EL CUERPO COMO UN MEDIO - La Otra Voz - Brent A. Haskell (Texto Buscable Con Indice)

For the sake of somewhere better to put the output so I could read it more easily, I opened it with Microsoft Excel. Excel refused to load four million unique items, so I checked Stack Overflow, updated my Excel and used the newer xml format (that saves as .xlsx files). I then discovered that there were all sorts of words present that did not exist in the English language, not even the SF dialect version. Words like “A1l,” or “an0tlner,” or even “!!—111.. p.” This explained why, rather than the few hundred thousand words I was expecting, I had four million words. Three million or more of them were not real words at all. I cursed the feeble OCR of Adobe, and wept over the

futility of my efforts. I then pulled myself together, having found that the student in the cubicle next to me had not noticed – their attention was intently focused on dissecting the structure of a minority dialect spoken in Highland New Guinea. I went back to the original works, and discovered the reason why the OCR included such gibberish. Eighty years of slow decay smudges the ink printed on porous, uncoated, pulp paper (Fig. 15).

In my naiveté I tried changing contrast, sharpness and other

characteristics of the images, doing this painstakingly and manually in Adobe Photoshop (Adobe Systems, 2018b), and tried again with the “improved” version of the images. The results were no better – naturally the people at Adobe had already integrated “smudgy text” processing into the application. As an alternative, I wondered if the wizardry of Mac Automator could be used to clean up the text post-processing. A return to Stack Overflow led me to “regular expressions” (regex) – simple commands for manipulating text, and some ideas

123 about script-based programming in python (van Rossum, 2018). After more searching and Stack Overflow queries, I found many ways of achieving my goal with regex. I suspected the process might require significant computing power to process all my texts, so I asked to use the university supercomputer. After listening to the person in charge for half an hour, and nodding sagely, I looked out of the window and decided not to follow this pathway after all. It seemed to lead into a potentially bottomless abyss, with a warning sign at the top stating: “Here be dragons.” After a week of

experimentation, I had created a regex script that ran within Automator, which would speedily remove all the non-alphanumeric characters from the files. Due to a glitch in my implementation, it destroyed the hard drive in my Mac by thrashing it to death over four days of continuous running time. I had managed, however, to make the text in the files look much less threatening by removing all the “111!!!!!lllllooooh1” entries, and replacing them with “lllllooooh.” It was progress. I replaced my hard drive with a faster solid-state one. I wrote ever more complex regex to do things such as deleting

additional sequential occurrences of any letter more than twice. This was satisfying work, and turned “lllllooooh” into “llooh.” It didn’t make more sense, but it did look more like English, and as a benefit I had learned much about how programming today was very different to what I had been expecting based on my childhood experience. It was unfortunate that the results were not what I wanted. Learning from my mistakes, I approached a fresh PhD student who I had not yet burdened with my problems. He suggested I ask one of the computer programming students at the university for help. I smiled, baring my canines, but it really wasn’t his fault that this obvious resource had not occurred to me. I emailed a programmer he had recommended – I didn’t see them face to face, of course – they lived far, far away in an office decorated with Star Trek posters – and received a remarkably short python script by return of email. I tried it out, and with minor tweaking (consulting Stack Overflow) it worked perfectly. In half an hour I had a better, quicker, happier version of my text – cleaner than the one I had spent a fortnight over on my own. I let it run on one of my early editions of Astounding Stories and I was impressed by the results. Using the scan of Fig. 15 as the source material, the OCR had converted:

124

' .. / .

-there she was WAITING AT THE CHURCH \\"/,- c· 1.: I I

-: .r r·Clr·cI;JII,·,.: \\' Annsq>tier"I.I,. .. by l'wI.. .

., . . '·' r "'"' lt-rnrenr· . . . -rstenne . '"·'.f"l" l':rtL\L" ..,- h,-.. I I .rtrun "' riJ,.nwllc/,

l.:nu\1'd <.rtI "'"'' Tl ' l:rt j'o11r /.n-:ul ·. It'll you will F:"ri I' ' '' Ml·r·r·r,.r. purer (Astounding Stories, 1938)

into:

there she was WAITING THE CHURCH I Clr JII Annsq tier I by wI lt rnrenr rstenne rtL I rtrun riJ nwllc nu rtI rt o11r It you will ri purer.

I carried out a word frequency analysis of the cleaned up versions against the original ones, but found to my surprise, and subsequent chagrin, that both versions provided output that was virtually identical. I had not changed the frequency of occurrence of “real” words, just removed some of the non-words. Since I was only interested in words that occurred above a certain frequency this made no difference. In my thematic

searches I would be looking for occurrences of very specific words– such as “cyborg” or “volcano,” and these were either there, or they were not, regardless of the

surrounding junk text. I still had faith, however, and I reasoned that there must be ways of improving accuracy by performing an “autocorrect” of faulty versions of words: “D@rlcne55,” for instance, should be readily convertible to “Darkness.” I had already noted that lc for k and ri for n, not to mention li for h, were common

125 errors found in OCR of my aged pages, so this type of error could presumably be fixed with the right script. I found another programmer, and he offered to look into it. He

came back to me a few days later with Peter Norvig. Not the actual Peter Norvig (Fig. 16), but an implementation of his dictionary- matching script for comparing faulty text against a reference dictionary to find possible errors (Norvig, 2016). This was wonderful, except that it proved impractical for my research – in my text there were many non-standard words, and gibberish, and there was a danger that I would lose words that were important (like “Scientology,” for instance) unless they were in the text-matching dictionary file, which would have to be huge – possibly almost as large as the original texts. This would take a very long time to process, and a dictionary- matching system might “correct” nonsense characters into words that did not occur in the original. So “111!!!!!lllllooooh1” might become “oh,” when it was really an OCR misreading of the border around an advert. I parked Norvig in a virtual garage, with the regex and python scripts, and surreptitiously dropped the USB stick in the river. An interesting excursion that had added precisely nothing to my PhD, but had been part of my process of evolving into a “proper” digital humanities researcher. Unsurprisingly, in retrospect, these issues have been identified in the analysis of similarly-smudgy

newspaper text, and researchers have come to similar conclusions regarding a

substantial corpus: that for some research aims (e.g. my own) this is a “desirable but not essential” process (Strange et al., 2014: 51). A happy result of this toil, however, was that I gave an in-house conference presentation on text processing which was attended by at least a dozen people – four percent of those invited – so, it was a tremendous success, despite being irrelevant to my actual study (If my mother had come it would have been five percent, but fifteen thousand kilometres is a long way to go and observe your son’s moment of glory). This was when I realised that even though I may care how

126

aesthetically pleasing my text looks, the computer is unimpressed. Pretty text does not improve statistical results, or add significant words to a search. The positive outcome of this process was that I now had a large volume of digitised text, from many books, and due to my foresight, the files all had identifiable names like: Amazing (v1n1) 1926. This gave me a feeling of quiet satisfaction that outweighed the difficulties I had

experienced.

In document La Otra Voz - Brent A. Haskell (Texto Buscable Con Indice) (página 78-81)