AL DESARROLLO DE LA UNIÓN EUROPEA
2. OBJETIVOS: LA POLÍTICA DE COOPERACIÓN AL DESARROLLO EN LOS TRATADOS
Due to the proliferation of Dickens’s works in different formats, it was uncomplicated to collect the DCC texts as they all exist in electronic format. However, these were dispersed over numerous websites and various university library archives, which necessitated on occasion their tracking and detailed revision before their final inclusion in the corpus body of the DCC. It was less onerous to exploit the availability of electronic texts of Dickens’s works as the hosting websites; for example, the Project Gutenberg and the University of Adelaide library contain the full-text content of his novels, novellas, short stories and journal articles, letters and speeches. Dickens’s works in these websites are typically stored as individual texts while also being searchable; thus enabling the reader to obtain the work of a single author or a specific work.
Through searching for electronic texts of Dickens’s works, the most relevant internet archive was found to be Project Gutenberg: a digital library containing over forty-five thousand documents. The collection of these free items, either due to being post-copyright or where permission has been granted by the copyright owner, comprises the full texts of public domain books. The documents are primarily available in TXT (plain text) format, while other formats may be available including HTML (hypertext markup language), EPUB (electronic publication) and PDF (portable document format).
Project Gutenberg, deemed as the most relevant internet archive, was selected as the primary source to download the texts of Dickens’s works, while other resources such as the University of Toronto’s Robarts Library,
118 Trove (the National Library of Australia) and the library of Harvard University have also been used. As for the other internet archives that have been utilised besides Project Gutenberg, these still offer some texts of Dickens’s work in different formats. The texts have been primarily scanned from the printed copies, and then saved directly to PDF files. From the PDF version, they can be automatically converted into either TXT or HTML files by the website itself, without any requirement for recognition software to identify the texts written on the scanned papers. Unfortunately, the output TXT files would not be in the same quality as Project Gutenberg or the University of Adelaide University library in terms of the proofreading. Although the accuracy of the text converted from the scanned pages into electronic format is remarkable, proofreading is still necessary as letters and numbers can be misidentified. Several examples that were noted include: I’11 instead of I’ll, I’ ve instead of I’ve and the use of American spellings on occasion, where the hardcopy employs British English.
Kennedy (1998) ‘reports that current scanners have challenges in identifying hyphens, apostrophes and certain letters or groups of letters such as a (car is rendered as cor), cl (clear becomes dear), in (innate becomes mnate) and the number 1 vs. the letter l’ (Kennedy 1998, cited in Baker 2006: 34; italics added). I found this indeed to be true following my extensive revision of texts obtained from such archives utilising the word processor to identify the misspelled words. These cases of misidentified and misspelled texts required proofreading, which I have conducted for some of the less common works of Dickens. Due to the advantages of the internet, there was no need to manually type full document reproductions of each of Dickens’s work included in the DCC; however, the process did require ‘copying’, ‘pasting’ and editing to ensure that some of the documents matched their printed versions of Dickens’s works, in addition to inserting some letters and speeches manually
119 as they could not be located in an electronic format after conducting online searches. The texts that were added manually are presented in Appendix 4.1.
When saving the TXT files of Dickens’s works, it was attempted to preserve the language data as accurately as possible to mirror the actual printed copies. In order to achieve this accuracy, great effort was undertaken to maintain all the data found in the texts, inclusive of speech marks or accented characters, since Dickens used other characters besides English, such as those found in French or Italian names and vocabulary describing his journeys and the locations he visited, especially in his letters. In order to preserve Dickens’s language data that occur in Latin alphabets with diacritics and orthographic ligatures, for example Æ and æ, the TXT files were saved through a ‘Unicode’ encoding system, that is, the ‘[i]nternational character- encoding system designed to support the electronic interchange, processing, and display of the written texts of the diverse languages of the modern and classical world’ (Britannica Concise Encyclopaedia).
Following collection of the electronic data of Dickens’s works, these needed to be spell-checked and corrected for errors. The aim was to match as accurately as possible the printed versions in hand, with the process being to verify the first line of each paragraph to ensure that the text matched the printed copy. On rare occasions, I would look for an electronic version of the text other than Project Gutenberg to find an identical text to the printed copy. This happened with the novel Great Expectations. I also checked the opening and closing parts of each chapter, as well as ensuring compliance with British English spelling conventions via the word processor spell-checker. This helped in identifying American spellings where the archive from which the texts had been sourced was from American universities. While Project Gutenberg affords valuable opportunities for selecting and downloading the texts in various formats, such documents needed to be revised and lightly
120 edited so as to match the printed volumes of Dickens’s works. Those hard copies from which the DCC is based are available for later reference if this corpus is to be developed further in future.