M ONOGRÁFICO
1. INTRODUCCIÓN: ¿SUEÑAN LOS ROBOTS?
In addition to excluding non-dialogic text from computations made by the corpus tools, mentioned in the previous section, annotation also preserves information such as stage directions and speaker labels in the texts. This helps make sense o f what is going on in the dialogue and aids the interpretation o f results during analysis. Additionally, annotation enables some useful meta-data about the contents o f each play-text file in
the corpus to be encoded for reference. To some extent, annotation can be automated by searching and replacing text to be tagged using regular expressions. With a text editor such as Notepad+ + 40 this can be carried out on multiple texts simultaneously.
Additional manual fine-tuning is usually required to tag individual items which cannot be globally searched. The annotation process is therefore time consuming and open to human error. This is pointed out by Baker (2006:42), who also emphasises that the direct benefit o f annotation to the interpretation o f results in a study needs to be
assessed carefully at the outset. The annotation o f the NDC is therefore limited to what was essential to serve the needs o f the present study, but I use a conventional system which has potential for exploitation in future research, and to which further degrees o f annotation can be added later if necessary.
My rationale for annotating and marking up the NDC play-texts is informed by those o f other scholars who have built corpora containing EModE drama, e.g. Archer and Culpeper (2003), Culpeper and Kyto (2010), Kyto and Walker (2006) and Lutzky (2009a, 2009b and 2012). The annotation I use is encoded in XML (extensible Mark
up Language) tags. XML tags are based on a standardised coding system for electronic texts, but their contents can be customised (Baker 2006:38-42). They are therefore useful in corpus linguistics, because they can be tailored to different kinds o f texts whilst still being interpretable by a range o f other computer programmes that may be required for analysing them. As noted in 3.3.2, Wmatrix will only work with well- formed XML tags. WordSmith excludes from computation any text bounded by a pair o f angle brackets. This automatically includes XML tags (which are bounded by pairs o f angle brackets), although, as mentioned in 3.3.2, it inexplicably picked up one word (who) from the speaker identity tags (which was fortunately an isolated anomaly).
40 Currently free to download. See http://notepad-plus-plus.org/ (last accessed 31.08.12).
Devising annotation systems is not particularly easy, and compatibility with different kinds o f corpus linguistic software tools is an issue.
Andrew Hardie (in preparation, and personal communication, 05.05.10) argues that the scripting language PHP can be used to increase the speed and efficiency o f annotation. It requires a reasonable level o f knowledge o f computer programming and regular expressions, however, which I could not have acquired sufficiently in the time available for the project. Therefore, he and I discussed the most advantageous ways to annotate the texts, and he wrote the PHP scripts which I could then execute and adapt in minor ways (e.g. by altering elements of the regular expressions in the scripts in order to change the search-and-replace parameters). PHP requires a text editor in which to write the scripts, and we used N otepad^+. Among its useful features (which I discuss a little further in 5.5) is the display o f XML tags in different colours from the main body o f the text. PHP scripts automated the tagging o f the speaker identities o f over 31,000 speech turns in the NDC, as I explain in more detail below. The other tag- types in the corpora were not in sufficiently standard forms to be annotated
automatically.
The XML tags used to mark up the contents o f the play-texts in the NDC are summarised in Table 11 below, which is followed by a brief explanation o f each one.
Table 11. Encoding conventions used in the NDC in the form of XML tags
<text i d - '">
</text>
short code for identifying the play title and genre, with an end marker showing where the zone o f the play-text finishes
<ref c='"7> reference and bibliographic information about the play-texts
<frontmatter c - "'/> additional text preceding the dialogue o f the play
<endmatter c - '"/> additional text following the dialogue o f the play
<comment c=" "/> typewritten notes and/or markers highlighting an anomaly or problem in the dialogic text, e.g. missing or unclear words
<stagedir c=""/> stage directions
<sceneid="">
</scene>
start and end tags marking acts and scenes
<u who=""> speaker identification tags
Short code fo r identifying the play title and genre
Each play in the NDC has a short title code comprising an initial letter N (denoting non-Shakespearean plays), a second letter identifying the genre (C for comedy, H for history and T for tragedy), and then a short form or acronym o f the title o f the play.
For example, The Duchess o f Malfi is coded as NTDOM, and Bartholomew Fair as NCBFAIR. These are inserted as text identification ("id") tags at the top o f each play- text file, and the end marker inserted after the final line, to serve as boundaries when joining multiple files. Play-text files were labelled using the text-id, for consistency.
Play-texts in the SDC are labelled in a similar format, but with the initial letter S, e.g.
SCMWW for The Merry Wives o f Windsor. Text-ids for all the plays in both corpora are shown in Appendix IV.
Reference and bibliographic information about the play-text
Baker (2006:40) explains that including tagged "headers" enables the retention or inclusion o f "meta-linguistic" information about the texts in a corpus. This is useful for reference. The digitised play-texts on EEBO already contain information such as the date and bibliographic name or number, the extended title, author's name and lifespan, and date o f publication. I encoded all this in a single "reference" tag, to form a header in each play-text. The header information could be broken down into separate
components, as is the case in the CED text files, if there was a need to search on individual components such as date. However, this is not necessary in my study. The header tag from the top o f the play-text o f Webster's tragedy The White Devil41 is shown on the next page.
41 N on-dialogic text in the corpora, including the header tags, is not subjected to the spelling
regularisation process discussed in 5.4. In the text o f the thesis, I standardise the titles o f the plays to the modern forms by w hich they are commonly known today, for convenience and brevity.
<ref c="Author: Webster, John, 1580?-1625?
Title: The white diuel, or, The tragedy of Paulo Giordano V r s i n i , Duke of Brachiano with the life and death of Vittoria Corombona the famous Venetian curtizan. Acted by the Queenes Maiesties Seruants. Written by Iohn Webster.
Date: 1612
Bibliographic name / number: STC (2nd ed.) / 25178 Bibliographic name / number: Greg, I, 306(a). / Physical description: [88] p.
Copy from: Bodleian Library Reel position: STC / 1296:01"/>
Other non-dialogic text preceding and follow ing the content o f the play
Some play-texts feature an introductory preamble for the benefit o f the players or the audience, before the dialogue o f the characters begins. This typically includes
prologues, dedications to the dramatists' patrons or friends, and/or a list o f dramatis personae. It is all encoded in a single <frontmatter c=""/> tag in each play-text. Any text which comes after the dialogue o f the play ends (typically an epilogue, or the printer's details) is encoded in a single <endmatter c= ""/> tag.
Comments
Missing or unclear text in the play-text files is indicated by three leader dots between square brackets: [...], a convention already in place in some o f the digitised files downloaded from EEBO. The marker is encoded in a comment tag, as in the following line from Marlowe's tragedy The Massacre at Paris:
And made ccomment c=" [...]"/> look with terror on the world:
Apart from the above "missing" marker, a few other brief notes are encoded in comment tags in the play-texts, such as places where a speaker's identity is unclear.
Stage directions
Stage directions are marked off in stage direction tags, e.g.:
<stagedir c="Exeunt."/>
Acts and scenes
As noted in 1.5.2, not all early extant EModE play-texts are divided up into acts and scenes. In the NDC, those which are are marked off into zones between a scene-id tag containing the text-id code and an end-of-scene tag. For example, Act I, scene i in Tamburlaine Part I is bounded by the following tags:
<sceneid="NHTAM_I_i">
</scene>
Speaker identification tags
Each speech turn in the play-texts downloaded from EEBO is preceded by a speaker label, in most cases an abbreviation o f the character's name followed by a full-stop.
These speaker labels needed to be converted to speaker-id tags. As mentioned at the start o f this section (5.2), the annotation o f speaker-id tags was automated to a great extent using PHP scripts. These are given in Appendix V. A single PHP script (written by Andrew Hardie) carried out the following set o f commands:
(i) the identification o f speaker labels, by searching for a single word followed by a full-stop on a line by itself, e.g.:
L o d o v i c o .
(ii) the insertion of a tag containing that character's name immediately after the speaker label (a "u who" tag from Table 11 above), e.g.:
<u who="Lodovico">
(iii) the insertion o f an end o f utterance tag </u> before each speaker label, to mark o ff the end o f the previous character's speech.
Lodovico's speech turn was then marked off by the end-of-utterance tag preceding the next speaker's turn, which was prompted by PHP finding the next speaker label in the play-text. The end-of-utterance tag before the first speaker label in each play-text was
redundant and was deleted manually, and a final end-of-utterance tag was added manually after the last speech turn in the play-text. In just a few seconds, the execution o f this single PHP script annotated the vast majority o f 31,000 speaking turns in all 43 play-texts in the corpus (it also inserted the text-id tags, discussed above, and saved the annotated text files as new XML files).
Often, however, the speaker labels for a single character are not consistent in EModE play-texts, because of non-standard spelling and abbreviations. This variation caused PHP to code them with separate "u who" tags (meaning that a single character's speech was split across several different speaker-id tags). That would have made it more difficult to extract all the dialogue o f a single character, which was desirable for creating separate male and female data files later on (mentioned below). To address the problem, I used a second PHP script which identified and listed all the variants o f
"u who" tags in a play-text. I could then identify potential variants of a single speaker label, verify them as belonging to one character in the play-text, and convert multiple variants to a standard speaker-id tag using a third PHP script. For example, in the play- text o f The Duchess o f Malfi downloaded from EEBO, speaker labels for the character Ferdinand were variously abbreviated to "Fer.", "Ferd.", "Fred.", "Ford." or "Berd.".
Each variation was initially assigned a different "u who" tag by the first PHP script.
These were then identified with the second script, and finally replaced with a standard tag: <u who="Ferdinand"> by the third script.
Following the automated annotation o f speaker-id tags as explained above, a relatively small amount o f manual fine-tuning was necessary to correct text which fitted the search parameters o f the first PHP script, but which did not constitute speaker labels. These were instances o f single words o f dialogue followed by a full- stop (i.e. one-word speech turns, such as "Good."). I could only do this by scrutinising
the corpus texts and checking them line by line, which also enabled me to pick up a few non-standard speaker labels that the first PHP script could not capture. These were speaker labels not followed by a full-stop, which were rare in most o f the NDC play- texts, although prevalent in a few. In these cases, the "u who" tag and end-of-utterance markers had to be inserted manually. The original speaker labels in the play-text files also had to be encoded between pairs o f angled brackets to isolate them from the rest o f the dialogic text, a process which in retrospect could have been included in the first PHP script. It was quick and easy to go back and make global replacements, however, whilst I was checking the corpus texts and carrying out the manual annotation o f other tag-types discussed above (using N o tep a d s+).
Following the annotation o f the speaker-id tags in each successive play-text, I used a fourth PHP script to count the number o f words o f dialogic text, i.e. everything contained between "u who" tags and end-of-utterance tags (apart from anything marked with other tags, e.g. comments and stage directions). PHP defines a word as a
"string containing alphabetic characters, which also may contain, but not start with and "-" characters"42, and its word count function produces results that are not entirely consistent with those from other programmes such as WordSmith. However, it
provided a quick guide to the amount o f dialogue harvested from each play-text, which was useful in building up each section o f the NDC to a size approximating that in the SDC.
I constructed a spreadsheet logging the name and sex o f each character in every play during the annotation process, to facilitate the rapid block extraction o f male and female dialogue with a fifth PHP script. This enabled me to create separate components o f the corpus for analysis o f selected results by gender, which is
42 See httD://uk3.php.net/manual/en/function.str-word-count.php (last accessed 10.08.12).
occasionally useful in the present study (e.g. in 8.3 for the word cluster I PRAY YOU), and which will benefit future research into language styles and gender in the plays (which I suggest as a useful direction in 9.4).
Other than the annotation explained above, and some correction o f gaps and mis-transcribed text in the EEBO digitised text files to increase their accuracy (discussed in the next section), nothing was added to the play-texts in the NDC.
Nothing was deleted apart from extra blank lines, spaces and unusual characters such as hash signs # which might interfere with the orthographic matching processes o f the corpus analysis software, following Lutzky (2009b: 1). The only characters which do not stand for themselves in the play-texts are the angle brackets < and > which
surround the encoded information. Following Kyto and Walker (2006:37-38), I did not alter the lineation o f the play-texts, but I removed hyphens from words which were split at the line break for the printers' convenience. This is because they would otherwise artificially inflate word counts for the NDC texts (compared to the SDC texts, which do not have words split at line breaks).
5.3 Missing text and other transcription issues in the digitised play-texts