Each text was assigned a number plus the author’s (anonymous) name to indicate in which sequence the text was authored (ranging from 1 to 5), so a text labelled Carla-5 indicates that it is the fifth text produced by Carla and likewise, Keith-1 denotes the text written first by Keith.
The previous chapter ended by advocating the use of automated methods for identifying formulaic sequences in the forensic context and the methods outlined in Section 4.2 utilise such an approach. However, a computer can only identify strings of words that it has been programed to find. Therefore, if the search criteria or the data involve a word which is misspelled a match will not be made. Researchers who have used automated methods for identifying authorship have often relied on published texts as their data (Hänlein, 1999; Hoover, 2001, 2002, 2003a). By virtue of being published, such texts will have already been subjected to heavy editing to ensure that spelling, punctuation and formatting are all standardised. However, research using data that has not been professionally edited raises the question of whether spelling should be corrected:
If it is not [corrected], then a misspelled word will not be recognized as an instance of that word in its correct form, and, indeed, may be counted as a nonword, a hapax legomenon (single-occurrence) or as an instance of another word with which the spelling coincides. Misspellings can be precisely what separates out one writer from another, but they will be unhelpful in many analyses (Mollet, Wray, Fitzpatrick, Wray, & Wright, 2010: 434).
Furthermore, deciding to correct spelling is not straight forward, since an author can make both ‘performance’ mistakes—mistakes that an author knows they have produced—and ‘competence’ errors—where non-standard rules are broken consistently (Coulthard, 2005b: 15). Coulthard also describes the problem of working with typewritten text:
[E]rrors and mistakes may be confused and compounded—one may not know, for any given item, particularly if it only occurs once, whether the ‘wrong’ form is the product of a mis- typing or a non-standard rule—for instance if a (British English) text includes the word ‘color’
-95-
is this a typing mistake or a spelling error, or even worse the result of the computer user being unable to change the spell-check to British English (Coulthard, 2005b: 16).
Since spelling is not the focus of this thesis, and since automated methods will be used for identifying formulaic sequences, the decision was made to standardise the data, using the autocorrect feature in Microsoft Word 2010 as a guide. Such changes included:
1) Inserting spaces as need for punctuation:
Original Edited
June-5 I would learn from my mistakes,but no fear, I don’t.
I would learn from my mistakes, but no fear, I don’t.
Elaine-1 and it was beautiful-it really was exactly what I would have chosen
and it was beautiful – it really was exactly what I would have chosen
Carla-1 it was a beautiful day and isn’t right .Firstly it’s
it was a beautiful day and isn’t right. Firstly it’s
2) Inserting or deleting spaces between words:
Original Edited
John-2 Those 6months were very hard Those 6 months were very hard
Sue-1 I had somewhat convinced myself that Iw as to get AAAB
I had somewhat convinced myself that I was to get AAAB
June-1 Duke of EdinburghAward Duke of Edinburgh Award
Hannah-1 The blindness went on for a bout 4 minutes
The blindness went on for about 4 minutes
3) Adding/removing apostrophes, accents and extraneous punctuation:
Original Edited
John-4 I was completely at everyones mercy I was completely at everyone’s mercy
June-2 I saw him in my minds eye I saw him in my mind’s eye
Carla-1 cliche cliché
Greg-2 no serious damage done,.although I wasn’t aware of it at the time
no serious damage done, although I wasn’t aware of it at the time
-96-
4) Correcting some spellings, often of homophones:
Original Edited
Keith-2 to the local boarder crossing police to the local border crossing police
Elaine-3 when your young when you’re young
Keith-3 I had heard a bit more about the shear numbers of people
I had heard a bit more about the sheer numbers of people
Some irregularities identified by the autocorrect feature were not corrected. These included: 5) Unrecognized lexical items:
Sarah-1 I was absolutely outstanded when he told me that I had passed
Mark-1 Me and JP have had a few bumpings off head but that’s just our characters really David-3 The existence of Santa Clause has always been one of magic and intrepidation
6) Inconsistent/incorrect capitalisation: Rose-5 it Was also quite embarrassing
Keith-2 at 4:30am i was dead tired and left everyone in the club and headed back to the Hotel Thomas-4 back to Normal
7) Features of spoken register:
June-3 We then run to our parents room and jump on their bed and begin opening our presents, oohing and aahing about what we have got
Alan-1 It takes a helluva lot to make me cry
Thomas-3 The other thing that kinda makes you stop believing 8) Lexical reduplication for emphasis:
June-2 I’m sure there [sic.] something really really bad that’s happened to me Judy-1 I had got up very very early in the morning
Alan-2 It was just constant pain, pain, pain
In the few examples of unrecognized lexical items, they were automatically identified as spelling errors and it was possible to make an educated guess about what the target word was. However, this could not be categorically known, and so the decision was made to err on the side of caution and not to second-guess the author. For example, in the case of outstanded, it is likely that Sarah blended outstanding with astounded but we cannot know for sure which was the target word.
-97-
There was no need to correct capitalisation as this would not interfere with any automated matching. Features of spoken register and reduplication for emphasis were not standardised since these were judged to be potentially characteristic of how each author used lexis. Being the central focus of this research, it would therefore be unjustifiable to alter this aspect.
In addition to the errors identified by the autocorrect feature, the data were manually checked and a series of additional errors were found:
9) Perceived errors in flow:
June-2 Moving back up North to and going to primary school
Sue-5 I refused on the basis there WAS no more room they start pushing along it, ramming into my sitting next to me
Mark-1 So it that is, that’s the last time i cried 10) Omitted words:
Rose-5 When we were all sat in the hall waiting for the presentation ø begin Michael-1 Later in ø morning we would take a stroll
Hannah-5 We had all drunk too much and there was ø of flirting 11) Incorrect lexical choices and/or potential typing errors:
Sue-3 I was still to select my choices, yet alone start writing a personal statement Sarah-5 There are other elements rather that money that make people happy Greg-5 smacked me on the back and pushed me fast first onto the snow
12) Some homophones:
Mark-3 but as the years past my love for animals hasn’t changed David-2 but buy definition they were accidents
Keith-3 but he actually sailed down my road in his slay
13) Incorrect word boundaries that formed complete, recognisable words:
Mark-4 but it came down to the stupidest thing of a miss understanding of what was happening
Rose-5 The nit was my turn!
This final category is akin to metanalysis in Old and Middle English where napron came to be pronounced as an apron and a nadder became an adder (Campbell, 2004). However, the difference is that whilst napron and nadder are not recognised as standard spellings in Modern English, the
-98-
examples above are and so are not instantly recognisable as misspellings. Categories 9—13 were not corrected for two reasons. Firstly, they were not identified by the autocorrect feature in Microsoft Word 2010 and therefore the task of identifying every single example could be too cumbersome for the forensic context. Secondly, whilst in some cases it would be possible to establish the target word (i.e. homophones, incorrect word boundaries that formed complete, recognisable words), an element of second-guessing the author would be required for other categories (i.e. perceived errors in flow, omitted words, incorrect lexical choices and/or potential typing errors). Rather, they have been highlighted through this manual checking to illustrate the authenticity of these texts and to explicate potential problems with any analytical techniques, and conversely the robustness of formulaic sequences as a marker of authorship if, even in the face of these problems, evidence in favour of the marker can still be found.
In the case of homophones, it is clear that some were identified by Microsoft Word 2010 and some were not. Those that were corrected were those that were automatically identified whilst those that were not corrected were those which required manual identification. This divide in the same category highlights a potential limitation in the use of automated methods—the research is limited by the software’s level of sophistication. A summary of the changes made to the data can be found below in Table 4.5:
Table 4-5 Summary of changes made to data
Edited Unedited
Identified by automated check
Inserting space between punctuation
Inserting or deleting space between words Adding/removing
apostrophes, accents and extraneous punctuation
Some homophones
Unrecognised lexical items Incorrect/inconsistent
capitalization
Features of spoken register Lexical reduplication for emphasis
Identified by manual check
Names, dates, places and any other identifying material
Perceived errors in flow Omitted words
Incorrect lexical choices and/or potential typing errors
Some homophones
Incorrect word boundaries that formed complete, recognizable words
The fact that so many errors of different types occurred in the data is an unavoidable characteristic of the data collection design; clearly, asking people to type their answers relies on individual typing
-99-
ability, though this does highlight that the data are authentic and that a linguist is faced with many of the same problems in a ‘real’ case of forensic authorship attribution. Nonetheless, it is acknowledged that editing the data in this way may be problematic for some.
4.7 Summary
In this chapter, three key claims have been stated about the nature of formulaic sequences as they relate to the individual. Three approaches have been proposed, all of which, although influenced by other approaches, are novel in their approach to how formulaic sequences may be identified and a corpus has been described on which these claims can be tested. Over the next three chapters, each of these approaches will be described with a full account of the results so that an answer to the central research question can be determined.
-100-