• No se han encontrado resultados

Tipos de equipos

Each text was assigned a number plus the author’s (anonymous) name to indicate in which sequence the text was authored (ranging from 1 to 5), so a text labelled Carla-5 indicates that it is the fifth text produced by Carla and likewise, Keith-1 denotes the text written first by Keith.

The previous chapter ended by advocating the use of automated methods for identifying formulaic sequences in the forensic context and the methods outlined in Section 4.2 utilise such an approach. However, a computer can only identify strings of words that it has been programed to find. Therefore, if the search criteria or the data involve a word which is misspelled a match will not be made. Researchers who have used automated methods for identifying authorship have often relied on published texts as their data (Hänlein, 1999; Hoover, 2001, 2002, 2003a). By virtue of being published, such texts will have already been subjected to heavy editing to ensure that spelling, punctuation and formatting are all standardised. However, research using data that has not been professionally edited raises the question of whether spelling should be corrected:

If it is not [corrected], then a misspelled word will not be recognized as an instance of that word in its correct form, and, indeed, may be counted as a nonword, a hapax legomenon (single-occurrence) or as an instance of another word with which the spelling coincides. Misspellings can be precisely what separates out one writer from another, but they will be unhelpful in many analyses (Mollet, Wray, Fitzpatrick, Wray, & Wright, 2010: 434).

Furthermore, deciding to correct spelling is not straight forward, since an author can make both ‘performance’ mistakes—mistakes that an author knows they have produced—and ‘competence’ errors—where non-standard rules are broken consistently (Coulthard, 2005b: 15). Coulthard also describes the problem of working with typewritten text:

[E]rrors and mistakes may be confused and compounded—one may not know, for any given item, particularly if it only occurs once, whether the ‘wrong’ form is the product of a mis- typing or a non-standard rule—for instance if a (British English) text includes the word ‘color’

-95-

is this a typing mistake or a spelling error, or even worse the result of the computer user being unable to change the spell-check to British English (Coulthard, 2005b: 16).

Since spelling is not the focus of this thesis, and since automated methods will be used for identifying formulaic sequences, the decision was made to standardise the data, using the autocorrect feature in Microsoft Word 2010 as a guide. Such changes included:

1) Inserting spaces as need for punctuation:

Original Edited

June-5 I would learn from my mistakes,but no fear, I don’t.

I would learn from my mistakes, but no fear, I don’t.

Elaine-1 and it was beautiful-it really was exactly what I would have chosen

and it was beautiful – it really was exactly what I would have chosen

Carla-1 it was a beautiful day and isn’t right .Firstly it’s

it was a beautiful day and isn’t right. Firstly it’s

2) Inserting or deleting spaces between words:

Original Edited

John-2 Those 6months were very hard Those 6 months were very hard

Sue-1 I had somewhat convinced myself that Iw as to get AAAB

I had somewhat convinced myself that I was to get AAAB

June-1 Duke of EdinburghAward Duke of Edinburgh Award

Hannah-1 The blindness went on for a bout 4 minutes

The blindness went on for about 4 minutes

3) Adding/removing apostrophes, accents and extraneous punctuation:

Original Edited

John-4 I was completely at everyones mercy I was completely at everyone’s mercy

June-2 I saw him in my minds eye I saw him in my mind’s eye

Carla-1 cliche cliché

Greg-2 no serious damage done,.although I wasn’t aware of it at the time

no serious damage done, although I wasn’t aware of it at the time

-96-

4) Correcting some spellings, often of homophones:

Original Edited

Keith-2 to the local boarder crossing police to the local border crossing police

Elaine-3 when your young when you’re young

Keith-3 I had heard a bit more about the shear numbers of people

I had heard a bit more about the sheer numbers of people

Some irregularities identified by the autocorrect feature were not corrected. These included: 5) Unrecognized lexical items:

Sarah-1 I was absolutely outstanded when he told me that I had passed

Mark-1 Me and JP have had a few bumpings off head but that’s just our characters really David-3 The existence of Santa Clause has always been one of magic and intrepidation

6) Inconsistent/incorrect capitalisation: Rose-5 it Was also quite embarrassing

Keith-2 at 4:30am i was dead tired and left everyone in the club and headed back to the Hotel Thomas-4 back to Normal

7) Features of spoken register:

June-3 We then run to our parents room and jump on their bed and begin opening our presents, oohing and aahing about what we have got

Alan-1 It takes a helluva lot to make me cry

Thomas-3 The other thing that kinda makes you stop believing 8) Lexical reduplication for emphasis:

June-2 I’m sure there [sic.] something really really bad that’s happened to me Judy-1 I had got up very very early in the morning

Alan-2 It was just constant pain, pain, pain

In the few examples of unrecognized lexical items, they were automatically identified as spelling errors and it was possible to make an educated guess about what the target word was. However, this could not be categorically known, and so the decision was made to err on the side of caution and not to second-guess the author. For example, in the case of outstanded, it is likely that Sarah blended outstanding with astounded but we cannot know for sure which was the target word.

-97-

There was no need to correct capitalisation as this would not interfere with any automated matching. Features of spoken register and reduplication for emphasis were not standardised since these were judged to be potentially characteristic of how each author used lexis. Being the central focus of this research, it would therefore be unjustifiable to alter this aspect.

In addition to the errors identified by the autocorrect feature, the data were manually checked and a series of additional errors were found:

9) Perceived errors in flow:

June-2 Moving back up North to and going to primary school

Sue-5 I refused on the basis there WAS no more room they start pushing along it, ramming into my sitting next to me

Mark-1 So it that is, that’s the last time i cried 10) Omitted words:

Rose-5 When we were all sat in the hall waiting for the presentation ø begin Michael-1 Later in ø morning we would take a stroll

Hannah-5 We had all drunk too much and there was ø of flirting 11) Incorrect lexical choices and/or potential typing errors:

Sue-3 I was still to select my choices, yet alone start writing a personal statement Sarah-5 There are other elements rather that money that make people happy Greg-5 smacked me on the back and pushed me fast first onto the snow

12) Some homophones:

Mark-3 but as the years past my love for animals hasn’t changed David-2 but buy definition they were accidents

Keith-3 but he actually sailed down my road in his slay

13) Incorrect word boundaries that formed complete, recognisable words:

Mark-4 but it came down to the stupidest thing of a miss understanding of what was happening

Rose-5 The nit was my turn!

This final category is akin to metanalysis in Old and Middle English where napron came to be pronounced as an apron and a nadder became an adder (Campbell, 2004). However, the difference is that whilst napron and nadder are not recognised as standard spellings in Modern English, the

-98-

examples above are and so are not instantly recognisable as misspellings. Categories 9—13 were not corrected for two reasons. Firstly, they were not identified by the autocorrect feature in Microsoft Word 2010 and therefore the task of identifying every single example could be too cumbersome for the forensic context. Secondly, whilst in some cases it would be possible to establish the target word (i.e. homophones, incorrect word boundaries that formed complete, recognisable words), an element of second-guessing the author would be required for other categories (i.e. perceived errors in flow, omitted words, incorrect lexical choices and/or potential typing errors). Rather, they have been highlighted through this manual checking to illustrate the authenticity of these texts and to explicate potential problems with any analytical techniques, and conversely the robustness of formulaic sequences as a marker of authorship if, even in the face of these problems, evidence in favour of the marker can still be found.

In the case of homophones, it is clear that some were identified by Microsoft Word 2010 and some were not. Those that were corrected were those that were automatically identified whilst those that were not corrected were those which required manual identification. This divide in the same category highlights a potential limitation in the use of automated methods—the research is limited by the software’s level of sophistication. A summary of the changes made to the data can be found below in Table 4.5:

Table 4-5 Summary of changes made to data

Edited Unedited

Identified by automated check

 Inserting space between punctuation

 Inserting or deleting space between words  Adding/removing

apostrophes, accents and extraneous punctuation

 Some homophones

 Unrecognised lexical items  Incorrect/inconsistent

capitalization

 Features of spoken register  Lexical reduplication for emphasis

Identified by manual check

 Names, dates, places and any other identifying material

 Perceived errors in flow  Omitted words

 Incorrect lexical choices and/or potential typing errors

 Some homophones

 Incorrect word boundaries that formed complete, recognizable words

The fact that so many errors of different types occurred in the data is an unavoidable characteristic of the data collection design; clearly, asking people to type their answers relies on individual typing

-99-

ability, though this does highlight that the data are authentic and that a linguist is faced with many of the same problems in a ‘real’ case of forensic authorship attribution. Nonetheless, it is acknowledged that editing the data in this way may be problematic for some.

4.7 Summary

In this chapter, three key claims have been stated about the nature of formulaic sequences as they relate to the individual. Three approaches have been proposed, all of which, although influenced by other approaches, are novel in their approach to how formulaic sequences may be identified and a corpus has been described on which these claims can be tested. Over the next three chapters, each of these approaches will be described with a full account of the results so that an answer to the central research question can be determined.

-100-

Documento similar