As an initial starting point, it makes sense to identify only a small subset of formulaic sequences because this way, it will be possible to more closely interrogate the data whilst being reasonably confident that the sequences identified are actually formulaic with fewer examples of ‘grey areas’. In order to achieve this, a quantitative approach (e.g. Biber & Conrad, 1999; Biber, Conrad, & Cortes, 2004; Hoover, 2002; Stubbs, 2002; Stubbs & Barth, 2003) will be used in order to identify recurrent sequences—clusters of words—in the corpus, which may be indicative of authorship. It is the number of occurrences and consistency with which individual clusters occur that qualifies them as formulaic (as discussed in Section 3.5.2, p. 70) and which separates the research presented here from other investigations which have used clusters as a marker of authorship.
At this juncture, it may be useful to consider whether the research presented here adopts a corpus-based or corpus-driven approach to the data analysis, which Römer (2005) describes as being “two different opposing disciplines within corpus linguistics” (p. 22). The corpus-driven approach is more prone to the alteration and development of theory leading to new theoretical insights. Corpus- based linguists, alternatively, “do not put the corpus at the centre of their research but see it as a welcome tool which provides them with frequency data, attested illustrative examples, or with answers to questions of grammaticality or acceptability” (p. 23). Although the author corpus is not annotated (a preference of corpus-driven linguists who avoid relying on other researchers’ views of language), there are pre-formulated ideas and hypotheses in mind (p. 23) which the corpus evidence is then used to either support or refute. On this basis, it is more accurate to describe the present research as being corpus-based.
Genre can be an important feature of some cluster based investigations (notably, lexical bundles e.g. Biber & Conrad, 1999; Biber, Conrad, & Cortes, 2004, cf. Section 3.3.1 for definition). Since the present research is interested in a more universally applicable approach, a robust method for authorship attribution needs to be independent of genre or context. As such, it is necessary to
-78-
develop a definition of what exactly will be identified as formulaic. The term formulaic cluster has been coined here for this purpose, and should be understood to mean:
Sequences of three words or more which are not necessarily complete meaningful units and which are not overtly related to context. Formulaic clusters occur in the majority of texts produced by an individual author and can be argued to be idiolectal based on the recurrence of form across separate texts and to be formulaic in terms of their frequency.
The fact that formulaic clusters are found in the majority of texts demonstrates that they are a strong and, crucially, recurring part of that author’s lexical repertoire (as opposed to clusters which might be very frequent in one text but not across a series; these are also likely to be idiolectal but less consistent and therefore less reliable). Repetition across texts also reduces the likelihood of clusters being content specific or chance occurrences. The threshold for determining what ‘majority’ means will be dependent on the data available in terms of quantity of texts and the length of texts. In a later section (cf. Section 4.4, p. 84), the author corpus is described, in which each author produced a total of five texts. As a guide, occurrence in three of the available texts (60%) is justified as the minimum since this equates to over half of the texts produced by an author (and obviously, formulaic clusters which occur in 80% or 100% of texts should be more characteristic of idiolect). Other researchers wishing to draw on this definition would be required to justify their own thresholds based on their own data.
The definition specifies that formulaic clusters must consist of at least three words, since two word clusters will typically consist of grammatical items (e.g. Biber, Conrad, & Cortes, 2004). Although the diagnostic potential of grammatical items has been claimed (e.g. Mosteller & Wallace, 2007), it may be less convincing to argue that they will be useful in this context. After all, grammatical items are required for the organisation of text whereas lexical items allow for more variability. Although grammatical items may well be stored formulaically, being a smaller set of words means that there is more limited variation in how authors can use them compared to lexical words. A cline will naturally be generated between clusters which occur more frequently across fewer texts and those which occur less frequently over more texts.
Finally, focusing on the recurrence of form means that variability cannot be tolerated; in other words, authors must produce the identical forms over three of their texts. The limitation of this approach is that clusters which naturally allow for some variability (e.g. it’s his choice and it’s her choice where the pronominal choice is content dependent) will not be identified as formulaic clusters in this research. However, the method will enable an initial automated analysis, contributing to the requirement that a method based on formulaic language should be robust (cf. Section 3.5).
-79-
If individual lexicons do contain preferred formulaic sequences, differences between authors’ formulaic clusters should manifest. Using this definition, formulaic clusters will be identified in the data in Chapter 5. It will then be possible to determine whether their occurrence in texts can differentiate authors and enable the correct attribution of a Questioned Document.
4.2.2 Core word
Having determined the extent to which a small and circumscribed sub-set of formulaic sequences are employed by some authors, it will be possible to increase the range of sequences to determine whether the authors employ sequences from different sets—or at least, whether the choices made in the texts differ significantly from author to author.
It stands to reason that if one particular word can be isolated which occurs predominantly and frequently in formulaic sequences—a core word—then a reasonable sub-set of sequences, the majority of which could be expected to be formulaic, will also be identified. The rationale behind using a core word is that a frequent content word will have fragmented meaning (Wray, 2002: 29) and therefore will rely on other words for the construction of a unified meaning. Wray (2002) discusses this concept in relation to Willis (1990):
Willis (1990) nicely illustrates this fact with reference to the word way, which he argues could usefully be a key vocabulary item in ESL teaching. This is not because way in the sense of ‘minor road’, or even ‘direction’, is particularly frequent, but because way figures in numerous expressions (e.g. in a way, by the way, by way of, ways and means) which, between them, propel the word virtually to the top of the frequency counts in a large corpus. (Wray, 2002: 29)
In this example, way should be frequent in large corpora because it is central to “numerous expressions”. It follows that identifying all instances of way in a corpus should provide a direct path to a range of formulaic sequences.
So what candidates are available as core words? Of the content words, two in particular stand out as worthy candidates. The first is thing. Willis (1990) observes that thing is very common in the English language, it has a clear meaning and its grammatical behaviour is known (p. 39). These are important factors for ensuring that sufficient data are extracted from the corpus and that the marked and unmarked uses of the word are understood. However, what makes thing especially suitable as a core word is that it is incorporated into a variety of formulaic sequences e.g. one thing after another, the shape of things to come (Willis, 1990: 39). Thing (and its plural form, things) occurs 168 times in the author corpus (described in Section 4.4, p. 84) so it generates plentiful data.
-80-
The second potential candidate is of course the noun way described above. Way is also singled out by Willis (1990) as a key content vocabulary item which occurs in numerous formulaic sequences and enjoys the same advantages as thing described above. In comparison to thing though, way (and its plural, ways) occurs less frequently in the author corpus, a total of only 105 times. Even so, way does hold certain other advantages. There is existing literature to support the range of meanings conveyed by way along with its specific uses (e.g. Goldberg, 1996; Sinclair, 1999; Willis, 1990). More importantly though, there are more entries and definitions in the Oxford Online Reference tool (2010) for expressions containing way. The importance of this will become clear later (cf. Chapter 5) when glosses for the sequences identified through the core word are required. It stands to reason that for an exploratory piece of research there is inherent value in utilizing the findings of existing research and resources to inform the methods adopted. Overall then, way is judged to be the most appropriate core word on which to concentrate.
The first task will be to establish whether individual authors use a different set of way- phrases. Then it will be possible to determine whether other authors use the same way-phrases (i.e. the distinctiveness of those formulaic sequences), and, for comparative purposes, whether similar meanings are expressed in different forms (i.e. using expressions that do not include way). This is the approach adopted in Chapter 6.
4.2.3 Reference list of formulaic sequences
The final investigation into formulaic sequences as a marker of authorship takes an increasingly inclusive approach in order to establish how many formulaic sequences occur in the texts, whether authors use different proportions of formulaic sequences, and crucially, whether texts can be attributed to their authors on this variable. In Chapter 3, the difference in the size of individuals’ store of formulaic sequences was highlighted. If Wray’s (2002) assertion is correct that the range of formulaic sequences available to an individual varies, then each author in the corpus should rely on formulaic sequences to differing degrees. Some authors may use a higher proportion of formulaic sequences whilst others may use a higher proportion of novel language, depending on what each author has stored as communicatively more useful and the complexity of their individual needs. In Section 3.5, it became clear that from the researcher’s perspective there were different advantages to each of the methods, but which ultimately raised problems from a forensic-orientated perspective. For example, using intuitions about formulaic sequences may be particularly suited to exploratory investigations such as this, but the level of objectivity and reliability renders the findings too problematic to be used as forensic evidence. The solution, drawing on the shared knowledge of a panel of judges, can lead to consensus regarding whether a given example can be considered
-81-
formulaic. However, forensic case work does not typically lend itself to analysis by a panel of judges due to issues of confidentiality, restrictions over access and limitations on time.
On the other hand, reference lists can overcome such problems and they afford the linguist an opportunity to analyse substantially more data, on their own, relatively quickly. The key consideration is which items are contained in such a list, and what decisions are made at the compilation stage. The ideal basis on which to proceed is to create a unique reference list of formulaic sequences based on the shared knowledge of a considerable number of judges. Such an approach draws on the strengths of both approaches: i) There is an element of consensus derived from a large panel of judges and ii) The list can be applied to data reliably without having to actually involve individual judges. The result will be a list of formulaic sequences that can be applied to individual forensic cases. Through the marriage of these two approaches a greater level of reliability, validity and feasibility can be achieved.
The proposed method is to develop a reference list based on examples of formulaic sequences obtained from the internet. The internet represents language as it used by a huge range of language communities. If there is consensus amongst internet users over what is acceptable as a formulaic sequence (what Peters (1983: 11) calls community-wide formulas), it is a reasonable assumption that such items will actually be formulaic. Numerous lists created on the basis of different aspects of formulaic sequences are available on the internet (e.g. clichés, idioms etc.), usually created as a reference tool for non-native speakers of English. In many cases, such lists are amended and added to following suggestions from readers. This satisfies the requirement of using a panel of judges.
The empirical research in the final analytical chapter sets out to test whether in fact such a list can be created and usefully applied to data, and more importantly, whether identifying formulaic sequences in this way produces results which firstly actually do differentiate authors (legitimising formulaic sequences as a marker of authorship) and secondly whether the method could be presented in evidence (legitimising the method as a forensically robust approach to identifying formulaic language).
Before the analytical work can begin, appropriate data are required. The data that will be used in this research are described in the following section.
-82-