The way in which corpus-driven and corpus-based approaches are reconceptualised in this chapter entails the former positioned as a subset within the wider field of corpus-based investigation. The dichotomous distinction outlined by Tognini-Bonelli (2001), although useful and influential, is challenged here on four counts.2 Firstly, it overlooks the comparative scale of the approaches, as corpus-based research greatly outweighs corpus- driven work both in terms of the amount of work produced and its variety. A more accurate representation of their relationship involves adjusting their relative scope and size.
Figure 3.2 The relationship adjusted to reflect size and scope of approaches
Secondly, the dichotomy overlooks similarities and shared priorities of the two. Proponents of both share beliefs in attested data when making statements regarding language and agree that this data should take the form of corpora: collections of authentic texts designed for linguistic purposes, planned accordingly, stored and accessed electronically, and analysed non-linearly and quantitatively (as well as qualitatively). Another similarity is that insights granted by both approaches over non-corpus or intuition-based work rest largely and in different ways on frequency; the corpus-based approach through investigating feature
1 As discussed below, the approach followed in these latter studies provide a model for implementing a greater focus on corpus-driven research to complement or balance trends towards larger and more differentiated studies based not only on corpora but on pre-corpus language models.
2
For similar reasons, some researchers, such as McEnery et al (2006) and Hunston (2002), decide to avoid the terms altogether.
Corpus-based
approach
Corpus- driven approachfrequency and distribution within and across corpora; and the corpus-driven approach by using frequency (rather than existing theoretical categories) as initial criteria through which to identify significantly-occurring words, patterns, and units of meaning. In overlooking similarities and focusing on differences, the concept of dichotomy only enforces confrontational relationships between corpus researchers.
Thirdly, the boundary between the two is, using McEnery et al‘s (2006: 11) words, ‗fuzzy‘ or ‗overstated‘ and it is suggested that their relationship is best represented as a cline (McEnery and Gabrielatos 2006: 36). Although proponents of both approaches may hold strong underlying convictions, in reality researchers working within the corpus-based tradition with heavily annotated corpora can adopt inductive approaches, just as ‗corpus- driven‘ studies can depart in various ways from the ‗prototypical‘ approach outlined by Sinclair among others. Rayson (2008), for example, describes his approach as data-driven in order to capture the fact that features he studies are selected through consideration of their keyness or salience in the corpus while distancing his work from the corpus-driven perspective within which his use of (key) semantic categories and POS-tags could be questioned. Meanwhile, although Gledhill‘s (2000) and Groom‘s (2003; 2006) work is most consistently driven by their corpora, in allowing frequency rather than existing theory or explicit preconceptions regarding language description drive findings, Carter and McCarthy‘s work into spoken language can be placed further towards the middle of the cline.1 Their work involves the re-evaluation of descriptive categories and creation of new linguistic terms (for spoken language) within an otherwise traditional grammar model. As discussed below, the distinction within the corpus-driven approach between the lexical grammar model and an inductive approach can be significant in describing differences between such studies.
The alternative way of conceptualising the two approaches shown below captures their similarities and differing scope. Prototypically corpus-driven studies exemplified by Sinclair (2004) could occupy the centre of the corpus-driven circle, and become progressively less prototypically corpus-driven towards its outer ring.
1
This is not to say that Carter and McCarthy would describe themselves using the term ‗corpus-driven‘ and, in fact, these authors avoid any mention of the dichotomy (Carter and McCarthy 2006; O‘Keeffe et al 2007).
Figure 3.3 An alternative conceptualisation of the relationship
The need to avoid a dichotomous model is supported by discussion of two of the ‗four basic differences‘ between the approaches (attitudes towards existing theories and towards intuition, and research focus)1 which suggests that in some cases differences have not only been overstated (McEnery et al 2006: 8) but can be put down to factors such as purpose, data type, and how general or specialist the corpus which typify, but are not inevitably linked to, one or other approach. In relation to the current focus on language variation studies, the two approaches differ most strikingly when determining which linguistic features should be studied in describing language varieties2 and how these are identified and retrieved (deductively or inductively). This distinction is significant in the current study for it underlies the argument made for the inclusion, alongside quantitative corpus-based studies, of corpus-driven studies: inductive, in-depth, data-driven analyses of single varieties.
A further advantage of avoiding the dichotomous model, however, lies in the assertion that the two approaches can complement rather than confront each other and that they have much to learn from the other: it is here, Sinclair (1991: 36) suggests, that progress lies. Correspondingly, although the current study is shaped primarily by an intention to let the data drive findings, in exploring what is a new domain in corpus linguistics, it does not confine itself to any presumed distinction between corpus-based and corpus-driven
1
The other differences include their paradigmatic claims and the suggestion that corpus driven approaches are more ‗radical‘ (largely due to their attitude towards existing theories and their research focus); and types of corpora (raising issues of size and representativeness) used by the two approaches. These issues are less relevant to the current chapter and are explored in Chapter 4: Considerations and Challenges in Corpus Compilation.
2 Given that, unlike a dictionary, a description cannot take account of all the words in a language variety.
Corpus-based
approach
Corpus- driven approach
procedures but draws, as we shall see, on both a word-frequency approach and on the application of an existing (inductive) model (Carter and McCarthy 2006).