Corpus linguistics generally understood as language study based on authentic evidence of language use has a long history. McEnery and Wilson have shown that many researchers had carried out investigations on a variety of subjects such as language acquisition; spelling conventions; language pedagogy; comparative linguistics; and syntax and semantics, utilising corpora of real language use (1996, pp. 2-4) long before the term ‘corpus linguistics’ began to be actually used, in the mid-1980s (Leech, 1992, p. 105). In fact, Leech (1992, p. 105) suggested that the reason for the lack of such a label was because “for those who espoused this approach, corpus linguistics was simply “linguistics””.
However, with the rise of Chomsky’s rationalism-driven theory of language studies in the 1950s and 1960s, a claim was made that competence, not performance,
80 Sinclair (2005) explains the use of “pieces” due to some researchers who compile extracts of texts (e.g. the Longman/Lancaster English Language Corpus, Summers, 1993).
71
was the real object of linguistic investigation. Hence, empiricism-based linguistic research, i.e. those using corpora for studying the language, were highly criticised. As McEnery and Wilson (1996, pp. 4-5) explained, according to this new perspective, linguistics should be based on “theories which reflected a psychological reality, cognitively plausible models of language” instead of abstract descriptions of utterances in a corpus, which Chomsky deemed skewed, and thus untrustworthy.81 With this paradigm shift, Chomsky’s generative theory of language took over the position of mainstream scientific linguistics (see Hanks, 2009).
It should be noted that, despite the undoubted predominance of generative linguistics in those decades and the consequent loss of popularity of corpus-based studies, corpus analyses were still being carried out, mostly due to the contribution of technology to the advancement of this kind of research (cf. McEnery & Wilson, 1996; Sardinha, 2000).
In this vein, notwithstanding the theoretically-adverse context of the time, Nelson Francis and Henry Kučera created the 1-million-word Brown corpus (Brown University Standard Corpus of Present-Day American English), covering 15 genres of written American English published in the year 1961, and releasing it in 1964 (Baker, 2011, p.17). This corpus comprised, for the first time, machine-readable texts stored in a computer.
Other corpora compilations with ever-increasing sizes followed suit. For example, the five times larger 5-million-word AHI corpus (American Heritage Intermediate Corpus; written American English, 1971) was an impressive mark in the progress of corpus linguistics; in the United Kingdom, corpora were reaching sizes as large as 20 million words, namely, the Birmingham Corpus (Birmingham University International Language Database; written British English, 1987), used for making the innovative Cobuild Dictionary of English for Advanced Learners from scratch (more details in section 3.2 below).
It is apparent that the analysis of large-sized corpora required other methods than the traditional manual inspection used in hand-picked instances of language use. Fast
81 See McEnery & Wilson (1996, pp. 4-14) for a detailed account on Chomsky’s opposition to corpus linguistics.
72
developing technology contributed to advances in computer-based corpus analysis tools, for example, concordancers, left and right sorting of keywords in context, search for collocates within a 2 to 5-word span from the keyword, word lists, etc. As can be concluded, the advent of the computer set a clear division mark in the history of corpus- based studies, giving rise to the concept of corpus linguistics as we presently know it.
It was thus possible, for the first time, to undertake a systematic examination of large amounts of evidence of language use that revealed patterns and regularities never before seen (cf. Sinclair, 1991). Quantitative results of language behaviour analyses enabled Sinclair to show that “words are interconnected, not isolates, that meaning is derived from context, and that collocation is key” (Moon, 2008, p.243). With this revelation, Sinclair made a compelling case against principles of generative theory and shed new light on linguistics studies.
One of his most renowned arguments was that meaning construction in texts did not follow the ‘slot-and-filler’ model proposed by generative theory. That is, he argued that while Chomsky’s tradition defined production of meaning through the filling of grammatical slots with virtually any random word, according to speaker’s choice, corpora investigation revealed that meaning derives from phrases, which are more or less fixed and have varied extent.
Sinclair’s idiom principle, “that a language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments” (Sinclair, 1991, p.110), has thus revolutionised linguistics, opening up unprecedented possibilities for research on how language works.
For instance, it has been shown that phraseology is pervasive and highly frequent in language. Stubbs (2006, p. 24) indicates that Mel’čuk’ s (1998) estimates of a frequency of “ten times as many phrasal units as individual words” in language. Gries also shares this view. He succinctly explains how meaning depends on words co- relations:
(…) formal differences reflect, or correspond to, functional differences. Thus, different frequencies of (co-) occurrences of formal elements […] are assumed to reflect functional regularities, and ‘functional’ is understood here in a very broad
73
sense as anything – be it semantic, discourse-pragmatic, …- that is intended to perform a particular communicative function. (Gries, 2009, p.4)
In this vein, painstaking analysis of lexical items and inspection of recurrent structures in very large amounts of naturally occurring language evidence soon became a method largely adopted in a variety of studies. This is the case of many studies on academic language, as presented in Chapter 2, which use corpora for analysing multi- word units82 and their meaning, function or specialized use with regard to genre,
discipline, writing expertise and language proficiency variation.
As can be seen, corpus linguistics has restored empirical research to a legitimate position within language studies. In consequence, corpora have been exponentially increasing not only in size – mega corpora, as it will be shown later, are reaching sizes of 20 billion words – but also in language coverage. While in the beginning of electronic corpora, the vast majority of them concerned the English language, popularization of access to corpus tools and language resources, due to exponential progress of language technology software and the increasing availability of Internet connections, has enabled corpus compilations of several additional languages, many of them with no, or very little, (electronic) lexicographic tradition, e.g. Mirandese, Yoruba, and other Portuguese varieties (from Mozambique and Cape Verde).
Accordingly, irrespective of the theoretical paradigm from which corpus-based research is undertaken, it can be argued that corpus linguistics today owes its fundamental characteristic to Sinclair’s ground-breaking demonstration of an unknown facet of the language: meaning derives from the relation that words maintain with other words.