5. ANÁLISIS DE IMACTOS Y PLAN DE SEGUIMIENTO AMBIENTAL
5.3. PLAN DE SEGUIMIENTO AMBIENTAL
1. The origins of corpora
The starting point of corpus linguistics can be traced by considering the issue of obser-vable data and how this has been handled in different periods and across different theo-retical schools. Of necessity, historical linguistics has always been corpus-based since by far the principal evidence of language change and evolution is found in collections of texts of different periods and locations (Johansson 1995: 22). Indeed modern linguistics owes its impetus to the lively work of the historical linguists of the nineteenth century. It may come as a surprise therefore that in a relatively short space of time it should shift its focus to an approach based on intuition and introspection from that data-based approach.
But it is a fact that, in spite of its data-based origins, modern linguistics, after the historical-evolutionist period, shifted away from the observation of data and, starting with Saussure, the object of linguistics was defined as the system, abstract par excellence, and non-identifiable in single tokens.
Under the influence of the positivist and behaviourist trend, post-Bloomfieldian lin-guistics in the USA became concerned to account for the observable data, and there was little room for abstract speculation. With Chomsky, though, the pendulum swung back towards a refusal of observable data as the basis for linguistic statements. Chomsky’s position as to observable data in general and corpus linguistics in particular is made clear in the following quotes:
Like most facts of interest and importance… information about the speaker-hearer’s competence… is neither presented for direct observation nor extractable from data by inductive procedures of any known sort.
(Chomsky 1965: 18) Corpus Linguistics does not exist.
(Chomsky, in an interview with Bas Aarts, 2000) The contrast between this position and the theoretical assumptions of corpus linguistics is obvious: corpus linguistics represents a definite shift towards a linguistics of parole; the
focus is on ‘performance’ rather than ‘competence’. The linguist aims to describe lan-guage use rather than identify linguistic universals. The quantitative element (frequency of occurrence) is considered very significant and, depending on the specific approach, is taken to determine the categories of description.
The idea of a corpus grew in the 1960s, deriving mainly from the tradition of lex-icography (Francis 1992). While Dr Johnson, the reference point for English lexico-graphy, used for his examples sentences quoted from great scholars like Hume, Johnson’s focus was on the meaning of the words in use, and not on the ideas expressed in the sentences. Well-known writers were cited because they were authorityfigures in a pre-scriptive tradition. But as well as gathering the words of the great and famous, another tradition of scholarship that grew with modern linguistics was that of the field linguists, who spread all over the world reaching ever more remote communities and building up records of the languages– usually spoken – that they found. Their informants were in the main quite ordinary people, their conversations also ordinary.
The modern corpus was mainly based on these prior methods of acquiring data for language study. Nevertheless, the thought of compiling a collection of texts which would provide sound evidence of the state of a language was new. Instead of capturing the great signs of culture, the early corpora had modest aims: to collect a good variety of language in use by fairly ordinary people in order to study better the grammar and vocabulary currently in use. As noted by Biber (this volume), an early example of corpus-based work is found in C. C. Fries’ grammars of written and spoken American English (1940 and 1952, respectively). By the end of the 1960s there existed a few small corpora, constructed on diverse principles.
The Survey of English Usage, led by Randolph Quirk from its inauguration in 1959 (see website) was an exception to the trend of the time. The Survey focused on the everyday linguistic interactions, spoken and written, of non-celebrities and accumulated a large database on file cards at University College London (see website). There were, however, no plans to computerise it until many years later.
2. The influence of technology in the development of corpora
It was not the linguistic climate but the technological one that stimulated the develop-ment of corpora. The electronic computer was on the horizon, and although the first computers were extremely difficult to work with, their great potential was correctly assessed from an early date. Computational work on texts began with Father Busa’s Index Thomisticus before 1950 (completed in 1978, see Busa 2000), continuing the scholarly tradition of making concordances to works of high status, but using the clerical potential of the computer. There is now a large library of electronic versions of literary, philoso-phical and religious texts, and the word corpus often stretches to cover some of these collections (see special corpora below).
The digitisation of a vast range of documents is of more recent origin. Starting with databases of legal and journalistic documents, the movement has grown in parallel with the access provided by the internet, and the world-wide web in particular.
The first electronic corpus of written language was the Brown Corpus, compiled in the 1960s at Brown University by Nelson Francis and Henry Kucˇera (Francis and Kucˇera 1964) and still very much in use (see website). This corpus contains a million words of American English from documents which had been published in the year 1961. Its
design (see sample corpora below) became the standard for some years, and some thirty years later was repeated in the Frown Corpus (see diachronic corpora below).
Advances in technology also enabled the collection of spoken data through the invention of the tape recorder. Portable tape recorders were just coming on the market in the late 1950s, and speech could be played back again and again, and studied as a sound wave. A side-benefit of this activity was that speech events could be transcribed without using shorthand, and thefirst electronic corpus of spoken language was assem-bled at the University of Edinburgh in the years 1963–5 on Sinclair’s initiative (see Krishnamurthy 2004). It contained 166,000 words of informal conversations in English, recorded and transcribed. So the written and spoken corpora of the early 1960s were prepared over the same period (by Francis and Kucˇera on the one hand for the written language and by Sinclair on the other for the spoken), but the researchers were not initially aware of each other’s work.
The 1970s was a period of consolidation and modest spread to a number of languages and different types of corpus. Development was slow, and this was also mainly because of the state of the available technology. Computers were still calculating machines with small memories, and programming languages were not devised with the manipulation of character strings in mind. Nevertheless, this is the time when corpora in excess of one million words were assembled, and annotated corpora were first considered, and also a spoken corpus in a detailed phonological transcription, the spoken section of the Survey of English Usage (see London–Lund website). All of these advances came from Sweden, and Scandinavian scholars such as Sture Allén, Knut Hofland, Stig Johansson and Jan Svartvik set the shape of mainstream corpus linguistics for a generation; they were not alone, however, and important corpus work was in progress with French, Hebrew and Frisian, among other projects. The first corpus of a special variety of a language was the Jiao Da English for Science and Technology (JDEST) corpus, compiled by Yang Huizhong in Shanghai around the same time (see JDEST corpus website).
Here again the timing was the result of technological advances. The invention of scanners improved access to the printed word enormously, and the growth of computer typesetting pushed the horizon out of sight. As Sinclair was fond of pointing out, by about 1990 linguistics had changed from a subject that was constrained by a scarcity of data to one that was confused by more data than the methodologies could cope with.
Some may even claim that it has not yet come to terms with this abundance.
Certain classes of data are still scarce, and this is not likely to change in the very near future.
Anything that is not available in electronic and alphanumeric form has still to undergo skilled and expensive processing at the input stage. The sound wave is still not amenable to automatic linguistic interpretation despite some successes in thefield of speech recognition.
Handwritten material has to be transcribed, and a lot of older printed material resists the best scanners. Meanwhile advances in graphics and the emergence of animated text and mixed media communication have set new descriptive goals for which linguistics was ill-prepared.
The large corpora of today often privilege material from an essentially unlimited source– journalism. This feature maintains the controversies about ‘balance’ and ‘repre-sentativeness’ which have been important issues since computer typesetting became almost universal. There is a clear risk that some features presented as characteristic of a language are actually characteristic mainly of its journalism. More recently the growth of electronic communication has given rise to several new and equally abundant sources, notably web pages, e-mail and blogging. All of these are uncharted territories whose communicative properties are, at the time of writing, largely unknown.
To conclude this section we could say that, in a rough-and-ready way, the relatively brief progress of electronic corpus building and availability can be seen as falling into three stages, or‘generations’ (Tognini Bonelli and Sinclair 2006: 208):
(a) Thefirst twenty years, c. 1960–80; learning how to build and maintain corpora of up to a million words; no material available in electronic form, so everything has to be transliterated on a keyboard.
(b) The second twenty years, 1980–2000; divisible into two decades:
(i) The eighties, the decade of the scanner, where with even the early scanners a target of twenty million words becomes realistic.
(ii) The nineties, the First Serendipidity, when text becomes available as the by-product of computer typesetting, allowing another order of magnitude to the target size of corpora.
(c) The new millennium, and the Second Serendipidity, when text that never had existence as hard copy becomes available in unlimited quantities from the internet.
3. A quantitative and a qualitative revolution
The technological advances outlined above are strictly interwoven with the emergence of corpus linguistics as a discipline and the progressive penetration of the computer in corpus linguistics work needs to be considered further. In the first stage the computer was seen simply as a tool: it was used to process, in real time, a quantity of information that could hardly be envisaged a few years ago. This is still the most impressive con-tribution of corpora to language research. But in changing the dimension of evidence, the computer reached a second stage of penetration into linguistic work: not only was it providing an abundance of new evidence, it was by its nature affecting the methodolo-gical frame of enquiry by speeding it up, systematising it, and making it applicable in real time to ever larger amounts of data. So Leech (1992) saw that there was now a dis-tinctive methodology associated with corpus work; while a corpus was little more than a big collection of evidence, one approached it in a different way from the perusal of separate texts. The means of retrieving information were getting more and more sophisticated, and the results required more and more skilled interpretation.
Leech (ibid.) drew a clear distinction between corpus linguistics and such varieties as sociolinguistics and psycholinguistics, which, although manifestly hybrid, were regarded as disciplines in their own right; the corpus advanced the methodology but did not change the categorial map drawn by linguistic theory. For many linguists, that is the limit of the changes brought about by the addition of corpora to the computational toolkit.
The computer as a tool was not expected to upturn the theoretical assumptions behind the original enquiries themselves, and so no such effect was expected. It was not even felt necessary to look out for such fundamental changes.
However, we can now show that a further development has taken place in the 1990s, a third stage of penetration. What started as a methodological enhancement but included a quantitative explosion (I am referring here to the quantity of data processed thanks to the aid of the computer) has turned out to be a theoretical and qualitative revolution in that it has offered insights into the language that have shaken the underlying assumptions behind many well-established theoretical positions in the field.
Writing a little after Leech, Halliday foresaw the signs of a qualitative change in the results of the quantitative studies opened up by corpus research. He warned that not only language but semiotic systems in general would be affected by this new proximity of theory and data (Halliday and James 1993: 1–25). This is clearly a stage beyond methodology.
Others expressed similar points of view around this time; one theme concerned the effect of increasing the arithmetic power under the command of the linguist. Clear (1993) pointed out the connection between the use of computational, and consequently algorithmic and statistical, methods on the one hand and the qualitative change in the observations. Not only could language researchers speed up the process of analysis, they could carry out procedures which were just not feasible before computers became available. The difference of scale led to a qualitative difference in the observations. It is strange to imagine that just more data and better counting could trigger philosophical repositionings, but that indeed is what has happened.
Saussure’s famous words, ‘c’est le point de vue qui crée l’objet’ (it is the viewpoint which creates the object), can be reinterpreted in this turn of events: if the dimensions of the viewpoint change as they did, or the granularity of the research results, the object created is substantially different from before. What we have witnessed in the development of corpus linguistics as a discipline is that our chosen methodological standpoint has pro-gressively determined both the object and the aim of the enquiry. In other words, in this instance, the methodology has ended up defining the domain of the discipline.
Given these premises, we should note a few points. Linguistic data are now available in such large quantities that patterns emerge that could not be seen before. In the debate centred around the issues arising in corpus linguistics there is a lot of talk of‘the web as a corpus’ and the explosion of information that affects corpus building. The change in the quality of evidence is now obvious to most scholars and observations about instances of language use affect systematically the statements about the language system in general.
The problem for the linguist has shifted from accessing large enough quantities of data to elaborating a reliable methodology to describe and take into account this type of unprecedented evidence. This is what a scholar like Sinclair observed (Sinclair 1991: 1 and ff.) and much of his theoretical work on the definition of units of meaning for language description amply proves this point.
Halliday’s point about the converging paths of theory and data raises question marks around some of the most familiar dichotomies of modern linguistics, in particular‘competence’ and
‘performance’. Although this separation between linguistic theory and linguistic evidence could be seen as a methodological convenience, it certainly cuts right across the descriptive framework which is necessary for deriving linguistic information from corpora. While before corpora came along such buffers were needed because no way could be envisaged of accounting directly for all the evidence, corpus work offers no reason or motivation for selecting some evidence and ignoring the rest. The theoretical statement derived from corpus evidence, especially nowadays when large corpora are at everybody’s disposal, has to start from new presuppositions. This point has been discussed in detail in Tognini Bonelli (2001) and was at the basis of the distinction between corpus-based and corpus-driven linguistics.
4. The theoretical shift from text-linguistics to corpus linguistics There is another point that is worth noting. Given that a corpus is a collection of texts, the aim of corpus linguistics has rightly been seen as the analysis and description of
language use, as realised in text(s). Corpus linguistics started from the same premises as text-linguistics in that texts were assumed to be the main vehicle for the creation of meaning.
The question that arose, however was: could corpus evidence be evaluated in the same way as a text was evaluated? This issue is in no way resolved. Different scholars continue to approach it in different ways and there are still those who advocate that, in order to understand and evaluate corpus data better, the analyst has to have direct and full access to the individual texts at any point in time. Most scholars, however, now accept that, in spite of the initial starting point which corpus and text share, the two approaches are fundamentally and qualitatively different from several points of view (summarised in Table 2.1).
Working within the Firthian framework of a contextual theory of meaning, the text has been seen in a unique communicative context as a single, unified language event mediated between two (sets of) participants. The switch in focus from a text-linguistic perspective to a corpus-linguistics one has brought about a different approach. The corpus is not‘just like a text, only more of it’. It brings together many different texts and therefore cannot be identified with a unique and coherent communicative event; the citations in a corpus – expandable from the Key Word in Context (KWIC) format to include n number of words– remain fragments of texts and lose out on the integrity of the text (for a detailed treatment of KWIC searches, see Tribble, this volume). The sig-nificant elements in a corpus become the patterns of repetition and patterns of co-selection.
In other words, in corpus linguistics it is the frequency of occurrence that takes pride of place.
This difference entails a different ‘reading’ of a corpus compared to one of a text (Tognini Bonelli 2001): the text is to be read horizontally, from left to right in the case of English and other western languages, paying attention to the boundaries between larger units such as clauses, sentences and paragraphs, possible markers of the macro-structure. A corpus, on the other hand, examined atfirst in KWIC format with the node word aligned in the centre, is read vertically, scanning for the repeated patterns present in the context of the node.
Furthermore, the text has a function which is realised in a verbal context, but also extends to a situational and a wider cultural context. It is interpreted by looking at the functions it has as a communicative event. The corpus, on the other hand, does not have a unique function, apart from the one of being a sample of the language gathered for linguistic analysis; the parameters for corpus analysis are above all formal.
Furthermore, the text has a function which is realised in a verbal context, but also extends to a situational and a wider cultural context. It is interpreted by looking at the functions it has as a communicative event. The corpus, on the other hand, does not have a unique function, apart from the one of being a sample of the language gathered for linguistic analysis; the parameters for corpus analysis are above all formal.