The CC has been designed to contribute to the diachronic study of English at several linguistic levels. The main idea behind this corpus is to study of scientific register in English. The project aims at complementing other corpora about to the history of English for specific purposes, such as the well-known Corpus of Early English Correspondence63 (Nevalainen et al., 1998), the Corpus of Early English Medical
61 See section 3.1.2 for a more complete description of CETA.
62 See section 3.1.3 for a description of CCT.
63 For further information and access to the corpus visit http://www.helsinki.fi/varieng/domains/
CEEC.html (Retrieved November 20, 2012)
Writing64, the Lampeter Corpus of Early Modern English Tracts65 (Schmied, 1994), the Heksinki Corpus of English Texts (Kytö, 2012; Rissanen et al., 1991), and the ARCHER (A Representative Corpus of Historical English Register) (Biber et al., 1994).
Text selection was made according to the UNESCO classification of science and technology (1988). Marked with bold type in the following figure are the disciplines chosen for the CC, which so far belong to the groups of natural sciences, social sciences and humanities:
Table 3: Fields of science and technology (UNESCO, 1978).
64 The Corpus of Early English Medical Writing is composed of several subcorpora chronologically arranged: Middle English Medical Texts (MEMT) (Taavitsainen, 2005), Early Modern English Medical Texts (EMEMT) (Taavitsainen 2010) and Late Modern English Medical Texts (MEMT) (in preparation).
Further information is available on http://www.helsinki.fi/varieng/CoRD/corpora/CEEM/ (Retrieved November 20, 2012)
65 Further information can be accessed on http://khnt.hit.uib.no/icame/manuals/LAMPETER/
LAMPHOME.HTM (Retrieved November 20, 2012)
The decision of not compiling any text from the group of medical sciences was, according to the compilers of the corpus a deliberate choice so as not to overlap with the Helsinki Corpus, a similar diachronic corpus texts specialized in medicine. Obviously, the UNESCO classification of science and technology (1988) was aimed at modern science and not all the texts compiled for the CC can be ascribed to one single category without serious doubts. The idea of science, as its language, has evolved with the passing of time and what was considered properly scientific in the eighteenth century may seem awkward for us (also viceversa!). However, the need for a clear-cut division and the lack of a standardized classification of science from the period covered in the corpus resulted in the adoption of the UNESCO parameters, as stated by the compilers of the corpus (Moskowich, 2012, p. 38).
The time-span of the texts compiled in the CC, 1700-1900, was chosen according to extralinguistic considerations (Moskowich, 2012, p. 47). Hence, the starting point of the corpus, 1700, coincides with a revolution of old epistemological patterns (Taavitsainen & Pahta, 1997). By 1700 the Scientific Revolution could already be considered an established phenomenon. The Royal Society had been running for 40 years and Isaac Newton (1643-1727) was already a senior scientist. Additionally, the beginning of the eighteenth century coincided with the start of the Enlightenment. As discussed in chapter one, the decline of the influence of religion over society together with the shift of scientific interest from deduction to induction gave way to a new form of science that favored the emergence of a new type of language for its dissemination (Swales, 1990). Several discoveries that took place at the end of the nineteenth century, such as J.J. Thompson's discovery of the electron, Planck's announcement of quantum mechanics in 1896 and Einstein's first formulation of the theory of relativity in 1905, set
a crisis in the basis of mechanical physics and serve as a good end-point for the corpus (Moskowich-Spiegel & Crespo, 2007: 348; Moskowich, 2012, p. 48).
The main aim of corpus design was to “construct smaller samples of the variety” (McEnery & Wilson, 1996, p. 21) to be studied, as analyzing every instance of language would be practically impossible. As a result, formal features include, apart from the external criteria for the delimitation of dates, equality in the sampling techniques and in the number of words per sample, and similar treatment of texts. CC contains two texts per decade and discipline. The general aim was to include one text from the beginning of each decade and one from the end. Samples contain around 10,000 words excluding figures, tables, formulae and graphs, which total ca. 200,000 words per century and discipline. Even if scholars in this field (Biber, 1993) have claimed that 1,000 word samples are long enough for the study of variation within the scientific register, compilers of the corpus (Moskowich, 2012, p. 39) have reasoned their choice of including larger samples as a result of historical reflection: the time-span of CETA covers a period in which the standardization of English scientific register was largely an ongoing process. Texts from this period present more variation and, hence, more words are needed to see repeating structures and patterns, as well as emerging standards of writing.
Issues of representativeness (Camiña, 2012, p. 96; Crespo & Moskowich, 2009;
Lareo & Estévez, 2008, p. 70; Lareo & Montoya, 2008, p. 140; Moskowich-Spiegel &
Crespo, 2007, p. 349) have been taken into account and include: use of first editions only and balance in text-types, gender and origin of authors. First editions of texts written in English by English-speaking authors have been preferred. Similarly no more than one text by the same writer has been included to avoid personal idiosyncrasies.
After text selection, the treatment of texts implied an edition of texts according to TEI standards, most widely used code for treating digital texts in the Humanities. Once edited, texts were saved in XML format. The corpus also contains other extralinguistic information about the authors in a series of XML files called metadata that can be consulted from the CCT interface. This information is especially useful to study extralinguistic variables. Regarding place and sex, the CC includes information about the place of education and sex of the authors. The audience, either real or intended, delimits the parameter of genre/text-type66.
Figure 8: Subcorpora in the Coruña Corpus.
Several subcorpora were designed within the CC. Of all these subcorpora, only CETA has been published so far. CEPHiT, CELiST and CHET are ready for publication
66 In this study, genre/text-type are understood in the way Taavitsainen (2001) did. She (2001: 140) defined genre as a “mental frame in people’s minds which gets realized in texts for a certain purpose in a certain cultural context” and considered that text-types were a linguistic realizationbnof genres.
and the rest are still under compilation. The next section is concerned with a description of CETA.