3. ELABORACIÓN DE MATERIALES
3.5 METODOLOGÍA Y
The main distinguishing feature in this typology is related to the semantic codification of entities as nouns. The nominalizations in this group can be labeled as terms because they are totally reified. Terms are cognitive devices we create and use to study reality by establishing a set of differences and frontiers (Calvin, 1996; Eckardt, 1993; Lakoff &
Johnson, 1980; Thagard, 1996). They are specially useful in scientific disciplines because they provide semantic traces of entity to both processes and entities themselves.
The extensive use of terminology is, in fact, one of the defining features of the scientific register in any language. Thus, in (40)
(40) But independent of theſe conſiderations, this rude ſyſtem was ſoon found incapable of ſtanding the teſt of obſervation and experiment (Bonnycastle 1786, p. 59; emphasis added).
the turning of a verb into a noun facilitates readers the identification of the processes and events that are being subject to study. Both observation and experiment are reified nominalizations functioning as guidelines to draw attention on the process. Information about agents is irrelevant here since the main objective is to present the processes as things. Cognitively, this process can be similar to the one employed when providing indexes at the end of a book to facilitate quick searches or when giving a title to a book or a chapter; It is related to a cognitive process linked not only to the reification of science but also to the organization of information in our minds.
Structurally, term nominalizations have usually underwent both a valency reduction (Mackenzie, 1985) and a substantivasubstantivizationtion process (Malchukov, 2006) so they are usually identified by the lack of semantic relationship between their modifiers and the agents, participants and circumstances in the process.
According to Malchukov (2006, p. 976), these are clear examples of “strong nominalizations”, that is nominalizations characterized by a lack of verbal properties and a total recategorization as nouns Concerning premodification, the most usual determiners are articles whereas postmodification is not common. Pluralization is also common in this typology. Syntactically, they can function in any position but they are the only typology found in titles given their extremely concise, reified nature.
This chapter has covered the most important features around the structure and function of nominalizations not only in scientific register but also in general language.
At the morphosyntactical level, the only feature that all schools have highlighted is their ability to fulfill nominal positions. Their structure, origin and semantics have been continuously debated. From all the theories explained, I would highlight the fact that nominalizations have a particular way of expressing information about process.
Consequently, the inclusion of optional modifiers maximizes the function of nominalizations as focalizers of information and opens up a wide range of functional implications that include not only properties as discourse organizers but also as assimilation facilitators, which reinforces their value as tools for knowledge transmission. Concerning the typology presented in this study, it responds to structural and functional premises but it has also taken into account extralinguistic factors, such as the establishment of stylistic concerns motivated by new linguistic practices carried out by a new discourse community. Once established the theoretical framework used for this study, the next chapter will be concerned with the explanation of the corpus of study and the methodology.
This chapter presents a description of the corpus of texts used for the analysis as well as the methodology used in it. Section 3.1 is concerned with the main work tools for this study, that is the corpus of texts and the search engine used. The description of the corpus is approached from two different angles: section 3.1.1 provides a general description of the corpus, explaining issues like size, textual categorization as well as sex, occupation and provenance of authors. Section 3.1.2 explores in detail the parts of the CETA, the subcorpus chosen for this study. Apart from general features of this subcorpus (section 3.1.2.1), an account of its different parts is provided, which also includes information about metadata files and prologues. The final part of this section is concerned with a brief explanation of the treatment of texts in the corpus. After corpus presentation, Section 3.1.3 deals with the description of the Coruña Corpus Tool (CCT henceforth), the search engine used to retrieve information from the corpus in this study.
Methodology is presented in section 3.2. Both the process of disambiguation and the
creation of the database used for analysis are explained. Additionally, I will expose the variables of study used together with the expected results.
3. 1. Work tools: Coruña Corpus, CETA and the Coruña Corpus Tool
Hickey (2003) reasoned the springing of diachronic corpora as a consequence of mixing together English historical linguistics with corpus linguistics, as a discipline in vogue after the initial hostility of generativists and the enormous advancement in computer science. Electronic corpora and the discipline of computational and corpus linguistics can be considered the turning point of linguistics in the last decades of the twentieth century, and its impact on the study of language can be paralleled to the impact of structuralism at the beginning of the century or the rising of generativism in the decade of the 1950s (Crystal, 1992, p. 85). Corpus linguistics deals with the principles and practice of using corpora in language study. The benefits of corpus linguistics revolve around a methodological reformulation that enabled to obtain quicker and more reliable data (Taavitsainen, 2005). This revolution in the method resulted in a shift of interest from random to central linguistic features, backed up by frequency numbers, which has led to the discipline of quantitative linguistics. The main criticism made to corpus linguistics was based upon the skewness of the discipline: “any natural corpus will be skewed. Some sentences won’t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere
list” (Chomsky, 1965, p. 159). The debate over competence and performance continued for decades until the advancement of computer science made it possible to compile massive corpora, where Chomsky´s concerns about language underrepresentation became obsolete.
According to Tognini-Bonelli (2001, p. 2) “a corpus can be defined as a collection of texts assumed to be representative of a given language put together so that it can be used for linguistic analysis.” There is not unanimity about the exact definition or specific features of a corpus. Some scholars allude to the etymological origin of the word corpus (Latin for 'body') for spreading the notion of corpora to any collection of texts. Hence, McEnery and Wilson (1996, p. 21) claimed that “in principle, any collection of more than one text can be called a corpus.” Following this statement Kilgariff and Grefenstette (2003) studied the possibility of considering the World Wide Web as a corpus. This line of thought, shared by many (Ghani, Jones & Mladen’c, 2003;
Robb, 2003; Sharoff, 2006), resulted in the creation of powerful automatic, web-based corpora, such as BootCaT57 (Baroni & Bernardini, 2004) and Sketch Engine58 (Kilgariff, Rychlý, Smrz & Tugwell, 2004). Although these corpora can be really useful, scholars generally agree that corpora should have explicit design criteria (Baker, 2002;
Biber, 1993; Hickey, 2003; McEnery & Wilson, 1996 Oostdijk, 1991; Taavitsainen, 2005; Tognini-Bonelli, 2001). This could be taken as the main difference between raw corpora and annotated corpora (Taavitsainen, 2005, p. 326). McEnery and Wilson (1996, p. 21), who are usually taken as reference for the description of corpus design, highlighted four main specific features of corpora, namely sampling and
57 http://bootcat.sslmit.unibo.it/?section=hom (Retrieved October 19, 2012).
58 http://www.sketchengine.co.uk/ (Retrieved October 19, 2012).
representativeness, finite size, machine-readable form and standard reference, which will be analyzed in the next section.
The Coruña Corpus, A Collection of Samples for the Historical Study of English Scientific Writing59 is compliant with all the specifications for corpus design stated by McEnery and Wilson's suggestions (1996). It is a closed corpus60 with a finite size of around 400,000 words in each subcorpora and it is machine-accessible, thanks to a search tool (CCT) that has been designed for its joint use. The last prerequisite established by McEnery and Wilson (1996), standard reference, is also fulfilled, as the first of the subcorpora was released as a publication (Moskowich, Lareo, Camiña-Rioboo & Crespo, 2012), making it available to other researchers worldwide.
Concerning Hickey's (2003, p. 4) suggestions about how to build a corpus, CC complies with all the requisites, as it is an untagged closed corpus presented in separate files.
Additionally, normalization, one of Hickey´s concerns, has been carefully planned and different spellings were confronted with the OED and normalized, where applicable.
Still under compilation, the CC will be made up of several subcorpora containing samples of different disciplines according to the UNESCO classification of science and
59 The CC is part of an ongoing project carried out by MuStE (research group for Multidimensional Corpus-based Studies in English) at the University of A Coruña. The main area of study of this group falls within the category of language variation and history of the English language and the common methodology for all members joins together traditional philological knowledge with new technologies.
From 2003 to 2010 the group received funding to carry out the compilation of this project and, although the compilation of some of the sucorpora is still ongoing, the CETA subcorpus –the one used for this study– has already been published and others are ready for publication. In order to give coherence to the project and to widen the scope of the studies related to the corpus, the interests of the group have spread to cover other fields of knowledge. Thus, and as the result of the collaboration with the Information Retrieval Lab team at the Department of Computer Science at the University of A Coruña, a tool for retrieving information from the corpus (CCT) was designed. On the other hand, to understand better the scientific discourse produced by women, some of the members of the group are currently working on a project about women scientists from 1700 and 1930. The main aim of this project is to raise awareness about the contribution made by women to the field of science, not only as writers but also as assistants, editors, translators, illustrators and collectors, which were in many cases the only professions allowed to them (Crespo, Puente, Bello & Lojo 2012).
60 McEnery & Wilson (1996) acknowledged that corpus can be either open or closed. Open corpora, also called monitor corpora are open entities to which updates are progressively being applied, whereas closed corpora have a finite size.
technology (1988). For this study I have selected one discipline from the field of exact and natural sciences –astronomy (CETA61 subcorpus). All corpora in the CC have a common structure to facilitate contrastive studies about different subcorpora. In order to make the exploitation of the corpus easier, an information retrieval tool, called Coruña Corpus Tool (CCT) was developed. This tool has been especially designed for the CC by the IRLab (Information Retrieval Laboratory) at the University of A Coruña. The CCT enables the extraction of information of either morphemes, words or sets of words from the texts, which facilitates the study greatly62.