2. CASO DE ESTUDIO
1.1. Marco conceptual y epistemológico de la tesis
3.2.1. Mono-modal corpus design
‘Time and fiscal constraints, as well as the traditions of different research communities make it impossible to adopt a single standard for all corpora’ (Strassel and Cole, 2006: 2, a fact also explored by Lapadat and Lindsay, 1999). Therefore, current mono-modal corpora, as with developing 4th generation MM corpora, are bespoke insofar as they are commonly designed and constructed in ‘light of the investigator’s goals’ (Cameron, 2001: 29, also
see Lapadat and Lindsay, 1999; O’Connell and Kowal, 1999: 112; Reppen and Simpson, 2002: 93 and Roberts, 2006), in order to meet a given research need and/or to allow users to focus on specific features of spoken or written language.
Despite this, since corpus construction is generally motivated by the aim of representing an ‘authentic’ sample of language, the ‘unambiguous, rigorous, consistent and well-documented practices [involved] in data development’ (Wynne, 2005) are of a fundamental concern when designing corpora. Although such practices are to a certain extent locally determined (Conrad, 2002: 77), Sinclair offers suggestions for ‘good practice’ that provide general benchmarks for all corpora (2005 - see Wynne, 2005 for similar prescriptions15). Although these are designed with 3rd generation corpora in mind, they are also relevant for 4th generation corpora, and exist as a good starting point for discussions of MM corpus development. They are as follows:
1. The contents of a corpus should be selected without regard for the language they contain, but according to their communicative function in the community in which they arise.
2. Corpus builders should strive to make their corpus as representative as possible, of the language from which it is chosen. 3. Only those components of corpora which have been designed to be
independently contrastive should be contrasted.
15Exhaustive standards for the construction of spoken corpora specifically have also been developed by EAGLES (Expert Advisory Groups on Language Engineering Standards), refer to the following website for further details:http://www.spectrum.uni-
4. Criteria for determining the structure of a corpus should be small in number, clearly separate from each other, and efficient as a group in delineating a corpus that is representative of the language or variety under examination.
5. Any information about a text other than the alphanumeric string of its words and punctuation should be stored separately from the plain text and merged when required in applications.
6. Samples of language for a corpus should, wherever possible, consist of entire documents or transcriptions of complete speech events, or should get as close to this target as possible. This means that samples will differ substantially in size.
7. The design and composition of a corpus should be fully documented with information about the contents and arguments in justification of the decisions taken.
8. The corpus builder should retain, as target notions, representativeness and balance. While these are not precisely definable and attainable goals, they must be used to guide the design of a corpus and the selection of its components.
9. Any control of subject matter in a corpus should be imposed by the use of external, and not internal, criteria.
10. A corpus should aim for homogeneity in its components while maintaining adequate coverage, and rogue texts should be avoided.
It is important to acknowledge that the above suggestions are theoretically idealistic. ‘Since language text is a population without limits, and a corpus is
necessarily finite at any one point; a corpus, no matter how big, is not guaranteed to exemplify all the patterns of the language in roughly their normal proportions’ (Sinclair, 2008: 30). Corpora are necessarily ‘partial’, as it is impossible to include everything in a corpus, since the methodological and practical processes of recording and documenting natural language are selective; ergo ‘incomplete’ (Thompson, 2005, see also Ochs, 1979; Kendon, 1982: 478-9 and Cameron, 2001: 71). This is true irrespective of whether a corpus is specialist or more general in nature.
Given this selectivity, the requirements for, for example, representativeness, balance and homogeneity (see suggestions 8 and 10, also see Biber, 1993) can be difficult to meticulously uphold. This problem is intensified by the fact the notions of, again, representativeness, balance and homogeneity, are relative, abstract concepts that are open to wide interpretation. A corpus that is sufficiently ‘balanced’ to achieve the aims of a particular corpus developer, or to allow for a specific line of research, may not be adequate for other users or lines of linguistic enquiry. Nevertheless, ‘we use corpora in full awareness of their possible shortcomings’ (Sinclair, 2008: 30) because there exists no better, alternative resource for the analysis of real life language-in-use than a corpus offers, nor better strategies for exploring such language than with the use of current CL methodologies.
3.2.2. A new design methodology for 4th generation corpora
Despite the potential for variety in the specific approaches used, when collecting and assembling naturally occurring qualitative data, in linguistics and beyond, there are essentially 4 fundamental processes which need to be
considered. These are outlined below (for similar models consult Psathas and Anderson, 1990; Leech et al., 1995; Lapadat and Lindsay, 1999; De Ruiter et al., 2003; Thompson, 2005 and Knight et al., 2006):
1. Recording. 2. Transcribing.
3. Coding and mark-up.
4. Applying and presenting data.
Although these processes are portrayed in a list-like format, it is appropriate to think of each as operating as part of a complete research system, rather than as being stages that are temporally ordered and distinct. So each stage is best conceptualised as interacting with, and influencing the next. Just how each of these interact, however, is reliant on the specific approaches and methods adopted as part of each stage. Again, since corpus construction is driven by the specific ‘investigator’s goals’ (Cameron, 2001: 29), the actual methods used at each of these stages are highly variable.
Accordingly, although the following sections aim to provide a general overview of some of the typical conventions and strategies used for corpus construction, this is not, in any way, a definitive account of possible procedures. Instead it functions to outline some of the choices and challenges faced by corpus linguists developing MM corpora, in order to postulate guidelines of good practice for this. In the remainder of this chapter, these stages of recording, transcribing, coding and presentation will be tackled in turn, however this is simply a method of providing a coherent structure to
discussions. Consequently, the interoperability of these phases is re- addressed throughout each section.