5. CERESOS EVALUADOS EN BAJA CALIFORNIA
5.6 CERESO EL HONGO III
As we have seen, much of the existing literature on vocabulary has grown out of the study of written texts and the study of relatively minimal conver-sational exchanges. Most researchers seem to have concentrated on lexical repetition in spoken language (for example, Persson, 1994, who uses spoken and written data; Tannen, 1989; and, most notably, Bublitz, 1989 who looks at various functions of repetition in spoken data). There has also been a limited amount of discussion of formality in vocabulary choice in spoken language (Powell, 1992).
There are obvious historical reasons why spoken vocabulary has been under-researched: lack of good spoken corpora, the frustrating inability of analytical computer software to cope well with the ‘messiness’ of spoken transcripts, and, above all, the immense effort and resources required to collect spoken data compared with the ease (nowadays) of optically scanning large amounts of written text into databases, which offer access to hundreds of millions of running words (see Section 6.6). Thus it is the written word which has dominated our view not only of which words are the most important ones, but also of how words are used in acts of communication.
One of the most obviously useful types of output from computerized corpora is the frequency list. Frequency lists for everyday spoken language differ significantly from those dependent only on written databases. The two lists in Table 4.1 are each based on samples of approximately 330,000 words of data, and reveal interesting differences. The data is from the Cambridge International Corpus (CIC).
Immediately noticeable in these lists are both the similarity of occurrence of basic function words and some interesting differences which give the spoken language some of its characteristic qualities. The written list is made up of function words (function words here include all non-lexical, that is, non-contentful items, such as pronouns, determiners, prepositions, modal verbs, auxiliary verbs, conjunctions, etc.), but the spoken list seems, at first glance, to include a number of lexical words such as know, well, got, think and right. Quite as expected, the function words dominate the top frequencies of both lists, and, indeed, one of the defining criteria of function words is their high frequency. Nonetheless, as we go down the frequency list, there is no absolute cut-off between function words and lexical words of high frequency (such as thing). Using frequency alone, without other criteria (e.g. whether the word in question belongs to an open or closed set), results in a blurred borderline between ‘grammar’ (function) and ‘vocabulary’ (lexical) words.
This is something which becomes apparent in spoken data of the kind exemplified in Table 4.2.
On closer examination, some of the lexical words which intrude into the high-frequency function-word list prove to be elements of interpersonal markers (e.g. you know and I think) or single-word organizational markers (e.g. well and right). Stenström (1990) discusses such words that seem to Lexis and discourse 103
belong quintessentially to the spoken mode, and offers a useful set of headings for what she generally refers to as ‘discourse items’, which include apologies, smooth-overs (e.g. never mind), hedges (e.g. kind of and sort of), and a variety of other types unlikely to occur in the written mode. Well occurs approximately nine times more frequently in spoken than in written texts. The Table 4.1 The fifty most frequent written (left-hand column) and spoken (right-hand column) words from 330,000 words in the Cambridge International Corpus (CIC, 1996).
No. Written Spoken No. Written Spoken
1
Table 4.2 Total occurrences of verb-inflections of start and begin, and total occurrences of too and also in the written and spoken parts of the CIC, 1996.
Items Written Spoken
hedging-word just ranks as 33 in the spoken; in the written it ranks at 61 and is two and a half times less frequent. Other items in Table 4.1 call for closer scrutiny too. What are the commonest functions of the extremely frequent spoken uses of got? Is got used differently in the spoken and the written? Let us consider some statistics. Got occurs approximately five and a half times more frequently in our spoken sample than in the written.
By far the most frequent use of got in spoken is in the construction have got as the basic verb of possession or personal association with something.
But frequency statistics alone do not tell us everything: McCarthy and Carter (1997) comment on the colligational properties of got, observing that structures such as I’ve got so many birthdays in July and I’ve got you are typical spoken uses. In the first case the speaker is referring to the responsibility of sending birthday cards to members of the family: I’ve got seems to mean something like ‘I have to deal with’. In the second case the utterance means roughly ‘I understand you’. Neither meaning might crop up in formal, written texts; spoken data is likely to be the best source for such uses.
It is not only that got shows such interesting differences in distribution and usage between written and spoken; other words display significant differences too, especially apparently synonymous everyday words such as start and begin (see Rundell, 1995), and too and also. The occurrences in the samples of written and spoken texts from CIC and CANCODE respectively, are shown in Table 4.2.
What can be noticed here is that start seems equally at home in spoken and written discourse, but that begin is relatively rare in informal spoken discourse of the kind recorded in CANCODE (part of CIC). A very similar picture obtains with too, which occurs more or less equally in spoken and written discourse; also occurs less than half the number of times in spoken than it does in written discourse. In the case of begin, it is perhaps also worth noting that, in the written data, the form beginning used as a noun occurs 41 times, but in the spoken only 15 times, reflecting the tendency towards nominalization in the written mode. For further illustrations of the different distributions of a wide selection of words in spoken and written texts, see Engels (1988).
One final point that needs to be considered with regard to the ‘top 50’
spoken and written word-forms is that of how much of the total text in the corpus samples they cover. The top 50 written word-forms cover 38.8 per cent of all the text; the top 50 spoken cover 48.3 per cent, almost 10 per cent more of the total. Schonell et al. (1956, pp. 73–4) report a similar percentage difference in coverage for their first thousand words of spoken data, as compared with coverage figures for the first thousand words of written. This would suggest that, on the face of it, the top 50 spoken words were more useful for learners wanting an emphasis on speaking skills in their learning programme and that the view, often anecdotally expressed, that the written language is the best basis for learning both spoken and written codes, may be Lexis and discourse 105
difficult to defend. However, another way of looking at the problem is that the figures suggest that almost half of spoken discourse has virtually no content (i.e. many of the items are function words), which would seem to make the teaching of such words as ‘vocabulary’ extremely difficult without accompanying contentful words to provide the necessary context. One position here would be to advocate situation-bound teaching of spoken language, where ‘content’ is provided by context. But it is also worth noting that the consequences of the heavy burden carried by the top 50 words in the spoken data means that, as we go down the frequency list, the spoken words in lower frequency bands will cover slightly less text than the written word.
Table 4.3 shows what percentage of the total text words in the ranks 501–550 and 1001–1050 cover in the written and spoken parts respectively.
Two basic positive points may be made about the use of corpora:
1 It is worth separating spoken and written corpora for the examination of the distribution and usage patterns of individual words.
2 It is worth separating spoken and written corpora for the examination of the distribution and usage patterns of pairs or groups of words that are apparently synonymous.
However, some problems also arise with such comparisons:
1 There is a problem with the status of the term word or word-form in the spoken corpus. Not included in the top 50 above are vocalizations transcribed in the corpus such as mm, er, erm and so on, some of which would merit being in the top 20 in terms of frequency of occurrence.
They are not commonly thought of as relevant items for vocabulary teaching; yet they may be quite significant discoursally, and of interest in cross-cultural comparisons with languages that have phonetically different equivalent vocalizations (see McCarthy, 1990, p. 127 for a further brief discussion). On the other hand, we have included oh in our list, since it seems to express great affective and interpersonal meaning.
But the cut-off line is by no means easy to justify.
2 Equally problematic in the spoken data is the very high incidence of Table 4.3 Percentage coverage of words in rank 501–550
and 1001–1050 in the written and spoken parts of CIC, 1996.
Coverage (%)
Rank in word list Written Spoken
501–550 1001–1050
1.00 0.52
0.80 0.36
contracted forms such as it’s, that’s, don’t, etc. They are included as single items here, since they are often in the same general bands of frequency as their non-expanded forms (e.g. it and it’s both occur in the top 20 spoken forms; do and don’t are also within 20 places of each other). However, major problems present themselves to transcribers. Are cos and because to be recorded and counted as two different word-forms?
If going to is transcribed as gonna when it is uttered as such, should got to become godda and have to become hafta when they are uttered informally? Such decisions can greatly affect the count for these basic, everyday spoken word-forms and there is no simple criterion that can always be followed.
3 Word-lists consisting of single word-forms (as we saw with the case of know) may hide the fact that the respective form regularly occurs as an element of a multi-word expression. For example, how many of the 500 plus occurrences of thing in the CIC spoken sample are embedded within the extremely common expression the thing is . . . (meaning ‘the problem/
point is . . .’)? How many are in vague expressions such as things like that?
Only a concordance can properly reveal whether thing is occurring in this way or not. (A concordance is a computer-assisted program for studying patterns of words as they occur in corpora of natural language.)
4 The discussion of coverage suggested that spoken words covered much more text than written. This is so, but it is also true that spoken-word meanings are often elusive and more cryptic than their written-word equivalents (note again the meaning of have got discussed above). It is equally true that, in texts where there is a very high proportion of common function words, occasional, low frequency content words may provide the crucial and only convincing clues as to what the text is
‘about’. This is particularly so in case of ‘language-in-action’ texts, that is, situations where the language is directly generated by the actions speakers are performing, such as cooking, loading luggage into a car and arranging furniture (see Carter and McCarthy, 1997, for examples).
Computational analysis of language corpora can reveal many interesting and pedagogically useful differences between spoken and written vocabulary use, and even relatively small samples (by today’s standards) can yield original insight, or can raise awareness for future observation and verification in the field. However, computers are less useful when it comes to under-standing the way vocabulary is used as a communicative resource by individual speakers in individual situations. A discource- or conversation-analysis approach may be the best way of getting at how vocabulary is used in everyday spoken interaction. For example, the most common occurrence of see in the spoken corpus is in the unit you see (meaning ‘understand’). Does this necessarily mean that the prototypical meaning of ‘perceive with the eyes’ should be relegated to second place? However, conversation analysis of itself (especially of just one textual fragment) may yield no more than an Lexis and discourse 107
account of that particular piece of data, with little generalizability. The subsequent checking in a large corpus would always be advisable to see if insights from the individual text hold good across a wide range of samples.
Corpus- and conversation-analysis are complementary for linguistic analysis.