Etología aplicada en los caprinos - Libro Etologia Final

A collocate is simply a word that occurs near another word. What exactly “near” means is up to individual interpretation. The chosen context within which two elements are said to be near each other is called a window. For example, let us look at the verb ate in sentence (1). This verb has two direct neighbours, the noun cat on its left, and the determiner the on its right. These two collocates therefore occur within a symmetrical window of one element.

(1) My old cat ate the dog’s homework window

3 2 1 0 1 2 3 4

My old cat ate the dog ’s homework

Table 2. Collocate window around the verb ate in sentence (1).

If the window is extended to two words instead of one, then the adjective old and the noun dog are also considered as collocates of the verb ate. Windows can be asymmetrical if one is only interested in the left or right collocates of a given element. For instance, if one decides to select an asymmetric window of three elements to the right, then the collocates the, dog and ’s would be selected. Table 2 provides a visual representation of a collocate window.

Collocates give relevant information about the meaning of a word or construction around which they occur. For instance, if one looks at all the collocates around the verb eat in a corpus, it is likely that many of these collocates are going to be food-related items, especially to the right. Similarly, left collocates will mostly comprise agents, such as people or animals. Therefore, collocates can inform about a link between the verb eat and entities that can eat, as well as things that can be eaten. This lead Firth (1957: 11) to coin his famous slogan: “You shall know a word by the company it keeps”.

Computational linguistics makes use of collocates to establish semantic profiles of given elements. These profiles can then be used in word-sense disambiguation tasks or information retrieval in general (Turney and Pantel 2010). For instance, the word bank is ambiguous because it can denote a financial institution or the bank of a river or lake. However, it is evident that when the word refers to the financial institution, most collocates will include finance-related words, whereas the other meaning will involve water-related words. This is how collocates can be used to gain information about a word’s semantics, and therefore also about the kinds of contexts in which it occurs.

The fact that collocates can be used to investigate semantics is relevant for grammaticalization because it has often been observed that grammaticalized elements can be used in a wider variety of contexts as they become semantically vaguer or broader. The notion of semantic bleaching in grammaticalization (Sweetser 1988) is a generally well-attested phenomenon (Hopper and Traugott 2003: 127) that has been observed with modals (Krug 2001), auxiliaries (Heine 1993) and many other types of elements. It should also be noted that grammatical elements becoming semantically broader has also been described as a result of increases in frequency (e.g. Haiman 1994).

While Lehmann (2002) includes semantics as an aspect of the integrity parameter (i.e. semantic integrity), collocates also relate to some of his other parameters, such as bondedness and syntagmatic variability. Indeed, if a sign is constrained with regard to the other signs it can be associated with, then this should be reflected empirically in its collocate profile as well. A highly constrained profile would mean an overall lower number of possible collocates, whereas a largely unrestricted profile would translate into a wider range of possible collocates. This is why a measurement of collocate diversity might be useful to measure degrees of grammaticalization. In fact, Hilpert (2008: 13-48), as well as Torres Cacoullos and Walker (2011), have already illustrated that collocates of grammatical constructions can be used to show degrees of grammaticalization for specific cases such as modals, future tense markers and complementizer that. Furthermore, the idea that collocations are relevant to grammaticalization has also been discussed in other works, such as Himmelman (2004: 31-34) and Gisborne and Patten (2011: 96-98), although not necessarily in direct relation to gradualness.

In the subsequent studies, collocate diversity is measured in the following way for a given element. First, all the possible collocates are found within a given window. Three different windows are used. One window consists in a symmetric four words window (i.e. eight words in total). The other two consist in a one word window to the right, and another one word window to the left.

Second, the number of different types among these collocates are counted. This simply consists in the number of different forms that are found. For example, if ten collocates were found, where five instantiate the word dog, three the word cat and two the word nice, this means that three different types were found. In a similar fashion to section 3.1, no lemmatization is used for the list of collocates, which means that dog and dogs count as separate types.

Third, the number of different collocates (i.e. the number of types) is then divided by the token frequency of the item, in order to take into account that items with higher token frequencies are necessarily going to have more different collocates as a result of their high frequency. This results in a number that ranges between zero and the window size. A brief explanation using a window size w and a frequency n is that the maximum number of different collocates corresponds to w*n. This corresponds to the situation where all collocates are unique and not a single one is repeated. If this maximal number (w*n) is further divided by the frequency (n), then trivially the maximum diversity is equal to w. Conversely, the theoretical minimum diversity (where all the collocates are in fact the same one) corresponds to 1/n, which will approach zero when n is large.

The three different windows proposed above reflect different linguistic aspects. The larger symmetric window of four words arguably gives more information about the meaning of a specific element. Collocate-based research for semantics-related purposes (e.g. word disambiguation, grouping words by semantic similarity) tends to be better served by broader windows (e.g. Turney and Pantel 2010: 170-171). While there does not seem to be a recommended number as this may depend on the task at hand, grammatical elements occur rather close to the elements that they are used in conjunction with, which motivates using four words rather than a larger number to both sides. Another note is that studies that use collocates to retrieve information about meaning tend to focus on lexical meaning and therefore remove words that are called “stopwords”, which mostly consist of function words. In the present case, since grammatical elements are the focus of the study, such words are also included, and no stopwords are removed from the elements found within the chosen windows. In the sense that the four-four window is more closely related to semantic aspects, it is connected to Lehmann’s notion of semantic integrity (section 2.2.1).

The smaller windows of just one word are expected to give more information on morphosyntactic constraints, as they pertain to elements that are directly adjacent. Adjacency is related to Lehmann’s notions of bondedness and syntagmatic variability (section 2.2.1), as both these parameters are concerned with how a given item attaches itself to another elements. For instance, prepositions tend to be heads of prepositional phrases, which means that most of the time, they will have another phrase to their left (e.g. he was on the roof). In contrast, it is likely that to the right of the preposition, one will find the second element of that prepositional phrase. Directionality is therefore a additional aspect that will be investigated in the subsequent studies. The remainder of this section presents several examples in order to illustrate the computation of the collocate diversity measure.

Consider for instance the case of a window of one element to the right. If one applies this window to an element that occurs 6’000 times in a given corpus, then there are going to be 6’000 collocates to the right of this item.10 If all these 6’000 elements are unique types, then dividing this number by a frequency of 6’000 results in a maximum diversity of 1 (6’000/6’000). In contrast, if there is only one type repeated 6’000 times, this results in a diversity of almost zero (1/6’000). In a more plausible case, maybe there will be 1’250 different types among these collocates, which results in a value of 0.208 (1’250/6’000), which can also be interpreted as 20.8% of the collocates being unique types.

If the window comprised two elements instead, then there would be 12’000 collocates, which can result in a possible maximum collocate diversity of 2 (12’000/6’000), which corresponds to the size of the window in that case. Therefore, a further step is to divide this value by the window size to always get a value between zero and one (and that can therefore be interpreted as a percentage as well, as shown in the previous paragraph), but this step is not required in the subsequent studies, as other standardization processes are undertaken (section 4.1.2).

To give an actual example from the subsequent studies, the collocate diversity measures from section 5.4 for the clitics ’re and ’m are given in Table 3. As expected, they show a very low diversity when using a window of one element on the left (Colloc_L1), which is to be expected since the left element is almost always going to be I and you. The right diversity (Colloc_R1) is also relatively low and the reason is that many instances include not, going and a. On the other hand, the four-four window (Colloc_4-4) shows higher values, which is to be expected since once one looks beyond direct adjacency, many other elements are possible. This illustrates why using different window sizes can lead to different results.

Colloc_4-4 Colloc_L1 Colloc_R1 <w V**>'M 0.486 0.001 0.066 <w V**>'RE 0.543 0.001 0.097

Table 3. Collocate diversity measures for the clitics ’m and ’re (data taken from Section 5.4).

10_{To keep the discussion simple, elements that occur where the corpus begins and ends are simply ignored.}

These cases are special in the sense that for instance the very first element in a corpus does not have left collocates at all. However, these cases are marginal and the way they are handled pertains more to practical matters.

The way collocate diversity is computed here is close to a type-token ratio, which is a measure of vocabulary diversity (or richness)11 of a text, or of a speech sample. This measure has been particularly prominent in child language research for a long time (e.g. Richards 1986). The type-token ratio of a given text is obtained by dividing the number of different words (i.e. types) by the total number of words (i.e. tokens). However, a limitation of such a measure is that it is dependent on text length (e.g. Tweedie and Baayen 1998, Covington and McFall 2010). It is therefore problematic to compare the richness of texts of different lengths using a type-token ratio. In the present case, collocate diversity is obtained by dividing the number of different types by the token frequency of an element (which therefore corresponds to text length in the case of a type-token ratio). While it is possible to address this shortcoming of type-token ratios, for instance by using text samples of similar sizes instead of the whole texts, in the present case this cannot really be achieved because the data concerns the collocates of a given element. The extent to which this limitation is problematic can be investigated by empirical means, namely by checking how the collocate diversity measures correlate with the token frequency measure. Sections 5.4.1, 5.5.1, and 5.6.1 deal with this issue by showing that while token frequency and collocate diversity are indeed correlated to some extent, their correlation is relatively small.

In document Libro Etologia Final (página 140-143)