Despiértate, tú que duermes - 01 Obras+de+Wesley Tomo.I

In this section I examine how distributional information can help categorize words syntactically. Frequency helps predict some parts of speech, notably function words. Figure 4.3 shows the frequency rank of syntactic categories.

Distribution of syntactic categories in the frequency rank

1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376

frequency rank num

pron excl name funct adv verb adj noun

Figure 4.3. Syntactic category of the 394 surface form words of frequency greater or equal to 200 in the corpus, ranked by frequency. Each dot represents one word, and there is only one word per frequency rank position.

The most frequent words are to the left, in the higher rank positions, the more infrequent to the right of the graph, in the lower rank positions. The only obvious categorisation that could be derived from frequency information alone is that between functors and content words, since functors tend to be significantly more frequent than content words.

Simple cooccurrence statistics also reflect syntactic category. Figure 4.4 shows the distribution of words by part of speech ranked according to their average cooccurrence-based similarity with other words.

Distribution of syntactic categories in the average semantic similarity rank

1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376

average semantic similarity rank num

pron excl name funct adv verb adj noun

Figure 4.4. Syntactic category of the 394 surface form words of frequency greater or equal to 200 in the corpus (context including content and function words) ranked by average similarity value.

Similarity was calculated for all word pairs as the cosine of the angle formed by the two vectors representing the two words in the pair. The distances from each word to every other word were averaged, and then all words were ranked by average cooccurrence-similarity value. As in the frequency rank, function words, being so ubiquitous, cooccur with many words and cluster at the top of the similarity rank. But the cooccurrence-statistics based ranking offers more information: we also see that numerals are on average far from other words (a closer examination reveals that they are very close to each other, forming a cluster), and that verbs tend to be more similar on average

to other words, while nouns tend to be less similar on average from the rest of the words.

More complex computations should achieve a more accurate syntactic categorisation of words. Section 4.2.1.5 outlined the characteristics of the four vector spaces that I will use for the tests in this section. I now investigate the effect of functors in the context-word set and of inflectional morphemes in the target-word sets on the ability of the vector space to predict the part of speech of words.

The ability to predict the part of speech or syntactic category has been tested in different ways: Levy, Bullinaria and Patel (1988) used the part-of-speech tags in the BNC to construct a syntactic categorization test. They calculated the centroid of a large number of vectors of words of each part of speech category, and then took the 100 most frequent words of each category and checked which centroid they were closest to. This method correctly categorised over 90% of the words using a window of one word to the left only or to the right and left. Redington, Chater and Finch (1998) calculated 600-dimension vectors for the 1,000 most frequent words in the corpus. They considered a window of two words to the left and right, and the information for positions two words to the left, one to the left, one to the right and two to the right were stored in separate vector components. The context words were the most common 150 words, which included a large proportion of functors.

Redington, Chater and Finch’s (1998) syntactic categorisation test involved hierarchical clustering of the vectors using Spearman’s rank to measure vector similarity. Their method offered the possibility to introduce a cut-off point of similarity level, which they set at 0.8 to obtain the best categorisation. This unsupervised method (the syntactic category information was not provided prior to the cluster construction) correctly categorised 90% of nouns and 72% of verbs (chance baselines of 25% and 14%, respectively).

I present a supervised syntactic categorisation test also based on hierarchical clustering that categorized each word according to the category of the majority of its nearest neighbours in the space.

Method

The vectors in the four sets in § 4.2.1.5 above were manually tagged for syntactic category. Ten categories were used: noun, adjective, verb, adverb, functor, proper name, exclamation, personal pronoun, indefinite pronoun and numeral. Functors included determiners, prepositions and conjunctions;

personal pronouns included possessives; indefinite pronouns included the Spanish equivalent of wh- pronouns such as qué, quién, cómo (what, who, how). I performed a hierarchical cluster analysis in SPSS (vector similarity metric: cosine) on each vector space and obtained a dendrogram with clusters of part-of-speech labels (See figure 4.5).

Figure 4.5 Part of a dendrogram showing hierarchical clustering (method: cosine) or words in a vector space (condition: lemmatised, functors and content. Words and part-of-speech labels are shown.

I performed a categorisation task on this dendrogram in the following way:

given a new word whose position in the space (and therefore in the dendrogram) is known, it is categorised as belonging to the predominant category in its local cluster. I first consider each terminal-level cluster

(marked in red in Figure 4.5); if there is one majority category² (as in the fist and third terminal-level clusters in Figure 4.5), then I count words of the majority category in that cluster as correctly categorised. If there is no majority in a cluster (as in the second terminal-level cluster in Figure 4.5), I consider that words in that cluster cannot be correctly categorised. Words clustered at the next level up (the two bottom words in Figure 4.5) count as correctly categorised if they belong to the majority category in the higher-level cluster. In the example in Figure 4.5, pronoun ‘que’ is correctly categorised because the majority of the words in the second-level cluster including the seven bottom words in the dendrogram are pronouns too. I did this only for the first two levels (considering more levels could only improve the results).

Results

This method categorised high proportions of words correctly. As seen in the summarised results in Figure 4.6, the presence of functors in the context-word set clearly improved the performance both in the surface-form (two-tailed paired t-test, t=2.23; df=9, p=0.05) and the lemmatised (two-(two-tailed paired t-test, t=2.21, df=9, p=0.05) versions of the corpus. Surface-forms were marginally better categorised than lemmas (t-tests not significant).

Correctly cate goris ed w ords

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

surf ace lemmas

proportion

(-) f unctors (+) f unctors

2 For 2-element clusters, there is only a majority if both items are the same category. Then, the classification algorithm will classify each of them correctly by assigning it the same category as the other item in the cluster. For larger clusters, I consider a majority of at least

Figure 4.6. Results of syntactic categorisation task using the four vector spaces.

Figure 4.7 shows the proportion of correctly categorised words in each syntactic category, compared with chance levels. Baseline chance levels are proportional to the number of nouns, proper names, numerals etc in the target-word set. Some syntactic categories were categorised better than others, but all were categorised correctly well above chance levels, as seen in Figure 4.7.

noun pn num adj verb adv funct p-pron i-pron excl

proportion

surf ace (+f unctors) lemmas (+f unctors) baseline (surface) baseline (lemmas)

Figure 4.7. Proportion of words of the ten different syntactic categories that were correctly categorised in the two vector spaces that included functors in their context-word set. Chance baseline levels also shown. (All result-baseline two-tailed paired t-tests yield significances p<0.01).

The graph shows the proportions of words correctly categorised in the two best-performing vector spaces (those including functors in their context-word sets). This comparison shows the effect of corpus lemmatisation on a syntactic categorisation task.

Discussion

As we see in Figure 4.7, nouns, numerals, proper names, adjectives and indefinite pronouns are better categorised in the lemmatised corpus. Verbs, adverbs, personal pronouns and, particularly, functors, however, are better categorised in the surface-form corpus. This suggests that conflating all noun, adjective and indefinite-pronoun surface forms into their lemmas

helps categorise them syntactically. On the other hand, conflating all surface forms of a verb into a single lemma hinders verb categorisation. Nouns, adjectives and indefinite pronouns can take gender and plural inflections. In the second group, only verbs change between versions of the corpus, with inflections removed from the lemmatised version.

Of the word categories which do not change between the surface and the lemmatised corpus versions, the largest difference is found in functors, which are better categorised in the surface-form corpus. As indicated by the results in Figure 4.6, the role of functors is to relate words to one another in the sentence, so it could be said that they categorise, but do not need to be categorised. Since relationships between words in Spanish are also signalled by agreement (in number and gender between nouns and adjectives, in number between subjects and verbs) inflected words provide a more fine-grained, and therefore more accurate, categorisation of functors than lemmas do.

These results support the idea that gender and number inflections on one hand and verb inflections on the other have different roles in syntactic categorisation. The difference between English noun (number) and verb (person and tense) morphology was pointed out by Tyler, Bright, Fletcher, and Stamatakis (2004), whose fMRI studies of noun and verb processing suggest that while noun and verb stems representations do not differ, verb and noun morpho-phonology engage different neural systems. The present results suggest that while nouns and adjectives are better categorised in a vector space based on the word root (lemma), verb categorisation is helped by the variety introduced by verb inflections.

The next section looks more closely at the classification of nouns and verbs in a vector space.

In document 01 Obras+de+Wesley Tomo.I (página 54-75)