The correlation found between the phonological and the semantic distances in § 5.2.2 could be driven solely by syntax, reflecting the match between the morphosyntactic information contained in word phonology and syntactic information captured by words’ cooccurrence with functors. Another contributing factor to the correlation could be phonological typicality, the fact that different syntactic classes have different phonological characteristics: Kelly (1992, 1996) shows phonological differences between English nouns and verbs. For example, disyllabic nouns tend to have initial stress whereas disyllabic verbs tend to have final stress; on average, nouns have more segments, more syllables and longer duration than verbs; and nouns tend to have more low vowels and more nasal consonants than verbs.
(See also Durieux & Gillis, 2000, and Monaghan, Chater & Christiansen, 2003, for reviews.)
Another factor could be phonological priming, the putative tendency to produce words containing sounds that are similar to recently uttered or heard words. The effect of (short-range) phonological priming could be eliminated from the correlation metric by using very large context windows such as those of Landauer and Dumais (1997). Phonological priming can be considered as a reflection of the similarity-based structure of the phonological lexicon on speech. An uttered word activates similar-sounding words more than different-sounding words, so the former are more likely than the latter to be uttered soon after.
Among the more tentative contributing factors to the correlation found in § 5.2.2 is the bias towards systematicity between the phonology and the meaning levels of the lexicon discussed above. We saw in chapter four that cooccurrence-based semantic similarity spaces do capture meaning, as shown by the facts that they model semantic priming and that they perform above average in semantic tests (§ 4.2.3).
This section aims to test the correlation between word form and word meaning by removing the influence of syntax from the semantic similarity metric. One way to eliminate the influence of syntax in the correlation would be to use the lemmatised corpus and remove the functors from the context word sets in the calculation of the vectors. That condition performed worst of all in syntactic classification tasks: part of speech (§ 4.2.2.1), nouns and verbs (§ 4.2.2.2) and masculine and feminine nouns (§ 4.2.2.4); but it performed well in a semantic task such as noun classification of 'person nouns' (§
4.2.2.3). However, during lemmatisation, as well as losing their morphemes, certain words have their root changed, and this affects their position in the phonological similarity space. For instance, feminine inflections are an integral part of words and cannot be removed without losing phonological information about the word ending, syllabic structure and length.
Lemmatisation replaces irregular forms of verbs by their (regular) stem.
Verbs present an additional problem. The canonical verb form, the infinitive, has one of three very characteristic endings: stressed -ar, -er or -ir. My lemmatisation removes the final -r, but still leaves a syntactically conspicuous final stressed -a, -e or -i.
An alternative way of eliminating the effect of syntax on the correlation is to use the surface forms, but to exclude parameters that may pick up on the morphology from the phonological similarity metric. I do not remove the parameters directly related to the last segment, site of the gender morpheme, for several reasons. The last segment is a site of important phonological information, as we saw in chapters two and three, and dispensing with it altogether leaves an incomplete picture of the word’s phonology. Feminine endings are not always inflections of a masculine stem: most feminine words are uninflected (in the aggregate cvcv and cvccv words, only 22% are inflections of a masculine stem), and the ending is arguably part of their phonological identity. Besides, it is not always the case that feminine words
end in -a, and masculine in -o, with about 15% of masculine and feminine words ending in -e (see Figure 5.11).
0 0.3 0.6 0.9
a e o u
f inal vow el
proportion of cases masc
fem
Figure 5.11. Final segment of the aggregate cvcv and cvccv gendered words.
(Note that plural inflections are not an issue, since the two word-groups at hand both end in a vowel, and are all singular.) As explained in chapter three (§ 3.2.2.5.3), the stress-related parameters – sharing the stress on the same syllable and sharing the same stressed vowel on the same syllable – reflect morphological similarity related to verb tense and person. Therefore, removing the stress-related parameters should eliminate most of the morphosyntactic information from the phonological similarity metric.
Summing up, I attempt to remove the effects of syntax by eliminating cooccurrences with functors in the semantic space and by eliminating stress-related parameters from the phonological similarity metric. The next section presents a measurement of a correlation with the new, relatively syntax-free data.
5.2.3.1 Materials
As in § 5.2.2, I use the 252 cvcv and the 146 cvccv phonetically transcribed words of frequency greater or equal to 20 in the surface-form corpus. The position vectors for the semantic similarity calculations take into account cooccurrences with content words, but not with functors.
5.2.3.2 Procedure
The procedure is essentially the same as that of the last section, with a few crucial differences. For the semantic similarity, the calculation of each word’s
position vector considers the cooccurrences of the target word with the content words - but not with the functors - of frequency greater or equal to 200 in the corpus. The phonological similarity metric calculates the parameter values in the same way as in § 5.2.2.2, but now excluding the parameters related to stress (stress in the same syllable and same stressed vowel in the same syllable) and to syllabic structure. See the new parameter values in Table 5.12. Note that these values are different and not completely correlated with the values in Table 5.10 above, because the removed parameters did not intervene in their calculation.
cvcv cvccv c1 0.178 c1 0.081 c2 0.009 c2 0.028 v1 0.021 c3 0 v2 0.072 tc13 0.105 tc 0.388 tc23 0.094 tv 0.332 3c 0.321
v1 0.082
v2 0.043
tv 0.246
Table 5.12. Phonological similarity parameter values used in the calculation of the correlation.
5.2.3.3 Results
Table 5.13 shows the correlation values (Fisher divergence) for the cvcv and the cvccv word groups, the number of word pairs configuring the spaces, and the significance, calculated with a Monte-Carlo analysis of 1000 randomisations. Table 5.13 and Figure 5.12 show the results of the Monte-Carlo analysis, indicating the position of the Fisher divergence obtained with the veridical pairs.
Fisher divergence Nr. words Significance cvcv 7.79 252 p<0.05 cvccv 3.69 146 p=0.09
Table 5.13. Correlation value (Fisher divergence) and significance for the cvcv and cvccv word groups after removing syntactic cues from phonological and semantic similarity metrics.
cvcv, no syntax
Figure 5.12. Histogram plots showing the results of the Monte-Carlo analysis for cvcv and cvccv words. The veridical results are in the white bins.
5.2.3.4 Discussion
The results obtained after eliminating syntactic information from the data are significant for cvcv words, but only marginally significant for cvccv words.
However, the fact that near significance values are obtained in two independent word-groups adds robustness to the results. This indicates that word form may be correlated with word meaning, but the results are not totally conclusive. Nevertheless, they are encouraging, given the rough phonological and semantic similarity metrics employed and the relatively small samples of the lexicon tested. It would be interesting to test the correlation with a phonological similarity metric including more parameters and a more robust semantic similarity based on a larger corpus and perhaps using a larger context window. Chapter six will offer some insight in some of these directions.
These results, together with those of section 5.2.2, show that there is a measurable significant correlation between the cooccurrence-based and the phonological levels of representation of the Spanish lexicon. I have shown that part of this correlation can be attributed to syntax, but a small part may rely on the meaning of the concepts denoted by words. The next section looks at the word classes that drive the phon-sem systematicity.