El casi cristiano 1 - 01 Obras+de+Wesley Tomo.I

This section explains how I constructed the hyperspace representation of the Spanish semantic lexicon that is the basis of the analyses of syntactic category, gender and semantic clustering described in § 4.2.2 to § 4.2.4. In those analyses I manipulate two variables: the presence of morphology in the corpus and the presence of function words as context words in the calculation of the word vectors. Some of the other parameters are set in order to maximize vector quality given the limited size of the corpus available.

I created vectors for each of a number of target words – all the types above a certain frequency, which in practice coincides with the set of content and function context words (see Table 4.1 below). Here the corpus size constraints the number of frequent words able to generate dense vectors.

These vectors are created by counting the number of times that each context word appears within 5 words of the target word in the corpus. The frequency counts are then transformed into probability distributions to normalise for word frequency. I measured the similarity between two vectors as the cosine of the angle they form, because this metric is not sensitive to vector length, and it performs well in semantic tests (Lowe & McDonald, 2000; McDonald, 2000). The following sections describe the other elements involved in the configuration of the semantic hyperspace.

4.2.1.1 The corpus

The distributional statistics in this section are based on the same corpus used in chapter three, namely ‘Corpus oral de referencia del español’ an orthographical Spanish speech corpus (Marcos Marín, 1992). The words are transcribed phonetically using the same citation rules as in chapter two of this thesis. After removing all tags the corpus has 897,395 word tokens (38,847 types). This is much smaller corpus than those used in the studies mentioned in § 4.1.1.1 above. The spoken part of the BNC used in other

studies mentioned above is about ten times larger¹. Even with this important limitation, the distributional statistics provide information at the levels explored, namely syntactic category, gender and semantics. I assume that more refined vectors based on a larger corpus would provide even more detailed information including subtler nuances.

4.2.1.2 Lemmatisation

One of the variables in this study is whether the corpus contains surface forms (all word forms as found in speech, including gender, plural and verb inflections) or lemmas only (uninflected words). The corpus is not annotated, so instead of lemmatising the whole corpus by hand, I only lemmatised types of frequency greater of equal to 100, plus a few other types that added together would generate a lemma of frequency greater or equal to 100. The lemmatisation process comprised:

• Replacing feminine and plural inflections with the masculine singular form.

• Replacing all verb forms, including all persons and tenses, participles and infinitives, with the verb root: the infinitive without the final -r.

Exceptions include forms of verb ser (be), which were replaced with the most common form, 3^rd person singular of the present tense, ‘es’;

forms of verb ir (go) were left as ‘ir’, because the forms resulting from the regular substitutions, ‘se’ and ‘i’ are homophonous with the very common impersonal pronoun ‘se’ and the conjunction ‘y’ (and), respectively.

• Removing the ending ‘-mente’ (equivalent to English ‘-ly’) from adverbs.

1 I could not find a larger corpus of spoken European Spanish available for research, which limits the quality of the resulting vectors and therefore of the hyperspace. There are enough differences between the varieties of Spanish spoken across Latin America to make it desirable to use a single variety.

• Merging very frequent compound forms, e.g. ‘por favor’ (please) becomes ‘porfavor’ and ‘sin embargo’ (however) becomes

‘sinembargo’.

4.2.1.3 Context words and dimensionality

This is the second variable manipulated in this study. Although several studies assume that semantic information is best captured by contexts consisting of content words and syntactic information by function word contexts (Lowe & McDonald, 2000; McDonald, 2000; Jarmasz, 2003), Levy and Bullinaria (2001) found that adding functors to their context-word set significantly boosted the performance of their metric in a semantic test. This study examines the performance of two context word sets:

1) Content and function words: all word types above a certain frequency threshold.

2) Content words only: the words remaining after removing function words from set (1).

In the 'content word' condition I removed determiners, prepositions and conjunctions, plus the auxiliary verbs ser, estar (be) and haber (have) from the context-word list. Table 4.1 shows the dimensionality of the spaces generated by the different context word sets.

surface lemma content+funct. 394 (≥200) 523 (≥100) content only 320 (≥200) 481 (≥100)

Table 4.1. Number of context words (in brackets, threshold frequency) in the surface-form and the lemmatised corpus, when considering all words or content words only.

In a small corpus, a low number of dimensions will yield denser vectors. In order to obtain vectors of similar density with both versions of the corpus, the frequency threshold for the surface form version of the corpus is 200, and that for the lemmatised version is 100.

4.2.1.4 Window size

The cooccurrence vectors were calculated by transforming the raw cooccurrence counts within a window of five words to the left and five to the right, all conflated in a single value, into probability distributions. Window size is not a variable in this study – its effect has been extensively analyzed for English (§ 4.1.1.2). I chose a window size that generated reasonable results in most English tests, but that was not too small – again, to prevent sparse vectors given the small corpus available. Also, the eleven words contained in this window size take approximately 2.5 seconds to pronounce in a naturalistic Spanish spontaneous speech rate of 250 words per minute.

This is close to the 2 seconds proposed by Baddeley, Thomson and Buchanan (1975) as the time-span of working memory. This 2.5 second window includes the five words that will be relevant for the processing of the target word, plus the five words in whose processing the target word is involved.

4.2.1.5 The vector spaces

I calculate four vector spaces using the methods and parameters above to be used in the studies presented in § 4.2.2 and § 4.2.3 below. I count the occurrences of one of two context word sets within a window of five words to the left and five to the right of the target words in two different versions of the corpus and two context-word sets. This results in four conditions:

1. Surface-form corpus, content and functors: the targets and the context words are the same: the 394 word types of frequency greater or equal to 200 in the surface-form corpus.

2. Surface-form corpus, content words only: the target words are the same as in condition 1; the context words are the 320 content words left after removing functors from the context-word set in condition 1.

3. Lemmatised corpus, content and functors: the targets and the context words are the same: the 523 word types of frequency greater or equal to 100 in the surface-form corpus.

4. Lemmatised corpus, content words only: the target words are the same as in condition 3; the context-words are the 481 content words left after removing functors from the context-word set in condition 3.

The rest of this chapter explores the performance of these four vector spaces in various syntactic and semantic categorisation tests.

In document 01 Obras+de+Wesley Tomo.I (página 41-54)