CAPITULO III DESARROLLO
1.7 PROCEDIMIENTO Y DESCRIPCION DE LAS ACTIVIDADES REALIZADAS
1.7.2 La primera etapa de las 5´S es Seiri (Clasificación)
The LC-STAR project highlighted the need for phonetic, prosodic, and morpho-syntactic enrichment in pronunciation lexica for voice-driven applications (Hartikainen et al, 2003). All thirteen LC-STAR lexica conform to a language-
independent specification with guidelines on coverage, syntax, and phonology, for example:
each lexicon, defined as a set of entry group elements, includes at least 50,000 inflected common-word entries;
generic entries classify wordforms via a basic scheme of 21 PoS with attributes common to several languages;
phonological information takes the form of stressed and syllabified SAM- PA phonetic transcriptions.
As an aside, all function words are assigned primary stress and this is also the default setting for function words in ProPOSEL.
6.4.1. English phonology in the Carnegie-Mellon pronouncing dictionary The CMU pronouncing dictionary restricts information for each of its 127,069 entries to: orthographic form; a counter denoting pronunciation variant; and an ARPAbet phonetic transcription - the ARPAbet being an American English subset of the International Phonetic Alphabet (IPA). Entries for the inflected form
presented, which displays the maximum number (i.e. three) of American English
pronunciation variants in this dictionary, are as follows (Example 3).
Example 3
PRESENTED 1 P R IY0 Z EH1 N T AH0 D PRESENTED 2 P ER0 Z EH1 N T AH0 D PRESENTED 3 P R IY0 Z EH1 N AH0 D
Interestingly, the phonetic transcriptions in this dictionary do not show how stress affects vowel reduction (Jurafsky and Martin, 2008); hence, the usual ARPAbet symbol for schwa /ax/ does not make an appearance (cf. P R IY0 Z EH1 N T AX0 D and its counterpart in SAM-PA prI'zent@d). Also, while a stress pattern can easily be extracted from CMU‘s ARPAbet transcriptions, homographs cause problems because there is no syntactic information to distinguish between wordforms which have the same spelling but which belong to a different class. The lemma present is a case in point (Table 6.2).
CMU ENTRY STRESS
PATTERN
WORD CLASS
PRESENT 1 P R EH1 Z AH0 N T
1 0 automatically It is not possible to determine from this entry which word class this
most common pronunciation and stress pattern belongs to. (A native speaker or advanced learner will know it’s either a noun or an adjective.)
PRESENT 2 P R IY0 Z EH1 N T 0 1 Pronunciations 2 and 3 for this lemma signify that it’s a verb – but how can this be automatically determined from the information given?
PRESENT 3 P ER0 Z EH1 N T 0 1
Table 6.2: Automatic mapping of phonological and syntactic information is not enabled from CMU dictionary entries
6.4.2 English phonology in CELEX-2
There are some 160,595 English wordforms in the CELEX-2 lexical database. Phonological information is detailed: Example 4 shows an entry for territorial from a CELEX-based epw (English phonology wordforms) directory3.
Example 4
90218\territorial\0\46811\2\P\"tE-r@-'t$-r7l\
[CV][CV][CVV][CVVC]\[tE][r@][tO:][rI@l]\S\"tE-rI-'t$- r7l\[CV][CV][CVV][CVVC]\[tE][rI][tO:][rI@l]
Fields of interest are: (1) unique ID number or key; (2) orthographic form; (6) pronunciation status: primary citation form or stylistic variant of same; (7) stressed and syllabified phonetic transcription using the DISC character set; (8) CV (consonant-vowel) pattern; (9) syllabified phonetic transcription using the SAM-PA character set; (10) secondary, less common pronunciation variant.
Field seven is of particular interest. CELEX-2 provides four different character sets for phonetic transcriptions, including the DISC set which allows one- to-one mapping between character and distinct phonological segment. The DISC transcription for territorial: /"tE-r@-'t$-r7l/ shows the character /7/ being used to represent a dipthong. Field seven also demonstrates that if the user selects a
stressed and syllabified phonetic transcription, irrespective of character set, they will
have effectively assigned stresses to syllables and bypassed the problems outlined in Section 3.1. A lexical stress pattern of 2010 (but not, unfortunately, 20100 - §Section 4.4) can also be derived for territorial from the DISC transcriptions in
field seven, where /'/ denotes primary and /"/ secondary stress; alternatively, users familiar with Unix can use an awk script to compute this pattern.
6.4.3. Variance in pronunciation lexica: lexical stress patterns
The CELEX-2 database lists a number of primary (P) and secondary (S) pronunciation variants for each lemma or wordform, as in this example from the English Linguistic Guide (Burnage, 1990) for passenger, using stressed and syllabified SAM-PA4 transcriptions (Table 6.3).
Variant pecking order Pronunciation status SAM-PA transcription
1 P “p{-sIn-Dz@r* 2 P “p{-sIn-Z@r* 3 S “p{-s@n-dZ@r* 4 S “p{-s@n-Z@r* 5 S “p{-sn,-dZ@r* 6 S “p{-sn,-Z@r*
Table 6.3: Primary and secondary pronunciation variants for passenger in CELEX- 2
Despite such segmental variation, the lexical stress pattern usually remains constant for a given word of two or more syllables in a given sentence slot - at least in terms of the location of primary stress: here passenger is realised throughout as 100. The citation form for each entry in the CELEX-2 wordforms directory was therefore used as the main generator for lexical stress patterns in ProPOSEL.
Perhaps the notion of one abiding stress pattern for an inflectional form in English needs qualification, however; homographs (§6.5.1) are a special case and there is evidence that dictionaries differ in the assignment of secondary stress and in syllable counts. The Carnegie-Mellon pronouncing dictionary (American English) is comfortable with secondary stress on the final syllable, whereas the Oxford- Longman derived CELEX-2 and CUVPlus (British English) are not (Table 6.4).
Lexicon Orhographic Form Phonetic Transcription Stress Pattern
4 SAM-PA transcriptions use /‖/for primary stress; the asterisk in ―p{-sIn-Dz@r* in
CELEX-2 abolished @-'bQ-lISt 010
CUVPlus abolished @'b0lISt 010
CMU abolished AH0 B AA1 L IH2 SH T 012
CELEX-2 calcify 'k{l-sI-f2 100
CUVPlus calcify 'k&lsIfaI 100
CMU calcify K AE1 L S AH0 F AY2 102
CELEX-2 finite 'f2-n2t 10
CUVPlus finite 'faInaIt 10
CMU finite F AY1 N AY2 T 12
Table 6.4 : Secondary stress is not marked for the wordforms: abolished, calcify and finite in either CELEX-2 or CUVPlus, whereas Carnegie-Mellon assigns
secondary stress quite readily in these cases.
6.4.4. Variance in pronunciation lexica: syllabification
When it comes to syllabification, Roger Mitton‘s account (Mitton, 1992) of his difficulties in deciding on syllable counts for some 3000 or so words in CUV2 is illuminating. His problems were to do with the /@/ phoneme or schwa in dipthongs, in the middle of words and in words ending in -ion. He compares the sound /aI@/ in higher (definitely 2 syllables) and hire (he is unsure). He opts for one syllable for each of: fire/hire/wire/pier/tour but says that ‗…on another day, [he] might easily have counted them as two…‘ He juxtaposes gambolling with gambling and gives instances of the word champion realised as 2 and then 3 syllables. Sometimes it‘s simply a case, he says, of judging whether more or less /@/ seems most natural. Hence Mitton awards territorial five syllables whereas CELEX-2 only gives it four. Mitton‘s experience is played out in the dictionaries themselves. Reduced vowels are included in syllable counts but sometimes disappear from phonetic transcriptions. The online version of OALD5 (now in its seventh edition) uses the
schwa in its transcription for descendant (3 syllables) but not for iridescent (4 syllables) - and the same goes for CUVPlus (Example 5).
Example 5
descendant|0|dI'send@nt|K6%|NN1:3|3
5 Oxford Advanced Learner‘s Dictionary:
iridescent|0|,IrI'desnt|OA%|AJ0:1|4
A quick check on the LDOCE6website verifies that Longman use the schwa in both
transcriptions.
Finally, dictionaries are not in accord on syllabification. To give just one instance, the syllable count for memorial is difficult to determine because it contains a dipthong (Mitton‘s old problem again). The CUV2-derived CUVPlus records a syllable count of 4 for this wordform but the CV pattern and syllabified transcription in CELEX-2 tell a different story. ProPOSEL reflects these discrepancies and this is intentional (Example 6).
Example 6
CUVPlus: memorial|0|mI'mOrI@l|K6%|NN1:16|4
CELEX-2: [CV][CVV][CVVC]\[m@][mO:][rI@l]