CAPITULO III DESARROLLO
1.7 PROCEDIMIENTO Y DESCRIPCION DE LAS ACTIVIDADES REALIZADAS
1.7.3 La segunda etapa de las 5´S: Seiton (Orden)
As previously stated, the purpose of this work is to integrate information from different dictionaries into one lexicon, customised for language engineering tasks
which involve the prosodic-syntactic chunking of text. One such task is automated phrase break prediction: the identification of potential pauses in text which reflect the way in which a native speaker might process or chunk that same text as speech. This is treated as a classification task in supervised machine learning, where junctures (whitespaces) between words in the input text are classified as either breaks (the minority class) or non-breaks. The machine learner is trained on boundary-annotated text, the hand-labelled speech corpus or ―gold standard‖, and then tested on an unseen reference dataset from the same corpus, minus the boundaries, to see how many junctures have been correctly classified.
6.6.1. The importance of PoS tags
Training and testing language models on a ―gold standard‖ corpus which exemplifies the rules and structures to be learned is an approach widely used in NLP (e.g. in PoS-tagging, parsing, and semantic representation such as thematic role labelling). This approach depends on PoS-tagged text; the sentence fragment below (Example 10) is taken from Section A09 (informal mid-1980s BBC radio news commentary) of MARSEC, the Machine Readable Spoken English Corpus (Roach
et al, 1993) and shows syntactic annotations from the LOB tagset and boundary
annotations in the form of pipe symbols: (|) for tone unit boundary and (||) for pause (Roach, 2000).
Example 10
internal/JJ leaders/NNS | who‟ve/WP+HV come/VBN together/RB to/TO form/VB a/AT new/JJ government/NN | to/TO get/VB on/RP with/IN it/PP3 ||
Therefore, a dictionary for NLP and linkage with corpora needs discriminating word class information in the form of PoS tags rather than categories based on the traditional 8 parts-of-speech; CELEX-2, for instance, only uses 9 categories to classify English lemmas (Burnage, 1990): {Noun, Adjective, Quantifier/Numeral, Verb, Pronoun, Adverb, Preposition, Conjunction, Interjection}. The LOB tagset captures fine-grained distinctions - on and with are tagged as particle (RP) and preposition (IN) respectively in the string get on with in Example 11 - and offers a choice of tags for the same word depending on its function or sentence-slot (Atwell, 2008). This is important for prosody - cf. the discussion in Section 4.3.6 of prepositional phrase attachment and its implications for prosodic-syntactic chunking. The C5 tagset used in CUVPlus, while sparser than LOB, retains this
discriminatory characteristic; as noted, it has a separate tag for of (PRF) as distinct from other prepositions (PRP), which may emerge as a useful refinement for phrase break prediction.
6.6.2. CFP status
Phrase break classifiers have been trained on additional text-based features besides PoS tags. The CFP status of a token - is it a content word (e.g. nouns or adjectives) or function word (e.g. prepositions or articles) or punctuation mark? - has proved to be a very effective attribute in both deterministic and probabilistic models (Liberman and Church, 1992; Busser et al, 2001) and therefore, a default content- word/function-word tag is assigned to each entry in the prosody lexicon in field ten. It is anticipated that further research will suggest modifications to this default status when the CFP attribute interacts with other text-based features. For example, the word against is a function word but it is also bi-syllabic and likely to carry word stress - different, therefore, from function words that ‗disappear‘ prosodically due to vowel reduction. The second entry for can in the Carnegie-Mellon pronouncing dictionary indicates this is what happens when, presumably, can is being a modal auxiliary (Example 11).
Example 11
CAN 1 K AE1 N (full vowel probably signifies noun)
CAN 2 K AH0 N (no schwa, no word class but looks like vowel reduction in can
the verb)
6.6.3. Syllable count and lexical stress
Syllable counts have already been used in phrase break models for English (Atterer and Klein, 2002; Schmid and Atterer, 2004). This rather assumes uniformity in terms of duration of syllables whereas we know that in connected speech, an indefinite number of unstressed syllables are packed into the gap between one stress
pulse (Mortimer, 1985) and another, English being a stress-timed language (Hirst,
2009). A lexical stress pattern, capturing both syllabification and stress distribution (rhythmic structure) in a simple abstract form, has therefore been included for each entry in the prosody-PoS lexicon because of its potential as a classificatory feature in the machine learning task of phrase break prediction. This intimation is further supported by the presence of rhythmic annotation tiers in the Aix-MARSEC corpus project (Hirst et al., 2000; Auran et al., 2004), with its focus on speech synthesis
applications and the theoretical modelling (acoustic, phonetic and phonological) of intonation and speech prosody.
6.6.4. Prior knowledge for machine learning
One of the thematic programmes for PASCAL10 (2008) identifies a current
interest in, and trend towards, leveraging a priori knowledge to enhance performance in machine learning in a variety of application domains, including text and language processing. is customised for corpus-based research; and specifically, for populating raw training data (i.e. the tokenized corpus text) with a priori knowledge gathered and cross-referenced from widely-used lexica. Predicting phrase boundaries at the prosody-syntax interface is a notoriously complex task for machine learning because of the inherent variance of prosody (cf. Taylor and Black, 1998; Atterer and Klein, 2002; Chapter 5). Planned research into the phrase break prediction task will attempt to incorporate a dictionary-derived feature such as lexical stress pattern (field eight in ProPOSEL) into a data-driven model to explore this interface more fully; although using just the raw pattern would entail one hundred and twenty-four separate values for this feature (i.e. the set of lexical stress patterns in the lexicon).