CAPÍTULO 2: TÉCNICAS DE MINERÍA DE DATOS ESPACIO-TEMPORAL
2.2 Metodologías para la detección de regiones de interés (Roi) en una trayectoria
2.3.1 Técnicas de Minería de Datos espacio-temporal basadas en geometría:
The previous section made clear that coarticulation, understood as the mutual interaction of successively pronounced phonemes, has typically been considered only as a distortion of intended phonemes. Yet, if coarticulation “disturbs” pho-nemes, one should then expect that isolated phonemes are more readily identi-fied than are phonemes in running speech.
That coarticulation may actually help rather than hinder phonemic identifi-cation was dramatically demonstrated soon after the introduction of the spec-trograph by Harris (1953). He found, for example, that if he edited the /d/ from the real word /dik/ and put it together with the /æk/ from /kæk/, creating the new word /dæk/, the initial consonant in the newly synthesized word not only sounded unnatural, but was essentially unintelligible. He concluded:
To synthesize speech with reasonable naturalness, the influence factor should be in-cluded. Here these influences can be approximated by employing more than one build-ing block to represent each lbuild-inguistic element and by selectbuild-ing these blocks properly, taking into account the spectral characteristics of adjacent sounds so as to approximate the time pattern of the formant structure occurring in ordinary speech. (p. 962) Harris’s work, carried out at Bell Telephone Laboratories in cooperation with its director R. K. Potter, offered one explanation as to why spectrograms were so difficult to read. The “influence factor,” Harris’s term for coarticulation, indi-cated how tightly coupled successive phonemes are with regard to their percep-tion. Although these experiments showed that the perception of consonants is strongly dependent on coarticulation with neighboring vowels, the findings did not address the question of whether the phenomenon as such should be evalu-ated as a “positive” or “negative” factor in speech perception.
The role of coarticulation has been investigated extensively for the vowels.
For example, if coarticulation impeded the identification of phonemes (had a
“negative” effect), one would expect that vowel identification would be more accurate for sustained vowels as opposed to vowels coarticulated with conso-nants. Strange, Verbrugge, Shankweiler, and Edman (1976) tested this hypoth-esis by comparing listeners’ accuracy in identifying isolated vowels versus vowels pronounced in syllables such as /pip/, /pIp/, and so on. The result was the opposite of the “negative” effect prediction: 69% of the isolated vowels were
correctly identified, as opposed to 91% of the syllabic-nucleus vowels. These scores were obtained for vowels pronounced by the same speaker. Mixing the utterances of different speakers resulted in lower scores of 57% and 83%, re-spectively, but confirmed the same finding—that interactions with neighboring consonants can contribute in a positive way to vowel identification.
This unexpected finding initiated a number of investigations in which the contributions of initial and final consonants to vowel identity were studied in more detail. For example, Strange and Bohn (1998) tested 14 different German vowels as pronounced in the syllable /d(vowel)t/. German was cho-sen for this experiment because the formant frequencies of its vowels are much more consistent than in American English (i.e., they form much better monophthongs). Possible additional advantages include the number of vowels, and the fact that they were pronounced in the context of the car-rier sentence Ich habe /d(vowel)t/ gesagt (I said /d(vowel)t/). Each individual target word was pronounced twice by the same German speaker, and each of these tokens was then split into three parts representing the initial conso-nant, the vowel center, and the final consonant (roughly 25%, 50%, and 25%, respectively, of the original word’s duration). These edited fragments were then presented to different groups of German listeners, who were asked to match each presented fragment against a list of target words printed in standard German orthography. Their average correct scores, ordered from lowest to highest, are plotted in Fig. 4.4.
As indicated in the figure, the lowest scores were obtained when only the ini-tial or the final 25% of the word was presented. When segments of the central, steady-state portion of the vowels were presented (all vowels adjusted to the same duration), only 53% of the original words were correctly identified. A much higher score (70%) was obtained when the initial and final 25% “conso-nant” portions were presented together, separated by a silent center adjusted to such a length that the total “word” durations were the same for all stimuli.
Note that in these conditions exclusively spectral information was pre-sented, whereas the other three conditions also contained information about the duration of the spoken vowel. The first was the vowel center, with its actual duration maintained. The score of 85% is indeed much higher than when dura-tion informadura-tion is lacking but, surprisingly, still lower than the score of 90% ob-tained for the initial and final consonants where the syllable center was replaced by a silent interval.
These results clearly demonstrate that syllable fragments, even when ed-ited to represent only consonants, contain substantial information about neighboring vowels. The fact that medial vowel segments when presented
alone led to relatively low vowel-identification scores does not support the view, held by most speech scientists in the past, that sustained vowels are the ideal form of a vowel, imperfectly realized in actual speech. In fact, Strange (1989) proposed an alternative theory, that the articulatory specification of a vowel is not only determined by a specific spectral target but also by a charac-teristic temporal movement pattern of the vocal tract. That is, the vowel ges-ture is still specified independently of the preceding and following consonant gestures but, due to the considerable temporal overlap of vowel and nant movements, the formant trajectories are a joint function of both conso-nant and vowel gestures. Thus Strange’s explanation is still very much couched in the terms of the motor theory of speech perception.
The conclusion that coarticulation contributes to vowel identification does not automatically mean that this is also true for consonants. Nittrouer and Studdert-Kennedy (1987) created a synthetic /I/–/s/ continuum followed by one of four natural vocalic portions: /i/ and /u/ produced with transitions appro-priate for either /I/ or /s/. For listeners, they recruited adults and also children between the ages of 3 and 7 years. Results of the testing indicated that percep-tual sensitivity to certain forms of coarticulation seems to be present from a very
FIG. 4.4. Percentage of correctly identified vowels with only the indicated parts of the word /d(vowel)t/ presented to the listeners (based on data from Strange & Bohn, 1998).
early age, and the authors concluded that it may therefore be intrinsic to the process of speech perception.
A quite different approach was followed by Diehl and coauthors (Diehl, Kluender, Foss, Parker, & Gernsbacher, 1987). They synthesized randomized lists of /b(vowel)s/, /d(vowel)s/, and /g(vowel)s/ syllables, utilizing 10 different vowels, and presented them to three groups of listeners. The first group were in-structed to push a button immediately upon recognizing the initial consonant /b/, the second group responded to /d/, and the third were given /g/ as a target.
Results indicated that reaction time (RT) correlated positively with the dura-tion of the following vowel (i.e., RTs were longer when the vowel was longer).
The authors interpreted this as suggesting that “consonant recognition is vowel dependent” and, more specifically, that “a certain amount or proportion of the vowel formant trajectory must be evaluated before consonants can be reliably identified” (p. 570). Again, coarticulation appeared to be contributing in a posi-tive way to identification.
In a later study, van Son and Pols (1995) investigated whether the contribu-tion of coarticulacontribu-tion to identificacontribu-tion is restricted to the influence of immedi-ately neighboring phonemes or extends over a larger range. In contrast to the studies already discussed, these authors used fragments taken from a longer, read text rather than isolated words. In addition, they tested a large number of different consonants and vowels. The results provided strong evidence that the identification of both vowels and consonants can be improved by acoustic infor-mation from beyond the boundaries of the transitions to neighboring pho-nemes. It was found that information from the speech ahead of the target segment improved identification more than information from speech after the segment, even when the transition boundaries were exceeded.
Finally, it is important to recognize the role of silent gaps in phoneme identifi-cation. For example, Best, Morrongiello, and Robson (1981) observed that hearing say or stay depends on the duration of the silent gap (preceding the /t/) in the word, indicating the significance of the overall dynamical structure in dis-tinguishing /s/ from /st/.
We may conclude on the basis of the evidence presented in this section that coarticulation should be seen as a natural and contributing rather than a dis-turbing phenomenon. It is clear that the apparently strong intermingling of spectral as well as temporal features of neighboring phonemes makes it more and more difficult to insist that discrete phonemes are the basic units of speech perception. I return to this question at the end of this chapter, after addressing some quite different approaches to the phoneme problem.