4. RESULTADOS
4.2. DISCUSIÓN
Coarticulation can be loosely described as the effect of adjacent phonemes on each other’s articulation (Massaro, 1998:372-386). The influence of the preceding segment is known as perseverative coarticulation, while the influence of the upcoming segment is called anticipatory coarticulation. As an example, one could consider the difference in articulation of ‘t’ in boot and in beet. It is obvious that the mouth position is different at the point of articulation of ‘t’. There is also a more subtle difference in the sound of the phoneme.
The mouth cannot assume the perfect posture for every phoneme in a word because of the speed at which words are uttered. While forming the posture for a phoneme, the following phoneme posture is
anticipated. In this process, the posture for the current phoneme is compromised to a certain degree, in order to be able to assume the posture for the following phoneme in time.
This has a significant impact on visual speech synthesis and needs to be considered if one wishes to achieve a realistic facial animation. We discuss two approaches to coarticulation: a three-step algorithm by Pelachaud (1991) and a dominance/blending algorithm by Massaro (1998:376-386).
Pelachaud’s (1991) algorithm was based on deformability of the lips (deformability refers to the degree of influence that neighbouring phonemes are allowed to have over the phoneme under review). Different phonemes have different deformability. ‘F’ is an example of the least deformable, while ‘m’ is one of the most deformable phonemes. In this coarticulation algorithm, the deformability also depends on the speed of speech. It is evident that the slower the speech, the more time is available for the lips to shape. Hence, the effect of coarticulation diminishes.
The basic idea of this approach is identifying and examining a highly visible vowel before and after the phoneme under review. The viseme shape of the phoneme is adjusted to conform to the shapes of the two neighbouring vowels.
In addition, the lapse of time between two phonemes is considered. Suppose that we observe two consecutive phonemes a and b. Each phoneme has its articulation time. This time can be functionally divided into three parts:
- Time required for the facial muscles to contract into the correct shape for articulation; - The actual articulation time in which the viseme does not change and;
- Relaxation time during which the lips restore their neutral form.
If the relaxation time of phoneme a, when added to contraction time of phoneme b is longer than the time required for articulation during the common speech, phoneme a would visually influence the phoneme b in such a way that the lips would have to start contracting to articulate b somewhere on the relaxation path of phoneme a. Where exactly this occurs depends on various factors, such as position in the word in question, and the accent and language of the speaker.
Elements of this approach have been put into practice by Pelachaud, Badler and Steedman (1996). Their system was discussed earlier in Section 9.2.1. The authors admitted that it is often not enough to analyse the segments immediately before and after the current one, as the current position can depend on up to five segments before or after (Figure 109).
if lip shape for segment not already computed
apply lip movement rules
if forward or backward rules apply
find vowel
complete list of AUs and their intensity
if phoneme is emphasized or accented
increase intensity of AU
apply spatial and temporal constraint YES YES YES NO NO NO
Figure 109: Lip shape computation and coarticulation algorithm (Pelachaud, Badler and Steedman, 1996).
Massaro’s (1998:376-386) approach used the dominance and blending functions. He took a phoneme and its timing information to produce key-frames at specific intervals. Each speech segment has a varying degree of dominance over articulators, which is calculated by a function for each articulator-phoneme combination. Using this function, it is possible to accurately determine the position of each articulator at any given point in time. The dominance falls accordingly with the time distance from the centre. The weighted average of all dominances acting within a given time frame determines the lip and tongue position.
This method is still considered superior to the other existing ones (Albrecht, Haber and Seidel, 2002). Albrecht, Haber and Seidel listed the advantages of Massaro’s algorithm: low memory usage, no neural network training, convincing results and fast animation. They themselves used Massaro’s
algorithm and improved on it by adopting the muscle-based facial animation model described by Kahler, Haber and Seidel (2001). They also restricted the influence of a segment to only seven preceding or following segments, in order to reduce computational overhead.
Ezzat, Geiger and Poggio (2002) included the solution to the coarticulation problem based on artificial intelligence principles (using the gradient descending learning procedure). While the actual technique is similar to Cohen and Massaro’s (1998:376-386), its main benefit is in that it does not need human intervention.
9.4 Conclusion
It is certain that speech communication is at the very least bi-modal, as humans perceive talking faces both visually and audibly. If either of the two aspects lacks in realism, the animation attempt will fail. Synthetic speech is out of the scope of this text, as the emphasis is primarily on the facial animation.
Lip synchronisation can be either manual or automated. Manual lip synchronisation is simple but laborious, and the success of the result depends to a great extent on the artistic skills of the animator. Automated lip synchronisation is far more complicated, but less dependent on human intervention and ability. According to the data acquisition method, lip synchronisation can be divided into three areas: text-, speech- and image-driven synchronisation. The speech- and image-driven approaches both eventually rely on a text-driven method. Both these approaches have their loyal followers and it is unclear which approach produces better results. Intuitively, using both approaches independently would enhance the recognition by having two points of reference.