9.4.1 Method
The aim o f the experim ents described in this section w as to obtain an indication o f the ability o f static and linear trajector\ m odels to describe typ ical observed feature vector sequences w ithin segm ents. T he studies w ere therefore based on a sm all subset o f the data, using a few exam p les o f each digit. For each o f the eight m el-cepstrum features and for the average am plitude feature, the fram e-by-fram e observed valu es w ere plotted superim posed on the calcu lated m odel valu es and tim e-aligned with the segm ent labels and filterbank output. The approxim ations w ere com pared for the three t \p e s o f m odel.
2 1 9 0 5 2 1 9 5 5 2 2 0 0 5 2 2 0 5 5 2 2 1 0 5 2 2 1 5 5 2 2 2 0 5 2 2 2 5 5 2 2 3 0 5 2 2 3 5 5 2 2 4 0 5 2 2 4 5 5 2 2 5 0 5
V a l 1 50
{ 2 19 0 5 | 2 1 ^ 5 [ 2 2 0 0 5 1220 55 1221 05 [ 2 2 1 5 5 122205 1 22255 122305 1223 5 5 1224 0 5 122455 122505 1 1 1 1 1 n 1 1 1 1 1 1 1 1 1 1 1 1 1 n 1 I 1 1 t t 1 1 1 I 1 I I 1 1 1 I I 1 I 1 I I I 1 1 I I 1 1 1 1 1 1 H 1 1 1 I 1 I h 1 1 1 I I 1 H I I 1 I I I I I 1 1 I I i l l 1 I I 1 1 1 I 1 I 1 I 1 1 I 1 I L i . I J J U I I I I 1 1 1 1 1 I I I I I I I ! I 1 1 I ! I I I
F igure 9 .2a: F ra m e-b y-fra m e va lu es (so lid lines) su p e r im p o s e d on c a lc u la te d m o d e l v a lu e s (d o tte d lines) a c c o rd in g to sta n d a r d H M M m o d ellin g assum ptions. The tra ck s re p r e se n t the lo w est eigh t m el-cep stru m fe a tu r e s a n d a v e ra g e am plitu de fe a tu r e f o r a d ig it seq u en ce ‘‘ze ro th r e e ”, tim e-a lig n ed with the sp e e c h w aveform , sp ec tro g ra p h ic d is p la y o f filte rb a n k analysis, a n d p h o n e -sta te labels.
T i m , i m s ) 121905 121955 122005 : 2 2 0 5 5 [ 2 2 1 0 5 122155 12220 5 122255 [223Ô5 [ 2 2 3 5 5 [224Ô5 [ 2 2 4 5 5 [ 2 2 5 0 5 ' " l l l l M l 111 I 1 1 I I I 1 1,1 1,1 I I M I 1 I 1 I I I I I 1 I 1 I I 1 I I I I I 1 I I I I l l I I I I I I I I I I i I I I l l I I I 1 I I I I 1 I I I I I I I I I I I I I I I I I I l i 1 l U h i l l 1 1 11 I 1 I 1 I 1 i l l ■ 1 i , 1 3 3 4 3 k a v e f o r m
@U:_)
2 1 9 0 5 2 1 9 5 5 2 2 0 0 5 2 2 0 5 5 2 2 1 0 5 2 2 1 5 5 2 2 2 0 5 2 2 2 5 5 2 2 3 0 5 2 2 3 5 5 2 2 4 0 5 2 2 4 5 5 122505
F igure 9,2b: F ram e-by-fram e va lu es (so lid lines) a n d c a lc u la te d m o d el va lu es (d o tte d lines) a cco rd in g to s ta tic se g m e n ta l H M M m o d ellin g assum ptions. The tracks can be c o m p a r e d with those
o f F igure 9.2a,c, (T:_ ( T : _ ) - 2 U : _ ) 4 .3 ( Ü U : _ ) . 2
F igure 9.2c: F ram e-by-fram e va lu es (so lid lines) a n d c a lc u la te d m o d el va lu es (d o tte d lines) a cco rd in g to lin e a r se g m e n ta l H M M m o d ellin g assum ptions. The tracks can be c o m p a r e d with
Analyses o f speech data 131
9.4.2 Results and discussion
Some example plots for the three model types are shown in Figure 9.2. It can be seen that the conventional HMM approach (Figure 9.2a) follows the general characteristics of each speech sound, but that an average over all frames of all examples is often quite a poor match for any one particular frame. By incorporating static segmental modelling assumptions (Figure 9.2b), individual examples are matched more closely. When the linear model is applied (Figure 9.2c), the model generally follows the pattern of change of the observed feature vectors very well. For the overall energy feature and for the lower-order cepstral features, the match to the ffame- by-ffame observed values is remarkably close. The higher-order cepstral features (from around the sixth upwards) tend to change less smoothly and there is therefore some loss of detail in the linear approximation.
Overall, it can be concluded from the trajectory plots that, not surprisingly, a dynamic model is necessary to follow the time-evolving nature of acoustic features. It appears that, for models with three segments per phone using mel-cepstrum features, a linear model should be adequate to capture the characteristics of these changes, especially as any additional variation around the linear trajectory will be modelled by the intra-segment variance. The adequacy o f a linear model is supported by Gish and Ng (1993), who found that a linear trajectory was sufficient for most sounds, even when only using one segment per phone. On the other hand, Deng, Aksmanovic, Sun and Wu (1994) have argued for the use of higher-order polynomials, at least for some speech sounds, although their linear models used no more than two states per phone. A higher-order polynomial should allow less states to be used to represent each phone and hence make greater use of the segmental-model constraints, but the current studies suggest that a linear model makes a good starting point.