4.4.1 ASR
In this research we investigated how different configurations of the modulation filterbank affect recognition performance. To deepen our understanding of the
Chapter 4. Human-inspired Modulation Frequency Features for Noise-robust ASR 107
Lin18−EMS features,Sparse classification, 1 frame window Mel features, Sparse classification, 30 frame window MFCCs, Aurora2 multi−condition recognizer
Lin18−EMS features,Sparse classification, 1 frame window
Non std−Equalized Lin18−EMS features,Sparse classification, 1 frame window Mel features, Sparse classification, 30 frame window
MFCCs, Aurora2 multi−condition recognizer MFCCs, ETSI−AFE multi−condition recognizer
(b)
Figure 4.10: Word recognition accuracy per test set as a function of ASR for four different systems. 1- The proposed EMS features (Lin18-EMS). 2-Sparse classification results using Mel-spectra features (Gemmeke et al., 2011b).
3- Aurora2 multi-condition recognizer applied to MFCC features (Hirsch and Pearce, 2000). 4- ETSI-AFE multi-condition recognizer applied to MFCC
fea-tures (Hirsch and Pearce, 2006).
strengths and weaknesses of the combination of EMS features and SC, we com-pared the performance on test sets A and B in aurora-2 with previously pub-lished recognition accuracies of three other systems: the ‘standard’ aurora-2 system trained with the the multi-condition data (Hirsch and Pearce, 2000), the multi-condition aurora-2 system that includes the Wiener filter based ETSI ad-vanced frontend (Hirsch and Pearce, 2006), and the SC-based system of Gemmeke et al. (2011b). The first two systems use GMMs based on MFCC features to esti-mate state posterior probabilities, while the third one used Mel-frequency energy spectra as stacks of up to 30 frames and used non-negative matrix factorization with the Kullback-Leibler divergence as the solver in the sparse coding engine.
Since there is no configuration of the modulation filterbank that is optimal for all SNR levels and all noise types, we conducted the comparison with the modulation filterbank consisting of the 1 Hz cut-off frequency LPF and M = 18 linearly spaced BPFs (which we refer to as the Lin18-EMS system). The Lin18-EMS system is a good compromise between the highest-possible performance for clean speech and the conditions with the lowest SNR level. The detailed results obtained with the Lin18-EMS system are collected in Table 4.1. .
In Figure 4.10, the recognition accuracies of the Lin18-EMS system and the three competing systems is plotted. Figure 4.10a shows the test results for matched noise types in test set A. While the Lin18-EMS system outperforms both MFCC-based multi-condition recognizers at very low SNR levels, its performance at higher
Chapter 4. Human-inspired Modulation Frequency Features for Noise-robust ASR 108 SNRs is substantially worse than the MFCC-based systems. The single-frame EMS features almost always outperform the 30-frame Mel features.
However, the results of the Lin18-EMS system on test set B, pertaining to the unseen noise type conditions, shown in Figure 4.10b, show that our system does not generalize well to unseen noise types, a characteristic that it shares with the other exemplar-based system. The superior performance of the 30-frame Mel features is most probably due to the fact that Gemmeke et al. (2011b) included artificially constructed noise exemplars that accounted to some extent for the mismatch be-tween the noise exemplars from test set A and the different noise types in test set B. Our EMS-based system did not include artificially constructed exemplars.
In cleaner conditions (down to 10 dB) the EMS-based system has roughly equal performance as the other exemplar based system. In contrast to the behaviour for test set A, however, the performance drop in SNRs < 10 dB is much steeper.
Averaged over the four noise types of test set B, the recognition accuracy is ap-proximately equal to that of the multi-condition trained GMM system without noise reduction.
A detailed analysis revealed that the performance of the Lin18-EMS system in fact is very similar to the system of Gemmeke et al. (2011b), except for train station noise (cf. Table 4.1). In search for the cause of this deviant behaviour, we found that omitting the standard deviation equalization step ((4.3) in Section 4.2.2) sub-stantially improved recognition performance for utterances corrupted with train station noise at low SNR levels. This is illustrated by the dotted line in Fig-ure 4.10b, which shows the average performance on test set B (SNR= 5, 0, −5 dB) when excluding the standard deviation equalization for train station noise. Recall that the main purpose of the standard deviation equalization procedure was to equalize the contribution of all gammatone frequency bands. The equalization weight vector was designed -using the speech exemplars from the dictionary- such that the standard deviation of the coefficients in the EMS vector are on average equal in all 15 gammatone filters, without changing the relative magnitude of the coefficients pertaining to the modulation bands. It appeared that the equaliza-tion procedure works well for noisified speech, as long as the envelope of the 15 gammatone coefficients in the modulation bands does not change between bands with low and high modulation frequencies. As long as that is the case, applying a fixed equalization vector will not change the average modulation spectrum of the noises. However, there are two noise types that violate this assumption, viz.
Chapter 4. Human-inspired Modulation Frequency Features for Noise-robust ASR 109
SNR Clean 20 15 10 5 0 -5 Average
Test A
Subway 94.14 94.84 94.38 88.89 86.98 81.98 66.38 86.80 Babble 93.62 93.44 92.93 91.9 87.07 73.52 42.05 82.16
Car 93.56 92.69 92.45 91.65 88.58 80.11 59.68 85.53
Exhibition 93.55 95.53 95.19 94.38 85.71 82.63 73.93 88.70 Average 93.72 94.12 93.74 91.70 87.24 79.56 60.51 85.79
Test B
Restaurant 94.14 89.39 91.56 90.97 84.86 69.11 36.94 79.56 Street 93.62 93.68 92.53 90.60 83.92 63.27 29.53 78.16 Airport 93.56 94.87 94.15 91.02 82.58 63.05 28.78 78.28 Train station 93.55 94.54 91.76 85.00 67.76 38.41 13.24 69.18 Average 93.72 93.10 92.50 89.40 79.78 58.46 27.12 76.29
Table 4.1: The word recognition accuracy obtained using Lin18-EMS features on aurora-2 test sets. (For explanation see text)
car noise in test set A and train station noise in test set B. The detrimental effect of the violations in car noise are limited, because it is represented in the noise dictionary exemplars taken from the car noise signals. For the train station noise this is not the case. As a result, the match between the modulation spectra of the speech noisified by adding train station noise and the exemplars in the dictionary deteriorates as the SNR level decreases.
4.4.2 Comparison with HSR
To evaluate the combination of EMS features and sparse coding in terms of human like performance, we re-use the data about the recognition accuracy of ten human listeners on aurora-2 utterances in Meyer (2013). Meyer used three different criteria: speech reception threshold (SRT), the effect of noise types and the effect of string lengths. SRT is the SNR at which listeners achieve a 50% accuracy;
usually it corresponds to the SNR at which the accuracy as a function of SNR has the largest negative slope. The SRT estimated for HSR in Meyer (2013) is around −10.2 dB while for the aurora-2 system trained with the multi-condition data (Hirsch and Pearce, 2000) the SRT is −1.5dB. From Figure 4.10a, it can be inferred that the SRT of the EMS-based system is well below −5 dB; although it is dangerous to extrapolate the curves, it is reasonable to assume that the SRT for the two exemplar-based systems is close to the human SRT. As can be seen from Figure 4.10b, which represents the noise mismatch case (test set B), our
Chapter 4. Human-inspired Modulation Frequency Features for Noise-robust ASR 110 EMS-based system does not generalize well to unseen noise types. We will come back to this issue in section 4.5.
According to Meyer (2013), the difficult noises for ASR and HSR are different. At SNR=0 and -5 dB, performance of aurora-2 system trained with multi-condition data the performance for babble noise is higher than for car noise, while HSR shows higher performance for car than for babble noise. From Table 4.1 it can be seen that our EMS+SC system shows the same trend as the human listeners: accuracy with babble noise is lower than with car noise. The same holds for the comparison of airport and train station noise, provided that we solve the equalization issue.
In the human data there is a small but clear drop in accuracy for the longest digit strings, which is probably due to memorization problems. Our EMS-based system does not show this effect. This was to be expected, because an automatic system is not affected by the need to memorize long strings. Our system also does not show the problems with one-digit utterances reported by Meyer (2013) for the ‘standard’ aurora-2 systems with multi-condition training. The raw EMS features that we used for speech-silence segmentation yield quite accurate results.
Only in a very small proportion of the utterances the endpoint estimates differed from voice onset and offset determined from the forced alignment by more than 16 frames, the minimum number of frames needed to find –or hallucinate– a digit word.
In summary, it can be concluded that the operation of our EMS-plus-SC system for the estimation of sub-word probabilities mimics human speech recognition on a semantics-free task better than more conventional MFCC-plus-GMM systems.