Cantidad y Distancias de Separación (NIVEL 2)

The superior robustness results reported in Sect.9.4 seem to suggest that DNNs provide significantly higher error rate reduction over GMM systems for noisier speech than for cleaner speech. This is, however, untrue. In fact, these results only indicate that the DNN systems are more robust than GMM systems to speaker and environmental distortions. The perturbation shrinking property in the higher layers as we discussed in Sect.9.2applies equally to all conditions. In this section, we show, using results from [8], that DNNs provide similar gains over GMM systems across different noise levels and speaking rates.

In [8], Huang et al. conducted a series of studies comparing GMMs and DNNs on the mobile voice search (VS) and short message dictation (SMD) datasets collected through the real-world applications that are used by millions of users with distinct speaking styles in diverse acoustic environments. These datasets were chosen because they cover almost all key LVCSR acoustic model challenges, each with enough data to ensure the statistical significance. In the study, Huang et al. trained a pair of GMM and DNN models using 400 h VS/SMD data. The GMM system, which used the 39-dimensional MFCC feature with up to the third-order derivatives, is a state-of-the-art model trained with the feature-space minimum phone error rate (fMPE) [22] and boosted MMI (bMMI) [21] criteria. The DNN system, which used the 29-dimensional log filter-bank (LFB) feature (and up to the second-order derivatives) with a context window of 11 frames, was trained using the cross-entropy (CE) criteria.

The two models shared the same training data and decision tree. The same maximum likelihood estimation (MLE) GMM seed model was used for the lattice generation in the GMM and the senone state alignment in the DNN. The analytic study was conducted on a 100 h VS/SMD test data randomly sampled from the datasets and roughly follow the same distribution as the training data.

9.5.1 Robustness Across Noise Levels

Figures9.8and9.9, provided in [8], compare the error pattern of the GMM-HMM and CD-DNN-HMM models under different signal-to-noise ratios (SNRs) for the VS and SMD datasets respectively. As we can observe from these tables, the CD-DNN-HMM

Fig. 9.8 Performance comparison of GMM-HMM and CD-DNN-HMM at different SNR levels for the VS task. The solid lines are the regression curves. (Figure from Huang et al. [8], permitted to use by ISCA.)

Fig. 9.9 Performance comparison of GMM-HMM and CD-DNN-HMM at different SNR levels for the SMD task. The solid lines are the regression curves. (Figure from Huang et al. [8], permitted to use by ISCA.)

significantly outperforms the GMM-HMM at all SNR levels, including both the clean and very noisy speech. But more interestingly, we can observe that the CD-DNN-HMM yields almost the uniform performance gain across different SNR levels over the GMM-HMM on both the VS and SMD datasets.

We can measure the noise robustness of the DNN in a different way by calculating the performance degradation per 1 dB SNR drop. For the VS task, each 1 dB SNR drop introduces about 0.40 % absolute (or 2.2 % relative) WER increment since when the SNR drops from 40 to 0 dB the WERs increase from 18 to 34 %. For the SMD task, the same 1 dB SNR drop results in 0.15 % absolute (or 1.3 % relative) WER increment because within the same SNR range the SMD WERs increase from 12 to 18 %. The quantitative difference of the sensitivity to the noise level between these

9.5 Robustness Across All Conditions 169

tasks is likely due to the fact that the SMD has much lower LM perplexity. The same 1 dB SNR drop, however, introduces 0.6 % (2.6 % relative) and 0.20 % absolute (or 1.2 % relative) WER increment on the VS and SMD datasets, respectively, when the GMM system is used.

These results suggest that CD-DNN-HMM is more robust than GMM systems with less WER increment per 1 dB SNR drop on average and slightly more so at the low SNR range as indicated by the flatter slope compared to GMM systems.

However, the difference is very small. The speech recognition performance of the DNN still drops quite a lot as the noise level increases within the normal range of the mobile speech applications. This indicates that the noise robustness remains an important research area and techniques such as speech enhancement, noise robust acoustic features, or other multi-condition learning technologies need to be explored to bridge the performance gap and further improve the overall performance of the deep learning-based acoustic model.

9.5.2 Robustness Across Speaking Rates

Speaking rate variation is another well known factor that would affect the speech intelligibility and thus the speech recognition accuracy. The speaking rate change can be due to different speakers, speaking modes, and speaking styles. There are several reasons that speaking rate change may result in speech recognition accuracy degradation. First, it may change the acoustic score dynamic range since the AM score of a phone is the sum of all the frames in the same phone segment. Second, the fixed frame rate, frame length, and context window size may be inadequate to capture the dynamics in transient speech events for fast or slow speech and therefore result in suboptimal modeling. Third, variable speaking rates may result in slight formant shift due to the human vocal instrumentation limitation. Last, extremely fast speech may cause formant target missing and phone deletion.

Figures9.10 and 9.11, originally appear in [8], illustrate the WER difference across different speaking rates, measured as the number of phones per second,³ on the VS and SMD datasets respectively. From these figures, we can notice that the CD-DNN-HMM system consistently outperforms the GMM-HMM system with almost uniform WER reduction across all speaking rates. Unlike in the noise robust-ness case, here we observe a U-shaped pattern on both VS and SMD datasets. On the VS dataset, the best WER is achieved around 10–12 phones per second. When the speaking rate deviates, either speeds up or slows down, 30 % from the sweet spot, 30 % relative WER increment is observed. On the SMD dataset, 15 % relative WER increment can be observed when the speaking rate deviates 30 % from the sweet spot.

3Huang et al. [8] also tried some of the variations such as the number of vowels per second and the speaking rate normalized by the average duration of different phonemes. It was reported that no matter which definition is used the WER pattern is very similar.

Speaking Rate (# of Phones per Second) Data Dist. (VS)

GMM-HMM (VS) CD-DNN-HMM (VS)

Fig. 9.10 Performance comparison of the GMM-HMM and the CD-DNN-HMM at different speak-ing rate for the VS task. (Figure from Huang et al. [8], permitted to use by ISCA.)

Speaking Rate (# of Phones per Sec.) Data Dist. (SMD) GMM-HMM (SMD) CD-DNN-HMM (SMD)

Fig. 9.11 Performance comparison of the GMM-HMM and the CD-DNN-HMM at different speak-ing rate for the SMD task. (Figure from Huang et al. [8], permitted to use by ISCA.)

To compensate for the speaking rate difference, additional modeling techniques need to be developed.

In document DIRECTRICES TÉCNICAS INTERNACIONALES SOBRE MUNICIONES. Almacenamiento temporal (página 12-16)