The development towards a fully informational society requires a better integration of machines into our lives and creating a more natural form of communication with them. The general objective of this thesis is to study existing methods and to find novel methods for robust recognition of strongly distorted speech. The situations include signals recorded with far distance microphones, in a noisy car environment and compressed speech. The focus will be given to techniques working at the level of acoustic model creation and front-end processing. The motivation for this research can be formulated as follows.
The recognition of recordings from a distant microphone is analysed for its application in the so-called smart homes which is based around the idea of using voice controlled appliances and controlling home faculties remotely. The second practical application is for the transcription of lectures and conference speeches recorded in auditoriums, where the microphone is usually placed at a distance from the speaker. The recognition of recordings from a car environment is analysed for two primary reasons. The first one is to provide human-to-machine interface for the voice controlled devices which include on- board navigation systems and other systems for controlling the car faculties. The second reason is more general as the conversations and phone calls made in cars also suffer from specific acoustic distortions which limits their usability for further processing.
Concerning the compressed speech, the algorithm widely known as MP3 belongs to the group of perceptual audio coders whose worldwide popularity is mainly historical as it appeared in the period of the rapid growth of the Internet and media sharing that came with it. It was developed primarily for the multimedia, namely for video and music storage and distribution [59], but it has seen successful use for speech encoding as well. Only music professionals, phoneticians, and audiophiles have always avoided using it. However, various studies have proved that even expert listeners can’t distinguish between
original and encoded files for bitrates higher than 256kbps [60]. Also, people tended to use much lower bitrates because even highly compressed speech which containing audible distortions was perceived by human listeners as intelligible. Recently, professional studios and many broadcasters are leaving the MP3 coding tools and prefer formats that are better suited for speech (e.g. Speex or FLAC). However, a lot of speech data has already been compressed and archived utilizing the MP3 format, which makes the task of MP3 speech recognition a true research challenge. This fact led me to decision to study compensation methods which would enable the automatic processing of MP3 compressed recordings. Particular ideas analysed within this thesis can be formulated as follows.
• Signals recorded with far distance microphones suffer from additive noises, strong echoes and reverberations. Home environments often introduces only weak additive noises but public places introduce strong additive and convolution noises. What is the contribution of front-end compensation methods for these situations? What is the contribution of acoustic modelling techniques? How much do these two environ- ments differ in terms of ASR performance?
• Signals recorded in a running car suffer from a strong additive noise caused by the running engine and the aerodynamic noise. Both get stronger as the driving speed increases. What is the contribution of signal pre-processing methods for a running car ASR? What is the contribution of acoustic modelling techniques? How much do the differing driving conditions matter in terms of ASR performance?
• The principal idea of MP3 compression is based on removing the imperceptible parts of the signal. What are the primary distortions introduced by the compression and how do they affect the standard cepstral-based features? It is possible that the distortions are located at certain parts of the speech more often that at others?
• The compression introduces non-linear distortions which corrupt signals spectra and the extracted features. It is possible to optimize the feature extraction parameters such as the window length/step? Do the standard compensation and feature nor- malization methods improve the performance? Which features are better suited for this task?
• Common way of improving ASR performance in adverse conditions is to employ either matched training or adapt the general purpose models to specific conditions.
What is the contribution of using the bitrate specific in comparison to general- purpose AM? Can the AM adaptation reduce this mismatch?
• Theoretical and practical works on distorted speech recognition demonstrated, that adding noise to speech signal can improve ASR performance. Can these ideas be extended further for MP3 speech?
• Recognition systems based on neural networks have displayed much greater ro- bustness against adverse environmental conditions than their GMM predecessors. However, these systems are discriminatory by their nature and thus purely data reliant, unlike the GMMs. Can the DNN-HMM system outperform the GMM-HMM system? Can the DNN-HMM system still contribute from any feature-level compen- sation methods such as the ones studied in this thesis?