In this chapter an overview of the SHoUT system has been given. The system consists of three subsystems: segmentation, diarization and ASR. Each of these subsystems has been designed to contain as few tunable parameters as possible. When tuning is needed, a development set is required that matches the audio that is going to be processed. If the development data are not representable for the target data, the subsystem will be tuned poorly and perform suboptimal. Instead of tuning on a development set, the subsystems will tune itself automatically on the audio that is being processed, so that it becomes possible to process data with unknown audio conditions.
The same mismatch problem exists if statistical models are used. For statistical models, the data used to train the models need to match the conditions of the au- dio that is going to be processed. Therefore when possible, the use of models that are trained on a training set is restricted as much as possible. This is possible for
segmentation and diarization, but not for ASR. For ASR the models are normalized using CMN and VTLN in order to reduce the mismatch between training data and the data that is going to be processed by the decoder.
The subsystems for segmentation, diarization and ASR will be discussed in-depth in the following three chapters.
CHAPTER 4
SPEECH ACTIVITY DETECTION
Speech Activity Detection (SAD) is the task of detecting the fragments in an audio recording that contain speech. Speech activity detection is useful for the ASR sub- system because it is more practical to process small speech segments instead of an entire recording. It is easier to keep the needed computer resources such as processor time and memory usage within reasonable bounds when the length of each segment is limited. A more important advantage of applying SAD is that all non-speech is removed from the recording so that the ASR subsystem doesn’t need to process these segments. Although audible non-speech (such as sound effects, etc) do not contain any speech, if they are passed to a decoder it will always output a hypothesis, leading to insertions. SAD is also very important for speaker diarization. All non-speech presented to the diarization subsystem will contaminate the speaker models and this will decrease the quality of the diarization subsystem.1
A common approach in speech activity detection is to attempt to classify all types of sound that are present in the recording. If it is known what types of sound can be expected, it is possible to create statistical models for them and the classification is straightforward. SAD is a lot harder when it is unknown beforehand what kind of sound effects can be expected, making it impossible to create high quality non-speech models. In this chapter the research question: ‘How can all audible non-speech be filtered out of a recording without having any prior information about the type of non-speech that will be encountered?’, is answered and the SHoUT SAD subsystem is presented that is able to handle this task.
The SHoUT SAD subsystem is inspired by the model-based SAD approach de- scribed in [AWP07]. During the segmentation process, the models of SHoUT are trained on the audio that is being processed. In order to obtain a bootstrap segmen- tation that can be used to train these models, in the original algorithm described in [AWP07], a silence-based segmentation strategy is employed (see chapter 2, sec- tion 2.3 about segmentation methods). Using this method, no training set is needed to train the models on, and the second research question: ‘How can the system per-
form speech/non-speech segmentation without the use of statistical models based on training data?’, is successfully addressed. When audible non-speech is expected to be present in the audio though, a bootstrap segmentation based on silence will not be sufficient. Therefore, a new solution is needed to solve this research problem. The SHoUT SAD subsystem addresses the problem by applying a model-based segmenta- tion component to create the bootstrap segmentation. After the initial segmentation step, three models are trained on the audio under evaluation: a model trained on silence, a model trained on audible non-speech and a model trained on speech. Each of these models is trained on the data that is being segmented. By applying the three models, the subsystem is able to perform high quality SAD.
In the following section, after discussing the definition of speech and non-speech, the algorithm is described that is used by the SHoUT SAD subsystem. In section 4.3 the features are described and in section 4.4, the two confidence measures are discussed that are needed for deciding which part of the bootstrap segmentation is going to be used to train the new models. In section 4.5 the component that is used to create the bootstrap segmentation is discussed. The bootstrap component is a standard model- based segmentation component for Dutch broadcast news. Finally, in section 4.7 the evaluation of the SHoUT SAD subsystem will be discussed.
4.1
What is considered speech?
When people talk they produce speech, even if what they say is drown out by loud noises or by other speech. For some applications it might be wanted that a speech activity system marks such corrupted speech as actual speech, but when SAD is used as a preprocessing step for ASR, corrupted speech that the ASR subsystem is not able to process correctly anyway, might as well be marked as non-speech. On the other hand, during evaluation, it should be penaltilized if the system is not able to recognize certain parts of the speech. In this thesis, speech is marked as actual speech if the transcriber is able to hear what is being said. The system is expected to process all types of speech as long as a person, the transcriber, is able to understand the content of this speech. Therefore the SAD subsystem needs to be able to classify all speech fragments as actual speech. Even if the fragments contain high levels of noise.