Ramana Maharsh
TEST DE LOS PÁRRAFOS
The first speaker diarization system developed for this research is a straightforward implementation of the agglomerative model-based algorithm. It was first tested at the NIST Rich Transcription 2006 Spring (RT06s) evaluation benchmark2 and therefore
this version of the diarization subsystem will be referred to as SHoUTD06.
SHoUTD06 was implemented using modules available from the SHoUT ASR sub-
system. At the time of the evaluation it was not yet possible to create an HMM topology with strings of states sharing a single GMM as drawn in figure 2.9. There- fore in this system, each speaker model consists of a single HMM state. Instead of enforcing a minimum duration of each segment by creating the strings of states, the duration of each segment is influenced by setting the transition probabilities to fixed values. The transition probability from one speaker to another is set to the small value of 1
200, representing an average segment length of 2 seconds.
The system contains five other parameters which are tuned to the RT05s conference meeting evaluation data that is part of the RT06s development data. As shown in chapter 2, section 2.4.4, none of these five system parameters are sensitive for changes in audio conditions [AW03]. In the remainder of this section, a short description of SHoUTD06will be given and the evaluation results will be discussed.
5.2.1
System description
For the first version of the SHoUT speaker diarization subsystem, SHoUTD06, the fea-
ture extraction component of the Sonic LVCSR toolkit [PH03] is used. This compo- nent calculates 12 Perceptual Minimum Variance Distortionless Response (PMVDR) cepstral coefficients. This feature type was developed to be more noise robust than MFCC features (see [PH03]). For speech recognition, energy is added to the twelve coefficients and the first and second derivatives of these features are concatenated to the feature vectors. For this diarization system though, only the PMVDR coefficients are used.
The next step in the algorithm is the creation of the initial clusters. Following the baseline system described in [AWPA06], a fixed number of initial clusters is set. For conference meetings ten clusters are used and for lecture meetings the initial number of clusters is five. Note that the number of initial clusters is fixed for all audio files, no matter how long the files are. The number of Gaussians for each initial model though, is dependent on the total amount of data that is used to train the model. This means that the initial models will contain more Gaussians for longer audio files than for short files. Experiments on the development data showed that this approach outperforms the variant with a fixed number of Gaussians. The optimum number of training samples per Gaussian is 800. For the RT06s conference meetings, this means that each initial model is trained with approximately 10–14 Gaussians. Making the number of Gaussians dependent on the duration of the meeting (the number of training samples) will ensure that the models are not under- or over-trained when the duration of the audio varies. The average number of Gaussians is remarkable higher than the five Gaussians per model reported in [AWPA06]. It suggests that more Gaussians are needed because this system only uses 12 feature coefficients instead of the 19 coefficients in [AWPA06].
The speaker models are all trained in an iterative process. Each model is first trained with a single Gaussian and then the model will split its Gaussian with the highest weight until the desired number of Gaussians is reached. At each iteration, before splitting a Gaussian, all Gaussian means and the covariance matrices will be adjusted in a number of training runs until the overall model score does not improve more than 1.5% relative to the previous training run. After the training of all models, the data will be re-aligned using Viterbi and a new training run (with the existing models) will be started (see figure 5.1). The merged speaker models are created in the same way as the single speaker models. In order to speed up the initialization, the model with the most Gaussians is used as initial model. Then, the model is trained on the data of both speakers and the number of Gaussians is increased iteratively until the correct number is reached.
Finding candidate models to merge and deciding when to stop merging is done with BIC. Figure 5.2 illustrates the merging process. First two models are picked that will be replaced by a single merged model. The remaining models are left unchanged as the merged model is being trained. If there are no two models with positive BIC score that can be merged, the system stops and topology (a) will be the final topology. Otherwise, after training the merged model, all models are trained and the data is re-aligned a number of times. This is also shown in the bottom gray box of figure 5.1.
Figure 5.2: The SHoUTD06 system uses BIC to compare models pairwise (a). The two
models that are considered most identical (biggest positive BIC score) are replaced by a single model trained on data from both separate models (b). After this replacement, the data is re-aligned and all models are retrained (c).
5.2.2
RT06s evaluation
The benchmarks for Rich Transcription of meetings contain two speaker diarization tasks (see appendix A). For the Multiple Distant Microphone (MDM) task, multiple microphones are available that are all allowed to be used while for the Single Dis- tant Microphone (SDM) task, only one microphone picked by NIST, is allowed to be used. For the MDM SHoUTD06 submission, only the SDM recording is used for
feature extraction and the extra information that can be obtained by using multiple microphones, for example beam forming, is not used. The speaker diarization error rates of the diarization system on the conference meeting audio are listed in table 5.1. Table 5.1 contains the results with and without overlapping speech regions taken into account for calculating the Diarization Error Rate (DER, 2.4.2). It is surprising to see that the DER increases considerably when overlapping speech regions are con- sidered during scoring. Part of this performance degradation is due to the fact that in 2005 and 2006 the speaker segment borders were annotated manually. Especially for overlapping speech regions this may introduce some noise as it is hard to determine
Test set DER (%) DER (%) without overlap with overlap
RT05s conference room 21.6 30.2
RT06s conference room 22.7 37.2
RT06 lecture room 30.8 32.4
Processing speed (×RT) 4.63
Table 5.1: The speaker diarization results of SHoUTD06measured with and without over-
lapping speech regions.
when exactly someone starts or stops speaking. But most part of the degradation can be attributed to the fact that the system is simply not able to model overlapping speech. Because of the Viterbi alignment, all speech is per definition assigned to one single speaker. In the next section it will be analyzed how much of the total DER is actually due to missed overlapping speech.
At the RT06s benchmark the SHoUTD06 system performed state-of-the-art and
this result shows that the system parameters are tuned sufficiently and that the soft- ware is performing adequately. The analysis described in the next section will explain why SHoUTD06was beaten by the ICSI diarization system.
5.2.3
Post evaluation changes
The audio files in the test collection of the RT06s benchmark are all more or less of the same length. Making the number of Gaussians variable helps because the files are not precisely of the same length and the amount of speech in the audio varies for each recording. But for more extreme variations in audio length, keeping the number of initial models fixed will result in models trained on too little data for short recordings and models trained on too much data for long recordings. Models trained on too little data tend to get over-trained and this might prevent models from the same speaker to be selected for merging. Models that are trained on high quantities of data might be so general that all models become similar and are all merged together. In order to prevent these two kinds of mistakes, after the benchmark the system was changed. Instead of making the number of Gaussians variable, the number of initial models was varied and the number of Gaussians was fixed for each initial model. The analysis described in the following section is performed on this new version of SHoUTD06.