CÓMO SE RELACIONAN CONLOS DEMÁS?

Albert Einstein

CÓMO SE RELACIONAN CONLOS DEMÁS?

The first research question concerning speaker diarization, defined in chapter 1, is: ‘How can a speaker clustering system be created that does not require any statistical models created using training data?’. This question was addressed during the development of the speaker diarization subsystem. The diarization subsystem does not need any training data for the creation of statistical models. Instead it randomly cuts up the recording and trains models on the recording that is being processed itself. By performing a number of Viterbi re-alignments while training the models, each speaker gradually captures his or hers own model.

The second question related to speaker diarization: ‘How can the proposed speaker clustering system be adjusted so that it is able to process long recordings with reasonably computational effort?’, is also answered in this thesis. Although it must be noted that the two solutions,SHoUTDCM and SHoUTD07∗, both speed-up the process but also

decrease the performance. In order to understand why the SHoUTDCM system did

not perform as well as the original diarization subsystem, future research is needed. A thorough analysis such as performed for SHoUTD06 can reveal the aspects that need

improvement. Due to time constraints it was not possible to perform an analysis for both systems. Instead, the SHoUTD07∗system was developed that is closely related to

the SHoUTD07 system. With the use of a single parameter, it is possible to decrease

the real-time factor of this subsystem with a slight decrease in performance. For short recordings though, the SHoUTD07∗ system is identical to the SHoUTD07 system.

It is interesting to plan future research for diarization of long recordings at the SHoUTDCM approach, but also at two other approaches. First, the process could be

made faster by combining the SHoUTDCMapproach and the SHoUTD07approach. For

long recordings, the first iterations could be performed by SHoUTDCM while the final

iterations are performed by SHoUTD07. The experiments from section 5.5.1 indicate

that SHoUTDCMseems weak in deciding the optimal number of clusters, but it works

The second possible approach to diarization of long recordings is related to the approach taken for SAD. Instead of processing the entire recording at once, it could be cut up in chunks and each chunk could be processed individually. Although for SAD it is relatively easy to re-combine the chunks, for diarization this step is not so straightforward. It is not that easy to determine which speaker model from one chunk matches the model of another chunk. If a method is found that can match the correct speaker models of the various chunks, it is possible to process recordings of infinite length. It would also make tracking of speakers over multiple recordings straightforward.

8.1.3 Automatic speech recognition

From a software engineering point of view, the ASR subsystem is the most complex of all three subsystems. This is the reason that in the chapter about ASR, chapter 6, a number of development issues has been presented. Especially the implementation of a modular system received special attention in the first section of chapter 6 and also the implementation and evaluation of a number of techniques for robust ASR was discussed and the question: ‘Which methods can be applied to make the decoder insensitive for a potential mismatch between training data and target audio?’, was addressed. The three methods: cepstrum mean normalization, vocal tract length normalization and structured maximum a posteriori linear regression, all proved to reduce the word error rate significantly.

In dealing with the development requirements, a very specific research question was encountered: ‘How can full language model look-ahead be applied for decoders with static pronunciation prefix trees?’ Language Model Look-Ahead (LMLA) is a very helpful technique in managing the computer resources needed by the ASR system. Unfortunately, it is not straightforward how to use this technique with the system architecture that was chosen in order to fulfill the development requirements. In chapter 6 this problem was addressed and a method to efficiently use language model look-ahead in the SHoUT decoder or in any other decoder using a single pronunciation prefix tree was discussed and evaluated. It was shown that LMLA speeds up the decoder considerably without loss of recognition precision. It was also shown that in the SHoUT decoder, the full LMLA architecture outperforms unigram look-ahead.

The unigram LMLA system was slower than the full LMLA system (1.35 times) and as expected the computational cost of the system without any LMLA was the highest (2.4 times as slow as the optimal system). The fact that unigram LMLA already provides a considerable speed-up, and that it is less complex to implement than full LMLA, could be a consideration to chose for unigram LMLA. Also, note that the reported real-time factor results are closely related to the implementation of the SHoUT decoder and that it is possible that, if implemented in another decoder, the RTF gain of the full LMLA system compared to the unigram LMLA system is less distinct. Given this caveat, the experiments with the SHoUT decoder are very promising and it is believed that full LMLA using the proposed data architecture will also improve the real time factor of other token passing decoders.

occurring words. This helps because these words are used considerably more often than the remaining words and therefore have a high probability of being looked up. In the SHoUT decoder, no cache is being used, but instead a very efficient LM look-up method is implemented that reduces a regular n-gram query to calculating the key for a minimum perfect hash table and using this key to directly access the probability. A cache might be useful for speeding up the calculation of the key, but the effect of this speed-up will be highly limited.

8.1.4 The sum of the three subsystems:

In document Curso Introduccion Al Eneagrama (página 136-140)