Albert Einstein
CUÁLES FUERON SUS ORÍGENES?
The decoder uses statistical N-gram backoff language models to determine its a-priori probabilities (see chapter 2, section 2.1.6). A number of tools exist to create N-gram language models out of text data collections. The most commonly used tool is the SRILM toolkit [Sto02]. This toolkit stores its models in standard ARPA format which will be the LM input file format for the SHoUT decoder. In the ARPA file format, each N-gram is stored in text on a separate line followed by its probability and backoff value. If the decoder would use this file format directly, it would need to compare the words of its hypothesis with the words of each N-gram until the correct N-gram is found. These kinds of text-based comparisons are computationally too intensive to be used for large language models. Therefore, a fast look-up method is needed for querying N-grams and their probabilities and backoff values.
Comparing numbers is a lot faster than comparing text strings. Therefore, each word in the dictionary is assigned a unique identification number wordID, and the words in each N-gram are replaced by these identification numbers. Even replac- ing each word with its wordID takes up considerable computational resources when common string comparison is used to identify each word in each N-gram. Therefore, the words are first represented in a tree structure with a node for each character of the word and this tree is then used to find the correct wordID’s for each N-gram. Matching words with their wordID by searching the tree is considerably faster than searching through the full list of words.
Even comparing sequences of numbers is time consuming. Therefore a method is needed to efficiently find the N-grams of wordID’s. For unigrams, this task is easy: all unigram probabilities and backoff values are stored in a list that is sorted on the wordID of each unigram so that it is possible to obtain the probabilities of each wordID with one single lookup.
For the higher order N-grams, it is not possible to use sorted lists that can be queried directly. Such lists would be multi-dimensional (three dimensional for tri- grams), very sparse and would take up too much memory. Instead, a minimum perfect hash table is used for querying higher order N-grams [CDGM02]. A minimum perfect hash table is a hash table where each slot in the table is filled with exactly one item. This means that it is possible to query N-gram probabilities in one single lookup and that no extra memory is needed except for storing the hash function and for the key of each data structure, the N-gram wordID’s. This key is needed because during lookup, the hash function will map queries for non-existing N-grams to random slots. By comparing the N-gram of the query to the N-gram of the found table slot, it can be
determined if the search is successful. The algorithm proposed in [CHM92, CDGM02] is used to generate the hash functions.
The SHoUT decoder is able to handle language models up to 4-grams. It is possible to extend the decoder so that it is able to handle higher order N-grams, although the tables and the hash functions would then start to take up large amounts of memory.
6.1.2
Acoustic models
For acoustic modeling, standard three-state left-to-right hidden Markov models with Gaussian mixture models as probability distribution functions are applied (see chap- ter 2 and figure 2.1). This means that a training procedure is needed to set the HMM transition probabilities, GMM Gaussian weights, Gaussian mean vectors and covari- ance matrices. Also the number of Gaussians per mixture needs to be set and the triphone clusters need to be defined. In the SHoUT toolkit, this complex procedure has been implemented in a stand-alone application. The steps that the training ap- plication takes will be discussed next. Figure 6.2 is a graphical representation of the training procedure.
Figure 6.2: The training procedure as it is implemented in the SHoUT toolkit. Each phone model is trained this way.
Allocating training data to each phone model
In order to train the acoustic models, training data for each model is needed. The acoustic training data consists of the audio itself, the exact start and end time of each phone and of the neighboring phones (left and right context). The easiest way to obtain this information is to run a forced-alignment of each utterance in the training set. A forced-alignment is a decoding run with only one search path: the string of phones that make up the utterance. The timing information for each phone in the result of the forced-alignment can be used for training the models. Of course, this method requires that there are already acoustic models available.
Another method for creating a training set is to manually determine start and end time of each phone. This method is very time consuming. Fortunately, by annotating a relatively small set of utterances, initial models can be created that can be used for forced-alignment. The new (larger) training set can then be used to train new models and iteratively more refined models can be created.
A third approach is to map acoustic models from another language to the target language. Similar to the previous method, the initial forced-alignment will be rough, but by iterating the process, refined alignments can be created. This method was used to create the Dutch acoustic models for the SHoUT decoder. Publicly available English acoustic models were used as initial models and in each iteration the latest alignments were used to train new models. In total, eight iterations were run to create the final Dutch acoustic models.
Non-speech models
Once training data is allocated for each phone, the training procedure as depicted in figure 6.2 can be started. The SHoUT decoder uses two kinds of acoustic models: phone models with the described three state topology and non-speech models. The non-speech models represent not only silence, but also a variety of non-speech sounds such as lip-smack or laughter. They are not modeled with three states, but instead the HMMs of these models only contain one single state. The non-speech model training is not context dependent but instead all non-speech frames are mapped to the same GMM. Therefore, for the non-speech models it is not needed to determine the triphone clusters and only two transition probabilities need to be calculated. In all other aspects the training procedure for non-speech models is identical to that of the other models. Note that because the SAD subsystem is expected to remove all audible non-speech, the SHoUT ASR subsystem will only be trained with a non-speech model for silence.
Context dependency
Determining the triphone context clusters is done as proposed in [YOW94] and de- scribed in chapter 2. For each state, a decision tree is created in such a way that at each node, the training set is optimally divided in two. For each node in the tree a single Gaussian is trained on the data of that node. Note that this means that, not only the begin and end time of each phone occurrence should be annotated, but also the moments that a state transition is being made, so that each feature vector can be assigned to one of the states.
The size of each decision tree is restricted by three static parameters. First, a minimum number of training samples should be available for each node. Second, the improvement in score gained by splitting a node should be at least a fixed percentage and third, the depth of the tree is limited by a fixed number. Finding the optimum values for these three parameters by means of a grid search would have been too time consuming. Instead a number of explorative experiments was conducted to obtain rough settings for these parameters so that the number of clusters was balanced with the amount of available data for each cluster. The three parameters were finally set to a minimum of 2000 training samples per cluster, a maximum tree depth of 19 (resulting in maximal 218 clusters) and a minimum score improvement of 50%.
Increasing the number of Gaussians
Once the triphone clusters are determined for each state, an initial model with one Gaussian for each cluster is trained. Two EM training methods are implemented for the training application: Baum-Welch and Viterbi. Using Baum-Welch training, all training samples are presented to all three states, but for all states the samples are weighted with the probability that they occur in that particular state. Note that in contrary to the determination of the triphone clusters, for this task the alignment of each phone occurrence is fixed but the moments of the state transitions are unknown (hidden). Using Viterbi training, each feature vector is presented only to the state for which it has the highest probability to occur in. With Baum-Welch training more accurate GMMs can be trained than with Viterbi training, but Baum-Welch is more time consuming.
In order to speed up the training procedure of increasing the number of Gaussians in each triphone cluster, Viterbi is used instead of Baum-Welch. Each triphone cluster is trained until the score does not improve more than a fixed percentage. As with determining the triphone clusters, determining the optimum value for this percentage is very time consuming. After explorative experiments it was set to 0.5%. After each set of training runs, the Gaussian with the highest weight (the Gaussian that was trained with the most training samples) is split in two. The two new Gaussians are created by shifting the means of the Gaussians to opposite sites in all dimensions2and
by increasing the variance in each dimension by 20%. After splitting the Gaussian, each model is trained with another set of Viterbi training runs.
This procedure of increasing the number of Gaussians is repeated until a maximum number of Gaussians is reached or until the number of training samples for either of the Gaussians in the GMM reaches a minimum. Also this minimum is found by explorative experiments. It is set to 20 training samples per Gaussian.
Baum-Welch training
In an attempt to improve the precision of the models, after a number of iterations of increasing the number of Gaussians, instead of Viterbi training, Baum-Welch training is applied. Also, once the maximum number of Gaussians per GMM is reached, an extra number of Baum-Welch runs is performed. As the experiments in section 6.3.3 will show, these final Baum-Welch runs improve the models considerably.