By now, direct posterior models for automatic speech recognition have become of increasing interest in the research community. In [Heigold & Schl¨uter` 07] it is shown that Gaussian
HMMs (GHMMs) are equivalent to Gaussian HMM-like Hidden Conditional Random Fields. While conventional GHMMs are usually estimated with a criterion on a segment level, hybrid approaches are typically based on a formulation of the criterion on the frame level. The study shows that improvements of HCRFs over GHMMs found in literature are not due to a refined acoustic modeling but rather from the more robust formulation of the underlying optimization problem or spurious local optima. In [Heigold & Deselaers` 08a], the work presented in this chapter has
been continued by extending GIS such that it does not only allow for training log-linear models with hidden variables but also enables the optimizatio of discriminative training criteria other than the Maximum Mutual Information criterion as, for instance, the Minimum Phone Error criterion. A completely different approach to the training problem of linear models is suggested in the following chapter which extends the Minimum Error Rate Training (MERT) algorithm for N -best lists as suggested by [Och 03] for statistical machine translation to word lattices. The power of the extended MERT algorithm is that it allows for efficiently constructing and exploring the exact error surface over all sentence hypotheses that are encoded in a word lattice under virtually any automated evaluation criterion that is used in natural language processing.
7.5 Conclusions
While Maximum Entropy (ME) based learning procedures have been successfully applied to text based natural language processing, there are only few investigations on using the ME framework
7.5 Conclusions
for acoustic modeling in automatic speech recognition. In this chapter, it was shown that the Generalized Iterative Scaling (GIS) algorithm can be used as an optimization algorithm to discriminatively train the parameters of an automatic speech recognizer based on Hidden Markov Models (HMMs) with continuous Gaussian densities. The ME approach was compared analytically and experimentally with both a conventional Maximum Likelihood (ML) training and a standard approach to discriminative training under the Maximum Mutual Information (MMI) criterion based on the Extended Baum (EB) algorithm. Experiments conducted on a recognition task for continuously spoken connected German digit strings achieved a relative improvement of up to 23% over the ML trained system, and more than 14% over the MMI criterion trained with the EB algorithm. In combination with a linear discriminant analysis, the EB algorithm performed better and outperformed the ME approach by 9% relative.
Chapter 8
Minimum Error Rate Training
Minimum Error Rate Training (MERT) is an effective means to estimate the feature function weights of a linear model such that an automated evaluation criterion for measuring system performance can directly be optimized in training. To accomplish this, the training procedure determines for each feature function its exact error surface on a given set of sentence hypotheses. The feature function weights are then adjusted by traversing the error surface combined over all sentences and picking those values for which the resulting error count reaches a minimum. Typically, candidates in MERT are represented as N -best lists which contain the N most probable sentence hypotheses produced by a decoder. This chapter presents a novel algorithm that allows for efficiently constructing and representing the exact error surface of all sentence hypotheses that are encoded in a word lattice. Compared to N -best MERT, the number of sentence hypotheses thus taken into account increases by several orders of magnitudes. The proposed method can be used to train the feature function weights of a log-linear combination of feature functions and multiple knowledge sources.
The remainder of this chapter is organized as follows. Section 8.1 motivates the general concept behind the MERT criterion. Section 8.2 briefly reviews N -best MERT and introduces some basic concepts that are used in order to develop the line optimization algorithm for word lattices in Section 8.3. Section 8.4 presents an upper bound on the complexity of the unsmoothed error surface for the sentence hypotheses represented in a word lattice. This upper bound is used to prove the space and runtime efficiency of the suggested algorithm. The chapter concludes with a summary in Section 8.5.
8.1 Introduction
Many statistical methods in natural language processing aim at minimizing the probability of sentence errors. In practice, however, system quality is often measured based on error metrics that assign non-uniform costs to classification errors and thus go far beyond counting the number of wrong decisions. Examples are the mean average precision for ranked retrieval, the F-measure for parsing, and the word error rate in automatic speech recognition. A class of training criteria that provides a tighter connection between the decision rule and the final error metric is known as Minimum Error Rate Training (MERT) and has been suggested in the context of statistical machine translation in [Och 03].
MERT aims at estimating the model parameters such that the decision under the zero-one loss function maximizes some end-to-end performance measure on a development corpus. In combination with log-linear models, the training procedure allows for a direct optimization of the unsmoothed error count. The criterion can be derived from Bayes’ decision rule as follows: Let Xr “ xr1, ..., xrTr denote a sequence of acoustic observation vectors together with the
corresponding spoken word sequence Wr “ wr1, ..., wrNr. Under the zero-one loss function, the
sentence hypothesis which maximizes the a posteriori probability is chosen: ˆ
W “ arg max
W PrpW |Xrq
(
Since the true posterior distribution is unknown, PrpW |Xrq is modeled via a log-linear model which
combines some feature functions hmpW, Xq with feature function weights λm, m “ 1, ..., M :
PrpW |Xrq “ pλM 1 pW |Xrq (8.2) “ exp “ řM m“1λmhmpW, Xrq ‰ ř W1exp “ řM m“1λmhmpW1, Xrq ‰ (8.3)
The feature function weights are the parameters of the model, and the objective of the MERT criterion is to find a parameter set ˆλM
1 that minimizes the error count on a representative set of
training sentences. More precisely, let pX , Wq :“ pXr, Wrqr“1,...,R denote the training utterances
of a speech corpus, each consisting of a sequence of acoustic observation vectors Xr“ xr1, ..., xrTr
together with the corresponding spoken word sequence Wr “ wr1, ..., wrNr. Assuming that
the corpus-based error count for some sentence hypotheses Mr “ tW pXrq | r “ 1, ..., Ru is
additively decomposable into the error counts of the individual sentences, i.e., EpW, Vq “ řR
r“1EpWr, W pXrqq, the MERT criterion is given as:
ˆ λM1 “ arg min λM 1 #R ÿ r“1 E`Wr, ˆW pXr; λM1 q ˘ + (8.4) “ arg min λM 1 #R ÿ r“1 ÿ W PMr EpWr, W qδ `ˆ W pXr; λM1 q, W ˘ + (8.5) with ˆ W pXr; λM1 q “ arg max W # M ÿ m“1 λmhmpW, Xrq + (8.6)
In [Och 03], it was shown that linear models can effectively be trained under the MERT criterion using a special line optimization algorithm. This line optimization determines for each feature function hm and sentence Xr the exact error surface on a set of sentence hypotheses Mr. The
feature function weights are then adjusted by traversing the error surface combined over all sentences in the training corpus and moving the weights to a point where the resulting error reaches a minimum.
Sentence hypotheses in MERT are typically represented as N -best lists which contain the N most probable sentence hypotheses. A downside of this approach is, however, that N -best lists can only capture a very small fraction of the search space. As a consequence, the line optimization algorithm needs to repeatedly decode the development corpus and enlarge the candidate repositories with newly found hypotheses in order to avoid overfitting on Mr and preventing the optimization
procedure from stopping in a poor local optimum.
This chapter presents a novel algorithm that allows for efficiently constructing and representing the unsmoothed error surface for all sentence hypotheses that are encoded in a word lattice. The number of sentence alternatives thus taken into account increases by several orders of magnitudes compared to N -best MERT.