The errors made by the conventional GMM-HMMs and that by DNN-HMMs are different. This provides the possibility of improving the overall performance by fusing the complementary recognition results of the GMM-HMM and DNN-HMM systems. The mostly widely used system combination techniques are recognizer output voting error reduction (ROVER) [21], segmental conditional random field (SCARF) [22], and minimum Bayesian risk (MBR) based lattice combination [23].
10.2.1 ROVER
ROVER [21] is a two-step procedure comprised of alignment and voting as shown in Fig.10.4. In the alignment step, an example of which is depicted in Fig.10.3, the recognition results from two or more ASR systems are combined into a single word transition network (WTN). To align and combine three or more ASR results we first create a linear WTN for each of the ASR system outputs. In Fig.10.3, for example, linear WTNs are created for three ASR results in step 1. By restricting the WTNs to linear topology, we can significantly simplify the combination process.
To achieve best results these linear WTNs are ordered by increasing WER. The first WTN (WTN-1 in Fig.10.3), which has the lowest WER, is designated as the base WTN from which the composite WTN is developed. The second WTN is then aligned to the base WTN using the dynamic programming (DP) alignment protocol.
We then augment the base WTN with word transition arcs from the second WTN as
Fig. 10.3 Illustration of the word transition network (WTN) composition procedure. In Step 1, a linear WTN is generated for each ASR result. In Step 2, WTN-1 is selected as the base WTN to which WTN-2 is aligned.
Fig. 10.4 Processing steps in ROVER
184 10 Fuse Deep Neural Network and Gaussian Mixture Model Systems
Table 10.6 The effect of system combination using ROVER
Method 50 h (%) 430 h (%)
Baseline GMM-HMM 18.8 16.0
GMM-HMM with AE-BN 17.5 15.5
ROVER over two 16.4 15.0
WER on the English broadcast news task. (Summarized from [17])
appropriate as shown in Step 3. With this new base WTN, the third WTN is merged into the base WTN as shown in Step 4. The process is repeated until all linear WTNs are merged into the base WTN.
Once the combined WTN is generated, the voting module evaluates each branching point using a voting scheme, which selects the best scoring word (with the highest number of votes) for the new transcription. There can be many different voting schemes, for example, based on frequency of occurrence, frequency of occurrence and average word confidence, or frequency of occurrence and maximum confidence.
The general scoring formula is
whereλnis the system dependent weight,δ is the Kronecker-δ, i denotes the position in the alignment, N is the number of systems, and conf(w, i) is the confidence score of word w at position i . Majority vote and averaged confidence score are smoothly interpolated viaα, which is trained on a development set.
It has been shown that ROVER over different systems almost always provide additional improvement on the recognition accuracy. For example, Sainath et al.
[17] reported that on the English broadcast news task, they can achieve additional 0.9 and 0.5 % WER reduction over the AE-BN system, respectively, trained with 50 and 430 h of training data, by combining the AE-BN system and the baseline GMM-HMM system as shown in Table10.6.
10.2.2 SCARF
In the SCARF [22] framework, the conditional probability of a state sequence s given an observation sequence o is given by
p(s|o) = whereas sel and sre are the left and right states associated with an edge e in the recognition lattice, q is a segmentation of the observation sequence which induces a
Fig. 10.5 An example of SCARF. Shown in figure are three hypothesized states aligned with seven
observations. s1equals to sel, the left state of edge e, and s2equals to ser, the right state
set of edges e∈ q between the states, o(e) is the segment associated with the right-hand state serand spans a block of observations from some start time to some end time,
fk
sel, ser, o (e)
is a feature defined over the edge and associated segment, andλk
is the weight associated with the feature. Figure10.5depicts an example SCARF, in which three states are hypothesized and aligned with seven observations. Weightsλk
are optimized to maximize the sequence conditional log-likelihood over a training set.
The key to the success of the SCARF model is the features extracted from recog-nition lattices generated by different ASR systems. Typical features used are [22]:
• Expectation features: defined with reference to a dictionary that specifies the spelling of each word in terms of the units.
• Levenshtein features: computed by aligning the observed unit sequence in a hypothesized span with that expected based on the dictionary entry for the word.
• Existence features: indicate the simple association between a unit in a detection stream, and a hypothesized word.
• Language model features: derived directly from LM.
• Baseline features: extracted from the baseline one-best sequence. The baseline feature for a segment is+1 when the hypothesized segment spans exactly one baseline word, and the label of the segment matches the baseline word. Otherwise it is−1.
In [24], Jaitly et al. applied the SCARF technique to combine the GMM-HMM system with the CD-DNN-HMM system. They reduced WER by 0.4 % (from 12.2 to 11.8 %) and 0.9 % (from 47.1 to 46.2 %) over the MMI trained CD-DNN-HMM system on the voice search and YouTube tasks, respectively.
10.2.3 MBR Lattice Combination
The MBR combination [23] finds the word sequence that minimizes the expected word error rate across the different systems being combined as
w∗= arg min
186 10 Fuse Deep Neural Network and Gaussian Mixture Model Systems
where L w, w
is the Levenshtein distance between two word sequences w and w and Pn(w|o) is the posterior probability of the word sequence w given the acoustic observation sequence o as computed by the n-th model. Pn(w|o) can be estimated as
Pn(w|o) = pn(o|w)κP(w)
wpn(o|w)κP(w), (10.4) whereκ is the acoustic scaling factor.
In [25], Swietojanski et al. reported that by combining the GMM-HMM and DNN-HMM systems with the MBR lattice combination technique they can achieve 1–8 % relative WER reduction over the DNN-HMM system across different setups.
However, MBR lattice combination is less robust than ROVER as it sometimes increases the error rates.