Opciones de protección climática (NIVEL 1)

The errors made by the conventional GMM-HMMs and that by DNN-HMMs are different. This provides the possibility of improving the overall performance by fusing the complementary recognition results of the GMM-HMM and DNN-HMM systems. The mostly widely used system combination techniques are recognizer output voting error reduction (ROVER) [21], segmental conditional random field (SCARF) [22], and minimum Bayesian risk (MBR) based lattice combination [23].

10.2.1 ROVER

ROVER [21] is a two-step procedure comprised of alignment and voting as shown in Fig.10.4. In the alignment step, an example of which is depicted in Fig.10.3, the recognition results from two or more ASR systems are combined into a single word transition network (WTN). To align and combine three or more ASR results we first create a linear WTN for each of the ASR system outputs. In Fig.10.3, for example, linear WTNs are created for three ASR results in step 1. By restricting the WTNs to linear topology, we can significantly simplify the combination process.

To achieve best results these linear WTNs are ordered by increasing WER. The first WTN (WTN-1 in Fig.10.3), which has the lowest WER, is designated as the base WTN from which the composite WTN is developed. The second WTN is then aligned to the base WTN using the dynamic programming (DP) alignment protocol.

We then augment the base WTN with word transition arcs from the second WTN as

Fig. 10.3 Illustration of the word transition network (WTN) composition procedure. In Step 1, a linear WTN is generated for each ASR result. In Step 2, WTN-1 is selected as the base WTN to which WTN-2 is aligned.

Fig. 10.4 Processing steps in ROVER

184 10 Fuse Deep Neural Network and Gaussian Mixture Model Systems

Table 10.6 The effect of system combination using ROVER

Method 50 h (%) 430 h (%)

Baseline GMM-HMM 18.8 16.0

GMM-HMM with AE-BN 17.5 15.5

ROVER over two 16.4 15.0

WER on the English broadcast news task. (Summarized from [17])

appropriate as shown in Step 3. With this new base WTN, the third WTN is merged into the base WTN as shown in Step 4. The process is repeated until all linear WTNs are merged into the base WTN.

Once the combined WTN is generated, the voting module evaluates each branching point using a voting scheme, which selects the best scoring word (with the highest number of votes) for the new transcription. There can be many different voting schemes, for example, based on frequency of occurrence, frequency of occurrence and average word confidence, or frequency of occurrence and maximum confidence.

The general scoring formula is

whereλnis the system dependent weight,δ is the Kronecker-δ, i denotes the position in the alignment, N is the number of systems, and conf(w, i) is the confidence score of word w at position i . Majority vote and averaged confidence score are smoothly interpolated viaα, which is trained on a development set.

It has been shown that ROVER over different systems almost always provide additional improvement on the recognition accuracy. For example, Sainath et al.

[17] reported that on the English broadcast news task, they can achieve additional 0.9 and 0.5 % WER reduction over the AE-BN system, respectively, trained with 50 and 430 h of training data, by combining the AE-BN system and the baseline GMM-HMM system as shown in Table10.6.

10.2.2 SCARF

In the SCARF [22] framework, the conditional probability of a state sequence s given an observation sequence o is given by

p(s|o) = whereas s^e_l and s_r^e are the left and right states associated with an edge e in the recognition lattice, q is a segmentation of the observation sequence which induces a

Fig. 10.5 An example of SCARF. Shown in figure are three hypothesized states aligned with seven

observations. s1equals to s^e_l, the left state of edge e, and s₂equals to s^e_r, the right state

set of edges e∈ q between the states, o(e) is the segment associated with the right-hand state s^e_rand spans a block of observations from some start time to some end time,

s^e_l, s^er, o (e)

is a feature defined over the edge and associated segment, andλk

is the weight associated with the feature. Figure10.5depicts an example SCARF, in which three states are hypothesized and aligned with seven observations. Weightsλk

are optimized to maximize the sequence conditional log-likelihood over a training set.

The key to the success of the SCARF model is the features extracted from recog-nition lattices generated by different ASR systems. Typical features used are [22]:

• Expectation features: defined with reference to a dictionary that specifies the spelling of each word in terms of the units.

• Levenshtein features: computed by aligning the observed unit sequence in a hypothesized span with that expected based on the dictionary entry for the word.

• Existence features: indicate the simple association between a unit in a detection stream, and a hypothesized word.

• Language model features: derived directly from LM.

• Baseline features: extracted from the baseline one-best sequence. The baseline feature for a segment is+1 when the hypothesized segment spans exactly one baseline word, and the label of the segment matches the baseline word. Otherwise it is−1.

In [24], Jaitly et al. applied the SCARF technique to combine the GMM-HMM system with the CD-DNN-HMM system. They reduced WER by 0.4 % (from 12.2 to 11.8 %) and 0.9 % (from 47.1 to 46.2 %) over the MMI trained CD-DNN-HMM system on the voice search and YouTube tasks, respectively.

10.2.3 MBR Lattice Combination

The MBR combination [23] finds the word sequence that minimizes the expected word error rate across the different systems being combined as

w^∗= arg min

186 10 Fuse Deep Neural Network and Gaussian Mixture Model Systems

where L w, w

is the Levenshtein distance between two word sequences w and w and Pn(w|o) is the posterior probability of the word sequence w given the acoustic observation sequence o as computed by the n-th model. Pn(w|o) can be estimated as

Pn(w|o) = pn(o|w)^κP(w)

wpn(o|w)^κP(w), (10.4) whereκ is the acoustic scaling factor.

In [25], Swietojanski et al. reported that by combining the GMM-HMM and DNN-HMM systems with the MBR lattice combination technique they can achieve 1–8 % relative WER reduction over the DNN-HMM system across different setups.

However, MBR lattice combination is less robust than ROVER as it sometimes increases the error rates.

In document DIRECTRICES TÉCNICAS INTERNACIONALES SOBRE MUNICIONES. Almacenamiento temporal (página 20-23)