• No se han encontrado resultados

At each step, split the node with the question that can bring the highest likelihood gain. This single tree node splitting procedure is applied to each node unless:

• There is no more questions available in the question set. • S has only one state.

• PT t=1

P

sSγs(t) is smaller than a threshold.

• L(S, r∗) = max∆L(S, r) falls below a threshold.

The tree construction stops once no more tree node splits occur. Note that it is often assumed that the alignments throughout the tree construction procedure are fixed, which makesPtγs(t), µs, and Σs constant and can reduce the computation cost. In the

final stage, the decrease in log-likelihood by merging leaf nodes with different parents is calculated using Eqn. (2.59). Any pairs of tree leaf nodes whose log-likelihood decrease falls below a threshold are merged. Once the trees are constructed, the triphone states unseen in the training corpus can be classified into the existing tree leaf nodes by answering the questions attached to the tree nodes. Moreover, this state level tying usually results in many logical triphone HMMs with the same tied states, which therefore, can be tied together. This leads to a more compact set of HMMs with identical state definitions termed as physical HMMs (Young et al., 2015).

Further improvements to the standard ML based decision tree can remove the single Gaussian distribution assumption by using GMMs (Reichl and Chou, 2000), or adding MAP (Gauvain and Lee, 1994) based hierarchical priors (Zen and Gales,2011). Alternatively, entropy (Hwang et al., 1996) and minimum description length (MacKay,

2003;Shinoda and Watanabe, 2000) are also widely used as the criterion for decision tree construction. In addition, the requirement of question set can be removed as well (Beulen and Ney, 2000; Chou, 1991; Povey et al., 2011).

2.6

Maximum Likelihood Linear Transforms

Various linear transformations have been applied to HMM-based ASR in the past decades by assuming that the mismatch between the original models and a particular

2.6 Maximum Likelihood Linear Transforms 33

data set is piece-wise linear (Gales, 1998). This section describes three linear transfor- mations involved in this thesis, namely maximum likelihood linear regression (MLLR) (Leggetter and Woodland, 1995), semi-tied covariance matrices (STC) (Gales, 1999), and heteroscedastic linear discriminant analysis (HLDA) (Kumar, 1997; Liu et al.,

2003), all of which are estimated to maximise the likelihood of the data generated by the HMMs.

2.6.1

Maximum likelihood linear regression

It is well known that there exist unique characteristics in speech from distinct speakers (Choukri and Chollet, 1986). Rather than speaker independent (SI) acoustic models, a general model set built upon many speakers’ data using speaker specific acoustic models, is therefore considered to have a better chance to model such characteristics well. However, constructing such a model set usually requires a large amount of data from target speakers, which is often hard or infeasible to collect. An alternative solution is to adapt SI models using a small portion of speaker specific data to capture the characteristics of his/her voice, which is termed speaker adaptation (Choukri and

Chollet, 1986; Cox and Bridle, 1989; Gales, 2000;Gauvain and Lee, 1994; Leggetter and Woodland, 1995; Woodland, 2001).

MLLR is a widely used speaker adaptation method that employs linear transforms to adapt the mean and covariance values of Gaussian components by

ˆ µ=Aµ+b (2.65) =W " µ 1 # ˆΣ =HΣHT, (2.66)

where ˆµ and are ˆΣ are the transformed mean vector and covariance matrix, and

W= [A b] and H denote their associated linear transforms. As its name suggests, the

linear transforms are estimated by maximising the likelihood of generating the speaker specific data from the models, which is a linear regression problem. From Section2.3.4, this could be done using the BW algorithm. The Qfunction from Eqn. (2.35) applied on HMM set Λ equals to

Q(Λ,ˆΛ)∝ −1 2 N X i=1 M X m=1 T X t=1 γim h Dim+ ln|ˆΣim|+ (o(t)−µˆim)TˆΣ−im1(o(t)−µˆim) i , (2.67)

where Dim is the normalisation constant associated with the Gaussian component m

in statei. The re-estimation formulae of Wcan be obtained by differentiating Q(Λ,ˆΛ)

with respect to W and equating to zero (Leggetter and Woodland, 1995). Note that

for each speaker, the speaker dependent (SD) linear transforms can be tied across a number of distributions through a regression class tree to share the limited adaptation

samples (Leggetter and Woodland, 1995). W(r) is the regression matrix of the rth

class. Once W(r) is updated,H(r) is then re-estimated using a formulae obtained in

a similar way (Gales and Woodland, 1996). Note that since MLLR allows different models to have different linear transforms, the overall transform is actually piece-wise linear, and therefore the BW algorithm is often performed for multiple iterations.

Besides applying the linear transforms to model parameters, MLLR can also be applied to input features. This is achieved by constraining the mean and covariance of a Gaussian component to share the same linear transform (Digalakis et al., 1995;

Gales, 1998), i.e, let H=A. Since

N(o(t)|Aµ+b,AΣAT) =|A−1|N(A−1o(t)−A−1b|µ,Σ), (2.68)

the constrained Gaussian mean and covariance transform is equivalent to transform the feature vector by

ˆ o(t) =A−1o(t)−A−1b (2.69) =W " o(t) 1 # .

W can be acquired similarly using Eqn. (2.67) by replacing o(t), ˆµim, and ˆΣim with

ˆ

o(t), µim, and Σim, respectively. A major benefit of applying this constrained MLLR

(CMLLR) transform is to perform speaker adaptation without changing the model parameters (Gales, 1998). More details of CMLLR can be found in (Gales,1998).

In addition to applying test-time speaker adaptation, it is also of a broad interest to apply adaptation in both training and testing (Anastasakos et al.,1996;Gales,1998;

Pye and Woodland, 1997). At test time, SD transforms are applied to a model set trained using the adaptation scheme instead of the SI model trained in the standard

2.6 Maximum Likelihood Linear Transforms 35

way. Therefore, this scheme is referred to as speaker adaptive training (SAT). In this method, standard acoustic model parameters are estimated to model only the phonetically relevant variations since the speaker variations are assumed to have been modelled separately by the training set SD parameters. When applying SAT based on MLLR, given SD transforms, the ML estimation of µim and Σim is achieved by

maximising Eqn. (2.67). This becomes very efficient for CMLLR since the re-estimation formulas are equivalent to the standard ML re-estimation formulas in Eqns. (2.38) and (2.39) with the transformed observation ˆo(t) (Gales, 1998).

2.6.2

Heteroscedastic linear discriminant analysis

Recall Section2.3.2, in practice, it is often the case that GMMs with diagonal covariance matrices are used for acoustic modelling, which assumes that all dimensions of ot are

independent variables. A solution is to decorrelate the GMM input features with a linear Karhunen-Lo`eve transform (KLT) (or its simplified approximation the DCT, as for MFCC), which is a fixed transform estimated by modelling all data samples with a single Gaussian distribution (Bishop,2006). The KLT can find some nuisance dimensions, which are less important in data modelling and can be discarded for data

compression purpose. However, for speech data, the nuisance dimensions found by knowing the distribution of each class (e.g., a phonetic unit) are often different from those found by KLT (Bishop, 2006). When each class is modelled by a Gaussian distribution with a shared covariance matrix, a linear decorrelation transform can be estimated by maximising the likelihood of generating the data. Equivalently, it can also be acquired by discriminating the classes by maximising the inter-class distances while minimising intra-class distances, and the nuisance dimensions are those useless for class discrimination. Therefore, this approach is named as linear discriminant analysis (LDA) (Bishop,2006). A further improvement of LDA can be derived by allowing each Gaussian distribution to have a distinct covariance matrix, which results in the HLDA transform (Kumar, 1997).

Let the top d dimensions be the valuable dimensions supposed to contain the

discriminative information and the other Dd dimensions be the nuisance dimensions.

associated with class care ˆ µ(c)= " ˆ µ(c)[d] ˆ µ[Dd] # (2.70) ˆΣ(c)= " ˆΣ(c) [d] 0 0 ˆΣ(g)[Dd] # (2.71) =   diag A[d]Σ(c)AT[d] 0 0 diagA[Dd]Σ(g)AT[Dd]  ,

whereN(ˆo[d](t)|µˆ(c)[d],ˆΣ(c)[d]) andN(ˆo[Dd](t)|µˆ[D(g)d],ˆΣ(g)[Dd]) are the Gaussian distribution

generating the samples ofcin the valuable dimensions and a global Gaussian distribution

generating the information in the nuisance dimensions. A is the HLDA transform and

A= " A[d] A[Dd] # . (2.72)

The transformed observation vector ˆo(t) is

ˆ

o(t) =Ao(t), (2.73)

and A can be estimated using the BW algorithm. By taking the equality Eqn. (2.61)

into Eqn. (2.34), the Q is positively associated with

1 2 X t X c γc(t)ln   |A|2 diagA[d]Σ(c)AT [d] diagA[Dd]Σ(g)AT [D−d]   (2.74)

and the full covariance matrices Σ(c) and Σ(g) are estimated with the samples associated

with class c and all the samples using Eqn. (2.39). Further steps for calculating A are

described in (Gales,1998, 1999, 2002). In practice, the nuisance dimensions are often discarded by using A[d] as the HLDA projection, which results in asubspace based on

the valuable dimensions that can be more suitable for modelling.

2.6.3

Semi-tied covariance matrices

Since it is well known that the distributions of phonetic units are not Gaussian, HLDA still has unsatisfied assumptions. As alternative to using a fixed feature level linear