Implementation of a language identification system based on optimization and machine learning techniques

(1)

s e r g i o d av i d r o d r í g u e z b e r m ú d e z I M P L E M E N TAT I O N O F A L A N G U A G E I D E N T I F I C AT I O N S Y S T E M B A S E D O N O P T I M I Z AT I O N A N D M A C H I N E L E A R N I N G

(2)

(3)

I M P L E M E N TAT I O N O F A L A N G U A G E I D E N T I F I C AT I O N S Y S T E M B A S E D O N O P T I M I Z AT I O N A N D M A C H I N E

L E A R N I N G T E C H N I Q U E S

s e r g i o d av i d r o d r í g u e z b e r m ú d e z

Thesis for the Degree of Electronic Engineer

(4)

Sergio David Rodríguez Bermúdez :Implementation of a language identification System based on Optimization and Machine Learning Techniques,Thesis for the Degree of Electronic Engineer, © Novem-ber2013

(5)

Ignoramus et ignorabimus

(6)

A B S T R A C T

During the development of this work we are going to build the implementation of a Language identification system that iden-tifies whether an utterance is a sample of English or French. To do this we rely in the Voxforge organization free license speech corpus. Pre-processing of this corpus involves the very common mel-frequency cepstral and delta-cepstral coefficients (MFCC) as feature vectors. The approach is going to be within the context of machine learning techniques, specifically using Gaussian Mix-ture Models (GMM) and Support Vector Machines (SVM). We demonstrate that GMM models are fairly simple to train and that the computational approach using Gaussian mixtures is highly scalable to large datasets without many complications. On the other hand, we will see that custom approaches to train SVMs using radial basis Gaussian kernel functions (RBF) show severe limitations handling big datasets.

(7)

We have seen that computer programming is an art, because it applies accumulated knowledge to the world, because it requires skill and ingenuity, and especially because it produces objects of beauty.

— Donald E. Knuth

A C K N O W L E D G M E N T S

Many thanks to all the people that was involved directly or indirectly in the development of this work. It constitutes the last task of my career and it is meant to close the cycle, as cliché as it may sound, this is a relieve for me and for my parents. I appreciate among everything the support that my parents have gave through all my undergraduate process from the beginning to this point. I count on them, and that is a privilege that is not necessarily given to everyone. I’m also very grateful with my advisor, Fernando Lozano, who’s particular style allowed me to be independent in my work ,yet very thoughtful and responsible of it. His knowledge was very well received throughout the course of machine learning. At last, I must say that I am in depth with the guys from the machine learning seminar. Specially Reinaldo Uribe, who delivered a live saving hint and who rescued my computer and my thesis.

(8)

(9)

C O N T E N T S

1 i n t r o d u c t i o n 1 1.1 LID Overview 1

i t h e o r e t i c a l f r a m e w o r k 3 2 f e at u r e e x t r a c t i o n 5

3 g au s s i a n m i x t u r e m o d e l s 9

3.1 Expectation-Maximization Algorithm 10 3.1.1 Maximum-likelihood 10

3.1.2 Basic EM 11

3.1.3 Maximum likelihood in mixture densities 12 4 s u p p o r t v e c t o r m a c h i n e s 17

4.1 Hyperplane Classifiers 17 4.2 Support Vector Classifiers 19

4.2.1 Soft Margin Hyperplane 20

4.3 Sequential Minimal Optimization (SMO) 21 4.3.1 Part I : Analytical step : Solving for two

Lagrange Multipliers 22

4.3.2 Part II : Heuristics : Choosing two Lagrange

multipliers 23

4.3.3 Part III : threshold : Determine the constant

b 24

ii o b j e c t i v e s 25 5 o b j e c t i v e s 27

5.1 General Objectives 27 5.2 Specific Objectives 27 iii m e t h o d o l o g y 29

6 d ata b a s e s e l e c t i o n a n d c h a r a c t e r i z at i o n 31 7 f e at u r e e x t r a c t i o n : m f c c a n d d e lta f e at u r e s 33 8 t r a i n i n g a n d t e s t i n g : g m m a n d s v m 35

8.1 Gaussian Mixture Models : Training 35 8.2 Gaussian Mixture Models : Testing 36 8.3 Support Vector Machines : Training 36 8.4 Support Vector Machines : Testing 37 iv r e s u lt s 39

9 r e s u lt s : e r r o r r at e s a n d c o m p u tat i o na l e f -f o r t 41

10 c o n c l u s i o n s 43 b i b l i o g r a p h y 44

(10)

L I S T O F F I G U R E S

Figure1 Process to create MFCC Features 5

Figure2 Mel scaling and smoothing of the log

am-plitude spectrum 6

Figure3 The Mel function 6

Figure4 Hyperplane Representation 19

L I S T O F TA B L E S

Table1 Training Database 31

Table2 Testing Database 32

Table3 MFCC Parameters 33

Table4 gmdistribution.fitParameters 35

Table5 svmtrainParameters 37

Table6 GMM Classifier Results : Cepstral and

Delta-cepstral coefficients 41

Table7 SVM Classifier Results : Misclassification

rate estimation 42

Table8 SVM Classifier Results : Testing Database 42

L I S T I N G S

(11)

1

I N T R O D U C T I O N

It is the purpose of this document to report on the implementation of two automatic language identification systems using Gaussian Mixture models and Support Vector Machines. To achieve this report, we present the overall process of building a functional LID learning algorithm from the scratch, and illustrate step by step the whole process. The document has been organized in four parts. The first part presents the theoretical framework. Here, we present a detailed description of the pre-processing process in chapter 2, followed by a full description of Gaussian Mixture

Models and Support Vector Machines in chapters3and4. Second

part presents briefly the general and specific objectives. Now, in the third part (chapters6to8), the document contains the explicit

methodology used to implement the language identification al-gorithms . Finally, results and conclusions are presented in the last2chapters.

1.1 l i d ov e r v i e w

For the purpose of clarity, lets describe briefly the main steps involved in the process of building a learning machine, and specifically one that is built to recognize languages.

To build and LID the first requirement to be met is to count on a database of recordings in the languages of interest. In this case, more than20.000.wav recordings from English and French

native speakers were extracted from the speech corpus in the voxforge repository [1]. Details on this database will be given in the following sections.

After acquiring the database, the next relevant step is to con-duct a transformation on this .wav files to turn them into feature vectors. This process is called feature extraction and it is neces-sary since learning algorithms perform operations on dot product spaces that requires vector representations instead of binary files.

Another reason for doing this process is that feature extraction provokes a domain reduction in the data, meaning that fragments of audio encrypted in large binary code strings are condensed in

13-position vectors that contains only the relevant information of

these fragments.

Having done this, the next step is to train a model using the resulting feature vector database. The process of setting up the models and the training algorithms is to be described throughout the document.

(12)

2 i n t r o d u c t i o n

Finally, a subset of the database is use to test the performance of the algorithms. At the end of this work we are intended to do a benchmark that compares the performance of the approaches that are going to be used to build the language identifier. Also, if time isn’t an issue,our objective is to build a very simple interface to facilitate the program utilization.

(13)

Part I

(14)

(15)

2

F E AT U R E E X T R A C T I O N

As mentioned, in the process of building an LID system the first thing to be done is to conduct feature extraction on the dataset. To this purpose we are going to use Mel-Frequency Cepstral Coefficients (MFCCs), which are the dominant features used for speech and language recognition.

The process of extracting MFCCs involves several steps, but the main assumptions are that Mel Frequency scale is a suitable scale to model the spectra of speech utterances, and that Discrete Cosine Transform (DCT) is proper to decorrelate the Mel-spectral vectors, and hence produce independent distribution samples [2].

We aim to describe the overall process. See figure1. The first

step is to divide the speech signal into frames, usually by apply-ing a windowapply-ing function at fixed intervals. Most typical window-ing is Hammwindow-ing window of0.01seconds wide [3].Cepstral Fea-ture Vectors are extracted for each frame, (13-vector per frame).

Figure1: Process to create MFCC Features [2]

The next step is to take the Discrete Fourier Transform (DFT) of each frame, take the amplitude and retain the logarithm of it.In this case phase information is discarded because perceptual studies has shown that amplitude of the spectrum is much more important than the phase. Also logarithm of the amplitude is

(16)

6 f e at u r e e x t r a c t i o n

taken because the perceived loudness of a signal has been found to be approximately logarithm [4].

The next step is to smooth the spectrum and emphasize per-ceptually meaningful frequencies. This is achieved by selectively reducing the number of spectral components as illustrated in figure2.

Figure2: Mel scaling and smoothing of the log amplitude spectrum. Spectral components are averaged over Mel-spaced bins to produce a smoothed spectrum. [2]

Here we see that the original spectrum is divided into segments. This segments, or bins, are distributed unevenly according to the Mel frequency scale, because its has been found that for speech, the lower frequencies are perceptually more important than the higher ones. Components are averaged over these bins to convey a reduced spectrum, which is referred as Mel-spectra.

The Mel scale is based on a mapping between actual frequency and perceived pitch as apparently us humans do not perceive pitch linearly. The mapping is approximately linear below1kHz

and logarithm above. Figure3shows the Mel function:

(17)

f e at u r e e x t r a c t i o n 7

And the formula for converting from frequency to Mel scale is:

M(f) =1125∗ln(1+ f 700)

So far so good. But, we still have a problem. The components of the Mel-spectral vector calculated for each frame are highly correlated. This fact is problematic if the behaviour of this data is to be model, since some schemes assumes that data is identi-cally independently distributed, like in Gaussian Mixture Models, where features are typically modelled by mixtures of Gaussian Densities.

Therefore, in order to eliminate the correlation and reduce the number of parameters in the system , the last step of MFCC feature construction is to apply a transform to the Mel-spectral which decorrelates their components. Theoretically, the Karhumen-Loeve (KL) transform achieve this. In the speeh community, the KL transform is approximated by the Discrete Cosine Transform (DCT) [5]. In our case, a DCT of type II is going to be used:

X_k= NX−1

n=0

x_ncos[π N(b+

1

2)k]k=0, ...,N−1.

Where thex_n correspond to the Mel-spectral components of a given frame tand theXk constitute the cepstral coefficients of

that frame.

It is worthy mention, that for language ID, only the lowest13

coefficients of the mel-cepstrum are calculated (X₀ throughX₁₂). The lowest Cepstral coefficient is sometimes ignored, because it contains only overall energy level information [6].

Finally, let us notice that, in an effort to model Cepstral tran-sition information, difference Cepstral are also computed and modeled. These vectors of cepstral differences are called delta-cepstrals and they are computed every frame as:

∆~c(t) =~c(t+1) −~c(t−1) (2.1)

Where{∆~c(t),~c(t)}refers to the delta-cepstral and cepstral vec-tors at frametcorrespondingly. Now, once we have our feature vector space full of instances, its time to model the information they convey. The idea is to approximate the pattern of the un-derlying process behind French and English utterances with a model. This can be done in different ways but we have chosen GMM and SVM as our approaches.

(18)

(19)

3

G A U S S I A N M I X T U R E M O D E L S

Among the most popular models in this context of Language Identification, Gaussian Mixture Models (GMM) is the most sim-ple approach. Not only for its mathematical framework but also because its requirements in terms of training data are relatively simple to meet1

.

To be able to train a GMM model, a database consisting of real valued m-vectors is required. In the present case this vectors are going to be the Mel-frequency Cepstral coefficients as well as the Delta-Cepstral coefficients. as seen in the last chapter, the former requires no complex calculations and it improves the model per-formance.

Now, Under the GMM assumption, each feature vector~v_t at frame timet(remenber that MFCC are taken per time frame) is assumed to be drawn randomly according to a probability density that is a weighted sum of multi-variate Gaussian densities [3]:

p(~v_t|λ) = N X

k=1

α_kp_k(~v_t) (3.1)

Whereλis the set of models parameters:

λ={α_k,~u_k,Σ_k}

kis the mixture index (1 ₆k < N), theα_k0s are the mixture weights constrained such thatPN_k₌₁α_k=1, and thep_k0s are the multi-variate Gaussian densities defined by the means~u0_ksand variancesΣ_k0s.

For each languageltwo GMM’s are created: one for the Cep-stral feature vectors{~xt}and one for the Delta-Cepstral feature

vectors{~y_t}. These models are trained running multiple iterations of the estimate-maximize (E-M) algorithm. Producing for each stream of data, a more likely set ofα_k,~u_k,Σ_k [8].

Now, the recognition of an unknown speech utterance is as follows. First, one conducts the process of feature extraction on this utterance by converting the digitized waveform (in this case

1 Notation and text coherence of this section as in [Zissman,1996]

(20)

10 g au s s i a n m i x t u r e m o d e l s

the .wav file) to its Cepstrals. Second, by calculating the log likeli-hood that the languagelmodel produced the unknown utterance.

The log likelihoodLis defined as:

L({~x_t,~y_t}|λC_l,λDC_l ) = T X

t

[log p(~x_t|λC_l) +log p(~y_t|λDC_l )] (3.2)

WhereλC_l andλDC_l are the Cepstral and Delta-Cepstral GMM, respectively, for languagel, andT is the duration of the utterance. Assumptions concerning feature vectors are that the obser-vations {~x_t} are statistically independent of each other, the ob-servations {~y_t} are statistically independent of each other, and that the two streams are jointly independent of each other as well.

The maximum-likelihood classifier hypothesizes ˆlas the lan-guage of the unknown utterance, where:

ˆ

l=argmax

l

L({~xt,~yt}|λCl,λDCl )

Now, the main issue of the GMM approach is to calculate the parameters λC_l and λDC_l . These parameters determine the structure of the predictors and hence determines the classifier. To calculate this parameters, the training strategy is based on an implementation of the EM algorithm.

3.1 e x p e c tat i o n-m a x i m i z at i o n a l g o r i t h m

In this section we describe the Expectation-Maximization algo-rithm and its theoretical foundations. We recall the maximum-likelihood problem, explain its motivation and describe the rela-tion of this problem to the Expectarela-tion-Maximizarela-tion problem. A general overview of the EM algorithm is presented along with its implementation to Gaussian Mixture Models.

3.1.1 Maximum-likelihood

The maximum-likelihood problem is the problem of fitting a probability density function as to maximize the likelihood of its parameters given a set of observed data. For a data set X= {~x₁, ...,~x_N} of size N, supposedly drawn from the distribution

p(~x|Θ), the likelihood of the parameters given the observations can be expressed as:

p(X|Θ) = N Y

i=1

(21)

3.1 e x p e c tat i o n-m a x i m i z at i o n a l g o r i t h m 11

WhereΘis the set of parameters that govern the distribution. The likelihood is though of as a function of the parametersΘ. As noted before, in the maximum likelihood problem our goal is to find theΘthat maximizesL. That is, we wish to findΘ∗where:

Θ∗ =argmax

Θ

L(Θ|X)

Depending of the form ofp(~x|Θ)this problem can be analyt-ically intractable. In such cases EM offers an alternative to the problem of maximum likelihood.

3.1.2 Basic EM

The EM algorithm is a method of solving the maximun-likelihood problem for an underlying distribution from a given data when the data is incomplete or has missing/hidden values. The main application of this approach to the current scenario is that opti-mizing the likelihood function for Gaussian densities is analyti-cally intractable, but the likelihood function can be simplified by assuming the existence and the values of additional but missing parameters.

As before, we assumed observed dataX to be generated by some distribution, but this time we refer to it asincomplete data

[7]. We assume that there’s a set of complete data,Z= (X,Y)and also assume a joint density :

p(~z|Θ) =p(~x,~y|Θ) =p(~y|~x,Θ)p(~x|Θ)

Now we can define the complete-data likelihood as:

L(Θ|Z) =L(Θ|X,Y) =p(X,Y|Θ)

Note that givenXandΘ, this likelihood can be though of as a random variable, as it is a function of the random variableY. In this context,the likelihood of the former section is generally referred as incomplete-data likelihood.

The EM algorithm consist of two steps, the E-step and the M-step. In the E-step we calculate the expected value of the complete-data log likelihood with respect toYgiven the observed dataXand the current parameters. This quantity can be expressed as:

Q(Θ,Θ(i−1)) =E[log p(X,Y|Θ)|X,Θ(i−1)] (3.3)

Where Θ(i−1) are the current parameters estimates that we used to evaluate the expectation and Θare the new parameters that we optimize to increaseQ.

(22)

The key aspects to understand of this equation are the follow-ing:

X,Θ(i−1) Constants

Θ Variable that we wish to adjust

Y Random variable with distributionf(~y|X,Θ(i−1))

Therefore, right hand size of equation3.3can be re-written as in

[7]:

E[logp(X,Y|Θ)|X,Θ(i−1)] = Z

~

y∈Υ

log p(X,~y|Θ)f(~y|X,Θ(i−1))d~y

(3.4)

Note thatf(~y|X,Θ)is the marginal distribution of the unob-served data and is dependent on the obunob-served data X and the current parameters Θ(i−1).Υis the space of possible values of~y. This function can be maximize with respect toΘ.

Conceptually, the main idea behind this expression is embed-ded in the meaning of the arguments in the function Q(Θ,Θ0). The first argument Θ corresponds to the parameter that will be optimized to maximize the likelihood. The second argument

Θ0corresponds to the parameters used to evaluate the function

f(~y|X,Θ(i−1))and makes possible to calculate the expectation.

As mentioned, the second step of the EM algorithm is called the M-step. In this step the expectationQ(Θ,Θ0)computed in the first step is maximized. That is, we find:

Θ(i)=argmax

Θ

Q(Θ,Θ(i−1))

Each iteration of this steps is guaranteed to increase the log likelihood and the algorithm is guaranteed to converge to a local maximum of the likelihood function [8]. The algorithmic details of the steps are strongly dependent of the particular application. Lets discuss the case of Gaussian mixtures.

3.1.3 Maximum likelihood in mixture densities

In the mixture-density parameter estimation problem we assume the following probabilistic model. (see3.1):2.

p(~x|Θ) = M X

(i=1)

αipi(~x|θi)

Where the parameters are_P Θ= (α₁, ...,α_M,θ₁, ...,θ_M). As before,

M

(i=1)αi = 1 and each pi is a multi-variate Gaussian density 2 Notation and text coherence of this section as in [Bilmes,1998]

(23)

function parametrized byθ_i. We considerMcomponent densities mixed together with weightsαi.

If we try to fit the parametersα_i,θ_i by solving the maximum likelihood problem we will face and intractable problem. Say we compute the incomplete-data likelihood for this probability density and data X, the resultant expression will be:

log(L(Θ|X)) =log

N Y

i=1

p(x_i|Θ) = N X

i=1

log( M X

j=1

α_jp_j(x_i|θ_j))

As can be seen from this equation, the presence of a sum within a logarithm make this expression difficult to optimize. As men-tioned, this problem might be simplified by assuming the exis-tence of unobserved data itemsY={yi}N_i₌₁, one per each element

ofX. Lets assume thaty_i∈1, ...,Mfor eachi, andy_i=kif the

ith sample was generated by thekth mixture component. Assum-ing that we know the values of Y, the complete-data likelihood becomes:

log(L(Θ|X,Y)) =log(P(X,Y|Θ))

= N X

i=1

log(P(x_i|y_i)P(y_i)) = N X

i=1

log(α_y_ip_y_i(x_i|θ_y_i)) (3.5)

Given that theaj parameters can be though of as an a priori

distribution for the variables y_i.

This expression can be optimize using a variety of techniques. The problem is that the Y are not known, but assuming they are random variables one can proceed and make use of the EM technique to estimate the expected value of 3.5and maximize it.

Recall that the E-step of the EM algorithm consists in calculat-ing the expected value of the complete-data log-likelihood, given

Xand some assumptions on the parametersΘg_{, also refered as}

guesses. The following expression re-writes 3.4 for the current

case:

Q(Θ,Θg) = X

~

y∈Υ

log(L(Θ|X,~y))p(~y|X,Θg) (3.6)

Here, the distribution of the unobserved data remains un-known. But assuming some values for the guessesΘg _{= (α}g

1, ...,α g M

,θg₁, ...,θg_M), and having in mind the interpretation of α0s asa prioriprobabilities, one can use Baye’s rule to implie:

p(y_i|x_i,Θg) = α g

yipyi(xi|θ

g yi)

(24)

And hence:

p(~y|X,Θg) = N Y

i=1

p(y_i|x_i,Θg)

Replacing this result into3.6, expanding the expression and

then simplifying as in [7],chapter 3, one obtains a fairly simple

equation ofQ(Θ,Θg)that can be written as:

Q(Θ,Θg) = M X

l=1 N X

i=1

log(α_l)p(l|x_i,Θg) + M X

l=1 N X

i=1

log(p_l(x_i|θ_l))p(l|x_i,Θg)

(3.7)

In this equation, whenever l appears as an argument of a probability density, it meansyi=l.

Now that we have completed the E-step, the procedure that follows is to perform the M-step, and maximize3.7with respect

to parameters αl,θl. We can maximize the term containingαl

and them the term containing θ_lindependently since they are not related. Lets start with the weightsα.

To find the expression forαl, we introduce the Lagrange

multi-plierλwith the constraint thatP_lα_l=1and solve the following equation:

∂ ∂α_l

"X_M

l=1 N X

i=1

log(α_l)p(l|x_i,Θg) +λ(X l

α_l−1)

#

=0

or

N X

i=1 1 αl

p(l|x_i,Θg) +λ=0

Summing both sizes overl, we get thatλ= −Nresulting in:

αl= 1 N

N X

i=1

p(l|xi,Θg)

This expression gives us the maximum complete-data likeli-hood estimate. It deliversMsuch estimations, each correspond-ing to one weight in the Gaussian mixture3.1. More importantly

it delivers a better estimation than the ones implicitly contained inΘg.

Similar procedures are conduct to obtain equivalent expres-sions for θ_l. To see the detailed calculation refer to [7]. In our case, we have distributions that are 13-dimensional Gaussian

(25)

components with meanµand covariance matrixΣ, i.e.,θ= (µ,Σ)

then:

pl(x|µl,Σl) = 1

(2π)132|Σ_l| 1 2

exp

1

2(x−µl)Σ −1

l (x−µl)

(3.8)

The estimates of the new parametersµ_l,Σ_l, in terms of the old parameters are as follows:

µnew_l = PN

i=1p(l|xi,Θg) PN

i=1p(l|xi,Θg)

Σnew_l = PN

i=1xip(l|xi,Θg)(xi−µnew_l )(xi−µnew_l )T PN

i=1p(l|xi,Θg)

The above equations perform both the expectation step and the maximization step simultaneously. As noticed before, this algorithm proceeds by using the newly derived parameters as the guess for the next iteration.

(26)

(27)

4

S U P P O R T V E C T O R M A C H I N E S

To understand the bulk of Support Vector Machines (SVM), one must understand the underlying mathematical framework giving birth to this technique . To this end, let us remember briefly the very common hyperplane classifiers and derive from there the main ideas behind Support Vector Machines.

It is worthy mention that the main idea behind this method is to model the boundary, as opposed to the GMM approach of the last section, which would model the probability distributions of the classes.

4.1 h y p e r p l a n e c l a s s i f i e r s

Hyperplane Classifiers are based on the class of hyperplanes,defined in a dot product space H:

hw~,~xi+b=0,w~ ∈H,b∈R. (4.1)

Corresponding to decision functions:

y=f(~x) =sign(hw~,~xi+b) (4.2)

The learning algorithm to find 4.2, proposed for problems

which are separable by hyperplanes, is based on two facts. First, that among all hyperplanes separating the data one can find a unique optimal hyperplanewith maximum margin of separation between any training point and the hyperplane [9]. Given m patternsx_i, and their corresponding tags y_i, this hyperplane is the solution of:

maximize

~

w∈_H,b∈_R min{||~x−~xi||~x∈H,hw~,~xi+b=0,i=1, ...,m} ( 4.3)

Second, the capacity of the class of separating hyperplanes decreases with increasing margin. Hence there are theoretical arguments supporting the good generalization performance of the optimal hyperplane. See [9], chapters5,7,12.

Solving4.3is equivalent to minimizing the norm of the vector ~

w, the following formulation summarizes this:

minimize

~

w∈H,b∈R τ(w) =~ 1 2||w~||

2

(4.4)

(28)

18 s u p p o r t v e c t o r m a c h i n e s

subject to (hw~,~x_ii+b)y_i_>1 for all i=1, ...,m. (4.5)

It can be shown that the optimal hyperplane can be uniquely constructed by solving a constrained quadratic optimization problem [10] and also that the solution w~ has an expansion

~

w=P_ivi~xi in terms of a subset of training patterns that lie on

the margin.

Notice that the functionτin equation4.4can be understood as

anobjective function, while4.5constitutes aninequality constraint.

They form aconstrained optimization problem. This problem can be dealt with by introducing a Lagrangian formulation:

L(w~,b,~α) = 1 2||w~||

2 −

m X

i=1

α_i((h~x_i,w~i+b)y_i−1) (4.6)

The Lagrangian has to be minimized with respect to theprimal variablesw~ andb, and maximize with respect to thedual variables α_i. A saddle point has to be found.

Introducing Karush-Khun-Tucker (KKT) conditions to guaran-tee optimality, one can also simplify the solution and learn some additional restrictions that narrow it.For example KKT states, that at the saddle point, the derivatives ofLwith respect to the primal variables must vanish,

∂

∂bL(w~,b,~α) =0 and ∂

∂w~ L(w~,b,~α) =0 (4.7)

Applying this conditions to4.6leads to: m

X

i=1

αiyi=0 (4.8)

and

~ w=

m X

i=1

αiyi~xi (4.9)

As mentioned, equation4.9shows us that the solution vectorw~

has an expansion in terms of the training patterns with non-zero

αi. This patterns are calledSupport Vectors(SVs). KKT conditions

also implie that the SVs lies on the margin [9]. All remaining training data(~x_i,y_i)becomes irrelevant. See fig.4.

Substituting4.8and4.9into the Lagrangian4.6we arrive at the dual optimization problem. Which is the problem actually solved in practice:

maximize

~

α∈_Rm W(~α) =

m X

i=1 α_i−1

2 m X

i,j=1

(29)

4.2 s u p p o r t v e c t o r c l a s s i f i e r s 19

Figure4: Hyperplane Representation. Only the patterns that lie in the margin matters. [11]

subject to α_i_>0 for all i=1, ...,m and m X

i=1

α_iy_i=0.

Also, substituting the KKT results into the decision function

4.2we obtain:

f(~x) =sgn m X

i=1

y_iα_ih~x,~x_ii+b

!

(4.11)

This formalism has a crucial property that needs to be em-phasized.That both the quadratic optimization problem and the final decision function4.11depend only on dot products between

patterns. This feature is precisely what let us generalized to the nonlinear case.

4.2 s u p p o r t v e c t o r c l a s s i f i e r s

We now have all the tools to describe SVMs. The basic idea of this technique is to map the original data presumably inRN _into

some other dot product space (called the feature vector space)F

via a nonlinear map:

φ : RN→F (4.12)

And perform the above linear classifier (section4.1) inF . To

do this, we would like to write the dot product of feature vectors

φ(x_i)in terms of input patternsx_iusing a kernel functionk:

(30)

Clearly, ifFis high dimensional, the right hand side of equation

4.13will be very expensive to compute. In some cases, however,

there is a simple kernelkthat can be evaluated efficiently. Given this, one can define an SVM as a binary classifier con-structed from sums of a kernel functionk(·,·), which implements a dot product in some higher dimensional space [11]:

f(x) =sgn m X

i=1

y_iα_ihφ(x),φ(x_i)i+b

!

=sgn

N X

t=0

α_iy_ik(~x,~x_i) +b

!

(4.14)

Where they_iare the ideal outputs,PN_i₌₁α_iy_i=1, andα_i> 0. In this expression the vectorsx_iare the support vectors from the training data.

The decision function from equation4.14is found by solving a

quadratic program. Specifically solving an optimization problem very similar to4.10but implementing the dot product in spaceF:

maximize

~

α∈_Rm W(~α) =

m X

i=1 α_i−1

2 m X

i,j=1

α_iα_jy_iy_jk x_i,x_j (4.15) subject to α_i_>0 for all i=1, ...,m and

m X

i=1

α_iy_i=0.

4.2.1 Soft Margin Hyperplane

In practice, a perfect separating hyperplane may not exist. To allow SVMs be applied to more general situations in which4.5

doesn’t hold, we introduce slack variables:

ξ_i_>0for alli=1, ...,m. (4.16)

This variables relax the constraint4.5:

(hw~,~x_ii+b)y_i_>1−ξ_ifor alli=1, ...,m. (4.17)

Given this, we find a classifiers that minimizes not only the magnitude ofw~ but also the magnitude of the slack variables, i.e, the following quantity:

τ(w) =~ 1 2||w~||

2₊_C m X

i=1

(31)

4.3 s e q u e n t i a l m i n i m a l o p t i m i z at i o n (s m o) 21

Subject to4.16and4.17. In this expression constant theCis a

real positive value that determines the trade-off between margin maximization and training error minimization. This again leads to4.15, but the constraints change to:

0₆α_i₆C for all i=1, ...,mand

m X

i=1

α_iy_i=0.

This constantCis sometimes referred asbox constraint, since it limits theαconstants to an square of sizeC. Other than that the problem is still quadratic and very similar to the former.

Now, the main issue of the SVM approach is to calculate the de-cision function and find the support vectors i.e, the non-vanishing Lagrange multipliers of equation4.15. These parameters

deter-mine the structure of the decision function and hence defines the classifier hyperplane. To calculate this parameters, the train-ing strategy used here is based on an implementation of the Sequential Minimal Optimization (SMO) technique.

4.3 s e q u e n t i a l m i n i m a l o p t i m i z at i o n (s m o)

As mentioned in the last section, training a Support Vector Ma-chine (SVM) requires the solution of a quadratic programming (QP) optimization problem. This problem is usually very large and requires special methods. SMO breaks this large problem into a series of smallest possible QP problems that are solved using an analytical QP step instead of inner QP optimization loops. Depending on the database type SMO can be faster or equal in speed to traditional procedures such as the projected conjugate gradient (PCG) chunking algorithm. Sometimes even

1000times faster [12]

Unlike previous methods, SMO chooses to solve the smallest possible optimization problem at every step. For the standard SVM QP problem the smallest possible optimization problem involves two Lagrange multipliers. SMO chooses precisely two Lagrange multipliers at every iteration to jointly optimize, find optimal values for these multipliers and updates de SVM to reflect the new optimal. This feature represents a great advantage, since two Lagrange multipliers can be done analytically, which represents less computational effort otherwise spent in numerical inner iterations.

There are three components to SMO: The first component is analitycal, related to solve the two Lagrange multipliers. The second part is heuristic, and give the tips for choosing which multipliers to optimize. At last, there is a method for computing the constant bin decision function4.14.

(32)

4.3.1 Part I : Analytical step : Solving for two Lagrange Multipliers

Let’s recall the dual formulation involved in solving and SVM’s with soft margin:

maximize

~

α∈Rm W(~α) =

m X

i=1 α_i−1

2 m X

i,j=1

α_iα_jy_iy_jk x_i,x_j

(4.19)

subject to 0_>α_i_>C,for all i=1, ...,m and m X

i=1

α_iy_i=0.

The bound constraints in this expression. Also referred asbox constraints, cause the Lagrange multipliers to lie within a box. On the other hand the linear equality constraints causes the La-grange multipliers to lie on a diagonal line. SMO computes this constraints and then solves for the constrained maximum.

The former implies that the constrained maximum of the objec-tive function must lie on a diagonal segment. This explains why two is the minimum number of Lagrange multipliers that can be optimize, one multiplier alone couldn’t fulfil the linear equality constraint at every step.

Say we pick two Lagrange multipliers based on some heuristics (details on how to pick them are explain in the following section). For simplicity, let’s label them α₁ andα₂ and let’s label all the quantities related to this multipliers in the same fashion.

SMO algorithm first computes the second Lagrange multiplier

α2 (it could just as well start withα1 without loss of generality.)

and computes the ends of the diagonal line segment in terms of this multiplier.

To find the unconstrained maximum location of the objective function in equation4.19, we use the procedure depicted in [12], chapter 12, section 7. This analytical maximum is done while

allowing only two Lagrange multipliers to change and is based on some initial guesses for the values of theα_i, referred asαold_i . The resultant expression for actualizedα₂ is:

αnew₂ =αold₂ −y2(E1−E2)

β (4.20)

whereEi=fold(~xi) −yi. In this expressionfold refers to the

SVM decision function build from the old parameters:

fold(~x) =sgn m X

i=1

y_iαold_i k(~x,~x_i) +b

!

(33)

4.3 s e q u e n t i a l m i n i m a l o p t i m i z at i o n (s m o) 23

Andβits short for:

β=2k(~x₁,~x₂) −k(~x₁,~x₁) −k(~x₂,~x₂)

Onceαnew

2 is evaluated, the following bounds are applied; If

targety1 equalsy2, we apply the bounds:

L=max 0,αold₂ −αold₁ , H=min C,C+αold₂ −αold₁ (4.22)

If targety₁ equals targety₂, then the following bounds apply toα2:

L=max 0,αold₁ +αold₂ −C,H=min C,αold₁ +αold₂ (4.23)

Finally the constrained maximum is found by clipping the unconstrained maximumαnew₂ to the ends of the line segment:

αnew₂ ,clipped =       

H, ifαnew₂ _>H αnew₂ , ifL < αnew₂ < H L, ifαnew₂ ₆L

The value ofα1 is computed from the clippedα2:

αnew₁ =αold₁ +s

αold₂ −αnew₂ ,clipped

(4.24)

With this procedure, SMO moves the Lagrange multipliers to the end point with the highest value of the objective function.

4.3.2 Part II : Heuristics : Choosing two Lagrange multipliers

In order to speed convergence, SMO algorithm uses heuristics to choose which two Lagrange multipliers to jointly optimize. There are two heuristic rules, one for each Lagrange multiplier. The first multiplier choice is the result of a loop. This loop deter-mines which training data examples violate the following KKT conditions:

α_i=0⇒ y_if(~x_i)_>1,

0 < α_i< C⇒ y_if(~x_i) =1,

αi=C⇒ yif(~xi)61,

(4.25)

The first KKT violator found by this outer loop is chosen as the first multiplier (remember that there’s a one-to-one relation between the training set and the Lagrange multipliers). A second multiplier is selected using a second heuristic rule. The two

(34)

multipliers are jointly optimized and the SVM is then updated. The process resumes by looking for further KKT violators.

The outer loop passes repeatedly over the training examples until every example obey KKT conditions within , which is typically set in the range 10−2 to10−3 [12]. At that point, the algorithm terminates. For reasons of computational efficiency the looping is not always done over all training data, but some iterations are made only on the examples that are more likely to violate KKT conditions: Examples whose Lagrange multiplier are neither0norC, also callednon-bound examples.

The choice of the second multiplier is based on equation4.20.

SMO keeps a cached error value of E for everynon-boundexample and it choosesα₂ such that the correspondingE₂ maximizes the numerator term|E1−E2|. Briefly, the second Lagrange multiplier

is chosen to maximize the step taken during joint optimization.

α∗₂ =maximize

αi

|E₁−E_i(α_i)| (4.26)

4.3.3 Part III : threshold : Determine the constant b

The procedure described above does not determine the threshold

bof the SVM, so it must be computed separately. b is re-computed with each iteration. Two equations are used to correctb depend-ing on the case. When the newα₁ is not at the bounds, we apply the correction:

bnew=E₁+y₁ αnew₁ −αold₁ k(~x₁,~x₁)

+y₂αnew₂ ,clipped−αold₂ k(~x₁,~x₂) +bold (4.27)

If the newα₂ is not at the bounds the corresponding equation is:

bnew=E₂+y₁ αnew₁ −αold₁

k(~x₁,~x₂)

+y2

αnew₂ ,clipped−αold₂

k(~x2,~x2) +bold (4.28)

When both equations are valid, they are equal. When both new Lagrange multipliers are at bound and if L is not equal to H, them all the interval between4.27and4.28are thresholds that are

consistent with the KKT conditions. SMO chooses the halfway. Seudo code for this algorithm is shown in [12], chapter12, section 3.

(35)

Part II

(36)

(37)

5

O B J E C T I V E S

5.1 g e n e r a l o b j e c t i v e s

• Build a Language Identifier that tells apart English and French using Machine Learning Techniques.

5.2 s p e c i f i c o b j e c t i v e s

• Understand the underlying theory concerning the problem of Language Identification.

• Learn models and algorithms related to the framework of Support Vector Machines (SVM).

• Learn Machine Learning techniques applied to LID systems.

• Conduct proper feature Extraction on database or speech corpus from English and French Languages.

• Use specialized libraries oriented to facilitate and optimize machine learning algorithms within the context of Lan-guage Identification.

• Build an end-point or user friendly interface that imple-ments the service of language identification.

(38)

(39)

Part III

(40)

(41)

6

D ATA B A S E S E L E C T I O N A N D C H A R A C T E R I Z AT I O N

We summarize the specific steps taken to conduct database se-lection and characterization. As mentioned, database was down-loaded from the speech corpus in Voxforge1

. Voxforge is an organization dedicated to collect speech samples of diverse lan-guages through voluntary utterances donations made by internet users. Each language corpus consist of the following directories:

• 16kHz−16bits

• 32kHz−16bits

• 44.1kHz−16bits

• 48kHz−16bits

• 8kHz−16bits

These directories names indicates the sample rate and the encoding with which the recordings were made. Different direc-tories contain different sets of data (meaning different speakers). In both the cases of French and English corpus, 48kHz directory

turns out to be the one with more recordings, so all the data was taken with this sample rate.

Equivalent amount of recordings was downloaded for both languages. In this case French database had the lower amount of files. The resulting sets are as follows:

Table1: Training Database

Aspect. English French

Sample Rate 48kHz 48kHz

N. of Speakers 215 219

Audio per speaker 32 31

Average duration 7sec 7sec

N. of files 7000 7000

Cepstral per audio < 13x500 > < 13x500 >

Delta-Cepstral per audio < 13x500 > < 13x500 >

Training set < 13x6542014 > < 13x7787422 >

1 Specifically taken from:http://www.voxforge.org/home/downloads

(42)

32 d ata b a s e s e l e c t i o n a n d c h a r a c t e r i z at i o n

Table1summarizes the main features of the training database.

Some explanation is needed to clarify the concepts of "Cepstral per audio" and "Delta-Cepstral per audio". This fields on the table constitutes the average number of Cepstral and Delta-Cepstral13

-vectors resulting from pre-processing per audio recording. This result is valid when the windowing used for MFCC extraction has window width of250ms.

Pre-processing results in a data base of order 106. Similar results hold for the testing data. This additional database is com-puted in order to test the resulting performance of the learning algorithms.Testing database features are shown in table2.

Table2: Testing Database

Aspect. English French

Sample Rate 48kHz 48kHz

N. of Speakers 92 94

Audio per speaker 32 31

Average duration 7sec 7sec

N. of files 3000 3000

Cepstral per audio < 13x500 > < 13x500 >

Delta-Cepstral per audio < 13x500 > < 13x500 >

Training set < 13x2927328 > < 13x3105082 >

Note that the complete database, counting both recordings from English and from French, consist of10thousand audio files.

Corresponding to nearly300speakers per language. Also notice

that70% of available data is used for training and30% is used for

testing. This database is modest, but it will be seen that it suffices for solving the problem at hand. For example, in the case of SVM, only a subset of the training data is actually used for training .

(43)

7

F E AT U R E E X T R A C T I O N : M F C C A N D D E LTA F E AT U R E S

As mentioned in the last section, database originally consisted of

20thousand audio files.10thousand files correspond to English

speakers and10thousand to French ones. These audio files were

sampled at 48kHz and encoded in .wav format at16bits. The

process of converting them to feature vectors was driven based on the Mel-frequency Cepstral coefficient extraction process ex-plained in chapter2. To perform this transformation, a matlab

library was used, authored by Daniel P. W. Ellis [13]. This library consist of an implementation of RASTA-PLP as well as MFCC feature extraction algorithms.

To perform MFCC on a .wav file the first step was to extract the samples. This is done usingwavreadfunction in matlab. Once the samples are store in a vector they are used as an argument for the functionmelfccform Ellis Library, which transform this samples into a matrix of around500column cepstral vectors of

dimension 13, this matrices are stored per .wav file in .txt files.

The parameters used to evaluate this function were set as follows:

Table3: MFCC Parameters

parameter value Description

wintime 25ms Window length in sec

hoptime 10ms Step between successive windows in sec numcep 13 Number of cepstra coefficients to return

fbtype ’mel’ Frequency warp

broaden 0 Flag to retain first and last bands

Table3shows the parameters ultimately used when training

the classifier. Other schemes were tried with different values of wintimeand hoptime. In particular 0.01 seconds time frames

with 0.04 hoptime were used, thus deriving the so called centi-second mel-scale cepstral and delta-cepstral coefficients, as rec-ommended in [3]. This arrangement of values end up lowering the performance of the learning machines. The values shown were used instead. Finally, for each .wav file, the corresponding delta-cepstral coefficients were computed and store. Computation was straighforward and was calculated according to equation2.1.

(44)

(45)

8

T R A I N I N G A N D T E S T I N G : G M M A N D S V M

8.1 g au s s i a n m i x t u r e m o d e l s : t r a i n i n g

Recall that under the GMM assumption all the instances of a particular class (say English) are assumed to be drawn randomly according to a probability density that is a weighted sum of multi-variate Gaussian densities. This probability density is not given, and has to be fit to the training data using some methodology. In this case, the EM algorithm.

In matlab, an object of thegmdistributionclass defines a Gaus-sian mixture distribution. Its attributes consist of the means, covariances matrices, number of components (k) and mixture weights necessary to define completely any given GMM. This object has a method calledfitthat implements the EM algorithm. Briefly, this matlab method receives a given dataset and returns the correspondinggmdistributionobject with maximum likelihood for that given data.

We train several GMMs. One for the Mel-frequency Cepstral coefficients of the English training database called gmc-english, one for the Mel-frequency delta-cepstral coefficients of the same database, called gmd-english and the corresponding gmc-french

andgmd-frenchfor the French training database.

These distributions were calculated for differentnumber of com-ponentsk in the mixture, as will be shown in the following sec-tion. Other than that, the rest of parameters remained the same throughout the simulations, as shown in the following table:

Table4:gmdistribution.fitParameters

parameter value Meaning

Start ’randSample’ Random initial conditions CovType ’diagonal’ Covariances must be diagonal SharedCov ’false’ Covariances not necessarily equal MaxIter 200 Maximum number of iterations

In this table,Random initial conditionsmeans that the distribu-tion meansare initialize with k random observations from the database X and that covariances matrices start all being equal, diagonal, and with the jcomponent equal to the X(:,j)

(46)

36 t r a i n i n g a n d t e s t i n g : g m m a n d s v m

nent variance. MaxIteris set to200because the amount of data requires lots of iterations and SharedCovis set to false, because setting it to true, despite requiring less computational effort, gives rise to error messages likeill-conditioned Covariance Matrixmore often. The meaning of this error is explained here.1

8.2 g au s s i a n m i x t u r e m o d e l s : t e s t i n g

To determine whether an utteranceU₁ is originated from English or from French we first extract its MFCC using the function

melfcc mentioned in chapter 7, and convert it into an array of 13-vectors {~u}. Next, using equation 3.2 we compute the

log-likelihood that the given array belongs to English and to French,

L {~u} |λ_EnglishandL({~u} |λ_French). We compare this quantities and the language with the greatest likelihood is signalled as the originator ofU1.

To determine the performance of the GMM learning machine we go ahead and try the former procedure with all the 6000

recordings of the testing database. First, we perform classification based on GMM with each of the English recordings in theEnlgish testing database of size T. Each time a recording is incorrectly labelled as originated from French we count an error on the discrete variablec_f and in the end we compute:

Error_English = cf

T (8.1)

An identical procedure is made with French testing database

and the ErrorFrench is computed. Since testing databases are

both equal in size, this two errors are simply averaged to finally obtain our empirical error Error_Empirical. Performance tables for differentkare shown in the results chapter.

8.3 s u p p o r t v e c t o r m a c h i n e s : t r a i n i n g

Training and SVM is not as straightforward. Contrary to GMM, Support vector machines are binary classifiers based on hyper-planes and depends highly on the kernel type and on the separa-ble nature of the data. Also, SVM training is not very scalasepara-ble, so it is not viable to use an entire7∗106 cepstral vectors database to train this type of classifier. Instead, a subset of10thousand ran-dom examples were used. Databases greater than that provoked convergence problems when running the algorithms.

To train the SVM we use a function from matlab statistical toolbox called svmtrain. This function trains an SVM based on

1 gmdistribution.fit documentation: http://www.mathworks.com/help/stats/ gmdistribution.fit.html

(47)

8.4 s u p p o r t v e c t o r m a c h i n e s : t e s t i n g 37

some entry data. The matlab function settings are shown in the table5. As can be seen in this table, some kernel parameters are

initialize randomly. At some point we will have to determine this parameters via an optimization process.

Table5:svmtrainParameters

parameter value Meaning

Kernel-Function ’rbf’ Gaussian Radial Basis Function rbf-sigma random rbf scaling factor initial value boxconstraint random soft margin C initial value

method SMO Optimization Method

We use Gaussian radial function as our kernel based on previ-ous approaches [?]. This type of kernel has the form:

K(x,x0) =exp ||x−x 0_||2 2σ2

!

(8.2)

Initially, the SVM is trained using random values for σand

C. (The soft margin box constraint). But the idea is to perform an iterative process that helps us find optimal values for this parameters.

We use matlab functioncrossvalto determine a 10-fold cross

validation estimate of missclassification rate, labelledmcr,thus determining the performance of any given pair{σ,C}. In order to minimize this error, we perform a line search over an exponential grid of parameters σ,C, that looks forward to minimize mcr. Several iterations of this kind are made to find a convincing minimum, each time departing from random but different initial parameter values. Each iteration results in different parameter estimates, some line searches even fail to converge. After finding a convincing set {mcr∗,σ∗,C∗}, we perform the minimization process using{σ∗,C∗}as initial values. Details are shown in the following section.

8.4 s u p p o r t v e c t o r m a c h i n e s : t e s t i n g

As before,to determine whether an utteranceU₁ is originated from English or from French we first extract its MFCC using the function melfcc mentioned in chapter 7, and convert it into an

array of13-vectors{~u}. Next, using the decision function resultant

from the SVM training, we score the utterance by summing the classifications for all its cepstrals (The classifier can take values

(48)

38 t r a i n i n g a n d t e s t i n g : g m m a n d s v m

{+1,−1}). If the resultant score is positive, the utterance U₁ is labeled as originated from french, and English otherwise.

We perform this classification based on each of the English recordings in the Enlgish testing database of size T. Each time a recording is labelled as originated fromFrenchwe count an error on the discrete variablec_f and we compute8.1.

An identical procedure is made with French testing database

and the ErrorFrench is computed. Since testing databases are

both equal in size, this two errors are averaged and we finally obtain our empirical error Error_Empirical. Performance tables are shown in the results chapter.

(49)

Part IV

(50)

(51)

9

R E S U LT S : E R R O R R AT E S A N D C O M P U TAT I O N A L E F F O R T

In this section we present the results obtained for both models. These results constitute the performances in terms of error rates achieved by GMM and SVM learning machines.

In the case of GMM our first approach was to use only the cepstral coefficients to train the model, the result was a minimum

Error_Empirical of20.48%. An attemp to increase the algorithm performance was made by including delta-cepstral coefficients in the training as well. In this case, a swipe was made over different values fork, the number of mixtures in the Gaussian distribution, the results obtained were:

Table6: GMM Classifier Results : Cepstral and Delta-cepstral coeffi-cients

k. Error_Eng Error_Fr Error_emp Time [sec]

5 33.24% 32.07% 32.65% 792(13min) 10 33.11% 23.92% 28.51% 3164(52min) 15 30.07% 18.17% 24.12% 3248(54min) 20 27.10% 12.83% 39.93% 6452(1.79h) 25 25.55% 12.63% 19.09% 8956(2.49h) 30 26.89% 15.81% 21.35% 12000(3.3h) 35 22.74% 18.86% 20.80% 13580(3.7h) 40 21.73% 21.05% 21.39% 17424(4.8h)

mixed 16.60% 21.38% 18.99% –

multi 17.07% 18.72% 17.89% –

In this table,Error_emprefers to the error calculated from equa-tion8.1in the last section. Note that the last two rows have the

better performances. In this contextmixedmeans that during the classification process GMM models with k=25where taken to calculate English likelihood whilst models with k = 40 where used for French. On the other handmultimeans that severalk0s models were used for each language.

In this rows the time field is empty. This is due to the fact that we are using past results to build a new classifier.

(52)

42 r e s u lt s : e r r o r r at e s a n d c o m p u tat i o na l e f f o r t

Now, the other type of classifier, SVM, had to be trained using only a 10 thousand examples dataset. The random line search depicted in the last section resulted in multiple parameter sets. We evidenced the formation of clusters. i.e, multiple data sets

{mcr∗,σ∗,C∗}with very similar values. The best misclassification rates (mcr) of each cluster were extracted and are shown in the following table:

Table7: SVM Classifier Results : Misclassification rate estimation

mcr. logσ logC T ime[sec]

41.85% 2.0533 1.2852 2932 37.64% 0.2975 -0.4113 2676 37.46% 0.7531 -0.4113 4803 37.45% 0.5092 0.0374 2135

We attempted to train the SVM with bigger datasets of sizes:

{15k,20k,25k}. But the algorithm always failed due to conver-gence issues. Results in table7seems lame, but since

classifica-tion for each utteranceU1 is made over a wide range of cepstral

vectors, thismcrproduces a classifier with better performance. We made classification tests with the bestmcrof table7in the

testing data, and obtain the following results:

Table8: SVM Classifier Results : Testing Database

logσ. logC Error_Eng Error_Fr Error_emp 0.5092 0.0305 38.30% 28.78% 33.54%

(53)

10

C O N C L U S I O N S

In this work, we have sought to learn techniques related to the problem of Language identification (LID). More specifically, LID solutions within the theoretical framework of supervised learning. The principles and main steps (theoretical and computational) of this techniques were properly learned and understood.

We got familiar with pre-processing of speech databases based on Mel-frequency cepstral coefficients (MFCC) and used a matlab library to successfully implement feature extraction on20

thou-sand .wav files from the Voxforge database.

Two techniques that belong to the family of supervised learn-ing models were studied and implemented. the Gaussian Mixture models (GMM) and Support vector machines (SVM) techniques. As can be seen in tables6and8, our GMM implementation had

better performance than SVM. We demonstrate that GMM mod-els are fairly simple to train and that the computational approach using Gaussian mixtures is highly scalable to large datasets with-out many complications. On the other hand, we saw that custom approaches to train SVMs using radial basis Gaussian kernel functions (RBF) show severe limitations handling big datasets.

In order to achieve the pre-processing and training process of the mentioned models, specialized libraries were used thor-oughly. The P.W. Ellis RASTA-PLP and MFCC implementations, the matlab Statistical Toolbox class gmdistribution and the struc-turesvmtrain. This tools allowed us to succesfully use machine learning techniques to built a language identifier that tells apart English and French.

Future work should focus on the utilization of boosting algo-rithms in order to increase models performance and make the LID system more reliable. Also, the construction of an end-point user friendly interface that implements the service of language identification is in order.

(54)

(55)

B I B L I O G R A P H Y

[1] VoxForge :http://www.voxforge.org/: Speech Corpus.

[2] Beth Logan et Al. “Mel Frequency Cepstral Coefficients for

Music Modeling,"Cambridge Research Laboratory.2001.

[3] Marc A. Zissman,“Comparison of Four Approaches to

Acoustic Language Identification of Telephone Speech,"

IEEE Transactions on Speech and Audio Processing,1996.

[4] Rabiner, L. R. & Jaung, B. H. “Fundamentals of Speech

Recognition,"Prentice Hall.1993.

[5] Marhav, N & Lee, C,“On the Asymptotic Statistical

Behav-ior of Empirical Cepstral-coefficients,"IEEE Transactions on Signal Processing,1993.

[6] Davis S. Mermelstein et al, “Comparison of Parametric

Rep-resentations for Monosyllabic Word Recognition in Spoken Sentences,"IEEE Transactions on Acoustics, Speech and Signal Processing,1998.

[7] Jeff A. Bilmes,“A Gentle Tutorial of the EM Algorithm and

its Application to Parameter Estimation for Gaussian Mix-ture and Hidden Markov Models,"International Computer Science Institute,Berkeley,1998.

[8] Frank Dellaert,“The Expectation Maximization Algortihm," College of Computing, Georgia Institute of Technology,2002.

[9] Bernhard Sch ´olkopf, Alexander J. Smola,“Learning with

Kernels,"The MIT Press,2002.

[10] Marti A. Hearst,“Support Vector Machines,"IEEE Intelligent Systems,2001.

[11] W. M. Campbell,“Support Vector Machines for Speaker

and Language Recognition," Computer Speech and Lan-guage,2005.

[12] John C. Platt,“Fast Training of Support Vector Machines

using Sequential Minimal Optimization," In Sch ´olkopf, Alexander J. Smola, A., editors Advances in Kernel Meth-ods - Support Vector Learning,MIT Press,2005.

[13] D. Ellis, “PLP and RASTA in

mat-lab using melfcc.m and invmelfcc.m,

http://labrosa.ee.columbia.edu/matlab/rastamat/