s e r g i o d av i d r o d r í g u e z b e r m ú d e z I M P L E M E N TAT I O N O F A L A N G U A G E I D E N T I F I C AT I O N S Y S T E M B A S E D O N O P T I M I Z AT I O N A N D M A C H I N E L E A R N I N G
I M P L E M E N TAT I O N O F A L A N G U A G E I D E N T I F I C AT I O N S Y S T E M B A S E D O N O P T I M I Z AT I O N A N D M A C H I N E
L E A R N I N G T E C H N I Q U E S
s e r g i o d av i d r o d r í g u e z b e r m ú d e z
Thesis for the Degree of Electronic Engineer
Sergio David Rodríguez Bermúdez :Implementation of a language identification System based on Optimization and Machine Learning Techniques,Thesis for the Degree of Electronic Engineer, © Novem-ber2013
Ignoramus et ignorabimus
A B S T R A C T
During the development of this work we are going to build the implementation of a Language identification system that iden-tifies whether an utterance is a sample of English or French. To do this we rely in the Voxforge organization free license speech corpus. Pre-processing of this corpus involves the very common mel-frequency cepstral and delta-cepstral coefficients (MFCC) as feature vectors. The approach is going to be within the context of machine learning techniques, specifically using Gaussian Mix-ture Models (GMM) and Support Vector Machines (SVM). We demonstrate that GMM models are fairly simple to train and that the computational approach using Gaussian mixtures is highly scalable to large datasets without many complications. On the other hand, we will see that custom approaches to train SVMs using radial basis Gaussian kernel functions (RBF) show severe limitations handling big datasets.
We have seen that computer programming is an art, because it applies accumulated knowledge to the world, because it requires skill and ingenuity, and especially because it produces objects of beauty.
— Donald E. Knuth
A C K N O W L E D G M E N T S
Many thanks to all the people that was involved directly or indirectly in the development of this work. It constitutes the last task of my career and it is meant to close the cycle, as cliché as it may sound, this is a relieve for me and for my parents. I appreciate among everything the support that my parents have gave through all my undergraduate process from the beginning to this point. I count on them, and that is a privilege that is not necessarily given to everyone. I’m also very grateful with my advisor, Fernando Lozano, who’s particular style allowed me to be independent in my work ,yet very thoughtful and responsible of it. His knowledge was very well received throughout the course of machine learning. At last, I must say that I am in depth with the guys from the machine learning seminar. Specially Reinaldo Uribe, who delivered a live saving hint and who rescued my computer and my thesis.
C O N T E N T S
1 i n t r o d u c t i o n 1 1.1 LID Overview 1
i t h e o r e t i c a l f r a m e w o r k 3 2 f e at u r e e x t r a c t i o n 5
3 g au s s i a n m i x t u r e m o d e l s 9
3.1 Expectation-Maximization Algorithm 10 3.1.1 Maximum-likelihood 10
3.1.2 Basic EM 11
3.1.3 Maximum likelihood in mixture densities 12 4 s u p p o r t v e c t o r m a c h i n e s 17
4.1 Hyperplane Classifiers 17 4.2 Support Vector Classifiers 19
4.2.1 Soft Margin Hyperplane 20
4.3 Sequential Minimal Optimization (SMO) 21 4.3.1 Part I : Analytical step : Solving for two
Lagrange Multipliers 22
4.3.2 Part II : Heuristics : Choosing two Lagrange
multipliers 23
4.3.3 Part III : threshold : Determine the constant
b 24
ii o b j e c t i v e s 25 5 o b j e c t i v e s 27
5.1 General Objectives 27 5.2 Specific Objectives 27 iii m e t h o d o l o g y 29
6 d ata b a s e s e l e c t i o n a n d c h a r a c t e r i z at i o n 31 7 f e at u r e e x t r a c t i o n : m f c c a n d d e lta f e at u r e s 33 8 t r a i n i n g a n d t e s t i n g : g m m a n d s v m 35
8.1 Gaussian Mixture Models : Training 35 8.2 Gaussian Mixture Models : Testing 36 8.3 Support Vector Machines : Training 36 8.4 Support Vector Machines : Testing 37 iv r e s u lt s 39
9 r e s u lt s : e r r o r r at e s a n d c o m p u tat i o na l e f -f o r t 41
10 c o n c l u s i o n s 43 b i b l i o g r a p h y 44
L I S T O F F I G U R E S
Figure1 Process to create MFCC Features 5
Figure2 Mel scaling and smoothing of the log
am-plitude spectrum 6
Figure3 The Mel function 6
Figure4 Hyperplane Representation 19
L I S T O F TA B L E S
Table1 Training Database 31
Table2 Testing Database 32
Table3 MFCC Parameters 33
Table4 gmdistribution.fitParameters 35
Table5 svmtrainParameters 37
Table6 GMM Classifier Results : Cepstral and
Delta-cepstral coefficients 41
Table7 SVM Classifier Results : Misclassification
rate estimation 42
Table8 SVM Classifier Results : Testing Database 42
L I S T I N G S
1
I N T R O D U C T I O N
It is the purpose of this document to report on the implementation of two automatic language identification systems using Gaussian Mixture models and Support Vector Machines. To achieve this report, we present the overall process of building a functional LID learning algorithm from the scratch, and illustrate step by step the whole process. The document has been organized in four parts. The first part presents the theoretical framework. Here, we present a detailed description of the pre-processing process in chapter 2, followed by a full description of Gaussian Mixture
Models and Support Vector Machines in chapters3and4. Second
part presents briefly the general and specific objectives. Now, in the third part (chapters6to8), the document contains the explicit
methodology used to implement the language identification al-gorithms . Finally, results and conclusions are presented in the last2chapters.
1.1 l i d ov e r v i e w
For the purpose of clarity, lets describe briefly the main steps involved in the process of building a learning machine, and specifically one that is built to recognize languages.
To build and LID the first requirement to be met is to count on a database of recordings in the languages of interest. In this case, more than20.000.wav recordings from English and French
native speakers were extracted from the speech corpus in the voxforge repository [1]. Details on this database will be given in the following sections.
After acquiring the database, the next relevant step is to con-duct a transformation on this .wav files to turn them into feature vectors. This process is called feature extraction and it is neces-sary since learning algorithms perform operations on dot product spaces that requires vector representations instead of binary files.
Another reason for doing this process is that feature extraction provokes a domain reduction in the data, meaning that fragments of audio encrypted in large binary code strings are condensed in
13-position vectors that contains only the relevant information of
these fragments.
Having done this, the next step is to train a model using the resulting feature vector database. The process of setting up the models and the training algorithms is to be described throughout the document.
2 i n t r o d u c t i o n
Finally, a subset of the database is use to test the performance of the algorithms. At the end of this work we are intended to do a benchmark that compares the performance of the approaches that are going to be used to build the language identifier. Also, if time isn’t an issue,our objective is to build a very simple interface to facilitate the program utilization.
Part I
2
F E AT U R E E X T R A C T I O N
As mentioned, in the process of building an LID system the first thing to be done is to conduct feature extraction on the dataset. To this purpose we are going to use Mel-Frequency Cepstral Coefficients (MFCCs), which are the dominant features used for speech and language recognition.
The process of extracting MFCCs involves several steps, but the main assumptions are that Mel Frequency scale is a suitable scale to model the spectra of speech utterances, and that Discrete Cosine Transform (DCT) is proper to decorrelate the Mel-spectral vectors, and hence produce independent distribution samples [2].
We aim to describe the overall process. See figure1. The first
step is to divide the speech signal into frames, usually by apply-ing a windowapply-ing function at fixed intervals. Most typical window-ing is Hammwindow-ing window of0.01seconds wide [3].Cepstral Fea-ture Vectors are extracted for each frame, (13-vector per frame).
Figure1: Process to create MFCC Features [2]
The next step is to take the Discrete Fourier Transform (DFT) of each frame, take the amplitude and retain the logarithm of it.In this case phase information is discarded because perceptual studies has shown that amplitude of the spectrum is much more important than the phase. Also logarithm of the amplitude is
6 f e at u r e e x t r a c t i o n
taken because the perceived loudness of a signal has been found to be approximately logarithm [4].
The next step is to smooth the spectrum and emphasize per-ceptually meaningful frequencies. This is achieved by selectively reducing the number of spectral components as illustrated in figure2.
Figure2: Mel scaling and smoothing of the log amplitude spectrum. Spectral components are averaged over Mel-spaced bins to produce a smoothed spectrum. [2]
Here we see that the original spectrum is divided into segments. This segments, or bins, are distributed unevenly according to the Mel frequency scale, because its has been found that for speech, the lower frequencies are perceptually more important than the higher ones. Components are averaged over these bins to convey a reduced spectrum, which is referred as Mel-spectra.
The Mel scale is based on a mapping between actual frequency and perceived pitch as apparently us humans do not perceive pitch linearly. The mapping is approximately linear below1kHz
and logarithm above. Figure3shows the Mel function:
f e at u r e e x t r a c t i o n 7
And the formula for converting from frequency to Mel scale is:
M(f) =1125∗ln(1+ f 700)
So far so good. But, we still have a problem. The components of the Mel-spectral vector calculated for each frame are highly correlated. This fact is problematic if the behaviour of this data is to be model, since some schemes assumes that data is identi-cally independently distributed, like in Gaussian Mixture Models, where features are typically modelled by mixtures of Gaussian Densities.
Therefore, in order to eliminate the correlation and reduce the number of parameters in the system , the last step of MFCC feature construction is to apply a transform to the Mel-spectral which decorrelates their components. Theoretically, the Karhumen-Loeve (KL) transform achieve this. In the speeh community, the KL transform is approximated by the Discrete Cosine Transform (DCT) [5]. In our case, a DCT of type II is going to be used:
Xk= NX−1
n=0
xncos[π N(b+
1
2)k]k=0, ...,N−1.
Where thexn correspond to the Mel-spectral components of a given frame tand theXk constitute the cepstral coefficients of
that frame.
It is worthy mention, that for language ID, only the lowest13
coefficients of the mel-cepstrum are calculated (X0 throughX12). The lowest Cepstral coefficient is sometimes ignored, because it contains only overall energy level information [6].
Finally, let us notice that, in an effort to model Cepstral tran-sition information, difference Cepstral are also computed and modeled. These vectors of cepstral differences are called delta-cepstrals and they are computed every frame as:
∆~c(t) =~c(t+1) −~c(t−1) (2.1)
Where{∆~c(t),~c(t)}refers to the delta-cepstral and cepstral vec-tors at frametcorrespondingly. Now, once we have our feature vector space full of instances, its time to model the information they convey. The idea is to approximate the pattern of the un-derlying process behind French and English utterances with a model. This can be done in different ways but we have chosen GMM and SVM as our approaches.
3
G A U S S I A N M I X T U R E M O D E L S
Among the most popular models in this context of Language Identification, Gaussian Mixture Models (GMM) is the most sim-ple approach. Not only for its mathematical framework but also because its requirements in terms of training data are relatively simple to meet1
.
To be able to train a GMM model, a database consisting of real valued m-vectors is required. In the present case this vectors are going to be the Mel-frequency Cepstral coefficients as well as the Delta-Cepstral coefficients. as seen in the last chapter, the former requires no complex calculations and it improves the model per-formance.
Now, Under the GMM assumption, each feature vector~vt at frame timet(remenber that MFCC are taken per time frame) is assumed to be drawn randomly according to a probability density that is a weighted sum of multi-variate Gaussian densities [3]:
p(~vt|λ) = N X
k=1
αkpk(~vt) (3.1)
Whereλis the set of models parameters:
λ={αk,~uk,Σk}
kis the mixture index (1 6k < N), theαk0s are the mixture weights constrained such thatPNk=1αk=1, and thepk0s are the multi-variate Gaussian densities defined by the means~u0ksand variancesΣk0s.
For each languageltwo GMM’s are created: one for the Cep-stral feature vectors{~xt}and one for the Delta-Cepstral feature
vectors{~yt}. These models are trained running multiple iterations of the estimate-maximize (E-M) algorithm. Producing for each stream of data, a more likely set ofαk,~uk,Σk [8].
Now, the recognition of an unknown speech utterance is as follows. First, one conducts the process of feature extraction on this utterance by converting the digitized waveform (in this case
1 Notation and text coherence of this section as in [Zissman,1996]
10 g au s s i a n m i x t u r e m o d e l s
the .wav file) to its Cepstrals. Second, by calculating the log likeli-hood that the languagelmodel produced the unknown utterance.
The log likelihoodLis defined as:
L({~xt,~yt}|λCl,λDCl ) = T X
t
[log p(~xt|λCl) +log p(~yt|λDCl )] (3.2)
WhereλCl andλDCl are the Cepstral and Delta-Cepstral GMM, respectively, for languagel, andT is the duration of the utterance. Assumptions concerning feature vectors are that the obser-vations {~xt} are statistically independent of each other, the ob-servations {~yt} are statistically independent of each other, and that the two streams are jointly independent of each other as well.
The maximum-likelihood classifier hypothesizes ˆlas the lan-guage of the unknown utterance, where:
ˆ
l=argmax
l
L({~xt,~yt}|λCl,λDCl )
Now, the main issue of the GMM approach is to calculate the parameters λCl and λDCl . These parameters determine the structure of the predictors and hence determines the classifier. To calculate this parameters, the training strategy is based on an implementation of the EM algorithm.
3.1 e x p e c tat i o n-m a x i m i z at i o n a l g o r i t h m
In this section we describe the Expectation-Maximization algo-rithm and its theoretical foundations. We recall the maximum-likelihood problem, explain its motivation and describe the rela-tion of this problem to the Expectarela-tion-Maximizarela-tion problem. A general overview of the EM algorithm is presented along with its implementation to Gaussian Mixture Models.
3.1.1 Maximum-likelihood
The maximum-likelihood problem is the problem of fitting a probability density function as to maximize the likelihood of its parameters given a set of observed data. For a data set X= {~x1, ...,~xN} of size N, supposedly drawn from the distribution
p(~x|Θ), the likelihood of the parameters given the observations can be expressed as:
p(X|Θ) = N Y
i=1
3.1 e x p e c tat i o n-m a x i m i z at i o n a l g o r i t h m 11
WhereΘis the set of parameters that govern the distribution. The likelihood is though of as a function of the parametersΘ. As noted before, in the maximum likelihood problem our goal is to find theΘthat maximizesL. That is, we wish to findΘ∗where:
Θ∗ =argmax
Θ
L(Θ|X)
Depending of the form ofp(~x|Θ)this problem can be analyt-ically intractable. In such cases EM offers an alternative to the problem of maximum likelihood.
3.1.2 Basic EM
The EM algorithm is a method of solving the maximun-likelihood problem for an underlying distribution from a given data when the data is incomplete or has missing/hidden values. The main application of this approach to the current scenario is that opti-mizing the likelihood function for Gaussian densities is analyti-cally intractable, but the likelihood function can be simplified by assuming the existence and the values of additional but missing parameters.
As before, we assumed observed dataX to be generated by some distribution, but this time we refer to it asincomplete data
[7]. We assume that there’s a set of complete data,Z= (X,Y)and also assume a joint density :
p(~z|Θ) =p(~x,~y|Θ) =p(~y|~x,Θ)p(~x|Θ)
Now we can define the complete-data likelihood as:
L(Θ|Z) =L(Θ|X,Y) =p(X,Y|Θ)
Note that givenXandΘ, this likelihood can be though of as a random variable, as it is a function of the random variableY. In this context,the likelihood of the former section is generally referred as incomplete-data likelihood.
The EM algorithm consist of two steps, the E-step and the M-step. In the E-step we calculate the expected value of the complete-data log likelihood with respect toYgiven the observed dataXand the current parameters. This quantity can be expressed as:
Q(Θ,Θ(i−1)) =E[log p(X,Y|Θ)|X,Θ(i−1)] (3.3)
Where Θ(i−1) are the current parameters estimates that we used to evaluate the expectation and Θare the new parameters that we optimize to increaseQ.
12 g au s s i a n m i x t u r e m o d e l s
The key aspects to understand of this equation are the follow-ing:
X,Θ(i−1) Constants
Θ Variable that we wish to adjust
Y Random variable with distributionf(~y|X,Θ(i−1))
Therefore, right hand size of equation3.3can be re-written as in
[7]:
E[logp(X,Y|Θ)|X,Θ(i−1)] = Z
~
y∈Υ
log p(X,~y|Θ)f(~y|X,Θ(i−1))d~y
(3.4)
Note thatf(~y|X,Θ)is the marginal distribution of the unob-served data and is dependent on the obunob-served data X and the current parameters Θ(i−1).Υis the space of possible values of~y. This function can be maximize with respect toΘ.
Conceptually, the main idea behind this expression is embed-ded in the meaning of the arguments in the function Q(Θ,Θ0). The first argument Θ corresponds to the parameter that will be optimized to maximize the likelihood. The second argument
Θ0corresponds to the parameters used to evaluate the function
f(~y|X,Θ(i−1))and makes possible to calculate the expectation.
As mentioned, the second step of the EM algorithm is called the M-step. In this step the expectationQ(Θ,Θ0)computed in the first step is maximized. That is, we find:
Θ(i)=argmax
Θ
Q(Θ,Θ(i−1))
Each iteration of this steps is guaranteed to increase the log likelihood and the algorithm is guaranteed to converge to a local maximum of the likelihood function [8]. The algorithmic details of the steps are strongly dependent of the particular application. Lets discuss the case of Gaussian mixtures.
3.1.3 Maximum likelihood in mixture densities
In the mixture-density parameter estimation problem we assume the following probabilistic model. (see3.1):2.
p(~x|Θ) = M X
(i=1)
αipi(~x|θi)
Where the parameters areP Θ= (α1, ...,αM,θ1, ...,θM). As before,
M
(i=1)αi = 1 and each pi is a multi-variate Gaussian density 2 Notation and text coherence of this section as in [Bilmes,1998]
3.1 e x p e c tat i o n-m a x i m i z at i o n a l g o r i t h m 13
function parametrized byθi. We considerMcomponent densities mixed together with weightsαi.
If we try to fit the parametersαi,θi by solving the maximum likelihood problem we will face and intractable problem. Say we compute the incomplete-data likelihood for this probability density and data X, the resultant expression will be:
log(L(Θ|X)) =log
N Y
i=1
p(xi|Θ) = N X
i=1
log( M X
j=1
αjpj(xi|θj))
As can be seen from this equation, the presence of a sum within a logarithm make this expression difficult to optimize. As men-tioned, this problem might be simplified by assuming the exis-tence of unobserved data itemsY={yi}Ni=1, one per each element
ofX. Lets assume thatyi∈1, ...,Mfor eachi, andyi=kif the
ith sample was generated by thekth mixture component. Assum-ing that we know the values of Y, the complete-data likelihood becomes:
log(L(Θ|X,Y)) =log(P(X,Y|Θ))
= N X
i=1
log(P(xi|yi)P(yi)) = N X
i=1
log(αyipyi(xi|θyi)) (3.5)
Given that theaj parameters can be though of as an a priori
distribution for the variables yi.
This expression can be optimize using a variety of techniques. The problem is that the Y are not known, but assuming they are random variables one can proceed and make use of the EM technique to estimate the expected value of 3.5and maximize it.
Recall that the E-step of the EM algorithm consists in calculat-ing the expected value of the complete-data log-likelihood, given
Xand some assumptions on the parametersΘg, also refered as
guesses. The following expression re-writes 3.4 for the current
case:
Q(Θ,Θg) = X
~
y∈Υ
log(L(Θ|X,~y))p(~y|X,Θg) (3.6)
Here, the distribution of the unobserved data remains un-known. But assuming some values for the guessesΘg = (αg
1, ...,α g M
,θg1, ...,θgM), and having in mind the interpretation of α0s asa prioriprobabilities, one can use Baye’s rule to implie:
p(yi|xi,Θg) = α g
yipyi(xi|θ
g yi)
14 g au s s i a n m i x t u r e m o d e l s
And hence:
p(~y|X,Θg) = N Y
i=1
p(yi|xi,Θg)
Replacing this result into3.6, expanding the expression and
then simplifying as in [7],chapter 3, one obtains a fairly simple
equation ofQ(Θ,Θg)that can be written as:
Q(Θ,Θg) = M X
l=1 N X
i=1
log(αl)p(l|xi,Θg) + M X
l=1 N X
i=1
log(pl(xi|θl))p(l|xi,Θg)
(3.7)
In this equation, whenever l appears as an argument of a probability density, it meansyi=l.
Now that we have completed the E-step, the procedure that follows is to perform the M-step, and maximize3.7with respect
to parameters αl,θl. We can maximize the term containingαl
and them the term containing θlindependently since they are not related. Lets start with the weightsα.
To find the expression forαl, we introduce the Lagrange
multi-plierλwith the constraint thatPlαl=1and solve the following equation:
∂ ∂αl
"XM
l=1 N X
i=1
log(αl)p(l|xi,Θg) +λ(X l
αl−1)
#
=0
or
N X
i=1 1 αl
p(l|xi,Θg) +λ=0
Summing both sizes overl, we get thatλ= −Nresulting in:
αl= 1 N
N X
i=1
p(l|xi,Θg)
This expression gives us the maximum complete-data likeli-hood estimate. It deliversMsuch estimations, each correspond-ing to one weight in the Gaussian mixture3.1. More importantly
it delivers a better estimation than the ones implicitly contained inΘg.
Similar procedures are conduct to obtain equivalent expres-sions for θl. To see the detailed calculation refer to [7]. In our case, we have distributions that are 13-dimensional Gaussian
3.1 e x p e c tat i o n-m a x i m i z at i o n a l g o r i t h m 15
components with meanµand covariance matrixΣ, i.e.,θ= (µ,Σ)
then:
pl(x|µl,Σl) = 1
(2π)132|Σl| 1 2
exp
1
2(x−µl)Σ −1
l (x−µl)
(3.8)
The estimates of the new parametersµl,Σl, in terms of the old parameters are as follows:
µnewl = PN
i=1p(l|xi,Θg) PN
i=1p(l|xi,Θg)
Σnewl = PN
i=1xip(l|xi,Θg)(xi−µnewl )(xi−µnewl )T PN
i=1p(l|xi,Θg)
The above equations perform both the expectation step and the maximization step simultaneously. As noticed before, this algorithm proceeds by using the newly derived parameters as the guess for the next iteration.
4
S U P P O R T V E C T O R M A C H I N E S
To understand the bulk of Support Vector Machines (SVM), one must understand the underlying mathematical framework giving birth to this technique . To this end, let us remember briefly the very common hyperplane classifiers and derive from there the main ideas behind Support Vector Machines.
It is worthy mention that the main idea behind this method is to model the boundary, as opposed to the GMM approach of the last section, which would model the probability distributions of the classes.
4.1 h y p e r p l a n e c l a s s i f i e r s
Hyperplane Classifiers are based on the class of hyperplanes,defined in a dot product space H:
hw~,~xi+b=0,w~ ∈H,b∈R. (4.1)
Corresponding to decision functions:
y=f(~x) =sign(hw~,~xi+b) (4.2)
The learning algorithm to find 4.2, proposed for problems
which are separable by hyperplanes, is based on two facts. First, that among all hyperplanes separating the data one can find a unique optimal hyperplanewith maximum margin of separation between any training point and the hyperplane [9]. Given m patternsxi, and their corresponding tags yi, this hyperplane is the solution of:
maximize
~
w∈H,b∈R min{||~x−~xi||~x∈H,hw~,~xi+b=0,i=1, ...,m} ( 4.3)
Second, the capacity of the class of separating hyperplanes decreases with increasing margin. Hence there are theoretical arguments supporting the good generalization performance of the optimal hyperplane. See [9], chapters5,7,12.
Solving4.3is equivalent to minimizing the norm of the vector ~
w, the following formulation summarizes this:
minimize
~
w∈H,b∈R τ(w) =~ 1 2||w~||
2
(4.4)
18 s u p p o r t v e c t o r m a c h i n e s
subject to (hw~,~xii+b)yi>1 for all i=1, ...,m. (4.5)
It can be shown that the optimal hyperplane can be uniquely constructed by solving a constrained quadratic optimization problem [10] and also that the solution w~ has an expansion
~
w=Pivi~xi in terms of a subset of training patterns that lie on
the margin.
Notice that the functionτin equation4.4can be understood as
anobjective function, while4.5constitutes aninequality constraint.
They form aconstrained optimization problem. This problem can be dealt with by introducing a Lagrangian formulation:
L(w~,b,~α) = 1 2||w~||
2 −
m X
i=1
αi((h~xi,w~i+b)yi−1) (4.6)
The Lagrangian has to be minimized with respect to theprimal variablesw~ andb, and maximize with respect to thedual variables αi. A saddle point has to be found.
Introducing Karush-Khun-Tucker (KKT) conditions to guaran-tee optimality, one can also simplify the solution and learn some additional restrictions that narrow it.For example KKT states, that at the saddle point, the derivatives ofLwith respect to the primal variables must vanish,
∂
∂bL(w~,b,~α) =0 and ∂
∂w~ L(w~,b,~α) =0 (4.7)
Applying this conditions to4.6leads to: m
X
i=1
αiyi=0 (4.8)
and
~ w=
m X
i=1
αiyi~xi (4.9)
As mentioned, equation4.9shows us that the solution vectorw~
has an expansion in terms of the training patterns with non-zero
αi. This patterns are calledSupport Vectors(SVs). KKT conditions
also implie that the SVs lies on the margin [9]. All remaining training data(~xi,yi)becomes irrelevant. See fig.4.
Substituting4.8and4.9into the Lagrangian4.6we arrive at the dual optimization problem. Which is the problem actually solved in practice:
maximize
~
α∈Rm W(~α) =
m X
i=1 αi−1
2 m X
i,j=1
4.2 s u p p o r t v e c t o r c l a s s i f i e r s 19
Figure4: Hyperplane Representation. Only the patterns that lie in the margin matters. [11]
subject to αi>0 for all i=1, ...,m and m X
i=1
αiyi=0.
Also, substituting the KKT results into the decision function
4.2we obtain:
f(~x) =sgn m X
i=1
yiαih~x,~xii+b
!
(4.11)
This formalism has a crucial property that needs to be em-phasized.That both the quadratic optimization problem and the final decision function4.11depend only on dot products between
patterns. This feature is precisely what let us generalized to the nonlinear case.
4.2 s u p p o r t v e c t o r c l a s s i f i e r s
We now have all the tools to describe SVMs. The basic idea of this technique is to map the original data presumably inRN into
some other dot product space (called the feature vector space)F
via a nonlinear map:
φ : RN→F (4.12)
And perform the above linear classifier (section4.1) inF . To
do this, we would like to write the dot product of feature vectors
φ(xi)in terms of input patternsxiusing a kernel functionk:
20 s u p p o r t v e c t o r m a c h i n e s
Clearly, ifFis high dimensional, the right hand side of equation
4.13will be very expensive to compute. In some cases, however,
there is a simple kernelkthat can be evaluated efficiently. Given this, one can define an SVM as a binary classifier con-structed from sums of a kernel functionk(·,·), which implements a dot product in some higher dimensional space [11]:
f(x) =sgn m X
i=1
yiαihφ(x),φ(xi)i+b
!
=sgn
N X
t=0
αiyik(~x,~xi) +b
!
(4.14)
Where theyiare the ideal outputs,PNi=1αiyi=1, andαi> 0. In this expression the vectorsxiare the support vectors from the training data.
The decision function from equation4.14is found by solving a
quadratic program. Specifically solving an optimization problem very similar to4.10but implementing the dot product in spaceF:
maximize
~
α∈Rm W(~α) =
m X
i=1 αi−1
2 m X
i,j=1
αiαjyiyjk xi,xj (4.15) subject to αi>0 for all i=1, ...,m and
m X
i=1
αiyi=0.
4.2.1 Soft Margin Hyperplane
In practice, a perfect separating hyperplane may not exist. To allow SVMs be applied to more general situations in which4.5
doesn’t hold, we introduce slack variables:
ξi>0for alli=1, ...,m. (4.16)
This variables relax the constraint4.5:
(hw~,~xii+b)yi>1−ξifor alli=1, ...,m. (4.17)
Given this, we find a classifiers that minimizes not only the magnitude ofw~ but also the magnitude of the slack variables, i.e, the following quantity:
τ(w) =~ 1 2||w~||
2+C m X
i=1
4.3 s e q u e n t i a l m i n i m a l o p t i m i z at i o n (s m o) 21
Subject to4.16and4.17. In this expression constant theCis a
real positive value that determines the trade-off between margin maximization and training error minimization. This again leads to4.15, but the constraints change to:
06αi6C for all i=1, ...,mand
m X
i=1
αiyi=0.
This constantCis sometimes referred asbox constraint, since it limits theαconstants to an square of sizeC. Other than that the problem is still quadratic and very similar to the former.
Now, the main issue of the SVM approach is to calculate the de-cision function and find the support vectors i.e, the non-vanishing Lagrange multipliers of equation4.15. These parameters
deter-mine the structure of the decision function and hence defines the classifier hyperplane. To calculate this parameters, the train-ing strategy used here is based on an implementation of the Sequential Minimal Optimization (SMO) technique.
4.3 s e q u e n t i a l m i n i m a l o p t i m i z at i o n (s m o)
As mentioned in the last section, training a Support Vector Ma-chine (SVM) requires the solution of a quadratic programming (QP) optimization problem. This problem is usually very large and requires special methods. SMO breaks this large problem into a series of smallest possible QP problems that are solved using an analytical QP step instead of inner QP optimization loops. Depending on the database type SMO can be faster or equal in speed to traditional procedures such as the projected conjugate gradient (PCG) chunking algorithm. Sometimes even
1000times faster [12]
Unlike previous methods, SMO chooses to solve the smallest possible optimization problem at every step. For the standard SVM QP problem the smallest possible optimization problem involves two Lagrange multipliers. SMO chooses precisely two Lagrange multipliers at every iteration to jointly optimize, find optimal values for these multipliers and updates de SVM to reflect the new optimal. This feature represents a great advantage, since two Lagrange multipliers can be done analytically, which represents less computational effort otherwise spent in numerical inner iterations.
There are three components to SMO: The first component is analitycal, related to solve the two Lagrange multipliers. The second part is heuristic, and give the tips for choosing which multipliers to optimize. At last, there is a method for computing the constant bin decision function4.14.
22 s u p p o r t v e c t o r m a c h i n e s
4.3.1 Part I : Analytical step : Solving for two Lagrange Multipliers
Let’s recall the dual formulation involved in solving and SVM’s with soft margin:
maximize
~
α∈Rm W(~α) =
m X
i=1 αi−1
2 m X
i,j=1
αiαjyiyjk xi,xj
(4.19)
subject to 0>αi>C,for all i=1, ...,m and m X
i=1
αiyi=0.
The bound constraints in this expression. Also referred asbox constraints, cause the Lagrange multipliers to lie within a box. On the other hand the linear equality constraints causes the La-grange multipliers to lie on a diagonal line. SMO computes this constraints and then solves for the constrained maximum.
The former implies that the constrained maximum of the objec-tive function must lie on a diagonal segment. This explains why two is the minimum number of Lagrange multipliers that can be optimize, one multiplier alone couldn’t fulfil the linear equality constraint at every step.
Say we pick two Lagrange multipliers based on some heuristics (details on how to pick them are explain in the following section). For simplicity, let’s label them α1 andα2 and let’s label all the quantities related to this multipliers in the same fashion.
SMO algorithm first computes the second Lagrange multiplier
α2 (it could just as well start withα1 without loss of generality.)
and computes the ends of the diagonal line segment in terms of this multiplier.
To find the unconstrained maximum location of the objective function in equation4.19, we use the procedure depicted in [12], chapter 12, section 7. This analytical maximum is done while
allowing only two Lagrange multipliers to change and is based on some initial guesses for the values of theαi, referred asαoldi . The resultant expression for actualizedα2 is:
αnew2 =αold2 −y2(E1−E2)
β (4.20)
whereEi=fold(~xi) −yi. In this expressionfold refers to the
SVM decision function build from the old parameters:
fold(~x) =sgn m X
i=1
yiαoldi k(~x,~xi) +b
!
4.3 s e q u e n t i a l m i n i m a l o p t i m i z at i o n (s m o) 23
Andβits short for:
β=2k(~x1,~x2) −k(~x1,~x1) −k(~x2,~x2)
Onceαnew
2 is evaluated, the following bounds are applied; If
targety1 equalsy2, we apply the bounds:
L=max 0,αold2 −αold1 , H=min C,C+αold2 −αold1 (4.22)
If targety1 equals targety2, then the following bounds apply toα2:
L=max 0,αold1 +αold2 −C,H=min C,αold1 +αold2 (4.23)
Finally the constrained maximum is found by clipping the unconstrained maximumαnew2 to the ends of the line segment:
αnew2 ,clipped =
H, ifαnew2 >H αnew2 , ifL < αnew2 < H L, ifαnew2 6L
The value ofα1 is computed from the clippedα2:
αnew1 =αold1 +s
αold2 −αnew2 ,clipped
(4.24)
With this procedure, SMO moves the Lagrange multipliers to the end point with the highest value of the objective function.
4.3.2 Part II : Heuristics : Choosing two Lagrange multipliers
In order to speed convergence, SMO algorithm uses heuristics to choose which two Lagrange multipliers to jointly optimize. There are two heuristic rules, one for each Lagrange multiplier. The first multiplier choice is the result of a loop. This loop deter-mines which training data examples violate the following KKT conditions:
αi=0⇒ yif(~xi)>1,
0 < αi< C⇒ yif(~xi) =1,
αi=C⇒ yif(~xi)61,
(4.25)
The first KKT violator found by this outer loop is chosen as the first multiplier (remember that there’s a one-to-one relation between the training set and the Lagrange multipliers). A second multiplier is selected using a second heuristic rule. The two
24 s u p p o r t v e c t o r m a c h i n e s
multipliers are jointly optimized and the SVM is then updated. The process resumes by looking for further KKT violators.
The outer loop passes repeatedly over the training examples until every example obey KKT conditions within , which is typically set in the range 10−2 to10−3 [12]. At that point, the algorithm terminates. For reasons of computational efficiency the looping is not always done over all training data, but some iterations are made only on the examples that are more likely to violate KKT conditions: Examples whose Lagrange multiplier are neither0norC, also callednon-bound examples.
The choice of the second multiplier is based on equation4.20.
SMO keeps a cached error value of E for everynon-boundexample and it choosesα2 such that the correspondingE2 maximizes the numerator term|E1−E2|. Briefly, the second Lagrange multiplier
is chosen to maximize the step taken during joint optimization.
α∗2 =maximize
αi
|E1−Ei(αi)| (4.26)
4.3.3 Part III : threshold : Determine the constant b
The procedure described above does not determine the threshold
bof the SVM, so it must be computed separately. b is re-computed with each iteration. Two equations are used to correctb depend-ing on the case. When the newα1 is not at the bounds, we apply the correction:
bnew=E1+y1 αnew1 −αold1 k(~x1,~x1)
+y2αnew2 ,clipped−αold2 k(~x1,~x2) +bold (4.27)
If the newα2 is not at the bounds the corresponding equation is:
bnew=E2+y1 αnew1 −αold1
k(~x1,~x2)
+y2
αnew2 ,clipped−αold2
k(~x2,~x2) +bold (4.28)
When both equations are valid, they are equal. When both new Lagrange multipliers are at bound and if L is not equal to H, them all the interval between4.27and4.28are thresholds that are
consistent with the KKT conditions. SMO chooses the halfway. Seudo code for this algorithm is shown in [12], chapter12, section 3.
Part II
5
O B J E C T I V E S
5.1 g e n e r a l o b j e c t i v e s
• Build a Language Identifier that tells apart English and French using Machine Learning Techniques.
5.2 s p e c i f i c o b j e c t i v e s
• Understand the underlying theory concerning the problem of Language Identification.
• Learn models and algorithms related to the framework of Support Vector Machines (SVM).
• Learn Machine Learning techniques applied to LID systems.
• Conduct proper feature Extraction on database or speech corpus from English and French Languages.
• Use specialized libraries oriented to facilitate and optimize machine learning algorithms within the context of Lan-guage Identification.
• Build an end-point or user friendly interface that imple-ments the service of language identification.
Part III
6
D ATA B A S E S E L E C T I O N A N D C H A R A C T E R I Z AT I O N
We summarize the specific steps taken to conduct database se-lection and characterization. As mentioned, database was down-loaded from the speech corpus in Voxforge1
. Voxforge is an organization dedicated to collect speech samples of diverse lan-guages through voluntary utterances donations made by internet users. Each language corpus consist of the following directories:
• 16kHz−16bits
• 32kHz−16bits
• 44.1kHz−16bits
• 48kHz−16bits
• 8kHz−16bits
These directories names indicates the sample rate and the encoding with which the recordings were made. Different direc-tories contain different sets of data (meaning different speakers). In both the cases of French and English corpus, 48kHz directory
turns out to be the one with more recordings, so all the data was taken with this sample rate.
Equivalent amount of recordings was downloaded for both languages. In this case French database had the lower amount of files. The resulting sets are as follows:
Table1: Training Database
Aspect. English French
Sample Rate 48kHz 48kHz
N. of Speakers 215 219
Audio per speaker 32 31
Average duration 7sec 7sec
N. of files 7000 7000
Cepstral per audio < 13x500 > < 13x500 >
Delta-Cepstral per audio < 13x500 > < 13x500 >
Training set < 13x6542014 > < 13x7787422 >
1 Specifically taken from:http://www.voxforge.org/home/downloads
32 d ata b a s e s e l e c t i o n a n d c h a r a c t e r i z at i o n
Table1summarizes the main features of the training database.
Some explanation is needed to clarify the concepts of "Cepstral per audio" and "Delta-Cepstral per audio". This fields on the table constitutes the average number of Cepstral and Delta-Cepstral13
-vectors resulting from pre-processing per audio recording. This result is valid when the windowing used for MFCC extraction has window width of250ms.
Pre-processing results in a data base of order 106. Similar results hold for the testing data. This additional database is com-puted in order to test the resulting performance of the learning algorithms.Testing database features are shown in table2.
Table2: Testing Database
Aspect. English French
Sample Rate 48kHz 48kHz
N. of Speakers 92 94
Audio per speaker 32 31
Average duration 7sec 7sec
N. of files 3000 3000
Cepstral per audio < 13x500 > < 13x500 >
Delta-Cepstral per audio < 13x500 > < 13x500 >
Training set < 13x2927328 > < 13x3105082 >
Note that the complete database, counting both recordings from English and from French, consist of10thousand audio files.
Corresponding to nearly300speakers per language. Also notice
that70% of available data is used for training and30% is used for
testing. This database is modest, but it will be seen that it suffices for solving the problem at hand. For example, in the case of SVM, only a subset of the training data is actually used for training .
7
F E AT U R E E X T R A C T I O N : M F C C A N D D E LTA F E AT U R E S
As mentioned in the last section, database originally consisted of
20thousand audio files.10thousand files correspond to English
speakers and10thousand to French ones. These audio files were
sampled at 48kHz and encoded in .wav format at16bits. The
process of converting them to feature vectors was driven based on the Mel-frequency Cepstral coefficient extraction process ex-plained in chapter2. To perform this transformation, a matlab
library was used, authored by Daniel P. W. Ellis [13]. This library consist of an implementation of RASTA-PLP as well as MFCC feature extraction algorithms.
To perform MFCC on a .wav file the first step was to extract the samples. This is done usingwavreadfunction in matlab. Once the samples are store in a vector they are used as an argument for the functionmelfccform Ellis Library, which transform this samples into a matrix of around500column cepstral vectors of
dimension 13, this matrices are stored per .wav file in .txt files.
The parameters used to evaluate this function were set as follows:
Table3: MFCC Parameters
parameter value Description
wintime 25ms Window length in sec
hoptime 10ms Step between successive windows in sec numcep 13 Number of cepstra coefficients to return
fbtype ’mel’ Frequency warp
broaden 0 Flag to retain first and last bands
Table3shows the parameters ultimately used when training
the classifier. Other schemes were tried with different values of wintimeand hoptime. In particular 0.01 seconds time frames
with 0.04 hoptime were used, thus deriving the so called centi-second mel-scale cepstral and delta-cepstral coefficients, as rec-ommended in [3]. This arrangement of values end up lowering the performance of the learning machines. The values shown were used instead. Finally, for each .wav file, the corresponding delta-cepstral coefficients were computed and store. Computation was straighforward and was calculated according to equation2.1.
8
T R A I N I N G A N D T E S T I N G : G M M A N D S V M
8.1 g au s s i a n m i x t u r e m o d e l s : t r a i n i n g
Recall that under the GMM assumption all the instances of a particular class (say English) are assumed to be drawn randomly according to a probability density that is a weighted sum of multi-variate Gaussian densities. This probability density is not given, and has to be fit to the training data using some methodology. In this case, the EM algorithm.
In matlab, an object of thegmdistributionclass defines a Gaus-sian mixture distribution. Its attributes consist of the means, covariances matrices, number of components (k) and mixture weights necessary to define completely any given GMM. This object has a method calledfitthat implements the EM algorithm. Briefly, this matlab method receives a given dataset and returns the correspondinggmdistributionobject with maximum likelihood for that given data.
We train several GMMs. One for the Mel-frequency Cepstral coefficients of the English training database called gmc-english, one for the Mel-frequency delta-cepstral coefficients of the same database, called gmd-english and the corresponding gmc-french
andgmd-frenchfor the French training database.
These distributions were calculated for differentnumber of com-ponentsk in the mixture, as will be shown in the following sec-tion. Other than that, the rest of parameters remained the same throughout the simulations, as shown in the following table:
Table4:gmdistribution.fitParameters
parameter value Meaning
Start ’randSample’ Random initial conditions CovType ’diagonal’ Covariances must be diagonal SharedCov ’false’ Covariances not necessarily equal MaxIter 200 Maximum number of iterations
In this table,Random initial conditionsmeans that the distribu-tion meansare initialize with k random observations from the database X and that covariances matrices start all being equal, diagonal, and with the jcomponent equal to the X(:,j)
36 t r a i n i n g a n d t e s t i n g : g m m a n d s v m
nent variance. MaxIteris set to200because the amount of data requires lots of iterations and SharedCovis set to false, because setting it to true, despite requiring less computational effort, gives rise to error messages likeill-conditioned Covariance Matrixmore often. The meaning of this error is explained here.1
8.2 g au s s i a n m i x t u r e m o d e l s : t e s t i n g
To determine whether an utteranceU1 is originated from English or from French we first extract its MFCC using the function
melfcc mentioned in chapter 7, and convert it into an array of 13-vectors {~u}. Next, using equation 3.2 we compute the
log-likelihood that the given array belongs to English and to French,
L {~u} |λEnglishandL({~u} |λFrench). We compare this quantities and the language with the greatest likelihood is signalled as the originator ofU1.
To determine the performance of the GMM learning machine we go ahead and try the former procedure with all the 6000
recordings of the testing database. First, we perform classification based on GMM with each of the English recordings in theEnlgish testing database of size T. Each time a recording is incorrectly labelled as originated from French we count an error on the discrete variablecf and in the end we compute:
ErrorEnglish = cf
T (8.1)
An identical procedure is made with French testing database
and the ErrorFrench is computed. Since testing databases are
both equal in size, this two errors are simply averaged to finally obtain our empirical error ErrorEmpirical. Performance tables for differentkare shown in the results chapter.
8.3 s u p p o r t v e c t o r m a c h i n e s : t r a i n i n g
Training and SVM is not as straightforward. Contrary to GMM, Support vector machines are binary classifiers based on hyper-planes and depends highly on the kernel type and on the separa-ble nature of the data. Also, SVM training is not very scalasepara-ble, so it is not viable to use an entire7∗106 cepstral vectors database to train this type of classifier. Instead, a subset of10thousand ran-dom examples were used. Databases greater than that provoked convergence problems when running the algorithms.
To train the SVM we use a function from matlab statistical toolbox called svmtrain. This function trains an SVM based on
1 gmdistribution.fit documentation: http://www.mathworks.com/help/stats/ gmdistribution.fit.html
8.4 s u p p o r t v e c t o r m a c h i n e s : t e s t i n g 37
some entry data. The matlab function settings are shown in the table5. As can be seen in this table, some kernel parameters are
initialize randomly. At some point we will have to determine this parameters via an optimization process.
Table5:svmtrainParameters
parameter value Meaning
Kernel-Function ’rbf’ Gaussian Radial Basis Function rbf-sigma random rbf scaling factor initial value boxconstraint random soft margin C initial value
method SMO Optimization Method
We use Gaussian radial function as our kernel based on previ-ous approaches [?]. This type of kernel has the form:
K(x,x0) =exp ||x−x 0||2 2σ2
!
(8.2)
Initially, the SVM is trained using random values for σand
C. (The soft margin box constraint). But the idea is to perform an iterative process that helps us find optimal values for this parameters.
We use matlab functioncrossvalto determine a 10-fold cross
validation estimate of missclassification rate, labelledmcr,thus determining the performance of any given pair{σ,C}. In order to minimize this error, we perform a line search over an exponential grid of parameters σ,C, that looks forward to minimize mcr. Several iterations of this kind are made to find a convincing minimum, each time departing from random but different initial parameter values. Each iteration results in different parameter estimates, some line searches even fail to converge. After finding a convincing set {mcr∗,σ∗,C∗}, we perform the minimization process using{σ∗,C∗}as initial values. Details are shown in the following section.
8.4 s u p p o r t v e c t o r m a c h i n e s : t e s t i n g
As before,to determine whether an utteranceU1 is originated from English or from French we first extract its MFCC using the function melfcc mentioned in chapter 7, and convert it into an
array of13-vectors{~u}. Next, using the decision function resultant
from the SVM training, we score the utterance by summing the classifications for all its cepstrals (The classifier can take values
38 t r a i n i n g a n d t e s t i n g : g m m a n d s v m
{+1,−1}). If the resultant score is positive, the utterance U1 is labeled as originated from french, and English otherwise.
We perform this classification based on each of the English recordings in the Enlgish testing database of size T. Each time a recording is labelled as originated fromFrenchwe count an error on the discrete variablecf and we compute8.1.
An identical procedure is made with French testing database
and the ErrorFrench is computed. Since testing databases are
both equal in size, this two errors are averaged and we finally obtain our empirical error ErrorEmpirical. Performance tables are shown in the results chapter.
Part IV
9
R E S U LT S : E R R O R R AT E S A N D C O M P U TAT I O N A L E F F O R T
In this section we present the results obtained for both models. These results constitute the performances in terms of error rates achieved by GMM and SVM learning machines.
In the case of GMM our first approach was to use only the cepstral coefficients to train the model, the result was a minimum
ErrorEmpirical of20.48%. An attemp to increase the algorithm performance was made by including delta-cepstral coefficients in the training as well. In this case, a swipe was made over different values fork, the number of mixtures in the Gaussian distribution, the results obtained were:
Table6: GMM Classifier Results : Cepstral and Delta-cepstral coeffi-cients
k. ErrorEng ErrorFr Erroremp Time [sec]
5 33.24% 32.07% 32.65% 792(13min) 10 33.11% 23.92% 28.51% 3164(52min) 15 30.07% 18.17% 24.12% 3248(54min) 20 27.10% 12.83% 39.93% 6452(1.79h) 25 25.55% 12.63% 19.09% 8956(2.49h) 30 26.89% 15.81% 21.35% 12000(3.3h) 35 22.74% 18.86% 20.80% 13580(3.7h) 40 21.73% 21.05% 21.39% 17424(4.8h)
mixed 16.60% 21.38% 18.99% –
multi 17.07% 18.72% 17.89% –
In this table,Erroremprefers to the error calculated from equa-tion8.1in the last section. Note that the last two rows have the
better performances. In this contextmixedmeans that during the classification process GMM models with k=25where taken to calculate English likelihood whilst models with k = 40 where used for French. On the other handmultimeans that severalk0s models were used for each language.
In this rows the time field is empty. This is due to the fact that we are using past results to build a new classifier.
42 r e s u lt s : e r r o r r at e s a n d c o m p u tat i o na l e f f o r t
Now, the other type of classifier, SVM, had to be trained using only a 10 thousand examples dataset. The random line search depicted in the last section resulted in multiple parameter sets. We evidenced the formation of clusters. i.e, multiple data sets
{mcr∗,σ∗,C∗}with very similar values. The best misclassification rates (mcr) of each cluster were extracted and are shown in the following table:
Table7: SVM Classifier Results : Misclassification rate estimation
mcr. logσ logC T ime[sec]
41.85% 2.0533 1.2852 2932 37.64% 0.2975 -0.4113 2676 37.46% 0.7531 -0.4113 4803 37.45% 0.5092 0.0374 2135
We attempted to train the SVM with bigger datasets of sizes:
{15k,20k,25k}. But the algorithm always failed due to conver-gence issues. Results in table7seems lame, but since
classifica-tion for each utteranceU1 is made over a wide range of cepstral
vectors, thismcrproduces a classifier with better performance. We made classification tests with the bestmcrof table7in the
testing data, and obtain the following results:
Table8: SVM Classifier Results : Testing Database
logσ. logC ErrorEng ErrorFr Erroremp 0.5092 0.0305 38.30% 28.78% 33.54%
10
C O N C L U S I O N S
In this work, we have sought to learn techniques related to the problem of Language identification (LID). More specifically, LID solutions within the theoretical framework of supervised learning. The principles and main steps (theoretical and computational) of this techniques were properly learned and understood.
We got familiar with pre-processing of speech databases based on Mel-frequency cepstral coefficients (MFCC) and used a matlab library to successfully implement feature extraction on20
thou-sand .wav files from the Voxforge database.
Two techniques that belong to the family of supervised learn-ing models were studied and implemented. the Gaussian Mixture models (GMM) and Support vector machines (SVM) techniques. As can be seen in tables6and8, our GMM implementation had
better performance than SVM. We demonstrate that GMM mod-els are fairly simple to train and that the computational approach using Gaussian mixtures is highly scalable to large datasets with-out many complications. On the other hand, we saw that custom approaches to train SVMs using radial basis Gaussian kernel functions (RBF) show severe limitations handling big datasets.
In order to achieve the pre-processing and training process of the mentioned models, specialized libraries were used thor-oughly. The P.W. Ellis RASTA-PLP and MFCC implementations, the matlab Statistical Toolbox class gmdistribution and the struc-turesvmtrain. This tools allowed us to succesfully use machine learning techniques to built a language identifier that tells apart English and French.
Future work should focus on the utilization of boosting algo-rithms in order to increase models performance and make the LID system more reliable. Also, the construction of an end-point user friendly interface that implements the service of language identification is in order.
B I B L I O G R A P H Y
[1] VoxForge :http://www.voxforge.org/: Speech Corpus.
[2] Beth Logan et Al. “Mel Frequency Cepstral Coefficients for
Music Modeling,"Cambridge Research Laboratory.2001.
[3] Marc A. Zissman,“Comparison of Four Approaches to
Acoustic Language Identification of Telephone Speech,"
IEEE Transactions on Speech and Audio Processing,1996.
[4] Rabiner, L. R. & Jaung, B. H. “Fundamentals of Speech
Recognition,"Prentice Hall.1993.
[5] Marhav, N & Lee, C,“On the Asymptotic Statistical
Behav-ior of Empirical Cepstral-coefficients,"IEEE Transactions on Signal Processing,1993.
[6] Davis S. Mermelstein et al, “Comparison of Parametric
Rep-resentations for Monosyllabic Word Recognition in Spoken Sentences,"IEEE Transactions on Acoustics, Speech and Signal Processing,1998.
[7] Jeff A. Bilmes,“A Gentle Tutorial of the EM Algorithm and
its Application to Parameter Estimation for Gaussian Mix-ture and Hidden Markov Models,"International Computer Science Institute,Berkeley,1998.
[8] Frank Dellaert,“The Expectation Maximization Algortihm," College of Computing, Georgia Institute of Technology,2002.
[9] Bernhard Sch ´olkopf, Alexander J. Smola,“Learning with
Kernels,"The MIT Press,2002.
[10] Marti A. Hearst,“Support Vector Machines,"IEEE Intelligent Systems,2001.
[11] W. M. Campbell,“Support Vector Machines for Speaker
and Language Recognition," Computer Speech and Lan-guage,2005.
[12] John C. Platt,“Fast Training of Support Vector Machines
using Sequential Minimal Optimization," In Sch ´olkopf, Alexander J. Smola, A., editors Advances in Kernel Meth-ods - Support Vector Learning,MIT Press,2005.
[13] D. Ellis, “PLP and RASTA in
mat-lab using melfcc.m and invmelfcc.m,
http://labrosa.ee.columbia.edu/matlab/rastamat/