• No se han encontrado resultados

are interested in visualising the high dimensional supervector space produced by each AID system. In order to achieve the most insight from the visualisation space, we need to find an approach to gain the maximum separation between the projected classes while reducing the dimensionality of the features space to two. Here, ‘class’ refers to the different regional accents (for example Birmingham accent, Glaswegian accent).

This suggests Linear Discriminant Analysis (LDA) [73], but the rank of the covariance matrix cannot be greater than the number of supervectors that is used to estimate it, which is much less than the dimension of the supervector space. Hence, due to the small sample size

N and high dimensionalityDof supervector space, it is not possible to invert the within-class covariance matrix. Here, the ‘sample size’ refers to the number of speakers in the data set. The ‘high dimensionality’ refers to the dimension of the AID supervectors.

A solution is to use Principal Components Analysis (PCA) [74] to reduce the dimension- ality of the accent supervectors to a new valuen, chosen empirically such thatC≤n≤N−C, whereCis the number of classes. Then we apply LDA to reduce the dimensionality to two dimensions [75, 76]. In this work we use the EM algorithm to calculate principal compo- nents in PCA (EM-PCA) [77]. This enables us to extract the eigenvectors from the high dimensional supervectors with lower computational cost [77].

Full description of the PCA, EM-PCA, and LDA dimension reduction approaches can be found in Appendices A.1.1, A.1.2, and A.1.3 respectively.

For the visualization purpose, after projecting the accent supervectors into a 2-dimensional space, the supervectors that belong to the same ‘accent region’ are represented by a cluster. For each accent region, avstandard-deviation (0<v≤1) contour around the mean valuem

represents the distribution of supervectors corresponding to speakers with that accent in the supervector space.

We expect to see correlations between the geographical and social similarities and differences between different accent regions and the relative positions of their accent clusters in the AID space. The visualisation for the i-vector, phonotacttic, and ACCDIST-SVM AID feature spaces can be found in Section 6.4.

2.6

Summary

In this chapter we presented four different approaches to address accent issues faced in ASR systems. Then, three popular AID systems were described, namely ACCDIST, phonotactic, and i-vector based systems. The ACCDIST-SVM system is a supervised approach which requires exact transcription of the utterances to recognise the test speaker’s accent. The phonotactic, and i-vector systems are unsupervised and do not rely on pre-transcribed material.

Performance of these three AID systems on British English accented speech is reported in Section 6.2.

The supervectors generated by different AID systems are used for two-dimensional visualisation of the accent space. The visualisation results for different AID accent feature spaces can be found in Section 6.4.

In Chapters 7 and 8 these AID systems will be used for selecting an accented acoustic model that matches the test speaker’s accent.

Chapter 3

HMM based speech recognition using

GMMs and DNNs

3.1

Introduction

ASR is the task of automatically converting speech data into a written transcription. ASR is not an easy task and visually similar waveforms do not necessarily indicate similar sounds. Therefore, simple pattern recognition is not powerful enough for the speech recognition task [78]. Pronounciation dictionary Decoding Speech data Feature Extraction Language model Acoustic model Word Sequence

Fig. 3.1Diagram of a simple speech recogniser

The speech recognition task consists of four main stages, namely feature extraction, acoustic modelling, language modelling and decoding [79]. A diagram of a simple speech recognition system is shown in Figure 3.1.

The aim of the feature extraction is to convert the speech waveform into a sequence of acoustic feature vectors which are suitable for applications such as speech recognition and accent recognition. Over the last four decades a low-dimensional representation of speech with capability to preserve the necessary information for ASR systems has been

used. An ideal acoustic feature vector should convey the information for making phonetic distinctions, while not being sensitive to speaker specific characteristics such as shape and size of the vocal tract. Many different types of acoustic feature vectors have been proposed in the literature, of which the most popular are Mel Frequency Cepstral Coefficients (MFCCs) [80], Mel Log filterbanks (FBANKs) [80, 81], and Perceptual Linear Prediction coefficients (PLPs) [82]. These low-dimensional features are inspired by physiological evidence of human perception of auditory signals and have been successfully applied to many applications. Recent publications suggested that it is possible to use the raw waveform to train the DNN based acoustic models and the accuracy of these systems matches the result obtained from using the FBANK features [83–85]. The complete description of the acoustic signal processing used in this work can be found in Section 3.3.

In ASR systems, a pronunciation dictionaries is used. Pronunciation dictionaries comprise one or more phone level transcription of the words that occur in the training set. They provide a link between the words and their phone level transcription and can be used in creating the context dependent acoustic modelling (Section 3.4).

The language modelling stage is concerned with the development of structures to model the word sequences using a large quantity of text data to estimate the probability of the word sequences. Details of the language modelling process can be found in Section 3.5.

Acoustic modelling involves a Markov process, to model the underlying sequential structure of a word, plus a mechanism to relate the acoustic feature vectors to the Markov model states which can be achieved with a GMMs or a DNNs. Hidden Markov Models (HMMs) are one of the major approaches for acoustic modelling. They can characterize the observed time-varying speech data samples. In HMMs an observation probability distribution is associated with each state in the HMM, to be able to describe the time-varying speech feature sequences. The complete description of the GMM and DNN based acoustic modelling can be found in Sections 3.7 and 3.10.

The aim of the acoustic decoding stage is to find the most likely word sequence for an observed acoustic data. Sections 3.11 fully describes the speech decoding process.

In sections 3.7 and 3.10, different stages of GMM and DNN based ASR systems are described. The two popular approaches to the GMM-HMM based acoustic model adaptation are introduced. Finally, we present the ASR evaluation formula, to be able to present and compare the recognition performance for different systems.

Documento similar