• No se han encontrado resultados

The approach for Accent Characterization by Comparison of Distances in the Inter-segment Similarity Table is called ACCDIST. This text-dependent (supervised) AID approach was first introduced by Huckvale [72]. The ACCDIST measure depends on absolute spectral properties. In this section, first the ACCDIST system introduced by Huckvale [72], and then a more generalised approach suggested by Hanani [56] called ACCDIST-SVM will be described. In this work we use Hanani’s ACCDIST-SVM system and report the AID accuracy using this AID system in Section 6.2.4.

Huckvale’s ACCDIST based AID

Using the ACCDIST method, the accent of a speaker is characterised using the similarities between realisation of vowels in certain words. It compares a speaker’s pronunciation system with average pronunciation systems for known accent groups to recognise his or her accent.

For instance, the realisation of the vowel in the words ‘cat’, ‘after’, and ‘father’ is a clue to distinguish between southern and northern British English regional accents. In this example the distance table for the mean cepstral envelopes of the vowel show that for northern English speakers the realisation of vowel in words ‘cat’ and ‘after’ is more similar, while for southern English speakers the vowel in ‘after’, and ‘father’ are more similar.

Huckvale’s ACCDIST accent identification approach is carried out in five stages, namely forced-alignment, vowel feature vector generation, vowel distance table measurement, vec- torisation and correlation distance measurement [72] as shown in Figure 2.3.

Correlation distance measure Vectorisation Vowel distance table Forced alignment H y p o th es iz ed ac ce n t Utterance

Fig. 2.3ACCDIST AID system based on correlation distance measure

• Forced-alignment:In the ACCDIST text-dependent system, during the forced-alignment

stage the corresponding phone level transcription and time segmentation is generated for each utterance using a pronunciation dictionary. Next, for each speaker only the vowel segments of the utterances are analysed.

• Vowel feature vector generation: Given a phone-level transcription for each utter-

ance, the start and end time index of vowels are determined. Each vowel is divided into two halves by time. For each half the average of 19 MFCC coefficients is calcu- lated. The mean cepstral envelopes in each half are concatenated. Each vowel is then represented by a 40-dimensional vector. If there are multiple instances of a vowel then the 40 dimensional vectors are averaged over all of these instances.

• Vowel distance measurement:For each speaker a vowel distance table is generated

by computing the distances between the 40 dimensional vectors corresponding to each pair of monophone vowels.

• Vectorisation:The distance tables generated in the previous stage are concatenated to form a supervector.

• Correlation distance measurement: For each utterance, in order to determine the

closest accent group to the test speaker, the correlation between the test speaker’s supervector and the accent group mean supervector for each accent is computed. In this process, using distances between pairs of vowels produced by the same speaker, makes the comparison insensitive to various speaker-dependent properties other than the speaker’s accent. The correlation distance d between two mean and variance normalized vectorsv1andv2is computed as shown by Equation 2.20. Here ‘.’ repre-

2.4 Different AID systems 21

correlation distance for two independent vectors is zero and for two identical vectors is one. d(v1,v2) = J

j=1 v1.v2 (2.20)

Hanani’s ACCDIST-SVM based AID

Huckvale’s AID system requires every utterance to correspond to exactly the same known phrase or set of phrases.

Later, Hanani [56] generalised this technique by comparing the realisation of vowels in the triphones rather than the words. In addition, for determining the closest accent group to the test speaker, Hanani’s classifier is based on Support Vector Machines (SVMs) with correlation distance kernel. We refer to the Hanani’s system as the ACCDIST-SVM system[56].

Given utterances from different speakers, the task is to create and compare the speaker distance tables of the mean cepstral envelope of the most common vowel triphones to identify the speakers accent. The Hanani’s ACCDIST accent identification approach is carried out in five stages, namely forced-alignment, vowel feature vector generation, vowel distance measurement, vectorisation and SVM classification as shown in Figure 2.4.

Vectorisation of vowel features Vowel distance measurement Forced alignment H y p o th es iz ed ac ce n t Multi-class SVM Utterance

Fig. 2.4ACCDIST-SVM AID system

• Forced-alignment: During the forced-alignment stage, a tied-state triphone based

phone recognizer is used to generate a triphone level transcription and time segmenta- tion for each utterance. Next, for each speaker only the vowel-triphone segments are analysed.

• Vowel feature vector generation: For each utterance, the vowel triphones and their

most common vowel triphones are selected, to ensure that distance tables from different utterances are comparable.

In the same way as Huckvale’s method a 40-dimensional vector is constructed. This vector is then concatenated to the vowel duration. For repeated triphones the average of these 41-dimensional vectors is used.

Each realisation of a common vowel-triphone is represented in form of(pi,vi)where

pidenotes thei-th vowel triphone, and vectorviis its corresponding 41-dimensional feature vector.

• Vowel distance measurement: A set of cepstral feature vectors were computed in the

previous stage for the vowel-triphones. In this stage compute a vowel distance table for each utterance by finding the Euclidean distance between every vowel-triphone pair. • Vectorisation:The distance tables computed in the previous stage are then vectorised

and stored in a supervector. The supervector generated in this stage is referred to as the ACCDIST-SVM system’s supervector.

• SVM classification:In our experiment, ACCDIST-SVM supervectors fromCaccent

groups are used to train a multi-class SVM classifiers with the correlation distance kernel. To address this classification problem, a multi-class SVM with the ‘one against all’ approach is chosen.

2.5

Proposed approach for visualisation of the AID feature

Documento similar