The approach for Accent Characterization by Comparison of Distances in the Inter-segment Similarity Table is called ACCDIST. This text-dependent (supervised) AID approach was first introduced by Huckvale [72]. The ACCDIST measure depends on absolute spectral properties. In this section, first the ACCDIST system introduced by Huckvale [72], and then a more generalised approach suggested by Hanani [56] called ACCDIST-SVM will be described. In this work we use Hanani’s ACCDIST-SVM system and report the AID accuracy using this AID system in Section 6.2.4.
Huckvale’s ACCDIST based AID
Using the ACCDIST method, the accent of a speaker is characterised using the similarities between realisation of vowels in certain words. It compares a speaker’s pronunciation system with average pronunciation systems for known accent groups to recognise his or her accent.
For instance, the realisation of the vowel in the words ‘cat’, ‘after’, and ‘father’ is a clue to distinguish between southern and northern British English regional accents. In this example the distance table for the mean cepstral envelopes of the vowel show that for northern English speakers the realisation of vowel in words ‘cat’ and ‘after’ is more similar, while for southern English speakers the vowel in ‘after’, and ‘father’ are more similar.
Huckvale’s ACCDIST accent identification approach is carried out in five stages, namely forced-alignment, vowel feature vector generation, vowel distance table measurement, vec- torisation and correlation distance measurement [72] as shown in Figure 2.3.
Correlation distance measure Vectorisation Vowel distance table Forced alignment H y p o th es iz ed ac ce n t Utterance
Fig. 2.3ACCDIST AID system based on correlation distance measure
• Forced-alignment:In the ACCDIST text-dependent system, during the forced-alignment
stage the corresponding phone level transcription and time segmentation is generated for each utterance using a pronunciation dictionary. Next, for each speaker only the vowel segments of the utterances are analysed.
• Vowel feature vector generation: Given a phone-level transcription for each utter-
ance, the start and end time index of vowels are determined. Each vowel is divided into two halves by time. For each half the average of 19 MFCC coefficients is calcu- lated. The mean cepstral envelopes in each half are concatenated. Each vowel is then represented by a 40-dimensional vector. If there are multiple instances of a vowel then the 40 dimensional vectors are averaged over all of these instances.
• Vowel distance measurement:For each speaker a vowel distance table is generated
by computing the distances between the 40 dimensional vectors corresponding to each pair of monophone vowels.
• Vectorisation:The distance tables generated in the previous stage are concatenated to form a supervector.
• Correlation distance measurement: For each utterance, in order to determine the
closest accent group to the test speaker, the correlation between the test speaker’s supervector and the accent group mean supervector for each accent is computed. In this process, using distances between pairs of vowels produced by the same speaker, makes the comparison insensitive to various speaker-dependent properties other than the speaker’s accent. The correlation distance d between two mean and variance normalized vectorsv1andv2is computed as shown by Equation 2.20. Here ‘.’ repre-
2.4 Different AID systems 21
correlation distance for two independent vectors is zero and for two identical vectors is one. d(v1,v2) = J
∑
j=1 v1.v2 (2.20)Hanani’s ACCDIST-SVM based AID
Huckvale’s AID system requires every utterance to correspond to exactly the same known phrase or set of phrases.
Later, Hanani [56] generalised this technique by comparing the realisation of vowels in the triphones rather than the words. In addition, for determining the closest accent group to the test speaker, Hanani’s classifier is based on Support Vector Machines (SVMs) with correlation distance kernel. We refer to the Hanani’s system as the ACCDIST-SVM system[56].
Given utterances from different speakers, the task is to create and compare the speaker distance tables of the mean cepstral envelope of the most common vowel triphones to identify the speakers accent. The Hanani’s ACCDIST accent identification approach is carried out in five stages, namely forced-alignment, vowel feature vector generation, vowel distance measurement, vectorisation and SVM classification as shown in Figure 2.4.
Vectorisation of vowel features Vowel distance measurement Forced alignment H y p o th es iz ed ac ce n t Multi-class SVM Utterance
Fig. 2.4ACCDIST-SVM AID system
• Forced-alignment: During the forced-alignment stage, a tied-state triphone based
phone recognizer is used to generate a triphone level transcription and time segmenta- tion for each utterance. Next, for each speaker only the vowel-triphone segments are analysed.
• Vowel feature vector generation: For each utterance, the vowel triphones and their
most common vowel triphones are selected, to ensure that distance tables from different utterances are comparable.
In the same way as Huckvale’s method a 40-dimensional vector is constructed. This vector is then concatenated to the vowel duration. For repeated triphones the average of these 41-dimensional vectors is used.
Each realisation of a common vowel-triphone is represented in form of(pi,vi)where
pidenotes thei-th vowel triphone, and vectorviis its corresponding 41-dimensional feature vector.
• Vowel distance measurement: A set of cepstral feature vectors were computed in the
previous stage for the vowel-triphones. In this stage compute a vowel distance table for each utterance by finding the Euclidean distance between every vowel-triphone pair. • Vectorisation:The distance tables computed in the previous stage are then vectorised
and stored in a supervector. The supervector generated in this stage is referred to as the ACCDIST-SVM system’s supervector.
• SVM classification:In our experiment, ACCDIST-SVM supervectors fromCaccent
groups are used to train a multi-class SVM classifiers with the correlation distance kernel. To address this classification problem, a multi-class SVM with the ‘one against all’ approach is chosen.