LA competitiVidAd ActuAL de LA industriA y Los serVicios: AnáLisis por

Manoj Banik¹, Aloke Kumar Saha², Mohammed Rokibul Alam Kotwal³, Mohammad Nurul Huda³,Chowdhury Mofizur Rahman³

1Department of CSE, Ahsanullah University of Science and Technology, Dhaka, Bangladesh

2Department of CSE, The University of Asia Pacific, Dhaka, Bangladesh

3Department of CSE, United InternationalUniversity, Dhaka, Bangladesh [email protected], [email protected], [email protected],

mnh@cse.uiu.ac.bd, [email protected] Abstract—This paper presents a method for extracting

distinctive phonetic features (DPFs). The method comprises three stages: i) acoustic feature extraction, ii) multilayer neural network (MLN) and iii) HMM-based classifier. At first stage, acoustic features, local features (LFs), are extracted from input speech. On the other stage, MLN generates a 45-dimentional DPF vector from the LFs of 75-dimentions. Finally, these 45-dimentional DPF vector is inserted into hidden Markov model (HMM) based classifier to obtain phoneme strings. From the experiments on Acoustic Society of Japan (ASJ), it is observed that the proposed DPF extractor provides a higher phoneme correct rate with fewer mixture components in HMM compared to the other methods investigated.

Index Terms-hidden Markov model; Japanese language; phonetic feature extraction

I. INTRODUCTION

There have been many researches [1-3] for feature extraction in speech recognition. In the research by Kirchhoff [1], acoustic features were mapped into the distinctive phonetic features (DPFs) using a set of lower level multilayer neural networks (MLNs) of five groups, in which each MLN was trained to extract a corresponding DPF in the group.

The DPFs output from lower level MLNs were input to a higher level MLN which produced an acoustic likelihood of subword units. Their work improved the recognition accuracy of spontaneous speech as well as speech with additive noise. Again, Jain et al.

[2] also applied a set of MLNs corresponding to each BPF channel to extract DPFs, and then used the DPFs in a higher level MLN as input that is similar to [1].Though these methods provide recognition accuracy up to a particular level, they have some demerits: i) they require higher mixture component to obtain higher recognition performance, ii) higher computational cost is needed and iii) they use mel-frequency cepstral coefficient features.

To eliminate these problems, in this paper we have developed a method based on articulatory features (AFs) or distinctive phonetic features (DPFs). The method comprises three stages: i) acoustic features,

local features (LFs), extraction from an input speech, ii) MLN to obtain DPFs from LFs and iii) HMM-based classifier for achieving phoneme recognition performance. This method has some advantages: (i) it uses local features (LFs) instead of MFCCs as input to the MLN and (ii) it provides a higher phoneme with fewer mixture components in HMMs.

For evaluation purposes, we have investigated the following methods: i) MFCC+HMM, ii) MFCC+LF+HMM and iii) LF+MLN+HMM (the proposed method).

The paper is organized as follows: Section II discusses the necessity of DPFs. Section III explains the system configuration of the existing methods with the proposed one. Experimental database and setup are provided in Section IV, while experimental results are analyzed in Section V. Finally, Section VI draws some conclusion and some remarks on future works.

II. DISTINCTIVE PHONETIC FEATURES

A phoneme can easily be identified by using its unique DPF set [4-5]. The Japanese balanced DPF set for classifying Advanced Telecommunications Research Institute International (ATR) phonemes have 15 elements. These DPF values are vocalic, high, low, intermediate between high and low <nil>, anterior, back, intermediate between anterior and back <nil>, coronal, plosive, affricate, continuant, voiced, unvoiced, nasal and semi-vowel. Here, present and absent elements of the AFs are indicated by “+” and “-” signs, respectively. Table I shows the Japanese balanced DPF-set for classifying ATR phonemes.

III. PHONEME RECOGNITION METHODS

A. The Existing Methods

1) MFCC- based Method: Conventional approach of ASR systems uses MFCC of 38 dimensions (12-MFCC, 12-Δ(12-MFCC, 12-ΔΔ(12-MFCC, P and ΔP, where P stands for raw energy of the input speech signal) as feature vector to be fed into a HMM-based classifier and the system diagram is shown in Fig. 1.

Parameters (mean and diagonal covariance of hidden Markov model of each phoneme) are estimated, from MFCC training data, using Baum-Welch algorithm.

For different mixture components, training data are clustered using the K-mean algorithm. During recognition phase, a most likely phoneme sequence for an input utterance is obtained using the Forward.

2) MFCC- LF based Method: At an acoustic feature extraction stage, firstly, input speech is converted into 25 dimensional local features (LFs) that represent variations in spectrum along time and frequency axes [6]. These 25 dimensional LFs are combined with MFCC of 38 dimensions (12-MFCC, 12-ΔMFCC, 12-ΔΔMFCC, P and ΔP, where P stands for raw energy of the input speech signal) to obtain a 63-dimensional feature vector. This feature vector is inserted into an HMM-based classifier to obtain phoneme recognition performance.

B. Proposed Method

Twenty five dimensional LFs vector extracted through the procedure mentioned in section III.A(2) are then entered into an MLN with four layers, including two hidden layers, after combining a current frame xt with the other two frames that are three-points before and after the current frame(xt-3, xt+3). The MLN has 45 output units (15x3) corresponding to a set of context-dependent DPF vector, which consists of three DPF vectors (a preceding context DPF, a current DPF, and a following context DPF) with 15 dimensions each.

The two hidden layers consist of 256 and 96 units from the input layer. Fifteen DPF elements of Japanese balanced DPF set are used. The MLN is trained by using the standard back-propagation algorithm to output a value one for the corresponding DPF elements with an input phoneme and its adjacent phonemes. This method has several advantages: it provides (a) robust features in different acoustic environments and (b) higher WCR with fewer mixture components in the HMMs. On the other hand, this method still occupies some demerits:

it shows some misclassification caused by co-articulation at phoneme boundaries; it cannot solve co-articulation problems because a single MLN has an inability to represent context information.

Figure 1.MFCC-based system.

Figure 2.Proposed System.

TABLE I

JAPANESE BALANCED DPF-SET FOR CLASSIFYING ATR PHONEMES

DPFs a i u e o N w y j my ky dy by gy ny hy ry py p t k ts ch b d g z m n s sh h f r

v ocalic ＋＋＋＋＋－－－－－－－－－－－－－－－－－－－－－－－－－－－－－

high －＋＋－－－＋＋＋＋＋＋＋＋＋＋＋＋－－＋－＋－－＋－－－－＋－＋－

low ＋－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－＋－－

nil －－－＋＋＋－－－－－－－－－－－－＋＋－＋－＋＋－＋＋＋＋－－－＋

an terior －－－－－－－－＋＋－＋＋－＋－＋＋＋＋－＋＋＋＋－＋＋＋＋＋－＋＋

back ＋－＋－＋－＋－－－－－－－－－－－－－＋－－－－＋－－－－－－－－

nil －＋－＋－＋－＋－－＋－－＋－＋－－－－－－－－－－－－－－－＋－－

co ron al －－－－－－－－＋－－＋－－＋－＋－－＋－＋＋－＋－＋－＋＋＋－－＋

p lo s ive －－－－－－－－－－＋＋＋＋－－－＋＋＋＋－－＋＋＋－－－－－－－－

affricativ e －－－－－－－－＋－－－－－－－－－－－－＋＋－－－＋－－－－－－－

co ntinu an t ＋＋＋＋＋＋＋＋＋－－－－－－－－－－－－－－－－－＋－－＋＋＋＋－

v oiced ＋＋＋＋＋＋＋＋＋＋－＋＋＋＋－＋－－－－－－＋＋＋＋＋＋－－－－＋

un vo iced －－－－－－－－－－＋－－－－＋－＋＋＋＋＋＋－－－－－－＋＋＋＋－

n as al －－－－－＋－－－＋－－－－＋－－－－－－－－－－－－＋＋－－－－－

s emi-vo wel －－－－－－＋＋－＋＋＋＋＋＋＋＋＋－－－－－－－－－－－－－－－＋

IV. EXPERIMENTS

A. Speech Database

Following two clean data sets are used in our experiments.

D1) Training data set: A subset of the Acoustic Society of Japan (ASJ) Continuous Speech Database [7] comprising 4513 sentences uttered by 30 different male speakers (16 kHz, 16 bit) is used.

D2) Test data set: This test data set comprises 2379 Japanese (JNAS) [8] sentences uttered by 16 different male speakers (16 kHz, 16 bit).

B. Experimental Setup

The frame length and frame rate are set to 25 ms and 10 ms, respectively, to obtain acoustic features, MFCCs and LFs, from an input speech.LFs are a 25-dimensional vector consisting of 12 delta coefficients along time axis, 12 delta coefficients along frequency axis, and delta coefficient of log power of a raw speech signal. On the other hand, MFCCs comprises 38 dimensions (12MFCC, 12ΔMFCC, 12ΔΔMFCC, P and ΔP, where P is the power of raw speech signal).

To measure phoneme error rate (PER), D2 data set are evaluated using an HMM-based classifier.

The D1 data set is used to design 37 Japanese monophone HMMs with five states, three loops, and left-to-right models. Input features for the classifier are MFCCs of 38 dimensions, MFCC-LF of 63 dimensions and DPFs of 45 dimensions. In the HMMs, the output probabilities are represented in the form of Gaussian mixtures, and diagonal matrices are used. The mixture components are set to 1, 2, 4 and 8. In our experiments of the MLNs, the non-linear function is a sigmoid from 0 to 1(1/ (1+exp (-x))) for the hidden and output layers.

To obtain PER, we have investigated the following DPF based methods.

i) MFCC+HMM ii) MFCC+LF+HMM iii) LF+MLN+HMM

V. EXPERIMENTAL RESULTS AND DISCUSSION

This DPF-based method (i) gives robust features to different acoustic environments with fewer mixture components in HMMs, and (ii) it improves the margin between acoustic likelihoods. Figs. 3(a) and 3(b) show the phoneme distances of five Japanese vowels in an utterance /ioi/ that are calculated with a mel-frequency cepstral coefficient (MFCC)-based ASR system and a DPF-based system, respectively. In both the systems, each distance is measured using the Mahalanobis distance between a given input vector and the corresponding vowel set of mean and covariance in a single-state model. The input sequence in the figures,

/i/../i//o/../o//i/../i/, exhibits phoneme for each frame and has total 20 frames in which first three frames, middle 13 frames, and last four frames are phonemes /i/, /o/, and /i/, respectively. The MFCC-based system (Fig.3(a)) shows seven misclassification of phonemes (/u/ output for /i/ input) for frames 4, 5, 13, 14, 15, 16, and 17, while two misclassification (/o/ and /u/

output for /i/ input) for frames 17 and 18 are exhibited by the DPF-based system (Fig. 3(b)).

Therefore, the DPF-based system outputs few misclassifications.

Fig. 4 shows PER for different investigated methods. From the figure, it is observed that the proposed system provides lower PER for all the mixture components investigated except 8. For an example, at mixture component 2, 23.94% PER is obtained by the proposed method while corresponding values for the methods based on MFCCs and MFCC+LF are 27.6% and 27.47%

respectively.

Figure 3.Phoneme classification for a) MFCC- based system and b) DPF-based method system.

Figure 4.Phoneme error rate for investigated methods.

Table II is given to indicate the computation time more specifically with the methods based on MFCC+LF+HMM and the proposed one. We have measured the HMM time required by MFCC+LF+HMM and the proposed method using the formula mS²T where m, S and T indicates number of mixture components, states and observation sequences respectively. For MFCC+LF+HMM, the required time is 4x5²x200 (=20K), while the corresponding time for the proposed method is 2x5²x200 (=10K) assuming number of observation sequence is 200 frames.

Therefore, our proposed method is faster than the method based on MFCC+LF+HMM.

VI. CONCLUSION

This paper has presented a method for extracting DPFs and then evaluated phoneme recognition performance using the extracted features. Findings of the method are given below:

(a) At lower mixture components the proposed method provides better result than the other methods investigated.

(b) The proposed method requires less computation time for fewer mixture components.

In near future, the authors would like to evaluate Bengali articulatory feature extraction using the method proposed in this paper.

REFERENCES

[1] K. Kirchhoff, et al."Combining acoustic and articulatory feature information for robust ASR,"Speech Comm., 2002.

[2] P. Jain, et al., “Distributed speech recognition using noise-robust MFCC and TRAPS-estimated manner features,” Proc.

ICSLP’02, pp.473-476, 2002.

[3] Automatic speech recognition with neural networks: Beyond nonparametric models, SpringerLink, pp. 104-121, vol. 745, 1993.

[4] S. King and P. Taylor, "Detection of Phonological Features in Continuous Speech using Neural Networks," Computer Speech and Language , vol.14, no.4, pp. 333-345, 2000.

[5] E. Eide,"Distinctive Features for Use in an Automatic Speech Recognition System," Proc. Eurospeech 2001, vol.III, pp.1613-1616, 2001.

[6] T. Nitta, "Feature extraction for speech recognition based on orthogonal acoustic-feature planes and LDA," Proc.

ICASSP’99, pp.421-424, 1999.

[7] T. Kobayashi, et. al, "ASJ Continuous Speech Corpus for Research," Acoustic Society of Japan Trans. vol.48, no.12, pp.888-893, 1992.

[8] JNAS: Japanese Newspaper Article Sentences. Available:

http://www.milab.is.tsukuba.ac.jp/jnas/instruct.html TABLEII.COMPARISON OF TIME COMPLEXITY BETWEEN

MFCC+LF+HMM AND PROPOSED METHODS

MFCC+LF+HMM Proposed

PCR=76.43 4 Mix -

PCR=76.06 - 2 Mix

Required Multiplication

20 k 10 k

Session 3: RFID Technology and

Applications

Zoo Application of RFIDTechnology: A Case

In document ¿INDUSTRIA Y/O SERVICIOS? (página 34-44)