3.2 CONTEXTO GEOLOGICO REGIONAL
3.2.3 Geología Estructural
5 .5 The MLP Algorithm for Frication / Voicing / Silence Detection :
MLP-FVS 5.5.1 Preprocessing 5.5.2 Network Architecture 5.5.3 Databases * Anechoic Speech * Reverberant Speech * Speech in Noise
* Speech for Lipreading Tests 5.5.4 Training
CHAPTER 5 FRICATION INFORMATION DETECnON: OUTLINE OF THE FRICATTON/VOICING/SILENCE PATTERN CLASSIFICATTW ALGORITHM
CHAPTER 5
FRICATION INFORMATION DETECTION: FRICATION / VOICING / SILENCE PATTERN CLASSIFIER - MLP-FVS
This chapter first discusses the purpose of frication detection in speech pattern processing for listeners with profound hearing loss, followed by a brief summary of the conventional and artificial neural network pattern recognition techniques for broad acoustic/phonetic feature detection. Then a detailed introduction to the multi-layer perceptron (MLP) method which has been employed in this study is given, followed by a detailed description of the frication /voicing /silence detection algorithm "MLP-FVS".
The theoretical issues concerning selection of input pattern vectors, training and adaptation for the MLP algorithm, will be discussed in the next chapter.
5.1 Purpose of Frication Detection
In this thesis, frication refers to the turbulent noise component of sounds which are contrastive at the phone and phoneme levels in normal phonetic description. The term F ricatio n is used here at the level of signal / sound description for any random component whether of, for example, frication in a fricative; burst in a plosive; or of aspiration at the subphonemic level.
The production of these noise-like sounds involves an obstruction of the vocal
tract Plosive sounds are produced by a complete closure of the articulators, pressure is
built up behind the closure point and then a sudden turbulent flow of air is produced
when the closure is released. Fricatives are produced with a narrow constriction. The
breath stream, passing through the constriction, becomes turbulent because of the nonlaminar flow of the airstream at the walls of the constriction. This generates a noise
like hissing sound which may or may not be associated with voicing. An ajfricate is simply a sequence of a stop followed by a homorganic (same vocal tract place) fricative.
The spectrograms of Chinese plosives, fricatives and affricates are shown in Figure 3-4,3-5,3-6, and 3-7. Usually the main energies of most fricative sounds lie in the frequency range between 1 - 8 kHz (except for the glottal fricative /h/, where the lower limit of frequency varies between 400-700 Hz, and energy lies mainly between
400 -1700 Hz). The energy of the aspirated plosives /p/, /t/, Dd is spread between 200 -
5000 kHz or higher.
As these frication sounds hardly have energy below 1000 Hz, it becomes difficult for profoundly hearing-impaired listeners who have little or no hearing above
1000 Hz to hear these sounds.
The motivation for the development of a frication detection algorithm in this study is to detect these frication sounds and transform or recode them into low frequency noises within the residual hearing range of profoundly deaf subjects (usually <= 500 Hz).
In Chinese, 15 out of the 23 initials contain sounds which are contrastively dependent on frication. These initials (plosives, affricates and fricatives) carry much information in speech communication in Chinese where the morpheme tends to be co extensive with the syllable and the initial of the syllable carries a far greater contrastive load than the final (see section 3.2.3 for Chinese initials and section 3.2.4.4 for phonotactics).
Since the 'mlp-fvs' algorithm is designed to detect frication excitation, it will simultaneously detect the syllabic onset of those syllables that have an initial with frication sound.
The **mlp-fvs” algorithm uses the multi-layer perceptron pattern classification technique to detect the major acoustic aspects of speech - frication, voicing and silence. This can be combined with the fundamental period extraction algorithm developed by
CHARIER 5 FRICATION INFORMAIICX^ DETECnON: OUTLINE OF THE FRICATIONWOICING/SILENCE PATTERN CLASSIHCATICW ALGORITHM
Howard (1991) to generate a compound speech pattern signal to be used in a new type of hearing aid for the profoundly hearing-impaired population.
The robust detection of frication /voicing /silence has potential application in reducing the search space in large vocabulary speech recognition systems since it can reduce the computational expense of exhaustive candidate unit search. There is then no need to attempt a match with inappropriate candidates. Totally unambiguous words can be eliminated from the search space without the computation needed for precise matching. Accurate broad acoustic-phonetic categorization can be used to reduce the search space in a speech recognition system (Larar,1986). So the large vocabulary problem can be divided into two stages: first, reduce search space to only those words that are similar in some sense to the test words; then the accurate matching techniques can be applied in the second stage.
5.2 Methods of Frication Detection
In order to recode frication information into the residual hearing range of the profoundly hearing impaired, it is necessary to develop an accurate classification method for major acoustic events in speech. Some of the most basic acoustic categories in human speech are given by the three-way distinction between silence regions, regions of voiced speech, and regions of frication (Waibel, 1988). A number of studies in different application areas have addressed the problem of detecting these or similar regions such as voiced, unvoiced, and silence (Noll, 1967; Atal & Hanauer, 1971; Markel, 1972; Atal & Rabiner, 1976; Siegel, 1979; Siegel & Bessey, 1982; Sarma & Venugopal, 1978; Cox & Timothy, 1980; De Souza, 1983; Bendiksen & Steiglitz, 1990; Hahn, 1990; Ran & Millar ,1991; Bengio et al., 1991). Before pattern recognition methods were introduced to address this classification problem (Atal & Rabiner, 1976), methods for voiced-unvoiced decision usually worked in conjunction with pitch analysis. For example, in the well known cepstral pitch detector (Noll, 1967), the voiced/unvoiced
decision is made on the basis of the amplitude of the largest peak in the cepstrum (the cepstrum is the inverse Fourier transform of the logarithm of the power spectrum of a signal). Other methods for voiced-unvoiced determination include those based on short- time average zero-crossing rate, auto-correlation function, the ratio of low to high- frequency energy (Rabiner & Schafer 1979, Atal & Rabiner 1976, G'Shaughnessy 1987), and counting bit alternation of the bit stream from linear delta modulation (Un & Lee, 1980). The disadvantage of these approaches is that only one feature is used in the decision procedure. A binary decision using a simple threshold test on these features for voiced-unvoiced determination is inadequate, as shown in Figure 5-1.
Time ( s ) S p e e c h S /N -2 0 d B Z e r o _ C r o s j T i V 50 R u l o ^ C o r e la t i o n S p e e c h S /N -lO d B Z ero C ros u t o L o r e i a t i o n lab el a n n o t a ti o n s U
Figure 5-1 Zero-Crossing Rate and Auto-Correlation of the Chinese Fricatives sh
/§/, h /xA and f/fJ.
(a) Speech signal which contains the Chinese fricatives sh /§/, h /y/, and
// f / , S/N=20 dB.
(b) Zero-crossing rate of (a).
(c) Auto-correlation at unit sample delay of (a).
(d) Speech signal which contains same token as in (a), but S/N=10 dB. (e) Zero-crossing rate of (d).
(f) Auto-correlation at unit sample delay of (d).
(g) Pinyin annotation of the Chinese fricatives in Speech.
CHAPTER 5 FRICATION INFORMATION ræiECnON: OUTLINE OF THE FRICATTON/VOIONG/SILH^CE PATTERN CLASSIFICATION ALGORITHM
It can be seen from Figure 5-1 that the zero-crossing rate for the fricative sh /§/ is
high, but much lower for the fricative h /%/ and //f /. The auto-correlation function is
only distinctive for the fricative sh /§/. These two features deteriorate when the S/N
level becomes lower, as shown in the second sh /§/ position.
In this study, a pattern recognition approach will be described for classifying speech signal into three classes: voice, silence, and frication. The advantage of the pattern recognition method is that it provides an effective method of combining the contributions of a number of speech features - which individually may not be sufficient to discriminate between the classes - into a single measure capable of providing reliable separation between the three classes.
5.3 Pattern Recognition Techniques
This section gives a brief overview of pattern recognition techniques and a comparison of conventional and neural network pattern recognisers. This is followed by a detailed discussion of the multi-layer perceptron classifier, the technique that is employed in this study.
5.3.1 Definition
A pattern can be considered to be an array of elements. It can be represented as
a multi-dimensional vector with the components in the vector corresponding to the
elements in the pattern : X = {xj, %2, ... x^), assuming x^, X2, ... x^are orthogonal.
Thus a pattern can be considered to be a point in a n-dimensional Euclidean space.
Pattern recognition is the process by which the input vectors are classified
into significantly different categories. The system that performs this fimctidn is called a
pattern classifier. In geometric terms, a pattern classifier is a system that divides the
input space into a given set of discrete regions, which correspond to different categories.
The surfaces that divide the points in the input space are known as decision surfaces,
5.3.2 Basic Structure of a Conventional P attern Classifîer
5-2:
The structure of a conventional pattern classifier can be summarized in Figure
Parameters Estimated Erom Training Data
Input Symbols
Measurement Pre-processing M a t c h in g score
computation Decision function
Conventional Patton Classifio
Figure 5-2 Basic Structure of a Conventional Pattern Classifier.
P re-p ro cessin g
The first stage in any pattern recognition task is usually referred to as p r e
processin g, The purpose of the pre-processing is mainly feature extraction. This
process can often be complicated and it constitutes one of the most important parts of the pattern recognizer. In this stage, an input measurement is transformed into a set of components (features) which are useful in the specific discrimination task. The selection of suitable features is crucial for the performance of the whole pattern recognition system.
Decision Functions
Decision functions are used to compute matching scores and select output class, as shown in Figure 5-2. Decision functions used for conventional pattern classification
CHAFIER 5 FRICATION INFORMATIW DETECnON: OUTLINE OFTHE FRICATION/VOICING/SILENCE PATTERN CLASSIHCATTCM ALGORITHM
fall into two types: classification based on distance fu n c tio n s, and classification
based on p ro b a b ilistic fu n c tio n s.
Classification using distance functions calculates the similarity between the input
pattern and a set of exemplar patterns as a function of their geometric proximity in the vector space, to determine which class the input should be in. The exemplar patterns are those which are most representative of each class. In the case of classification using
probabilistic functions, the task is to find a statistical function that leads to the optimal
decision. The optimal decision is the one with a minimum cost for the decision made by
the classifier. A priori knowledge of the input pattern distributions is assumed. An
assumption that the input patterns follow a Gaussian distribution is often made, in which case, the pattern classes need only be represented by their means and covariances. The main types of decision functions belonging to these two types of decision functions are listed in Figure 5-3. Further details may be found in standard sources (Tou & Gonzalez,1974; Duda & Hart,1973; James, 1987 ).
^ Nearest Neighbour Classification K-nearest Neighbour Classification Distance Function
Heuristic Cluster Seeking Algorithms Paformance Index Cluster Seeking Algorithm Decision Function
Bayes' Classifier Probabilistic Function ^ M in im a x QassifiCT
Neyman-Pearson Classifier
Figure 5-3 Decision Functions of Conventional Pattern Recognition Techniques.
<
5.3.3 A rtificial Neural Networks (ANN)
Introduction
Artificial Neural Networks (ANN) can be described as an attempt to mimic aspects of human brain function and have been studied for many years in the hope of
achieving human-like performance in the fields of pattern recognition, especially speech and image recognition. The superior performance of humans in speech and image recognition compared to the best computer systems is believed to arise from the massive parallel use of many similar basic computing units (neurons) in the biological nervous system. It is estimated that there are about 10^° neurons in the human brain (Lippmann,
1987).
Neural network (NN) researchers believe that the human brain builds up its own hidden rules through what is usually called ’experience’. Neural net models explore many competing hypotheses simultaneously using massive parallel nets composed of many computational elements connected by links with variable weights, and it is within these structures that hidden rules grow and can be executed (Lippmann, 1987; Aleksander, 1989).
Comparisons Between Traditional Classifiers and Artificial Neural Networks
Apart from the massive parallelism of neural networks, and their learning ability, another important difference between ANNs and traditional classifiers is that ANNs are non-parametric and make weaker assumptions than traditional statistical classifiers concerning the underlying input distributions. This is shown in Figure 5-2 and Figure 5-4. The traditional classifier in Figure 5-2 first computes matching scores for each class and then selects the class with the maximum score. The inputs to the first stage are symbols representing values of the N input elements. These symbols are entered sequentially. An algorithm computes a matching score for each of the M classes which indicates how closely the input matches the exemplar pattern for each class. This exemplar pattern is that pattern which is most representative of that class. In many situations a probabilistic model is used to model the generation of input patterns from exemplars and the matching score represents the likelihood or probability that the input
CHAPTER 5 FRICATION INFORMATION DETECnON: OUTLINE OF THE FRICATION/VOICING/SILENCE PATTERN
CLASSIFICATTW ALOORTTHM
pattern was generated from each of the M possible exemplars. In those cases, strong assumptions are typically made concerning the underlying distributions of the input elements. Matching scores are coded into symbolic representations and passed sequentially to the second stage of the classifier. Then they are decoded and the class with the maximum score is selected.
A neural net classifier is shown in Figure 5-4. The input values are fed in parallel to the first stage via N input connections. The first stage computes matching scores and outputs these scores in parallel to the next stage. The second stage has one output for each of the M classes. After classification is complete, only that output corresponding to the most likely class will be "high", other outputs will be "low". The correct class and the classifier ouqiuts can be fed back to the first stage of the classifier to adapt weights using a learning algorithm.
Output Inputs Measurement -► Adapt weights according to outputs Select output class Matching score computation Pre-processing
and correct class
Neural Net Qassifier I
Figure 5-4 Block Diagram of a Neural Network Classifier.
N eural Network Models
ANN models are specified by three elements: 1. Net topology.
2. Node characteristics. 3. Training or learning rules.
The nonlinearity used within nodes is one important factor in the capabilities of ANN models. The most commonly used node sums N weighted inputs and passes the
result through a nonlinear function as shown in Figure 5-5. The node is characterized by an internal threshold 0j and by the type of nonlinearity.
N-l «-j = 2 Vjj Xi- 0, V cutout N-1 i?0 +1 5 ' ' / --- 0 a -1 0 a 0 a
Hard Limiter Threshold Logic Sigmoid
Figure 5-5 Nonlinear Computational Nodes.
Most ANN algorithms adapt connection weights on the basis of current results as is shown in Figure 5-4. Adaptation or learning is a major focus of ANN research. ANNs can be divided into those which are supervised and those which are unsupervised, depending upon how their training / learning is carried out. In supervised learning, the nets are provided with information or labelling that specifies the correct class for new input patterns during training. In unsupervised learning, no information concerning the correct class is provided to the nets during training. In this case, the nets apply some general mapping to an input pattern, and group it with other input patterns with similar characteristics. Patterns that are similar are clustered together, whereas patterns that are different form separate clusters.
The input to the N.N. can be either discrete or analog.
A good review and discussion of six important neural nets used for pattern classification can be found in Lippmann (1987). They are categorized in Figure 5-6:
CHAPTER 5 FRICATION INFORMATION DETECnON: OUTLINE OF THE FRICATION/VOICING/SILENCE PATTERN
CLASSIFICATION ALGORITHM
Neural Net Classifiers
Binary Input Supervised
/ \
Hopfield Hamming Net Net Unsupervised Carpenter/ Grossberg Classifier Condnous-Valued Input/ \
Supervised Unsupervised/ \
I
Perceptron Multi-Layer Kohonen
Perceptron Self-Organizing
Feature-Maps
Figure 5-6 Neural Net Classifiers (after Lippmann, 1987).
The Multi-layer Perceptron (MLP) method is employed in this study and will be discussed in more detail in the next section. The MLP has shown itself to be a robust pattern recognition technique in many applications of speech pattern processing (Waibel, 1989; Huang & Lippman, 1987; Peeling & Bridle, 1986). Waibel (1989) proposed a Time Delay Neural Network (TDNN) and compared the model with a discrete Hidden Markov Model (HMM) in the task of recognizing three phonemes /b, d, g/ in different contexts. He found the performance of TDNN to be superior to that of HMM in his tests. For vowel recognition, Huang and Lippman (1987) found the MLP performed better than Gaussian and K-nearest neighbour classifiers. Peeling & Bridle (1986) used the MLP to recognize several acoustic-phonetic features of the speech signal, including voicing. Its performance was shown to be high. Another reason is that the MLP algorithm developed in this study can be combined with the fundamental period extraction algorithm developed in this Department (Howard, 1991; Walliker & Howard, 1990) which also employed the MLP method to generate a compound speech pattern signal to be used in a new type of hearing aid for the profoundly hearing impaired. The other ANN methods shown in Figure 5-6 will not be discussed here as they are not appropriate for our application.
Although the NN approach looks appealing and quite promising, some studies have found that the performance of NN is not superior to other pattern recognition
methods (Brauer et al, 1991). At present, the best speech recognition system is still based on the HMM (Lee, 1989).
Several problems remain unsolved in ANN models: which architecture should be chosen ? how many layers ? how many cells should be used? how to deal with time processing? etc. (Mariani 1989).
5.3.4 M ulti-Layer Perceptron
Multi-layer perceptrons (MLP) are feed-forward nets with one or more hidden layers between the input and output nodes. Figure 5-7 shows a multi-layer perceptron with one hidden layer.
O utput P atterns
Output Layer
Hidden Layer