ARISTÓTELES 143 embarazarse de continuo esas tendencias Y así

Actividad única en Dios Su vida puramente teorética

ARISTÓTELES 143 embarazarse de continuo esas tendencias Y así

In this thesis segmentation is defined as the process of cutting up an audio stream in segments and labeling these segments with a specific class such as ‘silence’, ‘speech’ or ‘music’. The main reasons to perform segmentation in a LVCSR system is to filter out the parts of the audio that the decoder won’t be able to handle and also to provide the decoder with extra information about the segments so that decoding can be optimized. A third reason to perform segmentation is to enrich the ASR output with the segment information.

The earlier mentioned silence and music classes are obvious examples of audio classes that the decoder won’t be able to process successfully. Discarding these segments will speed up the decoding process and most likely keep the word error rate low (fewer insertions). Classes that are often used directly to optimize the decoding process are the ‘audio channel’ (telephone/broadband) and ‘gender’ (female/male) classes. For example, the system can train gender specific acoustic models and during decoding use each model according to the segmentation information.

A more complex, but also effective way of optimizing the decoding process is to cluster all segments of the same class together and use this larger amount of audio

for unsupervised adaptation of the acoustic models or feature vectors. Adaptation can often be done with higher precision when more data is available and therefore it is interesting to group segments with the same acoustic characteristics together. For example, the first step in finding all segments containing only speech from one single speaker is called speaker segmentation or speaker change detection. In this case, segments labeled with speech are split at the points in time where a speaker change is detected.

Three main approaches of performing segmentation can be defined: silence-based, model-based and metric-based [CG98, KSWW00]. In the remainder of this section these three approaches will be described briefly. In the next section the process of clustering segments is discussed.

2.3.1 Feature extraction for segmentation

As segmentation is a statistical classification problem, just as for decoding a set of acoustic observations is needed. Although feature types such as MFCC or PLP were not designed to distinguish between speakers, most state-of-the-art segmentation systems actually use these feature extraction methods. For speaker change detection, sometimes feature vectors with a higher number of coefficients are used. A nice example of a system that does not use standard MFCC or PLP features is [AMB03]. Here ‘entropy’ and ‘dynamism’ are used to classify between speech and music.

2.3.2 Silence-based segmentation

For some tasks it is assumed that the audio only contains speech and silence. For example, BN recordings might contain some jingles, but the major part of the recording consists of speech and small pauses between utterances or topics [HOvH01]. Some systems make use of this by segmenting on basis of the silences in the audio. If the segments are later needed to cluster speakers, these systems assume that there is always a short silence between speakers. In case of BN recordings, this assumption is often valid. Unfortunately, for recordings with more spontaneous speech such as recordings of meetings, this assumption is often not valid at all.

There are two common methods of finding silences in an audio stream. The first method is calculating the energy of short (often overlapping) windows. The local minima of this energy series are considered silence. The second method, decoder- based segmentation, is to run a fast ASR decoder [WSK07]. Most decoders contain a silence ‘phone’ that takes care of pauses between speech.

In [PH03] the ASR acoustic models are used to create two special models: one for silence and one for speech. The speech model is created by combining the most dominant Gaussian mixtures of all phones into one GMM. A small HMM is then created containing only two states. The first state uses the silence GMM for its PDF and the second state uses the speech GMM. A Viterbi decoding run using this HMM will result in the speech/silence segmentation. Decoder-based segmentation systems, although they only distinguish between silence and speech, can also be considered to be model-based segmentation systems.

2.3.3 Model-based segmentation

Model-based segmentation systems train one GMM for each segmentation class. These GMMs are used as PDF in a hidden Markov model where each state is connected to all other states. Performing a Viterbi decoding run using this HMM results in the segmentation of an audio file. The advantage of this method is that it is very easy to add segmentation classes. The systems in [HJT+_{98, GLAJ99] train a silence, speech}

and music GMM, but it is possible to create models for other classes such as sound effects or even known speakers (for example the anchor-man in BN recordings).

Without taking special measures, HMMs with one state for each class tend to produce short segments, even when the transition probabilities from one class to the other are set low. In order to force minimum time constraints on segments, sometimes HMMs are created with a string of states per class that each share the same GMM. Each state in a string is connected to the next state and only the final state has a self- transition (see figure 2.9). The number of states in the string determine the minimum time of each segment. Another approach is to post-process the segmentation and join short speech segments or remove short silence segments.

Figure 2.9: An example HMM used in model-based segmentation. Each string of states represents one segmentation class and all states of a string share the same PDF.

The major disadvantage of model-based segmentation is that the GMMs need to be trained on some training set. If the acoustic characteristics of the audio under evaluation are too different from the characteristics of the training data, the accu- racy of the segmentation will be poor. Model-based segmentation has recently been used in various systems for finding speech and non-speech regions [HJT+_{98, GLAJ99,}

HMV+_{07, SAB}+_{07, vLK07].}

2.3.4 Metric-based segmentation

One of the most common segmentation methods to date is metric-based segmentation. In metric-based segmentation, a sliding window is used to investigate a short portion of the audio at each step. Typically, the window is cut in the middle and it is determined if this point in time should or should not be marked as a segment border. Some kind

of distance metric is used to measure whether the two segments Siand Sj belong to

the same class S, or if they are actually part of two separate segments.

In the literature, a number of distance metrics have been proposed. Most of these metrics make use of models (often Gaussians or Gaussian mixtures) that are trained on Si, Sj and S in order to calculate distances [Ang06]. The most common distance

metric is the Bayesian Information Criterion (BIC) [Sch78]. This metric uses some model Mi with #(Mi) parameters representing a segment of data Si with Ni time

frames (feature vectors) and it determines how well the model fits the data:

BIC(Mi) = log L(Si, Mi) −

2λ#(Mi) log Ni (2.6)

λ is a free parameter that needs to be tuned on a training set. The value of this parameter influences when the BIC value is positive, meaning that the model fits the data, or negative, meaning that the model does not fit the data very well. Formula 2.6 can be used to determine if the data of the two segments Si and Sj fit Mi and Mj

best or if the data of the two segments together (Si+Sj=S) fit the model M trained

on S the best:

∆BIC(Mi, Mj) = BIC(M ) − (BIC(Mi) + BIC(Mj))

= log L(S, M ) − (log L(Si, Mi) + log L(Sj, Mj)) (2.7)

− λ∆#(Mi, Mj) log N

where ∆#(Mi, Mj) is #(M )−(#(Mi)+#(Mj)). If ∆BIC is negative, the model of

the total segment S fits the data not as good as the two separate models and a segment border is placed between the two segments. ∆BIC was first used for segmentation and clustering in [CG98]. In [Ang06] a mathematical proof of formula 2.7 is given. Note that when ∆#(Mi, Mj) is zero, meaning that the number of free parameters in

M equals the number of free parameters in Mi and Mj, the design parameter λ no

longer influences the equation.

In combination with speaker clustering, the Bayesian Information Criterion has recently been used for speaker change detection in a number of systems [Cas04, IFM+_06,

vLK07, RSB+_07].

2.3.5 Assessment of segmentation systems

Two measures are regularly used for assessing segmentation results. The first one is to measure for each class the percentage of time that the class was correctly assigned, or if one overall number is required, the percentage of time that all classes were correctly assigned: Score = P c (Cc) L · 100% (2.8)

where Cc is the total time that class c was classified correctly and L is the total

audio length.

The second method of assessing segmentation systems is to consider the result to be a special case of a speaker diarization system [NIS06]. This was done at the NIST benchmarks in 2005 and 2006 for the Speech Activity Detection (SAD) task. SAD is a segmentation task with two classes: speech and non-speech. All reference speakers were joined in one cluster and any speaker overlap was removed. The measurement explained in section 2.4.2 was then used to score the SAD results. Because overlapping speech is not measured and therefore the number of ‘speakers’ is always zero or one, for SAD systems, formula 2.10 in section 2.4.2 can be formulated as:

SAD =M + F

S · 100% (2.9)

where S is the total time of speech, M is the total time of speech that was not classified as speech (missed speech) and F is the total time of silence that was falsely classified as speech (false alarms). Note that this measurement results in an error percentage while the first measurement (formula 2.8) results in a percentage of correctly assigned classes. Also, the SAD measurement is a percentage of the total time of speech in the reference transcript, while using the first measurement for a SAD system would result in a percentage of the total time of the evaluation audio.

In document Aristoteles - Franz Brentano (página 143-147)