IV. ANÁLISIS Y RESULTADOS
4.3. Diseño estructural
4.3.1. Diseño de componentes de concreto armado
4.3.1.3. Diseño de la losa de concreto armado
In speaker recognition applications, speech processing is not commonly performed directly on the raw audio signal. Most applications, including all speaker segmen-tation and clustering techniques covered throughout this dissersegmen-tation, require features containing speaker-specific information to be extracted from the audio signal, to allow subsequent speaker modelling and classification to be performed using these features. The extracted features should ideally maximise inter-speaker variability while minimizing intra-speaker variations, and represent the relevant information in a compact form [18].
To date, cepstral-domain features based on short periods of speech have proven to be most successful at capturing the useful characteristics of speech for speech and speaker recognition applications, compared to both time-domain signals and frequency-domain spectra [57]. Cepstral features are capable of capturing in-formation pertinent to the anatomical aspects of the speaker’s voice production mechanisms, including the vocal tract shape and glottal source. This allows acoustical and physical traits such as their tone, the nasality and the roughness of their voice to be encoded in the features. This class of features include mel-frequency cepstral coefficients (MFCC) [14], linear predictive cepstral coefficients (LPCC) [23], and perceptual linear predictive (PLP) coefficients [37]. A detailed evaluation of these acoustic features for speaker recognition can be found in [58].
Cepstral features are based on spectral information from relatively short seg-ments of speech, referred to as frames, which typically contain 10-30 milliseconds of speech with a significant overlap between consecutive frames. It is assumed that the speech signals are quasi-stationary within the short periods considered.
A sliding spectral analysis window [7] (not to be confused with the sliding-window method for speaker segmentation described in Section 2.7) is applied to the frames
of speech to provide a more consistent response across all frequencies and pitches of speech. A Hamming window is used in this work as is typically employed in literature. The windowed frames of speech are then used to compute a sequence of magnitude spectra, and the spectral representations are then transformed to cepstral coefficients as a final step. Each frame of speech results in a single feature vector.
Two categories of cepstral features are commonly used in speaker recognition systems reported in literature, differing in the method by which the log-magnitude spectrum is represented. Filterbank analysis describes the magnitude spectrum through the energy in the output signal of a set of bandpass filters, while linear predictive analysis involves approximating the magnitude spectrum using an all-pole filter.
2.3.1 Filterbank Analysis
Although filterbank analysis was one of the earliest methods developed for speech processing, it remains one of the most effective techniques used in speaker recog-nition systems today [46]. In this approach, the short-time magnitude spectrum of a speech signal is represented by the energy in the output signal of a set of bandpass filters spaced evenly across the frequency range of interest. Approxi-mately 20 filters are typically used in this process, producing a compact set of coefficients to represent the spectrum.
Based on the filterbank analysis approach, mel-frequency cepstral coefficients (MFCC) are the most popular and commonly used of the acoustic features and it has been demonstrated to work well in both speech recognition [14] and speaker recognition [46] tasks. MFCCs are produced through spacing the filters evenly according to the mel-frequency scale. The mel-frequency scale is a non-linear transformation of the physical frequency to the pitch perceived by humans [72], placing less emphasis on higher frequencies. In this way, the bandwidth of each filter represents a perceptually similar frequency range and quantity of informa-tion content. The mel-frequency scale is logarithmic in the standard frequency
scale and is approximated by
fmel = 2595· log10
(
1 + fHz 700
)
. (2.1)
For computational efficiency, the filterbank is implemented in the frequency domain using the fast Fourier transform (FFT) of the speech frames.
The log-energies of the filterbank outputs are transformed into cepstral coef-ficients via the discrete cosine transform (DCT). This significantly reduces the correlation between the energy outputs of adjacent (and usually partially over-lapping) bandpass filters, thus allowing simpler subsequent modelling of speech using these feature vectors.
The time derivatives of the static features, also known as delta coefficients, are often appended to the feature vector as additional features to model trajectory information [10]. Delta coefficients approximate the instantaneous derivative of each of the cepstral coefficients by finding the slope coefficient via least-square linear regression over a window of consecutive frames. The window lengths are typically between 3 and 7 frames.
2.3.2 Linear Predictive Analysis
In linear predictive analysis, the speech production model is assumed to incorpo-rate a glottal excitation signal filtered through the vocal tract and nasal cavity.
Let the speech signal at time n be denoted by sn. The linear predictor (LP) models snby a linear combination of its past values and a weighted present input excitation [45] as
sn= G· un−
∑P k=1
ak· sn−k, (2.2)
where G is a gain scaling factor, un is the present input excitation, P is the prediction order, and ak is a set of model parameters called the predictor coef-ficients, which define an all-pole filter that describes the response of the vocal tract given an input excitation signal. The predictor coefficients are estimated
using a minimum mean squared error (MMSE) criterion where the residual error is assumed to be equivalent to the excitation term, Gun. While some speaker-dependent characteristics such as the fundamental frequency of voiced speech can be extracted from this excitation term [10], it is the predictor coefficients that are usually the part of the model of interest in feature extraction, as they provide the majority of speaker discriminative information.
In order to express the features in a more appropriate form for speaker mod-elling, the predictor coefficients calculated based on linear predictive analysis are commonly transformed into linear predictive cepstral coefficients (LPCC) [23], which have found significant use in speaker recognition tasks [5, 61]. Similar to MFCCs, LPCC features are derived through a further Fourier or cosine trans-form from the log-magnitude of the spectrum, and delta coefficients are commonly appended to the feature vector to capture transient information. However, the log-magnitude of the spectrum in this case is estimated via the frequency response of the all-pole filter defined by the predictor coefficients.
Based on LP modelling, the perceptual linear predictive (PLP) analysis tech-nique [37] attempts to represent speech based on human perception by incorpo-rating several human perceptual factors to the speech signal before applying the LP model. Similar to the mel-frequency transformation for calculating MFCCs, a Bark-scale transformation is applied to the power spectrum in order to equalize the information content of the signal. Additionally, the difference in perceived loudness for different frequencies and power levels are also normalized.