2. EL DEBATE ALREDEDOR DE LOSARGUMENTOS CONTRA EL FISICALISMO
2.1.2 La Brecha Explicativa
This section describes some of the state-of-the-art speaker diarization systems. The HM- M/GMM based system provides the the state-of-the-art in NIST-RT [National Institute of Standards and Technology, 2003] evaluation campaigns. The information bottleneck framework provides comparable results to that of HMM/GMM based system [Vijayase- nan and Valente, 2012].
HMM/GMM system
In HMM/GMM based speaker diarization system, each speaker is represented by a state of an HMM and the state emission probabilities are modeled using GMMs. The initial clustering is performed initially by partitioning the audio signal equally which generates a set of segments {si}. Let ci represent ith speaker cluster, bi represent the emission
probability of cluster ci and ft denote a given feature vector at time t. Then, the
log-likelihood logbi(st) of the feature ftfor cluster ci is calculated as follows:
logbi(st) = log
X
(r)
wi(r)N (fi, µ(r)i , Σ(r)i ) (2.31)
where N () is a Gaussian pdf and w(r)i , µ(r)i , Σ(r)i are the weights, means and covariance matrices of the rth Gaussian mixture component of cluster ci, respectively.
The agglomerative hierarchical clustering starts by overestimating the number of clus- ters. At each iteration, the clusters that are most similar are merged based on the BIC distance. The distance measure is based on modified delta Bayesian information crite- rion [Ajmera and Wooters, 2003]. The modified BIC distance does not take into account the penalty term that corresponds to the number of free parameters of a multivariate Gaussian distribution and is expressed as:
∆BIC(ci, cj) = X ft∈{ci∪cj} logbij(ft) − X ft∈ci logbi(ft) − X ft∈cj logbj(ft) (2.32)
where bij is the probability distribution of the combined clusters ciand cj . The clusters
that produce the highest B IC score are merged at each iteration. A minimum duration of speech segments is normally constrained for each class to prevent decoding short- segments. The number of clusters is reduced at each iteration. When the maximum ∆BIC distance among these clusters is less than threshold value 0, the speaker diarization system stops and outputs the hypothesis.
Information Bottleneck (IB) system is a non-parametric system based on information theoretic principles. Its results are comparable with the HMM/GMM system [Wooters and Huijbregts, 2008]. The main advantage of IB is it requires less computation time more than HMM/GMM systems [Vijayasenan et al., 2009,Vijayasenan and Valente, 2012]. IB clustering clusters segments with similar distributions over a set of variables called relevance variables.
Let X = {x1, x2, ..., xn} represent the input variables to be clustered and Y = {y1, y2, ..., ym}
denote the relevance variables with meaningful information about clustering output C = {c1, c2, ..., cr}. IB method tries to optimize the clustering process by maximiz-
ing the following equation:
F= I(Y, C) − 1
βI(C, X) (2.33)
where β is a Lagrange multiplier, I(X, C) denotes the mutual information where X represents the speech segment set at each iteration and C represents the clusters, and I(Y, C) measures the mutual dependence between the relevant variables Y and the clus- tering partition C.
The IB system uses a greedy technique to optimize the clustering process [Vijayasenan and Valente, 2012]. It starts with unique segmentation where each segment is consid- ered as a set of input variables X. The set of relevance variables Y is components of background GMM estimated from the speech segments. Given input speech segment xi,
the posterior distribution of the relevance variables for the segment xi is obtained using
Bayes rule. The clustering of IB is initialized with each member of the set of speech segment X and the two clusters with the most similar distribution are merged at each iteration.
Other approaches
The HMM/GMM and IB based speaker diarization systems are based on an agglomera- tive clustering framework. There are also other approaches to speaker diarization. They are described as follows:
Top down system
The top down-approach starts by modeling the entire audio signal with a single speaker model. Then, it successively generates new speaker models. The generation of new speaker models can be done using some criterion such as duration of the speech segment. A new speaker model is generated for these speech segments. This process is performed iteratively until the final number of speaker is found. Top-down approaches are not
widely used as the bottom up one. They are however computationally efficient and their performance can be improved using cluster purification as reported in [Bozonnet et al.,
].
Factor analysis techniques
Factor analysis techniques which are the state of the art in speaker recognition have recently been successfully used in speaker diarization [Kenny et al., 2010,Franco-Pedroso et al., 2010,Shum et al., 2011]. The speech clusters are first represented by i-vectors and the successive clustering stages are performed based on i-vector modeling. The use of factor analysis technique to model speech segments reduces the dimension of the feature vector by retaining most of the relevant information. Once the speech clusters are represented by i-vectors, cosine-distance and PLDA scoring techniques can be applied to decide if two clusters belong to the same or different speaker(s). [Dehak et al., 2011].