Capítulo 3: El Paisaje religioso
3.1 Sus fundamentos en el discurso de evangelización
4.3.2.1 Dataset
To experiment with the background tracking features in a genre ID task, data from 332 shows with a total amount of 231 hours which were broadcast by the BBC during the first week of May, 2008 were used. According to the internal genre classification of the BBC, these shows are classified into these eight genres:
• advice: consumer, do-it-yourself and property shows • children’s: cartoons and educational shows
Input features (MFCC, PLP, ...) Asynchronous alignment 0 0 2 2 2 2 3 0 1 1 2 2 Background indexes 0.25 0.17 0.5 0.08 Output features
Figure 4.2: Background tracking features extraction process, adapted from Saz and Hain (2013)
• competition: quiz shows and other contest shows • documentary: including fly-on-the-wall shows • drama: soap operas and other serialised dramas • events: live events, sports and concerts
• news: broadcast news and current affair shows
Since the shows cover a whole week, it includes a mixture of the genres and in this sense it is a more realistic scenario compared to the limited RAI dataset. These genres are very heterogeneous as well, for example events genre covers live sports as well as music shows.
The split between the training and test set was performed by selecting 285 shows for the training set and 47 shows for the test set. Amount of data and number of shows per genre for the training and test set is presented in table 4.1. This dataset is called dataset A in the remainder of this chapter.
4.3.2.2 Extracting background tracking features
As discussed earlier, if the correct transcript of the data is available, then the back- ground tracking features can be extracted by aligning the transcripts to the audio signals and keeping track of the best path through the states. However in the case of
Table 4.1: Amount of training and test data (hours) per genre in dataset A
Genres Training set Test set
#Shows Duration #Shows Duration
Advice 34 24.5 4 3.0 Children’s 45 18.5 8 3.0 Comedy 20 9.7 6 3.2 Competition 37 25.9 6 3.3 Documentary 41 29.8 9 6.8 Drama 19 14.4 4 2.7 Events 23 29.8 5 4.3 News 66 50.3 5 2.0 Total 285 203.0 47 28.3
dataset A, only subtitles were available. A lightly supervised training procedure as described in (Lanchantin et al., 2013) was used for the training of the GMM-HMM acoustic models which were then used for the forced alignment of the data.
Seven initial CMLLR transformations were trained on a modified version of the WSJCAM0 (Robinson et al., 1995) corpus, as described in Saz and Hain (2013). These seven transformations correspond to these acoustic backgrounds: clean speech, classical music, contemporary music, applause, cocktail party noise, traffic noise and wildlife noise and were retrained asynchronously on the BBC dataset. After this initial stage, the feature vectors were processed using P = 100 which yielded 7 dimensional feature vectors.
4.3.2.3 Visualising the background tracking features
Using the procedure described in section 4.3.2.2, the features were extracted and aggregated. Each aggregated feature vector corresponds to one second of the audio segment. Figure 4.3 visualises 60 seconds of these features for four different shows. The 7-dimensional features are represented by bar plots in each column (which corresponds to one second). Visually inspecting these plots and trying to synchronise it with the audio, the changes in the distribution of the feature vectors correspond to the events happening in the background. For example for figure 4.3a, the news programme starts with music, then changes to street noise, then to clean studio speech and finally ends with some street noise. Figure 4.3b, is a cut from a music event show and shows music changes from rock music to solo singing and ends with instrumental rock music. Figure 4.3c presents a historical documentary show that
Table 4.2: Genre classification accuracy (%) with GMM models and short-term PLP features on dataset A #Components Accuracy 8 44.7 16 48.9 32 48.9 64 48.9 128 53.2 256 53.2 512 61.7 1024 59.6 2048 61.7
starts with bell sounds and whistles, then continues with some music, followed by some clean speech and ends with some birds song and seaside noises. Figure 4.3d corresponds to a minute cut from a light entertainment show and has portions of speech with long laughter bursts.
4.3.2.4 Baseline
To evaluate the performance of the proposed approach for the genre identification task, first the baseline experiments are performed. As a baseline classifier, GMMs were trained with the PLP features. The 13 dimensional PLP features were ex- tracted every 10ms and their first and second derivatives were added to form a final 39 dimensional feature vector. GMMs with a varying number of mixture compo- nents were trained using the EM algorithm and the mix-up procedure for each of the 8 genres. The label assignment to the new data was based on computing the overall likelihood of the frames with all of the 8 models and picking the GMM with the highest likelihood. This baseline enables the comparison of the dataset and the pro- posed approach with other related techniques which were introduced in section 4.2. Table 4.2 summarises the classification accuracy with GMM classifiers with varying number of mixture components.
Comparing the results obtained here with the results reported in the literature on other datasets such as the RAI dataset, shows how challenging this BBC dataset is. Best accuracy for this dataset is obtained with a GMM with 512 mixtures, which is 61.7%, while the best accuracy with the GMMs for the RAI dataset was reported as 93.6% (Kim et al., 2013).
0 10 20 30 40 50 60 Time (seconds)
Feature values
(a) News genre (broadcast news)
0 10 20 30 40 50 60
Time (seconds)
Feature values
(b) Events genre (live music show)
0 10 20 30 40 50 60
Time (seconds)
Feature values
(c) Documentary genre (history show)
0 10 20 30 40 50 60
Time (seconds)
Feature values
(d) Comedy genre (light entertainment)
Figure 4.3: One-minute samples of background tracking features for four different shows, adapted from Saz et al. (2014)
Table 4.3: Genre classification accuracy (%) with GMM models and background tracking features on dataset A #Components Type Subtitles Decoding 8 59.6 59.6 16 66.0 63.8 32 66.0 68.0 64 72.3 68.1 128 70.2 68.1 256 68.1 66.0 512 70.2 70.2 1024 59.6 59.6 2048 53.1 49.0