Sus fundamentos en el discurso de evangelización

Capítulo 3: El Paisaje religioso

3.1 Sus fundamentos en el discurso de evangelización

4.3.2.1 Dataset

To experiment with the background tracking features in a genre ID task, data from 332 shows with a total amount of 231 hours which were broadcast by the BBC during the first week of May, 2008 were used. According to the internal genre classification of the BBC, these shows are classified into these eight genres:

• advice: consumer, do-it-yourself and property shows • children’s: cartoons and educational shows

Input features (MFCC, PLP, ...) Asynchronous alignment 0 0 2 2 2 2 3 0 1 1 2 2 Background indexes 0.25 0.17 0.5 0.08 Output features

Figure 4.2: Background tracking features extraction process, adapted from Saz and Hain (2013)

• competition: quiz shows and other contest shows • documentary: including fly-on-the-wall shows • drama: soap operas and other serialised dramas • events: live events, sports and concerts

• news: broadcast news and current affair shows

Since the shows cover a whole week, it includes a mixture of the genres and in this sense it is a more realistic scenario compared to the limited RAI dataset. These genres are very heterogeneous as well, for example events genre covers live sports as well as music shows.

The split between the training and test set was performed by selecting 285 shows for the training set and 47 shows for the test set. Amount of data and number of shows per genre for the training and test set is presented in table 4.1. This dataset is called dataset A in the remainder of this chapter.

4.3.2.2 Extracting background tracking features

As discussed earlier, if the correct transcript of the data is available, then the background tracking features can be extracted by aligning the transcripts to the audio signals and keeping track of the best path through the states. However in the case of

Table 4.1: Amount of training and test data (hours) per genre in dataset A

Genres Training set Test set

#Shows Duration #Shows Duration

Advice 34 24.5 4 3.0 Children’s 45 18.5 8 3.0 Comedy 20 9.7 6 3.2 Competition 37 25.9 6 3.3 Documentary 41 29.8 9 6.8 Drama 19 14.4 4 2.7 Events 23 29.8 5 4.3 News 66 50.3 5 2.0 Total 285 203.0 47 28.3

dataset A, only subtitles were available. A lightly supervised training procedure as described in (Lanchantin et al., 2013) was used for the training of the GMM-HMM acoustic models which were then used for the forced alignment of the data.

Seven initial CMLLR transformations were trained on a modified version of the WSJCAM0 (Robinson et al., 1995) corpus, as described in Saz and Hain (2013). These seven transformations correspond to these acoustic backgrounds: clean speech, classical music, contemporary music, applause, cocktail party noise, traffic noise and wildlife noise and were retrained asynchronously on the BBC dataset. After this initial stage, the feature vectors were processed using P = 100 which yielded 7 dimensional feature vectors.

4.3.2.3 Visualising the background tracking features

Using the procedure described in section 4.3.2.2, the features were extracted and aggregated. Each aggregated feature vector corresponds to one second of the audio segment. Figure 4.3 visualises 60 seconds of these features for four different shows. The 7-dimensional features are represented by bar plots in each column (which corresponds to one second). Visually inspecting these plots and trying to synchronise it with the audio, the changes in the distribution of the feature vectors correspond to the events happening in the background. For example for figure 4.3a, the news programme starts with music, then changes to street noise, then to clean studio speech and finally ends with some street noise. Figure 4.3b, is a cut from a music event show and shows music changes from rock music to solo singing and ends with instrumental rock music. Figure 4.3c presents a historical documentary show that

Table 4.2: Genre classification accuracy (%) with GMM models and short-term PLP features on dataset A #Components Accuracy 8 44.7 16 48.9 32 48.9 64 48.9 128 53.2 256 53.2 512 61.7 1024 59.6 2048 61.7

starts with bell sounds and whistles, then continues with some music, followed by some clean speech and ends with some birds song and seaside noises. Figure 4.3d corresponds to a minute cut from a light entertainment show and has portions of speech with long laughter bursts.

4.3.2.4 Baseline

To evaluate the performance of the proposed approach for the genre identification task, first the baseline experiments are performed. As a baseline classifier, GMMs were trained with the PLP features. The 13 dimensional PLP features were extracted every 10ms and their first and second derivatives were added to form a final 39 dimensional feature vector. GMMs with a varying number of mixture components were trained using the EM algorithm and the mix-up procedure for each of the 8 genres. The label assignment to the new data was based on computing the overall likelihood of the frames with all of the 8 models and picking the GMM with the highest likelihood. This baseline enables the comparison of the dataset and the proposed approach with other related techniques which were introduced in section 4.2. Table 4.2 summarises the classification accuracy with GMM classifiers with varying number of mixture components.

Comparing the results obtained here with the results reported in the literature on other datasets such as the RAI dataset, shows how challenging this BBC dataset is. Best accuracy for this dataset is obtained with a GMM with 512 mixtures, which is 61.7%, while the best accuracy with the GMMs for the RAI dataset was reported as 93.6% (Kim et al., 2013).

0 10 20 30 40 50 60 Time (seconds)

Feature values

(a) News genre (broadcast news)

0 10 20 30 40 50 60

Time (seconds)

Feature values

(b) Events genre (live music show)

0 10 20 30 40 50 60

Time (seconds)

Feature values

0 10 20 30 40 50 60

Time (seconds)

Feature values

(d) Comedy genre (light entertainment)

Figure 4.3: One-minute samples of background tracking features for four different shows, adapted from Saz et al. (2014)

Table 4.3: Genre classification accuracy (%) with GMM models and background tracking features on dataset A #Components Type Subtitles Decoding 8 59.6 59.6 16 66.0 63.8 32 66.0 68.0 64 72.3 68.1 128 70.2 68.1 256 68.1 66.0 512 70.2 70.2 1024 59.6 59.6 2048 53.1 49.0

4.3.3 GMM classification with the background tracking fea-

In document Representación de la naturaleza y el espacio en la pintura andina de los siglos XVII y XVIII (página 136-156)