• No se han encontrado resultados

11.3 Funcionamiento del programa de verificación

11.3.7 Grabación de tarjeta

This section describes the features used for experimental studies of this thesis. Because of the major advantage of audio features that they can be always extracted from the corresponding MP3 song, we restricted the feature set used in this work to audio features only. The score is not always available for popular music pieces, and metadata are often incomplete or subjective. The exact feature lists with literature references are provided in Appendix A. In the following sections we briefly discuss characteristics of the feature groups and provide definitions for several features, which were implemented for this thesis.

2.2. Feature extraction 29

Figure 2.6.: Several representations and audio feature domains for Beethoven’s - “F¨ur Elise”: (a) the score; (b) time domain; (c) spectrum domain; (d) chroma discrete cosine transform-reduced log pitch (CRP) [155]; (e) phase domain [146].

The Sections2.2.3.1to2.2.3.3adapt the categorisation of feature groups from [211], where three types of audio features were described: timbral, rhythmic, and pitch characteristics. We extend these groups to ‘timbre and energy’ (features estimated mostly from time, spec- tral, and cepstrum domains), ‘chroma and harmony’ (pitch-related, short-frame, and low- level characteristics, as well as high-level descriptors, e.g., a number of different chords), and ‘tempo, rhythm, and structure’ (long-frame, mostly high-level features, which de- scribe the time structure of music). In Chapter2.2.3.4three high-level feature groups are mentioned, which are derived from the audio signal by application of the sliding feature selection (which is introduced in Section3.3). The experimental studies for the creation of these high-level features are described in detail in Section5.1.

Table 2.2 lists the software tools used for F E . All of them are integrated as libraries or plugins into the Advanced MUSic Explorer (AMUSE) [221]. We developed this Java framework with the target to provide interfaces for various music classification tasks, as categorised in Section2.1.3.

Table 2.2.: Software tools for F E .

Name Reference

AMUSE [221]

Chroma Toolbox [155]

jAudio [138]

MIR Toolbox [117]

NNLS Chroma and Chordino Vamp plugins [135]

Yale [147]

2.2.3.1. Timbre and energy

Timbre and energy features can be considered as low-level (see Section2.2.1), and most of them are estimated from short extraction frames. Timbreis a characteristic, which makes

the halftones of the same pitch and loudness sound differently, depending on the source instrument and the playing style. Energy features relate to the noisiness and loudness

of an audio signal.

TableA.1 in Appendix A lists the feature names, literature references, extraction frame sizes in samples We(for mono signals with fs= 22, 050 Hz), numbers of feature dimensions,

the software used for feature estimation, and the unique AMUSE feature IDs. Most of these features are described in our technical report [206] and the manual of the MIR Toolbox [115], in which references to further works are given.

It is possible to group these features by their extraction domain:

Time domain characteristics describe the audio signal time series, e.g., by its ap- proximation with linear prediction coefficients or energy distribution. For example, ‘low energy’ compares the energy of a frame to the energy of the previous larger analysis window. Another commonly used and simple feature is the zero-crossing rate. It correlates with the noisiness of the signal, which in turn describes the timbre [211].

2.2. Feature extraction 31

• Spectral domainfeatures correspond to the numerous statistics of the distribution

of the frequency bin amplitudes: spectral centroid, crest factor, slope, kurtosis, flux, skewness, distances between spectral peaks, etc.

Cepstral domain descriptors consist of the several implementations of the mel frequency cepstral coefficients (MFCCs) and the cepstral modulation ratio regression (CMRARE) features [133], which describe the temporal cepstrum progress using a polynomial approximation.

Phase domainfeatures are the average distance and the average angle in the phase domain. These features are well suited for the separation of classical music and popular genres with a higher percussion share [146].

• Finally, ERB and Bark scale domains are motivated by the characteristics of human perception, where different frequency bands are sensed differently [151].

2.2.3.2. Chroma and harmony

Harmony describes the relationship between simultaneously played tones (and is often

described as the ‘vertical’ music component). If exactly two tones are played at the same time, they build aninterval; three and more tones are characterised as chord. One of

the central terms in music harmony is theconsonance: consonant intervals are perceived

as more complete and pleasing, whereas dissonant intervals are perceived as rough. The differences between consonant and dissonant sounds can be measured by mathematical, physical, physiological, and psychoacoustical aspects. However, it is difficult to provide an exact definition, in particular, because the comprehension of consonance altered over centuries. References to older and newer theories are provided in [144,185].

Because the exact notes cannot be perfectly extracted from audio, the first step in the estimation of almost all audio harmonic characteristics is the transformation into the chroma domain. One of the simplest possibilities is to estimate the P CP , as defined in Equ.2.4. The chroma-related harmonic characteristics are often not so precise as the score features. However, they build a bridge between signal processing methods and music theory and are essential when no score is available.

Chroma and harmony features listed in TableA.2, AppendixA, comprise low-level spectral characteristics as well as high-level music theory related harmonic descriptors. It can be roughly distinguished between chroma-based features, harmonic characteristics, and chord statistics. A semitone spectrum, which is estimated from the frequency bin amplitudes aggregated around the corresponding pitches, can be considered as low-level. On the other side, the characteristics of chords and musical keys can be referred to as high-level. Several features were implemented for this study directly in AMUSE and are defined as follows:

Interval strengths from the 10 highest semitone values: First, a semitone spectrum is estimated with NNLS Chroma [135], saving the amplitudes SC(p) for the 85 different pitch levels. Then, the indices of the 10 highest values are sorted and saved in p10. The interval strengths IS(k) (k ∈ {1, 2, ..., 12}) are calculated as

follows:

IS(k) = X i,j∈p10

|i−j|=k

min (SC(i), SC(j)) . (2.9)

Interval strengths from the semitone spectrum above 3/4 of its max-

imum value: If a part of simultaneously played tones is significantly louder than

another part, the 10 strongest SC values may describe the fundamental frequencies, overtones and noisy components only from the louder tones. Therefore, another possibility to measure the interval strengths is to allow all values above a certain threshold to contribute to the interval estimation. Here, all semitone spectrum values above 3/4 of the maximum are used:

IST (k) = X

SC(i),SC(j)> 34·SC(p10(1))

|i−j|=k

min (SC(i), SC(j)) . (2.10)

• Strengths of the CRP cooccurrences: Chroma discrete cosine transform-

reduced log pitch (CRP) [155] is an enhanced chroma variation. It was developed especially for filtering out timbre sound characteristics, which are mostly captured by lower MFCCs. The strength of two cooccurrent values CRP (i) and CRP (j), i, j ∈ {1, 2, ..., 12} is defined as:

CRPS(i, j) = CRP (i) + CRP (j)

2 . (2.11)

The estimation of all strengths between CRP values provides a raw description of interval strengths, and the overall number of dimensions is equal to 12·112 = 66. • Number of different chords and chord changes in 10 s: This feature is

estimated from the chords, which were previously extracted by the Chordino Vamp plugin [135]. A frequent chord change does not necessarily correspond to a rich harmonic progression, since only a few different chords may be a part of the chord sequence.

Shares of the most frequent 20, 40, and 60 per cent of chords with

regard to their duration: Initially, the durations of each chord are summed

up for each chord type, and the most frequent chords for the complete music piece are estimated. c20, c40 and c60 save the indices of the most frequent chords, which

cover more than 20%, 40% and 60% of the song. Afterwards, the time shares of these most frequent chords are estimated for each extraction window:

CS(k) = X

i∈ck Chi

We

, (2.12)

where k ∈ {20; 40; 60}, and Chiis the overall duration of the chord i in the extraction

Documento similar