• No se han encontrado resultados

Ley 2004 Propuesta Ejecutivo Ley 2005 Comunicaciones y Transportes, mismas que

Automatically classifying musical genre or style by examining an file’s audio or sym- bolic (usually MIDI format) musical content has applications primarily in musical in- formation retrieval and cognitive science. In the former case, the goal is to automate the human task of assigning genres to tracks in musical databases to facilitate search- ing, browsing and recommendation. In the latter, the goal is to discover the processes behind the human cognition of musical style, and often to try and determine how composer styles are manifested statistically or structurally. The computational ap- proaches for each discipline have tended to be slightly different in the literature. MIR research focuses predominantly on statistical feature extraction and standard machine learning techniques. Style cognition research has a longer history, and has seen em- phasis on grammatical and probabilistic models in additional to statistical feature ex- traction.

Scaringella et al. [Scaringella et al. 2006] provide a comprehensive survey of au- tomatic genre classification, pointing out that it is an extremely non-trivial problem not only for technical reasons, but also due to many endemic problems with genre definitions themselves. One of these problems is the lack of a consistent semantic basis: labelling can derive from geographical origins (Latin), historical periods (Clas- sical), instrumentation (Orchestral), composition techniques (Musique Concr´ete), sub- cultures (Jazz), or from terms which are coined arbitrarily in the media or by artists (Dubstep). Issues of scalability arise whenever new genres emerge from combinations of old ones. Pachet and Cazaly noted the utter lack of consensus on genre taxonomies among researchers and popular musical databases [Pachet and Cazaly 2000].

These problems cannot be ignored when designing classifiers. Scaringella argues that attempting to derive genre from audio requires the assumption that it is as much an intrinsic attribute of a title as tempo, which is “definitely questionable” [Scaringella et al. 2006]. Dannenberg et al. commented that higher-level musical intent appears “chaotic and unstructured” when viewed as low-level data streams [Dannenberg et al. 1997]. On the other hand, one particular study seems to provide good motivation for this line of research: Gjerdingen and Perrott found that humans with variable musical backgrounds were able to correctly categorise musical snippets of only 250ms in 53 percent of cases, and snippets of 3 seconds in 72 percent of cases [Gjerdingen and Perrott 2008]. This result is convincing evidence that even untrained humans have an innate ability to recognise style from a small amount of data, which implies that the data must contain some measurable characteristics which make that possible. Therefore, in MIR the importance to date has been on the extraction of meaningful statistical features from short frames of audio data.

Statistical features extracted from audio fall into the broad categories of temporal, spectral, perceptual and energy content [Scaringella et al. 2006]. The precise feature extraction algorithms are numerous and need not be discussed here. Feature patterns are used to train models based on unsupervised clustering algorithms or supervised learning algorithms. In both cases the resulting model of pattern separation is used as the basis for the classification of new patterns extracted from unlabelled pieces of mu-

sic. Various authors have reported success with an array of different algorithms and feature sets, for both audio and symbolic data [Scaringella et al. 2006]. The advantage of symbolic data is that reliably discerning musical statistics such as pitch and chord relationships is easily accomplished; a disadvantage is the shortage of important spec- tral information.

Chai and Vercoe classified symbolic encodings of monophonic folk melodies as being Irish, German or Austrian using Hidden Markov Models, with an accuracy approaching 80 percent [Chai and Vercoe 2001]. The classification of symbolically en- coded folk songs was also addressed by Bod, using probabilistic grammars to achieve 85 percent accuracy [Bod 2001]. Shan and Kuo trained a genre classifier using both MIDI harmonies and melodies [Shan and Kuo 2003]; they used a method combining a priori pattern finding with heuristics, which achieved an accuracy of 84 percent using just melodic features. Keirnan used self-organising maps to successfully partition au- dio into three classes representing the composers Friederick, Quantz and Bach [Kier- nan 2000]. Ruppin et al. [Ruppin and Yeshurun 2006] used the K-nearest neighbour algorithm to classify MIDI files as either Classical, Pop or Classical Japanese, with 85 percent accuracy. Kosina used K-nearest-neighbours to classify audio as Metal, Dance or Classical with 88 percent accuracy [Kosina 2002]. Xu et al. distinguished between Pop, Classical, Jazz and Rock audio using support-vector machines, with 96 percent accuracy [Xu et al. 2003]. Among the most comprehensive and successful work in MIR to date is that by McKay, who used a learning ensemble consisting of neural net- work and K-nearest-neighbour classifiers trained on MIDI files using 111 features and audio using 26 features, each weighted by sensitivity using a genetic algorithm. This system achieved a 9-genre classification accuracy of 98 percent [McKay 2010].

The majority of authors agree that improvement can be made by increasing the sophistication of the feature sets, but evidently there is still no widely accepted algo- rithm for making even extremely broad classifications. Some authors have deduced that the relatively small size of the datasets may be to blame — both McKay and Ponce de Le ´on et al. have concluded that song databases much larger than those currently in use are the key to assessing the real worth of particular combinations of feature sets and learning algorithms [McKay 2010; Ponce de Le ´on et al. 2004]. McKay also advo- cates the training of classifiers on both audio and symbolic features simultaneously. This requires perfect MIDI transcriptions of audio files, a rare commodity that will continue to rely on highly skilled human labour until significant advances are made in the field of automated polyphonic transcription [McKay 2010].

The recent release of a million-song feature-set for public use [Bertin-Mahieux et al. 2011] is likely to instigate the next generation of MIR research and a significant rais- ing of the bar in the near future. In the meantime, it must be stressed that the assign- ment of genre labels to the automated Schillinger System’s output will be flawed to an extent; the purpose of the experiment is simply to determine whether the output’s statistical characteristics point more towards certain styles than others, and whether the output contains a notable degree of diversity.

§4.4 Assessing Stylistic Diversity 73