ing Model to Piano Music
To gain a better understanding about the features found in polyphonic music recordings, it is beneficial to analyse the statistics of a single pi- ano note. If the goal of a learning algorithm is to learn representations of individual notes, it has to be investigated if such representations are feasible in the time domain and, if so, of which form these representations are likely to be. The questions of interest are:
1. Is there a single time domain representation that contains the rel- evant features of a note in order to distinguish notes of different instrument types or even of different models of the same instrument, so that such a representation can be assigned to an individual note and instrument in a recording?
2. Does a single feature contain enough information for transcription, source separation or signal compression, i.e. can a signal be com- pressed using one general feature vector for each different note pres- ent?
3. What do such features look like and what information do they con- tain? Is, for example, the relationship of the phase of high frequency and low frequency components similar for different realisations of the same note?
4. How high is the dimension spanned by a note played several times? Can the series of notes be represented accurately enough with a single vector and what information gets lost in such a representation.
CHAPTER 8. EMERGENCE OF MUSICAL STRUCTURES 133
Here several realisations of a single note played on a piano under similar conditions are analysed and their properties studied. The notes analysed were taken from a commercial recording of Ludwig van Beethoven’s Sonata for Piano No. 12, in A flat, Scherzo (Allegro molto) in which this particular note was played 14 times without any other overlapping notes. The note was extracted by cutting the recording just before the note onset and again just before the onset of the following note. This procedure gave a set of 14 notes of identical pitch, played at roughly the same loudness and of roughly the same length.
Principal component analysis (PCA) of the 14 piano notes was per- formed. The individual piano notes were normalised and time aligned to maximise the cross correlation between them before conducting PCA. The results are shown below. The top panel in figure 8.1 shows the ordered contribution each principal component makes to the variance observed in the 14 observations. It was found that two principal components account for 88% of the variance of the original notes. The other components are much less significant, with the third component accounting for about 4% of the variance. Note that only 13 principal components have been found, which means that the 14th component had such a small contribution that
it was smaller than the accuracy of the computation so that the 14 notes effectively span a 13 dimensional space. It is also clear that a two di- mensional representation would account for 88% of the information and it seems that at least two components are necessary to represent a single note.
The time domain and spectral representations of one of the original piano notes is shown in the second row of figure 8.1. In the third and fourth row of figure 8.1 the time domain and spectral representations of the principal components related to the highest and second highest variance are shown respectively. The similarity of the spectrum of the original note to the principal components is evident. It can further be seen that the second principal component has much higher fifth and seventh harmonics than the first principal component.
The time domain and spectral representations of the weakest principal component are shown in the last row in figure 8.1. It is obvious that it
2 4 6 8 10 12 14 0 0.2 0.4 % of contribution to variance No of Principal component 0.125 0.25 0.375 −0.1 −0.05 0 0.05 0.1 original piano note 1 2 3 4 0 5 10 15 0.125 0.25 0.375 −0.1 −0.05 0 0.05 0.1
first principal component
1 2 3 4 0 5 10 15 0.125 0.25 0.375 −0.1 −0.05 0 0.05 0.1 second principal component 1 2 3 4 0 5 10 15 0.125 0.25 0.375 −0.1 −0.05 0 0.05 0.1 thirteenth principal component time/ms 1 2 3 4 0 5 10 15 frequency/kHz
Figure 8.1: Principal component analysis of piano notes. Percentage that each principal component contributes to the variance of 14 piano notes (top), the time-domain (left) and spectral-domain (right) representations of one of the original piano notes (second row), the principal components with the largest eigenvalue (third row), the second largest eigenvalue (fourth row) and the small- est eigenvalue (fifth row).
contains much more noise but it still contains some harmonic structure. It is interesting to note that the higher harmonics are of comparable strength
CHAPTER 8. EMERGENCE OF MUSICAL STRUCTURES 135 0.125 0.25 0.375 −0.1 −0.05 0 0.05 0.1
first principal component
0.125 0.25 0.375 −0.2 −0.1 0 0.1 0.2 second principal component time/ms 1 2 3 4 0 5 10 15 frequency/kHz 1 2 3 4 0 5 10 15
Figure 8.2: Time-domain (left) and spectral-domain (right) representations of the two strongest principal components of the left channel. The form of the envelope clearly suggests that the two principal components are necessary to represent the piano notes with different envelopes.
to the high harmonics in the other principal components, while the low harmonics are much weaker.
Here, as well as in the results presented in the next chapter, a stereo recording was summed to mono before analysing the signal. In order to see the effect of this summation and to investigate whether the second strongest principal component might be due to the signal reaching the different recording microphones with varying strengths and delays, the same experiment was conducted using only one of the stereo channels. The same observations were made as reported above. It is interesting to note that the two principal components that are responsible for most of the variation in the signal have different amplitude envelopes. One com- ponent models mostly the note onset, while the other component models mainly the latter part of the note. This is shown in figure 8.2. A similar observation, though not as pronounced, can be made for the strongest principal components shown in figure 8.1, which were found in the previ- ous experiment. In both cases, the second strongest component is found to have a slightly higher high frequency content compared to the strongest component.
The above results were obtained from a set of notes, all of which were roughly of the same length. Obviously for notes of different lengths these results are not valid, however, the rest of this thesis demonstrates that notes of different lengths can be handled by the shift-invariant sparse coding model by concatenating features. This is shown in detail in chapter 9. It should also be mentioned that the piano notes used above where all of roughly the same loudness and the influence of changes in the loudness on a linear representation, could not be deduced from the above experiment. For the case of notes of similar lengths and with roughly similar amplitude we can give the following answers to some of the questions raised above.
1. It can be seen that a single component can represent the magnitude spectrum of a piano note quite accurately, but fails in representing the different time envelopes observed for different notes. For accurate reconstruction of notes it seems necessary to model a note with at least 2 features to cater for the different time envelopes. It can, however, be assumed that a single feature can capture much of the information and that two features can represent a piano note quite accurately. The question of whether features learned from different sources are significantly different in order to separate two sources was not investigated here, but experimental results are presented in chapter 10. These results show that at least for certain mixtures this assumption can be made.
2. In the piano example reported above it is evident that the pitch of a piano note can be described by a single feature. Whether this is still true for notes played on different pianos or for notes recorded in different acoustical environments is questionable. However, for the task of identifying individual notes played by a single piano, a set of features each describing an individual piano note might be sufficient. For blind source separation, features have to be found and grouped that relate to single sources and that offer good reconstruction of those sources. For piano signals, it was shown that at least 2 features are required for good reconstruction of individual notes. For high quality signal compression a single feature is therefore not enough.
CHAPTER 8. EMERGENCE OF MUSICAL STRUCTURES 137
However, if a MIDI representation of a musical signal is seen as a low quality compression of the original audio file, then such a compression can be achieved from transcription, which seems feasible with only a single feature.
3. It can be seen that the single time domain feature that represents the above piano notes relatively well, has a similar spectrum to the original piano note. It also has an envelope that is similar to the original time envelope of the notes in the sample space. It must again be mentioned that the above sample set was quite restricted in that it not only contained notes at the same velocity, but also notes of a similar length with only slight envelope deviations. It is therefore not surprising that a single time domain representation can be found with a similar time envelope. Such a simple representation is not possible for sounds of varying length or for sounds with different time envelopes.
4. It has been shown that for the example studied here, most of the variance of the notes is concentrated in a two dimensional subspace. However, the dimension of this subspace is likely to increase for more complex signals.