improving piano music transcription by elman ...

IMPROVING PIANO MUSIC TRANSCRIPTION BY ELMAN DYNAMIC NEURAL NETWORKS GIOVANNI COSTANTINI Department of Electronic Engineering, University of Rome “Tor Vergata” via del Politecnico, 1– 00133 Rome, Italy Institute of acoustics “O. M. Corbino” via del Fosso del Cavaliere, 100 – 00133 Rome,Italy MASSIMILIANO TODISCO Department of Electronic Engineering, University of Rome “Tor Vergata” via del Politecnico, 1– 00133 Rome, Italy MASSIMO CAROTA Department of Electronic Engineering, University of Rome “Tor Vergata” via del Politecnico, 1– 00133 Rome, Italy In this paper, we present two methods based on neural networks for the automatic transcription of polyphonic piano music. The input to these methods consists in live piano music acquired by a microphone, while the pitch of all the notes in the corresponding score forms the output. The aim of this work is to compare the accuracy achieved using a feed-forward neural network, such as the MLP (MultiLayer Perceptron), with that supplied by a recurrent neural network, such as the ENN (Elman Neural Network). Signal processing techniques based on the CQT (Constant-Q Transform) are used in order to create a time-frequency representation of the input signals. The processing phases involve non-negative matrix factorization (NMF) for onset detection. Since large scale tests were required, the whole process (synthesis of audio data generated starting from MIDI files, comparison of the results with the original score) has been automated. Test, validation and training sets have been generated with reference to three different musical styles respectively represented by J.S Bach’s inventions, F. Chopin’s nocturnes and C. Debussy’s preludes.

1. Introduction The target of this work dealt with the problem of extracting musical content (i.e., a symbolic representation of musical notes) from audio data, particularly with reference to polyphonic piano music. Actually, the harmonic components of notes that simultaneously occur in polyphonic music significantly obfuscate automated transcription. The first algorithms developed by Moorer [1] used comb filters and autocorrelation in order to perform transcription of very

1

2

restricted duets. The most important work in this research field is the Sonic project [2] developed by Marolt. In this paper, we propose two supervised classification methods that infer the correct note labels based only on training with labeled examples. These methods performs polyphonic transcription via a system of MultiLayer Perceptron (MLP) [3] classifiers and a system of Elman Neural Network (ENN) [3] classifiers that have been trained starting from spectral features obtained by means of the well known Constant-Q Transform (CQT) [4]. In order to investigate the influence of musical styles on classification systems, we used three different MIDI files containing works by three composers lived in different ages: J.S Bach (three part inventions), F. Chopin (nocturnes) and C. Debussy (preludes). The processing phase starts in correspondence to a note onset. The onset detection algorithm, proposed in [5], is based on a suitable binary time-frequency representation of the audio signal followed by Non-Negative Matrix Factorization (NMF) [6]. The paper is organized as follows: in section 2 the generation of audio data set will be described, section 3 illustrates the spectral features; the fourth section will be devoted to the description of the note classification sensor interface; finally, the results concerning the transcription of the notes in the score will be shown. 2. Audio Data Set In order to investigate the influence of musical styles on classification systems, we used three different MIDI files containing works by three composers lived in different ages: J.S Bach (three part inventions), F. Chopin (nocturnes) and C. Debussy (preludes). Bach’s & Chopin’s Pieces Major Tonality Training Validation Bach’s 3 part inv in Cmajor Chopin’s Nocturn in AbM→CM

Bach’s 3 part inv in DM→CM Chopin’s Nocturn in GM→CM

Bach’s & Chopin’s Pieces Minor Tonality Training Validation Bach’s 3 part inv in Aminor Chopin’s Nocturn in Bbm→ Am

Bach’s 3 part inv in Em→Am Chopin’s Nocturn in Cm→ Am

Debussy’s Pieces Training Validation Debussy’s Prelude n. 1

Debussy’s Prelude n. 4

Table 1: Training-validation audio data set. We have trained the neural networks with five different training-validation data sets: two major tonalities and two minor tonalities, taken from Bach and Chopin repertoires, and one taken from Debussy’s pieces (Table 1).

3

3. Spectral Features The processing phase starts in correspondence to a note onset. Firstly, the attack time of the note is discarded (in case of the piano, the longest attack time is equal to about 32ms). Then, after a Hanning windowing, as single CQT of the following 64ms of the audio note event is calculated. Figure 1 shows the complete process.

Fig. 1: Complete spectral features extraction process 4. Note Classification Sensor Interface Our note classification sensor interface is constituted by three components: a microphone, a signal pre-processing based on our onset detection algorithm and extraction of CQT events as explained in previous session and separate oneversus-all (OVA) feed-forward Neural Network and Elman Neural Network binary classifiers were trained on the spectral features of the most important degrees in the major and minor scales. The input to this sensor interface consists in live piano music acquired by a microphone, while the pitch of all the notes in the corresponding score forms the output. The following Figure shows the note classification sensor interface.

Figure 4: Note classification sensor interface 5. Conclusion and Discussion We tested the algorithm on all Bach’s Inventions in three parts, all Chopin’s Nocturnes and all Debussy’s Preludes. The average accuracy, in terms of fmeasure, was 83% and 76% found in ENN and MLP classifiers respectively.

4

Figure 2 shows transcription accuracy difference between ENN fmeasure and MLP fmeasure decreasing with respect musical styles from Bach to Chopin to Debussy. The success of dynamic neural network on Bach’s Inventions and on Chopin’s Nocturnes is due to the composition method, based on classical harmony, adopted by these two authors. On the contrary, in Debussy’s compositions, all the passage from a musical note to another is equally likely. In this case, the performances of the dynamic classifier and the static classifier are comparable.

Fig. 2: Transcription accuracy difference References 1. J. A. Moorer (1977). “On the transcription of musical sound by computer”. Computer Music Journal, 1(4): 32-38. 2. M.Marolt, “A connectionist approach to automatic transcription of polyphonic piano music,” IEEE Transactions on Multimedia, vol. 6, no. 3, pp. 439–449, 2004. 3. Hertz J., A. Krogh & R.G. Palmer, Introduction to the theory of neural computation, Addison-Wesley Publishing Company, Reading Massachusetts, 1991. 4. J. C. Brown, “Calculation of a constant Q spectral transform”, Journal of the Acoustical Society of America, vol. 89, no. 1, pp. 425–434, 1991. 5. G. Costantini, R. Perfetti and M. Todisco, “Event based transcription system for polyphonic piano music”, Signal Process. (2009), doi:10.1016/j.sigpro.2009.03.024 6. P. Smaragdis and J.C. Brown, “Non-negative matrix factorization for polyphonic music transcription”, Proc. IEEE Workshop of applications of signal processing to audio and acoustics”, pp. 177-180, October 2003.