Gaussian Process Regression for Rendering Music Performance

Proceedings of the 10 th International Conference on Music Perception and Cognition (ICMPC 10). Sapporo, Japan. Ken’ichi Miyazaki, Yuzuru Hiraga, Mayumi Adachi, Yoshitaka Nakajima, and Minoru Tsuzaki (Editors)

Gaussian Process Regression for Rendering Music Performance Keiko Teramura*1, Hideharu Okuma*2, Yuusaku Taniguchi*3, Shimpei Makimoto*4, Shin-ichi Maeda#5 *

Graduate School of Information Science, Nara Institute of Science and Technology, Japan # Graduate School of Informatics, Kyoto University, Japan

{1keiko-te, 2hideharu-o, 3yuusaku-t, 4makimoto.shimpei01}@is.naist.jp, 5 [email protected]

ABSTRACT So far, many of the computational models for rendering music performance have been proposed, but they often consist of many heuristic rules and tend to be complex. It makes difficult to generate and select the useful rules, or perform the optimization of parameters in the rules. In this study, we present a new approach that automatically learns a computational model for rendering music performance with score information as an input and the corresponding real performance data as an output. We use a Gaussian Process (GP) incorporated with a Bayesian Committee Machine to reduce naive GP's heavy computation cost, to learn those input-output relationships. We compared three normalized errors: dynamics, attack time and release time between the real and predicted performance by the trained GP to evaluate our proposed scheme. We evaluated the learning ability and the generalization ability. The results show that the trained GP has an acceptable learning ability for 'known' pieces, but show insufficient generalization ability for 'unknown' pieces, suggesting that the GP can learn the expressive music performance without setting many parameters manually, but the size of the current training dataset is not sufficiently large so as to generalize the training pieces to 'unknown' test pieces.

INTRODUCTION Anyone who has performed music will agree that we do not follow exactly the music score, but accent, expand or contract the notes to some extent to perform music expressively. Understanding how we perform expressive music or knowing what elements are essential to perform the expressive music is one of the most important themes in musicology. Realizing the expressive music performance is the counterpart of the understanding the expressive music performance. In 1984, Fryden and Sundberg proposed the rendering model for monophonic music. Since then, attention to the computational model for rendering expressive music is growing as the computing power increases. In recent years, even there are the models for real time musical accompaniment (Rachael, 2003) or polyphony music (Hashida et al., 2007). However, many of the existing approaches are rule-based and consist of many handcrafted rules. Those rules and their parameters are needed to should be determined based on the technical knowledge or pilot experiments, nevertheless these tasks are complex and often difficult to perform thoroughly. Therefore, some methods that automatically determine the computational model are desired. Machine learning is one of the promising approaches to extract rules from a dataset automatically. Widmer (2002) succeeded in discovering 17 simple music performance rules out of 383 rules using machine learning. It greatly helps to construct a computational model for expressive music performance, however, it is still necessary to prepare the rules

ISBN: 978-4-9904208-0-2 © 2008 ICMPC 10

in advance and tune those parameters manually. Dorard et al. (2007) presented the model that learns the relationship between the score information and the performance information directly using Kernel Canonical Correlation Analysis, which is one of the kernel methods. The kernel method has an advantageous in designing the model. It does not need to prepare the many parametric rules in advance, but just define the proximity kernel; in this case, it measures the proximity between inputs and output. They predict the output given test inputs by searching the output that is the most closely related to the test input. However, their method needs the chord information as an input. Because the chord information is not explicitly given in the music score, it is not suitable to learn the input-output relationship from large dataset of musical score. The purpose of this study is to propose a machine-learning model for an expressive music performance that requires few parameters to be set manually. The model learns the music performance so as to predict the human performance, which includes conscious or unconscious variations to the musical score. In particular, we use Gaussian process (GP), which is also one of the kernel methods and does not need to set many parameters. Using the GP and the CrestMusePEDB (www.crestmuse.jp/pedb/), which is a public database of score text information and real performance information, as its training dataset, we can directly approximate input-output relationships that might include rules that are hard to describe explicitly. If the input-output relationship obeys certain probabilistic distribution and training samples can be obtained from the distribution, GP is guaranteed to reproduce the true input-output relationship under the certain condition (Rasmussen et al., 2006). Therefore we expect that it is possible to enhancement the accuracy by using large training dataset.

METHOD We utilize a GP regression for approximating input-output relationships. Before describing the GP, we will explain the input and output information. For each single note, three outputs and corresponding thirteen input features are defined, and three functions each of which returns one of three outputs and receive the thirteen input features, are independently learned. A. Input features Input features consist of the information relevant to the single target note. Several features are taken from the PEDB-SCR data in the CrestMusePEDB as input features. The input features are categorized into four groups: position feature, pitch feature, duration feature, and dynamics feature. Each of them is defined as follows.

167

z ＜position feature＞ meter: This describes the musical meter by two dimensions, e.g., (4,4) for 4/4, (3,4) for 3/4, (6,8) for 6/8 , and so on. z first beat: This describes whether the target note is on the first beat in the bar; it takes 1 if the note is on the first beat, otherwise takes 0. z melody part: This discriminates whether the target note is in the melody part or not; it takes 1 if the note is in the highest part, otherwise 0.

z

z

＜pitch feature＞ relative pitch: This describes the relative pitch of the target note. In PEDB-SCR data, the pitch information is represented in SMF (standard MIDI file) format, e.g., C4=60. We transformed it to the key-invariant relative pitch based on the key attribute also described in PEDB-SCR data. The transformation was done so as to match the dominant note to C. To discriminate the mode (major or minor), we numbered the pitch on the scale integer so that the difference of the two relative pitches of certain two notes on the scale reflects how many notes apart between the two notes on the scale while we numbered the pitch that is between the two neighbor notes on the scale half of the two neighbor notes’ relative pitch. Examples of G major and A minor are shown in Figures 1 and 2, respectively.

Figure 1. Example of relative pitch for G-Major. Circled number on the piano keyboard indicates the relative pitch of the corresponding note.

z z

pre-pitch III : This indicates the average pitch of the notes (hereafter we call this AP) in the previous bar subtracted by the AP in the current bar where the target note appears. post-pitch Ⅰ :This indicates the pitch of next note subtracted by the pitch of the target note. post-pitch II: This indicates the difference of APs between that calculated for the next bar and that predicted by the APs in current and previous bars. The predicted AP is calculated as the AP in the current bar subtracted by the pre-pitch III.

＜duration feature＞ z duration: This feature represents the target note duration. It takes one for the quarter note, and two for the half note, and so on. When the target note is the grace note, it takes 1/8 regarding it as a demisemiquaver. Also, if there is a tie on the target note, we extend the duration combining the tied two notes into one. z pre-duration ratio: This feature represents the previous note’s duration divided by the target note’s duration. It takes zero when the previous note is the rest. z post-duration ratio: This feature represents the next note’s duration divided by the target note’s duration. It takes zero when the next note is the rest. z average pre-duration ratio: This feature represents the average duration of the notes in the previous bar divided by the average duration of the notes in the target bar. z average post-duration ratio: This feature represents the average duration of the notes in the next bar divided by the average duration of the notes in the target bar. ＜dynamics feature＞ z measure-dynamics: This feature represents the dynamics at the bar where the target note is presented. It takes integer value as f=2, mf=1, mp= −1 , and p= −2 . z note-dynamics: This feature represents the dynamics at the target note. It takes integer as same as measure-dynamics. B. Output The outputs consist of the PEDB-DEV data in The CrestMusePEDB as follows. z z

Figure 2. Example of relative pitch for A-Minor. Circled number on the piano keyboard indicates the relative pitch of the corresponding note.

z z

pre-pitchⅠ:This indicates the pitch of the previous note subtracted by the pitch of the target note. pre-pitch Ⅱ :This indicates the normalized pitch difference between a pitch that is two notes before the target notes and the predicted pitch, divided by the target pitch. The predicted pitch is calculated as the sum of the target note’s pitch and twice of the pre-pitch I.

z

168

dynamics: This represents the velocity of the note, the quantity follows the PEDB-DEV data format. attack: This represents the deviation of the actual attack time from the attack time expected from the music score to play the target note. It takes one if the duration equals to the duration of the quarter note. Positive number means that the actual attack time is late for the music score and vice versa for the negative number. release: This feature represents the deviation of the actual release time from the release time expected from the music score to end playing the target note. It takes one if the duration equals to the duration of the quarter note. Positive number means that the actual release time is late for the music score and vice versa for the negative number.

C. Gaussian Process In this study, we use a GP regression to learn three input-output functions where the input features are music score information defined as above and the output feature is one of the three deviations also given above. Basically, the parametric function approximation optimizes parameters of the function to fit the training data (we call this learning), and it does not need to maintain the training data, but only needs the learned parameters when it predicts the output to the given inputs (we call this inference). On the other hand, non-parametric function approximation such as the GP does not require the concrete parametric functions (hence, nor the parameters), but needs the training data in the inference. In general, although the memory and the computation cost of non-parametric learning tends to be high because it stores and utilizes whole training data, it can represent the wide class of functions. In case of rendering music performance, numerous numbers of rules can be considered, however, it is unknown and certain experience such as trial and error is necessary to know what rule set is important. Because it is difficult to prepare concrete (perhaps parametric) learning rules, we consider to use the GP. To resolve the computational difficulty in performing naïve GP, we utilize Bayesian Committee Machine (BCM) (Tresp, 2000, Schwaighofer et al, 2003), which approximates the naïve GP to reduce the computation cost of the naïve GP. We will begin with the belief review of naïve GP. Suppose we have a training data set D = {x, t} , which contains N input vectors x = {x1 ," , x N } , and its corresponding N scalar outputs t = {t1 ," , t N } . We assume each output tn ( n = 1," , N ) follows the equation below; tn = yn + ε n , (1) where yn denotes a random variable that depends on the input xn while ε n denotes another random variable that is independent to the input xn . If we think ε n represents a measurement noise, we can think yn as a noise-free output. When ε n is assumed to obey a Gaussian distribution whose

mean zero, and variance σ 2 , p(tn | yn ) = N(tt | yn , σ 2 ) , (2) holds. Here, N (⋅ | μ , Σ) denotes a Gaussian distribution whose mean is μ and covariance matrix is Σ . We assume that the noise ε n ( n = 1," , N ) is independent among samples, and denote the set of noise-free output by y = ( y1 ,..., yN ) . Therefore, p(t|y ) = N(t | y, σ 2 I N ) , (3) where I N denotes an N × N identity matrix. In the formulation of GP, y is also assumed to be generated by a Gaussian distribution. p(y | x) = N(y | 0, K N ) , (4) where K N denotes an N × N Gram matrix, and its (m, n) element km , n , i.e., the correlation between ym and yn is

defined by a kernel function. In this study, we used a rational quadratic for the kernel function. Then, the (m, n) element km , n = k (x m , x n ) is defined as k (x m , x n ) =

a

.

2

(1 + b xm − xn )c

(5)

Here, ⋅ denotes the Euclid norm, and the parameters a , b and c are kernel parameters, which take positive values. Now we consider the prediction of T noise-free outputs y * = ( y1* ," , yT* ) to T test input vectors x* = {x1* ," , xT* } . We assume the conditional joint distribution of y * and y , p(y * , y | x* , x) , is given by a Gaussian distribution as in eq. (4), i.e., a Gaussian distribution whose mean is zero, and each element of the covariance matrix is given as eq. (5). Then, given training data set D and test input vectors x* , the distribution of the noise-free test outputs y * is given as

p ( y * | x* , D ) =

p ( y * , t | x* , x ) p ( t | x* , x )

³ p(t | y) p(y , y | x , x)dy ³ ³ p(t | y ) p(y , y | x , x)dydy *

=

*

*

*

*

.

(6)

Note that p(y * | x* , D) is just a Gaussian distribution because both of the distributions p (t | y ) and p(y * , y | x* , x) are Gaussian distributions. The mean of the distribution p(y * | x* , D) , E[y * | x* , D] , is given by E[y * | x* , D] = K *N (K *N + β −1I N ) −1 t ,

(7)

where K is a T × N Gram matrix, and its (m, n) element is * N

given as km* , n = k (x*m , x n ) using the kernel function defined in eq. (5). We can use E[y * | x* , D] for the prediction to test input vectors x* . For the calculation of eq. (7), we need to compute an inverse matrix whose size is N × N , and hence it is difficult to calculate when the training data size N becomes large. To overcome the computational difficulty, we utilize a BCM framework. The BCM approximates the calculation of eq. (7). To begin with, we mention the fact that the eq. (7) is also calculated as follows: E[ y * | x * , D ] = K * (K * + K *N cov(t | y * ) −1 K *N T ) −1 K *N cov(t | y * ) −1 t , where K

*

(8)

denotes a T × T Gram matrix whose (m, n)

element is given by k (x*m , x*n ) , and T denotes a matrix transpose. cov(t | y * ) is given by cov(t | x, x* , y * ) = K N + β −1I N − (K *N )T (K * ) −1 K *N . (9) For the calculation of eq. (8), we also need to calculate the inverse of cov(t | y * ) with size of N × N . BCM approximates cov(t | y * ) with a block matrix. Using this approximation, eq. (8) can be approximated by

assumed to be described by some proximity measure between xm and xn as km , n = k (x m , x n ) . The measure of proximity is

169

E[y * | x* , D] ≈ C −1 ¦ i =1 cov(y * | x* , D i ) −1 E[y * | x* , D i ] (10) M

where C = −( M − 1)( K * ) −1 + ¦ i =1 cov(y * | x* , D i ) −1 , and we M

Table 2. Training (TR) and Test (TS) data used for the experiment 1. Used data is marked by a black circle ƽ.

assume that the training data set is divided into M disjoint sets as D = {D1 ," , D M } . Now we can find that the matrix size that needs the inversion becomes small. At last, we mention the hyper-parameter estimation. In eq. (10), there are four parameters to be determined, i.e., the variance of the measurement noise σ 2 and kernel parameters a , b and c . They can be determined by type II maximum likelihood estimation (Bishop, 2006). We used a scaled conjugate gradient descent to perform the type II maximum likelihood estimation.

EX. ID Data ID

1-1

1-2

1-3

TS

TR

1-4

TR

TS

TR

TS

TR

1S

ƽ

ƽ

ƽ

2N

ƽ

ƽ

3N

ƽ

ƽ

ƽ

4G

ƽ

ƽ

ƽ

ƽ

5G

ƽ

ƽ

ƽ

ƽ

1-5

TS

TR

TS

1N

ƽ

ƽ

ƽ

ƽ

ƽ

ƽ

ƽ

ƽ

ƽ

ƽ

ƽ

ƽ ƽ

ƽ

EXPERIMENT In this section, we evaluate our proposed scheme mentioned above by applying it to the subset of the CrestMusePEDB data.

Table 3. Training (TR) and Test (TS) data used for the experiment 2. Used data is marked by a black circle ƽ.

A. Used data We used the CrestMusePEDB for training and test of our proposed GP regression; especially we used the pieces composed by W.A. Mozart as described in Table 1.

EX. ID Data ID

2-1 TR

1N

2-2

TS

TR

ƽ

ƽ

TS

2-3 TR

TR

ƽ

ƽ

ƽ

ƽ

ƽ

ƽ

1S

ƽ

2N

ƽ

ƽ

3N

ƽ

ƽ

4G

ƽ

ƽ

ƽ

5G

ƽ

ƽ

ƽ

ƽ

TS

2-4 TS

Table 1. Used data for the experiment.

Data ID

Title

Player

1N

Piano sonata No.11 K.331-I

H. Nakamura

1S


N. Shimizu

2N

Piano sonata No.11 K.331–II

H. Nakamura

3N

Piano sonata No.11 K.331-III

H. Nakamura

4G


G. Gould

5G

Piano sonata No.16 K.545-II

G. Gould

ƽ

ƽ ƽ ƽ

RESULTS To evaluate the accuracy of the model, we calculated the normalized difference (ND) between the real performance and predicted performance by the trained GP. Here, the ND is defined as follows SD (11) ND= error , SDreal

Experiment 1: The test for learning ability To test how well the GP learns, we used 5 pieces for training, and pick one piece from them to test whether the trained GP can imitate the performance of the ’known’ training data. Table 2 summarizes the training and test data we used.

where SDerror denotes the standard deviation of the error (difference) between two outputs; one is given by PEDB-DEV data, and the other is given as the output by trained GP, and SDreal denotes the standard deviation of the output given by PEDB-DEV data. The ND takes zero if the trained GP returns the output that is exactly same with that given by PEDB-DEV data while takes one if the trained GP returns always zero irrelevant to the input values. We calculated the ND for each output, i.e., dynamics, attack time and release time.

Experiment 2: The test for generalization ability To test how well the trained GP generalizes the 'unknown' pieces from the training data set, we test the performance of the pieces that does not included in the training data. Table 3 summarizes the training and test data we used.

(1) Result of experiment 1 Table 4 shows the ND of the experiment 1. The average ND of dynamics, attack time and release time over all 5 cases were 28.5%, 26.8 %, and 27.2%, respectively, showing a good learning ability.

B. Condition of the Experiment We carried out two experiments to evaluate the learning ability and generalization ability of the trained GP.

170

Table 4. Results of the Experiment 1.

EX. ID 1-1 1-2 1-3 1-4 1-5 Average

Dynamics 0.290 0.232 0.250 0.336 0.316 0.285

ND Attack 0.350 0.041 0.345 0.185 0.418 0.268

Release 0.340 0.125 0.170 0.263 0.462 0.272

(2) Result of experiment 2 Table 5 shows the ND of the experiment 2. The average of the ND of dynamics, attack time and release time over all 4 cases were 98.9%, 109.0%, and 88.4%, respectively, showing not very good generalization ability. Table 5. Results of the Experiment 2.

EX. ID 2-1 2-2 2-3 2-4 Average

Dynamics 0.667 0.712 0.999 1.577 0.989

ND Attack 0.971 1.388 1.000 1.002 1.090

Release 0.679 0.747 1.111 0.998 0.884

(3) Observation from the results of two experiments

Figure 3. Average ND of Experiment 1 and Experiment 2.

Figure 3 shows the average ND for three outputs of the experiments 1 and 2. This figure clearly shows the significant difference between the two experiments. Remember that the GP is guaranteed to reproduce the true input-output relationship under the certain condition (Rasmussen et al., 2006) if the input-output relationship obeys certain probabilistic distribution and training samples can be obtained from the distribution. Based on the facts, we can consider several reasons that account for this difference; 1) The performance data cannot be well captured by samples taken from certain stationary distribution, 2) The number of training data samples is too small, 3) BCM approximation works poorly. The first case means that there are seldom rules, which describe the real performance well from the input features or the

distribution has too large variance to regard it stationary. One probable scenario is that the current input features are insufficient so as to capture the rules, and it seems as if there are no rules to describe this input-output relationship. Similar thing will happen when the kernel function is not appropriate to capture the proximity of two input features. To examine how much these factors influence to the results, we need to perform other experiments. The second case is also likely because we used only 5 pieces for the training. Moreover, the fact that those pieces are played by different players might make the learning much difficult. Because when the computational model learns the input-output relationship, which has a large noise variance, in general, the model needs much more samples for the learning. GP utilized all the training data and calculate the proximity between the test input and every input of the training data. If we have plenty of training data (assuming the input features are sufficient and we have a good kernel function), GP will find similar samples and weight the outputs according to the proximity of the samples, then predict the output by the weighted sum of the outputs as in eqs. (8) or (10). One supportive evidence can be found in Table 5. We can see the better performance of 'dynamics' and 'release' in EX. IDs 2-1 and 2-2 compared to the others. In these experiments, the same piece (Piano sonata No.11 K.331-I) is used for both of the training and test but by different players, i.e., data whose data ID is 1N and 1S. Therefore, if these players perform the piece in a similar way regarding 'dynamics' and 'release', GP will perform the test piece well imitating the way in the training piece. In contrast, not good performance of 'attack' can be accounted by the great discrepancy in 'attack' between the two players.

Figure 4. Correlation coefficients between the output of 1N and that of 1S To further investigate the hypothesis, we examine the correlation coefficients of the actual outputs of 1N and 1S as shown in Figure 4. As can be seen from Figure 4, the correlation coefficients of 'dynamics' and 'release' are significantly high while the coefficient of 'attack' is low. This result suggests that 1) we need much more training data for the generalization, and 2) it is desired that the data is collected from certain identical player to learn a consistent way of performance.

171

CONCLUDING REMARKS We proposed a new computational scheme for rendering music performance based on the machine learning method. We use a Gaussian Process (GP) regression that requires few parameters to be set manually, and is flexible to approximate of function and can represent a wide class of functions. Our GP showed a reasonable performance by learning from only 5 pieces. Therefore, we anticipate that it might be able to imitate a master performer when trained by a large the performer’s dataset in the future.

ACKNOWLEDGMENT This research is supported by “Support Program for Improving Graduate School Education” financed by the Japan Society for the Promotion of Science.

References Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York:Springer. Dorard, L., Hardoon, D. & Shawe-Taylor, J. (2007) Can Style be Learned? A Machine Learning Approach Towards ‘Performing’ as Famous Pianists. Proceedings ofMusic, Brain & Cognition Workshop, in The Neural Information Processing Systems 2007. Fryden, L. & Sundberg, J. (1984). Performance rules for melodies,. origin, functions, purposes. Proceedings of the 1984 International Computer Music Conference, 221–224. Hashida, M., Nagata, N., Kawahara, H. &Katayose, H. (2007). A Performance Rendering Model for Polyphrase Ensemble. Transactions of Information Processing Society of Japan, 48, 1 , (20070115) , 248-257 (in Japanese). Rachael, C. (2003). Orchestra in a Box: A System for Real-Time Musical Accompaniment. Proceedings of International Joint Conferences on Artificial Intelligence, 2003. Rasmussen, C. E. and Williams, C. (2006). Gaussian Processes for Machine Learning. Cambridge: MIT Press. Schwaighofer, A. & Tresp, V. (2003). Transductive and inductive methods for approximate Gaussian process regression. Advances in Neural Information Processing Systems 15. Cambridge: MIT Press. Tresp, V. (2000). A bayesian committee machine. Neural Computation, 12(11), 2719-2741. Widmer, G. (2002). Machine Discoveries: A Few Simple, Robust Local Expression Principles. Journal of New Music Research, 31, 1, 37-50.

172