Perceptual constraints for automatic vocal

Perceptual constraints for automatic vocal detection in music recordings Félicien Vallet ENST-GET, Département TSI, Paris, France [email protected]

Martin F. McKinney Philips Research Laboratories, Eindhoven, The Netherlands [email protected] In: K. Maimets-Volt, R. Parncutt, M. Marin & J. Ross (Eds.) Proceedings of the third Conference on Interdisciplinary Musicology (CIM07) Tallinn, Estonia, 15-19 August 2007, http://www-gewi.uni-graz.at/cim07/

Background in Music Information Retrieval. For many applications in music information retrieval, automatic music structure analysis is a desired capability, including the detection of vocal/sung segments within a musical recording. While there have been several recent studies in the area of automatic vocal detection in music recordings, current performance is not sufficient for all applications. Above all, it is not clear what basic features of the audio signal are most important for distinguishing vocal from non-vocal segments, especially in the context of a multi-instrument down-mixed recording. Background in Music Perception and Cognition. The human auditory system is able to ascertain various properties of sounds, including aspects of vocalizations, within a very short time-frame. Robinson and Patterson (1995) showed that listeners need sound lengths of only ~5-10 msec to make reliable distinctions between vowel types. Perrot and Gjerdingen (1999) found that listeners could reliably distinguish most music styles with excerpts as short as 250 msec and that the presence of vocals hindered listeners’ ability to identify the correct style. Other studies on timbre perception have shown that the attack portion or transients between successive notes are important for timbre identification. These abilities and the relevant parameters provide helpful starting constraints in developing systems for automatic audio and music processing, however it is still not clear which parameters are the most important for the detection of vocal content within a music recording. Aims. We examine how the length of a sound excerpt influences listeners’ perceptual ability to identify the presence of a singing voice and whether or not the inclusion of note transients aids the ability of listeners to detect vocal content. The data collected is used to constrain the development of acoustic feature models for the automatic detection of singing voice in popular songs. Method. 10 subjects with mixed musical experience (from none to 15 years) listened over headphones to 576 different musical excerpts from a wide range of music genres. The 12 genres represented were reggae, jazz, classical, country, folk, rock, electronics, rhythm and blues, hip hop, pop and blues. For each genre, three different songs were used to create a database of excerpts. The excerpts were either vocal or non-vocal and contained either transient or steady-state regions and were between 100 and 500 msec long (4 different lengths). Listeners had to indicate whether or not they heard a singer in each excerpt. Results. The experiment revealed that even for the shortest excerpt length, 100 msec, subjects performed well above chance at detecting the presence of a singer. For the two longest excerpt lengths (350 msec and 500 msec), subjects could also more easily discriminate vocal/non-vocal excerpts when transients were present. Conclusions. Information regarding the presence of vocals in music recordings exists in very short excerpts, thus signal descriptors designed from short windows of audio should be sufficient for systems to automatically detect segments with vocals. In addition, the fact that transitions between notes appear to benefit listeners’ perceptual ability to detect vocals suggests that automated systems for vocal detection should be directed to transients in audio recordings. Implications. This study is an example of how a basic method and study in music perception can lead to valuable data for the development of systems for music information retrieval. The fact that only very short audio segments are sufficient to perceive singing in a music recording suggests that real-time interactive systems are quite feasible for computer-aided applications related to singing and music. In addition, the perceptually-derived constraints from this study serve as helpful guides for systems for automatic music structure analysis.

CIM07 - Conference on Interdisciplinary Musicology - Proceedings

The automatic detection of vocal sections within a music audio track is a useful capability for many music information retrieval (MIR) tasks, including music structure analysis, automatic disc-jockey applications and lyric synchronization. Automatic analysis of music structure, for example would benefit from an algorithm that could accurately identify where the singing begins and ends in a chorus, for instance. Several recent algorithms have been proposed and have shown modest success, however most of them have not been tested extensively on large databases (Maddage 2006; Tsai and Wang 2006). Furthermore, results from several of the studies suggest that optimal parameters of the featureextraction stage for a vocal detection system are not clearly evident.

rock, electronics, rhythm and blues, hip hop, pop, and blues. From each song, excerpts were extracted to examine the following three factors: 1) presence of vocals, 2) presence of a note transient, and 3) excerpt length (100, 200, 350, or 500 msec). Thus from each song, 16 excerpts were extracted: 2 (vocal and non-vocal) X 2 (note transient and no note transient) X 4 (excerpt lengths). The excerpts were played randomly and could be listened to only once. The duration of the experiment was forty minutes on average. Subjects were primarily graduate students working at Philips Research Laboratories, had no known hearing abnormalities and had a range of musical training from none to 15 years.

Results

One can attempt to constrain parameters in such systems by evaluating the criteria necessary for the perception of vocals within musical fragments. Previous perceptual studies have shown that the human auditory system is able to ascertain certain properties of sounds including aspects of vocalizations and musical style, within a very short time frame (Perrott and Gjerdingen 1999; Robinson and Patterson 1995). Other studies on timbre perception have shown that the attack portion of sounds and transients between successive notes are important for the perception of timbre. While these studies provide general guidelines to relevant audio feature attributes for our purpose, further investigation is necessary. Here we look specifically at the necessary length of a musical audio fragment for the correct (perceptual) detection of vocals within the fragment. We also examine whether or not the presence of a note transient within the fragment aids the detection process.

Results show that listeners can reliably distinguish between vocal and non-vocal excerpts for all excerpt-lengths used in this study (see Fig. 1). There is a tendency for detection of vocals to be more accurate for longer excerpts, but the differences across excerpt length are not significant. Another aspect shown in Fig. 1 (top plot) is that subjects can more easily hear the presence of a singer rather than its absence. Even for the shortest excerpt length, the error percentage for non-vocal excerpts is a few points higher than that for vocal excerpts. In other words, there is a higher probability of misclassifying a non-vocal excerpt as a vocal one than the opposite. Despite this trend, the two types of errors are not statistically different. Figure 1 (lower panel) also shows that the presence of note transients aids the perceptual detection of vocal content for all excerpt lengths, but the difference in detectability is significant only for the two longest excerpt lengths, 350 and 500 msec. A general result here is that the necessary information to detect the presence of singing in a music recording exists even in very short excerpts (100 msec). This suggests that realtime interactive systems are quite feasible for singing performance, performance evaluation, and other computer-aided applications related to singing and music. In addition, the fact that note transients seem to provide more

Method Ten subjects listened to 576 short musical excerpts over headphones and had to indicate whether or not they heard vocal content in the excerpt. The musical excerpts were either entirely sung (vocal) or entirely instrumental and were taken from 36 songs covering 12 musical genres (3 songs from each genre): Reggae, jazz, Western classical, country, folk,

2


20 10 100

200 300 400 fragment length (msec)

500

0

50 transient steady-state 40

300

30

200

20 100 0 0

10 100

200 300 400 fragment length (msec)

500

mean correct classification:81.6±1.1%

vocal

0

84.2±1.4

15.8±1.4

21±0.9

79±0.9

Real class

100

After feature extraction, we used Quadratic Discriminant Analysis, preceded by a linear regression method for dimension reduction, for the classification of music excerpts. The classifier consists of a multivariate Gaussian model for each class (vocal, non-vocal), trained on 1300 15-sec excerpts of Western popular music covering a range of styles. We used 5-fold cross validation for training and testing our models. Performance on excerptbased classification (i.e., multiple feature frames) from the testing phase is shown in Fig. 2. An interesting result here is that the automatic detector, while performing worse than our human subjects in the perceptual experiment, also has a tendency to be more accurate on the detection of vocals rather than on non-vocals.

percentage error

30

200

0 0

number of errors

50 vocal non-vocal 40

300

from spectro-temporal information as well as statistics describing the tonal content of the audio (van de Par, McKinney, and Redert, 2006).

percentage error

number of errors

perceptual clues to the presence of vocals in an excerpt provides a clear guideline to designing automatic systems for vocal detection. One could imagine an onset detector or “note transient detector” that would apply a detection algorithm at each specific part of the audio signal where a notetransient is spotted.

Figure 1. Perceptual detection errors of vocals in short musical excerpts as a function of excerpt length. Upper plot shows errors split across vocal and non-vocal excerpts. Bottom plot shows errors split across excerpts with and without a note transient. Maximum number of errors is 720: 10 subjects * 576 excerpts / 4 window lengths * 2 types of excerpts. Error bars show estimates of the standard error of the mean, calculated through bootstrapping the experimental data (Efron and Tibshirani 1993).

non-vocal

vocal

non-vocal Detected class

Figure 2. Classification of musical excerpts as “Vocal” or “Non-Vocal” using quadratic discriminant analysis.

We used the vocal classifier to identify vocal sections in complete songs by averaging and thresholding the classification probabilities. An example of this process is shown in Fig. 3. Fig. 3 shows that, for a particular song, the vocal detector can give quite impressive results. It produces a probability of having a vocal segment for each time frame (~ 700 msec) between zero and one. We can see that the probability curves generated by the vocal detector show that the vocal detection closely follows the hand-labeled annotation of the song (shown in the top portion of the bottom frame). However, on all the songs we tested, only a few of them showed such good

Vocal Detector As a first attempt to construct an automatic system for vocal detection in music audio, we employed our existing audio/music feature extractor, which we have used to reliably classify a number of different audio and music classes (McKinney and Breebaart, 2003). Results of the current perception experiment validate the use of this feature extractor in that the time scale of feature extraction (~ 700 msec) is more than coarse enough to contain the relevant information for vocal detection. The features include those derived

3


results. On average, performance was around 70%. Single Frame Classification Quantized and smoothed

1

References Efron, B., & Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman & Hall.

0.5

Maddage, N. (2006) Automatic Structure Detection for Popular Music, IEEE MultiMedia, 13(1), 65-77.

0

Extracted

0

Vocal

50

150

Not Vocal

100

Vocal

Not Vocal

Labeled

50

Not Vocal

0

Not Vocal

Probability of Vocal

1.5

steps/developments could take into account the role of the note transient in detecting vocals by analyzing only windows containing a transient.

100

Time (sec)

McKinney, M.F., & Breebaart, J. (2003). Features for audio and music classification. In Proceedings of the 4th International Conference on Music Information Retrieval, Baltimore, MD, Johns Hopkins University.

Vocal

Perrott, D., & Gjerdingen, R. (1999) Scanning the dial: An exploration of factors in the identification of musical style, Society for Music Perception and Cognition Conference, Evanston, IL.

Vocal

150

Figure 3. Detection of vocal content in a single song, Honey Child by Martha Reeves & the Vandellas. Dotted horizontal line in upper panel is the threshold used to obtain the predicted vocal sections (bottom panel).

Robinson, K. & Patterson, R. (1995) The Stimulus Duration Required to Identify Vowels, their Octave, and their Pitch Chroma, J Acoust Soc Am., 98(4), 1858-1865. Roads, C. (2003). The Perception of Microsounds and its Musical Implications, Ann N Y Acad Sci, 999(1), 272-281.

Conclusion We have shown how perceptual data on vocal detection can help constrain model parameters for automatic detection of sung content in music. Indeed, this study is an example of how a basic method and study in music perception can lead to valuable data for the development of systems for music information retrieval. Our results showed that the humans can reliably detect the presence of vocal content in musical excerpts of lengths of several hundred milliseconds. Additionally, we saw that the presence of note transient may help considerably in detecting the presence of a singer.

Schellenberg, R., Iverson, P., & M. McKinnon. (1999). Name That Tune: Identifying Popular Recordings from Brief Excerpts, Psychon Bull Rev, 6(4), 641-646. Tsai, W. & Wang, H. (2006). Automatic Singer Recognition of Popular Music Recordings via Estimation and Modeling of Solo Vocal Signals, IEEE Trans on Audio, Speech and Language Processing, 14(1), 330-341. van de Par, S., McKinney, M.F., & Redert, A. (2006). Musical key extraction from audio using profile training. In Proceedings of the 7th International Conference on Music Information Retrieval, Victoria, Canada.

The current results generated by the vocal detector suggest that the features used here are sub-optimal and that classification performance, while decent, does not match that of human listeners. In addition, the current results were only achieved when classification probabilities of consecutive frames were averaged. Nevertheless, the current performance of our singing detector allows successful classification of vocal content for many pieces of music. Next

4