icdat 2004 - IIS, SINICA

Towards A Music Digital Library: A Content-based Processing Paradigm of Music Collections Using Solo Vocal Signal Modeling Wei-Ho Tsai and Hsin-Min Wang Institute of Information Science, Academia Sinica, Taiwan, Republic of China {wesley,whm}@iis.sinica.edu.tw

Abstract. With the explosive growth of networked collections of music material, a need for establishing a mechanism like a digital library for managing music data has arisen. This paper presents a content-based processing paradigm of music collections to facilitate the realization of a music digital library. This paradigm is built upon the automatic extraction of information of interest from music audio. Recognizing the fact that a vocal part, if at all, is often the heart of a music performance, emphasis is put on developing techniques for exploiting the underlying solo voices from an accompanied performance. We exemplarily show that a singer’s voices can be characterized via stochastic modeling of solo vocal signal such that music documents can be clustered, retrieved, and verified on the basis of the associated singer.

1

Introduction

Recent advances in digital signal processing technologies, coupled with the essentially unlimited data storage and transmission capabilities, have created an explosion of multimedia data being produced, spread, and made available to everyone everywhere. Music, as known as one of the most prevalent multimedia, is growing in number at an unprecedented pace. However, the rapid proliferation of music material has ironically brought with it the dilemma of what to do with it: how to locate the desired music from the innumerable options and how to make sure only those authorized can access them. This dilemma has consequently motivated recent research [1,2,3,25] into trying to establish a mechanism like a digital library for managing the burgeoning amount of music collections, which, if possible, makes music as accessible as text that has been a well-regulated archival resource. Constructing a music digital library requires the integration of comprehensive technologies and disciplines concerning library science, computer science, audio engineering, musicology, law, and so on. Although each of the disciplinary communities concentrates on its own set of requirements, a music digital library, as far as the functionality is concerned, is expected to equip the following operations: • Music data organization. Apart from data acquisition, the first consideration when constructing a music digital library is to organize the music data readily available in electronic form for subsequent preservation, access, research, and other uses.

• Music information retrieval. Following the services provided by traditional public libraries, a music digital library must support a scenario in which users know characteristics of the music they desire and use a music information retrieval system to locate music documents that most closely matches those characteristics. • Music recommendation. In contrast to the music information retrieval, which is passively initiated by user request, a music digital library may actively recommend a user to access some music documents, or, for instance, notify subscribers the arrival of new music objects according to their preferences. Music recommendation can also be used to assist users to locate their desired music when they have difficulty to formulate an appropriate query. • Copyright protection. In addition to providing services to authorized users, a music digital library must be able to prevent its music documents from unauthorized access, manipulation, or distribution. Traditionally, in record shops, music is cataloged by title, composer, producer, artist, or other objective and subjective classes to enable costumers quickly finding the music they want. Such a means has been extended to the current music encoded in digital formats, in which descriptive information is delivered together with the actual audio content. Examples of this so-called metadata ("data about data") are the widely used ID3 tags attached to MP3 bitstreams [4] and the forthcoming MPEG-7 standard [5]. From the viewpoint of the traditional bibliographic text system, it might be rather straightforward for a digital library to manage music collections through the use of metadata. However, as music presented in audio form encompasses numerous levels of information ranging from concrete to abstract, metadata may not be effective since the desired information is probably not included in the attached tags. Even with the limited information we mostly care, generating metadata for a vast music database is often associated with considerable efforts, attributed to the inevitable need of human intervention. Moreover, metadata is easily to be removed or manipulated, which might be taken advantage of by copyright violators to keep piracy from being detected. As a result, designing a feasible music digital library cannot rely solely on the common metadata-based mechanism, but instead requires that automatic techniques be used to analyze the content of music data. This paper presents a paradigm of content-based processing of music data to facilitate the construction of a music digital library. It is aimed to develop techniques for automatically extracting information of interest from music with just audio signal inspected. Relatedly, there are copious amounts of research efforts being made to this end, such as melody extraction [6,7], instrument recognition [8,9], genre classification [10,11], artist or singer identification [12-15], mood classification [16], lyric transcription [17]. Among these studies, of particular interest and challenge can be the problem of probing information residing in vocal, i.e., singing over music. In most music, a vocal part is often the heart of a performance, carrying melodic hook, artist style, lyrics, and so on. Extracting vocal-related information is notably difficult by the fact that singing voices are inextricably intertwined with the non-stationary signal of background accompaniment. However, a large proportion of prior works so far have almost either assumed that the background music can be ignored or simply attacked

from a fault-tolerance standpoint. Methods, exclusively for exploiting the underlying solo voices from an accompanied performance, are still less explored. Recently, we proposed a stochastic modeling technique of solo vocal signal for better analysis of music content [21]. The stochastic modeling is built upon the goal of extracting the singer voice characteristics from a piece of music. In this study, we exemplarily show how a singer’s voices can be characterized such that music documents can be clustered, retrieved, and verified on the basis of the associated singer, which therefore benefits the realization of a music digital library. The rest of this paper is organized as follows. Section 2 describes the specific tasks we addressed, along with the system configurations for implementation. Section 3 presents a statistical classifier for distinguishing between vocal segments and accompaniments. Section 4 introduces a method for distilling the singers’ vocal characteristics from the vocal regions of music recordings. In Sections 5, 6 and 7, we, in turn, describe how to perform singer-based clustering, singer-based music document retrieval and singer verification through the use of solo vocal signal modeling. Finally, Section 8 concludes this paper.

2 2.1

Content-based Processing of Music Collections Music Data Organization

In this work, a singer-based clustering technique is developed that partitions a collection of music recordings into several clusters such that each cluster consists exclusively of recordings from one singer. This technique serves as a tool for expediently organizing unlabeled or insufficiently well labeled music collections. For instance, many rock music bands have a lead singer who sings the majority of all the band’s songs, but a minority of songs are sung by the guitarist, drummer, or other band-members. Since the vast majority of commercial music data is only labeled by lead singer or band name, singer-based clustering can be deployed to identify those songs not sung by the lead singer. In addition, lead singers in both rock and pop music are known to quit, do solo albums, start new bands, or join other bands. Singer-based clustering is useful for those wishing to collect the full works of artists like Phil Collins, Sting, Ozzy Osbourne, or even Michael Jackson. To cluster music recordings by singer, a system block diagram as shown in Fig. 1 is carried out. The system takes as inputs the M music recordings and produces as outputs the K clusters. It involves three major processes. First, each music recording is segmented into vocal and non-vocal regions, where a vocal region consists of concurrent singing and accompaniment, whereas non-vocal regions consist of accompaniment only. Then, the singer’s vocal characteristics of each music recording are statistically modeled from the segmented vocal regions. Finally, similarities between music recordings are computed in terms of the singer’s vocal characteristics, and the recordings similar with each other are grouped into a cluster. Each of the processes will be described in a greater detail in the subsequent sections.

Music Recording 1

Vocal/Non-vocal Segmentation

Singer Characteristic Modeling

Music Recording 2



Music Recording M



Cluster 1 Inter-recording Similarity Computation & Clustering

Cluster 2

Cluster K

Fig. 1. Block diagram of the singer-based clustering.

2.2

Music Document Retrieval/Recommendation

A music document retrieval system based on query-by-example is built that allows users to locate a specified singer’s music recordings from a database via submitting a fragment of music as a query. This system would be of great use for those wishing to listen to a particular artist’s songs but just cannot recall what the name of the artist is. Such a requirement may also arise when users heard some songs in somewhere and want to know more about the artists of these songs or want to listen to more songs performed by the artists same to those of the songs they just heard. In these cases, a user would like to query, “Find me all the songs performed by the singer of this attached recording.” As with query-by-example, such a system can also trivially accommodate with music recommendation, in which users are suggested to listen to the songs performed by their favorite singers or the singers with voices similar to their favorite ones. A block diagram of the music document retrieval system is shown in Fig. 2. Given an exemplar music query submitted by a user, the system evaluates the similarities between this music query and each of the music documents in the database, and produces a rank list of the relevance of each document to the query. Here, the more similar between the query and a particular document, the more relevant this document is deemed. In order for the similarity to be connected to the singer voice characteristics, each music document undergoes the above-mentioned vocal/non-vocal segmentation along with the singer characteristic modeling. Music Query Music Document 1



Music Document 2



Music Document N



Query-Document Similarity Computation & Ranking

Fig. 2. Block diagram of the singer-based music document retrieval.

Rank List

2.3

Copyright Protection

A singer verification technique is developed that determines whether or not a test music recording contains the voices of a singer wanted. This technique enables copyright holders to rapidly scan the suspect websites by examining if some music files for free download are performed by the singers of their concern. As with content-based examination, manipulated music files, for example, with descriptive-information fields removed or simply given as some irrelevant filenames, will be easily ascertained. Given a test music recording X, the singer verification task can be formulated as a basic hypothesis test between H0: X is performed by the hypothesized singer, and H1: X is not performed by the hypothesized singer. The optimum test to decide between these two hypotheses is a likelihood ratio test given by H 0 (Yes) p( X | H 0 ) > δ, ≤ p( X | H 1 ) H 1 (No) where δ is a decision threshold. Fig. 3. shows the basic components of a singer-verification system based on the above hypothesis test. Prior to determining if a test music recording is performed by the hypothesized singer, a training process must be carried out to characterize the two hypotheses, H0 and H1. Toward this end, music data, performed and not performed by the hypothesized singer, are collected and, respectively, used for generating the representative characteristics of H0 and H1. These music data undergo the above-mentioned vocal/non-vocal segmentation along with the singer characteristic modeling. Test Music Recording

∈ The Hypothesized Singer

All Available Music Data



All Available Music Data The Hypothesized Singer



∉

Likelihood Ratio Computation & Decision

Training Phase

Yes/ No

Testing Phase

Fig. 3. Illustration of the singer verification system.

3


As an indispensable step in determining the vocal characteristics of a singer, music segments that contain vocals are located and marked as such. This task can be

formulated as a problem of distinguishing between vocal segments and accompaniments, and a vocal/non-vocal classifier is, thus, built for solving this problem. The classifier consists of a front-end signal processor that converts digital waveforms to cepstrum-based feature vectors, followed by a backend statistical processor that performs modeling and matching. It operates in two phases, training and testing. During training, a music database with manual vocal/non-vocal transcriptions is used to form two separate Gaussian mixture models (GMMs): a vocal GMM, and a non-vocal GMM. The use of GMMs is motivated by the desire for modeling various broad acoustic classes by a combination of Gaussian components. These broad acoustic classes reflect some general vocal tract and instrumental configurations. It has been shown that GMMs have a strong ability to provide smooth approximations to arbitrarily-shaped densities of spectrum over a long time span [18]. We denote the vocal GMM as λV, and the non-vocal GMM λN. Parameters of the GMMs are initialized via k-means clustering and iteratively adjusted via expectation -maximization (EM) [19]. During testing, the classifier takes as inputs the Tx-length feature vectors X = {x1, x2, ..., xTx} extracted from an unknown recording, and produces as outputs the frame log-likelihoods log p(xt|λV) and log p(xt|λN), 1≤ t ≤Tx. Since singing tends to be continuous, classification is preferably made in a segment-by-segment manner rather than a frame-by-frame manner. To reduce the risk of crossing multiple vocal/non-vocal boundaries, a segment is selected and examined in the following way. First, vector clustering is employed on all the frame feature vectors, and each frame is assigned a cluster index associated with that frame’s feature vector. Then, each segment is assigned the majority index of its constituent frames, and adjacent segments are merged as a homogeneous segment if they have the same index. Finally, classification is made per homogeneous segment using: vocal

1 Wk

Wk −1   Wk −1  ∑ log p( x sk +i | λ V ) - ∑ log p( x sk +i | λ N )  i =0   i =0

> ≤

η,

(1)

non - vocal

where Wk and sk represent, respectively, the length and starting frame of the k-th homogeneous segment, and η is the decision threshold.

4


After locating the vocal portions, the next important step is to probe the solo voice signal underlying the vocal portions and thereby draw the singer voice characteristics. According to the observation in most modern music, particularly in popular music, substantial similarities exist between the instrumental-only regions and the accompaniment of the singing regions. It is, therefore, reasonable to assume that the stochastic characteristics of the background music can be approximated by those of the instrumental-only regions. This assumption forms the basis of our following formulation. Suppose that an accompanied voice V = {v1, v2, ..., vT} is a mixture of a

singing voice S = {s1, s2, ..., sT} and a background music B = {b1, b2, ..., bT}. Both S and B are unobservable, but B’s stochastic characteristics can be estimated from the non-vocal segments. Accordingly, it is sufficient to build a stochastic model λs for the singing voice S, based on the available information from V and B. Toward this end, we further assume that S and B are, respectively, drawn randomly and independently according to GMMs λs = {ws,i, µs,i, Σs,i | 1 ≤ i ≤ I }, and λb = {wb,j, µb,j, Σb,j | 1 ≤ j ≤ J }, where ws,i and wb,j are mixture weights, µs,i and µb,j mean vectors, and Σs,i and Σb,j covariance matrices. If the signal V is formed from a generative function vt = f (st, bt), 1 ≤ t ≤ T, the probability of V, given λs and λb can be represented by T  I J  (2) p(V | λ s , λ b ) = ∏ ∑∑ ws ,i wb, j p(v t | µ s ,i , Σ s ,i , µ b, j , Σ b, j ), t =1  i =1 j =1  where p (v | µ , Σ , µ , Σ ) = N ( s; µ , Σ )N (b; µ , Σ )dsdb. (3) t

s ,i

s ,i

b, j

∫∫

b, j

s ,i

s ,i

b, j

b, j

v t = f ( st ,bt )

In our context, V, S and B are represented in the form of cepstral features, and since S and B are additive in the time domain or linear-spectrum domain, the accompanied voice can be approximated expressed by vt = log(exp(st) + exp(bt)) ≈ max(st, bt), 1 ≤ t ≤ T. Thus, Eq. (3) is explicitly computed using p (v t | µ s ,i , Σ s ,i , µ b , j , Σ b , j ) = N ( st ; µ s ,i , Σ s ,i )

∫

vt

−∞

+ N (bt ; µ b , j , Σ b , j )

N (b; µ b, j , Σ b, j )db

∫

vt

−∞

(4)

N ( s; µ s ,i , Σ s ,i )ds.

To build λs, a maximum-likelihood estimation can be made as λ ∗s = arg max p(V | λ s , λ b ).

(5)

λs

Using the EM algorithm, a new model λˆ s is iteratively estimated by maximizing the auxiliary function T

I

J

Q(λ s , λˆ s ) = ∑∑∑ p(i, j | v t , λ s , λ b ) log p(i, j, v t | λˆ s , λ b ),

(6)

t =1 i =1 j =1

where ˆ , µ , Σ ), p (i, j , v t | λˆ s , λ b ) = wˆ s ,i wb , j p (v t | µˆ s ,i , Σ s ,i b, j b, j

(7)

and p (i, j | v t , λ s , λ b ) =

ws ,i wb , j p (v t | µ s ,i , Σ s ,i , µ b, j , Σ b , j )

∑ ∑n=1 ws ,m wb,n p(v t | µ s ,m , Σ s ,m , µ b,n , Σ b,n ) I m =1

J

.

(8)

Letting ∇Q(λ s , λˆ s ) = 0 with respect to each parameter to be re-estimated, we have wˆ s ,i =

∑t =1 ∑ j =1 p(i, j | v t , λ s , λ b ) E {st | v t , µ s ,i , Σ s ,i , µ b , j , Σ b , j } , T N ∑t =1 ∑ j =1 p(i, j | v t , λ s , λ b ) T

µˆ s ,i =

1 T J ∑∑ p(i, j | vt , λ s , λ b ), T t =1 j =1

N

(9) (10)

∑∑ p(i, j | vt , λ s , λ b )E{st st′ | vt , µ s,i , Σ s,i , µb, j , Σb, j } T

ˆ = Σ s ,i

J

t =1 j =1

∑ ∑ T t =1

J j =1

p(i, j | v t , λ s , λ b )

− µˆ s ,i µˆ s′,i ,

(11)

where prime denotes vector transpose, and E{⋅} expectation. The details of Eqs. (9)-(11) can be found in [20,21]. Note that if the number of mixtures in λb is zero, then the method above degenerates to directly modeling the accompanied voices as a GMM. This serves as a baseline for performance comparison.

5 5.1

Singer-based Clustering of Music Recordings Methodology

With vocal/non-vocal segmentation and singer characteristic modeling, the concept of the singer-based clustering described in Sec. 2.1 can be put into practice. To begin, a solo voice model λs,i and a background music model λb,i is generated for each of the M recordings to be clustered, 1 ≤ i ≤ M. The log-likelihood, Li, j = log p(Vi|λs, j,λb,i), 1 ≤ i, j ≤ M, that the vocal portion of the recording Vi tests against the model λ s, j, is then computed using Eqs. (2) and (4). Here, a large log-likelihood Li, j indicates that the singer of recording i is similar to the singer of recording j. Then, singer-based clustering can be formulated as a conventional vector clustering algorithm by assigning the characteristic vector Li = [Li,1,Li,2,…, Li,M]′, 1 ≤ i ≤ M, to each recording i, and computing the similarity between two recordings using the Euclidean distance: || Li - Lj ||. Furthermore, the clustering quality may be improved by emphasizing the larger likelihoods and suppressing the smaller ones. To achieve this, the Li,j for each recording i are ranked in descending order of likelihood. Let the rank of Li, j be denoted by Ri, j. Then, the characteristic vectors Fi = [Fi,1,Fi,2,…, Fi,M]′, 1 ≤ i ≤ M, are formed using  1.0 , j =i  Fi , j = exp{α ( Li , j - Li ,ϕ )}, j ≠ i and Ri , j ≤ θ ,  0 .0 , j ≠ i and Ri , j > θ 

(12)

and

ϕ = arg max Li ,k , k ≠i

(13)

where α is a positive constant for scaling, and θ is an integer constant for pruning the lower log-likelihoods. The vector clustering problem is solved using the k-means algorithm, which starts with a single cluster and recursively splits clusters in attempts to minimize the within-cluster variances. A choice must be made as to how many clusters should be created. If the number of clusters is low, a single cluster is likely to include recordings from multiple singers. On the other hand, if the number of clusters is too high, a single singer’s recordings will be split across multiple clusters. Clearly the optimal number of clusters K is equal to the number of singers, which is unknown. In this study, the Bayesian Information Criterion (BIC) [22] is employed to decide the best

value of K. The BIC assigns a value to a stochastic model based on how well the model fits a data set, and how simple the model is, specifically 1 (14) BIC(Λ ) = log p(D | Λ ) − γ d log | D |, 2 where d is the number of free parameters in model Λ, |D| is the size of the data set D, and γ is a penalty factor. A K-clustering can be modeled as a collection of Gaussian distributions (one per cluster). The BIC may then be computed by: K 1 n  1   (15) BIC( K ) = −∑  k log | Σ k |  − γ K  M + M ( M + 1)  log M , 2    2 k =1  2 where nk is the number of elements of the cluster k, and Σk the covariance matrix of the characteristic vectors in the cluster k,. The BIC value should increase as the splitting improves conformity of the model, but should decline significantly after an excess of clusters are created. A reasonable number of clusters can be determined by (16) K * = arg max BIC( K ). 1≤ K ≤ M

5.2

Experimental Results

The music data used in this experiment consisted of 416 tracks from Mandarin pop music CDs. All the tracks were manually labeled with the singer identity and the vocal/non-vocal boundaries. The database was divided into two subsets, denoted as DB-1 and DB-2, respectively. The DB-1 comprised 200 tracks performed by 10 female and 10 male singers, with 10 distinct songs per singer. DB-2 contained the remaining 216 tracks, involving 13 female and 8 male singers, none of whom appeared in DB-1. All music data were down-sampled from the CD sampling rate of 44.1 kHz to 22.05 kHz, to exclude the high frequency components beyond the range of normal singing voices. In this experiment, DB1 was used for examining the validity of the singer-based clustering methods, while DB2 was used for training the vocal model λV and non-vocal model λN. The feature vectors used here were Mel-scale frequency cepstral coefficients (MFCCs), computed using a 32-ms Hamming-windowed frame with 10-ms frame shifts. Performance of the vocal/non-vocal segmentation was evaluated on the basis of frame accuracy. The best segmentation accuracy we obtained was 79.8%, using a 64-mixture vocal GMM and an 80-mixture non-vocal GMM. Performance assessment of the singer-based clustering was characterized by average cluster purity [23], defined as 1 K (17) ρ= ∑ nk ρ k , M k =1 and P

ρk = ∑

nkp2

(18) , nk2 where ρk is the purity of the cluster k, K is the number of clusters, M is the total number of recordings, nk is the total number of recordings in the cluster k, nkp is the number of recordings in the cluster k that were performed by singer p, and P is the number of singers. Fig. 4 shows the average purity as a function of the number of p =1

clusters. We can see that as expected, the average purity gains sharply as the number of clusters increases in the beginning, and then tends to saturate after too many clusters are created. Comparing the results yielded by the GMMs and the solo voice models, it is clear that a better clustering performance can be obtained by explicitly exploiting prior knowledge of background music. When the number of clusters is equal to the singer population (K = P = 20), the highest purities of 0.87 and 0.77 were yielded by using manual segmentation and automatic segmentation, respectively. This confirms that the system is capable of grouping the music data according to singer. 1.00 0.90 0.80

Average Purity

0.70 0.60 0.50 0.40

Manual Seg.; 32-mix Solo GMM & 8-mix Background GMM / Recording

0.30

Manual Seg.; 32-mix Vocal GMM / Recording

0.20

Automatic Seg.; 24-mix Solo GMM & 8-mix Background GMM / Recording

0.10

Automatic Seg.; 24-mix vocal GMM / Recording

0.00 0

10

20

30

40

50

60

70

80

90

No. of Clusters

Fig. 4. Results of the singer-based clustering. Next, the problem of automatically determining the number of singers was investigated. A series of clustering experiments were conducted using 50 music recordings (5 singers × 10 tracks), 100 music recordings (10 singers × 10 tracks), 150 music recordings (15 singers × 10 tracks), and 200 music recordings (20 singers × 10 tracks), respectively. Fig. 5 shows the resulting BIC values as a function of cluster number. The peak of each curve in the figure is located very close to the actual number of singers, validating the BIC-based decision criterion.

-40

5 singers

-80

10 singers

-120

15 singers 20 singers

-160 -200

BIC

-240 -280 -320 -360 -400 -440 -480 -520

0

5

10

15

20

25

30

No. of Clusters

35

40

45

50

Fig. 5. BIC measurements after each split.

6 6.1

Retrieving Music Documents by Singer Methodology

Following the system configuration shown in Fig. 2, a collection of N music documents X1, X2, …, XN is represented by solo voice models λs,1, λs,2,…, λs,N, based on the method described in Sec. 4. The characteristic similarity L(Xi,Y), 1≤ i ≤N, is then evaluated by computing the log-probability (likelihood) that Y tests against λs,i; i.e., L(Xi,Y) = log p(Yv|λs,i, λb,y), (19) where Yv is the vocal portion of Y, and λb,y is the background music GMM trained using the non-vocal portion of Y. Let R{L(Xi,Y)} denote the rank of L(Xi,Y) among L(X1,Y), L(X2,Y), …, L(XN,Y) in descending order. A music document Xi is hypothesized as relevant to the query Y if

R {L(Xi,Y)} < ϒ ,

(20)

where ϒ controls the number of documents that will be presented to the user. 6.2


Music data used here were the same as those used in Sec. 5.2. Experiments for the music document retrieval were conducted in a leave-one-out manner, which used each track in DB1 as a query once at a time to retrieve the remaining 199 tracks in DB1, and then rotated through all the tracks. Fig. 6 shows the precision rates (PR) and the

recall rates (RR) with respect to the number of documents presented to the user. The number of mixtures in λb,y and λs,i, 1≤ i ≤ 199, were empirically determined to be 8, 32, respectively. For each query, there are supposed to be nine documents deemed relevant. From Fig. 6, we can see that, on the whole, around six documents were truly relevant when the system presents nine documents to the user. Compared to the results obtained with the GMMs, the effectiveness of the solo voice models was clearly demonstrated.

Precision/Recall Rate (in %)

100.0 90.0

Recall Rate GMM Solo Voice Modeling

80.0 70.0 60.0 50.0 40.0 30.0

Precision Rate GMM Solo Voice Modeling

20.0 10.0 0.0 20

40

60

80

100

120

140

160

180

No. of Documents Deemed Relevant

Fig. 6. Results of the singer-based music document retrieval.

7 7.1

Singer Verification Methodology

A singer-verification system, as depicted in Fig. 3, operates in two phases: training and testing. In the training phase, music data from a training set are segmented into vocal and non-vocal regions. The resulting non-vocal regions are then used to form a GMM which simulates the characteristics of the background accompaniments. The background music GMM together with the segmented vocal regions are then used to create two solo voice models, the hypothesized singer model λ sH and the universal singer model λ Us . The hypothesized singer model is trained using the music recordings fully performed by the hypothesized singer, while the universal singer model is trained using all the available music data not performed by the hypothesized singer. During testing, a background music GMM λb is created on-line using the segmented non-vocal regions of a test recording X. The system then determines whether or not X is performed by the hypothesized singer using Yes (21) > δ, log p ( XV | λ sH , λ b ) − log p ( XV | λ Us , λ b )

≤

No

where XV denotes all the segmented vocal regions in X, and δ is the decision threshold. 7.2


Music data used here followed Sec.5.2. In analogy with the experiments in Sec. 6.2, singer-verification experiments were conducted in a leave-one-out manner, which used each of the singers in DB-1 as a hypothesized one once at a time and then rotating through all the singers. The subset DB-1 was further divided into two sub-subsets, one for training the hypothesized singer models, and another for evaluating the singer-verification performance. The sub-subset for the training contained eight tracks per singer, while the sub-subset for evaluation contained two tracks per singer. In addition to creating the vocal and non-vocal GMMs, DB-2 was also used for training the universal singer model λ Us . Performance assessment of the singer verification was characterized by two error measures, Miss Error Rate (MER) and False Alarm Rate (FAR). A miss error occurs when a music recording performed by the hypothesized singer was determined as “no”, while a false alarm error occurs when a music recording not performed by the hypothesized singer was determined as “yes”. Fig. 7 shows the singer verification results reported using the detection error trade-off (DET) plot [24]. Here, the number of mixtures used in the target singer, universal singer, and background music model were empirically determined to be 32, 32, and 8, respectively. The superiority of the solo voice models over the GMMs was demonstrated once again. The best equal error rate (MER = FAR) of 12.4% showed the feasibility of the singer-verification system.

GMM Solo Voice Modeling

Miss probability (in %)

40

20

10

5

5

10 20 False Alarm probability (in %)

40

Fig. 7. Results of the singer verification.

8

Summary

This paper has investigated several essential techniques required for the realization of a music digital library. In particular, to facilitate an automatic organization of music collections, we have studied a technique for blind clustering of music recordings based on singer voice characteristics. In addition, to support the feature of information retrieval, a query-by-example framework has been built that allows users to locate a specified singer’s music recordings from a database without explicitly indicating the name of the sought singer. Furthermore, to prevent the copyrighted music documents from being unauthorized distribution, a singer-verification technique has been developed for enabling copyright holders to rapidly scan the suspect websites for piracy. Through the use of vocal/non-vocal segmentation and solo voice modeling, we have shown that the singer voice characteristics can be better handled, and hence the feasibility of the singer-based clustering, retrieval, and verification can be significantly promoted.

References [1] R. J. McNab, L. A. Smith, I. H. Witten. Towards the Digital Music Library: Tune Retrieval from Acoustic Input. In Proc. of the 1st ACM International Conference on Digital Libraries, pp.11-18, 1996. [2] D. Bainbridge, C. G. Nevill-Manning, I. H. Witten, L. A. Smith, R. J. McNab. Towards A digital Library of Popular Music. In Proc. of the 4th ACM International Conference on Digital Libraries, pp.161-169, 1999. [3] D. Bainbridge, C. G. Nevill-Manning, I. H. Witten, L. A. Smith, R. J. McNab. An Ethnographic Study of Music Information Seeking: Implications for the Design of a Music Digital Library. In Proc. of the ACM/IEEE Joint Conference on Digital Libraries, pp.5-16, 2003. [4] S. Hacker. MP3: The Definitive Guide. O’Reilly, 2000. [5] ISO-IEC/JTC1 SC29 WG11 Moving Pictures Expert Group. Information technology – multimedia content description interface – part 4: Audio. Comittee Draft 15938-4, ISO/IEC, 2000. [6] A. S. Durey and M. A. Clements. Features for Melody Spotting Using Hidden Markov Models. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1765–1768, 2002. [7] M. A. Akeroyd, B. C. J. Moore, and G. A. Moore. Melody Recognition Using Three Types of Dichotic-pitch Stimulus. The Journal of the Acoustical Society of America, 110 (3), pp. 1498-1504, 2001. [8] P. Herrera, X. Amatriain, E. Batlle, and X. Serra. Towards Instrument Segmentation for Music Content Description: A Critical Review of Instrument Classification Techniques. In Proc. of 1st International Symposium on Music Information Retrieval, 2000. [9] A. Eronen. Musical Instrument Recognition Using ICA-based Transform of Features and Discriminatively Trained HMMS. In Proc. of 7th International Symposium on Signal Processing and Its Applications, pp. 133-136, 2003. [10] G. Tzanetakis and P. Cook. Musical Genre Classification of Audio Signals. IEEE Transactions on Speech and Audio Processing, 10 (5), pp. 293-302, 2002.

[11] C. Xu, N. C. Maddage, X. Shao, F. Cao, and Q. Tian. Musical Genre Classification Using Support Vector Machines. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 429-432, 2003. [12] Y. E. Kim and B. Whitman. Singer Identification in Popular Music Recordings Using Voice Coding Features. In Proc. of 3rd International Conference on Music Information Retrieval, pp. 164-169, 2002. [13] C. C. Liu, and C. S. Huang. A singer identification technique for content-based classification of MP3 music objects. In Proc. of International Conference on Information and Knowledge Management, pp. 438-445, 2002. [14] W. H. Tsai, H. M. Wang, and D. Rodgers. Automatic Singer Identification of Popular Music Recordings via Estimation and Modeling of Solo Vocal Signal. In Proc. of 8th European Conference on Speech Communication and Technology, 2003. [15] B. Whitman, G. Flake, and S. Lawrence. Artist Detection in Music with Minnowmatch. In Proc. of IEEE Workshop on Neural Networks for Signal Processing, pp. 559-568, 2001. [16] D. Liu, L. Lu, and H. J. Zhang. Automatic Mood Detection from Acoustic Music Data. In Proc. of 4th International Conference on Music Information Retrieval, 2003. [17] C. K. Wang, R. Y. Lyu, and Y. C. Chiang. An Automatic Singing Transcription System with Multilingual Singing Lyric Recognizer and Robust Melody Tracker. In Proc. of 8th European Conference on Speech Communication and Technology, 2003. [18] D. A. Reynolds and R. C. Rose. Robust Text-independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing, 3 (1), pp. 72-83, 1995. [19] A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39, pp. 1-38, 1977. [20] R. C. Rose, E. M. Hofstetter, and D. A. Reynolds. Integrated Models of Signal and Background with Application to Speaker Identification in Noise. IEEE Transactions on Speech and Audio Processing, 2 (2), pp. 245-257, 1994. [21] W. H. Tsai, H. M. Wang, D. Rodgers, S. S. Cheng, and H. M. Yu. Blind Clustering of Popular Music Recordings Based on Singer Voice Characteristics. In Proc. of 4th International Conference on Music Information Retrieval, 2003. [22] G. Schwarz. Estimation the Dimension of A Model. The Annals of Statistics, 6, pp. 461-464, 1978. [23] A. Solomonoff, A. Mielke, M. Schmidt, and H. Gish. Clustering Speakers by Their Voices. In Proc. of IEEE Conference on Acoustics, Speech, and Signal Processing, pp. 757-760, 1998. [24] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki. The DET Curve in Assessment of Detection Task Performance. In Proc. of 5th European Conference on Speech Communication and Technology, 1997. [25] http://music-ir.org/evaluation/wp.html