Automated Classification of Audio Data and Retrieval Based on Audio ...

1 downloads 0 Views 210KB Size Report
not have music, speech, etc. all predominantly present. ... classical music, country music, laughter, etc. are the classes. Then, for ..... J. (Eds.), 1967, pp281{297.
Automated Classi cation of Audio Data and Retrieval Based on Audio Classes S.R. Subramanya1 , Abdou Youssef 2, Bhagirath Narahari2, Rahul Simha3 Department of Computer Science, University of Missouri{Rolla, Rolla, MO 65409 Department of EE and CS, The George Washington University, Washington, DC 20052 3 Department of Computer Science, College of William and Mary, Williamsburg, VA 23185 1

2

Abstract

The explosive increases in the amounts of audio (and multimedia) data being generated, processed, and used in several computer applications have necessitated the development of audio (and multimedia) database systems. The classi cation of audio data into di erent categories is an important task which facilitates ecient indexing and accurate searching. The large number of data items, coupled with the subjective and inexact nature of audio data, make the manual classi cation a daunting task and automated classi cation schemes highly desirable. This paper presents a scheme for automated classi cation of audio data with minimal manual interaction, and a scheme for speeding up the retrieval of relevant audio data using pre-determined audio classes. The results of classi cation and retrieval using the proposed algorithm are presented.

1 Introduction The phenomenal increases in the use of audio, along with text and other multimedia data such as images and video, in a variety of computer applications have necessitated the design and development of audio databases with newer features. Example applications are in digital libraries, entertainment industry, medical and forensic laboratories, virtual reality, and several others. The huge number of data items coupled with the richness of expression, the inexact nature, and subjective interpretations of audio data have rendered the keyword-based querying ine ective in the aforementioned applications. For e ective use of data in such applications, the user needs to have content-based queries, which are unrestricted and unanticipated. Although it is desirable for the content-based retrieval scheme to search the entire collection of audio data, it might be bene cial to have a classi cation of the audio data, so that the search can be made to look for data more closely in certain classes than in other

classes to improve the search speed and search accuracy. Sometimes, even in a particular application, the audio data types are very diverse and a classi cation might be bene cial. The huge data sizes and enormous number of data items, coupled with the subjective and inexact nature of audio data make the manual classi cation a daunting task, and automated classi cation schemes highly desirable. A scheme for the classi cation of audio data based on the content, to facilitate subsequent search and retrieval, is presented in [8]. The classi cation is done by deriving a few low-level attributes of audio such as pitch, brightness, and harmonicity, from the measurable properties of audio, and then using statistical techniques such as mean, variance, autocorrelation on the derived attributes to cluster or classify the data. In this paper, we propose (1) a scheme for automated classi cation of audio data, and (2) a search and retrieval scheme for queries-by-example, based on audio classes. We approach the problem of automated classi cation of audio data with minimal manual interaction, as a special case of the general clustering problem, where a nite set of items has to be partitioned into disjoint subsets such that the `distances' between all items in a group are as small as possible, while the `distances' between the groups are as large as possible. The `distance', in this context, is a measure of (dis)similarity between two audio data items. Determination of the clusters (classes) is a global optimization problem, whereby a set of simultaneous algebraic equations describing the optimality criterion must be solved by direct or iterative methods. This is computationally intensive and is the motivation for the heuristics we present in this paper. The proposed scheme for the automated classi cation/clustering of audio data is based on k-means clustering algorithm [1, 2, 3]. The proposed classi cation and the retrieval schemes use the frequency content information of the audio data for determining their similarity. This paper

only addresses the broad classi cation of highly distinct audio data, and does not address ner sub-classes within audio classes. For example, one might further classify the sounds of bells into ner sub-classes such as jingles, gongs, etc. In the proposed algorithms, the audio data is assumed to be of relatively short durations (less than 20 secs). These could represent actual data les or they could be content-indices for larger les. In this paper, the terms class and cluster are used synonymously. The next section describes the audio data model and outlines the assumptions used in the proposed algorithm, Section 3 presents the proposed algorithm. Experimental results are given in Section 4, followed by conclusions.

2 Audio Data Model and Assumptions 2.1 Audio data model We use the frequency content information of short, non-overlapping windows of audio data as a signature or characterizing feature for the audio data items. To determine the similarity of two data, we use a distance measure between the corresponding frequency components of the blocks. The reasons for this representation are given below. Transforms such as FFT, DCT applied to signals (time domain{audio, spatial{images, spatio-temporal{ video) yield frequency domain data { the transform coecients, which o er advantages in processing data, noise removal, energy compaction, compression, etc. The energy compaction property of the transforms enables the use of only a few signi cant transform coecients to adequately `represent' the original data. This is the reason for our choice of transform coecients for representing and comparing audio data items, since it speeds up the process of comparison. In this work, we adopt the DCT (Discrete Cosine Transform) as the transform for its advantages over several other transforms, for our requirements [9]. Audio data is non-stationary i.e., all the frequencies in the signal may not be present over the whole signal duration. Using a Fourier or DCT transform of the whole signal gives the frequency components present in the signal and their relative strengths, but will give no information about the time where the frequencies occur in the signal. So, to capture the time information (and the ner local information of the signal), the transform needs to be applied to short windows of the signal. The process of applying DCT on short windows of the signal and selecting a suitable number of trans-

form coecients to represent the audio data, used in the proposed scheme are described in [9]. Thus, each audio data item Ai is represented by a sequence of blocks Din of transform coecients. We use the Euclidean distance between the transform coecients as the similarity measure.

2.2 Assumptions The following assumptions are made about the audio data used in the classi cation and search.  All of the audio data in the collection are of relatively short duration (less than 20 seconds).  All audio data items are of approximately equal sizes (durations).  The frequency content of the audio data is used in determing the similarity between them. Euclidean distance de ned on the transform coecients is used as the (dis)similarity measure of two audio data.  Every data item contains only one predominant type of audio data. In other words, a data will not have music, speech, etc. all predominantly present.

3 The Proposed Algorithm The general idea behind the classi cation scheme is as follows. First, a few representative audio data items Ri are manually selected from the collection of audio les, each Ri initially serving as representative for class Ci . For example, male speech, female speech, classical music, country music, laughter, etc. are the classes. Then, for each of the remaining audio data Ai in the collection, the Rk which is closest to Ai is determined and Ai is assigned to class Ck . Next, for each class Ci , the audio which is the `centriod' of the class, i.e., the audio data which is most similar to the rest of the audio data in the class (the one with the least distance with all others in the class), is determined and taken as the new Ri . Then for each of the remaining audio data, the nearest Rk is determined as before and the audio data is assigned to that class Ck . This process of clustering and determining the new representative is continued until the classes become `stable'. The stability or convergence is not a simple issue, however. We use a heuristic wherein, between successive iterations, if (1) the changes in the Ri 's of the classes, and (2) the number of data migrations across classes, are within certain thresholds,

then the classes are assumed to have stabilized.

Algorithm 3.1 AudioClassify Input: Set of all audio data. Output: Audio classes. 1. begin

2. Manually select representatives Ri ; 1  i  M . 3. For all remaining data Ai (6= Ri ), nd the closest Rk and assign Ai to Ck . 4. Find the `centroid' of each class. (Centroid is the data item which has the minimum distance with all other data in the class) 5. Treat the above centroids as the new Ri 's. 6. Test for convergence. 7. If convergence Ci 's are the classes; Else goto step 3. 8. end

3.1 Choice of initial classes In the above description, the initial classes were derived manually by choosing representative audio data from the collection. When the collection is huge and the content of every item is not precisely known (and/or if the contents are prone to subjective interpretation), it might be better to have an automated scheme to derive the initial classes. This could be done by one of the following approaches: (1) Random. The M initial classes are randomly selected. Thereafter the clustering algorithm described above is used to classify the data. (2) Top-down. In this scheme the whole collection is initially regarded as a single big class. The representative is determined. Then a perturbed value of the representative is computed and the data items are reclassi ed based on the two representatives. The process is repeated until M classes are formed. (Although this results in M being a power of 2, by creating only one new representative at every step, or after a suitable number of steps, any value of M can be handled). (3) Bottom-up. In this approach, each data item is initially regarded as a single class. Then the pair-wise distances of all data are determined and the closest pair is merged into a single class (pair-wise nearest neighbor (PNN) clustering). After merging, a new representative of the class is determined. This process is continued until the number of classes reaches M.

3.2 Search using audio classes A brief description of the search algorithm based on classes is given below followed by the pseudocode.

It is assumed that for every class, a `representative' data item has been determined. The representative is akin to the `centroid' of the class { it could be one of the audio data which is approximately equidistant from other member data in the class, or it could be a 'synthetic' data which is indeed equidistant from other class members. The algorithm rst nds the distance between the given query and the representative of every class. All the distances are then sorted in non-decreasing order. Based on this order and the distances, the classes are grouped into a few `superclasses'. The classes that are nearer to the given query are considered more promising, as they are likely to contain more data which are closer to the given query. So, the search process is designed to look more closely for matches in the classes which are closer to the query than those which are farther. This is done by using more/less transform coecients in distance computations for the data in the classes that are closer/farther to the query. As a speci c example, suppose that the classes are arranged in non-decreasing order of distances from the given query. The top 30% of the classes form super-class 1, the next 30% form superclass 2, and the remaining 40% form super-class 3. Suppose also that 64 signi cant coecients are used per block of data as the index data. For the distance computation (for similarity matches), the search algorithm uses all 64 coecients for all audio data in super-class 1, uses only 32 coecients for all data in super-class 2, and uses only 16 coecients for superclass 3. A straight-forward algorithm, on the other hand, uses a xed number of 64 coecients for all data.

Algorithm 3.2 ClassBasedRetrieve (in:

Ci ; Ri ; 1  i  M; Q; out: L) fCi : Class i, Ri : Rep. of Ci , L: Result listg

1. begin 2. Find the distances di between Q and Ri . 3. Sort di ; 1  i  M in non-decreasing order. 4. Based on the di 's, form P Super-classes S0 : : : SP ?1 , of sizes N0 : : : NP ?1 . 5. Prioritize the searches in di erent super-classes, i.e., x the index size in the super-classes to be used in the search. 6. Search in the super-classes based on the index sizes xed for each of them. 7. Collect the results of search in all classes. L fbest matchesg. Return L. 8. end

misclassi cations No. of m = TotalNonumber of items in class data Classes items 1 2 3 4 5 100 1/12 2/18 3/24 2/27 2/19 200 4/42 4/39 6/46 4/41 3/32 300 8/61 9/65 10/63 7/58 9/53 i

:

Table 1: misclassi cation ratio for 5 classes. misclassi cations No. of m = TotalNonumber of items in class data Classes items 1 2 3 4 5 6 7 100 1/8 1/11 2/9 2/12 1/9 1/10 1/8 200 2/21 3/22 3/19 3/21 2/20 1/18 2/17 300 4/33 3/30 4/28 3/29 4/32 4/31 4/28 i

:

8 9 10 1/11 2/13 1/9 2/22 3/24 1/16 3/29 3/31 3/29

Table 2: misclassi cation ratio for 10 classes.

4 Experimental Results The proposed algorithm has been implemented and used to classify audio data from a database of about 300 audio les. The classi cation experiments were run (1) with di erent starting audio data as class representatives, (2) with di erent number of data les. The algorithm generally converges in 4{6 iterations. The results shown here are for the case of manual selection of initial class representatives. The following table gives the division of classes into `super-classes' and the number of coecients used in the search process. Super class 1 2 1 2 3

Number of classes = 5

No. of classes No. of coecients in each superclass used in search 2 64 3 32

Number of classes = 10 3 3 4

64 32 16

A misclassi cation occurs when an audio data item does not `belong' to a class, i.e., it is assigned to a class, but should, in fact, be in some other class. A data item `belongs' to a class if it is closer to `all the members' of that class than to `all the members' of any other class. This occurs due to the fact that in the classi cation algorithm, the determination of the class for a data item is made based only on the representative of a class, without considering the other data items in the class at that time.

The misclassi cation for class i, mi is given by: of misclassi ed items in the class mi = Number Total number of items in the class Tables 1 and 2 summarize the misclassi cation in the various classes. The misclassi cation ratio m is given by: M 1 m m=

X

M i=1 i

The misclassi cation ratio averaged over ten runs are shown in Figure 1, for various data sizes (left) and for di erent numbers of selected coecients (right). Figure 2 shows the comparison of (1) the retrieval times (on the left), and (2) the retrieval precision (on the right), for the the schemes with and without classes. It is easily seen that the scheme which uses classes has better retrieval speed since it uses less number of coecients for searching in the classes which are farther away from the query. Note that the times spent (1) for nding the distance between the given query and the class representatives, and then (2) for sorting those distances, have been included as part of the retrieval times. The retrieval precision is slightly lower for class-based retrieval due to misclassi cations, and due to the use of smaller number of coecients for certain classes seemingly farther from the query.

5 Conclusions and Future Directions Fast and accurate content-based retrievals of audio data are important in audio databases. Instead of uniformly searching the entire database, a classi cation of the data enables the search scheme look more closely

0.13

0.125 5 CLASSES 10 CLASSES

5 CLASSES 10 CLASSES 0.12

MIS−CLASSIFICATION RATIO

MIS−CLASSIFICATION RATIO

0.125

0.12

0.115

0.11

0.105

0.1 100

0.11

0.105

0.1

200 DATA SIZE (NO. FILES)

0.095 16

300

32

64 NO. OF COEFFICIENTS

128

256

Figure 1: misclassi cation ratio.

70

60

0.115

0.4

W/O CLASSES WITH 5 CLASSES WITH 10 CLASSES

W/O CLASSES WITH 5 CLASSES WITH 10 CLASSES

0.395

0.39

RETRIEVAL PRECISION

RETRIEVAL TIMES (SEC.)

50

40

30

0.385

0.38

0.375

0.37

20 0.365 10 0.36

0 100

200 DATA SIZE (NO. OF FILES)

300

0.355 100

200 DATA SIZE (NO. OF FILES)

300

Figure 2: Comparison of retrieval times and precision. in certain classes that are closer to the query, than others. This paper proposed schemes for (1) the automated classi cation of audio data items, and (2) the retrieval of audio data based on pre-determined audio classes. The algorithm was implemented, and the results show the improvement in speed and accuracy of the retrieval results compared to a scheme which works on the entire audio collection uniformly. The proposed method could be extended to re ne the individual classes and build a hierarchy of classes and sub-classes (a desired number of levels). The development of a good similarity measure is a challenging issue. Further extension might include handling data of di erent sizes and di erent distinct sounds (classes). Another extension could be the automatic development of new classes { when a data item is not `close enough' to the representative of any class, it could then be assigned a new class.

References [1] MacQueen, J. `Some methods for classi cation and analysis of multivariate observations' Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Cam, L.M. and Neyman, J. (Eds.), 1967, pp281{297.

[2] Hartigan, J. Clustering Algorithms, Wiley, 1975. [3] Jain, A.K. and Dubes, R.C. Algorithms for Clustering Data, Prentice-Hall, 1988. [4] Narasimhalu, A.D. (Ed.) Special issue on contentbased retrieval. ACM Multimedia systems, Vol. 3, No. 1, Feb 1995. [5] Hermansky, H. `Speech Beyond 10 milliseconds (Temporal Filtering in Feature Domain)', Tech. report, Oregon Graduate Institute. [6] Kimber, D. and Wilcox, L. `Acoustic Segmentation for Audio Browsers', Proc. Interface Conference, Sydney, July 1996. [7] Hawley, M.J. Structure of Sound, Ph.D. Thesis, MIT, Sept. 1993. [8] Wold, E. et. al. `Content-based classi cation, search and retrieval of audio data', IEEE Multimedia Magazine, 1996. [9] Subramanya, S.R. et. al. `Transform-Based Indexing of Audio Data for Multimedia Databases', IEEE Int'l Conference on Multimedia Systems, Ottawa, June 1997.