audio signal discrimination using evolutionary ... - Semantic Scholar

International Journal of Computers and Applications, Vol. 31, No. 2, 2009

AUDIO SIGNAL DISCRIMINATION USING EVOLUTIONARY SPECTRUM A. I. Al-Shoshan∗

audio signals properties is given and a list of approaches used in discrimination is mentioned. In Section 3, audio signal analysis using the evolutionary spectrum (ES) approach is introduced and discussed. In Section 4, simulation results and a comparison with other approaches are presented, and then the results are concluded in Section 5.

Abstract In this paper, a joint-distribution algorithm, mainly, the evolutionary spectrum (ES) for audio signal discrimination is proposed and discussed. The purpose of audio signal discrimination is to build two different libraries: speech library and music library, from a stream of sounds.

In general, the classification algorithms can

2. Audio Signals Discrimination

be divided into three approaches: time-domain, frequency-domain, and time–frequency domain approaches. The first two approaches

2.1 Audio Signal Properties

were discussed thoroughly in literature; however, the third approach has not yet been tested thoroughly for this type of discrimination.

In this section, we briefly mention some of the main differences between music and speech signals. These differences can be summarized as follows: tonality [10–12], alternative sequence [13], bandwidth and power distribution [14], fundamental frequency, dominant frequency, and excitation patterns [15], tonal duration, energy sequences, and zero-crossing rate (ZCR ) [15], and consonants [16]. The audio signal discrimination approaches can be classified into three categories: (1) time, (2) frequency, and (3) time–frequency domains. El-Maleh [17, 18] has developed a two-level music and speech classifier and used long-term features such as differential parameters, variance, time averages of spectral parameters, and zero crossing. Saunders [15] has also proposed a two-level algorithm for discrimination based on the average ZCR and energy features, and applied a simple threshold procedure. Matityaho and Furst [19] have developed a neural network-based model for classification of music type. They have designed their model based on human cochlea functional performance. Hoyt and Wecheler [20] have also developed a neural network-based model for speech detection, but they have used Hamming filter, Fourier transform and a logarithmic function in preprocessing before neural network input. They have used a simple threshold algorithm to detect speech from music, traffic, wind, or any interfering sound. Also, they have suggested another wavelet transform feature as an option of pre-processing to improve the performance. Their work is similar to the work done by Matityaho and Fursts [19]. Scheirer and Slaney [21] have examined 13 features, some of them are modification of each other, intended to measure conceptually distinct properties of speech and/or music signals, and combined them in several multidimensional classification frameworks. They concluded that using long-term features, like cepstrum pitch or spectral centroid, consumes large de-

As audio signals are considered non-stationary, in general, the time or frequency-domain approach alone may fail in reflecting the time-varying behaviour of such signals correctly. Therefore, in this paper, the ES is proposed to solve this type of classification problem. To show the effectiveness of the proposed algorithm, some simulation and comparisons with classical methods are done, where the Colombia SMD database is used.

Key Words Audio signal discrimination, joint-distribution, evolutionary spectrum

1. Introduction The problem of discriminating audio signals has become increasingly important and it has been applied to more real-world multimedia domains [1–6]. Due to the new techniques of analysis and synthesis of speech signals, the musical signal processing has gained particular weight [7, 8], and therefore, the classical sound analysis techniques are used in processing music signals. There are many kinds of music such as Classical, Rock, Pop, Disco, Jazz, Country, Latin, Electronic, Arabic, and so on [9]. As audio signal characteristics are changing with time, they are considered non-stationary signals; therefore, the classical stationary signal analysis approaches do not reflect the actual behaviour and cannot deal accurately with these types of signals. Due to this property, we propose using the time– frequency distributions to reflect their properties. This paper is organized as follows. In Section 2, a summary of ∗

Computer Engineering Department, College of Computer, Qassim University, P.O. Box 1928, Onizah 51911, Saudi Arabia; e-mail: [email protected] Recommended by Dr. Leone C. Monticone (paper no. 202-1990)

69

lay without worth increase in overall discrimination precision. Tzanetakis and Cook [22] have worked on classifying audio signals using the discrete wavelet transform (DWT). They have compared their results with short time fourier transform (STFT) and Mel-frequency cepstral coefficients (MFCC) methods using three different sets of audios, speech/music, classical, and voices. They have also presented a technique for detecting the beat attributes of music.

where h(n, m) is the impulse response of the LTV system. Substituting e(n) from (1) into (3) (S(ω) = 1 for white noise) we get: π x(n) =

H(n, ω)ejωn dZ(ω)

−π

where the generalized transfer function of the LTV system is defined as:

2.2 Audio Signal Discriminators

1 E{|x(n) |} = 2π 2

S(n, w) =

x(n) =

h(n, m)e(n − m)

(6)

−π

1 |H(n, ω)|2 2π

(7)

Applying the ES to a set of 500 samples of speech and to a set of 500 samples of music signals, 2.5 s each, with a sampling rate of 22,050 Hz, we get the results shown in Table 1. Table 1 Results using the ES Method Detection Correct (%) Error (%)

(1)

Speech Music 99.4

99

0.6

1

Table 1 shows an error of 8 samples out of 1,000 samples, i.e., an error of 0.08%. The ES feature vector used in Table 1 can be constructed as follows: 1. Get the ES of the audio signal. 2. Divide the ES of the audio signal into four non-uniform sub-bands. The four sub-bands are from [0, ω0 /8], [ω0 /8, ω0 /4], [ω0 /4, ω0 /2] and [ω0 /2, ω0 ], where ω0 is half of the sampling frequency. 3. As features, get the mean, the variance, the average spectral flux, and the average sub-band energy ratio (ASBER) of each sub-band.

(2)

and S(ω) is the spectral density function of e(n). The family of constant amplitude sinusoids is however not appropriate for characterizing non-stationary processes, like speech. In the Wold–Cramer decomposition [32], a discrete-time non-stationary process {x(n)} is considered the output of a casual linear, and time-variant (LTV) system with a zero-mean, unit-variant white noise input e(n), i.e.: n

|H(n, ω)|2 dω

4. Simulation Results

where Z(ω) is the process with orthogonal increments i.e.: S(ω)dω δ(ω − Ω) 2π

π

the Wold–Cramer ES is defined as:

−π

E{dZ ∗ (ω)dZ(Ω)} =

(5)

A non-stationary process can thus be expressed as an infinite sum of sinusoids with random, time-varying amplitudes and phases. As the instantaneous variance of x(n) is given by [33]:

In this section, a brief mathematical introduction on the ES is introduced. First, the spectral representation of a stationary signal may be viewed as an infinite sum of sinusoids with random amplitudes and phases: ejωn dZ(ω)

h(n, m)e−jωm

m=−∞

3. Evolutionary Spectrum

e(n) =

n

H(n, w) =

In general, the music and speech discrimination process found in literature can be classified, briefly, into the following algorithms: (1) time-domain approaches [13–15, 18, 19, 21, 23–26], such as the ZCR, the STE, the ZCR, and the STE with positive derivative, the variance of the roll-off feature, the pulse metric, the number of silent segments, the hidden Markov model (HMM) unit, and the neural networks; (2) frequency-domain approaches [14, 23, 27–32], such as the spectral centroid, the mean and variance of the spectral flux, the mean and variance of the spectral centroid, the roll-off of the spectrum, the bandwidth of signal, the amplitude, and the delta amplitude, the cepstral residual, the variance of the cepstral residual, cepstral feature, the pitch, and the delta pitch; (3) time–frequency domain approaches such as the wavelet [22], where these time– frequency approaches have not been tested thoroughly. As audio signals are considered non-stationary, in general, the time or frequency-domain approach alone may fail in reflecting the behaviour of such signals correctly, therefore, the joint-distribution approach, mainly, the ES [33, 34], is proposed and will be discussed and tested.

π

(4)

(3)

Figure 1. Block diagram of the ES method.

m=−∞

70

Figure 2. Speech ES. Figure 3. Music ES.

Table 2 Features and Classifiers Used in Some Speech/Music and General Audio Classifiers Article

Features

Classifiers Avg. Accuracy (%)

[15]

STE; ZCR

Gaussian

91

[35]

MFCC

MAP

94

[21]

4Hz; STE; Roll-Off (+ΔRoll-Off ); MAP SC (+Δ SC); Pulse GMM Flux (+Δ Flux); ZCR (+Δ ZCR); CRRM + Δ CRRM); k-NN k-d trees

94 94.4 94.7 94.3

[36]

MFCC(+Δ MFCC); Amplitude (+Δ Amplitude); Pitch (+Δ Pitch); ZCR (+Δ ZCR)

GMM

98.8

[37]

Entropty; Avg. prob. dynamism; Background energy ratio; Phone distribution match

HMM

98.6

[38]

STE;Harmonic frequencies; ZCR

Threshold 80

[39]

STE; Fundamental frequency; ZCR

HMM

90

[40]

Temporal energy density; BW; Sub-band centroid

Gaussian

90

[18]

LSF (+Δ LSF); HOC; LP-ZCR

QGC k-NN

95.9

[41]

Silence ratio ZCR

Threshold 87.7 75.5

[42]

MFCC (+Δ MFCC)

SVM

81.8

[22]

Discrete wavelet transform

Gaussian

98.7

[44]

7k+ hi-level-features(depends on type of music)

MULT

82.1–99.5

Gaussian

99.2

Proposed ES

Note: SVM stands for support vector machine, QGC stands for quadratic Gaussian classifier, CRRM stands for Cepstrum re-synthesis residual magnitude, GMM stands for Gaussian mixture models, and MULT is a variety of classifiers, and mainly, SVM, k-NN, Decision Tree, Neural Net, Bayesian classifiers.

71

A block diagram showing the processing flow is shown in Fig. 1. Figs. 2 and 3 show the averaged ES of 500 speech samples and 500 music samples, respectively. From Figs. 2 and 3, we observe that the averaged ES could be considered a useful tool for discriminating music from speech signals. The figures show that the distribution of energy in time– frequency domain for speech is different from that of music. A comparison with a set of recently proposed speech/music discrimination and general audio classification schemes are summarized in Table 2. A list of the related work is cited in the first column. Columns two and three show the features and classifiers used in each work, whereas column four lists the percentage of the classification process. Schuller et al. have also carried out an intensive work on music/speech discrimination [43–47]. Their method is based on Low-Level-Descriptors 7k + hi-level-features, and they have tested different types of music signals from different databases. They used genetic algorithm-based search through possible feature space. This result is expected because the ES is able to reflect the non-stationarity property of the audio signal. Although the time–frequency distributions are good in discriminating music from speech signals, the main disadvantage of these tools is the cost of their computations; therefore, they may be used in off-line analysis.

[4] R.O. Gjerdingen, Using connectionist models to explore complex musical patterns, Computer Music Journal, 13(3), 1989, 67–75. [5] D. H¨ ornel & T. Ragg, Learning musical structure and style by recognition, prediction and evolution, in D. Rossiter (Ed.), Int. Computer Music Conf., 1996, 59–62. [6] M. Leman & P. Van Renterghem, Transputer implementation of the Kohonen feature map for a music recognition task, Proc. of the Second International Transputer Conference: Transputers for Industrial Applications II, Antwerp, BIRA, 1, 1989, 213– 216. [7] C. Stevens & C. Latimer, A comparison of connectionist models of music recognition and human performance, Minds and Machines, 2(4), 1992, 379–400. [8] M. Kahrs & K. Brandenburg, Application of digital signal processing to audio and acoustics (London: Kluwer Academic Puplisher, 1998). [9] P. Toiviainen, Modelling the target-note technique of Bebopstyle jazz improvisation: an artificial neural network approach, Music Perception, 12(4), 1995, 399–413. [10] C. Stevens & J. Wiles, Representations of tonal music: a case study in the development of temporal relationships, Proc. Connectionist Models Summer School, Hillsdale, NJ, Erlbaum, 1993, 228–235. [11] N. Griffith & P.M. Todd, Musical networks (Bradford, United Kingdom: Bradford Books The MIT Press, 1999). [12] R. Monelle, Linguistics and semiotics in music (Harwood Academic Publishers, 1992). [13] B. Feiten & T. Ungvary, Organizing sounds with neural nets, Int. Computer Music Conference, San Francisco, 1991, 441– 443. [14] J.T. Foote, Content-based retrieval of music and audio, SPIE’97, 1997, 138–147. [15] J. Saunders, Real-time discrimination of broadcast speech/ music, IEEE ICASSP’96, 1996, 993–996. [16] L. Rabiner & B.H. Juang, Fundamentals of speech recognition (NJ: Prentice-Hall, 1993). [17] K. El-maleh, A. Samoulian, & P. Kabal, Frame-level noise classification in mobile environment, Proc. IEEE ICASSP’99, 1999, 237–240. [18] K. El-maleh, M. Klein, G. Petrucci, & P. Kabal, Speech/music discriminator for multimedia application, ICASSP, Istanbul, 2000, 2445–2448. [19] B. Matityaho & M. Furst, Neural network based model for classification of music type, IEEE Catalogue, 95, 1995, 640–645. [20] J.D. Hoyt & H. Wechsler, Detection of human speech using hybrid recognition models, IEEE, 1994, 2, 330–333. [21] E. Scheirer & M. Slaney, Construction and evaluation of a robust multifeature speech/music discriminator, ICASSP’97, 2, Munich, Germany, 1997, 1021–1024. [22] G. Tzanetakis & P. Cook, Audio analysis using the discrete wavelet transform (Report, Computer Science Department, Princeton University, 2001). [23] D. Roy & C. Malamud, Speaker identification based text to audio alignment for an audio retrieval system, ICASSP’97, 2, Munich, Germany, 1997, 1099–1102. [24] B. Kedem, Spectral analysis and discrimination by zerocrossings, Proceedings of IEEE, 74(11), 1986, 1477–1492. [25] P. Laine, Generating musical patterns using mutually inhibited artificial neurons, Proc. Int. Computer Music Conference, San Francisco, 1997, 422–425. [26] A.I. Al-Shoshan, A. Al-Atiyah, & K. Al-Mashouq, A three-level speech, music, and mixture classifier, Journal of King Saud University {Engineering Sciences}, 2(16), 2003, 23–30. [27] H. Beigi, S. Maes, J. Sorensen, & U. Chaudhari, A hierarchical approach to large-scale speaker recognition, IEEE ICASSP’99, Phoenix, Arizona, 1999, 105–109. [28] H. Jin, F. Kubala, & R. Schwartz, Automatic speaker clustering, Proc. of the Speech Recognition Workshop, 1997, 108–111. [29] R. Meddis & M. Hewitt, Modelling the identification of concurrent vowels with different fundamental frequency, Journal of the Acoustical Society of America, 91, 1992, 233–245. [30] B.P. Bogert, M.J.R. Healy, & J.W. Tukey, The Frequency analysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe cracking, (New York: John Wiley and Sons, 1963), 209–243.

5. Conclusions The purpose of audio signal discrimination is to build two different libraries: speech library and music library, from a stream of sounds. In this paper, a joint-distribution algorithm, mainly, the ES was proposed and discussed. In general, the classification algorithms can be divided into three categories: time-domain, frequency-domain, and time– frequency domain approaches. The first two approaches were discussed thoroughly in literature; however, the jointdistribution approaches have not been yet tested for this type of discrimination. As audio signals are considered non-stationary, in general, the time or frequency-domain approach alone may fail in reflecting the behaviour of such signals correctly. Therefore, in this paper, the ES was proposed and discussed. To show the effectiveness of the proposed algorithm, some simulations and comparisons with classical methods were done. In our future work, we aim at investigating more time-dependent features. Also we aim at investigating the music/speech separation process. References [1] G. Tzanetakis & P. Cook, Multifeature audio segmentation for browsing and annotation, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1, New Paltz, NY, 1999, 91–94. [2] K. Martin, Towards automatic sound source recognition: identifying musical instruments, Proc. NATO Computational Hearing Advanced Study Institute, Ciocco (Tuscany), Italy, 1, 1998, 54–61. [3] P. Herrera-Boyer, Towards instrument segmentation for music content description: a critical review of instrument classification techniques, ISMIR, 1, 2000, 115–119.

72

[31] A. Eronen & A. Klapuri, Musical instrument recognition using cepstral coefficients and temporal features, Proc. ICASSP, 2000, 513–516. [32] N.J.L. Griffith, Modelling the influence of pitch duration on the induction of tonality from pitch-use, Proc. Int. Computer Music Conf., San Francisco, 1994, 35–37. [33] M.B. Priestley, Non-linear and non-stationary time series analysis (NY: Academic Press, 1988). [34] A.I. Al-Shoshan, LTV system identification using the timevarying autocorrelation function and application to audio signal discrimination, ICSP’02, China, 2002, 419–422. [35] M. Spina & V. Zue, Automatic transcription of general audio data: preliminary analysis, Proc. Fourth Int. Conf. on Spoken Language (ICSLP 1996 ), 2, October 1996, 594–597. [36] M.J. Carey, E.S. Parris, & H. Lloyd-Thomas, A comparison of features for speech, music discrimination, Proc. 1999 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 1, 1999, 149–153. [37] G. Williams & D. Ellis, Speech/music discrimination based on posterior probability features, Eurospeech-99, 2, Budapest, Hungary, September 1999, 687–690. [38] S. Srinivasan, D. Ponceleon, & D. Petkovic, Towards robust features for classifying audio in the cuevideo system, Proc. 7th ACM Int. Conf. on Multimedia, Orlando, FL, USA, 1999, 393–400. [39] T. Zhang & C.-C.J. Kuo, Hierarchical classification of audio data for archiving and retrieving, IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP’99 ), 6, March 1999, 3001–3004. [40] Y. Nakajima, Y. Lu, M. Sugano, A. Yoneyama, H. Yamagihara, & A. Kurematsu, A fast audio classification from MPEG coded data, IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP’99 ), 6, March 1999, 3005–3008. [41] L. Guojun & T. Hankinson, An investigation of automatic audio classification and segmentation, Proc. 5th Int. Conf. on Signal Processing (WCCC-ICSP 2000 ), 2, 2000, 1026–1032. [42] P.J. Moreno & R. Rifkin, Using the Fisher kernel method for web audio classification, Proc. 2000 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 4, June 2000, 2417– 2420. [43] B. Schuller, F. Eyben, & G. Rigoll, Static and dynamic modelling for the recognition of non-verbal vocalisations in conversational speech, PIT 2008, 99–110.

[44] B. Schuller, F. Wallhoff, D. Arsic, & G. Rigoll, Musical signal type discrimination based on large open feature sets, ICME 2006, 1089–1092. [45] A. Hyvarinen & E. Oja, Independent component analysis: algorithms and applications. International Journal of Neural Networks, 13(4–5): 411–430, 2000. [46] G. Mu & D.L. Wang, An extended model for speech segregation, Proc. IEEE, 2001, 1089–1094. [47] V. Peltonen, Computational auditory scene recognition, Master of Science Thesis, Tampere University of Technology, Department of Information Technology, February 2001.

Biography Abdullah I. Al-Shoshan received his M.S. and Ph.D. degrees in Electrical Engineering from the University of Pittsburgh, Pittsburgh, PA, USA, in 1991 and 1995, respectively. In 1996, he joined the faculty of Computer Engineering Department at King Saud University. Currently, he is a professor and Dean of the College of Computer at Qassim University, Saudi Arabia. His research interests are in artificial intelligence, statistical signal processing, time–frequency analysis, non-stationary signal modelling, system identification, speech and image processing, wavelets analysis, and neural networks. He is a member of IEEE, ISCA, IAMCM, and SCS.

73