Efficient audio-driven multimedia indexing through ...

1 downloads 0 Views 1MB Size Report
Multimedia indexing through similarity-based speech / music discrimination. 3 ..... The resulting Jk block after the application of g(x) is presented in equation. 6.
Manuscript

Click here to download Manuscript template.tex

Click here to view linked References Noname manuscript No. (will be inserted by the editor)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Efficient audio-driven multimedia indexing through similarity-based speech / music discrimination Nikolaos Tsipas · Lazaros Vrysis · Charalampos Dimoulas · George Papanikolaou

Received: date / Accepted: date

Abstract In this paper, an audio-driven algorithm for the detection of speech and music events in multimedia content is introduced. The proposed approach is based on the hypothesis that short-time frame-level discrimination performance can be enhanced by identifying transition points between longer, semantically homogeneous segments of audio. In this context, a two-step segmentation approach is employed in order to initially identify transition points between the homogeneous regions and subsequently classify the derived segments using a supervised binary classifier. The transition point detection mechanism is based on the analysis and composition of multiple self-similarity matrices, generated using different audio feature sets. The implemented technique aims at discriminating events focusing on transition point detection with high temporal resolution, a target that is also reflected in the adopted assessment methodology. Thereafter, multimedia indexing can be efficiently deployed (for both audio and video sequences), incorporating the processes of high resolution temporal segmentation and semantic annotation extraction. The system is evaluated against three publicly available datasets and experimental results are presented in comparison with existing implementations. The proposed algorithm is provided as an open source software package in order to support reproducible research and encourage collaboration in the field. Keywords Speech/Music Discrimination · Self-Similarity Matrix Analysis · Transition Point Detection · Supervised Learning N. Tsipas Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece E-mail: [email protected] L. Vrysis E-mail: [email protected] C. Dimoulas E-mail: [email protected] G. Papanikolaou E-mail: [email protected]

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Nikolaos Tsipas et al.

1 Introduction During the recent years, an exponential growth of multimedia content produced and made available online by users, has been observed. As a consequence, the availability of robust semantic multimedia analysis systems has become a necessity in order to maintain their ability to explore, consume and analyse content. Audio event detection and classification can be considered among the most useful analysis mechanisms for multimedia content; speech / music discrimination falls within the wider range of audio-based semantic analysis applications providing new ways of multimedia content description and management. Example use cases include semantically-enhanced navigation allowing users to skip content they are not interested in, efficient compression algorithms using different encoding strategies depending on content type, purification of multimedia content that will be used as training data and others. In most cases, the task of speech/music discrimination of multimedia content is approached from an audio content analysis perspective as this provides significant advantages in terms of computational efficiency, reduced ambiguity and existing research maturity level. In addition to that, semantic annotation in terms of audio segmentation and its associated time-duration distribution analysis of the implicated audio segments can be exploited towards characterisation and overall classification of generic multimedia content (i.e. video categorization /archiving in related video on demand web-tv services [39]). Towards this direction an audio-driven, similarity-based speech/music discrimination system is proposed, leveraging the advantages of audio-driven analysis, while preserving the attributes of an extensible platform that can be evolved into a multimodal analysis framework. The rest of the paper is organised as follows. The state of research is presented in section 2, followed by an overview of the proposed approach outlined in section 3. As part of the proposed approach, the employed audio features and pre-processing steps are analysed in subsection 3.1, while the similarity-based content segmentation mechanism and the supervised binary classification model are discussed in subsections 3.2 and 3.3. Finally, experimental results are presented in section 4 evaluating the proposed algorithm performance at multiple levels and in comparison with other publicly available speech/music discrimination algorithms.

2 State of Research Multimedia segmentation and semantic annotation for the purposes of content documentation, description and management has been a very popular field of research during the last decades. In this direction, various audio pattern recognition methods have been introduced, exploiting many of the aforementioned advantages of audio-driven analysis. Speech / music detection and segmentation is both, a very popular and well studied sub-topic, with a broad field of applications since audio is encountered in most multimedia streams. During

Multimedia indexing through similarity-based speech / music discrimination

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

3

the recent years, the task of speech / music discrimination has been studied through a variety of features and analysis techniques. In most relevant work, researchers introduce new audio features aiming at improving separability between the two classes, while, at the same time, performance improvements have been demonstrated by incorporating a wide range of classification techniques. Saunders [32] was one of the pioneers in the field, proposing a multivariate gaussian-based speech / music discriminator using statistical features on zero-crossing rate and energy, while Scheirer and Slaney [33] explored a wider range of features in combination with different classifiers and identified subsets of best performing features. Carey et al. [8] focused on single-feature performance identifying cepstral coefficients as the best performing one, followed by amplitude, pitch and zero-crossing rate. Similarly, Wang et al. [40] followed a single feature approach by introducing the modified low energy ratio feature. El-Maleh et al. [10] proposed a combination of line spectral frequencies and zero-crossing based features for frame-level speech/music discrimination, while Pikrakis et al. [27] proposed a computationally efficient, region growing technique operating on a single feature, a variant of spectral entropy. Seyerlehner et al. [35] introduced a new feature named Continuous Frequency Activation, which relies on the detection of horizontal segments in audio spectrograms for the analysis of television productions. More recently, the chroma-based music tonality features were introduced in [34], while Wavelet-based feature extraction techniques were demonstrated in [30]. Towards this direction of purposebuilt feature-sets, Khonglah and Prasana [16] introduced speech-specific features representing the excitation source, vocal tract system and syllabic rate of speech. With regards to classification algorithms and analysis techniques, Gaussian Mixture Model (GMM) usage was demonstrated by Shirazi and Ghaemmaghami [36] as part of their work focusing on Sinusoidal Model based features, while Lavner and Ruinskiy [19] introduced a decision-tree based approach. On the other hand, Dynamic Programming and Bayesian Networks were successfully employed by Pikrakis et al. [28], broadcasting-audio segmentation using Artificial Neural Networks was demonstrated by Kostakis et al. [18] and Support Vector Machines (SVM) based approaches were followed in [20,31]. Finally, HMM usage for speech/music discrimination was exhibited in [7, 27] and Deep Learning application in the field was evaluated as part of [29]. Elaborating on previous work conducted as part of the speech / music classification and detection task of MIREX 2015 [37], this paper focuses on segmentation performance improvements by more accurately detecting transition points between successive audio events. Instead of introducing new features, as many related methods do, a collection of standard, general-purpose audio features that have been previously employed in similar systems is used. The enhanced discrimination performance exhibited by accurately detecting transition points between homogeneous regions is achieved by introducing an extended, novelty-detection, self-similarity matrix (SSM) analysis technique, based on the approach originally proposed by Foote [13]. SSM analysis methods

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Nikolaos Tsipas et al.

have been broadly and successfully employed in the field of Music Information Retrieval for segmentation and structure analysis tasks during the recent years [15, 38]. Although the presented system relies on the analysis of audio data for the speech / music discrimination task, it is intended to be used for the analysis of multimedia content, following the audio-driven analysis paradigm demonstrated in [9, 11, 14, 22, 42].

3 Proposed Approach The proposed approach is based on the hypothesis that short-time frame-level discrimination performance can be enhanced by identifying transition points between longer, semantically homogeneous (i.e. speech or music) segments of audio. In this context, a two-step segmentation approach is employed in order to initially identify transition points between homogeneous regions and, subsequently, classify the derived segments, using a supervised binary classifier. An overview of the proposed system is presented in Figure 1.

Fig. 1: Algorithm overview diagram. Segmentation is achieved by merging transition points detected on self-similarity matrices generated using different audio feature-sets. (m: music, s: speech)

In the following subsections, the audio features and algorithms employed in both stages of the proposed algorithm are presented. Focusing on the first stage, the employed self-similarity matrices and novelty-detection algorithm are discussed in order to establish the building blocks behind the introduced multiple SSM analysis approach. Afterwards, focusing on the second stage of the algorithm, the method employed to annotate the derived segments provided as input from stage one is presented.

Multimedia indexing through similarity-based speech / music discrimination

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

5

3.1 Audio Features and Preprocessing Three feature sets consisting of different audio features are extracted from the audio signal using Yaafe [21] and a fourth one is derived by merging feature sets A, B and C. The details about the extracted features comprising each group are outlined in Table 1. The exact feature extraction process is outlined in the corresponding feature extraction plan, a text file in which each line represents an extracted feature according to Yaafe’s feature definition syntax. The employed yaafe feature extraction plan is available in the project code-repository 1 . The selected audio features cover a wide range of temporal, spectral and cepstral aspects of the signal and similar features have been successfully employed in various audio classification tasks including voice activity detection [17, 24], speech/non-speech discrimination [25, 39] and speaker diarisation [23].

Fig. 2: Feature aggregation. Each feature is described by two aggregated values, its mean and variance statistics over 16 non-aggregated frames.

Initially, each feature is extracted with a block size of 1024 samples (46 ms) and no overlapping. As part of this work, the default Hanning window configuration for Yaafe was utilised, without further evaluation of its impact on the proposed algorithm. Afterwards, the extracted features are aggregated over 16 frames with no overlapping in order to calculate their mean and variance values. This leads to an aggregated frame size of 16384 samples, which at 22050 Hz sampling rate is equal to 743 ms. Since each aggregated feature is represented by its mean and variance statistics over the aggregation window, the total number of components comprising the aggregated feature vector is equal to two times the non-aggregated number of features. The aggregation procedure is illustrated in Figure 2. As a pre-preprocessing step all features are scaled to have zero mean and unit variance.

1 https://github.com/nicktgr15/similarity-based-speech-music-discrimination/ blob/master/datasets/featureplans/featureplan

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Nikolaos Tsipas et al.

Table 1: Feature Sets. The aggregated features are represented by their mean and variance values hence the aggregated vector length is twice the length of the non-aggregated vector Feature Set

Audio Features

Non-Aggregated Vector Length

Aggregated Vector Length

A B C

zcr, flux, spectral rollof, rms energy mfcc spectral flatness per band

4 13 19

8 26 38

D

A+B+C

36

72

3.2 Similarity-based Content Segmentation As discussed, the first stage of the proposed algorithm relies on the detection of transition points between homogeneous regions of content in order to identify speech and music segments. In the context of this work, a homogeneous region of content refers to either an audio segment containing speech or an audio segment containing music. The identification of speech and music segments is achieved by employing a well-established, checkerboard-kernel, novelty-detection technique [13] that is applied on a self-similarity matrix. A novel aspect of this work, in the extension of the standard methodology in order to employ multiple self-similarity matrices generated from distinct feature sets. The motivation behind this decision is the fact that different feature sets produce different similarity matrices and consequently different sets of transition points. 3.2.1 The Self-Similarity Matrix Data Structure The self-similarity matrix is a data structure, capturing the pairwise similarity between n items of a time series, introduced by Foote [12] in the late 90s. The generated matrix S is described by equation 1 where vj and vk represent a pair of compared feature vectors. The chosen similarity metric s is the cosine distance, a measure of similarity between two non zero-vectors that estimates the cosine of the angle θ between them, as described by equation 2. S(j, k) = s(vj , vk ) j, k ∈ (1, ..., n) s(vj , vk ) = cos(θ) =

vj · vk ||vj || ||vk ||

(1) (2)

In terms of complexity, generating the complete self-similarity matrix for n elements has a complexity of O(n2 ) which makes the scaling of the algorithm quite challenging as n increases. However, in order to detect a transition point, only a small number of self-similarity distance pairs before and after it, is required. The number of calculated self-similarity distance pairs needs to be big

Multimedia indexing through similarity-based speech / music discrimination

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

7

Fig. 3: Top: Transition points detected on the self-similarity matrix. Bottom: Checkerboard kernel correlation result. Red ”x” markers indicating the detected transition points. Ground truth segments indicated with blue (music) and green (speech).

enough in order to accommodate the size of the checkerboard kernel. Specifically, if m pairwise distances are calculated before and after each element n, then the complexity is reduced to O(mn) which is significantly lower. Based on this fact, the optimised, reduced-size self-similarity matrix generation is employed in the proposed algorithm implementation. Example similarity matrices calculated with m=60 are presented in figure 5. 3.2.2 Transition Point Discovery Through Novelty Detection In order to detect transition points on a self-similarity matrix a gaussian checkerboard kernel is correlated with it across its main diagonal. The checkerboard kernel A (eq. 3) is a square matrix that can be clockwise partitioned into four blocks, having even blocks equal to −Jk and odd blocks equal to Jk . The block Jk (eq. 4) is a square matrix of size k, resulting into a checkerboard kernel A of size 2k.   −Jk Jk A= (3) Jk −Jk   1 1 ··· 1 1 1 · · · 1   Jk =  . . . .  k ∈ Z (4)  .. .. . . ..  1 1 ··· 1 Additionally an exponential function (eq. 5) is applied on the elements of Jk in order to receive a gaussian kernel similar to the ones presented in figure

8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Nikolaos Tsipas et al.

4. The resulting Jk block after the application of g(x) is presented in equation 6. Parameters k and p were set equal to 10 and 0.1 through a trial and error procedure. g(x) = e−px

2

x ∈ (0, ..., k),

p∈R

(5)



 g(0)a0 g(1)a1 · · · g(k)ak  g(1)a1 g(1)a1 · · · g(k)ak    Jk =  . .. ..  k ∈ Z, ..  .. . . .  g(k)ak g(k)ak · · · g(k)ak

ak = 1

(6)

The transition points are derived from the time series data generated during the guassian kernel correlation step using a simple peak detection algorithm (Fig 3). In particular, the PeakUtils [5] python package is employed, which identifies peaks on time series data by taking the first order difference and applying a normalised threshold. The normalised threshold was set equal to 0.20 through a trial and error approach.

Fig. 4: Gaussian Kernels (20x20) generated with k=10 and from left to right p=0.1, p=0.5, p=0.01.

3.2.3 Multiple Self-Similarity Matrix Analysis Following the initial hypothesis, the segmentation performance of the standard novelty-detection approach can be enhanced by applying the same methodology not on a single SSM but on multiple ones. As indicated in Figure 5, different homogeneous regions can be captured from self-similarity matrices generated using different audio feature sets and subsequently different sets of transition points can be derived. The derived sets of transition points are then combined to form the final segmentation of the audio signal. Although the merging of transition points detected across dissimilar self-similarity matrices can increase over-segmentation, the performance gains of a multiple SSM approach are significant, as indicated by the experimental results presented in section 4. Additionally, the utilisation of multiple SSM, introduced with this

Multimedia indexing through similarity-based speech / music discrimination

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

9

work, makes the optimal selection of the size of the feature vector used to generate an individual matrix less relevant. This is a consequence of the fact that multiple feature vectors of different sizes are used to generate the different matrices.

Fig. 5: Example transition points detected on self-similarity matrices generated with different feature sets using audio data from the GTZAN dataset. Clockwise from top left, feature sets A, B, D and C. Notice the different homogeneous regions presented on each one of the matrices and the different transition points derived from them.

3.3 Frame Level Classification The second stage of the algorithm is built around a supervised, binary classifier that helps to transform transition points into annotated segments of audio, as illustrated in Figure 6. The trained classifier receives as input an aggregated frame/feature-vector and is able to classify it as speech or music. In the proposed system, a collection of frames comprising a segment defined by two successive transition points, is given as input to the classifier. Those frames are classified one after the other and as a result a collection of corresponding classifications is produced. Using a simple, percentage-based majority voting approach, the dominant class is determined and assigned to the evaluated segment. Finally, as a post-processing step, adjacent segments of the same class are merged to form larger homogeneous segments. Three algorithms, Random Forests, Logistic Regression and SVM were tested in order to evaluate their classification performance. Implementations

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Nikolaos Tsipas et al.

of the algorithms available in scikit-learn [26] were used in the experiments and the final implementation of the system.

Audio Signal

Ground Truth Annotations

Successive transition points derived from multiple SSM novelty detection define a segment Transition Point

1 Detected Transition Points

...

Aggregated Frame

Binary Classifier

s m

2

... s

Majority Voting 95% speech 5% music

s

Annotated Segments

3 Adjacent Same-Class Segment Combiner

Post-processed Annotations

Fig. 6: Top to bottom, going from detected transition points to annotated segments using a binary classifier, majority voting and post-processing.

4 Experimental Results As already discussed, the proposed system consists of two distinct analysis stages, the first one undertaking the transition point detection task and the second one, the annotation of the derived segments using majority voting to set the dominant class. Various types of evaluation were conducted in order to assess the performance of the involved subsystems. The performance of the transition point detection subsystem is assessed in terms of segmentation accuracy by comparing different variants of the system, in subsection 4.2. Similarly, in subsection 4.3 the performance of different machine learning algorithms, used for frame level classification, is evaluated in order to select

Multimedia indexing through similarity-based speech / music discrimination

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

11

the best performing model. Finally, the performance evaluation of the system as a whole is accomplished using two types of evaluation. The frame-level evaluation focuses on the end-to-end performance of the system at the framelevel while the event-level evaluation assesses system’s performance focusing on segment homogeneity and onset/offset point accuracy.

4.1 Training and Evaluation Datasets A crucial aspect for the objective and thorough evaluation of an algorithm is the choice of appropriate training and evaluation data. For this work, the attributes required to be satisfied by the data were content diversity, adequate data volume and public availability. In order to satisfy the above requirements, the fusion of existing, publicly available datasets was considered. In this context, GTZAN [2], LabROSA [3, 33] and data from the Mirex 2015 muspeak sample dataset [4] were selected.

Table 2: Evaluation Datasets

Dataset

Duration (s)

Speech/Music (%)

Speech/Music (segments number)

Speech/Music avg segment length (s)

GTZAN LabROSA Mirex 2015

3840 2715 3164

50 / 50 44.1 / 55.8 16.6 / 83.3

30 / 31 49 / 49 13 / 12

61.93 / 64.0 24.48 / 30.91 40.46 / 219.83

Total

9719

36.9 / 63.1

92 / 92

42.29 / 104.91

GTZAN and LabROSA come in the form of separate audio clips for each one of the two classes (i.e. speech, music) while the Mirex 2015 dataset comes in the form of a continuous audio file with corresponding text-based annotations. Since the goal of this work was to assess the ability of the algorithms to discriminate speech / music segments as events by analysing a continuous stream of audio, GTZAN and LabROSA were also transformed into that form by merging the smaller audio clips and generating text-based annotations. The scripts used for the transformation of the original data are available in github 2 and the details of the derived datasets are presented in Table 2. Since the proposed algorithm is intended to be used as a speech/music discriminator that operates after a voice-activity detector (or some other filtering mechanism), no silent audio parts of considerable duration exist in the datasets. 2 https://github.com/nicktgr15/similarity-based-speech-music-discrimination/ tree/master/datasets

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Nikolaos Tsipas et al.

4.2 Segmentation Performance Evaluation The transition point detection algorithm performance has been evaluated against all three datasets. Two variants of the algorithm were assessed, the first one using a single similarity matrix generated using feature-set D and the second one using multiple similarity matrices generated with feature-sets A, B, C and D. Furthermore, the transition point detection accuracy was evaluated at two different tolerance levels, including ±0.5 and ±1.0 seconds. The results of the evaluation are presented in Table 3 using the precision, recall and f1score measures. To take into account label imbalances in the dataset, the above metrics were calculated for each class and their average was derived, weighted by support (the number of true instances for each label). In the results, it is shown that recall (without sacrificing precision) is significantly higher when multiple self-similarity matrices are employed. We are focusing on the recall metric as it is effectively setting an upper limit on the maximum performance of the overall speech/music discrimination algorithm. In other words, if the transitions points are not detected correctly at this stage then the binary classifier will not be able to improve performance even if it was an ideal/100% accurate classifier.

Table 3: Segmentation Performance Results Single SSM Dataset

toler.(s)

Multiple SSMs

P

R

F1

P

R

F1

(%) Impr.

Mirex 2015

± 0.5 ± 1.0

0.051 0.068

0.625 0.833

0.094 0.126

0.105 0.121

0.833 0.958

0.187 0.2157

33.333 15.000

LabROSA

± 0.5 ± 1.0

0.183 0.316

0.567 0.979

0.276 0.477

0.196 0.251

0.773 0.990

0.313 0.401

36.364 1.053

GTZAN

± 0.5 ± 1.0

0.111 0.165

0.617 0.917

0.188 0.280

0.142 0.179

0.783 0.983

0.241 0.303

27.027 7.273

4.3 Evaluation and Selection of Supervised Classification Algorithm As discussed, three algorithms, Random Forests, Logistic Regression and SVM were tested against the three available datasets. Following a cross-validation approach, when one of the datasets is tested the other two are used as training data. This approach allows to objectively evaluate the classification performance of the algorithms on unseen data. Feature set D is used with the binary classifier in two variants, one employing the raw feature vectors and one employing reduced-length feature vectors generated using Principal Component

Multimedia indexing through similarity-based speech / music discrimination

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

13

Analysis (PCA) as a pre-processing step. When PCA is employed, feature vectors have a reduced length of 20 components as opposed to the original size of 72 components.

Table 4: Classifier Performance using feature-set D with and without PCA Random Forests Dataset

Logistic Regression

SVM

F1

F1(PCA)

F1

F1(PCA)

F1

F1(PCA)

Mirex 2015 LabROSA GTZAN

0.988 0.958 0.905

0.989 0.962 0.890

0.991 0.957 0.916

0.994 0.961 0.930

0.997 0.976 0.959

0.992 0.973 0.938

Average

0.950

0.947

0.955

0.962

0.977

0.967

The results, presented in Table 4, indicate that SVM without the dimensionality - reduction step applied, achieves the highest f1-score and thus it is selected as the classification algorithm in the final implementation and the rest of the experiments. A regularisation parameter C = 1.0 was selected through a trial and error approach for the SVM classifier.

Audio Signal

Ground Truth Annotations

event-level evaluation threshold

onset

offset

Detected Segments Event-level Evaluation Frame-level Evaluation

frame-level evaluation frame (10ms)

Fig. 7: Event-level evaluation relies on the detection of events/segments with onset and offset points within the evaluation threshold, while frame-level evaluation relies on the correct classification of the 10ms evaluation frame. In the above example, for the event-level evaluation, only two out of the four detected transition points are within the evaluation threshold and this has as a result three (out of five) incorrectly classified segments. On the other hand, for the frame-level evaluation, there are only 4 incorrectly classified frames out of a total of around 80 frames. Based on the above, the event-level evaluation will return a much lower f1-score making clear that it is a much more demanding task in comparison to the framelevel evaluation. Red color indicates incorrectly classified instances (i.e. events/segments and frames).

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Nikolaos Tsipas et al.

4.4 Frame-level Evaluation The frame-level evaluation is performed on a 10ms frame-length basis. This value was selected in order to provide a considerably short window length for serving high temporal resolution needs. Such requirement is encountered in the frame-level evaluation of the Music/Speech Classification and Detection task [4], where the same 10 ms value is employed, motivating the proposed configuration as well. In practical terms, this means that the ground truth segments and the detected segments are getting split into 10 ms frames (Fig 7) and precision, recall and f1-score is calculated over those. Three variants of the algorithm, illustrated in Figure 8 are evaluated and compared. The first variant, indicated as ”raw”, relies on the output of the binary classifier and the formation of segments by combining adjacent frames of the same class and employing post-processing to remove short gaps in segments. The second and third variants are using the segments derived during the transition point detection step with the difference that in the first case a single SSM is used as opposed to multiple ones in the second. The results of the frame-level evaluation are presented in Table 5. As indicated in the results, the multiple SSMs approach achieves the highest f1-score, followed by the single SSM and raw variants.

Table 5: Frame-level Evaluation Results Raw

Single SSM

Multiple SSMs

Dataset

P

R

F1

P

R

F1

P

R

F1

Mirex LabROSA GTZAN

0.995 0.970 0.960

0.995 0.970 0.960

0.995 0.970 0.960

0.997 0.976 0.985

0.996 0.975 0.985

0.996 0.976 0.985

0.998 0.976 0.998

0.997 0.976 0.997

0.998 0.976 0.998

Average

0.976

0.975

0.975

0.986

0.986

0.986

0.990

0.990

0.990

4.5 Event-level Evaluation Event-level evaluation is accomplished on an onset-offset basis with a tolerance of ±0.5 and ±1.0 seconds (Fig 7). This means that both the onset and the offset points need to be within the tolerance limits when precision, recall and f1-score are calculated. The results are presented in Table 6 for the same three variants of the algorithm discussed in the previous section. As indicated, the variant in which multiple SSMs are used for the transition point detection step, outperforms on average the other two variants at both tolerance levels. Specifically, for the subsecond (±0.5 s) tolerance-level, an f1score that is about 35% higher compared to the Raw and Single SSM variants,

Multimedia indexing through similarity-based speech / music discrimination

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Aggregated Feature Extraction

Aggregated Feature Extraction

Aggregated Feature Extraction

Single SSM Transition Point Detection

Multiple SSM Transition Point Detection

Binary Classifier

Binary Classifier

Majority Voting

Majority Voting

Adjacent Same-Class Combiner

Adjacent Same-Class Combiner

Binary Classifier

Adjacent Same-Class Combiner

15

Post-Processing

Raw

Post-Processing

Post-Processing

Single SSM

Multiple SSM

Fig. 8: Three variants of the proposed algorithm are evaluated in order to justify the advantages of single and multiple SSM-based segmentation

Table 6: Event-level Evaluation Results Raw

Single SSM

Multiple SSMs

Dataset

tol.(s) P

R

F1

P

R

F1

P

R

F1

Mirex

±0.5 ±1.0

0.478 0.957

0.458 0.917

0.468 0.936

0.545 0.864

0.500 0.792

0.522 0.826

0.792 1.000

0.792 0.917

0.792 0.957

LabROSA ±0.5 ±1.0

0.330 0.845

0.351 0.897

0.340 0.222 0.870 0.899

0.227 0.918

0.224 0.317 0.908 0.871

0.330 0.907

0.323 0.889

GTZAN

±0.5 ±1.0

0.267 0.613

0.333 0.767

0.296 0.681

0.367 0.850

0.367 0.850

0.367 0.850

0.467 0.950

0.467 0.950

0.467 0.950

Average

±0.5 ±1.0

0.358 0.805

0.381 0.860

0.368 0.829

0.378 0.871

0.364 0.853

0.371 0.861

0.525 0.940

0.529 0.925

0.527 0.932

is exhibited. By reviewing the results presented in this and the previous section, it becomes clear that the event-level evaluation is a much more demanding task in comparison to the frame-level evaluation. In particular, when sub-second accuracy is required then high frame-level performance metrics are translated into much lower event-level figures. For example, the average event-level f1score when multiple SSMs are used is 61% lower compared to the corresponding frame-level figure.

4.6 Comparison with the Current State of the Art The proposed algorithm’s performance was also evaluated against two existing algorithms with publicly available implementations that can accomplish the task of speech/music discrimination. The first algorithm [41] is a Yaafe plu-

16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Nikolaos Tsipas et al.

gin [1] built around the Continuous Frequency Activation (CFA) feature. CFA is based on the hypothesis that music tends to have more stationary parts than speech and consequently a higher number of horizontal bars can be detected in spectrograms generated from music data [35]. The second algorithm is a speech / music discriminator that combines a Variable Duration Hidden Markov Model with a Bayesian Network while as a pre-processing step a computationally efficient region growing technique is employed [27]. The algorithm is distributed as a Matlab executable package [6]. The evaluation metrics are the same as the ones used in the previous section, including frame and event level assessment. For the event-level evaluation a ±1.5 second tolerance level was added to the existing ±0.5 s and ±1.0 s ones.

Table 7: Comparison results Wieser et al

Pikrakis et al

Proposed

Dataset

Evaluation Type

F1

F1

F1

% Impr.

Mirex 2015

Frame-level Event-lev. ±0.5s Event-lev. ±1.0s Event-lev. ±1.5s

0.976 0.073 0.17 0.414

0.973 0.192 0.269 0.538

0.998 0.792 0.957 0.957

2.5 312.3 255.6 77.8

LabROSA

Frame-level Event-lev. ±0.5s Event-lev. ±1.0s Event-lev. ±1.5s

0.805 0.013 0.079 0.125

0.953 0.411 0.666 0.784

0.976 0.323 0.889 0.909

2.4 -21.4 33.5 16.0

GTZAN

Frame-level Event-lev. ±0.5s Event-lev. ±1.0s Event-lev. ±1.5s

0.673 0.008 0.016 0.033

0.876 0.104 0.22 0.29

0.993 0.467 0.950 0.983

13.4 348.7 331.8 239.1

The results of the evaluation are presented in Table 7. Frame-level evaluation indicates a performance improvement ranging from 2.4% to 13.4% (6.1% on average) in comparison to the best performing algorithm of the two alternatives. The performance improvement is significantly higher during the event-level evaluation where the proposed algorithm figures are up to three times higher in comparison to the best performing alternative. It’s noteworthy that the performance improvement of the proposed algorithm, with the exception of the results generated for LabROSA ±0.5s, is monotonically increasing as the event-based evaluation tolerance level decreases indicating increased sub-second accuracy.

Multimedia indexing through similarity-based speech / music discrimination

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

17

4.7 Results Review and Discussion In the previous subsections, the proposed system went through different types of evaluation in order to assess the performance of the algorithm components as single units and the algorithm as an end-to-end pipeline. Firstly, the performance of the single and multiple SSM-based analysis approaches as transition point detection mechanisms was evaluated against the available datasets. Results indicated that the proposed, multiple SSM-based approach yields up to a 36% recall improvement in comparison to the standard single SSM variant. Secondly, in order to build the binary classification model, three machine learning algorithms were evaluated using cross-validation. SVM performance results indicated 3% higher f1 scores on average in comparison to the Random Forests and Logistic Regression alternatives. Furthermore, the potential benefits of PCA as a dimensionality reduction mechanism were investigated, however no performance gains were observed. Considering that the classification performance was already high enough (with the selected feature set), PCA was indicatively tested as a mean for reducing computational demands during training. Hence, a single dimension (20 components) PCA was evaluated, while additional experiments (with higher or lower number of PCA components) were not conducted. The first two evaluation steps helped to make decisions regarding the two basic building blocks of the system, the transition point detection mechanism and the binary classifier. Afterwards, in order to justify the advantages of multiple SSM-based segmentation approach, three variants (Fig 8) of the system were assessed using frame and event-based evaluation approaches. During the frame-level evaluation, the proposed multiple SSM-based approach exhibited approximately 2% higher f1-score in comparison to the Raw variant, while during the event-level evaluation the performance gains in terms of f1-score increased to 38% in comparison to Raw. Furthermore, it is observed that the proposed algorithm achieves higher f1-scores when sub-second accuracy is assessed during event-level evaluation. In particular, as already mentioned, 38% performance improvement is exhibited at ± 0.5s tolerance while that figure decreases to 7% at ±1.0s tolerance. From the frame and event-level evaluation results it becomes clear that the latter is a much more demanding task as a 2% performance gain during frame-level evaluation can lead to up to 38% improvement during event-level evaluation. Finally, the proposed algorithm was evaluated against two existing, publicly available speech/music discrimination algorithms. The same evaluation approach as the one previously employed was followed and the proposed method exhibited f1-score improvement of up to 13% for the frame-based evaluation and up to three times higher, in comparison to the best performing alternative, for the event-level evaluation. Similarly to what was observed before, the performance improvement of the proposed algorithm increased monotonically as the event-level evaluation tolerance decreased to sub-second values. The high f1-score performance improvement (in comparison to the alternatives) exhibited by the proposed algorithm during the event-level evaluation is a result

18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Nikolaos Tsipas et al.

of its ability to detect segments with a high degree of coherency/homogeneity and its ability to precisely identify, in terms of time, their onset and offset points. Both characteristics emanate from the employment of an SSM-based approach for the transition points detection stage.

5 Conclusion and Further Work As part of this work an audio-driven, similarity-based, speech / music discrimination algorithm for the analysis of multimedia content was introduced and evaluated against three publicly available datasets and two existing, alternative algorithms. The proposed system exhibited significant performance improvements during both frame and event level evaluations and especially when sub-second accuracy was assessed. The employment of multiple self-similarity matrices for transition point detection resulted in higher performance figures in comparison to the standard single self-similarity matrix analysis. Furthermore, the availability of the proposed algorithm in the form of a documented, open-source software package 3 will allow researchers to reproduce results and evaluate algorithm’s performance under different scenarios. Further steps towards this direction could include improvements or alternative approaches in both stages of the system. On the transition point detection side, image segmentation techniques including watershed transform, region growing and others could be evaluated as alternatives to the standard checkerboard-kernel SSM analysis approach while the generation of multiple SSM with different spatial distance metrics could also be explored. On the classification side, additional classes could be considered (e.g. noise) and the performance of more recent machine learning algorithms (e.g. Deep Neural Networks) could be evaluated. Finally, considering the ubiquitous nature of music and speech in multimedia content, the proposed framework (and or in combination with other methods) could be exploited in fully and semiautomated approaches for semantic annotation extraction.

References 1. Continuous frequency activation yaafe plugin repository. URL https://github.com/ mcrg-fhstp/cba-yaafe-extension. [Online; accessed 30-July-2016] 2. Gtzan music speech dataset. URL http://marsyasweb.appspot.com/download/data_ sets/. [Online; accessed 30-July-2016] 3. Labrosa music-speech corpus. URL http://labrosa.ee.columbia.edu/sounds/musp/ scheislan.html. [Online; accessed 30-July-2016] 4. Mirex 2015 muspeak sample dataset. URL http://www.music-ir.org/mirex/wiki/ 2015:Music/Speech_Classification_and_Detection#Dataset_2. [Online; accessed 30July-2016] 5. Peakutils. URL http://pythonhosted.org/PeakUtils/. [Online; accessed 30-July-2016] 6. Speech - music discrimination demo version 1.0. URL http://cgi.di.uoa.gr/~sp_mu/ download.html. [Online; accessed 30-July-2016] 3

https://github.com/nicktgr15/similarity-based-speech-music-discrimination

Multimedia indexing through similarity-based speech / music discrimination

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

19

7. Ajmera, J., McCowan, I.A., Bourlard, H.: Speech/music discrimination using entropy and dynamism features in a hmm classification framewor. Tech. rep., IDIAP (2001) 8. Carey, M.J., Parris, E.S., Lloyd-Thomas, H.: A comparison of features for speech, music discrimination. In: Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on, vol. 1, pp. 149–152. IEEE (1999) 9. Dimoulas, C.A., Symeonidis, A.L.: Syncing shared multimedia through audiovisual bimodal segmentation. IEEE MultiMedia 22(3), 26–42 (2015) 10. El-Maleh, K., Klein, M., Petrucci, G., Kabal, P.: Speech/music discrimination for multimedia applications. In: Acoustics, Speech, and Signal Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on, vol. 6, pp. 2445–2448. IEEE (2000) 11. Elizalde, B., Friedland, G.: Lost in segmentation: Three approaches for speech/nonspeech detection in consumer-produced videos. In: 2013 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2013) 12. Foote, J.: Visualizing music and audio using self-similarity. In: Proceedings of the seventh ACM international conference on Multimedia (Part 1), pp. 77–80. ACM (1999) 13. Foote, J.: Automatic audio segmentation using a measure of audio novelty. In: Multimedia and Expo, 2000. ICME 2000. 2000 IEEE International Conference on, vol. 1, pp. 452–455. IEEE (2000) 14. Jiang, H., Lin, T., Zhang, H.: Video segmentation with the support of audio segmentation and classification. In: Proc. IEEE ICME (2000) 15. Jun, S., Rho, S., Hwang, E.: Music structure analysis using self-similarity matrix and two-stage categorization. Multimedia Tools and Applications 74(1), 287–302 (2015) 16. Khonglah, B.K., Prasanna, S.M.: Speech/music classification using speech-specific features. Digital Signal Processing 48, 71–83 (2016) 17. Kinnunen, T., Chernenko, E., Tuononen, M., Fr¨ anti, P., Li, H.: Voice activity detection using mfcc features and support vector machine. In: Int. Conf. on Speech and Computer (SPECOM07), Moscow, Russia, vol. 2, pp. 556–561 (2007) 18. Kotsakis, R., Kalliris, G., Dimoulas, C.: Investigation of broadcast-audio semantic analysis scenarios employing radio-programme-adaptive pattern classification. Speech Communication 54(6), 743–762 (2012) 19. Lavner, Y., Ruinskiy, D.: A decision-tree-based algorithm for speech/music classification and segmentation. EURASIP Journal on Audio, Speech, and Music Processing 2009(1), 1 (2009) 20. Lim, C., Chang, J.H.: Efficient implementation techniques of an svm-based speech/music classifier in smv. Multimedia Tools and Applications 74(15), 5375–5400 (2015) 21. Mathieu, B., Essid, S., Fillon, T., Prado, J., Richard, G.: Yaafe, an easy to use and efficient audio feature extraction software. In: ISMIR, pp. 441–446 (2010) 22. Minami, K., Akutsu, A., Hamada, H., Tonomura, Y.: Video handling with music and speech detection. IEEE Multimedia 5(3), 17–25 (1998) 23. Miro, X.A., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: A review of recent research. Audio, Speech, and Language Processing, IEEE Transactions on 20(2), 356–370 (2012) 24. Moattar, M., Homayounpour, M.: A simple but efficient real-time voice activity detection algorithm. In: Signal Processing Conference, 2009 17th European, pp. 2549–2553. IEEE (2009) 25. Panagiotakis, C., Tziritas, G.: A speech/music discriminator based on rms and zerocrossings. Multimedia, IEEE Transactions on 7(1), 155–166 (2005) 26. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12(Oct), 2825–2830 (2011) 27. Pikrakis, A., Giannakopoulos, T., Theodoridis, S.: Speech/music discrimination for radio broadcasts using a hybrid hmm-bayesian network architecture. In: Signal Processing Conference, 2006 14th European, pp. 1–5. IEEE (2006) 28. Pikrakis, A., Giannakopoulos, T., Theodoridis, S.: A speech/music discriminator of radio recordings based on dynamic programming and bayesian networks. IEEE Transactions on Multimedia 10(5), 846–857 (2008)

20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Nikolaos Tsipas et al.

29. Pikrakis, A., Theodoridis, S.: Speech-music discrimination: A deep learning perspective. In: 2014 22nd European Signal Processing Conference (EUSIPCO), pp. 616–620. IEEE (2014) 30. Ramalingam, T., Dhanalakshmi, P.: Speech/music classification using wavelet based feature extraction techniques. J. Comput. Sci 10(1), 34–44 (2014) 31. Sang-Kyun, K., Chang, J.H.: Speech/music classification enhancement for 3gpp2 smv codec based on support vector machine. IEICE transactions on fundamentals of electronics, communications and computer sciences 92(2), 630–632 (2009) 32. Saunders, J.: Real-time discrimination of broadcast speech/music. In: icassp, vol. 96, pp. 993–996 (1996) 33. Scheirer, E., Slaney, M.: Construction and evaluation of a robust multifeature speech/music discriminator. In: Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on, vol. 2, pp. 1331–1334. IEEE (1997) 34. Sell, G., Clark, P.: Music tonality features for speech/music discrimination. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2489–2493. IEEE (2014) 35. Seyerlehner, K., Pohle, T., Schedl, M., Widmer, G.: Automatic music detection in television productions. In: Proceedings of the 10th International Conference on Digital Audio Effects (DAFx07). Citeseer (2007) 36. Shirazi, J., Ghaemmaghami, S.: Improvement to speech-music discrimination using sinusoidal model based features. Multimedia Tools and Applications 50(2), 415–435 (2010) 37. Tsipas, N., Vrysis, L., Dimoulas, C., Papanikolaou, G.: Mirex 2015: Methods for speech/music detection and classification 38. Tsipas, N., Vrysis, L., Dimoulas, C.A., Papanikolaou, G.: Content-based music structure analysis using vector quantization. In: Audio Engineering Society Convention 138. Audio Engineering Society (2015) 39. Tsipas, N., Zapartas, P., Vrysis, L., Dimoulas, C.: Augmenting social multimedia semantic interaction through audio-enhanced web-tv services. In: Proceedings of the Audio Mostly 2015 on Interaction With Sound, p. 34. ACM (2015) 40. Wang, W., Gao, W., Ying, D.: A fast and robust speech/music discrimination approach. In: Information, Communications and Signal Processing, 2003 and Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint Conference of the Fourth International Conference on, vol. 3, pp. 1325–1329. IEEE (2003) 41. Wieser, E., Husinsky, M., Seidl, M.: Speech/music discrimination in a large database of radio broadcasts from the wild. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2134–2138. IEEE (2014) 42. Zhang, T., Kuo, C.C.J.: Audio content analysis for online audiovisual data segmentation and classification. IEEE Transactions on speech and audio processing 9(4), 441–457 (2001)