On Feature Extraction for Voice Pathology Detection from Speech ...

On Feature Extraction for Voice Pathology Detection from Speech Signals Zvi Kons1, Aharon Satt1, Ron Hoory1, Virgilijus Uloza2, Evaldas Vaiciukynas3, Adas Gelzinis3 and Marija Bacauskiene3 1

IBM Research, Haifa Research Lab, Israel Lithuanian University of Health Sciences 3 Kaunas University of Technology

2

{ZVI, AHARONSA, HOORY}@il.ibm.com

Abstract Reliable, automatic and objective detector of pathological voice disorders from speech signals is a long sought-for tool, by voice clinicians as well as by general practitioners. Such a detector can also be used for low-cost and noninvasive mass-screening, diagnosis and early detection of voice pathology for professionals using voice as an essential career tool, for humans working in risky environments such as chemical factories, and for the general population. Following years of research and significant advancements in voice pathology detection and classification, correct detection and classification rates of various pathology stages are still insufficient for reliable and trusted large-scale screening. The research work in this field generally splits in two stages: first, extraction of meaningful feature sets, and second, using these features for classification of speech recordings into healthy condition and different pathological cases. This work examines the performance of state-of-theart methods, and investigates their weaknesses. Methods examined include features in time, frequency, perturbations, noise and spectral structure; also examined are features more related to the glottal source. Those features are evaluated by different machine learning techniques. This paper describes ongoing work to improve feature sets for more accurate detection of pathology in voice signals, aiming at overcoming certain weaknesses in current state-of-the-art methods, with emphasis on early-stage cases. Promising results on real-world pathologic and healthy samples, recorded in voice clinics, are shown, and future directions discussed. Index Terms: speech analysis; pathological voices; voice disorders; speech parameterization

1. Introduction Voice disorders can be caused by different problems in the larynx. In this article we focus on disorders that are caused by different pathologies of the vocal folds. Today, diagnosing those pathologies require direct visual imaging of the larynx by invasive procedures. Naturally those procedures are complex and require medical expertise.

As expected, those pathologies have expression in the patient’s speech. Common manifestations are: weak voice, hoarseness, lower pitch and even inability to produce voiced speech. Medical experts can use those qualities in the voice to diagnose the patient, but their correct classification rate for distinguishing between healthy and pathological cases is only about 70% [1]. It seems reasonable to assume that automatic classification of speech samples can achieve better classification rates. Several different papers previously presented [1] - [5] use voice samples only or use speech samples with additional data such as images and questionnaires trying to distinguish between healthy and pathological cases and between different pathologies. The main idea of these works is to extract different features out of the voice samples using general or specific speech analysis techniques and then train some classification tool (usually SVM) on the data. Results reported in those works show classification rates usually around 90%. In this work we continue this path. Our main goal is to improve the features that are used for the classification. This is done by using improved voice analysis tools and by producing new features that are more relevant to those kinds of pathology. Here we experiment on a much larger database than in previous works, which gives us more reliable results. The problem of unbalanced data, where the number of pathologic cases is significantly higher than the healthy group, is also addressed.

1.1. Voice database Our database contains samples from 719 subjects (320 male and 339 female) that were recorded at the Department of Otolaryngology, Kaunas University of Medicine, Kaunas, Lithuania, in a soundproof booth. For each subject the database holds 1-3 recordings of sustained phonation of the vowel /a/ (2-5 seconds) and a recording of a sentence (not used in this article). The recordings were done at sample rate of 44.1 KHz and converted to 11.25 KHz. A total of 2101 recordings of the vowel /a/ were used.

Diagnosis was done for all patients by clinical signs revealed during video laryngostroboscopy, direct microlaryngoscopy and by histological examination of removed vocal fold mass lesions. The data for some of the subjects in the database also contains images of the larynx and additional information in the form of questionnaire (both not used here). Table 1 lists the distribution of the different pathologies. For some subjects the severity level was evaluated through a medical procedure into three severity levels as follows: 0-healthy, 1-mild, 2-meduim and 3-severe. Table 2 lists the distribution of evaluated levels in the database, based on patients visiting the clinic. DIAGNOSIS

NUMBER OF SUBJECTS

Control group (Normal voice) Vocal fold nodule Vocal fold polyp Vocal fold cyst Vocal fold cancer Vocal fold polypoid hyperplasia Vocal fold keratosis Vocal fold papilloma

126 97 262 31 66 68 20 49

% 18% 13% 36% 4% 9% 9% 3% 7%

Table 1: Distribution of pathologies SEVERITY LEVEL

NUMBER OF SUBJECTS

%

0

126

31%

1

89

22%

2

102

25%

3

96

23%

Table 2: Distribution of severity levels

2. Feature extraction 2.1. Features Our initial investigation of the pathological samples shows that the different pathologies can have various manifestations. In severe cases the subject cannot even maintain a voiced /a/ (i.e. no periodic-like signal is generated). Less severe cases usually seem to show different problems such as hoarseness, breathiness, unstable pitch, elevated noise levels and attenuation of the higher frequencies. In a large part of the pathological cases subjects sound normal and cannot be identified by listening only without deeper examination. Healthy subjects can sound hoarse as well, while not showing signs of pathology in the medical examination. In order to apply classification tools on the speech samples we need first to represent them with a vector of features. Since the problems seem to have variety of manifestations, we have to look at features that represent different parameters of the voice. The features we use are based on the following speech parameters that are calculated out of the speech samples:

1. 2. 3. 4. 5. 6.

Pitch and Degree of Voicing (DOV) Spectral envelope Harmonies frequency jitter. LPC filter coefficients and the signal after LPC inverse filtering Residual signal Glottal source

Parameters 1-5 are described below and 6 (the glottal source description) in section 2.3. Pitch, DOV, harmonic amplitudes and spectral envelope are calculated for each 10ms frame. We use the sinusoidal model and represent each frame as:

x[n]   Ak sin 2n f 0 k   k    k 

(1)

k

Where x[n] is the speech signal at sample n, Ak

f 0 is the pitch frequency,  k represent frequency jitter and k are are the harmonic amplitudes,

phase offsets. For unvoiced frames we extract the amplitudes and phases from their STFT. The 24 Mel-spaced spectral envelope parameters C m are found by requiring that linear interpolation at frequency f 0 k   k would give the closest match to

log( Ak ) . DOV is a parameter based on the maximal voicing frequency where the pitch frequency jitter crosses a given threshold. The details of this procedures are described in [6][10]. From those parameters we derive features as follows: Pitch: mean, STD, maximum and minimum of the entire files, mean of different segments of the file, STD of the pitch differences and shape of the pitch distribution. (Total: 13 features) DOV: mean, median, max and min. (4 features). Spectral envelope: mean and STD for the parameters, their DCTs and STD of the differences. (17 features) Frequency jitters: the amount of jitter in different frequency bands (4 features). LPC: For the next set of features we first calculate the LPC filter coefficients for each frame. The mean and STD of those coefficients are taken as features. In addition, by applying the inverse LPC filters to the signal we get a signal that is mainly composed of time localized excitation and noise. We found that the best results are obtained by looking at short frames with length of half pitch cycle and examining the distribution of the signal amplitudes in those frames. The mean and STD of several predefined quantiles were used as features. (54 features)

Residual: Our goal is to remove the periodic parts of the signal so that only the noise-like residual is left. We first refine the frame based pitch and calculate the pitch at every sample using correlation. The pitch and correlation values are used to subtract the signal from the previous pitch cycle. This leaves us with the residual signal. We apply this process both to the original signal and the signal after LPC inverse filtering. Various features are extracted from the residual signal including: spectral shape in several bands, distribution of amplitudes and the 3rd and 4th moments of the signal. (94 features) The last feature is the gender of the subject. Finally, before classification all the features are normalized so they would have zero mean and STD of 1. 2.2. Classification In this work our focus is to perform two class classification (healthy vs. pathologic) using the features described in section 2.1. We have experimented with several classification methods and tools which in general gave similar results. The results reported here were achieved with an SVM classifier. We apply feature weighting by multiplying each feature with a different weight. The kernel function for two features vectors x and y is therefore:

 2 K  x, y   exp   wk  xk  y k    k 

(2)

The optimal SVM parameters and the weights for the features are found by a search process. Since the size of the available data set is not very large we apply out-of-the-bag cross validation. This is done by taking out data from several subjects as a test subset, and training on the rest. This procedure is repeated several times for full coverage. Since the data is also not balanced between the two classes, we measure the Equal Error Rate (EER) which is the point on the ROC where the false alarm rate equals the misdetection rate. The details of this procedure will be published elsewhere. 2.3. Glottal Source The glottal source, the derivative of the glottal pulse, is expected to facilitate accurate detection of the health state of the vocal folds in particular and the larynx in general. In this work we describe preliminary promising results of glottal source extraction from voice samples of the vowel /a/. We begin with a known method in the art – separating the z-transform roots of a segment of voice into minimum-phase and maximumphase groups [11-13]. Two pitch cycles of the vowel /a/, centered around the Glottal Closure Index (GCI) [14],

are separated into minimum- and maximum-phase components by dividing the z-transform root into two groups, with modulus less than and greater than 1.0, respectively [11-14]. It is well known that the vocal tract transfer function of the vowel /a/ accurately corresponds to a minimum-phase signal, and so is the glottal-source closing-phase signal. The maximum-phase component therefore closely represents the glottal-source openphase. In this work we examine the utilization of the open-phase signal for voice pathology detection and classification. As discussed in [15-16], the above method, of separating the signal to minimum- and maximumphase components is unstable, and typically yields inaccurate estimation of the open-phase signal. This instability stems from factors such as windowing effects and inaccuracy of the GCI location. In this work we have developed an algorithm that automatically performs z-transform normalization, by manipulating the z-transform roots, to efficiently extract the maximum-phase component of the signal. It performs very well for both normal and mild-to-medium pathological voices, yielding a “clean” and meaningful glottal-source open-phase signal, while being insensitive to GCI location inaccuracies and other factors. We have also developed an algorithm that automatically estimates the “quality” of the extracted open-phase signal, in terms of prediction measures for the “healthiness” of the examined voice sample. The full details of this procedure will be published elsewhere.

3. Results The classification error rates of the entire feature set (without the glottal source) and feature subsets (without the gender) are shown in Table 3, listed by increasing error rates. Feature set All Residual of inverse LPC Spectral envelope LPC inverse filtered LPC coefficients Frequency jitter Residual of original signal Pitch DOV

size 187 46 17 20 34 4 46 15 4

EER (%) 10 ± 3 14 ± 4 15 ± 5 15 ± 4 18 ± 5 20 ± 6 22 ± 4 25 ± 5 30 ± 5

Table 3: Classification equal error rates for different sets of features As can be seen, most of the information originates from the spectral features while the pitch related features seem to be less important. The most useful feature set is the residual from the inverse LPC filtered

signal. This feature set describes different irregularities that mostly manifest themselves as noises in the speech signal. Another important feature set is the spectral envelope in which its important parts describe the amplitude shimmer. By looking at the optimal feature weights that were found (3), we can see the relevance of individual features. Looking at the top of the list of features with the highest weights, the first feature turns out to be the speaker’s gender. This probably means that male and female distributions are very different (e.g. for females, low pitch is better indication for pathology). It might be useful in the future to create different classifiers for each gender. Most of the top positions in the list are different spectral bands from the residual of the original and inverse LPC signals. Two more important features are the high frequency components of the LPC filter (indicating noise levels) and minimal DOV (low voicing indicates pathology). Classification rate for different levels of severity are shown in Table 4. It is clear that more severe cases are easier to diagnose. There is still large room for improvement for identifying correctly the weak pathologies.

obtained by improving our current feature set and by adding new ones.

5. Acknowledgements This work was done in the framework of the IBM – Lithuania joint research agreement.

6. References [1]

[2]

[3]

[4]

[5]

[6]

[7]

Severity 0 1 2 3

Classification rates 90 ± 0.5 % 79 ± 0.9 % 91 ± 0.6 % 98 ± 0.5 %

Table 4: Classification rates for different severity levels 3.1. Results for glottal source Since the glottal source extraction algorithm depends on the existence of the GCIs and the consistent estimation of their locations, which requires a periodiclike signal to be present, we currently focus only on the healthy and severity-1 cases. These are the most challenging cases, in the sense that they represent the lion share of the sample distribution in key use-cases on one hand, while being significantly more difficult to detect and classify on the other hand. Initial results show that by using - scalar “quality” measures against fixed thresholds, we can get 75% and 70% correct classification rates for healthy and severity-1 cases, respectively, using samples from over 100 healthy and over 50 severity-1 human subjects.

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

4. Summary At this stage of our work we are able to demonstrate good classification between healthy and pathologic cases. It is clear, however, that additional work is still needed to improve the classification rates, especially for the weaker pathologies. We believe that this can be

[16]

Uloza V, Verikas A, Bacauskiene M, Gelzinis A, Pribuisiene R, Kaseta M, Saferis V., “Categorizing Normal and Pathological Voices: Automated and Perceptual Categorization.”, J Voice. 25 June 2010. A. Gelzinis, A. Verikas, M. Bacauskiene, “Automated speech analysis applied to laryngeal disease categorization”, Computer Methods and Programs in Biomedicine, 91(1), 2008, 36-47. Hadjitodorov S, Mitev P., "A computer system for acoustic analysis of pathological voices and laryngeal diseases screening.", Med Eng Phys. 2002 Jul;24(6):419-29. Wang X, Zhang J, Yan Y., "Discrimination between pathological and normal voices using GMM-SVM approach.", J Voice. 2011 Jan;25(1):38-43. Dibazar, A.A.; Narayanan, S.; Berger, T.W.; , "Feature analysis for automatic detection of pathological speech," Engineering in Medicine and Biology, 2002. Shechtman S. and Sorin A., “Sinusoidal model parameterization for HMM-based TTS system”, INTERSPEECH-2010, Sept. 2010, Makuhari Japan. Chazan D., Hoory R., Sagi A., Shechtman S., Sorin A., Shuang Z. and Bakis R., "High quality sinusoidal modeling of wideband speech for the purpose of speech synthesis and modification", ICASSP 2006, Toulouse, May 2006 Chazan D., Hoory R., Kons Z., Sagi A., Shechtman S. and Sorin A., "Small footprint concatenative text-to-speech synthesis system using complex spectral envelope mode-ing", INTERSPEECH 2005, Lisbon, Sept. 2005. Sorin A., Ramabadran T., Chazan D., Hoory R., McLaughlin M, Pearce D., Wang F., Zhang Y., "The ETSI Extended Distributed Speech Recognition Standards: Client Side Processing and Tonal Language Recognition Evaluation", in Proc. ICASSP, May 2004, Montreal Canada. Chazan D., Zibulski M., Hoory R. and Cohen G., "Efficient periodicity extraction based on sine-wave representation and its application to pitch determination of speech signals", in proceedings of EUROSPEECH 2001. Bozkurt B., Doval B., D’Alessandro C., T. Dutoit T., “Zeros of Z-Transform Representation With Application to Source-Filter Separation”, in Speech IEEE Signal Processing, Letters, vol. 12, no. 4, 2005 Doval B., D’Alessandro C., Henrich N., “The voice source as a causal/anticausal linear filter”, Proceedings ISCA ITRW VOQUAL03, pp. 15-19, 2003 Sturmel N., D'Alessandro C., Doval B., “A comparative evaluation of the Zeros of Z-transform representation for voice source estimation”, Interspeech07, pp. 558-561, 2007 Bozkurt B., “Zeros of the z-transform (ZZT) representation and chirp group delay processing for the analysis of source and filter characteristics of speech signals”, thesis work, http://www.tcts.fpms.ac.be/publications/phds/bozkurt/thesis_Bozk urt.pdf T. Drugman, B. Bozkurt, T. Dutoit, "Chirp Decomposition of Speech Signals for Glottal Source Estimation", Proc. NOLISP (ISCA Workshop on Non-Linear Speech Processing) 2009, 27-27 June, Barcelona T. Drugman, B.Bozkurt, T. Dutoit, "Glottal Source Estimation Using an Automatic Chirp Decomposition", in " Advances in Nonlinear Speech Processing", LNCS 5933, 35-42, edited by R. Goebel, J. Siekmann, and W.Wahlster, DOI: 10.1007/978-3-64211509-7_5, ISBN: 3-642-11508-X, Springer, NewYork