A comparative study on feature extraction techniques ...

11 downloads 0 Views 343KB Size Report
(STT).speech processing is one of the exciting areas of signal processing. ... information conveyed by these feature vectors may be correlated .... LPC (Linear Predictive coding) analyzes the speech signal ... network, ̃(), by difference equation:.
A comparative study on feature extraction techniques in speech recognition Smita B. Magre

Ratnadeep R. Deshmukh

Pukhraj P. Shrishrimal

Department of C. S. and I.T., Dr. Babasaheb Ambedkar Marathwada University, Aurangabad [email protected]

Department of C.S. and I.T., Dr. Babasaheb Ambedkar Marathwada University, Aurangabad [email protected]. in

Department of C.S. and I.T., Dr. Babasaheb Ambedkar Marathwada University, Aurangabad [email protected]

ABSTRACT Automatic speech recognition (ASR) has created nice strides with the event of digital signal process hardware and software package. But despite of those advances, machines cannot match the performance of their human counterparts in terms of accuracy and speed, especially simply just in case of speaker independent speech recognition. So nowadays good portion of speech recognition analysis is focused on speaker independent speech recognition drawback. The reasons are its big selection of applications, and limitations of obtainable techniques of speech recognition. This paper provides outline of technique developed in each stage of speech recognition. This paper helps in selecting the technique beside their relative advantages & disadvantages. Comparative study of various techniques is finished as per stages. This paper is concludes with the choice on feature direction for developing technique in human pc interface system using Marathi Language.

General Terms Modeling technique, speech processing, signal processing

Keywords-

Speech Recognition; Feature Extraction; MFCC; LPC;PCA;LDA;WAVELET;DTW.

1. INTRODUCTION Speech recognition is also known as automatic speech recognition or computer speech recognition which means understanding voice of the computer and performing any required task or the ability to match a voice against a provided or acquired vocabulary. The task is to getting a computer to understand spoken language. By “understand” we mean to react appropriately and convert the input speech into another medium e.g. text. Speech recognition is therefore sometimes referred to as speech-to-text (STT).speech processing is one of the exciting areas of signal processing. A speech recognition system consists of a microphone, for the person to speak into; speech recognition software; a computer to take and interpret the speech; a good quality soundcard for input and/or output; a proper and good pronunciation.

1.1 Topology of Speech Recognition System 









Speaker Dependent: These systems require a user to train the system according to his or her voice. Speaker Independent Systems: This system does not require a user to train the system i.e. they are developed to operate for any speaker. Isolated word recognizer: Accept one word at a time. These recognition systems allow us to speak naturally continuous. Connected word systems: It allows speaker to speak slowly and distinctly each word with a short pause i.e. planned speech. Spontaneous recognition systems: It allows us to speak spontaneously [3].

1.2 Overview of Speech Recognition All paragraphs must be indented. All paragraphs must be justified, i.e. both left-justified and right-justified. A speech recognition system consists of five blocks: Feature extraction, Acoustic modeling, Pronunciation modeling, Decoder. The process of speech recognition begins with a speaker creating an utterance which consists of the sound waves. These sound waves are then captured by a microphone and converted into electrical signals. These electrical signals are then converted into digital form to make them understandable by the speech-system. Speech signal is then converted into discrete sequence of feature vectors, which is assumed to contain only the relevant information about given utterance that is important for its correct recognition. An important property of feature extraction is the suppression of information irrelevant for correct classification such as information about speaker (e.g. fundamental frequency) and information about transmission channel (e.g. characteristic of a microphone). Finally recognition component finds the best match in the knowledge base, for the incoming feature vectors. Sometimes, however the information conveyed by these feature vectors may be correlated and less discriminative which may slow down the further processing. Feature extraction methods like Mel frequency Cepstral coefficient (MFCC) provide some way to get uncorrelated vectors by means of discrete cosine transforms (DCT).

index scaling of the repetition frequency, using = ( − )/ ( + 1) wherever is that the highest frequency of the filter bank on the scale, computed from using equation given on top of, is that the lowest frequency in scale, having a corresponding and is that the range of filter bank. The values thought-about for the parameters within the gift study are: = 16 and = 0 on the scale are given by: (

)

=

(

1≤

2. FEATURE EXTRACTION First of all, recording of various speech samples of each word of the vocabulary is done by different speakers. After the speech samples are collected; they are converted from analog to digital form by sampling at a frequency of 16 kHz. Sampling means recording the speech signals at a regular interval. The collected data is now quantized if required to eliminate noise in speech samples. The collected speech samples are then passed through the feature extraction, feature training & feature testing stages. Feature extraction transforms the incoming sound into an internal representation such that it is possible to reconstruct the original signal from it. There are various techniques to extract features like MFCC, PLP, RAST, LPCC, PCA, LDA, Wavelet, DTW but mostly used is MFCC.

)

+

Figure 1: Outline of Speech Recognition System [4].

.The middle frequencies

+ +1

(

)

(

)

,



The center frequency in Hertz, is given by

= 700(

(

) .

− 1)

Above Equation is inserted into equation of to give the m filter bank. Finally, the MFCCs are obtained bycomputing the discrete cosine transform of using

( )= For = 1, 2, 3, … . . ,



( )cos (

(

1 − )) 2

where ( ) is the ℎ MFCC.

The time derivative is approximated by a linear regression coefficient over a finite window, which is defined as

Figure 2: Feature Extraction Diagram [1].

∆ ( )= 2.1 MFCC: Mel-Frequency Cepstral Coefficients Mel Frequency Cepstral Coefficients (MFCC) is one amongst the most normally used feature extraction methodology in speech recognition. The technique is named FFT based mostly which implies that feature vectors square measure extracted from the frequency spectra of the windowed speech frames. The Mel frequency filter bank may be a series of triangular bandpass filters. The filter bank relies on a non-linear frequency scale referred to as the − . An a1000 thousand Hz tone is outlined as having a pitch of1000 . Below a thousand Hz, the Mel scale is more or less linear to the linear frequency scale. On top of the a thousand Hz point of reference, the connection between Mel scale and therefore the linear frequency scale is non-linear and more or less power. The following equation describes the mathematical relationship between the Mel scale and therefore the linear frequency scale

= 1127.01 ln(

700

+ 1 )

The Mel frequency filter bank contains triangular band pass filters in such the simplest way that lower boundary of one filter is placed at the middle frequency of the previous filter and also the higher boundary placed within the center frequency of the next filter. a set frequency resolution within the Mel scale is computed, appreciate a



( )

,1 ≤ ≤

Where is the ℎcepstral coefficient at time and is a constant used to make the variances of the derivative terms equalto those with the original cepstral coefficients.

Figure 3: Steps involved in MFCC Feature Extraction.

2.1.1 Advantage As the frequency bands are positioned logarithmically in MFCC, it approximates the human system response more closely than any other system.

2.1.2 Disadvantage MFCC values are not very robust in the presence of additive noise, and so it is common to normalize their values in speech recognition systems to lessen the influence of noise



2.1.3 Applications MFCCs are commonly used as features in speech recognition systems, such as the systems which can automatically recognize numbers spoken into a telephone. They are also common in speaker recognition, which is the task of recognizing people from their voices. MFCCs are also increasingly finding uses in music information retrieval applications such as genre classification, audio similarity measures, etc

LPC (Linear Predictive coding) analyzes the speech signal by estimating the formants, removing speech signal, and estimating the intensity and frequency of the remaining buzz. The process is called inverse filtering, and the remaining signal is called the residue. In LPC system, each expressed as a linear combination of the previous samples. This equation is called as linear predictive coding [9]. The coefficients of the difference equation characterize the formants. The basic steps of LPC processor include the following [5]:

2.2.1 Preemphasis The digitized speech signal ̃ ( )is put through a low order digital system, signal and to make it less susceptible to finite precision effects later in the signal processing. preemphasizer network is related to the input to the network, ̃ ( ) , by difference equation:

̃( ) = ( ) −

Where the highest autocorrelation value, p, is the order of the LPC analysis

2.2.4 LPC Analysis The next processing step is the LPC analysis, which converts each frame of p + 1 autocorrelations into LPC parameter set by using Durbin‟s method. This can formally be given as the following algorithm:

( )

2.2 Linear Prediction Coefficient

= 0,1, …

= (0)

=

( )−∑

()

=

()

=

( )

= (1 −

()

(| − |)

(



)

,1 ≤ ≤

,1 ≤ ≤ − 1

)

By solving above equations recursively for = 1,2, … , the LPC coefficient, am , is given as

( )

=

( − 1)

=

2.2.2 Frame Blocking

.

.

,

>

The output of preemphasis step, ̃ ( ), is blocked into frames of N samples, with adjacent frames being separated by M samples. If ( ) is the ℎ frame of speech, and there are L frames within entire speech signal, then [5].

( ) = ̃(

+ )

2.2.3 windowing After frame blocking, the next step is to window each individual frame so as to minimize the signal discontinuities at the beginning and end of each frame. If we define the window as ( ), 0 ≤ ≤ – 1, then the result of windowing is the signal: Figure 4: Block Diagram of LPC.

( ) = ( ) ( ), 0 ≤ ≤ – 1 Typical window is the Hamming window, which has the form



−1

,0 ≤

−1

Autocorrelation Analysis: The next step is to auto correlate each frame of windowed signal in order to give

( )=

 

2

( ) = 0.54 − 0.46

2.2.5 Advantages

( ) ( +

One of the most powerful signal analysis techniques is the method of linear prediction. LPC is to minimize the sum of the squared differences between the original speech signal and the estimated speech signal over a finite duration.

2.3 Perceptually Based Linear Predictive Analysis (PLP)

), PLP analysis models perceptually motivated auditory spectrum by a low order all pole function, using

theautocorrelation LP technique. Basic concept of PLP method is shown in block diagram of Fig. 5

resolution into its spectrum estimate by remapping the frequency axis to the Bark scale and integrating the energy in the critical bands to produce a critical-band spectrum approximation

Figure 5: Block Diagram of PLP Speech Analysis Method [7]. It involves two major steps: obtaining auditory spectrum, approximating the auditory spectrum by an all pole model. Auditory spectrum is derived From the speech waveform by critical-band filtering, equal loudness curve preemphasis, and intensity loudness root compression. Eighteen critical band filter outputs with their center frequencies equally spaced in bark domain, are defined as .

Ω(w)=6ln((

) +((w/1200 )+1)

Center frequency of

ℎ critical band Ω = 0.994

)

2.4 Dynamic Time Warping (DWT) In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences which may vary in time or speed. For instance, similarities in walking patterns could be detected using DTW, even if one person was walking faster than the

Π

=[ (

)

(W)=10(Ω



( ). ( )

. )

]

, Ω ≤ Ω-0.5

= 1 , Ω -0.5 ≤ Ω ≤ Ω +0.5 = 10

. (Ω Ω

. )

other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio and graphics data — indeed, any data which can be turned into a linear sequence can be analyzed with DTW.

, Ω +0.5 ≤ Ω

Figure 6: Block Diagram of DWT.

The output thus obtained is linearly interpolated to give interpolated auditory spectrum. The interpolated auditory spectrum is approximated by fifth order all pole model spectrum. The IDFT of interpolated auditory spectrum provides first six terms of autocorrelation function. These are used in solution of Yule Walker equations [7] to obtain five autoregressive coefficients of all-pole filter. The PLP analysis provides similar results as with LPC analysis but the order of PLP model is half of LP model. This allows computational and storage saving for ASR. Also it provides better performance to cross speaker ASR .

2.4.1 Advantages

2.3.1 Advantages

2.4.2 Disadvantage









PLP coefficients are often used because they approximate well the high-energy regions of the speech spectrum while simultaneously smoothing out the fine harmonic structure, which is often characteristic of the individual but not of the underlying linguistic unit. LPC, however, approximates the speech spectrum equally well at all frequencies, and this representation is contrary to known principles of human hearing. The spectral resolution of human hearing is roughly linear up to 800 or 1000 Hz, but it decreases with increasing frequency above this linear range. PLP incorporates critical-band spectral-

    



Increased speed of recognition. Reduced storing space for the reference template. Constraints could be made in finding the optimal path. Increased recognition rate. A threshold can be used in order to stop the process if the error is too great.

To find the best reference template for a certain word. Choosing the appropriate reference template is a difficult task

2.5 Wavelet The speech is a non stationary signal. The Fourier transform (FT) isn't suitable for the analysis of such non stationary signal as a result of it gives only the frequency data of signal however doesn't provide the data about at what time which frequency is present. The windowed short-time ft (STFT) provides the temporal data concerning the frequency content of signal. A disadvantage of the STFT is its fastened time resolution attributable to fixed window length. The WT, with its versatile time-frequency window, is an which tool for the analysis of non stationary signals like speech fixed have both short high frequency bursts and long quasi-stationary component also.

2.5.1 Advantages 

  

Wavelet transform have been used for speech feature extraction in which the energies of wavelet decomposed sub-bands have been used in place of Mel filtered sub-band energies. Because of its better energy compaction property. Wavelet transform-based features give better recognition accuracy than LPC and MFCC The WT has a better capability to model the details of unvoiced sound portions. Better time resolution than Fourier Transform.

To compensate for linear channel distortions the analysis library provides the power to perform rasta filtering. The rasta filter is used either within the log spectral or cepstral domains. In result the rasta filter band passes every feature coefficient. Linear channel distortions seem as an additive Figure 8: Block Diagram of RASTA. constant in each the log spectral and therefore the cepstral domains. The high-pass portion of the equivalent band pass filter alleviates the result of convolution noise introduced inthe channel. The low-pass filtering helps in smoothing frame to border spectral changes.

2.7 Principal (PCA)

Component

Analysis

PCA is thought a Principle part Analysis – this is often a statistical analytical tool that's used to explore kind and cluster information. PCA take an oversized variety of correlate (interrelated) variables and rework this information into a smaller variety of unrelated variables (principal components) whereas holding largest quantity of variation, so creating it easier to work the information and build predictions. PCA could be a method of distinguishing patterns in information, and expressing the information in such some way on highlight their similarities and variations. Since a pattern in information is hard to seek out in information of high dimension, wherever the posh of graphical illustration isn't offered, PCA could be a powerful tool for analyzing information. Figure 9: Block Diagram of PCA. Figure 7: Block Diagram of Wavelet. 

 



The main advantage of wavelets is that they offer a simultaneous localization in time and frequency domain. Wavelet is that, using fast wavelet transform, it is computationally very fast. Wavelets have the great advantage of being able to separate the fine details in a signal. Very small wavelets can be used to isolate very fine details in a signal, while very large wavelets can identify coarse details. A wavelet transform can be used to decompose a signal into component wavelets.

2.5.2 Disadvantage  

The cost of computing DWT as compare to DCT may be higher. It requires longer compression time.

2.5.3 Application    

Signal processing. Data compression. Speech recognition. Computer graphics and multi-fractal analysis.

2.6 Relative Spectra Filtering of Log Domain Coefficients (RASTA)

2.7.1 Advantage 

High dimensionality reduction techniques.

2.7.2 Disadvantage   

The results of PCA depend on the scaling of the variables. The applicability of PCA is limited by certain assumptions made in its derivation. The absence of a probability density model and associated likelihood measure.

3. COMPARATIVE ANALYSIS Various methods for Feature Extraction in speech recognition are broadly shown in the following table 1. Table 1: Feature Extraction Methods [1]

Method

property

comments

Principal Component Analysis(PCA)

Non linear feature extraction method, Linear map; fast; eigenvector-based

Traditional, eigenvector based method, also known as karhuneu-Loeve expansion; good for Gaussian data.

Linear Discriminant Analysis(LDA)

Non linear feature extraction method, Supervised linear map; fast; eigenvector-based

Better than PCA for classification;

Linear Predictive coding

Static feature extraction method,10 to 16 lower order coefficient,

It is used for feature Extraction at lower order

Cepstral Analysis

Static feature extraction method, Power spectrum

Used to represent spectral envelope

Mel-frequency cepstrum (MFFCs)

Power spectrum is computed by performing Fourier Analysis

This method is used for find our features

Independent Component Analysis (ICA)

Non linear feature extraction method, Linear map, iterative nonGaussian

Blind course separation, used for de-mixing nonGaussian distributed sources(features)

Cepstral Analysis

Static feature extraction method, Power spectrum

Used to represent spectral envelope

Static feature extraction method, Spectral analysis

Spectral analysis is done with a fixed resolution along a subjective frequency scale i.e. Mel-frequency scale.

Non linear transformations,

Dimensionality reduction leads to better classification and it is used to remove noisy and

2.8 Combined LPC and MFCC The determination algorithms MFCC and LPC coefficients expressing the basic speech features are developed by author. Combined use of cepstrals of MFCC and LPC in speech recognition system is suggested by author to improve the reliability of speech recognition system. The recognition system is divided into MFCC and LPC-based recognition subsystems. The training and recognition processes are realized in both subsystems separately, and recognition system gets the decision being the same results of each subsystems. Author claimed that, results in decrease of error rate during recognition.

2.8.1 Steps of Combined Use of Cepstral of MFCC and LPC 1. The speech signals is passed through a first-order FIR high pass filter 2. Voice activation detection (VAD). Locating the endpoints of an utterance in a speech signal with the help of some commonly used methods such as short-term energy estimate Es , short-term power estimate Ps , shortterm zero crossing rate Zs etc. 3. Then the mean and variance for the measures calculated for background noise, assuming that the first 5 blocks are background noise. 4. Framing. 5. Windowing. 6. Calculating of MFCC features. 7. Calculating of LPC features. 8. The speech recognition system consists of MFCC and LPC-based two subsystems. These subsystems are trained by neural networks with MFCC and LPC features.

2.8.2 The Recognition Process Stages 1. In MFCC and LPC based recognition subsystems recognition processes are realized in parallel. 2. The recognition results of MFCC and LPC based recognition subsystems are compared and the speech recognition system confirms the result, which confirmed by the both subsystems. Since the MFCC and LPC [2] methods are applied to the overlapping frames of speech signal, the dimension of feature vector depends on dimension of frames. At the same time, the number of frames depends on the length of speech signal, sampling frequency, frame step, frame length. An author use sampling frequency is 16khs, the frame step is 160 samples, and the frame length is 400 samples. The other problem of speech recognition is the same speech has different time duration. Even when the same person repeats the same speech, it has the different time durations. Author suggested that, for partially removing the problem, time durations are led to the same scale. When the dimension of scale defined for the speech signal increases, then the dimension of feature vector corresponding to the signal also increases.

Mel-frequency scale analysis

Kernel based feature extraction method

Wavelet

Dynamic feature extractions i)LPC ii)MFCCs

Spectral subtraction

Better time resolution than Fourier Transform

Acceleration and delta coefficients i.e. II and III order derivatives of normal LPC and MFCCs coefficients Robust Feature extraction method

redundant features, and improvement in classification error It replaces the fixed bandwidth of Fourier transform with one proportional to frequency which allow better time resolution at high frequencies than Fourier Transform

It is used by dynamic or runtime Feature

It is used basis on Spectrogram

RASTA filtering

For Noisy speech

It is find out Feature in Noisy data

Integrated Phoneme subspace method

A transformation based on PCA+LDA+ICA

Higher Accuracy than the existing methods

4. CONCLUSION We have discussed some features extraction techniques and their advantages and disadvantages. Some new methods are developed using combination of more techniques. Authors have claimed improvement in performance. There is a need to develop new hybrid methods that will give better performance in robust speech recognition area.

5. REFERENCES [1] Santosh K. Gaikwad, Bharti W. Gawali, Pravin Yannawar “A Review on Speech Recognition Technique” International Journal of Computer Applications (0975 – 8887). Volume 10– No.3, November 2010. [2] Urmila Shrawankar, “Techniques for Feature Extraction in Speech Recognition System: A Comparative Study”. International Journal Of Computer Applications In Engineering, Technology and Sciences (IJCAETS), 6 May 2013 [3] M. A. Anusuya, S. K .Katti “Speech Recognition by Machine: A Review” (IJCSIS) International Journal of Computer Science and Information Security, Vol. 6, No. 3, 2009 [4] Preeti Saini, Parneet Kaur, “Automatic Speech Recognition: A Review” International Journal of

Engineering Trends and Technology- Volume4Issue22013. [5] Leena R Mehta, S. P. Mahajan, Amol S Dabhade, “Comparative Study of MFCC and LPC For Marathi Isolated Word Recognition System”, International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering Vol. 2, Issue 6, June 2013 [6] Kayte Charansing Nathoosing, “Isolated Word Recognition for Marathi Language using VQ and HMM” Science Research Reporter 2(2):161-165, April 2012 [7] Manish P. Kesarkar, “Feature Extraction for Speech Recognition” M.Tech. Credit Seminar Report, Electronic Systems Group, EE. Dept, IIT Bombay, Submitted November2003. [8] Hynek Hermansky “Perceptual Liner Predictive (PLP) Analysis Of Speech” speech technology laboratory, division Panasonic technologies, Journal Acoustic Society of America,Vol. 87,No 4.April 1990. [9] Bharti W. Gawali, Santosh Gaikwad, Pravin Yannawar, Suresh C. Mehrotra, “Marathi Isolated Word Recognition System using MFCC and DTW Features”, ACEEE International Journal on Information Technology, Vol. 01, No. 01, Mar 2011. [10] К. R. Aida – Zade, C. Ardil and S. S. Rustamov, Investigation of Combined use of MFCC and LPC Features in Speech Recognition Systems, Proceeding of World Academy of Science, Engineering and Technology Volume 13 May 2006 ISSN 1307-6884