Simultaneous speech and speaker recognition using ... - CiteSeerX

INTERNATIONAL COMPUTER SCIENCE INSTITUTE 1947 Center St. Suite 600 Berkeley, California 94704-1198 (510) 643-9153 FAX (510) 643-7684

I

Simultaneous speech and speaker recognition using hybrid architecture Dominique Genoud, Dan Ellis, Nelson Morgan TR-99-012 July 1999

Abstract

This rapport summarize the work that was done this last 6 month at ICSI in speaker recognition and speaker adaptation.

1 Introduction The automatic recognition process of the human voice is often divided in speech recognition and speaker recognition. The 2 area use the same input signal (the voice), but not for the same purpose: the speech recognition aims to recognize the message uttered by any speaker, and the speaker recognition want to identify the person who is talking. However, more and more applications need to use simultaneously the 2 kind of information. Some actual examples given below illustrate this tendency. State-of-the-art speech recognition systems tend to be speaker independent by using models (phonemes, diphones, triphones) estimated on huge databases containing numerous speakers, and also by using parameterization which try to suppress the speaker dependent characteristics (PLP,RASTA-PLP). However, for some types of applications it could be important to readapt the speaker independent speech recognizer to a de ned speaker, in order to improve the noise robustness for example, or simply to improve the speech recognition performances by adding some knowledge of the speaker. Some recent results shows that speaker adaptation of a speech recognizer improve the performances of the systems [DARPA, 1998]. Nowadays, numerous applications performing speech information retrieval require the automatic extraction of the content of shows and the retrieval of the speech of a particular speaker on a particular subject. In this case a speech recognition and a speaker recognition should be carried on in parallel. Furthermore the detection of speaker change in a conversation (speaker A/ speaker B or speaker/music) may also be very useful for the indexing and the labeling of the huge databases available. Finaly, a speaker recognition is needed for applications like secured voice access to information (as a bank account or a voice-mail box) . In this case, the speaker recognition can be text independent if the content of the utterance is not checked. However, better results are obtained by using text dependent speaker recognition, both because a control of what is said can be done and also because more accurate models (phonemes, words) can be built. Anyhow, the text dependent speaker recognition has to be preceded by a speech recognition step to control and split the message properly. All these applications show the need of a simultaneous speaker and speech recognition. This rapport shows that it exists some possibilities exist to carry out this 2 tasks simultaneously.

2 Background This section will recall some useful notions about speech and speaker recognition domains.

2.1 Some de nitions 2.1.1

Type of speakers

The speakers which have to be identi ed (veri ed) by our system will be named registered , RS , the speakers which will attempt to impersonate the registered speakers will

speakers

1

be called impostors. We also use the voice of many other speakers which will constitute the world speakers.

2.1.2 Type of speaker recognition applications

Speaker recognition applications are classi ed by their text dependency, they can be text dependent, text independent, or text prompted, the two latter cases imply a control of the text. The speaker recognitions applications can also be classi ed by the way that the identity of the speaker is checked: if the voice of an unknown speaker is used directly to compare to references of enrolled speakers we perform an identi cation of the speaker. If the identity is checked by another mean (password,identi cation number, etc...) a veri cation that the input voice belongs to the identi ed speaker.

2.1.3 Type of speech recognition applications The speech recognition applications can be speaker dependent or speaker independent,

however this notion becomes a little bit fuzzy when a speaker adaptation of the speaker independent speech recognizer is performed. The speech recognizer can be classi ed by the size of the vocabulary they can handle like small, medium or large vocabulary applications.

2.1.4 Into the systems The speech/speaker recognition systems are built into two parts: 1. A training phase where the parameters of the models of speech recognizer or of the registered speakers are estimated, using known target sentences. 2. A test phase where unknown sentences uttered by registered speakers (often called true speaker test) or by impostors are given to the system. The speech recognizer will produce a sequence of words using the models estimated in the training phase. The speaker recognizer will produce a score which will be compared to a threshold, if the score is greater than the threshold, the utterance will be accepted as pronounced by the RS which is tested. 3. A third phase is often used to set the a priori thresholds used to measure the performances for the speaker recognition system ([Bimbot and Genoud, 1997; Pierrot et al., 1998]). However in this rapport a theoretical speaker independent threshold will be used. 2.2 The Log Likelihood Ratio (LLR) as speaker veri cation score When using statistical algorithms, the LLR is the main score computed in speaker recognition, because his strong relationship with the statistical models themselves ([Green and Swets, 1988; Scharf, 1991]). The decision of accepting or rejecting the utterance of a registered speaker can be seen as an hypothesis test H0 (the speech segment belongs to the RS ) against H1 (the speech segment doesn't belong to the RS ). Which is identical than

2

testing the conditional probability of an event X knowing the hypothesis H0 and H1 (see equation 1).

H0 : Registered speaker, H1 = H0 accept

accept

> P (H0) > < P (H1); P (H0 X ) < P (H1 X ) j

(1)

j

reject

reject

The quantities P (H0 X ) and P (H1 X ) are called a posteriori probabilities of the hypothesis H0, respectively H1 knowing the event X . j

j

Since it is not possible to model the "non registered speaker space" (virtually all the other speaker [past, present and future] of this planet living or having lived at the same time than the RS ), the hypothesis H1 is translated in: "the segment belongs to a speaker among a large amount of people which are neither registered speakers of the application, nor possible impersonators". H1 will be modeled by a world model estimated by the voice of many speakers excluding the registered speakers and the speakers used as tuning impersonators of the application. Translated into the speaker veri cation problem, the test of a sequence of observations Ot (a sequence of parameters) knowing a statistical model MC for each RS and a model MW for the world, the equation 1 can be rewritten as (equation 2): accept

P (MC Ot) > < P (MW Ot) j

(2)

j

reject

Using the rst Bayes rule (equation 3) and if the a priori probability P (MC ) and P (MW ) are known (there are often assumed equal) the a posteriori probabilities P (Ot jMC ) and P (OtjMW ) can be estimated (equation 4): )P (B ) P (B=A) = P (A=B (Bayes) (3) P (A) accept

P (Ot MW ) P (MW ) = P (M O ) P (MC Ot) = P (Ot MPC(O) )P (MC ) > W t < P (Ot) t j

j

j

reject

j

(4)

Which can be transformed in: accept

P (Ot MC ) > P (MW ) P (Ot MW ) < P (MC ) j

j

reject

3

(5)

The estimation of the 2 a posteriori probabilities P (Ot=MC ) and P (Ot =MW ) is carried out by using a maximum likelihood estimation function [Scharf, 1991]. The quantity LR is the Likelihood ratio of the observation sequence Ot knowing the 2 models MC and MW . The equation 5 becomes: accept

MC ) > P (MW ) LR = LL ((OOt;;M t W ) < P (MC )

(6)

reject

In order to decrease the computing resources, and due to the normality properties of the logarithm, the log of the equation 6 is taken (equation 7) which de ne the Log Likelihood Ratio (LLR): accept

P (M ) > LLR(MC ; MW ; Ot) = log L (Ot; MC ) , log L (Ot; MW ) < log P (MW) C reject

(7)

With log L (Ot; MC ) and log L (Ot; MW ) the log likelihood of the RS , respectively of the world computed on the observation sequence Ot .

2.2.1 Error function in speaker recognition In a practical application, we try to minimize the total cost of error. For this purpose a cost function can be de ned as the sum of the errors done by accepting wrongly speech utterances which do not belongs to the RS (False Acceptance, FA) and rejecting wrongly speech utterances which belongs to the RS (False Rejection, FR):

ctot = cfr P (C ) E (FRjC ) + cfa P (C ) E (FAjC)

(8) With cfr and cfa the cost of a false rejection, respectively a false acceptance. These costs are established according to the application needs. P (C ) and P (C ) are the a priori probabilities that the test sequence belongs to the RS or not. E (FRjC ) and E (FAjC) are the false rejection (i.e. reject falsly an utterance belonging to the registered speaker) and false acceptance (i.e. accept falsly an utterance of an impostor) error rate done by the system. It can be shown that minimizing the cost function ctot comes down to add the error costs cfr et cfa to the equation 4. Moreover, if we accept that P (OtjMC ) can be approximated by P (OtjMW ), the equation 4 can be re-written as: accept

P (OtjMC ) P (MC ) c > P (OtjMW ) P (MW ) c fr < fa P (Ot) P (Ot) reject

(9)

Which, when using the equations 5, 6 and 7, eventually leads to the estimation of the following equation: 4

( C ; MW ; Ot) = log L (Ot; MC ) , log L (Ot; MW )

LLR M

accept >