Robust Speaker Recognition Over Varying ... - Semantic Scholar

1 downloads 0 Views 2MB Size Report
Overview of JFA. Joint factor analysis (JFA) is a two-level generative model of how different speakers produce speech and how their (remotely) observed speech ...
Robust Speaker Recognition Over Varying Channels Report from JHU workshop 2008 Luk´aˇs Burget 1 , Niko Br¨ ummer 2 , Douglas Reynolds 3 , Patrick Kenny 4 , Jason Pelecanos 6 , Robbie Vogt 7 , Fabio Castaldo 5 , Najim Dehak 4 , Reda Dehak 12 , Ondˇrej Glembek 1 , Zahi N. Karam 3 , John Noecker Jr. 9 , Elly (Hye Young) Na 10 , Ciprian Constantin Costin 11 , Valiantsina Hubeika 1 , Sachin Kajarekar 8 , Nicolas Scheffer 8 , ˇ and Jan “Honza” Cernock´ y (editor) 1 (1) Brno University of Technology, Czech Republic, (2) Agnitio, Spain, (3) MIT Lincoln Labs, USA, (4) Centre de Recherche en Informatique de Montreal, Canada (5) Polytechnic University of Turin, Italy, (6) IBM, USA, (7) Queensland University of Technology, Australia, (8) SRI INternational, USA, (9) Duquesne University, USA, (10) George Mason University, USA, (11) Alexandru Ioan Cuza University, Rumania, (12) EPITA, France.

Summary of planned work Nowadays, speaker recognition is relatively mature with the basic scheme, where speaker model is trained using target speaker speech and speech from large number of non-target speakers. However, the speech from non-target speakers is typically used only for finding general speech distribution (e.g. UBM). It is not used to find the ”directions” important for discriminating between speakers. This scheme is reliable when the training and test data come from the same channel. All current speaker recognition systems are however prone to errors when the channel changes (for example from IP telephone to mobile). In speaker recognition, the ”channel” variability can include also to linguistic content of the message, emotions, etc. - all these factors should not be considered by a speaker recognition system. Several techniques, such as feature mapping, eigen-channel adaptation and NAP (nuisance attribute projection) have been devised in the past years to overcome the channel variability. These techniques make use of the large amount of data from many speakers to find and ignore directions with high with-in speaker variability. However, these techniques still do not utilize the data to directly search for directions important for discriminating between speakers. In an attempt to overcome the above mentioned problem, the research will be concentrate on utilizing the large amount of training data currently available to research community to derive the information, that can help discriminate among speakers and discard the information that can not. We propose direct identification of directions in model parameter space that are the most important for discrimination between speakers. According to our experience from speech and language recognition, the use of discriminative training should significantly improve the performance of acoustic SID system. We also expect that discriminative training will make the explicit modeling of channel variability needless. The research will be based on an excellent baseline - the STBU system for NIST 2006 SRE evaluations (NIST rules prohibit us to disclose the exact position of the system in the evaluations). The data to be used during the workshop will include NIST SRE data (telephone) but we will not overhear the requests from the security/defense community and evaluate the investigated techniques also on other data sources (meetings, web-radio, etc) as well as on cross-channel conditions. The expected outcomes of the proposed research are: 1. significant increasing of the accuracy of current SID systems 2. decreasing the dependency on communication channel, content of the message and other factors negatively affecting SID performance. 3. speaker identification and verification from very short speech segments.

2

Team members Team Leader Lukas Burget

[email protected]

Brno University of Technology

Senior Researchers Niko Brummer Patrick Kenny

[email protected] [email protected]

Jason Pelecanos Douglas Reynolds Robbie Vogt

[email protected] [email protected] [email protected]

Agnitio Centre de Recherche en Informatique de Montreal IBM MIT Lincoln Labs Queensland University of Technology

Graduate Students Fabio Castaldo Najim Dehak Reda Dehak Ondrej Glembek Zahi Karam

[email protected] [email protected] [email protected] [email protected] [email protected]

Polytechnic University of Turin Ecole de Technologie Superieure EPITA Brno University of Technology Massachusettes Institute of Technology

Undergraduate Students John Noecker Jr. Elly (Hye Young) Na Ciprian Constantin Costin Valiantsina Hubeika

[email protected] [email protected] [email protected] [email protected]

Duquesne University George Mason University The Alexandru Ioan Cuza University Brno University of Technology

Affiliates Sachin Kajarekar Nicolas Scheffer

[email protected] [email protected]

SRI International SRI International

Acknowledgements This research was conducted under the auspices of the 2008 Johns Hopkins University Summer Workshop, and partially supported by NSF Grant No IIS-0705708 and by a gift from Google, Inc. BUT researchers were partly supported by European project AMIDA (IST-033812), by Grant Agency of Czech Republic under project No. 102/05/0278 and by Czech Ministry of Education under project No. MSM0021630528. Luk´ aˇs Burget was supported by Grant Agency of Czech Republic under project No. GP102/06/383. The hardware used in this work was partially provided by CESNET under projects Nos. 162/2005 and 201/2006. Thanks to Tom´ aˇs Kaˇsp´ arek (BUT) who provided the JHU team an excellent computer support and allowed for efficient use of the BUT computing cluster during the wokrshop.

3

Contents 1 Introduction 1.1 Role of NIST evaluations . . . . . . . . . . 1.2 Sub-groups . . . . . . . . . . . . . . . . . . 1.2.1 Diarization using JFA . . . . . . . . 1.2.2 Factor Analysis Conditioning . . . . 1.2.3 SVM–JFA and fast scoring . . . . . 1.2.4 Discriminative System Optimization 2 Overview of JFA 2.1 Supervector model . . . 2.2 Generative ML training 2.3 JFA operation . . . . . . 2.4 Gender dependency . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

3 Factor analysis based approaches to speaker diarization 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Diarization Systems . . . . . . . . . . . . . . . . . . . . . 3.2.1 Agglomerative Clustering System (Baseline) . . . . 3.2.2 Variational Bayes System . . . . . . . . . . . . . . 3.2.3 Streaming Systems . . . . . . . . . . . . . . . . . . 3.2.4 Hybrid System . . . . . . . . . . . . . . . . . . . . 3.3 Experiment Design . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Measures of Performance . . . . . . . . . . . . . . 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

. . . .

. . . . . .

7 7 7 8 8 8 9

. . . .

10 10 11 11 12

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

13 13 13 14 14 16 16 16 16 17 18 18

4 Factor analysis conditionning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 A Phonetic Analysis . . . . . . . . . . . . . . . . . . . . . . 4.3 Factor Analysis Combination Strategies . . . . . . . . . . . 4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Systems and protocol . . . . . . . . . . . . . . . . . 4.3.3 Combination Strategies . . . . . . . . . . . . . . . . 4.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . 4.4 Within Session Variability Modelling . . . . . . . . . . . . . 4.4.1 Joint Factor Analysis with Short Utterances . . . . . 4.4.2 Extending JFA to Model Within-Session Variability 4.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . 4.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

20 20 20 22 22 23 25 28 29 29 31 32 34

4

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

37 38 38 38 40 41 41

5 Support vector machines and joint factor analysis for speaker 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Joint Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 SVM-JFA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 GMM Supervector space . . . . . . . . . . . . . . . . . . . 5.3.2 Speaker factors space . . . . . . . . . . . . . . . . . . . . 5.3.3 Speaker and Common factors space . . . . . . . . . . . . 5.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Acoustic features . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Factor analysis training . . . . . . . . . . . . . . . . . . . 5.4.4 SVM impostors . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Within Class Covariance . . . . . . . . . . . . . . . . . . . 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 SVM-JFA: GMM supervector space . . . . . . . . . . . . 5.5.2 SVM-JFA: speaker factors space . . . . . . . . . . . . . . 5.5.3 SVM-JFA: speaker and common factors space . . . . . . . 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

43 43 43 44 44 45 46 46 46 46 46 46 47 47 47 47 48 49

6 Handling variability with support vector machines 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 6.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . 6.3 Handling Nuisance Variability . . . . . . . . . . . . . 6.4 Should All Nuisance be Treated Equally . . . . . . . 6.5 Using Inter-speaker Variability . . . . . . . . . . . . 6.6 Incorporating All Variability . . . . . . . . . . . . . 6.7 Probabilistic Interpretation . . . . . . . . . . . . . . 6.8 Artificially Extending the Target Data . . . . . . . . 6.9 Experimental Results . . . . . . . . . . . . . . . . . . 6.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

50 50 50 51 53 54 55 55 56 56 57

4.5

4.6

4.4.5 Summary and Future Directions Multigrained Factor Analysis . . . . . . 4.5.1 Introduction . . . . . . . . . . . 4.5.2 Multi-Grained Approach . . . . . 4.5.3 Results . . . . . . . . . . . . . . 4.5.4 Conclusions . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

7 Comparison of Scoring Methods used in Speaker Recognition Analysis 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Frame by Frame . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Integrating over Channel Distribution . . . . . . . . . . . . 7.2.3 Channel Point Estimate . . . . . . . . . . . . . . . . . . . . 7.2.4 UBM Channel Point Estimate . . . . . . . . . . . . . . . . 7.2.5 Linear Scoring . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . . . . .

with Joint Factor . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

58 58 58 59 59 60 60 61 62 62

7.4 7.5

7.3.2 Feature Extraction . . . 7.3.3 JFA Training . . . . . . 7.3.4 Normalization . . . . . . 7.3.5 Hardware and Software Results . . . . . . . . . . . . . . 7.4.1 Speed . . . . . . . . . . Conclusions . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

8 Discriminative optimization of speaker recognition 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 8.2 Motivation for discriminative training . . . . . . . . 8.2.1 Envisioned advantages . . . . . . . . . . . . . 8.3 Challenges of discriminative training . . . . . . . . . 8.4 Solutions for discriminative training . . . . . . . . . 8.4.1 Discriminative Training Objective Function . 8.4.2 Regularization . . . . . . . . . . . . . . . . . 8.4.3 Optimization algorithms . . . . . . . . . . . . 8.4.4 Computation of derivatives . . . . . . . . . . 8.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Small scale experiment . . . . . . . . . . . . . 8.5.2 Large Scale Experiments . . . . . . . . . . . . 8.5.3 Experiment with ML trained eigenchannels . 8.5.4 Conclusion . . . . . . . . . . . . . . . . . . . 9 Summary and conclusions

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

. . . . . . . . . . . . . .

. . . . . . .

62 62 62 62 63 63 64

. . . . . . . . . . . . . .

65 65 65 66 67 68 68 69 70 71 71 72 73 75 75 77

6

Chapter 1

Introduction The largest challenge to practical use of speaker detection systems is channel/session variability, where “variability” refers to changes in channel effects between training and successive detection attempts. Channel/session variability encompasses several factors: • microphones – Carbon-button, electret, hands-free, array, etc. • acoustic environment – Office, car, airport, etc. • transmission channel – Landline, cellular, VoIP, etc. • differences in speaker voice – Aging, mood, spoken language, etc. Recent experiments of several sites on NIST 2008 data have shown that the performance of a speaker verification system can be improved from ∼3% EER when using different microphone in training and test to ∼1% EER for the same microphone in training and test. The main vehicle to fight the unwanted variability in this workshop was Joint Factor Analysis (JFA), which dominated NIST 2008 SRE evaluations.

1.1

Role of NIST evaluations

The role of NIST SRE evaluations1 is greater than just providing the data and metrics for the workshop – all the team members participated in recent 2008 NIST evaluations, so JHU workshop can be seen as a great opportunity to: • do common post-evaluation analysis of our systems • combine and improve techniques developed by individual sites Thanks to NIST evaluations we have: • identified some of the current problems that we worked on • well defined setup and evaluation framework • baseline systems that were trying to extend and improve during the workshop

1.2

Sub-groups

The work in the workshop was split into 4 work-groups: 1

Annual NIST evaluations of speaker verification technology (since 1995) using a common paradigm for comparing technologies, see http://nist.gov/speech/tests/sre/

7

1.2.1

Diarization using JFA

Problem Statement • Diarization is an important upstream process for real-world multi-speaker speech • At one level diarization depends on accurate speaker discrimination for change detection and clustering • JFA and Bayesian methods have the promise of providing improvementsto speaker diarization Goals • Apply diarization systems to summed telephone speech and interview microphone speech – Baseline segmentation-agglomerative clustering – Streaming system using speaker factors features – New variational-bayes approach using eigenvoices • Measure performance in terms of DER and effect on speaker detection For more details, see Chapter 3.

1.2.2

Factor Analysis Conditioning

Problem Statement • A single FA model is sub-optimal across different conditions • Eg.: different durations, phonetic content and recording scenario Goals Explore two approaches: • Build FA models specific to each condition and robustly combine multiple models • Extend the FA model to explicitly model the condition as another source of variability For more details, see Chapter 4.

1.2.3

SVM–JFA and fast scoring

Problem Statement • The Support Vector Machine is a discriminative recognizer which has proved to be useful for SRE • Parameters of generative GMM speaker models are used as features for linear SVM ( ..sequence kernels) • We know Joint Factor Analysis provides higher quality GMMs, but using these as is in SVMs has not been so successful. 8

Goals • Analysis of the problem • Redefinition of SVM kernels based on JFA • Application of JFA vectors to recently proposed and closely related bilinear scoring techniques which do not use SVMs For more details, see Chapters 5, 6 and 7.

1.2.4

Discriminative System Optimization

Problem Statement • Discriminative training has proved very useful in speech and language recognition, but has not been investigated in depth for speaker recognition • In both speech and language recognition, the classes (phones, languages) are modeled with generative models, which can be trained with copious quantities of data • But in speaker recognition, our speaker GMMs have at best a few minutes of training typically of only one recording of the speaker Goals • Reformulate the speaker recognition problem as binary discrimination between pairs of recordings which can be (i) of the same speaker, or (ii) of two different speakers • We now have lots of training data for these two classes and we can afford to train complex discriminative recognizers For more details, see Chapter 8.

9

Chapter 2

Overview of JFA Joint factor analysis (JFA) is a two-level generative model of how different speakers produce speech and how their (remotely) observed speech may differ on different occasions (or sessions). The hidden deep level is the joint factor analysis part that models the generation of speaker-and-session-dependent GMMs. The output level is the GMM generated by the hidden level, which in turn generates the sequence of feature vectors of a given session. The GMM part needs no further introduction. As is customary in speaker recognition, all of the GMMs differ only in the mean vectors of the components [Reynolds et al., 2000]. The component weights and the variances are the same for all speakers and sessions. The session-dependent GMM component means are modeled as: Mki = mk + Uk xi + Vk ys(i) + Dk z ks(i)

(2.1)

Here the indices are: k for the GMM component; i for the session; and s(i) for the speaker in session i. The system hyperparameters are: mk , speaker-and-session-independent mean vector; Uk , rectangular channel-factor loading matrix ; Vk , rectangular speaker-factor loading matrix ; Dk , diagonal speaker-residual scaling matrix ; The hidden speaker and session variables are: xi , session-dependent vector of channel-factors; ys , speaker-dependent vector of speaker-factors; z ks , speaker-and-component-dependent vector of speaker-residuals. Standard normal distributions are used as a prior for all of these hidden variables.

2.1

Supervector model

We can summarize our JFA model by stacking component-dependent hyperparameters into larger matrices:       V1 U1 D1 0 · · ·       V =  V2  , U =  U2  , D =  0 D2 · · ·  (2.2) .. .. .. .. . . . . . . . We refer to V as the eigenvoice matrix ; to U as the eigenchannel matrix ; and D as the residual scaling matrix. By also stacking component-dependent vectors into larger vectors, which we shall refer to as 10

supervectors: 

 M1i   Mi =  M2i  .. .



 m1   m =  m2  .. .



 z 1s   zs =  z 2s  , .. .

(2.3)

the JFA model can be expressed succinctly in supervector form as Mi = m + Uxi + Vys(i) + Dzs(i)

2.2

(2.4)

Generative ML training

In this section we give a rough summary of how the hyperparameters of a JFA system may be trained. The steps are as follows: 1. Train the universal background model (UBM) on a large selection of development data, possibly on all of it. The UBM is a GMM trained by maximum likelihood (ML), via appropriate initialization followed by multiple iterations of the EM algorithm. The UBM essentially provides the following functionality: • Its component means are a good choice to use for the speaker-and-session-independent supervector m; and its variances and weights are a good choice to use for all speaker-andsession-dependent GMM variances and weights. • It parametrizes a computationally efficient approximation to all GMM log-likelihoods, used during training and operation of the JFA system. Specifically, all GMM loglikelihoods are approximated by the EM-algorithm auxiliary function [Minka, 1998], often denoted ‘Q-funtion’ in the literature. Informally, given some GMM, we approximate log p(data|GMM) ≈ Q(UBM, GMM, data). All further processing makes use of this approximation. 2. Train the eigenvoice matrix V with an EM algorithm designed to optimize a maximum likelihood criterion over a database of as many speakers as possible. Pool multiple sessions per speaker, to attenuate intersession variation. 3. Given V as obtained above and with D temporarily set to zero, train the eigenchannel matrix U with a similar EM algorithm, over a database that has multiple sessions per speaker. This data should be rich in channel variation. The Mixer Databases are very good for this purpose. 4. Finally (and optionally), train D, with a similar EM-algorithm, on some held-out data.

2.3

JFA operation

When used operationally, the steps performed by a JFA system to score a given trial, composed of one train and one test segment, can be described as follows: 1. Use the JFA model (2.4) and the train segment to make a MAP1 point-estimate of the target speaker model. That is, the hidden variables x, y, and (optionally) z are jointly estimated. Then x is discarded and M is denoted the target speaker model. Note that this model now has a unspecified parameter x, because its value in a test segment will be different from its value in the train segment. This uncertainty is modeled by the standard normal prior over x. 1

MAP denotes maximum a-posteriori. The likelihood used here is the Q-function approximation and the prior is the standard normal distributions over the hidden variables.

11

2. Compute an approximation to the log-likelihood of the target speaker model, given the test segment data, log p(test segment|M). Good approximations to use here include [Glembek et al., 2009]: • The Q-function approximation, where the unknown nuisance variable x is integrated out, see [Kenny et al., 2007b], equation 19. • A linear simplification to the Q-function, where a MAP point-estimate of x is used. For computational efficiency x is estimated relative to the UBM, i.e. with y = 0 and z = 0. 3. Compute the same approximation to the UBM log-likelihood, i.e. with y = 0 and z = 0. The raw score (or raw log-likelihood-ratio) is now the difference between the target model log-likelihood and the UBM log-likelihood. 4. Normalize the raw score by applying the following in order: (i) divide by the number of test frames, (ii) z-norm, (iii) t-norm.

2.4

Gender dependency

JFA systems benefit from gender-dependent components: • Some, like the CRIM system at SRE’08, are trained from the UBM onwards on gender-dependent data. This gives independent male and female systems, which can be used respectively for allmale or all-female trials. • Others, like the BUT system at SRE’08, are trained on mixed data, but then use genderdependent ZT-norm cohorts.

12

Chapter 3

Factor analysis based approaches to speaker diarization This chapter reports on work examining new approaches to speaker diarization. Four different systems were developed and experiments were conducted using summed-channel telephone data from the 2008 NIST SRE. The systems are a baseline agglomerative clustering system, a new Variational Bayes system using eigenvoice speaker models, a streaming system using a mix of low dimensional speaker factors and classic segmentation and clustering, and a new hybrid system combining the baseline system with a new cosine-distance speaker factor clustering. Results are presented using the Diarization Error Rate as well as by the EER when using diarization outputs for a speaker detection task. The best configurations of the diarization system produced DERs of 3.5-4.6% and we demonstrate a weak correlation of EER and DER,

3.1

Introduction

Audio diarization is the process of annotating an input audio channel with information that attributes (possibly overlapping) temporal regions of signal energy to their specific sources. These sources can include particular speakers, music, background noise sources and other signal source/channel characteristics. Diarization systems are typically used as a pre-processing stage for other downstream applications, such as providing speaker and non-speech annotations to text transcripts or for adaptation of speech recognition systems. In this work we are interested in improving diarization to aid in speaker recognition tasks where the training and/or the test data consists of speech from more than one speaker. In particular we focus on two speaker telephone conversations and multi-microphone recorded interviews as used in the latest NIST Speaker Recognition Evaluation (SRE)1 . This chapter reports on work carried out at the 2008 JHU Summer Workshop examining new approaches to speaker diarization. Four different systems were developed and experiments were conducted using data from the 2008 NIST SRE. Results are presented using a direct measure of diarization error (Diarization Error Rate) as well as the effect of using diarization outputs for a speaker detection task (Equal Error Rate). Finally we conclude showing the relation of DER to EER and summarize the effective components common to all systems.

3.2

Diarization Systems

Four systems were developed for the 2008 JHU Summer Workshop. The systems range from a baseline agglomerative clustering system, to a new system based on variation Bayes theory, to a streaming audio 1

See http://www.nist.gov/speech/tests/sre/2008/ for more details.

13

clustering system and a new hybrid system using elements of the baseline system and newly developed speaker factor distances.

3.2.1

Agglomerative Clustering System (Baseline)

The baseline system represents the framework of most widely used diarization systems [Tranter and Reynolds, 2006]. It consists of three main stages. In the first stage speaker change points are detected using a Bayesian Information Criterion (BIC) based distance between abutting windows of feature vectors. The features for the baseline system consist of 13 cepstral coefficient (including c0) with no channel normalization. This technique searches for change points within a window using a penalized likelihood ratio test of whether the data in the window is better modeled by a single distribution (no change point) or two different distributions (change point). If a change is found, the window is reset to the change point and the search restarted. If no change point is found, the window is increased and the search is redone. Full covariance Gaussians are used as distribution models. The purpose of the second stage is to associate or cluster segments from the same speaker together. The clustering ideally produces one cluster for each speaker in the audio with all segments from a given speaker in a single cluster. Hierarchical, agglomerative clustering with a BIC based stopping criterion is used consisting of the following steps: 0. Initialize leaf clusters of tree with speech segments. 1. Compute pair-wise distances between each cluster. 2. Merge closest clusters. 3. Update distances of remaining clusters to new cluster. 4. Iterate steps 1-3 until stopping criterion is met. The clusters are represented by a single full covariance Gaussian. Since we have prior knowledge of two speakers present in the audio, we stop when we reach two clusters. The last stage is iterative re-segmentation with GMM Viterbi decoding to refine change points and clustering decisions. Additionally, a form of Baum-Welch re-training of speaker GMMs using segment posterior-weighted statistics can be used before a final Viterbi segmentation. This step was inspired by the Variation Bayes approach and is also referred to as ”soft-clustering.”

3.2.2

Variational Bayes System

This is one of the new systems developed during the workshop and is based on the Variational Bayes method of speaker diarization described by Valente [Valente, 2005]. Work on this system was motivated by the desire to build on the success of factor analysis methods in speaker recognition and to capitalize on some of the advantages a Bayesian approach may bring to the diarization problem (e.g., EM-like convergence guarantees, avoiding premature hard decisions, automatic regularization). To build on the factor analysis work, we begin by using an eigenvoice model to represent the speakers. The assumption in eigenvoice modeling is that supervectors2 have the form s = m + V y. Here s is a randomly chosen speaker dependent supervector; m is a speaker independent supervector (i.e., UBM); V is a rectangular matrix of low rank whose columns are referred to as eigenvoices; the vector y has a standard normal distribution; and the entries of y are the speaker factors. From the point of view of Bayesian statistics, this is a highly informative prior distribution as it imposes 2

The term supervector is used to refer to the concatenation of the mean vectors in a Gaussian mixture model.

14

severe constraints on speaker supervectors. Although supervectors typically have tens of thousands of dimensions, this representation constrains all supervectors to lie in an affine subspace of the supervector space whose dimension is typically at most a few hundred. The subspace in question is the affine subspace containing m which is spanned by the columns of V . In the Variational Bayes diarization algorithm, we start with audio file in which we assume there are just two speakers and a partition of the file into short segments, each containing the speech of just one of the speakers. This partitioning need not be very accurate. A uniform partition into one second intervals can be used to begin with; this assumption can be relaxed in a second pass. We define two types of posterior distribution which we refer to as speaker posteriors and segment posteriors. For each of the two speakers, the speaker posterior is a Gaussian distribution on the vector of speaker factors which models the location of the speaker in the speaker factor space. The mean of this distribution can be thought of as a point estimate of the speaker factors and the covariance matrix as a measure of the uncertainty in the point estimate. For each segment, there are two segment posteriors q1 and q2 ; q1 is the posterior probability of the event that the speaker in the segment is speaker 1 and similarly for speaker 2. The Variational Bayes algorithm consists in estimating these two types of posterior distribution alternately as explained in detail in [Kenny, 2008]. At convergence, it is normally the case that q1 and q2 takes values of 0 or 1 for each segment but q1 and q2 are initialized randomly so that the Variational Bayes algorithm can be thought of as performing a type of soft speaker clustering, as distinct from the hard decision making in the agglomerative clustering phase of the baseline system. The Variational Bayes algorithm can be summarized as follows: Begin: • Partition the file into 1 second segments and extract Baum Welch statistics from each segment • Initialize the segment posteriors randomly • No initialization is needed for the speaker posteriors On each iteration of Variational Bayes: • For each speaker s: – Synthesize Baum Welch statistics for the speaker by weighting the Baum Welch statistics for each segment by the corresponding segment posterior qs – Use the synthetic Baum Welch statistics to update the speaker posterior • For each segment: – Update the segment posteriors for each speaker End: • Baum Welch estimation of speaker GMM’s together with iterative Viterbi re-segmentation (as in the baseline system) In the Variational Bayes system, 39 dimensional feature vectors derived from HLDA transforms of 13 cepstra (including c0) plus single, double and triple deltas are used. The cepstra were processed with short-term (300 frame) Gaussianization. For the re-segmentation, 13 un-normalized cepstra (c0-c12) were used. The eigenvoice analysis used a 512 mixture GMMs and 200 speaker factors. 15

3.2.3

Streaming Systems

In this section we describe another way to combine speaker diarization and join factor analysis. Speaker diarization using factor analysis was first introduced in [Castaldo et al., 2008] using a stream-based approach. This technique performs an on-line diarization where a conversation is seen as a stream of fixed duration time slices. The system operates in a causal fashion by producing segmentation and clustering for a given slice without requiring the following slices. Speakers detected in the current slice are compared with previously detected speakers to determine if a new speaker has been detected or previous models should be updated. Given an audio slice, a stream of cepstral coefficients and their first derivatives are extracted. With a small sliding window (about one second) a new stream of speaker factors (as described in the previous section) is computed and used to perform the slice segmentation. The dimension of speaker factor space is quite small (about twenty) with respect to the number used for speaker recognition (about three hundred) due to the short estimation window. In this new space, a clustering of the speaker factors stream is done obtaining a single multivariate Gaussian for each speaker. A BIC criterion is used to determine how many speaker there are in the slice. A Hidden Markov Model (HMM) using the Gaussian for each state associated to a speaker is built and through the Viterbi algorithm a slice segmentation is obtained. In addition to the segmentation, a Gaussian Mixture Model (GMM) in the acoustic space is created for each speaker found in the audio slice. These models are used in the last step, slice clustering, where we determine if a speaker in the current audio slice was present in previous slices, or is a new one. Using an approximation to the Kullback-Leibler divergence, we find the closest speaker model built in previous slices to each speaker model in the current slide. If the divergence is below a threshold the previous model is adapted using the model created in the current slice, otherwise the current model is added to the set of speaker models found in the audio. The final segmentation and speakers found from the on-line processing can further be refined using Viterbi re-segmentation over the entire file.

3.2.4

Hybrid System

The last system was motivated by other work at the 2008 JHU Workshop [Dehak et al., 2009] that showed good speaker recognition performance could be obtained using a simple cosine-distance between speaker factor vectors. The idea is to use the baseline agglomerative clustering system to build a tree up to a certain level where the nodes contain sufficient data, then to extract speaker factors for these nodes and continue the clustering process using the cosine distance. The critical factor in determining when to stop the initial clustering is the amount of speech in each node since we wished to work with 200 speaker factors. Two approaches were used for stopping the initial clustering: level cutting and tree searching. Level cutting (upper pannel of Fig 3.1) consists of merely running the clustering until a preset number of nodes exist - typically 5-15. The tree searching (lower pannel of Fig 3.1) consists of building the entire tree, then searching the tree from the top-down to select the set of nodes that had at least a preset amount of speech. As with the other systems, a final Viterbi re-segmentation is applied to refine the diarization.

3.3 3.3.1

Experiment Design Data

Summed channel telephone data from the 2008 SRE was used for diarization experiments with the above systems. This data was selected since we could derive reference diarization, needed for measuring DER, by using time marks from ASR transcripts produced on each channel separately. In addition 16

Figure 3.1: Level cutting and Tree clustering the data corresponded to one of the speaker detection tasks in the 2008 SRE, so we could measure the effect of diarization on EER. The test set consists of 2215 files of approximately five minutes duration each (≈ 200 hours total). To avoid confounding effects of mismatched speech/non-speech detection on the error measures, all diarization systems used a common set of reference speech activity marks for processing.

3.3.2

Measures of Performance

As mentioned earlier, we used two measures of performance for the diarization systems. The diarization error rate (DER) is the more direct measure which aligns a reference diarization output with a system diarization output and computes a time weighted combination of miss, false alarm and speaker error 3 . Since all systems use reference speech activity marks, miss and false alarm, which only are affected by speech activity detection, are not used. Speaker error measures the percent of time a system incorrectly associates speech from different speakers as being from a single speaker. In these results we report the average and standard deviation DER computed over the test set to show both the average as well as the variation in performance for a given system. To measure the effect of diarization on a speaker detection task, we used the diarization output in the recognition phase of one of the summed-channel telephone tasks from the 2008 SRE. In the 3conv-summed task, the speaker models are trained with three single channel conversations and tested with a summed channel conversation. The diarization output is used to split the test conversation into two speech files (presumably each from a single speaker) which are scored separately and the maximum score of the two is the final detection score. A state-of-the-art Joint Factor Analysis (JFA) 3

DER scoring code available at www.nist.gov/speech/tests/rt/2006-spring/code/md-eval-v21.pl

17

speaker detection system developed by Loquendo [Vair et al., 2007] is used for all diarization systems. Results are reported in terms of the equal error rate (EER).

3.4

Results

In Table 3.1 we present DER results for some key configurations of the diarization systems. Overall we see that the final Viterbi re-segmentation significantly helps all diarization systems. For the baseline system, it was further seen that the soft-clustering, inspired by the Variational Bayes system, reduces the DER by almost 50%. The Variational Bayes system achieves similarly low DER when a second pass is added that relaxes the first pass assumption of fixed one second segmentation. The streaming system had the best performance out of the box, with some further gains with the non-causal Viterbi re-segmentation. Disappointingly, the hybrid system did not achieve performance better then the original baseline. This may be due to the first stage baseline clustering biasing the clusters too much or the inability to reliably extract 200 speaker factors from the small amounts of speech in the selected clusters. Table 3.1: Mean and standard deviation of diarization error rates (DER) channel telephone data for various configurations of diarization systems. mean DER (%) Baseline + Viterbi 6.8 Baseline + soft-cluster + Viterbi 3.5 Var. Bayes 9.1 Var. Bayes + Viterbi 4.5 Var. Bayes + Viterbi + 2-pass 3.8 Stream 5.8 Stream + Viterbi 4.6 Hybrid + Viterbi (level cut) 14.6 Hybrid + Viterbi (tree search) 6.8

on the NIST 2008 summed σ (%) 12.3 8.0 11.9 8.5 7.6 11.1 8.8 17.1 13.6

Lastly, in Figure 3.2 we show EER for the 3conv-summed task for different configurations of the above diarization systems. The end point DER values of 0% and 35% represent using reference diarization and no diarization, respectively. We see that there is some correlation of EER to DER, but this is relatively weak. It appears that systems with a DER < 10% produce EERs within about 1% of the “perfect” diarization. To sweep out more point with higher DER, we ran the baseline system with no Viterbi re-segmentation (DER=20%). While the EER did increase to 10.5% it was still better than the no-diarization result of EER=14.1%.

3.5

Conclusions

In this chapter we have reported on a study of several diarization system developed during the 2008 JHU Summer Workshop. While each of the systems had a different approach to speaker diarization, we found that ideas and techniques proved out in one system were also able to be successfully applied to other systems. The Viterbi re-segmentation used in the baseline system, was a very useful stage for the other systems. Also the idea of soft-clustering from the Variation Bayes approach was incorporated into the agglomerative clustering baseline system to reduce the DER by almost 50%. The best configurations of the diarization systems produced DERs of 3.5-4.6% on summed-channel conversational telephone speech. We further examined the impact of using different diarization system with varying DERs on a speaker recognition task. While there was some weak correlation of EER 18

Figure 3.2: EER vs DER for several diarization systems. to DER, it was not as direct as one would like in order to optimize diarization systems using DER independent of the recognition systems using the their output. In future work we plan on applying these diarization systems to the interview recordings in the 2008 SRE. This new domain will present several new challenges, including variable acoustics due to microphone type and placement as well as different speech styles and dynamics between face to face interviewer and interviewee.

19

Chapter 4

Factor analysis conditionning 4.1

Introduction

Factor Analysis (FA) modelling [Kenny, 2006] is a popular and an effective mechanism for capturing variabilities in speaker recognition. However it is recognized that a single FA model is sub-optimal across different conditions. For example, modelling utterances of different durations, phonetic content and recording configurations. In this chapter we begin to address these conditions by exploring two approaches; that is by, (1) building FA models specific to each condition and robustly combining multiple models and (2) extending the FA model to explicitly model the condition as another source of variability. These approaches guide the study in four areas: • A Phonetic Analysis • Factor Analysis Combination Strategies • Within Session Variability Modelling • Multigrained Factor Analysis The work stemming from these themes exploits the use of phonetic information in both enrollment and verification. Figure 4.1 presents the issue of phonetic variability across sessions. In the traditional factor analysis system, the phonetic variability component is largely ignored and is modelled indirectly as part of a larger within session variability process – whether or not the phonetic instances were observed in all utterances. Section 4.2 provides an introductory study on the performance of phonetic events with the FA type system. Section 4.3 discusses the use of different FA configurations (such as stacking and concatenation) and their effect on performance. The following section (Section 4.4) then investigates the issue of factor analysis for varied utterance durations, and finally, Section 4.5 examines one of the granularity assumptions of the implemented FA model.

4.2

A Phonetic Analysis

This section performs an examination of the relative performance of phonetically labelled events and their improvement attributed to cross fusion of these categories in a factor analysis setup. This work validates the need for a conditioned analysis of the underlying processes being modelled. To demonstrate the importance of conditioning a system for the audio context the results of an artificial experiment are presented. The results in Table 4.1 demonstrate the importance of balanced phonetic content in both enrollment and verification. The results are presented for the NIST 2006 20

Train Data

phoneme II ‘ah1’

Test Data

phoneme I ‘w’

phoneme I ‘w’

phoneme II ‘ow’

phoneme III ‘n’

feature space

phoneme III ‘d’

feature space phoneme I

phoneme II

phoneme III

Figure 4.1: A drawing indicating the breakdown of speech into phonetic categories in enrollment and test.

Table 4.1: Performance of systems when trained and tested on broad phonetic categories. Vowel (Test) Consonant (Test) Enroll EER (%) Min DCF EER (%) Min DCF Vowel 4.50 0.0208 12.47 0.0537 Consonant 10.72 0.0521 7.03 0.0336

Speaker Recognition Evaluation [National Institute of Standards and Technology, 2008] using a standard factor analysis system trained on the broad phonetic groups as classified by the BUT Hungarian Phonetic Recognizer. This result, albeit an extreme example, demonstrates the challenge of mismatched phonetic content. For example, if only consonants are used to enroll and verify a speaker, the EER is approximately 7% while if only vowels are used in verification, then the EER increases to more than 12%. Phonetic mismatch is pronounced for short duration utterances and utterances recorded with a different speech style. Not only are there performance differences attributed to speech content across enrollment and verification but there is also performance differences for different phones as shown in Table 4.21 . A follow up plot (Figure 4.2) is provided using the data from Table 4.2 to present the performance of broad phonetic categories versus their relative duration in the utterance. Interestingly, the vowels tend to be the best performing, but they also comprise more of the speech in an utterance. A final experiment examines the performance of fusing the systems from two different phonetic events (optimally combined by linear fusion). The question this experiment attempts to address is whether the linear score fusion of two vastly different phonetic categories is more beneficial than the fusion of two similar phonetic classes. Figure 4.3 plots the performance of the score fusion of two phone classes versus the total duration of the combined phonetic classes. Intuition would suggest that phonetic diversity should help, but it was not observed to a significant degree in this experiment.

1

Note that the output from the Hungarian recognizer does not correspond to English phones and may be considered more as an audio tokeniser instead.

21

Table 4.2: Performance of systems when trained and tested on broad phonetic categories. DET 1

DET 3

Phoneme

Type

% of speech

EER (%)

DCF

EER (%)

DCF

E

vowel

18.93

12.16

0.0567

8.62

0.0419

O

vowel

10.71

14.57

0.0645

12.30

0.0558

i

vowel

6.85

16.73

0.0749

15.49

0.0696 0.0852

A:

vowel

5.89

23.31

0.0876

21.79

n

nonvowel

5.44

19.08

0.0779

17.23

0.073

e:

vowel

4.73

25.31

0.0917

22.92

0.0866

k

stop

4.49

25.56

0.0926

22.26

0.0868

z

sibilant

4.25

29.73

0.098

28.22

0.0971

o

vowel

3.01

25.53

0.0924

25.24

0.0926

t

stop

2.76

27.04

0.0956

24.92

0.0936

s

sibilant

2.74

30.73

0.0965

27.63

0.0908

f

sibilant

2.41

34.43

0.0998

31.42

0.0984

j

nonvowel

2.38

25.00

0.0918

22.41

0.0862

v

sibilant

2.35

33.66

0.1

30.78

0.0992

m

nonvowel

2.29

21.18

0.0835

18.63

0.0782

S

sibilant

2.21

31.97

0.0959

31.74

0.0981

l

nonvowel

1.99

30.05

0.0974

29.91

0.0955

40

vowel

nonvowel

sibilant

stop

Performance (EER) %

30

20

10

0 0

5

10

15

20

% of Speech

Figure 4.2: A plot of the phonetic performance of individual phones identified according to broad phonetic category.

4.3 4.3.1

Factor Analysis Combination Strategies Introduction

Modeling variability in the model space is a major focus of the speaker recognition community. This work has shown to be particularly useful for channel compensation of speaker models. One of the most developed frameworks tackling this problem is Joint Factor Analysis (JFA), introduced by Patrick Kenny in [Kenny et al., 2005a]. This framework aims at factoring out two components for an utterance: the speaker and the nuisance component (usually called channel or session variability). The latter is commonly removed for training a speaker model. This work2 aims to take advantage of developments in JFA in the context of a phonetically conditioned system. Previous work with phonetic systems has shown the ability to extract additional performance through phonetic conditioning [Kajarekar, 2008, Castaldo et al., 2007], although this advantage was not observed for a full factor analysis model. The particular focus of this work is to investigate strategies for combining each of the phoneconditioned JFA systems. Our hypothesis is that score level combination is suboptimal and does not fully realize the potential advantages of a conditioned JFA system. Options for model-level 2

This section of work ”Factor Analysis Combination Strategies” is copied from our accepted ICASSP 2009 submission.

22

vowel with others vowel with vowel

Figure 4.3: A plot of the performance of fusing two phonetic events from within or across broad phone categories.

combination are presented and compared. We term the model combination strategies as supervector concatenation and subspace stacking, both illustrated in Figure 4.4. The motivation behind the supervector concatenation approach is to simultaneously present all the phone-conditioned statistics to the JFA model so that correlations and relationships between the phonetic conditions, as well as the differences, can be observed and modeled. This approach results in an increase in the dimension of the speaker model mean by a factor of the number of phonetic classes with no increase in the latent variable dimension. Alternatively, the subspace stacking approach combines subspace transforms from each phonetic context resulting in an increased dimension of the speaker, channel or both latent variables. It is hypothesized that this approach provides the flexibility for the observed data to select the most relevant subspace dimensions and has previously proven useful in the auxiliary microphone conditions of recent NIST SREs [Kenny et al., 2008c]. While the focus of this work is on phone-conditioned JFA systems, the implications may reach beyond this scope. We expect that investigating several possibilities using phonetic-events will lead to a better understanding of the JFA model and a methodology that can be applied to increase robustness to other kinds of conditions such as language, gender and microphone types.

Figure 4.4: Stacked vs. concatenated eigenvectors for 2 phonetic classes. The former enrich the model by projecting statistics on both classes, thus increasing the rank. The latter produces a more robust latent variable by tying the classes together, thus increasing the model size.

4.3.2

Systems and protocol

We describe the JFA framework, as well as the system and the phonetic decoder used for the experiments, before presenting the experimental protocol. 23

Joint Factor Analysis Let us define the notations that will be used throughout this discussion. The JFA framework uses the distribution of an underlying GMM, the universal background model (UBM) of mean m0 and diagonal covariance Σ0 . Let the number of Gaussians of this model be N and the feature dimension in each Gaussian be F . A supervector is a vector of the concatenation of the means of a GMM: its dimension is N F . The speaker component of the JFA model is a factor analysis model on the speaker GMM supervector. It is composed of a set of eigenvoices and a diagonal model. Precisely, the supervector ms of a speaker s is governed by, ms = m0 + V y + Dz (4.1) where V is a tall matrix of dimension N F × RS , and is related to the eigenvoices (or speaker loadings), which span a subspace of low-rank RS . D is the diagonal matrix of the factor analysis model of dimension N F × N F . Two latent variables y and z entirely describe the speaker and are subjected to the prior N (0, 1). The nuisance (or channel/session) supervector distribution also lies in a lowdimensional subspace of rank RC . The supervector for an utterance h with speaker s is mh = ms + U x

(4.2)

The matrix U , known as the eigenchannels (or channel loadings), has a dimension of N F × RC . The loadings U , V , D are estimated from a sufficiently large dataset while the latent variables x, y, z are estimated for each utterance. Baseline System Description The speaker recognition system from Brno University of Technology (BUT) is used for the experiments. The baseline system employs a 512-Gaussian UBM. The features are warped Mel-frequency cepstral coefficients (MFCCs) composed of 19 cepstrum features and one energy feature. First and second order derivatives are appended for a total dimension of 60. The rank of the speaker space is 120 while the channel space rank is 60. A lower number of Gaussian as well as lower subspace ranks were selected to accommodate for the multiple phone classes. To train the matrices, several iterations of the expectation maximization (EM) algorithm of the factor analysis framework are used. An alternative minimum divergence estimation (MDE) is used at the second iteration to scale the latent variables to a N (0, 1) distribution. To train a speaker model, the posteriors of x, y, z are computed using a single iteration (via the Gauss-Seidel method as in [Vogt et al., 2005]). The verification score for each trial was a scalar product between the speaker model mean offset and the channel compensated first order Baum-Welch statistics centered around the UBM. This scalar product was found to be simple yet very effective [Br¨ ummer, 2008] and was subsequently adopted by the JHU fast scoring group (Chapter 7). The speaker verification system is gender-independent with a gender-dependent score normalization (ZT norm). Phonetic Decoder The phonetic decoder used for these experiments is an open-loop Hungarian phone decoder from BUT, Brno [Matejka et al., 2006]. The Hungarian language possesses a large phone set and enables the modeling of more nuances than an English set. This has been particularly useful in language identification tasks. For this work, we chose to cluster the phonemes into broader phonetic events. We used two different clusterings obtained in a supervised way by expertise: • 2-class set: vowels (V), consonants (C) • 4-class set: vowels (V), sibilants (Si), stops (St), non-vowels (NV). 24

To build a phonetically conditioned system, for example a vowel system, we first extract the feature vectors from an utterance corresponding to the occurrences of vowels in the phone transcription to obtain phone-conditioned Baum-Welch statistics for the utterance. These statistics are used in exactly the same fashion as described above to build a full JFA model with phone-conditioned speaker and channel subspace matrices. The speaker and channel loadings will be subscripted by the notation adopted for each event in Table 4.3 (for instance, VV will be the speaker loading for the vowel set). Experimental Protocol All experiments were performed based on the all trials condition from the NIST-SRE-2006 dataset. The data set consists of 3616 target trials and 47452 non-target trials. Results are given in terms of equal error rate (EER) and minimum detection cost function (mDCF) given by NIST. The factor analysis model uses the following data sets for training: • The UBM is trained on Switchboard and Mixer data. For simplicity we fixed the UBM for all phonetic events. • The eigenvoices and eigenchannels are trained in a gender-independent fashion on the NIST SRE 04 data set, consisting of 304 speakers and 4353 sessions. The diagonal model is trained on 359 utterances coming from 57 speakers from SRE 04 and 05. • The score normalization data (Z- and Tnorm) was drawn from SRE 04 and 05 with around 300 utterances for each gender.

4.3.3

Combination Strategies

In this section, we evaluate the performance of the score-level combination strategy for the phoneticsystem. We will then investigate techniques in the model space that will robustly estimate the speaker by taking into account all phonetic classes. Baseline and Score-level Fusion Results Score-level combination is a frequently used technique for gaining robustness on different conditions. For a phonetic GMM system, the usual strategy is to have as many systems as the number of phonetic events. The combination of information is done at the score level by fusing the scores. In this experiment, an optimistic system combination is used, as the logistic regression is trained and tested on the same data. The FoCal toolkit [Br¨ ummer and du Preez, 2006] is used for this process. Table 4.3 presents the results for the baseline system, as well as for each broad phonetic event of our set. There is a clear advantage of the system using vowels alone, but it also represents 60% of the entire data used. The score-level fusion on the 2-class is better than for the 4-class set. However, while using the same amount of data, the 2-class fusion performance is worse than the baseline system. In the following paragraphs, we show how to improve the subsystem combination. Concatenation The first model space approach investigated consists of concatenating parameters of the speaker from different phone sets. The following experiments investigate at which level this concatenation should occur. Let us consider the 2-class phone-set {V, C} for this approach. The resulting model supervector length will thus increase to 2N F . The main advantage of this method is that a single system is used for the entire phone set. • Eigenvector concatenation 25

Table 4.3: Results for the baseline system, as well as for each phonetic group are included. The results of fusions across phonetic groupings are also shown. Results show that score-level combinations for the two phonetic sets are similar, but fail to outperform the baseline. [SRE 06, all trials, DCF×10, EER(%)]. System Vowels (V) Consonants (C) Consonant subsets... Non-Vowels (NV) Sibilants (Si) Stops (St) V+ C V + NV + Si + S. Baseline

% Data 60 40

EER (%) 6.17 7.91

mDCF 0.296 0.391

15 15 10 100 100 100

10.7 14.14 15.27 5.20 5.42 5.12

0.502 0.647 0.685 0.262 0.272 0.241

We first concatenate the eigenvectors from different phonetic events during training and testing of the speaker models. Under this model, the system will estimate a single set of latent variables x, y, z per utterance, each of them being independent of the class. ms = m0 m0 + VV VC y + UV UC x + DG 00DG z

(4.3)

Here, the ranks of the subspaces are the same as in the baseline system and the DG matrix is a copy of the D matrix from the baseline system. The results in Table 4.4 (first three rows) show a significant degradation of the model concatenation style combination. It seems that if the subspaces are trained separately, the projection on the resulting concatenated subspace does not reflect the classes appropriately. This leads to the need to retrain subspaces explicitly to be tied together. It is important to note that the concatenation of the channel eigenvectors decreases the performance much more compared to the speaker eigenvectors. This supports the hypothesis that eigenvoices should be the main focus when using a phonetic GMM system. • Baum-Welch statistics concatenation For this experiment, the speaker and channel subspaces are retrained using the concatenated firstand zero-order statistics from each phonetic event. The results in Table 4.4 show that this approach performs close to the score-level combination, but fails to outperform it. However, the subspaces are effectively tied so that a robust estimate of the latent variable can be produced. Consequently, a gain is observed compared to the systems taken separately. • Tied factor analysis Tied factor analysis has been used successfully in other fields such as face recognition [Prince and Elder, 2006]. For this approach, the model is the same as in Equation 4.3, but the eigenvectors for each phonetic event are trained so that the latent variables are tied between the phonetic events. This approach should be successful for a phonetic system, as the amount of data for each event can vary, especially for very short conditions. We applied the following algorithm until convergence: • Estimate the latent variables for the concatenated Baum-Welch statistics (like in 4.3.3). 26

• Estimate the matrices separately, on their respective statistics, by maximizing the likelihood of the data with respect to the latent variables of the previous step. Table 4.4 shows that retraining the subspaces by concatenating the statistics from each phone set or by using tied factor analysis leads to similar performance. It seems the EM algorithm used for the factor analysis model tends to tie the different phonetic events naturally. Table 4.4: Eigenvector concatenation on the 2-class set. The speaker and the channel subspace used are shown along with the concatenation type. Results show that the subspaces have to be retrained to obtain decent performance, using the standard EM or a Tied Factor Analysis approach. [SRE 06, all trials, DCF×10, EER(%)] System Baseline Eig. Concat. Eig. Concat. Eig. Concat. BW Concat. Tied FA

Speaker VG VV , VC VG VV , VC VV , VC {VV , VC }T ied

Channel UG UV , UC UV , UC UG UV , UC UV , UC

EER (%) 5.12 13.4 11.3 7.02 5.45 5.32

mDCF 0.241 0.573 0.531 0.378 0.266 0.268

Stacking Another approach in the model space consists in stacking the eigenvectors of the subspaces together. In this approach, the dimension of the model remains constant while the rank of the subspaces increases. This leads to running one system per event before combining them at the score-level. • Eigenvector stacking The advantage of this method is its robustness to different stacking configurations. Indeed, the latent variable estimation is enriched with the information of other events while keeping a good estimate for the current event. Let us consider two matrices from the 2-class phone set VV and VC , and their respective latent variables yv , yc . This approach captures cross-correlation between phonetic events when estimating the latent components. Stacking the eigenvectors for different events is equivalent to performing a sum in the supervector space. For the 2-class set, the system is expressed as: mh = m0 + VV VC yV yC + UV UC xV xC + DG z (4.4) The DG matrix is the one from the baseline system. The ranks of the resulting stacked matrices are 240 and 120, for the speaker and the channel respectively. • Stacking in the speaker space and channel space Stacking the channel eigenvectors was already demonstrated to be successful for a different set of microphones [Kenny et al., 2008c]. Stacking the speaker eigenvectors should be suitable for a phonetic GMM system for two reasons. Firstly, speaker modeling should profit from correlations between phonetic events. Secondly, using subspaces from all phonetic events when evaluating a single phonetic event should increase robustness to errors of the phonetic decoder. Similarly to the concatenation experiments, results in Table 4.5 tend to show that the relevant information is contained in the speaker space as stacking in the channel space degrades the results. This means that a global channel matrix can be estimated and successfully applied to all events. Therefore, we only present this configuration for the 4-class set. Stacking the speaker eigenvectors is a strategy that outperforms the score-level combination and gives the results similar to the baseline non-phonetic system. There is no observed improvement by using the 4-class set over the 2-class one. 27

Table 4.5: System combination using stacked eigenvectors for the speaker space, channel space or both. The matrices selected in each configuration are specified. Results tend to show that the relevant information is contained in the speaker space, as stacking the speaker loadings gives better results than the score-level fusion. [SRE 06, all trials, DCF×10, EER(%)] System Baseline Unstacked Stacked Stacked Stacked Stacked

Speaker VG VV , VC VG VV , VC VV , VC VV , VSt , VSi , VN V

Channel UG UV , UC UV , UC UG UV , UC UG

EER 5.12 5.20 5.34 5.09 5.28 5.03

mDCF 0.241 0.262 0.260 0.247 0.251 0.250

• Stacked eigenvoices for the baseline system In section 4.3.3, we showed that stacking the matrices for each phonetic event was a successful approach for a phonetic-based system. One disadvantage of this method, compared to the method of Section 4.3.3, is the need to run one system for each event. The phonetic subspaces can, however, be used to generate large factor loading matrices. In the protocol, around 300 speakers are used to train the eigenvoice matrix. This is also the maximum number of eigenvoices that can be estimated. For the 4-class phone set, the system has a rank of 480 for the speaker space. This number of eigenvectors cannot be estimated from our data set. However, it is interesting to use this large eigenvoice matrix for the baseline non-phonetic system (channel matrices are not used here following the results in Table 4.5). Under this scenario, the standard (non-phonetic) statistics will be presented to the system while the stacked matrices coming from different phonetic events are used as eigenvoices. The channel matrix used is the one from the baseline system. Table 4.6: Performance of the stacked eigenvoices generated from different phonetic events on a nonphonetic system. Stacked eigenvoices from the 4-class set outperform the baseline. [SRE 06, all trials, DCF×10, EER(%)] System Baseline Stacked Stacked

Speaker VG VV , VC VV , VN V , VSt , VSi

EER (%) 5.12 5.14 4.76

mDCF 0.241 0.243 0.234

Results in Table 4.6 show that stacking eigenvoices derived from different phonetic events can be useful for improving performance over the standard baseline system. It may also be that using more classes may better the performance of the stacked system. Indeed, using the stacked eigenvoices from the 4-class set outperforms the baseline non-phonetic system and the 2-class system.

4.3.4

Conclusion

This work3 aims to take advantage of the recent developments in Joint Factor Analysis in the context of a phonetically conditioned GMM speaker verification system. We focused on strategies for combining the phone-conditioned systems. Our first approach was to perform JFA per class and combine the systems at the score-level. Our hypothesis is that this approach does not use the data efficiently as the 3 The work, by authors at SRI International, was funded through a development contract with Sandia National Laboratories (#DE-AC04-94AL85000). The views herein are those of the authors and do not necessarily represent the views of the funding agencies.

28

performance is worse than the baseline. We later employed strategies in the model space that more robustly estimate the latent variables by taking into account all phonetic events. In section 4.3.3, we showed that the concatenation of eigenvectors could lead to decent performance provided that the subspaces are explicitly retrained on the concatenated statistics. In section 4.3.3, we showed that both factor concatenation and score-level fusion could be outperformed by stacking eigenvectors from different phonetic events. For the phonetic system, stacking the eigenvoices leads to the greatest improvement. We also proposed to use this large set of eigenvoices on the baseline system and showed that it could result in a slight improvement over the traditional baseline system. While the focus of this work is on phone-conditioned JFA systems, the implications may lead to a better understanding of the JFA model and a methodology that can be applied to increase robustness to other kinds of conditions such as language, gender and microphones. Future work will focus on understanding the differences and overlaps between the global and per-class estimates, in the channel and the speaker space, and methods to extract more information for a more robust estimate of speaker models.

4.4

Within Session Variability Modelling

Recent observations have shown that the current Joint Factor Analysis (JFA) model does not provide the expected improvements in performance for short utterance lengths that it does for the core NIST SRE condition using full conversation sides. It is hypothesized in this work that this poor performance is the result of deficiencies in the current JFA model particularly with respect to modelling the unwanted variability present within the session. Based on these observations, an extended JFA model is introduced in this work to specifically address the characteristics of verification with short utterances by incorporating explicit modelling of within-session variability, such as the phonetic information encoded in an utterance. The following section investigates the effect of verification with short utterances on the standard JFA approach through the results of recent studies, highlighting the role of session variability and its dependency on utterance length. Section 4.4.2 then proposes the extended factor analysis model that incorporates within-session variability modelling to combat the deficiencies of the standard model. Implementation details and experiments on NIST SRE 2006 data are then presented with a brief discussion in Sections 4.4.3 and 4.4.4, respectively. Finally a summary and possible future directions are presented in Section 4.4.5.

4.4.1

Joint Factor Analysis with Short Utterances

Previous work has highlighted some deficiencies with current Joint Factor Analysis models for shorter utterance lengths. As demonstrated in [Vogt et al., 2008b], as utterance lengths for training and testing reduce the effectiveness of JFA is also reduced. Table 4.7, with results reproduced from [Vogt et al., 2008b], shows that, while JFA provides a quite significant performance improvement for full conversation sides, this improvement certainly diminishes when utterance lengths of 20 seconds or less are used for training and testing. Furthermore, the inclusion of channel factors in the JFA model at these short utterance lengths had a significant negative impact on performance, while the inclusion of speaker factors was still generally beneficial. Further investigation of JFA with short utterances was pursued in [Vogt et al., 2008a]. In this investigation, it was found that training the session variability subspace matrix U with utterances of matched length to the evaluation conditions provides significant improvements, as shown in Table 4.8. Matching the session variability training data resulted in performance gains for the full JFA model, incorporating session factors, even with utterance lengths as short as 10 and 20 seconds (Table 4.9). It is also clear from Table 4.8 that specifically the channel subspace must be trained with matched 29

System Baseline Speaker only Session only Speaker & Session

1 conv .0442 .0422 .0305 .0295

60 sec .0456 .0434 .0373 .0350

20 sec .0608 .0571 .0702 .0671

10 sec .0752 .0727 .0857 .0880

Table 4.7: DCF on the female subset of the 2005 NIST SRE common evaluation condition for systems with and without channel compensation. From [Vogt et al., 2008b].

System Full-length Matched Matched Session

Subspace V 1 conv 20 sec 1 conv

Training U 1 conv 20 sec 20 sec

EER 13.47% 12.04% 11.70%

Min. DCF 0.0544 0.0498 0.0493

Table 4.8: EER and minimum DCF on a modified 20 second train/test condition for the female subset of the 2005 NIST SRE. Results are presented for systems using subspaces trained on different length segments. From [Vogt et al., 2008a]. conditions rather than the speaker subspace; additionally matching the training for the speaker subspace to the evaluation conditions results in degraded performance compared to matching the channel subspace training alone. From these results it was concluded that the inter-session variability captured in the subspace of U is actually dependent on the length of the utterances used to train the subspace. More specifically, shorter utterances show an increase in overall session variability as shown by the measured trace of the session subspaces for differing lengths in Table 4.10. Deficiencies of the JFA Model The observed behavior of the joint factor analysis with short utterances does not fit well with the assumptions made by the model. It has been assumed to this point that the session factors/session subspace capture environmental effects such as channel, handset and background noise which we take to be constant for the length of a session. This was the initial intent of including the session variables [Kenny and Dumouchel, 2004]. In fact, the terms “channel” and “session” have often been considered effectively synonymous, although this is not technically accurate. The characteristics of these environmental effects should be consistent regardless of utterance length even if estimating session factors with shorter utterances should lead to less accurate results. The improved performance attained with matched session subspaces demonstrates that the matched subspaces are substantially different at different utterance lengths. These results indicate that the characteristics of the differences between sessions are also different as the utterance length change. System Baseline FA Full-length FA Matched Session

80 sec 0.0442 0.0238 0.0234

40 sec 0.0501 0.0346 0.0337

20 sec 0.0617 0.0544 0.0493

10 sec 0.0753 0.0797 0.0708

Table 4.9: Minimum DCF on the female subset of the 2005 NIST SRE common evaluation for reduced utterance length conditions. From [Vogt et al., 2008a]. 30

Utt. Length tr(U U ∗ )

1 conv 105.7

80 sec 116.9

40 sec 148.8

20 sec 213.0

10 sec 329.8

Table 4.10: Trace of the session subspace covariance with U trained with different length utterances. From [Vogt et al., 2008a]. This is problematic firstly because we are thus required to train specialized session subspaces for the range of utterance lengths we are interested in using to extract optimal performance from the JFA model, but, more importantly, it implies that our assumptions about the nature of the inter-session variability are flawed. As noted above, the improved performance of the matched session subspace system indicates that the characteristics are different, but what exactly is this difference? Looking at the increasing session variability with shorter utterances, it seems that the consistent, stationary environmental factors may well still be present as utterances become shorter, but an additional source of variability is becoming more apparent with reducing utterance length. One hypothesis for this extra captured variability is the variability introduced by the speech content, that is the phonetic information encoded in the speech. For the text-independent speaker recognition task, phonetic variation is unwanted variability in general (although it is possible to produce better, more accurate speaker models that are conditioned on phonetic context; this is the approach taken for the conditioned Factor Analysis work). The effect of phonetic content of an utterance on a speaker model will be more pronounced as training utterances become shorter. Over the typical NIST conversation lengths, there is likely to be a reasonable coverage of the phonetic space and the effects of phonetic variability will largely average out. For utterances of only a few seconds in length, however, there will be very poor coverage of the phonetic space, and differences in the particular observed phones will cause large differences in the produced speaker model estimate.

4.4.2

Extending JFA to Model Within-Session Variability

This work extends the current JFA model. The goals of extending the model are two-fold: Firstly, to produce better performance from the JFA model in the specific case of short utterances, through using a model that better fits the observed reality. And secondly, to construct a JFA model that can be used for a wide range of utterance lengths without adjustment, that is, to avoid the issues of retraining the session subspace for different length training and testing utterances. This is particularly relevant if an evaluation or application has mixed utterance lengths or the utterance length is not known a priori. We want to effectively enhance the JFA model by making it independent of utterance length. With these goals in mind, the approach taken in this work is to extend the JFA model by separating the sources of session variability—a collective term for all information not useful for identifying a speaker—into distinct sources of inter -session variability and within-session variability. In this extended model, inter-session variability is modelled as an offset U I x to the GMM mean supervector, as in the standard JFA model, where U I is equivalent to U , except that U I x is intended to strictly represent only constant environmental effects such as handset and channel. A goal therefore is to train U I in such a way as to capture only stationary environmental effects that are independent of utterance length. Additionally, within-session variability is modelled over a shorter time span than the inter-session variability to capture and remove transient effects within an utterance. Following the hypothesis above, these transient effects are expected to be dominated by phonetic variability, although this restriction is not enforced. The within-session variability is modelled by splitting a utterance into a series of short segments and estimating an additional GMM mean offset U W wn for each short segment 31

n. Including within-session variability, the complete model for a short segment n of an utterance is sn = m + V y + dz + U I x + U W wn . While y, z and x are all held constant for the entire utterance, there will be an independent wn for each short segment n. Choice of Short Segmentation An important consideration for this extended JFA model including within-session variability modelling is the choice of method for segmenting an utterance into short segments. Ideally, the segments should be short enough to capture the relevant sources of the within-session variability but must also be long enough to adequately estimate wn for each segment. Computational load is also a consideration, as increasing the number of segments also increases the number of within-session factors, wn , which must be estimated for an utterance. Following the hypothesis that phonetic variability is the dominant source of within-session variability, this work explored modelling within-session variability for short segments that are aligned with open-loop phone recogniser (OLPR) transcripts. Using this alignment, there is a one-to-one mapping from each phone instance in the OLPR output transcript to a short segment. The OLPR transcripts derived from BUT Hungarian phone recognition system [Schwarz et al., 2004]. This phone recogniser has previously been shown to be effective for a number of applications including speech activity detection and language recognition. A brief analysis of the OLPR transcripts for Mixer data used in recent SRE’s revealed that this phone recogniser produced phone labels at approximately 10 phone events per second of active speech. This rate roughly translates into 1,000 phone events for a full NIST conversation side or 100 events for a 10-second segment. It is important to note that there is no conditioning or separate modelling based on the actual phone labels produced by the recogniser. The phone labels are simply used to chop an utterance into short segments based on the start and end time of each phone event. The actual phone labels are completely disregarded. There are other potential methods of segmenting an utterance which may deserve pursuing. One potential option is to simply segment the active speech of an utterance at regular intervals, giving a sequence of segments of the same length. A reasonable range of segment lengths might be 0.1–1 seconds, which corresponds to approximately 10–100 speech frames. A regular segmentation scheme such as this has the advantages of not requiring any external dependencies such as an OLPR as well as a consistent length of segment which may be helpful in estimating wn for each segment. Another possibility is aligning segments with syllables instead of phones, allowing the segments to be centred around high-energy syllable nuclei. This may provide better quality estimates of wn due to cepstral representations of high-energy voiced speech generally being less effected by environmental effects. A syllable-aligned segmentation would obviously also require syllable transcripts or some other method of recognising syllable events such as [Dehak et al., 2007].

4.4.3

Implementation

Several systems were developed for comparison in this work: 1. A baseline JFA system, using the standard JFA model. 2. A standard JFA system with U specialised for differing utterance lengths. 3. A system implementing the extended JFA model incorporating the within-session variability modelling. 32

Details of these systems are presented in the following sections. Baseline JFA System The baseline system for this evaluation implemented the standard JFA model introduced by Kenny, et al. [Kenny and Dumouchel, 2004] for speaker modelling. This implementation was based on the “small” BUT system comprising 512-component, gender-independent GMM models with 39dimensional MFCC-based features. Details of the features and UBM training data are given in [Burget et al., 2007]. For simplicity and efficiency, this system implemented dot product scoring for the verification trials as described for the SUN/SDV submission for SRE’08 [Strasheim and Br¨ ummer, 2008]. Channel compensation was also applied to the statistics for both training and testing utterances in the manner described in [Strasheim and Br¨ ummer, 2008]. ZT-norm was also applied using BUT’s “standard” lists. The JFA parameters for the baseline system were trained on a small subset of the BUT lists for FA training. Specifically, U and V were trained on the SRE04 utterances in the list fa_train_eigenchannels.scp while d was trained on fa_train_d.scp as usual. With the reduced number of utterances and unique speakers for training the subspaces, the subspace dimensions were limited to 100 speaker factors and 50 session factors. While this baseline system does not provide world-beating performance due to the limited FA training data and smaller models, it is expected to be representative of a larger state-of-the-art system. The choice to use a configuration with reduced FA training data was made in order to provide representative performance and a maximum throughput of experiments given the limited time-frame. Matched U to Utterance Length This system is identical to the baseline system in most aspects. The only difference is in the data used to train the session subspace transform U . Additional matrices U were trained using the same utterances as the baseline system, except the utterances were truncated in length to match the anticipated utterance lengths to be used in the experiments. Matrices for 20-second and 10-second conditions were produced for this system. V and d were unchanged. Extended JFA System Incorporating Within-Session Modelling Training the subspaces is expected to be at least as important for the extended JFA model as it is with the standard JFA model. Many potential options exist for the order in which the subspaces are estimated and whether joint or separate optimization is better, as well as the question of how to split the utterances into short segments. Due to time constraints and the computational cost involved in this estimation process, few of these options were examined. The approach taken in these experiments was the simplest and involved the least changes from the baseline system. Firstly, the parameters of the standard JFA model (U I , V and d) were trained exactly as in the baseline system. These were therefore identical to the baseline system. Following this, the additional within-session subspace, U W , was trained on a subset of approximately 100 utterances (2 from each speaker in the fa_train_d.scp list). This training process is analogous to the training of U I , except that U W is trained to capture the dominant directions of differences between the short segments of the training utterances. The transcripts from the Hungarian OLPR provided the segment alignment for the utterances used in estimating U W . It is therefore expected that the within-session variability captured through this procedure will be dominated by phonetic information,4 however, it is also reasonable to expect 4

As an interesting aside, it may be possible to retrieve the phone label for each segment based on the corresponding estimate of w. This has not been investigated to date, but may provide insight into the validity of the assumption that

33

Approx Magnitude of JFA Subspaces

2

10

Speaker Inter−Session Phonetic

1

Eigenvalues

10

0

10

−1

10

−2

10

0

10

1

10

2

10

Figure 4.5: Leading eigenvalues of the speaker, inter-session and within-session variability. The withinsession subspace was trained on segments aligned to open-loop phone recogniser transcripts. variation in actual realisations of phones will also be present. The leading eigenvalues of the speaker, inter-session and within-session subspaces resulting from this training procedure are plotted in Fig. 4.5. The within-session variability is evidently very high and substantially greater than both the speaker and inter-session variability. Based on an approximate average length for the short segments of 0.1 seconds, the effective within-session variability over the utterance for a variety of utterance lengths is depicted in Fig. 4.6, again showing the speaker and inter-session variability for comparison. As the utterance length increases to a full conversation side of approximately 100 seconds, the effect of within-session variability to the utterance as a whole will be effectively negligible, as expected, due to the averaging effect and sufficient coverage of the phonetic space. During both speaker model training and testing, the effect of within-session variability was removed in an analogous fashion to the inter-session variability of the baseline system: For each short segment n of an utterance, the within-session factors wn were estimated and the sufficient statistics compensated as in [Strasheim and Br¨ ummer, 2008]. The compensated statistics for all of the segments in an utterance were then summed to give utterance statistics with within-session effects removed. These within-session-compensated statistics were then used in the same way as the usual utterance statistics in the baseline system.

4.4.4

Experiments

A system implementing the extended JFA model with within-session variability was evaluated and compared against the standard and matched-U JFA systems on the NIST SRE 2006 core, 1conv4w1conv4w condition. To investigate the performance of the systems with reduced utterance lengths, this same 1conv4w-1conv4w condition was again utilised however both the training and testing utterances were truncated to produce shorter utterances. 20-second and 10-second conditions for both training the within-session variability is dominated by phonetic information.

34

Approx Magnitude of JFA Subspaces Speaker Inter−Session Phonetic−−1 sec Phonetic−−10 sec Phonetic−−Conv side

1

10

Eigenvalues

0

10

−1

10

−2

10

0

1

10

2

10

10

Figure 4.6: Approximate effective within-session variability for a range of utterance lengths compared to speaker and inter-session variability. JFA Model U +V +d U +V +d U Matched + V + d UI + UW + V + d

Dims 50 60 50 50I + 10W

1 conv 3.10% 3.03% 3.10% 2.97%

20 sec 12.79% 13.01% 12.20% 11.98%

10 sec 20.21% 20.31% 19.71% 19.67%

Table 4.11: Comparison of EER performance for the standard JFA model, matched-length session JFA model and the extended JFA model incorporating within-session variability modelling on the SRE 06 common evaluation condition. and testing were added in this way. From the results in [Vogt et al., 2008b] and [Vogt et al., 2008a], this 10 to 20 seconds range appears to be the range at which the effectiveness of the standard JFA model is diminishing. Tables 4.11 and 4.12 present EER and minimum DCF results, respectively, comparing the variants of the JFA model for the full conversation side and 20- and 10-second truncated training and testing conditions. All results are English-language trials only. The first and second rows in each table use the standard JFA model with 50 and 60 session factors, respectively. The third row show the results with U matched to the length of utterance used for training and testing. The last row of each table includes within-session variability modelling with 50 inter-session factors and 10 within-session factors. As reported in [Vogt et al., 2008a], matching U to the evaluation conditions provides an advantage over the standard JFA model. The matched system provided better performance in all short conditions over the baseline although the improvement for the 10-sec condition are quite modest. Incorporating within-session variability modelling largely produced similar results to the matchedU approach, improving on the standard JFA system for all shortened utterances. Additionally, at the EER operating point this approach gave the best performance at each utterance length, although only by a small margin. Results were less clear-cut when measured by minimum DCF. From these results it can be seen that the introduction of within-session factors at least achieved 35

JFA Model U +V +d U +V +d U M atched + V + d UI + UW + V + d

Dims 50 60 50 50I + 10W

1 conv .0159 .0156 .0159 .0170

20 sec .0561 .0562 .0531 .0541

10 sec .0819 .0820 .0814 .0807

Table 4.12: Comparison of minimum DCF performance for the standard JFA model, matched-length session JFA model and the extended JFA model incorporating within-session variability modelling on the SRE 06 common evaluation condition. JFA Model U +V +d U Matched + V + d U Stacked + V + d UI + UW + V + d

Dims 50 50 100 50I + 10W

20 sec 6.12% 6.39% 5.91% 5.85%

10 sec 9.59% 10.13% 9.54% 9.59%

Table 4.13: Comparison of EER performance for the standard JFA model, matched-length session JFA model, a stacked session model and the extended JFA model incorporating within-session variability modelling on the SRE 06 common evaluation condition with whole conversation side training and truncated utterances for testing. one of the stated goals of producing a system that could be effective over a wide range of utterance lengths. While the matched system used a distinct U matrix for each utterance length tested, the parameters of the within-session modelling system were consistent across all trials. Thus, the withinsession modelling approach provides a practical advantage over the standard JFA model through its flexibility. The second goal of improving performance through more accurately modelling the unwanted variability has not been convincingly achieved with these results. Several factors may contribute to this outcome. Firstly, the choice of segmentation may not be optimal, but more importantly, the approach to estimating the subspaces of the extended model used for these experiments was not at all tailored to the extended model. It is expected that the extended model should at the very minimum require adjustment to the values of d as less information will be explained as “residual” variability with the inclusion of within-session factors. The effects of including within-session modelling on the speaker and inter-session subspaces must also be investigated. Future investigation of segmentation choice and proper integration of within-session modelling in the subspace estimation process may lead to significant improvements in performance of this extended model. Following on from the above results where utterances of the same length were used for both training and testing. An added complication is introduced when the training and testing utterance lengths differ. In this case, the optimal matrix U is different for training and testing. Tables 4.13 and 4.14 present results evaluated with a whole conversation for training and 20 or 10 second testing utterances. Again in these tables the first row is the baseline approach using standard JFA model. The results in the second represent a system with U matched to the utterance length for both training and testing. In this case, due to the full conversation for training and truncated utterances for testing, U differs from training to testing. Interestingly, while the matched-U approach worked quite well with the same utterance lengths for both training and testing, it causes a degradation in performance in all measures compared to the baseline system. Mismatch between the U for training and testing is the most likely cause of this performance degradation. To overcome the issue of differing U between training and testing while matching the session subspace to the utterance length, a stacking approach was investigated. Under this approach, a larger session subspace was constructed by concatenating the two session matrices matched to both the 36

JFA Model U +V +d U Matched + V + d U Stacked + V + d UI + UW + V + d

Dims 50 50 100 50I + 10W

20 sec .0293 .0305 .0275 .0290

10 sec .0433 .0441 .0421 .0414

Table 4.14: Comparison of minimum DCF performance for the standard JFA model, matched-length session JFA model, a stacked session model and the extended JFA model incorporating within-session variability modelling on the SRE 06 common evaluation condition with whole conversation side training and truncated utterances for testing. training and testing conditions. That is, for a 1conv training, 10 second test condition, the U used for training and testing consists of concatenated matrices matched to the 1 conv and 10 second utterance lengths. This approach has been successfully employed previously for mixed telephone and distant microphone conditions in recent SRE. The third row of Tables 4.13 and 4.14 demonstrate that this stacking approach provides an improvement in all cases over the baseline system, regaining the advantage of the matched approach observed previously, although, again these gains are modest. Finally, the last row in Tables 4.13 and 4.14 present the performance of incorporating within-session modelling. As with the stacking approach, the extended model provides improved performance over the baseline system in all cases, except for the 10-sec EER where the two are equivalent. The extended approach is also competitive with the stacked approach as they each provide the best performance depending on the condition and performance measure. The results for these experiments again highlight the ability for the extended JFA model to provide competitive performance across a wide range of operating conditions without having to adjust model parameters. This flexibility is a major advantage of this approach, especially for situations in which it is not possible to know the training and testing utterance lengths prior to evaluation or, as in this case, the utterance lengths are not consistent for training and testing.

4.4.5

Summary and Future Directions

This work motivated and presented an extension to the joint factor analysis model to include modelling of unwanted within-session variability. This extension was particularly motivated by observations of relatively poor and ineffective performance of the standard JFA model for short utterance lengths. The inclusion of within-session variability modelling was particularly intended to compensate for the effects of poor and uneven phonetic coverage for short utterances by modelling and removing the effects of phonetic variation over short segments of each utterance. The goals of the extended model were; (a) to produce better performance from the JFA model in the specific case of short utterances by using a model that better fits observed behaviour, and (b) to produce a flexible JFA model that would be equally effective over a wide range of utterance lengths without adjusting model parameters such as retraining session subspaces. Experimental results demonstrate the flexibility of the extended JFA model by providing competitive results over a wide range of utterance lengths and operating conditions without need for adjusting any of the model parameters. While modest performance improvements were also observed in a number of conditions over current state-of-the-art, further work is necessary to demonstrate that significant performance improvements are achievable through this extended model. Future work on this approach is expected to focus on two areas, the optimal method of segmenting utterances and better integration of within-session variability in estimating the parameters of the JFA model. Possible candidates for segmentation method include aligning with syllables or syllable-like units to provide slightly longer segments and potentially better estimates of within-session factors (wn ) 37

and fixed length segments in the range of 0.1–1 seconds of active speech to provide more consistency in the estimates of wn and less dependency on speech recognition tools. Methods of incorporating withinsession variability in estimating the speaker and inter-session subspaces will also be examined. Previous work has shown that slight variations in the subspace estimation procedure can make significant performance differences for the standard JFA model; it is likely that this effect is exacerbated for the extended model.

4.5 4.5.1

Multigrained Factor Analysis Introduction

Recent efforts in speaker recognition research have focussed on reducing the effects of session variability. The more recent papers in this research area attempt to compensate a high-dimensional utterance representation. These methods include Factor Analysis (in both the model [Kenny, 2006, Vogt et al., 2005] and feature [Vair et al., 2006] domains), Within-Class Covariance Normalisation (WCCN) [Hatch et al., 2006] and Nuisance Attribute Projection (NAP) [Solomonoff et al., 2005]. These techniques are noted to be very effective within a Gaussian Mixture Model (GMM) framework. Each of these methods make particular assumptions; such as assuming that signal distortions do not span across multiple mixture components. We attempt to address this constraint by examining the granularity of the distorted feature space. We investigate this problem from the perspective of two similar modelling approaches; they are, Factor analysis [Kenny, 2006] and NAP [Campbell et al., 2006a]. Factor analysis decomposes the statistics of an utterance into components relating to the speaker and the channel/audio-environment. The NAP approach models and eliminates directions of variability in the model space that are considered harmful to classification performance. The assumption for both the GMM-kernel based NAP approach [Campbell et al., 2006a] and the factor analysis variant is that the session variability is confined to distortions within the span of its corresponding mixture component. More specifically, for a feature based interpretation, the most that a feature vector will be compensated by is an offset spanned by the operating space of the corresponding Gaussian mixture component. In addition, feature distortions that exist at a much smaller scale than the span of the mixture component will not be sufficiently described. In this work, we propose the use of a multi-grained approach whereby the compensated statistics generated by a low-complexity GMM-NAP structure are used in a higher-complexity compensation system. Section 4.5.2 presents the proposed multi-grain approach that compensates for distortions of differing structural detail. Section 4.5.3 follows with a presentation of the results and then leads into the conclusions.

4.5.2

Multi-Grained Approach

We propose a multi-grained approach as a simple means to mitigate assumptions of granularity. To motivate the approach given later in this section, we include an example indicating a need for the approach. Figure 4.7 shows two Gaussian mixture models with a different number of mixture components. The model to the left is a GMM with a lower complexity than the GMM on the right. In the figure, the data is represented by square dots while the distortions (hypothetically applied) are represented by arrows. When feature compensation is performed, the amount by which a single feature vector may be compensated is generally governed by the reach of the most significant Gaussian mixture component. A Gaussian’s reach is used here to describe the space of features vectors that will significantly affect or be influenced by that mixture component. Note that this description is specific to the formulation of the model. Through a thought experiment, it is apparent that a feature vector that will be compensated by 38

Figure 4.7: A hypothetical scenario of two different complexity models used to account for feature distortions.

the GMM on the left could potentially be modified to a greater extent than by the GMM on the right. In addition, the GMM on the left may be able to compensate for significantly large channel effects, while the GMM on the right would tend to compensate for more fine-grained distortions. A multi-grained model may be able to compensate for session effects that cause large regional variability and yet also handle small localized distortions. The work regarding the multi-grained analysis is inspired by Chaudhari et al [Chaudhari et al., 2000]. We apply a multi-grained framework to both the NAP and the factor analysis approaches to demonstrate their utility. Please refer to [Campbell et al., 2006a] for the NAP specific details of the method and to [Kenny, 2006] for the factor analysis specifics. In brief, Figure 4.8 shows the general process. A low complexity NAP-GMM setup is used to transform the raw features to produce Level-1 (L1) features. These features are further transformed by a more complex Level-2 (L2) NAP-GMM structure. Note that the features generated by the L2-model may be passed into an SVM or other alternate classifier. Although this diagram indicates the case for feature based compensation of the NAP statistics using models of differing granularity, it was also applied such that the sufficient statistics could be changed using a factor analysis type model (see Figure 4.9). Multi-grained NAP feature compensation With the previous method, NAP was applied in two stages, first for enhancing the features using a low complexity model, and second, a higher complexity model to perform an additional NAP transformation on the sufficient statistics that were then scored. For the NAP case, a feature vector, x, may be compensated to give, x ˆ, according to the following equation ([Vair et al., 2006]). N X x ˆ=x− P r(i|x)S i (4.5) i=1

where wi g(x|µi , Σi ) P r(i|x) = PN j=1 wj g(x|µj , Σj )

(4.6)

The conditional probability of mixture component i given observation x is given by P r(i|x). The function g(x|µi , Σi ) is the multi-variate probability density function of feature vector x for a given mean (µi ) and diagonal covariance (Σi ) of mixture component i. 39

The parameter vector, S i , represents the session nuisance contribution for mixture component i and may be calculated as follows: 1 1 S i = √ Σi2 Vi V T φ (4.7) wi Note that Vi is the sub-matrix of V referring to the nuisance contribution of mixture component i. Specifically, Vi refers to row [(i − 1) × d + 1] through to row [i × d] of matrix V . In addition, the L2 NAP-GMM is directly scored rather than being used for generating additional features.

Figure 4.8: The procedure for performing a two-step Nuisance Attribute Projection.

Multi-grained FA statistics compensation With the statistics compensation approach, factor analysis is applied to all broad-phone classes to determine the session subspace contribution. Following this, another factor analysis model is created for each broad-phone class whereby the previous session subspace contribution is already removed5 . Raw Features

Find FA Statistics

Compensate For Session Statistics

ALL SPEECH

L1 Global Model

Find FA Statistics STP NON

VOW SIB

Compensate For Session Statistics L2 Phone-Class Specific Model

Use of FA Compensated Statistics In Kernel or Model

Figure 4.9: The procedure for performing a two-step factor analysis compensation, firstly using the statistics over the entire utterance followed by the phone-group specific compensation.

4.5.3

Results

The speaker recognition system is based on a GMM based kernel structure [Reynolds et al., 2000, Campbell et al., 2006a]. All output scores have ZT-Norm [Reynolds et al., 2000, Auckenthaler et al., 2000] (enrollment and test utterance score normalization) applied. We first evaluate the multigrained NAP feature compensation approach. This consists of a 256 mixture component NAP system that is used to transform the cepstral-based features for use by a 1024 mixture component secondary NAP system. The secondary NAP system also incorporates the scoring (dot-product) component. The results of the multigrained NAP feature compensation approach are presented in Table 4.15 for the NIST 2008 SRE. Conditions 7 and 8 represent the telephony audio 5

The equations are omitted at this time.

40

Table 4.15: The NIST 2008 results with and without the multi-grained analysis. Task

Condition 7

Condition 8

Base System with NAP

0.179

0.182

Base System with Multigrained NAP

0.175

0.166

Broad Phone System with NAP

0.212

0.209

Broad Phone System with Multigrained NAP

0.206

0.190

Table 4.16: The NIST 2006 results with and without the multi-grained analysis compared for broad phonetic groupings. Baseline

Hierarchical

Baseline Hierarchical

Phone Type NonVowel Sibilant Stop Vowel NonVowel Sibilant Stop Vowel

DET 3 - Base Min DCF 0.0888 0.0988 0.0993 0.0604 0.0852 0.0994 0.0991 0.0482

Consonant Vowel Consonant Vowel

0.0839 0.0604 0.0777 0.0482

DET 3 - ZTNorm EER Min DCF EER 24.04% 0.0413 9.05% 30.28% 0.0584 13.05% 33.33% 0.0631 13.81% 11.26% 0.0201 3.97% 23.24% 0.042 9.53% 28.93% 0.0585 14.20% 33.27% 0.0655 14.63% 10.29% 0.0206 3.91% 20.48% 11.26% 18.26% 10.29%

0.0323 0.0201 0.0312 0.0206

6.28% 3.97% 6.45% 3.91%

evaluation for all English trials and all native English trials respectively. This table includes results for two types of systems; that is a standard GMM system and a broad phone system. The standard GMM system is of the same configuration as mentioned earlier in the paragraph. The broad phone system consists of using the compensated features generated by the 256 mixture component system and then having broad-phone models (which are also NAP compensated) trained on these new features. The scores from the broad phone models are combined in late fusion. The results show a small improvement in the performance of the multigrained NAP systems over the standard NAP baselines. In another experiment (see Table 4.16), using the NIST 2006 SRE, the multigrained FA system was evaluated. This consisted of compensating the sufficient statistics using a general GMM and then further compensating the statistics using a broad-phone specific system. The multigrained approach demonstrated consistent improvements for the ‘DET 3-Base’ result. Once ZT-Norm was applied (‘DET 3-ZTNorm’), the observed benefits were lost. Note also that the fusion of multiple phone systems did not demonstrate an improvement. Effectively compensating the sufficient statistics of the FA model in a multigrained manner seems to be a challenging task at this time.

4.5.4

Conclusions

This section presented a multi-grained approach to address certain limitations in current compensation models. Results indicate some gains with the potential for the method being applied to other session modelling approaches.

4.6

Summary

This chapter presented some of the efforts performed at the JHU workshop on the topic of Factor Analysis Conditioning. The work covered four main areas; a phonetic analysis, Factor Analysis 41

Combination Strategies, Within Session Variability Modelling and Multigrained Factor Analysis. Results demonstrate that a conditioned FA model can provide improved performance and score level combination may not always be the best method. Including Within-Session factors in an FA model can reduce the sensitivity to utterance duration and phonetic content variability. Stacking factors across conditions or data subsets can provide additional robustness. Hierarchical modelling for NAP/Factor Analysis also shows promise. These approaches also have applicability to other condition types such as different languages and microphones types.

42

Chapter 5

Support vector machines and joint factor analysis for speaker verification This chapter presents several techniques to combine between Support vector machines (SVM) and Joint Factor Analysis (JFA) model for speaker verification. In this combination, the SVMs are applied in different sources of information produced by the JFA. These informations are the Gaussian Mixture Model supervectors and speakers and Common factors. We found that the use of JFA factors gave the best results especially when within class covariance normalization method is applied in the speaker factors space in order to compensate for the channel effect. The new combination results are comparable to other classical JFA scoring techniques.

5.1

Introduction

During the last three years, the Joint Factor Analysis (JFA) [Kenny et al., 2008d] approach became the state of the art in the speaker verification field. This modeling was proposed in order to deal with the speaker and channel variability in the Gaussian Mixture Models (GMM) [Douglas A. Reynolds, 2000] framework. At the same time the application of the Support Vector Machine (SVM) in GMM supervector space [Campbell et al., 2006a] got interesting results, especially when the nuisance attribute projection (NAP) was applied to deal with the channel affect. In this approach, the kernel used is based on a linear approximation of the Kullback Leibler (KL) distance between two GMMs. The speaker GMM means supervectors were obtained by adapting the Universal Background Model (UBM) supervector to speaker frames using Maximum A Posterior (MAP) adaptation [Douglas A. Reynolds, 2000]. In this paper, we propose to combine the SVM with JFA. We tried two types of combination; the first one uses the GMM supervector obtained with JFA as input to the SVM using the classical linear KL kernel between two supervectors. The second, rather than using the GMM supervectors as features for the SVM, directly uses the information given by the speaker and common factors components (see section 5.2) defined by the JFA model. The outline of the paper is as follows. Section 5.2 describes the factor analysis model. In section 5.3, we present the JFA-SVM approach and we describe all the kernels used to implement it. The comparison between different results is presented in section 5.5. Section 5.6 concludes the paper.

5.2

Joint Factor analysis

Joint factor analysis is a model used to treat the problem of speaker and session variability in GMM’s. In this model, each speaker is represented by the means, covariance, and weights of a mixture of C multivariate diagonal-covariance Gaussian densities defined in some continuous feature space of 43

dimension F . The GMM for a target speaker is obtained by adapting the UBM means parameters (UBM). In joint factor analysis [Kenny et al., 2008d, Kenny et al., 2007b, Kenny et al., 2007a], the basic assumption is that a speaker and channel dependent supervector M can be decomposed into a sum of two supervectors: a speaker supervector s and a channel supervector c M =s+c

(5.1)

where s and c are normally distributed. In [Kenny et al., 2008d], Kenny et al. described how the speaker dependent supervector and channel dependent supervector can be represented in low dimensional spaces. The first term in the right hand side of (5.1) is modeled by assuming that if s is the speaker supervector for a randomly chosen speaker then s = m + dz + V y (5.2) Where m is the speaker and channel independent supervector (UBM), d is a diagonal matrix, V is a rectangular matrix of low rank and y and z are independent random vectors having standard normal distributions. In other words, s is assumed to be normally distributed with mean m and covariance matrix V V t + d2 . The components of y and z are respectively the speaker and common factors. The channel-dependent supervector c, which represents the channel effect in an utterance, is assumed to be distributed according to c = ux (5.3) Where u is a rectangular matrix of low rank, x is distributed with standard normal distribution. This is equivalent to saying that c is normally distributed with zero mean and covariance uut . The components of x are the channel factors in factor analysis modeling.

5.3

SVM-JFA

The SVM is a classifier used to find a separator between two classes. The main idea of this classifier is to project the input vectors into a high dimension space called feature space in order to find linear separation. This projection is carried out using a mapping function. In practice, SVMs use kernel functions to perform the scalar product computation in the feature space. These functions allow us to compute directly the scalar product in the feature space without defining the mapping function. In this section, we will present several ways to carry out the combination between the SVM and JFA. The first approach is similar to the classical SVM-GMM [Campbell et al., 2006a, Campbell et al., 2006b] when the speaker GMM supervectors are used as input to SVM. The second set of method that we tested, consist of designing new kernel using the speaker factors or speaker and common factors depending on the configuration of the JFA model.

5.3.1

GMM Supervector space

In order to apply SVM with JFA using speaker supervector as input, we used the classical linear Kullback- Leibler kernel. This kernel applied in GMM supervector space is based on Kullback-Leibler divergence between two GMMs [Campbell et al., 2006a]. This distance corresponds to Euclidean distance between scaled GMM supervectors s and s0 . De2

0



s, s =

C X

 t wi si − s0i Σ−1 si − s0i i

(5.4)

i=1

where wi and Σi are the ith UBM mixture weights and diagonal covariance matrix, si corresponds to the mean of Gaussian i of the speaker GMM. The derived linear kernel is defined as the corresponding 44

inner product of the preceding distance 0



Klin s, s =

C  X √

−1 wi Σi 2 si

 √

−1 wi Σi 2 s0i

t (5.5)

i=1

This kernel was proposed by Campbell et al. [Campbell et al., 2006a].

5.3.2

Speaker factors space

In this part of the paper, we discuss the use of speaker factors as parameters input to SVM. The speaker factors coefficients correspond to coordinate of the speaker in the speaker space defined by the eigenvoices matrix. The advantage of using speaker factors is that these vectors are of low dimensions (typical dimension = 300), making the decision process faster. We tested these vectors with three classical kernels which are linear, Gaussian and cosine kernels. These kernels are respectively given by the following equations: k(y1 , y2 ) = hy1 , y2 i   1 2 k(y1 , y2 ) = exp − 2 ky1 − y2 k 2.σ hy1 , y2 i k(y1 , y2 ) = ky1 k ky2 k

(5.6) (5.7) (5.8)

The motivation of using the linear kernel is that the speaker factors vectors are normally distributed with zero mean and identity variance matrix. In order to obtain the speaker factors for this system, we used the JFA configuration which has the speaker and channel factors only. There are no common factors (z) (see equation 5.2). Within Class Covariance Normalization In this new approach, we proposed to make another channel compensation step in the speaker factors space. The first step is carried out by estimating the channel factors in GMM supervectors space. To achieve this compensation, two choices are possible, the first one is NAP [Campbell et al., 2006a] algorithm and the second is the Within Class Covariance Normalization algorithm (WCCN) [Hatch et al., 2006]. We decided to apply WCCN algorithm rather than NAP because NAP algorithm realizes channel compensation by removing the nuisance directions; however the speaker factors are vectors of low dimension so removing additional directions could be harmful. The WCCN algorithm uses the Within Class Covariance (WCC) matrix to normalize the kernel functions in order to compensate for the channel factor without removing any directions in the space. WCC matrix is obtained by the following formula: ns S 1X 1 X W = (yis − ys ) (yis − ys )t S ns s=1

(5.9)

i=1

P s s where ys = n1s ni=1 yi is mean of speaker factors vectors of each speaker, S is the number of speaker and ns is number of utterances for each speaker s. The WCCN algorithm was applied to the linear and cosine kernels. The new versions of the two previous kernels are given by the following equations: k(y1 , y2 ) = y1t W −1 y2 y1t W −1 y2 k(y1 , y2 ) = t −1 y1 W y1 y2t W −1 y2 45

(5.10) (5.11)

5.3.3

Speaker and Common factors space

In the case when we have the speaker and common factors, we proposed and compared two techniques to combine these two sources of information. The first approach is to apply SVM in each space (speaker factors space and common factors space). Thereafter we make linear scores fusion of these two SVMs scores. The fusion weights are obtained by using a logistic regression [Br¨ ummer et al., 2007]. The second approach is to define a new kernel which is the linear combination of two kernels. The first kernel is applied in the speaker factors space the second kernel is applied in the common factors space. The kernels combination weights are fixed in order to maximize the margin between target speaker and impostors utterances. This technique was already applied in speaker verification [Dehak et al., 2008].

5.4 5.4.1

Experimental setup Test set

The results of our experiments are reported in the core condition of the NIST 2006 speaker recognition evaluation (SRE) dataset1 . In the case of score fusion system, we trained the score fusion weights on NIST 2006 SRE dataset and we tested the systems on the telephone data of the core condition of the NIST 2008 SRE.

5.4.2

Acoustic features

In our experiments, we used cepstral features, extracted using a 25ms Hamming window. 19 mel frequency cepstral coefficients together with log energy are calculated every 10ms. This 20-dimensional feature vector was subjected to feature warping [Pelecanos and Sridharan, 2001] using a 3s sliding window. Delta and double delta coefficients were then calculated using a 5 frames window giving a 60-dimensional feature vectors. These feature vectors were modeled using GMM and factor analysis was used to treat the problem of speaker and session variability.

5.4.3

Factor analysis training

We used gender independent Universal Background Models which contains 2048 Gaussians. This UBM was trained using LDC releases of Switchboard II, Phases 2 and 3; switchboard Cellular, Parts 1 and 2 and NIST 2004-2005 SRE. The (gender independent) factor analysis models were trained on the same quantities of data as the UBM. The decision scores obtained with the factor analysis were normalized using zt-norm normalization. We used 148 male and 221 female t-norm models and we used 159 male and 201 female z-norm utterances. We used two factor analysis configurations. The first JFA is composed by 300 speaker factors and 100 channel factors only and in the second configuration is the full configuration; we added the diagonal matrix (d) in order to have speaker and common factors.

5.4.4

SVM impostors

We used 1875 gender independent impostors to train the SVM model. These impostors are taken from LDC releases of Switchboard II, Phases 2 and 3; switchboard Cellular, Parts 1 and 2 and NIST 2004-2005 SRE. 1

http://www.nist.gov/speech/tests/spk/index.htm

46

Table 5.1: Comparison results between SVM-JFA in GMM supervectors space and JFA frame by frame scoring. The results are given on EER in the core condition of the NIST 2006 SRE. System JFA: s = m + V y JFA: s = m + V y + dz SVM-JFA: s = m + V y SVM-JFA: s = m + V y + dz

English 1.95% 1.80% 4.24% 4.23%

All trials 3.01% 2.96% 4.98% 4.92%

Table 5.2: Comparison results between SVM-JFA in speaker factor space and GMM supervectors space. The Results are given on EER in the core condition of the NIST 2006 SRE.

KL-kernel: GMM supervectors Linear kernel Gaussian kernel Cosine kernel

5.4.5

English No-norm T-norm 4.24%

3.47% 3.03% 3.08%

2.93% 2.98% 2.92%

All trials No-norm T-norm 4.98%

4.64% 4.59% 4.18%

4.04% 4.46% 4.15%

Within Class Covariance

The gender independent within class covariance matrix is trained in the same dataset as the JFA training.

5.5 5.5.1

Results SVM-JFA: GMM supervector space

We start with the results obtained by the combination SVM-JFA when the GMM supervectors are used as input to the SVM. We used GMM supervector obtained using both JFA configurations (with or without Common factors). The results are given in Table 5.1. These results are compared to the frame by frame JFA scoring techniques. The results show that the performances of the application of the SVM in the GMM supervector space are significantly worse than that obtained by the conventional frame by frame JFA scoring. These results can be explained by the fact that the linear KL kernel is not appropriate for GMM supervectors obtained by the JFA model because the assumption of independence of GMM Gaussians in the case of MAP adaptation is not true for adaptation based on eigenvoices. The results show also that the addition of common factors didn’t improve the results in the case of SVM-JFA compared to the JFA scoring.

5.5.2

SVM-JFA: speaker factors space

We present in this section the results obtained with the linear, Gaussian and cosine kernels applied in speaker factors space. We compare these new results with the last one using the SVM- JFA applied in GMM supervectors. The Table 5.2 gives these results. 47

Table 5.3: Comparison results between SVM-JFA in speaker factor space (with and without WCCN) with two JFA scoring techniques. The results are given on EER in the core condition of the NIST 2006 SRE, English trials.

Linear kernel Cosine kernel JFA frame by frame scoring JFA integrate over channel factors

Without WCCN t-norm zt-norm 2.93% 2.92% 2.81% 1.95% 4.12%

2.70%

With WCCN t-norm zt-norm 2.44% 2.43% -

-

There are three remarks in Table 5.2. The first one is that the application of the SVM in speaker factors space gave better results than applied SVM in GMM supervectors space. The second is that there is well linear separation between the speakers if we compare the results between cosine and Gaussian kernel. The last remark is that t-norm didn’t give a large improvement in the case of the cosine and Gaussian kernels, however it helps in the case of the linear kernel.

Within Class Covariance Normalization We will now discuss the performance achieved with or without the WCCN technique in the case of linear and cosine kernels. Table 5.3 compares the results obtained with and without WCCN to the results of two JFA scoring techniques. The first method consists on integrating over channel factors proposed in [Kenny et al., 2008d]. The second one is to make frame by frame JFA scoring. The results given in Table 5.3 show that with the WCCN, we achieved 17 % relative improvements in both kernels. We can see also that the performances obtained with WCCN are very comparable to the JFA scoring. We got better results than integrating over channel factors and closer to JFA frame by frame scoring. An advantage of this new SVM-JFA scoring is that it is faster than the two other techniques.

5.5.3

SVM-JFA: speaker and common factors space

We present a comparison between results obtained with score fusion and kernel combination applied in the speaker and common factors. In both fusion techniques, we applied cosine kernel in speaker and common factors space. We used also WCCN in order to normalize the speaker factors cosine kernel. The results are given in Table 5.4. By looking at these results, we can conclude, that both fusion methods gave equivalent results. However, the use of the kernel combination is more appropriate because we dont need development data to set the kernel weights. The results reported in the Table 5.4 by score fusion on NIST 2006 SRE are not realistic because we trained and tested the score fusion weights on the same dataset. We note also that the common factors components give complementary information to speaker factors components and the combination between them improve the performances. If we compare the results obtained by the kernel combination method and the other scoring methods, we find the same conclusion as using only SVM in the speaker factors space (see section 5.5.2). 48

Table 5.4: Comparison results between score fusion and kernels combination for SVM-JFA system.

Cosine kernel on y Cosine kernel on z Linear score fusion Kernel combination JFA frame by frame Scoring JFA integrate over channel factors

5.6

NIST 2006 SRE English All trials 2.34% 3.59% 6.26% 8.68% 2.11% 3.62% 2.08% 3.62% 1.80% 2.96% 2.65%

3.82%

NIST 2008 SRE English All trials 3.86% 6.55% 10.34% 13.45% 3.23% 6.86% 3.20% 6.60% -

-

Conclusion

We tested in this paper several combinations between discriminative model which is Support vector machine and generative model which is Joint Factor analysis for speaker verification. We found that using linear or cosine kernel defined in speaker and Common factors which are the components of the JFA gave better results than using linear Kullback Leibler kernel applied in GMM supervectors obtained also with JFA model. We proved that using within class covariance normalization in speaker space in order to compensate for the channel effect gave the best performances. The results obtained with SVM-JFA using the speaker factors were comparable to the results obtained with classical JFA scoring. However the advantage of using the SVM in speaker factors space (usally dimension 300) makes the scoring faster than others classical techniques.

49

Chapter 6

Handling variability with support vector machines In speaker verification we encounter two types of variation: inter-speaker, and intra-speaker. The former is desired and a good recognizer should exploit it, while the latter is a nuisance and a good recognizer should suppress it. In this chapter, we will propose variability compensated SVM (VCSVM), a new framework for handling both of these types of variation in the SVM speaker recognition setup.

6.1

Introduction

Speaker verification using SVMs has proven to be a powerful method, specifically using the GSV Kernel [Campbell et al., 2006b] with nuisance attribute projection (NAP) [Solomonoff et al., 2005]. Also, the recent popularity and success of factor analysis [Kenny et al., 2008c] has led to Najim’s promising attempts to use speaker factors directly as SVM features. Both using NAP projection and speaker factors with SVMs are methods of handling variability in speaker verification: NAP by removing undesirable nuisance variability, while using the speaker factors does so by forcing the discrimination to be performed based on inter-speaker variability. These successes have led us to propose VCSVM, a new method to handle both inter and intra-speaker variation, that attempts to do so directly in the SVM optimization. This is done by adding a penalty to the minimization that biases the normal to the hyperplane to be orthogonal to the nuisance subspace or alternatively orthogonal to the complement of the subspace containing the intra-speaker variation. This bias will attempt to ensure that inter-speaker variability is used in the recognition while intra-speaker variability is ignored.

6.2

Motivation

Evidence of the importance of handling variability can be found in the discrepancy in verification performance between one, three and eight conversation enrollment tasks for the same SVM system; with performance improving as the number of enrollment utterances increases. One explanation for this is that when only one target conversation is available to enroll a speaker then the orientation of the separating hyperplane is set by the impostor utterances. As more target enrollment utterances are provided the orientation of the separating hyperplane can change drastically, as sketched in Figure 6.1. The additional information that the extra enrollment utterances provide is intra-speaker variability, due to channel effects and other nuisance variables. If we could estimate the principal components of intra-speaker variability for a given speaker then we could force the SVM to not use that variability in choosing a separating hyperplane as is shown in Figure 6.2 where the main nuisance direction was removed. However since it is not generally possible to estimate intra-speaker variability for a specific 50

1c target 3c targets

w +

+

+

+

1c

+

+

+

+

− −

3c

− −











8c

− −



Figure 6.1: Different separating hyperplanes obtained with 1, 3, and 8 conversation enrollment. speaker we could substitute a global estimate obtained from a large number of speakers, this is exactly what is done in NAP. Capturing Intra−speaker variability

Standard SVM seperating hyperplane

w +

U

− −









− −

− −

SVM seperating hyperplane with intra−speaker variability compensation





Figure 6.2: Effect of removing the nuisance direction from the SVM optimization.

6.3

Handling Nuisance Variability

NAP handles nuisance variability by estimating a small subspace where the nuisance lives and removing it completely from the SVM features, i.e. not allowing any information from the nuisance subspace to affect the SVM decision. We approach this from another angle that, instead of removing the subspace completely, biases the normal to the separating hyperplane to be orthogonal to the nuisance subspace. Assume that the nuisance subspace is spanned by a set of N orthonormal eigenvectors {u1 , u2 , . . . , uN }, and let U be the matrix whose columns are the u0 s. Let the vector normal to the separating hyperplane be w and ideally if the nuisance was restricted to the subspace would require the orthogonal U then one 2 T projection of w in the nuisance subspace to be zero, i.e. UU w 2 = 0. This requirement can be introduced directly into the primal formulation of the SVM optimization: 2 P minimize J(w, ) = ||w||22 /2 + ξ UUT w 2 /2 + C m i=1 i subject to

(6.1)

yi (wT xi + b) ≥ 1 − i

i = 0, . . . , m

i ≥ 0

i = 0, . . . , m,

51

where the x0i s are the utterance specific SVM features (supervectors) and yi0 s are the corresponding labels. Note that the only difference between (6.1) and the standard SVM formulation is the addition 2 of the ξ UUT w 2 term, where ξ is a tunable (on some held out set) parameter that regulates the amount of bias desired. If ξ = ∞ then this formulation becomes similar to NAP compensation, and if ξ = 0 then we obtain the standard SVM formulation; Figure 6.3 sketches the separating hyperplane obtained for different values of ξ . Capturing Intra−speaker variability

Standard SVM seperating hyperplane

w

ξ=0

+











− −

ξ=

8



− −

0< ξ
300, but we did not investigate this experimentally. 4 FoCal toolkit (http://niko.brummer.googlepages.com/focal) by Niko Brummer was used for this purpose

73

∂Cllr 1 = ∂θ 2 log 2

! 1 X 1 X ∂λt (1 − P (target|λt , 0.5)) − P (target|λt , 0.5) , |T | |N | ∂θ t∈T

(8.7)

t∈N

where P (target|λt , 0.5) is given by (8.2) and by combining (8.6) and (7.17) and differentiating w.r.t. V, we obtain ∂λt ¯ ∗ = αΣ−1 Fy (8.8) ∂V To optimize our objective function, we need to define set of training trials. In these experiments, each possible pair of two segments from our training set formed a valid trial, where one segment is considered to be the enrollment and the other the test segment. This allows us to define J × J matrix P, where J is number of the segments in the training set and each element of the matrix corresponds to one trial where row index defines test segment and column index defines the enrollment segment. t ,0.5) Let the elements of the matrix P corresponding to trial t be 1−P (target|λ if the trial is a target |T | t ,0.5) trial and − P (target|λ if the trial is a non-target trial. Combining (8.7), (8.8), making use of matrix |N | P and taking the gradient of the objective function w.r.t. V, we obtain

∇Cllr (V) =

1 ∗ ˜ αΣ−1 FPY 2 log 2

(8.9)

˜ are vectors of first order sufficient statistics, F, ¯ extracted form all segments where columns of matrix F in training set (representing test segments) and where columns of matrix Y are vectors of speaker factors extracted form all segments in training set (representing enrollment segments). Now, the gradient could be used to optimize the objective function using standard gradient descent optimization. However, the widely adopted technique for MMI training GMMs of HMMs is Extended Baum-Welch [Schluter et al., 2001] re-estimation, which has been shown to provide much faster convergence compared to gradient descent. In our case, the Extended Baum-Welch can not be adopted in straightforward way thanks to our more complicated model and simplified linear scoring. In [Schluter et al., 2001] section 2.2.2., relation between Extended Baum-Welch and gradient descent update of parameters was pointed out showing that Extended Baum-Welch update of GMM mean vectors can be seen as gradient descent update with a specific learning rate used to update each parameter. Inspired by this relation, we propose to use similar learning rate specific to each row of matrix V. Specifically, we multiply the gradient ∇Cllr (V) by diagonal matrix ˜ L = η diag(NP1)Σ,

(8.10)

˜ are vectors of zero order sufficient statistics, N, extracted form all segments where columns of matrix N in training set, η is parameter independent learning rate, 1 is column vector of ones and diag is an operator that converts vector into diagonal matrix. Finally, the matrix V is iteratively updated using the following formula: Vnew = Vold + L∇Cllr (V)

(8.11)

Experiment with no explicit channel compensation First system, which we used for experimenting with discriminative training of eigenvoices V was pure eigenvoice system (i.e. no U and no D matrices were considered in this case). A system based on relatively small UBM with only 512 components and 39 dimensional features was selected for this experiment. System was first trained using maximum likelihood training, then eigenvoice matrix V of size 19968 × 300 was retrained discriminatively using the procedure described in the previous section. We use ML trained V as the staring point for the discriminative training. The results are presented in 74

EER[%] Generative V Generative V and U Discriminative V Discriminative V with channel compensated y

No norm 15.44 6.99 7.19 6.80

ZT-norm 11.42 4.07 5.06 4.81

Table 8.1: Results of the 1st large scale experiment, on SRE 2006 all trials (det1). EER[%] Generative V and U Discriminative V, Generative U

No norm 6.99 6.00

ZT-norm 4.07 3.87

Table 8.2: Results of the 2nd large scale experiment, on SRE 2006 all trials (det1). Table 8.1. Comparing the first and the third line in the table, we can see that discriminative training provides substantial improvement in the performance. The improvement hold also when applying zt-normalization, though the normalization was not considered during the discriminative training. As mentioned in the previous section, speaker factors y are always computed using the original ML trained model. In this case, it is the pure eigenvoice system with speaker factors y estimated without considering channel variability. The last line in the table shows results obtained with the same discriminatively trained pure eigenvoice system used for testing, where, however, factors y were obtained from ML trained system modeling the channel variability. The improved result suggest that good estimation of y not affected by channel variability may be important. Possibly, in the future, this could be also achieved by means of discriminative training, without explicitly modeling the channel variability. The second line in the table shows performance of ML trained system making use of eigenchannels for both estimating speaker factors y and testing. We can see that discriminative training provides comparable improvements to the intersession variability modeling. However, improvement over the ML trained system is observed only in the first column of the third row, which corresponds to result without zt-normalization. When zt-norm is used, performance of the generative system is superior to the discriminatively trained one. Note, that zt-norm was not considered during discriminative training. Not having zt-norm incorporated in the discriminative training may force the training to concentrate on problem that can be easily solved by the normalization, which can lead to suboptimal result.

8.5.3

Experiment with ML trained eigenchannels

In the following experiment, we have used system, were the intersession variability is modeled using eigenchannel matrix U (200 eigenchannels). Otherwise, the system is the same as in the previous set of experiments. Again, the ML trained parameters are used as the starting point for the discriminative training of matrix V. Although, we make use of U in this experiment, this matrix is not retrained discriminatively. As can be seen in Table 8.2, retraining the matrix V improves the results. Much higher improvement is, however, obtained without zt-norm. The probable reason is the same as explained in the previous paragraph.

8.5.4

Conclusion

Discriminative training for speaker identification a large and difficult problem, but it has the potential of worthwhile gains with the possibility of more accurate, but faster and smaller systems. We have managed to show some proof of concept, but so far without significantly improving on the state-of75

the-art. Remaining problems are practical and theoretical, including complexity of optimization and principled methods for combating over-training. Many possible extensions of our large scale experiments are possible. Beside training eigenvoices V, hyperparameters U and D can be also trained discriminatively. In all of our current experiments, we worked with sufficient statistics collected by UBM model. This means that assignment of frames to Gaussians is fixed given by UBM, which was, however, trained using maximum likelihood criterion. It is quite possible that such allocation of Gaussians is suboptimal for the task of discriminating between speaker. It would be worthwhile to experiment with discriminative training that has the freedom to change such frame assignment. We have also pointed out the problem with zt-norm not being incorporated in the discriminative training. This could be achieved by making λt in (8.3) to be the zt-normalized score. However, this make integration of our objective function much more complicated.

76

Chapter 9

Summary and conclusions In this workshop, several approaches to robust speaker recognition, sharing the same theoretical background — Joint Factor Analysis (JFA) — were investigated. In diarization (Chapter 3), we have examined application of JFA and Bayesian methods to diarization. Our approach produced 3-4on challenging interview speech In Factor Analysis Conditioning (Chapter 4), we have explored ways to use JFA to account for non-session variability (phone) and showed robustness using within-session, stacking and hierarchical modeling. We have also advanced SVM-JFA approaches by developing techniques to use JFA elements in SVM classifiers (Chapters 5 and 6). The results are comparable to full JFA system but with fast scoring (Chapter 7) and no score normalization. We have concluded that SVM approaches provide better performance using all JFA factors. Finally, discriminative system optimization was investigated (Chapter 8). It focused on means to discriminatively optimize the whole speaker recognition system and successfully demonstrated proofof-concept experiments. To conclude, we have found JHU 2008 an extremely productive and enjoyable workshop, and our aim is to have collaboration in problem areas going forward. Cross-site, joint efforts will certainly provide big gains in future speaker recognition evaluations and experiments.

77

Bibliography [Auckenthaler et al., 2000] Auckenthaler, R., Carey, M., and Lloyd-Thomas, H. (2000). Score normalization for text-independent speaker verification systems. Digital Signal Processing, 10(1/2/3):42– 54. [Bishop, 2007] Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Springer. [Brookes, 2006] Brookes, M. (2006). The http://www.ee.ic.ac.uk/hp/staff/www/matrix/intro.html.

matrix

reference

manual.

[Br¨ ummer, 2008] Br¨ ummer, N. (2008). SUN SDV system description for the NIST SRE 2008 evaluation. ˇ [Br¨ ummer et al., 2007] Br¨ ummer, N., Burget, L., Cernock´ y, J., Glembek, O., Gr´ezl, F., Karafi´at, M., van, D. L., Matˇejka, P., Schwarz, P., and Strasheim, A. (2007). Fusion of heterogeneous speaker recognition systems in the stbu submission for the NIST speaker recognition evaluation 2006. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2072–2084. [Br¨ ummer and du Preez, 2006] Br¨ ummer, N. and du Preez, J. (2006). Application-independent evaluation of speaker detection. Computer Speech & Language, 20(2-3):230–275. [Burget et al., 2007] Burget, L., Matejka, P., Glembek, O., Schwarz, P., and Cernocky, J. (2007). Analysis of feature extraction and channel compensation in GMM speaker recognition system. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):1979–1986. [Campbell et al., 2006a] Campbell, W., Sturim, D., Reynolds, D., and Solomonoff, A. (2006a). Svm Based Speaker Verification using a GMM SuperVector Kernel and NAP Variability Compensation. In IEEE-ICASSP, Toulouse. [Campbell et al., 2006b] Campbell, W. M., Sturim, D. E., and Reynolds, D. (2006b). Support Vector Machines Using GMM Supervectors for Speaker Verification. IEEE Signal Processing Letters, 13(5):308–311. [Castaldo et al., 2007] Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., and Vair, C. (2007). Compensation of nuisance factors for speaker and language recognition. IEEE Transactions on Audio, Speech and Language Processing, 15(7):1969–1978. [Castaldo et al., 2008] Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., and Vair, C. (2008). Stream-based speaker segmentation using speaker factors and eigenvoices. In Proc. ICASSP, Las Vegas, Nevada. [Chaudhari et al., 2000] Chaudhari, U., Navratil, J., and Maes, S. (2000). Transformation enhanced multi-grained modeling for text independent speaker recognition. ICSLP, 2:298–301. 78

[Dehak et al., 2009] Dehak, N., Kenny, P., Dehak, R., Glembek, O., Dumouchel, P., Burget, L., Hubeika, V., and Castaldo, F. (2009). Support vector machines and joint factor analysis for speaker verification. In Proc. ICASSP, Taipei, Taiwan. [Dehak et al., 2007] Dehak, N., Kenny, P., and Dumouchel, P. (2007). Modeling prosodic features with joint factor analysis for speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 15:2095–2103. [Dehak et al., 2008] Dehak, R., Dehak, N., Kenny, P., and Dumouchel, P. (2008). Kernel Combination for SVM Speaker Verification. In Odyssey Speaker and Language Recognition Workshop 2008, Stellenbosch, South Africa. [Douglas A. Reynolds, 2000] Douglas A. Reynolds, Thomas F. Quatieri, R. B. D. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, pages 19–41. [Ferrer et al., 2007] Ferrer, L., Sonmez, K., and Shriberg, E. (2007). A smoothing kernel for spatially related features and its application to speaker verification. In Proceedings of Interspeech. [Glembek et al., 2009] Glembek, O., Burget, L., Dehak, N., Br ummer, N., and Kenny, P. (2009). Comparison of scoring methods used in speaker recognition with joint factor analysis. In Proc. ICASSP, Taipei. [Hatch et al., 2006] Hatch, A. O., Kajarekar, S., and Stolcke, A. (2006). Within-class covariance normalization for svm-based speaker recognition. In Proceedings of Interspeech. [J. Pelecanos, 2006] J. Pelecanos, S. S. (2006). Feature warping for robust speaker verification. In Proceedings of Odyssey 2006: The Speaker and Language Recognition Workshop, pages 213–218. [Kajarekar, 2008] Kajarekar, S. (2008). Phone-based cepstral polynomial SVM system for speaker recognition. Proceedings of Interspeech 2008. [Kenny, 2005] Kenny, P. (2005). Joint factor analysis of speaker and session variability : Theory and algorithms - technical report CRIM-06/08-13. Montreal, CRIM, 2005. [Kenny, 2006] Kenny, P. (2006). Joint factor analysis of speaker and session variability: Theory and algorithms (draft version). IEEE Speech, Acoustics and Language Processing. [Kenny, 2008] Kenny, P. (2008). Bayesian analysis of speaker diarization with eigenvoice priors. [Kenny et al., 2005a] Kenny, P., Boulianne, G., and Dumouchel, P. (2005a). Eigenvoice modeling with sparse training data. Speech and Audio Processing, IEEE Transactions on, 13(3):345–354. [Kenny et al., 2005b] Kenny, P., Boulianne, G., Ouellet, P., and Dumouchel, P. (2005b). Factor analysis simplified. In Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 637– 640, Toulouse, France. [Kenny et al., 2007a] Kenny, P., Boulianne, G., Ouellet, P., and Dumouchel, P. (2007a). Speaker and Session Variability in GMM-Based Speaker Verification. IEEE Trans. Audio Speech and Language Processing. [Kenny et al., 2007b] Kenny, P., Boulianne, G., Oullet, P., and Dumouchel, P. (2007b). Joint factor analysis versus eigenchannes in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2072–2084. [Kenny et al., 2008a] Kenny, P., Dehak, N., Dehak, R., Gupta, V., and Dumouchel, P. (2008a). The role of speaker factors in the nist extended data task. In Odyssey: The Speaker and Language Recognition Workshop. 79

[Kenny et al., 2008b] Kenny, P., Dehak, N., Ouellet, P., Gupta, V., and Dumouchel, P. (2008b). Development of the primary CRIM system for the nist 2008 speaker recognition evaluation. In Proc. Iinterspeech, Brisbane. [Kenny and Dumouchel, 2004] Kenny, P. and Dumouchel, P. (2004). Experiments in speaker verification using factor analysis likelihood ratios. In Odyssey: The Speaker and Language Recognition Workshop, pages 219–226. [Kenny et al., 2008c] Kenny, P., Ouellet, P., Dehak, N., Gupta, V., and Dumouchel, P. (2008c). A Study of Inter-Speaker Variability in Speaker Verification. IEEE Trans. Audio, Speech and Language Processing, 16(5):980–988. [Kenny et al., 2008d] Kenny, P., Ouellet, P., Dehak, N., Gupta, V., and Dumouchel, P. (2008d). A Study of Inter-Speaker Variability in Speaker Verification. IEEE Transactions on Audio, Speech and Language Processing. [Lin et al., 2008] Lin, C.-J., Weng, R. C., and Keerthi, S. S. (2008). Trust region newton method for logistic regression. J. Mach. Learn. Res., 9:627–650. [MacKay, 2003] MacKay, D. (2003). Information theory, inference and learning algorithms. Cambridge University Press, New York, NY. [Matejka et al., 2008] Matejka, P., Burget, L., Glembek, O., Schwarz, P., Hubeika, V., Fapso, M., Mikolov, T., Plchot, O., and Cernocky, J. (2008). BUT language recognition system for nist 2007 evaluations. In Proc. Interspeech. [Matejka et al., 2006] Matejka, P., Burget, L., Schwarz, P., and Cernocky, J. (2006). Brno University of Technology System for NIST 2005 Language Recognition Evaluation. Speaker and Language Recognition Workshop, 2006. IEEE Odyssey 2006, pages 1–7. [Minka, 1998] Minka, T. (1998). Expectation-maximization as lower bound maximization. Technical report, Microsoft. [National Institute of Standards and Technology, 2008] National Institute of Standards and Technology (2008). NIST speech group website. http://www.nist.gov/speech. [Nocedal and Wright, 2006] Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer. [Pelecanos and Sridharan, 2001] Pelecanos, J. and Sridharan, S. (2001). Feature Warping for Robust Speaker Verification. In Speaker Odyssey, pages 213–218, Crete, Greece. [Prince and Elder, 2006] Prince, S. and Elder, J. (2006). Tied factor analysis for face recognition across large pose changes. Proceedings of the British Machine Vision Conference, 3:889–898. [Reynolds et al., 2000] Reynolds, D., Quatieri, T., and Dunn, R. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1/2/3):19–41. [Schluter et al., 2001] Schluter, R., Macherey, W., Muller, B., and Ney, H. (2001). Comparison of discriminative training criteria and optimization methods for speech recognition. Speech Communication, 34:287–310. ˇ [Schwarz et al., 2004] Schwarz, P., Matˇejka, P., and Cernock´ y, J. (2004). Towards lower error rates in phoneme recognition. In International Conference on Text, Speech and Dialogue, pages 465–472. ˇ [Schwarz et al., 2006] Schwarz, P., Matˇejka, P., and Cernock´ y, J. (2006). Hierarchical structures of neural networks for phoneme recognition. In Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 325–328, Toulouse, France. 80

[Sollich, 1999] Sollich, P. (1999). Probabilistic interpretation and bayesian methods for support vector machines. In Proceedings of ICANN. [Solomonoff et al., 2005] Solomonoff, A., Campbell, W. M., and Boardman, I. (2005). Advances in channel compensation for SVM speaker recognition. In Proceedings of ICASSP. [Strasheim and Br¨ ummer, 2008] Strasheim, A. and Br¨ ummer, N. (2008). SUNSDV system description: NIST SRE 2008. In NIST Speaker Recognition Evaluation Workshop Booklet. [Tranter and Reynolds, 2006] Tranter, S. and Reynolds, D. (2006). An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1557– 1565. [Vair et al., 2006] Vair, C., Colibro, D., Castaldo, F., Dalmasso, E., and Laface, P. (2006). Channel factors compensation in model and feature domain for speaker recognition. IEEE Odyssey 2006 Speaker and Language Recognition Workshop. [Vair et al., 2007] Vair, C., Colibro, D., Castaldo, F., Dalmasso, E., and Laface, P. (2007). Loquendo - Politecnico di Torino’s 2006 NIST speaker recognition evaluation system. In Proceedings of Interspeech 2007, pages 1238–1241. [Valente, 2005] Valente, F. (2005). Variational Bayesian methods for audio indexing. PhD thesis, Eurecom. [Vogt et al., 2005] Vogt, R., Baker, B., and Sridharan, S. (2005). Modelling session variability in text-independent speaker verification. Interspeech, pages 3117–3120. [Vogt et al., 2008a] Vogt, R., Baker, B., and Sridharan, S. (2008a). Factor analysis subspace estimation for speaker verification with short utterances. In Interspeech, pages 853–856. [Vogt et al., 2008b] Vogt, R., Lustri, C., and Sridharan, S. (2008b). Factor analysis modelling for speaker verification with short utterances. In Odyssey: The Speaker and Language Recognition Workshop.

81