Audio-Visual Biometrics - CiteSeerX

INVITED PAPER

Audio-Visual Biometrics Low-cost technology that combines recognition of individuals’ faces with voice recognition is now available for low-security applications, but overcoming false rejections remains an unsolved problem. By Petar S. Aleksic and Aggelos K. Katsaggelos

ABSTRACT | Biometric characteristics can be utilized in order to enable reliable and robust-to-impostor-attacks person recognition. Speaker recognition technology is commonly utilized in various systems enabling natural human computer interaction. The majority of the speaker recognition systems rely only on acoustic information, ignoring the visual modality. However, visual information conveys correlated and complimentary information to the audio information and its integration into a recognition system can potentially increase the system’s performance, especially in the presence of adverse acoustic conditions. Acoustic and visual biometric signals, such as the person’s voice and face, can be obtained using unobtrusive and user-friendly procedures and low-cost sensors. Developing unobtrusive biometric systems makes biometric technology more socially acceptable and accelerates its integration into every day life. In this paper, we describe the main components of audio-visual biometric systems, review existing systems and their performance, and discuss future research and development directions in this area. KEYWORDS | Audio-visual biometrics; audio-visual databases; audio-visual fusion; audio-visual person recognition; face tracking; hidden Markov models; multimodal recognition; visual feature extraction

I. INTRODUCTION Biometrics, or biometric recognition, refers to the utilization of physiological and behavioral characteristics for automatic person recognition [1], [2]. Person recognition can be classified into two problems: person identification and person verification (authentication). Person identification is the problem of determining the identity of a person from a closed set of candidates, while person

Manuscript received September 2, 2005; revised August 16, 2006. The authors are with the Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60208 USA (e-mail: [email protected]; http://ivpl.ece.northwestern.edu/Staff/Petar.html; [email protected]). Digital Object Identifier: 10.1109/JPROC.2006.886017

0018-9219/$20.00 Ó 2006 IEEE

verification refers to the problem of determining whether a person is who s/he claims to be. In general, a person verification system should be capable of rejecting claims from impostors, i.e., persons not registered with the system or registered but attempting access under someone else’s identity, and accepting claims from clients, i.e., persons registered with the system and claiming their own identity. Applications that can employ person recognition systems include automatic banking, computer network security, information retrieval, secure building access, ecommerce, teleworking, etc. Personal devices, such as cell phones, PDAs, laptops, cars, etc., could also have built-in person recognition systems which would help prevent impostors from using them. Traditional person recognition methods, including knowledge-based (e.g., passwords, PINs) and token-based (e.g., ATM or credit cards, and keys) methods, are vulnerable to impostor attacks. Passwords can be compromised, while keys and cards can be stolen or duplicated. Identity theft is one of the fastest growing crimes worldwide. For example, in the U.S. over ten million people were victims of identity theft within a 12-month period in 2004 and 2005 (approximately 4.6% of the adult population) [3]. Unlike knowledge- and token-based information, biometric characteristics cannot be forgotten or easily stolen and the systems that utilize them exhibit improved robustness to impostor attacks [1], [2], [4]. There are many different biometric characteristics that can be used in person recognition systems, including fingerprints, palm prints, hand and finger geometry, hand veins, iris and retinal scans, infrared thermograms, DNA, ears, faces, gait, voice, signature, etc. [1] (see Fig. 1). The choice of biometric characteristics depends on many factors, including the best achievable performance, robustness to noise, cost and size of biometric sensors, invariance of characteristics with time, robustness to attacks, uniqueness, population coverage, scalability, template (representation) size, etc. [1], [2]. Each biometrics modality has its own advantages and disadvantages with respect to the above-mentioned factors. All of them are usually considered when choosing the most

Vol. 94, No. 11, November 2006 | Proceedings of the IEEE

2025

Aleksic and Katsaggelos: Audio-Visual Biometrics

Fig. 1. Biometric characteristics: (a) fingerprints, (b) palm print, (c) hand and finger geometry, (d) hand veins, (e) retinal scan, (f) iris, (g) infrared thermogram, (h) DNA, (i) ear, (j) face, (k) gait, (l) speech, and (m) signature.

appropriate biometric characteristics for a certain application. A comparison of the most commonly used biometric characteristics with respect to maturity, accuracy, scalability, cost, obtrusiveness, sensor size, and template size is shown in Table 1 [2]. The best recognition performance is achieved when iris, fingerprints, hand, or signature are used as biometric features. However, systems utilizing these biometric features require user cooperation, are considered intrusive, and/or utilize high cost sensors. Although reliable, they are usually not widely acceptable, except in high-security applications. In addition, there are many biometric applications, such as sport venue (gym) entrance check, access to desktop, building access, etc., which do not require high security, and in which it is very important to use unobtrusive and low-cost methods for extracting biometric features, thus enabling natural person recognition and reducing inconvenience. In this review paper, we address audio-visual (AV) biometrics, where speech is utilized together with static

video frames of the face or certain parts of the face (face recognition) [5]–[9] and/or video sequences of the face or mouth area (visual speech) [10]–[16] in order to improve person recognition performance (see Fig. 2). With respect to the type of acoustic and visual information they use, person recognition systems can be classified into audioonly, visual-only-static (only visual features obtained from a single face image are used), visual-only-dynamic (visual features containing temporal information obtained from video sequences are used), audio-visual-static, and audiovisual-dynamic. The systems that utilize acoustic information only are extensively covered in the audio-only speaker recognition literature [17] and will not be discussed here. In addition, visual-only-static (face recognition) systems are also covered extensively in the literature [18]–[22] and will not be discussed separately, but only in the context of AV biometrics. Speaker recognition systems that rely only on audio data are sensitive to microphone types (headset, desktop, telephone, etc.), acoustic environment (car, plane, factory, babble, etc.), channel noise (telephone lines, VoIP, etc.), or complexity of the scenario (speech under stress, Lombard speech, whispered speech). On the other hand, systems that rely only on visual data are sensitive to visual noise, such as extreme lighting changes, shadowing, changing background, speaker occlusion and nonfrontality, segmentation errors, low spatial and temporal resolution video, compressed video, and appearance changes (hair style, make-up, clothing). Audio-only speaker recognizers can perform poorly even at typical acoustic background signal-to-noise (SNR) levels (10 to 15 dB), and the incorporation of additional biometric modalities can alleviate problems characteristic of a single modality and improve system performance. This has been well established in the literature [5]–[16], [23]–[51]. The use of visual information, in addition to audio, improves speaker recognition performance even in noise-free environments [14], [28], [34]. The potential for such improvements is greater in acoustically noisy environments, since visual speech information is typically much less affected by acoustic noise than the acoustic speech information. It is true, however, that there is an equivalent to the acoustic Lombard effect in the visual

Table 1 Comparison of Biometric Characteristics (Adapted From [2])

2026

Proceedings of the IEEE | Vol. 94, No. 11, November 2006


Fig. 2. AV biometric system.

domain; although, it has been shown that it does not affect the visual speech recognition as much as the acoustic Lombard effect affects acoustic speech recognition [52]. Audio-only and static-image-based (face recognition) biometric systems are susceptible to impostor attacks (spoofing) if the impostor possesses a photograph and/or speech recordings of the client. It is considerably more difficult for an impostor to impersonate both acoustic and dynamical visual information simultaneously. Overall, AV person recognition is one of the most promising userfriendly low-cost person recognition technologies that is rather resilient to spoofing. It holds promise for wider adoption due to the low cost of audio and video biometric sensors and the ease of acquiring audio and video signals (even without assistance from the client) [53]. The remainder of the paper is organized as follows. We first describe in Section II the justification for combining audio and video modalities for person recognition. We then describe the structure of an AV biometric system in Section III, and in Section IV we review the various approaches for extracting and representing important visual features. Subsequently, in Section V we describe speech recognition process and in Section VI review the main approaches for integrating the audio and visual information. In Section VII, we provide a description of some of the AV biometric systems that appeared in the literature. Finally, in Section VIII, we provide an assessment of the topic, describe some of the open problems, and conclude the paper.

II . IMPORTANCE OF AV BIOMETRI CS Although great progress has been achieved over the past decades in computer processing of speech, it still lags significantly compared to human performance levels [54], especially in noisy environments. On the other hand, humans easily accomplish complex communication tasks by utilizing additional sources of information whenever required, especially visual information. Face visibility benefits speech perception due to the fact that the visual

signal is both correlated to the produced audio signal [55]–[60] and also contains complementary information to it [55], [61]–[66]. There has been significant work on investigating the relationship between articulatory movements and vocal tract shape and speech acoustics [67]–[70]. It has also been shown that there exists a strong correlation among face motion, vocal tract shape, and speech acoustics [55], [61]–[66]. For example, Yehia et al. [55] investigated the degrees of this correlation. They measured the motion of markers, which were placed on the face and in the vocal tract. Their results show that 91% of the total variance observed in the facial motion could be determined from the vocal tract motion, using simple linear estimators. In addition, looking at the reverse problem, they determined that 80% of the total variance observed in the vocal tract can be estimated from face motion. Regarding speech acoustics, linear estimators were sufficient to determine between 72% and 85% (depending on subject and utterance) of the variance observed in the root mean squared amplitude and linespectrum pair parametric representation of the spectral envelope from face motion. They also showed that even the tongue motion can be reasonably well recovered from the face motion, since the tongue frequently displays similar motion as the jaw during speech articulation. Jiang et al. [56] investigated the correlation among external face movements, tongue movements, and speech acoustics for consonant–vowel (CV) syllables and sentences. They showed that multilinear regression could successfully be used to predict face movements from speech acoustics for short speech segments, such as CV syllables. The prediction was the best for chin movements, followed by lips and cheeks movements. They also showed, like the authors of [55], that there is high correlation between tongue and face movements. Hearing impaired individuals utilize lipreading and speechreading in order to improve their speech perception. In addition, normal hearing persons also use lipreading and speechreading to a certain extent, especially in acoustically noisy environments [61]–[66]. Lipreading represents the


2027


perception of speech based only on talker’s articulatory gestures, while speechreading represents understanding of speech by observing the talker’s articulation, facial and manual gestures, and audition [64]. Summerfield’s [65] experiments showed that sentence recognition accuracy in noisy conditions can improve by 43% by speechreading and 31% by AV speech perception compared to an audio-only scenario. It has also been shown in [66] that most hearingimpaired listeners achieve higher level of speech recognition when the acoustic information is augmented by visual information, such as mouth shapes. The bimodal integration of audio and visual information in perceiving speech has been demonstrated by the McGurk effect [71]. According to it, when, for example, the video of a person uttering the sound /ba/ is dubbed with a sound recording of a person uttering the sound /ga/, most people perceive the sound /da/ as being uttered. Speech contains not only user linguistic information, but also information about emotion, location, and identity, and plays an important part in the development of human– computer interaction (HCI) systems. Over the past 20 years, significant interest and effort has been focused on exploiting the visual modality in order to improve HCI [72]–[77] but also on the automatic recognition of visual speech, also known as automatic speechreading, and its integration with traditional audio-only systems, giving rise to AV automatic speech recognition (AV-ASR) systems [77]–[83]. The successes in these areas form another basis for exploiting the visual information in the person recognition problem, thus giving rise to AV biometrics.

A. Building Blocks of Acoustic and Visual Speech The basic unit that describes how speech conveys linguistic information is the phoneme. For American English, there exist approximately 42 phonemes [84], generated by specific positions or movements of the vocal tract articulators. Similarly, the basic visually distinguishable unit, utilized in the AV speech processing and human perception literature [62], [76], [85], is the viseme. The

number of visemes is much smaller than that of phonemes. Phonemes capture the manner of articulation, while visemes capture the place of articulation [61], [62], i.e., they describe where the constriction occurs in the mouth, and how mouth parts, such as the lips, teeth, tongue, and palate, move during speech articulation. Therefore, many consonant phonemes with identical manner of articulation, which are difficult to distinguish based on acoustic information alone, may differ in the place of articulation, and thus be visually identifiable, as is, for example, the case with the two nasals /m/ (a bilabial) and /n/ (an alveolar). In contrast, certain phonemes are easier to perceive acoustically than visually, since they have identical place of articulation, but differ in the manner of articulation, as is, for example, the case with the bilabials /m/ and /p/. Various mappings between phonemes and visemes have appeared in the literature [86]. In general, they are derived by human speechreading studies, but they can also be generated using statistical clustering techniques [87]. There is no universal agreement about the exact grouping of phonemes into visemes, although some clusters are well defined. For example, phonemes /p/, /b/, and /m/ (bilabial group) are articulated at the same place (lips), thus appearing visually the same.

I II . AV BIOMETRI CS SYSTEM DESCRIPTION The block diagram of an AV person recognition system is shown in Fig. 3. It consists of three main blocks, that of preprocessing, feature extraction, and AV fusion. Preprocessing and feature extraction are performed in parallel for the two modalities. The preprocessing of the audio signal under noisy conditions includes signal enhancement, tracking environmental and channel noise, feature estimation, and smoothing [88]. The preprocessing of the video signal typically consists of the challenging problems of detecting and tracking of the face and the important facial features.

Fig. 3. Block diagram of AV biometrics system.

2028



The preprocessing should be coupled with the choice and extraction of acoustic and visual features as depicted by the dashed lines in Fig. 3. Acoustic features are chosen based on their robustness to channel and background noise. A number of results have been reported in the literature on extracting the appropriate acoustic features for both clean and noisy speech conditions [88]–[91]. The most commonly utilized acoustic features are melfrequency cepstral coefficients (MFCCs) and linear prediction coefficients (LPCs). Features that are more robust to noise obtained with the use of spectral subband centroids [90] or the zero-crossings with peak-amplitudes model [89] as an acoustic front end, have also been proposed. Acoustic features are usually augmented by their first- and second-order derivatives (delta and delta– delta coefficients) [88]. The appropriate selection and extraction of acoustic features is not addressed in this paper. On the other hand, the establishment of visual features for speaker recognition is a relatively newer research topic. Various approaches have been implemented towards face detection and tracking and facial feature extraction [22], [92]–[96] and will be discussed in more detail in Section IV. The dynamics of the visual speech are captured, similarly to acoustic features, by augmenting the Bstatic[ (frame-based) visual feature vector by its first- and second-order time derivatives, which are computed over a short temporal window centered at the current video frame [86]. Mean normalization of the visual feature vectors can also be utilized to reduce variability due to illumination [115]. AV fusion combines audio and visual information in order to achieve higher person recognition performance than both audio-only and visual-only person recognition systems. If no fusion of the acoustic and visual information takes place, then audio-only and visual-only person recognition systems result (see Fig. 3). The main advantage of AV biometric systems lies in their robustness, since each modality can provide independent and complementary information and therefore prevent performance degradation due to the noise present in one or both of the modalities. There exist various adaptive fusion approaches, which weight the contribution of different modalities based on their discrimination ability and reliability, as discussed in Section VI. The rates of the acoustic and visual features are typically different. The rate of acoustic features is usually 100 Hz [86], while video frame rates can be up to 25 frames per second (50 fields per second) for PAL or 30 frames per second (60 fields per second) for NTSC. For fusion methods that require the same rate for both modalities, the video is typically up-sampled using an interpolation technique in order to achieve AV feature synchrony at the audio rate. Finally, adaptation of the person’s models is an important part of the AV system in Fig. 3. It is usually performed when the environment or the speaker’s voice characteristics change, or when the person’s appear-

ance changes, due for example, to pose or illumination changes, facial hair, glasses, or aging [80].

IV. ANALYSIS OF VISUAL FEATURES UT I LI ZE D F O R AV B IOM E TRI CS The choice of acoustic features for speaker recognition has been thoroughly investigated in the literature [88] (see also other papers in this issue). Therefore, in this paper, we focus on presenting various approaches for the extraction of visual features utilized for AV speaker recognition. Visual features are usually extracted from two-dimensional (2-D) or three-dimensional (3-D) images [97]–[99], in the visible or infrared part of the spectrum [100]. Facial visual features can be classified into global or local, depending on whether the face is represented by only one or multiple feature vectors. Each local feature vector represents information contained in small image patches of the face or specific regions of the face (e.g., eyes, nose, mouth, etc.). Visual features can also be either static (a single face image is used) or dynamic (a video sequence of only the mouth region, the visual-labial features, or the whole face is used). Static visual features are commonly used for face recognition [18]–[22], [92]–[96], while dynamic visual features for speaker recognition, since they contain additional important temporal information that captures the dynamics of facial feature changes, especially the changes in the mouth region (visual speech). The various sets of visual facial features proposed in the literature are generally grouped into three categories [101]: 1) appearance-based features, such as transformed vectors of the face or mouth region pixel intensities using, for example, image compression techniques [28], [77], [79]–[81], [83], [102]; 2) shape-based features, such as geometric or model-based representations of the face or lip contours [77], [79]–[83]; and 3) features that are a combination of both appearance and shape features in 1) and 2) [79]–[81], [103]. The algorithms utilized for detecting and tracking the face, mouth, or lips depend on the visual features that will be used for speaker recognition, along with the quality of the video data and the resource constraints. For example, only a rough detection of the face or mouth region is sufficient to obtain appearance-based visual features, requiring only the tracking of the face and the two mouth corners. On the other hand, a computationally more expensive lip extraction and tracking algorithm is additionally required for obtaining shape-based features, a challenging task especially in low-resolution videos.

A. Facial Feature Detection, Tracking, and Extraction Face detection constitutes, in general, a difficult problem, especially in cases where the background, head pose, and lighting are varying. It has attracted significant interest in the literature [93]–[96], [104]–[106]. Some reported face detection systems use traditional image processing techniques, such as edge detection, image


2029


thresholding, template matching, color segmentation, or motion information in image sequences [106]. They take advantage of the fact that many local facial subfeatures contain strong edges and are approximately rigid. Nevertheless, the most widely used techniques follow a statistical modeling of the face appearance to obtain a classification of image regions into face and nonface classes. Such regions are typically represented as vectors of grayscale or color image pixel intensities over normalized rectangles of a predetermined size. They are often projected onto lower dimensional spaces and are defined over a Bpyramid[ of possible locations, scales, and orientations in the image [93]. These regions are usually classified, using one or more techniques, such as neural networks, clustering algorithms along with distance metrics from the face or nonface spaces, simple linear discriminants, support vector machines (SVMs), and Gaussian mixture models (GMMs) [93], [94], [104]. An alternative popular approach uses a cascade of weak classifiers instead that are trained using the AdaBoost technique and operate on local appearance features within these regions [105]. If color information is available, image regions that do not contain sufficient number of skin-tone like pixels can be determined (for example, utilizing hue and saturation) [107], [108] and eliminated from the search. Typically, face detection goes hand-in-hand with tracking in which the temporal correlation is taken into account (tracking can be performed at the face or facial feature level). The simplest possible approach to capitalize on the temporal correlation is to assume that the face (or facial feature) will be present in the same spatial location in the next frame. After successful face detection, the face image can be processed to obtain an appearance-based representation of the face. Hierarchical techniques can also be used at this point to detect a number of interesting facial features, such as mouth corners, eyes, nostrils, and chin, by utilizing prior knowledge of their relative position on the face in order to simplify the search. Also, if color information is available, hue and saturation information can be utilized in order to directly detect and extract certain facial features (especially lips) or constrain the search area and enable more accurate feature extraction [107], [108]. These features can be used to extract and normalize the mouth region-of-interest (ROI), containing useful visual speech information. The normalization is usually performed with respect to head-pose information and lighting [Fig. 4(a) and (b)]. The appearance-based features are extracted from the ROI using image transforms (see Section IV). On the other hand, shape-based visual mouth features (divided into geometric, parametric, and statistical, as in Fig. 5) are extracted from the ROI utilizing techniques such as snakes [109], templates [110], and active shape and appearance models [111]. A snake is an elastic curve represented by a set of control points, and it is used to detect important visual features, such as lines, edges, or contours. The snake control point coordinates are iteratively up2030

dated, converging towards a minimum of the energy function, defined on the basis of curve smoothness constraints and a matching criterion to desired features of the image [109]. Templates are parametric curves that are fitted to the desired shape by minimizing an energy function, defined similarly to snakes. Examples of lip contour estimation using a gradient vector field (GVF) snake and two parabolic templates are depicted in Fig. 4(c) [82]. Examples of statistical models are active shape models (ASMs) and active appearance models (AAMs) [111]. The former are obtained by applying principal component analysis (PCA) [112] to training vectors containing the coordinates of a set of points that lie on the shapes of interest, such as the lip inner and outer contours. These vectors are projected onto a lower dimensional space defined by the eigenvectors corresponding to the largest PCA eigenvalues, representing the axes of shape variation. The latter are extensions of ASMs that, in addition, capture the appearance variation of the region around the desired shape. AAMs remove the redundancy due to shape and appearance correlation and create a single model that describes both shape and the corresponding appearance deformation.

B. Visual Features With appearance-based approaches to visual feature representation, the pixel-values of the face or mouth ROI, tracked and extracted according to the discussion of the previous section, are directly considered. The extracted ROI is typically a rectangle containing the mouth, possibly including larger parts of the lower face, such as the jaw and cheeks [80] or could even be the entire face [103] (see Figs. 4(b) and 5). It can also be extended into a three-dimensional rectangle, containing adjacent frame ROIs, thus capturing dynamic visual speech information. Alternatively, the mouth ROI can be obtained from a number of image profiles vertical to the estimated lip contour as in [79] or from a disc around the mouth center [113]. A feature vector x t (see Fig. 5) is created by ordering the grayscale pixel values inside the ROI. The

Fig. 4. Mouth appearance and shape tracking for visual feature extraction. (a) Commonly detected facial features. (b) Two corresponding mouth ROIs of different sizes. (c) Lip contour estimation using a gradient vector field snake (upper: the snake’s external force field is depicted) and two parabolas (lower) [82].



Fig. 5. Various visual speech feature representation approaches discussed in this section: appearance-based (upper) and shape-based features (lower) that may utilize lip geometry, parametric, or statistical lip models.

dimension d of this vector typically becomes prohibitively large for successful statistical modeling of the classes of interest, and thus a lower dimensional transformation of it is used instead. A D d dimensional linear transform matrix R is generally sought after, such that the transformed data vector y t ¼ R xt contains most speech reading information in its D d elements (see Fig. 5). Matrix R is often obtained based on a number of training ROI grayscale pixel value vectors utilizing techniques borrowed from the image compression and pattern classification literature. Examples of such transforms are PCA, generating Beigenlips[ (or Beigenfaces[ if applied to face images for face recognition) [113], the discrete cosine transform (DCT) [114], [115], the discrete wavelet transform (DWT) [115], linear discriminant analysis (LDA) [11], [77], [80], Fisher linear discriminant (FLD), and the maximum likelihood linear transform (MLLT) [28], [80]. PCA provides low-dimensional representation optimal in the mean-squared error sense, while LDA and FLD provide most discriminant features, that is, features that offer a clear separation between the pattern classes. Often, these transforms are applied in a cascade [11], [80] in order to cope with the Bcurse of dimensionality[ problem. In addition, fast algorithmic implementations are available for some of these transformations. It is important to point out that appearance-based features allow for dynamic visual feature extraction in real time due to the fact that a rough ROI extraction can be achieved by utilizing computationally inexpensive face detection algorithms. Clearly, the quality of the appearance-based visual features degrades under intense head-pose and lighting variations. With shape-based features it is assumed that most of the information is contained in face contours or the shape of the speaker’s lips [77], [102], [103], [116]. Therefore,

such features achieve a compact representation of facial images and visual speech using low-dimensional vectors and are invariant to head pose and lighting. However, their extraction requires robust algorithms, which is often difficult and computationally intensive in realistic scenarios. Geometric features, such as the height, width, perimeter of the mouth, etc., are meaningful to humans and can be readily extracted from the mouth images. Geometric mouth features have been used for visual speech recognition [87], [117] and speaker recognition [117]. Alternatively, model-based visual features are typically obtained in conjunction with a parametric or statistical facial feature extraction algorithm. With model-based approaches the model parameters are directly used as visual speech features [79], [82], [111]. An example of model-based visual features is represented by the facial animation parameters (FAPs) of the outer- and inner-lip contours [14], [82], [102]. FAPs describe facial movement and are used in the MPEG-4 AV object-based video representation standard to control facial animation, together with the so-called facial definition parameters (FDPs) that describe the shape of the face. The FAPs extraction system described in [82] is shown in Fig. 6. The system first employs a template matching algorithm to locate the person’s nostrils by searching the central area of the face in the first frame of each sequence. Tracking is performed by centering the search area in the next frame at the location of the nose in the previous frame. The nose location is used to determine the approximate mouth location. Subsequently, the outer lip contour is determined by using a combination of a GVF snake and a parabolic template [see also Fig. 4(c)]. Following the outer lip contour detection and tracking, ten FAPs describing the outer-lip shape (Bgroup 8[ FAPs [82])


2031


Fig. 6. Shape-based visual feature extraction system of [82], depicted schematically in parallel with audio front end, as used for AV speaker recognition experiments [14].

are extracted from the resulting lip contour (see also Figs. 4 and 5). These are placed into a feature vector, which is subsequently projected by means of PCA onto a three-dimensional space [82]. The resulting visual features (eigenFAPs) are augmented by their first- and second-order derivatives providing a nine-dimensional dynamic visual speech vector. Since appearance- and shape-based visual features contain respectively low- and high-level information about the person’s face and lip movements, their combination has been utilized in the expectation of improving the performance of the recognition system. Features of each type are usually just concatenated [79], or a single model of face shape and appearance is created [103], [111]. For example, PCA appearance features are combined with snake-based features or ASMs or a single model of face shape and appearance is created using AAMs [111]. PCA can be applied further to this single model vector [103]. In summary, a number of approaches can be used for extracting and representing visual information utilized for speaker recognition. Unfortunately, limited work exists in the literature in comparing the relative performance of visual speech features. The advantage of the appearancebased features is that they, unlike shape-based features, do not require sophisticated extraction methods. In addition, appearance-based visual features contain information that cannot be captured by shape-based visual features. Their disadvantage is that they are generally sensitive to lighting and rotation changes. The dimensionality of the appearance-based visual features is also usually much higher than that of the shape-based visual features, which affects reliable training of the AV person recognition systems. Most comparisons of visual features are made for features within the same category (appearance- or shape-based) in the context of AV or V-only person recognition, or AV-ASR [103], [114]–[116]. Occasionally, features across categories are compared, but in most cases with inconclusive results [102], [103], [114]. Thus, the question of what are the most appropriate and robust visual speech features remains to a 2032

large extent unresolved. Clearly, the characteristics of the particular application and factors such as computational requirements, video quality, and the visual environment, have to be considered in addressing this question.

V. SPEAKE R RECOGNITION PROCESS Although single-modality biometric systems can achieve high performance in some cases, they are usually not robust under nonideal conditions and do not meet the needs of many potential person recognition applications. In order to improve the robustness of biometric systems, multisamples (multiple samples of the same biometric characteristic), multialgorithms (multiple algorithms with the same biometric sample), and multimodal (different biometric characteristics) biometric systems have been developed [5]–[16], [23]–[51]. Different modalities can provide independent and complementary information, thus alleviating problems characteristic of single modalities. The fusion of multiple modalities is a critical issue in the design of any recognition system, as is the case with the fusion of the audio and visual modalities in the design of AV person recognition systems. In order to justify the complexity and cost of incorporating the visual modality into a person recognition system, fusion strategies should ensure that the performance of the resulting AV system exceeds that of its single-modality counterpart, hopefully by a significant amount, especially in nonideal conditions (e.g., acoustically noisy environments). The choice of classifiers and algorithms for feature and classifier fusion are clearly central to the design of AV person recognition systems. In this section, we describe the speaker recognition process, and in the next section we review the main concepts and techniques of combining the acoustic and visual feature streams. Audio-only speaker recognition has been extensively discussed in the literature [17] and the objective of this paper is to concentrate only on AV speaker recognition systems. The process of speaker recognition consists of two



phases, those of training and testing. In the training phase, the speaker is asked to utter certain phrases in order to acquire the data to be used for training. In general, the larger the amount of training data, the better the performance of a speaker recognition system. In the testing phase, the speaker utters a certain phrase, and the system accepts or rejects his/her claim (speaker authentication) or makes a decision on the speaker’s identity (speaker identification). The testing phase can be followed by the adaptation phase during which the recognized speaker’s data is used to update the models corresponding to him/her. This phase is used for improving the robustness of the system to the speaker’s acoustic and visual speech characteristic changes over time, due to channel and environmental noise, visual appearance changes, etc. Maximum a posteriori (MAP) [112] adaptation is commonly used in this phase. Speaker recognition systems can be classified into textdependent and text-independent, based on the text used in the testing phase. Text-dependent systems can be further divided into fixed-phrase or prompted-phrase systems. Fixedphrase systems are trained on the phrase that is also used for testing. However, these systems are very vulnerable to attacks (an impostor, for example, could play a recorded phrase uttered by the user). Prompted-phrase systems ask the claimant to utter a word sequence (phoneme sequence) not used in the training phase or in previous tests. These systems are less vulnerable to attacks (an impostor would need to generate in real time an AV representation of the speaker) but require an interface for prompting phrases. In the text-independent systems, the speech used for testing is unconstrained. These systems are convenient for applications in which we cannot control the claimant’s input. However, they would be classified as more vulnerable to attacks than the prompted-phase systems.

A. Classifiers in AV Speaker Recognition The speaker recognition problem is a classification problem. A set of classes needs to be defined first, and then based on the observations one of these classes is chosen. Let C denote the set of all classes. In speaker identification systems, C typically consists of the enrolled subject population possibly augmented by a class denoting the unknown subject. On the other hand, in speaker authentication systems, C reduces to a two-member set, consisting of the class corresponding to the user and the general population (impostor class). The number of classes can be larger than mentioned above, as, for example, in text-dependent speaker recognition (and ASR). In this case, subphonetic units are considered, utilizing tree-based clustering of possible phonetic contexts (bi-phones, or tri-phones, for example), to allow coarticulation modeling [86]. The set of classes C then becomes the product space between the set of speakers and the set of phonetic based units. Similarly, one could consider visemic subphonetic units, obtained for example by

decision tree clustering based on visemic context (visemic units have been used in ASR [118]). The same set of classes is typically used if both speech modalities are considered, since the use of different classes complicates AV integration, especially in text-dependent speaker recognition. The input observations to the recognizer are represented by the extracted feature vectors at time t, os;t , and their sequence over a time interval T, Os ¼ fos;t ; t 2 Tg, where s 2 S denotes the available modalities. For instance, for the AV speaker recognition system, S ¼ fa; v; f g, where a stands for audio, v for visual-dynamic (visual-labial), and f for face appearance input. A number of approaches can be used to model our knowledge of how the observations are generated by each class. Such approaches are usually statistical in nature and express our prior knowledge in terms of the conditional probability Pðos;t jcÞ, where c 2 C, by utilizing artificial neural networks (ANNs), support vector machines (SVMs), Gaussian mixture models (GMMs), hidden Markov models (HMMs), etc. The parameters of the prior model are estimated during training. During testing based on the trained model, which describes the way the observations are generated by each class, the posterior probability is maximized. In single modality speaker identification systems, the objective is to determine the class ^c, corresponding to one of the enrolled persons or the unknown person that best matches the person’s biometric data, that is ^c ¼ argmax Pðcjos;t Þ;

s 2 fa; v; f g

(1)

c2C

where Pðcjos;t Þ represents the posterior conditional probability. In closed-set identification systems the unknown person is not modeled and the classification is forced into one of the enrolled persons’ classes. In single modality speaker verification systems there are only two classes, the class corresponding to the general population (impostor class), that is w, and c, the class corresponding to the true claimant. Then, the following similarity measure D can be defined: D ¼ log Pðcjos;t Þ log Pðwjos;t Þ;

s ¼ fa; v; f g:

(2)

If D is larger than an a priori defined verification threshold the claim is accepted; otherwise, it is rejected. The world model is usually obtained using biometric data from the general speaker population. A widely used prior model is the GMM, expressed by

Pðos;t jcÞ ¼

Ks;c X

ws;c;k Nðos;t ; ms;c;k ; X s;c;k Þ

(3)

k¼1


2033


where Ks;c denotes the number of mixture weights ws;c;k , which are positive and add up to one, and Nðo; m; XÞ represents a multivariate Gaussian distribution with mean m and covariance matrix X, typically considered as a diagonal matrix. During training the parameters of each GMM in (3), namely ws;c;k , ms;c;k , and X s;c;k are estimated. In certain applications, for example, text-independent speaker recognition, a single GMM is used to model the entire observation sequence Os for each class c. Then, for a given modality s, the MAP estimation of the unknown class is obtained as

^c ¼ argmax PðcjOs Þ ¼ argmax PðcÞ c2C

Y

c2C

Pðos;t jcÞ

(4)

t2T

where PðcÞ denotes the class prior. In a number of other applications, such as textdependent speaker recognition, the observations are generated by a temporal sequence of interacting states. In this case, HMMs are widely used. They consist of a number of states. At each time instance, one of them generates (Bemits[) the observed features with class conditional probability given by (3). The HMM parameters (mixture weights, means, variances, transition probabilities) are typically estimated iteratively during training, using the expectation-maximization (EM) algorithm [84], [86], or by discriminative training methods [84]. Once the model parameters are estimated, HMMs can be used to obtain the Boptimal state sequence[ ^lc ¼ flct ; t 2 Tg per class c, given an observation sequence Os over an interval T; namely ^lc ¼ argmax Pðlc jOs Þ lc Y P lct jlct1 P os;t jlct ¼ argmax PðcÞ lc

(5)

t2T

where Pðlct jlct1 Þ denotes the transition probability from state lct1 to state lct . The Viterbi algorithm is used for solving (5), based on dynamic programming [84], [86]. Then, the MAP estimate of the unknown class is obtained as ^c ¼ argmax Pð^lc jOs Þ:

(6)

c2C

B. Performance Evaluation Measures The performance of identification systems is usually reported in terms of identification error or rank-N correct identification rate, defined as the probability that the correct match of the unknown person’s biometric data is in the top N similarity scores (this scenario corresponds to 2034

the identification system which is not fully automated, needing human intervention or additional identification systems applied in cascade). Two commonly used error measures for verification performance are the false acceptance rate (FAR)Van impostor is acceptedVand the false rejection rate (FRR)Va client is rejected. They are defined by

FAR ¼

IA 100% I

FRR ¼

CR 100% C

(7)

where IA denotes the number of accepted impostors, I the number of impostor claims, CR the number of rejected clients, and C the number of client claims. There is an inherent tradeoff between FAR and FRR, which is controlled by an a priori chosen verification threshold. The receiver operator curve (ROC) or the detection error tradeoff (DET) curve can be used to graphically represent the tradeoff between FAR and FRR [119]. DET and ROC depict FRR as a function of FAR in a log and linear scale, respectively. The detection cost function (DCF) is a measure derived from FAR and FRR according to DCF ¼ Cost(FR) P(client) FRR þ Cost(FA) P(impostor) FAR

(8)

where P(client) and P(impostor) are the prior probabilities that a client or an impostor will use the system, respectively, while Cost(FA) and Cost(FR) represent, respectively, the costs of false acceptance and false rejection. Half total error rate (HTER) [119], [120] is a special case of DCF when the prior probabilities are equal to 0.5 and the costs equal to 1, resulting in HTER ¼ ð1=2ÞðFRR þ FARÞ. Verification system performance is often reported using a single measure either by choosing the threshold for which FAR and FRR are equal, resulting in the equal error rate (EER), or by choosing the threshold that minimizes DCF (or HTER). The appropriate threshold can be found either using the test set (providing biased results) or a separate validation set [121]. Expected performance curves (EPCs) are proposed as a verification measure in [121] and [122]. They provide unbiased expected system performance analysis using a validation set to compute thresholds corresponding to various criteria related to real-life applications.

VI. AV FUSION METHODS Information fusion is used for integration of different sources of information with the ultimate goal of achieving superior classification results. Fusion approaches are usually classified into three categories: premapping fusion, midst-mapping fusion, and postmapping fusion [8]



Fig. 7. AV fusion methods (adapted from [8]).

(see Fig. 7). These are also referred to in the literature as early integration, intermediate integration, and late integration, respectively [80], and they will be used interchangeably in this paper. In the premapping fusion audio and visual information are combined before the classification process. In the midst-mapping fusion, audio and visual information are combined during the mapping from sensor data or feature space into opinion or decision space. Finally, in the postmapping fusion, information is combined after the mapping from sensor data or feature space into opinion or decision space. In the remainder of this section, following [8], we will review some of the most commonly used information fusion methods and comment on their use for the fusion of acoustic and visual information.

A. Premapping Fusion (Early Integration) Premapping fusion can be divided into sensor data level and feature level fusion. In sensor data level fusion [123], the sensor data obtained from different sensors of the same modality is combined. Weighted summation and mosaic construction are typically utilized in order to enable sensor data level fusion. In the weighted summation approach the data is first normalized, usually by mapping them to a common interval, and then combined utilizing weights. For example, weighted summation can be utilized to combine multiple visible and infrared images or to combine acoustic data obtained from several microphones of different types and quality. Mosaic construction can be utilized to create one image from images of parts of the face obtained by several different cameras. Feature level fusion represents the combination of the features obtained from different sensors. Joint feature vectors are obtained either by weighted summation (after normalization) or concatenation (e.g., by appending a visual to an acoustic feature vector with normalization in order to obtain a joint feature vector). The features ob-

tained by the concatenation approach are usually of high dimensionality, which can affect reliable training of a classification system (Bcurse of dimensionality[) and ultimately recognition performance. In addition, concatenation does not allow for the modeling of the reliability of individual feature streams. For example, in the case of audio and visual streams, it cannot take advantage of the information that might be available, about the acoustic or the visual noise in the environment. Furthermore, audio and visual feature streams should be synchronized before the concatenation is performed.

B. Midst-Mapping Fusion (Intermediate Integration) With midst-mapping fusion, information streams are processed during the procedure of mapping the feature space into the opinion or decision space. It exploits the temporal dynamics contained in different streams, thus avoiding problems resulting from vector concatenation, such as the Bcurse of dimensionality[ and the requirement of matching rates. Furthermore, stream weights can be utilized in midst-mapping fusion to account for the reliability of different streams of information. Multistream and extended HMMs [32], [79], [80], [82], [88] are commonly used for midst-mapping fusion in AV speaker recognition systems [13], [30], [32], [33]. For example, multi-stream HMMs can be used to combine acoustic and visual dynamic information [80], [82], [88], allowing for easy modeling of the reliability of the audio and visual streams and various levels of asynchronicity between them. In the state-synchronous case, the probability density function for each HMM state is defined as [77], [88] Y P os;t jlct ¼

s2fa;vg

"K s;c X

#s ws;lct ;k Nðos;t ; ms;c;k ; X s;c;k Þ

(9)

k¼1


2035


where s denotes the stream weight corresponding to modality s. The stream weights add up to one, that is, a þ v ¼ 1. The combination of audio and visual information can also be performed at phone, word, and utterance level [80], allowing for different levels of asynchronicity between audio and visual streams.

C. Postmapping Fusion (Late Integration) Postmapping fusion approaches are grouped into decision and opinion fusion (also referred to as score-level fusion). With decision fusion [11], [14], classifier decisions are combined in order to obtain the final decision by utilizing majority voting, or combination of ranked lists, or and and or operators. In majority voting [48], the final decision is made when the majority of the classifiers reaches the same decision. The number of classifiers should be chosen carefully in order to prevent ties (e.g., for a two class problem, such as speaker verification, the number of classifiers should be odd). In ranked list combination fusion [48], [124], ranked lists provided by each classifier are combined in order to obtain the final ranked list. There exist various approaches for combination of ranked lists [124]. In and fusion [125], the final decision is made only if all classifiers reach the same decision. This type of fusion is typically used for high-security applications where we want to achieve very low FA rates, by allowing higher FR rates. On the other hand, when the or fusion method is utilized, the final decision is made as soon as one of the classifiers reaches a decision. This type of fusion is utilized for lowsecurity applications where we want to achieve lower FR rates and prevent causing inconvenience to the registered users by allowing higher FA rates. Unlike decision fusion, in opinion (score-level) fusion methods the experts do not provide final decisions but only opinions (scores) on each possible decision. The opinions are usually first normalized by mapping them to a common interval and then combined utilizing weights (e.g., either weighted summation or weighted product fusion). The weights are determined based on the discriminating ability of the classifier and the quality of the utilized features (usually affected by the feature extraction method and/or presence of different types of noise). For example, when audio and visual information are employed, the acoustic SNR and/or the quality of the visual feature extraction algorithms are considered in determining weights. After the opinions are combined, the class that corresponds to the highest opinion is chosen. In the postclassifier opinion fusion approach [8] the likelihoods corresponding to each of the NC classes of interest, obtained utilizing each of the NL available experts, are considered as features in the NL NC dimensional space, where the classification of the resulting features is performed. This method is particularly useful in verification applications since only two classes are available. In the case of AV speaker verification, the number of experts can also be small (two or more). However, utilizing this method in identification problems 2036

(large number of classes), or when the number of experts NL is large, can result in features of high dimensionality, which could cause inadequate performance. It is important to point out that the problem of combining single-modality biometric results becomes more complicated in the case of speaker identification and requires one-to-many comparisons to find the best match. In addition, when choosing biometric modalities to be used in a multimodal biometric system, one should consider not only how much the modalities complement each other in terms of identification accuracy but also the identification speed. In some cases, the cascade of singlemodality biometric classifiers can be used in order to narrow down the number of candidates by a fast, but not necessarily of high accuracy, first biometric system, and then use the second, higher accuracy biometric system on the remaining candidates to improve the overall identification performance.

VII. AV BI OMETRIC S YSTEMS In this section, we briefly review corpora commonly used for AV person recognition research and present examples of specific systems and their performance.

A. AV Databases In contrast to the abundance of audio-only databases, there exist only a few databases suitable for AV biometric research. This is because the field is relatively young, but also due to the fact that AV corpora pose additional challenges concerning database collection, storage, distribution, and privacy. Most commonly used databases in the literature were collected by few university groups or individual researchers with limited resources, and as a result, they usually contain a small number of subjects and have relatively short duration. AV databases usually vary greatly in the number of speakers, vocabulary size, number of sessions, nonideal acoustic and visual conditions, and evaluation measures. This makes the comparison of different visual features and fusion methods, with respect to the overall performance of an AV biometric system, difficult. They lack realistic variability and are usually limited to one area or focus on one aspect of biometric person recognition. Some of the currently publicly available AV databases which have been used in the published literature for AV biometrics are the M2VTS (multimodal verification for teleservices and security applications) [126] and XM2VTS (extended M2VTS) [127], BANCA (biometric access control for networked and e-commerce applications) [128], VidTIMIT [26] (video recordings of people reciting sentences from the TIMIT corpus), DAVID [129], VALID [130], and AVICAR (AV speech corpus in a car environment) [131] databases. We provide next a short description for each of them. The M2VTS [126] database consists of audio recordings and video sequences of 37 subjects uttering digits



Table 2 Sample AV Person Recognition Systems

0 through 9 in five sessions spaced apart by at least one week. The subjects were also asked to rotate their head to the left and then to the right in each session in order to obtain a head rotation sequence that can provide 3-D face features to be used for face recognition purposes. The main drawbacks of this database are its small size and limited vocabulary. The extended M2VTS database [127] consists of audio recordings and video sequences of 295 subjects uttering three fixed phrases, two ten-digit sequences and one seven-word sentence, with two utterances of each phrase, in four sessions. The main drawback of this data-

base is its limitation to the development of text-dependent systems. Both M2VTS and XM2VTS databases have been frequently used in the literature for comparison of different AV biometric systems (see Table 2). The BANCA database consists of audio recordings and video sequences of 208 subjects (104 male, 104 female) recorded in three different scenarios, controlled, degraded and adverse, over 12 different sessions spanning three months. The subjects were asked to say a random 12-digit number, their name, their address and date of birth, during each of the recordings. The BANCA database was captured


2037


in four European languages. Both high- and low-quality microphones and cameras were used for recording. This database provides realistic and challenging conditions and allows for comparison of different systems with respect to their robustness. The VidTIMIT database consists of audio recordings and video sequences of 43 subjects (19 female and 24 male), reciting short sentences from the test section of the NTIMIT corpus [26] in three sessions with an average delay of a week between sessions, allowing for appearance and mood changes. Each person utters ten sentences. The first two sentences are the same for all subjects, while the remaining eight are generally different for each person. All sessions contain phonetically balanced sentences. In addition to the sentences, the subjects were asked to move their heads left, right, up, then down, in order to obtain head rotation sequence. The AV biometric systems that utilize the VidTIMIT corpora are described in [8]. The DAVID database consists of audio and video recordings (frontal and profile views) of more than 100 speakers including 30 subjects recorded in five sessions over a period of several months. The utterances include digit set, alphabet set, vowel–consonant–vowel syllables, and phrases. The challenging visual conditions include illumination changes and variable scene background complexity. The VALID database consists of five recordings of 106 subjects (77 male, 29 female) over a period of one month. Four of the sessions were recorded in an office environment in the presence of visual noise (illumination changes) and acoustic noise (background noise). In addition, one session was recorded in the studio environment containing a head rotation sequence, where the subjects were asked to face four targets, placed above, below, left, and right of the camera. The database consists of recordings of the same utterances as those recorded in the XM2VTS database, therefore enabling comparison of the performance of different systems and investigation of the effect of challenging visual environments on the performance of algorithms developed with the XM2VTS database. The AVICAR database [131] consists of audio recordings and video sequences of 100 speakers (50 male and 50 female) uttering isolated digits, isolated letters, phone numbers, and TIMIT sentences with various language backgrounds (60% native American English speakers) inside a car. Audio recordings are obtained using a visormounted array composed of eight microphones under five different car noise conditions (car idle, 35 and 55 mph with all windows rolled up, or just front windows rolled down). Video sequences are obtained using dashboardmounted array of four video cameras. This database provides different challenges for tracking and extraction of visual features and can be utilized for analysis of the effect of nonideal acoustic and visual conditions on AV speaker recognition performance. 2038

Additional datasets for AV biometrics research are the Clemson University AV Experiments (CUAVE) corpus containing connected digit strings [132], the AMP/CMU database of 78 isolated words [133], the Tulips1 set of four isolated digits [134], the IBM AV database [15], and the AV-TIMIT AV corpus [9]. None of the existing AV databases probably has all desirable characteristics, such as adequate number of subjects, size of vocabulary and utterances, realistic variability (representing for example speaker identification on a mobile hand-held device, or taking into account other nonideal acoustic and visual conditions), recommended experiment protocols (it is, for example, specified that certain specific subjects are to be used as clients and certain specific subjects as impostors), and ability to utilize them for text-independent as well as text-dependent verification systems. There is, therefore, a great need for new, standardized databases and evaluation measures (see Section V-B) that would enable fair comparison of different systems and represent realistic nonideal conditions. Experiment protocols should also be defined in a way that avoids biased results and allows for fair comparison of different person recognition systems.

B. Examples of AV Biometric Systems The performance of AV person recognition systems strongly depends on the choice and accurate extraction of the acoustic and visual features and the AV fusion approach utilized. Due to differences in visual features, fusion methods, AV databases, and evaluation procedures, it is usually very difficult to compare systems. As mentioned in Section I, audio-only speaker recognition and face recognition systems are extensively covered elsewhere and will not be discussed here. We present in this section various visual-only-dynamic, audio-visual-static, and audiovisual-dynamic systems and provide some comparisons. Table 2 shows an overview of the various AV person recognition systems found in the literature. Some of those systems are discussed in more detail in the remainder of this section. Luettin et al. [135] developed a visual-only speaker identification system by utilizing only the dynamic visual information present in the video recordings of the mouth area. They utilized the Tulips1 database [134], consisting of recordings of 12 speakers uttering first four English digits, extracted shape- and appearance-based visual features, and performed both text-dependent and text-independent experiments. Their person identification system, based on HMMs, achieved 72.9%, 89.6%, and 91.7% recognition rates when shape-based, appearance-based, and joint (concatenation fusion) visual features were utilized, respectively, in text-dependent experiments. In textindependent experiments, their system achieved 83.3%, 95.8%, and 97.9% recognition rates when shape-based, appearance-based, and joint (concatenation) visual features were utilized, respectively. In summary, they



achieved better results with appearance-based than with shape-based visual features, and the identification performance improved further when joint features were utilized. 1) Audio-Visual-Static Biometric Systems: Chibelushi et al. [5] developed an AV biometrics system that utilizes acoustic information and static visual information contained in face profiles. They utilized an AV database that consists of audio recordings and face images of ten speakers [5]. The images are taken at different head orientations, image scales, and subject positions. They combined acoustic and visual information utilizing weighted summation fusion. Their system achieved an EER of 3.4%, 3.0%, and 1.5% when only speech information, only visual information, or both acoustic and visual information were used, respectively. Brunelli and Falavigna [6] developed a textindependent speaker identification system that combines audio-only speaker identification and face recognition. They utilized an AV database that consists of audio recordings and face images of 89 speakers collected in three sessions [6]. The system provides five classifiers, two acoustic and three visual. The two acoustic classifiers correspond to two sets of acoustic features (static and dynamic) derived from the short time spectral analysis of the speech signal. Their audio-only speaker identification system is based on vector quantization (VQ). The three visual classifiers correspond to the visual classifying features extracted from three regions of the face, i.e., eyes, nose, and mouth. The individually obtained classification scores are combined using the weighted product approach. The identification rate of the integrated system is 98%, compared to the 88% and 91% rates obtained by the audioonly speaker recognition and face recognition systems, respectively. Ben-Yacoub et al. [7] developed both text-dependent and text-independent AV speaker verification systems by utilizing acoustic information and frontal face visual information from the XM2VTS database. They utilized elastic graph matching in order to obtain face matching scores. They investigated several binary classifiers for postclassifier opinion fusion, namely, SVM, Bayesian classifier, Fisher’s linear discriminant, decision tree, and multilayer perceptron (MLP). They obtained the best results utilizing SVM and Bayesian classifiers, which also outperformed single modalities. Sanderson and Paliwal [8] utilized speech and face information to perform text-independent identity verification. They extracted appearance-based visual features by performing PCA on the face image window containing the eyes and the nose. The acoustic features consisted of MFCCs and their corresponding deltas and maximum autocorrelation values, which capture pitch and voicing information. A voice activity detector (VAD) was used to remove the feature vectors which represent silence or background noise, while a GMM classifier was used as a

modality (speech or face) expert, to obtain opinions from the speech features. They performed elaborate analysis and evaluation of several nonadaptive and adaptive approaches for information fusion and compared them in noisy and clean audio conditions with respect to overall verification performance on the VidTIMIT database. The fusion methods they analyzed include weighted summation, Bayesian classifier, SVM, concatenation, adaptive weighted summation, and proposed piecewise linear postclassifier, and modified Bayesian postclassifier. The utilized fusion methods take into account how the distributions of opinions are likely to change due to noisy conditions, without making a direct assumption about the type of noise present in the testing features. The verification results obtained for various SNRs in the presence of operationsroom noise are shown in Fig. 8. The operations-room noise contains background speech as well as machinery sounds. The results are reported in terms of total error (TE), defined as TE ¼ FAR þ FRR. They concluded that the performance of most of the nonadaptive fusion systems was similar and that it degraded in noisy conditions. Hazen et al. [9] developed a text-dependent speaker authentication system that utilizes lower quality audio and visual signals obtained by a handheld device. They detected 14 face components and used ten of them, after normalization, as visual features for a face recognition algorithm which utilizes SVMs. They achieved 90% reduction in speaker verification EER when fusing face and speaker identification information. 2) Audio-Visual-Dynamic Biometric Systems: Jourlin et al. [10] developed a text-dependent AV speaker verification system that utilizes both acoustic and visual dynamic information and tested it on the M2VTS database. Their

Fig. 8. AV person recognition performance obtained for various SNRs in presence of operations-room noise utilizing several nonadaptive and adaptive AV fusion methods described in [8] on VidTimit database [26].


2039


39-dimensional acoustic features consist of LPC coefficients and their first- and second-order derivatives. They use 14 lip shape parameters, ten intensity parameters, and the scale as visual features, resulting in a 25-dimensional visual feature vector. They utilize HMMs to perform audio-only, visual-only, and AV experiments. The AV score is computed as a weighted sum of the audio and visual scores. Their results demonstrate a reduction of FAR from 2.3% when the audio-only system is used to 0.5% when the multimodal system is used. Wark et al. [11]–[13] employed multi-stream HMMs to develop text-independent AV speaker verification and identification systems tested on the M2VTS database. They utilized MFCCs as acoustic features and lip contour information obtained after applying PCA and LDA, as visual features. They trained the system in clean conditions and tested it in degraded acoustic conditions. At low SNRs, the AV system achieved significant performance improvement over the audio-only system and also outperformed the visual-only system, while at high SNRs the performance was similar to the performance of the audio-only system. We have developed an AV speaker recognition system with the AMP/CMU database [133] utilizing 13 MFCC coefficients and their first- and second-order derivatives as acoustic features [14]. A visual shape-based feature vector consisting of ten FAPs, which describe the movement of the outer-lip contour [82], extracted using the systems previously discussed in Section IV, was projected by means of PCA onto a three-dimensional space (see Fig. 6). The resulting visual features were augmented with first- and second-order derivatives providing nine-dimensional dynamic visual feature vectors. We used a feature fusion integration approach and single-stream HMMs to integrate dynamic acoustic and visual information. Speaker verification and identification experiments were performed using audio-only and AV information, under both clean and noisy audio conditions at SNRs ranging from 0 to 30 dB. The obtained results for both speaker identification and verification experiments, expressed in terms of the identification error and EER, are shown in Table 3. Significant improvement in performance over the audioonly (AU) speaker recognition system was achieved, especially under noisy acoustic conditions. For instance, the identification error was reduced from 53.1%, when Table 3 Speaker Recognition Performance Obtained for Various SNRs Utilizing Audio-Only and AV Systems in [14] Tested on AMP/CMU Database [133]

2040

audio-only information was utilized, to 12.82%, when AV information was employed at 0-dB SNR. Chaudhari et al. [15] developed an AV speaker identification and verification system which modeled the reliability of the audio and video information streams with time-varying and context-dependent parameters. The acoustic features consisted of 23 MFCC coefficients, while the visual features consisted of 24 DCT coefficients from the transformed ROI. They utilized GMMs to model speakers and parameters that depended on time, modality, and speaker to model stream reliability. The system was tested on the IBM database [15] achieving an EER of 1.04%, compared to 1.71%, 1.51%, and 1.22%, of the audio-only, video-only, and AV (feature fusion) systems, respectively. Dieckmann et al. [16] developed a system which used visual features obtained from all three modalities, face, voice, and lip movement. Their fusion scheme utilized majority voting and opinion fusion. Two of the three experts had to agree on the opinion, and the combined opinion had to exceed the predefined threshold. The identification error decreased to 7% when all three modalities were used, compared to 10.4%, 11%, and 18.7%, when voice, lip movements, and face visual features were used individually.

VIII. DISCUSSION/CONCL USION In this paper, we addressed how the joint processing of audio and visual signals, both generated by a talking person, can provide valuable information to benefit AV speaker recognition applications. We first concentrated on the analysis of visual signals and described various ways of representing and extracting information available in them unique to each speaker. We then discussed how the visual features can complement features extracted (by wellstudied methods) from the acoustic signal, and how the two-modality representations can be fused together to allow joint AV processing. We reviewed several speaker identification and/or authentication systems that have appeared in the literature and presented some experimental results. These results demonstrated the importance of utilizing visual information for speaker recognition, especially in the presence of acoustic noise. The field of joint AV person recognition is new and active, with many accomplishments (as described in this paper) and exciting opportunities for further research and development. Some of these open issues and opportunities are the following. There is a need for additional resources for advancing and accessing the performance of AV speaker recognition systems. Publicly available multimodal corpora that better reflect realistic conditions, such as acoustic noise and lighting changes would help in investigating robustness of AV systems. They can serve as a reference point for development, as well as evaluation and comparison of various systems. Baseline algorithms and systems could also be



agreed upon and made available in order to facilitate separate investigation of the effects that various factors, such as the choice of acoustic and visual features, the information fusion approach, or the classification algorithms, have on system performance. In comparing and evaluating speaker recognition systems, the statistical significance of the results needs to be determined [136]. It is not adequate to simply report that one system achieved a lower error rate than another one (and therefore it is better than the other one) using the same experiment setup. The mean and the variance of a particular error measure can assist in determining the relative performance of systems. In addition, standard experiment protocols and evaluation procedures should be defined in order to enable fair comparison of different systems. Experiment protocols could include a number of different configurations in which available subjects are randomly divided into enrolled subjects and impostors, therefore providing performance measures obtained for each of the configurations. These measures can be used to determine statistical significance of the results. In many cases, the performance of person identification systems is reported for a closed set system. The underlying assumption is that the same performance will carry over to an open set system. This, however, may well not hold true; therefore, a model for an unknown person should be used. The design of a truly high-performing visual feature representation system with improved robustness to the visual environment, possibly employing 2.5-D or 3-D face information [97]–[99], needs to be further investigated. Finally, the development of improved AV integration algorithms that will allow unconstrained AV asynchrony REFERENCES [1] A. K. Jain, A. Ross, and S. Prabhakar, BAn introduction to biometric recognition,[ IEEE Trans. Circuits Systems Video Technol., vol. 14, no. 1, pp. 4–20, Jan. 2004. [2] N. K. Ratha, A. W. Senior, and R. M. Bolle, BAutomated biometrics,[ in Proc. Int. Conf. Advances Pattern Recognition, Rio de Janeiro, Brazil, 2001, pp. 445–474. [3] Financial crimes report to the public. Fed. Bur. Investigation, Financial Crimes Section, Criminal Investigation Division. [Online]. Available: http://www.fbi.gov/publications/ financial/fcs_report052005/ fcs_report052005.htm. [4] A. K. Jain and U. Uludag, BHiding biometric data,[ IEEE Trans. Pattern Anal. Machine Intell., vol. 25, no. 11, pp. 1494–1498, Nov. 2003. [5] C. C. Chibelushi, F. Deravi, and J. S. Mason, BVoice and facial image integration for speaker recognition,[ in Proc. IEEE Int. Symp. Multimedia Technologies Future Appl., Southampton, U.K., 1993. [6] R. Brunelli and D. Falavigna, BPerson identification using multiple cues,[ IEEE Trans. Pattern Anal. Machine Intell., vol. 10, pp. 955–965, Oct. 1995.

modeling and robust, localized reliability estimation of the signal information content, due for example, to occlusion, illumination change, or pose, are also needed. Concerning the practical deployment of the AV biometric technology, robustness represents the grand challenge. There are few applications where the environment and the data acquisition mechanism can be carefully controlled. For the widespread use of the technology it will require us to be able to handle variability in the environment, data acquisition devices, and degradations due to data and channel encoding and spoofing. Overall, the technology is quite robust to spoofing (when audio and dynamic visual features are used); although, it would be possible in the future to synthesize a video of a person talking and replay it to the biometric system, for example on a laptop and even on the fly, in order to defeat a prompted phrase systems. The technology is readily available, however, for most day-to-day applications that have the following characteristics [137]: 1) low security and highly user friendly, e.g., access to desktop using biometric log-in; 2) high security but user can tolerate the inconvenience of being falsely rejected, e.g., access to military property; 3) low security and convenience is the prime factor (more than any other factors such as cost), e.g., time-stamping in a factory setting. The technology is not yet available for highly secured and highly user-friendly applications (such as banking). Further research and development is therefore required for AV biometric systems to become widespread in practice. h

[7] S. Ben-Yacoub, Y. Abdeljaoued, and E. Mayoraz, BFusion of face and speech data for person identity verification,[ IEEE Trans. Neural Networks, vol. 10, pp. 1065–1074, 1999. [8] C. Sanderson and K. K. Paliwal, BIdentity verification using speech and face information,[ Digital Signal Processing, vol. 14, no. 5, pp. 449–480, 2004. [9] T. J. Hazen, E. Weinstein, R. Kabir, A. Park, and B. Heisele, BMulti-modal face and speaker identification on a handheld device,[ in Proc. Works. Multimodal User Authentication, Santa Barbara, CA, 2003, pp. 113–120. [10] P. Jourlin, J. Luettin, D. Genoud, and H. Wassner, BIntegrating acoustic and labial information for speaker identification and verification,[ in Proc. 5th Eur. Conf. Speech Communication Technology, Rhodes, Greece, 1997, pp. 1603–1606. [11] T. Wark, S. Sridharan, and V. Chandran, BRobust speaker verification via fusion of speech and lip modalities,[ in Proc. Int. Conf. Acoustics, Speech Signal Processing, Phoenix, AZ, 1999, pp. 3061–3064. [12] VV, BRobust speaker verification via asynchronous fusion of speech and lip information,[ in Proc. 2th Int. Conf. Audio- and Video-Based Biometric Person

[13]

[14]

[15]

[16]

[17]

Authentication, Washington, DC, 1999, pp. 37–42. VV, BThe use of temporal speech and lip information for multi-modal speaker identification via multi-stream HMMs,[ in Proc. Int. Conf. Acoustics, Speech Signal Processing, Istanbul, Turkey, 2000, pp. 2389–2392. P. S. Aleksic and A. K. Katsaggelos, BAn audio-visual person identification and verification system using FAPs as visual features,[ in Proc. Works. Multimedia User Authentication, Santa Barbara, CA, 2003, pp. 80–84. U. V. Chaudhari, G. N. Ramaswamy, G. Potamianos, and C. Neti, BInformation fusion and decision cascading for audio-visual speaker recognition based on time-varying stream reliability prediction,[ in Proc. Int. Conf. Multimedia Expo, Baltimore, MD, Jul. 6–9, 2003, pp. 9–12. U. Dieckmann, P. Plankensteiner, and T. Wagner, BSESAM: A biometric person identification system using sensor fusion,[ Pattern Recogn. Lett., vol. 18, pp. 827–833, 1997. J. P. Campbell, BSpeaker recognition: A tutorial,[ Proc. IEEE, vol. 85, no. 9, pp. 1437–1462, Sep. 1997.


2041


[18] W.-Y. Zhao, R. Chellappa, P. J. J. Phillips, and A. Rosenfeld, BFace recognition: A literature survey,[ ACM Computing Survey, pp. 399–458, 2003, Dec. Issue. [19] M. Turk and A. Pentland, BEigenfaces for recognition,[ J. Cognitive Neuroscience, vol. 3, no. 1, pp. 586–591, Sep. 1991. [20] M. Kirby and L. Sirovich, BApplication of the Karhunen-Loeve procedure for the characterization of human faces,[ IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 1, pp. 103–108, Jan. 1990. [21] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, BEigenfaces versus fisherfaces: Recognition using class specific linear projection,[ IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, pp. 711–720, 1997. [22] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, BFace recognition: A literature survey,[ Proc. ACM Computing Surverys (CSUR), vol. 35, no. 4, pp. 399–458, 2003. [23] J. Luettin, BVisual speech and speaker recognition,[ Ph.D. dissertation, Dept. Computer Science, Univ. Sheffield, Sheffield, U.K., 1997. [24] C. C. Chibelushi, F. Deravi, and J. S. Mason, BAudio-visual person recognition: An evaluation of data fusion strategies,[ in Proc. Eur. Conf. Security Detection, London, U.K., 1997, pp. 26–30. [25] R. Brunelli, D. Falavigna, T. Poggio, and L. Stringa, BAutomatic person recognition using acoustic and geometric features,[ Machine Vision Appl., vol. 8, pp. 317–325, 1995. [26] C. Sanderson and K. K. Paliwal, BNoise compensation in a person verification system using face and multiple speech features,[ Pattern Recognition, vol. 36, no. 2, pp. 293–302, Feb. 2003. [27] P. Jourlin, J. Luettin, D. Genoud, and H. Wassner, BAcoustic-labial speaker verification,[ Pattern Recogn. Lett., vol. 18, pp. 853–858, 1997. [28] U. V. Chaudhari, G. N. Ramaswamy, G. Potamianos, and C. Neti, BAudio-visual speaker recognition using time-varying stream reliability prediction,[ in Proc. Int. Conf. Acoustics, Speech Signal Processing, Hong Kong, China, 2003, pp. V-712–V-715. [29] S. Bengio, BMultimodal authentication using asynchronous HMMs,[ in Proc. 4th Int. Conf. Audio- and Video-Based Biometric Person Authentication, Guildford, U.K., 2003, pp. 770–777. [30] A. V. Nefian, L. H. Liang, T. Fu, and X. X. Liu, BA Bayesian approach to audio-visual speaker identification,[ in Proc. 4th Int. Conf. Audio- and Video-Based Biometric Person Authentication, Guildford, U.K., 2003, pp. 761–769. [31] T. Fu, X. X. Liu, L. H. Liang, X. Pi, and A. V. Nefian, BAn audio-visual speaker identification using coupled hidden Markov models,[ in Proc. Int. Conf. Image Processing, Barcelona, Spain, 2003, pp. 29–32. [32] S. Bengio, BMultimodal authentication using asynchronous HMMs,[ in Proc. 4th Int. Conf. Audio- and Video-Based Biometric Person Authentication, Guildford, U.K., 2003, pp. 770–777. [33] VV, BMultimodal speech processing using asynchronous hidden Markov models,[ Information Fusion, vol. 5, pp. 81–89, 2004. [34] N. A. Fox, R. Gross, P. de Chazal, J. F. Cohn, and R. B. Reilly, BPerson identification using automatic integration of speech, lip,

2042

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

and face experts,[ in Proc. ACM SIGMM 2003 Multimedia Biometrics Methods and Applications Workshop (WBMA’03), Berkeley, CA, 2003, pp. 25–32. N. A. Fox and R. B. Reilly, BAudio-visual speaker identification based on the use of dynamic audio and visual features,[ in Proc. 4th Int. Conf. Audio- and Video-Based Biometric Person Authentication, Guildford, U.K., 2003, pp. 743–751. Y. Abdeljaoued, BFusion of person authentication probabilities by Bayesian statistics,[ in Proc. 2nd Int. Conf. Audio- and Video-Based Biometric Person Authentication, Washington, DC, 1999, pp. 172–175. Y. Yemez, A. Kanak, E. Erzin, and A. M. Tekalp, BMultimodal speaker identification with audio-video processing,[ in Proc. Int. Conf. Image Processing, Barcelona, Spain, 2003, pp. 5–8. A. Kanak, E. Erzin, Y. Yemez, and A. M. Tekalp, BJoint audio-video processing for biometric speaker identification,[ in Proc. Int. Conf. Acoustic, Speech Signal Processing, Hong Kong, China, 2003, pp. 561–564. E. Erzin, Y. Yemez, and A. M. Tekalp, BMultimodal speaker identification using an adaptive classifier cascade based on modality reliability,[ IEEE Trans. Multimedia, vol. 7, no. 5, pp. 840–852, Oct. 2005. M. E. Sargin, E. Erzin, Y. Yemez, and A. M. Tekalp, BMultimodal speaker identification using canonical correlation analysis,[ in IEEE Proc. Int. Conf. Acoustics, Speech Signal Processing, Toulouse, France, May 2006, pp. 613–616. J. Kittler, J. Matas, K. Johnsson, and M. U. Ramos-Sa´nchez, BCombining evidence in personal identity verification systems,[ Pattern Recogn. Lett., vol. 18, pp. 845–852, 1997. J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, BOn combining classifiers,[ IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 226–239, 1998. J. Kittler and K. Messer, BFusion of multiple experts in multimodal biometric personal identity verification systems,[ in Proc. 12th IEEE Workshop Neural Networks Sig. Processing, Switzerland, 2002, pp. 3–12. E. S. Bigun, J. Bigun, B. Duc, and S. Fisher, BExpert conciliation for multi modal person authentication systems by Bayesian statistics,[ in Proc. 1st Int. Conf. Audio- and Video-Based Biometric Person Authentication, Crans-Montana, Switzerland, Mar. 1997, pp. 291–300. R. W. Frischholz and U. Dieckmann, BBiolD: A multimodal biometric identification system,[ Computer, vol. 33, pp. 64–68, 2000. S. Basu, H. S. M. Beigi, S. H. Maes, M. Ghislain, E. Benoit, C. Neti, and A. W. Senior, BMethods and apparatus for audio-visual speaker recognition and utterance verification,[ U.S. Patent 6 219 640, 1999. L. Hong and A. Jain, BIntegrating faces and fingerprints for personal identification,[ IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 1295–1307, 1998. V. Radova and J. Psutka, BAn approach to speaker identification using multiple classifiers,[ in Proc. IEEE Conf. Acoustics, Speech Signal Processing, Munich, Germany, 1997, vol. 2, pp. 1135–1138.


[49] A. Ross and A. Jain, BInformation fusion in biometrics,[ Pattern Recogn. Lett., vol. 24, pp. 2115–2125, 2003. [50] V. Chatzis, A. G. Bors, and I. Pitas, BMultimodal decision-level fusion for person authentication,[ IEEE Trans. Systems, Man, Cybernetics, Part A: Syst. Humans, vol. 29, no. 5, pp. 674–680, Nov. 1999. [51] N. Fox, R. Gross, J. Cohn, and R. B. Reilly, BRobust automatic human identification using face, mouth, and acoustic information,[ in Proc. Int. Workshop Analysis Modeling of Faces and Gestures, Beijing, China, Oct. 2005, pp. 263–277. [52] F. J. Huang and T. Chen, BConsideration of Lombard effect for speechreading,[ in Proc. Works. Multimedia Signal Process, 2001, pp. 613–618. [53] J. D. Woodward, BBiometrics: Privacy’s foe or privacy’s friend?[ Proc. IEEE, vol. 85, pp. 1480–1492, 1997. [54] R. P. Lippmann, BSpeech recognition by machines and humans,[ Speech Commun., vol. 22, no. 1, pp. 1–15, 1997. [55] H. Yehia, P. Rubin, and E. Vatikiotis-Bateson, BQuantitative association of vocal-tract and facial behavior,[ Speech Commun., vol. 26, no. 1–2, pp. 23–43, 1998. [56] J. Jiang, A. Alwan, P. A. Keating, E. T. Auer, Jr., and L. E. Bernstein, BOn the relationship between face movements, tongue movements, and speech acoustics,[ EURASIP J. Appl. Signal Processing, vol. 2002, no. 11, pp. 1174–1188, Nov. 2002. [57] J. P. Barker and F. Berthommier, BEstimation of speech acoustics from visual speech features: A comparison of linear and non-linear models,[ in Proc. Int. Conf. Auditory Visual Speech Processing, Santa Cruz, CA, 1999, pp. 112–117. [58] H. C. Yehia, T. Kuratate, and E. Vatikiotis-Bateson, BUsing speech acoustics to drive facial motion,[ in Proc. 14th Int. Congr. Phonetic Sciences, San Francisco, CA, 1999, pp. 631–634. [59] A. V. Barbosa and H. C. Yehia, BMeasuring the relation between speech acoustics and 2-D facial motion,[ in Proc. Int. Conf. Acoustics, Speech Signal Processing, Salt Lake City, UT, 2001, vol. 1, pp. 181–184. [60] P. S. Aleksic and A. K. Katsaggelos, BSpeech-to-video synthesis using MPEG-4 compliant visual features,[ IEEE Trans. CSVT, Special Issue Audio Video Analysis for Multimedia Interactive Services, pp. 682–692, May 2004. [61] A. Q. Summerfield, BSome preliminaries to a comprehensive account of audio-visual speech perception,[ in Hearing by Eye: The Psychology of Lip-Reading, R. Campbell and B. Dodd, Eds. London, U.K.: Lawrence Erlbaum, 1987, pp. 3–51. [62] D. W. Massaro and D. G. Stork, BSpeech recognition and sensory integration,[ Amer. Scientist, vol. 86, no. 3, pp. 236–244, 1998. [63] J. J. Williams and A. K. Katsaggelos, BAn HMM-based speech-to-video synthesizer,[ IEEE Trans. Neural Networks, Special Issue Intelligent Multimedia, vol. 13, no. 4, pp. 900–915, Jul. 2002. [64] Q. Summerfield, BUse of visual information in phonetic perception,[ Phonetica, vol. 36, pp. 314–331, 1979. [65] VV, BLipreading and audio-visual speech perception,[ Phil. Trans. R. Soc. Lond. B., vol. 335, pp. 71–78, 1992.


[66] K. W. Grant and L. D. Braida, BEvaluating the articulation index for auditory-visual input,[ J. Acoustical Soc. Amer., vol. 89, pp. 2950–2960, Jun. 1991. [67] G. Fant, Acoustic Theory of Speech Production, S-Gravenhage. Amsterdam, The Netherlands: Mouton, 1960. [68] J. L. Flanagan, Speech Analysis Synthesis and Perception. Berlin, Germany: Springer-Verlag, 1965. [69] S. Narayanan and A. Alwan, BArticulatory-acoustic models for fricative consonants,[ IEEE Trans. Speech Audio Processing, vol. 8, no. 3, pp. 328–344, Jun. 2000. [70] J. Schroeter and M. Sondhi, BTechniques for estimating vocal-tract shapes from the speech signal,[ IEEE Trans. Speech Audio Processing, vol. 2, no. 1, pp. 133–150, Feb. 1994. [71] H. McGurk and J. MacDonald, BHearing lips and seeing voices,[ Nature, vol. 264, pp. 746–748, 1976. [72] T. Chen and R. R. Rao, BAudio-visual integration in multimodal communication,[ Proc. IEEE, vol. 86, no. 5, pp. 837–852, May 1998. [73] S. Oviatt, P. Cohen, L. Wu, J. Vergo, L. Duncan, B. Suhm, J. Bers, T. Holzman, T. Winograd, J. Landay, J. Larson, and D. Ferro, BDesigning the user interface for multimodal speech and pen-based gesture applications: State-of-the-art systems and research directions,[ Human-Computer Interaction, vol. 15, no. 4, pp. 263–322, Aug. 2000. [74] J. Schroeter, J. Ostermann, H. P. Graf, M. Beutnagel, E. Cosatto, A. Syrdal, A. Conkie, and Y. Stylianou, BMultimodal speech synthesis,[ in Proc. Int. Conf. Multimedia Expo, New York, 2000, pp. 571–574. [75] C. C. Chibelushi, F. Deravi, and J. S. D. Mason, BA review of speech-based bimodal recognition,[ IEEE Trans. Multimedia, vol. 4, no. 1, pp. 23–37, Mar. 2002. [76] D. G. Stork and M. E. Hennecke, Eds., Speechreading by Humans and Machines. Berlin, Germany: Springer, 1996. [77] P. S. Aleksic, G. Potamianos, and A. K. Katsaggelos, BExploiting visual information in automatic speech processing,[ in Handbook of Image and Video Processing, A. Bovik, Ed. New York: Academic, Jun. 2005, pp. 1263–1289. [78] E. Petajan, BAutomatic lipreading to enhance speech recognition,[ Ph.D. dissertation, Univ. Illinois at Urbana-Champaign, Urbana, IL, 1984. [79] S. Dupont and J. Luettin, BAudio-visual speech modeling for continuous speech recognition,[ IEEE Trans. Multimedia, vol. 2, no. 3, pp. 141–151, Sep. 2000. [80] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, BRecent advances in the automatic recognition of audiovisual speech,[ Proc. IEEE, vol. 91, no. 9, pp. 1306–1326, Sep. 2003. [81] G. Potamianos, C. Neti, J. Luettin, and I. Matthews, BAudio-visual automatic speech recognition: An overview,[ in Issues in Visual and Audio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson, and P. Perrier, Eds. Cambridge, MA: MIT Press, 2004. [82] P. S. Aleksic, J. J. Williams, Z. Wu, and A. K. Katsaggelos, BAudio-visual speech recognition using MPEG-4 compliant visual features,[ EURASIP J. Appl. Signal

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

[91]

[92]

[93]

[94]

[95]

[96]

[97]

[98]

[99]

Processing, vol. 2002, no. 11, pp. 1213–1227, Nov. 2002. T. Chen, BAudiovisual speech processing. Lip reading and lip synchronization,[ IEEE Signal Processing Mag., vol. 18, no. 1, pp. 9–21, Jan. 2001. J. R. Deller, Jr., J. G. Proakis, and J. H. L. Hansen, Discrete-Time Processing of Speech Signals. Englewood Cliffs, NJ: Macmillan, 1993. R. Campbell, B. Dodd, and D. Burnham, Eds., Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory Visual Speech. Hove, U.K.: Psychology Press, 1998. S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book. London, U.K.: Entropic, 2005. A. J. Goldschen, O. N. Garcia, and E. D. Petajan, BRationale for phoneme-viseme mapping and feature selection in visual speech recognition,[ in Speechreading by Humans and Machines, D. G. Stork and M. E. Hennecke, Eds. Berlin, Germany: Springer, 1996, pp. 505–515. L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice Hall, 1993. D.-S. Kim, S.-Y. Lee, and R. M. Kil, BAuditory processing of speech signals for robust speech recognition in real-world noisy environments,[ IEEE Trans. Speech Audio Processing, vol. 7, no. 1, pp. 55–69, Jan. 1999. K. K. Paliwal, BSpectral subband centroids features for speech recognition,[ in Proc. Int. Conf. Acoustics, Speech and Signal Processing, Seattle, WA, 1998, vol. 2, pp. 617–620. M. Akbacak and J. H. L. Hansen, BEnvironmental sniffing: Noise knowledge estimation for robust speech systems,[ in Proc. Int. Conf. Acoustics, Speech and Signal Processing, Hong Kong, China, 2003, vol. 2, pp. 113–116. H. A. Rowley, S. Baluja, and T. Kanade, BNeutral networks-based face detection,[ IEEE Trans. Pattern Anal. Machine Intell., vol. 20, no. 1, pp. 23–38, Jan. 1998. A. W. Senior, BFace and feature finding for a face recognition system,[ in Proc. Int. Conf. Audio Video-based Biometric Person Authentication, Washington, DC, 1999, pp. 154–159. K. Sung and T. Poggio, BExample-based learning for view-based human face detection,[ IEEE Trans. Pattern Anal. Machine Intell., vol. 20, no. 1, pp. 39–51, 1998. E. Hjelmas and B. K. Low, BFace detection: A survey,[ Computer Vision and Image Understanding, vol. 83, no. 3, pp. 236–274, Sep. 2001. M.-H. Yang, D. Kriegman, and N. Ahuja, BDetecting faces in images: A survey,[ IEEE Trans. Pattern Anal. Machine Intell., vol. 24, no. 1, pp. 34–58, Jan. 2002. V. Blanz, P. Grother, P. J. Phillips, and T. Vetter, BFace recognition based on frontal views generated from non-frontal images,[ in Proc. Computer Vision Pattern Recognition, 2005, pp. 454–461. C. Sanderson, S. Bengio, and Y. Gao, BOn transforming statistical models for non-frontal face verification,[ Pattern Recognition, vol. 39, no. 2, pp. 288–302, 2006. K. W. Bowyer, K. Chang, and P. Flynn, BA survey of approaches and challenges in

[100]

[101]

[102]

[103]

[104]

[105]

[106]

[107]

[108]

[109]

[110]

[111]

[112]

[113]

[114]

[115]

3-D and multi-modal 3-D face recognition,[ Computer Vision Image Understanding, vol. 101, no. 1, pp. 1–15, 2006. S. G. Kong, J. Heo, B. R. Abidi, J. Paik, and M. A. Abidi, BRecent advances in visual and infrared face recognitionVA review,[ Computer Vision Image Understanding, vol. 97, no. 1, pp. 103–135, 2005. M. E. Hennecke, D. G. Stork, and K. V. Prasad, BVisionary speech: Looking ahead to practical speechreading systems,[ in Speechreading by Humans and Machines, D. G. Stork and M. E. Hennecke, Eds. Berlin, Germany: Springer, 1996, pp. 331–349. P. S. Aleksic and A. K. Katsaggelos, BComparison of low- and high-level visual features for audio-visual continuous automatic speech recognition,[ in Proc. Int. Conf. Acoustics, Speech Signal Processing, Montreal, Canada, 2004, pp. 917–920. I. Matthews, G. Potamianos, C. Neti, and J. Luettin, BA comparison of model and transform-based visual features for audio-visual LVCSR,[ in Proc. Int. Conf. Multimedia Expo, 2001, pp. 22–25. H. A. Rowley, S. Baluja, and T. Kanade, BNeural network based face detection,[ IEEE Trans. Pattern Anal. Machine Intell., vol. 20, no. 1, pp. 23–38, Jan. 1998. P. Viola and M. Jones, BRapid object detection using a boosted cascade of simple features,[ in Proc. Conf. Computer Vision Pattern Recognition, Kauai, HI, Dec. 11–13, 2001, pp. 511–518. H. P. Graf, E. Cosatto, and G. Potamianos, BRobust recognition of faces and facial features with a multi-modal system,[ in Proc. Int. Conf. Systems, Man, Cybernetics, Orlando, FL, 1997, pp. 2034–2039. M. T. Chan, Y. Zhang, and T. S. Huang, BReal-time lip tracking and bimodal continuous speech recognition,[ in Proc. Workshop Multimedia Signal Processing, Redondo Beach, CA, 1998, pp. 65–70. G. Chetty and M. Wagner, BFLiveness_ verification in audio-video authentication,[ in Proc. Int. Conf. Spoken Language Processing, Jeju Island, Korea, 2004, pp. 2509–2512. M. Kass, A. Witkin, and D. Terzopoulos, BSnakes: Active contour models,[ Int. J. Computer Vision, vol. 4, no. 4, pp. 321–331, 1988. A. L. Yuille, P. W. Hallinan, and D. S. Cohen, BFeature extraction from faces using deformable templates,[ Int. J. Computer Vision, vol. 8, no. 2, pp. 99–111, 1992. T. F. Cootes, G. J. Edwards, and C. J. Taylor, BActive appearance models,[ in Proc. Eur. Conf. Computer Vision, Freiburg, Germany, 1998, pp. 484–498. R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. Hoboken, NJ: Wiley, 2001. B. Maison, C. Neti, and A. Senior, BAudio-visual speaker recognition for broadcast news: Some fusion techniques,[ in Proc. Works. Multimedia Signal Processing, Copenhagen, Denmark, 1999, pp. 161–167. P. Duchnowski, U. Meier, and A. Waibel, BSee me, hear me: Integrating automatic speech recognition and lip-reading,[ in Proc. Int. Conf. Spoken Lang. Processing, Yokohama, Japan, Sep. 18–22, 1994, pp. 547–550. G. Potamianos, H. P. Graf, and E. Cosatto, BAn image transform approach for HMM based automatic lipreading,[ in Proc. Int.


2043


[116]

[117]

[118]

[119]

[120]

[121]

[122]

Conf. Image Processing, Chicago, IL, Oct. 4–7, 1998, vol. 1, pp. 173–177. P. S. Aleksic and A. K. Katsaggelos, BComparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance,[ in Proc. Int. Conf. Image Processing, Italy, Sep. 2005, vol. 5, pp. 501–504. X. Zhang, C. C. Broun, R. M. Mersereau, and M. Clements, BAutomatic speechreading with applications to human-computer interfaces,[ EURASIP J. Appl. Signal Processing, vol. 2002, no. 11, pp. 1228–1247, 2002. M. Gordan, C. Kotropoulos, and I. Pitas, BA support vector machine-based dynamic network for visual speech recognition applications,[ EURASIP J. Appl. Signal Processing, vol. 2002, no. 11, pp. 1248–1259, 2002. F. Cardinaux, C. Sanderson, and S. Bengio, BUser authentication via adapted statistical models of face images,[ IEEE Trans. Signal Processing, vol. 54, no. 1, pp. 361–373, Jan. 2006. G. R. Doddington, M. A. Przybycki, A. F. Martin, and D. A. Reynolds, BThe NIST speaker recognition evaluationVOverview, methodology, systems, results, perspective,[ Speech Commun., vol. 31, no. 2–3, pp. 225–254, 2000. S. Bengio, J. Mariethoz, and M. Keller, BThe expected performance curve,[ in Int. Conf. Machine Learning, Workshop ROC Analysis Machine Learning, Bonn, Germany, 2005. S. Bengio and J. Mariethoz, BThe expected performance curve: A new assessment measure for person authentication,[ in Proc. Speaker Language Recognition Works. (Odyssey), Toledo, OH, 2004, pp. 279–284.

[123] D. L. Hall and J. Llinas, BMultisensor data fusion,[ in Handbook of Multisensor Data Fusion, D. L. Hall and J. Llinas, Eds. Boca Raton, FL: CRC, 2001, pp. 1–10. [124] T. K. Ho, J. J. Hull, and S. N. Srihari, BDecision combination in multiple classifier systems,[ IEEE Trans. Pattern Anal. Machine Intell., vol. 16, pp. 66–75, 1994. [125] R. C. Luo and M. G. Kay, BIntroduction,[ in Multisensor Integration and Fusion for Intelligent Machines and Systems, R. C. Luo and M. G. Kay, Eds. Norwood, NJ: Ablex, 1995, pp. 1–26. [126] S. Pigeon and L. Vandendorpe, BThe M2VTS multimodal face database (release 1.00),[ in Proc. 1st Int. Conf. Audio- and Video-Based Biometric Person Authentication, Crans-Montana, Switzerland, 1997, pp. 403–409. [127] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, BXM2VTSDB: Te extended M2VTS database,[ in Proc. 2nd Int. Conf. Audio- and Video-Based Biometric Person Authentication, Washington, DC, 1999, pp. 72–77. [128] E. Bailly-Bailliere, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariethoz, J. Matas, K. Messer, V. Popovici, F. Poree, B. Ruiz, and J.-P. Thiran, BThe BANCA database and evaluation protocol,[ in Proc. Audio- and Video-Based Biometric Person Authentication, Guilford, 2003, pp. 625–638. [129] C. C. Chibelushi, F. Deravi, and J. S. Mason, BT DAVID DatabaseVInternal Rep., Speech and Image Processing Research Group, Dept. of Electrical and Electronic Engineering, Univ, les Swansea, 1996. [130] N. Fox, B. O’Mullane, and R. B. Reilly, BThe realistic multi-modal VALID database and visual speaker identification comparison experiments,[ in Lecture Notes in Computer

[131]

[132]

[133]

[134]

[135]

[136]

[137]

Science, T. Kanade, A. K. Jain, and N. K. Ratha, Eds. New York: Springer-Verlag, 2005, vol. 3546, p. 777. B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu, and T. Huang, BAVICAR: Audio-visual speech corpus in a car environment,[ in Proc. Conf. Spoken Language, Jeju, Korea, 2004. E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, BCUAVE: A new audio-visual database for multimodal human-computer interface research,[ in Proc. Int. Conf. Acoustics, Speech and Signal Processing, Orlando, FL, 2002. T. Chen, BAudiovisual speech processing,[ IEEE Signal Processing Mag., vol. 18, pp. 9–21, Jan. 2001. J. R. Movellan, BVisual speech recognition with stochastic networks,[ in Advances in Neural Information Processing Systems, G. Tesauro, D. Toruetzky, and T. Leen, Eds. Cambridge, MA: MIT Press, 1995, vol. 7. J. Luettin, N. Thacker, and S. Beet, BSpeaker identification by lipreading,[ in Proc Int. Conf. Speech and Language Processing, Philadelphia, PA, 1996, pp. 62–64. S. Bengio and J. Mariethoz, BA statistical significance test for person authentication,[ in Proc. Speaker and Language Recognition Workshop (Odyssey), Toledo, 2004, pp. 237–244. N. Poh and J. Korczak, BBiometric authentication in the e-World,[ in Automated Authentication Using Hybrid Biometric System, D. Zhang, Ed. Boston, MA: Kluwer, 2003, ch. 16.

ABOUT THE AUTHORS Petar S. Aleksic received the B.S. degree in electrical engineering from the University of Belgrade, Serbia, in 1999, and the M.S. and Ph.D. degrees in electrical engineering from Northwestern University, Evanston, IL, in 2001 and 2004, respectively. He has been a member of the Image and Video Processing Lab at Northwestern University, since 1999, where he is currently a Postdoctoral Fellow. He has published more than ten articles in the area of audio-visual signal processing, pattern recognition, and computer vision. His primary research interests include visual feature extraction and analysis, audio-visual speech recognition, audio-visual biometrics, multimedia communications, computer vision, pattern recognition, and multimedia data mining.

2044

Aggelos K. Katsaggelos received the Diploma degree in electrical and mechanical engineering from the Aristotelian University, Thessaloniki, Greece, in 1979, and the M.S. and Ph.D. degrees from the Georgia Institute of Technology, Atlanta, in 1981 and 1985, respectively, both in electrical engineering. He is currently a Professor of electrical engineering and computer science at Northwestern University, Evanston, IL, and also the Director of the Motorola Center for Seamless Communications and a member of the academic affiliate staff, Department of Medicine, Evanston Hospital. He is the Editor of Digital Image Restoration (New York, Springer, 1991), Coauthor of Rate-Distortion Based Video Compression (Kluwer, Norwell, 1997), and Coeditor of Recovery Techniques for Image and Video Compression and Transmission (Kluwer, Norwell, 1998). Also, he is Coinventor of ten international patents. Dr. Katsaggelos is a member of the Publication Board of the IEEE PROCEEDINGS and has served as the Editor-in-Chief of the IEEE Signal Processing Magazine (1997–2002). He has been a recipient of the IEEE Third Millennium Medal (2000), the IEEE Signal Processing Society Meritorious Service Award (2001), and an IEEE Signal Processing Society Best Paper Award (2001).