Multi-Modal Dialog Scene Detection Using Hidden ... - CiteSeerX

10 downloads 17298 Views 242KB Size Report
ingful to represent the (dialog) scenes by a number of key-frames or short trailers. Moreover, ... On the other hand, for example, a common location might not be necessary for a phone ..... Systems Corp. during the summers of 1989 and 1996, and 1992, respectively. ... AT&T Bell Laboratories, Murray Hill, New Jersey.
Multimedia Tools and Applications, 14, 137–151, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. °

Multi-Modal Dialog Scene Detection Using Hidden Markov Models for Content-Based Multimedia Indexing A. AYDIN ALATAN [email protected] Electrical-Electronics Engineering Department, Middle East Technical University, Balgat, Ankara 06531 Turkey ALI N. AKANSU New Jersey Center for Multimedia Research, New Jersey, Institute of Technology, University Heights, Newark, NJ 07102, USA WAYNE WOLF Department of Electrical Engineering, Princeton University, Princeton, NJ 08544-5263, USA

Abstract. A class of audio-visual data (fiction entertainment: movies, TV series) is segmented into scenes, which contain dialogs, using a novel hidden Markov model-based (HMM) method. Each shot is classified using both audio track (via classification of speech, silence and music) and visual content (face and location information). The result of this shot-based classification is an audio-visual token to be used by the HMM state diagram to achieve scene analysis. After simulations with circular and left-to-right HMM topologies, it is observed that both are performing very good with multi-modal inputs. Moreover, for circular topology, the comparisons between different training and observation sets show that audio and face information together gives the most consistent results among different observation sets. Keywords: content-based indexing, multi-modal analysis, hidden Markov models, dialog scene analysis

1.

Introduction

Rapid advances in digital imaging and storage products made digital multimedia libraries feasible. Management of this huge content with limited computational power and networking capacity, requires devising clever methods. An efficient retrieval of multimedia data is usually realized by indexing the raw content, so that users navigate, search, browse and view, easily. Multimedia content, more specifically video (including audio information), can be categorized into four major classes [1]. These classes, with some examples, are as follows: – – – –

information: news, education programs communication: video-conferencing, presentation data analysis: surveillance, scientific video recordings entertainment:

138

ALATAN, AKANSU, AND WOLF

• interactive video: games, home-shopping • non-fiction: sports, documentaries, talk-shows • fiction: motion pictures, TV series This paper presents an efficient method which segments dialog scenes within video content of fiction entertainment class. Although, all the listed classes above might contain human communications (i.e., dialogs) in their contents, fitting a model is more meaningful for motion pictures or TV series, which are basically post-edited for improving their presentation of the events (dialogs, actions) to the viewers. Such post-editing process eases modeling and indexing of the scenes, considerably. Dialog scene analysis is especially useful for abstraction. It is semantically more meaningful to represent the (dialog) scenes by a number of key-frames or short trailers. Moreover, the quantitative comparisons between the durations of dialog and non-dialog (e.g., action) scenes give a rough idea about the genre of the movie (e.g., romantic, action), as well [15]. Hence, dialog scene analysis is a key concept in managing some classes of video content. In the next section, we briefly explain the scene analysis concept with emphasis on definitions, elements, analysis hierarchy of dialog scenes and as well as multi-modal approaches. Section 3 is devoted to the proposed multi-modal HMM-based dialog scene modeling concept. In Section 4, a practical system is proposed which takes into account the model in Section 3. This system is simulated and the results are stated in Section 4. The simulation results are then followed by our conclusions. 2.

Scene analysis

While a shot is usually defined as a single continuous recording of a camera, a scene consists of concatenation of shots (combined during editing stage), which are connected in time or space, and presenting a meaningful part of the whole story. Finding semantically meaningful parts of a story (i.e., scene analysis) is a necessary (also a challenging) process in multimedia indexing, since such an analysis is a crucial step in reaching from low-level features, like color, motion, voice pitch, to the high-level semantic descriptions of the content. This challenging goal can be simplified by defining a context in which the connections between low-level features and high-level descriptions are well-established in comparison to general cases. Dialog scenes are such contexts that has clear interactions between low and high level features. 2.1.

Dialog scene analysis

A dialog scene can be defined as a set of consecutive shots which contain conversations of people. However, there is a possibility of having shots in a dialog scene that do not contain a conversation or even a person. However, they should still be included in the scene due to their semantic coherence. For example, while two people are talking to each other, the shots usually contain the faces and speech content of these people. In order to emphasize another subject, the movie director may insert shot(s) of another object or person between

MULTI-MODAL DIALOG SCENE DETECTION

139

these shots. Since the conversation continues after this short interrupt, although visually irrelevant, but semantically relevant shot(s) should also be included into the dialog scene, as well. Inserting close-up views of silent faces between conversations as a reaction shot from someone listening to the conversation or insertion of shots of what they are talking about is other usual methods employed by the movie directors. However, such random events make the modeling more difficult due to their indeterministic nature. If the scene is understood semantically, these insertions make sense. However, they are hard to figure out from low-level information, which is why statistical models enter into the picture. There are two main challenges in dialogue scene analysis : determining a dialog as a whole (neither totally missing it nor erroneously dividing it into multiple sub-dialogs) and finding the boundaries of a dialog (which can be quite subjective in some cases) with a good accuracy. For different applications, the importance of these performance measures might differ. The analysis of a dialog scene is usually achieved by using either deterministic or probabilistic approaches. While the deterministic methods usually cluster consecutive shots by utilizing appropriate measures [1, 7, 13], the probabilistic methods use hidden Markov models (HMM) to represent their states (i.e., scenes) in the content [4, 16]. Although, both approaches are still quite immature and there is no fair comparison yet, HMM-based formulation looks more promising due to two factors. The first reason lies within the random behavior of any natural language which does not permit to put a number of deterministic rules to model this behavior. As the second reason, following the analogy between spoken and visual languages (cinema), statistical models are expected to parse the shots of a movie (like the phonemes of a word) better compared to non-statistical approaches. Moreover, clustering is not flexible enough to parse video programs into useful syntactic structures [16]. 2.2.

The elements of a dialog scene

A dialog scene requires three elements at the same time; some people, their conversation and a location. The importance of these three components might differ. Intuitively, the (audio) conversation between characters should be the most important among three elements, since a human can detect a dialog in a radio broadcast even without understanding its content. On the other hand, for example, a common location might not be necessary for a phone conversation scene in a movie, hence it should be the least priority. In order to perform an analysis, the elements of a dialogue must be extracted from the multimedia content. Although, there are many approaches, the simplest way to detect speech (conversation) is to analyze the audio track of the content. Since it is common to show the face of speakers during a dialog, face analysis is a good choice to determine the presence of people within shots. Finally, the location information usually requires some simple visual clues. 2.3.

Multi-modal scene analysis

Multi-modal analysis combines different components belonging to the same information content. In other words, for a movie, video, audio, closed-caption or textual overlay

140

ALATAN, AKANSU, AND WOLF

information is integrated together for different applications. Although, it has obvious advantages over uni-modal analysis, the fusion of data with different nature is still a difficult problem to solve [11]. Multi-modal approaches have been investigated for different areas in content-based indexing, such as shot-boundary detection [5], video abstraction [6, 7], speaker-dependent temporal indexing [14], story unit identification [13] and violent scene detection [9]. The simulation results of all these techniques point out an improvement over uni-modal counterparts, as expected. Multi-modal fusion can be performed at three levels [11]: signal, feature and decision levels. While signal level fusion is only used to reduce measurement errors by considering multiple sensors, decision level fusion do not take into account some dependencies among features from different modalities. Feature level fusion is also computationally costly due to the size of typical feature spaces. Considering these three levels with their disadvantages, one practical approach is to choose a transitional level between decision and feature levels by reducing size of feature space after making some basic decisions. For example, before detecting a dialog scene from raw image intensity and audio data, some preliminary decisions can be made by using the features of the data. In this way, one can obtain a smaller feature space (e.g., music, face, silence, no-face) from each modality that can be combined more easily.

2.4.

Dialog scene analysis hierarchy

The elements of a dialog have already been stated as conversation, people and location. Since the most practical way to perform multi-modal fusion is decreasing the size of the feature space after some preliminary decisions, the best way to extract the dialog scene elements from multimedia data is to obtain the elements from uni-modal data before the final decision. This disjoint extraction processes are audio classification, face and location analysis. Multi-modal dialog scene analysis can be categorized into a three-level hierarchy according to the (computational) complexity of the extraction systems. This classification is shown in Table 1. The applications, which require considerably lower complexity, may simply use methods which have silence/no-silence classification of the audio track data and a set of visual Table 1. Classification of multi-modal scene analysis according to computational complexity levels of the extraction algorithms. Level

Audio analysis

Visual analysis

Resulting system

Low

silence/no-silence

shot info

shot detection

Mid

silence/music/speech

shot info + face/noface + old/new location

dialog scene detection

High

silence/music/ speaker1...N

shot info + face1...M + location1...K

dialog scene description

MULTI-MODAL DIALOG SCENE DETECTION

141

shot boundaries. An improvement has already been shown over shot-boundary detection compared to using only visual data [5]. A method from high-level analysis class possibly requires robust unsupervised speaker, face and location classification algorithms, in order to segment speech signals into N speakers in a soundtrack, to classify faces of M different people in a movie, and to detect K locations in which the events take place, respectively. Considering the state-of-the-art audio-visual analysis methods, all these three goals are computationally quite complex. Moreover, they might not be robust against generic scenes and soundtracks. However, such an analysis may give the complete description of the dialog. In our research, we focus on medium-complexity techniques. These methods require only a classification between silence, speech and music data for their audio analysis part. This problem is relatively easier to solve [13]. For the visual part, we only need to find whether a face exists at each shot or not. Face detection is also a mature topic with robust solutions [14, 16]. Finally, location analysis can be simplified into detection of significant changes between shots using simple histogram-based methods [2, 5, 9]. 3.

HMM-based multi-modal dialogue scene analysis

Considering the promising performances of multi-modal video indexing [5–7, 9, 13, 14] and HMM-based scene modeling [4, 16], we propose a novel method which detects the dialogue scenes in a movie using a multi-modal HMM-based approach. 3.1.

Hidden Markov modeling of dialog scenes

Hidden Markov Models (HMM) are powerful statistical tools which have been successfully utilized in speech recognition and speaker identification fields [12]. Recently, they have also found applications in content-based video indexing area with attempts to solve video scene segmentation [16], shot-boundary detection [2], face [10] and gesture [3] recognition problems. HMM-based analysis usually requires answers to the questions, below. While the first three questions depend on a particular application, the last two problems have general solutions in the literature [12]. • • • • •

What are the hidden states? How are these states connected (i.e., topology)? How do we observe the hidden states? How do we obtain the statistical parameters of HMM? How do we find a state sequence best fitting to an observation data set?

Assigning semantically meaningful scenes of the content to the states of the HMM is the most straightforward approach. Dialog, action, establishing, progression and transition are the mostly employed scenes by film directors. Another approach is to use typical shots, such as master or close-up, as the states of the model. Although either of these approaches can be utilized, scene-based hidden state modeling is more attractive since it eventually

142

Figure 1.

ALATAN, AKANSU, AND WOLF

(a) Left-to-right and (b) circular HMM state diagram for modeling dialog scenes in movies.

determines the main goal; the detection of dialog scenes. For shot-based modeling, we need one more analysis step to reach the scene information. The HMM states can be related with each other using different HMM topologies. One possibility is a left-to-right HMM state diagram (figure 1.(a)), while another is a circular HMM (figure 1(b)). Obviously, the states (scene) may slightly differ for these two topologies. In our application, left-to-right state diagram simply tells that movies start with a state, called establishing scene, which is followed by a dialog scene. Once a dialog scene ends, we enter into a transitional (or establishing) scene in which a conversation does not exist and this order repeats itself. As an obvious drawback, left-to-right model strictly requires knowing the number of scenes (states) in the content, beforehand. On the other hand, without such a constraint, circular topology can be constructed from only two states, representing dialog and non-dialog scenes. Other HMM topologies are also possible [4]. At each (hidden) state, there are some observable output symbols with some associated probabilities. The observable symbols are important factors which determine the performance of the model and analysis. For HMM-based dialog scene analysis, the output symbols are combinations of audio conversation, face and location descriptions. Once the model topology and observation symbol set are determined, the next step is of training the model by some data and determining the initial, state-transition and observation probabilities. Baum-Welch algorithm [12] is the well-known solution for training HMMs. After the model probabilities are obtained by using the training data, the observations of a test data are fed into a dynamic programming routine in order to obtain the best statesequence, i.e., the sequence of scenes in the video clip. In some applications, rather than using a training sequence, the model parameters can be obtained from the data itself to represent the description of the model. Within the emerging multimedia description standard MPEG-7, it is expected to have a scheme that will make it possible to store probabilistic model parameters (such as HMM probabilities) into a MPEG-7 bit-stream.

MULTI-MODAL DIALOG SCENE DETECTION

Figure 2.

3.2.

143

The block diagram of the proposed system.

Proposed system

The block-diagram of the proposed system is shown in figure 2. As a first step, it is necessary to decode and demultiplex the compressed input bit-stream (consisting of audio and video data) into its audio and video components. Video signals are then parsed into shots using any shot-boundary detection algorithm. After segmenting video data into shots, all the remaining analysis continue based on this shot information. In other words, shots become the temporal units of the system. Shot information is used in audio analysis to determine whether the soundtrack of that shot contains speech, silence or music content. This audio classification can be obtained using energy thresholding and frequency analysis [13] and its output decision will simply be either speech (T) or silence (S) or music (M) for that shot. Shot segmented video should be further examined to detect faces within frames of each shot. Either a simple skin color detection or a more sophisticated method can be utilize to obtain face (F) or no-face (N) information for that shot. For location analysis, the histograms of consecutive shots can be compared to check whether there is a similarity between the current and the last (also one previous shot before the last) shots using a threshold (probably higher than the threshold used during shot boundary detection). The result of this operation gives whether the observed scene (location) is significantly changed (C) or remained unchanged (U) between consecutive shots. As the fusion step, the Token Generator (figure 2) simply combines the output of three analysis kernels to obtain one multi-modal token for each shot. For example, a shot of a movie without a human face and some music in the background in a new location will be

144

ALATAN, AKANSU, AND WOLF

represented with the token MNC (i.e., Changed location, with No-face and Music at the background). Hence, the token generator will represent each shot with a token consisting of three letters, which are obtained from all possible combinations of sets (S, T, M), (F, N) and (C, U). Finally, the model parameters (state change and state output probabilities) are used to obtain the best state sequence corresponding to the input token sequence by using dynamic programming (i.e., Viterbi algorithm). Note that the proposed method can be divided into two stages (as separated by the vertical dashed-line in figure 2). The first part (left side of dashed-line) extracts and feeds the descriptors of audio-visual data into the next decision part, which is a novel description scheme. The first part can be easily replaced by a bit-stream (e.g., MPEG-7 bit-stream) that already contains these descriptions. According to the application, the HMM parameters can also be inserted into the bit-stream as a description. 4.

Simulation results

The system shown in figure 2 is tested. Simulations are conducted to compare different topologies, different observation symbol sets of the states and the effects of different training data on HMM-based dialog scene analysis. Since the extraction process is not a factor for these comparisons, it is assumed that all low level descriptors are available. In other words, audio and video features are obtained manually from training and test data sets. During our simulations, audio-visual content from MPEG-7 Test Data Set (CD-20 [Spanish TV Movie], 21 [Spanish TV Sitcom], 22 [Portuguese TV Sitcom]) is used for training and test purposes. The ground-truth data is obtained by assigning every shot to a “true” state/scene, subjectively. The performance of different approaches are compared based on this ground truth. 4.1.

Comparisons between different HMM topologies

The left-to-right and circular topologies are compared according to a criterion which measures the difference between ground truth and computed state sequence output. Left-toright topology has only three states which are establishing, dialog and transitional scenes, whereas circular topology has two states as dialog and non-dialog. Shot accuracy measure, R1 , simply gives the ratio of correct shot assignments to the total number of shots. The results are tabulated in Table 2. The results show how these two topologies perform for different input test set (CD-20, CD-21, CD-22) after they are Table 2. Shot accuracy, R1 , for different test data sets. R1

CD-20

CD-21

CD-22

Left-to-right

0.92

0.98

0.99

Circular

0.71

0.82

0.94

MULTI-MODAL DIALOG SCENE DETECTION

145

Figure 3. Dialog scene analysis for MPEG-7 test data set (CD-21) using a left-to-right HMM (Ground truth is shown as a pulse diagram while dialogs are solid lines).

146

ALATAN, AKANSU, AND WOLF

trained by using all the available data, including themselves. It should be noted that for left-to-right topology, the input data (tokens) is pre-segmented such that they contain one establishing scene, one dialog scene and one transitional scene (i.e., 3 states). Otherwise, it is not possible to use left-to-right topology to detect scenes. The observation symbols are selected as combinations of audio, face and locations information, as it is explained in the previous section. A typical result for left-to-right HMM is presented at figure 3 with some randomly selected key-frames for each scene. While both topologies are quite successful for assigning shots to correct scenes, the simulation results reveal that left-to-right topology performs better compared to its circular counterpart. However, left-to-right topology requires a-priori knowledge about every scene, hence usually it is difficult to use in practice. The next stage of simulations are conducted on circular topology to test its performance on different observation symbol and training data sets.

Figure 4.

R1 (a) and R2 (b) for different observation sets and training data.

MULTI-MODAL DIALOG SCENE DETECTION

147

Figure 5. Dialog scene analysis for MPEG-7 Test Data Set (CD-22) using a circular HMM for different observation sets and training data (Ground truth is shown as a pulse diagram while dialogues are high pulses. s-train: self-training, a-train: training by all data, o-train: training by all that other than itself).

148 4.2.

ALATAN, AKANSU, AND WOLF

Comparisons between observation symbols of HMM and training sets

The next steps of the simulation are only conducted for circular topology, since left-toright one requires a priori information. At this stage, the performance between the ground truth and computed values is compared using not only R1 , but also scene accuracy, R2 parameter, which is defined as the ratio of correct scene assignments to the total number of scenes, either dialog or non-dialog. For this parameter, the transition point between scenes does not have to be exact but a disjoint scene is strictly necessary for considering a correct scene detection. In other words, if a dialog scene is totally missed or divided into two dialog scenes, it is accepted as a wrong classification for that particular dialog scene. Three different set of observation symbols are tested, as audio-only, audio + face and audio + face + location. In other words, we are trying to understand the performance of a HMM-based dialog scene detection system utilizing only audio information, or audio and face information, or audio, face and location informations together. These different sets of data are also tested for different training data, as self-training by the test data, training by all data including and excluding test data, respectively. In this way, it will possible to realize the dependencies of particular method with the training data. The results of this experiments are shown in figure 4, after summing the individual results for the available three test data sets, CD-20, 21 and 22. As a typical result, figure 5 gives the performance in detail for circular HMM, corresponding to the test data set CD-22. Similar to figure 4, the results are presented for different observation sets and training data, but as a pulse diagram. Along the shot axis, while a high pulse represents a dialog, low pulse is a non-dialog scene. The numbering of the dialogs (i.e., D1, D2, etc.) is given automatically as an ordering value. Hence, for example, D2 of the ground truth may successfully correspond to D3 of one of the methods. According to the simulation results, the performance improvement due to location information is not significant. In fact, this result is expected since location information intuitively adds a small extra information for defining a dialog. In some cases, it may have some negative effect on the results, as well. More interestingly, audio-only observation set provides good results with some dependency on the training data. If the test set is removed from the training set, the performance of audio-only approach decreases, considerably. However, as it is noted previously, in some applications, test data can be freely included in the training data. According to the presented results, audio and face together gives the most consistent performance for different observation sets and training data.

5.

Conclusions

A novel dialog scene detection scheme is proposed with multi-modal inputs and probabilistic modeling. For both circular and left-to-right HMM topologies, the scene transitions are determined with a good accuracy.

MULTI-MODAL DIALOG SCENE DETECTION

149

Since the number of dialogs in a scene is not known beforehand, a left-to-right HMM will not be feasible in practice. For the more practical circular topology, among three observation symbol sets, the set with audio and face information gives the best performance for different training data. Audio-only detection also remains as a good option for lowcomplexity applications. With its current design, the method is not capable of differentiating between dialog and monolog scenes. However, with proper modifications (old face/new face) in the observation set, this deficiency may also be handled. Improving the proposed simple state diagram and testing using automatically extracted features remains as a future research.

References 1. R.M. Bolle, B.-L. Yeo, and M.M. Yeung, “Video query: research directions,” IBM Journal of Research and Development, Vol. 42, pp. 233–252, 1998. (also avaiable at http://www.almaden.ibm.com/journal/ rd/422/bolle.txt). 2. J.S. Boreczky and L.D. Wilcox, “A hidden Markov model framework for video segmentation audio and image features,” in Proceedings of ICASSP’98, 1998, pp. 3741–3744. 3. S. Eickler and G. Rigoll, “Continuous online gesture recognition based on hidden Markov models,” in Proceedings of ICPR’98, 1998, pp. 1206–1208. 4. M. Ferman and A.M. Tekalp, “Probabilistic analysis and extraction of video content,” in Proceedings of ICIP’99, 1999. 5. J. Huang, Z. Liu, and Y. Wang, “Integration of audio and visual information for content-based video segmentation,” in Proceedings of ICIP’98, 1998. 6. Q. Huang, Z. Liu, A. Rosenberg, D. Gibbon, and B. Shahraray, “Automated generation of new content hierarchy by integrating audio, video and text information,” in Proceedings of ICASSP’99, 1999, pp. 3025– 3028. 7. R. Lienhart, S. Pfeiffer, and W. Effelsberg, “Video abstracting,” Communications of ACM, Vol. 40, No. 12, pp. 55–62, 1997. 8. J. Nam, A.E. Cetin, and A.H. Tewfik, “Speaker identification and video shot analysis for hierarchical video shot classification,” in Proceedings of ICIP’97, 1997. 9. J. Nam, M. Alghoneiemy, and A.H. Tewfik, “Audio-visual content-based violent scene characterization,” in Proceedings of ICIP’98, 1998, pp. 353–357. 10. A.V. Nefian and M.H. Hayes III, “An embedded HMM-based approach for face detection and recognition,” in Proceedings of ICASSP’99, 1999, pp. 3553–3556. 11. H. Pan, Z.-P. Liang, T.J. Anastasio, and T.S. Huang, “A hybrid NN-bayesian architecture for information fusion,” in Proceedings of ICIP’98, 1998, pp. 368–371. 12. L.R. Rabiner and B-H. Juang, Fundementals of Speech Recognition. Prentice Hall. Englewood Cliffs, NJ, USA, 1993. 13. C. Saraceno and R. Leonardi, “Identification of story units in audio-visual sequences by joint audio and video processing,” in Proceedings of ICIP’98, 1998, pp. 363–367. 14. S. Tsekeridou and I. Pitas, “Speaker dependent video indexing based on audio-visual interaction,” in Proceedings of ICIP’98, 1998, pp. 358–362. 15. N. Vasconcelos and A. Lippman, “Towards semantically meaningful feature spaces for the characterization of video content,” in Proceedings of ICIP’97, 1997. 16. W. Wolf, “Hidden Markov model parsing of video programs,” in Proceedings of ICASSP’97, 1997, pp. 2609–2611.

150

ALATAN, AKANSU, AND WOLF

A. Aydin Alatan was born in Ankara, Turkey in 1968. He received his B.S. degree from Middle East Technical University, Ankara Turkey in 1990, the M.S and DIC degrees from Imperial College of Science, Medicine and Technology, London, UK in 1992, and PhD degree from Bilkent University, Ankara Turkey in 1997, all in Electrical Engineering. He was a post-doctoral research associate at Center for Image Processing Research in Rensselaer Polytechnic Institute between March 1997 and June 1998 and New Jersey Center for Multimedia Research in New Jersey Institute of Technology between June 1998 and May 2000. Since then he is a faculty at Electrical-Electronics Engineering Department at Middle East Technical University, Ankara, Turkey. His research interests are multi-modal content based video analysis, digital watermarking, robust image/video transmission over mobile channels and packet networks, object-based video coding and 3-D motion models.

Ali N. Akansu received the BS degree from the Technical University of Istanbul, Turkey, in 1980, the MS and Ph.D degrees from the Polytechnic University, Brooklyn, New York in 1983 and 1987, respectively, all in Electrical Engineering. He has been with the Electrical and Computer Engineering Department of the New Jersey Institute of Technology since 1987. He was an academic visitor at IBM T.J. Watson Research Center and at GEC-Marconi Electronic Systems Corp. during the summers of 1989 and 1996, and 1992, respectively. He serves as a consultant to the industry. His current research interests are signal theory, linear transform theory (subbands and wavelets) and algorithms, image-video compression and multimedia networks, signal processing techniques for digital communications, communication networks and Internet technologies. Dr. Akansu has published many papers on block, subband and wavelet transforms and their applications in image and video compression, and digital communications. He is a co-author (with R.A. Haddad) of the book Multiresolution Signal Decomposition: Transforms, Subbands and Wavelets, Academic Press, 1992 and a coeditor (with M.J.T. Smith) of a book entitled Subband and Wavelet Transforms: Design and Applications, Kluwer, 1996. He is also a co-editor of a recent book (with M.J. Medley) Wavelet, Subband and Block Transforms in Communications and Multimedia, Kluwer, 1999. Dr. Akansu was an Associate Editor of IEEE Transactions on Signal Processing, a member of the Digital Signal Processing Technical Committee. He is currently a member of the Signal Processing Theory and Methods, and Multimedia Signal Processing Technical Committees of the IEEE Signal Processing Society. He was the Technical Program Chairman of the IEEE Digital Signal Processing Workshop 1996. He has served as session organizer or session chair, as a member of the Organizing Committees for the SPIE Visual Communications and Image Processing Conferences (VCIP). He co-organized the first Wavelet Transforms related technical conference in the United States, 1st NJIT Symposium on April 30, 1990, and then 2nd and 3rd NJIT Symposia in 1992 and 1994, respectively. Recently, he co-organized the 7th NJIT Symposium on Content Security, Data Hiding and Watermarking, April, 1999. He has given many talks and presentations internationally on Subband and Wavelet Transforms and their applications. He was the lead guest co-editor of the Special Issue of IEEE Transactions on

MULTI-MODAL DIALOG SCENE DETECTION

151

Signal Processing on the Theory and Application of Filter Banks and Wavelets (April, 1998 issue). He is a Senior Member of IEEE.

Wayne Wolf is professor of electrical engineering at Princeton University. Before joining Princeton, he was with AT&T Bell Laboratories, Murray Hill, New Jersey. He received the B.S., M.S., and Ph.D. degrees in electrical engineering from Stanford University in 1980, 1981, and 1984, respectively. His reseasrch interests include multimedia information systems, VLSI, and embedded computing. Wolf has been elected to Phi Beta Kappa and Tau Beta Pi. He is a Fellow of the IEEE and a member of the ACM and SPIE.