A Lip Reading Application on MS Kinect Camera - IEEE Xplore

A Lip Reading Application on MS Kinect Camera Alper Yargıç, Muzaffer Doğan Computer Engineering Department Anadolu University Eskisehir, Turkey {ayargic,muzafferd}@anadolu.edu.tr Abstract—Hearing-impaired people can read lips and lip reading applications may help them to improve their lip imitation skills. Speech of normal people can be recognized by even cellular phones but lip reading systems using only visual features remain important for hearing-impaired people. This paper aims to develop an application using MS Kinect camera to recognize Turkish color names to be used in the education of hearingimpaired children. Predefined lip points are located with depth information by the MS Kinect Face Tracking SDK. Words are segmented from the speech and the angles between the lip points are used as features to classify the words. Angles are computed using the 3D coordinates of the lip points. The KNN classifier is used to classify the words with Manhattan and Euclidean distances and the best feature vectors are tried to be found. As a result, the isolated words are classified with the success rate of 78.22%. Keywords- Lip reading; MS Kinect camera; 3D face tracking; K-nearest neighbor classifier; lip activation; lip motion detection

I.

INTRODUCTION

With the developing technology there has been an increasing interest in automatic lip reading systems. Lip reading systems are commonly used to support hearing impaired people so that the system measures how successful the hearing impaired person when trying to mimic the mouth motions and also scoring pronunciation accuracy [1, 2]. The first step of a lip reading system is to acquire speaker’s mouth images and to find the coordinates of some predefined points on the mouth. Then the motions of those points are analyzed with some classifier methods, such as artificial neural networks [3], hidden Markov models [4], k-nearest neighbor classifier [5], etc. to recognize the spoken word. Audio information can also be used as a supplementary resource with visual mouth images in lip reading systems to increase the success of the recognition. However, the importance of the visual features, such as lip movements and shape of the mouth, cannot be underestimated [6, 7]. Classification systems using only visual data are available in the literature [8, 9]. Hearing-impaired people and people with normal hearing ability are uses sound signals and visual features while listening to a conversation [10]. In real world, audio and visual features are complementing each other. The same rule applies for real-time lip reading applications. Classification processes which are made using only audio or video data are not so accurate, therefore usage of audio and video data combination increases the accuracy of speech recognition systems [11]. The data set created in low noise level environment has the accuracy over the success of 95% This research was supported by Scientific Research Projects Commission of Anadolu University (Project number: 1302F039)

978-1-4799-0661-1/13/$31.00 © 2013 IEEE

[12]. However, the classification accuracy is reduced while the noise level in the environment increases. Consistency and classification rate of audio-visual systems are more accurate than visual-only or audio-only systems. However, hearing-impaired people are not able to extract audio in the conversation and they can’t use audio information to recognize the lip motions. Therefore, increasing the accuracy of visual-only lip reading systems is important in applications which demonstrate hearing-impaired people’s lip reading behavior. Visual-only classification systems can be divided into two groups as shape-based and appearance-based systems [9]. Shape-based systems use predefined points on the lip as features, such as distance between two points and angle between three points on the lip [11]. Appearance-based systems are based on pixel intensity values around the mouth area of the image [13]. In this paper we used MS Kinect camera which detects predefined lip points using an appearance-based algorithm. In this paper some isolated Turkish words whose lip points are obtained by MS Kinect camera are classified using KNearest Neighbor (KNN) classifier. The aim of the study is to develop an application in the future which tries to improve hearing-impaired people’s lip reading imitations. In Sec. 2 face tracking with MS Kinect sensor is explained. Sec. 3 describes the data preprocessing before classification. Sec. 4 describes the lip activation detection, lip movements tracking, active lip motion interval detection and segmentation of an isolated word. Sec. 5 describes the classification of the isolated words with KNN. Experimental results are presented in Sec. 6 and the results are discussed in Sec. 7. II.

FACE TRACKING WITH KINECT SENSOR

Microsoft Kinect sensor is an effective camera for face tracking and real time movement tracking applications, because Kinect camera has an integrated infrared sensor and it captures a stream of colored images with the depth data of each image [14]. Kinect sensor obtains 3D data by using these components:   

A color camera An infrared emitter An infrared receiver.

Kinect sensor is supported by Kinect Face Tracking Software Development Kit (Face Tracking SDK) whose

versions after 1.5 provide tools for developing real-time face tracking and lip reading applications on .NET Framework. The Face Tracking engine analyzes the input images to compute the 3D position of head and discovers 121 predefined face points on the face model. These face points are shown in Figure 1. The SDK engine measures also the distances of these face points to the camera and makes these values available for using in an application at the same process time. The Face Tracking SDK uses Active Appearance Model for 2D feature tracker and the 2D data obtained from the Kinect sensor are extended to 3D by inserting depth information [15]. Face tracking is then accomplished on the 3D Kinect coordinate system where depth values and skeleton space coordinates are expressed in meters [16]. The and axes represent the skeleton space coordinates and the axis represents the depth, as shown in Fig. 3. In this study, the Kinect sensor and the Kinect Face Tracking SDK are used for;   

face detection, lip detection, lip motion tracking.

18 of the 121 face points represent lip where 8 of them are settled on the inner lip and 10 of them stays on the outer lip. Each lip point is assigned an integer ID value to identify them in this paper. The lip points and their ID’s are shown in Fig. 2.

Fig. 1. 121 face points tracked by Kinect sensor

Fig. 3. Kinect Coordinate Space

III.

In real-time face tracking and lip reading systems, head pose, head movements, camera-to-head distance and headdirection of the speaker are important parameters and these parameters affect lip reading performance and robustness of the system. In real-time lip reading systems, the most frequently used features are the distance between upper- and lower-lips and the width of mouth. Width and height deformation rate of mouth is also used as a feature of the classification process. The classification methods which use these features have reached effective word classification rates, because the biggest deformation on lip contour occurs on these features [12]. Input images should be preprocessed before classification because of the following three main problems that may occur while collecting data: 

People have different sizes of lip contour shape and mouth height/width ratios, and these differences may cause different classifications.



The distance between the camera and the speaker affects pixel distances between lip points on the image and the classification algorithms based on the distances may suffer from those differences.



If speaker turns his/her head to left or right slightly, the height/width ratio of the mouth may change and it may lead to incorrect classifications.

To prevent these problems, 3D data which contain the depth information [17] and the angles between the lip points [18] can be used effectively instead of width and height information between pixels on the input images. Depth information, i.e. the 3D data, eliminates the distance problem and angles eliminate the shape and height/width ratio problem. Therefore data preprocessing is not required when 3D information and angles are used to classify the isolated words. IV.

Fig. 2. 18 lip feature points and their assigned ID values

DATA PREPROCESSING

WORD SEGMENTATION

In lip-reading systems, the first step is to determine the starting and ending points of a word in the speech video. Precisely determining these points increases the success of the classification. Detecting where a word starts and finishes is called word segmentation. Word segmentation can be made after lip activation detection which decides the movements of lip are meaningful or not.

During the pronunciation of a specified word, lip activation can be detected by using the standard deviation of a specified lip feature during the past image frames where the value of is chosen experimentally according the frame rate of the video data. The standard deviation of the first feature, , over past frames around the th frame is computed by the formula

Input Features

Lip Activation Detection

Active Interval Detection

1

(1) Fig. 5.

where value of

is the th value of feature over the past frames [12].

and

Word segmentation steps

V.

is the mean

In this paper, 4 lip points, specified with the IDs 9, 12, 15, and 17 in Fig. 2, on the outer lip are used to define 2 features. The first feature, , is the angle between the lip points 12, 9, and 17; the second feature, , is the angle between the lip points 9, 12, and 15 as shown in Fig. 4. These features represent how much the mouth is opened in vertical and horizontal axes respectively. Number of frames, , to detect the lip activation is taken as 5. Standard deviation of over last 5 frames is computed according to Eq. (1). If for the th frame is greater than 1, then the lip is assumed as active at the th frame, otherwise it is assumed as passive. This threshold value of 1 is computed from some preliminary experiments. A computer program is developed which displays a word to speaker and the speaker reads that word while the Kinect camera records the data. If consecutive 10 frames are classified as active (1) or passive (0) in the order [0 0 0 0 0 1 1 1 1 1], where 5 passive frames are followed by 5 active frames, then the frame where 1 is first occurred, i.e. the 6th frame, is assumed as the starting frame of the word. If the activity of the frames occur in the order [1 1 1 1 1 0 0 0 0 0], where 5 active frames are followed by 5 passive frames, then the first occurrence of zero is assumed as the ending frame of the word. By using these starting and ending frames, the words can be segmented from the input video. Word segmentation steps are shown in Fig. 5.

and

WORD CLASSIFICATION

A. Preprocessing (Normalization) Pronunciations of different words take different time intervals. For example, the word beyaz in Turkish (meaning white) may take about 1 second and the word kahverengi (meaning brown) may take 1.25 seconds. If the frame rate of the video is 12 fps, then the word beyaz takes 12 frames and kahverengi takes 16 frames. Besides, two different persons may pronounce the same word in different intervals. Even the same person may pronounce the same word at two different times in different intervals because of his/her mood at that time. Thus, the frame-lengths of the words should be normalized for a successful classification. Linear regression method is widely used to normalize the segmented word data for speaker adaptation in real time lip reading systems [19]. Here, cubic spline interpolation [20] is used for normalization so that each word has the same framelength, 15, at the end. B. K-Nearest Neighbor Classifier The KNN classifier is a popular classifier for speech recognition systems because of its simplicity and powerful classification rates on lip reading applications [5]. Input data is classified by using the distance to the nearest neighbor. During the classification process, each isolated Turkish word is assigned to the nearest neighbor based on the Euclidean and Manhattan distances. The KNN classifier does not need retraining when a new word is added to the training data set and also it does not need more training data for classification [21]. VI.

Fig. 4. The angle features,

Word Segmentation

EXPERIMENTAL RESULTS AND DISCUSSION

A computer application is developed to acquire and analyze images of speakers from the MS Kinect camera. The video images are recorded in a normally lit environment, with 12 frames per second, and the resolution of 1280x960. The Face Tracking SDK is used to detect face and extract the lip points. During the records, 2D video images are recorded as well as the 2D coordinates of lip points and 3D information which contains depth information of each lip point. The data are then processed offline to segment the words and the words are classified using KNN. The whole process is summarized in Fig. 6.

the aid of an instructor. For this reason, most frequently used words about colors in Turkish are chosen to construct a dataset. These 15 color names and their English translations are listed in Table 1.

Input Image

Speaker Face Detection

Lip Detection

Face Tracking SDK

Lip Feature Detection

The Kinect camera is located about 90 cm. away from the speaker’s face and each speaker repeats each of the 15 isolated Turkish words 5 times by different. For the experiments, 10 volunteer people read the words so that the data set consists of 750 words in total. Each of the angles, listed in Table 2, on the lip is separately used to classify the words with KNN algorithm. The second and third columns in Table 2 show the percentages of correctly classified words over the total 750 spellings when Manhattan and Euclidean distances are used in the KNN classifier respectively. It is seen that the Manhattan distance gives better classification rates than Euclidean distance. It is also seen that the angles between the corners 9-17-15, 15-917, 3-1-7 and 17-12-9 give the best classification rates. These four angles are visually shown in Fig. 7.

Lip Activation Detection

Activation Interval Detection

TABLE II.

Word Classification with KNN Fig. 6. The flowchart for the lip reading system TABLE I.

LIP READING DATA SET

Turkish Word

Beyaz Bordo Gri Kahverengi Kırmızı Lacivert Mavi Menekşe Mor Pembe Sarı Siyah Turkuaz Turuncu Yeşil

Meaning

White Burgundy Gray Brown Red Navy blue Blue Violet Purple Pink Yellow Black Turquoise Orange Green

This study aims in the future to develop an application to improve lip imitation skills of hearing-impaired children with

Percentage of correctly classified words Manhattan Euclidean

Angle corners

Offline Data Processing

Word Segmentation

ACCURACY OF WORD CLASSIFICATION USING KNN

12 12 17 11 1 1 12 3 7 1 1 1 5 1 17 17 18 3 3 1 5 1 1 12 9 9 15 9 9

9 17 12 1 11 17 15 1 3 7 12 5 1 5 18 16 17 1 2 3 1 8 7 9 12 10 9 17 18

17 9 9 17 17 11 9 7 1 3 5 12 12 3 16 18 16 5 1 5 7 7 5 15 15 12 17 15 17

56.22 55.56 57.78 51.78 51.77 55.33 54.89 60.44 56.22 55.33 55.56 55.56 50.44 51.33 40.44 41.11 40.00 48.89 50.67 51.33 51.33 56.67 48.44 51.56 55.33 39.56 62.67 63.33 52.00

55.56 53.78 56.67 54.22 50.44 54.89 52.67 57.56 51.33 56.00 54.44 54.44 49.11 49.56 37.33 44.67 39.11 47.56 49.56 49.78 47.56 52.44 46.00 53.11 52.22 38.00 59.78 59.11 50.22

REFERENCES [1]

[2]

[3] [4]

[5]

[6] Fig. 7. Visual representation of the four best angles [7] TABLE III.

Feature set Four best angles All of the angles

PERCENTAGE OF CORRECTLY CLASSIFIED WORDS

Percentage of correctly classified words 72.44 78.22

As the second experiment, four best angles are used together to classify the words instead of using only one feature as explained in the first experiment. In the third experiment, all of the angles are used for the classification. The Manhattan distance is used in both experiments. The percentages of correctly recognized words are presented in Table 3. All of the selected angles give better classification rate but the computation takes more time.

[8]

[9]

[10]

[11]

[12] [13]

VII. CONCLUSION In this paper, MS Kinect camera and Face Tracking SDK are used to acquire video images and detect 3D coordinates of predefined points on lip. A set of isolated words about Turkish color names are constructed and they are classified by the KNN classifier. The features are selected from the angles of some points on lip and reasonable results are obtained by using angle features. Each angle is used to classify the words separately and the best angle features are determined. It is shown that the angle between the corners 9-17-15 alone correctly classifies the 63.33% of the words. When the number of features increases, the classification rate also increases but computation gets slower. The success of the classifier is 78.22% when all angle features are used and 72.44% when only the best four angles are used. Using four angles decreases the computation time and it can be used instead of all angles reasonably. The KNN classifier is used to classify the words and it is shown that Manhattan distance yields better results than the Euclidean distance when used in the KNN classifier. As the future work, the experiments will be repeated on the new version of MS Kinect camera for Windows which supports near mode where the face would be tracked only from 40 cm. away.

[14]

[15]

[16] [17]

[18]

[19]

[20] [21]

L. Neumeyer, H. Franco, V. Digalakis and M. Weintraub, "Automatic scoring of pronunciation quality," Speech Commun., vol. 30, pp. 83-93, 2000. O. Turk and L. Arslan, "Speech recognition methods for speech therapy," in Signal Processing and Communications Applications Conference, 2004. Proceedings of the IEEE 12th, 2004, pp. 410-413. A. Bagai, H. Gandhi, R. Goyal, M. Kohli and T. Prasad, "Lip-Reading using Neural Networks," Ijcsns, vol. 9, pp. 108, 2009. L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proc IEEE, vol. 77, pp. 257-286, 1989. T. Pao, W. Liao and Y. Chen, "Audio-visual speech recognition with weighted KNN-based classification in mandarin database," in Intelligent Information Hiding and Multimedia Signal Processing, 2007. IIHMSP 2007. Third International Conference on,2007, pp. 39-42. W. H. Sumby and I. Pollack, "Visual contribution to speech intelligibility in noise," J. Acoust. Soc. Am., vol. 26, pp. 212, 1954. K. K. Neely, "Effect of visual factors on the intelligibility of speech," J. Acoust. Soc. Am., vol. 28, pp. 1275, 1956. W. C. Yau, D. K. Kumar and S. P. Arjunan, "Visual recognition of speech consonants using facial movement features," Integrated Computer-Aided Engineering, vol. 14, pp. 49-61, 2007. W. C. Yau, D. K. Kumar, S. P. Arjunan and S. Kumar, "Visual speech recognition using image moments and multiresolution wavelet images," in Computer Graphics, Imaging and Visualisation, 2006 International Conference on, 2006, pp. 194-199. P. Duchnowski, U. Meier and A. Waibel, "See me, hear me: Integrating automatic speech recognition and lip-reading," in Proc. Int. Conf. Spoken Lang. Process, 1994, pp. 547-550. E. Petajan, "Automatic lipreading to enhance speech recognition," in: Proceedings of Global Telecommunications Conference, Atlanta, GA , pp. 265–272. 1984. J. Shin, J. Lee and D. Kim, "Real-time lip reading system for isolated Korean word recognition," Pattern Recognit, vol. 44, pp. 559-571, 2011. L. Liang, X. Liu, Y. Zhao, X. Pi and A. V. Nefian, "Speaker independent audio-visual continuous speech recognition," in Multimedia and Expo, 2002. ICME'02. Proceedings. 2002 IEEE International Conference on, 2002, pp. 25-28. D. Catuhe, Programming with the Kinect for Windows Software Development Kit: Add Gesture and Posture Recognition to Your Applications. O'Reilly Media, Inc., 2012. Wang, Q., & Ren, X. Facial Feature Locating Using Active Appearance Models With Contour Constraints From Consumer Depth Cameras. Journal of Theoretical and Applied Information Technology, 45(2), 2012, pp. 593-597. J. Webb and J. Ashley, Beginning Kinect Programming with the Microsoft Kinect SDK. Apress, 2012. G. Loy, E. Holden and R. Owens, "3D head tracker for an automatic lipreading system," in Proc. Australian Conf. on Robotics and Automation (ACRA2000), 2000, . T. Yoshinaga, S. Tamura, K. Iwano and S. Furui, "Audio-visual speech recognition using new lip features extracted from side-face images," in COST278 and ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction, 2004, . K. Chen, W. Liau, H. Wang and L. Lee, "Fast speaker adaptation using eigenspace-based maximum likelihood linear regression," in Proc. ICSLP, 2000, pp. 742-745. Y. Cheung, X. Liu and X. You, "A local region based approach to lip tracking," Pattern Recognit, 2012. C. Rodriguez, F. Boto, I. Soraluze and A. Pérez, "An incremental and hierarchical k-NN classifier for handwritten characters," in Pattern Recognition, 2002. Proceedings. 16th International Conference on, 2002, pp. 98-101.