Adaptive Sign Language Recognition With ... - Semantic Scholar

5 downloads 941 Views 171KB Size Report
Abstract—Sign language recognition systems suffer from the problem of signer .... the fixed number of iterations that the messages' changes fall below a ...
IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 3, MARCH 2010

297

Adaptive Sign Language Recognition With Exemplar Extraction and MAP/IVFS Yu Zhou, Xilin Chen, Member, IEEE, Debin Zhao, Hongxun Yao, and Wen Gao, Fellow, IEEE

Abstract—Sign language recognition systems suffer from the problem of signer dependence. In this letter, we propose a novel method that adapts the original model set to a specific signer with his/her small amount of training data. First, affinity propagation is used to extract the exemplars of signer independent hidden Markov models; then the adaptive training vocabulary can be automatically formed. Based on the collected sign gestures of the new vocabulary, the combination of maximum a posteriori and iterative vector field smoothing is utilized to generate signer-adapted models. Experimental results on six signers demonstrate that the proposed method can reduce the amount of the adaptation data and still can achieve high recognition performance. Index Terms—Affinity propagation, maximum a posteriori, sign language recognition, signer adaptation, vector field smoothing.

I. INTRODUCTION

S

IGN language recognition aims to transcribe sign language to text automatically. Many works on sign language recognition have been performed [1]. To the best of our knowledge, some representative works are [2]–[4]. Most works focus on signer dependent (SD) sign language recognition. Nevertheless, the performance of the system is poor when a signer is unregistered in the training set. Signer independent (SI) models [5] can achieve high performance, but still can not be comparable with SD models. Adaptation techniques in speech recognition [6] and handwriting recognition [7] supply an alternative solution to this problem. Ong et al. [8] applied supervised maximum a posteriori (MAP, [9]) to adapt their system and yielded 88.5% accuracy on a 20-gesture vocabulary. U. von Agris et al. [10] combined maximum likelihood linear regression and MAP for signer adaptation. With 80 and 160 signs, they achieved 78.6% and 94.6% accuracy respectively on a vocabulary of 153 signs. In their latest work [11], they combined eigenvoice, maximum likelihood linear regression, and MAP algorithms to reduce the Manuscript received September 29, 2009; revised November 19, 2009. First published December 08, 2009; current version published January 22, 2010. This work was supported by the Natural Science Foundation of China under Contracts 60533030, 60603023, and 60973067, and by National Key Technology R&D Program under Contract 2008BAH26B03, and also by open project of Beijing Multimedia and Intelligent Software Key Laboratory at the Beijing University of Technology. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Jen-Tzung Chien. Y. Zhou, D. Zhao, and H. Yao are with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China (e-mail: [email protected]; [email protected]; [email protected]). X. Chen is with the Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China (e-mail: [email protected]). W. Gao is with the Institute of Digital Media, Peking University, Beijing, China (e-mail: [email protected]). Digital Object Identifier 10.1109/LSP.2009.2038251

Fig. 1. Exemplar extraction and MAP/IVFS for signer adaptation.

adaptation data and retard the performance saturation. Wang et al. [12] presented an adaptive method based on data generating, in which they reduced the size of adaptation data set from 350 to 136 with acceptable recognition accuracy. In this letter we propose a novel signer adaptation method to reduce the amount of data further. As shown in Fig. 1, our method mainly consists of two steps: the exemplar extraction and the combination of maximum a posteriori and iterative vector field smoothing (MAP/IVFS). First, affinity propagation (AP, [13]) is used to extract a subset of the vocabulary, which can represent the major characteristics of the new signer’s signing; then MAP/IVFS is adopted to modify the parameters of the models. In the next two sections, AP based exemplar extraction and MAP/IVFS are described respectively, and the experiment evaluation and conclusion are given in Sections IV and V respectively. II. EXEMPLAR EXTRACTION Different people have different hand sizes, body sizes, signing habits, signing rhythms, and so on, which leads to varieties when they sign the same word. The mismatch between the training data and the test data leads to poor recognition performance. One alternative to solve this problem is collecting enough data from different people to train SI models. In this way two problems stand out. 1) The models are difficult to converge because the data of different people vary noticeably. Sometimes the distinctions between the data of two different people on the same sign are almost larger than the distinctions between the data of the same people on two different signs. 2) The generalization ability is another problem. Well-trained SI models may gain acceptable performance on new signers, but not perfect performance as gained by the SD models. The SI models are one-size-fits-all. The adaptation techniques can adapt SI models to a specific signer. However, in Chinese sign language there are totally more than 5000 words. Collecting data samples for adaptation is a tedious job even supposing that only one sample is needed for

1070-9908/$26.00 © 2010 IEEE

298

IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 3, MARCH 2010

hand shape (DHS). These data streams are almost independent from each other, and contribute differently to recognition. Before we compute the similarity between two means we should first weigh the 3 data streams. According to the experiments, the are obtained empirical weight values of corresponding to P&O, AHS and DHS respectively. As a result, between two means and can be obtained the similarity by (1) shown at the bottom of the page. where , and are the dimensionalities of P&O, AHS and DHS. With the input of the preference value and the similarity matrix, AP clusters the mean vectors through three steps, depicted in (2)–(4): (2)

Fig. 2. Extracting exemplars using AP and k -means (a) Original points. (b) AP clustering. (c) K-means clustering.

(3)

one word. Therefore, we must explore the correlation among the models, and select some exemplars for all the words. We use hidden Markov model (HMM, [14]) as the statistical model. For adaptation, the mean vectors of HMMs’ mixture components are the most important, so we only adapt the mean vectors as other signer adaptation works [8], [10]–[12]. However, the mean vectors of HMMs have exemplars implicit. If we have not enough data to adapt all the vectors, we can first adapt exemplars of them, then the unadapted vectors can be estimated according to the priors (from SI models) and the changes (from the adapted vectors). We first use AP to cluster the HMMs’ means and detect patterns of them, and each pattern is an exemplar. Compared to other clustering method, AP can find exemplars representing data structure well. Fig. 2 shows the clustering results of AP and -means [15]. The lower left parts of Fig. 2(b) and 2(c) show that AP extracts several exemplars whereas -means only extracts one cluster center, which means that AP can find exemplars that have similar structure with original data. AP’s inputs are a preference value and the similarity matrix whose elements are the similarity measure between pairs of two means. We should first compute the similarity between each pair of mean vectors. Chinese sign language data can be separated into three data streams: position and orientation of two hands (P&O), the affiliated hand shape (AHS) and the dominant

is the message sent from to ; is where to ; is the accumulated the message sent from is an exemplar. AP begins with all availabilievidence that is set to the preference value ties initialized to zero, and is chosen as an exemplar. Then the messages are passed that between all the mean pairs. The procedure will terminate after the fixed number of iterations that the messages’ changes fall below a threshold, or after the local decisions stay constant for some number of iterations [13]. The outputs of AP are the exemplar mean vectors. These exemplars can represent a specified signer’s signing manner to some extent. Each exemplar mean vector belongs to an HMM, that is, corresponds to a word. We collect one sample of a word for adaptation if the word’s HMM includes at least one exemplar mean vector. Considering that different exemplar means may belong to the same HMM, the number of words included in the adaptation data may be smaller than that of exemplar mean vectors. This can reduce the amount of adaptation data also. MAP/IVFS takes the adaptation data and the SI models as inputs and outputs the SA models.

(4)

III. MAP/IVFS We propose the MAP/IVFS algorithm to adapt the SI models to SA models with the new signer’s small amount of data.

(1)

ZHOU et al.: ADAPTIVE SIGN LANGUAGE RECOGNITION

299

MAP [9] utilizes the prior distribution of parameters and the adaptation data to estimate the new parameters. The prior distribution is extracted from the SI models. If conjugated priors are used, a simple formula is obtained by (5). (5) , and are the adapted mean vector, the where mean of the observed adaptation data and the mean vector of SI models; is the weight of the prior and is the occupation likelihood of the adaptation data. If the amount of will be close to SI mean ; If the adaptation data is small, amount of adaptation data is large, will be close to . As a result, tailoring the SI models with enough adaptation data using MAP will obtain SA models close to SD models in performance. Suppose is the length of the observation sequence and is the number of states in HMM set, then the complexity of MAP . estimation is The adaptation data obtained by AP based exemplar extraction do not cover all HMM models. Supposing that there are samples in the adaptation data set, then the corresponding HMMs can be adapted using MAP. The unadapted HMMs must be estimated by utilizing the prior information (from the SI models), the adapted HMMs, and the correlation among different HMMs. denotes the HMM means set that each of Supposing that denotes the HMM them has adaptation data available and means set that each of them has no adaptation data, the transfer vectors of means in can be obtained by subtracting SI means from their corresponding adapted means: (6) We assume that all the transfer vectors of form a smooth vector field, so the transfer vectors of means in can be estimated using the transfer vectors of their neighbors in . For each in , the estimation using MAP/VFS [16] is obtained by (7): (7)

is a subset of , represents the nearest neighbors where ; represents the transfer vector of as defined in of is the weight of to , and equals to , which (6); indicates that the more similar is with , the more infor. mation it supplies to is equal to the sum of its initial value The estimated mean and the estimated transfer vector. The transfer vector estimation is based on interpolation by weighing the transfer vectors of its nearest neighbors. Selecting the number of is a problem that must be solved. Previously in MAP/VFS the neighbors are obtained in the SI means space, not considering the SA means space. Nevertheless, the neighborhood relationship may change before and after MAP/VFS adaptation. Although the SA means space is not available completely, the partial space generated by MAP/VFS can still supply some information. By this idea, we propose the MAP/IVFS.

Fig. 3. The MAP/IVFS, the nearest neighbors are iteratively refined.

Fig. 3 shows the process of MAP/IVFS. In (a), the target selects (we take 3 as an example) neighbors in the mean SI means space. Then (b) shows that the estimated value corresponding to can be obtained with its neighbors using finds the MAP/VFS. At the SA means space, the estimated same number of neighbors. As can be seen in (c), a new neighbor appears, which means that it is still informational to though it is not very close to in the SI means space. (d) shows that since a new exemplar mean appears, it should be added to the neighbors at the SI means space also. The procedure iterates until the number of maximum iteration arrives or the neighbors do not change after some number of iterations. IV. EXPERIMENTS Two data gloves and a position tracker are used as data input devices. Experimental data set consists of 6144 samples over 256 words with each word having 24 samples (including six signers, each signer with four samples). One signer was selected as the test signer, and his/her first group and the other three groups samples of all the words were used as the adaptation data set and the test data set respectively. The data of other fivesigners were used as the training data set for SI models. Cross- validation was conducted on six signers. Each word was modeled by a 3-state Bakis HMM, and the observation probability distribution was unimodal multivariate Gaussian. The experimental results are summarized in Table I. In Table I, E and A are the number of exemplars extracted by AP and the number of corresponding adaptation data; K and K’ are the initial and the final average number of neighbors for MAP/IVFS; SI represents the recognition accuracy using SI HMMs; SA represents the recognition rate without reducing the amount of adaptation data, that is, the adaptation data set consists of 1 sample for each word in the vocabulary; for comparing the AP and -means, -means based MAP/IVFS results is also listed. Experimental results showed that the recognition accuracy improvements became larger with the increase of the preference value. We observed that when we increased the preference value, the number of exemplars extracted became larger, and the corresponding number of adaptation data became larger. , the number of exemWhen the preference value was plar mean vectors was about 188.2, included in about 154.8

300

IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 3, MARCH 2010

TABLE I EXPERIMENTAL RESULTS OF OUR METHOD

TABLE II COMPARISON WITH OTHER METHODS

HMMs, so the adaptation data set consisted of about 154.8 samples. With this adaptation data set, the recognition accuracy of 84.60% was obtained, and the performance was strong because about 100 HMMs did not have adaptation data. Moreover, MAP/ IVFS achieved high recognition rate compared with MAP/VFS because MAP/IVFS could iteratively refine the neighbors as shown by the comparison of K and K’. , the average number of When the preference value was adaptation data was 16.8, and with these adaptation data, MAP/ IVFS achieved the average recognition rate of 76.26%, which was about 8.66% improvement over SI models. The average number of test data recognized right by SI models was 519.2; the average number of test data recognized right by MAP/IVFS was 585.7. Therefore, the average number of test data wrongly recognized by SI and however recognized right by MAP/IVFS was 66.5. The improvement in the vocabulary of exemplars can be estimated as 12.4; the improvement in the vocabulary of non-exemplars can be estimated as 54.1. As a result, the recognition rate that was improved in the vocabulary of non-exemplars was 7.54%. To evaluate the effectiveness of our method we also compared our method with Wang’s method [12] and U. von Agris’ method [10], which was illustrated in Table II. The experimental results in [11] were based on eigenvoices method, which needed large amount of model sets to be trained. Our database only consisted of 6 signers, so it was quite difficult to initially adapt SI models with eigenvoices. Instead, we compared our method with their previous work [10]. From Table II, we can see that our method can achieve comparable results with Wang’s, but our method only used 6.6% percentage of all data versus Wang’s 38.9% percentage. Compared with U. von Agris’ method the improvement of our method was higher than theirs. For MLLR when the percentage was 6.6%, the global transformation matrix estimated was not robust enough, and the adapted models were biased to the seen models. V. CONCLUSION In this letter we have proposed a novel method for adaptive Chinese sign language recognition. AP is utilized to extract exemplar mean vectors, then the adaptation data can be di-

rectly selected. MAP/IVFS is used to generate the SA models. Experimental results have shown that our method can reduce the amount of adaptation data greatly and still achieve acceptable recognition rate. If the size of the vocabulary increases, the reduction of adaptation data may be more significant because more pooling among different models arises. Our future work will focus on signer adaptation with unlabeled data in that collecting unlabeled data is easier than collecting labeled ones. Moreover, the computational time complexity also needs to be considered further. REFERENCES [1] S. C. W. Ong and S. Ranganath, “Automatic sign language analysis: A survey and the future beyond lexical meaning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 6, pp. 873–891, Jun. 2005. [2] T. Starner, J. Weaver, and A. Pentland, “Real-time american sign language recognition using desk and wearable computer based video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 12, pp. 1371–1375, Dec. 1998. [3] C. Vogler and D. Metaxas, “Handshapes and movements: Multiplechannel american sign language recognition,” in Proc. Gesture Workshop, 2003, pp. 247–258. [4] W. W. Kong and S. Ranganath, “Signing exact english (SEE): Modeling and recognition,” Pattern Recognit., vol. 41, no. 5, pp. 1638–1652, May 2008. [5] J. Lichtenauer, G. ten Holt, M. Reinders, and E. Hendriks, “Person-independent 3D sign language recognition,” Gesture-based Human–Computer Interaction and Simulation, pp. 69–80, 2009. [6] S. K. Au-Yeung and M. H. Siu, “Maximum likelihood linear regression adaptation for the polynomial segment models,” IEEE Signal Process. Lett., vol. 13, pp. 644–647, Oct. 2006. [7] S. D. Connell and A. K. Jain, “Writer adaptation for online handwriting recognition,” IEEE TPAMI, vol. 24, no. 3, pp. 329–346, Mar. 2002. [8] S. C. W. Ong and S. Ranganath, “Deciphering gestures with layered meanings and signer adaptation,” in Sixth IEEE Int. Conf. Automatic Face and Gesture Recognition, 2004, pp. 559–564. [9] J. L. Gauvain and C. H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Trans. Speech Audio Process., vol. 2, no. 2, pp. 291–298, Apr. 1994. [10] U. von Agris, D. Schneider, J. Zieren, and K. F. Kraiss, “Rapid signer adaptation for isolated sign language recognition,” in Conf. Computer Vision and Pattern Recognition Workshop, 2006, p. 159. [11] U. von Agris, C. Blomer, and K. F. Kraiss, “Rapid signer adaptation for continuous sign language recognition using a combined approach of eigenvoices, MLLR, and MAP,” in 19th Int. Conf. Pattern Recognition (ICPR 2008), Tampa, FL, 2008, pp. 3539–3542. [12] C. Wang, X. Chen, and W. Gao, “Generating data for signer adaptation,” in Proc. Gesture Workshop, 2007, pp. 114–121. [13] B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, vol. 315, no. 5814, pp. 972–976, Feb. 2007. [14] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, pp. 257–286, Feb. 1989. [15] J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967, pp. 281–297. [16] J. Takahashi and S. Sagayama, “Vector-field-smoothed bayesian learning for incremental speaker adaptation,” in Int. Conf. Acoustics, Speech, and Signal Processing, 1995, pp. 696–699.