Automatic Recognition of Signed Polish Expressions

Automatic Recognition of Signed Polish Expressions TOMASZ KAPUSCINSKI and MARIAN WYSOCKI

The paper considers recognition of single sentences of the Polish Sign Language. We use a canonical stereo system that observes the signer from a frontal view. Feature vectors take into account information about the hand shape and orientation, as well as 3D position of the hand with respect to the face. Recognition based on human skin detection and hidden Markov models (HMMs) is performed on line. We focus on 35 sentences and a 101 word vocabulary that can be used at the doctor’s and at the post office. Details of the solution and results of experiments with regular and parallel HMMs are given. Key words: gesture recognition, sign language, computer vision, hidden Markov models

1.

Introduction

Sign language is the natural language of the deaf people. It is a visual language, different from the spoken language, but serving the same function. It is not a universal language. Regionally different languages have been evolved. The Polish Sign Language (PSL) is an adaptation of the sign language used in our country to the rules of Polish grammar [4, 13]. Inability to use spoken language considerably complicates life of the deaf people. The aim of sign language recognition is to provide an efficient and accurate mechanism to transcribe sign language into text or speech so that communication between deaf and hearing society could be more convenient. Furthermore sign language recognition generally serves as a good basis for the development of gestural human-machine interfaces. Automatic gesture recognition has recently acquired much attention. Good overviews about such systems can be found in [7, 8]. There exist two main The authors are with the Computer and Control Engineering Chair, Rzeszow University of Technology, W. Pola 2, 35-959 Rzeszow, Poland, e-mail{tomekkap, mwysocki}@przrzeszow.pl. This is the modified version of the paper presented at the 2nd Language & Technology Conference, April, 2005, Poznan, Poland. The work was supported by the State Committee for Scientific Research under Grant 4T11C00625. Received dd.mm.yyyy, revised dd.mm.yyyy.

1

approaches to collecting data for the classification process: instrumented glovebased data collection and video-based data collection. The video-based approach leads to a more natural interface, although some video-based recognition systems require the signer to wear coloured cotton gloves. One of the first video-based systems was presented by Tamura and Kawasaki [14]. The system recognized twenty one-handed words of the Japanese Sign Language. About 45% of them were recognized correctly. Charyaphan and Marble considered interpretation of the American Sign Language (ASL) [2]. They successfully classified a test sample of 31 signs. Grobel and Assam [3] recognized isolated signs of German Sign Language (GSL) collected from video recordings of a signer wearing coloured cotton gloves. 91.3% accuracy out of a 262 sign vocabulary was reported. The authors used hidden Markov models (HMMs). Bauer, Hienz and Kraiss extended the work [3] and recognized continuous sentences of GSL using one HMM for each sign. They achieved an accuracy of 91.8% based on a lexicon of 97 signs [1]. Starner et al. [11] presented a videobased system for real-time recognition of ASL sentences. They employed a single video camera as part of two different setups. The one system observed the signer from a frontal view (desk mounted camera), while the other system used a cap mounted camera for image recording. The vocabulary included 40 signs and the sentence structure to be recognized was constrained. HMMs were used for recognition. The system was tested on a corpus of 94 sentences for the desk-based system and 100 sentences for the cap-based system. Recognition accuracy ranged between 74.5% and 97.8% depending on the camera position and the grammar used. Vogler and Metaxas [16] presented a novel approach to ASL recognition. It was based on parallel HMMs (PaHMMs) which modeled the parallel processes occurring during the course of a sign. The authors used 3D data for a 22 sign vocabulary and demonstrated that PaHMM can improve the robustness of HMM-based recognition. They obtained accuracy of 91.1% for words and 84.5% for sentences. This paper considers recognition of 35 single sentences and a 101 word vocabulary of PSL using a stereovision-based system. To our knowledge it is the first approach related to the PSL recognition. The reverse issue, i.e. translation of written (spoken) sentences into PSL using graphical animation was considered in [12]. We use a canonical colour camera system observing the signer from a frontal view. The signer is not required to wear any colour gloves but he/she should wear long-sleeved clothes. The clothes and the background should be of different colours from the skin. Recognition is based on human skin detection and HMMs. Feature vectors taking into account information on the hand shape and orientation, as well as 3D position of the hand with respect to the face are used. Results of experiments with regular and parallel HMMs are presented and compared.

2

2.

Characteristics of the PSL

In PSL, similarly to other sign languages, a sign is the equivalent of the word. Every sign can be analysed by specifying at least three components: (i) the place of the body against which the sign is made, (ii) the shape of a hand or hands, (iii) the movement of a hand or hands. Although in practical sign language communication some additional features (such as lip shape, etc.) are often used, we do not consider them in this paper. We focus on 35 sentences and 101 words that can be used at the doctor’s and at the post office. The sentences contain 2-10 words that are signed in their basic form and in the right order. Fig. 1 presents starting and final phases of the words that form the sentence You are ill [4].

You

be

ill.

Fig. 1. Sample sentence of PSL The considered gestures are static or dynamic. Most of them are two-handed. For one-handed signs the so-called dominant hand performs the sign, whereas for two-handed signs the non-dominant hand is also used. The non-dominant hand is often still, but in many gestures both hands move. The hands often touch each other or appear against the background of the face. The motion can be single or repeated.

3.

Construction of the feature vectors

The following problems are important in our recognition task: (i) determinig and tracking the position of the hands and the face of a signer, (ii) feature selection and determining. To identify regions of the hands and the face we used a histogram-based chrominance model of human skin in the normalized RGB space [5]. We assumed that the signer was facing the front of the camera and was not changing his distance and orientation with respect to the camera. In order to ensure correct segmentation there were some restrictions for the clothing of the signer and the background. Areas of objects toned in skin colour, their centres of gravity and ranges of motion were analysed to recognize the right hand, the left hand and the face. Comparison of neighbouring frames helped to notice whether the hands (the hand and the face) touched or partially covered each other.

3

The following fourteen features were computed: (1) the length lr of the line segment connecting gravity centres of the right hand and the face in the picture (see also Fig. 2), (2) the orientation ϕr of the aforementioned line, (3) the area Sr of the right hand, (4) compactness γr of the right hand, (5) eccentricity εr of the right hand, (6) the angle αr between the maximum axis [15] of the right hand picture and the x-coordinate axis, (7) difference zr between average depth of the face and the right hand, (8) – (14) corresponding parameters for the left hand. The parameters l and ϕ characterise the position of the hands in the picture, the S, γ and ε represent the hand shape, and the angles α represent orientation of the hands. The parameters z contain the information about the depth.

jr ar

lr

Fig. 2. Elements of the feature vector describing position and orientation of the hand

4.

Hidden Markov Models

A hidden Markov model is a statistical model used to characterize the statistical properties of a signal [9]. An HMM consists of two stochastic processes: one is unobservable Markov chain with a finite number of states, an initial state probability distribution and a state transition probability matrix; the other is a set of probability density functions associated with observations generated by each state. Human hand gestures are spatio-temporal entities. The performance of the gesture is usually not perfect. The same gesture changes in time and space even if one person performs it twice. Human performance involves two distinct stochastic processes: human mental states and resultant actions. The mental state is immeasurable, the action is measurable. Therefore many researches use HMMs to recognize hand gestures [3, 8, 11]. Regular HMMs are capable of modelling only one process evolving over time. It is worth noticing that in spoken languages the phonemes, i.e. the smallest language units of speech that may bring about a change of meaning, appear sequentially. Their counterparts in signed languages can appear both in se4

Fig. 3. A HMM with two emitting states and the non-emitting start and end state

quences and simultaneously. For example the hand shape and hand orientation can change at the same time. Modelling parallel processes with a single HMM would require that the processes evolve passing through the same state at the same time. To overcome this problem one considers PaHMMs. PaHMMs are essentially regular HMMs that are used in parallel [16]. They are based on the assumption that the separate processes evolve independently from one another with independent output. Thus the HMMs for the separate processes differ in general with respect to their structures, as well as parameters. They are trained wholly independently.

5.

Experiments

We used the HTK software [17], originally prepared for speech recognition with HMMs. The models of the words had the form shown in Fig. 3. Two states in the model generate observation (the emitting states) and two additional start and end states do not generate any observation (the non-emitting states). As we will see later, the start and end state facilitate construction of composite models for recognition of sentences on the basis of models of isolated words. Continuous output probability distributions were assumed to be mixtures of two Gaussians.

Fig. 4. Laboratory setup

5

Colour stereo head of the Videre Design was used in our vision-based system (see Fig. 4). Recognition was performed in real time. First we used a vocabulary of 101 words and two data sets Aw , Bw . Each set consisted of 20 realizations of each word, registered as sequences of images with the resolution of 320*240 pixels and the frequency of 25 frames/s. Signs in the sets Aw and Bw were presented by signers SA and SB, respectively. The SB was a PSL teacher, person SA has learnt PSL for needs of this research. We conducted various experiments reported in [6]. One of them concerned a cross-validation on the sets Aw and Bw . Each set was divided into four independent subsets of the equal size. Then three subsets were used to train the HMMs, and the remaining one to test them. This process was repeated for each of four possible choices and the test recognition rates were averaged over all four results. We obtained recognition rate of 94.7% for Aw and 94.9% for Bw . Sentences were recognized using a composite model

a)

b)

Fig. 5. Networks of isolated word models

built as a network of isolated word models. We examined two structures shown in Fig. 5. Both schemes used a statistical information about the transition probability between two successive signs, calculated for any sign in relation to each possible preceding sign from the training corpus (bigram language model [1, 10, 17]). The network (a) does not use any other information about the syntax, whereas the scheme (b) exploits expected connections between words, what is demonstrated in fig 5 in simplified way. In both cases the long sequence parsing is performed by a Viterbi algorithm based on token passing [17]. The non-emitting states (see Fig. 3) provide the link needed to join word models together. The composite model was trained with pre-assembled sentences prepared by the persons SA and SB, and collected in the sets As , Bs , respectively. Each set contained 20 realisations of each of 35 sentences. The HTK offers so-called embedded training, what means that parameters of the isolated word models are tuned on the basis of training sentences. Embedded training uses the same procedure as for the isolated word case, but rather than training each model individually, all models are trained simultaneously. The location of gesture boundaries in the training data is not required for this procedure, but the symbolic transcription of each training sequence is needed [17]. Table 1 shows results of recognition on test sets in dependence on signer and structure of the network used in recognition. As one could expect, the network (b), exploiting more detailed information about the syntax, gives much better results. For continuous 6

Table 1. Recognition accuracy on test sets [%]. signer network (a) network (b) SA 82.6 97.1 SB 83.7 97.4

sign language recognition a key problem is the effect of movement epenthesis, that is transient movements between signs, which vary with the context of signs. We examined schemes from Fig. 5 with transient models having one emitting state and direct transition from start to end nodes. Improvement of results turned out statistically insignificant, especially for the scheme (b). Let us remark that the first and the last states in the word (or transient) models may be skipped (see Fig. 3). This is important in situation when the ending phase of the actual word and the beginning of the next word are omitted because of transient movement between the signs. Regular HMMs are capable of modelling only one process evolving over time. Because of parallel processes occurring in sign languages, it is interesting in which extent parallel HMMs mentioned in section 4. can help in recognition of the PSL. We considered two PaHMM models. The first model (PaHMM1) consisted of two channels, i.e. one channel for each hand. The second model (PaHMM2) was composed of four channels: (1) with elements l, ϕ that characterize position of the hands, (2) with elements S, γ, ε representing the shape, (3) with angles α representing orientations of the hands, (4) with parameters z characterizing the depth. To perform computations with PaHMMs we prepared appropriate functions and added them to the HTK software. Sample results concerning the network (b) are presented in table 2. Each set As , Bs was Table 2. Recognition accuracy on test sets [%]. data sets HMM PaHMM1 PaHMM2 A 97.1 96.3 96.6 B 97.4 98.3 98.0 A/B 93.4 92.3 92.6 B/A 92.9 93.1 91.4 AB/A 96.3 98.0 98.3 AB/B 97.4 98.0 98.3

divided into two separate subsets, i.e. a training set and a test set, both with 10 realizations of each sentence. Table 2 gives results of recognition on the test sets. The first two rows refer to models tested on data prepared by the same signer whose gestures were used in the training phase. Next two rows are for models trained on data prepared by one signer, denoted by the first letter, and tested on data from the other signer. Because gestures were performed in the 7

ways that were characteristic for the signers’ individual manners, the results turned out worse than those obtained with models trained and tested on gestures of the same person. Combining the training sets improved the results, as one can see in the last two rows, where AB denotes a training set composed of five realizations of each sentence per training subset of As , Bs . Results obtained with regular HMMs are comparable for both signers, whereas Parallel HMMs turned out better in recognition of sentences signed by the signer SB the PSL teacher. Her gestures were performed fluently, without paying attention to synchronization of the hands, especially when they prepared themselves to the next word at the end of the previous one. In contrast, the second signer’s gestures were less spontaneous and more accurate. Both PaHMMs gave similar recognition accuracies. The PaHMM1 with two channels, i.e. one channel for each hand, seems more natural. To better evaluate flexibility of PaHMMs we trained the network (b) on the training subset of As (Bs ) reduced by removing from it all realizations of nine chosen sentences. Some words involved to build those sentences appeared in other sentences used for training. We recognized the set of removed sentences, containing 9*20 = 180 realizations. Next we trained the network on words only, i.e. no one of 700 realizations of the sentences were shown to it during the training (no embedded training was performed). Then we recognized the whole set of sequences. Results are summarized in Table 3. PaHMMs turned out Table 3. Recognition of unknown sentences [%]. problem 180 sentences 700 sentences model SA SB SA SB HMM 29.4 30.6 31.0 31.3 PaHMM1 63.9 65.0 69.7 70.0

more flexible than regular HMMs. Having large vocabulary it is hard to collect sufficiently large amount of training sentences. Presented results let us expect that PaHMMs will improve recognition of PSL in such situation.

6.

Conclusions and future work

In this paper an HMM-based recognition system of signed Polish expressions was introduced. We considered vocabulary of 101 words and 35 sentences used in typical situations at the doctor’s and at the post office. The proposed feature vectors were composed of features taking into account information about the hand shape, orientation and 3D position of the hand with respect to the face, determined on the basis of sequences obtained in a stereovision system. We used the data set of 4040 signs and 1400 sentences prepared by two persons. Comparative experiments with regular and parallel HMMs are given. We plan further experiments with larger vocabulary and verification of the system using 8

greater data sets containing signs performed by various signers. We expect that our solution, with limited vocabulary, will be useful in translation and dialog systems for use in specific domains such as post offices, doctors etc.

References [1] B. Bauer, H. Hienz, and K. F. Kraiss. Video-based sign language recognition using statistical methods. Proc. of the ICPR’00, Barcelona:2463–2466, 2000. [2] C. Charayaphan and A. E. Marble. Image processing system for interpreting motion in american sign language. J. Biomed. Eng, 14:419–425, 1992. [3] K. Grobel and M. Assam. Isolated sign language recognition using hidden Markov models. Proc. of the IEEE Int. Conf. on SMC, Orlando:162–167, 1997. [4] J. K. Hendzel. Dictionary of the Polish Sign Language. OFFER Press, Olsztyn, 3rd edition, 1997. [5] T. Kapuscinski and M. Wysocki. Hand skin colour identification in different colour spaces (in polish). Archives of Theoretical and Applied Computer Science, 13, 1:53–68, 2001. [6] T. Kapuscinski and M. Wysocki. Recognition of isolated words of the polish sign language. Proc. of the CORES’05, Computer Recognition Systems, Springer, Berlin, Heidelberg, New York:697–704, 2005. [7] S. C. W. Ong and S. Ranganath. Automatic sign language analysis: A survey and future beyond lexical meaning. IEEE Trans. PAMI, 27, 6:873– 891, 2005. [8] V. I. Pavlovic, R. Sharma, and T. S. Huang. Visual interpretation of hand gestures for human-computer interaction: A review. IEEE Trans. PAMI, 19, 7:677–693, 1997. [9] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of the IEEE, 77, 2:257–286, 1989. [10] R. Rosenfeld. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88, 8:1270–1278, 2002. [11] T. Starner, J. Weaver, and A. Pentland. Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. PAMI, 20, 12:1371–1375, 1998. [12] N. Suszczanska, P. Szmal, and J. Francik. Translation polish text into sign language in the tgt system. Proc. of the 20th IASTED International Multiconference Applied Informatics, Insbruck:282–287, 2002. 9

[13] B. Szczepankowski. Sign language in school (in Polish). WSiP, Warszawa, 1988. [14] S. Tamura and S. Kawasaki. Recognition of sign language motion images. Pattern Recognition, 21:343–353, 1988. [15] A. Theodoridis and K. Kontroumbas. Pattern Recognition. Acad. Press, London, 1999. [16] C. Vogler and D. Metaxas. A framework for recognizing the simultaneous aspects of american sign language. Computer vision and Image Understanding, 81:358–384, 2001. [17] S. Young and et al. The HTK Book. Microsoft Corporation, 2000.

10