Proceedings Template - WORD

2 downloads 0 Views 550KB Size Report
The OpenNI API [5] provides several algorithms to manipulate the depth video stream. The API returns positions and orientations of the skeleton joints, but joint ...
Extending the eSIGN Editor to Specify More Dynamic Signing Rekha Jayaprakash

Silke Matthes

Thomas Hanke

IDGS University of Hamburg [email protected]

IDGS University of Hamburg [email protected]

IDGS University of Hamburg [email protected]

ABSTRACT In this paper, we present the time alignment process to match the dynamics of human signing with the scripted sign (avatar). We replace the manual editing process with a Kinect-based approach. While the current state of the art in sign language recognition does not allow reliable recognition of fluent signing, we will show that it is possible to extract the dynamics of the user’s sign performance and apply it to a pre-existing sign script featuring the exact same signs. The user manually defines a sign script and then signs it into the Kinect for the system to get the dynamics as performed by the user. A dataset of selected signs have been created for the experimentation purpose with the proposed approach and our approach’s performance analysis has been described in detail in this paper.

Keywords eSIGN Editor, Time Alignment, Dynamic Time Warping, SiGML, Kinect Skeleton Tracking, HamNoSys, SiGML.

1. INTRODUCTION The eSIGN Editor [7] provides an easy-to-use way to script avatar performances to be performed by the eSIGN/Dicta-Sign avatars. For glosses the user enters, it looks up the HamNoSys[4] notation from its own database or an iLex[3] database. In subsequent steps, the user can manipulate the signs, e.g. by changing their locations in signing space, or add nonmanuals. The timing, however, cannot easily be manipulated by the user, as the duration of a sign is implicitly defined by its HamNoSys notation. As the signing rhythm (or a lack thereof) has been identified as a key shortcoming making this from-notation avatar signing difficult to understand, this is an area to address. As a first step, the eSIGN Editor user interface was extended to allow specification of sign durations. These can either be entered by the user, or also drawn from an iLex database. In order to obtain empirically founded durations for specific signs while abstracting away from the signing speed of individual signers, several strategies to normalize durations have been implemented and tried. In the case that duration information is automatically provided, it is of course possible to override that, e.g. to allow phrase-final lengthening. Obviously, the approach described so far does not provide timing information for transitional (inter-sign) movements nor does it allow for sub-sign level timing. While it is straightforward to allow the user to specify sub-sign level timing information within the eSIGN Editor user interface, empirical data to support such an approach simply do not exist, and purely manual specification would be in conflict with the tool’s goal to provide a rather quick scripting facility.

The idea is to create movement descriptions from HamNoSys and average duration information from the iLex database on the one hand and from the Kinect data on the other hand close enough so that Dynamic Time Warping [2] can time-align the movement description created from the sign script to match the Kinect data dynamics. For this to work, both the Kinect and the sign script data are pre-processed.

2. PROCESSING HUMAN-SIGNED (KINECT) DATA 2.1 RGB Data Processing The Kinect camera is used to capture human signing because it gives a lot of information in the 3D perspective, which is significantly helpful for the sign/gesture recognition process. It gives real world distances between the signer in the scene and the camera, which is most useful to continuously track the moving hands of the signer. It captures an RGB stream commonly referred to as color video data and the depth video data (3D information) in a single format called “.oni”. This .oni video data is processed through several steps of computer vision and image processing techniques. Each RGB frame I(m,n,p) is processed with Gaussian filter to remove noise and then the optical flow algorithm[1] is applied. In order to get the motion vectors of Region of Interest (face & hands), an intensity threshold for feature points is set. The motion vector contains information about vector length and vector angle between the same feature points in two consecutive frames. For example, P(x,y) and P’(x,y) are the same feature points from frame 1 and frame 2. Then the angle and length l between them is given by, Angle, Length, l To get different direction (normal’s to the feature point), the following equation is used:

The RGB video data is processed to get the motion vectors for each frame while the moving hands are continuously tracked. At the time of capture, timestamp information for each frame is carefully recorded throughout the video.

and orientation data are extracted mainly for the purpose of segmenting signs before they are compared against the avatar video sequence.

3. PROCESSING SCRIPTED SIGNING 3.1 Information Extraction From the scripted sign, the avatar video is generated. From the video, we extract the following data: the individual time duration of the signs and the motion vectors (the same procedure is followed as that of Kinect RGB video). The average time duration for each sign is calculated by summing up all the duration of its utterances divided by the number of utterances. This is carried out with the information available with iLex database [3] for DGSKorpus data.

Figure 1 Motion vectors of a random frame

As explained in section 2.1, by changing the scalar multiplication values to 3,5,7,9… in the vector direction equation, more possible directions of motion are introduced in the signing space to capture all relevant information (see figure 3).

2.2 Depth Data Processing The OpenNI API [5] provides several algorithms to manipulate the depth video stream. The API returns positions and orientations of the skeleton joints, but joint orientation is not reliable because of the variable length of the torso part of the human body. So they change whenever the torso length changes when the human body is in motion. On the other hand the joint’s position (measured in units mm) is reliable. Partial ground truth data for joint’s position is extracted from the SiGML[7] description which is based on HamNoSys[4]. For example, the SiGML description for the sign “ICH” (Engl. “I”) has an element called “location” which gives information about the sign position , in this case location of the arm is near to chest.

Figure 3 Directions of motion vectors

Figure 2 Skeleton Tracking in a random depth frame Now the parameters for each frame are extracted and stored separately for the depth video data, which includes the skeleton joint’s position (p) and orientation (o). Both the RGB and depth have the same spatial resolution. So it is easy to map between color and depth frames. The skeleton-tracking algorithm on depth video data provides orientation and position of 24 different joints such as Head, Torso, Left/Right Hand, Fingertip etc. Integration and mapping of depth data with the RGB data is a challenge because those are not synchronized during recording. In fact, timestamps on the depth data are fluctuating in a time window around corresponding RGB timestamps. Fortunately with the help of timestamps this is rectified. In the end, the important parameters are extracted from the Kinect data for further comparison and processing of temporal alignment. The position

Figure 4 Video frames from the sign “RECTANGLE” (left) and its corresponding calculation of velocity vectors (right). The motion trajectory over the length of the sign sequence is also required to have a completely labeled set of data for classification. This motion trajectory is extracted from the generated avatar video from JASigning video Generator [7]. The sequences of signs are entered into the eSIGN Editor [6] to visualize the avatar signing in the SiGML player [7].

Simultaneously the JASigning Video Generator captures the sequence as a movie file.

3.2 Sign Segmentation in the Kinect Data For the segmentation of signs from the Kinect RGB video data, hand position and orientation from the human skeleton knowledge are utilized. Only the total time duration is obtained from the Kinect video sequence, the duration of individual sign segments in the Kinect recording is unknown. At this point of time, inter-sign pauses, i.e. longer periods without movement, are taken into account in order to improve segmentation accuracy. The segmentation of the signs using the joint’s orientation and position is a heuristic approach. The heuristic constraints are defined for the database that we considered to use for testing in this work. As explained in section 2.2, the partial ground truth from the SiGML descriptions are specifically defined for joint’s position (location_body_part in SiGML) along with its attributes like contact and location. For example consider

Figure 5 Example showing the process of warping segments

5. EXPERIMENTAL ANALYSIS The proposed approach is applied to a dataset consisting of four samples performing 8 different sign sequences. For example consider the user enters a sign sequence, ICH MUSS ICH ARBEITEN (“I must work”) in the eSIGN Editor (see figure 6).

Table.1 Partial Ground truth constructed from SiGML Sign

Body part

Contact

Location

ICH

Bodyarm/Hand

Touch

Chest

MITTWOCH

Bodyarm/Hand

Close

Chin

LEER_oben_kreis in Uhr

Hand

Touch

Palm

The joint position information gives the distance of each joint from the sensor in the units of mm. Now a relationship is established between the kinect joint’s orientation-position values and the table shown above. This heuristic relationship acts as the segmentation criteria for the kinect data along with other parameters, which are explained further in this paper.

4. TEMPORAL ALIGNMENT OF SIGNS Now the time alignment between each segmented sign from Kinect and avatar are carried out. The sequences are represented as Sk = {sk1, sk2,., skn} and Sa = {sa1, sa2, …., san} where {sk1,…,skn} & {sa1,…,san} are segmented signs for Kinect and avatar video respectively. Each segmented sign is compared using Dynamic time warping (DTW) [2], which is used for measuring similarity between two sequences that vary in time. In this paper, dynamic programming [2] is used as the distance measure. The feature vectors for each sign segment are formed using the extracted motion vectors and time duration information. These two vectors, when passed through the DTW algorithm, returns non-normalized distance between the vectors, the accumulated distance between them, the length of the warping path (the normalizing factor), and the warping path points (matrix). Based upon the warping points the two segments are warped temporally in a best fit (see figure 5).

Figure 6 User typed sign sequence with corresponding HamNoSys

5.1 Test case The user performs the sign sequence in front of the Kinect camera as previously scripted. The skeleton of the user is tracked to get the joints parameters of the moving hands such as position and orientation (see figure 7).

Figure 7 User signing the sequence are captured and tracked from Kinect

Figure 8 RGB motion vectors between two frames The motion vectors from the Kinect RGB video data are computed (see figure 8). At this point of time, key frames are included to segment the signs. The segmented signs from both

Kinect and avatar video are warped with dynamic time warping, by considering the motion vectors. As explained earlier in this paper, the motion vectors of each sign may have different durations. They need to be aligned temporally using these feaures. The motion vector of the Kinect segment is considered as the reference input and the avatar sign segment is considered the test sequence 1

sequence 2

sequence 3 input. These two are given as input to the warping function. The warping function finds the best fitting points using similarity measures. The closest match is found at a particular time when the distances between the compared features are minimal. These points are obtained for the whole length of the test input (avatar signs) and warped with the reference input (Kinect sign), which provides the final temporally aligned sign segment with the duration difference. Figure 9 Example frames from avatar video and its corresponding velocity vectors. As of the time of writing, the test set consists of only eight sign sequences out of which the optimal alignment is obtained for seven sequences with an average accuracy of 77% with respect to correct matching. The correct matching is identified based upon the reachable warping path for a sequence out of n number of warping paths. The remaining sequence provides false matching due to the insignificant motion vectors (i.e the motion vector is identical or does not change over time predominantly). This means that our procedure currently always provides a result, and the user must decide whether it is useful or not. We hope that in the future a combined measure resulting from a much larger test set will give some indication about whether the dynamics could be successfully transferred or not. Table.2 Example Sign sequences from DGS-Korpus data Avatar’s standard time duration (sec)

Warped time duration (sec)

ICH1

0.7000

0.9330

MUSS1

0.8400

0.4340

ICH2

0.7000

0.4000

ARBEITEN2

1.5600

1.6660

MONTAG

1.0600

2.2010

KANN1

0.6000

0.6000

AB2

0.6600

0.8670

NUMUHR

1.3700

1.7000

Signs seq1

Signs seq2

Signs seq3 ICH1

0.7000

0.8670

WENIG

0.6000

0.8000

ERFAHRUNG

0.5400

0.9670

Figure 10 Comparison of warped time with the standard avatar time for Sign sequences 1,2,3 with respect to Table.2

6. FUTURE WORK In order not only to preserve the speed of individual signs, but also the dynamics of inter-sign transitions, it would be most useful to consider these as extra segments of the sign stream. For this to become feasible, we either need to be able to tell inter-sign movements apart from intra-sign movements or have a statistical model how to split the signal. While there have been promising approaches to the first [9], they rely on motion capture providing a higher temporal and spatial resolution than provided by the Kinect camera. Instead, we will use corpus data and investigate the usefulness of a statistical model to predict the duration of intersign transitions for specific pairs of signs.

7. REFERENCES [1] Sun, Deqing. Roth, Stefan. and Black, Michael. J (2010). Secrets of Optical Flow Estimation and Their Principles. CVPR Technical Report. Brown University. [2] Reyes, Miguel. Dominguez, Gabriel. and Escalera, Sergio (2011). Feature weighting in Dynamic Time Warping for Gesture Recognition in Depth Data. In Proceedings of IEEE International Conference on Computer vision workshops (ICCV). Barcelona, Spain, 1182-1188. DOI= http://dx.doi.org/10.1109/ICCVW.2011.6130384. [3] Hanke, Thomas, and Storz, Jakob (2008). iLex – A database tool integrating sign language corpus linguistics and sign language lexicography. In Proceedings of Language Resources and Evaluation Conferences (LREC). Marrakech, Morocco. [4] Hanke, Thomas (2004). HamNoSys - representing sign language data in language resources and language processing contexts. In Proceedings of Workshop: Representation and processing of sign languages. ELRA Paris, 1-6. [5] PrimeSense Inc. (2012). Prime Sensor NITE 1.5 Algorithms notes. [6] Hanke, Thomas, and Popescu, Hortensia (2003). Intelligent Sign Editor. eSIGN Deliverable D2.3. [7] http://vh.cmp.uea.ac.uk/index.php/Main_Page [8] Kennaway, Richard (2003). Experience with and Requirements for a Gesture Description Language for Synthetic Animation. In Proceedings of International Gesture workshop, Genova, Italy, 2003, pp.300-311. [9] Duarte, Kyle, and Gibet, Sylvie (2010). How are Transitions Build in Sign Languages?. Presentation at TISLR 10.