Computer Vision Based Hand Gesture Recognition Using ... - CiteSeerX

3 downloads 1216 Views 143KB Size Report
Human Skin Color Detection [1] model using hue and ... 1 http://www.deafblind.com/asl.html .... object's border, and information on region which contain .... Table 2: Evaluation on MLP using web cam captured evaluation samples at Rejection ...
Computer Vision Based Hand Gesture Recognition Using Artificial Neural Network Jiong June Phu and Yong Haur Tay Faculty of Information and Communication Technology, Universiti Tunku Abdul Rahman (UTAR), MALAYSIA. [email protected], [email protected]

Abstract- This paper describes the research and development of a computer vision based hand gesture recognition system that interprets a set of static hand gestures. It can be further implemented as a sign language interpreter as well as novel approach for human computer interaction (HCI). Hand segmentation is based on Fleck’s Human Skin Color Detection [1] model using hue and saturation values of a human skin, based on an optimal range in color spectrum to distinguish human skin colors from background using the values formulated based on Logarithm Opponent Component of a three-channeled pixel. Evaluation on this approach shows a satisfactory result using 40,000 evaluation pixel samples of skin and non-skin color cropped from 40 random human images and 40 background images, where it achieves 2.005 % of False Rejection Rate (FRR) and 1.68% of False Acceptance Rate(FAR) under an investigated condition in several experiments. Ideas on solving several segmentation problems (i.e. multi foreground objects problem) due to skin detection model approach are also presented, in which collaboration of Connected Component Labeling and Edge Detection is proposed. Techniques on image clipping, using a wrist detection algorithm based on local minimum width is proposed in this paper. This paper also identifies and analyzes several image preprocessing techniques (i.e. image smoothing) for noise reduction optimization problem. It is also presented in this paper the implementation of an adaptive classification engine based on Artificial Neuron Network (i.e. multilayer perceptrons - MLP) in hand gesture recognition. The evaluation of MLP using 10 classes of web cam captured live samples shows an accuracy up to 92.97% based on 0.7% rejection threshold. Another evaluation using 10 classes of Triesch’s Hand Gesture Dataset shows the accuracy of only 50.81% which may be caused by other factors such as different approach used in segmentation process due to lack of color information.

Keywords- Computer Vision, Skin Color Detection, Connected Component Labeling, Edge Detection, Wrist Detection, Multi Layer Perceptron.

1. Introduction Computer Vision (CV) is a subfield of artificial intelligence, which deals with images, and understanding of images in which specific information is being extracted from the image data for a specific purpose. CV uses the raw information (i.e. colors information) extracted from

the images and then attempts to simulate the effect of human vision by electronically perceiving. Using the Computer Vision approach, it is possible to integrate natural gestures and interactive communications with the system, as an innovation in HCI. As described by Leong [2], there are lots of similarities in ways human communicate using gesture and ways users manipulate the graphical user interface of a computer system to perform certain tasks. Leong proposed using webcam as an alternative to mouse and keyboard which are currently the most common input devices. The concept is: A device which enables users to interact with the system in a natural and intuitive manner (i.e. Gesture or speech) without having the users to wear any special hardware device (i.e. Mechanical Trackers). Though there exist many types of non-CV based Gesture Interface Devices which are available for various Virtual Reality applications for measuring the real time position of the user’s finger and wrist to allow natural, gesture based interaction with the virtual environment [3] (i.e. Pinch Glove, 5DT Data Glove, Didjiglove and Cyberglove) . However, there are issues on wearing these special devices especially with the concern on motion restriction, the cost of the device and hygiene concern. Using CV to achieve the same purpose appears to be able to overcome those problems related to gesture interface devices.

2. Approach and methods An appearance based hand gesture recognition approach works in which from a two-dimensional hand image, the hand shape is extracted and recognition is based on the segmented shape. A single static frame is grabbed from the video stream, and processing is done on each frame individually. The previous frames have no effect or influences on the result of the current frame. This method is appropriate for static based gesture in which it involves no articulated motion, for example, alphabet representation in American Sign Language Alphabets1 (ASL) or German Finger Spelling alphabet 2 which consists of a set of symbolic alphabet representations autonomous gestures. Prior to recognition phase, tracking and segmentation are the most essential in order to extract useful information from raw images. Thus, it is necessary to be able to distinguish the region of foreground and background in a given image. The foreground consists of objects of interest which are to be tracked (i.e. hand or palm) while background consists of non relevant pixels 1 2

http://www.deafblind.com/asl.html http://www.sign-lang.uni-hamburg.de/fa/

which are to be discarded. Many techniques can be used to differentiate foreground and background of an image, for example: Binary Thresholding, Connected Components Labeling, Image Differencing and etc. Two of these techniques, namely Image Differencing with Running Average and Skin Color Filter are examined in this paper. Image Differencing uses a simple idea, in which the foreground is modeled by obtaining the absolute difference of current frame with the background. It then filters out the pixels which have less difference by using a threshold function, and leaving the foreground pixels extracted.

values) are first transformed into log-opponent values I, Rg, and By using the formula (3): L(x) = 105 * log10 (x+1+n) (3) I = L(G) (4) Rg = L(R) - L(G) (5) By = L(B) - (L(G) + L(R))/2 (6)

The equation given is: Di,j = |Ii,j – Ri,j|

The green (G) channel is used to represent intensity because the red and blue channels from some cameras have poor spatial resolution. The constant 105 simply scales the output of the log function into the range [0,254]. n is a random noise value, generated from a distribution uniform over the range [0,1). The random noise is added to prevent banding artifacts in dark areas of the image. The constant 1 added before the log transformation prevents excessive inflation of color distinctions in very dark regions. The log transformation makes the Rg and By values, as well as differences between I values (e.g. texture amplitude), independent of illumination level. The hue at a pixel is defined to be atan(Rg,By), where Rg and By are the smoothed values computed as in the previous section.

(1)

where Di,j = The difference image at pixel (i,j) Ii,j = The input image at pixel (i,j) Ri,j = The background image at pixel (i,j) The difference of images is in term of brightness. By applying a small threshold value, say Tforeground, the Di,j which exceeds that value will be labeled as foreground area. This equation can only work when a background image is known prior to the calculation. For an adaptive algorithm, background can be calculated in real time using the following formula: B(t-1) = α It + (1- α) Bt

(2)

where t = Current time It = Input image at time t Bt = Background image at time t α = Weighing factor Choosing the appropriate weighing factor is critical in this algorithm, in which the weighing factor determines how much the current frame should influence the running background image model. Accordingly, this algorithm only works when there exist continuous changes for a given sequence of images. By leaving the foreground objects static for a certain period of time, all the previous background, B(t-1) will become equal eventually regardless of the weighing factor, thus, the whole image will be tracked as background, leaving no object tracked as foreground.

2.1 Skin Color Filter Proposed by Fleck and Forsyth in [4], human skin color is composed by two extreme hues; red (blood) and yellow (melanin) substances, with moderate saturation. The parameter which has effect on saturation is the melanin level, which causes the skin to be more saturated as more yellow is added. Fleck also observed that skin has low amplitude texture, except for extremely hairy ones. These skin properties are essential information that can be used in a hand tracking algorithm. Skin color filter model is proposed as follows: The three channeled pixel (RGB

where I =Log-opponent for channel G Rg =Log-opponent for channel R By = Log-opponent for channel B

hue = 180 tan −1 ( Rg , By )

π

(7)

The saturation at the pixel is sqrt(Rg^2 + By^2). Because the equation ignores the intensity, which causes the yellow and brown regions cannot be distinguished, both will be considered as yellow.

saturation = Rg 2 + By 2

(8)

Once the hue and saturation are calculated, the skin regions can be marked using the pixel values which have the following properties: (a) Hue is in between 110 and 150 and whose saturation is between 20 and 60. (b) Hue is in between 130 and 170 and whose saturation is between 30 and 130.

Figure 1: Foreground modeling using Skin Color Filter 2.2 Segmentation Shape extracted from background subtraction algorithm may not be exactly what is desired. Many non relevant objects are captured as foreground due to their similar characteristics in nature as depicted in figure 2.

a skin color detection model can be applied. This model can tell which regions contain human skin color. However, the output may come out as one mingled region of undefined border.

Figure 2: Non relevant object captured as foreground In order to solve this problem, the foreground is partitioned into different region using connected component algorithm of 4-connectivity. After several regions are identified, the largest piece will be taken as the main object for further processing.

Figure 3: Region identified based on Connected Component Labeling This algorithm fails if one of the non relevant objects has larger area than the actual main object. Besides, region captured as foreground might not be easily partitioned into disjoint regions when there is no clear cut border among the regions as depicted in figure 4.

Figure 4: Objects tracked as one mingled region. A way to solve this problem is through collaboration of several image segmentation algorithms which tackle the problem of different characteristics separately. One of the solutions, which based on Edge Detection, Skin Color Filter Model and Absolute Difference, is proposed as follows: Applying edge detection algorithm to the image defines the borderline of each object in the scene captured. However, it does not tell which is the main object (i.e. hand), and which are not. Ideally, edge detection applied on image shown in figure 2 should obtain the result as shown in figure 5.

Figure 5: Ideal Edge Detection To obtain more information about the image, especially information about which region contains a hand,

Figure 6: Foreground based on skin colors To have both information (i.e. information on object’s border, and information on region which contain human skin color), a collaboration of both output can be obtained by using an image operator: Absolute Difference. Taking the absolute difference of image in figure 5 and image in figure 6, ideally should produce an output as shown in figure 7, where there exists a clear cut border among 2 objects detected which contain human skin color information.

Figure 7: The absolute difference of image processed using edge detection and image processed using skin detection model Applying the region identification algorithm using connected component labeling to image in figure 7, it is possible to identify and label each of the regions in the image, since they are disjoint parts separated by border information obtained from edge detection operator. By taking the largest region identified, we are now able to come out with an ideal output as what is depicted in figure 8.

Figure 8: Ideal segmentation using Region Identification Algorithm based on Connected Component Labeling 2.3 Wrist Detection Before feature extractions can be processed, a segmented hand image should be clipped in a Region of Interest (ROI) in order to discard non trivial portion of the image. This process is essential because one of the features that can describe the hand gesture is based on the area ratio, using the information of ROI area and hand area. Other than that, having a clipped ROI can ease the scaling process to a standardized size for a more robust feature extraction. A user’s hand can appear in anywhere of various distance in the scene captured by a web cam (see

figure 9). Without applying image clipping to a defined ROI, the location of the hand (i.e. x and y coordinates) as well as distance of the hand from the camera (z coordinate) become an issue to feature extraction component. This may lead to incorrect information extracted from the segmented image.By applying image clipping to the ROI, the location and distance from camera is no longer an issue, and this will significantly reduce the number of varieties in the set of feature vectors in a representation of a class.(see figure 10).

Ideally, clipping should produce the output as what is depicted in Figure 11.

Figure 11: Palm Clipping with Wrist Detection Applied

Figure 9: Hand appears in various locations of various distances. It is ideal to have wrist discarded from the ROI because the domain of gestures is restricted to palm area. In order to detect wrist, a wrist detection algorithm is proposed as follows: Step 1: Scan the image row by row, starting from the bottom of the image. Step 2: Declare a variable to store the minimum width, and another variable as counter. Step 3: In each row, the width of the object detected is identified. If the width is less than minimum width, increment the counter, and update the minimum width. Step 4: Repeat step 3 until the counter is greater than 5. Two assumptions are made in this algorithm, which are highlighted as follows: 1. Wrist has a local minimum width across the horizontal cross section of the arm and hand. 2. Vertical posture (with certain degree of tolerance).

Figure 10: Wrist Detection Algorithm based on local minimum width. Once the wrist’s position is detected, clipping can be done to set the ROI to fit exactly into the palm region so that trivial pixels are not included. To do this, we need to have the information on top-most, left-most, right-most and bottom-most pixel coordinates for the clipped region. The former three can be obtained directly from the hand segmented, while the bottommost y-coordinate is based on the position of the wrist tracked. Supposed that no wrist is tracked, the whole hand with arm (i.e. area under the wrist) will be clipped, and this will affect the recognition result.

2.4 Feature extraction Feature extraction is the process of generating a set of descriptors or characteristic attributes from a binary image. The descriptors are stored as a feature vector which can be used for recognition process as well as for machine training purposes in an adaptive recognition engine (i.e. artificial neuron network). Main features being experimented in the prototype are object crossing, projection histogram and scalar region descriptors. Object crossing works by considering significant pixel changes across a cross section. It is a good feature extraction where skeletonization of an object can be omitted, because the width of the objects will not affect the output of this feature extraction. Type of the cross section used for object crossing may vary in the form of horizontal, vertical or any arbitrary degree of diagonal cross section, which depends on implementation of the technique. The location of cross section is distributed uniformly across the object. Number of cross sections depends on how detail the features are to be extracted. Implementing larger number of cross sections implies more information extraction from the objects. However, this will cost more computational power as well as memory consumption due to larger number of data to be extracted and stored. In this prototype, only two types of object crossing are implemented. Object crossing in horizontal and vertical cross sections. Projection histogram works by taking account the shape of the object from a direction by calculating the distance of first significant pixel detected to its respective origin across a cross section. Types of projection histogram vary in term of direction of projection (i.e. top, left, right, bottom, and any arbitrary diagonal direction). Scalar region descriptors are the simplest yet significant features to describe an object. In this prototype, four main region descriptors are being used, which are width, height, area ratio and width and height ratio. Based on the ROI defined on a clipped palm from segmentation module, the width feature is the horizontal length of the ROI, and the height is the vertical length of the ROI. Area ratio is the proportion of foreground pixels captured over the overall area of ROI defined. Usually, a feature that can derive other features will not be used as a separate individual feature, for instance: The width and height features which can derive width/height ratio. This is to avoid redundancy information which may cause inefficiency and possible conflicts to the classifier. Basically the objective of the feature extraction is to

1.

reduce the dimensionality of the raw image while retaining as many descriptive features as possible.

2.

3. Results

Saturation

Correct Acceptance (%)

False Rejection (%)

Correct Rejection (%)

False Acceptance (%)

3.1 Tests on Skin Filter Model 40,000 color samples are randomly cropped from 80 random images, where 20,000 samples are cropped from 40 images which contain human skin colors with various human skin colors of different races. Another 20,000 samples of random color pixels cropped from 40 random images are taken for rejection test where these samples are cropped from region of non human skin color (i.e. random backgrounds). Each color sample (pixel) in the dataset is operated through the Skin Color filter and the output is recorded as 1 for acceptance and 0 for rejection. The result is shown in Table 1.

Hue

2.5 Artificial Neural Network (ANN) ANN is a Statistical Classifier which is adaptive and can be trained to classify dynamic set of objects. In this prototype, a multilayer perceptron (MLP) is implemented as the core recognition engine in the classification component. For input layer, a vector of 26 elements (26 input nodes) is constructed. 6 inputs for horizontal ink crossing, 6 for vertical ink crossing, 6 for left projection histogram, 6 for right projection histogram, 1 for area ratio and 1 for width/height ratio. Hidden layer consists of a vector of 150 elements (150 hidden nodes). The activation function being used is sigmoid and tangent hyperbolic. Both activation functions are being experimented. Output layer consists of a vector of 10 elements (10 output nodes). Using the 1-of-N encoding method, where 1 class is represented uniquely by 1 for one output node, having the rest of output nodes are zero(eg: ‘A’ is represented by (1,0,0,0,….,0), ‘B’ is (0,1….,0)). Four types of activation functions are implemented in the MLP: (i)Linear function, (ii)Sigmoid function, (iii)Tangent Hyperbolic and (iv)Softmax. In this prototype, two sets of implementation using different activation functions for feed forward propagation and backward propagation are attempted.

Sigmoid for both feed forward and backward propagation Softmax for feed forward, Tangent hyperbolic and Softmax for backward propagation

130 140 150 10 20

30 30 30 40 40

97 97.995 98.55 98.55 98.55

3 2.005 1.45 1.45 1.45

98.57 98.32 96.8 96.8 96.79

1.43 1.68 3.2 3.2 3.21

Table 1: Skin Detection Acceptance and Rejection tests

Table 2: Evaluation on MLP using web cam captured evaluation samples at Rejection Threshold of 0.7 for 10 classes

Table 3: Evaluation on MLP using Triesch’s Greyscaled Dataset at Rejection Threshold of 0.5 for 10 classes

3.2 Evaluation on MLP using web cam captured samples for 10 gesture classes. This evaluation dataset consists of 1,381 hand gesture samples (10 classes) captured using N-Stel PC camera. All the evaluation dataset is fed into the neuron network and the outputs are compared with the corresponding teaching signals. A confusion matrix is generated to visualize the result of the evaluation as depicted in table 2. 3.4 Evaluation on MLP using Triesch’s gesture dataset for 10 static gesture classes. Triesch’s 3 image datasets are grey scaled where samples cannot be segmented using skin color filter model. Thus, in order to extract the evaluation out of the image dataset, an alternative segmentation using Binary Threshold filter is implemented. Using this method, only 420 evaluation samples (10 classes) with either black or white background will be used, since these are the only samples which can be segmented using Binary Threshold successfully. The result of evaluation is shown as depicted in table 3.

4. Discussion Based on the result shown in table 1, opposed to what is suggested in hypothesis discussed in section 2.1, the most optimal range to distinguish skin region from background is obtained as follows: (i) Hue is in between 125 and 155 and whose saturation is between 25 and 35 and (ii) Hue is in between 5 and 25 and whose saturation is between 35 and 45. It can be observed that training using tangent hyperbolic activation function has faster error decreasing rate than what is shown by sigmoid activation function. Other than that, the time taken for MLP training session using tangent hyperbolic is much shorter than using sigmoid function. This can be explained by number of computational steps required in the function, in which there are more operations in the sigmoid function compared to the tangent hyperbolic function. Confusion matrix in table 2 shows that all the classes are well recognized; except for class ‘G’ and ‘V’, where ‘G’ has some misrecognition toward ‘H’ and ‘V’ has some misrecognition toward ‘Y’. It can be observed that class ‘V’ has the highest error. This can be explained that classes with similar features extracted are difficult to distinguish. Confusion matrix in table 3 shows most of the classes are misrecognized as class ‘A’, while class ‘G’ has a large number of misrecognition toward class ‘H’. The accuracy is low when evaluated using Triesch’s dataset due to possibly high error rate in segmentation using binary threshold as the replacement for skin detection color model in segmentation due to lack of color information (Grey scaled) in Triesch’s Dataset. Besides, MLP is trained mainly on one particular person’s hand 3

http://www.idiap.ch/~marcel/Databases/gestures/main.php

gestures training samples. Thus, there is a possibility where overfitting might happen, in which the MLP tends to memorize. Thus, it can only recognize the hand gestures of this particular person.

5. Conclusion We showed the research and development of a simple but effective appearance based hand gesture recognition model from fundamental image processing to complex recognition algorithms such as MLP. We proposed an approach using Skin Detection Model as a robust foreground modeling technique in an image segmentation component. Evaluation on the experiment shows a promising result that it is robust to be implemented in a hand segmentation module. Problems due to this technique are then countered by proposing two solutions based on Region Identification Algorithm and Edge Detection Algorithm. Evaluation of the proposed solution shows that it actually works to solve partial of the problem. After that, we proposed more techniques for segmentation enhancement such as using image clipping based on wrist detection algorithm. Feature extraction component is then briefly analyzed to show how it facilitates the work of pattern recognition using an adaptive classifier (i.e. a multilayer perceptron (MLP)). After that we studied the complex structure of a MLP, as well as all the parameters related. We experimented with two types of activation functions using tangent hyperbolic and sigmoid. Based on the result obtained, we draw a conclusion that tangent hyperbolic is a better activation function in a MLP backward propagation process. Finally, we evaluated the MLP constructed in our prototype using both web cam captured samples and a set of grey scaled Triesch’s dataset which consist of 10 classes of static gestures. We ran six sessions of training and nine sessions of evaluations. The result obtained from the experiments showed a satisfactory result (92.97%) using web cam captured samples, though the accuracy evaluated using Triesch’s dataset is not very promising (50.82%). However, the analysis shows a possibly overfitting may be the reason for this, where the MLP is excessively trained on one particular person’s hand gesture, and thus which may lead the system to memorize and thus, only able to recognize one particular person’s gesture.

6. References [1] M. Fleck, D. Forsyth, and C. Bregler (1996) “Finding Naked People,” 1996 European Conference on Computer Vision , Volume II, pp. 592-602. [2] S. H. Leong, “Computer vision based human computer interaction using video based finger tracking”, UTAR Final Year Project Report 2005. [3] G. C. Burdea and P. Coiffeg: “Gesture Interfaces” in Virtual Reality Technology, Prentice Hall, 2001, pp. 46-54.