Understanding Users Intention: Programming Fine ... - Semantic Scholar

1 downloads 0 Views 1MB Size Report
sume vast sensory employment like a data glove and integrated tactile sensors. An overview of ..... Using the kernel trick and the implicit transformation in a high ...
Understanding Users Intention: Programming Fine Manipulation Tasks by Demonstration R. Z¨ollner, O. Rogalla, R. Dillmann Universit¨at Karlsruhe (TH) Institute of Computer Design and Fault Tolerance Karlsruhe, D-76128, Germany P.O. Box 6980 Abstract The Programming by Demonstration (PbD) paradigm enable programming of service robots by unexperienced human users. The main goal of these systems is to allow the unexperienced human user to easily integrate motion and perception skills or complex problem solving strategies. Unfortunately actual PbD systems are dealing only with manipulations based on Pick & Place operations. For complex service task these are not sufficient. Therefore this paper describes how fine manipulations like detecting screw moves can be recognized by a PbD system. In order to do this, finger movements and forces on the fingertips are gathered and analyzed while a object is grasped. This assume vast sensory employment like a data glove and integrated tactile sensors. An overview of the used tactile sensors and the gathered signals is given. Furthermore the segmentation of users demonstration and the classification of the recognized Dynamic Grasp is pointed out. For classifying dynamic grasps a time delay method based on a Support Vector Machine (SVM) is used. Finally the symbolic representation of service Tasks is briefly illustrated.

1

Introduction

Using personal and service robots implies high demands on the programming interface. The interaction of these robots with humans and there programming needs the developing of new techniques that allow untrained users to use such a personal service robot both safely and efficiently. PbD is one way to meet these high requirements. The aim of PbD is to let arbitrary persons program robots by simply giving a demonstration of how to solve a certain task to a sensor system and then

M. Z¨ollner Forschungszentrum Informatik Interactive Diagnosis and Service Systems Karlsruhe, D-76131, Germany

have a system interpret his actions and map them to a specific manipulator. Although detecting and understanding the user’s actions and intentions turned out to be a quite difficult task. Learning systems are needed capable of extracting knowledge from watching users demonstration. Heterogeneous sensor inputs like vision, tactile or position information are required of such systems. This paper presents an approach of a PbD system, which handles more than only Pick & Place operations. In order to detect fine manipulations, a grasp is analyzed with respect to finger movements and forces performed on the fingertips. Section 2 will give a brief overview on today’s PbD techniques and modeling technics of human hand and finger motions. In section 3 the PbD-system currently running at our institute and the employed sensor devices will be outlined as well as the implemented approaches. Section 4 focuses on segmentation of the user demonstration in order to gather users intention. Therefore the reliable detection of grasped and un-grasped fragments is crucial for analyzing a given demonstration. Section 6 presents and classify dynamic grasps. The classification of the dynamic grasp is done through a Support Vector Machine, by using a time delay approach. Finally the symbolic representation of service Task is pointed out in section 7.

2

State of the art

The state of the art will be discussed on two topics, namely PbDsystems and modeling and interpreting finger/hand movements during a object is grasped. Realization of recognition and interpretation of continuous human action sequences is critical to PbD. Though, there are few publications regarding sensors including visual processing. Kuniyoshi et al. [14] pre-

sented a system with a visual hand-tracker module that is able to detect grips and drops of objects. However, only one type of grasping is classified and the hand is constrained to appear under a certain angle. Kang [12] used a data glove in combination with depth images computed from recorded image sequences for a reconstruction of what has been done. Depth images are yield by the projection of structured light thus undergoing real-time constraints. Since elementary operations consist of movements, a lot effort has been spent tracking and reconstructing the trajectories of objects [20], a robot’s effector [16] or user’s hand [17, 8, 18]. For in order to recognize grasps in a demonstration Kang and Co. regards to the contact points between hand and object [13] or the hand posture itself [10]. All the works are considering only static grasps. One recent work [22] is analyzing coordinated finger movements, by using the taxonomy proposed by Elliont in [5]. The method is based on detecting synchronous or asynchronous joint movements, by processing signals received from a data glove. The approach is restricted on few movements which can be detected.

3

Experimental Setup

Focusing service tasks in households and workshop environments, for PbD information about grasping states, movements, forces and objects is needed. Therefore, we consider combining results of as many suiting sensor types as possible in order to obtain as much information as possible from a single demonstration.

3.1

needed for describing grasping or gesture actions , the contact between hand and environment has to be detected in order to get the exact timestamps for grasping and un-grasping object during manipulative tasks. For this purpose we integrated force sensors on each fingertip in the data glove. As described in [23] the FSR force sensors are sensitive enough for detecting contacts and moreover information about the movement of the grasped object can be derived.

Sensors

As sensors for observing a user demonstration of a manipulating task, a VPL data glove, a camera head, a Polhemus magnetic tracker and force sensors both mounted on the glove are used. For improving the accuracy of the sensors these are integrated in a fixed rack (see figure 1).

3.2

Figure 1: Experimental environment - demonstration rack and data glove with mounted tactile sensors.

Sensors Fusion

Because of its many degrees of freedom and changing of shape, it is very difficult to extract posture information about a user’s hand solemnly out of image sequences. Especially information about its particular grasping state is hard to obtain. Following [19], we consider data gloves as good sensors for obtaining this kind of information. Beside the hand pose,

Figure 2: Recording sensors: 1. Data Glove with mounted Tactile Sensors. 2. Active Stereo Camera Head. In order to record a demonstration trajectory, all the VPL data glove sensor data is used while the measurements of the Polhemus tracker are merged with visual tracking data. Visual tracking follows a marker fixed on the magnetic tracker. The camera head employs three grey-scale Pulnix TM765i cameras and AMTEC turn and tilt modules. For grabbing, a Ma-

trox Genesis frame grabber is used on a standard PC. Additionally, visual data is used for determining the types of manipulable objects and positions.

3.3

PbD Approach

According to the PbD cycle presented in [4], we first check for objects present in the scene that a user is about to manipulate. This is done via the camera head using state of the art image processing methods [9, 3]. After reconstructing their particular positions in the rack, the user’s hand is being tracked recording the trajectory given by the magnetic and visual tracker. The recorded trajectory is then analyzed, interpreted and mapped to a manipulator (see [2, 6]). So far only Pick&Place operations were considered. Regarding the analysis of the demonstration, we have shown that a static grasp can be detected and classified according to the Cutkosky hierarchy [1] with high precision and robustness by a neural network classifier [7]. We used this information combined with movement speed considerations to determine grasp events and movements. The next section shows how this segmentation step is extended by using tactile sensors.

4

Segmentation of User Demonstration

In order to understand the users intention the performed demonstration has to be segmented and analyzed. Therefore in a first step the sensor data has to be preprocessed for extracting reliable measurements and key points, which are used in a second step for segmenting the demonstration. The following two sections are describing these steps.

4.1

Signal Processing

Figure 3 shows the performed preprocessing steps only for the segmentation. For detecting gesture actions more fusion methods are performed like shown in [15]. The input signals ji , hi , fi are gathered from data glove, magnetic tracker and the tactile force sensors. All signals are filtered i.e. smoothed and high passed in order eliminate outlier. Next the joint angles are normalized passed over a rule based switch to a static (SGC) and a dynamic (DGC) grasp classifier. The tracker values, representing the absolute position of the hand, are differentiate in order to obtain the velocity of the hand movement. Like mentioned in [23]

the tactile sensors have a 10-20 % hysteresis, which is eliminated by the Function H(x). The normalized force values, together with the velocity of the hand are passed to the Module R, which consists a rule set in order to determine if a object is potentially grasped and triggers the SGC and DGC. The output of these classifiers are grasp types according to the Cutkosky hierarchy (static grasps) or dynamic grasp according to hierarchy presented in section 5.

ji hi fi



´ji



´hi



´fi

c SGC ∆ H(fi)

~ c fi

h&i

s

d R

DGC

zi-1

Figure 3: Sensor Preprocessing and Fusion

4.2

Segmentation step

The segmentation of a recorded demonstration is performed in two steps: 1. Trajectory segmentation This step segments the trajectory of the hand during the manipulation task. Hereby the segmentation is done by detecting grasp actions. Therefore the time of contact between hand and object has to determined. This is done by analyzing the force values with a threshold based algorithm. To improve the reliability of the system the results are fused within a second algorithm, based on the analyze of trajectories of finger poses, velocity and acceleration w.r.t. to minima. Figure 4 shows the trajectories of force values, finger joint an velocity values of three Pick&Place actions. 2. Grasp segmentation For detecting fine manipulation the actions during a object is grasped have to be segmented and analyzed. The upper part of figure 4 shows that the shape of the force graph features a relative constant plateau. Due to the fact that no external forces are applied to the object this effect is plausible. But if the grasped object collides with the environment the force profile will chance. The results are high peaks i.e. both amplitude and frequency are oscillating, like shown in the lower part of figure 4).

need to be detected and represented in order to program such tasks with the PbD paradigm.

5.1 Grasp Segmentation

Static Grasp

External Force

Dynamic Grasp

Trajectory Segmentation

Figure 4: Analyzing segments of a demonstration: force values and finger joint velocity.

Dynamic Grasps

With Dynamic Grasps we denote operations like screw, insert etc. which all have in common that finger joints are changed during a object is grasp (i.e. the force sensors provide non zero values). For classifying dynamic grasps we choose the movement of the gasped object, as an distinction criterion. This allows a intuitive description of the users intention performing fine manipulative tasks. For describing grasped object movements these are transformed in hand coordinates. Figure 5 shows the axes of the hand, according to the taxonomy of Elliot & Co. (Refer to [5]). Due to the principle axes rotation and translation are regarded. Some restrictions of the human hand like rotation along the y-axes are considered be considered.

Looking to the force values during a grasp, three different profiles can be distinguished:

Distal Y

• Static Grasp Here the gathered force values are nearby constant. The force profile shows characteristic plateaus, where the height points out the weight of the grasped object. • External Forces The force graph of this class shows high peaks. Because of the hysteresis of the sensors no quantitative prediction about the applied forces can be made. A proper analyze of external forces applied to a grasped object will be subject of further works. • Dynamic Grasps During a dynamic grasp the both amplitude and frequency oscillate moderate, as a result of finger movements performed by the user. The result of the segmentation step is sequence of elemental actions like moves and static and dynamic grasps.

5

Dynamic Grasps Classification

For describing various household activities like opening a twisted cap or screwing a bold in a nut, simple operation like Pick & Place are inadequate. Therefor new elemental operations like Dynamic Grasps

Dorsal

X

Ulnar

Radial

Ventral

Z Proximal

Figure 5: Principal axes of the human hand. (Refer to [5]) Figure 6 shows the distinguished dynamic grasps. The classification is done separately for rotation and translation of the grasped object. Furthermore the the number of finger which where involved in the grasp (i.e. form 2 to 5) are considered. There are several precision tasks which are performed only with thumb and index finger like opening a bottle is classified as a rotation around the x-axes. Other grasps require higher forces and all fingers involved in the manipulation, as example Full Roll for screwing actions. The presented classification contains most of the common fine manipulations, if we assume that three and four finger manipulations are included in the five finger classes.

For example a Rock Full dynamic grasp can be performed with three, four or five fingers. Next section gives a brief overview about SVM’s and presents the experimental results .

Rotation Rotation 98%

X

X/Y 89%

Translation Translation

Z 93%

X/Y 97%

Roll Index

92%

Roll Full

89%

Roll Tumb

93%

Y

Z

Rock Index

92%

Rock Tumb

Shift Palm

Shift Index

89%

Pinch Index

96%

88%

Shift Full Roll Palm

95%

Pinch Full

Rock Full

96%

Rock Radial

Figure 6: Hierarchy of Dynamic Grasps containing the Results of the DGC

5.2

Support Vector Machine Classifier

Support vector machines are a general class of statistical learning architectures, which are rising up with a profound theoretical foundation as well as excellent empirical performance in a variety of applications. Originally developed for pattern recognition, the SVMs justify their application by a large number of positive qualities, like: fast learning, accurate classification and in the same time a high generalization performance. The basic training principle behind the SVM is to find the optimal class-separating hyperplane so, that the expected classification error for unseen examples is minimized. Using the kernel trick and the implicit transformation in a high dimensional working space, leads to nonlinear separation of the feature space. The decision function becomes a linear combination of kernels P of the training data: f (x) = j αj yj K(x, xj ) + b where xj are the training vectors with their corresponding labels yj , and αj are the Lagrange - multiplier. By performing the Lagrange - optimization for finding the optimal separating hyper plane just a small set of multiplier α are carried out as nonzero. The corresponding data points are the so-called support vectors. [21]

5.3

Experimental Results

For training the SVM Gaussian kernel functions, an algorithm based on the SVMLight [11] and the one-against-one strategy have been used. Twentysix classes corresponding to the elementary dynamic grasps presented in figure 6 where trained . Because of the fact that a dynamic grasp is defined by a progression of joint values a time delay approach was chosen. Consequently the input vector of the SVM Classifier comprised 50 joint configurations of 20 joint values. The training data set contained 2600 input vectors. The fact that SVM’s can learn from significant less data than neuronal networks, assure that this approach will work very well. Figure 6 shows results of the DGC. Since the figure shows only right resp. forward direction the displayed percents represents a average value of the two values. The maximum variance between this two direction is about 2 %. Remarkable is the fact that the SVM needs only 486 support vectors (SV) for generalizing over 2600 vectors i.e. 18,7% of the data set. Less number of SV improves not only the generalize behavior but also the runtime of the resulting algorithm during the application. The presented results are very encouraging, but there have to made some restricting remarks. The data set is performed by a single user. Even if the joint data is normalized there surly exist some user specific variance. Therefore we tested the DGC with data from a second user. A sample of ten elements of eight elemental dynamic grasps (i.e. 80 graps) were evaluated. The maximum variance was less or equal 3 %.

6

Symbolical Task Representation

For the symbolic representation of fine manipulation, we enhanced our representation for Pick & Place (Refer to [2, 6]). Hereby a task is represented as a sequence of elemental operators Oi with STRIPS like notation. Each operator has a relational precondition Vi and postcondition (contribution) Ci . Elemental operations are various types of movements and grasps. To fit this representation elementary dynamic grasps can be combined to a complex grasp. For example during a screw operation a translatory and a rotatory component has to be performed. Therefore the implementation of the DGC enables multiple results, so that serial representation of fine manipulation is possible.

7

Conclusion

This paper pointed out how a PbD system handling Pick & Place manipulation is enhanced for detecting fine manipulation tasks by gathering Dynamic Grasps. In this context tactile sensors were mounted in a data glove in order to improve the reliably of segmenting the user demonstration into movements and grasps. Further on it was shown how a grasp segment can be divided in statical and dynamical grasps by analyzing the force signals. Finally a new time delay approach based on a Support Vector Machine was realized in order to classify Dynamic Grasps.

ACKNOWLEDGMENT

[8] D. Gavrila and L. Davis. Towards 3d model-based tracking and recognition of human movement: a multi-view approach. In International Workshop on Face and Gesture Recognition, Z¨ urich, 1995. [9] J. Gonz´ alez-Linares, N. Guil, P. P´ erez, M. Ehrenmann, and R. Dillmann. An efficient image processing algorithm for high-level skill acquisition. In Proc. of the International Symposium on Assembly and Task Planning (ISATP), Porto, Portugal, pages 262–267, Juli 1999. [10] H. Hashimoto and M. Buss. Skill acquisition for the intelligent assisting system using virtual reality simulator. In Proceedings of the 2nd International Conference on Artificial Reality and Tele-existence, Tokyo, 1992. [11] T. Joachims. Making large scale svm learning practical. in b. sch”ollkopf, c.j.c.burges and a.j. smola, editors. Advances in Kernel Methods- Support Vector Learning, pages 169–184, 1999. [12] S. Kang. Robot Instruction by Human Demonstration. PhD thesis, Carnegie Mellon University, Pittsburg, Pennsylvania, 1994.

This work has been supported by the BMBF project “Morpha”.

[13] S. Kang and K. Ikeuchi. Toward automatic robot instruction from perception: Mapping human grasps to manipulator grasps. Robotics and Automation, 13(1):81–95, Februar 1997.

References

[14] Y. Kuniyoshi, M. Inaba, and H. Inoue. Learning by watching: Extracting reusable task knowledge from visual observation of human performance. IEEE Transactions on Robotics and Automation, 10, 1994.

[1] M. R. Cutkosky. On grasp choice, grasp models, and the design of hands for manufacturing tasks. IEEE Transactions on Robotics and Automation, 5(3):269–279, 1989. [2] R. Dillmann, O. Rogalla, M. Ehrenmann, R. Z¨ ollner, and M. Bordegoni. Learning robot behaviour and skills based on human demonstration and advice: the machine learning paradigm. In 9th International Symposium of Robotics Research (ISRR 99), pages 229–238, Snowbird, Utah, USA, 9.-12. Oktober 1999. [3] M. Ehrenmann, D. Ambela, P. Steinhaus, and R. Dillmann. A comparison of four fast vision based object recognition methods for programing by demonstration applications. In Proceedings of the 2000 International Conference on Robotics and Automation (ICRA), volume 1, pages 1862–1867, San Francisco, Kalifornien, USA, 24.–28. April 2000. [4] M. Ehrenmann, P. Steinhaus, and R. Dillmann. A multisensor system for observation of user actions in programing by demonstration. In Proceedings of the IEEE International Conference on Multi Sensor Fusion and Integration (MFI), volume 1, pages 153–158, Taipeh, Taiwan, August 1999. [5] J.M. Elliot and K.J. Connolly. A classification of hand movements. Developmental Medicine and Child Neurology, 26:283–296, 1984. [6] H. Friedrich. Interaktive Programmierung von Manipulationssequenzen. PhD thesis, Universit¨ at Karlsruhe, 1998. [7] H. Friedrich, V. Grossmann, M. Ehrenmann, O. Rogalla, R. Z¨ ollner, and R. Dillmann. Towards cognitive elementary operators: grasp classification using neural network classifiers. In Proceedings of the IASTED International Conference on Intelligent Systems and Control (ISC), volume 1, Santa Barbara, Kalifornien, USA, 28.-30. Oktober 1999.

[15] M.Ehrenmann, S. Knoop, R. Zoellner, and R. Dillmann. Multi sensor fusion approaches for programming by demonstration. International Conference on Multi Sensor Fusion and Integration for Intelligent Systems (MFI), 2001. [16] M. P¨ aschke and J. Pauli. Vision based learning of gripper trajectories for a robot arm. In International Symposium on Automotive Technology and Automation (ISATA), Florence, pages 235–242, 1997. [17] J. Rehg and T. Kanade. Visual tracking of high DOF articulated structures: an application to human hand tracking. In ECCV, pages 35–46, 1994. [18] N. Shimada and Y. Shirai. 3d hand pose estimation and shape model refinement from a monocular image sequence. In Proceedings of the VSMM, Gifu, pages 423–428, 1996. [19] D. Sturman and D. Zeltzer. A survey on glove-based input. IEEE Computer Graphics and Applications, 14(1):30–39, 1994. [20] A. Ude. Rekonstruktion von Trajektorien aus Stereobildfolgen f¨ ur die Programmierung von Roboterbahnen. PhD thesis, Universit¨ at Karlsruhe, 1996. Erschienen in: VDI Verlag, Fortschr. Ber. VDI Reihe 10 Nr. 448. D¨ usseldorf. [21] V. Vapnik. Statistical learning theory. John Wiley & Sons, Inc, 1998. [22] M. Zacksenhouse and P. Marcovici. Inherent structure of manipulative hand movements and its discriminative power. International Conference on Intelligent Robots and Systems, 2000. [23] R. Zoellner, O.Rogalla, and R. Dillmann. Integration tactile sensors in a programming by demonstration system. International Conference on Robotics and Automation (ICRA), 2001.