Interactive robot

Interactive robot Robot acting on human intentions Trym Bremnes

Thesis submitted for the degree of Master in Informatics: Nanoelectronics and Robotics 60 credits Department of Informatics Faculty of mathematics and natural sciences UNIVERSITY OF OSLO May 2017

Interactive robot

Robot acting on human intentions

Trym Bremnes

© 2017 Trym Bremnes Interactive robot http://www.duo.uio.no/ Printed: Reprosentralen, University of Oslo

Abstract As the world’s population ages, demands for health services continues to increase and robotics can be part of the solutions to this problem. One particular area of medicine, ultrasound, can be helped by robotics, especially for doctors who would otherwise suffer long term health implications. The primary purpose of this study is to construct a system designed to be useful and low cost in the context of an ultrasound procedure. An equally important goal is to determine if the proposed solution, which is based on what the patient feels, works compared to a more traditional HumanComputer Interaction (HCI) approach. In general, the functions introduced here can be used at the start of, during and after an ultrasound diagnosis procedure. As for the approach itself, the system developed provides a simulation of the ultrasound robot and the patient, collision detection between them and emotion detection of the patient’s facial expression. The system consist of different computer vision-based methods performing in real-time. The experiments performed investigated the different components, communication, collision and emotion detection, before finally putting them all together and compare this solution to the HCI approach. In the end, because of robot limitations, no verification of the system as a whole was completed.

I

Contents Abstract

I

Preface

XI

1 Introduction 1.1 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . 2 Background 2.1 Representation of emotions . . . . . . . . . . . . . . . 2.1.1 Basic emotion . . . . . . . . . . . . . . . . . . 2.1.2 Appraisal theory . . . . . . . . . . . . . . . . 2.1.3 Psychological constructionist model . . . . . . 2.1.4 Social constructionist model . . . . . . . . . . 2.2 Computational emotion recognition . . . . . . . . . . 2.2.1 Facial Action Coding System . . . . . . . . . 2.2.2 Machine learning . . . . . . . . . . . . . . . . 2.3 User and patient interaction . . . . . . . . . . . . . . 2.3.1 The feel dimension . . . . . . . . . . . . . . . 2.3.2 Relation to medicine . . . . . . . . . . . . . . 2.4 3D camera technology . . . . . . . . . . . . . . . . . 2.4.1 Structured light . . . . . . . . . . . . . . . . . 2.4.2 Time-of-flight . . . . . . . . . . . . . . . . . . 2.4.3 Stereo vision . . . . . . . . . . . . . . . . . . . 2.5 Communication background . . . . . . . . . . . . . . 2.5.1 Communication protocols . . . . . . . . . . . 2.5.2 Tele-operated restrictions . . . . . . . . . . . . 2.6 Related work . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Ultrasound Robotic System using UR5 . . . . 2.6.2 Emotion Interaction System for Service Robot 2.6.3 Robot system for medical ultrasound . . . . . 2.6.4 The role of Emotion and Enjoyment for QoE . II

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

1 4 5 5 6 6 7 9 9 9 9 11 11 12 12 14 15 16 16 16 17 17 18 18 18 19

3 Interactive Ultrasound robot 3.1 Overview of solution . . . . . . . . . . . . . . . . 3.2 3D Camera . . . . . . . . . . . . . . . . . . . . . 3.2.1 Camera requirements . . . . . . . . . . . . 3.2.2 Kinect overview . . . . . . . . . . . . . . . 3.2.3 Skeletal tracking . . . . . . . . . . . . . . 3.3 Collision detection . . . . . . . . . . . . . . . . . 3.3.1 Abstract objective . . . . . . . . . . . . . 3.3.2 Specific solution . . . . . . . . . . . . . . . 3.4 Emotion detection . . . . . . . . . . . . . . . . . 3.4.1 Abstract objective . . . . . . . . . . . . . 3.4.2 Specific solution . . . . . . . . . . . . . . . 3.5 Robot . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Task . . . . . . . . . . . . . . . . . . . . . 3.5.2 Specific solution . . . . . . . . . . . . . . . 3.6 Implementation . . . . . . . . . . . . . . . . . . . 3.6.1 Overview . . . . . . . . . . . . . . . . . . 3.6.2 3D Vision Server: Javascript-simulator and MainServer . . . . . . . . . . . . . . . . . 3.6.3 Cameras . . . . . . . . . . . . . . . . . . . 3.6.4 Emotion API . . . . . . . . . . . . . . . . 3.6.5 Robot . . . . . . . . . . . . . . . . . . . . 3.6.6 Python script . . . . . . . . . . . . . . . . 4 Experiments 4.1 Cloud service delay test . . . 4.1.1 Setup . . . . . . . . . 4.1.2 Testing . . . . . . . . . 4.1.3 Results . . . . . . . . . 4.1.4 Conclusion . . . . . . . 4.2 Communication test . . . . . 4.2.1 Setup . . . . . . . . . 4.2.2 Testing . . . . . . . . . 4.2.3 Results . . . . . . . . . 4.2.4 Conclusion . . . . . . . 4.3 Collision detection experiment 4.3.1 Setup . . . . . . . . . 4.3.2 Testing . . . . . . . . . 4.3.3 Results . . . . . . . . . 4.3.4 Conclusion . . . . . . . 4.4 Emotion detection experiment III

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

20 20 23 23 23 24 28 28 29 31 31 31 33 33 34 35 35

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

39 47 48 49 54

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

56 57 57 58 58 58 60 60 61 63 63 64 64 64 68 68 70

4.5

4.4.1 4.4.2 4.4.3 4.4.4 Final 4.5.1 4.5.2 4.5.3 4.5.4 4.5.5

Setup . . . Testing . . . Results . . . Conclusion . experiment . Setup . . . Pilot test . Main test . Results . . . Conclusion .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

5 Discussion 5.1 Choice of networking protocol . . . . . 5.2 Network delay . . . . . . . . . . . . . . 5.3 Emotion representation . . . . . . . . . 5.4 Emotion detection . . . . . . . . . . . 5.5 Patient interaction . . . . . . . . . . . 5.6 Camera performance . . . . . . . . . . 5.7 Robot evaluation . . . . . . . . . . . . 5.7.1 Safety . . . . . . . . . . . . . . 5.7.2 Design . . . . . . . . . . . . . . 5.7.3 Limitations and alterations . . . 5.8 Implementation drawbacks, advantages tions . . . . . . . . . . . . . . . . . . . 5.8.1 Advantages . . . . . . . . . . . 5.8.2 Drawbacks and corrections . . . 5.8.3 Reflection on development . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Conclusion 6.1 Final remarks . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . 6.2.1 Modify and verify current solution . 6.2.2 Virtual and augmented reality . . . . 6.2.3 Cloud integration . . . . . . . . . . . 6.2.4 Ultrasound system without operator

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

70 73 75 76 77 77 83 86 88 88

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . correc. . . . . . . . . . . . . . . .

. . . . . . . . . .

89 89 90 91 91 93 94 96 96 97 97

. . . .

98 98 99 100

. . . . . .

102 . 102 . 103 . 103 . 103 . 103 . 104

. . . . . .

. . . . . . . . . .

. . . . . .

. . . . . .

A Image of setup, backside

105

B Kinect v1 specifications

106

C Robot specifications

107

IV

D Flowcharts

109

E Cloud delay

112

F Hardware and frameworks used by the F.1 Cloud delay experiment . . . . . . . . F.1.1 Hardware . . . . . . . . . . . . F.1.2 Frameworks . . . . . . . . . . . F.2 Communication experiment . . . . . . F.2.1 Hardware . . . . . . . . . . . . F.2.2 Frameworks . . . . . . . . . . . F.3 Collision experiment . . . . . . . . . . F.3.1 Hardware . . . . . . . . . . . . F.3.2 Frameworks . . . . . . . . . . . F.4 Emotion experiment . . . . . . . . . . F.4.1 Hardware . . . . . . . . . . . . F.4.2 Frameworks . . . . . . . . . . . F.5 Final experiment . . . . . . . . . . . . F.5.1 Hardware . . . . . . . . . . . . F.5.2 Frameworks . . . . . . . . . . . G Final experiment data G.1 Pilot test . . . . . . . . . . . . . G.1.1 Subject 1 . . . . . . . . G.1.2 Subject 2 . . . . . . . . G.1.3 Subject 3 . . . . . . . . G.1.4 Subject 4 . . . . . . . . G.1.5 Subject 5 . . . . . . . . G.2 Main test . . . . . . . . . . . . G.2.1 Questionnaire Norwegian G.2.2 Questionnaire English .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

H Kinect v2 specifications I

. . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . .

113 113 113 114 115 115 116 117 117 118 119 119 120 121 121 122

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . .

123 . 123 . 124 . 125 . 127 . 129 . 131 . 133 . 134 . 136 138

Software overview I.1 Overview of software . . . . . . . . . . I.2 Javascript simulator code . . . . . . . . I.2.1 init () . . . . . . . . . . . . . . I.2.2 init human () and init robot () I.2.3 update human (*positions) and tions) . . . . . . . . . . . . . .

V

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . update robot . . . . . . . .

139 . 139 . 141 . 141 . 141

. . . . . . . . . . . . . . . . (*posi. . . . . 141

I.2.4 I.2.5 I.2.6 I.2.7

I.3

I.4 I.5 I.6

calculate joint positions (joint rotations array) . . . . . 141 handleMessage (header,payload) . . . . . . . . . . . . . 142 saveAsImage temp () and saveAllImages () . . . . . . . 142 change robot speed no saving (emotion payload) and emotion response (emotion) . . . . . . . . . . . . . . . 142 I.2.8 splitKinectData (payload data) and splitRobotData (payload data robot) . . . . . . . . . . . . . . . . . . . . . 142 I.2.9 Create sphere (start point) . . . . . . . . . . . . . . . . 142 I.2.10 create box (*args) . . . . . . . . . . . . . . . . . . . . . 142 I.2.11 update box (new start point,new end point,name) . . . 142 I.2.12 detect collision (object1 name,other objects names array)142 Main server code . . . . . . . . . . . . . . . . . . . . . . . . . 143 I.3.1 MainWindow.xaml . . . . . . . . . . . . . . . . . . . . 143 I.3.2 MainWindow.xaml.cs . . . . . . . . . . . . . . . . . . . 143 I.3.3 websocket con.cs . . . . . . . . . . . . . . . . . . . . . 144 I.3.4 emotion detection.cs . . . . . . . . . . . . . . . . . . . 145 Python script code . . . . . . . . . . . . . . . . . . . . . . . . 146 Robot code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Cloud delay experiment . . . . . . . . . . . . . . . . . . . . . 148

Bibliography

161

VI

List of Figures 1.1 1.2

Haptic robot operator scenario . . . . . . . . . . . . . . . . . . Overview of context . . . . . . . . . . . . . . . . . . . . . . . .

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

Basic emotions . . . . . . . . . . . . Appraisal emotions . . . . . . . . . . Russel’s emotional construction figure Facial expression classifier structure . Distance measurement methods . . . Structured light illustrated . . . . . . Time-of-flight illustrated . . . . . . . Time-of-flight phase angle . . . . . . Emotion Interaction System robot . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

6 7 8 11 13 14 15 16 19

3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19

Overview of setup . . . . . . . . . . . . . . . . . Physical solution . . . . . . . . . . . . . . . . . Kinect’s sensors . . . . . . . . . . . . . . . . . . Images obtained through Kinect . . . . . . . . . Kinect’s “speckle” pattern . . . . . . . . . . . . Kinect skeleton . . . . . . . . . . . . . . . . . . Raycasting illustrated . . . . . . . . . . . . . . . Universal Robots UR5 . . . . . . . . . . . . . . Overview of implementation . . . . . . . . . . . Top-down overview of scenario . . . . . . . . . . Javascript-simulator’s implementation overview Robot rotation calculations illustrated . . . . . MainServer’s implementation overview . . . . . Picture of cameras . . . . . . . . . . . . . . . . Overview of the camera solution . . . . . . . . . Emotion API overview . . . . . . . . . . . . . . UR5 native interface . . . . . . . . . . . . . . . Overview of Robot solution . . . . . . . . . . . Robot with probe . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

21 22 24 25 26 27 29 34 36 38 39 42 45 47 48 49 50 51 52

VII

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

2 3

3.20 Acceleration and maximum velocity illustration . . . . . . . . 53 3.21 Overview of Python script in the solution . . . . . . . . . . . . 54 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17

Setup: cloud service delay test . . . . . . . . . . . . . Cloud delay sequence . . . . . . . . . . . . . . . . . . Setup: Communication with UR Simulator/UR robot Network delay measured . . . . . . . . . . . . . . . . Setup: Collision detection experiment . . . . . . . . . Collision test . . . . . . . . . . . . . . . . . . . . . . DH parameters illustrated . . . . . . . . . . . . . . . Setup: Emotion detection experiment . . . . . . . . . Emotion detection sequence with local face detection Emotions on a distance . . . . . . . . . . . . . . . . . Emotion detection sequence . . . . . . . . . . . . . . Emotion detection sample . . . . . . . . . . . . . . . Setup: Final experiment . . . . . . . . . . . . . . . . Emotion detection output . . . . . . . . . . . . . . . Javascript-simulator output . . . . . . . . . . . . . . Kinect’s resolution per centimeter . . . . . . . . . . . Subject 5 fourth image . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

57 58 60 62 65 67 69 71 73 74 75 76 78 80 80 82 84

5.1

Kinect for Xbox One . . . . . . . . . . . . . . . . . . . . . . . 95

A.1 Appendix: Image of setup, backside . . . . . . . . . . . . . . . 105 D.1 Appendix: Initialization sequence . . . . . . . . . . . . . . . . 109 D.2 Appendix: Input to display sequence . . . . . . . . . . . . . . 110 D.3 Appendix: Collision detection sequence . . . . . . . . . . . . . 111 I.1

The structure of the additional material on the DVD. . . . . . 140

VIII

List of Tables 3.1 3.2 3.3

Data transition table . . . . . . . . . . . . . . . . . . . . . . . 37 Robot velocity levels . . . . . . . . . . . . . . . . . . . . . . . 44 Table of used joints on patients . . . . . . . . . . . . . . . . . 46

4.1 4.2

Measured delay of cloud service . . . . . . . . . . . . . . . . . 59 Subjects’ emotion result table . . . . . . . . . . . . . . . . . . 88

5.1

Table of research questions . . . . . . . . . . . . . . . . . . . . 89

E.1 Appendix: Cloud delay complete results . . . . . . . . . . . . 112 F.1 F.2 F.3 F.4 F.5 F.6 F.7 F.8 F.9 F.10 F.11 F.12

Appendix: Appendix: Appendix: Appendix: Appendix: Appendix: Appendix: Appendix: Appendix: Appendix: Appendix: Appendix:

Cloud experiment: Hardware table 1 . . . . . Cloud experiment: Hardware table 2 . . . . . Cloud experiment: Framework table . . . . . Communication experiment: Hardware table . Communication experiment: Framework table Collision experiment: Hardware table 1 . . . Collision experiment: Hardware table 2 . . . Collision experiment: Framework table . . . . Emotion experiment: Hardware table . . . . . Emotion experiment: Framework table . . . . Final experiment: Hardware table . . . . . . Final experiment: Framework table . . . . . .

IX

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

113 113 114 115 116 117 117 118 119 120 121 122

Listings 3.1 3.2 3.3 3.4

Javascript-simulator’s render Collision detection script . . Message formats . . . . . . Emotion API call . . . . . .

loop . . . . . . . . .

X

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

39 43 46 49

Preface This thesis would not have been completed without the help of many good people. I would like to thank my supervisors Egil Utheim, Idar Dyrdal and Ole Jacob Elle for their help with this project. Egil for his fantastic help with the development of the project and the countless hours we spent on my project. Idar for all of my annoying questions that were answered in a calm and professional manner as well as all the times he read through what I wrote. Ole Jacob gave me the overall structure of this thesis and the administrative tasks associated with this whole endeavor. Dennis Warnaar read through the thesis line by line for hours on end. Without him this work would have been many orders of magnitude worse. I hope I can return the favor one day. Dr Wolfgang Leister at Norwegian Computing Center pointed out a mistake in one of the experiments, which I am grateful for. He also looked through my thesis at multiple occasions, and gave me valuable feedback. Jonas Nævra at Mektron AS provided helped me in my robotic control endeavor, and without his help I would have struggled a lot more than I had to. Agathon Maximillian Skei-Hart deserves my thanks for looking through the cognitive background. I appreciate it. I would also like to thank Knut Øvsthus at Bergen University College for his feedback, and Dag Langmyhr for his help with fixing the meta data for the PDF. And finally I would like to thank my brother, Gard Bremnes, for attempting to help fix said experiment right before the deadline.

XI

Chapter 1 Introduction With a rapidly aging population and greater lifespans, many western countries are facing greater demand on health care services. Consequently, in order to mitigate the negative effects of reduced care per patient, and to reduce economical and societal costs, new solutions need to be introduced. Robotics in particular might help to reduce the workload of medical personnel, while increasing productivity, which can be seen in the industrial sector [38]. One important, frequently used, diagnostic tool is ultrasonography or ultrasound. To manually perform an ultrasound diagnostic procedure is tiring and the operators may suffer musculoskeletal damage [37]. This happen because of the strain of holding a ultrasound probe with high pressure against the patient, while simultaneously monitor the patient’s organs through a display in a, more often than not, awkward position. By removing the operator from this difficult scenario and replacing him or her with a robot, which the operator can control from a comfortable position, the operator can strainlessly examine the patient. Focusing on the task at hand instead of holding a rather rigid pose, the operator can be free to inspect the patient more effectively. Such a system has been researched by Mathiassen et al. [62], and the solution he created relied on haptic feedback to give the operator a sense of “touch”. Figure 1.1 shows the system he created, which is the immediate context for this project. Not taking into account any of the flaws of that system, a few problems remained unresolved. In particular, how would the robot act prior, during and post operation and how would this affect the experience as a whole for the patient. From a traditional HCI view, the solution would likely be a simply move-to-target position and back-again problem, without considering the patient’s emotions or concerns. Therefore a new approach to interact 1

with humans needed to be deployed. Inspiration for this work came from the relative recent research by A. Larssen [54], who investigated alternatives to HCI and invented the concept of the feel dimension.

Figure 1.1: An operator controlling a robot remotely through a haptic feedback device. The problem this thesis helps to solve is the development of a low cost, semi-autonomous robot that can assist medical personnel performing ultrasonography or ultrasound on a patient. Specifically, by focusing on the interaction between robot and patient. The solution presented in this thesis is thought of as being generally applicable to the whole diagnostic operation, seen in Figure: 1.2. The specific part that is addressed here is the start-up sequence of the operation. This has been done in conjunction with the feel dimension, in contrast to traditional HCI. The main challenges are to detect the patient around the robot, to make the robot react to the patient in a satisfactory manner and to use the feel dimension to give the patient an experience unlike traditional HCI. Two approches, one HCI and a new approach, will be compared to see if the patients perception of the procedure is improved by adding the feel dimension to the experience. Specifically, this project demonstrates how collision detection and emotion detection can be used within the context of an ultrasound diagnosis 2

procedure. The reasoning behind researching such a procedure can be found by looking at other procedures, e. g. MRI and CT, that have in a very high degree been automated [116].

Figure 1.2: Overview of context. The ultrasound part in the middle has already been created, though the start and shutdown sequences has not.

3

1.1

Outline of the thesis

This thesis is divided into chapters covering the different parts of the project. This chapter describes the problems this thesis are trying to solve. Chapter 2 covers the background this project relies upon, in addition to presenting related projects. Chapter 3 contains the different concepts and the particular solutions applied to achieve this project’s goals, as well as the implementation. In Chapter 4 the different development tasks and studies are presented, leading to the discussion in Chapter 5, before concluding in Chapter 6, which also includes a few future problems that were left unsolved. Appendices and bibliography can be found in the end of the document.

4

Chapter 2 Background In this chapter relevant prior research can be found, though the subjects gone through here are in no way exhaustive. In the order of appearance, they are: representation of emotions (2.1), computational emotion recognition (2.2), user and patient interaction (2.3), 3D camera technology (2.4), communication background (2.5) and related works (2.6). Representation of emotions presents a few fundamental approaches to working with emotions, computational emotion recognition discusses the ways of using one of these representations computationally. Next section, user and patient interaction; the subjective feelings appearing while interacting with agents, be it human or robot, and how to handle them. Recent principles for creating three dimensional images can be found in the 3D camera technology section. Second to last, digital communication is presented in Section communication background, however the honor of being last was taken by related works, which briefly presents other projects related to what is the goal of this thesis.

2.1

Representation of emotions

Emotions are “probably one of the least understood aspects of the human experience” [84], and this section here should not be read as an exhaustive lists of all possible representations of emotions. Emotions are usually evaluated automatically by humans, but computers and robots are dependent on highly advanced methods to do the same evaluation. However, even basic emotions like “happy”, “sad”, “afraid” etc. can potentially be a big advantage in robot-human-interactions. Therefore it’s reasonable to focus here on the most basic human emotions. It is easy for most humans to detect and evaluate emotions. Whereas a smile can indicate that a person is happy, it could with some very subtle 5

changes signify schadenfreude. Robots and intelligent systems does not have an inherent ability to detect human emotions, and therefore craves definitions and patterns so that they may evaluate emotions. And this is precisely what is discussed in this section. Studying emotions depends very much on the approach to emotions. Charles Darwin started the emotion debate back in 1872, with is work The Expression of the Emotions in Man and Animals [24]. There are mainly four ways of “measuring” emotions, through the basic emotion model, the appraisal emotion model, the psychological constructionist emotion model and finally the sociological constructionist emotion model.

2.1.1

Basic emotion

The first one, the basic emotion model, was put forward by P. Ekman, although he was very much inspired by Darwin. His assumption is that all humans have some discrete universal emotions such as anger, fear and joy [72]. This view, which is heavily influenced by evolution, makes the case that these basic emotions were either culturally or evolutionary preferred because of the ease of communicating these basic emotions. In short, any external event X, almost necessarily leads to emotion Y; see Figure 2.1. Critics of this system say that it too simplistic, and Lisa Feldman Barret, argues that there is “no clear objective way to measure the experience of emotion” [7].

Figure 2.1: Basic emotions illustrated

2.1.2

Appraisal theory

Appraisal emotion model or appraisal theory is the basic emotion model taken one step further. Instead of one event X almost always triggering emotion Y, event X in conjunction with “mental state” Z, creates appraisal (or context) W that finally triggers emotion Y. An illustration of this is shown in Figure 2.2. In effect, this means that if the context around the event is 6

different, different emotions would appear [35]. Some appraisal theorists also rejects the discreteness of emotions, resulting in varying degrees of “anger”, “sadness”, etc [29]. R. B. Zajonc found in 1980 while subjecting test individuals to stimuli for very short durations, that the subjects had “at least two different forms of unconscious process”, casting doubt on the accuracy of the appraisal model [80]. Lazarus presented the appraisal theory that distinguishes appraisal into two basic parts, primary and secondary appraisals [52].

Figure 2.2: Appraisal emotions illustrated

2.1.3

Psychological constructionist model

The psychological constructionist view rejects the notion of the basic, discrete emotions. James A. Russel lays out how “prototypes” of emotions can be represented as a combination of concepts, such as Core Affect, Perception of Affective Quality, Attribution to Object, Appraisal, Action, Emotional Meta-Experience and Emotion regulation [85]. In Figure 2.3 an overview of this model can be seen. The eagle eyed reader might have spotted that appraisal is also a component of this view, and this work can be regarded as a psychologically focused, improved version of the appraisal model.

7

Figure 2.3: Russel’s emotional construction figure

8

2.1.4

Social constructionist model

And finally, the social constructionist emotion model. C. Ratner explains that emotions in human infants is basically the same as emotions in animals, spontaneous and natural. As humans ages, their emotions becomes hardened, more controlled, tempered by social thinking [14]. Newer approaches also includes regarding emotions as “social objects” [28].

2.2

Computational emotion recognition

In recent years great progress has been made creating real-time facial recognition software [74]. A natural extension of this work would be to recognize emotions, since humans can easily detect emotions by looking at other people’s faces. The approaches to this problem, however, are very different. On one hand, there is psychology, sociology and many other fields researching how to identify a face using models as those mentioned in the section above. On the other hand, there is machine learning. This approach usually assumes much less about of the nature of the problem, and simply searches for patterns that matches the data provided.

2.2.1

Facial Action Coding System

One relatively well known system in use today is called Facial Action Coding System (FACS) [73]. This system encodes facial muscle movements into Action Units (AU1 ). These 44 different AUs reflect the unique movements and positions each of the muscles can do or be in. The default expression is a relaxed, neutral face, so not all faces would be accurately evaluated without knowing this neutral position of each individual face. These AUs can be used as features for emotion detection. For example raising the cheek (AU 6) and pulling the lip corner (AU12) might signify happiness, depending on how you code these AUs into emotions. FACS have been successfully incorporated, for example by J. J. Lien et al. [56], although they based their work on the AUs for facial expression recognition.

2.2.2

Machine learning

Machine learning (ML) can be used for all sorts of pattern recognition, and human emotions are no exception. The most widely used definition of machine learning is 1

Not to be confused with Astronomical Units.

9

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E [68]. Pedro Domingos provides an excellent introduction to ML in A few useful things to know about machine learning [27]. Relevant work in terms of image recognition is Image based Static Facial Expression Recognition with Multiple Deep Network Learning by Z. Yu and C. Zhang [113]. In this work they proposed a new way of classifying a dataset, consisting of images of faces. Their solution were twofold, first detect the faces using a combination of three recent face detection algorithms. Then use convolutional neural networks (CNNs) to determine which of seven emotions this particular face was showing. The facial detection part was an ensemble of classifiers. The first of these is the joint cascade detection and alignment (JDA), which is almost as quick as Haar-cascade classification by Viola and Jones [108], but with greater efficiency [17]. The second one is a Deep CNN detector (DCNN), based on a paper by C. Zhang et al. [117], that easily finds faces in different poses and illumination, but at a higher performance cost. And finally a Mixtures of Trees (MoT) [119] would then be applied before enough confidence were present to determine if one image did indeed include at least one face. MoT have been used in “the wild” and provides good results [119], however it was not superior to everything else so that’s why it’s part of the ensemble. Using the faces detected by the algorithm above, 48x48 pixel face images would be produced. As for the facial expression recognition, seven hidden layer CNN were utilized, where the input would be these 48x48 images. The layers used in these CNNs where five convolutional layers, three stochastic pooling [115] layers and three fully connected layers. Figure 2.4 shows this architecture. As for accuracies, the data set this method was tested on was the Facial Expression Recognition Challenge 2013 data set [97] where the test set accuracy was 61.29%, compared to the challenge baseline of 39.13%. Other alternatives to Yu and Zhang is the Noldus Face reader, which is based on Active Appearance method [22] that tries to match object shapes to the image [105]. The drawback of this system is that it costs much more than the implementation of Yu et al. 10

Figure 2.4: Structure of how the CNN used by Yu et al., Figure from [113]

2.3 2.3.1

User and patient interaction The feel dimension

Most technology needs some kind of interactions with human users, and this also applies for informatics. However, people and computers are not able1 to communicate with each other as seamlessly as interpersonal communication through body language and speech [63]. Human-computer interaction (HCI) has been developed to address this problem of communication. Alan Dix et al. goes through what HCI is in their appropriately named work: “Humancomputer interaction” [26]. They go in depth explaining the traditional view of HCI: function and usability. Only a small part describe the subjective feel of the user2 . Astrid T. A. Larssen tries to expand on this traditional view of HCI in her thesis: “How it Feels, not Just How it Looks”. After conducting three studies (where one of them included the usage of HCI, among others), she offers an addition to improve interactions between humans and computers: the feel dimension [54]. Her work can be interpreted as a way to focus more on the user and the user’s senses, especially the kinaesthetic (the sense of knowing that you are moving or being moved) and proprioceptive (knowing the position, state and movement of body and limbs [45]) ones. Larssen recommends incorporating these senses in for a more complete sensual experience, so that the user feel 1 the right feelings. This is different from the traditional HCI, since the focus is on the user being “happy”, but not with 1

At least not at this point in time See Section 3.9 in their book [26]. 1 Thereby the name: the feel dimension. 2

11

“flow” [54]. To illustrate this think of playing a song on a piano: hitting all the keys in the right order might cause the user to be satisfied. Doing the same with “flow”, being in the “rhythm” and performing with “flair” might cause the user a different experience. The reader might be familiar with gamification [39], that shares some similarities with the feel dimension in terms of goals. Gamification differs in the approach used to make the experience more valuable. The goal also differs in some aspects; where gamification uses psychology (usually positive reinforcement [33] or reward-based reinforcement[83]), the feel dimension goes deeper, answering more primal, raw senses. This could potentially be exploited by people suffering from disabilities like Alzheimer’s, who might understand what the machine is doing or are going to do without being explicitly told.

2.3.2

Relation to medicine

Similar to the feel dimension, the art of medicine or doctor-patient interaction attempts to comfort and assure the patient to comply with procedures and instructions given by their doctor [6]. In 2010, E. Broadbent et al. explored the public’s perception of robots in healthcare [11]. They found that few people had actually seen a robot, but had become familiar with them through popular culture. The art of medicine differs from the feel dimension by building trust, not necessarily feeling the right feelings. It’s about “compassion, communication, professionalism, respecting patient autonomy, treating each individual as a beautiful and unique snowflake” [20].

2.4

3D camera technology

Though the first optical camera have been developed almost two centuries ago [9], only recently affordable 3D cameras have become common. These 3D cameras use depth data to form a scene, unlike a laser rangefinder, which only returns the distance to one specific point. There are many ways of calculating distances using a 3D camera. Here, only three of these methods have been discussed: structured light, time-offlight principles and stereo vision. Figure 2.5 shows an overview of many of the methods used to obtain depth images. In that Figure passive means sensors using radiation from the environment to determine the distance, while active refers to actively acting on the environment to get the desired depth data [23]. 12

Figure 2.5: Distance measurement methods. Note that Light Coding here is the same as structure light. Figure from: [23]

13

2.4.1

Structured light

Structured light works in a two step procedure. First, a projector streams out light1 in a structured pattern (thereby its name). Secondly, a camera (usually situated nearby) picks up the distortion created by the surface it reflected from [36]. This distortion can then be used to determine the distance between a point on the surface and the camera.

Figure 2.6: Structured light illustrated, Figure from: [36] So if the surface was flat, the pattern sent by the projector would look very much the same when reflected on the surface. However, if the depth of the surface area changes, this will cause distortion in the light reflected on the surface, signaling that the surface has different elevation. This elevation can be calculated using triangulation, specifically: R=B

sin(θ) sin(α + θ)

(2.1)

The goal in Figure 2.6 is to get R, the distance between the camera and pixel P in the image. Equation 2.1 enables the calculation of this distance where B is the distance between the projector and camera, α is the angle between the camera-pixel vector and the camera-projector vector and θ is the angle between the projector-pixel vector and the projector-camera vector. One of the great weaknesses of structured light is that radiation from other sources than the projector might interfere with it. Therefore the sun, 1

Usually Infrared.

14

with its large electromagnetic spectrum, might interfere with the data, rendering it less useful outdoors [55].

2.4.2

Time-of-flight

Time-of-flight or ToF for short, is another technique for obtaining the depth image. ToF works by measuring the phase shift between light illuminated by a source, not unlike the projector in the subsection above, and the sensor. The source of this light must not be a separate unit, providing greater versatility [55].

Figure 2.7: Time-of-flight illustrated, Figure from: [67] Figure 2.7 shows the overall ToF process. Note the four illuminated squares on the sensor part, these photo-detectors transform light into electrical charges. These charges can then be used to calculate the phase angle between the source light and reflected light. The depth of any one pixel can be found using Equation 2.2. Here D is the depth to the target, c is the speed of light, ∆φ is the phase angle and f is the frequency of the light emitted. Equation 2.3 shows how to calculate the phase angle using the electric charges Qi for each of the four photo-detectors in Figure 2.7. The way these charges are detected is through Figure 2.8, where the charges Qi corresponds to the photo-detectors Ci . Each window detects light at a different time. c ∆φ 2 2πf Q3 − Q4 ∆φ = arctan Q1 − Q2 D=

(2.2) (2.3)

One of the major draw backs of the time-of-flight is the lower frame rate, if higher accuracy is needed [67]. One of the advantages over structured light 15

Figure 2.8: Time-of-flight phase angle calculation. This is the process where the charges Qi are obtained through the photo-detectors Ci , Figure from: [55] is that ambient light, e. g. the sun, has less of an impact than structured light [47].

2.4.3

Stereo vision

Stereo vision is how humans obtain depth information, and stereo vision for computers works the same way [2]. Having two eyes (or cameras) each providing a slightly different view of the same scene, these two images may then be used to calculate the disparity, e. i. the distance to whatever object that is being perceived. This geometry is relatively simple, once again triangulation has been used, however it is not easy to know which pixels in the first and second image that “belongs together”. This is known as the correspondence problem [89], which is also relevant for panorama picture creation and combining images [60].

2.5

Communication background

Many projects are the sum of multiple devices, and communication between these devices need to follow certain protocols for enabling successful communication. The fundamental types of communication protocols are described in the following subsection, followed by the network delay in medical robot systems.

2.5.1

Communication protocols

Transmission Control Protocol is the most widely used protocol in the world [4]. TCP works by setting up a connection between the server and the 16

client, and sending data over this connection. The idea is to guarantee that the packets sent are received in order and that no packets have been lost, so a lot of features are implemented in this protocol, like flow control among others [59]. One of the great advantages of TCP is that the data is almost always going to reach its destination, even over an unreliable connection [59]. This means valuable data, like control signals should preferably be sent over a TCP connection. The same goes for large files, as file integrity should be preserved. One of the major drawbacks of TCP is that every device needs a connection to be established and maintained, which consumes a lot memory and processing power. In addition TCP has more overhead than UDP, acknowledge messages needs to be sent all the time. User Datagram Protocol on the other hand, is connectionless. This means that a packet is simply sent, no message control or follow up is included [59]. UDP really shines in scenarios where data order does not matter, or when only the newest data matters such as real-time data. The disadvantage is, obviously, that datagram losses would not be detected, so control data should generally not be sent over UDP [4].

2.5.2

Tele-operated restrictions

According to Bauer et al. , a transmission delay, also known as Round Trip Time (RTT), of 250 milliseconds would only slightly restrict how a surgeon would operate a tele-operated robot [8]. Verkarakis et al. put this limit at 330 milliseconds [106]. The variations of this might be affect by what kind of movements the robot has to make and what environments that surrounds the robot. It is also worth noting that this delay Bauer and Varkarakis measures is only the transmission delay, not any computational delays. This means that if a system had to do more than simply send the next action to be performed, for example calculating a motion plan, the total delay still should not exceed 250-330 milliseconds.

2.6

Related work

Related projects within the field of robotics are reviewed in this section. A project of import is the Ultrasound Robotic System using UR5 by Mathiassen 17

et al. [62] who was mentioned in the introduction. The other related works are, in order of appearance, a robot using appraisal theory, another hapticinterfaced ultrasound robot and finally a study of emotions in a quiz setting.

2.6.1

Ultrasound Robotic System using UR5

Mathiassen et al. created a control system for a commercial robot, where the patient could be examined by a robot controlled by an operator through a haptic interface. One of its major advantages was that the haptic interface felt very intuitive to use, furthermore it also reacted quickly and naturally to any input from the user. It also boasts its own force-feedback system for the haptic interface. The robot’s inbuilt safety mechanism were disabled during the haptic mode, so it would risk the health of the patient during use. The system developed also lacked any collision control or any kind of patient focus, however neither of those were Mathiassen’s objectives. Mathiassen does demonstrate the feasibility of a tele-operated ultrasound commercial robot [62].

2.6.2

Emotion Interaction System for Service Robot

In 2007 a team from Korea designed a robot that was reportedly able to recognize, generate and express emotions [53]. This robot used the OCC (Ortony, Clore, Collins) [71] emotion model, which is a type of appraisal theory, which Subsection 2.1.2 explained. Interestingly, the authors did generate and express “facial” emotions, but did not identify emotions through vision. Voice, language and touch were input for emotion recognition software. It is important to note that touch was registered through certain “pads”, restricting touch to certain designated areas.

2.6.3

Robot system for medical ultrasound

Another related robotic project, by Salcudean, Zhu, Abolmaesumi, Bachmann and Lawrence, demonstrated a tele-operated ultrasound procedure back in year 2000. It is similar to Mathiassen’s work, but it’s haptic interface is different. A very desirable feature of this robot is its ability to track features in the ultrasound images. As with the other ultrasound robot, there were neither collision detection nor emotion detection, and the authors barely mention “human-dependent specifications” [86].

18

Figure 2.9: Emotion Interaction System robot. Figure from: [53]

2.6.4

The role of Emotion and Enjoyment for QoE

Related in terms of use of emotion detection is a paper by Tjøstheim et al. [100], who as part of their research utilized Noldus Face Reader to detect the emotions. Their setting was measuring the Quality of Experience (QoE) and how it was affected by emotions and enjoyment in a quiz setting. The goal of this study was to see how positive and negative emotions correlated with the winners and losers of the quiz. The findings found that the emotion detection software repeatedly classified a smile in the case of a loss as happiness, which was not necessarily the case. Potential criticism could be the lack of a neutral category, engagement does not necessarily indicate affection. For example, driving a car requires engagement, but does not necessarily provoke emotions.

19

Chapter 3 Interactive Ultrasound robot To be able to solve the problem defined in the introduction, a new method of interacting with patients had to be developed. In this chapter a high-level description of the developed system are presented as well as the most important components the solution consists of. The following sections explains the system as a whole, the most important concepts deployed and finally how this was implemented for this project.

3.1

Overview of solution

To help medical personnel scan patients with ultrasound, a new start up procedure has been developed that takes into consideration the patients feelings. The standard Human Computer Interface has been discarded in favor of an expanded perception of the delicate nature of interacting with humans through robotics. As a consequence of this the robotics system presented here takes into account how the patient feels and tries to compensate for it through its actuators. This is not a component like those mentioned below, as this is something that permeates throughout the entire solution (see Section 2.3). As such, this lead to repeatedly asking questions, throughout the development process, about how a patient might perceive the solution. Unfortunately, comprehensive and complete psychological knowledge about human emotions and feelings did not exist, so design decisions were made by imagining how it would feel if the developer had been the patient. This, of course, can be argued as being a specific solution that is generally not applicable for other patients. However, there were no agreed upon objective human experience at the time, therefore any reasonably explained solution would have to do. 20

Figure 3.1: Overview the software components. Note the area in center of the figure, encompassing Javascript-simulator and MainServer, is 3D Vision Server. In terms of the physical scenario, the objective was to guide a robotic arm from a default position to an area in front of the patient’s chest by perceiving the environment through a 3D camera and a RGB camera. The way the system worked was that the cameras streamed data to a server called 3D Vision Server. This server processed these data together with the positions streamed from the robot. In the 3D Vision Server, collisions and emotions were detected, and appropriate commands were issued to the robot to act on this information. The considered collisions were between 16 human joints and 8 robotic joints. The emotions detected are put into three categories: positive, neutral and negative. The commands, on the other hand, were “go”, “stop”, increase or decrease the robot’s speed. The following sections introduces the different concepts and technology that have been utilized for this project. They have been divided into parts: 3D Camera, Collision detection, Emotion detection and Robot. These concepts were in use in the different software parts, and Figure 3.1 shows how the software components interact with each other. Figure 3.2 shows the physical setup without the 3D camera.

21

Figure 3.2: The physical system. The computer to the left runs the MainServer and Javascript-simulator. The RGB camera is in the middle is facing the robot, which is in its default position. On the right the PolyScope robotic interface is shown, which is used to interact with the robot. The 3D camera is placed on the wall outside of this figure, but Appendix Figure A.1 shows it in this setting.

22

3.2

3D Camera

As with many robotics projects, some kind of perception had to be used to act on the environment. As mentioned in the introduction to this chapter, this project needed to contend with humans in a medical setting. Knowing where and what a patient was doing was paramount to create meaningful interaction between machine and patient. For this scenario, the sensor’s goal was to figure out where the patient was in the room, i.e. where 16 areas of the body were located. How this was achieved depended, of course, on the camera, since many ways existed to create a depth image. Three of these methods were mentioned Section 2.4 in the background chapter. The following subsections deals with the relevant 3D camera questions; mainly what was required of the camera, and how it achieved its goal.

3.2.1

Camera requirements

Two features were considered as requirements for the 3D camera: Requirement 1 A refresh frequency equal to or greater than 10 Hz, as the this would probably lead to unsustainable delay of the system so the user will experience it as unresponsive. Requirement 2 Able to track a skeleton in 3D in real-time. The first point stems indirectly from Subsection 2.5.2, where the transmission delay was discussed. If the image data had a delay of more than 100 milliseconds, it was estimated that the remaining time would not suffice for doing other calculations, like collision detection. The result would then be a system “living” in the past. The second point is very important, since it is difficult to determine where the patient is located without skeletal or similar positions. This requirement meant that, in practice, any camera that could not obtain the distance between its surroundings and itself, did not fulfill this requirement.

3.2.2

Kinect overview

Literature review reveals that few cameras supported robust tracking of humans, especially skeletal tracking. Fortunately, one of the big providers of 3D cameras, Microsoft with its Kinect for Xbox 360, have create a device suited for this task. Kinect SDK for version 1.8 features skeletal tracking 23

and it also runs at a 30 Hz frequency [49]. Kinect have two sensors that were relevant for this project, a 3D sensor and a RGB camera, which can be seen in Figure 3.3. It is important to note that the Kinect’s RGB camera was not used, only the 3D camera was needed. Other specifications are listed in Appendix B. Another advantage of choosing the Kinect was that the department of informatics had more experience working with this device.

Figure 3.3: The different sensors, as well as the tilt motor, of the Kinect for Xbox 360. Figure from [49]

3.2.3

Skeletal tracking

Every human skeleton consists of hundreds of bones [10] and skeletal tracking was in this case really just that: identifying and tracking a selection of points on the patient. These points are shown in Figure 3.6. The Kinect for Xbox 360’s way of tracking skeletons relies on two concepts: structured light and machine learning. Structured light is a means to create a depth image by usage of the triangulation principle, while machine learning is used to detect and track the humans in a depth image. Both concepts have been introduced in the background chapter. The Microsoft Kinect for Xbox 360 operates with licensed software called light coding from a company known as PrimeSense1 . This software was based 1

Bought by Apple in 2013 [18]

24

(a) RGB

(b) Depth

Figure 3.4: Images obtained by Kinect. a): RGB image of a chair. b): Depth image of the same chair. on the principle of structured light which is a way to ascertain a 3D image by using only a 2D camera and a structured light projector. Subsection 2.4.1 provides an explanation of how this principle works. Light coding, the software that the Kinect uses, utilizes a variant of structured light. Though there were little publicly available knowledge about light coding, a patent filled by PrimeSense does indicated that structured light is indeed used by their software. Light coding uses a “speckle”-pattern, seen in Figure 3.5, as opposed to the pattern seen in Figure 2.6 [90]. Just like any 3D camera, the result is a depth image. This image is then sent forward for classification of body parts. One thing that needs to be clarified before describing how the Kinect uses machine learning is how the segmentation of the foreground (person) from the background works. There are no publicly available information as far as the author of this document could find. However, there have been attempts at understanding how it works [16]. The way the Kinect uses this depth image is through machine learning (see Subsection 2.2.2). Machine learning is a way to use data and prior knowledge, to automatically improve the results a computer program returns. Usually, the process of ML is divided into two stages: training phase and test phase. The objective of the training phase is to train the model chosen a priori1 , while the test phase measures the accuracy, recall or other appropriate measures of the trained model [68]. 1

You do not really need a model, like using a k-NN classifier [3], but that is outside the scope of this thesis.

25

Figure 3.5: The speckle pattern from PrimSense’s patent. Figure from [90] In this particular scenario classification happened on a pixel-by-pixel basis. This means that every single pixel has a probability distribution, determining its probability of belonging to different areas of the body, like “left elbow” or “head”. Since this is a classification problem, the most likely class gets assigned to the pixel. The question was what kind of classifier to choose and what features should be used for this classification [40, 19]. The Kinect researchers chose the model to be a random forest classifier, which is a method that uses randomly created decision trees to avoid the main pitfall of decision trees, which is overfitting. This classifier was trained using training data obtained by actors wearing sensors telling the “ground truth” while capturing the movements through an array of different cameras, including the Kinect. The resulting training data were too biased to used directly (an overrepresentation of certain poses), so they decided to add “synthetic data”. By creating variations of the real world data, they managed to expand the training set from a around 500k frames to as many as they wanted, as well as making the chances for the different poses more uniform [40, 19]. This dataset was then used to determine which of 2000 different candidate features that separated the data in the best possible way without overfitting. These features were mainly features like “is the pixel, three pixels to the right of this pixel, background or foreground?” [40]. 26

Figure 3.6: All of the joints Kinect can identify, Figure from: [101]

27

After all of the pixels in the depth image had been classified, the actual joint positions needed to be inferred. This was done through a mean shift [21] with weighted Gaussian kernel, seen in equation 3.1.

! N X

xˆ − xˆi 2

fc (ˆ x) ∝ ωic exp − (3.1)

bc i=1 In Equation 3.1 xˆ is the coordinate in the image, N is the number of pixels in total, c is the body part label, ωic is a pixel weight (see Equation 3.2) , xˆi is the pixel xi project back into the 3D image using the depth at dI (xi ) (where I is the depth image). bc is learned variable for each part. ωic = P (c|I, xi ) · dI (xi )2

(3.2)

To summarize this subsection: First get the depth image using triangulation and structured light, then use this depth image to find the best features to use, classify the pixels in the depth image using random forest classifier and finally determine the likely joint positions using mean shift.

3.3

Collision detection

Since the patient would be within close proximity of the robot at all times, there would have to be some kind of safeguard against possible collisions between robot and patient. Such collisions might result in patient injury and/or damage to the robot. This section explains the theory and concepts used to achieve this, followed by the specific solution created.

3.3.1

Abstract objective

There are many ways of detecting collisions. If there are n objects in a scene, the worst case scenario would be in order of O(n2 ) [31]. For example, video games that uses physics engines to simulate real life physics has to tackle the complexity problem, since n would usually be large. In addition to this the computations are usually time sensitive, thereby making the problem worse. Solutions include “hierarchical representation, geometric reasoning, algebraic formulations, spatial partitioning, analytical methods, and optimization methods” [57]. The method chosen in this project is ray casting, which is a method that can be used to detect collisions between two objects [31]. The basic idea is to create a ray from one object towards the object you want to check 28

collision with, and then do a search along the ray until you find a point of the other object. At this point, the problem specific details begin to matter. One simple problem could be calculating if two cubes collides or not. After casting the ray towards the other cube and finding a point on this ray belonging to the other cube, one way to find out if they collide is to see if the distance from the center of the first cube multiplied by its rotation factor is equal to or greater than the point found on the ray. If yes, then a collision has happened, if not then there might be a point closer to the first cube that were not found or there might not be a collision at all. One drawback of ray casting is the potential to “jump over” intersecting points if the search along the ray is inadequate. This is a problem if the figures used are of more complex shapes, such as concave 3D objects. Figure 3.7 shows how ray casting works when used with a cuboid.

Figure 3.7: Raycasting illustrated, the search for the object goes along the ray that intersects the cuboid, Figure from [58]

3.3.2

Specific solution

As described in the overview section, this particular solution consists of 16 joints belonging to the human figure and 8 joints belonging to the robot. 29

These joints are structured so that there are 15 rectangular cuboid objects representing the patient, and 7 rectangular cuboid objects representing the robot. These 16 joints were all of the joints seen in Figure 3.6, except for feet and hands. The reasons for not including the feet was that they were pretty far away from the chest area and that the skeleton would still look like a skeleton without them. The hands, on the other hand, were not included because of the assumed difficulty of tracking the hands: Kinect could not track individual fingers. Instead of the simulated hands being put in obviously incorrect positions, leading to “fake” collisions, it was decided to not include them. The only interesting collisions are those happening between the human and the robot. Robot-robot collisions are not interesting since only particular movements may cause it, as well as the limited impact on the patient. Human-human collision means intra human collision, which is natural: touching your own arm or leg is usually not a danger to anyone or anything. This means that only 15 · 7 = 105 collision detections has to be done each time collisions are scheduled to be detected. As this was not a very large amount of objects, nor were they objects of very complex shapes, checking the collisions between each of the objects were regarded as sufficient. Static obstacles, such as walls, could be directly inserted into the robot’s software, creating zones it would avoid. This, in turn, removed the need for collision detection between walls and robot. Adjusting the size of the cuboids would change the nature of the collision. By increasing the size of the cuboids, the collision would be detected earlier than the physical collision, in effect creating a buffer zone. A buffer zone was highly desired, as it would give the robot enough time to stop, thereby avoiding a physical collision between robot and patient. Naturally, the way this was implemented was in the simulation layer, because the tool for displaying the objects had a method for detecting collisions using ray casting. The collisions did not have to be detected every frame, since there could potentially be many frames per millimeter of movement. Therefore, the collisions could be detected every tenth frame without any humanly noticeable loss in responsiveness. In short, collisions were detected by first creating cuboids of the patient and the robot. Then create rays pointing towards objects of relevance, before finally verifying that the objects are indeed touching.

30

3.4

Emotion detection

Collision detection might stop the patient from being physically harmed, but it might not stop the patient from feeling uncomfortable with the robot’s movement. For example, moving the robot’s tool head towards the patient at a high speeds might result in the patient feeling uncomfortable. From this appears the need to act on how the patient feels. Here this was solved through detecting emotions by examining the face of the patient continuously.

3.4.1

Abstract objective

Although almost all humans can read human emotions easily, by simply looking at another person’s face. However, this task have eluded computers for some time. One way of solving this problem is to limit the facial expressions humans make to only a few categories. As mentioned earlier in Section 2.1, researchers within different fields have created multiple models that explains emotions. The models mentioned in the background were basic emotion model, appraisal emotion model, psychological constructionist emotion model and the sociological constructionist emotion model. Unfortunately, the appraisal emotion model requires “context” information (see Subsection 2.1.2) and it was unclear what information should have been gathered to make this work. To obtain this information is not a trivial matter, although progress of has been made in order to make appraisal theory more formalized [12]. The psychological constructionist model leads to the highly complex world of psychology, and any complete solution might require deep understanding about the subject as well as personal data about the patient. As far as the sociological model is concerned, it does suffer from the same drawbacks as the former one. By process of elimination, this left the basic emotion model which were the basis for emotion detection.

3.4.2

Specific solution

Though the basic model have been chosen in the section above, questions such as: which emotions should be detected and what relationships or dependencies exists between them, lingers. Should the system be based on a model such as FACS, or should it be inferred from data?

31

In practice, few programs uses anything different than FACS-based systems [65] or completely data-driven systems. Microsoft’s Emotion API were chosen for the task of detecting emotions. It can detect eight basic emotions: happiness, neutral, surprise, fear, anger, contempt, disgust and sadness. Although many of the emotions might be quite different, for the purpose of this project, each emotion have been placed into one of three categories: “positive”, “neutral” or “negative”. The positive category consists of happiness, the neutral consists of neutral and lastly the negative category of all the rest: fear, anger, contempt, surprise, sadness and disgust. The reasoning behind this was that fear, anger, contempt, sadness and disgust are all feelings most humans would like to avoid [88] and therefore regarded as not desired in this scenario. “Surprise” has, surprisingly, also been included into the negative category. Since it was in this scenario regarded as related to fear, and therefore negative. Unless specified, positive, neutral and negative emotions represents these emotion categories for readability. Emotion categories were explicitly written when contrasting the categories with the emotions returned from the emotion API. Obviously, the positive category contains the emotion happiness and finally the neutral category, also with only one member: neutral. The first category is thought of as the desired state, here the patient is content and does not seem to mind the robot’s actions. Neutral is somewhere in between the two other categories, but maybe slightly leaning towards the positive category. Instead of using positive, neutral and negative emotion categories, these will be now called positive, neutral and negative emotions, skipping the category part.

32

3.5

Robot

All of the above section describes how the system perceives the world. In this section the robot has been described, its role was to be the actuator of the project. For this scenario, for all uses and purposes, it was simply treated as a deterministic agent, acting on information sent from the collision and emotion detection components, and moving the robotic arm appropriately. The next subsection described what signals were sent by the proposed system to the robot and how it acted on this information.

3.5.1

Task

As outlined in the introduction of this chapter, the robotic component of this project acted on information from the other components. Specifically, it was informed by the collision detection system if a collision had been detected. This information could be, for example, what part of the robot and the human that have collided, or if there have been a collision at all. If computational resources were limited, it might also be relevant to report the time or frame the collision were registered. Information from the emotion detection system works similar to the collision detection system, but the data acquired from this sensory data processing unit were not essentially binary. The possible data responses depends on the emotion model chosen in Subsection 3.4.1. Again, if limited in terms of computational resources, this process might run fewer times per minute, therefore adding a timestamp or similar to the emotion detected would be a reasonable decision. The two paragraphs above are specific to this project, but any other relevant sensor already installed on the robot might be utilized. This includes, but is not limited to gyroscopes, displacement sensors, force sensors and accelerometers. For example, an accelerometer might hinder the robot in acting in a way that might appear “too sudden”. The robot’s responses to these signals were specific robotic movements. The ultimate logistical goal was to move the robot’s tool head in front of the patient’s chest area. But the robotic agent’s task was twofold, as the patient’s comfort and state of mind also affected current and potentially future movements. If a collision had been detected, one straightforward action would be to do a complete stop. Other alternatives could have been to retrace the path it took or simply invert the direction of the tool head’s movement. 33

The emotion response would in principle be different from the collision detection response, as the robot did not act on any physical state, but a psychological state. So a reasonable response may have been to temporarily stop, slow down, speed up or simply do nothing different, depending on the emotion (s) detected.

3.5.2

Specific solution

As for the developed solution that have been made, Universal Robots UR5 were the robot of choice. This robot was used by Mathiassen et al., see Subsection 2.6.1. It did include a fair range of sensors, the ones worth mentioning in this context are the force sensors and accelerometer. Why this robot were chosen was not at the author’s discretion, this decision was taken by Egil Utheim. Potential reasons could be the relatively low cost compared to classical industrial robots and that the robot was readily available for this project. An important quality of the UR5 was its collaborative nature, or put in another way: it was an collaborative robot. It has multiple safety certificates, and as long as the core hardware and software, these certificates would still be valid, shortening its distance to be applied in a real world scenario.

Figure 3.8: Universal Robots UR5, Figure from: [104] When a collision had been detected in this solution, a simple string indicating “stop” would be sent to the robot, which it promptly executed. The reasoning behind this was that the patient might be fragile, and the built-in force sensors were too insensitive to handle that task.

34

For enabling the robot to react to different emotions, a simplistic reaction system was used. Subsection 3.4.2 specified three emotions that were utilized. A positive emotion increases the speed of the robot by a small amount. A neutral emotion prompts no action at all. This is a form of wait-and-see action, where not enough was known about the patient to take a decision. Finally a negative emotion; which reduces the speed of the robot, making it act more cautiously. This decrease in speed is greater than the increase for a positive emotion, so that the system would not suddenly surprise the patient.

3.6

Implementation

The concepts introduced in the sections above had to be implemented to see if they indeed managed to investigate this new type of HCI. This section presents how these components worked together, and also the roles of all the utility functions and processes. The code that were developed has been added to an external Appendix, and a description of the developed software can be found in Appendix I.

3.6.1

Overview

The introduction to this chapter mentioned 3D Vision Server, which can be regarded as the major decision making hub. 3D Vision Server include two separate sub-instances, Javascript-simulator and MainServer. Other instances worth mentioning here are Cameras, Robot, Python script and Emotion API. All of these and how they interact have been explained in this section. Figure 3.9 visualize how these instances depend on each other and what communication interfaces were utilized between them. Figure 3.10 shows how the interacting elements can be seen in the physical world, without the purely computational instances. MainServer acted as a gateway to the Javascript-simulator and handler of the cameras. It also assisted with the emotion detection procedure. Robot was, obviously, the robot, and acted on commands from the Javascriptsimulator while continuously sending rotational data about its joints. Python script had the role of a gateway between the robot and MainServer. And finally Emotion API, a method calling a cloud service to detect emotions in an RGB image sent from MainServer. In essence, the data flowed from the cameras and the robot, through MainServer, Emotion API and/or Python script, to the Javascript-simulator, 35

Figure 3.9: Overview of the implementation.

36

which then displayed the situation on a display and dispatched orders to the robot. The type of data sent started out as RGB image, depth image and robot joint rotations, and being transformed into emotions, skeletal joints and robot joint positions, before finally ending up as a robot command. Similarly, robot commands sent from Javascript-simulator to the robot, goes through similar transformation, only that the result is a physical action. Table 3.1 illustrates the steps each of the data types goes through. Bear in mind that the vast majority of these robot commands are simply “no change”, and therefore no commands would be sent to the robot to interrupt it. Data from RGB camera RGB camera MainServer RGB image

RGB image

Emotion API Emotion

MainServer Emotion category

Javascriptsimulator Robot command

Data from 3D camera 3D camera MainServer Javascript-simulator Depth image Skeletal Robot command joints Data from Robot Robot Python MainServer Javascript-simulator script Robot joint Robot joint Robot joint Robot command rotations rotations rotations Data from Javascript-simulator JavascriptMainServer Python Robot simulator script Robot com- Robot com- Robot com- (Action) mand mand mand Table 3.1: Shows which route each of the data types take. Note that the data type listed below each instance, e. g. “Skeletal joints” below “MainServer” in the fifth and sixth row, second column, says that MainServer output skeletal joints to the next instance, in this case Javascript-simulator.

37

Figure 3.10: View from above of the physical scenario. R is the robot, C is the RGB camera, 3D the 3D camera and P is the patient.

38

3.6.2

3D Vision Server: Javascript-simulator and MainServer

Beginning with the Javascript-simulator, this instance had to render in real-time1 a simulation of both the robot and the patient. It also had to determine if any collisions had occurred and determine how to respond to different emotions. Overview of what the Javascript-simulator interacts with can be seen in Figure 3.11.

Figure 3.11: Overview of what the Javascript-simulator interacts with. MainServer is the only instance it is directly communicating with. The rendering was done by utilizing a framework known as Three.js, which is a Javascript library that empower a Javascripted HTML-page to render graphics. Alternatives to Three.js include, but are not limited to Blender [34], Unity [102] and Unreal engine [111]. All of these, however, are big, complex and are arguably more difficult to use. Three.js on the other hand, can easily be incorporated into a HTML-page and was easy to use as a real-time renderer tool. The render loop can be seen in Listing 3.1. 1

Render is a term in computer graphics that means to “draw” a figure on a screen.

39

Listing 3.1: Javascript-simulator’s render loop function render () { //renders the frames renderer.render ( scene, camera ); //updates the frame rate box in the upper left corner stats.update (); //whenever the program has rendered collision_detection_threshold amount of frames, collision detection will be performed. if (! (counter >= collision_detection_threshold )) { counter++; } else { counter = 0; var arrayLength = name_array_of_robot_parts.length; //goes through all of the different limbs of the robot for (var i = 0; i < arrayLength; i++) { //detects collision only between the different robot parts and the human parts, not intra human or intra robot collisions. detect_collision (name_array_of_robot_parts[i],name_array_of_human_parts); } } }

Three.js could easily create cuboid figures, and this, in conjunction with data about the skeletal joints from MainServer, made it relatively easy to render cuboids between two vectors (or points) representing joints in a scene. This was done by simply finding the center of mass and use an inbuilt method to align the cuboid to the directional vector. The joints could be either the robotic joints or the skeletal joints, and the directional vector was found by simple vector subtraction. Data from the robot, however, were not that easily rendered. The robot could only provide joint rotations, so some calculations needed to be done before all of the 3D space coordinates could be found. The base position of the robot in the scene was in this case fixed, the displacements between each robotic joint were given by the robot developers, and 40

the initial rotations were found through trial-and-error1 . All these variables made it possible to calculate the positions of the joints in the scene, using the following procedure. Before any of the robot joint rotation values had been sent, a model representing the robot in a particular pose had been created. This pose was, of course, a pose with all of its rotations set to zero. Another aspect of importance regarding this model was that all of the distances were true to scale. When new rotations were received, the task of obtaining the correctly rotated model was to rotate each of the joints in order, from the base joint all the way up to the last wrist. A simplified version can be seen in Figure 3.12, where the right figure has been rotated by π radians at base and −π/4 radians at the shoulder. By including more joints this model could easily be extended into representing any robot with only rotating joints. As for updating these cuboids, it happened exactly in the same manner, only now the old cuboids would be moved and scaled according to the new positions obtained. Additional functionality of the graphical display were the ability to move, rotate and zoom in and out the view of the scene. Collision detection was performed using ray casting as mentioned in Section 3.3, and the code for this can be found in Listing 3.2. It is a very simple algorithm, but it did exactly what it was meant to do. Default precision of the ray casting algorithm was sufficient, most likely because of the low complexity of the figures. A frequency that performed well on the systems tested on was once per 10 render calls, which was a good balance between effective collision detection and performance. Whenever a collision was detected, a message telling the robot to stop would be dispatched to the MainServer by using Javascript’s WebSocket library. The reasons for choosing WebSocket technology, instead of HTTP, was that after the initial handshake, both server and client could push data while suffering almost no overhead compared to HTTP [75].

1

Since some of the displacements were negative, and negative lengths were not desired.

41

Figure 3.12: An illustration showing how the calculations to obtain the robot joints in 3D coordinates were done. The left figure is the starting pose where all rotations are zero, and the right one has been rotated its base rotated by π radians and the shoulder have been rotated by −π/4 radians.

42

Listing 3.2: Collision detection script function detect_collision (object1_name,other_objects_names_array) { object1 = scene.getObjectByName ( object1_name ); var collidableMeshList = []; other_objects_names_array.forEach (function (entry) { collidableMeshList.push (scene.getObjectByName ( entry)); // creates a list of objects that will be used for the collision detection }); var originPoint = object1.position.clone (); for (var vertexIndex = 0; vertexIndex < object1.geometry.vertices.length; vertexIndex++) { var localVertex = object1.geometry.vertices[vertexIndex].clone (); var globalVertex = localVertex.applyMatrix4 ( object1.matrix ); var directionVector = globalVertex.sub ( object1.position ); // creates the vector between var ray = new THREE.Raycaster ( originPoint, directionVector.clone ().normalize () ); // creates the ray var collisionResults = ray.intersectObjects ( collidableMeshList ); if ( collisionResults.length > 0 && collisionResults[0].distance < directionVector.length () ) //checks if there actually have been at least one collision { for (var i=0;i