0 downloads 0 Views 22MB Size Report
Jun 30, 2018 - 6.15 Accuracy for each of 6 keypoints by Method1 ([18]). . . . . . . . . . . 83 ... LIST OF TABLES. Table. Page ... are designed to test executive function like decision making, working memory and attention. ..... (ResNet). ...... T, K and S. The gesture class with the highest P(gesture/frame) σ(l)x is chosen as the.



Presented to the Faculty of the Graduate School of The University of Texas at Arlington in Partial Fulfillment of the Requirements for the Degree of




The members of the Committee approve the doctoral dissertation of SRUJANA GATTUPALLI

Dr. Vassilis Athitsos Supervising Professor Dr. Fillia Makedon

Dr. Farhad Kamangar

Dr. Chris Conly

Dean of the Graduate School

c by SRUJANA GATTUPALLI 2018 Copyright All Rights Reserved

“Traveler, there is no path. Paths are made by walking”— Antonio Machado

I dedicate this thesis to my beloved parents, Archana Gattupalli and NareshKumar Gattupalli, who have inculcated in me the love for science and education, my grandparents, Krishna and Gangaram Gayatri, without whom I would not be where I am today and to my husband Rachit Malik, who has been a source of support and encouragement. I love you all and I thank you for being there for me.

ACKNOWLEDGEMENTS First and foremost, I would like to extend my warmest gratitude to my PhD advisor, Dr. Vassilis Athitsos, for his constant guidance, support and encouragement during the course of my PhD study. I wish to thank my committee members Dr. Fillia Makedon, Farhad Kamangar and Dr. Chris Conly for their interest in my research and for taking time to serve in my dissertation committee. I would like to thank all members of the Vision-Learning-Mining Lab and the Heracleia Human Computer Interaction lab, for all the motivating discussions and for creating a perfect learning and working environment. I would also like to extend my appreciation to CSE department for providing all the facilities and infrastructure necessary to carry out my PhD study. Special thanks to all my teachers in India and in United states who have taught me and helped me achieve the knowledge and education that I have today. I would like to thank my beloved parents, who have taught me the value of hard work by their own example. I would like to thank all my family and friends for their affection and care that brings color to my life. June 30, 2018




Supervising Professor: Dr. Vassilis Athitsos Cognitive impairments in early childhood can lead to poor academic performance and require proper remedial intervention at the appropriate time [1]. ADHD affects about 6-7% of children [2, 3] and occurs about three times more frequently in boys than in girls [4]. According to [5, 6, 7] ADHD is a psychiatric neurodevelopmental disorder that is very hard to diagnose or tell apart from other disorders. There are specific symptoms that can be observed in individuals suffering from the disease including inattention, inability to follow instructions, distracibility, hyperactivity or acting impulsively [8, 9]. Such cognitive insufficiencies hinder the development of working memory and can affect school success and even have long term effects that can result in low self-esteem and self-acceptance[10]. The main aim of this research is to investigate development of an automated and non-intrusive system for assessing physical exercises related to the treatment and diagnosis of Attention Deficit Hyperactivity Disorder (ADHD). A proposed artificial intelligent cognitive behavior assessment system takes advantage of state-of-the-art knowledge from both fields of Computer and Cognitive sciences, and aims to assist vi

therapists in decision making, by providing advanced statistics and sophisticated metrics regarding the subject’s performance. The ultimate goal is to deliver meaningful information to cognitive experts and help develop skills in children that can result in overall improvement of child’s academic performance. To facilitate this, research has been employed in artificial intelligence, computer vision, machine learning and human computer interaction. Computational methods for human motion analysis are proposed in this dissertation; to provide automatic measurements of various metrics of performance. These are metrics related to generic motion features as well as metrics explicitly defined by experts. To conclude, a novel set of user-interfaces is introduced, specifically designed to assist human experts with data-capturing and motion-analysis, using intuitive and descriptive visualizations.


TABLE OF CONTENTS ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . .


ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


LIST OF ILLUSTRATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . .


LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .




1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . .



Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . .


2. Computerized Neurocognitive Tests . . . . . . . . . . . . . . . . . . . . . .



Computerized Executive Function (EF) Assessment . . . . . .


Physical Exercises for Cognitive Behavior Assessment . . . . . . . . .



Physical activities and metrics . . . . . . . . . . . . . . . . . .



The Head-Toes-Knees-Shoulders Task . . . . . . . . . . . . . .



Embodied Cognition Test . . . . . . . . . . . . . . . . . . . .


3. Computer Vision for Human Motion Analysis . . . . . . . . . . . . . . . .





Body Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . .



Body Pose Dataset . . . . . . . . . . . . . . . . . . . . . . . .


Hand Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . .



Depth-based methods . . . . . . . . . . . . . . . . . . . . . . .



RGB-based methods . . . . . . . . . . . . . . . . . . . . . . .




RGB-D based methods . . . . . . . . . . . . . . . . . . . . . .


Deep Visual Features for Physical Exercise Assessment . . . . . . . .


4. Developed Application for Cognitive behavior Assessment . . . . . . . . .




Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . .



CogniLearn Sytem . . . . . . . . . . . . . . . . . . . . . . . . . . . .



User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . .



Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . .


5. Physical Activity Recognition for HTKS . . . . . . . . . . . . . . . . . . .



Experimental Architecture . . . . . . . . . . . . . . . . . . . . . . . .



Full-body Joints Localization . . . . . . . . . . . . . . . . . .


CogniLearn HTKS Recognition Algorithm . . . . . . . . . . . . . . .



System Observations . . . . . . . . . . . . . . . . . . . . . . .



Algorithmic Improvements . . . . . . . . . . . . . . . . . . . .



Supervised Learning Approach . . . . . . . . . . . . . . . . . .



Multi-person Physical Activity Recognition . . . . . . . . . . . . . . .



Child HTKS Activity Recognition . . . . . . . . . . . . . . . . . . . .



Comparison to Alternatives . . . . . . . . . . . . . . . . . . . . . . .



Alternative Approach . . . . . . . . . . . . . . . . . . . . . . .



User Performance Calculations . . . . . . . . . . . . . . . . . . . . . .



Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .


6. Physical Activity Recognition for CogniLearn-RSM exercises . . . . . . . .





Analyzing Rapid Sequential Movements . . . . . . . . . . . . . . . . .



Sequential Finger Movements . . . . . . . . . . . . . . . . . .



Sequential Hand Movements . . . . . . . . . . . . . . . . . . .


Dataset Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . .




Captured RSM Exercises . . . . . . . . . . . . . . . . . . . . . ix


Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Hand Keypoints Dataset . . . . . . . . . . . . . . . . . . . . .


Vision-based Hand Keypoint Recognition . . . . . . . . . . . . . . . .



Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . .



Experimental Results . . . . . . . . . . . . . . . . . . . . . . .


7. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . .




Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


Appendix A. List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


BIOGRAPHICAL STATEMENT . . . . . . . . . . . . . . . . . . . . . . . . . 104





Proposed CogniLearn System Architecture . . . . . . . . . . . . . . . .



Three- and four-category models of executive function.[11] . . . . . . .



A typical Scavenger Hunt round.[12] . . . . . . . . . . . . . . . . . . .



Examples of cognitive testing tasks from Cogscreen-AE.[13] . . . . . .



CNTs for elderly and their application towards domains.[14] . . . . . .



Pronate/ Supinate motor test setup in [15]. . . . . . . . . . . . . . . .



Architecture for pose estimation in [16] . . . . . . . . . . . . . . . . . .



Ground truth ASLID training dataset variance . . . . . . . . . . . . .



Example image annotations from ASLID . . . . . . . . . . . . . . . . .



Ground truth ASLID test dataset variance . . . . . . . . . . . . . . . .



Human Body Pose Keypoints . . . . . . . . . . . . . . . . . . . . . . .



Hand Keypoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



CogniLearn-HTKS Framework . . . . . . . . . . . . . . . . . . . . . .



Recording Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Analysis Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Example visualizations of self-occlusions in Kinect2 skeleton tracking (Left) Vs Vision-based pose estimation (Right) . . . . . . . . . . . . .




Visualizations of our method - a.) Demonstrates need for offset as distances of wrist from head(a) and shoulders(b) are almost similar, b.) Pose estimation failure for occlusion due to accessory (hat), c.) Correct estimation with person in background and partially cropped image (Toes not visible), d.) Correct classification of pose for with accessory when face is visible. . . . . . . . . . . . . . . . . . . . . . . . . . . . .



CogniLearn HTKS Algorithm Observations . . . . . . . . . . . . . . .



Multi-Person HTKS Recognition . . . . . . . . . . . . . . . . . . . . .



Example visualizations for HTKS prediction on the Child CogniLearn Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Multi-Modal data for preliminary experiments . . . . . . . . . . . . . .



Participants’ Performance. . . . . . . . . . . . . . . . . . . . . . . . . .



Confusion Matrix for system performance for: a.) With improvements from [17], Rule-1 only b.) Our Improvements by adding Rules-1 and 3 c.) Sideway Poses with offset deduction for “Head” and “Toes” gesture class (Without Rule-2), d.) Improved Performance for Sideway Poses with Rule-2, e.) Multi-Person Dataset Recognition with Improvement (all 3 Rules), f.) Child Dataset Recognition with Improvement (all 3 Rules). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Gesture-wise system performance . . . . . . . . . . . . . . . . . . . . .



Finger Tap Keypoint detection . . . . . . . . . . . . . . . . . . . . . .



Hand pronate/supinate keypoint detection . . . . . . . . . . . . . . . .



Appose Finger Succession . . . . . . . . . . . . . . . . . . . . . . . . .



Leap Motion Controller . . . . . . . . . . . . . . . . . . . . . . . . . .



LSM Field of View . . . . . . . . . . . . . . . . . . . . . . . . . . . . .



Example annotations of cropped images from HKD dataset . . . . . .




Variance of hand keypoints of Subject1 in HKD. . . . . . . . . . . . .


Visualization of Method 1 on HKD - Top Left, Top Right, Bottom Left.


Bottom Right - Visualization of Ground Truth Annotations 6 keypoints. 77 6.9

Visualization of 21 keypoints estimated on HKD from Method1 [18] (left and center). Visualization of 6 keypoints of interests for our dataset. .


6.10 Example visualization of convex hull from 6 keypoints. Green + is the centroid.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


6.11 Visualization of Method2 (model from [19]) on HKD. . . . . . . . . . .


6.12 Accuracy on HKD by Method1 ([18]) and 2 ([19]). . . . . . . . . . . .


6.13 Accuracy for each participant in HKD by Method1 ([18]). . . . . . . .


6.14 Accuracy for each participant in HKD by Method2 ([19]). . . . . . . .


6.15 Accuracy for each of 6 keypoints by Method1 ([18]). . . . . . . . . . .


6.16 Accuracy for each of 6 keypoints by Method2 ([19]). . . . . . . . . . .




Design diagram for the proposed remote system . . . . . . . . . . . . .





Embodied Cognition Test Battery . . . . . . . . . . . . . . . . . . . .



CogniLearn Interface Instruction Steps . . . . . . . . . . . . . . . . . .



Supervised vs Unsupervised classification . . . . . . . . . . . . . . . .



CHAPTER 1 Introduction 1.1 Motivation Attention deficit hyperactivity disorder (ADHD) is a cognitive impairment that is known to affect mainly children but also adults [20, 21]]. According to [4], ADHD is about three times more common in boys than in girls and can play a significant role in their academic performance, since children who suffer from it have a higher chance of under-performing in their educational duties [1]. Overall, as reported in [2, 3] about 6-7% of children suffer from ADHD. In contrast to some other cognitive disabilities, ADHD is considered to be a psychiatric neurodevelopmental disorder that is very hard to be diagnosed [5, 22, 7], and its most observable symptoms include inattention, inability to follow instructions, destructibility, hyperactivity or acting impulsively [8, 9]. Cognitive dysfunctionalities caused by ADHD mainly affect executive functioning skills and can have a negative impact on working memory [8, 9]. Moreover, such cognitive insufficiencies significantly affect school success and can even be responsible for long term effects that can result in low self-esteem and self-acceptance [10]. As recent research has shown, specifically designed physical games and exercises can work as indicators for predicting and assessing ADHD symptoms in children [23]. There is a need for computational methods that would help with automatic computation of various physical performance metrics to improve on the accuracy and efficiency of human-made cognitive assessments. In this thesis, we propose an intelligent cognitive behavior assessment system that would help provide early diagnosis of problems with underlying neurocognitive processes. The proposed system would 1

help provide recommendations and decision support to the cognitive expert as well as targeted intervention for cognitive dysfunctionalities like ADHD. Our system is based on latest advances in computer vision and machine learning to assess exercises that are designed to test executive function like decision making, working memory and attention. The proposed system provides with quantifiable results and helps monitor child’s progress over time to help overcome the learning difficulties.

1.2 Proposed System In this thesis, an intelligent computer vision and machine learning based system for performing cognitive behavior assessment of children in a safe, unobtrusive and easy to administer manner is presented. The proposed system provides decision support to experts for helping with early childhood development.

Figure 1.1: Proposed CogniLearn System Architecture

Figure-1.1 shows the proposed system architecture which consists of intelligent user interface capable of capturing participant data for physical exercise performances 2

and a Computer Vision and Machine Learning back-end that provides motion analysis of exercise performances. The proposed system is supported by an intuitive and specifically designed user interface that can help human experts to cross-validate and/or refine their diagnosis. Evaluations on the captured data are performed solely on the captured RGB dataset. The analysis module of our interface gathers these information and provides systematic feedback regarding the assessment of these physical exercises to the human experts. Our system, CogniLearn, has two test suites available for cognitive behavior assessment. The first module, CogniLearn-HTKS, is based on the well-established framework of “Head-Toes-Knees-Shoulders” (HTKS) that is known for its sufficient psychometric properties and its ability to assess cognitive dysfunctions. HTKS serves as a useful measure for behavioral self-regulation[1]. Additional details about the Head-Toes-Knees-Shoulder task are presented in 2.1.1 and the CogniLearn-HTKS system is presented in 4.1. The second module, CogniLearn-RSM, is based on the “Rapid Sequential Movements” (RSM) from the embodied cognition test described in Chapter 2. Additional details about the test battery are presented in 2.1.2 and the CogniLearn-RSM system is presented in 4.2.2.


1.3 Dissertation Structure This dissertation thesis is constructed as follows: • In Chapter 1, we present our proposed intelligent cognitive assessment system. • In Chapter 2, we briefly summarize descriptions of existing frameworks for computerized neurocognitive tests (CNT). Further, we define role of physical exercises in cognitive behavior assessment. Also, we describe the physical exercises test suites to be included in our proposed system. • In Chapter 3, we discuss computer vision technologies to build an intelligent physical exercise assessment system. Details of deep visual features for body and hand pose estimation for physical exercise assessment are described here. • In Chapter 4, we present a developed applications as a part of this thesis towards a computerized cognitive behavior assessment system. • In Chapter 5, we provide implementation details of computer vision based physical activity recognition for the Head-Toes-Knees-Shoulders test suite. • In Chapter 6, we describe hand keypoint detection towards cognitive behavior assessment system to provide enriched task library with other physical games based on rapid sequential hand movement. • In Chapter 7, conclusions and future work are reported.


1.4 Research Questions During our work, some research questions arise with the development of this thesis: 1. How to accurately localize hand keypoint from RGB videos? 2. Can deep visual features like keypoints from body posture, facial landmarks and hand joint keypoints help analyze physical exercise performance? 3. How to effectively estimate human motion from live videos? 4. How to extract behaviorally relevant information from articulated motion analysis?


CHAPTER 2 Computerized Neurocognitive Tests Computerized neurocognitive tests (CNTs) measure neurocognitive impairment while providing benefits of speed, accuracy and low cost. Computerized cognitive function tests have several advantages over psychological testing. These include standardized administration of the test across wide range of subjects, automatic scoring and reporting, self-paced instructions. CNT’s are consistent in providing quantitative analysis of performance, allow for frequent assessment of cognitive function and in certain cases can also be self-administered inexpensively at home. These tests help with assessment of executive function which one of the primary cognitive impairments in children suffering from ADHD. In this section, we review existing computerized neurocognitive tests, their effectiveness for executive behavior assessment and importance of physical exercises for cognitive function monitoring and intervention. We also discuss the physical exercises chosen for cognitive behavior assessment and discuss development of our CNT based on physical exercise analysis.

2.0.1 Computerized Executive Function (EF) Assessment Executive Function (EF) categorizes cognitive function into working memory of child and inhibition. These are learning-related behaviors which provide better academic outcomes such as positive interaction with peers and teachers, task-based sequential activity performance and ability to refrain from disruptive actions. Executive function in children begin to develop at early age and help control attention and goal-directed behaviors. In classroom, EF is one aspect of self-regulation, which 6

helps focus on task at hand, follow directions and resist distraction. One theory of EF classifies its related skills in three classes as shown in Figure 2.1. These are inhibitory control (self-control), cognitive flexibility (attention switching), and working memory. Another theory categorizes EF with with four categories in which attentional control and cognitive flexibility are similar to the three class model. In this model, goaldefined reasoning and formulation as well as though organization for task completion are included. The work in [11] analysis role that executive function plays in academic performance and development of preschoolers. Outcomes of this study demonstrate that EF is important for for success in routine task completion as well as necessary to learn language skills and for solving mathematics problems.

Figure 2.1: Three- and four-category models of executive function.[11]

EFs are high-level cognitive processes and can be assessed by different tests. A computational model named Scavenger Hunt (SH), for Trail Making Test (TMT) is developed in [12]. TMT is a well-known to measure EF, pattern recognition and working memory. This system provides quantitative assessment of cognitive function 7

for people of all ages. SH is an engaging and fun mouse-driven point-and-click CNT that mimics TMT with 30 second rounds of connect-the-dots task. Figure 2.2 shows the Scavenger Hunt test system which shows the time remaining in the game, search string, game score, targets and distractors. In [24], EFs are assessed by the Wisconsin Card sorting test (WCST). The compare assessment by manual test using cards vs. computer-based WCST. This work shows that the CNT based administration is similar to that of manual administration and that it does not affect participants performance. This was tested on elderly participants and analyzed scores based on percentage errors, conceptual level responses, and number of categories completed.

A CNT battery to measure inhibitory control is developed in [25] to test inhibitory control. Inhibitory control is a component of EF that promotes goal-directed behavior and provides consciousness to suppresses irrelevant responses. Figure 2.2: A typical Scavenger Hunt round.[12]

Result of this work demonstrate

an implementation of CNT to provide cognitive assessment as well as intervention to help with EF from early ages. The apply the Spearman’s correlation to validate conventional and and CNT based application. The conventional tasks selected for this study were Rabbit/Crocodile and Sun-Grass. The CNT consisted of Rabbit/Crocodile, Moon/Star, and Day-Night tasks. For the Stroop test, the results demonstrated that the CNT Day-Night task correlated to conventional Sun-Grass task and also showed that both of the other CNT tasks correlated with each other. Here, 8

Figure 2.3: Examples of cognitive testing tasks from Cogscreen-AE.[13] the Moon/Star and Rabbit/Crocodile are CNT Go and No-Go tasks. Cogscreen-AE [13], is a CNT that consists of series of computerized tasks for detecting cognitive changes occurring by brain dysfunction. These are test specifically developed for pilots to help with recruitment and to check cognitive fitness-to-fly for pilot. It is a 20 minutes tests for analyzing cognitive function competencies for speed, accuracy, throughput and logical reasoning. These are tests like visual sequence comparison, simple math problems, path finder, manikin which do not need any administration supervision. Examples of cognitive testing tasks from Cogscreen-AE are shown in Figure 2.3. The CNT called CNS Vital Signs (CNSVS), in [26], is a successfully implemented test battery which is easy to set up and to use for clinical screening. It which consists of seven tests which are verbal and visual memory, finger tapping, symbol digit coding, the Stroop Test, test for shifting attention and the continuous performance test. The tests of the Computer-Administered Neuropsychological Screen for Mild Cognitive Impairment (CANS-MCI) is a 30 minutes test to assess language, memory, and EF. It is a well-paced test and elderly test takers are shown to rate this


CNT easy to use and understand compared to tests taken with paper and pen[14]. Table 2.4 shows several CNTs and their administered domains.

Figure 2.4: CNTs for elderly and their application towards domains.[14]

These are CNTs for elderly and are administered using keyboard, mouse or touchscreen. Early childhood is a developmental phase where children have high levels of physical activity. Our work is based on monitoring physical exercises and we focus on development of CNTs that are administered by audio instructions and interaction with computer is via a camera and the participant does not need to use keyboard, mouse or touchscreen.

2.1 Physical Exercises for Cognitive Behavior Assessment Attention-deficit/hyperactivity disorder (ADHD) is characterized by behavioral symptoms and fundamental cognitive deficiencies and that affect executive function in 10

children. Physical activity is shown to positively impact neurobiological mechanisms and serves to be beneficial by moderating the symptoms of ADHD [27]. The work in [28] demonstrates that there is modest test-retest reliability for childrens performance of EF tasks. Hence, to improve precision of assessment it is important to administer several tasks and then aggregate performances for these tasks. Computerized tests provide this ease for administration and aggregation of scores towards assessment.

2.1.1 Physical activities and metrics Physical exercises are used towards cognitive behavior monitoring and training like working memory, sustained attention, self-regulation and cognitive flexibility[29]. The “Head-Toes-Knees-Shoulders” Task & the “Rapid Sequential Movement” Tasks along with other similar cognitive assessment methods are chosen towards the development of our cognitive assessment system.

2.1.2 The Head-Toes-Knees-Shoulders Task The traditional game called Head-Toes-Knees-Shoulders (HTKS) can provide sufficient psychometric observations and can be used as a measure of behavioral selfregulation [30]. According to the authors in [30] and their extended research in the task, HTKS is significantly related to cognitive flexibility, working memory, and inhibitory control. The game has three sections with up to four paired behavioral rules: “touch your head” and “touch your toes;” “touch your shoulders” and “touch your knees.” Subjects first respond naturally, and then are instructed to switch rules by responding in the “opposite” way (e.g., touch their knees when told to touch their shoulders). Our work aims to provide a computerized infrastructure for performing and evaluating the aforementioned assessments.


CogniLearn-HTKS system is based on the existing framework of Head-ToesKnees-Shoulders (HTKS) that serves as a useful measure for behavioral self-regulation. The proposed method exploits recent advances in the area of computer vision, by combining deep-learning and convolutional neural networks with traditional computer vision features, in an effort to automate capture and motion analysis of users performing the HTKS game. Our method targets to tackle several common problems in computer vision, such as multi-person analysis, point-of-view and illumination invariance, subject invariance, and self-occlusions, under a very sensitive context where accuracy and precision come as our first priority. We perform extensive evaluation of our system, under varying conditions and experimental setups, and we provide detailed analysis regarding its capabilities. As an additional outcome of this work, a publicly available dataset was released, that is partially annotated. The dataset consists of different subjects performing the HTKS activities under different scenarios.

2.1.3 Embodied Cognition Test The Embodied Cognition Test consists of physical exercises which children perform as part of a comprehensive cognitive assessment and enhancement. Table-2.1 shows the complete list of tests in the Embodied Cognition Test. For development of CogniLearn-RSM test battery, we have choose tasks that correspond to assessment of motor skills based on performing rapid sequential movements (RSM). These are different tests from ECT (Table-2.1) which include “Rapid Sequential Movements” tests 1 to 4. The exercises that we choose to are finger-to-thumb opposition sequence (FOS) learning task which is a motor task which has been previously used for adult and children studies [31]. In our system, the task is to tap fingers to thumb in order 1-23-4. The fingers are numbered from 1 to 4, with 1 being assigned to the index finger 12




Lateral Preference Patterns



Gross Motor Gait and Balance


Synchronous Movements


Bilateral Coordination and Response Inhibition


Bilateral Coordination and Response Inhibition


Visual Response Inhibition


Cross Body Game


Finger-Nose Coordination


1. Natural walk 2. Gait on Toes 3. Tandem Gait 4. Stand eyes closed hands outstretched 5. Stand on One Foot 1. March Slow 2. March Fast 1. Bi-manual Bag Pass Slow 2. Bi-manual Bag Pass Fast (Bi-Manual Ball Pass with Red Light/Green Light) 1. Red Light/Green Light – Slow 2. Red Light/Green Light – Fast 3. Red Light/Green Light/Yellow Light – Slow 4. Red Light/Green Light/Yellow Light Fast 5. Red Light/Green Light/Yellow Light Visual Slow (A sailor went to Sea, Sea, Sea) 1. Sailor Slow 2. Sailor Fast (Cross Body Ears, Shoulders, Hips, Knees) 1. Cross Body Ears - Knees 2. Cross Body Shoulders = Cross Body Hips 3. Combined Reverse Actions 1. Foot Tap 2. Foot-Heel-Toe Tap 3. Hand Pat 4. Hand Pronate/supinate 5. Finger Tap 6. Appose Finger Succession

Rapid Sequential Movements

Table 2.1: Embodied Cognition Test Battery and 4 being the little finger. The children are instructed to perform this sequence as quickly and accurately as possible. Thw work in [31], shows that children were able to learn sequential finger movements after one session of physical training were also able to transfer training effects to the other untrained hand. This study shows the ability of children to learn new motor skills through mental practice. Mental imagery of finger movements helps develop internal representations which are effector-independent. For this study, the children were seated with hand supported on a desk and were allowed to verbalize the


Figure 2.5: Pronate/ Supinate motor test setup in [15]. sequence. The task is analyzed with the help of a computer monitored device which is attached to the fingertips to count number of correct sequences per minute. Neuromotor processes underlying limb pronate and supinate rapid movements is investigated in [15]. These are the tasks that we have chosen to be a part of Cognilearn-RSM. The study in [15] shows that these limb movements are controlled by generalized motor programs where the relative time to perform limb movements and the active time in EMG for each muscle is invariant for different movementspeed conditions. In this study, a computer monitored device is used to analyze the movements. Figure 2.5 shows their setup. They display participant limb movement responses on oscilloscope. Here, the angular displacement is shown vertically and movement time is displayed horizontally on the oscilloscope. In [32], the FOS learning task is performed to analyze training related gains in three different age groups. Here, experiments are performed by children and adolescents and their actions are videotaped. Their actions are manually evaluated and tested. Vision-based analysis would help with assessments of such motor training/ assessments tasks and is incorporated in Cognilearn-RSM. 14

CHAPTER 3 Computer Vision for Human Motion Analysis Human activities are sequence of body and hand configurations and postures. Physical activity recognition can be characterized as a simultaneous alignment and recognition problem. This consists of recognition of body/hand gestures from videos and measurement of correctness of the movement performed based on alignment of these body/hand configurations with expected movement. In this work, we perform body and hand pose estimation towards body/ hand joint keypoint recognition. These keypoints are then used as features to solve temporal alignment problem. In this chapter, we provide brief review of computer vision methods for articulated human body and hand pose estimation. We also discuss the articulated pose estimation methods applied towards feature extraction for physical exercise recognition for our system.

3.1 Body Pose Estimation Many techniques have been proposed for vision-based human pose estimation. Several recent approaches use CNNs to address the task of joint localization for human body pose estimation. In [33], Tompson et al. address the challenging problem of articulated human pose estimation in monocular images by capturing geometric relations between body parts, by first detecting the heatmaps of these body joint using a CNN architecture, and then applying a graphical model to validate these predictions. In [34], Yang and Ramanan create a tree-like model using local deformable joint parts and solve it using a linear SVM to achieve good results for pose estima15

Figure 3.1: Architecture for pose estimation in [16] tion. In [35], Chen and Yuille further extend this by using a graphical model and employing Deep CNNs to learn the conditional probabilities of presence of joints and their spatial relationships. Toshev and Szegedy [36] propose a deep learning based AlexNet-like CNN model for human pose estimation method by which they were able to localize body joints as solution to a regression problem, and then improve on the estimation precision by using a cascade of these pose regressors.

In [37, 38], Jain,

Tompson et al. perform pose estimation using a model that combines CNN to regress over joint heatmaps and a Markov Random Field(MRF). In the work by Pfister et al.[16], the authors use a deep CNN to regress over heatmap of body joints, and improve performance by the use of temporal information from consecutive frames. They propose a deeper network for regressing heatmap, spatial fusion layers to learn the spatial model, optical flow to align nearby frame’s heatmap predictions and parametric pooling layer that combine the aligned heatmaps into a pooled confidence map. Their architecture for pose estimation is shown in Figure 3.1. Convolutional pose machines (CPM) [39] is a sequential CNN architecture for single person body pose estimation. CPMs perform predictions for the spatial relationships among body parts without the need for graphical model based inference. 16

These CNNs repeatedly refine part location estimates by operating on 2D belief maps for each part from previous stages. learn implicit image-dependent spatial models of the relationships between parts. In [40], the challenging problem of multi-person pose estimation is addressed by integer linear programming (ILP). Here, the multi-person poses are estimated poses in a frame by minimizing the joint objective to label and partition body part candidates into body pose configurations that correspond to different people. These human body parts are detected by using deep Residual Network (ResNet).

3.1.1 Body Pose Dataset A relevant work on pose estimation specific to Sign Language Recognition (SLR) domain is performed by [41], where estimation of joint locations over frames of sign language videos is obtained by first performing background subtraction and then predicting joints as a regression problem solved using random forest. As a part preliminary experiments of this work on body joint localization, the American Sign Language Image Dataset (ASLID) is collected, with images extracted from Gallaudet Dictionary Videos [42] and the American Sign Language Lexicon Video Dataset (ASLLVD) [43]. ASLID is annotated for upper body joint locations to help with the body joint recognition task. The dataset is divided into training and testing sets, to provide a benchmark to conduct user-independent experiments. The training set consists of 808 ASLID images from different signs, performed by six different ASL signers. For the test set there are 479 ASLID images from two ASL signers from ASLLVD videos. The training and testing sets vary in terms of different users, signs and different colored backgrounds. Annotations are provided for seven key upper body joint locations, namely left hand(LH), left elbow(LE), left shoulder(LS), head(H), right hand(RH), right shoulder(RS), right hand(RH). 17

Figure 3.2: Ground truth ASLID training dataset variance The work in [44], shows performance of this pose estimator by Pfister et al.[16] on ASLID as well as demonstrates how transfer learning can help improve on joint localization accuracy. These experiments serve as a baseline to our physical activity recognition model where the first step is to localize body joints. In section 3.3, we provide overview of full body joint localization in a physical exercise assessment setting for our system. Figure 3.3 shows examples of annotated images from ASLID. Variations in the range of training and testing poses are shown in the ground truth scatter plots figures 3.2 and 3.4. Our dataset and code to display annotations is available from:


Figure 3.3: Example image annotations from ASLID Transfer Learning Transfer learning is a way to improve the performance of a learning algorithm by utilizing knowledge that is acquired from previously solved similar problem[45]. As pointed out in [46], initializing a network with transfer learned weights obtained even from a different task can improve performance compared to using random weights for initialization of a network. [46] further points out that effectiveness of transfer learning is better if the difference between the original task and the target task is smaller. In our case, the original task is human body pose estimation and the target task is ASL specific upper body pose estimation, which are relatively similar. Hence, transfer learning helps in finetuning the pose estimator, so as to obtain better joint localization estimates for the sign language as well as for physical exercise recognition


Figure 3.4: Ground truth ASLID test dataset variance experiments. As pointed out in [46], this is also a way to avoid overfitting during training, even when we have a smaller target dataset than the original dataset.

3.2 Hand Pose Estimation Hand pose estimation is an important computer vision topic due to its wide range of applications in human-computer interaction, augmented/virtual reality, and gaming. Such applications often require hand segmentation, articulated hand pose estimation and tracking. Recent methods in body pose estimation [40, 19] can be used to detect and segment hands using human body hand joint features. Articulated hand pose estimation from monocular RGB images is still a largely unsolved problem in several aspects. This is because human hand configurations are very diverse, with the pose of a human hand having over 20 Degrees of Freedom (DoF). Hands are smaller that the body, and thus they occupy a small part of the image when the full body is visible. In addition, hand keypoints are oftentimes occluded by other parts of the same hand, the other hand, or the rest of the body. Deep learning based methods currently achieve state of the art performance for human body pose estimation. Estimating body pose is an articulated pose estimation 20

problem similar to hand pose estimation. However, body pose estimation is easier, due to the body orientation being upright most time, and also due to occlusions being a less frequent and less severe problem for full body images compared to hand images. We investigate deep learning based methods for hand pose estimation that perform holistic articulated pose estimation. Pixel-wise pose estimation approaches could be slow for real-time applications and do not take advantage of important holistic hand features due to per pixel constraints. Most vision-based articulated hand pose estimation methods can be categorized based on the modality of the data that they use, their application towards first person view (ego-centric) or third person view, and whether the methods are discriminative or generative. Below we briefly provide a review of articulated hand pose estimation based on depth, color, or a combination of both depth and color modalities.

3.2.1 Depth-based methods Hand pose estimation is frequently done using depth image frames. The use of depth cameras has facilitated the creation of large-scale datasets [47]. Automatic annotations of keypoint locations for such datasets can be achieved by attaching magnetic sensors to the hand. Using magnetic sensors for automated annotations is not as effective an approach for RGB datasets, as the magnetic sensors/ data gloves change the appearance of the hand in the RGB images and reduce the usefulness of such images for model training. This is why in this paper we focus on providing an RGB dataset. With the advent of deep learning based pose estimation, data has proven to be a crucial part of model learning. The model in [48] predicts 3D hand skeleton joint locations by integrating a principal component analysis (PCA) based prior into a Convolutional Neural Network (CNN). In [49], 3D joint estimation is achieved by performing per pixel classification from synthetic depth images using 21

random decision forests and has shown to achieve good recognition accuracy on ASL digits. Hierarchical pose regression for 3D hand poses is performed [50] towards recognizing 17 ASL hand gestures. The work in [51], estimates 3D joint locations by projecting the depth image on 3 orthogonal planes and fusing the 2D regressions obtained on each plane to deduce the final estimate. In [52], the hand segmentation task is performed using depth information to train a random decision forest. Hand pose estimation for tasks of hand-object interaction is performed in [53]. Here, grasp taxonomies are recognized for egocentric views of the hand.

3.2.2 RGB-based methods The methods in [54] and [55] contribute towards egocentric hand pose recognition from RGB image frames. Here, [54] is based on hierarchical regressors to estimate articulated hand pose and [55] formulates a graph based representation for hand shapes for hand gesture recognition. In [56], an RGB dataset of hands performing 10 different gestures is provided. The method in [57] consists of a model for color-based real-time hand segmentation and tracking for hands interacting with objects. In our work, we evaluate two state of the art hand keypoint detectors for 2D keypoint estimation from RGB images. The first one is the method presented in [18], which proposes a deep neural network to perform 3D hand pose estimation. This model is trained on static images from a synthetic dataset. We evaluate the 2D keypoint detections from their PoseNet network on our dataset. The second state of the art hand keypoint detector that we evaluated is the method in [19], which produces real-time keypoint detection on RGB images using multi-view bootstrapping. We define these methods in detail in Section 5 for our experiments and provide extensive evaluation for them in Section 6.


3.2.3 RGB-D based methods The model in [58] is a combined generative model with discriminatively trained salient features for articulated hand motion capture from monocular RGB-D images for hand-hand and hand-object interactions. The work in [59] provides a new RGB-D dataset for egocentric hands manipulating with objects. Another RGB-D approach for articulated 3D motion hand tracking is presented in [60] where the authors perform part-based pose retrieval and image-based pose optimization.

3.3 Deep Visual Features for Physical Exercise Assessment Physical exercises can be used for both assessment as well as enhancement of cognitive skills [61]. Computer vision methods can automatically analyze physical exercise for performance metrics like speed, correctness of movement and response time. Our assumption is that children perform physical exercises in front of a Microsoft Kinect V2 camera. The children may or may not face the camera directly, and in front of every camera are at most three children. To initially evaluate the proposed system, we capture a novel dataset with adult individuals, mainly undergraduate students. For further evaluation we test our system also in more complex recordings, where multiple people perform exercises in front of a camera, when the Kinect is placed at the side of the subjects and with real subject that is a child. The CogniLearn system has been built on top of state-of-the-art machine learning techniques on pose estimation from RGB video streams. For the underlying mechanism of the CogniLearn system, we exploit the remarkable performance reported on similar tasks by deep-learning approaches. Deep-learning and especially CNNs have dominated by far the area of visionbased activity recognition and computer vision in general [62, 44] as they can signifi-


cantly outperform previous state-of-the-art approaches, that were primarily based on shallow classifiers and hand-crafted features [63, 64]. In 5.1, we compare the skeleton tracking offered by Kinect with computer vision based body pose estimation. The Kinect’s skeleton tracker fails in cases of self-occlusions that are prevalent in our dataset for gestures of participant touching their “Knees” or “Toes”. In 5.4, we discuss the use of different data modalities for human motion analysis. Our method is based on the RGB modality as it can be more consistent and less noisy than Kinect’s skeleton tracking data and can work better in different lighting conditions and interference.

Human body pose estimation and hand detection are two important tasks for systems that perform computer vision for activity recognition.

In this

work, we focus on RGB-based articulated hand pose estimation. We prefer this modality due to the availability and ease of deployment of regular color cameras compared to depth cameras. Our contribution aims towards partial hand

Figure 3.5: Human Body Pose Keypoints

pose estimation problem on single RGB image frames. The hand keypoints that we estimate are wrist, finger tips for thumb, index finger, middle finger, ring finger, little finger. We provide a novel RGB benchmark dataset for hand keypoints estimation and perform evaluations to provide quantitative evaluation for current state-of-the-art methods for this task. This dataset includes hand gestures and keypoint annotations for gestures pertaining to rhythmic hand movements. Our motivation is that tasks involving such movements can be used 24

for cognitive assessments, in conjunction with tasks involving whole body motion [65]. There is a need for computational methods that would help with automatic computation of various physical performance metrics, so as to improve on the accuracy and efficiency of human-made assessments. Articulated hand pose recognition is an important step in recognizing and assessing physical exercises that contain hand gestures. For CogniLearn system, we perform articulated motion analysis on exercises that require observation of body parts like torso, upper & lower arms, upper & lower limbs, fingers and face. For efficient hand pose estimation, we use the pretrained model from [66] which is a real-time hand keypoint detector and has demonstrated practical applicability for in- the-wild RGB videos. For analysis of “Rapid sequential movement” exercises like “Finger Tap”, “Appose Finger Succession”, “Hand Pat” and “Hand Pronate/supinate” we need precise location of fingers. The aforementioned deep learning model provides 21 keypoint detections each for Left and Right hand as shown in Fig-3.6. These are also important in tests for “Bilateral Coordination and Response Inhibition” like “Bi-manual Bag Pass Slow” and “Bi-manual Bag Pass Fast”, test for “Finger-Nose Coordination”. For the CogniLearn-HTKS, we take advantage of the deep convolutional structure initially proposed in [40, 67]. For CogniLearn-RSM, we employ the deep learning network proposed in [68]. The body keypoints detected by [68] are shown in Fig- 3.5. These deep neural networks have been trained accordingly to estimate humanbody joints of single subjects in an image, providing highly accurate and robust recognition reFigure 3.6: Hand Keypoints


The CogniLearn system operates based 25

on the skeleton joints provided by the aforementioned Convolutional Neural Networks (CNN), and its task is to understand the intended motion of every single subject in the frame and score his/her performance. Recognition and system evaluation occur in a viewpoint-independent manner, while scoring takes place according to a set of scoring rules defined by cognitive experts. To analyze the human motion and evaluate the subject’s performance, the CogniLearn-HTKS system continuously keeps track of the hand-joints positions with respect to the rest of the body joints of interest (i.e.Head,Toes,Knees and Shoulders). Evaluation occurs by inspecting if the user’s action matches the expected motion (i.e., what body part did he/she touch, if he/she were supposed to touch his/her head?). To analyze the subject’s intended motion, we segment each video stream into longterm segments, which consist of all the frames that occurred between two consecutive HTKS commands. In CogniLearn-RSM, the human motion analysis needs deep visual features of body, hands and face keypoints. The intended motion is deduced from computations on these detected keypoints. This requires analysis of body keypoints for exercises to test “Gross Motor Gait and Balance”. Body posture is also required initially to find out “Lateral Preference Patterns” of the candidate i.e. to detect whether the candidate is Left-handed or Right.


CHAPTER 4 Developed Application for Cognitive behavior Assessment 4.1 Theoretical Background Emerging technologies have significantly influenced several medical related processes, such as diagnosis, rehabilitation and treatment. The effect of computer science in the medical domain is observable not only in the level of human computer interaction but also on the quality and the quantity of useful data that a modern system can automatically capture and provide to the experts as assistive material to the diagnosis. The implementation of such systems must meet two major criteria: a) keep the user motivated and b) provide meaningful and understandable data to the domain experts[69]. Towards that direction various works have been proposed that try to access different but similar medical conditions. In [70], the authors proposed after extended research, that active video game play (i.e. consoles like Microsoft Kinect or Nitendo Wii) can promote physical activity and rehabilitation of children with Cerebral Palsy. In [71], an interactive game was developed to assist stroke-patients improve their balance. In [72], the authors focused on extending the attention span of children with ADHD by designing a computer software application that constantly monitors the users attention state using an eye-tracker and adapts its user interface for incorporating multiple stimuli. In [73], the researchers deployed a humanoid robot to teach children who suffer from complex developmental disabilities, simple coordinated behaviors. The authors in [74] designed a virtual-reality game for upper-limb motor-rehabilitation, while in [75] a virtual reality environment was deployed for the 27

assessment and rehabilitation of attention deficits in children, and especially ADHD. In [76], a set of new game designs is presented, that is based on psychological tests or tasks and aims to monitor or improve ADHD related symptoms. The system we propose in this research fits very well as a component to the framework suggested in [77], where the authors proposed a system that combines two types of feedback to the therapists (one directly from the user and one automatically generated by a computer vision-based mechanism). Our work is based on the well-established framework of HTKS as a tool for assessing cognitive dysfunctions [1], while at the same time employs state-of-the-art vision based techniques for activity recognition and evaluation. In it is packaged with a carefully designed UI that is intuitive and motivates both users (therapists and patients) to interact with. Computation-wise we exploit the remarkable performance reported on such tasks by deep-learning approaches. Deep-learning and especially CNNs have shown very good results in activity recognition tasks [62, 44] compared to traditional approaches based on shallow classifiers and hand-crafted features [63, 64]. Finally as other similar systems [78, 79, 80, 81] our application outputs metrics that are valuable to the experts and can add significant information to the level of treatment and diagnosis. Those metrics are defined in the HTKS protocol and are mainly related to the number of errors (if the subject performed the expected motion) and the delay (how much time did the subject require to perform a specific command).

4.2 CogniLearn Sytem In this work, we propose the CogniLearn-HTKS system that has specifically been designed to help detect cognitive dysfunction and to serve as a tool for behavioral self-regulation. This is achievable by employing HTKS self-regulatory tasks 28

as proposed by Mcclelland et. al [30] which have proven to be good indicators of overall cognitive health. For the purpose of this work, we use the terms ‘users’ and ‘instructor’ synonymously which represent the therapist or the cognitive expert and the person being observed while performing the HTKS task as the ‘participant’ or the ‘subject’. The CogniLearn-HTKS system is built to achieve goals based on our novel framework that can serve as a tool for evaluating and assessing physical activities, which can reveal potential cognitive dysfunctions. Our framework, as shown in Figure-4.1, aims for the CogniLearn-HTKS system to be well-defined, modular and well-structured. We aim to provide ease of use for general users and to deliver valuable data, meaningful measures and information that can benefit advanced users such as therapist and cognitive experts.

We deploy the HTKS game in an intelligent

Figure 4.1: CogniLearn-HTKS Framework

interface that has a user friendly front-end that is capable of providing instructions to play the HTKS game, able to collect and store multi-modal observations while the participant plays the game. Visual data is made available to the instructor to demonstrate the actual and intended motion and audio instructions that should be delivered to the participant. We use a Microsoft Kinect V2 to collect RGB, Depth and Skele29

ton data from participants using the recording module of the CogniLearn interface. This data is then stored in our database and is then analyzed using computer vision and machine learning methods. CogniLearn’s back-end evaluates the participant’s performance and provides meaningful outcomes in the form of performance reports and scores. We provide further enhancements to our system to enable support for multi-person evaluation for cases where there are more than one participant playing the game simultaneously using our interface. The system evaluates each participant’s performance individually and provides analysis reports for each. We also support for variations in participants’ body orientation angles while performing the exercises.

4.2.1 User Interface As a part of this research, we developed a novel user interface that can facilitate therapists and human experts who monitor the performing subjects and provide them with the desired measurements. We provide a prototype user interface for recording and analyzing motion, when children performs HTKS exercises in front of a Microsoft Kinect V2 camera. CogniLearn’s user interface consists of two main components, namely the Recording and the Analysis modules. The recording module is designed to facilitate the needs of the instructor, in order to conduct the HTKS task and capture the required data. On the other hand, the analysis module is used to conduct and visualize the data analysis provided by the system. Both these modules are built following the self regulatory task protocol similar to that in [30] established by cognitive experts. For implementation purposes, both the recording and the analysis modules are part of a single program that was designed using Electron: a Node JS framework used for using Node’s powerful webapp tools to develop a standalone app with access tools normally reserved for the server. The modular approach of our GUI facilitates ease of using the interface for either or 30

both of these tasks of capturing and analyzing participant data. The interface is built keeping in mind the appropriate HCI practices to provide better visual features, such as interactive charts, and enable ease of use to capture and deliver information that would be accessible and understandable for the user as thoroughly discussed in [82]. Recording Module: This part of the interface is built to be used by the instructor to capture participant data while they perform physical exercises from the HTKS task. The interface provides verbal audio instructions to the performing subject, regarding the upcoming motion (i.e “Touch your Head ”). As shown in Figure4.2, the avatar in the middle of the recording interface shows the body part that the subject is expected to touch. This way the instructor can keep track of the current association between words in the audio and the body part that the hands should actually touch. Instructors also have the options to move between the steps that they desire to record and to cancel or restart the recording. As already discussed, audio instructions are provided to the participant by the system and not by the instructor.

The sequence displayed above this avatar in Figure- 4.2 corresponds to the

Figure 4.2: Recording Module

sequence of moves to be made. For example, 1 corresponds to head, 2 to shoulders, 3 to head, and 4 to toes, as in the first step shown. When the PLAY button is pressed, 31

instructions are given, and the correct sections are highlighted for the instructor to see. Timestamped RGB data is collected simultaneously, which is the primary advantage of using Electron. In CogniLearn’s interface, Electron communicates via a socket with a C# application, so as to enable collection of Kinect data with proper metadata and timing as provided by the front-end application. Once the initial step is completed, the system moves to the next step, which shuffles the body parts: the task switching. Each step selected on the left corresponds to a different association between audio words and body parts, and a different sequence of instructions to be given. The instructor can switch between these steps and allow recording for all instructions within the step or a subpart of it. Table- -4.1 illustrates the correlations between audio instructions and intended motion at each step as provided by the current implementation of the system. Analysis Module: The analysis interface provides front-end visualizations of predictions obtained from our algorithm that runs at the back-end. It allows for selection of participant and provides stepwise as well as consolidated summary of participant performance in terms of speed and accuracy of movement. In Figure4.3(a.) we can see that the frame numbers are shown at the bottom left of the image, which allows for precise manual annotation, if needed. Shown to the right are three categories of text: Spoken, Command, and Prediction. Spoken refers to the word that is spoken (audio instruction provided) by the system. Command refers to the body part that the participant is expected to touch i.e. the intended movement of the participant. Prediction refers to what the system “thinks” the participant is touching.

The graph below is a prediction confidence visualization as a doughnut

chart that shows the relative certainty for each body part. In the image, knees has the highest relative certainty in the chart, so “Knees” is shown as the prediction above. The scores are defined so that 0 points stand for “incorrect”, 1 point stands 32

Figure 4.3: Analysis Module for “almost correct”, 2 stands for “correct”.

The evaluation that we perform

Step #


Audio Instruction

Intended Movement

1. 2. 3. 4.

123434 2432123413 4321323423 1242433132

H,S,K,T,K,T S,T,K,S,H,S,K,T,H,K T,K,S,,H,K,S,K,T,S,H H,S,T,S,T,K,K,H,K,S


Table 4.1: CogniLearn Interface Instruction Steps

is based on timestamps of the image frames captured. The subject is supposed to perform the instructed gesture in a 3-second window, after the audio instruction has been delivered by the recording interface. Thus, based on how long the subject takes to get to the desired gesture determines the correctness of the motion. Figure- 4.3(b.) illustrates a detailed evaluation of the whole recording. In particular, the analysis provided shows how the subject responded to each command of every practiced step. It provides both the score that was assigned to the related motion by the system and the likelihood that the assigned score was decided correctly. Thus, the human expert has the choice to decide if the analysis provided by CogniLearn for a specific command can be trusted or if it was made under great uncertainty and hence, should be ignored. Finally, in Figure -4.3(c.), additional performance measures are provided, which aim 33

to show subject’s overall score with respect to every individual motion. This type of analysis aims to assist the therapist identify behavioral patterns of the subject from a higher level of abstraction. A demo-video showing an initial prototype of the proposed UI can be found here:

4.2.2 Data Acquisition All the data for our experiments is collected using the recording module of our CogniLearn-HTKS Interface. We capture data following steps as shown in Table 4.1, similar to the HTKS protocol [30] as defined by cognitive experts. This data is then analyzed to deliver valuable performance measures to the users of our system. Instructions within each steps are provided in the form of audio outputs from the recording module. Participants are expected to listen to the instructions and perform the corresponding gestures to the best of their cognitive abilities. We collected different types of recordings using our interface to validate our system performance for different conditions. Our main CogniLearn-HTKS dataset, also referred to as single-person CogniLearn dataset, consists of 15 adult participants who perform exercises while facing towards the camera. In these recording, one participant performs exercises in front of one Kinect. Analysis of this dataset is done in single-person mode of CogniLearn. We collected RGB data from 15 participants (9 Male and 6 Females) of age group 18 to 30 that we recruited to follow the instructions provided by the interface and perform the task sequences. Videos are captured with some participants with accessories, for e.g. wearing a ‘hat’. RGB image frames and timestamps for each audio instruction delivered are captured while the participants perform these steps. These are all utilized in our system during analysis of participants’ movements and for score calculations. This dataset comprises of over 60,000 frames of RGB data captured for the participants 34

which creates a substantial dataset for cognitive analysis as well a baseline to perform gesture recognition for the physical exercises performed during the HTKS task. During this study and data collection, the analysis interface provides us valuable measures of performances of these participants such as scores for individual instructions, summarized report for overall performance per gesture class and visualizations of the steps performed. Our dataset and necessary annotations can be found at the provided link: We provide annotated gesture class labels for 4443 of these frames which are useful to calculate accuracy of our system. We collected additional data to enable our system to deliver support and validate its performance for different conditions. These conditions include support for more than one person in the frame (at most 3 people in front of one Kinect), person facing sideways or with varying body orientation angles while performing exercises and data captured for a child participant. We call these Multi-person Dataset, Side Pose Dataset and Child dataset respectively. These recordings all consist of multimodal stream captures for RGB data, Depth data, Kinect skeleton tracking data and instruction timestamps. The Multi-person CogniLearn dataset consists of recordings captured when two people perform steps from Table 4.1, captured using our interface. Both participants face towards the camera and do not change orientations or locations in the frame. Both are Male participants within the age group 18 to 30 years. All 4 steps of the Table 4.1 are recorded. The Side Pose CogniLearn dataset consists of three participants captured while performing HTKS as instructed by our recording interface. These are adult participants, 2 Males (right profile visible) and 1 Female (left profile visible) between age group 18 to 30 and one of the participant has worn a ‘hat’. We manually annotate 35

step-1 recordings of all the participants to perform accuracy calculations for use of our system for side poses. The Child CogniLearn dataset consists of multimodal recordings captured while a child participant performs exercises. These recordings were not captured using our system and do not follow instructions steps from our system. This data is recorded from a Kinect, while a cognitive expert delivered instructions. It consists of 198 images where the child performs one of the 4 gestures (HTKS) for each frame. This data is highly useful to validate the performance of CogniLearn HTKS algorithm on an actual subject, a six year old child. We annotate these frames belonging to one of the four gestures classes.


CHAPTER 5 Physical Activity Recognition for HTKS 5.1 Experimental Architecture Our experimental architecture comprises of frame-by-frame recognition of participants’ movements and an overall performance analysis over video sequences. We have devised a method to classify participants’ movement at each frame into four gestures categories namely - Head, Shoulders, Knees or Toes. Long videos are then analyzed by using this per frame gesture information. Outcomes from this analysis are performance measures like accuracy of the movement, comparison of the performed action to the intended action, response time of the participant and steps or instructions for which participant made most errors in movement. Currently, our algorithm works by analyzing the captured RGB frames from the multi-modal data collected using our interface. For each step performed by the participant we have available RGB data frames. We use our algorithm to classify each of these frames into the aforementioned gesture categories. At each frame the algorithm works to recognize the activity performed by the participant at that frame to classify the image into one of the four gesture classes Head, Shoulders, Knees or Toes. We perform this HTKS classification in two steps: 1.) localize participant’s full-body joint co-ordinates in the image and 2.) use the localized joint co-ordinates to understand participant’s performed at that image frame. This algorithm generates frame-wise HTKS classification labels under the assumption that there is a single person in the image frame.


5.1.1 Full-body Joints Localization For our HTKS recognition approach to be effective, it is highly important for the pose estimator to localize body joints correctly. The available skeleton tracker co-ordinates offered by Kinect offers poor results in certain cases where there are self-occlusions [83]. Our dataset consist of many instances of self-occlusions for when the participant is bent forward to touch their toes or knees or in cases where participant’s body is orientated sideways from the camera. Example visualizations of cases of self-occlusions that we observed in our dataset during joint localization by Kinect’s skeleton tracker are shown in Figure 5.1. For this reason, we decided not to use Kinect’s skeleton tracking modality that we collect in our recording module. CogniLearn’s physical activity recognition comprises of a computer vision module that works with RGB data and takes advantage of latest advances in deep learning to employ a pose estimator that works well for our problem. RGB data can be less prone to noise and more consistent when compared to the skeleton tracking data and it works well with different lighting conditions. Depth data is another valuable resource that could be used for recognition and we would like to explore incorporating this modality for HTKS recognition in our future work. For full-body pose estimation, we explored several existing state-of-the-art pose estimators that work with RGB data and decided to use an existing deep learning based pose estimator called DeeperCut [40]. This method works by performing CNNbased body part detection for each input image and then using deep neural networks for pairwise-conditioning for these detected body parts. We decided to use this pose estimator for our problem set as it is a current state-of-the-art for multi-person detection and it provides competitive performance for single person pose estimation as demonstrated by their results on popular datasets. It is a fast and accurate model that compliments the robustness of out user interface. Furthermore, we needed a 38

method that works well for joint localization for body pose observed from various camera viewpoints and also for participants of different age groups. This pre-trained model is shown to work on recognizing poses for adult participants as well as children as demonstrated by their results for single and multi-person pose estimation on the popular MPII human pose [84] dataset. This dataset which consists of annotated classes for images with adults as well as children, for example images in the MPII dataset with activity labels “Playing with children”, “Child care”. We use this pre-trained deep learning model and perform experiments using Caffe framework [85]. This models provides full-body pose outcomes on our recorded RGB image data without the need for any kind of training using our dataset. We use the pose estimator to track 14 full body joints namely: LS, LE, LH, LW, LK, LA,H, N, RS, RE, RH, RW, RK, RA that correspond to body parts ‘left shoulder’, ‘left elbow’, ‘left hand’, ‘left waist’, ‘left knee’, ‘left ankle’, ‘head’, ‘neck’, ‘right shoulder’, ‘right elbow’, ‘right hand’, ‘right waist’, ‘right knee’, ‘right ankle’ respectively. These joint coordinates from each participant are then utilized in our CogniLearn HTKS recognition algorithm. For this section of the thesis, we assume one person in the image frame and our chosen model is utilized to estimate these 14 joint coordinates for a single person in the image frame. In the later sections, we will see how we use this for HTKS recognition for multiple people in the image frame. This approach for HTKS recognition using computer vision & deep learning and how it integrates into our CogniLearn-HTKS system are shown Figure 4.1. The modularity in our framework allows for change of this recognition subsystem as well as the flexibility to allow replacement of the current pose estimator with a better one with future technological advances in the state-of-the-art for pose estimation.


Figure 5.1: Example visualizations of self-occlusions in Kinect2 skeleton tracking (Left) Vs Vision-based pose estimation (Right) The above-mentioned pose estimation model localizes participant’s full body joint co-ordinates (top of head, neck, shoulders, elbows, hands, waist, knees and toes) which can now be used to formulate the algorithm to obtain image-wise HTKS gesture class labels.

5.2 CogniLearn HTKS Recognition Algorithm We assume that every RGB image frame captures movements of a single participant performing the HTKS gestures. Each of these images is input to the pose tracker to output pixel coordinates for 14 body joint locations. We use distances between these joint locations to understand the movement performed by the participant. We have available the wrist coordinates and the coordinates of the four body parts of interest - head, left & right shoulders, left & right knees and left & right ankles. The confidence scores for each of the four gesture classes (1–Head, 2–Shoulder, 3–Knees and 4–Toes) are obtained as follows: • Calculate r1 and l1 as Euclidean distance from right hand(RH ) to head(H ) and distance from left hand(LH ) to head(H ) respectively. 40

• Calculate Euclidean distances r2 , r3 , r4 as distances from right hand(RH ) coordinate to the three body parts shoulder(RS ), knee(RK ) and toe(RA) respectively. Similarly, calculate Euclidean distances l2 , l3 , l4 as distances from left hand(LH ) co-ordinate to the three body parts shoulder(LS ), knee(LK ) and toe(LA) respectively. • Calculate distances d1 , d2 , d3 and d4 where each di is average of ri and li for the corresponding body parts: • Compute scores z1 , z2 , z3 and z4 from the distances (where, x ∈ 1 to 4): Scores(zx ) = 1/dx • Perform softmax over the scores to obtain probabilities: e zx σ(l)x = P4 x=1

e zx

This provides the probability for any given frame to belong to one of the four classes H, T, K and S. The gesture class with the highest P(gesture/frame) σ(l)x is chosen as the gesture class classified for that frame. Visualizations of this approach and gesture class obtained through this algorithm are shown in the link: watch?v=w63cqZtIeIk&t=14s. Performance of this method is improved by adding some constraints based on certain body poses based on performance of this system observed for certain conditions. These improvements and observations are explained further.

5.2.1 System Observations We measure the HTKS recognition accuracy for our system on our validation set using the above-mentioned algorithm. During this phase, we observed certain common failure cases for which the system does not perform well. This provided an 41

opportunity to easily improve our system further by incorporating minor changes in the algorithm. We observed that the system predicts incorrect gesture for some cases where the participant touches their “Head” or “Toes” body parts. We demonstrate one such failure case in Figure 5.2(a.). Here, the person is touching their “Head” but the system predicts the gesture class as “Shoulders”. It is very clear for a human observer to recognize the gesture class. However, the system provides incorrect prediction those are dependent on the calculations based on distances of body parts from the hand. This is because our pose estimator outputs joint localizations for wrist and not for fingers or the tip of the hand. In most cases, this is the part that participants’ use in order to touch the body part of interest. However, the joint localization method prediction for location of participant’s wrist needs to be compensated by deduction of some offset distance measure between the wrist position and fingertips. Incorrect predictions can also occur if the person touches a lower part of their Head for the instruction “Head” but the wrist coordinates get closer to “Shoulders” than to the “top of the Head”. The distances d1 which the distance from hand to head (demoted by ‘a’ in Figure 8a) and d2 which the distance from hand to shoulders (demoted by ‘b’ in Figure 8a) are almost the same which leads to this incorrect prediction. Similarly, this observation can be made for when the person touches their “Toes” but the system predicts “Knees” as the distance calculations are done with respect to wrist localized coordinates. We utilize these observations to improve our system by making up for these offset distance measures caused by localization of wrist coordinates by our chosen pose predictor. In future we could train or fine-tune a predictor to outcome fingertips instead of wrist coordinates. Here, we use a simple solution by performing offset calculations and deductions while calculating distances of the ‘Hand’ from ‘Head’ or ‘Toes’ body parts. The details of our calculations for this offset distance measure and its incorporation to help achieve algorithmic improvements 42

is explained in the section on Algorithmic Improvements.

For side poses, it is ob-

Figure 5.2: Visualizations of our method - a.) Demonstrates need for offset as distances of wrist from head(a) and shoulders(b) are almost similar, b.) Pose estimation failure for occlusion due to accessory (hat), c.) Correct estimation with person in background and partially cropped image (Toes not visible), d.) Correct classification of pose for with accessory when face is visible.

served that this offset deduction is not necessary for when person is touching “Head”. The wrist is seen at an angled orientation and this distance between the wrist and fingers is relatively smaller and not so observable when the person faces sideways. In such cases, we observe that deducting the offset can result in incorrect prediction of “Shoulder” gesture as “Head” gesture class. Example visualization of side poses for “Shoulder” gesture are shown in Figure 5.3. However, offset deduction for “Toes” gestures are still needed in this case, for the same reason as that for front-facing poses. Therefore, we need to fine-tune our algorithm to recognize whether a person is standing facing towards the camera or is oriented at different angles. This helps us decide what kind of offset needs to be added for our algorithm to work precisely. Our algorithm also provides support for HTKS prediction for varying body orientations, for example, if a person switches from front-facing view to their side (left or right-facing posture) while performing the HTKS gestures. 43

Another common failure case is due to the failure of the pose estimation method. These are mostly observed in cases of certain occlusion when a person is touching ‘Toes’. Though the pose estimator is capable of handing some occlusions, it fails in some cases when the person is bent down or has certain accessories on for example in Figure 5.2.b. This shows a failure case when the participant has worn a hat and this occludes their faces and/or other body joints visible in the RGB images. For such cases the system performs well for cases of instructions “Head”, “Shoulder” and “Knees” as the face is visible but occasionally fails to detect correct pose for the “Toes” position, see Figure 5.2.d. For these, it is helpful to support our algorithm by making some other vision-based observations to help with the gesture class prediction. The pose estimator seems to estimate joints incorrectly and shows gesture class ‘Head’ as prediction instead of ‘Toes’ class. In our method, we support this by observing height of a person. In each RGB frame, we calculate height of the person. If in any frame the height of the person is less than 2/5th of the persons max height then the pose is considered to be ‘Toes’. We apply this constraint on predictions on all our datasets and observe that this measure helps with accuracy for the system to predict ‘Toes’ gesture. The value 2/5th of height is obtained after experiments with different values and prediction outcomes. As future work, we would also like to track position of the joints across consecutive frames so that when a person bends and the face is no longer visible, the system can still be able to provide an estimate of where the joints are. Similarly, this kind of method would help if a person is occluded or a certain body part in occluded in certain frames of the captured data.


5.2.2 Algorithmic Improvements Observations from our system performance by experiments on our initial validation set provided the opportunity to improve the CogniLearn HTKS algorithm by fine-tuning it for certain gesture class predictions. For certain cases, depending on participant’s orientation from the camera, we need to account for the offset distances from tip of the hand to wrist. This is due to the fact that the pose estimator provides us with wrist locations and not for fingers. Based on the body orientation, these offsets have to be considered while measuring distance of hand to different body parts. For front poses, it is necessary consider this while performing calculations for most cases of distance measured from hands for “Head” and “Toes” gesture classes. For side poses, this deduction is necessary for the “Toes” gesture class. For cases of occlusion, where the pose estimator fails for “Toes” and predicts incorrect class labels, we need to support the system by providing additional features to help with classification. In our case, we provide height of a person as the feature and it is a useful measure to predict if the person has bent down in ‘Toes’ position. Therefore, we improve our CogniLearn HTKS recognition algorithm as follows by using the following three constraints:

Rule 1: For person facing towards the camera, deduct an offset value calculated to approximate hand size of the person, from the distances measured from hands to “Head” and “Toes”.

Rule 2: For person facing sideways (left or right direction) from the camera, deduct an offset value calculated to approximate hand size of the person, only from the distances measured from hands to “Toes”.


Rule 3: If the height of the person in the current frame is less than 2/5th of the person’s height when in upright positions, then deduce the gesture as “Toes”.

Figure 5.3: CogniLearn HTKS Algorithm Observations

These are incorporated in our algorithm as follows. For Rules 1 and 2, we need to calculate the offset value. Here, we utilize an approach similar to [44] where we consider the offset value according to the face height of a person. Face size seems like a reasonable measure to calculate offset as this gives a scale invariant estimate than setting it to a fixed value. The pose estimator provides us co-ordinates for “Head” (Top of Head) and “Neck” positions that can be used to estimate the face size of the person in the image. The face height f is calculated with the help of these coordinates that we have available for ‘top of head - htop ’ and ‘neck - n’. Offset measure is then obtained as half of the height of face. This obtained offset δ is then deducted from the distances of hand from hands and toes for front poses. If the person is facing sideways then this offset is deducted only from the distances of hand from toes. To determine if the person is facing sideways (left or right) or at any other orientation we use distances between left shoulder and right shoulder. If the distance between 46

the two shoulders is less than the offset then the person is oriented sideways. We use the same offset that we calculate for deductions for wrist to fingertip distances. The offset needs to be deducted from distances of both the hands and the HTKS algorithm described previously needs to be modified in the following way: δ=

f Where, f = d(htop , n) 2

• Calculate d1 based on updated r1 and l1 where r1 = r1 -δ, l1 = l1 -δ. • Calculate the distance dS as the distance between the Right Shoulder(RS ) and the Left Shoulder(LS ). If dS > δ . Then: – Calculate d4 based on updated r4 and l4 where r4 = r4 -δ, l4 = l2 -δ. – Perform calculations for the remaining steps for the CogniLearn HTKS recognition algorithm using the updated value for d4 . • Perform calculations for the remaining steps for the CogniLearn HTKS recognition algorithm using the updated value of d1 . For Rule 3, we employ a scale-invariant approach to know if the person is bent and touching their “Toes”. This is useful for when the pose estimator fails due to occlusions by accessories or other body parts of the person. ‘Height’ of the person is an important feature to determine if the person is bent or not in the frame. M axh is the height of the person when standing upright. We assume that at the first frame of the recording, the person is upright. The algorithmic changes to incorporate this constraint are as follows: • At each frame, calculate ‘Height: Ht ’ for the person – If RA & LA are available, Ht = mean(d(htop , RA), d(htop , LA)) – Else If RA is available, Ht = d(htop , RA) 47

– Else, Ht = d(htop , LA) • If M axh < Ht , then Ht = M axh • If Ht < M axh × 2/5, then Output Class = “Toes”. Figure 5.3 shows the performance our algorithm and visualizations to explain the need for the rules. Here, Figure 5.3a demonstrates the need for Rule 1, Figures 5.3d and 5.3e show how offset deduction by Rule-1 is a helpful factor for front-facing poses but would show incorrect classification if added for side poses. Side poses are visually different in RGB frame and the same rule cannot be generalized for front-facing and side facing body postures. Hence, arises need to formulate Rule -2. Figures 5.3b and c are some example visualizations of our method for recognition for Side Poses. Here, Rules-2 and 3 are applied by the system and result in correct classification. Our experimental results show that adding the constraints of Rule -2 and Rule -3 help improve accuracy of our system from our previous work [17] by 5%. Our system in [17] employs Rule-1 and does not support side pose orientations or multi-person detection. We now provide these new features along with substantial improvement in accuracy. Moreover, our system provides performance accuracy similar to that in [86] but in a scale invariant way. The method in [86], achieves resultant performance by placing constraints based on predetermined fixed thresholds. These thresholds might need to be fine-tuned based on scales or participant sizes in the frame based on their agegroup or distance from the camera. Also, this fixed threshold would not work for all participants if there are multiple people of different age groups or at different depths from the camera in the same image frame. In our method, we devise constraints that are scale invariant and we calculate estimates based on available joint localizations for the participants. Hence, each constraint is customized for the participant within the algorithm. Therefore, our system gains accuracy without compromising on our 48

system’s the ability to perform well for different user types and conditions. All these improvements achieved by incorporating the rules help us avoid the common failure cases cause due to localization of wrist position instead of tip of the hand for frontfacing as well as side poses and also helps avoid errors due to any occlusions.

5.2.3 Supervised Learning Approach The above mentioned CogniLearn HTKS algorithm is an unsupervised approach for gesture class classification. We also perform supervised learning for classification to compare with the CogniLearn HTKS algorithm. Joint locations and distances are still obtained using deep feature extraction from the body pose estimator. Supervised learning is performed on the distances obtained from the localized joint coordinates. Here, we used the annotated data of 4,443 frames from our CogniLearn HTKS dataset which consist of Step-1 performances of the 15 participants. We conduct user-independent experiments on this by considering data from 12 participants as the training dataset and other 3 participants’ data as the test dataset. We perform 5-fold cross-validation to obtain accuracy on the entire annotated dataset. We train a shallow neural network to predict classes based on joint distances. The input to this neural network are z-score based normalized distances d1 , d2 , d3 and d4 , which are the distance of hand from the 4 joints and the training labels for each frame to belong to class H, T, K or S. The neural network model is shallow with one hidden layer consisting of 2000 units. Results of these experiments are shown in Table - 5.1. These are percentage accuracies on test set as mean accuracy of the 5-fold cross validation results. We also perform SVM for gesture classification with similar 5-fold cross validation. In Table - 5.1, the unsupervised learning method is the CogniLearn HTKS algorith with algorithmic improvements.

The advantage of

supervised training based classification is that we do not need to incorporate rules 49


Neural Network

Unsupervised learning




Table 5.1: Supervised vs Unsupervised classification for system improvements explicitly as mentioned in section 6.3. However, these approaches require labeled training set and need diverse training examples to obtain good performance on participants of varying age, cases of self occlusions and different orientations from camera. Hence, in our system we choose to use the unsupervised algorithm as it provides comparable performance with supervised training methods, without the need for any labeled training dataset.

5.3 Multi-person Physical Activity Recognition The CogniLearn-HTKS system allows for simultaneous multi-person cognitive behavior assessment and monitoring while playing the HTKS game. The assumption is that at most three people can play the game and perform the exercises in front of one Microsoft Kinect. Multiple Kinects can be easily integrated into our framework, to allow future deployment of our system in classroom settings. One key constraint that we place for our system, so that it can work with multiple people in a single image frame, is that each participant should have all body parts within the image frame without much occlusion from other participants. This is monitored by placing a person detector in our system that based on a multi-person joint estimation. We consider a person to be detected if at least 10 out of the 14 body joints and their correspondences are detected as belonging to the same person within the same image frame. The deep-learning based pose estimator that we have chosen [40] provides support for multi-person pose estimation. The multi-person joint detector provides out50

comes of body joints recognized and their correspondences, so as to detect a person in an image frame. Multi-person detection and joint estimation is more expensive compared to single person full-body joint pose estimation. We aim to incorporate support for multiple people to simultaneously use our system without compromising on the speed and accuracy of our single person CogniLearn[17]. We incorporate this functionality again using a modular approach supported by our framework. The system architecture on Figure 4.1 shows how multi-person detection module is placed in our system to pre-process the collected data to recognize participants in the RGBframe prior to image-wise HTKS gesture class recognition. As this is an expensive step, we decided to recognize participants in the beginning of the recording and then assume that the same participants perform all steps of the game. It is important to follow the system game rules for multiple participants, which state that the participants are not allowed to switch places or move positions once the recording begins. The participants can still switch body orientations or be at varying body angles from the camera, but they are required perform exercises in-place. We still allow some room for movement in position for minor location changes. In the future, we would like to integrate some inexpensive method in our system to help perform participant localization at timed-intervals or for every image frame. The multi-person physical activity recognition algorithm that we devised to be used in our CogniLearn system is as follows: 1. Perform Multi-Person pose estimation on the first RGB image frame of the video to obtain person-wise joints localizations. 2. Obtain NP number of participants detected by discarding person detection outcomes with less than 10 body joint localizations detected for the person. Joints availability for Head, Neck, Right and Left hands are mandatory. For Shoul-


ders, Knees and Toes, either one of Left or the Right joint localizations should be available. 3. Create bounding box for each participant detected in the image frame. 4. Pre-process all RGB images of the entire video sequence by cropping according to the bounding boxes to obtain NP number of RGB images for each RGB image frame. 5. Perform single person HTKS algorithm for all frames for these NP number of participants separately.

Figure 5.4: Multi-Person HTKS Recognition

In Step 1, the multi-person pose predictor detects certain body joints as belonging to a certain number of people detected in the frame. This produces an outcome localized joints and their correspondences for each person detected. An example visualization for this is shown in Figure 5.4 a. Here, we see that the lamp was detected as a person with some body-joint predictions. Also, the light blue edges and dark blue edges correspond to different people detected. These edges should belong to the same person but are detected as multiple people. To avoid such false positives for person detection we place certain constraints on the detection. This is done in the Step 2 of our algorithm. 52

A person detection is considered valid only if at least 10 out of the 14 full-body joints are available and localized for the detected person. If less than 10 body parts are detected then it is regarded as a false positive for person detection and discarded. This provides us NP number of participants detected for this frame. Here, NP is the number of participants detected by this method and for which 10 or more body joint localizations are available. For HTKS recognition we need all body parts of the person to lie within the camera frame. This constraint helps avoid detections where few random joints are detected in the image. We set this number of body joints as the threshold for person detection as we observed that our method can tolerate some occlusion or cropping in some cases, for example in Figure 5.2c. In our example in Figure 5.4a, the light blue body joints and the lamp detections are discarded as there are less than 10 body joints localized for those. In Step 3, we create a bounding box for each participant in the RGB frame. This bounding box would then be utilized to preprocess the image frame to crop and separate out the RGB frame for each participant. This bounding box has to be of sufficient dimensions to encompass all body joints for the participant and to provide enough room for movements to perform the HTKS gestures. The steps to create this bounding box are explained in the section for bounding box calculations. In this way, for each image frame we divide the image into separate images based on the number of participants in the frame, and then feed each seperate stream of images as a perparticipant input to our existing CogniLearn Physical Activity recognition algorithm which is mentioned in Step 5 of our multi-person activity recognition algorithm. Bounding Box Calculations We encapsulate each participant’s body joints in the image using bounding boxes. These bounding boxes are then utilized to crop all images from the entire 53

video sequence captured, so as to get participant-wise RGB frames. This person-wise crop becomes the input to our single person CogniLearn HTKS recognition algorithm. We need to calculate bounding rectangle [xmin, ymin, width, height] for each person. This rectangle should be such that all body joints detected for the person lie within the bounding box and it should provide enough space for movement for the person to perform HTKS gestures. We compute these bounding box sizes and location by utilizing the 14 body joint coordinates. As per our previous constraint, we detect a person if 10 out of 14 body joints are available. It is mandatory that joint localizations of certain joints are available as stated in Step(2). We compute the bounding box using the joint coordinate localizations for the person as follows: • Calculate the vertical margin VM argin . This is the space that would be left above head and below toes. We use previous measure δ as the vertical margin measure. VM = δ =

f Where, f = d(htop , n) 2

• Calculate the horizontal margin HM . This is used to determine width of the bounding box of the person to allow for free arm movements for body joints to lie within the bounding box frame while performing HTKS. – Calculate HM = La where La is length of arm for that participant. – If LS, LE and LH coordinates are available, La = d(n, LS) + d(LS, LE) + d(LE, LH) – Else if RS, RE and RH coordinates are available, La = d(n, RS) + d(RS, RE) + d(RE, RH) In this way, we calculate the arm length as the total of length from Neck to Shoulder combined with distances from Shoulder to Elbow and the 54

distance from Elbow to Wrist. This helps customize the bounding box for scale invariance. This helps create a bounding box suitable for each participant so that we do not lose any valuable information as well as do not waste on computational resources by extra background input. • Calculate Xmin = Hx − VM , where Hx is the ‘x’ co-ordinate location of ‘Head’. • Calculate Ymin = Hy − HM , where Hy is the ‘y’ coordinate location of ‘Head’. • Calculate Width = 2 × La • Calculate Height – If co-ordinates of right ankle are available, Height = d(H, RA) + VM × 2. – Else if, co-ordinates of left ankle are available, Height = d(H, LA)+VM ×2.

5.4 Child HTKS Activity Recognition

Figure 5.5: Example visualizations for HTKS prediction on the Child CogniLearn Dataset 55

The CogniLearn HTKS algorithm can directly be used for participants of different age groups without any fine-tuning or modifications. Our pose estimator supports joint localization predictions for different body types and at different body orientation angles. The CogniLearn HTKS algorithm mentioned in Physical Activity Recognition section can be employed for single participant cognitive behavior assessment and our method for Multi-person Physical Activity Recognition can be used for more than one child participant in the frame. By default, the system runs in Single-person detection mode for preserving robustness. If there are multiple participants in the frame then multi-person mode can be enabled. We show the recognition accuracy of using our method on the Child dataset in the experimental results section of this chapter.

5.5 Comparison to Alternatives In order to build an end-to-end system like CogniLearn, there are several critical points that need the designer’s special attention. To deal with these issues and finetune our algorithmic approach, we collected a preliminary dataset that covers all the possible subtasks within the HTKS task. There are 16 variations that can be recognized within the HTKS task according to hand movements. These 16 variations are subsets of the HTKS task where each movement begins at one of the four body parts (H,T,K or S) and ends at one of the four body parts (H,T,K or S).


gathered a preliminary dataset (see Figure 5.6) by recruiting 5 subjects (3 Males and 2 Females) and recording snippets for the 16 distinct sub-tasks using Microsoft Kinect for Windows V2. This dataset consists of three different types of multi-modal data namely RGB, Depth and Skeleton tracking 2D and 3D co-ordinates. For each subtask, snippets were recorded for approximately 5 to 7 seconds at 30 fps for all the three modalities and at three different depth positions of the participant from the Kinect V2 camera. Various approaches can be considered to evaluate the activities occurred 56

Figure 5.6: Multi-Modal data for preliminary experiments during these sub-tasks. Our initial approach on the problem was based on holistic recognition using a Convolution Neural Network applied on raw video input to obtain frame-wise gesture labels. For both these approaches, the current and the initial one, input data is raw RGB video data and frame-by-frame analysis of the RGB images of the video is performed for classification into the following four gesture classes: 1–Head, 2–Shoulder, 3–Knees and 4–Toes. Long-term window recognition can then be performed on the output from any of these gesture class classifiers by the algorithm shown for score calculations. We considered static frames or also using the motion energy (See Figure 5.6 optic flow feature) images[87] that could serve as an addition feature for the alternative approach of using CNN model. Further, this approach can also be be improved using Long-Short Term memory(LSTM) in the same manner as proposed by [62].

5.5.1 Alternative Approach Note that CogniLearn-HTKS’s current implementation does not require any training in contrast to its alternative discussed in this section since we deployed a 57

pre-trained deep learning based pose estimator to get body part coordinates and then perform classification based on distances between the body parts (CogniLearn HTKS algorithm). The alternative approach that we considered to solve the gesture classification problem was to treat it as a supervised learning problem. To perform experiments according to this we divided our dataset into training, validation and test sections and annotated them image-wise with gesture class labels. The architecture we used was the CaffeNet network and used the model to perform training using the Caffe[85] framework. To improve the system’s accuracy, we initialized the network’s weights using a pre-trained model on 1.2M image ILSVRC-2012 dataset[88], which is an Imagenet subset in order to obtain a powerful model and avoid overfitting. We performed transfer learning and fine-tuned this model using RGB data from our training set (80% of our initial data and consists of 2 Males and 1 Female participants) and calculated accuracy on our initial test data (10% of our initial dataset and consists of 1 Male and 1 Female participants). We perform user independent experiments on this dataset and achieved 79.98% accuracy after training this network for 500 Iterations. This network performance can be improved in several different ways such as adding more training data, using additional modalities like flow images, depth information and using LSTM as discussed previously. Despite the promising results of this method, there are certain drawbacks that we needed to overcome in order to incorporate this approach in the CogniLearn architecture. The main disadvantage of using the method proposed in [62] is the need to fine-tune their proposed model given a set of annotated data from our scenario. Those data had to be able to represent all the different scenarios in which our system could be potentially used, thus making the overall task much more complex to be sufficiently represented. For example our system, as already explained, can be potentially used to analyze motion under various experimental conditions, such as single-person or 58

with multiple people in frame and for settings for conducting study for different user groups such as adult, elderly and/or child participants. Since the goal of our work is to obtain a system that is capable of being deployed in all these different settings and to be used in various environments such as in schools for assessment of cognitive behavior in children, or to be used in clinics with adults/children or elderly that need assessment and enhancement of cognitive abilities, to be used for a single participant or for group study that involves multi-person cognitive assessment, an approach similar to [62] would make it impossible to develop a robust motion analysis system. An additional drawback of this approach is that it is a supervised learning problem and hence, requires to label all the different training data capture under every pre-discussed scenario. This would be very time consuming and almost impossible to complete in a reasonable amount of time.

Figure 5.7: Participants’ Performance.


5.6 User Performance Calculations Score calculations for each participant are done at the backend of our system and their detailed analysis is provided in the Analysis Interface. Each participant performs all 4 Steps of the instructions provided by the recording module and the system outcomes overall step score as well accuracy for each individual instruction. It also provides outcomes of analysis about performance for each individual gesture performed and where the maximum errors were made by the participant. We also provide measure of reflexes by analyzing how long it took for the participant to reach the instructed body part. If the participant does not reach there instantly but gets it right after a certain number of frames we consider the gesture as almost correct. So for each instruction within the step we provide scores 0 - incorrect, 1 - almost correct and 2 - correct. Every instruction is delivered at a 3 seconds interval. We stop recording 2 seconds after the last instruction. For every instruction, the instruction begin time is denoted by tbegin and is collected each time that a new instruction is delivered by the recording module. The participant is assumed start performing the gesture at this timestamp. End of each instruction is denoted by tend . That states that tendi = tbegini+1 where i is the current instruction and i+1 is the next instruction. To account for the time that the participant takes to listen to the instruction and start the gesture to reach the body part we add 30 milliseconds to each tbegin . Score calculations for each instruction performed are calculated in the following way: • All frames are captured with a timestamp associated with them. We denote the timestamp for each frame as the tf . • Start time (tbegin ) and End time (tend ) of each instruction captured are saved along with participants’ recorded videos and data and are available for score calculations.


• For a number of frames T , with tf in the range from tbegin to tend , compare the predicted gesture class label with the instruction provided. Keep a count C as the number of frames within the instruction range that match with the instruction. Instruction score is assigned using the equation below: P erf ormance(P ) =

C × 100 % T

    1, if P > 30% & P ≤ 60%     Instruction Score = 2, if P > 60% & P ≤ 100%       0, Otherwise The recording module captures the four Steps performed by the participant as shown in Table 4.1. Here, for Step 1 there are 6 instructions delivered so the max attainable score for this step is 12. Similarly, for Steps 2 to 4 there are 10 instructions delivered and max attainable score for each of those steps is 20 each. So the maximum score that each participant can obtain is 72. Our system provides flexibility to change the instructions or the steps as required by the cognitive experts. For our experiments, to maintain consistency we used the steps as stated in Table 4.1. In future, we also want to use reinforcement learning to be able to deliver instruction customized specifically based on participant performance. This way the system would be able to deliver instructions according to difficulty levels suitable to help with cognitive growth. Figure 5.7 depicts scores obtained by the 15 participants from our dataset measured by the CogniLearn system. This graph depicts cumulative and step-wise scores obtained by participants. This graph also contains confidence intervals (error bars) according to our system accuracy. This is the overall system accuracy from our dataset of 15 participants performing HTKS in single person mode, sideway dataset and multi-person HTKS dataset. Such graph visualizations helps us understand participant perfor61

mance, for. e.g. in this case we can see that participants 15 and 8 performed the steps really well and attained almost the max score whereas participant 12 did well in the step with no task switching but lost most points for the third step which has task switching for ’Head’ and ‘Toes’.

5.7 Experimental Results Performance of our system is measured in terms of accuracy of our algorithm to successfully classify each image into one of the four gesture classes - Head, Shoulder, Knees or Toes. Step-1 of each recording is captured without any task-switching and we utilize data from this steps to obtain ground truth instructions for accuracy measurement. Our CogniLearn HTKS recognition algorithm takes as input RGB image frames for each participant. The unsupervised learning problem of classification of gestures for each of these frames is solved using our computer vision and deep leaning based algorithm that performs image-wise analysis of these frames and deliver confidence scores for the four gesture classes for each frame. Therefore, for each image an output label is generated that predicts the class of the performed HTKS gesture. For each image of the captured Step-1 RGB data for the participant, we match these predicted labels from the system with ground truth gesture class annotations. We manually annotate Step-1 recordings for all 15 participants (9 Male and 6 Female) from our Single-Person CogniLearn dataset. This consists of Step-1 recordings for all 15 participants which constitutes of total 4,443 RGB frames. We also annotate Step-1 recordings for the captured Side Pose dataset and for Multi-Person HTKS recognition. The Side-Pose dataset consists of 3 participants and constitutes of total 1,149 RGB frames in Step-1 recordings. For Multi-Person dataset accuracy calculations we manually annotate the images obtained from output of the Multi-Person Recognition algorithm that provides RGB frames for each participant individually. 62

These are the RGB frames obtained after the are cropping takes place according to the bounding boxes around each. The prediction labels for these images are generated by input to the the CogniLearn HTKS algorithm for each participant individually which is the last step of the Multi-Person HTKS recognition algorithm. These predicted labels for participants are then compared to the ground truth annotations. In the Multi-Person dataset we have Step-1 recorded with two participants in a single frame but the accuracy is calculated for both the participants individually. We calculate accuracy on 353 RGB frames for participant-1 and 358 RGB frames for participant-2 which perform HTKS simultaneously. These numbers are different as the both the participants start performing the gestures at different timestamps, tbegin of the first instruction and end at varying timestamp before tend for last instruction. So this gives us a total of 711 frames for the participants from the Multi-Person Dataset. Step-1 consists of instructions H, S, K, T, K, T performed by these participants. Here, the first 5 instructions are captured for 3 seconds each and last instruction is of 2 seconds. All recordings are at a frame rate of approximately 30 fps. For the Child HTKS dataset, we manually annotate the 198 frames available. Accuracy is calculated for HTKS recognition over all those 198 frames. Our experiments are all user-independent as we do not perform any model training for our deep learning computations. The deep learning model used provides inference and does not rely on the captured participant data for training. For Single-Person CogniLearn-HTKS dataset, the accuracy obtained over 4,443 frames after inclusions of the the improvements in the algorithm is 97.54%. This outperforms accuracy of our improved algorithm in our previous method in [17] which was 92.54% accurate for the same dataset. This previous method employed Rule-1 of our Algorithmic Improvement section. In this work, we improve our algorithm further by adding Rules-2 and 3 achieving a performance improvement of 5%. Overall, 63







Figure 5.8: Confusion Matrix for system performance for: a.) With improvements from [17], Rule-1 only b.) Our Improvements by adding Rules-1 and 3 c.) Sideway Poses with offset deduction for “Head” and “Toes” gesture class (Without Rule-2), d.) Improved Performance for Sideway Poses with Rule-2, e.) Multi-Person Dataset Recognition with Improvement (all 3 Rules), f.) Child Dataset Recognition with Improvement (all 3 Rules).


Figure 5.9: Gesture-wise system performance incorporation of these 3 rules gives a substantial improvement of 12.5% over the system accuracy obtained from algorithm without any improvements (without incorporation of any of the Rules) which has accuracy of 85.05% for the same dataset. We observe a substantial improvement for correct classification of “Head” gesture class by incorporation of Rule-1. This shows that half of face height works as a good offset value of this gesture class. Incorporation of Rule-3 gave us very good improvement for the “Toes” gesture class where the number of correctly classified frames increased from 721 to 938. Figure - 5.8a shows the confusion matrix for this Single-Person HTKS recognition without incorporation of offset or improvements to the algorithm. Figure- 5.8b shows the confusion matrix with improvements for this Single-Person HTKS recognition for the same 4,443 RGB frames. For the Side Poses dataset, we observed that inclusion of this offset deduction for ‘Head’ body part increased the detection errors for the “Shoulder” gesture class. This is shown in Figure 5.8c which shows the confusion matrix for the 1,149 RGB images from the Side Pose dataset. Incorporating Rule-1 for this gives an accuracy of 65

92.16% for the recognition of the HTKS gesture classes for the Side Poses Step-1 RGB data frames. Most errors here are on recognition for the “Shoulder” class due to the offset deduction which was provided good improvement for front-facing poses. Based on this we added a constraint to our algorithm to avoid offset deductions for ‘Head’ based on pose orientation angles for the participant. Offsets are still deducted for ‘Toes’. This is the constraint found in Rule-2. Rule-2 is applied here, which does not deduct offset from ‘Head’ for sideways orientations. An accuracy value of 97.56% is obtained after placing this constraint. Figure 5.8d provides the confusion matrix obtained after placing this constraint. For Multi-Person dataset, an accuracy of 96.77% is obtained for the HTKS recognition. Figure - 5.8a shows the confusion matrix for this Multi-Person HTKS recognition performed over 711 RGB images. This is the recognition accuracy of our finalized CogniLearn HTKS algorithm that constitutes offset deductions based on participant poses. Moreover, the Multi-Person HTKS algorithm accurately detects two participants in the first frame without any false positives for people detections. It also correctly crops the images from the original RGB frames to fetch individual participant’s RGB frames that contain full-body joints for one participant in each image. For Child dataset, we use our CogniLearn HTKS algorithm that obtains accuracy of 98.98% over these 198 RGB data frames. Figure - 5.8a depicts the confusion matrix for this recognition for the Child dataset.


CHAPTER 6 Physical Activity Recognition for CogniLearn-RSM exercises In this chapter, the development of the CogniLearn-RSM system is discussed to help performs cognitive behavior assessment with physical exercises from RSM test battery defined in 2.1.2. CogniLearn-RSM provides framework to support different physical exercises from Table-2.1 namely “Lateral Preference Patterns”, “Gross Motor Gait and Balance” test 1, “Bilateral Coordination and Response Inhibition” test 1 & 2, “Finger-Nose Coordination” and “Rapid Sequential Movements” tests 1 to 4. Computer vision methods for articulated motion tracking of human body parts like arms, legs, hands (finger movements), torso and face, are useful features for physical activity recognition. In this work, we employ the deep learning network proposed in [68] for body keypoint detection. We use the pose estimator to track 17 full body joints: Nose, Neck, RShoulder, RElbow, RWrist, LShoulder, LElbow, LWrist, RHip, RKnee, RAnkle, LHip, LKnee, LAnkle, REye, LEye, REar, LEar. These joint coordinates from each participant are then utilized in our recognition algorithms for physical exercises. Following section describes in detail the physical exercise analysis for exercises involving rapid sequential hand movements, specifically towards hand keypoints detection, finger movements recognition and gesture classification.

6.1 Analyzing Rapid Sequential Movements The following physical exercises are designed for cognitive behavior monitoring based on performance of rapid sequential hand movements. These are movements


with hands and fingers and require wrist and finger key-point detections. Description of the exercises and their analysis are as follows:

6.1.1 Sequential Finger Movements The following exercises require motion analysis of participant’s fingers while performing sequential movements involving hand gestures. Finger Tap This exercise is performed for 10 seconds for which the participant is instructed to tap index finger against the thumb as fast as possible. Performance is measured in terms of raw score, numf t which is the total number numf t of accurately completed finger succession sets. If numf t is between 0 to 9 the Rf t = 0, if numf t is between 10 to 19 then Rf t = 1 and if numf t is greater than 19 Rf t = 2. The rhythm/sequencing score, Sf t is based on smoothness of motion for e.g. if the motion freezes up or other movements intrude. If movement stops and doesnt restart or switches to other movements then Sf t = 0; If movement stops but restarts correct movement, then Sf t = 1 and for continuous movement Sf t = 2. Recognition of movements depends on accuracy of thumb and index finger tip keypoint T T, IT detections. If the euclidean distance between T T and IT is less than a certain threshold, T we classify the frame to contain a tap movement. The algorithm to detect finger taps numf t based on T T and IT for sequence of N number of frames is as follows: • M ax ← 0, M in ← Inf • For k = 1, k++, while k < N – Calculate dit as Euclidean distance from thumb (TT ) to Index finger (IT ). – If dit > M ax, dit ← M ax – If dit < M in, dit ← M in 68

– If M ax > T and M in < T and dit < T , then numf t ← numf t + 1, M ax ← 0, M in ← Inf , Tap Start ← k. • end for Hence, tap count is increased based on finger tip distance dit falling below a threshold (T). Consecutive frames with dit < T are considered as single tap. Tap start is obtained based on M in and M ax variables. This way total number of taps are detected as well as their start timestamped positions in the captured sequence. Appose Finger Succession In this exercise, the participant is instructed to tap each finger against the thumb in order of Thumb to Index, Middle, Ring and Little finger respectively. This sequence is to be repeated as as fast as possible in a 10 seconds interval.


Figure 6.1: Finger Tap Keypoint detection

participant is told not to tap in the backwards order. The participant performance is evaluated on bases of speed and accuracy. Performance is measured in terms of raw score, numaf s , which is the total number numaf s of accurately completed finger succession sets. If numaf s is between 0 to 2 the Raf s = 0, if numaf s is between 3 to 4 then Raf s = 1 and if numaf s is greater than 5 Raf s = 2. The rhythm/sequencing score, Saf s is based on smoothness of motion for e.g. if the motion freezes up or other movements intrude. If movement stops and doesnt restart or switches to other 69

movements then Saf s = 0; If movement stops but restarts correct movement, then Saf s = 1 and for continuous movement Saf s = 2. Recognition of movements depends on accuracy of hand tip keypoint T T, IT, T M, T R, T L detections. The distances between these keypoint locations helps understand the finger movements performed by the participant. The confidence scores for each of the four tap gesture classes (1–Index, 2–Middle, 3–Ring and 4–Little) are obtained as follows: • Calculate dit as Euclidean distance from thumb (TT ) to Index finger (IT ). • Similarly, calculate Euclidean distances dtm , dtr , dtl as distances from thumb(TT ) co-ordinates to the three fingers - TM, TR and TL respectively. • Compute scores z1 , z2 , z3 and z4 from the distances (where, x ∈ 1 to 4): Scores(zx ) = 1/dx • Perform softmax over the scores to obtain probabilities: e zx σ(l)x = P4 x=1

e zx

This provides the probability for the particular tap in any given frame. The tap gesture class with the highest σ(l)x is chosen as the tap gesture class classified for that frame. For sequential motion analysis, time series alignment method like dynamic time warping help obtain similarity between expected tap sequence and performed tap sequence. To validate our hand movement recognition algorithm based on finger keypoints detection, we perform initial analysis based on finger tip locations obtained using a sensor device. For this purpose, we use the leap motion controller to obtain the 3D finger tip locations. Further, we provide analysis of utilizing pre-trained models from [66] and [18] to extract 2D hand keypoint from visual features from color image 70

frames. Details about the data collection using a Leap motion sensor and RGB camera are provided in Section 6.2. Analysis of vision-based hand keypoint detection on this captured data is provided in Section 6.3. The keypoints features of most value are the five finger tips namely T T, IT, T M, T R, T L, for the tips of thumb, index, middle, ring and little finger respectively. Figure - 6.1 shows visualizations of the finger keypoints on participant performing finger tapping gestures.

6.1.2 Sequential Hand Movements These exercises consist of rapid movements of hands by the participant. The basic version of this exercise is “Hand Pat” where the participant pats their hands on lap rapidly and another version called “Hand Pronate/ Supinate” where participant needs to flip hand with every pat i.e. pat with palm and next time pat with back of the hand. Hand pat In this exercise, the participant is instructed to pat one hand on the lap as fast as possible. This movement is measured for 10 seconds. Performance is measured in terms of raw score, numhp which is the total number numhp of accurately completed hand pats. If numhp is between 0 to 9 the Rhp = 0, if numhp is between 10 to 19 then Rhp = 1 and if numhp is greater than 19 Rhp = 2. The rhythm/sequencing score, Shp is based on smoothness of motion for e.g. if the motion freezes up or other movements intrude. If movement stops and doesnt restart or switches to other movements then Shp = 0; If movement stops but restarts correct movement, then Shp = 1 and for continuous movement Shp = 2. We can analyze this motion by observing joint angles between shoulder, elbow and wrist for the hand of interest.

71 Hand Pronate/supinate In this exercise, the participant needs to flip hand with every pat. Pronation describes the movement of rotating the forearm into a palm down position and supination is the movement of rotating the forearm into a palm up position.

Figure -

Figure 6.2: Hand pronate/supinate keypoint detection

6.2 shows the hand pronate/ supinate action. This exercise is performed for 10 seconds for which the participant is instructed to pat one hand on the lap as fast as possible along with Pronate/supinate action. Performance is measured in terms of raw score, numhps which is the total number numhps of accurately completed hand pats. If numhps is between 0 to 4 the Rhps = 0, if numhps is between 5 to 9 then Rhps = 1 and if numhps is greater than 10 Rhps = 2. The rhythm/sequencing score, Shps is calculated similar to as that for “Hand pat” exercise. For detecting pronate/ Supinate action visual features of importance would be the direction of thumb with respect to hand/wrist or other fingers. If the thumb points outwards then the action is determined as “pronation” and if the thumb points inwards then the movement is “supination”. The pat motion is analyzed by observing joint angles between shoulder, elbow and wrist for the hand of interest.


6.2 Dataset Collection This section describes the rapid sequential movement exercises captured in this work, collected RGB dataset and data captured using a finger keypoint detection sensor.

6.2.1 Captured RSM Exercises For the Cognilearn-RSM framework development, dataset is collected for exercises with images from participants performing gesture corresponding to “finger appose” and “appose finger succession”. These physical exercises are designed for cognitive behavior monitoring based on performance of rapid sequential movements. The exercises consist of movements with hands and fingers and require wrist and finger key-point detections. The physical exercise dataset captured towards building the Cognilearn-RSM consists of the Sequential Finger Movements as described previously. Brief overview of these exercises and the instructions towards their data capture as a part of this work are as follows: Finger Tapping The participant is instructed to tap index finger against the thumb as fast as possible. This movement is performed for 10 seconds. Number of finger taps collected vary based on participant’s performance. This exercise is performed with different orientations of the hand with respect to camera/ leap motion device. Recognition of these movements depends on accuracy of thumb and index finger tip keypoint detection T T, IT detections. The distance between T T and IT determines if a tap movement is recognized as described in Section


Figure 6.3: Appose Finger Succession Appose Finger Succession For this exercise data collection, the participant is instructed to tap each finger against the thumb in order of Thumb to Index, Middle, Ring and Little finger respectively. This sequence is to be repeated as as fast as possible. The participant is told not to tap in the backwards order. The keypoints features of most value are the five finger tips namely T T, IT, T M, T R, T L, for the tips of thumb, index, middle, ring and little finger respectively. The participant performance on the test set would be evaluated on bases of speed and accuracy. Figure - 6.1 shows visualizations of the finger keypoints on participant performing finger tapping gestures. The distances between the detected hand keypoint locations can be used to understand the finger movements performed by the participant as described previously in Section These finger tip locations can be obtained using sensor as described below. As a part of this work, we also evaluate accuracy of vision-based finger keypoint detections from single RGB image frames. Sensor based data collection and the Hand Keypoints dataset collected as a part of this work is described further.

6.2.2 Sensors Microsoft Kinect is used to capture RGB data frames for performing visionbased analysis. In addition to RGB images for data collection, we also perform leap motion sensor based data collection for finger keypoint detection. Finger keypoint coordinates are captured using the following sensor device: 74 Leap Motion Controller Leap Motion controller is a small peripheral device capable of sensing hand and finger movements. It consists of an array of infrared depth sensors that observe a hemispherical area to a distance of about 1 meter. This depth sensing along with their detection algorithm provides 3D touch free motion sensing and motion control capabilities. The field of view of a leap motion sensor is 150◦ on long side, 120◦ on short side and max 60cm above and on each side of the controller. This is shown in Figure 6.5.

Figure 6.4: Leap Motion Controller

Figure 6.5: LSM Field of View

As a part of this work, timestamped 3D coordinated for stabilized tip positions and the physical positions of metacarpophalangeal joint or knuckle of the hand for all five fingers are captured using the Leap motion controller. The exercises of finger tapping and finger apposition are captured for 8 participants using this sensor. The participants were instructed to perform the above-mentioned exercises for 10 seconds each following the exercise rules. The scores obtained using these finger tip locations from the leap controller in algorithms & provide complete accurate in detecting the taps. The threshold T of algorithm was chosen as 36mm for the captured sequences. For vision-based finger tip location detection, we capture RGB frames as described next. 75

6.2.3 Hand Keypoints Dataset As a part of this thesis, the Hand Keypoints Dataset (HKD) is generated, which comprises of annotated RGB images captured for participants while they perform rhythmic finger movements. This dataset comprises of 782 color image frames captured from four different participants.

The rapid sequential rhythmic fin-

Figure 6.6: Example annotations of cropped images from HKD dataset

ger movements performed are as described. This is a novel benchmark dataset that for hand keypoint detection and/or tracking from RGB images. We provide original frames with annotations as well as annotated cropped frames that are loosely cropped around the centroid location of the hand in the framed. We provide annotations for 6 hand keypoints namely - Wrist(W), Thumb (TT), Index Finger tip (IT), Middle Finger tip (TM), Ring Finger tip (TR) and Little Finger tip (TL). We also provide hand centroid location in the original RGB frames.

Our dataset and code to display

annotations is available from:

6.3 Vision-based Hand Keypoint Recognition In our experiments, we evaluate the performance of deep learning based hand keypoint detectors on our collected HKD dataset. We perform experiments for current state of the art methods for hand keypoint localization for all 782 original images 76

Figure 6.7: Variance of hand keypoints of Subject1 in HKD.

Figure 6.8: Visualization of Method 1 on HKD - Top Left, Top Right, Bottom Left. Bottom Right - Visualization of Ground Truth Annotations 6 keypoints. frames. We also provide subject-wise keypoint detection accuracy and per-keypoint accuracy for these methods, as described in Section 6.3.2.


Figure 6.9: Visualization of 21 keypoints estimated on HKD from Method1 [18] (left and center). Visualization of 6 keypoints of interests for our dataset. The first method (referred to as “Method1” in the experiments) that we have chosen for evaluation is the network in [18]. In this work, there is a hand segmentation network which is used to detect the hand and crop it. After this segmentation the image is passed as an input to the PoseNet network, which localizes 21 hand keypoints. We use the pre-trained model for hand segmentation network and the PoseNet provided with this work. This model is trained on the Rendered Handpose Dataset (RHD) which consists of 41,258 synthetic RGB training images. We evaluate the performance of this network on original image frames from HKD for localization of our 6 keypoint of interest. The PoseNet network from this work is based on the work in [39] and the output of the network consists of heatmaps (one heatmap for each keypoint). The second method (referred to as “Method2”) that we have chosen for evaluation is the network from [19], which achieves state of the art performance on hand keypoint localization. The keypoint detection model architecture in this work is also similar to the work in [39]. This is why we chose this method to be compared against method1 [18]. The difference here is that, for feature extraction, convolutional layers from a pre-initialized VGG-19 network are used as initial convolutional stages. The pretrained model from this paper is trained on hand data from multiple cameras using bootstrapping. This helps to correctly train the keypoints despite the occlusions 78

from other hand joints as the same pose is observed from multiple cameras. This method also detects 21 keypoints, out of which we select the 6 keypoints annotated on our dataset. The pre-trained model from this work is trained on RGB images with multiple views (51 views) simultaneously at each timestamp. 2D keypoint detections from one frame are triangulated geometrically and used as training labels for a different view frame where that keypoint may be occluded. This is a boosted approach that learns keypoints labels from multiple views and improves on the supervised weak learners.

6.3.1 Evaluation Protocol We provide metrics called average End Point Errors (EPE) in pixel for each keypoint and Percentage of Correct Keypoints (PCK) to measure accuracy. For each prediction, if the EPE is lower than a threshold then the keypoint is classified as correctly predicted. We evaluate the hand keypoint detection EPE on over several thresholds. We have set an upper limit to this threshold based on the detected keypoints. A reasonable upper limit to this threshold is measured as half of the average distance of each keypoint from centroid of the region of the hand in the image. Overall, EPE is shown in terms of Percentage of Correct Keypoints (PCK) and we plot the results on HKD dataset in Section 6. The hand region is determined by the convex hull formed by connecting the six annotated keypoints of the hand. Due to the large number of DOFs for the hand, we perform the following steps to find the threshold: • For each image frame i, find the 2D convex hull from the six predicted keypoints. The convex hull of the set of keypoints would be the smallest convex polygon that contains all the six keypoints. Figure 6.10 shows a convex hull visualization. • Estimate centroid coordinate C, of the convex hull. 79

Figure 6.10: Example visualization of convex hull from 6 keypoints. Green + is the centroid. • For each of the six predicted keypoints in W, T T, IT, T M,T R, T L, find dw , dtt , dit , dtm , dtr and dtl as the Euclidean distances from C. • Let di be average distance to centroid for that image frame, di = (dw + dtt + dit + dtm + dtr + dtl )/6 . • Then, for n being the number of frames (in our case n = 782), threshold T is computed as: Pn T =

di 2×n i=1

In this way, we employ a convex hull based approach to estimate an upper limit pixel threshold for evaluating correct keypoint detections. This value T is 45 pixels for the HKD cropped dataset and 12 pixels for HKD original frames dataset.

6.3.2 Experimental Results We provide graphs to demonstrate the evaluation of Method1 [18] & Method2 [19] on our dataset. Overall accuracy of both methods on HKD is shown in Figure 6.12. This plot demonstrates Percentage of Correct Keypoints (PCK) detections (yaxis) against different values of pixel thresholds (x-axis). For a threshold value of 11, 80

Figure 6.11: Visualization of Method2 (model from [19]) on HKD. the performance of the model from [18] is approximately 5% and for the model from [19] is 60%.

We provide accuracy attained on each participant’s frame

Figure 6.12: Accuracy on HKD by Method1 ([18]) and 2 ([19]).

for Method1 [18] in Figure 6.13. Here, Probability (Percentage) of Correct Keypoints (PCK) vs. pixel thresholds is shown for each of four participants’ frames of the HKD dataset for Method1. Similarly, in Figure 6.14 we plot PCK vs. pixel thresholds to 81

Figure 6.13: Accuracy for each participant in HKD by Method1 ([18]).

Figure 6.14: Accuracy for each participant in HKD by Method2 ([19]). show accuracy for each of the four HKD participants’ frames separately for Method2 [19]. These are important benchmarks for performing subject independent tests on HKD. The results demonstrated by these plots shows that the performance for both these methods on subject 3 is higher than for other participants. The performance for subjects 3 and 4 is approximately 10-15% lower than for other subjects. We also measure accuracy attained for each of the six keypoints (W, T T, IT , T M, T R, T L) individually in Figure 6.15 and 6.16. The plot in Figure 6.15 demonstrates per keypoint PCK vs pixel threshold for method1 [18] and the plot in Figure 6.16 shows per keypoint PCK vs pixel threshold for method2 [19] performance. These 82

Figure 6.15: Accuracy for each of 6 keypoints by Method1 ([18]).

Figure 6.16: Accuracy for each of 6 keypoints by Method2 ([19]). plots help contrast the performances for both methods for each of the six keypoints. For both methods, wrist detection performance is better than detection for finger tip keypoints. Overall, the method in paper [19] outperforms the method in paper [18] but the trends in PCK for subject-wise and keypoint-wise accuracies are similar. The method in paper [19] leverages the recognition by a boosted approach by improving accuracy of trained classifier by refinement of weakly supervised learning using multiview geometry. Here, occluded keypoints in one view frames are projected from another camera view frame where they are visible and the training is refined with the help of these new labels. The pre-trained model for this method has been 83

trained on these multiview image frames of RGB hand images. In contrast, the hand keypoint detector in paper [18] is trained on RGB synthetic images from Rendered Handpose Dataset (RHD). The results can also be an indicator that higher performance accuracy can be achieved by training a model on real RGB images performance as compared to training on images from a synthetic dataset.


CHAPTER 7 Conclusions and Future Work In this dissertation, I have presented my work on creating an intelligent system capable of assisting cognitive-therapists with the diagnosis of ADHD, using stateof-the-art machine learning and computer-vision. The architecture of the proposed framework is based on two major components: a) a Backend-Module, that performs state-of-the-art motion analysis using traditional computer vision and modern deeplearning techniques and estimates user’s performance b) a Front-End module, that consists of two separate user-interfaces and is responsible for assisting human experts with data capturing and user performance analysis. Support for two different cognitive test batteries has been discussed in this work. CogniLearn-HTKS, according to our knowledge, is one of the few end-to-end systems, that tries to evaluate cognitive-performance related to task-switching. by It achieves that, by computerizing current ADHD assessment practices and providing objective and quantitative metrics. The proposed Motion-Analysis module supports a robust architecture, able to compute user’s performance under varying, real-life enviromental conditions, offering a multiperson, subject independent and point of view invariant mechanism. Thus, being ideal for dealing in the future, with other similar physical exercises designed for cognitive assessment. Cognilearn’s Front-End module, which is specifically designed to capture and vizualize the physical exercise motion data and their analysis, is one of the first efforts towards bridging the gap between two different points of view on the same aspect as seen from the fields of Computer and Cognitive sciences respectively. 85

Future goals of this work are focused towards four major directions. Firstly is to expand CogniLearn’s functionalities as discussed in Section-7.1. Additionally, the focus is on improving system’s accuracy and performance by further optimizing the proposed motion analysis module. Moreover, one of primary future work is to develop new methods for extracting additional metrics of interest such as measurements related to subject’s cognitive load, attention levels, unintended motion levels etc and extend experimentations on additional real subjects. Finally, the ultimate goal is to enrich CogniLearn’s capabilities by adding supplementary exercises, others than HTKS and RSM, as mentioned in Chapter 2 with suggestions towards an embodied cognition test shown in Table 2.1 and provide a compact tool for assisting therapists in treating and diagnosing ADHD.

7.1 Future Work Future steps of the CogniLearn system are heading towards improving our evaluation constraints and expanding the system’s current functionality. Primarily, future work needs to be done on expanding the available exercises supported by the CogniLearn architecture [89, 90]. Our vision is to develop a battery of physical exercises specifically designed for cognitive assessment of children and adults. The goal is to target as many cognitive aspects as possible, including working-memory, task switching, sequencing ability and processing speed. Another important future work is to be able to expand the system’s functionality, to support remote monitoring and real time data analysis. These features will provide greater freedom to the experts and will enhance the outcomes of the remote assessment. Currently, task-administrators are not usually cognitive experts. They are mostly teachers who manually observe the children performing the assessment tasks and afterwards report to the human experts using traditional methods. 86

Remote-data analysis aims to bridge the gap between subjects and the experts by adding the expert-in-the-loop. This will happen by providing the ability of observing remotely multiple subjects that require special attention and modify their assessment accordingly and on-the-fly if needed. Additionally it will significantly reduce the time required for the assessment results and will help to speed-up the rehabilitation process in special-cases.

Figure 7.1: Design diagram for the proposed remote system

Towards that direction we propose the system architecture showed in Figure-13. The “CogniLearn User Entity” corresponds to the node at the school or data collection site. The “Supercomputer/Remote Host” corresponds to the machine on which the data analysis is performed. The “Remote Professional Entity” is the node of the user viewing the real-time analysis. Prior to the recording sessions, each user must log in, so that the data collected can be annotated with a user ID, and so the remote professional can gain access to the data and only the data they are verified for, according to an unspecified login system. At the start of a recording session, the Cognilearn User Entity establishes a secured connection with the supercomputer, with a user and session ID. Then, as each 87

frame is collected, it is compressed to the size used by the algorithm, annotated with the correct intended move, encrypted, and sent over analysis. Upon receiving the data, the supercomputer decrypts and parses the data, then immediately performs analysis on the frame. When complete, the original frame and corresponding prediction are written to the database, and sent to any connected professional. At any point, the remote professional entity can establish a connection, with proper credentials, to gain access. In the live stream view, they will be able to receive frames and predictions as soon as they become available, without database overhead. To step into more detailed analysis of a recording, the session will be requested from the supercomputer, which will pull it from the database and send it to the remote professional entity. Using this described system, a remote professional could take advantage of our data analysis in real-time while the exercises are being performed, allowing real-time feedback and a dynamic, scalable system.


APPENDIX A List of Publications


A.1 Publications related to this work 1. Srujana Gattupalli, Ashwin Ramesh Babu, James Robert Brady, Fillia Makedon, and Vassilis Athitsos, Towards Deep Learning Based Hand Keypoints Detection for Rapid Sequential Movements from RGB Images, Pervasive Technologies Related to Assistive Environments (PETRA), June 2018. 2. Varun Kanal, Maher Abujelala, Srujana Gattupalli, Vassilis Athitsos, Fillia Makedon, APSEN: PreScreening Tool for Sleep Apnea ina Home Environment, International Conference on Human-Computer Interaction, Vancouver, Canada, July 2017. 3. Amir Ghaderi, Srujana Gattupalli, Dylan Ebert, Ali Sharifara, Vassilis Athitsos, Fillia Makedon, Improving the Accuracy of the CogniLearn System for Cognitive Behavior Assessment, International Conference on PErvasive Technologies Related to Assistive Environments, PETRA, Rhodes, Greece, June 2017. 4. Srujana Gattupalli, Dylan Ebert, Michalis Papakostas, Vassilis Athitsos, Fillia Makedon, CogniLearn: A Deep Learning-based Interface for Cognitive Behavior Assessment, Intelligent User Interfaces (ACM IUI), Limassol Cyprus, March 2017. 5. Srujana Gattupalli, Alexandros Lioulemes, Shawn N. Gieser, Paul Sassaman, Vasilis Athitsos, Fillia Makedon, MAGNI: A Real-time Robot-assisted Game based Tele-Rehabilitation System, International Conference on Human-Computer Interaction, Toronto, August 2016. 6. Srujana Gattupalli, Human Motion Analysis and Vision-Based Articulated Pose Estimation, IEEE International Conference on Healthcare Informatics, (ICHI), Dallas TX, October 2015. 7. Srujana Gattupalli, Amir Ghaderi, Vassilis Athitsos, Evaluation of Deep Learning based Pose Estimation for Sign Language, International Conference on PEr90

vasive Technologies Related to Assistive Environments, PETRA, Corfu Island Greece, June 2016.


REFERENCES [1] M. M. McClelland and C. E. Cameron, “Self-regulation in early childhood: Improving conceptual clarity and developing ecologically valid measures,” Child Development Perspectives, vol. 6, no. 2, pp. 136–142, 2012. [2] A. R. Jadad, L. Booker, M. Gauld, R. Kakuma, M. Boyle, C. E. Cunningham, M. Kim, and R. Schachar, “The treatment of attention-deficit hyperactivity disorder: an annotated bibliography and critical appraisal of published systematic reviews and metaanalyses,” The Canadian Journal of Psychiatry, vol. 44, no. 10, pp. 1025–1035, 1999. [3] V. N. Parrillo, Encyclopedia of social problems. Sage Publications, 2008. [4] M. G. Sim, E. Khong, G. Hulse, et al., “When the child with adhd grows up,” Australian family physician, vol. 33, no. 8, p. 615, 2004. [5] E. Cormier, “Attention deficit/hyperactivity disorder: a review and update,” Journal of pediatric nursing, vol. 23, no. 5, pp. 345–357, 2008. [6] D. W. Dunn and W. G. Kronenberger, “Attention-deficit/hyperactivity disorder in children and adolescents,” Neurologic clinics, vol. 21, no. 4, pp. 933–940, 2003. [7] K. W. Lange, S. Reichl, K. M. Lange, L. Tucha, and O. Tucha, “The history of attention deficit hyperactivity disorder,” ADHD Attention Deficit and Hyperactivity Disorders, vol. 2, no. 4, pp. 241–255, 2010. [8] R. Mayes, C. Bagwell, and J. Erkulwater, “Adhd and the rise in stimulant use among children,” Harvard review of psychiatry, vol. 16, no. 3, pp. 151–166, 2008.


[9] R. G. Ross, “Psychotic and manic-like symptoms during stimulant treatment of attention deficit hyperactivity disorder,” American Journal of Psychiatry, vol. 163, no. 7, pp. 1149–1152, 2006. [10] C. A. Dendy, “Executive functionwhat is this anyway?,” Retrieved September, vol. 18, p. 2008, 2008. [11] D. J. Ackerman and A. H. FriedmanKrauss, “Preschoolers’ executive function: Importance, contributors, research needs and assessment options,” ETS Research Report Series, vol. 2017, no. 1, pp. 1–24. [12] S. Hagler, H. B. Jimison, and M. Pavel, “Assessing executive function using a computer game: Computational modeling of cognitive processes,” IEEE Journal of Biomedical and Health Informatics, vol. 18, no. 4, pp. 1442–1452, July 2014. [13] R. Westerman, “Computer-assisted cognitive function assessment of pilots,” 2001. [14] “Status of computerized cognitive testing in aging: A systematic review,” Alzheimers and Dementia, vol. 4, no. 6, pp. 428 – 437, 2008. [15] M. C. Carter and D. C. Shapiro, “Control of sequential movements: evidence for generalized motor programs,” Journal of Neurophysiology, vol. 52, no. 5, pp. 787–796, 1984. [16] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for human pose estimation in videos,” in IEEE International Conference on Computer Vision, 2015. [17] S. Gattupalli, D. Ebert, M. Papakostas, F. Makedon, and V. Athitsos, “Cognilearn: A deep learning-based interface for cognitive behavior assessment,” in Proceedings of the 22nd International Conference on Intelligent User Interfaces. ACM, 2017, pp. 577–587.


[18] C. Zimmermann et al., single rgb images,”

“Learning to estimate 3d hand pose from

in IEEE International Conference on Computer

Vision (ICCV), 2017, [Online]. Available: [19] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand keypoint detection in single images using multiview bootstrapping,” in CVPR, 2017. [20] S. O. ATTENTION-DEFICIT et al., “Adhd: clinical practice guideline for the diagnosis, evaluation, and treatment of attention-deficit/hyperactivity disorder in children and adolescents,” Pediatrics, pp. peds–2011, 2011. [21] J. Biederman, T. Wilens, E. Mick, S. Milberger, T. J. Spencer, and S. V. Faraone, “Psychoactive substance use disorders in adults with attention deficit hyperactivity disorder (adhd): effects of adhd and psychiatric comorbidity,” American Journal of Psychiatry, vol. 152, no. 11, pp. 1652–1658, 1995. [22] H. M. Feldman and M. I. Reiff, “Attention deficit–hyperactivity disorder in children and adolescents,” New England Journal of Medicine, vol. 370, no. 9, pp. 838–846, 2014. [23] C. Verret, M.-C. Guay, C. Berthiaume, P. Gardiner, and L. B´eliveau, “A physical activity program improves behavior and cognitive functions in children with adhd: an exploratory study,” Journal of attention disorders, vol. 16, no. 1, pp. 71–80, 2012. [24] G. Peretti and C. M. Trentini, “Assessing executive functions in older adults: a comparison between the manual and the computer-based versions of the wisconsin card sorting test,” 2010. [25] D. Hendrawan, C. Carolina, F. Fauzani, H. N. Fatimah, F. Kurniawati, and M. A. M. Ariefa, “The construction of android computer-based application on neurocognitive executive function for early age children inhibitory control mea94

surement,” in 2016 7th IEEE International Conference on Cognitive Infocommunications (CogInfoCom), Oct 2016, pp. 000 409–000 414. [26] C. T. Gualtieri and L. G. Johnson, “Reliability and validity of a computerized neurocognitive test battery, cns vital signs,” Archives of Clinical Neuropsychology, vol. 21, no. 7, pp. 623–643, 2006. [27] J. I. Gapin, “Associations among physical activity adhd symptoms and executive function in children with adhd,” Ph.D. dissertation, The University of North Carolina at Greensboro, 2009. [28] M. T. Willoughby, L. J. Kuhn, C. B. Blair, A. Samek, and J. A. List, “The testretest reliability of the latent construct of executive function depends on whether tasks are represented as formative or reflective indicators,” Child Neuropsychology, vol. 23, no. 7, pp. 822–837, 2017. [29] B. E. Wexler, “Integrated brain and body exercises for ADHD and related problems with attention and executive function,” IJGCMS, vol. 5, no. 3, pp. 10–26, 2013. [30] M. M. McClelland, C. E. Cameron, R. Duncan, R. P. Bowles, A. C. Acock, A. Miao, and M. E. Pratt, “Predictors of early growth in academic achievement: the head-toes-knees-shoulders task,” Frontiers in Psychology, vol. 5, p. 599, 2014. [31] S. K. de Paula Asa, M. C. S. Melo, and M. E. P. Piemonte, “Effects of mental and physical practice on a finger opposition task among children,” Research Quarterly for Exercise and Sport, vol. 85, no. 3, pp. 308–315, 2014. [32] K. A. Dorfberger S, Adi-Japha E, “Sequence specific motor performance gains after memory consolidation in children and adolescents.” PLoS ONE 7(1): e28673, 2012.


[33] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a convolutional network and a graphical model for human pose estimation,” in Advances in neural information processing systems, 2014, pp. 1799–1807. [34] Y. Yang and D. Ramanan, “Articulated pose estimation with flexible mixturesof-parts,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 1385–1392. [35] X. Chen and A. Yuille, “Articulated pose estimation by a graphical model with image dependent pairwise relations,” in Advances in Neural Information Processing Systems (NIPS), 2014. [36] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in CVPR, 2014. [37] A. Jain, J. Tompson, M. Andriluka, G. W. Taylor, and C. Bregler, “Learning human pose estimation features with convolutional networks,” CoRR, vol. abs/1312.7302, 2013. [38] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a convolutional network and a graphical model for human pose estimation,” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014, pp. 1799–1807. [39] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines.” in CVPR, 2016, pp. 4724–4732. [40] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, “Deepercut: A deeper, stronger, and faster multi-person pose estimation model,” in European Conference on Computer Vision. Springer, 2016, pp. 34–50.


[41] J. Charles, T. Pfister, M. Everingham, and A. Zisserman, “Automatic and efficient human pose estimation for sign language videos,” International Journal of Computer Vision, vol. 95, pp. 180–197, 2011. [42] R. Tennant, “American sign language handshape dictionary,” in Gallaudet University Press, Washington, D.C, 2010. [43] V. Athitsos, C. Neidle, S. Sclaroff, J. P. Nash, A. Stefan, Q. Yuan, and A. Thangali, “The american sign language lexicon video dataset,” in CVPR Workshops 2008, Anchorage, AK, USA, 23-28 June, 2008, 2008, pp. 1–8. [Online]. Available: 4563181 [44] S. Gattupalli, A. Ghaderi, and V. Athitsos, “Evaluation of deep learning based pose estimation for sign language recognition,” in Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments. ACM, 2016, p. 12. [45] J. Lu, V. Behbood, P. Hao, H. Zuo, S. Xue, and G. Zhang, “Transfer learning using computational intelligence: a survey,” Knowledge-Based Systems, vol. 80, pp. 14–23, 2015. [46] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in Neural Information Processing Systems, 2014, pp. 3320–3328. [47] B. S. S. J. S. Yuan, Q. Ye and T.-K. Kim, “Bighand2.2m benchmark: Hand pose dataset and state of the art analysis.” in CVPR, 2017. [48] M. Oberweger and V. Lepetit, “Deepprior++: Improving fast and accurate 3d hand pose estimation,” in ICCV Workshops. pp. 585–594.


IEEE Computer Society, 2017,

[49] C. Keskin, F. Kira, Y. E. Kara, and L. Akarun, “Real time hand pose estimation using depth sensors.” in ICCV Workshops. IEEE, 2011, pp. 1228–1234. [50] X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun, “Cascaded hand pose regression,” in CVPR. IEEE Computer Society, 2015, pp. 824–832. [51] L. Ge, H. Liang, J. Yuan, and D. Thalmann, “Robust 3d hand pose estimation in single depth images: From single-view CNN to multi-view cnns,” in CVPR. IEEE Computer Society, 2016, pp. 3593–3601. [52] W. J. Tsai, J. C. Chen, and K. W. Lin, “Depth-based hand pose segmentation with hough random forest,” in 2016 3rd International Conference on Green Technology and Sustainable Development (GTSD), Nov 2016, pp. 166–167. [53] D. Huang, M. Ma, W. Ma, and K. M. Kitani, “How do we use our hands? discovering a diverse set of common grasps,” in CVPR. IEEE Computer Society, 2015, pp. 666–675. [54] G. Rogez, M. Khademi, J. S. S. III, J. M. M. Montiel, and D. Ramanan, “3d hand pose detection in egocentric RGB-D images,” in ECCV Workshops (1), ser. Lecture Notes in Computer Science, vol. 8925. Springer, 2014, pp. 356–371. [55] M. Baydoun, A. Betancourt, P. Morerio, L. Marcenaro, M. Rauterberg, and C. S. Regazzoni, “Hand pose recognition in first person vision through graph spectral analysis,” in ICASSP. IEEE, 2017, pp. 1872–1876. [56] Y. Zhou, G. Jiang, and Y. Lin, “A novel finger and hand pose estimation technique for real-time hand gesture recognition,” Pattern Recognition, vol. 49, pp. 102–114, 2016. [57] S. Sridhar, F. Mueller, M. Zollh¨ofer, D. Casas, A. Oulasvirta, and C. Theobalt, “Real-time joint tracking of a hand manipulating an object from RGB-D input,” in ECCV (2), ser. Lecture Notes in Computer Science, vol. 9906. 2016, pp. 294–310. 98


[58] D. Tzionas, L. Ballan, A. Srikantha, P. Aponte, M. Pollefeys, and J. Gall, “Capturing hands in action using discriminative salient points and physics simulation,” International Journal of Computer Vision, vol. 118, no. 2, pp. 172–193, 2016. [59] G. Garcia-Hernando, S. Yuan, S. Baek, and T. Kim, “First-person hand action benchmark with RGB-D videos and 3d hand pose annotations,” CoRR, vol. abs/1704.02463, 2017. [60] S. Sridhar, A. Oulasvirta, and C. Theobalt, “Interactive markerless articulated hand motion tracking using RGB and depth data,” in ICCV. IEEE Computer Society, 2013, pp. 2456–2463. [61] D. P. Van Dusen, S. H. Kelder, H. W. Kohl, N. Ranjit, and C. L. Perry, “Associations of physical fitness and academic performance among schoolchildren*,” Journal of School Health, vol. 81, no. 12, pp. 733–740, 2011. [62] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625–2634. [63] D. Leightley, J. Darby, B. Li, J. S. McPhee, and M. H. Yap, “Human activity recognition for physical rehabilitation,” in Systems, Man, and Cybernetics (SMC), 2013 IEEE International Conference on. IEEE, 2013, pp. 261–266. [64] L. Xia and J. Aggarwal, “Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2834–2841. [65] S. Gattupalli, D. Ebert, M. Papakostas, F. Makedon, and V. Athitsos, “Cognilearn: A deep learning-based interface for cognitive behavior assessment,” in


Proceedings of the 22Nd International Conference on Intelligent User Interfaces, ser. IUI ’17, 2017, pp. 577–587. [66] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand keypoint detection in single images using multiview bootstrapping,” in CVPR, 2017. [67] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele, “Deepcut: Joint subset partition and labeling for multi person pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4929–4937. [68] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, 2017. [69] L. J. Najjar, “Principles of educational multimedia user interface design,” Human factors, vol. 40, no. 2, pp. 311–323, 1998. [70] J. Howcroft, S. Klejman, D. Fehlings, V. Wright, K. Zabjek, J. Andrysek, and E. Biddiss, “Active video game play in children with cerebral palsy: potential for physical activity promotion and rehabilitation therapies,” Archives of physical medicine and rehabilitation, vol. 93, no. 8, pp. 1448–1456, 2012. [71] B. Lange, S. Flynn, R. Proffitt, C.-Y. Chang, et al., “Development of an interactive game-based rehabilitation tool for dynamic balance training,” Topics in stroke rehabilitation, 2015. [72] O. Asiry, H. Shen, and P. Calder, “Extending attention span of adhd children through an eye tracker directed adaptive user interface,” in Proceedings of the ASWEC 2015 24th Australasian Software Engineering Conference. ACM, 2015, pp. 149–152. [73] A. Billard, B. Robins, J. Nadel, and K. Dautenhahn, “Building robota, a minihumanoid robot for the rehabilitation of children with autism,” Assistive Technology, vol. 19, no. 1, pp. 37–49, 2007. 100

[74] L. E. Sucar, F. Orihuela-Espina, R. L. Velazquez, D. J. Reinkensmeyer, R. Leder, and J. Hernandez-Franco, “Gesture therapy: An upper limb virtual reality-based motor rehabilitation platform,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 22, no. 3, pp. 634–643, 2014. [75] A. A. Rizzo, J. G. Buckwalter, T. Bowerly, C. Van Der Zaag, L. Humphrey, U. Neumann, C. Chua, C. Kyriakakis, A. Van Rooyen, and D. Sisemore, “The virtual classroom: a virtual reality environment for the assessment and rehabilitation of attention deficits,” CyberPsychology & Behavior, vol. 3, no. 3, pp. 483–499, 2000. [76] M. P. Craven and M. J. Groom, “Computer games for user engagement in attention deficit hyperactivity disorder (adhd) monitoring and therapy,” in Interactive Technologies and Games (iTAG), 2015 International Conference on. IEEE, 2015, pp. 34–40. [77] K. Tsiakas, M. Papakostas, B. Chebaa, D. Ebert, V. Karkaletsis, and F. Makedon, “An interactive learning and adaptation framework for adaptive robot assisted therapy,” in Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments. ACM, 2016, p. 40. [78] A. Da Gama, T. Chaves, L. Figueiredo, and V. Teichrieb, “Poster: improving motor rehabilitation process through a natural interaction based system using kinect sensor,” in 3D User Interfaces (3DUI), 2012 IEEE Symposium on. IEEE, 2012, pp. 145–146. [79] H. Pirsiavash, C. Vondrick, and A. Torralba, “Assessing the quality of actions,” in European Conference on Computer Vision. Springer, 2014, pp. 556–571. [80] A. K. Roy, Y. Soni, and S. Dubey, “Enhancing effectiveness of motor rehabilitation using kinect motion sensing technology,” in Global Humanitarian Technology


Conference: South Asia Satellite (GHTC-SAS), 2013 IEEE.

IEEE, 2013, pp.

298–304. [81] E. Velloso, A. Bulling, H. Gellersen, W. Ugulino, and H. Fuks, “Qualitative activity recognition of weight lifting exercises,” in Proceedings of the 4th Augmented Human International Conference. ACM, 2013, pp. 116–123. [82] B. Shneiderman, Designing the user interface: strategies for effective humancomputer interaction. Pearson Education India, 2010. [83] L. Zhou, Z. Liu, H. Leung, and H. P. H. Shum, “Posture reconstruction using kinect with a probabilistic model,” in Proceedings of the 20th ACM Symposium on Virtual Reality Software and Technology, ser. VRST ’14, 2014, pp. 117–125. [84] M. Andriluka, L. Pishchulin, P. V. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, 2014, pp. 3686–3693. [85] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 675–678. [86] A. Ghaderi, S. Gattupalli, D. Ebert, A. Sharifara, V. Athitsos, and F. Makedon, “Improving the accuracy of the cognilearn system for cognitive behavior assessment,” in Proceedings of the 10th ACM International Conference on PErvasive Technologies Related to Assistive Environments. ACM, 2017, p. 4. [87] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” Computer Vision-ECCV 2004, pp. 25–36, 2004.


[88] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015. [89] E. M. Mahone, M. Ryan, L. Ferenc, C. Morris-Berry, and H. S. Singer, “Neuropsychological function in children with primary complex motor stereotypies,” Developmental Medicine & Child Neurology, vol. 56, no. 10, pp. 1001–1008, 2014. [90] B. Wexler and M. Bell, “Activate program,”


BIOGRAPHICAL STATEMENT Srujana Gattupalli was born in Mumbai, India, in 1989. She received her B.E. degree from Gujarat University, India in 2011, her M.S. and Ph. D degrees from The University of Texas at Arlington in 2013 and 2018, respectively, all in Computer Science. Her research interests are focused on Machine Learning, Computer Vision, Human-Computer interaction and their applications for human body motion estimation and pose tracking. Her academic work experience includes a role as a teaching assistant for graduate course offerings and as graduate research assistant for Vision Learning Mining Lab and Heracleia Human-Computer Interaction Lab. She has worked as a Software Engineer at Cerner Corporation in 2014. She has also been a Graduate Research Intern at Intel corporation in 2017, to work towards research and development of computer vision and machine learning algorithms. Ms. Gattupalli has co-authored several peer reviewed papers published in technical conferences and has served as a reviewer in many others. Ms. Gattupalli is currently a member of Upsilon Pi Epsilon Texas Gamma Chapter honor society.