ICMI'14: Measuring Child Visual Attention Using Markerless Head

0 downloads 0 Views 1MB Size Report
Nov 12, 2014 - ABSTRACT. A child's failure to respond to his or her name being called ... child's name and observe whether he or she responds— methods for ...
Measuring Child Visual Attention using Markerless Head Tracking from Color and Depth Sensing Cameras Jonathan Bidwell, Irfan A. Essa, Agata Rozga & Gregory D. Abowd School of Interactive Computing & Center for Behavior Imaging Georgia Institute of Technology, Atlanta, GA, USA

ABSTRACT

INTRODUCTION

A child’s failure to respond to his or her name being called is an early warning sign for autism and response to name is currently assessed as a part of standard autism screening and diagnostic tools. In this paper, we explore markerless child head tracking as an unobtrusive approach for automatically predicting child response to name. Head turns are used as a proxy for visual attention. We analyzed 50 recorded response to name sessions with the goal of predicting if children, ages 15 to 30 months, responded to name calls by turning to look at an examiner within a defined time interval. The child’s head turn angles and hand annotated child name call intervals were extracted from each session. Human assisted tracking was employed using an overhead Kinect camera, and automated tracking was later employed using an additional forward facing camera as a proof-of-concept. We explore two distinct analytical approaches for predicting child responses, one relying on rule-based approached and another on random forest classification. In addition, we derive child response latency as a new measurement that could provide researchers and clinicians with finer grain quantitative information currently unavailable in the field due to human limitations. Finally we reflect on steps for adapting our system to work in less constrained natural settings.

There is a long-standing interest in the autism research and clinical community in measuring children’s tendency to attend to social stimuli, such as faces, voices, and gestures [10]. These multimodal stimuli require an understanding of speech, gestures and context for assessing child behaviors. The failure of a child to respond to a name call has received particular attention as a red flag for autism. +90° +135°

+45°

+180°



-180°

-135°

-45° -90° child’s head angle (side-to-side direction) name call interval

hand annotated child response predicted child response

CATEGORIES AND SUBJECT DESCRIPTORS

Figure 1: Markerless head pose estimation was employed for measuring a child’s head angle at each frame. Head angles ranged between +/- 180 degrees with the child facing forward at zero degrees. The resulting head turn measurements were then used for predicting child response to name.

J.4 [Computer Applications]: social and behavioral sciences; I.4.8 [Image Processing And Computer Vision]: Scene Analysis; I.5.4 [Pattern Recognition]: Applications GENERAL TERMS

Response to name probes are now widely included in standard autism screening and diagnostic assessments, such as the M-CHAT [27] and ADOS [16] where a child’s failure to respond to his or her name being called is considered an early warning sign for autism [1,14]. While these observations are conceptually simple to make—call a child’s name and observe whether he or she responds— methods for capturing and quantifying these observations currently require direct human involvement. Response to name is typically annotated after-the-fact by researchers or made in-person by clinicians. These approaches can present logistical challenges for researchers seeking to conduct observations in a naturalistic settings and present cognitive bandwidth challenges for clinicians seeking to observe multiple child behaviors at the same time during in-person assessments.

Experimentation, Algorithms, Human Factors KEYWORDS

Computational Behavioral Analysis; Autism Spectrum Disorder Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICMI '14, November 12 - 16 2014, Istanbul, Turkey Copyright 2014 ACM 978-1-4503-2885-2/14/11…$15.00 .http://dx.doi.org/10.1145/2663204.2663235

447

In this work, we present initial steps towards an automated system for measuring a child’s response to name through clinically standard inputs that include recorded audio and video feeds. Both human assisted and automated head tracking was performed to extract the left-right rotation of the child’s head along the yaw axis as a marker for responding (see Figure 1). The system was evaluated on 50 sessions consisting of a standard response to name probe. Our goal was to predict whether the child responded to a name call by turning his/her head to face the examiner. Using automated tracking and a rule-based classifier, we predicted child response within a standard 3-second response time interval with a precision of 89.4% and recall of 80.4%. Using a learning-based approach, driven by a random-forest classifier, we obtained a precision of 76.9% and a recall of 83.3%. We further computed a response to name latency as the elapsed time between the call of the child’s name and the onset of the predicted child response. These derived response latencies agreed closely with our ground truth annotations with an average absolute value error of 0.46 sec (and standard deviation (SD) of 0.35 sec) for the rule based classification approach, and 0.97 sec (SD = 1.22 sec) for the learning-based classification. These results highlight the potential for automating the measurement of children’s orienting responses to social stimuli, a key factor in assessing developmental progress.

most infant eye gaze fixations occur within the middle of the child’s field of view [15]. Stiefelhagen et al. performed a related study with four adults interacting at a meeting table, and found that eye gaze agreed 87% of the time with the direction of the subject’s head [29]. 2. Eye tracking solutions are limited for toddlers in room-sized environments. Traditional monitor-based eye tracking methods will not work in our use case where children may not be able to cooperate with calibration instructions, and responses must be measured in a 3D roomsized environment as opposed to a 2D screen. Head mounted eye trackers have been used to compare eye fixations between older neurotypical children and children with autism [5]; however these systems are cumbersome, require participants’ active effort during calibration, and their form factor is not appropriate for young children [34,35]. Adult-worn video recording glasses and face tracking software have been used to infer eye contact by identifying when the child looks toward the camera; however, this approach requires the examiner to maintain the child within the camera’s field of view at all times. More recently, a corneal-reflection based eye tracking technique has been proposed as an alternative for eye tracking in in 3D environments [23]. This system uses one or more cameras for tracking reflections on the user’s cornea; however in our case such a system would require a prohibitive number of cameras as a child’s eyes must need to remain visible throughout head turns [23,25].

The research contributions of this work include: 1.

2.

Identifying that child head orientation can be used for predicting a child’s response to a social stimulus (a call of their name) within defined time intervals.

3. Markerless motion capture has been applied for measuring head pose in room-sized environments. By contrast, methods based on ultrasonic [36], electromagnetic [37], or marker-based tracking [38,39] tend to be obtrusive and require instrumenting the child with special clothing. Markerless motion capture does not require instrumenting subjects [19] and has been successfully demonstrated in room-sized environments. Cutorian et al. [20] and Stiefelhagen [30] used 2D image based feature detection and tracking to estimate the head pose of people during meetings. In addition, many of the latest depth based methods for head tracking require placing cameras prohibitively close to the subject. Franelli et al. used a single Microsoft Kinect camera for tracking large head rotations up to +/- 90 degrees [13]; however due to the limited depth accuracy of the Kinect [17], subjects in the study had to sit within approximately 1 meter from a tripodmounted camera. Maimone et al. addressed this by suspending multiple Kinects from the ceiling on long poles [18]. In our case, we aimed to position cameras as far away from the children as possible with the ultimate goal of measuring response to name in less constrained settings where the child may be moving around the room.

Building a system that uses automated child head turn measurements and hand annotated name call intervals for reliably predicting response to name and measuring child response latency as a new fine-grained dimension of the response that may be beneficial to researchers.

RELATED WORK The computer science community has long been interested in multimodal analysis for measuring human activities in natural environments [2,7,24]. Only recently has this work begun to focus on measuring child behaviors, and studying more nuanced behavioral phenomena such as social sharing and social bids for attention [5,26]. The ultimate context for our automated response to name measurement called for designing a system that would be both unobtrusive and unencumbering while requiring minimal instrumentation. We focused on head tracking as a primary indicator of child response for the four reasons, outlined below. 1. Head pose can serve as a proxy for attention. Human eye gaze has been correlated with visual focus of attention [8,9,11], and tends to agree with head direction. Kretch et al. instrumented infants with eye trackers and found that

4. Head tracking beyond 90 degrees can be achieved from an overhead camera. Finally, we needed to track child head turns beyond 90 degrees. Most head pose

448

annotaation software [31] to markk frame-level oonsets and offsetss of each instaance of the chhild’s name beeing called, making eye and anny instances off the child ressponding by m contactt with the exam miner. If it waas unclear from m the video whetheer eye contacct occurred, the research assistants consultted the examinner’s scoring sheet. These aannotations served as ground ttruth for trainning and evaluuating our m (see Figure 2)). system

eestimation reseearch assumes that images will w contain eaasy too track facial features such as the eyes, nose n and moutth, aand assume sid de-to-side head rotation angles between +/- 90 9 ddegrees [21]. For example, Ba et al’s system m was trained on o a dataset with h head rotation ns ranging fro om -60 and 60 6 ddegrees [3]. Head tracking beyond b this raange presents an aadded challeng ge in that it intrroduces imagees (e.g., the back oof someone’s head) h that contaain fewer faciaal landmarks an nd thhus have less detail for im mage segmentaation and mod del ffitting.. This problem has been addresssed for simillar ssituations throu ugh the introdu uction of priorrs such as a fu ull bbody skeleton [6,28] [ or mapp ping an initial scan of the userr’s hhead in 2D or 3D 3 to a lower dimensional reepresentation for f m matching betw ween frames [13,32]. In our case, c the child d’s vvisual appearan nce can be diff fficult to anticip pate in advancce. W While we coulld train a new model for eacch child, we felt fe thhis approach would presen nt an unrealistic burden for f rresearchers and d clinicians hop ping to use ourr system.

HEAD D TRACKIN NG Human n assisted track king

Humann assisted trackking was perfformed on all ssessions in order to establish ground truthh for child head turn measurrements. Thiss was accompplished via a graphical interfacce that allowedd the user to m monitor the outtput from a templaate tracker and then click andd re-initialize thhe tracks as neededd. The result was a set of “best case” hhead angle measurrements that eenabled us to eevaluate the feasibility of using hhead turn meassurements for cclassification oof response to nam me, prior to adddressing head tracking challeenges such as headd occlusions, cchild appearancce variations, aand out-ofplane hhead rotations.

M METHOD R Response to nam me administratiion (RTN)

F Fifty response to name sesssions were seelected from th he ppublically avaiilable Multimo odal Dyadic Behavior B Datasset ((MMDB) [http p://www.cbi.gatech.edu/mmdb b]. The sessions ccontained 63 name calls and followed a sim milar protocol to rresponse to na ame probe inccluded in the ADOS [16], a sstandard diagn nostic instrumeent for assessing autism (see F Figure 2). Thee child sat at a table (either alone or in th he pparent’s lap) and a was given n a small toy to occupy theeir aattention. The examiner e then positioned herself to the rig ght aand slightly behind b the ch hild, waited until u the child d’s aattention was on o the toy, and d called the ch hild’s name in na pplayful tone. Iff the child did not respond by b turning his or hher head and making m eye con ntact with the examiner with hin 3 seconds, shee called the ch hild’s name again, up to tw wo m more times. Th he examiner heeld a clipboard d with a scorin ng ssheet where she marked whetther the child responded r to th he nname call and if i so, on which trial.

Head p pose estimation

The chhild’s side-to-sside head motion was measuured from a forwarrd facing cameera (Canon Vixxia) and the chhild’s head positioon was measuured using a ceiling-mounnted kinect cameraa, centered andd pointing direectly down onn the table. The O Omron OKAO O Vision Librrary [40] wass used for trackinng the child’s head orientation within thhe forward facing camera’s videeo feed, while B Boston Universsity (BU)’s Person Trackeer was used foor measuring thhe position Multi-P of the child’s head ffrom the overhhead camera [[33]. These measurrements were then used to first initializee and later updatee a Kanade–L Lucas–Tomasi (KLT) templaate tracker the overhead ccamera [4]. Thhe template was updated from th with hhead orientatiion estimates from the fr front-facing cameraa every 80 fraames. The resuulting automatted tracker was caapable of meassuring child heead turns headd turns that exceedded 90 degrees. Autom mated tracking u using template m matching

Head ttracking was performed usingg a template trracker. The templaate consisted of a pair of massked, overheadd color and depth image of the child’s head. The template’s position and oriientation were initialized as aan oriented bouunding box rectanggle. Head posiition was definned as the first detected positioon from BU’s Multi-Personn Tracker [33] and head orientaation was defiined as the fiirst head detecction from Omronn [40]. The chiild’s head was segmented byy excluding pixels greater than 10 cm above annd 2 cm below w the center pixel oof the boundinng box in the ddepth image. H Hole filling was tthen performeed by applyying dilate aand erode morphoological operaations and Gaaussian smootthing. The resultinng mask was then applied to the color and depth imagess for creating a template thhat excluded bbackground pixels. This process w was repeated ffor matching thhe template

F Figure 2: Respo onse to name ad dministration and a ground tru uth aannotations. The examiner callls the child’s na ame up to 3 timees. IIn this examplee the examiner calls the child d’s name, but the cchild does not respond. r The ex xaminer waits three seconds an nd then calls the ch hild’s name aga ain. This time th he child respon nds b by making eye contact. c H Hand annotated d video observattions

T The audio and video recordin ngs from the seession were latter rreviewed by two research assistants, who w used Elan

449

to the current tracked bounding box at each frame (see Figure 3).

a.

b.

Head orientation updates

The orientation of the overhead-tracking template was periodically reset to the front facing head rotation angle every 80 frames as a value that worked well in practice.

c.

BEHAVIOR CLASSIFICATION Binary classifier evaluation

We employ two “response to name” classifiers (rule based and random-forest) on 50 sessions involving 63 individual name calls. Each classifier made predictions on a per response basis --- a separate binary decision (true or false) was made for each hand annotated child name call interval. In each case, the child’s activity was analyzed on a within a time window that started at the offset of the name call interval and spanned for a maximum of 3 seconds (30 fps * 3 sec = 90 frames) or until the start of the next ground truth name call interval, whichever was shorter.

d.

Figure 3: Response to name automated head tracking process. (a) Head position candidates are detected from the overhead camera and filtered using histogram binning. Head turns are measured from the overhead camera by tracking a segmented child head template between frames. (b) Head templates consist of an oriented bounding box where the position is automatically initialized from the overhead camera and the orientation is initialized from a front-facing camera using commercial head tracking software (c). Finally, (d) the template is segmented using the overhead camera depth image and tracked between frames using a KLT tracker (d).

The resulting predictions were evaluated against our hand annotated ground truth observations as follows. Any prediction that fell within a window of +/- 3 seconds from the onset of the child’s response, as determined by the ground truth annotation, was counted as a true positive. A failure to predict a ground-truth coded response that occurred within this window was counted as a false negative. The 3-second margin was selected based on a previously published study using an identical response to name administration protocol [22]. In cases where the child’s name was called more than once, we also calculated true negatives (correct prediction of no response in the 3second window following the offset of the ground-truth coded name call) and false positives (incorrect prediction of child response in the 3 second window following the offset of the ground-truth coded name call).

This approach was chosen to address the following four challenges. 1. Measuring large head rotations. The Omron software provided side-to-side head rotation measurements [40]. While the software could detect side-to-side head turns between +/- 90 degrees; reliable measurements were limited to approximately +/- 30 degrees. In our case, we employed a template tracker to pick up where the Omron tracker left off and continued tracking the child head turns beyond this limited range. 2. Robustness to occlusions. The examiner and parents frequently occlude the child from the frontal camera. This results in missing data. The overhead KLT template tracker is more robust to these occlusions and provides greater tracking coverage. 3. Robustness to appearance changes. Hair and skin colors varied considerably between children. This appearance was also subject to out of plane rotations when viewed from the overhead camera. This led us to choose a template tracker for providing us with a more general model.

Child latency measure

Next, we also computed the latency for each response, defined as the elapsed time between the offset of the last name call and the onset of the child’s response. The child response latency was computed for our ground truth and predicted responses as follows: = − = −

Head position filtering

(4)









(5)

The error and absolute error between these latency measures was computed as follows:

The BU tracker was used for detecting candidate head positions, and this often resulted in multiple head position estimates per frame [33]. Head positions were filtered using histogram peak detection and binning as an additional step for isolating the child’s head (see Figure 3).

= − =| −

450

(6)





|

(7)

The duuration of the child’s responnse could be noo less than 0.25 seeconds as a lim mit on how fasst children coulld turn and no lonnger than 3 seeconds as the maximum tim me that the examinner would waitt before callingg the child’s nname again. The m median kernel width ranged from 1-30 frrames (see Figure 4).

N Negative values reflect predicctions made ah head of our han nd aannotated child d response interrvals, and posittive values rreflect predictio ons that were made m after the response. r R Random-forest classification

M Many aspects of the responsse to name taask could not be b kknown in advaance. For exaample we obseerved that som me cchildren did a “double take”” while respon nding rather than tuurning once, and a thus realized we would d need temporral ffeatures to en ncode such behaviors. Thee classifier was w trrained with feeatures from our o automated d pipeline. Head trracking resultss were paired with our hand d-annotated chiild rresponse labelss for creating a feature vecto or at each fram me. E Each vector co ontained the ch hild’s angle in degrees and th he m most recent elaapsed time sincce the child’s name n was called inn seconds.

The ruule-based classsifier was traiined using a ggrid-search were w we first compuuted the averaage precision rrate of the classifi fier using eachh combinationn of parameterrs over 10 trials w with a random m 50/50 traininng and test spliit and then chose set of parametters with the hhighest averagee precision value.

T The resulting feature vectorrs were then paired p with our o hhand annotated d ground truth h labels. The child respon nse laabel at each frame was ap ppended for crreating a set of ppositive or neg gative training examples. Th he random foreest w was then traineed for predictin ng the child’s response r at eacch fframe and evaaluated using leave-one-out l cross validatio on ((see Table 2). The random forest classifieer demarcated d a ppositive respon nse with a thresshold of 15 co onsecutive “tru ue” fframes. This th hreshold was made on the assumption th hat cchildren could not physically turn to make eye contact wiith thhe examiner in n less than this time (15 fram mes / 30 fps = 0.5 0 sseconds).

Figure 4: Rule baseed classificatioon: The child is said to d if his or her head angle (black line) exceeeds a given respond head aangle (blue linee) for a minim mum duration oof time (15 framess). Intervals satiisfying these criiteria are shown n in green.

RESU ULTS The syystem was evvaluated againsst human codding of the child’ss response to name from video. The 50 sessions containned a total of 63 name callss, with 48 insttances of a responnse and 15 instaances of a failuure to respondd. Only two childreen never responnded to the call for their nam me.

bled us to quicckly evaluate th he IIn general this approach enab ccontribution off multiple feaatures such ass child positio on, hhead velocity and a head acceleeration. In our case, child head aangle and the elapsed e time sin nce the name call c were the tw wo m most discriminaative features.

Our firrst step was to examine the reelationship bettween child head oorientation annd name call intervals for predicting responnse to name. H Human assistedd tracking meaasurements were uused to evaluuate the systeem’s performaance when trackinng errors were minimized. Thhese results inddicated that predictting responsee to name ffrom changess in head orientaation was a traactable problem m. The best sset of gridsearch parameters ffor the rule-baased classifierr were 27degreees for the headd turn threshold, 0.5-secondd minimum median kernel width (see fixationn duration andd 15 frame m Table 1) with an avverage precisioon of 93.2%, a recall of 90.9%,, an average llatency error oof -0.42 sec (S SD = 0.55 sec), annd an average absolute valuee latency error of 0.42 sec (SD = 0.55 sec). The random forest classiffier had a precisiion of 93.3%, a recall of 877.5% an averaage latency D = 0.68 sec), with an averagge absolute error oof 0.09 sec (SD value llatency error off 0.47 sec (SD = 0.53 sec).

R Rule-based classsification

N Next we evaaluated the performance of o a rule-based aapproach, whicch aims to captture the high-level structure of thhe interaction n as opposed the temporal features as an aadditional bencchmark. This approach estaablished rules to ffind a set of th hresholds and parameters th hat results in th he bbest classifier performance. Specifically, the child was w laabeled as resp ponding if s/hee turned beyon nd a given head aangle and rem mained beyond d this angle for f a minimu um dduration of tim me. We also added a a mediaan kernel wid dth pparameter to reeduce noise in the head-track king signal. Neext w we performed an exhaustivee grid-search over o these thrree vvariables to fin nd the set of parameters yiellding the higheest pprecision rate.

Next, w we evaluated the performannce of the two classifiers using generateed head orientation automaticallly measurrements. The rrule-based classsifier had a pprecision of 89.4%,, a recall of 800.4%, an averaage latency errror of -0.27 sec (SD D = 0.50 sec), and an averagge absolute vallue latency error oof 0.46 secondds (SD = 0.35 sec). The randdom forest classifi fier had a preccision of 76.9% %, a recall of 83.3%, an averagge latency errorr of -0.04 sec (SD = 1.50 seec), and an

T The range for each e grid-searcch value was informed i by our o uunderstanding of the respon nse to name protocol p and th he pphysical enviro onment. For ex xample, we kneew the examin ner aalways stood behind b and to the t right of thee child while th he cchild would alw ways respond by turning to the right. Head tuurn responses would w always be between 0 and a 180 degreees.

451

average absolute value latency error of 0.97 sec (SD = 1.22 sec). The random forest performed better than the rulebased approach during three sessions where the child did a “double take” before turning to the examiner i.e., the children made eye contact on their second attempt after repositioning themselves to see around the parent. The performance of each classifier decreased with the automated tracking condition but remained reasonable for our application with precision and recall above 76% and predicted child response latency error within +/- 3 seconds. These results confirmed that automated responses to name predictions could accurately measure child latency within +/- 3 seconds of the child’s actual response. The absolute value standard deviation latency error for both classifiers was less than 1 second.

present with the random forest as the child’s head turn was included in the negative training examples. The random forest classifier was less accurate for predicting the onset of child responses and was more susceptible to false positives than the rule-based classifier (see Table 2). The absolute predicted response latency error distribution was also greater for the random forest at SD = 1.22 sec. as compared to SD = 0.35 sec. for the rule-based classifier. The rule-based classifier made predictions over multiple frames while random forest classifier made predictions on a per frame basis. These results suggest that longer duration temporal features may be needed for improving random forest classifier performance; however in practice these offsets were small and latency error remained well below our target application error margin of within +/- 3 seconds.

Table 1 Rule-based classification results

Predicted

a) Human assisted tracking Child responded Child did not respond

DISCUSSION Head tracking was well suited for predicting response to name during our study for two reasons. First, the response to name probe elicited large head rotations by design. The examiner called the child’s name from behind and to the left of the child. This prompted children to turn and face the examiner rather than glance to the side while responding. Measuring these head turns called for course-level tracking estimates.

Actual Child Child did responded not respond 22 1.6 2.2

5.7

Child responded 19.4

Child did not respond 2.3

4.7

6.2

Predicted

b) Automated tracking Child responded Child did not respond

Markerless head tracking provided us with sufficient accuracy for detecting the large angular sweep of the child’s head turns responses without requiring more accurate techniques that would require instrumenting our cameras closer to the children [12]. Second, response to name was administered in a structured laboratory setting the position of the examiner and the child were both known in advance. This enabled us to apply a simple but effective rule-based classification approach for predicting children responses (see Table 1). Each classifier reliably predicted response to name using automated head tracking and hand annotated child name call intervals. Both classifiers agreed above 76% with the examiners’ in-person observations and our latency error was well within our +/3 second cutoff window for in-person observations.

Table 2 Random forest classification results

Predicted

a) Human assisted tracking Child responded Child did not respond

Actual Child Child did responded not respond 42 3 6

12

Child responded 40

Child did not respond 12

8

3

Predicted

b) Automated tracking Child responded Child did not respond

While at first glance our experimental setup may seem simplified and overly structured, we note that this type of test is standard practice in the psychology research community for supporting repeatable observations across participants. Our response to name probe was modeled on a standard diagnostic instrument for autism, which involves leaving the child occupied with a toy, moving behind and to the side, and calling the child’s name up to three times [16]. Our setup has allowed us to demonstrate to our psychology collaborators that we can replicate their current practice with respect to response to name administration and scoring. The feedback that we received from our colleagues in the autism research community continues to emphasize

Binary classifier performance comparisons

The rule-based classifier had a consistent bias towards predicting earlier child responses for both head tracking conditions. The average rule-based latency error for human assisted tracking was -0.42 seconds and -0.27 seconds for automated tracking, respectively. This bias can be explained by how predictions are made. The rule-based head angle threshold did not account for the gradual turn of a child’s head. Instead of predicting a response after the child had turned, that is, at the maximum of the plateau, our method would often make predictions earlier, at the moment the child initiated the head turn. This bias was not

452

the value of such objective measures as large, longitudinal datasets become increasingly common in the field.

The results of our work highlight the opportunities and challenges for response to name prediction in a naturalistic environment. The rule-based and random forest classifiers were evaluated for sessions where the examiner called the child’s name from at a single, fixed position in the room. In the future, we plan to track the examiner’s head position for evaluating our system with multiple name call locations.

In the future we seek to perform additional head tracking and speech processing for supporting automated response to name in a wider range of naturalistic social settings. Head tracking will be extended for including the examiner’s position. The present system assumes that the examiner calls the child’s name from a particular location in a particular room. The relative angle between the child’s head position could be computed for enabling the examiner to move about the room. The more the child orients to the examiner’s tracked position the smaller this angle would become. The resulting measure could support name calls in multiple locations. Speech processing is also planned for automatically identifying instances of response to name within longer less structured settings and labeling child name call intervals. In our early work we explored automated word spotting using commercial software. These efforts were unsuccessful due to child names being called with low phoneme counts and background noise. In the future we plan to develop speech diarisation software for addressing these shortcomings.

ACKNOWLEDGMENTS This paper is based upon work supported by the National science Foundation under Grant #1029679

REFERENCES 1.

Adrien, J.L., Lenoir, P., Martineau, J., et al. Blind ratings of early symptoms of autism based upon family home movies. Journal of the American Academy of Child & Adolescent Psychiatry 32, 3 (1993), 617–626. 2. Ba, S.O., Hung, H., and Odobez, J. Visual activity context for focus of attention estimation in dynamic meetings. IEEE International Conference on Multimedia and Expo, 2009. ICME 2009, (2009), 1424–1427. 3. Ba, S.O. and Odobez, J.M. A Rao-Blackwellized mixed state particle filter for head pose tracking. Proc. ACM-ICMI-MMMP, (2005), 9–16. 4. Baker, S. and Matthews, I. Lucas-Kanade 20 Years On: A Unifying Framework. International Journal of Computer Vision 56, 3 (2004), 221–255. 5. Bal, E., Harden, E., Lamb, D., Hecke, A.V.V., Denver, J.W., and Porges, S.W. Emotion Recognition in Children with Autism Spectrum Disorders: Relations to Eye Gaze and Autonomic State. Journal of Autism and Developmental Disorders 40, 3 (2010), 358–370. 6. Ballan, L. and Cortelazzo, G.M. Marker-less motion capture of skinned models in a four camera set-up using optical flow and silhouettes. 3DPVT, Atlanta, GA, USA 37, (2008). 7. Bernardin, K., Gehrig, T., and Stiefelhagen, R. Multilevel Particle Filter Fusion of Features and Cues for Audio-Visual Person Tracking. In R. Stiefelhagen, R. Bowers and J. Fiscus, eds., Multimodal Technologies for Perception of Humans. Springer Berlin Heidelberg, 2008, 70–81. 8. Conty, L. and Grèzes, J. Look at me, I’ll remember you. Human Brain Mapping 33, 10 (2012), 2428– 2440. 9. Von Cranach, M. The role of orienting behavior in human interaction. In Behavior and Environment. Springer, 1971, 217–237. 10. Devine, P.G. and Malpass, R.S. Orienting Strategies in Differential Face Recognition. Personality and Social Psychology Bulletin 11, 1 (1985), 33–40. 11. Emery, N.J. The eyes have it: the neuroethology, function and evolution of social gaze. Neuroscience & Biobehavioral Reviews 24, 6 (2000), 581–604. 12. Fanelli, G., Dantone, M., Gall, J., Fossati, A., and Gool, L. Random Forests for Real Time 3D Face

CONCLUSIONS We presented a system for reliably predicting a child’s response to name, given child head turn measurements with the long-term goal of deploying the system in a naturalistic setting. Moreover, we measured child response latency as a first step for providing a finer-grained measure of a child’s response than administrators are currently able to capture. These measures may prove to be clinically meaningful for psychologists by providing a second source of information during clinical assessments and further serve researchers by reducing the need for scheduling observations, establishing inter-rater reliability between examiners, performing video annotation, establishing larger, more detailed datasets for contextualizing child behaviors. Human assisted tracking was performed for establishing a set of “best-case” child head turn measurements, and we showed that we could reliably predict response to name and measure child response latency with precision and recall above 87% and measure child response latency with a maximum absolute average error of 0.47 seconds (SD=0.53 sec), respectively. Next we implemented automated head tracking as a first step for supporting these predictions in more naturalistic environments, we showed that we could continue to reliably predict these measures with precision and recall above 76% and measure child response latency with a maximum absolute average error of 0.97 seconds (SD=1.22 sec) when using a random forest classifier.

453

13.

14. 15. 16.

17.

18.

19.

20.

21.

22.

23.

24.

Analysis. Int. J. Comput. Vision 101, 3 (2013), 437– 458. Fanelli, G., Weise, T., Gall, J., and Gool, L.V. Real Time Head Pose Estimation from Consumer Depth Cameras. In R. Mester and M. Felsberg, eds., Pattern Recognition. Springer Berlin Heidelberg, 2011, 101– 110. Johnson, C.P. and Myers, S.M. Identification and Evaluation of Children With Autism Spectrum Disorders. Pediatrics 120, 5 (2007), 1183–1215. Kretch, K.S., Franchak, J.M., and Adolph, K.E. Crawling and Walking Infants See the World Differently. Child Development, (2013), n/a–n/a. Lord, C., Risi, S., Lambrecht, L., et al. The Autism Diagnostic Observation Schedule—Generic: A Standard Measure of Social and Communication Deficits Associated with the Spectrum of Autism. Journal of Autism and Developmental Disorders 30, 3 (2000), 205–223. Maimone, A., Bidwell, J., Peng, K., and Fuchs, H. Enhanced personal autostereoscopic telepresence system using commodity depth cameras. Computers & Graphics 36, 7 (2012), 791–807. Maimone, A. and Fuchs, H. Real-time volumetric 3D capture of room-sized scenes for telepresence. 3DTVConference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), 2012, (2012), 1–4. Moeslund, T.B., Hilton, A., and Krüger, V. A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 104, 2–3 (2006), 90–126. Murphy-Chutorian, E. and Trivedi, M.M. 3D tracking and dynamic analysis of human head movements and attentional targets. Second ACM/IEEE International Conference on Distributed Smart Cameras, 2008. ICDSC 2008, (2008), 1–8. Murphy-Chutorian, E. and Trivedi, M.M. Head Pose Estimation in Computer Vision: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 4 (2009), 607–626. Nadig AS, Ozonoff S, Young GS, Rozga A, Sigman M, and Rogers SJ. A prospective study of response to name in infants at risk for autism. Archives of Pediatrics & Adolescent Medicine 161, 4 (2007), 378– 383. Nakazawa, A. and Nitschke, C. Point of Gaze Estimation through Corneal Surface Reflection in an Active Illumination Environment. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato and C. Schmid, eds., Computer Vision – ECCV 2012. Springer Berlin Heidelberg, 2012, 159–172. Nickel, K., Gehrig, T., Stiefelhagen, R., and McDonough, J. A Joint Particle Filter for Audio-visual Speaker Tracking. Proceedings of the 7th International

25.

26.

27.

28. 29.

30.

31. 32.

33.

34. 35.

36. 37. 38. 39. 40.

454

Conference on Multimodal Interfaces, ACM (2005), 61–68. Nishino, K. and Nayar, S.K. The world in an eye [eye image interpretation]. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004, (2004), I– 444–I–451 Vol.1. Rehg, J.M., Abowd, G.D., Rozga, A., et al. Decoding Children’s Social Behavior. 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2013), 3414–3421. Robins, D.L., Fein, D., Barton, M.L., and Green, J.A. The Modified Checklist for Autism in Toddlers: An Initial Study Investigating the Early Detection of Autism and Pervasive Developmental Disorders. Journal of Autism and Developmental Disorders 31, 2 (2001), 131–144. Shotton, J., Sharp, T., Kipman, A., et al. Real-time Human Pose Recognition in Parts from Single Depth Images. Commun. ACM 56, 1 (2013), 116–124. Stiefelhagen, R. Tracking Focus of Attention in Meetings. Proceedings of the 4th IEEE International Conference on Multimodal Interfaces, IEEE Computer Society (2002), 273–. Voit, M. and Stiefelhagen, R. 3D User-perspective, Voxel-based Estimation of Visual Focus of Attention in Dynamic Meeting Scenarios. International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction, ACM (2010), 51:1–51:8. Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., and Sloetjes, H. Elan: a professional framework for multimodality research. Proceedings of LREC, (2006). Xiong, X. and De La Torre, F. Supervised Descent Method and Its Applications to Face Alignment. 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2013), 532–539. Zhang, J., Presti, L.L., and Sclaroff, S. Online Multiperson Tracking by Tracker Hierarchy. 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance (AVSS), (2012), 379– 385. FaceLab. 2014. http://www.seeingmachines.com/product/facelab/. Mobile eye tracking - Tobii Glasses. 2014. http://www.tobii.com/en/eye-trackingresearch/global/products/hardware/tobii-glasses-eyetracker/. Intersense. 2014. www.intersense.com. Polhemus. 2014. http://www.polhemus.com/. Vicon. 2014. http://www.vicon.com/. PhaseSpace. 2014. http://www.phasespace.com/. Omron. 2014. http://www.omron.com/r_d/coretech/vision/okao.html.