Enhancing multimodal learning through ... - Wiley Online Library

Received: 8 June 2017

Revised: 2 January 2018

Accepted: 14 March 2018

DOI: 10.1111/jcal.12262

ORIGINAL ARTICLE

Enhancing multimodal learning through personalized gesture recognition M.J. Junokas

|

R. Lindgren

University of Illinois, Urbana‐Champaign, United States Correspondence Michael Junokas, National Center for Supercomputing Applications, University of Illinois at Urbana‐Champaign, 1205 W Clark St, Room 2103, Urbana, IL, 61801. Email: [email protected] Funding information National Science Foundation, Grant/Award Number: IIS‐1441563

|

J. Kang

|

J.W. Morphew

Abstract Gestural recognition systems are important tools for leveraging movement‐based interactions in multimodal learning environments but personalizing these interactions has proven difficult. We offer an adaptable model that uses multimodal analytics, enabling students to define their physical interactions with computer‐assisted learning environments. We argue that these interactions are foundational to developing stronger connections between students' physical actions and digital representations within a multimodal space. Our model uses real time learning analytics for gesture recognition, training a hierarchical hidden‐Markov model with a “one‐shot” construct, learning from user‐defined gestures, and accessing 3 different modes of data: skeleton positions, kinematics features, and internal model parameters. Through an empirical comparison with a “pretrained” model, we show that our model can achieve a higher recognition accuracy in repeatability and recall tasks. This suggests that our approach is a promising way to create productive experiences with gesture‐based educational simulations, promoting personalized interfaces, and analytics of multimodal learning scenarios.

1

|

I N T RO D U CT I O N

appropriate meanings for the participants in the context that they are being enacted.

Computer‐assisted learning systems have recently begun adopting

In order to move closer to this ideal context for multimodal

interfaces that elicit embodied interactions across a variety of

embodied learning, we propose a system that affords personalized

educational contexts from learning about ratio (Abrahamson &

gesturing through what is referred to as “one‐shot” machine learning,

Trninic, 2015), the motion of gears (Han & Black, 2011), to planetary

providing learners more accurate recognition of their gestures and an

astronomy (Lindgren, Tscholl, Wang, & Johnson, 2016). While these

opportunity to customize their gestural interactions with the system.

interfaces have shown encouraging learning effects and represent

The one‐shot technique (Fei‐Fei, Fergus, & Perona, 2006) requires a

a promising new paradigm for educational technology design,

minimal amount of user training to drive algorithmic models. The

these systems tend to be limiting in that they restrict the physical

promise of this method for computer‐assisted learning interfaces is

interaction of the user to a specified, predetermined gestural vocab-

that learners can flexibly interact with a robust system that is tuned

ulary (e.g., particular finger conventions for manipulating fractions on

specifically to their bodily input and associated meanings. Central to

a touchscreen, holding one's hands in certain ways to represent

this method is creating a set of features that allows for personalized

molecular interactions, etc.). There is emerging evidence that the

interaction yet does not restrict the system's real time performance.

gestures people produce—both big and small—can have significant

In order to facilitate a gesture‐recognition system that is adaptable

effects on learning (Alibali & Nathan, 2012; Goldin‐Meadow, 2011;

to the individual user from a limited training set, we turn to multimodal

Lindgren, 2015), but in the context of computer‐assisted learning

analytics. By analyzing and applying three separate modes of move-

systems, this implies that the gestures can be accurately recognized,

ment data (skeleton positions, kinematic features, and internal model

that they are intuitive and easy to perform, and that they carry the

parameters), we demonstrate that our model supports embodied

J Comput Assist Learn. 2018;1–8.

wileyonlinelibrary.com/journal/jcal

© 2018 John Wiley & Sons Ltd

1

2

JUNOKAS

ET AL.

interactions that are better recognized, easier to recall, and carry

support learning about the movement of objects in space by making

more personal meaning than the traditional approach that uses

metaphorical connection to a learner's body movement. Johnson‐

predefined gestures.

Glenberg et al. (2014) used 3D tracking of a hand‐held wand to allow

In order to establish our model within a proper context, we begin

students to manipulate virtual objects and change parameters within

with a brief review of research in the areas of embodied interaction

immersive science simulations. Even the few embodied learning envi-

and algorithmic‐enabled gesture recognition, leading to our own

ronments that do allow for more expressive body movements tend to

one‐shot model. We then describe our specific application of one‐shot

utilize direct mappings to spatial position and pretrained classifiers

training, defining a generalized path that aims to be applicable in a

(see Biskupski, Fender, Feuchtner, Karsten, & Willaredt, 2014; Isbister,

variety of embodied learning platforms. We then perform an empirical

Karlesky, Frye, & Rao, 2012). As motion tracking devices improve,

comparison of a two‐layer hierarchical hidden‐Markov model

there will be increased opportunities to collect multimodal data about

(HHMM; Fine, Singer, & Tishby, 1998) trained to recognize gestures

the character and quality of a learner's interactions, but if gesture

using two different training methods: the proposed one‐shot model

detection remains limited to searching for predetermined gestures

and a traditional “pretrained” model as used in past implementations

and reducing these rich datasets to coarse descriptions of movement,

of the algorithm (see Bevilacqua et al., 2010). We conducted a

there is little reason to expect additional improvements on learning

between‐subjects study in which two groups of undergraduate

outcomes from physical engagement with computer‐assisted learning

students used one of these models to perform three separate tasks

systems. Providing a more dynamic, adaptable system that builds from

where the focus was on repeating gestures, recalling gestures, and

user intuitions could translate to both a higher level of perceived

using gestures to perform a contextualized quantitative operation.

usability and enhanced learning (Alshammari, Anane, & Hendley, 2016; Nacenta, Kamber, Qiang, & Kristensson, 2013). To move from this predefined space to one that affords dynamic

2

|

B A CKG R O U N D

bodily expression and rich metaphor, we propose to use a “one‐shot” gesture training methodology, alleviating the need for burdensome

Our approach to interactive system design is based in theories of

mappings, memorization, and conforming to predefined vocabularies,

embodiment and its implications for human learning, multimodal

instead basing symbolic structure on expressions created by an

educational environments that integrate gesture‐recognition systems,

individual student.

and a focus on creating intuitive and personalized learning environments using machine‐learning algorithms.

2.2 | Personalized gesture recognition using the HHMM 2.1 | Embodied interaction in computer‐assisted learning systems

In order to develop a minimally trained yet adaptable model for per-

More accurate and more adaptive systems for recognizing gestures in

gesture recognition. The HHMM, while capable of traditional

interactive learning technologies are needed because physical engage-

machine‐learning applications, is especially effective for modeling

sonalized user interactions, we applied the HHMM (see Figure 1) for

ment is increasingly being shown to be an effective way to enhance learning outcomes for a range of domains including reading comprehension (Glenberg, Gutierrez, Levin, Japuntich, & Kaschak, 2004), algebra (Goldin‐Meadow, Cook, & Mitchell, 2009), chemistry (Flood et al., 2014), and the physical sciences (Kontra, Lyons, Fischer, & Beilock, 2015; Plummer & Maynard, 2014). These studies and others show that gesturing in educational activities can be important, but they also show that gestures can vary greatly, and there is not strong evidence to suggest that prescriptive gesture schemes lead to learning gains across a diverse population of students. The suggestion, rather, is that if computer‐assisted learning environments are going to facilitate embodied interactions with learning content, they should be flexible and allow for a range of semantics associations. In the last decade, there have been several technology‐enhanced embodied learning designs involving whole‐body gestures that have shown promising results (Enyedy, Danish, Delacruz, & Kumar, 2012; Johnson‐Glenberg, Birchfield, Tolentino, & Koziupa, 2014; Lindgren et al., 2016), but these environments have employed fairly simple metaphors that do not necessarily leverage the power‐embodied interactions for the kinds of robust concept development that are found in the nontechnology‐mediated studies cited above. For example, Lindgren et al. (2016) used body position tracking in a 2D plane to

FIGURE 1 Hierarchical hidden‐Markov model graphical models in a single timestep (t) showing inputs (I) with two layers of hidden states (H), and resulting observations (O). Each hidden state can be a production state (p), resulting directly in an observation, or an internal state (i), resulting in a model [Colour figure can be viewed at wileyonlinelibrary.com]

JUNOKAS

3

ET AL.

movement features. This is due to the relatively robust feature set it

We then analyzed this raw movement with software written in

generates using multiple layers of abstraction, representing temporal

Max 7/Jitter (Puckette, 2014), creating a set of real time kinematic

paths through concrete kinematic features. For example, a multimodal

features from the raw movement data. While we were able to gener-

feature set collected from a Kinect V2 sensor (Microsoft, 2017) con-

ate a broad range of features (e.g., positional derivatives, comparative

sists of kinematic measures, digital video, and depth maps. This can

features, and statistical metrics) across a variety of timeframes, we

be temporally linked to abstract probabilistic parameters made by

chose to focus on frame‐to‐frame relative velocity features for our i-

the model, producing a machine‐generated mode of data for analysis.

nitial testing. For both of our experimental conditions, we trained a

From these multiple streams, higher level statistical representations of

two‐layer HHMM to recognize gestures using 30 features including

the user's movement can be constructed, allowing us to synthesize a more nuanced and personalized platform for user interaction. This augmented feature space lends itself well to maximally descriptive class descriptions, providing an ideal environment for one‐shot training. This more in depth user representation provides an interaction environment that is conducive to fluid user performance, allowing for the bodily expression of symbolic conceptions to be captured without inhibiting user movement. Due to this potential, HHMMs have already effectively been used in several multimodal

a. the x, y, z of the right ankle (3), right wrist (3), left ankle (3), and left wrist (3) velocity relative to the spine mid; b. the x, y, z of right wrist (3), left ankle (3), and left wrist (3) velocity relative to the right ankle; c. the x, y, z left ankle (3) and left wrist (3) velocity relative to the right wrist; and d. the x, y, z left wrist (3) velocity relative to the left ankle.

applications (Bevilacqua et al., 2010; Françoise, Caramiaux, & Bevilacqua, 2011) including movement to sound mapping schemas (Françoise, 2015). Extending this type of machine‐learning to educational applications has already been attempted (Junokas, Linares, & Lindgren, 2016), but with this paper, we show that these multimodal representations can be leveraged by one‐shot models to specifically tailor interaction to the individual user. By providing an environment where

Using these kinematic features, we trained a two‐layer HHMM adapted from the IRCAM MUBU package (IRCAM, 2015). From this HHMM, we were able to model and classify user‐defined gestures. In the process of classifying these gestures, additional underlying statistics that contributed to that classification were captured. These internal model parameters gave us an abstracted, machine‐driven mode of the user's gestures, providing a deeper feature perspective

users can define their own physical interactions without losing

that could be used to aid in gesture recognition or applied in ulterior

accuracy, we provide a system that can adapt to personalized user

domains, such as mapping controls to simulation applications. Through

interactions, more accurately reflecting their intentions, thus leading to more beneficial feedback and customization of the user experience.

3

|

MODEL DESCRIPTION

The proposed system captures raw movement data using the

the combination of these three data modes, we were able to create an adaptable one‐shot gesture‐recognition model that allowed for a personalized, multimodal interaction.

3.1 | Empirical tasks in repeatability, recall, and operation semantics

Microsoft Kinect V2, which utilizes infrared cameras to generate

To validate the efficacy of the one‐shot model for the kinds of gestural

depth maps (see Figure 2). From these depth maps, the Kinect gener-

interactions one might find in embodied learning environments, we

ates a skeleton frame of 25 distinct joints. We accessed this skeleton,

compared its performance to that of a traditional pretrained model

writing software using the Kinect's application programming interface,

in a three‐stage gesture‐recognition task. We hypothesized that in

to send skeleton positions across our system via the protocol Open

the three stage conditions, one‐shot gesture recognition would per-

Sound Control (Wright & Freed, 1997) at a fixed rate of 30 frames

form with superior accuracy to pretrained models, while also leaving

per second.

learners more satisfied with the quality of the interactions.

FIGURE 2 Software captured data using the Kinect V2: high definition digital video (left), depth maps with overlayed skeleton (center), and infrared sensor data (right) [Colour figure can be viewed at wileyonlinelibrary.com]

4

JUNOKAS

ET AL.

The three stages each participant went through consisted of (a) repeating gestures based on on‐screen representations (repeatability), (b) performing gestures by recalling their associated number (recall), and (c) using gestures representing context‐appropriate mathematical operations (operation semantics).

4 4.1

METHODS

|

|

Participants

Twenty‐one participants were split into the two groups with 10 in the one‐shot group and 11 in the pretrained group. All users were undergraduate college students between the ages of 18 and 22 majoring in a variety of fields including computer science, neuroscience, accounting, education, physics, and engineering. Gender of the participants

FIGURE 4 User (red skeleton figure) performing the first stage of the experiment, repeating gestures shown in playback (purple skeleton figure) [Colour figure can be viewed at wileyonlinelibrary. com]

was evenly balanced between the two conditions, with 5 males and 5 females for the one‐shot condition and 5 males and 6 females for

predetermined gestures. Each participant was instructed to watch

the pretrained condition. All participants were unfamiliar with our spe-

the playback to completion and then repeat the gesture they had seen.

cific research aims at the beginning of the session and were not given

The system attempted to recognize this gesture and the accuracy of

any incentive to participate.

such recognition was measured internally, though not reported back to the participant.

4.2

|

Study procedures

In the second stage (recall), users were asked to recall gestures that were associated with a given number (see Figure 5). For example,

Participants were brought into our lab and given a short introduction

if the stage prompted the user with the number one, the user would

to the gesture‐based interface (see Figure 3) they would be using

perform the gesture they had associated with the number one, if

and the tasks they would be performing. We carefully controlled all

prompted with the number two, the user would perform the gesture

environmental variables across participants including the amount of

they associated with the number two, and so forth. Gestures were

light exposure in the task environment and the distance the partici-

defined by the users similarly to Stage 1, with one‐shot users record-

pant was from the sensor. The hardware configuration was kept as

ing their own gestures and pretrained users being given a

consistent as possible across all of the classification tasks in an

predetermined set.

attempt to minimize noise differentials or sensor variability. All partic-

In the third stage (operation semantics), participants were asked

ipants in both conditions were guided through three stages led by an

to associate four new gestures with arithmetic operations of add, sub-

experimenter. Each stage consisted of three groups of eight gestures,

tract, multiply, and divide (see Figure 6). One‐shot participants

giving the participants as much time as they needed to complete the

recorded the four operational gestures based on their own semantic

given stage.

associations with those operations. They could choose any gestures

In the first stage (repeatability), participants were asked to repeat

they wished as long as they were distinct and large enough to be rec-

gestures that were demonstrated by a digital skeleton on a large

ognized by the Kinect. After given some time to consider their chosen

projected screen (see Figure 4). One‐shot users recorded four of their

gesture, users were cued with textual instructions on which operation

own designed gestures, and pretrained users were shown four

they were recording and when to start recording their gesture by a digital

FIGURE 3 User interface for empirical task with experimenter controls on the right and gestural interactions captured and displayed via a skeleton figure in the center [Colour figure can be viewed at wileyonlinelibrary.com]

countdown.

Once

started,

gesture

recordings

were

FIGURE 5 User (red skeleton figure) performing the second stage of the experiment, recalling gestures associated with a given number on the left [Colour figure can be viewed at wileyonlinelibrary.com]

JUNOKAS

5

ET AL.

gesturing, upon which the researcher manually stopped the classification process. A frame was classified as correct if it matched the corresponding requested gesture for a given task in a given stage. Stage accuracy was determined by taking the total number of frames classified correctly from all of the tasks performed in a given stage, divided by the total number of frames classified for the stage.

5

|

RESULTS

To investigate differences between the groups, an independent t test FIGURE 6 User (red skeleton figure) performing the third stage of the experiment, performing gestures associated with arithmetic operations to complete the given equation [Colour figure can be viewed at wileyonlinelibrary.com]

was conducted to compare stage accuracy for the one‐shot and the pretrained models in each stage. Shapiro–Wilk's tests indicated that the scores were normally distributed, and Levene's tests indicated that there was a homogeneity of variance between the groups. As shown in Table 1, there are statistically significant differences between the

automatically stopped when the user's “energy” (the summation of

one‐shot model and the pretrained model groups in all three stages.

speed across all the joints) crossed below a preset threshold that rep-

On average the students in the one‐shot group tend to have higher

resented “stillness.” Once all four recordings were completed, users

accuracy scores than the students in the pretrained group. In specific,

were given an opportunity to see the gestures they recorded through

mean accuracy score in Stage 3 differs between the one‐shot model

digital playback. If they were unsatisfied with any of their recordings,

(M = 130.14, SD = 20.51, n = 7) and the pretrained model

they were given an opportunity to rerecord any gesture.

(M = 102.86, SD = 22.27, n = 7) at the .001 level of significance

Pretrained participants in the third stage were shown a set of four

(t = 5.43, df = 19, p < .001).

researcher‐chosen gestures corresponding to mathematical operations

Given the significance of the tests, the one‐shot model

(right‐handed “stacking” motion to the right for add, left‐handed

outperformed the pretrained model in each of the three stages (see

“throwing away” motion to the left for subtract, two‐handed “folding”

Table 1). The most significant difference was in the accuracies of the

gesture for multiply, and left‐handed “slashing” motion for divide). Par-

operational semantic Stage 3. While Stages 1 and 2 maintained rela-

ticipants were then give equations with missing operators and asked

tively close accuracies; there was a drop in accuracy when going to

to gesture according to which operation they believed belonged in

Stage 3 for both of the model conditions.

the equation. For example, if participants were given “1__4 = 5,” they were expected to perform the gesture for add. Once all of the stages were completed, participants were given a

In response to their postexperiment survey, several one‐shot participants mentioned that their gestures were easy to remember due with one user saying, “I made them with the fact that I'd have to rep-

qualitative feedback form to provide their subjective impressions of

licate them in mind” and another user saying, “Knowing how distinct

their experience. Users were given several specific questions:

and different they were from each other, I could remember which mover[sic] corresponded to the operation.” One‐shot users seemed

a. Were your gestures easy to remember?

to be working harder to create semantically meaningful associations

b. Did anything help you remember your gestures?

with their gesture.

c. What was difficult to do? d. What was easy to do? Additionally, they were given an open section to provide their own feedback about the experience. After participants were done with the questionnaire they were thanked and allowed to leave.

Pretrained participants tried to establish connections to the prescribed gestures, with one user stating they “made semantic connections between the operations and the gestures, e.g., the divide gesture is similar to a slash” and another saying they “tried to think of personal ways to memorize the gestures.” Even using these connections, pretrained users frequently expressed frustration with spatial aspects of the gesture, remembering the general gesture but performing it with the wrong directional affinity (e.g., left‐handed vs. right‐handed). One user specifically stated that “It was a little bit hard

4.3

|

Gesture data analysis

In each of these of the three stages, each gesture was analyzed by the models frame to frame, classifying every frame collected by the sys-

to remember the direction.” Pretrained users seemed to be trying to match prescribed directions rather than create any deeper, symbolic connections with their gesture.

tem at a rate of 30 frames per second. Classification began for each gesture when the participant crossed above a specified speed thresh-

6

|

DISCUSSION

old (e.g., began to move) after the corresponding digital playback, number, or equation were provided. Classification was ended for each

Our one‐shot model achieved a higher accuracy in each of our exper-

gesture when the participant verbally acknowledged they were done

imental task's stages. This can be mainly attributed to the adaptability

6

JUNOKAS

TABLE 1

ET AL.

Means, standard deviations, and independent t‐test results of three‐stage gesture‐recognition task One‐shot

Pretrained

Mean (SD)

Mean (SD)

t test

Significance

Stage 1–repeatability

0.71 (0.14)

0.57 (0.15)

t(19) = 2.24

p = .037

Stage 2–recall

0.73 (0.12)

0.55 (0.12)

t(19) = 3.23

p = .004

Stage 3–operation semantics

0.65 (0.13)

0.38 (0.08)

t(19) = 5.43

p < .0001

Note. One‐shot (n = 10), pretrained (n = 11).

of our one‐shot model to personalized inputs, allowing the users to

this stage. This was measured by looking at tasks where a majority

perform self‐defined gestures and to create more solid mnemonic con-

of the frames were misclassified and reviewing the gesture recordings

nections to their movement. This finding is overall in line with previ-

to make sure there were not any computational errors. This supports

ous research indicating user‐defined gestures yield a higher level of

our larger argument that allowing users to define their own gestures

memorability or enjoyment (i.e., Freeman, Benko, Morris, & Wigdor,

space increases the user's ability to recall mathematical operations.

2009; Nacenta et al., 2013). This personalized recognition additionally

Rather than taking mental and physical effort trying to frame their

allowed users to focus on accomplishing tasks rather than performing

own gesture within a set of movement templates as in the pretrained

accurate gestures.

model, one‐shot users were able to use their gestural nuance as a

In the repeatability Stage 1, pretrained users had to remember the

mechanism to strengthen their interactions with the system. The

preset gesture templates in addition to performing them, maintaining

capability of the model to adapt and learn the one‐shot user's own

spatial and directional mappings that may not have been intuitive

interpretations of the mathematical operations seemed to support

given a gesture. For example, if a gesture was performed with the right

embodied interaction that alleviates the additional effort of specific

hand, pretrained users did not have the advantage of defining said

pretrained template matching, which may have given users the

gesture, thus were more likely to erroneously start with the left hand.

opportunity to focus on accomplishing tasks rather than performing

In such cases, users quickly corrected their action when seeing they

accurately.

used an incorrect joint by using the playback provided in the Stage

Qualitative feedback from players often cited recall problems

1. Providing the playback feature that enables users to observe and

when using a pretrained model, not remembering which specific joints

reflect on their movements seemed to benefit the pretrained users

to use and how precisely to perform them to get the desired response.

(Lindgren, 2015). One‐shot users did not have the same problem, con-

The ability to create a personalized gestural vocabulary using the one‐

sistently using the correct joint for the respective actions and

shot model seems to have allowed users to quickly connect to abstract

performing more precisely the spatial configuration of their self‐

concepts such as mathematical functions to an immediate physical

defined gestures, not having to parse out directionality or recall any

expression, not having the additional burden of associating with exter-

matching schema. This quality of the pretrained condition may have

nally prescribed criteria. This finding together with the lower accura-

interfered with user performance. In particular, this extra step of map-

cies of pretrained model is in line with prior research that

ping to external schema (e.g., performing given gestures) forced

highlighted memorability as an important aspect of gestural interfaces

pretrained participants to perform an extra mental step, recalling the

since predefined gestures are required to remember and users often

different parameters of the prescribed gestures before performing

become frustrated when forgotten gesture causes errors (Long Jr,

them, thus leading to lower accuracies.

Landay, Rowe, & Michiels, 2000).

This directional problem was exacerbated in Stages 2 and 3 where

Body movement can benefit students' performance in numerous

users were forced to recall gestures without visual reinforcement.

areas such as algebra (Goldin‐Meadow et al., 2009; Lindgren et al.,

Several users performed the given gesture but directionally inverse,

2016). The results in this study show that the user‐defined model lead

leading to much lower accuracies. Although this did occur with one

to increased learner performance, and the predefined condition did

of the one‐shot users, directional inversion did not occur as nearly

not provide the same level of benefits. This highlights designing

as frequently as it did in the pretrained condition where it occurred

personalized gestural environment as a critical area of research and

with three different users.

design. Interestingly, users who used the one‐shot model created a

In the recall Stage 2, users performed with a very similar accuracy

wide variety of gestures (e.g., jumping with outstretched arms,

as in Stage 1, staying within 2%. The small difference in accuracy

squatting with arms held horizontally, spinning in place, and using only

supports that users are able to repeat and recall gestures with similar

one arm for all gestures), not creating a cohesive representation across

success given the gestures bear no symbolic meaning. The drop in

the mathematical functions, suggesting that allowances of personaliza-

accuracy from the one‐shot to the pretrained condition can then be

tion of input is more in line with user's uninfluenced response to such

attributed to the capability of the respective conditions performance.

a system. Some of the symbolic representations of movement

Unlike the first two stages, the operation semantics Stage 3 had a

included creating spatial sequences, not necessarily stacking but

significant difference in performance accuracies in comparison with

organizing in rows or columns (add); horizontal hand motions,

the first two stages, with the one‐shot model largely outperforming

emulating a dash (subtract); expansion, growing from large to small

the pretrained model. Additionally, one‐shot users were able to per-

using the whole body (multiply); and separating material by sweeping

form the correct target gestures more often than pretrained users in

away excess (divide).

JUNOKAS

7

ET AL.

In some cases, this variability in gesture vocabulary did lead to

there are innumerable additional combinations we could compare

lower model performance due to the lack in diversity in the user's ges-

using our current construct. We are interested in exploring the optimi-

tures. Prior research indicates that predesigned or user‐elicited ges-

zation of model training (e.g., does using two shot improve on one‐

tures are more applicable than user‐defined gestures in certain

shot at all?), seeking commonality across users' gestural representation

situations; however, the gap is constantly narrowing with the improve-

of abstract concepts and finding methods to emphasize the most rele-

ment of sensing and recognizer technologies (Nacenta et al., 2013). An

vant parameters of movement. In this initial work, we focused on

advantage of the pretrained model is that researchers would be able

leveraging the internal states of the respective models to classify ges-

to control and define large differences in movements, resulting in

ture. While these internal states were a result of our two other modes

more clearly delineated gestures. A clear advantage to the “one‐shot”

of data (e.g., skeleton positions and kinematic features), we imagine a

methodology is that users would presumably train in the exact

much more effective and satisfying model could be developed from

environment (e.g., amount of light) and with the same conditions

balancing all of these modes into a cohesive representation of partic-

(e.g., personal physical attributes) that they would test in, providing a

ipants movement. For example, skeleton joints that possessed a higher

much more focused model. Although we did take considerable effort

speed could be used to weight the internal parameters, reflecting con-

to control the testing conditions, the Kinect V2 can still be sensitive

centrations of user energy in addition to their nuanced expression.

to noise, demographic attributes, and other environmental factors.

The model that we've developed is flexible to that type of expansion

Since our task is centered on creating optimal algorithm performance

and will be developed accordingly with our future applications.

and the ability to empower the user to connect this performance to

We are currently working on implementing our one‐shot model

higher level abstractions, the capability of targeting the user directly

into several education simulation theaters, replacing a pretrained

to create a more focused model is beneficial to our goals and effective

model that we had previously used. These simulation theaters are

in reducing the effect of these sensor sensitivities.

placed within the context of understanding scale in three different

For both groups, the majority of the incorrectly classified frames

domains: the Richter scale used to measure the magnitude of earth-

in both conditions occurred at the beginning of the gestures,

quakes, the pH scale that measures the level of acidity within solu-

transitioning from stillness into movement. This was especially appar-

tions, and the growth of bacteria cultures. Interaction within these

ent in the first stage. After initiating and maintaining a consistent ges-

domains is preceded by a gestural tutorial space where students famil-

ture (e.g., not changing or pausing mid movement), the models

iarize themselves with their movement and how it can control the sim-

generally maintained a singular classification for the remainder of the

ulations. Up until this point, we have been using a pretrained model,

recording. Classification results would improve from the classification

having students match gestural templates that we based on research

of a larger timeframe (e.g., multiple frames, the entire gesture) but by

and interviews (Alameh, Linares, Mathayas, & Lindgren, 2016). With

looking at the frame‐to‐frame classification, we were able to identify

the findings in this paper, we are now beginning to integrate our

the progression of classification, providing deeper insight into the via-

one‐shot model, merging our previous research by physically and ver-

bility of leveraging these abstract model parameters for real time mul-

bally cuing the students (Lindgren, 2015) while letting them define

timodal analytics.

gestures according to their own specifications.

Overall, although the sample size is fairly small and not very

By continuing to solidify the relationship between the interaction

diverse, this preliminary experiment supports using a one‐shot model

space, gesture recognition, and feedback systems, we look to continue

to leverage personalized educational interactions, when users are

empowering users, alleviating them of the cognitive load required to

required to couple gestural performance and symbolic recall with

match and perform specific gestures of a pretrained model, allowing

accomplishing a reasoning task. A more rigorous and in‐depth

them to form their own, personalized gesture language and semiotic

experiment would need a significantly larger sample size to make

structure using a one‐shot model, from which they communicate their

any definitive claims; however, this initial experiment shows

ideas through bodily movement and technology.

the potential of using a one‐shot gesture‐recognition model. ACKNOWLEDGMENTS This research was supported by a grant from the National Science

7 | C O N CL U S I O N A N D F U T U R E DIRECTIONS

Foundation (IIS‐1441563). We would also like to acknowledge the work of the entire ELASTC3S team, especially Ben Lane, Greg Kohlburn, Sahil Kumar, Tom Roush, Viktor Makarskyy, and Wai‐Tat

This preliminary experiment lays the groundwork for future testing,

Fu for their development and analysis work that contributed to

including the application of our model in more practical educational

this study.

interfaces. We anticipate that by applying the model beyond simple classification tasks to multimodal feature synthesis and control mappings will lead to more intuitive and dynamic interactions. These more natural system interactions could potentially lead to more developed and accurate analytics on student learning in gestural environments. Although we felt our task (comparing pretrained and one‐shot models) was the most basic that would test our hypothesis (one‐shot models provide a viable option for personalizing user interaction),

ORCID M.J. Junokas

http://orcid.org/0000-0002-8764-7936

RE FE RE NC ES Abrahamson, D., & Trninic, D. (2015). Bringing forth mathematical concepts: Signifying sensorimotor enactment in fields of promoted action. ZDM, 47(2), 295–306.

8

Alameh, S., Linares N., Mathayas N., & Lindgren, R. (2016). The effect of students' gestures on their reasoning skills regarding linear and exponential growth. Annual Meeting of the National Association of Research on Science Teaching Alibali, M. W., & Nathan, M. J. (2012). Embodiment in mathematics teaching and learning: Evidence from learners' and teachers' gestures. The Journal of the Learning Sciences, 21(2), 247–286. Alshammari, M., Anane, R., & Hendley, R. J. (2016). Usability and effectiveness evaluation of adaptivity in e‐learning systems. In Proceedings of the 2016 CHI conference extended abstracts on human factors in computing systems (pp. 2984–2991). ACM. Bevilacqua, F., Zamborlin, B., Sypniewski, A., Schnell, N., Guédy, F., & Rasamimanana, N. (2010). Gesture in embodied communication and human‐computer interaction (Vol. 5934). Biskupski, A., Fender, A. R., Feuchtner, T. M., Karsten, M., & Willaredt, J. D. (2014). Drunken Ed: A balance game for public large screen displays. In CHI'14 extended abstracts on human factors in computing systems (pp. 289–292). ACM. Enyedy, N., Danish, J. A., Delacruz, G., & Kumar, M. (2012). Learning physics through play in an augmented reality environment. International Journal of Computer‐Supported Collaborative Learning, 7(3), 347–378. Fei‐Fei, L., Fergus, R., & Perona, P. (2006). One‐shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611. Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32(1), 41–62. Flood, V. J., Amar, F. G., Nemirovsky, R., Harrer, B. W., Bruce, M. R., & Wittmann, M. C. (2014). Paying attention to gesture when students talk chemistry: Interactional resources for responsive teaching. Journal of Chemical Education, 92(1), 11–22. Françoise, J. (2015). Motion‐sound mapping by demonstration (Doctoral dissertation. UPMC. Françoise, J., Caramiaux, B., & Bevilacqua, F. (2011). Realtime segmentation and recognition of gestures using hierarchical Markov models. Mémoire de Master. Université Pierre et Marie Curie–Ircam. Freeman, D., Benko, H., Morris, M. R., & Wigdor, D. (2009). ShadowGuides: Visualizations for in‐situ learning of multi‐touch and whole‐hand gestures. In Proceedings of the ACM International Conference on Interactive Tabletops and Surfaces (pp. 165–172). ACM. Glenberg, A. M., Gutierrez, T., Levin, J., Japuntich, S., & Kaschak, M. P. (2004). Activity and imagined activity can enhance young children's reading comprehension. Journal of Educational Psychology, 96(3), 424–436.

JUNOKAS

ET AL.

IRCAM. (2015). MUBU for Max. http://forumnet.ircam.fr/fr/produit/ mubu/ Isbister, K., Karlesky, M., Frye, J., & Rao, R. (2012, May). Scoop!: A movement‐based math game designed to reduce math anxiety. In CHI'12 extended abstracts on human factors in computing systems (pp. 1075–1078). ACM. Johnson‐Glenberg, M. C., Birchfield, D. A., Tolentino, L., & Koziupa, T. (2014). Collaborative embodied learning in mixed reality motion‐capture environments: Two science studies. Journal of Educational Psychology, 106(1), 86. Junokas, M., Linares N., & R. Lindgren. (2016). Developing gesture recognition capabilities for interactive learning systems: Personalizing the learning experience with advanced algorithms. Proceedings of the International Conference of the Learning Sciences (2016). Kontra, C., Lyons, D. J., Fischer, S. M., & Beilock, S. L. (2015). Physical experience enhances science learning. Psychological Science, 26(6), 737–749. Lindgren, R. (2015). Getting into the cue: Embracing technology‐facilitated body movements as a starting point for learning. In V. R. Lee (Ed.), Learning technologies and the body: Integration and implementation in formal and informal learning environments (pp. 39–54). New York, NY: Routledge. Lindgren, R., Tscholl, M., Wang, S., & Johnson, E. (2016). Enhancing learning and engagement through embodied interaction within a mixed reality simulation. Computers & Education, 95, 174–187. Long, A. C. Jr., Landay, J. A., Rowe, L. A., & Michiels, J. (2000). Visual similarity of pen gestures. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 360–367). ACM. Microsoft. Developing with Kinect for Windows. (2017). https://developer. microsoft.com/en‐us/windows/kinect/develop Nacenta, M. A., Kamber, Y., Qiang, Y., & Kristensson, P. O. (2013). Memorability of pre‐designed and user‐defined gesture sets. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 1099–1108). ACM. Plummer, J. D., & Maynard, L. (2014). Building a learning progression for celestial motion: An exploration of students' reasoning about the seasons. Journal of Research in Science Teaching, 51(7), 902–929. Puckette, M. (2014). Max/MSP (version 6): Cycling'74 Wright, M., & Freed, A. (1997). Open SoundControl: A new protocol for communicating with sound synthesizers. In ICMC.

Goldin‐Meadow, S. (2011). Learning through gesture. Wiley Interdisciplinary Reviews: Cognitive Science, 2(6), 595–607.

How to cite this article: Junokas MJ, Lindgren R, Kang J,

Goldin‐Meadow, S., Cook, S. W., & Mitchell, Z. A. (2009). Gesturing gives children new ideas about math. Psychological Science, 20(3), 267–272.

alized gesture recognition. J Comput Assist Learn. 2018;1–8.

Han, I., & Black, J. B. (2011). Incorporating haptic feedback in simulation for learning physics. Computers & Education, 57(4), 2281–2290.

Morphew JW. Enhancing multimodal learning through personhttps://doi.org/10.1111/jcal.12262