Received: 8 June 2017
Revised: 2 January 2018
Accepted: 14 March 2018
DOI: 10.1111/jcal.12262
ORIGINAL ARTICLE
Enhancing multimodal learning through personalized gesture recognition M.J. Junokas
|
R. Lindgren
University of Illinois, Urbana‐Champaign, United States Correspondence Michael Junokas, National Center for Supercomputing Applications, University of Illinois at Urbana‐Champaign, 1205 W Clark St, Room 2103, Urbana, IL, 61801. Email:
[email protected] Funding information National Science Foundation, Grant/Award Number: IIS‐1441563
|
J. Kang
|
J.W. Morphew
Abstract Gestural recognition systems are important tools for leveraging movement‐based interactions in multimodal learning environments but personalizing these interactions has proven difficult. We offer an adaptable model that uses multimodal analytics, enabling students to define their physical interactions with computer‐assisted learning environments. We argue that these interactions are foundational to developing stronger connections between students' physical actions and digital representations within a multimodal space. Our model uses real time learning analytics for gesture recognition, training a hierarchical hidden‐Markov model with a “one‐shot” construct, learning from user‐defined gestures, and accessing 3 different modes of data: skeleton positions, kinematics features, and internal model parameters. Through an empirical comparison with a “pretrained” model, we show that our model can achieve a higher recognition accuracy in repeatability and recall tasks. This suggests that our approach is a promising way to create productive experiences with gesture‐based educational simulations, promoting personalized interfaces, and analytics of multimodal learning scenarios.
1
|
I N T RO D U CT I O N
appropriate meanings for the participants in the context that they are being enacted.
Computer‐assisted learning systems have recently begun adopting
In order to move closer to this ideal context for multimodal
interfaces that elicit embodied interactions across a variety of
embodied learning, we propose a system that affords personalized
educational contexts from learning about ratio (Abrahamson &
gesturing through what is referred to as “one‐shot” machine learning,
Trninic, 2015), the motion of gears (Han & Black, 2011), to planetary
providing learners more accurate recognition of their gestures and an
astronomy (Lindgren, Tscholl, Wang, & Johnson, 2016). While these
opportunity to customize their gestural interactions with the system.
interfaces have shown encouraging learning effects and represent
The one‐shot technique (Fei‐Fei, Fergus, & Perona, 2006) requires a
a promising new paradigm for educational technology design,
minimal amount of user training to drive algorithmic models. The
these systems tend to be limiting in that they restrict the physical
promise of this method for computer‐assisted learning interfaces is
interaction of the user to a specified, predetermined gestural vocab-
that learners can flexibly interact with a robust system that is tuned
ulary (e.g., particular finger conventions for manipulating fractions on
specifically to their bodily input and associated meanings. Central to
a touchscreen, holding one's hands in certain ways to represent
this method is creating a set of features that allows for personalized
molecular interactions, etc.). There is emerging evidence that the
interaction yet does not restrict the system's real time performance.
gestures people produce—both big and small—can have significant
In order to facilitate a gesture‐recognition system that is adaptable
effects on learning (Alibali & Nathan, 2012; Goldin‐Meadow, 2011;
to the individual user from a limited training set, we turn to multimodal
Lindgren, 2015), but in the context of computer‐assisted learning
analytics. By analyzing and applying three separate modes of move-
systems, this implies that the gestures can be accurately recognized,
ment data (skeleton positions, kinematic features, and internal model
that they are intuitive and easy to perform, and that they carry the
parameters), we demonstrate that our model supports embodied
J Comput Assist Learn. 2018;1–8.
wileyonlinelibrary.com/journal/jcal
© 2018 John Wiley & Sons Ltd
1
2
JUNOKAS
ET AL.
interactions that are better recognized, easier to recall, and carry
support learning about the movement of objects in space by making
more personal meaning than the traditional approach that uses
metaphorical connection to a learner's body movement. Johnson‐
predefined gestures.
Glenberg et al. (2014) used 3D tracking of a hand‐held wand to allow
In order to establish our model within a proper context, we begin
students to manipulate virtual objects and change parameters within
with a brief review of research in the areas of embodied interaction
immersive science simulations. Even the few embodied learning envi-
and algorithmic‐enabled gesture recognition, leading to our own
ronments that do allow for more expressive body movements tend to
one‐shot model. We then describe our specific application of one‐shot
utilize direct mappings to spatial position and pretrained classifiers
training, defining a generalized path that aims to be applicable in a
(see Biskupski, Fender, Feuchtner, Karsten, & Willaredt, 2014; Isbister,
variety of embodied learning platforms. We then perform an empirical
Karlesky, Frye, & Rao, 2012). As motion tracking devices improve,
comparison of a two‐layer hierarchical hidden‐Markov model
there will be increased opportunities to collect multimodal data about
(HHMM; Fine, Singer, & Tishby, 1998) trained to recognize gestures
the character and quality of a learner's interactions, but if gesture
using two different training methods: the proposed one‐shot model
detection remains limited to searching for predetermined gestures
and a traditional “pretrained” model as used in past implementations
and reducing these rich datasets to coarse descriptions of movement,
of the algorithm (see Bevilacqua et al., 2010). We conducted a
there is little reason to expect additional improvements on learning
between‐subjects study in which two groups of undergraduate
outcomes from physical engagement with computer‐assisted learning
students used one of these models to perform three separate tasks
systems. Providing a more dynamic, adaptable system that builds from
where the focus was on repeating gestures, recalling gestures, and
user intuitions could translate to both a higher level of perceived
using gestures to perform a contextualized quantitative operation.
usability and enhanced learning (Alshammari, Anane, & Hendley, 2016; Nacenta, Kamber, Qiang, & Kristensson, 2013). To move from this predefined space to one that affords dynamic
2
|
B A CKG R O U N D
bodily expression and rich metaphor, we propose to use a “one‐shot” gesture training methodology, alleviating the need for burdensome
Our approach to interactive system design is based in theories of
mappings, memorization, and conforming to predefined vocabularies,
embodiment and its implications for human learning, multimodal
instead basing symbolic structure on expressions created by an
educational environments that integrate gesture‐recognition systems,
individual student.
and a focus on creating intuitive and personalized learning environments using machine‐learning algorithms.
2.2 | Personalized gesture recognition using the HHMM 2.1 | Embodied interaction in computer‐assisted learning systems
In order to develop a minimally trained yet adaptable model for per-
More accurate and more adaptive systems for recognizing gestures in
gesture recognition. The HHMM, while capable of traditional
interactive learning technologies are needed because physical engage-
machine‐learning applications, is especially effective for modeling
sonalized user interactions, we applied the HHMM (see Figure 1) for
ment is increasingly being shown to be an effective way to enhance learning outcomes for a range of domains including reading comprehension (Glenberg, Gutierrez, Levin, Japuntich, & Kaschak, 2004), algebra (Goldin‐Meadow, Cook, & Mitchell, 2009), chemistry (Flood et al., 2014), and the physical sciences (Kontra, Lyons, Fischer, & Beilock, 2015; Plummer & Maynard, 2014). These studies and others show that gesturing in educational activities can be important, but they also show that gestures can vary greatly, and there is not strong evidence to suggest that prescriptive gesture schemes lead to learning gains across a diverse population of students. The suggestion, rather, is that if computer‐assisted learning environments are going to facilitate embodied interactions with learning content, they should be flexible and allow for a range of semantics associations. In the last decade, there have been several technology‐enhanced embodied learning designs involving whole‐body gestures that have shown promising results (Enyedy, Danish, Delacruz, & Kumar, 2012; Johnson‐Glenberg, Birchfield, Tolentino, & Koziupa, 2014; Lindgren et al., 2016), but these environments have employed fairly simple metaphors that do not necessarily leverage the power‐embodied interactions for the kinds of robust concept development that are found in the nontechnology‐mediated studies cited above. For example, Lindgren et al. (2016) used body position tracking in a 2D plane to
FIGURE 1 Hierarchical hidden‐Markov model graphical models in a single timestep (t) showing inputs (I) with two layers of hidden states (H), and resulting observations (O). Each hidden state can be a production state (p), resulting directly in an observation, or an internal state (i), resulting in a model [Colour figure can be viewed at wileyonlinelibrary.com]
JUNOKAS
3
ET AL.
movement features. This is due to the relatively robust feature set it
We then analyzed this raw movement with software written in
generates using multiple layers of abstraction, representing temporal
Max 7/Jitter (Puckette, 2014), creating a set of real time kinematic
paths through concrete kinematic features. For example, a multimodal
features from the raw movement data. While we were able to gener-
feature set collected from a Kinect V2 sensor (Microsoft, 2017) con-
ate a broad range of features (e.g., positional derivatives, comparative
sists of kinematic measures, digital video, and depth maps. This can
features, and statistical metrics) across a variety of timeframes, we
be temporally linked to abstract probabilistic parameters made by
chose to focus on frame‐to‐frame relative velocity features for our i-
the model, producing a machine‐generated mode of data for analysis.
nitial testing. For both of our experimental conditions, we trained a
From these multiple streams, higher level statistical representations of
two‐layer HHMM to recognize gestures using 30 features including
the user's movement can be constructed, allowing us to synthesize a more nuanced and personalized platform for user interaction. This augmented feature space lends itself well to maximally descriptive class descriptions, providing an ideal environment for one‐shot training. This more in depth user representation provides an interaction environment that is conducive to fluid user performance, allowing for the bodily expression of symbolic conceptions to be captured without inhibiting user movement. Due to this potential, HHMMs have already effectively been used in several multimodal
a. the x, y, z of the right ankle (3), right wrist (3), left ankle (3), and left wrist (3) velocity relative to the spine mid; b. the x, y, z of right wrist (3), left ankle (3), and left wrist (3) velocity relative to the right ankle; c. the x, y, z left ankle (3) and left wrist (3) velocity relative to the right wrist; and d. the x, y, z left wrist (3) velocity relative to the left ankle.
applications (Bevilacqua et al., 2010; Françoise, Caramiaux, & Bevilacqua, 2011) including movement to sound mapping schemas (Françoise, 2015). Extending this type of machine‐learning to educational applications has already been attempted (Junokas, Linares, & Lindgren, 2016), but with this paper, we show that these multimodal representations can be leveraged by one‐shot models to specifically tailor interaction to the individual user. By providing an environment where
Using these kinematic features, we trained a two‐layer HHMM adapted from the IRCAM MUBU package (IRCAM, 2015). From this HHMM, we were able to model and classify user‐defined gestures. In the process of classifying these gestures, additional underlying statistics that contributed to that classification were captured. These internal model parameters gave us an abstracted, machine‐driven mode of the user's gestures, providing a deeper feature perspective
users can define their own physical interactions without losing
that could be used to aid in gesture recognition or applied in ulterior
accuracy, we provide a system that can adapt to personalized user
domains, such as mapping controls to simulation applications. Through
interactions, more accurately reflecting their intentions, thus leading to more beneficial feedback and customization of the user experience.
3
|
MODEL DESCRIPTION
The proposed system captures raw movement data using the
the combination of these three data modes, we were able to create an adaptable one‐shot gesture‐recognition model that allowed for a personalized, multimodal interaction.
3.1 | Empirical tasks in repeatability, recall, and operation semantics
Microsoft Kinect V2, which utilizes infrared cameras to generate
To validate the efficacy of the one‐shot model for the kinds of gestural
depth maps (see Figure 2). From these depth maps, the Kinect gener-
interactions one might find in embodied learning environments, we
ates a skeleton frame of 25 distinct joints. We accessed this skeleton,
compared its performance to that of a traditional pretrained model
writing software using the Kinect's application programming interface,
in a three‐stage gesture‐recognition task. We hypothesized that in
to send skeleton positions across our system via the protocol Open
the three stage conditions, one‐shot gesture recognition would per-
Sound Control (Wright & Freed, 1997) at a fixed rate of 30 frames
form with superior accuracy to pretrained models, while also leaving
per second.
learners more satisfied with the quality of the interactions.
FIGURE 2 Software captured data using the Kinect V2: high definition digital video (left), depth maps with overlayed skeleton (center), and infrared sensor data (right) [Colour figure can be viewed at wileyonlinelibrary.com]
4
JUNOKAS
ET AL.
The three stages each participant went through consisted of (a) repeating gestures based on on‐screen representations (repeatability), (b) performing gestures by recalling their associated number (recall), and (c) using gestures representing context‐appropriate mathematical operations (operation semantics).
4 4.1
METHODS
|
|
Participants
Twenty‐one participants were split into the two groups with 10 in the one‐shot group and 11 in the pretrained group. All users were undergraduate college students between the ages of 18 and 22 majoring in a variety of fields including computer science, neuroscience, accounting, education, physics, and engineering. Gender of the participants
FIGURE 4 User (red skeleton figure) performing the first stage of the experiment, repeating gestures shown in playback (purple skeleton figure) [Colour figure can be viewed at wileyonlinelibrary. com]
was evenly balanced between the two conditions, with 5 males and 5 females for the one‐shot condition and 5 males and 6 females for
predetermined gestures. Each participant was instructed to watch
the pretrained condition. All participants were unfamiliar with our spe-
the playback to completion and then repeat the gesture they had seen.
cific research aims at the beginning of the session and were not given
The system attempted to recognize this gesture and the accuracy of
any incentive to participate.
such recognition was measured internally, though not reported back to the participant.
4.2
|
Study procedures
In the second stage (recall), users were asked to recall gestures that were associated with a given number (see Figure 5). For example,
Participants were brought into our lab and given a short introduction
if the stage prompted the user with the number one, the user would
to the gesture‐based interface (see Figure 3) they would be using
perform the gesture they had associated with the number one, if
and the tasks they would be performing. We carefully controlled all
prompted with the number two, the user would perform the gesture
environmental variables across participants including the amount of
they associated with the number two, and so forth. Gestures were
light exposure in the task environment and the distance the partici-
defined by the users similarly to Stage 1, with one‐shot users record-
pant was from the sensor. The hardware configuration was kept as
ing their own gestures and pretrained users being given a
consistent as possible across all of the classification tasks in an
predetermined set.
attempt to minimize noise differentials or sensor variability. All partic-
In the third stage (operation semantics), participants were asked
ipants in both conditions were guided through three stages led by an
to associate four new gestures with arithmetic operations of add, sub-
experimenter. Each stage consisted of three groups of eight gestures,
tract, multiply, and divide (see Figure 6). One‐shot participants
giving the participants as much time as they needed to complete the
recorded the four operational gestures based on their own semantic
given stage.
associations with those operations. They could choose any gestures
In the first stage (repeatability), participants were asked to repeat
they wished as long as they were distinct and large enough to be rec-
gestures that were demonstrated by a digital skeleton on a large
ognized by the Kinect. After given some time to consider their chosen
projected screen (see Figure 4). One‐shot users recorded four of their
gesture, users were cued with textual instructions on which operation
own designed gestures, and pretrained users were shown four
they were recording and when to start recording their gesture by a digital
FIGURE 3 User interface for empirical task with experimenter controls on the right and gestural interactions captured and displayed via a skeleton figure in the center [Colour figure can be viewed at wileyonlinelibrary.com]
countdown.
Once
started,
gesture
recordings
were
FIGURE 5 User (red skeleton figure) performing the second stage of the experiment, recalling gestures associated with a given number on the left [Colour figure can be viewed at wileyonlinelibrary.com]
JUNOKAS
5
ET AL.
gesturing, upon which the researcher manually stopped the classification process. A frame was classified as correct if it matched the corresponding requested gesture for a given task in a given stage. Stage accuracy was determined by taking the total number of frames classified correctly from all of the tasks performed in a given stage, divided by the total number of frames classified for the stage.
5
|
RESULTS
To investigate differences between the groups, an independent t test FIGURE 6 User (red skeleton figure) performing the third stage of the experiment, performing gestures associated with arithmetic operations to complete the given equation [Colour figure can be viewed at wileyonlinelibrary.com]
was conducted to compare stage accuracy for the one‐shot and the pretrained models in each stage. Shapiro–Wilk's tests indicated that the scores were normally distributed, and Levene's tests indicated that there was a homogeneity of variance between the groups. As shown in Table 1, there are statistically significant differences between the
automatically stopped when the user's “energy” (the summation of
one‐shot model and the pretrained model groups in all three stages.
speed across all the joints) crossed below a preset threshold that rep-
On average the students in the one‐shot group tend to have higher
resented “stillness.” Once all four recordings were completed, users
accuracy scores than the students in the pretrained group. In specific,
were given an opportunity to see the gestures they recorded through
mean accuracy score in Stage 3 differs between the one‐shot model
digital playback. If they were unsatisfied with any of their recordings,
(M = 130.14, SD = 20.51, n = 7) and the pretrained model
they were given an opportunity to rerecord any gesture.
(M = 102.86, SD = 22.27, n = 7) at the .001 level of significance
Pretrained participants in the third stage were shown a set of four
(t = 5.43, df = 19, p < .001).
researcher‐chosen gestures corresponding to mathematical operations
Given the significance of the tests, the one‐shot model
(right‐handed “stacking” motion to the right for add, left‐handed
outperformed the pretrained model in each of the three stages (see
“throwing away” motion to the left for subtract, two‐handed “folding”
Table 1). The most significant difference was in the accuracies of the
gesture for multiply, and left‐handed “slashing” motion for divide). Par-
operational semantic Stage 3. While Stages 1 and 2 maintained rela-
ticipants were then give equations with missing operators and asked
tively close accuracies; there was a drop in accuracy when going to
to gesture according to which operation they believed belonged in
Stage 3 for both of the model conditions.
the equation. For example, if participants were given “1__4 = 5,” they were expected to perform the gesture for add. Once all of the stages were completed, participants were given a
In response to their postexperiment survey, several one‐shot participants mentioned that their gestures were easy to remember due with one user saying, “I made them with the fact that I'd have to rep-
qualitative feedback form to provide their subjective impressions of
licate them in mind” and another user saying, “Knowing how distinct
their experience. Users were given several specific questions:
and different they were from each other, I could remember which mover[sic] corresponded to the operation.” One‐shot users seemed
a. Were your gestures easy to remember?
to be working harder to create semantically meaningful associations
b. Did anything help you remember your gestures?
with their gesture.
c. What was difficult to do? d. What was easy to do? Additionally, they were given an open section to provide their own feedback about the experience. After participants were done with the questionnaire they were thanked and allowed to leave.
Pretrained participants tried to establish connections to the prescribed gestures, with one user stating they “made semantic connections between the operations and the gestures, e.g., the divide gesture is similar to a slash” and another saying they “tried to think of personal ways to memorize the gestures.” Even using these connections, pretrained users frequently expressed frustration with spatial aspects of the gesture, remembering the general gesture but performing it with the wrong directional affinity (e.g., left‐handed vs. right‐handed). One user specifically stated that “It was a little bit hard
4.3
|
Gesture data analysis
In each of these of the three stages, each gesture was analyzed by the models frame to frame, classifying every frame collected by the sys-
to remember the direction.” Pretrained users seemed to be trying to match prescribed directions rather than create any deeper, symbolic connections with their gesture.
tem at a rate of 30 frames per second. Classification began for each gesture when the participant crossed above a specified speed thresh-
6
|
DISCUSSION
old (e.g., began to move) after the corresponding digital playback, number, or equation were provided. Classification was ended for each
Our one‐shot model achieved a higher accuracy in each of our exper-
gesture when the participant verbally acknowledged they were done
imental task's stages. This can be mainly attributed to the adaptability
6
JUNOKAS
TABLE 1
ET AL.
Means, standard deviations, and independent t‐test results of three‐stage gesture‐recognition task One‐shot
Pretrained
Mean (SD)
Mean (SD)
t test
Significance
Stage 1–repeatability
0.71 (0.14)
0.57 (0.15)
t(19) = 2.24
p = .037
Stage 2–recall
0.73 (0.12)
0.55 (0.12)
t(19) = 3.23
p = .004
Stage 3–operation semantics
0.65 (0.13)
0.38 (0.08)
t(19) = 5.43
p < .0001
Note. One‐shot (n = 10), pretrained (n = 11).
of our one‐shot model to personalized inputs, allowing the users to
this stage. This was measured by looking at tasks where a majority
perform self‐defined gestures and to create more solid mnemonic con-
of the frames were misclassified and reviewing the gesture recordings
nections to their movement. This finding is overall in line with previ-
to make sure there were not any computational errors. This supports
ous research indicating user‐defined gestures yield a higher level of
our larger argument that allowing users to define their own gestures
memorability or enjoyment (i.e., Freeman, Benko, Morris, & Wigdor,
space increases the user's ability to recall mathematical operations.
2009; Nacenta et al., 2013). This personalized recognition additionally
Rather than taking mental and physical effort trying to frame their
allowed users to focus on accomplishing tasks rather than performing
own gesture within a set of movement templates as in the pretrained
accurate gestures.
model, one‐shot users were able to use their gestural nuance as a
In the repeatability Stage 1, pretrained users had to remember the
mechanism to strengthen their interactions with the system. The
preset gesture templates in addition to performing them, maintaining
capability of the model to adapt and learn the one‐shot user's own
spatial and directional mappings that may not have been intuitive
interpretations of the mathematical operations seemed to support
given a gesture. For example, if a gesture was performed with the right
embodied interaction that alleviates the additional effort of specific
hand, pretrained users did not have the advantage of defining said
pretrained template matching, which may have given users the
gesture, thus were more likely to erroneously start with the left hand.
opportunity to focus on accomplishing tasks rather than performing
In such cases, users quickly corrected their action when seeing they
accurately.
used an incorrect joint by using the playback provided in the Stage
Qualitative feedback from players often cited recall problems
1. Providing the playback feature that enables users to observe and
when using a pretrained model, not remembering which specific joints
reflect on their movements seemed to benefit the pretrained users
to use and how precisely to perform them to get the desired response.
(Lindgren, 2015). One‐shot users did not have the same problem, con-
The ability to create a personalized gestural vocabulary using the one‐
sistently using the correct joint for the respective actions and
shot model seems to have allowed users to quickly connect to abstract
performing more precisely the spatial configuration of their self‐
concepts such as mathematical functions to an immediate physical
defined gestures, not having to parse out directionality or recall any
expression, not having the additional burden of associating with exter-
matching schema. This quality of the pretrained condition may have
nally prescribed criteria. This finding together with the lower accura-
interfered with user performance. In particular, this extra step of map-
cies of pretrained model is in line with prior research that
ping to external schema (e.g., performing given gestures) forced
highlighted memorability as an important aspect of gestural interfaces
pretrained participants to perform an extra mental step, recalling the
since predefined gestures are required to remember and users often
different parameters of the prescribed gestures before performing
become frustrated when forgotten gesture causes errors (Long Jr,
them, thus leading to lower accuracies.
Landay, Rowe, & Michiels, 2000).
This directional problem was exacerbated in Stages 2 and 3 where
Body movement can benefit students' performance in numerous
users were forced to recall gestures without visual reinforcement.
areas such as algebra (Goldin‐Meadow et al., 2009; Lindgren et al.,
Several users performed the given gesture but directionally inverse,
2016). The results in this study show that the user‐defined model lead
leading to much lower accuracies. Although this did occur with one
to increased learner performance, and the predefined condition did
of the one‐shot users, directional inversion did not occur as nearly
not provide the same level of benefits. This highlights designing
as frequently as it did in the pretrained condition where it occurred
personalized gestural environment as a critical area of research and
with three different users.
design. Interestingly, users who used the one‐shot model created a
In the recall Stage 2, users performed with a very similar accuracy
wide variety of gestures (e.g., jumping with outstretched arms,
as in Stage 1, staying within 2%. The small difference in accuracy
squatting with arms held horizontally, spinning in place, and using only
supports that users are able to repeat and recall gestures with similar
one arm for all gestures), not creating a cohesive representation across
success given the gestures bear no symbolic meaning. The drop in
the mathematical functions, suggesting that allowances of personaliza-
accuracy from the one‐shot to the pretrained condition can then be
tion of input is more in line with user's uninfluenced response to such
attributed to the capability of the respective conditions performance.
a system. Some of the symbolic representations of movement
Unlike the first two stages, the operation semantics Stage 3 had a
included creating spatial sequences, not necessarily stacking but
significant difference in performance accuracies in comparison with
organizing in rows or columns (add); horizontal hand motions,
the first two stages, with the one‐shot model largely outperforming
emulating a dash (subtract); expansion, growing from large to small
the pretrained model. Additionally, one‐shot users were able to per-
using the whole body (multiply); and separating material by sweeping
form the correct target gestures more often than pretrained users in
away excess (divide).
JUNOKAS
7
ET AL.
In some cases, this variability in gesture vocabulary did lead to
there are innumerable additional combinations we could compare
lower model performance due to the lack in diversity in the user's ges-
using our current construct. We are interested in exploring the optimi-
tures. Prior research indicates that predesigned or user‐elicited ges-
zation of model training (e.g., does using two shot improve on one‐
tures are more applicable than user‐defined gestures in certain
shot at all?), seeking commonality across users' gestural representation
situations; however, the gap is constantly narrowing with the improve-
of abstract concepts and finding methods to emphasize the most rele-
ment of sensing and recognizer technologies (Nacenta et al., 2013). An
vant parameters of movement. In this initial work, we focused on
advantage of the pretrained model is that researchers would be able
leveraging the internal states of the respective models to classify ges-
to control and define large differences in movements, resulting in
ture. While these internal states were a result of our two other modes
more clearly delineated gestures. A clear advantage to the “one‐shot”
of data (e.g., skeleton positions and kinematic features), we imagine a
methodology is that users would presumably train in the exact
much more effective and satisfying model could be developed from
environment (e.g., amount of light) and with the same conditions
balancing all of these modes into a cohesive representation of partic-
(e.g., personal physical attributes) that they would test in, providing a
ipants movement. For example, skeleton joints that possessed a higher
much more focused model. Although we did take considerable effort
speed could be used to weight the internal parameters, reflecting con-
to control the testing conditions, the Kinect V2 can still be sensitive
centrations of user energy in addition to their nuanced expression.
to noise, demographic attributes, and other environmental factors.
The model that we've developed is flexible to that type of expansion
Since our task is centered on creating optimal algorithm performance
and will be developed accordingly with our future applications.
and the ability to empower the user to connect this performance to
We are currently working on implementing our one‐shot model
higher level abstractions, the capability of targeting the user directly
into several education simulation theaters, replacing a pretrained
to create a more focused model is beneficial to our goals and effective
model that we had previously used. These simulation theaters are
in reducing the effect of these sensor sensitivities.
placed within the context of understanding scale in three different
For both groups, the majority of the incorrectly classified frames
domains: the Richter scale used to measure the magnitude of earth-
in both conditions occurred at the beginning of the gestures,
quakes, the pH scale that measures the level of acidity within solu-
transitioning from stillness into movement. This was especially appar-
tions, and the growth of bacteria cultures. Interaction within these
ent in the first stage. After initiating and maintaining a consistent ges-
domains is preceded by a gestural tutorial space where students famil-
ture (e.g., not changing or pausing mid movement), the models
iarize themselves with their movement and how it can control the sim-
generally maintained a singular classification for the remainder of the
ulations. Up until this point, we have been using a pretrained model,
recording. Classification results would improve from the classification
having students match gestural templates that we based on research
of a larger timeframe (e.g., multiple frames, the entire gesture) but by
and interviews (Alameh, Linares, Mathayas, & Lindgren, 2016). With
looking at the frame‐to‐frame classification, we were able to identify
the findings in this paper, we are now beginning to integrate our
the progression of classification, providing deeper insight into the via-
one‐shot model, merging our previous research by physically and ver-
bility of leveraging these abstract model parameters for real time mul-
bally cuing the students (Lindgren, 2015) while letting them define
timodal analytics.
gestures according to their own specifications.
Overall, although the sample size is fairly small and not very
By continuing to solidify the relationship between the interaction
diverse, this preliminary experiment supports using a one‐shot model
space, gesture recognition, and feedback systems, we look to continue
to leverage personalized educational interactions, when users are
empowering users, alleviating them of the cognitive load required to
required to couple gestural performance and symbolic recall with
match and perform specific gestures of a pretrained model, allowing
accomplishing a reasoning task. A more rigorous and in‐depth
them to form their own, personalized gesture language and semiotic
experiment would need a significantly larger sample size to make
structure using a one‐shot model, from which they communicate their
any definitive claims; however, this initial experiment shows
ideas through bodily movement and technology.
the potential of using a one‐shot gesture‐recognition model. ACKNOWLEDGMENTS This research was supported by a grant from the National Science
7 | C O N CL U S I O N A N D F U T U R E DIRECTIONS
Foundation (IIS‐1441563). We would also like to acknowledge the work of the entire ELASTC3S team, especially Ben Lane, Greg Kohlburn, Sahil Kumar, Tom Roush, Viktor Makarskyy, and Wai‐Tat
This preliminary experiment lays the groundwork for future testing,
Fu for their development and analysis work that contributed to
including the application of our model in more practical educational
this study.
interfaces. We anticipate that by applying the model beyond simple classification tasks to multimodal feature synthesis and control mappings will lead to more intuitive and dynamic interactions. These more natural system interactions could potentially lead to more developed and accurate analytics on student learning in gestural environments. Although we felt our task (comparing pretrained and one‐shot models) was the most basic that would test our hypothesis (one‐shot models provide a viable option for personalizing user interaction),
ORCID M.J. Junokas
http://orcid.org/0000-0002-8764-7936
RE FE RE NC ES Abrahamson, D., & Trninic, D. (2015). Bringing forth mathematical concepts: Signifying sensorimotor enactment in fields of promoted action. ZDM, 47(2), 295–306.
8
Alameh, S., Linares N., Mathayas N., & Lindgren, R. (2016). The effect of students' gestures on their reasoning skills regarding linear and exponential growth. Annual Meeting of the National Association of Research on Science Teaching Alibali, M. W., & Nathan, M. J. (2012). Embodiment in mathematics teaching and learning: Evidence from learners' and teachers' gestures. The Journal of the Learning Sciences, 21(2), 247–286. Alshammari, M., Anane, R., & Hendley, R. J. (2016). Usability and effectiveness evaluation of adaptivity in e‐learning systems. In Proceedings of the 2016 CHI conference extended abstracts on human factors in computing systems (pp. 2984–2991). ACM. Bevilacqua, F., Zamborlin, B., Sypniewski, A., Schnell, N., Guédy, F., & Rasamimanana, N. (2010). Gesture in embodied communication and human‐computer interaction (Vol. 5934). Biskupski, A., Fender, A. R., Feuchtner, T. M., Karsten, M., & Willaredt, J. D. (2014). Drunken Ed: A balance game for public large screen displays. In CHI'14 extended abstracts on human factors in computing systems (pp. 289–292). ACM. Enyedy, N., Danish, J. A., Delacruz, G., & Kumar, M. (2012). Learning physics through play in an augmented reality environment. International Journal of Computer‐Supported Collaborative Learning, 7(3), 347–378. Fei‐Fei, L., Fergus, R., & Perona, P. (2006). One‐shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611. Fine, S., Singer, Y., & Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32(1), 41–62. Flood, V. J., Amar, F. G., Nemirovsky, R., Harrer, B. W., Bruce, M. R., & Wittmann, M. C. (2014). Paying attention to gesture when students talk chemistry: Interactional resources for responsive teaching. Journal of Chemical Education, 92(1), 11–22. Françoise, J. (2015). Motion‐sound mapping by demonstration (Doctoral dissertation. UPMC. Françoise, J., Caramiaux, B., & Bevilacqua, F. (2011). Realtime segmentation and recognition of gestures using hierarchical Markov models. Mémoire de Master. Université Pierre et Marie Curie–Ircam. Freeman, D., Benko, H., Morris, M. R., & Wigdor, D. (2009). ShadowGuides: Visualizations for in‐situ learning of multi‐touch and whole‐hand gestures. In Proceedings of the ACM International Conference on Interactive Tabletops and Surfaces (pp. 165–172). ACM. Glenberg, A. M., Gutierrez, T., Levin, J., Japuntich, S., & Kaschak, M. P. (2004). Activity and imagined activity can enhance young children's reading comprehension. Journal of Educational Psychology, 96(3), 424–436.
JUNOKAS
ET AL.
IRCAM. (2015). MUBU for Max. http://forumnet.ircam.fr/fr/produit/ mubu/ Isbister, K., Karlesky, M., Frye, J., & Rao, R. (2012, May). Scoop!: A movement‐based math game designed to reduce math anxiety. In CHI'12 extended abstracts on human factors in computing systems (pp. 1075–1078). ACM. Johnson‐Glenberg, M. C., Birchfield, D. A., Tolentino, L., & Koziupa, T. (2014). Collaborative embodied learning in mixed reality motion‐capture environments: Two science studies. Journal of Educational Psychology, 106(1), 86. Junokas, M., Linares N., & R. Lindgren. (2016). Developing gesture recognition capabilities for interactive learning systems: Personalizing the learning experience with advanced algorithms. Proceedings of the International Conference of the Learning Sciences (2016). Kontra, C., Lyons, D. J., Fischer, S. M., & Beilock, S. L. (2015). Physical experience enhances science learning. Psychological Science, 26(6), 737–749. Lindgren, R. (2015). Getting into the cue: Embracing technology‐facilitated body movements as a starting point for learning. In V. R. Lee (Ed.), Learning technologies and the body: Integration and implementation in formal and informal learning environments (pp. 39–54). New York, NY: Routledge. Lindgren, R., Tscholl, M., Wang, S., & Johnson, E. (2016). Enhancing learning and engagement through embodied interaction within a mixed reality simulation. Computers & Education, 95, 174–187. Long, A. C. Jr., Landay, J. A., Rowe, L. A., & Michiels, J. (2000). Visual similarity of pen gestures. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 360–367). ACM. Microsoft. Developing with Kinect for Windows. (2017). https://developer. microsoft.com/en‐us/windows/kinect/develop Nacenta, M. A., Kamber, Y., Qiang, Y., & Kristensson, P. O. (2013). Memorability of pre‐designed and user‐defined gesture sets. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. 1099–1108). ACM. Plummer, J. D., & Maynard, L. (2014). Building a learning progression for celestial motion: An exploration of students' reasoning about the seasons. Journal of Research in Science Teaching, 51(7), 902–929. Puckette, M. (2014). Max/MSP (version 6): Cycling'74 Wright, M., & Freed, A. (1997). Open SoundControl: A new protocol for communicating with sound synthesizers. In ICMC.
Goldin‐Meadow, S. (2011). Learning through gesture. Wiley Interdisciplinary Reviews: Cognitive Science, 2(6), 595–607.
How to cite this article: Junokas MJ, Lindgren R, Kang J,
Goldin‐Meadow, S., Cook, S. W., & Mitchell, Z. A. (2009). Gesturing gives children new ideas about math. Psychological Science, 20(3), 267–272.
alized gesture recognition. J Comput Assist Learn. 2018;1–8.
Han, I., & Black, J. B. (2011). Incorporating haptic feedback in simulation for learning physics. Computers & Education, 57(4), 2281–2290.
Morphew JW. Enhancing multimodal learning through personhttps://doi.org/10.1111/jcal.12262