Individual Differences in Multimodal Integration ... - Semantic Scholar

1 downloads 0 Views 338KB Size Report
Apr 2, 2005 - indirectness of pragmatic style, (2) brevity or lengthiness of construction, and (3) the ..... some sandbags from Broadway to Fremont.”). From the ...
Individual Differences in Multimodal Integration Patterns: What Are They and Why Do They Exist? Sharon Oviatt, Rebecca Lunsford, Rachel Coulston Center for Human-Computer Communication — Oregon Health & Science University 20000 NW Walker Road, Beaverton, OR 97006 oviatt|rebeccal|[email protected] +1 503 748 1342 ABSTRACT

ACM Classification Keywords

Techniques for information fusion are at the heart of multimodal system design. To develop new user-adaptive approaches for multimodal fusion, the present research investigated the stability and underlying cause of major individual differences that have been documented between users in their multimodal integration pattern. Longitudinal data were collected from 25 adults as they interacted with a map system over six weeks. Analyses of 1,100 multimodal constructions revealed that everyone had a dominant integration pattern, either simultaneous or sequential, which was 95-96% consistent and remained stable over time. In addition, coherent behavioral and linguistic differences were identified between these two groups. Whereas performance speed was comparable, sequential integrators made only half as many errors and excelled during new or complex tasks. Sequential integrators also had more precise articulation (e.g., fewer disfluencies), although their speech rate was no slower. Finally, sequential integrators more often adopted terse and direct command-style language, with a smaller and less varied vocabulary, which appeared focused on achieving errorfree communication. These distinct interaction patterns are interpreted as deriving from fundamental differences in reflective-impulsive cognitive style. Implications of these findings are discussed for the design of adaptive multimodal systems with substantially improved performance characteristics.

H.5.2 [Information Interfaces and Presentation]: User Interfaces—user-centered design, theory and methods, interaction styles, evaluation/methodology, input devices and strategies, voice I/O, natural language, prototyping.

Author Keywords

multimodal integration patterns, simultaneous or sequential input, individual differences, impulsive-reflective cognitive style, conversations, commands, disfluencies, errors Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI 2005, April 2–7, 2005, Portland, Oregon, USA. Copyright 2005 ACM 1-58113-998-5/05/0004…$5.00

INTRODUCTION

Techniques for information fusion are at the heart of designing a new generation of multimodal systems, including approaches for combining information at both the semantic and temporal level. With respect to the design of temporal constraints, current state-of-the-art multimodal systems use fixed temporal thresholds, which are based on previous modeling of users’ natural modality integration patterns [10,12]. However, there are significant individual differences among users in their multimodal integration patterns when engaged in system interactions, which suggests that user-adaptive temporal thresholds potentially could support more tailored, flexible, and powerful approaches to fusion in the future. As background, when people combine speech and pen input during a multimodal construction, recent research has revealed that there are two distinct user integration patterns―simultaneous versus sequential. During a simultaneous integration, speech and pen input is at least partially overlapped in time, whereas during a sequential construction one mode ends before the second starts. Since users’ sequential multimodal constructions are known to have intermodal lags ranging 0-4 seconds, a system with a fixed temporal threshold would only attempt to fuse two signals if the lag was less than 4 seconds. Lengthier delays instead would result in a unimodal interpretation. As evident in this example, one key role of temporal thresholds is to assist in determining whether two signals are part of a joint user construction, or whether they are unimodal and should be processed separately. It is critical that multimodal systems make this distinction accurately, since users typically intermix unimodal with multimodal constructions during system interactions. In fact, their overall percentage of multimodal interaction can vary widely, from as little as 20% in some verbal-temporal

domains, to 70% or higher in many spatial applications [10]. Given the bimodal distribution of user integration patterns, adaptive temporal thresholds potentially could support more tailored and flexible approaches to fusion. Ideally, an adaptive multimodal system would detect, automatically learn, and adapt to a user’s dominant multimodal integration pattern, which could result in substantial improvements in system processing speed, the accuracy of interpretation, and also synchronous interchange with the user. Specifically, it has been estimated that system delays could be reduced to just 44% of what they currently are by adopting user-defined thresholds. For example, if a user is known to be a simultaneous integrator during multimodal constructions, then the presence of an intermodal lag between signals should result in immediate unimodal processing, with the 3-4 second delay eliminated altogether. Likewise, for sequential integrators with habitually brief intermodal lags (e.g., 1.5 seconds), their temporal threshold and associated system processing delays should be reduced accordingly. The net impact of these cumulative reductions during both unimodal and multimodal processing would be a substantial speed up in system response. This in turn can be expected to reduce system recognition errors which often occur, for example, when users repeat their input during lengthy delays. Related Literature on Multimodal Integration Patterns

As illustrated in Table 1, a series of studies conducted with users across the lifespan has indicated that individual child, adult, and elderly users all adopt either a predominantly simultaneous or sequential integration pattern during production of speech and pen multimodal constructions [11]. This bimodal distribution of user integration patterns has been observed in different task domains (map-based real estate selection, crisis management, educational applications with animated characters), and also when using different types of interface (conversational, command style) [12,14,15]. In short, empirical studies have demonstrated that this bimodal distinction between users in integration pattern generalizes widely across different age groups, task domains, and types of interface. During multimodal interaction, a user’s dominant simultaneous or sequential integration pattern can be identified almost immediately, typically on the very first multimodal command, and it remains highly consistent (8897%) throughout an interactive computer session [11,14,15]. In addition, users’ dominant pattern is strikingly consistent and resistant to change, even when explicit instruction or strong selective reinforcement is delivered to encourage switching from a sequential to simultaneous pattern, or vice versa [8,11]. Instead, both sequential and simultaneous integrators have shown evidence of entrenching further in their dominant patterns

(i.e., increasing intermodal lag during sequential integrations, and overlap during simultaneous integrations) over the course of an interactive session, during system error handling, and when completing increasingly more difficult tasks. Interestingly, independent of these new findings on multimodal production during human-computer interaction, large individual differences and within-subject stability likewise have been documented in the cognitive science literature on perception of multisensory synchrony during auditory-visual events [7,13].

Children

Adults

Seniors

User SIM SEQ SIM integrators:

User SIM SEQ SIM integrators:

User SIM SEQ SIM integrators:

1 100 2 100 3 100 4 100 5 100 6 100 7 98 8 96 9 82 10 65 SEQ integrators: 11 15 12 9 13 2

1 100 0 2 94 6 3 92 8 4 86 14 SEQ integrators: 5 31 69 6 25 75 7 17 83 8 11 89 9 0 100 10 0 100 11 0 100

1 100 0 2 100 0 3 100 0 4 97 3 5 96 4 6 95 5 7 95 5 8 92 8 9 91 9 10 90 10 11 89 11 12 73 27 SEQ integrators: 13 1 99 Non-dominant integrators: 14 59 41 15 48 52 Average Consistency 88.5%

0 0 0 0 0 0 2 4 18 35 85 91 98

Average Consistency 93.5%

Average Consistency 90%

Table 1. Percentage of simultaneously-integrated multimodal constructions (SIM) versus sequentially-integrated constructions (SEQ) for children, adults, and seniors

All of these findings imply that future multimodal systems that can detect and adapt to a user’s dominant integration pattern potentially could yield substantial improvements in system robustness and overall performance. In fact, in many respects these data on individual differences in adult multimodal integration patterns present an ideal set of circumstances and opportunity for adaptive processing. That is, users are divided into two basic types, with early predictability and a high degree of consistency in their integration pattern. To date, however, there has been no longitudinal research documenting whether these differences among individuals remain stable over an extended time period. There likewise has been no explanation for what the root cause may be of these salient differences, for example by investigating other possible behavioral and linguistic differences between the groups. Such information would be valuable for guiding future interface design, especially user-centered design of adaptive temporal thresholds in fusion-based multimodal systems.

Individual Differences in Cognitive Style

Perhaps the most pervasive and studied of all cognitive style dimensions is reflectivity-impulsivity, which is manifest as stable individual differences in the way people perceive, process, and use information [3,4,6]. Kagan and colleagues [4] first reported that when children must respond to situations or tasks in which there is response uncertainty, as in the Matching Familiar Figures Test (MFFT), reflective or analytic individuals will respond relatively slowly and accurately, whereas impulsive ones tend to work in a more rapid and error-prone manner. The MFFT taps two individual difference components: (1) concern over making errors, or a low tolerance for errors, and (2) tempo of information processing. Although these differences can be influenced to some extent by experience and training, tempo appears easier to modify than errorproneness [3,6]. Corresponding studies of gaze patterns during MFFT problem solving have shown that reflective individuals inspect alternatives more systematically from the very beginning of a task, and before responding with what they believe is the best answer. They make more fixations, longer ones, and scan alternatives in a more planful way than impulsive individuals [2]. Consistent differences in impulsive-reflective cognitive style have been reported as early as infancy. For example, young reflective infants inspect objects visually for longer time periods [3,5]. With increasing age children become more reflective, and this is associated with improved performance in classroom tasks, although reflective and impulsive subgroups of individuals can be identified at any age. The growth of reflectivity in childhood is believed to be the outcome of developing mental structures that permit an increasingly complex use of materials available in the environment [5]. Reflective children’s more strategic approach to information processing has been documented to yield the greatest accuracy advantage on difficult tasks that require detailed analysis. Importantly, the stability of individual differences in impulsivity-reflectivity has been demonstrated to generalize across different kinds of visual, verbal and mathematical tasks, and also to become more consistent with age [6]. Although individual differences in reflectiveimpulsive cognitive style are relatively stable, they can be influenced to some extent by experience and training. In general, however, it is believed to be easier to modify response time (e.g., slowing down the tempo of a hyperactive child) than an individual’s error-proneness. Impulsive and reflective individuals also differ along social and personality dimensions. For example, impulsives tend to be more socially responsive. They also typically have shorter attention spans and weaker impulse control [6], so they are more distractible. At times, this can jeopardize their ability to learn new tasks, to navigate without accidents, and other tasks where distractibility and dual

tasking are a special concern. In contrast, reflectives typically will concentrate on one object or task for a longer time period, and they are less dependent on others such as teachers. Specific Goals and Predictions of This Study

The primary general goals of the present study were to determine: (1) whether individual users’ multimodal integration patterns (simultaneous versus sequential) remain stable over time, as predicted, and (2) whether these groups also differ in other aspects of multimodal integration, task performance, or communication patterns that might provide the basis for a coherent explanation of their fundamental differences. More specifically, this study examined whether simultaneous and sequential integrators differ in: • Consistency, mode precedence, or other aspects of integration patterns • Accuracy or speed of task performance, with sequential integrators expected to be more accurate but slower • Articulatory precision and speech rate, with sequential integrators anticipated to be more precise but slower • Size and variability of lexicon, with sequential integrators expected to use a smaller and more constrained lexicon • Pragmatic style and syntactic structure of constructions, with sequential integrators expected to adopt a more terse, direct, and constrained command style METHODS Participants

Twenty-five adult participants 18-55 years of age were studied, including 14 males and 11 females. During prescreening, 13 participants were identified as simultaneous integrators and 11 as sequential. Simultaneous integrators averaged 35 years of age, and 77% were right-handed. Sequential integrators averaged 32, and 81% were right handed. All participants were native English speakers who were paid after their final session. Application Scenario

Participants were instructed to act as volunteers assisting during a flood management exercise. They were given a simulated system with a map-based, multimodal interface. Instructions “from headquarters” were displayed as text near the bottom of their screen (Fig 1, a). They then used speech and pen input to deliver instructions to the system. Individual tasks involved obtaining information, placing items, closing roads, bridges and areas, and controlling the map display. Figure 1 shows a screen shot of the interface. In this example, the participant said “command post” while drawing an “X” on the map (Fig. 1, c) before a simulated error was received (Fig. 1, b). The participant then

reentered input, which resulted in both textual feedback and an icon at the selected location.

c

a

Figure 1. User interface

Task difficulty included four levels: low, moderate, high, and very high. As illustrated in Table 2, low difficulty tasks referenced just one piece of directional (e.g. south, east) or locational information (e.g. Couch School). Moderate difficulty tasks referenced two such pieces of information, whereas high difficulty tasks referenced three, and very high difficulty tasks a total of four.

Low Moderate High Very High

During each participant’s first session, their dominant multimodal integration pattern was verified in a lab setting during an initial “identification band” involving their first ten multimodal utterances to the system [11]. Dominance was defined as 60% or more of their constructions delivered either simultaneously or sequentially. Interactions involved the same simulation software and application as during actual testing, and the experimenter was not present. Using a dual-wizard simulation method, one assistant attended to logging participants’ integration patterns in real time. Approximately half of the subjects retained for the full longitudinal study were simultaneous integrators and the other half were sequential integrators. Main Session

b

Difficulty

Identification of User Multimodal Integration Pattern

Message from Headquarters Situate a volunteer area near Marquam Bridge Send a barge from Morrison Bridge barge area to Burnside Bridge dock Draw a sandbag wall along east riverfront from OMSI to Morrison Bridge Place a maintenance shop near the intersection of I-405 and Hwy 30 just east of Good Samaritan

Table 2. Examples of task difficulty levels, with spatiallocation lexical content in italics

Each volunteer participated in three test sessions involving 16 tasks apiece, with sessions spaced approximately two weeks apart. During each session, the participant first completed practice consisting of three tasks while the experimenter was present to answer questions and provide feedback. Afterwards, the experimenter left the room and the participant worked alone while completing their tasks. Interviewing and Debriefing

After the final session, participants were interviewed about how they delivered their speech and pen input, and whether they were inclined to work quickly or slowly and carefully. Following the interview, the experimenter verified that all participants believed they were interacting with a fully functioning system. Participants then were debriefed about the simulation and reimbursed for participation. Simulation Technique

Data collection in this study was accomplished using a dual-wizard high-fidelity semi-automatic simulation technique, as described in previous work [11]. This technique permitted real-time identification and logging of participants’ integration patterns on a command-bycommand basis. An error generator simulated a 20% rate of recognition errors throughout each session. Research Design

Procedure Prescreening

Initial pre-screening was conducted in field settings to identify people’s dominant integration pattern so that a balanced set of sequential and simultaneous integrators could be recruited for the study, since 70% of people are simultaneous integrators [11]. The experimenter instructed participants to use both speech and pen input when completing each task on either a paper map or PC interface, although they could use these modes however they wished. She then noted people’s integration patterns on each task.

The experimental design involved a mixed factorial that included a between-subject comparison of individual differences in (1) Type of integration pattern (simultaneous, sequential). It also included a within-subject comparison of dependent measures as a function of (2) Longitudinal session (first, second, third), and (3) Task difficulty level (low, moderate, high, very high). Three parallel sets of tasks were developed and counterbalanced across sessions.

Dependent Measures and Coding

All sessions were videotaped and transcribed, and data coding was performed using customized tools that supported frame-accurate temporal precision (0.03 sec) of coding the start and end of speech and pen signals. Careful hand coding of all data was performed to verify real-time logging of users’ integration patterns. Dominant Integration Pattern

A participant was defined as either a simultaneous or sequential integrator if 60% or more of all their constructions during each session were delivered in that pattern. If a participant dropped below the 60% threshold, their pattern was defined as non-dominant. If they delivered 60% or more in the opposite pattern, they were classified as having switched to the non-dominant pattern. Integration & Duration Measures

Precedence–For each multimodal command, whether speech or pen input was presented first. Intermodal Overlap/Lag–During simultaneous integrations, the duration of signal overlap in milliseconds (ms) for each construction. During sequential integrations, the lag from the end of the first signal to the start of the second. Signal Durations–Duration of the speech, pen, and total multimodal signal in ms. for each multimodal construction. Performance Measures

Task-critical Performance Errors–A task-critical human performance error was recorded whenever the participant specified an incorrect location, direction, or name for a location when completing a task, or if the task content was completely in error. All tasks during a participant’s session were coded for the total of such errors, which was converted to a percentage of errors per 100 tasks. Task Response Latency–The duration in ms. between receipt of a task instruction on the screen and the start of a multimodal response. Session Duration–Total duration in ms. From the appearance of the second task instruction on a user’s screen until the end of their multimodal input on the last task. Linguistic Measures

Mean Length of Utterance (MLU)–The average total number of words in a spoken construction to the system, which was scored automatically for each task. Lexical Disfluencies–The following types of disfluency were coded during spoken utterances directed to the system: (1) content self-corrections, (2) false starts, (3) repetitions, and (4) filled pauses, as described previously [9]. Then a rate per 100 words of lexical disfluencies was calculated.

Non-lexical Disfluencies–The number of intrasentential pauses that were clearly audible by ear was recorded, and then a rate per 100 words was calculated. Speech Rate–The number of syllables per utterance was calculated automatically using a modified Celex lexical database [1]. Spoken utterance durations were calculated based on time codes as described above, and the average speech rate then was calculated in syllables per second. Lexicon size & variability–Total number of unique spoken words and overall total number of spoken words, which were calculated automatically from transcripts. Construction Type–Each subject was categorized as using either predominantly command or conversational style utterances, respectively, based on: (1) directness or indirectness of pragmatic style, (2) brevity or lengthiness of construction, and (3) the absence or presence of other grammatical markers, such as determiners and prepositions. Typical command-style constructions were of the syntactic form: Imperative, noun; “Close highway”, or Noun phrase fragment; “Volunteer area”, whereas typical conversational constructions were more fully developed sentences or queries as in: “We could use a volunteer area here, okay?” Presence and Type of Verb– Related to construction type, specific analyses also were conducted on whether a verb was present or elided in each construction, and whether a given verb was imperative or not. The ratio of (1) all constructions containing an elided verb, (2) imperative verbs to all constructions containing a verb, and (3) all constructions containing either an elided or imperative verb then was calculated for each subject. Reliability

Second scoring was conducted between two independent coders on over 10% of the scored data. Measurements of start and end of signals matched for 91% of pen and 89% of speech measurements to within 0.1 seconds. On mean length of utterance, pauses, and disfluencies, scorers matched on over 90% of scored data. Coders also agreed on over 98% of categorizations involving construction types and verbs, and on 96% of task errors. Response latencies and session durations have been documented to match over 80% of the time to within 0.1 seconds using the current coding procedures. Automated syllable counts were hand-checked against transcripts, and were 100% accurate. RESULTS

Data were available for analysis on over 1,100 multimodal constructions. Integration Pattern Stability and Consistency

Both simultaneous and sequential integrators were stable in maintaining their dominant integration pattern over time, with no cases of participants switching to an alternate

pattern during any of the three sessions. Only two instances occurred in which participants temporarily dropped below the 60% dominance threshold. Table 3 summarizes the consistency of simultaneous and sequential users’ dominant patterns and shows that both groups were highly consistent, with the former averaging 96% and the latter 95%.

Simultaneous

“Let’s have an evacuation route”

Sequential

Simultaneous Integrators

Sequential Integrators

Session

Session

Subj

1

2

3

1 2 3 4 5 6 7 8 9 10 11 12 13 Ave

100 100 100 100 100 100 100 100 100 100 80 56 88 94

100 100 100 100 100 100 100 100 100 100 88 94 57 95

100 100 100 100 100 100 100 100 100 100 88 100 80 98

Ave 100 100 100 100 100 100 100 100 100 100 85 83 76 96

“Make a route”

Subj

1

2

3

Ave

14 15 16 17 18 19 20 21 22 23 24

100 100 100 100 100 100 100 81 100 88 94

100 100 100 100 100 100 100 69 81 88 100

100 100 100 100 100 100 100 75 75 94 80

100 100 100 100 100 100 100 75 85 90 91

Ave

97

94

93

95

Table 3: Percentage of users’ multimodal constructions in their dominant integration pattern (non-dominant in bold)

Integration Pattern Precedence and Durational Features

0.0

0.5

1.0

1.5

2.0

3.5

As shown in Figure 2, simultaneous integrators’ speech duration averaged 1.8 seconds, while sequential integrators averaged 1.3, which was significant by independent t-test (logged), t = 3.44, (df = 22) p < 0.002, two-tailed. Simultaneous users averaged 1.5 seconds of writing, while sequential users averaged 1.3, which did not differ by independent t-test (logged), t < 1, NS. Finally, simultaneous integrators averaged 2.4 seconds in total multimodal construction length, compared with 3.4 for sequential integrators, which also differed significanly by independent t-test (logged), t = 2.56, (df = 22) p < 0.02, two-tailed. Performance Measures

Different patterns also emerged with respect to input mode precedence for the two groups. An analysis of all constructions issued by simultaneous integrators revealed that 72% were delivered with speech preceding pen, whereas 82% of sequential integrators’ constructions involved initiation with pen input, which was significant by Wilcoxon Rank Sum test, z = 2.38, p < 0.02, two-tailed.

Task Critical Performance Errors

The average intermodal overlap between speech and pen input for simultaneous integrators was 1.0 second, while in contrast the average lag between pen and speech input was 0.6 seconds for sequential integrators. The typical integration pattern for a simultaneous versus sequential integrator, including both mode precedence and average modality overlap/lag, is summarized in Figure 2.

3.0

Signal Durations

Modality Precedence

Intermodal Overlap/Lag

2.5

Time (seconds) Figure 2: Model of average temporal integration pattern for simultaneous and sequential integrators’ typical constructions

Simultaneous integrators had an average overall taskcritical error rate of 12.8% per 100 tasks, compared with 6.6% for sequential integrators, which was significant by Wilcoxon Rank Sum test, z = 1.83, p < 0.035, one-tailed. Overall, simultaneous integrators made 94% more errors. As illustrated in Figure 3, participants’ rate of errors decreased over time, averaging 15.1% in their first session, 8.6% in the second, and 6.3% in the final session. This pattern of decreasing errors over time also differed between the two groups, primarily in the first session. Simultaneous integrators averaged 21.6% errors on their first session, whereas sequential integrators averaged only 7.4%, a significant difference by Wilcoxon Rank Sum test, z = 2.40, p < 0.009, one-tailed. However, neither of their subsequent sessions differed significantly, z < 1.04, N.S.

Simultaneous Sequential

50% 40% 30%

Session 1

Session 2

Very High

High

Moderate

Low

Very High

High

Moderate

Low

Very High

High

Moderate

10% 0%

Linguistic Measures

Results are reported for the entire dataset, and for some measures also for a more controlled subset of 336 utterance pairs matched on lengths ranging from 2-10 words. Mean Length of Utterance (MLU)

20%

Low

Performance Errors per Task

60%

Session 3

Figure 3: Differences in task-critical performance errors for simultaneous versus sequential integrators by session and task difficulty level

The error rate also increased with task difficulty, averaging 1.0% for low difficulty tasks, 2.1% for moderate, 15.6% for high, and 21.2% for very high difficulty tasks. As shown in Figure 3, simultaneous and sequential integrators had similar error rates on low difficulty tasks (1.3% and 0.8%, respectively) and on moderate ones (2.6% and 1.5%), neither of which differed significantly by Wilcoxon Rank Sum test, z < 1, N.S. However, simultaneous integrators made marginally more errors on high difficulty tasks than sequential ones (18.6% and 12.1%, respectively), Wilcoxon Rank Sum test, z = 1.25, p < 0.105, one-tailed, and they also made substantially more errors on very high difficulty tasks (28.8% and 12.1%, respectively), significant by Wilcoxon Rank Sum test, z = 2.13, p < 0.02, one-tailed. Further examination of differences between the two groups just during session 1 revealed that simultaneous and sequential integrators differed significantly at both the high task difficulty level (34.6% and 15.9%, respectively), Wilcoxon Rank Sum test, z = 1.80, p < 0.04, one-tailed, and also at the very high difficulty level (48.1% and 11.4%), Wilcoxon Rank Sum test, z = 2.35, p < 0.01, onetailed. Overall, simultaneous integrators made 322% more errors on these very highest difficulty tasks during their initial session. Task Response Latency

There was no significant difference between simultaneous and sequential integrators in latencies to initiate a task (10.9 and 11.0 seconds, respectively), t < 1 by independent t-test (logged), N.S. Session Duration

There was no difference between simultaneous and sequential integrators in total session duration (278.4 vs. 285.2 seconds, respectively), independent t-test, t < 1, N.S.

Simultaneous integrators’ utterances ranged 1-26 words in length (mean 5.02), and sequential integrators ranged 1-14 words (mean 2.67). The average utterance length of simultaneous integrators was 88.0% greater than that of sequential ones, significant by Wilcoxon Rank Sum test (reciprocal transform), z = 1.97, p < 0.05, two-tailed. Lexical Disfluencies

For all data, simultaneous integrators averaged 0.59 disfluencies per 100 words, whereas sequential integrators averaged 0.09, which was significant by Wilcoxon Rank Sum test, z = 2.06, p < 0.02, one-tailed. Since a strong linear relation has been documented between utterance length and the density of disfluencies [9], further analysis was performed on the more controlled subset of utterance pairs matched on length. On these data, simultaneous integrators’ disfluency rate was 0.98, whereas sequential integrators’ was 0.11, significant by Wilcoxon Rank Sum test, z = 1.79, p < 0.04, one-tailed. Overall, simultaneous integrators had a substantial 791% higher disfluency rate. Non-lexical Disfluencies

Over all data, simultaneous integrators averaged 4.16 pauses per 100 words, whereas sequential integrators averaged 2.43, significant by Wilcoxon Rank Sum test, z = 2.90, p < 0.002, one-tailed. However, analyses based on utterances matched for length revealed that the average pause rate for simultaneous integrators was 3.00 per 100 words, compared with 2.89 for sequential integrators, which only was marginally significant by Wilcoxon Rank Sum test, z = 1.31, p < 0.10, one-tailed. Speech Rate

Simultaneous integrators’ average speech rate was 3.45 syllables per second, and sequential integrators’ was 3.31, which was not significant by independent t-test, t < 1, N.S. Since the two groups differed in their average utterance length and pause rate, analyses also were summarized for a subset of 593 2-4 word utterances in which no pauses were present. For this subset, simultaneous integrators averaged 3.28 and sequential 3.33 syllables per second, which also was not significant by independent t-test, t < 1, N.S. Lexicon Size & Variability

The total number of words spoken by simultaneous integrators averaged 244, compared with 137 for sequential integrators, a significant difference by Wilcoxon Rank Sum test, z = 1.97, p < .03, one-tailed. In addition, the average number of unique words spoken by simultaneous

integrators was 91, compared with 52 for sequential integrators, also significant by Wilcoxon Rank Sum, z = 2.23, p < .02, one-tailed. Construction Type

Whereas only 6 of 13 simultaneous integrators delivered their speech using command language, all 11 of 11 sequential integrators did so, which was significant by ChiSquare test, X2 = 8.36, (df = 1), p < .004, one-tailed. Presence & Type of Verb

The ratio of imperative verbs out of all verbs used was 50.7% for simultaneous integrators and 93.0% for sequential ones, significant by Wilcoxon Rank Sum test, z = 2.97, p < 0.0015, one-tailed. This represented an 83.4% higher rate of using imperatives by sequential integrators. However, the percentage of all constructions with the verb elided was 41.0% for simultaneous integrators and 63.8% for sequential ones, which was not significant by Wilcoxon Rank Sum test, z < 1.1, N.S. Finally, the percentage of all constructions which either elided the verb or contained an imperative was 61.9% for simultaneous integrators, but 97.3% for sequential ones, again a significant difference by Wilcoxon Rank Sum test, z = 2.75, p < .003, one-tailed. That is, sequential integrators were 57.2% more likely to either omit a verb or use its imperative form. Self-Report

Overall, 21 of the 24 participants correctly identified their dominant integration pattern, or 87.5%. In addition, 22 participants correctly reported that their integration pattern did not change over sessions, or 91.7%. However, whereas 8 out of 13 simultaneous integrators reported that they felt free to interact rapidly with the system (61.5%), only 3 of 11 sequential integrators reported this (27.3%), a marginal difference by Chi-Square test, X2 = 2.82, (df = 1), p < .10. DISCUSSION

In the present longitudinal study, all 25 participants could be classified as having a dominant and stable multimodal integration pattern, which involved either simultaneous or sequential delivery of speech and pen input. As illustrated in Table 3, all users spontaneously maintained their dominant pattern over an extended 6-week time period, with no case of shifting over from one pattern to the other. Independent of the specific integration pattern adopted, both user groups also were remarkably consistent (95-96%) in deploying their preferred pattern. These new findings on the long-term stability of users’ multimodal integration patterns, combined with previous results showing that they occur across the lifespan and are resistant to change following instruction and training [8,11], clearly indicate that users’ dominant integration patterns are durable and highly consistent.

Apart from differences in the temporal characteristics of their multimodal constructions, sequential and simultaneous integrators clearly exhibited different behaviors and cognitive styles while completing tasks with the computer. As predicted, sequential integrators made substantially fewer task-critical errors than simultaneous integrators— approximately half, overall. As illustrated in Figure 3, their ability to minimize errors was most apparent on newly introduced and complex tasks. In fact, during the first session, simultaneous integrators made 322% more errors on the highest difficulty tasks than did sequential integrators. Contrary to expectations, however, this greater accuracy achieved by sequential integrators did not occur at the cost of slower completion times. Sequential integrators were neither slower to initiate a task following receipt of instructions, nor in their overall session durations. In addition to their greater accuracy at the task level, sequential integrators had significantly fewer lexical and non-lexical disfluencies in their spoken utterances than simultaneous integrators. Once again, this greater articulatory control did not occur at the cost of a slower speech rate in sequential integrators, since both groups delivered their utterances at the rate of approximately 3.3 syllables per second. In comparing groups based on matched-length utterance pairs, simultaneous integrators uttered a substantial 791% more lexical disfluencies than sequential integrators. Figure 2 summarizes the average temporal integration pattern and typical linguistic constructions for simultaneous and sequential integrators. It illustrates several basic features that differed between the groups, for example that sequential integrators led with pen input 82% of the time, followed by a lag, and then a direct, terse, command-style spoken utterance. Their commands typically were noun phrase fragments (“Volunteer area”) or imperative-noun combinations (“Close bridge”). In contrast, simultaneous integrators led with speech input 72% of the time, with pen input overlapped. Their utterances also were lengthier, more indirect and conversational, with larger and more varied vocabularies that included more verbs, adverbs, determiners, prepositions, conjunctions, and politeness terms (“We need an evacuation route here”; “Let’s put some sandbags from Broadway to Fremont.”). From the viewpoint of management of miscommunications with a computer, sequential integrators’ briefer and more direct communications, as well as their smaller and less varied vocabularies, could be interpreted as compatible with a more cautious style focused on achieving an errorfree exchange. In addition, their delivery of pen input separately before speaking broke up input into two briefer communicative steps, and also created an ink trace as “common ground” to be talked about. All of these characteristics are consistent with a more careful and

deliberative approach to ensuring communicative success. Sequential integrators’ strategy of delivering briefer spoken utterances also may have assisted them in speeding up the tempo of their performance, such that they could maintain high accuracy without incurring any significant time loss. In comparison, simultaneous integrators’ lengthier and more conversational language, their larger and more varied vocabularies, and their pattern of speaking immediately while writing, all demonstrated a less inhibited and cautious approach that appeared more focus on social interaction per se. The present research has clarified that the individual differences in multimodal integration patterns between simultaneous and sequential integrators are indeed highly consistent and stable over time. In addition, the behavioral and linguistic profile uncovered for each group reveals fundamental differences in reflective-impulsive cognitive style. Previous psychology literature has shown that reflectivity-impulsivity is a pervasive and stable determiner of an individual’s basic cognitive style, and one that generalizes across a wide variety of tasks. In this research, sequential integrators engaged in multimodal humancomputer interaction have been identified as similar to reflective individuals in their cautious and deliberative intellectual style, including systematic information collection and strategic planning before committing to action, and a lower tolerance for committing errors during tasks and related communications. The advantage of this cognitive style clearly is its cultivated strategy for effectively minimizing errors and protecting performance accuracy, especially during newly introduced or complex tasks. In comparison, simultaneous integrators are similar to impulsive individuals in their greater spontaneity, less cautious approach to tasks, and more socially-responsive conversational style. This research highlights the critical nature of understanding and modeling stable individual differences among users, in particular major distinctions in their multimodal integration patterns (simultaneous vs. sequential), so that future multimodal systems can be designed that more effectively adapt their temporal thresholds during information fusion. Given the bimodal distribution of user integration patterns, adaptive temporal thresholds could support substantial improvements in system processing speed (i.e., with delays reduced to just 44% of current levels), greater system reliability, and also improved synchrony of humancomputer interaction. The present results also underscore that future mobile and educational interfaces should be designed with the goal of supporting the poorer attention span, impulse control, and higher error rates of users with an impulsive profile—especially in the case of mobile invehicle, military, and other applications that bear an unacceptably high cost for committing errors.

ACKNOWLEDGMENTS

Thanks to Benfang Xiao, Matt Wesson, and Josh Flanders for assisting during testing, data collection, scoring, and second scoring. This research was supported by NSF Grant No. IIS-0117868 and DARPA Contract No. NBCHD030010. Any opinions, findings or conclusions are those of the authors and do not necessarily reflect the views of DARPA or the Department of the Interior. REFERENCES 1. Centre for Lexical Information, M.P.I.f.P., The CELEX lexical database. Linguistic Data Consortium. 2. Drake, D.M., Perceptual correlates of impulsive and reflective behavior. Dev. Psych., 1970. 2(2), 202-214. 3. Kagan, J. and Kogan, N., Individual variation in cognitive processes, Carmichael's manual of child psych., P. Mussen, Ed. 1970, Wiley, New York. 1273-1365. 4. Kagan, J., Rosman, B.L., Day, D., Albert, J., and Phillips, W., Information processing in the child: significance of analytic and reflective attitudes. Psychological Monographs, 1964. 78(1), 1-37. 5. Maccoby, E., Social Development: Psychological growth and the parent-child relationship. 1980, New York: Harcourt Brace Jovanovich. 6. Messer, S.B., Reflection-impulsivity: A review. Psychological Bulletin, 1976. 83(6), 1026-1052. 7. Mollon, J.D. and Perkins, A.J., Errors of judgement at Greenwich in 1796. Nature, 1996. 380, 101-102. 8. Oviatt, S., Coulston, R., and Lunsford, R. Just do what I tell you: The limited impact of instructions on multimodal integration patterns, in submission. 9. Oviatt, S.L., Predicting spoken disfluencies during humancomputer interaction. Computer Speech and Language, 1995. 9, 19-35. 10. Oviatt, S.L., Ten myths of multimodal interaction. Comm. of the ACM, 1999. 42(11), 74-81. 11. Oviatt, S.L., Coulston, R., Tomko, S., Xiao, B., Lunsford, R., Wesson, M., and Carmichael, L. Toward a theory of organized multimodal integration patterns during humancomputer interaction. Proc. ICMI. ACM Press. 44-51. 12. Oviatt, S.L., DeAngeli, A., and Kuhn, K. Integration and synchronization of input modes during multimodal humancomputer interaction. Proc. CHI’97. ACM Press. 415-422. 13. Stone, J.V., Hunkin, N.M., Porrill, J., Wood, R., Keeler, V., Beanland, M., Port, M., and Porter, N.R., When is now? Perception of simultaneity. Proc. Royal Society: Biological Sciences, 2001. 268, 31-38. 14. Xiao, B., Girand, C., and Oviatt, S.L. Multimodal integration patterns in children. Proc. ICSLP'02. Causal Productions, Ltd. 629-632. 15. Xiao, B., Lunsford, R., Coulston, R., Wesson, M., and Oviatt, S.L. Modeling multimodal integration patterns and performance in seniors: Toward adaptive processing of individual differences. Proc. ICMI. ACM Press. 265-272.