computer-assisted vocabulary learning: multimedia

0 downloads 0 Views 5MB Size Report
Oct 21, 2009 - Research suggests that the monotony of a fixed presentation ..... similar words are excluded as well because learning lexical sets, synonyms,.
COMPUTER-ASSISTED VOCABULARY LEARNING: MULTIMEDIA ANNOTATIONS, WORD CONCRETENESS, AND INDIVIDUALIZED INSTRUCTION by Anne Rimrott Master of Arts, Simon Fraser University, 2005 Diplom, Justus Liebig Universität, 2002

DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

In the Department of Linguistics

© Anne Rimrott 2010 SIMON FRASER UNIVERSITY Fall 2010

All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced, without authorization, under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately.

APPROVAL Name:

Anne Rimrott

Degree:

Doctor of Philosophy

Title of Dissertation:

Computer-Assisted Vocabulary Learning: Multimedia Annotations, Word Concreteness, and Individualized Instruction

Examining Committee: Chair:

Dr. Chung-hye Han Associate Professor, Department of Linguistics

______________________________________ Dr. Trude Heift Senior Supervisor Professor, Department of Linguistics ______________________________________ Dr. J. Dean Mellow Supervisor Associate Professor, Department of Linguistics ______________________________________ Dr. Christian Guilbault Internal Examiner Associate Professor, Department of French ______________________________________ Dr. Dorothy Chun External Examiner Professor, Graduate School of Education University of California Santa Barbara Date Defended/Approved:

______________________________________

ii

STATEMENT OF ETHICS APPROVAL The author, whose name appears on the title page of this work, has obtained, for the research described in this work, either: (a) Human research ethics approval from the Simon Fraser University Office of Research Ethics, or (b) Advance approval of the animal care protocol from the University Animal Care Committee of Simon Fraser University; or has conducted the research (c) as a co-investigator, collaborator or research assistant in a research project approved in advance, or (d) as a member of a course approved in advance for minimal risk human research, by the Office of Research Ethics. A copy of the approval letter has been filed at the Theses Office of the University Library at the time of submission of this thesis or project. The original application for approval and letter of approval are filed with the relevant offices. Inquiries may be directed to those authorities. Simon Fraser University Library Simon Fraser University Burnaby, BC, Canada

Last update: Spring 2010

ABSTRACT This dissertation addresses research gaps in second / foreign language (L2) vocabulary learning by investigating issues surrounding multimedia annotations, word concreteness, and individualized instruction. Two experiments were conducted with beginner learners of L2 German who used Voka, an online flashcard-based multimedia program for intentional vocabulary learning designed by the author of this dissertation. Experiment 1 explored the effectiveness of annotations for vocabulary learning by also considering word concreteness and variation in annotation effectiveness among learners. Using a within-subjects design, 72 participants studied 15 abstract and 15 concrete German nouns. For each word, learners received a translation, an example sentence, and one of five annotation clusters that address the form, meaning and / or use of a word: PG) picture, gloss of example sentence, DG) definition, gloss, PA) picture, audio pronunciation, DA) definition, audio, or PAGD) picture, audio, gloss, definition. An immediate vocabulary posttest revealed that for both abstract and concrete words, annotation clusters containing a picture are significantly more effective than clusters without a picture. The delayed posttest data showed, however, that all annotation clusters are equally effective for abstract and concrete words. Furthermore, both posttests demonstrated that abstract words are significantly harder to learn than concrete words in all annotation clusters and that the effectiveness of annotation clusters varies across learners. iii

Experiment 2 constructed an individualized learning environment by considering the effectiveness of different annotation clusters on learner performance in experiment 1 to then examine the additional effect of two presentation sequences of annotation clusters on L2 vocabulary learning. Using a between-subjects design, 68 participants studied another 28 nouns with Voka. The FIX group received a fixed presentation sequence that showed all words in each learner's most effective annotation cluster. The ALT group received an alternating presentation sequence of each learner's two most effective annotation clusters by studying 14 words in each cluster. The results showed that presentation sequence has no effect on L2 vocabulary learning. The dissertation discusses the implications of the findings of both experiments and identifies potential avenues for future research.

Keywords: second language acquisition (SLA); computer-assisted language learning (CALL); computer-assisted vocabulary learning (CAVL); multimedia vocabulary annotations; picture annotations; definition annotations; audio annotations; gloss annotations; word concreteness; imageability; individualized instruction; German as a foreign language

iv

DEDICATION

Marlena und Leon gewidmet

v

ACKNOWLEDGEMENTS They say it takes a village to raise a child. Well, it certainly took a village to complete this dissertation. I deeply thank my senior supervisor, Dr. Trude Heift, for sharing her expertise and knowledge of language learning and linguistics, for her support, encouragement and friendship throughout this entire process and for the countless hours of work put into reading and discussing the many drafts of my dissertation. I am also very grateful to Dr. Dean Mellow for broadening my understanding of second language acquisition and for helping me see the bigger picture when contemplating the results of a study. Thank you also to both of you for understanding and accommodating my sometimes crazy family-dominated work schedule. I would also like to thank Dr. Dorothy Chun and Dr. Christian Guilbault for serving as my examiners and for a stimulating discussion and valuable comments and suggestions at the thesis defence. I am very thankful to Claudia Hein for letting me conduct my experiments in her German classes in Fall 2008, Summer 2009 and Fall 2009, for accommodating my research schedule, for helping me coordinate everything and for keeping me calm while the students were working on Voka. Thank you also to the German 102 students for participating in this research project. Many people helped me create Voka. First and foremost, thank you to my husband, Damir Tresnjo, for the hours and hours of programming Voka for me and for putting up with the many revisions and re-revisions! I also thank Jesko Petersen, Elisabeth Rimrott, and Damir Tresnjo for taking so much of their time to help me create the content for Voka. Thank you to the many friends, family and department members for agreeing to be actors for the Voka pictures: Jesko, Kristin, Elisabeth, Hans, Niklas, Damir, Marlena, Bernhard, Hedwig, Antje, Franziska, Marie, Karin, Kurt, Claudia, Hella, Kathri(e)n, Katrin, Doreen, Nadine, Sabine, Bryan, Lubomir, Agnieszka, Martina, Ben, Jacqueline, Dennis, Loreley, Susan, Emrah, Amalia, Grace, and John, and thanks to the many strangers that I talked into posing for a Voka picture at airports, train stations and on the streets of Germany. Thank you also to Elisabeth Reising, Elisabeth Rimrott, Kurt Trautwein, and Damir Tresnjo for allowing me to use some of their pictures in Voka. Thank you Claudia Hein, Martina Lange, Jesko Petersen, Elisabeth Rimrott, and Niklas Rimrott for rating the Voka pictures.

vi

Thank you Elisabeth Rimrott, Hans Rimrott, and Niklas Rimrott for evaluating the German example sentences and thank you Natasha Penner for evaluating the English glosses. Thank you Lubomir Slavicek and Andrew Szendrey for recording, editing, and cutting the audio files for Voka. Thank you Elisabeth Rimrott and Susan Morton for testing Voka. I also thank the following people for contributing in various additional ways to Voka and my research project: Kathrin Bauer, Jenny Clarke-Amberson, Sabine Clemens, Tami Esau, Stephen Halford, Megan Krempel, Dana Lyseng, Kristin Petersen, Ivanka Skrypnyk, and Thomas Welsch. Thank you Stephanie Gabriel for performing the statistical analysis of my research data. I thank Susan Morton for decluttering some of my long sentences and putting a nice verb in every single one of them. Thank you to the staff (past and present) in the SFU Linguistics Department, particularly Melanie Covey, Ray Haleem, Carol Jackson, Rita Parmar, and Grace Wattanga for their help in navigating graduate studies. Thank you to former and current SFU Linguistics graduate students for putting some fun into the whole process, especially Reem Alsadoon, Lorna Fadden, Heidi Kent, Emma Mileva, Susan Morton, Dennis Storoshenko, and Loreley Wiesemann. I am thankful to Simon Fraser University for supporting my graduate studies through Graduate Fellowships, Travel and Minor Research Awards, and a President's Ph.D. Research Stipend. I am also grateful for the financial support I received through teaching assistantships and several research assistantships from Dr. Trude Heift. Damir Tresnjo, my love, thank you for standing by me all along, thank you for your love, friendship and support and for our beautiful family. Thank you Marlena and Leon, my children, for giving me balance with your laughter and happiness. Thank you Elisabeth and Hans Rimrott, my parents, for being so involved in my project, for coming to my defence, and for inspiring me, along with my late uncle Friedrich Rimrott, to pursue a doctoral degree. Thank you also to Christian and Niklas Rimrott, my brothers, and to my friends for their support and encouragement along the way. I owe deep gratitude to Dzemal and Dilka Tresnjo, my parents-in-law, for taking care of Marlena in Fall 2006 and of Leon in Fall 2010. Without their love for our family, I could not have started or finished this dissertation. Hvala lijepo! Thank you also to Mitra Jannati and "Auntie" Liz and "Uncle" Chris Deas-Dawlish for taking care of Marlena while I took care of the dissertation. It was a long road but it was worth it!

vii

TABLE OF CONTENTS APPROVAL .................................................................................................................................... ii ABSTRACT .................................................................................................................................... iii DEDICATION.................................................................................................................................. v ACKNOWLEDGEMENTS ................................................................................................................ vi TABLE OF CONTENTS.................................................................................................................. viii LIST OF FIGURES......................................................................................................................... xiii LIST OF TABLES............................................................................................................................ xv LIST OF ABBREVIATIONS ........................................................................................................... xvii 1

INTRODUCTION ..................................................................................................................... 1

2

L2 VOCABULARY LEARNING ................................................................................................... 5 2.1

WORD KNOWLEDGE ........................................................................................................... 5

2.1.1

Degrees of Word Knowledge .................................................................................... 6

2.1.2

Aspects of Word Knowledge ..................................................................................... 8

2.2

MULTIMEDIA THEORIES IN COMPUTER-ASSISTED ENVIRONMENTS ................................................. 9

2.2.1 2.3

Multimedia Theories and Second Language Acquisition .......................................... 13

VOCABULARY ANNOTATIONS .............................................................................................. 15

2.3.1

Visuals, Translations, and Definitions ...................................................................... 15

2.3.2

Other Vocabulary Annotations................................................................................ 20

2.4

WORD CONCRETENESS ..................................................................................................... 25

2.4.1

Learnability of Abstract and Concrete Words .......................................................... 26

2.4.2

Word Concreteness and the Effectiveness of Vocabulary Annotations..................... 27

2.5

INDIVIDUAL LEARNER DIFFERENCES IN SLA............................................................................. 28

2.5.1 2.6

INDIVIDUALIZED INSTRUCTION ............................................................................................. 32

2.6.1 2.7

IDs in Multimedia CAVL .......................................................................................... 30 Individualization in CAVL ........................................................................................ 34

SUMMARY OF PREVIOUS RESEARCH...................................................................................... 37

viii

2.8

2.8.1

Vocabulary Annotations ......................................................................................... 38

2.8.2

Effects of Word Concreteness ................................................................................. 39

2.8.3

IDs, Individualized Instruction, and Presentation Sequence ..................................... 40

2.9

3

RESEARCH GAPS .............................................................................................................. 38

RESEARCH QUESTIONS OF THIS DISSERTATION ........................................................................ 42

2.9.1

Research Questions of Experiment 1 (Part 1 of Voka) .............................................. 42

2.9.2

Research Question of Experiment 2 (Part 2 of Voka) ............................................... 44

DESIGN OF VOKA ................................................................................................................. 45 3.1

INTENTIONAL VOCABULARY LEARNING .................................................................................. 45

3.2

WORD SELECTION............................................................................................................ 48

3.2.1

Imageability............................................................................................................ 49

3.2.2

Frequency .............................................................................................................. 49

3.2.3

Linguistic Features .................................................................................................. 50

3.2.4

Characteristics of the Test Items in Voka................................................................. 51

3.3

ASPECTS OF WORD KNOWLEDGE IN VOKA ............................................................................. 53

3.4

DEFAULT WORD INFORMATION........................................................................................... 56

3.4.1

Form Information: Written Form ............................................................................ 56

3.4.2

Meaning Information: L1 Translation ...................................................................... 56

3.4.3

Use Information: L2 Example Sentence ................................................................... 57

3.5

WORD INFORMATION IN THE ANNOTATION TYPES ................................................................... 57

3.5.1

Form Annotation Type: Audio Pronunciation .......................................................... 57

3.5.2

Meaning Annotation Type I: L1 Definition ............................................................... 58

3.5.3

Meaning Annotation Type II: Picture....................................................................... 58

3.5.4

Use Annotation Type: L1 Gloss of L2 Example Sentence .......................................... 60

3.6

THE FIVE ANNOTATION CLUSTERS ........................................................................................ 61

3.7

PROGRAM FLOW ............................................................................................................. 63

3.7.1

Pretest ................................................................................................................... 65

3.7.2

The Two Study Phases ............................................................................................ 66

3.7.3

The Two Practice Phases......................................................................................... 67

3.7.4

The Two Posttests .................................................................................................. 70

3.8

DATA COLLECTION IN VOKA ............................................................................................... 72

3.9

RATIONALE FOR VOKA'S DESIGN .......................................................................................... 73

3.9.1

Assessment in Voka ................................................................................................ 73

3.9.2

Word Knowledge in Voka........................................................................................ 76

ix

3.9.3 4

5

METHODOLOGY OF EXPERIMENT 1 ..................................................................................... 80 4.1

STUDY PARTICIPANTS ........................................................................................................ 80

4.2

TIMELINE OF EXPERIMENT 1 ............................................................................................... 82

4.3

DESIGN OF EXPERIMENT 1 ................................................................................................. 84

4.4

DATA ANALYSIS AND STATISTICAL PROCEDURES....................................................................... 87

4.4.1

Scoring of Student Answers .................................................................................... 87

4.4.2

Experiment Variables and Statistical Tests .............................................................. 87

RESULTS OF EXPERIMENT 1 ................................................................................................. 91 5.1

RESEARCH TOPIC 1: VOCABULARY ANNOTATIONS .................................................................... 91

5.1.1

RQ 1.1: Main Effect of Annotation Cluster............................................................... 92

5.1.2

RQ 1.2: Main Effect of Annotation Type Presence ................................................... 94

5.2

RESEARCH TOPIC 2: WORD CONCRETENESS ........................................................................... 98

5.2.1 5.3

RQ 2: Main Effect of Word Type.............................................................................. 98

RESEARCH TOPIC 3: ANNOTATIONS AND WORD CONCRETENESS ............................................... 100

5.3.1

RQ 3.1: Interaction Effect of Annotation Cluster X Word Type ............................... 100

5.3.2

RQ 3.2: Main Effect of Annotation Type Presence in the Two Word Types............. 106

5.4

RESEARCH TOPIC 4: ANNOTATIONS AND INDIVIDUAL LEARNERS ................................................ 111

5.4.1 5.5 6

Exposure Time for Test Items.................................................................................. 77

RQ 4: Interaction of Annotation Cluster and Individual Learner ............................. 111

SUMMARY OF THE FINDINGS OF EXPERIMENT 1 ..................................................................... 114

DISCUSSION OF EXPERIMENT 1 ......................................................................................... 116 6.1

RESEARCH TOPIC 1: VOCABULARY ANNOTATIONS .................................................................. 116

6.1.1

Annotation Cluster / Picture Annotation Type....................................................... 116

6.1.2

Definition Annotation Type................................................................................... 121

6.1.3

Gloss Annotation Type.......................................................................................... 124

6.1.4

Audio Annotation Type ......................................................................................... 126

6.2

RESEARCH TOPIC 2: WORD CONCRETENESS ......................................................................... 131

6.3

RESEARCH TOPIC 3: ANNOTATIONS AND WORD CONCRETENESS ............................................... 133

6.3.1

The Word Concreteness Effect in the Annotation Clusters..................................... 133

6.3.2

Annotation Effectiveness for Abstract and Concrete Words .................................. 134

6.4

RESEARCH TOPIC 4: ANNOTATIONS AND INDIVIDUAL LEARNERS ................................................ 145

6.4.1

Inter-learner Variation.......................................................................................... 145

6.4.2

A Need for Individualized Instruction .................................................................... 147

x

7

METHODOLOGY OF EXPERIMENT 2 ................................................................................... 150 7.1

STUDY PARTICIPANTS ...................................................................................................... 150

7.1.1 7.2

TIMELINE OF EXPERIMENT 2 ............................................................................................. 155

7.3

DESIGN OF EXPERIMENT 2 ............................................................................................... 155

7.3.1

Construction of a Learner Model for Each Student ................................................ 156

7.3.2

Annotation Clusters Presented to FIX and ALT....................................................... 159

7.4 8

DATA ANALYSIS AND STATISTICAL PROCEDURES..................................................................... 161

RESULTS AND DISCUSSION OF EXPERIMENT 2 ................................................................... 162 8.1

RESEARCH TOPIC 5: PRESENTATION SEQUENCE IN INDIVIDUALIZED INSTRUCTION ........................... 162

8.1.1 8.2

9

Assignment of Participants to the Two Groups...................................................... 152

Results of RQ 5: Main Effect of Presentation Sequence ......................................... 162

DISCUSSION OF TOPIC 5: PRESENTATION SEQUENCE IN INDIVIDUALIZED INSTRUCTION .................... 163

8.2.1

Presentation Sequence ......................................................................................... 163

8.2.2

Ranking of Annotation Clusters............................................................................. 166

8.2.3

An Evaluation of the Individualized Instruction ..................................................... 167

CONCLUSION ..................................................................................................................... 173 9.1

SUMMARY ................................................................................................................... 173

9.2

PEDAGOGICAL IMPLICATIONS ............................................................................................ 176

9.2.1

Multimedia Annotations and Word Concreteness ................................................. 178

9.2.2

Individualized Instruction ..................................................................................... 182

9.3

STUDY LIMITATIONS AND CONSTRAINTS ON GENERALIZABILITY .................................................. 183

9.3.1

Study Limitations .................................................................................................. 183

9.3.2

Constraints on Generalizability of Findings............................................................ 185

9.4

FUTURE RESEARCH ......................................................................................................... 187

9.4.1

Vocabulary Annotations and Word Concreteness.................................................. 187

9.4.2

Individualized Instruction ..................................................................................... 189

APPENDICES .............................................................................................................................. 191 APPENDIX A: TEST ITEMS ............................................................................................................. 192 APPENDIX B: PROGRAM FLOW SAMPLES ......................................................................................... 212 Program Flow Sample of Experiment 1 ............................................................................. 212 Program Flow Samples of Experiment 2 ............................................................................ 214 APPENDIX C: BACKGROUND QUESTIONNAIRE ................................................................................... 218

xi

APPENDIX D: EVALUATION QUESTIONNAIRE 1 .................................................................................. 221 APPENDIX E: EVALUATION QUESTIONNAIRE 2 ................................................................................... 223 APPENDIX F: INSTRUCTION AUDIO SCRIPT ........................................................................................ 225 APPENDIX G: COPYRIGHT INFORMATION ......................................................................................... 227 REFERENCE LIST ........................................................................................................................ 228

xii

LIST OF FIGURES Figure 1:

Labelled Voka flashcard of target word Herbst (fall)...................................54

Figure 2:

The three picture choices for Zweck (purpose) (Picture 3 selected for Voka) .......................................................................59

Figure 3:

The five annotation clusters in Voka illustrated with the target word Herbst (fall) ................................................................62

Figure 4:

Example screenshot of the direction screen (assessment and study phases completed)....................................................64

Figure 5:

Example screenshot of the pretest (top part shown) .....................................65

Figure 6:

Example screenshot (study phase 1) of a word studied in annotation cluster PAGD ............................................................66

Figure 7:

Example screenshot of a prompt in practice phase 1 ....................................68

Figure 8:

Example screenshot of practice phase feedback for correct student input..... 69

Figure 9:

Example screenshot of practice phase feedback for incorrect student input ...........................................................................69

Figure 10: Example screenshot of a prompt in practice phase 2 ....................................70 Figure 11: Example screenshot of the posttests (top of immediate posttest shown) ....... 71 Figure 12: Example screenshot of answer sheet (top part shown) .................................72 Figure 13: Dimensions of vocabulary assessment .........................................................75 Figure 14: Standardized mean scores for annotation presence vs. absence (immediate posttest, all words) ....................................................................95 Figure 15: Standardized mean scores for annotation presence vs. absence (delayed posttest, all words) ........................................................................97 Figure 16: Immediate posttest results for abstract vs. concrete words per annotation cluster ................................................................................ 101 Figure 17: Delayed posttest results for abstract vs. concrete words per annotation cluster ................................................................................ 105 Figure 18: Standardized mean scores for annotation presence vs. absence (immediate posttest, abstract words).......................................................... 107 Figure 19: Standardized mean scores for annotation presence vs. absence (immediate posttest, concrete words) ........................................................ 107 xiii

Figure 20: Standardized mean scores for annotation presence vs. absence (delayed posttest, abstract words) .............................................................. 110 Figure 21: Standardized mean scores for annotation presence vs. absence (delayed posttest, concrete words) ............................................................. 110 Figure 22: Target word Vogel (bird) in annotation cluster PAGD ............................... 126 Figure 23: Voka pictures of the concrete nouns Gehirn (brain), Vogel (bird), and Kreis (circle) ...................................................................................... 134 Figure 24: Voka pictures of the abstract nouns Beweis (proof), Zweck (purpose), and Zustand (state) .................................................................................... 135 Figure 25: Immediate posttest results for the five annotation clusters (both experiments) .................................................................................... 169 Figure 26: Delayed posttest results for the five annotation clusters (both experiments) .................................................................................... 170 Figure 27: The test items of experiments 1 and 2........................................................ 192

xiv

LIST OF TABLES Table 1:

Examples of receptive and productive recognition and recall tests .................7

Table 2:

Aspects of word knowledge ..........................................................................9

Table 3:

Characteristics of the test items in Voka ......................................................52

Table 4:

Aspects of word knowledge covered by the information in Voka ................53

Table 5:

The five annotation clusters in Voka ...........................................................61

Table 6:

The seven phases in Voka ...........................................................................63

Table 7:

Sample Voka log output for participant voka203 .........................................73

Table 8:

Aspects of word knowledge contained in Voka's treatment and assessment .............................................................77

Table 9:

Exposure time per word in Voka's treatment ...............................................78

Table 10: Characteristics of participants of experiment 1 ............................................81 Table 11: Timeline for experiment 1 ...........................................................................83 Table 12: Example of a Latin square ...........................................................................85 Table 13: The five word packages in experiment 1 .....................................................86 Table 14: Annotation cluster distribution in Latin Square design of experiment 1 ....... 86 Table 15: Independent variables of experiment 1 ........................................................88 Table 16: Research questions and operationalization for experiment 1 ........................89 Table 17: Immediate posttest results for the five annotation clusters ...........................92 Table 18: Pairwise comparisons for annotation cluster (immediate posttest) ...............93 Table 19: Delayed posttest results for the five annotation clusters ...............................94 Table 20: T-tests for annotation presence vs. absence (immediate posttest, all words) ....................................................................96 Table 21: T-tests for annotation presence vs. absence (delayed posttest, all words) ..... 98 Table 22: Immediate posttest results for abstract vs. concrete words ...........................98 Table 23: Delayed posttest results for abstract vs. concrete words ...............................99 Table 24: Immediate posttest results for abstract vs. concrete words per annotation cluster ................................................................................ 101

xv

Table 25: Delayed posttest results for abstract vs. concrete words per annotation cluster ................................................................................ 104 Table 26: T-tests for annotation presence vs. absence (immediate posttest, abstract words).......................................................... 108 Table 27: T-tests for annotation presence vs. absence (immediate posttest, concrete words) ........................................................ 109 Table 28: T-tests for annotation presence vs. absence (delayed posttest, abstract words) .............................................................. 111 Table 29: T-tests for annotation presence vs. absence (delayed posttest, concrete words) ............................................................. 111 Table 30: Best annotation cluster(s) for each learner on the immediate posttest......... 112 Table 31: Best annotation cluster(s) for each learner on the delayed posttest ............. 113 Table 32: Number of times each annotation cluster is in first place ........................... 114 Table 33: Summary of the findings of experiment 1.................................................. 115 Table 34: Characteristics of participants of experiment 2 .......................................... 151 Table 35: Means and SDs for the sorting measures ................................................... 154 Table 36: Timeline for experiment 2 ......................................................................... 155 Table 37: Annotation cluster ranks for participants of experiment 2 .......................... 157 Table 38: Learner model scores to determine top two annotation clusters for experiment 2........................................................................................ 158 Table 39: Calculation of the learner model for voka244 and voka250 ....................... 158 Table 40: Best and second best annotation clusters for participants of experiment 2.. 159 Table 41: Distribution of annotation clusters in the FIX group .................................. 160 Table 42: Distribution of annotation clusters in the ALT group ................................. 160 Table 43: Research question and operationalization for experiment 2 ....................... 161 Table 44: Immediate and delayed posttest scores for experiment 2............................ 162 Table 45: Comparison of best vs. second-best annotation cluster scores for 30 of the ALT learners ......................................................................... 165 Table 46: Some uncontrolled characteristics of the 30 target words of experiment 1 ......................................................................................... 184 Table 47: Program flow of experiment 1 for participant voka201 .............................. 212 Table 48: Program flow of experiment 2 for participant voka244 (ALT group) ......... 214 Table 49: Program flow of experiment 2 for participant voka249 (FIX group) .......... 215

xvi

LIST OF ABBREVIATIONS ANOVA

Analysis of variance

BAWL

Berlin Affective Word List

BAWL-R

Berlin Affective Word List Reloaded

CALL

Computer-assisted language learning

CAVL

Computer-assisted vocabulary learning

CTML

Cognitive theory of multimedia learning

DA

Definition audio annotation cluster

df

degrees of freedom

DG

Definition gloss annotation cluster

IDs

Individual learner differences

L1

First language, native language

L2

Second language, foreign language

n

Number in subsample

N

Total number in sample

p

Probability

PA

Picture audio annotation cluster

PAGD

Picture audio gloss definition annotation cluster

PG

Picture gloss annotation cluster

RQ

Research question

Sig.

Significance

SEM

Standard error of the mean

SD

Standard deviation

SLA

Second language acquisition

xvii

1 INTRODUCTION Vocabulary learning is an essential aspect of mastering a second or foreign language (L2). Knowledge of words is fundamental to both comprehension and production in reading, listening, speaking and writing activities. In instructional settings, providing learners with a variety of information about a word enhances vocabulary learning. For example, research indicates that pictorial annotations illustrating the meaning of a word in a reading passage have lasting effects on word retention. Research also suggests that vocabulary annotations are not equally effective for all learners. Individual learner differences determined by, for instance, distinct cognitive styles and / or learning preferences influence learners' abilities to benefit from particular vocabulary annotations for learning. Pictures are potentially more valuable for visual learners than for verbal learners, for example. At the same time, vocabulary annotations are still underexplored. For instance, audio annotations require further investigation. In addition, it is unclear how word concreteness impacts the effectiveness of vocabulary annotations. For example, more research is needed to investigate whether pictorial annotations help with learning abstract words, which are not easily picturable. Moreover, it is necessary to explore if and to what extent the effectiveness of vocabulary annotations varies among learners to ascertain the kind of individualized instruction that best accommodates learner differences. Finally, the role of the computer environment itself in vocabulary learning needs further investigation. For example, the effect of the sequence in which annotations are presented in a computer-based learning environment has not been explored to date. This dissertation fills some of the research gaps surrounding vocabulary learning in computer-assisted language learning (CALL). With their multimedia capabilities, CALL 1

programs can provide word annotations such as pictures, audio pronunciations and / or written definitions to assist in the language learning process. Due to convenient tracking capabilities, they can also answer questions concerning the effectiveness of multimedia annotations for different learners. This knowledge will contribute to constructing effective CALL programs that are tailored to individual learners' needs. Finally, questions surrounding the effect of features of the computer environment itself on learning outcomes can be explored with CALL programs. For this dissertation, in fall 2009, beginner L2 learners of German at a Canadian university studied vocabulary using Voka (Rimrott, 2009), a web-based multimedia computer-assisted vocabulary learning (CAVL) program designed by the author of this dissertation. Two consecutive CALL experiments were conducted with Voka. Experiment 1 explores the effectiveness of different combinations of annotations for learning abstract and concrete words by also examining learner differences. In the first experiment, 72 participants studied 15 abstract and 15 concrete German nouns with five combinations of annotations addressing the form, meaning and / or use of these test items. Using the results of experiment 1, an individualized learning environment was constructed for experiment 2, in which the effect of different presentation sequences of annotations on vocabulary learning was investigated as one variable made possible by the computer-based learning environment. In this experiment, 68 participants from experiment 1 studied another 28 German nouns, for which the annotations were displayed in two presentation sequences, fixed versus alternating. Half the participants (n = 34) studied the words in a fixed presentation sequence with their best combination of annotations as determined by their learning preferences and performance in experiment 1. The other 34 participants learned with their two best combinations of annotations, which were displayed in random alternation for the 28 vocabulary items. Immediate and delayed productive recall vocabulary posttests were used in both experiments to assess the level of mastery of the test items. The dissertation is organized as follows. Chapter 2 presents an overview of previous research on L2 vocabulary learning. Section 2.1 discusses different degrees and aspects of word knowledge. In examining vocabulary learning in computer-assisted 2

environments, section 2.2 focuses on theories of multimedia learning and section 2.3 discusses research on vocabulary annotations in CAVL. As two aspects that might influence the effectiveness of vocabulary annotations, word concreteness is explored in section 2.4 and individual learner differences are discussed in section 2.5. This leads to an exploration of individualized instruction in section 2.6. Section 2.7 summarizes the research findings presented in the chapter while section 2.8 identifies gaps in previous research. The chapter concludes with a presentation of the research questions of this dissertation in section 2.9. Chapter 3 discusses the design of Voka, the CAVL program used for the experiments of this dissertation. Section 3.1 provides a rationale for using intentional vocabulary instruction in Voka. Section 3.2 describes the selection process for the target words while section 3.3 details the information presented about these words in Voka, either as default information or as a vocabulary annotation. The default word information is portrayed in more detail in section 3.4 while the vocabulary annotations are presented in section 3.5. Section 3.6 introduces the five combinations of annotation types that were used to convey the form, meaning, and use of the test items in Voka. Voka's program flow with respect to the learning and assessment tasks that the participants completed is described in section 3.7. The data tracking and collection procedures in Voka are presented in section 3.8. Finally, section 3.9 discusses the rationale for Voka's design. Chapter 4 outlines the methodology of the first experiment that was conducted for this dissertation. Section 4.1 provides a description of the study participants and section 4.2 presents the timeline of the experiment. The design of experiment 1 in terms of the distribution of the combinations of annotation types for the participants is discussed in section 4.3. Finally, section 4.4 specifies the data analysis and statistical procedures used. Chapter 5 presents the results of experiment 1 according to four distinct research topics. Section 5.1 focuses on the effectiveness of vocabulary annotations while section 5.2 deals with the learnability of words of differing levels of concreteness (abstract and concrete words). Section 5.3 presents findings regarding the interaction of vocabulary annotation effectiveness and word concreteness in vocabulary learning. Section 5.4

3

covers the interaction of vocabulary annotation effectiveness and individual learners. A summary of the results of experiment 1 is provided in section 5.5. Chapter 6 discusses the results of experiment 1, covering each of the four research topics in order. Accordingly, section 6.1 explores vocabulary annotations, section 6.2 examines word concreteness, section 6.3 discusses the interaction between vocabulary annotation effectiveness and word concreteness, and section 6.4 inspects annotations in relation to individual learners. Chapter 7 describes the methodology of the second experiment conducted as part of this dissertation. Section 7.1 presents the study participants while section 7.2 provides the timeline for conducting the experiment. Section 7.3 explains the design of experiment 2 with respect to the two presentation sequences (fixed versus alternating) and the combinations of annotations shown to each learner. Section 7.4 describes the data analysis and statistical tests employed in experiment 2. The findings of experiment 2, which examines the effect of different presentation sequences in vocabulary instruction, are presented in chapter 8. Section 8.1 describes the results of experiment 2 and section 8.2 discusses them. As a conclusion, chapter 9 first summarizes the findings of this dissertation in section 9.1 and then discusses their pedagogical implications in section 9.2. The conclusion addresses limitations and constraints on the generalizability of the findings in section 9.3. Finally, section 9.4 identifies areas of future research.

4

2 L2 VOCABULARY LEARNING This chapter first establishes what it means to know a word by discussing different degrees and components of word knowledge in L2 vocabulary learning (section 2.1). Turning to vocabulary learning in CALL environments, section 2.2 reviews theories of multimedia learning and their application to second language acquisition while section 2.3 provides an overview of CALL annotations for L2 vocabulary learning. Section 2.4 discusses word concreteness and its potential impact on the effectiveness of L2 vocabulary annotations. The impact of individual learner differences on annotation effectiveness is examined in section 2.5. This leads to a discussion of individualized instruction, that is, instruction tailored to the needs of individual learners in section 2.6. Section 2.7 summarizes the findings of previous research while section 2.8 addresses research gaps. Finally, section 2.9 presents the research questions of this study.

2.1 Word Knowledge Linguists in general and also vocabulary researchers have found it difficult to provide a definition of a word (see, e.g., Haspelmath, in press; Nation, 2001; Read, 2000). For instance, after examining ten commonly used criteria for determining wordhood (potential pauses, free occurrence, mobility, uninterruptibility, non-selectivity, non-coordinatability, anaphoric islandhood, nonextractability, morphophonological idiosyncrasies, deviations from biuniqueness), Haspelmath (in press) concludes that there are no valid, cross-linguistic criteria for defining a word and for distinguishing words from affixes, clitics and phrases. However, he concedes that language-specific definitions of wordhood are both possible and practical for descriptive purposes.

5

As a working definition for discussing vocabulary learning in this dissertation, the term word generally refers to lemmas. For languages such as English and German, tokens, types, lemmas, and word families can be distinguished (Nation, 2001). Tokens are nonunique occurrences of a word form whereas types are unique occurrences. For example, the sentence The cat is eating the food contains 6 tokens but only 5 types. Lemmas consist of "a headword and some of its inflected and reduced (n't) forms" (Nation, 2001, p. 7) that belong to the same part of speech. For instance, the verb forms walk, walks, walking, walked are grouped under one lemma. A word family consists of the lemma and closely related derived forms (e.g., forms derived by adding affixes such as -ly, -ness, or un-).

2.1.1 Degrees of Word Knowledge Learning or knowing a word is not an all-or-nothing affair but rather involves different kinds and degrees of knowledge. Receptive and productive types of knowledge can be distinguished (Nation, 2001). Receptive knowledge is associated with the socalled receptive skills of listening and reading while productive knowledge is linked to the so-called productive skills of speaking and writing. 1 Receptive knowledge involves perceiving an L2 word and being able to comprehend its meaning whereas productive knowledge implies expressing a meaning and supplying the appropriate L2 word. Accordingly, when L2 vocabulary is learned and / or tested with the help of the learner's native language (L1), the receptive direction entails going from the L2 to the L1 (e.g., das Bein  leg for English learners of German) while the productive direction is the reverse (i.e., leg  das Bein).

1

While the distinction between receptive and productive knowledge is a useful one, it should be noted that there is some debate about the terminology itself as well as the exact definition and scope of the terms (Melka, 1997; Nation, 2001; Read, 2000; Schmitt, 2010). For example, the term receptive is not entirely appropriate to refer to listening and reading because meaning is also produced in these activities (Nation, 2001). Some researchers prefer the terms passive for receptive and active for productive although these terms are also problematic as listening and reading are not passive activities. In addition, some treat the receptive / productive distinction as a scale of knowledge while others view it as two distinct points (see, e.g., Nation, 2001, pp. 24-26, and Laufer & Goldstein, 2004, for a more detailed discussion).

6

Aside from the receptive / productive distinction, there is an additional distinction between recognition and recall knowledge of words. In recognition, the form and the meaning of a word are presented simultaneously whereas in recall, either the form or the meaning of a word has to be retrieved from memory. Recognition knowledge involves, for instance, studying words with a vocabulary list where both form and meaning are presented simultaneously, form first in receptive learning (e.g., das Bein = leg, for English learners of German) and meaning first in productive learning (e.g., leg = das Bein). Retrieval of the form (productive) or meaning (receptive) in recall can be practiced by using flashcards that display the meaning of the word on one side and the form on the other. Table 1: Examples of receptive and productive recognition and recall tests

Knowledge

Knowledge Receptive

Productive

Recognition das Bein a. chair b. leg c. fork

leg

das Bein _________

leg _________

a. der Stuhl b. das Bein c. die Gabel

Recall

Note. L1 = English, L2 = German.

Four degrees of strength of word knowledge can be derived from combinations of the two distinctions: receptive recognition, receptive recall, productive recognition, and productive recall. Table 1 provides examples of the four pairings in the context of vocabulary assessment. For instance, a receptive recognition test might contain a multiple choice task where learners have to select the correct meaning of an L2 word from several options presented in the L1. In contrast, a productive recall test could

7

involve the provision of the correct L2 form from memory given its meaning in the L1 as a prompt. 2 Laufer and Goldstein (2004) conducted a study with 435 L2 learners of English to ascertain the level of difficulty of the four degrees of word knowledge. The study found that productive recall is the most difficult task for language learners, followed by receptive recall, productive recognition, and finally, receptive recognition. Other studies confirm that learning words for productive use is harder than receptive learning and that recall is more difficult than recognition (Chun & Plass, 1996a; De Groot & Keijzer, 2000; Ellis & Beaton, 1993a; Nation, 2001). If only one type of learning is possible, studying productively rather than receptively and using recall instead of recognition is recommended because the more difficult tasks generally result in superior learning (Barcroft, 2007; Nation, 2001).

2.1.2 Aspects of Word Knowledge However, when learning L2 vocabulary, what is actually involved in knowing a word? According to Nation (2001), knowledge of a word is comprised of nine aspects, grouped under three headings: form, meaning, and use. Each of the aspects has a receptive and a productive knowledge dimension, as illustrated in Table 2. For example, receptive knowledge of the word underdeveloped comprises, among other things, knowing its written and spoken form so that one can recognize it in different contexts when reading or listening. Thus, "if we say that a particular word is part of someone’s receptive vocabulary, we are making a very general statement that includes many aspects of knowledge and use, and we are combining the skills of listening and reading." (Nation, 2001, p. 28). In turn, productive knowledge of the same word includes being able to spell and pronounce the word correctly, being able to produce the word to express the meaning underdeveloped, and being able to decide in which contexts to use the word. In general,

2

Note that some studies confound the distinction between receptive / productive and recognition / recall by using a recognition item for assessing receptive knowledge but a recall item to measure productive knowledge (see Nation, 2001). In addition, the four terms are also used differently by different researchers (see, e.g., Read, 2000, pp. 154-157). Following Nation (2001) and Laufer and Goldstein (2004), for this dissertation and when reporting research by others, the terms are used as explained in this section.

8

speakers' receptive vocabulary knowledge is greater than their productive vocabulary knowledge (Nation, 2001; Webb, 2008). Table 2: Aspects of word knowledge

Aspect of word knowledge Form spoken

Meaning

Use

Associated knowledge R What does the word sound like? P How is the word pronounced?

written

R P

What does the word look like? How is the word written and spelled?

word parts

R P R P

What parts are recognisable in this word? What word parts are needed to express the meaning? What meaning does this word form signal? What word form can be used to express this meaning?

concept and referents

R P

What is included in the concept? What items can the concept refer to?

associations

R P R P

What other words does this make us think of? What other words could we use instead of this one? In what patterns does the word occur? In what patterns must we use this word?

collocations

R P

What words or types of words occur with this one? What words or types of words must we use with this one?

constraints on use (register, frequency, …)

R

Where, when, and how often would we expect to meet this word? Where, when, and how often can we use this word?

form and meaning

grammatical functions

P

Note. R = receptive knowledge, P = productive knowledge. Reproduction of Nation's Table 2.1 (2001, p. 27). Reproduced with permission.

When learning L2 vocabulary in a computer-assisted multimedia environment, theories and principles of multimedia learning come into play. This is the topic of the following section.

2.2 Multimedia Theories in Computer-Assisted Environments The two theories most commonly applied to multimedia L2 vocabulary learning are Paivio's (1971, 1986) dual coding theory and Mayer's (2001, 2005a) cognitive theory of multimedia learning (see, e.g., Reed, 2006, for an overview of additional multimedia theories).

9

Paivio's (1971, 1986) dual coding theory assumes that humans possess two distinct cognitive modality-specific subsystems for processing information. The nonverbal subsystem, commonly called the imagery subsystem, specializes in dealing with nonverbal information (e.g., visual objects, environmental sounds, taste) while the verbal subsystem specializes in representing and processing language (e.g., written or spoken words). The two subsystems are functionally independent, that is, they can be active simultaneously or separately from each other. Activity in one subsystem can also trigger activity in the other. Dual coding theory maintains that objects can be recalled from memory via activation of their names (verbal information), their images (nonverbal information), or both. Information presented in both verbal and nonverbal form is thus said to be conducive to learning because it enables the creation of two memory components, which increases the chance of remembering the information. In other words, the additivity effect due to the independence but interrelatedness of the verbal and nonverbal subsystems acts as a back-up in that items coded in dual modalities (e.g., by providing a picture and a written linguistic stimulus) are more likely to be learned and recalled than single-modality items, which are encoded only once. This dual coding effect seems to apply to any two sensory modalities (e.g., auditory, visual, olfactory) (Thompson & Paivio, 1994). Mayer's (2001, 2005a) cognitive theory of multimedia learning (CTML), previously also referred to as generative theory of multimedia learning, builds on and incorporates elements of other theories (e.g., Paivio's (1986) dual coding theory, Baddeley's (1986, 1999) model of working memory). CTML's key components are consistent and compatible with other multimedia instructional design theories such as Sweller's (2005) cognitive load theory and Schnotz and Bannert's integrated model of text and picture comprehension (Schnotz, 2005). Given this compatibility with other theories, the following overview is limited to CTML. Mayer (2005b) defines multimedia as the provision of both words (printed or spoken) and pictures (e.g., graphics, illustrations, photographs, videos, animations). Multimedia learning refers to the building of mental representations from words and

10

pictures, and multimedia instruction entails the presentation of both words and pictures to promote learning. CTML rests on the dual-channel assumption 3, which is that people have separate information processing channels for material represented visually and material represented auditorily (see also Baddeley, 1986; Baddeley, 1999; Paivio, 1986). CTML incorporates the dual-channel assumption by suggesting that humans use an auditory / verbal channel and / or a visual / pictorial channel when processing information. Information that is presented to the eyes (e.g., pictures or written text) is initially processed through the visual channel whereas information presented to the ears (e.g., spoken text, non-speech sounds) is first processed through the auditory channel. However, humans can convert a representation originally presented to one channel for processing in the other channel. For example, a narration, which is initially processed in the auditory channel, might also be processed in the visual channel if the learner constructs a corresponding mental image. According to CTML, five cognitive processes are involved in multimedia learning: 1) selecting words from a multimedia presentation for processing in working memory, 2) selecting images from a multimedia presentation for processing in working memory, 3) organizing the words into a verbal model, 4) organizing the images into a pictorial model, and crucially, 5) integrating the pictorial and verbal representations with each other and also with prior knowledge. More specifically, CTML assumes that a number of steps occur in multimedia learning. Three memory stores are involved in the process: sensory memory, working memory, and long-term memory. Working memory is subdivided into two parts: a part dealing with the raw material that enters working memory, and a part that represents the knowledge constructed in working memory by organizing the raw material appropriately. 3

CTML is also based on two additional assumptions: the limited capacity assumption and the active processing assumption. The limited capacity assumption states that people can only process a limited amount of information at a given time, that is, learners are constrained by the capacity of their working memory. The active processing assumption is that people actively engage in the cognitive processing of information to construct mental representations, that is, learners aim to make sense of multimedia instruction.

11

When processing multimedia, learners are first presented with words and pictures in a multimedia presentation. This information enters the learners' sensory memory through the ears and eyes, where it is held exactly as seen and heard for a very brief period of time. From sensory memory, learners select relevant words and images, which then enter working memory as raw material. In working memory, a sound can be mentally converted into an image (e.g., by forming a mental image of a cat when hearing the word cat) and vice versa. From the sound and image bases, knowledge is then constructed in working memory. The word sounds selected from the multimedia presentation are organized into a coherent representation called the verbal model. This involves building connections among different pieces of verbal knowledge. In parallel fashion, the images selected from the multimedia presentation are organized into a coherent representation called the pictorial model. The processes of organizing words and images reflect the learners' effort to build a representation structure that makes sense to them. In a final step, the verbal model and the pictorial model are integrated in one representation in which corresponding elements from both models are mapped onto each other. The integrated model also incorporates relevant prior knowledge from long-term memory. The model entails that there are five forms of representation for words and pictures: 1) words and pictures in the multimedia presentation, 2) sounds and images in sensory memory, 3) sounds and images in working memory, 4) a verbal model and a pictorial model in working memory, and 5) knowledge in long-term memory. In this model, pictures enter sensory memory through the eyes, and are then processed mainly as part of the so-called image base and the pictorial model in working memory before they are integrated. Spoken words enter sensory memory through the ears, are then processed mainly as part of the sound base and the verbal model in working memory and are finally integrated. Processing is more complicated for printed words. Printed words pass into sensory memory through the eyes and are brought into working memory as part of the images. Then, by mentally pronouncing the words, they enter the sound base, and are subsequently processed like spoken words, that is, they are

12

organized in the verbal model and then integrated. Thus, printed words compete for attention with pictures because they share part of the processing route with them. Based on numerous multimedia studies, several principles have been put forward for multimedia learning. The most important principle and the main impetus for studying multimedia learning is the so-called multimedia principle, which, based on much empirical evidence (see, e.g., section 2.3), maintains that learning from words and pictures is generally more effective than learning from words alone. In CTML, this principle is explained by the differential processing of images and words in the human mind. The modality principle postulates that if pictures are part of a presentation, accompanying spoken text is more effective than written text. According to CTML, this is due to spoken text and pictures not sharing the same processing route. Furthermore, the redundancy principle stipulates that learning from graphics and narration is more effective than learning from graphics, narration, and on-screen text (Mayer, 2005d). Finally, the coherence principle states that learners learn better when extraneous information (e.g., text or sound not relevant to the learning goal) is not included in multimedia instruction (Mayer, 2005c).

2.2.1 Multimedia Theories and Second Language Acquisition In applying multimedia theory to second language acquisition (SLA), Plass and L. Jones (2005) note that the most crucial principle of multimedia learning, the multimedia principle (see Mayer, 2005d), appears to be applicable to SLA. Most studies have found that instruction combining both words and pictures is more effective than instruction using either words or pictures alone in L2 listening or reading comprehension and vocabulary acquisition (e.g., Chun & Plass, 1996a; L. Jones & Plass, 2002; Plass, Chun, Mayer, & Leutner, 1998). However, Plass and L. Jones (2005; see also Sydorenko, 2010) suggest that some of the other principles that have been established for multimedia learning in general (see Mayer, 2005d) may not be transferrable to the SLA context: For example, the coherence principle suggests that unneeded or irrelevant words reduce learning of scientific content and should 13

therefore be removed from the text. This principle does not extend easily to second-language acquisition, where any meaningful linguistic input has potential value for the acquisition of the language… Similar arguments can be made for the redundancy principle and the modality principle – reading and listening are two competencies that both need to be developed, and in many cases one is used as input enhancement for the other. (p. 480) Plass and L. Jones (2005) propose a model of multimedia learning in SLA that incorporates ideas from Mayer's CTML and an interactionist perspective of SLA. The interactionist perspective of SLA (e.g., Gass, 1997; Gass & Mackey, 2006; Lightbown & Spada, 1998; Long, 1983, 1985, 1996; Pica, 1994) deems three functions crucial to language learning: comprehensible input, interaction, and comprehensible output. A simplified model of the interactionist perspective on the SLA process moves from linguistic input, via apperception of this input, its semantic / syntactic comprehension, intake, and its integration into the linguistic system to comprehensible linguistic output (Chapelle, 1998). Only input that is apperceived by the learner (e.g., by making it prominent in instructional materials) can be acquired. When the learner comprehends apperceived input, the linguistic material can become intake, that is, language that the learner comprehends and that may develop his or her L2 system. Integration entails processing intake in short term memory to develop L2 knowledge. Finally, output refers to the language produced by the learner. In the interactionist model, interaction with others along several stages of this SLA process plays an important role because it allows learners to confirm or reject their hypotheses about the L2 (Swain, 1995). Combining this model with CTML, Plass and L. Jones (2005) suggest that the apperception stage, as understood in the interactionist model, involves selecting words and images in terms of Mayer's CTML, while the interactionists' comprehension stage is linked to the creation of the sound and image bases in CTML. Furthermore, intake, as defined in the interactionist model, is associated with CTML's organization of words and images into the verbal and visual model, respectively. The interactionists' integration stage corresponds to CTML's integration of the verbal and visual models into the

14

learner's linguistic system. Finally, comprehensible output, as defined in the interactionist model, can be produced. Theories of multimedia learning can explain some of the benefits of L2 vocabulary annotations that are typically provided in computer-assisted learning environments. Vocabulary annotations can be defined as "visual or verbal supplementary items, such as a translation of visual representations, that provide additional information for selected keywords" (Plass & L. Jones, 2005, p. 483). Vocabulary annotations are reviewed in the following section by also presenting empirical findings regarding their effectiveness in light of theories of multimedia learning.

2.3 Vocabulary Annotations There is evidence that providing additional resources (e.g., annotations) for language learning enhances language acquisition (Abraham, 2008; Cárdenas-Claros & Gruba, 2009; Chun & Plass, 1996a; L. Jones, 2006; Lomicka, 1998) and is appreciated by language learners (Chapelle & Heift, 2009; Davis & Lyman-Hager, 1997). When it comes to studying vocabulary, one of the advantages of CAVL over paper-based instruction is the ease of multimedia integration by embedding written text, audio, images, and video in vocabulary annotations.

2.3.1 Visuals, Translations, and Definitions The bulk of research on L2 vocabulary annotations has explored the effectiveness of visuals (usually still pictures but also videos), translations, and definitions to express the meaning of a word. Visual annotations and L1 translations or definitions are believed to be important in L2 vocabulary learning because they are said to be linked to mental processes in SLA. For example, Plass et al. (1998) state that L2 learners have available two separate verbal systems (L1 and L2) and one common imagery system. This suggests that L1 "translations of words would not only link the two verbal systems but that this storage in the second verbal system would have an additive effect on learning (cf. also Paivio &

15

Desrochers, 1980; Paivio & Lambert, 1981)" (p. 26). In fact, research indicates that vocabulary is learned effectively when learners "can establish a direct connection between a word in their native language, the corresponding picture of an object or action, and its foreign equivalent" (p. 26; see also Plass, Chun, Mayer, & Leutner, 2003). Learners thus have two types of retrieval cues (visual and L1-verbal) in memory, allowing for better learning. Pictures might be even more effective than L1 translations because, as L. Jones and Plass (2002) suggest "the mapping of pictures onto the mental model provides a stronger bond than the mapping of words due to the different representations of their information (analog vs. symbolic)" (Kost, Foss, & Lenzini, 1999, p. 99). Analog representations (images) can be directly mapped onto the mental model and are assumed to be language independent, whereas symbolic representations (text) are sequentially processed and demand "an indirect transformation between the symbolic representation of the text and the analog mental model" (Chun & Plass, 1997a, p. 8). (p. 557) Paivio's (1971, 1986) dual coding theory also stipulates a strong link between L2 words, their L1 translations and pictures. According to this theory, the learner develops an L2 verbal representational system, which has referential interconnections to nonverbal representations in the image system and also associative interconnections with existing L1 verbal representations (Paivio, 1986; Paivio & Desrochers, 1980). Similarly, Yoshii (2006) maintains that L2 words may be represented conceptually in learners' minds via a direct link between the L2 word and the underlying concept, or indirectly via a lexical link to an L1 word, which is then linked conceptually to the underlying concept. Yoshii suggests that beginner learners rely more on L1-word-to-L2word links (lexical links) whereas more proficient learners link L2 words directly to concepts (conceptual links). In this model, images provide a further conceptual link to anchor words in memory and consequently help strengthen links between L2 words and their underlying concepts. There are thus three routes for learning in total: L1 translations, L2 words, and images.

16

The following two subsections review empirical studies that have examined the effectiveness of translations, definitions, and visual annotations (pictures, videos) for L2 vocabulary learning in incidental or intentional learning contexts. Incidental learning refers to the non-deliberate acquisition of vocabulary while performing another cognitive task such as reading or listening, whereas intentional learning means giving explicit attention to a word for the specific purpose of learning it. Empirical studies on incidental vocabulary learning In incidental learning environments, vocabulary annotations have been examined as part of either reading L2 text passages (Akbulut, 2007; Al-Seghayer, 2001; Chun & Plass, 1996a, 1996b; Plass et al., 1998, 2003; Yoshii, 2006; Yoshii & Flaitz, 2002) or listening to an L2 oral passage (L. Jones & Plass, 2002).4 Not surprisingly, the studies have found that accessing annotations is more effective for vocabulary learning than not accessing them in incidental vocabulary contexts (e.g., L. Jones & Plass, 2002; Plass et al., 1998, 2003). More importantly, however, several studies show that visuals, particularly pictures but also videos, are effective annotations for vocabulary learning and that access to multiple annotations (visual and verbal information) is more beneficial than access to either verbal only or visual only annotations. For example, Chun and Plass (1996a, study 2) compared different annotations for learning words with 103 university learners of German. Some words in their reading passage were annotated with a definition only, others with a definition plus a picture, and yet others with a definition plus a video. The researchers found that scores on an immediate vocabulary posttest were significantly higher for words that were annotated with pictures and text than for words that were annotated with video and text or text only. Furthermore, students also reported using the pictures as retrieval cues for remembering the words more often than the definitions or the videos. In another study involving the same treatment and 21 learners, Chun and Plass (1996a, study 3) found that words annotated with a definition and a picture were recalled 4

Some studies (e.g., Al-Seghayer, 2001; Chun & Plass, 1996a) also provided audio pronunciations for all words under investigation.

17

significantly better on a one-week delayed posttest than on the corresponding immediate posttest but this was not the case for the definition only or definition and video words. The authors speculate that this might be due to the so-called hypermnesia effect, which predicts better recall of pictures compared to text over time. In line with these results, Plass et al. (1998) found in a study with 103 learners that looking up both verbal and visual annotations in the same multimedia program resulted in significantly higher vocabulary posttest scores than any of the other look-up behaviours (i.e., looking up verbal only, visual only, or no annotations). Plass et al. (2003) also obtained similar results in a between-subjects design with four treatment groups (N = 152). Learners in the verbal and visual (picture or video) annotation group performed best, followed by the two single annotation groups (verbal only, visual only) and finally the group that did not have access to annotations. Akbulut (2007) investigated these vocabulary annotations (see also Al-Seghayer, 2001) by assigning 69 English learners to one of three treatment groups: a) L2 definition, b) L2 definition and picture, and c) L2 definition and video. The results showed that the groups with the picture and video annotations significantly outperformed the definition only group on both immediate and delayed vocabulary posttests. L. Jones and Plass (2002) used a between-subjects design with four treatment groups (N = 171): a) no vocabulary annotations, b) written L1 translations, c) pictorial annotations, d) both written and pictorial annotations. On an immediate vocabulary posttest, the L2 French learners in the multiple annotations group (written and pictorial) scored significantly higher than all other learners. On a three-week delayed posttest, the multiple annotations group still outperformed the written only and no annotations groups. Furthermore, Yoshii and Flaitz (2002) compared three annotation types in a between-subjects design (N = 151): a) text only, b) picture only, and c) text and picture (see Kost, Foss, & Lenzini, 1999, for a similar study in a non-CALL context). Results of immediate and delayed vocabulary posttests demonstrated a clear advantage of multiple annotations (text and picture) over single annotations (text only or picture only). In addition, the data presented some evidence that picture only annotations might be more 18

effective than text only annotations (see also Kellogg & Howe, 1971; Tonzar, Lotto, & Job, 2009). In a further study with a between-subjects design (N = 195), Yoshii (2006) examined four annotation types: a) L1 translation, b) L2 definition, c) L1 translation and picture, and d) L2 definition and picture. While the combined annotation groups (L1 or L2 text and picture) and the single annotation groups (L1 or L2 text only) performed equally well on receptive recognition posttests, the verbal plus picture annotations led to significantly more effective vocabulary learning than the verbal only annotations on productive recall posttests, providing further evidence of the effectiveness of pictorial annotations. Empirical studies on intentional vocabulary learning In contrast to the cited studies on incidental vocabulary acquisition, less research has been conducted on intentional L2 vocabulary learning (Godwin-Jones, 2010). However, Kim (2006) conducted a study on intentional vocabulary learning with a flashcard-based CAVL program. Korean tenth grade learners of English (N = 172) were split into six groups, which all received the written L2 target words and a written L2 example sentence but different combinations of the following vocabulary annotations: spoken target word, spoken L2 definition, written L2 definition, spoken L2 example sentence, and graphic. The interpretation of Kim's results is complicated due to the variety of annotation combinations in the groups. Nonetheless, the study provides some support for the effectiveness of picture annotations for vocabulary learning in that some groups with both word and picture annotations significantly outperformed some of the groups without pictorial annotations on immediate and delayed vocabulary posttests. In another study on intentional vocabulary learning with a CAVL program, Dubois and Vial (2000) also found some support for the effectiveness of imagery for vocabulary learning. The authors suggest that annotations "designed to promote mental imagery … allowed learners to encode information more effectively" (p. 162). Finally, Zhuo (2008) performed a meta-analysis of 20 studies of L2 vocabulary learning that compared text only instruction with text plus picture instruction and found 19

that "students learned better when on-screen text and visual materials are both presented instead of text only" (slide 21). In sum, there is a substantial amount of empirical evidence in both incidental and intentional vocabulary learning environments indicating that pictures are effective learning aids and that vocabulary is learned more effectively with a combination of verbal and pictorial annotations than with verbal annotations alone. This evidence is in line with Paivio's dual coding theory and Mayer's CTML. However, more research is needed to investigate the effectiveness of these annotations for different types of words. For instance, it is unclear whether pictures are useful for concrete and abstract words alike. In addition, the influence of individual learner differences needs to be considered. For example, pictorial annotations may not be effective for every learner. However, before turning to these topics, the following section reviews additional vocabulary annotations that have been studied in SLA, although less extensively.

2.3.2 Other Vocabulary Annotations There are many other possible annotations for L2 words. This section focuses on two annotations that are also used in the experiments conducted for this dissertation: audio pronunciations and L1 glosses of L2 examples sentences. Audio pronunciations While audio pronunciations for L2 vocabulary are frequently included in CAVL programs, their effectiveness for L2 vocabulary learning requires further assessment. As reviewed in section 2.2, the modality principle established by research on multimedia learning in general states that spoken words are more effective than written words in instructional materials that contain pictures. This is due to the fact that spoken words and pictures do not compete for attention in the visual processing channel. However, this principle may not apply to an SLA multimedia learning environment, in which written words are presented as an additional L2 learning resource along with a picture and spoken word(s) (Plass & L. Jones, 2005; Sydorenko, 2010, see section 2.2.1). For instance, in a study by Danan (1992), subjects learned significantly more vocabulary test 20

items when a video with spoken L2 narration was accompanied by written L2 subtitles than when no subtitles were present. In the SLA context, auditory information is thought to be beneficial for L2 vocabulary learning because it supports the development of sound-form mapping (Hamada & Koda, 2008) and commits L2 words to phonological memory (Hummel & French, 2010). Some researchers believe phonological memory or phonemic coding ability, that is, the "capacity to code unfamiliar sound so that it can be retained over more than a few seconds and subsequently retrieved or recognized" (Dörnyei & Skehan, 2003, p. 592; see also Nation, 2001), to play a part in SLA. Retaining information in phonological memory increases the likelihood for it to become part of long-term memory (Joseph, Watanabe, Shiung, Choi, & Robbins, 2009). Auditory information seems to be helpful in learning vocabulary especially when learners are confronted with unfamiliar alphabets and phonemes (Dubois & Vial, 2000). Audio recordings of words may also strengthen links between the written L2 form and its meaning (Hill & Laufer, 2003; Laufer & Hill, 2000; Svenconis & Kerst, 1995). For example, in an investigation of incidental vocabulary learning, Laufer and Hill (2000) speculate that "the addition of auditory information has helped the Chinese learners [in their study] to build referential connections between the written form and the meanings of the words. The bimodal (visual plus auditory) presentation of words may have enhanced the storage in short term memory" (p. 70). In addition, some learners have a preference for auditory learning (Laufer & Hill, 2000), which might affect their performance in different multimedia environments. Evidence of the potential benefits of audio annotations is provided by Okuyama (2007), who found that the frequency of accessing audio recordings in a multimedia vocabulary learning environment for L2 Japanese strongly correlated with higher scores on two immediate posttests assessing the receptive recognition of both sound and script. Further tentative evidence also stems from a study by Sydorenko (2010) who compared vocabulary learning in three learner groups: a) video plus L2 audio, b) video plus L2 subtitles, and c) video plus L2 audio and L2 subtitles. While the difference was not statistically significant, Sydorenko found that the audio plus subtitles group 21

outperformed the subtitles only group on a test of meaning recall. This led the author to conclude that "attention (at some level) was paid to audio and can be considered as the factor that increased word recall" (p. 63). However, Okuyama's (2007) study did not investigate productive knowledge or delayed retention and Sydorenko's (2010) investigation only showed descriptive but not inferential benefits of the audio annotation. Other studies did not find the provision of audio information useful in multimedia vocabulary learning. In one study on incidental vocabulary learning, the speech rate of the audio information was apparently too fast for the language learners to be effective (Yeh & Wang, 2003). In addition, Chun and Plass (1996a) found that audio information was of limited importance as a retrieval cue for words on a vocabulary posttest. Overall, further studies concerning the effectiveness of audio annotations are advocated (Chun & Plass, 1996a, 1997). L1 glosses of L2 example sentences Nation (2001) recommends studying L2 words with an L2 example sentence because it provides helpful information on the use of the target word in context. Naturally, comprehension of L2 example sentences is difficult to achieve at a beginner proficiency level given the learners' limited vocabulary knowledge. However, an L1 gloss of the L2 example sentence, as an additional annotation type, may enhance comprehension of the original L2 sentence. For example, in studying the German word Beweis (proof), an L2 example sentence such as Die Polizei findet einen Beweis in Philips Wohnzimmer. can be annotated with the L1 gloss The police finds proof in Philip's living room. The effectiveness of L1 sentence-level glosses of L2 sentences for vocabulary learning have only been investigated in a few studies. Sun and Dong (2004) conducted a study with 67 Chinese children who viewed an animated cartoon with English narration in one of three treatment groups: 1) children heard an L1 gloss after each English sentence, 2) children practiced the pronunciation of the target words, heard the L1 glosses, and were informed of a vocabulary posttest, and 3) control (no L1 glosses, no practice, no posttest announcement). Group 2 significantly outperformed both group 1 22

and group 3 on a vocabulary posttest. Despite some limitations in research design (e.g., only four words comprised the posttest), the results suggest that L1 glosses, in combination with pronunciation practice and a posttest announcement, facilitate vocabulary learning. Grace (1998a, 1998b) examined the effectiveness of L1 glosses for incidental vocabulary learning in a between-subjects study with a multimedia CAVL program used by 181 first- and second-semester university learners of French. The CAVL program consisted of 40 comic book screens with dialogues in which 48 target vocabulary items were embedded. All subjects had access to the written text, a corresponding audio track and L2 definitional sentences of the target words. Subjects in the experimental group (n = 89) could additionally access L1 sentence-level glosses of the French dialogue whereas control subjects (n = 92) could not. Results of an immediate and delayed posttest showed that the subjects with access to sentence-level glosses learned vocabulary significantly better than the control subjects. Grace concludes that L1 glosses promote the retention of the correct word meaning, especially for beginner learners, because they assure learners of the correctness of their inferences. Deep and durable memory coding might also stem from the more extensive analysis of the input when learners search for semantic equivalents in and contrast the structure of the two sentences. Some additional evidence of the potential helpfulness of L1 glosses for vocabulary learning stems from two related fields of inquiry: research on video subtitles and parallel corpus research. Video subtitles refer to on-screen text (written language) that corresponds to and simultaneously accompanies a soundtrack (spoken language) of video material (i.e., animated pictorial representations). When subtitles are given in one language (e.g., L1) and the soundtrack in the other (e.g., L2), the environment is similar to a CAVL program that contains L2 sentences with corresponding L1 glosses. Studies

have

found

different-language subtitles,

especially

L2

subtitles

accompanying an L1 soundtrack, to be effective for vocabulary learning, often more so than same-language subtitles or no subtitles (Danan, 1992, 2004; Markham & Peter, 23

2002-2003). These findings indicate that to understand the meaning of a message, language learners, particularly those with limited L2 proficiency, benefit from the additional L1 input because it allows them to grasp the contextual cues in the L2 sentences. With reference to Paivio's bilingual dual coding theory, which stipulates two separate verbal systems (L1, L2) and an image system for SLA (see section 2.3.1), Danan (2004) maintains that the positive effect of subtitling on language learning is brought about by the fact that "these three independent systems are interconnected through triple associations between image, sound in one language, and text in another, which may lead to better processing and recall because of the additive effects of both image and translation" (p. 72). Regarding vocabulary annotations, studies on different-language subtitling thus suggest that L2 example sentences combined with corresponding L1 glosses provide a contextualized link of the two verbal systems involved in SLA and may thus lead to enhanced vocabulary learning. Another area that can provide insights into the possible effectiveness of sentencelevel translations is parallel corpus research. A parallel corpus provides concordances of authentic sentences in the L2 along with their corresponding glosses in the L1. A parallel concordance is thus like an L2 sentence with a corresponding L1 gloss in a CAVL program. Parallel concordances are increasingly considered beneficial for L2 learning, not only by SLA researchers but also by learners themselves (Chang & Chang, 2004; Fan & Xunfeng, 2002; Granger, in press; Nerbonne, 2000; St. John, 2001; J.-C. Wu, Yeh, Chuang, Shei, & Chang, 2003; Zanettin, 1998). Parallel concordances are similar to a contextualized dictionary in that they allow learners to inductively engage in meaning creation and to confirm their linguistic hypotheses. They enable consciousness-raising by letting learners reflect on interlanguage similarities and differences in contextualized word use (Granger, in press). L1 glosses also provide a reasonable assurance that a learner will comprehend the L2 material (Nerbonne, 2000), which is a prerequisite for learning. While one can speculate that these benefits transfer to the provision of L2

24

example sentences with L1 glosses in a multimedia CAVL program, empirical studies are needed to test this assumption. The above discussion identifies open questions regarding the effectiveness of different multimedia annotations. More research is also necessary to examine their benefits with respect to different types of words. The following section explores word concreteness.

2.4 Word Concreteness Words can be classified according to their level of concreteness and imageability, respectively. Concreteness refers to the ease with which a word's referent can be experienced by the senses (Paivio, Yuille, & Madigan, 1968) whereas imageability refers to the ease with which a word's referent arouses a mental image in a person's mind (Ellis & Beaton, 1993b; Paivio et al., 1968). Both concepts are usually measured by asking subjects to rate words on a 7-point scale from 1 to 7, ranging from abstract to concrete for word concreteness and from low imageability to high imageability for imageability. For example, the German word Hand (hand) is highly imageable whereas Ausnahme (exception) is very low on the imageability scale (mean imageability 6.88 and 1.35, respectively, in Vö, Jacobs, & Conrad, 2006). While there is evidence that concreteness and imageability are somewhat distinct concepts in terms of psychological processing (Richardson, 1975a, 1975b, 1976), many researchers treat them as interchangeable because concreteness and imageability are strongly correlated, "typically r > .90" (De Groot & Keijzer, 2000, p. 9). A highly concrete word is generally highly imageable and an abstract word is generally low on the imageability scale (see also Paivio, 1986; Paivio et al., 1968; Richardson, 1976). Accordingly, following the terminological convention of others (e.g., De Groot & Keijzer, 2000; Lahl, Göritz, Pietrowsky, & Rosenberg, 2009; van Hell & Mahn, 1997), words rated low on imageability and / or concreteness are termed abstract words and words rated high on imageability and / or concreteness are called concrete words in this dissertation unless otherwise noted. 25

2.4.1 Learnability of Abstract and Concrete Words Regarding learnability, there is evidence that concrete words are generally easier to learn than abstract words (Paivio, 1986). In fact, the term concreteness effect refers to "the observation that concrete nouns are processed faster and more accurately than abstract nouns in a variety of cognitive tasks" (Jessen et al., 2000, p. 103). Neurolinguistic research confirms that there are differences in representation, retrieval, and processing of concrete versus abstract words in the human brain (e.g., Bedny & Thompson-Schill, 2006; Binder, 2007; Binder, Westbury, McKiernan, Possing, & Medler, 2005; Bruyer & Racquez, 1985; Fiebach & Friederici, 2003; Fliessbach, Weis, Klaver, Elger, & Weber, 2006; Holcomb, Kounios, Anderson, & West, 1999; Jessen et al., 2000; Noppeney & Price, 2004; Scott, 2004; Tyler & Moss, 1997). Two hypotheses have been put forward to explain the concreteness effect. The dual coding hypothesis attributes the advantage of concrete (high imageability) words to the greater ease with which a mental image can be formed (Paivio, 1986; see also Matsumi, 1994), or, in neurolinguistic terms, with easier access of an image-based system in the brain's right hemisphere (Jessen et al., 2000). While abstract (low imageability) nouns "are not assumed to be represented only by the verbal system" (Paivio, 1986, p. 170), "they are less likely to arouse images or do so with greater difficulty" (p. 170) because their access to the imagery system is less direct. In contrast, the context availability hypothesis maintains that "concrete words activate a broader contextual verbal support, which results in faster processing, but do not access a distinct image based system" (Jessen et al., 2000, p. 103). Based on event-related functional magnetic resonance imaging data from their study, Jessen et al. assert that a combination of both models can likely explain the superior encoding of concrete words. However, the neurolinguistic evidence is not conclusive at this point (cf., e.g., Binder et al., 2005; Fiebach & Friederici, 2003; Fliessbach et al., 2006).5 5

Two additional hypotheses are that concrete words are easier to learn than abstract words because either a) they are associated with a smaller number of other words, which makes them easier to remember, or b) they are associated with a larger number of other words, which makes them easier to remember (Hulstijn, 1997). According to Hulstijn, "Nelson and Schreiber [1992] claim that they have been able to falsify both of these (mutually incompatible) … hypotheses" (p. 213).

26

The concreteness effect also extends to the L2 context (e.g., De Groot & Keijzer, 2000; Ellis & Beaton, 1993b; Hulstijn, 1997; Matsumi, 1994; van Hell & Mahn, 1997). For instance, Ellis and Beaton (1993b) suggest that imageability is a "strong determinant of learnability" (p. 559). A study by de Groot and Keijzer (2000) with 60 native language words paired with pseudowords also found that abstract words were harder to learn and more susceptible to forgetting than concrete words.

2.4.2 Word Concreteness and the Effectiveness of Vocabulary Annotations However, when it comes to vocabulary annotations in L2 learning, the effect of concreteness has hardly been explored. For instance, to the author's knowledge, the interaction between word concreteness and the effectiveness of annotations in the form of definitions, audio pronunciations, or L1 glosses of L2 example sentences has not been studied to date and previous publications do not appear to provide much insight into these areas. Furthermore, pictures have been shown to be effective annotations for learning words in general (see section 2.3.1) but it is unclear whether they are equally useful for learning concrete and abstract words. Pictures tend to fulfill differential functions with respect to the two word types. Pictures of concrete nouns are generally iconic and representational because they depict the actual referent (e.g. a picture of an apple to visualize the referent apple), thus providing "largely the same content but in a different sensory code" (Heidemann, 1996, p. 71). In contrast, abstract nouns generally cannot be visualized directly but rather have to be conveyed indirectly, for instance, through the use of metaphors (e.g., a picture of a clock to express the referent time) (Heidemann, 1996). Although visual annotations might be ambiguous for abstract words, they may nonetheless "provide invaluable situational contexts and thus facilitate vocabulary growth and retention after all" (Kost et al., 1999, p. 97). One would expect pictorial annotations to be less helpful for abstract, barely imageable words than for concrete, highly imageable words but this hypothesis is yet to be tested empirically. Previous studies have generally either investigated differences 27

between concrete and abstract words without recourse to pictorial annotations (e.g., De Groot & Keijzer, 2000) or they have examined the effectiveness of pictorial annotations but constrained themselves to studying only concrete words (e.g., Al-Seghayer, 2001; Barcroft, 2007; Kost et al., 1999; Lotto & de Groot, 1998; Yoshii, 2006). For instance, Yoshii (2006) states that "this study used concrete verbs since the researcher needed to illustrate them effectively" (p. 90). To the author's knowledge, only two unpublished master's theses (Leeflang, 2007; O'Bryan, 2005) have investigated pictorial annotations for abstract words to date. For instance, in O'Bryan's (2005) experiment, university ESL students were assigned to either a picture-annotation condition (n = 7) or an L2-definition-annotation condition (n = 6) for reading three texts with 48 glossed vocabulary items. The words were all rated as abstract (mean concreteness between 1 and 4 on a scale from 1 to 7), but imageable (mean imageability between 3.5 and 7). Vocabulary learning was measured by an immediate and 1.5-week delayed vocabulary posttest. While the study did not reveal a significant difference in the effectiveness of pictorial versus textual annotations for learning abstract, imageable words, the results have to be interpreted with caution because of limitations in research design. For example, only one participant in the definition group and six in the picture group accessed the annotations so the inferential statistics were based on a small sample. Given the limited number of studies to date, more research is needed to investigate the potential interaction between word concreteness and the effectiveness of vocabulary annotations. Aside from word concreteness, individual learner differences might also influence the effectiveness of vocabulary annotations in that different learners might benefit differently from vocabulary annotations. The following section explores this topic.

2.5 Individual Learner Differences in SLA Section 2.3 reviewed empirical studies that consistently found that providing a variety of annotations, particularly verbal coupled with pictorial annotations, is an effective means of vocabulary learning. While this is the case for learners in general, 28

differences have been found with respect to the usefulness of annotations for different learners (Chun, 2006). These individual differences (IDs) among learners also need to be taken into account in CAVL. For example, L. Jones (2009) notes that high-spatial-ability students are more likely to benefit from pictorial information (Riding & Cheema, 1991) since processing such material is relatively effortless for them, while low-spatialability students must use a larger amount of cognitive effort to understand pictorial representations (Mayer & Sims, 1994). High-verbal-ability students are more likely to benefit from textbased materials, whereas low-verbal-ability students must expend greater cognitive effort to process information from aural or written texts (Jonassen & Grabowski, 1993). (p. 270) IDs have been studied extensively in the realm of SLA (for an overview, see, e.g., Dörnyei & Skehan, 2003). Differences among learners in age, personality, foreign language aptitude (e.g., phonemic coding ability), cognitive and learning style, L2 proficiency, learner strategies, and motivation, for instance, have been linked to a broad spectrum of variation in SLA (Dörnyei & Skehan, 2003). This includes variation in L2 learning outcomes and achievement, in the way learners respond to vocabulary tests (e.g., Eyckmans, van de Velde, van Hout, & Boers, 2007; Milton, 2007), in the use of CALL materials (e.g., Chapelle & Heift, 2009; Hegelheimer & Tower, 2004; Heift, 2002), in the number and kinds of errors students make in CALL practice (Heift, 2008), and in learning from information presented in different modalities (e.g., Chun & Plass, 1997; Kellogg & Howe, 1971; Skehan, 2000), to name but a few diverse areas. However, in ascertaining IDs, a number of studies have found that data on learners' cognitive and / or learning styles collected from self-report questionnaires are not always in accordance with students' actual performance in an authentic learning setting (Chapelle, 2001; Chapelle & Heift, 2009; Clark & Feldon, 2005; Leutner & Plass, 1998). Accordingly, some researchers suggest paying more attention to student performance data, preferably in the context in which IDs are to be assessed (Chapelle, 2001; Leutner & Plass, 1998). IDs are increasingly regarded as dynamic and situated in context rather than static and context-independent: 29

The most striking aspect of nearly all the recent ID literature is the emerging theme of context: It appears that cutting-edge research in all these diverse areas has been addressing the same issue, that is, the situated nature of the ID factors in question. Scholars have come to reject the notion that the various traits are context-independent and absolute, and are now increasingly proposing new dynamic conceptualizations in which ID factors enter into some interaction with the situational parameters rather than cutting across tasks and environments. (Dörnyei, 2005, p. 218) Accordingly, IDs "can only be evaluated with regard to their interaction with specific environmental and temporal factors or conditions" (Dörnyei, 2009, p. 232).

2.5.1 IDs in Multimedia CAVL With regard to multimedia learning environments, IDs have been found to impact multimedia learning in general (e.g., Fletcher & Tobias, 2005; Goldman, 2009; Mayer, 2001; Mayer & Sims, 1994; Paivio, 1971, 1986) and in SLA in particular (e.g., Chun & Payne, 2004; L. Jones, 2009; Laufer & Hill, 2000; Plass et al., 1998, 2003). Regarding multimedia learning in general, the individual differences principle states that "design effects are stronger for low-knowledge learners than for high-knowledge learners, and for high-spatial learners than for low-spatial learners" (Mayer, 2001, p. 161). Similarly, in multimedia SLA learning, verbal and spatial abilities, visualizer and verbalizer learning preferences and background knowledge, for instance, have been found to be particularly influential (Chun & Plass, 1997). Chun and Plass (1996a) point out that: Ongoing investigation of cognitive styles and abilities addresses the possibility that there is no one mode or medium that is helpful to all learners, but that different types of learners look up different types of annotations and learn more successfully with them. In other words, different types of learners might learn better with the type of information best suited to their cognitive style. For example, pictures may not be more useful than definitions for all learners but might help visualizers, whereas definitions might improve the learning of verbalizers. Other dimensions of individual differences that should be taken into consideration include general, visual, and verbal abilities. (p. 195) 30

Some empirical studies also provide evidence of the existence of these IDs in multimedia vocabulary learning environments. For example, L. Jones (2009) analyzed to which extent the effectiveness of multimedia annotations for vocabulary acquisition varied for learners of different verbal and spatial abilities. For listening to a passage, second-semester university learners of French (N = 171) were split into four groups which differed in the annotations provided for 27 target items in the text (pictorial annotations only, L1 translation only, both, control). Based on immediate and delayed vocabulary posttests, the findings showed that while pictorial annotations can be effective, "when comparing ability groups, lower ability students did not find images to be as helpful as written annotations" (p. 286). In light of her findings, L. Jones recommends that multimedia learning environments provide flexibility and choice to tackle learners' differing cognitive abilities by allowing them to access the information that is most helpful for them. Similarly, Plass et al. (1998) investigated differences in incidental vocabulary acquisition for visual and verbal learners in a multimedia CALL environment for L2 reading. The researchers classified their learners as visualizers, verbalizers or control subjects based on their learning preferences as manifested in their use of the program. On the vocabulary posttest, learners were also asked to indicate for each L2 word whether it primarily reminded them of hearing a translation (verbal information) or of seeing an image or video (visual information). The researchers found that learners that were classified as verbalizers were more likely to produce a correct translation for items that reminded them of verbal as opposed to visual information whereas visualizers showed the opposite pattern. These results suggest that visualizers are more effective in using visual cues for remembering vocabulary information, whereas verbalizers are more effective in using verbal cues. An instructional implication is that multimedia environments that allow for both visual and verbal modes of elaborating on words may be effective because learners can choose the mode of annotation that best suits their learning preference. (p. 32)

31

However, based on results of a later study, Plass et al. (2003) conclude that rather than displaying all available information by default, multimedia environments should be tailored to learners by taking their IDs into account. In this study, the participants (N = 152) completed tests of verbal and spatial ability and read a text with 35 target words under one of four treatment conditions (no annotations, verbal only, visual only, both). The subjects' performance on a vocabulary posttest demonstrated "that multiple representations of information do not always help learning. Indeed, they may hinder learning in low-ability students when they experience high cognitive load as it is imposed by the requirement to process visual information" (p. 240). Individualized instruction, that is, instruction tailored to individual learners, is the topic of the next section.

2.6 Individualized Instruction Learners learn differently and thus it can be assumed that they benefit from individualized instruction rather than from a one-size-fits-all approach (Heift & Schulze, 2007). Individualized or adaptive instruction means adapting instruction (e.g., in terms of the content, presentation, instructional media, instructional materials, or the pace of learning) to suit the characteristics of the individual learner (e.g., his or her abilities, interests, cognitive traits or prior knowledge base). In educational contexts, the main purposes of individualized instruction are corrective (giving individualized feedback), elaborative (extending the student's knowledge), strategic (guiding the approach to teaching), diagnostic (assessing the student's knowledge state), predictive (anticipating student behaviour), and evaluative (assessing the student's achievement) (Elsom-Cook, 1993). CALL programs, including multimedia language learning environments, are ideally suited to adapt instruction to the individual learner (Chun & Plass, 1997; Heift, 2007; Heift & Schulze, 2007; Plass et al., 1998). In fact, "individualized language instruction has long been recognized as a significant advantage of CALL over workbook tasks" (Heift & Schulze, 2007, p. 171). Given the impact of IDs in multimedia learning, Chun 32

and Plass (1997) maintain that "instructional materials should be designed as adaptive systems to support learners with different traits, such as learning preferences and cognitive styles, so that different learners can receive the type of information in the mode they need or prefer" (p. 73) (see also Chun, 2006). In a computer environment, three stages generally comprise the adaptation process: collecting data about a program user, processing the data to construct or update a socalled user or student model, and providing adaptation based on this model (Brusilovsky, 1996). According to O'Shea and Self (1983), a student model is any information which a teaching program has which is specific to the particular student being taught. The reason for maintaining such information is to help the program to decide on appropriate teaching actions. The information itself could range from a simple count of how many incorrect answers have been given, to some complicated data structure which purports to represent a relevant part of the student’s knowledge of the subject. (p. 143) However, student models "will always be approximations because of the noise present when inferring and maintaining the model" (Heift & Schulze, 2007, p. 174). For example, noise might be due to the non-monotonic nature of learning or the indeterminacy of student answers. There are static and dynamic approaches to individualization of the learning process. In static adaptation, instruction is individualized based on pre-collected information about a learner (e.g., age, L1, proficiency level). This approach conceptualizes learner characteristics as static, that is, unchanging. After a one-time preassessment of a learner's individualization needs, instruction is not further adapted in subsequent learner interactions with a program. In contrast, in dynamic adaptation, information about each learner is collected and frequently updated throughout the learner's interaction with the system to continually adapt to the changes in learner performance. This approach addresses the dynamic nature of learning and the potentially changing needs of learners over time, even when interacting with the same learning environment. For example, a dynamically adaptive program might reassess a learner's L2 33

proficiency every time the learner interacts with the system and continually adjust its corrective feedback as the learner becomes increasingly proficient in the L2 (Heift & Nicholson, 2001). Depending on the purpose of the adaptation, programs can adapt to, for instance, user goals, subject knowledge, background, experience, preferences, interests or traits (e.g., personality, learning styles), or to environment factors (e.g., user location) (Brusilovsky, 1996, 2001). However, it is not feasible to design a specific instructional method for each individual learner. Instead, it is possible to identify different types of learners, so-called learner personas, to which instruction may be adapted. Within CALL, learner personas are "archetypal users of a learning tool that represent the needs of larger groups of users in terms of their goals and personal characteristics" (Heift, 2007, p. 4). Thus, personas are fictitious users that are constructed to individualize language instruction to a group of learners that share certain properties. Learning environments can either be designed with personas that are based on knowledge about real users (e.g., from performance data, student interviews) or with pre-defined personas in mind that are then "tested and revised to confirm or reject the designer's assumptions about the persona" (p. 4). A study by Laufer and Hill (2000) provides an example of different learner personas obtained by investigating performance data in a multimedia CAVL program. Students from Israel and Hong Kong read a text containing target words with five annotations: audio pronunciation, L2 definition, L1 translation, extra information, and word root. The authors classified the individual students into different learner personas based on their look-up preferences (i.e., the annotations they predominantly selected) into L1 type, L2 type, L1 / L2 type, and L1 / L2 + other type. Although all look-up personas appeared in both learner groups, most of the Israeli learners were of the L1 type whereas most Hong Kong learners were L1 / L2 + other.

2.6.1 Individualization in CAVL A number of individualized language learning programs exist in diverse areas within CALL (see Heift & Schulze, 2007, for an overview). Regarding vocabulary 34

learning in particular, four areas can be identified in which prior research has been conducted: 1) adaptive testing of vocabulary knowledge, 2) adaptive determination of the review cycle for vocabulary learning, 3) adaptive selection of L2 target words, and 4) adaptive selection of resources provided for L2 target words. As part of the broader field of computer-adaptive testing (CAT, see, e.g., Carr, 2006), adaptive vocabulary testing is an area of adaptive CAVL that has received a fair amount of research attention (e.g., Laufer & Goldstein, 2004; Vispoel, 1998; W.-s. Wu, 2005). In adaptive vocabulary assessment, test items are adapted to each test taker's ability level. For example, if a learner demonstrates sufficient knowledge of highfrequency words, the test might skip further high-frequency words and instead test the learner on lower-frequency items. Adaptive vocabulary testing that is performed dynamically (e.g., W.-s. Wu, 2005) frequently relies on item response theory, which is "a statistical approach that assumes some ability underlies examinees' performances on tests and that the higher their ability level is, the more likely they will be to answer a particular item correctly" (Carr, 2006, p. 292). Another popular kind of individualization in CAVL is to dynamically adapt the learning cycle for reviewing vocabulary, that is, the number or timing of re-presentations of vocabulary items (e.g., Atkinson, 1972; Chen & Chung, 2008; Hubbard, Coady, Graney, Mokhtari, & Magoto, 1986; Joseph, Lewis, & Joseph, 2004; Joseph et al., 2009; Nakata, 2008; Stockwell, 2007). Godwin-Jones (2010) provides an overview of spaced repetition software available online. The adaptive spacing of vocabulary re-presentations might involve a computerized system similar to a flashcard box with several sections (see Schuetze & Weimer-Stuckmann, 2010). Words that a learner cannot recall drop back to the previous section while words that are remembered advance to the following section and are not presented again once they are correctly recalled in the final section. Aside from learner performance on individual test items, the dynamic adaptation can also take into account the learner's learning ability and the level of difficulty of a given word (Chen & Chung, 2008). Adaptively selecting target vocabulary is another area of individualized instruction in CAVL. A dynamically adaptive program developed by Chen and Chung (2008), for 35

instance, recommends appropriate vocabulary words for individuals based on their vocabulary learning ability as evident from their interaction with the CAVL program. Employing a static approach to adaptation, Crozer (1996) describes an adaptive CAVL program that presents a student with target words that are selected based on the student's knowledge as indicated on a pretest. The adaptation in the program ensures that learners do not study words that they already know. Another static approach is exemplified by the REAP Tutor (Heilman & Eskenazi, 2006a, 2006b). Based on a pretest in which unknown target vocabulary is determined for each learner, the REAP Tutor creates tailored reading passages for students that contain unknown target words in authentic contexts. Not only the target vocabulary can be selected adaptively, but also the information provided for each word. For example, ELDIT, a web-based German-Italian learner dictionary, plans to adapt to the learners' proficiency level and preferences by providing different amounts of detail about a word. For instance, beginners will receive more basic word information while advanced learners will obtain more in-depth details (Brusilovsky, Knapp, & Gamper, 2006; Gamper & Knapp, 2001). Furthermore, Kaya (2006) conducted a study where the information about L2 words presented to learners is statically adapted to their learning style as assessed on a pre-treatment questionnaire. To the author's knowledge, this is the only study that investigates the effect of individualized instruction in a multimedia CAVL environment. Kaya designed two versions of a CAVL program to study 12 academic words, each addressing a specific learning style. The VIDS version targeted learners with a preference for visual learning, individual work, detailed, systematic and logical information. The AGEC version was designed for learners with a preference for auditory learning, guessing, essential and broad elements and collaborative work. Half of the 145 Japanese learners of English used the program version that matched their learning style (e.g., an AGEC learner received the AGEC program) while the other half received instruction that was not suited to their learning style (e.g., an AGEC learner received the VIDS program). Although the results showed no significant differences on vocabulary achievement as measured by an immediate and a delayed posttest between the matched and non-matched groups in both versions of the CAVL program, it is likely that limitations in the study's 36

design concealed the potential effectiveness of learning-style adapted vocabulary instruction (see Kaya, 2006). For example, the lack of a significant difference between matched and non-matched instruction might be partially due to the fact that self-report questionnaire rather than performance data were used to assess IDs. Given the limited research on individualized multimedia vocabulary instruction, coupled with the findings that underscore the impact of IDs in L2 CAVL, more research on adaptive multimedia CAVL is needed to examine if and how individualized instruction may be able to benefit individual learners. The following section presents a summary of the findings of previous research on L2 vocabulary learning as discussed in this chapter.

2.7 Summary of Previous Research Numerous previous studies on L2 vocabulary annotations have provided evidence of the effectiveness of pictures for learning L2 words. Vocabulary is generally learned better when translations or definitions of a word are presented alongside pictorial annotations than when pictorial annotations are not included. This empirical finding is usually linked to Mayer's (2001, 2005a) CTML and Paivio's (1971, 1986) dual coding theory. Based on the dual-channel assumption, both theories assume that information coded as both pictures and words in the human mind is learned and recalled more effectively than information that is only coded in one processing channel. There is a lack of evidence regarding the effectiveness of other vocabulary annotations such as audio pronunciations and L1 glosses. In terms of learning different types of words, research suggests that abstract words are generally harder to learn than concrete words. This holds true for both the L1 and the L2 context. Moreover, previous research has found that IDs have an impact on L2 multimedia CAVL. For example, high-spatial ability learners appear to benefit more from pictorial annotations than low-spatial ability learners whereas high-verbal ability learners can make use of verbal annotations more effectively than low-verbal-ability learners (L. 37

Jones, 2009; Plass et al., 1998). Studies on IDs in CAVL indicate that individualized instruction catering to individual learners can enhance L2 vocabulary learning. However, several questions pertaining to L2 vocabulary learning in multimedia CAVL environments need further exploration. The following section presents these research gaps.

2.8 Research Gaps This dissertation contributes to three areas of research requiring further investigation in CAVL, which are discussed in turn in the following three sections: 1) vocabulary annotations, 2) effects of word concreteness, and 3) IDs, individualized instruction, and presentation sequence.

2.8.1 Vocabulary Annotations Studies have generally found translations, definitions and pictures to be beneficial annotations for L2 vocabulary learning. However, additional vocabulary annotations such as audio annotations have hardly been explored. Audio annotations might be valuable in CAVL because, as reviewed in section 2.3.2, auditory information strengthens form-meaning links and is preferred by some learners. Moreover, a learner's phonological

memory

influences

vocabulary

learning,

and

some

multimedia

presentations are more effective when spoken rather than written language is included (modality principle). Some studies indeed suggest that audio annotations are effective learning aids in CAVL (e.g., Laufer & Hill, 2000) but others cannot confirm this finding (e.g., Chun & Plass, 1996a). Accordingly, research on this annotation is warranted (Chun & Plass, 1996a, 1997). Furthermore, sentence-level L1 glosses of L2 example sentences have hardly been examined. Within CAVL, a couple of studies indicate that L1 glosses promote the retention of word meaning (Grace, 1998a, 1998b) and research on subtitles in videos and parallel concordances suggests a potential positive effect of L1 glosses on vocabulary learning because they might allow the inductive creation of meaning in context.

38

However, more research is required to further elucidate the role of L1 glosses in L2 vocabulary learning. In addition to the lack of research on certain annotation types in isolation (e.g., audio, L1 gloss), the effectiveness of combinations of annotation types has yet to be examined more thoroughly. Previous studies have generally compared two annotation types in isolation (e.g., a picture or a translation) with both annotation types together (e.g., picture and translation) and / or with no annotation types at all (e.g., Akbulut, 2007; Al-Seghayer, 2001; Chun & Plass, 1996a; L. Jones & Plass, 2002; Plass et al., 1998, 2003; Yoshii, 2006; Yoshii & Flaitz, 2002). These studies generally indicate that giving learners access to multiple annotations for the meaning of a word (e.g., pictures and definitions) is superior to presenting only one type of annotation (see section 2.3). However, previous studies have generally not intentionally considered the interplay of form, meaning, and use aspects of word knowledge in the provision of L2 vocabulary annotations. For example, pictorial annotations for the meaning of a word may be differentially effective depending on whether they are presented alongside audio pronunciations as form information or L1 glosses as use information. Research is needed to examine which combinations of annotation types for form, meaning, and use are particularly effective for vocabulary learning.

2.8.2 Effects of Word Concreteness Another aspect that needs further exploration is the interaction of word concreteness and annotation effectiveness in CAVL. With respect to pictorial annotations, for instance, Chun and Plass (1996a) propose that future studies investigate "how or whether abstract ideas … can be visualized or represented by nontextual annotations" (p. 195). Chapelle (2003) suggests exploring the use of images for abstract words even if more creativity or interpretation might be required to connect the words and images in this case (see also Al-Seghayer, 2001; Chun & Plass, 1997). In 2010, Xu (2010) still maintains that "future research needs to compare the effectiveness of MVA [multimedia vocabulary annotations] for concrete versus abstract vocabulary and to look for the best way to visualize abstract vocabulary" (p. 323) (see also Yoshii, 2006). 39

Abstract and concrete words differ in their imageability and accordingly, one might expect pictorial annotations to be more effective for concrete, that is, highly imageable, words. Other annotations or combinations of annotations may also be differentially effective for the two word types. At the same time, vocabulary annotations might be able to eliminate the concreteness effect such that abstract words are learned as easily as concrete words given an optimal combination of annotation types.

2.8.3 IDs, Individualized Instruction, and Presentation Sequence Finally, the effectiveness of vocabulary annotations might also be influenced by IDs. Many studies point towards the benefits of multimedia vocabulary instruction but also note that the same type of instruction is not equally effective for every learner. The amount of evidence on individual learner variation in SLA in general and in CAVL in particular (e.g., Chun & Payne, 2004; L. Jones, 2009; Laufer & Hill, 2000; Plass et al., 1998, 2003) strongly suggests a need for individualized multimedia vocabulary instruction, which is part of the research agenda for vocabulary learning (Chun & Plass, 1997) but has rarely been investigated to date (see section 2.6.1). Within CAVL, different combinations of vocabulary annotations can cater to different types of learners. For example, some learners might learn more effectively with annotations that include pictures and audio pronunciations whereas others might perform better with annotations containing definitions and L1 glosses. In order to provide and investigate individualized vocabulary instruction, it needs to be assessed if and in what ways vocabulary annotations differ in their effectiveness for different learners in a given CAVL environment. Aside from the potential impact of IDs in CAVL, there are also variables made possible by the computer environment itself that might have an effect on vocabulary learning. However, this has not been researched much in a CAVL environment. For example, in addition to the potential effect of computer platform (e.g., mobile phone vs.

40

desktop, see Stockwell, 2010) or issues of interface design 6, the presentation sequence of annotations in a computer-based learning environment might affect L2 vocabulary learning. A fixed presentation sequence of annotations means that learners receive the same annotation for each target word whereas an alternating presentation sequence implies that some target words are displayed in one annotation and others are shown in another annotation. Research suggests that the monotony of a fixed presentation sequence might be less effective for or even detrimental to L2 vocabulary learning compared to the variation provided by an alternating presentation sequence, which possibly helps learners continually refocus their attention on the learning task (Bear, Connors, & Paradiso, 2007; Shaffer & Kipp, 2009). The noticing hypothesis (Schmidt, 1990) could also be applied here in that learners might notice the annotations more if they alternate than if they are presented in a fixed sequence. However, empirical research is needed to test this potential effect in a CAVL environment. In examining a fixed versus an alternating presentation sequence as one aspect of a computer-based environment, however, the question arises as to which annotation(s) to present to each learner. The annotation itself might have an impact on the effectiveness of the presentation sequence given that research indicates that there are differences in the effectiveness of annotations for different learners. Therefore, a learning environment that provides individualized instruction by considering the annotations that are most effective for each learner appears to be an ideal choice to investigate presentation sequence. Not only does previous research call for individualized instruction in CAVL but more importantly, by giving learners only annotations that are ideally suited to them, the annotation chosen for each learner is eliminated as a potential confounding variable when assessing the effect of presentation sequence.

6

For instance, studies have investigated the effect of highlighting words on incidental L2 vocabulary learning (e.g., De Ridder, 2002). Effects of highlighting in CALL have also been explored in relation to learner feedback (e.g., Heift, 2004) under the theoretical framework of the noticing hypothesis (Schmidt, 1990).

41

2.9 Research Questions of This Dissertation Based on the research gaps identified in section 2.8, this dissertation presents two experiments conducted with Voka, a web-based multimedia CAVL program for L2 German designed by the author of this dissertation. In Voka, abstract and concrete L2 words are studied in one of five annotation clusters (i.e., combinations of annotation types): 1) PG (picture + gloss), 2) DG (definition + gloss), 3) PA (picture + audio), 4) DA (definition + audio), or 5) PAGD (picture + audio + gloss + definition). Voka consists of two parts, which correspond to the two experiments conducted for this dissertation. Experiment 1 (conducted with part 1 of Voka) investigates annotations, word concreteness and variation among learners in CAVL by exposing each study participants to every annotation cluster. Using the results of experiment 1, experiment 2 (conducted with part 2 of Voka) provides an individualized environment in which the effectiveness of a fixed versus an alternating presentation sequence of annotation clusters for L2 vocabulary learning is examined. The following sections list the research questions (RQs) that are addressed in the two experiments.

2.9.1 Research Questions of Experiment 1 (Part 1 of Voka) Topic 1: Vocabulary annotations RQ 1.1. Main effect of annotation cluster Does annotation cluster (i.e., PG, DG, PA, DA, PAGD) have an effect on vocabulary learning? RQ 1.2. Main effect of annotation type presence Is there a difference in the effectiveness for vocabulary learning between annotation clusters (i.e., PG, DG, PA, DA, PAGD) that contain a given annotation type (i.e., picture, definition, gloss, or audio) and annotation clusters that do not contain that annotation type? Topic 2: Word concreteness RQ 2. Main effect of word type 42

Does word type (i.e., abstract, concrete) have an effect on vocabulary learning? Topic 3: Annotations and word concreteness RQ 3.1. Interaction effect of word type X annotation cluster Is there an interaction between word type (i.e., abstract, concrete) and annotation cluster (i.e., PG, DG, PA, DA, PAGD) in the effectiveness for vocabulary learning? An interaction can be present in two ways and thus RQ 3.1 is divided into two subquestions: RQ 3.1.a. Is the effect of word type on vocabulary learning different for the five annotation clusters? RQ 3.1.b. Is the effect of annotation cluster on vocabulary learning different for the two word types? RQ 3.2. Main effect of annotation type presence in the two word types RQ 3.2.1. Is there a difference in the effectiveness for learning abstract words between annotation clusters (i.e., PG, DG, PA, DA, PAGD) that contain a given annotation type (i.e., picture, definition, gloss, or audio) and annotation clusters that do not contain that annotation type? RQ 3.2.2. Is there a difference in the effectiveness for learning concrete words between annotation clusters (i.e., PG, DG, PA, DA, PAGD) that contain a given annotation type (i.e., picture, definition, gloss, or audio) and annotation clusters that do not contain that annotation type? Topic 4: Annotations and individual learners RQ 4. Interaction effect of annotation cluster X learner Does the effectiveness of the annotation clusters (i.e., PG, DG, PA, DA, PAGD) for vocabulary learning vary across learners?

43

2.9.2 Research Question of Experiment 2 (Part 2 of Voka) Topic 5: Presentation sequence in individualized instruction RQ 5. Main effect of presentation sequence Does presentation sequence (i.e., fixed, alternating) have an effect on vocabulary learning?

44

3 DESIGN OF VOKA Voka (Rimrott, 2009) is an online CAVL program for beginner learners of L2 German designed by the author of this dissertation. It provides intentional vocabulary learning for the two experiments of this dissertation. Voka consists of two parts. Part 1 is used to answer the research questions of experiment 1 while part 2 is designed to address the research question of experiment 2 (see section 2.9). The following sections discuss Voka's design. Section 3.1 reviews arguments for intentional vocabulary instruction. Section 3.2 describes the selection criteria for the L2 test items while section 3.3 explains the aspects of word knowledge covered in Voka. Word information that is provided for all words (default information) is described in section 3.4. The vocabulary annotation types used in Voka are presented in section 3.5. Section 3.6 introduces the five vocabulary annotation clusters. Section 3.7 portrays Voka's program flow. Section 3.8 describes the data collection in Voka. A rationale for some of Voka's design features is given in section 3.9.

3.1 Intentional Vocabulary Learning Whereas most studies on vocabulary annotations in multimedia CAVL programs examine incidental vocabulary acquisition (see section 2.3), Voka provides direct vocabulary instruction to add to the comparatively small body of research on multimedia vocabulary annotations in intentional learning environments. Direct vocabulary instruction is an established and efficient method of learning that has been used in a variety of settings throughout the past centuries. Many researchers (e.g., Cobb, 2007; Godwin-Jones, 2010; Laufer, 2003, 2005, 2006; Min, 2008; Nation, 45

2001; Read, 2000, 2004; Schmitt, 2010; Tozcu & Coady, 2004) maintain that intentional vocabulary study, in addition to incidental vocabulary acquisition, is an important part of L2 vocabulary learning. Cobb (2007) points out that "free reading alone is not sufficient to ‘do the entire job’ of building a functional second lexicon in any typical time frame of L2 learning" (p. 45). Laufer (2006) believes intentional vocabulary learning to be particularly important "in any learning context that cannot recreate the input conditions of first-language acquisition." (p. 162). Read (2000) remarks that …the trend in the 1990s is for many language teachers to discourage learners from memorising lists of isolated words, on the basis that vocabulary should always be learned in context. Nevertheless, research shows that systematic learning of individual words can provide a good foundation for vocabulary development, especially in foreign-language environments where learners have limited exposure to the language outside of the classroom. Nation (1982) made a careful analysis of various arguments that could be put forward to support the learn-incontext view, as applied to the initial learning of new words, and concluded that it was a statement of belief rather than a principle supported by the research evidence. (p. 41) Nation (2001) argues that a balanced, well-designed language course should consist of four major strands, each of which should make up roughly 25% of the course: 1) learning from comprehensible meaning-focused input, 2) meaning-focused output, 3) fluency development, and 4) language-focused learning (i.e., form-focused instruction). While instruction can focus solely on one strand at a time, two or more strands can also be combined in one learning activity. According to Nation (2001), direct vocabulary learning is part of the language-focused learning strand and should therefore amount to about a quarter of the time available for vocabulary learning. In this view, learning vocabulary in context and direct instruction are seen as complementary rather than conflicting ways of learning, with each activity reinforcing and enhancing the learning that arises from the other (see also Laufer, 2005; Laufer, 2006; Schmitt, 2010). By providing direct instruction in an online CAVL program that learners can also access outside the classroom, valuable classroom time can be freed up for the other three strands

46

of learning (meaning-focused input, meaning-focused output, fluency development), which rely more strongly on the interaction with a language instructor. In addition to arguments for a balanced approach to vocabulary learning, Schmitt (2008) maintains that there are five reasons to warrant an explicit focus on vocabulary instruction. First, although learners might understand the overall message of L2 input, they often do not attend to the precise meanings of words in meaning-based learning. Second, it is often unreliable to guess words from context. Third, if a word is in fact easily guessed from context, it may not generate enough engagement to be acquired. Fourth, repeated exposure to a word is necessary in order for it to be learned and this is more readily provided through explicit teaching. Fifth, and most importantly, explicit learning is very effective and often results in better retention and productive mastery than incidental learning (see also Laufer, 2006; Nation, 2001). Based on the arguments in favour of intentional vocabulary learning, and the research questions pursued in this dissertation, Voka investigates aspects of direct vocabulary instruction to shed more light on one of the four strands of a balanced foreign language course in a multimedia CAVL environment. In doing so, it needs to be borne in mind that direct instruction is only one of four equally important instructional strands, and that it is best suited to the teaching of high frequency words. High-frequency words by definition cover approximately 80% of the running words in any spoken or written text (Nation, 2001).7 Given that learners will encounter and use these words frequently in any L2 context, "the benefits of knowing high-frequency vocabulary compensate for the time and effort required for direct vocabulary instruction" (Nation 2001, p. 97; see also Read, 2000). In contrast, direct instruction is not recommended for low frequency words

7

Typically, high-frequency words, specialized vocabulary (academic words, technical words) and lowfrequency words are distinguished (Nation, 2001). High-frequency words are generally thought to consist of the 2000 most frequent word families. They contain both function (e.g., the, to, one) and content words (e.g., say, student, simple). Specialized words include both academic words, which are common (about 9% of the running words) in academic texts in different fields (e.g., involve, pursue, approach), and technical words, which cover about 5% of the running words and are closely connected to a text's topical content (e.g., beech, timber, overgrazing in a text about forests). The remainder of all words are low-frequency words (e.g., pastoral, eponymous, perpetuity). They are by far the largest group but cover only a small percentage of the words in a text (roughly 5% of the words in an academic text).

47

because they are too numerous and do not occur frequently enough to warrant individual attention. Moreover, direct vocabulary instruction can deal effectively with some although not all aspects of word knowledge (see Table 2 in section 2.1.2). For instance, direct CAVL instruction can effectively teach the spoken and the written form of the word, the concept of the word, the form-meaning connection, its grammatical functions (particularly the part of speech) and some collocations (e.g., if an example sentence is provided). However, direct instruction cannot easily convey other aspects of word knowledge such as usage constraints, the whole range of possible collocations, or the diversity of referents, which rely on quantity of experience, implicit knowledge and contextualized language use. Nation (2001) recommends learning these aspects of word knowledge incidentally in context. Finally, "a word is not fully learned through one meeting with it, even if this meeting involves substantial deliberate teaching" (Nation, 2001, p. 81). Accordingly, Voka provides information on some form, meaning and use aspects of high-frequency L2 words through direct vocabulary instruction as a step forward in the cumulative process of learning these words. In addition to deciding on the intentional vocabulary learning environment, the test items provided by Voka had to be carefully selected. This is discussed in the following section.

3.2 Word Selection Voka teaches 58 German words, 15 concrete and 15 abstract words in part 1 (experiment 1) and 14 concrete and 14 abstract words in part 2 (experiment 2) (see Appendix A for a list of the test items in Voka). In addition, 10 abstract and 10 concrete words are included in Voka as spare words that are automatically substituted by the computer program in case a student already knows a target word on the vocabulary pretest (see section 3.7). Taking the main words and the spare words together, there are 78 German words in total, 39 abstract and 39 concrete. 48

The test items in Voka are restricted to nouns because several studies suggest that part of speech affects ease of learning (e.g., Ellis & Beaton, 1993a; M. O. James, 1996). To minimize the probability that the test items are known to the study participants, none of the nouns chosen are part of the active or passive vocabulary of the textbook chapters covered by the participants in their beginner German course, derivationally related to active textbook vocabulary, or frequently used in classroom instruction (e.g., Antwort (answer)). The selection of the test items can be further characterized according to the following additional parameters: 1) imageability, 2) frequency, and 3) linguistic features.

3.2.1 Imageability The test items in Voka are chosen from two imageability ranges, low and high imageability. The imageability ratings are taken from the Berlin Affective Word List data (Vö et al., 2006). The Berlin Affective Word List is a list of 1867 nouns and 341 verbs rated by 40 German speakers for their imageability on a scale from 1 (low imageability) to 7 (high imageability). The abstract words in Voka stem from the pool of nouns with a mean imageability between 1.0 and 3.0. The concrete words are taken from the 5.0 to 7.0 range. The Berlin Affective Word List contains imageability but not word concreteness ratings and accordingly, the abstract test items in Voka comprise words with a low imageability while the concrete test items are composed of highly imageable words (see section 2.4).

3.2.2 Frequency The test items in Voka are all highly frequent German nouns because direct vocabulary instruction is best suited to high-frequency words (see section 3.1). Of the 78 words, 76 are among the 2000 most frequent German words based on R. L. Jones and Tschirner's frequency dictionary of German (R. L. Jones & Tschirner, 2006) while the remaining 2 words (Befehl (command) and Verbot (ban)) occupy position 3066 and

49

3515, respectively. 8 R. L. Jones and Tschirner's list is not restricted to nouns, thus suggesting that the two words are nonetheless frequently occurring nouns in German. The 58 main target words have a frequency ranging from 20 words per million for Verbot (ban) to 381 words per million for Grund (reason), with a mean frequency of 99.98 occurrences per million words (SD: 80.57). 9

3.2.3 Linguistic Features To control learning burden, the choice of the test items is characterized by a consideration of several linguistic features. English-German cognates are not included because cognates are generally easier to learn than non-cognates (Nation, 2001; Tonzar et al., 2009). Nouns containing special German orthographic characters (ä, ö, ü and ß) are excluded because they might be more difficult to learn than nouns without unfamiliar letters (Ellis & Beaton, 1993b; Laufer, 1990, 1997). The presence of orthographically similar words is avoided because it also affects learnability (Nation, 2001). For instance, Mauer (wall) is excluded because Bauer (farmer) is already a test item in Voka. Semantically similar words are excluded as well because learning lexical sets, synonyms, opposites or free associates together also leads to interference (Erten & Tekin, 2008; Nation, 2001). For instance, Bereich (area) is part of Voka and therefore, Gebiet (area, region, district) is not included. Furthermore, according to Laufer (1990, 1997), additional intralingual factors such as grammatical characteristics (part of speech, inflexional complexity, derivational complexity), semantic features (e.g., specificity, idiomaticity), register restrictions and multiple meanings have also been shown or are presumed to affect the learning burden of 8

Despite their lower frequency in R. L. Jones and Tschirner's (2006) list, as relatively short abstract nouns, Befehl (command) and Verbot (ban) are included in Voka to make the mean word length of the abstract nouns in Voka more comparable to the mean word length of the concrete nouns, which are generally shorter (see Table 3). 9

Including the 20 spare words, the 78 test items in Voka have a range of 20 to 1070 occurrences per million words with a mean frequency of 112.31 (SD: 135.26). Based on another frequency list, the CELEX database (see Vö et al., 2006), the 58 main words have a frequency per million words ranging from 9.17 for Gehirn (brain) to 437.33 for Grund (reason) with a mean frequency of 99.36 (SD: 80.71). According to CELEX, all 78 test items have a frequency range of 9.17 to 437.33 and a mean frequency of 113.86 (SD: 92.15).

50

a word. These factors thus also constrain the selection and presentation of the test items in Voka. For example, the test items in Voka are generally not compounds and although the inflexional complexity of the words might differ (e.g., plural formation), students are not tested on the morphology of the nouns. All words are general terms presented in a non-idiomatic context. The words do not have register restrictions and in case of multiple meanings, only the most common meaning of the noun is given in Voka. With other intralingual learnability factors thus controlled, the final test items for Voka are selected with an eye to balancing phonographic complexity for concrete versus abstract nouns. Phonographic complexity is an additional factor that affects the learning burden of a word (Laufer, 1990, 1997). For example, research indicates that longer words may be more difficult to learn than shorter ones (Ellis & Beaton, 1993b; Laufer, 1990, 1997). Phonographic complexity is considered by taking into account the number of letters, consonants, vowels, phonemes, and syllables and the consonant / vowel ratio of the abstract and the concrete test items in Voka, as described in detail in the following section.

3.2.4 Characteristics of the Test Items in Voka In summary, for the 58 test items in Voka, Table 3 presents the mean numbers, standard deviations and ranges for imageability, frequency, and the phonographic complexity measures. 10 For example, the 15 abstract nouns in part 1 are between five and eight letters long and have a mean word length of 6.1 letters (see Appendix A for all L2 target words and Table 13 for the 30 target words of part 1). Focusing on the words in part 1 first, in terms of their means, standard deviations, and ranges for phonographic complexity and frequency measures, the abstract and concrete words are similar, often even identical. For example, they have the same mean number of letters (6.1), consonants (4.1), vowels (1.9), phonemes (5.1) and syllables (1.8). As stated above, other word characteristics that potentially affect learning burden (e.g., cognate status) are also controlled. This leaves imageability as the only identifiable 10

Note that the spare words are not characterized in Table 3. However, they are matched as closely as possible to their corresponding main words in terms of the characteristics discussed here.

51

variable considered here that differs markedly between the two groups, thus enabling the assessment of imageability as a factor influencing word learnability and annotation effectiveness. 11 Table 3: Characteristics of the test items in Voka

Words Part 1 of Voka Abstract Mean SD (15) Range Concrete Mean SD (15) Range Both Mean SD (30) Range Part 2 of Voka Abstract Mean SD (14) Range Concrete Mean SD (14) Range Both Mean SD (28) Range Both parts All words Mean SD (58) Range

Imageab.

Freq.

Lett.

Cons.

Vow.

C/V ratio

Phon.

Syll.

2.5 0.5 1.7-3.0 5.7 0.5 5.0-6.8 4.1 1.7 1.7-6.8

103 91 23-381 81 47 42-175 92 72 23-381

6.1 0.8 5-8 6.1 1.0 5-8 6.1 0.9 5-8

4.1 0.8 3-6 4.1 1.1 3-6 4.1 1.0 3-6

1.9 0.6 1-3 1.9 0.6 1-3 1.9 0.6 1-3

2.4 1.1 1.0-5.0 2.5 1.5 1.0-6.0 2.5 1.3 1.0-6.0

5.1 0.9 3-7 5.1 0.7 4-6 5.1 0.8 3-7

1.8 0.4 1-2 1.8 0.6 1-3 1.8 0.5 1-3

2.5 0.4 1.8-2.9 5.9 0.5 5.1-6.8 4.2 1.8 1.8-6.8

121 107 20-348 97 69 43-238 109 89 20-348

6.4 0.9 5-8 6.1 2.0 4-10 6.3 1.6 4-10

4.3 1.0 3-6 4.3 1.7 2-8 4.3 1.4 2-8

2.1 0.5 1-3 1.8 0.8 1-3 2.0 0.7 1-3

2.2 1.2 1.3-6.0 2.8 1.5 0.7-6.0 2.5 1.4 0.7-6.0

5.4 0.8 4-7 5.3 1.9 3-9 5.3 1.4 3-9

1.9 0.3 1-2 1.6 0.6 1-3 1.8 0.5 1-3

4.1 1.7 1.7-6.8

100 81 20-381

6.2 1.2 4-10

4.2 1.2 2-8

1.9 0.6 1-3

2.5 1.3 0.7-6.0

5.2 1.1 3-9

1.8 0.5 1-3

Note. The numbers in brackets in the first column refer to the number of words in that word group. Imageab. = imageability (taken from Vö et al., 2006). Freq. = mean frequency per million words (taken from R. L. Jones & Tschirner, 2006), Lett. = (Number of) letters, Cons. = consonants, Vow. = vowels, C/V ratio = consonant / vowel ratio, Phon. = phonemes, Syll. = syllables.

Concerning part 2, Table 3 shows that the abstract and concrete words are also fairly comparable in their phonographic makeup, although there is more variability than in part 1. This, however, is unproblematic because experiment 2 does not investigate the influence of imageability on word learning or annotation effectiveness and thus identity or near-identity in phonographic complexity measures between the two groups is not necessary.

11

Note that additional factors that might affect item difficulty (e.g., emotional valence) are not intentionally controlled in this study (see section 9.3.1).

52

In addition to choosing the words to be included in a CAVL program for research purposes, decisions have to be made regarding the information to be provided about each word. The aspects of word knowledge covered in Voka are discussed in the following section.

3.3 Aspects of Word Knowledge in Voka In order to learn a word, learners need to gain knowledge of the word's form, meaning and use (Nation, 2001, see section 2.1.2). In Voka, this is achieved by providing information on all three aspects of word knowledge, as summarized in Table 4. Table 4: Aspects of word knowledge covered by the information in Voka

Aspect of word knowledge Form Meaning

Use

Information Written form Audio pronunciation L1 translation L1 definition Picture L2 example sentence L1 gloss of L2 example sentence

Purpose in Voka Default information Annotation type Default information Annotation type Annotation type Default information Annotation type

Voka makes use of a flashcard design consisting of one electronic vocabulary flashcard per target word. To illustrate the look of the form, meaning, and use information in the flashcards, Figure 1 presents a labelled screenshot of the target word Herbst (fall) shown in the flashcard that includes all annotation types. 12 The information on the target words that is part of the default information shown to every student for every word is displayed in the shaded box at the top of the flashcard (i.e., L1 translation, L2 example sentence, written form). The information shown in the white box in the middle of the flashcard is the annotation type information (i.e., L1 gloss, picture, audio pronunciation, L1 definition). This information is only given to learners if they are studying the word with an annotation cluster that includes this annotation type (see section 3.6).

12

All screenshots of Voka shown in this dissertation © Anne Rimrott, 2010. For the copyright of the pictures shown in this dissertation, see Appendix G. For all pictures and screenshots, all rights reserved. The pictures and screenshots are not to be reproduced without the copyright holder's written consent.

53

Figure 1: Labelled Voka flashcard of target word Herbst (fall)

To convey the form of a word, Voka provides the written form as default information and an audio pronunciation as an annotation type. The audio is chosen as the annotation type to address the need for more research on audio annotations (see section 2.3.2).13 With regard to a word's meaning, Voka provides three sources of information: L1 translations as default information, and L1 definitions and pictures as two meaning annotation types. The translation is chosen as the default meaning information because it is more succinct than the definition or the picture, in particular with respect to abstract words. The two meaning sources selected as annotation types address the lack of findings

13

Phonetic transcriptions of L2 words, another possible form annotation, were not included in Voka because the results of a pilot study conducted for this dissertation suggested that they are not very helpful for vocabulary learning. Note that to inform the design of the main study of fall 2009, two pilot studies were conducted with an earlier version of Voka in fall 2008 (N = 30) and summer 2009 (N = 14). As one of the annotation types, the fall 2008 pilot study contained phonetic transcriptions whereas the main study and the summer 2009 pilot study used sentence-level L1 glosses instead.

54

on the effectiveness of pictures and definitions for words of differing levels of concreteness (see section 2.8). 14 The three sources are viewed as complementary rather than alternative ways of expressing the meaning of the test items, each source providing an approximation of a word's meaning (see Nation, 2001). According to Nation (2001), "all ways of communicating meaning involve the changing of an idea into some observable form, are indirect, are likely to be misinterpreted, and may not convey the exact underlying concept of the word." (p. 85). For example, L1 translation "is often criticised as being indirect,… and encouraging the idea that there is an exact equivalence between words in the first and second languages. These criticisms are all true but they apply to most other ways of communicating meaning" (Nation, 2001, pp. 85-86), including the use of definitions and pictures. For instance, for a picture annotating the vocabulary item school of fish, a learner needs to decide if the picture refers to "a school of fish, a fish, water, the sea, sea-life or to a similar meaning of the vocabulary word" (Plass et al., 2003, pp. 236237). In Voka, both the translation and the definition are in the L1. The integration of the L1 in L2 teaching is supported by many researchers (e.g., Chapelle, 2003; Danan, 1992; Grace, 1998a, 1998b; Laufer & Girsai, 2008; Nation, 2001; Pavičić Takač, 2008; Schmitt, 2010) and apparently enjoyed by language learners (Brooks-Lewis, 2009). For example, learners have been found to prefer L1 translations over L2 definitions (Chun & Payne, 2004; Davis & Lyman-Hager, 1997; Laufer & Hill, 2000). In L2 vocabulary learning, recourse to the L1 allows the creation of two separate verbal memory stores (L1, L2) (Paivio, 1986) (see section 2.3.1), and benefits the psycholinguistic establishment of the initial form-meaning link (Schmitt, 2008). In addition, using L1

14

As another possible meaning source, L2 definitions were not considered for inclusion in Voka because it was impossible to compose precise L2 definitions that the study participants could process given their limited L2 vocabulary knowledge (see also Chun & Plass, 1996a, p. 190). For this reason, the use of synonyms or antonyms as possible meaning sources was not an option either. Finally, videos were not chosen because the three meaning sources selected for Voka were deemed more appropriate given that the test items consisted of nouns only.

55

translations is efficient (Nation, 2001), especially for beginning learners with no previous L2 knowledge (Chun & Plass, 1996a; Plass & L. Jones, 2005). Finally, the use information in Voka consists of a German example sentence as the default information and an English sentence-level gloss of the German example sentence as a use annotation type. The L1 gloss is selected as the annotation type to address the lack of research on L1 gloss annotations (see section 2.8).15 The following sections describe the default information and the annotation type information used in Voka in detail.

3.4 Default Word Information 3.4.1 Form Information: Written Form In Voka, the default form information for each word consists of the German noun, which is inflected in the nominative singular with the definite article (der = masculine, die = feminine or das = neuter) and the plural morpheme (see Figure 1). 16 For example, Figure 1 displays the written form der Herbst, -e where der indicates that Herbst (fall) is a masculine noun, and -e signifies that the plural is Herbste.

3.4.2 Meaning Information: L1 Translation The default meaning information is an L1 translation of the L2 target words. The translation consists of a single word, which expresses the most common meaning of the target noun (e.g., fall as the translation of Herbst in Figure 1).

15

Concordance sentences were not considered as a possible use annotation because of the study participants' limited L2 proficiency. 16

Note that German has three noun genders (masculine, feminine and neuter) that correspond to three articles in the singular nominative case (der, die, and das, respectively). Generally, the gender of a noun cannot be predicted by the noun itself and has to be memorized by L2 learners. It was decided to supply the article for the nouns in both the vocabulary learning treatment and the assessment in Voka (see section 3.7). This allowed learners to study the article with each noun while eliminating noun gender as a variable or distractor.

56

3.4.3 Use Information: L2 Example Sentence The default use information consists of a German example sentence. For instance, for Herbst (fall), the example sentence is Viele Menschen wandern gern im Herbst. (Many people like to go hiking in the fall.) (see Figure 1). The sentences were written by the author of this dissertation, a native speaker of German, and evaluated by up to three additional adult native German speakers. Sentences that sounded unnatural were rewritten and reassessed until all sentences were deemed acceptable by all raters. In the sentences, the target noun is presented in its singular form and in a grammatical case that is orthographically identical to the nominative case. The sentences are between six and nine words long and the target word usually appears towards the middle or end of the sentence. The example sentences generally only contain vocabulary and grammatical constructions that are familiar to the students. However, to make the examples more pregnant and informative, some sentences also contain cognates, English borrowings or internationalisms that students can easily understand in context (e.g., Farm (farm), Lotion (lotion), Polizei (police)). In addition, whenever comprehensible to the participants, the sentences include significant lexical coocurrences

of

the

target

words

taken

from

the

Wortschatz

database

(http://wortschatz.uni-leipzig.de). For instance, the example sentence for Loch (hole) includes the word Wand (wall), which is a significant coocurrence of that target word.

3.5 Word Information in the Annotation Types 3.5.1 Form Annotation Type: Audio Pronunciation For the form annotation type, Voka includes an audio pronunciation of the target noun in the singular nominative case. For instance, for Herbst (fall), the audio annotation is a pronunciation of the word Herbst. All audio recordings are between one and two seconds long. The recordings were spoken by the author of this dissertation in a soundproof audio recording studio. The original audio files were recorded in WAVE SOUND format (.wav) and then converted to MP3 format (.mp3) for use in Voka. In Voka, the audio recording is played once automatically for each flashcard that includes this 57

annotation type. Learners have the option to replay the audio pronunciation as often as they choose while the flashcard is being displayed.

3.5.2 Meaning Annotation Type I: L1 Definition In Voka, the L1 definitions consist of between seven and nine words and always start with the target noun followed by a colon. For example, the L1 definition of Herbst (fall) is Herbst: the season between summer and winter (see Figure 1). If a target word has more than one possible meaning the definition provides the same meaning as the English translation. However, to avoid redundancy, the definitions are non-recursive, that is, they do not contain the word used in the one-word English translation. For example, the word fall is not included in the definition of Herbst (fall). Several online dictionaries were consulted in composing the L1 definitions: a German-German dictionary (DWDS, http://www.dwds.de/woerterbuch) and two EnglishEnglish dictionaries (Merriam-Webster, http://www.merriam-webster.com/dictionary/; Dictionary.com, http://dictionary.reference.com/). While the definitions in Voka were sometimes quoted directly from one of the English dictionaries, the English definitions were often not entirely applicable and, as a result, ideas from all three sources had to be mixed and / or the author's personal ideas had to be included to arrive at a suitable L1 definition.

3.5.3 Meaning Annotation Type II: Picture The other meaning annotation type in Voka is the provision of a colour photograph to illustrate the meaning of the target word (see Figure 1). Most photographs were taken in Germany because "there is often a cultural dimension to the meaning and use of vocabulary" (Nation 2001, p. 51) and authentic visuals of people and cultural artefacts can help capture this aspect. All photographs were taken in natural settings by amateur photographers. In post-editing, some photographs were cropped and / or enhanced through minor colour, contrast or focus adjustments while preserving a natural, unedited

58

(rather than an artistic or creative) look. All the pictures in Voka have a standard 4:3 width-height-ratio with a size of 384 x 288 pixels. 17 To ensure adequate pictorial illustration of the target words, five L1 German speakers were asked to rank two to three pictures for each word and also indicate if they thought a picture was not a good visual representation of the target referent at all (see Akbulut, 2007; and O'Bryan, 2005, for a similar procedure). The raters were selected such that there was diversity in terms of their gender (three women, two men), age (28 – 67 years old), L2 German teaching experience (one experienced L2 German instructor, four raters without L2 German teaching experience) and experience with both German and Canadian culture (three raters have lived their entire lives in Germany, two raters have lived in both Germany and Canada for extended periods of time). Figure 2: The three picture choices for Zweck (purpose) (Picture 3 selected for Voka)

Picture 1

Picture 2

Picture 3

For the rating procedure, the author first selected two to three suitable pictures for each word from a database of over 2,800 pictures that was created for this dissertation. For 22 of the 58 target words (38%), the two or three picture choices for one target word captured different ideas to illustrate the same meaning. For example, Figure 2 shows the three choices for Zweck (purpose), which play with very different underlying ideas: The purpose of scissors is to cut paper (Picture 1), the purpose of a recycling bin is to recycle waste (Picture 2), the purpose of rubber boots is to protect feet from water (Picture 3). For 23 of the 58 target words (40%), the picture choices portrayed a similar scene or idea 17

All photographs are used with permission from the copyright holders (see Appendix G for copyright information). Persons depicted in photographs either gave their consent to the use of the pictures or were photographed in public places where assent is not legally required. The use of the pictures in Voka does not defame persons or unreasonably intrude into their private life.

59

but shot at different locations with different actors and / or props. For instance, for Gesicht (face), the three pictures depicting a face were taken with three different actors. For 9 of the 58 words (16%), the choice consisted of different pictures of the same scene, shot in the same location with the same actors and / or props. For example, the pictures of Gehirn (brain) used the same plastic model of a brain but one shows a lateral view whereas the other shows a top view. For 4 of the 58 words (7%), the pictures were created from the same photograph but edited differently (e.g., cropped) in postproduction. The picture ratings indicated that the five raters generally believed the pictures to be adequate representations of the target words. For example, for 32 of the 58 words (55%), at least four of the five raters agreed on one picture. Nevertheless, the ratings also prompted the replacement of some target words due to poor picture ratings. Usually, the picture with the highest cumulative rating was chosen for display in Voka. To ensure a better match of the picture and the L2 example sentence, however, a different picture that was also ranked high was selected instead in some cases.

3.5.4 Use Annotation Type: L1 Gloss of L2 Example Sentence The use annotation in Voka is an English gloss (i.e., sentence-level translation) of the German example sentence. For example, for Herbst (fall), the English gloss Many people like to go hiking in the fall. is a translation of the German example sentence Viele Menschen wandern gern im Herbst (see Figure 1). The English glosses are between five and ten words long. In the gloss, the target word is always translated with the same English word that is also used in the one-word English translation. 18 The author of this dissertation wrote the glosses and a native English speaker evaluated them for their acceptability. Glosses that sounded unnatural were rewritten until the English native speaker considered all glosses acceptable.

18

For clarity, in discussing Voka, the term gloss is used as a synonym for sentence-level L1 translation whereas the one-word English translation of the German target noun is called translation (see Figure 1).

60

The following section explains the ways in which the form, meaning and use annotations presented in this section are combined in the five annotation clusters of this study.

3.6 The Five Annotation Clusters In addition to the default information on the form (written L2 word), meaning (L1 translation), and use (L2 example sentence) of a word, which is identical for all learners, the study participants receive one of five annotation clusters that differ in the annotations they provide for each word (see Table 5). Table 5: The five annotation clusters in Voka

Annotation cluster PG DG PA DA PAGD

Meaning annotation picture L1 definition picture L1 definition L1 definition picture

Use annotation L1 gloss L1 gloss --L1 gloss

Form annotation --audio pronunciation audio pronunciation audio pronunciation

Based on the premise that meaning information is more essential to word knowledge than either use or form information, every annotation cluster in Voka contains at least one meaning annotation. There are two meaning-use (i.e., PG, DG), two meaning-form (i.e., PA, DA) and one meaning-use-form annotation cluster (i.e., PAGD). Figure 3 shows the corresponding vocabulary flashcards in Voka for the target word Herbst (fall).

61

Figure 3: The five annotation clusters in Voka illustrated with the target word Herbst (fall)

Meaning-use annotation clusters Annotation cluster PG (picture gloss)

Annotation cluster DG (definition gloss)

Meaning-form annotation clusters Annotation cluster PA (picture audio)

Annotation cluster DA (definition audio)

Meaning-form-use annotation cluster Annotation cluster PAGD (picture audio gloss definition)

62

3.7 Program Flow Voka's program flow is identical for both parts 1 and 2. The program consists of seven phases (see Table 6): a pretest (called assessment in Voka), two study phases, two practice phases, an immediate posttest (follow-up test 1), and a delayed posttest (followup test 2). Each phase has either a recall or a recognition format, as shown in Table 6. For both experiments, phases 1 through 6 were conducted together while the delayed posttest (phase 7) took place one week later (for further details, see Table 11 for experiment 1 and Table 36 for experiment 2). Table 6: The seven phases in Voka

Phase

Corresponding name in Voka

Format

Function

1 2 3 4 5 6

Pretest Study phase 1 Study phase 2 Practice phase 1 Practice phase 2 Immediate posttest

Assessment Study phase 1 Study phase 2 Practice phase 1 Practice phase 2 Follow-up test 1

Recall Recognition Recognition Recognition Recall Recall

Assessment Treatment Treatment Treatment Treatment Assessment

7

1-week delayed posttest

Follow-up test 2

Recall

Assessment

Studies have shown that spaced repetition leads to more effective L2 vocabulary learning than massed repetition 19 (Nation, 2001), and accordingly, Voka's treatment involves spaced repetition by presenting each noun on four different occasions, twice in the study phases and twice in the practice phases, before it is tested (see Table 6). Moreover, for each student and each Voka phase, the target words are presented in a randomized order to prevent serial learning effects (see Appendix B for examples). 20 The total number of random orders for the presentation of words is the number of phases in

19

Massed repetition means giving repeated attention to a word in a continuous, uninterrupted time frame (e.g., two minutes) whereas spaced repetition involves "spreading the repetitions across a long period of time, but not spending more time in total on the study of the words." (Nation, 2001, p. 76). 20

Serial learning takes place when one word in a list enables recall of the next word because the words are always learned in the same order. In addition, words at the beginning and end of a series are generally learned better than words in the middle. Changing the order of the words allows vocabulary items to be learned and recalled independently of their position in a list (Nation, 2001).

63

Voka (i.e., 7) multiplied by the number of study participants (i.e., N = 72 in experiment 1, N = 68 in experiment 2). Figure 4: Example screenshot of the direction screen (assessment and study phases completed)

Once the students log on to the Voka website at www.voka.ca with their unique IDs, they first see the direction screen (Figure 4), which gives them an overview of the program flow by listing the different phases that Voka leads them through. For each phase, the screen displays the allotted time along with instructions. The direction screen also records the student's progress through the program by providing a green dot next to each completed phase. The following sections discuss each phase in turn.

64

3.7.1 Pretest The initial assessment of the learners' knowledge of the test items consists of a productive recall pretest in form of an English to German translation. For each word, the prompt is the English word and the German noun’s article as well as a blank field for students to provide the German noun (see Figure 5). The pretest for experiment 1 contains 48 words: 15 main abstract target words, 15 main concrete target words, 5 alternative abstract target words (spare words), 5 spare concrete words, and 8 distractor words that are part of the students’ active vocabulary from the course textbook. For experiment 2, the pretest contains 46 words (1 main abstract word and 1 main concrete word fewer than in experiment 1). To maximize the number of valid study participants, main target words are assigned spare words as potential substitutes. Each main target word that a student knows on the pretest is automatically replaced by an unknown corresponding spare word in subsequent Voka phases. 21 The 8 distractor words chosen from the students' active vocabulary are inserted to keep students' attention on the test. Figure 5: Example screenshot of the pretest (top part shown)

The participants have a maximum of 10 minutes to complete the pretest. At the top of the screen, Voka provides a counter indicating the remaining time (see Figure 5). The

21

For example, on the part 2 pretest, participant voka250 knew the main word Schmerz (pain) but not its first spare word Fleisch (meat), and thus received Fleisch (meat) instead of Schmerz (pain) in the study and practice phases and on the two posttests.

65

participants can either submit the test within the allotted time or the program will automatically lead them back to the direction screen.

3.7.2 The Two Study Phases Following the pretest are two study phases, which introduce the words to the learners. For each word, a flashcard displays the annotation cluster allocated to a given learner for that word. For example, learners studying Herbst (fall) in annotation cluster PAGD are presented with the screen shown in Figure 6 in both study phases (see Figure 3 for all flashcards; the allocation of annotation clusters is described in section 4.3 for experiment 1 and in section 7.3 for experiment 2). Figure 6: Example screenshot (study phase 1) of a word studied in annotation cluster PAGD

In addition to displaying content-related information, the bottom of the flashcard informs the learner of the current Voka phase, the number of words already presented in that phase and the time remaining before the next flashcard appears (see Figure 6). In each study phase, Voka automatically advances to the next flashcard after the allotted time (i.e., 23 seconds in study phase 1, 18 seconds in study phase 2). Once Voka has cycled through all the words in study phase 1, the program returns to the direction screen from which students can activate study phase 2. 22 In study phase 22

Although for methodological reasons, the students may not manipulate the program while a study phase is in progress, they can self-activate each phase from the direction screen by clicking on START, thus giving them time to rest between phases.

66

2, students receive the same annotation cluster for each word as in study phase 1, and again in random order, but the test items are presented for a shorter period of time (18 as opposed to 23 seconds).

3.7.3 The Two Practice Phases The first practice phase Following the second study phase, the learner initiates the first practice phase, which is a productive recognition task in a modified multiple choice format. The learners see the English translation and have to identify the corresponding German target noun, which is presented along with two other German nouns as distractors. However, instead of simply selecting the correct answer, the study participants have to type the noun into an input field. This affords learners the opportunity to practice spelling the nouns in preparation for the posttests. Figure 7 displays the prompt for Herbst (fall). For each student and word, the target noun and its two distractors are presented in a random order. The two distractors are assumed to be unknown to the learners because they are not part of the textbook vocabulary and are also generally not highly frequent nouns. The distractors always have the same gender and the same initial as the target noun to compel learners to read the entire word and to supply them with a minimum of correct word information for learning. However, for maximum difference between a target noun and its distractors beyond that, the distractors generally differ from the target noun in terms of word length and the letters and / or consonant clusters contained. Furthermore, the second letter of the distractors is never the second letter of the target noun or any of the other Voka target words that start with the same initial.

67

Figure 7: Example screenshot of a prompt in practice phase 1

In the practice phases, the students can either click on Submit at any time to advance to the feedback screen or the program will advance automatically after 16 seconds. A feedback message informs the students if their answer is correct, that is, identical to the target word (e.g., for Beweis (proof), see Figure 8), or provides them with the correct answer in case their answer deviates from the spelling of the target word (e.g., , , , or < > for Kreis (circle), see Figure 9). For instance, in Figure 9, Voka informs the learner that the "answer 'der Kellner' is incorrect. The correct answer is: der Kreis." In addition, below the feedback message, the flashcard displays again the noun in the same annotation cluster with which the student was studying it in the study phases. Independent of whether the student translated the noun correctly or not, the feedback screen is displayed for 16 seconds before Voka automatically advances to the next word. Students do not have the option to advance the feedback screen faster.

68

Figure 8: Example screenshot of practice phase feedback for correct student input

Figure 9: Example screenshot of practice phase feedback for incorrect student input

69

The second practice phase Parallel to the pretest and posttest format, the second practice phase tests productive recall: Students see the English word and the German article and are asked to translate the word into German. However, here, the students see only one word at a time (see Figure 10). Voka advances automatically to the feedback screen after 14 seconds but students can submit their answers earlier. Figure 10: Example screenshot of a prompt in practice phase 2

As in the first practice phase, the feedback message is displayed along with the flashcard in the respective annotation cluster for 16 seconds before Voka automatically proceeds to the next prompt (see Figure 8 and Figure 9).

3.7.4 The Two Posttests Like the pretest, the posttests in Voka are in a productive recall format but the participants are only tested on the words they studied during the treatment (i.e., 30 words in part 1, 28 words in part 2). For all words, students see the English word and the German article and have to provide the German noun (Figure 11). To complete the posttest, 15 minutes are allotted in part 1 and 14 minutes in part 2. Students can submit the tests sooner by clicking on Submit at the bottom of the posttest page.

70

Figure 11: Example screenshot of the posttests (top of immediate posttest shown)

The immediate posttest (follow-up test 1) Once the immediate posttest is submitted, Voka informs the learner of his or her posttest score (e.g., Follow-up test successfully completed. Your score: 26/30.). However, to avoid post-treatment learning, students do not see the correct answers to the posttest prompts. The delayed posttest (follow-up test 2) Seven days after completing the pretest, the treatment and the immediate posttest, students take the delayed posttest. The delayed posttest is identical to the immediate posttest (see Figure 11) and the words are presented again in random order. For both experiments, the participants are informed of the delayed posttest but are asked not to study the test items between the immediate and the delayed posttest. Once the posttest is submitted, students can inspect an answer sheet showing them the posttest prompts, the corresponding German translations, their own translations and their score (Figure 12).

71

Figure 12: Example screenshot of answer sheet (top part shown)

A one-week post-treatment interval was chosen for the delayed posttest. Aside from logistical reasons, a one-week delay was considered optimal for assessing the learning of vocabulary items that are not part of the learners' regular curriculum. This interval is also commonly used in L2 vocabulary research studies (e.g., Chun & Plass, 1996a; Hill & Laufer, 2003; Kaya, 2006; Kellogg & Howe, 1971; Kim, 2006).

3.8 Data Collection in Voka Voka tracks the entire student interaction with the system and saves it for research purposes (see Table 7). For each appearance of a target word in Voka, the program records the target word (e.g., Zustand (condition)), its word type (i.e., abstract or concrete), the annotation cluster the student studied the word in (e.g., PAGD), the student ID (e.g., voka203), the student input (if applicable, e.g. ), the Voka part (i.e., 1 or 2), the Voka phase (e.g., practice phase 2), the display order of the word in that phase (e.g., position 28), the completion date, and the time (in seconds) that the word was displayed until either the screen advanced automatically or the learner submitted an answer. The information logged by Voka is saved in a database and subsequently extracted into a Microsoft® Excel® spreadsheet which contains one row for each word in each of

72

the seven Voka phases for each student and each part of Voka (roughly 33,000 rows in total, see example in Table 7). Table 7: Sample Voka log output for participant voka203

Word Zustand

Type abstract

Cluster PAGD

ID 203

Input --

Part 1

Phase Study 2

Order 24

Zustand

abstract

PAGD

203

Gestand

1

Pract. 2

28

Zustand

abstract

PAGD

203

Zustand

1

Post. 1

13

Date 21/10/2009 11:34:59am 21/10/2009 12:00:53pm 21/10/2009 12:05:15pm

Time 18 s 13 s 188 s

3.9 Rationale for Voka's Design This section provides a rationale for the assessment, the aspects of word knowledge covered, and the exposure time for test items in Voka.

3.9.1 Assessment in Voka Productive recall tests In the context of L2 vocabulary research, an evaluation of the effectiveness of different kinds of instructed vocabulary learning methods entails giving explicit thought to vocabulary assessment (Read, 2000). In learning new vocabulary, it is initially most important to establish a memory link between the word form and its meaning (Laufer & Goldstein, 2004; Nation, 2001; Ryan, 1997; Schmitt, 2008, 2010; Webb, 2005). Accordingly, in terms of the word knowledge tested, the posttests in Voka assess the creation of this form – meaning link. The posttests are in a productive recall format, asking the participants to supply the German target words from memory given their English translations as test prompts. The pretest in Voka employs the same format to make it comparable to the posttests (see section 3.7). Time constraints for conducting the experiment precluded the use of more than one assessment type in Voka. Productive recall was selected because it is the most difficult task for language learners when compared to productive recognition, receptive recall and receptive recognition (see section 2.1.1). It is recommended if only one type of learning 73

is possible because recall generally results in superior vocabulary learning than recognition (Nation, 2001) and productive learning also leads to a considerable amount of receptive knowledge (receptive learning, however, does not foster productive knowledge to the same extent, see Mondria & Wiersma, 2004). Because the students are tested in a productive format, the flashcards used in Voka encourage productive learning by presenting the English word on the left and the German word on the right (see Figure 1) (De Groot & Keijzer, 2000; Nation, 2001). Generally, productive learning is more efficient for productive testing and receptive learning favours receptive testing (Nation, 2001; Mondria & Wiersma, 2004). Furthermore, the two study and two practice phases that comprise Voka's vocabulary learning treatment are designed to provide a step-wise increase in the difficulty of the L2 vocabulary learning task as preparation for the productive recall posttests by moving from recognition to recall (see Table 6) and by following the pedagogical sequence of presentation (i.e., the two study phases) and practice (i.e., the two practice phases) before the final assessment (see section 3.7). Discrete, selective, and context-independent assessment In Voka, the productive recall posttests are in a discrete, selective, and contextindependent format. This format is a specific combination of the three dimensions of vocabulary assessment proposed by Read (2000) (see Figure 13). Discrete, selective, and context-independent vocabulary tests have been a staple of educational measurement for a long time, although language learning and assessment philosophy in recent decades has favoured embedded, context-dependent and often comprehensive vocabulary assessment. However, Read (2000) argues convincingly that "instead of making a blanket statement such as 'all vocabulary testing should be contextualised', we should consider what the appropriate role of context is for a particular assessment purpose, and design the test tasks accordingly" (p. 165).

74

Figure 13: Dimensions of vocabulary assessment

Construct underlying the assessment instrument Discrete ⇔ Embedded A measure of vocabulary knowledge A measure of vocabulary which forms part of the or use as an independent construct assessment of some other, larger construct Range of vocabulary Selective ⇔ Comprehensive A measure in which specific A measure which takes account of the whole vocabulary items are the focus of the vocabulary content of the input material (reading / assessment listening tasks) or the test taker's response (writing / speaking tasks) Role of context Context-independent ⇔ Context-dependent A vocabulary measure in which the A vocabulary measure which assesses the test test taker can produce the expected taker's ability to take account of contextual response without referring to any information in order to produce the expected context response Note. Reproduction of Read's Figure 1.1 (2000, p. 9) with addition of header rows. Reproduced with permission.

For Voka, a discrete, selective, and context-independent test format was chosen for four reasons. First, according to Read (2000) this format is appropriate and commonly employed in studies that view vocabulary knowledge as a discrete form of L2 knowledge and L2 lexical forms as individual units of meaning. This perspective pertains to this dissertation, which examines the construct of vocabulary knowledge (rather than, say reading comprehension) and assesses the knowledge of specific vocabulary items. Second, discrete, selective and context-independent vocabulary tests are suitable when the learning task is restricted to establishing an initial memory link between the L2 form and its meaning (Read, 2000). This is precisely the case in Voka, where the focus of the research is on the effectiveness of annotations for constructing a mental link between form (L2 word) and meaning (L1 word). By including only the L1 translation and the L2 target word, and not the annotations, the assessment focuses on the extent to which the annotations as a learning resource contribute to better vocabulary learning rather than on how well the annotations themselves can be remembered. Third, the test format was chosen to avoid the potential confounding influence on test performance introduced by assessing L2 vocabulary knowledge in sentence contexts. While learners with a good grasp of the L2 syntax and vocabulary used in the test sentences, for instance, would likely have an advantage over learners with less syntactic 75

and lexical L2 awareness, both types of learners are on an even playing field when L2 target words are tested in isolation. Fourth, the test format is deemed appropriate given the limited L2 proficiency level of the study participants. For beginner learners of a language, learning and being tested on discrete and context-independent knowledge of vocabulary items is often sufficient while advanced learners may require more extensive knowledge about words. Exposure to vocabulary items at a beginner level provides only a foundation for more extensive learning about the words at later proficiency stages (Nation, 2001; Read, 2000). While discrete, selective, and context-independent vocabulary tests are thus certainly appropriate for the stated research endeavour of this dissertation, it is important to note that these tests do not allow researchers to make broad inferences about L2 vocabulary acquisition in general because they do not test whether learners can understand or use the L2 words in context in various communicative situations. In that respect, investigators employing discrete, selective, and context-independent tests need to take heed in assessing the implications of their findings (see, e.g., Norris & Ortega, 2000, 2001).

3.9.2 Word Knowledge in Voka Table 8 provides a summary of the treatment and assessment of vocabulary learning in Voka in view of the receptive and / or productive coverage of the nine aspects of word knowledge (see section 2.1, Table 2). For example, Voka's treatment (i.e., study and practice phases) covers productive knowledge of the written target form by asking students to type the word in the two practice phases. During Voka's assessment (i.e., pretest and posttests), the written form is also tested productively because learners have to supply it on the tests. Table 8 shows that the treatment covers more aspects of word knowledge than the posttests because, as discussed in section 3.9.1, the study assesses the potential of the vocabulary annotations to aid in the establishment of the initial form-meaning link in L2 vocabulary learning.

76

Table 8: Aspects of word knowledge contained in Voka's treatment and assessment

Aspect of word knowledge Form

spoken written word parts

Meaning

form + meaning concept + referents associations

Use

grammatical functions collocations constraints on use

R P R P R P R P R P R P R P R P R P

Voka treatment ✓ -✓ ✓ ✓ -✓ ✓ ✓ ✓ --✓ -✓ ----

Voka assessment ---✓ ---✓ -✓ ---------

Note. R = receptive knowledge, P = productive knowledge.

Table 8 also indicates that Voka does not provide explicit coverage of associations and constraints on use. However, regarding associations, Nation (2001) reports that introducing new words with related words usually results in interference and confusion rather than effective learning and thus associated words were not included in Voka. Even if desired, it would have been difficult to include associations given the learners' limited vocabulary knowledge. Finally, concerning constraints on use, the participants were informed when introducing the experiments that they would be studying high-frequency words with no constraints on their use.

3.9.3 Exposure Time for Test Items Two competing forces have to be considered in deciding on the amount of time for which learners are exposed to a test item in a vocabulary research study. On the one hand, the participants should have ample opportunity to learn all the words, on the other, for research purposes, the study should be designed to avoid a ceiling or floor effect in word retention scores so that a meaningful analysis of the research questions is possible.

77

Taking the timing in previous studies and the two pilot studies into account 23, the timing in Voka was set to balance these two issues. In the treatment phases (i.e., the study and practice phases), each learner sees each word for a minimum of 1 minute and 13 seconds (73 seconds) and a maximum of 1 minute and 43 seconds in total (103 seconds). However, the experiments are conducted over the Internet, and thus potential fluctuations in web-page load times cannot be ruled out. Table 9: Exposure time per word in Voka's treatment

Treatment Study phase 1 Study phase 2 Practice phase 1 prompt Practice phase 1 feedback Practice phase 2 prompt Practice phase 2 feedback

Time per word 23 seconds 18 seconds 0 – 16 seconds 16 seconds 0 – 14 seconds 16 seconds

Total

73 – 103 seconds

Table 9 indicates the exposure time per word for all parts of the Voka treatment. For the study phases and the practice phase feedback, the timing is preset by the program whereas for the prompts in the two practice phases, the participants can submit their response before the allotted time runs out. 23

Studies have different research questions and methodologies and accordingly, the exposure time in previous studies varies widely, from much less than a minute per word at the low end of the spectrum to quite a few minutes per word at the high end. The total number of words studied at one time and the amount of information presented for each word naturally also influence time per word decisions. At the low end, Mondria and Wiersma (2004) report that "Stoddard 1929 and Griffin & Harley 1996 allotted 24 seconds per word, receptively or productively. Waring's 1997 subjects used 38 seconds per word receptively and 48 seconds productively." (pp. 89-90). Mondria and Wiersma themselves had learners study 16 words in 15 minutes, which corresponds to 56 seconds per word. Tonzar et al. (2009) gave children 48 seconds per word to study 40 L2 words. In a study involving many more words, "Thorndike (1908) found that learners could average about 34 German-English word pairs per hour (1,030 words in 30 hours)" (Nation, 1982, p. 16). This translates into 1 minute 45 seconds per word. The timing in CALL studies also varies considerably. For example, Dubois and Vial (2000) had learners study 19 words in two successive presentations of 15 seconds each (i.e., 30 seconds per word in total). In Nakata (2008), learners studied ten words for an average of 51 seconds per word, leading to a total study time of under 9 minutes. Kim (2006) allotted a maximum study time of 30 minutes for 12 English words, that is 2 minutes 30 seconds per word. The study was similar to Voka in that some treatment conditions included a definition, a drawing, an example sentence and an audio pronunciation for each word. With Kaya's (2006) multimedia CALL program, learners studied 12 words in approximately 80 minutes, that is, 6 minutes 40 seconds per word. In the two pilot studies that were conducted for this experiment, the learners studied each word for a maximum total of 1 minute 26 seconds (fall 2008) and 1 minute 33 seconds (summer 2009), respectively. The pilot studies used Voka in a similar way as the main study and yielded adequate scores for data analysis.

78

Regarding the assessment phases in Voka, learners have 10 minutes to complete the pretests (48 items in experiment 1, 46 items in experiment 2), 15 minutes each to complete the 30-item immediate and delayed posttest of part 1, and 14 minutes each for the 28-item immediate and delayed posttest of part 2.

79

4 METHODOLOGY OF EXPERIMENT 1 The first experiment conducted for this dissertation investigated the effectiveness of vocabulary annotations for L2 vocabulary learning by also taking into account word concreteness and differences in annotation effectiveness among learners. This chapter provides the methodology of experiment 1. Section 4.1 describes the study participants while section 4.2 presents the timeline of the experiment in fall 2009. To detail the design of experiment 1, section 4.3 explains the distribution of annotation clusters in Voka's treatment phases. Finally, section 4.4 reports on the data analysis and statistical procedures used in the experiment.

4.1 Study Participants The study participants for this research were beginner L2 learners of German registered in the first-semester German class (German 102) at a Western Canadian university in fall 2009. German 102, a course for learners with no previous knowledge of German, entails 13 weeks of instruction with a classroom session of 100 minutes twice a week. Overall, 78 students were registered in German 102 in fall 2009 and all of them consented to participating in the experiment. However, 6 of the 78 students had to be eliminated from experiment 1. One student was deleted from the sample because, diverging from the experiment instructions, she wrote down all the target words during the treatment. The remaining five students were removed because their pretest indicated that they already knew one or more of the words that Voka showed them during treatment. The computational algorithm in Voka had not detected their prior knowledge and for this reason, no spare word had been provided. None of the remaining 72 final participants knew any of the L2 words that were part of the study treatment. In addition, 80

all final participants indicated that they had not studied the target words between the treatment and the delayed posttest.24 Based on the self-report background questionnaire (see Appendix C), Table 10 provides more details about the participants. Table 10: Characteristics of participants of experiment 1

Characteristic Number of students Age Range Mean Gender female male Status undergraduate graduate student a English proficiency native language native-like advanced intermediate advanced beginner beginner Other known languagesb

Prior German instruction none 1 semester Visits to German countries 0 times 1 – 3 times 4 or more Time in German countries none 1 – 21 days 1 – 6 months 1 – 4 years Computer comfort (very) comfortable neither (very) uncomfortable

Experiment 1 N = 72 17 - 47 years 20.7 years 48 students 24 students 69 students 3 students 49 students 8 students 13 students 2 students 0 students 0 students Arabic, Chinese, Croatian, English, French, German, Greek, Italian, Japanese, Punjabi, Spanish, Thai 68 students 4 students 40 students 29 students 3 students 39 students 23 students 8 students 2 students 71 students 1 student 0 students

24

This was determined through a question posed to the participants immediately after having submitted the delayed posttest. The question was Did you actively study or look up any of the words that Voka showed you last time? The reply options were yes, no, and not sure. All final participants replied no.

81

Characteristic Prior CALL use never rarely / occasionally (very) frequently

Experiment 1 29 students 30 students 13 students

Note. aOther than English, the following L1s were reported: Arabic, Croatian, Chinese, Farsi, French, Hindi, Korean, Moroccan, Punjabi, Russian, Spanish, Tagalog, Turkish, and Ukrainian. bOther known languages include non-native languages for which students indicated a proficiency from beginner to advanced level. All students listing German assessed themselves as beginners.

Table 10 indicates that while the ratio of women to men was 2 to 1, the participants were fairly homogeneous regarding most other characteristics. For example, most participants were undergraduate students and they were generally in their late teens or early twenties (all but six students were between 17 and 23 years old). Although the participants came from various language backgrounds, they were all fluent in English and all had passed the university's entrance requirements for English language proficiency. All participants were at a beginner proficiency level in German and most students had never been to a German-speaking country or reported only short visits. 25 The students were also homogeneous concerning their level of comfort using computers with all but one student indicating they are comfortable or very comfortable with computers. Finally, the participants were presumably also fairly uniform with respect to their motivation to learn German. The university offers a variety of languages and German 102 is an elective course. Therefore, a fairly high motivation to learn the language can be assumed for all participants.

4.2 Timeline of Experiment 1 The research conducted for experiment 1 involved three meetings with the study participants. The participants were enrolled in four sections of German 102 taught by two instructors (not the author of this dissertation). Accordingly, each component of the 25

As Table 10 indicates, only four students had previous German instruction. However, their knowledge was not enough to not qualify for German 102 as assessed by the course instructor. Regarding the two students that indicated visits of between one and four years, one student stayed in Germany as a young child. The other took part in a university program with English as the language of instruction and reported not having learned the language because of little contact with L1 speakers.

82

experiment was conducted on four separate occasions in fall 2009, as indicated in Table 11. Table 11: Timeline for experiment 1

Meeting

Experiment component

1

Introduction Background quest. Instruction video Exp. 1 Phase 1 - 6 Exp. 1 delayed posttest Evaluation quest. 1

2 3

Approx. duration 30 min.

Class week 2

German 102 section D1 D2 E1 Sep. 15 Sep. 16 Sep. 15

E2 Sep. 16

65 min.

7

Oct. 22

Oct. 21

Oct. 22

Oct. 26

25 min.

8

Oct. 29

Oct. 28

Oct. 29

Nov. 2

At the first meeting, the author introduced the participants to the experiment with a brief presentation. The participants were told that the research was about vocabulary learning but they were not informed about the specific research questions. The author then asked the students to fill out the background questionnaire (see Appendix C), and distributed an ethics consent form. All students consented to participating in the research conducted for this dissertation. Ethics approval for the study was obtained from Simon Fraser University's Office of Research Ethics prior to conducting the experiments. At the second meeting, the participants watched a 5 minute video to ensure that the participants in all course sections received identical instructions on the procedures of the research project using Voka. The video was a screencast of the Voka program with voice-over narration (see Appendix F). The participants then completed phases 1 through 6 of part 1 of Voka (i.e., the pretest, the two study phases, the two practice phases and the immediate posttest, see section 3.7). At the third meeting, the participants completed the delayed posttest of part 1 of Voka (i.e., phase 7, see section 3.7) and filled out the first evaluation questionnaire, which asked them to rank the five annotation clusters according to their personal preferences (see Appendix D). 26 Note that the participants also completed a second evaluation questionnaire at the end of experiment 2, in which they evaluated both 26

The first evaluation questionnaire showed each annotation cluster with the sample word Abschied (parting). This word was not studied by the participants and with an imageability of 3.9, it was chosen to lie approximately halfway in between the mean imageability of the abstract words (2.5) and the mean imageability of the concrete words (5.7) so as not to bias the participants' annotation cluster rankings towards either word type.

83

experiments (see section 7.2 and Appendix E for more details). After completing the first evaluation questionnaire, the participants received a handout of the part 1 Voka words including the German word, its English translation and a German example sentence. 27 The experiment sessions were conducted during regular classroom time as part of the coursework for German 102. Students received 5% of their final course grade if they attended all sessions pertaining to both experiment 1 and experiment 2 (see section 7.2) whereas they received 0% if they missed and did not make up for a session for reasons other than extenuating circumstances. Students who missed a session either completed it by attending another section of German 102 or by meeting with the researcher individually. The participants were informed that they received their course credit for participating in the meetings for experiments 1 and 2 independently of their performance (e.g., posttest scores) in Voka. All students earned the 5% course credit because all participated in all parts of the experiments.

4.3 Design of Experiment 1 Experiment 1 investigated the effectiveness of five annotation clusters (i.e., PG, DG, PA, DA, PAGD) for vocabulary learning by also considering word concreteness and variation among learners. For this, part 1 of Voka presented the participants with 30 words (15 abstract, 15 concrete) (see section 3.2 and Appendix A). In deciding on the distribution of annotation clusters for these words in Voka's study and practice phases, three objectives were considered. First, it had to be ensured that each participant be exposed to all annotation clusters to determine the effectiveness of each annotation cluster for each student (RQ 4, see section 2.9.1). This objective also 27

Meetings two and three, as well as both meetings for experiment 2 (see section 7.2), were conducted in reserved university computer labs on Macintosh or PC computers. There was one computer for each participant and the participants were seated sufficiently separate from each other to be able to focus only on their own computer screen. During the study and practice phases (see section 3.7), the participants wore headsets connected to their computers to listen to the audio annotations. All monitors had a large enough resolution to enable the learners to study the L2 words without having to scroll up or down the screen. Students accessed the program via the Internet. For the entire experiment, there were no technical problems surrounding the delivery of Voka. Students that finished a session early generally remained quietly in their seats until all participants were done.

84

eliminated the learner as an extraneous variable in assessing the effectiveness of the annotation clusters. Second, abstract and concrete words had to be displayed in each annotation cluster to permit an analysis of the impact of word concreteness on annotation effectiveness (RQ 3.1 and RQ 3.2, see section 2.9.1). Third, each annotation cluster needed to be tested with all of the target words to control for a potentially confounding effect of a word on annotation cluster effectiveness. If each participant received the same words in the same annotation cluster, differences in word learning in the annotation clusters could have been due to either differences in the effectiveness of the annotation clusters investigated or to differences in the relative learning burden of the words in that cluster (Akbulut, 2007). Incidentally, the objective that each word be presented in each annotation cluster also enabled the determination of the best annotation cluster for each target word. Although not a focus of this study, this might provide further insights into the learning burden of words as well as the interplay of word characteristics and annotation cluster effectiveness in L2 vocabulary learning. To achieve these objectives, a Latin square within-subjects experimental design was employed in experiment 1. As illustrated in Table 12, a Latin square is an n × n table populated by n symbols such that every symbol appears once in each row and once in each column. A Latin Square experimental design controls two sources of extraneous variation (here: learner and word) by showing each treatment (here: annotation cluster) exactly once in each row and each column (see Table 14). Table 12: Example of a Latin square

1 3 2

2 1 3

3 2 1

Accordingly, in the Latin square design of experiment 1, each participant received each annotation cluster for three abstract and three concrete words and at the same time, each word was studied in all five annotation clusters by different participants. For each participant, the annotation cluster for a word was identical across the Voka phases. The Latin square design in Voka involved splitting the 30 words into five word packages with 3 abstract and 3 concrete words in each package, as shown in Table 13.

85

The author strove to make the five word packages comparable in terms of their overall learning burden by matching linguistic features and imageability ratings (see section 3.2) as much as possible and by also considering pilot study data when available (specifically, the relative item difficulty of the target words as evidenced by scores on the pilot study posttests). Table 13: The five word packages in experiment 1

Word package A

B

C

D

E

Abstract Ablauf (procedure) Gesetz (law) Mangel (defect) Grund (reason) Hoffnung (hope) Sorge (worry) Inhalt (content) Schuld (blame) Zustand (state) Anfang (beginning) Befehl (command) Gefahr (danger) Beweis (proof) Vorwurf (criticism) Zweck (purpose)

Concrete Kreis (circle) Richter (judge) Schritt (step) Himmel (sky) Regen (rain) Verkehr (traffic) Kunst (art) Rechnung (bill) Vogel (bird) Grenze (border) Pfarrer (priest) Waffe (weapon) Beamte (government employee) Herbst (fall) Urlaub (vacation)

The participants were then randomly divided into five exposure groups. An exposure group is a specific combination of word package and annotation cluster. As illustrated in Table 14, the participants in exposure group I, for example, received the six words in word package A (see Table 13) in annotation cluster PG, word package B in DG, C in PA, D in DA and E in PAGD. In contrast, the participants in exposure group II studied the six words from word package A in annotation cluster DG, word package B in PA, C in DA, D in PAGD, E in PG, and so on. As a detailed illustration, the program flow for participant voka201 is provided in Appendix B. Table 14: Annotation cluster distribution in Latin Square design of experiment 1

Word package A B C D E

Exposure group I (n = 14) PG DG PA DA PAGD

Exposure group II (n = 15) DG PA DA PAGD PG

Exposure group III (n = 14) PA DA PAGD PG DG

86

Exposure group IV (n = 13) DA PAGD PG DG PA

Exposure group V (n = 16) PAGD PG DG PA DA

4.4 Data Analysis and Statistical Procedures 4.4.1 Scoring of Student Answers In tackling the research questions of experiment 1, both the immediate and delayed posttest scores of part 1 of Voka were considered. In scoring the posttest results for data analysis purposes, for each word, a learner received either 1 point, 0.5 points or 0 points. A score of 1 was given for responses that were identical to the target item or the target item's plural form, independent of whether the letters of the word were written in upper and / or lower case. For instance, for Kreis (circle), the responses , , or (the plural form) would have been awarded one point. Responses received a score of 0.5 points if they deviated from the target word by an edit distance of one or two because this demonstrated a considerable amount of partial knowledge on the part of the learner. Edit distance, a measure to describe the difference between two strings, is defined here as the minimum number of changes required to transform the student response into the target response, where a change is the addition, omission or substitution of a single letter or the transposition of two letters. For example, the student response for the target response (pain) has an edit distance of two because one addition (the letter c) and one omission (the letter t) is required to convert into . Accordingly, this answer received 0.5 points. Responses with an edit distance of 3 or more, or no responses, were counted as incorrect and received 0 points. Being quite far removed from the target items, these responses demonstrated only a minor degree of partial knowledge, if any at all, and were thus counted as incorrect. For instance, 0 points were given to the student responses for (law) (edit distance 3), for (bird) (edit distance 5) and < > for (sky) (edit distance 6).

4.4.2 Experiment Variables and Statistical Tests The independent variables considered in experiment 1 were: annotation cluster, annotation type presence, word type, individual learner, and exposure group. Table 15 87

provides more details on these variables. Except for exposure group, all independent variables were directly involved in the research questions of this dissertation. 28 Table 15: Independent variables of experiment 1

Independent variable Annotation cluster

Type within-subjects

Annotation type presence Word type

within-subjects

Individual learner Exposure group

between-subjects between-subjects

within-subjects

Levels 5 levels: PG, DG, PA, DA, PAGD 2 levels: presence, absence 2 levels: abstract, concrete 72 levels (= N) 5 levels: I, II, III, IV, V

Measure nominal nominal nominal nominal nominal

The dependent variable in this experiment was vocabulary learning, which was operationalized as the scores that students received on the immediate and delayed productive recall posttests. These scores ranged from 0 – 30 points per student in experiment 1. For each student, scores could only be multiples of 0.5 because each word was awarded either 0, 0.5 or 1 point(s) (see section 4.4.1). Table 16 lists the research questions (RQs) of this study, the corresponding statistical effects to be tested, the operationalization, and the inference tests used. For all tests, the alpha level to determine statistical significance was set to .05. The inferential results for RQ 1.1, RQ 2, and RQ 3.1 were obtained by conducting a three-way mixed ANOVA with word type and annotation cluster as within-subject factors and exposure group as a between-subjects factor. Univariate F tests were performed to test main effects and interaction effects. The paired samples contrast t-tests for RQ 1.2 and RQ 3.2 compared the mean of the means of the three annotation clusters containing the annotation type under investigation to the mean of the means of the two annotation clusters not containing it. For example, with regard to the picture annotation type, scores of 0 to 18 points per 28

Exposure group was included as an additional factor in experiment 1 because it was regarded as a potential confounding variable in assessing the effectiveness of the annotation clusters for vocabulary learning (see section 4.3). As an experiment-internal background factor, exposure group was relevant to the assignment of the participants to the two treatment groups in experiment 2 (see section 7.1.1). However, it is not included in the presentation and discussion of the results of experiment 1 in chapters 5 and 6 because it was not part of the research questions of experiment 1.

88

student (N = 72) were possible for picture presence (i.e., 6 points per student in each of PG, PA, and PAGD) while the possible range was 0 to 12 points per student for picture absence (i.e., 6 points each in DG and DA). Corresponding data were used to test the main effects of definition (presence: DG, DA, PAGD vs. absence: PG, PA), gloss (presence: PG, DG, PAGD vs. absence: PA, DA), and audio (presence: PA, DA, PAGD vs. absence: PG, DG), respectively. As Table 16 shows, it was not possible to make an inference about the interaction between annotation cluster and individual learner. This was because each level of the factor individual learner was an individual learner (i.e., there was only one subject per level of the factor). Accordingly, RQ 4 was answered by examining variation in the data on a descriptive level only. Table 16: Research questions and operationalization for experiment 1

Research question 1.1. Does annotation cluster (i.e., PG, DG, PA, DA, PAGD) have an effect on vocabulary learning? 1.2. Is there a difference in the effectiveness for vocabulary learning between annotation clusters (i.e., PG, DG, PA, DA, PAGD) that contain a given annotation type (i.e., picture, definition, gloss, or audio) and annotation clusters that do not contain that annotation type?

Effect tested Operationalization Vocabulary annotations Main effect of 0 – 6 points per annotation cluster per student (N = 72) annotation cluster in each posttest of part 1

Inference test

Main effect of annotation type presence

4 paired samples contrast t-tests (α = .05)

0 – 18 points per student (N = 72) for clusters containing the annotation type, 0 – 12 points per student (N = 72) for clusters not containing the annotation type in each posttest of part 1

Word concreteness 2. Does word type (i.e., Main effect of 0 – 15 points per word type per student (N = 72) in abstract, concrete) have an word type effect on vocabulary learning? each posttest of part 1 Annotations and word concreteness 3.1. Is there an interaction Two-way 0 – 3 points per word type between word type (i.e., interaction per annotation cluster per student (N = 72) in each abstract, concrete) and effect of annotation cluster (i.e., PG, annotation posttest of part 1 DG, PA, DA, PAGD) in the cluster X word effectiveness for vocabulary type learning?

89

ANOVA (α = .05)

ANOVA (α = .05)

ANOVA (α = .05)

Research question Effect tested Operationalization 0 – 9 points per student (N 3.2.1. Is there a difference in Main effect of = 72) for clusters the effectiveness for learning annotation abstract words between type presence containing the annotation annotation clusters (i.e., PG, for abstract type, 0 – 6 points per student (N = 72) for DG, PA, DA, PAGD) that words contain a given annotation type clusters not containing the (i.e., picture, definition, gloss, annotation type in each or audio) and annotation posttest of part 1 clusters that do not contain that annotation type? 0 – 9 points per student (N 3.2.2. Is there a difference in Main effect of = 72) for clusters the effectiveness for learning annotation concrete words between type presence containing the annotation annotation clusters (i.e., PG, for concrete type, 0 – 6 points per student (N = 72) for DG, PA, DA, PAGD) that words contain a given annotation type clusters not containing the (i.e., picture, definition, gloss, annotation type in each or audio) and annotation posttest of part 1 clusters that do not contain that annotation type? Annotations and individual learners 4. Does the effectiveness of Two-way 0 – 6 points per annotation cluster per student (N = 72) the annotation clusters (i.e., interaction PG, DG, PA, DA, PAGD) for effect of in each posttest of part 1 vocabulary learning vary annotation across learners? cluster X individual learner

Inference test 4 paired samples contrast t-tests (α = .05)

4 paired samples contrast t-tests (α = .05)

No inference possible

The results of experiment 1 (part 1 of Voka) are presented in the following chapter and discussed in chapter 6.

90

5 RESULTS OF EXPERIMENT 1 The results of experiment 1 (conducted with part 1 of Voka) are presented in four subsections, corresponding to its four research topics (see section 2.9 and section 4.4.2). Accordingly, section 5.1 explores vocabulary annotations, first presenting the results for the main effect of annotation cluster (section 5.1.1), then the results for the four main effects of the presence or absence of an annotation type in an annotation cluster (section 5.1.2). Section 5.2 provides the results for the main effect of word type. Examining annotations together with word concreteness in section 5.3, section 5.3.1 first investigates the interaction effect of annotation cluster and word concreteness while section 5.3.2 then presents the findings regarding the four main effects of the presence of an annotation type for abstract and concrete words. Section 5.4 deals with the interaction of individual learners and annotation cluster effectiveness. Section 5.5 presents a summary of the findings of experiment 1. In all sections, each research question is answered by inspecting the immediate and then the delayed posttest data.

5.1 Research Topic 1: Vocabulary Annotations The first research topic investigates the effectiveness of annotations for L2 vocabulary learning by asking two subquestions. Research question 1.1 enquires about the main effect of annotation cluster (i.e., PG, DG, PA, DA, PAGD) on vocabulary learning. Research question 1.2 examines for each of the annotation types (i.e., picture, definition, gloss, audio), whether there is a difference in vocabulary learning effectiveness between the clusters that contain the annotation type and those that do not.

91

5.1.1 RQ 1.1: Main Effect of Annotation Cluster Immediate Posttest Descriptive statistics Regarding the main effect of annotation cluster (RQ 1.1), Table 17 displays the results of the immediate posttest for the 72 participants of experiment 1. Each learner studied 6 words (3 abstract, 3 concrete) in each of the five annotation clusters, for a total of 30 words (see section 4.3). Table 17: Immediate posttest results for the five annotation clusters

Annotation cluster

Rank

PG (Picture Gloss) DG (Definition Gloss) PA (Picture Audio) DA (Definition Audio) PAGD (Everything)

3 5 1 4 2

All annotation clusters

Mean

Maximum score

Percentage

SD

3.69 3.31 3.81 3.35 3.71

6 6 6 6 6

61.6% 55.2% 63.5% 55.9% 61.8%

1.41 1.65 1.56 1.56 1.37

17.88

30

59.6%

6.16

Table 17 shows that the learners obtained an average score of 17.88 out of 30, which corresponds to a mean vocabulary learning rate of 59.6%. Annotation cluster PA (picture audio) best supported vocabulary learning with a mean score of 3.81 out of 6 possible points per learner, that is, 63.5% on average. With a mean of 61.8%, the second best annotation cluster was PAGD, the annotation cluster that contained all annotations, closely followed by PG (picture gloss) with 61.6%. Annotation clusters DA (definition audio) and DG (definition gloss) were least effective, with 55.9% and 55.2%, respectively. In the most effective annotation cluster (PA), learners on average scored 0.5 (out of 6) points more, or 8.3% higher, than in the least effective annotation cluster (DG). Inspecting the data with respect to the two meaning-use (PG, DG) and meaningform (PA, DA) annotation clusters, results show that the picture annotation clusters (PA, PG) outperformed the definition annotation clusters (DA, DG) independent of whether additional form (audio) or use (gloss) information was provided. Furthermore, within both the two picture (PA, PG) and the two definition annotation clusters (DA, DG), the

92

meaning-form annotation clusters (PA, DA) always ranked higher than the meaning-use annotation clusters (PG, DG). Inferential statistics Formal hypothesis testing with an alpha level of .05 showed a main effect of annotation cluster on the immediate posttest, F(4, 268) = 4.265, p = .002. Thus, the annotation clusters differ significantly in their effectiveness for learning L2 vocabulary. The magnitude of effect estimate is η2 (eta-squared) = .016, that is, 1.6% of the variance in word retention scores on the immediate posttest can be explained by the main effect of annotation cluster. Post-hoc multiple pairwise comparisons revealed two homogeneous subsets: PG, PA, and PAGD versus DG and DA. Annotation clusters PG, PA, and PAGD are significantly more effective for vocabulary learning than annotation clusters DG and DA (see Table 18, statistically significant results indicated in bold). Table 18: Pairwise comparisons for annotation cluster (immediate posttest)

Annotation cluster (I) PG

Annotation cluster (J) DG PA DA PAGD

Mean difference (I – J) .195 - .058 .158 - .011

Standard error

Sig.

.074 .075 .077 .071

.011 .440 .045 .872

DG

PA DA PAGD

- .253 - .037 - .206

.083 .083 .073

.003 .656 .006

PA

DA PAGD

.216 .047

.081 .074

.010 .532

DA

PAGD

- .169

.076

.029

Delayed Posttest Descriptive statistics The results of the delayed posttest of experiment 1 are presented in Table 19. Mean word retention was notably lower on the delayed posttest (21.6%, or 6.49 / 30 points) than on the immediate posttest (59.6%, or 17.88 / 30 points).

93

Table 19: Delayed posttest results for the five annotation clusters

Annotation cluster

Rank

Mean

PG (Picture Gloss) DG (Definition Gloss) PA (Picture Audio) DA (Definition Audio) PAGD (Everything)

2 5 3 4 1

1.34 1.17 1.33 1.28 1.37 6.49

All annotation clusters

Maximum score

Percentage

SD

6 6 6 6 6

22.3% 19.6% 22.2% 21.3% 22.8%

1.13 1.11 1.18 1.13 1.22

30

21.6%

4.43

Inspecting the descriptive data with regard to RQ 1.1, the difference in effectiveness among the five annotation clusters was less pronounced on the delayed posttest than on the immediate posttest. With 22.8%, 22.3%, and 22.2%, annotation clusters PAGD, PG, and PA, respectively, were the three most effective annotation clusters while annotation clusters DA and DG trailed them closely with a mean word retention of 21.3% and 19.6%, respectively. There was a 0.2 point difference (3.2%) in mean word retention scores between the most effective annotation cluster (PAGD: mean 1.37 points) and the least effective one (DG: mean 1.17 points). As on the immediate posttest, the three picture annotation clusters (PAGD, PG, PA) were more effective than the two annotation clusters without pictures (DA, DG). Inferential statistics However, formal hypothesis testing for RQ 1.1 revealed no main effect of annotation cluster on the delayed posttest, F(4, 268) = .713, p = .584. This means that in terms of statistical significance, all annotation clusters are equally effective one week after the learning session.

5.1.2 RQ 1.2: Main Effect of Annotation Type Presence Immediate posttest Descriptive statistics Research question 1.2 examines whether the effect on vocabulary learning is different when learners study with annotation clusters (i.e., PG, DG, PA, DA, PAGD) that contain a given annotation type (i.e., picture, definition, gloss, audio) than when they study with annotation clusters that do not contain it. Figure 14 displays the standardized 94

mean scores for the presence versus the absence of the four annotations in the annotation clusters on the immediate posttest.29 Figure 14: Standardized mean scores for annotation presence vs. absence (immediate posttest, all words)

Figure 14 shows that with 3.74 out of a maximum of 6 points, the mean response for words studied in annotation clusters with a picture was 0.41 points (6.8%) higher than the mean response for words studied in annotation clusters without a picture (3.33). Regarding the definition annotation, learners scored 0.29 points (4.8%) lower on average when they had studied the words with a definition (3.46) as opposed to without one (3.75). Words studied with a gloss yielded similar scores to words studied without a gloss (3.57 vs. 3.58 points, respectively). The mean response for words studied with audio was 0.13 points (2.2%) higher than the mean response for words studied without audio (3.63 vs. 3.50 points).

29

For each comparison, three annotation clusters contain the annotation type investigated (e.g., PG, PA, and PAGD for the picture annotation type) and two annotation clusters do not contain that annotation type (e.g., DG and DA for the picture annotation type). Accordingly, the mean of the three means for annotation presence is out of 18 points whereas the mean of the two means for annotation absence is out of 12 points (see section 4.4.2). To facilitate the comparison between annotation presence versus absence, the presence and absence means are both standardized here to a scale of 6 points.

95

Inferential statistics A paired samples contrast t-test for the picture annotation type revealed a main effect of picture, t(71) = 4.016, p = .000 (see Table 20). Accordingly, of the five annotation clusters tested, those containing a picture (PG, PA, PAGD) are on average significantly more effective for vocabulary learning than those without a picture (DG, DA). The estimated magnitude of effect is Hedge's g = 0.47.30 Table 20: T-tests for annotation presence vs. absence (immediate posttest, all words)

Annotation Picture Definition Gloss Audio

Pair (PG, PA, PAGD) - (DG, DA) (DG, DA, PAGD) - (PG, PA) (PG, DG, PAGD) - (PA, DA) (PA, DA, PAGD) - (PG, DG)

Paired differences Mean SD .40509 .85584 - .29514 .87913 - .01157 .98097 .12153 .95127

SEM .10086 .10361 .11561 .11211

t

df

Sig.

4.016 - 2.849 - .100 1.084

71 71 71 71

.000 .006 .921 .282

A t-test also showed a main effect of definition, t(71) = 2.849, p = .006. In this case, the absence of a definition is significantly more effective for vocabulary learning: learners perform better on average on annotation clusters PG and PA, which do not contain a definition, than on the annotation clusters with a definition (DG, DA, and PAGD). The estimated magnitude of this effect is Hedge's g = 0.34. Finally, there was no main effect of either gloss presence or audio presence (t(71) = .100, p = .921, and t(71) = 1.084, p = .282, respectively). These inferential results lead to the conclusion that of the five annotation clusters tested, those that contain a gloss (PG, DG, PAGD) are not more or less effective on average than those that do not contain a gloss (PA, DA). Likewise, the annotation clusters with audio (PA, DA and PAGD) are on average as effective as the annotation clusters without audio (PG, DG).

30

ANOVA's post-hoc analysis (see section 5.1.1) claims inferential equality of the means in the two homogeneous subsets (i.e., [PGmean = PAmean = PAGDmean] ≠ [DGmean = DAmean]). The contrast ttests test whether the mean of the means of the three clusters containing the annotation type in question is equal to the mean of the means of the two clusters not containing it (e.g., [(PGmean + PAmean + PAGDmean)/3] ≠ [(DGmean + DAmean)/2]). Accordingly, the contrast t-test result of the picture annotation type could be anticipated by inspecting the post-hoc ANOVA results but the t-test results of the other three annotation types (audio, definition, gloss) cannot.

96

Delayed posttest Descriptive statistics Figure 15 displays the standardized means of the vocabulary scores for annotation presence versus absence for the four annotation types on the delayed posttest. Figure 15: Standardized mean scores for annotation presence vs. absence (delayed posttest, all words)

Picture presence was descriptively better than picture absence (1.35 vs. 1.23 points). Words studied without a definition (1.27 points) were retained better than words studied with a definition (1.34 points). The mean scores were similar for gloss presence (1.29 points) and gloss absence (1.31 points) and words studied with an audio annotation were retained better than words studied without audio (1.33 points vs. 1.26 points, respectively). Overall, in terms of descriptive statistics, the tendencies observed on the delayed posttest were the same as those on the immediate posttest although the mean scores were lower and the difference in mean scores between annotation presence and absence were generally less pronounced. For instance, the mean difference for picture presence versus picture absence was 0.12 points (2.0%) on the delayed posttest but 0.41 points (6.8%) on the immediate posttest.

97

Inferential statistics Table 21: T-tests for annotation presence vs. absence (delayed posttest, all words)

Annotation Picture Definition Gloss Audio

Pair (PG, PA, PAGD) - (DG, DA) (DG, DA, PAGD) - (PG, PA) (PG, DG, PAGD) - (PA, DA) (PA, DA, PAGD) - (PG, DG)

Paired differences Mean SD .12153 .71384 - .06366 .81739 - .01157 .83078 .06944 .73863

SEM .08413 .09633 .09791 .08705

t

df

Sig.

1.445 - .661 - .118 .798

71 71 71 71

.153 .511 .906 .428

The four complex hypothesis tests showed no main effects of annotation presence for the four individual annotations on the delayed posttest (see Table 21). Accordingly, when assessing learners seven days after the treatment, the presence of a picture, a definition, a gloss or an audio annotation, respectively, is on average not significantly more (or less) effective for vocabulary learning than its absence in the five annotation clusters tested.

5.2 Research Topic 2: Word Concreteness 5.2.1 RQ 2: Main Effect of Word Type The second research question examines whether word concreteness has an effect on vocabulary learning. Immediate Posttest Descriptive statistics Table 22: Immediate posttest results for abstract vs. concrete words

Word type

Mean

Maximum score

Percentage

SD

Abstract words Concrete words

6.72 11.17

15 15

44.8% 74.4%

3.77 2.79

Both

17.88

30

59.6%

6.16

Table 22 displays the results of the immediate posttest, for which each learner received 15 abstract and 15 concrete words as test items (see section 4.3). With 44.8% mean retention, the abstract words were harder to learn than the concrete words (74.4% mean retention). On average, students' scores were 4.45 points (29.7%) lower for the abstract words. 98

Inferential statistics Formal hypothesis testing showed a main effect of word type, F(1, 67) = 234.841, p = .000. Thus, the concreteness of a word has a significant effect on its learnability with abstract words being significantly harder to learn than concrete words. 31 The estimated magnitude of effect is η2 = .320, meaning that 32% of the variance in posttest scores can be attributed to the effect of word type. Delayed Posttest Descriptive statistics The results of the delayed posttest, displayed in Table 23, mirror the results of the immediate posttest. Learners again scored lower on the abstract words (mean retention: 10.2%) than on the concrete words (33.1%). On average, the score for abstract words was 3.43 / 15 points (22.9%) lower than the score for concrete words. Table 23: Delayed posttest results for abstract vs. concrete words

Word type

Mean

Maximum score

Percentage

SD

Abstract words Concrete words

1.53 4.96

15 15

10.2% 33.1%

1.76 3.06

Both

6.49

30

21.6%

4.43

Inferential statistics Formal hypothesis testing revealed a main effect of word type on the delayed posttest, F(1, 67) = 155.037, p = .000. The significantly higher learning burden of abstract words compared to concrete words is still present one week after the vocabulary learning session. The magnitude of effect estimate is η2 = .281, that is, 28.1% of the variance in scores is explained by the main effect of word type.

31

The sample of abstract and concrete words chosen for this study are representative of the population of abstract and concrete words and from a linguistic point of view, it is thus reasonable to draw conclusions to abstract and concrete words in general. However, note that based on strict statistical principles, it would not be possible to draw conclusions to all abstract and concrete words because the words were not randomly selected but rather chosen according to specific characteristics (e.g., word length) to control for potential influences on learning burden apart from word concreteness (see section 3.2).

99

5.3 Research Topic 3: Annotations and Word Concreteness 5.3.1 RQ 3.1: Interaction Effect of Annotation Cluster X Word Type RQ 3.1 asks whether there is an interaction between word concreteness and annotation cluster effectiveness in L2 vocabulary learning. An interaction can be present in one of two ways. Accordingly, RQ 3.1.a examines whether the effect of word concreteness on vocabulary learning differs as a function of annotation cluster while RQ 3.1.b investigates whether the effect of annotation cluster on vocabulary learning differs as a function of word type. Immediate Posttest Descriptive statistics Each participant studied three abstract and three concrete words in each of the five annotation clusters (see section 4.3). Table 24 displays the mean results for the two word types in the five annotation clusters on the immediate posttest. The data are visualized in Figure 16. Regarding the first subquestion, RQ 3.1.a, the descriptive data in Table 24 and Figure 16 show that in all five annotation clusters, learners performed better on concrete words than on abstract words. Notably, the mean difference between concrete and abstract scores was highest in DG with the participants earning on average 1.03 / 3 points (34.5%) more with concrete than with abstract words (2.17 – 1.14) while with 0.79 points (26.1%) (2.30 – 1.51), the difference was lowest in PA. Furthermore, abstract words learned in the most effective annotation cluster for abstract words (PA, 50.5%) were still harder to learn than concrete words studied in the least effective annotation cluster for concrete words (DA, 71.5%).

100

Table 24: Immediate posttest results for abstract vs. concrete words per annotation cluster

Annotation cluster PG (Picture Gloss)

Rank 3 3

Word type Abstract Concrete Both

DG (Definition Gloss)

5 4

Abstract Concrete Both

1.14 2.17 3.31

PA (Picture Audio)

1 1

Abstract Concrete Both

DA (Definition Audio)

4 5

PAGD (Everything)

2 2

All annotation clusters

Mean 1.42 2.27 3.69

Maximum score 3 3 6

Percentage 47.5% 75.7% 61.6%

SD 0.96 0.76 1.41

3 3 6

38.0% 72.5% 55.2%

1.03 0.80 1.65

1.51 2.30 3.81

3 3 6

50.5% 76.6% 63.5%

1.10 0.65 1.56

Abstract Concrete Both

1.21 2.15 3.35

3 3 6

40.3% 71.5% 55.9%

0.98 0.86 1.56

Abstract Concrete Both

1.43 2.28 3.71

3 3 6

47.7% 75.9% 61.8%

0.94 0.72 1.37

Abstract Concrete Both

6.72 11.17 17.88

15 15 30

44.8% 74.4% 59.6%

3.77 2.79 6.16

Figure 16: Immediate posttest results for abstract vs. concrete words per annotation cluster

101

With respect to the second subquestion, RQ 3.1.b, the descriptive rankings regarding the effectiveness of the five annotation clusters was almost identical for the two word types (see Table 24). For abstract words, annotation cluster PA was most effective (50.5%), followed by PAGD (47.7%), PG (47.5%), DA (40.3%), and finally DG (38.0%). For concrete words, the order of the first three annotation clusters was the same as for abstract words. PA (76.6%) was followed by PAGD (75.9%) and PG (75.7%). However, DG (72.5%) was slightly ahead of DA (71.5%) for concrete words. For both word types, the three annotation clusters with a picture (PG, PA, PAGD) ranked above the two annotation clusters without one (DG, DA). The mean difference between the most and the least effective annotation clusters was bigger for abstract words than for concrete words. For abstract words, the average score under the most effective annotation cluster (PA) was 1.51 points (50.5%) while the average score under the least effective one (DG) was 1.14 points (38.0%), that is, 0.37 points (12.5%) less on average. For concrete words, the most beneficial annotation cluster (PA) scored 2.30 points (76.6%) and the least beneficial one (DA) scored 2.15 points (71.5%), leading to a range of only 0.15 points (5.1%). Inferential statistics Hypothesis testing revealed no two-way interaction effect of word type X annotation cluster, F(4, 268) = .828, p = .508, meaning that the mean difference in the immediate posttest scores for abstract versus concrete words is the same, in inferential terms, in each annotation cluster. This finding is visualized by the near-parallelism of the lines for abstract and concrete words in Figure 16. Subquestion RQ 3.1.a asks whether the effect of word type is different for the five annotation clusters. In answering RQ 3.1.a, the lack of an interaction effect of word type and annotation cluster requires recourse to the inferential results for the main effect of word type (RQ 2). RQ 2 found a main effect of word type on the immediate posttest with abstract words being significantly harder to learn than concrete words (see section 5.2.1). The main effect of word type, coupled with the lack of a statistically significant interaction between word type and annotation cluster, means that the same conclusion is 102

warranted for each annotation cluster in isolation as for the five annotation clusters combined. Accordingly, in each annotation cluster, abstract words are significantly harder to learn than concrete words and they are always harder by the same ratio. Subquestion RQ 3.1.b enquires whether the effect of annotation cluster is different for the two word types. In response to this question, the lack of an interaction between word type and annotation cluster necessitates a look at the inferential results for the main effect of annotation cluster on the immediate posttest (RQ 1.1). Hypothesis testing for RQ 1.1 showed a main effect of annotation cluster with post-hoc analyses revealing two homogeneous subsets. Annotation clusters PA, PG, and PAGD are significantly more effective for L2 vocabulary learning than annotation clusters DG and DA (see section 5.1.1). This main effect of annotation cluster on the immediate posttest, together with the lack of an interaction effect between word type and annotation cluster, signifies that the same conclusion can be drawn for the two word types in isolation as for all words combined. Accordingly, abstract words are significantly easier to learn with annotation clusters PA, PG, and PAGD than with annotation clusters DG and DA. Likewise, annotation clusters PA, PG, and PAGD are significantly more effective than annotation clusters DG and DA for concrete words. Delayed Posttest Descriptive statistics Table 25 and Figure 17 present the mean scores for abstract and concrete words in the five annotation clusters on the delayed posttest. Concerning subquestion RQ 3.1.a, which examines whether the effect of word type is different in the five annotation clusters, the delayed posttest showed that abstract words were harder to retain than concrete words in all five annotation clusters. The mean difference in scores was highest in PAGD and lowest in PG. In PAGD, the score for concrete words was on average 0.85 points (28.4%) higher than the score for abstract words (1.11 – 0.26). In PG, the difference was only 0.58 points (19.2%) (0.96 – 0.38). Moreover, the mean score obtained for abstract words studied in the most effective annotation cluster for abstract words (PG, 12.7%) was still lower than the mean score for 103

concrete words presented in the least effective annotation cluster for concrete words (DG, 29.6%). Table 25: Delayed posttest results for abstract vs. concrete words per annotation cluster

Annotation cluster PG (Picture Gloss)

Rank 1 4

Word type Abstract Concrete Both

Mean 0.38 0.96 1.34

DG (Definition Gloss)

4 5

Abstract Concrete Both

0.28 0.89 1.17

PA (Picture Audio)

3 2

Abstract Concrete Both

DA (Definition Audio)

2 3

PAGD (Everything)

5 1

All annotation clusters

Out of 3 3 6

Percentage 12.7% 31.9% 22.3%

SD 0.61 0.77 1.13

3 3 6

9.5% 29.6% 19.6%

0.56 0.75 1.11

0.30 1.03 1.33

3 3 6

10.0% 34.5% 22.2%

0.55 0.81 1.18

Abstract Concrete Both

0.31 0.97 1.28

3 3 6

10.4% 32.2% 21.3%

0.53 0.86 1.13

Abstract Concrete Both

0.26 1.11 1.37

3 3 6

8.6% 37.0% 22.8%

0.51 0.91 1.22

Abstract Concrete Both

1.53 4.96 6.49

15 15 30

10.2% 33.1% 21.6%

1.76 3.06 4.43

In view of subquestion RQ 3.1.b, which investigates whether the effect of annotation cluster on vocabulary learning is different for abstract versus concrete words, Figure 17 shows that the rankings of the five annotation clusters in terms of their effectiveness was different for the two word types. For abstract words, annotation cluster PG (12.7%) was most effective, DA (10.4%) was in second place, PA (10.0%) third, DG (9.5%) fourth and PAGD (8.6%) fifth. Whereas PAGD was the least effective annotation cluster for abstract words, it was the most effective annotation cluster for concrete words. For concrete words, PAGD (37.0%) was followed by PA (34.5%), DA (32.2%), PG (31.9%), and finally DG (29.6%). However, there was little variation in mean scores between the highest scoring and the lowest scoring annotation clusters for both abstract and concrete words. Regarding abstract words, learners scored 0.12 points (4.1%) more on average on the most effective annotation cluster (PG) than on the least effective annotation cluster (PAGD). Similarly, 104

for concrete words, the difference between highest scoring (PAGD) and lowest scoring (DG) annotation cluster was 0.22 points (7.4%) on average. Figure 17: Delayed posttest results for abstract vs. concrete words per annotation cluster

Inferential statistics Hypothesis tests on the delayed posttest data showed no interaction effect of word type X annotation cluster, F(4, 268) = 1.834, p = .122. This means that, in inferential terms, the mean difference in posttest scores between the two word types is the same in each annotation cluster (see Figure 17 for a visualization of this finding). With regard to RQ 3.1.a, which investigates differences in the effect of word concreteness due to annotation cluster, the lack of an interaction between word type and annotation cluster necessitates a look at the inferential results for the main effect of word type (RQ 2). Hypothesis testing revealed a main effect of word type on the delayed posttest with abstract words being significantly more difficult to learn than concrete words (see section 5.2.1). This main effect of word type, coupled with the lack of an interaction effect, leads to the conclusion that abstract words are significantly harder to learn than concrete words in each annotation cluster.

105

Concerning RQ 3.1.b, which examines differences in the effectiveness of the five annotation clusters due to word type, the lack of an interaction effect between word type and annotation cluster requires revisiting the inferential results for the main effect of annotation cluster (RQ 1.1). Hypothesis testing showed no main effect of annotation cluster for vocabulary learning on the delayed posttest, meaning that all five annotation clusters are equally effective for vocabulary learning seven days post-treatment (see section 5.1.1). The lack of a main effect of annotation cluster, combined with the lack of an interaction effect between word type and annotation cluster, means that seven days after the treatment, all five annotation clusters are equally effective for learning abstract words and all five annotation clusters are equally effective for learning concrete words.

5.3.2 RQ 3.2: Main Effect of Annotation Type Presence in the Two Word Types RQ 3.2 examines whether vocabulary learning is different in the annotation clusters that contain a given individual annotation type (i.e., picture, definition, gloss, audio) than in the clusters that do not contain that annotation type. Subquestion RQ 3.2.1 explores this question for abstract words and subquestion RQ 3.2.2 for concrete words. Immediate posttest Descriptive statistics For RQ 3.2.1, Figure 18 displays the standardized means of the immediate posttest scores for abstract words studied with annotation clusters with versus without each of the four annotation types. 32 The learners on average scored 0.29 out of 3 points (9.7%) more when abstract words were studied with pictures (1.46 points) than without pictures (1.17 points). When definitions were present in an annotation cluster (1.26 points), posttest scores were on average lower by 0.21 points (7.0%) than when they were not present (1.47 points). For abstract words learned with a gloss, posttest scores were slightly lower on average (1.33 points) than for abstract words studied without a gloss (1.36 points). 32

The mean of the three means for annotation type presence, which is out of 9 points per word type and student, and the mean of the two means for annotation type absence, which is out of 6 points per word type and student (see section 4.4.2), are both standardized to a scale of 3 points.

106

The scores for abstract words studied with audio pronunciations (1.38 points) were higher by 0.10 points (1.7%) than the scores for abstract words studied without audio (1.28 points). Figure 18: Standardized mean scores for annotation presence vs. absence (immediate posttest, abstract words)

Figure 19: Standardized mean scores for annotation presence vs. absence (immediate posttest, concrete words)

3.00 2.50 2.00 1.50 1.00 0.50 0.00 Presence

Picture 2.28

Definition 2.20

Gloss 2.24

Audio 2.24

Absence

2.16

2.28

2.22

2.22

For RQ 3.2.2, Figure 19 displays the standardized means of the immediate posttest scores for concrete words. The mean score was 0.12 / 3 points (4.0%) higher for words 107

studied with a picture annotation (2.28 points) than for words studied without a picture (2.16 points). Concrete words studied with annotation clusters not containing a definition (2.28 points) were learned more effectively than concrete words studied with annotation clusters containing a definition (2.20 points). The difference in mean scores for words studied in annotation clusters with a gloss or audio annotation compared to words studied in annotation clusters without a gloss or audio annotation, respectively, was minimal (see Figure 19). Inferential statistics Table 26: T-tests for annotation presence vs. absence (immediate posttest, abstract words)

Annotation Picture Definition Gloss Audio

Pair (PG, PA, PAGD) - (DG, DA) (DG, DA, PAGD) - (PG, PA) (PG, DG, PAGD) - (PA, DA) (PA, DA, PAGD) - (PG, DG)

Paired differences Mean SD .28241 .59398 - .20949 .67324 - .03009 .70798 .10301 .73538

SEM .07000 .07934 .08344 .08667

t

df

Sig.

4.034 - 2.640 - .361 1.189

71 71 71 71

.000 .010 .719 .239

For abstract words (RQ 3.2.1), the paired samples contrast t-tests revealed a main effect of picture, t(71) = 4.034, p = .000, Hedge's g = 0.48 (see Table 26). This leads to the conclusion that learners score significantly higher on abstract words when they study with the annotation clusters that contain a picture (PG, PA, PAGD) than when they study with the annotation clusters that do not contain a picture (DG, DA). There was also a main effect of definition, t(71) = 2.640, p = .010, Hedge's g = 0.31. The presence of a definition in an annotation cluster (in DG, DA, PAGD) is detrimental to the learning of abstract words. There was no main effect of gloss or audio for abstract words only (t(71) = .361, p = .719, and t(71) = 1.189, p = .239, respectively, see Table 26). Thus, the annotation clusters with a gloss or audio annotation are on average as effective as the annotation clusters without a gloss or an audio annotation, respectively, for learning abstract words. The inferential results for abstract words paralleled those for both word types presented in section 5.1.2. For abstract words only, there was also a main effect of picture and a main effect of definition but no main effect of gloss or audio.

108

Table 27: T-tests for annotation presence vs. absence (immediate posttest, concrete words)

Annotation Picture Definition Gloss Audio

Pair (PG, PA, PAGD) - (DG, DA) (DG, DA, PAGD) - (PG, PA) (PG, DG, PAGD) - (PA, DA) (PA, DA, PAGD) - (PG, DG)

Paired differences Mean SD .12269 .51560 - .08565 .48832 .01852 .56903 .01852 .50141

SEM .06076 .05755 .06706 .05909

t

df

Sig.

2.019 - 1.488 .276 .313

71 71 71 71

.047 .141 .783 .755

For concrete words (RQ 3.2.2), there was also a main effect of picture, t(71) = 2.019, p = .047, Hedge's g = 0.24, as shown in Table 27. The annotation clusters with a picture (PG, PA, PAGD) are on average significantly more effective for learning concrete words than the annotation clusters without a picture (DA, DG). With respect to the remaining annotation types, there were no main effects for concrete words (definition: t(71) = 1.488, p = .141; gloss: t(71) = .276, p = .783; audio: t(71) = .313, p = .755, see Table 27). Thus, in inferential terms, the presence or absence of a definition, a gloss or an audio pronunciation, respectively, in the five annotation clusters examined does not have an effect on the learning of concrete words. Delayed posttest Descriptive statistics In reference to the delayed posttest data for RQ 3.2.1, Figure 20 displays the standardized means for the four annotation types for abstract words. For each annotation type, there was little difference in mean scores for words learned with a given annotation compared to words learned without it. For example, words studied in annotation clusters containing a picture (0.31 points) scored on average 0.01 points (0.5%) higher than words studied in annotation clusters without a picture (0.30 points)

109

Figure 20: Standardized mean scores for annotation presence vs. absence (delayed posttest, abstract words)

Figure 21: Standardized mean scores for annotation presence vs. absence (delayed posttest, concrete words)

For RQ 3.2.2, Figure 21 shows the results for the delayed posttest for concrete words. Learners on average scored 0.11 points (3.6%) higher on concrete words studied with a picture (1.03 points) than on concrete words studied without a picture (0.93 points). Similar results were obtained for the audio annotation (presence: 1.04 points, absence: 0.92 points, difference: 0.11 points (3.8%)). For the definition and the gloss

110

annotation types, the mean scores for annotation presence versus annotation absence were almost identical (0.99 points vs. 1.00 points in both cases). Inferential statistics Table 28: T-tests for annotation presence vs. absence (delayed posttest, abstract words)

Annotation Picture Definition Gloss Audio

Pair (PG, PA, PAGD) - (DG, DA) (DG, DA, PAGD) - (PG, PA) (PG, DG, PAGD) - (PA, DA) (PA, DA, PAGD) - (PG, DG)

Paired differences Mean SD - .01389 .39732 - .05556 .46113 .00231 .47901 - .04398 .45385

SEM .04682 .05434 .05645 .05349

t

df

Sig.

- .297 - 1.022 .041 - .822

71 71 71 71

.768 .310 .967 .414

t

df

Sig.

1.577 - .118 - .200 1.663

71 71 71 71

.119 .906 .842 .101

Table 29: T-tests for annotation presence vs. absence (delayed posttest, concrete words)

Annotation Picture Definition Gloss Audio

Pair (PG, PA, PAGD) - (DG, DA) (DG, DA, PAGD) - (PG, PA) (PG, DG, PAGD) - (PA, DA) (PA, DA, PAGD) - (PG, DG)

Paired differences Mean SD .10764 .57928 - .00810 .58177 - .01389 .58959 .11343 .57859

SEM .06827 .06856 .06948 .06819

Regarding the presence of the four individual annotations in the five annotation clusters, the complex hypotheses tests revealed no main effects in either word type (see Table 28 for abstract words, RQ 3.2.1, and Table 29 for concrete words, RQ 3.2.2). Thus, seven days after the vocabulary learning session, the presence of a picture, a definition, a gloss, or an audio annotation, respectively, in the five annotation clusters investigated is not significantly more effective on average than its absence for either the learning of abstract words or the learning of concrete words.

5.4 Research Topic 4: Annotations and Individual Learners 5.4.1 RQ 4: Interaction of Annotation Cluster and Individual Learner RQ 4 examines whether the vocabulary learning effectiveness of the five annotation clusters varies across learners. To answer this question, in experiment 1, each learner studied 6 of the 30 words in each of the five annotation clusters (see section 4.3). An annotation cluster ranking for each learner was obtained by investigating the learner's score (out of 6) for each of the five annotation clusters on each posttest. For each learner,

111

the annotation cluster(s) with the highest scores were taken to be the ones with which s/he learned best. 33 Immediate posttest Table 30 lists the best annotation cluster(s) for all 72 learners on the immediate posttest. The data indicate that annotation cluster effectiveness varies considerably across learners. PA, the annotation cluster that was best for the largest number of learners only accounted for 20.8% of the sample (15 of the 72 learners). PA was closely trailed by PAGD with 13 learners (18.1% of the sample). PG, DA, and DG followed, and then combinations of two or more annotation clusters for which students scored equally high. Table 30: Best annotation cluster(s) for each learner on the immediate posttest

Number of learners 15 13 9 8 6 3 each 2 each 1 each

Percentage of sample 20.8% 18.1% 12.5% 11.1% 8.3% 4.2% each 2.8% each 1.4% each

Best annotation cluster(s) PA PAGD PG DA DG PG + PAGD; DA + PAGD DG + PA; DG + PAGD; PA + PAGD; PG + DG + PA + DA + PAGD PG + DG; PG + PA; PG + PA + DA; DG + PA + DA; PG + PA + PAGD; DG + PA + PAGD, PG + PA + DA + PAGD

Delayed posttest The data for the delayed posttest, displayed in Table 31, again reveals differences among individual learners with respect to the annotation clusters that best support their vocabulary learning. As on the immediate posttest, on the delayed posttest, annotation cluster PA was the one with which the largest number of learners excelled (14 learners, or 19.4%). PA was followed by PAGD with 12 learners (16.7%), DA, PG and DG. Combinations of two or more annotation clusters were varied, as on the immediate posttest.

33

Note that due to the small number of test items per word type and annotation cluster for each student (i.e., three abstract words and three concrete words per annotation cluster), variation among learners is not investigated with respect to word concreteness but only for the vocabulary items as a whole.

112

Table 31: Best annotation cluster(s) for each learner on the delayed posttest

Number of learners 14 12 8 each 7 3 2 each 1 each

Percentage of sample 19.4% 16.7% 11.1% each 9.7% 4.2% 2.8% each 1.4% each

Best annotation cluster(s) PA PAGD PG; DA DG DG + PAGD PG + DG; PG + PA; PG + DA; PG + PAGD; DA + PAGD; PG + PA + DA; PG + DG + PAGD DG + PA; PA + DA; PG + DG + PA; DG + DA + PAGD; DG + PA + DA + PAGD; n/a (score of 0/30)

Overall, the immediate and delayed posttest data showed that the effectiveness of the five annotation clusters varied across learners. Rather than one annotation cluster emerging as the most effective for the majority of learners, each annotation cluster was an effective learning tool for a considerable number of individuals. Both posttests Given the variation among learners, in order to ascertain the relative importance of the five annotation clusters as vocabulary learning tools on an individual learner level, the number of times each annotation cluster was most effective on each posttest was tabulated and then combined for both measures (see Table 32). For example, on the immediate posttest, annotation cluster PG was the only best annotation cluster for 9 students and one of the best annotation clusters for another 10 students (see Table 30) and thus PG was in first place a total of 19 times, or 18% of the time (see Table 32). Although annotation cluster PAGD was most often in first place on the immediate (27%) and the delayed (23%) posttest, it only accounted for approximately one quarter of the data. The other four annotation clusters were also in first place frequently, especially annotation clusters PA and PG. Even the annotation clusters that were most effective for the least number of students appeared in first place at least 14% of the time (DG on the immediate posttest). In sum, when considering the number of times each cluster appeared in first place, the distribution of annotation clusters was fairly even.

113

Table 32: Number of times each annotation cluster is in first place

Annotation cluster PG DG PA DA PAGD

Immediate posttest Best % 19x 18% 15x 14% 27x 26% 16x 15% 28x 27%

Delayed posttest Best % 21x 21% 18x 18% 22x 22% 17x 17% 23x 23%

Total Best 40x 33x 49x 33x 51x

% 19% 16% 24% 16% 25%

Total

105x

101x

206x

100%

100%

100%

5.5 Summary of the Findings of Experiment 1 Table 33 provides a summary of the main findings of experiment 1 (statistically significant findings are indicated in bold). The following chapter discusses the results of experiment 1 with respect to the four research topics examined.

114

Table 33: Summary of the findings of experiment 1

RQ Effect tested Posttest results Vocabulary annotations Main effect of annotation cluster 1.1 Annotation I 2 cluster F(4, 268) = 4.265, p = .002, η = .016 Post-hoc: PG, PA, and PAGD more effective than DG, DA D No main effect of annotation cluster F(4, 268) = .713, p = .584 1.2

Individual annotation type presence

I

D Word concreteness 2 Word type

I D

Picture more effective than no picture (p = .000, g = 0.47) No definition more effective than definition (p = .006, g = 0.34) No effect of gloss or audio presence, respectively No effect of picture, definition, gloss, or audio presence, respectively

Concrete words easier than abstract words F(1, 67) = 234.841, p = .000, η2 = .320 Concrete words easier than abstract words F(1, 67) = 155.037, p = .000, η2 = .281

Annotations and word concreteness 3.1 Annotation I No interaction effect F(4, 268) = .828, p = .508 cluster X  3.1.a) concrete words easier than abstract words in each Word type annotation cluster  3.1.b) PG, PA and PAGD more effective than DG, DA for abstract words only and for concrete words only D No interaction effect F(4, 268) = 1.834, p = .122  3.1.a) concrete words easier than abstract words in each annotation cluster  3.1.b) all annotation clusters equally effective for abstract words only and for concrete words only 3.2 Individual I 3.2.1) for abstract words: Picture more effective than no picture (p = .000, g = 0.48) annotation No definition more effective than definition (p = .010, g = 0.31) type presence No effect of gloss or audio presence, respectively

D

3.2.2) for concrete words: Picture more effective than no picture (p = .047, g = 0.24) No effect of definition, gloss, or audio presence, respectively 3.2.1) for abstract words, and 3.2.2) for concrete words: No effect of picture, definition, gloss, or audio presence, respectively

Annotations and individual learners 4 Annotation I Annotation cluster effectiveness varies across learners (20.8% of cluster X learners scored highest with PA only, 18.1% with PAGD only, etc.) Individual D Annotation cluster effectiveness varies across learners (19.4% of learner learners scored highest with PA only, 16.7% with PAGD only, etc.) Note. RQ = Research question, I = immediate posttest, D = delayed posttest, g = Hedge's g.

115

6 DISCUSSION OF EXPERIMENT 1 This chapter discusses the findings of experiment 1. Following the four research topics of this experiment, section 6.1 explores vocabulary annotations, section 6.2 focuses on word concreteness, section 6.3 considers the interaction between annotations and word concreteness, and section 6.4 examines annotations and individual learner differences.

6.1 Research Topic 1: Vocabulary Annotations In discussing the main findings of experiment 1 with regard to research topic 1, section 6.1.1 examines the effect of annotation cluster and the picture annotation type, section 6.1.2 covers the definition annotation, section 6.1.3 discusses the gloss annotation, and section 6.1.4 investigates the audio annotation.

6.1.1 Annotation Cluster / Picture Annotation Type Beneficial effect on immediate posttest With respect to the provision of form, meaning, and use information when learning new L2 words (Nation, 2001), which is one of the design features underlying Voka (see section 3.3), the findings of experiment 1 show that additional meaning information is most beneficial for vocabulary learning: The annotation clusters are split into two effectiveness groups (i.e., homogeneous subsets) according to the picture meaning annotation. The annotation clusters with a picture (PG, PA, PAGD) are more beneficial than the annotation clusters without a picture (DG, DA). However, the findings further reveal that the provision of any kind of meaning information is not useful per se. Whereas the picture meaning annotation is beneficial, the definition meaning annotation 116

is detrimental to learning L2 vocabulary in Voka. Finally, experiment 1 also shows that a picture is always helpful for L2 vocabulary learning, independent of other form, meaning, and / or use annotation types (audio, definition, gloss) that are provided alongside the picture in the annotation clusters. With regard to the effectiveness of picture annotations for L2 vocabulary learning, the findings of experiment 1 thus confirm but also extend previous research, which has generally found vocabulary presented with visual annotations to be learned more effectively than vocabulary studied without visuals (see section 2.3.1, e.g., Akbulut, 2007; Chun & Plass, 1996a; L. Jones & Plass, 2002; Plass et al., 1998; Yoshii, 2006; Yoshii & Flaitz, 2002; Zhuo, 2008). Whereas most previous studies have focused on incidental vocabulary learning while reading or listening (see section 2.3.1), the research with Voka confirms this finding in the context of intentional vocabulary learning with annotation clusters that simultaneously address form, meaning, and / or use aspects of the target words. In experiment 1, the magnitude of effect of the statistically significant main effect of annotation cluster was η2 = .016 (see section 5.1.1), meaning that this effect accounted for only 1.6% of the variance in immediate posttest scores while 98.4% of the variance was due to other factors. Furthermore, the mean difference in immediate posttest scores between the three significantly more effective picture-based annotation clusters (PG, PA, PAGD) and the two less effective non-picture annotation clusters (DG, DA) was merely 6.8% (see section 5.1). Three reasons might explain why the effect of annotation cluster obtained on the immediate posttest was not more pronounced in Voka. First, the annotation clusters (PG, DG, PA, DA, PAGD) overlapped to a considerable extent in the annotation types (audio, picture, definition, gloss) they provided. Second, it is generally more difficult to detect statistically significant effects for factors with a large number of levels. The annotation cluster factor had five levels (PG, DG, PA, DA, PAGD) and it is likely that the elimination of some of these annotation clusters would have led to more marked differences among the remaining clusters. Third, the scale for scoring the students' posttest answers was coarse-grained with 0.5 point increments, which allowed only 13 117

scores for each learner in each annotation cluster (0, 0.5, … 5.5, 6; see section 4.4.1). A more fine-grained scoring scale (e.g., with 0.25 point increments) would have produced greater variation in posttest scores for different annotation clusters. However, the mean difference in immediate posttest scores observed in experiment 1 (6.8%) is similar to statistically significant mean differences reported in previous studies between words studied with pictures (and verbal annotations) and words studied without pictures. For example, in Akbulut (2007), compared to learners receiving only definitions, learners receiving definitions plus visual (picture or video) annotations scored significantly higher by 8% on an immediate meaning recognition posttest and by 13% on an immediate meaning production posttest. In Plass et al. (1998), the statistically significant difference in one-day delayed posttest scores between words studied with verbal plus visual (picture or video) annotations and words studied with verbal only annotations was approximately 8%. In L. Jones and Plass (2002), the learners in the written plus pictorial annotations group significantly outperformed the learners in the written only annotations group on the immediate posttest by 11%. Finally, the statistically significant difference in mean immediate posttest scores between words studied with picture plus definition and definition only was 13% in Chun and Plass (1996a, study 2). The finding of a beneficial effect of picture annotations for L2 vocabulary learning is in line with Paivio's (1971, 1986) dual coding theory and Mayer's (2001, 2005a) CTML. Based on a dual channel assumption, both theories stipulate that information processed both verbally and visually is learned better than information processed in only one processing channel (see section 2.2). Confirming the posttest results, on the second evaluation questionnaire (see Appendix E and section 7.2), the participants also frequently commented on the effectiveness of the pictures in helping them remember the vocabulary: 34

34

All participants' comments reproduced in this dissertation are unedited.

118

"The images are very helpful in creating mental pictures one can link to words." voka275 "The pictures were useful (in my mind if I saw the word I could sort of think of the photo and remember the word." voka266 "The pictures with the words definately makes me remember the vocab more." voka264 Finally, it is interesting to note that the picture annotations in Voka were beneficial in spite of the fact that they were not part of the posttests, which assessed written productive recall of the L2 forms by prompting their L1 translations (see section 3.7.4 and section 3.9.1). In contrast to studies which used the visual annotations presented during treatment in subsequent vocabulary posttests (e.g., Al-Seghayer, 2001; Chun & Plass, 1996a, study 3; L. Jones, 2004), experiment 1 with Voka demonstrates the effect of pictures on establishing the link between the meaning of the words and their form in assessment measures where the learning modality (i.e., pictorial annotation) is different from the testing modality (i.e., written text). This effect is also expressed in a comment made by participant voka256 on the second evaluation questionnaire: "Pictures really useful for remember the words. (I recalled pictures which then allowed me to recall the initial letters of words)." No observed effect on delayed posttest In Voka, the statistically significant positive effect of learning with the picturebased annotation clusters diminished over time and was thus not observed on the delayed posttest. Descriptively, however, the annotation clusters containing a picture (PA, PG, and PAGD) ranked higher than the annotation clusters without a picture (DA and DG) on both posttests (see section 5.1.1). Given the effect on the immediate posttest, the lack of an effect on the delayed posttest contrasts with research on knowledge acquisition from pictures in general, as expressed by Peeck (1989): 119

In their review of the literature, Levie and Lentz (1982) listed 24 … studies [comparing retention of text in immediate and delayed recall of text-only and text plus picture conditions], 19 of which showed that pictures helped more in delayed recall than in immediate recall (p. 222). The average facilitation due to pictures was five times greater in delayed testing than in immediate testing. (p. 263) Peeck (1989) immediately qualifies these findings, however, stating that "the comparisons were derived from only 6 studies …, with, in some instances, very brief delays" (p. 263). A stronger benefit of picture annotations on delayed compared to immediate posttests has also been found for L2 vocabulary learning in particular. In a study by Chun and Plass (1996a, study 3), 21 learners of German read a text in which some words were annotated with a definition only, others with a picture and a definition and others with video and a definition. The learners completed an immediate and 1-week delayed multiple choice recognition posttest (see a visual or textual meaning prompt, select German word). Chun and Plass found that the participants scored significantly higher on words they had studied with a definition plus picture on the delayed posttest (mean correct responses: 7.29 / 9, SD: 0.94) than on the immediate posttest (mean: 6.89 / 9, SD: 0.82) but this was not the case for the definition only or definition plus video words. The authors speculate that this might be due to the so-called hypermnesia effect, which predicts better recall of pictures compared to text over time. However, it should be noted that the improvement in scores in the sample was minimal (0.4 / 9 points, or 4.4%). Several differences between Chun and Plass's (1996a) study and experiment 1 with Voka might explain why the hypermnesia effect was not detected in the current study. First, the pictures shown during the treatment were also the posttest prompts in Chun and Plass whereas L1 translations were the posttest prompts in the Voka experiment. Second, Chun and Plass tested productive recognition in a multiple choice test which provided the L2 form alongside distractors while the Voka experiment assessed productive recall. These differences might matter, as indicated by another study (N = 36) by Chun and Plass (1996a, study 1), which did not find a significant improvement in scores for words 120

annotated with picture plus definition when comparing an immediate to a two-week delayed posttest. In this study, the researchers assessed receptive recall rather than productive recognition and the treatment pictures were not included in the posttest. Finally, Chun and Plass's study focused on incidental vocabulary learning in a reading context whereas the Voka experiment examined intentional vocabulary learning. In addition, the design of experiment 1 was also particularly conducive to the fading of significant effects because the participants were not exposed to the target words between the tests and were expressly asked not to study the words.

6.1.2 Definition Annotation Type The inferential statistics performed on the immediate posttest data reveal that the L1 definition of the L2 target words provided in some annotation clusters (DG, DA, PAGD) is detrimental rather than beneficial to learning L2 vocabulary. Descriptively, the presence of a definition was also detrimental on the delayed posttest. However, no significant effect was detectable in the delayed posttest data, indicating that the difference in scores diminished over time. Four possible explanations of the detrimental effect of the definition are discussed in the following, and they relate to the general usefulness of the definition, CTML's modality principle, the preset exposure time in Voka, and relevance to the assessment measure, respectively. First, taking this finding at face value, students may not have found the definition in Voka useful. This suggestion receives support from the participants' evaluation of the annotation types. On the second evaluation questionnaire (see Appendix E), the 72 Voka participants indicated their agreement / disagreement with four statements suggesting that each annotation type is useful for L2 vocabulary learning on a 5-point Likert scale labelled 1 (strongly agree), 2 (agree), 3 (neither agree nor disagree), 4 (disagree), and 5 (strongly disagree). The definition received the lowest mean usefulness rating (mean: 2.5, SD: 1.21), followed by the gloss (mean: 2.26, SD: 0.98), the picture (mean: 2.01, SD: 1.04), and the audio (mean: 1.57, SD: 0.80).

121

The definition might have been judged as superfluous by the learners because for most words, the L1 translation, which was part of the default information for every word, did not need clarification. For example, the definition for Anfang (beginning), Anfang: the point at which something starts., expresses in more words the meaning information provided by the L1 translation. However, the L1 definition might help solidify the learners' understanding of the new word by providing additional L1 synonyms. For example, in addition to the word purpose, which appears as the L1 translation of Zweck, the L1 definition, Zweck: a clearly directed intent or use., adds two additional words, intent and use, to the semantic network that learners can now establish for Zweck. The definition can also disambiguate the meaning of the L2 target word in cases where the L1 translation has multiple meanings. For example, the definition of Zustand (state), Zustand: condition something is in at a given time, clarifies that Zustand denotes a condition rather than a region. The second possible explanation relates to the modality principle of CTML. The modality principle states that it is more effective to provide spoken text with pictures than written text with pictures because the former two types of input do not compete for resources in the same processing channel (see section 2.2). More generally, this principle suggests that learning is hindered if several types of input share the same processing route. In Voka, all of the word information apart from the audio (i.e., L2 word, L1 translation, L2 example sentence, L1 gloss, L1 definition, picture) is presented visually to the learners, thus competing for attention in the visual processing channel according to CTML. Possibly, the annotation clusters that do not contain a definition are significantly more effective than the annotation clusters without a definition because they reduce the number of items competing for resources in the visual processing channel. A third explanation concerns the preset exposure time of the test items. In Voka, the flashcards displaying the annotation information advanced automatically in a set time frame (see section 3.9.3). For example, in the second study phase of Voka, the participants had 18 seconds to view each test item. In CALL environments where learners are not able to control the pace of their own learning, the competition for attentional resources might be particularly taxing. In this context, limited working 122

memory capacity (see, e.g., Chun & Payne, 2004) may play a role in that learners might be unable to process the multiple sources of word information in their memory in the limited exposure time provided for each test item given that the definitions are presented alongside additional word information. Indeed, the following comments by two pilot study participants on the evaluation questionnaire express the difficulty of taking in all the word information in a limited time: "What I think could be of benefit would be extending time limits, as I started to panick at how little time I was given to press my brain to remember. Also, combinations of 2 or 3 help options are good, but all 5 [note: there were only 4 in total] just make the page too busy and I start panicking again because I'm unsure of what I should spend my time really looking at." voka54 "Example sentences: I hardly succumbed to read the sample sentences, not enough time to learn them along with definitions" voka108 Presumably, in annotation clusters without definitions, learners profit from being able to spend more time on average on each remaining item of word information provided for the word (recall that learners saw each word for the same amount of time independent of the annotation cluster associated with the word, see section 3.9.3). Definitions might be more effective in learning environments that allow learners to proceed through a CALL program at a self-selected pace. Thus, somewhat surprisingly, the findings of experiment 1 show that offering learners more annotation types is not necessarily more effective than giving them fewer annotation types. It is commonly assumed that in the absence of individualized instruction, providing learners with a wealth of information by which a word can be remembered is most effective because it allows learners to choose to focus on the annotation type(s) most helpful for them (L. Jones, 2009; Nation, 2001; Plass et al., 123

1998). However, as experiment 1 demonstrates, this may not apply to learning environments with brief and computer-controlled exposure time to the target vocabulary. The fourth potential explanation relates to the assessment measure in Voka. Norris and Ortega (2000) point out that "there can be little doubt that the particular test or measure utilized within a given study plays a central role in observations and eventual interpretations about the effectiveness of L2 instructional treatments" (p. 486) and "interpretations of study findings should be tempered by the realization that a different test type would likely have produced different results" (pp. 486-487) (see also Read, 2000; Schmitt, 2010). It appears that, in contrast to the picture annotation, the definition annotation does not strengthen the link between the L1 translation and the written L2 form, which was assessed on the posttests. If this is the case, the definition can be regarded as extraneous information relative to the assessment context and, according to the coherence principle of multimedia learning, learning is more effective when extraneous information is excluded from multimedia instruction (Mayer, 2005c, see section 2.2). Definitions might address aspects of L2 vocabulary learning that are brought about better by posttests asking learners to use target words in context or to describe their meanings as precisely as possible. In sum, it is likely that definitions of L2 target words, although possibly helpful in general, are detrimental in learning environments like Voka, which provide the majority of word information in a visual (particularly written textual) format, present test items in a predetermined exposure time, and assess vocabulary achievement with a discrete, selective and context-independent written productive recall posttest.

6.1.3 Gloss Annotation Type The immediate and delayed posttest data of experiment 1 lead to the conclusion that the annotation clusters that contain an L1 gloss of the L2 example sentence (PG, DG, PAGD) are as effective as the annotation clusters without the L1 gloss (PA, DA) for Voka’s CAVL environment. The possible explanations of the non-detection of an effect of the gloss annotation are similar to the explanations brought forward regarding the detrimental effect of the 124

definition annotation type (see section 6.1.2). Accordingly, students may not have found the gloss to be beneficial for L2 vocabulary learning. Furthermore, the gloss might have competed for attentional resources with other visual word information provided on the flashcard. Particularly given the limited preset exposure time of the test items in Voka, students might have chosen to focus their attention on information that they deemed more relevant. Finally, the non-detection of an effect of the gloss might also have come about because the information provided by the gloss was not part of the assessment in Voka. No effect of gloss but definition detrimental While the findings regarding both the definition and the gloss annotation type prompt similar explanations, it still needs to be explored why the gloss has no inferential effect on vocabulary learning while the definition has a detrimental effect on the immediate posttest. A likely explanation for this difference is that the definitions are harder to process than the glosses and thus interfere more with attending to other, more beneficial word information on the flashcards such as the picture annotation. There are four reasons for the presumed difference in processing, which are illustrated by the target word Vogel (bird) (see Figure 22). First, the glosses generally consist of a main clause (e.g., I see a bird in my garden) while the definitions contain complex phrases (e.g., egg-laying creature with feathers, usually capable of flying). Second, the glosses contain the L1 translation of the target word, which the learners are also exposed to, whereas the definitions deliberately do not contain the word used as the L1 translation. Third, the gloss, as a translation of the L2 example sentence (e.g., Ich sehe einen Vogel in meinem Garten), contains only highly frequent vocabulary but the definition often contains words that are used less commonly. Fourth, the meaning of the gloss sentence is also provided in the L2 example sentence while the definition is not a translation of information presented elsewhere on the flashcard.

125

Figure 22: Target word Vogel (bird) in annotation cluster PAGD

6.1.4 Audio Annotation Type The immediate and delayed posttest data of experiment 1 reveal that, in inferential terms, studying L2 words with the annotation clusters that contain audio (PA, DA, PAGD) is as effective as studying L2 words with the annotation clusters that do not contain audio (PG, DG). This finding is not only contrary to what one would expect based on theories of multimedia learning and SLA (see section 2.2 and section 2.3.2) but it is also unexpected given the participants' own judgement of the usefulness of audio annotations on the second evaluation questionnaire (see Appendix E). Of the four Voka annotation types (audio, definition, gloss, picture), the participants rated the audio annotation as the most useful one. With a mean usefulness rating of 1.57, audio is the only annotation type for

126

which the mean falls between 1 ("strongly agree" that it is useful) and 2 ("agree"). 35 The appreciation of the audio annotation as a learning aid is also echoed in students' comments: "I found the audio of the words very helpful. Hearing a word as well as reading it reinforces the learning." voka229 "I really liked when Voka used audio as a reminder, It helped me to remember the word." voka227 "[I did not like that] some of the vocabularies don't have audio for pronunciation. Audio is very helpful for me to learn." voka216 "I liked that I was able to hear the words spoken as I was studying them." voka251 "The audio was excellent. I remember things better if I hear them, and I always want to hear the correct pronounciation." voka276 Descriptively, it also appears that the audio annotation was effective for L2 vocabulary learning. On average, the participants scored higher on words studied with as opposed to without audio annotations (2.2% higher on the immediate posttest, 1.2% higher on the delayed posttest, see section 5.1.2). Furthermore, not considering the annotation cluster that provided all annotation types (PAGD), within both the two picture annotation clusters (PA, PG) and the two definition annotation clusters (DA, DG), the meaning-form clusters (PA, DA) always outperformed the meaning-use clusters (PG, DG) on the immediate posttest (see section 5.1.1). PA ranked highest, followed by PG,

35

As anecdotal testament to the perceived usefulness of the audio annotation, the author also noticed that while conducting the experiment, the room was filled with the constant sound of students clicking on the replay button of the audio player with their computer mouse.

127

then DA and DG. This suggests an advantage of audio annotations over glosses for vocabulary learning. However, inferentially, audio did not have an effect on L2 vocabulary learning as measured on the posttest. This is particularly surprising given that the audio annotation was the only annotation type in Voka that targeted L2 form, and form recall was assessed on the posttest. However, it is possible that the audio annotation in Voka helped students remember the written L2 form but the commission of phonologically-motivated misspellings due to phoneme-grapheme ambiguities in both English and German might have concealed this positive effect. Phonologically-motivated misspellings come about when the actual or assumed phonology of a word has an influence on its orthography. Spelling in German by L1 English speakers can be influenced by both English and German spelling and phonology (C. James & Klein, 1994). 36 Rimrott and Heift (2008) found that 94% of the phonologically-motivated misspellings by their Anglophone beginner learners of L2 German resulted in spellings that deviated by an edit distance of 1 from the correct target words. In Voka, this equates to scoring 0.5 points instead of 1 point for a test item on the posttests (see section 4.4.1). A distinction also needs to be made between grapheme to phoneme (i.e., spelling to sound) conversions and phoneme to grapheme (i.e., sound to spelling) conversions (see, e.g., Ryan, 1997). In reading the target word and then listening to the audio recording on the Voka flashcards, learners establish spelling to sound connections. Several participants appreciated this help, as illustrated by the following evaluation questionnaire comments: "I really liked the audio because it helps with proper pronunciation." voka212

36

In German, there are at least two possible spellings for almost every phoneme (Wimmer & Landerl, 1997) and in English, the phoneme-grapheme ambiguities are even greater (Dewey, 1970; Hall, 1961; Wimmer & Landerl, 1997). For example, English students writing in German may choose to represent the phoneme /aj/ with the German graphemes or (e.g., mein (my), Saite (string)) or the English graphemes , , , , and (e.g., hi, my, lie, sigh, rise).

128

"I liked having the word pronounced, it deepened my understanding

of

how

words

and

certain

letters

are

pronounced." voka201 "I liked the audio part since it help me with German pronunciation" voka253 "…the program … helped improve my German both with words and with understanding pronunciation" voka235 However, in having to recall the words on the posttest, the process is reversed in that learners have to translate the sound that the audio annotation might have implanted in their memory into the correct spelling of the L2 target word. Given the non-detection of an effect for the audio recording, the participants presumably could not successfully apply the phoneme to grapheme information obtained from the audio annotation to the grapheme to phoneme conversions needed on the posttest. In line with this interpretation, participant voka244 remarked that "the fact that I was able to replay the audio helped me to have the word stick in my mind better, even if I couldn't quite remember the spelling." Indeed, a closer examination of the immediate posttest data reveals that the participants

produced

numerous

target

word

misspellings

that

were

likely

phonologically-motivated. For instance, the misspelling * for Anfang (beginning) used the English grapheme instead of the correct German grapheme to represent the phoneme /a/. Interestingly, this misspelling was produced by 3 of the 45 participants that studied this word with the audio annotation (i.e., with annotation clusters PA, DA, or PAGD) but by none of the 27 participants that studied the word without the audio (i.e., with annotation clusters PG or DG). Presumably, the audio annotation helped the learners recall the correct pronunciation of the sound (i.e. /a/) but did not aid in choosing the correct grapheme to represent the sound (i.e., ). Furthermore, in the absence of audio information, the learners studying with annotation clusters PG or DG might have interpreted the grapheme as the sound /æ/ based on English grapheme-

129

phoneme correspondences 37 and thus were shielded from committing the phonologicallymotivated misspelling of using the incorrect grapheme on the posttest. 38 If audio annotations help decode the spelling and recall of the written L2 form but not encode its spelling in the learners' own production, then audio annotations should boost vocabulary learning scores when assessing the written L2 form with recognition rather than recall test items (i.e., when learners do not have to write the form themselves). This is precisely what Okuyama (2007) found. The researcher notes that a higher frequency of accessing audio recordings of L2 Japanese target words strongly correlated with higher scores on an immediate vocabulary posttest in which the learners received the written L2 forms and had to select the corresponding meaning from among several illustrations. However, given that Okuyama's study did not measure written L2 form recall and experiment 1 of Voka did not assess written L2 form recognition, withinstudy comparisons of the potentially contrastive benefits of audio annotations for recognition versus recall are not possible.

37

Note that the participants' textbook, Deutsch: Na Klar! (Di Donato, Clyde, & Vansant, 2008), does not include any information on pronunciation or phoneme-grapheme-correspondences in German and it is likely that the learners resort to corresponding L1 knowledge when necessary. As Hamada and Koda (2008) point out: Inasmuch as cross-language transfer is a factor in virtually every aspect of L2 learning (Gass & Selinker, 1983; Kellerman & Sharwood Smith, 1986), it can be hypothesized that adult L2 learners, who are literate in their native language, make use of their L1 phonological decoding skills during L2 print information processing. Studies conducted to test the hypothesis have repeatedly shown that L2 learners with divergent L1 backgrounds utilize systematically different procedures for phonological information extraction and, more critically, that the observed procedural variation corresponds with differences in L1 orthographic properties (Akamatsu, 1999, 2002; Brown & Haynes, 1985; Gairns, 1992; Koda, 1989, 1990; Muljani, Koda, & Moates, 1998; Wang, Koda, & Perfetti, 2003). (p. 6) 38

As another example, the phoneme /z/ as the third sound of the target word Gesetz (law) was misspelled using the English grapheme by 5 of the 45 learners that studied with audio (* (2x), *, *, *) but by none of the 27 learners that studied without the audio annotation. Again, one can speculate that the learners studying with the audio annotation could recall the correct pronunciation (i.e., /z/) but not the correct spelling (i.e., ) whereas learners studying without the audio annotation might have interpreted the grapheme as the sound /s/ and thus did not employ the incorrect grapheme on the test.

130

Furthermore, if the posttest had focused on aural rather than written L2 form recall, the audio would have likely emerged as more beneficial. Indeed, Sydorenko (2010) found that learners in a video-audio group performed better on an aural than on a written form recognition posttest while the reverse pattern emerged for learners in a videocaption group, supporting the conclusion that "recognition of form is best when modality of input and test modality are the same" (p. 62). Finally, competition for attentional resources in the visual processing channel and the preset exposure time of the flashcards, which were also cited as explanations for the findings regarding the definition and the gloss, are presumably not as influential for the audio annotation because the audio could be attended to in the auditory processing channel while simultaneously processing other word information in the visual processing channel. The following section discusses the results of experiment 1 with respect to word concreteness.

6.2 Research Topic 2: Word Concreteness RQ 2 investigates whether word concreteness (i.e., abstract vs. concrete) has an effect on L2 vocabulary learning. The immediate posttest data reveal a main effect of word type with abstract words being significantly harder to learn than concrete words. The concreteness effect also persisted over time as the results of the delayed posttest again reveal a main effect of word type with abstract words being significantly harder to remember than concrete words (see section 5.2.1). Interestingly, the concreteness effect is also apparent when inspecting the individual scores for each learner. On the immediate posttest, 66 of the 72 students (91.7%) had a lower word retention score for abstract words than for concrete words. Three students (4.2%) had equal retention scores and only 3 students (4.2%) achieved 0.5 / 15 points more for abstract words than for concrete words. On the delayed posttest, 68 students (94.4%) scored higher on the concrete words, 3 students (4.2%) received equal

131

scores for both word types, and only 1 student (1.4%) scored 2 / 15 points higher on the abstract words. The concreteness effect detected with the Voka learning environment is consistent with the findings of prior research (see section 2.4.1). However, prior research on word concreteness has generally considered L1 – L1 paired associates learning, offline learning, learning of pseudolanguage words, computer-based learning without multimedia, or learning of words without some consideration of form, meaning, and use aspects (see, e.g., De Groot & Keijzer, 2000; Ellis & Beaton, 1993b; Paivio, 1971, 1986; van Hell & Mahn, 1997). Experiment 1 extends the finding of a concreteness effect to intentional foreign language vocabulary learning with a multimedia CAVL program. This result appears to be particularly valid because the mean imageability rating was determined by 40 German speakers in an independent study (Vö et al., 2006) rather than being classified as such by a small number of raters assigned to the research project discussed in this dissertation. However, Laufer (1997) states (see also Pavičić Takač, 2008), that "it cannot be claimed that concreteness in itself can assure ease in learning" (p. 150) but she (1990) contends that concrete words are likely easier to learn than abstract words if other learnability factors are controlled. By controlling numerous variables that might have an effect on word learnability, experiment 1 provides more conclusive evidence of the concreteness effect in SLA. The 15 abstract and 15 concrete test items in Voka were controlled for the following factors: part of speech, prior learner knowledge, frequency, unfamiliar orthographic characters, cognate status, orthographic or semantic similarity to other test items, inflexional or derivational complexity, semantic features, register restrictions, multiplicity of meaning, C/V ratio, number of letters, consonants, vowels, syllables, and phonemes (see section 3.2). To the author's knowledge, no prior study has controlled as many extraneous variables in an investigation of the learnability of abstract versus concrete L2 words in a multimedia CAVL environment. The following section discusses the interaction between word concreteness and annotation cluster effectiveness in Voka. 132

6.3 Research Topic 3: Annotations and Word Concreteness RQ 3.1 investigates whether there is an interaction of annotation cluster effectiveness and target word concreteness in L2 vocabulary learning. Specifically, RQ 3.1.a inspects whether the effect of word type varies in the five annotation clusters while RQ 3.1.b examines whether the effectiveness of the annotation clusters varies across the two word types. Furthermore, RQ 3.2 investigates whether L2 vocabulary learning differs when comparing the annotation clusters that contain a given annotation type (i.e., picture, definition, gloss, or audio) to those that do not contain that respective annotation type. This question is examined for abstract words in RQ 3.2.1 and for concrete words in RQ 3.2.2. The following section discusses the word concreteness effect in the annotation clusters (RQ 3.1.a). This is followed by a discussion of annotation effectiveness for abstract and concrete words (RQ 3.1.b, RQ 3.2.1, and RQ 3.2.2) in section 6.3.2.

6.3.1 The Word Concreteness Effect in the Annotation Clusters The findings of both posttests demonstrate that concrete words are significantly easier to learn than abstract words in all five annotation clusters. Accordingly, the effect of word concreteness (abstract, concrete) is stronger than the effect of annotation cluster (PG, DG, PA, DA, PAGD). Not only are abstract nouns significantly harder to retain than concrete nouns in all annotation clusters but on both posttests, the least effective annotation cluster for concrete nouns was still more effective than the most effective annotation cluster for abstract nouns. Experiment 1 thus provides additional evidence for the prior research finding that concrete words are easier to learn than abstract words. For both immediate and delayed vocabulary learning, the concreteness effect is so strong that the addition of beneficial annotations for learning abstract words cannot militate against this effect.

133

6.3.2 Annotation Effectiveness for Abstract and Concrete Words In considering the differences in annotation effectiveness for abstract and concrete words, this section first discusses the picture annotation type, then the definition, followed by the gloss, and finally the audio annotation type. Picture annotation type Immediate posttest The conclusion that annotation clusters PG, PA, and PAGD are significantly more effective than annotation clusters DG and DA for L2 vocabulary learning in general (see section 5.1 and section 6.1), when measured immediately post-treatment, was also drawn when looking at abstract words and concrete words in isolation. For concrete words, given their high imageability, these results are not surprising. Previous research has shown time and again that visuals are effective for learning easily picturable nouns (see section 2.3.1 and section 2.4). For concrete words, the picture is usually representational, that is, it is a fairly direct representation of the noun's referent. As such, the picture largely presents the same information as the L2 target word but in a different modality, thus facilitating retention (Heidemann, 1996; Issing, Hannemann, & Haack, 1989, see also section 2.4.2). For instance, the Voka picture of Gehirn (brain) shows a model of a brain, the picture of Vogel (bird) depicts a bird, and the picture of Kreis (circle) portrays a circle (see Figure 23). In all cases, the connection between the picture and the word is straightforward and easily drawn such that the picture can often be recognized as a depiction of the referent even without knowing in advance what the picture is supposed to illustrate (see Appendix A for more examples). Figure 23: Voka pictures of the concrete nouns Gehirn (brain), Vogel (bird), and Kreis (circle)

134

However, even for abstract nouns, the annotation clusters with pictures (PG, PA, and PAGD) are significantly more effective for L2 vocabulary learning than the annotation clusters without pictures (DG, DA). This novel finding extends existing research and is remarkable given the low imageability of abstract words and the fact that because of this, previous studies have generally dismissed the idea that it is possible, let alone potentially beneficial, to create picture annotations for abstract words (i.e., words with a low imageability) (see section 2.4.2). Rather than the representational function typical of pictures of concrete referents, pictures of abstract referents frequently serve an associative or interpretational function, illustrating the text through analogies, visual metaphors, or the like (Heidemann, 1996; Molitor, Ballstaedt, & Mandl, 1989, see also section 2.4.2). For abstract nouns, the connection between the noun and its picture is thus more obscure and indirect than for concrete nouns. Consider the Voka pictures of the abstract nouns Beweis (proof), Zweck (purpose) and Zustand (state) in Figure 24. Figure 24: Voka pictures of the abstract nouns Beweis (proof), Zweck (purpose), and Zustand (state)

The picture of Beweis (proof) shows a hand holding a blood-covered (o.k., ketchup-covered!) knife in a plastic bag. This picture draws on popular conceptions of crime scenes in that the knife is presumably proof of the commission of a serious crime. The picture of Zweck (purpose) shows rubber boots in a puddle, suggesting that the purpose of the boots is to protect the feet from getting wet. For Zustand (state), the picture depicting an untidy room implies that the room is in a bad state due to this chaos. For abstract words such as these, understanding the connection between the picture and the referent of the L2 target word presumably requires a substantial amount of mental processing and guesswork and may, for some learners and / or for some L2 target 135

words, never be accomplished. The relationship between the picture and the abstract referent is so indirect that one often cannot identify the referent of the pictures without knowing beforehand what the picture is intended to depict (see Appendix A for more examples). However, notwithstanding the indirect and oftentimes obscure connection between the picture and the referent for abstract nouns, the findings clearly show that pictures are also effective for the recall of abstract words. Picture effect stronger for abstract than for concrete words What is even more astonishing than the fact that pictures are effective for learning abstract words is that in Voka, pictures are actually more effective for abstract than for concrete words. The effect size for the benefit of annotation clusters containing pictures compared to annotation clusters without pictures is twice as large for abstract words (Hedge's g = 0.48) than for concrete words (Hedge's g = 0.24). Descriptively, with picture annotations, learners scored on average 9.7% higher for abstract words and only 4.0% higher for concrete words (see section 5.3.2). The findings of experiment 1 are compatible with Paivio's (1971, 1986) dual coding theory in that words studied with picture annotations are learned more effectively than words studied without picture annotations, and this applies to both abstract and concrete words. Moreover, the concreteness effect also holds in that low imageability words are more difficult to learn than high imageability words. Note also, that an additional explanation of the concreteness effect might stem from the context availability hypothesis (see section 2.4.1), which relates the advantage of concrete over abstract words to the presumed greater activation of verbal contexts for concrete words. However, although a presumed visual coding advantage of concrete over abstract words is present, the fact that pictures are more effective for abstract than for concrete words needs explanation that goes beyond Paivio's (1971, 1986) dual coding theory. It is possible that picture processing of abstract referents cognitively engages the learner more than picture processing of concrete referents (see also Schmitt, 2008, 2010). This explanation is

136

elaborated in the following by referring to two hypotheses: 1) the levels of processing hypothesis, and 2) the involvement load hypothesis. The levels of processing hypothesis The levels of processing (or depth of processing) hypothesis, originally proposed by Craik and Lockhart in 1972 (Craik, 2002; Craik & Lockhart, 1972; Lockhart & Craik, 1990), maintains that retention of knowledge in memory is a function of the depth of processing (i.e., degree of elaboration) of the learning stimuli in that stimuli processed more deeply are retained better than stimuli processed less deeply. Examples of shallower processing include analyses of surface form, colour, loudness, and brightness while deeper processing entails analyzing meaning, inference, interpretation, and implication (Craik, 2002; Lockhart & Craik, 1990). Furthermore, with respect to L2 vocabulary learning in particular, Schmitt (1997) notes that images entail elaborative mental processing of the form envisioned by the levels of processing hypothesis. The amount of attention devoted to a stimulus is one of the factors involved in determining its depth of processing (Craik & Lockhart, 1972). Greater depth of processing of a stimulus implies a greater degree of cognitive or semantic analysis, which leads to a more persistent trace in memory and hence better retention (Craik & Lockhart, 1972). Accordingly, "the more effort a learner puts into figuring out a meaning and how to retain it ('depth of processing'), the more likely it is to be remembered (Loucky, 2006)" (Godwin-Jones, 2010, p. 6). The depth of processing hypothesis can be applied to the present findings in that learners have to devote considerably more attention to analyzing the more interpretational pictures of abstract nouns than to analyzing the generally representational pictures of concrete nouns. For example, little attention is necessary to realize that the bird depicted in the picture for Vogel (bird) is a representation of a bird (see Figure 23). However, drawing a connection between the rubber boots in the puddle and the target word Zweck (purpose) requires more cognitive resources (see Figure 24). The comparatively more difficult task of drawing the connection between the abstract word and the picture might have resulted in students spending more time and more cognitive

137

resources on processing the picture, which might imply more mental involvement or a deeper level of processing. This might explain the larger benefit of pictorial annotations for learning abstract compared to concrete words. While the levels of processing hypothesis makes instant intuitive sense, one of its lasting criticisms is the danger of circular reasoning. In the absence of an independent index of depth, it is easy to claim that something that has been remembered better must have been processed more deeply because deeper processing leads to better remembering (Craik, 2002; Lockhart & Craik, 1990). Craik (2002) concedes that "the concept of depth clearly requires much greater specification" (p. 315). Naturally, the potential for circularity is also present when applying the hypothesis to the data collected in experiment 1 with Voka. One might claim that pictures of abstract referents are processed more deeply than pictures of concrete referents because pictures are found to be more beneficial for abstract than for concrete words. Nonetheless, within the same domain, that is, when processing associations between a picture and its target word, it seems reasonable to assume that it requires more attention, and hence a deeper level of processing, to work out the less direct link between a picture and its abstract referent than to comprehend the more direct association between a picture and its concrete referent. The involvement load hypothesis Along similar lines, the findings might also be explained by referring to the involvement load hypothesis, which was first formulated by Laufer and Hulstijn in 2001 (Hulstijn & Laufer, 2001; Laufer & Hulstijn, 2001). The involvement load hypothesis maintains that the amount of involvement induced by a language learning task predicts word retention. One of the central assumptions of the hypothesis is that "other factors being equal, words which are processed with higher involvement load will be retained better than words which are processed with lower involvement load" (Laufer & Hulstijn, 2001, p. 15). The involvement load of a task is determined by its varying degrees of need, search, and evaluation. Need refers to the learners' (externally imposed or self-imposed) motivation to fulfill a task. Search is the attempt to establish a form-meaning link for an 138

L2 word. Evaluation entails assessing the fit of a word (i.e., form and meaning) in a given context. The degree of need, search, and evaluation can vary for different tasks, ranging from absent via moderate to strong. For example, in a reading comprehension task where target words that are not relevant to comprehension are glossed in a text, need, search, and evaluation are absent because the learners do not need the words for text comprehension, do not have to search for the meaning of the words, and do not have to evaluate the fit of different possible meanings of the word. However, if instead of single glosses multiple-choice glosses are provided (i.e., the correct translation is given alongside distractors), readers have to engage in evaluation to decide which gloss is appropriate in the reading context (Rott, 2005). Rott's study suggests that compared to single glosses, multiple-choice glosses lead to more robust form-meaning connections because of their higher task-induced involvement load.39 In applying the involvement load hypothesis with its three dimensions of need, search, and evaluation to the present experiment, for learning both abstract and concrete words, the degree of need is identical: The participants had an externally imposed need to learn the target words. The degree of search is also the same: The participants received all the necessary information on the Voka flashcards and did not have to search for additional information elsewhere. However, the degree of evaluation required to process the pictures is presumably much higher for abstract than for concrete words. For concrete words, only a limited degree of evaluation is necessary because the picture is generally a clear representation of the target concept. For abstract words, however, learners have to

39

As a further example, in a writing composition task where learners are provided with L1 concepts to include in their essays, need, search, and evaluation are present because the learners need the L2 words to write the composition, they have to search for L2 forms to express the concepts (e.g., in a dictionary), and they have to evaluate which L2 form is most appropriate in the context.

139

generate diverse hypotheses about how the picture might relate to the target word and then evaluate which of their hypotheses (if any) is appropriate in the context.40 The dimension of evaluation in the involvement load hypothesis was conceived to refer to an evaluation of a given target word in a linguistic context and clearly, the application of the hypothesis to picture processing is an extension of Laufer and Hulstijn's (2001) original idea. However, their hypothesis can readily be applied to the present data because a picture can be regarded as a form of non-linguistic context in which the appropriateness of a target word needs to be evaluated. This process is similar to evaluating which gloss is appropriate when multiple-choice glosses are provided, as in Rott's (2005) study. In Voka, the higher degree of evaluation imposed by picture processing for abstract words, in combination with identical degrees of need and search for both abstract and concrete words, leads to a higher task-induced involvement load for processing picturebased annotation clusters of abstract compared to concrete words. This might explain why learning with pictures is more beneficial for abstract than for concrete words in Voka. The fact that the scores for abstract words are still considerably lower than the scores for concrete words despite the higher task-induced involvement for processing pictures associated with abstract words can be explained by the inherent differences in learnability of the two word types (i.e., the concreteness effect). Almost paradoxically then, target word imageability appears to exert two opposing influences on learning L2 vocabulary with pictures. Imageability on its own, that is, independent of the processing of pictures associated with the target words, affects vocabulary learning in that low imageability words are much more difficult to learn than high imageability words and this concreteness effect has also been demonstrated for Voka (see section 5.2 and section 6.2). At the same time, however, it is speculated here 40

For instance, regarding the pink rubber boots in the puddle as a picture of Zweck (purpose) (see Figure 24), learners might form hypotheses such as the following: 1) Boots are worn with a purpose, 2) The purpose of little feet (= children) is to grow up one day and do something good, 3) The purpose of rubber boots is to protect feet from water [the author's intended meaning], 4) The purpose of water is to give life. 5) The purpose of this picture is to get me thinking about something, and 6) This picture is completely unrelated to the L2 target word.

140

that target word imageability influences the involvement load presented by an L2 vocabulary learning task that shows target words with associated pictures. Compared to high imageability words, low imageability words require a higher degree of evaluation when processing the association between picture and target word, which leads to a higher degree of task-induced involvement load and therefore a greater boost in vocabulary learning scores. Evidently, the effect of imageability on vocabulary learning in general is stronger than the supposed effect on task-induced involvement load during picture processing, which explains the overall higher scores for concrete compared to abstract words in experiment 1. Thus, while the higher task-induced involvement entailed in the picture processing of abstract words leads to a larger boost in scores, it cannot eliminate entirely the concreteness effect introduced by the low imageability of the words in the first place. Processing of irrelevant pictures The discussion up to this point has worked with the underlying assumption that learners are ultimately successful in creating a meaningful association between the picture and its abstract referent. However, even when a learner concludes that the picture is not related to the L2 target word, the picture might still help in recalling the word. The learner presumably has to spend some time and cognitive resources to consider a picture in order to come to the decision that it is not related to a given word. Therefore, it arguably still requires a deeper analysis (in terms of levels of processing) and more evaluation (in terms of involvement load) to conclude that a picture is unrelated to an abstract referent than to confirm that a picture is a representation of a concrete referent. The idea that semantically unrelated pictures are still useful mnemonic aids is nicely expressed by voka12, a pilot study participant, who thought that "the pictures were useful" but also commented that "sometimes the picture did not relate well enough to the word. … (while I don’t think the picture of pink boots related all that much to 'purpose' the picture was memorable)."

141

In fact, it is known that visuals that are semantically unrelated to new L2 target words can still serve as powerful learning aids. This is emphasized by Godwin-Jones (2010) who describes how new L2 words can be associated with unrelated visuals: This involves linking in one’s mind new vocabulary to something concrete and familiar to the learner, such as items in a room in one’s home. The familiar locale provides a memory hook which can be used to retrieve linked items by perusing mentally the trajectory through that physical space. This is, in fact, the classical ars memoriae or “method of loci” evoked by Cicero in De oratare, used by classical and medieval scholars to remember speeches and for aid in recalling all kinds of systemizable knowledge. It was famously used by the Jesuit Matteo Ricci in 16th century China to help prepare candidates in learning language and culture for the all-important imperial exam. In our day, it is known to be used by winners of memory contests. The technique goes by a variety of names including the Roman Room, the Peg System, and the Nook and Cranny method; also used is the term “journeys,” as trips through familiar scenes can also be used as a pegging mechanism. (pp. 5-6) In sum, it is possible that pictures are more effective for abstract nouns not in spite of the indirect connection between picture and referent but precisely because of it. Pictures of concrete nouns serve a largely representative function, which facilitates an obvious and direct association between the picture and the concrete noun. Therefore, they are presumably processed on a shallow level, in terms of the levels of processing hypothesis, and with little evaluation required, in terms of the involvement load hypothesis. In contrast, pictures of abstract nouns often have an interpretational function, which entails a more obscure and indirect link between the picture and the abstract referent. They therefore might be processed on a deeper level, in terms of levels of processing, and with more evaluation required, in terms of involvement load. Furthermore, it is speculated that even pictures that learners ultimately deem semantically unrelated to their abstract referents are more deeply processed and

142

thoroughly evaluated. Deeper processing and / or higher involvement load predicts a greater effectiveness of pictures for abstract compared to concrete nouns. 41 Delayed posttest In contrast to the findings of the immediate posttest, there were no significant differences in terms of annotation cluster or annotation type effectiveness for the two word types on the delayed posttest. Possible reasons for the non-detection of effects on the delayed posttest have already been discussed for L2 vocabulary learning in general (see section 6.1) and these are assumed to apply to concrete and abstract words in isolation as well. However, in addition to these reasons, the delayed posttest scores for abstract words might have been too low to detect differences among annotation clusters or annotation types. The mean score for abstract words on the delayed posttest was 1.53 / 15 points (10.2%, SD 1.76), with a range of 0 to 7.5 points. These numbers possibly suggest a floor effect, indicating that the task might have been too difficult for the learners, in which case no significant effects could have been detected. Definition annotation type While there is no significant effect of the definition on the delayed posttest for either word type, on the immediate posttest, the definition is detrimental to the learning of abstract words and shows no statistically significant effect regarding the learning of concrete words. Because the abstract words are inherently hard to learn, the addition of a definition in the annotation clusters may lead to cognitive overload, particularly given the preset exposure time of the L2 test items. For instance, learners studying with annotation cluster PAGD, in which both the picture and the definition are present, have to split their

41

As a sidenote, in creating the materials for Voka, the author of this dissertation could usually easily generate picture ideas for the concrete nouns but it took considerably more effort, and often the help of family and friends, to generate and evaluate picture ideas for the abstract nouns. On a purely personal and descriptive level, this experience further supports the idea that a deeper level of processing and a higher task-induced involvement load might be involved in associating pictures with abstract as opposed to concrete nouns.

143

time between deciphering the association between picture and abstract referent and attending to the definition. In contrast, the definition may not impair the learning of concrete words because of the inherent ease of learning these words. A closer look at the descriptive statistics for the immediate posttest provides some support for this contention. First, the particular annotation cluster with which a learner was studying was more influential for abstract than for concrete words. For abstract nouns, the mean difference in posttest scores between the most effective annotation cluster (PA, 1.51 / 3 points) and the least effective cluster (DG, 1.14 points) was 12.5% while for concrete nouns, the most effective annotation cluster (PA, 2.30 points) outperformed the least effective cluster (DA, 2.15 points) by only 5.1% (see section 5.3.1). Second, the relative ease of learning concrete words is also highlighted by the fact that the mean posttest score for concrete words was 11.17 out of 15 points (74.4%, see section 5.2.1) with one third of the 72 participants (24 / 72) achieving 13 or more out of 15 points. Gloss annotation type On both the immediate and the delayed posttest, there is no statistically significant effect of the gloss for either abstract words or concrete words. As a way of confirming L2 meaning in context, glosses were expected to be particularly useful for abstract words because their meaning is arguably more elusive than the meaning of concrete words. For example, according to the context availability hypothesis (e.g., Schwanenflugel, Akin, & Luh, 1992), it is more difficult to access contextual information for abstract words like Schuld (blame) or Ablauf (procedure) than for concrete words such as Vogel (bird) or Regen (rain). One might speculate that the L1 glosses allow learners to establish correct inferences regarding the usage of abstract words. However, even when inspecting the descriptive statistics of the Voka posttests, there is no discernable advantage of the annotation clusters with a gloss compared to the annotation clusters without a gloss, overall and for either word type. Possible reasons for the non-detection of an effect of gloss presence were discussed in section 6.1.3.

144

Audio annotation type Experiment 1 found that for both abstract and concrete words, the inclusion of the audio annotation in the five annotation clusters investigated has neither a positive nor a negative effect on immediate and delayed L2 vocabulary retention. It is not surprising that there is no difference between abstract and concrete words in this regard because phonographic complexity factors such as the number of phonemes and syllables and word length of the two word types were tightly controlled in Voka (see section 3.2). However, one might expect audio annotations to be more useful for abstract words in general because abstract words appear to be slightly longer than concrete nouns (De Groot, Dannenburg, & van Hell, 1994; van Hell & Mahn, 1997), and thus their form is presumably harder to process. For example, in the Berlin Affective Word List Reloaded (BAWL-R, see Vö et al., 2009), a more comprehensive version of BAWL (see Vö et al., 2006, and section 3.2), the 452 abstract German nouns (i.e., imageability 1.0 – 3.0) have a mean length of 5.8 phonemes (SD: 1.3), 6.6 letters (SD: 1.4), and 2.3 syllables (SD: 0.6). The 850 concrete German nouns (i.e., imageability 5.0 – 7.0) have a mean length of 5.2 phonemes (SD: 1.4), 5.9 letters (SD: 1.5), and 2.0 syllables (SD: 0.7). The following section discusses the results of experiment 1 with respect to annotations and individual learners.

6.4 Research Topic 4: Annotations and Individual Learners 6.4.1 Inter-learner Variation RQ 4 examines whether the effectiveness of the annotation clusters varies across learners. As demonstrated on both posttests, there is indeed considerable inter-learner variation regarding the annotation clusters that best support vocabulary learning. There is no single annotation cluster that is most effective for an absolute majority of learners. The results are rather much closer to an even distribution of all annotation clusters. For instance, on both the immediate and the delayed posttest, annotation cluster PA received the relative majority with 20.8% and 19.4%, respectively, of the participants achieving the highest score with PA only (see section 5.4). 145

On an individual learner basis, all five annotation clusters are effective aids for L2 vocabulary learning. For instance, when considering the number of times each annotation cluster was in first place for a learner, the combined results of both posttests show that annotation cluster PAGD obtained the relative majority with 25% while the other four annotation clusters combined accounted for 75% of the first places (see section 5.4.1). Thus, although PAGD was best overall, the vast majority of students learned best with an annotation cluster other than PAGD. This confirms the finding expressed in section 6.1.2, in that, in Voka, most students learn L2 vocabulary better with flashcards showing fewer annotation types than with flashcards showing all annotation types. The inter-learner variation regarding annotation cluster effectiveness on the posttests was also mirrored in the students' learning preferences as expressed on the two evaluation questionnaires (see Appendix D and Appendix E). For example, 49 of the 72 learners (68.1%) regarded PAGD as the most useful or as one of the most useful annotation clusters while 7 learners (9.7%) considered it least effective. Furthermore, voka219, for instance, ranked PA as the most useful and DG as the least useful annotation cluster whereas voka265 indicated the opposite: She regarded PA as least useful and DG as most useful (see section 7.3.1 for further details on the evaluation questionnaire rankings). The learners also evaluated the four individual annotation types quite differently. For instance, voka202 strongly agreed that the definition was useful and strongly disagreed that the picture was useful while voka234 expressed the opposite view. Several comments on the second evaluation questionnaire further demonstrate that the learning preferences of some learners contradict the learning preferences of others: "In my opinion, sometimes 'less' is 'more'. Rather than having an overcrowded interface, simplifying it may be more beneficial to learning these vocabs better" voka108 "I liked how each flashcard gave you many different ways to learn the vocabulary" voka230

146

"cards with all the possible information was often distracting (focus on picture, audio, etc. rather then word & definition)" voka202 "I liked the variety of study tools." voka249 "[I liked] the example sentences [but] the definition does not help much" voka245 "I think the sentences were rather redundant and not particularly helpful - definitions and audio were good though more of that!" voka247

6.4.2 A Need for Individualized Instruction The results of experiment 1 of Voka call for individualized vocabulary instruction: Annotation cluster effectiveness varies considerably across learners. Depending on the student, annotation cluster PG, DG, PA, DA or PAGD (or a combination thereof) is most effective for L2 vocabulary learning, implying that students would benefit from individualized instruction based on learner performance with and preferences for different annotation clusters. However, in addition to the variation among learners regarding the annotation cluster that best supports their learning (i.e., inter-learner variation), a further factor to consider when contemplating individualized instruction is the variation in annotation cluster effectiveness within each individual learner (i.e., intra-learner variation). This intra-learner variation can be captured by examining each learner's range of scores for different annotation clusters. For each student and each posttest, the range was determined by subtracting the learner's score on his or her least effective annotation cluster from the learner's score on his or her most effective annotation cluster. For example, on the immediate posttest, voka222 scored 5 points in his best annotation cluster and 1 point in his worst one, making his range 4. The highest possible range was 6, the lowest possible range was 0. 147

The more prominent the difference in performance between the most and the least effective annotation cluster is, the more important it is to provide individualized instruction considering the learner's most effective annotation cluster(s) to optimize performance. For learners with a high range value, annotation cluster has a strong effect, whereas for learners with a low range value, the effect of annotation cluster is weaker. For instance, a student like voka233 would benefit from individualized instruction that enables him to study all target words with annotation cluster PA. Voka233 scored 5 / 6 points on the immediate posttest and 4.5 / 6 points on the delayed posttest with PA, his best annotation cluster, but only 1 / 6 points on the immediate posttest and 0.5 / 6 points on the delayed posttest with DG, his worst annotation cluster. This translates into a range of 4 on each posttest. In contrast, for a student like voka237, individualized instruction with respect to annotation cluster might not be as important because this student displays a fairly even performance with each annotation cluster. On the immediate posttest, voka237 scored 2 points with his best annotation cluster (PA) and 0.5 points with his worst cluster (DA), which corresponds to a range of 1.5. On the delayed posttest, he scored 0.5 points with his best cluster (PA) and 0 points with the other clusters, which equals a range of 0.5. Accordingly, to further evaluate the need for individualized instruction in Voka, calculations of the range of scores on both posttests are required. On the immediate posttest, a 5.5 point difference between best and worst annotation cluster was the highest range of scores while 0 was the lowest range (for two students who obtained 30 out of 30 points). The mean range between the learners' best and worst annotation clusters for the 72 students was 2.30 points (SD: 1.04), that is, with their best annotation cluster learners scored 2.30 points (38.3%) higher than with their worst annotation cluster on average. On the delayed posttest, the highest range was 4 and the lowest 0 points. The mean range was 1.76 points (i.e., 29.3%; SD: 0.92). The highest actual range, the mean range and the SDs on the two posttests provide evidence of considerable intra-learner variation with regard to the effectiveness of annotation clusters. Combined with the inter-learner variation in annotation cluster effectiveness discussed in section 6.4.1, this strongly suggests that in CAVL programs 148

like Voka, individualized instruction would significantly boost learner performance compared to generic, non-individualized instruction. Accordingly, experiment 2, which examines the effect of presentation sequence on vocabulary learning, provided learners with an individualized learning environment that was based on learner performance and preference regarding the five annotation clusters in experiment 1. The following chapter describes the methodology of experiment 2.

149

7 METHODOLOGY OF EXPERIMENT 2 The second experiment conducted for this dissertation investigated the effect of presentation sequence of annotation clusters (i.e., fixed vs. alternating) on L2 vocabulary learning with a between-subjects design involving two treatment groups (FIX and ALT). By taking into account the results of experiment 1, experiment 2 provided an individualized learning environment. Each learner in the FIX group received a fixed presentation sequence of their best annotation cluster for all 28 test items of part 2. The learners in the ALT group received an alternating presentation sequence of their two best annotation clusters by studying half the target words in one annotation cluster and the other half in the other annotation cluster. This chapter describes the methodology of this experiment. Section 7.1 describes the study participants and the assignment of the participants to the two treatment groups. Section 7.2 provides the timeline of the experiment in fall 2009. Section 7.3 explains the design of experiment 2 with respect to the two presentation sequences and the annotation clusters provided for each learner in each treatment group. Section 7.4 describes the data analysis and statistical procedures of experiment 2.

7.1 Study Participants Experiment 2 was conducted with 68 of the 72 participants from experiment 1 (see section 4.1), who were divided equally into two treatment groups (n = 34 in FIX, n = 34

150

in ALT). 42 Based on the pretest, none of the 68 students knew any of the words contained in their treatment in experiment 2. In addition, none of them had studied the test items between the two posttests. Table 34 provides more details about the participants. Table 34: Characteristics of participants of experiment 2

Characteristic Number of students Age Range Mean Gender female male Status undergraduate graduate student English proficiencya native language native-like advanced intermediate advanced beginner beginner Other known languagesb

Prior German instruction none 1 semester Visits to German countries 0 times 1 – 3 times 4 or more Time in German countries none 1 – 21 days 1 – 6 months 1 – 4 years Computer comfort (very) comfortable neither (very) uncomfortable

Experiment 2 Both groups N = 68

FIX group n = 34

ALT group n = 34

17 - 47 years 20.8 years

17 - 28 years 20.6 years

18 - 47 years 21.0 years

45 students 23 students

21 students 13 students

24 students 10 students

65 students 3 students

33 students 1 student

32 students 2 students

47 students 23 students 24 students 7 students 3 students 4 students 12 students 6 students 6 students 2 students 2 students 0 students 0 students 0 students 0 students 0 students 0 students 0 students Chinese, English, French, German, Italian, Japanese, Spanish in both groups. Additionally, Punjabi and Thai in FIX, and Arabic, Croatian, and Greek in ALT. 64 students 4 students

31 students 3 students

33 students 1 student

39 students 26 students 3 students

20 students 12 students 2 students

19 students 14 students 1 student

38 students 20 students 8 students 2 students

20 students 11 students 3 students 0 students

18 students 9 students 5 students 2 students

67 students 1 student 0 students

34 students 0 students 0 students

33 students 1 student 0 students

42

Four of the 72 students from experiment 1 had to be eliminated from experiment 2. One student accidentally did not complete the entire treatment. Another student did not engage in the experiment, submitting each 28-item posttest in under 10 seconds without filling in any input fields. Experiment 2 is designed as a paired samples experiment, and accordingly, the paired participants of the two students were also excluded.

151

Characteristic Prior CALL use never rarely / occasionally (very) frequently

Experiment 2 Both groups

FIX group

ALT group

26 students 29 students 13 students

12 students 15 students 7 students

14 students 14 students 6 students

Note. aOther than English, the following L1s were reported: Arabic, Croatian, Chinese, Farsi, French, Hindi, Korean, Moroccan, Punjabi, Russian, Spanish, Tagalog, Turkish, and Ukrainian. bOther known languages include non-native languages for which students indicated a proficiency from beginner to advanced level. All students listing German assessed themselves as beginners.

Table 34 indicates that in terms of their personal characteristics, the participants in experiment 2 were quite evenly divided into the two treatment groups. For instance, of the 45 women participating in the second study, 21 women were in the FIX treatment group and 24 were in the ALT treatment group. Also, 13 of the 23 men were in FIX and 10 were in ALT. The following section describes how the participants were assigned to the two treatment groups.

7.1.1 Assignment of Participants to the Two Groups To control for variables that might influence the results of the study, a matchedparticipant design was employed to assign the 68 participants to the two treatment groups of experiment 2. 43 The following measures from experiment 1 were considered, in decreasing order of importance, to obtain two balanced groups: 1) the learner's exposure group (groups I – V, see section 4.3), 2) the learner's delayed posttest score, 3) the learner's immediate posttest score, 4) the learner's range of scores for different annotation clusters on the delayed posttest, and 5) the number of best annotation clusters the learner had. To control for the potential effect the learners' exposure group might have had on the learners' performance in experiment 1 (see section 4.3), exposure group was chosen 43

In a matched-participant design, participants that share similar characteristics are paired so that potential differences in scores between two treatment groups can be attributed more assertively to differences in treatment rather than to pre-existing differences in, for instance, student abilities in the two groups. One member of each pair is randomly assigned to one of the treatment groups while the other member is assigned to the other treatment group.

152

as the primary grouping measure. Indeed, formal hypothesis testing on the immediate and delayed posttest scores of experiment 1 showed that there are some statistically significant interaction effects involving the exposure group factor. In particular, on the immediate posttest, there is a two-way interaction effect of exposure group X annotation cluster, F(16, 268) = 3.618, p = .000, η2 = .055, and a three way interaction effect of exposure group X annotation cluster X word type, F(16, 268) = 1.934, p = .018, η2 = .026, but no two-way interaction effect of exposure group X word type, F(4, 67) = .989, p = .420. On the delayed posttest, there is a two-way interaction effect of exposure group X annotation cluster, F(16, 268) = 2.657, p = .001, η2 = .045, but there are no interaction effects for either exposure group X word type (F(4, 67) = .702, p = .593) or exposure group X annotation cluster X word type (F(16, 268) = 1.159, p = .302). There is no main effect of exposure group on the immediate or delayed posttests (F(4, 67) = 1.179, p = .328, and F(4, 67) = 1.816, p = .458, respectively). The interaction effects with exposure group mean that the particular words assigned to a given annotation cluster have some influence on the effectiveness of that annotation cluster. However, the magnitude of effect estimates are low in all cases, and thus the proportion of variance explained by interaction effects involving exposure group is minimal. Nonetheless, this confounding influence was controlled by making exposure group the primary grouping measure. The remaining four measures (see 2 – 5 above) were chosen because they are considered to be most prognostic of the learners' success in experiment 2. The score learners received on the delayed posttest in experiment 1 is likely the most predictive of their performance in experiment 2 and was therefore chosen as the secondary grouping measure while the student's immediate posttest score was selected as the tertiary grouping measure. Each student's range of scores for different annotation clusters on the delayed posttest was the fourth grouping measure. The range is determined by subtracting the student's score on his / her worst annotation cluster from the student's score on his / her best cluster. Presumably, students with a high range between their best and worst annotation cluster are helped more if they receive an annotation cluster suited to them than students with a low range between their best and worst annotation clusters (see also section 6.4.2). Finally, the number of best annotation clusters per student was 153

the final grouping measure because it is likely that a student with only one best annotation cluster is more dependent on receiving this annotation cluster in experiment 2 than a student with several best annotation clusters. 44 Based on the five measures, the students were sorted in a spreadsheet, first by their exposure group from I to V. Within exposure group, the students were sorted by their delayed posttest score from high to low, then by their immediate posttest score, then by their range, and finally by the number of best annotation clusters. The students were then grouped into pairs with two consecutive students comprising a pair (i.e., the first two students were a pair, students 3 and 4 were a pair, etc.).45 For each pair, one of the members was randomly assigned to the FIX group while the other was assigned to the ALT group. Table 35: Means and SDs for the sorting measures

Measure Delayed posttest score Immediate posttest score Range of scores Number of best annotation clusters

FIX group (n = 34) Mean SD 6.53 4.43 17.43 5.90 1.90 1.01 1.12 0.33

ALT group (n = 34) Mean SD 6.60 4.50 18.74 6.56 1.60 0.77 1.12 0.33

Table 35 displays the means and standard deviations of the four numerical measures for both groups. Four paired-samples t-tests confirm that the two treatment groups do not differ significantly on any of the four measures (delayed posttest score: t(33) = -.135, p = .894; immediate posttest score: t(33) = -1.584, p = .123; range of scores: t(33) = 1.782, p = .084; number of best annotation clusters: t(33) = .000, p = 1.000). 44

For the grouping of the participants, the delayed posttest score of part 1 was categorized as high (students with a score of 9 / 30 points or more), medium (4 - 8.5 points), or low (0 - 3.5 points). For instance, a student scoring 8.5 points on the delayed posttest falls into the medium category. Note that the category start and end values were chosen such that there was a similar number of students in each of the categories. The immediate posttest score was grouped into the categories high (21.5 points or more), medium (15 - 21 points), and low (6 - 14.5 points). The learners' range of scores was categorized as high (range 2.5 - 4), medium (range 1.5 - 2), or low (range 0 - 1). For the fifth measure, the learner model calculation described in section 7.3.1 was used to determine the number of best annotation clusters per student. 45

For instance, participants voka270 and voka275 were matched for part 2. The two are an ideal match: Both were in exposure group V in part 1, had a high delayed posttest score, a medium immediate posttest score, a high range of scores and only one best annotation cluster.

154

7.2 Timeline of Experiment 2 The research conducted for experiment 2 involved two meetings with the study participants, which were conducted in each of the four sections of German 102 (see Table 36). Table 36: Timeline for experiment 2

Meeting

Experiment component

1 2

Exp. 2 Phase 1 - 6 Exp. 2 delayed posttest Evaluation quest. 2 Debriefing

Approx. duration 60 min. 30 min.

Class week 12 13

German 102 section D1 D2 E1 Nov. 24 Nov. 25 Nov. 24 Dec. 1 Dec. 2 Dec. 1

E2 Nov. 25 Dec. 2

At the first meeting, the study participants completed the pretest, the two study phases, the two practice phases and the immediate posttest of part 2 of Voka (see section 3.7). At the second meeting, the participants completed the delayed posttest of part 2 (see section 3.7). They also filled out the second evaluation questionnaire, which asked them to evaluate the study (see Appendix E). The participants then received a handout of the part 2 Voka words. During experiment debriefing, the researcher presented the preliminary findings of both experiments 1 and 2 and conducted a lottery with small prizes as a thank you to the participants. As with experiment 1 (see section 4.2), the Voka sessions were conducted as part of the regular in-class instruction for German 102 and students could make up a missed session by attending another section of German 102 or meeting with the researcher individually.

7.3 Design of Experiment 2 Experiment 2 explored the effect of different presentation sequences on L2 vocabulary learning with a between-subjects design. An individualized learning environment was constructed because the findings of experiment 1 called for individualized instruction to boost learner performance (see section 6.4.2). The static adaptation in Voka guided the approach to teaching by considering the learner's

155

performance and preference in the selection of resources provided for the L2 test items (i.e., annotation cluster) (see section 2.6). For experiment 2, 68 of the participants from experiment 1 were divided into two groups, FIX and ALT, to study another 28 L2 words (14 abstract, 14 concrete). Learners in the FIX group received a fixed presentation sequence of the same annotation cluster for all 28 target words. Learners in the ALT group received an alternating presentation sequence by studying 14 words (7 abstract, 7 concrete) in one annotation cluster and the other 14 words (7 abstract, 7 concrete) in another annotation cluster. The annotation clusters in ALT were presented in random order for each student and Voka phase. The 7 abstract and 7 concrete words shown in each annotation cluster were also selected randomly for each student. In designing experiment 2, it had to be ensured that the annotation cluster(s) selected for the students in the treatment groups would not affect their performance and thus confound the results. Experiment 1 showed that the effectiveness of the annotation clusters varied across learners (see RQ 4, section 5.4) and thus the relative effectiveness of an annotation cluster for a given learner was an extraneous variable that could have influenced the results of experiment 2. Accordingly, experiment 2 investigated presentation sequence in an individualized environment in which learners received only annotation clusters that had been demonstrated to be effective for them. Learners in the FIX group received their most effective annotation cluster for all 28 words of part 2, while for learners in the ALT group, instruction alternated between their two most effective annotation clusters. In providing individualized instruction in experiment 2, data from the first experiment were the input for the learner model (see section 2.6) used to determine effective annotation clusters for each student. This is described in detail in the following section.

7.3.1 Construction of a Learner Model for Each Student The learner model for each student in Voka consisted of the student's most effective annotation cluster(s). To calculate the learner model for each student, both 156

learner performance and learner preferences were considered. Specifically, the following data from experiment 1 were taken into account: the student's most effective and second most effective annotation clusters from the two posttests, as determined by his or her posttest scores, and the annotation clusters ranked highest and second-highest by the learner on the first evaluation questionnaire. Table 37 displays these data for the 68 participants of experiment 2. For example, 18 students scored highest with annotation cluster PG on the immediate posttest and another 30 students achieved their second highest score with PG. 46 Table 37: Annotation cluster ranks for participants of experiment 2

Data source

Rank

Immediate posttest

1 2

Annotation cluster PG DG PA DA 18 14 25 16 30 19 18 20

Sum

Delayed posttest

1 2

20 18

17 15

20 19

17 17

23 22

97 91

Evaluation questionnaire

1 2

11 26

8 9

28 31

12 20

46 12

105 98

PAGD 28 11

101 98

As shown in Table 38, to calculate the learner model for each student, the six components contributing to the learner model were weighted differently to reflect their relative importance. 47 By then adding the learner model scores for each annotation cluster, the final result was a ranking of the five annotation clusters for each student.

46

The sum of the numbers in the rows exceeds 68 because some students achieved the same posttest score in more than one annotation cluster and / or gave more than one annotation cluster the same ranking (e.g., voka244, see Table 39).

47

The weighting system, although to some extent arbitrary, takes into account multifaceted measures with the goal of arriving at a more reliable determination of the learner's ranking of the five annotation clusters. The weighting system applied here not only takes both posttests into consideration but pedagogical considerations are also given due attention by assigning less weight to the immediate posttest than to the delayed posttest given the fleeting nature of the former. Moreover, learner preferences, as indicated on the first evaluation questionnaire, are also considered in ranking the five annotation clusters for each student. For each data source, the most effective annotation cluster is weighted more than the second most effective one.

157

Table 38: Learner model scores to determine top two annotation clusters for experiment 2

Data source Immediate posttest Delayed posttest Evaluation questionnaire

Annotation cluster rank 1) best annotation cluster(s) 2) second best annotation cluster(s) 1) best annotation cluster(s) 2) second best annotation cluster(s) 1) annotation cluster(s) ranked first 2) annotation cluster(s) ranked second

Learner model score 3 1 6 2 6 2

The learner model calculation is illustrated with participants voka244 and voka250 in Table 39. For instance, voka244 scored 5 out of a maximum of 6 points in DA, making it her most effective annotation cluster on the immediate posttest. With 4.5 out of 6 points, PG was her second most effective cluster. According to the learner model calculation (Table 38), she received a learner model score of 3 for DA and a learner model score of 1 for PG. Allocating learner model scores in a similar fashion for the delayed posttest and the evaluation questionnaire and then tallying the scores for each annotation cluster, voka244 achieved the highest score (15) with DA, making it her best annotation cluster. PAGD was her second best annotation cluster with a score of 6. The same procedure was applied to voka250's data where it resulted in two best annotation clusters, PG and DG, both with a learner model score of 9. Table 39: Calculation of the learner model for voka244 and voka250

Data source

Raw data score PG

DG

PA

DA

PAGD

Corresponding learner model score PG DG PA DA PAGD

4.5 2 #4 ---

2.5 0 #3 ---

2 1 #2 ---

5 4 #1 ---

3 2 #1 ---

1 2 0 3 --

0 0 0 0 --

0 0 2 2 --

3 6 6 15 best

0 2 6 6 second

5 3 #2 ---

6 3 #3 ---

4 1.5 #1 ---

3 1.5 #1 ---

4 1 #1 ---

1 6 2 9 best

3 6 0 9 best

0 2 6 8 --

0 2 6 8 --

0 0 6 6 --

Participant voka244 Imm. posttest Delayed posttest Quest. ranking Sum Result Participant voka250 Imm. posttest Delayed posttest Quest. ranking Sum Result

158

Learners in the ALT treatment received their top two annotation clusters, which were chosen randomly if more than two clusters emerged. For example, voka244, who was assigned to the ALT group, studied with DA and PAGD while voka250, who was also in the ALT group, received PG and DG. For learners in the FIX treatment, if the learner model calculation revealed only one best annotation cluster, then this cluster was presented. For example, if voka244 had been assigned to the FIX group, she would have received DA (see Table 39). If a student had more than one best annotation cluster, one of them was chosen randomly. For example, this would have applied to voka250 if she had been in the FIX group (see Table 39). As a detailed illustration, Appendix B shows the program flow for voka244 in the ALT group, and her paired partner (see section 7.1.1), voka249, in the FIX group. The results of the calculation of each student's learner model are displayed in Table 40, which shows the distribution of best and second best annotation clusters for all 68 participants of experiment 2. For example, annotation cluster PG was the best cluster for 10 students and the second best cluster for 16 students. 48 Table 40: Best and second best annotation clusters for participants of experiment 2

Result Best cluster Second-best cluster Sum

Annotation cluster PG DG PA DA 10 9 17 8 16 8 19 12 26 17 36 20

Sum PAGD 32 20 52

76 75 151

The following section describes the annotation clusters that were displayed in the two treatment groups of experiment 2.

7.3.2 Annotation Clusters Presented to FIX and ALT Experiment 2 provided individualized instruction based on a static adaptation to each learner's most effective annotation cluster(s). The 68 participants of experiment 2 were split into two treatment groups to assess the effect of presentation sequence on vocabulary learning. Table 41 shows the annotation clusters that the 34 students in the

48

The sum of the numbers in the rows exceeds 68 because some students received the same learner model score in more than one annotation cluster (e.g., voka250, see Table 39).

159

FIX group received. Although PAGD was the annotation cluster with the highest number of participants (13), it only accounted for 38% of the data. The remaining 62% of learners studied with annotation clusters PG, DG, PA or DA. Table 41: Distribution of annotation clusters in the FIX group

Annotation cluster PG DG PA DA PAGD

Number of learners 4 2 10 5 13

Percentage

All

34

100%

12% 6% 29% 15% 38%

Table 42 displays the distribution of annotation clusters for the 34 students in the ALT group. The number in a given cell refers to the number of students that received that combination of annotation clusters. For example, six participants received instruction that randomly alternated between annotation clusters DA and PAGD. The last column and last row in Table 42 indicate that overall, for instance, annotation cluster PG was presented 11 times. As in FIX, PAGD was shown most often, followed by PA, PG, and finally DG and DA. 49 To give more detail, Appendix B lists the specific treatment for some students of both treatment groups. Table 42: Distribution of annotation clusters in the ALT group

Annotation clusters

Sum

PG DG PA DA PAGD

Annotation clusters PG DG PA DA PAGD 2 2 2 5 2 2 0 5 2 2 1 9 2 0 1 6 5 5 9 6

Sum

11

68

9

14

9

25

11 9 14 9 25

The following section describes the data analysis and statistical procedures used in experiment 2. 49

The FIX group would have had the following distribution of annotation clusters if all 68 participants of experiment 2 had been assigned to it: PG: 10 learners, DG: 6 learners, PA: 17 learners, DA: 8 learners, PAGD: 27 learners. Conversely, if all 68 participants had been in the ALT group, the distribution of annotation clusters would have been the following: PG/DG: 3 learners, PG/PA: 6 learners, PG/DA: 3 learners, PG/PAGD: 10 learners, DG/PA: 2 learners, DG/DA: 0 learners, DG/PAGD: 9 learners, PA/DA: 3 learners, PA/PAGD: 22 learners, DA/PAGD: 10 learners.

160

7.4 Data Analysis and Statistical Procedures The independent variable considered in experiment 2 was presentation sequence, a nominal between-subjects variable with two levels: fixed versus alternating. The dependent variable was vocabulary learning, operationalized as learners' scores on the immediate and delayed posttest of experiment 2. The posttest scores ranged from 0 to 28 points per student. The scoring of the student answers on the two posttests was identical to the scoring applied in experiment 1 (see section 4.4.1). Accordingly, each student response received either 1 point, 0.5 points or 0 points. Table 16 lists the research question of experiment 2 along with the statistical effect tested, the operationalization, and its inference test. Table 43: Research question and operationalization for experiment 2

Research question 5. Does presentation sequence (i.e., fixed, alternating) have an effect on vocabulary learning?

Effect tested Operationalization Presentation sequence Main effect of 0 – 28 points per student in each group (n = 34) in presentation sequence each posttest of part 2

Chapter 8 presents the results and discussion of experiment 2.

161

Inference test Paired samples t-test (α = .05)

8 RESULTS AND DISCUSSION OF EXPERIMENT 2 Experiment 2 provided an individualized learning environment, which was based on the results of experiment 1, to examine the effect of presentation sequence (i.e., fixed vs. alternating) on vocabulary learning. The results of this research endeavour are presented in section 8.1 and discussed in section 8.2.

8.1 Research Topic 5: Presentation Sequence in Individualized Instruction 8.1.1 Results of RQ 5: Main Effect of Presentation Sequence Immediate and delayed posttests Descriptive statistics Table 44 shows the immediate and delayed posttest scores for each treatment group and the total. On both posttests, the group with a fixed presentation sequence (FIX) performed better than the group with an alternating presentation sequence (ALT). The FIX group received a mean score of 20.79 / 28 (74.3%) compared to a mean of 19.59 / 28 points (70.0%) for the ALT group. On the delayed posttest, the means were 6.31 / 28 points (22.5%) for FIX and 5.95 points (20.0%) for ALT. Table 44: Immediate and delayed posttest scores for experiment 2

Posttest FIX (n = 34) ALT (n = 34) Both groups (N = 68)

Immediate posttest Mean % SD 20.79 74.3% 5.66 19.59 70.0% 6.39 20.19 72.1% 6.02

162

Delayed posttest Mean % SD 6.31 22.5% 4.22 5.59 20.0% 3.49 5.95 21.3% 3.86

Inferential statistics To formally test the hypothesis of an effect of presentation sequence, a matched samples t-test was conducted for each posttest. For the immediate posttest, there is no main effect of presentation sequence, t(33) = 1.104, p = .278. For the delayed posttest, there is also no main effect of presentation sequence, t(33) = .777, p = .443. Accordingly, in inferential terms, presentation sequence has no effect on L2 vocabulary learning immediately or seven days post-treatment. The following section discusses these results.

8.2 Discussion of Topic 5: Presentation Sequence in Individualized Instruction This section first explores the effect of presentation sequence as a variable enabled by a computer learning environment that may influence the effectiveness of vocabulary instruction (section 8.2.1). Section 8.2.2 then discusses the results with respect to the ranking of annotation clusters followed by an evaluation of the individualized instruction that learners received in experiment 2 (section 8.2.3).

8.2.1 Presentation Sequence The results of experiment 2 reveal that immediate and delayed L2 vocabulary retention is not influenced by presentation sequence. Thus, whether annotation clusters alternate (alternating presentation sequence) or whether learners study all L2 target words with the same annotation cluster (fixed presentation sequence) does not affect L2 vocabulary learning (see Appendix B for an example of each presentation sequence). These results are somewhat surprising considering existing literature. For instance, one might have expected that the alternating presentation sequence would lead to more successful vocabulary learning because it might facilitate noticing of L2 input (Schmidt, 1990). The noticing hypothesis maintains that learners are more likely to acquire L2 knowledge if it is noticed in the input they receive. Prior research has found, for example, that the provision of L2 annotations in CALL generally helps learners notice L2 target vocabulary (e.g., Bowles, 2004; Yanguas, 2009). However, to the author’s knowledge, the noticing hypothesis has never been applied to a pedagogical and computational 163

environment like Voka which alters the vocabulary annotations provided for L2 words. Nonetheless, the variation inherent to the alternating presentation sequence in Voka might prompt learners to repeatedly re-direct their attention to the task at hand and thus lead to noticing of the L2 input because this element of newness, diversity, and / or entertainment is missing from the monotony provided by the fixed presentation sequence. More generally, this idea can also be applied to the theory of habituation. As a form of learning, habituation is a psychological process in which the repeated presentation of a stimulus results in decreased responsiveness (Bear et al., 2007; Shaffer & Kipp, 2009). 50 In the fixed presentation sequence, the participants received the same annotation cluster for each word, which might have led to habituation and a resulting decrease in learning effectiveness. In contrast, the alternating presentation sequence presented two different stimuli and thus might have prevented habituation from occurring. The latter is especially likely because the two annotation clusters were presented in random alternation (see section 7.3) and thus learners could not anticipate at which point in the instructional process which annotation clusters would be shown. However, not only is there no statistically significant difference between the two presentation sequences but, in opposition to the noticing and habituation hypotheses, the descriptive statistics indicate that the FIX group outperformed the ALT group (immediate posttest: FIX 74.3% vs. ALT 70.0%; delayed posttest: FIX 22.5% vs. ALT 20.0%, see section 8.1) suggesting that neither habituation nor noticing was a factor. One might speculate that the FIX group had a descriptive advantage over the ALT group because of differences in the annotation cluster effectiveness in the two groups. In the FIX group, all learners received instruction with the annotation cluster that was most effective for them. In contrast, in the ALT group, some learners received the two annotation clusters that were most effective for them while others received their most and second-most effective annotation clusters. The former was the case when the learner 50

For example, habituation is often employed in studies with infants (Shaffer & Kipp, 2009). To test whether infants can discriminate between two sounds, for instance, researchers first present them with a sound until they stop responding to it (habituate) and then present a different sound. If an infant shows increased responsiveness (e.g., by sucking stronger or turning their head), the researchers conclude that the infant was able to discriminate the two sounds.

164

model calculation produced two annotation clusters with an equal score whereas the latter occurred when the learner model calculation produced a single high score (see section 7.3.1). Four of the 34 learners (11.8%) in ALT received their two best annotation clusters and 30 learners (88.2%) received their best and second-best annotation cluster. The fact that most learners in ALT received half the words in their second best annotation cluster might have been a disadvantage. This might have resulted in a lower mean score for the ALT group when compared to the FIX group. However, the descriptive and inferential results indicate that, if a disadvantage was present at all, it was so slight that it is unlikely to have impaired the performance of the ALT group as a whole, especially when also considering that 4 of the 34 students studied with their two best annotation clusters. For instance, the data in Table 45 show that on average, the 30 ALT learners achieved only a slightly higher score with their best than with their second-best annotation cluster on each posttest (71.8% vs. 70.1% on the immediate posttest; 21.2% vs. 19.0% on the delayed posttest). Two paired samples t-tests confirm that the difference in scores between the best and the second-best annotation clusters for the 30 learners is not significant (immediate posttest: t(29) = .502, p = .619; delayed posttest: t(29) = .603, p = .551). Table 45: Comparison of best vs. second-best annotation cluster scores for 30 of the ALT learners

Description Immediate posttest Score with best annotation cluster Score with second-best annotation cluster Delayed posttest Score with best annotation cluster Score with second-best annotation cluster

Mean

%

SD

10.05 / 14 9.82 / 14

71.8% 70.1%

3.45 3.67

2.95 / 14 2.67 / 14

21.2% 19.0%

2.25 1.95

Thus, the results of experiment 2 suggest that the continuous random alternation between the two annotation clusters in the ALT group does not present an advantage for students. On the contrary, the random pattern of alternation might even be frustrating, especially when learners do not know the purpose of the alternating instruction. Although the study participants were told prior to the experiment that they might receive two annotation clusters, they were not aware that instruction was individualized according to their best annotation cluster(s) or that presentation sequence was the topic of

165

investigation. To the learners, the seeming lack of an instructional rationale might have impaired their learning and ultimate achievement scores. A comment by voka213 on the evaluation questionnaire, for example, indicates that at least this student disliked the seemingly purposeless alternation of annotation clusters: "[I didn't like that it was] randomly with and without elements; pictures, definition, etc.;". Despite a lack of significant differences between the two presentation sequences, the results of experiment 2, nonetheless, add to our understanding of the design of computer-assisted learning environments. Providing different presentation sequences as part of the L2 learning task is an additional pedagogical tool made possible by the computer as the instructional medium and, like interface design issues for instance, this certainly can lead to differences in learner performance. In addition, the design of experiment 2 also allows for an investigation of the effect of the annotation clusters that the students received as part of their individualized L2 vocabulary instruction. This analysis belongs to the realm of individualized instruction and is discussed in the following section.

8.2.2 Ranking of Annotation Clusters In the individualized learning environment of experiment 2, investigating whether a fixed presentation sequence is more effective than an alternating sequence equates to the question of whether students learn L2 vocabulary better when they study with their most effective annotation cluster or when they study with their two most effective annotation clusters. Accordingly, the absence of a significant effect of presentation sequence also implies that there is no difference in Voka between studying words with one's most effective annotation cluster or with one's two most effective annotation clusters (see section 8.1.1 and section 8.2.1). However, it is likely that providing learners with annotation clusters that are ranked third, fourth or fifth for them would negatively affect their L2 vocabulary learning performance. While not tested in experiment 2, this is suggested by the considerable intra-learner variation in posttest scores between the learners' most effective and least effective annotation clusters in experiment 1. As detailed in section 6.4.2, this intra166

learner variation in annotation cluster effectiveness was one of the factors that prompted the need for an individualized learning environment to enhance learner performance. For the conclusion regarding the ranking of annotation clusters in experiment 2 to be valid, the information upon which it is based must be reliable. Accordingly, the following section evaluates the goodness of fit of the individualized instruction provided in experiment 2.

8.2.3 An Evaluation of the Individualized Instruction Experiment 2 tested whether presentation sequence has an effect on learner performance by providing an individualized learning environment in which learners received their best annotation cluster(s) according to the results obtained from experiment 1. In this context, there are three predictions that should be investigated to evaluate whether the individualization performed in experiment 2 adequately addressed the differential effectiveness of the annotation clusters for different learners: 1. For the 30 of the 34 participants in the ALT group who received instruction in their best and second-best annotation cluster (see section 8.2.1), the ranking of the two annotation clusters in experiment 2 should be consistent with the ranking of the two annotation clusters determined by the learner model calculation (see section 7.3.1). 2. The annotation cluster ranking in experiment 2 should be different from the ranking obtained in experiment 1 because in experiment 2, each learner studied only with annotation clusters that were effective for him or her whereas in experiment 1, each learner studied with all five annotation clusters. 3. The scores in experiment 2 should be higher than the scores in experiment 1 because of the individualized instruction that the learners received in experiment 2. Regarding the first prediction, both the mean results across learners and the individual results for each learner are in line with the first prediction. The data relating to 167

the mean results are provided in Table 45 (section 8.2.1). While there is no statistically significant difference between the scores obtained with the learners' best annotation clusters and those achieved with their second-best annotation clusters, the descriptive results confirm the prediction: Learners on average scored higher with their best annotation cluster (immediate posttest: 71.8%, delayed posttest: 21.2%) than with their second-best cluster (immediate posttest: 70.1%, delayed posttest: 19.0%) (see section 8.2.1). Furthermore, the data for individual learners show that on both posttests, the majority of learners performed better with their best than with their second-best annotation cluster. On the immediate posttest, 16 of the 30 participants (53.3%) scored higher with their best annotation cluster than with their second-best cluster. On the delayed posttest, 17 of the 30 learners (56.7%) achieved a higher score with their best annotation cluster. Furthermore, for the vast majority of the learners that scored better with their second-best cluster, the difference in performance was slight. On the immediate posttest, for only 2 learners, the second-best cluster outperformed the best cluster by 2 or more points and contributed more than 60% to the learner's final score. On the delayed posttest, this applied to only 4 learners. Regarding the second prediction, the data of the immediate posttest reveal that indeed, the ranking of annotation clusters, displayed above the data columns in Figure 25, was different in the two experiments. Most notably, annotation cluster DG, which was descriptively the least effective annotation cluster in experiment 1, was in second place in experiment 2 (see Figure 25). Furthermore, in experiment 1 annotation clusters PG (61.6%), PA (63.5%), and PAGD (61.8%) had the most similar scores (and were inferentially part of the same homogeneous subset, see section 5.1.1). In experiment 2, however, annotation clusters PG (70.3%), DG (72.0%), and PAGD (71.0%) were most alike in scores, with PA (80.4%) scoring much higher and DA (62.2%) scoring much lower.

168

Figure 25: Immediate posttest results for the five annotation clusters (both experiments)

Note. The annotation cluster rankings in the respective experiment are displayed above the data columns.

Figure 26 shows that the descriptive annotation cluster rankings in the two experiments were also different on the delayed posttests. Notably, annotation cluster PA, which was in first place in experiment 1, was only in third place in experiment 2, and DG, which was ranked fifth in experiment 1, was ranked 3/4 in experiment 2. The data are thus in line with the second prediction in that the ranking of the annotation clusters differed in the two experiments. One might initially expect that in the individualized environment of experiment 2, all five annotation clusters should receive similar scores because all learners studied only with annotation clusters that were most effective for them. However, this would presume that all learners have an equal ability to learn L2 vocabulary and, when given adequate, individualized instruction, they will all perform to the same standard. This is clearly not the case and thus, the variation in annotation cluster effectiveness in the second experiment is compatible with expected student performance in individualized instruction.

169

Figure 26: Delayed posttest results for the five annotation clusters (both experiments)

Note. The annotation cluster rankings in the respective experiment are displayed above the data columns.

In this regard, it is interesting to note that providing a picture in combination with audio was descriptively most effective for L2 vocabulary learning in both Voka's individualized and Voka's non-individualized learning environments. Annotation cluster PA was the most effective annotation cluster on the immediate posttest of both experiments 1 and 2 and on the delayed posttest of experiment 2 while PAGD was most effective on the delayed posttest of experiment 1 (see Figure 25 and Figure 26). This clearly suggests that if, for instance, resources are limited, providing a picture in combination with audio should be the highest priority (see section 9.2.1). The third prediction is that the posttest scores should be higher in experiment 2 than in experiment 1 because of the individualized instruction that the learners received in the second experiment. Inspecting the immediate posttest results, the mean achievement score in experiment 2 (72.1%, see section 8.1.1) was indeed 12.5% higher than the mean score in experiment 1 (59.6%, see section 5.1.1). Furthermore, in each

170

annotation cluster, the score was notably higher in the second experiment than in the first experiment (see Figure 25). Regarding the delayed posttest, the overall mean score was nearly identical: 21.6% in experiment 1 (see section 5.1.1) and 21.3% in experiment 2 (see section 8.1.1). Although the scores from experiment 1 were higher in four of the five annotation clusters, the mean difference between experiment 1 and experiment 2 scores in each cluster was fairly small (see Figure 26). Thus, while the delayed posttest scores did not differ by much between the two experiments, prediction three is strongly supported by the immediate posttest data, where the scores were much higher in experiment 2. In sum, all three predictions receive support from the data. Learners generally scored higher in their best than in their second-best annotation cluster, the ranking of the annotation clusters was different in the two experiments, and the posttest scores in experiment 2 were generally higher than in experiment 1. This not only suggests an accurate assessment of the learners’ most effective annotation cluster(s) as determined in experiment 1 but also indicates that the individualized instruction in experiment 2 effectively addressed learner differences with respect to their performance and preference of annotation clusters. However, an important caveat with respect to data collection, analysis and interpretation needs to be mentioned. In addition to the individualization provided in experiment 2, the results might have been affected by additional factors that varied between the two experiments. For instance, experiment 2 involved different L2 target words, which might have differed in their item difficulty (see, e.g., section 3.2.4). Moreover, the learners had studied German for several more weeks when they participated in experiment 2, which possibly affected their posttest performance (e.g., better spelling because of a greater awareness of German orthography). Finally, there might have also been a practice effect with Voka given that the participants used the CAVL program for a second experiment. However, while these factors should be borne in mind as extraneous variables, any CAVL program that provides individualized instruction will face similar challenges given that an assessment of learners’ annotation clusters is required in order to effectively tailor instruction to individual learner needs. 171

This implies an initial observation and testing phase in which data on student use and preferences must be collected which then form the basis of the subsequently constructed individualized learning environment. The following chapter offers concluding remarks with regard to the two experiments of this dissertation.

172

9 CONCLUSION The concluding chapter provides a summary of this dissertation in section 9.1 and then explores the pedagogical implications of the findings in section 9.2. Section 9.3 discusses the limitations and constraints on the generalizability of the findings, which leads into a discussion of possible avenues for future research in section 9.4.

9.1 Summary This dissertation explored issues surrounding multimedia annotations, word concreteness, and individualized instruction in L2 vocabulary learning in CALL with two experiments involving English L2 learners of beginner German. In experiment 1, which employed a within-subjects design, 72 participants studied 15 abstract and 15 concrete German nouns with Voka. For each test item, the learners received the German word with its article and plural, its English translation, a German example sentence, and one of five annotation clusters addressing form, meaning and / or use aspects of the word: PG, DG, PA, DA, or PAGD. With respect to multimedia vocabulary annotations, the immediate posttest of experiment 1 revealed that it is more beneficial for L2 vocabulary learning to provide learners with annotation clusters that contain a picture (PG, PA, PAGD) than to offer annotation clusters without a picture (DG, DA). Thus, additional meaning information in form of a picture annotation is most effective, presumably because of the dual coding advantage of providing both text and pictures. However, not all additional meaning information is beneficial as the immediate posttest findings further showed that vocabulary in Voka is learned significantly more effectively without a definition 173

annotation. While the trends for the picture and definition annotations were the same on the delayed posttest, the effects were not statistically significant. Furthermore, the gloss and the audio annotations did not have a statistically significant effect on vocabulary learning on either posttest. However, for the audio, both posttests indicated a descriptive advantage of studying with this annotation type, and the participants also considered it the most beneficial annotation for learning L2 vocabulary. Concerning word concreteness, both the immediate and the delayed posttest of experiment 1 revealed that abstract words are significantly harder to learn than concrete words. Due to the stringent control of extraneous variables (see section 3.2), this result is particularly valid. The finding not only confirms prior research but also extends it to intentional L2 vocabulary learning with a multimedia CAVL program that considers form, meaning, and use aspects of the target words. Regarding the interaction of annotations and word concreteness, both posttests of experiment 1 showed that the concreteness effect is sustained in each annotation cluster. This demonstrates that the effect of word type is stronger than the effect of annotation cluster. Moreover, the immediate posttest data showed that the beneficial effect of pictures applies to both abstract and concrete words but is comparatively stronger for abstract words. Possibly, pictures have a relatively stronger effect for abstract nouns because their processing is cognitively more engaging, thus entailing a deeper level of processing and / or more evaluation in terms of the involvement load hypothesis. The immediate posttest data also revealed that the definition has a detrimental effect on learning abstract words, likely because it leads to cognitive overload when studying these inherently difficult words. The delayed posttest data did not reveal statistically significant effects of the annotations for either word type. Turning to individualized instruction, the results of experiment 1 call for individualized instruction in CAVL that considers the learners' performance and preferences with different annotation clusters. Both the immediate and the delayed posttest data demonstrated that annotation cluster effectiveness varied greatly across learners. This inter-learner variation in performance was also echoed in learner

174

preferences for particular annotation types and clusters. Furthermore, there was also considerable intra-learner variation in annotation cluster effectiveness. Accordingly, experiment 2 examined the presentation sequence of annotation clusters in the individualized learning environment prompted by experiment 1. Employing a between-subjects design, 68 of the participants of experiment 1 studied another 28 German nouns with an individualized version of Voka, which presented to each student only the most effective annotation cluster(s) based on his or her posttest performance and learning preferences from experiment 1. The students in the FIX group (n = 34) received a fixed presentation sequence which presented their most effective annotation cluster for all 28 words. The students in the ALT group (n = 34) received instruction that alternated between their two most effective annotation clusters (i.e., they studied half the words in one annotation cluster and the other half in another cluster). Experiment 2 furthers our understanding of the effect of presentation sequence in a computer-based learning environment. While one might have expected the alternating presentation sequence to facilitate noticing and prevent habituation, thus leading to better performance, the results of both posttests showed that presentation sequence does not have a statistically significant effect on L2 vocabulary learning. More importantly, descriptively, the FIX group outperformed the ALT group, suggesting that students might have been distracted by the seemingly purposeless alternation of annotation clusters in the ALT group. Furthermore, experiment 2 also enhances our understanding of the effect of annotation cluster ranking in individualized instruction on L2 vocabulary learning. Given the individualized learning environment of experiment 2, the results further showed that there is no statistically significant difference between studying target words with a learner's most effective annotation cluster (i.e., learners in the FIX group) or with a learner's two most effective clusters (i.e., learners in the ALT group). Note, however, that the provision of lower-ranked annotation clusters might negatively affect learner performance as suggested by the considerable intra-learner variation in annotation cluster effectiveness revealed in experiment 1 (see section 6.4.2 and section 8.2.2).

175

The following section discusses the implications of the findings of the two experiments of this dissertation.

9.2 Pedagogical Implications The two experiments with Voka indicate a considerable benefit of intentional CAVL programs for learning L2 vocabulary. After only four consecutive learning cycles (two study and two practice phases) and a total exposure time of only 73 to 103 seconds per L2 target word (see section 3.9.3), the participants achieved a mean immediate posttest score of 59.6% in experiment 1 and 72.1% in experiment 2 (see section 5.1.1 and section 8.1.1, respectively). These scores are quite high considering that learners had to memorize a total of 30 test items in experiment 1 and 28 test items in experiment 2 in a relatively short amount of time. Moreover, the posttest assessed productive recall, the most difficult type of word knowledge (see section 2.1.1). While one might argue that instruction with Voka was only moderately effective given the low delayed posttest scores (21.6% in experiment 1, 21.3% in experiment 2, see section 5.1.1 and section 8.1.1), the following five considerations suggest that Voka was in fact very effective in teaching the target words in spite of the research focus of the learning environment. First, for data analysis purposes, Voka was intentionally designed to be more difficult than a comparable non-experimental CAVL program to prevent a ceiling effect in posttest scores (see section 3.9.3). Second, in order to control for extraneous variables, the test items in Voka were purposefully chosen not to be part of the learners' regular German 102 curriculum and thus there was little motivation to learn the target words. The learners were also asked not to study the words in between the two testing phases and course credit was solely given for participation in the research project and not for performance. Third, many learners remarked that they lost focus because of the preset exposure time of the flashcards, which was implemented to control time as a variable. Fourth, the students were only exposed to the L2 target words in one learning session in Voka. However, repeated exposure to L2 words on separate occasions is necessary for effective and durable vocabulary learning (e.g., Nation, 2001). Research suggests that at least five to seven repetitions are needed for effective vocabulary 176

learning (Nation, 2001; see also Webb, 2007). Fifth, inspecting the data in more detail also reveals that assessing a different type of word knowledge would have resulted in higher scores. For instance, while the mean score on the immediate productive recall posttest was 17.88 / 30 (59.6%), the mean score in the first practice phase, which tested productive recognition (see section 3.7.3), was 28.4 / 30 points (94.8%). Similarly, Chun and Plass (1996a) report 25% accuracy on production tests compared to 77% accuracy on corresponding recognition tests. In view of all of these factors, the posttest scores in Voka are high, suggesting considerable

potential

of

Voka-like

programs

in

non-experimental

learning

environments, in which the target words are part of the regular L2 curriculum and learners have more control over the pace of their own learning, their vocabulary review cycles, and the number of words studied at one time. Naturally, for all words contained in an intentional learning program like Voka, contextual encounters are eventually required to broaden the learners' vocabulary knowledge. Considering Nation's (2001) four strands of learning – meaning-focused input, meaning-focused output, fluency development, and language-focused learning – (see section 3.1), the use of Voka-like intentional CAVL programs that learners can access in their own time free up classroom sessions for the remaining three strands of learning. For instance, vocabulary study might be more effective before rather than as part of a meaning-oriented reading activity as research suggests that vocabulary annotations included in reading passages, although generally beneficial for vocabulary learning, are often not effective for reading comprehension (Chun, 2006; Tozcu & Coady, 2004). Possibly, the annotations distract from the reading task (Ariew & Ercetin, 2004) and thus separating the two tasks from each other by first using an intentional CAVL program might be more conducive to both vocabulary learning and reading comprehension. The following two sections discuss pedagogical implications of the two experiments as they relate to multimedia annotations and word concreteness (section 9.2.1), and individualized vocabulary instruction (section 9.2.2).

177

9.2.1 Multimedia Annotations and Word Concreteness In considering pedagogical implications, it needs to be recognized that statistically significant differences are not necessarily pedagogically relevant (Norris & Ortega, 2000; Schmitt, 2010). At the same time, the findings of this dissertation prompt reflections not only of the research topics that were investigated but also with respect to their possible impact on related areas. For instance, although this dissertation examined the learning of L2 words, the implications discussed in the following might also be applicable to the teaching and learning of phrases, collocations, and idioms given the indeterminacy of word segmentation (Haspelmath, in press, see section 2.1). More specifically, in considering the results with respect to multimedia annotations and word concreteness for L2 vocabulary learning, experiment 1 prompts three pedagogical recommendations which relate to 1) picture annotations, 2) audio annotations, and 3) the number of annotations provided in a CAVL environment. Picture annotations The results of experiment 1 suggest that pictures are useful for L2 vocabulary learning and thus, ideally, pictures should be provided. Of the four annotation types displayed in Voka (picture, definition, gloss, and audio), the picture is the only annotation type that has a statistically significant positive effect on L2 vocabulary learning. The immediate posttest data revealed a main effect of annotation cluster, showing that annotation clusters with a picture are significantly more effective than annotation clusters without a picture (see section 5.1.1). Moreover, vocabulary instruction should also take into account learner preferences for vocabulary annotation types, especially in those instances where learner preferences strengthen empirical findings. In Voka, the picture annotation type received a mean usefulness rating of 2.01 (SD: 1.04, see section 6.1.2), meaning that the participants generally agreed that the pictures in Voka were useful for learning vocabulary (on the 5point Likert scale, 1 = strongly agree, 2 = agree, 3 = neither agree nor disagree, 4 = disagree, and 5 = strongly disagree).

178

However, the recommendation to provide pictures needs to be put into perspective by considering the expected pedagogical impact (Norris & Ortega, 2000). The findings suggest that the impact of providing a picture for L2 vocabulary learning is moderate: When studying with pictures and text compared to when studying without pictures, learners can expect an improvement in immediate posttest scores of approximately 10% (6.8% in experiment 1, up to 13% in other studies, see section 6.1.1). Moreover, in Voka, the picture superiority effect did not last over time and 98.4% of the variance in immediate posttest scores was due to factors other than the main effect of annotation cluster (see section 6.1.1). Thus, at first sight, the empirical findings might not justify the effort of generating pictures for each and every target word. However, the findings of experiment 1 also suggest that considerably more effort needs to be spent on teaching L2 abstract words than on teaching L2 concrete words because abstract words are significantly harder to learn than concrete words. In Voka, the concreteness effect persisted over time and had a considerable magnitude of effect on both posttests (η2 = .320 on immediate posttest, η2 = .281 on delayed posttest; see section 5.2.1). From a pedagogical perspective, this finding suggests that it is important to not only spend more time on abstract words in each learning session but also to devote more learning sessions overall to abstract words and this applies to all proficiency levels. For instance, children acquiring their L1 often learn several abstract words early in their development (Tomasello, 2003) and adult L2 learners of all proficiency levels certainly have communicative needs beyond talking or writing about concrete objects in their immediate environment. In L2 listening and reading tasks, abstract words are also encountered frequently because many abstract words are part of the 2000 most frequent words of a language. In addition, abstract words also appear early in L2 textbooks for beginner learners. For example, in Deutsch: Na Klar! (Di Donato et al., 2008), the textbook used by the study participants in their German course, the following sample of abstract (i.e., low imageability) nouns is part of the active vocabulary of the first three chapters: Frage (question) (BAWL-R imageability: 2.7, see Vö et al., 2009), Beruf

179

(profession) (imageability: 2.5), Miete (rent) (imageability: 2.4), and Kunde (customer) (imageability: 3.0). Thus, abstract nouns must be taught, even at a beginner level. Moreover, the findings of experiment 1, while indicating that pictures are beneficial for both concrete and abstract words, suggest that it is even more important to provide pictures for abstract words given that the effect of pictures is stronger for abstract words. While learners generally can easily think of pictures for concrete referents, providing them with pre-selected pictures in a CAVL program facilitates their access to the imaginal system for abstract words. As a result, while not completely eliminating the concreteness effect, this at least mitigates it by providing additional help for abstract words, which are inherently harder to learn. Accordingly, there is a strong incentive to provide learners with picture annotations, if not for all target words, then at least for abstract words. Audio annotations A further recommendation is to provide audio annotations for L2 vocabulary learning. Although in Voka, annotation clusters containing audio are not significantly more effective than annotation clusters without audio, there are three additional observations that suggest that audio annotations are nonetheless beneficial for L2 vocabulary learning. First, a closer analysis of the posttest data suggested that the audio annotation helped establish the form-meaning link for L2 words but this effect was concealed by the commission of phonologically-motivated misspellings on the written productive recall posttests (see section 6.1.4). Second, descriptively, words studied with audio were learned better than words studied without audio on both posttests of experiment 1 (see section 5.1.2). Third, of the four annotation types provided in Voka, the study participants ranked the audio as the most useful annotation type (mean usefulness rating: 1.57, SD: 0.80). In assessing the expected pedagogical impact of this recommendation, providing audio annotations may increase student motivation to work with a CAVL program and thus lead to improved learner performance.

180

Number of multimedia annotations The final pedagogical recommendation with regard to vocabulary annotations is to pay close attention to the number of annotation types provided in a CAVL environment, especially in cases where learners cannot control the exposure time to the target words. The findings of experiment 1, for instance, suggest that, for learning environments with brief and computer-controlled exposure time, fewer annotation types prevent cognitive overload and are thus more effective than more annotation types (see section 6.1.2). Considering the four annotation types in Voka, the provision of a picture annotation is clearly most important. If two annotation types can be offered, the picture should be combined with an audio annotation given that on all four posttests of both Voka experiments, an annotation cluster containing a picture and audio (i.e., either PA or PAGD) was descriptively the most effective annotation cluster (see section 8.2.3). Furthermore, the study participants ranked the audio and the picture as the two most useful annotation types on the evaluation questionnaire. The fact that learners particularly liked the combination of picture and audio is also expressed by participant voka282, who commented that "the pictures and audio helped to have the vocab words stick in my mind." and voka215, who stated that "it combine the audio with the visual will help to memorize." The provision of fewer rather than more annotation types appears to be especially important when teaching abstract as opposed to concrete words. As experiment 1 showed, the definition is detrimental to learning abstract words but shows no effect regarding concrete words, indicating that cognitive overload caused by too many annotation types is more prevalent for the inherently hard-to-learn abstract words. However, most learning environments, especially those not solely used for research purposes, allow learners to control the exposure time to target words. In these settings, it may be more beneficial to provide many annotation types to facilitate vocabulary learning, as suggested by Hulstijn and Laufer (2001): If learners pay careful attention to the word’s pronunciation, orthography, grammatical category, meaning, and semantic 181

relations to other words, they are more likely to retain the word (i.e., the link between at least one representation of the word’s form and at least one of its meanings) than if they pay attention to only one or two of the above word properties. (pp. 541-542) For example, annotation cluster PAGD might be quite effective in untimed learning environments because, in providing a wealth of information by which a word can be remembered, it allows learners to focus on the annotations that are most helpful for them.

9.2.2 Individualized Instruction Experiment 1 calls for individualized instruction in CAVL. The data demonstrate both considerable inter-learner and intra-learner variation in terms of the effectiveness of annotation clusters for L2 vocabulary learning (see section 6.4). CAVL programs should acknowledge individual learner differences by allowing learners to explore different learning resources (e.g., annotation clusters) and by then customizing instruction based on learner performance and / or preferences accordingly. Attention to IDs might be especially important in CAVL programs that provide learning options that diverge considerably from each other (e.g., annotation clusters PA vs. DG in Voka). Furthermore, the findings of this dissertation show that individualized instruction can be constructed in a flexible manner. Experiment 2 revealed that an alternating presentation sequence of a learner’s two most effective annotation clusters is as effective as a fixed presentation sequence of a learner’s most effective annotation cluster. From a pedagogical perspective, this implies that more learners can be instructed with the same annotation cluster. For example, students who learn best with annotation cluster PA and second-best with DG can study target words with DG with little or no impact on their performance. Moreover, for words that are not annotated with all available annotations, the program has more flexibility in providing learners with a good match of learning help options in the form of multimedia annotations. Finally, annotation clusters that are outliers, that is, that are only effective for very few learners, can be eliminated given that learners may study with their remaining effective annotation cluster in a comparably effective manner. 182

9.3 Study Limitations and Constraints on Generalizability This section exposes some limitations and constraints on the generalizability of the findings of this dissertation.

9.3.1 Study Limitations The findings of this research project have to be interpreted in light of its limitations. This section discusses three limitations. First, in investigating the concreteness effect in multimedia CAVL (see section 5.2), to the author's knowledge, to date, experiment 1 provides the most stringent control of extraneous variables that might affect item difficulty (e.g., part of speech, phonographic complexity, cognate status, see section 3.2). Nevertheless, the findings are limited by additional extraneous variables that were not intentionally controlled (see, e.g., Schmitt, 2010). As an illustration of additional uncontrolled factors that might influence the learning burden of a word, for the 15 abstract and 15 concrete nouns of experiment 1, Table 46 provides information on the emotional valence and arousal ratings as well as the psycholinguistic index ratings contained in BAWL-R. 51 In addition, information on the

51

In the Berlin Affective Word List Reloaded (BAWL-R, see Vö et al., 2009), emotional valence is rated on a 7-point scale from -3 (very negative) to +3 (very positive) while emotional arousal is rated on a 5point scale from +1 (low-arousing) to +5 (high-arousing). For example, Gefahr (danger) has a quite negative mean valence rating of -1.8 but a high mean arousal rating of +4.4. In addition to emotional valence and arousal, the following psycholinguistic indexes are included for each word in BAWL-R: NUMBER OF ORTHOGRAPHIC NEIGHBORS (N): Two words are considered orthographic neighbors when they share all the letters (in the same position) except one (Coltheart, Davelaar, Jonasson, & Besner, 1977). ... FREQUENCY OF ORTHOGRAPHIC NEIGHBORS

orthographic neighbors.

(FN): This index refers to the summed frequency of

NUMBER OF HIGHER FREQUENCY ORTHOGRAPHIC NEIGHBORS (HFN): This variable lists the number of words that are higher frequency orthographic neighbors.

(FHFN): This variable contains the summed frequency of words that are higher frequency orthographic neighbors.

FREQUENCY OF HIGHER FREQUENCY ORTHOGRAPHIC NEIGHBORS

(BIGmean): This index provides information on the nonpositional mean token bigram frequency of the critical word—that is, the frequency of those BIGRAM FREQUENCY

183

word length of the L1 translations of the L2 target nouns are included to illustrate additional potential learnability factors stemming from the L1 (Schmitt, 2010). As indicated in Table 46, the valence and arousal ratings, the accent index, and the word length of the L1 translation are fairly similar for the two word types but there is more variation between abstract and concrete words for N, FN, HFN, FHFN, and bigram frequency, all of which might have affected the results of experiment 1 with regard to word concreteness. Table 46: Some uncontrolled characteristics of the 30 target words of experiment 1

Words

Val.

Arou.

N

FN

HFN

FHFN

BIGmean

Accent

Abstract (15)

Mean SD Min Max

-0.5 1.3 -2.0 2.2

2.9 0.7 2.1 4.4

0.7 1.0 0.0 3.0

17.9 51.9 0.0 202.8

0.1 0.3 0.0 1.0

12.2 47.3 0.0 183.0

151091.6 59230.0 55860.3 271566.4

1.3 0.5 1.0 2.0

L1 word length 6.2 1.8 3.0 9.0

Concrete (15)

Mean SD Min Max

0.3 1.4 -2.4 2.5

2.6 0.7 1.7 4.2

2.0 3.3 0.0 13.0

189.2 573.0 0.0 2234.0

0.5 1.4 0.0 5.0

163.0 566.4 0.0 2197.3

242946.2 133445.2 45866.0 505799.2

1.1 0.4 1.0 2.0

5.9 3.7 3.0 18.0

Note. Min = Minimum value, Max = Maximum value (Range = Min to Max), Val. = emotional valence, Arou. = emotional arousal, N: for all 2107 nouns in BAWL-R, the range of N is 0 - 18, FN: BAWL-R noun range: 0 - 130233, HFN: BAWL-R noun range: 0 - 11, FHFN: BAWL-R noun range: 0 104092 , BIGmean: BAWL-R noun range: 8024.5 - 839507.3. All information except L1 letters column taken from Vö et al. (2009) (see footnote 51 for further explanation of the columns).

As a second limitation, when examining annotations and word concreteness (see section 5.3), the posttest scores for abstract words were quite low. This possibly implies a floor effect, which might have prevented the detection of more statistically significant effects pertaining to abstract words (see section 6.3.2). For example, the mean score for abstract words in experiment 1 was 1.53 out of 15 points on the delayed posttest (10.2%) and 45 of the 72 participants (62.5%) scored only 10% or less on the abstract words (i.e., 0 to 1.5 out of 15 points). This also applied to 5 participants (6.9%) on the immediate posttest. words that contain the bigrams of the critical word regardless of their position within the word. ACCENT:

This variable indicates which syllable of the critical word is stressed when pronouncing the word. (Vö et al., 2009, p. 536)

184

The third limitation relates to the number of test items and the individualized instruction provided in experiment 2. At least for some students, the individualized instruction was based largely on their scores for the concrete words because their scores for the abstract words were too low to contribute much to their overall performance (see section 6.3.2). Moreover, experiment 1 shows that the effectiveness of annotations varies for concrete versus abstract words (e.g., definitions are only detrimental for learning abstract words on the immediate posttest, see section 5.3.2) and thus individualized instruction should ideally take into consideration a learner's most effective annotation cluster(s) for each word type. This, however, calls for an increased number of test items per word type per annotation cluster per learner. The total number of test items in experiment 1 (30 L2 nouns) is certainly comparable or even above average for research conducted in intentional multimedia CAVL but given that the words were split into two word types and also distributed over five annotation clusters, there were only three words per word type per annotation cluster per learner.

9.3.2 Constraints on Generalizability of Findings Aside from the limitations, the findings are also constrained by additional features of the design of the two experiments (Norris & Ortega, 2000, 2001). This section discusses four features which affect the generalizability of the findings. First, both experiments were conducted with L2 learners of German, a language with a Roman script. Accordingly, the findings may not be generalizable to other languages and / or other scripts (see, e.g., Hamada & Koda, 2008). Second, the study participants were all beginner learners, and generally, "results must be viewed in terms of the level of the learner's L2 language ability and cannot be generalized to all learners" (Chun, 2006, p. 77). For example, audio annotations might be more effective for advanced learners because these learners are more accustomed to the L2 orthography and thus might commit fewer phonologically-based misspellings (see section 6.1.4). Third, the findings have to be interpreted in light of the operationalization of L2 vocabulary learning in the two experiments (Levin, 1989; Norris & Ortega, 2000, 2001; 185

Salomon, 1989; Schmitt, 2010). For example, the findings pertain to the learning of L2 nouns and further research will have to establish whether they are generalizable to other parts of speech such as verbs or adjectives. Moreover, the findings are limited in their generalizability by the intentional CAVL environment used and the assessment measure applied: discrete, selective, context-independent written productive recall posttests, which assess the establishment of a link between the meaning (provided as the test prompt) and the written L2 form (the expected learner response). For example, Salomon (1989) points out that "the frequent practice of comparing different stimulus conditions that may accomplish entirely different cognitive functions on common learning outcomes … ignores the possibility that different kinds of outcome may be facilitated by this functional diversity" (p. 79). For instance, in assessing only productive recall, the posttest data do not provide information on learning related to the other degrees of word knowledge (receptive recall, productive recognition, receptive recognition, see section 2.1.1). Furthermore, the posttests in Voka measured remembering of L2 target words and thus the data cannot illuminate the potential facilitative role of the annotations for learners' understanding of the target words or their ability to apply them (see Levin, 1989). Accordingly, this study cannot make any claims regarding the effectiveness of Voka's instruction on reading comprehension or other more contextualized language use. At the same time, Tozcu and Coady (2004), for instance, demonstrated that direct vocabulary instruction has a beneficial effect on both reading comprehension and the rate of speed for frequent word recognition. Fourth, the study was designed as laboratory research (see Hegelheimer & Tower, 2004) to provide considerable variable control for research purposes. While this increased the internal validity of the findings, it likely decreased their external validity (Hegelheimer & Tower, 2004). For example, the findings may not be entirely generalizable to a more natural setting, in which a CAVL program is used as an integral part of the learners' curriculum. The limitations and constraints on generalizability of the research findings of this dissertation pave the way for future research on L2 vocabulary learning, which is discussed in the following section. 186

9.4 Future Research The following sections explore some of the possible research endeavours for future studies that are prompted by the findings of this dissertation. Section 9.4.1 lists future research possibilities regarding vocabulary annotations and word concreteness (research topics 1, 2, and 3). Section 9.4.2 discusses future research with respect to individualized instruction (research topics 4 and 5).

9.4.1 Vocabulary Annotations and Word Concreteness More research is clearly necessary to further understand the effectiveness of the four annotations provided in Voka for CAVL. In general, this might involve assessing a variety of learning outcomes (see, e.g., Webb, 2007) to reach a more comprehensive understanding of the contribution of the four annotations to L2 vocabulary learning. For instance, assessment measures can be varied in future research according to the construct underlying the assessment instrument (discrete vs. embedded), the range of vocabulary (selective vs. comprehensive), and the role of context (context-independent vs. contextdependent) (see Read, 2000, and section 3.9.1). Moreover, future studies can assess reception and / or production, recognition and / or recall, and form, meaning and / or use of the L2 target words (see section 2.1). For example, the effectiveness of the gloss annotation type can be further investigated in a discrete, selective, context-dependent posttest assessing receptive recognition of the L2 words in use. Studies are also needed to investigate the effect of audio annotations in assessments of L2 form recognition versus form recall and in written versus oral production to further analyze whether audio annotations lead to phonologically-motivated misspellings (see section 6.1.4). Moreover, the effect of audio annotations, pictures, definitions, and glosses needs to be explored for learners of different proficiency levels (see section 9.3.2). Future studies should also further explore the idea that a deeper level of processing and / or more evaluation of the picture – referent connection leads to better retention (see section 6.3.2). For example, future studies could compare the amount of time learners devote to picture annotations of abstract versus concrete words in self-paced CAVL programs or employ eye-tracking technology to further explore learner processing of 187

pictures. Researchers could also collect additional qualitative data with think-aloud protocols or post-treatment interviews by asking learners to evaluate and explain specific picture annotations. Furthermore, studies might test the effectiveness of pictures of varying degrees of relatedness to the target words. For instance, for concrete words such as table, would a picture in which learners have to search for a table be more effective than a picture clearly showing it? Or, considering prototype theory (Field, 2003), when studying bird, would it be more effective to show a picture of a prototypical bird such as a robin or a dove (see Figure 22), or a less prototypical bird like a flamingo or a penguin? Furthermore, not every type of picture is equally effective (L. Jones, 2006; Winn, 1989) and studies are needed to compare the effectiveness of, for instance, line drawings versus colour photographs for both abstract and concrete L2 words. More research is also required to examine the proportion of variance unexplained in experiment 1. The effects of annotation cluster (η2 = .016 on immediate posttest) and word concreteness (η2 = .320 on immediate, η2 = .281 on delayed posttest) together account for roughly one third of the variance on each posttest, leaving approximately two thirds of the variance unexplained. Accordingly, future studies might explore to what extent additional factors such as learner characteristics (e.g., gender, motivation, learning style, language background) contribute to the variance in posttest scores in studies of annotation effectiveness in L2 vocabulary learning. Finally, future studies are also needed to further investigate the factors that contribute to a word's learning burden and the ways in which these factors interact with the effectiveness of annotation types for L2 vocabulary learning. Although not a focus of this dissertation, experiment 1 showed that the words assigned to an annotation cluster influence the effectiveness of that annotation cluster (see section 7.1.1). Thus, studies of picture effectiveness in which all learners received the same words in the same annotation type (e.g., Al-Seghayer, 2001; Chun & Plass, 1996a; Plass et al., 1998) likely confounded the effect of the picture annotation type with the effect of the learning burden of the word (e.g., its imageability) on L2 vocabulary learning. There are two design options to avoid the confounding influence of a word's learning burden on annotation effectiveness. First, studies can employ a between-subjects design in which different 188

learners study the same words with different annotations (e.g., Akbulut, 2007; Dubois & Vial, 2000; L. Jones & Plass, 2002; Kim, 2006; Plass et al., 2003; Yeh & Wang, 2003; Yoshii, 2006; Yoshii & Flaitz, 2002). Second, in within-subject designs, in which each learner is exposed to every annotation, researchers can employ a Latin square design to assure that each annotation is tested with each target word (see experiment 1, section 4.3) or, alternatively, test items can be allocated randomly to the annotations. Both designs allow researchers to study the interaction between item difficulty and annotation effectiveness.

9.4.2 Individualized Instruction More research is also needed to further investigate individualized vocabulary instruction. For example, future studies should investigate learner variables and the ways in which they relate to the effectiveness of annotation clusters. This research can establish learner personas relevant to L2 vocabulary learning and determine annotation clusters that are most conducive to their learning (see section 2.6). For instance, for which type of learner is a picture-audio annotation cluster effective and who learns best with a definition-gloss annotation cluster? Different learning and / or cognitive styles, gender, proficiency level, language aptitude, motivation, and other IDs (see section 2.5) can be considered. Future studies might also explore word-based adaptive vocabulary instruction because experiment 1 suggests that learners benefit from adaptation to the imageability of the target word. Word-based adaptive instruction was not pursued in this dissertation because experiment 1 did not find a statistically significant interaction of word type and annotation cluster effectiveness (see section 5.3.1). Nevertheless, the results of experiment 1 indicate that annotations are differentially effective for abstract versus concrete words. For instance, descriptively, annotation cluster PAGD was most effective for concrete but least effective for abstract words on the delayed posttest, see section 5.3.1). Studies might also want to compare the effectiveness of learner-based versus word-based adaptive instruction, or a combination of learner-based and word-based adaptivity. From a practicality perspective, it is certainly easier to implement word-based 189

adaptive instruction because, in contrast to learner-based adaptive instruction, it does not require extensive pretesting of each new system user. Finally, future studies are needed to explore the possibility of dynamic rather than static adaption of annotation clusters. Learner needs may change over time, which may require continued performance evaluation and dynamic adaptation rather than a static, one time pre-assessment of learner differences and needs (see section 2.5 and section 2.6). This might involve providing learners with open (i.e., inspectable) learner models (see, e.g., Bull, Dimitrova, & McCalla, 2007), which allow them to select annotations and to modify their choices based on their preferences and / or frequently updated system recommendations. Voka's design, for example, lends itself to an exploration of dynamic adaptation because learners can be continually monitored on their performance with the five annotation clusters while interacting with the program. Furthermore, the goal of a CAVL program is to provide the most effective vocabulary instruction for each learner and thus future studies might also analyze learner motivation and attitudes towards the use of a CAVL program. For example, it would be interesting to investigate the ways in which learners view individualized instruction compared to a generic CAVL program and the extent to which this may affect learning outcomes. Clearly, CAVL remains a field ripe for research and analysis. While this research project contributed some pieces to the big puzzle of CAVL, it is hoped that the further accumulation of research studies will ultimately enable us to provide an individualized computer-assisted learning environment that is most conducive to L2 vocabulary learning.

190

APPENDICES

191

Appendix A: Test Items The following table lists the L2 test items of experiments 1 (i.e., part 1 of Voka) and 2 (i.e., part 2 of Voka), their word type (abstract or concrete), their mean imageability score (scale from 1.0 to 7.0), the distractors shown in practice phase 1 (see section 3.7.3) and the following word information as displayed in Voka: English translation, German example sentence, gloss, definition, and picture (see sections 3.3, 3.4, 3.5 and 3.6). The table first lists the abstract and then the concrete words of experiment 1, and then the abstract followed by the concrete words of experiment 2, all in alphabetical order. For space reasons, the article and plural of the test items are not included in the table. Note that in this dissertation, the eyes of persons who appear in pictures are masked by a black box to protect their privacy. The eyes were unmasked in the actual Voka program.

Figure 27: The test items of experiments 1 and 2

Noun Ablauf

Translation

Word type

Imageability

Part

Distractors

procedure

abstract

2.11

1

Aal, Ahorn

Sentence

Der Ablauf für den Test ist: schreiben, lesen, sprechen.

Gloss

The procedure for the test is: writing, reading, speaking.

Definition

Ablauf: specific progression of a process or event.

Picture

192

Noun

Translation

Word type

Imageability

Part

Distractors

Anfang

beginning

abstract

2.44

1

Affe, Azubi

1

Bote, Bursche

1

Bohrer, Bund

Sentence

Tristan ist noch am Anfang von dem Buch.

Gloss

Tristan is still at the beginning of the book.

Definition

Anfang: the point at which something starts.

Picture

Befehl

command

abstract

2.68

Sentence

"Bleib da!" Das ist der Befehl an Waldo.

Gloss

"Stay there!" That is the command for Waldo.

Definition

Befehl: obligatory order or instruction from a superior.

Picture

Beweis

proof

abstract

3.00

Sentence

Die Polizei findet einen Beweis in Philips Wohnzimmer.

Gloss

The police finds proof in Philip's living room.

Definition

Beweis: evidence that something is indeed like someone claimed.

Picture

193

Noun Gefahr

Translation

Word type

Imageability

Part

Distractors

danger

abstract

2.91

1

Gans, Garderobe

1

Glatteis, Garn

Sentence

Kleine Kinder sind im Haus oft in Gefahr.

Gloss

Little children are often in danger around the house.

Definition

Gefahr: potential exposure to pain or harm.

Picture

Gesetz

law

abstract

2.86

Sentence

Das Parlament diskutiert gerade dieses Gesetz.

Gloss

The parliament is currently discussing this law.

Definition

Gesetz: a community's regulations established by some authority.

Picture

Grund

reason

abstract

1.73

1

Sentence

Doris sieht keinen Grund für das Problem.

Gloss

Doris does not see a reason for the problem.

Definition

Grund: motive supporting a conclusion or explaining a fact.

Picture

194

Gallier, Gimpel

Noun Hoffnung

Translation

Word type

Imageability

Part

Distractors

hope

abstract

2.00

1

Huld, Hachse

Sentence

Ich habe ein großes Problem, aber ich habe Hoffnung.

Gloss

I have a big problem, but I have hope.

Definition

Hoffnung: feeling that something might turn out as desired.

Picture

Inhalt

content

abstract

2.67

1

Sentence

Der Inhalt von dieser Box ist interessant.

Gloss

The content of this box is interesting.

Definition

Inhalt: something enclosed or contained in something else.

Igel, Ire

Picture

Mangel

defect

abstract

2.78

1

Sentence

Diese Skulptur hat da einen Mangel.

Gloss

This sculpture has a defect there.

Definition

Mangel: imperfection that reduces the worth of something.

Picture

195

Mut, Morast

Noun Schuld

Translation

Word type

Imageability

Part

Distractors

blame

abstract

2.78

1

Schere, Schottin

1

Sicht, Suppe

Sentence

Sarah hat die Schuld an diesem Problem.

Gloss

Sarah is to blame for this problem.

Definition

Schuld: responsibility for a wrongdoing or an offence.

Picture

Sorge

worry

abstract

2.77

Sentence

Alex ist noch nicht da. Das ist Cornelias Sorge.

Gloss

Alex is not there yet. That is Cornelia's worry.

Definition

Sorge: restlessness or fear caused by a difficult situation.

Picture

Vorwurf

criticism

abstract

2.89

1

Vicomte, Varistor

Sentence

Martina macht Thomas einen Vorwurf: "Du bist unpraktisch!"

Gloss

Martina's criticism of Thomas is: "You are impractical!"

Definition

Vorwurf: expressing disapproval with someone's supposed wrongdoing.

Picture

196

Noun Zustand

Translation

Word type

Imageability

Part

Distractors

state

abstract

1.78

1

Zipfel, Zorn

1

Zapfen, Zentner

Sentence

Das Zimmer ist in einem schlechten Zustand.

Gloss

The room is in a bad state.

Definition

Zustand: condition something is in at a given time.

Picture

Zweck

purpose

abstract

1.77

Sentence

Diese Boots sind schön und haben einen praktischen Zweck.

Gloss

These boots are pretty and have a practical purpose.

Definition

Zweck: a clearly directed intent or use.

Picture

Beamte

government employee

concrete

5.00

1

Sentence

Dieser Beamte arbeitet bei der Stadt Berlin.

Gloss

This government employee works for the city of Berlin.

Definition

Beamte: person employed by the state, usually for life.

Picture

197

Bon, Bottich

Noun Grenze

Translation

Word type

Imageability

Part

Distractors

border

concrete

5.78

1

Gabe, Gicht

Sentence

Die Grenze von Deutschland ist 3621 Kilometer lang.

Gloss

Germany's border is 3621 kilometres long.

Definition

Grenze: edge of a region, boundary between two regions.

Picture

Herbst

fall

concrete

5.78

Sentence

Viele Menschen wandern gern im Herbst.

Gloss

Many people like to go hiking in the fall.

Definition

Herbst: the season between summer and winter.

1

Haarkamm, Huf

1

Habicht, Husten

Picture

Himmel

sky

concrete

5.82

Sentence

Der Himmel ist sehr schön heute.

Gloss

The sky is very beautiful today.

Definition

Himmel: the upper atmosphere of the earth.

Picture

198

Noun Kreis

Translation

Word type

Imageability

Part

Distractors

circle

concrete

6.46

1

Kellner, Konzern

1

Kerze, Kirmes

Sentence

Dieser Kreis hat einen Radius von 5 cm.

Gloss

This circle has a radius of 5 cm.

Definition

Kreis: flat and completely round geometric shape.

Picture

Kunst

art

concrete

5.42

Sentence

Viele Menschen finden moderne Kunst schön.

Gloss

Many people find modern art beautiful.

Definition

Kunst: creative expression of what is beautiful or aesthetic.

Picture

Pfarrer

priest

concrete

6.11

1

Pokal, Pinsel

Sentence

Der Pfarrer spricht heute über viele Religionen.

Gloss

The priest is speaking about many religions today.

Definition

Pfarrer: community theologian performing church service and spiritual guidance.

Picture

199

Noun Rechnung

Translation

Word type

Imageability

Part

Distractors

bill

concrete

5.78

1

Rast, Rosine

Sentence

Wie hoch ist die Rechnung für mein Essen?

Gloss

How much is the bill for my meal?

Definition

Rechnung: form that details the cost of something purchased.

Picture

Regen

rain

concrete

6.77

Sentence

Viele Menschen bleiben bei Regen im Haus.

Gloss

Many people stay in the house when there is rain.

Definition

Regen: precipitation that consists of water drops.

1

Ramsch, Rost

1

Radau, Ruhm

Picture

Richter

judge

concrete

5.96

Sentence

Der Richter sagt:"Peter hat Recht, Susanne nicht."

Gloss

The judge says: "Peter is right, Susanne isn't."

Definition

Richter: legal expert administering justice according to state laws.

Picture

200

Noun Schritt

Translation

Word type

Imageability

Part

Distractors

step

concrete

5.15

1

Schwefel, Schlauch

1

Uso, Unwille

Sentence

Ich mache einen Schritt und bin im Garten.

Gloss

I take a step and I'm in my garden.

Definition

Schritt: movement made by advancing one's foot.

Picture

Urlaub

vacation

concrete

5.09

Sentence

Ich mache nächstes Jahr Urlaub in Italien.

Gloss

I'm going on vacation to Italy next year.

Definition

Urlaub: time spent away from work, especially in travel.

Picture

Verkehr

traffic

concrete

5.69

1

Vulkan, Vagabund

Sentence

Heute ist nicht viel Verkehr auf der Straße.

Gloss

There is not much traffic on the road today.

Definition

Verkehr: movement of vehicles or pedestrians through an area.

Picture

201

Noun Vogel

Translation

Word type

Imageability

Part

Distractors

bird

concrete

5.86

1

Vandale, Valeur

Sentence

Ich sehe einen Vogel in meinem Garten.

Gloss

I see a bird in my garden.

Definition

Vogel: egg-laying creature with feathers, usually capable of flying.

Picture

Waffe

weapon

concrete

5.14

Sentence

Herr Braun hat eine alte Waffe im Haus.

Gloss

Mr. Braun has an old weapon in the house.

Definition

Waffe: object used to attack or destroy.

1

Wut, Wirtin

2

Achse, Alm

Picture

Auswahl

selection

abstract

2.81

Sentence

Hier gibt es eine große Auswahl an Essen.

Gloss

There is a large selection of food here.

Definition

Auswahl: the variety of products or services offered.

Picture

202

Noun Bereich

Translation

Word type

Imageability

Part

Distractors

area

abstract

2.62

2

Bogen, Bube

2

Boje, Bucht

Sentence

Hier ist der Bereich für Kunden.

Gloss

Here is the area for customers.

Definition

Bereich: a particular extent of space with boundaries.

Picture

Bildung

education

abstract

2.92

Sentence

Angela Schulz hat eine gute politische Bildung.

Gloss

Angela Schulz has a good political education.

Definition

Bildung: development of human intellectual and mental capacities.

Picture

Fehler

mistake

abstract

2.50

Sentence

Da ist ein Fehler in deinem Test.

Gloss

There is a mistake on your test.

Definition

Fehler: deviation from what is correct or right.

Picture

203

2

Faulpelz, Finne

Noun Folge

Translation

Word type

Imageability

Part

Distractors

sequence

abstract

2.45

2

Famulatur, Faust

Sentence

Die Zahlen "4, 5, 6" sind eine Folge.

Gloss

The numbers "4, 5, 6" are a sequence.

Definition

Folge: connected series where one thing follows another.

Picture

Menge

crowd

abstract

2.59

2

Sentence

Heute ist eine Menge von Menschen im Biergarten.

Gloss

A crowd of people is at the beer garden today.

Definition

Menge: large number of persons, especially when together.

Motte, Muschel

Picture

Merkmal

attribute

abstract

2.67

Sentence

Hat diese Frau ein interessantes Merkmal?

Gloss

Does this woman have an interesting attribute?

Definition

Merkmal: characteristic feature of a person or thing.

Picture

204

2

Mus, Mofa

Noun Pflicht

Translation

Word type

Imageability

Part

Distractors

duty

abstract

1.78

2

Patrone, Perle

Sentence

Genaue und gute Arbeit ist meine Pflicht.

Gloss

Exact and good work is my duty.

Definition

Pflicht: obligatory tasks or conduct arising from one's position.

Picture

Sache

thing

abstract

1.95

Sentence

Sie diskutieren diese Sache mit dem Haus.

Gloss

They are discussing that thing with the house.

Definition

Sache: a certain matter or issue of concern.

2

See, Skizze

2

Stufe, Steuer

Picture

Stimmung

mood

abstract

2.67

Sentence

Auf dieser Party ist die Stimmung toll.

Gloss

The mood is great at this party.

Definition

Stimmung: quality of feeling at a particular time.

Picture

205

Noun Umfang

Translation

Word type

Imageability

Part

Distractors

circumference

abstract

2.50

2

Uhu, Ukelei

Sentence

Der Tisch hat einen Umfang von 126 cm.

Gloss

The table has a circumference of 126 cm.

Definition

Umfang: the outer boundary, especially of a circular area.

Picture

Verbot

ban

abstract

2.64

2

Sentence

Wir brauchen ein Verbot von Zigaretten.

Gloss

We need a ban on cigarettes.

Definition

Verbot: a strict order prohibiting or disallowing something.

Vieh, Visier

Picture

Vorteil

advantage

abstract

1.89

2

Sentence

Christopher ist groß. Das ist oft ein Vorteil.

Gloss

Christopher is tall. That is often an advantage.

Definition

Vorteil: having a superior position compared to others.

Picture

206

Vampir, Vaduzer

Noun Zukunft

Translation

Word type

Imageability

Part

Distractors

future

abstract

2.78

2

Zeile, Ziffer

2

Bock, Bummel

2

Boden, Borretsch

Sentence

Kinder haben in Deutschland eine gute Zukunft.

Gloss

Children have a good future in Germany.

Definition

Zukunft: time that comes after the present.

Picture

Bauer

farmer

concrete

5.18

Sentence

Dieser Bauer arbeitet auf einer Farm bei Frankfurt.

Gloss

This farmer works on a farm near Frankfurt.

Definition

Bauer: person cultivating farmland or raising cattle.

Picture

Blick

look

concrete

5.46

Sentence

Annas Blick sagt viel. Sie findet Michael interessant.

Gloss

Anna's look says a lot. She finds Michael interesting.

Definition

Blick: the temporary act of looking somewhere.

Picture

207

Noun Brust

Translation

Word type

Imageability

Part

Distractors

chest

concrete

6.42

2

Bosheit, Bulette

2

Falz, Fremde

Sentence

Ich brauche eine Lotion für meine Brust.

Gloss

I need a lotion for my chest.

Definition

Brust: part of the body between neck and abdomen.

Picture

Flughafen

airport

concrete

6.77

Sentence

Nicole kommt um 12:30 an den Flughafen.

Gloss

Nicole is coming to the airport at 12:30.

Definition

Flughafen: the premises where airplanes arrive and depart.

Picture

Gehirn

brain

concrete

6.22

Sentence

Der Professor fragt: "Wie arbeitet das Gehirn?"

Gloss

The professor asks: "How does the brain work?"

Definition

Gehirn: mass in the skull where consciousness sits.

Picture

208

2

Golderz, Gurgeln

Noun Gesicht

Translation

Word type

Imageability

Part

Distractors

face

concrete

6.00

2

Gatter, Gutachten

2

Kehle, Koppel

2

Libelle, Luft

Sentence

Diese Frau hat ein schönes Gesicht.

Gloss

This woman has a beautiful face.

Definition

Gesicht: front of the head from chin to hairline.

Picture

Krankheit

illness

concrete

5.46

Sentence

Mark geht es schlecht. Er hat eine Krankheit.

Gloss

Mark is feeling bad. He has an illness.

Definition

Krankheit: unhealthy condition of the body or mind.

Picture

Landschaft

scenery

concrete

6.38

Sentence

Die Landschaft bei Frankfurt ist besonders schön.

Gloss

The scenery around Frankfurt is particularly beautiful.

Definition

Landschaft: general appearance of a geographic region.

Picture

209

Noun Loch

Translation

Word type

Imageability

Part

Distractors

hole

concrete

6.11

2

Lumen, Lyzeum

Sentence

In meiner Wand ist ein großes Loch.

Gloss

There is a big hole in my wall.

Definition

Loch: a man-made or natural opening through something.

Picture

Punkt

period

concrete

6.19

2

Papagei, Pegel

Sentence

Bitte mach hier einen Punkt und kein Komma.

Gloss

Please put a period here, not a comma.

Definition

Punkt: orthographic character marking the end of a sentence.

Picture

Rede

speech

concrete

5.12

2

Sentence

Die Rede von Frau Meier ist ausgezeichnet.

Gloss

Mrs. Meier's speech is excellent.

Definition

Rede: oral presentation of thoughts before an audience.

Picture

210

Rasur, Russin

Noun Schmerz

Translation

Word type

Imageability

Part

Distractors

pain

concrete

5.27

2

Schotter, Schlag

Sentence

Der Schmerz in meiner Hand ist noch da.

Gloss

The pain in my hand is still there.

Definition

Schmerz: suffering associated with bodily disorders or emotional distress.

Picture

Wald

forest

concrete

6.56

Sentence

Annika und Martin laufen oft im Wald.

Gloss

Annika and Martin often run in the forest.

Definition

Wald: dense growth of trees covering a large area.

2

Wortlaut, Wucher

2

Wimper, Wurzel

Picture

Welle

wave

concrete

5.68

Sentence

Ein Surfer sucht immer eine große Welle.

Gloss

A surfer is always looking for a big wave.

Definition

Welle: continuous up-and-down motion of the surface of water.

Picture

211

Appendix B: Program Flow Samples Program Flow Sample of Experiment 1 As an illustration, Table 47 provides the program flow of the treatment and posttest assessment in Voka for participant voka201 in experiment 1. Voka201 was in exposure group 1 (see section 4.3). Accordingly, in all phases of the program, voka201 received the words Ablauf, Gesetz, Mangel, Kreis, Richter, and Schritt in annotation cluster PG, the words Grund, Hoffnung, Sorge, Himmel, Regen, and Verkehr in annotation cluster DG, the words Inhalt, Schuld, Zustand, Kunst, Rechnung, and Vogel in annotation cluster PA, the words Anfang, Befehl, Gefahr, Grenze, Pfarrer, and Waffe in annotation cluster DA, and the words Beweis, Vorwurf, Zweck, Beamte, Herbst, and Urlaub in annotation cluster PAGD (see section 4.3, in particular, Table 13 and Table 14). The words are shown in random display order for each learner in each of the Voka phases. This is exemplified for voka201 with the target word Vogel (bird), highlighted in bold in Table 47. Vogel (bird) is shown as the 8th of the 30 target words in study phase 1, as the 29th word in study phase 2, as the 7th word in practice phase 1, as the 2nd word in practice phase 2, as the 27th word on the immediate posttest, and as the 1st word on the delayed posttest. Appendix A provides the English translation along with the additional word information shown for each test item. The five annotation clusters are illustrated with the target word Herbst (fall) in section 3.6. Table 47: Program flow of experiment 1 for participant voka201

Phase St1

Display order 1-6 7 - 12 13 - 18 19 - 24

Target word Regen (C, DG) Pfarrer (C, DA) Inhalt (A, PA) Herbst (C, PAGD)

Zweck (A, PAGD) Vogel (C, PA) Richter (C, PG) Gefahr (A, DA)

Schritt (C, PG) Gesetz (A, PG) Waffe (C, DA) Kreis (C, PG)

Beweis (A, PAGD) Anfang (A, DA) Schuld (A, PA) Grenze (C, DA)

212

Grund (A, DG) Himmel (C, DG) Mangel (A, PG) Hoffnung (A, DG)

Vorwurf (A, PAGD) Befehl (A, DA) Urlaub (C, PAGD) Zustand (A, PA)

Phase

St2

Display order 25 - 30

Target word Ablauf (A, PG)

Rechnung (C, PA)

Beamte (C, PAGD)

Verkehr (C, DG)

Sorge (A, DG)

Kunst (C, PA)

1-6

Grenze (C, DA) Regen (C, DG) Hoffnung (A, DG) Urlaub (C, PAGD) Herbst (C, PAGD)

Schuld (A, PA) Verkehr (C, DG) Zweck (A, PAGD) Mangel (A, PG) Kreis (C, PG)

Rechnung (C, PA) Sorge (A, DG) Zustand (A, PA) Grund (A, DG) Ablauf (A, PG)

Pfarrer (C, DA) Anfang (A, DA) Himmel (C, DG) Richter (C, PG) Inhalt (A, PA)

Gesetz (A, PG) Kunst (C, PA) Gefahr (A, DA) Beamte (C, PAGD) Vogel (C, PA)

Beweis (A, PAGD) Schritt (C, PG) Befehl (A, DA) Waffe (C, DA) Vorwurf (A, PAGD)

Regen (C, DG) Vogel (C, PA) Grenze (C, DA) Beamte (C, PAGD) Waffe (C, DA)

Hoffnung (A, DG) Richter (C, PG) Anfang (A, DA) Beweis (A, PAGD) Kreis (C, PG)

Mangel (A, PG) Herbst (C, PAGD) Vorwurf (A, PAGD) Urlaub (C, PAGD) Rechnung (C, PA)

Verkehr (C, DG) Befehl (A, DA) Pfarrer (C, DA) Ablauf (A, PG) Zweck (A, PAGD)

Sorge (A, DG) Inhalt (A, PA) Gefahr (A, DA) Grund (A, DG) Gesetz (A, PG)

Zustand (A, PA) Himmel (C, DG) Schritt (C, PG) Schuld (A, PA) Kunst (C, PA)

Regen (C, DG) Verkehr (C, DG) Waffe (C, DA) Richter (C, PG) Grenze (C, DA)

Vogel (C, PA) Kunst (C, PA) Beamte (C, PAGD) Mangel (A, PG) Urlaub (C, PAGD)

Grund (A, DG) Zweck (A, PAGD) Vorwurf (A, PAGD) Himmel (C, DG) Gesetz (A, PG)

Inhalt (A, PA) Zustand (A, PA) Schuld (A, PA) Befehl (A, DA) Hoffnung (A, DG)

Herbst (C, PAGD) Anfang (A, DA) Rechnung (C, PA) Gefahr (A, DA) Pfarrer (C, DA)

Sorge (A, DG) Kreis (C, PG) Ablauf (A, PG) Schritt (C, PG) Beweis (A, PAGD)

Gefahr (A, DA) Hoffnung (A, DG) Schuld (A, PA) Regen (C, DG) Inhalt (A, PA)

Himmel (C, DG) Zustand (A, PA) Schritt (C, PG) Herbst (C, PAGD) Rechnung (C, PA)

Gesetz (A, PG) Zweck (A, PAGD) Ablauf (A, PG) Anfang (A, DA) Vogel (C, PA)

Kunst (C, PA) Mangel (A, PG) Richter (C, PG) Sorge (A, DG) Urlaub (C, PAGD)

Vorwurf (A, PAGD) Beamte (C, PAGD) Waffe (C, DA) Verkehr (C, DG) Grund (A, DG)

Beweis (A, PAGD) Kreis (C, PG) Befehl (A, DA) Pfarrer (C, DA) Grenze (C, DA)

Vogel (C, PA) Schuld (A, PA) Urlaub (C, PAGD)

Ablauf (A, PG) Grenze (C, DA) Mangel (A, PG)

Rechnung (C, PA) Richter (C, PG) Beweis (A, PAGD)

Kreis (C, PG) Grund (A, DG) Vorwurf (A, PAGD)

Schritt (C, PG) Sorge (A, DG) Befehl (A, DA)

Kunst (C, PA) Anfang (A, DA) Himmel (C, DG)

7 - 12 13 - 18 19 - 24 25 - 30 Pr1

1-6 7 - 12 13 - 18 19 - 24 25 - 30

Pr2

1-6 7 - 12 13 - 18 19 - 24 25 - 30

ImmP

1-6 7 - 12 13 - 18 19 - 24 25 - 30

DelP

1-6 7 - 12 13 - 18

213

Phase

Display order 19 - 24 25 - 30

Target word Verkehr (C, DG) Regen (C, DG)

Inhalt (A, PA) Zweck (A, PAGD)

Hoffnung (A, DG) Pfarrer (C, DA)

Gesetz (A, PG) Beamte (C, PAGD)

Gefahr (A, DA) Herbst (C, PAGD)

Waffe (C, DA) Zustand (A, PA)

Note. Phase = Voka phase; St1 = study phase 1; St2 = study phase 2; Pr1 = practice phase 1; Pr2 = practice phase 2; ImmP = immediate posttest; DelP = delayed posttest; A = abstract word; C = concrete word.

Program Flow Samples of Experiment 2 Table 48 shows the program flow for voka244, a participant in the ALT group in experiment 2. Voka244 received instruction that alternated between annotation clusters DA and PAGD (see section 7.3.1). Table 48: Program flow of experiment 2 for participant voka244 (ALT group)

Phase St1

Display order 1-6 7 - 12 13 - 18 19 - 24 25 - 28

St2

1-6 7 - 12 13 - 18 19 - 24 25 - 28

Pr1

1-6 7 - 12 13 - 18 19 - 24

Target word Menge (A, PAGD) Bereich (A, DA) Verbot (A, PAGD) Rede (C, PAGD) Brust (C, PAGD)

Bauer (C, DA) Umfang (A, DA) Zukunft (A, PAGD) Wald (C, DA) Landschaft (C, DA)

Merkmal (A, DA) Folge (A, PAGD) Vorteil (A, PAGD) Welle (C, PAGD) Gehirn (C, DA)

Bildung (A, PAGD) Gesicht (C, PAGD) Flughafen (C, DA) Pflicht (A, PAGD) Blick (C, PAGD)

Loch (C, DA) Stimmung (A, DA) Auswahl (A, DA) Sache (A, DA)

Punkt (C, DA) Schmerz (C, PAGD) Krankheit (C, PAGD) Fehler (A, DA)

Merkmal (A, DA) Rede (C, PAGD) Fehler (A, DA) Bauer (C, DA) Landschaft (C, DA)

Sache (A, DA) Wald (C, DA) Bereich (A, DA) Stimmung (A, DA) Umfang (A, DA)

Welle (C, PAGD) Auswahl (A, DA) Punkt (C, DA) Bildung (A, PAGD) Flughafen (C, DA)

Schmerz (C, PAGD) Verbot (A, PAGD) Zukunft (A, PAGD) Folge (A, PAGD) Brust (C, PAGD)

Pflicht (A, PAGD) Gesicht (C, PAGD) Blick (C, PAGD) Gehirn (C, DA)

Vorteil (A, PAGD) Loch (C, DA) Krankheit (C, PAGD) Menge (A, PAGD)

Rede (C, PAGD) Folge (A, PAGD) Zukunft (A, PAGD) Verbot (A, PAGD)

Sache (A, DA) Menge (A, PAGD) Vorteil (A, PAGD) Bereich (A, DA)

Bauer (C, DA) Landschaft (C, DA) Blick (C, PAGD) Punkt (C, DA)

Stimmung (A, DA) Brust (C, PAGD) Welle (C, PAGD) Loch (C, DA)

Gehirn (C, DA) Krankheit (C, PAGD) Auswahl (A, DA) Umfang (A, DA)

Bildung (A, PAGD) Gesicht (C, PAGD) Merkmal (A, DA) Wald (C, DA)

214

Phase

Pr2

Display order 25 - 28

Target word Pflicht (A, PAGD)

Flughafen (C, DA)

Fehler (A, DA)

Schmerz (C, PAGD)

1-6

Schmerz (C, PAGD) Blick (C, PAGD) Brust (C, PAGD) Verbot (A, PAGD) Menge (A, PAGD)

Gehirn (C, DA) Rede (C, PAGD) Pflicht (A, PAGD) Merkmal (A, DA) Bereich (A, DA)

Fehler (A, DA) Auswahl (A, DA) Flughafen (C, DA) Bildung (A, PAGD) Punkt (C, DA)

Wald (C, DA) Vorteil (A, PAGD) Zukunft (A, PAGD) Stimmung (A, DA) Folge (A, PAGD)

Landschaft (C, DA) Loch (C, DA) Krankheit (C, PAGD) Sache (A, DA)

Umfang (A, DA) Bauer (C, DA) Gesicht (C, PAGD) Welle (C, PAGD)

Gesicht (C, PAGD) Brust (C, PAGD) Vorteil (A, PAGD) Bereich (A, DA) Merkmal (A, DA)

Pflicht (A, PAGD) Zukunft (A, PAGD) Sache (A, DA) Folge (A, PAGD) Fehler (A, DA)

Rede (C, PAGD) Gehirn (C, DA) Bildung (A, PAGD) Menge (A, PAGD) Wald (C, DA)

Blick (C, PAGD) Krankheit (C, PAGD) Loch (C, DA) Bauer (C, DA) Auswahl (A, DA)

Flughafen (C, DA) Schmerz (C, PAGD) Welle (C, PAGD) Stimmung (A, DA)

Verbot (A, PAGD) Umfang (A, DA) Landschaft (C, DA) Punkt (C, DA)

Sache (A, DA) Zukunft (A, PAGD) Bauer (C, DA) Auswahl (A, DA) Rede (C, PAGD)

Brust (C, PAGD) Stimmung (A, DA) Flughafen (C, DA) Landschaft (C, DA) Gehirn (C, DA)

Folge (A, PAGD) Vorteil (A, PAGD) Pflicht (A, PAGD) Merkmal (A, DA) Welle (C, PAGD)

Krankheit (C, PAGD) Fehler (A, DA) Bildung (A, PAGD) Loch (C, DA) Menge (A, PAGD)

Schmerz (C, PAGD) Blick (C, PAGD) Verbot (A, PAGD) Punkt (C, DA)

Umfang (A, DA) Wald (C, DA) Gesicht (C, PAGD) Bereich (A, DA)

7 - 12 13 - 18 19 - 24 25 - 28 ImmP

1-6 7 - 12 13 - 18 19 - 24 25 - 28

DelP

1-6 7 - 12 13 - 18 19 - 24 25 - 28

Table 49 shows the program flow for voka249, the paired participant of voka244 in the FIX group in experiment 2. Voka249 received all target words in annotation cluster PAGD. In both tables, the abstract word Auswahl (selection) is highlighted in bold to illustrate the random display order of the target words for each participant in each Voka phase. The English translation along with the additional word information shown for each test item is presented in Appendix A. Table 49: Program flow of experiment 2 for participant voka249 (FIX group)

Phase St1

Display order 1-6

Target word Punkt (C, PAGD)

Verbot (A, PAGD)

Bereich (A, PAGD)

Stimmung (A, PAGD)

215

Rede (C, PAGD)

Umfang (A, PAGD)

Phase

Display order 7 - 12 13 - 18 19 - 24 25 - 28

St2

1-6 7 - 12 13 - 18 19 - 24 25 - 28

Pr1

1-6 7 - 12 13 - 18 19 - 24 25 - 28

Pr2

1-6 7 - 12 13 - 18 19 - 24 25 - 28

ImmP

1-6 7 - 12 13 - 18 19 - 24 25 - 28

Target word Blick (C, PAGD) Welle (C, PAGD) Folge (A, PAGD) Bauer (C, PAGD)

Flughafen (C, PAGD) Landschaft (C, PAGD) Loch (C, PAGD) Pflicht (A, PAGD)

Auswahl (A, PAGD) Zukunft (A, PAGD) Wald (C, PAGD) Menge (A, PAGD)

Vorteil (A, PAGD) Schmerz (C, PAGD) Krankheit (C, PAGD) Bildung (A, PAGD)

Brust (C, PAGD) Gehirn (C, PAGD) Merkmal (A, PAGD)

Gesicht (C, PAGD) Fehler (A, PAGD) Sache (A, PAGD)

Punkt (C, PAGD) Wald (C, PAGD) Schmerz (C, PAGD) Auswahl (A, PAGD) Brust (C, PAGD)

Merkmal (A, PAGD) Landschaft (C, PAGD) Vorteil (A, PAGD) Umfang (A, PAGD) Loch (C, PAGD)

Verbot (A, PAGD) Stimmung (A, PAGD) Krankheit (C, PAGD) Rede (C, PAGD) Zukunft (A, PAGD)

Bereich (A, PAGD) Fehler (A, PAGD) Flughafen (C, PAGD) Gehirn (C, PAGD) Gesicht (C, PAGD)

Bildung (A, PAGD) Bauer (C, PAGD) Welle (C, PAGD) Blick (C, PAGD)

Pflicht (A, PAGD) Sache (A, PAGD) Menge (A, PAGD) Folge (A, PAGD)

Wald (C, PAGD) Merkmal (A, PAGD) Welle (C, PAGD) Blick (C, PAGD) Landschaft (C, PAGD)

Gehirn (C, PAGD) Verbot (A, PAGD) Loch (C, PAGD) Bereich (A, PAGD) Pflicht (A, PAGD)

Auswahl (A, PAGD) Umfang (A, PAGD) Rede (C, PAGD) Gesicht (C, PAGD) Bildung (A, PAGD)

Zukunft (A, PAGD) Brust (C, PAGD) Fehler (A, PAGD) Folge (A, PAGD) Flughafen (C, PAGD)

Sache (A, PAGD) Schmerz (C, PAGD) Vorteil (A, PAGD) Stimmung (A, PAGD)

Menge (A, PAGD) Krankheit (C, PAGD) Punkt (C, PAGD) Bauer (C, PAGD)

Rede (C, PAGD) Menge (A, PAGD) Brust (C, PAGD) Bauer (C, PAGD) Bildung (A, PAGD)

Fehler (A, PAGD) Bereich (A, PAGD) Wald (C, PAGD) Blick (C, PAGD) Sache (A, PAGD)

Punkt (C, PAGD) Loch (C, PAGD) Flughafen (C, PAGD) Schmerz (C, PAGD) Stimmung (A, PAGD)

Landschaft (C, PAGD) Gesicht (C, PAGD) Verbot (A, PAGD) Krankheit (C, PAGD) Folge (A, PAGD)

Zukunft (A, PAGD) Umfang (A, PAGD) Gehirn (C, PAGD) Merkmal (A, PAGD)

Vorteil (A, PAGD) Welle (C, PAGD) Pflicht (A, PAGD) Auswahl (A, PAGD)

Zukunft (A, PAGD) Gehirn (C, PAGD) Rede (C, PAGD) Merkmal (A, PAGD) Bildung (A, PAGD)

Bereich (A, PAGD) Welle (C, PAGD) Auswahl (A, PAGD) Wald (C, PAGD) Gesicht (C, PAGD)

Schmerz (C, PAGD) Krankheit (C, PAGD) Menge (A, PAGD) Verbot (A, PAGD) Brust (C, PAGD)

Bauer (C, PAGD) Folge (A, PAGD) Landschaft (C, PAGD) Umfang (A, PAGD) Vorteil (A, PAGD)

Sache (A, PAGD) Flughafen (C, PAGD) Loch (C, PAGD) Pflicht (A, PAGD)

Stimmung (A, PAGD) Punkt (C, PAGD) Fehler (A, PAGD) Blick (C, PAGD)

216

Phase DelP

Display order 1-6 7 - 12 13 - 18 19 - 24 25 - 28

Target word Loch (C, PAGD) Auswahl (A, PAGD) Wald (C, PAGD) Merkmal (A, PAGD) Menge (A, PAGD)

Umfang (A, PAGD) Verbot (A, PAGD) Flughafen (C, PAGD) Bauer (C, PAGD) Stimmung (A, PAGD)

Schmerz (C, PAGD) Gesicht (C, PAGD) Vorteil (A, PAGD) Blick (C, PAGD) Fehler (A, PAGD)

Brust (C, PAGD) Bildung (A, PAGD) Bereich (A, PAGD) Pflicht (A, PAGD) Rede (C, PAGD)

217

Landschaft (C, PAGD) Welle (C, PAGD) Krankheit (C, PAGD) Sache (A, PAGD)

Gehirn (C, PAGD) Folge (A, PAGD) Punkt (C, PAGD) Zukunft (A, PAGD)

Appendix C: Background Questionnaire The 2-page background questionnaire given to the study participants is reproduced below (see section 4.2). ******************************************************************************************** Background Questionnaire Title: Adaptive vocabulary instruction. Investigator name: Anne Rimrott ([email protected]). Investigator Department: Linguistics

Please answer the following questions. Note that your answers to this questionnaire have no effect whatsoever on your course grade for German 102. The questionnaire is only used to answer some of the research questions that are posed as part of this study.

Today's date (MM / DD / YYYY): _________________

What is your full name? __________________________________________

How old are you? _____ years old

What is your gender? O male O female

What are you currently? O an undergraduate student O a graduate student O other – please specify: _____________________

What are your major(s) and minor(s) at SFU? major(s):

___________________________________________

minor(s):

___________________________________________

O undeclared: O not applicable: Why not? _________________________________

What is your cumulative grade point average (GPA) at SFU?

218

O 3.5 – 4.33 O 3.0 – 3.49 O 2.5 – 2.99 O 2.0 – 2.49 O below 2.0 O not applicable: Why not? _____________________________________

What is your native language and what other languages do you know? Language

beginner O

advanced beginner O

intermediate O

advanced O

nativelike O

native language O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

Have you taken German in high school? O no O yes: for ________ years

Not counting the current semester (German 102), have you ever taken German at a postsecondary institution (e.g., college, university)? O no O yes: for ________ semesters

How many times have you visited a German-speaking country (e.g., Germany, Austria)? never O

once O

2 – 3 times O

4 – 5 times O

6 or more times O

If applicable, how long were you there (all visits and German-speaking countries combined)? _____________ years, __________months, __________ days

219

Have you had any other experience with German (e.g., German friends, tutoring, German heritage, …)? If yes, please describe when, for how long and in what way you have come into contact with German. Please be as specific as possible. _____________________________________________________________________________ _____________________________________________________________________________ ____________________________________________________________________________

Are you comfortable using computers (e.g., Internet, E-mail)? very uncomfortable O

uncomfortable O

neither O

comfortable O

very comfortable O

Have you used computers for foreign language practice before this semester? never O

rarely O

occasionally O

frequently O

very frequently O

Thank you! ********************************************************************************************

220

Appendix D: Evaluation Questionnaire 1 The 2-page first evaluation questionnaire given to the study participants is reproduced below (see section 4.2). ******************************************************************************************** Evaluation Questionnaire Title: Adaptive vocabulary instruction. Investigator name: Anne Rimrott. Investigator Department: Linguistics

I would like to get some feedback on Voka from you. Note that your answers have no effect on your course grade for German 102. The questionnaire is only used to answer some of the research questions that are posed as part of this study. Your feedback will be treated anonymously. Your name will not be included in reports about the study. Your identity will remain confidential at all times. Thank you!

Your name:

_________________________ Your Voka ID: ______________

Today's Date: ________________________

1. What was the most effective flashcard for you for learning words? Please rank the five flashcards from most useful (1) to least useful (5). It is ok to give some flashcards the same rank.

Flashcard

Example

1 2 most useful

3

4 5 least useful

O

O

O

Translation Picture

221

O

O

Flashcard

Example

1 2 most useful

3

4 5 least useful

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

Translation Definition

Picture Audio

Definition Audio

Translation Definition Picture Audio

********************************************************************************************

222

Appendix E: Evaluation Questionnaire 2 The 2-page second evaluation questionnaire given to the study participants is reproduced below (see section 7.2). ******************************************************************************************** Evaluation Questionnaire Title: Adaptive vocabulary instruction. Investigator name: Anne Rimrott. Investigator Department: Linguistics

I would like to get some feedback on how you liked working with Voka and what I could do to make Voka better. Note that your answers to these questions have no effect on your course grade for German 102. The questionnaire is only used to answer some of the research questions that are posed as part of this study. Your feedback will be treated anonymously. Your name will not be included in reports about the study. Your identity will remain confidential at all times.

Your name:

__________________________ Your Voka ID: ______________

Today's Date: ________________________

1. Please provide your opinion on the following statements. Statement

strongly agree

agree

disagree

strongly disagree

O O O

neither agree nor disagree O O O

Voka helped me learn vocabulary. I enjoyed studying vocabulary with Voka. The German example sentences for the words were useful. I could understand the German example sentences. The English translations of the German example sentences were useful. The definitions of the words in English were useful. The pictures of the words were useful. The pictures related well to the words I was studying. The audio pronunciations of the words were useful. The PRACTICE phases (where I typed in the words and received feedback) were useful.

O O O

O O O

O O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O O

O O

O O

O O

O

O

O

O

O

O

O

O

O

O

223

2. Please provide your opinion on the following aspects of the timing within Voka.

Time to study each word in the STUDY phases (where you got introduced to the words for the first time). Time to type in the word in the PRACTICE phases (where you typed in the words and received feedback). Time to read the feedback in the PRACTICE phases (where you typed in the words and received feedback).

generally not enough time

generally adequate

generally too much time

O

O

O

O

O

O

O

O

O

3. Would you have liked to study the German 102 Deutsch Na Klar chapter vocabulary (e.g., the vocabulary for chapter 3) with Voka? definitely yes O

probably yes O

maybe O

probably not O

definitely not O

4. In your own words, what did you like about Voka? _____________________________________________________________________________ _____________________________________________________________________________ ___________________________________________________________________________

5. In your own words, what didn’t you like about Voka and/or what can be improved? _____________________________________________________________________________ _____________________________________________________________________________ ___________________________________________________________________________

6. Do you have any other comments? I would be happy to hear them. _____________________________________________________________________________ _____________________________________________________________________________ ___________________________________________________________________________

Thank you! ********************************************************************************************

224

Appendix F: Instruction Audio Script The following is the audio script (i.e., voice over) of the 5 minute instruction video that the study participants viewed prior to logging on to Voka for the first time. *********************************************************************** This video shows you how we will use Voka to learn German words. Once you log on to Voka, you will see an instruction screen, which gives you detailed information about what we will do. We will complete six steps today: an assessment, two study phases, two practice phases, and a follow-up test. In the study and practice phases, Voka will teach you 30 German nouns today. You will then be tested on these nouns on the follow-up test. The follow-up test is an English-German translation. On the test, you only need to type in the German noun itself; the article will be provided for you. The step that is highlighted in green is the one that you need to complete next. We will start with an assessment test to see if you already know some of the words that Voka will teach you. Click START to start the assessment. For this demonstration, there are only four words on the assessment. In the real study, there will be more words for you to translate. When you have completed the test, please click on SUBMIT. This brings you back to the instructions screen, which now shows that you can start the first study phase. Once you've read the instructions, put on your headset and click START. In the first study phase, for each word, you will see a flashcard with the English word and its German translation, a German example sentence and additional information. This information will vary for each word. For example, you might see a translation of the example sentence or a definition of the word. Or you might see a picture or hear how the word is pronounced. The audio will play once automatically. You can replay it by clicking on the play button of the audio player. At the bottom, a timer counts down the seconds you have left to study the flashcard before Voka automatically moves on to the next one. While you are working with Voka, please do not take any notes but just focus on the computer screen. Please note that you may not check your emails, chat with your classmates, et cetera, while Voka is running. This would cause you to lose your course credit. Voka automatically moves through all the flashcards and then brings you back to the instruction screen. The second study phase is very similar to the first study phase, and once you have completed it, you are ready to practice what you have just learned. In the first practice phase, Voka will show you three possible German translations for each English noun. Please type in the translation you think is correct and click submit. Remember to capitalize each noun, otherwise Voka will not count your entry as correct. The feedback screen informs you whether your entry was correct or not. It also shows you the flashcard for the word again to give you one more opportunity to learn the word. As before, Voka will cycle through all the words automatically and then bring you back to the instruction screen. 225

The second practice phase is similar to the first practice phase. You need to translate the words into German again but this time Voka does not give you any clues. As in the first practice phase, the feedback screen evaluates your answer and shows you the flashcard for the word again. After the second practice phase, you are ready to do the follow-up test. Here, Voka tests what you have just learned. On the follow-up test, you translate the words that Voka has taught you from English into German. You do not need to remember the article, it is already provided for you. Remember to capitalize your nouns. You are then done with Voka for today. In a week, we will access Voka again to complete the same follow-up test one more time. Please do not actively study the Voka words between today and next week. This is very important for this research study. Remember that the score you receive on Voka does not influence your grade in German 102. After the follow-up test next week, you will get a handout with all the Voka words for you to keep. As you know, you receive 5% credit for German 102 for your participation in Voka. In addition, I would like to thank you for your cooperation by entering you all in a lottery draw for a $10 gift certificate to a coffee shop. Thank you for listening and have fun with Voka today! ***********************************************************************

226

Appendix G: Copyright Information The copyright of the pictures and screenshots shown in this dissertation is as follows. Copyright of the pictures: Pfarrer (priest) © Elisabeth Reising, 2008; Rede (speech) © Kurt Trautwein, 1968; Verkehr (traffic), Waffe (weapon), and Bauer (farmer) © Elisabeth Rimrott, 2008; Menge (crowd), Stimmung (mood), and Welle (wave) © Damir Tresnjo, 2008. All other pictures shown in this dissertation © Anne Rimrott, 2008 - 2009. All screenshots of Voka shown in this dissertation © Anne Rimrott, 2010. For all pictures and screenshots, all rights reserved. The pictures and screenshots are not to be reproduced without the copyright holder's written consent.

227

REFERENCE LIST Abraham, L. B. (2008). Computer-mediated glosses in second language reading comprehension and vocabulary learning: A meta-analysis. Computer Assisted Language Learning, 21(3), 199-226. Akbulut, Y. (2007). Effects of multimedia annotations on incidental vocabulary learning and reading comprehension of advanced learners of English as a foreign language. Instructional Science, 35, 499-517. Al-Seghayer, K. (2001). The effect of multimedia annotation modes on L2 vocabulary acquisition: A comparative study. Language, Learning & Technology, 5(1), 202232. Ariew, R., & Ercetin, G. (2004). Exploring the potential of hypermedia annotations for second language reading. Computer Assisted Language Learning, 17(2), 237-259. Atkinson, R. C. (1972). Optimizing the learning of a second-language vocabulary. Journal of Experimental Psychology, 96(1), 124-129. Baddeley, A. D. (1986). Working memory. Oxford, England: Oxford University Press. Baddeley, A. D. (1999). Human memory. Boston: Allyn & Bacon. Barcroft, J. (2007). Effects of opportunities for word retrieval during second language vocabulary learning. Language Learning, 57(1), 35-56. Bear, M. F., Connors, B. W., & Paradiso, M. A. (2007). Neuroscience: Exploring the brain. Philadelphia: Lippincott Williams & Wilkins. Bedny, M., & Thompson-Schill, S. L. (2006). Neuroanatomically separable effects of imageability and grammatical class during single-word comprehension. Brain and Language, 98, 127-139. Binder, J. R. (2007). Effects of word imageability on semantic access: Neuroimaging studies. In J. Hart & M. A. Kraut (Eds.), Neural basis of semantic memory (pp. 149-181). Cambridge, England: Cambridge University Press. Binder, J. R., Westbury, C. F., McKiernan, K. A., Possing, E. T., & Medler, D. A. (2005). Distinct brain systems for processing concrete and abstract concepts. Journal of Cognitive Neuroscience, 17(6), 905-917. Bowles, M. (2004). L2 glossing: To CALL or not to CALL. Hispania, 87(3), 541-552. 228

Brooks-Lewis, K. A. (2009). Adult learners' perceptions of the incorporation of their L1 in foreign language teaching and learning. Applied Linguistics, 30(2), 216-235. Brusilovsky, P. (1996). Methods and techniques of adaptive hypermedia. User Modeling and User-Adapted Interaction, 6(2-3), 87-129. Brusilovsky, P. (2001). Adaptive hypermedia. User Modeling and User-Adapted Interaction, 11, 87-110. Brusilovsky, P., Knapp, J., & Gamper, J. (2006). Supporting teachers as content authors in intelligent educational systems. International Journal of Knowledge and Learning, 2(3/4), 191-215. Bruyer, R., & Racquez, F. (1985). Are lateral differences in word processing modulated by concreteness, imageability, both, or neither? International Journal of Neuroscience, 27, 181-189. Bull, S., Dimitrova, V., & McCalla, G. I. (2007). Preface "Open learner models: Research questions" special issue of the IJAIED. International Journal of Artificial Intelligence in Education, 17(2), 83-87. Cárdenas-Claros, M. S., & Gruba, P. A. (2009). Help options in CALL: A systematic review. CALICO Journal, 27(1), 69-90. Carr, N. T. (2006). Computer-based testing: Prospects for innovative assessment. In L. Ducate & N. Arnold (Eds.), Calling on CALL: From theory and research to new directions in foreign language teaching (pp. 289-312). San Marcos, TX: CALICO. Chang, J. S., & Chang, Y.-C. (2004). Computer assisted language learning based on corpora and natural language processing: The experience of project CANDLE. In Proceedings of IWLeL 2004: An interactive workshop on language e-learning (pp. 15-23): IWLeL 2004 Program Committee. Chapelle, C. A. (1998). Multimedia CALL: Lessons to be learned from research on instructed SLA. Language Learning & Technology, 2(1), 21-39. Chapelle, C. A. (2001). Computer applications in second language acquisition. Cambridge, England: Cambridge University Press. Chapelle, C. A. (2003). English language learning and technology: Lectures on applied linguistics in the age of information and communication technology. Amsterdam: John Benjamins. Chapelle, C. A., & Heift, T. (2009). Individual learner differences in CALL: The Field Independence/Dependence (FID) construct. CALICO Journal, 26(2), 246-266. Chen, C.-M., & Chung, C.-J. (2008). Personalized mobile English vocabulary learning system based on item response theory and learning memory cycle. Computers & Education, 51, 624-645.

229

Chun, D. M. (2006). CALL technologies for L2 reading. In L. Ducate & N. Arnold (Eds.), Calling on CALL: From theory and research to new directions in foreign language teaching (pp. 69-98). San Marcos, TX: CALICO. Chun, D. M., & Payne, S. J. (2004). What makes students click: Working memory and look-up behavior. System, 32(4), 481-503. Chun, D. M., & Plass, J. L. (1996a). Effects of multimedia annotations on vocabulary acquisition. The Modern Language Journal, 80(2), 183-198. Chun, D. M., & Plass, J. L. (1996b). Facilitating reading comprehension with multimedia. System, 24(4), 503-519. Chun, D. M., & Plass, J. L. (1997). Research on text comprehension in multimedia environments. Language Learning & Technology, 1(1), 60-81. Clark, R. E., & Feldon, D. F. (2005). Five common but questionable principles of multimedia learning. In R. E. Mayer (Ed.), The Cambridge handbook of multimedia learning (pp. 97-115). Cambridge, England: Cambridge University Press. Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language Learning & Technology, 11(3), 38-63. Craik, F. I. M. (2002). Levels of processing: Past, present ... and future? Memory, 10(5/6), 305-318. Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. Journal of Verbal Learning and Verbal Behavior, 11(6), 671684. Crozer, N. (1996). Individualized vocabulary instruction on the computer. Woodland Hills, CA: Los Angeles Pierce College. Danan, M. (1992). Reversed subtitling and dual coding theory: New directions for foreign language instruction. Language Learning, 42(4), 497-527. Danan, M. (2004). Captioning and subtitling: Undervalued language learning strategies. Meta: Translator's Journal, 49(1), 67-77. Davis, J. N., & Lyman-Hager, M. (1997). Computers and L2 reading: Student performance, student attitudes. Foreign Language Annals, 30(1), 58-72. De Groot, A. M. B., Dannenburg, L., & van Hell, J. G. (1994). Forward and backward word translation by bilinguals. Journal of Memory and Language, 33, 600-629. De Groot, A. M. B., & Keijzer, R. (2000). What is hard to learn is easy to forget: The roles of word concreteness, cognate status, and word frequency in foreignlanguage vocabulary learning and forgetting. Language Learning, 50(1), 1-56. De Ridder, I. (2002). Visible or invisible links: Does the highlighting of hyperlinks affect incidental vocabulary learning, text comprehension, and the reading process? Language Learning & Technology, 6(1), 123-146.

230

Dewey, G. (1970). Relative frequency of English spellings. New York: Teachers College Press. Di Donato, R., Clyde, M. D., & Vansant, J. (2008). Deutsch: Na klar!: An introductory German course (5th ed.). New York: McGraw-Hill. Dörnyei, Z. (2005). The psychology of the language learner: Individual differences in second language acquisition. Mahwah, NJ: Lawrence Erlbaum Associates. Dörnyei, Z. (2009). Individual differences: Interplay of learner characteristics and learning environment. Language Learning, 59(Suppl. 1), 230-248. Dörnyei, Z., & Skehan, P. (2003). Individual differences in second language learning. In C. J. Doughty & M. H. Long (Eds.), The handbook of Second Language Acquisition (pp. 589-630). Oxford, England: Blackwell. Dubois, M., & Vial, I. (2000). Multimedia design: The effects of relating multimodal information. Journal of Computer Assisted Learning, 16, 157-165. Ellis, N. C., & Beaton, A. (1993a). Factors affecting the learning of foreign language vocabulary: Imagery keyword mediators and phonological short-term memory. The Quarterly Journal of Experimental Psychology Section A, 46(3), 533-558. Ellis, N. C., & Beaton, A. (1993b). Psycholinguistic determinants of foreign language vocabulary learning. Language Learning, 43(4), 559-617. Elsom-Cook, M. (1993). Student modeling in Intelligent Tutoring Systems. Artificial Intelligence Review, 7(3-4), 227-240. Erten, I. H., & Tekin, M. (2008). Effects on vocabulary acquisition of presenting new words in semantic sets versus semantically unrelated sets. System, 36, 407-422. Eyckmans, J., van de Velde, H., van Hout, R., & Boers, F. (2007). Learners' response behaviour in yes/no vocabulary tests. In H. Daller, J. Milton & J. Treffers-Daller (Eds.), Modelling and assessing vocabulary knowledge (pp. 59-76). Cambridge, England: Cambridge University Press. Fan, M., & Xunfeng, X. (2002). An evaluation of an online bilingual corpus for the selflearning of legal English. System, 30, 47-63. Fiebach, C. J., & Friederici, A. D. (2003). Processing concrete words: fMRI evidence against a specific right-hemisphere involvement. Neuropsychologia, 42(1), 62-70. Field, J. (2003). Psycholinguistics: A resource book for students. New York: Routledge. Fletcher, J. D., & Tobias, S. (2005). The multimedia principle. In R. E. Mayer (Ed.), The Cambridge handbook of multimedia learning (pp. 117-133). Cambridge, England: Cambridge University Press. Fliessbach, K., Weis, S., Klaver, P., Elger, C. E., & Weber, B. (2006). The effect of word concreteness on recognition memory. NeuroImage, 32(3), 1413-1421.

231

Gamper, J., & Knapp, J. (2001). Adaptation in a language learning system. In Proceedings of ABIS-Workshop held in conjunction with Lehren, Lernen, Wissen und Adaptivität (LLWA'01). Gass, S. (1997). Input, interaction, and the second language learner. Mahwah, NJ: Lawrence Erlbaum. Gass, S., & Mackey, A. (2006). Input, interaction and output: An overview. AILA Review, 19(1), 3-17. Godwin-Jones, R. (2010). Emerging technologies: From memory palaces to spacing algorithms: Approaches to second-language vocabulary learning. Language Learning & Technology, 14(2), 4-11. Goldman, S. R. (2009). Commentary: Explorations of relationships among learners, tasks, and learning. Learning and Instruction, 19(5), 451-454. Grace, C. A. (1998a). Personality type, tolerance of ambiguity, and vocabulary retention in CALL. CALICO Journal, 15(1-3), 19-45. Grace, C. A. (1998b). Retention of word meanings inferred from context and sentencelevel translations: Implications for the design of beginning-level CALL software. The Modern Language Journal, 82(4), 533-544. Granger, S. (in press). Comparable and translation corpora in cross-linguistic research: Design, analysis and applications. Journal of Shanghai Jiaotong University. Hall, R. A. J. (1961). Sound and spelling in English. Philadelphia: Chilton. Hamada, M., & Koda, K. (2008). Influence of first language orthographic experience on second language decoding and word learning. Language Learning, 58(1), 1-31. Haspelmath, M. (in press). The indeterminacy of word segmentation and the nature of morphology and syntax. Folia Linguistica. Hegelheimer, V., & Tower, D. (2004). Using CALL in the classroom: Analyzing student interactions in an authentic classroom. System, 32(2), 185-205. Heidemann, A. (1996). The visualization of foreign language vocabulary in CALL. Frankfurt (Main), Germany: Peter Lang. Heift, T. (2002). Learner control and error correction in ICALL: Browsers, peekers, and adamants. CALICO Journal, 19(2), 295-313. Heift, T. (2004). Corrective feedback and learner uptake in CALL. ReCALL, 16(2), 416431. Heift, T. (2007). Learner personas in CALL. CALICO Journal, 25(1), 1-10. Heift, T. (2008). Modeling learner variability in CALL. Computer Assisted Language Learning, 21(4), 305-321. Heift, T., & Nicholson, D. (2001). Web delivery of adaptive and interactive language tutoring. International Journal of Artificial Intelligence in Education, 12(4), 310325. 232

Heift, T., & Schulze, M. (2007). Errors and intelligence in Computer-Assisted Language Learning: Parsers and pedagogues. New York: Routledge. Heilman, M., & Eskenazi, M. (2006a, December). Authentic, individualized practice for English as a Second Language vocabulary. Paper presented at the Interfaces of Intelligent Computer-Assisted Language Learning workshop, Ohio State University, Columbus, OH. Heilman, M., & Eskenazi, M. (2006b). Language Learning: Challenges for Intelligent Tutoring Systems. In Proceedings of the Workshop of Intelligent Tutoring Systems for Ill-Defined Domains. 8th International Conference on Intelligent Tutoring Systems (pp. 20-28). Hill, M., & Laufer, B. (2003). Type of task, time-on-task and electronic dictionaries in incidental vocabulary acquisition. International Review of Applied Linguistics, 41(2), 87-106. Holcomb, P. J., Kounios, J., Anderson, J. E., & West, C. W. (1999). Dual-coding, context-availability, and concreteness effects in sentence comprehension: An electrophysiological investigation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25(3), 721-742. Hubbard, P., Coady, J., Graney, J., Mokhtari, K., & Magoto, J. (1986). Report on a pilot study of the relationship of high frequency vocabulary knowledge and reading proficiency in ESL readers. Ohio University Working Papers in Linguistics and Language Teaching, 8, 48-57. Hulstijn, J. H. (1997). Mnemonic methods in foreign language vocabulary learning: Theoretical considerations and pedagogical implications. In J. Coady & T. N. Huckin (Eds.), Second language vocabulary acquisition: A rationale for pedagogy (pp. 203-224). Cambridge, England: Cambridge University Press. Hulstijn, J. H., & Laufer, B. (2001). Some empirical evidence for the involvement load hypothesis in vocabulary acquisition. Language Learning, 51(3), 539-558. Hummel, K. M., & French, L. M. (2010). Phonological memory and implications for the second language classroom. Canadian Modern Language Review, 66(3), 371391. Issing, L. J., Hannemann, J., & Haack, J. (1989). Visualization by pictorial analogies in understanding expository text. In H. Mandl & J. R. Levin (Eds.), Knowledge acquisition from text and pictures (pp. 195-214). Amsterdam: North-Holland. James, C., & Klein, K. (1994). Foreign language learners' spelling and proofreading strategies. Papers and Studies in Contrastive Linguistics, 29, 31-46. James, M. O. (1996). Improving second language reading comprehension: A computerassisted vocabulary development approach (Doctoral dissertation, University of Hawai'i at Manoa, 1996). Dissertations & Theses: A&I, Publication No. AAT 9629832.

233

Jessen, F., Heun, R., Erb, M., Granath, D.-O., Klose, U., Papassotiropoulos, A., et al. (2000). The concreteness effect: evidence for dual coding and context availability. Brain and Language, 74(1), 103-112. Jones, L. (2004). Testing L2 vocabulary recognition and recall using pictorial and written test items. Language Learning & Technology, 8(3), 122-143. Jones, L. (2006). Listening comprehension in multimedia environments. In L. Ducate & N. Arnold (Eds.), Calling on CALL: From theory and research to new directions in foreign language teaching (pp. 99-125). San Marcos, TX: CALICO. Jones, L. (2009). Supporting student differences in listening comprehension and vocabulary learning with multimedia annotations. CALICO Journal, 26(2), 267289. Jones, L., & Plass, J. L. (2002). Supporting listening comprehension and vocabulary acquisition in French with multimedia annotations. The Modern Language Journal, 86(4), 546-561. Jones, R. L., & Tschirner, E. (2006). A frequency dictionary of German: Core vocabulary for learners. New York: Routledge. Joseph, S. R. H., Lewis, A. S., & Joseph, M. H. (2004). Adaptive vocabulary instruction. In Proceedings of the IEEE International Conference on Advanced Learning Technologies (ICALT'04) (Vol. 00, pp. 141-145). Washington, D.C.: IEEE Computer Society. Joseph, S. R. H., Watanabe, Y., Shiung, Y.-J., Choi, B., & Robbins, C. (2009). Key aspects of computer assisted vocabulary learning (CAVL): Combined effects of media, sequencing and task type. Research and Practice in Technology Enhanced Learning, 4(2), 133-168. Kaya, T. (2006). The effectiveness of adaptive computer use for learning vocabulary (Doctoral dissertation, Northern Arizona University, 2006). Dissertations & Theses: A&I, Publication No. AAT 3213107. Kellogg, G. S., & Howe, M. J. A. (1971). Using words and pictures in foreign language learning. The Alberta Journal of Educational Research, 17(2), 89-94. Kim, D.-S. (2006). Effects of text, audio, and graphic aids in multimedia instruction on the achievement of students in vocabulary learning (Doctoral dissertation, Indiana State University, 2006). Dissertations & Theses: A&I, Publication No. AAT 3251404. Kost, C., Foss, P., & Lenzini, J. (1999). Textual and pictorial glosses: Effectiveness on incidental vocabulary growth when reading in a foreign language. Foreign Language Annals, 32(1), 89-113. Lahl, O., Göritz, A. S., Pietrowsky, R., & Rosenberg, J. (2009). Using the world-wide web to obtain large-scale word norms: 190,212 ratings on a set of 2,654 German nouns. Behavior Research Methods, 41(1), 13-19.

234

Laufer, B. (1990). Why are some words more difficult than others? Some intralexical factors that affect the learning of words. International Review of Applied Linguistics, 28(4), 293-307. Laufer, B. (1997). What's in a word that makes it hard or easy: Some intralexical factors that affect the learning of words. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, acquisition and pedagogy (pp. 140-155). Cambridge, England: Cambridge University Press. Laufer, B. (2003). Vocabulary acquisition in a second language: Do learners really acquire most vocabulary by reading? Some empirical evidence. The Canadian Modern Language Review, 59(4), 567-587. Laufer, B. (2005). Focus on form in second language vocabulary acquisition. In S. H. Foster-Cohen, M. P. Garcia-Mayo & J. Cenoz (Eds.), EUROSLA Yearbook 5 (pp. 223-250). Amsterdam: John Benjamins. Laufer, B. (2006). Comparing Focus on Form and Focus on FormS in second-language vocabulary learning. The Canadian Modern Language Review, 63(1), 149-166. Laufer, B., & Girsai, N. (2008). Form-focused instruction in second language vocabulary learning: A case for contrastive analysis and translation. Applied Linguistics, 29(4), 694-716. Laufer, B., & Goldstein, Z. (2004). Testing vocabulary knowledge: Size, strength and computer adaptiveness. Language Learning, 54(3), 399-436. Laufer, B., & Hill, M. (2000). What lexical information do L2 learners select in a CALL dictionary and how does it affect word retention? Language Learning & Technology, 3(2), 58-76. Laufer, B., & Hulstijn, J. H. (2001). Incidental vocabulary acquisition in a second language: The construct of task-induced involvement. Applied Linguistics, 22(1), 1-26. Leeflang, K. (2007). Visualisations: Do they really facilitate foreign language learning? Unpublished master's thesis, University of Groningen, Groningen, Netherlands. Leutner, D., & Plass, J. L. (1998). Measuring learning styles with questionnaires versus direct observation of preferential choice behavior in authentic learning situations: The Visualizer/Verbalizer Behavior Observation Scale (VV-BOS). Computers in Human Behavior, 14(4), 543-557. Levin, J. R. (1989). A transfer-appropriate-processing perspective of pictures in prose. In H. Mandl & J. R. Levin (Eds.), Knowledge acquisition from text and pictures (pp. 83-100). Amsterdam: Elsevier Science. Lightbown, P. M., & Spada, N. (1998). How languages are learned. Oxford, England: Oxford University Press. Lockhart, R. S., & Craik, F. I. M. (1990). Levels of Processing: A retrospective commentary on a framework for memory research. Canadian Journal of Psychology, 44(1), 87-112. 235

Lomicka, L. L. (1998). "To gloss or not to gloss": An investigation of reading comprehension online. Language Learning & Technology, 1(2), 41-50. Long, M. H. (1983). Linguistic and conversational adjustments to non-native speakers. Studies in Second Language Acquisition, 5, 177-193. Long, M. H. (1985). Input and second language acquisition theory. In S. M. Gass & C. G. Madden (Eds.), Input in second language acquisition (pp. 377-393). Rowley, MA: Newbury House. Long, M. H. (1996). The role of the linguistic environment in second language acquisition. In W. Ritchie & T. Bhatia (Eds.), Handbook of second language acquisition (pp. 413-468). San Diego, CA: Academic Press. Lotto, L., & de Groot, A. M. B. (1998). Effects of learning method and word type on acquiring vocabulary in an unfamiliar language. Language Learning, 48(1), 3169. Markham, P., & Peter, L. (2002-2003). The influence of English language and Spanish language captions on foreign language listening/reading comprehension. Journal of Educational Technology Systems, 31(3), 331-341. Matsumi, N. (1994). Second language vocabulary learning and visuo-spatial short-term memory. Hiroshima Forum for Psychology, 16, 27-32. Mayer, R. E. (2001). Multimedia learning. Cambridge, England: Cambridge University Press. Mayer, R. E. (2005a). Cognitive theory of multimedia learning. In R. E. Mayer (Ed.), The Cambridge handbook of multimedia learning (pp. 31-48). Cambridge, England: Cambridge University Press. Mayer, R. E. (2005b). Introduction to multimedia learning. In R. E. Mayer (Ed.), The Cambridge handbook of multimedia learning (pp. 1-16). Cambridge, England: Cambridge University Press. Mayer, R. E. (2005c). Principles for reducing extraneous processing in multimedia learning: Coherence, signaling, redundancy, spatial contiguity, and temporal contiguity principles. In R. E. Mayer (Ed.), The Cambridge handbook of multimedia learning (pp. 183-200). Cambridge, England: Cambridge University Press. Mayer, R. E. (Ed.). (2005d). The Cambridge handbook of multimedia learning. Cambridge, England: Cambridge University Press. Mayer, R. E., & Sims, V. K. (1994). For whom is a picture worth a thousand words? Extensions of a dual-coding theory of multimedia learning. Journal of Educational Psychology, 86(3), 389-401. Melka, F. (1997). Receptive vs. productive aspects of vocabulary. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, acquisition and pedagogy (pp. 84102). Cambridge, England: Cambridge University Press.

236

Milton, J. (2007). Lexical profiles, learning styles and the construct validity of lexical size tests. In H. Daller, J. Milton & J. Treffers-Daller (Eds.), Modelling and assessing vocabulary knowledge (pp. 47-58). Cambridge, England: Cambridge University Press. Min, H.-T. (2008). EFL vocabulary acquisition and retention: Reading plus vocabulary enhancement activities and narrow reading. Language Learning, 58(1), 73-115. Molitor, S., Ballstaedt, S.-P., & Mandl, H. (1989). Problems in knowledge acquisition from text and pictures. In H. Mandl & J. R. Levin (Eds.), Knowledge acquisition from text and pictures (pp. 3-35). Amsterdam: North-Holland. Mondria, J.-A., & Wiersma, B. (2004). Receptive, productive, and receptive + productive L2 vocabulary learning: What difference does it make? In P. Bogaards & L. Batia (Eds.), Vocabulary in a second language: Selection, acquisition, and testing (pp. 79-100). Amsterdam: John Benjamins. Nakata, T. (2008). English vocabulary learning with word lists, word cards and computers: Implications from cognitive psychology research for optimal spaced learning. ReCALL, 20(1), 3-20. Nation, I. S. P. (1982). Beginning to learn foreign vocabulary: A review of research. RELC Journal, 13(1), 14-36. Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge, England: Cambridge University Press. Nerbonne, J. (2000). Parallel texts in computer-assisted language learning. In J. Véronis (Ed.), Parallel text processing: Alignment and use of translation corpora (pp. 299-311). Dordrecht, Netherlands: Kluwer Academic Publishers. Noppeney, U., & Price, C. J. (2004). Retrieval of abstract semantics. NeuroImage, 22(1), 164-170. Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis. Language Learning, 50(3), 417-528. Norris, J. M., & Ortega, L. (2001). Does type of instruction make a difference? Substantive findings from a meta-analytic review. In R. Ellis (Ed.), Form-focused instruction and second language learning (pp. 157-213). New York: Blackwell. O'Bryan, A. (2005). Effects of images on the incidental acquisition of abstract words. Unpublished master's thesis, Iowa State University, Ames, IA. O'Shea, T., & Self, J. A. (1983). Learning and Teaching with Computers. Brighton, England: Harvester. Okuyama, Y. (2007). CALL vocabulary learning in Japanese: Does Romaji help beginners learn more words? CALICO Journal, 24(2), 355-379. Paivio, A. (1971). Imagery and verbal processes. New York: Holt, Rinehart and Winston.

237

Paivio, A. (1986). Mental representations: A dual coding approach. Oxford, England: Oxford University Press. Paivio, A., & Desrochers, A. (1980). A dual-coding approach to bilingual memory. Canadian Journal of Psychology, 34(4), 388-399. Paivio, A., Yuille, J. C., & Madigan, S. A. (1968). Concreteness, imagery, and meaningfulness values for 925 nouns. Journal of Experimental Psychology, 76(1, Part 2), 1-25. Pavičić Takač, V. (2008). Vocabulary learning strategies and foreign language acquisition. Clevedon, UK: Multilingual Matters. Peeck, J. (1989). Trends in the delayed use of information from an illustrated text. In H. Mandl & J. R. Levin (Eds.), Knowledge acquisition from text and pictures (pp. 263-277). Amsterdam: Elsevier Science. Pica, T. (1994). Research on negotiation: What does it reveal about second-language learning conditions, processes, and outcomes? Language Learning, 44(3), 493527. Plass, J. L., Chun, D. M., Mayer, R. E., & Leutner, D. (1998). Supporting visual and verbal learning preferences in a second-language multimedia learning environment. Journal of Educational Psychology, 90(1), 25-36. Plass, J. L., Chun, D. M., Mayer, R. E., & Leutner, D. (2003). Cognitive load in reading a foreign language text with multimedia aids and the influence of verbal and spatial abilities. Computers in Human Behavior, 19(2), 221-243. Plass, J. L., & Jones, L. (2005). Multimedia learning in second language acquisition. In R. E. Mayer (Ed.), The Cambridge handbook of multimedia learning (pp. 467488). Cambridge, England: Cambridge University Press. Read, J. (2000). Assessing vocabulary. Cambridge, England: Cambridge University Press. Read, J. (2004). Research in teaching vocabulary. Annual Review of Applied Linguistics, 24, 146-161. Reed, S. K. (2006). Cognitive architectures for multimedia learning. Educational Psychologist, 41(2), 87-98. Richardson, J. T. E. (1975a). Concreteness and imageability. Quarterly Journal of Experimental Psychology, 27, 235-249. Richardson, J. T. E. (1975b). Imagery, concreteness, and lexical complexity. Quarterly Journal of Experimental Psychology, 27, 211-223. Richardson, J. T. E. (1976). Imageability and concreteness. Bulletin of the Psychonomic Society, 7(5), 429-431. Rimrott, A. (2009). Voka [Computer software]. Accessed December 31, 2009 at http://www.voka.ca.

238

Rimrott, A., & Heift, T. (2008). Evaluating automatic detection of misspellings in German. Language Learning & Technology, 12(3), 73-92. Rott, S. (2005). Processing glosses: A qualitative exploration of how form-meaning connections are established and strengthened. Reading in a Foreign Language, 17(2), 95-124. Ryan, A. (1997). Learning the orthographical form of L2 vocabulary - a receptive and a productive process. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, acquisition and pedagogy (pp. 181-198). Cambridge, England: Cambridge University Press. Salomon, G. (1989). Learning from texts and pictures: Reflections on a meta-level. In H. Mandl & J. R. Levin (Eds.), Knowledge acquisition from text and pictures (pp. 73-82). Amsterdam: Elsevier Science. Schmidt, R. W. (1990). The role of consciousness in second language learning. Applied Linguistics, 11(2), 129-158. Schmitt, N. (1997). Vocabulary learning strategies. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, acquisition and pedagogy (pp. 199-227). Cambridge, England: Cambridge University Press. Schmitt, N. (2008). Review article: Instructed second language vocabulary learning. Language Teaching Research, 12(3), 329-363. Schmitt, N. (2010). Researching vocabulary: A vocabulary research manual. Hampshire, UK: Palgrave Macmillan. Schnotz, W. (2005). An integrated model of text and picture comprehension. In R. E. Mayer (Ed.), The Cambridge handbook of multimedia learning (pp. 49-69). Cambridge, England: Cambridge University Press. Schuetze, U., & Weimer-Stuckmann, G. (2010). Virtual vocabulary: Research and learning in lexical processing. CALICO Journal, 27(3), 517-528. Schwanenflugel, P. J., Akin, C., & Luh, W.-M. (1992). Context availability and the recall of abstract and concrete words. Memory and Cognition, 20(1), 96-104. Scott, S. K. (2004). The neural representation of concrete nouns: What's right and what's left? TRENDS in Cognitive Sciences, 8(4), 151-153. Shaffer, D. R., & Kipp, K. (2009). Developmental psychology: Childhood & adolescence. Belmont, CA: Wadsworth. Skehan, P. (2000). A cognitive approach to language learning. Oxford, England: Oxford University Press. St. John, E. (2001). A case for using a parallel corpus and concordancer for beginners of a foreign language. Language Learning & Technology, 5(3), 185-203. Stockwell, G. (2007). Vocabulary on the move: Investigating an intelligent mobile phone-based vocabulary tutor. Computer Assisted Language Learning, 20(4), 365-383. 239

Stockwell, G. (2010). Using mobile phones for vocabulary activities: Examining the effect of the platform. Language Learning & Technology, 14(2), 95-110. Sun, Y., & Dong, Q. (2004). An experiment on supporting children's English vocabulary learning in multimedia context. Computer Assisted Language Learning, 17(2), 131-147. Svenconis, D. J., & Kerst, S. (1995). Investigating the teaching of second-language vocabulary through semantic mapping in a hypertext environment. CALICO Journal, 12(2-3), 33-57. Swain, M. (1995). Three functions of output in second language learning. In G. Cook & B. Seidlhofer (Eds.), Principles and practice in the study of language (pp. 125144). Oxford, England: Oxford University Press. Sweller, J. (2005). Implications of cognitive load theory for multimedia learning. In R. E. Mayer (Ed.), The Cambridge handbook of multimedia learning (pp. 19-30). Cambridge, England: Cambridge University Press. Sydorenko, T. (2010). Modality of input and vocabulary acquisition. Language Learning & Technology, 14(2), 50-73. Thompson, V. A., & Paivio, A. (1994). Memory for pictures and sounds: Independence of auditory and visual codes. Canadian Journal of Experimental Psychology, 48(3), 380. Tomasello, M. (2003). Constructing a language: A usage-based theory of language acquisition. Cambridge, MA: Harvard University Press. Tonzar, C., Lotto, L., & Job, R. (2009). L2 vocabulary acquisition in children: Effects of learning method and cognate status. Language Learning, 59(3), 623-646. Tozcu, A., & Coady, J. (2004). Successful learning of frequent vocabulary through CALL also benefits reading comprehension and speed. Computer Assisted Language Learning, 17(5), 473-495. Tyler, L. K., & Moss, H. E. (1997). Imageability and category-specificity. Cognitive Neuropsychology, 14(2), 293-318. van Hell, J. G., & Mahn, A. C. (1997). Keyword mnemonics versus rote rehearsal: Learning concrete and abstract foreign words by experienced and inexperienced learners. Language Learning, 47, 507-546. Vispoel, W. P. (1998). Reviewing and changing answers on computer-adaptive and selfadaptive vocabulary tests. Journal of Educational Measurement, 35(4), 328-345. Vö, M. L.-H., Conrad, M., Kuchinke, L., Urton, K., Hofmann, M. J., & Jacobs, A. M. (2009). The Berlin Affective Word List Reloaded (BAWL-R). Behavior Research Methods, 41(2), 534-538. Vö, M. L.-H., Jacobs, A. M., & Conrad, M. (2006). Cross-validating the Berlin Affective Word List. Behavior Research Methods, 38(4), 606-609.

240

Webb, S. (2005). Receptive and productive vocabulary learning: The effects of reading and writing on word knowledge. Studies in Second Language Acquisition, 27, 3352. Webb, S. (2007). The effects of repetition on vocabulary knowledge. Applied Linguistics, 28(1), 46-65. Webb, S. (2008). Receptive and productive vocabulary sizes of L2 learners. Studies in Second Language Acquisition, 30(1), 79-95. Wimmer, H., & Landerl, K. (1997). How learning to spell German differs from learning to spell English. In C. A. Perfetti, L. Rieben & M. Fayol (Eds.), Learning to spell: Research, theory, and practice across languages (pp. 81-96). Mahwah, NJ: Lawrence Erlbaum. Winn, W. (1989). The design and use of instructional graphics. In H. Mandl & J. R. Levin (Eds.), Knowledge acquisition from text and pictures (pp. 125-144). Amsterdam: Elsevier Science. Wu, J.-C., Yeh, K. C., Chuang, T. C., Shei, W.-C., & Chang, J. S. (2003). TotalRecall: A bilingual concordance for computer assisted translation and language learning. In Proceedings of the 41st Association of Computational Linguistics Conference, Sappora, Japan. Wu, W.-s. (2005). Development of an online adaptive vocabulary test system. In Proceedings of ED-MEDIA 2005: World Conference on Educational Multimedia, Hypermedia & Telecommunications. Montreal, Canada (pp. 632-637). Xu, J. (2010). Using multimedia vocabulary annotations in L2 reading and listening activities. CALICO Journal, 27(2), 311-327. Yanguas, I. (2009). Multimedia glosses and their effect on L2 text comprehension and vocabulary learning. Language Learning & Technology, 13(2), 48-67. Yeh, Y., & Wang, C.-W. (2003). Effects of multimedia vocabulary annotations and learning styles on vocabulary learning. CALICO Journal, 21(1), 131-144. Yoshii, M. (2006). L1 and L2 glosses: Their effects on incidental vocabulary learning. Language Learning & Technology, 10(3), 85-101. Yoshii, M., & Flaitz, J. (2002). Second language incidental vocabulary retention: The effect of text and picture annotation types. CALICO Journal, 20(1), 33-58. Zanettin, F. (1998). Bilingual comparable corpora and the training of translators. Meta: Translator's Journal, 43(4), 616-630. Zhuo, F. J. (2008, March). Synthesis of the effectiveness of words, pictures and other instructional media. Paper presented at the CALICO Conference, San Francisco, CA.

241