What is to be assessed?

What is to be assessed? Teachers’ understanding of constructs in an oral English examination in Norway

Henrik Bøhn

Thesis submitted for the degree of Philosophiae Doctor Department of Teacher Education and School Research Faculty of Educational Sciences UNIVERSITY OF OSLO 2016

ii

Summary The present thesis has investigated EFL teachers’ rating orientations in an oral English examination at the upper secondary level in Norway. As part of this investigation, aspects of the teachers’ scoring behaviour, i.e. grades given, have also been studied. In addition, comparisons were made between what the teachers understand as relevant performance aspects to be tested and what the English subject curriculum and accompanying government documents define as construct-relevant. The thesis is article-based and comprises three articles and an extended abstract. The extended abstract provides a background for the investigation, a theoretical framework, a literature review, a presentation of the research design and methods used, as well as a discussion of the main findings. The articles present the three individual studies which have been conducted. A major concern throughout the thesis has been the lack of a common national rating scale in the upper secondary school context and how this may affect the validity and reliability of the scores. In Study 1 (Article 1) semi-structured interviews were used to explore 24 Norwegian EFL teachers’ general understanding of the constructs to be tested. The study found that the teachers focused on two main constructs, namely ‘communication’ and ‘content’, which in turn comprised a number of sub-constructs. Overall, the teachers understood the main constructs in the same way, but they disagreed on some of the more specific performance aspects, such as ‘pronunciation’. In addition, the study found that teachers weighted the content construct differently. The teachers working in the general studies programme put more emphasis on content than did the teachers in the vocational studies programmes. There was also evidence that some teachers focused on construct-irrelevant performance features, such as effort. Beyond this, the investigation of scoring behaviour indicated that there was fairly good agreement in the scoring of performance. Study 2 (Article 2) used semi-structured interviews and questionnaires to investigate the rating orientations of 70 EFL teachers regarding aspects of the pronunciation construct. These aspects included native speaker pronunciation and intelligibility, as well as the pronunciation of segmentals (individual sounds), word stress, sentence stress and intonation. The results showed that the teachers had widely differing views on native speaker pronunciation, but that they strongly agreed that intelligibility was important for a highscoring performance. In addition, they largely agreed that segmentals, word stress and sentence stress were important features to be assessed. As for intonation, however, the iii

findings indicated that the teachers were either not as concerned with this feature, or unsure of its relevance. Study 3 (Article 3) employed verbal protocol analysis and semi-structured interviews to explore 10 EFL teachers’ understanding of the content construct. This construct was mainly analysed in terms of a subject matter dimension and a skills and abilities dimension. Comparisons were also made between the teachers’ perceptions of content and aspects of content identified in the subject curriculum and accompanying government documents. The results showed that the teachers had a very general understanding of subject matter, largely interpreting it in terms of ‘general world knowledge’, which may be said to correspond well with what the subject curriculum stipulates. In addition, the study found that the teachers were more concerned with the skills and abilities dimension than with the subject matter dimension, stressing the importance of higher-order thinking skills for a top-scoring performance. There was also evidence that the teachers largely had the same understanding of the construct, but that some of them disagreed on what kind of performance was indicative of the different achievement levels. These differences were largely attributed to study programme affiliation, the vocational studies teachers being more lenient in their assessment orientations than the general studies teachers. In sum, the three articles provide empirical evidence of what kind of performance aspects teacher raters attend to in a curriculum-based, oral EFL assessment context at the upper-intermediate level (Common European Framework of Reference B1/B2 level) where no common rating scale exists. Overall, the results showed that the teachers had a similar understanding of the main constructs to be tested, but that they disagreed on the more narrow performance aspects. The study also indicated that constructs such as pronunciation and content are somewhat elusive and need to be better defined in order to provide scoring outcomes that are valid and dependable. In addition, the findings suggested that the Norwegian educational authorities should consider introducing common rating scale guidelines, as well as more coherent rater training, in order to guide teachers in their assessment of oral English performance.

iv

Acknowledgements This thesis would never have been finalized without the help of a number of people. First of all, I am greatly indebted to my main supervisor, Professor Glenn Ole Hellekjær, for all his excellent advice, helpful comments, genuine care and pragmatic guidance when I got lost in details along the way, which I constantly did. I am also indebted to my co-supervisor, Professor Hilde Hasselgård, for her great repertoire of practical advice, razor-sharp analyses of my drafts, eminent suggestions for text improvement and wonderful sense of humor. I also heartily thank the student who had the courage to let me film her during her oral exam. Without her, this project would have been very different. Similarly, the 80 teachers who agreed to watch the video-clip and answer my questions as interview informants and questionnaire respondents deserve credit for sharing their precious time. Special thanks go to the teachers Christian Andresen at Jessheim Upper Secondary School, Phil Grey at Bjørkelangen Upper Secondary School and Margrethe Hall Christensen at Ås Upper Secondary School for granting me access to study participants. Likewise, a number of good colleagues at Østfold University College must be acknowledged. First of all, I want to thank Associate Professor Magne Dypedahl for his rocksolid support through all these years. His untiring encouragement, insightful analyses and pointed comments on different text versions have been particularly helpful in this process. I am also indebted to Assistant Professor Thomas Hansen for agreeing to co-author the second article of this thesis with me, for helping me to analyse data and critique drafts and for his admirable good spirits. Professor Roald Jensen also deserves great thanks for sharing his profound knowledge of learning and assessment theory and for his thorough and helpful feedback on two of the chapters. My sincere thanks are furthermore due to Professor Julianne Cheek, whose competence in the area of research methodology is truly of the highest academic standard. Her lucid criticism of parts of this text proved immensely valuable to me. I am also indebted to my former Dean, Associate Professor Eva Lambertsson Björk, for her backing and counselling in the very early stages of the project. In the same manner I owe a great deal to helpful colleagues at the University of Oslo. I am particularly grateful to the members of the SISCO research group, who will leave a lasting impression on my academic career, because of all the things they taught me about good research. Professor Kirsti Klette and Professor emerita Frøydis Hertzberg, who chaired this group as I joined it in 2013, and who gave me such a warm welcome, combine the highest academic standards and a human touch in an exemplary way. The same can be said of v

Professor Andreas Lund, who generously set aside time to read and provide excellent comments on my PhD project proposal before I was admitted to the Faculty of Education. Similarly, Associate Professor Ulrikke Rindahl should be credited for kind-heartedly providing very valuable feedback on various versions of the articles and the extended abstract. Also, former Master’s student, Caroline Borch-Nielsen, contributed insightful ideas on the development of the questionnaire which I used for my pilot. Furthermore, I want to express my gratitude to Associate Professor Therese N. Hopfenbeck at the University of Oxford, for her superb comments on various parts of the thesis in the final stages of the project, and to Mathilda Burman and Kim Buxton at the Norwegian Directorate for Education and Training who kindly shared their time to inform me of assessment policies in the Norwegian context. I must also thank the library staff at Østfold University College, for their excellent and speedy service, always willing to make an extra effort to provide me with sources before deadlines. In the same vein I am obliged to Geir Jarberg and Anne Grethe Bremnes, at the ICT technical support unit at Østfold University College, who always jumped to their feet to assist me whenever my computer jammed, when a software license expired or when my lack of digital skill prevented me from getting my writing done. Finally, and most importantly, I want to thank my family. I am truly grateful to my parents, Tor Inge and Hildegunn, for their unwavering support in all respects, to my sister, Anne Marie, whose experience as an English teacher brought fruitful ideas into this project, and to my children, Eva Julie and Simon, who always cheer me up. I dedicate this thesis to my wife, colleague and best friend, Gro-Anita, whose love, compassion, academic agility and moral and practical support got me through this long process. You have my heart.

Halden, 30 May 2016 Henrik Bøhn

vi

Table of contents Summary ................................................................................................................................ iii Acknowledgements ................................................................................................................. v Table of contents ................................................................................................................... vii List of tables ............................................................................................................................ x List of figures .......................................................................................................................... x Abbreviations ......................................................................................................................... xi

Part I: Extended abstract Chapter 1: Introduction ...................................................................................................... 1 1.1 General background ....................................................................................................... 1 1.2 Assessment paradigms in the educational domain ........................................................ 2 1.3 The Norwegian context.................................................................................................. 4 1.3.1 Assessment in Norway ............................................................................................ 4 1.3.2 English in Norway .................................................................................................. 5 1.3.3 The Norwegian school context ................................................................................ 6 1.4 Research purpose and aims............................................................................................ 8 1.5 A note on terminology ................................................................................................... 9 1.6 The structure of the thesis ............................................................................................. 10

Chapter 2: Theoretical framework .................................................................................. 11 2.1 Introduction to the chapter ........................................................................................... 11 2.2 Constructs: Operationalization of underlying abilities ................................................ 11 2.3 Validity and validation ................................................................................................ 14 2.3.1 Evaluating evidence about score interpretations ................................................. 14 2.3.2 The unitary theory of validity and argument-based validity approaches ............. 15 2.3.3 The social consequences of test use ...................................................................... 18 vii

2.4 Reliability and standardization .................................................................................... 18 2.5 Constructs to be tested: Oral communication and content .......................................... 19 2.5.1 Models of communicative competence and the oral communication construct ... 20 2.5.2 Theoretical perspectives on content ...................................................................... 23

2.5.3 Competence aims which relate to oral communication and content ................... 25 2.6 Short summary ............................................................................................................. 27

Chapter 3: Literature review ............................................................................................ 28 3.1 Introduction to the chapter ........................................................................................... 28 3.2 International research on rater orientations and rater behaviour ................................. 28 3.2.1 Rater variability and the focus of the present investigation ................................. 28 3.2.2 Rater orientations ................................................................................................. 29 3.2.3 Rater behaviour .................................................................................................... 32 3.3 Assessment research in the Norwegian context........................................................... 33 3.4 Short summary ............................................................................................................. 34

Chapter 4: Methodology.................................................................................................... 35 4.1 Introduction to the chapter ........................................................................................... 35 4.2 The phases of the research process .............................................................................. 35 4.2.1 General overview .................................................................................................. 35 4.2.2 The pilot phase ...................................................................................................... 38 4.2.3 Study 1 (Article 1) ................................................................................................ 39 4.2.4 Study 2 (Article 2) ................................................................................................ 40 4.2.5 Study 3 (Article 3) ................................................................................................ 40 4.3 Data collection ............................................................................................................. 41 4.3.1 Study 1: Teachers’ overall understanding of constructs ...................................... 41 4.3.2 Study 2: Teachers’ orientations towards pronunciation ...................................... 44 4.3.3 Study 3: Teachers’ understanding of content ...................................................... 44 4.4 Participants and procedure ........................................................................................... 46 4.5 Data analyses ............................................................................................................... 47 4.6 Research validity ......................................................................................................... 51 4.7 Ethical considerations .................................................................................................. 54

viii

4.8 Generalizability............................................................................................................ 56 4.9 Short summary ............................................................................................................. 57

Chapter 5: Summary and discussion ............................................................................... 58 5.1 Introduction to the chapter ........................................................................................... 58 5.2 Summary of the articles ............................................................................................... 58 5.2.1 Article 1................................................................................................................. 58 5.2.2 Article 2................................................................................................................. 61 5.2.3 Article 3................................................................................................................. 63 5.3 Research contribution ................................................................................................. 66 5.4 Implications for the Norwegian educational context ................................................... 68 5.5 Concluding remarks ..................................................................................................... 71

Part II: Articles Article 1: Assessing spoken EFL without a common rating scale: Norwegian EFL teachers’ conceptions of construct”. Sage Open, October-December, 2015.

Article 2: Assessing pronunciation in an EFL context: Teachers’ orientations towards nativeness and intelligibility. Under review for Language Assessment Quarterly.

Article 3: Assessing content in a curriculum-based EFL oral exam: The importance of higher-order thinking skills. Under review for Assessment in Education: Principles, Policy and Practice.

ix

List of tables Table 1 CEFR production scale for overall spoken production. ....................................... 22 Table 2 CEFR production scale for overall spoken interaction ........................................ 23 Table 3 Overview of the central elements in the research process .................................... 50

List of figures Figure 1 CEFR’s model of communicative language competence, Council of Europe (2001) ................................................................................................................. 21 Figure 2 Competence aims underlying the two constructs to be tested; taken from the English subject curriculum, GSP1/VSP2 level .................................................. 25 Figure 3 Overview of the research design ......................................................................... 37

x

List of abbreviations AfL

Assessment for Learning

CBI

Content-Based Instruction

CLA

Communicative Language Ability

CEFR

Common European Framework of Reference for Languages

EAP

English for Academic Purposes

EFL

English as a Foreign Language

ESL

English as a Second Language

FYR

Fellesfag, yrkesretting, relevans [Vocational Education and Training Promotion]

GSP

General Studies Programme

GSP1

General Studies Programme, year 1 at the upper secondary school level

IELTS

The International English Language Testing System

LEA

Local Educational Authorities

LFC

Lingua Franca Core

LK-06

Læreplanverket for kunnskapsløftet [The Knowledge Promotion curriculum reform]

L1

First language

L2

Foreign or second language

NKVS

Nasjonalt kvalitetsvurderingssystem [National Quality Assessment System]

PIRLS

Progress in International Reading Literacy Study

PISA

Programme for International Student Assessment

RQ

Research question

TIMMS

Trends in International Mathematics and Science Study

TOEFL

Test of English as a Foreign Language

UDIR

Utdanningsdirektoratet [Norwegian Directorate for Education and Training]

VPA

Verbal Protocol Analysis

VSP

Vocational Studies Programme

VSP2

Vocational Studies Programme, year 2 at the upper secondary school level

xi

Part I Extended abstract

xii

Chapter 1: Introduction Speaking is […] the most difficult skill to assess reliably.1

1.1 General background Educational assessment can be defined as “the planned and systematic process of gathering and interpreting evidence about learning in order to make a judgement about that learning” (Tina Isaacs, Zara, Herbert, Coombs, & Smith, 2013). This practice has a fundamental role in education and involves a number of political, philosophical, social, ethical and technical issues. In the past two decades assessment has received increasing attention from researchers, policy makers, teaching practitioners and the general public, not least because of the importance that has been attributed to the role of formative assessment in the advancement of student learning (Black & Wiliam, 1998; Hattie, 2009; Sadler, 1998; Stiggins, 2005). Other types of assessment practices have also seen an upsurge, especially in the form of national and international large-scale language testing, which may have diagnostic, comparative and accountability functions (Bennett & Gitomer, 2008; Hopfenbeck, 2014; Kunnan, 2008; Stobart & Eggen, 2012). In addition, school exams and other summative assessments continue to occupy an important place in school systems, serving achievement record and certification functions with potential high-stakes consequences. This thesis investigates the latter domain, with a focus on the rating process in an oral English exam. Having worked as an English teacher at the upper secondary school level in Norway for three years, and as a lecturer at the tertiary level for more than 15 years, I have been involved in numerous assessment situations, many of which have been oral exams. My first encounter with an oral English school exam as a young teacher in the late 1990s came to epitomize some of the challenges of the Norwegian educational assessment system, which I encountered when embarking on this PhD-project many years later. Not only was there no interlocutor training to help me ask the right kinds of questions during the examination, there were no rater training and no rating scale to guide me in the rating process either. With no experience at all I felt very much dependent on the judgement of my more experienced coassessor. In retrospect, as I became increasingly aware of the complexities of assessment, I have been pondering the arbitrariness of the system that I encountered, which no doubt sparked my subsequent interest in assessment research. 1

J. Charles Alderson & Lyle F. Bachman, in the preface to Assessing Speaking (Luoma, 2004, p. ix).

In the almost 20 years that have passed since then, the Norwegian educational system has undergone important changes, and international advances in testing and assessment research have improved our understanding of the nature of assessment and of how to improve testing practices. Despite this, nationally administered rater training and rating scales for oral exams are still non-existent in Norway. Moreover, the international research community continues to grapple with a number of unsolved challenges. For example, there are problems related to the issues of reliability (as indicated by Alderson and Bachman in the introductory quote), assessment design, score interpretation and test use. Other dilemmas relate to the nature of language ability, the appropriateness of the scoring system and the ethical uses of language assessments (Bachman, 2014; Davies, 2014; Green, 2014). With regard to the appropriateness of the scoring system, the question of validity is particularly important. Very simply put, validity can be understood as the extent to which the inferences drawn from assessment scores are ‘sound’ (Cronbach, 1971; Fulcher, 2015). In order to ensure that inferences are sound, raters must have a common conceptualization of what is to be assessed. This ‘what’ is regularly referred to as the construct (O'Sullivan, 2014). If raters do not have a shared understanding of the construct, it will negatively affect validity (and reliability) (Jones, 2012; Taylor & Galaczi, 2011). Validity is therefore a fundamental aspect of the quality of the assessment procedure (Newton & Shaw, 2014; Wiliam, 2008). In this thesis I address the issues of score interpretation and the appropriateness of the scoring system. The focus is on the assessment of spoken English as a Foreign Language (EFL) in an oral examination at the upper secondary school level. My main aim is to identify what aspects of performance teachers pay attention to in the rating process. As part of this investigation I compare their notions of relevant performance aspects with what the curriculum and other defining documents identify as relevant features to be assessed. In addition, I examine aspects of teachers’ scoring behaviour.

1.2 Assessment paradigms in the educational domain Assessment is not undertaken in a vacuum. Ontological and epistemological assumptions, tradition, values and ideologies all affect the way assessment is looked upon, researched, designed, implemented and appraised. Taken together on a general level, facets such as these may be said to form a paradigm, or “a set of interrelated concepts which provide the framework within which we see and understand a particular problem or activity” (Gipps, 1994, p. 1). Not infrequently, ideas attributable to different paradigms may exist side by side, 2

creating tensions in societal systems. In educational assessment such tensions can be observed in views and practices stemming from two overarching paradigms, namely the measurement, or psychometrics, paradigm and the assessment paradigm.2 In order to understand the oral English exam under investigation here, it is relevant to discuss some of the main features of these paradigms and how they are reflected in educational assessment practices. The measurement paradigm originated in the field of psychology in the 19th century and was traditionally associated with a positivist epistemological outlook (Baird, Hopfenbeck, Newton, Stobart, & Steen-Utheim, 2014; Broadfoot, 2007). A basic assumption in this paradigm is the idea that abilities are fixed individual properties which can be ‘measured’, or ‘tested’, quantitatively. Norm-referenced test practices are frequent, and reliability and standardization are of major concern (Baird et al., 2014; Broadfoot, 2007; Gipps, 1994). In order to enhance reliability, externally defined criteria, or standards, are commonly preferred. In terms of learning, behaviouristic and cognitive models are frequently drawn upon (InbarLourie, 2008), and knowledge is often believed to exist separately from the learner. From this perspective, tests can be designed to objectively assess the amount of knowledge that a student has acquired (Serafini, 2001). Thus, an important purpose of tests in education is to monitor learning (Inbar-Lourie, 2008). Other important purposes are ranking, reporting, surveillance and the certification of competence (Black & Jones, 2006; Inbar-Lourie, 2008). The assessment paradigm, on the other hand, which was developed in the late 20th century, is sometimes seen as a reaction against the psychometrics tradition (Throndsen, Hopfenbeck, Lie, & Dale, 2009). Based on interpretivist and constructivist epistemological positions, this paradigm typically sees abilities as evolving and contextually sensitive (InbarLourie, 2008). On this view, learning is typically understood as knowledge construction, rather than something which is objectively acquired (Hargreaves, 2005). Moreover, there is a preference for

criterion-referenced

forms

of assessment, whereas

reliability and

standardization are de-emphasized (Gipps, 1994; Inbar-Lourie, 2008).3 The main purpose of assessment is to promote learning, and in this process the teacher has a prominent role. Hence, criteria can legitimately be designed and implemented on the local level. Engh (2011) goes so far as to say that:

2

Other labels have been used to describe these paradigms. Inbar-Lourie (2008), for example, refers to them as “testing culture” and “assessment culture” (p. 285). 3 This may be a problematic stance in validity frameworks which incorporate reliability into validity (cf. section 2.3.3). However, in some approaches it may seem possible to have validity without reliability (Moss, 1994).

3

The teachers’ technical expertise is used, among other things, to assess student competence. This type of assessment is to be carried out on the basis of the teachers’ professional judgement. Only in exceptional cases is it possible or pedagogically sensible to use standards for assessing student performance. In most cases, what we are assessing is quality, and quality cannot be assessed with the use of standards or other quantitative measures (p. 17, my translation).

The Assessment for Learning (AfL) approach, which has been influential in Norway (cf. section 1.3.1, below) draws heavily on this tradition. This is an approach which reflects social constructivist, cognitive and socio-cultural theories of learning (Black & Wiliam, 2009). Features of these two paradigms are recognizable in a number of educational systems, Norway being no exception. Traces of the measurement paradigm, for example, are evident in national and international large-scale (external) testing practices, which generally have a quantitative orientation and where the monitoring of learning is one important function of such assessment. However, what is even more relevant for the present thesis is the way aspects of the two paradigms are reflected in various forms of school-based (internal) assessment. For example, assessments used for formative purposes draw largely on the assessment paradigm, whereas summative assessments, such as exams, tend to share more features with the measurement paradigm.4 Still, there are overlaps, and in the concrete design and implementation of different assessments some important questions relating to differing views from the two paradigms need to be asked, for instance: How standardized do examinations need to be? Are common rating scales required? Is rater training absolutely necessary? This thesis presents empirical findings which relate to these questions and discusses potential consequences of choosing some solutions over others.

1.3 The Norwegian context 1.3.1 Assessment in Norway Since the mid-2000s, assessment has been a major area of attention for Norwegian educational authorities, reflecting recent international trends in education (Andreassen & Gamlem, 2011). This development was prompted by the low average results of Norwegian students on international tests such as PISA, TIMMS & PIRLS after the turn of the

4

I here follow Harlen (2012) who refers to formative assessment as assessment intended to “help learning” and summative assessment as assessment intended to “report learning” (p. 97). That being said, it is also clear that exams generally have a certification function.

4

millennium (Engh, 2011). On the release of the first PISA results in 2001, the authorities immediately initiated a number of research projects to find out why Norwegian students did not perform better. The results of this research identified a number of challenges in the area of assessment, particularly with regard to formative evaluation. For example, studies found that feedback practices were unsystematic and poorly related to learning objectives, indicating that teachers lacked assessment competence (Haug, 2004; Haugstveit, 2005; Hertzberg, 2003; Klette, 2003; Solstad & Rønning, 2003). The government concluded that there was a “weak assessment culture” in many Norwegian schools (Meld. St. 16 (2006-2007), 2007, p. 77, my translation). Consequently, a range of measures were initiated to improve the situation, several of which related specifically to assessment. Among these were the establishment of a national quality assessment system (the “NKVS”) in 2004 with a particular focus on accountability measures, the introduction of the Knowledge Promotion curriculum reform (LK-06) in 2006, a revision of the Regulations to the Education Act in 2009, introducing a distinction between formative and summative assessment, and focus on AfL as a prioritized area in education (Meld. St. 20 (2012-2013), 2013). In addition, calls for more research were made, concerning both theoretical analyses and empirical investigations of assessment practices in schools (Throndsen et al., 2009). The present thesis is a response to these calls.

1.3.2 English in Norway English holds a strong position in Norwegian society. Since 1969 it has been a compulsory school subject for all. Norwegians are widely exposed to English both at school and in society at large, and people use it for a number of different purposes across a range of different contexts (Chvala & Graedler, 2010; Simensen, 2011). Moreover, studies have shown that the proficiency level of the population is generally high compared to other countries in which English is neither the first nor an official language (Education First, 2014, 2015). However, studies have also shown that the proficiency level of the population may be insufficient for meeting the communicative requirements in professional settings (Hellekjær, 2007, 2008, 2012). The educational authorities have given English special status in the subject curricula by no longer subsuming it under the label “foreign languages”. Despite this, they do not explicitly use the label “second language”. Whether English in Norway should be treated as a “foreign” or a “second” language seems to be a matter of preference. Some scholars base the distinction between them on the status accorded to the language in society (e.g. Graddol, 2006, p. 84), whereas others base it on whether the language in question is used as an L1 by 5

the majority population (e.g. Alderson et al., 2015, p. 71). Yet others see the distinction in itself as somewhat artificial and outdated (Celce-Murcia, 2014). However, in this thesis I will follow Simensen (2014) and refer to English in Norway as a foreign language.

1.3.3 The Norwegian school context Norwegian children start school at the age of six. Schooling is compulsory at the primary (grades 1-7) and lower secondary level (grades 8-10). Upper secondary school (grades 11-13) is voluntary, but everyone has the legal right to attend. At the upper secondary level students can choose between a general studies programme (GSP) – for students whose primary goal is to continue to the tertiary level – and various vocational studies programmes (VSPs). The English subject is compulsory in primary and secondary school. At the upper secondary level it is required for all GSP students in their first year and for all VSP students in their first and second years. Both student groups have the same curriculum, albeit with some adjustments for the different study programmes. For example, students in the Building and Construction programme will be expected to handle a specialized vocabulary related to the building and construction domain, in addition to “a wide general vocabulary” expected of students regardless of programme (Norwegian Ministry of Education and Research [KD], 2006/2013). The reason for having a common subject curriculum, which is generally rather academic in its orientation, has been to give all students the opportunity to qualify for tertiary education (Skjersli & Aamodt, 1997). However, this curriculum has repeatedly been criticized for being too ‘theoretical’ and poorly tailored to the VSP students’ needs (Høst, Seland, & Skålholt, 2013; Solberg, 2010; Tarrou, 2010). Traces of this criticism are found in the data gathered for the present investigation. Recently, however, the government has partly acknowledged the critique by initiating projects such as the Vocational Education and Training Promotion (“FYR”) aimed at making the common core subjects, such as mathematics, Norwegian and English, more relevant for vocational students (KD, 2015). The current English subject curriculum, which was introduced in 2006 and revised in 2013, is loosely based on the Common European Framework of Reference (CEFR) (Simensen, 2010). It specifies a number of learning outcomes or “competence aims”, which guide instruction and define what is to be assessed. The aims are grouped into four “main subject areas”: Language Learning, Oral Communication, Written Communication, and

6

Culture, Society and Literature (KD, 2006/2013).5 However, as the aims are many, and some of them are rather general, they need to be operationalized in order to be assessed (Meld. St. nr. 30, 2004, p. 40). In addition to the competence aims, the subject curriculum defines five basic skills, which are common to all subjects in school and which are described as “fundamental prerequisites for learning and development in schools, in the workplace and in society at large” (Norwegian Directorate for Education and Training [UDIR], 2015, p. 5, my translation). The inclusion of oral skills as one of these five basic skills underscores the importance attributed to spoken proficiency in the Norwegian context. Summative assessment in upper secondary school is predominantly given in the form of overall achievement marks. These marks are awarded by each subject teacher on the basis of various forms of classroom assessment. In the case of the English level studied here, i.e. first year GSP / second year VSP, approximately 20 per cent of the students are also randomly selected to take a written exam, and five per cent are selected to take an oral exam. The educational authorities give no explicit reason why exams administered to only a portion of the students are needed in addition to the overall achievement marks, but the practice may be explained historically as a matter of different assessment traditions existing side by side (see e.g. Lysne, 2006). As the marks awarded are decisive for admission to colleges and universities, the different forms of summative assessment must be regarded as high-stakes. An interesting distinction between the oral and the written exam, which is of relevance here, regards their administration. The written exam is administered nationally by the Norwegian Directorate for Education and Training, which provides exam tasks, written rating scales and assessment guidelines nationwide. The oral exam, on the other hand, is managed by the local educational authorities (LEAs) through the county governors in each of the 19 counties. Some of these LEAs provide rating scales, exam tasks and rater training for teachers, but in many cases they leave it to the individual schools to decide in these matters. In turn, some schools leave it to the individual teachers to handle the assessment procedures. Consequently, there are no common national exam tasks or rating scale. This apparent incongruity between the written and the oral exam may be partly be explained in terms of a long tradition of the local level having a strong position in the management of school policies,

5

Minor revisions to the curriculum were made in 2013, just after the data for Article 1 had been collected. One of the most important ones was the division of the previous main area “Communication” into “Oral communication” and “Written communication”, thus emphasizing the importance of, and differences between, writing and speaking. The 2013 version can be found in Appendix 1, the 2006 version can be found in Appendix 2.

7

which was reinforced with the K-06 curriculum reform in 2006 (Sandberg & Aasen, 2008). In fact, the Norwegian Directorate for Education and Training emphasizes the importance of the subject curriculum being adapted locally in everyday teaching and assessment practices so as to “promote adapted education” (UDIR, 2014d, pp. 5, my translation). More generally, however, the difference in the administration between the oral and written exams may also be said to reflect the afore-mentioned tension between the measurement and the assessment paradigms. Arguments for more standardization, as manifested in the written exam, can be supported with reference to the measurement paradigm. Arguments for less standardization, on the other hand, which is demonstrated in the oral exam, can be supported by assessment paradigm thinking. The lack of a national rating scale for oral English in the Norwegian system is of particular interest in the present thesis, as it appears to be taken as given in international language test design that a language test should be accompanied by a common rating scale (Fulcher, 2003; Ginther, 2013; Luoma, 2004). Rating scales are considered invaluable tools for raters in helping them to focus on those aspects of the performance which the test is intended to measure. As Fulcher (2012) has noted, the rating scale can be seen as the operationalization of the construct to be tested. For example, if the assessment is intended to test pronunciation, this should be specified in the rating scale. If not, it should be left out. A number of studies have investigated rater variability in test situations where rating scales exist; a considerably smaller number of investigations have studied tests without rating scales (e.g. Brown, Iwashita, & McNamara, 2005). In both cases, there is evidence that raters have somewhat different conceptions of the construct to be assessed. In any case, assessment contexts with no common scales are special, and they beg for closer scrutiny.

1.4 Research purpose and aims The present thesis investigates rating processes and outcomes in an oral English exam at the upper secondary school level. This exam is administered to GSP students in their first year (GSP1) and VSP students in their second year (VSP2). The main focus is on teachers’ understanding of the constructs to be tested.6 As a part of this inquiry, rater behaviour in

6

This is intrinsically linked to the specific aspects of performance that the teachers focus on during assessment. In order to describe this focus I will use the terms “teacher (rater) orientations” “teacher (rater) perceptions” and “teacher (rater) cognition” (cf. Brown et al., 2005).

8

terms of grades awarded is also examined, as well as correspondence between the teachers’ notions of construct and the intended construct as specified in the subject curriculum and related documents. The three studies that have been undertaken have had the following foci: 

Article 1 has examined teachers’ general perceptions of what should be tested in the GSP1/VSP2 oral English exam. This has included a brief analysis of scoring behaviour (i.e. grading) and a comparison between aspects of teachers’ orientations and the construct to be tested according to the English subject curriculum and other defining documents.



Article 2 has investigated teachers’ orientations towards the assessment of various aspects of pronunciation in the GSP1/VSP2 oral English exam.



Article 3 has explored teachers’ understanding of how to assess subject content in the GSP1/VSP2 oral English exam, and compared their assessment foci with what the subject curriculum stipulates with regard to content.

The present study is a response to the calls for more assessment research in the Norwegian educational context (cf. section 1.3.1, above) by providing empirical evidence of what happens in the rating process. In this sense, the study contributes to the evaluation of assessment quality in the GSP1/VSP2 oral English exam. In addition, the study more generally provides information on rating processes in EFL school contexts at the upperintermediate proficiency level.

1.5 A note on terminology Bachman and Palmer (2010, pp 19-21) use the terms test, assessment, measurement and evaluation more or less synonymously to describe the practice of collecting and evaluating evidence about learning in order to make judgements (c.f. Tina Isaacs et al.'s, 2013, definition on p. 1, above). In this thesis I follow their use of the terminology. However, I am well aware that these terms are used with different meanings and connotations, not least because of their association with the two paradigms outlined in section 1.2, above. Hence, a brief explication of some definitions will follow. Generally, it may be said that assessment is a broader term than test and evaluation, and that the former subsumes the latter two (Kunnan, 2004, p. 1). Tina Isaacs et al. (2013) explain assessment in relation to learning, but it could also be explained in relation to abilities 9

or behaviour generally, or even on a macro-level, such as an educational programme. A test, on the other hand, is typically seen as a more systematic and rigorous form of information gathering, normally restricted by a predetermined time frame (Green, 2014, p. 6). Evaluation is sometimes regarded as the “use” of assessment, for example in the evaluation of an educational programme, whereas measurement is often associated with the gathering of quantitative data according to explicit rules and procedures (Bachman, 2004, pp. 8-9). While recognizing these terminological differences, I would still argue that the collection and evaluation of evidence about students’ abilities and behaviour can be referred to as assessment, testing, measurement and evaluation in the context of this study, in line with Bachman and Palmer (2010).

1.6 The structure of the thesis This thesis is divided into two main parts. Part I contains the extended abstract, and Part II comprises the three articles which report on the investigations undertaken. The extended abstract consists of five chapters. While the present chapter situates the study by providing a general introduction, Chapter 2 explains the theoretical framework for the thesis. This framework is largely based on theories from the fields of educational and psychological measurement and applied linguistics. The rationale for using these theories is that they provide relevant conceptualizations for understanding the nature of the phenomena being studied, i.e. assessment processes and outcomes in the GSP1/VSP2 oral English exam. Chapter 3 reviews relevant research literature on rater cognition and rater behaviour, both in educational and non-educational contexts, and both internationally and in Norway. The purpose of the review is to identify the space in which the present study gives a research contribution. In Chapter 4 the research design and the methods used are outlined, including a presentation of the research questions, participants, data and analyses, as well as a discussion on the appropriateness of the methods chosen for the different research questions. In addition, I consider aspects of research validity and ethical considerations regarding the investigation. Finally, in Chapter 5 I discuss the main findings of the three articles, including their interrelatedness and the extent to which they have responded to the overall research aims and purpose of the study. The chapter ends with a number of implications for assessment and instruction and some suggestions for future research.

10

Chapter 2: Theoretical framework 2.1 Introduction to the chapter In this chapter I discuss the theoretical framework of this thesis. As the main focus is on rating processes in a high-stakes, oral examination, I have found it relevant to use a number of conceptualizations developed in the field of educational and psychological measurement, typically applied in standardized, large-scale testing. Examples of such conceptualizations are ‘construct’, ‘validity’ and ‘reliability’. The use of this terminology reflects a pragmatic stance on the relevance of these conceptualizations for the object of study. This is also related to the fact that the summative nature of the oral exam makes it “test-like” (Erickson, 2014, p. 50). Thus, the decision to use this theoretical framing is consistent with a pragmatist epistemological position, which holds that concepts are to be understood as tools for understanding the phenomena we want to study (Hookway, 2015). In the following I will start by explaining the concept of construct, before moving on to an exposition of the notions of validity and validation. I continue by outlining some perspectives on reliability and standardization. Finally, I discuss the concepts of oral communication and content, as they represent the main components of what should be tested in the oral exam under scrutiny.

2.2 Constructs: Operationalization of underlying abilities As established in section 1.1, assessment can be seen as the collection and interpretation of evidence about learning in order to form a judgement about that learning. Regardless of test purpose, a very central concern in testing and assessment is what one is trying to form a judgement about. In test theory this ‘what’ is commonly referred to as “attributes”, “traits” or “constructs” (Fulcher, 2015, p. 127; Kane, 2006, p. 30; Newton & Shaw, 2014, p. 10). According to Weir (2005), constructs are the “underlying […] abilities we wish to measure in students” (p. 1). An example of such an ability, taken from the CEFR, is lexical competence (Council of Europe, 2001, p. 110). The choice of label for these abilities is a contentious issue (Fulcher, 2015; Kane, 2012). In this thesis I use the term “construct”, rather than “attribute” or “trait”, because I find that it aptly points to the constructed and abstract nature of the phenomena being investigated,

11

such as, for instance, lexical competence.7 The justification for this view is that an unobservable concept such as lexical competence is an abstract notion, which can only be assessed after having been operationalized (Fulcher & Davidson, 2007, pp. 369-370). Thus, in order to assess lexical competence, one would have to identify observable properties which can serve as indicators of this construct. Examples of such properties are “sentential formulae” (“How do you do?”, “Good morning!”) and “phrasal idioms” (“He kicked the bucket”, “It’s a long shot”) (Council of Europe, 2001, p. 110). Quite frequently, a construct and its observable indicators will form a larger whole, together with other constructs and indicators in a more or less unified theory. In the CEFR, for instance, lexical competence is a construct within a model of communicative language competence, which is logically linked to a number of related constructs, such as grammatical competence, semantic competence and phonological competence (Council of Europe, 2001, p. 109). In turn, this model is built on theories of communicative competence (North, 2014). One of the reasons for the disagreement over the use of terminology is that the term “construct” is used with so many different meanings that it may be difficult to know what it refers to (Kane, 2012, p. 67). Moreover, theorists disagree on the ontological nature of constructs. Measurement specialists who subscribe to a realist world view, for example, typically see constructs as psychologically real entities, which exist in the minds of individuals and which may cause variation in behaviour (e.g. Borsboom, Cramer, Kievit, Scholten, & Franić, 2009, p. 150). Theorists who subscribe to an antirealist position, on the other hand, question the existence of constructs as ‘real’ attributes of the mind. To antirealists, they are first of all theoretical ideas, constructed by the research community, which are meant to describe and explain patterns of behaviour (Newton & Shaw, 2014, p. 164). From a realist perspective, there is no point in trying to measure constructs (i.e. theoretical ideas) since they cannot cause variation in behaviour if they do not exist. Borsboom et al. (2009), for example, therefore suggest that the construct label is replaced by the term “psychological attribute”, which can be regarded as a property that “plays a role in psychological reality” (pp. 150, 152). A third ontological position, referred to as pragmatic realism (Fulcher, 2015), holds that a construct (such as lexical competence) is real if the operationalizations of the construct “can be observed, and if they vary in ways predicted” (Fulcher, 2014, p. 1447). According to this view, a construct can be seen as:

7

In Article 1 I use the term “criterion” as an auxiliary concept. This is defined as “aspects of performance to be assessed”. The reader is referred there for a further discussion on the use of this term.

12

[t]he abstract name for a complex idea derived from observations of co-occurring phenomena, the purpose of which is to explain the coherence of our perceptions and make predictions about the likelihood of future states or events. The names are ‘the signs of our ideas only’, but no less real for that. (Fulcher, 2015, pp. 129-130)

By keeping a dual focus on the existence of both theoretical constructs and observable indicators, Fulcher occupies a middle position between (extreme) realist and antirealist positions. It follows from my pragmatist epistemological position (cf. section 2.1) that it would not greatly matter which label I choose for the aspects that teachers attend to when assessing performance. Still, I find that the notion of underlying theoretical constructs, operationalized in terms of observable properties (Fulcher & Davidson, 2007), aptly describes what is to be assessed in the Norwegian context. As the English subject curriculum – which forms the basis for instruction and assessment – is based on theories of communicative competence through its influence from the CEFR (cf. section 1.3.3), it explicitly and implicitly uses theoretical constructs which need to be operationalized. Examples of such constructs are “speaking strategies” and “fluency” (Norwegian Ministry of Education and Research [KD], 2006/2013). For example, fluency cannot be observed directly in students, but must be inferred on the basis of properties such as “pauses”, “fillers”, “false starts” etc. (Brown et al., 2005, p. 23). Furthermore, the use of constructs as analytical tools in the present context fits well with “Bachman & Palmer’s (2010) claim that a construct is defined on the basis of a “frame of reference” (pp. 212-213). This frame of reference may be a theory of language, a syllabus, a needs analysis, or a combination of the three. As already mentioned, it is the English subject curriculum which above all informs teaching and assessment in the Norwegian context. However, communicative theories also play a part through their influence on curriculum development. In addition, there are government documents, such as circulars, which specify what goes into the construct and what does not. For instance, in Norway a circular specifically states that a student’s “effort” is not to be assessed (UDIR, 2010, p. 13).8 Hence, in Norway the frame of reference for the construct definition is the English subject curriculum, communicative theories and government directives.

8

This circular was replaced by a revised one in 2014, after the main bulk of the data for this thesis had been collected. In the new circular (UDIR, 2014b), the reference to “effort” has been omitted.

13

2.3 Validity and validation 2.3.1 Evaluating evidence about score interpretations In section 1.1 I pointed out that validity is commonly regarded as a fundamental concern in assessment, sometimes referred to as the quality or ‘soundness’ of an assessment procedure. However, the concept is multifaceted and complex and its meaning has evolved over the years. Although some agreement can be found today, not all theorists interpret the concept in the same way (Newton & Shaw, 2014, pp. 7-9). In addition, it should be noted that the concept of validity in this thesis is discussed against an educational backdrop, where there may be said to be tensions between assessment and learning (Baird et al., 2014, p. 97). More broadly, these tensions are echoed in the measurement and the assessment paradigms (cf. section 1.2), which affect assessment theory and practices in different ways. The classic definition of test validity concerned the extent to which a test “measures what it purports to measure” (McCall, 1922, quoted in Anthony Green, 2014, p. 75). According to this view, validity is seen as a property of the test itself. Some measurement specialists (e.g. Borsboom et al., 2009) still adhere to this notion of validity, but to most authors it is no longer tenable (Bachman, 2014; Fulcher, 2015; Anthony Green, 2014; Kane, 2013; Newton & Shaw, 2014). A typical argument for rejecting the classical definition is that no matter how well-developed a test is, it would not measure what it is supposed to measure if it is poorly administered or used in contexts for which it was not intended (Newton, 2012, p. 3). Hence, the ‘consensus’ view today holds that validity is not a property of a test, but of the inferences that are made from assessment results (Fulcher, 2015; Newton & Shaw, 2014; Wiliam, 2008). In the Standards for Educational and Psychological Testing (henceforth: Standards) the concept is defined in the following way: Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests. […] The process of validation involves accumulating relevant evidence to provide a sound scientific basis for the proposed score interpretations. It is the interpretations of test scores for proposed uses that are evaluated, not the test itself. (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [AERA, APA, & NCME], 2014)

In passing, it is worth commenting on the term “validation” in the above quotes. According to Davies and Elder (2005) validity refers to the theoretical aspect of assessment quality, whereas validation is the actual practice of evaluating the quality of a test. 14

Two aspects of the Standards definition are particularly important for the present thesis. The first relates to the formulation ‘interpretations of test scores’, which essentially concerns score meaning9. The notion of score meaning raises a host of questions: For example, in the oral English exam under investigation one may ask what the mark 3 means. According to the Norwegian Directorate for Education and Training the numerical mark 3 means “fair degree of competence in the subject” (UDIR, 2009, p. 2). However, one could continue to inquire: “In relation to what? That is, what kind of competence has been assessed?” According to the Regulations to the Education Act, it is the competence aims of the subject curriculum which form the basis for assessment (KD, 2006/2015). However, not all of these competence aims are relevant for the oral exam (e.g. those that relate to written proficiency). One may therefore continue to probe: “Which competence aims have been tested? How have they been operationalized? What kind of performance has the student given that is indicative of goal attainment with regard to the competence aims being tested?”, etc. The second aspect to consider in the Standards definition concerns the importance attributed to the collection and interpretation of evidence as a central element in validation. In order to make sure that the interpretations from test scores are valid, one has to gather and analyse information about the different aspects of the assessment process, such as task design, scoring procedure, rater bias etc. These two aspects of validity and validation, i.e. score meaning and the collection and interpretation of evidence, are of direct concern in the present thesis. In all three articles I have provided evidence of the teacher raters’ perceptions of score meaning in terms of the constructs to be assessed. In addition, I have gathered and analysed data from the English subject curriculum and accompanying government documents concerning the intended meaning of the scores. Finally, in Article 1 I have also investigated rater consistency by asking teachers to score student performance.

2.3.2 The unitary theory of validity and argument-based validity approaches In order to understand more fully how the interpretation of scores can be evaluated, it is worth considering two frameworks which have been very influential in educational and psychological assessment. These are the unitary theory of validity and the argument-based validity approach.

9

In this thesis I follow Messick (1989), who uses the term “score” in a very general sense (p. 13). This means that it does not only reflect numerical ratings, but also, for example, verbal descriptions of scoring outcomes.

15

In his unitary theory of validity, Messick (1989) developed a validity framework which puts primary emphasis on the construct to be assessed. Messick defines validity as: [a]n integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment. (Messick, 1989, p. 13, emphasis in the original)

In order to be able to draw sound inferences from score meaning, Messick maintains, it is important that the results represent, as accurately as possible, the intended constructs. However, this is difficult in practice, as measurements will typically “either leave out something that should be included according to the construct theory or else include something that should be left out, or both” (p. 34). To conceptualize this, he borrowed two terms introduced by Cook and Campbell (1979). Aspects of the construct which are left out are labelled construct underrepresentation, and aspects which are not supposed to be included are referred to as construct-irrelevant test variance (Messick, 1989, p. 34). Returning to the example of lexical competence, one could say that failure to assess a student’s knowledge of idioms would indicate construct underrepresentation, whereas the assessment of the ability to use adjectives and adverbs correctly would signify construct-irrelevant variance. Validation studies should therefore collect and analyse evidence to identify such validity threats. This evidence could come from virtually any source, and Messick advocates the collection and interpretation of as many sources of evidence as possible (Messick, 1989, p. 35). One potential problem with the unified theory of validity, however, is its applicability, since it is conceptually very complex. As Messick himself attested to, a consequence of this complexity is that validation studies will require paying attention to a very large number of different questions in order to provide good validation evidence (Messick, 1996, p. 7). The broadening of the scope of validation to such an extent has led some validation practitioners to regard the theory as impractical for application in test evaluation (see e.g. Baird et al., 2014, p. 79). Alternative approaches have therefore been developed. One such approach is argument-based validation (Cronbach, 1988; Kane, 2006; Mislevy, Almond, & Lucas, 2003). Argument-based validation distinguishes itself from the unitary theory of validity in not taking theory-based constructs as the starting point for the evaluation of inferences. Rather, as the name suggests, it uses an argument to clarify the reasoning of the proposed interpretations and uses of scores (Kane, 2013, p. 8). This argument typically consists of claims for the inferences to be made from the scores, warrants to support the inferences, 16

backing evidence, and alternative hypotheses as rebuttals to the claims. Simply put, this means that it is possible to infer directly from an observation to a claim, without reference to a construct. For instance, on the basis of the observation that a test taker’s speech is unintelligible, one may make the claim that he or she is unfit for working as a teaching assistant at university. However, as Kane makes clear, the structure of the argument would depend on the attribute to be assessed, and on the claim to be stated (Kane, 2012, p. 68). In cases where the trait is a theoretical construct, defined in terms of an underlying theory, the argument-based approach will be similar to construct validity approaches, such as the unified theory of validity (Kane, 2013, p. 9). However, in most cases, there is no need to invoke notions of construct, since the trait to be assessed will be an “observable attribute” that is not dependent on an underlying theory (Kane, 2012, p. 68). This attribute can then be assessed directly. As Kane (2013) points out, it would be possible to test someone’s skill in servicing computers without making assumptions about an underlying “‘computer-servicing’ trait” (p. 21). Kane (2006) distinguishes four mean types of inferences in an interpretive argument: scoring inference, generalization inference, extrapolation inference and implication inference. The first one is the most the most relevant in the present thesis. The scoring inference concerns the assignment of scores to a test taker’s performance according to a “scoring rule, which provides the warrant for the scoring inference” (Kane, 2006, p. 34). The scoring rule includes aspects such as the purpose of the assessment, the types of tasks included and the criteria to be applied. In order to evaluate the appropriateness of the scoring inference several kinds of evidence can be used. For example, empirical evidence from the rating process can be used “to check on the consistency (e.g., inter-rater reliability) and accuracy (e.g. quality control data) of scoring” (p. 34). The specification of different types of inferences in this way is meant to provide guidance on what kinds of evidence that are needed for validation. To sum up, both the unitary theory of validity and the argument-based approach provide valuable conceptualizations for investigating rater perceptions (and behaviour) in the GSP1/VSP2 oral English exam. Both emphasize the collection and analysis of evidence for the evaluation of the inferences that are made from score meaning. In addition, the unitary theory of validity brings to the analysis the concepts of construct underrepresentation and construct-irrelevant variance, which are appropriate for analysing the extent to which the teachers attend to the things that they should attend to. The argument-based approach helps narrow the focus by pointing to the types of evidence needed for the analysis of the scoring process. In addition, even if the argument-based validity approach de-emphasizes the role of 17

construct, it does not eliminate it, provided that it is based on some underlying theory. In the case of communicative language assessment, such as in Norway, there is reason to argue that there is an underlying theory for the attributes to be tested (cf. section 2.2). The features of validity and validation discussed in this section concern the technical aspects of assessment quality, and are by far the most important for the present thesis. However, both the unitary validity theory and the argument-based approach include a broader concern as well – which is also reflected in the analyses I have undertaken – namely the social consequences of test use. In the next section I therefore turn to this issue.

2.3.3 The social consequences of test use Messick’s (1989) concern for the social consequences of test use implied an evaluation of the value implications of score interpretation, as well as issues such as fairness and test impact. For example, in a validation study it could be relevant to evaluate the impact that the use of a test would have on teaching, learning and instructional materials (Cumming, 2013, p. 6008). Although this issue is not directly addressed in the present study, the question of values surfaces in the analyses of my data. For example, some of the teachers interviewed report that they experience the assessment system as being unfair to some students. Consequently, they deliberately score the students that they perceive as being disadvantaged more leniently than other students. This type of value judgement influences the interpretations that the teachers make, in the sense that it negatively affects the consistency, or reliability, of the scores. This last point is also interesting in the sense that both Messick and Kane integrate the issue of reliability, or score consistency, into their frameworks (Kane, 2006; Messick, 1989). Since reliability may be affected by test administration and rating procedures and is relevant in a discussion on the amount of standardization needed in an assessment context, I now turn to a consideration of these issues.

2.4 Reliability and standardization According to Harlen (2012), assessment used for summative purposes requires some form of quality assurance procedures in order to ensure that the scores are reliable, or dependable. “The more weight that is given to the summative judgment”, she says, “the more stringent the quality assurance needs to be” (p. 97). This would mean that some form of standardization of the assessment procedure is required. However, as Harlen’s quote indicates, such

18

standardization is a matter of degree. The question is then: How standardized do the assessment procedures need to be? In large-scale testing, rating scales and rater training are regarded an important part of test development and administration and rarely, if ever, dispensed with. As Standard 4.18 of the Standards specifies: “[I]n scoring more complex responses, test developers must provide detailed rubrics [i.e. rating scales] and training in their use” (p. 91, italics added). However, the same Standards also make clear that “more flexibility in the assessment procedures” may be considered in order to better capture complex constructs “that may be otherwise difficult to assess”, even if this jeopardizes reliability (AERA, APA, & NCME, 2014, p. 36). Although the Standards here do not mention the abandonment of rating scales or rater training as examples of greater flexibility, they do not exclude the possibility of dispensing with them either, or as in the Norwegian setting, to create locally developed scales. The fact that assessment in a curriculum-based context is intrinsically linked to teaching and learning makes the oral exam studied here somewhat different from testing in non-educational contexts. In such contexts, researchers have advocated greater procedural flexibility. According to Moss (1994), for example, privileging standardization in order to enhance reliability may come into conflict with good teaching and learning, since there are “certain intellectual activities that standardized assessment can neither document nor promote” (p. 6). Finally, in the discussion of standardization and the use of a common rating scale, it may be argued that although rating scales provide a means for guiding raters in the assessment process, they do not necessarily reflect the complexity of the constructs to be tested (Taylor & Galaczi, 2011). Logically enough, rating scales need to be simplified in order to make them usable to raters. This seems particularly obvious in the case of assessment based on a comprehensive subject curriculum. Hence, rating scales may be criticized for not giving a realistic representation of the features to be assessed (Lumley, 2005). As the Standards alludes to, there is a tension here, reflecting preferred emphasis: should reliability be stressed, or is it better to go for a looser structure in order to better capture the constructs?

2.5 Constructs to be tested: Oral communication and content The oral English exam under investigation is markedly different from traditional language tests in the sense that the English subject curriculum, in addition to language-specific constructs, also specifies a number of content-related issues that are to be tested (cf. section 1.3.3). This means that two main constructs may be identified: oral communication and 19

content.10 In the following, I will briefly outline some theoretical perspectives on these two constructs and link them to the corresponding competence aims of the subject curriculum.

2.5.1 Models of communicative competence and the oral communication construct Models of communicative competence generally aim to describe and explain what it means to know and use a language for communication (Purpura, 2008). Several communicative models have been developed. The perhaps most well-established ones are Canale and Swain’s model (Canale, 1983; Canale & Swain, 1980) and Bachman’s model (Bachman, 1990; Bachman & Palmer, 1996). The former describes communicative competence as consisting of four components: grammatical competence, sociolinguistic competence, pragmatic competence and strategic competence. Bachman’s framework is an expansion of Canale and Swain’s and other earlier models in that it explicitly “attempts to characterize the processes by which the various components interact with each other and with the context in which language use occurs” (Bachman, 1990, p. 81). Communicative competence, or “communicative language ability” (CLA), is described by Bachman as comprising two main components: (i) language knowledge, such as grammatical knowledge, textual knowledge, functional knowledge and sociolinguistic knowledge; and, (ii) strategic competence, defined as “a set of metacognitive components, or strategies, which can be thought of as higher order executive processes that provide a cognitive management function in language use” (Bachman & Palmer, 1996, p. 70). Despite their currency, these models have attracted criticism from various quarters. Some authors have claimed that they are too simplistic, not accounting sufficiently well for all the different elements which affect communication, particularly as regards contextual factors (McNamara, 2003). Bachman’s model, for example, taking a cognitive perspective on language ability, sees the construct as something residing in the individual. From an interactionist point of view, however, this is a too narrow perspective, as communicative competence is understood as being more explicitly shaped by contextual features, such as physiological setting, participants and tasks (Chalhoub-Deville, 2003; Chapelle, 1998; He & Young, 1998). As Chalhoub-Deville (2003) argues, “the ability components the language user brings to the situation or context interact with situational facets to change those facets as well as to be changed by them” (p. 372, italics added). On this view, language ability is seen as coconstructed by the participants of the interaction in local settings. This is an interesting position which is relevant in a discussion on oral exams in Norway, where the exams are 10

Cf. appendices 1 and 2.

20

administered on the local level, and where no national rating scales exist. However, as both Bachman (2007) and Chalhoub-Deville (2003) have pointed out, such a position may cause problems in language assessment, as it makes generalizations across contexts difficult. In addition to the criticism levelled against communicative models for their lack of complexity, there are also those who have found the models too complex for application in test situations (c.f. Harding, 2014). As a practical solution to this problem, Harding points to the use of frameworks such as the CEFR, which may function “as an accessible de facto theory of communicative language ability” (p. 191).11 For this reason, and because CEFR has been influential on curriculum development in Norway, it is worth briefly considering some of its main features relevant for the assessment of oral communication. As was mentioned in section 2.2 the CEFR lists a number of factors which contribute to a language user’s “communicative language competence”. The three main features are: (i) linguistic competences, (ii) socio-linguistic competences; and (iii) pragmatic competences. In Figure 1 these three competences and their sub-components have been listed.

COMMUNICATIVE LANGUAGE COMPETENCES

LINGUISTIC COMPETENCES

- Lexical competence

SOCIO-LINGUISTIC COMPETENCES

- Linguistic markers of social conventions

- Grammatical competence - Politeness conventions - Semantic competence

PRAGMATIC COMPETENCES

- Discourse competence (i.e. ability produce coherent stretches of language in terms of thematic organization, cohesion etc.)

- Expressions of folk wisdom - Phonological competence - Register differences - Orthographic competence - Dialects and accents - Orthoepic competence

- Functional competence (e.g. imparting and seeking information, expressing and finding out attitudes, socializing, communication repair)

Figure 1. CEFR’s model of communicative language competence (Council of Europe, 2001).

11

Cf. Fulcher’s (2004, 2010) critique against CEFR’s lack of a theoretical basis.

21

In addition to the elements listed in Figure 1, mention is also made of fluency and propositional precision, the latter defined as “generic qualitative factors which determine the functional success of the learner/user” (p. 128). Beyond this, the CEFR makes clear that there are additional competences contributing to a language user’s ability to communicate. Three such competences are mentioned: “knowledge of the world”, “sociocultural knowledge” and “practical skills and know-how”. Furthermore, the CEFR refers to an additional concept labelled “production strategies”, which involves “mobilising resources, balancing between different competences – exploiting strengths and underplaying weaknesses – in order to match the available potential to the nature of the task” (Council of Europe, 2001, p. 63). Thus seen, production strategies have affinities with the notion of strategic competence in Bachman & Palmer’s (1996) model. Beyond the descriptions of communicative competence, the CEFR is relevant in the present thesis in that it provides descriptions of “oral production” and “oral interaction” in a number of proficiency scales. These describe oral ability at different levels of competence. In Table 1 and Table 2 extracts from the two scales describing overall production and interaction have been listed. Note that only the B1 and B2 levels are included, as these correspond to the average proficiency level of the Norwegian students in this investigation.

Table 1. CEFR production scale for overall spoken production. OVERALL ORAL PRODUCTION B2

Can give clear, systematically developed descriptions and presentations, with appropriate highlighting of significant points, and relevant supporting detail. Can give clear, detailed descriptions and presentations on a wide range of subjects related to his/her field of interest, expanding and supporting ideas with subsidiary points and relevant examples.

B1

Can reasonably fluently sustain a straightforward description of one of a variety of subjects within his/her field of interest, presenting it as a linear sequence of points.

22

Table 2. CEFR production scale for overall spoken interaction. OVERALL SPOKEN INTERACTION B2

Can use the language fluently, accurately and effectively on a wide range of general, academic, vocational or leisure topics, marking clearly the relationships between ideas. Can communicate spontaneously with good grammatical control without much sign of having to restrict what he/she wants to say, adopting a level of formality appropriate to the circumstances. Can interact with a degree of fluency and spontaneity that makes regular interaction, and sustained relationships with native speakers quite possible without imposing strain on either party. Can highlight the personal significance of events and experiences, account for and sustain views clearly by providing relevant explanations and arguments.

B1

Can communicate with some confidence on familiar routine and non-routine matters related to his/her interests and professional field. Can exchange, check and confirm information, deal with less routine situations and explain why something is a problem. Can express thoughts on more abstract, cultural topics such as films, books, music etc. Can exploit a wide range of simple language to deal with most situations likely to arise whilst travelling. Can enter unprepared into conversation on familiar topics, express personal opinions and exchange information on topics that are familiar, of personal interest or pertinent to everyday life (e.g. family, hobbies, work, travel and current events).

The descriptors listed in Table 1 and Table 2 indicate that a number of communicative language competences are involved in oral production and interaction. For example, the references to coherence, fluency, grammatical control and level of formality point to linguistic, socio-linguistic and pragmatic competences (cf. Figure 1). Moreover, the references to different types of topics indicate that other competences such as knowledge of the world and sociocultural knowledge are also important in actual language use. In addition to these two overall scales, the CEFR includes a number of more detailed scales for “public announcements”, “conversation”, “informal discussions” and “phonological control”, to mention a few.

2.5.2 Theoretical perspectives on content According to Met (1998), language education may be regarded as a continuum from language-driven approaches to content-driven approaches. In Norway the English subject taught at the GSP1/VSP2 level may be located somewhere in the middle of this continuum. Competence aims such as “[The student shall be able to] discuss literature by and about indigenous peoples in the English-speaking world” (cf. Figure 2, below) attest to this claim.

23

In other words, they reveal that content is an important part of the construct to be taught and tested. In models of communicative competence language aspects are distinctly more elaborately described than are content aspects. Apart from sketchy descriptions of “topical knowledge” in Bachman and Palmer (1996) and general references to “world knowledge” and “socio-cultural knowledge” in the CEFR (cf. section 2.5.1), there is, in fact, little theoretical support to be found for the analysis of content in the GSP1/VSP1 oral English exam. However, relevant conceptualizations can be found in the field of content-based instruction (CBI), which is common in curriculum-based, second language instruction in the U.S. In CBI subject-specific curricular content – such as language arts or social science – is typically taught alongside a second language in order to improve students’ skills and abilities in both areas (e.g. Brinton, Snow, & Wesche, 2003; Chamot, 2009; Snow & Katz, 2014). In addition, valuable theoretical perspectives on content may be found in Bloom’s revised taxonomy, which has affinities with Norwegian subject curricula through its focus on learning objectives (Anderson & Kratwohl, 2001). In Chamot’s (2009) CBI-framework content is referred to as facts, concepts, laws, principles and theories. An interesting point that she makes, which I have also found traces of in statements from the teachers in my material, is that understanding subject matter concepts and the relationships between them is a prerequisite for developing academic knowledge (p. 20). Moreover, Chamot stresses the importance of higher-order thinking skills, such as being able to analyse, reflect, predict and synthesize (p. 30). A similar emphasis is found in the English subject curriculum, where a number of competence aims in the content area include terms such as “assess”, “discuss” and “elaborate on” (cf. Figure 2). Chamot’s exposition points to a separation of the content construct into two main components. On the one hand, there are the subject matter elements, such as facts, concepts, laws etc., and on the other, there are the thinking skills or abilities, which are needed to handle subject matter. This perspective is even clearer in Bloom’s revised taxonomy, where learning outcome objectives are arranged along a knowledge dimension and a skills and processes dimension (Anderson & Kratwohl, 2001). Accordingly, the content construct consists of a what-dimension (subject matter knowledge) and a how-dimension (skills, abilities, processes). The former is divided into “factual knowledge”, “conceptual knowledge”, “procedural knowledge” and “metacognitive knowledge”. The latter is structured hierarchically from the simple to the complex in the following order: “remember”, “understand”, “apply”, “analyse”, “evaluate” and “create” (Kratwohl, 2002, p. 216). 24

2.5.3 Competence aims which relate to oral communication and content Turning to the subject curriculum, there are a number of competence aims which underlie the two overall constructs to be tested in the oral English exam. Figure 2 gives an illustration.

ORAL COMMUNICATION (The student shall be able to:) • understand and use a wide general vocabulary and an academic vocabulary related to his/her own education programme • understand oral […] presentations about general and specialized themes related to his/her own education programme • express him/herself […] orally in a varied, differentiated and precise manner, with good progression and coherence • select and use appropriate […] listening strategies to locate information in oral and written texts • select and use appropriate […] speaking strategies that are adapted to a purpose, situation and genre • take the initiative to begin, end and keep a conversation going

CONTENT (The student shall be able to:) • exploit and assess various situations, working methods and strategies for learning English • describe and evaluate the effects of different verbal forms of expression • assess and comment on his/her progress in learning English • select and use content from different sources independently, critically and responsibly • select an in-depth study topic within his/her own education programme and present this • use technical and mathematical information in the media • discuss social/cultural conditions and values from a number of English-speaking countries • present and discuss international news topics and current events • give an account of the use of English as a universal world language • discuss and elaborate on English texts from a selection of different genres, poems, short stories, novels, films and theatre plays from different epochs and parts of the world • discuss literature by and about indigenous peoples in the English-speaking world Figure 2. Competence aims underlying the two constructs to be tested; taken from the English subject curriculum, GSP1/VSP2 level. 25

It should be noted here that the competences aims listed in Figure 2 are taken from the previous version of the subject curriculum (2006), which was the governing document when the first data was collected in 2012. The curriculum was slightly revised in 2013.12 As can be seen in Figure 2, the oral communication construct draws on competence aims which not only include production skills, but also reception skills, i.e. listening. In this sense the Norwegian subject curriculum follows the CEFR, which treats these skills as overlapping (Council of Europe, 2001, p. 92). Despite the rather general formulations characterizing these competence aims, the references to “vocabulary”, “coherence” and “speaking strategies” give clear indications of linguistic, socio-linguistic and pragmatic competences as defined by the CEFR (cf. Figure 1, above). In addition, expressions such as “precise manner” and “progression” (i.e. fluency) point to the notions of fluency and propositional precision mentioned in section 2.5.1, above. 13 In terms of the content construct, the competence aims relating to this trait have a very wide scope, ranging from rather conventional English subject content matter, such as being able to “discuss and elaborate on English texts from a selection of different genres” to rather unconventional ones, such as “exploit and assess various situations, working methods and strategies for learning English”. However, in line with Bloom’s revised taxonomy, they both relate to skills or processes (other than communicative language competence), on the one hand, and subject matter knowledge, on the other. For example, being able to assess strategies for learning English, has to do with subject matter knowledge in the field of metacognitive strategies. And, in order to be able to assess such strategies, one also has to be able to understand, use, apply and analyse them (cf. section 2.5.2, above). Two more comments are worth making. Firstly, there are eight competence aims listed in the subject curriculum, which have not been included in Figure 2. These involve skills such as reading, writing and using digital aids (cf. Appendix 2). Secondly, it is, of course, virtually impossible to test all these aims in a single oral exam. Hence, some aims may be excluded when the construct is to be operationalized. This may be a potential problem if teachers, who both design test tasks and evaluate performance, consistently disregard some of these aims.

12

Cf. footnote 5, p. 7 and appendices 1 and 2. «Progression» is an erroneous translation of the Norwegian word “flyt” (cf. the Norwegian version of the curriculum, Appendix 3, bullet point 5 under the heading “Kommunikasjon”). 13

26

2.6 Short summary The theoretical aspects of assessment discussed in this chapter provide a relevant framework for understanding the phenomena studied in this thesis, i.e. teacher rater perceptions (and behaviour) in the Norwegian EFL context. Moreover, they also afford valuable tools for evaluating the quality of the rating processes. Overall, the notion of “construct” offers a useful conceptualization for understanding what aspects teachers attend to while rating oral performance. In addition, the concept of “validity” and its related notions of “construct underrepresentation” and “construct-irrelevant variance” guides the attention towards the aspects of student performance that should be attended to according to the curriculum and accompanying guiding documents. This is further supported by the idea of scoring inference, which emphasizes the importance of clarifying the reasoning of the proposed interpretations of the scores. Furthermore, as the analyses undertaken in this thesis give indications of social value judgements which affect the teachers’ ratings, it is also relevant to discuss this feature in light of the social consequences of test use. Finally, the theoretical perspectives on oral communication and content provide a useful backdrop for analysing the teachers’ orientations towards the two main constructs to be assessed in the GSP1/VSP2 oral English exam.

27

Chapter 3: Literature review 3.1 Introduction to the chapter In this chapter I situate the present investigation in the assessment research context. The review supplements and expands on the review sections of each of the three articles included in Part II of the thesis. The choice of literature has been guided by: (i) the focus on (teacher) rater orientations (and to a lesser degree on rater behaviour and the validity of the scoring inference); (ii) the fact that the oral exam focuses not only on aspects of language, but also on subject content; (iii) the lack of a common rating scale for the teachers in this exam; and (iv) the Norwegian context. A systematic search for relevant studies, based on these criteria, was undertaken by browsing databases such as Web of Science, Academic Search Premier, ERIC and Idunn, journals such as Language Assessment Quarterly and Language Teaching, and reference works such as The companion to language assessment (Kunnan, 2014). Both international and Norwegian sources were explored. In addition, Google Scholar searches were conducted to see whether additional studies could be identified. Moreover, the web pages of the Norwegian Directorate for Education and Training and print monographs on language testing and assessment were searched (e.g. Fulcher, 2003). Beyond the studies located from searches in the sources mentioned, other studies were identified in the references lists of the literature which was examined. An expanded overview of sources explored and search terms used can be found in Appendix 4. All in all, the search yielded a total of 153 sources, 51 of which are considered in the present review. In the following I will first address international rater cognition and rater behaviour research, and then go on to present findings from the Norwegian educational context. As all the three articles include literary review sections, the reader is also referred to those sections for more specific information pertaining to the three studies.

3.2 International research on rater orientations and rater behaviour 3.2.1 Rater variability and the focus of the present investigation It has long been known that substantial variation in scoring outcomes may be caused by raters rather than by the performance of the test takers (Bejar, 2012). Such variation is commonly referred to as rater variability (McNamara, 1996). Rater variability may take different forms, 28

such as: (i) differences in the kinds of performance aspects that raters pay attention to; (ii) variations in how raters interpret and apply assessment criteria; (iii) differences in how raters understand and use rating scale categories; (iv) variability in terms of rater severity or leniency; and (v) interactions between raters and test takers, tasks or other facets of the assessment situation (Eckes, 2005, p. 44). In the present thesis it is the first three forms which are of particular concern, as they relate to the question of rater orientations, or rater cognition. However, point number (iv) will be also considered, as the investigation touches on rater behaviour in terms of score consistency and the severity of the raters. The review will predominantly focus on oral assessment research, but where relevant, mention will also be made of written assessment research. Three other points are worth keeping in mind. Firstly, the vast majority of studies reviewed have investigated assessment in general proficiency speaking tests, rather than in curriculum-based achievement tests, as is the case in Norway. Secondly, they have involved different kinds of proficiency levels. Thirdly, only three of them (Brown et al., 2005; Pollitt & Murray, 1996; Yildiz, 2011) have investigated rater variability in L2 speaking contexts without a common rating scale. Hence, the majority of these are not immediately relevant for the GSP1/VSP2 oral English exam investigated in the present thesis.

3.2.2 Rater orientations International research on rater orientations, or rater cognition, in the area of general L2 oral proficiency assessment is limited (Brown et al., 2005). Early studies on rater orientations tended to show that raters typically paid more attention to linguistic features of performance, especially grammar, than to other aspects (Magnan, 1988; McNamara, 1990). More recently, however, studies have shown that assessors heed a broader range of factors, including content, discourse complexity and functional skills (Ang-Aw & Goh, 2011; Borger, 2014; Brown, 2000; Brown et al., 2005; Sato, 2012). On the whole, there is evidence that raters heed a number of different features of performance, both construct-relevant and construct-irrelevant ones (Douglas, 1994; C.-N. Hsieh, 2011; Joe, Harmes, & Hickerson, 2011; May, 2006; Orr, 2002; Pollitt & Murray, 1996). Examples of construct-irrelevant performance aspects that raters attend to are interest and personality (Ang-Aw & Goh, 2011), effort (Brown, 1995), voice quality (C.-N. Hsieh, 2011) and age and gender (Orr, 2002). In addition, raters may fail to attend to features that are included in the construct of interest, thus underrepresenting it (Ang-Aw & Goh, 2011; Cai, 2015). In the following, I will review the studies by Brown et al. (2005) and Pollitt and Murray (1996) in some more detail, as these are the only international 29

investigations I have located which explored rater orientations in oral assessment contexts where rating scales were not provided. Moreover, they investigated so-called singleton speaking examination formats, which are also similar to the Norwegian EFL exam format, in that there is only one test taker being examined at a time.14 I will then go on to identify some common problem areas discovered in related studies. Pollitt and Murray (1996) investigated examiners’ assessment of performance in speech samples taken from the Cambridge Certificate of Proficiency in English test, which is a high proficiency level oral examination. Five trained raters were presented with pairs of speech samples obtained from five young adult non-native speakers. The raters were first asked to decide which of the performances in each pair was “better” (p. 81). Afterwards, they were asked to verbalize their perceptions of similarities and differences in the different pairs. The results showed that while the raters paid more attention to content, or what was being said, at the higher levels of performance, they attended more to linguistic features and associated notions of ‘correctness’ at the lower levels of performance. In addition, the results showed that the raters’ disagreed on the question of whether comprehension should be part of the spoken performance construct. Pollitt and Murray also found that the judges were influenced by non-relevant criterion elements in their evaluations, such as the personalities, physical attractiveness and cultural backgrounds of the examinees. Brown et al.’s (2005) study had a dual focus. It explored raters’ perceptions of relevant assessment criteria and investigated the quality of the test takers’ discourse in two oral proficiency assessments. The starting point for the study was the piloting of a TOEFL (Test of English as a Foreign Language) speaking test, which is an advanced proficiency level examination focusing on English for Academic Purposes (EAP). The raters’ orientations (n=10) were investigated using a qualitative research design (i.e. verbal protocol analysis). Overall, the study found that the judges had the same ideas of the main constructs to be tested, but that there was some variation in terms of their attention to the more finely grained performance features. The main aspects that the raters heeded were: “linguistic resources”, “phonology”, “fluency” and “content” (p. 31). Interestingly, it was found that content was a major focus, only surpassed by linguistic factors. The authors speculated that the relative importance attributed to content could be explained by the nature and the level of the test, which was an EAP examination. This would agree with Pollitt and Murray’s (1996)

14

The singleton format can be contrasted with the paired speaking format, which involves more than one test taker. Hence, in the paired speaking format it is possible to test, for example, interactional skills (see e.g. Ducasse & Brown, 2009, which investigated such a format without a common rating scale).

30

conclusion that raters pay more attention to content at the higher proficiency levels. A final observation, which is relevant for Article 2 in this thesis, is that in terms of the phonological, syntactic and organizational aspects of the examinees’ speech production, the raters were more concerned with “comprehensibility” and “clarity” than with native-speaker “correctness” (p. 101). Beyond Pollitt and Murray (1996) and Brown et al.’s (2005) studies, other investigations – where rating scales have been provided – have reported rater variability both in relation to the interpretation and use of criteria, and to the raters’ comprehension and use of the rating scale categories (cf. points number (ii) and (iii) mentioned by Eckes, 2005, above). For example, in studies by Brown (2000), Kim (2015), May (2009) and Orr (2002) it was found that raters had conflicting views on the meaning of the criteria in the scales. Orr (2002) and Eckes (2009) also found that the raters comprehended the rating scale categories in different ways, thereby awarding the same score to different performances, as well as different scores to similar performances. A similar conclusion was reached by Douglas & Selinker (1993, 1994; cited in Douglas, 1994). The exposition so far makes clear that some studies have documented considerable rater variability (Eckes, 2005; Orr, 2002), whereas others have demonstrated fairly good correspondence in the features that the raters attend to (Brown et al., 2005; see also Borger, 2014). Brown et al.’s (2005) study is of particular interest in this respect, as no scoring guidelines were provided. The question, then, is how can such differences be accounted for? Different explanations for this variation have been suggested. For example, it may be a matter of rater background characteristics, such as professional background (Brown, 1995; Chalhoub-Deville, 1995), first language background (Kang, 2008; Y.-H. Kim, 2009) and rating experience (Lumley, 2005). However, it may also be related to a number of other variables, such as test tasks, test administration, rating scales and rater training. According to May (2009), for example, one explanation for the confusion over how to interpret the descriptors in the rating scales could be that the criteria in the scales are vague. As for the question of rater training, I will return to this issue in section 3.2.2 below. More recently, efforts have been made to identify rater profiles in order to better understand rater variability, and to be able to tailor rater training to different needs. For example, Eckes (2009) classified raters into six different rater types, depending on the extent to which they focused on content, correctness, comprehensibility, description, completeness or overall performance. Similarly, Cai (2015) divided assessors into form-oriented, balanced and content-oriented, depending on the degree to which they focused mainly on linguistic 31

factors, content factors or both. These are interesting findings in the light of the strong focus on subject content in the English subject curriculum in Norway. Before moving on to the issue of rater behaviour, I find it relevant to return briefly to Brown et al.’s (2005) finding that raters focused more on comprehensibility than on nativeness. A long-standing debate in the language teaching and assessment literature has been the question of speaker norms, and related issues of ‘correctness’ (Davies, 2003; Jenkins, Cogo, & Martin, 2011). Traditionally, the native speaker has been used as the model for production and the standard for assessment, both in terms of pronunciation, grammar, vocabulary and organization (Cook, 1999). However, with the advent of alternative approaches to L2 language teaching, such as World Englishes (Kachru, 1986), the Intercultural Model (Byram, 1997) and English as a Lingua Franca (Jenkins, 2000), there are indications that this is changing, with raters becoming more concerned with intelligibility than with correctness (Brown et al., 2005; Timmis, 2002). The extent of this change is not known, however, as a number of raters, perhaps particularly teachers, prefer a native speaker standard (Coskun, 2011; Deterding, 2010; Jenkins, 2007)

3.2.3 Rater behaviour Although rater behaviour in terms of grades awarded is not a major focus of the present investigation, I still find it relevant to include a brief review of research in this area in order to provide a backdrop for the subsequent analysis of the Norwegian teachers’ scoring behaviour. Studies have shown that rater consistency in terms of inter-rater reliability is typically found to be high in oral performance tests (McNamara, 1996; Eckes, 2011; see also Fulcher, 2003). There is also evidence that rater training and experience will increase such reliability (Barnwell, 1989; Brown, 2012). In addition, training may improve intra-rater reliability (Weigle, 1998). As for the question of rater severity or leniency, however, studies are less conclusive as to the effect of training. Some investigations have found that it does have a positive impact, particularly with novice and excessively harsh or lenient raters (Davis, 2015), whereas others have found little effect (Eckes, 2011; Lumley & MacNamara, 1995). An additional point worth making here is that reliability ratings appear to increase when two or more raters are involved in the scoring (Henning, 1996). This is relevant in Norway, where two examiners are always involved in the scoring of performance in the oral exam.

32

3.3 Assessment research in the Norwegian context As mentioned in section 1.3.1, studies of assessment practices generally in Norway in the early 2000s pointed towards a “weak assessment culture” in Norwegian schools, including inadequate focus on learning objectives and unsystematic feedback routines. Later studies have confirmed these findings, documenting inappropriate evaluation practices that have been contrary to government regulations. For example, there is evidence that many teachers have been prone to award achievement marks in a norm-referenced manner, despite that fact that the system has been criterion-referenced since 2001 (Galloway, Kirkebøen, & Rønning, 2011; Hægeland, Kirkebøen, Raaum, & Salvanes, 2005). There are also indications that teachers have tended to award different grades for the same performance and that they heed constructirrelevant performance aspects, such as effort (Nusche, Earl, Maxwell, & Shewbridge, 2012). This may be explained in terms of difficulties in the interpretation of competence aims and the problem of relating student performance to different levels of competence (Prøitz and Borgen, 2010; Throndsen et al., 2009). There are, however, also indications that the initiatives undertaken by the educational authorities to improve assessment literacy among Norwegian teachers have had an impact. Studies by Hodgson, Rønning, Skogvold, and Tomlinson (2010) and Sandvik and Buland (2014), for example, indicate that teachers have become more focused on implementing principles for good formative and summative assessment practices, although there are considerable differences across schools. Sandvik and Buland (2014) also show that teachers’ assessment literacy is typically of a general character, relating to overarching pedagogical principles rather than to the subject-specific nature of assessment in the different subjects. Research on teacher rater orientations and behaviour in oral assessment in the English subject is extremely scarce, however. One reason for this may be that the area of oral skills in education, both as regards L1 and L2, is underresearched (Svenkerud, Klette, & Hertzberg, 2012). Only one study, a Master’s thesis, has been identified which specifically addresses rater perceptions in the English classroom (Yildiz, 2011). Using a qualitiative research design involving interviews with 16 English teachers at the upper secondary school level, Yildiz examined both the administration of the oral exam across different counties, as well as what kind of performance features the teachers attend to. Overall, the study found that the teachers were concerned with five general criteria: (i) “Language competence”; (ii) “Communicative competence”, (iii) “Subject competence”; (iv) “Ability to reflect and discuss independently”; and (v) “Ability to speak freely and independent of manuscript”. However, there was also clear evidence that the teachers paid attention to different performance aspects, that they 33

considered construct-irrelevant criteria, such as effort, and that that they weighted criteria differently.

3.4 Short summary Overall, the rater orientation and behaviour studies reviewed in this chapter point to varying degrees of rater variability in oral EFL/ESL assessment. There is evidence that judges attend to different kinds of performance aspects, that they apply criteria differently, that they have problems distinguishing levels of performance and that they vary in severity/leniency. However, the number of studies conducted is limited, particularly in curriculum-based assessment contexts where no common rating scale exists. Consequently, there is need for more research to provide evidence of how the construct is understood in this context and what kind of features (teacher) raters heed when scoring performance.

34

Chapter 4: Methodology 4.1 Introduction to the chapter In the present chapter I describe the research design and explain how it contributed to answering the research questions. I also discuss aspects of research validity, including strengths and weaknesses of the design in view of the research focus. In addition, I consider some ethical aspects of the research process. The chapter supplements the sections on methodology in each of the three articles and justifies their relationship in the research project as a whole.

4.2 The phases of the research process 4.2.1 General overview This investigation was primarily designed to investigate teacher raters’ understanding of the constructs to be tested in the GSP1/VSP2 oral English exam. In addition, it was devised to study aspects of scoring behaviour and correspondence between the teachers’ understanding of constructs and the intended constructs to be tested as specified by the English subject curriculum and accompanying government documents. The study has an inductive theoretical drive (Morse & Niehaus, 2009), utilizing a “qualitative priority” (Creswell & Plano Clark, 2011, p. 65). This implies that the overall direction of the research was guided by the inductive analysis of the data which was gathered through the use of qualitative methods in the first phase of the project. The design may further be described as emergent, allowing for the ongoing reconsideration of how to collect and analyse information based on what has been learnt in earlier stages of the project (Creswell, 2013). As the overarching focus of the thesis is on raters’ understanding of constructs, the prime object of study is the teachers’ cognitive processes. Hence, I found it relevant to use introspective research methods in the study of these processes (Gass & Mackey, 2000; Grotjahn, 1987; Sasaki, 2014). Such methods may either be qualitative or quantitative (Richards, 2009; Sasaki, 2014), and in the present study I collected introspective data through semi-structured interviews, questionnaires and verbal protocols.15 The research process comprised a pilot phase and two main phases. In the pilot phase a questionnaire was administered to a group of EFL teachers and subsequently followed by a

15

Verbal protocols are used in verbal protocol analysis (VPA) methodology (Alison Green, 1998).

35

semi-structured interview with one additional teacher. The pilot studies aimed to examine the teachers’ general orientations towards the constructs to be tested in order to obtain preliminary data on the research topic and to evaluate the efficacy of these research instruments with regard to the object of study. On the basis of the pilots I decided that semistructured interviews would be better suited for investigating the problem statement in the first main phase. Hence, in the first main phase I interviewed a group of teachers on their understanding of constructs generally. The main bulk of this interview data was analysed in Study 1 and reported in Article 1 (cf. Figure 3, below). The analysis pointed towards two constructs which needed to be investigated further, namely pronunciation and content. Therefore, in the second main phase I went on to investigate these two constructs more closely. Firstly, in Study 2 I used unexplored qualitative data from the interviews in Study 1, as well as a questionnaire designed exclusively for Study 2, to examine the teachers’ orientations towards aspects of the pronunciation construct. The results of this investigation were reported in Article 2. Secondly, in Study 3 I used verbal protocols and semi-structured interviews to analyse the teachers’ understanding of the content construct. The results from this analysis were presented in Article 3. Figure 3 provides an overview of the research design.

36

Overarching research focus:

Teachers’ understanding of constructs (+ rater behaviour and correspondence between teachers’ understanding of constructs and intended constructs)

Informed

Pilot studies: 1. Questionnaire survey 2. Interview

Informed

Informed

Informed

Pilot phase

Study 1 (Article 1) Main research focus: Teachers’ understanding of constructs generally

Phase 1

Data collection method: Semi-structured interviews

Phase 2

Study 2 (Article 2)

Study 3 (Article 3)

Research focus: Teachers’ orientations towards aspects of pronunciation

Main research focus: Teachers’ understanding of the content construct

Data collection method: - Semi-structured interviews - Questionnaire

Data collection method: - Verbal protocols - Semi-structured interviews

Figure 3. Overview of the research design.

37

In addition to the main focus on teachers’ understanding of constructs, the ancillary focus on aspects of teacher rater behaviour, i.e. grading, and correspondence between the teachers’ understanding of constructs and the intended constructs, as specified by the English subject curriculum and accompanying government documents, were investigated in Study 1 and Study 3. In the following I will describe the phases of the research process in further detail.

4.2.2 The pilot phase The pilot questionnaire was conducted in collaboration with a master’s student at the University of Oslo in the very early stages of the project. Its main aim was to investigate teachers’ understanding of the constructs in the GSP1/VSP2 oral English exam, in order to test the suitability of this research instrument for the object of study (Yin, 2016). 32 upper secondary EFL teachers were asked to watch a video-taped performance of two students taking a mock oral English exam and then to respond to a questionnaire designed to tap into their understanding of what should be tested. The items in the questionnaire were developed on the basis of an analysis of the competence aims in the English subject curriculum and the mock oral exam tasks.16 Furthermore, we included an item to investigate attitudes towards the question of whether native speaker pronunciation is a relevant assessment criterion. The reason for doing this was a then recently published study which had shown conflicting attitudes among Norwegian EFL teachers regarding this question (Hansen, 2011). In addition to the specific items which were constructed to elicit rater orientations, we also created space for feedback in the questionnaire regarding the relevance of the items (cf. Appendix 5). A main finding from this pilot survey concerned the usefulness of the research method. In written comments on the questionnaire four teachers pointed to the inappropriacy of several items, one remarking: “This has got nothing to do with the mastery of English or other foreign languages!”. Consequently, we encountered hands-on a criticism sometimes voiced against questionnaires: The fact that the questions, or items, are developed by the researchers means that they to some extent measure the researchers’ understanding of the phenomena being studied (Jacobsen, 2005, p. 31). Although this position may be countercritiqued, as questionnaires can be trialled and improved, it nevertheless prompted me to consider alternative methods of data collection in the first main phase of the investigation. 16

The mock exam consisted of a preplanned presentation (monologue) task and a discussion task based on a fictional text from the syllabus.

38

To continue the process, I decided to employ a qualitative research instrument, since qualitative methods are claimed to be particularly suitable for studying “the views and perspectives of a study’s participants” (Yin, 2016, p. 9). Hence, I created an interview guide based on the pilot questionnaire responses in order to test whether semi-structured interviews would better capture the teachers’ orientations towards the constructs (cf. Appendix 6). The interview guide was piloted on a GSP1/VSP2 English teacher, and on the basis of this pilot, I concluded that semi-structured interviewing would be a more appropriate data collection method for Study 1. In addition, I decided that a video-taped student performance used as prompt, or stimulus, would be useful for eliciting responses. However, rather than employing the video-clip from the mock exam presented to the survey pilot teachers, I decided to use a video-recording from an authentic GSP1/VSP2 oral English exam. The video-prompt was obtained by recruiting a VSP2 student who agreed to be filmed as she was taking her GSP1/VSP2 oral English exam. The exam format comprised three tasks: (i) a pre-planned presentation task, followed by a discussion between the examiners and the student; (ii) an interview task based on a short story from the syllabus; (iii) an interview task based on an audio-taped listening comprehension sequence. In the first task the student had been given 48 hours in advance to plan a presentation on “a common health issue in today’s society [and] the problems it causes the individual and in society”. As for the second and third tasks, the short story was about eating disorders, and the listening comprehension exercise treated the issue of English as a world language.

4.2.3 Study 1 (Article 1) In Study 1 I used a revised version of the pilot interview guide as a basis for interviewing a sample of 24 GSP1/VSP2 English teachers on their understanding of the constructs. First, I asked the teachers to score the video-taped student performance. Next, I interviewed them on what kind of performance features they had been paying attention to in the specific performance of the student in the video-clip, as well as which performance aspects they would heed generally in this exam. Beyond this, I compared the teachers’ understanding of constructs with aspects of the intended construct as stipulated by the subject curriculum and a government circular. These analyses served as evidence for the evaluation of the validity of the scoring inference. The results of the analyses were presented in Article 1 and formed the basis for identifying the phenomena to be studied more closely in Study 2 and Study 3.

39

Overall, the responses indicated that phonology was an important feature heeded by the teachers.17 Moreover, as the pilot survey had indicated that the teachers had widely differing views on the question of native speaker pronunciation, I decided to investigate the pronunciation construct in more detail in Study 2. In fact, a question on native speaker phonology had been included in the interview guide used in Study 1, but the responses to this question had been left unexplored. The reasons for this were partly space constraints, and partly because the strong focus on pronunciation by the teachers made me realize that the native speaker question could be further investigated in the second main phase of the project. In addition, as the teachers interviewed in Study 1 had diverging opinions on the question of how to evaluate content, I found it relevant to look more closely into this construct in Study 3.

4.2.4 Study 2 (Article 2) Study 2 analysed the unexplored question of native speaker pronunciation which had been put to the 24 teacher informants in Study 1. In addition, on the basis of findings from pronunciation research, I developed a questionnaire which I distributed to another cohort of 46 teachers. The questionnaire items attempted to elicit both the teachers’ views on the native speaker question, as well as their orientations towards aspects of the pronunciation construct which has been found by research to be important for communication. The video prompt, which had been used in Study 1, was also used as stimulus in Study 2. The analyses of this data were presented in Article 2.

4.2.5 Study 3 (Article 3) Finally, in Study 3 I collected concurrent verbal reports (Alison Green, 1998) and interview data from an additional group of 10 teachers in order to investigate their understanding of the content construct. The study used theoretical and empirical evidence from Study 1, as well as conceptualizations from the research literature, to develop a conceptual framework for analysing teacher statements pertaining to content. In addition, I compared the teachers’ understanding of the content construct with aspects of content identified in the English subject curriculum. This comparison provided evidence for the evaluation of the validity of the scoring inference. The verbal reports were gathered by letting the teachers watch the same video-prompt used in Study 1 and Study 2 and then simultaneously comment on which 17

In this thesis I use the terms phonology and pronunciation interchangeably to refer to both segmental (i.e. individual sounds) and suprasegmental (e.g. stress, intonation) aspects of pronunciation (cf. Talia Isaacs, 2014).

40

performance aspects they paid attention to. Subsequently, semi-structured interviews were carried out with the same group of teachers, who were asked to elaborate on their understanding of the content construct. The findings were reported in Article 3.

4.3 Data collection 4.3.1 Study 1: Teachers’ overall understanding of constructs As mentioned above, the experiences from the pilot studies prompted the decision to use semi-structured interviews in Study 1. However, a number of issues had to be considered, as qualitiative interviews may take a number of forms, depending on the researcher’s ontological and epistemological positions, the nature of study, the research questions, time constraints etc. (Brinkmann & Kvale, 2015; King & Horrocks, 2010; Mann, 2016). I will here briefly present some epistemological reflections on interviewing which are important for understanding my approach to collecting and analysing the interview data in this thesis. According to Silverman (1993, 2011) and Alvesson (2003), interview research can broadly be said to be informed by three major paradigms, namely neopositivism, romanticism and localism. Neopositivism typically draws on quantitative ideals for data collection, seeing interviews as a channel for transmitting, more or less, objective knowledge. In order to obtain valid data, contextual factors must be controlled or minimized as they may ‘contaminate’ the information that is gathered. Objectivity and neutrality are important principles, and the researcher should therefore strive to keep a professional distance and follow rigorous procedures in order not to cause unnecessary ‘bias’ (Alvesson, 2003). Romanticism, on the other hand, is more inclined to regard the research interview as a meaning-making event, rather than a “pipeline for transmitting knowledge” (Gubrium & Holstein, 2003, p. 68). The focus is on creating a situation in which the interviewer can get access to the perceptions or experiences of the informant by establishing “rapport, trust and commitment”, rather than following a strict research protocol (Alvesson, 2003, p. 16). However, romanticism shares with neopositivism the idea that the perceptions and experiences of the research participant can be regarded as an “object […] located inside people’s heads” (Silverman, 2011, p. 18). Thus, the interview may be seen as a ‘tool’ or a ‘technique’ for collecting evidence about how individuals understand the world (Alvesson, 2003). Localism, on the other hand, is sceptical of this tool metaphor and stresses the importance of the social context in the creation of interview outcomes. As Alvesson (2003) puts it, the interview is seen as a localized accomplishment, in which the interviewees are producing “situated accounts, drawing upon 41

cultural resources” (p. 17). Also, the localist postion takes an essentially critical stance on interviewing. It confronts the assumptions, purposes and arguments of those who aspire to use interviews instrumentally (Silverman, 1993).18 In section 4.2.1 I described the data collection methods used in this thesis as introspective. Although introspective research methods may be employed using different philosophical perspectives, there seems to be a tendency for them to be used by researchers adhering to the neopositivist paradigm. For example, according to Gass & Mackay (2000), introspective methods are “a means of eliciting data about thought processes involved in carrying out a task or activity, [working under the assumption that] humans have access to their internal thought processes at some level and can verbalize those thought processes” (p. 1). Such statements point to a neopositivist epistemology, evoking the pipeline metaphor of Gubrium and Holstein, quoted above. However, I believe a critical perspective on such a position is important, as a host of factors – other than interviewer bias – may influence the interview ‘output’. Examples of these are the expectations of the interviewee, the informant’s desire to express a certain identity (or identities), the use of professional jargon to describe the phenomenon being studied and the informant’s inclination to further his or her own political viewpoints (Alvesson, 2003). Alvesson, therefore, advocates a reflexive pragmatist view on interviews, in which it is important that the researcher recognizes the variety of meanings that may occur, interpreting these in an open and (self-) critical way. Based on my pragmatist epistemological position (cf. section 2.1), I found this approach to be highly relevant for the collection and analysis of data in this thesis. I do not completely reject the interview-as-tool metaphor (neither does Alvesson), but I believe it is important to critically scrutinize the data produced in the interviews, always keeping in mind alternative interpretations. This fits with the pragmatist notion that the outcome of research is not necessarily “true” knowledge, but knowledge that is “useful” (Brinkmann & Kvale, 2015, p. 65). On a more practical note, I also considered collecting data in Study 1 by using verbal protocols. This is a form of introspective data collection method which has been popular in rater cognition studies (Borg, 2003; Cai, 2015; H. J. Kim, 2015; May, 2006; Orr, 2002). It assumes that it is possible to make inferences about individuals’ cognitive processes by asking them to verbalize their thoughts during or after the completion of a task (Ericsson & Simon, 1993; Alison Green, 1998). Verbalizations which are made during task completion are

18

Brinkmann and Kvale (2015) present an alternative description of interview research paradigms, illustrated in terms of an interviewer-as-miner metaphor and an interviewer-as-traveller metaphor. The former may be labelled (neo)positivist, the latter constructivist.

42

referred to as concurrent, whereas verbalizations that are produced afterwards are termed retrospective. The verbal data will constitute a ‘protocol’ or a ‘report’, which must be gathered in the form of an audio or video-recording in order to be subjected to scientific analysis. Verbal protocols can either be used with or without stimuli, or “memory aids” (Sasaki, 2014, p. 1344). Protocols using stimuli are also referred to as stimulated recall. One of the advantages of verbal protocols compared to interviews is the lower level of personal reactivity involved (Hammersley, 2008a), which means that the research participants are less likely to be affected by the researcher’s involvement, for example as interviewer.19 However, there are also disadvantages, especially regarding concurrent reports, relating to issues such as taciturn participants, ambiguous and ‘superficial’ comments and the missed opportunity on the part of the researcher to ask for clarification and to probe further into the participants’ understanding of the phenomena. Despite the reputed advantages of verbal protocols in rater cognition studies, I decided to employ qualitative interviews in Study 1. Even though I recognized the challenges of reactivity, I wanted to have the opportunity to ask for clarification and to delve deeper into the teachers’ understanding of constructs, which is the main focus of this project. This included their understanding of the overall meaning of the scores, their perceptions of which criteria they regarded the most important, and the question of native speaker pronunciation (cf. interview guide, Article 1, Appendix B). On the basis of these aims, I chose a semi-structured interview format, which implied an overall inductive approach, where open-ended questions at the beginning of the interview were meant to make the teachers describe in their own words their understanding of what to be tested. However, deductive elements are apparent in the specific questions relating to, for example, native speaker pronunciation. It may of course be objected that I could have combined interviewing and verbal protocols in this study, as I did in Study 3, in order to obtain richer data. This possibility was indeed considered. However, the recruitment of students for the video prompt proved very time-consuming, since the majority of the local educational authorities, school principals and teachers that I contacted turned down my request for authorization to film. Therefore, I decided to opt for interviewing only. The research questions which were addressed in Study 1 are listed in Table 3, below.

19

This presupposes a ‘non-mediated’ VPA procedure, where the researcher refrains from intervening during verbalization. Alison Green (1998), however, also describes ‘mediated’ alternatives, where the researcher actively prompts participants to make comments when they fall silent, or even asks questions for clarification.

43

4.3.2 Study 2: Teachers’ orientations towards pronunciation As mentioned in section 4.2.4, parts of the data which were analysed in Study 2 were actually gathered in Study 1. This data comprised un-analysed responses to the question on native speaker pronunciation, which had been put to the 24 interview informants. The question read: “What about phonology? Some teachers say that a near-native speaker accent is important in order to get a top score? What is your comment on that?” (cf. Article 1, Appendix B). The actual data collection in Study 2 took as a starting point this question about native-speaker accent, as well as the finding from Study 1 that the teachers were generally very concerned with pronunciation. Based on this evidence, I consulted the pronunciation research literature in order to develop an analytical framework which could be used to investigate teacher orientations towards this construct. The framework centred on the concepts of nativeness and intelligibility (Levis, 2005) and a related set of four specific pronunciation features found to be important for communication (cf. Article 1). These concepts and features were operationalized in a number of items in a questionnaire, which was then distributed to 46 GSP1/VSP2 English teachers. The purpose of the questionnaire was twofold. Firstly, it aimed to corroborate the findings on native-speaker pronunciation investigated by the interview question. Secondly, it tried to test the extent to which the teachers attended to the four specific pronunciation aspects when assessing student performance. Hence, this specific design may be said to have both validity checking and complementary information seeking purposes (Hammersley, 2008b). Broadly speaking, the data collection in the second study formed part of the larger, emergent and inductive process. However, the gathering of data in Study 2 itself may be characterized as essentially deductive. The questionnaire, for example, served a hypothesistesting purpose, checking to what extent the teachers judge pronunciation against a native speaker norm. The research questions that were asked are listed in Table 3, below.

4.3.3 Study 3: Teachers’ understanding of content In Study 3 data was gathered both inductively and deductively from 10 GSP1/VSP2 teacher participants. Firstly, concurrent verbal protocols were collected inductively, using the videoclip as stimulus, in order to produce ‘grounded’ evidence on the teachers’ understanding of the content construct. The verbal protocols were followed up in interview sessions immediately afterwards. In the first part of the interviews I asked the teachers open-ended questions on how they would assess the performance they had just seen. In these two phases 44

then – the recording of the verbal protocols and the first part of the interviews – no hypotheses, theories or conceptualizations were guiding the information gathering. However, prior to the data collection phase, I had developed a conceptual framework for describing content, which was intended as an analytical tool in the exploration of the teachers’ statements. This framework, which was built on findings from Study 1, indicated that the teachers were largely assessing content in terms of a Bloom-like taxonomy of analysing and reflecting on subject matter. The framework was further supplemented with theoretical descriptions of content from educational theory and content-based instruction literature (e.g. Anderson & Kratwohl, 2001; Chamot, 2009). At its core, it conceptualized content as a two-dimensional construct, consisting of a subject matter, or what, dimension, and a skills and processes, or how, dimension. The framework was operationalized in a set of interview questions which were put to the teachers in the second phase of the interview (cf. Article 3). The questions tried to tap into the teachers’ notions of subject matter content. Additionally, in order to check the correspondence between the teachers’ understanding of the content construct and the intended construct, I asked the informants about the relevance of a number of content-related issues which were identified in the subject curriculum. Thus, in the second phase of the interviews, the data collection may be regarded as deductive. The reason for combining inductive and deductive information gathering in this way, was to check whether the verbal protocol data and the teachers’ answers to the open-ended questions could corroborate the responses to the theory-driven questions in the second phase of the interviews (cf. section 4.5, below). In addition, the verbal reports and open-ended interview questions also served a complementary information

seeking objective

(Hammersley, 2008b), in that they might provide additional evidence which the twodimensional content model had not managed to capture. A final word to be made here concerns the validity of verbal protocols. According to Alison Green (1998), it is important that the research participants are given proper instructions, in order to be able to produce rich and dependable reports. It is also essential that they are encouraged to speak as much as possible. I therefore attempted to explain as clearly as possible the purpose of the research and the usefulness of this data collection method, when properly carried out. I also stressed the importance of providing as many comments as possible, encouraging the teachers to verbalize any thought that came to mind when watching the video-clip. In addition, I allowed them five minutes at the beginning of the session to familiarize themselves with the equipment and to trial a commenting sequence. The research question for Study 3 is listed in Table 3, below. 45

4.4 Participants and procedure The student participant who was video-taped as she was taking her oral exam was an 18-year old girl in the Health and Social Care vocational study programme. She consented to participating after having been invited by her English teacher to take part in the study. The recording produced a 22-minute long video-sequence. All of the teacher participants in the three studies (n=80) were fully qualified EFL teachers at the upper secondary level in Norway. They were recruited by means of purposeful sampling (Creswell, 2013), in order to obtain variation in the samples with regard to age, gender, L1, teaching experience, county and study programme affiliation. No economic incentives were provided. The 24 interview informants in Study 1 were recruited from 19 different schools in each of the three different counties of Finnmark (n=8), Oslo (n=8) and Østfold (n=8). One of the teachers in Oslo and seven in Østfold were interviewed face-to-face, whereas the rest were interviewed by telephone. They were recruited directly by email or telephone, after I had approached schools and asked them to help me identify informants on the basis of the above mentioned criteria. Three interviews were conducted in English and 21 were carried out in Norwegian. All the informants received a letter in advance informing them of the purpose of the study. The video-clip was distributed to them on a USB memory stick a few days before the interviews were scheduled. Instructions were given to watch the video-clip, score the performance and to justify the decision. Moreover, I asked the teachers to watch the video as near the time of the interview as possible, and to take notes, in order to keep the performance as vividly in their memory as possible. The interviews lasted between 24 and 47 minutes. The interview informants in Study 2 were the same as those in Study 1. Of the 46 questionnaire respondents, 14 represented 10 different schools in eight counties (Akershus, Aust-Agder, Nord-Trøndelag, Oppland, Sogn og Fjordane, Sør-Trøndelag, Vestfold and Østfold). They were recruited in the same way as the informants in Study 1. The remaining 32 respondents represented a number of schools from all over the county of Akershus. They were recruited in a teacher seminar via a teacher who invited me to his school to talk about language assessment. At the beginning of the seminar they were informed of the purpose of the study and requested to participate, which all of them agreed to. After my talk, they were shown the video-clip and asked to complete the questionnaire. Being present, I was able to clarify three questions which were asked in relation to the wording of some of the items. The 10 interview and VPA participants in Study 3 represented five different schools in Akershus, Oslo and Østfold. They were recruited in the same way as the informants in Study 1. The purpose of the study and the specifics of the research design were explained to the 46

participants on email in advance. To collect the data I met with each of the teachers individually. The video-clip was shown to them on a laptop-computer, and headphones were provided to let them listen to the student’s speech while producing the verbal protocols. They were also given a digital voice recorder to record their comments. The VPA procedure was then explained, and the teachers were given 5-10 minutes to familiarize themselves with the procedure. I pointed out that they were allowed to stop the video if they needed to, but that they could preferably watch it all in one go without interruption. They were also told to provide as many comments as possible. During the recording of the protocols I left the room in order to minimize reactivity. Immediately afterwards, I interviewed each of them on their understanding of the content construct. Nine protocols were recorded in Norwegian. One, carried out by a native speaker of English, was conducted in English.

4.5 Data analyses All the data from the interviews and the verbal protocols were explored using the computer programme QSR NVivo10. The questionnaire data was analysed using IBM SPSS Statistics. All the interviews and verbal protocols were transcribed, checked and sent to the teachers for respondent validation (Bryman, 2012). The data obtained in Study 1 was analysed by means of qualitative and quantitative content analysis (Galaczi, 2014; H.-F. Hsieh & Shannon, 2005; Krippendorf, 2013). This analysis may be characterized as essentially inductive, in the sense that I decided to let the analytical categories, as far as possible, develop from the teachers’ statements. However, as I have good knowledge of the English subject curriculum, theories of communicative competence, the Common European Framework of Reference etc., it is of course impossible to claim that the categories I developed were purely data-driven. For example, one category I invented was labelled “compensatory strategies” (cf. Article 1). This category was coded from statements such as: “And if they can’t [find the word they are looking for], they should try to circumvent it, rather than switching into Norwegian”. While analysing such statements, the first concept that came mind was admittedly Canale and Swain’s (1980) term strategic competence, which is conceptually what circumvention rather than switching into your first language is about. Hence, my analysis clearly has elements of deduction, or rather abduction, in it (see e.g. Douven, 2011; Hanson, 1958). As pointed out in section 4.3.2 the analytical framework which was developed for Study 2 centered on the concepts of nativeness and intelligibility and four related pronunciation features identified as important for intelligibility. Both the interview data and 47

the questionnaire responses were investigated using this framework. The interview data, which consisted of answers to the question of native speaker pronunciation (cf. section 4.3.2), was analysed both inductively and deductively, using magnitude and provisional coding (Miles, Huberman, & Saldaña, 2014). The magnitude coding was used to answer the question of nativeness, i.e. the extent to which (near-) native speaker pronunciation is a relevant assessment criterion. It involved the assignment of coded statements along a four-point continuum going from “not at all” in agreement with the nativeness perspective to “to a large extent” in agreement. The provisional coding was employed to answer the question of intelligibility, i.e. the degree to which students merely need to make themselves understood. It was carried out by deductively setting up a number of phrases pertaining to intelligibility, such as “comprehensible speech” and “understanding” and then searching the interview transcripts for corresponding phrases. Finally, the questionnaire responses were used to answer the question about the relevance of the four pronunciation features.20 They were analysed deductively by calculating descriptive statistics, such as means and standard deviations, in order to examine the teachers’ orientations towards these constructs. Finally, in Study 3 I analysed the verbal protocols and the interview data obtained in the second phase of the project. The analysis was carried out in two cycles using provisional coding (Miles et al., 2014). In the first cycle, teacher statements were coded on the basis of the construct categories developed in Study 1, such as “Vocabulary”, “Grammar” and “Fluency”, as well as the analytical framework developed specifically for Study 3. The analytical framework, based on the before mentioned subject matter and skills and processes dimensions (cf. section 4.3.3), helped identify the teacher statements relating to content. More specifically, these dimensions were identified in terms of noun phrases (subject matter) and verb phrases (skills and processes) (cf. Kratwohl, 2002). For example, the teacher statement She didn’t get the chance to, sort of, talk about the English language as a world language and an international language was analysed in terms of these two phrase types. The verb phrase “talk about” represents the skills and processes dimension, and “English language as a world language and an international language” represents the subject matter dimension. In the second analysis cycle, I sifted out all the phrases relating to subject matter, which was the main object of study, and re-analysed these statements with a particular focus on noun phrases, as specified by Kratwohl. Finally, I compared these phrases with the corresponding noun phrases representing subject matter in the English subject curriculum.

20

These were segmentals (i.e. individual sounds), word stress, sentence stress and intonation (cf. Article 2).

48

When considered across the research phases, my analysis of the content construct may be said reflect abductive reasoning. For example, in Study 1, I came across statements which indicated that the teachers were thinking in terms of the Bloom-like categories of description, analysis, reflection etc. when assessing content. When I put this to one informant, she confirmed that this was how she thought most teachers reasoned about content. Hence, I inferred that this was a relevant conceptualization of the construct, and I therefore consulted the theory, i.e. Bloom’s revised taxonomy, to further elaborate on this perspective. In Study 3 I returned to the teachers to collect more empirical evidence, which in turn confirmed the theory. Hence, there was a going back and forth between empirical data and theory in a way very similar to the abductive process described by, for example, Hanson (1958, p. 85). As mentioned above, the interview transcripts were sent to the informants for respondent validation. In order to further validate the analyses, three colleagues were involved in parts of the coding process. In Study 1 two colleagues, who had previously worked as EFL teachers, agreed to code four transcripts (16 per cent of the total). The comparison between their coding and mine resulted in a moderate inter-coder reliability estimate (Cohen’s Kappa = .69). Therefore, we sat down to discuss the coding categories, in order to arrive at a better understanding of how the teachers’ statements could be analysed. After having revised the coding scheme and re-analysed the transcripts, I asked one of the colleagues to code two new transcripts. This inter-coder reliability analysis resulted in a Cohen’s Kappa estimate of .89, which can be regarded as very good (Landis & Koch, 1977). In Study 2, one of these colleagues also agreed to code two transcripts (8 per cent of the total). This coding resulted in a Kappa estimate of .85, which is also very good. In Study 3 a third colleague, who also has previous experience as a teacher in upper secondary school, consented to analyse two transcripts (20 per cent of the total). The inter-coder consistency between my own coding and hers resulted in a Kappa estimate of .83 for the VPA and .78 for the interviews, which may be regarded as substantial. Obviously, it would have been preferable to have more researchers co-code a larger number of transcripts, but time-constraints did not allow for that. Table 3 gives an overview of the central elements in the research process, such as research foci and research questions, data collection, data analyses, main constructs and methods of validation.

49

Table 3. Overview of the central elements in the research process. Study 1 (Article 1)

Study 2 (Article 2)

Study 3 (Article 3)

Research focus

Teachers’ understanding of constructs generally

Teachers’ orientations towards aspects of the pronunciation construct

Teachers’ understanding of the content construct

Research questions

1. How do EFL teachers in Norway understand the constructs and criteria to be tested in an oral English exam a the upper secondary level?

1. To what extent do EFL teachers at the upper secondary level in Norway see nativeness as an important criterion in the assessment of pronunciation?

What do EFL teachers at the upper secondary school level in Norway perceive as relevant subject matter content to be assessed in the GSP1/VSP2 oral English exam?

2. What kind of criteria do these teachers see as salient when assessing performance?

2. To what extent do the teachers see intelligibility as an important criterion in the assessment of pronunciation? 3. To what extent do teachers see segmentals, word stress, sentence stress and intonation as important in the assessment of pronunciation?

Semi-structured interviews

- Semi-structured interviews - Questionnaires

- Verbal protocols - Semi-structured interviews

24 interviewees Number of participants

- 24 interviewees - 46 questionnaire respondents

10 verbal protocol and interview informants

Data collection

Data analysis

Qualitative and quantitative content analysis (Galaczi, 2014)

Qualitative, using magnitude and provisional coding (Miles et al., 2014), and quantitative, calculating descriptive statistics

Qualitative, using provisional coding (Miles et al., 2014)

Main constructs

- Communication - Content

- Nativeness - Intelligibility

Validation of data

- Respondent validation - Co-coding of transcripts

- Respondent validation - Co-coding of transcripts

- Subject matter content - Cognitive skills and abilities (description, application, analysis etc.) - Respondent validation - Co-coding of transcripts

50

4.6 Research validity In section 1.1 test validity was described as the extent to which inferences drawn from assessment scores are ‘sound’. In a similar vein, research validity may be described as the trustworthiness of the interpretations that are made from research results (Hesse-Biber & Leavy, 2011; Kleven, 2008; Yin, 2016). A number of qualitative researchers avoid using the term ‘validity’, associating it with the measurement paradigm and quantitative research methods. This is worth noting, as the current investigation utilizes a qualitative priority (cf. section 4.2.1). However, as Kleven (2008) points out, regardless of scientific outlook, the essence of a research endeavour is the dependability of the knowledge claims that are ultimately being put forward. In this perspective, it is of minor importance whether one chooses to use the labels validity, trustworthiness, credibility or dependability to describe the quality of the research outcomes. Rather, what is important is to clarify issues that may undermine the soundness of the conclusions drawn. No universal, standardized list of criteria for the clarification of these issues exist, but in the following I will use Patton’s (2015) 10 general criteria for assessing the quality of qualitative research in order to evaluate the validity of the current study. The 10 criteria are: 1. Clarity of purpose 2. Epistemological clarity 3. Questions and hypotheses flow from and are consistent with purpose and epistemology 4. Methods, design and data collection procedures are appropriate for the nature of the inquiry 5. Data collection procedures are systematic and carefully documented 6. Data analysis is appropriate for the kind of data collected 7. Strengths and weaknesses are acknowledged and discussed 8. Findings should flow from the data and analysis 9. Research should be presented for review 10. Ethical reflection and disclosure

Point no. 8 on this list will be considered in Chapter 5, and point no. 10 will be treated in section 4.8. In addition to these ten aspects of research quality, I will consider the question of generalizability of the results, as this point is commonly regarded as vital in social science research. I turn to this question in section 4.9.

51

As for the first point on Patton’s (2015) list, the purpose of the present investigation was pointed out in section 1.4. By studying Norwegian EFL teacher raters’ understanding of constructs, rating behaviour, and correspondence between aspects of teachers’ understanding of construct with what the curriculum and other defining documents identify as construct relevant and irrelevant, I provide empirical evidence of these issues in a domain which is under-researched. Not only is there little evidence of oral EFL assessment in curriculumbased educational settings generally; such research has hardly been undertaken in Norway and has been explicitly requested by the research community (cf. section 1.3.1). In terms of epistemological clarity, I established in section 2.1 that I adhere to a pragmatist position in the sense that I see the terminology employed in the thesis as tools for understanding the phenomena under investigation. The same is true of the methods used. In this thesis I have utilized different introspective methods because I have found them useful for studying the teachers’ understanding of what should be tested. Hence, this “what works” approach draws attention to the research problem rather than to the methods used (Creswell, 2013, p. 28). Obviously, this implies some assumptions about what kind of inferences that can be drawn from the research results. In section 4.3.1 above, I pointed out that introspective research assumes that it is possible to get access to individuals’ cognition through the use of these particular methods. I recognize of course that individuals will not always be able to verbalize their thoughts (Freeman, 1996; Polanyi, 1966) and that this is a potential weakness in such research designs. However, all designs involve assumptions about some aspect of the world, and what is important is to make these assumptions explicit. Beyond this, a core question in the pragmatic paradigm, according to Patton (2015), is what practical consequences can be drawn from a given study (p. 152). In the current study this question relates first of all to the problem of not having a common rating scale on the national level. I will return to this problem in Chapter 5. The third and the fourth points on Patton’s list are interrelated and have been demonstrated in this chapter. I have argued in section 4.3.1 that the use of introspective research instruments is a highly rational choice in a study of teachers’ cognitive processes. Moreover, the use of multiple methods, as employed here, is quite consistent with a pragmatist epistemology (Creswell, 2009, p. 9). In addition, I have argued that the chosen methods – interviews, questionnaires and verbal protocols – in order to answer the six research questions outlined in Table 3 have been appropriate for providing sound evidence consistent with the purpose of the study, i.e. to investigate the rating process in the GSP1/VSP2 oral English exam. That being said, I recognize that I would have added 52

descriptive power to the results by, for example, combining verbal protocols and interviews in the Study 1. In terms of points five and six on the list, I have in the present chapter and in the appendices attached also documented the systematicity of the data collection procedure and shown how the different phases of the research process followed logically from each other. Similarly, I have shown how the data analyses have been undertaken and discussed the appropriateness of these analyses with regard to the data collection and the nature of the study (cf. sections 4.3 and 4.5, above). When it comes to point seven on Patton’s list, I would argue that one of the strengths of this investigation is its use of methods which are highly relevant for the nature of the study. Also, the use of multiple methods, particularly in Study 3, both strengthened the validity of the results and broadened the perspective on the nature of that particular inquiry, i.e. the content construct. Moreover, the involvement of as many as 24 interview informants in the first study produced rich descriptions of the teachers’ understanding of the constructs. 21 This claim is supported by the results in Study 3, which showed that the informants there had a similar understanding of the content construct as the participants in Study 1. However, there are also weaknesses in the investigation which need to be addressed. The problem of reactivity has already been mentioned, cf. sections 4.3.1 and 4.4. Similarly, I noted in section 4.5 that a larger number of coders to validate the interview and verbal protocol analyses would have strengthened the descriptive power of the conclusions arrived at. This is also a relevant point in the sense that the second inter-coder reliability measure, which was obtained in Study 1 (Cohen’s Kappa .89), may have been influenced by the fact that I discussed the coding categories with the colleague who was carrying out the second cocoding. Hence, if I had involved other coders, who had not been involved in the discussion of the categories, the results may have been different. An additional perspective relating to the previous point is the general problem of representing an emic, or insider, perspective of research participants through analyses made by a researcher, as it is well-postulated that observation will always be theory-laden (Hanson, 1958). This, by the way, was touched upon in my discussion of the abductive nature of some of my analyses (cf. section 4.5). In addition, the use of only one video-clip as stimulus in the three studies may have affected the teachers’ focus unduly and influenced their responses in the verbal protocols, interviews and questionnaire. As I argue in all of the articles, had there been another student, 21

Compare the interview study by Guest, Bunce & Johnson (2006, quoted in Bryman, 2012, p. 426), where the authors found that around 12 interviews (out of 60) were sufficient for achieving data saturation.

53

their attention might have shifted to different performance aspects. Against this claim, however, it could be said that the teachers are generally well accustomed to thinking in terms of, and verbalizing, criteria, particularly for formative purposes in the classroom. Hence, the effect of only one video-clip is uncertain. However, what they are not accustomed to is producing concurrent verbal protocols. It could therefore be held that the use of such protocols is an unnatural way for teachers to talk about criteria and that it may have impacted on the data in unforeseen ways. With regard to points eight and nine on Patton’s list of criteria, I will return to a summary and discussion of the findings in Chapter 5. As for the presentation of the findings for review, Article 1 has been published in Sage Open, Article 2 has been submitted to Language Assessment Quarterly, whereas Article 3 has been submitted to Assessment in Education: Principles, Policy and Practice (cf. section. 5.2)

4.7 Ethical considerations Criterion 10 on Patton’s (2015) list concerns ethical considerations and needs to be addressed here in some more detail. Drawing on Diener and Crandall (1978), Bryman (2012, p. 135) lists four main ethical issues which need to be considered in social science research. These relate to: (i)

whether there is harm to participants;

(ii)

whether there is lack of informed consent;

(iii)

whether there is an invasion of privacy;

(iv)

whether deception is concerned.

All of these questions seem to depend on how the specific notions of harm, consent, invasion and deception are interpreted. However, in terms of the first point, Bryman (2012) argues that harm can be related to the confidentiality of records. In the present investigation this aspect is relevant in relation to the video-recording of the student taking the oral exam, the audiorecordings of interview informants and VPA participants, as well as the interview and VPA transcripts. According to the Norwegian National Research Ethics Committee (2006), records must be protected whenever participants can be directly or indirectly identified through, for example, the combined variables of place of work, age, gender, occupation etc. In my project, the most obviously identifiable participant was the student recorded in the video-clip. 54

Although neither her name nor her school is recognizable in the video-sequence, it is evident that she is easy to identify. Hence, the video-clip was stored on my computer, which is only accessible through a username and a password, as well as on USB memory sticks, which were kept in a locked cabinet. The teachers who received them were explicitly instructed to either send them back to me immediately after assessment, or to delete their contents. Beyond the student who was video-recorded, all the teachers were represented only by a code in the audio-recordings, transcripts and articles. Personal details, including names and emailaddresses, which could be linked to the codes, were kept in a notebook in a locked cabinet. The second principle Bryman lists concerns lack of informed consent. In the current investigation I sent written letters (consent forms) to all prospective participants informing them of the purpose of the study, the research questions, the use of data, and an assurance that all the participants would be guaranteed anonymity (cf. Appendix 7). In addition, the consent forms going out to the students informed them that they could withdraw from the project at any time (cf. Appendix 7). According to the Norwegian National Research Ethics Committee (2006) informed consent also means that the information that is given to participants should be given in a “neutral manner” and that prospective participants should be informed that participation is voluntary (p. 12, my translation). The reason for this is not to put pressure on anyone to sign up for something that they may later regret. In my letters to the teachers and students I took care to use a neutral language, and one of the executive officers at the Local Educational Authority (LEA) in the county of Østfold read through the consent forms and made suggestions for improvements. The third principle, invasion of privacy, can be linked to issues of anonymity and confidentiality and is particularly relevant in relation to the student who was being filmed. The question of anonymity relating to the confidentiality of the recording has already been mentioned. As for the dissemination of the video-clip to the teachers, one might consider it a potential problem that I was not able to guarantee that the teachers would not show the videosequence to third parties. However, as teachers are bound by professional secrecy and are trained to handle such issues, I regard this matter as relatively unproblematic. Similarly, the principle relating to deception is not particularly relevant in this thesis. Apart from the fact that I did not disclose my research question in Study 3 to the teachers until in the second half of the interview, which I do not consider as essentially unethical, I have at no point tried to keep any information relevant to the participants away from them. On the whole, I consider the research topic to be fairly uncontroversial, and it did not involve minors or “other vulnerable populations” (Miles et al., 2014, p. 60). All required authorizations were 55

obtained from the Norwegian Data Protection Official for Research (cf. Appendix 8), the LEAs, and the different school principals, and data was stored and treated in accordance with the guidelines of the National Research Ethics Committee.

4.8 Generalizability The final point I would like to make in this chapter concerns the generalizability of the results. Generalizability concerns the representativeness of research findings and is a standard aim in quantitative research, where it is commonly regarded as an integrated aspect of research validity (Bryman, 2012). In quantitative research generalizability typically involves the use of statistical sampling techniques in order to generalize findings from a randomized, representative sample to the population from which the sample was drawn (Onwuegbuzie & Leech, 2010). As Bryman (2012) points out, it is impossible to generalize statistically from a small, non-randomized sample to a population. The findings from the three purposefully collected teacher samples gathered in the present investigation can therefore not be statistically generalized to the population of EFL upper secondary school teachers in Norway. However, other forms of generalization are possible. As Gobo (2008) has argued, in a number of research fields, such as geology, genetics, history, linguistics and paleontology, researchers work with non-probability samples which are “regarded as being just as representative of their relative populations and therefore as producing generalizable results” (p. 200). Two types of generalizations that are relevant for the present investigation are analytical, or theoretical, generalization (Hammersley, 1992; Mitchell, 1983; Yin, 2016), and generalizable patterns (Larsson, 2009; Patton, 2015, p. 107). The former implies that it is the “cogency of the theoretical reasoning” rather than statistical criteria that are decisive for judging the extrapolation of the results (Mitchell, 1983, p. 207). The latter can be understood as “configurations, which can be recognized in the empirical world” (Larsson, 2009, p. 33). In this study the notion of theoretical generalization is particularly relevant in relation to the content construct. In Article 1 I found evidence that the teachers were largely thinking of the content construct in terms the ability to explain, analyse and evaluate whatever is up for discussion. From this empirical finding I developed a theoretical model of content based on Bloom’s revised taxonomy, which involved a subject matter dimension and a skills and processes dimension (cf. section 4.3.3). This theoretical model was further empirically supported by the data gathered in Study 3. Hence, the conclusion that the content construct can be understood in this way, is not a matter of statistical generalization, but rather 56

“theoretical representativeness” (Gomm, 2008, p. 235). In other words, it suggests that such a model is a relevant way of describing theoretically how teachers think about content when assessing performance in the oral exam. The idea of generalizable patterns, on the other hand, pertains to a number of other findings which are not so much a matter of theoretical representativeness, but rather of empirical resemblance. Thus, the findings including the general agreement among the teachers on the main constructs to be tested, the variations in orientations towards native speaker pronunciation, the indications of construct irrelevant variance and construct underrepresentation, may be seen as patterns that are transferable from the specific research situation – which resembles authentic oral exam situations – to real-life, GSP1/VSP2 oral English exams (cf. Yin, 2016, p. 106).

4.9 Short summary In this chapter I have shown how the research design fits the nature of the study and critically evaluated aspects of the design which may compromise the inferences that can be drawn from the results. I have also given examples of such inferences. In the next and final chapter I elaborate on these inferences, summing up the main findings from the three articles and point to some possible implications and avenues for further research.

57

Chapter 5: Summary and discussion 5.1 Introduction to the chapter As stated in section 1.4, the main aim of this investigation is to explore EFL teacher raters’ understanding of the constructs to be tested in an oral English exam in Norway. In addition, the study examines aspects of rater behaviour and the correspondence between the teachers’ understanding of constructs and the intended constructs as specified by the English subject curriculum and accompanying government documents. The purpose of the inquiry is to provide empirical evidence of these aspects of the rating process in an assessment context which is poorly researched. In this final chapter I present and discuss the main findings of the study in light of the purpose and the research focus of the investigation. I start by providing a summary of the three articles and continue with a discussion of the contribution of the study to the body of research literature on oral L2 assessment generally and the Norwegian context in particular, as well as some potential implications for EFL teaching and assessment in Norway. Finally, I suggest some avenues for further research.

5.2 Summary of the articles 5.2.1 Article 1 Article 1, entitled “Assessing spoken EFL without a common rating scale: Norwegian EFL teachers’ conceptions of construct”, was published in Sage Open, October-December, 2015. The aim of the article was to explore Norwegian EFL teachers’ general assessment orientations in the GSP1/VSP2 oral English exam. Two research questions (RQs) were addressed: RQ1. How do EFL teachers in Norway understand the constructs and criteria to be tested in an oral exam at the upper secondary level? RQ2. What kind of criteria do these teachers see as salient when assessing performance? The data was gathered from semi-structured interviews with 24 teachers who had been asked to share their assessment orientations after having seen a video-taped performance of a student taking the GSP1/VSP2 oral English exam (cf. section 4.2.3). The analyses of the interview transcripts showed that the teachers paid attention to a large number of different 58

performance aspects. In fact, the number of categories developed was so substantial (n=56) that I was unable to report on all of them. Hence, I concentrated on the 38 categories that were common to all the three exam tasks (cf. section 4.2.2). As for the teachers’ overall understanding of what was to be tested, the results showed that the informants focused on two main constructs, namely Communication and Content. These two constructs comprised a number of sub-categories, where Linguistic competence (belonging to Communication) and Application, analysis, reflection (belonging to Content) turned out to be the most important. Linguistic competence, in turn, consisted of three large sub-categories, namely Grammar, Vocabulary and Phonology, Phonology being the most substantial of the three.22 In general, there was evidence that the teachers largely understood the main constructs in the same way, but that there was some variation with regard to the more finely grained performance aspects, such as Phonology. This finding corresponds with those of Brown et al. (2005) and Borger (2014), which indicated overall agreement on the main aspects to be tested, but some disagreement on the more narrowly defined performance features. One notable difference in teacher orientations, however, was the importance attributed to Content. Although the informants largely understood Content in the same way – conceptualized as a Bloom-like taxonomy of analysing and reflecting on subject matter, as well as the ability to address the task or topic question – the analysis showed that the teachers weighted this construct differently. More specifically, there was evidence that the general studies programme (GSP) teachers were more concerned with Content than the vocational studies programmes (VSP) teachers. For example, the GSP teachers clearly penalized the student in the video-clip for not answering the task question properly. It was hypothesized that this could be explained in terms of the GSP teachers being used to working with students who are generally more proficient in English. Such an interpretation is congruent with findings that raters are more likely to pay attention to linguistic features at the lower levels of proficiency, whereas they have a stronger focus on content at the higher levels (Brown et al., 2005; Pollitt & Murray, 1996; Sato, 2012). Beyond the differences in orientations relating to the Content construct, I also found that some teachers openly admitted that they would assess VSP students more leniently than GSP students, thus explicitly bringing to the fore the debate about the fairness of making the VSP students take the same course and the same exam as the GSP students (cf. section 1.3.3).

22

“Phonology” and “pronunciation” are here being used interchangeably (cf. Chapter 4, footnote 17.)

59

This point relates to the question of the social consequences of test use (Messick, 1989), as mentioned in section 2.3.3. The analysis of scoring behaviour, based on the grades awarded, showed that three teachers gave the student in the video-clip a 2, 15 teachers awarded her a 3 and six teachers awarded her a 4 (mean value M=3.13; standard deviation St. dev.= .61).23 However, as only one student performance was scored, it was not possible to calculate an inter-rater reliability measure, such as a Cronbach’s alpha, for the grades awarded. No analyses of relationships between scoring behaviour and teacher background variables were reported in Article 1, but statistically significant variation was found in terms of study programme affiliation, as the GSP teachers scored more harshly than the VSP teachers (r = .50, p < .05). Moreover, the fact that some teachers explicitly stated that they would assess VSP students more leniently shows variation in rater severity which is a threat to the validity of the scoring outcomes. As for the evaluation of the scoring inference, it must be pointed out that the current investigation is not a validation study. I have not made any comprehensive, systematic inventory of the constructs to be tested as specified by the subject curriculum and defining documents. Yet, when comparing the teachers’ statements with the subject curriculum and the circular UDIR-1-2010, there is clear evidence that some teachers were attending to constructirrelevant performance features. Four informants explicitly referred to the student’s level of preparedness as a relevant assessment criterion, and five teachers mentioned effort as an aspect to be tested. According to the above-mentioned circular, preparation and effort are not to be assessed (UDIR, 2010). Such a focus on construct-irrelevant features is not uncommon in the assessment or spoken proficiency (Ang-Aw & Goh, 2011; Borger, 2014; Brown, 1995; C.-N. Hsieh, 2011; Lyn A. May, 2006; Orr, 2002) On the basis of these analyses I argued that the findings point to the problem of not having a common rating scale on the national level. For lack of data, I cannot compare the results in this study to teacher rater cognition and behaviour in a similar setting with a common rating scale. However, there is evidence that such a scale may enhance the validity and reliability of the scores (Fulcher, 2012; Ginther, 2013). In addition, I maintained that the inclusion of a comprehensive content construct – as evidenced by the many references to content in the English subject curriculum – is problematic in an assessment context where the proficiency levels of the students seem to vary a lot. Judging from the teachers’ accounts there are indications that the VSP students are, on average, at a lower proficiency level in English.

23

The grades given range from 1 (= ‘fail’) to 6 (= ‘excellent’).

60

As a consequence, the VSP teachers seem to prioritize language aspects over content features. However, given the design and administration of the oral English exam as it is currently framed, I also argued that teachers should make students explicitly aware of the importance of answering task questions properly and to reflect on the questions that they are given.

5.2.2 Article 2 Article 2 was co-written with Thomas Hansen and entitled “Assessing pronunciation in an EFL context: Teachers’ orientations towards nativeness and intelligibility”. It is currently under review for Language Assessment Quarterly. The article took as a starting point the strong focus on phonology reported in Study 1 (Article 1), and the divergent views on the question of native speaker pronunciation, which were found in the first pilot study (cf. section 4.2.3). These are interesting findings in view of the apparent fact that pronunciation, until recently, has been neglected in ESL/EFL pedagogy and research (Baker, 2013; Talia Isaacs, 2014; Moyer, 2013). Also, it is worth noting that pronunciation instruction and assessment may be said to have been guided by two contradictory principles, the nativeness principle and the intelligibility principle (Levis, 2005). The former holds that the goal of pronunciation pedagogy is to make students achieve native-like pronunciation; the latter implies that the goal is for students to make themselves understood. And, although the intelligibility principle, at least in theory, seems to have been embraced by a large number of teaching practitioners in recent years (Brown et al., 2005; Hansen, 2011; Timmis, 2002), there are indications that teaching and assessment practices are still largely guided by the nativeness principle (Deterding, 2010; Jenkins, 2007; Seidlhofer, 2011). Against this backdrop, we explored Norwegian EFL teachers’ attitudes towards nativeness and intelligibility. In addition, we measured the teachers’ orientations towards the assessment of four specific pronunciation features which have been found to be important for intelligibility, namely segmentals (i.e. individual sounds), word stress, sentence stress and intonation. Study 2 (Article 2) therefore addressed the following three RQs:

RQ1. To what extent do EFL teachers at the upper secondary level in Norway see nativeness as an important criterion in the assessment of pronunciation? RQ2. To what extent do the teachers see intelligibility as an important criterion in the assessment of pronunciation? RQ3. To what extent do the teachers see segmentals, word stress, sentence stress and intonation as important in the assessment of pronunciation? 61

RQ1 and RQ2 were answered with data from both the 24 informants interviewed in Study 1 and from the 46 questionnaire respondents participating in Study 2, totalling 70 teacher participants. RQ3, on the other hand, was only answered with data from the 46 questionnaire respondents. The video-clip used as prompt in Study 1, was also used in this study (cf. section 4.2.4). As for RQ1, the results showed that the teachers had conflicting views on the issue of nativeness. For example, 11 teachers strongly disagreed and 13 strongly agreed that students should be assessed against a native-speaker norm. However, we also found evidence of ambivalent attitudes towards nativeness. Some of the interview informants advocated acceptance of non-native speaker pronunciation at the same time as they would prefer topscoring students to approximate a native-speaker accent. Regarding RQ2, however, there was much more agreement among the teachers. On average they strongly agreed that intelligibility was important. For instance, 37 of the 46 questionnaire respondents strongly or completely agreed that they would automatically mark a student down from the grade 6 if it was difficult to understand what he or she said. Finally, in terms of RQ3, the analysis showed that the teachers moderately to strongly agreed that segmentals, word stress and sentence stress were important. For example, on a set of questionnaire items tapping into the respondents’ orientations towards the importance of the four pronunciation features for a top score, segmentals turned out to be the most important (median value Md=5; mean value M=4.46), followed by word stress (Md=4; M=4.02) and sentence stress (Md=4; M=3.67).24 However, when it comes to the assessment of intonation, the teachers were more cautious. The average response values of Md=3 and M=3.07, including 32 respondents who chose the mid-range option (i.e. neither agree nor disagree), indicate that they were either less concerned with, or uncertain of, how to relate to intonation (cf. Dubois & Burns, 1975). An additional analysis of possible relationships between rater background variables and interview and questionnaire responses found significant positive relationships between teacher experience and all the four pronunciation features. Differently put, the more experienced teachers were slightly more concerned with these features than the less experienced teachers. All in all, the findings in Article 2 indicated that there was agreement among the teachers about intelligibility. Nativeness, on the other hand, was a contentious issue. Some teachers were strongly in favour of assessing performance against a native speaker norm, 24

A five-point Likert scale going from 1 = «completely disagree» to 5 = «completely agree» was used to elicit responses.

62

whereas others were strongly against. In this respect, the results are in accord with other studies showing divergent orientations in this area (Coskun, 2011; Jenkins, 2007; Timmis, 2002). Additionally, there were teachers who displayed ambivalent attitudes to this issue. However, it seems clear that the teachers overall are concerned with a number of pronunciation issues, such as segmentals, word stress and sentence stress. As for intonation, they are more reluctant, either being unsure of its relevance or finding it less important. If the latter is the case, they are in agreement with proponents of the intelligibility principle within the English as a Lingua Franca paradigm, who point to studies showing that intonation is not important for making onself understood (Deterding, 2010; Jenkins, 2000). On the basis of these findings, we also argued that the pronunciation construct itself needs to better defined. This includes a clarification of the relationship between nativeness and intelligibility, which we consider to be interrelated, rather than contradictory (cf. Levis, 2005). Moreover, as the reference to pronunciation in the English subject curriculum is rather vague,25 we suggested that the introduction of a common rating scale with a more clearly operationalized pronunciation construct is considered. At any rate, it is vital that those pronunciation aspects that are important for intelligibility are made clear to raters. Until empirical studies have proven otherwise, we suggested that the features discussed here, such as certain segmentals, word stress and sentence stress, are singled out as salient criteria.

5.2.3 Article 3 Article 3, entitled “Assessing content in a curriculum-based EFL oral exam: The importance of higher-order thinking skills”, has been submitted to Assessment in Education: Principles, Policy and Practice. The starting point of the article was the variability in teacher perceptions on the issue of content, which was reported in Article 1, as well as claims that the content construct is complex and not well understood in all contexts (Frost, Elder, & Wigglesworth, 2012). Based on the finding from Article 1 that content was largely understood in terms of a Bloom-like taxonomy of analysing and reflecting on subject matter, I developed an analytical framework mainly drawing conceptually on Bloom’s revised taxonomy of educational objectives (Anderson & Kratwohl, 2001). This framework divided content into a subject matter dimension and a skills and abilities dimension (cf sections 4.3.3 and 4.5). As the teachers in Article 1 were mostly concerned with the skills and abilities dimension when 25

The 2006 curriculum did not mention it at all, and the 2013 version states that “[The student shall be able to] use patterns of pronunciation, intonation […]” (cf. appendices 1 and 2)

63

assessing performance, I decided to let the focus of Article 3 be what kind of subject matter aspects they were paying attention to in the oral English exam. Hence, the following RQ was formulated: What do EFL teachers at the upper secondary school level in Norway perceive as relevant subject matter content to be assessed in the GSP1/VSP2 oral English exam?

As part of the investigation of the validity of the scoring inference (cf. section 1.4), I also decided to use the analytical framework to identify aspects of subject matter in the competence aims of the subject curriculum. This allowed me to compare the intended construct to be tested with the teachers’ own understanding of the construct. Data was collected from verbal protocols and semi-structured interviews involving 10 teachers. The initial analysis of subject matter elements in the English curriculum showed that several competence aims involve a considerable number of such elements. Examples of these are “Cultural and societal conditions in English-speaking countries” and “English-speaking cultural forms of expression” (cf. Article 3, Table 2). In total, ten subject matter categories were developed from the analysis of the subject curriculum. The exploration of the teacher statements in the interviews and verbal protocols revealed that the teachers also understood subject matter in very general terms. The analyses yielded 13 subject matter classifications, including categories as different as Personalized knowledge and Indigenous peoples. The former implies that some teachers would accept, at the lower proficiency levels at least, that students describe topics of personal interest. The latter means that they would expect students to display specific knowledge of indigenous peoples (cf. Article 3 – Table 4). However, most of the teachers were quick to point out that, because of the very general nature of the subject matter construct in the curriculum, they do not expect students to remember specific details, such as particular historical events or geographical locations. Rather, the teachers are largely open to assessing whatever is up for discussion. Some of them also report that this relates to the fact that the Norwegian Directorate for Education and Training has specified that the examiners are supposed to help the students to “display their competence”, not to seek out their “lack of competence” (UDIR, 2014a, p. 2, my translation). This position has two consequences. Firstly, it means that subject matter largely becomes an issue of students having general world knowledge. Secondly, and more importantly, the crucial element in the content construct is not the subject matter dimension, but rather the skills and abilities dimension. What is important is that the students are able to analyse and reflect on whichever topic they are presented with. And, the higher up 64

on the grading scale that the students are, the more important it is that they are able to display higher order thinking skills, such as to analyse, synthesize and evaluate. In this respect, the teachers may be said to be in agreement with the widely held notion that higher order thinking skills is one of the primary goals of education (Baird et al., 2014; Chamot, 2009; Marin & Halpern, 2011). The investigation of the validity of the scoring inference showed that the teachers’ descriptions of content was generally highly congruent with the subject matter aspects identified in the curriculum. This, however, was not unexpected, given the very general nature of subject matter described in the competence aims. The only discrepancy between the teachers and the curriculum was found in the area of metacognitive strategies, where the teachers appeared to be unsure of their relevance, despite references to this component in one of the competence aims. One teacher, in fact, persistently denied that metacognitive strategies were to be tested in the oral exam. Hence, the intended construct to be tested was underrepresented with regard to this element. Such construct underrepresentation is not uncommon in ESL/EFL speaking tests (Ang-Aw & Goh, 2011; Cai, 2015). A final analysis, inspired by the finding in Article 1 that the GSP teachers gave more weight to the content construct, showed that the GSP and VSP teachers in Article 3 were equally concerned with the relevance of this construct. The three VSP teachers in the current sample (n=10) all acknowledge the importance of analysing and reflecting on subject matter, as well as answering the task or topic question properly. In this respect, the findings in Article 3 do not support the findings in Article 1. However, the VSP informants disagreed with some of the GSP teachers on what kind of performance is indicative of achievement with regard to a given grade level. For example, one of the GSP teachers insisted that the student in the video-clip had failed to answer the task question, which she regarded as a major flaw in her performance. When I quoted this statement to one of the VSP teachers, she agreed that answering the task is important, but strongly objected to the claim that the student had not answered the task properly. In her view, the student had responded well to the question. In this sense, the results in Article 3 corroborate the findings in Article 1 that VSP instructors are more lenient in their assessment of content than GSP teachers. Moreover, this result is consistent with research findings showing that raters give different scores for the same performance (Douglas, 1994; Eckes, 2009; Orr, 2002). On the basis of the results of these analyses I argued that the correspondence between the teachers’ understanding of the construct and the intended construct appears to be acceptable. The validity threats in this study concern the underrepresentation of the 65

metacognitive strategies components, as well as the disagreement on what kind of performance is characteristic of a given level in relation to a criterion. Thus, in order to further strengthen the validity of the scoring inference, I suggested that the educational authorities consider the introduction of a common rating scale, or at least common scoring guidelines, as this may improve the validity of the scores (Fulcher, 2012). A related point is the introduction of more rater training, which seems to be arbitrarily organized in the Norwegian educational system. Some of the teachers in the studies comprising this thesis report that they have never had any rater training at all. Such training may have a positive effect on the consistency of the scores (Taylor & Galaczi, 2011).

5.3 Research contribution The main research contribution of the present investigation is increased knowledge of how EFL teacher raters assess performance in an English L2 assessment context at the upperintermediate proficiency level (CEFR level B1/B2). The three studies point towards ways in which Norwegian EFL teachers understand the constructs to be assessed. Additionally, they provide empirical evidence of aspects of scoring behaviour, as well consistency between the teachers’ understanding of constructs and the intended constructs to be tested. Regarding empirical contributions, three main findings related to the teachers’ understanding of constructs must be mentioned. First of all, this thesis has shown that the teachers, overall, have the same understanding of the main constructs to be tested, but that they differ with regard to more finely grained assessment criteria, such as phonology. Lack of research findings in similar contexts, i.e. curriculum-based school settings without a common rating scale, makes direct comparisons with other studies difficult. The only enquiry examining exactly the same context, albeit with a smaller teacher sample and a more general research focus (i.e. Yildiz, 2011), found that the teachers heeded similar constructs as the teachers in the present investigation, but that there was some more variation in the teachers’ orientations. Apart from Yildiz’ study, Brown et al. (2005), investigating an EAP context without a rating scale, and Borger (2014), exploring a Scandinavian upper secondary school context with a common rating scale, are the studies identified in the literature review which most strongly resemble the current one (cf. Chapter 3). Both corroborate the conclusions arrived at here. Secondly, the results of the present thesis support findings that teachers tend to heed linguistic performance aspects at the lower proficiency levels, whereas they put more 66

emphasis on content at the higher levels of proficiency (Brown et al., 2005; Pollitt & Murray, 1996; Sato, 2012). For example, some of the teachers working exclusively with VSP students, who are reportedly at a lower proficiency level in English, stress that they prioritize languagerelated aspects when assessing performance. Thirdly, the investigation shows that the pronunciation construct is a complex one, in the sense that teachers disagree on the relevance of applying a native speaker standard when judging phonology. This is consistent with conclusions drawn in similar studies (Brown et al., 2005; Deterding, 2010; Jenkins, 2007; Timmis, 2002). What was particularly interesting here was that 19 per cent of the teachers were strongly in favour of using a native speaker standard, whereas 16 per cent were strongly against. This suggests that nativeness is a contentious issue among many English teachers in Norway. Such conflicting views have also been reported elsewhere (Jenkins, 2007; Jenkins et al., 2011). Above and beyond this, the findings in Article 2 support the contention that pronunciation has been neglected in language teaching pedagogy (Baker, 2013; Derwing & Munro, 2009; Isaacs, 2014) in the sense that the subject curriculum is surprisingly vague in its reference to this construct. As for scoring behaviour, no measure of inter-rater reliability could be given due to the fact that only one video-taped student performance was used. However, the analysis showed that the GSP teachers were significantly harsher in their grading than the VSP teachers, which means a threat to the reliability of the scores. When correcting for this difference, the results indicate acceptable levels of rater consistency, especially when considering that the scoring of performance always involves two teacher raters (Henning, 1996). This is in line with studies on inter-rater reliability, which generally show high levels of consistency (McNamara, 1996; Eckes, 2011; see also Fulcher, 2003), but the lack of a common rating scale in the present study, as well as the general lack of rater training reported by the teachers, makes this finding original. The comparison between aspects of the teachers’ understanding of constructs and the intended constructs as specified by the English subject curriculum and other government documents showed generally good correspondence, considering the lack of rater training and rating scale. This corresponds to the findings made by Brown et al. (2005) and Borger (2014). Construct irrelevant assessment criteria included preparation and effort. Construct underrepresentation was identified in the teachers’ reluctance to include metacognitive strategies when assessing performance. Failure to heed the constructs properly in this way is quite common in oral ESL/EFL assessment (Ang-Aw & Goh, 2011; Brown, 1995; Cai, 2015;

67

C.-N. Hsieh, 2011; Orr, 2002), but there is little evidence on this issue in curricular-based contexts where no scoring guidelines are provided. One theoretical contribution provided in this thesis concerns the conceptualization of the content construct. Applying Bloom’s revised taxonomy of educational objectives (Anderson & Kratwohl, 2001), I argued in Article 3 that the teachers to a large extent understand content as consisting of a subject matter (or what) dimension and a skills or abilities (or how) dimension. Furthermore, given the very general nature of the subject matter dimension in the curriculum, I maintained that the teachers largely understand subject matter in terms of general world knowledge. Moreover, they tend to assess the skills or abilities dimension in terms of whether students are able to recount or describe subject matter – at the lower levels of achievement – or whether they are able to analyse and reflect on subject matter, at the higher levels of achievement. Of the two dimensions, the skills and abilities dimension is the more important one, as the teachers primarily tend to focus on the extent to which students possess higher-order thinking skills. This is a perspective which may theoretically generalize to other contexts (Mitchell, 1983) in the sense that it provides a highly relevant conceptual tool for assessing the content construct. Another theoretical contribution of this thesis relates to the claim put forward in Article 2 that nativeness and intelligibility are interrelated rather than contradictory principles (cf. Levis, 2005). As was argued in Article 2, alternative approaches to native speaker pronunciation pedagogy tend to take as a starting point native speaker features when attempting to define what is important for intelligibility and what is not. The suggestions made within the English as a Lingua Franca paradigm, which stresses intelligibility over nativeness, is a case in point. These suggestions are based on a number of features taken from Received Pronunciation and General American. Hence, the notion that the nativeness principle and the intelligibility principle are in incongruous, appears paradoxical. Rather, the two principles seem interrelated, although their relationship may be somewhat blurred. I therefore suggest that more research be undertaken to clarify this association (cf. Article 2).

5.4 Implications for the Norwegian educational context Given the fact that there was no common rating scale in the current study, as well as a reported lack of training by many teachers, one may conclude that the teachers’ agreement on the main constructs to be tested is acceptable. However, there are instances of construct underrepresentation and construct-irrelevant variance which do pose a threat to the validity of 68

the scores, such as the attention paid to effort and the failure to heed metacognitive strategies. In addition, the fact that some teachers differ in what kind of performance they consider as indicative of achievement on the different levels of proficiency is problematic for score consistency. Examples of such differences were found, for example, in the assessment of content. The analysis of scoring behaviour also shows significant differences between the GSP teachers and VSP teachers, the VSP teachers being more lenient. However, when controlling for the variable of study programme affiliation, the score consistency may be regarded as satisfactory, especially when considering that two teacher raters will always be involved in the scoring of performance in the oral exam. Against this background, the question of standardization, which has been a recurring issue throughout this thesis, needs to be considered. This question first of all concerns the lack of a common rating scale (cf. sections 1.1, 1.3.3 and 2.4). As touched upon in section 2.4, the question of standardization may be regarded as a continuum, depending on the purpose of the test, the constructs to be assessed and the extent to which validity or reliability is emphasized (AERA, APA, & NCME, 2014; Harlen, 2012). It also depends on paradigm perspective (cf. section 1.2). From a measurement paradigm viewpoint dispensing with rating scales may seem rather peculiar, since they are widely considered to contribute to improving the reliability of scores (Brown, 2012; Davis, 2015; Fulcher, 2012; Ginther, 2013; Taylor & Galaczi, 2011; Weigle, 1998). Also, it may be seen as somewhat puzzling that rating scales and rater training are not provided for the oral exam, since they are are provided in the corresponding written English exam. In fact, in language performance testing they are hardly ever dispensed with (Fulcher, 2012; Ginther, 2013). However, from an assessment paradigm perspective, rating scales are not necessarily the universal solution to increased quality in educational contexts. According to Baird et al. (2014), for example, assessment is intrinsically linked to teaching and learning and should not be treated in isolation (p. 82). This relates to the notion that rating scales may not be able to capture the complexities of what is to be tested, especially when the frame of reference for the definition of the constructs is a comprehensive subject curriculum (cf. sections 2.2 and 2.4). A similar issue concerns the objection that the introduction of a rating scale may blur the distinctive character of the course of study, causing the criteria to “create a new structure in the subject” (Throndsen et al., 2009, p. 109, my translation). Relatedly, a less formalized structure gives teachers the opportunity to integrate valuable learning outcomes which may not be specified in the English subject curriculum, but which are included in the Core Curriculum (R. Jensen, personal communication, April 7, 2016). Finally, there is the 69

interactionist perspective (cf. section 2.5.1), which holds that communicative competence is co-constructed in local contexts, which makes the application of universal criteria problematic (Chalhoub-Deville, 2003; Chapelle, 1998; He & Young, 1998, see also Bachman, 2007). On the basis of this consideration, and the threats to validity and reliability identified in the current investigation, I side with Throndsen et al. (2009), who suggest that national rating scale guidelines, which may be locally adapted, are introduced to support the scoring of performance in the GSP1/VSP2 exam. Such guidelines have, in fact, been implemented in Norwegian lower secondary schools and provide teachers with common operationalizations of the main constructs at the same time as broader assessment concerns may be included in cases where this is deemed relevant. In addition, I suggest that the Norwegian educational authorities offer rater training on a more systematic basis. Given the positive effects of such training (Barnwell, 1989; Brown, 2012; Taylor & Galaczi, 2011), it may strengthen the already established focus on a “shared assessment culture”.26 Along with the question of standardization, another recurrent issue in this thesis is the consequences of letting GSP and VSP students take the same exam. As mentioned in section 1.3.3, this has been a controversial topic for a number of years. The analyses conducted in the current investigation show clear indications of rater variability between GSP and VSP teachers. Not only are VSP teachers more lenient in their scoring, some of them also openly admit that they will apply non-relevant criterion information, such as effort, in the assessment of performance in order to let VSP students pass the exam. In addition, there are teachers working with both student groups who indicate that they score GSP students more harshly than VSP students because they find the system unfair on the VSP students. However, considering the fact that measures to make English more relevant for vocational students were introduced after the data for the present investigation was collected (cf. section 1.3.3), it may be that the teachers’ attitudes towards this issue have changed. Yet, the matter should be further looked into, as this type of rater variability poses one of the most serious threats to the validity of the scores in this thesis. This relates to the social consequences of test use (cf. section 2.3.3).

26

The Norwegian educational authorities’ focus on a «shared assessment culture» (Norw. ‘tolkingsfellesskap’) entails the voluntary development, by the teachers, of “a common understanding of the subject curriculum, and the assessment of competence, through dialogue and discussion” (UDIR, 2014c, p. 1, my translation).

70

5.5 Concluding remarks As pointed out by Alderson and Bachman in the quote given in the introduction to this thesis (p. 1) speaking is probably the most difficult skill to assess reliably. Against this fact, and the fact that there is neither a common rating scale, nor systematic rater training in the assessment context explored in this thesis, I conclude that the teachers involved in this study had surprisingly similar views on how to assess performance in the GSP1/VSP2 oral English exam. One may speculate that this is caused by competent teachers who have been working consciously to develop a “shared assessment culture” (cf. footnote 26). Nevertheless, there is clear evidence of rater variability, and I therefore suggest that the educational authorities consider introducing common rating scale guidelines on the national level, as well as more systematic rater training. In addition, more research is needed. One area in which more research should be conducted is pronunciation, particularly in terms of the relationship between nativeness and intelligibility and the issues of ‘correctness’ and ‘error’. As has been pointed out in this thesis, the operationalization of the pronunciation construct is problematic, and more studies are required in order to clarify which phonological features that should be prioritized in teaching and assessment. Another underresearched area concerns the question of how to assess content, specifically regarding the interface between language and content (cf. Snow & Katz, 2014). Finally, as regards the Norwegian context, it would be relevant to investigate how Norwegian EFL teachers assess oral English when producing oral achievement marks, as these make up a substantial proportion of the students’ final English grade at the upper secondary school level.

71

References Alderson, J. C., Haapakangas, E.-L., Huhta, A., Nieminen, L., & Ullakonoja, R. (2015). The diagnosis of reading in a second or foreign language. New York: Routledge Ltd. Alvesson, M. (2003). Beyond neopositivists, romantics, and localists: A reflexive approach to interviews in organizational research. Academy of Management Review, 28(1), 13-33. doi:10.5465/amr.2003.8925191. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. Anderson, L. W., & Kratwohl, D. R. (Eds.). (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives. New York: Longman. Andreassen, R.-A., & Gamlem, S. M. (2011). Arbeid med elevvurdering som utvikling av skolens læringskultur [Developing schools' learning cultures through student assessment]. In S. Dobson, A. B. Eggen, & K. Smith (Eds.), Vurdering, prinsipper og praksis: Nye perspektiver på elev- og læringsvurdering (pp. 112-129). Oslo: Gyldendal Akdademisk. Ang-Aw, H. T., & Goh, C. C. M. (2011). Understanding discrepancies in rater judgement on national-level oral examination tasks. RELC Journal, 42(1), 31-51. doi:10.1177/0033688210390226. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press. Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge University Press. Bachman, L. F. (2007). What is the construct? The dialectic of abilities and contexts in defining constructs in language assessment. In J. Fox, M. Wesche, D. Bayliss, L. Cheng, C. E. Turner, & C. Doe (Eds.), Language testing reconsidered (pp. 41-71). Ottawa: Ottawa University Press. Bachman, L. F. (2014). Ongoing challenges in language assessment. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. 3, pp. 1586-1603). Chichester, UK: Wiley-Blackwell. Bachman, L. F., & Palmer, A. (2010). Language Assessment in Practice. Oxford: Oxford University Press. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press. Baird, J.-A., Hopfenbeck, T. N., Newton, P., Stobart, G., & Steen-Utheim, A. T. (2014). State of the field review: Assessment and learning. (Case number 13/4697). Oslo: Knowledge Centre for Education. Baker, A. (2013). Integrating fluent pronunciation use into content-based ESL instruction: Two case studies. In J. Levis & K. LeVelle (Eds.), Proceedings of the 4th Pronunciation in Second Language Learning and Teaching Conference. Aug. 2012 (pp. 245-254). Ames, IA: Iowa State University. Barnwell, D. (1989). 'Naive' native speakers and judgements of oral proficiency in Spanish. Language Testing, 6(2), 152-163. doi:10.1177/026553228900600203. Bejar, I. (2012). Rater cognition: Implications for validity. Educational Measurement: Issues and Practice, 31(3), 2-9. doi:10.1111/j.1745-3992.2012.00238.x.

72

Bennett, R. E., & Gitomer, D. H. (2008). Transforming K-12 assessment: Integrating accountability testing, formative assessment, and professional support. (Research memorandum ETS RM-08-13). Princeton, NJ: Educational Testing Service. Black, P., & Jones, J. (2006). Formative assessment and the learning and teaching of MFL: Sharing the language learning road map with the learners. The Language Learning Journal, 34(1), 4-9. doi:10.1080/09571730685200171. Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5(1), 7-74. doi:10.1080/0969595980050102. Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and Accountability, 21(1), 5-39. doi:10.1007/s11092-0089068-5. Borg, S. (2003). Teacher cognition in language teaching: A review of research on what language teachers think, know, believe, and do Language teaching, 36(2), 81-109. doi:10.1017/S0261444803001903 Borger, L. (2014). Looking beyond scores: A study of rater orientations and ratings of speaking. (Licentiate thesis), University of Gothenburg, Gothenburg, Sweden. Retrieved from http://hdl.handle.net/2077/38158. Borsboom, D., Cramer, A. O. J., Kievit, R. A., Scholten, A. Z., & Franić, S. (2009). The end of construct validity. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and applications (pp. 135-170). Charlotte, NC: Information Age Publishing. Brinkmann, S., & Kvale, S. (2015). InterViews: Learning the craft of qualitative research interviewing. Thousand Oaks: Sage. Brinton, D. M., Snow, M. A., & Wesche, M. (2003). Content-based second language instruction. University of Michigan Press: Michigan Classics Edition. Broadfoot, P. (2007). An introduction to assessment. New York: Continuum. Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12(1), 1-15. doi:10.1177/026553229501200101. Brown, A. (2000). An investigation of the rating Process in the IELTS oral interview. IELTS research reports 3. Retrieved from https://www.ielts.org/pdf/Vol3Report3.pdf. Brown, A. (2012). Interlocutor and rater training. In G. Fulcher & F. Davidson (Eds.), The routledge handbook of language testing. Oxford: Routledge. Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater orientations and test-taker performance on English-for-academic-purposes speaking tasks. (TOEFL monograph series. MS - 29). Princeton, NJ: Educational Testing Service. Bryman, A. (2012). Social research methods (4th ed.). Oxford: Oxford University Press. Byram, M. (1997). Teaching and assessing intercultural communicative competence. Clevedon: Multilingual matters. Cai, H. (2015). Weight-based classification of raters and rater cognition in an EFL speaking test. Language Assessment Quarterly, 12(3), 262-282. doi:10.1080/15434303.2015.1053134. Canale, M. (1983). On some dimensions of language proficiency. In J. W. Oller (Ed.), Issues in language testing research (pp. 333-342). Rowley, MA: Newbury House. Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics, 1(1), 1-47. Celce-Murcia, M. (2014). Teaching English in the context of World Englishes. In M. CelceMurcia, D. M. Brinton, & M. A. Snow (Eds.), Teaching English as a Second or Foreign Language (4th ed., pp. 63-70). Boston, MA: National Geographic Learning. 73

Chalhoub-Deville, M. (1995). Deriving oral assessment scales across different tests and rater groups. Language Testing, 12(1), 16-33. doi:10.1177/026553229501200102 Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and future trends. Language Testing, 20(4), 369-383. doi:10.1191/0265532203lt264oa. Chamot, A. U. (2009). The CALLA handbook: Implementing the cognitive academic language learning approach (2nd ed.). White Plains, NY: Pearson Education. Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA research. In L. F. Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and language testing research (pp. 32-70). Cambridge: Cambridge University Press. Chvala, L., & Graedler, A. L. (2010). Assessment in English. In S. Dobson & R. Engh (Eds.), Vurdering for læring i fag [Assessment for learning in the disciplines]. Kristiansand: Høyskoleforlaget. Cook, T. D., & Campbell, D. T. (1979). Quasi experimentation: Design and analysis issues for field settings. Chicago: Rand McNally. Cook, V. (1999). Going beyond the native speaker in language teaching. TESOL Quarterly, 33(2), 185-209. Coskun, A. (2011). Future English teachers' attitudes towards EIL pronunciation. Journal of English as an International Language, 6(2), 46-68. Council of Europe. (2001). The Common European Framework of Reference for Languages: Learning, teaching, assessment. Strasbourg: Council of Europe, Language Policy Unit. Creswell, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods approaches (3rd ed.). Thousand Oaks: Sage. Creswell, J. W. (2013). Qualitative inquiry & research design : choosing among five approaches. Los Angeles: Sage. Creswell, J. W., & Plano Clark, V. L. (2011). Designing and conducting mixed methods research (2nd ed.). Thousand Oaks: Sage. Cronbach, L. (1988). Five perspectives on validity argument. In H. Wainer & H. Braun (Eds.), Test validity (pp. 3-17). Hillsdale, NJ: Lawrence Erlbaum. Cronbach, L. J. (1971). Test validation. In R.L.Thorndike (Ed.), Educational measurement (2nd ed., pp. 443-507). Washington, DC: American Council on Education. Cumming, A. (2013). Validation in langauge assessments. In C. A. Chapelle (Ed.), The encyclopedia of applied linguistics (pp. 6006-6015). Chichester, UK: Blackwell. Davies, A. (2003). The native speaker: Myth and reality. Clevedon: Multilingual Matters. Davies, A. (2014). Fifty years of language assessment. In A. J. Kunnan (Ed.), The companion to language assessment: Abilities, contexts, and learners (Vol. 1, pp. 3-21). Chichester, UK: Wiley Blackwell. Davies, A., & Elder, C. (2005). Validity and validation in language testing. In E. Hinkel (Ed.), Handbook of research in second language teaching and learning (pp. 795-813). Mahwah, NJ: Lawrence Earlbaum. Davis, L. (2015). The influence of training and experience on rater performance in scoring spoken language. Language Testing. doi:10.1177/0265532215582282. Derwing, T., & Munro, M. J. (2009). Putting accent in its place: Rethinking obstacles to communication. Language teaching, 42(4), 476-490. Deterding, D. (2010). ELF-based pronunciation teaching in China. Chinese Journal of Applied Linguistics, 33(6), 3-15. Diener, E., & Crandall, R. (1978). Ethics in social and behavioural research. Chicago: University of Chicago Press. Dobson, S., Eggen, A. B., & Smith, K. (2009). Vurdering, prinsipper og praksis: Nye perspektiver på elev- og læringsvurdering [Assessment, principles and practice: New perspectives on student and learning assessment]. Oslo: Gyldendal Akademisk. 74

Douglas, D. (1994). Quantity and Quality in Speaking Test Performance. Language Testing, 11(2), 125-144. Douven, I. (Writer). (2011). Peirce on Abduction. The Standford Encyclopedia of Philosophy. Dubois, B., & Burns, J. A. (1975). An analysis of the meaning of the question mark response category in attitude scales. Educational and Psychological Measurement, 35(4), 869884. doi:10.1177/001316447503500414. Ducasse, A. M., & Brown, A. (2009). Assessing paired orals: Raters' orientation to interaction. Language Testing, 26(3), 423-443. doi:10.1177/0265532209104669. Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197-221. doi:10.1207/s15434311laq0203_2. Eckes, T. (2009). On Common Ground? How Raters Perceive Scoring Criteria in Oral Proficiency Testing. In A. Brown & K. Hill (Eds.), Tasks and criteria in performance assessment: Proceedings of the 28th Language Testing Research Colloquium (Vol. 13). Frankfurt: Peter Lang. Eckes, T. (2011). Introducing many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Frankfurt: Peter Lang. Education First. (2014). EF English Proficiency Index. Retrieved from http://media.ef.com/__/~/media/centralefcom/epi/v4/downloads/full-reports/ef-epi2014-english.pdf. Education First. (2015). EF English Proficiency Index. Retrieved from http://www.ef.co.uk/epi/. Engh, R. (2011). Vurdering for læring i skolen [Assessment for learning in school]. Kristiansand: Høyskoleforlaget. Erickson, G. (2014). What is good (about) language assessment? Conference proceedings from early language learning: Theory and practice 2014, 2014 (pp. 50-54). Ericsson, K. A., & Simon, H. (1993). Protocol analysis. Cambridge, MA: MIT Press. Freeman, D. (1996). "To Take Them at Their Word": Language Data in the Study of Teachers' Knowledge. Harvard Educational Review, 66(4), 732-762. doi:10.17763/haer.66.4.3511321j38858h69. Frost, K., Elder, C., & Wigglesworth, G. (2012). Investigating the validity of an integrated listening-speaking task: A discourse-based analysis of test takers’ oral performances. Language Testing, 29(3), 345-369. doi:10.1177/0265532211424479. Fulcher, G. (2003). Testing Second Language Speaking. London: Pearson. Fulcher, G. (2004). Deluded by artifices? The Common European Framework and harmonization. Language Assessment Quarterly, 1(4), 253-266. doi:10.1207/s15434311laq0104_4 Fulcher, G. (2010). Practical language testing. London: Hodder Education. Fulcher, G. (2012). Scoring performance tests. In G. Fulcher & F. Davidson (Eds.), The Routledge Handbook of Language Testing (pp. 378-392). Oxford: Routledge. Fulcher, G. (2014). Philosophy and language testing. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. 3, pp. 1433-1451). Chichester, UK: Wiley-Blackwell. Fulcher, G. (2015). Re-examining langauge testing: A philosophical and social inquiry. Oxon: Routledge. Fulcher, G., & Davidson, F. (2007). Language Testing and Assessment. Oxford: Routledge. Galaczi, E. (2014). Content analysis. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. 3). Chichester: Wiley-Blackwell. Galloway, T. A., Kirkebøen, L. J., & Rønning, M. (2011) Karakterpraksis i grunnskoler: Sammenheng mellom standpunkt- og eksamenskarakterer [Grading practices in

75

primary and secondary schools: The relationship between overall achievement marks and exam scores]. Rapporter 4/2011. Oslo-Kongsvinger: Statistics Norway. Gass, S. M., & Mackey, A. (2000). Stimulated recall methodology in second language research. Mahwah, NJ: Lawrence Earlbaum. Ginther, A. (2013). Assessment of speaking. In C. E. Chapelle (Ed.), The Encyclopedia of applied linguistics (pp. 234-240): Wiley Blackwell. Gipps, C. (1994). Beyond testing: Towards a theory of educational assessment. London: Falmer. Gobo, G. (2008). Re-conceptualizing generalization: Old issues in a new frame. In P. Alasuutari, L. Bickman, & J. Brannen (Eds.), The Sage handbook of social research methods (pp. 193-213). London: Sage. Gomm, R. (2008). Social research methodology: A critical introduction (2nd Ed.). Basingstoke: Palgrave MacMillan. Graddol, D. (2006). English next: Why global English may mean the end of 'English as a Foreign Language'. London: The British Council. Green, A. (1998). Verbal protocol analysis in language testing research: A handbook. Cambridge: Cambridge University Press. Green, A. (2014). Exploring language assessment and testing: Language in action. Oxon: Routledge. Grotjahn, R. (1987). On the methodological basis of introspective methods. In C. Faerch & G. Kasper (Eds.), Introspection in second language research (pp. 54-81). Clevedon, UK: Multilingual Matters. Gubrium, J. F., & Holstein, J. A. (2003). Active interviewing. In J. F. Gubrium & J. A. Holstein (Eds.), Postmodern interviewing (pp. 67–80). Thousand Oaks: Sage. Hammersley, M. (1992). What's wrong with ethnography? London: Routledge. Hammersley, M. (2008a). Questioning qualitative inquiry: Critical essays. Thousand Oaks: Sage. Hammersley, M. (2008b). Troubles with triangulation. In M. M. Bergman (Ed.), Advances in mixed methods research. Thousand Oaks, CA: Sage. Hansen, T. (2011). Speaker models and the English classroom: The impact of the intercultural-speaker teaching model in Norway. (Unpublished master's thesis). Østfold University College, Halden, Norway. Hanson, N. R. (1958). Patterns of discovery: an inquiry into the conceptual foundations of science. Cambridge: University Press. Harding, L. (2014). Communicative Language Testing: Current Issues and Future Research. Language Assessment Quarterly, 11(2), 186-197. doi:10.1080/15434303.2014.895829 Hargreaves, E. (2005). Assessment for learning? Thinking outside the (black) box. Cambridge Journal of Education, 35(2), 213-224. doi:10.1080/03057640500146880. Harlen, W. (2012). On the relationship between assessment for formative and summative purposes. In J. Gardner (Ed.), Assessment and learning (2nd ed., pp. 87-101). London: Sage. Hattie, J. (2009). Visible learning: A synthesis of over 800 meta-snalyses relating to achievement. Oxon: Routledge. Haug, P. (2004). Resultat frå evalueringa av Reform 94. [Results from the evaluation of the 1994 curriculum reform]. Oslo: Norges forskningsråd. Haugstveit, T. B. (2005). Vurdering som profesjonsfaglig kompetanse – Læreres refleksjoner over egen vurderingspraksis på 5., 6. og 7. trinn. [Assessment as professional competence - Teachers' reflections on their own assessment practices in the 5th, 6th and 7th grades]. Norsk Pedagogisk Tidsskrift, 89(6), 417-430.

76

He, A., & Young, R. (1998). Language proficiency interviews: A discourse approach. In A. He & R. Young (Eds.), Talking and testing (pp. 1-24). Amsterdam: John Benjamins. Hellekjær, G. O. (2007). Fremmedspråk i norsk næringsliv - engelsk er ikke nok! [Foreign languages in Norwegian trade and industry - English is not enough!]. (Vol. 3). Halden: Fremmedspråksenteret. Hellekjær, G. O. (2008). A case for improved reading instruction for academic English reading proficiency. Acta Didactica Norge, 2(1), 1-17. Hellekjær, G. O. (2012). A survey of English use and needs in Norwegian export firms. Hermes - Journal of Language and Communication Studies, 48, 7-18. Henning, G. (1996). Accounting for nonsystematic error in performance ratings. Language Testing, 13(1), 53-61. doi:10.1177/026553229601300104. Hertzberg, F. (2003). Arbeid with muntlige ferdigheter. [Working with oral skills]. In K. Klette (Ed.), Evaluering av Reform 97: Klasserommets praksisformer etter Reform 97. [An evaluation of the 1997 curriculum reform: Classroom practices after the 1997 curriculum reform] (pp. 137-171). Oslo: Unipub. Hesse-Biber, S. N., & Leavy, P. (2011). The practice of qualitative research (2nd ed.). Los Angeles, Calif: SAGE. Hodgson, J., Rønning, W., Skogvold, A. S., & Tomlinson, P. (2010). Vurdering under Kunnskapsløftet: Læreres begrepsforståelse og deres rapporterte og faktiske vurderingspraksis [Assessment in the Knowledge Promotion curriculum reform: Teachers' conceptual understanding and their reported and actual assessment practices]. NF-rapport nr. 17/2010. Nordland Research Institute. Hookway, C. (2015). Pragmatism. In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy (Spring 2015 edition). Retrieved from http://plato.stanford.edu/archives/spr2015/entries/pragmatism/. Hopfenbeck, T. N. (2014). Testing times: Fra Pisa til nasjonale prøver. [Testing times: From Pisa to national tests]. In J. H. Stray & L. Witteck (Eds.), Pedagogikk: En grunnbok. Oslo: Cappelen Damm Akademisk. Hsieh, C.-N. (2011). Rater effects in ITA testing: ESL teachers' vs American undergraduates' judgements of accentedness, comprehensibility, and oral proficiency. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 9, 47-74. Retrieved from Spaan Fellowship website: http://www.cambridgemichigan.org/sites/default/files/resources/SpaanPapers/Spaan_ V9_Hsieh.pdf Hsieh, H.-F., & Shannon, S. E. (2005). Three Approaches to Qualitative Content Analysis. Qualitative Health Research, 15(9), 1277-1288. doi:DOI: 10.1177/1049732305276687 Hægeland, T., Kirkebøen, L. J., Raaum, O., & Salvanes, K. G. (2005). Familiebakgrunn, skoleressurser og avgangskarakterer i norsk grunnskole. [Family background, school resources, and final grades in Norwegian primary and lower secondary schools]. Retrieved from http://www.ssb.no/a/publikasjoner/pdf/sa74/kap-2.pdf. Høst, H., Seland, I., & Skålholt, A. (2013) Yrkesfagelevers ulike tilpasninger til fagopplæring: En undersøkelse av elever i tre yrkesfaglige utdanningsprogram i videregående skole [Vocational studies students' adjustment to vocational training: An investigation of students in three vocational studies programmes in upper secondary school]. (16/2013). Oslo: NIFU. Inbar-Lourie, O. (2008). Language assessment culture. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of language and education (2nd ed., Vol. 7, pp. 285-289). New York: Springer.

77

Isaacs, T. (2014). Assessing pronunciation. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. 1, pp. 140-155). Chichester, UK: Wiley-Blackwell. Isaacs, T., Zara, C., Herbert, G., Coombs, S. J., & Smith, C. (2013). Key concepts in educational assessment. London: SAGE Publications Ltd. Jacobsen, D. I. (2005). Hvordan gjennomføre undersøkelser?: Innføring i samfunnsvitenskapelig metode [How to carry out investigations?: Introduction to methodology in the social sciences] (2nd ed.). Kristiansand: Høyskoleforlaget. Jenkins, J. (2000). The phonology of English as an international language: New models, new norms, new goals. Oxford: Oxford University Press. Jenkins, J. (2007). English as a lingua franca: Attitude and identity. Oxford: Oxford University Press. Jenkins, J., Cogo, A., & Martin, D. (2011). Review of developments in research into English as a lingua franca. Language teaching, 44(3), 281-315. Joe, J. N., Harmes, J. C., & Hickerson, C. A. (2011). Using verbal reports to explore rater perceptual processes in scoring: A mixed methods application to oral communication assessment. Assessment in Education: Principles, Policy & Practice, 18(3), 239-258. doi:10.1080/0969594X.2011.577408. Jones, N. (2012). Reliability and dependability. In G. Fulcher & F. Davidson (Eds.), The routledge handbook of language testing (pp. 350-362). Oxon: Routledge. Kachru, B. B. (1986). The alchemy of English. Oxford: Pergamon Press. Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17-64). Westport, CT: Preaeger and the American Council on Education. Kane, M. T. (2012). All validity is construct validity. Or is it? Measurement: Interdisciplinary Research and Perspectives, 10(1-2), 66-70. doi:10.1080/15366367.2012.681977. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1-73. doi:10.1111/jedm.12000. Kang, O. (2008). Ratings of L2 oral performance in English: Relative impact of rater characteristics and acoustic measures of accentedness. Spaan Fellow Working Papers in Second or Foreign Language Assessment, 6, 181-205. Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239-261. doi:10.1080/15434303.2015.1049353. Kim, Y.-H. (2009). An investigation into native and non-native teachers' judgments of oral English performance: A mixed methods approach. Language Testing, 26(2), 187-217. doi:10.1177/0265532208101010. King, N., & Horrocks, C. (2010). Interviews in qualitative research. London: Sage. Klette, K. (2003). Lærernes klasseromsarbeid: Interaksjons- og arbeidsformer i norske klasserom etter Reform 97. [Teachers' classroom work: Forms of interaction and working methods in Norwegian classrooms after the 1997 curriculum reform]. In K. Klette (Ed.), Evaluering av Reform 97: Klasserommets praksisformer etter Reform 97. [An evaluation of the 1997 curriculum reform: Classroom practices after the 1997 curriculum reform] (pp. 39-76). Oslo: Unipub. Kleven, T. A. (2008). Validity and validiation in quantitative and qualitative research. Nordisk Pedagogik, 28, 219-233. Kratwohl, D. R. (2002). A revision of Bloom's taxonomy: An overview. Theory into Practice, 41(4), 212-218. doi:10.1207/s15430421tip4104_2. Krippendorf, K. (2013). Content Analysis (3rd ed.). Thousand Oaks: Sage. Kunnan, A. J. (2004). Regarding language assessment. Language Assessment Quarterly, 1(1), 1-4. doi:10.1207/s15434311laq0101_1.

78

Kunnan, A. J. (2008). Large-scale language assessments. In E. Shohamy & N. Hornberger (Eds.), Encyclopaedia of language and education (2nd ed., Vol. 7, pp. 135-155). New York, NY Springer Science. Kunnan, A. J. (Ed.) (2014). The companion to language assessment. Chichester, UK: Wiley Blackwell. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. doi:10.2307/2529310. Larsson, S. (2009). A pluralist view of generalization in qualitative research. International Journal of Research & Method in Education, 32(1), 25-38. doi:10.1080/17437270902759931 Levis, J. M. (2005). Changing contexts and shifting paradigms in pronunciation teaching. TESOL Quarterly, 39(3), 369-377. Lumley, T. (2005). Assessing second language writing. Frankfurt am Main: Peter Lang. Lumley, T., & MacNamara, T. (1995). Rater characteristics and rater bias: Implications for training Language Testing, 12(1), 54-71. doi:10.1177/026553229501200104 Luoma, S. (2004). Assessing Speaking. Cambridge: Cambridge University Press. Lysne, A. (2006). Assessment theory and practice of students' outcomes in the Nordic countries. Scandinavian Journal of Educational Research, 50(3), 327-359. doi:10.1080/00313830600743365. Magnan, S. S. (1988). Grammar and the ACTFL oral proficiency interview: Discussion and data. The Modern Language Journal, 72(3), 266-276. doi:10.1111/j.15404781.1988.tb04187.x. Mann, S. (2016). The research interview: Reflective practice and reflexivity in research processes. London: Palgrave MacMillan. Marin, L. M., & Halpern, D. F. (2011). Pedagogy for developing critical thinking in adolescents: Explicit instruction produces greatest gains. Thinking Skills and Creativity, 6(1), 1-13. doi:http://dx.doi.org/10.1016/j.tsc.2010.08.002 May, L. A. (2006). An examination of rater orientations on a paired candidate discussion task through stimulated verbal recall. Melbourne Papers in Language Testing, 11(1), 2951. May, L. A. (2009). Co-constructed interaction in a paired speaking test: The rater's perspective. Language Testing, 26(3), 397-421. doi:10.1177/0265532209104668. McNamara, T. (1996). Measuring Second Language Performance. Harlow: Longman. McNamara, T. (2003). Looking back, looking forward: Rethinking Bachman. Language Testing, 20(4), 466-473. doi:10.1191/0265532203lt268xx McNamara, T. F. (1990). Item Response Theory and the validation of an ESP test for health professionals. Language Testing, 7(1), 52-76. doi:10.1177/026553229000700105. Meld. St. 16 (2006-2007). (2007). ... og ingen stod igjen: Tidlig innsats for livslang læring. [... and no one lingered: Early initiatives for life-long learning]. Oslo: Norwegian Ministry of Education and Research. Retrieved from https://www.regjeringen.no/contentassets/a48dfbadb0bb492a8fb91de475b44c41/no/pd fs/stm200620070016000dddpdfs.pdf. Meld. St. 20 (2012-2013). (2013). På rett vei: Kvalitet og mangfold i fellesskolen [On the right track: Quality and diversity in an inclusive school system]. Oslo: Norwegian Ministry of Education and Research. Meld. St. nr. 30. (2004). Kultur for læring [Culture for learning]. Oslo: Norwegian Ministry of Education and Research. Retrieved from https://www.regjeringen.no/contentassets/988cdb018ac24eb0a0cf95943e6cdb61/no/pd fs/stm200320040030000dddpdfs.pdf.

79

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (pp. 13-103). New York: Macmillan. Messick, S. (1996). Validity and washback in language testing. Retrieved from Princeton, New Jersey: http://onlinelibrary.wiley.com/doi/10.1002/j.23338504.1996.tb01695.x/pdf. Met, M. (1998). Curriculum decision-making in content-based language teaching. In J. Cenoz & F. Genesee (Eds.), Beyond bilingualism: Multilingualism and multilingual education (pp. 35-63). Philidelphia, PA Multilingual Matters. Miles, M. B., Huberman, A. M., & Saldaña, J. (2014). Qualitative data analysis: A methods sourcebook. Los Angeles: Sage. Mislevy, R., Almond, R. G., & Lucas, J. F. (2003). A brief introduction to evidence-centered design. Research Report RR-03-16. Princeton, NJ: Educational Testing Service. Mitchell, J. C. (1983). Case and situation analysis. Sociological Review, 31(2), 187-211. Morse, J. M., & Niehaus, L. (2009). Mixed method design: Principles and procedures. Walnut Creek, CA: Left Coast Press. Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher, 23(2), 5-12. doi:10.3102/0013189x023002005. Moyer, A. (2013). Foreign accent: The phenomenon of non-native speech. Cambridge: Cambridge University Press. Newton, P. E. (2012). Clarifying the consensus definition of validity. Measurement: Interdisciplinary Research and Perspectives, 10(1-2), 1-29. doi:10.1080/15366367.2012.669666. Newton, P. E., & Shaw, S. D. (2014). Validity in educational and psychological assessment. London: Sage. North, B. (2014). The CEFR in practice. Cambridge: Cambridge University Press. Norwegian Directorate for Education and Training [UDIR]. (2009). Certificates and grading scales. Retrieved from http://www.udir.no/Stottemeny/English/Curriculum-inEnglish/About-certificates-and-grading-scales/. Norwegian Directorate for Education and Training [UDIR]. (2014a). Muntlig eksamen [Oral exams]. Retrieved from http://www.udir.no/Vurdering/Eksamen/Muntlig-eksamen/ Norwegian Directorate for Education and Training [UDIR]. (2014b). Rundskriv Udir-02-2014 - Lokalt gitt muntlig eksamen [Circular Udir-02-2014 - Locally administered oral exams]. Oslo: Author. Retrieved from http://www.udir.no/Regelverk/Finn-regelverkfor-opplaring/Finn-regelverk-etter-tema/eksamen/Udir-2-2014-Lokalt-gitt-muntligeksamen/. Norwegian Directorate for Education and Training [UDIR]. (2014c). Tolkingsfellesskap - en vei mot rettferding vurdering [Shared assessment culture - a path towards fair assessment]. Retrieved from http://www.udir.no/Vurdering-for-laring/rettferdigvurdering/Tolkningsfellesskap/Tolkningsfellesskap/. Norwegian Directorate for Education and Training [UDIR]. (2014d). Veiledning i lokalt arbeid med læreplaner [Guide to the development of locally adapted subject curricula]. Retrieved from http://www.udir.no/Lareplaner/Veiledninger-tillareplaner/Veiledning-i-lokalt-arbeid-med-lareplaner/2-Lareplanverket-forKunnskapsloftet-LK06-og-LK06S/Lokalt-arbeid-med-lareplaner-etter-LK06/. Norwegian Directorate for Education and Training [UDIR]. (2015). Rammeverk for grunnleggende ferdigheter [Framework for basic skills]. Oslo: Author. Retrieved from http://www.udir.no/globalassets/upload/larerplaner/lareplangrupper/rammeverk_grf_2 012.pdf.

80

Norwegian Ministry of Education and Research [KD]. (2006/2013). Læreplan i engelsk [English subject curriculum]. Oslo: Author. Retrieved from http://data.udir.no/kl06/ENG1-03.pdf?lang=eng. Norwegian Ministry of Education and Research [KD]. (2006/2015). Forskrift til opplæringslova [Regulations to the Education Act]. Retrieved from https://lovdata.no/dokument/SF/forskrift/2006-06-23-724. Norwegian Ministry of Education and Research [KD]. (2015). The Vocational Education and Training Programme. Retrieved from https://www.regjeringen.no/en/topics/education/innsikt/yrkesfagloftet/id2353804/ Norwegian National Research Ethics Committee. (2006). Ethical guidelines. Retrieved from https://www.etikkom.no/en/. Nusche, D., Earl, L., Maxwell, W., & Shewbridge, C. (2012). OECDs gjennomgang av evaluering og vurdering innen utdanning: Norway. [OECD review of evaluation and assessment of education: Norway]. Retrieved from http://www.oecd.org/edu/school/Evaluation-and-Assessment_Norwegian-version.pdf. O'Sullivan, B. (2014). Assessing Speaking. In A. J. Kunnan (Ed.), The Companion to Language Assessment (Vol. 1, pp. 156-171). Chichester, UK: Wiley-Blackwell. Onwuegbuzie, A., & Leech, N. (2010). Generalization practices in qualitative research: a mixed methods case study. International Journal of Methodology, 44(5), 881-892. doi:10.1007/s11135-009-9241-z Orr, M. (2002). The FCE Speaking test: using rater reports to help interpret test scores. System, 30, 143-154. Patton, M. Q. (2015). Qualitative research and evaluation methods (4th ed.). Thousand Oaks: Sage. Polanyi, M. (1966). The tacit dimension. Garden City, New York: Doubleday. Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th language research testing colloquium, Cambridge. Cambridge: Cambridge University Press. Purpura, J. E. (2008). Assessing communicative language ability: Models and their components. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of language and education (Vol. 7, pp. 53-68). New York: Springer Science. Richards, K. (2009). Trends in qualitative research in language teaching since 2000. Language teaching, 42(2), 147-180. Sadler, D. R. (1998). Formative assessment: Revisiting the territory. Assessment in Education: Principles, Policy & Practice, 5(1), 77-84. doi:10.1080/0969595980050104 Sandberg, N., & Aasen, P. (2008). Det nasjonale styringsnivået: Intensjoner, forventninger og vurderinger: Evaluering av Kunnskapsløftet [The national level of governance: Intentions, expectations and evaluations: An evaluation of the Knowledge Promotion curriculum reform]. Retrieved from http://brage.bibsys.no/xmlui/bitstream/handle/11250/284600/NIFUrapport200842.pdf?sequence=1&isAllowed=y. Sandvik, L. V., & Buland, T. (2014). Vurdering i skolen: Utvikling av kompetanse og fellesskap [Assessment in schools: The development of competence and shared practices]. Retrieved from http://www.udir.no/Upload/Forskning/2015/FIVIS%20sluttrapport%20desember%202 014.pdf?epslanguage=no. Sasaki, M. (2014). Introspective methods. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. 3, pp. 1340-1357). Chichester, UK: Wiley-Blackwell. 81

Sato, T. (2012). The contribution of test-takers’ speech content to scores on an English oral proficiency test. Language Testing, 29(2), 223-241. doi:10.1177/0265532211421162. Seidlhofer, B. (2011). Understanding English as a lingua franca. Oxford: Oxford University Press. Serafini, F. (2001). Three paradigms of assessment: Measurement, procedure, and inquiry. The Reading Teacher, 54(4), 384-393. Silverman, D. (1993). Interpreting qualitative data. London: Sage. Silverman, D. (2011). Interpreting qualitative data: A guide to the principles of qualitative research. London: Sage. Simensen, A. M. (2010). Fluency: An aim in teaching and a criterion in assessment. Acta Didactica Norge, 4(I). Retrieved from http://adno.no/index.php/adno/article/view/118/137 Simensen, A. M. (2011). Europeiske institusjoners rolle i utviklingen av engelskfaget i norsk skole [The role of European institutions in the development of the English subject in Norwegian schools]. Didaktisk Tidskrift, 20(3), 157-181. Simensen, A. M. (2014). Skolefaget engelsk: Fra britisk engelsk til mange slags ”engelsker”– og veien videre [The school subject English: From British English to many different types of "Englishes" - and into the future]. Acta Didactica Norge, 8(2), 1-18. Skjersli, S., & Aamodt, P. O. (1997). 10 Effekter av Reform 94 på sosiale skjevheter i valg mellom allmennfag og yrkesfag [10 effects of the 1994 curriculum reform on social imbalances in students' choices between general studies and vocational studies programmes]. In B. Lødding & K. Tornes (Eds.), Idealer og paradokser: Aspekter ved gjennomføringen av Reform 94 (pp. 256-280). Oslo: Tano Aschehoug. Snow, M. A., & Katz, A. M. (2014). Assessing language and content. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. 1, pp. 230-247). Chichester, UK: WileyBlackwell. Solberg, E. (2010). Et løft for yrkesfagene [A boost for the vocational subjects]. Retrieved from https://ernasolberg.wordpress.com/2010/01/17/et-l%C3%B8ft-for-yrkesfagene/ Solstad, K. J., & Rønning, W. (2003). Likeverdig skole i praksis - Synteserapport. [An equal school system in practice - A synthesis report]. Retrieved from Bodø: Stiggins, R. (2005). From Formative Assessment to Assessment FOR Learning: A Path to Success in Standards-Based Schools. Phi Delta Kappan, 87(4), 324-328. Stobart, G., & Eggen, T. (2012). High-stakes testing – value, fairness and consequences. Assessment in Education: Principles, Policy & Practice, 19(1), 1-6. doi:10.1080/0969594X.2012.639191 Svenkerud, S., Klette, K., & Hertzberg, F. (2012). Opplæring i muntlige ferdigheter [Teaching and learning oral skills]. Nordic Studies in Education, 32(1). Tarrou, A.-L. H. (2010). Yrkesfagopplæringen - mellom skole og bedrift [Vocational training - between school and company]. Bedre skole, 2, 67-71. Taylor, L., & Galaczi, E. (2011). Scoring Validity. In L. Taylor (Ed.), Examining Speaking: Research and practice in assessing second language speaking (Vol. 30, pp. 171-233). Cambridge: Cambridge University Press. Throndsen, I., Hopfenbeck, T. N., Lie, S., & Dale, E. L. (2009). Bedre vurdering for læring: Rapport fra "Evaluering av modeller for kjennetegn på måloppnåelse i fag" [Better Assessment for Learning: Report from "The Evaluation of Models for Assessment Criteria for Goal Achievements in Subjects]. Retrieved from Oslo: http://www.udir.no/Upload/Forskning/5/Bedre_vurderingspraksis_ILS_rapport.pdf?ep slanguage=no Timmis, I. (2002). Native-speaker norms and international English: A a classroom view. ELT Journal, 56(3), 240-249. 82

Udir. (2010). Rundskriv Udir-1-2010: Individuell vurdering i grunnskolen og videregående opplæring etter forskrift til opplærngsloven kappittel 3. Oslo: Directorate for Education and Research Retrieved from http://www.udir.no/Upload/Rundskriv/2010/5/Udir_1_2010_Individuell_vurdering_i_ grunnskolen_og_videregaende_opplaring.pdf. Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287. doi:10.1177/026553229801500205. Weir, C. (2005). Language testing and validation: An evidence-based approach. Basingstoke: Palgrave MacMillan. Wiliam, D. (2008). Quality in assessment. In S. Swaffield (Ed.), Unlocking assessment: Understanding for reflection and application (pp. 123-137). Oxford: Routledge. Yildiz, L. M. (2011). English VG1 level oral examinations: How are they designed, conducted and assessed? , (Unpublished master's thesis). University of Oslo, Oslo, Norway. Retrieved from https://www.duo.uio.no/bitstream/handle/10852/32421/YildizMaster.pdf?sequence=2 &isAllowed=y. Yin, R. K. (2016). Qualitative research from start to finish (2nd ed.). New York: The Guilford Press.

83

Appendix 1

English subject curriculum – 2013 (abridged) Dette er en oversettelse av den fastsatte læreplanteksten. Læreplanen er fastsatt på bokmål. Established as a Regulation by the Ministry of Education and Research on 21 June 2013 Valid from 01.08.2013

Purpose English is a universal language. When we meet people from other countries, at home or abroad, we need English for communication. English is used in films, literature, songs, sports, trade, products, science and technology, and through these areas many English words and expressions have found their way into our own languages. When we want information on something of private or professional interest, we often search for it in English. In addition, English is increasingly used in education and as a working language in many companies. To succeed in a world where English is used for international communication, it is necessary to be able to use the English language and to have knowledge of how it is used in different contexts. Thus, we need to develop a vocabulary and skills in using the systems of the English language, it`s phonology orthography, grammar and principles for sentence and text construction and to be able to adapt the language to different topics and communication situations. This involves being able to distinguish between oral (spoken) and textual (written) styles and formal and informal styles. Moreover, when using the language for communication we must also be able to take cultural norms and conventions into consideration. Language learning occurs while encountering a diversity of texts, where the concept of text is used in the broadest sense of the word. It involves oral and written representations in different combinations and a range of oral and written texts from digital media. When we are aware of the strategies that are used to learn a language, and strategies that help us to understand and to be understood, the acquisition of knowledge and skills becomes easier and more meaningful. It is also important to establish our own goals for learning, to determine how these can be reached and to assess the way we use the language. Learning English will contribute to multilingualism and can be an important part of our personal development. In addition to language learning, the subject of English shall contribute to providing insight into the way people live and different cultures where English is the primary or the official language. The subject of English shall provide insight into how English is used as an international means of communication. Learning about the English-speaking world and the increasing use of English in different international contexts will provide a good basis for understanding the world around us and how English developed into a world language. Literary texts in English can instil a lifelong joy of reading and a deeper understanding of others and of oneself. Oral, written and digital texts, films, music and other cultural forms of expression can further inspire personal expressions and creativity.

84

Thus, English as a school subject is both a tool and a way of gaining knowledge and personal insight. It will enable the pupils to communicate with others on personal, social, literary and interdisciplinary topics. The subject shall help build up general language proficiency through listening, speaking, reading and writing, and provide the opportunity to acquire information and specialised knowledge through the English language. Development of communicative language skills and cultural insight can promote greater interaction, understanding and respect between persons with different cultural backgrounds. Thus, language and cultural competence promote the general education perspective and strengthen democratic involvement and co-citizenship.

Main subject areas The subject of English is structured into main subject areas with competence aims. The main subject areas supplement each other and must be considered together. The subject of English is a common core subject for all the upper secondary education programmes. Learning in this subject shall therefore be made as relevant as possible for pupils by adapting each subject to the different education programmes. English has competence aims after the second, fourth, seventh and tenth years in primary and lower secondary school and after the first year in the programmes for general studies (Vg1) or after the second year of a vocational education programmes (Vg2).

Overview of main subject areas: Year 1–10

Main subject areas

Vg1 Vg2

Language learning

Oral Written Culture, society and communication communication literature

(vocational education programme)

Language learning The main subject area Language learning focuses on what is involved in learning a new language and seeing relationships between English, one's native language and other languages. It covers knowledge about the language, language usage and insight into one's own language learning. The ability to evaluate own language usage and learning needs and to select suitable strategies and working methods is useful when learning and using the English language.

Oral communication The main subject area Oral communication deals with understanding and using the English language by listening, speaking, conversing and applying suitable communication strategies. The main subject area involves developing a vocabulary and using idiomatic structures and grammatical patterns when speaking and conversing. It also covers learning to speak clearly and to use the correct intonation.

85

The main subject area involves listening to, understand and using English in different situations where communication needs to be done orally. General politeness and awareness of social norms in different situations are also an important element. This also involves adapting the language to purposeful objectives and adapting the language to the recipient, i.e. by distinguishing between formal and informal spoken language. The use of different media and resources and the development of a linguistic repertoire across subjects and topics are also key elements of the main subject area.

Written communication The main subject area Written communication deals with understanding and using English language through reading, writing and using suitable reading and writing strategies. The main subject area includes reading a variety of different texts in English to stimulate the joy of reading, to experience greater understanding and to acquire knowledge. This involves reading a large quantity of literature to promote language understanding and competence in the use of text. Reading different types of texts can lay the foundation for personal growth, maturation and creativity and provide the inspiration necessary to create texts. The main subject area includes writing different texts in English in different situations where written communication is necessary to stimulate the joy of writing, to experience greater understanding and to acquire knowledge. This also involves adapting the language to purposeful objectives and to the recipient, i.e. by distinguishing between formal and informal written language. The main subject area involves developing a vocabulary and using orthography, idiomatic structures and grammatical patterns when writing. It also covers creating structure, coherence and concise meaning in texts. The use of different media and resources and the development of a linguistic repertoire across subjects and topics are also key elements of the main subject area.

Culture, society and literature The main subject area Culture, society and literature focuses on cultural understanding in a broad sense. It is based on the English-speaking countries and covers key topics connected to social issues, literature and other cultural expressions. This main area also involves developing knowledge about English as a world language with many areas of use. The main subject area involves working with and discussing expository texts, literary texts and cultural forms of expression from different media. This is essential to develop knowledge about, understanding of and respect for the lives and cultures of other people.

Basic skills Basic skills are integrated in the competence aims where they contribute to the development of competence in the subject, while also being part of this competence. In the subject of English, the basic skills are understood as follows: Oral skills in English means being able to listen, speak and interact using the English language. It means evaluating and adapting ways of expression to the purpose of the conversation, the recipient and the situation. This further involves learning about social

86

conventions and customs in English-speaking countries and in international contexts. The development of oral skills in English involves using oral language in gradually using more precise and nuanced language in conversation and in other kinds of oral communication. It also involves listening to, understanding and discussing topics and issues to acquire more specialised knowledge. This also involves being able to understand variations in spoken English from different parts of the world. Being able to express oneself in writing in English means being able to express ideas and opinions in an understandable and purposeful manner using written English. It means planning, formulating and working with texts that communicates and that are well structured and coherent. Writing is also a tool for language learning. The development of writing proficiency in English involves learning orthography and developing a more extensive repertoire of English words and linguistic structures. Furthermore, it involves developing versatile competence in writing different kinds of generalised, literary and technical texts in English using informal and formal language that is suited to the objective and recipient. Being able to read in English means the ability to create meaning by reading different types of text. It means reading English language texts to understand, reflect on and acquire insight and knowledge across cultural borders and within specific fields of study. This further involves preparing and working with reading English texts for different reasons and of varying lengths and complexities. The development of reading proficiency in English implies using reading strategies that are suited to the objective by reading texts that are advancingly more demanding. Furthermore, it involves reading English texts fluently and to understand, explore, discuss, learn from and to reflect upon different types of information. Numeracy in English means being able to use relevant mathematical concepts in English in different situations. This involves familiarity with units of measure used in Englishspeaking countries and to understand and to communicate in figures, graphic representations, tables and statistics in English. The development of numeracy in English involves using figures and calculations to develop a repertoire of mathematical terms in English related to daily life and general and technical fields. Digital skills in English means being able to use a varied selection of digital tools, media and resources to assist in language learning, to communicate in English and to acquire relevant knowledge in the subject of English. The use of digital resources provides opportunities to experience English texts in authentic situations, meaning natural and unadapted situations. The development of digital skills involves gathering and processing information to create different kinds of text. Formal requirements in digital texts means that effects, images, tables, headlines and bullet points are compiled to emphasise and communicate a message. This further involves using digital sources in written texts and oral communication and having a critical and independent attitude to the use of sources. Digital skills involve developing knowledge about copyright and protection of personal privacy through verifiable references to sources.

Competence aims Competence aims after Vg1 – programmes for general studies and Vg2 – vocational education programmes

Language learning The aims of the studies are to enable pupils to

87

  

evaluate and use different situations, working methods and learning strategies to further develop one’s English-language skills evaluate own progress in learning English evaluate different digital resources and other aids critically and independently, and use them in own language learning

Oral communication The aims of the studies are to enable pupils to        

evaluate and use suitable listening and speaking strategies adapted for the purpose and the situation understand and use a wide general vocabulary and an academic vocabulary related to his/her own education programme understand the main content and details of different types of oral texts about general and academic topics related to one’s education programme listen to and understand social and geographic variations of English from authentic situations express oneself fluently and coherently in a detailed and precise manner suited to the purpose and situation introduce, maintain and terminate conversations and discussions about general and academic topics related to one’s education programme use patterns for pronunciation, intonation, word inflection and various types of sentences in communication interpret and use technical and mathematical information in communication

Written communication The aims of the studies are to enable pupils to         

evaluate and use suitable reading and writing strategies adapted for the purpose and type of text understand and use an extensive general vocabulary and an academic vocabulary related to one’s education programme understand the main content and details in texts of varying length about different topics read to acquire knowledge in a particular subject from one’s education programme use own notes to write texts related to one’s education programme write different types of texts with structure and coherence suited to the purpose and situation use patterns for orthography, word inflection and varied sentence and text construction to produce texts produce different kinds of texts suited to formal digital requirements for different digital media evaluate different sources and use contents from sources in an independent, critical and verifiable manner

Culture, society and literature The aims of the studies are to enable pupils to   

discuss and elaborate on culture and social conditions in several English-speaking countries present and discuss current news items from English language sources discuss and elaborate on the growth of English as a universal language

88

   

discuss and elaborate on different types of English language literary texts from different parts of the world discuss and elaborate on English language films and other forms of cultural expressions from different media discuss and elaborate on texts by and about indigenous peoples in Englishspeaking countries select an in-depth study topic within one’s education programme and present this

Assessment Provisions for final assessment: Overall achievement assessment Year

Provision The pupils shall have one overall achievement grade for written work and one overall achievement grade for oral performance.

Year 10 Vg1 Programme for General Studies Vg2 Vocational Education Programme

The pupils shall have one overall achievement grade.

Examinations for pupils Year Year 10

Provision The pupils may be selected for a written examination. The written examination is prepared and graded centrally. The pupils may also be selected for an oral examination. The oral examination is prepared and graded locally.

Vg1 programme for general The pupils may be selected for a written examination. The written examination is studies prepared and graded centrally. The pupils may also be selected for an oral examination. The oral examination is prepared and graded locally. The Vg2 examination covers the entire subject (140 teaching hours). vocational education programme

89

Appendix 2 English subject curriculum – 2006 (abridged) Competence aims after Vg1 – programmes for general studies and Vg2 – vocational education programmes Language learning The aims are that the pupil shall be able to    

exploit and assess various situations, working methods and strategies for learning English describe and evaluate the effects of different forms of verbal expression assess and comment on his/her progress in learning English use a wide selection of digital and other aids independently, including monolingual dictionaries

Communication The aims are that the pupil shall be able to             

understand and use a wide general vocabulary and an academic vocabulary related to his/her own education programme understand oral and written presentations about general and specialized themes related to his/her own education programme express himself/herself in writing and orally in a varied, differentiated and precise manner, with good progression and coherence select and use appropriate reading and listening strategies to locate information in oral and written texts select and use appropriate writing and speaking strategies that are adapted to a purpose, situation and genre take the initiative to begin, end and keep a conversation going read texts from different genres and with different objectives write formal and informal texts with good writing structure and coherence based on themes that interest him/her and are important for society read and write texts related to his/her own education programme select and use content from different sources independently, critically and responsibly use technical and mathematical information in communication produce composite texts using digital media select an in-depth study topic within his/her own education programme and present this to the other pupils

90

Culture, society and literature The aims are that the pupil shall be able to     

discuss social and cultural conditions and values from a number of English-speaking countries present and discuss international news topics and current events give an account of the use of English as a universal world language discuss and elaborate on English texts from a selection of different genres, poems, short stories, novels, films and theatre plays from different epochs and parts of the world discuss literature by and about indigenous peoples in the English-speaking world

91

Appendix 3 English subject curriculum (2006 – Norwegian version)27 Etter Vg1 – studieforberedende utdanningsprogram og Vg2 – yrkesfaglige utdanningsprogram

Språklæring Mål for opplæringen er at eleven skal kunne     

utnytte og vurdere ulike situasjoner, arbeidsmåter og strategier for å lære seg engelsk drøfte likheter og forskjeller mellom engelsk og andre fremmedspråk og utnytte dette i egen språklæring bruke relevant og presis terminologi for å beskrive språkets formverk og strukturer beskrive og vurdere egen framgang i arbeidet med å lære engelsk bruke et bredt utvalg digitale og andre hjelpemidler, inkludert ettspråklige ordbøker, på en selvstendig måte

Kommunikasjon Mål for opplæringen er at eleven skal kunne             

beherske et bredt ordforråd bruke språkets formverk og tekststrukturer i skriftlige og muntlige framstillinger forstå lengre framstillinger i skrift og tale om ulike personlige, litterære, tverrfaglige og samfunnsmessige emner trekke ut vesentlige opplysninger fra muntlige og skriftlige tekster og drøfte forfatterens synspunkt og holdninger uttrykke seg skriftlig og muntlig på en nyansert og situasjonstilpasset måte, med flyt, presisjon og sammenheng velge hensiktsmessige lytte-, tale-, lese- og skrivestrategier tilpasset formål, situasjon og sjanger ta initiativ til å begynne, avslutte og holde en samtale i gang lese formelle og uformelle tekster i ulike sjangere og med ulike formål skrive formelle og uformelle tekster med god struktur og sammenheng om personlige, tverrfaglige og samfunnsmessige temaer velge og bruke innhold fra ulike kilder på en selvstendig, kritisk og ansvarlig måte bruke teknisk og matematisk informasjon i kommunikasjon produsere tekster med sammensatt innhold i digitale medier velge et tverrfaglig fordypningsemne innenfor eget programområde og presentere dette

http://www.udir.no/kl06/eng1-02

27

The full version of this subject curriculum is retrievable from http://www.udir.no/kl06/eng1-02.

92

Kultur, samfunn og litteratur Mål for opplæringen er at eleven skal kunne      

drøfte sosiale forhold, samfunnsforhold og verdier i ulike kulturer i flere engelskspråklige land presentere og diskutere internasjonale nyheter og aktuelle hendelser gjøre rede for hovedtrekk i utvikling av engelsk fra et anglosaksisk språk til et internasjonalt verdensspråk analysere og drøfte en film og et representativt utvalg engelskspråklige litterære tekster fra sjangrene dikt, novelle, roman og skuespill drøfte et utvalg engelskspråklige litterære tekster fra ulike deler av verden og ulike tidsepoker, fra 1500-tallet til moderne tid drøfte litteratur av og om urfolk i den engelskspråklige verden lage og vurdere egne muntlige eller skriftlige tekster inspirert av litteratur og kunst

93

Appendix 4 Literature search for the extended abstract Database

Search term(s)

Web of science

- Oral language assessment rating √ - Oral language assessment rater √ - Oral language test rating √ - Oral language test rater √ - Spoken language assessment rating √ - Spoken language assessment rater √ - Spoken L2 test rating √ - Spoken L2 test rater √ - L2 language assessment rating √ - L2 languge assessment rater √ - Oral language rater effects √ - Oral language assessment rater effects √ - Oral language rater orientations √ - Rater cognition language assessment √ - Rater cognition language test √ - Rater orientation language assessment √ - English exam teacher assessment √ - Teacher summative language assessment √ - Classroom-based language assessment √ - Teacher-based language assessment √ - Language assessment speaking rater √ - Oral language proficiency rater √ - Rater variability √ - content-based assessment √ - curriculum-based assessment √ - assessment AND “topical knowledge” √

Academic search premier

- “language assessment” AND rating √ - “language assessment” AND rater √ - “language test” AND rating √ - “language testing” AND rating √ - “language testing” AND rater √ - “language assessment” AND spoken AND rating √ - “language assessment” AND oral AND rating √ - “language assessment” AND “rater cognition” √ - “language assessment” AND “rater orientation” √ 94

- “English exam” AND teacher AND assessment √ - “language assessment” AND teacher AND summative √ - summative AND “language assessment” √ - “classroom-based” AND “language assessment” √ ERIC

- “language assessment” AND oral AND rating √ - “language assessment” AND Oral AND rater √ - “language test” AND oral AND rating √ - “language test” AND oral AND rater √ - “language testing” AND oral AND rater √ - “language assessment” AND spoken AND rating √ - “language assessment” AND spoken AND rater √ - “rater cognition” AND “language assessment”√ - “rater cognition” AND “language assessment”√ - “English exam” AND teacher AND assessment √ - “language assessment” AND teacher AND summative √ - “language assessment” AND “rater effects” √ - “language testing” AND “rater effects” √

Language Testing

- oral AND rating √ - “rater cognition” AND oral √ - “rating process” √ - summative √ - “curriculum-based” √ - “rater variability” √ - oral AND rating √ - “rater cognition” AND oral √ - “rating process” √ - summative √ - “curriculum-based” √ - “rater variability” √

Language Assessment Quarterly

Language Teaching

- “language assessment” AND “subject knowledge” √ - “language assessment” AND “subject content” - “curriculum-based assessment” - “assessing topical knowledge”

Idunn

- muntlig AND vurdering √ - “vurdering av muntlige” √ - “muntlig eksamen” √ - “summative vurdering” AND muntlig √ - vurdering AND muntlig AND ferdigheter √ 95

Google scholar

“language assessment” AND oral AND rating √ - “language assessment” AND Oral AND rater √ - “language test” AND oral AND rating √ - “language test” AND oral AND rater √ - “language testing” AND oral AND rater √ - “language assessment” AND spoken AND rating √ - “language assessment” AND spoken AND rater √ - “rater cognition” AND “language assessment”√ - “rater cognition” AND “language assessment”√ - “English exam” AND teacher AND assessment √ - “language assessment” AND teacher AND summative √ - “language assessment” AND “rater effects” √ - “language testing” AND “rater effects” √

The Directorate for Education and Training (www.udir.no)

- muntlig AND vurdering √ - “vurdering av muntlige” √ - “muntlig eksamen” √ - “summative vurdering” AND muntlig √ - vurdering AND muntlig AND ferdigheter √

Other sources: - Companion to Language Assessment, 2014 - Encyclopedia of Applied Linguistics, 2013 - Dobson, Eggen, and Smith (2009) - Fulcher (2003) - Fulcher (2010) - Fulcher and Davidson (2007) - Anthony Green (2014) - Luoma (2004) - T. McNamara (1996) - Throndsen et al. (2009) - Master & PhD Theses

Note: The list of sources is not exhaustive. In the earlier stages of the PhD-project research studies were also located elsewhere, such as in the web pages of the Norwegian National Centre for Foreign Languages in Education(www.fremmedspraksenteret.no), in the public media, and in other digital and printed literature.

96

References Dobson, S., Eggen, A. B., & Smith, K. (2009). Vurdering, prinsipper og praksis: Nye perspektiver på elev- og læringsvurdering [Assessment, principles and practice: New perspectives on student and learning assessment]. Oslo Gyldendal Akademisk. Fulcher, G. (2003). Testing second language speaking. London: Pearson. Fulcher, G. (2010). Practical language testing. London: Hodder Education. Fulcher, G., & Davidson, F. (2007). Language testing and assessment. Oxford: Routledge. Green, A. (2014) Exploring language assessment and testing: Language in action. Oxon: Routledge. Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press. McNamara, T. (1996). Measuring Second Language Performance. Harlow: Longman. Throndsen, I., Hopfenbeck, T. N., Lie, S., & Dale, E. L. (2009). Bedre vurdering for læring: Rapport fra «Evaluering av modeller for kjennetegn på måloppnåelse i fag» [Better assessment for learning: Report from «The evaluation of models for assessment criteria for goal achievements in subjects]. Oslo: Universitetet I Oslo.

97

Appendix 5

November 2011

Pilot questionnaire28 Assessing oral English – GSP1/VSP2 level This questionnaire is designed to gather data for a master’s degree project and a PhD-project on the assessment of oral English at the GSP1/VSP2 level. It will give valuable information on how teacher raters assess oral English performance. You have just seen a video-clip of two students in a mock oral English exam situation. How would you score their performances?

Grade: _______

Please answer the questionnaire by indicating to what extent you agree with the different statements listed. Note that the statements concern the assessment of oral English performance generally, and not the specific performances of the two students you have just seen. The questionnaire has three sub-sections: one concerning the oral English exam generally, one concerning the presentation part of the exam and one concerning the discussion part (between the candidate and the interlocutor). We would also be very pleased if you could comment on the appropriateness of the different statements in the questionnaire.

1. Background questions 1.1 Have you ever examined your own students in an oral English exam?

Yes ❑

1.2 Have you ever been interlocutor at the GSP1/VSP2 level?

Yes ❑

No ❑

1.3 Have you ever been external examiner at the GSP1/VSP2 level?

Yes ❑

No ❑

1.4 Are you only employed in the VSP studies programme?

Yes ❑

No ❑

1.5 Are you only employed in the GSP study programme?

Yes ❑

No ❑

1.6 Are you employed both in the GSP and the VSP studies programmes?

Yes ❑

No ❑

1.7 Have you ever had any formal rater training?

Yes ❑

No ❑

28

No ❑

The questionnaire has been translated from Norwegian.

98

1.8 How often have you been interlocutor / external examiner in the Oral English exam at the GSP1/VSP2 level? Never ❑

1-2 times ❑

3-5 times ❑

6 times or more ❑

2. Assessing performance in the oral English exam generally 2.1 Language aspects Completely disagree 1 2

Completely agree 3 4 5

2.1.1 Good vocabulary is important in order to achieve a top grade. 2.1.2 Clear pronunciation is important in order to achieve a top grade. 2.1.3 Good “native speaker” pronunciation is important in order to achieve a top grade. 2.1.4 God fluency is important in order to achieve a top grade (i.e. coherent speech without too much hesitation and unnatural pauses). 2.1.5 The student must adapt his/her language to the situation in order to achieve a top grade (i.e. language use which corresponds to the genre conventions for a formal setting). 2.1.6 Grammatically correct language use is important in order to achieve a top grade. 2.1.7 The use of cohesive ties is important in order to achieve a top grade.

2.1.8 COMMENTS / OTHER IMPORTANT ASPECTS: _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________

2.1.9 Would you say that any of these aspects (criteria) are more important in the scoring of performance than others? Yes ❑

No ❑

Don’t know ❑

99

If yes, which three criteria (vocabulary, clear pronunciation, fluency etc.) in section 2.1 do you consider the most important (1 meaning most important and 3 meaning least important)? 1. ______________________________ 2. ______________________________ 3. ______________________________

3. The presentation part of the exam 3.1 Content Completely disagree 1 2


3.1.1 Relevance and range of content is important in order to achieve a top grade. 3.1.2 A creative presentation design is important in order to achieve a top grade. 3.1.3 The ability to reflect on the material presented is important in order to achieve a top grade. 3.1.4 Good use of sources is important in order to achieve a top grade. 3.1.5 Candidates who have obviously prepared well (without this being necessarily reflected in the performance) should be credited for this.

3.1.6 COMMENTS / OTHER IMPORTANT ASPECTS: _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________

100

3.2 Visual aids Completely disagree 1 2


3.2.1 Good use of visual aids is important in order to achieve a top grade. 3.2.2 The use of technically advanced aids (e.g. using a number of advanced features of the PowerPoint program) will affect the score in a positive way. 3.2.3 Visual aids (PowerPoint etc.) containing a number of language errors will mark the student down considerably. 3.2.4 The ability to paraphrase and to elaborate on, for example, the bullet points in the PowerPoint slides is important for a top score 3.2.5 Variation in presentation modes (i.e. swapping between different media, such as text, pictures etc.) is important for a top score.

COMMENTS / OTHER IMPORTANT ASPECTS _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________

3.3 Presentation skills Completely disagree 1 2


3.3.1 Good fluency, pace and rhythm is important for a top grade. 3.3.2 The ability to free oneself from the manuscript is important for a top grade. 3.3.3 To perform in a creative way is important in order for a top grade. 3.3.4 To speak sufficiently loudly and clearly is important for a top grade. 3.3.5 Appropriate body language (including good eye contact) is important for a top grade. 3.3.6 Good presentation structure is important for a top grade. 3.3.7 Candidates who do not show signs of nervousness during the presentation should be credited for this. 3.3.8 Candidates who display a high level of engagement during the presentation should be credited for this. 3.3.9 Candidates who have good digital skills should be credited for this.

101

3.3.10 COMMENTS / OTHER IMPORTANT ASPECTS: _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ _______________________________________________________________________


No ❑

Don’t know ❑


4. Discussion part 4.1 Content Completely disagree 1 2


4.1.1 The ability to retell facts is important for a top grade. 4.1.2 The ability to reflect on facts is important for a top grade. 4.1.3 The ability to draw on syllabus texts is important for a top grade. 4.1.4 Relevant subject matter content is important for a top grade.

4.1.5 COMMENTS / OTHER IMPORTANT ASPECTS: _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ _________________________________________________________________________________

102

4.2 Communicative interaction Completely disagree 1 2


4.2.1 The ability to listen well is important in order to achieve a top grade 4.2.2 Candidates who take the initiative to talk and who show willingess to participate in the conversation should be credited for this. 4.2.3 If a candidate gives very little response, this will negatively affect the score

4.3.4 COMMENTS / OTHER IMPORTANT ASPECTS: _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ _________________________________________________________________________________


No ❑

Don’t know ❑


Do you have other comments on the questionnaire? _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________

Thanks for your time!

103

Appendix 6

Pilot interview guide29 The assessment of oral EFL at GSP1/VSP2 oral English Exam 1. BACKGROUND: 1.1 Age: 1.2 L1: 1.3 Education: 1.4 Experience (number of years at the upper secondary level): 1.5 Experience as rater at the oral exam (fra vg1/YF2): 1.6. Study programme affiliation: General studies ___

Vocational studies ___

1.7. Have you worked outside of your county? 1.8 Have you attended rater training? 2. How well would you say that your school works with the identification of criteria for assessment in English (on the basis of the subject curriculum)? 3. Do you have a common written rating scale for the assessment of oral English (developed either at your school or in your county)? 4. Which criteria do you apply generally when rating performance in the oral English exam? 5. Which of these criteria are the most important, would you say? 6. To what extent is correctness in pronunciation important? 7. Do you assess holistically or analytically? 8. Some teachers say that they are more lenient when scoring students in the vocational studies programmes compared with students in the general studies programmes. What is your comment on that?

29

The questions have been translated from Norwegian.

104

Appendix 7 Henrik Bøhn Avdeling for økonomi, språk og samfunnsfag, Remmen, 1757 Halden Tlf.: 69 21 52 89

Forespørsel til elever om deltakelse i forskningsprosjekt vedrørende muntlig eksamen i engelsk

Undertegnede jobber for tiden med et forskningsprosjekt om vurdering av muntlig engelsk i videregående skole. Prosjektets formål er å se på hvordan muntlig eksamen i engelskfaget gjennomføres på forskjellige skoler og hvordan sensorer vurderer de enkelte kandidatene.

For å kunne dokumentere dette nærmere ønsker jeg å komme i kontakt med elever som er villige til å la seg filme under muntlig eksamen – dersom de skulle bli trukket ut i engelsk. Det legges opp til at filmingen gjøres så diskret som mulig slik at ikke elevens prestasjon under eksamen skal bli påvirket. Undertegnede vil ikke være til stede; kun et kamera vil bli satt opp. Alle opplysninger vil bli behandlet konfidensielt og personlige opplysninger vil ikke bli lagret i noe register. Videomaterialet skal brukes i forskningsøyemed, dvs. som utgangspunkt for forskning på sensorers karaktersetting, og til sensorskolering, dvs. for at sensorer skal kunne trenes i å gi best mulig vurdering. Du vil ved å si ja til dette, kunne gi et viktig bidrag til vurderingsforskningen i Norge. Dersom du skulle si ja til dette, gjøres det oppmerksom på at du når som helst – både før, under og etter eksamen – kan trekke deg fra deltakelse i prosjektet. Dersom du er under 18 år, er det også nødvendig med bekreftelse fra en av dine foresatte.

Med vennlig hilsen Henrik Bøhn --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Ja, jeg gir tillatelse til filming under muntlig eksamen i engelsk dersom jeg skulle bli trukket ut.

____________________________________ Navn

__________________________________ Skole

____________________________________ Dato

__________________________________ Underskrift

___________________________________ Dato

___________________________________ Foresattes underskrift (hvis under 18 år)

105

Forespørsel til lærere om deltakelse i forskningsprosjekt vedrørende muntlig eksamen i engelsk Undertegnede jobber for tiden med et forskningsprosjekt om vurdering av muntlig engelsk i videregående skole. Prosjektets formål er å se på hvordan muntlig eksamen i engelskfaget gjennomføres på forskjellige skoler og hvordan sensorer vurderer de enkelte kandidatene.

For å kunne dokumentere dette nærmere ønsker jeg å komme i kontakt med elever som er villige til å la seg filme under muntlig eksamen – dersom de skulle bli trukket ut i engelsk. Jeg er også avhengig av at eksaminatorer/sensorer sier seg villige til å delta. Det legges opp til at filmingen gjøres så diskret som mulig slik at ikke elevenes prestasjon under eksamen skal bli påvirket. Undertegnede vil ikke være til stede; kun et kamera vil bli satt opp. Eksaminator/sensor vil ikke bli filmet, men eksaminators/sensors spørsmål vil bli kunne høres på lydsporet. Alle opplysninger vil bli behandlet konfidensielt og verken elevenes eller eksaminatorenes/sensorenes navn eller andre personlige opplysninger vil bli lagret i noe register. Videomaterialet skal brukes i forskningsøyemed, dvs. som utgangspunkt for forskning på sensorers karaktersetting, og til sensorskolering, dvs. for at sensorer skal kunne trenes i å gi best mulig vurdering. I tillegg til selve filmingen, ønsker jeg å komme i kontakt med eksaminatorer/sensorer som er villige til å la seg intervjue – alternativt svare på et spørreskjema – om hvordan karakterene har blitt satt. Dersom du skulle si ja til dette, gjøres det oppmerksom på at du når som helst – både før, under og etter eksamen – vil kunne trekke deg fra deltakelse i prosjektet. Din deltakelse i et slikt prosjekt vil kunne være et viktig bidrag til lærings- og vurderingsforskningen.

Med vennlig hilsen Henrik Bøhn

--------------------------------------------------------------------------------------------------------------------------Ja, jeg gir tillatelse til filming under muntlig eksamen i engelsk dersom elever fra min klasse skulle bli trukket ut og jeg er eksaminator. Ja, jeg gir tillatelse til filming under muntlig eksamen dersom jeg er oppnevnt som ekstern sensor. Ja, jeg sier meg villig til å delta i intervju.

Ja, jeg sier meg villig til å svare på et spørreskjema.

__________________________________ Navn

_____________________________________ Dato og underskrift

106

Appendix 8

107

Part II Articles

108

621956

research-article2015

SGOXXX10.1177/2158244015621956SAGE OpenBøhn

Article

Assessing Spoken EFL Without a Common Rating Scale: Norwegian EFL Teachers’ Conceptions of Construct

SAGE Open October-December 2015: 1–12 © The Author(s) 2015 DOI: 10.1177/2158244015621956 sgo.sagepub.com

Henrik Bøhn1,2

Abstract This study investigated teacher cognition and behavior in a high-stakes, English as a Foreign Language (EFL) school context where no common rating scale exists. 24 EFL teachers at the upper secondary level in Norway were asked to rate the performance of a student taking her oral English exam and to give an account of what kind of performance aspects they pay attention to in the rating process. The study showed that while the raters had the same general ideas of the constructs to be assessed, there were differences in how they perceived the relative importance of these constructs, particularly as regards topical knowledge. The study has implications for language teaching and assessment practices at the intermediate to upperintermediate levels (Common European Framework of Reference, level B1/B2), particularly with regard to the role of topical knowledge. Keywords language assessment, English as a Foreign Language, spoken L2, oral English exam, constructs, criteria

Introduction The question of constructs, or what to be tested, is crucial in language assessment. Constructs are typically operationalized in written rating scales (Fulcher, 2012; Luoma, 2004), which are usually provided for raters in high-stakes tests. In the Norwegian educational system, however, there are no national requirements for the provision of common rating scales in the assessment of oral English at the upper secondary level. A general framework exists in the form of national legislation, general directives, and a national curriculum, but the operationalization of the constructs is left to the local level, which in many cases means the individual teachers. A number of studies have shown that raters pay attention to different aspects of performance when rating spoken English as a Second/Foreign Language (ESL/EFL), but most of this research has focused on assessment and testing in contexts where common rating scales exist (e.g., Ang-Aw & Goh, 2011). What happens in situations where the constructs have not been operationalized is much less clear, however. The aim of this study, therefore, is to explore how EFL teachers in Norway understand the constructs to be tested in a high-stakes, oral English exam at the upper secondary level, where no common rating scale has been provided. The major focus will be on rater cognition, but as part of this discussion, the issue of rater consistency will also be considered. Understanding which aspects of performance raters pay

attention to is important for informing the design of test tasks, the selection of criteria for assessment, and the creation of rating scales (Pollitt & Murray, 1996; Taylor & Galaczi, 2011).

Literature Review National and international studies have found variability in rater cognition and rater behavior in L2 speaking assessment. Internationally, for example, research has found that raters pay attention to a range of different aspects of performance in the rating process (Brown, 2000; Hsieh, 2011; Orr, 2002). More specifically, they may vary considerably in how they perceive the importance of the various criteria in the rating scales, such as the use of vocabulary (Ang-Aw & Goh, 2011; Brown, 1995; Eckes, 2009; Kim, 2009). There is also evidence that raters pay attention to different aspects of performance depending on level. For example, in the assessment of low-level performance, raters are more likely to heed features such as grammar and pronunciation, whereas at more advanced levels, they will 1

University of Oslo, Oslo, Norway Østfold University College, Halden, Norway

2

Corresponding Author: Henrik Bøhn, Department of Teacher Education and School Research, University of Oslo, P.O. Box 1099, Blindern, Oslo 0317, Norway. Email: [email protected]

Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 3.0 License (http://www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages Downloaded from by guest on May 29, 2016 (https://us.sagepub.com/en-us/nam/open-access-at-sage).

2

SAGE Open

pay more attention to aspects such as fluency and content (Brown, Iwashita, & McNamara, 2005; Pollitt & Murray, 1996; Sato, 2012). There is also research showing that raters attend to non-relevant criteria in their assessment of performance, one example being the voice quality of the test takers (Brown, 2000; Orr, 2002; Sato, 2012). In addition, there are indications that raters give test takers credit for effort, regardless of whether it is defined as part of the construct or not (Ang-Aw & Goh, 2011; Brown, 1995; May, 2006; Pollitt & Murray, 1996). However, with the exception of Brown et al. (2005) and Pollitt and Murray (1996), none of the above-mentioned studies have looked into rater orientations in contexts where common rating scales are absent. Moreover, it is worth noting that both Brown et al. and Pollitt and Murray studied rater cognition in high proficiency level contexts (English for Academic Purposes and the Cambridge Certificate of Proficiency in English, respectively), whereas the level under investigation in this study is upper intermediate. As raters may attend to different criteria at different levels, it is relevant to study teachers’ conceptions of constructs also at the intermediate to advanced level. In the Norwegian context, there is very little empirical evidence on how raters operationalize the construct in oral English exams. However, studies investigating assessment practices more generally, including subjects such as English and Norwegian, have found that teachers may find criterionreferenced assessment difficult, even though such assessment is required by the regulations of the Education Act (Hægeland, Kirkebøen, Raaum, & Salvanes, 2005; Prøitz & Borgen, 2010). More specifically, there are studies indicating that teachers find it difficult to describe student competence at different levels (Throndsen, Hopfenbeck, Lie, & Dale, 2009). As for the question of teacher cognition in the assessment of oral English, only a master’s study, Yildiz (2011), has cursorily investigated this issue. The study indicated that Norwegian teachers heed different aspects of performance, that they weigh criteria differently, and that they employ non-relevant criterion information in the rating process. Consequently, with so little national and international research having been undertaken, the present investigation adds valuable empirical evidence to the field of spoken L2 assessment at the upper-intermediate proficiency level. In this study, the following two research questions will be addressed: Research Question 1 (RQ1): How do EFL teachers in Norway understand the constructs and criteria to be tested in an oral English exam at the upper secondary level? Research Question 2 (RQ2): What kind of criteria do these teachers see as salient when assessing performance?

Theoretical Considerations Despite the view that terminology such as “construct” and “construct definition” is less useful for explaining observed behavior in assessment situations (Kane, 2006, 2012), a number of language assessment and testing specialists still find it relevant as a way of conceptualizing what should be tested (Bachman & Palmer, 2010; Fulcher, 2015; Fulcher & Davidson, 2007; Green,

2014; Hulstijn, 2011; Inbar-Lourie, 2008). According to Fulcher and Davidson (2007), a construct can be considered an unobservable concept, usually identified by an abstract noun, which needs to be defined so that it can be scientifically investigated. This means that “it can be operationalized so that it can be measured” (Fulcher & Davidson, 2007, pp. 369-370, emphasis in original). One of the examples Fulcher and Davidson use is “fluency.” Thus, to assess “fluency,” one would have to decide on its operationalization, for example, using performance features such as “pauses,” “fillers,” and “false starts” as indicators of this construct (Brown et al., 2005, p. 23). In language assessments, constructs typically relate to one or more aspects of language ability. However, they may also relate to aspects of content, or topical knowledge (Bachman & Palmer, 2010). Here, it is also worth noting that constructs typically have a source or a “frame of reference,” such as a course syllabus that helps assessment designers to operationalize the constructs (Bachman & Palmer, 2010, p. 211). In the Norwegian system, it is the subject curriculum that forms the basis for this operationalization. In the language assessment and testing literature, one also frequently comes across the term “criterion” in relation to what should be assessed (e.g., Council of Europe, 2001; Cumming, 2009; Lumley, 2002; Luoma, 2004; Stoynoff, 2009; Taylor, 2006). This concept has been defined in a number of different ways (e.g., Glaser & Klaus, 1962; Popham, 1978). However, the notion of criterion that best fits with the approach taken in this study is that of Brindley (1991), who says that criteria are the “the key aspects of performance . . . to be assessed such as fluency, appropriacy, accuracy, pronunciation, grammar etc.” (p. 140, emphasis added). Interestingly, Brindley mentions fluency as an example of a criterion. This may appear confusing as Fulcher and Davidson (above) use fluency as an example of a construct. To avoid this confusion, I will in the following reserve the term “construct” for the broader categories of concepts under investigation and use the terms “criteria,” “sub-criteria,” and “sub-sub-criteria” for the more narrowly defined performance aspects. An example of a construct will be “communication,” whereas examples of criteria, sub-criteria, and sub-sub-criteria will be “linguistic competence,” “grammar,” and “subject-verb concord,” respectively. In this discussion, then, there is a hierarchical relationship between the constructs and the criteria, the subcriteria, and the sub-sub-criteria (cf., Tables 2 and 4).

The Situation in Norway In Norway, English is a compulsory subject from the first grade onward. Consequently, by the time students enter upper secondary school, at the age of 16, they have normally reached an upper-intermediate level (Common European Framework of Reference, level B1/B2). The subject curriculum is standards based, listing a number of competence aims that specify what students are expected to master at the end of instruction at different levels. These aims are grouped together in three “main areas”:

Downloaded from by guest on May 29, 2016

3

Bøhn i. Language and language learning—involving aims such as “the pupil shall be able to . . . exploit and assess various situations, working methods and strategies for learning English.” ii. Communication—comprising aims such as “the pupil shall be able to . . . express him/herself in writing and orally in a varied, differentiated and precise manner, with good progression and coherence.” iii. Culture, society, literature—including aims such as “the pupil shall be able to . . . present and discuss international news topics and current events.”1 At the upper secondary level, the English course involving this curriculum is obligatory for both students at the general studies program (GSP) and the various vocational studies programs (VSPs). However, the GSP students complete the course after 1 year (GSP1), whereas the VSP students complete it after 2 years (VSP2). The fact that these two groups of students are made to take the same course has caused some tension in the past, drawing criticism from stakeholders who have found the course far too academic for VSP students, who are allegedly less proficient in English (e.g., Solheim, 2009). End-of-instruction assessment is mainly given in the form of overall achievement marks, awarded by each subject teacher at the end of the school year on the basis of various forms of classroom assessment. In addition, approximately 20% of the students are randomly selected for written English exams and 5% for the oral English exams at the GSP1 and VSP2 levels. Marks range from 1 (“fail”) to 6 (“excellent”). In contrast to the written exam, which is administered nationally by the Directorate for Education and Training, the administration of the oral exam is left to the local educational authorities through the county governors. A direct consequence of this policy is that while the written exam is standardized in terms of a common exam format, common exam tasks, and a common written rating scale, there is no such standardization for the oral exam. Actually, in many cases, the local educational authorities leave it to the individual schools to decide for themselves, particularly with regard to exam tasks and rating scales.2

Method Research Design This study is primarily qualitative. As the focus is predominantly on rater cognition, it was decided to use semistructured interviews as a means to tap into the “life-world” of the informants (Kvale, 2007; Kvale & Brinkmann, 2009). To obtain relevant interview data, it was also decided to use a prompt in the form of a videotaped oral exam performance. This prompt was then distributed to a group of teachers who were asked to watch the video-clip, score the performance, and write down their comments explaining what kind of criteria they applied in the rating process. The informants were then interviewed individually, and in the interviews they were asked to answer both the question on criteria related

specifically to the performance in the video-clip and the question on criteria to be applied more generally. In addition, the informants were asked to score the performance, to obtain a consistency measure as well as an indication of rater behavior, which could then be used to validate the rater orientation analysis (Krippendorf, 2013). Content analysis was used in the exploration of the data. This method can be used both quantitatively and qualitatively, and in the present study, I have used both approaches (Galaczi, 2014; Hsieh & Shannon, 2005; Krippendorf, 2013). According to Hsieh and Shannon (2005), the qualitative approach is particularly relevant when existing theory or research literature on a phenomenon is limited, which is the case in the Norwegian context. Due to this lack of prior conceptualizations, I carried out the analysis inductively, letting the construct and criterion categories emerge from the data (Galaczi, 2014). As for the quantitative aspects of the investigation, the frequencies of the categories that emerged may serve as an index of the salience of these categories (cf., Krippendorf, 2013).

Participants As for the filming of student performance, a VSP student who volunteered to participate was videotaped as she was taking her oral exam. The exam format consisted of three tasks: (a) a preplanned monologue task in the form of a presentation, followed by a discussion of the presentation; (b) an oral interview task based on a short story from the syllabus; and (c) an oral interview task based on a listening comprehension exercise. When it comes to the recruitment of teacher interviewees, purposeful sampling (Creswell, 2013) was employed to ensure variation with regard to age, gender, geographical location, experience, and study program affiliation. In total, 24 teachers from the three counties of Finnmark, Oslo, and Østfold were recruited by contacting schools directly.3 All the informants had English teaching experience from the upper secondary level, and all but one of them (No. 23) had examined students at the oral English GSP1/VSP2 level. Some only taught students in the VSPs, some were exclusively involved in the GSP, and some were involved in both types of programs. The background information on the informants as well as the score they awarded to the student in the video-clip are summarized in Appendix A.

Interview Procedure A semi-structured interview format was chosen, and an interview guide was piloted and revised (cf. Appendix B). Seven teachers in Østfold and one in Oslo were interviewed faceto-face, whereas the rest were interviewed via telephone. The informants were asked to watch the video-clip immediately before the interview was scheduled to keep the event as vividly in their minds as possible. All interviews were recorded. No specific analysis of interviewer effects, that is, interviewer influence on informants’ answers, has been carried out (e.g., Kreuter, 2008). However, it appears that typical variables known to make a difference in this respect, such as


4

SAGE Open

sensitivity of topic, marginalized respondents, and older or hearing-impaired respondents, have not affected the responses negatively (e.g., Shuy, 2003).

Data Analysis The data were analyzed using the computer software package QSR NVivo10. The analysis was carried out in several stages. First, the interviews were transcribed and checked, and all the transcripts were read through to get an overall impression of the material. Second, the transcripts were divided into three sections, each corresponding to the three interview questions: (a) criteria used for scoring performance in the video-clip, (b) criteria generally, and (c) most important criteria. Within these three sections, teacher statements were divided into “ideas units” based on the nature of the research questions. An ideas unit can be described as “a single or several utterances with a single aspect of the event as the focus,” that is, a unit that is “concerned with a distinct aspect of performance” (Brown et al., 2005, p. 13). The following excerpt serves as an illustration, in which the ideas units boundaries have been marked by a “/”: /She is fairly fluent,/and there are no serious errors hampering communication, right?/And she tackles that quite well, even if she has to stop and switch into Norwegian a couple of times./But there are no errors hampering communication./

The ideas units in the above excerpt were coded as “Fluency,” “Disruptive features,” and “Compensatory strategies.” In the next stage, a coding scheme was developed. This entailed the comparison of codes with statements and codes with other codes in a cyclical process (Galaczi, 2014). Having developed the coding scheme, I coded all the transcripts. This process also involved the quantification of statements by making category counts. An ideas unit that was mentioned in one section, such as “and there are no serious errors hampering communication, right?” (cf. the extract above), was counted once. If the same ideas unit appeared within the same section, like, for example, “But there are no errors hampering communication” (cf. above), it was not counted again. However, if it appeared in two or three sections, it was counted 2 or 3 times. It should be pointed out that this type of quantification does not automatically reveal strength of correlation between statements and the prominence of a category. However, it can be validated against “behavioral effects” (Krippendorf, 2013, p. 31) such as the scores awarded by the teachers. In addition, it can be corroborated by the qualitative analysis, which may support findings through the in-depth scrutiny of statements. Both of these validation procedures were employed here. To validate the analysis, two colleagues, who had previously worked as English teachers at the upper secondary level in Norway, were asked to code four transcripts (a total of 16% of the transcripts). The match between their coding and my own resulted in a Cohen’s Kappa reliability estimate of .69, which is regarded as moderate agreement (Landis & Koch, 1977). The

mismatched codes were then discussed, and the coding scheme was revised. I then re-analyzed all the transcripts, and one of the colleagues agreed to code two new transcripts. The intercoder reliability analysis in this phase resulted in a Kappa estimate of .89, which can be labeled very good (Landis & Koch, 1977).

Findings As the investigation of rater behavior was used to validate the rater orientations analysis (cf., “Method” section), it is relevant to briefly look at the interviewees’ scoring of the performance in the video-clip. Table 1 gives an overview of the frequencies and percentages of the scores, as well as the mean score and the standard deviation. (For the assignment of individual scores, see Appendix A.) Table 1. Frequencies and Percentages of Scores, Mean Score, and Standard Deviation. Grade 2 3 4 N

Frequency

Percentage

3 15 6 24

12.5 62.5 25.0 100

M

SD

3.13

.612

As Table 1 shows, most of the teachers awarded the performance a 3. The standard deviation of .612 further indicates moderate spread in the scoring. This means that the teachers largely agreed that it was an average performance.

RQ1: Teachers’ Notions of Constructs and Criteria The results for RQ1 are based on the informants’ answers to the questions of which performance aspects they pay attention to in the rating process, both in terms of the specific case of the student performance in the video-clip, as well as in the assessment of oral performance at this level more generally. It should be observed that all the teachers reported that they score performance holistically. The coding of statements yielded a total of 56 categories. These were then ordered into “construct,” “criterion,” “subcriterion,” and “sub-sub-criterion” categories (see Tables 25). In total, 38 of these categories related to student performance irrespective of task, 17 were relevant for the presentation task only, and one related solely to the short story discussion task. However, for reasons of space, I will restrict my presentation and discussion here to the 38 categories that relate to performance irrespective of task. An overview of the construct and criterion categories, with one example statement for each criterion, is presented in Table 2. Note that many of these criteria, such as “Linguistic competence,” have sub-criteria and sub-sub-criteria that are not displayed in Table 2 but will be displayed in Table 4:


5

Bøhn Table 2. Constructs and Criteria Developed From Teacher Statements: Categories and Examples. Constructs

Criteria

Examples a

Communication

(General reference) Linguistic competence

Compensatory strategies

Listening comprehension

Take initiative Adapt communication to situation and audience Cohesion Ability to repair Social competence

(General reference)b Application, analysis, reflection Comprehension (explain using own words) Knowledge (reproduction) Addressing task or problem statement Elaborated response

(Other)

Content structure Disruptive features Preparation Effort

Content

“I think in terms of communication . . . she was able to communicate” (No. 21) “. . . the vocabulary was reasonably limited . . . simple sentences with quite a few grammar errors” (No. 9) “And if they can’t [find the word], they should try to circumvent it, rather than switching into Norwegian” (No. 4) “. . . she’s got good listening skills. When she was asked a question, there was no problem understanding” (No. 9) “the student needs to . . . contribute to keep the conversation going” (No. 5) “. . . she adapts her language to the situation” (No. 2) “. . . the importance of using paragraph-connectors . . . ‘firstly,’ ‘secondly’” (No. 4) “. . . I see students who are able to self-correct orally . . . ” (No. 24) “. . . to me, that communication thing is to some extent a social issue; you are supposed to put yourself into it” (No. 22) “She probably says too little about the topic, actually” (No. 13) “She totally missed out on the second . . . the analysis part of the question” (No. 9) “you’re testing their understanding” (No. 23) “. . . she is able to recount the content of the short story” (No. 2) “. . . she doesn’t really answer the task question” (No. 11) “. . . she didn’t respond well to the questions . . . she answered in three words and ended with a ‘yes’” (No. 2) “I think she structured the retelling of the story well” (No. 24) “[these aspects] are not really hampering communication” (No. 24) “I think she has prepared well, according to her level, that is” (No. 21) “But trying isn’t in the competency criteria. But I think it should be.” (No. 14)

a “General reference” is not a criterion in itself, but a category that summarizes all the instances where the informants mentioned “communication” or “to communicate” (as in the example provided in Table 2). b cf. Note 5.

In this analysis, two constructs emerged from the coded statements, namely, “Communication” and “Content.” With the exception of the criteria “Disruptive features,” “Preparation,” and “Effort,” which were put in an “Other” category, all the criteria, sub-criteria, and sub-sub-criteria relate to these two constructs. This is not surprising, given that the subject curriculum warrants the identification of the same two constructs. The three main areas in the curriculum, that is, “Language learning,” “Communication,” and “Culture, society and literature” (cf., “The Situation in Norway” section), can, in my interpretation, be subsumed under the headings “Communication” and “Content.” On one hand, students are expected to be able to communicate, and on the other, they are expected to know something about language, language learning, and cultural issues. When it comes to the “Communication” construct, it should be noted that 11 informants spoke of “language” as an “overall category.” For example, Informant No. 9 said, “Usually I identify three areas—content, organization and language—and then, in the descriptors for each grade, I write exactly what I expect students to perform.”4 However, I would argue that “language,” which I have here termed

“Linguistic competence,” is in fact a sub-category, or a criterion that belongs to “Communication.” Support for this claim is found in the many statements concerning other criteria that are not linguistic, but which are closely connected to “Linguistic competence,” and which, taken together, logically make up what can be labeled “Communication” (cf., Tables 2 and 4). This also fits theoretically with a communicative approach to language teaching, which the Norwegian educational system draws on through the Common European Framework of Reference (North, 2004; Simensen, 2010). As for the “Content” construct, the informants’ statements pertaining to this category largely turned out to involve classification. Consequently, it was deemed relevant to use a taxonomy in the coding of some of these references. Adapting Bloom, Englehart, Furst, Hill, and Kratwohl’s (1956) taxonomy, I therefore found it pertinent to apply the criterion categories “Knowledge,” “Comprehension,” and “Application, analysis, reflection.” A statement from Informant No. 2 supports this decision: And it’s often these three levels one relates to: Are they just reproducing facts, are they on a level where they understand some


6

SAGE Open more, and are able to use it to some extent, or have they reached a level where they are able to analyse, reflect and compare?

It may be objected here that a criterion should not involve classification. However, keeping in mind the definition of criteria used in this study, that is, “aspects of performance . . . to be assessed,” I would argue that such criterion categories are relevant in the present analysis. Not only do some teachers report that they assess performance according to these three categories, such a classification is also internally consistent in the sense that all the criteria reflect aspects of performance that can be linked to level indicators, like, for instance, “poor,” “average,” or “good.” In other words, just as the teachers may find a student’s linguistic competence to be good, they may also judge her ability to reflect upon topical knowledge to be good. When it comes to the categories “Disruptive features,” “Preparation,” and “Effort,” which did not clearly relate to the two overall constructs, they were referred to less frequently (see Table 3). As for “Effort,” there are indications that some of the teachers rate VSP students more leniently than they do GSP students, especially the weaker students who risk failing. This means that the teachers may give credit to students who “try their best” in order to compensate for lack of language or content knowledge. Informant No. 14 reflects this sentiment: We’ve had a lot of non-native Norwegians, who are in a [vocational] programme. They’re going to become hairdressers and they’re going to work at [the local supermarkets], and oftentimes we have students that understand very little English. They can’t even have an ongoing, real discussion with you in the classroom. . . . We see how broken these kids are [and] passing English is the difference between getting a job and not getting a job. . . . I say to a lot of kids: “If you come and you try, I will do my best to give you a two” . . . But trying isn’t in the competency criteria. But I think it should be.

However, this picture is balanced by some of the other VSP teachers who take the opposite stance. Nos. 22 and 23, for example, categorically deny that they would give extra credit for effort. “I am not allowed to do that,” No. 22 says. This is aptly remarked as the national educational authorities have stipulated that effort is not to be assessed (Norwegian Directorate for Education and Training [UDIR], 2010).

RQ2: Teachers’ Notions of Salient Criteria As for RQ2, the answer to this question is based on two types of evidence. First, it is based on the reference counts for each of the categories that emerged in the general quantitative analysis (cf., Tables 3 and 4). Second, it is evidenced by the answers given to the interview question concerning which performance aspects the teachers considered as salient (the “most important criterion” question; cf. Table 5). Table 3 presents the total number of counts that were made for the constructs and the criteria in the general quantitative analysis. (Note that the figures in Table 3 include the counts for the sub-criteria and the sub-sub-criteria, although these have not been specified here; cf., Table 4.)

Table 3. Number of Reference Counts for the Different Statements Pertaining to Constructs and Criteria. Constructs Communication Content (Other)

Criteria (General reference to communication) Linguistic competence Compensatory strategies Listening comprehension Take initiative Adapt communication to situation and audience Cohesion Ability to repair Social competence Sum Communication (General reference to content) Application, analysis, reflection Comprehension (explain using own words) Knowledge (reproduction) Addressing task or problem statement Elaborated response Content structure Sum Content Disruptive features Preparation Effort Sum Other

Reference counts 28 240 24 21 15 6 2 2 2 340 43 44 30 27 26 15 4 189 17 14 7 38

As Table 3 shows, “Linguistic competence” was the criterion category that received by far the most counts in the general quantitative analysis (240 counts). In fact, it is more than 5 times larger than the second largest category, “Application, analysis, reflection” (44 counts), and 8 times larger than the third category, “Comprehension” (30 counts). This does not necessarily mean, however, that the teachers see language ability as 5 to 8 times more important than the ability to understand or analyze content, but it reflects the fact that they mention a larger number of different aspects of language when they are asked to discuss criteria. This can be seen in Table 4, which lists the seven sub-criteria and nine sub-sub-criteria that were developed from teacher statements relating to “Linguistic competence.” In comparison, no teacher statements produced sub-criteria or sub-sub-criteria within the “Application, analysis, reflection” or “Comprehension” categories. In passing, it should be mentioned that a number of references to “structure” have been left out, because it was difficult to decide whether the respondents referred to “Cohesion” or “Content structure” (cf.,Table 2).


7

Bøhn Table 4. “Linguistic Competence”: Sub-Criteria, Sub-Sub-Criteria, and Reference Counts. Criterion

Sub-criteria

Linguistic competence Sum linguistic competence

Sub-sub-criteria

(General reference) Grammar

(General reference) Syntax Subject-verb concord Tense Adjective/adverb Sum Grammar (General reference) Technical Advanced/nuanced Sum Vocabulary (General reference) Pronunciation Intonation Stress, rhythm, pauses Sum Phonology

Vocabulary

Phonology

Fluency Idioms, metaphors Independence/originality Accuracy

Reference counts 31 25 12 9 2 1 49 40 8 6 54 0 48 15 4 54 25 9 3 2 240

Table 5. Most Important Criteria and Sub-Criteria Mentioned by 19 Out of 24 Informants. Constructs Communication Content

Criteria

Sub-criteria

(General reference) Linguistic competence

Counts

(General reference) Pronunciation Vocabulary Grammar Sum Linguistic competence

Compensatory strategies Listening comprehension Sum Communication (General reference) Application, analysis, reflection Addressing task or problem statement Sum Content

As the “Linguistic competence” category turned out to be so comprehensive, including sub-criteria that received a substantial number of counts, it is relevant to briefly consider some of these subcategories. As can be seen in Table 4, the two most prominent of these, “Vocabulary” and “Phonology,” both received 54 counts. The third largest of the subcategories, “Grammar,” received 49 counts. In other words, all these sub-criterion categories received more counts than the biggest “Content” category, that is, “Application, analysis, reflection.” Moving on to the analysis of the most important criterion question, one gets a fuller account of what the teachers see as

2 4 4 2 12 3 1 16 3 9 5 17

salient criteria. This analysis is based on statements like the following (from No. 6): “And then there is the fact that she hasn’t answered the whole task. That’s what marks it down the most.” Due to the emergent nature of this research design (Creswell, 2013), not all the informants were systematically asked about which criteria they see as most important. Only 19 out of 24 teachers gave answers to this question. Consequently, the findings reported in Table 5 may give an incomplete picture of the entire teacher sample’s response to this question. Nevertheless, when comparing the results in Table 5 with the number of counts in Tables 3 and 4, one gets a more complete picture of the salient criteria.


8

SAGE Open

The answers reported in Table 5 largely supported the findings summarized in Tables 3 and 4, although with some modifications. Again, “Linguistic competence” turned out to be the most prominent criterion category, followed by “Application, analysis, reflection.” However, the difference between these two categories was not in any way as substantial as was the difference in the general quantitative analysis. As can be seen in Table 5, 12 counts were made for “Linguistic competence” and nine counts were made for “Application, analysis, reflection.” Interestingly, the third largest category identified in the general analysis, that is, “Comprehension,” was not mentioned at all in the “most important criterion” discussion. Instead, “Addressing task or problem statement” was cited as a salient criterion (five counts). As this criterion also received a number of counts in the general quantitative analysis (26 counts), it seems clear that the teachers consider it to be important. Finally, it is worth noting that the three subcriteria emerging as important in the general analysis, that is, “Vocabulary,” “Phonology,” and “Grammar,” were all pointed to as salient criteria. “Vocabulary” and “Phonology” received four counts each, and “Grammar” received two counts. As for variation in the teacher responses, I found a clear distinction in the data pertaining to the criterion category “Addressing task or problem statement.” The informants only involved in the GSP seemed to be particularly concerned with this criterion, whereas the teachers only involved in the VSPs did not mention it at all. For example, the three informants who awarded the student in the video-clip a 2 (Nos. 11, 12, and 17) mentioned lack of task response as a dire weakness in the candidate’s performance. All of these are GSP teachers. Informant No. 11 put it this way: So I would have put her at a two, apart from the listening task, since she doesn’t quite answer the task, and since the assessors have to “pull” so much information out of her.

Conversely, none of the teachers who awarded the candidate a 4, and most of whom mainly or only teach VSP students, mentioned the criterion “Addressing task or problem statement” at all. One reason for this may be that they put more emphasis on language features in their assessment. A quote from Informant No. 24 supports this interpretation: So I’m not so concerned with whether they have necessarily acquired so much factual knowledge and societal aspects. I consider myself more of a language teacher in my English lessons, rather than a teacher of cultural studies.

As I will return to below, this suggests that there is a difference between the teachers in how they regard the importance of content knowledge.

Discussion In response to the two research questions, then, this investigation has found variability in the way teachers understand the

constructs and criteria to be tested and what kind of criteria they see as salient. In addition, it has found variability in scoring behavior. As for the teachers’ notions of constructs and criteria, all the informants reflect an understanding of the two constructs that can be identified in the subject curriculum, namely, “Communication” and “Content.” However, they view the relative importance of these two constructs somewhat differently. The VSP teachers have a tendency to put more emphasis on “Communication,” and particularly “Linguistic competence,” whereas the GSP teachers see “Communication” and “Content” as being more juxtaposed. For example, it was highly conspicuous the way the GSP teachers penalized the student in the video-clip for not answering the topic question and not reflecting sufficiently on the issues under discussion. Assuming that GSP students are on average more proficient in English than VSP students, one may infer that the GSP teachers are used to focusing more on “Content” because of the higher level of proficiency of their regular students. Such a conclusion supports the research results mentioned above, which have indicated that raters focus more on linguistic features at lower levels and pay more attention to content at the higher levels of proficiency (Brown et al., 2005; Pollitt & Murray, 1996; Sato, 2012) Beyond this, the present study confirms the findings reported by Brown et al. (2005) in that the teachers largely focus on the same overall features of performance, but that there is some variation in the way that they attend to the more narrow features. For instance, all the teachers mention phonology as a criterion that should be heeded, whereas only two mention the ability to repair mistakes. Of course, this does not necessarily mean that only these two informants pay attention to a student’s ability to repair, but it suggests that it is seen as a less salient performance criterion. When it comes to the three categories “Disruptive features,” “Effort,” and “Preparation,” they did not correlate well with the overall constructs “Communication” and “Content.” Actually, one may question their status as criteria to be tested. The first one of these, “Disruptive features” is not unambiguously a criterion in the sense of “aspect of performance” as defined above. Rather, it is an effect of the failure of a student to perform well on other criteria, like, for example, “Linguistic competence.” In other words, if a student cannot pronounce a word correctly, this may disrupt communication. Still, several informants appeared to treat it as a criterion, and it is actually included in the written rating scale for Østfold county. As for “Effort,” it is not uncommon that raters pay attention to such a feature, even in contexts where it is not included in the construct to be tested (Brown, 1995; May, 2006). In the present study, this aspect may further be linked to differences in rater severity, an aspect that is also commonly found in the research literature (Bonk & Ockey, 2003; Hsieh, 2011; Iwashita, McNamara, & Elder, 2001; Lumley & McNamara, 1995). As the results showed, some teachers were rating the VSP students more leniently than the GSP students, especially the weaker students who might fail. Such a practice may be attributed to the already mentioned belief held by some teachers that it is unfair


9

Bøhn to make VSP students take the same course, and the same exam, as the GSP students, who are supposedly theoretically stronger. Finally, it can be argued that the category “Preparation” is not an aspect of the performance. Rather, it is an assumed cause of one or more aspects of the performance. What is more, just like “Effort,” it is not criterion-relevant according to the national educational authorities (UDIR, 2010, pp. 13, 44). Overall, then, these findings corroborate results from earlier studies that have found that raters apply non-criterion relevant information when scoring performance (May, 2006; Pollitt & Murray, 1996; Sato, 2012; Yildiz, 2011).

Conclusion and Implications This study has investigated what kind of performance aspects teachers pay attention to in an EFL oral exam at the upperintermediate level where no common rating scale exists. The study found that the teachers generally have the same broad understanding of the constructs and criteria to be tested, but indicated that there is some variation when it comes to how they value the relative importance of these constructs and criteria. In particular, there is variation as regards how the teachers’ view the significance of content knowledge. In addition, the study found variability in scoring outcomes. Three important limitations of this study must be kept in mind. First, it is based on a purposeful sample comprising only 24 informants. The generalization of these results to raters in Norway, or raters generally, is therefore, of course, problematic. Second, there is the possibility that the teachers’ accounts of general criteria may have been influenced by the particular student performance shown to them in the video-clip. Had there been a different performance, the teachers may have mentioned different aspects in the discussion of general criteria. Third, it may be difficult for teachers to describe the

salience of individual criteria because performance is assessed holistically. Considering these limitations, it would be relevant to undertake a larger study involving a number of student performances at different levels, as well as a more sizable teacher sample, to see if the conclusions in this study could be supported. Despite these limitations, however, the findings provide important empirical evidence about teacher cognition and behavior, which may help inform the development of rating scales, test tasks, and classroom assessment practices. The study has three major implications. First of all, it points to the problem of not having a common rating scale in a highstakes oral L2 testing situation. As there is evidence that a common rating scale may lead to “sounder, if imperfect, inferences . . . in the process of decision making” (Fulcher, 2012, p. 379), it is likely that the introduction of a common rating scale would strengthen the validity of the score interpretations. Second, this investigation highlights the problem of introducing a comprehensive content construct at the intermediate to upper-intermediate proficiency levels. There are indications that teachers working with lower proficiency level students downplay the content construct, despite curricular requirements, as many of their students find it difficult enough to come to grips with basic linguistic features. However, as the findings here do not warrant firm conclusions, and the assessment of content in language learning contexts is an underexplored area (Snow & Katz, 2014), it is recommended that more research be undertaken. Third, given that many examiners in the oral exam seem to be quite concerned with students’ abilities to reflect on content, it is important that classroom practices at this level involve tasks that give students the opportunity to reflect on topical knowledge. Restricting work in class to language-related exercises or the simple recounting of content will not prepare them sufficiently for an oral exam such as the one investigated here.

Appendix A Rater Background and Scores Awarded Rater Background Information and Scores Awarded. No.

Age

Gender

L1

Education

Teaches at study program

Score given

1 2 3 5 6 7 8 9 10 11 12 13 14 15

39 57 57 48 35 29 42 41 59 55 28 39 55 36

Male Female Male Male Female Female Male Female Male Female Female Female Male Female

English Norwegian Norwegian Norwegian Norwegian Norwegian Norwegian Russian Norwegian Swedish Norwegian Norwegian English Finnish

Master Bachelor Bachelor Bachelor Bachelor Master Bachelor Master Master Bachelor Master Master Master Master

Both GSP and VSP VSP only Both GSP and VSP Both GSP and VSP Mainly GSP Both GSP and VSP Mainly GSP GSP only Mainly GSP GSP only GSP only GSP only VSP only GSP only

3 4 3 3 3 3 4 3 3 2 2 3 3 3 (continued)


10

SAGE Open

Appendix A (continued) No.

Age

Gender

L1

Education

Teaches at study program

Score given

16 17 18 19 20 21 22 23 24 N = 24

58 38 36 41 54 35 35 34 47

Female Female Male Male Male Female Male Female Male

Norwegian Norwegian Norwegian Norwegian Norwegian Mandarin English Romanian Norwegian

Bachelor Bachelor Master Bachelor Master Master Doctor Master Bachelor

VSP only GSP only Both GSP and VSP VSP only Both GSP and VSP Mainly VSP Mainly VSP Mainly VSP VSP only

4 2 3 3 4 4 3 3 4

Note. GSP = general studies program; VSP = vocational studies program.

Appendix B

10. What about phonology? Some teachers say that a near-native speaker accent is important to get a top score? What is your comment on that?

Interview Guide—Assessing the GSP1 (General Studies Program – 1st Year)/VSP2 (Vocational Studies Program – 2nd Year) Oral English Exam

11. Would you give credit for effort? Acknowledgment

1. Background: 1.1. Age: 1.2. First language: 1.3. Education (English): 1.4. Number of years as a teacher (upper secondary level): 1.5. Experience as examiner (at the GSP1/VSP2 level): 1.6. Has been teaching: GSP ___ VSP ____ Health/ social ___ 1.7. Worked as a teacher outside your county? 1.8. Attended rater training courses? 1.9. Do you use a written rating scale while rating? If yes, who has developed this scale?

2. How would you assess the performance you have just seen? Which grade would you have given and why? In other words, which criteria would you have applied in the assessment process? 3. Are there any other criteria, which you haven’t applied here, that would be relevant in the general scoring of performance in this exam? 4. Do you score analytically or holistically? 5. Do you compare students when grading? 6. What, in your opinion, does the grade reflect? General English competence, competence relating to vocational English, academic English, or what? 7. How do you understand “communication”?

the

concept

of

8. What would it take to get a top score? What criteria are the most important? 9. Conversely, when will a student fail?

I would like to thank the student and the 24 teachers who agreed to participate in this study.

Declaration of Conflicting Interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding The author(s) received no financial support for the research and/or authorship of this article.

Notes 1. Due to curriculum revisions in 2014, the main area “Communication” has now been divided into “Oral Communication” and “Written Communication,” and minor revisions of some of the aims have been undertaken (cf. www.udir.no). 2. In the county of Østfold, in which eight of the 24 teacher informants in this study were employed, the county governor has developed a common, written rating scale to be used by all English teachers in the county. 3. In Norway, there is a total of 19 counties. 4. With the exception of the quotes from Informants Nos. 9, 14, and 23, which are verbatim, all the quotes in this article have been translated from Norwegian.

References Ang-Aw, H. T., & Goh, C. C. M. (2011). Understanding discrepancies in rater judgement on national-level oral examination tasks. RELC Journal, 42, 31-51. doi:10.1177/0033688210390226 Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford, UK: Oxford University Press.


11

Bøhn Bloom, B. S., Englehart, M., Furst, E., Hill, W., & Kratwohl, D. (Eds.). (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook I: The cognitive domain. London, England: Longman. Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20, 89-110. doi:10.1191/0265532203lt245oa Brindley, G. (1991). Defining language ability: The criteria for criteria. In S. Anivan (Ed.), Current developments in language testing. Anthology series 25. Singapore: Southeast Asian Ministers of Education Organization (SEAMEO) Regional Language Centre. Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12, 1-15. doi:10.1177/026553229501200101 Brown, A. (2000). An investigation of the rating process in the IELTS oral interview (IELTS research reports, Vol. 3). Retrieved from https://www.ielts.org/pdf/Vol3Report3.pdf Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater orientations and test-taker performance on Englishfor-academic-purposes speaking tasks (TOEFL Monograph Series, MS-29). Princeton, NJ: Educational Testing Service. Council of Europe. (2001). The common European framework of reference for languages: Learning, teaching, assessment. Strasbourg, France: Council of Europe, Language Policy Unit. Creswell, J. W. (2013). Qualitative inquiry & research design: Choosing among five approaches. Los Angeles, CA: SAGE. Cumming, A. (2009). Language assessment in education: Tests, curricula, and teaching. Annual Review of Applied Linguistics, 29, 90-100. doi:10.1017/S0267190509090084 Eckes, T. (2009). On common ground? How raters perceive scoring criteria in oral proficiency testing. In A. Brown & K. Hill (Eds.), Tasks and criteria in performance assessment: Proceedings of the 28th language testing research colloquium (Vol. 13, pp. 43-73). Frankfurt, Germany: Peter Lang. Fulcher, G. (2012). Scoring performance tests. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook of language testing (pp. 378-392). Oxford, UK: Routledge. Fulcher, G. (2015). Re-examining language testing: A philosophical and social inquiry. Oxon, UK: Routledge. Fulcher, G., & Davidson, F. (2007). Language testing and assessment. Oxford, UK: Routledge. Galaczi, E. (2014). Content analysis. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. 3). Chichester, UK: Wiley-Blackwell. Glaser, R., & Klaus, D. J. (1962). Proficiency measurement: Assessing human performance. In R. M. Gagné (Ed.), Psychological principles in systems development (pp. 419-474). New York, NY: Holt, Rinehart & Winston. Green, A. (2014). Exploring language assessment and testing: Language in action. Oxon, UK: Routledge. Hægeland, T., Kirkebøen, L. J., Raaum, O., & Salvanes, K. G. (2005). Familiebakgrunn, skoleressurser og avgangskarakterer i norsk grunnskole [Family background, school resources, and final grades in Norwegian primary and lower secondary schools]. Retrieved from http://www.ssb.no/a/publikasjoner/ pdf/sa74/kap-2.pdf Hsieh, C.-N. (2011). Rater effects in ITA testing: ESL teachers’ versus American undergraduates’ judgments of accentedness, comprehensibility, and oral proficiency. In Spaan Fellow

Working Papers in Second or Foreign Language Assessment (Vol. 9, pp. 47-74). Retrieved from http://www.cambridgemichigan.org/wp-content/uploads/2014/12/Spaan_V9_FULL. pdf Hsieh, H.-F., & Shannon, S. E. (2005). Three approaches to qualitative content analysis. Qualitative Health Research, 15(9), 1277-1288. doi:10.1177/1049732305276687 Hulstijn, J. H. (2011). Language proficiency in native and nonnative speakers: An agenda for research and suggestions for second-language assessment. Language Assessment Quarterly, 8, 229-249. doi:10.1080/15434303.2011.565844 Inbar-Lourie, O. (2008). Constructing a language assessment knowledge base: A focus on language assessment courses. Language Testing, 25, 385-402. doi:10.1177/0265532208090158 Iwashita, N., McNamara, T., & Elder, C. (2001). Can we predict task difficulty in an oral proficiency test? Exploring the potential of an information-processing approach to task design. Language Learning, 51, 401-436. doi:10.1111/0023-8333.00160 Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17-64). Westport, CT: Praeger. Kane, M. (2012). All validity is construct validity. Or is it? Measurement: Interdisciplinary Research and Perspectives, 10, 66-70. doi:10.1080/15366367.2012.681977 Kim, Y.-H. (2009). An investigation into native and nonnative teachers’ judgments of oral English performance: A mixed methods approach. Language Testing, 26, 187-217. doi:10.1177/0265532208101010 Kreuter, F. (2008). Interviewer effects. In P. J. Lavrakas (Ed.), Encyclopedia of survey research methods (pp. 369-371). London, England: SAGE. Krippendorf, K. (2013). Content analysis (3rd ed.). Thousand Oaks, CA: SAGE. Kvale, S. (2007). Doing interviews (Vol. 2). London, England: SAGE. Kvale, S., & Brinkmann, S. (2009). Interviews: Learning the craft of qualitative research interviewing. Los Angeles, CA: SAGE. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174. doi:10.2307/2529310 Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19, 246-276. doi:10.1191/0265532202lt230oa Lumley, T., & McNamara, T. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12, 54-71. doi:10.1177/026553229501200104 Luoma, S. (2004). Assessing speaking. Cambridge, UK: Cambridge University Press. May, L. A. (2006). An examination of rater orientations on a paired candidate discussion task through stimulated verbal recall. Melbourne Papers in Language Testing, 11, 29-51. Retrieved from http://eprints.qut.edu.au/15747/1/15747.pdf?ev=pub_ext_ prw_xdl North, B. (2004, April 15). Europe’s framework promotes language discussion, not directives [Online edition]. Guardian Weekly. Retrieved from http://www.theguardian.com/education/2004/ apr/15/tefl6 Norwegian Directorate for Education and Training. (2010). Rundskriv Udir-1-2010: Individuell vurdering i grunnskolen og videregående opplæring etter forskrift til opplæringsloven kapittel 3 [Circular Udir-1-2010: Individual assessment in


12

SAGE Open

primary and secondary school according to the regulations of the Education Act]. Oslo, Norway: Directorate for Education and Research. Retrieved from http://www.udir.no/Regelverk/ Finn-regelverk-for-opplaring/Finn-regelverk-etter-tema/ Vurdering/Udir-1-2010-Individuell-vurdering/ Orr, M. (2002). The FCE speaking test: Using rater reports to help interpret test scores. System, 30, 143-154. Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th language research testing colloquium. Cambridge, UK: Cambridge University Press. Popham, W. J. (1978). Criterion-referenced measurement. Englewood Cliffs, NJ: Prentice Hall. Prøitz, T. S., & Borgen, J. S. (2010). Rettferdig standpuktvurdering det (u)muliges kunst? [Fair overall achievement marks—A(n) (im) possible goal?] (Report 16/2010). Oslo, Norway: NIFU (Nordisk Institutt for Studier av Innovasjon, Forskning og Utdanning) STEP (Studies in Technology, Innovation and Economic Policy). Sato, T. (2012). The contribution of test-takers’ speech content to scores on an English oral proficiency test. Language Testing, 29, 223-241. doi:10.1177/0265532211421162 Shuy, R. W. (2003). In-person versus telephone interviewing. In J. A. Holstein & J. F. Gubrium (Eds.), Inside interviewing (pp. 175-193). Thousand Oaks, CA: SAGE. Simensen, A. M. (2010). Fluency: An aim in teaching and a criterion in assessment. Acta Didactica Norge, 4(1), 1-13. Snow, M. A., & Katz, A. M. (2014). Assessing language and content. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. 1, pp. 230-247). Chichester, UK: WileyBlackwell.

Solheim, T. (2009). Opplæring i yrkesfag: Teori-praksis [Education in the vocational subjects: Theory-practice] (Bedre skole, 4/2009). Retrieved from https://www.utdanningsforbundet.no/ upload/Tidsskrifter/Bedre%20Skole/BS_nr_4-09/4328-04-09BedreSkole-Solheim.pdf Stoynoff, S. (2009). Recent developments in language assessment and the case of four large-scale tests of ESOL ability. Language Teaching, 42, 1-40. doi:10.1017/S0261444808005399 Taylor, L. (2006). The changing landscape of English: Implications for language assessment. ELT Journal, 60, 51-60. doi:10.1093/ elt/cci081 Taylor, L., & Galaczi, E. (2011). Scoring validity. In L. Taylor (Ed.), Examining speaking: Research and practice in assessing second language speaking (Vol. 30, pp. 171-233). Cambridge, UK: Cambridge University Press. Throndsen, I., Hopfenbeck, T. N., Lie, S., & Dale, E. L. (2009). Bedre vurdering for læring: Rapport fra “Evaluering av modeller for kjennetegn på måloppnåelse i fag” [Better Assessment for Learning: Report from “The Evaluation of Models for Assessment Criteria for Goal Achievements in Subjects”]. Retrieved from http://www.udir.no/Upload/Forskning/5/Bedre_ vurderingspraksis_ILS_rapport.pdf?epslanguage=no Yildiz, L. M. (2011). English VG1 level oral examinations: How are they designed, conducted and assessed? (Unpublished master’s thesis). University of Oslo, Norway.

Author Biography Henrik Bøhn is a lecturer in English at Østfold University College, where he teaches language proficiency and English education. He is currently undertaking a PhD in language assessment.


Assessing pronunciation in an EFL context: Teachers’ orientations towards nativeness and intelligibility

Henrik Bøhn This study investigated EFL teachers’ orientations towards the assessment of pronunciation at the upper-secondary school level in Norway. Data was gathered from purposeful samples of 24 interview informants and 46 questionnaire respondents. The teachers were asked about their attitudes towards the assessment of nativeness and intelligibility, including pronunciation features such as segmentals, word stress, sentence stress and intonation. The results showed that the teachers strongly agreed on the importance of intelligibility, whereas they strongly disagreed on the salience of nativeness. They were also moderately to strongly oriented towards the evaluation of segmentals, word stress and sentence stress. As for intonation, the results suggest that the teachers were either less concerned with this criterion or unsure of how to relate to it. The study points to the importance of a more clearly defined pronunciation construct in language assessment, which should be informed by recent advances in pronunciation research. Key words: EFL, ESL, oral assessment, pronunciation, nativeness, intelligibility

Background Pronunciation1 is a salient feature of spoken language and is frequently included as a criterion in speaking assessments (e.g. ETS, 2009; IELTS, 2008). Despite this, there is evidence that pronunciation, until recently, has been neglected in L2 language research, instruction and assessment (Baker, 2013; Crowther, Trofimovich, Saito, & Isaacs, 2015; Derwing & Munro, 2009). This may be partly explained in terms of its marginalized status in communicative language teaching (Isaacs, 2014). Studies indicate that many raters lack pronunciation training and that they feel uneasy about how to assess this construct (Levis, 2006; MacDonald, 2002). This supports Derwing and Munro’s (2005) claim that pronunciation pedagogy has been largely guided by intuitive thinking. An important issue in English as a Foreign/Second language (EFL/ESL) pedagogy is the question of speaker norms. Traditionally, the native speaker model has been regarded as the norm for language production and assessment (Cook, 1999; Deterding, 2010). In the past few decades, however, this model has been challenged by competing approaches such as World Englishes (Kachru, 1985) and English as a Lingua Franca (Jenkins, 2000; Seidlhofer,

1

In the discussion of pronunciation in this paper the term refers to both segmental (individual sounds) and suprasegmental (intonation, stress, rhythm etc.) features of spoken utterances (cf. Isaacs, 2014, p. 21).

1

2001). These approaches seem to have impacted ESL/EFL instruction and assessment, making them less native-speaker focused (Graddol, 2006). Yet, the extent of this impact is not clear, as studies have shown variation in instructors’ and raters’ attitudes towards the native speaker norm (Brown, Iwashita, & McNamara, 2005; Deterding, 2010). According to Seidlhofer (2011), the professional discourse concerning norms has changed, but instructional (and assessment) practices have not (p. 13). Such claims are worthy of further investigation, as knowledge of which aspects of performance teacher raters attend to is fundamental to the development of fair and valid assessment practices (Taylor & Galaczi, 2011). Therefore, in this study we aim to examine Norwegian EFL teachers’ attitudes towards the native speaker norm in the assessment of pronunciation, as well as their views on the importance of four specific pronunciation features frequently found to be important for comprehension (cf. Analytical framework section, below). The focus will be on assessment in an oral exam at the upper secondary school level in Norway.

The Norwegian context Although English does not have status as a first or an official language in Norway, it has a strong position in Norwegian society. It is taught as a compulsory school subject from the first grade onwards (age six), and people are widely exposed to the English language both in- and outside of school. Studies have shown that the general proficiency level of the population is high (Education First, 2015), and people use English for a number of different purposes across a range of different contexts, both nationally and internationally. Cross-linguistically, Norwegian and English share a number of phonological similarities, particularly in the area of consonant sounds (Nilsen & Rugesæter, 2015). However, there are also significant differences, such as the phonemes /θ/, /ð/, /tʃ/ /dʒ/, /z/ and /w/, which do not have phonemic status in Norwegian. Norwegian intonation patterns also differ markedly from those of, for example, General American (GA) or Received Pronunciation (RP), which have traditionally been used as pronunciation models in Norway. For example, in eastern and central Norwegian dialects – used by a large proportion of the population – ordinary statements are typically spoken with rising intonation, in contrast to GA and RP which prefer falling intonation (Nilsen & Rugesæter, 2015).

2

In upper secondary school students (age 16-17) have on average reached an upperintermediate proficiency level (Common European Framework of Reference [CEFR], level B1/B2). End-of-instruction assessment is primarily given in the form of overall achievement marks, awarded by each individual subject teacher on the basis of various forms of classroom assessment. In addition, approximately 20 per cent of the pupils are randomly selected for written English exams and five per cent for an oral English exam. Grades range from 1 (“fail”) to 6 (“excellent”) and performance is scored holistically. Instruction and assessment are based on a national subject curriculum and regulations to the Education Act (Norwegian Ministry of Education and Research [KD], 2006/2013; Vurderingsforskriften, 2006/2009). Little empirical knowledge exists to describe how pronunciation is taught and assessed, however. The curriculum is loosely based on the CEFR and stresses the role of English as an international language. It is also standards-based, stipulating a number of competence aims which define what students are to work towards.2 However, as some of these competence aims are rather general, they need to be operationalized in the form of assessment criteria and rating scale descriptors in order to be applicable in assessment (Meld. St. nr. 30, 2004, p. 40). As for the administration of the oral exam, the national educational authorities have delegated this responsibility to the county governors at the local level, who in many cases leave it up to the individual schools to develop their own rating scales. Studies show that this has led to the development of a number of different criteria. In addition, criteria use is somewhat inconsistent across teacher raters, schools and counties (Bøhn, 2015; Yildiz, 2011). The English subject curriculum may be regarded as a source or a “framework” for the operationalization of the constructs to be assessed (Bachman & Palmer, 2010, p. 211). However, it offers little guidance on the assessment of pronunciation. For example, only one competence aim mentions pronunciation, stating that students should be able to “use patterns for pronunciation, intonation, word inflection and various types of sentences in communication” (KD, 2006/2013, p. 10). What ‘using patterns’ entails is not further specified. Obviously, such formulations leave room for individual interpretation when teachers are to operationalize the construct. Little research has been carried out on teachers’ orientations towards pronunciation and the native speaker norm in the Norwegian context. Three studies have found variation in teachers’ attitudes towards the native speaker norm in EFL pedagogy generally (Hansen, 2

A translated version of the English subject curriculum can be accessed at http://www.udir.no/kl06/eng103/Hele/?lplang=eng.

3

2011; Sannes, 2013; Østensen, 2013). In addition, two studies have found that EFL teachers in Norway do consider features of pronunciation to be relevant in oral performance assessment (Bøhn, 2015; Yildiz, 2011). Finally, it is worth mentioning that most certified EFL teachers in Norway have degrees in English from Norwegian tertiary institutions, and that these degrees comprise obligatory courses in English phonology. Hence, the above mentioned claim that raters lack pronunciation training does not formally apply in the Norwegian context.

Analytical framework Nativeness and intelligibility According to Levis (2005), pronunciation research and pedagogy have traditionally been informed by two “contradictory” principles, the nativeness principle and the intelligibility principle (p. 370). The nativeness principle states that it is both feasible and desirable for L2 learners to achieve native-like pronunciation; the intelligibility principle holds that these learners simply need to make themselves understood. Levis’ dichotomous description has been acknowledged by a number of pronunciation specialists (Crowther et al., 2015; Derwing & Munro, 2009; Kirkpatrick, Deterding, & Wong, 2008). The nativeness principle may be associated with the native speaker norm and has, as already mentioned, long held a strong position in language teaching and assessment. However, it has frequently been criticized by pronunciation experts and applied linguists for a number of reasons. Among the arguments are claims that the attainment of a native-speaker accent is an unrealistic goal for most learners (Jenkins, 2000; Munro & Derwing, 2011; see also Singleton, 2005). Another objection relates to the problem of defining who counts as a “native speaker” (Cook, 1999). It has also been maintained that the native speaker norm is the inappropriate target, since the majority of those who use English globally today are nonnative speakers who do not need to conform to first language users (Jenkins, 2000; Kirkpatrick, 2007; Seidlhofer, 2001). A related concern has arisen from the recognition that accent involves aspects of identity (Giles, 1979; Rindal, 2010). This has led some scholars to argue that speakers should be allowed to develop their own distinct speech variety as long as focus is kept on comprehensible output (Jenkins, Cogo, & Martin, 2011). This last point relates to the most pervasive objection to the nativeness principle, namely the argument that what L2 speakers need is to be understood, not to come across as native speakers (Derwing & 4

Munro, 2009; Isaacs, 2014). This position is supported by research showing that a strong foreign accent does not necessarily impede intelligibility (Derwing & Munro, 1997; Munro & Derwing, 1999). The term intelligibility has different denotations (Derwing & Munro, 2009; Jenkins, 2000; Smith & Nelson, 1985). More recently it has been common to distinguish between intelligibility, defined narrowly as a listener’s actual understanding of what is being said, and comprehensibility, defined as a listener’s perceptions of how difficult it is to understand a message (Derwing & Munro, 2009). In this study we investigate the teachers’ attitudes towards intelligibility in the sense of actual understanding.

Linguistic features important for intelligibility The view that intelligibility should be the main priority in L2 pronunciation pedagogy naturally raises the question of which linguistic features that should be prioritized in assessment. As research in the field has been limited (Harding, 2013; Munro & Derwing, 2015) and carried out in different communicative contexts (e.g. Jenkins, 2000; Kang, 2010), it is difficult to set up a comprehensive, empirically validated list of features. However, there is a growing body of research which gives indications of priorities, and in the following we will present features which are recurrently mentioned as important in the literature (Crowther et al., 2015; Derwing & Munro, 2009; Harding, 2013; Isaacs, 2014; Jenkins, 2000; Jenkins et al., 2011; Munro, Derwing, & Thomson, 2015). From the perspective of English as a Lingua Franca, Jenkins (2000, 2009)

has

suggested a set of features – the so-called “Lingua Franca Core” (LFC) – which her research has shown to be necessary for intelligibility in interactions involving non-native speakers of English. The main features of the LFC are: 1. Consonant sounds, except /θ/, /ð/ and dark /l/. 2. Vowel length contrasts (e.g., the difference between the vowels in “pitch” and “peach”). 3. Restrictions on consonant deletion (in particular, not omitting sounds in the beginning and in the middle of words). 4. Nuclear (or tonic) stress production/placement. 5. The vowel /ɜ:/ (as in RP “fur”). (Jenkins, 2009)

5

As can be seen from this overview, Jenkins finds that it is the pronunciation of segmentals which matters most (cf. points 1, 2, 3 and 5). The only suprasegmental feature here is nuclear (tonic) stress, also referred to as sentence stress (cf. Hahn, 2004). Jenkins (2009) goes on to argue that other features of “native speaker English”, such as native-like vowel quality (except /ɜ:/) and suprasegmental features, such as word stress, intonation and rhythm, are “unnecessary” for intelligibility in non-native speaker interactions (p. 13). However, the LFC has been criticized for having a too limited empirical base, and for not having clarified the inclusion criteria for speech samples in the LFC corpus (Isaacs, 2014). Moreover, as the LFC is concerned with intelligibility in situations where both interlocutors are non-native speakers, it could be claimed that this focus is too narrow in EFL/ESL settings, as many learners will have to communicate with native speakers of English (Trudgill, 2005). Still, the LFC serves as an interesting point of departure for the identification of aspects that are important for intelligibility, and a comparison of the LFC with other features repeatedly mentioned in the literature as being crucial may provide a fuller picture of what should be prioritized in teaching and assessment. For example, other studies have found that segmental aspects are vital for communication, particularly those which carry high functional load (e.g. Munro & Derwing, 2006). Functional load refers to “the number of pairs of words in the lexicon that [a phonemic contrast] serves to keep distinct” (Catford, 1987, p. 88). For example, the contrast /tʃ/ vs. /ʃ/ (such as watching vs washing) carries higher functional load than /s/ vs. /z/ (such as ice vs eyes), since the former distinguishes a larger number of word pairs than the latter. Jenkins’ claim that the dental fricatives /ð/ and /θ/ are not important for intelligibility can be supported with reference to this principle, since typical phonemic contrasts that they are part of, such as /ð/ vs /d/ and /θ/ vs /t/, carry low functional load There is also evidence that sentence stress is significant for understanding (Dauer, 2005; Hahn, 2004). For example, in Jenkins’s (2002) data, the sentence “I smoke more than you DO” (rather than “I smoke more than YOU do”) led to communication breakdown between two non-native interlocutors. However, in terms of Jenkins’s conclusion that word stress does not affect intelligibility, other researchers have reached the opposite conclusion (Dauer, 2005; Field, 2005; Saito, Trofimovic, & Isaacs, 2015; Zielinski, 2008). Field (2005) for instance, gives an example where the change of stress from the first to the last syllable in the word second (i.e. “seCOND”) caused reduced understanding in both native and non-native listeners. Similarly, a number of other studies have contradicted Jenkins’ claim that intonation is not important for understanding (Munro & Derwing, 1999; Pickering, 2001; Saito et al., 2015; Winters & 6

O’Brien, 2013). However, there are also studies which have partly supported Jenkins’ findings (Fayer & Krasinski, 1987; Kang, 2010; see also Walker & Zoghbor, 2015). This contradictory evidence is further complicated by the elusive nature of the intonation construct (Jenkins, 2000; Underhill, 2005). In this study we use Underhill’s (2005) definition of intonation, which describes it as “pattern of pitch variation” (p. 76). This should be recognizable to EFL teachers in Norway, as textbooks used in English phonology courses in Norway typically refer to intonation as such (e.g. Nilsen & Rugesæter, 2015). The idea of intonation as pitch variation has lead researchers to study its impact on communication by physically monotonizing it, with the use of a computer program, in order to see whether native speaker listeners could distinguish native from non-native speakers, which in many cases they could not (e.g. Van Els & De Bot, 1987). This is relevant for our operationalization of the intonation construct (cf. below). In sum, the research discussed in this section argues that intelligibility is more important than nativeness for communicative purposes. It also indicates that segmentals, word stress and sentence stress are important for understanding and should therefore be included in a pronunciation syllabus. It would therefore be relevant to investigate the extent to which Norwegian EFL teachers pay attention to these features in pronunciation assessment. As for intonation, research on its relevance for intelligibility is less conclusive. Still, the importance attributed to it in pronunciation text books (Nilsen & Rugesæter, 2015; Underhill, 2005) leads us to include it here.

Research questions Against this backdrop, the present study addresses the following three research questions (RQs): 1. To what extent do EFL teachers at the upper secondary level in Norway see nativeness as an important criterion in the assessment of pronunciation? 2. To what extent do the teachers see intelligibility as an important criterion in the assessment of pronunciation? 3. To what extent do the teachers see segmentals, word stress, sentence stress and intonation as important in the assessment of pronunciation?

7

Method Research design and instruments In this study we used a sequential, exploratory, mixed-methods design (Creswell, Plano Clark, & Garrett, 2008). The data was collected from semi-structured interviews with 24 informants and a questionnaire subsequently distributed to an additional sample of 46 respondents, all working as fully qualified EFL teachers in Norway. The interviews were conducted as part of a larger research project on Norwegian EFL teachers’ general assessment orientations, which also yielded other data (explored in Bøhn (2015)). However, the current investigation analysed responses to an interview question designed exclusively for the present study. More specifically, this question was employed to answer RQ1 and RQ2, and the questionnaire was used to check whether the findings from the interview analysis could be corroborated, as well as to answer RQ3. Hence, the questionnaire served both validity checking and complementary information seeking purposes (Hammersley, 2008). In order to obtain relevant data in the larger project, i.e. in both Bøhn (2015) and the current study, it was decided to use a prompt in the form of a video-taped performance of a student taking her oral exam.

Participants The interview informants were purposefully selected (Creswell, 2013) from a population of fully qualified EFL teachers at the upper secondary school level in order to obtain variation in the sample with regard to gender, first language, teaching and rater experience, county background, and study programme affiliation. The informants were contacted directly by telephone on the basis of these criteria, and those who agreed to participate did so voluntarily, as no financial incentive was provided. 16 of the teachers were native speakers of Norwegian, three had English as their L1, whereas the remaining five were native speakers of Russian, Swedish, Finnish, Mandarin and Rumanian. Their teaching experience ranged from 1 to 32 years, with an average of 11.2 years. All but one of them had previous experience as examiners at the oral English exam.3 The questionnaire respondents were recruited in the same way as the interview informants, and 46 teachers agreed to participate. No economic incentive was given. 40 of the informants had Norwegian as their L1, the others were native speakers of Swedish (3),

3

Background information on the interview informants can be found in Appendix A.

8

English (2) and French (1). These teachers had on average 6-10 years of experience. 32 of them had previous experience as oral examiners, whereas 14 had no such experience.4 The student recorded in the video-clip was an 18-year old girl who volunteered to be filmed as she was taking her oral exam. Her pronunciation was generally clear, but was characterized by a number of “errors”. For instance, in terms of segmentals she used /`hedeɪtʃ/ for headache and /`wɒʃɪŋ/ for watching. As for word stress, she said /`devǝlɒp/ for develop and /`sevǝrlІ/ for severely. In addition, her accent showed fairly strong traces of Norwegian intonation. When asked to grade this particular performance, most of the teachers awarded her an average grade (M = 3.2 on the 6-point grading scale mentioned in “The Norwegian context” section, above).

Procedure An interview guide was piloted and revised. In order to explore the teachers’ attitudes towards nativeness and intelligibility, we first designed a question reading: “To what extent is correctness in pronunciation important?”. However, as the pilot informant found this question ambiguous, we changed it for the final version of the interview guide into: “What about phonology? Some teachers say that a near-native speaker accent is important in order to get a top score? What is your comment on that?”.5 The analysis of the responses to this question was used to answer RQ1 and RQ2. In addition, it was used as a starting point for the development of the questionnaire, as we wanted to see whether the results could be corroborated. The development of the questionnaire also entailed a piloting process. In order to check the validity of the findings from the interviews, we included three items on nativeness and two on intelligibility, but one of the latter had to be taken out because of poor fit with the construct (cf. footnote 6, Table 1 and Figure 2, below). Additionally, we created eight items to answer RQ3. These were related to the pronunciation features discussed above, i.e. segmentals, word stress, sentence stress, and intonation (cf. tables 2 and 3). As the pilot indicated potential problems related to the interpretation of some of these items, we attempted to clarify them by exemplifying the first three. The last one, intonation, was operationalized

4 5

Background information on the questionnaire respondents can be found in Appendix B. The interview guide can be found in Appendix C.

9

as monotonous speech pattern).6 Responses were measured on a five-point Likert scale ranging from 1 = “not at all in agreement” to 5 = “completely agree”. The video-clip used as prompt was distributed to all the interview informants and questionnaire respondents, who were asked to score the performance and answer questions regarding criteria generally (Bøhn, 2015) and nativeness and intelligibility (the present study). As for the questionnaire respondents, one of the researchers was present for a group of 32 of these respondents while they watched the video-clip and completed the questionnaire. No time-limit was set for the completion of the questionnaire. The rest of the respondents (n=14) and all of the interviewees, on the other hand, watched the video-clip on their own, without the researchers being present. The interviewees were asked to take notes and were interviewed shortly afterwards. The language used in both the interviews and the questionnaire was Norwegian; translations into English have been carried out by the researchers.

Data analyses The interviews were transcribed and checked and then returned to the informants to be read through and commented on. Thereafter the transcriptions were coded on the basis of statements pertaining to nativeness and intelligibility as defined in the Analytical framework section (cf. above). We used the computer program QSR NVivo10 in the coding process. In the analysis of the statements relating to RQ1 we used magnitude coding (Miles, Huberman, & Saldaña, 2014). This entailed the assignment of coded statements along a four-point scale going from “not at all” in agreement with the nativeness-perspective to “to a large extent” in agreement. In addition, we included an “ambiguous” category, since some statements turned out to be vague or somewhat inconsistent. The analysis of responses concerning RQ2 was carried out using provisional coding (Miles et al., 2014). This involved an initial deductive phase, where we set up a list of phrases relating to intelligibility, such as “clear pronunciation”, “comprehensible speech” and “understanding”, and then searched the transcripts for matching phrases. The interview analysis was validated by having a colleague co-code two manuscripts (8% of the total number) using magnitude and provisional coding. This colleague had prior experience as an EFL teacher at the upper secondary level in

6

The questionnaire can be found in Appendix D.

10

Norway. The match between her analysis and our own resulted in a Kappa-estimate of .85, which is high (Landis & Koch, 1977). The questionnaire data was analysed by using IBM SPSS Statistics to calculate descriptive statistics in terms of median and mean values, standard deviations and skewness.

Results RQ1: To what extent do EFL teachers at the upper secondary level in Norway see nativeness as an important criterion in the assessment of pronunciation? Interview analysis RQ1 was answered with data from both the interviews and the questionnaire. The interviews revealed considerable variation in the informants’ answers. The analysis using magnitude coding showed that five of the 24 informants did not at all find nativeness to be important, five found it to be of little importance, five found it to be of some importance, and six found it to be of considerable importance. One teacher who was strongly opposed to using nativeness as an assessment criterion said: I don’t kind of expect Norwegian students to be native speakers. I don’t think it is ethical for the teachers to give a lower grade just because the student hasn’t got a proper British accent, or an American accent. It is my personal opinion. Conversely, a teacher who was firmly in favour of judging performance against a native speaker norm said: The closer you get to a 6, the closer your pronunciation should be to a native speaker accent.

Additionally, we found statements from three informants which were ambiguous. The following quote gives an illustration: Informant: Ayway, I think it is quite o.k. that [the students] don’t speak perfect British English or American English, since English has become sort of a global langugage. This means that we must accept that pronunciation has been localized in different parts of the world. Interviewer: But […] would you say that it would be important to have a native speaker accent in order to obtain a 6?

11

Informant: If you do, that’s the best thing, but if you don’t … I don’t think it’s a must.

Here the informant echoed proponents of English as a Lingua Franca who argue that different local varieties of English must be accepted. This view typically involves acceptance of nonnative speaker English as a variety in its own right (Jenkins et al., 2011; Seidlhofer, 2011). At the same time, however, the informant maintained that a native speaker accent is preferable at the top end of the scale, although it is not a “must”. We will return to this somewhat incongruous position in our discussion below.

Questionnaire analysis With regard to the questionnaire, three items attempted to measure the respondents’ orientations towards nativeness. The responses were measured on a five-point Likert scale, ranging from 1 (“completely disagree”) to 5 (“completely agree”). A reliability analysis of these items yielded a Chronbach’s alpha of α = .77, which is acceptable. In Table 1 these answers have been summarized. Table 1. Findings from questionnaire: Rank order of responses to items measuring nativeness (n=46). Rankorder

Items

Md

M

SD

Skew

1

A strong Norwegian accent will mark the student down from a top score.

3

3.11

1.27

-.14

2

A student may go from 5 to 6 if he or she has a near native-speaker accent.

3

2.93

1.23

-.06

3

Good native speaker accent is important in order to obtain a top score.

3

2.78

1.02

-.49

The response categories ranged from 1 (“completely disagree”) to 5 (“completely agree”).

The use of both median (Md) and mean (M) to represent central tendency in Table 1 reflects the fact that the median is usually regarded as the more appropriate measure with ordinal data (e.g. in Likert-scale responses such as the ones used here). However, with responses that are close to a normal distribution, it is relevant to report results from ordinal data in terms of means and standard deviation (Sullivan & Artino, 2013). As can be seen from the low to moderate skewness measures, the responses were not too far from a normal distribution (apart 12

from “Good native speaker accent is important”). The findings suggest that the teachers, on average, were moderately sceptical about using a native speaker standard as a criterion for assessment. However, the fairly high standard deviation (SD) scores indicate that the teachers disagreed on this issue. The answers to the first item here is a case in point, as illustrated in Figure 1.

Figure 1. The distribution of responses to the item: “A strong Norwegian accent will mark the student down from a top score” (n=46).

As Figure 1 shows, the teachers’ responses to this item varied extensively: six respondents strongly disagreed that nativeness matters, seven strongly agreed that it does, eight moderately disagreed and 11 moderately agreed, whereas the largest group, 13 respondents, neither agreed nor disagreed. Thus, the results from the questionnaire analysis support the findings from the interviews.

13

RQ 2: To what extent do the teachers see intelligibility as an important criterion in the assessment of pronunciation? Interview analysis RQ2 regarding intelligibility was also addressed by analysing both the interviews and the questionnaire data. The interview analysis using deductive provisional coding showed that 11 informants used phrases such as “clear pronunciation”, “comprehensible speech” etc. in the discussion of general criteria. One example is the following answer to the question of whether a native speaker accent is required for a top score: I think native speaker is too… well, very few teachers are approaching native speaker level, so I don’t think that is what we are aiming for. But pronunciation must be very good and clear. It should be easy to understand, and, then, one should avoid those kinds of common mistakes. It is possible to have proper pronunciation without sounding completely native. In this quote the view that pronunciation “should be easy to understand” points to an orientation towards intelligibility. It is worth noting that three of the four interview informants who insisted that nativeness was not relevant at all stressed the importance of intelligibility. The following quote, from an American teacher, is a case in point: No, I’m not concerned with accent and pronunciation. I don’t expect [the students] to imitate and copy me, my own dialect and way of speaking. But the words are to be pronounced in a way that is readable… or… understandable.

Questionnaire analysis Data from the questionnaire turned out to support the findings from the interviews. The item designed to measure orientations towards intelligibility gave clear indications of the teachers’ positions in this regard. Figure 2 illustrates this finding.

14

Figure 2. The distribution of responses to the question: “If it is difficult to understand what the student says, I will automatically mark him/her down from a 6” (n=46).

Figure 2 shows that as many as 23 teachers strongly agreed that top scoring students should be penalized for being difficult to understand. The median value (MD = 5) further suggests that intelligibility was seen as a very important assessment criterion by these respondents. This strongly supports the findings from the interviews that understanding is important in the scoring of pronunciation.

RQ3: To what extent do the teachers see segmentals, word stress, sentence stress and intonation as important in the assessment of pronunciation? RQ3 was only answered with data from the questionnaire. Eight items were created in order to examine the teachers’ attitudes towards these aspects. A reliability analysis of these items yielded a Chronbach’s alpha of α = .81, which may be regarded as good. The first four items were designed to measure the respondents’ general views on the importance of these features for a top score. Table 2 gives an overview of the responses.

15

Table 2. Findings from questionnaire: Rank order of responses to the items measuring general orientations towards segmentals, word stress, sentence stress and intonation (n=46). Rankorder

Items

Md

M

SD

Skew

1

Correct pronunciation of segmentals is important in order to achieve a top score.

5

4.46

.69

-.90

2

Correct word stress is important in order to obtain a top score.

4

4.02

.80

-.06

3

Correct sentence stress is important in order to obtain a top score.

4

3.67

1.06

-.72

4

Intonation should be assessed at the oral exam.

3

3.07

.61

.58

The response categories ranged from 1 (“completely disagree”) to 5 (“completely agree”).

As can be seen in Table 2, some of the skewness values were considerably higher than for the items in Table 1, indicating that these responses were further from a normal distribution. Hence, the means and standard deviations for these responses must be interpreted with caution. Overall, the teachers on average strongly agreed that the correct pronunciation of segmentals is important (MD = 5; M = 4.46), and the fairly low standard deviation of .69 denotes relatively little disagreement in these responses. The teachers also found word stress and sentence stress to be relevant, the former turning out to be slightly more important (M = 4.02) than the latter (M = 3.67). However, the standard deviations pointed to some more disagreement (SD = .80 and SD = 1.06, respectively). In terms of intonation, an error in the questionnaire meant that the operationalization of this construct, i.e. “monotonous speech pattern”, was excluded for this particular item (cf. the corresponding item in Table 3). Thus, the results may be less informative than for the other items. The median (MD = 3) and mean (M = 3.07) values show that the respondents on average have answered that they neither “strongly disagree”, nor “strongly agree” that this feature should be assessed at the oral exam. The rather low standard deviation (SD = .61) also implies little disagreement in the responses. In order to further corroborate these findings, we added a set of four items tapping into the teachers’ orientations towards the potential consequences of making errors related to these 16

four pronunciation features. Each of these items corresponded to the four items reported in Table 2. Table 3 gives an overview of the responses. Table 3. Findings from questionnaire: Rank order of items measuring potential consequences of making errors related to segmentals, word stress, sentence stress, and intonation (n=46). Rankorder

Items

Md

M

SD

Skew

1

More than five segmental errors could lower the score from 6 to 5.

4

3.76

1.42

-.98

2

A monotonous speech pattern could lower the score from a 6 to a 5.

4

3.63

.97

-.08

3

More than five word stress errors could lower the score from 6 to 5.

4

3.46

1.19

-.56

4

More than five sentence stress errors could lower the score from 6 to 5.

4

3.33

1.26

-.32

The response categories ranged from 1 (“completely disagree”) to 5 (“completely agree”) The results in Table 3 support the findings in Table 2, apart from intonation, i.e. “monotonous speech pattern”, which is higher ranked here. Even though the median values for all the items are equal (Md = 4), the results suggest that segmental errors (M = 3.76) were seen as slightly more important than intonation (M = 3.63), word stress errors (M = 3.46) and sentence stress errors (M = 3.33). However, the moderate to high skewness values for “word stress” (Skew = .56) and “segmental errors” (Skew = -.98) indicate that the mean and standard deviation values for these items must be interpreted with caution. Bearing this in mind, we find it interesting that all the standard deviation values reported in Table 3 are higher than for than for the corresponding items in Table 2, something which indicates less agreement among the teachers. For example, on the item measuring segmentals, 19 respondents “completely agreed” that five such errors would lower the score, whereas as many as seven respondents “completely disagreed”. This disagreement does not, however, disprove the claim that segmentals are an important pronunciation feature, only that many teachers do not find five such errors to be serious enough to mark a student down from a top score.

17

Beyond the reported findings, a final analysis was undertaken to explore possible relationships between rater background variables and the interview and questionnaire responses. Here we found significant positive correlations between teaching experience and “segmental mistakes” (r = .39, p < .01), “word stress mistakes” (r = .41, p < .01) and “sentence stress mistakes” (r = .39, p < .05). Similarly, we found significant positive correlations between rater experience and “monotonous speech patterns” (r = .30, p < .05), “word stress” (r = .36, p < .05), “word stress” (r = .35, p < .05), “sentence stress” (r = .35, p < .05) and “sentence stress mistakes” (r = .30, p < .05). These findings suggest that the more experienced teachers are somewhat more concerned with these pronunciation features than the less experienced teachers. Apart from these relationships we found no other significant correlations.

Discussion In terms of RQ1, the results in this study clearly show that the teachers disagreed on the question of nativeness. On the two items measuring nativeness in the questionnaire, 18 and 14 respondents, respectively, chose the mid-point option. This choice could reflect moderate agreement with these items, but it could also indicate uncertainty (Dubois & Burns, 1975). Findings from the interview data substantiate this interpretation, such as the somewhat contradictory statement in the interview excerpt cited above. As could be seen there, the informant reflected arguments against nativeness typically voiced in the field of English as a Lingua Franca, at the same time as she would prefer a native speaker accent for a top-scoring performance. As regards RQ2, the teachers strongly agreed that intelligibility is important. Both the data from the interviews and the questionnaire confirmed this. The mean scores for the two items measuring this feature in the questionnaire were the highest of all the items, and the standard deviations were moderate to low. These findings were not unexpected, as understanding obviously is a prerequisite for success in a communicative language paradigm. As for RQ3, we found that the respondents, on average, moderately to strongly agreed that segmentals, word stress and sentence stress are important. Of these three features segmentals received the highest scores. With regard to intonation, the results are somewhat less clear. As the first item measuring this construct (i.e. “Intonation should be assessed at the oral exam”) was not properly operationalized in the questionnaire, it is difficult to compare it 18

with the corresponding items for segmentals, word stress and sentence stress. The high number of teachers scoring this item mid-point (33 respondents) also leaves it unclear whether they are moderately oriented towards the construct, or whether they are unsure of how to relate to it. Still, the fact that the majority of these teachers have completed courses in English phonology, in which intonation is typically defined as pitch variation, suggests that they have an understanding of the construct similar to our definition given above. Consequently, the average score for this item indicates that the teachers are only moderately oriented towards intonation. A comparison of this result with the scores for the second item (“A monotonous speech pattern could lower the score from a 6 to a 5”) does not really clarify the issue. The mean score for this item was in fact higher than for word stress and sentence stress (cf. Table 3) and suggests that the teachers to some extent see monotonous speech as potentially problematic for the achievement of a 6. Yet, it may be argued that “monotonous speech pattern” is not a successful operationalization of the intonation construct. It could be said that native speakers may also speak in a monotonous way, something which could make listening strenuous. Therefore, it might be that this item did not tap very well into the teachers’ orientations towards intonation. The analysis of relationships between rater background variables and the teachers’ responses indicated that the more experienced teachers were slightly more focused on segmentals, word stress and sentence stress than the less experienced teachers. One might speculate that the more experienced raters would be more concerned with nativeness as well, in view of the fact that nativeness used to be the norm of the past (Cook, 1999). However, no relationship between rater experience and nativeness was found. It must be pointed out that most of the pronunciation features in the questionnaire relate to performance at the top end of the scale. Hence, this data does not uncover teacher orientations towards those aspects at the middle or lower end of the continuum. Still, data from the interviews indicates that the teachers are concerned with various features of pronunciation also at the mid- and lower levels, something which has been confirmed in other studies (Bøhn, 2015; Pollitt & Murray, 1996; Yildiz, 2011). This is also in agreement with studies showing that pedagogically and linguistically untrained native speaker listeners attend to features of pronunciation at all proficiency levels when listening for meaning (Saito et al., 2015). Turning to Seidlhofer’s (2011) claim that the discourse concerning nativeness has changed, whereas practices have not (cf. above), we have found that a number of teachers in 19

our study are either strongly oriented towards nativeness or display ambivalence towards this construct. In addition, there are several teachers who are clearly opposed to using a native speaker standard for judging pronunciation. Although we do not have any longitudinal data to support of refute Seidelhofer’s claim, the variability in teacher orientations is a potential threat to the validity of the scoring outcomes. We therefore believe that steps should be taken to clarify the criteria that the teachers use. However, a problem in this regard seems to be that the discourse itself is somewhat ambivalent. Not only are there different suggestions for which speaker model to use (Deterding, 2013; Trudgill, 2005), curricula may also be perceived as ambiguous. The opacity of the English subject curriculum used in Norway has already been mentioned. Moreover, the contention that nativeness and intelligibility are contradictory principles (Levis, 2005) may also add to the confusion, as a number of suggestions for intelligibility typically take native speaker norms as a starting point. Thus, these principles seem to be interrelated, rather than contradictory. In addition, although a number of pronunciation specialists agree that intelligibility rather than nativeness should be in focus, the intelligibility construct is elusive, and it is therefore difficult to make strong recommendations for which linguistic features to prioritize (cf., Harding, 2013; Munro & Derwing, 2015). This point is further complicated by the fact that different language backgrounds may require different foci (Crowther et al., 2015). Moreover, the question of “intelligible to whom” has not yet been properly resolved (Isaacs, 2014). The aspects suggested in this study, however, are features that a number on specialists agree on across contexts and should therefore be considered for inclusion in rating scales.

Conclusion and implications Overall, this study showed that the teachers in the two study cohorts were strongly oriented towards intelligibility and that they disagreed on the relevance of nativeness. As for the salience of individual pronunciation features, the teachers found segmental features to be the most prominent, followed by word stress and sentence stress. Their attitudes towards intonation were somewhat less clear, but the analysis suggests that they were either not as concerned with this feature or unsure of how to relate to it. Three limitations in this study should be noticed. Firstly, the teacher samples were small and purposefully selected. Thus, it is problematic to generalize the results to other 20

EFL/ESL contexts. Still, they point to important tendencies in teacher orientations and support earlier studies concerning nativeness (Hansen, 2011; Sannes, 2013; Østensen, 2013) and pronunciation assessment (Bøhn, 2015; Yildiz, 2011). Secondly, this study has only studied orientations towards linguistic pronunciation features in isolation. Hence, we do not know how the teachers would react to combinations of features (cf. Analytical framework, above). Finally, as the stimulus presented to the teachers consisted of only one video-taped student performance, it could be argued that the teachers were unduly influenced by the language idiosyncrasies of this student. Still, the fact that this study has shown variation in teachers’ orientations towards the assessment of some of these pronunciation features implies a threat to the validity of the scoring outcomes. Consequently, in the Norwegian context, we recommend that steps be taken to clarify the assessment criteria. We also believe that a common rating scale on the national level should be considered as a means to improving rater reliability (cf. e.g. Fulcher, 2012). More broadly, we find that, if intelligibility is to be the primary assessment criterion in EFL/ESL pronunciation assessment, it is important that those phonological features that are important for intelligibility are made clear to raters. Until otherwise has been established in empirical studies, we suggest that the features discussed here, such as segmentals involving high functional load, word stress and sentence stress are singled out as salient criteria. As for intonation, research findings are less clear, but there seems to be more evidence in favour of including it than excluding it, at least in more traditional EFL/ESL contexts. Finally, there is need for more research. In terms of rater orientations, it would be particularly relevant to look more closely into EFL teachers’ attitudes towards the assessment of intonation, as well as to their orientations towards combinations of pronunciation features. In addition, in order to strengthen the validity of the claims made for intelligibility, more research into this construct should be undertaken. Such research should also attempt to clarify the interrelationship between nativeness and intelligibility, particularly concerning the notions of “correctness” and “error”.

21

References Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford: Oxford University Press. Baker, A. (2013). Integrating fluent pronunciation use into content-based ESL instruction: Two case studies. In J. Levis & K. LeVelle (Eds.), Proceedings of the 4th Pronunciation in Second Language Learning and Teaching Conference. Aug. 2012 (pp. 245-254). Ames, IA: Iowa State University. Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater orientations and test-taker performance on English-for-academic-purposes speaking tasks. (TOEFL monograph series. MS - 29). Princeton, NJ: Educational Testing Service. Bøhn, H. (2015). Assessing spoken EFL without a common rating scale: Norwegian EFL teachers’ conceptions of constructs. Sage Open. October-December 2015. Retrieved from http://sgo.sagepub.com/content/spsgo/5/4/2158244015621956.full.pdf. Catford, J. C. (1987). Phonetics and the teaching of pronunciation: A stystemic description of English phonology. In J. Morley (Ed.), Current perspectives on pronunciation: Practices anchored in theory (pp. 87-100). Washington, DC: TESOL. Cook, V. (1999). Going beyond the native speaker in language teaching. TESOL Quarterly, 33(2), 185-209. Creswell, J. W. (2013). Qualitative inquiry & research design: Choosing among five approaches. Los Angeles: Sage. Creswell, J. W., Plano Clark, V. L., & Garrett, A. L. (2008). Methodological issues in conducting mixed methods research designs. In M. M. Bergman (Ed.), Advances in mixed methods research (pp. 66-83). Thousand Oaks, CA: Sage. Crowther, D., Trofimovich, P., Saito, K., & Isaacs, T. (2015). Second language comprehensibility revisited: Investigating the effects of learner background. TESOL Quarterly, 49(4), 814-837. Dauer, R. M. (2005). The Lingua Franca Core: A new model for pronunciation instruction? TESOL Quarterly, 39(3), 543-550. Derwing, T., & Munro, M. J. (1997). Accent, inntelligibility, and comprehensibility: Evidence from four L1s. Studies in Second Language Acquisition, 20, 1-16. Derwing, T., & Munro, M. J. (2005). Second language accent and pronunciation teaching: A research-based approach. TESOL Quarterly, 39(3), 379-397. Derwing, T., & Munro, M. J. (2009). Putting accent in ints place: Rethinking obstacles to communication. Language teaching, 42(4), 476-490. Deterding, D. (2010). ELF-based pronunciation teaching in China. Chinese Journal of Applied Linguistics, 33(6), 3-15. Deterding, D. (2013). Pronunciation models. In C. A. Chapelle (Ed.), The encyclopedia of applied linguistics (pp. 4722-4725). Chichester: Wiley-Blacwell. Dubois, B., & Burns, J. A. (1975). An analysis of the meaning of the question mark response category in attitude scales. Educational and Psychological Measurement, 35(4), 869884. doi:10.1177/001316447503500414. Education First. (2015). EF English Proficiency Index. ETS. (2009). The official guide to the TOEFL test (3rd ed.). New York: McGraw-Hill. Fayer, J. M., & Krasinski, E. (1987). Native and nonnative judgments of intelligibility and irritation. Language Learning, 37(3), 313-326. Field, J. (2005). Intelligibility and the listener: The role of lexical stress. TESOL Quarterly, 39(3), 399-423. doi:10.2307/3588487. Fulcher, G. (2012). Scoring performance tests. In G. Fulcher & F. Davidson (Eds.), The routledge handbook of language testing (pp. 378-392). Oxford: Routledge. 22

Giles, H. (1979). Ethnicity markers in speech. In H. Giles & K. R. Scherer (Eds.), Social markers in speech (pp. 251-290). Cabridge: Cambridge University Press. Graddol, D. (2006). English next: Why global English may mean the end of 'English as a Foreign Language'. London: The British Council. Hahn, L. D. (2004). Primary stress and intelligibility: Research to motivate the teaching of suprasegmentals. TESOL Quarterly, 38(2), 201-223. doi:10.2307/3588378. Hammersley, M. (2008). Troubles with triangulation. In M. M. Bergman (Ed.), Advances in mixed methods research. Thousand Oaks, CA: Sage. Hansen, T. (2011). Speaker models and the English classroom: The impact of the intercultural-speaker teaching model in Norway. (Unpublished master's thesis). Østfold University College, Halden, Norway. Harding, L. (2013). Pronunciation assessment. In C. A. Chapelle (Ed.), The encyclopedia of applied linguistics (pp. 4708-4713). Chichester: Wiley-Blackwell. IELTS. (2008). IELTS speaking test: Instructions to IELTS examiners. Cambridge: IELTS. Isaacs, T. (2014). Assessing pronunciation. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. 1, pp. 140-155). Chichester, UK: Wiley Blackwell. Jenkins, J. (2000). The phonology of English as an international language: New models, new norms, new goals. Oxford: Oxford University Press. Jenkins, J. (2002). A sociolinguistically based, empirically researched pronunciation syllabus for English as an international language. Applied Linguistics, 23, 83-103. Jenkins, J. (2009). (Un)pleasant? (In)correct? (Un)intelligible? ELF speakers’ perceptions of their accents In A. Mauranen & E. Ranta (Eds.), English as a Lingua Franca: Studies and findings. (pp. 10-36). Newcastle: Cambridge Scholars Publishing. Jenkins, J., Cogo, A., & Martin, D. (2011). Review of developments in research into English as a lingua franca. Language teaching, 44(3), 281-315. Kachru, B. B. (1985). Standards, codification and sociolinguistic realism: The English language in the outer circle. In R. Quirk & H. G. Widdowson (Eds.), English in the World: Teaching and learning the language and literatures (pp. 11-30). Cambridge: Cambridge University Press. Kang, O. (2010). Relative salience of suprasegmental features on judgments of L2 comprehensibility and accentedness. System, 38(2), 301-315. doi:http://dx.doi.org/10.1016/j.system.2010.01.005. Kirkpatrick, A. (2007). World Englishes: Implications for international communication and English language teaching. Cambridge: Cambridge University Press. Kirkpatrick, A., Deterding, D., & Wong, J. (2008). The international intelligibility of Hong Kong English. World Englishes, 27(3/4), 359-377. doi:10.1111/j.1467971X.2008.00573.x. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. doi:10.2307/2529310. Levis, J. M. (2005). Changing contexts and shifting paradigms in pronunciation teaching. TESOL Quarterly, 39(3), 369-377. Levis, J. M. (2006). Pronunciation and the assessment of spoken language. In R. Hughes (Ed.), Spoken English, TESOL and applied linguistics: Challenges for theory and practice (pp. 245-269). Basingstoke: Palgrave MacMillan. MacDonald, S. (2002). Pronunciation - views and practices of reluctant teachers. Prospect, 17(3), 3-18. Meld. St. nr. 30. (2004). Kultur for læring [Culture for learning]. Oslo: Norwegian Ministry of Education and Research. Retrieved from https://www.regjeringen.no/contentassets/988cdb018ac24eb0a0cf95943e6cdb61/no/pd fs/stm200320040030000dddpdfs.pdf. 23

Miles, M. B., Huberman, A. M., & Saldaña, J. (2014). Qualitative data analysis: A methods sourcebook. Los Angeles: Sage. Munro, M. J., & Derwing, T. (1999). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 49, 285-310. doi:10.1111/0023-8333.49.s1.8. Munro, M. J., & Derwing, T. (2006). The functional load principle in ESL pronunciation instruction: An exploratory study. System, 34, 520-531. Munro, M. J., & Derwing, T. (2011). The foundations of accent and intelligibility in pronunciation research. Language teaching, 44(3), 316-327. doi:10.1017/S0261444811000103. Munro, M. J., & Derwing, T. M. (2015). Intelligibility research and practice: Teaching priorities. In M. Reed & J. M. Levis (Eds.), The Handbook of English Pronunciation (pp. 377-396). [Wiley Ebook]. Retrieved from http://site.ebrary.com/lib/hiof/reader.action?docID=11053049. Munro, M. J., Derwing, T. M., & Thomson, R. I. (2015). Setting segmental priorities for English learners: Evidence from a longitudinal study. International Review of Applied Linguistics in Language Teaching (IRAL), 53(1), 39-60. Nilsen, T. S., & Rugesæter, K. N. (2015). English phonetics for teachers (3rd ed.). Bergen: Fagbokforlaget. Norwegian Ministry of Education and Research [KD]. (2006/2013). Læreplan i engelsk [English Subject Curriculum]. Oslo: Author. Retrieved from http://data.udir.no/kl06/ENG1-03.pdf?lang=eng. Pickering, L. (2001). The role of tone choice in improving ITA communication in the classroom. TESOL Quarterly, 35(2), 233-255. doi:10.2307/3587647. Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th language research testing colloquium, Cambridge. Cambridge: Cambridge University Press. Rindal, U. (2010). Constructing identity with L2: Pronunciation and attitudes among Norwegian learners of English. Journal of Sociolinguistics, 14(2), 240-261. doi:10.1111/j.1467-9841.2010.00442.x. Saito, K., Trofimovic, P., & Isaacs, T. (2015). Second language speech production: Investigating linguistic correlates of comprehensibility and accentedness for learners at different ability levels. Applied Psycholinguistics, 1-24. doi: 10.1017 / S0142716414000502. Sannes, M. T. (2013). From the native speaker norm towards English as an international language: A study of exposure and attitudes to native and non-native varieties in the teaching of English in Norway. (Unpublished master's thesis). University of Bergen, Bergen, Norway. Retrieved from https://bora.uib.no/bitstream/handle/1956/6669/106843915.pdf?sequence=1. Seidlhofer, B. (2001). Closing a conceptual gap: The case for a description of English as a Lingua Franca. International Journal of Applied Linguistics, 11(2), 133-158. doi:10.1111/1473-4192.00011. Seidlhofer, B. (2011). Understanding English as a Lingua Franca. Oxford: Oxford University Press. Singleton, D. (2005). The Critical Period Hypothesis: A coat of many colours. International Review of Applied Linguistics in Language Teaching (IRAL), 43(4), 269-285. Smith, L. E., & Nelson, C. L. (1985). International intelligibility of English: Directions and resources. World Englishes, 4(3), 333-342. doi:10.1111/j.1467-971X.1985.tb00423.x.

24

Sullivan, G. M., & Artino, A. R. (2013). Analyzing and interpreting data from Likert-Type Scales. Journal of Graduate Medical Education, 5(4), 541–542. Taylor, L., & Galaczi, E. (2011). Scoring validity. In L. Taylor (Ed.), Examining speaking: Research and practice in assessing second language speaking (Vol. 30, pp. 171-233). Cambridge: Cambridge University Press. Trudgill, P. (2005). Native-speaker segmental phonological models and the English Lingua Franca Core. In K. Dziubalska-Kołaczyk & J. Przedlacka (Eds.), English pronunciation models: A changing scene (pp. 77-98). Bern: Peter Lang. Underhill, A. (2005). Sound foundations: Learning and teaching pronunciation. Oxford: Macmillan. Van Els, T., & De Bot, K. (1987). The role of intonation in foreign accent. The Modern Language Journal, 71(2), 147-155. doi:10.2307/327199. Vurderingsforskriften. (2006/2009). Forskrift til Opplæringslova [Regulations to the Education Act]. Retrieved from https://lovdata.no/dokument/SF/forskrift/2006-06-23724. Walker, R., & Zoghbor, W. (2015). The pronunciation of English as a Lingua Franca. In M. Reed & J. M. Levis (Eds.), The handbook of English pronunciation (pp. 433-453). [Wiley Ebook]. Retrieved from http://site.ebrary.com/lib/hiof/reader.action?docID=11053049. Winters, S., & O’Brien, M. G. (2013). Perceived accentedness and intelligibility: The relative contributions of F0 and duration. Speech Communication, 55(3), 486-507. doi:http://dx.doi.org/10.1016/j.specom.2012.12.006. Yildiz, L. M. (2011). English VG1 level oral examinations: How are they designed, conducted and assessed? , (Unpublished master's thesis). University of Oslo, Oslo, Norway. Retrieved from https://www.duo.uio.no/bitstream/handle/10852/32421/YildizMaster.pdf?sequence=2 &isAllowed=y. Zielinski, B. W. (2008). The listener: No longer the silent partner in reduced intelligibility. System, 36(1), 69-84. Østensen, H. G. (2013). Teacher cognition and the acquisition and teaching of pronunciation. (Unpublished master's thesis). Bergen University College, Bergen, Norway.

25

Appendix A Interview informants: Background characteristics and scores awarded

No.

Age

Gender

L1

Education Master

Teaches at study programme Both GSP and VSP7

Score given 38

1

39

Male

English

2

57

Female

Norwegian

Bachelor

VSP only

4

3

57

Male

Norwegian

Bachelor

Both GSP and VSP

3

4

29

Male

Norwegian

Master

Mainly GSP

3

5

48

Male

Norwegian

Bachelor

Both GSP and VSP

3

6

35

Female

Norwegian

Bachelor

Mainly GSP

3

7

29

Female

Norwegian

Master

Both GSP and VSP

3

8

42

Male

Norwegian

Bachelor

Mainly GSP

4

9

41

Female

Russian

Master

GSP only

3

10

59

Male

Norwegian

Master

Mainly GSP

3

11

55

Female

Swedish

Bachelor

GSP only

2

12

28

Female

Norwegian

Master

GSP only

2

13

39

Female

Norwegian

Master

GSP only

3

14

55

Male

English

Master

VP only

3

15

36

Female

Finnish

Master

GSP only

3

16

58

Female

Norwegian

Bachelor

VP only

4

17

38

Female

Norwegian

Bachelor

GSP only

2

18

36

Male

Norwegian

Master

Both GSP and VP

3

19

41

Male

Norwegian

Bachelor

VP only

3

20

54

Male

Norwegian

Master

Both GSP and VP

4

21

35

Female

Mandarin

Master

Mainly VP

4

22

35

Male

English

Doctor

Mainly VP

3

23

34

Female

Romanian

Master

Mainly VP

3

24

47

Male

Norwegian

Bachelor

VP only

4

n=24

M=3.13

7

GSP = General studies programme VSP = Vocational studies programmes 8 The scores ranged from 1 («fail») to 6 («excellent»)

26

Appendix B Questionnaire respondents: Background characteristics and scores awarded

9

No.

Gender

L1

Teaching experience

Rater experience

1

Female

Norwegian

16 years or more

6 exams or more

Score given 39

2

Male

Swedish

16 years or more

6 exams or more

4

3

Female

Norwegian

0-1 years

None

4

4

Female

Norwegian

2-5 years

1-2 exams

3

5

Female

Norwegian

11-15 years

1-2 exams

4

6

Female

Norwegian

6-10 years

6 exams or more

3

7

Male

English

2-5 years

None

3

8

Male

Norwegian

2-5 years

6 exams or more

3

9

Female

Norwegian

0-1 years

None

-

10

Female

Norwegian

11-15 years

6 exams or more

3

11

Female

Norwegian

2-5 years

3-5 exams

3

12

Male

English

6-10 years

3-5 exams

3

13

Female

Norwegian

2-5 years

None

3

14

Female

Norwegian

11-15 years

1-2 exams

3

15

Female

Norwegian

0-1 years

None

4

16

Male

Norwegian

16 years or more

6 exams or more

3

17

Female

Norwegian

16 years or more

6 exams or more

3

18

Female

Norwegian

6-10 years

None

3

19

Male

Norwegian

16 years or more

6 exams or more

3

20

Female

Norwegian

16 years or more

-

3

21

Female

Norwegian

6-10 years

1-2 exams

4

22

Male

Swedish

2-5 years

1-2 exams

3

23

Female

Norwegian

11-15 years

3-5 exams

3

24

Female

Norwegian

16 years or more

3-5 exams

3

25

Female

Norwegian

2-5 years

3-5 exams

3

26

Male

Norwegian

6-10 years

None

3

27

Female

Norwegian

0-1 years

None

4

28

Female

Norwegian

0-1 years

None

4

The scores ranged from 1 («fail») to 6 («excellent»)

27

No.

Gender

L1

Teaching experience

Rater experience

29

Female

Norwegian

6-10 years

1-2 exams

Score given 3

30

Female

Norwegian

6-10 years

6 exams or more

3

31

Female

Norwegian

11-15 years

None

4

32

Female

Norwegian

0-1 years

None

3

33

Male

Norwegian

2-5 years

1-2 exams

4

34

Male

Norwegian

16 years or more

6 exams or more

3

35

Female

Norwegian

16 years or more

6 exams or more

4

36

Female

Norwegian

11-15 years

6 exams or more

2

37

Female

Norwegian

6-10 years

6 exams or more

3

38

Female

Norwegian

0-1 years

None

3

39

Female

Norwegian

0-1 years

None

3

40

Female

Norwegian

11-15 years

3-5 exams

3

41

Female

Swedish

11-15 years

3-5 exams

3

42

Female

Norwegian

2-5 years

3-5 exams

3

43

Female

Norwegian

16 years or more

6 exams or more

4

44

Female

Norwegian

16 years or more

6 exams or more

3

45

Female

French

6-10 years

6 exams or more

4

46

Male

Norwegian

6-10 years

6 exams or more

4

n=46

M=3.24

28

Appendix C INTERVIEW GUIDE – ASSESSING THE GSP1/VSP2 ORAL ENGLISH EXAM 1. Background: 1.1 Age: 1.2 First language: 1.3 Education (English): 1.4 Number of years as a teacher (upper secondary level): 1.5 Experience as examiner (at the GSP1/VSP2 level): 1.6. Has been teaching: GSP. ___VSP ____

Health/social ___

1.7. Worked as a teacher outside your county? 1.8 Attended rater training courses? 1.9 Do you use a written rating scale while rating? If yes, who has developed this scale?

2. How would you assess the performance you have just seen? Which grade would you have given and why? In other words, which criteria would you have applied in the assessment process? 3. Are there any other criteria, which you haven’t applied here, that would be relevant in the general scoring of performance in this exam? 4. Do you score analytically or holistically? 5. Do you compare students when grading? 6. What, in your opinion, does the grade reflect? General English competence, competence relating to vocational English, academic English, or what? 7. How do you understand the concept of “communication”? 8. What would it take to get a top score? What criteria are the most important? 9. Conversely, when will a student fail? 10. What about phonology? Some teachers say that a near-native speaker accent is important in order to get a top score? What is your comment on that? 11. Would you give credit for effort?

29

Appendix D Questionnaire Assessing pronunciation – The English oral exam (vg1/yf2) This questionnaire is one of several data collection instruments in a PhD-project on the assessment of oral English. It will provide important information about English teachers’ assessment of oral English proficiency with regard to pronunciation. The questionnaire is to be answered after you have seen a video-clip of a yf2 vocational student (Health and Social Care programme) who is taking her oral exam. Which grades would you have awarded the candidate in the presentation sequence and in the interview sequence, and which overall grade would you have given? Presentation sequence:

_______

Interview sequence:

_______

Overall grade:

_______

Answer the questionnaire by indicating to what extent you agree or disagree with the various claims. Note! This questionnaire concerns the assessment of performance in the oral exam generally, nor specifically the candidate you have just seen.

1. Background information 1.1 Gender:

Male ❑

Female ❑

1.2 First language: _______________________________ 1.3 How much teaching experience do you have from at the upper secondary level? 0-1 ys ❑

2-5 ys ❑

6-10 ys ❑

11-15 ys ❑

15 ys or more ❑

1.4 How many times have you been internal/external examiner at the oral exam at the upper secondary level? Never ❑

1-2 times ❑

3-5 times ❑

6 times or more ❑

1.5 Do you have any experience as internal/external examiner in oral exams in other subjects or at other levels (lower secondary school or university / university college)? Yes ❑

No ❑

30

If yes, which subject(s)? ____________________________________________________ 1.6 Have attended any courses in the assessment of oral English? Yes ❑

No ❑

If yes, what kind of courses? ___________________________________________________ 1.7 How well would you say that your school is working with assessment criteria and the development of a joint assessment culture? Not at all ❑ To little extent ❑ To some extent ❑ To a large extent ❑ To a very large extent ❑

1.8 To what extent do you consider the competence aims of the subject curriculum when you assess oral English performance? Not at all ❑ To little extent ❑ To some extent ❑ To a large extent ❑ To a very large extent ❑ 1.9 To what extent do you use (or consider) ready made assessment criteria in the assessment of oral English performance? Not at all ❑ To little extent ❑ To some extent ❑ To a large extent ❑ To a very large extent ❑ 1.10 To what extent do you consider pronunciation to be an important aspect to be assessed in the oral English exam? Not at all ❑ To little extent ❑ To some extent ❑ To a large extent ❑ To a very large extent ❑

2. Assessing pronunciation Please respond to the following statements:

Completely disagree 1

2


2.1 Intonation is a performance aspect which should be assessed at the oral exam 2.2. The correct pronunciation of individual sounds is important in order to achieve a top score (for example /hedeɪk/ and not /hedeɪtʃ/ for «headache»; /wɒtʃɪŋ/ and not /wɒʃɪŋ/ for «watching»). 2.3 More than five pronunciation errors such as the ones mentioned in 2.2 above may lower the score from 6 to 5.

31

2.4 Good “native speaker”-accent is important in order to achieve a top score. 2.5 A monotonous speech pattern could lower the score from a 6 to a 5. 2.6 Correct word stress is important in order to obtain a top score (for example, “deVElop” rather than “DEvelop”). 2.7 More than five word stress errors, as mentioned in 2.7 above, could lower the score from a 6 to a 5. 2.8 Correct sentence stress is important in order to obtain a top score (for example, «I smoke more than YOU do» rather than «I smoke more than you DO») 2.9 More than five sentence stress errors, as mentioned in 2.9 above, could lower the score from a 6 to a 5. 2.10 Clarity of speech is important in order to obtain a top score. 2.11 If it is difficult to understand what the student says, I will automatically mark him/her down from a 6. 2.12 A strong Norwegian accent will mark the student down from a top score. 2.13 A student may go from 5 to 6 if he or she has a near native-speaker accent.

2.15 Are there any other salient criteria in terms of pronunciation? __________________________________________________________________________________ __________________________________________________________________________________ __________________________________________________________________________________ __________________________________________________________________________________ __________________________________________________________________________________

2.16 Comments / Other important issues: _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________

32

2.17 Would you say that some of the criteria mentioned in 2.1-2.14 are more important than others? Yes ❑

No ❑

Don’t know ❑

If yes, which three criteria (pauses, pronunciation errors, native-speaker accent etc.) do you see as the most important in the assessment of student performance (i.e. where mistakes related to these criteria would have lowered the score the most)? 1._________________________________________________________ 2._________________________________________________________ 3._________________________________________________________

2.18 Additional comments _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________ _______________________________________________________________________________________________________________

Thanks for your time!

33

Assessing content in a curriculum-based EFL oral exam: The importance of higher-order thinking skills

Henrik Bøhn In this study data from verbal protocols and semi-structured interviews was analysed to explore Norwegian EFL teacher raters’ (n=10) orientations towards content in an oral English exam at the upper secondary school level. The content construct was mainly analysed in terms a subject matter dimension and a skills and processes dimension. The investigation included a comparison between the teachers’ orientations towards subject matter and the aspects of subject matter identified in the English subject curriculum. The study found that the teachers generally had a broad understanding of subject matter and that they were rather more concerned with the skills and processes dimension of the construct (e.g. analysis, reflection) than with the subject matter dimension (e.g. news topic, syllabus texts). Moreover, their understanding of subject matter compared fairly well with the subject curriculum, which opens up for the assessment of a great variety of different subject matter topics. However, one content component specifically mentioned in the curriculum, i.e. meta-cognitive strategies, was less likely to be heeded by the teachers as a relevant assessment criterion. The study points to the prominence of guidance for teacher raters in the assessment of content and to the significance of alerting students to the importance of higher-order thinking skills in assessment situations at this level.

Key words: EFL, ESL oral L2 assessment, content, subject matter, higher-order thinking skills

Background Aspects of content may be said to be involved in all language use (Bachman & Palmer, 2010, p. 41). Despite this, the role of content in second and foreign language instruction and assessment varies substantially from context to context. Historically, it has been considerably downplayed in language assessment, as the primary focus has been on the evaluation of language features. In fact, in some cases it has even been treated as a potential source of language bias (Douglas, 2000, p. 2). More recently, however, the assessment of content has been emphasized in a number of settings, for example in content-based instruction and specific purposes courses (Byrnes, 2008; Snow & Katz, 2014). Overall, language instruction and assessment may thus be regarded as a continuum from language-driven approaches to content-driven approaches (Met, 1998). Evidence suggests that lower proficiency levels, traditional foreign language contexts, and oral assessment are more strongly associated with the language-driven end of the continuum. Higher proficiency levels, content-based language instruction, and written assessment, on the other hand, are more likely to be found on the 1

content-driven side of the scale (see e.g. Brown, Iwashita, & McNamara, 2005; Pollitt & Murray, 1996; Sato, 2012; Snow & Katz, 2014). The concept of content is somewhat elusive, however. In the language assessment literature it has been related to, or used synonymously with, as diverse terms as “subject matter”, “topic”, “cultural knowledge”, “information content”, “ideas”, and “framing”, to mention a few (Bachman & Palmer, 2010 p. 41; Brown et al., 2005, p. 27; Council of Europe, 2001, p. 68; Kratwohl, 2002, p. 213; Sato, 2012, pp. 223, 226). As for expected test taker response in the content area, the concept has been linked to performance features such as “[task] fulfilment”, “description”, “explanation”, “sophistication”, “meaningfulness”, “accuracy”, “elaboration”, and “development” (Bachman & Palmer; 2010; p. 218; Brown, 2000, p. 63; Douglas, 2000, p. 117; Eckes, 2009, p. 48; Frost et al. 2011, p. 349; Sato, 2012, pp. 225-226). In other words, the concept is multifaceted and complex, and there is evidence that it is not well understood in all contexts (Frost, Elder, & Wigglesworth, 2012). More research has therefore been called for (Snow & Katz, 2014). The context of the present study is curriculum-related English as a Foreign Language (EFL) education at the upper-intermediate proficiency level (CEFR B1/B2) in Norway. This may be said to belong to the middle of the language-content continuum. A defining feature of this context is the lack of a common rating scale to guide teachers in their assessment of oral performance. Little empirical evidence exists to describe how content is assessed in this setting. A notable exception is Bøhn (2015), which used semi-structured interviews (n=24) to investigate Norwegian EFL teachers’ general assessment orientations in the oral English exam. The results showed that content was one of the aspects which caused the most variability among the teachers. Some of them gave it considerable attention, whereas others largely downplayed it. Overall, however, the study showed that the majority of the teachers saw content as consisting of the following four aspects: (i)

addressing task or problem statement;

(ii)

elaborated response;

(iii)

content structure;

(iv)

a Bloom-like taxonomy of reproduction, comprehension, application, analysis and reflection.

As the study found differences in rater orientations between teachers from different study programmes, it was speculated that the teachers employed in the academic, general studies programme were used to working with more proficient students than teachers in the 2

vocational studies programme and that they therefore were more focused on content aspects. These results were not conclusive, however, and it was suggested that more research should be carried out in order to further explore the teachers’ operationalization of the content construct in this setting. In addition to Bøhn (2015), another investigation relevant for the present study is Brown et al. (2005). Although the context of this study was English for academic purposes (EAP) rather than curriculum-related EFL education, it is relevant in the sense that it investigated rater orientations in a content-oriented test situation where no rating scale existed. The authors used verbal-report methodology to examine the raters’ orientations of 10 EAP specialists. The raters were asked to report on which performance features they paid attention to in 40 audio-recorded student performances in a pilot TOEFL test. The results showed that content was a major focus in the raters’ reports (together with linguistic resources, phonology and fluency). More specifically, content was associated with “task fulfillment”, “ideas” and “framing”. Task fulfilment indicated the degree to which the test takers were “on topic” or “addressed the question”. Ideas referred to “the amount of speech produced”, the ability of the candidates to respond to the “functional demands” of the task, as well as the “sophistication” and “relevance” of the responses. Finally, framing related to “the generic structure of test-takers’ responses in terms of an introduction and a conclusion” (pp. 27-30). A comparison of Bøhn (2015) with Brown et al. (2005) shows several similarities. Table 1 gives an overview. Table 1. Raters’ conceptions of content in Bøhn (2015) and Brown et al. (2005). Bøhn (2015)

Brown et al. (2005)

- Addressing task or problem statement

- Task fulfilment

- Elaborated response

- Ideas - amount of speech produced - response to functional demands - sophistication (independent) - relevance

- Bloom-like taxonomy of - reproduction - comprehension - application - analysis - reflection - Content structure

- Framing

3

The comparison of findings illustrated in Table 1 shows that Addressing task or problem statement in Bøhn parallels Task fulfillment in Brown et al. Moreover, Bøhn’s concept of Elaborated response has affinities with Brown et al.’s notion of Ideas, although Ideas is a broader category involving more features. Finally, the notion of Content structure in Bøhn has similarities with Framing in Brown et al. However, Bøhn’s category involving the Bloom-like taxonomy is less conspicuous in Brown et al.’s findings. Still, one may consider that Brown et al.’s subcategory Sophistication is partly covering the same construct, in the sense that a sophisticated answer would presumably require the ability to analyse and reflect on the issue being discussed. Another interesting similarity between these two studies is the fact that the categories developed mainly describe the skills or abilities involved in the handling of content. Except for the general references to “task”, “ideas” and “problem statement”, very little is said about what should be tested. In Brown et al.’s (2005) study this may come as no surprise. As the test takers were responding to a proficiency test (Douglas, 2010, pp. 1-2), they were not judged on their knowledge of specific EAP topics, but rather on their language abilities and their capacity to handle whatever topic was under discussion. However, in Bøhn (2015) the oral exam used as a basis for the inquiry may be classified as an achievement test, as it is based on a subject curriculum (Bachman & Palmer, 2010, p. 213). Consequently, one may expect raters in such settings to comment on specific subject matter. No such comments were reported, however. Against this background the present study investigates Norwegian EFL teachers’ orientations towards subject matter content in an oral English exam at the upper-intermediate proficiency level. As part of the investigation the teachers’ orientations will be compared with the aspects of content identified in the English subject curriculum. The aim of the study is to provide empirical evidence of teachers’ understanding of content in curriculum-related EFL education, as little research has focused on the importance of non-linguistic criteria in oral assessment (Sato, 2012), and as knowledge of raters’ assessment focus is important for the creation of test tasks and the development of rating scales (Pollitt & Murray, 1996; Taylor & Galaczi, 2011). Moreover, by comparing the teachers’ cognitive processes with the intended construct to be measured, as outlined in the English subject curriculum, the study may contribute important evidence concerning the validity of the scoring process (Bejar, 2012).

4

Analytical framework As was touched upon in section one, the concept of content is used in a number of different ways in the assessment literature. Traditional assessment theory gives the concept scant treatment, but where mentioned, it is often related to Bachman and Palmer’s (1996) model of communicative language ability (e.g. Douglas, 2000; 2010; Fulcher & Davidson, 2007; Green, 2014; Luoma, 2004; McNamara, 1996). In this model content is referred to as topical knowledge or real-world knowledge and defined loosely as “knowledge structures in long-term memory” (Bachman & Palmer, 1996, p. 65). Although Bachman and Palmer offer interesting perspectives on the testing of content, their focus is nevertheless predominantly on language aspects. Therefore, in order to broaden the analytical framework, I will briefly turn to the field of curriculum-related, content-based L2 language instruction, which provides richer theoretical support for the subsequent analysis. In content-based instruction (CBI) the overall objective is typically to teach “subjectspecific curricular content”, such as mathematics, social science or language arts, alongside a second language, in order to help students develop L2 for communicative purposes and to “access academic content” in regular subject classes (Snow & Katz, 2014, p. 230). In this sense CBI belongs to the content side of the above mentioned language-content continuum. Exactly what the subject-specific or academic content is will depend upon the nature of the subject field, and it is of course impossible to make an inventory of all the different subject matter issues that may be treated. Overall, however, one may find content described in terms of words such as facts, concepts, laws, principles and theories (Chamot, 2009, p. 239). The reference to “concepts" here is particularly noteworthy, as Chamot claims that “[c]ontent subject concepts and relationships are the foundation of academic knowledge” (p. 20, emphasis added). In passing, it is also worth observing that Chamot lists a number of skills and abilities needed to process subject matter content. Stressing the importance of teaching higher-order thinking skills, she claims that students should be encouraged to speculate, predict, synthesize and make judgements about the material they are learning, “rather than merely recall facts” (p. 30). Another relevant feature of curriculum-related CBI in this discussion is its focus on content standards (Chamot, 2009; Echevarría, Vogt, & Short, 2008; Snow & Katz, 2014). Content standards specify learning outcome objectives, which state what students should know and be able to do in relation to some defined subject matter content (Chamot, 2009, p. 16). As Kratwohl (2002) explains,

5

statements of objectives typically consist of a noun or noun phrase – the subject matter content – and a verb or a verb phrase – the cognitive process(es). Consider, for example, the following objective: The student shall be able to remember the law of supply and demand in economics. (p. 213, italics added)

Learning objectives like the one Kratwohl mentions here offer a suitable framework for the assessment of content because they provide tools for identifying both what to assess (i.e. subject matter content) and how to assess it (i.e. the range of cognitive processes, or ‘skills’ or ‘abilities’ involved). In order to further specify this, it is relevant to briefly consider Bloom’s taxonomy, which has been used as a basis for the development of content standards in many contexts, and which is frequently drawn upon in CBI (Chamot, 2009; Echevarría et al., 2008; Kratwohl, 2002). Here I will concentrate on the revised version of the taxonomy. Bloom’s revised taxonomy arranges learning outcome objectives in a two-dimensional grid (Anderson & Kratwohl, 2001). One dimension represents different types of knowledge (the what-aspects), and the other represents various types of cognitive processes (the howaspects). The organization of knowledge and cognitive processes along dimensions is meant to demonstrate their hierarchical nature, from the simpler forms of knowledge and processes to the more complex. According to Kratwohl (2002), knowledge is related to subject matter content and can be divided into four types: factual, conceptual, procedural and metacognitive. Cognitive processes fall into the following six categories listed from the simple to the complex: remember, understand, apply, analyse, evaluate and create. Figure 1 illustrates how the two dimensions interrelate and places Kratwohl’s example in this grid.

6

The cognitive process dimension

The knowledge dimension

Remember

Understand

Apply

Analyse

Evaluate

Create

Factual knowledge Conceptual knowledge

Remember the law of supply and demand

Procedural knowledge Metacognitive knowledge

Figure 1. Example of a learning objective represented in Bloom’s taxonomy table, as outlined by Kratwohl (2002).

In summary, the theoretical frameworks discussed in this section all point to important ways in which subject matter content can be understood. Both Chamot’s framework and Bloom’s revised taxonomy seem to be highly relevant, as they relate to curriculum-based contexts. More specifically, Kratwohl’s description of subject matter in terms of nouns and noun phrases provides a particularly useful tool for analysing content in the present study.

Research question Against the empirical findings and theoretical frameworks presented in the preceding sections the present study addresses the following research question: What do EFL teachers at the upper secondary school level in Norway perceive as relevant subject matter content to be assessed in the GSP1/VSP2 oral English exam?1 As part of this investigation, the teachers’ orientations will be compared against the aspects of subject matter identified in the subject curriculum.

1

This upper secondary school exam is taken by students in their first year in the general subjects programme (GSP1) or by students in their second year in the vocational subjects programmes (VSP2).

7

The context of the study In Norway, English is a compulsory subject for all students from the first grade onwards (age six). By the time the students start upper secondary school at the age of 16, they have on average reached an upper-intermediate proficiency level (CEFR B1/B2). The subject curriculum is centered on competence aims which define what students are expected to master at the end of the different levels of instruction. Grades are mainly given in the form of overall achievement marks, awarded by each individual subject teacher on the basis of various forms of classroom assessment. In addition, around 20 per cent of the students are randomly selected to sit for a written exam, and five per cent are selected to take an oral exam. Whenever a group of students are assigned for the oral exam, their English subject teacher is required to act as an examiner. In addition, an English teacher external to the students’ school is assigned the role as assessor. Grades range from 1 (“fail”) to 6 (“excellent”), and performance is scored holistically. At the upper secondary level the English course involving the exam being studied here is mandatory for students enrolled in both the general studies programme (GSP) and the different vocational studies programmes (VSPs). However, the GSP students complete the course after one year (GSP1), whereas the VSP students complete it after two years (VSP2). The fact that these two groups of students are required to take the same course has caused some tension in the past. According to some stakeholders the course is overly academic and therefore gives an advantage to GSP students who are allegedly both better motivated and more proficient in English (Solheim, 2009). The English subject curriculum,

which works as a framework for the

operationalization of the constructs (Bachman & Palmer, 2010, p. 211), stipulates a number of competence aims relating to subject matter content. At the level under investigation here, the GSP1/VSP2 level, 10 aims explicitly address content. These aims have been listed in Table 2, with a corresponding description of the subject matter content as defined above.

8

Table 2. Subject matter content specified in the English subject curriculum (GSP1/VSP2-level).2 Competence aim

Subject matter content

Assess and use different situations, working methods Metacognitive strategies and learning strategies for developing one’s English skills Assess different digital resources and other aids Resources critically and independently and use them in one’s own language learning Understand the main content and details of different General topics; subject-specific topics types of oral texts about general topics and subject- related to study programme specific topics related to one’s own study programme Discuss cultural and societal conditions in a number of Cultural and societal conditions in EnglishEnglish-speaking countries speaking countries Present and discuss current news from English-speaking Current news topics sources speaking sources

from

English-

Discuss the development of English as a world language English as a world language Discuss different types of English-speaking texts from English-speaking texts different parts of the world Discuss English-speaking films and other cultural forms English-speaking of expression from different parts of the world expression

cultural

forms

of

Discuss texts by and about indigenous peoples in Texts by and about indigenous peoples English-speaking countries Select an in-depth study topic of one’s own study Subject-specific topic related to study programme and present this programme

As can be seen in Table 2, there is a very broad range of subject matter aspects. Not only are the students expected to handle a number of topics related to the English-speaking world, such as literary texts, cultural conditions and indigenous peoples, they are also expected to be able to understand the content of subject-general texts and subject-specific topics related to their own study programme. In addition, they are also required to know and to assess both metacognitive strategies and (re)sources. 2

The English subject curriculum can be accessed at http://www.udir.no/kl06/eng1-03/Hele/?lplang=eng.

9

As for the operationalization of the constructs to be assessed, there is a notable difference between the oral and the written exam. Whereas the written exam is administered nationally by the Norwegian Directorate for Education and Training, the oral exam is controlled by the local education authorities in each of the 19 counties. This means that, for the written exam, there is a nationally developed rating scale and nationally designed test tasks, whereas for the oral exam, different types of locally developed scales and tasks exist. As a rating scale can be understood as an operationalization of the constructs to be assessed (Fulcher, 2012, p. 378; Luoma, 2004, p. 59), this means that the constructs are operationalized differently in the various counties (see e.g. Bøhn, 2015).

Method Research design The study used data from two sources of evidence: verbal protocols and semi-structured interviews (Brinkmann & Kvale, 2015; Green, 1998), involving 10 EFL teachers at the upper secondary school level in Norway. A prompt in the form of a video-taped performance of a student taking the oral exam was used as a stimulus for the generation of verbal protocols. On the basis of this video-clip the teachers were asked to comment on the performance in real time (concurrent verbal reporting) and to give it a score. Directly after the protocols had been recorded, the teachers were interviewed by the researcher on their conceptions of content in the oral English exam. Both data sets were analysed using provisional coding (Miles, Huberman, & Saldaña, 2014).

Participants The teachers were recruited for the study through purposeful sampling (Creswell, 2013), in order to obtain variation in the sample with regard to school and county background, teaching and rater experience and study programme affiliation. The teachers were contacted directly by telephone, and all who agreed to participate did so on a voluntary basis, with no financial compensation. The participants were between 32 and 51 years of age (M=40), and their teaching experience ranged from one and a half to 26 years. They represented six different schools in three different counties. Three of them worked only in the vocational studies programmes (VSP), three worked only in the general studies programme (GSP) and four

10

worked in both programmes. All of them were fully qualified teachers and had previously been involved as examiners.3 As for the video-taped prompt, a VSP student agreed to be filmed as she was taking her oral exam. The exam format consisted of three tasks: (i) a pre-planned monologue task in the form of a presentation, followed by a discussion of the presentation; (ii) an oral interview task based on a short story from the syllabus; and (iii) an oral interview task based on a listening comprehension sequence. For the pre-planned monologue task the student had been given 48 hours in advance to respond to the following prompt: Choose a common health issue in today’s society and make a presentation of the problems it causes the individual and in society. Use examples from fictional and factual texts as well as films from your reading list to illustrate your examples. The student had chosen to give a presentation about HIV/AIDS in South Africa. As regards the topics of the other two tasks, the short story focused on obesity and eating disorders, and the listening comprehension sequence involved a discussion about English a world language.

Procedure An interview guide was piloted and revised. The questions in the interview guide were formulated on the basis of the findings in Bøhn (2015), the analytical framework presented above and the content-related statements identified in the English subject curriculum (cf. Table 2, above).4 The verbal protocols were generated by the teachers in individual thinkaloud sessions (Green, 1998). The video-clip was shown to the participants on a lap-top computer, and a headset was provided in order to ensure good sound quality. Before the recording started, the teachers were instructed to verbalize their thoughts on the general aspects of the performance and then to give it a grade. They were also given five minutes to familiarize themselves with the equipment and the procedure. All the teacher comments were recorded on an Olympus DM-450 digital voice recorder. Immediately after the think-aloud sequence, the teachers were interviewed on their judgments of the performance they had just seen, as well as on their assessment orientations more broadly. In the first half of each interview only open, “nondirective” questions (Yin, 2016, p. 144) concerning general assessment criteria were asked, in order not to impose 3 4

Further information about teacher background can be found in Appendix A. The interview guide can be found in Appendix B.

11

researcher-generated conceptions of content on the participants (see questions B1-3 in the interview guide, Appendix B). Thus, it was hoped that ‘unsolicited’ answers regarding content would emerge. Subsequently, the teachers were questioned specifically on whether and to what extent they considered content while rating. This included questions concerning what they regarded as content, how they thought it should be evaluated, and to what extent they found the subject matter identified in the curriculum to be relevant in the assessment of oral exam performance (see question B4-10 in the interview guide, Appendix B).

Data analyses After the verbal protocols had been recorded I transcribed, checked and segmented them (cf. Green, 1998). In the segmentation process the transcripts were divided into ideas units. An ideas unit can be defined as “a single or several utterances with a single aspect of the event as the focus”, i.e. a unit which is “concerned with a distinct aspect of performance” (Brown et al., 2005, p. 13). The following excerpt, divided into five units (separated by “/”), serves as an illustration: / Good vocabulary / She corrected herself. There was an error there / There was a Norwegian word there / She is doing well in terms of content / Here her pronunciation is not that good […] There were some long words… loan words / All the segments were then coded into categories, using the computer software package QSR NVivo 10. The transcripts were coded in two cycles (Saldaña, 2013). In the first cycle all the segments were assigned codes, using provisional coding based on the categories developed in Bøhn (2015) and the conceptual framework presented in the analytical framework section, above (cf. Miles et al., 2014; Saldaña, 2013). For example, the segments in the above quoted excerpt were coded as Vocabulary, Ability to repair, Compensatory strategies, Content and Pronunciation, respectively. After all the statements had been coded, the codes relating to content were sifted out and re-analysed in a second cycle in order to validate these categories. In this cycle ideas units were specifically checked for nouns and noun phrases relating to content, as specified by Kratwohl (2002) (cf. Analytical framework section). For example, in one statement, the following noun phrase occurred: “She remembers the syllabus, so she has studied”. Here, the noun phrase the syllabus was categorized as a subject matter item. More specifically, the unit was coded in the category “Syllabus texts” (cf. Table 3, below). In order to further validate the analysis, a colleague with prior experience as an EFL teacher at the upper secondary level was asked to code two transcripts, using provisional coding. The inter12

coder reliability analysis yielded a Kappa estimate of .83, which may be regarded as very good (Landis & Koch, 1977). The interviews were also analysed using provisional coding. First, they were transcribed and checked and then divided into two sections corresponding to the unsolicited and solicited answers to the open and specific questions that had been asked (cf. Procedure section). Next, these two sections were divided into ideas units, in a process similar to the verbal protocol analysis (VPA), and again the transcripts were coded in two cycles. In the first cycle the idea units were compared against the analytical framework developed in Bøhn (2015) and in the analytical framework section, whereas in the second cycle the content segments were separated out and analysed with a particular focus on nouns and noun phrases. The following extract gives an illustration (the segments have been separated by “/”):

Researcher: How would you define content? Informant:

/ First of all, that she answers the task, and that it is an answer which is relevant to the task / that it is an answer which shows that she has knowledge of English-speaking countries and English-speaking literature, something which this student [in the video-clip] doesn’t have at all /

Here the first ideas unit contained the content-related noun phrases the task. It was coded as “Task / Topic statement” (cf. Table 4, below). The second ideas unit comprised the noun phrase knowledge of English-speaking countries and English-speaking literature. However, as “knowledge” in Kratwohl’s (Bloom’s) framework relates to the process dimension of learning objectives rather than to the subject matter dimension, this noun phrase head may be excluded. We are then left with the two noun phrases English speaking countries and Englishspeaking literature. These two phrases were coded as “Knowledge of culture and literature in the English-speaking world” (cf. Table 4). In order to validate the coding, the above mentioned colleague agreed to analyse another two transcripts. The inter-coder consistency between my own coding and hers resulted in a Kappa estimate of .78, which may be regarded as substantial (Landis & Koch, 1977).

13

Results Results from the verbal protocol analysis The VPA regarding relevant subject matter content to be assessed produced five specific subject matter categories, in addition to a general one (cf. Table 3, below). The first one, which comprised a number of statements from all the participants, was labelled Task / Topic statement. This category reflects the fact that the teachers mainly commented on subject matter in relation to the three exam tasks: the presentation about HIV/AIDS in South Africa, the discussion of the text from the syllabus about eating disorders and the listening comprehension task about English as a world language. Hence, a large proportion of the statements were simply references to “HIV/AIDS”, “South Africa”, “symptoms”, “obesity”, “English around the world”, “accents” and the like. Similarly, the teachers used a number of general descriptions such as “topic”, “theme”, “problem statement” and “concepts” to refer those task-related issues. Three statements illustrate this: She is reflecting a bit on the consequences of HIV and AIDS and the fact that there is no proper cure. (Informant no. 10)

She shows understanding of the complexities of eating disorders. (Informant no. 3) She’s managed to demonstrate that she understood some of what she has listened to. (Informant no. 5)

None of the other categories comprised nearly as many statements as Task / Topic statement. The second one, termed Sources, was commented on by six teachers. This category was found to be related to Task / Topic statement, but it was singled out as a separate category. The reason for this was that the teachers seemed to expect the student in the video-clip to reflect, or at least to comment on, the sources of her presentation. Hence, this analysis is consistent with the other categorizations made here, considering that Sources is realized by a noun and would fit neatly into Bloom’s taxonomy table as presented in Figure 1. A quote from informant no. 6 illustrates this point: “She doesn’t say much about her sources”. Comments like this one were only made in relation to the presentation task.

14

The third category, labelled Personalized knowledge, was developed from three teacher statements which pointed to the fact that the student in the video-clip related the topic of task three to personal experiences: [She was asked] a question about whether she speaks English outside of Norway… She communicates o.k. when she speaks freely. That’s quite common… when they are allowed to speak about what they want, they usually do o.k. (Informant no. 8) As may be observed, the underlined content aspect in this extract is not represented by a noun phrase, but rather by a nominal relative clause (Hasselgård, Lysvåg, & Johansson, 2012). However, such clauses have syntactic functions similar to noun phrases, and in this case it is seems clear that it denotes subject matter. The statement suggests that speaking about personal experiences is a type of subject matter which is sometimes seen as relevant by Norwegian teachers. Four informants made comments of this kind. The fourth category, Syllabus texts, indicates that some teachers seem to expect students to remember, and possibly to reflect on, texts from the syllabus. As the informant quoted in the methods section put it, “She remembers the syllabus” (cf. above). Again, this is an example which fits in Bloom’s taxonomy table presented in Figure 1. The fifth category, termed Knowledge of culture and literature in the English-speaking world, suggests that some teachers expect students to be able to refer to culture-specific issues in their responses to the exam tasks. Commenting on the student’s response to exam task number two regarding English as a world language, one informant said: [A] really good student would jump on that question and talk about different values and… how some people look up to posh accents, or might look down at another. But she’s not at all in that category of students. (Informant no. 5)

Only informant no. 5 made comments which were coded in this category, however. Finally, a general content category emerged, comprising statements where the teachers merely referred to “content”. In Table 3, all the categories from the verbal protocol analysis have been listed, together with an example for each category.

15

Table 3. Subject matter categories developed from the verbal protocol data. Content category

Example

Task / Topic statement

It is good that she is able to reflect, at least a little bit, on the task. (Informant no. 7)

Sources

[There are] sources [in the last PowerPoint slide] … which are only URLs, nothing more. (Informant no. 4)

Personalized knowledge

But she is telling an interesting story here about a friend… with an eating disorder. (Informant no. 3)

Syllabus texts

She remembers the syllabus. So she has studied. (Informant no. 3)

Knowledge of culture and literature in the Englishspeaking world

[A] really good student would jump on that question and talk about different values and… how some people look up to posh accents. (Informant no. 5)

Content – general

It seems that she doesn’t know the content very well. (Informant no. 8)

Results from the interview analysis The analysis of the answers to the unsolicited questions in the first part of the interviews yielded no additional subject matter content categories. As the teachers had only been asked to explain their general assessment orientations in this sequence, they mainly reiterated the aspects of content which they had reported in the VPA. However, when they were asked specific questions regarding content in the second part of the interview, other and more nuanced aspects emerged. The following exchange between the researcher and informant no. 3 serves as an example: Researcher: How do you understand content? Informant: That you have understood some concepts and relationships and are able to show me that you understand by explaining on the basis of texts and examples. And in order to be able to do that, you obviously have to remember some texts and be able to remember some facts and stuff, but it’s not particularly important that you have remembered the exact year [of an event] or the full name and title of a king, or an author or the like […] And I would like you to know that it is Hemingway who has written the short story, but I would much rather that you really understand… why Nick Adams acts like he does, and why

16

he has that relationship to his father… and if possible compare with another short story. (Informant no. 3)

Here the informant relates relevant subject matter content to concepts, relationships, texts, facts and ‘literary topics’ and exemplifies with reference to a literary text from the syllabus. In the analysis, “concepts” was coded as a separate category labelled Concepts, whereas “texts” was classified as Syllabus texts. ‘Literary topics’ was coded in the category Knowledge of culture and literature in the English-speaking world (cf. Table 4). Interestingly, the emphasis placed on “concepts” and “relationships” is an echo of Chamot’s (2009) claim that the ability to understand concepts and to see the relationship between them is the foundation of academic knowledge (cf. Analytical Framework section). However, as the idea of ‘seeing relationships [between concepts]’ may have more in common with the process aspects (understand, analyse, evaluate etc.) than with the subject matter aspects of content, it was decided not to place “relationships” in a separate category in the analysis. Similarly, “facts” was not coded as a separate category, as the data also contained answers to a question concerning how the teachers perceived the notion of “facts” (cf. the interview guide, Appendix B). Thus, it was hoped that more explicit features of subject matter could be discerned. In response to this question, one informant answered: “Well, I think of general knowledge” (informant no. 2). This view that general knowledge is part of the content construct was supported by seven other informants. Hence, a separate category labelled General world knowledge was included in the analysis (cf. Table 4). In addition, on the question of facts, another informant explained: Well, if you look at the English subject curriculum, there is no list of facts that you have to remember; absolutely not. You don’t have to know that Sydney is the capital of Australia (sic) in order to pass in English […]. But if you get that task, you are expected to find some information about Australia. (Informant no. 5)

As informant no. 5 quite correctly points out, the curriculum does not list any facts that students must remember. For him, this seems to mean that subject matter largely relates to the information that that the student has collected in preparation for the presentation task (task number one). Accordingly, it appears that this task is seen as central for the what to be tested. Another informant also mentioned the fact that detailed subject matter aspects are absent from the curriculum: 17

I had a student in an oral exam once who didn’t know anything about the Tea Party [Movement]… and there is nothing [in the curriculum] about the Tea Party in the U.S. But he had to know something. Exactly what that ‘something’ is […] isn’t so important. But it has to be something. And what he or she shows... has to be thoroughly done… and be at a certain level… not just surface level knowledge. (Informant no. 4) In this extract informant no. 4 does not mention the centrality of task one, but rather alludes to the very general and wide-ranging content aspects of the curriculum. A consequence of this appears to be that the what-aspects to be presented are seen as less important. What matters is how subject matter is presented. Elaborating on this point, she explained: But then I also think that… every now and then… we are assessing general maturity. […] … content… how much do they actually understand of the world around them? And what is kind of… more a type of general intelligence or general knowledge, which is perhaps not always linked to the English subject. (Informant no. 4) Here, the formulation “general knowledge, which is perhaps not always linked to the English subject” suggests that any topic is potentially relevant for discussion. Moreover, the use of the phrase “actually understand” again points to an emphasis on skills and processes. This view was, in fact, supported by several other teachers. For example, in a response to my question on whether good general knowledge could help improving a student’s score, informant no. 8 replied: “Yes… In fact, I would put it in the category ‘Having the ability to reflect’”. A related issue which supports the interpretation that specific factual knowledge is less important is the view that a teacher should not search for ‘knowledge gaps’ in students’ performance. Examples of such a position were: [One should try to] help the students perform well, not necessarily look for things that they don’t know. (Informant no. 1) So I think it’s my responsibility if the student can’t answer… But there are many ways of asking questions, right? Are you able to sneak it in in a subtle way? And as a teacher you should be able to spot… is the student completely… [lost] or is it possible to get some more out of [him/her]. (Informant no. 3)

18

It is said that we are not supposed to search for things that they don’t know. We are meant to try to elicit what they do know. (Informant no. 10)

Four teachers made comments like these. Such a focus may also explain why some teachers include Personalized knowledge in the content construct. If a student is unable to speak about curricular texts, indigenous peoples, English as a world language or any of the other issues listed in the subject curriculum (cf. Table 2), one last resort may be to ask about a topic of personal interest. This interlocutor (or examiner) strategy may also be explained by the view, mentioned by several of the teachers, that it is virtually impossible to test all the competence aims in the oral exam. Hence, some aims have to be left out. This is interesting in light of the fact that the students’ own subject teachers, acting as interlocutors in the exam, have considerable influence over which competence aims are tested: [a]t the same time you [i.e. the teacher] select the competence aims to be tested in the oral exam. So if you choose not to include it, it shouldn’t be tested. (Informant no. 10) Finally, as regards the relevance of the specific content issues identified in the curriculum (cf. Table 2), all the teachers confirmed that the aspects listed there, except for Metacognitive strategies, were relevant features to be tested. However, the issue of Metacognitive strategies left most teachers hesitant. One teacher even categorically denied they were to be tested in the exam, calling such strategies “a meta-science”. Only one teacher clearly affirmed that these strategies were a relevant part of the content construct. That being said, it should be emphasized that the teachers did not appear to expect students to respond impromptu to detailed questions concerning all of these issues. Rather, they emphasized the importance of being able to analyse, reflect on and evaluate whatever subject matter that the task or question addressed. In Table 4 all the subject-matter categories which emerged from the interview analysis have been listed. As has been mentioned, the first one of these, Task / Topic statement was by far the largest one. Some, such as News topics, Syllabus texts and particularly Metacognitive strategies, were rather marginal.

19

Table 4. Subject matter categories developed from the interview data. Content category Task / Topic statement

Example Well, I [think of content as] her focusing on the theme that she has been given and actually talks about this topic. (Informant no. 7)

Sources

Content [relates to the student’s ability to] use some sources… Because she doesn’t say anything about that either. (Informant no. 2)

Personalized knowledge

But she is telling an interesting story here about a friend… with an eating disorder. (Informant no. 3)

Knowledge of culture and [Content relates to the fact that] she has knowledge of the Englishliterature in the English- speaking world and of English literature. (Informant no. 6) speaking world English as a world language She didn’t get the chance to sort of talk about the English language, as a world language and international language. (Informant no. 4) General world knowledge

You will get a better grade if you have good general knowledge of the world. (Informant no. 6)

Concepts

[Content means] that you have understood some concepts […] (Informant no.3)

Indigenous peoples

We could have asked a question like: “Have you learnt anything about indigenous peoples?” (Informant no. 5)

News topics

[Students should have] the ability to reflect on [news topics from] Fox news, right… those kinds of things. (Informant no. 4)

Syllabus texts

If I have taken some texts […] from the syllabus […] and they don’t know anything about them […] then they are in trouble, I’d say (Informant no. 9)

Metacognitive strategies

[Interviewer:] Does this mean that learning strategies could be tested? [Informant:] Yes, it’s there [in the curriculum], isn’t it? (Informant no. 1) I think part 1 [task 1] is good in the sense that she shows good knowledge (Informant no. 4)

Content – general

Miscellaneous

Content in English […] that’s an inexhaustible field. (Informant no. 4)

Two final comments are worth making. Firstly, it is interesting to observe the apparent discrepancy between the general agreement on certain assessment criteria – e.g. that answering the task is an important criterion – and the occasional disagreement on what kind of performance that is characteristic of a given level in relation to a criterion. For example, 20

informant no. 2 reported in her verbal protocol: “She doesn’t mention film or literature at all. She is not answering the whole task”. In a response to this remark, which was presented to her in the interview, informant no. 10 replied: “Not answering the task? Well of course she does!” In other words, the informants do agree that answering the task is important, but they do not agree on what kind of performance is indicative of task fulfillment. Secondly, and relatedly, the finding in Bøhn (2015) which indicated that GSP teachers are more concerned with aspects of content than VSP teachers is only partly corroborated. Actually, none of the participants in this study downplayed content as a relevant criterion in itself, as some of the VSP teachers in Bøhn (2015) did. However, as the finding above showed, the teachers partially disagreed on what kind of performance is necessary for students to score at a certain level in relation to content. Interestingly, the above quoted statements from informants no. 2 and no. 10 are quite illustrative. Informant no. 2, who found that the student was not answering the task properly, is a teacher who mainly works with GSP students. Informant no. 10, on the other hand, who thought that the student did answer the task well, only works with VSP students. Moreover, informant no. 2 awarded the student a 3, informant no. 10 awarded her a 4.

Discussion Overall, in response to the research question What do EFL teachers at the upper secondary school level in Norway perceive as relevant subject matter content to be assessed in the GSP1/VSP2 oral English exam? the analyses showed that the teachers understand subject matter in very general terms. They confirmed that the aspects listed in the subject curriculum, apart from meta-cognitive strategies, are relevant features to be tested, but as a number of these aspects are very wide-ranging, the teachers pointed out that it is unrealistic to expect students to remember details from all kinds of potential topics. Consequently, they appear to adopt an assessment strategy where the testing of skills and processes (describing, analysing, evaluating etc.) becomes more important than the assessment of clearly defined subject matter. Simply put, the specifics of the subject matter are notably downplayed. This supports the finding in Bøhn (2015) which showed that the teachers’ operationalization of the content construct to a large extent involved skills and abilities. However, the results in the present study only partially support Bøhn’s finding that the VSP teachers are less concerned with content than the GSP teachers. The evidence suggests that they generally have the same 21

understanding of how the content construct should be operationalized, but they may disagree on what kind of behaviour is indicative of performance at the different levels of proficiency. Similar findings were reported in Brown et al.’s (2005) study. On closer inspection, the teachers seemed to focus most of their attention on subject matter in the pre-planned presentation task. However, as some of them pointed out, the material that the students have prepared beforehand, for example PowerPoint slides, is not to be tested. Rather, it is their ability to present and discuss this material which should be the focus of the assessment. Such a position is consistent with stipulations made by the national educational authorities, which specify that whatever the students have prepared beforehand is not to be assessed (Norwegian Directorate for Education and Training [UDIR], 2014). This fact may further explain why the teachers seemed more concerned with the students’ ability to present, reflect and analyse than with the specifics of what was being presented. In particular, they gave the impression of being preoccupied with the higher-order thinking skills of analysis, reflection and evaluation, as mentioned by Chamot (2009). Such an orientation is reinforced by the fact that several of the teachers acknowledged that they will try to seek out what the students know, not what they do not know. This, by the way, is also in line with the requirements of the educational authorities, which state that the examiners should not search for the students’ “lack of competence” (UDIR, 2014, p. 8, my translation). A potential consequence of this line of thinking is that the subject matter construct is largely understood as general world knowledge and that a student with good general knowledge might obtain a high score in the exam as long as he or she possesses well-developed higher-order thinking skills. As for the validity of the scoring process, the general agreement between the teachers’ orientations towards subject matter and the aspects of content identified in the subject curriculum attests to fairly good correspondence between the teachers’ cognitive processes and the intenderd construct to be measured (Bejar, 2012). One potential threat to validity, however, is the reluctance towards metacognitive strategies as an assessment criterion. Another is the variation in teacher perceptions regarding what kind of performance indicates proficiency at the different levels.

22

Conclusion and implications This study has investigated Norwegian EFL teachers’ conceptions of subject matter content in an oral exam at the upper secondary level and compared these conceptions with aspects of content specified in the English subject curriculum. The results show that the teachers have very general conceptions of content, something which corresponds well with the content construct as defined in the curriculum. Moreover, the findings indicate that the teachers are generally more concerned with the skills and process aspects of content than with specific subject matter. In particular, they seem oriented towards higher-order thinking skills, such as the ability to reflect on a given topic. Three limitations of this study must be kept in mind. First of all, the teacher sample was small, something which makes generalizations to other contexts problematic. Secondly, although the teachers were interviewed on their orientations towards the content construct generally, they were probably influenced by the performance of the student in the video-clip. Therefore, had there been another student giving a different performance, the teacher responses may also have been somewhat different. Thirdly, introspective methods such as interviews and VPA, which were used in this study, do not automatically predict genuine teacher behaviour in authentic assessment situations. The study has two major implications. Firstly, the validity problems related to the role of metacognitive strategies in the test construct, as well as the differences in perceptions concerning what kind of behaviour that is indicative of performance at the different levels, need to be addressed. One feasible solution to these problems is the introduction of a common rating scale, which may better guide the teachers in their operationalization of the content construct (Fulcher, 2012). In addition, it seems that more rater training would be beneficial, as this is reported to have positive effects on reliability (Taylor & Galaczi, 2011). The second implication relates to the issue of washback. As classroom assessment, both formatively and summatively, is a prioritized area in the Norwegian setting, students should be informed of the importance of higher-order thinking skills in their work with the English subject. An avenue for further research is the question of the interface between language and content. As current research is particularly concerned with how the language and content constructs interrelate (Snow & Katz, 2014), it would be relevant to explore further how the teachers understand this interrelation in the Norwegian school context.

23

References Anderson, L. W., & Kratwohl, D. R. (Eds.). (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives. New York: Longman. Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford University Press. Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford: Oxford University Press. Bejar, I. (2012). Rater cognition: Implications for validity. Educational Measurement: Issues and Practice, 31(3), 2-9. doi:10.1111/j.1745-3992.2012.00238.x. Brinkmann, S., & Kvale, S. (2015). InterViews: Learning the craft of qualitative research interviewing. Thousand Oaks: Sage. Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater orientations and test-taker performance on English-for-academic-purposes speaking tasks. (TOEFL monograph series. MS - 29). Princeton, NJ: Educational Testing Service. Byrnes, H. (2008). Assessing content and language. In E. Shohamy & N. H. Hornberger (Eds.), Encyclopedia of Language and Education (Vol. 7, pp. 37-52). New York: Springer Science+Business. Bøhn, H. (2015). Assessing spoken EFL without a common rating scale: Norwegian EFL teachers’ conceptions of constructs. Sage Open. October-December 2015. Retrieved from http://sgo.sagepub.com/content/spsgo/5/4/2158244015621956.full.pdf. Chamot, A. U. (2009). The CALLA handbook: Implementing the Cognitive Academic Language Learning Approach (2nd ed.). White Plains, NY: Pearson Education. Council of Europe. (2001). The Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Strasbourg: Council of Europe, Language Policy Unit. Douglas, D. (2000). Assessing languages for specific purposes. Cambridge: Cambridge University Press. Douglas, D. (2010). Understanding language testing. Oxon: Hodder Education. Echevarría, J., Vogt, M. E., & Short, D. J. (2008). Making content comprehensible for English learners: The SIOP model (3rd ed.). Boston: Allyn & Bacon. Frost, K., Elder, C., & Wigglesworth, G. (2012). Investigating the validity of an integrated listening-speaking task: A discourse-based analysis of test takers’ oral performances. Language Testing, 29(3), 345-369. doi:10.1177/0265532211424479. Fulcher, G. (2012). Scoring performance tests. In G. Fulcher & F. Davidson (Eds.), The routledge handbook of language testing (pp. 378-392). Oxford: Routledge. Fulcher, G., & Davidson, F. (2007). Language testing and assessment. Oxford: Routledge. Green, A. (1998). Verbal protocol analysis in language testing research: A handbook (Vol. 5). Cambridge: Cambridge University Press. Green, A. (2014). Exploring language assessment and testing: Language in action. Oxon: Routledge. Hasselgård, H., Lysvåg, P., & Johansson, S. (2012). English grammar: Theory and use (2nd ed.). Oslo: Universitetsforlaget. Kratwohl, D. R. (2002). A revision of Bloom's taxonomy: An overview. Theory into Practice, 41(4), 212-218. doi:10.1207/s15430421tip4104_2. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. doi:10.2307/2529310. Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press. McNamara, T. (1996). Measuring second language performance. Harlow: Longman. 24

Met, M. (1998). Curriculum decision-making in content-based language teaching. In J. Cenoz & F.Genesee (Eds.), Beyond bilingualism: Multilingualism and multilinugal education (pp. 35-63). Philidelphia, PA Multilinugal Matters. Miles, M. B., Huberman, A. M., & Saldaña, J. (2014). Qualitative data analysis: A methods sourcebook. Los Angeles: Sage. Norwegian Directorate for Education and Training [UDIR]. (2014). Rundskriv Udir-02-2014 Lokalt gitt muntlig eksamen [Circular Udir-02-2014 - Locally administered oral exams]. Oslo: Author. Retrieved from http://www.udir.no/Regelverk/Finn-regelverkfor-opplaring/Finn-regelverk-etter-tema/eksamen/Udir-2-2014-Lokalt-gitt-muntligeksamen/. Pollitt, A., & Murray, N. L. (1996). What raters really pay attention to. In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th language research testing colloquium, Cambridge. Cambridge: Cambridge University Press. Saldaña, J. (2013). The Coding Manual for Qualitative Researchers (2nd ed.). London: Sage. Sato, T. (2012). The contribution of test-takers’ speech content to scores on an English oral proficiency test. Language Testing, 29(2), 223-241. doi:10.1177/0265532211421162. Snow, M. A., & Katz, A. M. (2014). Assessing language and content. In A. J. Kunnan (Ed.), The companion to language assessment (Vol. 1, pp. 230-247). Chichester, UK: WileyBlackwell. Solheim, T. (2009). Opplæring i yrkesfag: Teori - praksis [Education in the vocational subjects: Theory - practice]. Bedre skole, 4/2009. Retrieved from https://www.utdanningsforbundet.no/upload/Tidsskrifter/Bedre%2020Skole/BS_nr_20 04-2009/4328-2004-2009-BedreSkole-Solheim.pdf. Taylor, L., & Galaczi, E. (2011). Scoring validity. In L. Taylor (Ed.), Examining speaking: Research and practice in assessing second language speaking (Vol. 30, pp. 171-233). Cambridge: Cambridge University Press. Yin, R. K. (2016). Qualitative research from start to finish (2nd ed.). New York: The Guilford Press.

25

Appendix A Informants’ background No.

Age

Gender

L1

Education

Teaches at study programme

1

48

Female

Norwegian

Master

Both vocational and general studies

2

40

Female

Norwegian

Master

Mainly general studies

3

35

Female

Danish

Master


4

47

Female

Norwegian

Master


5

51

Male

English

Master


6

35

Female

Norwegian

Bachelor


7

39

Male

Norwegian

Master


8

32

Female

Norwegian

Bachelor

Only vocational studies

9

32

Female

Norwegian

Bachelor


10

40

Female

Norwegian

Bachelor


26

Appendix B Assessing content – Interview guide A. Background 1. Gender _____________________ 2. Ager

_____________________

3. Education (English) ______________________________ 4. Teaching experience (upper secondary) ___________________________ 5. Rater/examiner experience _____________________________________

B. Questions concerning the assessment of content (Please remember that these questions concern how you would assess students in the oral exam, not what the Directorate for Education and Training, colleagues, researchers or others think.)

1. Very briefly: Which grade would you have given the performance in the video-clip that you have just seen? Why? 2. Overall, which aspects are you looking for while rating performance in the oral English exam? 3. Which criterion/criteria is/are the most important one(s)? 4. To what extent would you say that «content» is to be tested? 5. How would you define content? 6. Regarding skills and processes: a. To what extent is it important that the student answers the task, such as in this video-clip? (Would you say that answering the task concerns the “content” to be tested?) b. To what extent is it important that the student gives an elaborate response? (Would you say that this is concerns the “content” to be tested?) 27

c. To what extent is it important that the candidate structures his or her answer well? (Would you say that this is concerns the “content” to be tested?) d. To what extent would you say that the ability to remember subject matter, being able to reflect on or evaluate subject matter etc., is important? i. Does this mean that students have to remember subject matter? ii. Is knowledge of meta-cognitive strategies, and listening and speaking strategies, a part of this knowledge? 7. Regarding subject matter knowledge: Is subject matter knowledge (facts) important? If yes, what kind of subject matter knowledge? («Culture and societal aspects in English-speaking countries», “culture and societal aspects generally”, “meta-cognitive knowledge”, “English as a world language”, “indigenous peoples”, “news topics from English speaking countries”?) 8. In your opinion, how much should content be weighted against language (or other aspects)? 9. Do you think the focus on content is too strong in the English subject curriculum? 10. It has been claimed that the subject curriculum is too ambitious concerning content for the less proficient students (who have problems coming to terms with languagerelated aspects). What is your opinion on this issue?

28