TEST DIRECTIONS AND FAMILIARITY 1 Test ... - Auburn University

17 downloads 43007 Views 85KB Size Report
... Design: Best Practices and the Impact of Examinee Characteristics. Joni M. Lakin ... Test Directions as a Critical Component of Test Design: Best Practices and the Impact of Examinee ...... New York, NY: The Free Press. Judd .... Educational Testing Service website: http://www.ets.org/research/researcher/RR-07-11.html.
Running Head: TEST DIRECTIONS AND FAMILIARITY

Test Directions as a Critical Component of Test Design: Best Practices and the Impact of Examinee Characteristics

Joni M. Lakin Auburn University

To appear in Educational Assessment

Author Note Joni M. Lakin, Department of Educational Foundations, Leadership, and Technology, Auburn University. This work was supported by a fellowship from the American Psychological Foundation. The author gratefully acknowledges the helpful comments of David Lohman, Kurt Geisinger, Brent Bridgeman, Randy Bennett, and Dan Eignor on earlier drafts of this article. Correspondence concerning this article should be addressed to Joni Lakin, Department of Educational Foundations, Leadership, and Technology, Auburn University, Auburn, AL 36849. Email: [email protected]

1

Running Head: TEST DIRECTIONS AND FAMILIARITY

2

Abstract The purpose of test directions is to familiarize examinees with a test so that they respond to items in the manner intended. However, changes in educational measurement as well as the U.S. student population present new challenges to test directions and increase the impact that differential familiarity could have on the validity of test score interpretations. This article reviews the literature on best practices for the development of test directions as well as documenting differences in test familiarity for culturally and linguistically diverse students that could be addressed with test directions and practice. The literature indicates that choice of practice items and feedback are critical in the design of test directions and that more extensive practice opportunities may be required to reduce group differences in test familiarity. As increasingly complex and rich item formats are introduced in next-generation assessments, test directions become a critical part of test design and validity.

Running Head: TEST DIRECTIONS AND FAMILIARITY

3

Test Directions as a Critical Component of Test Design: Best Practices and the Impact of Examinee Characteristics Introducing new or unfamiliar item formats to examinees poses particular challenges to test developers because of the need for students to quickly and accurately understand what the test items require (Haladyna & Downing, 2004; Lazer, 2009). Item formats are defined not only by their response modes (e.g., multiple choice, constructed response), but also by the types of demands they are expected to place on examinee’s knowledge and cognition (e.g., basic arithmetic, scientific reasoning; Martinez, 1999). When introducing new item formats, the critical challenge is how best to introduce the task so that all students are able to respond to the format as intended by the test developers. With new innovations in achievement testing, especially through computer-based testing and performance assessment, novel item formats are increasingly being introduced to achievement tests to increase construct representation and measure more complex and authentic thinking skills (e.g., National Center for Education Statistics [NCES], 2012; see also Bennett, Persky, Weiss, & Jenkins, 2010; Fuchs et al., 2000; Haladyna & Downing, 2004; Lazer, 2009; Parshall, Davey, & Pashley, 2000). In these new formats, simulations, computer-based tools, and items with complex scoring procedures are used for both teaching and assessment (Bennett et al., 2010; NCES, 2012). Furthermore, the role of innovative formats (and thus initially unfamiliar formats) will only increase in the near future because computer-based assessments with more complex and authentic test items form a central component of the general assessment consortia formed in response to Race to the Top (PARCC and SMARTER Balanced Assessment Consortia; Center for K–12 Assessment & Performance Management, 2011; Lissitz & Jiao, 2012). When item formats are unfamiliar, test directions and practice become a crucial, but often overlooked, component of test development and a threat to valid score interpretation (Haladyna & Downing, 2004). Test directions serve to explain unfamiliar tasks to test-takers and are often the only means that test developers have of diminishing potential unfamiliarity or practice effects. Further, if the test developers suspect there will be preexisting group differences in test familiarity, test directions are

Running Head: TEST DIRECTIONS AND FAMILIARITY

4

their last opportunity to address this validity threat as well. If groups differ in their understanding of the test task after the directions have been completed, bias can be introduced to the test scores, resulting in score differences between the two groups that do not reflect real group differences in ability (Clemans, 1971; Cronbach, 1984). As James (1953) advocated, test directions should “give all candidates the minimum amount of coaching which will enable them to do themselves full justice on the test. All candidates are then given a flying start and not just some of them” (p. 159). The goal of this article is to outline some key considerations in developing high quality test directions. This is particularly important in the context of a culturally and linguistically diverse society where test familiarity is more common among some cultural and linguistic groups and creates a barrier to educational opportunities for individuals from other groups (Kopriva, 2008). These disparities are discussed below. Importance of Developing Good Directions and Practice Activities Test directions are an instructional tool whose purpose is to teach examinees the sometimes complex task of how to solve a series of test items. Depending on the students and test purpose, these directions may include general testing taking tips (e.g., “Don’t spend too long on any one item!”; Millman, Bishop, & Ebel, 1965). Some directions rely heavily on examples accompanied by oral or written explanations of the task, while others consist only of brief oral or written introductory statements. The effectiveness of a set of directions depends not only on the characteristics of those directions, but also on the characteristics of the examinees, including their ability to understand the written or spoken directions and familiarity with the test tasks. Good test directions provide enough information to familiarize examinees with the basic rules and process of answering test items so that test items are answered in the manner intended by the test creator (AERA, NCME, & APA, 1999; Clemans, 1971). With increasingly complex computer-based or interactive tasks, efficiency in answering the questions and understanding the task (with minimal guidance from a proctor) is also critical. The Standards (AERA et al., 1999) further emphasize that examinees must have equal understanding of the goals of the test, basic solution strategies, and scoring processes for test fairness to be assured. The Standards suggest the use of practice materials and example items as part of the test directions to accomplish this goal.

Running Head: TEST DIRECTIONS AND FAMILIARITY

5

On some tests, such as classroom tests, traditional achievement tests, and even aptitude tests for college applicants, it is assumed that understanding the task is trivial and so the directions are perfunctory (Scarr, 1981). In these cases, the directions can be brief because the examinees have been exposed to many similar tests (e.g., vocabulary or mathematical computation tests) throughout their education. Some researchers even complain that examinees entirely ignore the test directions and skip straight to the test items as soon as they are able (James, 1953; LeFevre & Dixon, 1986). In these cases, low-quality directions are expected to have a negligible impact on test scores. On the other hand, it has long been acknowledged that practice can significantly affect performance on novel tasks and that differences in familiarity can lead to construct-irrelevant differences in scores (Haladyna & Downing, 2004; Jensen, 1980; Thorndike, 1922). Several researchers exploring the effects of training on item performance confirmed that more complex and unfamiliar formats showed the greatest gains in response to training (Bergman, 1980; Evans & Pike, 1973; Glutting & McDermott, 1989; Kulik et al., 1984; Jensen, 1980). The size of practice effects likely has practical importance, but diminishes quickly after a few exposures to a particular test (Kulik et al., 1984; te Nijenhuis et al., 2001). Practice effects on traditional formats in parallel forms of a test have been estimated to be .23SD for a single practice test (metaanalysis; Kulik, Kulik, & Bangert, 1984). Jensen (1980) estimated this effect to be a .33 IQ SDs gain based on the literature up to 1980. A noteworthy point is that practice gains differ significantly by examinee type and test design. Thus, on tests where the formats are less likely to be familiar, the demands of understanding the task may not be trivial and thus the directions must not be perfunctory. In the absence of a full practice test, directions with practice examples are test developers’ best chance to decrease practice gains on novel tests. With the introduction of complex item formats into achievement tests, such as laboratory simulations (NCES, 2012), the novelty of the items may be irrelevant and detrimental to measuring the desired construct in early applications of the format (Ebel, 1965; Fuchs et al., 2000).

Running Head: TEST DIRECTIONS AND FAMILIARITY

6

These effects may have critical impacts on test use. For example, Fuchs et al. (2000) found that a brief orientation session on a new mathematics performance assessment (an unfamiliar assessment format for their sample of students) greatly improved the scores students received (by .54 to .75SD, depending on subtest). They concluded that unfamiliarity with the testing situation, which was particularly acute for performance assessments where both the format and grading scheme were unfamiliar, was an obstacle to the students’ ability to demonstrate their mathematical knowledge on the test. In their case, the unfamiliarity not only changed the construct measured, but also created spurious growth effects that could confound interpretations of the test scores longitudinally. Importance of Directions in Culturally and Linguistically Diverse Examinee Populations Over time and with additional test experience, students acquire important problem-solving skills and knowledge relevant to the format (not only general test-wiseness [Millman et al., 1965] but formatspecific knowledge and strategies). Students with greater experience with the item format may use more sophisticated or practiced strategies in completing the items compared to students with less experience (Norris, Leighton, & Phillips, 2004; Oakland, 1972). When all students are affected by item format or test unfamiliarity, the influence of unfamiliarity can be detected through traditional validity methods such as retesting and experimental designs (such as those used by Fuchs et al., 2000). Unfamiliarity in this case can also be responded to with a standard set of directions. However, differential familiarity may, in fact, be an even greater threat than general unfamiliarity to valid score interpretation (Fairbairn, 2007; Haladyna & Downing, 2004; Maspons & Llabre, 1985; Ortar, 1960; Scarr, 1994). When familiarity only affects a subset of students or even individual students, detecting its effects (termed individual score validity [ISV] by Kingsbury & Wise, 2012) becomes more difficult because many validation methodologies rely on accurately identifying affected examinee subgroups, such as race/ethnicity. For example, differential item functioning methods can only detect intermittent group differences in item functioning, not differences that affect the entire test or an unidentified subset of students. Because of this difficulty, test developers should actively anticipate instances where unfamiliarity could be a threat to valid and fair use of test scores for all students.

Running Head: TEST DIRECTIONS AND FAMILIARITY

7

When new item formats are introduced to a testing application, if some students are able to gain additional practice or test exposure, these differences could potentially contribute bias to test scores (Haladyna & Downing, 2004). These differences in ability to profit from test practice are pervasive. Prior experience with tests and practice activities provided by parents or extracurricular programs can provide some students with greater knowledge of typical tests and solution strategies (Hessels & Hamers, 1993). When tests are high-stakes, these influences are especially likely and worrisome (Haladyna & Downing, 2004). But extracurricular practice is not the only concern. For students who are new to U.S. schools or do not come from families who are engaged with their school system, access to knowledge about tests and test preparation activities may be restricted and also lead to differential effects of item format novelty (Budoff, Gimon, & Corman, 1974). The critical assumption that an item format is equally familiar to all students (a foundation of valid score interpretation; AERA et al., 1999; Haladyna & Downing, 2004) seems especially problematic among culturally and linguistically diverse students whose experiences with testing may vary considerably (Anastasi, 1981; Fairbairn & Fox, 2009; Peña & Quinn, 1997; Sattler, 2001). Both students and parents can vary in their understanding of the purposes of testing, motivation, knowledge of the tests used, and access to test preparation opportunities (Kopriva, 2008). These differences are exacerbated when tests have high stakes attached to them (Reese, Balzano, Gallimore, & Goldenberg, 1995; SolanoFlores, 2008; Walpole et al., 2005). The U.S. school population has always been diverse, but the challenges for educational assessment are increasing with rising numbers of Latino students and English learners (EL; Federal Interagency Forum on Child and Family Statistics; 2011; Passel, Cohn, & Lopez, 2011). Latino and EL students have a number of social and cultural differences—in addition to the linguistic differences— that may contribute to the widely documented differences in their test performance compared to White and non-EL students. These differences are described in the sections that follow. Test-wiseness. There are apparent cultural differences between mainstream middle-class White families and families of other racial/ethnic background and SES with respect to their testing-related

Running Head: TEST DIRECTIONS AND FAMILIARITY

8

parenting behaviors and beliefs about testing purposes. Although test-taking skills classified as testwiseness are usually not addressed by test directions, they could be and might contribute to students’ ability to profit from test directions that focus on the particular of a given format. Cultural differences seem to affect students’ test-wiseness (general knowledge and strategies for testing). Greenfield, Quiroz, and Raeff (2000) and Peña and Quinn (1997) found differences in assessment-related behaviors for Latino mother-child pairs compared to White mother-child pairs. Peña and Quinn argued that differences in the interactions of mothers with their children, specifically the types of questions that mothers asked during playtime, related to differences in the way that the children worked on picture-labeling (vocabulary) vs. information-seeking (knowledge) tests. They observed that the Latino children had more difficulty understanding vocabulary tasks than they did understanding information-seeking tasks that elicited descriptions, functions, and explanations and argued that the latter task more closely resembled the way that the Latino mothers interacted with their children. At school entry, test-wiseness continues to differentially affect students from culturally and linguistically diverse backgrounds. Dreisbach and Keogh (1982) found that test-wiseness training, devoid of test-relevant content, improved the performance of Spanish-speaking children from low SES backgrounds more than would be expected from the general population (d=.46 between trained and untrained students; effect size calculated from their reported data). Even among college students, differences may persist. Maspons and Llabre (1985) found that college-aged Latino students in the Miami-Dade area students benefitted significantly when trained in test-taking skills necessary for U.S. testing systems. In their experimental design, they found that the correlation between their knowledge test and later class performance in the class jumped from r=.39 for the control group to r=.67 for the experimental group that received training in test-taking skills. Differences in cultural and educational backgrounds appear to affect EL students’ engagement with tests. Budoff et al. (1974) argued that EL students “differ in familiarity and experience with particular tasks, have a negative expectancy of success in test-taking, and are less effective in spontaneously developing strategies appropriate to solving the often strange problems on a test” (p. 235).

Running Head: TEST DIRECTIONS AND FAMILIARITY

9

The latter point makes high quality test directions especially likely to help EL students if they support strategy development. Budoff et al. confirmed that their learning potential post-test (which allowed more opportunities for feedback on test strategies) created substantial practice gains for EL students: d=0.68 for primary students and d=0.40 for intermediate grade students. They also found that the post-tests correlated more strongly with achievement for EL students compared to their pre-test scores and that indicators of SES and English proficiency were correlated with pre-test but not post-test performance. Perceptions. There are also important differences in perceptions of tests, too, as suggested by Budoff et al. (1974). Students and parents new to the U.S. school system may differ in their understanding of the purposes of testing and the basic strategies of completing tests (Kopriva, 2008; Reese et al., 1995). For example, Walpole et al. (2005) found that students from low-SES backgrounds enrolled in lowperforming schools lacked information about the availability of test preparation materials and vastly overestimated how much improvement could be expected from re-testing. Such differences are believed to impact students’ motivation to perform well on these tasks and to exert effort (Kopriva, 2008; Schwarz, 1971). When the tests have high stakes (such as for gifted and talented identification or college admissions), there are often large differences between White and minority students in the perceived importance of preparing at home for school-administered tests (Briggs, 2001; Chrispeels & Rivero, 2001; Kopriva, 2008; Solano-Flores, 2008; Walpole at al., 2005). Better advanced information about tests and their uses should therefore be targeted to these parents. Summary. Although test unfamiliarity can affect all students when test formats are novel, the literature strongly suggests that there is a possibility of differential familiarity for any test format when there is diversity in the student population. As a result, test developers should take care with the development of test directions and practice activities to address any potential unfamiliarity effects that may influence test performance. Best Practice in Designing Test Directions and Practice Opportunities Several older measurement texts tout the importance of good directions to valid test use and offer some advice on the features of good directions (e.g., Clemans, 1971; Cronbach, 1984; Traxler, 1951). The

Running Head: TEST DIRECTIONS AND FAMILIARITY

10

attention to the construction of test directions appears to have diminished over time, as exhibited by the content coverage of successive editions of Educational Measurement: Lindquist (1951), Thorndike (1971), Linn (1989), and Brennan (2006a). The first two editions each contained a full chapter on test administration that contained extensive advice on creating the test directions (Clemans, 1971; Traxler, 1951), whereas the latter two editions restricted their discussion to issues of standardization with little specific advice on developing test directions (e.g., Bond, 1989; Cohen & Wollack, 2006; Millman & Greene, 1989). Even in these early guides, specific guidance for how to create directions was uncommon. To some extent, this is because the best directions differ by test and item types. However, the research also lacks systematic efforts to discover exactly what features make for good directions for a given format especially in recent years when test item formats have been evolving considerably. Given the infrequent inclusion of guidance on the development of directions in recent measurement texts, test directions may appear easy to construct. However, when actually trying to write directions, it can be quite difficult to explain an unfamiliar task clearly and succinctly so that all examinees fully understand. This is especially true when writing directions for children because even basic concepts must be minimized (Kaufman, 1978). Children have less experience with tests and need simple language and adequate practice. Furthermore, when tests are less familiar and more complex, students of any age may depend more heavily on the test directions to understand the task. From a review of the literature, the following guidelines for creating directions and practice items are suggested. These guidelines are then expanded upon in the following sections. 1. Treat directions as an instructional activity and build in flexibility for additional practice or explanations for students who do not immediately understand. 2. Use the simplest language needed, but no simpler. 3. Use more practice items with greater variety, dependent on examinee characteristics. 4. Provide elaborative feedback to make examples efficient and effective for all.

Running Head: TEST DIRECTIONS AND FAMILIARITY

11

5. Gather evidence about the quality of the directions to support the intended validity inferences for students who vary in test familiarity. Treat Directions as an Instructional Activity We know from extensive the work on practice effects (Jensen, 1980; Kulik et al., 1984; Thorndike, 1922) that examinees improve their test scores when repeatedly administered similar tests. What is the examinee learning that results in practice effects on later examinations even when the items are not the same or no feedback is given? For examinees first encountering novel item formats, part of the challenge of novel item formats is that they require examinees to assemble a strategy and rely more heavily on metacognitive monitoring skills rather than well-practiced and automatic strategies (Marshalek, Lohman, & Snow, 1983; Prins, Veenman, & Elshout, 2006). The items may also require specific knowledge about the response format or rubric (Fuchs et al., 2000). Test directions with a number of practice items (or a separate practice opportunity) may be sufficient to help students overcome this initial learning phase. Depending on the age of the examinees and how much they vary in their experience with the test format (or tests in general), the detail of the directions and practice may need to be adapted at the student or class level so that students are neither bored nor overwhelmed by the length and detail of the directions or practice. This would, in fact, be a sharp departure from typical standardized test directions. The traditional understanding of standardization is that it refers to the behaviors of the test administrator and the testing environment rather than to the readiness of the examinees to engage in the test. In this behaviorist framework, fairness is achieved when every student hears the same directions and sees the same practice items (c.f., Cronbach, 1970; Geisinger, 1998; Norris et al., 2004). However, when students vary greatly in their familiarity with testing, the typical administration procedures and test directions provided may fail to homogenize the test experience of examinees. For complex and unfamiliar formats, test developers would benefit from broadening their definition of standardization to include homogeneity of the examinee experience and change the focus of directions from standardization to effective instruction.

Running Head: TEST DIRECTIONS AND FAMILIARITY

12

To accomplish this change, mainstream testing might benefit from incorporating elements of the dynamic assessment and learning potential assessment tradition to adapt directions and practice opportunities. Following these traditions, test direction activities would ensure that every student is on equal standing with respect to understanding the task, motivation to do their best, and familiarity and testwiseness related to the format (Budoff, 1987; Caffrey, Fuchs, & Fuchs, 2008; Kaufman, 2010; Ortar, 1972; Rand & Kaniel, 1987; Tzurkiel & Haywood, 1992). In some circles, this is called the “Betty Hagen rule”1: “Do not start the test until you are sure every child understands what he or she is supposed to do.” Like Hagen, these interactive testing traditions recognize that fairness comes from adaptation to students rather than standardization of administration and that, in fact, equitable treatment can produce inequality (Brennan, 2006b). Hattie, Jaeger, and Bond (1999) capture this tension between standardization and adaptation as the key to fairness in item formats and administration: “The manner under which tests are administered has become a persistent methodological question, primarily because of the tensions between the desire to provide standardized conditions under which the test or measure is administered (to ensure more comparability of scores) and the desire for representativeness or authenticity in the testing condition (to ensure more meaning in inferring from the score).” (p. 395) When students vary widely in their familiarity with tests and their readiness to show their best work, fairness is likely best achieved through increased flexibility in test administration (e.g., additional practice for some, changes in directions, comprehension checks) rather than rigid adherence to standardized procedures. In treating test directions as an instructional activity, the directions must be adapted to examinee characteristics and pay attention to motivation. Young students may benefit especially from basic testwiseness training, including self-pacing and tactics to reduce impulsive responding. In a study of test

1

Hagen was the co-author of several forms of the Cognitive Abilities Test (CogAT) among many other tests and well-versed in designing tests for children. This is hear-say wisdom from her CogAT Form 6 co-author, David Lohman.

Running Head: TEST DIRECTIONS AND FAMILIARITY

13

directions for young students, Lakin (2010) found that multiple tactics and teacher interventions were needed within the directions to slow students down and to encourage them to consider all of the response options before selecting an answer. Even young students seem quick to assume they understand the task, consistent with the minimal effect of written directions on older students in an experimental design (LeFevre & Dixon, 1986). For abstract item formats, such as those used on ability tests and critical thinking tasks, the directions should also try to introduce a basic strategy for solving the early items so that students feel successful and can adapt the strategy as items increase in difficulty (Budoff et al., 1974). Familiarity with successful strategies is certainly an advantage for more experienced test-takers (Byrnes & Takahira, 1993; Marshalek et al., 1983; Prins, Veenman, & Elshout, 2006), but may be minimized by teaching a basic strategy. In this way, the first few test items (also called “teaching items”) should start easy and increase in difficulty so as to lead the child to develop appropriate strategies by gradually introducing more complex rules (Cronbach, 1984). For introducing complex simulations or tools, test developers should similarly plan to introduce tools as needed and preferably one at a time so that the examinee is not overwhelmed with new cognitive demands (Gee, 2003). Use the Simplest Language Needed, But No Simpler In writing test directions, the primary goal must be clarity in conveying information to examinees. Directions that are too long lead to inattention and directions that are too vague or brief will not serve the intended purpose. There is thus a fine line to walk when every student will receive the same directions. The trend in abilities assessment seems to be towards using the shortest instructions possible (even to the somewhat extreme “nonverbal directions” procedure; Lakin, 2010; McCallum et al., 2001) with the idea that less language is more culture-fair. However, this may not be the best method for closing gaps in understanding between culturally and linguistically diverse students. In fact, shorter directions—where repetitions, explanations, and logical conjunctions are dropped—may actually increase the burden of listening comprehension for examinees (Baker, Atwood, & Duffy, 1988; Davison & Kantor, 1982).

Running Head: TEST DIRECTIONS AND FAMILIARITY

14

Furthermore, even directions that seem short and clear to the adult test developer may contain an undesirable number of basic concepts when testing young students. In traditional test administration, basic concepts are essential to explaining the task, especially when a practice item is presented. Telling the examinees to “look at the top row of pictures” or asking, “What happens next?” both involve basic concepts that young students may not know. As a result, the directions for most widely used intelligence tests contain many more basic concepts than one would expect or desire on a test for young students (Flanagan, Kaminer, Alfonso, & Rader, 1995; Kaufman, 1978). From their work in cross-cultural testing, van de Vijver and Poortinga (1992) argue that directions should “rely minimally on explanations in which the correct understanding of the verbalization of a key idea is essential” (p. 20). In their view, visual examples and verbal descriptions must work together with appropriate verbal repetition and feedback to support comprehension so that a single moment of inattention or misunderstanding does not derail an examinee’s performance. Indeed, rather than arguing for short directions, van de Vijver and Poortinga (1992) argue that increasing the amount of instructions and practice can actually reduce the cultural loading of a test. More language decreases how critical it is that students quickly grasp the rules of the task from a few examples. A liberal number of examples and exercises, they say, can overcome relatively small group differences in familiarity. Biesheuvel (1972) likewise argued that familiarization with test demands is essential before beginning a test in a cross-cultural context: practitioners often fail “to recognize the desirability of prolonged pre-test practice in dealing with the material which constitutes the content of the test and of introducing a learning element in the test itself” (Biesheuvel, 1972, p. 48). The best compromise therefore seems to be appropriately detailed explanations that are firmly concretized by examples and visuals. Visuals are a way of reducing language load without reducing information (Mayer, 2001). Another solution is to rely on computer animations: one important benefit of computerized tests is the potential for directions to show rather than tell test-takers where to direct their attention. This approach reduces the amount of explanation required and the corresponding language load.

Running Head: TEST DIRECTIONS AND FAMILIARITY

15

Use More Practice Items In treating the directions as an instructional activity, the test developer should pay special attention to choosing the best practice items and tactics to teach the students how to respond to the task. For example, the first few items should not provide opportunities to succeed using inappropriate strategies. Misconceptions about the task introduced by the test directions and early test items are a serious threat to validity. For example, Lohman and Al-Mahrzi (2003) found that the use of certain item distractors (that is, the incorrect options to an item) unintentionally reinforced a misunderstanding of the task that was only discovered through an analysis based on personal standard errors of measurement. Likewise, Kaufman (2010) reported that the content of early items on the WISC Similarities tests unintentionally steered students towards giving lower-point answers. Thus, it is clear that test directions, practice items, and early test items must work together to prevent misunderstandings. The choice and number of practice items is no small consideration in the design of good test directions and practice. In fact, research shows that test developers should design directions and examples keeping in mind that students will rely more heavily on the examples than verbal descriptions to get a sense of the task. LeFevre and Dixon (1986) compared the performance of students on two item formats when examples and instructions for the tests conflicted (e.g., the words described a figure series test but the example showed a figure classification problem). No matter how the experimenters varied the emphasis on the directions, the college-aged students used the example as their primary source of information on how to do item and virtually ignored the written directions. However, LeFevre and Dixon (1986) warn that “the present conclusions probably would not apply to completely novel tasks; in such cases a single example might be insufficient to induce any procedure at all” (p. 28; see also Bergman, 1980). LeFevre and Dixon argue that examinees need both examples to help concretize procedures for tests and verbal instructions to help abstract the important features from the examples. Feist and Gentner (2007) concur, finding that verbal directions were essential to drawing attention to important features of test items, teaching basic strategies, and giving feedback on practice items. “In speaking, we are induced by the language we use to attend to certain aspects of the

Running Head: TEST DIRECTIONS AND FAMILIARITY

16

world while disregarding or de-emphasizing others” (p. 283). Although examples by themselves may be less useful to naïve test-takers, clearly, they play an important role in helping examinees develop specific strategies and procedures for solving test items (Anderson, Farrell, & Sauers, 1984). An additional benefit of using more practice items is that they provide a more representative sample of test items (Haslerud & Meyers, 1958; Kittell, 1957; van de Vijver & Poortinga, 1992). Presenting a range of realistic examples is important. Examples that are too simplified do not expose students to the full range of item types and may lead them to develop ineffective strategies for the task, while a range of examples can introduce strategies and allow examinees to anticipate the range of challenges in the test (Jacobs & Vandeventer, 1971; Morrisett & Hovland, 1959). Furthermore, several studies showed that schemes are better built from multiple practice items than from long explanations (Anderson, Fincham, & Douglass, 1997; Klausmeier & Feldman, 1975; Morrisett & Hovland, 1959; Ross & Kennedy, 1990). Morrisett and Hovland (1959) also found that presenting a range of different practice items was essential because it taught examinees to discriminate between task-relevant and task-irrelevant problem features. However, they also found limits to how much variety is beneficial. Although they found that extensive practice on just one type of item led to negative transfer on diverse items, they also found that sufficient practice on a set of three item types was more effective than brief practice on 24 item types. Thus, the best approach in selecting practice items is to identify a few broad categories of items and help examinees develop workable but general strategies that can be adapted to different items within these categories. The relative timing of directions and example items is also a consideration. Lakin (2010) found that the preamble of test purpose and describing the task that many tests use may not be ideal and that good test directions should engage students (especially young students) in responding to practice items almost immediately rather than expecting them to listen to extensive explanations first. In fact, some research has shown that it may be better to let the students see examples before they are given any

Running Head: TEST DIRECTIONS AND FAMILIARITY

17

description of the task at all in order to make the description more concrete (Sullivan & Skanes, 1971). This puts the onus of explanation on the feedback to responses rather than directions. Provide Elaborative Feedback Practice or example items can play an important role in demonstrating the format rules to students. However, practice alone does not lead to optimal performance when it is not paired with appropriate feedback. The research indicates that (1) feedback makes practice more efficient for learning and (2) elaborative feedback that goes beyond yes/no verification improves learning further. Practice with feedback is more efficient. The importance of feedback was established early in the history of psychology (Thorndike, 1927). Morrisett and Hovland (1959) showed that feedback increased the amount of learning from each practice item, whereas providing no feedback resulted in the need for many more practice items to produce the same amount of learning. Furthermore, practice without feedback has the tendency to make most students faster but not better at task (Thorndike, 1914, Ch. 14), because such practice leads to automatization of whatever strategy the student is using, whether efficient or not, and not necessarily to improvement in performance. Several studies support the need to provide feedback for practice items on tests (e.g., Hessels & Hamers, 1993; Parshall et al., 2000; Sullivan & Skanes, 1971; Whitely & Dawis, 1974). In one example, Tunteler, Pronk, and Resing (2008) conducted a microgenetic study of first-grade students solving figure analogies and showed that although all students improved their performance from practice alone, targeted feedback in a dynamic training session greatly increased performance. Importantly, Sullivan and Skanes (1971) found that feedback had a greater effect for low-ability examinees, who gained much less from practice without feedback than high-ability examinees. This latter finding is consistent with research showing that high-ability examinees learn more from practice without feedback because they can more often correct their own errors (Shute, 2007). The positive effect of feedback for culturally and linguistically diverse students has been confirmed in cross-cultural research as well. Hessels and Hamers (1993) and Resing, Tunteler, de Jong,

Running Head: TEST DIRECTIONS AND FAMILIARITY

18

and Bosma (2009) found that dynamic testing approaches with tailored feedback and practice significantly increased the performance of language- and ethnic-minority students on an ability test. Elaborative feedback increases learning from practice problems. Several researchers have investigated what degree of feedback and explanations are optimal (e.g., Judd, 1908; Kittell, 1957; Sullivan, 1964). In her review of the literature, Shute (2007) concluded that elaborative feedback provided in tutoring systems resulted in larger gains in comprehension compared to simple verification of the answer. Elaborative feedback could include both a brief explanation of why the chosen answer was correct as well as information on why other attractive or plausible distractors are wrong. Ideally, the person administering the test should also be aware of what feedback the examinees are receiving as well so that the examiner can determine that the examinee has sufficiently understood the directions (van de Vijver and Poortinga, 1992). This is rarely a component of traditional test directions, but in many cases could be easily added to the administration process (e.g., directing the administrator to walk around during the practice items). Note that examiners would also need to be provided with information on how to provide additional feedback to students who are not completing the practice items correctly. Such comprehension checks should allow the examiner to be confident that all students understand before the test begins. Computer-based administration of tests or practice tests would also be effective in tracking practice item performance and routing students to additional practice. Collect Evidence on the Quality of Directions Test directions are a critical component of test design that influence both the construct measured and the precision of that measurement (Clemans, 1971; Cronbach, 1970), yet they rarely receive systematic evaluation to confirm that they are effective in reducing pre-existing differences in test familiarity among examinees. Ideally, directions would be evaluated as a part of the validity argument for a test to assure that they adequately familiarize all students with the task. Examples of appropriate research methods for this purpose have been reviewed earlier in this article (LeFevre & Dixon, 1986; Fuchs et al., 2000; Lohman & Al-Mahrzi, 2003; te Nijenhuis et al., 2007; see also Leighton & Gokiert, 2005). Likewise, researchers should evaluate the magnitude of practice effects (overall and for specific

Running Head: TEST DIRECTIONS AND FAMILIARITY

19

students) as one check to determine how much students gain from changes in test familiarity alone, because this can threaten score interpretation (particularly longitudinal growth estimates). The age, test experience, and background characteristics of the examinees clearly matters in the design of directions and warrants additional research. First, young students clearly require more orientation and distributed practice opportunities2 to help them understand the test tasks (Dreisbach & Keogh, 1982; Lakin, 2010), although finding the optimal method requires more research. In contrast, the particular issue with older examinees seems to be getting them to attend to the directions (LeFevre & Dixon, 1986), a challenge that has also not been addressed with empirical research. Second, the background characteristics of the students, whether in terms of cultural differences that influence test-wiseness (Biesheuvel, 1972) or linguistic differences (Dreisbach & Keogh, 1982; Ortiz & Ochoa, 2005), may influence the type and amount of directions and practice that are needed. In schools with large populations of culturally and linguistically diverse students, the adequacy of test directions to create scores that reflect valid comparisons of student performance is an even greater concern and requires additional fairness research evidence. Other background characteristics should also be considered, when appropriate. One example is the Matthew effect, where gains from practice opportunities differentially benefit the most able students, at least on cognitive ability tests (Jensen, 1980; Sullivan, 1964; te Nijenhuis et al., 2007). One estimate of the Matthew effect comes from Kulik et al.’s (1984) meta-analysis, which estimated the effects of practice to be .17SD for low ability students, .40 for middle ability, and .82 for high ability (note that this was only for identical tests—the differential gains were much smaller for parallel forms, d=.07, .26, .27, respectively). Test developers should consider such complications and avoid unintentionally exacerbating disparities.

2

It is important to note that teaching test-related skills does not necessarily have to happen in isolation and extend the time spent on test preparation. Integrated instruction/assessment systems, such as the CBAL assessment system (Bennett et al., 2010; see also DiCerbo & Behrens, 2012), embed instruction related to learning the assessment system and tools into meaningful content instruction.

Running Head: TEST DIRECTIONS AND FAMILIARITY

20

Conclusions about the Design of Test Directions Complex and novel item formats play an important role in both cognitive ability and achievement testing. Introducing new item formats can yield large familiarity or practice effects when students first encounter the format which can influence evaluation at the individual or group level (Jensen, 1980; Fuchs et al., 2000). When access to test preparation varies, particularly when tests have high individual stakes attached to their results, differential novelty between students becomes a serious threat to fairness and the validity of interpretations from scores as well. In the diverse populations many schools serve, equal familiarity with testing and access to test preparation cannot be assumed. Based on the literature review, several guidelines for creating directions were substantiated. It was clear that directions should be treated as an instructional activity that engages students in the task, use clear examples, provide elaborative feedback, and use age/level-appropriate language. The literature also suggests that when students vary widely in their familiarity with tests, fairness is likely best achieved through increased flexibility in test administration (e.g., additional practice for some, changes in directions) rather than rigid adherence to standardized procedures. A critical area for future research is the use of computer-based adaptive directions that could provide the needed flexibility in practice and feedback. Even for paper-based tests, computer-based practice activities could be beneficial in addressing the suggestions for directions and practice made in this article. Computers can respond adaptively to what students can already do and provide more or less practice as needed. Computer-based practice and directions could even provide basic test-wiseness strategies for students who lack any familiarity with tests; training that would be inappropriate and boring for other students. They can also permit the sort of semi-structured trial-and-solution strategy that students commonly use when learning new computer games (Gee, 2003). Computer-based directions could also target the feedback to provide only useful (and brief) corrections for errors made on practice items and support this feedback with appropriate visual demonstrations. A further benefit of computerbased practice would be that examinees could elect to hear the directions in one of several languages.

Running Head: TEST DIRECTIONS AND FAMILIARITY

21

Regardless of the format of the test, the most important lesson from the literature is that test directions are a vital contributor to test validity, should be systematically studied and designed, and should be evaluated as part of the process of compiling a validity argument. Unless test developers and users are certain that all students have had equal opportunity to understand the task presented, bias may be introduced through systematic differences in test familiarity.

Running Head: TEST DIRECTIONS AND FAMILIARITY

22

References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association. Anastasi, A. (1981). Coaching, test sophistication, and developed abilities. American Psychologist, 36, 1086-1093. Anderson, J. R., Fincham, J. M., & Douglass, S. (1997). The role of examples and rules in the acquisition of a cognitive skill. Journal of Experimental Psychology Learning Memory and Cognition, 23, 932-945. Baker, E. L., Atwood, N. K., & Duffy, T. M. (1988). Cognitive approaches to assessing the readability of text. In A. Davison, & G. M. Green (Eds.), Linguistic complexity and text comprehension: Readability issues reconsidered (pp. 55-83). Hillsdale, NJ: Lawrence Erlbaum. Bennett, R.E., Persky, H., Weiss, A., & Jenkins, F. (2010). Measuring problem solving with technology: A demonstration study for NAEP. Journal of Technology, Learning, and Assessment, 8. Retrieved July 23, 2010, from http://www.jtla.org Bergman, I. B. (1980). The effects of test-taking instruction vs. practice without instruction on the examination responses of college freshmen to multiple-choice, open-ended, and cloze type questions. Dissertation Abstracts International: Section A. Humanities and Social Sciences, 41(1A), 176. Biesheuvel, S. (1972). Adaptability: Its measurement and determinants. In L. J. Cronbach, & P. J. D. Drenth (Eds.), Mental tests and cultural adaptation (pp. 47-62). The Hague: Mouton. Bond, L. (1989). The effects of special preparation on measures of scholastic ability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 429-444). New York, NY: American Council on Education/Macmillan. Brennan, R. L. (Ed.). (2006a). Educational measurement (4th ed.). Westport, CT: American Council on Education/Praeger.

Running Head: TEST DIRECTIONS AND FAMILIARITY

23

Brennan, R. L. (2006b). Perspectives on the evolution and future of educational measurement. In R. L. Brennan (Ed.), Educational measurement, fourth edition (pp. 1-16). Westport, CT: American Council on Education/Praeger. Briggs, D. C. (2001). The effect of admissions test preparation: Evidence from NELS: 88. Chance, 14(1), 10-21. Budoff, M. (1987). The validity of learning potential assessment. In C.S. Lidz (Ed.), Dynamic assessment (pp. 52-81). New York, NY: Guilford Press. Budoff, M., Gimon, A., & Corman, L. (1974). Learning potential measurement with Spanish-speaking youth as an alternative to IQ tests: A first report. InterAmerican Journal of Psychology, 8, 233246. Byrnes, J.P., & Takahira, S. (1993). Explaining gender differences on SAT-Math items. Developmental Psychology, 29, 805-810. Caffrey, E., Fuchs, D., & Fuchs, L.S. (2008). The predictive validity of dynamic assessment: A review. Journal of Special Education, 41, 254-270.

Center for K–12 Assessment & Performance Management (2011, July). Coming together to raise achievement: New assessments for the Common Core State Standards. Princeton, NJ: Educational Testing Service. Chrispeels, J. H., & Rivero, E. (2001). Engaging Latino families for student success: How parent education can reshape parents' sense of place in the education of their children. Peabody Journal of Education, 76(2), 119-169. Clemans, W. V. (1971). Test administration. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 188-201). Washington, D.C.: American Council on Education. Cohen, A. S., & Wollack, J. A. (2006). Test administration, security, scoring, and reporting. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 355-386). New York, NY: American Council on Education/Macmillan.

Running Head: TEST DIRECTIONS AND FAMILIARITY

24

Cronbach, L. J. (1970). Essentials of psychological testing (3rd ed.). New York, NY: Harper and Row. Cronbach, L. J. (1984). Essentials of psychological testing (4th ed.). New York, NY: Harper and Row. Davison, A., & Kantor, R. (1982). On the failure of readability formulas to define readable texts: A case study from adaptations. Reading Research Quarterly, 17, 187-209. DiCerbo, K.E., & Behrens, J.T. (2012). Implications of the digital ocean on current and future assessment. In. R.W., Lissitz & H. Jiao, Computers and their impact on state assessments: Recent history and predictions for the future (pp. 273-306). Charlotte, NC: Information Age Publishing, Inc. Dreisbach, M., & Keogh, B. K. (1982). Testwiseness as a factor in readiness test performance of young Mexican-American children. Journal of Educational Psychology, 74(2), 224-229. Ebel, R. (1965). Measuring educational achievement. Englewood Cliffs, N. J: Prentice-Hall. Evans, F. R., & Pike, L. W. (1973). The effects of instruction for three mathematics item formats. Journal of Educational Measurement, 10(4), 257-271. Fairbairn, S. (2007). Facilitating greater test success for English language learners. Practical Assessment, Research, & Evaluation, 12(11). Retrieved from http://pareonline.net/getvn.asp?v=12&n=11 Fairbairn, S., & Fox, J. (2009). Inclusive achievement testing for linguistically and culturally diverse test takers: Essential considerations for test developers and decision makers. Educational Measurement: Issues & Practice, 28(1),10-24. doi:10.1111/j.1745-3992.2009.01133.x Federal Interagency Forum on Child and Family Statistics. (2011). America’s children: Key national indicators of well-being. Washington, DC: U.S. Government Printing Office. Feist, M. I., & Gentner, D. (2007). Spatial language influences memory for spatial scenes. Memory and Cognition, 35(2), 283-296. Flanagan, D. P., Kaminer, T., Alfonso, V. C., & Rader, D. E. (1995). Incidence of basic concepts in the directions of new and recently revised American intelligence tests for preschool children. School Psychology International, 16(4), 345-364.

Running Head: TEST DIRECTIONS AND FAMILIARITY

25

Fuchs, L.S. , Fuchs, D., Karns, K., Hamlett, C.L., Dutka, S., & Katzaroff, M. (2000) The importance of providing background information on the structure and scoring of performance assessments. Applied Measurement in Education, 13, 1 - 34. Gee, J.P., (2003). What Video Games Have to Teach Us About Learning and Literacy. ACM Computers in Entertainment 1(1), 1-4. Geisinger, K.F. (1998). Psychometric issues in test interpretation. In J. Sandoval, C.L. Frisby, K.F. Geisinger, J.D. Scheuneman, & J.R. Grenier, Test interpretation and diversity: Achieving equity in assessment (pp. 17-30). Washington, DC: American Psychological Association. Glutting, J. J., & McDermott, P. A. (1989). Using “teaching items” on ability tests: A nice idea, but does it work? Educational and Psychological Measurement, 49, 257-268. Greenfield, P. M., Quiroz, B., & Raeff, C. (2000). Cross-cultural conflict and harmony in the social construction of the child. New Directions for Child and Adolescent Development, 2000 (87), 93108. Haladyna, T.M., & Downing, S.M. (2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17-27 Haslerud, G. M., & Meyers, S. (1958). The transfer value of given and individually derived principles. Journal of Educational Psychology, 49(6), 293-298. Hattie, J., Jaeger, R.M., & Bond, L. (1999). Persistent methodological questions in educational testing. Review of Research in Education, 24, 393-446. Hessels, M. G. P., & Hamers, J. H. M. (1993). The learning potential test for ethnic minorities. In J. H. M. Hamers, K. Sijtsma & Ruijssenaars, A. J. J. M. (Eds.), Learning potential assessment: Theoretical, methodological and practical issues (pp. 285-313). Amsterdam: Swets & Zeitlinger. Jacobs, P. I., & Vandeventer, M. (1971). The learning and transfer of double-classification skills by first graders. Society for Research in Child Development, 42, 149-159. James, W. S. (1953). Symposium on the effects of coaching and practice in intelligence tests: Coaching for all recommended. British Journal of Educational Psychology, 23, 155-162.

Running Head: TEST DIRECTIONS AND FAMILIARITY

26

Jensen, A.R. (1980). Bias in Mental Testing. New York, NY: The Free Press. Judd, C. H. (1908). The relation of special training to general intelligence. Educational Review, 36(28), 42. Kaufman, A. S. (1978). The importance of basic concepts in individual assessment of preschool children. Journal of School Psychology, 16, 207-211. Kaufman, A.S. (2010). "In what way are apples and oranges alike?": A critique of Flynn's interpretation of the Flynn Effect. Journal of Psychoeducational Assessment. Advance online publication. doi: 10.1177/0734282910373346 Kingsbury, G.G., & Wise, S.L. (2012). Turning the page: How smarter testing, vertical scales, and understanding of student engagement may improve our tests. In. R.W. Lissitz & H. Jiao, Computers and their impact on state assessments: Recent history and predictions for the future (pp. 245-269). Charlotte, NC: Information Age Publishing, Inc. Kittell, J. E. (1957). An experimental study of the effect of external direction during learning on transfer and retention of principles. Journal of Educational Psychology, 48(6), 391-405. Klausmeier, H. J., & Feldman, K. V. (1975). Effects of a definition and a varying number of examples and nonexamples on concept attainment. Journal of Educational Psychology, 67(2), 174-178. Kopriva, R. J. (2008). Improving testing for English language learners. New York, NY: Routledge. Kulik, J. A., Kulik, C. C., & Bangert, R. L. (1984). Effects of practice on aptitude and achievement test scores. American Educational Research Journal, 21, 435-447. Lakin, J.M. (2010). Comparison of test directions for ability tests: Impact on young English-language learner and non-ELL students. (Unpublished doctoral dissertation). The University of Iowa, Iowa City. Retrieved from http://ir.uiowa.edu/etd/536 Lazer, S. (December 10, 2009). Technical challenges with innovative item types [Video from presentation at National Academies of Science]. Retrieved August 3, 2010, from http://www7.nationalacademies.org/bota/Steve%20Lazer.pdf

Running Head: TEST DIRECTIONS AND FAMILIARITY

27

LeFevre, J. A., & Dixon, P. (1986). Do written instructions need examples? Cognition and Instruction, 3, 1-30. Leighton, J.P., & Gokiert, R.J. (April, 2005). The cognitive effects of test item features: Informing item generation by identifying construct irrelevant variance. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Montreal, Quebec, Canada. Retrieved from: http://www2.education.ualberta.ca/.../NCME_Final_Leigh&Gokiert.pdf Lindquist, E. F. (Ed.). (1951). Educational measurement. Washington, DC: American Council on Education. Linn, R. L. (Ed.). (1989). Educational measurement. New York, NY: American Council on Education/Macmillan. Lissitz, R.W., & Jiao, H. (2012). Computers and their impact on state assessments: Recent history and predictions for the future. Charlotte, NC: Information Age Publishing, Inc. Lohman, D. F., & Al-Mahrzi, R. (2003). Personal standard errors of measurement. Retrieved from http://faculty.education.uiowa.edu/dlohman/pdf/personalSEMr5.pdf Marshalek, B., Lohman, D. F., & Snow, R. E. (1983). The complexity continuum in the radex and hierarchical models of intelligence. Intelligence, 7, 107-128. Martinez, M.E. (1999). Cognition and the question of test item format. Educational Psychologist, 34, 207-218. Martinez, M.E. (1999). Cognitive and the question of test item format. Educational Psychologist, 34(4), 207-218. Maspons, M. M., & Llabre, M. M. (1985). The influence of training Hispanics in test taking on the psychometric properties of a test. Journal for Research in Mathematics Education, 16, 177-183. Mayer, R. E. (2001). Multimedia learning. New York, NY: Cambridge University Press. McCallum, R. S., Bracken, B. A., & Wasserman, J. D. (2001). Essentials of nonverbal assessment. Hoboken, NJ: Wiley. Millman, J., Bishop, C.H., & Ebel, R. (1965). An analysis of test-wiseness. Educational and Psychological Measurement. 25(3), 707-726.

Running Head: TEST DIRECTIONS AND FAMILIARITY

28

Millman, J., & Greene, J. (1989). The specification and development of tests of achievement and ability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 335-355). New York, NY: American Council on Education/Macmillan. Morrisett, L., & Hovland, C. I. (1959). A comparison of three varieties of training in human problem solving. Journal of Experimental Psychology, 58(1), 52-55. National Center for Education Statistics (2012, June). Science in action: Hands-on and interactive computer tasks from the 2009 science assessment. Retrieved from https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2012468 te Nijenhuis, J., van Vianen, A.E.M., & van der Flier, H. (2007). Score gains on g-loaded tests: No g. Intelligence, 35, 283-300. doi:10.1016/j.intell.2006.07.006 te Nijenhuis, J., Voskuijl, O.F., & Schijve, N.B. (2001). Practice and coaching on IQ tests: Quite a lot of g. International Journal of Selection and Assessment, 9(4), 302-308. Norris, S.P. Leighton, J.P., & Phillips, L.M. (2004). What is at stake in knowing the content and capabilities of children's minds? A case for basing high stakes tests on cognitive models. Theory and Research in Education, 2, 283–308. doi: 10.1177/1477878504046524 Oakland, T. (1972). The effects of test-wiseness materials on standardized test performance of preschool disadvantaged children. Journal of School Psychology, 10, 355-360. Ortar, G. R. (1960). Improving test validity by coaching. Educational Research, 2, 137-142. Ortar, G. (1972). Some principles for adaptation of psychological tests. In L. J. Cronbach, & P. J. D. Drenth (Eds.), Mental tests and cultural adaptation (pp. 111-120). The Hague: Mouton. Ortiz, S. O., & Ochoa, S. H. (2005). Conceptual measurement and methodological issues in cognitive assessment of culturally and linguistically diverse individuals. In R. L. Rhodes, S. H. Ochoa & S. O. Ortiz (Eds.), Assessing culturally and linguistically diverse students: A practical guide (pp. 153-167). New York: Guilford Press.

Running Head: TEST DIRECTIONS AND FAMILIARITY

29

Parshall, C.G., Davey, T., & Pashley, P.J. (2000). Innovative item types for computerized testing. In W.J. van der Linden & C.A.W. Glas, Computerized adaptive testing: theory and practice (pp. 129148). The Netherlands: Kluwer Academic Publishers. Passel, J.S., Cohn, D., & Lopez, M.H. (2011). Hispanics account for more than half of nation's growth in past decade. Retrieved from http://pewresearch.org/pubs/1940/hispanic-united-states-populationgrowth-2010-census Peña, E. D., & Quinn, R. (1997). Task familiarity: Effects on the test performance of Puerto Rican and African American children. Language, Speech, and Hearing Services in Schools, 28, 323-332. Prins, F.J., Veenman, M.V.J., & Elshout, J.J. (2006). The impact of intellectual ability and metacognition on learning: New support for the threshold of problematicity theory. Learning and Instruction, 16, 374-387. Rand, Y., & Kaniel, S. (1987). Group administration of the LPAD. In C.S. Lidz (Ed.), Dynamic assessment (pp. 196-214). New York, NY: Guilford Press. Reese, L., Balzano, S., Gallimore, R., & Goldenberg, C. (1995). The concept of educación: Latino family values and American schooling. International Journal of Educational Research, 23, 57-81. Resing, W.C.M., Tunteler, E., de Jong, F.M., & Bosma, T. (2009). Dynamic testing in indigenous and ethnic minority children. Learning and Individual Differences, 19, 445-450. Ross, B. H., & Kennedy, P. T. (1990). Generalizing from the use of earlier examples in problem solving. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(1), 42-55. Sattler, J. (2001). Assessment of children: Cognitive applications. San Diego: Jerome M. Sattler. Scarr, S. (1981). Implicit messages: A review of “Bias in mental testing”. American Journal of Education, 89(3), 330-338. Scarr, S. (1994). Culture-fair and culture-free tests. In R. J. Sternberg (Ed.), Encyclopedia of Human Intelligence (pp. 322–328). New York: Macmillan.

Running Head: TEST DIRECTIONS AND FAMILIARITY

30

Schwarz, P. A. (1971). Prediction instruments for educational outcomes. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 303-334). Washington, D.C.: American Council on Education. Shute, V. J. (2007). Focus on formative feedback (Research report No. RR-07-11). Retrieved from Educational Testing Service website: http://www.ets.org/research/researcher/RR-07-11.html Solano-Flores, G. (2008). Who is given tests in what language by whom, when, and where? The need for probabilistic views of language in the testing of English Language Learners. Educational Researcher, 37(4), 189-199. Sullivan, A. M., & Skanes, G. R. (1971). Differential transfer of training in bright and dull subjects of the same mental age. British Journal of Educational Psychology, 41(3), 287-293. Sullivan, A. M. (1964). The relation between intelligence and transfer (Doctoral thesis, McGill University, Montreal, Quebec). Thorndike, E. L. (1914). Educational psychology: A briefer course. New York, NY: Teachers College, Columbia University. Thorndike, E. L. (1922). Practice effects in intelligence tests. Journal of Experimental Psychology, 5(2), 101. Thorndike, E. L. (1927). The law of effect. American Journal of Psychology, 39(212), 222. Thorndike, R. L. (Ed.). (1971). Educational measurement (2nd ed.). Washington, DC: American Council on Education. Traxler, A. E. (1951). Administering and scoring the objective test. In E. F. Lindquist (Ed.), Educational measurement (1st ed., pp. 329-416). Washington, DC: American Council on Education. Tunteler, E., Pronk, C., & Resing, W.C.M. (2008). Inter-and intra-individual variability in the process of change in the use of analogical strategies to solve geometric tasks in children: A microgenetic analysis. Learning and Individual Differences, 18, 44-60.

Running Head: TEST DIRECTIONS AND FAMILIARITY

31

Tzuriel, D., & Haywood, H.C. (1992). The development of interactive-dynamic approaches to assessment of learning potential. In H.C. Haywood & D. Tzuriel, Interactive assessment (pp. 3-37). New York, NY: Springer-Verlag. van de Vijver, F., & Poortinga, Y. (1992). Testing in culturally heterogeneous populations: When are cultural loadings undesirable. European Journal of Psychological Assessment, 8(1), 17-24. Walpole, M., McDonough, P. M., Bauer, C. J., Gibson, C., Kanyi, K., & Toliver, R. (2005). This test is unfair: Urban African American and Latino high school students' perceptions of standardized college admission tests. Urban Education, 40(3), 321-349. Whitely, S. E., & Dawis, R. V. (1974). Effects of cognitive intervention on latent ability measured from analogy items. Journal of Educational Psychology, 66, 710-717.