Teachers may want their school to provide them with supplies, ... George Washington Carver suggested that, âEducation is the key to unlock the golden.
Connecting the dots: Formative, interim, and summative assessment Dylan Wiliam, Gage Kingsbury, Steven Wise In R. W. Lissitz (Ed.), Informing the practice of teaching using formative and interim assessment: A systems approach (pp. 1-‐19). Charlotte, NC: Information Age Publishing (2013).
Introduction Over the past twenty years, interest in educational success has grown dramatically. This interest has grown in response to a variety of factors, differing from one country to the next. As studies of educational performance in different countries have made international comparisons available (TIMMS: Mullis, Martin, Ruddock, Sullivan, & Preuschoff, 2009; PISA: OECD, 2012; PIRLS: Mullis, Martin, Kennedy, Trong, & Sainsbury, 2009), interest in ranking performance among countries has become a topic of interest. At the same time, worries about educational funding have caused governmental interest concerning educational budgeting and effectiveness to increase. Parents are also concerned in the escalating need for a good education to help their children succeed as they enter the workforce. As these factors have been at work outside the classroom, teachers have been pushed to accomplish more for their students in the face of declining educational budgets. The pressure for the current generation to compete for careers with others
based around the globe has changed the nature of what educational success actually means. As a result of these many factors increasing interest in education, we have a host of stakeholders whose interests may vary. Students, parents, teachers, school and district administrators, legislators, and the general public (including the business community) all have an interest in how education is working. The interests of these groups differ dramatically when we consider what makes up quality education and what the goals of education should be. Examples may include the following: •
District administrators may be most interested in providing meaningful education for the diverse students in their school district, within the limits of the current budget.
•
Legislators may be interested establishing laws and regulations that improve the quality of education compared to other states or other countries.
•
Business owners may want the schools to provide them with students who are capable of stepping into their entry-‐level positions.
•
Parents may want schools to provide their children with opportunities that they never had.
•
Teachers may want their school to provide them with supplies, resources, and support to help them in the classroom.
•
Students may want their school to help them find out what they can and want to do in their lives.
Give the variety of these needs, and the hundreds of others that might be included in a plan for helping education move forward, it is useful to consider what the mission of an educational system should be. Many different thinkers have considered this issue, and the resulting comments have been quite varied. When we start to review them, though, commonalities emerge from very diverse sources. Thomas Jefferson said that education had as its purpose, “the ideal of offering all children the opportunity to succeed, regardless of who their parents happen to be”. George Washington Carver suggested that, “Education is the key to unlock the golden door of freedom”. More recently, Malcolm Forbes said that, “Education's purpose is to replace an empty mind with an open one”. While views may vary, it is clear that these speakers commonly viewed education as a way to expand students’ views of the world. We will adopt that view, and consider the student as an evolving human being, expanding their view of the world as they grow to include the wide variety of possibilities that are available. Our view is that the student and the future that we owe that student need to be central to any educational systems. With this as a starting point, we will make the following assumption concerning the development and improvement of a system of education: The mission of an educational system is to provide each student with an opportunity to learn what life has available, to help them decide what interests them, and to help them learn as much as they can to take them in their desired direction. For this chapter we will use this mission as our starting and ending point.
Clearly, a different mission statement will lead to very different conclusions, but it may also make the student less central to the educational process. Since education is a less satisfying enterprise if it doesn’t involve students, we will include them at the heart of our discussion. Assessment Needs To this point, we haven’t discussed assessment at all, but it is clear that as interest in educational quality has increased, so has interest in test scores. We have systems in place in countries around the world that require the testing of some or all students in some or all grades in some or all subjects. Some of these testing systems have been developed with the needs of school personnel in mind (asTTle is a very fine example, Fletcher, 2000). In the United States, however, most have been developed to provide external agencies, such as state and federal governments, with a window into the development of student competency in the schools. These assessments often provide a very narrow view of education (testing a few subjects commonly, with only one test per year at the most). The shortcomings of these tests have caused schools to use a wide variety of other tests, designed to serve different needs and different groups of students. The result is that we have many tests in use, but few ways to design assessment systems that are efficient and effective in telling us about students while they help the students learn. Currently, the primary focus of federal regulation in the United States is summative assessment. This focus creates an imbalance in the classroom, since summative assessment meets the needs of only a few educational stakeholders. We
need to find a better balance, so each assessment tool is used when it is appropriate, and each assessment helps us provide the information we need to influence education in a manner that informs each stakeholder, and serves the most important stakeholder well, the student. In the remainder of this chapter, we will describe some of the types of assessments that are used in schools today. We will then attempt to connect some of the dots to describe an assessment system that could be useful to students, as well as the other stakeholders on our schools.
Summative Assessment In the United States, the most common types of summative assessment currently used are state assessments, which are used to assess student proficiency toward the end of a school year. Scores from these tests are usually aggregated to support inferences about groups of students. For example, during the past decade, the No Child Left Behind (NCLB) legislation has mandated that states report annual testing results to the U.S. Department of Education as the basis for its focus on school accountability. When inferences are to be made at the school level, the focus is on the precision of the measurement for the aggregated groups of scores. The tests contain items that represent only a sample of the state’s learning standards. Also, an inference at the school level would not require that all students in a school be tested, though NCLB has required census testing.
How useful are state summative tests for making statements about individual students? We would suggest that they are not very useful, for two reasons. First, state tests are not long enough to yield scores with satisfactory measurement precision (particularly for the low and high performers). The design of the tests could be changed to better support inferences about individual students, but that would require longer tests than are currently being used, and therefore more testing time. The more important limitation, however, stems from when the tests are administered. They are typically administered at, or towards the end of the school year, and test results are typically unavailable until after the school year has ended. Immediacy of results is less important when one is making inferences about schools. Moreover, the “shelf-‐life” of the information (i.e., for how long do the data support the intended inferences?) is much longer when it is the accountability of the school, rather than the performance of an individual student, that is the focus. To what extent are the various stakeholders’ assessment needs met by this type of summative test? School administrators can use the results to chart trends in student proficiency over time. In addition, administrators might make inferences (whether warranted or not) about the relative success of particular schools in educating students. Legislators can use the results from tests to identify educational program and funding needs. The general public can use the results to gauge the general effectiveness of the educational system that is funded by taxpayer dollars. Teachers may be able to use the results to help in their curriculum planning for future cohorts of students. Although some assessment needs are met by tests that are designed primarily for summative purposes, others are not. Teachers receive little information about the
instructional needs of this year’s students for two reasons. First, as noted above, teachers typically receive the results in the summer—well after the conclusion of the academic year. Second, in many, if not most, states, the results for a particular student are given as scale scores using technology such as item-‐response theory (IRT) along with a coarse classification of proficiency relative to the state’s proficiency standards (e.g., “basic,” “proficient” or “advanced”). Such general information has little instructional value for teachers. For the same reasons, students receive little or no actionable information about their specific instructional needs. Parents receive their student’s scale score and proficiency classification, along with information about the performance of the student’s school, but little information about their student’s academic growth, or what might be done to support the student’s learning.
Interim Assessment Interim assessments are focused on student achievement and growth relative to a trait of primary interest during instruction. They are typically used to assess student proficiency at multiple points during a year of instruction, and they are designed to support inferences about the academic growth of individual students. Because inferences are being made about the growth of individual students, high measurement accuracy and precision is needed. For this reason, a computerized adaptive test (CAT) is particularly useful. CATs, such as Northwest Evaluation Association’s Measures of Academic Progress (NWEA, 2012), can assess student proficiency and growth efficiently and with high precision.
Since interim assessments are designed to provide information about individual student growth, they are administered to all students for whom growth inferences are to be made. And because the information shelf life of interim test scores is short, immediacy of returning results to stakeholders is important. Interpretation of scores can be made relative either to norms (i.e., how does the student’s growth compare to that of some reference group of students?), to aspirations (e.g., to what extent did the student meet the growth targets he or she helped establish?), or to long-‐term benchmarks (e.g., is the student making adequate progress toward college readiness?). Compared to summative assessments, interim assessments can provide useful information to a broader array of stakeholders. Students are able to gauge their academic growth relative to normative, aspirational, or long-‐term goals. Similarly, parents are able to use the results from interim assessments to track their child’s academic growth. Teachers can use the results to make instructional decisions about how they should manage and plan for the instruction of the entire cohort of students for whom they are responsible. School administrators can aggregate student results to assess trends in growth and as part of a plan to evaluate teacher effectiveness. Legislators can use interim assessment results to both evaluate the effectiveness of public educational policy and articulate performance expectations for schools that take into account the academic progress of all students. Finally, the general public can use the results to gauge the effectiveness of the educational system. One of the most important aspects of interim testing is that it changes the unit of inference from groups of students (e.g., teacher; school) to the individual student. Because they are administered only several times per year, however, interim
assessments are ill-‐suited to inform teachers’ day-‐to-‐day instructional decision making. What is needed is a process that allows a teacher to capture learning as it occurs, and to make appropriate instructional adjustments.
Formative Assessment No system of instruction can be guaranteed to be effective. However well instruction is designed, because learning is largely a constructive, rather than a passive process, the knowledge that learners construct will be influenced by their previous experiences. So, to a very real extent, each individual in a class experiences different instruction from the others. As David Ausubel (1968) reminded us almost half a century ago, to be effective, instruction must take into account the learner’s own starting point. To accomplish this assessment must be a central process in effective instruction. Assessment is needed at the outset, to establish where learners are in their learning, and during instruction, to provide a means whereby the teacher can establish whether the instructional activities in which the students have engaged have resulted in the intended learning, and if not, to take appropriate action before moving on. This basic idea of a cycle of evidence collection, interpretation, and action can be operationalized in myriad ways, and along a number of time-‐scales. Consider the following scenarios, taken from Wiliam (2011): 1. A team of mathematics teachers from the same school meet to discuss their professional development needs. They analyze the scores obtained by their students on national tests and see that while their scores are, overall, comparable to national benchmarks, their students tend to score less well on items involving
ratio and proportion. They decide to make ratio and proportion the focus of their professional development activities for the coming year, meeting regularly to discuss the changes they have made in the way they teach this topic. Two years later, they find that their students are scoring well on items on ratio and proportion in the national tests, which takes their students’ scores well above the national benchmarks. 2. Each year, a group of fourth-‐grade teachers meet to review students’ performance on a standardized reading test, and to examine the facility (proportion correct) for different kinds of items on the test. Where item facilities are lower than expected, they look at how the instruction on those aspects of reading were planned and delivered, and they look at ways in which the instruction can be strengthened in the following year. 3. Every seven weeks, teachers in a school use a series of interim tests to check on student progress. Any student who scores below a threshold judged to be necessary to make adequate progress is invited to attend additional instruction. Any student who scores below the threshold on two successive occasions is required to attend additional instruction. 4. A teacher designs an instructional unit on Pulleys and levers. Following the pattern that is common in middle schools in Japan (Lewis, 2002 p. 76), although 14 periods are allocated to the unit, the teacher makes sure that all the content is covered in the first 11 periods. In period 12, the students complete a test on what they have covered in the previous 11 periods, and the teacher collects the students’ responses, reads them, and, on the basis of what she learns about the
class’s understanding of the topic, plans what she is going to do in lessons 13 and 14. 5. A teacher has just been discussing with a class why historical documents cannot be taken at face value. As the lesson is drawing to a close, each student is given an 3 by 5 index card and is asked to write an answer to the question “Why are historians concerned about bias in historical sources?” As they leave the classroom, the students hand the teacher these “exit passes” and after all the students have left, the teacher reads through the cards, and then decides how to begin the next lesson. 6. A sixth-‐grade class has been learning about different kinds of figurative language. In order to check on the class’s understanding, the teacher gives each student a set of five cards bearing the letters A, B, C, D, and E. On the interactive white board, she displays the following list: A. Alliteration B. Onomatopoeia C.
Hyperbole
D. Personification E. Simile She then reads out a series of statements: 1.
He was like a bull in a china shop.
2.
This backpack weighs a ton.
3.
He was as tall as a house.
4.
The sweetly smiling sunshine warmed the grass.
5.
He honked his horn at the cyclist.
As each statement is read out to them, each member of the class has to hold up letter cards to indicate what kind of figurate language they have heard. The teacher realizes that almost all the students have assumed that each sentence can have only one kind of figurative language. She points out that the third sentence is a simile, but is also hyperbole, and she then re-‐polls the class on the last two statements, and finds that most students can now correctly identify the two kinds of figurative language in the last two statements. In addition, she makes a mental note of three students who answer most of the questions incorrectly, so that she can follow up with them individually at some later point. 7. A high-‐school chemistry teacher has been teaching a class how to balance chemical equations. In order to test the class, she writes up the unbalanced equation for the reaction of mercury hydroxide with phosphoric acid. She then invites students to change the quantities of the various elements in the equation, and when there are no more suggestions from the class, she asks the class to vote on whether the equation is now correct. All vote in the affirmative. The teacher concludes that the class has understood, and moves on. In each of these situations, information about student achievement was elicited, interpreted, and used to inform decisions about next steps in instruction. Moreover, the decision was either likely to be better, or better grounded in evidence, than the decision that would have been made had the evidence of student achievement not been used. This motivates the following definition of formative assessment based on Black and Wiliam (2009):
An assessment functions formatively to the extent that evidence about student achievement elicited by the assessment is interpreted, and used to make decisions that are likely to be better, or better founded, than the decisions that would have been taken in the absence of the evidence. The important thing about this definition is that decisions, rather than data, are central. Rather than data-‐driven decision-‐making, this approach might be described as decision-‐driven data collection. As the seven scenarios above indicate, these decisions about instruction can be at a number of levels and over a range of time scales. In terms of levels, the instructional decisions can relate to an individual, a group of students, a whole class, a building, a district, or even a state. The time scale can be seconds, minutes, hours, days, weeks, months, or years. These two variables define a space that can be used to locate different kinds of formative assessment, as is shown in Figure 1 below, which provides indicative locations of the seven assessment scenarios presented above.
Figure 1: Level/cycle space As well as providing a way of relating the seven assessment scenarios described above, the level/cycle space diagram also draws attention to other possibilities for formative assessment, including highlighting the trends for worthwhile formative assessment to be concentrated in the lower and rightmost part of the space. For some of the decisions that need to be taken, assessments that are reported on a unidimensional scale might be adequate in which case attention would focus on the nature of the scale, (e.g., nominal, ordinal, or equal interval) and how the performance of an individual was to be interpreted (e.g., with respect to an external criterion, the performance of other students, or the same student’s performance at some time in the past). For other decisions, the decisions would mandate multidimensional information, for example by reporting a profile of achievement across a number of sub-‐domains. Where the focus was on a teachers’ instructional decision-‐making, the relevant group might be all the students in a grade (e.g., “Do we
need to supplement the textbooks we are using to adequately cover the state standards?) or all students in one group (e.g., “Which instructional units do I need to review with this class in preparation for an upcoming test?”). At other times, the focus might be on individual students.
Classroom assessment One obvious way in which assessment can function formatively is for assessments to be used to indicate different courses of action for different students. Students receiving instruction would be tested, and on the basis of the test outcomes, decisions would be taken about the next steps in instruction for each individual. Specifically, analysis of each individual student’s performance on a test might be used to tailor instruction for that student. This is the logic behind much of the current interest in “diagnostic assessment.” Although current systems for representing student achievement are in general rather too coarse grained to support individualized instruction, notable examples do exist, such as Carnegie Learning’s Cognitive Tutor for Algebra (Ritter, Anderson, Koedinger, & Corbett, 2007). An alternative take on classroom assessment is typified by the Diagnostic Items in Mathematics and Science (DIMS) project. If the response of one student to thirty items provides a reasonable basis for improving the decisions taken about the learning of that individual, the logic of the DIMS approach is that the response of thirty students to one item provides a reasonable basis for improving the decisions taken about the learning of that group of students (for further details see Wylie & Wiliam, 2006; 2007). One of the items developed in the DIMS project is shown in Figure 2 below.
Sheena leaves a wooden block, a glass flask, a woolly hat, and a metal stapler on a table overnight. What can she say about their temperatures the next morning? A. The stapler will be colder than the other objects B. The woolly hat will be warmer than the other objects C. The temperatures of all four objects will be different D. The temperatures of all four objects will be the same Figure 2: Diagnostic item probing students’ understanding of temperature In one sense, these two approaches represent two ends of a spectrum. If the responses of 30 students to 30 items are arranged in an array with students as rows, and item outcomes as columns, then the diagnostic testing approach involves analyzing each row separately, and the DIMS approach involves analyzing each column separately. This way of thinking about analyzing item responses suggests that other approaches that look for patterns in the array would also be worth exploring. The approach that is often entitled “response to intervention,” where students who are judged not to be making sufficient progress under conditions of ordinary instruction are given a different, more intensive approach, is in effect a version of diagnostic testing, in which a number of students are treated as equivalent. However, other approaches are possible. For example, an analysis of the item responses of a class might indicate that certain topics could usefully be re-‐taught to the whole class,
that there exist three distinct groups of students in terms of the understanding of the bulk of the subject matter under study, and that there are also three individuals with highly idiosyncratic patterns of response that indicate that the way they are learning this topic is very different from their peers, suggesting that further investigation of their problems is warranted. In other words, rather than trying to work out what is the one next step (for the class) or the thirty next steps (for the thirty individuals in the class), we might also usefully look for a set of five or six next steps.
Connecting (some of) the dots Figure 1 illustrated two dimensions along which assessments might vary: the “shelf-‐life” of the assessment and the level of aggregation. To this we can add a third dimension—the functions that assessment might serve. These functions might broadly be classified as “instructional guidance,” “describing individuals,” and “institutional accountability.” Obviously these three dimensions are not entirely independent of each other. It seems rather unlikely that anyone would want to collect building level data on an hourly basis for the purpose of institutional accountability. On the other hand, it is not possible to regard any of the three dimensions as completely subsumed within another. Hinge-‐point questions are most meaningful at the level of an instructional group, as are exit passes and “before-‐the-‐end-‐of–the-‐unit” tests, while decisions about academic promotion are, by definition, taken at the level of the individual student. The three dimensions therefore represent a space within which different kinds of assessments can be placed. Obviously representing this in a two-‐dimensional medium such as a book chapter is difficult, so Figure 3 merely represents the cycle length and the assessment function. The other dimension
(aggregation level) might therefore be considered to be at right angles to the surface of the page. The three-‐dimensional space represented in Figures 1 and 3 provides a way of relating different functions, time scales and levels of aggregation for assessments, but does not, of course, provide any guidance about the kinds of assessment that best fulfill these needs. While the definitions of formative, interim, and summative assessments proposed here indicate that these are functions that assessment outcomes can serve, rather than properties of assessments themselves, this does not mean that any assessment can serve any purpose. This is important, because, especially in the U.S., testing is unpopular, and therefore to minimize the amount (and cost) of testing it seems attractive to use the same assessment to serve multiple functions, which immediately raises issues about the validity of using the same test for different purposes.
Academic promotion Annual Benchmark Interim
Weekly
Common formative assessments
End-of-course exams
High-stakes accountability
Growth End-of-unit tests
Before the endof-unit tests Daily Exit pass Hourly
Hinge-point questions Instructional Guidance (“formative”)
Describing Individuals (“summative”)
Institutional Accountability (“evaluative”)
Figure 3: Cycle length and the functions of assessment
Test validity As many authors have pointed out, the idea that validity is a property of a test is problematic, since the test may be valid for some purposes and not others, valid for some populations and not others, and valid under some circumstances and not others. Although agreement is not universal, most authors seem to agree with Cronbach, Messick, and others that validity is a property of inferences supported by test scores (Cronbach, 1971; Messick, 1989). While test scores from one assessment may be able to serve different kinds of inferences, for example about students, groups of students, schools, districts, or states, a validity argument would need to be constructed for each of the intended inferences. This much appears to be fairly widely accepted (see for example, the various Standards for Educational and Psychological Testing developed by the American Psychological Association, the American Educational Research Association, and the National Council on Measurement in Education). However, what is less widely appreciated is that even when each of the different inferences that tests are to support are effectively validated, if these validation exercises are undertaken independently, they may not adequately account for what happens when the tests are used to support different inferences simultaneously. For example, if assessments do function formatively, then they are likely to modify, and presumably, improve, the instruction received by students. If this instruction is improved then this weakens the ability of the same assessment information to function summatively. A medical analogy might be helpful here. If a blood test on an individual reveals high levels of cholesterol, which prompts a doctor
to prescribe a course of statins, which in turn has the effect of lowering the level of cholesterol, then the original blood test is now inaccurate, because it has been used to change things for the better. In the same way, if assessment outcomes are used formatively to improve instruction, leading to higher achievement, the assessment outcomes are no longer useful indications of the students’ achievement, because the outcomes have been successful in improving the instruction. This gives us a version of the Pauli exclusion principle in physics—assessment outcomes can function summatively only if they do not function formatively. If they function formatively, then they can no longer function summatively because they are likely to have improved the instruction to the extent that the original assessment data is no longer relevant. As a second example, consider the use of results achieved by individual students on a state test. The tests are typically designed to indicate the degree to which students have mastered the state standards for their grade, but do this by sampling across the standards. Where the same assessment outcomes are used to hold teachers accountable teachers are incentivized to teach only those aspects of the standards that are likely to be tested. Scores go up, but the results obtained by students are now less useful as indicators of students’ achievement, since inferences about aspects of the standards that were not tested are likely to be less valid. Tests that might, if used solely for this purpose, provide useful information about students mastery of standards no longer do so because they have been used to support other kinds of inferences. As a third example, consider the use made by a district of interim or benchmark tests, in order to monitor the extent to which students in a school building
are on track to be regarded as proficient on a state test at some point in the future. When such tests are used as low-‐stakes tests, they can provide valuable information about where additional instructional resources might best be deployed. However, in some districts, the scores on such low-‐stakes assessments are also used to provide early warning about ineffective instruction. Even if this is not the case, individual teachers may believe that unwelcome attention will be focused upon them if the scores of their students are less than is expected. As a result, they may therefore decide to spend significant amounts of classroom time preparing for these tests. Not only does this preparation take time away from instruction, but it also makes the results of the test difficult to interpret, since without information about the amount of specific preparation undertaken for the tests, results will not be comparable across classrooms. What is important to note about each of these three examples is that in each case, assessment outcomes were used for multiple purposes, and while the additional uses may well be justified in their own right, the effect of these multiple usages was to weaken the ability of the assessment to serve its original purpose. This suggests that while the same assessment outcome information could be used for multiple purposes, and it would seem to efficient to do so, great care needs to be taken that the any additional use of assessment information does not weaken the ability of the assessment to serve both the additional and the original function. Indeed, it does not seem to us to be unreasonable to argue that where any assessment is used for more than a single function, the validity of the assessment can be established only by a validation process in which all the intended inferences that a test is to support are validated concurrently. As Wiliam and Black (1996) have pointed out, it may well be
that the formative functions that assessments serve are validated primarily by their consequences, while interim and summative functions of assessment are validated primarily in terms of the meanings, but the interactions between the different uses of the assessment need to be explored in a systemic way to minimize the likelihood of unintended consequences. Where such concurrent validation is not possible, it seems to us that a “self-‐denying ordinance” should be adopted. However attractive it might seem to use the same data to serve multiple functions, there is sufficient evidence to suggest that the costs of the unintended consequences of multiple uses of assessment data, even when each of the uses is validated, are likely to be greater than the costs of additional data collection.
Conclusion: Building a strong assessment system While formative assessment practices, interim assessments, and summative assessments all provide important information to educational stakeholders, putting them together in a way that serves the needs of each student best is as tricky as building a Saturn rocket from a table full of Legos. While views on what constitutes a strong assessment system will vary widely, following are a few elements that follow from the student-‐centered mission of education that we adopted earlier. A strong system of assessments will: •
provide students with immediate feedback concerning their progress
•
provide teachers with actionable information concerning their student’s needs
•
provide teachers with information useful in long-‐range instructional planning
•
provide school administrators with information about the school’s progress
•
provide the public with information about student achievement and growth
•
be designed to have an impact in the classroom
•
communicate needed information clearly to teachers and students
•
use a strong measurement scale to measure growth
•
provide normative, criterion, and content references to make meaning of performance
•
use a strong measurement design to measure growth well If we use these characteristics as a starting point, we can begin to fashion an
assessment system that benefits from the unique characteristics of each type of assessment that we have considered above. A strong assessment system including the characteristics described above can be developed in any number of ways, but any development needs to be thoughtful and mission-‐driven. Below, we illustrate one way in which these disparate elements might be brought together, and how one particular system might address some of the tensions we have described above. There is no one perfect system because each system needs to be designed to take account of the constraints and affordances in the area, but the hypothetical example below shows how the principles identified in this paper might inform the design of the “assessment-‐rich” school. Larkrise Middle School, Lake Wobegon Students who are entering sixth grade at Larkrise Middle School in the fall complete an interim assessment in the previous May. This, combined with an electronic portfolio, and individual student profiles prepared by the fifth grade teacher at their elementary school, is used to help the middle school allocate students to classes, ensuring the full range of achievement in each class, and to set individual
growth targets for each student. Parents have online access to the electronic portfolio, the teacher reports, and the scores gained by their children on the interim tests. Teachers at Larkrise Middle School meet once a month in cross-‐grade teams to plan learning progressions, using the protocol outlined in Leahy & Wiliam (2011). On the basis of these learning progressions, they produce short tests that they use approximately once every two weeks to determine how far along the learning progression the students in their classes have reached, and they also plan high-‐quality single “hinge-‐point” items that they incorporate into their lesson plans. Teachers also meet in grade-‐based teams every two weeks to review the progress their students have made. The seven administrators at Larkrise Middle School undertake “Learning walks” approximately once per month, in which they attempt to visit as many classrooms as possible, typically spending between 10 and 15 minutes in each classroom they visit. During a day, they are generally able to visit the classrooms of every single one of the teachers at the school. At the end of each visit, the teacher being observed receives a short report slip that follows the “two stars and a wish” protocol (two positive aspects of the practice observed, and one reflection point for the teacher—see Wiliam, 2011, for more details). The administrator also has a copy of the report, but this does not give the observed teacher’s name, since as a result of a “self-‐denying ordinance” as discussed above, the administrative team has decided that the quality of the evidence collected from a single lesson after a 10-‐minute observation is not sufficiently reliable to provide a basis for the evaluation of a particular teacher (see Hill, 2012). Teachers are, however, free to use these report slips in their annual meetings with their supervisors to discuss their future professional development priorities). Although the
results of 10-‐minute observations on individual teachers may not support inferences about the quality of individual teachers, the evidence from the 100 to 150 lessons observed during a typical “Learning walk” day does provide a sound evidence base for the average quality of instruction being provided in the school. By reviewing trends over several months, the administrators are able to determine whether institution-‐ wide initiatives are having an effect on instruction. These reviews of long-‐term trends are also informed by a monthly questionnaire completed by a sample of 10% of the students (students are randomly allocated to complete one questionnaire each year). At the end of the first marking period (six weeks into the school year) students take an interim assessment that gives each teacher and student a first look at achievement during the year and progress toward growth targets. This information is used to make “mid-‐course” corrections and is used as the basis of the second series of reports to stakeholders. In keeping with the decision about “self-‐denying ordinances” described above, data on student achievement on these interim assessments is never used to support inferences about individual teachers. During the second and third marking periods, formative assessment approaches allow each teacher to adjust content as each student progresses. The regular monitoring of student progress allows a “response to intervention” type approach to be used whereby students who are not making adequate progress are provided with additional support, which takes the form either of tuition in smaller groups, re-‐allocation to the classes of teachers known to be highly effective with students with special needs, or special “catch-‐up” classes. Toward the end of school year, the summative assessment identifies the overall achievement of the students in the class to help determine what there is to celebrate,
and what might be done better in subsequent years. This information is also passed to stakeholders in the form of easily readable reports that describe the depth and breadth of the accomplishments of the schools. At the end of school year, an interim assessment gives each teacher and student a look at achievement during the year and attainment of growth targets. This information serves as the basis of the final series of stakeholder reports, which describes both the accomplishments of the year, but also the changes that will be made to serve students better in the upcoming years. “Value-‐ added” analyses are also undertaken to establish the total progress made by students in the school, and, where sufficient reliability can be achieved, these analyses, along with observational data, feed in to the evaluation of each teacher’s performance over the year. Obviously, another group of educators might come to a substantially different design for using the types of assessment together to improve education. However, as long as we keep a mission that is centered on the student in mind, it is unlikely that we will go too far wrong. The quality of our educational systems may be seen most easily by test scores and student growth, but it is important to remember that the quality of education is best seen in the accomplishments of our students. The best that schools can hope to do is to set our students along paths that will, eventually, make our world a better place for them and their children.
References Ausubel, D. P. (1968). Educational psychology: a cognitive view. New York, NY: Holt, Rinehart & Winston.
Black, P. J., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and Accountability, 21(1), 5-‐31. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2 ed., pp. 443-‐507). Washington DC: American Council on Education. Fletcher, R. (2000). A review of linear programming and its application to the assessment tools for teaching and learning (asTTle) projects. Auckland, NZ: University of Auckland. Hill, H. C. (2012). When rater reliability Is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56-‐84. Leahy, S., & Wiliam, D. (2011, April). Devising learning progressions. Paper presented at the Annual meeting of the American Educational Research Association held at New Orleans, LA. Lewis, C. C. (2002). Lesson study: a handbook of teacher-led instructional change. Philadelphia, PA: Research for Better Schools. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3 ed., pp. 13-‐103). Washington, DC: American Council on Education/Macmillan. Mullis, I., Martin, M., Kennedy, A., Trong, K. & Sainsbury, M. (2009). PIRLS 2011 assessment framework. Boston, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. Mullis, I., Martin, M., Ruddock, G., O’Sullivan, C., & Preuschoff, C. (2009). TIMMS 2011 assessment frameworks. Boston, MA: TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.
Northwest Evaluation Association (2009). Technical manual for Measures of Academic Progress and Measures of Academic Progress for Primary Grades. Portland, OR: Author. Northwest Evaluation Association. (2012). Measures of academic progress®. Retrieved March 31, 2012, from http://www.nwea.org/products-‐ services/computer-‐based-‐adaptive-‐assessments/map OECD (2012). PISA 2009 technical report. Author. Ritter, S., Anderson, J. R., Koedinger, K. R., & Corbett, A. (2007). Cognitive Tutor: applied research in mathematics education. Psychonomic Bulletin & Review, 14(2), 249-‐255. Wiliam, D. (2011). Embedded formative assessment. Bloomington, IN: Solution Tree. Wiliam, D., & Black, P. J. (1996). Meanings and consequences: a basis for distinguishing formative and summative functions of assessment? British Educational Research Journal, 22(5), 537-‐548. Wylie, E. C., & Wiliam, D. (2006, Diagnostic questions: is there value in just one? Paper presented at the Annual Meeting of the National Council on Measurement in Education held at San Francisco, CA. Wylie, E. C., & Wiliam, D. (2007, Analyzing diagnostic questions: what makes a student response interpretable? Paper presented at the Annual Meeting of the National Council on Measurement in Education held at Chicago, IL.