Connecting the dots - Formative, interim, and ...

108 downloads 0 Views 523KB Size Report
Teachers may want their school to provide them with supplies, ... George Washington Carver suggested that, “Education is the key to unlock the golden.
  Connecting  the  dots:    Formative,  interim,  and     summative  assessment   Dylan  Wiliam,  Gage  Kingsbury,  Steven  Wise   In  R.  W.  Lissitz  (Ed.),  Informing  the  practice  of  teaching  using  formative  and  interim   assessment:  A  systems  approach  (pp.  1-­‐19).  Charlotte,  NC:  Information  Age  Publishing  (2013).  

Introduction Over  the  past  twenty  years,  interest  in  educational  success  has  grown   dramatically.    This  interest  has  grown  in  response  to  a  variety  of  factors,  differing   from  one  country  to  the  next.    As  studies  of  educational  performance  in  different   countries  have  made  international  comparisons  available  (TIMMS:    Mullis,  Martin,   Ruddock,  Sullivan,  &  Preuschoff,  2009;  PISA:    OECD,  2012;  PIRLS:    Mullis,  Martin,   Kennedy,  Trong,  &  Sainsbury,  2009),  interest  in  ranking  performance  among   countries  has  become  a  topic  of  interest.    At  the  same  time,  worries  about  educational   funding  have  caused  governmental  interest  concerning  educational  budgeting  and   effectiveness  to  increase.    Parents  are  also  concerned  in  the  escalating  need  for  a  good   education  to  help  their  children  succeed  as  they  enter  the  workforce.       As  these  factors  have  been  at  work  outside  the  classroom,  teachers  have  been   pushed  to  accomplish  more  for  their  students  in  the  face  of  declining  educational   budgets.    The  pressure  for  the  current  generation  to  compete  for  careers  with  others  

based  around  the  globe  has  changed  the  nature  of  what  educational  success  actually   means.       As  a  result  of  these  many  factors  increasing  interest  in  education,  we  have  a   host  of  stakeholders  whose  interests  may  vary.    Students,  parents,  teachers,  school   and  district  administrators,  legislators,  and  the  general  public  (including  the  business   community)  all  have  an  interest  in  how  education  is  working.       The  interests  of  these  groups  differ  dramatically  when  we  consider  what   makes  up  quality  education  and  what  the  goals  of  education  should  be.    Examples  may   include  the  following:   •

District  administrators  may  be  most  interested  in  providing  meaningful   education  for  the  diverse  students  in  their  school  district,  within  the   limits  of  the  current  budget.  



Legislators  may  be  interested  establishing  laws  and  regulations  that   improve  the  quality  of  education  compared  to  other  states  or  other   countries.  



Business  owners  may  want  the  schools  to  provide  them  with  students   who  are  capable  of  stepping  into  their  entry-­‐level  positions.  



Parents  may  want  schools  to  provide  their  children  with  opportunities   that  they  never  had.  



Teachers  may  want  their  school  to  provide  them  with  supplies,   resources,  and  support  to  help  them  in  the  classroom.  



Students  may  want  their  school  to  help  them  find  out  what  they  can  and   want  to  do  in  their  lives.  

Give  the  variety  of  these  needs,  and  the  hundreds  of  others  that  might  be   included  in  a  plan  for  helping  education  move  forward,  it  is  useful  to  consider  what   the  mission  of  an  educational  system  should  be.    Many  different  thinkers  have   considered  this  issue,  and  the  resulting  comments  have  been  quite  varied.    When  we   start  to  review  them,  though,  commonalities  emerge  from  very  diverse  sources.     Thomas  Jefferson  said  that  education  had  as  its  purpose,  “the  ideal  of  offering  all   children  the  opportunity  to  succeed,  regardless  of  who  their  parents  happen  to  be”.     George  Washington  Carver  suggested  that,  “Education  is  the  key  to  unlock  the  golden   door  of  freedom”.    More  recently,  Malcolm  Forbes  said  that,  “Education's  purpose  is  to   replace  an  empty  mind  with  an  open  one”.    While  views  may  vary,  it  is  clear  that  these   speakers  commonly  viewed  education  as  a  way  to  expand  students’  views  of  the   world.    We  will  adopt  that  view,  and  consider  the  student  as  an  evolving  human  being,   expanding  their  view  of  the  world  as  they  grow  to  include  the  wide  variety  of   possibilities  that  are  available.   Our  view  is  that  the  student  and  the  future  that  we  owe  that  student  need  to  be   central  to  any  educational  systems.    With  this  as  a  starting  point,  we  will  make  the   following  assumption  concerning  the  development  and  improvement  of  a  system  of   education:   The  mission  of  an  educational  system  is  to  provide  each   student  with  an  opportunity  to  learn  what  life  has  available,   to  help  them  decide  what  interests  them,  and  to  help  them   learn  as  much  as  they  can  to  take  them  in  their  desired   direction.   For  this  chapter  we  will  use  this  mission  as  our  starting  and  ending  point.    

Clearly,  a  different  mission  statement  will  lead  to  very  different  conclusions,  but  it   may  also  make  the  student  less  central  to  the  educational  process.    Since  education  is   a  less  satisfying  enterprise  if  it  doesn’t  involve  students,  we  will  include  them  at  the   heart  of  our  discussion.     Assessment  Needs     To  this  point,  we  haven’t  discussed  assessment  at  all,  but  it  is  clear  that  as   interest  in  educational  quality  has  increased,  so  has  interest  in  test  scores.    We  have   systems  in  place  in  countries  around  the  world  that  require  the  testing  of  some  or  all   students  in  some  or  all  grades  in  some  or  all  subjects.    Some  of  these  testing  systems   have  been  developed  with  the  needs  of  school  personnel  in  mind  (asTTle  is  a  very  fine   example,  Fletcher,  2000).    In  the  United  States,  however,  most  have  been  developed  to   provide  external  agencies,  such  as  state  and  federal  governments,  with  a  window  into   the  development  of  student  competency  in  the  schools.    These  assessments  often   provide  a  very  narrow  view  of  education  (testing  a  few  subjects  commonly,  with  only   one  test  per  year  at  the  most).           The  shortcomings  of  these  tests  have  caused  schools  to  use  a  wide  variety  of   other  tests,  designed  to  serve  different  needs  and  different  groups  of  students.    The   result  is  that  we  have  many  tests  in  use,  but  few  ways  to  design  assessment  systems   that  are  efficient  and  effective  in  telling  us  about  students  while  they  help  the  students   learn.       Currently,  the  primary  focus  of  federal  regulation  in  the  United  States  is   summative  assessment.    This  focus  creates  an  imbalance  in  the  classroom,  since   summative  assessment  meets  the  needs  of  only  a  few  educational  stakeholders.    We  

need  to  find  a  better  balance,  so  each  assessment  tool  is  used  when  it  is  appropriate,   and  each  assessment  helps  us  provide  the  information  we  need  to  influence  education   in  a  manner  that  informs  each  stakeholder,  and  serves  the  most  important   stakeholder  well,  the  student.   In  the  remainder  of  this  chapter,  we  will  describe  some  of  the  types  of   assessments  that  are  used  in  schools  today.    We  will  then  attempt  to  connect  some  of   the  dots  to  describe  an  assessment  system  that  could  be  useful  to  students,  as  well  as   the  other  stakeholders  on  our  schools.    

Summative Assessment In  the  United  States,  the  most  common  types  of  summative  assessment   currently  used  are  state  assessments,  which  are  used  to  assess  student  proficiency   toward  the  end  of  a  school  year.    Scores  from  these  tests  are  usually  aggregated  to   support  inferences  about  groups  of  students.    For  example,  during  the  past  decade,  the   No  Child  Left  Behind  (NCLB)  legislation  has  mandated  that  states  report  annual   testing  results  to  the  U.S.  Department  of  Education  as  the  basis  for  its  focus  on  school   accountability.   When  inferences  are  to  be  made  at  the  school  level,  the  focus  is  on  the   precision  of  the  measurement  for  the  aggregated  groups  of  scores.    The  tests  contain   items  that  represent  only  a  sample  of  the  state’s  learning  standards.    Also,  an   inference  at  the  school  level  would  not  require  that  all  students  in  a  school  be  tested,   though  NCLB  has  required  census  testing.      

How  useful  are  state  summative  tests  for  making  statements  about  individual   students?    We  would  suggest  that  they  are  not  very  useful,  for  two  reasons.    First,  state   tests  are  not  long  enough  to  yield  scores  with  satisfactory  measurement  precision   (particularly  for  the  low  and  high  performers).    The  design  of  the  tests  could  be   changed  to  better  support  inferences  about  individual  students,  but  that  would   require  longer  tests  than  are  currently  being  used,  and  therefore  more  testing  time.     The  more  important  limitation,  however,  stems  from  when  the  tests  are  administered.     They  are  typically  administered  at,  or  towards  the  end  of  the  school  year,  and  test   results  are  typically  unavailable  until  after  the  school  year  has  ended.   Immediacy  of  results  is  less  important  when  one  is  making  inferences  about   schools.    Moreover,  the  “shelf-­‐life”  of  the  information  (i.e.,  for  how  long  do  the  data   support  the  intended  inferences?)  is  much  longer  when  it  is  the  accountability  of  the   school,  rather  than  the  performance  of  an  individual  student,  that  is  the  focus.   To  what  extent  are  the  various  stakeholders’  assessment  needs  met  by  this   type  of  summative  test?    School  administrators  can  use  the  results  to  chart  trends  in   student  proficiency  over  time.    In  addition,  administrators  might  make  inferences   (whether  warranted  or  not)  about  the  relative  success  of  particular  schools  in   educating  students.    Legislators  can  use  the  results  from  tests  to  identify  educational   program  and  funding  needs.    The  general  public  can  use  the  results  to  gauge  the   general  effectiveness  of  the  educational  system  that  is  funded  by  taxpayer  dollars.     Teachers  may  be  able  to  use  the  results  to  help  in  their  curriculum  planning  for  future   cohorts  of  students.   Although  some  assessment  needs  are  met  by  tests  that  are  designed  primarily   for  summative  purposes,  others  are  not.    Teachers  receive  little  information  about  the  

instructional  needs  of  this  year’s  students  for  two  reasons.    First,  as  noted  above,   teachers  typically  receive  the  results  in  the  summer—well  after  the  conclusion  of  the   academic  year.    Second,  in  many,  if  not  most,  states,  the  results  for  a  particular  student   are  given  as  scale  scores  using  technology  such  as  item-­‐response  theory  (IRT)  along   with  a  coarse  classification  of  proficiency  relative  to  the  state’s  proficiency  standards   (e.g.,  “basic,”  “proficient”  or  “advanced”).    Such  general  information  has  little   instructional  value  for  teachers.    For  the  same  reasons,  students  receive  little  or  no   actionable  information  about  their  specific  instructional  needs.    Parents  receive  their   student’s  scale  score  and  proficiency  classification,  along  with  information  about  the   performance  of  the  student’s  school,  but  little  information  about  their  student’s   academic  growth,  or  what  might  be  done  to  support  the  student’s  learning.    

Interim Assessment Interim  assessments  are  focused  on  student  achievement  and  growth  relative   to  a  trait  of  primary  interest  during  instruction.    They  are  typically  used  to  assess   student  proficiency  at  multiple  points  during  a  year  of  instruction,  and  they  are   designed  to  support  inferences  about  the  academic  growth  of  individual  students.     Because  inferences  are  being  made  about  the  growth  of  individual  students,  high   measurement  accuracy  and  precision  is  needed.    For  this  reason,  a  computerized   adaptive  test  (CAT)  is  particularly  useful.    CATs,  such  as  Northwest  Evaluation   Association’s  Measures  of  Academic  Progress  (NWEA,  2012),  can  assess  student   proficiency  and  growth  efficiently  and  with  high  precision.  

Since  interim  assessments  are  designed  to  provide  information  about   individual  student  growth,  they  are  administered  to  all  students  for  whom  growth   inferences  are  to  be  made.    And  because  the  information  shelf  life  of  interim  test   scores  is  short,  immediacy  of  returning  results  to  stakeholders  is  important.     Interpretation  of  scores  can  be  made  relative  either  to  norms  (i.e.,  how  does  the   student’s  growth  compare  to  that  of  some  reference  group  of  students?),  to   aspirations  (e.g.,  to  what  extent  did  the  student  meet  the  growth  targets  he  or  she   helped  establish?),  or  to  long-­‐term  benchmarks  (e.g.,  is  the  student  making  adequate   progress  toward  college  readiness?).     Compared  to  summative  assessments,  interim  assessments  can  provide  useful   information  to  a  broader  array  of  stakeholders.    Students  are  able  to  gauge  their   academic  growth  relative  to  normative,  aspirational,  or  long-­‐term  goals.    Similarly,   parents  are  able  to  use  the  results  from  interim  assessments  to  track  their  child’s   academic  growth.    Teachers  can  use  the  results  to  make  instructional  decisions  about   how  they  should  manage  and  plan  for  the  instruction  of  the  entire  cohort  of  students   for  whom  they  are  responsible.    School  administrators  can  aggregate  student  results   to  assess  trends  in  growth  and  as  part  of  a  plan  to  evaluate  teacher  effectiveness.     Legislators  can  use  interim  assessment  results  to  both  evaluate  the  effectiveness  of   public  educational  policy  and  articulate  performance  expectations  for  schools  that   take  into  account  the  academic  progress  of  all  students.    Finally,  the  general  public  can   use  the  results  to  gauge  the  effectiveness  of  the  educational  system.   One  of  the  most  important  aspects  of  interim  testing  is  that  it  changes  the  unit   of  inference  from  groups  of  students  (e.g.,  teacher;  school)  to  the  individual  student.     Because  they  are  administered  only  several  times  per  year,  however,  interim  

assessments  are  ill-­‐suited  to  inform  teachers’  day-­‐to-­‐day  instructional  decision   making.    What  is  needed  is  a  process  that  allows  a  teacher  to  capture  learning  as  it   occurs,  and  to  make  appropriate  instructional  adjustments.  

Formative Assessment No  system  of  instruction  can  be  guaranteed  to  be  effective.    However  well   instruction  is  designed,  because  learning  is  largely  a  constructive,  rather  than  a   passive  process,  the  knowledge  that  learners  construct  will  be  influenced  by  their   previous  experiences.    So,  to  a  very  real  extent,  each  individual  in  a  class  experiences   different  instruction  from  the  others.   As  David  Ausubel  (1968)  reminded  us  almost  half  a  century  ago,  to  be  effective,   instruction  must  take  into  account  the  learner’s  own  starting  point.    To  accomplish   this  assessment  must  be  a  central  process  in  effective  instruction.    Assessment  is   needed  at  the  outset,  to  establish  where  learners  are  in  their  learning,  and  during   instruction,  to  provide  a  means  whereby  the  teacher  can  establish  whether  the   instructional  activities  in  which  the  students  have  engaged  have  resulted  in  the   intended  learning,  and  if  not,  to  take  appropriate  action  before  moving  on.   This  basic  idea  of  a  cycle  of  evidence  collection,  interpretation,  and  action  can   be  operationalized  in  myriad  ways,  and  along  a  number  of  time-­‐scales.  Consider  the   following  scenarios,  taken  from  Wiliam  (2011):   1.   A  team  of  mathematics  teachers  from  the  same  school  meet  to  discuss  their   professional  development  needs.  They  analyze  the  scores  obtained  by  their   students  on  national  tests  and  see  that  while  their  scores  are,  overall,  comparable   to  national  benchmarks,  their  students  tend  to  score  less  well  on  items  involving  

ratio  and  proportion.  They  decide  to  make  ratio  and  proportion  the  focus  of  their   professional  development  activities  for  the  coming  year,  meeting  regularly  to   discuss  the  changes  they  have  made  in  the  way  they  teach  this  topic.  Two  years   later,  they  find  that  their  students  are  scoring  well  on  items  on  ratio  and   proportion  in  the  national  tests,  which  takes  their  students’  scores  well  above  the   national  benchmarks.   2.   Each  year,  a  group  of  fourth-­‐grade  teachers  meet  to  review  students’  performance   on  a  standardized  reading  test,  and  to  examine  the  facility  (proportion  correct)  for   different  kinds  of  items  on  the  test.    Where  item  facilities  are  lower  than  expected,   they  look  at  how  the  instruction  on  those  aspects  of  reading  were  planned  and   delivered,  and  they  look  at  ways  in  which  the  instruction  can  be  strengthened  in   the  following  year.   3.   Every  seven  weeks,  teachers  in  a  school  use  a  series  of  interim  tests  to  check  on   student  progress.    Any  student  who  scores  below  a  threshold  judged  to  be   necessary  to  make  adequate  progress  is  invited  to  attend  additional  instruction.   Any  student  who  scores  below  the  threshold  on  two  successive  occasions  is   required  to  attend  additional  instruction.   4.   A  teacher  designs  an  instructional  unit  on  Pulleys  and  levers.    Following  the   pattern  that  is  common  in  middle  schools  in  Japan  (Lewis,  2002  p.  76),  although   14  periods  are  allocated  to  the  unit,  the  teacher  makes  sure  that  all  the  content  is   covered  in  the  first  11  periods.  In  period  12,  the  students  complete  a  test  on  what   they  have  covered  in  the  previous  11  periods,  and  the  teacher  collects  the   students’  responses,  reads  them,  and,  on  the  basis  of  what  she  learns  about  the  

class’s  understanding  of  the  topic,  plans  what  she  is  going  to  do  in  lessons  13  and   14.   5.   A  teacher  has  just  been  discussing  with  a  class  why  historical  documents  cannot   be  taken  at  face  value.  As  the  lesson  is  drawing  to  a  close,  each  student  is  given  an   3  by  5  index  card  and  is  asked  to  write  an  answer  to  the  question  “Why  are   historians  concerned  about  bias  in  historical  sources?”  As  they  leave  the   classroom,  the  students  hand  the  teacher  these  “exit  passes”  and  after  all  the   students  have  left,  the  teacher  reads  through  the  cards,  and  then  decides  how  to   begin  the  next  lesson.   6.   A  sixth-­‐grade  class  has  been  learning  about  different  kinds  of  figurative  language.   In  order  to  check  on  the  class’s  understanding,  the  teacher  gives  each  student  a   set  of  five  cards  bearing  the  letters  A,  B,  C,  D,  and  E.  On  the  interactive  white  board,   she  displays  the  following  list:   A. Alliteration   B. Onomatopoeia     C.

Hyperbole  

D. Personification   E. Simile     She  then  reads  out  a  series  of  statements:   1.

He  was  like  a  bull  in  a  china  shop.  

2.

This  backpack  weighs  a  ton.  

3.

He  was  as  tall  as  a  house.  

4.

The  sweetly  smiling  sunshine  warmed  the  grass.  

5.

He  honked  his  horn  at  the  cyclist.  

  As  each  statement  is  read  out  to  them,  each  member  of  the  class  has  to   hold  up  letter  cards  to  indicate  what  kind  of  figurate  language  they  have  heard.   The  teacher  realizes  that  almost  all  the  students  have  assumed  that  each  sentence   can  have  only  one  kind  of  figurative  language.  She  points  out  that  the  third   sentence  is  a  simile,  but  is  also  hyperbole,  and  she  then  re-­‐polls  the  class  on  the   last  two  statements,  and  finds  that  most  students  can  now  correctly  identify  the   two  kinds  of  figurative  language  in  the  last  two  statements.  In  addition,  she  makes   a  mental  note  of  three  students  who  answer  most  of  the  questions  incorrectly,  so   that  she  can  follow  up  with  them  individually  at  some  later  point.     7.   A  high-­‐school  chemistry  teacher  has  been  teaching  a  class  how  to  balance   chemical  equations.  In  order  to  test  the  class,  she  writes  up  the  unbalanced   equation  for  the  reaction  of  mercury  hydroxide  with  phosphoric  acid.  She  then   invites  students  to  change  the  quantities  of  the  various  elements  in  the  equation,   and  when  there  are  no  more  suggestions  from  the  class,  she  asks  the  class  to  vote   on  whether  the  equation  is  now  correct.  All  vote  in  the  affirmative.  The  teacher   concludes  that  the  class  has  understood,  and  moves  on.   In  each  of  these  situations,  information  about  student  achievement  was  elicited,   interpreted,  and  used  to  inform  decisions  about  next  steps  in  instruction.    Moreover,   the  decision  was  either  likely  to  be  better,  or  better  grounded  in  evidence,  than  the   decision  that  would  have  been  made  had  the  evidence  of  student  achievement  not   been  used.    This  motivates  the  following  definition  of  formative  assessment  based  on   Black  and  Wiliam  (2009):  

An  assessment  functions  formatively  to  the  extent  that  evidence   about  student  achievement  elicited  by  the  assessment  is   interpreted,  and  used  to  make  decisions  that  are  likely  to  be  better,   or  better  founded,  than  the  decisions  that  would  have  been  taken   in  the  absence  of  the  evidence.     The  important  thing  about  this  definition  is  that  decisions,  rather  than  data,  are   central.    Rather  than  data-­‐driven  decision-­‐making,  this  approach  might  be  described   as  decision-­‐driven  data  collection.   As  the  seven  scenarios  above  indicate,  these  decisions  about  instruction  can  be   at  a  number  of  levels  and  over  a  range  of  time  scales.    In  terms  of  levels,  the   instructional  decisions  can  relate  to  an  individual,  a  group  of  students,  a  whole  class,  a   building,  a  district,  or  even  a  state.    The  time  scale  can  be  seconds,  minutes,  hours,   days,  weeks,  months,  or  years.    These  two  variables  define  a  space  that  can  be  used  to   locate  different  kinds  of  formative  assessment,  as  is  shown  in  Figure  1  below,  which   provides  indicative  locations  of  the  seven  assessment  scenarios  presented  above.    

  Figure  1:    Level/cycle  space   As  well  as  providing  a  way  of  relating  the  seven  assessment  scenarios   described  above,  the  level/cycle  space  diagram  also  draws  attention  to  other   possibilities  for  formative  assessment,  including  highlighting  the  trends  for   worthwhile  formative  assessment  to  be  concentrated  in  the  lower  and  rightmost  part   of  the  space.   For  some  of  the  decisions  that  need  to  be  taken,  assessments  that  are  reported   on  a  unidimensional  scale  might  be  adequate  in  which  case  attention  would  focus  on   the  nature  of  the  scale,  (e.g.,  nominal,  ordinal,  or  equal  interval)  and  how  the   performance  of  an  individual  was  to  be  interpreted  (e.g.,  with  respect  to  an  external   criterion,  the  performance  of  other  students,  or  the  same  student’s  performance  at   some  time  in  the  past).  For  other  decisions,  the  decisions  would  mandate   multidimensional  information,  for  example  by  reporting  a  profile  of  achievement   across  a  number  of  sub-­‐domains.  Where  the  focus  was  on  a  teachers’  instructional   decision-­‐making,  the  relevant  group  might  be  all  the  students  in  a  grade  (e.g.,  “Do  we  

need  to  supplement  the  textbooks  we  are  using  to  adequately  cover  the  state   standards?)  or  all  students  in  one  group  (e.g.,  “Which  instructional  units  do  I  need  to   review  with  this  class  in  preparation  for  an  upcoming  test?”).  At  other  times,  the  focus   might  be  on  individual  students.  

Classroom assessment One  obvious  way  in  which  assessment  can  function  formatively  is  for   assessments  to  be  used  to  indicate  different  courses  of  action  for  different  students.   Students  receiving  instruction  would  be  tested,  and  on  the  basis  of  the  test  outcomes,   decisions  would  be  taken  about  the  next  steps  in  instruction  for  each  individual.   Specifically,  analysis  of  each  individual  student’s  performance  on  a  test  might  be  used   to  tailor  instruction  for  that  student.  This  is  the  logic  behind  much  of  the  current   interest  in  “diagnostic  assessment.”  Although  current  systems  for  representing   student  achievement  are  in  general  rather  too  coarse  grained  to  support   individualized  instruction,  notable  examples  do  exist,  such  as  Carnegie  Learning’s   Cognitive  Tutor  for  Algebra  (Ritter,  Anderson,  Koedinger,  &  Corbett,  2007).   An  alternative  take  on  classroom  assessment  is  typified  by  the  Diagnostic  Items   in  Mathematics  and  Science  (DIMS)  project.    If  the  response  of  one  student  to  thirty   items  provides  a  reasonable  basis  for  improving  the  decisions  taken  about  the   learning  of  that  individual,  the  logic  of  the  DIMS  approach  is  that  the  response  of  thirty   students  to  one  item  provides  a  reasonable  basis  for  improving  the  decisions  taken   about  the  learning  of  that  group  of  students  (for  further  details  see  Wylie  &  Wiliam,   2006;  2007).    One  of  the  items  developed  in  the  DIMS  project  is  shown  in  Figure  2   below.  

  Sheena  leaves  a  wooden  block,  a  glass  flask,  a  woolly  hat,  and  a  metal  stapler   on  a  table  overnight.  What  can  she  say  about  their  temperatures  the  next  morning?   A. The  stapler  will  be  colder  than  the  other  objects   B. The  woolly  hat  will  be  warmer  than  the  other  objects   C. The  temperatures  of  all  four  objects  will  be  different   D. The  temperatures  of  all  four  objects  will  be  the  same   Figure  2:  Diagnostic  item  probing  students’  understanding  of  temperature     In  one  sense,  these  two  approaches  represent  two  ends  of  a  spectrum.  If  the   responses  of  30  students  to  30  items  are  arranged  in  an  array  with  students  as  rows,   and  item  outcomes  as  columns,  then  the  diagnostic  testing  approach  involves   analyzing  each  row  separately,  and  the  DIMS  approach  involves  analyzing  each   column  separately.    This  way  of  thinking  about  analyzing  item  responses  suggests  that   other  approaches  that  look  for  patterns  in  the  array  would  also  be  worth  exploring.   The  approach  that  is  often  entitled  “response  to  intervention,”  where  students  who   are  judged  not  to  be  making  sufficient  progress  under  conditions  of  ordinary   instruction  are  given  a  different,  more  intensive  approach,  is  in  effect  a  version  of   diagnostic  testing,  in  which  a  number  of  students  are  treated  as  equivalent.    However,   other  approaches  are  possible.    For  example,  an  analysis  of  the  item  responses  of  a   class  might  indicate  that  certain  topics  could  usefully  be  re-­‐taught  to  the  whole  class,  

that  there  exist  three  distinct  groups  of  students  in  terms  of  the  understanding  of  the   bulk  of  the  subject  matter  under  study,  and  that  there  are  also  three  individuals  with   highly  idiosyncratic  patterns  of  response  that  indicate  that  the  way  they  are  learning   this  topic  is  very  different  from  their  peers,  suggesting  that  further  investigation  of   their  problems  is  warranted.    In  other  words,  rather  than  trying  to  work  out  what  is   the  one  next  step  (for  the  class)  or  the  thirty  next  steps  (for  the  thirty  individuals  in   the  class),  we  might  also  usefully  look  for  a  set  of  five  or  six  next  steps.  

Connecting (some of) the dots Figure  1  illustrated  two  dimensions  along  which  assessments  might  vary:  the   “shelf-­‐life”  of  the  assessment  and  the  level  of  aggregation.    To  this  we  can  add  a  third   dimension—the  functions  that  assessment  might  serve.    These  functions  might   broadly  be  classified  as  “instructional  guidance,”  “describing  individuals,”  and   “institutional  accountability.”    Obviously  these  three  dimensions  are  not  entirely   independent  of  each  other.    It  seems  rather  unlikely  that  anyone  would  want  to  collect   building  level  data  on  an  hourly  basis  for  the  purpose  of  institutional  accountability.     On  the  other  hand,  it  is  not  possible  to  regard  any  of  the  three  dimensions  as   completely  subsumed  within  another.    Hinge-­‐point  questions  are  most  meaningful  at   the  level  of  an  instructional  group,  as  are  exit  passes  and  “before-­‐the-­‐end-­‐of–the-­‐unit”   tests,  while  decisions  about  academic  promotion  are,  by  definition,  taken  at  the  level   of  the  individual  student.    The  three  dimensions  therefore  represent  a  space  within   which  different  kinds  of  assessments  can  be  placed.    Obviously  representing  this  in  a   two-­‐dimensional  medium  such  as  a  book  chapter  is  difficult,  so  Figure  3  merely   represents  the  cycle  length  and  the  assessment  function.    The  other  dimension  

(aggregation  level)  might  therefore  be  considered  to  be  at  right  angles  to  the  surface   of  the  page.   The  three-­‐dimensional  space  represented  in  Figures  1  and  3  provides  a  way  of   relating  different  functions,  time  scales  and  levels  of  aggregation  for  assessments,  but   does  not,  of  course,  provide  any  guidance  about  the  kinds  of  assessment  that  best   fulfill  these  needs.  While  the  definitions  of  formative,  interim,  and  summative   assessments  proposed  here  indicate  that  these  are  functions  that  assessment   outcomes  can  serve,  rather  than  properties  of  assessments  themselves,  this  does  not   mean  that  any  assessment  can  serve  any  purpose.  This  is  important,  because,   especially  in  the  U.S.,  testing  is  unpopular,  and  therefore  to  minimize  the  amount  (and   cost)  of  testing  it  seems  attractive  to  use  the  same  assessment  to  serve  multiple   functions,  which  immediately  raises  issues  about  the  validity  of  using  the  same  test  for   different  purposes.  

Academic promotion Annual Benchmark Interim

Weekly

Common formative assessments  

End-of-course exams

High-stakes accountability

Growth End-of-unit tests

Before the endof-unit tests Daily Exit pass Hourly

Hinge-point questions Instructional Guidance (“formative”)

Describing Individuals (“summative”)

Institutional Accountability (“evaluative”)

 

Figure  3:  Cycle  length  and  the  functions  of  assessment  

Test validity As  many  authors  have  pointed  out,  the  idea  that  validity  is  a  property  of  a  test   is  problematic,  since  the  test  may  be  valid  for  some  purposes  and  not  others,  valid  for   some  populations  and  not  others,  and  valid  under  some  circumstances  and  not  others.   Although  agreement  is  not  universal,  most  authors  seem  to  agree  with  Cronbach,   Messick,  and  others  that  validity  is  a  property  of  inferences  supported  by  test  scores   (Cronbach,  1971;  Messick,  1989).  While  test  scores  from  one  assessment  may  be  able   to  serve  different  kinds  of  inferences,  for  example  about  students,  groups  of  students,   schools,  districts,  or  states,  a  validity  argument  would  need  to  be  constructed  for  each   of  the  intended  inferences.  This  much  appears  to  be  fairly  widely  accepted  (see  for   example,  the  various  Standards  for  Educational  and  Psychological  Testing  developed   by  the  American  Psychological  Association,  the  American  Educational  Research   Association,  and  the  National  Council  on  Measurement  in  Education).  However,  what   is  less  widely  appreciated  is  that  even  when  each  of  the  different  inferences  that  tests   are  to  support  are  effectively  validated,  if  these  validation  exercises  are  undertaken   independently,  they  may  not  adequately  account  for  what  happens  when  the  tests  are   used  to  support  different  inferences  simultaneously.   For  example,  if  assessments  do  function  formatively,  then  they  are  likely  to   modify,  and  presumably,  improve,  the  instruction  received  by  students.    If  this   instruction  is  improved  then  this  weakens  the  ability  of  the  same  assessment   information  to  function  summatively.    A  medical  analogy  might  be  helpful  here.    If  a   blood  test  on  an  individual  reveals  high  levels  of  cholesterol,  which  prompts  a  doctor  

to  prescribe  a  course  of  statins,  which  in  turn  has  the  effect  of  lowering  the  level  of   cholesterol,  then  the  original  blood  test  is  now  inaccurate,  because  it  has  been  used  to   change  things  for  the  better.    In  the  same  way,  if  assessment  outcomes  are  used   formatively  to  improve  instruction,  leading  to  higher  achievement,  the  assessment   outcomes  are  no  longer  useful  indications  of  the  students’  achievement,  because  the   outcomes  have  been  successful  in  improving  the  instruction.    This  gives  us  a  version  of   the  Pauli  exclusion  principle  in  physics—assessment  outcomes  can  function   summatively  only  if  they  do  not  function  formatively.    If  they  function  formatively,   then  they  can  no  longer  function  summatively  because  they  are  likely  to  have   improved  the  instruction  to  the  extent  that  the  original  assessment  data  is  no  longer   relevant.   As  a  second  example,  consider  the  use  of  results  achieved  by  individual   students  on  a  state  test.    The  tests  are  typically  designed  to  indicate  the  degree  to   which  students  have  mastered  the  state  standards  for  their  grade,  but  do  this  by   sampling  across  the  standards.    Where  the  same  assessment  outcomes  are  used  to   hold  teachers  accountable  teachers  are  incentivized  to  teach  only  those  aspects  of  the   standards  that  are  likely  to  be  tested.    Scores  go  up,  but  the  results  obtained  by   students  are  now  less  useful  as  indicators  of  students’  achievement,  since  inferences   about  aspects  of  the  standards  that  were  not  tested  are  likely  to  be  less  valid.    Tests   that  might,  if  used  solely  for  this  purpose,  provide  useful  information  about  students   mastery  of  standards  no  longer  do  so  because  they  have  been  used  to  support  other   kinds  of  inferences.   As  a  third  example,  consider  the  use  made  by  a  district  of  interim  or   benchmark  tests,  in  order  to  monitor  the  extent  to  which  students  in  a  school  building  

are  on  track  to  be  regarded  as  proficient  on  a  state  test  at  some  point  in  the  future.     When  such  tests  are  used  as  low-­‐stakes  tests,  they  can  provide  valuable  information   about  where  additional  instructional  resources  might  best  be  deployed.    However,  in   some  districts,  the  scores  on  such  low-­‐stakes  assessments  are  also  used  to  provide   early  warning  about  ineffective  instruction.    Even  if  this  is  not  the  case,  individual   teachers  may  believe  that  unwelcome  attention  will  be  focused  upon  them  if  the   scores  of  their  students  are  less  than  is  expected.    As  a  result,  they  may  therefore   decide  to  spend  significant  amounts  of  classroom  time  preparing  for  these  tests.    Not   only  does  this  preparation  take  time  away  from  instruction,  but  it  also  makes  the   results  of  the  test  difficult  to  interpret,  since  without  information  about  the  amount  of   specific  preparation  undertaken  for  the  tests,  results  will  not  be  comparable  across   classrooms.   What  is  important  to  note  about  each  of  these  three  examples  is  that  in  each  case,   assessment  outcomes  were  used  for  multiple  purposes,  and  while  the  additional  uses   may  well  be  justified  in  their  own  right,  the  effect  of  these  multiple  usages  was  to   weaken  the  ability  of  the  assessment  to  serve  its  original  purpose.    This  suggests  that   while  the  same  assessment  outcome  information  could  be  used  for  multiple  purposes,   and  it  would  seem  to  efficient  to  do  so,  great  care  needs  to  be  taken  that  the  any   additional  use  of  assessment  information  does  not  weaken  the  ability  of  the   assessment  to  serve  both  the  additional  and  the  original  function.    Indeed,  it  does  not   seem  to  us  to  be  unreasonable  to  argue  that  where  any  assessment  is  used  for  more   than  a  single  function,  the  validity  of  the  assessment  can  be  established  only  by  a   validation  process  in  which  all  the  intended  inferences  that  a  test  is  to  support  are   validated  concurrently.      As  Wiliam  and  Black  (1996)  have  pointed  out,  it  may  well  be  

that  the  formative  functions  that  assessments  serve  are  validated  primarily  by  their   consequences,  while  interim  and  summative  functions  of  assessment  are  validated   primarily  in  terms  of  the  meanings,  but  the  interactions  between  the  different  uses  of   the  assessment  need  to  be  explored  in  a  systemic  way  to  minimize  the  likelihood  of   unintended  consequences.    Where  such  concurrent  validation  is  not  possible,  it  seems   to  us  that  a  “self-­‐denying  ordinance”  should  be  adopted.    However  attractive  it  might   seem  to  use  the  same  data  to  serve  multiple  functions,  there  is  sufficient  evidence  to   suggest  that  the  costs  of  the  unintended  consequences  of  multiple  uses  of  assessment   data,  even  when  each  of  the  uses  is  validated,  are  likely  to  be  greater  than  the  costs  of   additional  data  collection.    

Conclusion: Building a strong assessment system While  formative  assessment  practices,  interim  assessments,  and  summative   assessments  all  provide  important  information  to  educational  stakeholders,  putting   them  together  in  a  way  that  serves  the  needs  of  each  student  best  is  as  tricky  as   building  a  Saturn  rocket  from  a  table  full  of  Legos.    While  views  on  what  constitutes  a   strong  assessment  system  will  vary  widely,  following  are  a  few  elements  that  follow   from  the  student-­‐centered  mission  of  education  that  we  adopted  earlier.    A  strong   system  of  assessments  will:   •

provide  students  with  immediate  feedback  concerning  their  progress  



provide  teachers  with  actionable  information  concerning  their  student’s  needs  



provide  teachers  with  information  useful  in  long-­‐range  instructional  planning  



provide  school  administrators  with  information  about  the  school’s  progress  



provide  the  public  with  information  about  student  achievement  and  growth  



be  designed  to  have  an  impact  in  the  classroom  



communicate  needed  information  clearly  to  teachers  and  students  



use  a  strong  measurement  scale  to  measure  growth  



provide  normative,  criterion,  and  content  references  to  make  meaning  of   performance  



use  a  strong  measurement  design  to  measure  growth  well   If  we  use  these  characteristics  as  a  starting  point,  we  can  begin  to  fashion  an  

assessment  system  that  benefits  from  the  unique  characteristics  of  each  type  of   assessment  that  we  have  considered  above.  A  strong  assessment  system  including  the   characteristics  described  above  can  be  developed  in  any  number  of  ways,  but  any   development  needs  to  be  thoughtful  and  mission-­‐driven.    Below,  we  illustrate  one  way   in  which  these  disparate  elements  might  be  brought  together,  and  how  one  particular   system  might  address  some  of  the  tensions  we  have  described  above.    There  is  no  one   perfect  system  because  each  system  needs  to  be  designed  to  take  account  of  the   constraints  and  affordances  in  the  area,  but  the  hypothetical  example  below  shows   how  the  principles  identified  in  this  paper  might  inform  the  design  of  the   “assessment-­‐rich”  school.   Larkrise  Middle  School,  Lake  Wobegon   Students  who  are  entering  sixth  grade  at  Larkrise  Middle  School  in  the  fall   complete  an  interim  assessment  in  the  previous  May.    This,  combined  with  an   electronic  portfolio,  and  individual  student  profiles  prepared  by  the  fifth  grade   teacher  at  their  elementary  school,  is  used  to  help  the  middle  school  allocate  students   to  classes,  ensuring  the  full  range  of  achievement  in  each  class,  and  to  set  individual  

growth  targets  for  each  student.    Parents  have  online  access  to  the  electronic  portfolio,   the  teacher  reports,  and  the  scores  gained  by  their  children  on  the  interim  tests.   Teachers  at  Larkrise  Middle  School  meet  once  a  month  in  cross-­‐grade  teams  to   plan  learning  progressions,  using  the  protocol  outlined  in  Leahy  &  Wiliam  (2011).    On   the  basis  of  these  learning  progressions,  they  produce  short  tests  that  they  use   approximately  once  every  two  weeks  to  determine  how  far  along  the  learning   progression  the  students  in  their  classes  have  reached,  and  they  also  plan  high-­‐quality   single  “hinge-­‐point”  items  that  they  incorporate  into  their  lesson  plans.    Teachers  also   meet  in  grade-­‐based  teams  every  two  weeks  to  review  the  progress  their  students   have  made.   The  seven  administrators  at  Larkrise  Middle  School  undertake  “Learning  walks”   approximately  once  per  month,  in  which  they  attempt  to  visit  as  many  classrooms  as   possible,  typically  spending  between  10  and  15  minutes  in  each  classroom  they  visit.       During  a  day,  they  are  generally  able  to  visit  the  classrooms  of  every  single  one  of  the   teachers  at  the  school.    At  the  end  of  each  visit,  the  teacher  being  observed  receives  a   short  report  slip  that  follows  the  “two  stars  and  a  wish”  protocol  (two  positive  aspects   of  the  practice  observed,  and  one  reflection  point  for  the  teacher—see  Wiliam,  2011,   for  more  details).    The  administrator  also  has  a  copy  of  the  report,  but  this  does  not   give  the  observed  teacher’s  name,  since  as  a  result  of  a  “self-­‐denying  ordinance”  as   discussed  above,  the  administrative  team  has  decided  that  the  quality  of  the  evidence   collected  from  a  single  lesson  after  a  10-­‐minute  observation  is  not  sufficiently  reliable   to  provide  a  basis  for  the  evaluation  of  a  particular  teacher  (see  Hill,  2012).    Teachers   are,  however,  free  to  use  these  report  slips  in  their  annual  meetings  with  their   supervisors  to  discuss  their  future  professional  development  priorities).    Although  the  

results  of  10-­‐minute  observations  on  individual  teachers  may  not  support  inferences   about  the  quality  of  individual  teachers,  the  evidence  from  the  100  to  150  lessons   observed  during  a  typical  “Learning  walk”  day  does  provide  a  sound  evidence  base  for   the  average  quality  of  instruction  being  provided  in  the  school.    By  reviewing  trends   over  several  months,  the  administrators  are  able  to  determine  whether  institution-­‐ wide  initiatives  are  having  an  effect  on  instruction.    These  reviews  of  long-­‐term  trends   are  also  informed  by  a  monthly  questionnaire  completed  by  a  sample  of  10%  of  the   students  (students  are  randomly  allocated  to  complete  one  questionnaire  each  year).   At  the  end  of  the  first  marking  period  (six  weeks  into  the  school  year)  students   take  an  interim  assessment  that  gives  each  teacher  and  student  a  first  look  at   achievement  during  the  year  and  progress  toward  growth  targets.    This  information  is   used  to  make  “mid-­‐course”  corrections  and  is  used  as  the  basis  of  the  second  series  of   reports  to  stakeholders.  In  keeping  with  the  decision  about  “self-­‐denying  ordinances”   described  above,  data  on  student  achievement  on  these  interim  assessments  is  never   used  to  support  inferences  about  individual  teachers.   During  the  second  and  third  marking  periods,  formative  assessment   approaches  allow  each  teacher  to  adjust  content  as  each  student  progresses.  The   regular  monitoring  of  student  progress  allows  a  “response  to  intervention”  type   approach  to  be  used  whereby  students  who  are  not  making  adequate  progress  are   provided  with  additional  support,  which  takes  the  form  either  of  tuition  in  smaller   groups,  re-­‐allocation  to  the  classes  of  teachers  known  to  be  highly  effective  with   students  with  special  needs,  or  special  “catch-­‐up”  classes.   Toward  the  end  of  school  year,  the  summative  assessment  identifies  the  overall   achievement  of  the  students  in  the  class  to  help  determine  what  there  is  to  celebrate,  

and  what  might  be  done  better  in  subsequent  years.    This  information  is  also  passed   to  stakeholders  in  the  form  of  easily  readable  reports  that  describe  the  depth  and   breadth  of  the  accomplishments  of  the  schools.  At  the  end  of  school  year,  an  interim   assessment  gives  each  teacher  and  student  a  look  at  achievement  during  the  year  and   attainment  of  growth  targets.    This  information  serves  as  the  basis  of  the  final  series  of   stakeholder  reports,  which  describes  both  the  accomplishments  of  the  year,  but  also   the  changes  that  will  be  made  to  serve  students  better  in  the  upcoming  years.  “Value-­‐ added”  analyses  are  also  undertaken  to  establish  the  total  progress  made  by  students   in  the  school,  and,  where  sufficient  reliability  can  be  achieved,  these  analyses,  along   with  observational  data,  feed  in  to  the  evaluation  of  each  teacher’s  performance  over   the  year.   Obviously,  another  group  of  educators  might  come  to  a  substantially  different   design  for  using  the  types  of  assessment  together  to  improve  education.    However,  as   long  as  we  keep  a  mission  that  is  centered  on  the  student  in  mind,  it  is  unlikely  that  we   will  go  too  far  wrong.    The  quality  of  our  educational  systems  may  be  seen  most  easily   by  test  scores  and  student  growth,  but  it  is  important  to  remember  that  the  quality  of   education  is  best  seen  in  the  accomplishments  of  our  students.    The  best  that  schools   can  hope  to  do  is  to  set  our  students  along  paths  that  will,  eventually,  make  our  world   a  better  place  for  them  and  their  children.    

References Ausubel,  D.  P.  (1968).    Educational  psychology:  a  cognitive  view.  New  York,  NY:  Holt,   Rinehart  &  Winston.  

Black,  P.  J.,  &  Wiliam,  D.  (2009).  Developing  the  theory  of  formative  assessment.   Educational  Assessment,  Evaluation  and  Accountability,  21(1),  5-­‐31.   Cronbach,  L.  J.  (1971).  Test  validation.  In  R.  L.  Thorndike  (Ed.),  Educational   measurement  (2  ed.,  pp.  443-­‐507).  Washington  DC:  American  Council  on   Education.   Fletcher,  R.  (2000).    A  review  of  linear  programming  and  its  application  to  the   assessment  tools  for  teaching  and  learning  (asTTle)  projects.    Auckland,  NZ:     University  of  Auckland.   Hill,  H.  C.  (2012).  When  rater  reliability  Is  not  enough:  Teacher  observation  systems   and  a  case  for  the  generalizability  study.  Educational  Researcher,  41(2),  56-­‐84.   Leahy,  S.,  &  Wiliam,  D.  (2011,  April).  Devising  learning  progressions.  Paper  presented   at  the  Annual  meeting  of  the  American  Educational  Research  Association  held  at   New  Orleans,  LA.   Lewis,  C.  C.  (2002).    Lesson  study:  a  handbook  of  teacher-­led  instructional  change.   Philadelphia,  PA:  Research  for  Better  Schools.   Messick,  S.  (1989).  Validity.  In  R.  L.  Linn  (Ed.),  Educational  measurement  (3  ed.,  pp.   13-­‐103).  Washington,  DC:  American  Council  on  Education/Macmillan.   Mullis,  I.,  Martin,  M.,  Kennedy,  A.,  Trong,  K.  &  Sainsbury,  M.  (2009).    PIRLS  2011   assessment  framework.    Boston,  MA:    TIMSS  &  PIRLS  International  Study  Center,   Lynch  School  of  Education,  Boston  College.   Mullis,  I.,  Martin,  M.,  Ruddock,  G.,  O’Sullivan,  C.,  &  Preuschoff,  C.    (2009).    TIMMS   2011  assessment  frameworks.    Boston,  MA:    TIMSS  &  PIRLS  International  Study   Center,  Lynch  School  of  Education,  Boston  College.  

Northwest  Evaluation  Association  (2009).    Technical  manual  for  Measures  of   Academic  Progress  and  Measures  of  Academic  Progress  for  Primary  Grades.     Portland,  OR:  Author.   Northwest  Evaluation  Association.  (2012).  Measures  of  academic  progress®.   Retrieved  March  31,  2012,  from  http://www.nwea.org/products-­‐ services/computer-­‐based-­‐adaptive-­‐assessments/map   OECD  (2012).    PISA  2009  technical  report.    Author.   Ritter,  S.,  Anderson,  J.  R.,  Koedinger,  K.  R.,  &  Corbett,  A.  (2007).  Cognitive  Tutor:   applied  research  in  mathematics  education.  Psychonomic  Bulletin  &  Review,   14(2),  249-­‐255.   Wiliam,  D.  (2011).  Embedded  formative  assessment.  Bloomington,  IN:  Solution  Tree.   Wiliam,  D.,  &  Black,  P.  J.  (1996).  Meanings  and  consequences:  a  basis  for   distinguishing  formative  and  summative  functions  of  assessment?  British   Educational  Research  Journal,  22(5),  537-­‐548.     Wylie,  E.  C.,  &  Wiliam,  D.  (2006,  Diagnostic  questions:  is  there  value  in  just  one?  Paper   presented  at  the  Annual  Meeting  of  the  National  Council  on  Measurement  in   Education  held  at  San  Francisco,  CA.     Wylie,  E.  C.,  &  Wiliam,  D.  (2007,  Analyzing  diagnostic  questions:  what  makes  a  student   response  interpretable?  Paper  presented  at  the  Annual  Meeting  of  the  National   Council  on  Measurement  in  Education  held  at  Chicago,  IL.