SRI International - OSPI

3 downloads 0 Views 375KB Size Report
Nov 30, 2002 - Figure 5. Mathematics Item Characteristics Judged by Alignment ... 7, and 10 in Reading and Mathematics. ..... WASL and the Essential Academic Learning Requirements (EALRs). .... series of recommendations were made related to continued review of .... answer key questions and prepare written findings.
SRI International November 30, 2002

A Review of the Washington Assessment of Student Learning in Mathematics: Grades 7 and 10

Executive Summary and Final Report SRI Project Number: P12089

Prepared for: The Office of Superintendent of Public Instruction (OSPI) Old Capitol Building, P.O. Box 47200 Olympia, WA 98504-7200

Submitted by: Geneva Haertel SRI International

Table of Contents Executive Summary.............................................................................................. i Technical Panel .........................................................................................................i Alignment Panel........................................................................................................ ii Cross-Panel Analyses Relating Item Characteristics to Item Difficulty ....................... iii Examination of Standard-Setting Procedures ........................................................... iv Conclusions and Recommendations ........................................................................ iv Standard Setting .......................................................................................................v Educational Reform ..................................................................................................v Summary ....................................................................................................... vi

Introduction.......................................................................................................... 1 Statement of the Problem.................................................................................... 1 General Approach to a Review of the WASL Mathematics Assessment......................................................................................................... 2 Background on the WASL ................................................................................... 3 Technical Studies ................................................................................................ 5 Technical Panel Participants and Methods ................................................................5 Technical Analyses of WASL Mathematics Achievement...........................................7 Analysis 1. Analysis of Initial Year Mathematics Score Distributions .........................7 Analysis 2. Analysis of Percentage of Students Reaching Standard..........................9 Analysis 3. Analysis of 2002 Mathematics Score Distributions with Comparisons to Initial Year of Testing.............................................11 Analysis 4. Analysis by Mathematics Achievement Levels ......................................13 Analysis 5. Comparing Standards and Tests across Grades...................................17 Analysis 6. Matched Samples Analyses .................................................................21 Conclusions: Technical Analyses 1-6 ......................................................................23

Alignment Studies.............................................................................................. 24 Alignment Panel Participants and Methods..............................................................24 Rating Categories ...................................................................................................26 Results of Alignment Panel Classifications of WASL Mathematics Items..............................................................................................30

Cross-Panel Analyses: Relationship of Item Characteristics to Item Difficulty............................................................................................................. 60 Grade 7 Results......................................................................................................61 Grade 10 Results....................................................................................................66 Conclusions: Cross-Panel Analyses ........................................................................71

Review of Standard-Setting Procedures............................................................ 72 Conclusions and Recommendations: Standard-Setting Procedures.........................78

Recommendations............................................................................................. 78 Test Development...................................................................................................78 Standard Setting .....................................................................................................79 Educational Reform ................................................................................................79

Summary ........................................................................................................... 80 References ........................................................................................................ 81

Table of Contents Appendix A: Resumes for National Experts ..................................................... A-1 Appendix B: Resumes for the Technical Assistance Committee ..................... B-1 Appendix C: Resumes for SRI International .................................................... C-1

List of Figures Figure 1. Scale Score Frequency Distribution by Grade for Initial Year(s) of Testing............................................................................ 8 Figure 2. Percentage of Students Meeting Standards, by Grade Level and Year.............................................................................. 10 Figure 3. Scale Score Frequency Distributions for Grades 4, 7, and 10 in 2002 ..................................................................................... 12 Figure 4. Percentage of Students in Each of Four Mathematics Achievement Levels (L1, L2, L3, and L4)...................................... 15 Figure 5. Mathematics Item Characteristics Judged by Alignment Panel............................................................................................. 26 Figure 6. Types of Knowledge and Associated Mathematical Activities........................................................................................ 27

List of Tables Table 1. Distribution of Mathematics Item Formats across Tests ....................... 4 Table 2. Percentages of Students Reaching Standard in Grades 4, 7, and 10 in Reading and Mathematics........................................... 9 Table 3. Comparison of Students’ Mathematics Achievement for Initial Year of Testing and Spring 2002 ......................................... 13 Table 4. Percentage of Students at Each Grade Level Scoring at or above Each Achievement Level................................................ 16 Table 5. Comparison of Mathematics Test Characteristics for Initial Year .............................................................................................. 18 Table 6. Number of Students in Matched Samples and Percentage of Students Retained and Lost in Matching, by Year ......................................................................... 21 Table 7. Average Cross-Classification Percentages for Grades 4-7 Cohorts on the Mathematics Tests................................................ 22 Table 8. Average Cross-Classification Percentages for Grades 710 Cohorts on the Mathematics Tests........................................... 22 Table 9. Distribution of Mathematics Item Formats across Tests ...................... 30 Table 10. EALR Distribution across Mathematics Tests.................................... 32 Table 11. EALR Distribution across Mathematics Item Formats ....................... 33 Table 12. EALR and Element Distribution across Mathematics Tests ............................................................................................. 34 Table 13. Agreement of Alignment Panel Codes with Test Developer Item Strands ................................................................ 38 Table 14. Relationship between Mathematics Item Directions and Performance Expectations ............................................................ 39 Table 15. Type of Knowledge Used in Mathematics Items................................ 41 Table 16. Knowledge Type of Mathematics Items by Item Formats .................. 42 Table 17. Mathematics Items that May Require Computation ........................... 43 Table 18. Mathematics Items that May Require Computation, by Item Format................................................................................... 43 Table 19. Use of Mathematics Vocabulary ........................................................ 43 Table 20. Mathematics Items that Use Context................................................. 44 Table 21. Context, by Mathematics Item Format ............................................... 44 Table 22. Grade-Level-Appropriate Context for Mathematics Items.................. 45 Table 23. Grade-Level-Appropriate Context across Mathematics Item Formats................................................................................. 45 Table 24. Context of Mathematics Item Stimulus Materials............................... 46 Table 25. Background Knowledge of Mathematics Item Stimulus Materials ....................................................................................... 47 Table 26. Use of Representational Graphics in Mathematics Items ................... 48 Table 27. Representational Graphic Formats by Mathematics Item Formats......................................................................................... 48 Table 28. Grade-Level-Appropriate Representational Graphic Formats......................................................................................... 49 Table 29. Type of Graphic Format Used in Mathematics Item Stimulus Materials......................................................................... 50

List of Tables Table 30. Information Content within Graphic Formats in Mathematics Items........................................................................ 51 Table 31. Information Load in Graphic Formats in Mathematics Items............................................................................................. 51 Table 32. Information Load in Graphic Formats across Mathematics Item Formats ........................................................... 52 Table 33. Degree of Scaffolding within Mathematics Item Directions...................................................................................... 53 Table 34. Degree of Scaffolding in Mathematics Item Directions across Item Formats ..................................................................... 53 Table 35. Information Load in Mathematics Item Stimulus Materials ................ 54 Table 36. Extraneous Information across Mathematics Item Formats......................................................................................... 54 Table 37. Degree of Item Scaffolding in Mathematics Item Stimulus Materials ....................................................................................... 55 Table 38. Degree of Item Scaffolding in Mathematics Item Stimulus Materials across Item Formats...................................................... 55 Table 39. Performance Expectations within Scoring Rubrics in Mathematics ................................................................................. 56 Table 40. Performance Expectations in Scoring Rubrics in Mathematics across Item Formats................................................ 56 Table 41. Match of Item Rubric with Performance Expectation of Primary EALR in Mathematics ...................................................... 57 Table 42. Relationship between Anchor Papers and Rubrics in Mathematics ................................................................................. 58 Table 43. Grade 7: Overall Question Type Characteristics ............................... 61 Table 44. Grade 7: Item May Require Computation.......................................... 61 Table 45. Grade 7: Knowledge Content Combinations ..................................... 62 Table 46. Grade 7: Item Uses Procedural Knowledge ...................................... 63 Table 47. Grade 7: Information Load Is Appropriate ......................................... 63 Table 48. Grade 7: Scaffolding Is Appropriate .................................................. 64 Table 49. Grade 7: Is Math Vocabulary Specialized? ....................................... 64 Table 50. Grade 7: Is Scaffolding of Problem within the Item Directions Appropriate? ................................................................ 65 Table 51. Grade 7: Performance Required Matches the EALR......................... 65 Table 52. Grade 7: Do the Anchor Papers Match the Rubrics? ........................ 66 Table 53. Grade 10: Overall Question Type Characteristics ............................. 66 Table 54. Grade 10: Knowledge Content Combinations ................................... 67 Table 55. Grade 10: Information Load Is Appropriate ....................................... 68 Table 56. Grade 10: Scaffolding Is Appropriate ................................................ 68 Table 57. Grade 10: Is Math Vocabulary Specialized? ..................................... 69 Table 58. Grade 10: Is Scaffolding of Problem within the Item Directions Appropriate? ................................................................ 69 Table 59. Grade 10: Performance Required Matches an EALR........................ 70 Table 60. Grade 10: Performance Required Matches Part of EALR ................. 70

List of Tables Table 61. Grade 10: Performance Required Matches EALR at Lower Grade Level........................................................................ 71 Table 62. Grade 10: Do The Anchor Papers Match the Rubrics? ..................... 71

A Review of the Washington Assessment of Student Learning in Mathematics: Grades 7 and 10 EXECUTIVE SUMMARY The Office of Superintendent of Public Instruction (OSPI) in the state of Washington commissioned SRI International to lead a study of students’ performance on the Washington Assessment of Student Learning (WASL) in mathematics at grades 7 and 10. SRI International coordinated a two-pronged review of the WASL mathematics assessment. The primary objective of the review was to address the performance pattern discrepancies, alignment, and standard-setting procedures of the grades 7 and 10 mathematics tests through the use of expert panels. These expert panels conducted alignment and technical studies using methodologies targeted to the goals. Dr. Geneva Haertel, Senior Educational Researcher in the Center for Technology in Learning at SRI International, coordinated the review effort. SRI’s role was to orchestrate the review process, recruit content and psychometric experts, convene the Alignment and Technical Panels, coordinate and synthesize analyses and findings of the Alignment and Technical Panels, and prepare the final report. The review integrated the expertise of five groups: •

Two national experts in testing, measurement, and psychometrics.



Two national experts in the alignment of assessments and standards in mathematics.



Two state of Washington educators with extensive knowledge of the WASL and the Essential Academic Learning Requirements (EALRs).



Three Washington Technical Advisory Committee (TAC) members.



SRI project staff.

TECHNICAL PANEL The Technical Panel was composed of individuals with expertise in testing, measurement, and psychometrics. The primary focus of this panel was to address questions about the performance patterns and the consistency and appropriateness of the standards and cutscores used at grade levels 7 and 10 of the WASL mathematics assessments. In particular, panelists were asked to examine the relative difficulty of meeting the mathematics performance standard and achieving the cutscores. The participants on the Technical Panel were: Ronald Hambleton, University of Massachusetts, Amherst Edward Haertel, Stanford University Peter Behuniak, Washington TAC, Connecticut Department of Education Joseph Ryan, Washington TAC, Arizona State University West, Phoenix

i

Panelists examined available data and documentation to better understand the patterns of student performance on the WASL at different grade levels and in different subject areas. In addition, the panelists considered the findings of the alignment study, since it was anticipated that the conclusions and data generated by the Alignment Panel could help to inform future test development and standard-setting activities on the WASL. TECHNICAL PANEL: KEY FINDINGS AND CONCLUSIONS Overall, the technical analyses of students’ performance in mathematics in Grades 4, 7, and 10 from 1997 to 2002 revealed progress in mathematics achievement at each of the grade levels. Patterns of achievement performance at grade 4 revealed that gains have been rapid and consistent over the six years that the WASL has been administered. In grade 7, the achievement gains have been sluggish, but the achievement gains have improved in the last two years. Achievement gains in grade 10 have been consistently modest since the WASL’s inception. The analyses further suggest that the grade 7 test was more challenging for the 7th-graders than the grade 10 test was for the 10th-graders.

ALIGNMENT PANEL The Alignment Panel was composed of four individuals with extensive expertise in mathematics education. Each panelist also had experience in judging the alignment of assessments with standards. Two members were nationally recognized for their pioneering work in the field of alignment of math and science assessments with standards, and two members had worked on the grade 4 Northwest Regional Educational Laboratory (NWREL) Washington Assessment of Student Learning (WASL) Mathematics Study. The primary focus of the Alignment Panel was to study the issue of the relative difficulty of individual grade 7 and grade 10 mathematics WASL items with respect to their correspondence with grade-level Essential Academic Learning Requirements and the cognitive demands of test item features. The participants on the Alignment Panel were: Norman Webb, University of Wisconsin, Madison Gerald Kulm, Texas A&M University Verna Adams, Washington State University, Pullman, Emeritus John Woodward, University of Puget Sound SRI researchers designed an alignment methodology tailored to the goals of the study: (1) to judge the appropriateness of the items for providing evidence of achievement of the EALRs specified for the grade level, and (2) to identify possible sources of challenge in item design. The Alignment Panelists were asked to evaluate characteristics of items that were hypothesized to affect student performance. Features of item content, format, and scoring were rated for their alignment with mathematics EALRs and grade-level appropriateness. Item features considered included appropriateness

ii

of problem contexts, graphics, information load, scaffolding, mathematical terminology, and computation. ALIGNMENT PANEL: KEY FINDINGS AND RECOMMENDATIONS In general, Alignment Panelists’ ratings suggested that the grade 7 tests were more challenging for 7th-grade students than the grade 10 tests were for 10thgrade students. Comparisons of the first versions of the tests in 1998 and 1999 with the 2001 tests revealed that the WASL test development process continues to improve the appropriateness of items for grade-level tests. Alignment Panel recommendations: •

Clarify the relationship among score reports, the basis for development of scoring guides, and the indexing of items for the purpose of item and test development and for the public.



Confirm whether EALRs and elements are represented in the items and reports as intended.



Examine and document the balance of grade-level items within and across the grade levels.



Review the relative distribution of items across the different knowledge types. Confirm the percentage of each knowledge type desired on tests at each grade level.



Consider whether the proportion of items with graphical forms should be balanced across grade levels.



Screen for information load, background knowledge, and scaffolding in the contexts and graphics and confirm appropriateness with student “thinkalouds.”



Vertically align scoring criteria across grade levels.

CROSS-PANEL ANALYSES RELATING ITEM CHARACTERISTICS TO ITEM DIFFICULTY As part of the technical analyses, the ratings of items in the alignment study were combined with information about how students actually performed on the test questions. Students’ performance on the items was represented by the item difficulty or p-value and the item mean. CROSS-PANEL ANALYSES: KEY FINDINGS ON THE RELATIONSHIP OF ITEM CHARACTERISTICS TO ITEM DIFFICULTY Findings from the technical analyses of performance patterns and the alignment of test items with EALRs confirm the conclusion that the grade 7 tests were more difficult for 7th-graders than the grade 10 tests were for 10th-graders. These analyses also confirm the relationship of test item characteristics to item difficulties.

iii

Key findings: •

Multiple-choice items that required computation were harder than those that did not.



Items requiring procedural knowledge were slightly easier.



Items with appropriate information load were easier.



Items coded as having appropriate scaffolding within the item directions were easier.

EXAMINATION OF STANDARD-SETTING PROCEDURES Members of the Technical Panel, Dr. Ronald Hambleton and Dr. Edward Haertel, examined technical documents describing the process used in setting performance standards and summarizing the results. The two panelists provided comments on standard-setting procedures employed and recommendations for future methodologies. STANDARD-SETTING PROCEDURES: KEY FINDINGS AND RECOMMENDATIONS The standard-setting practices that were employed by OSPI from 1997 through 1999 were reasonable, were similar to those practices that have been employed by other states, and were considered state-of-the-art at that time. In this review, some aspects of the procedures were identified as potential sources of variability that could have influenced the results of the standard-setting and consequently the cross-grade achievement patterns. Recommendations: • Provide more detailed documentation of the procedures. •

Specify how sub-panels will be used.



Review the facilitators’ role in working with each panel or sub-panel.



Specify performance-level descriptors more explicitly.



Use impact data to set standards.



Codify validity evidence for performance standards.



Incorporate current standard-setting technologies.

CONCLUSIONS AND RECOMMENDATIONS Conclusions and recommendations of the review panels relate to methodologies for test development and standard setting and to the role of the WASL in the Washington State education reform agenda. TEST DEVELOPMENT The panels offered a number of specific recommendations for the WASL test development process. Panel findings revealed that the grade 7 math test is harder for 7th-graders than the grade 4 test is for 4th-graders and the grade 10

iv

test is for 10th-graders. A number of the characteristics of items judged by the Alignment Panel as likely sources of item difficulty were associated with empirical indicators of item difficulty, although the relationships varied at grades 7 and 10. Building on the methodologies used by the Technical and Alignment Panels, a series of recommendations were made related to continued review of test development, accompanied by even more detailed communication of the WASL design to the educational community. Recommendations: • Conduct longitudinal and cross-sectional analyses like the ones reported in this study at the end of each year and in each content area. •

Provide clear test frameworks and blueprints to the public displaying the representation of EALRs by items on the test.



Vertically align and balance test designs for assessing EALRs.



Routinely engage experts to screen item features.



Routinely conduct and document think-alouds with small samples of representative students.



Conduct further analyses of the relationship of multiple characteristics to item difficulty.



Compare the relationship between item characteristics and item difficulty across grades



Compare the relationship between item characteristics from the alignment study and item discrimination indices.

STANDARD SETTING Panelists found that the standard-setting procedures employed for grades 7 and 10 mathematics have incorporated common practices and should continue to add recommended elements. Recommendations: • Provide detailed documentation of standard-setting procedures and decisions in WASL technical reports. •

Include impact data for standard-setting panels.



Vertically equate and “smooth” expectations and cutscores.

EDUCATIONAL REFORM Panelists emphasize that this study of the mathematics achievement of students in grades 7 and 10 tested by the WASL must be viewed within interrelationships of standards, assessments, curriculum, teachers, instruction, and students. The EALRs established in the content areas govern the design of the WASL. Standards set at each grade level tested by the WASL specify test scores for each level of achievement. The grade levels selected for administration of the

v

WASL influence instructional emphases. Test performance at the grade levels could be affected by: •

Varying demands in the content standards.



Differences in the design of the tests for the grade levels.



Content and performances required of students to meet the standards on the WASL.



Variations in the application of standard-setting methods across subjects and grade levels.



Level of effort that students make when taking the assessment.



Differences in the nature of curriculum content to which students at different grade levels are exposed.



Differences in the quality of instruction provided.



Variation in the qualifications of teachers.

The exemplary efforts by OSPI to systematically examine the design of the WASL tests and their relationships to the educational reform agenda have provided OSPI and Washington educators with valuable data and insights. For example, relevant systemic factors to consider might include: •

Teachers in grade 4 have had more experience than 7th- and 10th-grade teachers with a standards-based curriculum and assessment system.



The first three groups of students exposed to a standards-based system in grade 4 have reached grade 7, and achievement in grade 7 is improving.



The earlier-grade math experiences of students in grade 10 are more varied than those of students in grades 4 and 7.



Teachers in grade 10 have had the least experience with a standardsbased curriculum and assessment system.



Teachers in grade 10 have had less opportunity than 7th-grade teachers to work with students who have been moving through a standards-based curriculum and assessment system.

Recommendation: • Routinely review the state assessment as a component within Washington State educational reforms.

SUMMARY Both the Technical and Alignment Panels note that since the inception of the WASL, progress has been made by Washington students in all grades on the WASL mathematics tests. These increases in performance mirror progress found in other state standards-based reforms. Panelists also note significant improvement in the design of the tests. They praise OSPI’s ongoing efforts to vi

monitor and improve the WASL. They note that the WASL test development process continues to increase the appropriateness of items for grade-level tests. Therefore, we stress the importance of integrating this study of the grades 7 and 10 mathematics WASLs with other studies of the impacts of resource allocation, teaching, curriculum, and instruction on students’ mathematics achievement.

vii

viii

A Review of the Washington Assessment of Student Learning in Mathematics: Grades 7 and 10 INTRODUCTION The Office of Superintendent of Public Instruction (OSPI) in the state of Washington commissioned SRI International to lead a study of students’ performance on the Washington Assessment of Student Learning (WASL) in mathematics at grades 7 and 10. SRI International coordinated a two-pronged review of the WASL mathematics assessment. The primary objective of the review was to address the alignment features, performance pattern discrepancies, and standard-setting procedures of the grades 7 and 10 mathematics tests through the use of expert panels. These expert panels conducted alignment and technical studies using methodologies targeted to the goals. This document summarizes the review process and highlights the panel findings and recommendations to OSPI concerning the WASL mathematics assessments.

STATEMENT OF THE PROBLEM Analyses of the WASL in mathematics revealed trends in student performance that require further study and explanation. In particular, students in grades 7 and 10 score lower in mathematics than in other subject areas. The grade 7 math and reading scores are much lower than the scores for grades 4 and 10. In addition, the rate of improvement in math has been slower than in the other subject areas. There are numerous plausible explanations for these trends. Differential performance at different grade-levels could arise from several sources, including: •

Varying demands in the content standards.



Differences in the design of the tests for the grade-levels.



Content knowledge and performances required of students to meet the standards tested by the WASL.



Variations in the application of standard-setting methods across subjects and grade-levels.



Level of effort that students make when taking the assessment.



Differences in the nature of curriculum content to which students at different grade-levels are exposed.



Differences in the quality of instruction provided.



Variation in the qualifications of teachers.

1

This study addresses only issues related to qualities of the grade 7 and grade 10 tests.

GENERAL APPROACH TO A REVIEW OF THE WASL MATHEMATICS ASSESSMENT Dr. Geneva Haertel, Senior Educational Researcher in the Center for Technology in Learning at SRI International, coordinated the review effort. SRI’s role in the review included: orchestrating the review process; recruiting content and psychometric experts; convening the Alignment and Technical Panels; coordinating and synthesizing analyses and findings of the Alignment and Technical Panels; and preparing the final report. The review had two thrusts: 1.

A technical review of: (a) the discrepancies in students’ patterns of performance on the grade 7 and grade 10 tests and (b) the policies governing and procedures employed in setting the performance standards for the tests.

2.

A review of the alignment of the WASL mathematics items with state standards—the Essential Academic Learning Requirements (EALRs).

For each area, a panel of national experts and members of the Washington Technical Advisory Committee were convened to conduct analyses designed to answer key questions and prepare written findings. Panels shared the information they gathered, so that the findings of each could support an integrated interpretation of students’ performance on the WASL in mathematics at grade 7 and grade 10. See Appendices A, B, and C for resumes. The review integrated the expertise of five groups: •

Two national experts in testing, measurement, and psychometrics.



Two national experts in the alignment of assessments and standards in mathematics.



Two state of Washington educators with extensive knowledge of the WASL and the Essential Academic Learning Requirements (EALRs).



Three Washington Technical Advisory Committee (TAC) members.



SRI project staff.

In addition, members of OSPI and Riverside Publishing personnel served as resources to the review process, providing the Alignment and Technical Panels with test materials and data as required.

2

BACKGROUND ON THE WASL This section describes the educational context in which the WASL was developed, and provides a brief history of its implementation. Recent education reform efforts in Washington State have their roots in legislation passed in the early 1990’s, when a comprehensive effort to improve teaching and learning was launched. The Legislature passed the Engrossed Substitute House Bill 1209 in 1993, noting that “student achievement in Washington must be improved to keep pace with societal changes, changes in the workplace, and an increasingly competitive international economy.” In response to this legislation, the Legislature created the Commission on Student Learning (CSL), which later transferred its duties to OSPI and the Academic Achievement and Accountability Commission. The state was required to establish academic standards at an internationally competitive level. To achieve a competitive level, the state of Washington conducted the following activities: •

Essential Academic Learning Requirements (EALRs) were established in eight content areas (reading, writing, communication, mathematics, science, health/fitness, social studies, and the arts) to specify what all students should know and be able to do.



Using the EALRs, the WASL was developed to measure student progress at grades 4, 7, and 10.



The accountability commission was created to make recommendations for criteria to identify schools that are struggling to meet the academic standards set forth by the state.

Development of the WASL was an extensive process. Several committees, composed of diverse populations from across the state, were established to create EALRs, for the various content areas (“strands”). They designed the content of the test, drafted items for a pilot test, ensured that items were free of potentially offensive and biased content, and reviewed the results of the pilot test. Tests for mathematics and other subjects were initially developed for the grade 4, and were first administered on a voluntary basis in the spring of 1997. Participation was mandatory in 1998. Assessments for grade 7 were administered on a voluntary basis in the spring of 1998. Grade 10 assessments were pilot-tested in 1998 and were administered in the spring of 1999. Participation in the grade 7 and 10 assessments was voluntary until 2001. The CSL adopted EALRs, for mathematics, reading, writing, and communications in 1995, revising them in 1997. Performance “standards” were established to reflect the legislative requirement that performance criteria reflect internationally competitive levels for grades 4, 7, and 10. EALRs, for science, social studies, health/fitness, and the arts were adopted in 1996 and revised in 1997. Using the EALRs as guides, a statewide assessment, the WASL, was developed to test upon achievement of these performance “standards.” After the test was initially administered in 1997 (for grade 4), a diverse standardsetting committee was created to set the scoring level that students need to 3

achieve in order to meet the state standard. The committee members were charged with determining these levels and were guided by what they believed a “well taught, hard working student” should be able to do by spring of a given year. Setting the standard necessitated an extensive analysis by experts. This thorough “expert judgment” process ensured that the standard set for proficiency was carefully scrutinized by a broad range of constituents of education. Committee members received significant input from peers and had many opportunities to discuss the standard. After the committee set the cutscore that students have to achieve to meet the standard, they set cutscores for partial achievement of the standard and achievement above the standard. Statistical equating methods are used to examine the consistency of these levels over time. Independent experts were consulted to determine if the test development and standard-setting processes were sound and well documented, and met the standards set forth in the Standards for Educational and Psychological Testing. The test development process is similar to that used across the country, and the standard-setting process has been used to set standards in other states. Table 1 presents basic information on the WASL mathematics assessment for grades 4, 7, and 10, including the number and types of items for each version of the test, by year. Table 1. Distribution of Mathematics Item Formats across Tests Number of Items, by Grade Level and Year Item Format Grade 4 1998

Grade 4 1999

Grade 4 2001

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

Multiple-choice

24

24

21

30

27

30

26

2-point constructed response

13

13

11

12

11

12

11

4-point constructed response

3

3

3

4

4

4

4

Total number of items

40

40

35

46

42

46

41

Total number of points

62

62

55

70

65

70

64

4

TECHNICAL STUDIES TECHNICAL PANEL PARTICIPANTS AND METHODS PARTICIPANTS The Technical Panel was composed of individuals with expertise in testing, measurement, and psychometrics. The primary focus of this panel was to address several technical questions about the consistency and appropriateness of the standards and cutscores used at grade-levels 7 and 10 of the Washington Assessment of Student Learning mathematics assessments. In particular, panelists were asked to examine the relative difficulty of meeting the mathematics performance standard and achieving the cutscores. They also were asked to address the extent to which the mathematics standards and cutscores are consistent and appropriate across the tested grades. The participants on the Technical Panel were: Ronald Hambleton, University of Massachusetts, Amherst Edward Haertel, Stanford University Peter Behuniak, Washington TAC, Connecticut Department of Education Joseph Ryan, Washington TAC, Arizona State University West, Phoenix METHODS • The purpose of the review was to examine available data and documentation to better understand the patterns of student performance on the WASL at different grade-levels and in different subject areas. Two nationally known measurement experts were invited to participate to provide independent, third-party perspectives on the technical issues of interest. Dr. Ronald Hambleton and Dr. Edward Haertel agreed to serve in this capacity. •

The panel’s activities involved the review of numerous technical documents describing the design and development of the WASL, as well as the data produced by the students who had taken the tests. Among the materials shared with panelists were the grade 4 Technical Reports for 1998, 1999, 2000, and 2001; grade 7 Technical Reports for 1998, 1999, 2000, and 2001; and grade 10 Technical Reports for 2000 and 2001. Additional materials included sets of discussion issues concerning standard-setting addressed by the Commission on Student Learning, documentation of standard-setting procedures used on the WASL, and information regarding the composition of the standard-setting committees. Further, student performance data were made available to summarize statewide achievement levels on the WASL. Finally, the results of the Alignment Panel’s work were considered by the panelists conducting the technical review.

5



Following the distribution of the WASL data and documentation to the panel members, panelists worked on three broad areas to conduct the studies. The two members of the Washington Technical Advisory Committee who were familiar with the WASL student performance data, Drs. Peter Behuniak and Joseph Ryan, reviewed the levels and patterns of student performance by grade-level and subject area. Their work is summarized in the review of the seven technical analyses presented below.



The available documentation regarding the development and implementation of procedures associated with the standard-setting process used on the WASL was reviewed by Drs. Ronald Hambleton and Edward Haertel. This independent review of the standard-setting processes that was implemented from 1997 through1999 attempted to identify sources of variation that could have contributed to the discrepant patterns and levels of student performance across grades and subject areas. The discussion of the procedures used and the panelists’ comments and recommendations are presented in the section, “Review of Standard-setting-Procedures.”



In addition, the panelists considered the findings of the alignment studies, since it was anticipated that the conclusions and data generated by the alignment panel could help to inform future test development and standard-setting activities on the WASL. The issues and the recommendations of the panelists are presented in the section, “CrossPanel Analyses: Relationship of Item Characteristics to Item Difficulty.”

6

TECHNICAL ANALYSES OF WASL MATHEMATICS ACHIEVEMENT This portion of the study examines students’ performance in mathematics in grades 4, 7 and 10 from 1997 to 2002. Seven analyses were performed: 1. Analysis of Initial Year Mathematics Score Distributions 2. Analysis of Percentage of Students Reaching Standard 3. Analysis of 2002 Mathematics Score Distributions with Comparisons to Initial Year of Testing 4. Analysis by Mathematics Achievement Levels (L1, L2, L3, L4) 5. Comparing Standards and Tests across Grades 6. Matched Samples Analysis 7. Alignment Analysis Related to Item Analysis. The first six analysis are reported here. Analysis 7 is reported after the description of the alignment review.

ANALYSIS 1. ANALYSIS OF INITIAL YEAR MATHEMATICS SCORE DISTRIBUTIONS WASL tests were first administered in grades 4, 7, and 10 in 1997, 1998, and 1999, respectively. The scores from the initial year of testing at each grade are examined and compared because they represent the starting point or baseline from which progress is evaluated. The students’ overall performance on these tests are shown graphically in Figure 1. The WASL scale is shown in the graphs going from 100 to 600, with a maximum frequency of 2500 to standardize the graphs across the grades. The WASL scales are not equated across the grades, however, so that the scale scores are not directly comparable for grades 4, 7, and 10. Within each grade, the scale score of 400 has been set as the achievement standard that must be met. Each graph shows the grade-level standard and the students’ mean scale score. The shapes of the scale score distributions for the three grades are different. At grade 4, the score distribution approximates a normal distribution that stretches out to the right. At grade 7, the score distribution is skewed, with relatively more students stacking up together toward the lower end of the scale, well below standard. The distribution of scores for grade 10 is spread out almost evenly across a wide range of the score scale (a flat distribution). At all grades, the mean score of the students in the first year of testing is well below the grade-level standard. The proportions of students between the mean and the standard for grades 4, 7, and 10, are 27%, 30%, and 15%, respectively. These data show that in the first year of testing, grade 7 contained a relatively large group of students located at the lower end of the score range. Twice as many students fell between the mean and the standard in grade 7 as in grade 10 and the average scale score in grade 10 was closer to the standard than those in grades 4 and 7.

7

Grade 4 Scale Score Frequency Distribution 1997

Number of Students

2500 Standard

Mean 2000 1500 1000 500 0 100

273

309 331

348

362 374

384 396

406

421 440

470

564

Mathematics Scale Score Grade 7 Scale Score Frequency Distribution1998

2500

Number of Students

Mean

Standard

2000 1500 1000 500 0 100 235 285 315 337 354 367 382 396 407 422 438 458 486 541

Mathematics Scale Score Grade 10 Scale Score Frequency Distribution 1999

2500 Mean

Standard

Number of Students

2000 1500 1000 500 0 100 280 315 336 352 365 376 385 395 403 414 427 443 465 508

Mathematics Scale Score

Figure 1. Scale Score Frequency Distribution by Grade for Initial Year(s) of Testing 8

ANALYSIS 2. ANALYSIS OF PERCENTAGE OF STUDENTS REACHING STANDARD The percentage of students who reach grade-level standard in mathematics in grades 4, 7, and 10 from 1997 to 2002 varies substantially. The percentages are reported in Table 2 and are shown graphically on the top graph in Figure 2. The line for grade 4 is very encouraging since it steadily increases from 1997 to 2002. The line for grade 10 is just below that for grade 4, and it shows a modest increase from 1999 to 2001 and a small dip in 2002. Grade 7 is the lowest line on the graph and reveals a small and slow increase from 1998 to 2002, while remaining in the lowest position. Although the focus of this investigation is students’ achievement in WASL mathematics, the corresponding information for WASL reading achievement for the same grades and same years is shown in the second graph in Figure 2 to provide a broader context for understanding grade-level variation in students’ mathematics achievement. The percentages of students in grades 4 and 10 reaching the reading standard were high in the initial years of testing and have generally increased over time. The percentage of students reaching the reading standard in grade 7 started off lower than in the other grades (by 10%) and has been substantially constant in the low position over the years of testing. The percentages of students reaching standard in mathematics and reading are shown together in the bottom graph in Figure 2. This combined graph highlights the lack of progress in the 7th-grade in both reading and mathematics. These results suggest that some systemic factors seem to be influencing and dampening students’ achievement in general as measured in grade 7. Table 2. Percentages of Students Reaching Standard in Grades 4, 7, and 10 in Reading and Mathematics Grade Level/ Subject

Grade 4 Reading

Year 1998

1999

2000

2001

2002

47.9

55.6

59.1

65.8

66.1

65.6

60.0

3.54

38.4

40.8

41.5

39.8

44.5

41.0

1.53

51.4

59.8

62.4

59.2

58.2

2.60

31.2

37.3

41.8

43.4

51.8

37.8

6.08

20.1

24.2

28.2

27.4

30.4

26.1

2.58

33.0

35.0

38.9

37.3

36.1

1.43

Grade 10 Reading

Grade 7 Math Grade 10 Math

Average Annual Percent Increase

1997

Grade 7 Reading

Grade 4 Math

Average Percent Reaching Standard

21.4

9

Percent of Students Meeting Standard

Percent of Students Meeting Mathematics Standard, by Year and Grade 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0

Gr. 4 Math Gr. 7 Math Gr. 10 Math 1997

1998

1999

2000

2001

2002

Year

Percent of Students Meeting Standard

Percent of Students Meeting Reading Standard, by Year and Grade 80.0 60.0 40.0

Gr. 4 Reading

20.0

Gr. 7 Reading

0.0

Gr.10 Reading 1997

1998

1999

2000

2001

2002

Year

Percent of Students Meeting Standard in Reading and Mathematics, by Year and Grade

Percent of Students Meeting Standard

70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 1997

1998

1999

2000

2001

G4 Reading G4 Math G7 Reading G7 Math G10 Reading 2002 G10 Math

Year

Figure 2. Percentage of Students Meeting Standards, by Grade Level and Year

10

ANALYSIS 3. ANALYSIS OF 2002 MATHEMATICS SCORE DISTRIBUTIONS WITH COMPARISONS TO INITIAL YEAR OF TESTING The distributions of students’ mathematics scale scores for spring 2002 are shown in Figure 3 and the comparison of students’ performance in the initial years of testing with performance in 2002 is summarized in Table 3. The scale score distributions for spring 2002 are quite different from those of their gradelevel counterparts in 1997, 1998, and 1999 for grades 4, 7, and 10, respectively. In grade 4, the scores have moved to the high end of the score range, so much so that the mean scale score is now at the grade 4 standard. The students in grade 7 in 2002 show considerable improvement over their 1998 grade 7 counterparts. In 1998, grade 7 scores were lopsided to the lower end of the scale score range, whereas in 2002, the distribution is roughly normal. The grade 10 scale score distribution in 2002 is more like that of the initial year of testing (1999) than is seen for either of the other grades. Progress has been made in all grades on the WASL mathematics tests. The percentage of students reaching standard in 2002 increased 30.4% in grade 4 since 1997 (six years); 10.3% in grade 7 since 1998 (five years); and only 4.3% in grade 10 since 1999 (four years). The yearly average gains in the percentage of students reaching standard are 6.08% in grade 4, 2.58% in grade 7, and 1.43% in grade 10. The percentage of students between the mean and the grade-level standard has been reduced from 30.2% to 17.5% in grade 7, a reduction of 42%. In grade 10, there has been a reduction of 27%, from 15.2% to 11%.

11

Grade 4 Scale Score Frequency Distribution 2002 3500 Mean and Standard

Number of Students

3000 2500 2000 1500 1000 500 0 100 277 308 329 345 358 370 381 392 401 413 428 449 489 600

Mathematics Scale Scores Grade 7 Scale Score Frequency Distribution 2002 3500

Number of Students

3000 2500

Mean

Standard

2000 1500 1000 500 0 100 222 265 291 311 328 342 355 367 376 389 400 414 431 449 476 526

Mathematics Scale Scores

3500

Grade 10 Scale Score Frequency Distribution 2002

Number of Students

3000 Mean

Standard

2500 2000 1500 1000 500 0 100 285 321 343 359 373 386 395 405 417 431 447 469 517

Mathmatics Scale Scores

Figure 3. Scale Score Frequency Distributions for Grades 4, 7, and 10 in 2002

12

Table 3. Comparison of Students’ Mathematics Achievement for Initial Year of Testing and Spring 2002 Mathematics Achievement Indicator

Grade 4 1997

2002

Mean Scale Score

374

401

Standard Deviation

34

42.8

Percent Reaching Standard

21.4

51.8

Percent between the Mean and the Standard

27.3

NA

Grade 7 1998

2002

Mean Scale Score

360

374

Standard Deviation

44.8

48.4

Percent Reaching Standard

20.1

30.4

Percent between the Mean and the Standard

30.2

17.5

Grade 10 1999

2002

Mean Scale Score

382

388

Standard Deviation

42.8

38.5

Percent Reaching Standard

33.0

37.3

Percent between the Mean and the Standard

15.2

11

ANALYSIS 4. ANALYSIS BY MATHEMATICS ACHIEVEMENT LEVELS A fine-grained analysis of students’ mathematics achievement across grades 4, 7, and 10 was obtained by examining the percentage of students in each of the four achievement levels (L1, L2, L3, and L4) over the years of testing. There are six years of data for grade 4 (1997-2002); five years of data for grade 7 (19982002); and four years of data for grade 10 (1999-2002). Figure 4 presents histograms illustrating the percentage of students at each grade-level who tested on Levels 1-4 for each year of testing. All six years are included in the legend for these graphs to generate columns for each year that line up visually across the grades. The percentages of students in each level were calculated using all students in the respective grades as the base, including students who were not tested. The percentages of students in the four achievement levels do not sum to 100% because the data do not include the percentage of students not tested. The data used to construct Figure 4 are shown in Table 4.

13

GRADE 4 The data show a distinct pattern in grade 4. Over the six years, the percentages of students in L1 have decreased consistently, from 47.2% to 20.0%. The percentage of students in L2, while fluctuating a bit, is basically constant over the six years. The percentage of students in L3 goes up each year, but not as sharply as the percentage of students in L4 goes up. It would appear that, over time, students whose counterparts from previous years would have been in L1 are appearing in L2 or higher. The percentage of students in L2 appears to be about constant because the percentage in L1 is going down while the percentages in L3 and L4 are going up. From 1997 to 2002, the proportion of students in successive cohorts shifting from L1 to L2 is about the same as the proportion of students shifting from L2 to L3 and L4. GRADE 7 The analysis of grade 7 shows far less change over time than in grade 4. Sixtyone percent of the students were in L1 in 1998; there was a drop of about 5.5% in 1999, with another drop of 5.5% over the next three years. The percentage, of students in L2 has been almost constant over the years. The percentages of students in L3 and L4 have fluctuated from 1998 to 2002, with the end results being a very modest increase over time. GRADE 10 The percentages of students in the various achievement levels in grade 10 have not changed substantially between 1999 and 2002. There has been a small decrease in the percentage of students in L1. The percentage of students in L2 has fluctuated around 20%, as has L3. The percentage of students in L4 has been between 15% and 19%, and in 2002 is back to 15.6%. ANALYSIS OF MATHEMATICS ACHIEVEMENT, BY LEVELS ACROSS GRADES Comparing the L1 percentage across grades reveals that initial testing in grade 7 in 1998 resulted in a larger percentage of students in L1 (over 60%), compared with the other two grades in their initial years of testing. The percentage of students in L2 is fairly constant in each of the grades, even though it fluctuates a bit. The percentage of students in L3 goes up strongly in grade 4 but improves only slightly in grades 7 and 10. Grade 4 shows a steady increase in the percentage of students in L4, while very modest L4 increases are seen in grades 7 and 10. Overall, the analysis of student achievement levels indicates that grade 4 has shown the type of progress over time that is highly desirable. Grade 7 results have been dominated by a very large percentage of students in L1 each year of testing, although some progress has been made. Grade 10 has not shown substantial progress, but it had a smaller percentage of students in L1 in the initial year of testing than either of the other two grades. Also, the grade 10 test has only been administered over four years, in contrast to six years for grade 4.

14

Grade 4 Percentage of Students by Achievement Level by Year

70

Percentage

60

1997 1998 1999 2000 2001 2002

50 40 30 20 10 0 L1

L2

L3

L4

Achievement Levels

Grade 7 Percentage of Students by Achievement Level by Year

70

Percentage

60

1997 1998 1999 2000 2001 2002

50 40 30 20 10 0 L1

L2

L3

L4

Achievement Levels

Grade10 Percentage of Students by Achievement Level by Year

70

Percentage

60

1997 1998 1999 2000 2001 2002

50 40 30 20 10 0 L1

L2

L3

L4

Achievement Levels

Figure 4. Percentage of Students in Each of Four Mathematics Achievement Levels (L1, L2, L3, and L4)

15

Table 4. Percentage of Students at Each Grade Level Scoring at or above Each Achievement Level Grade 4

Year 1996-97

1997-98

1998-99

1999-00

2000-01

2001-02

Reading At Level 4

16.8

15.6

17.7

22.4

21.5

27.0

At or above Level 3

47.9

55.6

59.1

65.8

66.0

65.6

At or above Level 2

83.9

90.2

90.3

92.8

93.4

93.9

Tested

95.3

97.7

97.6

97.9

98.3

98.6

Not Tested

4.7

2.3

2.4

2.1

1.7

1.4

At Level 4

6.6

11.0

13.9

19.3

20.3

24.8

At or above Level 3

21.3

31.2

37.2

41.7

43.4

51.7

At or above Level 2

50.2

61.0

64.6

66.6

71.8

78.6

Tested

97.4

98.8

98.2

98.0

98.6

98.6

Not Tested

2.6

1.2

1.8

2.0

1.4

1.4

Mathematics

Grade 7

Year 1996-97

1997-98

1998-99

1999-00

2000-01

2001-02

At Level 4

11.9

13.9

13.7

16.8

14.2

At or above Level 3

38.5

41.5

40.8

39.8

44.6

At or above Level 2

79.9

80.6

79.9

81.5

84.5

Tested

96.2

96.9

96.1

97.1

97.7

Not Tested

3.8

3.1

3.9

2.9

2.3

Reading

Mathematics At Level 4

5.7

10.4

12.0

13.1

13.2

At or above Level 3

20.1

24.2

28.2

27.4

30.4

At or above Level 2

36.6

41.6

43.5

44.1

47.5

Tested

97.8

97.4

97.2

97.3

97.7

Not Tested

2.2

2.6

2.8

2.7

2.3

1998-99

1999-00

2000-01

2001-02

At Level 4

33.4

37.7

47.9

44.0

At or above Level 3

51.5

59.8

62.4

59.2

At or above Level 2

74.6

79.4

81.3

79.2

Tested

88.3

91.1

92.2

93.1

Not Tested

11.7

9.0

7.8

6.9

Grade 10

Year 1996-97

1997-98

Reading

16

Table 4. Percentage of Students at Each Grade Level Scoring at or above Each Achievement Level - continued Grade 10

Year 1996-97

1997-98

1998-99

1999-00

2000-01

2001-02

Mathematics At Level 4

13.9

14.8

19.0

15.6

At or above Level 3

33.0

35.0

38.9

37.2

At or above Level 2

52.4

57.9

59.4

59.9

Tested

91.8

92.7

91.8

93.1

Not Tested

8.2

7.2

8.2

6.9

ANALYSIS 5. COMPARING STANDARDS AND TESTS ACROSS GRADES The WASL tests were not designed to be directly compared across grades. No attempt was made during item development, test construction, or standardsetting to place the separate scales for grades 4, 7, and 10 on the same scale. Thus, grade-to-grade comparisons must be made with great care, and attention to important qualifications is essential. Notwithstanding these warnings, certain characteristics of the WASL mathematics tests for grade 4 in 1997, grade 7 in 1998, and grade 10 in 1999 are compared in Table 5. These years represent the initial years of testing for these three grades.

17

Table 5. Comparison of Mathematics Test Characteristics for Initial Year Grade 4 1997

Grade 7 1998

Grade 10 1999

63,055

73,270

65,270

Scale Score

400

400

400

Mean Scale Score

374

360

382

Scale Score Standard Deviation

34.0

44.8

42.8

Distance from Mean to 400 in Standard Deviation Units

-0.77

-0.89

-0.42

17,227 27.3%

22,099 30.2%

9,950 15.2%

62

70

70

Percent of Points Required at Standard

62.9%

55.7%

57.1%

Percent of Students Meeting Standard

21.4%

20.1%

33.0%

Mathematics Test Characteristic Number of Students Tested

Number and Percent of Students between the Mean and the Standard Number of Points on the Test

Percent of 4th Grade Students Meeting Standard if the Grade 7 Criterion of 55.7 % of Points Earned Applied

32%

Percent of 7th Grade Students Meeting Standard if the Grade 4 Criterion of 62.9% of Points Earned Applied

11%

Number/Percent of Items at Grade 7 Corresponding to the Grade 10 Percent Meeting Standard of 33% Number/Percent of Items at Grade 10 Corresponding to Grade 7 Percent Meeting Standard of 20.1%

32 45.7% 48 68.6%

The scale score needed to reach standard is adjusted during the scaling process so that it is 400 for all grades in reading and mathematics. The common scale standard of 400, however, does not mean that the tests are equally demanding across grades and subjects. GRADE 4 AND GRADE 7 The mean scale scores of 374 and 360 for grades 4 and 7, respectively, suggest that the 4th-grade test is somewhat easier for the 4th-graders than the 7th-grade test is for the 7th-graders. The standard deviation for the grade 4 students is 34.0, compared with 44.8 for grade 7. These indicate that the 4th-grade students are more closely clustered around their mean score than is true for the 7th-grade students. The mean scale score for the 4th-graders is .7 standard deviation below the grade-level standard; for the 7th-graders, the mean scale score is .8 standard deviation below the grade-level standard. The percentage of students between the mean and the standard is similar for the two grades, although a slightly larger proportion (3%) of students in grade 7 fall between the mean and the grade 7 standard. The percentage of items needed to reach grade-level standard is very different for the two grades. In grade 4, the standard corresponds to 62.9% of the

18

possible points earned (39/62). In grade 7, the standard corresponds to earning 55.7% of the possible points (39/70). The percentages of students reaching standard in grades 4 and 7 are quite similar, being 21.4% and 20.1%, respectively. The percentage of students reaching standard is very similar despite the fact that 7% more points must be earned on the 4th-grade test than on the 7th-grade tests to reach standard. These results suggest that the 4th-grade standard-setting committee may have perceived the 4th-grade test to contain relatively easy content and compensated by setting a higher standard in terms of the percentage of points that had to be earned to reach standard. The 4th-and 7th-grade tests can be compared by projecting the percentage of points needed to reach standard from one test to the other. The results of these projections are shown in Table 5. At grade 4, 32% of the students would reach standard if the grade 7 criterion requiring 55.7% of the points to reach standard were applied (in fact, only 21% of the students reached standard). In contrast, at grade 7, only 11% of the students would reach standard if the grade 4 criterion requiring 62.9% of the points to reach standard were applied (in fact, 20.1% of the students reached standard). The preceding analyses suggest that the grade 4 test may have contained relatively easier content for the 4th-graders, compared with the grade 7 test content for the 7th-graders. The percentage of points required to reach standard in grade 4 was set higher, and this relatively higher cutscore compensated for the differences between the difficulty of the content on the two tests relative to the students at the two grades. GRADE 7 AND GRADE 10 For grades 7 and 10, the mean scores of 360 and 382 suggest that the 7th-grade test is relatively more difficult for the 7th-graders than the 10th-grade test is for the 10th-graders. The standard deviations for the two tests are roughly equivalent, at 45 and 43 for grades 7 and 10, respectively. The mean scale score for the 7thgraders is .9 standard deviation below the grade-level standard; in contrast; for the 10th-graders, the mean scale score is .4 standard deviation below the gradelevel standard. The number and percentage of students between the mean and the standard are quite different for the two grades. There are twice as many students between the mean and the standard for grade 7 (22,099 students, or 30% of the 7th-graders) as for grade 10 (9,950 students, or 15% of the 10th-graders). The percentage of items needed to reach grade-level standard is roughly equivalent for the two grades, with 56% and 57% for grades 7 and 10, respectively. However, only 20.1% of the students reached the standard in grade 7, compared with 33% in grade 10. The 7th-and 10th-grade tests can be compared by projecting the percentage of students reaching standard from one test to the other. This is done in the last rows of Table 5. If the 33% passing rate from grade 10 is projected onto the 19

grade 7 test data, the raw score required so that 33% of the 7th-graders would reach standard is 32. Thirty-two represents 45.7% of the items on the 7th-grade test (in fact 55.7% of the points were required to reach standard). Conversely, if the 20.1% passing rate from Grade 7 is projected onto the grade 10 test data, the raw score required so that 20.1% of the 10th-graders would reach standard is 48. Forty-eight represents 68.6% of the items on the 10th-grade test (in fact, 57.1% of the points were required to reach standard.) The preceding analyses strongly suggest that the 7th-grade test was more challenging for the 7th-graders than the 10th-grade test was for the 10th-graders. This inference is supported by the following observations: •

The percentage of points needed to meet standard was about the same for both tests.



Only 20.1% of the 7th-graders compared with 33% of the 10th-graders, reached standard the first year the tests were administered.



The grade 7 mean scale score (360) was lower than the grade 10 mean scale score (382).



The grade 7 mean scale score was almost 1 standard deviation below the standard, while the grade 10 mean scale score was less than half a standard deviation below the standard.



More than twice as many 7th-graders had scores between the mean and the standard, compared with the 10th-grade.



Projecting the percentage meeting standard from grade 10 to grade 7 yields a standard of less than 50% of the items correct.



Projecting the percentage meeting standard from grade 7 to grade 10 yields a standard of almost 70% of the items correct.

It is important to mention some of the procedures used during the standardsetting process as a context for understanding these analyses. Part of the process familiarized the standard-setting panelists with the tests, so they knew how many items or total points were on each test. In addition, specific instructions were given to the members of the various panels that set the gradelevel standards. The standard-setting groups were told, “In all content areas the standard should reflect what a well taught and hard working student should know and be able to do near the end of grade [4, 7, or 10].” The standard-setting committee members knew that two additional cutscores would be set after the initial performance standard was set for “Meets Standard. ” One of these additional cutscores would be below "Meets Standard" (separating Levels 1 and 2) and one would be above it (separating Levels 3 and 4). Finally, it is important to note that the standard-setting panelists did not have any information about the proportion of students who might meet or not meet standard as they contemplated setting the standard at various raw score points. Information about the percentage of students who would meet or not meet various possible standards is called “impact data,” and it is often supplied to standard-setting

20

committees as they do their work. Impact data were not used as part of the WASL standard-setting procedure. With this background, consider the 7th-grade standard-setting committee members as they examined what we now know, after the fact, to be a relatively demanding test. Members of the committee knew there were 70 points on the test, and their first task was to set a “Meets Standard” cut point for “well taught and hard working” students. It seems very unlikely that the panelists could bring themselves to set a 7th-grade standard anything close to one that would be proportionally equivalent to the difficulty of the 10th-grade standard. As shown in Table 5 and discussed above, a score of 32 on the 7th-grade test would be proportionally as difficult for 7th-graders as the 10th-grade standard was for the 10th-grade students. However, a score of 32 on the 7th-grade test is less than 50% of the points on the test (70 points total), the standard is supposed to typify expectations for “well taught hard working students,” and an additional cutscore was to be placed below the “Meets Standard” point. The challenging nature of the 7th-grade test, the directions given to the standard-setting committee, and the absence of impact data inevitably lead to a very demanding 7th-grade standard.

ANALYSIS 6. MATCHED SAMPLES ANALYSES The standards-referenced performance of students across grades was compared using special subsamples of students for whom 4th grade scores could be matched to their 7th grade scores and different matched subsamples whose 7thgrade scores could be matched with their 10th-grade scores. There are three grades 4-7 cohorts and two grades 7-10 cohorts. The number of matched pairs in each cohort, the original number of students in the cohort at the time of first testing, and the percentage of students retained and lost in the matching process are shown in Table 6. Table 6. Number of Students in Matched Samples and Percentage of Students Retained and Lost in Matching, by Year Number of Original Matched Number of Pairs Students

Percent of Percent of Students Students Lost Matched in Matching

Year

Grades

1997-2000

4-7

35,821

63,055

56.8

43.2

1998-2001

4-7

45,924

73,164

62.8

37.2

1999-2002

4-7

46,483

74,670

62.3

37.7

1998-2000 1999-2000

7-10 7-10

42,663 41,967

73,270 65,270

58.2 64.3

41.8 35.7

The percentage of students who could be successfully matched ranges from 56% to 64%, with loss rates of 35% to 43%. These loss rates are substantial and suggest that results of analyses based on these data may not generalize to the intact cohorts. Furthermore, a detailed examination of two of the cohorts 21

indicates that the lost cases occurred proportionally much more often for students who had not reached the standard at the time of first testing. The average cross-classification percentages for the grades 4-7 cohorts on the mathematics tests are shown in Table 7. “Pass” means the students reached grade-level standard; “Fail” means the students did not reach grade-level standard. These data show the fairly high classification consistency of 79% (Pass-Pass + Fail-Fail) across the 3-year span from grade 4 to grade 7. Table 7. Average Cross-Classification Percentages for Grades 4-7 Cohorts on the Mathematics Tests Grade 7 Pass

Grade 7 Fail

Grade 4 Pass

25.2

10.5

Grade 4 Fail

9.0

53.8

The average cross-classification percentages for the grades 7-10 cohorts on the mathematics tests are shown in Table 8. These data also show a fairly high classification consistency of 76% (Pass-Pass + Fail-Fail) across the 3-year span from grade 7 to grade 10. Table 8. Average Cross-Classification Percentages for Grades 7-10 Cohorts on the Mathematics Tests

Grade 7 Pass Grade 7 Fail

Grade 10 Pass

Grade 10 Fail

34.5

19.1

5.1

41.3

A comparison of the percentages in Tables 6 and 7 shows that the percentage of Pass-Pass students from grades 4 to grade 7 was 25.2%, whereas the percentage of Pass-Pass students from grade 7 to grade 10 was 34.5%. Conversely, the percentage of Fail-Fail students from grades 4 to grade 7 was 53.8%, whereas the percentage of Fail-Fail students from grade 7 to grade 10 was 41.3%. These data show that a smaller percentage of matched samples of students reached standard at both grades 4 and 7 (25.2%) compared to the percentage of students in the grades 7-10 cohorts who reached both the grades 7 and 10 standards (34.5%). Also, there was a larger percentage of students who did not reach standards in both grades 4 and 7 (53.8%) compared to the percentage of students who did not reach both the grades 7 and 10 standards (41.3%). A number of additional analyses were considered and would have been conducted if the percentage of students retained in the matching of students over

22

the grades 4-7 cohorts and grades 7-10 cohorts had been higher or if the loss of students had been random. These additional analyses could be used to examine students’ performance in the later grade, conditional on how they had performed in the earlier grade. For example, it would be informative to know: •

What percentage of the students who reached standard in the first grade tested reached or failed to reach standard when they were tested at their subsequent grade-level?



What percentage of the students who did not reach standard in the first grade tested reached or failed to reach standard when they were tested at their subsequent grade-level?

Detailed analyses at this level seemed unwise, given the very substantial attrition rate in the matched cohorts and the fact that a disproportionately large percentage of the students who could not be matched had not met the standard in the first grade tested. Detailed analyses based on this type of nonrepresentative sample would be misleading, and the results would not be applicable to the intact cohorts.

CONCLUSIONS: TECHNICAL ANALYSES 1-6 Overall, the technical analyses reveal progress in mathematics achievement at each of the grade-levels. Patterns of achievement performance at grade 4 reveal that gains have been rapid and consistent over the six years that the WASL has been administered. In grade 7, the achievement gains have been sluggish, but in the last two years, the achievement gains have improved. Achievement gains in grade 10 have been consistently modest since the WASL’s inception.

23

ALIGNMENT STUDIES ALIGNMENT PANEL PARTICIPANTS AND METHODS PARTICIPANTS The Alignment Panel was composed of four individuals with extensive expertise in mathematics education. Each panelist also had experience in judging the alignment of assessments with standards. Two members were nationally recognized for their pioneering work in the field of alignment of math and science assessments with standards, and two members had worked on the grade 4 Northwest Regional Educational Laboratory (NWREL) Washington Assessment of Student Learning (WASL) Mathematics Study. The primary focus of the Alignment Panel was to study the issue of the relative difficulty of individual grade 7 and grade 10 mathematics WASL items with respect to their correspondence with grade-level Essential Academic Learning Requirements (EALRs) and the cognitive demands of test item features. The participants on the Alignment Panel were: Norman Webb, University of Wisconsin, Madison Gerald Kulm, Texas A&M University Verna Adams, Washington State University, Pullman, Emeritus John Woodward, University of Puget Sound Panel support personnel included: Edys Quellmalz, Washington Technical Advisory Committee, SRI International Patty Kreikemeier, SRI International Barbara Chamberlain, Consultant, OSPI METHODS The alignment protocol used by SRI International for the WASL mathematics items was developed through the synthesis of multiple alignment protocols and framework documents: •

NSF-funded studies completed at SRI International, including the Validities of Science Inquiry Design and Implementation Studies (Quellmalz & Kreikemeier, 2002) and Alignment of GLOBE with Standards (Quellmalz, Kreikemeier, Rosenquist, & Hinojosa, 2001; Quellmalz, Kreikemeier, & Rosenquist, 2002).



The National Council of Teachers of Mathematics’ Principles and Standards for School Mathematics (NCTM, 2000).



The Mathematics frameworks for the 1996 and 2000 National Assessment of Educational Progress (NAEP, 1999).

24



National Research Council’s National Science Education Standards (NRC, 1996).



The American Association for the Advancement of Science’s Benchmarks for Science Literacy (AAAS, 1993).



The Achieve alignment protocol (Achieve, Inc., 1997).



The Webb alignment protocol used to align standards and assessments in mathematics in four states (Webb, 1999).



AAAS/Project 2061 alignment project currently being developed (Kulm, 2002).



A revision of Bloom’s taxonomy of educational objectives (Anderson & Krathwohl, 2001).



Washington Essential Academic Learning Requirements.



The OSPI-directed study of the grade 4 WASL mathematics test completed by NWREL.

SRI researchers designed an alignment methodology tailored to the goals of the study: (1) to judge the appropriateness of the items for providing evidence of achievement of the EALRs specified for the grade-level, and (2) to identify possible sources of challenge in item design. An inductive approach was developed that asked independent experts to analyze each test item to judge the content and processes that the item seemed to elicit and then use these judgments to identify the standards the item appeared to test. The Alignment Panelists worked through each item, thus serving as expert-level students. To judge the cognitive complexity of an item, the panelists engaged in a cognitive analysis of the requirements of the item and the mathematical concepts and processes students must use to solve it. This inductive approach is in contrast to a deductive review model that supplies alignment judges with the test developer’s classifications of items and asks them to confirm or disconfirm the developer’s intentions. SRI’s method provides an independent verification of the test developer’s codes by checking their agreement with item classifications resulting from the alignment experts’ inductive process. The Alignment Panelists were asked to evaluate components of items that were hypothesized to affect student performance on mathematics items on the grade 7 1998, grade 7 2001, grade 10 1999, and grade 10 2001 WASL. Figure 5 describes the three characteristics of items that panelists judged.

25

Content Characteristics • EALRs – What is the primary EALR to which an item is linked? What is the secondary link, if appropriate? •

Computation – Is computation required for this student response?



Knowledge type – What type of knowledge is measured by this student response?

Format Characteristics • Item stimulus – To what extent do cognitive demands and grade-levelappropriateness of item stimulus materials influence student response? •

Item directions – To what extent do cognitive demands and structure of item directions influence student response?

Scoring Characteristics • Student evidence – To what extent does the student evidence match the item directions and the performance expectations of the assigned EALR? •

Rubric – To what extent do the item rubric and anchor papers match the performance expectations of the assigned EALR?

Figure 5. Mathematics Item Characteristics Judged by Alignment Panel Detailed descriptions of judgments made within each category are given below.

RATING CATEGORIES Content Characteristics. EALR linkage: Identify the EALR for which the item appears to provide the strongest evidence of student achievement. If appropriate, link the item to a secondary EALR. Computation may apply: Make a decision about the likelihood that student will use computation to reach the answer. Type of knowledge: Identify the type of knowledge elicited in student response. Type of knowledge categories include: declarative knowledge – domain-specific knowledge; procedural knowledge – a sequence of steps or actions to achieve a goal; schematic knowledge – using principles and mental models to interpret, explain/justify, and predict/hypothesize; and strategic knowledge – applying other knowledge types to particular situational demands. Examples of the student evidence elicited by each knowledge type are given in Figure 6.

26

Declarative Knowledge • Recognize, recall, and identify. •

Classify and label.



Interpret and apply signs, symbols, and terms.

Procedural Knowledge • Carry out experimental procedures; use manipulatives; make observations; collect and organize data. •

Use an algorithmic within a given problem situation.



Read graphs and tables; produce graphs and tables.



Complete geometric constructions.



Perform noncomputational skills, such as rounding and ordering.



Collect information from models, diagrams, and representations.

Schematic Knowledge • Describe, display, explain, compare, and interpret data; determine the consistency of data •

Explain the purpose and use of experimental procedures; verify or justify the correctness of a procedure; judge the reasonableness and correctness of solutions



Make connections between findings, related concepts, and phenomena



Interrelate models, diagrams, manipulatives, and representations

Strategic Knowledge • Select a problem-solving approach among alternatives. •

Design experiments; generate, extend, and modify procedures in new settings.



Connect mathematical knowledge and apply to particular situational demands.



Relate ideas within the content area or among content areas.



Use reasoning, strategies, and models to generate concepts; combine, and synthesize ideas into new concepts.



Compare, contrast, and integrate related concepts and principles.

Figure 6. Types of Knowledge and Associated Mathematical Activities

27

Format Characteristics. Decisions about format characteristics were divided into two categories: decisions about the item stimulus materials and decisions about the item directions. Decisions related to item stimulus materials included: •

Use of specialized mathematical vocabulary and its grade appropriateness (examples: “reflect over the y-axis,” “equivalent expression,” “convex polygon,” “congruent angle”).



Use of grade-appropriate context (examples: background information is familiar or adequately explained if it’s specialized).



Use of grade-appropriate scaffolding supports for the content or structure of the item (examples: labeling axes of a graph, providing criteria for a constructed-response item, supplying sufficient context).



Inclusion of extraneous information that increases reading load but does not add to appropriate scaffolding of the item.



Use of grade-inappropriate information load (examples: overly complex graphic formats; multistep calculations; overly complex context; excessive, irrelevant text).



Use of grade-appropriate graphic formats (examples: graphics are required and serve more than a supplementary function; graphics contain sufficient but not extraneous information; background information of the graphic is familiar or adequately explained if it’s specialized).

Decisions related to item directions included several of the characteristics described above (e.g., use of specialized mathematical vocabulary and its grade appropriateness), as well as use of grade-appropriate scaffolding supports. Additionally, decisions were made about the match between the item directions and the expectations and features of the EALR to which the item was linked. Scoring Characteristics. Decisions about scoring included: •

Match between student response expectations and the expectations stated within the item directions.



Match between the rubric and the item directions.



Match between the rubric and the expectations and features of the EALR to which the item was linked.



Match between anchor papers used to exemplify the rubric and the rubric features and quality levels.

ALIGNMENT TRAINING Panel members convened for a 2-day working session on August 8 and 9, 2002, in Seattle, Washington. During that time, the panelists were introduced to the purpose of the study, the general structure of the Washington State Essential Academic Learning Requirements document, and the alignment protocol. Each alignment coding category was defined and accompanied by an item and 28

annotation explaining the specific item characteristics that placed the item into a category. The panelists and support personnel discussed and reached consensus on decision rules for classification of the item alignment with standards and of the item characteristics. Panelists practiced coding sample sets of items. ITEM CODING Following the discussion of the training materials, Alignment Panel members coded the WASL mathematics assessment items. The Alignment Panel coded all items from the 1998 and 2001 grade 7 WASL (88 items) and all items from the 1999 and 2001 grade 10 WASL (87 items). These 175 items were divided into two groups based on item format. That is, all multiple-choice WASL items were collected and compiled into one notebook (57 grade 7 items plus 56 grade 10 items). Likewise, all constructed-response WASL items were collected and compiled into another notebook (31 grade 7 items plus 31 grade 10 items). Each item was identified by grade level and the unique nine-digit number given by the test developer. The items were subsequently randomized within each item format notebook. Panel members were initially trained on day 1 to code items of the multiple-choice format, which had fewer item features such as scoring rubrics and corresponding anchor papers. On day 2, panelists were trained to address the additional item characteristics associated with the constructed-response item format. Items were coded by two or more members of the panel, who worked independently. Panelists used the Washington State EALR document and the item characteristics coding definitions as they worked. Mathematics item specification documents created by OSPI were made available to panel members. As panel members made their decisions about each item characteristic, they recorded their classifications on an alignment coding sheet. After an initial reading of the item and ancillary materials, such as scoring rubrics and anchor papers in the case of constructed-response items, panelists began the coding and linking process. Once a primary EALR (and secondary EALR, when appropriate) had been identified, this was the anchor to which all item characteristics were referenced. In addition to classifying WASL mathematics items according to the item characteristics described above, panelists were also encouraged to provide written feedback on individual items, when appropriate. These comments were quite specific and, to preserve item security, were forwarded to OSPI as confidential files. Alignment coding sheets were designed to be processed by the Teleform system. Each coding sheet was preprinted with the item identification code and designation of the item format. The completed alignment coding sheets were scanned using the Teleform system and the data were entered into a spreadsheet.

29

DATA ANALYSIS Interrater reliabilities were established for each item characteristic on the mathematics WASL for each of the years examined. For example, when three out of four panelists agreed on an item characteristic, a reliability rating of 0.75 was assigned to that item characteristic; if two out of three panelists agreed, the item received a 0.67 reliability rating; and in those infrequent cases where none of the panelists agreed on a particular characteristic, the item classification chosen by the panelist with the most extensive alignment experience was used and the reliability recorded as 0.25.

RESULTS OF ALIGNMENT PANEL CLASSIFICATIONS OF WASL MATHEMATICS ITEMS Sets of tables summarizing the panel ratings were generated for each item characteristic for each test year and grade level. For each judgment, an initial table presents the ratings across the grade levels and testing years. A second table displays classifications of the item characteristic subdivided by item formats. Interrater reliabilities for each item characteristic are presented at the level of the test grade and test year only.

REPRESENTATION OF MATHEMATICS ITEM FORMATS The item formats currently employed in the WASL in mathematics include multiple-choice, 2-point constructed-response and 4-point constructed response. As shown in Table 9, distribution of item formats and score points was evenly balanced across the grade levels for all years. Table 9. Distribution of Mathematics Item Formats across Tests Grade Level / Test Year Item Format Multiple-choice

2-point constructed-response

4-point constructed-response

Grade 7 1998 Number of items

Grade 7 2001

Grade 10 1999

Grade 10 2001

30

27

30

26

Percent of items

0.65

0.64

0.65

0.63

Percent of total points

0.43

0.42

0.43

0.41

Number of items

12

11

12

11

Percent of items

0.26

0.26

0.26

0.27

Percent of total points

0.34

0.34

0.34

0.34

4

4

4

4

Number of items

30

0.09

0.10

0.09

0.10

0.23

0.25

0.23

0.25

REPRESENTATION OF EALRS ON THE MATHEMATICS TESTS Tables 10–13 summarize panel ratings related to items’ representation of the EALRs on the WASL in mathematics, as administered for grade 7 in 1998 and 2001 and for grade 10 in 1999 and 2001. Table 10 and Table 11 present the primary EALR distributions across tests and item formats. General findings in Table 10 and Table 11 include: •

The representation of EALRs by items was fairly even across the gradelevels.



The majority of items on the WASL in mathematics for grades 7 and 10 were linked to EALR 1 (concepts and procedures of mathematics); between 78% and 88% of the test items and 70% or more of the student’s test score were attributable to this EALR. This finding is not unexpected. The five major curricular areas (number sense, measurement, probability and statistics, geometry, and algebra) are all subcategories of EALR 1.



The remaining 12% to 22% of the test items were spread across EALRs 2, 3, and 4, which relate to mathematical problem solving, mathematical reasoning, and mathematical communication.



The alignment panel did not identify EALR 5, the EALR for mathematical connections, as the primary link for any items.

The interrater agreement on item/primary EALR links ranged from 0.86 to 0.96—that is, alignment panelists agreed with each other from 86% to 96% of the time when identifying one of the five EALRs as the primary link to an item. The agreement level was highest on those items linked to EALR 1: “The student understands and applies the concepts and procedures of mathematics.” The agreement levels were lower on those items linked to EALRs 2, 3, 4, and 5.

31

Table 10. EALR Distribution across Mathematics Tests Grade Level / Test Year

Primary EALR as Identified by Alignment Panel 1.

2.

3.

4.

5.

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

Percent of items

34 0.74

34 0.81

39 0.85

36 0.88

Percent of total points

0.69

0.72

0.77

0.78

3

4

3

1

The student understands and applies the concepts and procedures of mathematics. Number of items

The student uses mathematics to define and solve problems. Number of items Percent of items

0.07

0.10

0.07

0.02

Percent of total points

0.11

0.15

0.07

0.06

3

2

1

3

Percent of items

0.07

0.05

0.02

0.07

Percent of total points

0.11

0.09

0.06

0.11

Percent of items

3 0.07

1 0.02

4 0.09

2 0.05

Percent of total points

0.11

0.03

0.14

0.06

The student uses mathematical reasoning. Number of items

The student communicates knowledge and understanding in both everyday and mathematical language. Number of items

The student understands how mathematical ideas connect within mathematics, to other subject areas, and to real-life situations. Number of items

0

0

0

0

Percent of items Percent of total points

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

Interrater agreement

0.86

0.96

0.94

0.95

32

Table 11. EALR Distribution across Mathematics Item Formats Grade Level/Test year and Item Format Primary EALR as Identified by Alignment Panel 1. The student understands and applies the concepts and procedures of mathematics. Number of items

Grade 7 1998

Grade 7 2001

mc

cr2

cr4

28

6

2

mc

25

Percent of items of type 0.93 0.50 0.50 0.96 Percent of total points 0.40 0.17 2. The student uses mathematics to define and solve problems. Number of items 0

2

Percent of items of this type

Percent of items of this type

2

28

cr2

9

cr4

2

mc

cr2

26

8

cr4

2

0.50

0.93

0.75

0.50

1.00 0.73

0.50

0.12

0.40

0.26

0.11

0.41 0.25

0.13

1

3

1

1

2

0

0

1

0.25

0.25

0.30

0.17

0.00

0.00 0.00

0.09 0.06 0.01

0.06

0.00

0.00

0

0

1

0.03 0.17 0.00 0.04

2

7

mc

0.58

Percent of total points 0.01 0.06 0.00 0.02 4. The student communicates knowledge and understanding in both everyday and mathematical language. Number of items 0

cr4

0.22

Percent of total points 0.00 0.06 0.06 0.00 2

cr2

Grade 10 2001

0.11 0.39

Percent of items of this type 0.00 0.17 0.25 0.00 3. The student uses mathematical reasoning. Number of items 1

Grade 10 1999

1

0

0.00 0.17 0.25 0.00

Percent of total points 0.00 0.06 0.06 0.00

0 0.00

1

0.25 0.00

0.00 0.06

1 0.08

0

0

0.00

1

0.00 0.03

0 0.00

1

0.25 0.04

0.00 0.06

2 0.17

1

1

0

0.25

0.00 0.06 1 0.09

1 0.25

0.02

0.03

0.06

0

2

0

0.25 0.00

0.18

0.00

0.06

0.00 0.01

0.06

0.06

0.00 0.06

0.00

0

0

0

0

0

0

0

0.00 0.00 0.00 0.00

0.00

0.00 0.00

0.00

0.00 0.00

0.00

0.00

Percent of total points 0.00 0.00 0.00 0.00

0.00

0.00

0.00

0.00

0.00 0.00

0.00

5. The student understands how mathematical ideas connect within mathematics, to other subject areas, and to real-life situations. Number of items 0 Percent of items of this type

0

0

33

0

0 0.00

Table 12 presents the Alignment Panel’s linking of items at a more detailed level, the level of EALR + element + target. The table reveals the following relationships: •

There was some uneven representation of test items across the 5 EALR + elements and across the 13 EALR + element + targets across the test forms. For example, EALR 1.4b (“Understand and apply concepts and procedures from probability and statistics – statistics”) was linked to two items on the grade 7 test in 1998 and zero items on the remaining tests.



Within EALR 1, representation varied by grade level. For example, there were fewer items on the grade 7 tests (n = 4 items) that were linked to EALR 1.2.a (“Understand and apply concepts and procedures from measurement – attributes and dimensions”) than on the grade 10 tests (n = 11 items). On the other hand, there were more items linked to EALR 1.4.a (“Understand and apply concepts and procedures from probability and statistics—probability”) on the grade 7 tests (n = 9 items) than on the grade 10 tests (n = 5 items).



There were EALR + element + targets that were not linked to any test items. For example, EALR 3.2 (“Use mathematical reasoning to predict results”) was not linked to any items at either grade-level on either test. The low number of items linked to EALRs 2, 3, 4, and 5 make analysis of item linkage at the more detailed level of EALR + element + target difficult.

Even at this more detailed level, interrater reliabilities ranged from 0.70 to 0.86. Table 12. EALR and Element Distribution across Mathematics Tests EALR and Element Group Identified by Alignment Panel1

Grade Level / Test Year Grade 7 1998

Grade 7 2001

Grade 10 Grade 10 1999 2001

4

2

4

2

0.09

0.05

0.09

0.05

5

4

6

1

0.11

0.10

0.13

0.02

1

1

0

1

0.02

0.02

0.00

0.02

EALR 1 The student understands and applies the concepts and procedures of mathematics. 1.1.a

Understand and apply concepts and procedures from number sense - numbers & numeration Number of items Percent of items of this type

1.1.b

Understand and apply concepts and procedures from number sense – computation Number of items Percent of items of this type

1.1.c

Understand and apply concepts and procedures from number sense – estimation Number of items Percent of items of this type

34

Table 12 EALR and Element Distribution across Mathematics Tests - continued

EALR and Element Group Identified by Alignment Panel1 1.2.a

Understand and apply concepts and procedures from measurement - attributes & dimension Number of items Percent of items of this type

1.2.b

Understand and apply concepts and procedures from measurement - approximation & precision Number of items Percent of items of this type

1.2.c

Understand and apply concepts and procedures from measurement - systems & tools Number of items Percent of items of this type

1.3.a

Understand and apply concepts and procedures from geometric sense - shape & dimension Number of items Percent of items of this type

1.3.b

Understand and apply concepts and procedures from geometric sense - relationships & transformation Number of items Percent of items of this type

1.4.a

Understand and apply concepts and procedures from probability & statistics – probability Number of items Percent of items of this type

1.4.b

Understand and apply concepts and procedures from probability & statistics – statistics Number of items Percent of items of this type

1.4.c

Understand and apply concepts and procedures from probability & statistics - prediction & inference Number of items Percent of items of this type

1.5.a

Understand and apply concepts and procedures from algebraic sense algebraic sense - relations & representations Number of items Percent of items of this type

1.5.b

Understand and apply concepts and procedures from algebraic sense algebraic sense – operations Number of items Percent of items of this type

35

Grade Level / Test Year Grade 7 1998

Grade 7 2001

Grade 10 Grade 10 1999 2001

1

3

6

5

0.02

0.07

0.13

0.12

2

0

0

0

0.04

0.00

0.00

0.00

1

1

1

0

0.02

0.02

0.02

0.00

2

2

0

1

0.04

0.05

0.00

0.02

4

3

5

5

0.09

0.07

0.11

0.12

5

4

2

3

0.11

0.10

0.04

0.07

2

0

0

0

0.04

0.00

0.00

0.00

1

1

1

0

0.02

0.02

0.02

0.00

2

2

0

1

0.04

0.05

0.00

0.02

4

3

5

5

0.09

0.07

0.11

0.12

Table 12 EALR and Element Distribution across Mathematics Tests - continued

EALR and Element Group Identified by Alignment Panel1

Grade Level / Test Year Grade 7 1998

Grade 7 2001

Grade 10 Grade 10 1999 2001

2

2

5

6

0.04

0.05

0.11

0.15

Multiple elements within EALR 1 Number of items Percent of items of this type EALR 2 The student uses mathematics to define and solve problems. 2.1

Investigate situations Number of items Percent of items of this type

2.2

4

2

3

0.10

0.04

0.07

1

1

0

0

0.02

0.02

0.00

0.00

1

2

1

1

0.02

0.05

0.02

0.02

1

0

1

1

0.02

0.00

0.02

0.02

Formulate questions and define the problem Number of items Percent of items of this type

2.3

5 0.11

Construct solutions Number of items Percent of items of this type

EALR 3 The student uses mathematical reasoning. 3.1

Analyze information Number of items Percent of items of this type

3.2

Predict results Number of items Percent of items of this type

3.3

0

0

0

0

0.00

0.00

0.00

0.00

1

1

0

2

0.02

0.02

0.00

0.05

1

1

0

1

0.02

0.02

0.00

0.02

0

0

1

0

0.00

0.00

0.02

0.00

0

1

0

0

0.00

0.00

0.00

0.00

Draw conclusions and verify results Number of items Percent of items of this type

Multiple elements within EALR 3 Number of items Percent of items of this type EALR 4 The student communicates knowledge and understanding in both everyday and mathematical language. 4.1

Gather information Number of items Percent of items of this type

4.2

Organize and interpret information Number of items Percent of items of this type

36

Table 12 EALR and Element Distribution across Mathematics Tests - continued

EALR and Element Group Identified by Alignment Panel1 4.3

Grade Level / Test Year Grade 7 1998

Grade 7 2001

Grade 10 Grade 10 1999 2001

1

1

3

0

0.02

0.02

0.07

0.00

0

0

0

1

0.00

0.00

0.00

0.02

0

0

0

0

0.00

0.00

0.00

0.00

0

0

0

0

0.00

0.00

0.00

0.00

0

0

0

0

0.00

0.00

0.00

0.00

3

3

1

3

Percent of items of this type

0.07

0.07

0.02

0.07

Interrater agreement

0.85

0.79

0.86

0.70

Represent and share information Number of items Percent of items of this type

Multiple elements within EALR 4 Number of items Percent of items of this type EALR 5 The student understands how mathematical ideas connect within mathematics, to other subject areas, and to reallife situations. a. Relate concepts and procedures within mathematics Number of items Percent of items of this type b. Relate mathematical concepts and procedures to other disciplines Number of items Percent of items of this type 5.3

Relate mathematical concepts and procedures to real-life situations Number of items Percent of items of this type

Cross-strand EALR’s & elements Number of items

1

To parallel the test developer’s item “strands,” both the primary and secondary EALR identifications were used for this table. Therefore, the number of items within a given classification will not always match the number of items within that same classification as found in Tables 2 and 3 where only primary EALR classifications were reported.

Table 13 presents the level of agreement between the primary EALR identified by the Alignment Panel and the test developer’s item strand identification. The more general the identification, the greater the match between the alignment panel’s link and the test developer’s code: •

At the most general level, the EALR level, the alignment panel’s interrater agreement was 86% to 96%, and their agreement with the test developer was 78% to 87%.



At the next level of detail, EALR + element, the alignment panel’s interrater agreement was 70% to 76%, while its agreement with the test developer was 70% to 78%. 37



At the most specific level, EALR + element + target, the alignment panel’s interrater agreement was 57% to 59%, while its agreement with the test developer was 56% to 59%.

Table 13. Agreement of Alignment Panel Codes with Test Developer Item Strands Alignment Panel’s Code Agrees with Test Developers’ item Strand Specifications

Level of Specificity within the Washington Essential Academic Learning Requirements (EALR)

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

EALR 36

34

40

35

Percent of items of this type

Number of items

0.78

0.81

0.87

0.85

Interrater agreement

0.86

0.96

0.94

0.95

Number of items Percent of items of this type

32 0.70

32 0.78

33 0.72

29 0.71

Interrater agreement

0.70

0.76

0.72

0.71

26

24

27

23

Percent of items of this type

0.57

0.59

0.59

0.56

Interrater agreement

0.57

0.57

0.59

0.56

EALR + element

EALR + element + target Number of items

Results from the Alignment Panel’s linking of WASL mathematics items with EALRs suggest that several features related to the representation of the EALRs on the WASL mathematics tests might be further examined. Building on the methodologies outlined in the NWREL study of the grade 4 WASL mathematics tests, panelists linked items to a primary EALR and to a secondary EALR, if appropriate. Test development documents indicate that items may be linked to several EALRs. Scores, however, seem to be reported only for the primary EALR. Recommendations The relationships among score reports, the development of scoring guides, and the indexing of items should be clarified for item and test development and for the public. Confirm whether EALRs and elements are represented in the items and score reports as intended.

38

GRADE-LEVEL EXPECTATIONS Panelists were asked to judge the alignment of the performance required by each WASL mathematics item to one or more EALRs at the intended grade-level. Table 14 presents these judgments. Panelists’ ratings suggest that the grade 7 test is more challenging for 7th-grade students than the grade 10 test is for 10thgrade students. The total number of items seeming to require higher or lower grade-level performance has decreased from when the tests were first administered in 1998 and 1999 (for grades 7 and 10, respectively) compared with the more recent test administration in 2001. Interrater agreement ranged from 83% to 96%, indicating that the Alignment Panel was in fairly high agreement regarding grade-level expectations. Table 14. Relationship between Mathematics Item Directions and Performance Expectations Grade Level / Test Year

Item Direction Relative to Performance Expectations

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

34

35

34

32

Percent of items of this type

0.74

0.83

0.74

0.78

Percent of total points

0.75

0.86

0.77

0.86

Interrater agreement

0.71

0.72

0.78

0.70

10

6

0

0

Percent of items of this type

0.22

0.14

0.00

0.00

Percent of total points

0.21

0.12

0.00

0.00

Interrater agreement

0.90

0.91

0.98

1.00

Number of items Percent of items of this item type

2 0.04

1 0.02

12 0.26

9 0.22

Percent of total points

0.04

0.02

0.23

0.14

Interrater agreement

0.96

0.96

0.83

0.87

Match grade-appropriate benchmark Number of items

Match higher-grade-level benchmark Number of items

Match lower-grade-level benchmark

Recommendation Examine and document the balance of grade-level items within and across the grade levels.

39

ITEM FEATURES Panelists also classified mathematics WASL items according to several item characteristics (e.g., the type of knowledge and the probability that an item would require computation). Table 15 and Table 16 identify the knowledge types elicited by mathematics WASL items across the grades, testing years, and item formats. The definitions of the types of knowledge were developed through a synthesis of various standards, frameworks, and alignment protocols, as described earlier. Panelists classified items as: •

Declarative knowledge: “knowing that”; domain-specific knowledge.



Schematic knowledge: “knowing when”; use principles and mental models to interpret, explain/justify, and predict/hypothesize.



Procedural knowledge: “knowing how”; sequence of steps/actions to achieve a goal.



Strategic knowledge: “knowing when, where, why, and how” to use other knowledge to apply to particular situational demands.

Detailed lists of verbs associated with mathematics were included as examples to further elucidate each category. A specific mathematics WASL item could be classified as likely to elicit more than one type of knowledge. Results in Tables 15 and 16 indicate that: •

More items were judged by the panelists to elicit the use of declarative and strategic knowledge on the grade 7 tests than on the grade 10 tests.



The number of items judged by panelists to elicit the use of schematic and procedural knowledge appears to be fairly even across the grade-levels and test years.



Approximately 80% of the items were judged by the panelists to elicit the use of procedural knowledge.



The number of items judged by the panelists to elicit strategic knowledge was smaller than the other types of knowledge. This finding appears to support the previous finding that panelists linked few items to the EALRs for mathematical problem solving, mathematical reasoning, mathematical communication, and mathematical connections.

Interrater agreement on these item characteristics ranged from 71% to 90%, with lower levels of agreement when classifying the grade 10 items.

40

Table 15. Type of Knowledge Used in Mathematics Items Grade Level / Test Year Type of Item Knowledge1

Grade 7 1998

Grade 7 1998

Grade 7 1998

Grade 7 1998

Declarative knowledge Number of items

36

35

23

29

Percent of items

0.78

0.83

0.50

0.71

Percent of total points

0.81

0.74

0.60

0.72

Interrater agreement

0.80

0.82

0.79

0.77

19

13

10

15

Schematic knowledge Number of items Percent of items

0.42

0.32

0.22

0.37

Percent of total points

0.53

0.48

0.24

0.42

Interrater agreement

0.83

0.85

0.71

0.85

Procedural knowledge Number of items

38

29

39

34

Percent of items

0.83

0.69

0.85

0.83

Percent of total points

0.87

0.68

0.83

0.78

Interrater agreement

0.85

0.80

0.82

0.85

Number of items Percent of items

9 0.20

9 0.21

5 0.11

7 0.17

Percent of total points

0.21

0.17

0.14

0.17

Interrater agreement

0.86

0.90

0.82

0.83

Strategic knowledge

1

Items could be classified into more than one knowledge type.

41

Table 16. Knowledge Type of Mathematics Items by Item Formats Grade Level / Test Year and Item Format Grade 7 1998

Type of Item Knowledge

Grade 7 2001

Grade 10 1999

Grade 10 2001

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

Cr4

mc

cr2

cr4

23

9

4

24

10

1

13

7

3

18

8

3

Declarative knowledge Number of items Percent of items of this type

0.77 0.75 1.00 0.92 0.83 0.25 0.43 0.58 0.75 0.69 0.73 0.75

Percent of total points

0.33 0.26 0.23 0.37 0.31 0.06- 0.19 0.20 0.17 0.28 0.25 0.19

Schematic knowledge Number of items

7

9

3

3

6

4

5

4

1

7

6

2

Percent of items of this type

0.23 0.75 0.75 0.16 0.50 1.00 0.17 0.33 0.25 0.27 0.55 0.50

Percent of total points

0.10 0.26 0.17 0.05 0.19 0.25 0.07 0.11 0.06 0.11 0.19 0.13

Procedural knowledge Number of items

23

11

4

18

9

2

26

10

3

22

10

2

Percent of items of this type

0.23 0.92 1.00 0.69 0.75 0.50 1.00 0.91 0.75 0.85 0.91 0.50

Percent of total points

0.33 0.31 0.23 0.28 0.28 0.12 0.37 0.29 0.17 0.34 0.31 0.13

Strategic knowledge Number of items

5

3

1

7

2

0

2

2

1

5

1

1

Percent of items of this type

0.17 0.25 0.25 0.27 0.17 0.00 0.07 0.17 0.25 0.19 0.09 0.25

Percent of total points

0.07 0.09 0.06 0.11 0.06 0.00 0.03 0.06 0.06 0.08 0.03 0.06

Recommendation Review the relative distribution of items across the different knowledge types. Confirm the percentage of each knowledge type desired on tests at each gradelevel. In addition, panelists indicated whether WASL mathematics items were likely to require computation procedures to answer the question posed. These results are found in Table 17 and Table 18. Approximately 60% of the grade 7 items and 75% of the grade 10 items were judged to require some computation. This number has remained steady from the test’s first use in 1998 and 1999 (for grades 7 and 10, respectively) and the more recent 2001 testing.

42

Table 17. Mathematics Items that May Require Computation Grade Level / Test Year Computation

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

28

25

35

30

0.61 0.69

0.60 0.57

0.76 0.76

0.73 0.72

0.96

0.93

0.88

0.89

Required Number of items Percent of items Percent of total points Interrater agreement

Table 18. Mathematics Items that May Require Computation, by Item Format Grade Level / Test Year and Item Format Grade 7 1998

Computation

Grade 7 2001

Grade 10 1999

Grade 10 2001

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

16

8

4

15

9

1

23

9

3

20

7

3

Percent of items of this type

0.53

0.67

1.00

0.58

0.75

0.25

0.77 0.75 0.75

0.77 0.64 0.75

Percent of total points

0.23

0.23

0.23

0.23

0.28

0.06

0.33 0.26 0.17

0.31 0.22 0.19

Required Number of items

ANALYSIS OF ITEM FEATURES To provide information about the characteristics that might contribute to item difficulty, panelists were asked to judge the appropriateness of the item stimulus materials, the item directions, response requirements, and use of various graphic formats. Table 19 presents panelists’ judgments of the grade-level-appropriateness of the mathematical vocabulary used in WASL mathematics items. Almost all items on the grade 7 tests (1998, 2001) and grade 10 tests (1999, 2001) were considered grade-level appropriate. Table 19. Use of Mathematics Vocabulary Grade Level / Test Year Mathematics Vocabulary

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

Number of items

42

39

46

39

Percent of items

0.93

0.93

1.00

0.95

Percent of total points

0.87

0.88

0.87

0.86

Interrater agreement

0.89

0.99

0.98

0.98

Grade-level appropriate

43

When asked to identify items employing a context within item stimulus materials, panelists classified fewer WASL mathematics items from the 2001 tests as having a context. (See Table 20 and Table 21.) The multiple-choice items are becoming more decontextualized. Data from Table 15 indicate that 73% of the multiple-choice items on the grade 7 WASL in mathematics in 1998 were set within a context, while only 62% of the multiple-choice items from the 2001 test were set within a context. The percentages for grade 10 are very similar: 77% in 1999, compared with 62% in 2001. OSPI and the test developer may wish to pursue further study on the relationship between item context and item difficulty. Table 20. Mathematics Items that Use Context Grade Level / Test Year Context

Grade 7 1998

Grade 7 2001

Grade10 1999

Grade10 2001

Number of items

36

30

37

25

Percent of items

0.78

0.73

0.80

0.61

Percent of total points

0.83

0.82

0.83

0.78

Interrater agreement

0.93

0.99

0.99

0.94

Present within item stimulus materials

Table 21. Context, by Mathematics Item Format Grade Level / Test Year and Item Format Grade 7 1998

Context

Grade 7 2001

Grade 10 1999

Grade 10 2001

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

Number of items 22

10

4

17

10

4

22

10

4

14

10

4

Present within item stimulus materials

Percent of items of this type 0.73 0.83 0.25 0.65 0.83 0.25 0.73 0.83 0.25 0.54 0.91 0.25 Percent of total points 0.31 0.29 0.23 0.26 0.31 0.25 0.31 0.29 0.23 0.22 0.31 0.25

For those items set within a context, panelists judged the grade-level appropriateness and familiarity of that context. Table 22 and Table 23 show the panel ratings regarding the grade-level appropriateness of item context. On earlier versions of each test, panelists judged that 27% of the grade 7 items used a context more appropriate to higher grade-levels, while 22% of the grade 10 items used a context more appropriate to lower grade-levels. The use of grade-level-appropriate context has increased dramatically from earlier versions of the grade 7 and grade 10 WASL mathematics tests. Ratings in Table 22 show that whereas the context mismatch 44

was 27% for the grade 7 WASL in mathematics in 1998 and 22% for grade 10 in 1999, those numbers fell to 5% for both grades 7 and 10 in 2001. Table 22. Grade-Level-Appropriate Context for Mathematics Items Grade Level / Test Year Context in Item Stimulus Materials

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

Appropriate for grade level Number of items

27

28

28

23

Percent of items

0.59

0.67

0.61

0.56

Percent of total points

0.86

0.75

0.74

0.61

Above grade level Number of items

7

2

1

0

Percent of items

0.27

0.05

0.02

0.00

Percent of total points

0.16

0.03

0.01

0.00

2

0

8

2

Below grade level Number of items Percent of items

0.04

0.00

0.22

0.05

Percent of total points

0.03

0.00

0.14

0.03

Interrater agreement

0.91

0.94

0.88

0.94

Table 23. Grade-Level-Appropriate Context across Mathematics Item Formats Grade Level / Test Year and Item Format Context in Item Stimulus Materials

Appropriate for grade level Number of items

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

15

9

3

15

9

4

15

9

4

13

7

3

Percent of items of this type

0.50 0.75 0.75 0.58 0.75 1.00 0.50 0.75 1.00 0.50 0.64 0.75

Percent of total points

0.21 0.26 0.17 0.23 0.28 0.25 0.21 0.26 0.23 0.20 0.22 0.19

Above grade level Number of items

5

1

1

2

0

0

1

0

0

0

0

0

Percent of items of this type

0.17 0.08 0.25 0.08 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00

Percent of total points

0.07 0.03 0.06 0.03 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00

Below grade level Number of items

2

0

0

0

0

0

6

2

0

2

0

0

Percent of items of this type

0.07 0.00 0.00 0.00 0.00 0.00 0.20 0.16 0.00 0.08 0.00 0.00

Percent of total points

0.03 0.00 0.00 0.00 0.00 0.00 0.09 0.06 0.00 0.00 0.00 0.00

45

The improved item designs seem to relate to the application of the grade 4 study findings to the item development process and the incorporation of these newly developed items into the 2001 WASL in mathematics. Recommendation Continue to study the WASL test designs. Use ongoing expert review to inform continued improvement.

Results of panelists’ judgments of familiarity of item contexts are presented in Table 24 and Table 25. In general, more items on the grade 7 WASL in mathematics were judged to present unfamiliar contexts, with 15% of the items in both years being identified as assuming specialized knowledge. Table 24. Context of Mathematics Item Stimulus Materials Grade Level / Test Year Item Context

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

29

24

35

20

Percent of items

0.63

0.59

0.76

0.49

Percent of total points

0.69

0.62

0.79

0.48

7

6

2

5

Percent of items

0.15

0.15

0.04

0.12

Percent of total points

0.14

0.20

0.07

0.16

Interrater agreement

0.85

0.89

0.90

0.87

Use familiar background or adequately explained Number of items

Assume specialized knowledge that is not explained Number of items

46

Table 25. Background Knowledge of Mathematics Item Stimulus Materials Grade Level / Test Year and Item Format Grade 7 1998

Item Context

Uses familiar background Number of items

Grade 7 2001

Grade 10 1999

Grade 10 2001

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

16

10

3

14

7

3

21

11

3

13

5

2

Percent of items of this type

0.53 0.83 0.75 0.54 0.58 0.75 0.70 0.92 0.75 0.50 0.45 0.50

Percent of total points

0.23 0.29 0.17 0.22 0.22 0.18 0.30 0.31 0.17 0.20 0.16 0.13

Assumes specialized knowledge Number of items

6

0

1

3

2

1

1

0

1

2

2

1

Percent of items

0.20 0.00 0.25 0.16 0.17 0.25 0.03 0.00 0.25 0.08 0.18 0.25

Percent of total points

0.09 0.00 0.06 0.05 0.06 0.06 0.06 0.00 0.06 0.03 0.06 0.06

Recommendation Screen item context to determine grade-level appropriateness.

Panelists were asked to judge the appropriateness of the graphic formats presented in items. Table 26 and Table 27 summarize these ratings. Most graphic formats were judged to be grade-level-appropriate. More of the constructed-response items on the grade 7 and grade 10 tests in 2001 required the use and interpretation of graphs and diagrams. On the grade 7 WASL in mathematics, the use of graphic formats increased from one-half to threequarters of the constructed items between the 1998 and 2001 tests. On the grade 10 test, constructed-response items that required the use of graphic formats increased from one-third of the items in 1999 to one-half of the items in 2001.

47

Table 26. Use of Representational Graphics in Mathematics Items Grade Level / Test Year Graphic Formats

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

23

21

24

21

Percent of items

0.50

0.50

0.52

0.52

Percent of total points

0.53

0.57

0.49

0.48

Percent of items

3 0.07

3 0.07

6 0.13

2 0.05

Percent of total points

0.06

0.05

0.19

0.08

Number of items

2

18

18

19

Percent of items

0.43

0.43

0.39

0.46

Percent of total points

0.47

0.52

0.37

0.41

0.97

0.94

0.91

0.97

Are used in the item stimulus materials Number of items

Are supplemental only, not required to answer the question Number of items

Are required to answer the question

Interrater agreement

Table 27. Representational Graphic Formats by Mathematics Item Formats Grade level / Test Year and Item Format Representational Formats Are used in the item stimulus materials Number of items Percent of items of this type Percent of total points Are supplemental only not required to answer the question Number of items Percent of items of this type Percent of total points Required to answer the item Number of items

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

5

5

3

11

7

3

15

6

1

13

7

1

0.50

0.42

0.75

0.41 0.58 0.75 0.50 0.05 0.25 0.50 0.63 0.25

0.21

0.14

0.17

0.17 0.22 0.18 0.21 0.17 0.06 0.20 0.22 0.06

2

1

0

0.07

0.08

0.00

0.12 0.00 0.00 0.10 0.08 0.50 0.03 0.00 0.25

0.03

0.03

0.00

0.05 0.00 0.00 0.04 0.03 0.11 0.02 0.00 0.06

3

7

0

3

3

12

1

5

2

1

1

12

0

7

1

13

4

3

Percent of items of this type

0.19

0.11

0.17

0.12 0.22 0.18 0.17 0.14 0.06 0.19 0.22 0.00

Percent of possible points

0.47

0.56

0.31

.73 0.37 0.38 0.58 0.53 0.47 0.56 0.31 0.73

48

8

0

0

Table 27 presents the different graphic formats used within the items. More line graphs are used on the grade 10 tests than on the grade 7 tests. The grade 7 tests appear to use more pictures/drawings to represent information. The percentage of items using pictures/drawings increased from 28% of the items in 1998 to 39% of the items on the 2001 grade 7 WASL in mathematics. Panelists were asked to judge the appropriateness of graphics presented in items. Table 28 shows that the most recent tests at both grades 7 and 10 present mostly grade-level-appropriate graphics. Table 28. Grade-Level-Appropriate Representational Graphic Formats Grade Level / Test Year Graphic Formats

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

4

1

0

0

0.09 0.06

0.02 0.06

0.00 0.00

0.00 0.00

0

0

2

0

Above grade level Number of items Percent of items Percent of total points Below grade level Number of items Percent of items

0.00

0.00

0.04

0.00

Percent of total points

0.00

0.00

0.03

0.00

Interrater agreement

0.96

0.97

0.97

0.96

Panelists identified the types of graphic formats presented in items. Table 29 describes the types of format. The 2001 grade 10 test presented fewer items with graphics (22) than did the 2001 grade 7 test (26). On the 2001 grade 7 test, 14 items presented pictures or drawings, 5 presented tables, and 4 presented bar or line graphs. On the 2001 grade 10 test, 8 items presented pictures or drawings, 5 `presented tables, and 6 presented bar or line graphs.

49

Table 29. Type of Graphic Format Used in Mathematics Item Stimulus Materials Type of Representational Formats

Grade Level / Test Year Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

Table Number of items

6

5

0

5

Percent of items

0.13

0.12

0.00

0.12

Percent of total points

0.16

0.17

0.00

0.13

0

2

2

1

Percent of items

0.00

0.05

0.04

0.02

Percent of total points

0.00

0.03

0.04

0.02

Bar graph Number of items

Line graph Number of items

3

2

9

5

Percent of items

0.07

0.05

0.20

0.12

Percent of total points

0.06

0.09

0.22

0.11

Picture / Drawing Number of items

14

14

11

8

Percent of items

0.30

0.33

0.24

0.20

Percent of total points

0.31

0.38

0.27

0.23

Percent of items

7 0.15

3 0.07

3 0.07

4 0.10

Percent of total points

0.10

0.11

0.06

0.09

Other Number of items

List of “other” type of representational format

• Scatter plots1 • Pictographs • Coordinate grids



Grids



Coordinate graphs

• Specialized bar graph • Fill-in-the-blank card • Coordinate grid

• Coordinate graph • Contour map • Venn diagram • Scatter plot

1

Several raters expressed concern in their written comments that scatter plots are appropriate for grade 7 students.

Tables 30 – 32 present the panel members’ ratings of the content and information load in the various graphic formats. Panel members were nearly unanimous in their judgments (interrater agreement of 0.91 – 0.95) that many graphics (17% – 22%) on the grades 7 and 10 WASL in mathematics required the use of specialized background knowledge.

50

Table 30. Information Content within Graphic Formats in Mathematics Items Grade Level / Test Year Content in Graphic Format

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

Number of items

2

2

1

0

Percent of items

0.04

0.05

0.02

0.00

Percent of total points

0.04

0.09

0.01

0.00

Interrater agreement

0.96

0.96

0.99

0.96

8

8

10

9

Extraneous information

Specialized background knowledge Number of items Percent of items

0.17

0.19

0.22

0.22

Percent of total points

0.13

0.18

0.17

0.17

Interrater agreement

0.92

0.95

0.91

0.93

Table 31. Information Load in Graphic Formats in Mathematics Items Grade Level / Test Year Information Load in Graphic Format

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

Appropriate for grade level Number of items

18

20

24

20

Percent of items

0.39

0.48

0.52

0.49

Percent of total points

0.64

0.54

0.56

0.55

Number of items

5

1

0

0

Percent of items

0.11

0.02

0.00

0.00

Percent of total points

0.07

0.03

0.00

0.00

Interrater agreement

0.90

0.95

1.00

1.00

Excessive for grade level

51

Table 32. Information Load in Graphic Formats across Mathematics Item Formats Grade Level / Test Year and Item Format

Information Load in Graphic Format

Appropriate for grade level Number of items

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

23

5

3

11

6

3

15

6

3

11

6

3

Percent of items of this type

0.77 0.42 0.75 0.42 0.50 0.75 0.50 0.50 0.75 0.42 0.55 0.75

Percent of total points

0.33 0.14 0.17 0.17 0.18 0.18 0.21 0.17 0.17 0.17 0.19 0.19

Excessive for grade level Number of items

5

0

0

0

1

0

0

0

0

0

0

0

Percent of items of this type

0.17 0.00 0.00 0.00 0.08 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Percent of total points

0.07 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Recommendations Consider whether the proportion of items that use graphic formats should be balanced across grade-levels. Screen for information load in the various graphic formats and confirm appropriateness with student “think-alouds.”

When making judgments about the item directions, panelists judged that many grade 7 items had insufficient scaffolding. The ratings in Table 33 and Table 34 show that, although the number of items with insufficient scaffolding decreased from 1998 to 2001 (from 37% to 12%), this source of challenge was greater for grade 7 than for grade 10. Moreover, the item formats most associated with this insufficient scaffolding were multiple-choice items and the 2-point constructedresponse items.

52

Table 33. Degree of Scaffolding within Mathematics Item Directions Grade Level / Test Year Scaffolding of Item Directions

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

29

36

43

38

0.63 0.64

0.86 0.85

0.93 0.93

0.93 0.93

0

1

0

1

Appropriate for grade level Number of items Percent of items Percent of total points Excessive for grade level Number of items Percent of items

0.00

0.02

0.00

0.02

Percent of total points

0.00

0.03

0.00

0.02

17

5

3

1

Insufficient for grade level Number of items Percent of items

0.37

0.12

0.07

0.02

Percent of total points

0.36

0.12

0.04

0.06

Interrater agreement

0.79

0.89

0.91

0.92

Table 34. Degree of Scaffolding in Mathematics Item Directions across Item Formats Grade Level / Test Year and Item Format Grade 7 1998

Scaffolding of Item Directions

Appropriate for grade level Number of items

Grade 7 2001

Grade 10 1999

Grade 10 2001

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

19

7

3

25

7

4

29

10

4

25

11

3

Percent of items of this type

0.63 0.23 0.75 0.96 0.58 0.75 0.96 0.83 1.00 0.96 1.00 0.75

Percent of total points

0.27 0.20 0.17 0.38 0.22 0.25 0.27 0.29 0.23 0.39 0.34 0.19

Excessive for grade level Number of items Percent of items of this type

0 0 0 0 1 0 0 0 0 1 0 0 0.00 0.00 0.00 0.00 0.08 0.00 0.00 0.00 0.00 0.04 0.00 0.00

Percent of total points

0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.02 0.00 0.00

Insufficient for grade level Number of items

11

5

1

2

3

0

1

2

0

0

0

1

Percent of items of this type

0.37 0.42 0.25 0.08 0.25 0.00 0.03 0.17 0.00 0.00 0.00 0.25

Percent of total points

0.16 0.14 0.06 0.03 0.09 0.00 0.01 0.06 0.00 0.00 0.00 0.00

Tables 35 – 38 examine the information load and scaffolding within item stimulus materials. On the grade 7 tests, the information load improved from 1998 to 2001 but still remained high on nearly one out of eight items, which accounted for 17% of a student’s score. The information load on the grade 10 tests was judged too high on only two items, or 6% of a student’s score on the 2001 test.

53

Table 35. Information Load in Mathematics Item Stimulus Materials Grade Level / Test Year Item Stimulus Materials

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

Number of items

4

1

0

2

Percent of items

0.09

0.04

0.00

0.05

Percent of total points

0.07

0.02

0.00

0.03

Interrater agreement

0.80

0.91

0.99

0.89

9

5

0

2

Percent of items

0.20

0.12

0.00

0.05

Percent of total points

0.19

0.17

0.00

0.03

Interrater agreement

0.88

0.91

0.99

0.94

Contain extraneous information

Inappropriate information load for grade level Number of items

Table 36. Extraneous Information across Mathematics Item Formats Grade Level / Test Year and Item Format Grade 7 1998

Item Stimulus Materials

Contain extraneous information Number of items

Grade 7 2001

Grade 10 1998

Grade 10 2001

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

3

1

0

1

0

0

0

0

0

2

0

0

Percent of items of this type

0.10 0.08 0.00 0.04

0.00 0.00 0.00 0.00 0.00 0.08 0.00 0.00

Percent of total points

0.04 0.00 0.00 0.02

0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00

Number of items with inappropriate information load for grade level Number of items Percent of items of this type Percent of total points

7

1

1

1

0.23 0.08 0.25 0.14 0.10 0.03 0.06 0.02

54

3

1

0

0

0

2

0

0

0.25 0.25 0.00 0.00 0.00 0.08 0.00 0.00 0.09 0.06 0.00 0.00 0.00 0.02 0.00 0.00

Table 37. Degree of Item Scaffolding in Mathematics Item Stimulus Materials Grade Level / Test Year

Degree of Item Scaffolding within Item Stimulus Materials

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

Number of items

28

34

45

39

Percent of items

0.61

0.81

0.98

0.95

Percent of total points

0.61

0.77

0.99

0.97

Number of items

3

0

1

1

Percent of items

0.07

0.00

0.02

0.02

Percent of total points

0.04

0.00

0.01

0.02

Number of items Percent of items

15 0.33

8 0.19

0 0.00

1 0.02

Percent of total points

0.34

0.23

0.00

0.02

Interrater agreement

0.81

0.89

0.91

0.89

Appropriate for grade level

Excessive for grade level

Insufficient for grade level

Table 38. Degree of Item Scaffolding in Mathematics Item Stimulus Materials across Item Formats Grade Level / Test Year and Item Format Grade 7 1998

Degree of Item Scaffolding

Appropriate for grade level Number of items

Grade 7 2001

Grade 10 1999

Grade 10 2001

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

17

9

2

24

7

3

29

12

4

24

11

4

Percent of items of this type

0.57 0.75 0.50 0.92 0.58 0.75 0.97 1.00 1.00 0.92 1.00 1.00

Percent of total points

0.24 0.26 0.11 0.37 0.22 0.18 0.41 0.34 0.23 0.38 0.34 0.25

Excessive for grade level Number of items Percent of items of this type Percent of total points Insufficient for grade level Number of items

3

0

0

0

0

0

1

0

0

1

0

0

0.10 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00 0.04 0.00 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.02 0.00 0.00 10

3

2

3

4

1

0

0

0

1

0

0

Percent of items of this type

0.33 0.25 0.50 0.12 0.33 0.25 0.00 0.00 0.00 0.04 0.00 0.00

Percent of total points

0.14 0.09 0.11 0.05 0.12 0.06 0.00 0.00 0.00 0.02 0.00 0.00

Recommendation Screen for scaffolding and information load in item stimulus materials and confirm appropriate with student “think-alouds.”

55

MATCH OF SCORING RUBRICS WITH PERFORMANCE EXPECTATIONS. The final set of item judgments relate to the scoring rubrics and anchor papers. Table 39 and Table 40 present these ratings. Panelists judged the grade-levelappropriateness of the criteria in the rubrics. Twenty-two percent of the items on the 1998 grade 7 assessment were judged to have performance expectations within the scoring rubrics that more closely matched the performance expectations of an EALR at a higher grade-level. In contrast, 26% of the items on the 1999 grade 10 assessments were judged to have performance expectations within the scoring rubrics that more closely matched the performance expectations of an EALR at a lower grade-level. Although the numbers decreased on the 2001 test administration, the percentages of items with criteria for higher or lower grade-levels are still 15% for grade 7 and 20% for grade 10. Table 39. Performance Expectations within Scoring Rubrics in Mathematics Grade Level / Test Year

Performance Expectations Match Benchmarks

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

Number of items

10

6

0

0

Percent of items

0.22

0.09

0.00

0.00

Percent of total points

0.21

0.12

0.00

0.00

Number of items

2

1

12

9

Percent of items

0.03

0.02

0.17

0.14

Percent of total points

0.04

0.02

0.23

0.15

Higher grade level

Lower grade level

Table 40. Performance Expectations in Scoring Rubrics in Mathematics across Item Formats Grade Level / Test Year and Item Format Performance Expectations Matches Benchmarks

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

mc

cr2

cr4

7

2

1

4

2

0

0

0

0

0

0

0

Higher grade level Number of items Percent of items of this type

0.24 0.17 0.25 0.15 0.19 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Percent of total points

0.10 0.06 0.06 0.06 0.06 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Lower grade level Number of items

1

1

0

1

0

0

8

4

0

9

0

0

Percent of items of this type

0.03 0.08 0.00 0.04 0.00 0.00 0.27 0.33 0.00 0.34 0.00 0.00

Percent of total points

0.01 0.03 0.00 0.02 0.00 0.00 0.11 0.11 0.00 0.14 0.00 0.00

56

As shown in Table 41 and Table 42, panelists judged the match between the performance levels in the rubric and the anchor papers and the performance levels of the standard to which the item had been linked. Table 41 ratings indicate that rubrics at both grade-levels are strong matches to the performance expectations in the linked standard. In Table 42, some anchor papers were judged not to match their intended score point, but most anchor papers did. Recommendation Vertically align scoring criteria across grade levels. Table 41. Match of Item Rubric with Performance Expectation of Primary EALR in Mathematics Grade Level / Test Year Rubric

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

12

13

15

13

Percent of items

0.26

0.31

0.33

0.20

Percent of total points

0.43

0.55

0.51

0.53

Percent of items

3 0.07

1 0.02

1 0.02

0 0.00

Percent of total points

0.09

0.03

0.06

0.00

Interrater agreement

0.67

0.84

0.77

0.69

Match all or part of the performance stated within the identified EALR-element-benchmark Number of items

Require more than the performance stated within the EALR-element-benchmark Number of items

57

Table 42. Relationship between Anchor Papers and Rubrics in Mathematics Grade Level / Test Year Anchor Papers

Grade 7 1998

Grade 7 2001

Grade 10 1999

Grade 10 2001

9

11

7

11

0.20

0.27

0.15

0.27

6

4

9

4

Percent of items of this type

0.13

0.10

0.20

0.10

Interrater agreement

0.72

0.93

0.84

0.80

1

1

5

2

Percent of items of this type

0.02

0.02

0.11

0.05

Interrater agreement

0.78

0.98

0.88

0.90

Number of items Percent of items of this type

3 0.07

3 0.07

6 0.13

1 0.02

Interrater agreement

0.69

1.00

0.92

0.98

1

1

0

1

Percent of items of this type

0.02

0.02

0.00

0.02

Interrater agreement

0.81

0.95

1.00

0.96

1

0

0

1

Percent of items of this type

0.02

0.00

0.00

0.02

Interrater agreement

1.00

1.00

1.00

0.98

0

0

0

0

Percent of items of this type

0.00

0.00

0.00

0.00

Interrater agreement

1.00

1.00

1.00

1.00

Match the rubric Number of items Percent of items of this type Match part of the rubric Number of items

Not match score point “0” Number of items

Not match score point “1”

Not match score point “2” Number of items

Not match score point “3” Number of items

Not match score point “4” Number of items

KEY FINDINGS AND RECOMMENDATIONS These analyses of the alignment data point to aspects of the test and item development processes that OSPI should continue to study and document. Redesigned test maps that match items to EALRs, elements, targets, and benchmarks, rather than strands, would provide clear documentation of the balance of evidence that is collected across standards and grades. Detailed examinations of item features identified as problematic by the Alignment Panel would further focus the item development process on specific characteristics that affect grade appropriateness. A deeper comparison of the item features flagged by panelists with the item specifications and commentary on individual secure

58

items might allow refinement and elaboration of the specifications. Furthermore, cognitive analyses of think-alouds by small samples of student populations on types of items could illuminate problems related to scaffolding or information load. These alignment data may suggest further empirical investigations of the difficulties of items and clusters of items flagged as problematic for the intended grade-level by the panelists. OSPI should continue to monitor alignments of all WASL mathematics tests through regular reviews designed to ensure the implementation of the OSPI recommendations developed after careful study of these analyses. In addition to classifying WASL mathematics items according to the item characteristics described above, panelists were encouraged to provide written feedback on individual items, when appropriate. These comments were quite specific and are more appropriately linked on the item level, rather than the test level. A complete set of item classifications has been provided to OSPI in a separate “secure” document.

59

CROSS-PANEL ANALYSES: RELATIONSHIP OF ITEM CHARACTERISTICS TO ITEM DIFFICULTY The alignment studies described in this report reviewed grade 7 and grade 10 WASL items in terms of a number of key characteristics, such as match to the EALRs, nature of the knowledge required, and scaffolding. In these cross-panel analyses, the coding information from the alignment study is combined with information about how students actually performed on the test questions. Students' performance on the items is represented by the item difficulty or pvalue and the item mean. The item p-value is the proportion of the possible points that students earned on the item. The mean item response is the average number of points students earned on an item. For multiple-choice items, the pvalue and mean item response are the same. For 2-point and 4-point constructed-response items, the item p-value is the number of points earned divided by 2 or 4, respectively. For 2-point and 4-point constructed-response items, the mean response is the average number of points earned on an item. For example, a 4-point item may have a mean response of 2, which would be a p-value of 0.5. P-values are necessary to compare items on which students can earn varying numbers of points (1, 2, and 4). The analyses reported in this section are meant to be exploratory and descriptive and must be interpreted carefully, for several reasons. The items involved come from only two test forms and thus represent a small sample of items. In addition, a large number of classification variables with multiple classification categories are used in coding the items. Thus, for a number of comparisons, there are very few or, in some cases, no items coded in a category. Many different comparisons are made, and the sheer number of comparisons is likely to include some comparisons that look statistically significant by chance. Finally, it is important to remember that the items vary on multiple characteristics. Classifying items into two categories, such as “information load is appropriate” compared to “information load is not appropriate,” may give the erroneous impression that the items vary only on the dimension of information load appropriateness. The items in these two categories vary on many other features, and these other features may be systematically correlated with, and thus confound, the comparison based on the single dimension of information load appropriateness. The items from the two grade 7 tests are analyzed together, as are the items from the two grade 10 tests. The items are coded on a large number of characteristics, but analyses are reported only if they reflect a substantively important issue (e.g., information load) or if differences in the comparisons are statistically significant or large. The key results of the grade 7 and grade 10 analyses are reported below. The analyses involved comparing the p-values for items in different classification categories. The p-values were compared using independent-sample t-tests or one-way analyses of variance when a sufficient number of items per category

60

were available. Mean differences in p-value are used to describe general trends. Analyses that reveal statistically significant differences are identified.

GRADE 7 RESULTS •

Table 43 presents the number, mean p-value, and mean score for the multiple-choice, 2-point constructed-response, and 4-point constructedresponse items in the grade 7 mathematics test. Table 43. Grade 7: Overall Question Type Characteristics p-Value

Mean Response Mean SD

N

Mean

SD

Multiple-choice

57

0.49

0.16

0.49

0.16

2-point c.r.

23

0.39

0.15

0.78

0.31

4-point c.r.

8

0.47

0.16

1.41

0.47



Multiple-choice items that required computation were harder than those that did not (statistically significant); also, 4-point constructed-response items involving computation were harder than those that did not involve computation (Table 44). Table 44. Grade 7: Item May Require Computation p-Value

Mean Response Mean SD

Question Type

Choice

N

Mean

SD

Multiple-choice*

Yes

31

0.45

0.16

0.45

0.16

No

26

0.53

0.16

0.53

0.16

Yes

17

0.39

0.15

0.78

0.29

No

6

0.39

0.18

0.77

0.37

Yes

5

0.42

0.11

1.25

0.33

No

3

0.56

0.20

1.67

0.61

2-point c.r.

4-point c.r.

*Sig. < .05, equal variances not assumed.



Items were coded to indicate whether they required strategic, procedural, schematic, or declarative knowledge. There were 13 combinations of these types of knowledge found in the multiple-choice items; 8 combinations for the 2-point constructed responses and 4 combinations for the 4-point constructed-response items (Table 45).

61

Table 45. Grade 7: Knowledge Content Combinations Content Codes: S = Strategic, P = Procedural, C = Schematic, D = Declarative

p-Value Mean Response Combinations Order: SPCD N Percent Mean SD Mean SD 0000 1 1.75 0.39 -0.39 -0001 9 15.79 0.54 0.13 0.54 0.13 0010 1 1.75 0.59 -0.59 -0011 2 3.51 0.62 0.33 0.62 0.33 0100 5 8.77 0.63 0.10 0.63 0.10 0101 22 38.60 0.43 0.17 0.43 0.17 0111 5 8.77 0.48 0.12 0.48 0.12 1000 1 1.75 0.19 -0.19 -1001 1 1.75 0.58 -0.58 -1011 1 1.75 0.63 -0.63 -1100 2 3.51 0.52 0.21 0.52 0.21 1101 6 10.53 0.47 0.15 0.47 0.15 1111 1 1.75 0.37 -0.37 -Total 57 100.00 0.49 0.16 0.49 0.16 2-point c.r. 0010 2 8.70 0.50 0.16 1.01 0.32 • Items requiring procedural knowledge were slightly easier than items coded as not 0101 5 21.74 0.27 0.18 0.54 0.36 requiring procedural knowledge (Table 12.) 0110 1 4.35 0.32 -0.63 -0111 10 43.48 0.39 0.14 0.77 0.29 TABLE 12. Item Uses Procedural Knowledge 1001 1 4.35 0.48 -0.96 -1101 2 8.70 0.51 0.01 1.02 0.02 1110 1 4.35 0.38 -0.76 -1111 1 4.35 0.56 -1.11 -Total 23 100.00 0.39 0.15 0.78 0.31 4-point c.r. 0010 2 25.00 0.49 0.24 1.47 0.71 0110 1 12.50 0.49 -1.46 -0111 4 50.00 0.50 0.15 1.51 0.45 1101 1 12.50 0.27 -0.81 -Total 8 100.00 0.47 0.16 1.41 0.47 Question type Multiple-choice



Items requiring procedural knowledge were slightly easier than items coded as not requiring procedural knowledge (Table 46).

62

Table 46. Grade 7: Item Uses Procedural Knowledge p-Value

Mean Response Mean SD

Question Type

Choice

N

Mean

SD

Multiple-choice

Yes

16

0.53

0.16

0.53

0.16

No

41

0.47

0.16

0.47

0.16

Yes

3

0.50

0.11

0.99

0.23

No

20

0.37

0.15

0.75

0.31

Yes

2

0.49

0.24

1.47

0.71

No

6

0.46

0.15

1.39

0.45

2-point c.r.

4-point c.r.



Items with appropriate information load were easier than items coded as having inappropriate information load (Table 47). This difference was statistically significant for 2-point constructed-response items. Table 47. Grade 7: Information Load Is Appropriate p-Value

Mean Response Mean SD

Question Type

Choice

N

Mean

SD

Multiple-choice

Appropriate

49

0.50

0.16

0.50

0.16

8

0.39

0.15

0.39

0.15

19

0.43

0.14

0.85

0.28

Not appropriate

4

0.21

0.07

0.42

0.14

Appropriate

6

0.48

0.17

1.44

0.50

Not appropriate

2

0.44

0.16

1.31

0.49

Not appropriate 2-point c.r.*

4-point c.r.

Appropriate

* Sig. < .05, equal variances not assumed.



Items for which the scaffolding was appropriate were easier than items coded as having excessive or insufficient scaffolding. This difference was statistically significant for 2-point constructed-response items (Table 48).

63

Table 48. Grade 7: Scaffolding Is Appropriate p-Value

Mean Response Mean SD

Question Type

Choice

N

Mean

SD

Multiple-choice

Appropriate

41

0.50

0.16

0.50

0.16

Excessive

3

0.39

0.30

0.39

0.30

Insufficient

13

0.45

0.14

0.45

0.14

Appropriate

16

0.43

0.14

0.86

0.27

Excessive

0

--

--

--

--

Insufficient

7

0.30

0.16

0.59

0.31

Appropriate

5

0.49

0.18

1.47

0.55

Excessive

0

--

--

--

--

Insufficient

3

0.43

0.12

1.30

0.35

2-point c.r.*

4-point c.r.

*Sig. < .05, equal variances not assumed



The difficulty of multiple-choice items was the same, regardless of whether or not they were coded as having specialized math vocabulary. There were few constructed-response items coded as having no specialized math vocabulary. The 2-point and 4-point constructed-response items coded as having specialized math vocabulary were harder than those coded as having no such vocabulary. The difference for the 2-point constructed-response items was statistically significant (Table 49). Table 49. Grade 7: Is Math Vocabulary Specialized? p-Value

Question Type

Choice

N

Mean

SD

Multiple-choice

Specialized math vocabulary used

52

0.49

0.17

2-point c.r.*

Specialized math vocabulary used

No specialized math vocabulary used

4-point c.r.

Mean Response Mean SD 0.49

0.17

5

0.49

0.14

0.49

0.14

20

0.36

0.15

0.73

0.29

No specialized math vocabulary used

3

0.56

0.06

1.11

0.12

Specialized math vocabulary used

7

0.44

0.15

1.33

0.44

No specialized math vocabulary used

1

0.66

--

1.97

--

* Sig. < .05, equal variances not assumed.



Items coded as having appropriate scaffolding within the item directions were easier than items coded as having insufficient scaffolding. The differences were statistically significant for the multiple-choice and 2-point constructed-response items (Table 50).

64

Table 50. Grade 7: Is Scaffolding of Problem within the Item Directions Appropriate? p-Value

Mean Response Mean SD

Question Type

Choice

N

Mean

SD

Multiple-choice*

Appropriate

44

0.52

0.16

0.52

0.16

Excessive

0

--

--

--

--

Insufficient

13

0.37

0.12

0.37

0.12

Appropriate

14

0.44

0.15

0.87

0.29

Excessive

1

0.59

--

1.17

--

Insufficient

8

0.29

0.11

0.57

0.21

Appropriate

7

0.49

0.16

1.46

0.48

Excessive

0

--

--

--

--

Insufficient

1

0.35

--

1.05

--

2-point c.r.*

4-point c.r.

* Sig. < .05, equal variances not assumed.



Multiple-choice items coded as not having performance matched to the EALRs were easier than those with matches to the EALRs (statistically significant). Constructed-response items with performance that matched the EALRs were easier than constructed-response items coded as having no EALR matches (Table 51).

Table 51. Grade 7: Performance Required Matches the EALR p-Value Question Type

Choice

Multiple-choice*

2-point c.r.

Mean

SD

No matches to the EALR

33

0.52

0.17

0.52

0.17

Matches the EALR

24

0.43

0.14

0.43

0.14

9

0.37

0.17

0.73

0.35

14

0.40

0.14

0.81

0.29

No matches to the EALR

3

0.39

0.14

1.17

0.43

Matches the EALR

5

0.52

0.16

1.55

0.47

No matches to the EALR Matches the EALR

4-point c.r.

Mean Response Mean SD

N

* Sig. < .05, equal variances not assumed.



Constructed-response items with anchor papers that matched the rubrics were easier than those that only partially matched the rubrics (Table 52).

65

Table 52. Grade 7: Do the Anchor Papers Match the Rubrics? p-Value Question Type

Choice

2-point c.r.

Yes

4-point c.r.

Mean Response Mean SD

N

Mean

SD

14

0.43

0.15

0.87

0.30

Partial match to rubrics

8

0.34

0.13

0.68

0.26

Yes

6

0.49

0.16

1.47

0.47

Partial match to rubrics

2

0.41

0.20

1.23

0.59

GRADE 10 RESULTS •

Table 53 presents the number, mean p-value, and mean score for the multiple-choice, 2-point constructed-response, and 4-point constructedresponse items in the grade 10 mathematics test. Table 53. Grade 10: Overall Question Type Characteristics p-Value

Mean Response Mean SD

N

Mean

SD

Multiple-choice

56

0.45

0.16

0.45

0.16

2-point c.r.

23

0.41

0.16

0.82

0.33

4-point c.r.

8

0.55

0.16

1.65

0.47



Items were coded to indicate whether they required strategic, procedural, schematic, or declarative knowledge. There were 11 combinations of these types of knowledge found in the multiple-choice items; 9 combinations for the 2-point constructed responses and 6 combinations for the 4- point constructed-response items (Table 54).

66

Table 54. Grade 10: Knowledge Content Combinations Content Codes: S = Strategic, P = Procedural, C = Schematic, D = Declarative p-Value Mean Response Combinations Order: SPCD N Percent Mean SD Mean SD 00000 1 1.79 0.53 -0.53 -00001 4 7.14 0.49 0.09 0.49 0.09 00010 1 1.79 0.56 -0.56 -00100 16 28.57 0.47 0.19 0.47 0.19 00101 18 32.14 0.48 0.12 0.48 0.12 00110 4 7.14 0.45 0.33 0.45 0.33 00111 5 8.93 0.32 0.12 0.32 0.12 01000 2 3.57 0.42 0.03 0.42 0.03 01101 2 3.57 0.39 0.21 0.39 0.21 01110 1 1.79 0.29 -0.29 -01111 2 3.57 0.51 0.11 0.51 0.11 Total 56 100.00 0.45 0.16 0.45 0.16 2-point c.r. 00000 1 4.35 0.22 -0.44 -00001 1 4.35 0.50 -0.99 -00011 1 4.35 0.52 -1.04 -00100 2 8.70 0.52 0.16 1.04 0.33 • Only two multiple-choice constructed-response were coded 00101 items and 6 no26.09 0.50 0.13items 1.00 0.25as having “Inappropriate Information Load” hence no meaningful comparisons could 00110 4 17.39 0.32 0.16 0.64 0.32 be made (Table 23).00111 5 21.74 0.37 0.22 0.75 0.44 01100 1 4.35 0.42 -0.84 -01101 2 8.70 0.28 0.12 0.56 0.24 TABLE 23. Information Load is Appropriate Total 23 100.00 0.41 0.16 0.82 0.33 4-point c.r. 00001 2 25.00 0.58 0.00 1.75 0.00 00011 1 12.50 0.34 -1.02 -00100 1 12.50 0.54 -1.62 -00110 1 12.50 0.70 -2.11 -00111 1 12.50 0.77 -2.31 -01101 2 25.00 0.44 0.18 1.33 0.53 Total 8 100.00 0.55 0.16 1.65 0.47 Question Type Multiple-choice



Only two multiple-choice items and no constructed response items were coded as having “Inappropriate Information Load;” thus no meaningful comparisons could be made (Table 55).

67

Table 55. Grade 10: Information Load Is Appropriate p-Value

Mean Response

Question Type • Only

Meanhaving “Insufficient” SD Mean one Choice multiple-choice itemN was coded scaffoldingSD and only two were coded as “Excessive”. No constructed-response items were coded Multiple-choice Appropriate 54 0.45 0.16 0.45 0.16 as having inappropriate scaffolding. The small number of items coded as having Notor appropriate 2 0.54 it impossible 0.14 to make 0.54 0.14 insufficient excessive scaffolding made meaningful comparisons based on the degree of scaffolding (Table 24). 2-point c.r. Appropriate 23 0.41 0.16 0.82 0.33 Not appropriate

• c.r. Only 4-point

--

--

--

--

oneAppropriate multiple-choice item coded having 8 was 0.55 0.16“Insufficient” 1.65 scaffolding 0.47 Not appropriate



0

0

--

--

--

--

Only one multiple-choice items was coded as having insufficient scaffolding, and only two were coded as excessive. No constructedresponse items were coded as having inappropriate scaffolding. The small number of items coded as having insufficient or excessive scaffolding made it impossible to make meaningful comparisons based on the degree of scaffolding (Table 56). Table 56. Grade 10: Scaffolding Is Appropriate p-Value

Mean Response Mean SD

Question Type

Choice

N

Mean

SD

Multiple-choice

Appropriate

53

0.45

0.16

0.45

0.16

Excessive

2

0.59

0.14

0.59

0.14

Insufficient

1

0.44

--

0.44

--

Appropriate

23

0.41

0.16

0.82

0.33

Excessive

0

--

--

--

--

Insufficient

0

--

--

--

--

Appropriate

8

0.55

0.16

1.65

0.47

Excessive

0

--

--

--

--

Insufficient

0

--

--

--

--

2-point c.r.

4-point c.r.



Multiple-choice items coded as having specialized math vocabulary were harder than those not coded for math vocabulary (statistically significant). There were few constructed-response items coded as having no specialized math vocabulary. The difficulty of the 2-point and 4-point constructed-response items coded as having specialized math vocabulary

68

were virtually identical to those of constructed-response items coded as having no such vocabulary (Table 57). Table 57. Grade 10: Is Math Vocabulary Specialized? p-Value

Mean Response Mean SD

Question Type

Choice

N

Mean

SD

Multiple-choice*

Specialized math vocabulary used

51

0.44

0.16

0.44

0.16

No specialized math vocabulary used

5

0.59

0.11

0.59

0.11

Specialized math vocabulary used

20

0.41

0.18

0.82

0.35

No specialized math vocabulary used

3

0.40

0.03

0.80

0.06

Specialized math vocabulary used

6

0.55

0.12

1.66

0.36

No specialized math vocabulary used

2

0.54

0.32

1.63

0.96

2-point c.r.

4-point c.r.

*Sig. < .05, equal variances not assumed.



Almost all grade 10 items were found to have appropriate scaffolding of the problem within the item directions (Table 58); thus, differences due to scaffolding could not be examined.

Table 58. Grade 10: Is Scaffolding of Problem within the Item Directions Appropriate? p-Value

Mean Response Mean SD

Question Type

Choice

N

Mean

SD

Multiple-choice

Appropriate

54

0.45

0.16

0.45

0.16

Excessive

1

0.55

--

0.55

--

Insufficient

1

0.69

--

0.69

--

Appropriate

21

0.40

0.17

0.80

0.34

Excessive

0

--

--

--

--

Insufficient

2

0.51

0.02

1.02

0.04

Appropriate

7

0.58

0.14

1.74

0.43

Excessive

0

--

--

--

--

Insufficient

1

0.34

--

1.02

--

2-point c.r.

4-point c.r.



Multiple-choice items coded as not having performance matched to the EALRs were easier than those with matches to the EALRs (statistically significant). Constructed-response items with performance that matched

69

the EALRs were easier than constructed-response items coded as having no EALR matches (Table 59). Table 59. Grade 10: Performance Required Matches an EALR p-Value

Mean Response

Question Type

Choice

N

Mean

SD

Mean

SD

Multiple-choice*

No matches to the EALR

37

0.48

0.17

0.48

0.17

Matches the EALR

19

0.40

0.13

0.40

0.13

No matches to the EALR

7

0.37

0.13

0.74

0.27

Matches the EALR

16

0.43

0.18

0.85

0.35

No matches to the EALR

4

0.50

0.21

1.50

0.64

Matches the EALR

4

0.60

0.07

1.81

0.21

2-point c.r.

4-point c.r.

*Sig.