On Examinee Choice in Educational Testing - ETS

14 downloads 86 Views 4MB Size Report
Holland, Robert Lukhele, Donald Rubin, and Xiang-bo Wang. In addition, we have ...... In a choice situation, we c:rn obscrvc the distribution of scoresJ,(y), for those ..... this argument more convincing, but it is moot in a measurement task that is.

On Examinee Choice

in Educational

Testing

Howard Wainer and David Thissen

GRE Board Report

No. 91017P

June 1994

This report presents the findings of a research project funded by and carried out under the auspices of the Graduate Record Examinations Board.

Educational

Testing

Service,

Princeton,

NJ

08541

******************** Researchers are encouraged to express freely their professional judgment. Therefore, points of view or opinions stated in Graduate Record Examinations Board Reports do not necessarily represent official Graduate Record Examinations Board position or policy. ********************

The Graduate Record Examinations Board dedicated to the principle of equal services, and employment policies EDUCATIONAL TESTING SERVICE, and GRE are registered Copyright

and Educational Testing Service opportunity, and their programs, are guided by that principle.

are

ETS, the ETS logo, GRADUATE RECORD EXAMINATIONS, trademarks of Educational Testing Service.

e 1994 by the American Educational Reprinted by permission of the

Research publisher.

Association.

Review of Educational Research Sprihg 1994, Vol. 64, No. 1, pp. 159-195

On Examinee Choice in Educational Testing Iiaw~~d Waincr Educational [email protected] Service David Tlrissen University of North Carolitta at Chapel Ifill Tltrougltout our society,individuals are compared 011srrrnmariesof diverse measures.Oftert tire measuresare rteitJtertltc same for all competitorsttor selectedrandomly front a urtiverseof rrtcasures.It1 fact, coniparisonsortell are based011a mixture of tmzasurcsitr wJricJt tJtecompetitorschoosesome or all of tire elenterttsof tltc rnixturc; applican fi to collegeschoosesome of tile coursestJleytake irr Jiighscltool, wlricltextracurricularactivitiesrlreyparticipate in, and (to some extent) wlticlt errtranceexamstJteytake. Wlrat are tile cottscquetmsof arty seiectiortprocedure, if sucJtclroicesare allowed? How cat1tlte adverseconsequerlces bc nritriruizcd?Tlrisarticle summarizesresults from teststlrat have allowed exarttirteechoiceJleretoforeand provides a basis for a fuller discussionof tlreseissues.

If you allow choice, you will regret it; if you don’t allow choice, you will regret it; whether you allow choice or not, you wilt regret both. --Kierkegaard, 1986, p. 24 Throughout the educational process,decisionsare made using nonrandomly selected data. Admissions to college are decided among individuals whose dossierscontain mixtures of material; one high school student may opt to emphasizecoursesin math and sciencewhereas another may have taken advancedcoursesin French and Spanish. One student might have been editor of the schoolnewspaper,another captain of the football team, and yet a third might have been first violin in the orchestra. All represent commitment and success; how are they to be compared?Is your French better than my calculus?Is sucha comparisonsensible?Admissionsoffices at competitive universities face these problemsall the time; dismissingthem is being blind to reality. Moreover, there is obvious sense in statements like, “I know more physics than you know This report summarizesthe wisdom gathered during the course of researchcarried

on over the past 3 yearsand owespart of its content to contributionsfrom Paul Holland, RobertLukhele, Donald Rubin, and Xiang-boWang. In addition,we have profited from conversations with Lyle Jones,Skip Livingston,Nick Longford, and Rick Morgan. We are pleasedto thank them. Of course, none of these worthy scholarsshouldbe held responsiblefor the opinionsexpressedor any errors that might remain. Last, this report, and most of the researchdescribedherein, was supportedin whole, or in part, by the Graduate Record Board. We gratefully acknowledge thissupport,withoutwhichlittle of thiswouldhavebeenaccomplished. 159

Waincr and Thissen

French.“ Or, “1 am a better runner than you are a swimmer.” Cross-modal comparisonsare not impossible, given that we have some implicit underlying notion of quality. How accurateare suchcomparisons?Can we make them at all when the differencesbetween the individuals are subtle? How do we take into account the difficulty of the accomplishment?Is being an All-State athlctc ;\s distinguishedan accomplishmentas being a Merit Scholarship finalist? How can we understand comparisonslike these? Perhaps we can begin by consideringthe more manageablesituation that manifests itself when the examination scoresof studentswho are to be compared are obtained from test items that the studentshave chosen themselves. Such a situation is likely to occur increasinglyoften in the future, becauseof the greater contemporary emphasis on assessing what are called gertcratiue or comtrucfive processes in learning. To be able to measuresuchprocesses,testingprograms believe they must incorporate constructedresponseitems into their previouslymultiple-choicestandardized exams.It is hoped that large items suchas testlcts, essays,mathcmsticai proofs, experiments, portfolios, or other performance-based tasks are better able to measuredeep understanding,broad analysis, and higller levels of performance than traditional multiple-choice items. Examples of testscurrently using large items are the College Board’s Advanced PIacement Examinations in United States History, European History, United States Government and Politics, Physics,Calculus, and Chemistry. There are many others. When an exam consists,in whole or in part, of constructedresponseitems, it is common practice to allow the examince to choose a subsetof the constructed responsequestionsfrom a larger pool. It issometimesargued that, if choicewere not alIowed, the 1imitationSon domain coverageforced by the small number of items might unfairly affect someexaminees.An alternative to choice wouId be to increase the length of the test; this is not often practical. Another alternative would bc to confine the test questionsto a core curriculum that all valid courses ought to cover. This option may discourage teachers from broadening their coursesbeyond that core. Even in the absenceof attractive alternatives, is allowing examinee choice a sensiblestrategy? Under what conditionscan we ahow choice without compromisingthe fairnessand quality of the test?Wewill notprovide a complete answer to thesequestions.We will illustrate someof the pitfalls associatedwith allowing examinee’choice; we will provide a vocabulary and framework that aids in the clear discussionof the topic; we will outline some experimental stepsthat can tell us whether choice can bc implemented fairly; and we wiI1 provide some experimental evidence that illuminates the topic. We begin with a brief history of examinee choice in testing, in the belief that it is easier to learn from the experiencesof our clever predecessorsthan to try to relive thoseexperiences. A SelectiveHistory

ofChoice

in Exam

We shall confine our attention to somecollege entrance examsusedduring the first half of the 20th century in the United States.The CollegeEntrance Examination Board (CEEB) began testing prospectivecollege studentsat the turn of the century. By 1905, exams were offered in 13 subjects: English, French, German, Greek, Latin, Spanish,mathematics,botany, chemistry, physics,drawing, geography,and history (CoIlege EntranceExaminationBoard, 1905).Most 160

On Examinee Choice in Testing of these examscontained somedegreeof choice. In addition, all of the science

examsused portfolio assessment,as it might be called in modem terminology. For example, on the 1905 Botany exam 37% of the grade was based on the examinee’s laboratory notebook. The remaining 63% was based on a lo-item exam. The examinee was askedto answer7 of those 10 items. This yielded 120 (10 choose7) different examinee-created“forms.” Question 10 (below), of the 1905Botany exam, couldcomplicatethe computation of the number of choice-createdforms: 10. Selectsome botanical topic not included in the questionsabove, and write a brief exposition of it. The gradersfor each testwere identified, with their institutional affiliations. The

chemistry and physicsexams shared the structureof the botany exam. As the CEEB grew more experienced,the structureof its testschanged.The portfolio aspect disappeared by 1913, when the requirement of a teacher’s certification that the studenthad, in fact, completeda lab coursewassubstituted for the student’s notebook. By 1921, this certification was no longer required. The extent to which examineeswere permitted choice varied; see Table 1, in which the number of possible examinee-createdforms are listed for several subjectsand years. The contrastbetweenthe flamboyance of the English exam and the staid German exam is instructive. Only once, in 1925, was any choice allowedon the German exam (“Answer only oneof the following sixquestions”). The lack of choiceseenin the German examis representativeof all of the foreign language exams except Latin. The amount of choice seen in the Physicsand Chemistry examsparallels that for mostexamsthat allowedchoice. The English exam between 1913 and 1925 is unique in terms of the possiblevariation. No equating of these examinee-constructedforms was considered. By 1941, the CEEB offered 14exams,but only 3 (American History, Contemporary Civilization, and Latin) allowed examinee choice. Even among these, choice was sharply limited: l

In the American History exam, there were six essayquestions.Essays1, 2, and 6 were mandatory. There were three questionsabout the American

TABLE 1 Number of possible testforms generatedby txaminee choicepatterns Subject Year

Chemistry

Physics

English

German

1905 1909 1913 1917 1921 1925 1929 1933 1937 1941

54 18 8 252 252 126 20 20 15 1

81 108 144 1,620 216 56 56 10 2 1

64 60 7,260 1,587,600 2,960,100 48 90 24 1 1

1 1 1 1 1 6 1 1 1 1

161

l

l

Constitution (labeled 3A, 4A, and 5A) as well as three parallel questions about the British Constitution (3B,4B, and 5B). The examinee could choose tither the A questionsor the B questions. In the Contcrnporary Civilization exam, there were six essay questions. Questions1-4 and 6 were all mandatory. Question 5 consistedof six shortansweritems out of which the examinee had to answer five. The Latin exam had many parts. In sectionsrequiring translation from Latin to English or vice versa, the examineeoften had the opportunity to pick one passagefrom a pair to translate.

In 1942, the last year of the program, there were fewer exams given; none allowedexamineechoice. Why did the useof choice disappearover the 40 years of this pioneering examination program? Our investigations did not yield a definitive answer, although there are many hints that suggestthat issuesof fairnesspropelled the CEEB toward the test structure on which they eventually settled. Our insight into this was sharpened during the reading of Brigham’s (1934, p. i) remarkable report on “the first major attack on the problem of gradingthe written examination.” This exam required the writing of four essays. There were six topics;Topics 1 and 6 were required of all examinees, and there was a choice offered between Topics2 and 3 and between Topics 4 and 5. The practiceof allowing examinee choice was termed “alternative questions” (p i). Brigham noted (1934, p. 7) that, When alternativequestionsare used,differentexaminationsare in fact set for the various groupselecting the different patterns. The total score reportedto the collegeis to a certainextenta sumof the separateelements, and the mannerin whichthe elementscombinedependson their intercorrelation. This subjectis too complexto be investigatedwith this material. Brigham’s judgment of the difficulty is mirrored by Harold Gulliksen (1950, p. 338), who wrote that, “In general it is impossibleto determine the appropriate adjustmentwithout an inordinate amount of effort. Alternative questionsshould always be avoided.” Both of thesecommentssuggestthat, while adjusting for the effects of examinee choice is possible,it is too difficult to do within an operational context. We suspect that even these careful researchersunderestimated the difficulty of satisfactorily accomplishingsuch an adjustment; their warnings are strongly reminiscentof Fermat’s marginal commentsabout his only just proved theorem. This view wassupportedby Ledyard Tucker (personal communication, February 10, 1993), who said that, “I don’t think that they knew how to deal with choice then. I’m not sure we know how now.” The core of the problem of choice is that, when it is allowed, the examinees’

choicesgeneratewhat can be thoughtof as different formsof the test. These formsmay not be of equaldifficulty.When different formsare administered, standards of good testing practice require that those forms be statistically equated so that individuals who took different forms can be compared fairly. Perhaps,through pretestingand careful test construction, it may be possibleto make the differencesin form difficulty sufficiently small that further equating is not required. As we will show,an unbiasedestimate of the difficulty of any item can only be obtained from a random sample of the examinee population. The 162

On Examinee Choice in Testing

CEEB made no attempt to get sucha sample. Perhapsthat is why Brigham said that, “This subject is too complex to be investigatedwith this material.” Are Choice Items of Equal DifFculty?

We have not yet found a situationin which it wasplausibleto believe that two choicequestionswere of equal difficulty. However, in only one casecould this be shownunequivocally;in operational tests,there are alwaysalternative explanations. These alternatives, although often unlikely, can never be dismissedentirely. To illustrate, we will give three examplesand provide referencesto more. We will then describethe only circumstancewe know of in which the difficulty of choice items was established. This last case will provide motivation for the subsequentdiscussionof how choice can be permitted and testscan still be fair. Example I.- 1968 AP Chembtry Test. Shownin Table 2 is a result reported by Fremer, Jackson, and McPeek (1968) for a form of the College Board’s Advanced Placement Chemistry Test. One group of examinees chose to take Problem 4, and a second group chose Problem 5. While their scoreson the common multiple-choice sectionwere about the same (11.7 vs. 11.2, out of a possible25), their scoreson the choiceproblem were very different (8.2 vs. 2.7, on a lo-point scale). There are several possibleconclusionsto be drawn from this; four among them are: 1. Problem 5 is a good deal more difficult than Problem 4. 2. Small differencesin performanceon the multiple-choicesectiontranslate into much larger differences on the free responsequestions. 3. The proficiency required to do the two problemsis not stronglyrelated to that required to do well on the multiple-choice section. Item 5 is selected by those who are less likely to do well on it. 4. Investigation of the content of the questions,as well as studiesof the dimensionality of the entire test, suggeststhat Conclusion1 is the most credible. This interpretation would suggestthat scoreson these two examinee-createdtest forms ought to have been equated. They were not. Example 2: I989 AP Chemktry Test. In the 1989versionof the AP Chemistry Test much had changed,but there wasstill examineechoice. Figure 1 showsthe expectedscoreon each of two choice items (examineescould opt to answerjust one of thesetwo) asa function of examineeproficiencyasmeasuredby the entire test (Wainer, Wang, & Thissen, 1991). To calculatethesetrace lines, an untestable assumptionabout unobservedexamineeperformancehad to be made; this assumptionwill be discussedin detail later. It will suffice for now to note that, even with the assumptionsthat usually underlie item responsetheory, we also TABLE 2 Average scoreson AP Chemistry 1968

Choice grOUP 1

Choice problem

MC section

4

11.7

8.2

5

163

Wuirrcrcm-i Tlrisscrr

6-

-3

1

-2

2

3

Proficiency

FIGURE 1. A cotnparisorl of tlrc wpcctcd score trace litm for the two choice problem irt Part II, Section B, of the 1989 AP Exartrirratiorain Cl~emistry had to assumethat relativelypoorerperformance on Problem3 by an examinee who otherwiseperformedequivnlcn\tlyto an examineewho choseProblem 2 must be reflected in differential difficulty of the choice problems, not multidimensionality. There is about a one-point advant?ge for an examineewho takesProblem2 and whoseproficiency lies between0 and2 onthisscale.This isan improvement

on the 1968 test, but it is by no meansa trivial difference.To put this in perspective,there are about160pointson the test, and a scoreof about45 is sufficientto obtain collegecredit for completinga one-semesterchemistry course.One point is not a trivial amount. Example 3: 2988 AP hited States History Test. The first two examples indicatedthat thosewho chosethe more difficuIt problemswere placedat a disadvantage. In thisexample,weidentifymorespecificallythe examinees who tendedto choosemore difficultitems.Considerthe resultsshownin Figure2 from the 1988 administration of the College Board’s Advanced Placement Test in United States History (hereafter AP US History). This test comprises 100

164

On Examinee Choice in Testing

multiple-choiceitems and two essays.The first essayis mandatory; the secondis chosen by the examinee from among five topics. Shown in Figure 2 are the averagescoresgivenfor eachof thosetopics,aswell asthe proportion of men and women who chose them. Essay3 had the lowest average scoresfor both men and women. The usual interpretation of this finding is that the topic of Essay 3 was the most difficult. This topic wasabout twice aspopular amongwomen asamongmen. An altemative interpretation of these findings might be that the lowestproficiency examineeschosethis topic and that a greater proportion of women than men fell into this category.This illustratesagain how any finding within a self-selectedsample yields ambiguousinterpretations. This same phenomenon showsitself on the chemistry test as well. Shown in Table 3 are the summary proportions of male and female examineeschoosing varioustriples in a choose-3-of-5sectionof the 1989 AP Chemistry Test. Of the five choiceitems, Item 6 is the most difficult, and Ite_ms8 and 9 are the easiest. Note that female examineeschoseItem triples5,6,7 and 5,6,8 considerablymore often than male examinees;men chosethe easieritems more often than women. Becauseall items counted equally, men had an advantage: not becauseof any superior knowledge of chemistry but merely because they tended to choose easier items to answer. Similar studieswith similar findingshavebeencarried out on all of the AP tests that allow choice (DeMauro, 1991; Pomplun, Morgan, & Nellikunnel, 1992). Fitzpatrick and Yen (1993) alsoreportedsubstantialsexand ethnic differencesin

The most difficult essay(3) was twice as popular with Women as with Men 0.4

1

&way 3

5.4

5.6

5.8

6.0

6.2

6.4

6.6

Average scoreon choiceessay FIGURE 2. The relativeperformanceof men and women on the choiceessaysin the 1988 AP Examination in History

165

Wainer and Thissen

TABLE 3 Item 6 is the hardest; Items 8 and 9 are the easiest

Choice on SectionD

Proportionchoosing Males

Females

567 568

0.13 0.25

0.16 0.35

Hardest

789 578 589

0.06 0.28 0.10

0.03 0.24 0.07

1 Easiest

choiceon third-, fifth-, and eighth-gradepassage-based reading comprehension tests.In the data describedby Fitzpatrick and Yen, there is no obvioustendency of any particular group to selectmore or lessdifficult items; however, for any particular form of the test, the outcomeof the combinationof choiceand scoring was that some group was placed at a disadvantage. The phenomenonof studentschoosingpoorly is so widespreadthat evidence of it occurswheneverone looksfor it. It is instructiveto note that Powers,FOwles, Farnum, and Genitz (1992), in their descriptionof the effect of choiceon a testof basicwriting, found that the more that examineesliked a particular topic, the lowerthey scoredon an essaythey subsequentlywrote on that topic. Their results are summarized in Figure 3. An examinee’s preference for a topic does not predict how well he or she will do on it. Perhapsif examineeswere explicitly informed that they ought to choosea topic on the basisof how well they think they can score, and not how much they like the topic, they might choosemore wisely. Although test developerstry to make up choicequestionsthat are of equivalent difficulty, they do not appearto have been completely successful.Of course, this conclusionis clouded by possiblealternative explanations that have their origin in self-selection:For example,the choiceitems are not unequally difficult; rather, the people who chosethem are unequally proficient. So long as there is self-selection,this alternative cannotbe completelydismissed,althoughit can be discreditedthrough the use of covariate information. If it can be shown that choiceitems are not of equal difficulty, it follows that some individuals will be placedat a disadvantageby their choiceof item-they chooseto take a test some of the items of which are more difficult than those on a correspondingtest for other examinees-and this extra difficulty is not adjusted away in the scoring. The only unambiguousdata on choiceand difficulty that we know of involved data gathered and reported by Wang, Wainer, and Thissen (1993) in which examineeswere presentedwith a choiceof two itemsbut then required to answer both. In Table 4, we seethat, even thoughItem 12 wasmuch more difficult than Item 11, there were still some studentswho chose it. The item difficulties in Table 3 were obtained from an operational administrationof these items involving more than 18 thousandexaminees.PerhapsexamineeschoseItem 12 because they had somespecialknowledgethat made this item lessdifficult for them than Item 11. In Table 5 is shownthe performanceof examineeson eachof theseitems broken down by the items they chose.Note that 11% of those exam&es who 166

On Examinee Choice irt Teshg

Performance Declines with Preference!

S.0

5.5

6.0

6.5

7.0

Prefcrcnce FIGURE 3. Irt a testof basicwriting, erantirrcesscoredlower on essaytoph that they liked more Note. Adapted from Powers,Fowles,Farnum,and Gerritz, 1992, Table 3, p. 17. chose Item 12 responded correctly, whereas 69% of them answered Item 11 correctly. Moreover, this group performedmore poorly on both of the items than examineeswho choseItem 11. The obviousimplication drawn fronithis example is that examineesdo not alwayschoosewisely and that lessproficient examinees exacerbatematters through their unfortunate choices.Data from the rest of this experiment consistently supported these conclusions. Strategiesfor Fair Testing When Choice Is Allowed We will now define more carefully the goal of examinee choice. We will then examine two alternative strategiesfor achieving that goal. The first is to aid TABLE 4 The number of studentschoosirlgeach item and the dif/iculty of those item in the general test-takingpopulation

Item chosen 11 12

Number choosing 180 4s

Item difficulty (b) -2.5 1.6 167

Wainet and Thissen

TABLE 5 The proportion of studen& getting each item correct shown conditional on which item they preferred to answer ~-

Item answered 11

12

Item chosen 11

12

0.84 0.23

0.69 0.11

studentsin making wiser choices; the second is to diminish the unintended consequences of a poor choice through statisticalequating of the choice items.

What Do We Get WhenWeAllow Choice? When a test is given we ordinarily estimatea score. Following traditional IRT notation,2we will call the proficiency of a particular examinee taking the test 8 and the estimateof that proficiency 6. Ordinarily, each item on a test providesa somewhatdifferent value of the examinee’s estimated proficiency; sometimes the varianceof the distributionof thoseestimatesis usedas an index of measurement precision (Wainer & Wright, 1980). 8Ma. The usual estimate of proficiency is based on estimated performance from a random sampledrawn from the entire distribution of items. But suppose we allow choice.What are we estimating?Most practitionerswe have spokento, who favor allowing choice, argue that choice provides the opportunity for the examineesto show themselvesto best advantage.We shall call this proficien$y 8Max* Figure 4 is a graphicaldepictionof what might occurin a choicesituation. The top panel is the distributionof 0 basedon information obtained prior to administering the choiceitems. The distributionis centeredon 0. Beneath this panel is a second panel showing the trace lines for two items, assumingthat they are answeredcorrectly. Item 1 is relatively easy;Item 2 is somewhatmore difficult. Each curvein the bottom panel is constructedby multiplying the prior densityby the appropriatetrace line (or one minusthat trace line, for incorrect responses). Theseare called the posteriordensitiesafter choiceand representour knowledge of that examinee’s ability. eMQIis the characterizationof proficiencythat would be obtained if the examinees chose the item that would give them the highest score. When we allow choicewe are attemptingto estimateeMor.What we are actually obtaining is bMar. How is 8, related to eMar? To answerthis question,it is usefulto adopta Bayesianapproach. But first, we must specifythe scoringsystemfor the items, becausean examinee’s expected scoreis Pitem S,where Pi,,, is the probability the examineewill obtain scoresfrom the item. We will considertwo possibilities: 1. Both items are scoreds = 1, if correct,s = 0, if incorrect (simple summed scoring).In this case,only the information in the secondpanel of Figure4 is relevant to the choice: We see that PI > Pz for all values of 8, so all examineesmaximize their expectedscore by choosingItem 1. 168

011Examitree Choicein Tcstirtg

Oo( . -

3

. -2

. - I

Si

1

0 8

Si

+-I

S

-I1

+2

+3

+ S*

FIGURE 4. A graphical representatiortdepictshow theposterior densityis computed from the product of the prior derwity altd the appropriate item trace lines

2. The items are scored using IRT-for

example, using the mean of the posterior given the item response.In this case, the score associatedwith Item 1 correct issl’; that for Item 1 incorrect iss; The scoreassociatedwith Item 2 correct is s,‘; that for Item 2 incorrect is ~2’1’ The locations of the 169

Wairrcr arad Tltissert scoresarc shownin Figure 4. Now the examinee’s task is more difficult; the

choice must be Item 2 if and only if I’&

+ (1 - I’&-

> P,sl+ + (1 - PJS;:

To do this computation, examineesmustknow both PI and Pz (at least for their own values of 6), as well as the item scores. But the examinees,when choosingan item, do not know their probability of respondingcorrectly to the item. Instead, they have somesubjective idea of that probability. We have already provided strong evidence indicating that some examineesdo not choosewisely. Moreover, we have seenthat the propensity for making optimizing choicesvaries by sex and ethnic group. As we have already seen,choiceitems, ascurrently prepared, are typically not of equal difficulty. This fact, combinedwith the common practiceof not equating choice items for their differential difficulty, yields the inescapable conclusion that it matterswhat choicean examineemakes. Examinees who chosethe more difficult questionwill, on average,get lowerscoresthan would have been the case had they chosenthe easier item. The fact that all examineesdo not choosethose items that will show their proficiency to best advantagecompletesthis unhappy syllogism: examinee choice is not likely to yield credible estimates of eMaX. What Cart We Do to Improve Matters? There appear to be two paths that can be followed: eliciting wiser choicesby examineesor equating test forms. The secondoption removesthe necessityfor the first; in fact, it makes examinee choice unnecessary. How can we improve examinees judgment about which items to select? Estimation of eMUcan bc done optimally only by askingexamineesto answerall items and then scoring just those responsesthat yield the highest estimate of performance. This strategy is not without its drawbacks. First, it takes more testing time, and choice is often instituted to keep testing time within practical limits. Second, many examinees,,onhearing that “only one of the six items will be counted” will only answerone. Thus, this strategy may commingle measures of grit, choice wisdom, and risk aversion with those of proficiency. A more practical approach might be to try to improve the instructionsto the examineesabout how the test isgraded, to guide their choicesbetter. It would be well if the instructionsabout choice made it clear that there is no advantage to answering a hard itemcorrectly relative to answering an easy one, if such is indeed the case. Current instructionsdo not addressthis issue.For example, the instructionsabout choice on the 1989 AP Chemistry Test (CEEB, 1990, p. 23), reproduced in their entirety are: SolveONE of the two problcrnsin thispart. (A secondproblemwill not be scored.) Contrast this with the care that is taken to instructexamineesabout the hazards of guessing.These are taken from the same test (p. 3): Many candidateswonderwhetheror not to guessthe answersto questions about which they are not certain. In this sectionof the examination, as a correctionfor haphazard guessing,one-fourth of the number of questions you answer incorrectly will be subtracted from the number you answer 170

011 EkarrrirtceChoice irt Testittg

correctly.It is improbable,therefore,that mere guessingwill improveyour scoresignificantly; it may even lower your score, nnd it does take time, If, howcvcr,youare not sureof the correctanswerbut havesomeknowlcdgcof

the questionand are able to eliminateone or more of the answerchoicesas wrong,yourchanceof gettingthe rightanswerisimproved,and it maybe to your advantageto answersucha question. Perhaps, with better instructions, the quality of examinee choices can be improved. At the moment, there is no evidence supporting the conjecture that they can be, or if so, by how much. An experimental test of the value of improved instructionscould invoivc one randomly sclcctcd group with tlrc tradition;\1 instructionsand another with a more informative set; which group has higher averagescoreson the choice section?A more complex experiment could use a paradigm much like that employed by Wang, Wainer, anti Thissen (1993) in which examineeswere askedto choosefrom among several items but were then required to answer all of them. This sort of experiment would allow a detailed examinationof the changein choicebehaviordue to the instructions.While more explicit instructions may help matters somewhat, and ought to be included regardlessof their efficacy, we arc not sanguine about this option solving the problem of getting f$,= closerto OAIax . To do this requires reducing the impact of unwise choice. As we pointed out, there are two terms in the calculation of subjective posterior density. The first is the subjective probability of getting the item correct; improved instructions may help this. The second requires that the examinee have an accurate idea of the relative difficulty of the choice items. Pretesting, when it is possible, would allow us to present to examinees each item’s difficulty in the pretest population. It will not help to characterize the individual variations in item difficulty that are the principal reason for ahowing choice. A more promising path seemsto be to make all of the choice problems equally difficult (from the point of view of the entire examinee population) and allow the choice to be governed by whatever special knowledge or proficiency each individual examinee might possess.In this way, we can bc sure that, at least on average, the items are as fair as possible.The problem is that it is at least difficult, and perhapsimpossible,to build items that empirically turn out to be exactly equal in difficulty. Another option is to adjust the scoreson the choice items statisticallyfor their differential difficulty. We will refer to this statistical adjustment as equating, although the way it is carried out may not satisfy the strict rules that are sometimes associatedwith that term. How Does Equating Affect the Examinee’s Task? Equating appears,at first blush, to make the examinee’s task of choosingmore difficult still. If no equating is done, the instructionsto the examinee should bc: Answer that item that seemseasiestto you (and we hope that the examineeschoosecorrectly, but we will not know if they do not). If we equate the choiceitems (give more credit for harder items than easier ones), the instructionsshould be: Pick that item which, after WCadjust, will give you the highestscore.

Waitrer and Thtisero

This task could be akin to the problem faced by compctitivc divers, who choose their routine of dives from within various homogeneousgroups of dives. The diver’s decision is informed by: l l

l

knowledge of the degree of difficulty of each dive, knowledge of the concatenationrule by which the dive’s difficulty and the diver’s performance rating are combined (they are multiplied), and knowledge, obtained through long practice, of what his or her scoreis likely to be on all of the dives.

Armed with this knowledge, the diver can sclcct a set of dives that is most likely to maximize his or her total score. The diver scenariois one in which an individual’s informed choice provides a oMarthat seems to be close enough to OMarfor useful purposes. Is a similar scenario possiblewithin the plausible confines of standardized testing? Let us examine the aspectsof required knowledge point by point. Specifying how much each item will count in advance is possible, either by calculating the empirical characteristicsof each item from pretest data or, as is howmucheachonecountsby fiat. We favorthe currently thecase,by specifying

former, becauseit allowseachitem to contributeto total scorein a way that minimizes measurement error. An improvident choice of a priori weights can have a seriousdeleteriouseffect on measurementaccuracy(see Lukhele, Thissen,

& Wainer, 1993;Wainer& Thissen,1993a). Specifyingtheconcatenation rule(howexamineeperformanceanditem characteristicsinteract to contribute to the examinee’s score) in advance is also possiblebut may be quite complex, for example, if IRT is used. Perhapsa rough approximation can be worked out, or perhaps one could present a graphical solution like that shown in Figure 4, but for now thisremainsa question.The difficulties that we might have with specifyingthe concatenation rule are largely technical, and workable solutionscould probably be developed. A much more formidable obstacle is providing the examinees with enough information so that they can make wise choices. This seemscompletely out of reach, for, even if examineesknow how much a particular item will, if answered correctly, contribute to their final score, it does no good unlessthe examinees have a good idea of their likelihood of answeringthe item correctly. The extent to which suchknowledgeis imperfect would then correspondto the bias(used in its statisticalsense)associatedwith the use of 8,, to estimate OMar.The nature of security associatedwith modern large-scaletests makes impossible the sort of rehearsal that provides divers with accurate estimates of their performance under various choiceoptions. The prospectappearsbleakfor simultaneously allowingchoiceandsatisfying the canonsof good practice that require the equating of test forms of unequal difficulty. The task that examineesface in choosingitems when they are adjusted seemstoo difficult. But is it? There remain two glimmers of hope. The brighter of theserestson the possibilityof successfully equatingthevariouschoiceforms. If we can do this, the examineesshould be indifferent as to which items they answer, becausesuccessfulequating means that an examinee will receive, in expectation, the samescoreregardless of the form administered.This is happy

172

On ExarrtirteeChoice irt Testittg

but ironic news, for it appearsthat we can allow choice and have fair testsonly when choice is unnecessary. A dimmer possibilityis to try to improve examinees’estimatesof their success on the various choices. In a computer administered test, it may be possible to provide some model-basedestimatesof an examinee’s probable score on each item. To the extent that theseestimatesare accurate,they might help. Of course, if really good estimates were available, we would not need to test further. Moreover, the value of choicewould be greatestwhen an examinee’s likelihood of successis very different than that predicted from the rest of the test. To answerthe questionposedat the beginningof this section:When we do not equate selecteditems, the problem of choicefaced by the examinee can be both difficult and important. When we do equate, the sefection problem simultaneously becomesmuch more difficult but considerably less important. This conclusionnaturally brings us to the next question. Under What ConditionsCan We Equate Choice Items? How? Let us reconsider Harold Gulliksen’s (1950, p. 338) advice, “Alternative questionsshouldalwaysbe avoided.” WC lravcdiscussedone possiblereasonfor this-that it makes the examince’s task too difficult. Our conclusionwas that, while it does make the task difficult, this difficulty becomesirrelevant for most usesof the test scoreif the alternate formsthusconstructedcan be equated. This raises a second possible explanation for this advice: The equating task is too difficult. Certainly this explanation was the one favored by ‘Tucker (quoted earlier). The only way to equate test forms that are created by choice is to make some (untestable) assumptionsabout the structureof the missingdata that have resulted from the choice behavior. One possible assumptionis nzksing-completely-at-random. Underlying this assumptionis the notion that, if we had the examinee’s responsesto all of the items, a random deletion of some portion of them would yield, in expectation, the samescoreaswasobtained through the examinee’s choice. In simple terms, we assumethat the choice had no effect on the examinee’s score. If we really believed missing-completely-at-random,we could equate without any anchor items because an important consequenceof missing-completely-at-randomis that all choice groupswill have the sameproficiencydistribution. Data gathered from all of the Advanced Placement exams(Pomplun, Morgan, & Nellikunnel, 1992) suggestthat this is not credible. Thus, it is imperative to userequired anchoritems to establisha commonscale for the choice items. This can be done using traditional or IRT methods (see Dorans, 1990) and is justified if we believe that the missingresponsesyielded by examinee choice are generated by a processthat is, in Little and Rubin’s (1987) terminology, conditionally ignorable.’ What we mean by this weaker assumption is that the probability of an examinee choosingany particular item is independent of his or her likelihood of getting that item correct, conditional on 0. In graphical terms, this means that an item’s trace lines are the same for those individuals who chose it as they would,have been for those who omitted it. Subsequentdiscussionswill be clearer if we repeat our characterizationof the missing data assumption and the logic surrounding their genesis with more precision.

173

Waimr and TlC&sctt

Therefore,suppose yI is the scoreon test item Y,, and R, is a choicefunctionthat takesthe value 1 if Y, is chosenand 0 if not. In a choicesituation,we c:rnobscrvcthe distributionof scoresJ,(y), for those who optedto take an item. This can be denoted f,(y) = P( Y = y I R = 1). What we do not know, but what is crucialif we are to be able to equatethe differentchoiceitems,is the distributionof scores,fo(y),for thosewhodid not take the item. This is denoted fo(y) = P(Y = y I R = 0).

To beableto equate,weneedto knowthedistributionof scoresin theunselected population,g(y) = P(Y = y). Note that we can representg(y) as

g(y) =

J(Y)

x P(R = 1) + fo(Y) x P(R = Oh

(1)

The onlypieceof thiswhichisunknownisJ,(y), the distributionof scoresamong thoseindividuals whochosenotto answerit. Unlessoneengages in aspecialdata gatheringeffort, in whichthoseexaminees whodid not answerY, are forcedto, fo(y) isnotonlyunknownbutunknowable. Thus,the conundrumisthatwemust equateto ensurefairness,but we cannotequatewithout knowingfo(y). One approachto suchproblems,mixturemodeling(Glynn, Laird, & Rubin, 1986),involvesa hypothesized structureforfO(y).It isconvenientto assumethat the functionfo(y) is the sameasf,(y). In formal terms, fi(y)

= P(Y = ylR = 1,0) = P(Y = ylR = 0,O) = Ii(y) = P(Y = yl0).

(2)

Or: Weassumethatthetracelinesfor thechoiceitem wouldhavebeenthesame for thosewhodidn’t chooseit asit wasfor thosewhodid. If we couldgatherthe appropriatedata (forcingthosewho opted not to answerit to do so), this hypothesis couldeasily be testedusingstandardDIF technology(Holland & Wainer, 1993). Althoughtheconditionalindependence, givenX and8, expressed in Equation 2 hasa surfacesimifarityto theconditionalindependence, given8, thatunderlies all of IRT, Equation2 expresses a muchstrongerassumption thatmayor maynot betrue:Equation2statesthat,if 0 isknown,knowledgeof whethertheexaminee chooses to answeran itemor notdoesnotaffectthe modeledprobabilityof each response. Thisassumption iscertainlycontraryto the perceptions of examinees, whooftenfeelthat theirchoiceof an itemoptimizestheirscore.However,there is iittle evidenceavailablethat‘ilIuminatesthe relationshipbetweenexaminees’ preferencefora particularitemandtheireventualscore.Contraryto widespread belief, what little experimentalevidencethere is supportsthe assumptionexpressedin Equation2. Thus,in theabsence of contrarydata,andbecausethisassumption allowsusto employthe existingtechnologyof IRT to equate,we shall useit. For a fuller 174

description of the structure and consequencesof assumptionsabout missing data, the reader is referred to Allen and Holland (1993); of specialimportance in the examinee choice situation is their distinctionbetween ignorable and forgettable nonresponse. WJiatOlJler AsstrntptionsAre Necessaryfor Equating? While ignorable nonresponseis the only assumptionthat is new to this circumstance, it is not the only assumptionrequired. In addition, we need to assume unidimcnsionality and fit to the test scoringmodel employed. These two latter assumptionsare well known and can be tested with the test data ordinarily gathered; ignorable nonresponse cannot be. To test ignorable nonresponse requiresa specialdata gatheringeffort. One exampleis the sort of data gathering schemethat Wang, Wainer, and Thissen (1993) employed: asking examineesto chooseitems but then requiring them to answersome of the items they did not choose. This is called sarrtJdirtg from tlte unscfcctedpopulatiort and will be discussedin greater detail later. Equating test forms constructedby examinee choice can bc straightforward once we have made some assumptionsabout the unobserved distribution of scoresJ,(Y). While one can derive a formal equating procedure for many assumed characterizationsof the missing data, the assumptionof conditionally ignorable nonresponseallows us to immediately use the existing machinery for IRT equating. One merely enters the variousvectorsof item responsesand treats what’s missingas having not been presentedto the individual. We establish a commonscaleby requiring a subsetof items that all examineesmustanswer.This anchor test provides a set of items drawn from the unsefectedpopulation on which we can also test model fit and unidimensionality. I-low Cart We Test Our Assumptiorls? The specialassumptionrequired to equate choice items involves the distribution of scoreson the choice items from those who did not answer them: h(y). This distribution is necessaryto estimate g(y), the distribution of scoresin the unselectedpopulation. There are many waysto test the viability of this assumption, but they all require somesort of specialdata gathering. We will describetwo experimental designsthat can be used to accomplishthis. Des&t I: WitJzinsubjects. In a randomly chosen subset of the examinee population, examinees must be required to indicate their choice but then required to answer all items. This design allows us to estimate all three parts of Equation 1 and so aIlowsan explicit test of the assumptionstated as Equation 2. A variant of this designwas employed by Wang (1992) which asked examinees their choices both before and after they answeredthe questions. This designis subject to the criticism that examineesmight not be particularly judicious in their choiceswhen they know that they will have to answer all the questionsanyway. If this conjecture is true, it is likely to affect the estimatesof MY) andfi(r) more than that of the compositeg(y). Using the good estimatesof fi(y) we can get from the operational choice test and the estimatesofg(y) from the experimental administration, we can derive fo(y) through Equation 1. We used this design to test the assumptionof ignorable nonresponseamong some choice items in the 1989 AP Chemistry Test (Wang, Wainer, & Thissen, 175

Wainer and Thissen

1993), using IRT-based DIF technology(Thissen, Steinberg, & Wainer, 1988, 1993; Wainer, Sireci, & Thissen, 1991). Figure 5 showsthe estimated trace lines for choice Items 11 and 12 for those examineesthat choseeach item [fi(y)] as well as for those that did not [fO(y)]. The apparent difference between the two trace lines for Item 11 is somewhat unlikely (& = 4), whereas there is no difference at all between the two trace linesfor Item 12. Operationally, thismeansthe ordinarily untestableassumption that we usedto equate choiceformsmay be untrue for Item 11. A more extensive experiment seemsin order. Note that the differencesobservedin the trace lines for Item 11, although not quite achieving nominal levelsof statisticalsignificance,suggestthat Item 11 is

Tracelines

for item 11

1.o-

T(x) 0.5



Tracelines

for item 12

(‘x) 0.5 ’

0.0 4 -3

-2

-1

0

+l

+2

4 +3

8

FIGURE 5. Graphical testsof the ordinarily untestableassumptionthat choiceitems have the same trace linesfor those who chosethem as for those who did not 176

easierfor thosewhochoseit thanfor thoseexaminees who did not. This is not alwaysthe case.As part of the samestudy,we foundthat for anotherpair of choiceitemsthe reversewas true. In noneof the casescxamincdwere the differences betweenI, and/,(y) solargeasto generateerrorsin theequating largerthanwouldhavebeenthe cast had we not equated. Desigrl 2: Between subjecfi. In a randomlychosensubsetof the examince population,examinees mustbe randomlyassigned to eachof the choiceitems. Thiswill provideuswith unbiasedestimatesof g(y) for eachof the choiceitems andallowusto equate.It will not providedirectestimatesoffo(y) andfi(y), but thosecanbeobtainedfromtheportionof theexamin whichchoiceisallowed.As of this writing, an experimentthat will have this format is currently being considered for the GRE WritingTest. Test Dirrtertsiorrnlity

Because of theincreasing interestin thedevelopment of teststhatcombinethe psychometric advantages of multiple-choice itemswith other featuresof constructedresponseitems, the followingtwo questionsassumeimportance: 1. Are WCmeasuring thesamethingwith theconstructed response itemsthat we are measuringwith the multiple-choice questions? 2. Is it meaningfu1 to combinethescoreson theconstructed response sections with the multiple-choice scoreto yield a singlereportedtotal score? Answersto thesequestionsare necessary to build appropriatescore-reporting strategiesfor suchhybrid tests.As we shallsee, answeringthesequestionsis moredifficultwhentheexamineeispermittedto chooseto answera subsetof the time-consuming constructedresponsequestions(Wainer, Wang, & Thissen, 1991). The useof item responsetheory to scorethe test, or to equateforms comprisingchosenquestions,explicitlyrequiresthat the test (or forms) be essentiallyunidimensional-that all the itemsmeasuremore or lessthe same thing.Thus,wemustanswerthe dimensionality questionto be ableto scorethe testin a meaningfulway.This isexplicitlytruewhenusingIRT but alsomustbe true whena test scoreis calculatedin manyother, lessprincipledrubrics. Are hybridtestsunidimensional? The literatureon thissubjectis equivocal. Bennett,Rock,Braun;Frye,Spohrer,andSoloway(1991)fitted differentfactor structures to tworelativelysimilarcombinations of multiple-choice, constructed response, and constrained constructed responseitems;a one-factormodelwas sufficientfor one setof data, but a two-factormodelwasrequiredfor another similarsetof data. Bennett,Rock,andWang(1991)examineda particufartwofactormodelfor the combinedmultiple-choice and constructed responseitems ontheCollegeBoard’sAdvancedPlacement(AP) Testin ComputerScienceand concludedthat the one-factormodelprovideda more parsimonious fit, We reanalyzed(Thissen,Wainer,& Wang, 1993)the ComputerScienceAP data reported by Bennett et al. (1991) and showedthat significant,albeit relativelysmall,factorsexplainsomeof the observedlocaldependenceamong the constructed responseitems.We replicatedthisfindingusingdata from the AP testin chemistry.There wasclearevidencethat the constructedresponse problemson bothof thesetestsmeasuresomethingdifferentthanthe multiplechoicesections of thosetests:There werestatisticallysignificantfactorsfor the 177

Waker and Thkscrt

constructedresponseitems, orthogonal to the general factor. However, there was also clear evidence that the constructedresponseproblems predominantly measurethe same thing as the multiple-choicesections:The factor loadingsfor the constructedresponseitems were almost alwayslarger on the general (multiple-choice)factor than on the constructedresponsefactor(s). The loadingsof the constructedresponseitems on the specificallyconstructedresponsefactorswere small, indicating that the constructedresponseitems do not measuresomething different very well. Given the small size of the constructed response factor loadings, it is clear that it would take many constructedrcsponscitems to producea reliable scoreon the factor underlying the constructedresponseitems alone-many more items than arc currently used. When WCaskedthe practicalquestion,“Is it meaningful to combine the scores on the constructedresponsesectionswith the multiple-choice score to yield a single reported score?” we were driven to conclude that it probably is; indeed, given the small size of the loadingsof the constructedresponseitems on their own specificfactors, it would probably not be meaningful to attempt to report a constructedresponsescorescparatcly, bccrruseit would not be reliably distinct from the multiple-choice score. Our investigation,and hence the aboveconclusions,utilized much of the same factor analytic technology, founded on complete data, that has become the standardin dimensionality studies(Joreskog& S&born, 1986, 1988). The procedure assumesthat estimates of the covarianceswere obtained from what is essentially a random sample from the examinee population. However, when there are choice items, assuminga noninformative sampling process’ is not credible. What is analogousto Assumption2 that will allow us to factor analyze the observedcovariancesand treat the resultsasif they camefrom the unselected population? Obviously, missing-completely-at-randomwould suffice, but this is usually patently false in a choice situation. Can we weaken it? Unfortunately, not much. Supposewe make the obvious assumptionthat the covariancesthat we observeare the sameasthosewe do not. Does thisallow usto analyze what are observedasif they were the unconditioned covariances?It does not, even with this strong an assumption.To understand why, it is best if we trace the logic mathematically. What must we assumeto allow us to treat Cov(yr,y+ RI X R, = 1) as if they were Cov(y,,y,)? There are many possibleassumptions.One, parallel to ASsumption2, would be to assumethat the covarianceinvolving a choice item is the sameamongthoseexamineeswho did notchoosethatitemasit wasamong those that did-that is,

Condition1: Cov(y,,y,lR, x R, = 1) = Cov(y,,y,IR, x R, = 0). But this is not enough. We mustalsoassumethat the meansfor at leastone of the two items in the covariance must be the same for thosewho chose it as it would have been for those who did not. Condition 2: E(y,IR, = 1) = E(y,lR, = 0) E(y,IR, = 1) = E(y,IR, = 0).

or

A little algebra will confirm that these conditionswill yield the desired result! 178

Or1 Examittee Choice itt Teshg

How plausibleis it that thesetwo conditionswill be upheld in practice?Clearly, if one thought that they were likely to be true, what would be the point of providingchoiceto examinees?Yet, to bc able to justify the typical analysts used to answerthe crucial dimensionality question, one must posit performance for examineeson the choiceitems that is essentiallythe sameregardlessof whether or not the items were chosen.We find thiscompelling evidenceto look elsewhere for methodologiesto answer dimensionality questionswhen there is choice. The missingdata theory describedabove presentsa convincing argument for the necessity of a special data gathering effort to estimate the covariances associatedwith choice items. We have demonstrated that there is no easy and obviousmodel that would allow the credible useof the observedcovariancesasa proxy for the covariancesof interest. To obtain these, we need a special data gathering effort analogousto the ones describedearlier. Both kinds of designs require a samplefrom the unselectedpopulation. Design 1 is exactly the sameas describedearher. Design 2 is slightly different. Des&r 1: Wirlzirrsubjec&. In a randomly chosen subset of the examinee population, examinees must be required to indicate their choice but then required to answerall items. As before, this providesestimatesof the covariances involving the choiceitemsthat are uncontaminatedby self-selection.They might suffer the same shortcomingas before; that is, examinees might not be particularly judiciousin their choiceswhen they know that they will have to answerall the questionsanyway.This will affect any measuredrelations of yI and R, but will probably be satisfactoryfor estimatesof the covariancesbetween the items. We have no data to shed light on these conjectures. Desigtz2; Befweensubjecti. In a randomly chosen subset of the examinee population, examinees must be randomly assigned to all pairs of the choice items. This will provide us with unbiasedestimatesof Cov(y,,y,) for all pairs of the choiceitems. It will provide more stableestimatesof the covariancesbetween each choiceitem and all of the required items as well. It will thus allow US to do dimensionality studies. Obviously, because this design does not gather any choice information, it cannot provide estimates of Cov(y,, RI).

What Can We Learn Fkom Choice Ikhrrvior? Thus far, our proposedrequirementsprior to implementing examinee choice fairly require a good deal of work on the part of both the examinee and the examiner. We are aware that extra work andexpenseare not partof theplanfor many choicetests.Often, choiceis allowed becausethere are too many plausible items to be asked and too little time to answer them. Is all of this work really 7 Almost surely. At a minimum, one cannot know whether it is necessary. necessaryunlessit is done. To paraphraseDerek Bok’s comment on the cost of education, if you think doing it right is expensive,try doing it wrong. Yet many well-meaning and otherwise clear-thinking individuals ardently support choice in exams. Why? The answerto this questionmust, perforce, be impressionistic.We have heard a variety of reasons. Some are nonscientific; an example is “To show the examinees that we care.” The implication is that, by allowing choice, we are giving examinees the opportunity to do their best. We find this justification 179

Waker ad

Tlukserr

difficult to accept, becausethere is overwhelmingevidence to indicate that this goal is unlikely to be accomplished.Which is more important-fairness or the appearanceof fairness?Ordinarily, the two go together, but, when they do not, we must be fair and do our best to explain why. A secondjustification (W. B. Schradcr,personalcommunication, March 7th, 1993) is that outstanding individuals are usuallyoutstandingon a small number of things. If the purposeof the exam is to find outstandingindividuals, WCought to allow them to have the option to showtheir maximum performance. We find this argument more convincing, but it is moot in a measurement task that is essentiallyunidimensional. A third justificationmight be termed instructionaldriven measurement(IDM). The argumentis that because,in the classroom,studentsare often provided with choice options evaluation instruments ought to as well. This argument can be compelling, especially if one thinks of the choice options being those that teachersmake: which topicsto cover, in what order, from what perspective.Why should studentssuffer the consequencesof unfortunate choice that were made on their behalf? The central question is: Can these issuesbe fairly addressed through the mechanism of allowing choice on exams? Let us consider more narrowly what we can learn from the choice behavior. Supposewe administer a test that is constructedof two sections.One section is mandatory, and everyone is required to answer all items. A second section containschoice. Equating of different test formsconstructedby choice behavior can be done, if we make the usual assumptionsrequired for IRT as well as an assumptionabout the shape of the choice items’ trace lines among those who opted for other items. Suppose,instead, we examine the estimatesof proficiency obtained from the mandatory section of the test. How well is proficiency predicted from the choices that examinees make? An illustration of such a test uses data drawn from the 1989 Advanced Hacement Examination in Chemistry (Wainer & Thissen, 1993b). A full descriptionof this test, the examinee population, and the scoringmodel is found in Wainer, Wang, and Thissen (1991). For the purposesof this illustration, we consideronly the five constructedresponseitems in Part II, Section D. SectionD has five problems (Problems 5, 6, 7, 8, and 9), of which the examinee must answer three. This section accountsfor 19% of the total grade. Becauseexamineeshad to answerthree out of the five questions,a total of IO choice groups was formed, with each group taking a somewhat different test form than the others. Each group had at leastone problem in common with every other group; this overlap can be usedto place all examinee selectedforms on a common scale. The common items serve the role of the mandatory section describedearlier. The fitting of a polytomous IRT model to all 10 forms simultaneously was described in Wainer, Wang, and Thissen (1991). As part of this procedure, we obtained estimates of the mean value of each choice group’s proficiency (b) as well as the marginal reliability of this sectionof the test. Our findings are summarized in Table 6. The proficiency scale had a standard deviation of one; those examineeswho chose the first three items (5,6, and 7) were considerably less proficient, on the average, than any other group. The groups labeled 2 through 7 were essentially indistinguishable in performance from one another. Groups 8, 9, and 10 were the best performing groups. 180

On Examinee Choice in Testing

TABLE 6 Summary statisticsfor the I0 groupsformed by examineechoiceon Problems 5-9

Group

8

9 10

Problems chosen

Mean group proficiency( ~.LJ

n

a

5,677

- 1.02

2,555

0.63

6,779 5368 5,799 5,798 6,798 5,699

-0.04 0.00* 0.04

0.09

121 5,227 753 4,918 1,392 457

0.65 0.57 0.64 0.51 0.54 0.67

0.40 0.43 0.47

407 898 1,707

0.57 0.59 0.59

0.08 0.08

WV 79%~

Cronbach’s

5,879 *The mean for Group 3, the largestgroup,is fixed at 0.0 to setthe locationof the proficiencyscale.

If we think of SectionD asa singleitem with an examineefalling into one of 10 possiblecategories,then the estimatedproficiencyof eachexamineeis the mean scoreof everyone in that category. How reliable is this one-item test? We can derive an analog of reliability (see the appendix for a de$vation), the squared correlation of proficiency (8) with estimatedproficiency(e), from the betweengroup variance [var(pJ] and the within-group variance (unity). This index of reliability,

r2(i,e)

= 4

ki)/[Var( pi) + 11,

is easilycalculated.The varianceof the pi is S7, and SOr*(6,9) is .15( = .17/1.17). It is informative to considerhow close.15 is to .57, the reliability of theseitems when actually scored. Supposewe think of the task of selectingthree out of five questionsto answeras a singletestlet. We can calculatethe reliability of a test made up of any number of suchtestletsusingthe Spearman-Brown prophesy formula. Thus, if we askthe examineeto pick three from five on one set of topics and then three from five on another, we have effectively doubled the test’s length, and its reliability risesfrom .15 to .26. The estimatedreliabilities for tests built of various numbers of suchchoice testletsare shown in Table 7. How much information is obtained by requiringexamineesto actually answer questionsand then gradingthem? The marginal gain for the AP Chemistry Test is very small; see Figure 6, which showsthat at all of the important choicepoints the error of measurementis virtually the samewhether the questionschosenare scoredfor the content of the answersor scoredby noting which choiceswere made. Thus, we have seenthat, for one test, the marginalgain in information by merely noting the choice is almost the same as that which is available from scoringthe items. Interestingly, we did not need to make any assumptionsabout choice behavior, as we did when we scoredthe item content in the presenceof choice-inducedmissingdata, becausethere is no missingdata if the data are the choices. 181

Wainer and Thissen

TABLE 7 Spearman-Brown extrapolationfor building a test of specifEd reliability Number of testlets*

Reliability

1 2 3 4 5 10 20

0.15 0.26 0.35 0.41 0.47 0.64 0.78

* Mere, each testlet comprisesthe task of selecting three questions out of five.

There is no doubt that more information is available from scoringthe constructedresponseitems of the chemistrytest than from merely observingwhich itemswere chosento answer.This is reflected in the difference in the sizeof the reliabilities of the choicetest versusthe traditionally scoredversion.This advantage may be diminished considerablyon tests based on constructedresponse itemsthat are holistically scored.Suchteststypically have much lower reliability than analytically scored tests. The reliabilities for the constructedresponsesectionsof 20 Advanced Placement Testsare shownin Table 8. Note that there is very little overlapbetweenthe There is little gain in accuracyfrom scoring the choiceitemsexceptat the highest levelsof proficiency

$‘ + ‘! \

I i

i : ! I !

i

I

! I : I

i !

-i

0 Proficiency

;

i

i

FIGURE 6. A comparison of the standard errors of estimateof profKiency for two versionsof the chemistrytestderivedby scoringthe choiceitems (84) or mereiy noting which items were chosen (79). At the selectionpoints of interest,scoring the choice itemsprovides almost no practical increasein precision.

182

On Examinee Choice in Testing TABLE 8 Reliabilitiesof constructedresponsesectionsof AP tests

Analytically scored

Score reliability

CalculusAB 0.85 PhysicsB 0.84 Computer Science 0.82 CalculusB C 0.80 FrenchLanguage 0.79 0.78 Chemistry Latin-Virgil 0.77 Latin-Catullus-Horace 0.76 PhysicsC-Electricity 0.74 Music Theory, BiologyI‘ 0.73 SpanishLanguage 0.72 PhysicsC-Mechanics 0.70 0.69 0.63 0.60 0.56 0.49 0.48 0.29

Holisticallyscored

History of Art FrenchLiterature SpanishLiterature EnglishLanguage& Composition EnglishLiterature & Composition American History EuropeanHistory Music: Listening& Literature

distributionsof reliability for analytically and holistically scoredtests,the latter beingconsiderablylessreliable. Chemistry is a little better than average,among analytically scored tests, with a reliability of .78 for its constructedresponse sections. It is soberingto considerhow well a test that usesonly the information about which options are chosenwould compare to one of the lessreliably scoredtests (i.e., any of the holistically scored tests). The structure of such a choice test might be to offer three or four setsof, say, five candidateessaytopics, ask the examineesto choosethree of those topicsin each set that they would write on, and then stop. Perhapsa more informative analysisof the information available in choices comparesit to other sortsof categoricalinformation. Figure 7 showsthat more Fisherian information is obtained from examineechoice than is obtained from knowledgeof examinee sexand ethnicity but that it is still lessthan the information obtained from just two (good) multiple-choice items. It is not our intention to suggestthat it is better to have examineeschoose questionsto answer than it is to actually have them answerthem.’ We observe only that, if the purpose is accurate measurement, some information can be obtained from the choices and that we can obtain this information without relying on untestable (and perhaps unlikely) assumptionsabout unobservable choicebehavior. Moreover, one should feel cautionedif the test administration and scoringschemeyield a measuringinstrumentlittle different in accuracythan wouldhave been obtained by ignoring the performanceof the examineeentirely. 183

Wairicr8and Thisscrt While there Ssmore infurmatlon in choice sdlcrns (htln In sex It Bthniclty, there is much more Mormntlon sliPI In Hemresponses

0.6

E .” Y

0.4

2 k E w

0.2 CholctIICIW

-t

stxdi Elhdclly

~~~~~~~~

0.

I 1

w-w

i,

j

t-

_______YI

Ii

1

--

3

i

1

W-B

I , 4’0 !i !

AP srortdtmerhtd Glftpritr thorn

Proficiency FIGURE 7. OtrI/teA P Chemistryexam, the information about chemistryknowledge provided by just two multiple-choiceitemsdwarfi that availablefrom sex and ethnicity or even choice behavior

Discussion We have painted a bleak psychometricpicture for the use of examinee choice within fair tests. To make testswith choice fair requires equating the test forms generated by the choice for their differential difficulty. Accomplishing this requires either some special data gathering effort or trust in assumptionsabout the unobserved responsesthat, if true, obviate the need for choice. If we can successfullyequate choiceitems, we have thusremoved the value of choice in any but the most superficial sense. To extend these considerations,we need to be explicit about the goals of the test. There are many possiblegoalsof a testing program. In this exposition, we will consider only three: contest, measurement, and device to induce social change. When a test is a contest,we are usingit to determine a winner. We might wish to choosea subsetof examineesfor admission,for an award, or for a promotion. In a contest,we are principally concernedwith fairness.All competitors must be judged under the same rules and under the same conditions. We are not concerned with accuracy,exceptto require that the test is sufficiently accurateto tell us the order of finish unambiguously. 184

On Exarttiriee Choice in Testing

When a test is usedfor measurement,we wishto make the mostaccurate possible determination of somecharacteiistic of an examinee.Usuallymcasurcmenthassomeactiotiassociated with it; we measurebloodpressureand then considerexerciseand diet; we measurea child’sreadingproficiencyand then choosesuitablebooks;we measurernathcmatical proficiencyand thenchoose thenextstepof instruction.Similarly,weemploymeasurement to determinethe efficacyof variousinterventions.How muchdid the diet lowerbloodpressure? Howmuchbetterwasonereadingprogramthananother?Whenmeasuring,we areprimarilyconcerned withaccuracy. Anythingthatreduceserrormayfairlybe includedon the test. When a test is a deviceto inducesocialchange,we are usingthe test to influencebehavior(Torrance,1993).Sometimesthe testis usedasa carrotor a stickto influencethe behaviorof students;we give the test to get studentsto studymoreassiduously. Sometimesthe testis usedto influencethe behaviorof teachers;we constructthe test to influenceteachers’choiceof materialto be covered.The recentliterature (Popham,1987) has characterizedthisgoal as measurement driveninstruction(MDI). MD1 hasengenderedrichandcontentiousdiscussions, andwe will not addto themhere. The interestedreadercan beginwithCizek(1993)andworkbackwardthroughthe references providedby him.At first,it mightappearthat, whena testisbeingusedin thisway,issuesof fairness andmeasurement precisionarenotimportant,althoughthe appearance of fairnessmaybe. However,that isfalse.Whena testis usedto inducechange, theobviousnextquestionmustbe, “How welldidit work?”If weusedthetestto getstudents to study’moreassiduously, or to studycertainspecificmaterial,or to studyin a differentway, howmuchdid they do so?How muchmore havethe studentslearnedthantheywouldhaveundersomeother condition?The other conditionmightbe no announcedtest,or it mightbe with a testof a different format.There are obviousexperimentaldesignsthat wouldallowusto investigatesuchquestions-butall requiremeasurement.’ .Thus,evenwhenthe purposeof thetestis to influencebehavior,that teststill oughtto satisfythecanons of goodmeasurement practice. Thusfar,wehaveconfinedourdiscussion to situationsin whichit isreasonable to assignanyof thechoiceitemsto anyexaminee.Suchan assumption underlies the notionsof equating,whichas we have usedthe term requiresessential unidimensionality (usingStout’s, 1990, usefulterminology),and ako of the experiments we havedescribedto ascertainthe difficultyof the choiceitemsin the unselected population.Situations in whichexaminees aregivensucha choice we call small choice. Small choiceisusedmostcommonlybecause it isfelt thatmeasurement of the underlyingconstructmaybecontaminated by theparticularcontextin whichthe materialisembedded.It issometimes thoughtthatby allowingexamineechoice from amongseveraldifferentcontextsa purer estimateof the underlyingconstructmaybeobtained.Consider,forexample,thefollowingtwomathproblems that are intendedto test the sameconceptualknowledge: 1. The distancebetweenthe Earth andtheSunis93 millionmiles.If a rocket shiptook 40 daysto make the trip, what wasits averagespeed? 2. The KentuckyDerby isoneandone-fourthmilesin length.WhenNorthern Dancerwonthe racewitha timeof 2 minutes,whatwashisaveragespeed? 185

Wairrer artd Thksert

The answerto both problemsmay be expressedin miles/hour.Both problemsare formally identical, except for differences in the difficulty of the arithmetic. Allowing an cxaminee to choosebetween theseitems might allow us to test the construct of interest (Does the student know the relation Rate x Time = Distance?), while at the same time letting the examinecspick the context within which they feel more comfortable. Big choice. In contrastto small choiceis a situation in which it makes no sense to insistthat all individuals attempt all tasks(e.g., it is of no interest or value to ask the editor of the schoolyearbook to quarterbackthe football team for a series of plays in order to gauge proficiency in that context). We call this sort of situation big choice. Using more preciselanguage,we would characterizesituations involving big choice as multidimensional. Making comparisons among individuals after those individuals have tnade a big choice is quite common. College admissionsofficers compare students who have chosen to take the French Achievement Test against those who opted for one in physics, even though their scoresare on completely different scales.Companies that reward employeeswith merit raisesusually have a limited pool of money available for raisesand, in the quest for an equitable distribution of that pool, must confront such imponderable questions as “is person A a more worthy carpenter than person B is a statistician?” At the beginning of this account, we set aside big choicewhile we attempted to deal with the easierproblemsassociatedwith small choice. Most of what we have discussedso far leansheavily on sampling responsesin an unseIcctedpopulation and thusappliesprimarily to the small choicesituation. Can we make useful comparisonsin the context of big choice? Yes, but only for testsascontests,at least for the moment. When there isbig choice, we can setout rules that will make the contestfair. We are not able to make the inferencesthat are usualfydesirable for measurement.To illustrate, let us considerthe scoring rules for the decathlon as an illustration of scoring a multidimensional test without choice. Building on this example, we will expand to the situation of multidimensionality and choice. The decathlon is a lo-part track event that is clearly multidimensional. There are strength events like discus, speed events like the 1OOmdash, endurance eventslike the 1,SOOmrun, and eventsthat stressagility, like the pole vault. Of course, underlying all of these events is some notion of generalized athletic ability, which may predict performancein all eventsreasonablyaccurately.‘How is the decathlon scored?In a word, arbitrarily. Each event is counted “equally” in that an equal number of points is allocatedfor someonewho equaledthe world record that existedin that event at the time that the scoringruleswere specified.” How closely one approachesthe world record determines the number of points received (i.e., if one is within 90% of the world record, one gets 90% of the points). As the world record in separateeventschanges,so too doesthe number of points allocated. If the world record got 10% better, then 10% more points would be allocated to that event. Let us examine the two relevant questions:Is this accurate measurement? Is this a fair contest? To judge the accuracyof the procedureasmeasurement,we need to know the qualitiesof the scalesodefined. Can we considerdecathlonscoresto be on a ratio scale?Is an athlete who scores8,000 points twice asgood assomeonewho scores 186

On Exartuhee Choice iIt Tcstirrg 4,000? Most exdertswouldagreethatsuchstatements are nonsensical. Can we

considerdecathlonscoresto be on an interval scale?Is the difference between an athlete who scores8,000 and one whoscores7,000 in any way the sameas the

diffcrcncebetweenonewhoscores 2,000andanother~110scoresl,OOO? Again, expertsagreethat this is not true in any meaningfulsense. Can1WCconsiderdecathlonscoresto be ordinallyscaled?Yes. A demonstrationusesstandardmathematical notationandisvirtuallyidenticalto thedcscription givenin Krantz, Lucc, Suppcs,and Tversky(1971, p. 14): Definition: Let A be aset and 2 be a binary relation on A, i:e. 2 is asubset of A X A. TIte relatiortal structure (A, 2) is a weak order ifand only if, for all a, b, c E A, the followirlg two axioms are satisfied: I. Cortrtectcdrtess; Either a LZb or b 2 a. 2. Trarrsitivity:If a r: b arrd b 2 c, then a ;F:c.

If sucha definitionholds,it can be provedthat If A is a fir&

noncrrtpty set and if (A, 2) is a weak order, then there existsa realvalued fumtion $I on A suchthat for aN a, b E A,

a2b

if and only if +(a) 2: 4(b).

+ is then an ordinal scale. Translatingthisinto the current context, A might represent tile collection of performances ononeof thevariousdecathlon events,scaledin seconds or meters or whatever.+ is the scoringfunctionthat translatesall of thoseperformances intopoints.It isstraightforward to examineanyparticularscoringfunctionto see if it satisfiestheseconditions.. Obviously,any functionthat is monotonicwill satisfythem. We concludethat decathlonscoringsatisfiesthe conditionsfor an ordinal scale. A fair contest must. This raisesan important and interesting issue: If we are using a test as a contest and we wish it to be fair, we must,gather data that would allow us to test the viability of the assumptionsstated in the definition above. The most interesting condition is that of transitivity. The condition comsuggeststwo possibleoutcomesin a situationinvolvingmultidimensional

parisons: 1. Tllere may existinstances in wlrichPersonA is preferredto PersonB and Person B to PersonC, and, last, PersonC is preferredto PersonA. This happenssufficientlyoftensothat we cannotalwaysattributeit to random error. It meansthat, in somemultidimensional situations,no ordinalscale exists. 2. Data that allow the occurrenceof an intransitive triad are not gathered. This means that while the scalingschememay fail to satisfythe require-

mentsof an ordinalscale,wllichare crucialfor a fair contest,wewill never know. In a situationinvolvingbig choice,wedo notknowif theconnectedness axiom is satisfied.Howcanwetesttheviabilityof thisaxiomif wecanobserveonlya on one personand only b on another? 187

Waker and Thissen

To get a better sense of the quality of measurement represented by the decathlon,let usconsiderwhat noncontestusesmight be made of the scores.The

most obvious use would be as a measureof the relative advantage of different training methods. Supposewe had two competingtraining methods-for example, one emphasizingstrengthand the other endurance. We could then conduct an experiment in which we randomly assignedathletes to one or the other of these two methods. In a pretest, we could get a decathlon score for each competitor and then another after the training period had ended. We couldthen rate each method’s efficacy as a function of the mean improvement in total decathlonscore.While one might find this an acceptablescheme, it may be less than desirable. Unless all events showed the same direction of effect, some athletes might profit more from a training regime that emphasizesstrength: others might need more endurance.It seemsthat it would be far better not to combine scoresbut, instead, to treat the 10 component scoresas a vector. Of course,each competitor would almostsurely want to combine scoresto seehow much his total had increased,but that is later in the process.The measurement task, from which we are trying to understandthe relation between training and performance, is better done at the disaggregatedlevel. It is only for the contest portion that the combination takes place. We concludethat scoringmethodsthat resemblethose used in the decathlon can only be characterizedas measurementin an ordinal sense. And thus, the measuresobtained are only suitable for crude sorts of inferences. When Is a Contest Fair?

In addition to the requirement of an ordinal scale, fair measurementalso requiresthat all competitorsknow the rulesin advance,that the samerulesmust apply to all competitorsequally, and that there is nothing in the rules that gives one competitor an advantageover another becauseof some characteristicunrelated to the competition. How well do the decathlon rules satisfy thesecriteria? Certainly the scoring rules, arcane as they might be, are well known to all competitors,and they apply evenhandedlyto everyone. Moreover, the measurements in each event are equally accurate for every competitor. Thus, if two competitors both throw the shot the same distance, they will get the same number of points. Last, is a competitor placed at a disadvantagebecauseof unrelated characteristics?No; each competitor’s score is determined solelyby his performance in the events. We conclude that the decathlon’s scoringrules comprisea fair contesteven thoughthey comprisea somewhat limited measuring instrument. The decathlon representsa good illustration of what can be done with multidimensional tests. Sensiblescoringcan yield a fair contest, but it is not good measurement.There has been an attempt to somehowcount all eventsequally, balancing the relative value of an extra inch in the long jump againstan extra secondin the 1,500 meter run. But no one would contend that they are matched in any formal way. Suchformal matchingis possible,but it requiresagreementon the metric. The decathlonis a multidimensionaltest, but it is not big choiceaswe have previouslydefined it. Every competitor provides a scorein each event (on every item). How much deteriorationwould result if we add big choiceinto this mix? 188

011 Exarttirtee Choice itt Tcshg

Big choicemakesthesituationworse.One maybe ableto inventscoringrules that yield a fair contest but do not give an accurate measurement. As one example, consider ABC’s “Super Star’s Competition,” a popular TV pscudosportin which athletes from varioussportsare gathered togetherto compete

in a scricsof scvcndifferentevents.The athleteseachselectfive eventsfrom amongtheseven.The winner of each cvcnt isawarded 10 points, secondplrlce7, third place5, and so on. The overall winner isthe one who accumulatesthe most points.Someeventsare“easier”thanothersbecause fewerand/orlesserathletes electedto competein that event;nevertheless, the samenumberof pointsare awarded.Thisis bigchoiceby our definition,in that thereare eventsthat some athletescouldnotcompetein (i.e., Jot Frazier,a former world champion boxer, chosenot to competein swimmingbecausehe could not swim). Are thescoresin sucha competition measurement?No. Is the contestfair? By the rulesof fairness the data makes checking key underlying assumptionsproblematic. The current state of the art allows us to use big choice in a multidimensional context and, under limited circumstances,to have fair contests.WC cannot yet have measurement in this context at a level of accuracy that can be called anything other than crude. As such, we do not believe that inferencesbasedon suchproceduresshould depend on any characteristicother than their fairness. This being the case, usersof big choice should work hard to assurethat their scoringschemesare indeed as fair as they can make them. Wainer (1993) and Wainer and Deveaux (1994) provide two detailed case studiesdescribing how this might be accomplished. Whereis it nof fair? Paul Holland (Allen, Holland, & Thayer, 1993, p. 5) calls big choice “easy choice,” becauseoften big choiceis really no choiceat all.

describedabove,yes, althoughthe missingncss of some of

Considera choiceitemin whichanexamineeisaskedto discuss theplotof either (a) The Pickwick Papers or (b) Crime and Punishment from a Marxistperspective. If thestudent’steacherchoseThe Pickwick Papers, therereallyisnochoice. At least, the studenthad no choice. Becausemany times in a bigchoicesituation the examineereaIly hasno choice,in that it is not plausibleto answerany but a singleoption,fairnessrequiresthe variousoptionsto be of equaldifficulty.This returnsus to the primary point of this account.How are we to ascertainthe relativedifficultyof big choiceitems? Is Big Choice Useful When the I‘c ’ st’s Goal Is to Itrduce Social CJlange?

If wewishto usethe test to influence instruction,we mightevahratethesuccess of the enterprise by surveying the field before and after the test became widespread.But this is surely only a superficial goal. The primary goal is not the structureof instructionbut ratherthe effectsof that instructionon the students. Thus, any attempt to measurethe efficacyof an intervention(in this casea particularkind of test structure)musteventuallyusesomesort of measuring instrument.We mustalsopay carefulattentionthat the useof a testto induce changedoesnot compromiseits fairness.We knowof onestandardized science test that introduced a very easy item on a new topic asa possiblechoice.The goal was to influence teachers to cover this new area. Examineeswhoseteachers covered this topichad a distinctadvantageoverexaminees whoseteachershad not. Since the choicewasreallymademonthsbeforethetest,andby theteacher, , not the student,is this fair? 189

Wainer and Tlriksert

Conclusions This summary of researchis far from conclusive;many questionsremain. It would be good to know how far away from unidimensionality a test can be and still yield acceptable measurement when choices are allowed. How far from ignorable can nonresponsebe and still be acceptably adjusted for statistically? What kinds of conditioning variablesare helpful in suchadjustments?What are the most efficient kinds of data-gathering designs?Such questionslend themselvesto solutionsthrough careful experimentation and computer simulation. Can the uncritical use of choice lead us seriously astray? While there are several sourcesof evidencesummarizedin this article about the size of choice effects, we focusedon just one seriesof exams.Summariesfrom other sources, albeit analyzed in several different ways, lead us to believe that the Advanced PlacementTests,often referred to becausethey currently involve choice, are not unusual.In fact, they may be considerablybetter than average.A recent experience with allowing choice in an experimental SAT is instructive (Lawrence, 1992). It haslong been felt by math teachersthat it would be better if examinees were allowed to use calculators on the mathematics portion of the SAT. An experiment wasperformed in which examineeswere allowed to use a calculator, if they wished. The hope was that it would have no effect on the scores. CaIculatorsdid improvescores.The experimentalsoshowedthat examineeswho used more elaborate calculators got higher scores than those who used more rudimentary ones. Sadly, a preliminary announcement had already been made indicating that the future SAT-M would allow examinees the option of using whatever calculator they wished, or not using one at all. A testingsituation correspondsto measuringpeople’s heightsby having them stand with their backsto a wall. Allowing examineesto bring a calculator to the testing’situation, or not, but not knowingfor sure whether they had one, or what kind, correspondsto having some personsto be measured for height, unbeknownst to you, bring a stool of unknown and varying height on which to stand. Accurate and fair measurement is no longer possible in either case. Our discussionhas concentrated on explicitly defined choice in tests, or aherrrative queshms in the languageof the first half of this century. However, in the case of portfolio assessment,the element of choice is implicit and not amenable to many of the kinds of analysis that have been described here. Portfolio assessmentmay be more or less structured in its demands on the examinee-that is, it may specify the elements of the portfolio more or less specifically. However, to the extent that the elements of the portfolio are left to the choice of the examinee, portfolio assessmentmore closely resemblesABC’s “Super Star’s Competition” than even the decathlon. In portfolio assessment, how many forms of the test are created by examinee choice? Often, as many as there are examinees!If that is the case,can those forms be statisticallyequated? No. This fact has clear consequencesin the results obtained with portfolio assessment;for instance, Koretz, McCaffrey, Klein, Bell, and Stecher (1992) report that the reliability of the 1992 Vermont portfolio program measureswas substantiallylessthan is expectedfor useful measurement. Can it be otherwise, when the examinees (effectively) constructtheir own tests?” Is building examinee choice into a test possible? Yes, but it requires extra work. Approaches that ignore the empirical possibility that different items do 190

On Exantirtee Choice irt Testing not havethesamedifficultywill not satisfythe canonsof goodtestingpractice,

nor will they yield fair tests.But, to assess the difficulty of choiceitems,onemust

haveresponses from an unselcctcd sampleof fully motivatedexaminees.This requiresa specialsort of data gatheringeffort. What arc we estimatingwhen WCuseexamineesclcctcditems?If WCMC interestedin O,,,M, thenweneedto choosetheitemsfor theexaminees. The belief that the estimateof 0Afarobtained from cxamincc select4 items is (7ccuratchas

beendisconfirmedby the data gatheredso far. Although thesedata are of modestscope,theyindicsrtc whatsortsof data needto bc gathcrcdto exuminc this questionmore fully. What can we do if the assumptionsrequired for equating are not satisfied acrossthe choice items? If test forms arc built that cannotbe equated(made comparable),scorescomparingindividualson incomparableformshavetheir validity compromisedby the portion of the test that is not comparable. Thus, we cannot fairly allow choice if the processof choosing cannot be adjusted away. Choice is anathema to standardizedtesting unlessthose aspectsthat charactcrizc the choice are irrelevant to what is being tcstcd. Notes ’ SectionII of the 1921examaskedthe examineeto answer5 of 26 questions.This aloneyieldedmorethan 65 thousanddifferentpossible“forms.” When coupledwith SectionIII (pick one essaytopic from among15) and SectionI (“Answer 1 of the following3”), we arrive at the unlikely figure shownin Table 1. ‘We will useboth the languageandnotationof item responsetheory (IRT). This is not necessary; our argumentcouldbe phrasedin traditional true scoretheoryterms. WCchoseto placethis argumentwithin an IRT frameworkbecauseit allowsgreater precisionof explanation.This is espccinlly important in later sectionswhcrc being ex licit about the estimandand the assumptions is critical. PWe are oversimplifyingIRT scoringhere; for most IRT models, the scoreassociatedwith eachitem responseactuallydependson the other item responses, andso it may be different for each examinee. ‘This variesa little from Little and Rubin’s (1987) conception.They wouldrequire the independenceof choice given some observedconditioning variable. In our construction,the conditioningvariable, 0, is latent. In any operational test, this differenceisonly a technicalone for, whenthetestislongish,rawscore(observable) and 0 can be transformedfrom one to the other easily. This does not apply in situationslike adaptive testingin which raw scoreis unrelated to 0. ‘A samplingprocessis noninformativein this case if, by knowing an individual’s choice,we learn nothing about how well they will do on the item. 60ur thanksto Nick Longford for pointing this out to us. ‘Our colleagueNick Longfordcommentedthat this “suits perfectly the current American culture in which no one ever actually does anything but is concerned insteadwith management.” ‘It is not uncommonin educationfor innovationsto be tried without an explicit designto aid in determining the efficacy of the intervention. Harold Gulliksen (personalcommunication,October26,1965) wasfond of recountingthe responsehe receivedwhenhe askedwhat the controlconditionwas againstwhich the particular educationinnovationwas to be measured.The responsewas “We didn’t have a control becauseit was only an experiment.” ‘Actually, it only predicts accurately for top-ranked competitors, who tend to perform “equally” well in all events. There are some athleteswho are very much

191

Wairter and

Tlukw~t

better in oue event or another, but they tend to have much lower overall performance than generalists who appear more evenly talented. ‘“The Olympic Decathlon scoringruleswere first establishedin 1912 and allocated 1,000 points in each event for a world record performance. These scoringruleshave been revised in 1936, 1950, 1964, and 1985. It is interesting to note (Mislevy, 1992) that the 1932goldmedalwinnerwouldhavefinishedsecondunderthe current(1985)

rules. “The idea of portfolio assessment includestwo components,one of which is examineechoiceof material to submit,and the other is that the material is collcctcd oversomelonger period of time than in a conventional test. The latter idea, collecting responses over a long periodof time, is certainly a usefulone. However,the former idea, letting the examineeschoosetheir test, leadsto noncomparable(and unreliable) scores.Long-term data collection is certainly possiblewith welLspecified prompts,questions,or itemsthat leavethe examineeno choice.Portfolioassessment would provide better measurementto the extent that the element of choice was removed. I2We are grateful to Charles Lewis who suggestedthis analog for reliability, provideda derivation, and cautionedagainstits too broad usage. APPENDXX

I2

How can we calculate a reliability coefficient from the classificationof examinees by their choice of items? Let us assume that we know the mean proficiency of alt examineesin each choice group. We will index examinees byj and choice groups by i, and the model we use is 4,

=

Pi

+

ZiJ

(Al)

where the proficiency of personj in group i is 0, and is distributed normally with mean p+and variance 1. We represerltthe deviation of eachpersonj within group i from that group’smeanas zlJ. If we estimate 8, with pLr,the mean of group i, that is

61, =

VW

Pi*

The correlation between8, and&, isanalogous to validityif we thinkof 0, asthe analogof true score. The square of this correlation can be thought of as a measure of reliability. Keeping this in mind, we can derive a computational formula for r2(6,,, 4,)

by notingthat

r2(6iJ, e,,) = [cw(~,,ei,~l’/[v~~(ei,) x W6dl=

VW

In the numerator,

COV(~,,, e,,) = coWeiJ 19, J3(6,, 1i)] e+ E[COV(B,,, 8, I i)] =

covb,

pi)

=

var(p,).

+

E[Cov(Oi,,

p, I i)]

The rightmosttermin theinitialexpression (E[Cov(b,,,8,,I i)]} iszero,andhence the expression reducesto the covariance of ~~with itself or the varianceof pd. 192

On Examirlee Choice irt Testiq

This is the expressionin the numcrator’that we need to compute (A3). The denominator requiresthe varianceof both 6, and O,,.These arc easily computed from Var(O,,) = Var[E(O, I’i)] + E[Var(O,j 1 i)] =

Vilr(~J

+

1,

(A4)

=

var(k,).

(AS)

and var(eij)

Substitutingthese results into (A3) yields r2($,, 01,) = var(p,)/[var(CLI) +

11

l

VW

The estimate of var(h) we obtained from SectionD of AP Chemistry is .17, and hence the estimated reliability [from (A(i)] is .15. Rckrcnccs Allen, N. L., & Holland, P. W. (1993). A model for missinginformationabout the group membershipof examincesin DIF studies.In P. W. Holland & I-I. Wainer (Eds.), Dq[crenliat item functiohg (pp. 241-252). Hillsdale, NJ: Erlbaum. Allen, N. L., Holland, P. W., StThayer, D. T. (1993). The optional essayprobIem and the hypothesisof equaldifficulty (ETS Tech. Rep. No, 93-94). Princeton,NJ: EducationalTestingService. Bennett, R. E., Rock, D. A., Braun, H. I., Frye, D., Spohrer,J. C,, & Soloway,E. (1991). The relationshipof expert-system scoredconstrainedfree-responseitems to multiple-choiceand open-endeditems. Applied Psychological Measuremcrtt, I#, 151-162.

Bennett, R. E., Rock, D. A., & Wang, M. (1991). Equivalenceof free-responseand multipIe-choiceitems. Jowlal of Educational Measurerrlertt, 28, 77-92. Brigham, C. C. (1934). The reading of the comprehensiveexamination irr [email protected] Princeton,NJ: PrincetonUniversity Press. Cizek, G. J. (1993). Rethinking psychomctricians’ beliefs about learning. Educational Researcher,22(4), 4-9. CollegeEntranceExaminationBoard. (1905). Qucstiotzssetat the exarrrirtatiorlsheld June 19-24, 1905. New York: Ginn. CollegeEntranceExaminationBoard. (1990). T/teJ989Advartced PlacemcrrtExamirtatiorts irt Chemistry and their [email protected] Princeton, NJ: Advanced Placement Programs. DeMauro, G. E. (1991). The effectsof the availability of alternativesand the useof multiple choice or essayanchor tests011cortstructedresponseconstructs(Draft Report). Princeton,NJ: EducationalTestingService. Dorans, N. J. (1990). Scalingand equating.In H. Wainer with N. J. Dorans, R. Flaugher,B. F. Green, R. J. Mislevy,L. Steinberg,and D. Thissen,Computerized adaptive testing:A primer (pp. 137-160). Hillsdale, NJ: Erlbaum. Fitzpatrick,A. R., & Yen, W. M. (1993, April). The psychometriccharacteristics of choiceitems. Paperpresentedat the Annual Meeting of the National Council on Measurementin Education, Atlanta. Fremer,J., Jackson,R., & McPeek,M. (1968).Reviewofthepsychonretriccharactcristicsof the Advanced PlacementTestsitt Chemistry,American History, and French

(Internal Memorandum). Princeton,NJ: EducationalTestingService. 193

Waitrer and Thissen Glynn, R. J., Laird, N. M., & Rubin, D. B. (1986). Selection modcling versus mixture modeling with nonignorable nonresponse. In II. Wainer (Ed.), Drawing irrfererrces from seZfhelectedsamples(pp. 115-142). New York: Springer-Verlag. Gulliksen, H. 0. (1950). A f/reory ofr!tental fesls. New York: Wiley. (Reprinted, 1987, Hillsdale, NJ: Erlbourn). HoIland, I? W., & Wainer, H. (1993). D iff erential item functioning. Hillsdale, NJ: Erlbaum. Joreskog, K. J., & S&born, D. (1986). PRELIS: A program for multivariate data screenhg altd data summarizatiort. Chicago, IL: Scientific Software. Jlireskog, K. J., & S&born, D. (1988). LISREL 7: A guide IO the program arid applications. Chicago, IL: SPSS. Kierkegaard, S. (1986). Eiherlor. New York: Harper & Row. Koretz, D., McCaffrey, D., Klein, S., Bell, R., & Stecher, B. (1992). The reliability ofscoresfrom the 1992 Vermont Portfolio AssessmerttProgram (Interim Report). Santa Monica, CA: RAND Institute on Education and Training. Krantz, D. H., Lute, R. D., Suppes, I?, & Tversky, A. (1971). Fourrdatiorzsof measurement,Vol. 1. New York: Academic. Lawrence, I, (1992). Effect of calculator use on SAT-M score cotrversiortsartd equating (Draft Report). Princeton, NJ: Educational Testing Service. Little, R. J. A., & Rubin, D. B. (1987). Statisticalanalysis with missingdata. New York: Wiley. Lukhele, R., Thissen, D., & Wainer, H. (1993). On the relative value of mulfiplechoice,free-response,artd ei;amirtee-selected items in two achievementtests(ETS Tech. Rep. No. 93-28). Princeton, NJ: Educational Testing Service. (Also in press, Journal of Educational Measurement, 3 I .) Mislevy, R. J. (1992). Linking educationalassessmert&s: Concepts, hues, mehods, and prospects(Draft Report). Princeton, NJ: Educational Testing Service. Pomplun, M., Morgan, R., & Nellikunnel, A. (1992). Choicein AdvarrcedPlacement Tests(Unpublished Statistical Report No. SR-92-51). Princeton, NJ: Educational Testing Service. Popham, W. J. (1987). The merits of measurement-driven instruction. Phi Delta Kappart, 68, 679-682. Powers,D. E., Fowles, M. E., Farnum, M., & Gerritz, K. (1992). Giving a choiceof topicson a testof basicwritingski&: Does it make any difference(Research Report No. 92019)? Princeton, NJ: Educational Testing Service. Stout, W. F. (1990). A new item responsetheory modeling approach with applications to unidimensionality assessmentand ability estimation. Psychometrika, 55, 293-325. Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item responsetheory in the study of group differences in trace lines. In H. Wainer & H. Braun (Eds.), Test validhy (pp. 147-169). Hillsdale, NJ: Erlbaum. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parametersof item responsemodeis. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67-113). Hillsdale, NJ: Erlbaum. Thissen, D., Wainer, H., & Wang, X. B. (1993). How unidimensiortal are tests comprisirrgbotlt multiple-choicealtd free-responseitems?An analysis of two tests (ETS Tech. Rep. No. 93-32). Princeton, NJ: Educational Testing Service. (Also in press, Journal of Educational Measurement, 31.) Torrance, H. (1993). Combining measurement-driven instruction with authentic assessment:Some initial observations of the national assessmentin England and Wales. Educational Evaluatiort and Policy Arralysir, 15, 81-90. Waincr, II. (1993). How much more efficiently can humans run than swim? Clrntzce, 6, 17-21.

194

On Examirtee Choice irt [email protected]

Waincr, H., & Dcvcaux, R. (1994). Rcsizingtriathlonsfor fairness.Clrartce,7(l), 20-25. Wainer, H., Sireci, S. G., & Thissen, D. (1991). Differential testlet functioning: Definitions and detection. Journal oJ Educatiottal Measurement, 28, 197-219. Wainer, I-I., & Thisscn,D. (1993b). Choosirtg:A test(ETS Tech. Rep. No. 92-25). Princeton,NJ: EducationalTestingService. Wainer, H., Wang, X. B., & Thissen,D. (1991). Mow well can WCequate testJorms by cxnttrirzccs (Tech. Rep.No. 91-15)?Princeton, NJ: Educathat are corrs/rrtc/ed tionalTestingScrvicc. (Also in press,Jourru~loJ EdrrcatiortalMeasurcmcrtt,31,) Wainer,I-I,, &Wright, B. D. (1980). Robustestimationof ability in the Raschmodel. Psychomctrika, 45, 373-391. Wang, X. B. (1992). Achieving equity irt sel/-scfectedsubsep of test items. Unpublisheddoctoraldissertation,University of Hawaii at Manoa, Honolulu. Wang,X. B., Wainer, H., & Thissen,D. (1993). 011the viability ofsome urttestable assumptionsin equating tx(1771s tltat allow exnmineechoice (ETS Tech. Rep. No. 93-31). Princeton,NJ: EducationalTestingService. Authors

HOWARD WAINER is Principal ResearchScientist,EducationalTestingService, T-15, 666 RosedaleRd., Princeton, NJ 08541. He specializesin statisticsand psychometrics. DAVID THISSEN is Professorand Director of the GraduateProgramin Quantitative Psychologyand Acting Director of the L. L. ThurstonePsychometricLaboratory at the University of North Carolina, Chapel Hill, CB# 3270, Davie Hall, Chapel Hill, NC 2759903270.He specializesin psychometricsand quantitative psyc11010gy. ReceivedJune 25, 1993 RevisionreceivedSeptember20, 1993 Accepted October 4, 1993

195