Detecting Construct-Irrelevant Variance in an Open-Ended ...

19 downloads 8920 Views 1MB Size Report
items, plus a test of their skill in entering and editing data using the computer ..... with the remainder distributed among mathematics (23%), physical science (16%), ... The majority (47%) indicated an intention to pursue a masters' degree, while.
Detecting Construct-IrrelevantVariance in an Open-Ended, Computerized Mathematics Task

Ann Gallagher Randy Elliot Bennett Cara Cahalan

GRE Board Report No. 9513P

October 2000

This report presentsthe findings of a researchproject funded by and carried out under the auspicesof the GraduateRecord Examinations Board

Educational Testing Service, Princeton, NJ 08541

Researchersare encouraged to expressfreely their professional judgment. Therefore, points of view or opinions stated in Graduate Record Examinations Board Reports do not necessarily represent official Graduate Record Examinations Board position or policy. ********************

The Graduate Record Examinations Board and Educational Testing Service are dedicated to the principle of equal opportunity, and their programs, services, and employment policies are guided by that principle. EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, the modernized ETS logo, GRADUATE RECORD EXAMINATIONS, and GRE are registered trademarks of Educational Testing Service. Educational Testing Service Princeton, NJ 08541 Copyright 0 2000 by Educational Testing Service. All rights reserved.

Abstract The purposeof this studywasto evaluatewhethervariancedueto computer-based presentation wasassociated with performanceon a new constructed-response type -- MathematicalExpression-- that requiresexamineesto build mathematicalexpressions usinga mouseandan on-screentool palette. Participantstook parallelcomputer-based andpaper-based testsconsistingof MathematicalExpression items,plus a testof their skill in enteringandeditingdatausingthe computerinterface.Comparisonsof meanperformance,reliability, speededness, andrelationswith externalindicatorswereconductedacross the paper-based andcomputer-based tests;also,computer-based math scorewasregressedon edit/entry scoreafter controllingfor paper-and-pencil math scoreandbackgroundinformation.Althoughno statisticalevidenceof construct-irrelevant variancewasdetected,someexamineesreportedmechanical difficultiesin respondingandindicateda preferencefor thepaper-and-pencil test. Keywords:Computer-based testing,Item sets,Mathematics,Speededness

Table of Contents Introduction .................................................................................................................................................1 3 Method.......................................................................................................................................................... Participants.......................................................................................................................................

3

Instruments.......................................................................................................................................

3

Procedure..........................................................................................................................................

4

Data Analysis ...................................................................................................................................

4

6 Results.......................................................*..............................................................................*.................... Conclusion....................................................................................................................................................8 10 Tables and Figures.................................................*.................................................................................. References..................................................................................................................................................13 Author Note................................................................................................................................................14 Appendix ....................................................................................................................................................15

List of Tables Table 1. A Mathematical Expression Key and Example Responses..............*..................*........................ 10 Table 2. Means, StandardDeviations, and Coefficient Alpha Reliabilities for Mathematical Expression and Edit/Entry Tests ..........................................*....................................................... 10

Table 3. Correlations Between the Mathematical ExpressionTest, Edit/Entry Test, and Other Variables ....................................................................................................................................... 11

Table 4. Hierarchical Multiple Regressionof Computer-Based Mathematical Expression Scores on Paper-and-Pencil Scores,Background Variables, and Edit/Entry Test ....................................... 11

Table 5. An Example Paper-and-Pencil ResponseThat Would Not Have Fit in the Computer-Based Mathematical Expression Answer Box ..........................................................................~.............12

List of Figures Figure 1. The Mathematical Expression interface with an example item and a correct response... . .. . .. .. ..12

Introduction One of the promises of computer-basedtesting is the ability to present examinees with openended tasksthat are more like the ones they encounter in academic and work settings(Bennett, 1993). Mathematical Expression (ME) is one suchresponsetype. ME was created as part of an experimental test for admission to quantitatively oriented graduateprograms. This responsetype can be used with any question for which the answer is a rational expression,including questionsthat ask the examinee to mathematically model a problem situation. ME is particularly exciting becauseit permits the developers of computer-basedmathematics teststo use automatically storable, open-ended items, the correct answersto which may take many different surface forms (see Table 1 for an example key and a few equivalent responses).Because these responsescan be scoredin real time using symbol manipulation techniques,ME items can be included in computer-adaptivetests.

In delivering a test on computer, one key concern is fmding a way for examinees to respond that is insensitive to individual differences in computer familiarity. For open-ended items, the challenge is particularly complex. By definition, these items require examinees to enter more information and, thus, could potentially require greater computer skill.

In developing the ME interface, considerablecare was taken to keep computer-skill requirements to a minimum. For example, the interface is completely mouse driven: Examinees build their expressions by clicking symbols in an on-screenpalette (see Figure 1). This strategy circumvents the need for keyboard facility as well as the problem that some mathematical symbols have no keyboard equivalents. On the palette, digits and arithmetic operatorsappear in the standardcalculator configuration, which makes them easy to find.

In addition, the interface provides for exponent and subscriptmodes, so that users do not have to enter syntactic markers, such as carats,to denote these positions. The user simply clicks on the Exponent or Subscriptbutton to make the next number he or she selectsappear in the intended position. The interface also provides graphical displays of complex expressionsinvolving division that use a horizontal division bar rather than the less visually meaningful slash.The natural, graphical representation of exponents, subscripts,and division makes it easier for usersto parse expressionsthey have just entered and minimizes the chancesof a mismatch between the system’s interpretation of an expression and the user’s intention.

To limit construct-irrelevant errors(suchastypos),andto facilitateinterpretationandscoring,the M.E interfaceimposescertainminimal constraints on the entryof expressions. For example,the interface disablescertainbuttonson the tool palettebasedon the entrymode selected.If the userhasselected exponent-mode,for instance,the interfacedisablesthe entryof certainmathematicaloperators,like multiplicationanddivision,aswell asalphabeticcharacters. Also, whenuserssubmittheir final answers, the interfacecheckstheseexpressions for syntacticcorrectness andflagsthosethat display inappropriatelyjuxtaposedoperators(e.g., a multiplicationsymbolfollowedimmediatelyby a division symbol),malformednumbers(e.g., a numbercontainingtwo decimalpoints),or unbalancedparentheses. The ME interfaceobviouslyrequiressomeorientation.To accomplishthis,a brief tutorialis used to familiarizeexamineeswith the responsetypeprior to takingthe test.The tutorialintroducesthe symbol paletteanddemonstrates how examineescanformulateexpressions usingthe SubscriptandExponent buttons,the variableandconstants menu (accessed by pressingthe a-z key shownin Figure l), andother features. Althoughevery effort wasmadeto designan ME interfacethatrequiredminimal computerskill, buildingan expressionwith the interfaceis still a morecomplextaskthanwriting onewith paperand pencil.For this reason,facility with theME interfacecouldwell producean unwantedperformance effect.Preliminaryevidenceprovidedby Bennett,Steffen,Singley,Morley andJacquemin(1997) seems to indicatethat ME tasksdo not introduceanymoreconstruct-irrelevant variancethando othertask types.Theseinvestigatorscomparedthe functioningof ME itemsto othercomputer-delivered item types, includingstandardmultiple-choicequestions,questions requiringentryof numericvalues,andquestions askingthe examineeto shadeportionsof a coordinatesystem.Their resultsshowedthat ME itemshave roughlythe samedistributionof difficulty astheseotherresponsetypes.In addition,ME questionshad item-totalcorrelationssimilarto thosefor the otheritems.Third, ME itemstook no longerto answerthan otherconstructed-response problemswrittento measuremathematicalmodelingskills(thoughbothtypes took longerthanmultiple-choicemodelingquestions).Finally, ME showedgenderdifferences comparableto thosefor the otherquantitativequestions. Whereasthe dataprovidedby Bennettet al. (1997) areencouraging,theyprovideonly an indirect evaluationof whetherthe ME interfaceintroducesirrelevantvariance.In the currentstudy,our goalwas to testmore directlythe hypothesisthatindividualdifferencesin facility with the ME interfaceaffect performanceon computer-based mathematicaltests.

2

Method

Participants We recruited 226 volunteers from 10 colleges and universities located in different regions of the United Statesto participate in this study. Of these individuals, 48 were eliminated becausethey either were not enrolled in quantitatively oriented undergraduatemajors or they were not close to making the transition to graduate school. Of the 178 remaining participants, 57% were college seniorsand 43% were first-year graduate students.Thirty-six percent of the participantswere women and 79% were U.S. citizens. The racial/ethnic distribution of the sample was 58% White, 15% Asian American, 13% Hispanic, 7% other, and 5% Black. Most participants(53%) reported an undergraduatemajor in engineering, with the remainder distributed among mathematics (23%), physical science (16%), and computer science (8%). The majority (47%) indicated an intention to pursue a masters’ degree, while many (35%) said they would be pursuing doctorates.

Of the 178 studentsin the sample, 75 (42%) reported a score from the quantitative section of the Graduate Record Examinations (GREB) General Test. Of these 75 participants,most (71%) were firstyear graduate students,and very likely a more select group than the sample as a whole. The mean score of those participants reporting GRE scoreswas 759 (SD = 41), which is substantially above the average scoresfor all of their undergraduatefields. For example, in our sample, engineering majors had a mean GRE quantitative score of 760, whereas in the 1995-96 academic year, studentsintending graduate study in engineering scored a mean of 687 (Graduate Record Examinations Board, 1997).

All but one of our participantsreported an undergraduategrade-point average (UGPA). UGPA data were reported in six categoriesranging from “Below 1.5” to “3.5-4.0,” with the latter marking the high end of the scale. Most participantsreported a UGPA of either 3.5-4.0 (41%) or 3.0-3.49 (33%).

Instruments Mathematical Expression test. Two 16-item ME testswere created for the study. These testswere designedto contain equal proportions of easy and difficult items, based on both mathematics content and the procedural complexity of entering the response.

Edit/entrytest.This computer-basedtest was designedto measure participants’ skill in using the ME ‘interface. The test consistedof five editing items and five entry items. Editing items required the examinee to modify a given mathematical expressionto match a given example. Entry items asked the examinee to enter a given expression. Editing and entry items were designedto cover a range of difficulty, with emphasison mathematical expressionsthat were somewhat more complex than those that would norrnally appear on an operational mathematical reasoning test.

Questionnaireand interview.Participants also completed a questionnaireabout their personal background, computer experience, perception of the ME tasks, and plans for graduate study. A debriefing interview was conducted to ensure that important information about the interface was not overlooked and to respond to any questionsor concernssubjectsmay have had.

Procedure Each examinee took part in a three-hour session,for which they received $45. All individuals took both ME tests, one on paper and the other on computer, with one hour allotted for each test. Students were assignedrandomly to one of four order conditions:

l

ME test 1 on computer, ME test 2 on paper, edit/entry test

l

ME test 2 on computer, ME test 1 on paper, edit/entry test

l

ME test 1 on paper, ME test 2 on computer, edit/entry test

l

ME test 2 on paper, ME test 1 on computer, edit/entry test

The edit/entry test was administered after the ME teststo avoid providing additional practice to studentsbefore taking the computer-basedME test. The sessionconcluded with the questionnaire and debriefing interview.

Data Analvsis To locate evidence of irrelevant variance due to the ME interface, we conducted several analyses. The f?rstset of analyseswas targeted at determining the extent to which the paper-and-pencil ME test forms were approximately equivalent to their computer-delivered counterparts.To the extent that they were equivalent, we presumed the case for irrelevant variance would be considerably weakened.

4

To assessequivalence, we first compared coefficient alpha reliabilities acrosstest modes -computer versus paper-and-pencil -- within each ME test form. Second, we compared mean scores resulting from different test modes within test forms, and vice versa. For the former comparison, we used a between-subjectsone-way analysis of variance for each test form, with ME scoresas the dependent variable and test mode as the independentvariable. For the latter comparison, we used a between-subjects one-way analysis of variance for each test mode, with ME score as the dependentvariable and test form as the independent variable.

Third, we looked at speedednessacrosstest modes within each test form, computing the proportion of studentscompleting the test and the proportion reaching all but the last item. These measuresare, at best, a very loose approximation of speedednessand one that is not precisely comparable acrosstest modes, because in computer mode, we required examinees to respond to an item before they could be presentedwith another item -- somethingwe could not control on the paper version. As a result, participants’ skipping behavior is readily detected on paper as a blank response;on computer, omits are less obvious as test takers could skip questionssimply by making any response.

Fourth, we compared the pattern of relations of the paper-and-pencil and computer test modes with other variables, including edit/entry scores,GRE quantitative scores,undergraduatemajor (coded as engineering vs. other), gender, and level of education (college senior vs. first-year graduate student).1 For this and subsequentanalyses,we combined ME scoresacrosstest forms within computer and paper-andpencil test modes to increase statisticalpower. To achieve this combination, we first standardized participants’ ME scoresfor each 16-item test forrn within each mode, and then we collapsed them across the order conditions.

For our secondset of analyses,we used hierarchical multiple regressionto examine the extent to which skill in using the ME interface was directly related to performance on the computer-basedME test. For this analysis, we used ME score on the computer-delivered test as the dependentvariable. We first entered paper-and-pencil ME score into the equation, followed by background information -- major (coded as engineering vs. other), level of education (college senior vs. first-year graduate student), and gender -- to control for any group differences in computer-basedME performance. Finally, we entered edit/entry score -- our measure of mechanical skill in respondingto the computer-basedME test. Here,

’ We used“engineeringversusother”for undergraduate majorbecausejust overhalf of our sampleindicatedan engineeringmajor.

5

we presumed that any significant effect for edit/entry score, after controlling for paper-and-pencil ME score and background information, would suggestconstruct-irrelevant variance due to lack of facility with the ME interface.

Results Table 2 showsmean performance and coefficient alpha reliabilities for both test forms for both the computer-based and paper-basedME tests, and for the edit/entry measure. The reliabilities for the ME tests ranged from .79 to .85, with no indication of differences between the computer-delivered and paperand-pencil versions. The reliability of the lo-item edit/entry task was .72.

Analyses of the mean scoresshowedno performance differences between the computer and paper test versions (_F[ 1,176] = .55, p > .05 for the first paper-and-pencil form vs. the first computerbased form; -F [ 1,176] = .29, p > .05 for the secondpaper-and-pencil form vs. the secondcomputer-based form). There were mean differences, however, between the two ME paper-and-pencil forms F [ 1,176] = 25.99, p < .OOl), and between the two forms delivered on computer p [ 1, 1761 = 22.48, p < .OOl), suggestingthat one form was harder than the other. Eta-squaredwas computed for each within-mode comparison and revealed effect sizes of. 13 and . 11 for paper-basedand computer-basedtests, respectively. According to Cohen (1988), these eta-squaresare characterized as medium effect sizes.

With respect to timing, 98% of those taking the paper version of ME test 1 finished the test, compared with 85% of those taking that test on computer, a statistically significant difference @ = 3.06,~ < .Ol). For ME test 2, 85% completed the paper version and 90% finished the computer version, which was not a significant difference @= -.91, p > .05). Regarding the percentagesof participants who reached all but the last item on each test, the differences were significant for both ME forms, but in opposite directions. For ME test 1, 100% of those taking the paper version reached the next to last item, while 93% of those taking the computerized test went that far @= 2.53@

.05). For ME test 2, 87% of examinees

taking the paper-and-pencil test completed the penultimate question, while 96% of those taking the computer-basedtest did so @= -2.12, p < .05).

Table 3 shows correlations found among ME score, edit/entry test score, and various external criteria after combining the standardizedscoreson the two ME forms. The observed correlation between the ME paper-based and computer-basedscoreswas .78; corrected for attenuation, that value was .97,

6

suggestingthat the two modes were measuring the same construct.2 Consistentwith this suggestionis that the ME computer-based and paper-based scores also showed the same pattern of relations with external criteria; no statistically significant differences were found between the correlation of the ME computer-delivered

test with any given external variable, and the correlation of the ME paper-and-pencil

test with the same external variable Q,range = -.40 to 1.69, -dfrange = 72 to 175). Both ME versions were

significantly related to UGPA, GRE quantitative score, gender, and level of education. Similarly, both ME tests were unrelated to the edit/entry test or to undergraduatemajor. Finally, the edit/entry test was unrelated to any measure of accomplishment -- GRE quantitative score, UGPA, or level of educational level -- suggestingthat, although reliable, the constructit measuredwas generally irrelevant to academic study.

Table 4 presentsthe results of regressingcomputer-basedME score on paper-basedME score, background variables, and the edit/entry test. The paper-basedME score accounted for 6 1% of the variance in computer-delivered ME score (F [ 1,176] = 272.92, p < .OOl). Adding the background information accounted for another 3% of the variance. Finally, and most importantly, no significant variance was attributable to the edit/entry measure.3

Compiled responsesto the ME interface questionnairecan be found in the Appendix. With respect to computer familiarity, all participants indicated using a computer almost daily, and all but one indicated almost always using a mouse. Regarding the computer-basedformat, 57% found it easy to use the computer to take the ME test, 42% found it somewhat difficult, and 2% thought it was very difficult. Of those who found it somewhat or very difficult, the difficulty cited by the largest portion of participants (29%) was that the on-screenpalette was hard to use. When asked if they had difficulty entering fractions, exponents/subscripts,or expressionsinvolving squareroots, 48% said that they had no difficulty with any of these functions, but 30% cited problems with entering fractions.

2 The correction for attenuation requires a reliability for each measure and the correlation between the two measures. Because there were two paper-based ME forms and two computer-basedME forms, we estimated a reliability for the two paper-and-pencil measuresby taking the (geometric) mean of their coefficient alpha reliabilities, and then estimated a reliability for the two computer-delivered measuresin the same way. To estimate the relationship between the computer-delivered and paper-and-pencil measures,we computed the paper-computer correlation for each of the four administration orders and then took the mean of these four values using the r-to-z transformation. 3 We reran this regression including participants who had been eliminated becausethey either did not have quantitatively oriented undergraduate majors or they were not close to making the transition to graduate school. Even with this larger and more diverse sample (n - = 2 19), the results were substantively identical to those presented here.

Polled asto whethertheywouldpreferto takeanME teston computeror paper,77% optedfor paper-and-pencilandonly 7% chosecomputer.Consistentwith thispreference,44% percentof participantsfelt thattakingthe teston computerwasmoretiring thantakingit on paper,comparedwith 15% who foundit moretiring on paperand41% who believedthe two modeswere equivalent.Finally, 48% thoughtthat,hadthe testbeenreal, theywouldhavebeenmoreanxiousabouttakingthe computerdeliveredtestthanthey wouldthepaper-and-pencil form; 43 percentwouldhavefelt aboutasanxious eitherway, andonly 8% would havefelt lessanxiouswith the computerversion.

Conclusion This studyfoundno strongevidenceto supportthe hypothesisthatindividualdifferencesin facility with the ME computerinterfacewouldaffectperformanceon open-ended,computerized mathematicstasks.Mean performance,reliability,andrelationswith othervariableswere closelysimilar for bothpaper-and-pencil andcomputerizedtestmodes.Althoughonecomputer-based testform appeared speededrelativeto its paper-and-pencil counterpart, thereversewastrue for the secondtestform, weakeningany claim that speededness mightbe a resultof lack of interfacefamiliarity.Regression resultsalsoshowedno signsof irrelevantvarianceconnectedwith the ME interface.Our edit/entrytest addednothingto thepredictionof computer-based mathematicalperformanceand,indeed,had aboutthe samelevel of zero-orderrelationshipto the computer-based ME testasit did to the paper-and-pencil one. Theseresultscomplementthe indirectevidence,reportedby Bennettet al. (1997), thatME itemsfunction similarlyto othercomputer-based responsetypes(includingmultiple-choice)writtento testadvanced mathematicalcontent. Whereasthe statisticalevidencedoesnot supportthe presenceof an interfacecompetencyeffect, examineeperceptionsdid suggestthatthe interfacewasnot alwayseasyto use.This perceptioncame throughmostclearlywith respectto the useof the on-screenpalette,the methodby which examinees createmathematicalexpressions.‘Using thispaletteis clearlymoretime-consumingandcumbersomethan writing an expression by hand,especiallyif the expressionis a complexone. To betterunderstand thisphenomenon,we retrospectivelysampledexamineepaper-and-pencil responses andthentried to enterthemon computer,findingthat somepaperresponses were, in fact, too long for the on-screenanswerbox (seeTable 5). We supposethatsomeexamineestried to entersuch expressions on the computer-based ME test,but wereforcedto reformulatethem to makethem fit the

8

requiredframe. If this is so,theseindividualswereableto completethisreformulationquickly enoughto avoida negativeimpacton their scores(whichwe otherwiseshouldhavedetectedin our statistical analyses).With more stringenttime limits thanthoseimposedhere,however,an effectmight well have appeared. The fact that somestudentshaddifficulty with the interfacesuggests thatwe shouldcontinueour effortsto improveit, or at leastthatwe shouldmake suretime limits aregenerousenoughto allow for the mechanicsof respondingusingthe interface.In the end,however,it is hardto envisiona mouse-driven interfacethat is asnaturalfor enteringmathematicalexpressions aspaperandpencil.Given that,the ideal solutionmay be handwritingthe expressionon somedigital surfacethatrecognizesfree-formsymbolic input andthat is connectedto the computingdeviceon whichthe testingsoftwareresides.This conceptis evidentin today’s personaldigital assistants, whichrecognizea form of textualentry. While the currentfindingsprovidesomeinsights,this studyhadseverallimitations.First,the samplesizewasrelativelysmall,somarginaleffectscouldnot easilybe detected.Second,for thosewho did reportGRE quantitativescores,the meanwasunusuallyhigh. Thus,our findingsmay not be generalizableto studentswith lowermathematicalability levels;suchstudentsmight experiencegreater difficulty with the ME interface.Third, our failureto provethe irrelevantvariancehypothesisdoesnot confirmthat suchcontaminationis absent,asthenull hypothesiscannotbe proven. Finally, this studyneedsto be viewedasonepartof a largervalidationprogram.The studyis meaningfulonly in the contextof theoreticalrationalesandempiricalresultsthatconvergeto supporta largervalidity argument(Messick,1989).As a responsetype,ME is characteristic of a growingclassof open-endedcomputer-based tasks.The largervalidity argumentfor thesetasksbeginswith the contention that,by their open-endednature,theyreplicatesomeof the complexityinherentin the problems encounteredin academicandwork settings.At the sametime, however,ourrenditionsof thesetaskscan addirrelevantcomplexityin, amongotherthings,the way we structurethe human-computer interaction. This researchhighlightsthe needto approachwith carehow we renderthosetasksandillustratesone methodof monitoringthe success of our developmentefforts.

Tables and Figures Table 1. A Mathematical Expression Key and Example Responses Mathematical Expressionkey

cm - 2P)@ - 2p) 4 Some example correct responses

(n-2P)(m -2P) 4 .25(-2p + m)(-2p + n) p2-pnl2-pm/2+mnl4

Table 2. Means, Standard Deviations, and Coefficient Alpha Reliabilities for Mathematical Expression and Edit/Entry Tests

Mean

Standard deviation

Coefficient alvha

Computer-based

10.07

4.09

.85

Paper-based

10.5 1

3.83

.83

Computer-based

7.34

3.58

.79

Paper-based

7.63

3.70

.80

6.29

2.56

.72

Test ME test 1

ME test 2

Edit/entry test

Note. Each Mathematical Expression (ME) test contained 16 items. The edit/entry test included 10 items. Eighty-nine participants took each ME test, while all 178 participantstook the edit/entry test.

10

Table 3. CorrelationsBetweenthe Mathematical ExpressionTest, the Edit/Entry Test, and Other Variables

ME -- computer version

ME --

Edit/

paper version

entry test

.78**

ME -- paper version

UGPA

GRE quantitative score

Undergraduate major

Gender

Level of education

.08

.46””

.55**

-.ll

.25**

.41**

.lO

.43**

.52**

-.13

S7”

.36**

.09

.18

-.08

.09

.12

-.03

.15

.26*

-.14

.13

Edit/entry test UGPA GRE quantitative score Undergraduate major Gender

.21**

Note. All correlations are based on a sample size of 177-178, except for those with GRE quantitative score, which are based on 75 participants. Undergraduate major was coded as engineering (0) versus other (1). Gender was coded as female (0) versus male (1). Level of education was coded as college senior (0) versus first-year graduate student(1). * p < .05 ““p