EQUATING IN THE PRESENCE OF PRACTICE ... - Wiley Online Library

0 downloads 0 Views 965KB Size Report
Section Pre-equating in the Presence of Practice Effects. Paul W. Holland and. Dorothy T. Thayer. Program Statistics Research. Technical Report No. 84-44.
RR·B4·7

Section Pre-equating in the Presence of Practice Effects

Paul W. Holland and Dorothy T. Thayer

@ PROGRAM STATISTICS RESEARCH TECHNICAL REPORT NO. 84-44

EDUCATIONAL TESTING SERVICE PRINCETON, NEW JERSEY 08541

Section Pre-equating in the Presence of Practice Effects

Paul W. Holland and Dorothy T. Thayer

Program Statistics Research Technical Report No. 84-44 Research Report RR-84-7

February 1984

This work is the result of a long collaboration and the order of authorship is alphabetical. We also wish to acknowledge the invaluable aid and council by R. Durso, J. Faggen, L. Hecht. L. Leary. M. McPeek. E. Stewart, L. Wightman. and D. Rubin in the work underlying this paper. The research and development activities underlying this paper were supported by the ETS Program Statistics Research project and by the Program Research Planning Council.

Copyright

©

1984 by Educational Testing Service.

All rights reserved.

The

Program

Statistics

designed to make Group

at

Research

the working

Educational Testing

series consists of

reports by

Statistics Group as

well as

Technical

papers of

the

Series

Report

is

Research Statistics

Service generally

available.

the statisticians in their external and

The

the Research

visiting statis-

tical consultants. Reproduction

of

any portion

of a

Program

Statistics

Research

Technical Report requires the written consent of the author(s).

Abstract Section pre-equating (SPE) is a new method for equating tests that was developed

in

response

to

test disclosure

legislation.

It

is

designed to equate a new test to an old test prior to the actual use of the new

test, and makes

testing

instrument.

for the

effects of

extensive use

of experimental sections

of a

This paper extends the theory behind SPE to allow practice on both

the old

gives a unified and generalized account of SPE.

and the new

tests. and

Table of Contents

........

1•

Introduction • • • • • • • • • •

2.

Review of Linear. Observed-score Test Equating.

3.

A Standard Equating Method Revisited •

4.

Section Pre-equating • • •

10

4.1

Tests With Sections • • •

10

4.2

Missing Data Techniques

14

References •

17

...

..... ..................

3

4

1• 1.

Introduction Test equating is an important responsibility of the statistical

staff of any large testing program.

The comparability of scores

obtained on different forms of a test depends on the accurate equating of the scores obtained on these forms.

When several new forms of a

test are routinely developed and used each year the problem of test equating can be quite acute.

There is an extensive literature on test

equating, see for example the review article by Angoff (1971) and more recently the volume edited by Holland and Rubin (1982) which includes a hibliography on test equating and related topics. discuss a new method of test

equatin~

In this paper we

called section pre-equating

(SPE), which was developed in response to test disclosure legislation. Section pre-equating is designed to equate a new test to an old test prior to the actual use of the new test.

It makes extensive use of

experimental or "non-operational" sections of a testing instrument. Section pre-equating is also discussed in Holland and Wightman (1982) and in Holland and Thayer (1981).

In this paper we extend the research

reported previously by allowing for the effect of practice to be explicitly taken into account.

In addition, the discussion given here

provides a more unified account of SPE than do the previous papers.*

*The

version of section pre-equati.ng described in this paper was

developed by a team of ErS staff which included R. Durso, J. Faggen, L. Hecht, L. Leary, M. McPeek, E. Stewart, L. Wightman, and the authors.

D. Rubin suggested several crucial ideas.

2. The essential ingredients of SPE are specific ways to collect and analyze data, and test equating is really just one of several by-products of these ingredients.

We believe that the methods used in

SPE have wider potential applications than the equating of tests because they allow us to piece together information about an entire test from information about small portions of the test.

For example,

the methods used in SPE automatically provide an estimate of the "parallel-form" reliability of a test without the requirement of administering two entire forms of the test to each member of a sample of examinees. The remainder of this paper is organized as follows.

Section 2

briefly reviews the elements of linear, observed-score test equating. In Section 3 we discuss a standard test equating method from a perspective that provides the logical basis for SPE.

Section 4 extends

this discussion to tests with sections and develops the theory behind SPE.

3. 2.

Review of Linear, Observed-score Test Equating Let X and Y denote two forms of a test as well as the respective

scores from these two test forms.

X and Yare assumed to be parallel

in form and content throughout this paper.

For a given population of

examinees there are potential distributions of X-scores and of Y-scores that we could observe if the population were tested with X or Y respectively.

Let U

x

2

and aX denote the mean and variance of the 2

potential distribution of X-scores and let uY and 0y denote the corresponding quantities for Y.

We can estimate

U

2

x and aX by testing

a random sample of examinees from the population with test X. 2

Similarly, we can estimate Uy and oY by testing a (different) random sample of examinees with test Y. The linear equating function L(y) that equates Y to X is given by the formula L(y)

(1)

The function L(y) has the property that it transforms the distribution of Y-scores so that it has the same mean and variance as X in the population.

The statistical problem associated with linear,

observed-score test equating is the estimation of the function, L(y). This is accomplished by estimating the four parameters, UX' Uy ' aX and oy and applying these estimates to equation (1). For a more thorough discussion of test equating along the lines sketched here the reader is referred to Braun and Holland (1982).

4. 3.

~

Standard Equating Method Revisited

Angoff (1971) describes a variety of test equating methods including one that he refers to as Design II:

"Random groups -- both

tests administered to each group, counterbalanced." This method provides the logical basis for SPE and hence we discuss it before turning to the more complex considerations that arise in SPE. Suppose a sample of examinees is tested first with form X and then with form Y. examinee.

We would then ohtain an X- and a Y-score from each

However, the Y-score could have been affected by the fact

that it was ohtained after taking a test that was parallel to it. Thus, the X-score and Y-score so obtained are not on an equal f oot.Lng , Y is possibly subject to a practice effect while X is not.

If the

order of X and Y is reversed then it is X that is possibly subject to the practice effect.

Thus we can envision four potential scores for

each examinee in the population.

These are:

a) X-score unpracticed, X b) X-score practiced, Xp c) V-score unpracticed, Y d) Y-score practiced, Yp• Throughout this paper we shall use the term practiced to mean prior exposure to a distinct but parallel form of the test in an appropriately short period of time.

The four potential scores (X, Y,

Xp• Yp) cannot all be observed for a given examinee.

It is only

possible, within the confines of this discussion, to observe either

5. (X, Yp) or (Xp, Y) for each examinee.

In Design II equating, as

described by Angoff (1971), two distinct random samples of examinees are drawn from the population.

For one group, group a, (X, Yp) is

observed while for the other group, group

a,

(Xp, Y) is observed.

This

is usually described by saying that X and Yare administered to the examinees in randomized, counter-balanced order.

Xp

explicitly distinguish X from Design II equating.

It is important to

and Y from Yp in the analysis of

Figure 1 represents the data obtained in Design II

equating in terms of patterns of the observed and unobserved data.

Figure

about here

It is clear that from group a we can estimate the means and variances of X and Yp in the population, i.e. ~X'

2 2 Ox ' ~Y , 0y ,

p

p

and the population correlation between X and Yp, i.e. Pxy Furthermore, in group 2

2

lJ y ' 0y

, lJ X ' Ox

a

P

we can estimate these five parameters:

and "xp Y' However, with the data collected in P P Design II equating we do not have direct estimates of

P XX ' P XY' P x Y and P y y ' Fortunately, in order to estimate the p p p p linear equating function in (1) we do not need to estimate any of

6.

SCORE Group a

I

x

y

I

indicates that the value of the score is observed in the group. indicates that the value of the score is not observed in the group. FIGURE 1:

The pattern of observed and unobserved scores in the two groups that arise in Design II equating.

7. these correlations in the application of standard Design II equating. In SPE, however, these inestimable correlations do playa role. One could simply use the sample means and variances of X in group a and of Y in group B to estimate L(y).

However, this wastes the rest

of the data, namely the sample means and variances of Xp and Yp•

To

rectify this apparent waste of data, Lord (1950) proposed a simple model that relates the parameters of (Xp' Yp) to those of (X, Y) •

This

model is specified as follows: lJ

xp

lJ y

p

c

uX

+

K

Ox

(2)



lJ y

+

K 0y

( 3)

Ox p

=

Ox

(4) (5)

°Y p • °Y·

It is then a simple matter to combine the practiced and unpracticed data from both groups to get estimates of not waste the data on Xp and yp. these improved estimates are

~iven

lJ

X'

lJ y

' oX' and 0y that do

The formulas developed by Lord for in detail in Angoff (1971).

(Koutsopoulos, 1961, also discusses practice effect models). The model specified by (2)-(5) is sufficiently detailed for use in Design II equating.

However, we need to refine it further for SPE.

particular we need to specify the entire joint distribution of the vector

In

8. (6)

While there are many ways to do this to be consistent with (2)-(5), the simplest is the "constant practice effect model," in which we assume that Xp - X

(7)

+ TX

and (8)

and, to make things consistent with (2) and (3), we further assume that the practice effects are proportional to the standard deviations. i.e.

(9) (10)

Ty -

From (7), (8). (9), and (10) it follows that (2), (3), (4), and (5) are satisfied under the constant practice effect model.

But in addition

(7) and (8) imply that all of the correlations involving X or

Xp

with Y

or Yp are the same, i.e. P XY

= P XYp = P Xp Y

- P

XpYp

= p.

(11 )

In addition (7) and (8) imply that PXX

P

= P

TIp • 1.

(12)

Thus the correlation matrix for the vector in (6) has the following pattern structure

9.

x

X

y

Xp -l--r---I

Y

Yp

1

P

p

1 P

- - - -

P

P

(3)

1

o

P

We shall see that this pattern structure will arise in SPE in a generalized form. There are two different correlations that arise in (13).

The

first is P. the common correlation between some version of X (i.e., either practiced or unpracticed) and some version of Y (also either practiced or unpracticed). reliability coefficient. Xp (or Y and Yp).

This may be viewed as a parallel-forms The other correlation is that between X and

The correlation between X and Xp might appear to be

like a test-retest reliability but they are not quite so directly interpretable.

X is the score obtained on form X if it is taken first

while Xp is the score obtained on form X if taken after taking the (parallel) form Y.

There is no real experiment that we could do to

observe both X and Xp on the same individuals. but both variables X and Xp are well defined and can be observed separately.

We call this

the "uncertainty principle of test theory" because it is reminiscent of the Heisenberg Uncertainty Principle of physics.

The point of

this discussion is that there is no way of rejecting the assumption

10.

that P

XXp

'" P

YY p

::I

1 using data.

This must be done on other grounds.

While the constant practice effect model is possibly controversial. it does lead to a consistent system of correlations which is the main use to which we shall put it in the next section. 4.

Section Pre-equating This section is divided into two parts.

In the first we describe

how those issues that arise in Design II equating can be applied to tests with sections.

In the second part we show how missing data

techniques can be applied to this problem. 4.1

Tests with sections Design II equating. described in Section 3. has the drawback of

requiring the full administration of both X and Y to each examinee.

In

many circumstances. and certainly for those for which SPE was developed. this is impractical.

However. it is often possible to

administer a portion of form Y along with all of form X to examinees. This is especially easy if the two tests are composed of separately timed sections.

Section pre-equating makes extensive use of separately

timed sections.

Suppose the test for which X and Yare two parallel

forms contains m separately timed sections and let both the i t h section of X as well as the score on the i t h section be denoted by Xi. Similarly for Yi.

Define the vectors of section scores.

~

and Y by (14)

11.

and (15)

We assume that section Vi is parallel in form and content to section Xi.

Let

~X

and

~V

~

denote the mean vector of

and

~

respectively over

a population of examinees. The scores to be equated are assumed to be the sums*

X

K

Xl + ••• +

Xm

(16)

and (17)

In order to estimate the means and variances of X and V we make use of the formulas: \.I

X

..

e

~

where e

=

T

T

\.I y '" l:!y ~

;

~X. ~V

(18)

~

T

.

(19)

Hence. it is sufficient

and

I

c

the joint covariance matrix of (X. V).

As

L

.

- - yy

(1.1 ••••• 1) and T denotes transpose.

to estimate

*

'" e

T

(20)

This is the line of attack

will become evident. any fixed linear combination of section

scores could be used and the same equating techniques could be applied.

12.

taken in SPE.

Estimates of

(~X' ~Y)

and

}:

are found and then the linear equating parameters are found via (18) and (19) and applied to (1). A testing instrument can contain all of the sections of form X and also contain one or more variable sections into which sections of form Y can be placed. i.e., m • 4.

For example. consider a test with four sections.

Suppose further that the actual testing instrument

contains six sections -- four of the sections of X and two "variable" sections.

Figure 2 illustrates this with an example in which the

variable sections are the second and fifth physical sections of the testing instrument.

Figure 2 also assumes that the contents of the

variable sections are pairs of sections of Y in half of the possible orderings.

This is merely for the sake of illustration and is not a

requirement of the method.

This results in six versions (or subforms)

of the testing instrument which are characterized by the four X-sections and two of the Y-sections as indicated in Figure 2.

Figure 2 about here

In the section pre-equating applications that interest us form X is "operational" in the sense that the reported scores computed for examinees are solely a result of their scores on Xl. X2, ••••

Xm.

On the other hand. the contents of the variable sections do not count towards the examinee's score.

Y is a pre-operational form which

13.

PHYSICAL POSITION OF SECTIONS IN TEST

---

SUBFORM

2

3

4

5

6

Xl

V

Xz

X3

V

Xlt

Xl

YI

Xz

X3

YZ

Xlt

2

Xl

YI

X2

X3

Y3

X4

3

Xl

Ylt

Xz

X3

YI

X4

4

Xl

Y3

Xz

X3

YZ

Xlt

5

Xl

Y4

Xz

X3

YZ

N.

6

Xl

Y3

Xz

X3

Y4

Xlt

Figure 2:

Illustration of the contents of a six section test for six possible subforms containing all four sections of X in fixed locations and two sections of Y in the variable sections.

14.

we desire to equate to X.

In these applications Y will be used as an

operational form at a later date.

Thus, for example, in Figure 2

subform 3 contains all of the sections of the operational form X and sections 1 and 4 of the preoperational form Y. Just as practice may affect performance on X and Y in Design II equating, practice can also affect performance on the sections of X and Y.

The actual pattern of practiced and unpracticed sections will

depend on the positions of the variable sections and the operational sections of the test and on the contents of the variable sections. Each subform of the testing instrument gives us values for each examinee for some but not all of the elements of the (4m)-vector (21)

where lC

(XI, •••

:II

,

Xm)

lCp '" (Xip, ••• , Xmp)

'! :!p 4.2

(22)

(YI, ••• , Ym) a

(Yip, ••• ,Ymp).

Missing Data Techniques The subforms that are used in a particular application of SPE

determine a pattern of observed and unobserved data in the (4m)-vector D

~

=

(X ~

.

Thus. the approach used in SPE is to use modern

missing data techniques to estimate the mean and covariance matrix of D and then to extract from these estimates the quantities needed to

15.

estimate L(y).

The mean and covariance matrix of D can be partitioned

into (23)

~ = (~x' ~Y' ~xep ' ~Yz p ) and

l: -

I-~xx

fxy

L L ~xxp ~ XYp

fyx

fIT

fyx p fIT p

- - - -

1- L

L ~ XpX ~2Xp y l: l: I_~ ypX ~ ypy

I

-,

I I

I

(24 )

L L ~yp~ ~ypyp ~XpXp ~Xpyp

I

From (23) we see that we only need the estimates of the first 2m elements of

~.

and from (24) we see that we only need estimates of the

upper left hand portion of L to equate Y to X.

The generalization of

the constant practice effect model to tests with sections is

(25) ~p -

Y

+

T

(r)

where

and

!(n ..

(K10y1, K20yZ'

Furthermore the covariance matrix of the constant practice effect model.

p

••••

Kmo ym) ·

(27)

has the following pattern under

16.

~XX ~Xy ~xx ~Xy ~yx In I Iyx ~n - - - -1-

I

I '"

Ixx

~xy

fyx

~yy

(28)

~xx ~xy fyx ~yy

The essential steps in 8PE may be reduced to the following. SPE

l)

Design a set of subforms that leads to a reasonable set of patterns of missing data for D.

8PE~)

Assign the examinees in a testing population at random to these subforms.

SPE

1)

Estimate ~ and

I

from the set of incompletely observed

data that results. SPE 4) ---

Compute ~x

from the estimates of ~ and

I,

and apply (1) to estimate L(y). In performing step 8PE 3, we have made heavy use of the EM-algorithm developed by Dempster, Laird, and Rubin (1977) applied to the problem of estimating the parameters of a multivariate normal distribution.

The reader is referred to that paper for details in the

use of the EM algorithm.

17. References Angoff, W.H.

Scales, norms, and equivalent scores.

In Educational

Measurement, second edition. (R.L. Thorndike, ed.). D.C.:

Washington.

American Council on Education. 1971, 508-600.

Beaton. A.E.

The use of special matrix operations in statistical

calculus.

Princeton, N.J.:

Educational Testing Service,

Research Bulletin. 1964, No. 64-51. Braun, H.I. and Holland, P.W.

Observed-score test equating:

mathematical analysis of some ETS equating procedures. Test Equating (P.W. Holland and D.B. Rubin, ed.).

A In

New York:

Academic Press Inc •• 1982. Dempster. A.P •• Laird. N.M •• and Rubin. D.B. incomplete data via the EM algorithm. Statistical Society (B),

~,

Maximum likelihood from Journal of the Royal

1977, 1-38.

Holland, P.W. and Rubin, D.B. (ed.).

Test Equating.

New York:

Academic Press, Inc •• 1982. Holland, P.W. and Thayer, D.T. record examination.

Section pre-equating the graduate

Princeton, N.J.:

Educational Testing

Service, Program Statistics Research Technical Report, 1981. No. 81-13. Holland, P.W. and Wightman, L.E. investigation. ed.).

New York:

Section pre-equating:

A preliminary

In Test Equating (P.W. Holland and D.B. Rubin, Academic Press Inc., 1982.

18. Koutsopoulos, C.J.

A linear practice effect solution for the counter-

balanced case of equating.

Princeton, N.J.:

Educational Testing

Service, Research Bulletin, 1961, No. 61-19. Lord, F.M. N.J.:

Notes on comparable scores for test scores.

Princeton,

Educational Testing Service, Research Bulletin, 1950,

No. 50-48. Rubin, D.B.

Characterizing the estimation of parameters in incomplete-

data problems.

Journal of the American Statistical Association,

69, 1974, 467-474. Rubin, D.B. and Szatrowski, T.H. matrices by the EM algorithm.

Finding MLE of patterned covariance Biometrika, 69(3), 1982, 657-660.