Section Pre-equating in the Presence of Practice Effects. Paul W. Holland and. Dorothy T. Thayer. Program Statistics Research. Technical Report No. 84-44.
RR·B4·7
Section Pre-equating in the Presence of Practice Effects
Paul W. Holland and Dorothy T. Thayer
@ PROGRAM STATISTICS RESEARCH TECHNICAL REPORT NO. 84-44
EDUCATIONAL TESTING SERVICE PRINCETON, NEW JERSEY 08541
Section Pre-equating in the Presence of Practice Effects
Paul W. Holland and Dorothy T. Thayer
Program Statistics Research Technical Report No. 84-44 Research Report RR-84-7
February 1984
This work is the result of a long collaboration and the order of authorship is alphabetical. We also wish to acknowledge the invaluable aid and council by R. Durso, J. Faggen, L. Hecht. L. Leary. M. McPeek. E. Stewart, L. Wightman. and D. Rubin in the work underlying this paper. The research and development activities underlying this paper were supported by the ETS Program Statistics Research project and by the Program Research Planning Council.
Copyright
©
1984 by Educational Testing Service.
All rights reserved.
The
Program
Statistics
designed to make Group
at
Research
the working
Educational Testing
series consists of
reports by
Statistics Group as
well as
Technical
papers of
the
Series
Report
is
Research Statistics
Service generally
available.
the statisticians in their external and
The
the Research
visiting statis-
tical consultants. Reproduction
of
any portion
of a
Program
Statistics
Research
Technical Report requires the written consent of the author(s).
Abstract Section pre-equating (SPE) is a new method for equating tests that was developed
in
response
to
test disclosure
legislation.
It
is
designed to equate a new test to an old test prior to the actual use of the new
test, and makes
testing
instrument.
for the
effects of
extensive use
of experimental sections
of a
This paper extends the theory behind SPE to allow practice on both
the old
gives a unified and generalized account of SPE.
and the new
tests. and
Table of Contents
........
1•
Introduction • • • • • • • • • •
2.
Review of Linear. Observed-score Test Equating.
3.
A Standard Equating Method Revisited •
4.
Section Pre-equating • • •
10
4.1
Tests With Sections • • •
10
4.2
Missing Data Techniques
14
References •
17
...
..... ..................
3
4
1• 1.
Introduction Test equating is an important responsibility of the statistical
staff of any large testing program.
The comparability of scores
obtained on different forms of a test depends on the accurate equating of the scores obtained on these forms.
When several new forms of a
test are routinely developed and used each year the problem of test equating can be quite acute.
There is an extensive literature on test
equating, see for example the review article by Angoff (1971) and more recently the volume edited by Holland and Rubin (1982) which includes a hibliography on test equating and related topics. discuss a new method of test
equatin~
In this paper we
called section pre-equating
(SPE), which was developed in response to test disclosure legislation. Section pre-equating is designed to equate a new test to an old test prior to the actual use of the new test.
It makes extensive use of
experimental or "non-operational" sections of a testing instrument. Section pre-equating is also discussed in Holland and Wightman (1982) and in Holland and Thayer (1981).
In this paper we extend the research
reported previously by allowing for the effect of practice to be explicitly taken into account.
In addition, the discussion given here
provides a more unified account of SPE than do the previous papers.*
*The
version of section pre-equati.ng described in this paper was
developed by a team of ErS staff which included R. Durso, J. Faggen, L. Hecht, L. Leary, M. McPeek, E. Stewart, L. Wightman, and the authors.
D. Rubin suggested several crucial ideas.
2. The essential ingredients of SPE are specific ways to collect and analyze data, and test equating is really just one of several by-products of these ingredients.
We believe that the methods used in
SPE have wider potential applications than the equating of tests because they allow us to piece together information about an entire test from information about small portions of the test.
For example,
the methods used in SPE automatically provide an estimate of the "parallel-form" reliability of a test without the requirement of administering two entire forms of the test to each member of a sample of examinees. The remainder of this paper is organized as follows.
Section 2
briefly reviews the elements of linear, observed-score test equating. In Section 3 we discuss a standard test equating method from a perspective that provides the logical basis for SPE.
Section 4 extends
this discussion to tests with sections and develops the theory behind SPE.
3. 2.
Review of Linear, Observed-score Test Equating Let X and Y denote two forms of a test as well as the respective
scores from these two test forms.
X and Yare assumed to be parallel
in form and content throughout this paper.
For a given population of
examinees there are potential distributions of X-scores and of Y-scores that we could observe if the population were tested with X or Y respectively.
Let U
x
2
and aX denote the mean and variance of the 2
potential distribution of X-scores and let uY and 0y denote the corresponding quantities for Y.
We can estimate
U
2
x and aX by testing
a random sample of examinees from the population with test X. 2
Similarly, we can estimate Uy and oY by testing a (different) random sample of examinees with test Y. The linear equating function L(y) that equates Y to X is given by the formula L(y)
(1)
The function L(y) has the property that it transforms the distribution of Y-scores so that it has the same mean and variance as X in the population.
The statistical problem associated with linear,
observed-score test equating is the estimation of the function, L(y). This is accomplished by estimating the four parameters, UX' Uy ' aX and oy and applying these estimates to equation (1). For a more thorough discussion of test equating along the lines sketched here the reader is referred to Braun and Holland (1982).
4. 3.
~
Standard Equating Method Revisited
Angoff (1971) describes a variety of test equating methods including one that he refers to as Design II:
"Random groups -- both
tests administered to each group, counterbalanced." This method provides the logical basis for SPE and hence we discuss it before turning to the more complex considerations that arise in SPE. Suppose a sample of examinees is tested first with form X and then with form Y. examinee.
We would then ohtain an X- and a Y-score from each
However, the Y-score could have been affected by the fact
that it was ohtained after taking a test that was parallel to it. Thus, the X-score and Y-score so obtained are not on an equal f oot.Lng , Y is possibly subject to a practice effect while X is not.
If the
order of X and Y is reversed then it is X that is possibly subject to the practice effect.
Thus we can envision four potential scores for
each examinee in the population.
These are:
a) X-score unpracticed, X b) X-score practiced, Xp c) V-score unpracticed, Y d) Y-score practiced, Yp• Throughout this paper we shall use the term practiced to mean prior exposure to a distinct but parallel form of the test in an appropriately short period of time.
The four potential scores (X, Y,
Xp• Yp) cannot all be observed for a given examinee.
It is only
possible, within the confines of this discussion, to observe either
5. (X, Yp) or (Xp, Y) for each examinee.
In Design II equating, as
described by Angoff (1971), two distinct random samples of examinees are drawn from the population.
For one group, group a, (X, Yp) is
observed while for the other group, group
a,
(Xp, Y) is observed.
This
is usually described by saying that X and Yare administered to the examinees in randomized, counter-balanced order.
Xp
explicitly distinguish X from Design II equating.
It is important to
and Y from Yp in the analysis of
Figure 1 represents the data obtained in Design II
equating in terms of patterns of the observed and unobserved data.
Figure
about here
It is clear that from group a we can estimate the means and variances of X and Yp in the population, i.e. ~X'
2 2 Ox ' ~Y , 0y ,
p
p
and the population correlation between X and Yp, i.e. Pxy Furthermore, in group 2
2
lJ y ' 0y
, lJ X ' Ox
a
P
we can estimate these five parameters:
and "xp Y' However, with the data collected in P P Design II equating we do not have direct estimates of
P XX ' P XY' P x Y and P y y ' Fortunately, in order to estimate the p p p p linear equating function in (1) we do not need to estimate any of
6.
SCORE Group a
I
x
y
I
indicates that the value of the score is observed in the group. indicates that the value of the score is not observed in the group. FIGURE 1:
The pattern of observed and unobserved scores in the two groups that arise in Design II equating.
7. these correlations in the application of standard Design II equating. In SPE, however, these inestimable correlations do playa role. One could simply use the sample means and variances of X in group a and of Y in group B to estimate L(y).
However, this wastes the rest
of the data, namely the sample means and variances of Xp and Yp•
To
rectify this apparent waste of data, Lord (1950) proposed a simple model that relates the parameters of (Xp' Yp) to those of (X, Y) •
This
model is specified as follows: lJ
xp
lJ y
p
c
uX
+
K
Ox
(2)
•
lJ y
+
K 0y
( 3)
Ox p
=
Ox
(4) (5)
°Y p • °Y·
It is then a simple matter to combine the practiced and unpracticed data from both groups to get estimates of not waste the data on Xp and yp. these improved estimates are
~iven
lJ
X'
lJ y
' oX' and 0y that do
The formulas developed by Lord for in detail in Angoff (1971).
(Koutsopoulos, 1961, also discusses practice effect models). The model specified by (2)-(5) is sufficiently detailed for use in Design II equating.
However, we need to refine it further for SPE.
particular we need to specify the entire joint distribution of the vector
In
8. (6)
While there are many ways to do this to be consistent with (2)-(5), the simplest is the "constant practice effect model," in which we assume that Xp - X
(7)
+ TX
and (8)
and, to make things consistent with (2) and (3), we further assume that the practice effects are proportional to the standard deviations. i.e.
(9) (10)
Ty -
From (7), (8). (9), and (10) it follows that (2), (3), (4), and (5) are satisfied under the constant practice effect model.
But in addition
(7) and (8) imply that all of the correlations involving X or
Xp
with Y
or Yp are the same, i.e. P XY
= P XYp = P Xp Y
- P
XpYp
= p.
(11 )
In addition (7) and (8) imply that PXX
P
= P
TIp • 1.
(12)
Thus the correlation matrix for the vector in (6) has the following pattern structure
9.
x
X
y
Xp -l--r---I
Y
Yp
1
P
p
1 P
- - - -
P
P
(3)
1
o
P
We shall see that this pattern structure will arise in SPE in a generalized form. There are two different correlations that arise in (13).
The
first is P. the common correlation between some version of X (i.e., either practiced or unpracticed) and some version of Y (also either practiced or unpracticed). reliability coefficient. Xp (or Y and Yp).
This may be viewed as a parallel-forms The other correlation is that between X and
The correlation between X and Xp might appear to be
like a test-retest reliability but they are not quite so directly interpretable.
X is the score obtained on form X if it is taken first
while Xp is the score obtained on form X if taken after taking the (parallel) form Y.
There is no real experiment that we could do to
observe both X and Xp on the same individuals. but both variables X and Xp are well defined and can be observed separately.
We call this
the "uncertainty principle of test theory" because it is reminiscent of the Heisenberg Uncertainty Principle of physics.
The point of
this discussion is that there is no way of rejecting the assumption
10.
that P
XXp
'" P
YY p
::I
1 using data.
This must be done on other grounds.
While the constant practice effect model is possibly controversial. it does lead to a consistent system of correlations which is the main use to which we shall put it in the next section. 4.
Section Pre-equating This section is divided into two parts.
In the first we describe
how those issues that arise in Design II equating can be applied to tests with sections.
In the second part we show how missing data
techniques can be applied to this problem. 4.1
Tests with sections Design II equating. described in Section 3. has the drawback of
requiring the full administration of both X and Y to each examinee.
In
many circumstances. and certainly for those for which SPE was developed. this is impractical.
However. it is often possible to
administer a portion of form Y along with all of form X to examinees. This is especially easy if the two tests are composed of separately timed sections.
Section pre-equating makes extensive use of separately
timed sections.
Suppose the test for which X and Yare two parallel
forms contains m separately timed sections and let both the i t h section of X as well as the score on the i t h section be denoted by Xi. Similarly for Yi.
Define the vectors of section scores.
~
and Y by (14)
11.
and (15)
We assume that section Vi is parallel in form and content to section Xi.
Let
~X
and
~V
~
denote the mean vector of
and
~
respectively over
a population of examinees. The scores to be equated are assumed to be the sums*
X
K
Xl + ••• +
Xm
(16)
and (17)
In order to estimate the means and variances of X and V we make use of the formulas: \.I
X
..
e
~
where e
=
T
T
\.I y '" l:!y ~
;
~X. ~V
(18)
~
T
.
(19)
Hence. it is sufficient
and
I
c
the joint covariance matrix of (X. V).
As
L
.
- - yy
(1.1 ••••• 1) and T denotes transpose.
to estimate
*
'" e
T
(20)
This is the line of attack
will become evident. any fixed linear combination of section
scores could be used and the same equating techniques could be applied.
12.
taken in SPE.
Estimates of
(~X' ~Y)
and
}:
are found and then the linear equating parameters are found via (18) and (19) and applied to (1). A testing instrument can contain all of the sections of form X and also contain one or more variable sections into which sections of form Y can be placed. i.e., m • 4.
For example. consider a test with four sections.
Suppose further that the actual testing instrument
contains six sections -- four of the sections of X and two "variable" sections.
Figure 2 illustrates this with an example in which the
variable sections are the second and fifth physical sections of the testing instrument.
Figure 2 also assumes that the contents of the
variable sections are pairs of sections of Y in half of the possible orderings.
This is merely for the sake of illustration and is not a
requirement of the method.
This results in six versions (or subforms)
of the testing instrument which are characterized by the four X-sections and two of the Y-sections as indicated in Figure 2.
Figure 2 about here
In the section pre-equating applications that interest us form X is "operational" in the sense that the reported scores computed for examinees are solely a result of their scores on Xl. X2, ••••
Xm.
On the other hand. the contents of the variable sections do not count towards the examinee's score.
Y is a pre-operational form which
13.
PHYSICAL POSITION OF SECTIONS IN TEST
---
SUBFORM
2
3
4
5
6
Xl
V
Xz
X3
V
Xlt
Xl
YI
Xz
X3
YZ
Xlt
2
Xl
YI
X2
X3
Y3
X4
3
Xl
Ylt
Xz
X3
YI
X4
4
Xl
Y3
Xz
X3
YZ
Xlt
5
Xl
Y4
Xz
X3
YZ
N.
6
Xl
Y3
Xz
X3
Y4
Xlt
Figure 2:
Illustration of the contents of a six section test for six possible subforms containing all four sections of X in fixed locations and two sections of Y in the variable sections.
14.
we desire to equate to X.
In these applications Y will be used as an
operational form at a later date.
Thus, for example, in Figure 2
subform 3 contains all of the sections of the operational form X and sections 1 and 4 of the preoperational form Y. Just as practice may affect performance on X and Y in Design II equating, practice can also affect performance on the sections of X and Y.
The actual pattern of practiced and unpracticed sections will
depend on the positions of the variable sections and the operational sections of the test and on the contents of the variable sections. Each subform of the testing instrument gives us values for each examinee for some but not all of the elements of the (4m)-vector (21)
where lC
(XI, •••
:II
,
Xm)
lCp '" (Xip, ••• , Xmp)
'! :!p 4.2
(22)
(YI, ••• , Ym) a
(Yip, ••• ,Ymp).
Missing Data Techniques The subforms that are used in a particular application of SPE
determine a pattern of observed and unobserved data in the (4m)-vector D
~
=
(X ~
.
Thus. the approach used in SPE is to use modern
missing data techniques to estimate the mean and covariance matrix of D and then to extract from these estimates the quantities needed to
15.
estimate L(y).
The mean and covariance matrix of D can be partitioned
into (23)
~ = (~x' ~Y' ~xep ' ~Yz p ) and
l: -
I-~xx
fxy
L L ~xxp ~ XYp
fyx
fIT
fyx p fIT p
- - - -
1- L
L ~ XpX ~2Xp y l: l: I_~ ypX ~ ypy
I
-,
I I
I
(24 )
L L ~yp~ ~ypyp ~XpXp ~Xpyp
I
From (23) we see that we only need the estimates of the first 2m elements of
~.
and from (24) we see that we only need estimates of the
upper left hand portion of L to equate Y to X.
The generalization of
the constant practice effect model to tests with sections is
(25) ~p -
Y
+
T
(r)
where
and
!(n ..
(K10y1, K20yZ'
Furthermore the covariance matrix of the constant practice effect model.
p
••••
Kmo ym) ·
(27)
has the following pattern under
16.
~XX ~Xy ~xx ~Xy ~yx In I Iyx ~n - - - -1-
I
I '"
Ixx
~xy
fyx
~yy
(28)
~xx ~xy fyx ~yy
The essential steps in 8PE may be reduced to the following. SPE
l)
Design a set of subforms that leads to a reasonable set of patterns of missing data for D.
8PE~)
Assign the examinees in a testing population at random to these subforms.
SPE
1)
Estimate ~ and
I
from the set of incompletely observed
data that results. SPE 4) ---
Compute ~x
from the estimates of ~ and
I,
and apply (1) to estimate L(y). In performing step 8PE 3, we have made heavy use of the EM-algorithm developed by Dempster, Laird, and Rubin (1977) applied to the problem of estimating the parameters of a multivariate normal distribution.
The reader is referred to that paper for details in the
use of the EM algorithm.
17. References Angoff, W.H.
Scales, norms, and equivalent scores.
In Educational
Measurement, second edition. (R.L. Thorndike, ed.). D.C.:
Washington.
American Council on Education. 1971, 508-600.
Beaton. A.E.
The use of special matrix operations in statistical
calculus.
Princeton, N.J.:
Educational Testing Service,
Research Bulletin. 1964, No. 64-51. Braun, H.I. and Holland, P.W.
Observed-score test equating:
mathematical analysis of some ETS equating procedures. Test Equating (P.W. Holland and D.B. Rubin, ed.).
A In
New York:
Academic Press Inc •• 1982. Dempster. A.P •• Laird. N.M •• and Rubin. D.B. incomplete data via the EM algorithm. Statistical Society (B),
~,
Maximum likelihood from Journal of the Royal
1977, 1-38.
Holland, P.W. and Rubin, D.B. (ed.).
Test Equating.
New York:
Academic Press, Inc •• 1982. Holland, P.W. and Thayer, D.T. record examination.
Section pre-equating the graduate
Princeton, N.J.:
Educational Testing
Service, Program Statistics Research Technical Report, 1981. No. 81-13. Holland, P.W. and Wightman, L.E. investigation. ed.).
New York:
Section pre-equating:
A preliminary
In Test Equating (P.W. Holland and D.B. Rubin, Academic Press Inc., 1982.
18. Koutsopoulos, C.J.
A linear practice effect solution for the counter-
balanced case of equating.
Princeton, N.J.:
Educational Testing
Service, Research Bulletin, 1961, No. 61-19. Lord, F.M. N.J.:
Notes on comparable scores for test scores.
Princeton,
Educational Testing Service, Research Bulletin, 1950,
No. 50-48. Rubin, D.B.
Characterizing the estimation of parameters in incomplete-
data problems.
Journal of the American Statistical Association,
69, 1974, 467-474. Rubin, D.B. and Szatrowski, T.H. matrices by the EM algorithm.
Finding MLE of patterned covariance Biometrika, 69(3), 1982, 657-660.