SUBJECT MATTER EXPERTS&apos ... - Wiley Online Library

2 downloads 0 Views 2MB Size Report
30 items - The rater-total correlation is the correlation of each rater with the average rating of the other three raters. The rater-delta or rater-r-biserial corre- lation is ...
RR·81·47

R E 5 E A R C R H E

SUBJECT MATTER EXPERTS' ASSESSMENT OF ITEM STATISTICS

p

o

Isaac I. Bejar

R T

November 1981

Educational Testing service Princeton, New Jersey

Subject Matter Experts' Assessment of Item Statistics

Isaac 1. Bejar

August 1981 Final Report

Abstract The study was conducted to determine the degree to which subject matter experts could predict the difficulty and discrimination of items from the Test of Standard Written English.

Two studies were conducted.

The first study was meant to serve as a training exercise while the second study was meant to simulate the conditions under which ratings would be obtained in practice.

It was concluded that despite the

extended training period the raters did not approach a high level of accuracy, nor were they able to pinpoint the factors that contribute to difficulty and discrimination.

Further research should attempt to

uncover those factors by examining the items from a linguistic and psycholinguistic perspective.

It was argued that by coupling linguistic

features of the items with subject matter ratings it may be possible to attain more accurate predictions of difficulty and discrimination.

Acknowledgements I am grateful to the Test Development staff of the College Board division at ETS for serving as raters in this study as well as contributing important insights to the design of the study, and to Thomas Donlon for useful suggestions.

Subject Matter Experts' Assessment of Item Statistics

The present study is concerned with determining how well subject matter experts can estimate the statistical characteristics of items. If it is found that they can do so to a reasonably good approximation, then in principle, not all newly generated items need to be pretested and new forms can be assembled, at least partially, from unpretested items based on subject matter experts' assessment of item statistics, thereby minimizing the exposure of items, and possibly test development costs. There is some eVidence, to be reviewed below, that raters are able to accurately rank the difficulty of mathematics items.

The present

study, however, is concerned with items that measure writing ability, which would appear to be a far more challenging task.

Unlike mathematics

items, where the mathematical operation required to solve the problem largely determines the difficulty, for items that attempt to measure writing skill a much greater variety of factors would seem to determine difficulty.

Thus, it is probably not sufficient to determine what error

is present in a given item, for the semantic and syntactic context in which that error is presented may influence item statistics significantly. This suggests that it may be useful to identify which factors contribute to difficulty.

It is clear that subject matter experts are

required in this process.

The outcome of this process could be important

both practically and theoretically.

From a practical point of view, once

these factors have been identified, they can be transmitted to other subject matter experts.

Although the creation of items is likely to

remain largely a creative endeavor, knowledge of what factors

-2affect difficulty and discrimination may provide better control over the statistical characteristics of items produced.

Alternatively, additional

subject matter experts could be easily trained to accurately assess the statistical characteristics of items. From a theoretical point of view, the identification of facets that account for the variability in difficulty and discrimination across items would seem to be an important step in the construct validation of multiple choice tests of writing ability.

For example, other things

being equal, the syntactic context in which a subject-verb agreement error is presented may be related to difficulty in a psychometric sense. Such a relationship might, in turn, be explained in terms of information processing constructs upon further research.

Doing so will, in the end,

contribute to the construct validity of the test. Review of Literature A series of investigations by Lorge, dealing exclusively with item difficulty, represents the most important component of the published literature on this topic.

The first study, by Lorge & Kruglov (1952),

was guided by the assumption that exposing raters to item of known difficulty would improve their ratings of other items.

To test the

hypothesis, a group of eight doctoral candidates taking courses in test construction were divided into two groups.

One group was given 120 items

to rate after being shown 30 items with known difficulty. group was given all 150 items to rate.

The other

Raters were instructed to rank

all items as well as to determine the percent of eighth grade students that would pass each item.

-3The results indicated a somewhat larger degree of interrater agreement for the group who had been exposed to the items of known difficulty.

However, both groups estimated relative difficulty equally

well but both groups of raters underestimated actual difficulty.

The

correlation between the mean of the difficulty ratings and estimated difficulty was .84 for the group with no prior information and .83 for the group with prior information. An obvious difficulty with the study is that the raters were not familiar with the testing population.

Such familiarity would seem to be

necessary to unbiasedly estimate difficulty.

In a subsequent investigation

Lorge & Kruglov (1953) attempted to remedy the problem by using raters that were candidates for advanced degrees in the teaching of mathematics. Two groups (A and B) of seven raters each were formed. two sets of 45 items each. parallel as possible.

Each judge rated

The two sets of 45 items were chosen to be as

Group A rated first Set I and then Set II and was

given the estimated difficulty estimates of 10 of the 45 items of Set II. Group B first rated Set II followed by Set I and was given the estimated difficulty estimate of 10 of the 45 items for Set I.

The estimated

difficulties of the items ranged from .02 to .97 for Set I and .08 to .99 for Set II. Under the no information condition the correlation of mean rated difficulty and empirical difficulty were .844 for Group A and .797 for Group B.

Under the information condition the correlations were .874 and

.853 for goups A and B respectively.

That is, the information condition

raters were better able to estimate difficulty.

-4These two studies largely corroborate the results of an earlier study by Tinkelman (1947) and together suggest that judges are also able to judge the relative difficulty of a set of items. order items in terms of difficulty.

That is, they can

Since those studies were conducted

little has been published in the psychometric literature on this topic. Some work (e.g., Prestwood & Weiss, 1977; Munz and Jacobs, 1971; Bratfisch, Borg & Donnie, 1971) has been done on perception of item difficulty by test takers, which suggests that test takers are also capable of rating difficulty accurately, with correlations of perceived difficulty and estimated difficulty as high as .94. More recently Thorndike (1980) investigated the usefulness of a procedure where items of known difficulty are scattered among the unpretested items.

For these items there were two difficulty estimates, the empirical

and the rated.

Thorndike found correlations ranging from .72 to .83 for

the correlation between estimated difficulty and the mean rating of 20 raters majoring in measurement.

The content of the items were verbal

analogies, quantitative relations and figure analogies. that the raters were not expert in test development.

Thorndike notes

Nevertheless, he

concluded rather pessimistically that if the purpose of the rating exercise is to pre-equate tests, the rating procedure is not sufficiently accurate.

He suggested that further attention should be focused on the

selection and training of judges. The investigations reviewed above all rely on the global ratings of judges.

A second paradigm consists of identifying structural determinants

of item difficulty and discrimination.

Although subject matter experts

may be involved in the identification of such determinants, the integration

-5of the structural information into a rating is done statistically.

Two

examples of these basic paradigm are the study of Millman (1978) and Searle t Lorton, and Suppes (1974).

In the Searle, et al. study the

presence or observation of a large number of item parameters (for example t how many mathematical operations were required by the problem) was determined for each item.

The estimated item difficulty was then regressed

on these factors to obtain the equation for predicting item difficulty. They found a multiple correlation of .81. The study by Millman is of a conceptual rather than empirical interest since the item statistics were based on 30 students only.

The

conceptual interest lies in his demonstration that items could be put together by computer while controlling the difficulty of the items.

That

is, rather than storing items in a computer, the computer stores the subroutine that generates the item.

Actually, in Millman's study the

information about the difficulty of different versions of an item was derived empirically; that is, there was no theory that could predict the item's difficulty.

In some domains, notably

~patial

ability, the computer

could not only generate the item but also an estimate of its difficulty. This is possible, in principle, because there is a fair amount of theoretical evidence (e.g. t Cooper t 1976) which suggests that for certain spatial tasks response latency is a known function of physical characteristics of the task. Summary:

Implications for the Present Study

The survey of the literature suggests that a fairly strong relationship may be obtained between rated and empirical item difficulty.

All

-6but one of the investigations, however, that were reviewed dealt with quantitative items and did not consider at all the possibility of rating discrimination.

Therefore, it remains to be seen to what extent raters

can accurately portray the difficulty and discrimination of non-quantitative items. It can be reasonably anticipated that the task of estimating item characteristics for non-quantitative items is more challenging.

In

anticipation of this problem the raters in the present investigation were subject matter experts who, in addition, were test development experts. Nevertheless, it is likely that their ability to accurately rate items was not fully developed since ratings are not normally required in their everyday work.

Therefore, one consideration of the present

study was to provide a situation whereby their judgments can be refined. A second consideration in the present study was to uncover the factors that contribute to the difficulty of TSWE items.

To the extent that this

is possible it may be feasible to train others, not necessarily test development experts, to accurately rate item difficulty and discrimination. Overview of the Study The study is based on items from the Test of Standard Written English (TSWE).

The investigation was conducted as two consecutive substudies.

In the first study, subject matter experts (test development staff at Educational Testing Service, College Board Division), were required to rate the difficulty and discrimination of items that had appeared in TSWE final forms.

This substudy served as a "training" period.

As part of

the exercise the raters discussed their ratings and were given feedback

-7on each item.

The second substudy was meant to assess how accurately

judges could rate difficulty and discrimination, under fairly realistic conditions, after "training" was completed. Study 1 Method Subject matter experts. the study.

Four subject matter experts particpated in

All raters are staff members of the College Board Test

Development Division at Educational Testing Service.

Their experience

with the Test of Standard Written English and other tests of writing skills ranged from 3 to 20 years. Instructions to raters.

The raters were assembled as a group and

for a given item instructed to examine it and write down their estimates of difficulty and discrimination, then they revealed their ratings and discussed among themselves the rationale behind them. the experts rated the item a second time.

After this discussion

At that point estimated

difficulty, discrimination and other statistical information were revealed. The "other" statistical information included the distribution of responses across alternatives for quintiles as well as the mean criterion score of students choosing each of the incorrect alternatives. The raters were instructed to rate difficulty on the delta metric. The delta difficulty index is a nonlinear transformation of the proportion correct statistic given by

/). = ~ -1 (1-p)

-8-

where ~

-1

is the inverse normal function and p is the proportion of students

answering the items correctly.

The proportion correct is estimated on

those students who attempt the item. equated deltas.

The values used in this study were

The equating process is performed to account for the

fact that the testing populations at different testing times differ in ability, and insures that the measure of difficulty for different forms is on the same metric. For discrimination the raters were instructed to rate the biserial correlation of the item.

The biserial correlation is computed as follows: r =

p(l-p) y

where

~ is the mean criterion score for students choosing the right answer

~ is the mean criterion score for students not choosing the correct alternative 8T is the standard deviation of criterion scores for all students y is the ordinate of the normal density functions corresponding to p or (l-p), whichever is smaller. It should be pointed out that these statistics are meaningful to the raters in this study since they use them in their every day work.

After

the estimated statistics were revealed the raters were encouraged to explain any discrepancies among their own ratings and with the estimated statistics.

As part of this process, they formulated hypotheses to

account for these discrepancies and were able to test them with items presented subsequently.

-9Description of the TSWE.

The TSWE is a 30 minute multiple choice

test introduced in 1974 as a companion test to the Scholastic Aptitude Test (SAT) with which it is administered.

Its purpose is to help colleges

place students in appropriate English Composition courses. recommended as an admission instrument.

It is not

The test consists of 50 items.

of two types. items 1-25 and 41-50 are called Usage items.

The testee is

expected to recognize writing that does not follow the conventional and standard written English. Directions:

Examples of this type of

follow:

The following sentences contain problems in grammar,

usage. diction (choice of words), and idiom. correct.

i~em

Some sentences are

No sentence contains more than one error.

You will find that the error, if there is one, is underlined and lettered.

Assume that all other elements of the sentence

are correct and cannot be changed.

In choosing answers, follow

the requirements of standard written English. If there is an error, select the one underlined part that must be changed in order to make the sentence correct, and blacken the corresponding space on the answer sheet. If there is no error, mark answer space E.l Examples:

I. II.

He spoke bluntly and angrily to we spectators. ABC D

No error E

He works every day so that he would become financially ABC D independent in his old age.

No error E

IThe current directions are somewhat different.

-10The other 15 items. 26-40 are called Sentence Correction items.

In

these items the student is expected to recognize unacceptable usage and structure and to choose the best way of phrasing the sentence.

Examples

of this type of item follow: Directions:

In each of the following sentences. some part of the

sentence or the entire sentence is underlined.

Beneath each

sentence you will find five ways of phrasing the underlined part. The first of these repeats the original; the other four are different.

If you think the original is better than any of the alternatives. choose answer A; otherwise choose one of the others.

Select

the best version and blacken the corresponding space on your answer sheet. This is a test of correctness and effectiveness of expression.

In

choosing the answer. follow the requirements of standard written English: that is. pay attention to grammar. choice of words. sentence construction. and punctuation.

Choose the answer that produces the

most effective sentence--clear and exact. without awkwardness or ambiguity.

Do not make a choice that changes the meaning of the

original sentence. 2 Examples: I.

Caroline is studying music because she has always wanted to become it. (A) it (D)

(B)

one of them

one in music

(E)

(C)

a musician

this

2The current directions are somewhat different.

-11II.

Because Mr. Thomas was angry, he spoke in a loud voice. (A)

he spoke

(B)

CD)

as he spoke

and speaking (E)

(C)

and he speaks

he will be speaking

The Usage items test a variety of problems in grammar - agreement; tense and verb form; the use of pronouns, diction and idiom; and the use of adjectives and verbs.

The Sentence Correction items test problems in

sentence structure related to complete sentences, placement of modifying phrases and clauses, the logical relation of ideas, the logical comparison of things or ideas, parallelism, conciseness, and clarity. Research on the TSWE has shown it to be a reliable and valid instrument. Table 1 shows some sample statistics for forms E3-E8.

As can be seen,

standard errors of measurement are about 4.0 and reliabilities are in the upper .80's.

For a thirty-minute test these figures are satisfactory.

Research by Breland (1976) has also provided evidence of the construct validity of scores derived from the TSWE.

For example, the correlation

between TSWE scores and essay scores is typically higher than the correlation between SAT verbal scores and essay scores.

This is to be expected if

indeed the TSWE measures skills related to writing ability rather than a more general reading ability and extent of vocabulary. Insert Table 1 about here The rating material.

The pool of items from which the items were

drawn consisted of five early forms of the test.

The items from each

form were separated into two groups corresponding to the item types. Three sets of twenty usage items and three sets of 10 Sentence Correlation items were taken at random from the pool and assembled into booklets.

-12The twenty usage items were placed first in the booklet followed by the ten sentence correction items.

Since the booklets were formed by randomly

choosing from the pool there was no control over the number of items testing the different type of errors. For purposes of the study the

"t xue

" item statistics were taken to

be the estimated equated delta and estimated biserial correlation based on the first national administration of the form.

The estimates are

based on a random sample of close to 2000 students from the population of students taking the test. Rating sessions.

Three ratings sessions A, B, and C were held.

All

sessions were held in the morning from 9 to 12. Analysis and Results Interrater reliability.

A question of obvious relevance is the

degree of agreement among the raters.

In particular, since they rated

the items before and after discussion it is of interest to assess interrater reliability before and after discussion.

However, while the raters may

agree they may be biased in the sense of over- or underestimating the empirical item statistics, we will examine both issues. Table 2 shows the interrater reliability for each session before and after discussion for both discrimination and difficulty. trends are evident.

Two major

lnterrater reliability is higher for difficulty

ratings than for discrimination.

For example, in session A the interrater

reliability for difficulty is .80 after discussion. corresponding reliability for discrimination is .38.

By contrast the It is also evident

from Table 2 that the process of discussion, not surprisingly, did have the effect of increasing the agreement among the raters.

This was true for both

-13difficulty and discrimination.

For example, for the three combined

sessions the interrater agreement increased from .61 to .77 for difficulty and from .35 to .54 for discrimination.

However, these increases in

interrater reliability within session must be taken with a grain of salt since they may just reflect correlated errors.

In order to determine the

effect of the learning (that is presumably taking place) on interrater agreement, it is necessary to compare the "before" agreement across the three sessions. decreases.

As can be seen, for difficulty, agreement actually

For discrimination the trend is not stable.

These patterns

suggest no permanent effect resulting from discussion. Insert Table 2 about here As expected, there were individual differences among the raters with respect to the elevation and variability of their rating and, more importantly, their "reliability" as a judge.

Tables 3-6 shows the

correlation of every rater with the mean rating of the other three raters, for difficulty and discrimination ratings, before and after discussion.

The correlation of every rater with the estimated diffi-

culty and discrimination is also shown. Insert Tables 3-6 about here For session A raters 1 and 3 seem to do best in reflecting their colleagues' ratings, both before and after discussion.

More importantly,

this is also true with respect to the actual estimated statistics.

For

discrimination, neither rater seems to perform well with respect to predicting the other three judges ratings on the estimated statistics.

-14In general, discussion enhances the agreement among the raters as well as the correlation of each rater with the estimated statistics. The positive effect of discussion on rater reliability and validity is also seen in session B.

It can be seen in Table 4 that with respect

to difficulty the rater-total and rater-delta increase after discussion. This is also the case with respect to discrimination but only for the rater-total correlation.

In fact, the correlation for each rater with

the estimated discriminations tended to decrease after discussion. In session C (Table 5) rater reliability tended to increase after discussion. However, the raters appeared to have little success in predicting the estimated delta or r-biserial for these items. By way of summary, Table 6 combines the data for sessions A, Band

c.

It can be seen that rater 1 appears to be the "best" rater with

respect to difficulty and discrimination. Table 7 shows the mean and standard deviation of the difficulty and Insert Table 7 about here discrimination ratings for each rater separately as well as for the mean rating for all four judges.

In addition the mean and standard deviation

of the estimated item statistics are shown.

The raters overestimated the

difficulty of the items for sessions A and B but underestimated it for session C.

Also, the raters underestimated the variability of the

equated deltas.

For discrimination, the raters underestimated the mean

and standard deviation of the biserial correlation for session A and C but were fairly successful in session B.

As a matter of fact the

-15-

mean and standard deviation of the combined rating matched the empirical statistics exactly.

Thus, raters appear not to be consistently biased.

Validity of ratings.

From a practical point of view the key consider-

ation is how well the estimated statistics can be predicted from the combined ratings.

Table 8 shows the simple correlation of the mean

rating with the corresponding estimated statistics before and after discussion for each session.

As can be seen, no firm trend is evident.

That is, neither discussion or experience seem to affect the magnitude of the relationship. Insert Table 8 about here In view of these results, it was important to probe further into the data to determine why the raters appear to perform so poorly. Examination of the residuals showed a clear pattern namely, that on the whole they were smaller for sentence correction items.

Table 9 shows the

mean and standard deviation of the residuals for each item type in each of the sessions.

For difficulty, the mean residual is lower for the

sentence correction items and the standard deviation is smaller as well. No such trend is evident with respect to discrimination.

This suggests

that the raters were better able to predict the estimated delta for sentence correction items. Table 10 shows the correlation with estimated delta and r-biserial before Insert Table 9 and 10 about here

-16-

and after discussion for both item types.

The correlation is higher for

sentence correction for both difficulty and discrimination.

However, for

discrimination the correlations are not significantly different from zero. Discussion The results are disappointing from a practical point of view.

When

Usage and Sentence Corrections items are considered together the interrater reliability is fairly low and does not seem to increase as a function of learning.

That is, even though the interrater reliability increased

after discussion, to some extent this could be due to correlated errors introduced in the discussion process.

There is some reason to believe,

however, that the basis for the agreement among the raters is related to the characteristics of the items rather than a peer pressure or some other social factor.

This is suggested by the fact that rater-total

correlation tended to go up and down with the rater-delta correlation, at least for difficulty. Unfortunately, the correlations of the composite ratings with the estimated item statistics did not even reach .50 when both Usage and Sentence Correction items are considered simultaneously.

Analysis of the

residuals suggested that the raters were not equally successful with the two types of items.

In fact,' the raters were more accurate in predicting

the difficulty of Usage items.

-17-

As indicated in the methods section the raters had been instructed to try to detect what factors they felt contributed to difficulty and discrimination.

Appendix A contains the reports of the four raters.

They show that even though the raters tried to identify facets of items which they believe would contribute to the difficulty and discrimination, they were not successful.

The problem was compounded by the frustration

experienced by the raters when the estimated statistics were revealed and found to disagree with their ratings. Study 2 Additional Training It was

hypothesi~ed

by additional training.

that the raters performance could be improved For this purpose items from 5 TSWE forms were

collected and the items sorted into the major error categories to which each item had been previously assigned by Test Development staff. There are 24 possible error categories, including a no-error category. However, only 19 of them were represented in the five forms which were used in the study.

Table 11 shows the mean and standard deviations of the Insert Table 11 about here

estimated delta and discrimination for items in each of these categories. The mean estimated delta and discrimination plus and minus two standard errors are shown in Figures 1 and 2, respectively.

-18-

Insert Figures 1 and 2 about here It is evident from Figure 1 that for most categories the mean difficulty is within a narrow interval. most

notably~

difficult~

categories 6 and

7~

However~

and perhaps

9~

some categories

appear to be more

whereas categories 4 and 13 appear to be easier than the

other categories.

For

discrimination~

most categories have a mean

discrimination clustered about .50 except for category

1~

no-error~

which

has a mean of .40, and category 13, which has a mean close to .60. The additional training was conducted in four three-hour sessions. During each session discussions were focused on specific error categories. This gave the raters an opportunity to examine items which tested a single error yet varied in their difficulty and discrimination.

The

information in Table 11 was made available to the raters as they discussed items within a given category.

The raters found it useful to discuss the

items among themselves and then "guess" at the statistics of the items. From time to time they postulated hypothesis that could account for difficulty.

While the raters did not seem able to develop a theory to

account for the variation in the difficulty and discrimination of items measuring a single error, by their own account, they found the exercise both professionally useful and frustrating. Once this training was completed a final session was completed where the subject matter experts rated two sets of 50 items each.

-19Method All four subject matter experts participated in this phase of the study.

The rating material consisted of two sets of 50 items each.

Each

set contained 35 usage items followed by 15 sentence correction items. For this study raters were instructed to rate the estimated delta and discrimination of items after dicussing an item among themselves. However. unlike Study 1. no feedback was given to the raters after each item (Table 11 was made available to the raters to guide their ratings). In short. the situation was designed to simulate the conditions which would be likely to prevail in a practical situation in which item statistics were estimated by raters.

The two sets of 50 items were rated in a

3-hour session. Analysis and Results The interrater reliability for difficulty ratings for sessions 4 and 5 were .95 and .91. respectively.

The interrater reliability for usage

and sentence correction items. across both sessions. was .94 and .91. respectively. For discrimination ratings, the interrater reliability for sessions 4 and 5 was .88 and .86, respectively.

The reliability for usage and

sentence correction items across sessions was .88 and .81, respectively. These reliabilities are substantially higher than those found in Study 1.

However. the raters were encouraged to use the available

information on items statistics available in Table 11.

Therefore, it is

likely that the increase in their agreement is due in part to their use of this information. Table 12 shows the correlation of each rater with the other raters.

-20The correlations are reported separately for Usage, and Sentence Correction items and both item types combined.

The rater-total correlations give an

indication of how well a given rater agrees with the other three raters. As can be seen, rater 1 continued to be the most representative rater for difficulty and discrimination. Insert Table 12 about here Table 12 also shows the correlation of each rater with the estimated delta and r-biserial. follows:

The value shown in parenthesis was computed as

Each item was assigned the mean value of the difficulty and

discrimination for the error category to which that item belonged. value will be referred to as the "empirical" rating.

This

The correlation of

the empirical rating with the estimated statistic is the value shown in parenthesis.

As can be seen, for difficulty the raters outperformed the

empirical rating for Usage items but not for Sentence Correction items. When both item types are combined each rater outperforms the empirical rating.

A similar result is observed for discrimination. Insert Table 13 about here

To see this in more detail Table 13 shows the correlation of the combined raters and the empirical rating with the estimated item statistics. The multiple correlation of the combined raters and the empirical with the estimated item statistics is also shown.

As can be seen for usage items

the raters outperform the empirical rating for both difficulty and discrimination.

For Sentence Correction items the empirical rating slightly out-

performs the combined rater for difficulty but for discrimination the

-21raters do slightly better.

When both items types are considered together

Table 13 shows that the raters do better on both difficulty and discrimination.

Examination of the multiple correlation shows that they are very

close to the larger of the two single order correlations.

Tpat is, it

seems that the ratings from the subject matter experts and the empirical rating measure the same variable.

In the case of Usage items the subject

matter experts measure that variable with more accuracy.

By contrast, in

Study 1 the raters appeared to be more accurate rating sentence correction items. Discussion The results from Study 2 suggest that even after an extended period of practice and training the accuracy in estimating item statistics of four subject matter experts does not approach the level that would be required to substitute ratings of item statistics for pretesting.

In

order for ratings to become practical substitutes for pretesting, their correlation with empirical estimates should exceed .80.

The results of

Study 2 indicates that the attainment of this goal in a cost-effective manner would not be possible at this level of rater performance. In principle, the needed level of correlation can be achieved by adding more raters.

However, since the correlation of the ratings with

the estimated statistics is low in relation to the reliability of the ratings, it is likely that a fairly large number of raters will be required to achieve a high level of correlation. expensive to attain a high validity.

That is, it may be

For example, at a rate of 100 items

-22per three-hour rating session, the ratings may be estimated to cost approximately $2.00 per item per rater or $100.00 per SO-item-set per rater (a TSWE form has 50 items).

Thus, if 20 raters were to be required

to achieve a validity of .80 the ratings cost would be $2000 (or $100 x 20).

Furthermore, this assumes that it is possible to identify such a

large number of qualified raters.

It should be pointed out that the cost

of processing and assembling a TSWE pretest is currently about $4000. In short, the findings of this study and the evaluation of the raters themselves (see the appendix) suggest that ratings cannot be substituted for actual pretesting of TSWE items.

This is discouraging

from a practical point of view since the savings and the minimization of item exposure that would result from such substitution was one of the motivating factors behind the study. The second motivating factor behind the study, however, was theoretical and consisted of an intent to uncover certain principles that could account for the variation in difficulty and discrimination among items. Such principles could, in turn be used to train additional experts or to have more control over the item generation process by alerting item writers to the factors that contribute to difficulty.

Although the

raters attempted to uncover those principles, by their own account they were not successful.

Nevertheless, based on the second study there is

reason to believe that the raters individually and collectively are able to predict estimated item statistics better than what we called the empirical rating (that is, the mean difficulty and discrimination computed on items sorted by error categories).

-23-

This suggests that subject matter experts can add a unique valid component of their own to the prediction of difficulty and discrimination. From a theoretical perspective, however, it is troubling that the raters, subject matter specialists and skilled test developers, were not able to articulate "theories" that could account for the variations among items in difficulty and discrimination.

Clearly a fairly elaborate

theory must be required to account for that variation.

In considering

how one might proceed in formulating an initial theory, an appropriate place to look for inspiration is linguistic and psycholinguistic research. One of the most significant concepts in modern linguistic theory is the dichotomy between the deep and surface structure of sentences. The deep structure of a sentence is a representation of the meaning of that sentence (e.g. Langendoen, 1969).

The surface structure is a

manifestation of that meaning, that is,

~he

sentence as we read it.

It would appear that further research on the problems of predicting or anticipating the statistical characteristics of items could benefit from a linguistic analysis of test items.

The basic hypothesis would be

that the degree of difficulty in judging the grammaticalness of a sentence is related to the linguistic features of the stem.

We may, for example,

examine the deep structure of the stems of the easiest and most difficult items testing a specific grammatical error.

Contrasting the resulting deep

structures may give us some clues as to why the difficult items are difficult and the easy ones easy.

Alternatively the key to the difficulty

-24of items may not lie in their deep structure as

such~

but in the transfor-

mations that are applied to the deep structure to produce the surface structure. (1969~

1.

Consider the following two sentences taken from Langendoen

Ch. 8) The rumor that that the report which the advisory committee submitted was suppressed is true is preposterous.

2.

The rumor is preposterous that it is true that the report which the advisory committee submitted was suppressed.

These two sentences have the same deep

structure~

that

is~

they mean the

same thing, and yet have surface realizations which differ substantially in their comprehensibility.

If we were to base TSWE items on these

sentences it is likely that an item derived from sentence 1 will be more difficult. It is to be expected that a syntactic analysis alone will not account fully for the statistical characteristics of items.

After all

the TSWE intends to measure correct usage, not just correct grammar, and usage refers to ..... the attitudes speakers of a language have toward different aspects of their language ..... (Postman & Weingartner, p. 80).

1966~

The problem is compounded by the fact those attitudes as well as

other non-syntactic factors probably are not invariant across subpopulations. Nevertheless, so long as the mixture of subpopulations taking the test is fairly constant it may be feasible to use subject matter experts as a means of tapping the non-syntactic determinants of item difficulty and discrimination.

It remains to be seen whether the integration of syntactic

and non-syntactic factors improves the predictability of item statistics.

-25References Bratfisch, 0., Borg, G., & Donnie, S. tests of intellectual capacity.

Perceived item difficulty in three (Report No. 28)

Stockholm:

University

of Stockholm, Institute of Applied Psychology, 1972. Breland, H. M.

A study of College English placement and the Test of

Standard Written English.

(Project Report 77-1)

Princeton, NJ:

Educational Testing Service, 1976. Cooper, L. A.

Demonstration of a mental analog of external rotation.

Perception and Psychophysics, Langendoen, D. T.

1976,~,

The study of syntax:

approach to American English.

296-302.

The generative-transformational

New York:

Holt, Rinehart, and Winston,

1969. Lorge, I., & Kruglov, L.

A suggested technique for the improvement of

difficulty prediction of test items. Measurement, 1952,

~,

Lorge, I., & Kruglov, L.

Educational and Psychological

554-561.

The improvement of estimates of test difficulty.

Educational and Psychological Measurement, 1953, Millman, J.

Determinants of item difficulty:

(Report No. 114.)

Los Angeles:

ll,

34-36.

A preliminary investigation.

University of California, Center

for the Study of Evaluation, 1978. Munz, D. C., Jacobs, P. D.

An evaluation of perceived item difficulty sequenc-

ing in academic testing. 1971,

~,

195-205.

British Journal of Educational Psychology,

-26Postman, N., & Weingartner, C. New York:

A reevaluation in teaching.

Dell Publishing Co., 1966.

Prestwood, J. S., & Weiss, D. J. culties.

Linguistics:

Accuracy

(Research Report 77-3)

ot

perceived test item diffi-

Minneapolis, MN:

University of

Minnesota, Department of Psychology, 1977. Searle, B. W., Lorton, P., & Suppes, P.

Structural variables affecting

CAl performance on arithmetic word problems of disadvantaged and deaf students. Thorndike, R. L.

Educational Studies in Mathematics, 1974,

1,

Item and score conversion by pooled judgment.

Testing Service Research Conference on Test Equating.

371-384. Educational

Princeton,

N.J., 1980.

Tinkelman,

s.

Difficulty prediction of test items.

(Teachers College

Contributions to Education Report No. 941.) New York: University, 1947.

Columbia

-27-

Table 1 Item and test analysis results for various TSWE forms

FORN

E3

E4

E5

E6

E7

12/74

2/75

2/75

4/75

11/75

12/75

N

1765

1920

1790

1685

1895

1830

Reliabi1i.ty

.890

.885

.872

.867

.893

.874

SEM (scaled)

3.7

3.8

4.1

4.0

3.6

4.2

R. Bis.

.51

.49

.47

.46

.51

.49

9.2

9.4

9.4

9.6

9.1

8.9

Admin. Date

~1ean

Equated 6 Hean

E8

-28-

Table 2 Interprater reliability, before and after discussion of difficulty and discrimination ratings for each session and across sessions

Difficulty

Disc rimina t ion

Session A Before After (N = 22)

.65 .80

Session B Before After (N = 25)

.62 • 79

.57 .66

.45

.27 .44

Session C Before After (N = 27) Combined Before After (N = 69)

.68

.61

.77

negative .38

a

.35 .54

a The negative reliability resulted from a rater whose ratings correlated negatively with the other raters.

-29-

Table 3 Correlation of each individual rater with the mean rating and estimated delta and discrimination for session A (N = 22)

Difficulty Rater-total Rater-delta correlation correlation Before Rater 1 2 3

.66 .12 .63

4

.11

After Rater 1 2 3 4

.86 .43 .66 .54

-

.44 .14 .46 .05 .53 .12 .45

.11

Discrimination Rater-total Rater-r-biserial correlation correlation

-

-

.19 .07 .02 .38

-

.06 .26 .22 .38

.59

.21

.33

- .23

.39 .26

.15 .26

The rater-total correlation is the correlation of each rater with the average rating of the other three raters. The rater-delta or rater-r-biserial correlation is the correlation of each rater with the estimated equated delta and estimated item biserial correlation, respectively. The approximate critical value of the correlation for a. = .05 is .42.

-30-

Table 4 Correlation of each individual rater with the mean rating and estimated delta and discrimination for session B (N = 25)

Difficulty Rater-delta Rater-total correlation correlation Before Rater 1 2

3 4

After Rater 1 2

3 4

.36 .54 .30 .44

.38 .10 .38

.70 .56 .55 .63

Discrimination Rater-total Rater-r-biserial correlation correlation

.32

.27 .46 .35 .39

.60 .48 .05 .11

.46 .14 .37 .38

.45 .52 .42 .44

.48 .17 .10 .10

The rater-total correlation is the correlation of each rater with the average rating of the other three raters. The rater-delta or rater-r-biserial correlation is the correlation 0f each rater with the estimated equated delta and estimated it~m b;serial correla~ion, respectively. The approximate critical value of the c or r e Lat Lon 1.S for Ct = .05 1.S .40.

-31-

Table 5 Correlation of each individual rater with the mean rating and estimated delta and discrimination before and after discussion for session C (N = 22)

Difficulty Rater-total Rater-delta correlation correlation Before Rater 1 2 3

4 After Rater 1 2 3

4

.57 .59 .10 .71

.19 .08

.27 .15 .05 .10

.00 .08 - .09 .00

.19 .35 .27 .15

.57 .30 .00 .20

.12 .28 .00 .46

.21 .39

.33

.50 - .12 .44

-

Discrimination Rater-total Rater-r-biserial correlation correlation

The rater-total correlation is the correlation of each rater with the average rating of the other three raters. The rater-delta or rater-r-biserial correlation is the correlation of each rater with the estimated equated delta and estimated item biserial correlation, respectively. The approximate critical value of the correlation for a = .OS is .42.

-32-

Table 6 Correlation of each individual rater with the mean rating and estimated delta and discrimination before and after discussion for session A, B, and C (N = 69)

Difficulty Rater-total Rater-delta correlation correlation Before Rater 1 2 3

Discrimination Rater-total Rater-r-biserial correlation correlation

4

.45 .44 .29 .40

.33 .29 .32 .05

.24 .23 .20 .08

.27 .04 .05 .17

After Rater 1 2 3 4

.69 .54 .48 .62

.37 .31

.48 .40 .32 .14

.29 .08 .08

.34 .20

.32

The rater-total correlation is the correlation of each rater with the average rating of the other three raters. The rater-delta or rater-r-biserial correlation is the correlation of each rater with the estimated equated delta and estimated item biserial correlation, respectively. The approximate critical value of the correlation for ~ = .05 is .24 respectively.

-33-

Table 7 Mean and standard deviation of difficulty and discrimination ratings by rater after discussion

N

Session A Rater 1 2 3 4

Difficulty Hean S. D.

Discrimination Hean S. D.

22 9.54 9.55 9.26 9.39

1.91 1.77 2.11

.44 .43 .42 .43

.08 .08 .08 .08

Hean Rating

9.43

1. 46

.43

.05

Empirical Estimates

8.85

2.38

.50

.10

9.22 9.50 9.59 9.59

1. 69 1. 80 2.15 1.51

.47 .41 .44 .43

.08 .09

Mean Rat ing

9.48

1. 41

.44

.07

Empirical Estimates

9.32

2.03

.44

.07

9.02 8.75 8.22 8.78

1.81 1. 99 1. 36

.46 .43 .47 .45

.08 .07 .08 .06

Mean Rating

8.69

1.24

.45

.04

Empirical Estimates

9.23

2.21

.49

.09

Session B Rater 1 2 3 4

Session C Rater 1 2 3 4

1. 60

25

.13 .07

22

1.72

-34-

Table 8 Correlation between the combined ratings and empirical item statistics before and after discussion of each session

Difficulty Before After

Discrimination Before After

Session A Simple (N = 22)

.42

.47

.17

.17

Session B Simple (N = 25)

.43

.42

.33

.28

Session C Simple (N = 22)

.31

.35

.01

.32

The approximate critical value of the correlation for a = .05 and N of 22 and 25 are .42 and .40 respectively.

-35-

Table 9 Mean and standard deviation of absolute residuals by item type

N

Difficulty Mean S. D.

Discrimination Hean S. D.

Session A Usage Sentence Correction

15 7

1.85 1.52

1. 24 .70

Session B Usage Sentence Correction

15 10

1.80 .61

1. 34

.10

.71

.08

Session C Usage Sentence Correction

15

1. 76 1. 52

1. 20

.07

1.14

.05

7

.06 .08

.07 .05 .07 .07 .07 .02

-36-

Table 10 Correlation between ratings and empirical item statistics .f or sessions A, B, C combined for each item type and both item types combined

N

Difficulty Before After

Discrimination Before After

Usage

45

.30

.32

.17

.22

Sentence Correction

24

.56

.63

.34

.34

Both

69

.36

.40

.21

.25

The approximate critical value of the correlation for a of 45, 24 and 69 are .29, .40, and .24 respectively.

.05 and N

-37Table 11 Mean and standard deviation of equated delta biserial correlation by major categories

Category

Difficulty Mean S.D.

an~

Discrimination S.D. Mean

N

No error (0)

8.92

1. 52

.40

.08

35

Subj ec t verb agreement (1)

9.06

2.12

.53

.09

20

Tense (2)

9.19

1.83

.57

.06

15

Verb form (3)

6.87

1. 41

.51

.05

10

Connective (4)

9.78

1. 26

.53

.10

16

Logical agreement (5)

10.70

1. 69

.55

.07

5

Logical comparison (6)

12.23

1. 43

.53

.03

4

Modifier (7)

9.77

1. 42

.54

.08

11

Pronoun (8)

10.33

1. 65

.52

.11

29

Diction (9)

9.59

1. 24

.49

.10

11

Idiom (10)

9.62

2.02

.48

.10

18

Parallelism (11)

9.53

1. 30

.52

.06

15

Sentence fragment (12)

7.30

1.84

.59

.06

10

Comma splice (13)

9.31

2.32

.46

.09

10

Improper subordination (14)

9.35

1. 78

.50

.10

12

Improper coordination (15)

8.70

2.62

.52

.10

10

Dangling modifier (16)

9.90

2.34

.45

.08

9

Redundancy/economy/constancy (17)

9.92

1. 51

.47

.09

5

Vague pronoun reference (18)

8.80

2.64

.46

.08

5

-38-

Table 12 Correlation of each individual rater with the mean rating and estimated delta and discrimination for sessions D and E

Difficulty Rater-total Rater-delta Usage Rater 1 2 3 (N 4

= 70)

Sentence Correction Rater 1 2 3 (N 4

=

29)

Both Rater 1 2 3

4

(N =

99)

Discrimination Rater-total Rater-r-biserial

.93 .84 .78 .92

.44 .48 .28 .49 (.16)

.84 .67 .76 .70

.92 .72 .76 .87

.20 .15 .18 .26 (.30)

.72 .59 .58 .66

.93

.38

.80

.37 .26 .43 (.22)

.82 .66 .73 .69

.77 .90

The approximate critical value of the correlation for a 70, 29 and 99 are .23. .37. and .20 respectively.

.45 .39 .44 .44 (.40)

-

.10 .05 .27 .12 (. 07) .39 .34 . !.2 .30 (.31)

.05 and N of

-39-

Table 13 Simple and multiple correlations of ratings with difficulty and discrimination for usage~ sentence correction, and both item types combined

Usage Diff. Disc.

Sentence Correction Df.f f , Disc.

Both Diff. Disc.

Correlation of raters with estimates

.46

.50

.21

.13

.39

.45

Correlation of empirical rating with estimates

.16

.40

.30

.07

.22

.31

Hultiple Correlation

.48

.51

.30

.13

.39

.45

N

70

99

29

The approximate critical value of the single order correlation and N of 70~ 29, and 99 are .23, .37, and .20 respectively.

for

~ =

.05

-40-

Figure 1

MEAN DELTA BY MAJOR ERROR CATEGORIES 14.660 13.692 012.724 E L 11.755 T A 10.787 U 9.819 A L 8.851 U E 7.883

6.915

I

f

5.946 4.978

I

fft 0

7 8 9 10 11 12 13 14 15 16 17 1e 19 20 21

ERROR CATEGORIES

-41Figure 2

MEAN DISCRIMINATION BY MAJOR ERROR CATEGORIES 1. 0 t--r--I~_--r-_....,--....-

,......~--.,._-..-_..,......,---r-...-f

0.9

o

10.8

S CO.?

R

l~:~

I

I

00.3

H

0.2

0.1 0.0

0

I

f i f Iii I I I I f J I i I ! J

-42-

APPENDIX

Reactions from the Participating Raters

-43-

Rater 1

My strongest reaction to participating in the study was one of frustration. Initially I was interested in the possibilities that could result from such a study and excited about the exchange of information that could go on between my colleagues and myself. However, after the very first session, I began to seriously question whether the process was a feasible one. I feel that all of the participants approached the task earnestly, bringing to the study a tremendous amount of experience with the item types being scrutinized. Yet, I found that because so many factors, known and unknown, affect the difficulty of items, we were unable to maintain any level of consensus. When we accurately assessed an item at the proper level of difficulty, I felt that it was merely a matter of luck because logical analysis, in almost all cases, could not be applied to other items testing the same problems at the same levels of difficulty. Although the experiment did not work, it was useful in two ways. First, it allowed for several of us to sit together on several occasions to discuss elements that may affect the performance of items. I feel this is a very necessary part of test development work, a part that often must be neglected because of the demands imposed by workloads and time schedules. Second, the study served to reinforce my belief in the importance of pretesting. I think the study clearly showed that.dependence on guesses of item difficulty by test assemblers would be a dangerous procedure to implement because no matter how careful we are in trying to determine item difficulty, our estimates are at best only educated guesses.

-44-

Rater 2

The TSWE rating study was useful to me in several respects: it demonstrated the hazards of attempting to predict the difficulty of TSWE items~ it enabled me to understand the value of the no error option in TSWE items, and it provided an opportunity for discussion with my colleagues in test development of particular characteristics in items that are likely to affect candidate response. The knowledge I have gained from the experience will surely help me in TSWE test development work. It helps to explain why, tor example, we have had only partial success in writing items for a particular level of difficulty, even when we use items of known difficulty as models. It tells us that the no error distracter (and therefore the noerror item) is integral to the usage item type, even though no error items often have rbiserials lower than we would like. And it makes me yet more aware of the peculiarities of language and sentence structure that must be considered in items testing writing skills. As best I can discover, my strategy in arriving at ratings consisted of a series of considerations followed by a larg€ly intuitive judgment. In assessing the difficulty of a particular item I tried to recall my experience in test development with items of the same kind,looking primarily at the problem tested at the key, then at those tested at the distracters, and attempting to recover a 'sense of how such items had performed. (I was also helped, of· course, by knowing that the items in the study had appeared in final forms and were therefore likely to fall within the TSWE range of difficulty and discrimination.) I also looked at the complexity of the sentence structure and the level of vocabulary in which the problems were presented. But the estimated rating itself was the result of a single intuitive judgment that occurred after and was in some sense distinct from all such considerations. In spite of all I learned, however, and in spite of the good company in which I learned it, the task itself was an extremely difficult one and did not become any easier as the study progressed. I am not at all sure that I ever learned to predict the difficulty of TSWE items. In fact, when I attempted at various times in the progress of the study to rate items on the basis of experience with items previously rated, I found the procedure more frustrating and the results less successful than when I used my accustomed strategy (described above). On the basis of my experience in the study, then, I would think it extremely imprudent to abandon or curtail pretesting in favor of a system based on the estimates of test assemblers, however experienced an~ well-trained.

-45Rater 3

The research project proposed to those of us in Test Development who had had experience with items for the Test of Standard Written English seemed a reasonable one. We. after all, were the ones most familiar with the items: we created them; we reviewed them; we assembled them into tests. Although we do not estimate item statistics for individual items before pretesting, we are highly successful in writing and choosing items for pretests that suit the skills of the population. Rarely do we include in a pretest an item that is too difficult for the population; most of our "failures" in pretesting occur in particular areas (testing "between you and I," for example) and we therefore expect items testing those problems to yield statistics unsuitable for the test. When I was a newcomer to Test Development many years ago, I was told not to waste my time estimating the statistics of individual items; I would be much better off estimating the difficulty level of the entire test, on the theory tha~ I would go wrong on individual items but not on an entire test. I suppose I assumed that ~hat instruction was a bow to my inexperience at the time because I certainly went to the rating sessions believing that I would find it easy to estimate accurately the difficulty of the items. However, I knew that I would not be able to estimate r biserials accurately, for barring noticing whether an item has an r biserial sufficiently high to allow the item to be used in a test, I pay little attention to that statistic--a function of the fact that most items are not pretested on the population taking the final form (as TSWE items are) and therefore have r biserials that will change somewhat when the items are used in final forms. Even so, I thought the task would be fairly simple. We would be dealing with items used in TSWE final forms; no matter how those items were mixed, it would be logical to assume that there would be few, i f any. items above delta 13.0 and few, if any. items with r biserials below .30. What happened at our first meeting, therefore, was hardly what I expected: If I estimated an item as having a delta of 11.0, my colleagues would estimate 9.0, and the final form statistics would indicate that the item was an 8.0. If all of us agreed that the item was an 8.0, the final form statistics would indicate that it was a 12.0. Every estimate was wrong. Trying to learn from experience I postulated theories, none of which proved valid. Whether the key in Type I items was located at the beginning or end of a line made no difference; the relative difficulty of the vocabulary in the sentepce made no difference; clues in the sentence made no difference; the relative rarity of the structure or the tense made no difference--and on and on. Everything I had ever learned or thought about language had no relevance. Nevertheless, I remain grateful that I haq the chance to be a rater.

-46-

The value of the study, to me. lies in the opportunity it provided to look at items in detail, to learn how little we really know about the ways students respond to items and why some options prove attractive and others do not. I would like to see studies done that would provide us with the information we lack--a study. for example.that would require individual students to respond to items and then to tell the investigator why they responded as they did. I would also like to see an analysis of the way the context in which an item is placed affects the item, if it does indeed do so. (Test assemblers always check for word and option overlap and avoid placing items with similar problems and similar sentence structure next to each other. What factors are we missing? What makes the same words in the same structure--for example. according to-attractive in one item and not worthy of choice in a second?) Already under way are plans to secure more data concerning those unusual items. turned up during the ratings. that varied in difficulty level from form to form of the test. The experience of being a rater was frustrating, interesting, challenging. defeating--all at the same time. It proved one's ignorance and increased one's desire to learn. The opportunity to look at items closely just to discuss with colleagues particular points in those items and particular theories of language learning was one I had never had before. We tend to look at items closely only to evaluate the clarity and style the sentence or the validity of the key. To share ideas with, to learn from, colleagues as one rates the items is invaluable, despite all the frustration and discouragement of the rating sessions.

-47Rater 4

Foremost among my reactions to participation in the item rating study is a sense of frustration. Despite earnest efforts, I did not perceive that my ability to rate items improved with eXperience. I suspect that 1 was more accurate in the first session than I was in the last session. Of the four participants, I have had the least experience with the two item types we rated, and yet, I did not feel that my co1leagues'initial ratings or subsequent ratings were more accurate than mine. Neither logic nor common sense nor comparisons with similar items nor remembrances of items past seemed to add accuracy to our ratings. The most valuable information present was the knowledge that the items we were rating had appeared in final forms. This information enabled me to deduce that most of the items had r's above thirty, that few would be above delta 12 or below delta 6, and that the numbers on the items were significant clues as to how easy or difficult the pretested items had been when the final form was assembled. Had I been working with unpretested items, I would, I suspect, have been considerably less accurate in my ratings. The experience was useful in that it convinced me of the importance of pretesting and the danger of over-reliance on estimates, even the most carefully considered, of item difficulties. The experience was extremely useful as an exercise examination of the item types presented.

in close