A Procedure for Estimating Intrasubject Behavior Consistency

7 downloads 0 Views 173KB Size Report
Jun 3, 2006 - those of Bem & Allen, 1974; Bem & Funder, 1978; and Epstein, 1979) and the ... Hogan, Benjamin, and Brezonski (2000) found, 75% of the ...
Educational 10.1177/0013164405275667 Herna! ndez etand al. Psychological / Intrasubject Behavior Measurement Consistency

A Procedure for Estimating Intrasubject Behavior Consistency

Educational and Psychological Measurement Volume 66 Number 3 June 2006 417-434 © 2006 Sage Publications 10.1177/0013164405275667 http://epm.sagepub.com hosted at http://online.sagepub.com

José M. Hernández Víctor J. Rubio Javier Revuelta José Santacreu University Autónoma of Madrid Trait psychology implicitly assumes consistency of the personal traits. Mischel, however, argued against the idea of a general consistency of human beings. The present article aims to design a statistical procedure based on an adaptation of the π* statistic to measure the degree of intraindividual consistency independently of the measure used. Three studies were carried out for testing the suitability of the π* statistic and the proportion of subjects who act consistently. Results have shown the appropriateness of the statistic proposed and that the percentage of consistent individuals depends on whether test items can be assumed as equivalents and the number of response alternatives they contained. The results suggest that the percentage of consistent subjects is far from 100%, and this percentage decreases when items are equivalent. Moreover, the greater the number of response options, the lesser the percentage of consistent individuals. Keywords: consistency; objective tests; item test theory

T

he notion of intraindividual consistency lies at the basis of all personality psychology. Ever since Mischel (1968) formulated his theoretical and methodological criticism of the concept of consistency sustained by trait psychologists, various authors have debated the various conceptions of consistency (Bem, 1983; Epstein, 1983a, 1983b, 1984; Funder, 1983a, 1983b; Funder & Ozer, 1983; Kenrick & Funder, 1988; Mischel & Peake, 1982, 1983; Ozer, 1986), chiefly throughout the 1980s. The debate originated with the conception defended by trait psychology that there exists a universal structure of human personality that is expressed in the ontological assumption underlying the proposition “All human beings are consistent.” Even though such an ontological supposition is unsustainable considering, as Zuckerman Authors’ Note: This research was supported by the Spanish Science and Technology Ministry (grant reference BSO2003-02162) and by the project AENA-UAM/785002. Please send correspondence concerning this article to José Santacreu, Facultad de Psicología, Universidad Autónoma de Madrid, 28049 Madrid, Spain; e-mail: [email protected].

417

418 Educational and Psychological Measurement

(1991) pointed out, that perfect consistency is impossible, given the continual interaction between phenotype and environment, the idea of a general consistency of human beings underlies most of personality psychology. This conception supposes that because people are consistent in some of their behaviors, we can identify the same kind of consistent behaviors corresponding to the major dimensions of personality in all subjects. The initial empirical work of Cattel and Guilford indicates this possibility. Based on this, methodological work focused on designing instruments capable of revealing this universal structure. The first step, then, was to identify those behaviors or those verbal statements that individuals make about themselves that tend to be grouped in the dimensions that would make up this personality structure. Whether a behavior (or its verbal description) belongs to the dimension is established by the covariation of this behavior with other behaviors belonging to this category across a broad sample of subjects. With this approach, any interindividual differences reveal the various degrees to which those persons present the assessed feature. Given this, a good instrument for assessing a personality trait is one made up of the elements that can detect the consistency of the individuals irrespective of the magnitude of the trait variable in each of them. Consequently, the process of constructing an assessment instrument that aims to measure personality traits demands the elimination of those items that do not contribute to increasing the internal consistency of the test. This is because it is supposed that those items are not measuring the dimension or trait being explored. That the individuals may not be consistent is not doubted. This is why, despite the evident fact that there are two sources of variation in the scores of a test— (a) the items of the test and (b) the individuals tested—methodological and analytical procedures have not been implemented to allot separate values to intraindividual consistency and to the consistency of the elements that make up the test. At the end of the 1960s, Mischel (1968) questioned the ontological supposition of the consistency of the individuals and, by extension, the method for constructing tests derived from it. The consistency paradox (Mischel, 1968) looks at the fact that the starting point supposes the consistency of each individual, although the empirical data gathered by the author do not support intraindividual consistency and consequently suggest that the situation will contribute the fundamental part of the variance that explains the behavior. Beyond the various responses given to this approach (among the best known are those of Bem & Allen, 1974; Bem & Funder, 1978; and Epstein, 1979) and the evolution of Mischel’s own positions (Mischel, 1990; Mischel & Shoda, 1995), we will highlight two aspects of the debate. The first has to do with the kind of situation in which we assess the person’s behavior. Because the behavior of individuals occurs in a specific situation, to study the consistency of the behavior, the situation must be of a kind that allows several behaviors and of a kind that does not determine which specific behavior will be executed. A situation that (because a given behavior is reinforced to a greater degree than others or has some kind of preeminence) determines one and only one of the possible behaviors for all those who find themselves in that situation will contribute nothing to the analysis of an individual’s personal behavior style because interindividual variation will not be possible (see the analysis proffered by Harzem,

Herna!ndez et al. / Intrasubject Behavior Consistency 419

1984, and later that of Ribes, 1990a, 1990b; and Ribes & Sánchez, 1992, of the contingencies established in an assessment situation). This is the case of an item lacking interindividual variability. The second aspect has to do with the differences between the concepts of intraindividual consistency and the statistical consistency of the measuring instrument. The consistency of an assessment instrument refers to the consistency among the scores of its items, which refers to the behaviors of the person in the situation considered in each element of the test. Forgetting the supposition of general consistency of people, occasional attempts are made to confirm the consistency of individuals because this consistency can be detected irrespective of the agents used to reveal it, by calculating the correlation between items of a single test. Attempts are also made, however, to establish the homogeneity or communality of the items of an instrument that assesses a trait by calculating the internal consistency of that test. In fact, as Hogan, Benjamin, and Brezonski (2000) found, 75% of the coefficients of reliability reported in the Directory of Unpublished Experimental Mental Measures, Volume 7 (Goldman, Mitchel, & Egelson, 1997), were estimates of internal consistency. We will consider two paradoxes in this respect. The first is that the measurement of internal consistency, which is not a direct measurement of reliability but rather of the degree to which the items of a test are grouped for measuring one same construct (see Henson, 2001), has been strongly criticized as a measurement of the unidimensionality of an instrument (see Hattie, 1984) for the purpose of verifying this assumption in item response theory models. The second paradox applies to all the replication designs for establishing the reliability of the scores of an instrument. Since the development of classical test theory, reliability is defined as the proportion of variance of the empirical scores due to true score variance. Or, given the fact that true scores cannot be accessed directly, reliability is understood as the degree of consistency of a subject’s scores through replications of the measurement procedures (Brennan, 2001). In the case of two measurements (test-retest replication design), should significant differences appear between the two applications of the test (which have supposedly controlled the time between testings, the testing conditions, the examiner, etc.), such differences should be attributed to the instrument (as a whole or to some of its elements). But the instrument has not varied either. To reject that the origin of the variability is as attributable to the instrument as to the subjects, to the conditions, or to any combination of these, is in any case limited reasoning with no logical basis. We consider that the case of internal consistency is a (restrictive) variation of the replication designs that assume a priori that the items are equivalent (see Brennan, 2001). The degree of equivalence required in each assessment instrument has varied (see the review by Lord & Novick, 1968). Nevertheless, the fact is that the two most frequent strategies to increase the covariation between items, particularly in the case of personality questionnaires, are, on one hand, semantic duplication of items and, on the other hand, the generation of items that represent more global and general situations. Either of them feeds an intrasubject consistency that can be more an artifact of

420 Educational and Psychological Measurement

the agents that make up the measurement instruments than of the consistency of the individuals. Although several technical solutions for this problem have been proposed in psychometrics, personality psychology has not proposed any solid methods. Hence, the objective of this article is to test a statistical procedure that, given a subject’s scores on an objective assessment test, will determine the degree of revealed intraindividual consistency independently of the measurement instrument used. The hypothesis to be verified is contrary to the theoretical presuppositions of trait psychology. In this regard, the hypothesis states that the percentage of individuals showing perfect consistency will differ significantly from 100%, which is what would be expected assuming the ontological supposition of consistency. To test this hypothesis, it is necessary to design situations that are exactly equal, morphologically and functionally, in each of the items.

Estimating the Proportion of Inconsistent Subjects To resolve the question posed, we propose a procedure that allows estimating the proportion of consistent and inconsistent subjects and the probability of the consistency of each subject independently of the subjects’ level in the variable measured and independently of that variable’s items being equivalent. The procedure is an adaptation to the psychometric context of the π* statistic proposed by Rudas, Clogg, and Lindsay (1994) for the analysis of contingency tables.

The ␲* Goodness-of-Fit Index Suppose that θ indicates the trait level of an individual. From a psychometric point of view, consistency means that the value and dimensionality of θ are invariant all over the test. This investigation addresses the problem of estimating the proportion of inconsistent subjects in a given population, which is divided into two groups: a group of consistent subjects and a group of inconsistent ones. Consistent subjects have a constant θ and follow a given response model all over the test, whereas no assumptions are made about the behavior of inconsistent subjects. The probability of sampling an individual at random from each group is 1 – π* and π*, respectively. The main objective of the current investigation is introducing a method for estimating π*. Let xh = (xh1, . . . , xhI) be the subject’s response pattern to a test of I items, h = 1, . . . , H and H is the number of different response patterns. The probability of each response is given by an item response model (Birnbaum, 1968). The most general model that we will use is the nominal response model (NRM; Bock, 1972), according to which the probability of obtaining response r in item i composed of A alternatives is given by P( xhi = r ) =

exp( air 0 + bir ) A

∑ exp( a a=1

ia

0 + bia )

,

(1)

Herna!ndez et al. / Intrasubject Behavior Consistency 421

where air and bir are the parameters that describe response r. Parameters ai0 and bi0 are fixed to zero to identify the model. We will also use the two-parameter logistic model (2pl), which is the same as the NRM but assumes that data are dichotomous. The oneparameter logistic model (1pl) applies to dichotomous data and assumes that ai1 takes the same value for all the items. These models are frequently applied for analyzing attitude and personality questionnaires (Chernyshenko, Stark, Chan, Drasgow, & Williams, 2001; Reise & Waller, 1990; Steinberg & Thissen, 1995). Suppose that θ follows a distribution f(θ) in the population. It is assumed from now on that f(θ) is standard normal. Then, assuming that the subject is consistent, the marginal probability of the response pattern is given by ∞

I

P( xh ) = ∫ ∏ P ( xhi ) f (θ )dθ. −∞ i = 1

(2)

The assumption of consistency does not imply that the probability Pir is the same for any item i, only that the subject’s trait (θ) remains invariable across the stimuli. For this reason, values of air and bir can vary across the items even when consistency holds. It can be asserted that there is no invariable trait level for all the stimuli in the group of inconsistent subjects. The probability of the response pattern for these subjects can be called in general terms R(xh). Because there is no model about how an inconsistent subject acts, we shall not provide an analytical expression for R(xh). Under these conditions, the marginal probability of the response pattern of the subject is f(xh) = (1 – π*)P(xh) + π*R(xh).

(3)

The value of 1 – π* can be interpreted as a minimum limit to the proportion of consistent subjects because there exists no information about what happens with the subjects that are outside the model (i.e., about the residual distribution R(x)). Finding a certain proportion of individuals outside the model does not necessarily mean that the value of θ changes throughout the test, as would happen if the consistence parameter were not met. A sample of n subjects is required to estimate the parameters. Let nh be the observed response frequency of pattern h. The probability of the observed data n = (n1, . . . , nH) follows the multinomial distribution: H

g( n)∞ ∏ f ( xh ) n h . h= 1

(4)

Maximum likelihood estimation consists in maximizing g(n) with respect to π* and the item parameters. This is achieved by using the marginal estimation (Bock & Aittink, 1981) with the expectation-maximization algorithm (Dempster, Laird, &

422 Educational and Psychological Measurement

Rubin, 1977). Because R(x) is unrestricted, distribution g(n) fits the data whenever π* 2 is large enough. The likelihood ratio statistic (G ) is used to assess the fit, and the esti2 mator of π* is the smallest value of that parameter for which G = 0 (Rudas et al., 1994). These authors also describe the estimator by intervals of π*, which takes the form (π *L , 1). The lower limit depends on the level of confidence, and the upper limit is 2 1. If the confidence level is 95%, π *L is the value of π* for which G = 2.7. The difference between this estimation procedure and the one commonly used in psychometrics is that the latter assumes that all the subjects fit the model (are consistent) and therefore π* = 0. Once the parameters have been estimated, it is possible to determine the probability of each subject belonging to the group of consistent or inconsistent subjects. The conditional probability that a subject is consistent is given by f ( consistent | xh ) =

(1 − π * )P( xh ) . f ( xh )

(5)

Other Indices Used in Item Response Theory Two statistics have been used in the context of item response models to assess consistency. The first is the lz statistic, proposed by Drasgow, Levine, and Williams (1985). The second is the α statistic proposed by Ferrando (2004). The lz statistic is a statistic for goodness of fit of the subject and is applied in dichotomized models. It is useful for detecting subjects who do not behave according to the pattern, for example, those responding to the test without paying attention to some questions. To apply it, you must have a test previously calibrated with one or more of the dichotomized models in its unidimensional version. The distribution of lz is approximately standard normal. Therefore, for each subject, you have to verify the hypothesis that the subject acts according to the pattern. The π* statistic presents some theoretical differences with respect to lz. First, π* is independent of the model and can be calculated with any model that specifies how consistent subjects respond. Second, lz does not directly provide an estimator of how many subjects are consistent in a population. The consistency hypothesis must be verified subject by subject, which increases the probability of Type I error in the set of subjects assessed. The α index was proposed to assess the degree of consistency of each subject, taken as the variability of θ throughout the test. It is applied only to the 1pl model and assumes that an inconsistent subject is one whose θ value varies throughout the test. That is, θ is precisely a measurement of this variation. There are three main differences between π* and α. First, π* is defined for the population of subjects and α for each particular subject. Moreover, α is based on the supposition that all subjects are inconsistent and assesses the degree of inconsistency of each one, whereas π* divides the population into a consistent group and an inconsistent group. Last, α contemplates only one type of inconsistency: the variation of θ throughout the test assuming the 1pl. In contrast, π* does not assume anything about how inconsistent subjects act.

Herna!ndez et al. / Intrasubject Behavior Consistency 423

The π* statistic is a goodness-of-fit statistic for the whole model; f(consistent | xh), lz, and α are goodness-of-fit statistics applied to each particular subject. Despite their theoretical differences, the three indices may give similar results in empirical studies. We address this issue in the applications described next.

Three Empirical Studies Three studies were conducted to illustrate how the π* index is applied in practice, and these three studies are described below.

Study 1 The first study has two objectives: (a) to estimate the percentage of subjects who act consistently in a task requiring conscientiousness (described in Hernández, SánchezBalmisa, Madrid & Santacreu, 2003) and (b) to assess the equivalence of the various tasks or items that comprise the test. Method A sample of 428 participants carried out 15 items of a conscientiousness test with a Cronbach’s coefficient of .76 (Hernández et al., 2003). The task consists in identifying and marking an object (a type of tree) in a matrix containing that object mixed with other objects (other types of tree). The maximum time for solving each item is 20 seconds. Each item presents a matrix of 12 rows by 10 columns (120 cells), and 60% of the cells (72) contain a figure. Of these 72, 14 contain the element to identify, whereas 58 have a distractor (different types of trees). The elements to identify are distributed in 7 rows and 7 columns, each with two “target” elements. The score is obtained by considering the order (by rows and/or columns) followed in solving the matrix. Each item earns a point only if the individual clicks consecutively on two targets belonging to the same row or column. Since the configuration of each matrix is different, each item is morphologically different (the type and the position of the tree to be identified varies) although functionally the same because they can be identified by using the same behavioral strategy. The score is based on the order in which the objects are identified. The range of scores of each item corresponds to an ordinal scale from 0 to 7 where the 0 score corresponds to no conscientiousness and the score of 7 to maximum conscientiousness. Because of the small number of subjects, only the first 7 items were analyzed. Furthermore, the responses were dichotomized using the value of 3.5 as the cutoff point. This way, the number of possible response patterns is 27 = 128, which allows assessing the model’s goodness of fit. Figure 1 illustrates an item example. The nature of the task can lead one to think that it is assessing a perception-related competence or aptitude. The data indicate that this is not so. For the sample analyzed, the mean correct was 13.65 (over a theoretical maximum of 14) and the mean wrong was 0.63 (Hernández et al., 2003). Moreover, the maximum time of execution was 14.95 seconds, clearly less than the 20 seconds allowed, and the correlation between

424 Educational and Psychological Measurement

Figure 1 Example of a Conscientiousness Test Item

Note: Squares in the screen represent targets already clicked by the subject.

the efficiency rate and that of meticulousness was r = .030. All this indicates that competence was not a variable that interfered in our results because all the participants were shown to be highly competent. Three models were estimated to assess both objectives. Model 1 is the 2pl; it assumes that the items are not equivalent (parameters a and b can be different for each item) and that the subjects are consistent (θ equal). Model 2 is the 1pl; it assumes that items are not equivalents. Model 3 assumes that the items are equivalent (equal parameters). Model 3 is nested in Model 2, and this is nested in Model 1. This allows testing each model’s goodness of fit by means of the likelihood ratio statistic (G2). Results Table 1 contains the estimated values of the parameters and the goodness-of-fit statistics. Only Models 1 and 2 fit the data. Although Model 1 fits the data, more than 36% of the subjects fall outside of it. This result may be due to the small sample size and the numerous response patterns with low or null frequencies. About 50% of the subjects fall outside Model 2. The comparison of Models 1 and 2 shows that results about consistency depend on the model assumed for the subject responses. Discussion As to the equivalence hypothesis, the result is clear. The items are not equivalent. The various items cannot be considered tasks that are equivalent to each other despite having tried to construct equal tasks, except in the situation of the targets.

Herna!ndez et al. / Intrasubject Behavior Consistency 425

Table 1 Results for Study 1: Item Parameter Estimates and Goodness-of-Fit Statistics Nonequivalents (2pl) Item 1 2 3 4 5 6 7 G2 df p π* π*L

Nonequivalents (1pl)

a

b

a

b

0.912 0.970 0.497 1.204 0.401 1.840 1.002

–0.296 –2.256 2.618 0.019 1.973 –1.509 0.542

1.004 1.004 1.004 1.004 1.004 1.004 1.004

–0.300 –2.271 2.896 0.176 2.240 –1.166 0.544

92.06 113 .926 0.363 0.260

106.57 119 .786 0.501 0.445

Equivalents a

b

0.400 0.165 0.400 0.165 0.400 0.165 0.400 0.165 0.400 0.165 0.400 0.165 0.400 0.165 1,202.24 125 .000 0.994 0.990

2

Note: G = the goodness-of-fit statistic; df = degrees of freedom; π* = proportion of subjects out of the model; π*L = lower bound of π*. Sample size is 428.

The data allow sustaining that approximately 64% of the participants behave consistently. The method does not allow determining why the model does not fit to the other subjects, whether it is due to the lack of consistency or to something else. Also, the small sample size can cause an overestimation of π*. In conclusion, it is possible to test both hypotheses at the same time, but this requires large samples. With tests that have more than 10 items, which is common in psychological assessment, or with items that have more than two categories of response, the number of subjects should be of the order of several thousands.

Study 2 Because the dichotomization of the scores of the test in the foregoing study restricted the variability of the response and because π* could be overestimated because of the number of subjects analyzed, we will try to replicate the analyses using a much larger sample and expanding the variability of the response by including polytomous scores. Method A sample of 12,295 participants was assessed using the conscientiousness test employed in the Study 1. In that case, 7 of 15 items were used. The responses (gathered on a scale ranging from 0 to 7) were recoded in two ways: (a) dichotomized data, for which responses 0 to 3 became 1 and responses 4 to 7 became 2, and (b) polytomous data, for which responses 0, 1, and 2 became 1; 3 and 4 became 2; and values 5, 6, and 7 became 3.

426 Educational and Psychological Measurement

Table 2 Results for Study 2: Item Parameter Estimates and Goodness of Fit for Dichotomous Data and Models of Equivalent and Nonequivalent Items Nonequivalents (2pl) Item 1 2 3 4 5 6 7 G2 df p π* π*L

Nonequivalents (1pl)

a

b

a

b

0.143 0.238 0.132 0.174 0.125 0.257 0.269

0.247 –0.047 –1.106 –0.761 –0.337 0.663 0.380

0.676 0.676 0.676 0.676 0.676 0.676 0.676

0.378 –0.065 –1.709 –1.139 –0.526 0.916 0.519

347.45 113 .000 0.261 0.201

532.80 119 .000 0.289 0.262

Equivalents a

b

0.160 0.139 0.160 0.139 0.160 0.139 0.160 0.139 0.160 0.139 0.160 0.139 0.160 0.139 6,821.82 125 .000 0.992 0.990

Note: G2 = goodness-of-fit statistic; df = degrees of freedom; π* = proportion of subjects out of the model; π*L = lower bound of π*. Sample size is 12,295.

The dichotomized data were analyzed with the 2pl and the 1pl—the polytomous data with the NRM, which is the extension of the 2pl for when there are more than two response categories. For both sets of data, we estimated the model corresponding to the supposition of equivalent items and the supposition of nonequivalent items. Results Dichotomized data. The number of different possible patterns is 128, none of which had zero frequency. Table 2 shows the result of the estimation of the models. The table contains the estimated parameters, the error of estimate of each, the goodness-of-fit statistic, its degrees of freedom, the critical level, the estimated value of π*, and the lower limit of the interval with a confidence level of 0.95. To test the item equivalence hypothesis, we used the difference between the values G2. The result is that this hypothesis is rejected. None of the models fit the data according to G2. Nevertheless, item response models are commonly rejected when the sample size is too large in comparison with the number of possible response patterns. The G2/df ratio takes the value 3.07 for the 2pl, which is interpreted as an adequate fit. The statistic π* shows similar values for the 1pl and the 2pl and indicates that the equivalence model must be rejected. The conclusion is that approximately 26% of the participants are not consistent. The main reason for estimating 1pl is to compare the results of the statistics f(consistent | xh), α, and lz. Table 3 shows the correlations between these statistics, the score on the test (X), and the response frequency of the pattern (nh).

Herna!ndez et al. / Intrasubject Behavior Consistency 427

Table 3 Results for Study 2 and the 1pl: Correlations Between the Person Fit Statistics and the Response Frequency of the Response Pattern

X nh lz α

nh

lz

α

.186

.084 .813

.141 .345 .357

f(consistent | xh) –.101 .205 .228 .323

Note: nh = response frequency of the response pattern; X = number-right score; α, lz, and f(consistent | xh) = person-fit statistics.

The lz statistic strongly correlates with the response frequency of the pattern. This may be due to the procedure for estimating the parameters, which tries to equate the probability of each pattern with its relative frequency. In contrast, the value of α shows an almost exact, quadratic relationship with the score on the test. Therefore, the linear correlation between the two quantities is low. This result might be due to the information function of the test. The variability of the estimator θ will decrease as the test information increases, which could be reflected in a lower value of the α index. Moreover, the test score and frequency hardly relate to f(consistent | xh). The correlations between the three goodness-of-fit indices for the subject, α, lz, and f(consistent | xh), are positive but low. Therefore, they apparently assess different causes for the lack of goodness of fit for the subject. Figure 2 contains the dispersion plots for α, lz, and f(consistent | xh) in relation to frequency and test score. Polytomous data. Three categories of response provide 2187 different patterns, of which 750 (34%) had zero frequency. The results for the nonequivalent item model and for the equivalent item model are shown in Table 4. The G2 values shown in Table 4 are significant. The estimated value of π* in the nonequivalent items model indicates that more than 95% of the participants are outside the consistency model. The estimated value of π* in the restricted model, which supposes that the items are equivalent, indicates that 98% of the subjects are outside the consistency model. Discussion In all cases, the results show that the supposition of the subjects’ being consistent cannot be maintained. The percentage of inconsistent subjects will be much larger if we consider that the items are equivalent. Statistically, however, this equivalence cannot be maintained. The estimated value of π* depends on the type of data used. The important distinction is not whether the data are dichotomized or polytomous but rather the sample size in relation to the number of possible response patterns. Rudas

428 Educational and Psychological Measurement

Figure 2 Dispersion Plots of the Response Frequency and the Test Score With Three Person Fit Statistics: The Posterior Probability of Being Consistent, lz, and ␣ 1,0

1,0

0,8

Posterior Probability

Posterior Probability

0,8

0,6

0,4

0,6

0,4

0,2

0,2

0

100

200

300

400

500

600

0

1

2

3

Frequency

4

5

6

7

5

6

7

Test Score

1,40

1,40

1,20

1,20

1,00

Alpha

Alpha

1,00

0,80

0,80

0,60

0,60

0,40

0,40

0

100

200

300

400

500

600

0

1

2

Frequency

2,00

4

2,00

1,00

1,00

0,00

0,00

-1,00

lz

lz

3

Test Score

-1,00

-2,00

-2,00

-3,00

-3,00

-4,00

-4,00

0

100

200

300

Frequency

400

500

600

0

1

2

3

4

5

6

7

Test Score

et al. (1994) showed that small samples can overestimate π*. This is why the use of polytomous data increases the estimated percentage of subjects outside the model.

Herna!ndez et al. / Intrasubject Behavior Consistency 429

Table 4 Results for Study 2: Item Parameter Estimates and Goodness of Fit for Polytomous Data and Models of Nonequivalent Items Nonequivalents Item 1 2 3 4 5 6 7 G2 df p π* π*L

a2 0.050 0.072 0.040 0.054 0.055 0.087 0.102

b2

a3

–0.868 0.156 –0.993 0.278 –1.866 0.147 –1.590 0.214 –1.950 0.194 –0.526 0.299 –0.501 0.319 2,730.77 2,158 .000 0.959 0.791

Equivalents b3

a2

0.365 –0.017 –1.820 –1.218 –0.413 1.056 0.726

0.053 0.053 0.053 0.053 0.053 0.053 0.053

b2

a3

–1.039 0.192 –1.039 0.192 –1.039 0.192 –1.039 0.192 –1.039 0.192 –1.039 0.192 –1.039 0.192 12,252.47 2,182 .000 0.982 0.974

b3 –0.185 –0.185 –0.185 –0.185 –0.185 –0.185 –0.185

Note: G2 = goodness-of-fit statistic; df = degrees of freedom; π* = proportion of subjects out of the model; π*L = lower bound of π*. Sample size is 12,295.

From all of this, it can be deduced that the estimated percentage of inconsistent persons is tied to the particular model assumed for the behavior of the consistent persons. At the same time, this study, like the preceding one, reveals that the items of the test used are not equivalent despite their unequivocal similarity and the similarity of the task to be performed by the subjects.

Study 3 The third study attempts to verify whether the results of the second study are maintained using a different test. Because the models pose the hypothesis of item nonequivalence, this time we will analyze the data of a test that measures the subjects’ risk-taking tendency. The task is described in Arend, Botella, Contreras, Hernández, and Santacreu (2003). This test consists in choosing one of four responses to each of the items an undetermined number of occasions. Consequently, the items of the test are strictly equal. The hypothesis states that, in this case, the model will show the equivalence of the items, although we cannot suppose this beforehand. Moreover, we will try to estimate the percentage of subjects who are consistent with dichotomized data given the fact that, as we have seen in Study 2, this is the condition of greatest accuracy (smaller confidence interval in dichotomized data) as well as the condition in which it is estimated that there is a greater number of consistent persons. Consequently, we can suppose that the number of inconsistent persons will be greater with polytomous data.

430 Educational and Psychological Measurement

Method A sample of 9,324 participants was assessed in a test that measures risk-taking behavior with 10 items and 4 response options. Cronbach’s α is .87 (Arend et al., 2003). The task consists in betting on the outcome of throwing two dice. The 10 items are exactly the same in both their form and function. The score of 1 (the sum of both dice higher than 4) corresponds to the value of the least risk assumed, and the score of 4 (a sum of 12) corresponds to the greatest risk. The analyses used only dichotomized data (1 and 2 were recoded as 1, and responses 3 and 4 became 2). Because the number of possible patterns with the dichotomized data was still too large for the sample size, only 6 items were selected for the analysis. The first 4 items, which, in any case, could be considered training items, were discarded. Results The sample contained the 64 possible response patterns. Table 5 shows the estimated parameters for the equivalent item and nonequivalent item models. Several conclusions can be drawn from the table. The items are quite similar to each other. Although, statistically, the equivalence hypothesis can be rejected, the estimators of the parameters are similar among all the items, and the value of G2 varies between the models much less than in Study 2 (G2 = 1,075 vs. 1,165). The value of π* allows associating the lack of fit with the percentage of inconsistent subjects. If the items are assumed to be equivalent, the percentage of inconsistent individuals is 38%; with nonequivalent items, it is 26%. Discussion The type of data used has an important effect on the estimator of π* in a task in which the items are morphologically and functionally equal and indicates the percentage of nonconsistent subjects much more accurately. But it is evident that the nonrestrictive model suggests a certain degree of item nonequivalence. This is difficult to explain unless we consider that an item differs from another by the mere fact of having been executed before, even when there is no feedback at all of its execution except for an indication that it has been answered. In any case, the models applied indicate that, in the best cases, 26% of the subjects are inconsistent. A drawback of the statistic is that it requires large samples. This study shows that to obtain clear results, the frequencies observed must be high enough for the estimator to be stable.

General Conclusions The statistic π* must be interpreted taking into account that it is a goodness-of-fit statistic. Its usefulness lies in quantifying the degree of lack of fit instead of simply concluding whether the fit is good or bad. This is useful for comparing different models subject to theoretically motivated restrictions, as is the case of item equivalence. In any case, the comparison of different values of π* is only descriptive. For example, the

Herna!ndez et al. / Intrasubject Behavior Consistency 431

Table 5 Results for Study 3: Item Parameter Estimates and Goodness of Fit for Dichotomous Data and Models of Equivalent and Nonequivalent Items Nonequivalents (2pl) Item

a

b

1 2 3 4 5 6 G2 df p π* π*L

1.177 1.071 1.019 1.055 0.879 1.059 0.993 1.110 1.179 1.131 1.415 1.294 1,075.02 51 .000 0.259 0.242

Nonequivalents (1pl) a

b

1.092 1.046 1.092 1.078 1.092 1.125 1.092 1.143 1.092 1.103 1.092 1.178 1,148.77 56 .000 0.320 0.315

Equivalents a

b

1.093 1.111 1.093 1.111 1.093 1.111 1.093 1.111 1.093 1.111 1.093 1.111 1,165.20 61 .000 0.381 0.379

2

Note: G = goodness-of-fit statistic; df = degrees of freedom; π* = proportion of subjects out of the model; and π*L = lower bound of π*. Sample size is 9,324.

imposition of the restriction that the items be equivalent in Studies 2 and 3 makes the model more restrictive, and its fit must worsen. The π* statistic attributes this lack of fit to an increase in the subjects that fall outside the consistency model. The results of the three studies have shown that the percentage of inconsistent subjects varies within samples according to a series of parameters: first, the estimation of item equivalence; second, the dichotomized or polytomous nature of the data used; and third, the size of the sample. When the sample is small and dichotomized data are used, the percentage of inconsistent subjects is about 36%. With large samples, the crucial parameter to determine the percentage of persons with inconsistent behaviors is the number of responses used. Therefore, as Study 2 shows, when dichotomized data are used, the percentage of individuals that does not behave consistently is considerably less than when polytomous data are used. In summary, the estimator of π* makes it possible to quantify the proportion of subjects who fall outside of a given model. This model can incorporate various theoretical suppositions: consistency, equivalence of the items, and form of the Pi function. Together with the sample size, they affect the value of π*, which must be interpreted in light of the conditions in which it has been estimated. The general conclusion of this article is that there is no empirical basis for assuming the supposition of consistency stated in the proposition “all human beings are consistent.” On the contrary, this proposition is meaningful only if a model is specified for the behavior of consistent subjects. Even in that case, the value of π* will not be 0 in general but provides an indication of the fraction of the population that falls outside

432 Educational and Psychological Measurement

the model. We can assert that the more equivalent the items that comprise a scale and, consequently, the greater the scale’s internal consistency, the greater will be the percentage of inconsistent persons. The more precision (less error) of measurement of a behavioral response, the more will we observe inconsistent individuals. This is particularly significant considering the tendency to increase the degree of discrimination in the scales that assess personality traits. In the first formulation of the Eysenck Personality Inventory, Eysenck (1959) presented a dichotomized response format versus the NEO Personality Inventory–Revised developed by Costa and McCrae (1992), which contains five response categories. Our data suggest that inconsistency attributable to the subjects increases as items are made more equivalent and as the possible number of responses increases. It seems evident that the statistic G2 allows us to test item equivalence and that the statistic π* allows us to calculate how many subjects of a sample will not be consistent in that test. This datum is of great importance insofar as we can describe a given test by (a) the consistency or equivalence of its items and (b) the percentage of subjects who are shown inconsistent in that test. Moreover, if we can indeed mark those persons who are not consistent in a test, insofar as that test corresponds to a personality variable, we should conclude that we cannot predict the behavior of this person in the test. That person has not demonstrated consistency on the task. Therefore, for calculating π*, the effect is the same as that of an item that does not contribute to increasing the Cronbach α statistic. Consequently, the elimination of these subjects from the sample will provide greater stability of the scores of the test at two different points in time. Although the statistics used prove their effectiveness in discriminating between the consistency of the items of a test and the intraindividual consistency, they still present some problems, and the main one is the size of the sample necessary for the calculations. Nevertheless, the pertinent analyses could be made with tests that are widely used in the study of personality and have generated data based on broad samples. This would solve the problems that Hogan et al. (2000) and Brennan (2001) or Henson (2001) consider with respect to the estimates of consistency, reliability, or unidimensionality of personality assessment instruments. Undoubtedly, the strategy that we have followed in this article is a methodological alternative to the theoretical strategy followed by Mischel and Shoda (1995), Shoda (1999), and Mendoza-Denton, Ayduk, Mischel, Shoda, and Testa (2001) to close the issue of the consistency paradox. The conceptual solutions put forward by these authors (Shoda & Mischel, 2000), by using new concepts such as “coherence” and “behavioral signature” within the Cognitive Affective Personality System model, aim to close a debate that has undoubtedly run its course but that has not produced the desired synthesis and integration of the data. In this sense, the usefulness of biological, psychosocial, and cultural variables in determining the patterns of organization and psychological functioning (Mischel, 2004) is an unquestionable theoretical step forward, but in our judgment, it does not cover the methodological aspect. Taking this into account, the methodological alternative that we put forward, in addition to proposing a statistic for calculating the per-

Herna!ndez et al. / Intrasubject Behavior Consistency 433

centage of consistent subjects, could be a new approach to personality assessment instruments.

References Arend, I., Botella, J., Contreras, M. J., Hernández, J. M., & Santacreu, J. (2003). A betting dice test to study the interactive style of risk-taking behavior. The Psychological Record, 53, 217-230. Bem, D. J. (1983). Further déjà vu in the search for cross-situational consistency: A response to Mischel and Peake. Psychological Review, 90, 390-393. Bem, D. J., & Allen, A. (1974). Predicting some of the people some of the time: The search for crosssituational consistencies in behavior. Psychological Review, 81, 506-520. Bem, D. J., & Funder, D. C. (1978). Predicting more of the people more of the time: Assessing the personality of situations. Psychological Review, 85, 485-501. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental scores (pp. 395-479). Reading, MA: AddisonWesley. Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29-51. Bock, R. D., & Aitkin, M. A. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459. Brennan, R. L. (2001). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement, 38, 295-317. Chernyshenko, O. S., Stark, S., Chan, K. Y., Drasgow, F., & Williams, B. (2001). Fitting item response theory models to two personality inventories: Issues and insights. Multivariate Behavioral Research, 36, 523-562. Costa, P. T., & McCrae, R. R. (1992). The NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) professional manual. Odessa, FL: Psychological Assessment Resources. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39, 1-38. Drasgow, F., Levine, M., & Williams, E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38, 67-86. Epstein, S. (1979). The stability of behavior: On predicting most of the people much of the time. Journal of Personality and Social Psychology, 37, 1097-1126. Epstein, S. (1983a). Aggregation and beyond: Some basic issues on the prediction of behavior. Journal of Personality, 51, 360-392. Epstein, S. (1983b). The stability of confusion: A reply to Mischel and Peake. Psychological Review, 90, 179-184. Epstein, S. (1984). The stability of behavior across time and situations. In R. A. Zucker, J. Aronoff, & A. I. Rabin (Eds.), Personality and prediction of behavior (pp. 209-268). New York: Academic Press. Eysenck, H. J. (1959). Eysenck Personality Inventory. London: University of London Press. Ferrando, P. J. (2004). Person reliability in personality measurement: An item response theory analysis. Applied Psychological Measurement, 28, 126-140. Funder, D. C. (1983a). The consistency controversy and the accuracy of personality judgement. Journal of Personality, 5, 346-359. Funder, D. C. (1983b). Three issues in predicting more of the people: A reply to Mischel and Peake. Psychological Review, 90, 283-289. Funder, D. C., & Ozer, D. J. (1983). Behavior as a function of situation. Journal of Personality and Social Psychology, 44, 107-112. Goldman, B. A., Mitchel, D. F., & Egelson, P. E. (1997). Directory of unpublished experimental mental measures (Vol. 7). Washington, DC: American Psychological Association.

434 Educational and Psychological Measurement Harzem, P. (1984). Experimental analysis of individual differences and personality. Journal of Experimental Analysis of Behavior, 42, 385-395. Hattie, J. (1984). An empirical study of various indices for determining unidimensionality. Multivariate Behavioral Research, 19, 49-78. Henson, R. K. (2001). Understanding internal consistency reliability estimates: A conceptual primer on coefficient alpha. Measurement and Evaluation in Counseling and Development, 34, 177-189. Hernández, J. M., Sánchez-Balmisa, C., Madrid, B., & Santacreu, J. (2003). La evaluación objetiva de la minuciosidad: Diseño de una prueba conductual [The objective assessment of conscientiousness: The design of a behavioral test]. Análisis y Modificación de Conducta, 29, 457-479. Hogan, T. P., Benjamin, K., & Brezonski, K. L. (2000). Reliability methods: A note on the frequency of use of various types. Educational and Psychological Measurement, 60, 523-531. Kenrick, D. T., & Funder, D. C. (1988). Profiting from controversy: Lessons from the person-situation debate. American Psychologist, 43, 23-34. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental tests scores. Reading, MA: AddisonWesley. Mendoza-Denton, R., Ayduk, O. N., Mischel, W., Shoda, Y., & Testa, A. (2001). Person × Situation interactionism in self-encoding (I am . . . when . . . ): Implications for affect regulation and social information processing. Journal of Personality and Social Psychology, 80, 533-544. Mischel, W. (1968). Personality and assessment. New York: John Wiley. Mischel, W. (1990). Personality dispositions revisited and revised: S view after three decades. In L. A. Pervin (Ed.), Handbook of personality: Theory and research (pp. 111-134). New York: Guilford. Mischel, W. (2004). Toward an integrative science of the person. Annual Review of Psychology, 55, 1-22. Mischel, W., & Peake, P. K. (1982). Beyond déjà vu in the search for cross-situational consistency. Psychological Review, 89, 730-755. Mischel, W., & Peake, P. K. (1983). Some facets of consistency: Replies to Epstein, Funder, and Bem. Psychological Review, 90, 394-402. Mischel, W., & Shoda, Y. (1995). A cognitive-affective system theory of personality: Reconceptualizing situations, disposition, dynamics and invariance in personality. Psychological Review, 102, 246-268. Ozer, D. J. (1986). Consistency in personality: A methodological framework. Berlin: Springer. Reise, P., & Waller, N. G. (1990). Fitting the two-parameter model to personality data. Applied Psychological Measurement, 14, 45-58. Ribes, E. (1990a). La individualidad como problema psicológico: El estudio de la personalidad. Revista Mexicana de Análisis de Conducta, 16, 7-24. Ribes, E. (1990b). Problemas conceptuales en el análisis del comportamiento [Conceptual problems in behavioral analysis]. México, DF: Trillas. Ribes, E., & Sánchez, S. (1992). Individual behaviour consistencies as interactive styles: Their relation to personality. The Psychological Record, 42, 369-387. Rudas, T., Clogg, C. C., & Lindsay, B. G. (1994). A new index of fit based on mixture methods for the analysis of contingency tables. Journal of the Royal Statistical Society, Series B, 56, 623-639. Shoda, Y. (1999). A unified framework for the study of behavioral consistency: Bridging person × situation interaction and the consistency paradox. European Journal of Personality, 13, 361-387. Shoda, Y., & Mischel, W. (2000). Reconciling contextualism with the core assumptions of personality psychology. European Journal of Personality, 14, 407-428. Steinberg, L., & Thissen, D. (1995). Item response theory in personality research. In P. E. Shrout & T. Fiske (Eds.), Personality research, methods, and theory: A festschrift honoring Donald W. Fiske (pp. 161181). Hillsdale, NJ: Lawrence Erlbaum. Zuckerman, M. (1991). Psychobiology of personality. New York: Cambridge University Press.