Measurement Educational and Psychological

2 downloads 0 Views 453KB Size Report
Applied Differential Psychology. The Reliability of Differences Between Linear Regression Weights in. Published by: http://www.sagepublications.com be found ...
Educational and Psychological Measurement http://epm.sagepub.com/

The Reliability of Differences Between Linear Regression Weights in Applied Differential Psychology Frank L. Schmidt Educational and Psychological Measurement 1972 32: 879 DOI: 10.1177/001316447203200403 The online version of this article can be found at: http://epm.sagepub.com/content/32/4/879

Published by: http://www.sagepublications.com

Additional services and information for Educational and Psychological Measurement can be found at: Email Alerts: http://epm.sagepub.com/cgi/alerts Subscriptions: http://epm.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav Citations: http://epm.sagepub.com/content/32/4/879.refs.html

>> Version of Record - Dec 1, 1972 What is This?

Downloaded from epm.sagepub.com at TEMPLE UNIV on June 11, 2013

EDUCATZOXAZ AND PSYCHOLOGICAL M U S U P E Y E N T

1972, 32, 879-886.

THE RELIABILITY OF DIFFERENCES BETWEEN LINEAR REGRESSION WEIGHTS I N APPLIED DIFFERENTIAL PSYCHOLOGY FRANX L. SCHMIDT Michigan State University

INmany areas of psychology, education, sociology, economics and other behavioral and social sciences, a relatively common research design is one that calls for the prediction of the standing of a person or thing on one variable, often designated the criterion, from his or its standing on a number of other variables, often called the predictors. When the relationships in question are linear, least squared error multiple regression weights are most commonly used in weighting the predictors into a composite. Theso weights minimize the sum of the squared deviations of the observed from the predicted criterion scores (Anderson 1958). In practice, the sample regression weights (8) are often computed on relatively small samples, and as a result, are only rough approximations to the population regression weights @), which are, by definition, the most effective set of predictor weights possible. If applied to the entire population, 8 would produco some correlation, p@), and B would produce p@), the maximum correlation. Although p @ ) Tvill vary depending on tho chance differences between different fi, p ( 0 ) is a parameter and thus has only one value. A previous study (Schmidt, 1971) showed that, for certain combinations of N (sample size) and p (number of predictors) in applied differential psychology, simple unit weighting of predictors (summing of z scores of predictors) produces, on the average, a larger correlation in the long run, i.e., in the population, than 8. With small N and large p , these differences in predictive eficiency favoring simple unit predictor weights (I) over fi were large enough, in some cases, to be of practical significance in applied situations (e.g., .12-.13 correlation units). 879

Downloaded from epm.sagepub.com at TEMPLE UNIV on June 11, 2013

860

EDUCATIONAL AND PSYCHOLOGICAL RIEASUREhTENT

Obviously, these results obtained because 6,as a function of random sampling fluctuation, contained much error.' That is, differences between elements in the Si mere a reflection of sampling error as meU as of actual differences between elements in the corresponding 8. Error variance associated with a given regression veight is usualIy conceptualized as the variance of the distribution that would result if, holding N and p constant, an infinite number of estimates of a given regression weight mere computed, each from a new sample. This is the conceptual model underlying the formula for the standard error of the regression weight. However, another error model may be of more explanatory value here-a simple variance components model inspired by classical test theory. If we assume that sampling error is independent of such parameters as 8 and 2, (the population validity vector), then the total variance within estimates of these vector parameters can be viewed as the sum of the population variance of the parameters and error variance due to sampling. Suppose, for example, that the variance across a set of population regression weights, Bt, is .0520 and the variance of a sample estimate of these weights, Bt,, is .OG40.Then, using the present model, variance due to sampling error is .Of340minus .0520, or .0120. Total obtained variance (.0640 here) equals true variance (.0520 here) plus error variance (.0120 here).' In equation form: u(Ot2

=

uz

= u'

am2

3- a,'

where

due to sampling error

Using this model, one can compute the reliability of the difference between elements in an estimate of a populntion parameter vector. For exampIe the reliability of differences between elements within Sampling fluctuatiom were the only source of error in thc b. The data used fit the linear multivariatc normal model and there mas no post hoc selection of p' predictors (where p' < p ) . 2This model is, of course, applicable only in instances in which parameters are vectors rather than scalar values. Scalar value parameters obviously have no populntion variances, but when the parameter of interest is a vector, me mny speak of the variance in the population acrosa the values in the vector.

Downloaded from epm.sagepub.com at TEMPLE UNIV on June 11, 2013

FRANK L. SCHMIDT a given set of sample regression weights, &,would be

851 [~r~@)/u’@~)].

This ratio corresponds to the basic definition of reliability as the ratio of true to total variance or the proportion of total variance that is true variance. Of course, uz@) is usually not known in practice, and so for most purposes of practical reliability estimation, this model is not of much value. However, a hlonte Carlo approach t o this problem developed by the miter calls for specification of the by the investigator. When “population” correlation matrix (Z,) z,,is known, S, and thus u’@), can be computed directly. fi and are then computed from sample correlation matrices (R) generated from XZu.For any N and Z, generation of a number of R alloNS the singlc estimate of o’(b) to be replaced by the average of a number of such estimates, designated ~ ~ ( 6The ) . present study had two purposes. The first was to examine systematically, via the above ratio, the average reliabifities of differences between regression weights in the data domain of applied differential psychology. Multiple regression techniques are used almost routinely to weight predictors differentially; often obtained weights are interpreted directly, with relative size being used as an index of the theoretical or practical significance of variables. The question of the reliability of the differences between these weights is therefore an important one. The second purpose of this study was to ascertain magnitudes of these reliabilities necessary under various combinations of N and p for regression weights t o equal and to exceed simple unit predictor weights in predictive efficiency. To the present author’s knowledge, the literature to date contains no studies addressed to these two problems. ’

illethod Sampling Plan for

zzuMatrices

In order to insure that the Zzuused were similar to those actually existing in the data domain of interest, it seemed appropriate to employ R matrices from the data domain as estimates of Zzu.For all practical purposes, the R are unbiased estimates of their respective Zzu.Four journals-Educational and PsychologicaZ Measurement (EPM),Journal of Applied Psychology (JAP), Journal of Educational Psychology (JEP) , and Personnel Psychology (PP)-were selected as representing the data domain of applied differential psychology, and tho years 1959-1969 were selected for examination. For two of the

Downloaded from epm.sagepub.com at TEMPLE UNIV on June 11, 2013

882

EDUCATIONAL AND PSYCHOLOGICAL hlEASUREhlENT

journals-BPM and JEP-the odd years were sampled and for the other two, the even years. In both cases, correlation matrices of odd numbered dimensions from 3 X 3 to 11 X 11 were recorded. In some cases, only parts of larger matrices were used. Correlation vectors containing negative or zero values were not used as validity vectors, and it was sometimes necessary to rearrange rows and columns in order to meet this condition. An attempt mas made to keep all validities above .20-a value chosen as approximately the minimum that would ordinarily be used in practice. For each matrix size, a random sample of 10 matrices was drawn from the pool of recorded matrices of that size and used as estimates of the Z., For certain of the larger matrices, a sample of 10 could not be obtained using only the journal volumes designated in the sampling plan, and additional samples had to be taken from the previously unused volumes. Even so, only eight 11 X 11 matrices could be found and two had to be taken from another source (Wechsler, 1949, p. 10). Obviously, the sampling fraction was much larger for the large than for the small matrices.

The Program Sample correlation matrices (R) computed from randomly dram samples from a mdtivariate normal distribution are distributed as T ( N , p , Z,) the Wishart distribution. For each given N and 2,, combination, 100 R matrices were generated from this distribution and 100 bi were computed, which, in turn, yielded 100 a'(& v ~ l u e s . ~ The average of these 100 values of u2(& was taken as the estimate of eu2(b) for a specih Z, a t a given N level and u2@)/ea2(@ was taken as the estimate' of the average reliability of differences between regression weights for that 2,,, given N . These reliability estimates mere then averaged across the 2,, t o provide an estimate of mean 3 This program was written b y Dr. Vernon Urry, now at the Univenity of Washington. The process of R-generation Forks by means of the Bartlett decomposition of the Wishart distribution (Bartlett, 1933; Ilshirsogar, 1959; Mijsman, 1957). This approach to sampling from N ( P , 2) has been discussed and employed b y Browne (1968) and Herzberg (19691, both of whom have carried out tests of the simulated data showing that the required assumptions are met. I n nddition, Herzberg (1969) showed that the results from simulated data were almost identical with the results from a Iarge sample of empirical data. 4 1 t is well known that, strictly speaking, CL/c(X)l & / X ) where k = a constant and X is a random variable), but, as Browne (1969) has shown, the discrepancy is so ,mall 09 to be negligible for psychometric purposes.

+

Downloaded from epm.sagepub.com at TEMPLE UNIV on June 11, 2013

FRANK L. SCHhlIDT

8%3

reliability in the data domain as a whole a t given N and p values. Values of N and p lying within the ranges most frequently encountered in practice were chosen. N s investigated were 25, 50, 75, 100, 150, 200, 500, and 1000. Values of p used mere 2, 4, 6, 8, and 10.

Results and Discussion Table 1 presents the average reliability of differences betreen regression weights for all N and p values for all matrices. It can be seen that differences betn-een regression weights do not, in general, attain levels of reliability generally considered satisfactory for most psychometric instruments until N s of 500 or more are reached. With one exception, all reliabilities for sample sizes of 75 or under are .50 or less, even when p is small. Reliabilities in the .20Jsand .30Js are much in evidence, and some are even lower than .20. The last column in Table 1shows the average reliabilities a t each N level across all p values. In the last row of Table 1,where N = ,eer2(f?) becomes ~u’(f3) ,the reliability ratio becomes 1.000, indicating that differences between regression weights are perfectly reliable when the “sampleJJ used is the entire population. An examination of the p for each of the 50 S,, revealed that 31 had one or more suppressor variables. Since suppressor variables are rarely used in applied differential psychology (Adkins, 1947) , it is probably hazardous to generalize to this data domain from the present matrix sample. Therefore, Table 2, showing the average reliabilities of differences between regression weights for those matrices without suppressors, was computed. I n the botTABLE 1 Average Reliabilities of Differences between Regression Weights across All Matrices of Sample Regression Weighls for All Combinations of N and p Alan3

P

N 25 50

75 100 150 200 500

1000 m

2

4

6

8

10

.3063 .4921 .5262 .6183 .6110 .6771 .7724

.2891 .445P .5074

.2513 .3946 .4701 .5376 .6196 .6549 .7998 .S624

.1931 .3609 .4443 .5294 .6300 .6786 .8374 .9329

1.0000

1.0000

.1131 -2632 .3601 .4338 .5355 ,6045 .7784 .8812 1.oooo

.SO55

1.0000

.5558 .6398 .6919 .8153 .s977 1.0000

Downloaded from epm.sagepub.com at TEMPLE UNIV on June 11, 2013

across all p .2305 .3912 .4616 .5350 .6071 .6608 .8066 .8759

1.0000

EDUCATIONAL AND PSYCHOLOGICAL MEASUREhlENT

884

tom row of Table 2 may be seen the number of matrices remaining at each p value after those containing suppressors were removed. Again, the mean reliabilities across all p values are given in the last column of the table. Comparing Tables 1 and 2, one can easily see that the average reliabilities are consistently higher when matrices containing suppressors are included. This results from the fact that the existence of suppressors leads to larger differences between elements in 8, which, in turn, leads to larger values of g2@). Since error variance is independent of a2@) and depends only on N and p , the effect is to increase the average reliability ratio [a2@)/em2(fl)] a t all N and p values. Thus differences between computed regression n-eights ell tend to be more reliable when one is dealing with a population which contains suppressor variables. Reliabilities that would generally be considered adequate for most psychometric instruments for most uses are, in general, achieved for differences between the regression weights for the matrices without suppressors at N levels somewhere between 500 and 1000.5 For all sample sizes below 75, irrespective of p , these reliabilities are below 30. At smaller sample sizes, reliTABLE 2 Average Reliabilities of Diferenees beheen Regression Weights of Sample Regression Weights for All Combinations of N and p f o r Those Matrices rcithout Suppressors

4

6

.09M

-1629 .2436 .2761 .3605 .4654 .7079 .7837 1.0000

.NO37 .32835 .44017 .52550 .GO657 .67015 .a157 .89750 1.0000

.30745 .35375 .46830 .55700 .72820 .88595 1.0000

1.0000

Means across All P .1402 .2692 .3495 .4145 .4916 .5746 .7508 .8385 1.0000

4

4

2

1

19

P N 25 50

75 100 150 200 500 1000 03

No. of hfatrices

2 .2469 .4351 .4580 .5597 .5489 .6303 .7059 .7663 1.0000 8

8 .09575 .23200

10 .0755 .I877 .2987 .3576 .4537 .5505 .777 .8596

-~

5With a large sample of Zr,at each p value, it mould be expected that mean reliability xould consistently decrease as p increased within each sample size, since the addition of each predictor leads to a loss of one degree of freedom. In fact, this pattern does obtain in Table 1 up to and including a sample size of 100. But as N increases from 100 to co, error variance due to sampling (u.2) becomes less important relative t o E U ? ( ~ )in determining mean relinbility a t each p-value. Due to sampling fluctuations in selecting

Downloaded from epm.sagepub.com at TEMPLE UNIV on June 11, 2013

FRANK L. SCHhlIDT

885

abilities below .20, and even below .lo, are much in evidence. Since sample sizes used in research in most areas of applicd psychology tend to be relatively small (Lawshe and Schuclier, 1959), it is probably reasonable to conclude from these data that the reliabilities of differences between most regression weights reported in the literature fall between .lo and .60. Implications for the practice of directly interpreting the relative magnitudes of regression weights computed on small samples are obvious. I n addition, it should be pointed out that the reliability estimates in both Table 1 and Table 2 are probably overestimates of the actual reliabilities in their respective data domains. The use of sample matrices from the literature as estimates of the Z,,, the population matrices, tends to inflate values of IJ'@), thus a a t i n g the reliability ratio [u2(~)/~u2@)]. The explanation for this inflation of 2 ( p ) values is relatively straightforward. I n S, matrices, the greater the variance of predictor intercorrelations and validities, other things equal, the more the individual regression weights in p will differ from each other. Because of the addition of variance due to sampling error, the variance of predictor intercorrelations and validities can be expected to be higher in sample matrices taken from the literature than in their corresponding population matrices. The result6 is larger values of 2 (p). For correlation matrices in applied differential psychology without suppressors, regression weights begin to be superior to unit weights, on the average (across p values) ,when the sample size is about 100 (Schmidt, 1971). The average reliability of differences between regression weights a t this sample size can be seen in Table 2 to be .4145. When suppressors are allowed, regression weights begin, on the average (across p values), to be superior to unit weights at a sample size of about 50 (Schmidt, 1971). At this N level, the average reliability of differences between regression weights is only .3912, as can be seen in the last column of Table 1. For both matrix samples, then, more than half the variance of the the original Zzu sample, cu*(p) varied from .0155 for the p = 10 sample to .1147 for the p = 4 sample. The effect of this variation was to mask and often reverse the expected decline in reliability with increases in p within N values. I n Table 2, where sample sizes for the &. are maller, this effect, as would be expected, is even more pronounced. 6 Another result, closely related, is the tendency for a larger number of suppressor variables to nppezr than probably exist in the parent population matrices.

Downloaded from epm.sagepub.com at TEMPLE UNIV on June 11, 2013

8S6

EDUCATIONAL AND PSYCHOLOGICAL hfEASUREhlENT

regression weights is error variance at that N level where regression weights overtake unit weights in predictive power. If me arbitrarily assume that .0150 correlation units is the minimum increase in predictive power, for most practical purposes, that will render the computation of regression weights worthwhile, then for the entire S, sample, averaging across levels of p , the minimum sample size needed is 60 (Schmidt, 1971). A t this sample size, the average reliability of differences between regression weights is approximately ,420. For matrices without suppressors, the corresponding sample size is 184, and the average reliability at this N level is approximately -550. Both of these reliability figures are low relative to commonly accepted standards for psychometric devices. Even when regression weights do show more predictive efficiency than simple unit weights, differences in magnitude between the individual weights are still very unreliable. Once again, the implications for the practice of directly interpreting relative magnitudes of computed regression weights are obvious.

REFERENCES Adkins, D. C. Construction and analysis of achievement tests. U. S. Government Printing Office, 1947. Anderson, T. W. Introduction to n r u l t i u a k t e statistical analysis. New York: TViley, 1958. Bartlett, AI. S. On the theory of statistical regression. Proceedings of the Royal Society of Edinburgh, 1933,53,260-283. Browne, M. W. A comparison of factor analytic techniques. Psychometrika, 1968,33,267-334. Browne, 31. TI'. Precision of prediction. Research Bulletin 69-69. Princeton, N.J.: Educational Testing Service, 1969. Herzberg, P. A. The parameters of cross-validation. Psychometric Monograph, No. 16,1969. Kshirsogar, A. AT. Bartlett decomposition and Wishart distribution. T h e Annals of iliathematical Statistics, 1959, 30, 239241. Lawshe, C. H. and Schucker, R. E. The relative efficiency of four test weighting methods in multiple prediction. EDUCATIONAL AND PSYCHOLOGICAL R~EASUREBIENT, 1959,19,103-14. Schmidt, F. L. The relative efficiency of regression and simple unit predictor weights in applied differential psychology. EDUCATIONAL AND PSYCHOLOGICAL hfEASUREUENT, 1971, 31, 699714. Wechsler, D. Wechsler intelligence scale for children. New York: Psychological Corporation, 1949, p. 11. Wijsman, R. A. Random orthogonal transformations. T h e Annuls of Mathematical Statistics, 1957,28,415-423.

Downloaded from epm.sagepub.com at TEMPLE UNIV on June 11, 2013