Chasing the Title VII Holy Grail: The Pitfalls of ...

Chasing the Title VII Holy Grail: The Pitfalls of Guaranteeing Adverse Impact Elimination Winfred Arthur, Dennis Doverspike, Gerald V. Barrett & Rosanna Miguel

Journal of Business and Psychology ISSN 0889-3268 Volume 28 Number 4 J Bus Psychol (2013) 28:473-485 DOI 10.1007/s10869-013-9289-6

1 23

Your article is protected by copyright and all rights are held exclusively by Springer Science +Business Media New York. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”.

1 23

Author's personal copy J Bus Psychol (2013) 28:473–485 DOI 10.1007/s10869-013-9289-6

Chasing the Title VII Holy Grail: The Pitfalls of Guaranteeing Adverse Impact Elimination Winfred Arthur Jr. • Dennis Doverspike Gerald V. Barrett • Rosanna Miguel

•

Published online: 15 March 2013 Ó Springer Science+Business Media New York 2013

Abstract Purpose Title VII of the Civil Rights Act of 1964 provided industrial/organizational (I/O) psychologists with a unique role as professional test developers and consultants involved in assisting organizations in establishing the jobrelatedness/validity defense to charges of discrimination, specifically charges based on an adverse or disparate impact theory. However, these activities have transmogrified into the fairly common occurrence of public municipalities and organizations demanding the reduction or absence of adverse impact as part of the scope of work or contracts and for practitioners and consultants to guarantee adverse impact reduction or elimination a priori. Plaintiffs and their experts also routinely argue that the observed adverse impact could have been allayed or eliminated if the defendant had only just used alternative testing methods. This then begs the following question: ‘‘Are there well established techniques and procedures that can reduce, minimize, or eliminate adverse impact in a predictable, generalizable, and replicable fashion in the same manner

W. Arthur Jr. (&) Department of Psychology, Texas A&M University, 4235 TAMU, College Station, TX 77843-4235, USA e-mail: [email protected] D. Doverspike University of Akron, Akron, OH, USA e-mail: [email protected] G. V. Barrett Barrett & Associates, Inc., Cuyahoga Falls, OH, USA e-mail: [email protected] R. Miguel John Carroll University, University Heights, OH, USA e-mail: [email protected]

that we might guarantee validity?’’ The present paper seeks to answer this question. Approach and Findings With the preceding as a backdrop, the present paper identifies and discusses four overlooked critical attributes of adverse impact that collectively and in conjunction work against and obviate adverse impact reduction and elimination guarantees. Conclusions and Implications We conclude that the search for guaranteed adverse impact reduction or elimination is a ‘‘Holy Grail’’ and that we should avoid predictions and guarantees regarding adverse impact elimination in specific situations, including those based on the inclusion of ‘‘alternative’’ selection devices. However, in the context of civil rights legislation, and the intersection of I/O psychologists with said legislation, what we can guarantee as a science and profession are sound and valid tests and assessment devices that can be defended accordingly should the use of said tests and devices be challenged. Keywords Civil Rights Act Subgroup differences Adverse impact Uniform guidelines Personnel selection

Title VII of the Civil Rights Act (CRA) of 1964 provided industrial/organizational (I/O) psychologists with a unique role as professional test developers and consultants involved in assisting organizations in establishing the jobrelatedness/validity defense to charges of discrimination, specifically charges based on an adverse or disparate impact theory. However, I/O psychologists were not only involved in developing and validating tests, but a concomitant search soon began for methods of reducing or eliminating the adverse impact of selection devices on protected groups. But, as with many such historical quests, it has not always been clear whether there was actually a prize to be

123

Author's personal copy 474

found at the end of the crusade. Indeed, in terms of professional practice, this search has transmogrified into what now seems to be the common occurrence of public municipalities and organizations demanding the reduction or absence of adverse impact as part of the scope of work or contracts and practitioners and consultants guaranteeing adverse impact reduction or elimination a priori. For instance, we commonly see language along these lines in the request for proposals (often with indemnification clauses) that we receive in our practices. Proclamations about adverse impact reduction and elimination can also be found at the web sites of some consulting firms. And, it is also common, in the absence of any direct evidence to support their claims, for plaintiffs and their experts to routinely argue that the observed adverse impact could have been allayed or eliminated if the defendant had only just used alternative testing methods. Court cases such as Dwight Bazile et al. v. City of Houston (2012), Howe v. City of Akron (2009), Ricci v. DeStefano (2009), U. S. v. The City of New York (2010), and U. S. v. City of New York (2011) are very good examples of this. Finally, it is not uncommon in both the scientific and applied literature to see frequent references to adverse impact reduction and elimination, as if this was something that could, a priori, be guaranteed and predicted. This then begs the following question: ‘‘Are there well established techniques and procedures that can reduce, minimize, or eliminate adverse impact in a predictable, generalizable, and replicable fashion in the same manner that we might guarantee validity?’’ Specifically, can we, as I/O psychologists, in an a priori manner, guarantee adverse impact reduction or elimination for applicant samples that are often selected in a highly non-random manner from the larger, hypothetical pool of applicants or normative population? Can we sincerely say to clients, ‘‘We can design a selection device that will be meaningfully related to job performance and will simultaneously not display adverse impact?’’ And, if we cannot answer the preceding question in the affirmative, then what are the implications of this for our professional role, and what recommendations might we offer to legislators drafting future civil rights legislation should the occasion arise? Section 60-3.3.B (‘‘Consideration of suitable alternative selection procedures’’) of the Equal Employment Opportunity Commission (EEOC 1978) Uniform Guidelines codified into regulations the importance of the search for methods of reducing adverse impact by stating that ‘‘Where two or more selection procedures are available which serve the user’s legitimate interest in efficient and trustworthy workmanship, and which are substantially equally valid for a given purpose, the user should use the procedure which has been demonstrated to have the lesser adverse impact.’’ The 1991 amendment to the CRA also added to civil rights

123

J Bus Psychol (2013) 28:473–485

legislation specific recognition of the process for establishing adverse impact (SEC. 2000e-2, Section 70) and the language corresponding to the search for alternatives— specifically, ‘‘the complaining party makes the demonstration described in subparagraph (C) with respect to an alternative employment practice and the respondent refuses to adopt such alternative employment practice’’ (from 1.A.ii). Thus, advice to organizations typically takes the form of not only methods for improving selection tests and procedures but also of how to ‘‘reduce or eliminate adverse impact,’’ usually via the means of alternative job-related constructs or alternative testing methods that supposedly do not display adverse impact (e.g., see Aguinis et al. 2009). Consequently, as previously noted, it is not uncommon in professional practice to see public municipalities and organizations demanding adverse impact reduction or absence in their request for proposals and scope of work and also guarantees of adverse impact reduction or even elimination from both consultants proffering such tests and plaintiff’s experts attacking test validation reports. However, if the argument is that our field has advanced to the point where we can offer tests that will demonstrate no adverse impact, then a historical analysis of discrimination charges and cases belies the effectiveness of such guarantees. Indeed, based on our practice, experience, and reading of the literature, it is our strong impression that employment litigation, as it pertains to employment-related decision making, is very ‘‘alive and well’’; it is not decreasing or tapering off. This is reflected in highly visible instances where critical cases involving basic issues of adverse impact from the use of professionally developed tests or from subjective decision making continue to reach the Supreme Court as illustrated by Ricci v. DeStefano (2009), Lewis v. City of Chicago (2010), and Wal-Mart v. Dukes (2011). Indeed, from Tonowski’s (2011a) perspective, in terms of the preponderance and nature of litigation, ‘‘in public safety selection testing, it’s 1980 again’’ (p. 6; see also Tonowski 2011b). This state of affairs does not suggest or support a proposition that would allow us to conclude that the field or profession is at a point where we can promise or guarantee adverse impact reduction or elimination. We would agree with McDaniel et al.’s (2011a) conclusion that racial differences in adverse impact on job-related constructs are likely to be with us for a long time. But, if as a profession we have developed methods of eliminating adverse impact, then we are either not offering them to our clients or we are not doing a very good job of implementing them in practice because adverse impact remains a common occurrence. However, the other possibility is that we simply do not have approaches that will guarantee both equivalent validity and the absence of

Author's personal copy J Bus Psychol (2013) 28:473–485

adverse impact in real-world situations involving actual applicants. We ascribe to the latter explanation. So, with the preceding as a backdrop, the present paper identifies and discusses four overlooked critical attributes of adverse impact that collectively and in conjunction work against and obviate adverse impact reduction and elimination guarantees. These are as follows: (1) our psychological research data are on standardized subgroup differences not adverse impact (i.e., subgroup differences are not the same as adverse impact); (2) standardized differences for at least one or more protected classes exist on most ability constructs of interest to the field and a wide array of variables in psychology in general; (3) adverse impact is situationally specific and administrative in nature; and (4) changes in technology and demographics have combined to increase the number of subgroups and number of applicants, thus increasing the likelihood of obtaining support for an adverse impact finding in most situations. The implications of these attributes for civil rights legislation, and the intersection of I/O psychologists with said legislation, are discussed.

Subgroup Difference is Not Adverse Impact Although the extant literature often uses them interchangeably [several instances of this can be found in the ‘‘The Uniform Guidelines Are a Detriment to The Field of Personnel Selection’’ focal article and commentaries in Industrial and Organizational Psychology: Perspectives on Science and Practice (2011)], technically speaking, subgroup differences and adverse impact are distinct concepts, a distinction that has important implications for the adverse impact issue. Subgroup differences can be described as psychological, scientific phenomena that are represented or conceptualized as standardized mean differences between groups on measures of psychological constructs. In most if not all instances, there is some theory or conceptual basis postulated to explain said differences. Consequently, a focus on subgroup differences versus adverse impact translates into quite different research questions with the former having a ‘‘why’’ and explanatory emphasis. So, for instance, there are well-documented sex differences in physical ability (Hough et al. 2001; Russell et al. 1994; The Cooper Institute 2011) for which there are sound and established physiological mechanisms and explanations. Adverse impact, on the other hand, is a legal and administrative concept, which follows from the logic of the psychological phenomenon of subgroup differences, but is also concerned with the equality of outcomes in real-world decision making. Adverse impact results from the implementation of some decision rule to a set of scores (e.g., pass/fail; hire/not hire; promote/not promote) in a specific

475

assessment or testing situation. So, although adverse impact is very closely tied to psychological subgroup differences, they are not the same. Furthermore, as is discussed in a subsequent section of this paper, the generalizability of subgroup differences is often contravened by the highly non-random situational specificity of adverse impact (Biddle 2010). So, for instance, in our own practice, we have found oral exercises that result in almost no adverse impact based on race in one city and then in the next administration, result in substantial adverse impact (see also, Howe v. City of Akron 2009). From a scientific, validity generalization perspective, we would have to argue that at the population level, the subgroup mean difference between African-Americans and Whites on the oral exercise has not changed. What have changed are the unique situational variables that influence and determine adverse impact. Such variables include the nature and demographics of the applicant samples, the education level of the applicant samples, the motivation level of the applicant samples, the hiring rates, and the specific methods used to make the final selection decisions. In theory, we could predict the relationship between subgroup differences and obtained levels of adverse impact if we knew all the variables influencing the hiring rates for the applicant samples and if we had reliable, valid data for these variables. Unfortunately, rarely, if ever, can we estimate these values a priori. We know that applicant samples are highly unlikely to be randomly drawn from the applicant population; however, we are almost always surprised by the direction and extent of the non-random nature of the draw. In the final analysis, irrespective of the presence of substantial data on subgroup differences, we lack the necessary data to make firm predictions about the likely degree of adverse impact encountered in operational applied selection situations. The distinction between subgroup differences and adverse impact is further highlighted by the fact that the techniques for reducing subgroup differences are quite different from those for reducing adverse impact. Furthermore, while subgroup difference reduction techniques may influence levels of adverse impact, the converse is not true; adverse impact reduction techniques have no effect on observed subgroup differences.

Reducing Subgroup Differences Subgroup difference reduction techniques are pre-test administration techniques. That is, they are implemented as part of the test design and development process and are predicated not on removing differences between groups on the focal construct, but instead on removing observed differences in the focal construct that may be arising from

123


construct-irrelevant variance such that at the end of the day, one can state that observed differences are real and not due to an irrelevant construct (that the groups differ on) that is present in the observed scores. Hence, subgroup difference reduction techniques are nothing more than standard good test design and development practices and as commonly practiced, include (a) identifying and removing internal test bias; (b) increasing the favorability of testtaker perceptions of and subsequent reactions to tests and, ultimately, test performance; and (c) pre-test coaching. Additional techniques which are also often listed in this category and are described in some detail below are (d) changing the focal construct or combining cognitive with other non-cognitive predictor constructs and (e) changing the test method (Arthur and Doverspike 2005; Ployhart and Holtz 2008). As previously noted, techniques intended to reduce subgroup differences are standard good test design and development practices and techniques, which are intended to minimize or eliminate the influence of construct-irrelevant variance. Thus, although these techniques may affect the level of adverse impact by removing construct-irrelevant variance, if there are ‘‘real’’ differences on the construct of interest (e.g., physical ability), these techniques will not reduce or eliminate subgroup differences which may or may not in turn translate into adverse impact (depending on other factors such as the selection ratio). A widely touted subgroup difference (and by inference, adverse impact) reduction approach is the method-change approach (e.g., see Aguinis et al. 2009). For instance, in their list of ‘‘recommendation[s] for what organizations should do to minimize the diversity-validity dilemma,’’ Ployhart and Holtz (2008, p. 168) state ‘‘[u]se alternative predictor measurement methods (interviews, SJTs, biodata, accomplishment record, assessment centers) when feasible. Supplementing a cognitive predictor measurement method can produce sizeable reductions in adverse impact (if they are not too highly correlated).’’ However, an often overlooked fact in the recommended use of the method-change approach is that it appears to ‘‘work’’ primarily because there is a concomitant change in the focal construct to more non-cognitive constructs (Arthur and Villado 2008). That is, functionally, the change is in effect a construct and not a method change. We acknowledge that this issue can be most directly and definitely addressed by a study that holds the constructs constant (e.g., general mental ability vs. interpersonal skills) and varies the methods (e.g., multiple-choice test vs. interview vs. situational judgment test vs. assessment center). Unfortunately, there are no studies that we are aware of that have done this. Specifically, there is no research that has investigated and demonstrated that when the constructs are held constant, a change to assessment

123

J Bus Psychol (2013) 28:473–485

centers, situational judgment tests, or interviews results in reductions in subgroup differences and ultimately adverse impact reduction or elimination. That being said, we think that the conclusion that the so-called alternative methods generally tend to be used to assess primarily non-cognitive or less g-loaded constructs is not an unreasonable one that is not supported by the extant literature. For instance, as highlight by the summary in Table 1, a review of the construct labels reported in large-scale construct-level meta-analyses of interviews (Huffcutt et al. 2001), situational judgment tests (Christian et al. 2010), and assessment centers (Arthur et al. 2003) clearly indicates that these ‘‘alternative’’ predictor methods are commonly used to measure non-cognitive constructs, more so than cognitive ones. For example, in Christian et al.’s (2010) situational judgment test meta-analysis, only 4 % of the data points were coded as measuring knowledge and skill; the remaining 96 % were non-cognitive constructs. Likewise, in Arthur et al.’s (2003) assessment center meta-analysis, 19 % of the constructs were coded as measuring problem solving (which we are willing to consider as a proxy for cognitive ability) and the remaining 81 % were noncognitive constructs. (These data and information can be readily extracted from the tables presented in the specified articles.) Consequently, method-change approaches are often in effect functionally construct-change approaches. Furthermore, the amount of variance that is due to methods is unclear and more importantly, there is little or no general theory for method-based subgroup differences (cf. Roth et al. 2010). In summary, the method-change approach translates into the reduction of subgroup differences through the use of non-cognitive constructs on which there are reduced subgroup differences for some protected classes. To be clear, (a) we are not necessarily advocating for the use of cognitive constructs over non-cognitive constructs in employment testing, (b) nor are we necessarily advocating for the use of multiple-choice tests as the preferred testing method of choice, and (c) we are not alleging that the non-cognitive constructs commonly measured by ‘‘alternative methods’’ such as assessment centers, situational judgment tests, and interviews are not important. Our position is simply that the method-change approach as often practiced is in effect nothing more than a switch to non-cognitive constructs on which there appear to be smaller subgroup differences. Consequently, it is a viable subgroup difference reduction technique only to the extent that (a) the inclusion of the specified (non-cognitive) constructs is supported by the job analysis and subsequent validity evidence, (b) the differences across all subgroups are small (e.g., see Ryan et al. 1998), and (c) there is indeed no adverse impact on the final assessment score. Finally, we should probably call


477

Table 1 Constructs commonly measured by interviews, situational judgment tests, and assessment centers per three major construct-level metaanalyses Constructs

Methods Interviews Huffcutt et al. (2001)

Situational judgement tests Christian et al. (2010)

Assessment centers Arthur et al. (2003)

Cognitive constructs Cognitive ability

General intelligence

Problem solving

Applied mental skills Job knowledge and skills

Job knowledge and skills

Job knowledge and skills

Education and training Experience and work history Non-cognitive constructs Communication

Communication skills

Interpersonal and teamwork skills

Interpersonal skills

Leadership

Leadership

Communication Interpersonal skills Teamwork skills

Consideration/awareness of others

Leadership

Influencing others

Persuading and negotiating Agreeableness

Agreeableness

Agreeableness

Conscientiousness Extroversion

Conscientiousness Extroversion

Conscientiousness

Emotional stability

Emotional stability

Openness to experience

Openness to experience

Other constructs

Creativity, organizational fit, occupational interest, hobbies, physical attributes

Organizing and planning

Personality composites

Drive

Heterogeneous constructs

Tolerance for stress/uncertainty Unable to classify (all non-cognitive)

it exactly what it functionally is—a construct-change approach.

Effectiveness of Subgroup Difference and Adverse Impact Reduction Techniques

Reducing Adverse Impact

Although they do not maintain the subgroup differences versus adverse impact distinction as strongly as we do here, detailed reviews of the effectiveness of the preceding reduction and elimination techniques have been presented elsewhere (e.g., see Arthur and Doverspike 2005; Ployhart and Holtz 2008; Sackett et al. 2001). For instance, concerning the effectiveness of the various strategies, Ployhart and Holtz (2008) conclude that ‘‘using predictors with smaller subgroup differences and combining/manipulating predictors’’ (p. 164) are the ‘‘most effective.’’ By ‘‘manipulating predictors,’’ Ployhart and Holtz mean posttest administration adverse impact reduction techniques such as banding and score adjustments and the explicit weighting of predictors and criteria to favor those with smaller levels of adverse impact. Of course, while these techniques may be effective in minimizing adverse impact, they are deployed after subgroup differences have been obtained. Furthermore, as previously noted, their legality is questionable. This is particularly the case with banding, the effectiveness of which, as an adverse impact reduction technique, is dependent on the selection out of the bands being based on the protected class status variable of interest

In contrast to subgroup difference reduction techniques, adverse impact reduction techniques are post-test administration techniques and primarily entail the elimination of differences in observed outcomes after the test has been administered. That is, they are deployed to deal with observed differences in outcomes. These techniques typically entail some effort to alter the selection outcomes and specifically take the form of focusing on the cut score, test weights, and even the use of banding. Although these techniques may be effective in reducing adverse impact, there are strong concerns about the extent to which most of them are legally permissible because of their post-test administration nature since Section 106 of the Civil Rights Act of 1991 explicitly states that ‘‘It shall be an unlawful employment practice for a respondent, in connection with the selection or referral of applicants or candidates for employment or promotion, to adjust the scores of, use different cutoff scores for, or otherwise alter the results of, employment related tests on the basis of race, color, religion, sex, or national origin.’’

123


(Barrett et al. 1995; Bobko and Roth 2004; Ployhart and Holtz 2008). That is, noticeable reductions in adverse impact are obtained only if one uses subgroup status preference (e.g., race or sex) to select out of the bands—a practice that is in violation of CRA 1991. In addition, as previously noted, the touted method-change approach works only to the extent that there is a concomitant change to non-cognitive constructs. Summary of Subgroup Differences and Adverse Impact In summary, although adverse impact is very closely tied to psychological subgroup differences, they are distinct concepts. From a scientific standpoint, we would agree that professionals can make accurate predictions concerning the degree of subgroup differences occurring for different types of constructs. However, such findings often do not generalize to the highly non-random applicant situations encountered in practical selection situations, including much of high stakes testing. Thus, in most situations, there will be insufficient data on other variables to make accurate predictions regarding adverse impact outcomes. Furthermore, techniques for reducing subgroup differences may or may not reduce adverse impact, and adverse impact reduction techniques have no effect on observed subgroup differences. Indeed, the disconnect between subgroup differences and adverse impact is perhaps best illustrated by Howe v. City of Akron (2009). In this case, highly similar tests (designed, developed, and implemented by the same consulting firm) resulted in adverse impact against AfricanAmericans at one level of promotional testing (Fire Lieutenant) and adverse impact against Whites at another level of promotional testing (Fire Captain). It is difficult to imagine what type of subgroup differences would have led to a prediction of adverse impact in opposite directions for highly similar tests.

Standardized Differences Exist on Most Ability Constructs General mental ability and other cognitively loaded tests (e.g., knowledge tests) are widely used for selection purposes because of the high validities associated with their use. However, they also pose potential legal threats to organizations because of their documented race-based subgroup differences (Gottfredson 1988; Herrnstein and Murray 1994; Hunter and Hunter 1984; Jensen 1969; Roth et al. 2001). Consequently, discussions of adverse impact reduction strategies have focused largely on reducing differences in race-based selection outcomes when cognitively loaded assessments are used. However, there are well-documented subgroup differences on a host of ability

123

J Bus Psychol (2013) 28:473–485

constructs of interest to the field and a wide array of variables in psychology in general. For instance, in terms of physical ability, males score significantly higher than females on a number of attributes including muscular strength, power, and endurance (Russell et al. 1994; The Cooper Institute 2011; see also Table 1 of Ployhart and Holtz 2008). A similar pattern is observed for mechanical comprehension with magnitudes that sometimes exceed a standard deviation in favor of males (Feingold 1988; Schmidt 2011). In a meta-analysis of sex differences on spatial ability, Voyer et al. (1995) found differences favoring males for mental rotation (d = 0.56, k = 78) and spatial perception (d = 0.44, k = 92). In contrast, females perform better than males on tests of clerical speed and accuracy (Russell et al. 1994). Communication/social skills tend to favor females. Huffcutt et al. (2001) found that females scored higher on communication and interpersonal skills. Roth et al. (2010) summarized research indicating that females perform significantly better than males on measures of written skills. For spatial/perceptual speed and accuracy, Hartigan and Wigdor (1989) found significant age differences on a General Aptitude Test Battery (GATB) composite. For dexterity, Hartigan and Wigdor found significant age differences on a GATB composite. Our intent here is not to provide a comprehensive review of the literature on differences for various protected subgroups, but merely to point out that critical differences are not limited to African-American–White comparisons on cognitive ability tests. There are well-documented and established age, sex, race, and/or ethnic differences on measures of (a) general mental ability, (b) specific mental abilities, (c) physical ability, (d) mechanical comprehension, (e) clerical speed and accuracy, (f) psychomotor abilities, (g) spatial and perceptual speed and accuracy, (h) dexterity, (i) communication and social skills, (j) memory, (k) reaction time, and (l) personality traits among others. Subgroup Differences on Non-Cognitive Constructs While their magnitudes may be relatively smaller, as previously noted, subgroup differences exist on a host of noncognitive constructs, such as personality traits, as well (Foldes et al. 2008; Hough et al. 2001; Ones and Anderson 2002). The theoretic and scientific embeddedness of these effects is such that it has been common practice for personality test manuals to report male and female norms for the interpretation of test scores. Finally, it is worth noting that summary tables of observed subgroup differences on a wide range of non-cognitive variables can be found in Foldes et al. (2008), Halpern (1997), Hough et al. (2001), McDaniel et al. (2011a), Ones and Anderson (2002), and Ployhart and Holtz (2008).


Subgroup Differences as a Function of Assessment Method The use of other assessment methods, usually as alternatives to cognitive-based multiple-choice testing, is often suggested as a means of reducing subgroup differences (Ployhart and Holtz 2008; Sackett et al. 2001) and, subsequently, adverse impact (Aguinis et al. 2009). Although widely acclaimed, the method-change approach with a few rare exceptions (e.g., Arthur et al. 2002; Chan and Schmitt 1997; Edwards and Arthur 2007) is conceptually suspect because it overlooks the fundamental construct-method distinction (Arthur and Villado 2008). Specifically, the presence and magnitude of subgroup differences is more likely to be a function of the constructs assessed and less so the method of assessment (Arthur and Day 2011; Bobko et al. 2005; Dean et al. 2008; see also Foldes et al. 2008; Hough et al., 2001). Hence, as previously discussed and highlighted in Table 1, the ostensibly lower levels of adverse impact observed for ‘‘alternative methods’’ such as assessment centers, interviews, and situational judgment tests, are primarily due to the fact that they are typically designed to measure non-cognitively loaded constructs (Arthur et al. 2003; Christian et al. 2010; Huffcutt et al. 2001). Again, we are unaware of any research that has investigated and demonstrated that when the constructs are held constant, a change to assessment centers, situational judgment tests, or interviews results in reductions in subgroup differences and ultimately adverse impact (cf. Arthur et al. 2002; Chan and Schmitt 1997; Edwards and Arthur 2007; Pulakos and Schmitt 1996; Richman-Hirsch et al. 2000; Schmitt and Mills 2001). In short, any method of assessment can display high or low levels of subgroup differences; it depends on the construct(s) being measured (Anderson et al. 2006; Goldstein et al. 2001). Phrased another way, all things being equal, whether a specified assessment method will display subgroup differences or not is largely a function of the constructs or content that it is designed to measure. Hence, in the absence of information about the content or the specific knowledge, skills, abilities, and other attributes being assessed, there is no assurance that any particular method will reduce or eliminate adverse impact. This is vividly illustrated in reviews of assessment center and situational judgment test court cases (e.g., see Thornton and Rupp 2006). A reading of these cases leads one to conclude that there is no a priori favorable or unfavorable legal predisposition toward assessment centers and situational judgment tests and rightfully so. The ability of an assessment center or situational judgment test to withstand legal challenge is primarily a function of the demonstrable job-relatedness of that specific assessment

479

center or situational judgment test and not some general magical or inherent ability to reduce or eliminate adverse impact. Explaining Subgroup Differences So, in the presence of observed subgroup differences, it is appropriate to seek or explore the conceptual basis or explanation for the observed differences. Two implicit competing explanations for observed subgroup differences can be labeled as the ‘‘source-of-fire’’ versus ‘‘thermometer’’ hypotheses for observed subgroup differences. The source-of-the-fire hypothesis posits that observed subgroup differences are due to or caused by the test itself, that there is something about the test that is generating or causing the observed differences. In contrast, the thermometer hypothesis posits that the test is simply the ‘‘bearer of news’’; it is merely an indicator and not the cause of the observed differences. If one ascribes to the source-of-fire explanation, then one would try to somehow fix the test or correct the results of the test to eliminate or reduce the observed differences and its associated outcomes. However, to the extent that the observed differences are real and not an artifact of the test, the thermometer hypothesis best explains the observed differences. Subscribing to the thermometer hypothesis also implies that attempts to reduce or eliminate subgroup differences are complex and finding a solution will continue to be a challenge (McDaniel et al. 2011a, b).

Adverse Impact Is Situationally Specific The third obviating issue is that adverse impact, unlike validity and even subgroup differences, is a very situationally specific phenomenon, the manifestation of which is a function of the makeup of the specific test-taker pool and test-taker behaviors. Unfortunately, we have little control over any restrictions in the test-taker pool that result in it being a non-random sample from the larger, hypothetical pool of applicants. By definition, such characteristics or features are situationally specific, that is, they vary from one situation to the next. So, for instance, in our practices it is not uncommon for us to use what is basically the same test (battery) for the same position on one occasion with no adverse impact, repeat it for the next administration and have substantially different outcomes. Or, similarly, use the same fundamental test development process to develop test batteries for different positions with one resulting in adverse impact and the other no adverse impact. Indeed, this situational specificity characteristic of adverse impact serves as the basis for Biddle’s (2010) recommendation to employers to rely on local validation studies instead of

123


J Bus Psychol (2013) 28:473–485

validity generalization to support the use of employment tests in Title VII situations. Once again, Howe v. City of Akron (2009) is a particularly noteworthy illustration of this phenomenon because for what was basically the same test methods (developed by the same consulting firm), it was ruled that the promotional process adopted by the City of Akron had an adverse impact on the basis of race against African-Americans at the Fire Lieutenant level and Whites at the Fire Captain level. Some Situational Factors Affecting Adverse Impact There are several factors that highlight the situational specificity of adverse impact as a phenomenon. That is, adverse impact is a function of more than the test itself. For instance, adverse impact can be a function of the following: (a)

(b)

(c)

(d)

(e)

(f)

The hiring rate. With higher hiring rates, there is likely to be less adverse impact. Hence, there may be subgroup differences in the test scores, but if everyone is selected or promoted (selection ratio = 1), then there will be no adverse impact. The proportion of minority applicants/test takers. For the same number of minority candidates passing the test, the adverse impact ratio is a function of the number of minority test takers, such that all things being equal, although it is at odds with typical recruitment practices and goals, a smaller number of minorities will translate into smaller levels of adverse impact. The variance of the scores for the subgroups. A larger variance is likely to translate into a larger number of individuals near the top of the selection list. The skewness of scores for the subgroups. A negative skew is likely to translate into a larger number of individuals near the top of the selection list. Non-randomness in the decision to apply. Psychometrics, including much of our reasoning concerning the outcomes of selection decisions, is based on the assumption that the sample of test takers is a random sample from the larger population, in this case the hypothetical population of applicants for which subgroup difference data have been calculated. Randomness rarely occurs in determining who will apply for jobs. For example, earlier decisions concerning affirmative action hiring or quotas resulting from consent decrees or other court mandated settlements will lead to differences in the characteristics of incumbents sitting for promotional tests (Kroeck et al. 1983). Test takers can also have idiosyncratic and different test taking motivations that cannot be predicted in a systematic manner from one test administration to the

123

next (e.g., taking the test to get some practice for the next time and a host of other possible motivations). The situational specificity of adverse impact is also reflected in fact that as previously noted, unlike subgroup difference reduction techniques (which are pre-test administration techniques), adverse impact reduction techniques are post-test administration in nature. So, for example, although a specific cutoff score or set of test weights may reduce or eliminate adverse impact in one situation, it cannot by definition be guaranteed to have the same effect in other situations or other occasions.

Increased Likelihood of Finding Adverse Impact From a practitioner perspective, at one time, the calculation of adverse impact seemed rather simple and straight forward. One calculated hiring rates for African-American– White and male–female comparisons (usually using nothing fancier than paper, pencil, and a handheld calculator) and then determined whether there were differences. However, there have been substantial changes and practice is now a whole lot more complicated. Statistical procedures have become more complex and sophisticated, and with the use of computers, it is now possible and routine to run a large number of statistical tests and calculations for various job grouping breakdowns. The advent of technology-enhanced internet-based assessments has also resulted in huge numbers of test takers. The number of ethnic groups has greatly expanded (National Research Council 2004), and the range of protected classes has increased to include age, disabilities, and now perhaps the unemployed (The White House 2011). In this section, we address the statistical issues in the determination of adverse impact. In particular, we note how the odds favor a likelihood of obtaining support for an adverse impact finding. In part, this is because as increasingly more statistical tests are performed on increasingly larger groups of people, one is de facto more likely to obtain statistically significant results. For an in-depth and detailed treatment of a number of issues and associated problems in the statistical analysis of adverse impact, the interested reader is referred to the Technical Advisory Committee Report on Best Practices in Adverse Impact Analyses (Cohen et al. 2010). Too Many Subgroups Adverse impact no longer involves a simple comparison of African-American and White selection ratios (National Research Council 2004). The EEOC and the courts have endorsed ‘‘intersectional discrimination,’’ which is the


481

intersection of two or more protected groups. For example, African-American women can be seen as distinct from White women and African-American men (EEOC 2006b). This also applies to the intersection of race and characteristics covered by other statutes such as age (Kimble v. Wisconsin Department of Workforce Development 2010). The theory underlying this endorsement is that these intersections lead to distinct stereotypes and subsequent discrimination threats that go beyond race (Semple 1990–1991). Thus, if we have six racio-ethnic groups disaggregated by sex being compared to a White, male majority group, then the calculation of adverse impact could very well entail the comparison of 11 or more groups. And, if one were to add a second job group (e.g., a second promotional level), then one would now have 22 or more statistical comparisons and tests being performed. This raises the likelihood of obtaining a finding of adverse impact close to a certainty. The issue of ‘‘classification ambiguity’’ is particularly germane here as well.1 Increasingly, the lines between demographic groups along the standard protected classes are becoming blurred. For instance, an increasing number of individuals are multiracial, multiethnic, or even transgendered. This is reflected in the EEOC’s revised EEO-1 race and ethnicity reporting categories (EEOC 2006a) which increased the number of race and ethnic reporting categories from five single categories to seven categories that also permitted compound- or cross-classifications including ‘‘Two or More Races–(Not Hispanic or Latino).’’ In addition, as this article was being written, there were serious efforts to further expand civil rights employment legislation to increase the protection offered to a number of groups including the unemployed (see Subtitle D—Prohibition of Discrimination in Employment on the Basis of an Individual’s Status as Unemployed, The American Jobs Act, The White House 2011), the disabled in general (Office of Federal Contract Compliance Programs [OFCCP] 2011), the learning disabled (EEOC 2011), and lesbian, gay, bisexual, and transgender workers [e.g., see The Employment Non-Discrimination Act (ENDA), a proposed bill in the U.S. Congress which, with the exception of the 109th, has been introduced in every Congress since 1994]. Furthermore, the ongoing Kronos case (EEOC v. Kronos 2010) looked at the possibility that not only might personality tests have adverse impact against a number of groups but also whether personality tests might be discriminatory toward the disabled including those with social disabilities. Such expansions and designations of new and additional protected classes increase the likelihood that any specific test will be found to have adverse impact against

some group based on both underlying subgroup differences and also the sheer number of statistical tests being performed.

1

The CRA had the beneficial effect of setting off a search for tests that would reduce adverse impact, while still

We would like to thank the editor for drawing our attention to this point.

Too Many People In addition to increases in the number of subgroups, the advent of technology-enhanced internet-based assessments has made high-volume testing and assessment programs quite common. In our experience, applicant samples in excess of 10,000 are no longer uncommon. The occurrence of such large applicant samples increases the likelihood that even small differences in hiring rates will be found to be statistically significant (Glaze et al. 2011). Use of Multiple Statistical Procedures Statistical experts have at their disposal a number of methods for testing for adverse impact (e.g., see Biddle and Morris 2011). They also have to make a variety of decisions regarding data aggregation and disaggregation (Biddle 2011; Gastwirth 1984). The result is that for any particular set of data, a large number of potential statistical analyses can be run. The expert can then use the most favorable set of outcomes to support her/his case. The ability to conduct multiple tests leads to a number of problems, but the basic issue is that the ability to conduct multiple tests using various cuts of the data increases the likelihood that an adverse impact finding will be obtained. In addition, the OFCCP’s use of either demographic or applicant flow data increases the odds of obtaining an adverse impact finding. The plaintiff can calculate statistics based on both applicant flow data and census-based availability. The two results can be quite different, but the plaintiff can choose the result supporting an adverse impact finding. Summary on Statistical Analysis Although we maintain that it is easy to demonstrate adverse impact in many operational applied situations, we are not arguing that one can never produce a finding of no adverse impact. In many cases in our own work, we do find situations where there is no adverse impact. However, our point is that we cannot predict when adverse impact will be found based on either the construct measured or the format used to measure the construct.

Implications for Civil Rights Legislation

123


maintaining high levels of validity. The case made in the present paper is (a) that subgroup differences and adverse impact are construct-level phenomena, and (b) the field of I/O psychology has not arrived at formats and methods that will guarantee the absence of adverse impact while still maintaining high levels of validity. The conundrum is that adverse impact is one of those problems for which we have no guaranteed solution no matter how much time or money we spend on it. Under these circumstances, it is at the very least inappropriate for us, as a profession, to engage in the practice of guaranteeing the reduction and elimination of adverse impact. Indeed, the burgeoning realization of the hollowness of these guarantees has lead to calls by I/O psychologists and attorneys for the revision or elimination of the Uniform Guidelines (McDaniel et al. 2011a, b; see also comments by Sackett 2011; Sharf 2011). We would argue that a revision of the Uniform Guidelines is insufficient and that what is required is a revision to the CRA so as to modify and clarify the applicability of the theory of disparate impact (Barrett et al. 2011). So, while we cannot and should not be guaranteeing adverse impact reduction or elimination, what we can guarantee as a science and profession are sound and valid tests and assessment devices that can be defended accordingly should the use of said tests and devices be challenged. To this end, we call for legislation and federal guidelines that clearly state that differences in group outcomes alone cannot be labeled discriminatory. [We should note that we draw a distinction here between the psychometric use and meaning of the term ‘‘discrimination’’ (e.g., as in item discriminability, or the ability of a test to discriminate between high and low performers) and the legal use of the term. We are, of course, referring to the latter in this discussion]. The label ‘‘discriminatory’’ should only be applied when there is (1) no logical basis for the assessment device or procedure, (2) the assessment device or procedure has no job or business relevance, and (3) there is evidence that the scores on the assessment device or procedure did not represent the level of ability or latent trait on the characteristic either because of construct-irrelevant variance (for example, requiring reading on the test when it is not required on the job) or because of past discrimination [individuals failed the promotional test because the company refused to allow them to attend training because of their race or sex; this would be consistent with the original concept in Griggs v. Duke Power (1971) of correcting the ‘‘present effects of past discrimination’’]. Furthermore, if the Uniform Guidelines are to be revised, as some have called for (e.g., McDaniel et al. 2011a), then we would suggest that such a revision should include a clearer, simpler standard related to the concept of job or business relevance. In both Guardians v. Civil Service Commission of the City of New York (1980) and Ricci,

123

J Bus Psychol (2013) 28:473–485

the court appeared to argue for a more straightforward approach to addressing the question of the adequacy of a test battery. Finally, the provision regarding demonstrations of an alternative device should be substantially altered to make such an argument more consonant with the extant literature. Thus, the plaintiff or plaintiff’s expert should have to demonstrate that the alternative does (a) have equal validity (i.e., measures constructs or KSAs that are equally important and job-related), (b) is equally feasible for the situation under consideration, and (c) will result in less adverse impact in the specific situation. The plaintiff’s expert should offer evidence that they have actually reduced adverse impact through the implementation of the proposed alternative in similar situations.

Implications for Professional Roles As psychologists in general, and I/O psychologists in particular, our efforts should be directed at designing and developing the best tests possible (i.e., eliminating sources of construct-irrelevant variance). However, in addition to this, instead of tinkering with post-test administration interventions that do not address or speak to the fundamental reasons for the observed subgroup differences— because the extant literature is more consonant with the thermometer hypothesis than the source-of-fire hypothesis—we should engage in meaningful and substantive efforts to unearth the extra-test factors (e.g., educational, societal) that are accountable for the observed differences and what can be done about them (e.g., add our voice to policy debates concerning public education) to minimize or eliminate their effects. Indeed, interestingly, this is the level of discourse in terms of the same phenomenon in K-12 education standardized testing (e.g., The Department of Education’s National Assessment of Educational Progress program). In fact, we have always found it very perplexing and discordant that in K-12 standardized educational testing (e.g., No-Child-Left-Behind programs), as per the thermometer hypothesis, observed subgroup differences serve as the basis for initiating policy-level decisions (e.g., failing schools, poor teachers, large class sizes, inadequate resources, etc.). In sharp contrast, in employment testing, the same observed differences are viewed through the source-of-the-fire lens and interpreted as ‘‘discriminatory.’’ It is our view that in our role as professional test developers and consultants, guarantees of adverse impact elimination or reduction by I/O psychologists are suspect at best. Real-world situations are complex, messy, and nonrandom. Population validities and subgroup differences may indeed be generalizable across situations. However,


adverse impact is a characteristic of the situation and is affected by situational variables. As a result, I/O psychologists should avoid predictions or guarantees regarding adverse impact elimination in specific selection situations, including those based on the inclusion of ‘‘alternative’’ selection devices. However, I/O psychologists can promise to use sound test development practices and procedures that minimize construct-irrelevant variance. They can argue that the inferences from said test scores are being made based on the best cumulative validity and scientific data available. Finally, they can state that the use of the test scores can be defended based on the validity evidence should said test scores result in adverse impact. However, they cannot and should not claim to be able to guarantee the elimination of adverse impact a priori. Although it is difficult to see how it would be done operationally, as is now the case in the K-12 education standardized testing literature, it is also time to acknowledge that mean differences do exist and that they do not necessarily represent discriminatory processes in testing. In 1910, Woodworth published a paper on the psychological testing of different cultural and ethnic groups in different regions of the world. Woodworth (1910) pointed out that merely reporting mean differences between ethnic groups was of little value because the overlap among groups on all tested attributes is considerable. Rosenblum (2000) discussed the evolution of analytical proof and statistics in equal employment opportunity litigation. He pointed out that the simple differences between African-Americans and Whites that he and other experts presented in the 1960s and 1970s should no longer be probative. In 2004, the National Research Council produced a book on statistical techniques to determine racial discrimination in various areas, including employment, in which the simple mean score difference approach is not even discussed. It is time to reject the simple mean difference approach underling the analysis of adverse impact.

Conclusion The CRA of 1964 has helped to expedite many needed societal and organizational-level changes in the treatment of protected classes. For I/O psychology, the CRA served to reinvigorate research into the constructs our tests measure and their job-relatedness, the form such tests take, the importance of considering applicant perceptions including the need for socially conscious and socially valid tests, and the need to consider biases in subjective decision making. However, we would also agree with McDaniel et al. (2011a, b) that we are unlikely to be able to guarantee any time in the near future that there will be no group-level differences

483

in outcomes resulting from the use of our employment tests. At this point, it is difficult to see how a professional psychologist could guarantee that any test or test battery would not have adverse impact against any protected subgroup, let alone all protected subgroups, unless the hiring rate was so high that close to 100 % of the applicants were hired or the selection process was basically random. As professionals, it would seem fruitful for us to recognize and acknowledge that unlike validity inferences, adverse impact is a situationally specific phenomenon the elimination or reduction of which cannot be predicted or promised a priori. This does not mean that we should give up on the search for methods of reducing subgroup differences; to the extent that we can continue to improve on our test development practices and minimize the role of construct-irrelevant variance, it is a worthy and challenging endeavor. Nevertheless, while we continue to pursue this goal, we should admit that we have limited control over findings of adverse impact; adverse impact is situationally determined and in most instances, we lack sufficient information to predict a priori the extent of adverse impact that will be observed in specific selection situations.

References Aguinis, H., Cascio, W., Goldstein, I., Outtz, J., & Zedeck, S. (2009). In The Supreme Court of the United States: Ricci v. DeStefano: Brief of Industrial-Organizational Psychologists as Amici Curiae in support of respondents. Anderson, N., Lievens, F., van Dam, K., & Born, M. (2006). A construct-driven investigation of gender differences in a leadership-role assessment center. Journal of Applied Psychology, 91, 555–566. Arthur, W., Jr., & Day, E. A. (2011). Assessment centers. In S. Zedeck (Ed.), APA handbook of industrial and organizational psychology: Volume 2, Selecting and developing members for the organization (pp. 205–235). Washington, DC: APA. Arthur, W., Jr., Day, E. A., McNelly, T. L., & Edens, P. S. (2003). Meta-analysis of the criterion-related validity of assessment center dimensions. Personnel Psychology, 56, 125–154. Arthur, W., Jr., & Doverspike, D. (2005). Achieving diversity and reducing discrimination in the workplace through human resource management practices: Implications of research and theory for staffing, training, and rewarding performance. In R. L. Dipboye & A. Colella (Eds.), Discrimination at work: The psychological and organizational bases (pp. 305–327). Mahwah, NJ: LEA. Arthur, W., Jr., Edwards, B. D., & Barrett, G. V. (2002). Multiplechoice and constructed-response tests of ability: Race-based subgroup performance differences on alternative paper-andpencil test formats. Personnel Psychology, 55, 985–1008. Arthur, W., Jr., & Villado, A. J. (2008). The importance of distinguishing between constructs and methods when comparing predictors in personnel selection research and practice. Journal of Applied Psychology, 93, 435–442. Barrett, G. V., Doverspike, D., & Arthur, W., Jr. (1995). The current status of the judicial review of banding: A clarification. The Industrial-Organizational Psychologist, 33(1), 39–41.

123

Author's personal copy 484 Barrett, G. V., Miguel, R. F., & Doverspike, D. (2011). The Uniform Guidelines: Better the devil you know. Industrial and Organizational Psychology: Perspectives on Science and Practice, 4, 534–536. Biddle, D. A. (2010). Should employers rely on local validation studies or validity generalization (VG) to support the use of employment tests in Title VII situations? Public Personnel Management, 39, 307–326. Biddle, D. A. (2011). Adverse impact and test validation: A practitioner’s handbook. Concord, MA: Infinity Publishing. Biddle, D. A., & Morris, S. B. (2011). Using Lancaster’s mid-P correction to the Fisher’s exact test for adverse impact analyses. Journal of Applied Psychology, 96, 956–965. Bobko, P., & Roth, P. L. (2004). Personnel selection with top-scorereferences banding: On the inappropriateness of current procedures. International Journal of Selection and Assessment, 12, 291–298. Bobko, P., Roth, P. L., & Buster, M. A. (2005). Work sample tests and expected reduction in adverse impact: A cautionary note. International Journal of Selection and Assessment, 13, 1–10. Chan, D., & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied Psychology, 82, 143–159. Christian, M. S., Edwards, B. D., & Bradley, J. C. (2010). Situational judgment tests: Constructs assessed and a meta-analysis of their criterion-related validities. Personnel Psychology, 63, 83–117. doi:10.1111/j.1744-6570.2009.01163.x. Civil Rights Act of 1964, Public Law 88-352 (78 Stat. 241). Civil Rights Act of 1991, 42 U.S.C. CC 1981, 200e et seq. Cohen, D. B., Aamodt, M. G., & Dunleavy, E. M. (2010). Technical advisory committee report on best practices in adverse impact analyses. Washington, DC: Center for Corporate Equality. Dean, M. A., Roth, P. L., & Bobko, P. (2008). Ethnic and gender subgroup differences in assessment center ratings: A metaanalysis. Journal of Applied Psychology, 93, 685–691. Dwight Bazile et al., v. City of Houston. (2012). C.A. No. H-08-2404. United States District Court for the Southern District of Texas, Houston Division. Edwards, B. D., & Arthur, W., Jr. (2007). An examination of factors contributing to a reduction in subgroup differences on a constructed-response paper-and-pencil test of scholastic achievement. Journal of Applied Psychology, 92, 794–801. Equal Employment Opportunity Commission. (2006a). Equal Employment Opportunity Standard Form 100, Rev January 2006, Employer Information Report EEO-1 Instruction Booklet. Washington, DC: EEOC. Equal Employment Opportunity Commission. (2006b). EEOC Compliance Manual, Section 15 of the New Compliance Manual on ‘‘Race and Color Discrimination’’. Number 915.003. Equal Employment Opportunity Commission. (2011). ADA: Qualification standards; disparate impact. Retrieved November 17, 2011, from www.eeoc.gov/eeoc/foia/letters/2011/ada_qualifica tion_standards.html. Equal Employment Opportunity Commission, Civil Service Commission, Department of Labor, & Department of Justice. (1978). Adoption by four agencies of uniform guidelines on employee selection procedures. Federal Register, 43, 38290–38315. Equal Employment Opportunity Commission v. Kronos. (2010). Case no. 2:09-mc-00079-ajs. www.ca3.uscourts/gov/opinarch/0932 19p.pdf. Feingold, A. (1988). Cognitive gender differences are disappearing. American Psychologist, 43, 95–103. Foldes, H., Duehr, E. E., & Ones, D. S. (2008). Group differences in personality: Meta-analyses comparing five US racial groups. Personnel Psychology, 61, 579–616.

123

J Bus Psychol (2013) 28:473–485 Gastwirth, J. L. (1984). Statistical methods for analyzing claims of employment discrimination. Industrial and Labor Relations, 38, 75–86. Glaze, R. M., Jarrett, S. M., Arthur, W., Jr., Schurig, I., & Taylor, J. E. (2011). Comparative evaluation of three situational judgment test response formats in terms of construct-related validity and subgroup differences. Paper presented at the 26th Annual Conference of the Society for Industrial and Organizational Psychology, Chicago, IL. Goldstein, H. W., Yusko, K. P., & Nicolopoulos, V. (2001). Exploring black-white subgroup differences of managerial competencies. Personnel Psychology, 54, 783–807. Gottfredson, L. S. (1988). Reconsidering fairness: A matter of social and ethical priorities. Journal of Vocational Behavior, 33, 293–319. Griggs v. Duke Power Co. (1971). 401 U.S. 424. Guardians v. Civil Service Commission of the City of New York, 630 F.2d 79 (2nd Cir. 1980). Halpern, D. F. (1997). Sex differences in intelligence: Implications for education. American Psychologist, 52, 1091–1102. Hartigan, J. A., & Wigdor, A. K. (Eds.). (1989). Fairness in employment testing: Validity generalization, minority issues, and the general aptitude test battery. Washington, DC: National Academy Press. Herrnstein, R. J., & Murray, C. (1994). The bell curve: Intelligence and class structure in American life. New York, NY: Free Press. Hough, L. M., Oswald, F. L., & Ployhart, R. E. (2001). Determinants, detection and amelioration of adverse impact in personnel selection procedures: Issues, evidence and lessons learned. International Journal of Selection and Assessment, 9, 152–194. Howe v. City of Akron. (2009). U.S. Dist. LEXIS 137344 (N.D. Ohio 2010). Huffcutt, A. I., Conway, J. M., Roth, P. L., & Stone, N. J. (2001). Identification and meta-analytic assessment of psychological constructs measured in employment interviews. Journal of Applied Psychology, 86, 897–913. Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72–98. doi:10.1037/0033-2909.96.1.72. Industrial and organizational psychology: Perspectives on science and practice, Vol. 4, No. 4, pp. 494–570 (2011). Jensen, A. R. (1969). How much can we boost IQ and scholastic achievement? Harvard Educational Review, 39, 1–123. Kimble v. Wisconsin Department of Workforce Development, 690 F. Supp. 2d 765 (E.D. Wis. 2010). Kroeck, K., Barrett, G. V., & Alexander, R. A. (1983). Imposed quotas and personnel selection: A computer simulation study. Journal of Applied Psychology, 68, 123–136. Lewis v. City of Chicago. (2010). U.S. LEXIS 4165 (Sup. Ct. Feb. 22, 2010). McDaniel, M. A., Kepes, S., & Banks, G. C. (2011a). The uniform guidelines are a detriment to the field of personnel selection. Industrial and Organizational Psychology: Perspectives on Science and Practice, 4, 494–514. McDaniel, M. A., Kepes, S., & Banks, G. C. (2011b). Encouraging debate on the Uniform Guidelines and the disparate impact theory of discrimination. Industrial and Organizational Psychology: Perspectives on Science and Practice, 4, 470–566. National Research Council. (2004). Measuring racial discrimination. Panel on methods for assessing discrimination. R. M. Blank, M. Dabady, & C. F. Citro (Eds.). Committee on National Statistics, Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press. Office of Federal Contract Compliance Programs. (2011). Affirmative action and nondiscrimination obligations of contractors and subcontractors regarding individuals with disabilities. Federal

Author's personal copy J Bus Psychol (2013) 28:473–485 Register, 76, Number 237. Retrieved December, 9, 2011, from www.regulations.gov/#!documentDetail;D=OFCCP-2010-00010130. Ones, D. S., & Anderson, N. (2002). Gender and ethnic group differences on personality scales in selection: Some British data. Journal of Occupational and Organizational Psychology, 75, 255–276. Ployhart, R. E., & Holtz, B. C. (2008). The diversity-validity dilemma: Strategies for reducing racioethnic and sex subgroup differences and adverse impact in selection. Personnel Psychology, 61, 153–172. Pulakos, E. D., & Schmitt, N. (1996). An evaluation of two strategies for reducing adverse impact and their effects on criterion-related validity. Human Performance, 9, 241–258. Ricci v. DeStefano. (2009). 129 S. Ct. 2658. Richman-Hirsch, W. L., Olson-Buchanan, J. B., & Drasgow, F. (2000). Examining the impact of administration medium on examinee perceptions and attitudes. Journal of Applied Psychology, 85, 880–887. Rosenblum, M. (2000). On the evolution of analytical proof, statistics, and the use of experts in EEO litigation. In J. L. Gastwirth (Ed.), Statistical science in the courtroom (pp. 161–194). New York: Springer-Verlag. Roth, P. L., Bevier, C. A., Bobko, P., Switzer, F. S., & Tyler, P. (2001). Ethnic group differences in cognitive ability in employment and educational settings: A meta-analysis. Personnel Psychology, 54, 297–330. doi:10.1111/j.1744-6570.2001.tb00094.x. Roth, P. L., Buster, M. A., & Barnes-Farrell, J. (2010). Work sample exams and gender adverse impact potential: The influence of self-concept, social skills, and written skills. International Journal of Selection and Assessment, 18, 117–130. Russell, T. L., Reynolds, D. H., & Campbell, J. P. (1994). Building a joint-service classification research roadmap: Individual differences measurement. AL/HR-TP-1994-0009. Armstrong Laboratory (AFMC), Human Resources Directorate, Manpower and Personnel Research Division, Brooks AFB, TX. Ryan, A. M., Ployhart, R. E., & Friedel, L. A. (1998). Using personality testing to reduce adverse impact: A cautionary note. Journal of Applied Psychology, 83, 298–307. Sackett, P. R. (2011). The Uniform Guidelines is not a scientific document: Implications for expert testimony. Industrial and Organizational Psychology: Perspectives on Science and Practice, 4, 545–546. Sackett, P. R., Schmitt, N., Ellingson, J., & Kabin, M. B. (2001). High-stakes testing in employment, credentialing, and higher

485 education: Prospects in a post-affirmative-action world. American Psychologist, 56, 302–318. Schmidt, F. L. (2011). A theory of sex differences in technical aptitude and some supporting evidence. Perspectives on Psychological Science, 6, 560–573. Schmitt, N., & Mills, A. E. (2001). Traditional tests and job simulations: Minority and majority performance and test validities. Journal of Applied Psychology, 86, 451–458. Semple, J. B. (1990–1991). Note, Invisible Man: Black and male under Title VII. Harvard Law Review, 104, 749–768. Sharf, J. C. (2011). Equal employment versus equal opportunity: A naked political agenda covered by a scientific fig leaf. Industrial and Organizational Psychology: Perspectives on Science and Practice, 4, 537–539. The Cooper Institute. (2011). The cooper standards. Dallas, TX: The Cooper Institute. Retrieved December, 18, 2011, from http:// www.cooperinstitute.org/law-fire-military. The White House. (2011). The American Jobs Act: President Obama’s plan to grow jobs now. Washington, DC: The White House. Released September 11, 2011. PDF retrieved online from www.whitehouse.gov/sites/default/files/omb/legislative/reports/ american-jobs-act.pdf. Thornton, G. C., III, & Rupp, D. E. (2006). Assessment centers in human resource management: Strategies for prediction, diagnosis, and development. Mahwah, NJ: LEA. Tonowski, R. (2011a, September 6–7). In public safety testing, it’s 1980 again. Assessment Center Council News. Tonowski, R. (2011b, March 20–23). The best of times, the worst of times (for discrimination cases). Assessment Center Council News. U. S. v. The City of New York. (2010). 683 F. Supp. 2d 225; 2010 U. S. Dist. LEXIS 2056; 108 Fair Empl. Prac. Cas. (BNA) 415, January 13, 2010. U. S. v. The City of New York. (2011). 2011 U. S. Dist. LEXIS 115074 (E.D.N.Y, October 5, 2011). Voyer, D., Voyer, S., & Bryden, M. P. (1995). Magnitude of sex differences in spatial abilities: A meta-analysis and review. Psychological Bulletin, 117, 250–270. doi:10.1037/0033-2909. 117.2.250. Wal-Mart Stores, Inc. v. Dukes. (2011). U.S. LEXIS 4567 (Sup. Ct. 2011). Woodworth, R. S. (1910). Racial differences in mental traits. Science, 31, 171–186.

123