Download PDF - SAGE Journals - Sage Publications

Journal of Experimental Psychopathology JEP Volume 2 (2011), Issue 2, 197–209 ISSN 2043-8087 / DOI: 10.5127/jep.008310

Best Practices for Using Median Splits, Artificial Categorization, and their Continuous Alternatives Jamie DeCostera, Marcello Galluccib, Anne-Marie R. Iselinc a

Institute for Social Science Research, University of Alabama

b

The University of Milano-Bicocca

c

Duke University

Abstract Methodologists have long discussed the costs and benefits of using medians or other cut points to artificially turn continuous variables into categorical variables. The current paper attempts to provide a perspective on this literature that will be of practical use to experimental psychopathologists. After discussing the reasons that clinical researchers might use artificial categorization, we summarize the arguments both for and against this procedure. We then provide a number of specific suggestions related to the use of artificial categorization, including our thoughts on when researchers should use artificial categories, how their use can be justified, what continuous alternatives are available, and how the continuous alternatives should be used. © Copyright 2011 Textrum Ltd. All rights reserved. Keywords: median splits, dichotomization, artificial categorization, continuous measures, dimensional measures, threshold, cut points Correspondence to: Jamie DeCoster, Institute for Social Science Research, University of Alabama, 306 Bryant Drive, Tuscaloosa, AL 35487-0216, USA, email: [email protected] 1. Institute for Social Science Research, University of Alabama, 306 Bryant Drive, Tuscaloosa, AL 35487-0216, USA, email: [email protected] 2. Faculty of Psychology, The University of Milano-Bicocca, piazzale Ateneo Nuovo 1 (U6), 20126 Milan, Italy, email: [email protected] 3. Center for Child and Family Policy, Duke University, Box 90545, Durham, NC 27705, USA, email: [email protected] Received 31-May-2010; received in revised form 24-Sep-2010; accepted 29-Sep-2010

Journal of Experimental Psychopathology, Volume 2 (2011), Issue 2, 197–209

198

Table of Contents Introduction Why Do People Use Artificial Categorization? How Do People Use Artificial Categorization? What Are the Problems with Artificial Categorization? Are There Valid Justifications for Using Artificial Categorization? How Should Artificial Categorization and Their Continuous Alternatives Be Used? References

Introduction Human beings have a natural tendency to categorize and classify things in their environment. Grouping similar concepts simplifies our thinking, makes it easier to understand objects as instances of a type, and makes it easier to describe the objects to others. Despite the fact that this tendency is well understood by psychologists (e.g., Fiske & Taylor, 1991), they are by no means exempt from its effects. When trying to describe how a psychological trait or ability is distributed throughout a population, it is common for clinical psychologists to think in terms of “normal” and “abnormal” groups, even when the characteristic is known to vary more continuously. This tendency has also impacted the way that psychologists analyze their data in research studies, where it is common practice to perform “median splits” to transform a continuous variable into a categorical variable with “high” and “low” groups. Median splits are a specific example of “artificial categorization”, which refers to the more general process of defining categorical variables based on the value of a numeric variable. Although this typically simplifies data analysis and the presentation of results, statisticians have often criticized the use of artificially categorized variables on the grounds that they distort research findings. The purpose of the present article is to evaluate the use of artificial categorization in experimental psychopathology. To this end, we start with an overview of the reasons that psychopathology researchers employ artificial categorization. We next provide practical information on common forms of artificial categorization and how they are used. We then discuss the major criticisms that have been raised against artificial categorization and the justifications that other researchers have provided for this procedure. Finally, we conclude by presenting a consolidated view of the costs and benefits associated with artificial categorization, provide recommendations for its use, and suggest specific alternative methods for situations where median splits and other artificial categorizations are inappropriate.

Why Do People Use Artificial Categorization? Researchers in psychopathology have historically made use of artificial categorization for both statistical and clinical reasons. The primary statistical advantage of using artificial categories is that they simplify the interpretations of the variables, the analyses, and the presentation of the results from a study. When trying to interpret a variable, it is much easier to consider differences between a limited number of groups than it is to consider differences along a continuum. It is often not clear how important specific numeric differences are: How meaningful is a 10-point difference on an IQ test? How about a 5-point or 2-point difference? By working with categories, researchers circumvent these questions because there are only a small number of values to compare, and it can be assumed that each comparison is meaningful. With regard to analysis, it is commonly easier (or at least more traditional) for researchers to analyze a variable categorically than continuously. Psychologists are typically more accustomed to using analysis of variance (ANOVA) to test influences on an outcome variable, which requires a categorical predictor


199

variable. If someone wanted to predict an individual's current mood from their score on a measure of depression, they might choose to artificially categorize these scores so that the data fit within an ANOVA framework. Most psychologists receive training in ANOVA, so authors can expect that this type of analysis will be understood by those reading their articles. For similar reasons, in fields where logistic regression is more common, researchers sometimes choose to perform median splits on their outcome variables so that they can analyze their data using logistic regression instead of linear regression or ANOVA, because logistic regression is more familiar and acceptable to their audiences. Finally, artificial categorization can simplify the presentation of results. One of the easiest ways to illustrate a significant effect is to provide a table or graph of the mean scores for different groups. There will not naturally be group means, however, if the effect involves a continuous variable. In this case, the analyst interprets the effects by considering the slope relating the predictor variable to the outcome variable and how it changes across the levels of other predictors. For example, if a researcher wanted to investigate a gender by age interaction effect on violence ratings, he or she would need to plot separate regression lines between age and violence for each gender and observe how they differ to interpret the effect. This is more complicated than simply comparing the difference between the violence ratings for older and younger males to the difference between older and younger females. Psychopathology researchers also use categorical variables for several important clinical reasons, especially as they pertain to treatments of psychological disorders. All human beings have difficulties and problems in their life, but not everyone’s problems reach a level that impairs their daily living, nor is everyone in need of treatment. Clinicians have developed criteria to help diagnose when these difficulties reach a level that is abnormal, categorizing people into those who have a mental health disorder and those who do not. An array of treatment decisions are made based on whether an individual is diagnosed with a psychological disorder, with treatment being typically provided only to individuals categorized as having a disorder in order to conserve time and resources (e.g., Kraemer, Noda, & O’Hara, 2004). Categorical labels also enable professionals across disciplines to clearly communicate about the psychological states of those seeking mental health services (e.g., Kamphuis & Noordhof, 2009). Finally, many clinical practices use diagnostic categories because they are necessary for them to receive reimbursement for the provision of clinical services such as therapy or psychiatric hospitalization (e.g., Kamphuis & Noordhof, 2009).

How Do People Use Artificial Categorization? Standard median splits can be used on either continuous or ordinal variables to turn them into dichotomous variables (that is, categorical variables with two groups). This is done by putting all cases that are below the median into a “low” group and all cases that are above the median into a “high” group. Values exactly at the median can be put into either group, and are usually assigned in a way that will make the groups most equivalent in size. A median split will naturally create equal groups when the original variable is continuous, but median splits of ordinal variables may produce unequal groups when the original variable has a limited number of possible values. After it is created, the median split variable is used in place of the original continuous or ordinal variable in the researcher's analyses. There are several other ways to artificially create a categorical variable other than by using a median split. As one example, researchers are not limited to using the median as a cutoff when dichotomizing a continuous variable. Median splits do tend to give the best results when the original variable has a symmetric distribution (such as when the original variable is normally distributed, see Cohen, 1983). However, sometimes researchers want to use a continuous variable to identify a distinct subgroup of individuals that are known to be only a small percentage of the population. In this case, the researcher


200

can perform a “proportional split,” where they determine the cut point for categorization so that it creates group sizes that match the proportions found in the original population. For example, an investigator might want to measure schizophrenia symptoms to create groups of individuals with schizophrenia and those without schizophrenia in a normative population. In this case, the investigator would want to determine group membership using a score on this measure that is approximately equivalent to the 99th percentile instead of at the median, because this more closely matches the established prevalence rate of schizophrenia in normative populations. Another option is to create categorical variables with more than two groups. The more groups a variable has, the better able it is to replicate the patterns found in the original continuous variable (Peters & Van Voorhis, 1940). This can be particularly useful if the variable being dichotomized has a curvilinear relation with the outcome measure. A dichotomous variable only estimates the value of the outcome variable at two points, which does not allow for changes in the slope of the relation between the predictor and outcome variable. A categorical variable with three levels, however, allows for the slope between the low and middle points to be different from the slope between the middle and high points. This enables the categorical variable to appropriately represent U-shaped or inverted U-shaped relations, such as would be found when predicting the amount of agitation in dementia patients from their cognitive abilities (Burgio, Park, Hardin, & Sun, 2007), where the most agitation is found in individuals with moderate cognitive abilities. Categorical variables with even more groups will be able to represent even more complex relations. One final variant that is worth mentioning is extreme group analysis. In this procedure, researchers select individuals on the basis of having very high or very low scores on a variable and then only use these extreme individuals in their study (Preacher, Rucker, MacCallum, & Nicewander, 2005). Often this is done by selecting the upper and lower quarters, excluding the middle 50% of the distribution. It has been known for some time that it is possible to increase the power of tests relating a continuous independent variable to a dependent variable by recruiting people from the extreme ends of the distribution on the independent variable (Feldt, 1961). By focusing on the extreme ends of the distribution, researchers increase the differences found within their samples, which in turn enhances the observed effects. Once the extreme individuals have been selected from the distribution, it is common practice to create a categorical variable indicating whether each individual was in the high group or in the low group. While there is still variability within the upper and lower parts of the distribution following this procedure, the distribution of the extremitized variable is strongly bimodal and easily lends itself to categorization.

What Are the Problems with Artificial Categorization? Statisticians have long considered the effects of artificially categorizing continuous variables on calculated test statistics. Soon after introducing the correlation, Karl Pearson considered ways to correct for the effects of dichotomizing continuous variables in its calculation (Pearson, 1900). Artificial categorization was originally popular as a method to save labor when making statistical calculations (Cohen, 1983). When it was still common to perform statistical analyses by hand, some researchers would choose to aggregate groups of observations to reduce the number of values that had to be entered into their computations. At the time, it was well understood that this aggregation would weaken the observed relations between variables. Peters and Van Voorhis (1940, p. 398) showed that median splits will reduce correlations between normally distributed variables by 20.2%, so that if the original correlation between two continuous variables was .500, we would expect to observe a correlation of .399 if one of the variables was dichotomized and a correlation of .318 if both of the variables were


201

dichotomized. Peters and Van Voorhis (1940) specifically suggested that researchers should always correct for this reduction when they artificially categorize data into six or fewer groups. Methodologists started criticizing the practice of categorizing continuous data when researchers began to use it for reasons other than ease of computation, and when authors failed to consider the effects of categorization on their results. Humphreys and Fleishman (1974) and Humphreys (1978) argued against using artificial categorization to allow researchers to analyze the influence of continuously measured personality variables using ANOVA. These authors suggested that the use of ANOVA in this circumstance misrepresents the relations among variables found in the real world, gives the illusion of experimental control to designs that lack it, and reduces the size of the observed relations. They proposed that continuously measured variables should instead be left in their original form and be investigated with correlations and regression analysis. The issues surrounding artificially categorizing naturally continuous variables were summarized and brought to the psychological literature by Cohen (1983), who concluded that the benefits of artificial categorization were not worth the costs in terms of statistical power and accuracy of the estimated relations. In addition to reducing the ability to detect relations, Maxwell and Delaney (1993) noted that the use of artificial categorization can lead to spuriously significant results when two artificially categorized predictor variables are examined together in a multifactor ANOVA. The authors mathematically showed that there will be inflated Type I error rates for the test of the interaction between the two categorized variables if they are correlated with each other and one of them is either unrelated to or has a nonlinear relation with the outcome variable. Vargha, Rudas, Delaney, and Maxwell (1996) further showed that the spuriously significant results discussed by Maxwell and Delaney will also occur when only one of the two predictors is artificially categorized. Taken together, these findings indicate that artificial categorization can not only reduce the power of statistical tests but can also produce falsely significant results. MacCallum, Zhang, Preacher, and Rucker (2002) presented a thorough review of the problems of artificial categorization and provided a practical explanation as to why it reduces the observed relations among variables. Compared with the original continuous measure, an artificially categorized variable is less precise because it does not allow the researcher to discriminate between differently scoring members in the same group. For example, someone who is just barely above the cutoff value on a median split is treated the same as someone who is near the maximum value on the scale. All of the information that distinguishes an observation from other members of its group is necessarily lost— essentially, everyone within the group is treated as having a value equal to the group mean. Losing this information makes it more difficult to use the dichotomized variable to predict participants’ characteristics on other measures. Tests based on dichotomized variables will therefore have less power than those performed with the original continuous measures. Effect size estimates based on dichotomized scores will also typically be smaller than those based on the original continuous measures.

Are There Valid Justifications for Using Artificial Categorization? In contrast to these admonitions against artificially categorizing variables, Farrington and Loeber (2000) highlighted some potential justifications in support of this practice. They suggest that artificially categorizing variables is one way of handling variables with highly skewed distributions such as the number of times someone is arrested, where the majority of people would report never being arrested and a small minority of people would report being arrested several times. The authors also suggest that artificially categorizing variables can be beneficial when a variable is not linearly related to an outcome, such as the relation between mother’s age at childbirth and delinquency, where delinquency rates are highest among younger and older mothers. Furthermore, the authors indicate that categorizing variables


202

improves communication among researchers, clinicians, and policy-makers by making results more interpretable and easier to understand. Finally, they claim that the costs of artificial categorization in terms of power are relatively small, and worth the gains it provides in interpretability. Noticing that the prior warnings against artificial categorization have not curtailed its use, DeCoster, Iselin, and Gallucci (2009) contacted a number of researchers who had published articles using artificial categorization, as well as a number of leading methodologists, to obtain possible justifications for this procedure that may not have been adequately addressed in prior reviews. DeCoster et al. (2009) classified the different justifications into ten broad categories and then evaluated their validity using Monte Carlo simulations and logical arguments. They concluded that using the original continuous variable was preferable to artificial categorization in most circumstances, although there were some situations where it could be justified. Specifically, the authors claimed that artificial categorization should be viewed as acceptable when the underlying variable is naturally categorical, the observed measure has high reliability, and the relative group sizes of the dichotomized indicator match those of the underlying variable. All three of these criteria must be met - violating any of them leads to a situation where the artificial categories perform substantially worse than the original continuous variable. Artificial categorization also appears to be justified when the data have been subjected to extreme group analysis. Contradicting Farrington and Loeber (2000), DeCoster et al.’s (2009) simulations showed that neither the presence of skewed distributions nor the presence of a nonlinear relation between the predictor and the outcome variable justified the use of artificial categorization. DeCoster et al. (2009) considered a number of other justifications related to artificial categories being easier to use and interpret, as well as there being a history of using categories in analysis, but found strong enough evidence against these justifications to suggest that they are not acceptable reasons to artificially categorize variables. The only other justification that they found to be valid is when researchers specifically want to evaluate how well a diagnostic or artificially categorized measure performs in the field. When discussing the use of diagnostic measures in the field, it is important to consider recent discussions and proposed changes to the assessment of psychological disorders. The forthcoming Diagnostic and Statistical Manual of Mental Disorders (DSM-V) will be recommending that dimensional measures of diagnoses be provided in addition to current categorical measures (Helzer et al. 2008). This change to more continuous descriptors of psychological functioning is driven by several reasons. First, individuals commonly meet criteria for several mental health disorders, resulting in a large number of people having mixed disorders. Empirical research, however, typically focuses narrowly on individuals who meet criteria for only one categorical diagnosis or individuals who have unspecified comorbid disorders (Regier, Narrow, Kuhl, & Kupfer, 2009). Within a strictly categorical framework, it is difficult to identify the developmental course and appropriate treatment for mixed disorders when the majority of research fails to consider their unique effects (Regier et al., 2009). Second, empirical evidence suggests that many mental disorders have naturally continuous underlying distributions, supporting the use of dimensional measures (e.g., Marcus, Siji, & Edens, 2004; Ruscio, Ruscio, & Keane, 2002). Finally, current categorical diagnoses are highly comorbid with each other, suggesting that fundamental underlying dimensions may be giving rise to this overlap (Kamphuis & Noordhof, 2009).

How Should Artificial Categorization and Their Continuous Alternatives Be Used? While many researchers, statisticians, and methodologists have commented on the appropriateness of artificial categorization more generally, we have seen very little practical advice on the best ways to use


203

artificial categorization, or exactly how to use their continuous alternatives when artificial categorization is inappropriate. In this final section, we provide a number of specific suggestions related to the use of artificial categorization, including our thoughts on when researchers can use artificial categorization, how it can be justified, what continuous alternatives are available, and how the continuous alternatives should be used. The methodological literature consistently supports the superiority of continuous measures over artificially categorized measures in most circumstances. In addition, the simulations of DeCoster et al. (2009) indicated that when the continuous measures are superior to the artificially categorized measures the difference can be very large, whereas the differences are very small in the more limited circumstances when artificially categorized measures outperform the original continuous measures. This combination seems to suggest that whenever researchers are not sure whether they should work with continuous or artificially dichotomized measures, they would be best off working with the continuous measures. This suggestion is further supported by the fact that reviewers who prefer continuous measures are much more likely to criticize the use of artificial categorizations than reviewers who prefer artificial categorizations are to criticize the use of continuous measures. Researchers should be sure that the benefits they gain by working with artificially dichotomized measures are worth the closer inspection and possible disparagement their use will bring to a manuscript. Researchers should be especially careful about dichotomizing a continuous measure using samplespecific criteria, such as by using median splits, when samples are likely to vary on the distribution of the measure. In this case, individuals who are labeled as “high” on the measure in one sample might qualify as being “low” on the measure in other samples. Any estimates using this measure will only be accurate for other groups possessing similar distributions on the measure. However, this does not call into question the validity of significant relations found using the dichotomized measure, since these depend on covariances rather than on specific predicted values. Researchers choosing to use artificial categorizations should expect that they will need to defend this choice with evidence that they are in one of the situations where the categorization will not strongly bias their results. Justifying the use of artificial categorization, however, requires that a researcher has a great deal of theoretical knowledge about and practical experience with the measures being used in the study. DeCoster et al. (2009) showed that dichotomization can be justified if the underlying latent variable is truly categorical, the observed continuous measure has high reliability, and the proportions of the artificially categorized variable match those found in the underlying latent variable. Researchers will rarely be able to provide evidence that the underlying latent variable is truly categorical just by examining the distribution of the observed variable: in fact, the entire field of taxometrics was developed because of this difficulty (Waller & Meehl, 1998). Instead, researchers will most commonly justify the categorical nature of the construct based on a consensus of opinion within a literature. Referencing a strong prior history with the variable is also the most common way to determine what the appropriate proportions are for each of the categories. Finally, the reliability of the measure must be such that the proportion of misclassified cases in the artificially categorized variable is approximately 10% or less. It is very difficult to establish this unless the performance of the measure has been previously compared to a gold standard, such as a clinical interview. It is therefore unlikely that a researcher will ever be able to justify the artificial categorization of a measure unless it has a substantial history and its properties are well understood. The choice between continuous and artificially categorized measures offers two different ways to analyze the same research question. While it can be debated exactly which method is most appropriate for any given problem, we can clearly state that it is inappropriate to analyze the data both ways and


204

then present whichever one gives the more favorable results. Performing multiple tests of the same hypothesis inappropriately takes advantage of the idiosyncratic patterns found in the particular data set, inflating the chances of making a Type I error (Miller, 1981). Researchers should instead decide whether the variable in question is best treated as a continuous or artificially categorized variable before they analyze the data and accept whatever results are obtained using the method they chose. One of the things that gives fuel to those who criticize artificial categorization is that the decision to use the procedure is often based on a strategic evaluation of what produces significant results instead of theoretical beliefs about what is the best way to handle the data. This makes critics suspicious of anyone who uses artificial categorization, even when the procedure is used for valid reasons. A more conceptual handling of the issue by all researchers will help those with differing views respect each others' opinions. Whether psychologists work in a research or a clinical context will influence their decision to use continuous or artificially categorized indices. We have established that the majority of research contexts should use continuous measures; however, clinical decision-making contexts typically require the use of categorical measures. When the aim is to enhance diagnostic development, evidence must be drawn from and applied to both of these contexts. This leads to a situation where evidence obtained in one context reciprocally informs the practice in the other context. How might this look in practice? Clinicians would first determine how a given mental health diagnosis will be used in the clinical decision-making context (e.g., as exclusion or inclusion criteria). An optimal cut-point for categorizing people into normal and abnormal groups would be derived based on evidence from research that used a continuous measure of mental health symptoms. Clinicians would use this empirically-based cut-point to create the diagnostic categories upon which they would rely in clinical decision-making contexts (Kraemer et al., 2004). Evidence from the clinical uses and outcomes of these categories then feeds back to research designs, enhancing empirical evidence on the disorder. Diagnostic development is ideally a process that “spirals” toward a true diagnosis through a reciprocal relation between research and clinical decision making (Kraemer et al., 2004). In practice, however, diagnostic development is less iterative than it should be, because clinical decision-making rarely changes based on feedback from research evidence. Clinicians would be well served by paying greater attention to research findings based on continuous measures, since methodologists have shown that the models examined in these studies will be more likely to represent the true relationships surrounding the construct being studied than models based on artificially categorized measures. It is important to distinguish the issue of artificially categorizing a continuous measure after it has been collected from limiting the number of response options in a scale as it is administered. A number of studies have shown that, under certain circumstances, scales with a smaller number of response options can produce more valid and reliable data than those with a larger number of response options (e.g., Matell & Jacoby, 1971; Miethe, 1985; Wikman & Warneryd, 1990). Researchers should feel free to use a smaller number of response options when doing so will improve the quality of the data, and should not feel the need to justify this in the same way that they would need to justify artificially categorizing a measure after the data has been collected. Researchers need to consider how they want to analyze the data when designing their measures. If they plan to analyze a variable in a categorical fashion, it would be better to assess the construct categorically to avoid future criticism. If they plan to analyze a variable in a continuous fashion, it would be better to assess the construct continuously. Hsu and Feldt (1969) have demonstrated that ordinal variables with five or more categories can be analyzed as continuous variables without notably influencing the error rates of the calculated statistics. Therefore, researchers interested in analyzing a measure continuously should implement it as a naturally continuous variable, the mean or sum of multiple items producing a composite with at least five different values, or individual items on ordinal rating scales with at least five different values.


205

Researchers supporting the use of artificial categorization (e.g., Farrington & Loeber, 2000) have often discussed how this procedure makes it easier to analyze and interpret data. However, there are many times when researchers will find it easier to analyze their data when they are treated continuously. Most of the statistics that are taught at an introductory level, including correlations, t-tests, linear regression, and ANOVA, require that the outcome variables in the analysis be continuous. Artificially categorizing the outcome variables can force people to use more complicated analyses such as logistic regression and nonparametric rank statistics, which are typically less powerful and require more effort to explain. In addition, it is typically easier for psychologists to understand continuous effect sizes (i.e., correlations) than categorical effect sizes (i.e., odds ratios). Even when an artificial categorization might simplify the interpretation of the variable itself, it may complicate the statistics that use the variable. Researchers should therefore consider all of these effects before choosing whether or not to artificially categorize a variable. Whether researchers choose to analyze their data continuously or categorically, it is important that the language they use to present their results accurately reflects the way in which the data were analyzed. At the same time, the descriptions of the findings should be phrased in a way that reflects the relations between the abstract constructs and is not irrevocably tied to the specific analytic procedure. As an example, let us say that a group of researchers observed a significant positive relation between the amount of bullying a child received (measured continuously) and their fear of social situations. A good way to phrase this finding would be that “children who were bullied more often tended to show greater fear in social situations.” If the researchers had instead sampled children who were bullied and those who were not bullied and chose to analyze this variable categorically, they might phrase the finding as “bullied children showed greater fear of social situations than children who were not bullied.” The general interpretation of the two findings is the same, which is as it should be: Both reflect the association of being bullied with a fear of social situations. However, the expression of the finding takes the exact evidence provided by the study (whether it was the presence of a significant correlation or a significant mean difference) into account. In addition, both phrases are written in a way that focuses on the conceptual relation instead of the analytic method, so that they would be interpretable by researchers who tend to think about their variables continuously and those who think about their variables categorically. We mentioned earlier that use of extreme group analysis provides an acceptable justification for artificial categorization. We would like to supplement this with a few words about the appropriate use of extreme group analysis. The typical reason that researchers choose to work with extreme groups is to increase the power of their tests. Restricting your focus to those at extreme ends of the distribution increases the differences between individuals on the variable, which should in turn lead to increased differences in any other variables that are related to the extremitized variable (Preacher et al., 2005). However, this advantage only appears if the selection of the extreme participants is made at recruitment. Sometimes researchers will recruit and collect data on the full range of participants, dropping those who are in the middle of the distribution when analyzing the data. This procedure, referred to as “post-hoc subgrouping,” is distinct from extreme group analysis and does not lead to an increase in statistical power (Alf & Abrams, 1975). In post-hoc subgrouping, the increase in power gained by working with extreme values is counteracted by the decrease in power suffered by the loss of participants. In addition, both extreme group analysis and post-hoc subgrouping have undesirable side effects, such as distorting effect sizes, reducing scale reliabilities, and making it difficult to observe nonlinear relations between variables (Preacher et al., 2005). Post-hoc subgrouping should therefore be avoided because it does not substantially increase the power of statistical tests and yet still creates all of the problems associated with extreme group analysis.


206

Sometimes researchers will use artificial categorization to prevent outliers from having an undue influence on statistical analyses. It has been well established that such outliers can drastically change the characteristics of an estimated regression line (Neter, Kutner, Nachtsheim, & Wasserman, 1996). By definition, continuous measures can accommodate a wider range of responses than dichotomized measures, which also means that the potential impact of unusual or erroneous responses is greater for continuous measures. Some researchers therefore argue that artificial categorization should be used to prevent outliers from strongly biasing statistical tests. However, we would suggest that the most appropriate way to handle an outlier depends on why the observation is unusual (Barnett & Lewis, 1994). Sometimes outlying observations occur because of simple typographical errors, in which case the appropriate action is to change the value of the outlier. Other times, an outlier could represent an observation that is not a member of the population being studied (e.g., someone with normal IQ in a study of individuals with intellectual disabilities), in which case the appropriate action would be to remove the observation from the analysis. It is also possible that outliers may simply represent unusual individuals or events that still fit within the population of interest, in which case the researchers may keep the observation as it is, or restrict its influence by transforming, truncating, or Winsorizing the data (Ruppert, 1988). Rather than simply applying artificial categorization in which all outliers are treated in exactly the same way, researchers are better off intentionally looking for outliers and then handling each in an individualized fashion. Although methodologists have argued about whether variables with nonlinear relations are better tested using continuous or artificially categorized variables, neither a basic ANOVA nor a basic linear regression analysis is the most appropriate procedure in these cases. Instead, researchers should apply methods that are specifically designed to handle nonlinear relations. Nonlinear relations can follow many different types of functions, including quadratic, logarithmic, exponential, and logistic. Artificially categorized variables can represent some types of nonlinear relations using polynomial contrasts (Neter et al., 1996), but more accurate estimates and more powerful tests of the underlying nonlinear functions can be obtained when the variables are treated continuously. In this case, the nonlinear relations can be examined using transformations, where a function is used to change the nonlinear relation into a linear relation (Cohen, Cohen, West, & Aiken, 2003); using polynomial regression, where quadratic, cubic, or other higher-order terms are added to a regression equation to fit a polynomial function to the data (Aiken & West, 1991; Seber & Lee, 2003); or by using generalized linear models, which can incorporate a nonlinear “link function” into the estimation of the relation between two variables (Agresti, 2002). One common reason that researchers artificially categorize their variables is to allow them to examine interaction effects. Although it takes a little more work to examine interactions between continuous variables, the procedures for doing so have been fully developed. The texts by Judd and McClelland (1989) and Aiken and West (1991) provide comprehensive treatments of how to test interaction effects in regression, including how to examine interactions between a continuous variable and a categorical variable. Whereas most programs (such as SPSS) automatically test for interactions among categorical IVs, interactions involving continuous predictors commonly must be set up by hand through the creation of multiplicative interaction terms. The way the variables and models are defined must also be considered carefully because the inclusion of an interaction term can change the interpretation of the main effects involved in the interaction. This can be appropriately addressed through the use of Type II sums of squares when testing effects (Driscoll & Borror, 2000) or by centering the continuous predictors before creating the interaction terms (Aiken & West, 1991). It can be difficult to illustrate the nature of interaction effects involving continuous variables, even after they have been tested accurately. Interactions involving continuous predictors are typically examined by comparing the “simple slopes,” which are regression lines relating one of the predictors to the outcome


207

at a particular combination of the levels of the other predictors involved in the interaction (Aiken & West, 1991). The researcher can understand the interaction by comparing the simple slopes for one variable across different levels of the other variables involved in the interaction. This is done by using the estimated regression equation to determine the predicted values at different values of your predictor variables and then graphing the results in a “simple slopes plot.” Unfortunately, most statistical software packages will not automatically provide simple slopes following a regression analysis, so they must instead be manually computed by substituting the appropriate values into the estimated regression equation. The following websites contain tools that will help researchers create simple slopes plots: http://www.stat-help.com/spreadsheets.html http://www.jeremydawson.co.uk/slopes.htm http://people.ku.edu/~preacher/interact/index.html http://www.upa.pdx.edu/IOA/newsom/macros.htm http://www.davidakenny.net/cm/moderation.htm One alternative researchers can use when they believe that artificially dichotomizing a variable simplifies the presentation of their results is to first analyze the data continuously and then use the means from an artificial categorization to interpret the results. All of the reported statistical analyses would be taken from the model that treated the variable continuously. However, the graphs or tables used to interpret the exact nature of the effects would be taken from models employing an artificial categorization. The significance of the effects from the models using artificial categorization are irrelevant and would never be presented. Instead, the researchers would simply explain that the model is solely used to help interpret the effects that were found when the variables were treated continuously. One problem with this method, however, is that there will be times when the effects found when the variable is treated continuously will differ from the effects found when the variable is artificially categorized. It is therefore up to the researcher to examine the two models and make sure that the categorical presentation of the results accurately reflects the findings that are observed in the continuous model. In this article, we sought to provide an understandable and unbiased summary of the literature on artificial categorization. Our conclusions are that it is usually preferable to work with the original continuous variables, but there are some specific situations where the use of artificial categorization can be justified. Providing a convincing justification is not easy, however, and typically requires that researchers have a strong theoretical understanding and a large amount of practical experience working with the measure in question. Often the additional criticism that researchers draw from using artificial categorizations is not worth the simplification that they provide. Many of the circumstances where people commonly choose to artificially categorize their variables, such as when examining interaction effects, can still be appropriately handled while treating the variables continuously. There are also other advantages to using continuous variables, such as the ability to directly model nonlinear relations. We would disagree with methodologists who indiscriminately condemn all forms of artificial categorization, and feel that researchers should be allowed to use this procedure in some circumstances. However, we would recommend that researchers work with continuous variables by default and only consider artificial dichotomization when it provides important advantages and can be strongly justified.

References Agresti, A. (2002). Categorical Data Analysis. Hoboken, NJ, USA: John Wiley & Sons, Inc. doi:10.1002/0471249688


208

Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park, CA: Sage Publications Inc. Alf, E. F., Jr., & Abrahams, N. M. (1975). The use of extreme groups in assessing relationships. Psychometrika, 40, 563–572. doi:10.1007/BF02291557 Barnett, V., & Lewis, T. (1994). Outliers in statistical data (3rd ed.). Chichester, England: Wiley. Burgio, L. D., Park, N. S., Hardin, J. M., & Sun, F. (2007). A longitudinal examination of agitation and resident characteristics in the nursing home. The Gerontologist, 47, 642-649. Cohen, J. (1983). The cost of dichotomization. Applied Psychological Measurement, 7, 247-253. doi:10.1177/014662168300700301 Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Erlbaum. DeCoster, J., Iselin, A. R., & Gallucci, M. (2009). A conceptual and empirical examination of justifications for dichotomization. Psychological Methods, 14, 349-366. doi:10.1037/a0016956 Driscoll, M. F., & Borror, C. M. (2000). Sums of squares and expected mean squares in SAS. Quality and Reliability Engineering International, 16, 423-433. doi:10.1002/10991638(200009/10)16:53.0.CO;2-W Farrington, D. P., & Loeber, R. (2000). Some benefits of dichotomization in psychiatric and criminological research. Criminal Behaviour and Mental Health, 10, 100-122. doi:10.1002/cbm.349 Feldt, L. S. (1961). The use of extreme groups to test for the presence of a relationship. Psychometrika, 26, 307–316. doi:10.1007/BF02289799 Fiske, S. T., & Taylor, S. E. (1991). Social Cognition (2nd ed.). New York: McGraw-Hill. Helzer, J. E., Kraemer, H. C., Krueger, R. F., Wittchen, H-U, Sirovatka, P. J., & Regier, D. A. (2008). Dimensional Approaches in Diagnostic Classification: Refining the Research Agenda for DSM-V. American Psychiatric Publishing: Washington, D.C. Hsu, T. C., & Feldt, L. S. (1969). The effect of limitations on the number of criterion score values on the significance level of the F-test. American Educational Research Journal, 6, 515-527. Humphreys, L. G. (1978). Research on individual differences requires correlational analysis, not ANOVA. Intelligence, 2, 1–5. doi:10.1016/0160-2896(78)90010-7 Humphreys, L. G., & Fleishman, A. (1974). Pseudo-orthogonal and other analysis of variance designs involving individual-difference variables. Journal of Educational Psychology, 66, 464–472. doi:10.1037/h0036539 Judd, C. M., & McClelland, G. H. (1989). Data analysis: A model comparison approach. New York: Harcourt Brace Jovanovich. Kamphuis, J. H., & Noordhof, A. (2009). On categorical diagnoses in DSM-V: Cutting dimensions at useful points? Psychological Assessment, 21, 294-301. doi:10.1037/a0016697 Kraemer, H. C., Noda, A., & O'Hara, R. (2004). Categorical versus dimensional approaches to diagnosis: Methodological challenges. Journal of Psychiatric Research, 38, 17-25. doi:10.1016/S00223956(03)00097-9 Marcus, D. K., Siji, L. J., & Edens, J. F. (2004). A taxometric analysis of psychopathic personality. Journal of Abnormal Psychology, 113, 626-635. doi:10.1037/0021-843X.113.4.626 MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19–40. doi:10.1037/1082-989X.7.1.19 Matell, M. S., & Jacoby, J. (1971). Is there an optimal number of alternatives for Likert scale items? Study I: Reliability and validity. Educational and Psychological Measurement, 31, 657–674. doi:10.1177/001316447103100307 Maxwell, S. E., & Delaney, H. D. (1993). Bivariate median splits and spurious statistical significance. Psychological Bulletin, 113, 181–190. doi:10.1037/0033-2909.113.1.181


209

Miethe, T. D. (1985). The validity and reliability of value measurements. Journal of Psychology, 119, 441–453. Miller, R.G. (1981). Simultaneous Statistical Inference (2nd Ed.). Springer-Verlag: New York, NY. Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996). Applied linear statistical models (4th ed.). Chicago: Irwin. Pearson, K. (1900). Mathematical contributions to the theory of evolution: VII. On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society of London, 195A, 1-47. Peters, C. C., & Van Voorhis, W. R. (1940). Statistical procedures and their mathematical bases. New York: McGraw-Hill. Preacher, K. J., Rucker, D. D., MacCallum, R. C., & Nicewander, W. A. (2005). Use of the extreme groups approach: A critical reexamination and new recommendations. Psychological Methods, 10, 178-192. doi:10.1037/1082-989X.10.2.178 Regier, D. A., Narrow, W. E., Kuhl, E. A., & Kupfer, D. J. (2009). The conceptual development of DSM-V. The American Journal of Psychiatry,166, 645-650. doi:10.1176/appi.ajp.2009.09020279 Ruppert, D. (1988). Trimming and Winsorization. In S. Kotz, N. L. Johnson, & C. B. Read (Eds.), Encyclopedia of statistical sciences: Vol. 9 (pp. 348–353). New York: Wiley. Ruscio, A.M., Ruscio J., & Keane, T.M. (2002). The latent structure of post-traumatic stress disorder: A taxometric investigation of reactions to extreme stress. Journal of Abnormal Psychology, 111, 290– 301. doi:10.1037/0021-843X.111.2.290 Seber, G. A. F., & Lee, A. J. (2003). Linear regression analysis. New York: Wiley Vargha, A., Rudas, T., Delaney, H. D., & Maxwell, S. E. (1996). Dichotomization, partial correlation, and conditional independence. Journal of Educational and Behavioral Statistics, 21, 264–282. Waller, N.G., & Meehl, P. E. (1998). Multivariate taxometric procedures: Distinguishing types from continua. Thousand Oaks, CA : Sage Publications. Wikman, A., & Warneryd, B. (1990). Measurement errors in survey questions: Explaining response variability. Social Indicators Research, 22, 199–212. doi:10.1007/BF00354840