Null Hypothesis Significance Testing: Effect Size Matters

7 downloads 0 Views 122KB Size Report
ral Resources Unit, Department of Natural Resource Recreation and Tourism, Colorado State Uni- ... Keywords Effect sizes, null hypothesis significance testing ...
Human Dimensions of Wildlife, 6:291–301, 2001 Copyright © 2001 Taylor & Francis 1087-1209 /01 $12.00 + .00

Null Hypothesis Significance Testing: Effect Size Matters JEFFREY A. GLINER Department of Occupational Therapy Colorado State University Fort Collins, Colorado, USA

JERRY J. VASKE Human Dimensions in Natural Resources Unit Colorado State University Fort Collins, Colorado, USA

GEORGE A. MORGAN School of Education Colorado State University Fort Collins, Colorado, USA A statistically significant outcome only indicates that it is likely that there is a relationship between variables. It does not describe the extent (strength) of that relationship. In this article, emphasis is placed on the importance of assessing the strength of the relationship between the independent and dependent variables using effect size indices. Effect size indices for the d family and r family are introduced, along with formulas for their direct and indirect computation for both the t test and chi-square test. A subset of the variables and concepts examined in the Whittaker and Manfredo study are reported here to demonstrate why an effect size index should be computed. Statistical analyses (either t test or chi-square test) were performed on the original sample of 796 and three smaller sample sizes (398, 200, and 100) randomly selected from the initial sample. Effect size indices were computed for each statistical test. The results indicated that the size of the sample directly affects the t or chi-square statistic and p, but the effect size was independent of the sample size. Effect sizes should, therefore, accompany reported p values. Keywords Effect sizes, null hypothesis significance testing Funding for the empirical example in this paper was provided by the Alaska Department of Fish and Game and the Human Dimensions in Natural Resources Unit at Colorado State University. Address correspondence to Jerry J. Vaske, 244 Forestry Building, Human Dimensions in Natural Resources Unit, Department of Natural Resource Recreation and Tourism, Colorado State University, Fort Collins, Colorado 80523, USA. E-mail: [email protected] u

291

292

J. A. Gliner et al.

Introduction A major problem with null hypothesis significance testing (NHST) concerns the interpretation of statistical significance. Misinterpretation of statistical significance occurs when one assumes that a statistically significant outcome provides information about the strength of the outcome. A statistically significant outcome only indicates that it is likely that there is some relationship between the variables. It does not tell the extent of that relationship. Therefore, in addition to information on statistical significance, it is important to state the size of the effect. Recognition of need to report effect sizes is apparent in a variety of disciplines (e.g., education, psychology, wildlife biology) and several recent articles (see, for example, Anderson, Burnham, & Thompson, 2000; Johnson, 1999; Kirk, 1996; Snyder & Lawson, 1993; Thompson, 1996, 1997). The Publication Manual of the American Psychological Association (APA, 1994) recommends that researchers report effect sizes, although relatively few researchers did so before 1999. Similarly, the APA Task Force on Statistical Inference stated that effect sizes should always be reported for primary results (Wilkinson & The Task Force, 1999). It is likely that most future articles will discuss the size of the effect as well as whether or not the result was statistically significant and in the predicted direction. In this article, we emphasize the importance of assessing the strength of the relationship between the independent and dependent variables using effect sizes. Effect Size Measures An effect size is defined as the strength of the relationship between the independent variable and the dependent variable. Effect size computations have been divided into two major types often referred to as the d family of indices and the r family of indices (Rosenthal, 1994). The d effect size indices are expressed in standard deviation units and computed by finding the difference between the means of two groups of interest to a researcher (e.g., those who voted for a particular wildlife initiative and those who voted against the issue) and then dividing by some estimate of the standard deviation of the study. Two common d effect size indices are Glass’s D and Hedge’s g. Both of these equations are referred to as direct computations of the effect size index because they are computed directly from the means and standard deviations. Although we use Hedge’s g as the d effect size indicator in this article, the formulas for both follow: Glass’s D = X E - X C S control group Hedge’s g =

X1 - X2 S Pooled

A second general family of indices is the effect size expressed as a correla-

Statistical Significance and Effect Size

293

tion coefficient, r (Rosenthal, 1994). Using this method, effect sizes are always less than 1.0, varying between –1.0 and +1.0. Researchers have debated whether it is better to express effect size as r or r2 (eta or eta2). The squared versions are common because they indicate that percentage of variance in the dependent variable that can be predicted from the independent variable(s). Rosenthal (1994), however, argues that these usually small percentages give one an underestimated impression of the strength or importance of the effect. Thus, we both argue that it is better to use r, eta, or R (multiple correlation). If the original means and standard deviations are not available (e.g., as might occur in a meta-analysis based on published literature), d and r indices can be computed indirectly following, for example, a t test. 2t

Hedge’s g =

Hedge’s g =

r=

N

for equal sample sizes

t n1 + n2 n1n2

for unequal sample sizes

t2 t 2 + df

Effect size indices can be used for many different statistical tests. A common statistical procedure used with surveys is the chi-square test (c2). As applied to the 2 × 2 contingency table, the effect size index determined from the chi-square is phi ( f) as follows: c2 N Phi is a product moment correlation, so as an effect size, its interpretation is the same as r (Cohen, 1988). f=

Statistical Significance and Effect Size Table 1 provides guidelines for interpreting and roughly equating four effect sizes, g, rp, r, and f (Cohen, 1988). The point biserial correlation (rp) is the result of an effect size, r, computed indirectly from a t test. This rp (e.g., nominal level independent variable and normally distributed dependent variable) is similar to the Pearson correlation, r (See Cohen, 1988, p. 82, for details on converting between the two measures). Cohen (1988) provides research examples of small, medium, and large effects to support the suggested d and r values. Although researchers may not consider a correlation (r) of .5 to be very strong, Cohen argues that a d of .8 and an r of .5 (which he shows are mathematically similar) indicate “…grossly perceptible and therefore large differences, as does the mean difference in height

294

J. A. Gliner et al. TABLE 1 Interpretation of Four Effect Size Indices d family

r family

Effect size interpretation

g

rp

r

f

small effect medium effect large effect

.2 .5 .8

.100 .243 .371

.1 .3 .5

.1 .3 .5

between 13- and 18-year-old girls” (p. 27). Similarly, a correlation of .5 is, according to Cohen (1988, p. 81), “about as high as they come” in predictive effectiveness in applied psychology. Although most agree that some index of the strength of relationship should accompany reporting of statistical tests, there is some disagreement about reporting effect sizes following analyses that are not statistically significant. Robinson and Levin (1997) introduced a two-step procedure for the evaluation and reporting of empirical results, “First convince us that a finding is not due to chance and only then, assess how impressive it is (i.e., estimate its magnitude)” (Robinson & Levin, 1997, p. 23). In other words, only report effect sizes after a statistically significant finding. The rationale for this approach is that effect sizes reported for outcomes that are not statistically significant represent chance deviations, assuming adequate power of the study. The role of power, however, is especially important for the reporting of effect size. Consider a study that had a relatively small sample size and hence inadequate power to reject the null hypothesis. Some might suggest that these studies need to be replicated with larger sample sizes before reporting effect size. Even though these small sample studies may not find a statistically significant outcome, a case can be made for reporting the results and including an effect size. For example, if a meta-analysis was conducted on the topic, combining these small sample studies with other similar studies would help address overall sample size and power issues. What is persuasive about this argument is that if only statistically significant studies are reported in the literature, a meta-analysis performed on these topics will tend to overestimate the effect sizes. If all studies are reported in the literature (those with and those without statistically significant outcomes), the resulting effect sizes will more accurately reflect the true state of affairs. We recommend a three-step approach for interpreting inferential statistics. First, decide whether to reject the null hypothesis. Second, if the outcome is statistically significant, determine the direction of the effect. If testing a difference between groups, state which group performed better. If testing a correlation, indicate whether the relationship is positive or negative. Finally, estimate the size of the effect.1 The effect size should be included in the description of results. To

Statistical Significance and Effect Size

295

illustrate the relationships among statistical significance and effect size, the following example has been included.

Empirical Example Study Area Anchorage, Alaska (population 260,000) is a city with abundant wildlife, including moose. With relatively low hunting harvest and the existence of extensive parkland and open space, moose populations in and around Anchorage have doubled since the late 1970s to about 1,900 animals. Of these, 200 to 300 are resident year-round in the developed part of the city (the “Anchorage Bowl”), whereas an additional 500 to 700 migrate into the Bowl each winter from adjacent lands in Chugach State Park and two large military reservations (Alaska Department of Fish and Game [ADF&G], 2000). These large moose populations provide high quality viewing opportunities, but can cause problems. There were an average of 156 moose killed in moosevehicle collisions per year from 1994–1998 (ADF&G, 2000), and damage to vehicles and associated human injuries as a result of these accidents is substantial (Thomas, 1995). Moose also damage homeowners’ gardens and landscaping and may aggressively protect their young, or block sidewalks and ski trails during the winter. Anchorage moose have stomped two people to death (in 1993 and 1995), and are estimated to injure or kill 50 to 100 dogs each year (ADF&G, 2000). Although small controlled moose hunts have been conducted in Anchorage, hunting as a management tool has been controversial (ADF&G, 2000). A few hunts have been proposed for parks in the city, but most proposals focus on a hunt in Chugach State Park. A survey of Anchorage residents was conducted to explore public attitudes toward this potential moose hunt (Whittaker & Manfredo, 1997). Of the 1,654 individuals sent a mail questionnaire, 971, or 59%, completed and returned the surveys. Tests for nonresponse bias suggested that differences between respondents and nonrespondents were small, and not significant. Analysis Variables A subset of the variables and concepts examined in the Whittaker and Manfredo (1997) study are reported here to demonstrate why an effect size index should be computed. The grouping (independent) variable was the respondents’ reported intention to vote for or against the hunt if a referendum were held.2 Results indicated that 481 would vote for the hunt, whereas 315 would vote against it, and 137 were unsure. The analyses presented here focus on those who stated an intention to vote either for (60%) or against (40%) the hunt (n = 796). To illustrate the relationship between statistical significance and effect size, different sample sizes of approximately 400, 200, and 100 were selected at random from the 796. For

296

J. A. Gliner et al.

each of these random samples, the 60% (for) versus 40% (against) ratio was maintained. A total of nine dependent variables were examined. Five of these variables were measured at the interval/ratio level. Following techniques developed by Fishbein and Ajzen (1975), attitude toward the proposed hunt was measured by combining respondents’ answers to a series of questions about the possible outcomes from the hunt (see Whittaker, Manfredo, Sinnott, Miller, Fix, & Vaske, 2001, for details on how this measure was computed). Two wildlife value orientation indices (use-protection, appreciative) were computed based on the procedures outlined in Fulton, Manfredo, and Lipscomb (1996). The remaining two ratio level measures included a summated index of the reported number of negative experiences with moose (e.g., been in a vehicle that has hit a moose, been charged by a moose, had a pet injured or killed by a moose), and a single-item question that asked respondents how many years they had lived in Anchorage. Four nominal level variables were also included in the analyses: current hunting participation (Yes or No), membership in hunting / fishing organizations (Yes or No), membership in animal rights / welfare organizations (Yes or No), and the sex of the respondent. Statistical Significance and Effect Size Results Statistical significance tests and effect size measures for the relationship between the respondents’ behavioral intention to vote for or against the proposed moose hunt and the nine dependent variables are shown in Tables 2 and 3. Table 2 presents t statistics, and the Hedge’s g and rp effect size indices, for the five interval/ ratio dependent measures and for each of the four different sample sizes (i.e., n = 796, 398, 200, 100). Both g and rp were computed indirectly from the t statistic. Table 3 presents the c2, and the effect size f for each of the four nominal dependent variables and the four sample sizes. Four out of five t tests with the original sample size (n = 796) yielded p values of less than .001 (Table 2). These p values imply that the probability of this outcome, assuming a true null hypothesis, is quite remote (less than 5 in 10,000) and unlikely to be true. According to the logic of null hypothesis significance testing, the null hypothesis should be rejected in favor of the alternative hypothesis. The p value, however, does not provide any information about the strength of the relationship between the variables. Determining the strength of the relationship necessitates examination of the effect size index.3 In this example, the effect sizes (g and rp) for “attitude toward the hunt” and the “use-protection value orientation” were large (d > .8 and rp > .371; see Table 1). Although the “appreciative value orientation” and “number of negative moose experiences” had the same small p values (< .001) for the original sample size (n = 798), the effect sizes for these measures were small to medium, and “years in Anchorage” had a small effect.

297

Statistical Significance and Effect Size TABLE 2 t Values and Effect Size Indices for Four Sample Sizes

Variable Attitude toward the hunt 2

Sample size t-value

df

p value (2-tailed)

Effect Mean size 1 difference Hedge’s g

Effect size 1 rp

796 398 200 100

–29.39 –19.33 –14.64 –11.27

792 260.5 132 98

< < <