Variable Criteria Sequential Stopping Rules for Nonparametric Tests

1 downloads 0 Views 2MB Size Report
criteria and power curves are presented here for the Mann-Whitney-Wilcoxon rank test for two independent groups, the Wilcoxon signed-rank test for two ...
Note: This is a slightly altered copy of a version of a manuscript that was rejected for publication from Behavior Research Methods on 18 Jan 2010. The reviewers disagreed strongly, and the editor rejected on the basis that it is more suitable for “a journal that is more oriented to users of nonparametric methods”. There was no criticism of the methods and results. I provide it for those who do use nonparametric tests. Please read the article first to be sure a nonparametric test is right for you. Doug Fitts

Variable-Criteria Sequential Stopping Rules for Nonparametric Tests Douglas A. Fitts Short title: Nonparametric SSRs.

Douglas A. Fitts, Ph.D. Office of Animal Welfare and IACUC University of Washington Box 357160, Seattle, WA 98195 voice (206) 685-0784 fax (206) 616-5664 [email protected]

Keywords: Heterogeneous variances, Mann-Whitney-Wilcoxon rank test, Monte Carlo simulation, Null hypothesis significance test, Reduction of subjects, Replication, Kruskal-Wallis rank test, Wilcoxon signed-rank test.

Abstract The variable-criteria sequential stopping rule (SSR) is a simple method that allows an investigator to conduct an experiment in stages, testing significance after the addition of one or more subjects at each stage, while maintaining a stable Type I error rate (observed alpha) and good power. The observed alpha is controlled by comparing the obtained p value with a pair of criteria obtained from a table for the corresponding nominal alpha, such as .05. Previously published criteria are valid for independent and dependent t tests and a one-way completely randomized ANOVA with four groups, but those tables are not optimized for p values from analogous nonparametric tests. New criteria and power curves are presented here for the Mann-Whitney-Wilcoxon rank test for two independent groups, the Wilcoxon signed-rank test for two dependent samples, and the Kruskal-Wallis rank test for four or six independent groups. The criteria are valid for 1- or 2-tailed tests. These nonparametric tests should not be used in place of parametric tests just because of a violation of the assumption of homogeneity of variances because heterogeneity biases the observed alpha as much or more in the nonparametric tests. Simulations indicate that the variable-criteria SSR method used with nonparametric tests can decrease the number of subjects required for significance on average compared to a test with comparable power conducted with a fixed sample size based on a power analysis.

An intuitive and efficient approach for determining if a research hypothesis is worth pursuing is to test for significance sequentially during the course of an experiment. Investigators conduct an experiment with a few "pilot" subjects and then test to see if the results are promising. If they are, the investigators may invest additional subjects, time, and money. If not, they can stop the experiment before excess time and resources have been expended. Sequential testing is not valid, however, if the investigators test at each stage of the experiment with the usual, desired alpha, such as .05. Doing so rapidly inflates the observed alpha, the rate of Type I errors, to unacceptable levels. Investigators who do this will incorrectly conclude that effects are significant much more frequently than they realize or believe. Sequential stopping rules (SSR) are available that allow an investigator to conduct an experiment in stages, testing significance after the addition of one or more subjects at each stage, without causing a large increase in the Type I error rate (Botella, Ximenez, Revuelta, & Suero, 2006, Fitts, 2010, Frick, 1998, Ximenex & Revuelta, 2007). Sequential testing traces to Wald (1947) among others. The simplest SSRs provide two values of p as criteria for stopping the experiment at each stage: a lower criterion for concluding that an effect is significant, and an upper criterion for concluding that the effect is not worth pursuing. If the p from a test is between these two criteria, the investigator may add additional subjects to the groups and test again. In some cases, the criteria are guaranteed only to hold the observed alpha below the nominal or desired alpha (Botella, et al. 2006, Frick, 1998, Ximenex & Revuelta, 2007). Depending on the number of subjects added at each step, the n added, the observed alpha may deflate to .02 or lower instead of staying at .05 in a typical test. This is good in the sense that one is

assured that the Type I error rate is not increased. However, alpha is one contributor to the power of a test, and a deflation of alpha also deflates power. The variable-criteria SSR (Fitts, 2010) was designed not only to stop the inflation of alpha during sequential testing but also to hold the observed alpha in the experiment very near to the nominal (desired) alpha. That is, it prevents both inflation and deflation of the alpha during sequential testing. The benefit of the variable-criteria technique is that it retains the power of the test throughout the various stages of the experiment. The drawback is that it is a bit more difficult to use because one must look up the criteria for an experiment in a table instead of using a single pair of criteria for any sort of test. The criteria vary with the starting sample size and the n added at each step. The method also imposes an upper bound on the number of subjects that will be used. This upper bound is based on the assumption that investigators usually do not want to invest an indefinite number of subjects in the experiment. After a relatively large number of subjects have been invested, investigators often decide that an effect is either nonexistent or too weak to bother with. Setting an upper bound for sample size allows the variable-criteria SSR to maintain good power. A table of criteria was previously generated for a two-tailed t test for two independent groups, and the same criteria were determined to be valid for a dependent samples t test, for a one-tailed t test of either type, or for a one-way ANOVA with 4 independent groups (Fitts, 2010). These criteria could be used with nonparametric tests such as the Mann-Whitney-Wilcoxon rank test for two independent groups (Mann & Whitney, 1957; Wilcoxon, 1945), the Wilcoxon signed-rank test for dependent samples (Wilcoxon, 1945), or the Kruskal-Wallis rank test for multiple independent groups

(Kruskal & Wallis, 1952). In unpublished simulations, trials using the criteria from parametric tests with nonparametric tests held the observed alpha below the nominal value, but they also caused a serious deflation of alpha with the least powerful tests (small n or conservative alpha). For this reason, the present paper provides new power curves and tables of criteria for these three commonly used nonparametric tests. The new criteria hold the observed alpha very near the nominal alpha, with a few exceptions, for all of the models used. Terminology throughout the article is as follows. Lower and upper criteria are the probability values that allow one to declare significance (lower criterion) or to stop the experiment because the p is too high (upper criterion). The lower and upper bounds are the sample size per group at the first iteration (lower bound) and the size beyond which the experiment will not proceed in later iterations (upper bound). The fixed stopping rule is the rule in which a fixed number of subjects is tested before stopping and analyzing the data for better or worse. This is the way significance tests were originally designed. The variable-criteria SSR discussed here uses customized criteria for each set of lower and upper bounds and each level of n added in order to hold the observed alpha very near the nominal alpha and to maintain power. These criteria are decided upon at the beginning of the experiment, and are obtained from power curves and a table. Method Simulations of 100,000 experiments each were conducted using the MannWhitney-Wilcoxon rank test for two independent samples, the Wilcoxon signed-rank test for two dependent measures, or the Kruskal-Wallis rank test for 4 or 6 independent

groups. The procedures were nearly identical except for differences in the experimental design. I computed criteria for each of the three nonparametric tests so that each observed alpha would be held stable near the nominal level and power would be maximal for that individual test. Derivation of new criteria Samples of n subjects (the lower bound) were randomly generated from normal distributions in which the point null hypothesis was true (no difference between or among means or conditions). See the section "Technical details of simulations" below. An appropriate statistical test was conducted and the resulting p value was stored in a computer file. Then one or more subjects were added to each group (n added) and the statistical test was recalculated with the augmented sample size. The resulting p value was stored in the second column in the computer file. Subjects continued to be added in equal increments until the addition of n added subjects would exceed the upper bound, and each successive p value was added as a new column in the file. For example, for the model using 4 subjects as a lower bound and 10 subjects as an upper bound (4/10 model), and for an n added value of 2 per group per iteration, the 4 columns contained p values when the sample sizes were 4, 6, 8, and 10 subjects. The experiment was stopped because the addition of 2 additional subjects would exceed the upper bound, 10. This exact procedure was duplicated 100,000 times so that, in the example cited, the total size of the file was 4 columns by 100,000 lines of p values. Separate files were generated for each of several values of n added for each model consisting of a lower and upper bound. In total, the models of lower and upper bounds included 3/9, 3/15, 4/10, 4/18, 5/12, 5/19, 6/12, 6/18, 7/14, 7/21, 8/24, 8/32, 9/27, 9/36, 10/30, and 10/40. The n added values

ranged from 1 to 6 in smaller models up to 1 to 10 in the largest models (see Tables 1, 2 or 3). These models and levels of n added are identical to the previously published simulations for the t test and one-way ANOVA (Fitts, 2010). Altogether this basic database for each nonparametric statistical test consisted of 118 files for the derivation of criteria based on the null hypothesis. These files were then probed with estimated starting criteria using the nominal alpha as the lower criterion and .50 as the upper criterion. First, the lines of the file were sorted into ascending order based on the values in the first column. That is, when the value in the first column moved during sorting, all other values in the other columns on that line moved with it so that the identity of the experiment was kept intact. Then, for each row in the file, the value in the first column was compared with the lower and upper criteria. To continue with the example for the 4/10 model with an n added of 2 and a nominal alpha of .05, the starting lower and upper criteria would be .05 and .50, respectively. If the p value for an n of 4 (column 1) was less than or equal to .05, the experiment was deemed significant at the .05 level, and that line and all previous lines with lower p values were eliminated from further sorting or analysis. Experiments in which the p value was greater than .50 were stopped and not analyzed further. The remaining lines in the file, in which the p value was greater than .05 and less than or equal to .50 were deemed uncertain, and these were kept for further analysis. Of 100,000 random experiments, after analysis of the first column only, there were on average about 5,000 significant experiments (5%, alpha), 45,000 uncertain experiments (45%), and 50,000 nonsignificant experiments (50%).

The ~45,000 uncertain experiments in our example of the 4/10 model were then sorted on the p values in column 2, which represents an additional 2 subjects added to the original 4 subjects for a total of 6 subjects per group. Any p values less than or equal to .05 at this point were deemed significant at the .05 level, counted, and eliminated from further analysis. Any p values greater than .50 were deemed not significant and were eliminated from further analysis. Note that the proportions in these categories were no longer 5, 45, and 50% because the observations are not independent from the results of column 1. The remaining uncertain experiments were then sorted on column 3 (n = 8 for the 4/10 example). This process of elimination of significant and not significant experiments, and the augmentation of n and sorting of uncertain experiments, continued until all columns had been analyzed. The empirical proportion of rejections (EPR, an estimate of alpha for these specific criteria when the null hypothesis was true) was computed as the total number of significant results in all columns divided by 100,000 (see Fitts, 2010, Table 1 for a detailed example). If this value was not within 2% of the nominal alpha (as frequently happened when starting with the nominal alpha as the lower criterion), the criteria were adjusted in the correct direction by the algorithm, the original file was re-loaded intact, and the process was repeated. The process ended when a set of criteria had been identified that produced an EPR within 2% of the nominal alpha, or else it was observed that the EPR values were repeating above and below alpha without getting any closer. This happened far more frequently with the discrete sampling distributions of these nonparametric tests than was ever observed with the continuous distributions of

parametric tests. A cut-off of 40 tries was established to end these repeating trials when 50 tries did not perform better. Adjustment of criteria during searching In the process of locating appropriate lower and upper criteria, the criteria were adjusted when the EPR was greater than 2% from the nominal alpha by increasing or decreasing either the lower or upper criterion depending on the sign and magnitude of the error. As demonstrated previously (Fitts, 2010), large changes in the upper criterion (say, from .20 to .50) produce only fine adjustments in the EPR. If two upper criteria of .20 and .50 resulted in EPRs that bracketed the nominal alpha, the upper criterion alone was adjusted to find a pair of criteria that produced an EPR with an acceptable error. If this process produced an upper criterion that required many decimal digits, the lower criterion was adjusted slightly and the process was repeated. This often generated an acceptable answer with fewer decimal digits. If the nominal alpha in the original trial was not bracketed by EPRs resulting from upper criteria of .20 and .50, the lower criterion was adjusted to produce a larger movement in the correct direction until the alpha was bracketed. The search included an additional constraint that the lower criterion contained a reasonable number of decimal digits. In this fashion, the search continued until a pair of criteria was identified that positioned the EPR within 2% of alpha without running to excess decimal digits, or else until the end of 40 tries. Estimation of power and mean sample size. Once criteria were established that constrained the observed alpha near the nominal alpha when the null hypothesis was true (all population means equal), a

simulation of 100,000 tests was conducted with a new null condition to validate the criteria, and an additional 100,000 tests were conducted with each of a set of standardized effect sizes. For the Mann-Whitney-Wilcoxon test, the standardized effect d was calculated as the difference between the means divided by the pooled standard deviation (Cohen, 1988). For the Wilcoxon test, d was calculated as the mean of the difference scores divided by the standard deviation of the difference scores. For the Mann-WhitneyWilcoxon and Wilcoxon tests these effect sizes were 0 (null), 0.8, 1.0, 1.2, 1.4, 1.6, 1.8 and 2.0. For the Kruskal-Wallis test, the null condition plus eight standardized effect sizes among 4 or 6 means were generated with f values ranging from .40 to .75 in increments of .05 (Cohen, 1988). For each simulation, the experiment had more than one opportunity to become significant depending on the value of n added in the simulation. For example, in the 6/18 model with an n added value of 1, the experiment might become significant on any of 13 different iterations -- i.e., when the n per group was 6, 7, 8, or any other integer value up to 18. If the n added value was 3, the experiment could become significant with an n of 6, 9, 12, 15, or 18. If the n added value was 5, the experiment could become significant with an n of 6, 11, or 16 (the experiment would be stopped at that point because the addition of n added subjects would exceed the upper bound of 18). Whenever an experiment became significant, the simulation was stopped, and the sample size at that point in the simulation was recorded. Thus, it was possible to calculate the mean sample size for all simulations that eventually became significant. If the null hypothesis was false (population means not equal) in a simulation of a particular model, the total number of significant experiments out of 100,000 (i.e., the EPR) was an estimate of the power of

the model. The mean sample size was an estimate of the number of subjects necessary to achieve that power with the model. This power was then compared with the power of the same test for that given sample size in a simulation with the fixed stopping rule to determine which was more efficient. Heterogeneous variances Simulations were conducted to determine the observed alpha (rate of Type I errors) for two-tailed Mann-Whitney-Wilcoxon tests and t tests of identical design in order to examine the effects of heterogeneous variances under the fixed stopping rule and under the variable-criteria SSR procedure. For the fixed stopping rule, sample sizes of 5, 10, 20, 30, and 40 per group were used. For the variable-criteria SSR, the models 4/18, 6/18, 8/32. and 10/40 were employed with criteria and n added as given in Table 1 for the Mann-Whitney-Wilcoxon tests and Table 2 of Fitts (2010) for the t tests. The data for the variable-criteria SSR were averaged over all levels of n added. The ratio of standard deviations was 1, 2, 3, or 4 in the two groups; the sample sizes were always equal; the population means were identical in the sampling distribution; and each test included 100,000 repetitions. Technical details of simulations Programs were written in the C programming language and executed in a computing environment as previously reported (Fitts, 2010). Data were sampled using a pseudorandom number generator based on "Ran2()" (Press, Teukolsky, Vetterling, & Flannery, 1992). Normalized deviates were transformed linearly using the desired means and standard deviations to create the generated samples. All simulations were conducted 100,000 times. The four levels of experimentwise alpha were .005, .01, .05, .10.

For dependent samples, otherwise known as one-sample, matched-sample, or correlated-sample tests for the Wilcoxon signed-rank method, pairs of scores were sampled from two populations of scores in which the distributions were unit normal and the correlation between the scores in the population was 0.50. The difference scores for pairs of scores sampled in this fashion have a population mean of 0.0 and a population standard deviation of 1.0. The computation for all tests included the recommended corrections for tied scores if they occurred. In the Wilcoxon test, tied scores affect the sample size, because the n used in the Wilcoxon is the number of non-zero difference scores, so the elimination of zero differences reduces the sample size and may reduce the power of the test. For this reason, I recorded the number of zero difference scores in the Wilcoxon tests to be sure that I knew what the actual sample sizes were. I never found a tied score, so all of the n values represented in the models were also the actual numbers used in the Wilcoxon tests. The lack of ties is not surprising given that the random numbers were being generated with floating point precision. Exact probabilities for the Mann-Whitney-Wilcoxon test and the Wilcoxon signed-rank test were calculated with floating-point precision for smaller sample sizes and probabilities were estimated from a normal approximation with larger sample sizes (larger than 12 for Mann-Whitney-Wilcoxon test and larger than 31 for Wilcoxon signedrank test). Because of the discrete nature of the underlying sampling distributions for these nonparametric tests, some changes in procedure were instituted with respect to my previous publication for parametric tests. With the parametric tests, it was relatively easy

to identify p values for criteria representing any desired value on the continuum within 1% error. With the nonparametric tests, and in particular with the Mann-WhitneyWilcoxon and Wilcoxon signed-rank tests with very small sample sizes, it was many times impossible to find criteria that yielded an estimate of alpha within 1% of the nominal alpha. After relaxing this rule to 2%, I was able to include a large number of additional pairs of criteria that would otherwise have been handled differently. Thus, for a nominal alpha of .05, the criteria were included if the observed alpha fell within the range .049 to .051 during the trial simulations. Larger errors in the dataset (some as high as 60% higher or lower) were obviously not sampling errors but characteristics of the sampling distribution itself. These were handled as follows. Simulations in which I was unable to locate a pair of criteria that produced an observed alpha within 2% of the nominal alpha fell into two categories. Some models simply did not have a nonzero solution at all within the simulation. In these cases, the criteria were deleted from the tables as impossible. The rest of the cases had nonzero observed alphas but they were farther than 2% from the nominal alpha. If the sign of the difference was negative (the observed alpha was less than the nominal alpha), the criteria were included in the table as is because these represent the smallest absolute error that I was able to find, and the p was less than the nominal alpha. All of the rest of the cases had positive errors larger than 2% (observed alpha greater than nominal alpha), and in many cases these were much larger. In these cases, I abandoned the criteria for the nearest observed alpha and used instead the criteria that would yield an observed alpha lower than the nominal alpha, even though this error would be absolutely greater than the positive error. I call this process a "demotion" of the criteria.

In all, there were 118 pairs of criteria at each level of alpha for all of the models and levels of n added for each nonparametric test. For the Wilcoxon test, 15 of these pairs of criteria were demoted at the .005 level, 12 at the .01 level, 8 at the .05 level, and 9 at the .10 level. The averages of the demoted EPRs alone were .0041, .0086, .0439, and .0818 for the respective alphas. Thirteen pairs of criteria at various alphas could not be determined in the Wilcoxon dataset and had to be eliminated altogether. All but one of these were in the 3/9 and 3/15 models. These two entire models were deleted for this reason (see Results). For the Mann-Whitney-Wilcoxon test, of 118 pairs at each level of alpha, 8 were demoted at the .005 level, 10 at the .01 level, 6 at the .05 level, and 4 at the .10 level. The averages of the demoted EPRs alone were .0044, .0082, .0444, and .0720. No criteria were deleted for the Mann-Whitney-Wilcoxon test. The Kruskal-Wallis test had no positive or negative errors greater than 2% in the generation phase and no criteria had to be demoted or deleted. Results The newly derived criteria for the Mann-Whitney-Wilcoxon, Wilcoxon, and Kruskal-Wallis tests are presented in Tables 1, 2, and 3, respectively. After the criteria were derived in one set of simulations in which the null hypothesis was true, a new, identical set of simulations was conducted to validate the alpha for those criteria, and these data are presented in Figure 1. The data in the figure have been averaged over 6 to 10 levels of n added for each model and the standard deviations of these observations are included in the figure. The criteria produced excellent stability of the observed alpha very close to the nominal alpha with few exceptions.

"Demotion" of individual criteria (see Methods) led to a reduced observed alpha and larger variability in some cases with the Mann-Whitney-Wilcoxon and Wilcoxon tests. Error bars in Figure 1 are drawn symmetrically, but in reality all of the distributions with demoted criteria were forced to negative skewness. I am not certain why the 5/12 model in the Mann-Whitney-Wilcoxon test required multiple demotions, but I replicated the effect independently in the generation-of-criteria phase of modeling with nearly the same result. The 3/9 and 3/15 models were deleted from the Wilcoxon tests because many of the simulations did not yield any lower bound between zero and the nominal alpha (see technical details in the Method section). Observed alphas for the Kruskal-Wallis test are presented in the lower half of Figure 1. The Kruskal-Wallis criteria in Table 3 were derived from simulations with 4 groups, and the alphas shown on the lower right of the figure for the 6-group KruskalWallis tests used the same criteria from Table 3. The figure demonstrates that the criteria in Table 3 can be used successfully with either 4 or 6 groups. An interesting observation after examining the tables of criteria is that some of the smallest models use a lower criterion that is larger than the nominal alpha in order to yield an observed alpha near the nominal alpha. This seems paradoxical until one considers what is happening with these very small sample sizes. Mann-WhitneyWilcoxon or Wilcoxon experiments beginning with a sample size of 3 or 4 may be impossible to stop with a significant effect after the first test, especially when testing at the .005 or .01 level of significance. The only options on this first test are to stop with a p value greater than the upper criterion or to add n added subjects and retest with the augmented sample size. No Type I errors accrue on this first test in the simulation, but

many tests may be stopped because of a large p. The second test with the augmented sample size is thus the first opportunity to observe a Type I error, but because it is being selected from a reduced number of tests, the observed alpha is less than the nominal alpha. The only way to compensate for this is to use a lower criterion greater than the nominal alpha. Doing so increases the number of Type I errors on the second iteration so that it approximates the nominal alpha for the entire experiment. Power curves are presented for the Mann-Whitney-Wilcoxon test in Figure 2, for the Wilcoxon test in Figure 3, for the Kruskal-Wallis test with 4 groups in Figure 4, and for the Kruskal-Wallis test with 6 groups in Figure 5. These curves should be used to determine which model to use in an experiment. See Recommendations for Use in the Discussion section and detailed examples in Fitts (2010). Figure 6 demonstrates the power of the variable-criteria SSR relative to the fixed stopping rule for the Mann-Whitney-Wilcoxon test, the Wilcoxon test, and the KruskalWallis test with 4 groups when tested at the .05 level of significance. The ordinate for each dot in the figure represents the average power across 6-10 levels of n added for a particular model simulated at a given level of effect size. There are 16 total models for the Mann-Whitney-Wilcoxon and Kruskal-Wallis tests and 14 total models for the Wilcoxon (models with a lower bound of 3 were omitted). There were 7 nonzero effect sizes tested with the Mann-Whitney-Wilcoxon and Wilcoxon tests and 8 nonzero effect sizes tested with the Kruskal-Wallis test. Thus, there were 112 points in the MannWhitney-Wilcoxon graph, 98 points in the Wilcoxon graph, and 128 points in the Kruskal-Wallis graph. The abscissa for each point was determined as follows. The mean sample size at the time of the rejection of the null hypothesis in the simulations with the

variable-criteria SSR was calculated (see Method). A single simulation of 100,000 tests was then conducted with the same effect size and sample size as was observed for the ordinate point. The number of rejections out of 100,000 tests represents an estimate of the power of the fixed stopping rule for that effect size and sample size. The line in each scatterplot represents the point of equivalent power on the ordinate and abscissa. Thus, the clustering of points above this line indicates that the variable-criteria SSR tests were more powerful than the same tests with the same average sample sizes and effect sizes when all of the data were collected and significance was tested only once (fixed stopping rule). Figure 6 illustrates the efficiency of the variable-criteria SSR, in that each subject is worth more "power points" (power per subject) when used with the SSR instead of with the fixed stopping rule. It implies that fewer subjects are required with the SSR to achieve the same power as the fixed stopping rule. For example, the average power for the variable-criteria SSR with the two-tailed Mann-Whitney-Wilcoxon test is .807 assuming we are using the 7/21 model, an .05 level of significance, and an effect size of 1.2 standard deviations (see Figure 2). The average number of subjects in the model at the time of the rejection of the null hypothesis was 10.36±0.72. Simulations of the power of the Mann-Whitney-Wilcoxon test with the fixed stopping rule yield powers of .668 with 10 per group, .729 with 11 per group, .763 with 12 per group, and .814 with 13 per group. Thus, on average in this example, the use of the variable-criteria SSR instead of the fixed stopping rule generates the same level of power, ~80%, with ~6 fewer subjects per experiment (3 per group).

It is of interest to determine if Tables 1 and 2 for two-tailed tests can be used also with one-tailed tests as I previously found with t tests (Fitts, 2010). An additional 100,000 simulations were conducted using the variable-criteria SSR procedure with each of the models 4/18, 6/18, 8/32, and 10/40 using a one-tailed test instead of a two-tailed test for the Mann-Whitney-Wilcoxon and Wilcoxon tests. The results are displayed in Table 4. In the simulations, the p value from the one- tailed test was compared with the lower and upper criteria in exactly the same fashion as the two-tailed tests reported above. The empirical percentage of rejections (EPR%) in the table is the observed alpha averaged over 6-10 levels of n added, and the error in estimating the nominal alpha is given in standard deviation units. Of 32 estimates in the table, only one was more than a standard deviation away from the nominal alpha (.05 level for the Wilcoxon 6/18 model) and the percentage error (100*(4.932 - 5.000)/5.000) for that instance was ~1.4%. I consider this an excellent fit. I recommend that Tables 1 and 2 can be used with either one- or two-tailed p values. The power of a one-tailed test is greater than the corresponding two-tailed test, so the power curves for the two-tailed Mann-Whitney-Wilcoxon and Wilcoxon tests in Figures 2 and 3 may underestimate the power of a corresponding one-tailed test. To give an approximate measure of the power of the one-tailed tests, I included in Table 4 the effect size for the model which produced at least 80% power on average. Suppose we planned to do a one-tailed Mann-Whitney-Wilcoxon test at the 5% level of significance and needed to detect an effect as small as 1.2 standard deviations with 80% power. From the models listed in the table, we should prefer the 8/32 model instead of the 4/18 or 6/18

models, because the smaller models require effect sizes of 1.8 and 1.6 standard deviations, respectively, to generate 80% power. The effect of heterogeneous variances on the Mann-Whitney-Wilcoxon test and a t test of equivalent design was determined for the fixed stopping rule (Figure 7) and for the variable-criteria SSR (Figure 8). The ratio of standard deviations was 1 to 4 between the two groups in each simulation. Both tests suffered an increase in the number of Type I errors compared with the nominal alpha, and the inflation was at least as bad for the Mann-Whitney-Wilcoxon test as for the t test. In Figure 7, at a ratio of 1 (equal standard deviations), the Type I error rate was near the nominal .05 level for all except the smallest sample size for the Mann-WhitneyWilcoxon test. For that test with a sample size of 5 per group, the nearest p value less than .05 in the sampling distribution is ~.032, which is close to the observed error rate. Increasing the ratio of the standard deviations from 1 to 4 increased the Type I error rate for the Mann-Whitney-Wilcoxon test for all sample sizes. For n = 5, the "inflation" was tempered by the low initial value. By contrast, the results with the t test demonstrated that the inflation of alpha by heterogeneous variances was greatest at the smallest sample size and declined very close to .05 with sample sizes as large as 20-40 per group. Thus, inflation of alpha was worst with large sample sizes in the Mann-Whitney-Wilcoxon test and worst with small sample sizes in the t test. In Figure 8, at a ratio of 1 (equal standard deviations), the Type I error rate was near the nominal .05 level for all tests as previously demonstrated. Increasing the ratio of the standard deviations from 1 to 4 progressively increased the Type I error rate for the Mann-Whitney-Wilcoxon test for all models. By contrast, the results with the t test

demonstrated that the inflation of alpha by heterogeneous variances was greatest with the smallest sample-size model (4/18) and declined progressively toward .05 with larger models.

Discussion Previously published tables of criteria for the variable-criteria SSR cannot be used with common nonparametric tests without a deflation of alpha and a loss of power. New power curves and tables of criteria are presented here for the Mann-Whitney-Wilcoxon rank test, the Wilcoxon signed-rank test, and the Kruskal-Wallis rank test for multiple independent groups. These criteria hold alpha relatively stable for any of the tested lower bounds or levels of n added. If criteria were not found to produce an observed alpha within 2% of the nominal alpha, nearby criteria were chosen to produce an observed alpha less than the nominal alpha just as the current tables for the Mann-WhitneyWilcoxon and Wilcoxon tests do. The criteria for the Kruskal-Wallis test were established using 4 independent groups, but they also produced satisfactory levels of alpha and power when the simulations included 6 independent groups. The tables were established using simulations of two-tailed tests, but the MannWhitney-Wilcoxon and Wilcoxon tests can also be used with one-tailed p values without modification. To do this, one conducts a test and obtains a one-tailed p value to compare with the lower and upper criteria. The fixed stopping rule, which is the way most investigators were originally taught to use null hypothesis tests, is less efficient and less powerful than the variablecriteria SSR with all but the smallest sample-size models. Investigators who already do

sequential testing at the .05 level should stop that practice and switch to a valid SSR (Botella, et al. 2006, Fitts, 2010, Frick, 1998, Ximenex & Revuelta, 2007). The SSR will require a few more subjects to detect significance if a difference exists, but it will not cause an unacceptable increase in Type I errors if there is not a difference. The variablecriteria SSR has been demonstrated to be slightly more powerful and efficient than other published SSRs, but all SSRs are much more powerful and efficient than the fixed stopping rule (Fitts, 2010). On average, users will save time, money, and subjects using a SSR instead of the fixed stopping rule. Recommendations for Use and Precautions. For an extended discussion of recommendations please see Fitts (2010). Briefly, SSRs are appropriate for some but not all experimental circumstances that use null hypothesis significance testing. They are good when the goal is to determine whether or not a treatment has a significant effect, and when the investigator is satisfied with a yes or no answer to that question (Frick, 1998). The efficiency of the method allows an answer with fewer subjects on average than the fixed stopping rule. The nonparametric tests described in this paper should be used when the experimental circumstances meet the above recommendations, when the nonparametric test is appropriate, and when a parametric test is not appropriate. These nonparametric tests are sometimes taken to be exact surrogates for the t test or ANOVA, but users should be cautious when interpreting nonparametric tests (see below). Parametric tests should be used whenever appropriate because they use all of the information available in the data and are more powerful. As previously demonstrated, the variable-criteria SSR yields a Type I error rate very close to the nominal alpha with no decrement in power

after a violation of the assumption of normality of the underlying distributions (Fitts, 2010) as does the basic t test (Boneau, 1960). I would suggest that users continue using parametric tests when working with moderately skewed distributions instead of resorting to nonparametric tests if that is the only violation of the assumptions. Users should always try to maintain fairly equal sample sizes. The nonparametric tests discussed here are definitely appropriate when the level of scaling is only ordinal and cannot meet the requirement of equal intervals between adjacent scores. Nonparametric tests are sometimes used in place of parametric tests when the ratio of the variances between the groups is high, because it is a common misimpression that nonparametric tests are free from effects of heterogeneous variances. They are not (Zimmerman 1999, 2000). Using computer simulations to compare Type I error rates of Mann-Whitney-Wilcoxon tests with those of t tests, and to compare Kruskal-Wallis tests with ANOVA tests, Zimmerman (2000) found that the probability of rejecting the null hypothesis with equal means and medians more than doubled when the ratio of the standard deviations increased from 1:1 to 1:8 in the Mann-Whitney-Wilcoxon and Kruskal-Wallis tests. The t test and ANOVA also suffered increases of alpha at small sample sizes, but these Type I error rates decreased as the sample sizes increased. By contrast, the elevated Type I error rates of the nonparametric tests were not reduced by increases in sample sizes. The author made the interesting comment that the rate of rejection of the null hypothesis with the nonparametric tests under conditions of heterogeneous variances is more a measure of the weak power of these tests to detect differences in variability rather than a measure of the Type I error rate.

I verified and extended Zimmerman's data for the fixed stopping rule (Figure 7). For the Mann-Whitney-Wilcoxon test, the amount of increase in the rate of Type I errors was about the same regardless of sample size as the ratio of standard deviations increased from 1 to 4. However, unlike Zimmerman, who used "natural significance levels" of .028, .058, and .114 for small sample sizes (4 to 8 per group), I used the traditional levels of .005, .01, .05, and .10 for all sample sizes (as do published tables of the MannWhitney test). The expected Type I error rate at the .05 level for n = 5 in the sampling distribution is about .032, and this is close to the observed rate in Figure 7. Consequently, I did not find bad inflation of alpha above the nominal level of .05 with the smallest sample size, but this was only because the observed alpha was inflated from a reduced baseline level at the 1:1 ratio. The values in Figure 7 are the approximate error rates that would be experienced by investigators who use the Mann-Whitney-Wilcoxon tables to determine significance, or who compare obtained p values with traditional significance levels of .005, .01, .05, or .10. There is nothing wrong with using the "natural significance levels" from the discrete sampling distribution as long as one can tolerate the increase in Type I errors. This a priori decision should be clearly stated in one's method section. Figure 8 displays similar data for Type I errors as a function of heterogeneity of variances in the variable-criteria SSR for several sample-size models. Because all of these models hold alpha near .05 with equal variances, the data are more straightforward than for the fixed stopping rule in Figure 7. An increase in the variance ratio caused an increase in Type I errors of about the same magnitude for all sample-size models with the Mann-Whitney-Wilcoxon tests. Type I errors were also increased when using the t test

with the lowest sample-size model, but unlike the Mann-Whitney-Wilcoxon test, the rate of Type I errors decreased with larger sample-size models. This is consistent with the findings using the fixed stopping rule both in Figure 7 and in Zimmerman's data (2000). One can mitigate the effects of heterogeneous variances by increasing sample size with the parametric tests but not with the nonparametric tests. For another illustration of this problem with the Mann-Whitney-Wilcoxon test, one should consider that this test can generate a significant result when the means of the two groups are equal and the distributions of scores are different. For example, suppose we have two groups of 6 scores each, and we have planned a one-tailed test at the .05 level: Group A, 1, 2, 3, 4, 5, 45; and Group B, 10, 10, 10, 10, 10, 10. The mean of each group is exactly 10, so an independent groups t test would yield a t value of 0, which is not significant. However, in the one-tailed Mann-Whitney-Wilcoxon distribution the probability of an event this extreme or more extreme is .032, which would be significant if the difference is in the correct direction for the one-tailed test. But which is the correct direction? Some have suggested that the results of the Mann-Whitney-Wilcoxon test apply to the medians rather than the means of the distributions, so a part of the answer may be that Group B has the larger median. However, from the previous paragraphs it is apparent that another part of the answer is that Group A has the larger variance. Note that the Mann-Whitney-Wilcoxon result is identical in this example whether the top score in Group A is 45 (mean of 10, same as Group B), 11 (mean of 4.3), or 585 (mean of 100). The results of the Mann-Whitney-Wilcoxon test are sensitive to several aspects of the shapes of the two distributions as well as the central tendency, and the interpretation of the result depends on a careful examination of the distributions of the scores.

My recommendations for the use of the Mann-Whitney-Wilcoxon test with the variable-criteria SSR are: (1) If the data are at no better than an ordinal level of scaling, use the Mann-Whitney-Wilcoxon test instead of a t test; (2) If the data are of at least an interval level of scaling, give strong consideration to the t test instead of the MannWhitney-Wilcoxon test because the t test with the variable-criteria SSR is not badly affected by differences in the shape of the distribution (Fitts, 2010), and because it is not affected any worse than the Mann-Whitney-Wilcoxon test by heterogeneous variances (Figure 8). Transforms may help to normalize distributions or stabilize variances. Other considerations being equal, the t test will have greater power, especially with small sample sizes. Users should always beware of the drastic distortion of alpha that results from a combination of unequal variances and unequal sample sizes. This problem can be remedied in part by equating the sample sizes as much as possible. Users contending with unequal population variances should consider using the modifications of the t test suggested by Satterthwaite (1946) and Welch (1947). The problem cannot be fixed by resorting to a nonparametric test (Zimmerman, 2000). Effect Size and the Variable-Criteria SSR I have observed that many investigators whose strength lies in the content of their area of biomedical or biobehavioral science instead of in statistics are baffled by the a priori use of a power analysis. A power analysis may be required by regulatory or granting agencies, such as an Institutional Animal Care and Use Committee or the National Institutes of Health, to justify the number of subjects requested. These investigators wonder how they are supposed to know the size of an effect and the standard deviation before they have conducted the study. If the study had already been

done so that these parameters were known, the investigators would not be doing the study now. However, these same investigators have little trouble suggesting that they should use a sample size of 8 per group based on 20 years of experience in the area. What they don't realize is that they have just used their 20 years of experience to conduct an informal power analysis. All they need to do is specify their level of alpha and power and they can complete an effect-size analysis by working backward from the sample size. If 8 subjects are expected to be sufficient for significance in the planned study, one needs only to enter the desired alpha and power into an effect-size calculator (many power calculators are freely available on the World Wide Web, although they vary in accuracy) to determine that the experiment would require an effect size of ~1.5 standard deviations in order to be significant in a two-tailed t test. The only thing that is missing is for the investigators to begin thinking of their effect sizes in standard deviation units. Do they really expect their effect to be that large? Is this the smallest effect that they would consider important? If so, they have just completed a power analysis. If not, they should alter the proposed sample size. The best way to begin thinking of effect sizes in standard deviation units is to begin reporting them in manuscripts that way. Standardized effect sizes are the raw data for meta-analyses, so these will help greatly to put one's study into the greater context later on. It is an excellent way for an investigator with extensive experience in an area to describe the anticipated size of effect and the number of subjects required by a study to a colleague in a different area, such as a member of the IACUC. This having been said, it is true that a small error in the prediction of an effect size in a sample-size analysis can make a fairly large difference in the number of subjects

required. If one is using a fixed stopping rule, one requests a certain number of subjects for the study based on a power analysis, but if the effect size is a bit smaller than predicted, the result may be not significant. One cannot increase sample size at that point without increasing the Type I error rate. A way to diminish this problem is to plan enough sample size to detect the smallest effect that one would consider important. However, the best way is to use a SSR from the beginning, which will be more forgiving of these errors in prediction. If the effect is not quite significant on the first iteration, it is expected that the investigator will add subjects and try again. If the effect does not exist, there is a high chance that the experiment will be stopped before many subjects and dollars are expended. Steps for Using the Variable-Criteria SSR (1) Conduct a power analysis and estimate the anticipated size of the effect in standardized units. Now, examine the power curves in Figures 2, 3, 4 and 5 using the correct test, the desired alpha and power, and the anticipated standardized effect size to determine the available models (combinations of lower and upper bounds) that are capable of producing the desired amount of power under the selected conditions. (2) Select any of the models from step 1 based on the needs and constraints of the experiment. Using larger lower bounds generally provides more power. One should use a lower bound that is large enough to persuade reviewers and readers that an effect is plausible if it is reported as significant. It should also be large enough to convince oneself that the hypothesis has been given a fair trial. One will have to stop testing if the p with that number of subjects is less than the lower criterion or greater than the upper criterion.

(3) Decide how many subjects to add per iteration (n added) and look up the criteria for the selected model in the appropriate table (Tables 1, 2, or 3). You must use these same lower and upper criteria and n added throughout the experiment. (See the comment about losses to one's sample size below.) (4) Test the number of subjects per group at the lower bound. If p is less than or equal to the lower criterion, stop testing and reject the null hypothesis at the selected experimentwise alpha. If p is greater than the upper criterion, stop testing and retain the null hypothesis. Otherwise, add n added subjects (step 3) and reanalyze with the augmented sample size. For independent-samples tests this number will be added to each group. Repeat this procedure until you have rejected the null hypothesis, retained the null hypothesis, or reached the upper bound. If adding n added subjects to the sample size would exceed the upper bound, and if the p value is still in the "uncertain" region, one must retain the null hypothesis: There is not sufficient evidence (or power) to declare the result significant. (5) Sometimes an investigator might need to stop an experiment while the result is still in the "uncertain" region and the upper bound of sample size has not yet been attained. From the rule just stated, this result cannot be declared significant because the p at the end of the last iteration was not less than or equal to the lower criterion. The observed alpha in this circumstance could easily be estimated from the data in the simulations, and it will always be less than the nominal alpha for the procedure. Stopping early without a significant result cannot inflate the Type I error rate because a Type I error can be made only when the result is significant.

It is important to note that alpha is inflated by the intention of the investigator to add additional subjects when the result is not quite significant rather than by the actual addition of subjects to an experiment when the result is not quite significant (Frick, 1998). If the investigator begins an experiment with the intention of using the variablecriteria SSR with an alpha of, say, .05, and finds that the resulting p is less than .01 after the first test at the lower bound, the result is significant at p ≤ .05 (not