Some notes on nonparametric inferences and permutation tests

1 downloads 0 Views 53KB Size Report
The parametric and nonparametric approach for hypothesis testing ... testing procedure is based on statistics whose distribution does not depend on the ...
MARCO MAROZZI (*)

Some notes on nonparametric inferences and permutation tests

Contents: 1. Introduction. — 2. The parametric and nonparametric approach for hypothesis testing. — 3. Permutations tests: theoretical simplicity and practical efficacy?. - 3.1. Brief history and fundamental definitions. - 3.2. Features of permutation tests. — 4. A recent interesting proposal on permutation testing. — 5. Concluding remarks. Acknowledgments. References. Summary. Riassunto. Key words.

1. Introduction Permutation tests are not generally endorsed very enthusiastically by either theoretical or applied statisticians, but these tests may be very useful, especially to study complex multivariate and multiaspect problems as is shown by a recently proposed method. Theoretical and practical features of permutation tests are presented and discussed in a non-technical way to emphasize both their theoretical rigour and their practical efficacy. These features should encourage their wide use in many fields of scientific research.

2. The parametric and nonparametric approach for hypothesis testing The two main approaches to hypothesis testing are the parametric and nonparametric approach. The parametric approach is based on specific assumptions regarding the nature of the underlying population distribution: usually one assumes that its form is known except for (*) Dipartimento di Scienze Statistiche dell’Universit`a di Bologna For correspondence: Via IV Novembre, n. 7, 40066 Pieve di Cento (BO) tel. 051/973247 E.mail: [email protected]

140 a few parameters. Given the right set of assumptions, test statistics are developed by means of rigorous mathematics. It is important to stress that the conclusions one may reach using these methods are exactly valid only when these assumptions are met in practice. Within the nonparametric approach, specific distribution assumptions are replaced by very general ones. The most commonly required assumption is simply that the population is continuous. Sometimes the assumption of symmetrical population is required as well. The testing procedure is based on statistics whose distribution does not depend on the distribution function of the underlying population. Box & Anderson (1955) suggested a two-point principle for establishing the goodness of a test: “to fulfil the needs of the experimenter, statistical criteria should (1) be sensitive to changes in the specific factors tested, (2) be insensitive to changes, of a magnitude likely to occur in practice, in extraneous factors”. The first criterion of Box & Anderson seems related to both “good power” and unbiasedness, the second one to robustness. It should be noted that a parametric test is almost always developed in order to satisfy the first point, provided that its assumptions are met in practice. Instead, a nonparametric test is inherently robust, since it requires rather mild assumptions. According to the criterion suggested by Box & Anderson, the main rule for choosing a parametric test should be based on robustness and, on the other hand, power should be the principle to assess the goodness of a nonparametric test. An interesting paper by Ludbrook & Dudley (1998) is devoted to how statistical analyses of the results of biomedical researches are carried out. The unit selection schemes of 252 comparative studies published in five important biomedical journals were reviewed. The authors pointed out that only in 11 cases out of 252 truly random samples were used, while in the remainder randomization of nonrandom samples was used to construct experimental groups. It is worth noting that in such studies the available samples are often selectionbiased samples. They also reported that the statistical procedures used for testing location shifts are the Student t test and Snedecor F test in 189 studies out of 225 in which non-random samples were used (see Table I). The authors underline that the median group size varies from 4 to 9 in the case of random samples and from 6 to 12 in the case of randomization of nonrandom samples. Finally, at the basis of these considerations, they suggest to use permutation tests, emphasizing that

141 Table I: 236 studies published in five biomedical journals, classified with respect to the type of statistical analysis techniques employed. unit selection scheme

test type t or F

rank

permutation

random sampling (and randomization)

8

2

1

11

randomization of nonrandom samples

189

36

0

225

197

38

1

236(1 )

Source: Ludbrook & Dudley.

these tests are valid under random sampling as well as under randomization of nonrandom samples and therefore that they are particularly useful in biomedical studies. In next sections, we discuss permutation tests, presenting the aspects that should encourage their wide use in many fields of scientific research, and not only in the biomedical field.

3. Permutations tests: theoretical simplicity and practical efficacy? 3.1. Brief history and fundamental definitions The earliest contributions to permutation testing were Fisher’s (1934, 1935) Statistical Methods for Research Workers and The Design of Experiments. The first book introduces the well-known and much used(2 ) exact test for the 2 × 2 contingency tables, while the second one presents a test for comparing two means (namely the permutation version of the Student t test). The first theoretical foundation is developed and proposed by Pitman (1937a, 1937b, 1938). It is very interesting to note that Fisher (1935) demands the assumption of sample randomness to apply the test, while Pitman (1937a), after presenting the foundation of permutation tests for comparing two populations in the context of random sampling, claims that these tests may (1 ) The total is less than 252 since no hypothesis testing procedure is performed in some of the studies. (2 )It is worth observing that McKinney et al. (1989) emphasize that the majority of the studies published in six major international medical journals contains an improper application of Fisher’s exact test.

142 be used even when samples are not random (of course, units should be randomly allocated to the different treatment levels). Other classical contributions are those by Lehmann and Stein (1949) in which the theory of optimal permutation tests is developed, by Hoeffding (1952) on the asymptotic power behaviour of permutation tests and by the above-mentioned paper by Box & Anderson (1955). Shortly after the publication of Pitman’s papers, a large number of rank tests were developed; it should be quoted the central article of Wilcoxon (1945). A permutation test is a statistical procedure for hypothesis testing in which one calculates the values that the test statistic T assumes on the observed data and on all permutations of the data to decide whether to accept or reject the null hypothesis. More precisely, the p-value of a permutation test is computed as the proportion of permutations that have test statistic greater than or equal to the observed test statistic. It is then clear that permutation testing is a conditional statistical procedures, where the conditioning is with respect to the observed data set. Considering the most frequently used kind of rank transformation, that based on natural numbers, it is easy to realize that a rank test is a permutation test applied to the ranks of the observations rather than their original values. Except for those circumstances in which the context and the purpose of the investigation lead us to consider the ranks of statistical units rather than the observation values, rank transformation was formerly an expedient for overcoming the huge computational burden of the calculations required to perform the testing procedure. This was especially true when computing resources were scarce and expensive. The computational saving assured by rank tests is due to the fact that ranks, unlike observed values, are natural numbers. Therefore, the permutation distribution of a rank test statistic is not conditional on the observed values and should be computed and tabulated only once for each sample size. Since the use of permutation tests is strictly dependent on computing facilities, the history of these methods is characterized by numerous rediscoveries, as newer technologies are developed. By the middle of the seventies, computers were at the disposal of many statisticians and the development of a wide variety of computer-intensive techniques based on resampling date back to that time. These techniques include bootstrapping, density estimations and permutation tests themselves. The influence of technological progress has heavily affected theoretical research too. If in the fifties the need for studying asymptotical approximations of the permutation distribution function of the test statistic was considered

143 very important, later on this matter became less important, partly because of the increasing interest in developing new and more efficient computational algorithms. On this aspect, classical contributions are Mehta & Patel (1983) and Pagano & Tritchler (1983). By the eighties, the deep improvement of computer power gave rise to a big revival of resampling-based techniques that is continuing in the present. As compared to the early monographs on nonparametric statistics (that dealt particularly with ranks) books on permutation testing are more recent. The works by Edgington (1995), Good (2000) and Pesarin (2001) should be noted. The book by Edgington deals with randomization tests, namely permutation tests based on the data randomization only, without taking into account whether the dataset has been randomly collected or not. The arguments are almost exclusively practical. The book by Good also deals mostly with applications. The main aim of the author is to persuade the reader that permutation tests are very useful to tackle problematic situations with missing or censored data and outliers. It should also be noted that the author lists fifty-five different fields in which these techniques have already been applied with good results and reports a bibliography of more than a thousand references. The author considers theoretical arguments only in the last chapter, which is very similar to the part on permutation testing in Lehmann (1986). Instead, Pesarin gives great importance to the presentation and discussion of the theoretical aspects. The second and very important feature of this book is that it deals with multidimensional problems, which are tackled through the Nonparametric Combination of Dependent Tests methodology, due to the same author. It is worth noting that the majority of the methods presented are new (see subsection 3.3). We quote also the theoretical contribution of Bell & Sen (1984), and the books of Manly (1997), Sprent (1998) and Lunneborg (1999) on practical application of permutation tests and of other resampling techniques. Finally, we would like to emphasize that Lehmann (1975, 1986) devoted an important part of his classical books “Nonparametrics” and “Testing Statistical Hypotheses” to permutation testing. 3.2. Features of permutation tests The main characteristic of permutation tests lies in their adaptability to many different applications. These tests may be applied

144 to continuous, ordered or categorical, normally or non-normally distributed data, to homogeneous or heterogeneous data, in both univariate and multivariate fields, to single or to repeated measurements. Pesarin (2001) underlines that there exist problems that can be treated only within a permutation framework. He also outlines a sort of Bayesian way to permutation testing. Regarding problems that have already been solved within a parametric framework, it is very often possible to consider the permutation version of the suggested test statistic. The performance of this latter is generally similar to that of the parametric test when the assumptions behind this one are met, otherwise its performance could be better (see Good, 2000, and the references therein). The permutation framework allows us to propose ad hoc statistics for the particular problem we are dealing with. We are not forced to use either the permutation version of statistics based on the classical theory or to examine usual alternative hypotheses. For example, it is wellknown that the permutation version of the F statistic offers a good performance for testing the null hypothesis of equality of the means of several populations against the generic alternative where at least two populations have different means. For the problem of testing for an ordered dose response, the F statistic does not represent a proper choice. In this case, Good suggests to use the Pitman (1937b) correlation test, which is more sensitive to directional alternatives. As a further example of the permutation testing versatility, Wan et al. (1997) emphasize the possibility of developing a “permutation based reference distribution for the estimate of the regression coefficient that is motivated by genetic principles rather than by standard regression procedures”. Another quality of permutation tests is their robustness. A parametric test is exact only if the underlying distribution is that on which the test is based. In this case, the test is often the best available test. If the underlying distribution is different, the parametric test loses these properties and its performance may be poor. On the contrary, provided that the observations are exchangeable under the null hypothesis, a permutation test is always exact and unbiased against shifts in the direction of higher values (Good). Regarding the essential assumptions behind permutation testing, Good emphasizes that “contrary to statements that have appeared in several recent journal articles - we withhold the names to protect the guilty - permutation tests cannot be employed without one or both of these essential assumptions”. The assumptions at issue are those that guarantee the exchangeability of the observations under H0 . In fact, if the observations are not ex-

145 changeable under H0 , it does not make sense to permute them. Let us consider, for example, the problem of comparing two variances: the permutation test based on the squares of the observations is appropriate only if the two population means are known, or unknown but equal, since only in these cases the observations are exchangeable under the null hypothesis of equal variances. Pesarin emphasizes that under a null hypothesis which states that the dataset is a random sample taken from an unknown distribution function F, the construction of a permutation test through conditioning on a set of sufficient statistics for F under H0 (like the sample itself) enables to draw inferences that are invariant with respect to F. Therefore such permutation tests are truly distribution-free and nonparametric. We consider the two-sample location problem in order to show some other features of permutation tests. The permutation test based on the sum of the observations of one of the two samples (a statistic that is permutationally equivalent both to the difference of the sample means and to the Student t statistic) is exact and unbiased. The asymptotic behaviour of the associated test is strongly consistent and, under mild conditions, the test statistic is normally distributed (Puri & Sen, 1971). It is well-known that the power of any test depends heavily on the underlying population distribution; if the two samples are normally distributed, then the permutation test has asymptotically the same power as the Student t test (namely the best test for the problem at hand). The theory that ensures this result can be outlined as follows: given a test which is asymptotically optimal according to a certain principle, a permutation test based on the same statistic inherits its optimal properties. It is worth noting that a permutation test has these properties not only for the particular system of hypotheses just taken in consideration, but also for examining more complex problems (Pesarin). In section 2, we noted that Ludbrook & Dudley (1998) criticized the common incorrect habit of applying parametric tests on non-random samples. It is important to stress that permutation tests make full sense also within the randomization model briefly described in section 3.1 because under the null hypothesis of ineffective treatment the observations are clearly exchangeable. On the contrary, parametric tests act exclusively within the population model, which is based on a hypothetical population with a specific distribution and of usually infinite size from which the samples are randomly drawn. Nevertheless, the generalization of results obtained by using a permutation test on exper-

146 imental groups obtained through randomization of non-random samples should be made with care. In these cases, Edgington (1995) states that inference has to be restricted just to the subjects used in the experiment, while inference on other subjects has to be of non-statistical type, that is without a probabilistic basis. He also underlines that non-statistical inferences represent a very common scientific procedure according to which the results obtained from the observed units are generalized to all units that are similar with respect to those characteristics considered as important for designing the experiment. The opinion of Ludbrook & Dudley is very similar. As far as they are concerned, one may generalize the result of the study to other experimental units (provided they are similar to those analized), but the “arguments must be verbal rather than statistical”. Some authors such as Good claim that permutation tests have many similarities with bootstrap tests, underlining that both procedures employ only the data at hand to draw inference on the null hypothesis, require minimal assumptions and are computer-intensive. However, according to other authors, there are few similarities between these methods, because permutation tests refer to the context of conditional inference, while bootstrap tests do not and ensure asymptotical properties only. According to results due principally to Romano (1989) and discussed in Pesarin, in many common situations the permutation and the bootstrap distributions of a certain test statistic are asymptotically the same. Bootstrap tests have an advantage over permutation tests: they can also be used to test null hypotheses other than that of invariance, whereas permutation tests may be used only to test null hypotheses of invariance with respect to the permutation model underlying the problem. However, permutation tests are generally preferred because they are exact even for small samples and, being conditional on a set of sufficient statistics, they have properties which are not satisfied by bootstrap tests. In fact, bootstrap tests are, at least for small samples, neither exact nor conservative, since the probability of a type-one error may be greater than the nominal significance level. The main practical drawback of the permutation testing methodology is that, except for very small samples, the number of all possible permutations is usually impractical large. To deal with this problem, one may estimate the exact p-value by taking a random sample from all permutations. On the contrary, if the number of observations is very small, it may be impossible to obtain p-values less than or equal to e. g. five percent (as happens when the number of permutations is less

147 than twenty). It is also worth noting that the permutation distribution is discrete, so the test can almost never attain the usual significance levels, unless one defines an uncertainty zone in which the outcome of H0 may be randomized. In spite of the good features presented in this section, permutation tests do not find many supporters either among methodologists or among researchers who apply statistical methods. Our opinion is similar to that of Berger (2000), who thinks that permutation tests are not generally endorsed very enthusiastically by either theoretical or applied statisticians because their simplicity has induced many theoretical statisticians to look elsewhere for more complex challenges, while applied statisticians seem to be reluctant to use these tests, being more familiar with the classical ones. 4. A recent interesting proposal on permutation testing Pesarin (1992) presented an interesting methodology, based on the Nonparametric Combination of Dependent Tests, for tackling complex multivariate and multiaspect problems. At the basis of this methodology there is a natural idea, that of breaking down a (complex) problem into a set of easier to solve sub-problems, each of which is related to a particular aspect of the original problem. In a typical complex problem, the dataset consists of C ≥ 2 independent random samples in which the values of q ≥ 1 random variables, generally not independent, are observed. Suppose that the null hypothesis is that the C samples come from the same population H0 : F1 = F2 = . . . = FC , where Fj denotes the distribution function of the j-th population ( j = 1, . . . , C) against H1 : Fh = Fj for some pair (h, j). Suppose that k partial aspects may be emphasized, so k that H0 may be broken down into k sub-hypotheses as H0 = ∩i=1 i H0 , that is to say that H0 is true if all the i H0 are jointly true. Note k k that H1 = H 0 = ∩i=1 i H0 = ∪i=1 i H 0 and that for most multivariate q k problems k = q and H0 = ∩i=1 i H0 = ∩i=1 (i F1 = i F2 = . . . i FC ), where i Fj denotes the distribution function of the i-th random variable in the population underlying the j-th sample. To solve the problem, as a first step, we may test each of the k partial null hypotheses. Then, if we have a method to manage jointly the results of the first step, we can solve the problem through these two steps. Pesarin provided a useful method to implement this approach.

148 As regards testing the partial hypothesis i H0 , we assume to have an unbiased and consistent permutation test. Without loss of generality, this test is assumed to be significant for large values of the test statistic. These assumptions are rather general because it is usually not difficult to find tests which satisfy them. It follows that the permutation distribution functions of the partial test statistics are stochastically larger under the alternative hypothesis i H1 = i H 0 than under i H0 and so the associated p-values are positively dependent(3 ). After accomplishing the first step, by computing the values of the k partial test statistics, one can perform the final step, by combining the partial p-values, which are permutationally equivalent to test statistics values, through an appropriate function in order to test the global null hypothesis. After pointing out that the considered global null hypothesis implies the exchangeability of the observations and then enables to use a permutation procedure, it is worth emphasizing that the major feature of the Nonparametric Combination of Dependent Tests is that one is not required to specify the dependence structure of partial tests. Since the underlying dependence relation structure is nonparametrically captured by the combining procedure, one may pay attention to the set of partial tests (Pesarin, 2001, p. 179). This aspect is very important for non-normal or categorical variables, in which dependence relations are generally difficult to define and manage successfully. Therefore the researcher has only to make sure that the partial tests satisfy the rather general assumptions which we reported earlier. If these assumptions are satisfied, then the combined test is exact, unbiased and usually with good power behaviour. Moreover it can be proved, under rather general assumptions, that in certain situations the asymptotic behaviour of the combined permutation test is equivalent to that of the most powerful parametric counterpart (provided that this one exists). It should also be pointed out that this method can be easily extended to study repeated multisample multivariate observations. Pesarin presents solutions to problems of great scientific interest towards which a parametric solution has not yet been proposed. These include, among others, problems on multivariate categorical variables, data missing either completely at random or not, multidimensional (3 )We do not go into details of these statements because what is argued here has the essential aim of presenting the methodology of the Nonparametric Combination of Dependent Tests, emphasizing its empirical effectiveness and its theoretical rigour, with no pretence at formal rigour.

149 analysis of variance with variables partly quantitative and partly categorical, situations in which the number of units is less than the number of observed variables (as often happens with repeated measurements). Simulation results asserting the good power behaviour of the suggested combination tests are reported as well. In our opinion, one of the most interesting example of the good qualities of the Nonparametric Combination Methodology is that of testing monotonic stochastic ordering, namely the C-sample problem concerning experiments where units are randomly assigned to C groups which are defined according to increasing levels of a treatment. When C > 2 and the group size is not the same, but especially in the multivariate case, no parametric solution has yet been proposed.

5. Concluding remarks We reviewed and discussed the main features of permutation tests, thanks to which we think we can answer yes to the question that acts as the title of section 3. We outlined a recently proposed method that enables to transfer the good features of univariate permutation tests to complex multivariate and multiaspect problems. Despite these arguments, and despite this method, unfortunately permutation tests still are not very popular among both methodological statisticians and researchers who apply statistical methods. We hope that this work may induce many practitioners to take into consideration the use of a permutation testing procedure for tackling their problems.

Acknowledgments The author thanks F. Pesarin and the referee for their helpful comments.

REFERENCES Bell, C. B. and Sen, P. K. (1984) Randomization procedures, Handbook of Statistics, 4, 1-29, Elsevier Sciences, North Holland. Berger, V. W. (2000) Pros and cons of permutation tests in clinical trials, Statistics in Medicine, 19, 1319-1328.

150 Box, G. E. P. and Anderson, S. L. (1955) Permutation theory in the derivation of robust criteria and the study of departures from assumption, Journal of the Royal Statistical Society, B, 17, 1-34. Edgington, E. S. (1995) Randomization tests, 3rd ed., Marcel Dekker, New York. Fisher, R. A. (1934) Statistical methods for research workers, Oliver & Boyd, Edinburgh. Fisher, R. A. (1935) The design of experiments, Oliver & Boyd, Edinburgh. Good, P. (2000) Permutation tests, a practical guide to resampling methods for testing hypotheses, 2nd ed., Springer-Verlag, New York. Hoeffding, W. (1952) The large-sample power of tests based on permutations of observations, Annals of Mathematical Statistics, 23, 169-192. Lehmann, E. L. and Stein, C. (1949) On the theory of some non-parametric hypotheses, Annals of Mathematical Statistics, 20, 28-45. Lehmann, E. L. (1975) Nonparametrics: statistical methods based on ranks, Holden Day, San Francisco. Lehmann, E. L. (1986) Testing statistical hypotheses, 2nd ed., John Wiley, New York. Ludbrook, J. and Dudley, H. (1998) Why permutation tests are superior to t and F tests in biomedical research, The American Statistician, 52, 127-132. Lunneborg, C. E. (1999) Data analysis by resampling: concepts and applications, Duxbury Press, Belmont. Manly, B. F. J. (1997) Randomization, bootstrap and Monte Carlo methods in biology, 2nd ed., Chapman & Hall, London. McKinney, P. W., Young, M. J., Hartz, A., and Bi-Fong Lee, M. (1989) The inexact use of Fisher’s exact test in six major medical journals, Journal of the American Medical Association, 261, 3430-3433. Mehta, C. R. and Patel, N. R. (1983) A network algorithm for performing Fisher’s exact test in r xc contingency tables, Journal of the American Statistical Association, 78, 427-434. Pagano, M. and Tritchler, D. (1983) On obtaining permutation distributions in polynomial time, Journal of the American Statistical Association, 78, 435-441. Pesarin, F. (1992) A resampling procedure for nonparametric combination of several dependent tests, Journal of the Italian Statistical Society, 1, 87-101. Pesarin, F. (2001) Multivariate permutation tests with applications in biostatistics, John Wiley, Chichester. Pitman, E. J. G. (1937a) Significance tests which may be applied to samples from any population, Journal of the Royal Statistical Association B, 4, 119-130. Pitman, E. J. G. (1937b) Significance tests which may be applied to samples from any population. II. The correlation coefficient, Journal of the Royal Statistical Association, B, 4, 225-232. Pitman, E. J. G. (1938) Significance tests which may be applied to samples from any population. III. The analysis of variance test, Biometrika, 29, 322-335. Puri, M. L. and Sen, P. K. (1971) Nonparametric methods in multivariate analysis, John Wiley, New York. Romano, J. P. (1989) Bootstrap and randomization tests of some nonparametric hypotheses, Annals of Statistics, 17, 141-159.

151 Sprent, P. (1998) Data driven statistical methods, Chapman & Hall, London. Wan, Y., Cohen, J., and Guerra, R. (1997) A permutation test for the robust sib-pair linkage method, Annals of Human Genetics, 61, 79-87. Wilcoxon, F. (1945) Individual comparison by ranking methods, Biometrics, 1, 80-83.

Some notes on nonparametric inferences and permutation tests Summary This paper deals with nonparametric methods for hypothesis testing with special regard to permutation tests. After the role of parametric and nonparametric methods for hypothesis testing is debated, fundamental historical aspects of permutation tests are reviewed. Then theoretical and practical features of these tests are presented and discussed in a non-technical way, the main aim being that of emphasizing both their theoretical rigour and high practical efficacy. Such qualities also pertain to a recently proposed method for the analysis of complex multivariate and multiaspect problems. In our opinion, permutation tests do not seem to enjoy the consideration they deserve in a vast part of the scientific world.

Qualche annotazione in tema di inferenze non parametriche e test di permutazione Riassunto In questo lavoro si ragiona sui metodi non parametrici per il controllo di ipotesi con riferimento privilegiato ai test di permutazione. Dopo aver discusso il ruolo dei metodi parametrici e non parametrici per il controllo di ipotesi, vengono tratteggiati gli aspetti storici fondamentali dei test di permutazione. Sono quindi presentate e discusse le caratteristiche teoriche e applicative di questi test in maniera non formalizzata, con lo scopo essenziale di evidenziare sia la rigorosit`a teorica che l’estrema efficacia empirica. Tali qualit`a sono proprie anche di un metodo proposto recentemente per la trattazione di complessi problemi multivariati e multi-aspetto. Si ritiene che i test di permutazione non sembrino godere delle dovuta e meritata considerazione in larga parte del mondo scientifico.

Key words Statistical inference; Nonparametric statistics; Resampling techniques; Permutation tests; Nonparametric combination of dependent tests.

[Manuscript received November 2001; final version received September 2002.]