REGULAR ARTICLES Fast Permutation Tests that ... - DataMineIt

Copyright  2003 JMASM, Inc. 1538 – 9472/02/$30.00

Journal of Modern Applied Statistical Methods May 2003, Vol. 2, No 1, 27-49

REGULAR ARTICLES Fast Permutation Tests that Maximize Power Under Conventional Monte Carlo Sampling for Pairwise and Multiple Comparisons J.D. Opdyke DataMineIt Marblehead, MA While the distribution-free nature of permutation tests makes them the most appropriate method for hypothesis testing under a wide range of conditions, their computational demands can be runtime prohibitive, especially if samples are not very small and/or many tests must be conducted (e.g. all pairwise comparisons). This paper presents statistical code that performs continuous-data permutation tests under such conditions very quickly – often more than an order of magnitude faster than widely available commercial alternatives when many tests must be performed and some of the sample pairs contain a large sample. Also presented is an efficient method for obtaining a set of permutation samples containing no duplicates, thus maximizing the power of a pairwise permutation test under a conventional Monte Carlo approach with negligible runtime cost (well under 1% when runtimes are greatest). For multiple comparisons, the code is structured to provide an additional speed premium, making permutation-style p-value adjustments practical to use with permutation test p-values (although for relatively few comparisons at a time). “No-replacement” sampling also provides a power gain for such multiple comparisons, with similarly negligible runtime cost. Key words: Permutation test, Monte Carlo, multiple comparisons, variance reduction, multiple testing procedures, permutation-style p-value adjustments, oversampling, no-replacement sampling

Introduction

enumerated, they provide gratifyingly exact results. Most important, however, is that with few exceptions, valid permutation tests rely on no distributional assumptions – only the requirement that the data satisfies the condition of exchangeability (i.e. distributional invariance under the null hypothesis to permutations of the subscripts of the data points). This gives permutation tests a very broad range of application.

Permutation tests are as old as modern statistics (see Fisher (1935)), and their statistical properties are well understood and thoroughly documented in the statistics literature (see Pesarin (2001) and Mielke and Berry (2001) for extensive bibliographies). Though not always as powerful as their parametric counterparts that rely on asymptotic theory, they sometimes have equal or even greater power (see Andersen and Legendre (1999) for just one example). In addition to their utility when asymptotic theory falls short (e.g. small samples and the Central Limit Theorem), permutation tests are unbiased, and when fully

Until recently the major drawback of permutation tests has been their high computational demands. Even when sampling from the permutation sample space, as is typically done, rather than fully enumerating it, computer runtimes still have been prohibitive, especially if samples are not very small. Recent advances in computing speed and capacity increasingly have relaxed this constraint, but the continual development of new and computationally intensive statistical methods is easily keeping pace with such advances.

J.D. Opdyke is President of DataMineIt, a statistical data mining consultancy (jdopdyke@ datamineit.com, www.datamineit.com). I owe special thanks to Geri S. Costanza, M.S., for a number of valuable insights. Any errors are my own.

27

FAST PERMUTATION TESTS For example, Westfall and Young (1993) convincingly demonstrated, under a broad range of real-world data conditions, the need for resampling-based multiple testing procedures. However, if the unadjusted p-values themselves are derived from resampling methods, such as permutation tests, the multiple comparisons pvalue adjustment requires a computationally intensive nested loop, where a large number (thousands) of additional permutation tests must be performed for each original permutation test to properly adjust its p-value. Obviously, even if each permutation test requires just a few seconds, runtimes quickly become prohibitive if there are many p-values that need to be adjusted. Similarly, power estimation of tests based on resampling methods require the same intensive nested loop structure (see Boos and Zhang (2000) for a useful computation reduction technique), while power estimation of the multiple comparisons adjustment procedure mentioned above requires an additional (third) loop. Such examples clearly demonstrate the ongoing need to develop faster code and algorithms that are also increasingly statistically efficient, since variance reduction lessens sampling requirements which, all else equal, increases speed. The goal of the methods described below is to contribute to these efforts. Widely Available Permutation Sampling Procedures Three procedures in SAS® v8.2 – PROC NPAR1WAY, PROC MULTTEST, and PROC PLAN – and one procedure in Cytel’s Proc StatXact® v5.0 – PROC TWOSAMPL – can be used to perform two-sample nonparametric permutation tests. All but PROC PLAN sample the input dataset itself, while PROC PLAN generates a record-by-record list, each record containing a number identifying the corresponding record on the input dataset to include in the “permutation” samples. This list subsequently must be merged with the original data to obtain the corresponding data points, something PROC MULTTEST does automatically by directly generating all the “permutation” samples it uses for permutation-style p-value adjustments (these

28

samples, however, can be used instead as the samples for the actual permutation tests). In contrast, both PROC NPAR1WAY and PROC TWOSAMPL actually conduct the permutation test and provide a p-value, whereas the samples from both PROC MULTTEST and PROC PLAN must be manipulated “by hand” to calculate the value of the test statistic associated with the original sample pair, and then compare it to all those associated with each of the “permutation” samples to obtain a p-value. Nonetheless, effective use of PROC PLAN, as shown in benchmarks in the Results section below, is much faster than these other procedures – often more than an order of magnitude faster when one of the samples is large. The only potential problem with using PROC PLAN is that it has a sample size constraint – the product of the sum of the two sample sizes (n1 + n2) and the number of “permutation” samples being drawn (T) cannot exceed 231 (about 2.1 billion, the largest representable integer in SAS) or the procedure terminates. However, this can be circumvented by inserting calls to PROC PLAN in a loop which cycles roundup((n1 + n2)* T / 231) times, each loop drawing T * [roundup((n1 + n2)* T / 231)]-1 samples until T samples have been drawn (see code in Appendix C). This looping in and of itself does not slow execution of the procedure. All of the abovementioned procedures can perform conventional Monte Carlo sampling without replacement within a sample, as required of all but a few stylized permutation tests, but none can avoid the possibility of drawing the same sample more than once. In other words, when drawing the sample of “permutation” samples, these procedures can only draw from the sample space of samples (conditional on the data) with replacement (WR). This problem of drawing duplicate samples, its effect on the statistical power of the permutation test, and a proposed solution that maximizes power under conventional Monte Carlo sampling for both pairwise and multiple comparisons are discussed in the Methodology section below. First, the background issues of determining the number of “permutation” samples to draw, and sampling approaches other than conventional Monte Carlo, are addressed

29

J.D. OPDYKE

below. Determining the Number of Permutation Samples When drawing samples from the permutation sample space, one must determine how many samples should be drawn. Obtaining an exact pvalue from a permutation test via full enumeration – i.e. by generating all possible sample combinations by reshuffling the data points of the samples at hand – quickly becomes infeasible as sample sizes increase. As shown in (1), the number of possible sample combinations becomes very large even for relatively small sample sizes (two samples of 29 observations each, for example, have 30,067,266,499,541,000 possible sample combinations). (1) # of two-sample combinations = nCn1 where n1 = sample one’s size, size, and n = n1 + n2

( n1 + n2 )! =

n2 = sample

p − value (1 − ( p − value ) )

(2)

(Note that permutations of the same sample do not affect this probability.) A (one-sided) permutation test p-value is simply the number of test statistic values, each corresponding to a “permutation” sample, at least as large as that based on the observed data samples; therefore, the estimated pvalue based on conventional Monte Carlo sampling is simply an estimated proportion

T

, and

(3)

95% ci ≈ p − value ± (1.96 × se ) cv =

two’s

Sampling from the permutation sample space, however, can provide an estimate of the exact pvalue via a conventional Monte Carlo approach, whereby the probability of drawing any particular sample is equal to one divided by the number of possible sample combinations, as in (2) below: 1 n Cn1

se ≈

n1 ! n2 !

Network algorithms (see Mehta and Patel (1983)) expand the sample size range over which exact pvalues realistically may be obtained, but the rapid combinatorial expansion of the “permutation” sample space – defined as conditional on the data in (1) – still limits the full enumeration of continuous data samples to relatively small sample sizes.

Pr ( S = s ) =

distributed binomially. The normal approximation to the binomial distribution allows one easily to obtain specified levels of precision for this estimate, based either on the standard error (se) or the coefficient of variation (cv), as a function of T = the number of samples drawn. This is done by straightforward solutions of (3) and (4) respectively (see Brown et al. (2001) for descriptions of the “Agresti-Coull” and “Wilson” intervals – superior, if slightly more complex, alternatives to the commonly used Wald approximation shown in (3)).

cv =

se p − value

(4)

0.05 (1 − 0.05 ) T = 0.10 ⇒ T = 1,900 0.05

and for cv < 0.10 T, = 1,901. For example, if cvα (the Monte Carlo error), the adjusted critical value for NR sampling will be larger than that of WR sampling ((5.10) – (5.13)). This gives permutation tests based on NR sampling greater power.

⇒σ

2 NR

cαWR

(5.4)

⇒ powerNR > powerWR

(5.5)

2 σ bin = n p p (1 − p )

(5.6)

2 σ hyp = n p p (1 − p ) ( n Cn1 − n p ) / ( n Cn1 − 1) (5.7)

where n p = number of permutation samples drawn,

(5.8) aslWR

( ( ) )

1 = Pr S ≤ n pα | p = np

n p  n pα 

∑∑ i =0 k =0

 np   i     i   np

  

k

 i  1 −  np

(5.9)

( ( ) )

aslNR = Pr S ≤ n pα | p =

N by

1 np

N n p  n pα 

∑ ∑ S =0

k =0

 S  N − S    n − k   k  p  N    np 

where S = number of “successes” (number of “permutation” sample test statistic values ≥ observed sample test statistic value) among n p permutation samples drawn, N np

is an integer, and

cα* = the critical value adjusted for Monte Carlo error.

(Note that above, the critical p-value of the test is adjusted, rather than the p-values themselves, solely for heuristic and computational purposes when demonstrating the power differential between NR and WR sampling in (5.1)-(5.5). In practice, it is the p-values themselves which should be adjusted for ease of interpretation of the test results. Both adjustments yield identical

  

( np −k )

FAST PERMUTATION TESTS results statistically.) The discreteness of both the binomial and hypergeometric distributions prevent the attainment of adjusted critical p-values yielding asl = α exactly. However, interpolation between α and the largest p-value yielding asl powerWRb )

(8.4)

(7.2)   C  C  fi Pr  min PjNR ≤ pi | H 0  < Pr  min PjWR ≤ pi | H 0   1≤ j ≤ k   1≤ j ≤ k  fi powerNRa ) > powerWRa ) (7.3)

fi p! i < p! i NR WR a)

(7.4)

a)

where

Therefore, to maximize NR sampling power gains when using permutation-style p-value adjustments in multiple comparisons of permutation test pvalues, combine both a) and b) – use NR sampling to generate both the original Monte Carlo-error adjusted p-values, as well as the “simulated” pvalue vectors when making the multiple comparisons adjustment ((9.1) – (9.3)).

pi = original p-value

(9.1)

p*j = data-based p-value vector of j p-values

Pr  min p jNR ≤ piNR | H 0C  < Pr  min p jWR ≤ piWR | H 0C   1≤ j ≤ k   1≤ j ≤ k 

Pj = joint random variable of j p-values

fi p! iNR < p! iWR

(9.2)

H 0C = the complete null hypothesis, i.e. assuming that all null hypotheses included in the family of multiple comparisons are true

fi powerNR > powerWR

(9.3)

p! iNR = the adjusted p-value of pi

b) Another source of power gain from NR sampling is the smaller p-values of the original permutation tests themselves, after adjustment for Monte Carlo error as described in the previous section. Assume that none of the “simulated” pvalues in each vector are generated using NR sampling, but that the original p-values are generated, and then Monte Carlo-error adjusted, using NR sampling instead of WR sampling. Because the p-values of the former are smaller (8.1), the probability of the same minimum pvalue being less than or equal to the original pvalue is smaller for NR sampling (8.2). This means the corresponding numerator (the count) of

The same rationale applies to stepwise multiple comparisons adjustments. Whenever NR sampling is used to generate either or both the minimum p-value and the original Monte Carlo error-adjusted p-values, its variance reduction will yield greater power (these derivations, (7.1)-(9.3), were first presented in Opdyke (2002b)). Efficient simulation of the power differential shown in (9.1) – (9.3), which requires a computationally intensive nested loop with three levels, is the topic of continuing research. However, its magnitude may very well be larger than that of a single pairwise comparison since variance reduction is achieved from two sources – both a) and b) above – rather than from b) alone. Before presenting the asymptotic power calculations for a single pairwise comparison, the

FAST PERMUTATION TESTS next section derives and presents an efficient method for performing NR sampling based on any procedure which uses WR sampling, as do all the “permutation” sampling procedures examined in this paper and known to this author. “Oversampling,” in effect, efficiently converts any WR sampling procedure into an NR sampling procedure, as shown below. “Oversampling” to Avoid Duplicate Samples “Oversampling” involves simply drawing more than the desired T samples (say, r samples), deleting any duplicate samples, and then randomly selecting T samples from the remaining set (this method, and its results in Table 1, were first presented in Opdyke (2002a)). This approach does not alter the probability of drawing any particular sample (see (2)), so “oversampling” is a statistically valid approach for obtaining T distinct samples. The next question to address is, what is the optimal size of (r-T)? The goal is to minimize expected runtime, which is a function of (r-T), or simply r, and the size of r involves the following runtime tradeoff: larger r will contribute to longer runtimes due to the extra time required to generate more samples, but also will diminish the probability that fewer than T unique samples will be drawn, which would require another draw of r samples and increase overall runtime; smaller r will require less time to generate fewer samples, but at the price of an increased probability of being left with fewer than T unique samples and having to redraw the samples all over again. Expected runtime is simply the product of a) the expected number of times r samples need to be drawn to obtain at least T unique samples, and b) the time it takes to draw r samples. So if expected runtime = g(r, x, y…), we seek r such that ∂g/∂r = 0 (and ∂2g/∂r > 0). Minimizing Expected Runtime a) The number of times r samples must be drawn before obtaining at least T unique samples is a random variable that follows the geometric distribution, which identifies the number of events occurring before the first success:

34 Pr ( S = s ) = p (1 − p )(

s −1)

(10)

where p indicates the probability of success (of obtaining at least T unique samples) for each event (each call to PROC PLAN, or whichever WR sampling procedure is being used). The expected value of the geometric distribution is E[S] = 1/p, and p is derived from a general form of the familiar (coupon or baseball card) collector’s problem. This problem asks the question, “How many card packets must one purchase to collect a complete set of baseball cards?” or equivalently, “How many samples must one draw, when sampling with replacement (because the sample size is so large), to obtain a complete set of all samples from the sampling distribution?” The more general problem, which is the relevant one for this analysis, is “How many samples are required, when sampling with replacement, to obtain T distinct samples from the sampling distribution?” The number of samples “required” follows a probability mass function (11) which is the sum of geometric random variables. (11) Pr ( # unique samples = j ) =

(

j

)∑

n Cn1

( −1)i j !( j − i )r

(

)

r j ! n Cn1 − j ! i = 0 i !( j − i )! n Cn1

where r = # of samples drawn and j ≤ r However, we are interested in the probability of obtaining at least T unique samples, which is simply the cumulative probability of obtaining T, T+1, T+2, … , r-1, and r unique samples, as shown below: (12) r 

 ( −1)i j !( j − i )r n Cn1 p = Pr ( j ≥ T ) =  r j ! n Cn1 − j ! j =T  i = 0 i !( j − i ) ! n Cn1

∑ (

j

)∑

(

)

    

where T ≤ r. Thus, the expected number of times r samples must be drawn to obtain at least T unique samples is a function of the number of possible sample combinations and r, as shown in (13) below:

35

J.D. OPDYKE (13)

expected # of calls to PROC PLAN = CTPP( n Cn1 , r, T) =

30

)∑

(

)

      

−1

Graph 2 illustrates the functional relationship between p, 1/p, and r for n1 = 68, n2= 4, and T = 1,901:

Real Time (seconds)

 r  j 1   ( −1)i j !( j − i )r n Cn1 =     r  p   j =T  j ! n Cn1 − j ! i = 0 i !( j − i )! C n n 1   

∑ (

Graph 3: PROC PLAN Runtime by n1+n2 by r 35

25

20

15

10

5

0 0

1/p = expected # calls

2.0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

n1+n2

Graph 2: Probability of at least T Unique Samples (p) and Expected # of calls to Proc Plan (1/p) by r (for n1=4, n2=68, and T=1,901)

r = 1,901

2,700

3,500

1.8

PROC PLAN Runtime = PPRT(n1, n2, r) = β0 + β1*(n1 + n2) + β2*r + β3*(n1 + n2)*r

1.6 1.4 1.2

(14)

0.8 0.6 0.4 0.2 0.0 1,901

1,903

1,905

1,907

1,909

1,911

1,913

1,915

r p

1/p

b) Now to return to the other factor determining expected sampling runtime – the time it takes PROC PLAN to draw a sample of r samples. This is simply the runtime of PROC PLAN as a function of, interestingly, not the number of possible two-sample combinations, but rather the sum of the two sample sizes (n1 + n2), as well as the number of samples drawn, r. This is shown in Graph 3 (see Appendix A for simulation details). Obviously, r and (n1 + n2) are correlated, but runtime is very well predicted (adj R2 = 0.9884) by the simple ordinary least squares multivariate regression equation in (14):

Nonlinearity at about (n1 + n2) = 65,500 and (n1 + n2) = 73,500 prompted the inclusion of dummy and interaction terms, leading to the near perfect prediction (adjusted R2 = 0.9927) for PPRT(n1, n2, r) presented in Appendix B (see Graph 4, which is simply a magnification of Graph 3 up to (n1 + n2)=100,000). Graph 4: PROC PLAN Runtime by n1+n2 by r 2.5

2.0

Real Time (seconds)

Pr(j>=1,901)=p

1.0

1.5

1.0

0.5

0.0 0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

100,000

n1+n2 r = 1,901

2,700

3,500

Thus, expected runtime g(n1, n2, r, T) is the product of PROC PLAN Runtime and the expected number of calls to PROC PLAN:

FAST PERMUTATION TESTS

 r  j   ( −1)i j !( j − i )r n Cn1   r   j =T  j ! n Cn1 − j ! i = 0 i !( j − i ) ! n Cn1 

∑ (

)∑

(

)

0.35

Graph 6: Expected Runtime (1/p * each runtime) by r (for n1=4, n2=68, and T=1,901)

0.30

Expected Runtime (seconds)

(15) expected runtime = g(n1, n2, r, T) = (14) x (13) = PPRT(n1, n2, r) * CTPP( n Cn1 , r, T) = [ β0 + β1*(n1 + n2) + β2*r + β3*(n1 + n2)*r + d1*β4 + d1*β5*(n1+n2)+ d1*β 6*r+d1*β7*(n1+n2)*r + d2*β8+d2*β9*(n1+n2)+d2*β10*r+d2*β11*(n1+n2)*r ] * −1

36

      

0.25

0.20

0.15

0.10

0.05

r* 0.00 1,901 1,902 1,903 1,904 1,905 1,906 1,907 1,908 1,909 1,910 1,911 1,912 1,913 1,914 1,915 1,916

r

To get an intuitive feel for r as a function of n1 and n2 (for a given T), note again that the second term of (15) is a combinatorial function of the sample sizes while the first term is merely a linear function of the sample sizes (see Graph 5).

Graph 7 magnifies the relevant expected runtime range.

0.0594

Graph 7: Expected Runtime (1/p * each runtime) by r (for n1=4, n2=68, and T=1,901)

Expected Runtime (seconds)

0.0592

Graph 5: Estimated PROC PLAN Runtime by r (for n1=4, n2=68, and T=1,901 -- based on PPRT in Appendix B) 0.0588

Estimated Runtime (seconds)

0.0587

0.0586

0.0585

0.0590

0.0588

0.0586

0.0584 0.0584

r*

0.0582 1,901 1,902 1,903 1,904 1,905 1,906 1,907 1,908 1,909 1,910 1,911 1,912 1,913 1,914 1,915 1,916

0.0583

r

0.0582

0.0581

92 1

92 0 1,

1,

91 9

91 8 1,

1,

91 6

91 7 1,

91 5 1,

1,

91 4

91 3 1,

r

1,

91 1

91 2 1,

91 0 1,

1,

90 8

90 7

90 9 1,

1,

90 6 1,

1,

90 5

90 4 1,

1,

90 3

90 2 1,

1,

1,

90 1

0.0580

The combinatorial terms in the second term of (15) end up dominating as sample sizes increase, asymptotically converging to 1.0 (one call to PROC PLAN) faster than the first term (each PROC PLAN runtime) diverges. Hence, for all but very small sample sizes, an optimal r in terms of expected runtime (where ∂g/∂r = 0) will be fairly close to T. Graphs 6 and 7 below present g(n1, n2, r, T) – the product of 1/p in Graph 2 and PPRT in Graph 5 above – and demonstrate an optimal r, r* = 1,908, for T = 1,901, n1 = 4, and n2 = 68 (and n Cn1 = C = 1,028, 790 ).

Unfortunately, the high level of precision needed to calculate numeric solutions for r* based on (15), for different sample sizes and different values of T, requires use of a symbolic programming language (the Mathematica® v4.1 code used to obtain the exact probabilities in Table 1 is available from the author upon request). Thus, exact solutions cannot be implemented “on the fly” in SAS, or any statistical software package, for encountered values of n1 and n2. Good approximations to the probability mass function of the collector’s problem, however, do exist (see Kuonen (2000) and Read (1998), as well as Lindsay (1992) for a unique approach to the problem), but whether using exact or approximate probabilities, for all practical purposes r* need not be calculated for each and every combination of values of n1 and n2. Nearly optimal r can be calculated for ranges of C because, as shown in Graph 7, the marginal runtime cost of drawing r slightly larger than r* is negligible (though the

37

J.D. OPDYKE

marginal runtime cost of drawing r smaller than r* is relatively large). Thus, if we define appropriate ranges of C, and for the lower bound of each range identify r*, these “low-end” r*s always will be larger than any other r* corresponding to any of the sample pairs within their respective ranges. In other words, though not optimal for every combination of sample sizes within its range, the “low-end” r* will be nearly optimal because it will be slightly larger (never smaller) than all other r* for sample size pairs within its range, and the marginal runtime cost of being slightly larger than r* is negligible. Table 1 below shows the values of r used in the permutation test program – the “low-end” r*s – for ranges of C. Although g(n1, n2, r, T) is a function of both C and n1 + n2, and n1 + n2 does vary for (essentially) constant C, the effect of this can be ignored since, as an empirical matter, it never affects the calculation of each of the “low-end” r*s. In other words, CTPP (13) strongly dominates PPRT (14) because 1/p converges to one so quickly. The code in Appendix C proposes an efficient method for generalizing the results from Table 1, i.e. for obtaining estimates of the optimal “lowend” r*s for any value of T. This method is very fast, perhaps even faster than Kuonen (2000), although it provides only estimates to the exact solution. It first utilizes optimal “low-end” r*s already calculated for a particular value of T (as in Table 1) as the basis for conservative estimates of the distance (standard deviations) between a new T and the mean of the collector’s problem mass function. Different r*s are tested via any of several straightforward convergence algorithms (false position converges more quickly than bisection and, surprisingly, Newton-Raphson in this context) to find those r*s yielding distances arbitrarily close to the original conservative distance estimates, typically within just several iterations. The method performs well in practice because of the shape of the runtime function (Graph 7): as long as the original distance estimates are conservative, i.e. slightly larger than necessary, the corresponding estimates of the optimal “low-end” r*s also will be slightly larger than necessary, causing only negligible runtime increases over use of the true optimal “low-end”

r*s. TABLE 1. Nearly Optimal r (“low-end” r*), Probability (p) of T ≥ 1,901 Unique Samples, and Expected # of Calls to PROC PLAN (1/p) by Ranges of # of Sample Combinations, C “lowp (lower 1/p (lower C =n Cn1 end” bound) bound) r* 1.0 (assuming C < 10,626 C 1.0 C ≥ T) 10,626 2,138 0.9979293 1.00207497 20330667 6280530 ≤C< 52,360 52,360 1,956 0.9990583 1.00094254 42955471 4598290 ≤C< 101,270 101,270 1,934 0.9994297 1.00057060 17692296 7715190 ≤C< 521,855 521,855 1,912 0.9997265 1.00027351 55240808 9551680 ≤C< 1,028,790 1,028,790 1,908 0.9995128 1.00048739 39120371 8321020 ≤C< 10,009,125 10,009,125 1,904 0.9999615 1.00003840 94180711 7294350 ≤C< 25,637,001 25,637,001 1,903 0.9999446 1.00005538 15376581 7691050 ≤C< 100,290,905 100,290,905 1,902 0.9998396 1.00016033 91379204 4323770 ≤C< 5,031,771,045 5,031,771,045 1,901 0.9996411 1.00035897 54940541 3875460 ≤C It is worth noting that, for T = 1,901, the largest value of C for which one has to actually “oversample” (although one must still check for duplicate samples and redraw if necessary) is relatively small – about 5x109. This corresponds to sample sizes of only n1 = 17 and n2 = 18 for small n = n1 + n2, and n1 = 2 and n2 = 100,000 for large n. This is due, of course, to the fantastic combinatorial growth of C, which causes 1/p’s rapid convergence to one. This convergence

FAST PERMUTATION TESTS indicates that using “oversampling” as outlined above to perform NR sampling should be applicable to any WR sampling procedure, even if its runtime function, unlike (13), is not linear in n (i.e. even if it is convex and steep in n).

38

o = MULTTEST, 10,000