Sample Size Determination With False Discovery Rate Adjustment for ...

0 downloads 0 Views 109KB Size Report
will be too many false positives. On the other hand, tradi- tional methods to control family-wise error rate (FWER), such as the one with Bonferroni adjustment, ...
Sample Size Determination With False Discovery Rate Adjustment for Experiments With High-Dimensional Data Gengqian C AI, Xiwu L IN, and Kwan L EE

In high-dimensional data analyses, such as in microarray experiments, the false discovery rate (FDR) has been widely used as an appropriate method to control false positive error rate, and some progress has been made on the issue of sample size calculation. However, there is still lack of a simple and practically useful method for routine use. This article investigates the power and the related problem of sample size determination methods for current FDR controlling procedures under a mixture model involving independent test statistics. An approach is proposed where one can use traditional sample size calculation for a single hypothesis with appropriately adjusted Type I error rate. This adjustment is based on a simple relationship between the desired FDR and power level and the individual Type I error rate. Simulation results show that our approach can be applied successfully under both an independence assumption and certain commonly used correlation structures. Key Words: Microarray; Mixture model; Multiple testing; Positive FDR; Power.

1. Introduction Today, more and more technologies are available to scientists in drug development to explore large number of markers simultaneously. One example is the microarray experiment, which allows researchers to simultaneously investigate a large number of genes. One of the major objectives in a microarray analysis is to discover the genes

that are expressed differentially between different groups or conditions. This is commonly formulated as a multiple testing problem with the null hypothesis for each gene representing no association of expression level with the group or condition. Due to the large number of hypotheses, the multiplicity effect cannot be ignored, or there will be too many false positives. On the other hand, traditional methods to control family-wise error rate (FWER), such as the one with Bonferroni adjustment, are too conservative in the sense that too few false positives are allowed to occur which in turn increases false negatives beyond practical usefulness. Researchers would like to have as many true positives as possible, while controlling the proportion of false positives. Methods controlling false discovery rate (FDR) come into play for this purpose. The concept of FDR as an overall measure of false positives in testing multiple hypotheses has been receiving considerable attention in high-dimensional data analyses. Consider the possible outcomes, as shown in Table 1, where m null hypotheses H1 , . . . , Hm are simultaneously tested against some alternatives. The FDR is the expected proportion of false positives among the rejected null hypotheses, that is,   V R > 0 · Pr{R > 0}. (1) FDR = E R Along with the concept of FDR, Benjamini and Hochberg (1995) proposed a procedure, known as the BH procedure, to control such an error rate based on indepenc American Statistical Association

Statistics in Biopharmaceutical Research 0000, Vol. 00, No. 00 DOI: 10.1198/sbr.2009.0058

1

Statistics in Biopharmaceutical Research: Vol. 0, No. 0

Table 1.

Result of testing multiple hypotheses

True null True alternative

Accepted U T A

Rejected V S R

m0 m1 m

dent p-values. They proved that the BH procedure controls the FDR at m 0 η/m which is below the prespecified level η, assuming independence. A slightly different concept of false discovery rate, known as positive FDR (pFDR), was proposed by Storey (2002) as   V (α) pFDR(α) = E R(α) > 0 (2) R(α)

for a given p-value threshold α, where V (α) is the number of false rejections and R(α) is the number of total rejections of null hypotheses. Assuming a mixture model, which is reviewed in Section 2, Storey (2003) gave a Bayesian interpretation to the pFDR. He presented an algorithm to estimate the pFDR, called the Q values. The pFDR was referred to as the Bayesian FDR by Efron (2003). While controlling the FDR at a preassigned level is mostly the main objective in high-dimensional data experiments, and the BH and Storey’s procedures are most commonly used for that purpose because of their ease of application, determining the minimum value of sample size providing the desired control of FDR with a reasonable power at a specified alternative is an important design issue. In other words, it is often important to answer the question: What is the minimum sample size for an FDR-controlling procedure, like the BH or Storey’s procedure, given a desired power γ , FDR level η, and other design parameters? There have been several articles that try to address the question of power and sample size in a multiple testing situation, where a large number of hypotheses are tested simultaneously. Hwang, Schmitt, and Stephanopoulos (2002) tried to do this by traditional sample size calculation method without any multiplicity adjustment; Pan, Lin, and Le (2002) tried to address this using Bonferroni adjustment and thus controlling the FWER; Lee and Whitmore (2002) calculated the sample size controlling expectation of the number of false positives E(V ). More recently, a few articles have investigated sample size and power calculation while controlling FDR, since it is a legitimate question raised at the design stage of high-dimensional data experiments given that certain 2

FDR controlling method will be used for the data analysis. Simon et al. (2004) proposed a very simple modification of traditional sample size calculation method by using Type I error α = 0.001 and Type II error β = 0.05 for microarray experiments to adjust for multiplicity. The operating characteristics of this modification is discussed in Section 3 under the mixture model setup. Pounds and Cheng (2005) proposed general algorithms to iteratively search for the sample size for desired power and FDR level, which can be computationally intensive. Same as in our article, methods from Pawitan et al. (2005), Jung (2005), and Liu and Hwang (2007) are all based on the result from mixture model (Bayesian interpretation) proposed by Storey (2002). Pawitan et al. (2005) investigated the relationship of FDR, sensitivity, and sample size with several operating characteristic curves based on a two-sample t-test. Jung (2005) and Liu and Hwang (2007) derived a general equation for FDR level, proportion of nondifferentially expressed genes and the posterior probabilities in microarray setting. Although the same result from Bayesian framework in Storey (2002) was the foundation for the above three articles, the FDR and power were given misleading interpretations there, as discussed in Sections 2 and 6. Also based on Storey (2002), Shao and Tseng (2007) introduced a method with dependence adjustment to calculate sample size with FDR controlling property. However the correlations need to be estimated with complicated process, which requires pilot data available and makes their method impractical. In Shao’s method, the power is defined as overall power instead of the more commonly used average power as we defined in Section 2. In this article, we examine the power of the BH procedure and Storey’s Q-value procedure in terms of average power and recommend simple and proper choice of n based on the FDR control and maintaining the power at a desirable value. The derivation of our method is based on p-values and therefore is distribution free, as long as the power and sample size can be determined in a traditional way for testing a single hypothesis. We present in the next section the mixture model and simple formulas under this model of the FDR, pFDR, and power. These formulas provide an explicit formulation of sample size calculation methodologies, which are described in Section 3. Section 4 discusses a specific application of our method in a microarray situation where both m and the proportion of true null hypotheses are large. Our methodology derived with independence assumption is further investigated through simulations under some dependent cases in Section 5. Some concluding remarks are presented in Section 6.

Sample Size Determination With FDR Adjustment for Experiments with High-Dimensional Data

2. Mixture Model In this section, we recall the mixture model considered by Efron et al. (2001), Storey (2002), and Efron (2003), and present the formulas related to the FDR and power under this model. Let P1 , . . . , Pm be the p-values corresponding to the m test statistics for testing m hypotheses. Let Hi = 0 indicate that Hi is true and Hi = 1 otherwise. Assume that (Pi , Hi ) are iid random pairs, and denote the distribution function of Pi given (Hi = 0) as F0 , which is a uniform distribution, and that of Pi given (Hi = 1) as F1 , which is the power function for testing a single hypothesis. Further assume that the same rejection region is used for each test, which makes the tests identical. We finally assume that the Hi ’s are independent Bernoulli random variables with Pr{Hi = 0} = π0 and Pr{Hi = 1} = 1 − π0 = π1 . Therefore the distribution function for Pi becomes: F(α) = π0 F0 (α) + π1 F1 (α) = π0 α + π1 F1 (α), where α > 0 is the common p-value cutoff, or Type I error rate for each hypothesis tested. Under the above mixture model, the pFDR can be expressed as π0 α π0 α + π1 F1 (α) = Pr{Hi = 0 | Pi ≤ α};

pFDR(α, m) =

(3)

Cheung (2002) for more details on average power. It is important to note that in the above definition m 1 is assumed to be positive. And under a mixture model, it is a random variable instead of a fixed value (and so is m 0 ). However, in Pawitan et al. (2005), Jung (2005), and Liu and Hwang (2007), m 0 and m 1 are used as fixed values although mixture model is used to derive the relationship of FDR and Type I error. Also under the mixture model, random variable m 1 can be zero with a positive probability, which requires this definition to be adjusted by appropriately defining the ratio S/m 1 when m 1 = 0. We define S/m 1 to be 1 when m 1 = 0; that is,   S m 1 > 0 Pr{m 1 > 0} + I (m 1 = 0). γ (t, m) = E m1 (6) This makes it consistent with the other concepts related to power in a multiple testing setting, such as the false nondiscovery rate (FNR) introduced independently by Genovese and Wasserman (2002) and Sarkar (2004), or pFNR defined in Storey (2003). To simplify the formula for the average power under the mixture model, we first see by arguing as in Storey (2003) by conditioning on m 1 that (without loss of generality, assuming the first m 1 hypotheses are the alternatives) " k # X E[S(α)|m 1 = k] = E I {Hi = 1, Pi < α|Hi = 1} i=1

= k · Pr{Pi < α|Hi = 1} = k · F1 (α),

and the FDR as  V (α) R(α) > 0 · Pr{R(α) > 0} FDR(α, m) = E R(α) π0 α = π0 α + π1 F1 (α) 

m

×[1 − (1 − π0 α − π1 F1 (α)) ],

where k > 0. Therefore, we have γ (α, m) = F1 (α)Pr{m 1 > 0} + Pr{m 1 = 0} = F1 (α)(1 − π0m ) + π0m .

(4)

see, for example, Sarkar (2006) and Storey (2003). Sample size calculations for high-dimensional data experiments must be based on some statistical metric of power that is appropriate for multiple testing setting. The concept of power can be generalized from single to multiple testing in a number of ways. Nevertheless, the one that is often used is the average power, which is defined as the expected proportion of nonnull hypotheses or alternative hypotheses that are correctly rejected,   S . (5) γ (α, m) = E m1 Readers are referred to Benjamini and Liu (1999), Dudoit, Shaffer, and Boldrick (2003), Kwong, Holland, and

(7)

(8)

Although the difference is small in high-dimensional experiments, this is conceptually different from the definition γ (α, m) = F1 (α) in Pawitan et al. (2005), Jung (2005), and Liu and Hwang (2007), where m 1 is incorrectly treated as a fixed value.

3. Sample Size Determination 3.1

Richard Simon’s Method

In Simon et al. (2004), a traditional α adjustment method of using smaller Type I error of α = 0.001 and β = 0.05 is proposed for the design of microarray experiments. It is not specifically proposed to control FDR at any specific value. And the potential drawback of this 3

Statistics in Biopharmaceutical Research: Vol. 0, No. 0 method is that α = 0.001 may not be used as an actual cutoff of p-values in the analysis stage. Therefore the sample size/power calculation may not be consistent with the analysis method. However, the simplicity of the recommendation may merit further investigation to the possible relationship to the FDR control. In other words, we would like to know if Simon’s simple adjustment method can control FDR at a reasonably small value for most practical situations. In the following, we investigate this method—in the mixture model setup—assuming α = 0.001 as the pvalue cutoff in the final analysis. It is clear that F1 (α) = 1 − β = 0.95 for each hypotheses. And as we have mentioned in Section 2, in high-dimensional setup, the average power in (8) reduces to

the distributions with monotone likelihood ratio (MLR) property. Note that for a test statistic X with a probability density function f θ (x) having the MLR property in x, the ratio Fθ0 (α)/Fθ1 (α) is an increasing function of α for any θ1 > θ0 , where Fθ (α) is the cdf of the p-value of a right-tailed test based on X ; see Lehmann (1986) for details on MLR. Many commonly used distributions, such as normal, noncentral t, non-central chi-square and non-central F distributions, belong to this family of distributions. Since α/F1 (α) goes from 0 to 1 as α increases and π1 η · η, (12) π0 1 − η

(9)

a unique solution of α exists for statistics with an MLR distribution. For Storey’s procedure we propose the following three-step procedure to calculate the sample size:

And pFDR in (3) can be rewritten as an increasing function of π0 as

1. Based on (8), for given π0 , m and the desired average power γ , calculate

γ (α, m) ≈ F1 (α) = 0.95.

pFDR =

1 1+

1−π0 F1 (α) α π0

=

1 1+

1−π0 π0

· 950

.

(10)

Apparently, this method does not take π0 into account. With α = 0.001 as the p-value cutoff, it can control pFDR at desired level η with average power of about 95% as long as 950 · η . π0 < 1 + 949 · η

Therefore, for example, for any π0 < 0.99, this approach can give sample size controlling pFDR at 10% with average power of 95%. But this method becomes conservative—that is, leading to a sample size larger than is needed—when π0 is relatively small (e.g., < 0.95). Below we propose our method taking actual value of π0 into account and therefore eliminate the conservativeness. 3.2

Method Based on Mixture Model

We assume the above mixture model and consider power, in terms of γ (α, m), of the BH and Storey’s procedures controlling the FDR or pFDR at η level. For Storey’s Q-value approach, based on (3), the control of pFDR at η is achieved if the cutoff α is chosen subject to α π1 η ≤ · . (11) F1 (α) π0 1 − η Apparently, the least conservative case is when the equality holds in (11). Liu and Hwang (2007) derived similar result as (11) and proved that unique solution α exists for (11) with t and F distributions. We now generalize the proof to all 4

F1∗ = (γ − π0m )/(1 − π0m ). 2. Based on (11) calculate the least conservative cutoff value α for given pFDR level η as α = F1∗ ·

π1 η · . π0 1 − η

3. Perform sample size calculation for testing a single hypothesis with Type I error rate α and Type II error rate β = 1 − F1∗ . Similarly for the BH procedure, which controls the FDR at π0 η level, let B = Pr{R > 0} = 1 − [1 − π0 α − π1 F1 (α)]m . Based on (4), the rejection region should satisfy π1 η α ≤ F1 (α) B − π0 η

(13)

in order to control FDR at π0 η level. With known π0 , the BH procedure can be further improved by adjusting for π0 as in Storey’s procedure. That is, we can use η/π0 as the nominal FDR level and the actual FDR will be controlled at η level in the BH procedure. In such case, the rejection region should satisfy α π1 η ≤ . F1 (α) π0 (B − η)

(14)

Similar to Storey’s procedure, sample size can be derived for the BH procedure in the three-step procedure as we proposed above. Just that when solving for α in Step 2, (13) or (14) need to be used.

Sample Size Determination With FDR Adjustment for Experiments with High-Dimensional Data

Note that (11) does not depend on m, while (13) and (14) do. This is not a surprise and actually should be the case, since pFDR is expressed as a posterior probability of a single common hypothesis in mixture model and ignores the situation when there is no rejection, while the critical values and the probability of at least one rejection in the BH procedure depend on m. Interestingly, by comparing the right side of (11) and (14), since B < 1, the α derived from (14) is larger than that from (11). That means theoretically the power of the BH procedure adjusted for π0 , which controls FDR at level η, is bigger than that of Storey’s procedure controlling pFDR at η. This again should be true because if pFDR is controlled at η, the corresponding FDR is only ηB < η. The comparison of Storeys’ procedure (11) and the original BH procedure (13) is not as straightforward and depends on the magnitude of B and π0 .

4. Application in Microarray Experiments In this section we continue the discussion in Section 3 for situations with large m and π0 close to 1, which are typically true in microarray experiments. In such cases, the probability of at least one rejection B gets very close to 1 with π0 < 1. With B = 1, (13) reduces to the result as in Genovese and Wasserman (2002), where it is derived asymptotically with large m, without the mixture model assumption and π0 is the fixed proportion of true null hypotheses. That means with large m, with or without mixture model, the result will be similar for the BH procedure. Furthermore, it is reasonable to assume π0 < B in such case. Thus, in theory, the original BH procedure will be less powerful than pFDR procedure since α from (11) would be larger than that from (13), although the difference can be very small. With B = 1 in microarray experiments, (14) is the same as (11), which means we use whether the BH procedure adjusted for π0 or Storey’s Q-value procedure, the sample size calculation results will be very close. For simplicity, we do not distinguish pFDR and FDR in the remaining of this section. As we have mentioned, for large m, the average power in (8) reduces to F1 (α), which is the traditional power function and does not depend on the number of hypotheses m either. Therefore, for microarray experiments, the three-step procedure proposed in Section 3 is reduced to the following easy and practical two-step sample size calculation procedure independent of m: 1. For given π0 , desired FDR level η and average power γ , calculate α from (11) with F1 (α) equals to the desired average power γ as α=γ ·

η π1 · . π0 1 − η

2. Find the minimum sample size such that F1 (α) ≥ γ ; this is the same as traditional power calculation with Type I error rate α and Type II error rate β = 1 − γ. Although the same Equation (11) was used, unlike in Liu and Hwang (2007) where extensive computational search is required, our procedure proposed above uses simple calculations and therefore is much less computationally intensive. Apparently π0 is an important factor influencing the sample size/power calculation, and the larger the π0 the smaller the α and therefore the smaller the average power F1 (α) for given sample size and other design parameters. Note that since the sample size cannot be noninteger, from the last step of our procedure, it is possible to get a sample size such that F1 (α) > γ . Since the left side of (11) is an increasing function of α, it is not difficult to show that the sample size obtained from our method is the minimum sample size to achieve an average power no less than the desired average power.

5. Comparison and Simulations 5.1

One-Way ANOVA Setup

Although the approaches discussed in this article do not depend on a particular distribution, we first formulate a balanced one-way ANOVA setup (i.e., F-test) for microarray experiments for illustration purpose. Differences between approaches and the cases with correlated hypotheses will then be discussed under this setting. Suppose that there are m genes in each microarray chip and a balanced one-way design is used to compare K groups based on a sample of size n from each group. For the kth sample of the jth group of gene i, the expression X i jk (usually in log scale) is modeled as X i jk = µi j + ei jk , i = 1, . . . , m, j = 1, . . . , K , k = 1, . . . , n, (15) where ei jk ∼ N (0, σi2 ); that is, we assume homogeneous variance across the groups within each gene. We are interested in testing the null hypothesis Hi : µi1 = · · · = µi K , for each i simultaneously. Denote Fζ [K − 1, K (n − 1)] as the F distribution with degrees of freedom K − 1 and K (n − 1), and the noncentrality parameter ζ . Each null hypothesis is tested using a one-way ANOVA based on a test statistic having the F distribution with ζ = 0 under Hi . Let 1i = δi /σi be the effect size for gene i, where δi is the minimum mean 5

Statistics in Biopharmaceutical Research: Vol. 0, No. 0

Table 2. Sample size per treatment group for different π0 ’s in balanced one-way setup (desired FDR level η, desired average power γ , effect size 1, and K = 4) π0 η

1

γ

0.90 0.92 0.94 0.95 0.96 0.98 0.99

0.05 1.00 0.80 0.85 0.90 0.95

37 40 44 51

39 42 46 53

40 43 48 55

41 45 49 56

43 46 50 58

46 50 55 62

50 54 59 67

1.50 0.80 0.85 0.90 0.95

18 19 21 24

18 20 22 25

19 21 23 26

20 21 23 26

20 22 24 27

22 24 26 29

24 25 28 31

2.00 0.80 0.85 0.90 0.95

11 12 13 14

11 12 13 15

12 13 14 15

12 13 14 16

12 13 14 16

14 14 16 17

15 16 17 19

2.50 0.80 0.85 0.90 0.95

8 8 9 10

8 9 9 10

8 9 10 11

9 9 10 11

9 9 10 11

10 10 11 12

10 11 12 13

0.10 1.00 0.80 0.85 0.90 0.95

33 36 40 46

34 37 41 48

36 39 43 50

37 40 44 51

38 42 46 53

42 46 50 57

46 50 54 62

1.50 0.80 0.85 0.90 0.95

16 17 19 22

16 18 19 22

17 19 20 23

18 19 21 24

18 20 22 25

20 22 24 27

22 24 26 29

2.00 0.80 0.85 0.90 0.95

10 10 11 13

10 11 12 13

11 11 12 14

11 12 13 14

11 12 13 15

12 13 14 16

14 14 16 17

2.50 0.80 0.85 0.90 0.95

7 7 8 9

7 8 8 9

8 8 9 10

8 8 9 10

8 9 9 10

9 9 10 11

10 10 11 12

difference to be detected and σi is the standard deviation of the gene expression for gene i. Under the alternative hypotheses, the conservative value of the noncentrality parameter for gene i will be ζi =

n 2 1 . 2 i

We further assume that 1i = 1 for all i; that is, the minimum effect size to be detected is the same for all genes. The m hypotheses become identical: Hi : ζi = 0 against K i : ζi = n2 12 . The assumption of same minimum effect size makes more sense than assuming the same absolute size of signal, for example, fold change, since the latter does not take into account different variation of the markers. The procedure we proposed in Section 4 does not 6

depend on the total number of hypotheses m in highdimensional cases and thus allows convenient tables as Table 2 to be generated, which contains the sample size for balanced one-way ANOVA setup with given design parameters. 5.2

Comparison

In this section we demonstrate the differences graphically in terms of average power between approaches with and without FDR adjustment under the balanced one-way ANOVA setup. With K = 4 and 1 = 2, Figure 1 shows the average power curve without adjustment (π0 = No Adj.) with 10% Type I error rate, and the average power curves based on our procedure in Section 4 for different values

Sample Size Determination With FDR Adjustment for Experiments with High-Dimensional Data

1.0

Average Power

0.8

0.6

0.4

π0 No Adj. Simon’s 0.8 0.9 0.95 0.99

0.2

0.0

5

10

15

20

Sample Size

Figure 1. Average powers for balanced one-way ANOVA. π0 = No Adj. corresponds to the case without any multiplicity adjustment. π0 = Simon’s corresponds to Richard Simon’s method where Type I error is set at 0.001; other curves correspond to our method with different π0 ’s. Type I error α = 10% for no adjustment case and FDR level is set at η = 10% for the cases with FDR adjustment; effect size 1 = 2; x-axis is the sample size for each of the K = 4 groups.

of π0 with desired FDR level η = 0.1. Also Richard Simon’s method is shown by the curve with π0 = Simon’s. Given the same sample size, it is obvious that when π0 is large, the power with FDR adjustment could be much smaller than that from the traditional power calculation without multiplicity adjustment. And Richard Simon’s cutoff α = 0.001 leads to small average power for most of the cases when π0 < 0.99. That is, as we mentioned before, the sample size derived from Richard Simon’s method (with β = 0.05) will be conservative (larger than that is needed to control FDR at 10% in this case) for most of the cases when π0 < 0.99. 5.3

Simulations

Our procedure is formulated with independence assumption and below we investigate through simulation studies the performance of our methodology under two types of correlation structures. We first assume common positive correlation among test statistics following multivariate normal distribution, that is, the correlation matrix has the form (1 − ρ)I + ρ J with ρ > 0, where I is the identity matrix and J is the square matrix of 1’s. We performed simulations assuming m = 2000, K = 2, 1 = 2, π0 = (0.90, 0.95, 0.98, 0.99), desired average power γ = (0.80, 0.85, 0.90), desired FDR level η = (0.05, 0.10)

and common correlation ρ = (0, 0.2, 0.5, 0.8). For each combination of the parameters, sample size n was first determined with the procedure we proposed in Section 4 and then 10,000 simulations were run with the BH procedure adjusted for π0 . The FDR and average power calculated from the simulation are presented in Table 3. In high-dimensional data experiments such as microarray, sparse block diagonal correlation is more realistic and is widely adapted in relevant simulation studies. With the same design parameters for the common correlation cases, we also performed simulations with such sparse block correlation structure, where we assume a block size of 20 with common correlation ρ and independence between different blocks. The simulation result is presented in Table 4. Benjamini and Yekutieli (2001) and Sarkar (2002) proved that under positive correlation structure, the BH procedure still controls FDR, but in a conservative manner. Our simulation confirmed this result, and with common correlation structure, the achieved FDR level becomes much smaller than the nominal level. More interestingly, our simulation result shows that even when the hypotheses are correlated, the sample size derived from our procedure can still achieve the desired average power in most of the cases. In only a few cases with common correlation structure, the achieved average power could be less than the desired value when the desired average 7

Statistics in Biopharmaceutical Research: Vol. 0, No. 0

Table 3. Simulated FDR and average power (desired FDR level η, desired average power γ , K = 2, 1 = 2, common correlation ρ) Design η

π0

γ

Simulated average power n

ρ = 0.0

0.2

0.5

ρ = 0.0

0.2

0.5

0.8

0.05 0.90 0.80 9 0.85 10 0.90 11

0.812 0.803 0.785 0.764 0.880 0.876 0.866 0.846 0.925 0.924 0.920 0.905

0.0499 0.0494 0.0452 0.0300 0.0498 0.0493 0.0449 0.0293 0.0497 0.0490 0.0446 0.0306

0.95 0.80 11 0.85 11 0.90 12

0.865 0.861 0.850 0.833 0.865 0.861 0.850 0.833 0.913 0.911 0.906 0.892

0.0498 0.0486 0.0423 0.0260 0.0498 0.0486 0.0423 0.0260 0.0498 0.0489 0.0424 0.0263

0.98 0.80 12 0.85 13 0.90 14

0.830 0.825 0.813 0.799 0.886 0.884 0.876 0.864 0.925 0.924 0.921 0.912

0.0492 0.0486 0.0386 0.0227 0.0499 0.0489 0.0398 0.0224 0.0502 0.0483 0.0391 0.0221

0.99 0.80 13 0.85 14 0.90 15

0.823 0.818 0.806 0.795 0.879 0.876 0.869 0.859 0.917 0.916 0.913 0.904

0.0497 0.0484 0.0377 0.0205 0.0503 0.0482 0.0375 0.0200 0.0493 0.0476 0.0372 0.0198

0.10 0.90 0.80 8 0.85 9 0.90 10

0.829 0.822 0.805 0.776 0.897 0.894 0.888 0.867 0.938 0.937 0.938 0.932

0.0999 0.0985 0.0877 0.0578 0.1000 0.0981 0.0870 0.0604 0.1000 0.0978 0.0864 0.0635

0.95 0.80 9 0.85 10 0.90 11

0.812 0.805 0.787 0.765 0.880 0.877 0.871 0.852 0.925 0.924 0.925 0.914

0.0999 0.0976 0.0827 0.0525 0.0996 0.0973 0.0822 0.0528 0.0994 0.0967 0.0819 0.0542

0.98 0.80 11 0.85 12 0.90 13

0.845 0.841 0.831 0.815 0.899 0.897 0.894 0.880 0.935 0.935 0.936 0.930

0.0996 0.0957 0.0754 0.0437 0.0992 0.0960 0.0754 0.0431 0.0995 0.0961 0.0757 0.0447

0.99 0.80 12 0.85 13 0.90 14

0.836 0.831 0.822 0.810 0.890 0.889 0.885 0.874 0.928 0.928 0.928 0.921

0.0987 0.0954 0.0704 0.0389 0.0997 0.0960 0.0717 0.0385 0.1000 0.0947 0.0700 0.0385

power γ is relative low (80%, for example) or the correlation is high (such as ρ = 0.80). And the loss of power in such cases is relatively small (< 5%).

6. Concluding Remarks This article is motivated by the need for a practical framework of sample size determination for FDR controlled multiple testing procedures in high-dimensional data experiments. Considering a mixture model involving independent test statistics, which is commonly used in high-dimensional data experiments, we discuss in this article the properties of the approach suggested by Simon et al. (2004) and propose a simple method to determine the sample size for FDR controlled multiple testing procedures. We show that, although Richard Simon’s simple method does not take π0 into account, it is conservative for most of the cases with π0 < 0.99. And therefore it can be used to get an good estimate of the upper bound 8

0.8

Simulated FDR

of sample size even in the absence of any information on π0 . Our method relies on simply examining the performance of an FDR controlled procedure through the average power. We give explicit formulas for the measures related to the FDR and average power under mixture model. These formulas, although quite simple, not only provide a clear explanation of our methodology but, more importantly, also help to clarify certain misconceptions about these measures that seem to still exist among some users of these measures. For instance, in Pawitan et al. (2005), who have also addressed the issue of sample size determination in microarray experiments subject to a control of the FDR and a reasonable measure of power, the FDR under the same mixture model is said to be given incorrectly by FDR =

π0 t . π0 t + π1 F1 (t)

While in typical microarray experiments with large num-

Sample Size Determination With FDR Adjustment for Experiments with High-Dimensional Data

Table 4.

Simulated FDR and average power (desired FDR level η, desired average power γ , K = 2, 1 = 2, block common correlation ρ) Design η

π0

γ

Simulated average power n

ρ = 0.0

0.2

0.5

0.8

Simulated FDR ρ = 0.0

0.2

0.5

0.8

0.05 0.90 0.80 9 0.85 10 0.90 11

0.812 0.810 0.808 0.806 0.880 0.879 0.878 0.877 0.925 0.925 0.924 0.923

0.0499 0.0499 0.0498 0.0496 0.0498 0.0498 0.0497 0.0493 0.0497 0.0500 0.0500 0.0496

0.95 0.80 11 0.85 11 0.90 12

0.865 0.864 0.862 0.860 0.865 0.864 0.862 0.860 0.913 0.912 0.911 0.909

0.0498 0.0497 0.0497 0.0496 0.0498 0.0497 0.0497 0.0496 0.0498 0.0498 0.0497 0.0491

0.98 0.80 12 0.85 13 0.90 14

0.830 0.828 0.823 0.817 0.886 0.884 0.882 0.877 0.925 0.924 0.922 0.918

0.0492 0.0494 0.0494 0.0484 0.0499 0.0508 0.0502 0.0499 0.0502 0.0503 0.0499 0.0489

0.99 0.80 13 0.85 14 0.90 15

0.823 0.816 0.806 0.797 0.879 0.875 0.868 0.859 0.917 0.915 0.910 0.903

0.0497 0.0506 0.0504 0.0489 0.0503 0.0509 0.0505 0.0483 0.0493 0.0496 0.0497 0.0491

0.10 0.90 0.80 8 0.85 9 0.90 10

0.829 0.827 0.826 0.824 0.897 0.896 0.895 0.893 0.938 0.938 0.937 0.936

0.0999 0.0999 0.0999 0.0991 0.1000 0.0998 0.0997 0.0991 0.1000 0.0998 0.0995 0.0986

0.95 0.80 9 0.85 10 0.90 11

0.812 0.810 0.807 0.803 0.880 0.879 0.877 0.875 0.925 0.925 0.924 0.922

0.0999 0.0999 0.0996 0.0986 0.0996 0.0996 0.0995 0.0982 0.0994 0.1000 0.0999 0.0988

0.98 0.80 11 0.85 12 0.90 13

0.845 0.842 0.837 0.833 0.899 0.897 0.893 0.890 0.935 0.934 0.932 0.929

0.0996 0.0989 0.0991 0.0977 0.0992 0.0993 0.0989 0.0967 0.0995 0.1000 0.1000 0.0980

0.99 0.80 12 0.85 13 0.90 14

0.836 0.829 0.820 0.810 0.890 0.887 0.879 0.870 0.928 0.926 0.921 0.913

0.0987 0.0990 0.0984 0.0947 0.0997 0.1011 0.1003 0.0982 0.1000 0.1008 0.0992 0.0964

ber of genes it does not cause much problem as it is very close to the actual formula, but it generally presents a misleading interpretation of the FDR under the mixture model. Moreover, Pawitan et al. (2005), Jung (2005), and Liu and Hwang (2007) did not seem to have taken into consideration the fact that the number of true null hypotheses under the assumed mixture model is a random quantity in their arguments on pros and cons of different operating characteristics of a multiple testing procedure. Of course, they have nicely articulated the need for sample size determination in microarray experiments and provided some insight for this area of research. We show in this article that for high-dimensional data experiments, where the number of hypotheses m is large, the BH procedure adjusted for π0 leads to results similar to Storey’s Q-value procedure. In a high-dimensional setup, the sample size calculation procedure we proposed is very similar to the traditional method for testing a single hypothesis and it depends on π0 but not on the total number of hypotheses m. Although our approach is based on independence assumption, based on the simula-

tion results, our approach can be applied successfully to the correlated cases such as sparse block correlation and common correlation structures. [Received October 2008. Revised July 2009.]

REFERENCES Benjamini, Y., and Hochberg, Y. (1995), “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing,” Journal of Royal Statistical Society, Series B, 57, 289–300. 1 Benjamini, Y., and Liu, W. (1999), “A Step-Down Multiple Hypotheses Testing Procedure that Controls the False Discovery Rate Under Independence,” Journal of Statistical Planning & Inference, 82, 163–170. 3 Benjamini, Y., and Yekutieli, D. (2001), “The Control of the False Discovery Rate in Multiple Testing Under Dependency,” The Annals of Statistics, 29, 1165–1188. 7 Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003), “Multiple Hypothesis Testing in Microarray Experiment,” Statistical Science, 18, 71–103. 3 Efron, B. (2003), “Robbins, Empirical Bayes and Microarrays,” The Annals of Statistics, 31, 366–378. 2, 3

9

Statistics in Biopharmaceutical Research: Vol. 0, No. 0

Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001), “Empirical Bayes Analysis of a Microarray Experiment,” Journal of American Statistical Association, 96, 1151–1160. 3 Genovese, C., and Wasserman, L. (2002), “Operating Characteristics and Extensions of the False Discovery Rate Procedure,” Journal of Royal Statistical Society, Series B, 64, 499–517. 3, 5 Hwang, D., Schmitt, W. A., and Stephanopoulos, G. (2002), “Determination of Minimum Sample Size and Discrimintory Expression Patterns in Microarray Data,” Bioinformatics, 18, 1184–1193. 2 Jung, S.-H. (2005), “Sample Size for FDR-Control in Microarrya Data Analysis,” Bioinformatics, 21, 3097–3104. 2, 3, 9 Kwong, K. S., Holland, B., and Cheung, S. H. (2002), “A Modified Benjamini-Hochberg Multiple Comparisons Procedure for Controlling the False Discovery Rate,” Journal of Statistical Planning & Inference, 104, 351–362. 3 Lee, M.-L. T., and Whitmore, G. A. (2002), “Power and Sample Size for DNA Microarry Studies,” Statistics in Medicine, 21, 3543–3570. 2 Lehmann, E. L. (1986), Testing Statistical Hypotheses, New York: Wiley. 4 Liu, P., and Hwang, J. T. G. (2007), “Quick Calculation for Sample Size While Controlling False Discovery Rate With Application to Microarray Analysis,” Bioinformatics, 23, 739–746. 2, 3, 4, 5, 9 Pan, W., Lin, J., and Le, C. T. (2002), “How Many Replicates of Arrays are Required to Detect Gene Expression Changes in Microarray Experiments? A Mixture Model Approach,” Genome Biology, 3, 1–10. 2 Pawitan, Y., Mchiels, S., Koscielny, S., Gusnanto, A., and Ploner, A. (2005), “False Discovery Rate, Sensitivity and Sample Size for Microarray Studies,” Bioinformatics, 21, 3017–3024. 2, 3, 8, 9

10

Pounds, S., and Cheng, C. (2005), “Sample Size Determination for the False Discovery Rate,” Bioinformatics, 21, 4263–4271. 2 Sarkar, S. K. (2002), “Some Results on False Discovery Rate in Stepwise Multiple Testing Procedures,” The Annals of Statistics, 30, 239–257. 7 (2004), “FDR-Controlling Stepwise Procedures and Their False Negatives Rates,” Journal of Statistical Planning & Inference, 125, 125–137. 3 (2006), “False Discovery and False Non-discovery Rates in Single-Step Multiple Testing Procedures,” The Annals of Statistics, 34, 394–415. 3 Shao, Y., and Tseng, C.-H. (2007), “Sample Size Calculation With Dependence Adjustment for FDR-Control in Microarray Studies,” Statistics in Medicine, 26, 4219–4237. 2 Simon, R. M., Korn, E. L., McShane, L. M., Radmacher, M. D., Wright, G. W., and Zhao, Y. (2004), Design and Analysis of DNA Microarray Investigations, New York: Springer. 2, 3, 8 Storey, J. D. (2002), “A Direct Approach to False Discovery Rates,” Journal of Royal Statistical Society, Series B, 64, 479–498. 2, 3 (2003), “The Positive False Discovery Rate: A Bayesian Interpretation and the Q-value”, The Annals of Statistics, 31, 2013–2035. 2, 3

About the Authors Gengqian Cai is a Statistician, Quantitative Sciences, GlaxoSmithKline, UW2350, 709 Swedeland Road, King of Prussia, PA 19406. Xiwu Lin is Assistant Director, and Kwan Lee is Director, Quantitative Sciences, GlaxoSmithKline, UP4335, 1250 South Collegeville Road, Collegeville, PA 19426. E-mail for correspondence: [email protected].