Statistical power calculation and sample size ... - UQ eSpace

4 downloads 0 Views 197KB Size Report
Statistical power calculation and sample size determination for environmental studies with data below detection limits. Quanxi Shao1 and You-Gan Wang2.
Click Here

WATER RESOURCES RESEARCH, VOL. 45, W09410, doi:10.1029/2008WR007563, 2009

for

Full Article

Statistical power calculation and sample size determination for environmental studies with data below detection limits Quanxi Shao1 and You-Gan Wang2 Received 2 November 2008; revised 16 June 2009; accepted 6 July 2009; published 16 September 2009.

[1] Power calculation and sample size determination are critical in designing

environmental monitoring programs. The traditional approach based on comparing the mean values may become statistically inappropriate and even invalid when substantial proportions of the response values are below the detection limits or censored because strong distributional assumptions have to be made on the censored observations when implementing the traditional procedures. In this paper, we propose a quantile methodology that is robust to outliers and can also handle data with a substantial proportion of below-detection-limit observations without the need of imputing the censored values. As a demonstration, we applied the methods to a nutrient monitoring project, which is a part of the Perth Long-Term Ocean Outlet Monitoring Program. In this example, the sample size required by our quantile methodology is, in fact, smaller than that by the traditional t-test, illustrating the merit of our method. Citation: Shao, Q., and Y.-G. Wang (2009), Statistical power calculation and sample size determination for environmental studies with data below detection limits, Water Resour. Res., 45, W09410, doi:10.1029/2008WR007563.

1. Introduction [2] In an environmental monitoring program, a key objective of the pilot study is to calculate the statistical power and determine the sample size required for future implementation. To accurately assess the impact on, or change of, environmental conditions at the monitoring sites, we need the statistical power to draw our conclusions. Here the statistical power 1  b is the complement of type II error b, which is the probability of denying the impact or change when it is the case. At the same time, in order to protect ourselves from incorrectly reaching the conclusion when it is not the case, we need to have a critical values. The probability of reaching incorrect conclusion is called the Type I error and denoted by a. Traditionally, the monitoring sites need to be compared with the so-called reference (or control) sites at which the impact or change has not occurred (at least it is assumed that the impact or change has not occurred). On the basis of the observations from reference and monitoring sites, a hypothesis test needs to be constructed, and the sample size with specified type I error a and type II error b can then be determined [Zar, 2009]. There are at least two major issues in the traditional hypothesis testing procedure which may arise in some studies. Firstly, the comparison is commonly made between the sample means. Because of the limitation of laboratory measuring instruments, concentrations below a certain level (detection limit) cannot be detected and therefore are reported as ‘‘below-detection-limit’’. Such observations 1 CSIRO Mathematical and Information Sciences, Floreat, Western Australia, Australia. 2 CSIRO Mathematical and Information Sciences, Indooroopilly, Queensland, Australia.

Published in 2009 by the American Geophysical Union.

are called censored data in statistical analysis. For example, in the Perth Long-Term Ocean Outlet Monitoring Program (PLOOM) discussed in section 3, the detection limits are 3 mg/L for ammonia (amm), 2 mg/L for ortho-Phosphate (orth) and nitrate + nitrite (NO23). When some observations are censored in a data set, the mean value cannot be calculated without distributional assumptions. In practice, one may simply replace the censored observations by a set of fixed values (typically zero or half of the detection limit) or delete the censored observation by trimming methods (such as winsorized mean [Berthouex and Brown, 1994]). Under the normality assumption on the transformed data (such as logarithm), one can use regression method for ordered statistics [Gilliom and Helsel, 1986], the maximum likelihood method [Cohen, 1959; El-Shaarawi, 1989], or the D-log procedure [Hinton, 1994]. Secondly, the standard formula are based on the normal distribution theory, which can be a very strong assumption. In fact, the distributions of many nutrient concentrations are not approximately normal distributed. The common practice is to produce an approximate normal or symmetric distribution via data transformation [Nakamura and Diaz-Frances, 1994]. However, such a transformation is sometimes difficult to find and the results for the transformed data do not retain the same statistical meanings as that for the original data. For example, under the assumption that the log-transformed data are normally distributed, the mean value of transformed data in fact corresponds to the median of the original data. Different transformations are often needed for different response variables, making interpretation of the results even more difficult. Underwood and Chapman [2003] discussed the problems of statistical power and sample size calculation in assessing environmental impact. [3] In this paper, we propose a new inference procedure for the statistical power calculation and sample size determination for a data set with a substantial proportion of

W09410

1 of 8

W09410

SHAO AND WANG: POWER CALCULATION AND SAMPLE SIZE DETERMINATION

below-detection-limit observations. Unlike the traditional methods which highly rely on weakly justified distributional assumptions, our procedure is based on the comparison between quantiles of monitoring and reference sites. Note that the quantile-based hypotheses do not suffer from the deficiencies of the traditional methods with the belowdetection-limit observations and nonnormal data. Details are given in section 2. We compare our method with the traditional method based on normality assumption in section 3. For illustration, in section 4 we apply our method to the Perth Long-Term Ocean Outlet Monitoring Program (PLOOM), which is the motivation of our proposed method here. Conclusions and discussion are given in section 5.

2. Statistical Framework [4] Our quantile method is a modification of the traditional method. To see this, we consider a simple one-sided hypothesis testing between monitoring and reference sites. Assume that the reference data are generated from random variable X and monitoring data from Y. Let mX and mY be the means of random variables X and Y, respectively. The simple one-sided hypothesis is given as H0 : mY ¼ mX

vs HA : mY > mX :

ð1Þ

Suppose both X and Y are normally distributed with variances s2X and s2Y, respectively. For given (sX, sY) and significance level a, the null hypothesis H0 is rejected if (Y  X ) > Z1ase(Y  X ) where Z1a is the (1  a) quantile of the standard normal distribution, X and Y are the sample means of the reference and monitoring sites, respectively, and se() stands for standard error. If (sX, sY) are unknown, Z will be replaced by t-distribution with the number of degrees of freedom being approximated as a function of the sample variances and sample sizes of both reference and monitoring sites [see Satterthwaite, 1946; Welch, 1947]. [5] In order to calculate the required sample size, we need to specify a minimum difference d we wish to detect between the monitoring and reference sites. That is mY  mX  d > 0:

ð2Þ

Here d is a predefined constant and is also known as effect size. It plays an important role in the statistical power calculation and sample size determination. [6] Note that the mean and median are the same for normal distributions, that is, mX = m0.5 and mY = x0.5. Here mt be the t-th quantile of X for the reference site and x t be the t-th quantile of Y for the monitoring site. Therefore we can view the above hypothesis as a comparison of quantiles between monitoring and reference sites. To be explicit, we can generalize the above mean comparison to a quantile comparison as H0 : x 0:5 ¼ m0:5

vs HA : x 0:5 > m0:5 :

ð3Þ

Similarly, in order to calculate the required sample size, we have to specify an effect size. Note that the quantity given in equation (1) is in fact the comparison between the median x 0.5 (or mean mY) of the monitoring site and (1  a)-th quantile mX + sX Z1a of the reference site. Therefore the

W09410

effect size in equation (2) is d = sX Z1a. A sensible way to generalize equation (1) to the quantile setup is to define d = mq  m0.5 as the effect size with an appropriate q. When the difference is at least d, x0.5  m0.5  d is equivalent to x 0:5  mq ;

ð4Þ

indicating that the median at monitoring sites is at least as large as the q-th quantile at reference sites. [7] We can specify a statistical power (such as 80%) when x 0.5 = mq, where q > 0.5 is a predefined quantity. For example, q = 0.8 is recommended by ANZECC/ARMCANZ [2000a] guidelines and used in the Perth Long-term Ocean Outlet Monitoring Program (PLOOM) in which the following procedure is used for nutrient concentrations. d [8] 1. Obtain estimates of the 80th percentile m0.8 as m 0:8 from the reference sites (from nX independent and identically distributed samples). [9] 2. Obtain estimates of the median x0.5 as xc 0:5 from the monitoring sites (from nY independent and identically distributed samples). d [10] 3. Evaluate xc 0:5  m 0:8 . d [11] 4. If xc 0:5  m 0:8  0, that is, the median of the monitoring site is less than the 80th percentile of the reference site, we declare a compliance. Otherwise, if the median of the monitoring site exceeds the 80th percentile of the reference site we declare a noncompliance and call for further investigation. [12] Of course, if the median of the monitoring site exceeds the ANZECC water quality guideline value [ANZECC/ARMCANZ, 2000a, 2000b; Shao, 2000], further investigation is needed. However, the comparison with the ANZECC water quality guideline value is not our interest here. The difference may be defined as mq  x0.5 if a lower concentration of the stressor is desirable. [13] Statistical power and sample size calculations depend on the test statistic we use. A simple test statistic under c ^ = xc d d the null hypothesis is D 0:5  m 0:5 , where m 0:5 and x 0:5 are the sample medians of the reference and monitoring sites, respectively. We declare a significance difference if ^ ^ > Z1a, where se(D) ^ is the standard deviation of D ^ D/se( D) (often estimated from data) and Z1a is the critical value, which is so chosen to ensure that the type I error is preserved at a. We now obtain the q-th quantile mbq from the reference site on the basis of nX samples, and the median xc 0:5 from the monitoring site on the basis of nY samples. [14] Suppose f() and g() are the density functions of X (for the reference site) and Y (for the monitoring site), ^ = respectively. We can write the asymptotic variance of D d  m as xc 0:5 0:5  1  1 s2a;D;0 ¼ 4nY g 2 ðx 0:5 Þ þ 4nX f 2 ðm0:5 Þ

ð5Þ

(see Theorem 4.1 of Koenker and Bassett [1978]). Note that ^ under sa,D,0 is approximately the standard deviation of D the null hypothesis. We will declare a significant difference ^ a,D,0 > c. Again the critical value c is to only when D/s control the type I error of the null hypothesis (1). For our ^ > one-sided test, we declare a significant difference if D ^ a,D,0 > cjH0) under the csa,D,0 with a probability of pr(D/s null hypothesis.

2 of 8

W09410

SHAO AND WANG: POWER CALCULATION AND SAMPLE SIZE DETERMINATION

2.1. Type I Error [15] The type I error a is the probability that a significant difference between the reference and monitoring sites is declared when this is in fact not the case (i.e., H0 is true). If we wish to control this probability to be less than a (0.05, say), we have   ^ a;D;0 > cjH0 a ¼ pr D=s n o d ¼ pr ðxc 0:5  m 0:5 Þ=sa;D;0 > cjH0

^ = xc [18] Note that when x0.5 = mq, the variance of D 0:5  d , given by equation (5), can be written as m 0:5 n o1  1 s2a;D;A ¼ 4nY g 2 ðmq Þ þ 4nX f 2 ðm0:5 Þ

2.2. Power Function [17] Suppose that we are interested in detecting a difference when the median concentration x0.5 at the monitoring sites is at least as high as mq (the q-th quantile concentration for the reference site). We will specify the effect size as d = mq  m0.5. The statistical power is the probability of declaring a significant impact or change when the difference is d and is given by n

o

^ a;D;0 > cjx 0:5 ¼ mq ; Power ¼ pr D=s

n o Powerðmq Þ ¼ pr ^x 0:5  m ^ 0:5 > csa;D;A jx 0:5 ¼ mq n o ¼ pr Z > c  ðmq  m0:5 Þ=sa;D;A jx 0:5 ¼ mq  mq  m0:5 c ; ¼F sa;D;A

where Z = {^x 0.5  mq  (m^0:5  m0.5)}/sa,D,A N(0, 1). As we can see, when q = 0.5 (i.e., under null hypothesis), this Power coincides with the type I error, as the median x 0.5 = m0.5. 2.3. Sample Size Determination [20] It is possible to control the statistical power to be at least 80% say when x 0.5  mq (q  0.5, and traditionally q can be 0.8 or 0.9). If we are only interested in testing if the monitoring sites are within a predetermined range [Armitage and Berry, 1987], we can simply carry out a significance test using 50% of the range width as the effect size. In this case, we only need to control the type I error. However, a larger sample size will be necessary to achieve this statistical power requirement, while preserving the significance level given in equation (6). [21] If the cutoff value c is so chosen that the significance level is a, we have c = Z1a. Once the critical value is determined, the statistical power will be a function of the sample size. Therefore for a given Power, we can inversely work out the necessary sample size as nY ¼

Z1a þ Z1b mq  m0:5

!2 (

) 1 1 þ : 4g 2 ðmq Þ 4Rf 2 ðm0:5 Þ

ð8Þ

where R = nX/nY is the sample size ratio of the reference to monitoring sites. The sample size for the reference sites is nX = RnY. Alternatively, for a given sample size we can evaluate the Power. [22] The normal quantile at 80% and 90% are 0.84 and 1.28, respectively. For 5% significance level (a = 0.05) at the null hypothesis and 80% statistical power at the alternative hypothesis mq = m0.8 , one can calculate the corresponding sample size for the monitoring sites as  nY ¼

ð6Þ

which is 1  b, the complement of the Type II error b. A monitoring program with large Power (or small b) values will be less likely to miss declaring a significant test. Therefore a reasonably large b value is desirable. Traditionally the Power is chosen as 0.8 or 0.9 for practical sample size determination.

ð7Þ

(as x 0.5 = mq). Bear in mind that under the null hypothesis x 0.5 = m0.5, s2a,D,0 is given by equation (5). [19] It follows that when x 0.5 = mq, the Power function is

¼ prfZ > cjH0 g ¼ 1  FðcÞ;

where Z = (xc 0:5  m0.5)/s a,D,0 N(0, 1), asymptotically. In practice, we estimate s2a,D,0 by s2D,0 by replacing each unknown quantity required in (5) by its estimate. The effect of ignoring the uncertainties in estimating s2a,D,0 is to underestimate the required sample size (for sample size determination), but, not to a large extent unless the sample size used (in calculation) is really small ( mq, where q > 0.7. It would be interesting to investigate different hypotheses and assess their performances. In practice, it is always a challenge to ensure that our quantile formulation is appropriate for the monitoring program to be established. [47] Acknowledgments. We wish to thank the Water Corporation of Western Australia for allowing us to use the PLOOM data and Oceanica Consulting Pty. Ltd. for the constructive comments during our data analysis. This research is also supported by the Water Information Research and Development Alliance (WIRADA) between the Australian Bureau of Meteorology (BoM) Water Division and the CSIRO Water for a Healthy Country Flagship Program. Thanks also to Brent Henderson for his helpful comments during CSIRO’s internal review process, and to the editor Tom Torgersen, the associate editor Richard Katz and two anonymous referees for their constructive comments that led to the much improved quality of the paper.

References ANZECC/ARMCANZ (2000a), Australian and New Zealand Guidelines for Fresh and Marine Water Quality, National Water Quality Management Strategy Pap. 4, Australia and New Zealand Environment and Conservation Council and Agriculture and Resource Management Council of Australia and New Zealand, Canberra, ACT. ANZECC/ARMCANZ (2000b), Australian and New Zealand Guidelines for Fresh and Marine Water Quality, National Water Quality Management Strategy Pap. 7, Australia and New Zealand Environment and Conservation Council and Agriculture and Resource Management Council of Australia and New Zealand, Canberra, ACT. Armitage, P., and G. Berry (1987), Statistical Methods in Medical Research, 2nd ed., Blackwell, Oxford, U. K. Berthouex, P. M., and L. C. Brown (1994), Statistics for Environmental Engineers, Lewis Publisher, Boca Raton, Fla. Cohen, A. C. (1959), Simplified estimators for the normal distribution when samples are singly censored or truncated, Technometrics, 1(3), 217 – 237. El-Shaarawi, A. H. (1989), Inferences about the mean from censored water quality data, Water Resour. Res., 25(4), 685 – 690. Garsd, A., G. E. Ford, G. O. Waring III, and L. S. Rosenblatt (1983), Sample size for estimating the quantiles of endothelial cell-area distribution, Biomatrics, 39, 385 – 394. Gilliom, R. J., and D. R. Helsel (1986), Estimation of distributional parameters for censored trace level water quality data: I. Estimation techniques, Water Resour. Res., 22(2), 135 – 146. Hyndman, R. J., and Y. Fan (1996), Sample quantiles in statistical packages, Am. Stat., 50, 361 – 365. Hinton, S. W. (1994), Statistical procedures for addressing nondetect results in environmental analyses, Tappi J., 77(4), 83 – 90. Nakamura, M., and E. Diaz-Frances (1994), Transformation to symmetry for censored data caused by detection limit, Environmetrics, 5, 399 – 416. Koenker, R., and G. Bassett (1978), Regression quantiles, Econometrica, 46, 33 – 50. Satterthwaite, F. E. (1946), An approximate distribution of estimates of variance components, Biom. Bull., 2, 110 – 114.

7 of 8

W09410

SHAO AND WANG: POWER CALCULATION AND SAMPLE SIZE DETERMINATION

Shao, Q. (2000), Estimation for hazardous concentrations based on NOEC toxicity data: An alternative approach, Environmetrics, 11, 583 – 595. Silverman, B. W. (1986), Density Estimation for Statistics and Data Analysis, Chapman and Hall, New York. Underwood, A. J., and M. G. Chapman (2003), Power, precaution, type II error and sampling design in assessment of environmental impacts, J. Exp. Mar. Biol. Ecol., 296(1), 49 – 70. Wang, Y.-G., and M. Zhu (2006), Rank-based regression for analysis of repeated measures, Biometrika, 93, 459 – 464.

W09410

Welch, B. L. (1947), The generalization of ‘‘student’s’’ problem when several different population variances are involved, Biometrika, 34, 28 – 35. Zar, J. H. (2009), Biostatistical Analysis, 2nd ed., Prentice-Hall, Englewood Cliffs, NJ.

 

Q. Shao, CSIRO Mathematical and Information Sciences, Leeuwin Centre, 65 Brockway Road, Floreat, WA 6014, Australia. ([email protected]) Y.-G. Wang, CSIRO Mathematical and Information Sciences, 120 Meiers Road, Indooroopilly, Qld 4068, Australia. ([email protected])

8 of 8