On Abandoning Hypothesis Testing in Environmental ... - Springer Link

1 downloads 0 Views 579KB Size Report
Mar 27, 2018 - Song S. Qian1 ○ Robert J. Miltner2. Received: 7 March 2017 ... reviewed by Keller and Cavallaro (2008), do not provide states with a specific ...
Environmental Management https://doi.org/10.1007/s00267-018-1037-2

On Abandoning Hypothesis Testing in Environmental Standard Compliance Assessment Song S. Qian1 Robert J. Miltner2 ●

1234567890();,:

1234567890();,:

Received: 7 March 2017 / Accepted: 27 March 2018 © Springer Science+Business Media, LLC, part of Springer Nature 2018

Abstract We use basic characteristics of statistical significance test to argue the abandonment of hypothesis testing in environmental standard (or criterion) compliance assessment. The typical sample size used for environmental assessment is small, and the natural variation of many water quality constituent concentrations is high. These conditions lead to low statistical power of the hypothesis tests used in the assessment process. As a result, using hypothesis testing is often inefficient in detecting noncompliance. When a noncompliance is detected, it is frequently due to sampling or other types of error. We illustrate the problems using two examples, through which we argue that these problems cannot be resolved under the current practice of assessing compliance one water at a time. We recommend that the hypothesis testing framework be replaced by a statistical estimation approach, which can more effectively leverage information from assessments on similar waters using a probabilistic assessment approach. Keywords 303(d) listing Compliance assessment Nutrient criteria Statistics ●



Introduction The U.S. Clean Water Act (CWA) requires that states periodically submit a list of impaired waters, waters that are “too polluted or otherwise degraded to meet water quality standards.” This requirement is part of Section 303(d) of CWA, and the process of compiling the list is known as 303 (d) listing. States are further required under CWA to establish priority rankings for waters on the lists and develop total maximum daily loads (TMDL) for these waters. As with all U.S. laws, the U.S. government is responsible for developing rules and regulations to implement the law including setting specific standards for compliance assessment. Enforcement actions will take effect once a noncompliance is identified. Consequently,

* Song S. Qian [email protected] 1

Department of Environmental Sciences, The University of Toledo, 2801 West Bancroft Street, MS# 604, Toledo, OH 43606-3390, USA

2

Ohio Environmental Protection Agency, 4675 Homer-Ohio Lane, Groveport, OH 43125, USA



compliance assessment is often the most important part of environmental management. The U.S. Environmental Protection Agency (EPA) is the government agency responsible for CWA. To implement Section 303(d) of CWA, EPA issued a series of documents to clarify requirements of Section 303(d). These documents, reviewed by Keller and Cavallaro (2008), do not provide states with a specific methodology for assessing the water quality of a waterbody. As a result, states have “resorted to developing their own methodology for collecting and evaluating water quality data to determine if water quality standards are exceeded” (Keller and Cavallaro 2008). The first question in developing an assessment method is the interpretation of the meaning of the standard; specifically, whether the standard is a measure of the mean concentration or not. Stephan et al. (1985) explained that a water quality standard (WQS) of a toxicant is determined by a dose–response model of how selected aquatic organisms respond to a number of different concentrations of the toxicant. Because the experimental organisms were exposed to a constant concentration, while the pollutant concentration fluctuates naturally in a waterbody, the lab derived standard should, therefore, be compared to a mean concentration in the field. This line of thinking led to the discussion of the three aspects of a WQS—magnitude, duration, and frequency (Qian 2015). That a WQS refers to

Environmental Management

the mean concentration is also reflected in legal interpretation in at least one U.S. court decision (Qian 2015). Because the mean concentration of a pollutant cannot be directly measured and an estimate from a limited number of samples is always uncertain, EPA used a “10% rule,” requiring that a water be listed as impaired if more than 10% of the samples exceed the WQS, as a conservative approach. Smith et al. (2001) discussed EPA’s raw score method from a statistical decision (decision under uncertainty) perspective, particularly the inevitable high probabilities of making two types of error based on the Neyman-Pearson lemma. Accordingly, Smith et al. (2001) proposed that compliance assessments should be based on statistical hypothesis testing to minimize the chances of making these two types of errors. Specifically, a binomial test should be used with “10% of the samples” interpreted as “10% of the time;” that is, a binomial test with the null hypothesis that the probability of water quality criterion violation is less than or equal to 0.1. The change from 10% of the sample to 10% of the time is an interpretation of a WQS as a measure of the 90th percentile of the pollutant concentration distribution. This interpretation is contradictory to the traditional interpretation of a WQS (Qian 2015). Many states accepted the interpretation of Smith et al. (2001) and developed hypothesis-testing-based methods for determining whether a waterbody is in compliance with a WQS. For example, Florida’s Administrative Code 62-303 (Identification of Impaired Surface Waters) requires a binomial test to ensure that a WQS is not exceeded >10% of the time. Colorado compares 85th percentile of a chemical pollutant’s concentration distribution to the respective WQS. California specifies that a binomial test must be used in assessing the attainment of almost all WQSs, except bacteria (fecal coliform). In Kansas, a binomial test is required if the raw score test failed (>10% of the samples exceed the standard). Most of these requirements were derived based on statistical null hypothesis testing on an upper quantile of the concentration distribution of a pollutant. See Keller and Cavallaro (2008) for more details. Two important issues related to the use of hypothesis testing on an upper percentile of a concentration distribution have been ignored. One is the lack of statistical confidence in determining an upper quantile of a probability distribution, and the other is the volatile nature of a hypothesis testing result, especially when the test is of low statistical power. These two issues are related in the context of 303(d) listing. When testing for the quantile of a distribution farther away from the median, the power of the test is further reduced compared to testing for the mean. Many authors recognized that routine monitoring programs often are not sufficient in generating enough data to adequately describe the status of a water. For example, Lazzarotto et al. (2005) pointed out that spot sampling of water quality is imperfect

because concentrations are often event-driven; Yoder and Rankin (1998) concluded that condition assessments based on chemical monitoring alone underestimate the amount of impaired water compared to monitoring programs that include a biological component. Some states in the U.S. used biological monitoring programs for developing nutrient criteria. However, the compliance assessment of these biologically based nutrient criteria is still largely based on the use of hypothesis testing because the screening of individual parameters for exceedances against WQS is less complex than identifying causal stressors (Townsend et al. 2008). In this paper, we first discuss technical difficulties of using hypothesis testing in the 303(d) listing process. We argue that these problems render the hypothesis testing approach undesirable. We then propose a probabilistic assessment based alternative, which is suitable not only for simple compliance testing problems but also for a broader application to managing pollution especially for parameters where a standard does not yet exist.

Statistical Issues of Using Hypothesis Testing Statistical issues arise because of the low statistical power often associated with the test used for 303(d) listing. Halsey et al. (2015) discussed the problem of a low power test in the context of obtaining reproducible results. At issue is the volatile nature of a hypothesis testing result because the pvalue is a random variable. When a test is of low statistical power (e.g., because of a small sample size and/or a large natural variability), the p-value tends to have a large variation, such that the test is inherently unstable (hence undesirable). First, a low power means that the test is unlikely to detect, in the context of this paper, the nonattainment of the water under study. A statistically significant result (confirming non-attainment) is inevitably associated with a mean concentration that is much higher than the WQS. We use a numeric example to illustrate the problem. A typical low power test is characterized by an effect size relatively small compared to the standard error of the sample mean. In a one-sample t-test problem of H0:μ ≤ 0 vs. Ha:μ > 0, the test has a power of 0.14 to detect a true mean (or effect size) of 0.1 (the alternative hypothesis mean) when n = 10 and the population standard deviation is pffiffiffi σ = 0.5 (with a standard error of σ= n ¼ 0:16). That is, when the effect size is 0.1, we will be able to reject the null hypothesis only 14% of the time, an undesirably low success rate. Second, a low power test (e.g., H0:μ ≤ 0 vs. Ha:μ = 0.1, n = 10, and σ = 0.5) will return a statistically significant result, only when the sample mean is much larger than the alternative (or true) mean (in this case, x  0:29). In this simple t-test example, the sample mean must be almost three times as large as the alternative mean to be

Environmental Management

statistically significant. Although “three times as large” is specific for the σ and n values used in this example, the point of this calculation is that a much larger sample mean is need to reject the null hypothesis when the test is of low power. In many cases, such a large sample mean concentration would be the result of data points we would likely treat as outliers. In other words, no matter what is the outcome of the test, we will be doubtful: if the test concludes attainment (the null is not rejected), it may be because of the low power; if the test concludes nonattainment (null-rejected), the large sample mean may make us believe that the result is due to sampling error. Although we used a simple t-test as an example, the problem of hypothesis testing under low power is the same no matter what test we use. The binomial test commonly used for 303(d) listing is no exception. A particular problem with binomial test used for 303(d) listing is that we are testing an upper quantile, which translates into a small probability of success in a binomial test. With a small probability of success, we need to have a relatively large number of observed successes to reject the null hypothesis. For example, the raw score method would list a water as impaired with 1 exceedance in 10 samples. When using the binomial test with a probability of success of 0.1, we need to have >3 exceedances to reject the null hypothesis at a significance level of 0.05. This is because of the relatively high likelihood of observing one (0.39) or two (0.19) successes under the null model (the frequency of exceeding the WQS is less than or equal to 10%). We need three times as many exceedances than the expected one exceedance (10%) under the null hypothesis to cast doubt on the null hypothesis. If the true exceedance probability is 0.15, the power of the test is only about 5% for n = 10 and 21% for n = 50 (Fig. 1). A typical 303(d) listing problem often has a sample size much smaller than 50. Consequently, the binomial test is almost always unsatisfactory. The high

Fig. 1 Power of a binomial test is a function of sample size and the effect size (the difference between the actual exceedance rate and the null hypothesis rate of 0.1). The line labeled as “n = 3, p = 0.2” shows the power of the binomial test used by the Florida Department of Environmental Protection

threshold of rejecting the null hypothesis because of the low power may have led to the impression that states chose hypothesis testing over the raw score method to reduce the number of listed water. However, the case for using hypothesis testing made by Smith et al. (2001) was convincing and the weakness of the raw score method was thoroughly discussed, which led to a textbook case study in Qian (2016). An unintended problem of using the binomial test is the potential of over-protection. When using the binomial test, we compare the 90th percentile of the concentration distribution to the WQS. The 90th percentile is a function of both the mean and the standard deviation. In order for the 90th percentile to be at or below a WQS, the mean must be far less than the WQS. If we use the common assumption that a concentration variable follows a log-normal distribution, the 90th percentile in the natural logarithmic scale is 1.28σ times of the log of the mean, where σ is the logstandard deviation. If a WQS is 20 μg/L, a mean of 13.8 is necessary for a distribution with a coefficient of variation (cv) of 20% (σ = 0.2) to be deemed as in compliance. For many water quality constituents, a cv of 100% (or σ = 0.83) is not uncommon. In order for a water to achieve compliance, the mean concentration must be at or below 6.88 μg/L, about one-third of the WQS (Fig. 2). Although these problems of statistical null hypothesis testing are well known, their consequences in environmental standard compliance assessment are unfamiliar to many in the field.

Examples Ohio’s Biological Assessment Method—A Classical Low Power Test Case Although biological measurements (often represented by counts of relative compositions of macroinvertebrate taxa

Fig. 2 In order for the 90th percentile of the concentration distribution to meet the standard, the median of the distribution is a function of the distribution variance

Environmental Management

for stream and river assessment) are measured with error, the variation of taxon composition is relatively small compared to the variation of grab-sampled chemical concentrations. Consequently, including biological indicators can result in identification of 50% more impairment than a water chemistry approach alone (Yoder and Rankin 1998). As a result, Ohio employs biological criteria as the arbiter of whether or not a waterbody is impaired and uses chemical standards to support 303(d) listings for impaired waters, as endpoints in TMDL estimations, and to protect beneficial uses (e.g., to impose permitted discharge limits by invoking reasonable potential). We reviewed data from four surveys conducted in various parts of Ohio in the last several years. Of the 300 sites surveyed, 177 were judged as impaired using biological indicators. Of the impaired sites 49 (42%) had no exceedances of chemical WQS. In statistical terms, the relatively small variability in biological indicators resulted in a higher statistical power when assessing the compliance using biological indicators. Furthermore, most numeric chemical standards are based on either toxicological or bioaccumulative endpoints. However, because primary nutrients like phosphorus and oxidized nitrogen are not toxic at concentrations typically discharged to the environment, protective endpoints for nutrients have been developed by retrospectively examining the association between biological indicators and nutrient concentrations in large data sets (Miltner and Rankin 1998; Wang et al. 2007; Chambers et al. 2012), and by tracing the causal pathway between stressor and response indicators in focused studies (Miltner 2010; Chambers et al. 2012). Although these studies have pointed to similar biologically relevant concentrations for phosphorus (~0.05 mg/L) and nitrogen (~1.0 mg/L), the effect size on biological communities in the aggregate is modest. That is, the differences in mean concentrations between biologically impaired and unimpaired sites are small. Furthermore, nutrient concentrations at a given site are highly variable through time. A small effect size and a large variance inevitably lead to a test with low power. As a result, formulating and administering a WQS based on hypothesis testing approach is challenging. Miltner (2010) used data from small rivers and streams in Ohio to derive nutrient criteria. Using Ohio’s biological criteria, streams with an invertebrate community index (ICI) and EPT taxa richness (EPT, number of Ephmeroptera, Plecoptera, and Trichoptera), Ohio classifies a stream as biologically in compliance when EPT > 10 and ICI > 46, and out of compliance otherwise. The mean concentrations of dissolved inorganic nitrogen (DIN) are 1.1 mg/L for streams that are in compliance and 1.7 mg/L for streams that are out of compliance. However, because of the large variance of DIN concentration (standard deviations are 1.1 and 2.0, respectively), the distributions of DIN concentrations

Fig. 3 Overlapping histograms compare the DIN concentration distributions from streams that are in compliance with Ohio’s biological criteria (dark shaded histogram) and the same from streams that are out of compliance (light shaded). The dotted and solid lines are empirically estimated density functions of the respective concentration distributions

from these two groups of stream overlap substantially (Fig. 3). Suppose that we derive DIN criterion based on the DIN distribution of streams that are in compliance. In natural log scale, the DIN concentration distribution has a mean of −0.3 and standard deviation of 1.0. The log-mean of DIN from streams that are out of compliance is −0.11 (with a log-standard deviation about 1). The difference in the two log-means (the effect size) is