Statistical hypothesis testing

Testing Statistical Hypotheses In statistical hypothesis testing, the basic problem is to decide whether or not to reject a statement about the distribution of a random variable. The statement must be expressible in terms of membership in a well-defined class. The hypothesis can therefore be expressed by the statement that the distribution of the random variable X is in the class PH = {Pθ : θ ∈ ΘH }. An hypothesis of this form is called a statistical hypothesis. Testing is a statistical decision problem.

Issues • optimality of tests: most powerful • Neyman-Pearson Fundamental Lemma: the optimal procedure for testing one simple hypothesis versus another simple hypothesis • uniformly optimal ∗ impose restrictions, such as unbiasedness or invariance · find optimal tests under those restrictions ∗ define uniformity in terms of a global averaging

Issues • general methods for constructing tests • asymptotic properties of the likelihood ratio tests • nonparametric tests • sequential tests • multiple tests

Statistical Hypotheses We are given (or assume) broad family of distributions, P = {Pθ : θ ∈ Θ}. As in other problems in statistical inference, the objective is to decide whether the given observations arose from some subset of distributions PH ⊂ P. The statistical hypothesis is a statement of the form “the family of distributions is PH ”, where PH ⊂ P, or perhaps “θ ∈ ΘH ”, where ΘH ⊂ Θ.

Statistical Hypotheses The full statement consists of two pieces, one part an assumption, “assume the distribution of X is in the class ”, and the other part the hypothesis, “θ ∈ ΘH , where ΘH ⊂ Θ.” Given the assumptions, and the definition of ΘH , we often denote the hypothesis as H, and write it as H : θ ∈ ΘH .

Two Hypotheses While, in general, to reject the hypothesis H would mean to decide that θ ∈ / ΘH , it is generally more convenient to formulate the testing problem as one of deciding between two statements: H 0 : θ ∈ Θ0 and H 1 : θ ∈ Θ1 , where Θ0 ∩ Θ1 = ∅. We do not treat H0 and H1 symmetrically; H0 is the hypothesis (or “null hypothesis”) to be tested and H1 is the alternative. This distinction is important in developing a methodology of testing.

Tests of Hypotheses To test the hypotheses means to choose one hypothesis or the other; that is, to make a decision, d. We have a sample X from the relevant family of distributions and a statistic T (X). A nonrandomized test procedure is a rule δ(X) that assigns two decisions to two disjoint subsets, C0 and C1 , of the range of T (X). We equate those two decisions with the real numbers 0 and 1, so δ(X) is a real-valued function, δ(x) =

(

0 for T (x) ∈ C0 1 for T (x) ∈ C1.

Note for i = 0, 1, Pr(δ(X) = i) = Pr(X ∈ Ci).

We call C1 the critical region, and generally denote it by just C. If δ(X) takes the value 0, the decision is not to reject; if δ(X) takes the value 1, the decision is to reject. If the range of δ(X) is {0, 1}, the test is a nonrandomized test. Sometimes it is useful to choose the range of δ(X) as some other set of real numbers, such as {d0, d1 } or even a set with cardinality greater than 2. If the range is taken to be the closed interval [0, 1], we can interpret a value of δ(X) as the probability that the null hypothesis is rejected. If it is not the case that δ(X) equals 0 or 1 a.s., we call the test a randomized test.

Errors in Decisions Made in Testing There are four possibilities in a test of an hypothesis: the hypothesis may be true, and the test may or may not reject it, or the hypothesis may be false, and the test may or may not reject it. The result of a statistical hypothesis test can be incorrect in two distinct ways: it can reject a true hypothesis or it can fail to reject a false hypothesis. We call rejecting a true hypothesis a “type I error”, and failing to reject a false hypothesis a “type II error”.

Errors Our standard approach in hypothesis testing is to control the level of the probability of a type I error under the assumptions, and to try to find a test subject to that level that has a small probability of a type II error. We call the maximum allowable probability of a type I error the “significance level”, and usually denote it by α. We call the probability of rejecting the null hypothesis the power of the test, and will denote it by β. If the alternate hypothesis is the true state of nature, the power is one minus the probability of a type II error. It is clear that we can easily decrease the probability of one type of error (if its probability is positive) at the cost of increasing the probability of the other.

Errors In a common approach to hypothesis testing under the given assumptions on X, we choose α ∈]0, 1[ and require that δ(X) be such that Pr(δ(X) = 1 | θ ∈ Θ0) ≤ α. and, subject to this, find δ(X) so as to minimize Pr(δ(X) = 0 | θ ∈ Θ1). Optimality of a test T is defined in terms of this constrained optimization problem. Notice that the restriction on the type I error applies ∀θ ∈ Θ0. We call sup Pr(δ(X) = 1 | θ) θ∈Θ0

the size of the test.

Errors In common applications, Θ0 ∪ Θ1 form a connected region in IRk , and Θ0 contains the set of common closure points of Θ0 and Θ1 and Pr(δ(X) = 1 | θ) is a continuous function of θ; hence the sup is generally a max. If the size is less than the level of significance, the test is said to be conservative, and in that case, we often refer to α as the “nominal size”.

Example 1 Testing in an exponential distribution Suppose we have observations X1, . . . , Xn i.i.d. as exponential(θ). The Lebesgue PDF is pθ (x) = θ −1e−x/θ I]0,∞[(x), with θ ∈]0, ∞[. Suppose now we wish to test H 0 : θ ≤ θ0

versus

H 1 : θ > θ0 .

We know that X is sufficient for θ. A reasonable test may be to reject H0 if T (X) = X > c, where c is some fixed positive constant; that is, δ(X) = I]c,∞[(T (X)).

Knowing the distribution of X to be exponential(θ/n), we can now work out Pr(δ(X) = 1 | θ) = Pr(T (X) > c | θ), which, for θ < θ0 is the probability of a Type I error. For θ ≥ θ0 1 − Pr(δ(X) = 1 | θ) is the probability of a Type I error. These probabilities, as a function of θ are shown.

Performance of Test type I error type II error correct rejection (

]

H0

0

H1

θ0

Now, for a given significance level α, we choose c so that Pr(T (X) > c | θ ≤ θ0) ≤ α. This is satisfied for c such that for a random variable Y with an exponential(θ0/n), Pr(Y > c) = α.

p-Values Note that there is a difference in choosing the test procedure, and in using the test. The question of the choice of α comes back. Does it make sense to choose α first, and then proceed to apply the test just to end up with a decision d0 or d1? It is not likely that this rigid approach would be very useful for most objectives. In statistical data analysis our objectives are usually broader than just deciding which of two hypotheses appears to be true. On the other hand, if we have a well-developed procedure for testing the two hypotheses, the decision rule in this procedure could be very useful in data analysis.

p-Values One common approach is to use the functional form of the rule, but not to pre-define the critical region. Then, given the same setup of null hypothesis and alternative, to collect data X = x, and to determine the smallest value α ˆ(x) at which the null hypothesis would be rejected. The value α ˆ(x) is called the p-value of x associated with the hypotheses. The p-value indicates the strength of the evidence of the data against the null hypothesis.

Example 2 Testing in an exponential distribution; p-value Consider again the problem where we had observations X1 , . . . , Xn i.i.d. as exponential(θ), and wished to test H 0 : θ ≤ θ0

versus

H 1 : θ > θ0 .

Our test was based on T (X) = X > c, where c was some fixed positive constant chosen so that Pr(Y > c) = α, where Y is a random variable distributed as exponential(θ0/n). Suppose instead of choosing c, we merely compute Pr(Y > x ¯), where x ¯ is the mean of the set of observations. This is the p-value for the null hypothesis and the given data. If the p-value is less than a prechosen significance level α, then the null hypothesis is rejected.

Example 3 Sampling in a Bernoulli distribution; p-values and the likelihood principle revisited We have considered the family of Bernoulli distributions that is formed from the class of the probability measures Pπ ({1}) = π and Pπ ({0}) = 1 − π on the measurable space (Ω = {0, 1}, F = 2Ω ). Suppose now we wish to test H0 : π ≥ 0.5

versus

H1 : π < 0.5.

As we indicated before there are two ways we could set up an experiment to make inferences on π. One approach is to take a random sample of size n, X1 , . . . , Xn from the Bernoulli(π), and then use some function of that sample as an estimator. An obvious statistic to use is the number of 1’s in the sample, P that is, T = Xi.

To assess the performance of an estimator using T , we would first determine its distribution and then use the properties of that distribution to decide what would be a good estimator based on T. A very different approach is to take a sequential sample, X1 , X2 , . . ., until a fixed number t of 1’s have occurred. This yields N , the number of trials until t 1’s have occurred. The distribution of T is binomial with parameters n and π; its PDF is n pT (t ; n, π) = π t(1 − π)n−t, t = 0, 1, . . . , n. t The distribution of N is the negative binomial with parameters t and π, and its PDF is pN (n ; t, π) =

n − 1

t−1

π t(1 − π)n−t,

n = t, t + 1, . . . .

Suppose we do this both ways. We choose n = 12 for the first method and t = 3 for the second method. Now, suppose that for the first method, we observe T = 3 and for the second method, we observe N = 12. The ratio of the likelihoods s does not involve π, so by the likelihood principle, we should make the same conclusions about π.

Let us now compute the respective p-values. For the binomial setup we get p = 0.073 (using the R function pbinom(3,12,0.5), but for the negative binomial setup we get p = 0.033 (using the R function 1-pnbinom(8,3,0.5) in which the first argument is the number of “failures” before the number of “successes” specified in the second argument). The p-values are different, and in fact, if we had decided to perform the test at the α = 0.05 significance level, in one case we would reject the null hypothesis and in the other case we would not.

This illustrates a problem in the likelihood principle; it ignores the manner in which information is collected. The problem often arises in this kind of situation, in which we have either an experiment that gathers information completely independent of that information or an experiment whose conduct depends on what is observed. The latter type of experiment is often a type of Markov process in which there is a stopping time.

Power of a Statistical Test We call the probability of rejecting H0 the power of the test, and denote it by β, or for the particular test δ(X), βT . The power in the case that H1 is true is 1 minus the probability of a type II error. The probability of a type II error is generally a function of the true distribution of the sample Pθ , and hence so is the power, which we may emphasize by the notation βδ (Pθ ) or βδ (θ). We now can focus on the test under either hypothesis (that is, under either subset of the family of distributions) in a unified fashion.

Power of a Statistical Test We define the power function of the test, for any given P ∈ P as βδ (P ) = EP (δ(X)). Thus, to minimize the error is equivalent to maximizing the power within Θ1. Because the power is generally a function of θ, what does maximizing the power mean?

Power of a Statistical Test That is, maximize it for what values of θ? Ideally, we would like a procedure that yields the maximum for all values of θ; that is, one that is most powerful for all values of θ. We call such a procedure a uniformly most powerful or UMP test. For a given problem, finding such procedures, or establishing that they do not exist, will be one of our primary objectives.

Decision Theoretic Approach In the decision-theoretic formulation of a statistical procedure, the decision space is {0, 1}, corresponding respectively to not rejecting and rejecting the hypothesis. As in the decision-theoretic setup, we seek to minimize the risk: R(P, δ) = E(L(P, δ(X))).

In the case of the 0-1 loss function and the four possibilities, the risk is just the probability of either type of error. We want a test procedure that minimizes the risk. The issue above of a uniformly most powerful test is equivalent to the issue of a uniformly minimum risk test.

Randomized Tests Just as in the theory of point estimation, we found randomized procedures useful for establishing properties of estimators or as counterexamples to some statement about a given estimator, we can use randomized test procedures to establish properties of tests. While randomized estimators rarely have application in practice, randomized test procedures can actually be used to increase the power of a conservative test. Use of a randomized test in this way would not make much sense in real-world data analysis, but if there are regulatory conditions to satisfy, it might be useful.

Randomized Tests We define a function δR that maps X into the decision space, and we define a random experiment R that has two outcomes associated with not rejecting the hypothesis or with rejecting the hypothesis,such that Pr(R = d0 ) = 1 − δR(x) and so Pr(R = d1) = δR(x). A randomized test can be constructed using a test δ(x) whose range is {d0 , d1} ∪ DR , with the rule that if δ(x) ∈ DR, then the experiment R is performed with δR(x) chosen so that the overall probability of a type I error is the desired level. After δR(x) is chosen, the experiment R is independent of the random variable about whose distribution the hypothesis applies to.

Optimal Tests Optimal tests are those that minimize the risk. The risk considers the total expected loss. In the testing problem, we generally prefer to restrict the probability of a type I error and then, subject to that, minimize the probability of a type II error, which is equivalent to maximizing the power under the alternative hypothesis.

An Optimal Test in a Simple Situation First, consider the problem of picking the optimal critical region C in a problem of testing the hypothesis that a discrete random variable has the probability mass function p0 (x) versus the alternative that it has the probability mass function p1(x). We will develop an optimal test for any given significance level based on one observation. For x 3 p0 (x) > 0, let p (x) , r(x) = 1 p0(x) and label the values of x for which r is defined so that r(xr1 ) ≥ r(xr2 ) ≥ · · · . Let N be the set of x for which p0 (x) = 0 and p1 (x) > 0.

Assume that there exists a j such that j X

p0(xri ) = α.

i=1

If S is the set of x for which we reject the test, we see that the significance level is X

p0(x).

x∈S

and the power over the region of the alternative hypothesis is X

p1(x).

x∈S

P Then it is clear that if C = {xr1 , . . . , xrj } ∪ N , then x∈S p1 (x) is

maximized over all sets C subject to the restriction on the size of the test.

Pj If there does not exist a j such that i=1 p0(xri ) = α, the rule is to put xr1 , . . . , xrj in C so long as j X

p0(xri ) = α∗ < α.

i=1

We then define a randomized auxiliary test R Pr(R = d1 ) = δR(xrj+1 ) = (α − α∗)/p0 (xrj+1 ) P

It is clear in this way that x∈S p1(x) is maximized subject to the restriction on the size of the test.

Example 4 testing between two discrete distributions Consider two distributions with support on a subset of {0, 1, 2, 3, 4, 5}. Let p0 (x) and p1(x) be the probability mass functions. Based on one observation, we want to test H0 : p0 (x) is the mass function versus H1 : p1(x) is the mass function. Suppose the distributions are as shown in the table, where we also show the values of r and the labels on x determined by r. x p0 p1 r label

0 .05 .15 3 2

1 .10 .40 4 1

2 .15 .30 2 3

3 0 .05 -

4 .50 .05 1/10 5

5 .20 .05 2/5 4

Thus, for example, we see xr1 = 1 and xr2 = 0. Also, N = {3}.

For given α, we choose C such that X

p0 (x) ≤ α

x∈C

and so as to maximize X

p1 (x).

x∈C

We find the optimal C by first ordering r(xi1 ) ≥ r(xi2 ) ≥ · · · and P then satisfying x∈C p0(x) ≤ α. The ordered possibilities for C in this example are {1} ∪ {3},

{1, 0} ∪ {3},

{1, 0, 2} ∪ {3},

···.

Notice that including N in the critical region does not cost us anything (in terms of the type I error that we are controlling).

Now, for any given significance level, we can determine the optimum test based on one observation.

• Suppose α = .10. Then the optimal critical region is C = {1, 3}, and the power for the null hypothesis is βδ (p1 ) = .45.

• Suppose α = .15. Then the optimal critical region is C = {0, 1, 3}, and the power for the null hypothesis is βδ (p1 ) = .60.

• Suppose α = .05. We cannot put 1 in C, with probability 1, but if we put 1 in C with probability 0.5, the α level is satisfied, and the power for the null hypothesis is βφ(p1 ) = .25.

• Suppose α = .20. We choose C = {0, 1, 3} with probability 2/3 and C = {0, 1, 2, 3} with probability 1/3. The α level is satisfied, and the power for the null hypothesis is βφ(p1 ) = .75.

All of these tests are most powerful based on one observations for the given values of α.

We can extend this idea to tests based on two observations. We see immediately that the ordered critical regions are C1 = {(1, 3)} × {(1, 3)},

C1 ∪ {(1, 3)} × {(0, 3)},

···.

Extending this direct enumeration would be tedious, but, at this point we have grasped the implication: the ratio of the likelihoods is the basis for the most powerful test. This is the Neyman-Pearson Fundamental Lemma.

The Neyman-Pearson Fundamental Lemma Example 4 illustrates the way we can approach the problem of testing any simple hypothesis against another simple hypothesis. Notice the pivotal role played by ratio r. This is a ratio of likelihoods. For testing H0 that the distribution of X is P0 versus the alternative H1 that the distribution of X is P1 given 0 < α < 1, under very mild assumptions, the Neyman-Pearson Fundamental Lemma tells us that a test based on the ratio of likelihoods exists, is most powerful, and is unique.

Proof. Let A be any critical region of size α. We want to prove Z C

L(θ1 ) −

Z A

L(θ1 ) ≥ 0.

We can write this as Z C

L(θ1) −

Z A

L(θ1 ) =

Z

L(θ1 ) +

C∩AZ

−

=

A∩C

Z

C∩Ac

Z C∩Ac

L(θ1) −

L(θ1 ) −

Z

L(θ1 )

Z

A∩C c

A∩C c

L(θ1 )

L(θ1 ).

By the given condition, L(θ1 ; x) ≥ kL(θ0 ; x) at each x ∈ C, so Z C∩Ac

L(θ1 ) ≥ k

Z C∩Ac

L(θ0 ),

and L(θ1 ; x) ≤ kL(θ0 ; x) at each x ∈ C c, so Z A∩C c

L(θ1 ) ≤ k

Z A∩C c

L(θ0 ).

Hence Z C

L(θ1 ) −

Z A

L(θ1 ) ≥ k

Z

L(θ0 ) −

C∩Ac

Z A∩C c

L(θ0 ) .

But Z C∩Ac

L(θ0 ) −

Z A∩C c

L(θ0 ) =

Z C∩Ac

− =

Z

Z C

C∩A

R

R

Hence, C L(θ1 ) − A L(θ1 ) ≥ 0.

L(θ0) −

L(θ0 ) −

= α−α = 0.

L(θ0) +

Z

Z A

C∩A Z

L(θ0 )

A∩C c

L(θ0 )

L(θ0 )

This simple statement of the Neyman-Pearson Lemma and its proof should be in your bag of easy pieces. The Lemma applies more generally by use of a random experiment so as to achieve the level α. Shao gives a clear statement and proof of this.

Generalizing the Optimal Test to Hypotheses of Intervals Although it applies to a simple alternative (and hence “uniform” properties do not make much sense), the Neyman-Pearson Lemma gives us a way of determining whether a uniformly most powerful (UMP) test exists, and if so how to find one. We are often interested in testing hypotheses in which either or both of Θ0 and Θ1 are continuous regions of IR (or IRk ).

We must look at the likelihood ratio as a function both of θ and x. The question is whether, for given θ0 and any θ1 > θ0 (or equivalently any θ1 < θ0), the likelihood is monotone in some function of x; that is, whether the family of distributions of interest is parameterized by a scalar in such a way that it has a monotone likelihood ratio (see Chapter 1 in the Companion notes). In that case, it is clear that we can extend the test to be uniformly most powerful for testing H0 : θ = θ0 against an alternative H1 : θ > θ0 (or θ1 < θ0).

The exponential class of distributions is important because UMP tests are easy to find for families of distributions in that class. Discrete distributions are especially simple, but there is nothing special about them. As an example, work out the test for H0 : θ ≥ θ0 versus the alternative H1 : θ < θ0 in a one-parameter exponential distribution. The one-parameter exponential distribution, with density over the positive reals θ −1e−x/θ is a member of the exponential class. Two easy pieces you should have are construction of a UMP for the hypotheses in the one-parameter exponential (above), and the construction of a UMP for testing H0 : π ≥ π0 versus the alternative H1 : π < π0 in a binomial(π, n) distribution.

Use of Sufficient Statistics It is a useful fact that if there is a sufficient statistic S(X) for θ, and ˜ δ (X) is an α-level test for an hypothesis specifying values of θ, then there exists an α-level test for the same hypothesis, δ(S) that depends only on S(X), and which has power at least as great as that of ˜ δ (X). We see this by factoring the likelihoods.