That P-values should not be used

0 downloads 0 Views 338KB Size Report
Aug 21, 2014 - That P-values should not be used. G. K. ROBINSON. Commonwealth Scientific and Industrial Research Organisation,. Digital Productivity and ...
That P-values should not be used G. K. ROBINSON Commonwealth Scientific and Industrial Research Organisation, Digital Productivity and Services Flagship, Private Bag 33, Clayton South, Victoria 3169, Australia Email: [email protected] Version 49 August 21, 2014

Abstract The paradigm for statistical inference which is most widely used in applied statistics is often called “classical statistics” or “the P-value paradigm”. This paper argues that P-values should never be used for indicating the strength of statistical evidence. For some simple situations, it shows that likelihood ratios are always preferable to P-levels for indicating strength of evidence. The main argument behind this preference is that three-region tests between simple hypotheses are a better basis for assessing the amount of statistical evidence than are the two-region tests used by Neyman & Pearson (1933). Keywords: Foundations of statistical inference; P-value; Likelihood ratio; Hypothesis testing; Confidence interval.

1

Introduction

It is widely accepted that something is wrong in the current state of the foundations and/or practice of statistical inference. Ziliak & McClosky (2008, page xvi) summarises the size of this problem in the following way. “Can so many scientists have been wrong over the eighty years since 1925? Unhappily, yes. . . . Statistical significance is surely not the only error in modern science, although it has been . . . an exceptionally damaging one.” However, there is not general agreement as to precisely what is wrong. Bayesians tell us to discard classical statistics in favour of Bayesian statistics in order that our methods be coherent. However Bayesian data analyses are inevitably subjective because they depend on the prior distributions chosen. Some people are concerned with the differences between Fisher’s and Neyman’s approaches to P-values and confidence intervals. See Lehmann (1986), Lehmann (1993) and Hubbard & Bayarri (2003). To understand the precise nature of these differences seems to require a very careful reading of documents written by the protagonists, but this careful effort seems to yield few useful consequences for applied statistics.1 1

For instance, Fisher (1956) argued that the Behrens–Fisher test for the two means problem is better than the Welch–Aspin test because there are negatively-biased relevant subsets for the Welch–Aspin test. This is an important example of difference between the two approaches because the Welch–Aspin test aims to have approximately the nominal coverage probability, while Fisher viewed the Behrens–Fisher test

1

Some other people such as Yates (1951), Carver (1978), Gardner & Altman (1986), Sterne & Smith (2001), Ziliak & McClosky (2008) and Cumming (2012) have emphasised the issue that it is more helpful to report interval estimates and effect sizes than to report the results of hypothesis tests. Several articles in Harlow et al. (1997) deal with this preference. It has been widely debated, particularly in psychology and medicine. This preference is not relevant to the arguments made in this paper. The suggestion that interval estimates should be preferred to hypothesis tests is just as applicable when likelihood or Bayesian methods are used to derive those interval estimates and hypothesis tests as it is when P-value-based arguments are used. A concern of Yates (1951, page 32) and others is that people often test null hypotheses of little or no practical importance. Such undesirable statistical practice may have been encouraged by some documents which used the P-value approach to significance testing, but it is not essentially associated with P-values. Likelihood and Bayesian methods can also be used to test null hypotheses of little or no practical importance. In the terminology of Dempster (1964), P-values are predictive statements. They provide a measure of reliability which is an average over all possible sets of data and which is meaningful before the data is observed. They are not a suitable basis for making postdictive statements. The criticism of classical statistical practice which is presented in this paper could have been presented in 1933, although some of the terminlology used is more modern. It is based on the fact that Neyman & Pearson (1933) formulated the problem of testing between two simple hypotheses in a way which does not allow postdictive assessment of the strength of evidence. This is argued in section 2. Their formulation involves dividing the sample space into two regions and is consistent with the idea of P-values. The alternative formulation involves dividing the sample space into three regions and leads to the use of likelihood ratios for measuring strength of evidence, rather than P-levels and power. Section 3 extends this argument to situations involving one-parameter families of simple hypotheses, for both hypothesis testing and interval estimation.

2

Tests between simple hypotheses

The usual Bayesian measure of the strength of evidence for an alternative hypothesis H1 relative to a null hypothesis H0 is called a “Bayes factor”. It is the posterior odds for H1 divided by the prior odds for H1 . It is equal to the ratio of marginal likelihoods when H0 and H1 are composite hypotheses. The Bayesian amount of evidence is the logarithm of this ratio. For testing between two simple hypotheses, the Bayesian amount of evidence does not depend on the prior distribution (provided that both hypotheses have non-zero prior probability). It therefore provides an objective measure of the amount of evidence. In more complicated situations the Bayesian measure of the amount of evidence depends on the prior distribution, and so does not provide an objective measure of the amount of evidence.

2.1

Two-region hypothesis tests

Neyman & Pearson (1933) consider hypothesis tests between two simple hypotheses where as being based on his theory of fiducial probability (which had few adherents then and has virtually no adherents now). Fisher (1959, page 93) wrote “In fact, as a matter of principle, the infrequency with which, in particular circumstances, decisive evidence is obtained, should not be confused with the force, or agency, of such evidence”, indicating that he did not regard coverage probability as a particularly important property.

2

a test is equivalent to a region, R, of the sample space in which the null hypothesis is rejected. In the remainder of the sample space the null hypothesis is not rejected. We will refer to the class of tests considered by Neyman & Pearson (1933) as “tworegion” tests. The following artificial example illustrates a potential inadequacy of such tests. Example 1 Consider a situation with four possible outcomes. Their probabilities under two hypotheses are given in Table 1. The situation might be interpreted as a blood test which generally gives X = 0 under H0 that a person does not have some disease and generally gives X = 3 under H1 that the person does have the disease. The two other possible outcomes X = 1 and X = 2 have fairly small probabilities under both H0 and H1 , with X = 1 being slightly more likely under H0 and X = 2 being slightly more likely under H1 . Table 1: Probabilities for a blood test. Hypothesis H0 H1

Probabilities for observations X=0 X=1 X=2 X=3 0.90 0.05 0.04 0.01 0.01 0.04 0.05 0.90

One possible hypothesis test which might be described as a Neyman–Pearson test or a “two-region” test would reject H0 if X is 2 or 3 and reject H1 if X is 0 or 1. This test has probability 0.05 of Type I error (falsely rejecting H0 ) and probability 0.05 of Type II error (falsely rejecting H1 ). However to report only which hypothesis is rejected by this test seems intuitively unsatisfactory in that the outcomes X = 1 and X = 2 are almost equally likely under both hypotheses, so when they are observed the amount of evidence seems intuitively to be negligible rather than being well summarized by the error rates. Another way of stating this criticism is that the strength of evidence (interpreted in an intuitive sense) is much less when the observed outcome is marginal (1 or 2) than when the outcome is conclusive (0 or 3), yet the reported strength of evidence is the same if the Type I and Type II error rates are the only features reported. We could also say that the test has poor conditional properties. Conditional on X being in {1, 2}, the test which rejects H0 if X = 2, and rejects H1 if X = 1 has probability 0.04/0.09 = 0.444 of Type I error and the same probability of Type II error. Conditional on X being in {0, 3}, the test which rejects H0 if X = 3, and rejects H1 if X = 0 has probability 0.01/0.91 = 0.011 of Type I error and the same probability of Type II error. The set {1, 2} would be called a negatively-biased relevant subset by Buehler (1959). The existence of such a subset is regarded by Buehler (1959) as a very serious criticism of the stated level of confidence, partly because it corresponds to a “recognizable” subset in the sense used by Fisher (1959). The problem of poor conditional properties is far from being peculiar to this particular example. For many tests between two simple hypotheses it is easy to find a negativelybiased relevant subset which is a set of possible results with likelihood ratio near to the cut-off value. Proposition 1 in the Appendix shows that negatively-biased relevant betting procedures generally exist when the observable random variable is discrete. The essential reason why two-region tests do a poor job of reporting the strength of evidence is that with only two possible reported conclusions there is no room to differentiate between values of X for which the reported conclusion is less reliable and values of X for which the reported conclusion is more reliable. 3

An obvious way to expand the formulation of hypothesis tests in order to give some information about the strength of evidence is a four-region test. This would divide the sample space into regions R0 , R1 , R0weak and R1weak such that hypothesis H0 is strongly rejected if X ∈ R0 , hypothesis H1 is strongly rejected if X ∈ R1 , hypothesis H0 is weakly rejected if X ∈ R0weak and hypothesis H1 is weakly rejected if X ∈ R1weak . However, having four regions seems to be unnecessarily complicated. Regions R0weak and R1weak can be combined into a single region where no strong conclusion can be drawn.

2.2

Three-region tests

For Example 1 the obvious three-region test accepts H0 if X = 0, makes no conclusion if X = 1 or X = 2, and accepts H1 if X = 3. The probability of accepting the true hypothesis is 0.9; the probability of accepting the wrong hypothesis is 0.01; and the probability of making no conclusion is 0.09. In cases where an hypothesis is accepted, the likelihood ratio in favour of the accepted hypothesis is 90. We now consider a simple situation in which the observable random variable is continuous rather than discrete. It is easy to construct a two-region test with P-value precisely 0.05.

0.2 0.0

0.1

Probability density

0.3

0.4

Example 2 Suppose that a random variable, X, is normally distributed with unit variance and mean either −1.645 under H0 or 1.645 under H1 . The situation is illustrated in Figure 1. It is symmetrical about X = 0.

−4

−3

−2

−1

0

1

2

3

X

Figure 1: Probability densities of possible values for a normally distributed random variable. Densities under H0 that the mean is −1.645 are shown by a continuous line, and densities under H1 that the mean is 1.645 are shown by a dot-dashed line. A symmetrical two-region hypothesis test rejects H0 if X > 0. The P-value is 0.05 and the Type II error rate is also 0.05. Negatively-biased relevant subsets can again be readily found by looking at sets of X values near to the cut-off value, X = 0. For instance, the error rates conditional on |X| < 0.14 are both 0.4428. 4

A symmetrical three-region test accepts H0 whenever X < −0.8950 (i.e. the likelihood ratio for H0 relative to H1 is at least 19:1) and accepts H1 whenever X > +0.8950 (i.e. the likelihood ratio for H1 relative to H0 is at least 19:1). Under H0 , the probability of accepting H0 is 0.7733, the probability of accepting H1 is 0.005544, and the probability of making no conclusion is 0.2211. For the two-region test, the P-value 0.05 is the probability under H0 of the region where H0 is rejected. For the three-region test, there is no obvious event whose probability is equal to 0.05. The likelihood ratio cut-off value, 19, is a much more natural measure of the reliability of any conclusions reached. In general for discriminating between pairs of simple hypotheses, if we are constrained to draw a conclusion no matter what the observed data then the classical two-region tests cannot be improved. Their optimality follows from a result that was proved by Neyman & Pearson (1933) and which has become known as the Neyman–Pearson lemma. This result is stated in the Appendix as Proposition 2 and is proved in a manner which allows a natural extension to Proposition 3 which shows that the best three-region tests must be based on the likelihood ratio and be of the form: Reject H0 if f1 (x)/f0 (x) > λ0 , reject H1 if f0 (x)/f1 (x) > λ1 , and otherwise make no decision. Within the classical school of statistics, the Neyman–Pearson lemma is usually regarded as providing an answer to the question “What test statistics should be used when calculating P-values and judging whether a data set is unusual?” Looking from within the classical school of statistics, three-region tests provide the same answers to this question as do two-region tests. In that sense, the classical school might regard three-region tests as being just as satisfactory as two-region tests. However looking from outside the classical school of statistics, the three-region approach more realistic for assessment of evidence. However it is not compatible with the idea of P-values. In order to compare P-values and likelihood ratios as measures of the amount of statistical evidence, we must put them on the same scale. That scale could be the scale of probability, p; the scale of odds ratios, (1−p)/p; or the scale of support for H0 , log[(1−p)/p]. For instance, a P-value of 0.05 corresponds to a likelihood ratio of 19 and to a support for H0 of log(19). For two-region tests, P-values generally over-state the amount of evidence compared to the cut-off likelihood ratios. Suppose that we are testing H0 against the alternative H1 using a Neyman–Pearson test and the Type II error rate, β, is larger than the Type I error rate, p. If we reject H0 whenever X is in a rejection region, R, in which the likelihood ratio f1 (x)/f0 (x) ≥ λ, then the P-value for this test satisfies Z

p=

Z

f0 (y) dy = x∈R

x∈R

f0 (y) f0 (x) f1 (y) dy ≤ f1 (y) f1 (x)

Z

f1 (y) dy = x∈R

1−β 1−p ≤ . λ λ

(1)

Hence λ ≤ (1 − p)/p or, equivalently, p < 1/(1 + λ). These inequalities can be interpreted as saying that the cut-off likelihood ratio suggests that there is less statistical evidence than does the P-value.

2.3

Sequential tests

The distinction being made between two-region tests and three-region tests seems to have been understood by Barnard (1947). I believe that he would have regarded what I am calling “three-region tests” as a form of sequential analysis which is terminated after one stage of data collection. He wrote that “. . . sequential analysis poses the question in a more natural manner than the classical theory of testing hypotheses. In the classical approach, the question 5

is put: Which of the two hypotheses, H or H 0 , should we adopt, on the basis of the data R? As if we were always compelled to choose one or other of these two alternatives. Sequential analysis, on the other hand, poses the question: Are the data R sufficient ground for adopting H, or for adopting H 0 , [or] are the data insufficient? In other words, we ask, is the likelihood ratio L0 /L so large that we can safely accept H 0 , is it so small that we can safely accept H, or is it so near to 1 that we have no safe grounds for decision? A rule for answering this question will take the form of fixing two numbers, A > 1 and B < 1, and prescribing that we are to accept H 0 if the likelihood ratio is greater than A, we are to accept H if the likelihood ratio is less than B, while we consider the data insufficient if the likelihood ratio lies between A and B.” When sequential probability ratio tests (SPRTs) were developed by Wald (1947) they were justified within the P-value paradigm. It was assumed that the most important measure of performance of an SPRT for choosing between two simple hypotheses was the P-value. Computing these P-values often requires complicated calculations or approximations. However if likelihood ratios are used to measure strength of evidence, then an SPRT is immediately justifiable because the test stops when a specified strength of evidence is reached or exceeded. Example 3 Consider testing between two hypotheses using independent tosses of a coin when there are two hypotheses as to the probability of heads. Under H0 , the probability of heads is 0.442. Under H1 , the probability of heads is 0.558. A non-sequential trial which seems reasonable according to classical criteria is to conduct 201 tosses and to reject H0 if the number of heads is greater than 100. The probabilities of Type I error and of Type II error are both 0.0492. An SPRT with likelihood ratio cut-off 19 will terminate when the absolute difference between the number of heads and the number of tails is 13. The actual likelihood ratio in favour of the hypothesis accepted is always 20.69. The probability of accepting the correct hypothesis is 0.954; the probability of accepting the incorrect hypothesis is 0.046; and the expected number of tosses is 101.7 When considering the relative merits of non-sequential and sequential trials at the time of choosing an experimental design, these two trials would be regarded as having similar discriminatory power from within the P-level paradigm. The trial of fixed size might be preferred because it is easier to manage. However, the discriminatory power of the non-sequential trial seems less similar to the discriminatory power of the SPRT when its results will be analysed as a three-region test with a likelihood ratio cutoff of 19. The probability of accepting the correct hypothesis is 0.7894; the probability of being undecided is 0.2043; and the probability of accepting the incorrect hypothesis is 0.0062. An interesting point which is not related to criticism of two-region hypothesis tests and P-values is that, for a given cost structure, the cut-off likelihood ratio used for making a conclusion with a sequential test is likely to be different from the cut-off likelihood ratio used for making a conclusion according to a three-region test. Suppose that utilities are +1 for making a correct decision, −19 for making an incorrect decision, 0 for making no decision, and −0.003 for each coin toss observed. The expected utility of the two-region hypothesis test with 201 tosses which accepts H0 if X ≤ 100 and accepts H1 if X > 100 is −0.5879. This is negative because the procedure often makes incorrect decisions. The expected utility of the three-region hypothesis test with 201 tosses which accepts H0 if X ≤ 94, accepts H1 if X > 106, and makes no decision if 95 ≤ X ≤ 106 is +0.0681. This expected utility is higher because the test does not make decisions when the data are 6

not sufficiently conclusive. This three-region test has the highest expected utility amongst tests based on 201 observations with the number of observations being fixed in advance. The sequential test which stops as soon as the likelihood in favour of one of the hypotheses is greater than 19:1 has an expected utility of −0.2273. This average utility is poor largely because almost 5% of the decisions made are incorrect. Another sequential test which has much higher expected utility (0.3472) is one which stops only when the difference between the numbers of heads and tails reaches 23 and the likelihood ratio therefore reaches 212.8. Other sequential probability ratio tests with slightly different stopping criteria are almost as good. A good test with fewer decision points is a two-stage test with 101 tosses in Stage 1 and an additional 106 tosses in Stage 2 which is only conducted if the likelihood ratio at the completion of Stage 1 is between 40:1 and 1:40. This test has expected utility +0.1417. It is better than the one-stage three-region test with 201 observations, but not as good as the best sequential test. The expected number of tosses is 171.2 and the probability of making no conclusion is 0.1853.

2.4

Choice of tests and measures of evidence

The main point made in this section is that postdictive assessment of strength of evidence. Three-region hypothesis tests are more appropriate than two-region hypothesis tests for summarizing the strength of evidence between two simple hypotheses. With a two-region hypothesis test, the hypothesis chosen after seeing the data should be better supported than the other hypothesis, but the difference in the degree of support may be small. With a three-region test, if an hypothesis is chosen after seeing the data then this choice must also be better supported than the alternative of waiting for more evidence, and this says something about the strength of evidence. From our preference for three-region hypothesis tests, it follows that likelihood ratios are preferable to P-values for summarizing evidence for tests between two simple hypotheses. This point has been made before though based on different arguments, for instance by Royall (1997). Another good feature of likelihood ratios for testing between two simple hypotheses is that they provide a perfect answer to the problem of how to combine information from independent trials. This is often called meta-analysis. If the evidence provided by a first experiment is summarized by the likelihood ratio l1 and the evidence provided by a second experiment is summarized by the likelihood ratio l2 then the total evidence is l1 × l2 .2

3

Statistical procedures involving one-parameter families of simple hypotheses

We now consider testing between a simple null hypothesis and a compound alternative hypothesis consisting of two simple hypotheses. Example 4 For a situation like the non-sequential test of Example 3, let X denote the number of heads observed in 201 tosses of a coin. Different from Example 3, hypothesis H0 is that the probability of heads is 0.5 and H1 is the composite hypothesis that the probability of heads is either 0.3739 or 0.6261. A Neyman–Pearson two-sided test of H0 would reject H0 if X ≤ 86 or X ≥ 115. This test has probability 0.048 of Type I error and probability 0.050 of Type II error. Just as for Example 3, the conditional properties of this test are unsatisfactory as can be 2

Note however, that this simple rule only works perfectly for tests between two simple hypotheses.

7

shown by conditioning on X being near the cut-off values. For instance, conditional on X ∈ (84, 91) ∪ (110, 117) the probabilities of Type I and Type II error are 0.179 and 0.396, respectively. A three-region hypothesis test which makes no conclusion unless there is a likelihood ratio of at least 19:1 accepts H0 if 94 ≤ X ≤ 107 and accepts H1 if X < 82 or X > 119. (Note that the likelihood of the alternative hypothesis is taken to be the maximum likelihood over the two simple hypotheses.) The probabilities of accepting the true hypothesis, accepting the wrong hypothesis and making no conclusion are, respectively, 0.6766, 0.0072 and 0.3162 under H0 and 0.8228, 0.0041 and 0.1731 under H1 . For Example 4, as for Example 3, the three-region test is conservative compared to the two-region test in that whenever the three-region test draws a conclusion it is the same as the conclusion drawn by the two-region test, but elsewhere the two-region test makes a conclusion and the three-region test make no conclusion.

3.1

Hypothesis testing as a basis for interval estimation

Now we consider situations where there is a family of simple hypotheses indexed by a single real parameter. Many practically important situations are in this class. Both the two-region and three-region approaches to testing statistical hypotheses can be used to construct interval estimates by finding the most likely simple hypothesis for given data, testing all other simple hypotheses against the most likely simple hypothesis, and regarding the set of parameter values which would not be rejected as an interval estimate for the parameter. Example 5 Suppose that X has a standard normal distribution with unknown mean µ and unit variance. Let φ(.) denote the density function and let Q(.) denote the right hand tail area for the standard normal distribution. If we observe X = x then µ ˆ = x is the most likely parameter value. A two-sided two-region hypothesis test with Type I error rate 2Q(δ) will not reject hypotheses between µ = x − δ and µ = x + δ. The P-value paradigm would report both the reliability of these hypothesis tests and the reliability of the interval estimate from µ = x − δ and µ = x + δ using the P-value 2Q(δ). A one-sided two-region hypothesis test with Type I error rate Q(δ) will not reject hypotheses with µ ≤ x + δ. The reliability of these hypothesis tests and the reliability of the interval estimate from −∞ to µ = x + δ would both be reported using the P-value Q(δ). A three-region hypothesis test will just reject the hypothesis µ = x + δ in favour of the hypothesis µ = x if the no-decision region is from X = x to X = x + δ and the likelihood ratio φ(0)/φ(δ) = exp( 21 δ 2 ) is equal to a chosen cut-off value. Similar logic applies to the hypothesis µ = x − δ. Hence the interval estimate from µ = x − δ and µ = x + δ might be reported with likelihood ratio exp( 21 δ 2 ) as measure of its reliability. Another possible report of strength of evidence is the Bayes factor for a Cauchy prior with quartiles at ±2. This has been included as an intuitive guide for readers who are familiar with the use of Bayes factors, but is not essential to the paper’s main line of argument. The relationship between δ and these measures of strength of evidence is shown in Table 2. The P-values are the measures of reliability for the two-region hypothesis tests. Two-tailed and one-tailed P-values are given in the second and fourth columns. They are also expressed on the scale of odds. The likelihood ratios are the measures of reliability for the three-region hypothesis tests. The upper portion of the table shows some commonlyused P-values, and the lower portion shows commonly-used likelihood ratios. 8

Table 2: P-values, likelihood ratios and Bayes factors for hypothesis tests based on a normal distribution as discussed in Example 5 δ 1.282 1.645 1.960 2.326 2.576 3.090 3.291 1.177 1.794 2.146 2.327 3.035 3.717

Two-sided P-value 0.20000 0.10000 0.05000 0.02000 0.01000 0.00200 0.00100 0.23903 0.07279 0.03188 0.01995 0.00241 0.00020

As odds 4.00 9.00 19.00 49.00 99.00 499.00 999.00 3.18 12.74 30.37 49.12 414.54 4957.73

One-sided P-value 0.10000 0.05000 0.02500 0.01000 0.00500 0.00100 0.00050 0.11952 0.03640 0.01594 0.00998 0.00120 0.00010

As odds 9.00 19.00 39.00 99.00 199.00 999.00 1999.00 7.37 26.47 61.74 99.24 830.08 9916.46

Likelihood ratio 2.27 3.87 6.83 14.97 27.59 118.48 224.48 2.00 5.00 10.00 15.00 100.00 1000.00

Bayes factor 0.63 0.95 1.48 2.80 4.63 15.78 27.34 0.57 1.16 2.02 2.80 13.66 100.89

We can see that the likelihood ratio in column 6 is always less than the one-sided P-value expressed as an odds ratio in column 5. This can be interpreted as saying that P-values over-state the strength of evidence relative to likelihood ratios. Inequality (1) shows that the one-sided P-value necessarily gives a larger estimate of the strength of evidence than does the likelihood ratio. The two-sided P-value also gives a larger estimate of the strength of evidence than does the likelihood ratio for the range of tests shown, but this is not true for two-sided P-values larger than 0.42. The likelihood ratio is the likelihood of the most likely alternative simple hypothesis divided by the likelihood of H0 . This is equal to the Bayesian strength of evidence only if the prior probability for the alternative hypotheses is concentrated on the simple hypothesis for which x is the true mean. In all other circumstances the likelihood ratio exceeds the Bayesian strength of evidence. In this sense, the observed likelihood ratio over-states the amount of evidence in favour of H1 . Given that the one-sided P-value over-states the strength of evidence in favour of H1 compared to the likelihood ratio and the likelihood ratio over-states the amount of evidence compared to all possible Bayesian strengths of evidence, it is always better to quote the likelihood ratio than to quote the P-value. It is not being claimed that the likelihood ratio is a particularly good measure of strength of evidence. The fact that it over-states the strength of evidence compared to the Bayesian strength of evidence for all possible prior distributions makes it seem likely that better ways of testing H0 might be possible. Simulation suggests that for this situation the average observed log likelihood ratio exceeds the log likelihood between the true mean and µ = 0 by 0.5.3 However, the P-value is a worse measure of the strength of evidence. The Bayes factor given in the last column of Table 5 is the best measure of strength of evidence if the prior distribution used to calculate it is reasonable. Note that x = 1.960 corresponds to a two-sided P-value of 0.05, which is often regarded as a strength of evidence sufficient to justify publication in an academic journal, but gives a Bayes factor little larger 3

The Akaike Information Criterion suggests that when comparing models with different numbers of parameters the preferred model is the one with the highest value of 2k − log(L) where k is the number of parameters and L is the likelihood. The observed difference of 0.5 is not consistent with the AIC.

9

than unity for the prior distribution chosen. This illustrates the point that the difference between approaches can make a substantial difference. Many applications of statistics are somewhat similar to the template provided by Example 5. Another situation of substantial practical interest is data that follows a Student’s t-distribution. Example 6 Suppose that for some mean, µ, X −µ has a t-distribution with three degrees of freedom. The null hypothesis, H0 , is that µ = 0 and the alternative hypothesis, H1 , is that µ 6= 0. If we observe X = x then the most likely value of µ is x. The likelihood ratio for H1 relative to H0 is the likelihood for µ = x divided by the likelihood for µ = 0. The observation X = δ provides just sufficient strength of evidence to reject H0 for values given in Table 3. As with Table 2, columns 2 and 3 are for two-sided two-region tests, columns 4 and 5 are for one-sided two-region tests, and column 6 is for threeregion tests. Table 2 may be regarded as the corresponding table for infinitely many degrees of freedom. Bayes factors have not been included in Table 3 because a sensible prior distribution needs to specify a joint distribution for the population mean and the population standard deviation; so the Bayes factors vary with both the sample mean and the sample standard deviation and cannot be presented in a single column. Table 3: P-values and likelihood ratios for hypothesis tests based on a t-distribution with 3 degrees of freedom.

δ 1.638 2.353 3.182 4.541 5.841 10.215 12.924 1.115 1.926 2.547 2.936 5.196 9.585

Two-sided P-value 0.20000 0.10000 0.05000 0.02000 0.01000 0.00200 0.00100 0.34622 0.14980 0.08417 0.06072 0.01385 0.00241

Classical test As One-sided odds P-value 4.00 0.10000 9.00 0.05000 19.00 0.02500 49.00 0.01000 99.00 0.00500 499.00 0.00100 999.00 0.00050 1.89 0.17311 5.68 0.07490 10.88 0.04208 15.47 0.03036 71.22 0.00692 413.99 0.00120

As odds 9.00 19.00 39.00 99.00 199.00 999.00 1999.00 4.78 12.35 22.76 31.94 143.44 828.98

Likelihood ratio 3.59 8.10 19.15 61.98 153.07 1280.13 3212.22 2.00 5.00 10.00 15.00 100.00 1000.00

Neyman–Pearson test One-sided As P-value odds 0.09345 9.70 0.04188 22.88 0.01758 55.87 0.00505 197.00 0.00184 542.58 0.00015 6603.51 0.00005 20168.00 0.16941 4.90 0.06738 13.84 0.03396 28.44 0.02257 43.30 0.00297 335.19 0.00020 4906.58

In some cases the one-sided P-values are smaller on the odds scale than the likelihood ratios. This seems to be in contradiction with inequality (1). However, the classical ttest is not equivalent to a Neyman–Pearson test between the null hypothesis and the most likely simple alternative hypothesis. For instance, when testing between H0 and µ = 4.541, the P-level for a one-tailed test which rejects H0 whenever X > 4.541 is 0.01. However, a Neyman–Pearson test which rejects H0 whenever the likelihood ratio in favour of it exceeds 61.98 only rejects H0 when 4.541 < X < 5.862. Its probability of Type I error is only 0.00505, and is given in the row of Table 3 for δ = 4.451. The last two columns of Table 3 concern Neyman–Pearson hypothesis tests. They give the one-tailed Type I error rates as probabilities and as numbers on a scale of odds. The 10

Neyman-Pearson odds ratio is always larger than the likelihood ratio, which is consistent with inequality (1).

3.2

Interpreting P-levels, likelihood ratios and Bayes factors

People might ask whether it is possible to simply translate between P-values and likelihood ratios (or between P-values and Bayes factors, particularly for situations more complex than the ones discussed in detail in this paper). For a fixed situation, there is generally a 1:1 relationship between the P-value and the likelihood ratio, so my argument that likelihood ratios are preferable to P-values for quantifying the amount of statistical evidence might be thought to be merely a matter of interpretation and calibration. However, these relationships are generally different for different situations. For instance according to Tables 2 and 3, a one-sided P-level of 0.025 corresponds to likelihood ratios of 6.83 or 19.15 for hypothesis testing based on a normal variable or based on a variable having a t-distribution with three degrees of freedom. For sequential testing the corresponding likelihood ratio could be as large as 39. Hence there is no general way of translating between P-values and likelihood ratios. Another important issue of interpretation is how likelihood ratios should be interpreted as strengths of evidence. In particular, what cut-off values should be used in deciding whether a research paper in, say, medicine or psychology is worthy of publication. Some opinions which have been expressed are as follows. Fisher (1970, pages 80–81) suggested that P-values smaller than 0.01, 0.02 and 0.05 corresponded respectively to the null hypothesis being “definitely disproved”, that the null hypothesis “fails to account for the whole of the facts”, and that the data is “seldom to be disregarded”. Fisher (1926, page 504) said “Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level.” Over many years, a P-value of 0.05 has come to be interpreted as “statistically significant”, suggesting that this is a level of evidence which is quite substantial, despite Fisher’s view that it corresponds to a level of evidence which is just high enough for it not to be automatically ignored. Ioannidis (2005) has suggested that the commonly-used 5% P-level is a publication barrier that is lower than is desirable in medicine, on the grounds that it results in a large proportion of published results being later found to be substantially inaccurate. Fisher (1959, page 72) advocated likelihood ratio cut-offs of 2, 5 and 15. These correspond approximately to error rates of 0.2, 0.07 and 0.02 for two-tailed tests or to 0.1, 0.04 and 0.01 for one-tailed tests, according to Table 2. Jeffreys (1961, Appendix B) suggested that 100.5 ≈ 3.2 should be the lower limit on the likelihood ratio for “substantial evidence”, 10 should be the lower limit for “strong evidence”, 101.5 ≈ 32 should be the lower limit for “very strong evidence”, and 102 should be the lower limit for “decisive”. Royall (2000, page 761) cites Royall (1997) as providing a basis for a likelihood ratio of 8 being interpreted as “fairly strong evidence” and a likelihood ratio of 32 being interpreted as “strong evidence”. Kass & Raftery (1995) suggest that 3 be the lower limit for “positive” evidence, 20 be the lower limit for “strong” evidence, and 150 be the lower limit for “very strong” evidence. In an unpublished but often cited paper, Evett (1991) argues that for forensic evidence alone to be conclusive in a criminal trial, the Bayesian posterior odds for guilt should be at least 1000. Considering these opinions, I would like to suggest that a likelihood ratio of 10 be regarded as indicative, 100 be regarded as reliable, and 1000 be regarded as very reliable. 11

On this scale, a “very reliable” result would represent three times as much evidence as an “indicative” result.

3.3

Practical consequences

One important practical consequence of using likelihood ratios rather than P-values for expressing the strength of the evidence provided by a set of data would be that more data would generally be required before a result was considered statistically significant4 . We will now look at two epidemiological trials involving tests of compound alternative hypotheses against simple null hypotheses where using likelihood ratios rather than P-values would have made a substantial difference. Example 7 This situation concerns the effect of hormone replacement therapy (HRT) on coronary heart disease (CHD). Ignoring the fact that many other quantities were also measured5 , consider the 2 × 2 table in Table 4 which is taken from Rossouw et al. (2002, Table 2). Table 4: Results of HRT trial at the time of its termination HRT Placebo Total

CHD 164 122 286

No CHD 8342 7980 16322

Total 8506 8102 16608

For the given marginal total frequencies (i.e. that there were 8506 women treated with HRT, 8102 treated with placebo, and a total of 286 suffered coronary heart disease), the number in the first cell in the table has a hypergeometric distribution. Under the null hypothesis that HRT has no effect on coronary heart disease, the probability that the number in the first cell will be 164 or more is 0.0210. Low numbers in the first cell might be regarded as being as surprising as high numbers, and numbers of 128 or less are all less likely than 164 under the null hypothesis. The total probability of these outcomes under the null hypothesis is 0.0210 + 0.0159 = 0.0369, so classical statistical practice would quote the two-tailed P-value 0.0369 as the basis for judging statistical significance. Common practice suggests that a P-value less than 0.05 allows us to say that the result is “statistically significant”, so the data in Table 4 have been considered to provide a basis for the statement that the link between HRT and coronary heart disease is “statistically significant”. In general, a statement like this increases the chances that a research result will get published. In this particular case, the statement also led to the study being terminated, on the grounds that it would be unethical to expose the participants in the study to increased risk of coronary heart disease. For a review of the medical and study design issues with the benefit of hindsight, see Langer et al. (2012). A problem with the calculation of the P-level is that the experimental procedure was actually sequential. In the terminology of Armitage (1975), the “nominal error rate” ignores the fact that the steadily-accumulating body of data may have already been examined on many occasions and might be examined on many subsequent occasions. The 4

For pure hypothesis testing, it is not possible to calculate likelihood ratios. Table 2 could be used to translate in an approximate way between P-values and the likelihood scale, leading to the suggestion that a one-sided P-value of 0.016 be regarded as indicative, 0.0012 be regarded as reliable, and 0.0001 be regarded as very reliable. Compared to current practice, such a change would amount to a raising of the barriers that P-values are required to meet in order to warrant various amounts of attention. 5 Selection of evidence is generally dangerous, but that issue is not the topic of this paper.

12

level of statistical significance ought to be based on the “overall error rate” which is the total probability under the null hypothesis that on one of those occasions the study would be terminated and the null hypothesis would be rejected. The overall error rate cannot be easily computed in this case because the protocol for reviewing the results during the trial is not described explicitly, but I guess that it is approximately 0.15. If this had been reported rather than the nominal error rate then the conclusion that HRT increases the risk of CHD might not have been regarded as statistically significant; the result might not have been published; and the study might not have been terminated. The overall error rate is always at least as large as the nominal error rate, so researchers might be expected to prefer to report the nominal error rate in order to increase the chance that their results will be considered publishable. In contrast, a likelihood analysis is not affected by whether observations-to-date have been or will be examined on many occasions. The parameter which indexes the alternative hypotheses for this example is the relative risk of CHD for women on HRT compared to women on placebo. The continuous line on Figure 2 shows the likelihood of alternative hypotheses relative to the null hypothesis that HRT has no effect on CHD. The most likely relative risk is 1.286. The likelihood ratio is 8.98 at this relative risk. The relative risks for which the likelihood is 1/19 of its maximum are 0.961 and 1.727. The interval between these two points is a likelihood-based interval estimate. The best single-number summary of the strength of evidence against the null hypothesis that the relative risk is unity is the likelihood ratio 8.98. This is smaller than the value of 19 which some people naively imagine corresponds to the significance level 0.05, and is also smaller than the likelihood ratio of 10 which is advocated above as the minimum for a result to be regarded as “indicative”. If the trial were regarded as a sequential trial to distinguish between H0 that HRT has no effect of the risk of CHD and H1 that the relative risk was increased by a factor of, say, F and the study was to be terminated when the likelihood ratio between these hypotheses exceeded 10, then there is no value for F such that the study would have been terminated according to the sequential protocol at the time when it was actually terminated. Some additional participants suffered CHD after the decision was made to terminate this study, as discussed in Manson et al. (2003). The final version of the data is given in Table 5. For this data the two-tailed P-value is 0.0771, which is much less extreme than the P-value of 0.0369 which had been found for the data in Table 4. Table 5: Final results from HRT trial HRT Placebo Total

CHD 188 147 335

No CHD 8318 7955 16273

Total 8506 8102 16608

The likelihood ratio as a function of the relative risk for the final results is shown on Figure 2 as a dotted line. The maximum likelihood is 5.21 for a relative risk of 1.223. This is not strong evidence for the suggested relationship. The relative risks for which the likelihood is 1/19 of its maximum are 0.935 and 1.605. Example 8 Another example of practical importance where using likelihood methods rather than P-values would affect our conclusions is data from an HIV vaccine trial in Thailand which was reported in Rerks-Ngarm et al. (2009) and is summarized in Table 6. Only 51 out of 8197 people in the vaccine arm of the trial became infected with HIV, compared with 74 out of 8198 people who received a saline shot as placebo. Using Fisher’s 13

Likelihood relative to null hypothesis

10.0 5.0

2.0 1.0 0.5

0.2 0.1 0.95

1.00

1.05

1.10

1.15

1.20

1.25

Odds ratio

Figure 2: Likelihood of data in Table 4 as a function of the relative risk. The continuous line gives the likelihood based on the data available in 2002 when the study was terminated, and the dotted line gives the likelihood based on all of the data in 2003.

exact test for 2 × 2 tables, the P-level for a two-sided test is 0.0478 and the P-level for a one-sided test is 0.0239. Table 6: Results from an HIV vaccine trial in Thailand Vaccine arm Placebo arm Total

Infected 51 74 125

Uninfected 8146 8124 16270

Total 8197 8198 16395

The parameter which indexes the simple hypotheses is again a relative risk: the risk of HIV infection for people given the vaccine as a multiple of the risk for people given the placebo. A likelihood-based analysis would report that the maximum relative likelihood is 8.52 for a relative risk of 0.816. This is not as high as the likelihood ratio of 10 which was suggested earlier as a possible minimum for a result to be regarded as “indicative”. The relative risks for which the likelihood is 1/19 of its maximum are 0.438 and 1.066. These are the bounds for a likelihood-based interval estimate of the relative risk. This data set can also be used to illustrate two common problems with two-region significance tests and P-levels which have not previously been mentioned in this paper but which would be eliminated by using three-region hypothesis tests and likelihood ratios. The first problem is choosing between one-sided and two-sided tests. Here the Plevels for one-sided and two-sided tests are 0.0239 and 0.0478, respectively. Such P-levels commonly differ by a factor of approximately 2, yet it is sometimes not clear which should be used. Before the vaccine trial had been conducted, I imagine that it would have been considered very unlikely that the real risk of HIV is higher for the vaccine than for the 14

placebo, so a one-sided tests seems to me to be appropriate. However it is not impossible that the vaccine might increase the risk, so many people would advise using a two-sided test. Within the P-level paradigm, the strength of evidence reported changes by a factor of approximately two depending on whether it was intended to use a one-sided significance test or a two-sided significance test. Formally, this is in conflict with the Likelihood Principle. Intuitively, it seems unreasonable given any set of data which results in rejection of the null hypothesis because the simple alternative hypothesis on the other side of the null is now very unlikely. The second problem is that under some circumstances, P-levels change abruptly as a function of the data. If there had been two extra people in the vaccine arm of the HIV vaccine trial and neither of them had become infected with HIV then the one-sided P-level would have changed from 0.02394 to 0.02387 but the two-sided P-level would have changed from 0.0478 to 0.0393. This dramatic change occurs because for the original 2 × 2 table the probability of getting 51 HIV infections in the placebo arm and 74 infections in the vaccine arm is slightly smaller than the probability of 51 HIV infections in the vaccine arm and 74 in the placebo arm, so the probability of this possible outcome is included in the two-sided P-level; but for the modified data the probability of getting 51 HIV infections in the placebo arm is slightly larger than the probability of 51 HIV infections in the vaccine arm so its probability is not included in the two-sided P-level. Such a dramatic change in the reported strength of evidence for a minor change in the data seems intuitively unreasonable.

Table 7: Results from a hypothetical HIV vaccine trial Infected Uninfected Total Vaccine arm 1 23 24 Example 9 Placebo arm 7 17 24 Total 8 40 48 Consider now the hypothetical set of results in Table 7 that could have arisen from a much smaller trial of the HIV vaccine. The two-sided and one-sided P-levels for Fisher’s exact test using this data are 0.0479 and 0.0240, respectively. They are almost the same as for the Thailand trial presented as Table 6. However, likelihood analysis gives a maximum likelihood ratio of 19.74, which is much larger than the maximum likelihood ratio of 8.52 for the Thailand trial. Likelihood-based analysis indicates that this hypothetical set of results provides substantially stronger evidence against the null hypothesis than do the Thailand results, despite P-value analysis suggesting that the strength of evidence is almost the same. This reinforces the point made in section 3.2 that using likelihood ratios to assess strength of evidence is not equivalent to taking a transformation of the P-level.

4

Discussion

The most fundamental argument presented here is that for testing between simple hypotheses Neyman & Pearson (1933) should not be relied upon as a basis for assessing strength of evidence because of its focus on two-region hypothesis tests. Two-region tests are only appropriate when a choice must be made between two simple hypotheses no matter how flimsy the evidence. Three-region tests are much more appropriate for assessing strength of evidence. An additional argument is that one-sided P-values always overstate the amount of statistical evidence compared to quoting likelihood ratios and that likelihood ratios overstate the amount of statistical evidence compared to Bayes factors. For 15

testing between two simple hypotheses, the likelihood ratio is a completely satisfactory measure of the amount of evidence so P-values should never be used. These arguments can be extended to tests between simple null hypotheses and compound null hypotheses. Again, likelihood ratios tend to overstate the amount of statistical evidence, so likelihood ratios are always better than P-values as a summary of the strength of statistical evidence. Interval estimates may be constructed as sets of values for a parameter which cannot be rejected by hypothesis tests, so interval estimates based on likelihood ratios are better than interval estimates based on P-values (i.e. confidence intervals). Robinson (1975) presented some artificial situations where interval estimates based on P-values have poor conditional properties. For all of them, likelihood-based interval estimates give more sensible summaries of the available evidence than do interval estimates based on P-values (i.e. confidence intervals). This paper criticizes the use of P-values in situations where alternatives to the null hypothesis are well-defined. This includes much of applied statistics, including a very large number of research papers in medical, psychological, environmental and other journals. I consider that editorial guidelines for many academic journals should be modified to encourage reporting of likelihood ratios and Bayes factors instead of encouraging the reporting of P-values. A convenient consequence of using likelihood ratios rather than P-values is that likelihood calculations are generally simpler than calculations of P-levels, though this might not be immediately apparent because statistical software is currently set up to calculate P-values. Cox & Hinkley (1974, page 93) wrote “The determination of a likelihood ratio critical region of size α requires two steps. The first and entirely straightforward one is the calculation of the likelihood ratio [.]; the second and more difficult part is the determination of the distribution of the [likelihood ratio as a] random variable under H0 .” Barnard (1949, pages 115–6) suggested “The adoption of the likelihood ratio as the method of expressing numerically the strength of the evidence provided by a set of data would have the immediate practical advantages that sequential tests have over classical ones—in particular, no special distribution problems would need to be solved before they could be applied.” For multi-stage epidemiological trials, accurate calculation of the P-value requires integration over many possible amounts of data for the various times when the data might have been reviewed, but the likelihood ratio depends only on the data available at the current time. Another important consequence of using likelihood methods for data analysis in simple situations is that sequential experimental designs might become more popular. I consider that such a change is desirable because it would lead to more efficient research. I do not expect individual users of statistical procedures to stop using P-values and to use likelihood ratios and Bayes factors instead without concern about how their work will be regarded by others. Currently, the style guidelines for many academic journals strongly encourage use of P-levels, whether in the form of hypothesis testing or confidence intervals. I hope that journal editors will provide leadership in a move away from use of P-values. One of the reasons why the arguments in this paper seem to be difficult for readers to accept is that I am suggesting that P-values should be discarded without saying that a preferred alternative (likelihood methods) is itself the correct way to do statistical inference. In the terminology of Kuhn (1970), I am asking readers to discard one paradigm without suggesting that some alternative paradigm is correct.6 P-values are worse than the preferred alternative, so it would be unreasonable to continue to use them. 6

Personally, I do not expect any ultimately correct way of doing statistical inference to ever be found.

16

For the simple problems discussed in this paper, I consider that likelihood methods are always preferable to methods based on P-values. However even for testing between a simple null hypothesis and a compound alternative, I do not consider that likelihood methods are the uniquely correct way (or even always a good way) to do statistical inference in the sense of solving the first of the problems of Royall (1997): “What evidence is there?”. For more complicated problems I suspect that methods based on Bayesian calculations with reasonably diffuse or improper prior distributions will be considered preferable to both P-values and simple likelihood methods. However, that is beyond the scope of this paper. The purpose of this paper is merely to denigrate the use of P-values for simple situations.

References Armitage, P. (1975). Sequential Medical Trials. Blackwell Scientific, Oxford, 2nd ed. Barnard, G. A. (1947). Book review of Sequential Analysis by Abraham Wald. Journal of the American Statistical Association 42, 658–664. Barnard, G. A. (1949). Statistical inference (with discussion). Journal of the Royal Statistical Society, Series B 11, 115–149. Buehler, R. J. (1959). Some validity criteria for statistical inferences. The Annals of Mathematical Statistics 30, 845–863. Carver, R. P. (1978). The case against significance testing. Harvard Educational Review 48, 378–399. Chernoff, H. & Scheffe, H. (1952). A generalization of the Neyman–Pearson fundamental lemma. The Annals of Mathematical Statistics 2, 213–225. Cox, D. R. & Hinkley, D. V. (1974). Theoretical Statistics. Chapman and Hall, London. Cumming, G. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge, New York. Dantzig, D. B. & Wald, A. (1951). On the fundamental lemma of Neyman and Pearson. The Annals of Mathematical Statistics 22, 87–93. Dempster, A. P. (1964). On the difficulties inherent in Fisher’s fiducial arguments. Journal of the American Statistical Association 59, 56–66. Evett, I. W. (1991). Implementing bayesian methods in forensic science. Paper presented at the Fourth Valencia International Meeting on Bayesian Statistics. Fisher, R. A. (1926). The arrangement of field experiments. Journal of the Ministry of Agriculture of Great Britain 33, 505–513. Fisher, R. A. (1956). On a test of significance in Pearson’s Biometrika Tables (No. 11). Journal of the Royal Statistical Society. Series B 18, 56–60. Fisher, R. A. (1959). Statistical Methods and Scientific Inference. Oliver and Boyd, Edinburgh, 2nd ed. Fisher, R. A. (1970). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh, 14th ed. Gardner, M. J. & Altman, D. G. (1986). Confidence intervals rather than P values: estimation rather than hypothesis testing. British Medical Journal 292, 746–750. Harlow, L. L., Muliak, S. A. & Steiger, J. H., eds. (1997). What If There Were No Significance Tests? Lawrence Erlbaum Associates Publishers. Hubbard, R. & Bayarri, M. J. (2003). Confusion over measures of evidence (P’s) versus errors (α’s) in classical statistical testing. The American Statistician 57, 171–177. Ioannidis, J. P. A. (2005). Contradicted and initially stronger effects in highly cited clinical research. Journal of the American Medical Association 294, 218–228. Jeffreys, H. (1961). Theory of Probability. Oxford University Press, 3rd ed. Kass, R. E. & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association 90, 773–795. 17

Kuhn, T. S. (1970). The Structure of Scientific Revolutions. The University of Chicago Press, 2nd ed. Langer, R. D., Manson, J. E. & Allison, M. A. (2012). Have we come full circle — or moved forward? The Womens Health Initiative 10 years on. Climacteric 15, 206–212. Lehmann, E. L. (1986). Testing Statistical Hypotheses. Wiley, New York, 2nd ed. Lehmann, E. L. (1993). The Fisher, Neyman–Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association 88, 1242–1249. Manson, J. E., Hsia, J., Johnson, K. C., Rossouw, J. E., Assaf, A. R., Lasser, N. L., Trevisan, M., Black, H. R., Heckert, S. R., Detrano, R., Strickland, O. L., Wong, N. D., Crouse, J. R., Stein, E. & Cushman, M. (2003). Estrogen plus progestin and the risk of coronary heart disease. The New England Journal of Medicine 349, 523–534. Neyman, J. & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A 231, 289–337. Rerks-Ngarm, S., Pitisuttithum, P., Nitayaphan, S., Kaewkungwal, J., Chiu, J., Paris, R., Premsri, N., Namwat, C., de Souza, M., Adams, E., Benenson, M., Gurunathan, S., Tartaglia, J., McNeil, J. G., Francis, D. P., Stablein, D., Birx, D. L., Chunsuttiwat, S., Khamboonruang, C., Thongcharoen, P., Robb, M. L., Michael, N. L., Kunasol, P. & Kim, J. H. (2009). Vaccination with ALVAC and AIDSVAX to Prevent HIV-1 infection in Thailand. The New England Journal of Medicine 361, 2209–2220. Robinson, G. K. (1975). Some counterexamples to the theory of confidence intervals. Biometrika 62, 155–161. Rossouw, J. E., Anderson, G. L., Prentice, R. L., LaCroix, A. Z., Kooperberg, C., Stefanick, M. L., Jackson, R. D., Beresford, S. A. A., Howard, B. V., Johnson, K. C., Kotchen, J. M. & Ockene, J. (2002). Risks and benefits of estrogen plus progestin in healthy postmenopausal women: Principal results from the Women’s Health Initiative randomized controlled trial. Journal of the American Medical Association 288, 321–333. Royall, R. (2000). On the probability of observing misleading evidence. Journal of the American Statistical Association 95, 760–768. Royall, R. M. (1997). Statistical Evidence: A Likelihood Paradigm. Chapman and Hall, New York. Sterne, J. A. C. & Smith, G. D. (2001). Sifting the evidence—what’s wrong with significance tests? British Medical Journal 322, 226–31. Wagner, D. H. (1969). Nonlinear functional versions of the Neyman–Pearson lemma. SIAM Review 11, 52–65. Wald, A. (1947). Sequential Analysis. Wiley, New York. Yates, F. (1951). The influence of Statistical Methods for Research Workers on the development of the science of statistics. Journal of the American Statistical Association 46, 19–34. Ziliak, S. T. & McClosky, D. N. (2008). The Cult of Statistical Significance: How the Standard Error Cost Us Jobs, Justice, and Lives. The University of Michigan Press, Ann Arbor.

A

Mathematical results

Proposition 1 Suppose that there are two points x0 and x1 in a discrete sample space such that we accept H0 with confidence α when x = x0 and we accept H1 with confidence α 18

when x = x1 . If the likelihood ratios at x0 and x1 are not very different from one another in the sense that 2  f0 (x0 )f1 (x1 ) α < f1 (x0 )f0 (x1 ) 1−α then we can construct a negatively-biased relevant betting procedure which bets that the confidence α is too large a fraction p0 of the times that X = x0 and a fraction p1 of the times that X = x1 . Proof:

First, choose a positive constant δ < α such that f0 (x0 )f1 (x1 ) < f1 (x0 )f0 (x1 )



δ 1−δ

2




p0 (1 − δ) +

f0 (x0 )f0 (x1 )f1 (x1 ) − f1 (x0 )

s

δ p0 − f1 (x1 ) 1−δ p1



#

f0 (x0 )f0 (x1 )f1 (x1 ) = 0. f1 (x0 )

Our expected yield from betting is strictly positive under both hypotheses, so the betting procedure is relevant. The betting procedure is negatively biased because it is always betting that the confidence level is too high. 2 Proposition 2 Suppose that one of two simple hypotheses, H0 and H1 , must be rejected after observing some data, X. Let R0 denote the region where H0 is rejected. The probability of incorrectly rejecting H0 is α = P[X ∈ R0 | H0 ] and the probability of correctly rejecting H0 is γ = P[X ∈ R0 | H1 ]. If R0 maximizes γ for given α = a, then points inside R0 have larger values of the likelihood ratio f1 (x)/f0 (x) than points outside R0 . Proof: If R0 maximizes γ for given α = a, then the method of Lagrange multipliers tells us that a scalar, λ, and the set R0 must be stationary points for the quantity γ − λ(α − a) = λa +

Z R0

19

[f1 (x) − λf0 (x)] dx

regarded as a function of λ and R0 . The integral can be maximized over R0 for fixed λ by taking R0 = {x : f1 (x) − λf0 (x) > 0}, so the critical region R0 must be of the form {x : f1 (x)/f0 (x) > λ} except that it may also include points where f1 (x)/f0 (x) = λ. 2 This argument requires some regularity conditions which are not important for the present purpose. See Dantzig & Wald (1951), Chernoff & Scheffe (1952) and Wagner (1969). Proposition 3 Let R0 denote the region where H0 is rejected and let R1 denote the region where H1 is rejected. The probabilities of incorrectly rejecting hypotheses are α0 = P[X ∈ R0 | H0 ] and α1 = P[X ∈ R1 | H1 ]. The probabilities of correctly rejecting hypotheses are γ0 = P[X ∈ R0 | H1 ] and γ1 = P[X ∈ R1 | H0 ]. If R0 and R1 maximize γ0 and γ1 for given α0 = a0 and α1 = a1 , then points inside R0 have larger values of the likelihood ratio f1 (x)/f0 (x) than points outside R0 ∪ R1 , which in turn have larger values of the likelihood ratio than points in R1 . Proof: If R0 and R1 maximize γ0 and γ1 for given α0 = a0 and α1 = a1 , then the method of Lagrange multipliers tells us that scalars λ0 and λ1 and sets R0 and R1 must be stationary points for the quantity γ0 + γ1 − λ0 (α0 − a0 ) − λ1 (α1 − a1 ) Z

= λ 0 a0 + λ 1 a1 +

[f1 (x) − λ0 f0 (x)] dx +

R0

Z

[f0 (x) − λ1 f1 (x)] dx

R1

regarded as a function of λ0 , λ1 , R0 and R1 . The integrals can be maximized over R0 and R1 for fixed λ0 and λ1 by taking R0 = {x : f1 (x) − λ0 f0 (x) > 0} and R1 = {x : f0 (x)−λ1 f1 (x) > 0} and possibly also including points where where equality holds. Hence the critical region R0 must be of the form {x : f1 (x)/f0 (x) > λ0 } and the critical region R1 must be of the form {x : f0 (x)/f1 (x) > λ1 }, although both may include points where equality holds. 2

20