The distribution of P-values in medical research articles suggested ...

7 downloads 0 Views 422KB Size Report
Apr 9, 2017 - distribution of published P-values to estimate the relative frequencies of null and alternative hypotheses and to seek irregularities suggestive.
Journal of Clinical Epidemiology 87 (2017) 70e77

The distribution of P-values in medical research articles suggested selective reporting associated with statistical significance Thomas V. Perneger*, Christophe Combescure Division of Clinical Epidemiology, Faculty of Medicine, University of Geneva, and Geneva University Hospitals, 6 rue Gabrielle-Perret-Gentil, CH-1211 Geneva, Switzerland Accepted 4 April 2017; Published online 9 April 2017

Abstract Objectives: Published P-values provide a window into the global enterprise of medical research. The aim of this study was to use the distribution of published P-values to estimate the relative frequencies of null and alternative hypotheses and to seek irregularities suggestive of publication bias. Study Design and Setting: This cross-sectional study included P-values published in 120 medical research articles in 2016 (30 each from the BMJ, JAMA, Lancet, and New England Journal of Medicine). The observed distribution of P-values was compared with expected distributions under the null hypothesis (i.e., uniform between 0 and 1) and the alternative hypothesis (strictly decreasing from 0 to 1). P-values were categorized according to conventional levels of statistical significance and in one-percent intervals. Results: Among 4,158 recorded P-values, 26.1% were highly significant (P ! 0.001), 9.1% were moderately significant (P  0.001 to ! 0.01), 11.7% were weakly significant (P  0.01 to ! 0.05), and 53.2% were nonsignificant (P  0.05). We noted three irregularities: (1) high proportion of P-values !0.001, especially in observational studies, (2) excess of P-values equal to 1, and (3) about twice as many P-values less than 0.05 compared with those more than 0.05. The latter finding was seen in both randomized trials and observational studies, and in most types of analyses, excepting heterogeneity tests and interaction tests. Under plausible assumptions, we estimate that about half of the tested hypotheses were null and the other half were alternative. Conclusion: This analysis suggests that statistical tests published in medical journals are not a random sample of null and alternative hypotheses but that selective reporting is prevalent. In particular, significant results are about twice as likely to be reported as nonsignificant results. Ó 2017 Elsevier Inc. All rights reserved. Keywords: Statistical tests; P-values; Publication bias; Practice of research

1. Introduction Most medical research studies, regardless of design or purpose, report results accompanied by P-values or by confidence intervals [1]. The aggregate population of published P-values (or confidence intervals) can be seen as a collective artifact of the medical research enterprise that may reveal useful clues about the conduct of science and the dissemination of scientific results. The main issue that hampers the empirical study of P-values is selection bias [2]. This bias can occur both through the researcher’s ingenuity in finding a ‘‘statistically significant’’ result (a practice sometimes called

Conflict of interest: None. * Corresponding author. E-mail address: [email protected] (T.V. Perneger). http://dx.doi.org/10.1016/j.jclinepi.2017.04.003 0895-4356/Ó 2017 Elsevier Inc. All rights reserved.

‘‘P-hacking’’ [3]) and from preferential publication of significant results [4], attributable to both researchers and journal editors. Recent studies found an unusually high occurrence of P-values just below the threshold of statistical significance [5e7]. For example, in abstracts that reported results as odds ratios or relative risks, Gøtzsche found 46 P-values between 0.0400 and 0.0499, but only five between 0.0500 and 0.0599 [5]; which would be highly unlikely without selection bias. When Jager and Leek estimated the ‘‘science-wise false discovery rate’’ (i.e., the proportion of published significant findings that correspond to type-1 errors) by applying statistical models developed for genomic studies [8], their approach was criticized chiefly because publication bias renders the statistical model untrustworthy [9e12]. Previous studies have not fully reflected what happens in medical research because they examined only abstracts

T.V. Perneger, C. Combescure / Journal of Clinical Epidemiology 87 (2017) 70e77

What is new? Key findings  The distribution of O4,000 P-values published in medical research articles suggested a pervasive selection bias associated with statistical significance.  This bias was observed for most study designs and most types of analyses, including randomized trials and primary analyses but excepting interaction tests and heterogeneity tests. What this adds to what was known?  Previous studies have shown that P-values published in abstracts are highly selected to highlight results that are statistically significant.  This study suggests that selective reporting of Pvalues affects medical research articles globally. What is the implication and what should change now?  The focus on statistical significance distorts the published record of medical research and the evidence available for medical decision making.  Other methods of statistical inference than Pvalues and other methods for disseminating research results deserve consideration.

[5,7,8] or only a subset of P-values [6]. In this study, we describe the distribution of P-values in full medical research articles, to verify if this distribution matches the shape that would be expected from a mixture of null and alternative hypotheses, and to identify irregularities that may reveal selection bias. We compare distributions of P-values according to study design and type of analysis. In this study, we observe scientific practice but do not attempt to judge the appropriateness of the statistical tests that were performed, nor the adequacy of their interpretation. Neither do we address the fundamental merits and limitations of P-values as measures of evidence; this issue is addressed elsewhere (e.g., [13,14]).

2. Methods We included in this cross-sectional study medical research articles published in four prominent journals (BMJ, JAMA, Lancet, and New England Journal of Medicine) starting on April 1, 2016. We identified articles that analyzed original numerical data and included at least one P-value, in order of publication, until 30 eligible articles were retrieved from each journal (120 articles in total). We also noted the usage of estimation methods (typically confidence intervals, in a

71

few cases credible intervals), either alone or in conjunction with P-values, in all screened articles. All reported P-values were retrieved from the selected articles, as published, except when the result was given as significant or not at the 0.05 level or described verbally as such. We did not retrieve P-values from appendices or attachments. For each article, we first abstracted P-values from the tables (including footnotes), then from the figures, and finally from the text, making sure to skip P-values that replicated those from tables or figures. We identified for each P-value the following information: (1) appearance in the abstract, (2) whether it was a primary analysis (according to the Methods section of each article), (3) whether it came from a parsimonious model or was otherwise described as a significant result selected among a larger number of results, (4) whether it was a baseline comparison from a randomized trial, (5) whether it was an interaction test that was not the primary analysis, (6) whether it was a heterogeneity test from a meta-analysis, (7) whether any correction for multiple testing had been applied. In the latter case, we did not back-compute uncorrected Pvalues as insufficient detail was provided in all instances. The sample size for this study was chosen so as to obtain a sufficiently detailed description of the distribution of P-values. We aimed to obtain at least 20 P-values in each one-percent interval; because we expected the distribution to be skewed to the right, we decided to retrieve about 4,000 P-values in total. As we also wanted to include the same number of articles from each journal, the final number of P-values was 4,158. For the analysis, P-values smaller than 0.01 reported as inequalities were imputed to the midpoint of the corresponding interval, for example, for P ! 0.001, we imputed 0.0005. Based on the initial exploratory analysis which identified irregular frequencies at both extremities of the distribution and at P 5 0.05, we classified the P-values into six categories: (1) !0.001, (2) 0.001 to !0.01, (3) 0.01 to !0.05, (4) 0.05 to !0.09, (5) 0.09 to !0.99, and (6) 0.99. We compared distributions of P-values across journals, study designs, types of analyses (primary, ordinary, baseline comparison from a randomized controlled trial (RCT), parsimonious model, interaction test, heterogeneity test, test with adjustment for multiplicity), and locations within a paper (abstract, tables, figures, text). The aim of these analyses was descriptive, and given the presence of biased sampling and lack of independence of observations, we refrained from statistical tests for these comparisons. To obtain a more detailed distribution, we also defined 100 one-percent wide intervals of P-values and obtained frequencies for each interval. We plotted the logarithm of each frequency against the logarithm of the midpoint of each interval. In interpreting this distribution, we assume that the observed P-values come either from null hypotheses or from alternative hypotheses [15,16]. If the null hypothesis H0 is true, and the test statistic is continuous, the distribution of P-values will be uniform [15,16]. This is

72

T.V. Perneger, C. Combescure / Journal of Clinical Epidemiology 87 (2017) 70e77

because the probability of finding a P-value equal to or smaller than any observed P-value equals the probability of finding the observed results or more extreme results than those observed, that is, Pr(P  P-valuejH0) 5 Pr(X  xjH0), where P is the random variable corresponding to the observed P-value and X the test statistic. But Pr(X  xjH0) is the definition of P. So the cumulative density function of P under the null hypothesis is an identity function (CDF0( P) 5 P) between 0 and 1, and its probability density function is uniform on this interval. Of note, the distribution of P-values will be irregular and therefore not neatly uniform with discrete data [15,17]. The distribution of P-values under a specific alternative hypothesis is a decreasing function between 0 and 1 [16,18], but true alternative hypotheses differ from one analysis to the next, and the distribution of P-values resulting from multiple tests is a mixture that cannot be described in closed form. Nevertheless, this distribution should be smooth and decreasing between 0 and 1 [18,19]. In analyses of genetic studies, it is often assumed that P-values from pooled alternative hypotheses can be represented by a beta distribution, with parameters a ! 1 and b 5 1 [16]. With b 5 1, the beta function simplifies to f(P) 5 constant*Pa1, and therefore log10(f(P)) 5 log10(constant) þ (a-1)*log10(P). Therefore, if all hypotheses were alternative, the plot of log10(frequency) versus log10(P) would be linear. The admixture of a uniform component will incurve the function, but it should remain smooth. Furthermore, the intercept (i.e., the density at P 5 1) provides an upper bound on the proportion of null hypotheses among all those tested [16]. To assess publication bias associated with statistical significance, we examined frequencies of P-values between 0.01 and 0.09, in one-percent intervals. This distribution should be decreasing (if at least some alternative hypotheses were tested). In addition, in absence of publication bias,

the distribution should be smooth. If P-values O0.05 were less likely to be published than P-values !0.05, we would expect a discrete step at P 5 0.05. Analyses were run using SPSS (IBM SPSS Statistics for Windows, Version 22.0. Armonk, NY: IBM Corp.).

3. Results 3.1. Included articles We screened 149 articles that contained numerical results and retained 120 articles. Twenty-nine articles were excluded because they included P-values only as inequalities (!0.05 vs. 0.05, N 5 2), only confidence intervals (N 5 19), or did not report inference methods (N 5 8). The 120 retained articles all included at least one P-value, but most included confidence intervals as well, either alongside corresponding P-values or without P-values (Table 1). The use of inference methods was similar across journals. Almost half of the articles (46.7%) were randomized trials, and 40.0% were observational studies; the rest were modeling studies, meta-analyses, and phase-I trials (Table 1). Randomized trials were less common in the BMJ than in the other journals in this sample. 3.2. Distribution of P-values We retrieved 4,158 P-values from the 120 articles, on average 35.6 per article, range from 1 to 141 (Table 1). Applying a level of statistical significance of 0.05, 2,211 (53.2%) P-values were nonsignificant and 1,947 (46.8%) were significant. Splitting P-values into one-percent intervals yielded a more detailed picture of the distribution (Fig. 1).The 94 intervals between 0.05 and 0.989 appeared to be scattered along a straight line. The four intervals between 0.01 and

Table 1. Excluded and included articles that reported numerical results in four general medical journals, starting on April 1, 2016, and study designs of included articles

Screened articles that reported numerical results, N Articles included in study At least one P-value, N (%) P-values without confidence intervals, N (%) P-values with confidence intervals, N (%) Confidence intervals without P-values, N (%) Study design Randomized clinical trial, N (%) Prospective, N (%) Cross-sectional or case-control, N (%) Modeling, N (%) Meta-analysis, N (%) Phase-I trial, N (%) P-values per article, mean (range) Significant P-values per article at 0.05 level, mean (range) Nonsignificant P-values per article, mean (range)

Total

BMJ

JAMA

Lancet

New England Journal of Medicine

149

43

35

39

32

120 109 91 91

(100) (90.8) (75.8) (75.8)

30 28 18 27

30 27 22 25

30 24 26 23

30 30 25 16

56 36 12 4 8 4 35.6 16.2

(46.7) (30.0) (10.0) (3.3) (6.7) (3.3) (1e141) (0e104)

2 20 2 1 5 0 35.5 (3e114) 16.9 (0e72)

15 9 4 2 0 0 43.0 (1e141) 23.7 (0e104)

17 3 5 1 3 1 30.2 (2e112) 11.9 (1e57)

22 4 1 0 0 3 29.9 (1e88) 12.3 (0e46)

18.6 (0e90)

19.3 (0e72)

18.3 (1e90)

17.6 (0e77)

18.4 (0e90)

T.V. Perneger, C. Combescure / Journal of Clinical Epidemiology 87 (2017) 70e77

73

Fig. 1. Logarithm-transformed distribution of 4,158 P-values retrieved from four medical journals, grouped in one-percent intervals (e.g., the first interval represents P-values !0.01, the second P-values 0.01 and !0.02, etc.). The abscissa corresponds to the logarithm in base 10 of the midpoint of each interval; for example, the first point is log(0.005) 5 2.30. Black dots represent statistically significant P-values (!0.05; five dots on the left), and P 5 1 (on the right), circles represent the other P-values. The dotted line is the linear regression obtained from P-values between 0.05 and 0.99 (circles).

0.049 had higher frequencies than expected from the extrapolated linear function. The most obvious outliers were the extreme intervals, P ! 0.01 and P  0.99. Given these results, we did not attempt to model the whole distribution as a beta-uniform mixture. We performed an exploratory linear regression analysis on the 94 one-percent intervals between 0.05 and 0.989, after logarithm transformation, which yielded an intercept (on the log10 scale) of 2.448, or an expected relative frequency of 0.0036 in the interval between 0.99 and 1. If we set aside the issue of publication bias, this would put an upper limit on the percentage of null hypotheses among those tested at 36% (i.e., on the interval between 0 and 1, 100  0.0036). Frequencies of P-values across one-percent intervals between 0.01 and 0.09 displayed an abrupt drop more than 0.05 (Fig. 2). 3.3. Stratified analysis To facilitate the comparison of subsets of P-values, we categorized the distribution in six classes: 26.1% of the

P-values were !0.001, 9.1% were between 0.001 and 0.009, 11.7% between 0.010 and 0.049, 4.4% between 0.05 and 0.089, 46.9% between 0.090 and 0.989, and 1.9% were 0.99 or more (Table 2). This distribution was stable across journals, except that highly significant results were more common in the BMJ and JAMA than in the other two journals. Very small P-values were less frequent in randomized trials and in meta-analyses than in other study designs. The latter designs also included most P-values between 0.99 and 1. The step at P 5 0.05 was seen in all study designs except in meta-analyses. About 85% of the P-values were generic analyses; that is, they were neither prespecified primary analyses, nor were they interaction tests, heterogeneity tests, baseline comparisons in clinical trials, or parsimonious models. Primary analyses were predominantly statistically significant, and the decrease in frequency more than P 5 0.05 was sharp. The 63 P-values from parsimonious models were virtually all significant (the only exception was a P-value equal to 0.05). In contrast, interaction tests and

74

T.V. Perneger, C. Combescure / Journal of Clinical Epidemiology 87 (2017) 70e77

Fig. 2. Distribution of 667 P-values 0.01 and !0.09, in one-percent intervals, retrieved from four medical journals. For all intervals denoted by [L, U), the lower bound L is included in the interval, and the upper bound U is not.

Table 2. Categories of P-values according to the characteristic of the analysis P value category !0.001

‡0.001 to !0.01

‡0.01 to !0.05

‡0.09 to !0.99

‡0.99

Row %

N (column %) Total Journal BMJ JAMA Lancet NEJM Study design RCT Prospective Cross-sectional Modeling Meta-analysis Phase-I trial Type of analysis Common Primary Parsimonious Heterogeneity Interaction Multiplicity correction RCT baseline comparison Location in study Abstract Tables Figures Text only

‡0.05 to !0.09

4,158 (100)

26.1

9.1

11.7

4.4

46.9

1.9

1,065 1,290 906 897

(25.6) (31.0) (21.8) (21.6)

31.6 32.2 18.5 18.4

6.0 10.1 9.5 10.8

10.0 12.9 11.4 12.0

4.6 4.5 4.6 3.7

46.5 38.4 53.9 52.7

1.2 1.9 2.1 2.3

1,899 1,506 321 78 283 71

(45.7) (36.2) (7.7) (1.9) (6.8) (1.7)

14.3 35.1 43.0 65.4 23.0 43.7

8.2 10.2 9.7 6.4 2.8 32.4

11.5 12.6 12.8 5.1 8.8 9.9

4.2 4.7 4.0 2.6 6.0 0.0

59.2 36.1 29.9 20.5 56.9 14.1

2.7 1.2 0.6 0.0 2.5 0.0

3,406 133 63 137 272 119 28

(81.9) (3.2) (1.5) (3.3) (6.5) (2.9) (0.7)

26.6 32.3 49.2 15.3 2.6 64.7 3.6

9.9 13.5 12.7 5.1 2.2 0.8 3.6

11.0 22.6 36.5 9.5 9.6 10.1 28.6

4.2 3.0 1.6 7.3 6.3 3.4 10.7

46.4 28.6 0.0 60.6 77.9 19.3 53.6

2.0 0.0 0.0 2.2 1.5 1.7 0.0

262 2,903 417 576

(6.3) (69.8) (10.0) (13.9)

35.5 25.0 22.1 30.2

15.6 8.0 9.4 11.3

22.1 9.0 12.0 20.3

5.3 4.5 3.4 4.2

21.4 51.3 52.0 33.0

0.0 2.3 1.2 1.0

Abbreviation: RCT, randomized controlled trial.

T.V. Perneger, C. Combescure / Journal of Clinical Epidemiology 87 (2017) 70e77

heterogeneity tests were in majority nonsignificant and displayed no increase in frequency below P 5 0.05. There were too few baseline comparisons in randomized trials to draw conclusions about such tests. P-values reported in abstracts and in the text of Section 3 were more often significant than P-values presented in tables and figures. The drop in frequencies above P 5 0.05 was seen in all sections of the studies. To examine the abrupt ‘‘step’’ at P 5 0.05, we examined the frequencies of P-values grouped in one-percent intervals from 0.01 to 0.09 (Table 3). The frequencies of P-values decreased sharply at 0.05 in most subgroups, but the decrease was less marked in some subgroups, that is, in the Lancet, in abstracts, in meta-analyses, and especially in heterogeneity and interaction tests.

4. Discussion 4.1. Overview The distribution of P-values published in four major medical journals displayed a globally smooth pattern, but with some substantial irregularities. When the P-values were binned into one-percent wide intervals, the relative frequency distribution was linearly decreasing from 0.05 to 0.99 on a logarithm scale. This is compatible with the notion that most nonsignificant P-values originate from a mixture of null and alternative hypotheses of varying power. The irregularities consisted in a high frequency of P-values !0.01 (even more so !0.001), a sharp discontinuity in the frequencies of P-values at 0.05, and a high frequency of P-values between 0.99 and 1. Globally, these

75

irregularities suggest that published P-values are not an unselected sample of P-values, such as can be obtained from a genome-wide association study. 4.2. Excess of highly significant P-values In our sample, more than a quarter of the P-values were less than 0.001, and 35% were below 0.01. This first onepercent interval (P ! 0.01) was an outlier when assessed against the distribution of P-values between 0.05 and 0.99. The reason for this finding is unclear. The use of a beta distribution to represent the distribution of P-values under the alternative hypothesis is merely a convenient approximation. The shape of the distribution depends on the power of the tests [18], and obviously, no two published tests are identical in this regard. Thus, it is entirely possible that the apparent excess of very small P-values stems from the performance of a large number of very powerful tests. However, researchers are usually motivated to conduct studies that have a reasonable power, typically of 80% or 90%, to limit costs and participant burden. Alternatively, tests may be sometimes performed when the pattern in the data is obvious and a test is unnecessary. For instance, one included study compared medical diagnoses and drug treatments among trial participants who took different numbers of concomitant medications; for 61 of 64 comparisons, patients who took more medications differed significantly (P ! 0.001) from those who took fewer. The associations seem almost tautological and the P-values do not add much to the description (but we suspect that had the investigators not reported them, they would have been called to order during the peer-review process). Finally, it is possible that the high proportion of very small P-values

Table 3. Frequencies (N ) of published P-values between 0.01 and 0.09, by subgroup (subgroups with !20 observations not shown) P-value interval Total Journal BMJ JAMA Lancet NEJM Study design RCT Prospective Cross-sectional Meta-analysis Type of analysis Common Primary Parsimonious Heterogeneity Interaction Location in study Abstract Tables Figures Text only

[0.01, 0.02)

[0.02, 0.03)

[0.03, 0.04)

[0.04, 0.05)

[0.05, 0.06)

[0.06, 0.07)

[0.07, 0.08)

[0.08, 0.09)

128

138

119

100

50

46

47

39

33 34 39 22

27 55 24 32

22 46 21 30

25 32 19 24

15 11 15 9

14 16 8 8

9 16 13 9

11 15 6 7

56 52 16 2

64 60 8 5

54 40 9 11

44 38 8 7

22 17 5 5

19 19 2 5

21 17 5 4

17 18 1 3

100 12 6 0 6

118 4 3 2 7

79 9 7 7 7

76 5 7 4 6

40 0 1 3 5

35 1 0 4 4

39 1 0 2 3

29 1 0 1 5

20 70 11 27

15 71 18 34

15 64 8 32

8 55 13 24

5 36 3 6

4 29 6 7

2 35 4 6

3 30 1 5

Abbreviation: RCT, randomized controlled trial.

76

T.V. Perneger, C. Combescure / Journal of Clinical Epidemiology 87 (2017) 70e77

is due to selection bias. Even researchers who claim statistical significance for P ! 0.05 may find a P-value less than 0.001 more convincing, and the same holds for reviewers and readers of the article. Many journals do not publish exact P-values less than 0.001, which suggests that this threshold has acquired a special status of extreme statistical significance, beyond which any distinctions are deemed meaningless (only genetic studies have broken this mold and use significance levels of 107 or 108 as a precaution against type-1 errors [20]). The desirable status of P-values !0.001 may motivate researchers to ‘‘dredge’’ their analyses to identify such ‘‘highly significant’’ results. 4.3. Excess of P-values equal to 1 On the opposite end of the range, we found a high frequency of P-values equal to 1 or O0.99. The most likely explanation is that when discrete variables (such as proportions) are compared, an exact equality is a possibility, especially when the sample sizes are equal and small. Simulations show that in such cases, the distribution of P-values is not uniform under the null hypothesis [15,17], but distinctly lumpy, and P 5 1 is a ‘‘lump’’ that is common to most situations. This indicates that the betauniform mixture is not an ideal model for the distribution of P-values, especially from chi-square tests. 4.4. Step at P 5 0.05 The relative frequency of reported P-values decreased by about half above 0.05, the common threshold of statistical significance. This provides a rough estimate of the selection bias that affects P-values: nonsignificant results are about 50% less likely to make it into print than significant results. This estimation is in line with results of previous cohort studies of protocols approved by research ethics committees: Easterbrook et al. obtained an odds ratio of publication of 2.32 for significant study results compared with nonsignificant results [21], and Dickersin et al. found

an odds ratio of 2.54 [22]. These studies analyzed the main result of a given study, whereas our estimation concerns the total population of tests. In addition, Easterbrook et al. found no publication bias among randomized trials, whereas we found evidence of publication bias for all original study designs, including RCTs, and also for analyses that were described as primary. Our exceptions were meta-analyses and particularly heterogeneity tests, and also interaction tests. Although these exceptions represent only a small fraction of published P-values, they are important because they show that a smooth density around P 5 0.05 is achievable when statistical significance is not particularly desirable. We also found that the contrast between P-values 0.05 and !0.05 was particularly strong for tests highlighted in free text, less so for results reported in tables.

4.5. Null and alternative hypotheses The evidence of selection bias was so overwhelming that we renounced fitting a beta-uniform mixture distribution to the data. However, we will hazard a back-of-the-envelope calculation (detailed in Table 4), which uses two results: (1) a proportion of 0.36 of null hypotheses among the published tests, obtained by multiplying by 100 the modelbased relative frequency of P-values between 0.99 and 1, that is, 0.0036 and (2) underreporting of statistically nonsignificant results by 50%, based on Fig. 2. From this, we estimate that 47% of tested hypotheses (2,994/6,369) were null and 53% were alternative. The null hypotheses produced by definition 5% of type-1 errors (150/2,994), and among the alternative hypotheses, the average power was 53% (1,797/3,375). Among nonsignificant results, the negative predictive value was 64% (2,844/4,422), and among the significant results, the positive predictive value was 92% (1,797/1,947). These estimates are only as good as the underlying assumptions and should not be taken at face value.

Table 4. Estimation of the frequencies of null and alternative hypotheses, by statistical significance, in published articles and for all tests performed, with computation steps in footnotes

Tests in published articles (study sample) Significant result Nonsignificant result Total Tests performed (including unpublished tests) Significant result Nonsignificant result Total

Null hypothesis

Alternative hypothesis

Total

Step 3: 150 Step 2: 1,422 Step 3: 1,572

Step 4: 1,797 Step 4: 789 Step 4: 2,586

1,947 2,211 4,158

Step 3: 150 Step 3: 2,844 Step 3: 2,994

Step 4: 1,797 Step 4: 1,578 Step 4: 3,375

1,947 Step 1: 4,422 Step 1: 6,369

Step 1: Assuming that all significant P-values but only 50% of nonsignificant P-values are published (Fig. 2), in fact 4,422 nonsignificant P-values were obtained, and 6,369 in total. Step 2: The estimated prevalence of null hypotheses was 36% among the 4,158 P-values, that is, 1,497, of which 95% are by definition nonsignificant, that is, 1,422. Step 3: As only 50% of nonsignificant P-values are published, in fact 2,844 nonsignificant P-values were obtained from null hypotheses; plus 150 type-1 errors, all published. In total, 2,994 null hypotheses were tested, and 1,572 were published. Step 4: Totals of tests under the alternative hypothesis are obtained by subtraction.

T.V. Perneger, C. Combescure / Journal of Clinical Epidemiology 87 (2017) 70e77

4.6. Strengths and limitations Our study has strengths but also limitations. An original aspect is the analysis of all P-values published in research studies, and not merely those from abstracts. Furthermore, we were able to stratify the results by study design and type of analysis. One limitation is the lack of independence of P-values derived from the same study, and the existence of correlated clusters of P-values, as when the same association is evaluated in several statistical models or repeated over several strata. Another limiting factor is that confidence intervals were at least as common in quantitative articles as P-values (see Table 1)dindeed, many articles use a mixture of inference methods. Thus, we can no longer assume that P-values capture all or most statistical inferences in clinical research. We do not fully understand when researchers resort to tests, to confidence intervals, or to both, but this decision is probably not random, and this adds another opportunity of selection bias. Finally, we included articles from four prominent medical journals. It is possible that publication bias is more pronounced in high-profile journals than elsewhere, as authors will try and publish their most exciting result in the most visible media. 4.7. Implications The medical research community appears to be keen to find and report statistically significant results. This distorts the published record of science and may have untoward consequences for patient care because clinical decisions are based at least in part on published research. Solutions to this problem include a move to inference methods other than statistical tests, such as confidence intervals, likelihood ratios [23], or Bayesian methods [24], that are not (yet) burdened by a dichotomous threshold that signals scientific desirability (although this is debatable for confidence intervals [25]). However, such methods have been available for many years and have not much reduced the popularity of statistical tests. Another line of action would attempt to dissociate the dissemination of factual scientific results from the reward mechanisms that sustain researchers’ careers, but this may require an overhaul of the current publication model. References [1] Chavalarias D, Wallach JD, Li AHT, Ioannidis JPA. Evolution of reporting of P values in the biomedical literature, 1990-2015. JAMA 2016;315:1141e8. [2] Sterling TD. Publication decisions and their possible effects on inferences drawn from tests of significancedor vice-versa. J Am Stat Assoc 1959;54:30e4.

77

[3] Head ML, Holman L, Kahn AT, Jennions MD. The extent and consequences of P-hacking in science. PLoS Biol 2015;13: e1002106. [4] Begg CB, Berlin JA. Publication bias: a problem in interpreting medical data. J Am Stat Assoc 1988;61:419e45. [5] Gøtzsche P. Believability of relative risks and odds ratios in abstracts: cross sectional study. BMJ 2006;333:231e4. [6] Masicampo EJ, Lalande DR. A peculiar prevalence of p values just below .05. Q J Exp Psychol 2012;65:2271e9. [7] Ginsel B, Aggrawal A, Xuan W, Harris I. The distribution of probability values in medical abstracts: an observational study. BMC Res Notes 2015;8:721. [8] Jager LR, Leek JT. An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics 2014;15:1e12. [9] Benjamini Y, Hechtlinger Y. Discussion: an estimate of the sciencewise false discovery rate and application to the top medical literature by Jager and Leek. Biostatistics 2014;15:13e6. [10] Gelman A, O’Rourke K. Discussion: difficulties in making inferences about scientific truth from distributions of published p-values. Biostatistics 2014;15:18e23. [11] Goodman SN. Discussion: an estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics 2014;15:23e7. [12] Ioannidis JPA. Discussion: why ‘‘An estimate of the science-wise false discovery rate and application to the top medical literature’’ is false. Biostatistics 2014;15:28e36. [13] Goodman SN. Toward evidence-based medical statistics. 1: the p value fallacy. Ann Intern Med 1999;130:995e1004. [14] Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. Am Stat 2016;70:129e31. [15] Murdoch DJ, Tsai YL, Adcock J. P-values are random variables. Am Stat 2008;62:242e5. [16] Pounds S, Morris SW. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 2003; 19:1236e42. [17] Bland M. Do baseline P-values follow a uniform distribution in randomised trials? PLoS One 2013;8:e76010. [18] Hung HMJ, O’Neill RT, Bauer P, K€ohne K. The behavior of the pvalue when the alternative hypothesis is true. Biometrics 1997;53: 11e22. [19] Selke T, Bayarri MJ, Berger JO. Calibration of p values for testing precise null hypotheses. Am Stat 2001;55:62e71. [20] Jannot AS, Ehret G, Perneger T. P!510-8 has emerged as a standard of statistical significance for genome-wide association studies. J Clin Epidemiol 2015;68:460e5. [21] Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR. Publication bias in clinical research. Lancet 1991;337:867e72. [22] Dickersin K, Min Y, Meinert CL. Factors affecting publication of research resultsdfollow-up of applications submitted to two Institutional Review Boards. JAMA 1992;267:374e8. [23] Goodman SN. Toward evidence-based medical statistics. 2: the Bayes factor. Ann Intern Med 1999;130:1005e13. [24] Hopper R. The Bayesian interpretation of a P-value depends only weakly on statistical power in realistic situations. J Clin Epidemiol 2009;62:1242e7. [25] Feinstein AR. P-values and confidence intervals: two sides of the same unsatisfactory coin. J Clin Epidemiol 1998;51:355e60.