Significance of Significance

U.S. v. Harkonen The Criminalization of Scientific Inference in Public Speech Nathan A. Schachtman

Nathan A. Schachtman, Esq., PC Ulmer & Berne LLP, Of Counsel

http://petrieflom.law.harvard.edu/events/details/biostatistics-and-fda-regulation 1

Idiopathic Pulmonary Fibrosis (IPF) Some Background Facts IPF – disease(s) of unknown cause(s) IPF causes rapidly progressive, fatal fibrosis of the lungs No approved medications for treatment of IPF (now Pirfenidone approved in Europe) Interferon γ-1b – cytokine with anti-fibrotic effects InterMune had obtained FDA approval for Actimmune® (interferon γ-1b) for treatment of chronic granulomatous 2 disease and severe, malignant osteopetrosis

The Austrian Clinical Trial 1999 NEJM : Small RCT of prednisolone + Interferon γ-1b vs. prednisolone alone 18 patients, with 9 in each arm All had “mild” or early IPF – no significant heart disease Lung function of all patients randomized to Interferon γ1b improved; all patients on corticosteroid alone worsened. R. Ziesche, et al., “A Preliminary Study of Long-Term Treatment with Interferon γ-1b and Low-Dose Prednisolone in Patients with Idiopathic Pulmonary Fibrosis, “ 341 New Engl. J. Med. 1264 (1999). 3

Intermune’s RCTs – Key Data Intermune’s Phase II trial continued the Austrian trial Intermune’s Phase III trial of Actimmune vs. Placebo: 1˚ end point = “progression-free survival” (composite of mortality & PFTs 2˚ end points (9) including survival Why 9 secondaries? Unclear how to measure pulmonary function benefit. Mortality however is the ultimate outcome. 28 /168 patients on placebo died, while only 16 /162 on Actimmune died – absolute value of 40% higher survival on 4 therapy (p-value = 0.084).

Did Intermune Recruit Patients With Disease Too Far Advanced to Benefit? Survival benefit seen for non-prespecified subgroup that had mild-to-moderate IPF (by pulmonary function criteria) at the outset of the trial. Survival benefit was 70% for non-prespecified subgroup (77% of participants) in this subgroup: 6 patients on Actimmune died out of 126 in subgroup; 21 patients out of 128 died (p = 0.004) 5

Data Dredging?

6

Indictment vs. Dr. Harkonen – Two Counts The actus reus was the issuance of a press release, which itself was required by SEC regulations felony misbranding, 21 U.S.C. §§ 331(k), 333(a)(2), 352(a) -- ACQUITTED BY JURY wire fraud, under 18 U.S.C. § 1343, for false statements -- CONVICTED BY JURY United States v. Harkonen, No. C 08–00164 MHP, 2010 WL 2985257, at *1 (N.D. Cal. July 27, 2010), aff’d, 510 F. App’x 633, 636 (9th Cir. Mar. 4, 2013), cert. denied, ___ U.S. ___ (Dec. 16, 2013)

7

Crime Scene in U.S. v. Harkonen

8

Some Startling Facts About Harkonen The defense failed to call a statistician or a clinician. The gov’t called no statistical expert witness. Prof. Fleming testified that inferring causation from a statistically non-significant, post-hoc analysis was “false as a matter of statistics.” ER2498 Fleming, testified that p-values < 0.05 were “magic numbers.” U. S. v. Harkonen, 2010 WL 2985257, at *5 (N.D. Calif. 2010)

The trial court never instructed the jury on how to interpret expert testimony and their power to disregard Fleming’s opinion. Post trial, new defense counsel submitted affidavits from Profs. Goodman and Rubin on the diversity of views on p-values. 9

More Startling Facts About Harkonen Dr. Harkonen’s statement (or opinion) was based upon more than a single RCT. When he issued the press release, he had •the Austrian RCT, •the Intermune extension of that trial, as well as •the Intermune Phase III RCT, in addition to •basic research on interferon gamma-1b.

In a per-protocol analysis (using actual rather than assigned treatment), the survival benefit for Actimmune patients who met inclusionary eligibility criteria was 48% greater than for for those randomized to placebo, p = 0.055. ER2070. Harkonen made no prediction that Intermune would obtain FDA approval. 10

Significance, Tail, or Attained Probability Fisher proposed and popularized the use of p-values in 1925, as a quasi-formal inferential method. Dfn: the probability, assuming no effect (the null hypothesis H0), of obtaining a result equal to or more extreme than what was actually observed [P-value = P(extreme data|H0)]

R.A. Fisher, Statistical Methods for Research Workers (1925) 11

Statistical Orthodoxy Made Into Criminal Prohibition Fleming’s orthodoxy of a strictly dichotomous test was rejected by Fisher and many leading statisticians since. See Sir Ronald A. Fisher, Statistical Methods and Scientific Inference 42 (1956) (ridiculing rigid hypothesis testing as “absurdly academic, for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”). 12

Ordinary Language Published studies frequently present post-hoc analyses without revealing their non-preplanned nature. Chantal Boonacker, et al., A Comparison of Subgroup Analyses in Grant Applications and Publications, 174 Am. J. Epidem. 291, 291 (2011).

“demonstrate” has multiple meanings including “to show,” in addition to apodictic certainty. The Oxford English Dictionary (2013) (entry 4) 13

Common Practice Scientific authors frequently use “demonstrate” for less than certain conclusions. See, e.g., Jerome Goldstein, et al., “Treatment of Severe, Disabling Migraine Attacks in an Over-the-Counter Population of Migraine Sufferers,” 19 Cephalgia 684, 689 (1999) (reporting post-hoc subgroup analysis based upon multiple clinical trials; “results of this post-hoc analysis demonstrate that AAC is effective. . . .”).

It was a press release, for goodness sake!

Steven Woloshin, et al., “Press Releases by Academic Medical Centers: Not So Academic?,” 150 Ann. Intern. Med. 613 (2009). 14

Multiplicity vs. Duplicity There is no consensus on whether such a p-value could or should be adjusted, especially given the statistical dependency among the secondary end points. The government never showed that the p-value in dispute (p = 0.004) would have exceeded 0.05 if adjusted for multiple comparisons. The “sin” of multiplicity lies in the inflation of the Type I error rate (false positive findings or false alarms), but the government would soon take a very different position in Matrixx Initiatives from the orthodoxy it 15 sponsored in the Harkonen prosecution.

Multiple Testing and P-values Assuming that H0 is true, for a pre-specified test of pvalue of 0.05 (alpha), then we fail to reject H0 95% in long run, and we can calculate the probability of at least one false-positive (Type I) error when we make multiple (n) independent tests in the same set of data: P[ none] = (1 − α) n P[≥ 1] = [1 − (1 − α) n ] n = 20; α = 0.05 : P[≥ 1] = [1 − (1 − α) n ] = [1 − (0.95) 20 ] = 0.64 16

Bonferroni Correction In general, if we have k independent significance tests of null hypotheses, all true, at the α level, then the probability of no significant results = (1 − α )k If we make α sufficiently small, we can keep the overall probability that none of the tests will be significant at k 0.95. Since α will be small:

0.95 = (1 − α ) ≈ 1 − k ⋅ α

Then:

0.05 α= k

So the probability of having one of k tests yield a p-value < α, if all null hypotheses are true, is maintained. 17

Post-Hoc Analysis Published in NEJM “Analysis of the treatment-adherent cohort of patients showed … a relative reduction in the risk of 66 percent (5 percent of 126 patients in the interferon gamma-1b group and 14 percent of 143 patients in the placebo group died, P = 0.02). The hazard ratio for death in the interferon gamma-1b group, as compared with the placebo group, was 0.3 (95 percent confidence interval, 0.1 to 0.9).” Raghu, et al., “A Placebo-Controlled Trial of Interferon γ -1b in Patients with Idiopathic Pulmonary Fibrosis,” 350 New Engl. J. Med. 125, 129-30 (2004)

18

DUPLICITY U.S. v. Harkonen Post-Trial “Matrixx” Motion U.S. Solicitor General (with counsel for FDA) filed an amicus brief in Matrixx. The brief introduces its discussion of statistical significance with a heading: “Statistical significance is a limited and nonexclusive tool for inferring causation.” Brief for the United States as Amicus Curiae Supporting Respondents, in Matrixx Initiatives, Inc. v. Siracusano, 2010 WL 4624148, at *13 (Nov. 12, 2010). 19

Solicitor General’s Amicus Brief The amicus brief in Matrixx then disclaimed the necessity for statistical significance: “[w]hile statistical significance provides some indication about the validity of a correlation between a product and a harm, a determination that certain data are not statistically significant … does not refute an inference of causation.” Id. at *14 (Compare with Matrixx opinion: “A lack of statistically significant data does not mean that medical experts have no reliable basis for inferring a causal link between a drug and adverse events.”) 20 131 S. Ct. at 1319.

Solicitor General on Statistical Significance In a footnote, the U.S. described its position as applying to both safety and efficacy outcomes: “[t]he same principle applies to studies suggesting that a particular drug is efficacious. A study in which the cure rate for cancer patients who took a drug was twice the cure rate for those who took a placebo could generate meaningful interest even if the results were not statistically significant.” Id. at *15 n.2 U.S. Amicus Brief at *13, 14 (leaving out conditional or cumulative probability)

21

2013 WL 5915131, 2013 WL 6174902 (filed Sept. 4, 2013)

http://schachtmanlaw.com/wp-content/uploads/2010/03/KJR-TLL-NAS-Amicus22 Brief-in-US-v-Harkonen-090413A.pdf