invited review Fundamental concepts in statistics: elucidation and illustration DOUGLAS CURRAN-EVERETT, SUE TAYLOR, AND KAREN KAFADAR Departments of Pediatrics and of Preventive Medicine and Biometrics, School of Medicine, University of Colorado Health Sciences Center, Denver, 80262; and Department of Mathematics, University of Colorado at Denver, Denver, Colorado 80217-3364 Curran-Everett, Douglas, Sue Taylor, and Karen Kafadar. Fundamental concepts in statistics: elucidation and illustration. J. Appl. Physiol. 85(3): 775–786, 1998.—Fundamental concepts in statistics form the cornerstone of scientific inquiry. If we fail to understand fully these fundamental concepts, then the scientific conclusions we reach are more likely to be wrong. This is more than supposition: for 60 years, statisticians have warned that the scientific literature harbors misunderstandings about basic statistical concepts. Original articles published in 1996 by the American Physiological Society’s journals fared no better in their handling of basic statistical concepts. In this review, we summarize the two main scientific uses of statistics: hypothesis testing and estimation. Most scientists use statistics solely for hypothesis testing; often, however, estimation is more useful. We also illustrate the concepts of variability and uncertainty, and we demonstrate the essential distinction between statistical significance and scientific importance. An understanding of concepts such as variability, uncertainty, and significance is necessary, but it is not sufficient; we show also that the numerical results of statistical analyses have limitations. confidence interval; estimation; tolerance interval; uncertainty; variability
There are very few things which we know, which are not capable of being reduc’d to a Mathematical Reasoning, . . . and where a Mathematical Reasoning can be had, it’s as great folly to make use of any other, as to grope for a thing in the dark when you have a Candle standing by you. John Arbuthnot (1692) STATISTICS IS ONE KIND of mathematical reasoning. Its concepts and principles are ubiquitous in science: as researchers, we use them to design experiments, analyze data, report results, and interpret the published findings of others. Indeed, it is from this foundation of statistical concepts and principles that scientific knowledge is accumulated. If we fail to understand fully these fundamental statistical concepts and principles—if our statistical reasoning is faulty—then we are more likely to reach wrong scientific conclusions. Wrong conclusions based on faulty reasoning is shoddy science; it is also unethical (1, 21, 30). Regrettably, faulty reasoning in statistics rears its head in the practice of science: for 60 years, statisticians have documented statistical errors in the scientific literature (3, 4, 17, 33, 50). In part, these errors exist because many introductory textbooks of statistics paradoxically hinder literacy in statistics: they empha-
size methods rather than concepts, they contain glaring errors, or they perpetuate misconceptions (4, 11, 12). In his editorial prelude to a series of statistical papers, Yates (51) wrote that the papers were designed to raise statistical consciousness and thereby reduce statistical errors in journals published by the American Physiological Society. Rather than reinforce concepts, these papers reviewed methods: analysis of variance (20), linear regression (37, 46), mathematical modeling (22, 29, 40), risk assessment (36), and statistical packages (34). The proper use of any statistical technique, however, requires an understanding of the fundamental statistical concepts behind the technique. How well do physiologists understand fundamental concepts in statistics? One way to answer this question is to examine the empirical incidence of basic statistical quantities such as standard deviations, standard errors, and confidence intervals. These quantities characterize different statistical features: standard deviations characterize variability in the population, whereas standard errors and confidence intervals characterize uncertainty about the estimated values of population parameters, e.g., means. Of the original articles published in 1996 by the American Physiological Society, the overwhelming majority (69–93%, range) report standard errors, apparently not as estimates of uncer-
8750-7587/98 $5.00 Copyright r 1998 the American Physiological Society
Table 1. Manuscripts for the American Physiological Society’s journals in 1996: use of statistics and statisticians %Research Manuscripts That Report
Am. J. Physiol. Cell Physiol. Endocrinol. Metab. Gastrointest. Liver Physiol. Heart Circ. Physiol. Lung Cell. Mol. Physiol. Regulatory Integrative Comp. Physiol. Renal Fluid Electrolyte Physiol. J. Appl. Physiol. J. Neurophysiol.
43 28 26 60 25 41 27 62 58
21 18 8 17 20 17 15 24 36
88 86 92 87 84 88 93 79 69
0 0 0 0 0 0 0 0 2
7 4 4 10 4 15 7 6 5
0 4 12 3 4 12 4 10 7
n, No. of research manuscripts reviewed. In 1996, these journals published a total of 3,693 original articles. No. of articles reviewed represents a 10% sample (selected by systematic random sampling, fixed start) of articles published by each journal. * Precise P value: for example, P 5 0.02 (rather than P , 0.05) or P 5 0.13 (rather than P $ 0.05 or P 5 not significant). † We assessed collaboration with a statistician using author affiliation and acknowledgments. We recognize that a statistician may be affiliated with another department, e.g., medicine. Using our criterion, however, few articles (0–12%, range) report formal collaboration of a physiologist with a statistician, a partnership that typically reaps great rewards.
tainty but as estimates of variability (Table 1). Virtually no articles (0–2%, range) report confidence intervals, recommended by statisticians (2, 5, 9, 10, 28, 39) as interval estimates of uncertainty about the values of population parameters. Moreover, few articles (4–15%, range) report precise P values, which precludes personal assessment of statistical significance. In this review, we summarize the primary scientific uses of statistics. Then, we illustrate several fundamental concepts: variability, uncertainty, and significance. Last, we illustrate that although an understanding of concepts such as variability, uncertainty, and significance is necessary, it is not sufficient: it is essential to realize also that the numerical results of statistical analyses have limitations. Glossary a Ave 5q6 µ n n N (µ, s2 ) P Pr 5A6 s sy s s2 s2 SE 5q6 Var 5q6 Y yi y
Critical significance level Average of the quantity q Population mean Degrees of freedom Number of observations Normal (Gaussian) distribution with mean µ and variance s2 Achieved significance level Probability of event A Population standard deviation Standard deviation of the sampling distribution of the sample mean Sample standard deviation Population variance Sample variance Standard error of the quantity q Variance of the quantity q Random variable Y Sample observation i, where i 5 1, 2, . . . , n Sample mean
SCIENTIFIC USES OF STATISTICS
In science, there are two main uses of statistics: hypothesis testing and estimation. Most researchers
use statistics solely for hypothesis testing. In many situations, statisticians play down hypothesis testing and prefer estimation instead. Hypothesis testing. To test a scientific hypothesis, a researcher must formulate the hypothesis before any data are collected, then design and execute an experiment that is relevant to it. Because the hypothesis is most often one of no difference, the hypothesis is called, by tradition, the null hypothesis.1 Using data from the experiment, the researcher must next compute the observed value T of a test statistic. Finally, the researcher must compare the observed value T with some critical value T *, chosen from the distribution of the test statistic that is based on the null hypothesis. If T is more extreme than T *, then that is a surprising result if the null hypothesis is true, and the researcher is entitled, on statistical grounds, to become skeptical about the scientific validity of the null hypothesis. The statistical test of a null hypothesis is useful because it assesses the strength of the evidence: it helps guard against an unwarranted conclusion, or it helps argue for a real experimental effect (19, 48). Nevertheless, a null hypothesis is often an artificial construct: before any data are recorded, the investigator knows—at least, suspects—that the null hypothesis is not exactly true. Moreover, the only question a hypothesis test can answer is a trivial one: is there anything other than random variation here?2 Statisticians have emphasized repeatedly the limited value of hypothesis testing (2, 4, 9, 18, 24, 28, 31, 38, 50). In fact, the P values that result from hypothesis tests have been described as ‘‘absurdly academic’’3 (25) and as having a ‘‘strictly limited role’’ (19) in data analysis. Within the scientific community, unwar1 The adjective ‘‘null’’ can be misleading: this hypothesis need not be one of no difference. The use of null persists because of historical inertia. 2 Kruskal (38) reviews other drawbacks to hypothesis testing. 3 Sir Ronald Fisher, the author of this phrase, developed many statistical procedures, including the analysis of variance.
ranted focus on hypothesis testing has blurred the distinction between statistical significance and scientific importance (3, 13, 19). Most investigators appear to reach scientific conclusions that are based not on their knowledge of science but solely on the probabilities of test statistics (16); this is an untenable approach to scientific discovery. The limited utility of hypothesis testing can be demonstrated with an example. Suppose a clinician wants to assess the impact of a placebo and the b-blockers bisoprolol and metoprolol on heart rate variability in patients with left heart failure. Suppose also that the clinician constructs the null and alternative hypotheses, H0 and H1, as H0: treatments have identical effects on heart rate variability H1: treatments have different effects on heart rate variability The result of this hypothesis test will fail to convey any information about the direction or magnitude of the treatment effects on heart rate variability. Direction and magnitude are important: in patients with left heart failure, decreases in heart rate variability are associated with increases in the risk of sudden cardiac catastrophe (49). Direction and magnitude of an effect reflect scientific importance; they are obtained by estimation. Estimation. Regardless of the statistical result of a hypothesis test, the crucial question concerns the scientific result: is the experimental effect big enough to be relevant? A point estimate of a population parameter4 and an interval estimate of the uncertainty about the value of that parameter help answer this question. For example, one point estimate of a population mean is the sample mean; one interval estimate of the uncertainty about the value of the populations mean is a confidence interval. Interval estimates circumvent the drawbacks inherent to hypothesis testing, yet they provide the same statistical information as a hypothesis test (15, 18, 28, 38). More important, point and interval estimates convey information about scientific importance. Practical considerations. Estimation focuses attention on the magnitude and uncertainty of the experimental results. We must emphasize that hypothesis testing can have value beyond assessing the strength of the experimental evidence: for example, hypothesis testing is useful if an investigator wants to evaluate the importance of between-subject variability in an experiment. In practice, estimation should be done whenever it is relevant and feasible; the precise P value from the associated hypothesis test should be reported with the point and interval estimates. When more than one hypothesis is tested in an experiment, the problem of multiple comparisons becomes relevant. Nevertheless, a discussion of the issues involved in multiple-compari4 A parameter is a numerical constant: for example, the population mean.
son procedures is beyond the scope of this review; Refs. 2, 9, 42, and 48 summarize these issues. For the rest of this review, we focus our attention on several aspects of estimation. USING SAMPLES TO LEARN ABOUT POPULATIONS
As researchers, we use samples to make inferences about populations. A sample interests us not because of its own merits but because it helps us estimate selected characteristics of the underlying population: for example, the sample mean y estimates the population mean µ.5 As an illustration, suppose the random variable Y represents the change in systolic blood pressure after some intervention. Suppose also that the distribution of Y conforms to a normal distribution. A normal distribution is specified completely by two parameters: the mean and variance. The population mean µ conveys the location of the center of the distribution; the population standard deviation s, the square root of the population variance s2, conveys the spread of the distribution. The distribution of possible outcomes of the random variable Y is described by the normal probability density function ( f ), which incorporates µ and s2 f(y) 5
·exp 52( y 2 µ) 2/(2·s2 )6, (1) for 2` , y , 1`
In Fig. 1, the distributions for three different populations are theoretical: each depicts the distribution of population values as if we had observed the entire population.6 Suppose we want to estimate µ1 5 215, the mean of population 1, in Fig. 1. To do this, we would measure the change in systolic blood pressure in a sample of n independent observations, y1, y2, . . . , yn, from the population. For simplicity, assume we limit the sample to 10 observations. One random sample is 233, 215, 26, 0, 18, 23, 8, 222, 222, 27 The average of these sample observations is the sample mean y y5
o y 5 28.2 i
Because of intrinsic variability in the population, the sample mean y differs from the population mean µ1; only because this is a contrived example do we know the true magnitude of the discrepancy.7 Next, we review measures that estimate variability in the population. 5
References 2, 9, 42, and 48 discuss other aspects of sampling. Statistical calculations and exercises were executed by using SAS Release 6.04 (SAS Institute, Cary, NC, 1987). 7 We address the discrepancy between the value of the sample estimate of a population parameter and the value of the population parameter itself in ESTIMATING UNCERTAINTY ABOUT A POPULATION PARAMETER. 6
Fig. 1. Using samples to learn about populations: 3 normal distributions. These distributions differ in location, reflected in the mean µ, or spread, reflected in the standard deviation s. A normal probability density function (Eq. 1) describes the distribution of each population.
ESTIMATING VARIABILITY IN THE POPULATION
The preceding sample observations, 233, 215, . . . , 27, differ because the population from which they were drawn is distributed over a range of possible values. This intrinsic variability is more than a distraction: it is an integral part of statistics, and the careful study of variability may reveal something about underlying scientific processes (25). The most common measure of the variability among sample observations is the sample standard deviation s, the square root of the sample variance s 2 s 5 2
o ( y 2 y) i
(See also Refs. 2, 9, 42, and 48.) The sample standard deviation characterizes the typical distance of an observation from the distribution center; in other words, it reflects the dispersion of individual sample observations about the sample mean. The sample standard deviation s also estimates the population standard deviation s: the standard deviation of the sample observations 233, 215, . . . , 27 is s 5 15.2, which estimates s 5 20. Most journals would publish the preceding sample mean and standard deviation as 28.2 mmHg 6 15.2 The 6 symbol, however, is superfluous: the standard deviation is a single positive number. A standard deviation can be reported clearly with notation of this form 28.2 mmHg (SD 15.2) In a table, the symbol SD can be omitted without loss of clarity as long as the table legend identifies the parenthetical value as a standard deviation. The standard deviation is often a useful index of variability, but in many experimental situations it may
be a deceptive one: even subtle departures from a normal distribution can render useless the standard deviation as an index of variability (43); often, the distribution of a biological variable differs grossly from a normal distribution. As one example, the distribution of values for plasma creatinine (26) resembles the skewed distribution depicted in Fig. 2. When the tails of a distribution are elongated, as is the right tail of this skewed distribution, the sample standard deviation will be an inflated measure of variability in the population (43, 48). There are two remedies to this misrepresentation of variability by the standard deviation: use another measure of variability, or transform the data. Alternative measures of variability. Two measures of variability that are useful with a variety of distributions are the mean absolute deviation and the interquartile range. The mean absolute deviation (Ave 50 dev06) is the average distance of the sample observations from the sample mean 1 n 0y 2 y 0 Ave 50dev 06 5 · n i51 i
The interquartile range (often designated as IQR) encompasses the middle 50% of a distribution and is the difference between the 75th and 25th percentiles. For 0 , w , 1, the 100wth percentile is the value below which 100w% of the distribution is found. Data transformation. When the sample observations happen to be drawn from a population that has a skewed distribution (e.g., a constituent of blood or the growth rate of a tumor), a transformation may change the shape of their distribution so that the distribution of the transformed observations is more symmetric (14, 23, 26, 32, 48). Common transformations include the logarithmic, inverse, square root, and arc sine transformations. The APPENDIX reviews a useful family of data transformations.
Fig. 2. Estimating variability in the population: a skewed distribution. The lognormal probability density function (Eq. A1) describes this skewed distribution in which the Pr 5Y # 6.16 5 0.50 and the Pr 52.1 # Y # 16.46 5 0.68 (gray area). For a normal distribution with the same mean and variance (inset), the Pr 5Y # 10.06 5 0.50, and the Pr 523.1 # Y # 23.16 5 0.68 (gray area). See APPENDIX for further explanation.
In the next section, we revisit the unknown discrepancy between the sample estimate of a population parameter and the population parameter itself. ESTIMATING UNCERTAINTY ABOUT A POPULATION PARAMETER
In the sampling exercise from USING SAMPLES TO the sample mean y 5 28.2 (Eq. 2) estimated the population mean µ1 5 215. If we had calculated this sample mean from experimental observations, then we would be uncertain about the magnitude of the discrepancy between the sample estimate y and the population parameter µ1. The ability to estimate the level of uncertainty about the value of a population parameter by using the sample estimate of that parameter is a powerful aspect of statistics (47). Suppose we measure the same response variable, the change in systolic blood pressure, in a second sample of 10 independent observations drawn from the same population. We know beforehand that because of random sampling the mean of the second sample, y2, will differ from the mean of the first sample, y1 5 28.2. If we measure the change in systolic blood pressure in 100 samples of 10 independent observations, then we expect 100 different estimates of the population mean µ1; for example
LEARN ABOUT POPULATIONS,
y1 5 28.2,
y2 5 28.1,
· · ·,
y100 5 222.5
If we treat these 100 observed sample means as 100 observations, then we can calculate their mean and standard deviation, designated as Ave5y6 and SD5y6 Ave 5y6 5 214.5
and variance s2, which are known; the notation for this normal distribution is Y , N(µ, s2 ). If an infinite number of samples, each with n independent observations, is drawn from this normal distribution, then the sample means y1, y2, . . . , y` will also be distributed normally.8 The average of the sample means, Ave 5y6, is the population mean µ, but the variance of the sample means (Var 5y6) is smaller than the population variance s2 by a factor of 1/n Ave 5y6 5 µ
Var 5y6 5 s2y 5 s2/n
(The APPENDIX derives these expressions. Figure 3 develops these expressions using empirical examples.) Therefore, the standard deviation of the theoretical distribution of the sample mean, sy, is sy 5 s/În If the sample size n increases, then the standard deviation sy will decrease: that is, the more sample observations we have, the more certain we will be that the point estimate y is near the actual population mean µ. The standard deviation of the theoretical distribution of the sample mean is known also as the standard error of the sample mean, SE 5y6; that is SE 5y6 5 s/În In estimation, the standard error of the mean has no particular value; instead, it is useful because of its role
SD 5y6 5 6.07
We can generalize from this empirical distribution of sample means to a theoretical distribution of the sample mean for a sample of size n. Consider a random variable Y that is distributed normally with mean µ
8 The Central Limit Theorem states that the theoretical distribution of the sample mean will be approximately normal, regardless of the distribution of the original observations (35, 42). If the distribution of the original observations happens to be normal, then the theoretical distribution of the sample mean will be exactly normal.
Fig. 3. Estimating uncertainty about a population parameter: empirical distributions of sample means. These distributions are based on 1,000 samples of 5 (A), 10 (B), 20 (C), or 40 (D) observations drawn at random from population 1, for which the mean µ 5 215 and the variance s2 5 400. For each empirical distribution, the average of the sample means, Ave 5 y6, happens to be 215.1. As sample size increases, however, the sample means become concentrated more closely about Ave 5 y6. When sample size doubles, the variance of the sample means, Var 5 y6, is approximately halved.
in the calculation of a confidence interval for the population mean µ.9 Confidence intervals. When we construct a confidence interval for the population mean, we assign numerical bounds to the expected discrepancy between the sample mean y and the population mean µ. In essence, a confidence interval is a range that we expect, with some level of confidence, to include the actual value of the population mean. Below, we use the theoretical distribution of the sample mean to derive the confidence interval for the population mean µ.10 In the theoretical distribution of the sample mean, 100(1 2 a)% of the possible sample means is included in the interval [µ 2 a, µ 1 a]
where the allowance a is a 5 za/2 ·SE 5y6
In Eq. 5, za/2 is the 100[1 2 (a/2)]th percentile from the standard normal distribution, i.e., a normal distribution with mean 0 and variance 1, and SE 5y6 is defined by Eq. 3. Therefore, when the population standard deviation s is known, 95% of the possible sample means are within 1.96·SE 5y6 of the population mean µ. The interval in Eq. 4 can be written as the probability expression Pr 5µ 2 a # y # µ 1 a6 5 1 2 a which declares that the probability is 1 2 a that a sample mean lies within the interval [µ 2 a, µ 1 a]. After algebraic rearrangement, this expression can be written
the interval [y 2 a, y 1 a]
is called the 100(1 2 a)% confidence interval for the population mean µ. In practice, the sample standard deviation s estimates the population standard deviation s, which means that s/În estimates the standard error of the mean (Eq. 3). In calculating a 100(1 2 a)% confidence interval for the mean µ, this uncertainty about the actual value of s is handled by replacing za/2 in Eq. 5 with ta/2,n, the 100[1 2 (a/2)]th percentile from a Student t distribution with n 5 n 2 1 degrees of freedom. Therefore, the allowance applied to the sample mean to obtain the 100(1 2 a)% confidence interval for the population mean (Eq. 6) is a 5 ta/2,n ·SE 5y6 where SE 5y6 5 s/În. Note that this allowance exceeds the allowance in Eq. 5: there is greater uncertainty about the value of the population mean µ. This happens because if n , `, then ta/2,n . za/2 for all values of a. Suppose we want to calculate a confidence interval for the population mean µ1 5 215 by using the observations 233, 215, . . . , 27 of the first sample. The mean and standard deviation of these 10 observations are y 5 28.2 and s 5 15.2. Therefore, the estimated standard error of the mean is SE 5y6 5 s/În 5 15.2/Î10 5 4.81
Pr 5y 2 a # µ # y 1 a6 5 1 2 a
Because n 5 10, there are n 5 n 2 1 5 9 degrees of freedom. If we want a 95% confidence interval, then a 5 0.05, ta/2,n 5 2.26, and the allowance a 5 2.26 3 4.81 5 10.9. Therefore, the 95% confidence interval is
but note that the randomness resides in the parameter estimate y, not in the actual parameter µ. In this form,
9 References 2, 9, 42, and 48 discuss the calculation of confidence intervals for other population parameters. 10 Moses (Ref. 42, p. 113–117) illustrates further the concept of a confidence interval by using empirical examples.
In other words, we can declare, with 95% confidence, that the population mean is included in the interval [219.1, 12.7]. Bear in mind that a single confidence interval either does or does not include the value of the population
Fig. 4. Estimating uncertainty about a population parameter: 95% confidence intervals for a population mean. These confidence intervals are for 100 samples of 10 observations drawn at random from population 1 in Fig. 1. It is because of the random sampling that the position and length of the confidence interval vary from sample to sample. About 95 of these intervals—the actual number will vary—are expected to cover the population mean of 215 mmHg. In this example, 98 of the confidence intervals cover the population mean µ; the 2 exceptions are highlighted (heavy black lines numbered 1 and 2).
parameter; in experimental situations, we are uncertain about which of these outcomes has occurred. Instead, the level of confidence in a confidence interval is based on the concept of drawing a large number of samples, each with n observations, from the population. When we measured the change in systolic blood pressure in 100 random samples, we obtained 100 different sample means and 100 different sample standard deviations. As a consequence, we will calculate 100 different 100(1 2 a)% confidence intervals; we expect ,100(1 2 a)% of these observed confidence intervals to include the actual value of the population mean (see Fig. 4). A confidence interval characterizes the uncertainty about the estimated value of a population parameter. Sometimes, an investigator may be interested less in the value of the population parameter and more in the distribution of individual observations. A tolerance interval characterizes the uncertainty about the estimated distribution of those individual observations (see APPENDIX). Next, we illustrate the distinction between statistical significance and scientific importance. Last, we show that the numerical results of statistical analyses have limitations. STATISTICAL AND SCIENTIFIC SIGNIFICANCE DIFFER
Hypothesis testing, as the primary scientific use of statistics, has a drawback: the result of a hypothesis test conveys mere statistical significance. In contrast, estimation conveys scientific significance.11 This distinction is obvious if we use the results of a recent clinical trial. In this trial, the Systolic Hypertension in the Elderly Program (SHEP) Cooperative Research Group 11 The word ‘‘significance,’’ when used to refer to scientific consequence, is ambiguous. Hereafter, we use the word ‘‘importance.’’
(45) evaluated the impact of antihypertensive drugs on the incidence of stroke in persons with isolated systolic hypertension. When compared with placebo, these drugs reduced by 36% (P 5 0.0003) the incidence of stroke. Associated with this reduced incidence of stroke was a greater decrease in systolic blood pressure. To appreciate the distinction between statistical significance and scientific importance, consider two populations that represent the theoretical distributions of the decreases in systolic blood pressure for the two groups. Let the decrease in systolic blood pressure of the placebo group be designated Y1 and that of the drug treatment group be designated Y2. Assume that Y1 and Y2 are distributed normally Y1 , N (µ1, s21 )
and Y2 , N (µ2, s22 )
The normal probability density function (Eq. 1), in which approximate values for the observed sample means and variances from the SHEP trial, yi and s2i , are substituted for the population means and variances, generates the population distributions depicted in Fig. 5 y1 5 215 ⇒ µ1, s 21 5 400 ⇒ s21 and y2 5 225 ⇒ µ2, s 22 5 400 ⇒ s22 Suppose our objective is to estimate the difference between population means µ2 2 µ1 5 225 2 (215) 5 210 mmHg The SHEP group established convincingly that the difference µ2 2 µ1, which represents the greater decrease in systolic blood pressure after drug therapy, was important. To estimate µ2 2 µ1, we would sample at random from each population: the difference between sample means, y2 2 y1, estimates the difference between population means, µ2 2 µ1.
Fig. 5. Statistical and scientific significance differ: placebo (black) and drug-treatment (gray) populations. The populations represent theoretical distributions of changes in systolic blood pressure during year 5 of the Systolic Hypertension in the Elderly Program clinical trial (see Ref. 45). The distributions are described by the normal probability density function (Eq. 1) in which the sample means and variances, yi and s2i , are substituted for the population means and variances. To generate samples of size n from each population, observations (Obs) were drawn at random from the placebo population; corresponding observations from the drug-treatment population were obtained by subtracting 10 from each placebo observation. The sampling procedure is illustrated for n 5 2.
By drawing samples of 2–128 observations from each population (Table 2) and by forcing y2 2 y1 5 210 (see Fig. 5), the distinction between statistical significance and scientific importance becomes clear. As sample size n grows, the statistical significance increases, from P 5 0.71 for n 5 2 to P , 0.001 for n 5 128. Regardless of sample size, one aspect of scientific importance, that reflected by the difference y2 2 y1, remains constant. As sample size increases, uncertainty about the actual difference µ2 2 µ1, another aspect of scientific importance characterized by the numerical bounds of the confidence interval, decreases. Practical considerations. In experimental situations, the distinction between statistical significance and Table 2. Statistical and scientific significance differ: statistical results n
y2 2 y1
SE 5 y2 2 y1 6
95% Confidence Interval*
Pr 5µ2 2 µ1 5 06‡
2 4 8 10 15 20 25 32 64 128
210 210 210 210 210 210 210 210 210 210
23.1 22.6 12.3 10.1 7.3 5.8 5.3 4.7 3.7 2.4
2110 to 190 265 to 145 236 to 116 231 to 111 225 to 15 222 to 12 221 to 11 219 to 21 217 to 23 215 to 25
20.43 20.44 20.81 20.99 21.38 21.74 21.88 22.13 22.70 24.25
0.71 0.67 0.43 0.34 0.18 0.09 0.07 0.04 ,0.01 ,0.001
n, Sample size drawn from placebo ( population 1) and drug treatment (population 2) populations (see Fig. 5). * Confidence interval for the difference between population means, µ2 2 µ1 (see Eq. A2). † Test statistic used to evaluate statistical significance of the difference y2 2 y1 (see Eq. A3). ‡ Probability (2-tailed) that µ2 2 µ1 5 0; this is the significance level P for the null hypothesis H0 : µ2 2 µ1 5 0. The difference y2 2 y1 and the 95% confidence interval for the difference µ2 2 µ1 reflect the magnitude and uncertainty of the experimental results. The test statistic t and its associated P value reflect statistical significance. An increase in the no. of observations drawn from each population decreases SE 5 y2 2 y1 6: as a consequence, the statistical significance increases (irregularly, because of random sampling), but the estimated difference between population means remains constant at y2 2 y1 5 210. The APPENDIX details the statistical equations required to perform this sampling exercise.
scientific importance can be maintained by routinely addressing two questions: how likely is it that the experimental effect is real, and is the experimental effect large enough to be relevant? The first question can be answered simply: compare the P value, obtained in the hypothesis test, with the critical significance level a, chosen before any data are collected; if P , a, then the experimental effect is likely to be real. The second question can be answered in two steps: calculate a confidence interval for the population parameter, and then assess the numerical bounds of that confidence interval for scientific importance; if either bound of the confidence interval is important from a scientific perspective, then the experimental effect may be large enough to be relevant. Consider the results when 15 sample observations were drawn from the placebo and drug treatment populations: when compared with placebo, the greater decrease in systolic blood pressure after drug therapy was unconvincing from a statistical perspective (P 5 0.18). Because the 95% confidence interval was [225, 15], uncertainty about the actual impact of drug treatment on systolic blood pressure is relatively large. Note, however, that the additional decrease in systolic blood pressure gained by drug treatment may have been as pronounced as 25 mmHg. From a scientific perspective, further studies, designed with greater statistical power, are warranted. To illustrate that a significant statistical result may have little scientific importance, imagine that systolic blood pressure had been measured in mmH2O rather than in mmHg. Consider the results when 128 sample observations were drawn from the two populations: the greater decrease in systolic blood pressure after drug therapy was compelling from a statistical perspective (P , 0.001). If the confidence interval [215, 25] is expressed in mmHg (by dividing each bound by 13.6), then the investigator can declare, with 95% confidence, that the magnitude of the greater decrease in systolic blood pressure was 0.4–1.1 mmHg. In this example, the
investigator can be quite certain of a trivial experimental effect. Whatever the statistical result of a hypothesis test, assessment of the corresponding confidence interval incorporates the scientific importance of the experimental result. LIMITATIONS OF STATISTICS
Although the process of scientific discovery requires an understanding of fundamental concepts in statistics, the use of statistics does have limitations. For example, not many of us would accept, solely on the basis of a close temporal relationship, that solar radiation governs stock market prices (Fig. 6). The limitations of statistics are more subtle if an association is plausible. Imagine this scenario: a neurological syndrome results from impaired production of some neurotransmitter. Drugs A and B, derivatives of the same parent compound, both stimulate production of this neurotransmitter. Just one of the drugs, however, continues to increase neurotransmitter production over its entire therapeutic range. At higher doses, the second drug becomes less effective at boosting neurotransmitter production and causes neurotoxicity. For each drug, Table 3 lists administered drug concentrations and measured increases in neurotransmitter production. If you rely on only the regression statistics in Table 3, which drug is which? If you are unfortunate and happen to have this hypothetical syndrome, then your choice assumes added importance. From the regression statistics alone, it is impossible to differentiate the drugs. Their identities are plain, however, when the data are plotted (Fig. 7): drug A increases neurotransmitter production over the entire range of drug concentrations; the increase in neurotransmitter production begins to fall at higher concentrations of drug B. Practical considerations. Data graphics are essential also if the requisite assumptions behind a particular
Fig. 6. Limitations of statistics: solar radiation and New York stock market prices during 1929 (after Ref. 27). In general, increases in stock prices were associated with decreases in solar radiation. This nonsensical association illustrates the phenomenon of spurious correlation.
Table 3. Limitations of statistics: raw data and regression statistics Drug A x
10 8 13 9 11 14 6 4 12 7 5
Drug B x
For each drug:
8.04 10 9.14 No. of observations (n) 5 11 6.95 8 8.14 Average of x values (x) 5 9.0 7.58 13 8.74 Average of y values (y) 5 7.5 8.81 9 8.77 Equation of regression line: yˆ 5 3 1 0.5x 8.33 11 9.26 Standard error of estimate of slope [SE5b16] 5 0.118 9.96 14 8.10 t5H0 : slope (b1 ) 5 06 5 4.24 7.24 6 6.13 Sum of squares of x values [S(x 2 x)2] 5 110.0 4.26 4 3.10 Regression sum of squares 5 27.50 10.84 12 9.13 Residual sum of squares 5 13.75 4.82 7 7.26 Correlation coefficient (r) 5 0.82 5.68 5 4.74 %Total sum of squares explained by regression (R 2 ) 5 67%
For drugs A and B, values are administered drug concentration x, measured increase in neurotransmitter production y, and statistics from regression analysis of the first-order model Y 5 b0 1 b1X 1 e, where e is error. Additional regression analyses (23) reveal that this model is inappropriate for drug B (see Fig. 7). Data are from Anscombe (6).
statistical technique are to be verified. For examples in regression, see chapt. 3 in Ref. 23. SUMMARY
It is depressing to find how much good biological work is in danger of being wasted through incompetent and misleading analysis . . . Frank Yates and Michael J. R. Healy (1964)
This scathing remark, written almost 35 years ago (50) but relevant even now (4), reflects the frustrations felt by statisticians over the statistical misconceptions held by scientists. These misconceptions exist in large part because of shortcomings in the cursory statistics education we received in graduate or medical school (4, 11, 12). The major defect in most introductory courses in statistics is that fundamental concepts in statistics, the cornerstone of scientific inquiry (47), are neglected rather than emphasized (4, 7, 17, 44, 50). Statisticians share responsibility with other faculty for ensuring
Fig. 7. Limitations of statistics: scatterplots of drug concentration x and increase in neurotransmitter production y. For each drug, the fitted first-order model yˆ 5 3 1 0.5x and corresponding regression statistics are identical (see Table 3). For only drug A, however, is this first-order relationship plausible. For drug B, a second-order model of the form Y 5 b0 1 b1 X 1 b2 X 2 1 e is required.
that introductory courses in statistics are relevant and sound (7, 44, 50). In this review, we have reiterated the primary role of statistics within science to be one of estimation: estimation of a population parameter or estimation of the uncertainty about the value of that parameter. Moreover, we have demonstrated the essential distinction between statistical significance and scientific importance; of the two, scientific importance merits more consideration. We have shown also that without data graphics, data analysis is a game of chance. And last, that this review was written by a physiologist and two statisticians embodies one of the most basic notions in all science: collaboration. APPENDIX
This APPENDIX reviews the lognormal distribution (a distribution that reveals limitations of the standard deviation as an estimate of variability), a versatile family of data transformations, the theoretical distribution of the sample mean, tolerance intervals, the statistical equations required to perform the significance sampling exercise, and the confidence interval for the difference between two population means. Lognormal distribution. The lognormal distribution is a common probability distribution model for skewed data. The random variable Y is distributed lognormally if the logarithm of Y is distributed normally with mean t and variance j2, or ln Y , N(t, j2 ). Formally, the lognormal probability density function g is g( y) 5
· exp 52ln2 ( y/e t )/(2 · j2 )6, for y . 0
The mean µg and variance s2g of the lognormal distribution specified by Eq. A1 are µg 5 e t1(j
5 e 2t1j · (e j 2 1) j2
For the distribution in Fig. 2, t 5 1.803 and 5 1; therefore, µg 5 10 and s2g 5 172. A family of data transformations. Box and Cox (14) have described a family of power transformations in which an observed variable y is transformed into the variable w by using the parameter l w5
( y l 2 1)/l
for l Þ 0, and
for l 5 0
The inverse (l 5 21) and square root transformations (l 5 0.5) are members of this family. Draper and Smith (Ref. 23, p. 225–226) summarize the steps required to estimate the parameter l so that the distribution of w is as normal (Gaussian) as possible. Theoretical distribution of the sample mean. Suppose some random variable X is distributed normally with mean µ and variance s2: that is, X , N(µ, s2 ). When a sample of n independent observations, x1, x2, . . . , xn, is drawn repeatedly from this distribution, the observed sample means can be treated as observations. These sample means will be distributed normally with mean µ and variance s2/n, or Ave 5x6 5 µ
Var 5x6 5 s2x 5 s2/n
As you might expect, there is a mathematical foundation to these relationships.
Consider the linear function L L 5 k1X11 k2X2 1 · · · 1 kmXm For i 5 1, 2, . . . , m, each ki is a real constant, and each Xi , N(µi, s2i ). The mean of L, Ave 5L6, is m
Ave 5L 6 5 k1µ1 1 k2µ2 1 · · · 1 kmµm 5
If X1, X2, . . . , Xm are mutually independent, then the variance of L, Var 5L6, is m
Var 5L 6 5 k 21s21 1 k 22s22 1 · · · 1 k 2ms2m 5
ok s i51
2 2 i i
If the function L is x, the mean of the n sample observations x1, x2, . . . , xn, then m 5 n, and furthermore, for i 5 1, 2, . . . , n ki 5 1/n
and Xi , N (µ, s2 )
Ave 5L 6 5
o k µ 5 o µ/n 5 n · (µ/n) 5 µ 5 Ave 5x6 i i
Var 5L 6 5
k 2i s2i 5
o s /n 2
5 n · (s2/n 2 ) 5 s2/n 5 Var 5x6
Tolerance intervals. A tolerance interval identifies the bounds that are expected to contain some percentage of a population, not just a single population parameter such as the mean (41). If a normal distribution has mean µ and variance s2, which are known, then the 100w% tolerance interval is [µ 2 5z(12w)/2 · s6, µ 1 5z(12w)/2 · s6] where z(12w)/2 is the 100[1 2 5(1 2 w)/26]th percentile from the standard normal distribution, i.e., N(0, 1). This tolerance interval covers exactly 100w% of the distribution. If w 5 0.95, then z(12w)/2 5 1.96. For the population that represented the change in systolic blood pressure after some intervention (see USING SAMPLES TO LEARN ABOUT POPULATIONS), µ 5 215 and s 5 20; therefore, the exact 95% tolerance interval is [254, 124] In practice, the sample statistics y and s are used to estimate the population parameters µ and s. This element of uncertainty about the values of µ and s is handled by replacing z(12w)/2 with the confidence coefficient k, where k depends on w as well as the sample size n. Therefore, the estimated 100w% tolerance interval is [ y 2 ks, y 1 ks] [If w 5 0.95 and n 5 `, then k 5 z(12w)/2 5 1.96 as above, when µ and s were known.] The coefficient k is chosen to enable the declaration, with 100(1 2 a)% confidence, that the estimated tolerance interval covers 100w% of the distribution (see Table XIV in Ref. 41). For the observations listed in USING SAMPLES TO LEARN ABOUT POPULATIONS, y 5 28.2 and s 5 15.2. Suppose we want to estimate with 95% confidence a 90% tolerance interval based on these results. When we use these percentages and the sample size of 10, the coefficient k 5 2.839. Therefore, the
tolerance interval is [251, 135] In other words, we can declare, with 95% confidence, that 90% of persons will have a change in systolic blood pressure of between 251 and 135 mmHg after the intervention. Note that this statement differs markedly from our previous assertion, made also with 95% confidence, that the population mean µ was included in the interval [219.1, 12.7]. The tolerance intervals outlined above are appropriate only if the distribution of the underlying population is normal; other formulas exist to construct tolerance intervals when the population is distributed nonnormally. Equations for the significance sampling exercise. For two samples of equal size n, the standard error of the difference between sample means, SE 5 y2 2 y1 6, is estimated as SE 5 y2 2 y1 6 5
1 s 21 )/n
where s 2j is sample variance. The 100(1 2 a)% confidence interval for µ2 2 µ1, the difference between population means, is [5( y2 2 y1 ) 2 a6, 5( y2 2 y1 ) 1 a6]
The allowance a applied to the difference y2 2 y1 is a 5 ta/2,n · SE 5 y2 2 y1 6 where ta/2,n is the 100[1 2 (a/2)]th percentile from a Student t distribution with n 5 2n 2 2 degrees of freedom. In this sampling exercise, we use the t distribution because we assume the standard deviations of the populations are unknown (42). The test statistic used to evaluate statistical significance of the difference y2 2 y1 is t 5H0 : µ2 2 µ1 5 06 5
( y2 2 y1 ) 2 0 SE 5 y2 2 y1 6
Confidence interval for the difference between population means. In the significance sampling exercise (see STATISTICAL AND SCIENTIFIC SIGNIFICANCE DIFFER), we calculated a confidence interval for the difference between two population means. Rather than construct a confidence interval for this difference, a researcher could construct a confidence interval for each population mean: if the two confidence intervals fail to overlap, the researcher would conclude that the population means differ. This approach is conservative. Consider the results when 32 sample observations were drawn from the placebo and drug treatment populations: when compared with placebo, drug therapy was associated with a greater decrease in systolic blood pressure (P 5 0.04), and the 95% confidence interval for the difference between population means was [219, 21]. That this confidence interval excludes 0 corroborates that the population means differ at the a 5 0.05 level. The observed sample means for the placebo and drug treatment groups were y1 5 29.9
and y2 5 219.9
the standard errors of these sample means were SE 5 y1 6 5 SE 5 y2 6 5 3.32 Because n 5 32, each sample has n 5 n 2 1 5 31 degrees of freedom.
If we want a 95% confidence interval for each population mean (Eq. 6), then a 5 0.05, ta/2,n 5 2.04, and the allowance a 5 2.04 3 3.32 5 6.8. Therefore, the 95% confidence interval for the mean of the placebo population is [217, 23] the 95% confidence interval for the mean of the drug treatment population is [227, 213] Because these individual confidence intervals overlap, we might conclude that there is insufficient evidence to declare that the two population means differ. In this example, 86% confidence intervals for the population means would just fail to overlap: that is, we could declare that the population means differ at the P 5 0.14 level. Note that when we calculate a confidence interval for the difference between these population means, we are more confident that an actual difference exists. We thank Brenda B. Rauner, Publications Manager and Executive Editor, APS Publications, for providing the information about research manuscripts published by the American Physiological Society. This review was supported in part by the Dept. of Pediatrics (M. Douglas Jones, Jr., Chair); by a Grant-in-Aid from the American Heart Association of Colorado and Wyoming (to D. Curran-Everett); and by the National Science Foundation Grant DMS 95-10435 (to K. Kafadar). Address for reprint requests: D. Curran-Everett, Dept. of Pediatrics, B-195, Univ. of Colorado Health Sciences Center, 4200 East 9th Ave., Denver, CO 80262 (E-mail: [email protected]
). REFERENCES 1. Altman, D. G. Misuse of statistics is unethical. In: Statistics in Practice, edited by S. M. Gore and D. G. Altman. London: Br. Med. Assoc., 1982, p. 1–2. 2. Altman, D. G. Practical Statistics for Medical Research. New York: Chapman & Hall, 1991. 3. Altman, D. G. Statistics in medical journals: developments in the 1980s. Stat. Med. 10: 1897–1913, 1991. 4. Altman, D. G., and J. M. Bland. Improving doctors’ understanding of statistics. J. R. Stat. Soc. Ser. A 154: 223–267, 1991. 5. Altman, D. G., S. M. Gore, M. J. Gardner, and S. J. Pocock. Statistical guidelines for contributors to medical journals. Br. Med. J. 286: 1489–1493, 1983. 6. Anscombe, F. J. Graphs in statistical analysis. Am. Statistician 27: 17–21, 1973. 7. Appleton, D. R. What statistics should we teach medical undergraduates and graduates? Stat. Med. 9: 1013–1021, 1990. 8. Arbuthnot, J. Of the Laws of Chance. London: Benj. Motte, 1692. 9. Armitage, P., and G. Berry. Statistical Methods in Medical Research (3rd ed.). Cambridge, MA: Blackwell Scientific, 1994. 10. Bailar, J. C., III, and F. Mosteller. Guidelines for statistical reporting in articles for medical journals. Ann. Intern. Med. 108: 266–273, 1988. 11. Bland, J. M., and D. G. Altman. Caveat doctor: a grim tale of medical statistics textbooks. Br. Med. J. 295: 979, 1987. 12. Bland, J. M., and D. G. Altman. Misleading statistics: errors in textbooks, software and manuals. Int. J. Epidemiol. 17: 245–247, 1988. 13. Boring, E. G. Mathematical vs. scientific significance. Psychol. Bull. 16: 335–338, 1919. 14. Box, G. E. P., and D. R. Cox. An analysis of transformations. J. R. Stat. Soc. Ser. B 26: 211–243, 1964. 15. Burkholder, D. L., and J. Pfanzagl. Estimation. In: International Encyclopedia of the Social Sciences, edited by D. L. Sills. New York: Macmillan & The Free Press, 1968, vol. 5, p. 142–157. 16. Burnand, B., W. N. Kernan, and A. R. Feinstein. Indexes and boundaries for ‘‘quantitative significance’’ in statistical decisions. J. Clin. Epidemiol. 43: 1273–1284, 1990.
17. Colditz, G. A., and J. D. Emerson. The statistical content of published medical research: some implications for biomedical education. Med. Educ. 19: 248–255, 1985. 18. Colton, T. Statistics in Medicine. Boston, MA: Little, Brown, 1974. 19. Cox, D. R. Statistical significance tests. Br. J. Clin. Pharmacol. 14: 325–331, 1982. 20. Denenberg, V. H. Some statistical and experimental considerations in the use of the analysis-of-variance procedure. Am. J. Physiol. 246 (Regulatory Integrative Comp. Physiol. 15): R403– R408, 1984. 21. Denham, M. J., A. Foster, and D. A. J. Tyrrell. Work of a district ethical committee. Br. Med. J. 2: 1042–1045, 1979. 22. DiStefano, J. J., III, and E. M. Landaw. Multiexponential, multicompartmental, and noncompartmental modeling. I. Methodological limitations and physiological interpretations. Am. J. Physiol. 246 (Regulatory Integrative Comp. Physiol. 15): R651– R664, 1984. 23. Draper, N. R., and H. Smith. Applied Regression Analysis (2nd ed.). New York: Wiley, 1981. 24. Evans, S. J. W., P. Mills, and J. Dawson. The end of the p value? Br. Heart J. 60: 177–180, 1988. 25. Fisher, R. A. Statistical Methods and Scientific Inference (3rd ed.). New York: Hafner, 1973. 26. Flynn, F. V., K. A. J. Piper, P. Garcia-Webb, K. McPherson, and M. J. R. Healy. The frequency distributions of commonly determined blood constituents in healthy blood donors. Clin. Chim. Acta 52: 163–171, 1974. 27. Garcia-Mata, C., and F. I. Shaffner. Solar and economic relationships: a preliminary report. Q. J. Economics 49: 1–51, 1934. 28. Gardner, M. J., and D. G. Altman. Confidence intervals rather than P values: estimation rather than hypothesis testing. Br. Med. J. 292: 746–750, 1986. 29. Garfinkel, D., and K. A. Fegley. Fitting physiological models to data. Am. J. Physiol. 246 (Regulatory Integrative Comp. Physiol. 15): R641–R650, 1984. 30. Gray, B. H., R. A. Cooke, and A. S. Tannenbaum. Research involving human subjects. Science 201: 1094–1101, 1978. 31. Healy, M. J. R. Significance tests. Arch. Dis. Child. 66: 1457– 1458, 1991. 32. Healy, M. J. R. Data transformations. Arch. Dis. Child. 69: 260–264, 1993. 33. Hill, A. B. Principles of medical statistics. XII—Common fallacies and difficulties. Lancet i: 706–708, 1937. 34. Hofacker, C. F. Abuse of statistical packages: the case of the general linear model. Am. J. Physiol. 245 (Regulatory Integrative Comp. Physiol. 14): R299–R302, 1983.
35. Hogg, R. V., and A. T. Craig. Introduction to Mathematical Statistics (4th ed.). New York: Macmillan, 1978. 36. Iberall, A. S. The problem of low-dose radiation toxicity. Am. J. Physiol. 244 (Regulatory Integrative Comp. Physiol. 13): R7–R13, 1983. 37. Jackson, T. E. Comparison of a class of regression equations. Am. J. Physiol. 246 (Regulatory Integrative Comp. Physiol. 15): R271–R276, 1984. 38. Kruskal, W. H. Tests of significance. In: International Encyclopedia of the Social Sciences, edited by D. L. Sills. New York: Macmillan & The Free Press, 1968, vol. 14, p. 238–250. 39. Land, T. A., and M. Secic. How to Report Statistics in Medicine. Philadelphia, PA: Am. College Physicians, 1997. 40. Landaw, E. M., and J. J. DiStefano III. Multiexponential, multicompartmental, and noncompartmental modeling. II. Data analysis and statistical considerations. Am. J. Physiol. 246 (Regulatory Integrative Comp. Physiol. 15): R665–R677, 1984. 41. Montgomery, D. C., and G. C. Runger. Applied Statistics and Probability for Engineers. New York: Wiley, 1994, p. 361–363. 42. Moses, L. E. Think and Explain with Statistics. Reading, MA: Addison-Wesley, 1986. 43. Mosteller, F., and J. W. Tukey. Data Analysis and Regression. Reading, MA: Addison-Wesley, 1977. 44. Murray, G. D. How we should approach the future. Stat. Med. 9: 1063–1068, 1990. 45. SHEP Cooperative Research Group. Prevention of stroke by antihypertensive drug treatment in older persons with isolated systolic hypertension. Final results of the systolic hypertension in the elderly program (SHEP). JAMA 265: 3255–3264, 1991. 46. Slinker, B. K., and S. A. Glantz. Multiple regression for physiological data analysis: the problem of multicollinearity. Am. J. Physiol. 249 (Regulatory Integrative Comp. Physiol. 18): R1–R12, 1985. 47. Snedecor, G. W. The statistical part of the scientific method. Ann. NY Acad. Sci. 52: 792–799, 1950. 48. Snedecor, G. W., and W. G. Cochran. Statistical Methods (7th ed.). Ames: Iowa State Univ. Press, 1980. 49. Tuininga, Y. S., D. J. van Veldhuisen, J. Brouwer, J. Haaksma, H. J. G. M. Crijns, A. J. Man in’t Veld, and K. I. Lie. Heart rate variability in left ventricular dysfunction and heart failure: effects and implications of drug treatment. Br. Heart J. 72: 509–513, 1994. 50. Yates, F., and M. J. R. Healy. How should we reform the teaching of statistics? J. R. Stat. Soc. Ser. A 127: 199–210, 1964. 51. Yates, F. E. Contribution of statistics to ethics of science. Am. J. Physiol. 244 (Regulatory Integrative Comp. Physiol. 13): R3–R5, 1983.