RESEARCH METHODS AND STATISTICS
Parametric Versus Nonparametric Statistical Tests: The Length of Stay Example Munirih Qualls, MD, MPH, Daniel J. Pallin, MD, MPH, and Jeremiah D. Schuur, MD, MHS
Abstract Objectives: This study examined selected effects of the proper use of nonparametric inferential statistical methods for analysis of nonnormally distributed data, as exemplified by emergency department length of stay (ED LOS). The hypothesis was that parametric methods have been used inappropriately for evaluation of ED LOS in most recent studies in leading emergency medicine (EM) journals. To illustrate why such a methodologic flaw should be avoided, a demonstration, using data from the National Hospital Ambulatory Medical Care Survey (NHAMCS), is presented. The demonstration shows how inappropriate analysis of ED LOS increases the probability of type II errors. Methods: Five major EM journals were reviewed, January 1, 2004, through December 31, 2007, and all studies with ED LOS as one of the reported outcomes were reviewed. The authors determined whether ED LOS was analyzed correctly by ascertaining whether nonparametric tests were used when indicated. An illustrative analysis of ED LOS was constructed using 2006 NHAMCS data, to demonstrate how inferential testing for statistical significance can deliver differing conclusions, depending on whether nonparametric methods are used when indicated. Results: Forty-nine articles were identified that studied ED LOS; 80% did not perform a test of normality on the ED LOS data. Data were not normally distributed in all 10 of the studies that did perform such tests. Overall, 43% failed to use appropriate nonparametric methods. Analysis of NHAMCS data confirmed that failure to use nonparametric bivariate tests results in type II statistical error and in multivariate models with less explanatory power (a smaller R2 value). Conclusions: ED LOS, a key ED operational metric, is frequently analyzed incorrectly in the EM literature. Applying parametric statistical tests to such nonnormally distributed data reduces power and increases the probability of a type II error, which is the failure to find true associations. Appropriate use of nonparametric statistics should be a core component of statistical literacy because such use increases the validity of ED research and quality improvement projects. ACADEMIC EMERGENCY MEDICINE 2010; 17:1113–1121 ª 2010 by the Society for Academic Emergency Medicine Keywords: length of stay, parametric, nonparametric, methodology, statistics, quality
tatistical literacy is a critical skill for users of the medical literature, for clinicians and researchers alike. A key premise of evidence-based medicine
From the Department of Emergency Medicine, Brigham and Women’s Hospital (MQ, DJP, JDS), Boston, MA; Harvard Medical School (DJP, JDS), Boston, MA; and the Division of Emergency Medicine, Children’s Hospital Boston (DJP), Boston, MA. Received December 15, 2009; revisions received March 21 and April 7, 2010; accepted April 11, 2010. Presented at the Society for Academic Emergency Medicine annual meeting, New Orleans, LA, May 14–17, 2009. Disclosures: Part of Dr. Schuur’s time is supported by a Jahnigen Career Development Award, funded by the Atlantic Philanthropies and the Hartford Foundation. Supervising Editor: Gary Gaddis, MD, PhD. Address for correspondence: Munirih Qualls, MD, MPH; e-mail: [email protected]
Reprints will not be available.
ª 2010 by the Society for Academic Emergency Medicine doi: 10.1111/j.1553-2712.2010.00874.x
(EBM) is that clinicians should depend on primary medical literature to inform patient care decisions.1 To practice EBM well requires the ability to understand and recognize sources of bias in the medical literature. Biased studies are more likely to derive incorrect conclusions that can mislead practitioners of EBM. Bias can occur due to faulty study design and ⁄ or methodology, as well as inappropriate choice of inferential statistical tests. Consequently, biostatistical training has become more emphasized in medical schools2 and emergency medicine (EM) residency programs.3 Despite this recent increased emphasis on statistical literacy, many physicians cannot demonstrate competence with basic biostatistical concepts.1,4,5 This leads to mistakes in statistical analysis6 and reluctance to conduct research.7,8 This limitation is particularly germane to EM, which has been highly self-critical regarding research methodology.9,10 Although the rigor of EM research has increased with the maturation of the
ISSN 1069-6563 PII ISSN 1069-6563583
Qualls et al.
specialty, and the creation of dedicated research training activities such as research fellowships,11 there is still room for improvement. Appropriate use of nonparametric statistical analyses for nonnormally distributed data represents just such an opportunity for improvement.6,12 EM researchers must understand how to choose appropriate statistical methods, and clinicians and peer reviewers who read their studies should be able to recognize basic errors. Emergency department length of stay (ED LOS) is a continuously distributed interval variable with a nonnormal distribution due to its frequent high degree of skew, as illustrated in Figure 1. ED LOS is an important, frequently studied outcome variable in EM operations research because it is a key indicator of operational efficiency. Parametric statistical tests are usually inappropriate13–17 for analysis of ED LOS. Since the Institute of Medicine identified ED crowding as an obstacle to high-quality emergency care, and a series of studies have linked crowding to adverse safety and quality outcomes, ED LOS has become a proxy for quality-of-care processes.18–23 The National Quality Forum (NQF) has endorsed median ED LOS as an indicator of safety and efficiency,24 and the Joint Commission announced that it will include the NQF’s ED LOS measures in its 2010 hospital specifications manuals.25 These are the first steps toward inclusion of ED LOS in mandatory hospital quality measures, making it likely that within a few years every hospital’s ED LOS will be available for the public to view on the Medicare website, Hospital Compare.26 Our study has two goals. The first goal is to test the hypothesis that inferential statistical analysis of ED LOS in published studies is usually performed inappropriately, because of a suspicion that ED LOS is usually analyzed using parametric methods. The second goal is to demonstrate, by example, that inappropriate use of parametric methods increases the probability of committing a type II error. We present several analyses of
PARAMETRIC VS. NONPARAMETRIC STATISTICAL TESTS
ED LOS, using data from the National Hospital Ambulatory Medical Care Survey (NHAMCS). We test the theory that type II errors, and not type I errors, predominate when inappropriate parametric tests are used.17 METHODS Study Design We hand-reviewed all original research articles published between January 1, 2004, and December 31, 2007, in Academic Emergency Medicine, the American Journal of Emergency Medicine, Annals of Emergency Medicine, the Canadian Journal of Emergency Medicine, and Emergency Medicine Journal. The first three journals are the three highest ranked U.S. EM journals by impact factor.27 The others are the two leading English language non-U.S. EM journals. Using the 2006 NHAMCS database, we conducted illustrative analyses of ED LOS using both parametric and nonparametric methods. Data Sources Our literature review included any original research article using ED LOS as a primary or secondary outcome. We used the ED Benchmarking Alliance consensus definition of ED LOS: time from ED arrival to discharge, admission, or death in the ED.28 For our illustrative analyses, we used the ED component of the 2006 NHAMCS. This is a multistage probability survey of U.S. ED visits in institutional, general, and short-stay hospitals, whose methods have been detailed previously.29 NHAMCS defines ED LOS as ‘‘length of visit, calculated from arrival time to time of discharge from the ED.’’29 Study Protocol We determined standards used to describe or test nonnormally distributed data and then compared the
Figure 1. Distribution of ED LOS in the NHAMCS (n = 26,618). The graph is truncated at ED LOS £ 1,800 minutes. ED LOS was greater than 1800 minutes at 123 visits. LOS = length of stay; NHAMCS = National Hospital Ambulatory Medical Care Survey.
ACAD EMERG MED • October 2010, Vol. 17, No. 10
Table 1 Standards Used to Describe or Test Nonnormally Distributed Data Criterion Test of normality Descriptor of central tendency Aggregation of data
Bivariate statistical tests
Transformation of LOS in regression analysis
Appropriate Standard LOS data must be evaluated for nonnormal distribution. LOS data must be described using median, or median and mean. Aggregate LOS data, such as weekly mean ED LOS, should not be used in statistical analyses. Nonparametric tests of significance (Mann-Whitney, Wilcoxon, or Kruskal-Wallis) should be used to analyze data with a nonnormal distribution, rather than parametric tests such as Student’s t-test. Nonnormal LOS data must be appropriately transformed prior to regression analysis (e.g., log-transformation).
with ED LOS. The hospital type variable was urban hospital defined by location in a metropolitan statistical area (Yes ⁄ No). The visit event variables were diagnostic tests performed (Yes ⁄ No), was imaging performed (Yes ⁄ No), was care by midlevel provider (nurse practitioner or a physician assistant; Yes ⁄ No), and arrival by ambulance (Yes ⁄ No). The patient demographic characteristics were Hispanic ethnicity (Yes ⁄ No), race (white vs. nonwhite), sex (male ⁄ female), age (18–65 years, > 65 years), and presenting level of pain (dichotomized to severe ⁄ not severe [none ⁄ mild ⁄ moderate]). For the multivariate analysis, we included the 10 variables listed above, plus five additional categorical variables that were not eligible for bivariate analysis because they could not easily be dichotomized. The additional variables were as follows: 1. Hospital ownership (proprietary; voluntary nonprofit; government, nonfederal). 2. Hospital region (Northeast, Midwest, South, West). 3. Number of medications given in the ED. 4. Day of the week. 5. Number of procedures performed (e.g., wound care or IV fluids) during the visit.
LOS = length of stay.
selected articles against these standards (Table 1). Descriptor of central tendency was categorized as mean, median, or both. For all other criteria, we determined a rating of ‘‘Yes’’ or ‘‘No’’ based on the statistical techniques used. To assess whether a test of normality was performed, we recorded whether the authors stated that they performed a test of normality on the ED LOS data or if they explicitly mentioned the distribution of ED LOS. For articles that did not describe the distribution of ED LOS data, we considered ED LOS to have a nonnormal distribution, unless a normal distribution was reported or unless sufficient data were provided for us to determine that ED LOS was normally distributed. When ED LOS was not normal, we deemed the study to have conducted appropriate nonparametric analysis if it met three criteria: 1) reported a median as the description of central tendency, 2) used appropriate nonparametric bivariate tests of significance (e.g., Mann-Whitney, Wilcoxon, or Kruskal-Wallis),30 and 3) conducted appropriate log-transformations of ED LOS data prior to regression analyses.31 In contrast, we deemed a study with nonnormal ED LOS data not to have used appropriate methods if it: 1) reported only the mean as a measure of central tendency for nonnormally distributed ED LOS data, 2) used inappropriate bivariate tests of significance (such as Student’s t-test), and 3) failed to perform a logtransformation prior to linear regression analysis. In our NHAMCS analyses, we included all adult (age ‡ 18 years) ED visits in the NHAMCS 2006 data set and evaluated associations between ED LOS and three categories of variables, which were hospital type, visit events, and patient demographic characteristics. To evaluate bivariate predictors of ED LOS, we selected 10 dichotomous variables for which there existed a clinical or administrative rationale to suggest an association
Outcome Measures The primary outcome measure in the literature review portion of this manuscript was the proportion of published articles that inappropriately employed parametric methods for description or analysis of ED LOS data. Data Analysis We analyzed the results of the literature review using descriptive statistics. We tested the NHMACS data for normality using three methods. The Anderson-Darling test is a formal test of normality; a p-value less than 0.01 signifies a nonnormal distribution.32 The second method is ‘‘mean-median difference,’’ which represents the degree of skewness by calculating the difference between the median and the mean as a percentage of the mean. A small percentage difference (i.e., 1% to 5%) suggests that the mean and the median are close to each other, and the data are likely to be normally distributed. A larger difference suggests that the mean and the median are far from each other and the data are not normally distributed.33 The third method is the ‘‘standard deviation to mean ratio.’’ If the standard deviation (SD) is more than half of the mean, the distribution is likely to be nonnormal.33 We examined the bivariate relationship between NHAMCS ED LOS and the 10 dichotomizable covariates with parametric (t-test) and nonparametric (Wilcoxon rank sum test) bivariate tests.33 We created a multivariate regression model from the 15 independent variables (10 dichotomous and five nondichotomous, as described above) using NHAMCS data. We constructed one model with raw ED LOS as the dependent variable and another with log-transformed ED LOS. The R2 values, the F-values, and the p-values of the two models were compared. The meaning of these parameters is defined elsewhere.33 In brief, the R2 is a measure of the total amount of
Qualls et al.
variability described by the equation constituting a multivariate linear regression. The F-value represents the proportion of the variance explained by each predictor variable. The p-value is a means of interpreting the F-value in terms of statistical significance. For purposes of the modeling exercise, we have focused on the change in R2 and F-values that occurs when one logtransforms existing explanatory variables to better normalize their distributions, without addressing issues of colinearity, residual analysis, and outliers. We considered a two-sided p < 0.01 to be significant, as recommended by the National Center for Health Statistics for analysis of NHAMCS data, due to the large size of the data set and the frequency with which it is queried.34 We performed bivariate parametric tests and multivariate linear regression models using both standard and weighted survey techniques that account for the design characteristics of NHAMCS. We performed nonparametric tests using standard techniques, as weighting cannot be accounted for with nonparametric techniques surveys. All statistical analyses were performed with SAS 9.1 software (SAS Institute, Cary, NC). RESULTS Literature Review We identified 49 articles with ED LOS as a primary or secondary outcome. Ten of the 49 (20%) articles included a test of normality on the ED LOS data; all 10 of these articles reported that the data were not normally distributed. Of the 39 articles that did not perform a test of normality, 17 reported sufficient data to allow calculation of the distribution of the ED LOS data set using two of the summary methods described above. Ten of these 17 articles (59%) had a nonnormal distribution of ED LOS. The ED LOS of the remaining 22 articles was assumed to be nonnormally distributed, consistent with the methodology described above. Sixteen of 49 articles (33%) appropriately accounted for the nonparametric distribution of ED LOS by reporting median ED LOS and analyzing ED LOS using a nonparametric bivariate test or transforming ED LOS prior to multivariate regression. Twenty-one of the 49 articles (43%) failed to account for the nonnormal distribution of ED LOS and used only parametric methods of analysis. The remaining 12 articles used a combination of parametric and nonparametric methods and descriptive statistics. Among all 49 articles with ED LOS as an outcome, 47% exclusively reported mean LOS as the description of central tendency. Of the 32 articles that performed bivariate tests of significance, 14 (44%) used only the parametric Student’s t-test. Of the 14 articles that created multivariate regression models with ED LOS as the dependent variable, 10 (71%) used only raw ED LOS data. Two articles (4%) used other appropriate techniques to account for skewed data: one included a histogram, a graphic representation of normality, and another reported trimming the data of outliers that contributed to a rightward skew of the data. Five articles (10%)
PARAMETRIC VS. NONPARAMETRIC STATISTICAL TESTS
performed statistical analyses on aggregate mean LOS data. Three used daily mean, one used weekly mean, and one used monthly mean. This is an inappropriate approach, as it does not change the underlying nonparametric distribution of ED LOS data and does not have the power advantages of nonparametric statistical techniques. NHAMCS Analysis Mean ± SD adult ED LOS for U.S. EDs was 229 ± 257 minutes, and median ED LOS was 164 minutes (interquartile range [IQR] = 93–272 minutes). Adult ED LOS was nonnormally distributed in the 2006 NHAMCS data set by all three tests of normality. The Anderson-Darling test revealed nonnormality, with p < 0.005. Mean-median difference was 30% of the value of the mean. The mean to SD ratio was 0.9. A histogram graphically illustrates the rightward skew of adult ED LOS compared to the normal distribution (Figure 1). The distribution of ED LOS closely fits a log-normal distribution. In our bivariate analysis, we found that three of the 10 variables evaluated as predictors of ED LOS did not meet our a priori threshold for statistical significance using the parametric Student’s t-test (Table 2). In contrast, all 10 variables were significant (p < 0.001) using the nonparametric Wilcoxon rank sum test. The three variables that were significant in nonparametric analysis and not significant in parametric analysis were: sex (p = 0.806), care by a midlevel provider (p = 0.021), and pain (p = 0.091). In our multivariate linear regression analysis, we found that log-transforming ED LOS resulted in a better fitting model than raw ED LOS, as determined by larger F-value (with the exception of ‘‘Hospital Ownership’’) for the predictor variables, and a larger R2 value for the regression model (Table 3). Only one variable was significant in the transformed ED LOS model and not significant in the raw ED LOS model (sex, p = 0.049). In both our bivariate and our multivariate analyses, we found no type I errors resulting from inappropriate use of parametric tests. Sensitivity Analysis: Weighted Survey Techniques As the NHAMCS is a multistage probability sample, it is recommended that users account for the visit weights when drawing inferences from the data. Although we used the data as an example of a nonparametric data set and are not proposing conclusions based on relationships between variables, we performed bivariate tests and multivariate linear regression using both survey and unweighted techniques, as readers familiar with NHAMCS will expect. When the weighted techniques were used for the bivariate tests (t-tests), one more variable was nonsignificant with the t-test (Hispanic ethnicity, p = 0.02), but all other directions of effect and conclusions were unchanged from the results reported above. When the multivariate models were run with survey weights applied, two more variables were nonsignificant in the raw ED LOS model (hospital region, p = 0.02; and ownership, p = 0.08) and one in the log-transformed ED LOS model (hospital ownership, p = 0.12). Most importantly, no examples of type I
ACAD EMERG MED • October 2010, Vol. 17, No. 10
Table 2 Results of Bivariate Parametric and Nonparametric Analysis of ED LOS From the NHAMCS*
Mean (95% CI)
Urban hospital Nonurban hospital No diagnostic tests performed Diagnostic tests performed No imaging performed Imaging performed No care by midlevel provider Care by midlevel provider No arrival in ambulance Arrival in ambulance Hispanic ethnicity Non-Hispanic ethnicity Nonwhite race White race Female Male Age > 65 yr Age < 65 yr High pain Low pain
239 157 133 251 197 264 230 216 208 291 257 225 248 222 229 228 256 223 225 220
(236–243) (149–166) (128–138) (247–255) (193–202) (259–268) (227–233) (205–228) (207–213) (283–300) (248–267) (222–228) (242–254) (218–226) (225–234) (224–234) (249–263) (219–226) (220–231) (216–224)