Statistical Methodology - Wiley Online Library

14 downloads 51776 Views 147KB Size Report
tor analysis is the statistical method of choice for an- ... and to decide which model or models best represent ..... In conducting CFA, the computer program:.
54

USING CFA

Bryant et al. • CONFIRMATORY FACTOR ANALYSIS

Statistical Methodology: VIII. Using Confirmatory Factor Analysis (CFA) in Emergency Medicine Research FRED B. BRYANT, PHD, PAUL R. YARNOLD, PHD, EDWARD A. MICHELSON, MD

Abstract. How many underlying characteristics (or factors) does a set of survey questions measure? When subjects answer a set of self-report questions, is it more appropriate to analyze the questions individually, to pool responses to all of the questions to form one global score, or to combine subsets of related questions to define multiple underlying factors? Factor analysis is the statistical method of choice for answering such questions. When researchers have no idea beforehand about what factors may underlie a set of questions, they use exploratory factor analysis to infer the best explanatory model from observed data ‘‘after the fact.’’ If, on the other hand, researchers have a hypothesis beforehand about the underlying factors, then they can use confirmatory factor analysis (CFA) to evaluate how well this model explains the observed data and to compare the model’s goodness-of-fit with that of other competing models. This article describes the basic rules and building blocks of CFA: what it is, how it works, and how researchers can use it. The authors begin by placing CFA in the context of a common research application — namely, assessing quality of medical outcome

Q

UESTIONNAIRES are commonly used in emergency medicine (EM) as a research tool to compare various treatments, to survey practitioners to define state of the art, and to assess patient and practitioner satisfaction with certain types of care and clinical experience.1,2 Although writing a survey instrument at first seems fairly straightforward, analysis of the survey responses for statistically valid conclusions becomes chalFrom the Department of Psychology, Loyola University Chicago (FBB), Chicago, IL; the Divisions of General Internal Medicine (PRY) and Emergency Medicine (EAM), Northwestern University Medical School, Chicago, IL; and the Department of Psychology, University of Illinois at Chicago (PRY), Chicago, IL. Series editor: Roger J. Lewis, MD, PhD, Department of Emergency Medicine, Harbor – UCLA Medical Center, Torrance, CA. Received April 3, 1996; revision received October 17, 1996; accepted September 14, 1998. Address for correspondence and reprints: Fred B. Bryant, PhD, Department of Psychology, Loyola University Chicago, 6525 North Sheridan Road, Chicago, IL 60626. Fax: 773-508-8713; e-mail: [email protected]

using a patient satisfaction survey. They then explain, within this research context, how CFA is used to evaluate the explanatory power of a factor model and to decide which model or models best represent the data. The information that must be specified in the analysis to estimate a CFA model is highlighted, and the statistical assumptions and limitations of this analysis are noted. Analyzing the responses of 1,614 emergency medical patients to a commonlyused ‘‘patient satisfaction’’ questionnaire, the authors demonstrate how to: 1) compare competing factormodels to find the best-fitting model; 2) modify models to improve their goodness-of-fit; 3) test hypotheses about relationships among the underlying factors; 4) examine mean differences in ‘‘factor scores’’; and 5) refine an existing instrument into a more streamlined form that has fewer questions and better conceptual and statistical precision than the original instrument. Finally, the role of CFA in developing new instruments is discussed. Key words: CFA; surveys; research methodology; statistical methods; confirmatory factor analysis; emergency medicine. ACADEMIC EMERGENCY MEDICINE 1999; 6:54–66

lenging. When subjects answer a set of self-report questions, is it more appropriate to analyze the questions individually, to pool responses to form one global score, or to group subsets of related questions to define the underlying characteristics (or factors) being measured? Can the results of such an analysis be used to guide improvements in the survey instrument, or to shorten or streamline the questions without sacrificing measurement outcome reliability? Factor analysis is the statistical method of choice for understanding how people respond to self-report surveys. The basic focus of factor analysis is on the ‘‘structure’’ underlying a set of measures collected from a group of subjects. By structure, we mean the ways in which responses to the individual measures interrelate (if they do) to define one or more underlying characteristics, or factors. When researchers have a hypothesis beforehand about the factors that underlie survey responses, they can use confirmatory factor analysis (CFA) to evaluate how well the survey results

ACADEMIC EMERGENCY MEDICINE • January 1999, Volume 6, Number 1

match the hypothesized measurement model. An explicit structural framework for explaining responses to a set of questions is known as a measurement model. In addition, researchers can compare how well competing models fit the collected data, compared with the hypothesized model. On the other hand, the factors underlying a set of survey questions may not be obvious to the investigator prior to data analysis. In this instance, exploratory factor analysis may be used to infer the best measurement model to ‘‘explain’’ observed data ‘‘after the fact.’’ This article describes the basic rules and building blocks of CFA: what it is, how it works, and how researchers can use it. Besides determining the appropriate number of factors or dimensions underlying responses to a set of survey questions, factor analysis also enables researchers to interpret the meaning of each factor, to label each factor in theory-relevant terms. Factor analysis quantifies how strongly each question characterizes each underlying factor, thereby pinpointing the specific subsets of questions that define the factors. Questions that strongly reflect a particular factor are said to have strong ‘‘loadings’’ on that factor or to ‘‘load’’ highly on that factor (i.e., each question’s factor loadings reflect how strongly each underlying factor influences responses to that question). Think of the set of survey questions as a bank of telescopes, each providing a separate field of vision with more or less clarity of focus. Factor analysis integrates the separate images from the bank of telescopes (analyzes the interrelationships among the questions) to determine the number of separate objects (underlying factors, if any) being viewed, the clarity with which each telescope focuses on its target object (factor loadings), and the identity of the objects (factors) under scrutiny. When more than one factor underlies a set of questions (i.e., a multidimensional model), factor analysis can also be used to determine how strongly the multiple factors relate to one another. For example, researchers can test the hypothesis that the underlying factors are unrelated to one another (i.e., an orthogonal model), or that the underlying factors are correlated with one another (i.e., an oblique model). In the latter case, factor analysis computes the correlations among the factors. For example, if it were determined that factor 1 (e.g., ratings of satisfaction with care received from physicians) correlated at r = 0.5 with factor 2 (e.g., ratings of satisfaction with care received from nurses), then these factors would share 0.52 ⫻ 100%, or 25% of their variance in common. Thus, although the two factors (i.e., ratings of physicians and of nurses) are correlated, they each measure primarily unique aspects of care (i.e., the remaining 75% of their variance is unique).

55

Factor analysis was originally developed for use in situations in which investigators had no prior knowledge about the structure underlying the answers to survey questions, and so-called ‘‘exploratory factor analysis’’ was used as an empirical method of analysis. CFA has several important advantages over exploratory factor analysis. First, CFA enables investigators to systematically test specific prior hypotheses about the structure underlying survey results and to compare alternative measurement models with respect to explanatory power. Researchers can also use CFA to refine a suboptimal model into a simpler form that is both parsimonious and reliable, thus improving conceptual and statistical precision. This makes CFA a valuable tool for both theory testing and theory building. Confirmatory factor analysis provides a means of systematically testing hypotheses about factor structure. CFA enables researchers to test and compare alternative models of the factors underlying a set of questions, to verify or confirm the most appropriate measurement model to explain responses to the questions. Researchers often begin by testing a ‘‘null’’ model that assumes there are no underlying factors (i.e., that the survey questions share nothing in common). Researchers usually wish to reject this null model in favor of a more conceptually meaningful structure. Thus, the null model serves as a baseline against which to contrast the fit of more complex structural models. After testing the null model, one then evaluates increasingly complex models, starting with a onefactor (unidimensional) model that assumes the set of questions reflects a single, global underlying factor. Next, one tests multifactor (multidimensional) models, the simplest of which includes two underlying factors that could be either uncorrelated (orthogonal model) or correlated (oblique model). Table 1 shows two areas assessed by a frequently used ED patient satisfaction survey (Press, Ganey Associates, Inc., South Bend, IN). The survey developer grouped questions under two general subheadings: Nurses and Doctors. Some questions are virtually parallel (e.g., ‘‘. . . nurses were courteous . . .’’ and ‘‘. . . doctors were courteous . . .’’), whereas other questions are unique for physicians (e.g., ‘‘The doctors were concerned about my comfort’’) and for nurses (e.g., ‘‘The nurses were technically skilled’’). We use factor analysis to see how well the intended Nurses and Doctors groupings explain patients’ responses to the survey questions. Both exploratory and confirmatory factor analyses operate on matrices of correlations among responses to the survey questions. (Although CFA can also analyze covariances, i.e., the correlation

56

USING CFA

Bryant et al. • CONFIRMATORY FACTOR ANALYSIS

TABLE 1. Areas Examined by Use of the Questionnaire* ED Staff

LISREL Code Name

Patient Perception Item

Nurses ncourtesy ntookser nattentn ninform nprivacy nskill

1. 2. 3. 4. 5. 6.

The The The The The The

nurses nurses nurses nurses nurses nurses

were courteous to me. took my problem seriously. paid enough attention to me. informed me about my treatment. maintained my privacy. were technically skilled.

Emergency physicians waittime dcourtesy dtookser dcomfort dexptest dexpprob dresident dselfcar

7. Before I was seen by a doctor, my waiting time in the treatment area was satisfactory. 8. The doctors were courteous to me. 9. The doctors took my problem seriously. 10. The doctors were concerned about my comfort. 11. The doctors explained my tests and treatment. 12. The doctors explained my problem. 13. The residents (wearing blue coats) gave me good care. 14. I received good advice about self-care at home, or about follow-up care.

*The questionnaire used was The Emergency Department Survey, from Press, Ganey Associates, Inc., South Bend, IN. All items were scored on a Likert-type scale from 1 = very poor to 5 = excellent. The LISREL code names identify measured variables in the LISREL analysis (see Fig. 1 and Tables 3 and 6). The complete questionnaire consists of separate, explicitly titled sections evaluating perceptions of registration, nurses, emergency physicians, tests and other staff, and final ratings. For present purposes, only the sections on nurses and emergency physicians were included in the analyses. We selected these sections because they had mostly complete data, whereas the other sections had more missing data (subjects with any missing data are automatically dropped from confirmatory factor analysis, i.e., casewise deletion). In addition, the use of only two factors greatly streamlined the presentation. The above items have been abridged from the original questionnaire.

between two measures multiplied by both standard deviations, the use of correlations standardizes results, making them easier to interpret.) Factor analysis requires that the data are measured on at least a continuous, interval scale (e.g., age, blood pressure, temperature). When data are not interval-scale (e.g., yes/no, true/false), researchers often use procedures to estimate what the correlations would have been had the variables been measured on an interval scale.4,5 Although most available multivariate statistical software packages (e.g., SPSS, BMDP, SAS) offer exploratory factor analysis capability, not all packages are capable of conducting CFA to impose specific models on the data, to systematically compare these models with alternative models for goodness-of-fit, and to refine models to make them more parsimonious and reliable. The three computer programs most commonly used to perform CFA are: LISREL,5 which stands for LI near Structural RELationships (supported by Scientific Software International, Chicago, IL); EQS,6 pronounced ‘‘X’’ (supported by Multivariate Software, Inc., Encino, CA); and CALIS,7 which stands for Covariance A nalysis of LInear Structural equations (supported by SAS, Cary, NC).8 LISREL has been available the longest and is currently the most popular, a larger literature having accrued concerning its use and its accuracy under conditions violating statistical assumptions. CFA soft-

ware is available for use with either mainframe or personal computers. Some CFA programs, such as LISREL, EQS, and AMOS,9 are available in Windows-version and even provide publication-quality diagrams of the model being evaluated (e.g., Fig. 1). In conducting CFA, the computer program: 1. uses the raw data to compute the actual, observed correlations among the survey questions; 2. uses the CFA model hypothesized by the user to predict what the observed correlations among the survey questions should have been, assuming that the hypothesized model is accurate; 3. determines the differences between the correlations predicted by the user’s model and the correlations that were actually observed; and 4. computes a maximum-likelihood chi-square value estimating the probability (p-value) that the differences between the predicted and actual observed correlations would occur by chance, assuming the hypothesized model is correct. Contrary to other inferential statistical tests for which significant p-values represent a positive result, with CFA a statistically significant chi-square denotes a model that fails to predict the observed data accurately (i.e., the intercorrelations predicted by the CFA model are different from the actual, observed intercorrelations). CFA users thus seek models that have nonsignificant chi-square values. If a CFA model fits the data well, then the

ACADEMIC EMERGENCY MEDICINE • January 1999, Volume 6, Number 1

57

Figure 1. This graphical representation shows that we believe answers to the 14 questions on the patient satisfaction questionnaire are influenced by two fundamental factors, one due to nurses, one due to doctors (Table 1), and that these two underlying factors are correlated. In diagramming a confirmatory factor analysis (CFA) model, each observed indicator is enclosed in a square and designated by a Roman letter x. In the hypothesized model, there are 14 observed indicators, labeled x1 – x14. The effect of measurement error on each observed indicator is marked by a small straight line to the indicator, and each unique error term is designated by a Greek letter delta (␦). The correlation matrix of these unique-error terms is referred to as Theta delta (⌰␦). In the hypothesized model, the unique errors for the 14 observed indicators (survey questions) are designated ␦1 – ␦14, all assumed to be independent of one another (i.e., no correlated-error terms are included). Each underlying factor is enclosed in a circle and designated by a Greek letter xi (␰). The hypothesized model has two underlying factors — Nurses and Doctors — labeled ␰1 and ␰2, respectively. The effect of a factor on an observed indicator (i.e., a factor loading) is marked by a straight line from the factor to that indicator and is designated by a Greek letter lambda (␭). The matrix of factor loadings is referred to as Lambda x (⌳x). In the hypothesized model, there are 14 estimated factor loadings in ⌳x; the first six indicators (x1 – x6) have loadings (␭1,1 – ␭6,1) on the first factor (␰1); and the last eight indicators (x7 – x14) have loadings (␭7,2 – ␭14,2) on the second factor (␰2). All other factor loadings (i.e., for x1 – x6 on ␰2; and for x7 – x14 on ␰1) have been fixed at zero. (This specifies that the first six indicators apply only to Nurses, and that the following eight indicators apply only to Doctors.) The correlations among the underlying factors are marked with curved paths in CFA diagrams. In LISREL notation, each factor intercorrelation is designated by a Greek letter phi (␾), and the matrix of factor variances and factor intercorrelations is referred to as Phi (⌽).

correlations that it predicts are not statistically different from the actual, observed correlations — resulting in a nonsignificant p-value. To ensure sufficient statistical power for hypothesis-testing purposes, researchers typically aim for a large total sample (e.g., 500 – 1,000 subjects). A rough guideline for the minimum ratio of subjects to questions is 5:1. Models that allow each survey question to tap more than one underlying factor require a larger sample size than models that force each question to tap only one factor. Because the maximum-likelihood chi-square statistic obtained from CFA is sensitive to sample size, researchers usually rely on other criteria to gauge how well a given model fits the data. With large samples, even reasonable models are likely to produce statistically significant chi-square values. For this reason, differences in chi-square val-

ues (⌬␹2) across models may be more informative than the chi-square values themselves. Two models are said to be ‘‘nested’’ when they are identical except that one model includes more ‘‘restrictions’’ than the other (i.e., more factor loadings or factor intercorrelations are fixed at a specific value, such as zero, instead of being estimated). When competing models are nested, then their chi-squares can be directly contrasted to test the hypothesis that one model fits the data better than the other. In the present example, an orthogonal (uncorrelated factors) version of the two-factor model is obtained when the correlation between the Nurses and Doctors factors is fixed at zero. This orthogonal model has all of the same parameters as the less restrictive, oblique (correlated factors) two-factor model, except that the factor correlation has been omitted, in contrast to the

58

USING CFA

oblique model in which the factor correlation is included. Thus, the orthogonal model is nested within the oblique model, and the chi-square for the latter model can be subtracted from that for the former to obtain a difference in chi-squares (⌬␹2), to test the hypothesis that the Doctors and Nurses factors are correlated. Besides contrasting the chi-square values of nested models to test ‘‘incremental fit,’’ researchers also report a variety of different relative-fit indices, including the goodness-of-fit index (GFI) and the adjusted (for the number of parameters estimated in the model) goodness-of-fit index (AGFI) (Table 2). Despite variations in their formulas, most comparative fit indices basically reflect how much better the given factor model fits the data, relative to the baseline ‘‘null’’ model, which assumes there are no underlying factors. Thus, fit indices reflect the proportion of shared variance that the model explains among the set of survey questions. All fit indices range between 0 and 1, with higher values indicating better fit, and 0.90 generally considered to be a minimum acceptable level of fit. The maximum-likelihood procedure, which car-

Bryant et al. • CONFIRMATORY FACTOR ANALYSIS

ries its own assumptions, is the engine that drives CFA. Maximum-likelihood estimates assume that a multivariate normal distribution underlies the data and that observations are sampled randomly and independently. Multivariate normality is often difficult to obtain in applied research. To meet this criterion, not only must patient responses on every question analyzed be normally distributed, but also the distribution of scores obtained by any possible linear function of the questions— e.g., [0.5 ⫻ (question 1)] ⫺ [0.75 ⫻ (question 7)] ⫹ [0.33 ⫻ question 18)])— must also be normally distributed. Because violating this criterion invalidates the type I error estimates obtained using CFA, researchers usually ignore the model’s p-value as an absolute measure of fit, focusing instead on measures of relative fit (which have fewer distributional assumptions). When interval-scale data are not multivariate normal, researchers sometimes transform them into a matrix of asymptotic covariances and use the weighted least-squares method of estimation, to obtain more accurate results.4,5 In the present example, the patient satisfaction data diverge from multivariate normality, re-

TABLE 2. Summary Table for Confirmatory Factor Analysis (CFA) Alternate names and related methods

Also known as measurement modeling, structural equation modeling with latent factors, covariance structure modeling, and analysis of latent covariance-structures. Related exploratory techniques include principal-components analysis and common-factor analysis.

Data type

A matrix of correlations or covariances among a set of interval- or ratio-scale variables.

Assumptions

Multivariate normality, sufficient sample size (see Limitations, below).

Principal results

1. Maximum-likelihood goodness-of-fit chi-square (␹2) and its associated degrees of freedom, for testing the hypothesis that the data matrix predicted by the particular factor model differs from the observed data matrix. Smaller chi-squares reflect better goodness-of-fit. 2. Indices of relative fit. These measures of relative fit include the goodness-of-fit and adjusted goodnessof-fit index. Each fit index reflects how much better the particular model fits the data, relative to a null model that assumes there is no common variance (i.e., that sampling error alone explains observed correlations among survey questions). These indices range from 0 (the fit of the absolute, worst null model) to 1 (perfect fit), with 0.90 as a rough guideline in judging minimally acceptable fit. Also included are modification indices, useful in improving model fit. 3. Standardized estimates of factor loadings relating survey questions to hypothesized underlying factors, factor intercorrelations controlling for measurement error, and the proportion of common variance that the particular model explains in each survey question.

Strengths

CFA allows systematic comparison of alternative structural models, to test hypotheses about relative fit and to determine the most appropriate measurement model(s) for a set of measured variables. CFA can be used to refine a model to increase its parsimony and statistical precision.

Limitations

1. Low sample size reduces CFA’s power to detect meaningful differences between the predicted and the observed data matrices and thus biases models to have better apparent goodness-of-fit. A rough guideline for the minimum ratio of respondents to measured variables is 5:1. 2. Skewness and range restriction in measured variables attenuate correlations and factor loadings, worsening goodness-of-fit and making multivariate normality difficult to obtain. To correct this problem, data are sometimes transformed before analysis. 3. Missing data are omitted using casewise deletion, which can shrink sample size considerably when missing responses are scattered throughout the data set. To avoid this problem, researchers sometimes construct and analyze data matrices using pairwise deletion. Although this maximizes average sample size, it sometimes produces matrices that are ill-conditioned for factor analysis, when there are numerous missing responses within subsets of respondents.

59

ACADEMIC EMERGENCY MEDICINE • January 1999, Volume 6, Number 1

TABLE 3. Correlations, Means, and Standard Deviations (SDs) for the 14 Patient Satisfaction Questions* Nurses

Doctors

Question

1

2

3

4

5

6

7

8

9

10

11

12

13

Nurses 1. ncourtsy 2. ntookser 3. nattentn 4. ninform 5. nprivacy 6. nskill

86 80 75 73 78

85 78 74 77

86 79 75

81 75

77

Doctors 7. waittime 8. dcourtsy 9. dtookser 10. dcomfort 11. dexptest 12. dexpprob 13. dresidnt 14. dselfcar

47 54 55 55 54 52 57 52

51 57 62 61 58 57 61 55

57 56 58 59 56 56 58 54

56 55 57 58 60 59 58 58

51 53 54 57 54 52 54 52

47 56 56 57 55 54 56 54

53 52 51 50 49 50 47

83 80 74 70 74 67

81 78 75 75 69

78 74 74 67

86 75 76

76 78

69

Mean SD

4.4 0.9

4.3 0.9

4.1 1.0

4.0 1.1

4.1 1.0

4.3 0.9

3.8 1.2

4.4 0.8

4.4 0.9

4.3 0.9

4.3 1.0

4.2 1.0

4.3 0.9

14

4.2 1.0

*N = 1,614. Decimals have been omitted from correlations. For the meaning of the questions, see Table 1.

sponses to all survey questions being negatively skewed. Because the data violate the multivariate normality assumption, and because sample size is relatively large, we have chosen to ignore CFA models’ p-values as indicators of absolute fit, and to use relative fit indices instead to gauge how well models explain the data. (Analyzing the asymptotic covariance matrix for the present data using least-squares estimation revealed that the final model identified using maximum likelihood provides an excellent fit to the data, ␹2 = (19, n = 1,614) = 64.1; GFI = 0.99; AGFI = 0.99. In addition to relative fit indices, the output of CFA includes other information useful in gauging model fit. For example, the root mean square residual (RMSR) is a measure of the average absolute difference between the interrelations predicted by the user’s model and the actual, observed interrelations. As RMSR decreases and approaches zero, the fit of the given model improves. The CFA output also includes information about the set of survey questions that is useful in assessing model fit. For example, the ‘‘squared multiple correlation’’ for each survey question estimates the proportion of variance that the model explains in subjects’ responses to that question. The ‘‘total coefficient of determination’’ reflects how well the survey questions jointly serve as measures of the underlying factors, and approaches 1.0 as the model’s fit improves. Table 2 provides a summary of critical information about CFA; and Figure 1 is a schematic diagram (with LISREL notation) of the CFA model originally hypothesized for the 14 patient satisfaction questions in Table 1.

Having discussed the motivation, data requirements, assumptions, and logic underlying CFA, we are ready to discuss the components of the CFA model. Aspects of the CFA model that the investigator must specify in order to conduct the analysis include: 1. the number of factors underlying the set of questions; 2. which questions measure which factor(s); 3. the nature of the relationship(s) between factors, if the model has more than one factor — i.e., are the factors correlated (oblique) or independent (orthogonal) of one another?; and 4. the concept of an error term (i.e., unexplained variance) for each survey question, and the interrelationships (if any) among these error terms. Finally, to improve models with inadequate fit, researchers often examine the modification index (MI) for each parameter (i.e., each factor loading, factor interrelation, or correlated-error term) that has a fixed value of zero in the hypothesized model, in search of ways to improve model fit. The MI estimates the reduction in chi-square that would result if a particular fixed parameter were free to be estimated in the model, as well as the expected change in the value of the particular parameter, given its inclusion in the model. This exploratory probing is known as a ‘‘specification search.’’ Based on specification searches, researchers often free additional factor loadings or allow for correlated measurement errors in their CFA models, particularly when these modifications are plausible and improve the fit of models that are well-grounded in preexisting theory.

60

USING CFA

Bryant et al. • CONFIRMATORY FACTOR ANALYSIS

TABLE 4. Goodness-of-fit Statistics for Various Factor Models of the Patient Satisfaction Survey* Measures of Fit Factor Model



df

GFI

AGFI

RMSR

One factor (total score) Two correlated factors: Nurses and Doctors Modified two-factor model (dual ␭s for waiting time) Modified two-factor model with two correlated ␦s Modified two-factor model with four correlated ␦s

6,200.2 1,556.9 1,426.3 948.9 756.3

77 76 75 73 71

0.48 0.87 0.87 0.92 0.94

0.29 0.81 0.82 0.88 0.91

0.092 0.042 0.026 0.025 0.020

2

*N = 1,614. GFI = the goodness-of-fit index. AGFI = the adjusted goodness-of-fit index. Each of these indices reflects how much better the particular model fits the data, relative to a null model that assumes there is no common variance (i.e., that sampling error alone explains observed correlations among the survey questions). GFI and AGFI approach 1 as the fit of the given model improves, with 0.90 considered a minimally acceptable level of fit. RMSR = root mean square residual (i.e., a measure of the average difference between the correlations actually observed among the survey questions and the correlations predicted by a particular model), which approaches 0 as the fit of the given model improves.

SAMPLE DATA Data were obtained for an urban, 800 bed university-based Level 1 trauma center. Staffed by attending physicians 24 hours a day, the ED has an annual census of 48,000 patients. The hospital serves as a primary site for an EM residency, and housestaff from other specialties rotate through the ED for one-month periods. The majority of patients are cared for by both an attending and one or more resident physicians. For the purpose of this study, data collection involved retrieval of archival data: information about patient identity was not recorded, and the study did not require institutional review board approval. Each year the ED surveys all discharged patients who were not admitted, mailing them a twopage questionnaire that is widely used to assess satisfaction with various aspects and components of care (Press, Ganey Associates, Inc., South Bend, IN), one week after their visit to the ED. For present purposes, we used questions from two sections of the questionnaire — Nurses and Doctors — in a CFA (Table 1). We analyzed the responses of 1,614 patients who returned the survey between April and September 1995, and who completed every question in the Nurses and Doctors sections.

ANALYSIS How well does the intended two-factor (Nurses and Doctors) model explain the pattern of correlations among the 14 satisfaction questions (Table 3)? To answer this question, we used LISREL5 to impose three different measurement models on the data: 1) a one-factor model that assumes patient satisfaction is unidimensional; 2) the hypothesized, oblique two-factor model (correlated Nurses and Doctors factors); and 3) an orthogonal version of the intended two-factor model in which ratings of Doctors and Nurses are assumed to be uncorre-

lated. In estimating these models, we standardized the underlying factors (i.e., fixed their means at zero and their variances at one) to define the measurement scale for each factor. This also allowed us to interpret the standardized factor interrelationships as correlation coefficients. We used two goodness-of-fit indices (GFI and AGFI) and the RMSR to gauge model fit. We compared the chisquares of nested models to assess incremental fit. Table 4 presents the goodness-of-fit statistics for these three measurement models. Contrary to the view of patient satisfaction as unidimensional, the one-factor model explained less than half of the variance that the satisfaction questions have in common (GFI = 0.48). As hypothesized, the intended two-factor model in oblique (correlated factors) form fit the data better than the one-factor model, ⌬␹2(1, n = 1,614) = 4,643.3; p < 0.00001. Confirming the appropriateness of correlated factors, the orthogonal version of the two-factor model (with the factor intercorrelation fixed at zero) produced a set of predicted correlations among the survey questions that was mathematically impossible to analyze (i.e., the matrix of predicted correlations was ‘‘not positive definite’’). LISREL’s inability to estimate the orthogonal model indicates that the orthogonal model is untenable in the face of the data (i.e., a case of so-called ‘‘model misspecification’’). However, an oblique version of the intended two-factor model fell just short (0.87) of the minimum acceptable GFI value (0.90). In search of an appropriate measurement model, we closely inspected the LISREL solution for the oblique two-factor model, with an eye for ways to improve model fit. Examining modification indices for factor loadings, we found that allowing Doctors question 1 (i.e., satisfaction with time in the treatment area before being seen by a doctor) to load also on the Nurses factor would significantly improve the two-factor model’s fit, MI = 127.2, p < 0.00001. The modification analysis further revealed that this ‘‘waiting time’’ question had

ACADEMIC EMERGENCY MEDICINE • January 1999, Volume 6, Number 1

an estimated factor loading of 0.36 on the Nurses factor (the remaining 13 questions, in contrast, all had estimated factor loadings below 0.08 on their alternate factor). In retrospect, it seems plausible that patients may attribute responsibility for their waiting time in the treatment area to both nurses and physicians. We therefore decided to modify the oblique two-factor model to allow the ‘‘waiting time’’ question to load on both factors. Using this procedure to modify an initial model represents a blend of exploratory and confirmatory approaches, and it is common practice when models that are grounded in strong theory achieve less than adequate goodness-of-fit. However, these modified models need to be replicated with an independent sample of respondents, before they can truly be termed ‘‘confirmed’’ measurement models in a technical sense. Thus, when sample size is large, researchers sometimes randomly split their data set in half, using one (training) sample to develop a final measurement model and the other (hold-out) sample to confirm the model’s cross-sample generalizability. Although not reported for brevity, this procedure yielded identical models for the present data. When the sample size is too small to split the data set into separate training and hold-out samples, an independent sample is needed to confirm the model’s replicability. Confirmatory factor analysis revealed that this modified model (i.e., allowing the ‘‘waiting time’’ question to reflect both Doctors and Nurses factors) fit the data significantly better than the original two-factor model, ⌬␹2(1, n = 1,614) = 130.6; p < 0.0001. However, the modified model still had goodness-of-fit indices below 0.90 (Table 4). Accordingly, we made another specification search, this time inspecting the modification indices of the unique-error terms in the LISREL solution for the modified model. Two correlated-error terms had MIs greater than 200 (p < 0.00001)— one for Nurses questions (between ‘‘courtesy of the nurses’’ and ‘‘degree to which the nurses took your problem seriously’’) and one for Doctors questions (between ‘‘doctors’ concern to explain your test and treatment’’ and ‘‘advice you were given about caring for yourself at home, or obtaining follow-up medical care’’). Two other correlated-error terms had MIs between 100 and 200 (p < 0.00001), both for Doctors questions: 1) between ‘‘overall explanation of your illness/injury’’ and ‘‘advice you were given about caring for yourself at home, or obtaining follow-up medical care’’; and 2) between ‘‘courtesy of the doctor’’ and ‘‘degree to which the doctor took your problem seriously’’ (paralleling the measurement error shared between comparable questions for Nurses). These reliable correlated-error terms reflect either: 1) unmeasured sources of influence other than Nurses or Doctors that affect responses

61

to pairs of questions (e.g., situational variation in patient load or time demands on staff ) or 2) shared sources of error variance (e.g., patients’ subjective impressions of the personalities of nurses or physicians). We decided to rerun the modified two-factor model, first freeing the two largest correlated errors. As seen in Table 4, the modified two-factor model (with dual loadings for the ‘‘waiting time’’ question) fit the data significantly better when it included the two correlated-error terms with the largest MIs, ⌬␹2(1, n = 1,614) = 477.4; p < 0.0001. Although the model’s GFI was now 0.91, its AGFI was still below 0.90. We thus freed the other two correlated errors with sizable MIs and reestimated the model. As seen in Table 4, including all four correlated-error terms significantly improved the model’s fit, compared with including only the two largest correlated-error terms, ⌬␹2(1, n = 1,614) = 192.6; p < 0.0001. The two-factor model with four correlated errors also had acceptable goodness-offit when assessed by both GFI (0.94) and AGFI (0.91), making it an appropriate measurement model for the 14-item instrument. In the standardized LISREL solution for this final model (Chart 1), the Nurses and Doctors factors correlate 0.73 (p < 0.00001), indicating that about half (i.e., 0.732 = 0.53) of the variance in patient satisfaction with emergency physicians is related to patient satisfaction with emergency nurses, and vice versa. Chart 2 shows the LISREL 7 computer code used to conduct this final analysis. Confirmatory factor analysis also spotlights survey questions that are poorly focused. For example, the LISREL solution (Chart 1) reveals that 13 of the 14 patient satisfaction questions have relatively small unique errors (␦s ⱕ 0.36) and sizable squared multiple correlations (i.e., ⱖ0.64); thus, the two-factor model explains nearly twice as much common variance as it leaves unexplained for all but one of the questions. The single discrepant question — wait time (‘‘waiting time in the treatment area, before you were seen by a doctor’’)— had a relatively high proportion of unique error (0.602) and a relatively low squared multiple correlation (0.398); thus, the model explains only two-thirds as much common variance as it leaves unexplained for this question. These findings suggest that patients attribute responsibility for the time they spend waiting in the treatment area largely to factors other than emergency medical staff. For example, perhaps patients blame delays in treatment on bureaucratic hospital procedures (e.g., the need to check records) or on circumstances beyond the control of emergency medical staff (e.g., the number of other sick patients being treated). Having determined an appropriate measure-

62

USING CFA

Bryant et al. • CONFIRMATORY FACTOR ANALYSIS

CHART 1. LISREL Output for the Final Two-factor Model*

continued

ment model for the patient satisfaction questionnaire, one can then use the factors — rather than the original variables — as more reliable and more parsimonious measures of patient satisfaction, which in turn may be used as outcome measures in the study of potential predictors of patient satisfaction. In the present context, this would require that, for each patient, the responses to the 14 questions (Table 1) be summarized using two

correlated factor scores— one score for satisfaction with nurses, and another score for satisfaction with doctors. The most common method for obtaining factor scores is known as unit weighting. In the present case, because all 14 questions were measured using the same five-point scale, unit weighting simply involves summation of the responses to the different questions (if different measurement scales

ACADEMIC EMERGENCY MEDICINE • January 1999, Volume 6, Number 1

63

CHART 1 (cont).

*The above is taken directly from the LISREL 7 output for the final two-factor solution. For the meaning of the survey questions, see Table 1.

were used, then the responses would have to be transformed into a common metric before unit weighting could meaningfully be conducted). Using the full 14-item measurement model, for example, each patient’s score on the Nurses factor would be computed by summing the patient’s responses to the six questions in the Nurses section, plus the first question (waiting time) in the Doctors section: scores on this Nurses factor could range between 7 and 35. Each patient’s score on the Doctors factor would be computed by summing the patient’s responses to the eight questions in the Doctors section: scores on the Doctors factor could range between 8 and 40. Note that if one’s hypothesis involves comparing mean scores on the Nurses and Doctors factors within subjects, then it is essential that the metric underlying the factor scores is common. To contrast mean scores on the two factors within subjects, we could divide scores on the Nurses factor by 7 (the number of questions) and divide scores on the Doctors factor by 8. This would allow us to meaningfully compare 1) each respondent’s level of satisfaction with nurses with 2) the same respondent’s level of satisfaction with doctors, using the original ‘‘metric’’ of the five-point rating scale for each factor score. Clearly, in the present context, unit weighting ignores the fact that the variables did not have

identical factor loadings in the final measurement model (Chart 1), and therefore theoretically should not be weighted equivalently. Unit weighting also ignores the influence of the four correlated error terms (Table 4, Chart 1). However, simulation research has shown that unit-weighted factors are often nearly perfectly correlated with ‘‘true’’ factor scores (e.g., computed by LISREL, EQS, or CALIS), which include all aspects of the measurement model. Indeed, using the present data, unitweighted and true factor scores correlated 0.99 for both factors. Nevertheless, as user-friendly software becomes more widely available, increasing numbers of researchers are reporting results based on true factor scores, rather than unit-weighted factor scores. Besides indicating how to obtain appropriate Nurses and Doctors factor scores based on the current 14-item patient satisfaction questionnaire, CFA may also be useful in creating a more streamlined survey having fewer questions than the original questionnaire, but equivalent goodness-of-fit. For example, the two-factor model for the 14-item data set indicated that it was more appropriate to let the ‘‘waiting time’’ question load on both Nurses and Doctors factors, rather than on the Doctors factor only, as was originally postulated. However, the factor loading coefficient for the waiting time

64

USING CFA

Bryant et al. • CONFIRMATORY FACTOR ANALYSIS

CHART 2. LISREL 7 Computer Code for the Final Two-factor Model

question was relatively low for both Nurses (0.36) and Doctors (0.31), suggesting that this question did not tap either factor well in an absolute sense (all other questions loaded at least 0.80). Thus, eliminating the waiting time question would improve the conceptual clarity and reliability of the Nurses and Doctors factors. Although this reduces the number of questions only by one in the present case, it is not unusual to distill a smaller ‘‘core’’ of survey questions from instruments that were originally larger. In addition, the final two-factor model contained four correlated error terms. Correlated errors are theoretically problematic, since a basic assumption of classic measurement theory involves the absence of shared measurement error between variables.10 We therefore inspected the factor loadings for each of the four pairs of questions that shared measurement error in the final two-factor model, in order to discard the question with the lower factor loading. Questions with shared measurement error and their factor loadings were: 1) ‘‘degree to which the nurses took your problem seriously’’ and ‘‘courtesy of the nurses,’’ with loadings of 0.89 and 0.86, respectively, on the Nurses factor; 2) ‘‘overall explanation of your illness/injury’’ and

‘‘advice you were given about caring for yourself at home, or obtaining follow-up medical care,’’ with loadings of 0.85 and 0.80, respectively, on the Doctors factor; 3) ‘‘doctors’ concern to explain your test and treatment’’ and ‘‘advice you were given about caring for yourself at home, or obtaining follow-up medical care,’’ with loadings of 0.89 and 0.80, respectively, on the Doctors factor; and 4) ‘‘degree to which the doctor took your problem seriously’’ and ‘‘courtesy of the doctor,’’ with loadings of 0.89 and 0.86, respectively, on the Doctors factor. For each pair of questions with correlated errors, we omitted the question with the lower factor loading and then reestimated the final two-factor model for this new subset of ten questions. Results revealed large modification indices for two more correlated-error terms in the model: ‘‘doctors’ concern to explain your test and treatment’’ and ‘‘overall explanation of your illness/injury’’ (MI = 232.8); and ‘‘degree to which the doctor took your problem seriously’’ and ‘‘doctors’ concern for your comfort while treating you’’ (MI = 106.4). In both cases, the latter question had a lower loading on the Doctors factor. Omitting the question with the lower loading for each pair yielded a final refined survey containing five Nurses questions and three

65

ACADEMIC EMERGENCY MEDICINE • January 1999, Volume 6, Number 1

Doctors questions, for a total of eight questions (as opposed to 14 questions on the original survey). CFA revealed that two correlated factors — Nurses and Doctors — provided an acceptable measurement model for this subset of eight questions, ␹2(19, n = 1,614) = 350.5; GFI = 0.95; AGFI = 0.90; RMSR = 0.021. Also, the correlation between the Nurses and Doctors factors was 0.73 — identical to that found analyzing all 14 questions. The refined measurement model achieves a goodness-of-fit comparable to that of the full 14item model, but contains no correlated errors and uses only about half as many questions (i.e., is more parsimonious). As mentioned earlier, however, before relying on the refined questionnaire in actual practice, it is best to collect data using it with an independent random sample, and then to impose the measurement model from the original analysis on these new data to evaluate the model’s goodness-of-fit. The logic of the CFA approach to measurement modeling can also help researchers in generating new questionnaires. That is, by knowing in advance how CFA can be used to identify the structure underlying responses to an instrument, one can design a new questionnaire from the outset to have good measurement properties. For example, an important issue in questionnaire construction concerns the number of questions that should be included to measure each underlying factor. Using the same number of questions for each factor provides direct comparability of mean factor scores within subjects, and direct comparability of estimates of the internal consistency and stability of factor scores without requiring correction for attenuation due to differences in the number of composite questions. If possible, all questions should share the same response scale. In addition, response scales should be designed so as to avoid extreme values of skewness, because skewed data attenuate correlations and constrain relative-fit indices to maximum values less than one.10 Finally, the complexity of the language used to construct questions should be comparable both within and across factors, and questions with highly similar wording should be avoided to prevent correlated errors. Another important resource for researchers attempting to construct new questionnaires is the Health and Psychosocial Instruments (HaPI) database of measures in the health and social sciences.11 Produced by Behavioral Measurement Database Services (Pittsburgh, PA) and available on line and on CD-ROM from Ovid Technologies (New York, NY), HaPI provides information about more than 63,000 records, including summary descriptions, information about reliability, validity, and appropriate populations, and instructions for ob-

taining copies of a wide variety of self-report questionnaires, survey forms, projective measures, and observational checklists in the health and social sciences. HaPI enables developers of new instruments to build on the prior work of others without ‘‘reinventing the wheel,’’ and it provides a means of systematically selecting criterion measures for use in assessing the convergent and discriminant validity of new measures. By examining methods that prior researchers have used to construct questions and response scales, and by studying findings of factor analyses of these prior instruments, researchers not only may gain insights into how they wish to pursue their specific measurement needs, but may also avoid pitfalls encountered in prior research.

CONCLUSION Several benefits can be gained by applying CFA in EM research. First, CFA improves conceptual precision (i.e., specificity) by enhancing our ability to label the variables under investigation in theoryrelevant terms. CFA thus contributes to the construct validity of empirical research by refining our understanding of the underlying factors being measured. Theoretical concepts that are otherwise ‘‘fuzzy’’ can be explicated by dissecting them into their constituent parts, to improve conceptual clarity. A second benefit of CFA is improved statistical precision (i.e., reliability) in assessing dependent variables. CFA improves the researcher’s ability to detect true relationships, by statistically controlling for unreliability in the underlying factors that is due to measurement error. By helping researchers develop parsimonious, reliable measurement models for their data, CFA improves the validity of the statistical conclusions drawn in empirical research. Confirmatory factor analysis also enables researchers to take fuller advantage of the multivariate nature of their data. Rather than analyzing dependent measures piecemeal or in arbitrary sets, researchers can use CFA to determine the measurement model that best explains subjects’ responses to the entire questionnaire, to obtain the most reliable measure of the research variables in question. CFA can thus improve conceptual and statistical precision in EM research. The reader interested in learning more about CFA may consult several sources that provide a thorough and comprehensive overview of the logic, application, and interpretation of CFA.3,5 – 7,9,12 – 15 References 1. Bjorvell H, Stieg J. Patients’ perception of the health care received in an emergency department. Ann Emerg Med. 1991; 20:734 – 8.

66

USING CFA

Bryant et al. • CONFIRMATORY FACTOR ANALYSIS

2. Carey RG, Seibert JH. A patient survey system to measure quality improvement: questionnaire reliability and validity. Med Care. 1993; 31:234 – 45. 3. Bryant FB, Yarnold PR. Principal-components analysis and exploratory and confirmatory factor analysis. In: Grimm LG, Yarnold PR (eds). Reading and Understanding Multivariate Statistics. Washington, DC: American Psychological Association Books, 1995, pp 99 – 136. 4. Joreskog KG, Sorbom DG. PRELIS: A Program for Multivariate Data Screening and Data Summarization. Chicago, IL: Scientific Software, 1989. 5. Joreskog KG, Sorbom DG. LISREL 7 User’s Reference Guide. Chicago, IL: Scientific Software, 1989. 6. Bentler PM. EQS Structural Equations Program Manual. Los Angeles, CA: BMDP Statistical Software, 1989. 7. SAS. User’s Guide. Version 6. Vol 1. Cary, NC: SAS Institute, 1990. 8. Waller NG. Software review — seven confirmatory fac-

tor analysis programs: EQS, EzPATH, LINCS, LISCOMP, LISREL7, SIMPLIS, and CALIS. Appl Psychol Meas. 1993; 17:73 – 100. 9. Arbuckle JL. AMOS 3.5 User’s Guide. Chicago, IL: SmallWaters Corporation, 1994. 10. Magnusson D. Test Theory. Reading, MA: Addison-Wesley, 1967. 11. Health and Psychosocial Instruments (HaPI)-CD Version 7.06. New York: Ovid Technologies, 1998. 12. Bollen KA. Structural Equations with Latent Variables. New York: Wiley, 1989. 13. Byrne BM. A Primer for LISREL: Basic Applications and Programming for Confirmatory Factor Analytic Models. New York: Springer-Verlag, 1989. 14. Hayduk LA. Structural Equation Modeling with LISREL. Baltimore, MD: Johns Hopkins University Press, 1987. 15. Long JS. Confirmatory Factor Analysis. Beverly Hills, CA: Sage, 1983.



Instructions for Contributors to Commentaries On occasion, the Academic Emergency Medicine Editorial Board asks individuals with special expertise to develop commentaries (editorials) regarding an article published in AEM. AEM manuscript submission requirements correspond with the ‘‘uniform requirements for manuscripts submitted to biomedical journals’’ (JAMA. 1997; 277:927 – 34). Solicited commentaries should be sent to the SAEM Editorial Office (address below) and should be accompanied by a copyright and author disclosure form that has been reviewed and signed and an electronic copy on a computer disk. Any word processing program is acceptable for this submission; the type of processing program should be indicated on the label of the disk. Send three hard copies and the electronic version of the commentary to: Academic Emergency Medicine, Society for Academic Emergency Medicine, 901 North Washington Avenue, Lansing, MI 48906. Phone: (517) 485-5484; fax: (517) 485-0801; e-mail: [email protected]. Commentaries should be no longer than ten doublespaced, single-sided, typewritten pages. Appropriate references should be incorporated, but should not exceed ten references. Tables and figures are also acceptable for commentaries, although they are rarely submitted. The commentary is generally solicited to respond to or reflect upon a specific article that will be published in the

same issue of AEM. Commentary authors may wish to point out strengths and weaknesses of the study that is under consideration, but should remember that significant peer review has already occurred. Instead, the commentary authors might wish to expand on the concept or message that is being demonstrated in the work they are reviewing. Commentary authors are also selected because of their noted expertise in the area under consideration. It is therefore appropriate for authors to bring their expertise into play as they discuss the article. Commentaries do not require any specific formatting. Written permission must be obtained if the author decides to reproduce figures or tables from any other publications. The sources of reproduced material must be acknowledged in the manuscript. Commentaries that include material under consideration by another publication and/or duplicate materials published or submitted elsewhere by the authors cannot be considered. Copies of similar manuscripts under review or published elsewhere should be provided when submitting the manuscript for consideration as a commentary. Any specific questions regarding these guidelines offered for authors of commentaries can be directed to the Editorin-Chief of Academic Emergency Medicine, Michelle Biros, MS, MD, at (612) 347-5683.