Measuring Clinically Meaningful Change Following ... - Springer Link

9 downloads 0 Views 343KB Size Report
using three methods for determining clinically meaningful change: reliable change index (RCI), effect size (ES), and standard error of measurement (SEM); and ...
Regular Article

Measuring Clinically Meaningful Change Following Mental Health Treatment Susan V. Eisen, PhD Gayatri Ranganathan, MS Pradipta Seal, MS Avron Spiro III, PhD Abstract Assessment of clinically meaningful change is useful for treatment planning, monitoring progress, and evaluating treatment response. Outcome studies often assess statistically significant change, which may not be clinically meaningful. Study objectives are to: (1) evaluate responsiveness of the BASIS-24* using three methods for determining clinically meaningful change: reliable change index (RCI), effect size (ES), and standard error of measurement (SEM); and (2) determine which method provides an estimate of clinically meaningful change most concordant with other change measures. BASIS-24* assessments were obtained at two time points for 1,397 inpatients and 850 outpatients. The proportion showing clinically meaningful change using each method was compared to the proportion showing change in global mental health, retrospectively reported change, and clinician-assessed change. BASIS-24* demonstrated responsiveness at both aggregate and individual levels. Regarding clinically meaningful improvement and decline, SEM was most concordant with all three outcome measures; regarding no change, RCI was most concordant with all three measures.

This work was done at the Center for Health Quality, Outcomes and Economic Research, a Health Services Research and Development Center of Excellence of the Veterans Administration. The Center is located at the Edith Nourse Rogers Memorial Veterans Hospital, Bedford, MA. The views expressed in this article are those of the authors and do not necessarily represent the views of the Department of Veterans Affairs. Address correspondence to Susan V. Eisen, PhD, Professor, Health Research Scientist, Health Research Scientist, Center for Health Quality, Outcomes and Economic Research (CHQOER), Edith Nourse Rogers Memorial Veterans Hospital, 200 Springs Road (152), Bedford, MA 01730, USA. Susan V. Eisen, PhD, Professor, Department of Health Policy and Management, Boston University School of Public Health, Boston, MA, USA. E-mail: [email protected]. Gayatri Ranganathan, MS, Biostatistician, MetaWorks Inc. 10 President’s Landing Medford, MA 02155 USA. E-mail:[email protected] Pradipta Seal, MS, Data Analyst, Center for Health Quality, Outcomes and Economic Research (CHQOER), Edith Nourse Rogers Memorial Veterans Hospital, 200 Springs Road (152), Bedford, MA 01730, USA. E-mail:[email protected] Avron Spiro III, PhD, Health Research Scientist, Massachussetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, 150 South Huntington Ave, Boston, MA 02130 and Associate Professor, Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA. Email: [email protected] Journal of Behavioral Health Services & Research, 2007 Healthcare.

272

)c M2007 National Council for Community Behavioral

The Journal of Behavioral Health Services & Research

34:3

July 2007

Introduction An important criterion for instruments designed to assess treatment effectiveness is that they be sensitive to change in the construct they were designed to measure.1,2 Sensitivity, or responsiveness to change, should be considered a distinct criterion in assessing the psychometric properties of health status instruments used for outcome evaluation.3 Studies of treatment effectiveness commonly use repeated measures designs in which outcomes are determined by comparing pre–post differences in symptomatology or functioning. Statistical significance of change is often assessed at the aggregate level using paired t tests between T1 and T2, repeated measures analysis of variance or covariance, or generalized estimating equations.4,5 However, with reasonably large samples, it is possible for small differences that may not be clinically meaningful to reach statistical significance. In addition, these types of aggregate data analysis are not useful for determining individual changes, which can be very valuable in clinical practice for treatment planning, monitoring course of illness and evaluating response to treatment.5–7 Liang3 differentiates sensitivity of an instrument, defined as an instrument_s capacity to measure any change, from responsiveness?an instrument_s capacity to measure clinically important or meaningful change. Clinically meaningful change can be defined as a noticeable, appreciable difference that is of value to the patient or the health professional, and that exceeds variation attributable to chance.3,8 Minimally important change has been defined as Bthe difference in score on a health-related ...instrument that corresponds to the smallest change in status that stakeholders (persons, patients, significant others, or clinicians) consider important^.9 Determining clinically meaningful or important change can facilitate interpretation of scores obtained on self-report measures of health status used to evaluate intervention and treatment effects at both the aggregate and individual level.5 Recent work by Hays et al.,7 and by Atkins et al.10 discussed and evaluated multiple methods for determining clinically meaningful change. However, no study to date has used multiple criteria to determine which method is most concordant with other widely used measures of change in mental health status. The purposes of this article are: (1) to evaluate responsiveness of the 24-item revised Behavior and Symptom Identification Scale (BASIS-24*)11 using three distribution-based methods for identifying clinically meaningful change: the reliable change index (RCI), effect size (ES), and standard error of measurement (SEM); and (2) to determine which of the three methods provides an estimate of change that is most concordant with three other mental health outcomes measures: change in selfreported global rating of mental health, a self-reported retrospective (transition) rating of change, and change based on a clinical rating of impairment [Global Assessment of Functioning (GAF)].8,11–13 Because individuals treated at different levels of care and in specialized versus generic programs might reasonably be expected to differ across a range of domains, including demographic and clinical characteristics, and amount of change they are likely to experience in different mental health domains, assessment of clinically meaningful change was done separately for inpatients and outpatients treated in mental health or substance abuse/dual diagnosis programs. Methods for determining clinically meaningful change The problems inherent in traditional methods of evaluating treatment effectiveness were noted almost 30 years ago, and since then, a number of methods have been developed to assess Bclinically significant^, Bmeaningful^, or Bminimally important change.^14–17 However, there is still no clear consensus regarding standards for determining clinically meaningful change.3,6,18,19 Effect size was first developed by Cohen to assess the magnitude of a treatment effect, originally using aggregate data, but later to assess change in individuals as well.20–22 Effect size is based on the ratio of the difference between baseline and follow-up scores to the standard deviation of the baseline score. Unlike significance tests, effect size is independent of sample size;

Measuring Change After Mental Health Treatment

EISEN et al.

273

consequently, it will not increase just by increasing sample size. Because they provide standardized measures of change, effect sizes can be used as benchmarks for understanding changes in health status.21 They can also be viewed as indicative of clinically meaningful change, based on research suggesting that a medium effect size corresponds to an amount of change that is noticeable to a careful observer.23 Based on Cohen_s d, effect sizes of 0.2 are considered small, 0.5–0.6 are considered medium, and Q 0.80 are considered large.20,23 Jacobson and Truax developed the reliable change index (RCI) as the first step of a two-step process for determining clinical significance of change.24 The RCI is calculated to determine whether the magnitude of change is statistically reliable. The posttreatment score is subtracted from the pretreatment score and divided by the standard error of the differences. If the absolute value of Bt^ is greater than 1.96, then change is considered statistically reliable. If statistically reliable change is established, a second criterion is suggested for determining clinical significance?that the posttreatment score fall within the range of scores for a Bnormal^ population. After development of this method, several refinements and enhancements were suggested.25–29 Although reliable change can be computed for any sample, identification of clinically meaningful change requires that norms be available for a Bnormal^ population, which is sometimes not the case. Because there is no well-established standard for determining clinical significance, and because normative data from an untreated population is not available, the RCI analysis used in this study is limited to determination of statistically reliable change. McHorney and Tarlov proposed the standard error of measurement (SEM) as a useful statistic for assessing individual change on health-related quality of life instruments, and its use has been described for evaluating meaningful change in a number of medical, cognitive, and behavioral conditions.6–8 The SEM is the standard deviation of an individual score, estimated by multiplying the standard deviation for a sample by the square root of one minus its reliability coefficient.30,31 Although varying statistical thresholds have been used to determine clinically meaningful change using the SEM, recent research has reported that one SEM consistently corresponded to a minimal clinically important intra-individual change.5,30,31 Consequently, in this paper, one SEM will be used as the criterion for clinically meaningful change in BASIS-24.*

Methods Sample The sample consisted of 2,248 English-speaking adults treated in one of 27 inpatient (n=1,398) or outpatient (n=850) mental health and/or substance abuse treatment programs who completed self-report mental health assessments at two time points: admission and within 24 h before discharge (for inpatients), or intake and 30–90 days later (for outpatients). This sample included all sites participating in a field test of the revised Behavior and Symptom Identification Scale. Details regarding site and sample selection have been previously reported.11 The majority of both the inpatient and outpatient samples were between 25 and 44 years old had at least a high school education and were unemployed or employed less than 10 h per week in the previous 30 days. Sample characteristics are presented in Table 1.

Instruments The revised 24-item Behavior and Symptom Identification Scale (BASIS-24*) was used to assess change following treatment. BASIS-24* is a revised version of the BASIS-32\ mental health outcome instrument consisting of 24 self-report items assessing six domains: depression/functioning, difficulty in interpersonal relationships, self-harm, emotional lability, psychotic symptoms,

274

The Journal of Behavioral Health Services & Research

34:3

July 2007

Table 1 Characteristics of the inpatient and outpatient samples Inpatient sample (N=1,397) Age 18–24 25–34 35–44 45–54 55+ Gender Male Female Race/ethnicity White Black/African-American Latino Other/multiracial Marital status Never married Married Separated/divorced Widowed Education 8th grade or less Some high school High school graduate/GED Some college 4-year college graduate Employed in past 30 days No Yes, 1–10 hours Yes, 11–30 h Yes, more than 30 h Primary psychiatric diagnosis Schizophrenia/schizoaffective disorder Depressive disorder Bipolar disorder Alcohol/drug use disorder Anxiety disorder Adjustment disorder Other disorder Insurance Self pay Commercial Medicare Medicaid Uninsured Program type Mental health Substance abuse/dual diagnosis

Outpatient sample (N=850)

Number of patients (N)

Percentage (%)

Number of patients (N)

Percentage (%)

214 324 439 300 120

15.3 23.2 31.4 21.5 8.6

160 253 252 129 56

18.8 29.8 29.7 15.2 6.6

773 624

55.3 44.7

380 470

44.7 55.3

860 385 56 90

61.8 27.7 4.0 6.5

654 67 59 64

77.5 7.9 7.0 7.6

660 271 419 46

47.2 19.4 30.0 3.3

366 253 213 9

43.5 30.1 25.3 1.1

95 264 444 350 226

6.9 19.1 32.2 25.4 16.4

14 99 249 292 187

1.7 11.8 29.6 34.7 22.2

913 77 112 292

65.5 5.5 8.0 21.0

385 49 104 303

45.8 5.8 12.4 36.0

339

25.1

25

3.5

342 189 363 34 38 43

25.4 14.0 26.9 2.5 2.8 3.2

231 79 200 57 85 32

32.6 11.1 28.2 8.0 12.0 4.5

78 382 271 214 405

5.8 28.3 20.1 15.8 30.0

110 233 27 130 22

21.1 44.6 5.2 24.9 4.2

1,164 233

83.3 16.7

593 257

69.8 30.2

Measuring Change After Mental Health Treatment

EISEN et al.

275

and substance abuse.11,32 In addition, an overall mental health score is computed. Items are rated on a 5-point scale, with higher numbers indicating greater symptom/problem frequency or severity. Three other measures of change were included to provide alternative measures of mental health status from which change can be determined or inferred. First, a self-reported global rating of mental health during the past week, also assessed on a 5-point rating scale (0=poor, 1=fair, 2=good, 3=very good, 4=excellent), was used. Previous research has reported a correlation of 0.65 for inpatients and 0.76 for outpatients between this rating and the BASIS-24* overall mental health score.11 Second, a 5-point rating of perceived change (0=much worse, 1=somewhat worse, 2=about the same, 3=somewhat better and 4=much better) was used. This rating corresponds to transition ratings of change, which have also been used to determine clinically meaningful change in health status.12 Third, the clinician-rated GAF, extracted from medical records or administrative databases, provided an external measure of level of impairment. The GAF is a single-item, clinician rating scale assessing overall psychological symptoms, social, and occupational functioning.13 It is part of the multi-axial (Axis V), psychiatric diagnostic system.33 Ratings can range from 1 (worst functioning) to 100 (best functioning). It is the most widely used measure of psychiatric impairment, and previous research has reported good reliability and validity.34 Because not all study sites required GAF ratings at two time points, they were not available for all study participants. However, GAF ratings at two time points were available for 385 inpatients and 328 outpatients. Demographic characteristics were collected by respondent self-report. Payer and psychiatric diagnosis, including GAF ratings, were obtained from medical records or administrative databases. Procedure BASIS-24* and the global mental health rating were administered twice, upon admission to an inpatient service or intake to an outpatient program (T1), and in the 24-h period before discharge (for inpatients) or 4–8 weeks after intake for outpatients (T2). The transition rating of change was obtained at T2 for inpatients and outpatients. GAF ratings at T1 and T2 were extracted from medical records or administrative databases. Data collection was undertaken by program staff within the context of continuous quality improvement activities. Verbal consent was obtained from all participants. This data collection process was approved by the Institutional Review Board of the grantee institution and by each participating site. Data analysis BASIS-24* subscale and overall scores at T1 and T2 were first computed,35 and statistical significance of the differences was assessed using paired t tests. Aggregate effect sizes were then computed as the difference between T1 and T2 group means, M1–M2, divided by the standard 2 20 deviation at T1, s1, hence, ES ¼ M1 sM . Individual change scores were then computed using the 1 three methods described above (RCI, ES, and SEM). The RCI was computed using a modification suggested by Hageman and Arrindell and recently used by Jerrell to evaluate reliable change of the BASIS-32.\28,36 This modification produces an Bimproved^ standardized, Time 1_Time 2 difference score by adjusting for regression to the mean. It is calculated as RCID ¼

ðx1  x2 Þr DD þ ðM1  M2 Þð1  r DD Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi SEM21 þ SEM22

where RCID=the reliable change index using improved difference (ID) scores, x1=Time 1 score, x2=Time 2 score, M1=mean of Time 1 scores, M2=mean of Time 2 scores, SEM1=standard error

276

The Journal of Behavioral Health Services & Research

34:3

July 2007

of measurement of Time 1 scores, SEM2=standard error of measurement of Time 2 scores, and rDD=reliability of difference scores. Study participants were then placed into three groups based on their RCI scores: reliable decline (RCIG_1.96), no reliable change (RCI between _1.96 and +1.96), and reliable improvement (RCI91.96). Effect size for individuals was computed as the difference between each individual_s T1 and T2 2 22 scores, divided by the group standard deviation at T1, s1; hence, ES ¼ x1 sx . Participants were 1 then grouped based on whether their individual ES suggested a large decline (ESG_0.50), no effect or a small effect (_0.49 to 0.49), or a medium to large improvement (ES90.50).20 Use of ES90.50 was selected based on a review of 29 research studies in which it was found that the mean minimally important difference was almost exactly equal to an effect size of 0.50.23 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi The SEM was computed as s1 1  rxx , where s1 is the standard deviation at time 1 and rxx is the internal consistency reliability coefficient at Time 1.30,31 As for the RCI and ES, participants were then categorized as declined if their T1_T2 difference scores declined by at least one SEM, stable if their T1_T2 difference scores were less than one SEM, and improved if their T1_T2 difference scores improved by at least one SEM. Change in global mental health was calculated by subtracting Time 2 from Time 1 global mental health ratings, yielding change scores ranging from _4 to +4. All positive scores were categorized as improved, scores of 0 were categorized as unchanged, and all negative scores were categorized as declined. Transition ratings were categorized into three groups: worse (including ratings of Bmuch worse^ and Bsomewhat worse^), same, and improved (Bsomewhat better^ and Bmuch better^). Improvement in GAF ratings were computed by subtracting T2 from T1 ratings. Difference scores were considered improved if T2 ratings were 10 or more points higher than T1 scores, worse if T2 scores were 10 or more points lower than T1 scores, and unchanged if the T2_T1 difference was less than 10 points. The 10-point difference criterion was used because the GAF uses 10-point ranges to define impairment severity levels, so that a 10-point change represents a change in level of impairment based on the clinician_s assessment. The weighted kappa statistic was used to determine agreement in the proportion of participants categorized as declined, stable, or improved based on each of the three methods for evaluating individual change.37 To determine which of the three methods is most concordant with clinically meaningful change on the BASIS overall mental health score, the proportion of individuals identified as meaningfully improved, unchanged, or declined, who were also categorized as improved, unchanged, or declined based on the global mental health rating was calculated using RCI, ES, and SEM, respectively. This analysis was then replicated for the two other change measures (transition rating of improvement and change in GAF rating) to determine concordance of change on these measures with change in the BASIS overall score.

Results Aggregate change over time Table 2 presents mean Time 1 and Time 2 scores and aggregate effect sizes for study participants. Although all but one of the statistical tests for Time 1_Time 2 differences were statistically significant, effect sizes varied by symptom/problem domain, and by program type. Among inpatients, large effect sizes (90.80) were found for the depression/functioning domain and for the overall score among individuals treated in both mental health and substance abuse programs. Among inpatients in substance abuse programs, a large effect was also found in the alcohol/drug use domain. Effect sizes in the other domains were moderate for inpatients. Among outpatients, effects sizes were low to moderate for those in mental health programs (0.21–0.45). In all domains except for alcohol/drug use, effect sizes were smaller for those treated in substance abuse programs than for those treated in mental health programs.

Measuring Change After Mental Health Treatment

EISEN et al.

277

278

The Journal of Behavioral Health Services & Research

34:3

July 2007

Table 2

1.15 (0.89)

1.31 (1.06)

0.43 (0.76) 1.23 (0.96)

0.70 (0.90)

0.61 (0.87) 1.05 (0.67)

1.79 (1.07)

1.23 (1.28) 1.95 (1.15)

1.20 (1.17)

0.94 (1.15) 1.84 (0.86)

Time 2

2.19 (1.17)

Time 1

0.29 0.92

0.43

0.63 0.63

0.45

0.89

Effect size

2.67 (0.93) 1.89 (0.71)

0.63 (0.86)

0.70 (1.01) 2.01 (0.99)

1.63 (0.99)

2.35 (0.94)

Time 1

1.65 (0.91) 1.14 (0.57)

0.35 (0.63)

0.26 (0.61) 1.43 (0.85)

1.15 (0.91)

1.31 (0.78)

Time 2

Inpatient substance abuse (n=233)

1.1 1.1

0.33

0.44 0.59

0.49

1.11

Effect size

0.53 (0.82) 1.64 (0.78)

0.66 (0.83)

0.62 (0.95) 2.04 (1.04)

1.51 (0.97)

2.05 (1.04)

Time 1

0.36 (0.61) 1.30 (0.74)

0.49 (0.76)

0.41 (0.77) 1.69 (1.01)

1.27 (0.92)

1.61 (1.00)

Time 2

Outpatient mental health (n=593)

0.21 0.44

0.20

0.22 0.34

0.25

0.42

Effect size

1.12 (1.11) 1.13 (0.86)

0.49 (0.86)

0.23 (0.59) 1.43 (1.17)

1.15 (1.08)

1.30 (1.10)

Time 1

0.70 (0.79) 0.85 (0.64)

0.33 (0.60)

0.14 (0.43) 1.19 (0.95)

0.97 (0.87)

0.95 (0.85)

Time 2

Outpatient substance abuse (n=257)

0.38 0.33

0.20

0.15 0.21

0.17

0.32

Effect size

*For inpatients, all differences between Time 1 and Time 2 are statistically significant ( pG0.001). For outpatients, all differences between Time 1 and Time 2 are statistically significant ( pG0.001) except self-harm for individuals treated in a substance abuse program.

Depression/ functioning Interpersonal relationships Self-harm Emotional lability Psychotic symptoms Alcohol/drug use Overall mean

BASIS-24* subscale

Inpatient mental health (n=1,164)

Mean symptom/problem scores (SD) at two time points for inpatients and outpatients treated in mental health and substance abuse programs*

Individual change over time Table 3 presents the proportion of individuals who showed clinically meaningful improvement on each BASIS-24* subscale and on the overall score, using each of the three methods. Consistent with the aggregate results, all three methods showed a higher proportion of inpatients than outpatients with clinically meaningful improvement. The proportion of inpatients showing clinically meaningful improvement on each BASIS-24* subscale and on the overall score ranged from 8–54% based on RCI, 29–73% based on ES, and 33–75% based on SEM. Corresponding rates of clinically meaningful improvement for outpatients ranged from G1–24% for RCI, 13–43% for ES, and 13–53% for SEM. For both inpatients and outpatients, the RCI method identified the fewest individuals as meaningfully improved. SEM identified the most individuals as meaningfully improved, with a few exceptions. For three program types, ES identified the same proportion of individuals as meaningfully improved as SEM (psychotic symptoms among inpatients treated in dual diagnosis/substance abuse programs, emotional lability among inpatients treated in mental health programs, and self-harm among outpatients treated in dual diagnosis/ substance abuse programs). For three domains within three program types, ES identified somewhat more individuals as meaningfully improved than SEM (emotional lability and alcohol/ drug use among inpatients in dual diagnosis/substance abuse programs and psychotic symptoms among outpatients in mental health programs). Across program type, of those who did not show clinically meaningful improvement, most showed no meaningful change, (46–99% based on RCI, 23–79% based on ES and 19–79% based on SEM; data not shown). Relatively few individuals showed clinically meaningful decline (0–5% based on RCI, 3–18% based on ES, and 3–22% based on SEM.). Results varied by both symptom/problem domain and program type. Although the highest proportion of individuals improved in the depression/functioning domain and in overall mental health regardless of program type, more individuals treated in substance abuse or dual diagnosis programs improved on the alcohol/drug use subscale than individuals in mental health programs, a finding that is consistent with the problems of individuals treated in substance abuse programs. Agreement among the methods Agreement among the three methods for assessing clinically meaningful change varied by domain and program type. With respect to the depression/functioning domain, agreement between ES and SEM was fairly high, with weighted kappas ranging from 0.73 to 0.93 (data not shown). However, agreement between the RCI and SEM for the same domain was low to moderate (weighted kappas ranged from 0.38 to 0.55), and agreement between ES and RCI was moderate (0.54 to 0.65). Regarding the BASIS-24* overall score, all three methods identified 49% of inpatients as meaningfully improved, 23% as unchanged, and G1% as meaningfully declined. Across all change groups (declined, stable, and improved), the three methods were in full agreement for 72% of inpatients. For outpatients, all three methods identified 23% as meaningfully improved, 34% as unchanged, and 2% as meaningfully worse. Agreement across all groups was 59%. Relationship among the measures This level of agreement can be explained by the common elements shared by each method for computing meaningful change. For the SEM, the criterion for clinically meaningful change is at pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi least 1 standard error of p measurement; for medium or larger effect size, it is ð0:5 ð1  r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi xx ÞÞ 2 2 SEM1 and for RCI, it is 1:96 SEM1 þSEMr2DDðM1 M2 Þð1rDD Þ. It can be shown mathematically that the minimum change required for clinically meaningful improvement is proportional to the standard error of measurement.38

Measuring Change After Mental Health Treatment

EISEN et al.

279

280

The Journal of Behavioral Health Services & Research

34:3

July 2007

Table 3

RCI

48.3 23.0 32.5 14.6 8.1 8.8 47.6

BASIS-24 subscale

Depression/functioning Interpersonal relationships Self-harm Emotional lability Psychotic symptoms Alcohol/drug use Overall mean

62.9 45.1 43.6 52.3 37.8 29.5 63.5

Effect size

Mental health (n=1,164)

Program type

*

Inpatient

Level of care

68.0 48.2 48.8 54.0 38.8 32.8 70.3

SEM 53.2 22.3 25.7 13.7 9.0 20.6 55.4

RCI 73.4 43.8 30.0 55.8 33.9 70.0 70.8

Effect size 77.2 47.2 36.1 55.8 33.9 68.2 76.8

SEM

Substance abuse/dual Dx (n=233)

24.3 8.4 11.8 0.5 0.5 4.9 23.6

RCI 43.3 34.1 23.6 41.0 24.6 21.4 42.3

Effect size

Mental health (n=593)

Outpatient

52.3 36.3 27.8 41.0 24.5 21.8 52.6

SEM

19.8 10.5 8.9 3.9 5.8 13.2 20.6

RCI

31.9 25.7 13.2 32.3 21.0 31.9 31.5

Effect size

44.8 30.7 13.2 34.6 21.8 34.6 48.3

SEM

Substance abuse/dual Dx (n=257)

Percent of individuals showing clinically meaningful improvement based on effect size (ES) 90.50, reliable change index (RCI), and one standard error of measurement (SEM)

Comparing SEM with ES, if T1 reliability (rxx) is greater than 0.75, larger T1_T2 differences are needed to show clinically meaningful change based on ES than SEM, and all individuals who improve based on ES will also improve based on one SEM.8 The reverse phenomenon will happen when T1 rxx is less than 0.75, in which case, larger T1–T2 differences are required to show clinically meaningful change based on SEM than ES. When T1 rxx=0.75, the same T1_T2 difference will identify individuals as meaningfully improved for both ES and SEM. As reported in Table 3, SEM identified more individuals as meaningfully improved than ES in all except six of the 84 cells, and in five of these six cells T1 reliability was 0.75 or less (Table 4). The only exception occurs in the self-harm domain for outpatients in substance abuse programs, in which both ES and SEM identified 13.2% of individuals as meaningfully improved, and T1 reliability (0.81) exceeded 0.75. In this case, the minimum T1_T2 difference required for meaningful improvement based on ES is greater than 1.11 times the SEM at T1. With respect to outpatients in substance abuse programs, all those who showed meaningful improvement in the self-harm domain based on SEM had T1_T2 difference scores greater than 1.11 times the T1 SEM. Consequently, the rates of improvement based on ES and SEM were the same. Comparing ES to RCI, if rxx is less than 0.94 and the ratio of sample mean improvement to SEM at T1 is greater than 1.96, then ES will identify more people as meaningfully improved if rDD satisfies Eq. 1 below: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi SEM22 Ratio of sample mean improvement to SEM1 at T1  1:96 1 þ SEM21 r DD  ð1Þ 0:5 ffi Ratio of sample mean improvement to SEM 1 at T1  pffiffiffiffiffiffiffiffi 1rxx

RCI could identify more people as meaningfully improved than ES if the ratio of sample mean improvement to SEM at T1 is greater than 1.96 and rDD is low. Using an example from the current dataset, the ratio of the mean difference to SEM1 on the BASIS-24* overall mental health score for inpatients treated in substance abuse programs was 2.73. rDD for this group was 0.74, which is greater than the quantity (0.20) derived from Eq. 1, as shown in Eq. 2 below: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2:73  1:96 1 þ 0:56 ¼ 0:20 0:5 2:73  pffiffiffiffiffiffiffiffiffiffiffi 10:85

ð2Þ

Table 4

Cronbach_s alpha (a) reliability coefficients for BASIS-24* subscales by program type Level of care

Inpatient

Outpatient

Program type

Mental health (n=1,164)

BASIS-24* subscale

T1 a

T2 a

T1 a

T2 a

T1 a

T2 a

T1 a

T2 a

Depression/ functioning Interpersonal relationships Self-harm Emotional lability Psychotic symptoms Alcohol/drug use Overall mean

0.88

0.86

0.86

0.84

0.90

0.91

0.92

0.91

0.82

0.87

0.84

0.82

0.81

0.83

0.89

0.85

0.88 0.77 0.77 0.86 0.87

0.82 0.76 0.76 0.80 0.87

0.89 0.78 0.75 0.68 0.87

0.86 0.72 0.76 0.63 0.85

0.87 0.75 0.73 0.80 0.90

0.88 0.76 0.73 0.75 0.90

0.81 0.83 0.86 0.82 0.94

0.78 0.76 0.72 0.73 0.91

Substance abuse/ dual Dx (n=233)

Measuring Change After Mental Health Treatment

Mental health (n=593)

EISEN et al.

Substance abuse/ dual Dx (n=257)

281

If rDD was lower than the quantity in Eq. 1, then RCI could identify more individuals as meaningfully improved than ES. However, this never occurred in the dataset, and it is quite unlikely to occur because an instrument with an rDD that low would not be considered reliable enough to measure change in health status. Comparing SEM to RCI, when the T1 SEM is greater than the sample mean difference (M1_M2), or if M1_M2 is less than 1.96 times the T1 SEM, then the minimum difference required for meaningful improvement based on SEM will always be lower than the minimum difference required for meaningful improvement based on RCI, and SEM will identify more individuals as meaningfully improved than RCI. When the ratio of mean difference to SEM191.96, SEM will still identify more individuals as improved than RCI if Ratio of sample mean improvement to SEM1 at T1  1:96 r DD 

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi SEM22 1þ 2 SEM1

Ratio of sample mean improvement to SEM1 at T1  1

ð3Þ

Using the previous example and Eq. 3, rDD would need to be greater than 0.16 (lower rDD requirement than ES) for SEM to identify more individuals as improved than RCI, which again, was always the case in this dataset. It is possible for RCI to identify more individuals as meaningfully improved than SEM only when rDD is less than the quantity on the right-hand side of Eq. 3. However, as with ES, this is highly unlikely because an instrument with an rDD that low, is not reliable enough to measure change. Which method is most concordant with change assessed by other outcome measures? Across mental health and substance abuse programs, 61% of inpatients and 41% of outpatients were improved based on change in their rating of global mental health. Twenty-nine percent of inpatients and 42% of outpatients were unchanged, and 10% of inpatients and 17% of outpatients were worse (Table 5). The proportions of these Bglobally improved^ individuals who were also identified as clinically meaningfully improved, same, or worse on BASIS-24* overall scores are presented in Table 5 for each of the three methods. The SEM method identified the greatest number of globally improved individuals and showed the highest concordance with both clinically meaningful improvement and decline in the BASIS overall mental health score. For all four treatment groups, agreement between SEM-based improvement and global mental health improvement ranged from 67 to 85%. The RCI identified the smallest number of Bglobally improved^ individuals and had the lowest levels of concordance with clinically meaningful improvement and decline in overall mental health (32 to 58%); but RCI had the highest concordance for individuals showing no change. Replication of these analyses using the transition rating and change in GAF scores, presented in Tables 6 and 7, respectively, were generally consistent with the analysis of change based on the global mental health rating. For both of these measures, the SEM method for determining clinically meaningful change was most concordant with both the transition measure of change and with change based on the clinician-rated GAF. The RCI criterion for no meaningful change was most concordant with no change based on the transition measure and the GAF. With respect to clinically meaningful decline, there was little or no difference in concordance among the three methods of determining clinically meaningful change. However, the number of cases in the Bdeclined groups^ was very small (nG16), except for those in inpatient or outpatient mental health programs. In these programs, the SEM method for assessing clinically meaningful decline was more concordant than either RCI or ES. The criterion for improvement on the global mental health rating was any change in a positive direction. To determine whether a higher threshold for improvement in global mental health

282

The Journal of Behavioral Health Services & Research

34:3

July 2007

Measuring Change After Mental Health Treatment

EISEN et al.

283

Table 5

64 79 84

77 49 36

3 21 24

66 82 85

55 38 27

4 20 20

Worse (%) (n=25) 51 67 78

Improved (%) (n=261)

Mental health

Outpatient

85 64 46

Same (%) (n=241)

n in each group is the number of cases improved, same, or worse based on the global rating of mental health.

RCI ES SEM

Same (%) (n=64)

Improved (%) (n=144)

Improved (%) (n=708)

Worse (%) (n=119)

Mental health

Program type

Same (%) (n=338)

Substance abuse/dual Dx

Inpatient

Level of care

18 32 38

Worse (%) (n=91)

44 56 74

Improved (%) (n=85)

83 72 45

Same (%) (n=119)

Substance abuse/dual Dx

21 28 42

Worse (%) (n=53)

Percent of individuals reporting improved, same, or worse global mental health and clinically meaningful improvement, no change or decline in BASIS-24 overall score based on RCI, ES and SEM

284

The Journal of Behavioral Health Services & Research

34:3

July 2007

Table 6

53 69 75

68 45 33

2 24 27

60 74 79

56 30 21

9 9 9

Worse (%) (n=11) 38 55 66

Improved (%) (n=178)

Mental health

Outpatient

80 50 37

Same (%) (n=331)

n in each group is the number of cases improved, same, or worse based on retrospective rating of improvement.

RCI ES SEM

Same (%) (n=43)

Improved (%) (n=179)

Improved (%) (n=904)

Worse (%) (n=49)

Mental health

Program type

Same (%) (n=212)

Substance abuse/ dual Dx

Inpatient

Level of care

7 18 23

Worse (%) (n=84)

29 41 60

Improved (%) (n=111)

82 64 40

Same (%) (n=131)

Substance abuse/dual Dx

13 20 20

Worse (%) (n=15)

Percent of individuals reporting retrospective improvement, no change, or worse mental health and clinically meaningful improvement, no change, or decline in BASIS-24 overall score based on RCI, ES, and SEM

Measuring Change After Mental Health Treatment

EISEN et al.

285

Table 7

60 77 82

56 27 11

0 6 6

60 77 80

70 50 30

0 0 0

Worse (%) (n=0)

n in each group is the number of cases improved, same, or worse based on GAF.

RCI ES SEM

Same (%) (n=10)

Improved (%) (n=30)

Improved (%) (n=284)

Worse (%) (n=16)

Mental health

Program type

Same (%) (n=45)

Substance abuse/dual DX

Inpatient

Level of care

41 57 67

Improved (%) (n=49)

Mental health

Outpatient

72 47 32

Same (%) (n=167)

14 14 14

Worse (%) (n=7)

28 41 63

Improved (%) (n=32)

85 63 42

Same (%) (n=71)

Substance abuse/dual Dx

0 50 50

Worse (%) (n=2)

Percent of individuals with improved same or worse clinician rating of impairment (GAF) and clinically meaningful improvement, no change, or decline in BASIS-24 overall score based on RCI, ES, and SEM

would yield different levels of concordance for SEM, ES, and RCI, the threshold for improvement was raised by considering only those who had improved by at least two rating scale points. This threshold resulted in a much lower proportion of individuals classified as improved (30% of inpatients and 12% of outpatients). However, SEM still yielded the highest rates of concordance with global improvement, ranging from 88 to 96% depending on program type and level of care (data not shown). Concordance of RCI rates remained lower than for SEM and for ES, but were substantially higher than they were with the lower improvement threshold, ranging from 72 to 74% agreement.

Discussion This study used three methods of evaluating responsiveness of the BASIS-24* mental health instrument and then evaluated the concordance of each method in identifying clinically meaningful change by using three other measures of improvement. Several major points emerge from the analysis. First, the BASIS-24* instrument showed responsiveness to change following treatment of both inpatients and outpatients in mental health and substance abuse/dual diagnosis programs. At both the aggregate and individual level, change was greater among inpatients than outpatients, although aggregate results mask the finding that a large proportion of individuals show no meaningful change, and some decline. Consistent with previous findings reported in the literature, both the SEM and ES methods identified a higher proportion of individuals as meaningfully improved than did the RCI method.7,10,17 Second, change varied as a function of both program type and symptom/problem domain. In general, more individuals showed clinically meaningful improvement in overall mental health and in the depression/functioning domain than in other domains. However, those treated in substance abuse programs were more likely to show clinically meaningful change in the substance abuse domain than those treated in mental health programs, highlighting the importance of examining outcomes in the domains most relevant to the focus of particular programs or diagnosis groups. Third, analyses reporting concordance between each of the three methods and three other measures of change showed that the SEM identified a higher proportion of individuals as improved or declined than either ES or RCI. These results support Wyrwich and Wolinsky_s suggestion that the SEM may be a better method for determining meaningful change than ES because ES uses the standard deviation, which is sample dependent.5 In contrast, the SEM does not vary from sample to sample, providing a more stable method for determining clinically meaningful change.5,18 These results are also consistent with the suggestion that the RCI may be a too stringent criterion for determining clinically meaningful change, particularly for individuals with severe and persistent mental illness.25,39–41 Comparison of results of this study to those reported for the original BASIS-32\ in a sample of individuals with severe and persistent mental illness treated as outpatients in South Carolina, and using the same RCI formula used here, shows almost the identical proportion of cases with improvement on the overall mean score, 24.5% for the South Carolina sample and 23.6% for the national sample reported here, despite a longer follow-up period (3–6 months) used in South Carolina compared with the current study (1–3 months).36 There are a number of limitations to this study. Although the sample is relatively large and includes individuals receiving mental health treatment at many different sites and geographic regions of the U.S., it is not necessarily representative of all mental health consumers. Second, because normative data for an untreated population are not available for BASIS-24*, it was not possible to determine the proportion of individuals whose mental health status moved into the range of a Bnormal^ population.24,25,41 Although three other mental health outcome measures were used to determine which method of measuring meaningful change showed the highest concordance with change in BASIS-24*, the other measures used had limitations. Two of them

286

The Journal of Behavioral Health Services & Research

34:3

July 2007

were self-report measures; consequently, they were not external Banchor^ measures, although these types of measures (global evaluations and transition ratings) have been used in previous research.5,16,42 The third measure (GAF) was an external measure because it is determined by the clinician. However, GAF ratings were available for only 32% of the sample. Furthermore, training of the clinicians who made the GAF ratings could not be determined nor could reliability or validity of the GAF ratings extracted from the medical records or administrative databases be assessed. However, a number of recent studies have reported satisfactory reliability and validity of GAF ratings obtained from these sources.43–45 In the absence of a Bgold standard^ measure of mental health status and functioning, (which does not currently exist46,47) further research using multiple anchor-based methods would provide additional evidence regarding which method yields the best estimate of clinically meaningful change, particularly with regard to mental health outcomes for individuals with severe and persistent mental illness. However, results of a simulation study exploring the relationship between an anchor-based approach and effect size found a near-linear relationship, suggesting that the proportion of individuals showing meaningful improvement can be directly estimated from ES, a distribution-based method.48 Another concern in evaluating clinical meaningfulness of individual change is the reliability of values obtained for the instrument. Reliability levels of 0.90 have been recommended as minimum standards for interpretation of individual-level results.49 However, this level of reliability is rarely achieved even for medical vital signs such as blood pressure.7 Recent work has suggested that this reliability level may be too stringent and that although caution should be exercised when interpreting results of assessments with less than optimal reliability, use of individual assessment data can still be valuable.7

Implications for Behavioral Health Many mental health assessment instruments, including BASIS-24,* were developed to monitor treatment outcomes at the aggregate level for quality improvement and accountability purposes. However, routine outcomes assessment can be both costly and of questionable value to clinical treatment providers.50,51 Timely knowledge about progress and treatment outcomes for individuals can provide opportunities for clinicians to improve care for individuals, utilize individual outcome data to guide future treatment, and engage consumers in the treatment process. Several mechanisms can be used to obtain immediate feedback of outcome data including hand-scoring templates, which plot individual progress against a benchmark, or automated scoring software/services, which provide immediate scoring and graphical printouts of mental health status and outcome scores for use by clinicians. These methods, which are widely available for psychological tests sold by commercial testing companies, are increasingly available for mental health outcome measures as well (including for BASIS-24*). However, to be used by clinicians, the reports must be perceived as valid, culturally sensitive, interpretable and user-friendly.51 Assessing clinical meaningfulness of change is one step toward providing interpretable outcome data at the individual level. This study used several methods for assessing clinically meaningful change, and results indicated that the SEM method consistently identified a somewhat higher proportion of people as clinically meaningfully improved than ES, and a substantially higher proportion than RCI. In addition, with regard to showing change, the SEM method was more concordant with three other measures of change, irrespective of the threshold established for clinically meaningful change. On the other hand, RCI estimates of no meaningful change were more concordant with other measures indicating no change; and among outpatients, more individuals showed no change than improvement on all three outcome measures. This finding was counterbalanced by the small percentage of people showing declines in mental health status, resulting in statistically significant improvement in aggregate scores, despite the fact that more

Measuring Change After Mental Health Treatment

EISEN et al.

287

outpatients showed no change than showed improvement. Use of the RCI may be more appropriate for populations on whom the measure was first developed and reported; that is, outpatient psychotherapy clients, who are likely to be less impaired than individuals receiving more intensive and long-term services such as those treated in public mental health systems. In contrast, the SEM criterion is widely used in assessing clinically meaningful improvement among individuals with chronic medical conditions,7,30 populations which may be more similar in some ways to those with long-standing behavioral health conditions.

Acknowledgment The authors thank Colleen McHorney, Ph.D., for suggesting use of the SEM to assess clinically meaningful change, and Joel Reisman for his helpful comments on an earlier version of this manuscript. This research was supported by grant R01 MH58240 from the National Institute of Mental Health and by the Veterans Affairs Health Services Research & Development program.

References 1. Newman FL, Ciarlo JA, Carpenter D. Guidelines for selecting psychological instruments for treatment planning and outcome assessment. In: Maruish M, ed. The Use of Psychological Testing for Treatment Planning and Outcomes Assessment 2nd Edition. Mahwah, NJ: Lawrence Erlbaum; 1999:153–170. 2. Guyatt G, Walter S, Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. Journal of Chronic Disease. 1987;40:171–178. 3. Liang MH. Longitudinal construct validity: Establishment of clinical meaning in patient evaluation instruments. Medical Care. 2000;38(Suppl II):S84–S90. 4. Hays RD, Anderson R, Revicki D. Psychometric considerations in evaluating health-related quality of life measures. Quality of Life Research. 1993;2:441–449. 5. Wyrwich KW, Wolinsky FD. Identifying meaningful intra-individual change standards for health-related quality of life measures. Journal of Evaluation in Clinical Practice. 2000;6:39–49. 6. McHorney CA, Tarlow AR. Individual patient monitoring in clinical practice: are available health status surveys adequate? Quality of Life Research. 1995;4(4):293–307. 7. Hays RD, Brodsky M, Johnston MF, et al. Evaluating the statistical significance of health-related quality-of-life change in individual patients. Evaluation & the Health Professions. 2005;28:160–171. 8. Wyrwich KW, Bullinger M, Aaronson N, et al. Estimating clinically significant differences in quality of life outcomes. Quality of Life Research. 2005;14:285–295. 9. Glossary. Health outcomes methodology. Medical Care. 2000;38(9 Suppl. II,II-7-II-13):2–10). 10. Atkins DC, Bedics JD, McGlinchey JB, et al. Assessing clinical significance: Does it matter which method we use? Journal of Consulting & Clinical Psychology. 2005;73:982–989. 11. Eisen SV, Normand SLT, Belanger AJ, et al. The revised Behavior and Symptom Identification Scale (BASIS-24*): Reliability and validity. Medical Care. 2004;42:1230–1241. 12. Guyatt GH, Norman GR, Juniper EF, et al. A critical look at transition ratings. Journal of Clinical Epidemiology. 2002;55:900–908. 13. Spitzer RL, Gibbon M, Endicott J. Global Assessment Scale (GAS), Global Assessment of Functioning (GAF) Scale, Social and Occupational Functioning Assessment Scale (SOFAS). In: Handbook of Psychiatric Measures. Washington DC: American Psychiatric Association; 2000:96–100. 14. Kazdin AE. Assessing the clinical or applied importance of behavior change through social validation. Behavior Modification. 1977;1:427–452. 15. Streiner DL, Norman GR.Health Measurement Scales. A Practical Guide to Their Development and Use (3rd ed.). New York: Oxford University Press 2003 16. Jaeschke R, Singer J, Guyatt GH. Measurement of health status. Ascertaining the minimally important difference. Controlled Clinical Trials. 1991;10:407–415. 17. Hurst H, Bolton J. Assessing the clinical significance of change scores recorded on subjective outcome measures. Journal of Manipulative Physiological Therapeutics. 2004;27:26–35. 18. Ferguson RJ, Robinson AB, Splaine M. Use of the reliable change index to evaluate clinical significance in SF-36 outcomes. Quality of Life Research. 2002;11:509–516. 19. Clancy C, Eisenberg J. Outcomes research care: measuring the end results of health care. Science. 1998;282:245–246. 20. Cohen J.Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates; 1988. 21. Testa M. Interpreting quality of life clinical trial data for use in the clinical practice of antihypertensive therapy. Journal of Hypertension Supplement. 1987;5:S9–S13. 22. Kazis Le, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Medical Care. 1989;27(Suppl 3):S178–S189.

288

The Journal of Behavioral Health Services & Research

34:3

July 2007

23. Norman GR, Sloan JA, Wyrwich KW. Interpretation of changes in health-related quality of life. The remarkable universality of half a standard deviation. Medical Care. 2003;5:582–592. 24. Jacobson NS, Truax P. Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology. 1991;59:12–19. 25. Jacobson NS, Follette WC, Revenstorf. Psychotherapy outcome research. Methods for reporting variability and evaluating clinical significance. Behavior Therapy. 1984;15:336–352. 26. Jacobson NS, Roberts LJ, Berns SB, McGlinchey JB. Methods for defining and determining the clinical significance of treatment effects: description, application and alternatives. Journal of Consulting and Clinical Psychology. 1999;67:300–307. 27. Speer D. Methodological developments. Clinically significant change: Jacobson and Truax (1991) revisited. Journal of Consulting and Clinical Psychology. 1992;69(3):402–408. 28. Hageman WJM, Arrindell WA. A further refinement of the reliable change (RC) index by improving the pre–post difference score: introducing RCID. Behaviour Research and Therapy. 1993;31:693–700. 29. Anastasi A, Urbina S.Psychological Testing (7th ed.). Upper Saddle River, NJ: Prentice-Hall; 1997. 30. Wyrwich KW, Tierney WM, Wolinsky FD. Further evidence supporting an SEM-based criterion for identifying meaningful intraindividual changes in health-related quality of life. Journal of Clinical Epidemiology. 1999;52:861–873. 31. Wyrwich KW, Nienaber NA, Tierney WM, Wolinsky FD. Linking clinical relevance and statistical significance in evaluating intraindividual changes in health-related quality of life. Medical Care. 1999;37:469–478. 32. Eisen SV, Dill DL, Grob MC. Reliability and validity of a brief patient-report instrument for psychiatric outcome evaluation. Hospital & Community Psychiatry. 1994;45:242–247. 33. American Psychiatric Association.Diagnostic and Statistical Manual of Mental Disorders. 4th ed. (DSM-IV). Washington D.C.: American Psychiatric Association; 1994. 34. Burlingame GM, Dunn TW, Chen S, et al. Selection of outcome assessment instruments for inpatients with severe and persistent mental illness. Psychiatric Services. 2005;56:444–451. 35. Eisen SV, Gerena M, Rangathan G, et al. Reliability and validity of the BASIS-24* Mental Health Survey for Whites, AfricanAmericans and Latinos. Journal of Behavioral Health Services & Research. 2006;33:304–323. 36. Jerrell JM. Behavior and Symptom Identification Scale 32. Sensitivity to change over time. Journal of Behavioral Health Services & Research. 2005;32:341–346. 37. Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960;20:37–46. 38. Seal P, Glickman M, Eisen SV. The relationship between SEM, effect size and RCI when classifying clinically meaningful change. Manuscript in preparation; 2006. 39. Lambert MJ, Lambert JM. Use of psychological tests for assessing treatment outcome. In: Maruish ME, ed. The Use of Psychological Testing for Treatment Planning and Outcomes Assessment (2nd ed.). Mahwah, NJ: Lawrence Erlbaum; 1999:115–152. 40. Jacobson NS, Roberts LJ, Berns SB, et al. Methods for defining and determining the clinical significance of treatment effects. Journal of Consulting and Clinical Psychology. 1999;67:300–307. 41. Ogles BM, Lambert MJ, Sawyer JD. Clinical significance of the National Institute of Mental Health treatment of depression collaborative research program data. Journal of Consulting and Clinical Psychology. 1995;63:321–326. 42. Guyatt H, Osoba D, Wu A, et al. Methods to explain the clinical significance of health status measures. Mayo Clinic Proceedings. 2002;77(4):373–383. 43. Jones SH, Thornicroft G, Coffey M, et al. A brief mental health outcome scale-reliability and validity of the Global Assessment of Functioning (GAF). British Journal of Psychiatry. 1995;166:654–650. 44. Startup M, Jackson MC, Bendix S. The concurrent validity of the Global Assessment of Functioning (GAF). British Journal of Clinical Psychology. 2002;41:417–422. 45. Greenberg GA, Rosenheck RA. Using the GAS as a national mental health outcome measure in the Department of Veterans Affairs. Psychiatric Services. 2005;56:420–426. 46. Wilkinson G, Hesdon B, Wild D, et al. Self-report quality of life measure for people with schizophrenia: the SQLS. British Journal of Psychiatry. 2000;177:42–46. 47. Srebnick D, Hendryx M, Stevenson J, et al. Development of outcome indicators for monitoring the quality of public mental health care. Psychiatric Services. 1997;48:903–909. 48. Norman GR, Sridhar FG, Guyatt GH. Relation of distribution- and anchor-based approaches in interpretation of changes in healthrelated quality of life. Medical Care. 2001;39:1039–1047. 49. Nunnally JC.Psychometric Theory. New York: McGraw-Hill; 1994. 50. Gilbody SM, House AO, Sheldon T. Routine administration of health related quality of life (HRQoL) and needs assessment instruments to improve psychological outcome?a systematic review. Psychological Medicine. 2002;32:1345–1356. 51. Garland AF, Kruse M, Aarons GA. Clinicians and outcome measurement: What_s the use? Journal of Behavioral Health Services & Research. 2003;30:393–405.

Measuring Change After Mental Health Treatment

EISEN et al.

289