COMPARING UTILITY SCORES DERIVED FROM THE ... - CiteSeerX

4 downloads 0 Views 70KB Size Report
It should be noted that David Feeny has a proprietary interest in Health Utilities ...... Mahon, Jeffrey L., Robert Bourne, Cecil Rorabeck, David Feeny, Larry Stitt, ...
COMPARING UTILITY SCORES DERIVED FROM THE SHORT-FORM 36, STANDARD GAMBLE, AND HEALTH UTILITIES INDEX Working Paper 04-05

David Feeny 1,2, 3 Ken Eng 1

1.

Institute of Health Economics, Edmonton, Alberta, Canada

2.

Departments Economics and Public Health Sciences, Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton, Alberta, Canada

3.

Health Utilities Incorporated, Dundas, Ontario, Canada

Legal Deposit 2000 National Library Canada ISSN 1481-3823

August 2004

TABLE OF CONTENTS ABSTRACT...............................................................................................................................1 Background ..........................................................................................................................1 Objective ..............................................................................................................................1 Methods ...............................................................................................................................1 Results..................................................................................................................................1 Conclusions..........................................................................................................................2 INTRODUCTION .....................................................................................................................3 METHODS ................................................................................................................................4 Patients and Procedures .......................................................................................................4 Measures. SF-36 .................................................................................................................5 SF-6D.............................................................................................................................5 HUI2 and HUI3..............................................................................................................5 VAS and SG...................................................................................................................6 Quality of Well-Being (QWB).......................................................................................6 QWB SF36.....................................................................................................................6 HUI2 SF36 .....................................................................................................................7 HUI3 SF12 .....................................................................................................................7 Statistical Analysis, Agreement...........................................................................................7 Statistical Analysis, Area under the Curve (AUC) ..............................................................7 RESULTS ..................................................................................................................................8 Patient Entry and Baseline Characteristics ..........................................................................8 Baseline................................................................................................................................9 Pre-surgery...........................................................................................................................9 Post-surgery .........................................................................................................................9 ICC.....................................................................................................................................10 Area Under the Curve ........................................................................................................10 DISCUSSION ..........................................................................................................................10 CONCLUSIONS......................................................................................................................12 REFERENCES ........................................................................................................................14

Institute of Health Economics Working Paper 04-05

i

LIST OF TABLES TABLE 1................................................................................................................................... 18 Demographic and Clinical Characteristics of Patients at Baseline, n=86 TABLE 2................................................................................................................................... 19 Utility Scores TABLE 3................................................................................................................................... 21 Agreement: Intra-Class Correlation Coefficients (ICC) and 95% Confidence Intervals Between Utility Scores TABLE 4................................................................................................................................... 26 Difference (Post-Surgery minus Pre-Surgery) in Utility Scores TABLE 5................................................................................................................................... 27 Area Under the Curve Estimates

Institute of Health Economics Working Paper 04-05

ii

ACKNOWLEDGMENTS Financial Support: The study, “The Effect of Waiting for Elective Hip Arthroplasty on Health-Related Quality of Life,” was supported by a grant from Physician Services Incorporated (PSI) of Ontario to Dr Jeffrey Mahon (Grant #94-30). The analyses reported in this paper were supported by grants from the Alberta Heritage Foundation for Medical Research (AHFMR) (#199909) and the Institute of Health Economics (IHE) to David Feeny. PSI, AHFMR, and IHE played no role in the design, interpretation, or analysis of the project and have not reviewed or approved of this manuscript. The authors gratefully acknowledge the contributions of Dr Jeffrey L. Mahon, Dr Robert Bourne, Dr Cecil Rorabeck, Larry Stitt, and Susan Webster-Bogaert to the hip arthroplasty study that generate data upon which this paper is based. The authors also gratefully acknowledge the help of the patients and surgeons who participated in the study: London, Ontario orthopedic surgeons, Drs Harvey Bailey, David Chess, Wayne Grainger, Paul Kim, and Richard McCalden. The authors acknowledge the contributions of Dr Chris Blanchard and Lieling Wu to the project upon which the paper is based. Finally, the authors acknowledge the constructive comments of a reviewer. Conflict of Interest It should be noted that David Feeny has a proprietary interest in Health Utilities Incorporated, Dundas, Ontario, Canada. HUInc. distributes copyrighted Health Utilities Index (HUI) materials and provides methodological advice on the use of HUI. * The analyses of agreement among utility scores reported in this paper extends results for comparisons among standard gamble, Health Utilities Index Mark 2, Health Utilities Index Mark 3, and Short-Form 6D scores previously reported in Feeny, Wu, and Eng (2004).

Institute of Health Economics Working Paper 04-05

iii

ABSTRACT Background Single-summary scores representing the health-related quality of life (HRQL) associated with health states are important patient-reported outcomes used to assess the effectiveness of healthcare interventions. Single-summary scores derived from preference-based measures are key inputs in the estimation of quality-adjusted life years (QALYs), cost-effectiveness analyses, and cost-utility analyses. Because many studies did not include preference-based measures in their original data collection, a variety of mapping and regression-based algorithms have been developed to “predict” preference scores on the basis of scores from measures of health status such as the Short-Form 36 (SF-36). Objective The objective of the study is to compare a variety of directly assessed and algorithm-derived preference-based scores based on the standard gamble (SG), SF36, Health Utilities Index Mark 2 (HUI2), Health Utilities Index Mark 3 (HUI3), and Quality of Well Being (QWB) scale in the context of elective total hip arthroplasty (THA) for osteoarthritis. Methods SF-36, HUI2, and HUI3 were serially administered to a cohort of patients waiting for and undergoing elective THA. The patients were also asked to evaluate the HRQL of their current health with the SG. Agreement among scores was assessed using the intra-class correlation coefficient (ICC). Agreement was assessed at the baseline, pre-surgery, and post-surgery assessment points. Estimates of change associated with THA were compared among measures. The time path of HRQL experienced by the cohort of patients was compared among measures by computing the area under the curve (AUC). Results Mean SG, HUI2, and SF-6D scores are similar. Mean HUI3, HUI3SF12 (based on regression model predicting HUI3 scores from SF-12 scores), and QWBSF36 (based on regression model

Institute of Health Economics Working Paper 04-05

1

predicting QWB scores from SF-36 scores) are lower; mean HUI2SF36 (based on regression model predicting HUI2 score from SF-36 scores) scores are the highest. At baseline agreement among pairs of scores is poor for 11 of 21 (52%) pairs of scores and moderate to good for 10 pairs (48%). At the pre-surgery assessment agreement is poor for 13 pairs (62%) and moderate to good for 8 (38%). At the post-surgery assessment agreement is poor for 10 pairs (48%), moderate to good for 10 (48%), and excellent for one pair (5%). In general, the range and variability of scores was substantially lower for utility scores derived from SF-36 than for scores derived from HUI and or SG scores. Estimates of the change associated with THA vary substantially among measures. Change is highest for HUI3 and HUI2, intermediate for the SG and HUI3SF12, and lowest for HUI2SF36, SF-6D, and QWBSF36. AUC estimates also vary considerably among measures. Conclusions Scores derived from direct (SG) and multi-attribute measures (HUI) and various scores derived from SF-36 are not interchangeable. Scores based on SF-36 seem to be subject to floor effects. The choice of measure may importantly affect the interpretation of the results. Further investigations comparing utility scores derived by various approaches is warranted.

2

Institute of Health Economics Working Paper 04-05

INTRODUCTION The Short-Form 36 (SF-36) (Ware and Sherbourne 1992; Ware et al. 1993; Ware 1996) is one of the most commonly used measures of health status. However, the SF-36 does not generate utility scores on the conventional dead = 0.00 to perfect health = 1.00 scale. A number of investigators have estimated the relationship between scores derived from various multiattribute systems and scores on the SF-36. In addition, recently Brazier et al. 2002 have provided an algorithm for computing utility scores based on descriptive health status information collected using SF-36. The validity and usefulness of utility scores derived from SF-36 may be affected by a number of factors. Potentially important factors include the nature of the SF-36 system itself. Is it subject to floor effects and/or ceiling effects? For the regression-model based algorithms used to predict Quality of Well Being (QWB) or Health Utilities Index (HUI) scores on the basis of information on health status reflected in SF-36, the range and variability in the data set used for the regression analysis may affect the generalizability and validity of the predicted scores. If there were few, if any, observations in severe health states, the regression model may not capture that part of the space adequately. For the Short-Form 6D system (Brazier et al. 2002) based directly on SF-36 and estimated from standard gamble (SG) scores, performance may be influenced by the choice of functional form and multi-colinearity in preferences among attributes or dimensions of health status. The objective of this paper is to compare utility scores derived from the SF-36, scores derived from the Health Utilities Index Mark 2 (HUI2) and Mark 3 (HUI3), and directly measured SG scores obtained in a cohort study of patients undergoing elective total hip arthroplasty (THA) for osteoarthritis (OA). Results reported here extend the results reported in Feeny, Wu, and Eng (2004) that included only comparisons among the SG, HUI2, HUI3, and SF-6D. Several types of comparisons are made. First, scores are compared descriptively. Second, the degree of agreement among scores is assessed using the intra-class correlation coefficient (ICC) (Shrout and Fleiss 1979; Deyo et al. 1991). Third, changes in scores between the pre-surgery and post-surgery assessments are compared descriptively across the measures. Fourth, estimates of the area under the curve (AUC) associated with THA generated by each Institute of Health Economics Working Paper 04-05

3

utility measure are compared. Each estimate of the AUC is based on a series of utility scores tracking the health-related quality of life (HRQL) of each patient from the baseline through the post-surgery assessments.

METHODS Patients and Procedures The data used in the paper were obtained from a study documenting the HRQL while waiting for, undergoing, and recovering from elective THA for OA. The study has been described previously (Mahon et al. 2002; Blanchard et al. QLR 2004; Blanchard et al. JCE 2003; Feeny et al. 2003; Feeny, Blanchard, Mahon et al. QLR 2004) and will be described here briefly. Prior to commencing the study, approval was obtained from the local Human Ethics Committee. All patients who were referred for “hip disease” between November 1993 and 1996 to any of seven surgeons performing THA in London, Ontario, were potentially eligible to participate in the study. Eligible patients who provided consent were invited to attend an outpatient department for a baseline assessment. Upon arrival to the clinic, the following data was collected from the patients: (a) age, gender, home address, employment status, duration of hip disease symptoms, and presence of comorbid conditions, (b) SF-36 (c) HUI2 and HUI3 (d) a number of diseasespecific measures of health status and health-related quality of life, (e) visual analogue scale (VAS) and SG scores for three hypothetical marker states (mild, moderate, and severe OA) and for the patient’s subjectively-defined current health state (SDCS), and (f) the six-minute walk test. Patients who were put onto a waiting list for THA continued to participate in a longitudinal study examining HRQL after THA (Mahon et al. 2002). For the purposes of the present study, which focuses on comparing various utility scores, we analyze data from the baseline assessment, pre-surgery assessment, and post-surgery assessment of patients eventually put on the waiting list for THA in order to focus on patients with documented hip disease severe enough to warrant THA. In general, the health status of patients declined while they were waiting for surgery and then improved dramatically following surgery. Assessing agreement at multiple points in time during which health status changed provides evidence on the extent to which agreement is stable over time. These analyses were also conducted because THA patients experience substantial changes in health status over time, 4

Institute of Health Economics Working Paper 04-05

providing an opportunity to examine if the extent of agreement depends on the mix of health states and degree of severity. Measures SF-36 includes 36 items and generates a score for each of eight domains of health status: physical functioning, role-physical, bodily pain, general health, vitality, social functioning, roleemotional, and mental health (Ware and Sherbourne 1992; Ware et al. 1993; Ware 1996). SF-36 also provides two summary scores: the physical component summary score (PCS) and mental component summary score (MCS). Domains that contribute positively to the PCS include physical functioning, role-physical, bodily pain, and general health. Domains that contribute positively to the MCS are vitality, social functioning, role-emotional, and mental health domains. The range of scores for each domain is 100 for a perfect score and 0 for the lowest score. MCS and PCS scores are often reported on a scale standardized so that the score for the general population is 50. SF-6D. The Short-Form 6D (SF-6D) system includes seven of the eight domains from the SF36: physical functioning, role limitations (a combination of role-physical and role-emotional), social functioning, pain, mental health, and vitality (Brazier et al. 2002; Brazier et al. 2004). Overall utility scores for SF-6D are based on a modified ad hoc linear additive functional form estimated on the basis of SG scores obtained from a random sample of the adult population in the United Kingdom. The lottery used to generate the utility scores used to estimate the scoring function consisted of the all-best SF-6D health state (perfect health) and the all-worst SF-6D health state. Scores range from 0.00 (dead) to 0.29 (the SF-6D state with the lowest score; the SF-6D health state with every attribute at its lowest level has a score of 0.30) to 1.00 (perfect health in SF-6D system). HUI2 and HUI3. HUI is a family of preference-based multi-attribute measures of HRQL. HUI2 includes seven health attributes (sensation [vision, hearing, speech], mobility, emotion, cognition, self-care, pain, and fertility) with three to five levels within each attribute for a total of 24,000 unique health states (Feeny et al. 1992, 1996; Furlong et al. 2001; Torrance et al. 2002; Horsman et al. 2003). (Fertility is not addressed in this study and is assumed to be normal.)

Institute of Health Economics Working Paper 04-05

5

HUI3 has eight attributes: vision, hearing, speech, ambulation, dexterity, cognition, emotion, and pain with five or six levels for each attribute for a total of 972,000 unique health states (Feeny et al. 1996, 2002). HUI2 overall utility scores are on a scale in which the all-worst HUI2 state has a score of 0.03, dead has a score of 0.00, and perfect health is 1.00 (Torrance et al. 1996). The HUI2 scoring system is based on visual analogue scale (VAS) and SG scores obtained from a random sample of parents in Hamilton, Ontario. The lottery for the SG consisted of dead and perfect health. The HUI2 and HUI3 scoring functions use a multiplicative functional form. Overall scores for the HUI3 are on a scale in which the all-worst HUI3 state has a score of 0.36, dead is 0.00, and perfect health is 1.00 (Feeny et al. 2002). The HUI3 scoring system is based on VAS and SG scores obtained from a random sample of the general population in Hamilton, Ontario. The lottery for the SG consisted of the all-worst HUI3 health state and allbest HUI3 health state (perfect health). VAS and SG. Interviews to obtain VAS and SG scores were conducted in person by a trained research assistant who was not involved in providing clinical care. Respondents were first asked to rank the three marker states, perfect health, and dead on the VAS. They were then asked to place an arrow on the VAS to correspond to their valuation of their SDCS. After completing the VAS exercise, the three marker states and the SDCS were evaluated using the SG. The SG was implemented using a Chance Board (Furlong et al. 1990). The lottery for the SG consisted of perfect health and dead. Quality of Well-Being (QWB). The QWB is a preference-based health related quality of life measure (Kaplan and Anderson 1996). The QWB has four attributes: mobility, physical activity, social activity. and a problem-symptom complex. The QWB scoring system is based on VAS scores obtained from a random sample of the general population in San Diego, California. The range of scores is between 1.0 (optimum function) to 0.0 (dead). The QWB uses a linear additive functional form. QWBSF36. Fryback et al, 1997, estimated an equation to predict scores of QWB from SF-36. Fryback et al used six variables, domains scores for physical functioning, mental health, and 6

Institute of Health Economics Working Paper 04-05

bodily pain and interactions of three pairs of domain scores, general health-role physical, physical functioning-bodily pain, and mental health-bodily pain as the independent variables in the regression model. The adjusted coefficient of determination was 0.57. HUI2SF36. Nichol et al, 2001 estimate a regression model to predict HUI2 scores from t-scores for the eight domain scores of SF-36. The adjusted coefficient of determination was 0.50. HUI3SF12. Franks et al. 2003 estimate a regression model to predict HUI3 scores from t-scores for the physical and mental component summary scores of the SF-12. The adjusted coefficient of determination was 0.58. Recently Franks et al. 2004 have published a regression model to predict EQ-5D scores on the basis of SF-12 scores. Analysis of the performance of this new regression model is beyond the scope of this paper. Statistical Analyses, Agreement Four utility scores were derived from health status data obtained using the SF-36: SF-6D, QWBSF36, HUI2SF36, and HUI3SF12. HUI2 and HUI3 scores were derived from health status assessments obtained on the same individuals on the same occasions using standard HUI questionnaires and algorithms. SG scores for each respondent’s SDCS were obtained directly from the patients. Agreement among the various utility scores was assessed for each pair-wise comparison of scores. ICCs were estimated to determine agreement at three assessment points: pre-surgery, baseline, and post-surgery assessments (Shrout and Fleiss 1979; Deyo et al. 1991). Agreement statistics were interpreted according to guidelines provided by Nunnally (1978). ICC < 0.40 implies poor agreement; 0.40 through 0.75 implies moderate to good agreement; and > 0.75 implies excellent agreement. Statistical Analyses, Area under the Curve (AUC) Estimates for AUC for each patient for each utility score were derived covering the period from baseline through the post-surgery assessment. Computation of the AUC requires complete data for the utility score of interest at each assessment. Furthermore, comparing estimates of the AUC across measures requires complete data for each of the underlying sources of utility scores: Institute of Health Economics Working Paper 04-05

7

HUI2, HUI3, the SG, and SF-36. Given that our objective was to compare estimates of AUC across measures, we decided not to impute any missing data. If we had imputed missing data, the resulting AUC estimates would reflect on both the measure itself and the imputation algorithm. As a result, the estimates of AUC without any imputation of missing data facilitate the comparisons among measures but the estimates are not accurate descriptions of the experience of the patients in that data was missing more frequently among patients with lower health status. AUC estimates are derived using guidelines developed by Drummond et al (1997). For instance, the average of the scores observed at assessment 1 and assessment 2 is computed for each patient and multiplied by the time elapsed between the two assessments for that patient. The AUC for that patient is computed by adding up all of the segments from baseline through the post-surgery assessment. Estimates for each measure are presented in two ways: the simple summation and the summation normalized for a 12-month time period.

RESULTS Patient Entry and Baseline Characteristics Of 553 patients referred for hip disease, 338 did not complete a baseline assessment because (a) they did not fulfill entry criteria at the time of initial contact (N = 224 including 34 patients who could not be contacted), or (b) they refused to participate despite meeting initial entry criteria (N = 48, unwilling to travel to London; N = 9 “too busy”; N = 57, reason not specified). Of the remaining 215 patients completing baseline assessments, 92 patients were not put on the waiting list (N = 62, THA judged not indicated by surgeon; N = 20, diagnosis other than OA; N = 6, failure to attend appointment with surgeon; N = 4, surgeon recommended THA but patient declined) leaving 123 patients eligible for surgery who entered the waiting list. Of the 123 patients eligible for surgery, 7 had their THA elsewhere, 2 did not have surgery, 15 did not return for a post-surgery assessment, and 9 had large amounts of missing data (< 30% of questions completed) across measures leaving 90 patients. Complete data for HUI2, HUI3, SF6D, and the SG was available for 86 patients at baseline. Demographic characteristics for the patients for whom we have complete data at baseline

8

Institute of Health Economics Working Paper 04-05

are displayed in Table 1. Patterns were similar for males and females. Characteristics for the 86 patients are similar to those for the full cohort of patients put on the waiting list. As one would expect, most patients were retired and had been experiencing symptoms for a number of years. Baseline Baseline scores for all measures are reported in Panel A Table 2. At the mean level scores for the SF-6D, HUI2, and SG are the same or very similar. Scores for the HUI3, HUI3SF12, and QWBSF36 are lower whereas the HUI2SF36 had the highest mean score of all measures. The HUI3 and SG had a larger range of scores with relatively high standard deviations. The range of scores for SF-36 based scores for QWBSF36, SF-6D, and HUI2SF36, are systematically lower with relatively small standard deviations. The range and variability are intermediate for the HUI3SF12. Pre-Surgery The results at pre-surgery, for whom 63 patients have complete data, are comparable to the results at baseline. For 28 of these patients, the baseline assessment and the pre-surgery assessment are the same. For patients who waited longer for their surgery, the pre-surgery assessment was the second assessment done for 19 of the patients, and the third or later assessment for 16 patients. HUI2SF36 has the highest mean score; HUI3SF12 has the lowest. HUI2, HUI3, and QWBSF36 scores are relatively low. Post-Surgery The 63 patients at post-surgery are the same patients as at pre-surgery assessment. All measures have higher mean scores at post-surgery than are pre-surgery. The HUI2SF36 has the highest mean score; HUI3SF12 has the lowest. Higher minimum scores are observed at the postsurgery assessment. The range of scores is reduced relative to the baseline and pre-surgery assessments. The amount of time elapsed between the pre- and post-surgery assessments was variable. The mean time in days between the pre- and post-surgery assessments was 166 (SD = 66; median = 158; mode = 189) with a minimum of 86 and a maximum of 535.

Institute of Health Economics Working Paper 04-05

9

ICC Agreement was assessed for each pair-wise combinations at each assessment (Table 3). At baseline there is poor agreement for the following 11 pairs (52%): HUI2/QWBSF36, HUI3/HUI2SF36, HUI3/QWBSF36, QWBSF36/HUI2SF36, SF-6D/HUI3, SG-SF-6D, SG/HUI2, SG/HUI3, SG/HUI2SF36, SG/QWBSF36, and HUI3SF-12/SG. Agreement is moderate to good for 10 pairs (48%): SF-6D/HUI2SF36, SF-6D/QWBSF36, HUI2/HUI2SF36, SF-6D/HUI2, HUI2/HUI3, HUI3SF-12/SF-6D, HUI3SF-12/HUI2, HUI3SF-12/HUI3, HUI3SF-12-/HUI2SF36, and HUI3SF12/QWBSF36.

SF-6D/HUI2SF36 had the highest level of agreement (ICC = 0.66).

Results at the pre-surgery assessment are similar. Again SF-6D/HUI2SF36 has the highest level of agreement, almost excellent (ICC = 0.74). Agreement is poor for 13 pairs (62%); agreement is moderate to good for 8 pairs (38%). At post-surgery there are 10 pairs (48%) with poor agreement and 10 pairs (48%) with moderate to good agreement. Agreement for HUI2/HUI3 was excellent (ICC = 0.82) although agreement was only good to moderate at baseline and pre-surgery. Differences between the post- and pre-surgery scores for each measure are reported in Table 4. The largest differences are observed for HUI3 and HUI2; intermediate differences are observed for the SG, HUI3SF12, and HUI2SF36; the smallest differences are observed for SF-6D and QWBSF36. Area under the Curve There were 51 patients for whom complete longitudinal data was available. Of those patients there were 20 with 2 assessments and 31 patients with 3 or more assessments (the range of assessments was from 3-8). In this latter group of 31 patients, there were 16, 5, 6, 2, and 2 patients with 3, 4, 5, 6, and 8 assessments respectively. Estimates for the AUC with and without normalization to 12 months are provided in Table 5. HUI3, HUI3SF12, and QWBSF36 had the lowest mean AUC; estimates of the AUC are

10

Institute of Health Economics Working Paper 04-05

higher for HUI2, HUI2SF36, SF-6D, and SG.

DISCUSSION Mean utility scores at the baseline and pre-surgery assessments are similar across most of the measures. However, mean HUI3, HUI3SF12, and QWBSF36 scores are systematically lower. The tendency for mean HUI3 scores to be lower than HUI2 and/or SG scores has been noted in previous studies (Blanchard et al. QLR 2004; Feeny et al. 2003; Feeny, Furlong, Saigal et al. 2004; Neumann et al. 2000). Although the mean scores are fairly similar, the distribution of scores differs substantially among the measures. The range of scores is much higher for HUI2, HUI3, HUI3SF12, and the SG. Utility scores derived from the SF-36, the SF-6D, QWBSF36, and HUI2SF36 have minimums which are much higher than the minimums for the SG and HUI. Furthermore SF-6D, QWBSF36, and HUI2SF36 scores have much more limited ranges. At baseline HUI2, the SG, and HUI3SF12 generate minimum scores that are in the “vicinity” of dead (0.00); the minimum HUI3 is in fact negative, a state worse than dead. In contrast the minimums for SF-6D, QWBSF36, and HUI2SF36 are in the vicinity of 0.4 to 0.5. The patterns at the pre-surgery assessment are similar. The standard deviations are also much lower. These results may, in part, be due to well-known floor effects associated with the SF-36 (Hollingworth et al. 2002; Longworth and Bryan 2003; Szende et al. 2004; see also Lee et al. 2003). Agreement between SG scores and HUI scores, scores based directly on a multi-attribute system is low. That mean scores approximately agree but poor agreement is observed at the individual level has been observed in other studies (Feeny, Furlong, Saigal et al. 2004). SG scores embody both heterogeneity in health status and in the valuation of that health status. In contrast, scores from multi-attribute systems reflect the heterogeneity in health status but by using community preferences, individual-level heterogeneity in preferences was suppressed. Agreement between SG and SF-6D is also low. Agreement is higher at the post-surgery assessment than at baseline and pre-surgery. This may be in part a result of floor effects with SF-36 that are more evident at baseline and presurgery but less empirically relevant after surgery. Institute of Health Economics Working Paper 04-05

11

The difference between post- and pre-surgery scores varied substantially by measure. The estimate of change is highest for HUI3 and HUI2, intermediate for the SG and HUI3SF12, and lowest for HUI2SF36, SF-6D, and QWBSF36. In comparing HUI2SF36, SF-6D, and QWBSF-36 scores pre- and post-lung transplant, Lobo et al. 2004 also found considerable variability in the estimates among measures. HUI2, SF-6D, SG, and HUI2SF36 provide similar estimates of the AUC. The estimates for HUI3, HUI3SF12, and QWBSF36 are substantially lower. In the case of HUI3, the lower AUC estimate reflects the observation that HUI3 scores are systematically lower than HUI2 and most other scores. With respect to the QWBSF36 the lower scores may be due to the use of VAS scores as the basis for the scoring system for the QWB. Typically VAS scores are lower than TTO or SG scores. The lower AUC using QWBSF36 may also reflect floor effects with SF-36.

CONCLUSIONS The choice of approach for obtaining utility scores matters a great deal with respect to the estimate of the gain in HRQL in between the pre-surgery and post-surgery assessments. The estimates of gain based on most of the SF-36-based measures (SF-6D, QWBSF36, and HUI2SF36) were much lower than the gain estimated by the SG. Estimates of the gain according to directly obtained HUI scores were yet higher. Clearly cost-utility estimates of quality-adjusted life years gained would be affected by the choice of utility score. The comparisons of the SF-36 derived scores, HUI scores, and SG are confounded by a number of factors including the nature of the underlying scoring system used. SF-6D and HUI2SF36 scores are based on the SG; QWBSF36 scores are based on the VAS. These differences are a potential source of the differences in the estimate gain in HRQL associated with surgery. However, a probably more important source of the difference is floor effects associated with SF36. As several other investigators have noted (Hollingworth et al. 2002; Longworth and Bryan 2003; Lee et al. 2003; Szende et al. 2004; Brazier et al. 2004), SF-36 based utility scores may not perform well in contexts in which patients have moderate and/or severe problems. Evidence presented here corroborates that observation.

12

Institute of Health Economics Working Paper 04-05

As has been previously noted (Feeny et al. 2003 IJTAHC; Feeny, Furlong, Saigal et al. 2004; Feeny, Wu, and Eng 2004), mean utility scores derived from various multi-attribute systems and mean SG scores are similar. Yet agreement at the individual level is poor. This result is not surprising. The scoring systems for multi-attribute measures are based on the preferences of “person-mean” - - an estimate of central tendency of preference scores. Of course there is considerable heterogeneity in preference scores. In a sense, on average the multiattribute scoring systems come close to “getting it right” and we observe the similarities in mean scores across measures but at the individual level, the suppression of the heterogeneity in preferences embodied in the multi-attribute scoring functions leads to poor agreement at the individual level. Scores from multi-attribute systems are useful for comparing groups but are not suitable for individual-level analyses. Direct utility scores such as the SG are useful both for individual-level analyses and decisions and in making comparisons among groups. Additional studies comparing SF-36 based, HUI, and direct utility scores in other clinical contexts appear to be warranted.

Institute of Health Economics Working Paper 04-05

13

REFERENCES 1. Blanchard, Chris, David Feeny, Jeffrey L. Mahon, Robert Bourne, Cecil Rorabeck, Larry Stitt, and Susan Webster-Bogaert, “Is the Health Utilities Index Valid in Total Hip Arthroplasty Patients?” Quality of Life Research, Vol. 13, No. 2, March, 2004, pp 339-348. 2. Blanchard, Chris, David Feeny, Jeffrey L. Mahon, Robert Bourne, Cecil Rorabeck, Larry Stitt, and Susan Webster-Bogaert, “Is the Health Utilities Index Responsive in Total Hip Arthroplasty Patients,?” Journal of Clinical Epidemiology, Vol. 56, No. 11, November, 2003, pp 1046-1054. 3. Bosch, Johanna L., Elkan F. Halpern, and G. Scott Gazelle, “Comparison of PreferenceBased Utilities of the Short-Form 36 Health Survey and Health Utilities Index Before and After Treatment of Patients with Intermittent Claudication.” Medical Decision Making, Vol. 22, No. 5, September-October, 2002, pp 403-409. 4. Brazier, John, Jennifer Roberts, and Mark Deverill, “The Estimation of a Preference-Based Measure of Health Status from the SF-36.” Journal of Health Economics, Vol. 21, No. 2, March, 2002, pp 271-292. 5. Brazier, John, Jennifer Roberts, Aki Tsuchiya, and Jan Busschbach, “A Comparison of the EQ-5D and SF-6D Across Seven Patient Groups.” Health Economics, 2004, in press. 6. Deyo, Richard A., Paul Diehr, and Donald L. Patrick, “Reproducibility and Responsiveness of Health Status Measures: Statistics and Strategies for Evaluation.” Controlled Clinical Trials, Vol. 12, No. 4, Supplement, August, 1991, pp 142S-158S. 7. Drummond, Michael F., Bernie O'Brien, Greg Stoddart, and George W. Torrance, Methods for the Economic Evaluation of Health Care Programmes. Second Edition. Oxford: Oxford University Press, 1997. 8. Feeny, David, William Furlong, Ronald D. Barr, George W. Torrance, Peter Rosenbaum, and Sheila Weitzman, "A Comprehensive Multiattribute System for Classifying the Health Status of Survivors of Childhood Cancer." Journal of Clinical Oncology, Vol. 10, No. 6, June, 1992, pp 923-928. 9. Feeny, David , George W. Torrance, and William Furlong, “Health Utilities Index,” In Bert Spilker ed., Quality of Life and Pharmacoeconomics in Clinical Trials. Philadelphia, PA: Lippincott-Raven, 1996, pp 239-251. 10. Feeny, David, William Furlong, George W. Torrance, Charles H. Goldsmith, Zenglong Zhu, Sonja DePauw, Margaret Denton, and Michael Boyle, “Multi-Attribute and Single-Attribute Utility Functions for the Health Utilities Index Mark 3 System.” Medical Care, Vol. 40, No. 2, February, 2002, pp 113-128.

14

Institute of Health Economics Working Paper 04-05

11. Feeny, David, Chris Blanchard, Jeffrey L. Mahon, Robert Bourne, Cecil Rorabeck, Larry Stitt, and Susan Webster-Bogaert, “Comparing Community-Preference Based and Direct Standard Gamble Utility Scores: Evidence from Elective Total Hip Arthroplasty.” International Journal of Technology Assessment in Health Care, Vol. 19, No. 2, Spring, 2003, pp 362-372. 12. Feeny, David, Chris M. Blanchard, Jeffrey L. Mahon, Robert Bourne, Cecil Rorabeck, Larry Stitt, and Susan Webster-Bogaert, “The Stability of Utility Scores: Test-Retest Reliability and the Interpretation of Utility Scores in Elective Total Hip Arthroplasty.” Quality of Life Research, Vol. 13, No. 1, February 2004, 15-22. 13. Feeny, David, William Furlong, Saroj Saigal, and Jian Sun,“Comparing Directly Measured Standard Gamble Scores to HUI2 and HUI3 Utility Scores: Group and Individual-Level Comparisons.” Social Science & Medicine, Vol. 58, No. 4, February, 2004, pp 799-809. 14. Feeny, David, Lieling Wu, and Ken Eng, “Comparing Short Form 6D, Standard Gamble, and Health Utilities Index Mark 2 and Mark 3 Utility Scores: Results from Total Hip Arthroplasty Patients.” Quality of Life Research, 2004, in press. 15. Franks, Peter, Erica L. Lubetkin, Marthe R. Gold, and Daniel J. Tancredi, “Mapping the SF12 to Preference-Based Instrumnts: Convergent Validity in a Low-Income Minority Population.” Medical Care, Vol. 41, No. 11, November, 2003, pp 1277-1283. 16. Franks, Peter, Erica I. Lubetkin, Marthe R. Gold, Daniel J. Tancredi, and Haomiao Jia, “Mapping the SF-12 to the EuroQol EQ-5D Index in a National U.S. Sample.” Medical Decision Making, Vol. 24, No. 3, May-June, 2004, pp 247-254. 17. Fryback, Dennis G., William F. Lawrence, Patricia A. Martin, Ronald Klein, and Barbara Klein, “Predicting Quality of Well-Being Scores from the SF-36: Results from the Beaver Dam Health Outcomes Study.” Medical Decision Making, Vol. 17, No. 1, January-March, 1997, pp 1-9. 18. Furlong, William, David Feeny, George W. Torrance, Ronald Barr, and John Horsman, "Guide to Design and Development of Health-State Utility Instrumentation." McMaster University Centre for Health Economics and Policy Analysis Working Paper No 90-9, June 1990. 19. Furlong William J., David H. Feeny, George W. Torrance, and Ronald D. Barr, “The Health Utilities Index (HUI) System for Assessing Health-Related Quality of Life in Clinical Studies.” Annals of Medicine, Vol. 33, No. 5, July, 2001, pp 375-384. 20. Hollingworth, William, Richard A. Deyo, Sean D. Sullivan, Scott S. Emerson, Darryl T. Gray, and Jeffrey G. Jarvik, “The Practicality and Validity of Directly Elicited and SF-36 Derived Health State Preferences in Patients with Low Back Pain.” Health Economics, Vol. 11, No. 1, January, 2002, pp 71-85.

Institute of Health Economics Working Paper 04-05

15

21. Horsman, John, William Furlong, David Feeny, and George Torrance, “The Health Utilities Index (HUI®): Concepts, Measurement Properties and Applications.” Health and Quality of Life Outcomes (electronic journal), Vol. 1: 54, October 16, 2003, http://www.hqlo.com/content/1/1/54 22. Kaplan, Robert M., and John P. Anderson, “The General Health Policy Model: An Integrated Approach,” Chapter 32 in Bert Spilker, ed. Quality of Life and Pharmacoeconomics in Clinical Trials. Second Edition. Philadelphia: Lippincott-Raven Press, 1996, pp 309-322. 23. Laupacis, Andreas, Robert Bourne, Cecil Rorabeck, et al., Effect of Elective Total Hip Replacement Upon Health-Related Quality of Life." Journal of Bone and Joint Surgery, Vol. 75-A, No. 11, November, 1993, pp 1619-1626. 24. Lee, Todd A., William Hollingworth, and Sean D. Sullivan, “Comparison of Directly Elicited Preferences to Preferences Derived from the SF-36 in Adults with Asthma.” Medical Decision Making, Vol. 23, No. 4, July-August, 2003, pp 323-334. 25. Lobo, Francis S., Cynthia R. Gross, and Barbara J. Matthees, “Estimation and Comparison of Derived Preferences Scores from the SF-36 in Lung Transplant Patients.” Quality of Life Research, Vol. 13, No. 2, March, 2004, pp 377-388. 26. Longworth, Louise, and Stirling Bryan, “An Empirical Comparison of EQ-5D and SF-6D in Liver Transplant Patients.” Health Economics, Vol. 12, No. 12, December, 2003, pp 10611067. 27. Mahon, Jeffrey L., Robert Bourne, Cecil Rorabeck, David Feeny, Larry Stitt, and Susan Webster-Bogaert, “The Effect of Waiting for Elective Total Hip Arthroplasty on HealthRelated Quality of Life,” Canadian Medical Association Journal, Vol. 167, No. 10, November 12, 2002, pp 1115-1121. 28. McDowell, Ian, and Claire Newell, Measuring Health: A Guide to Rating Scales and Questionnaires. Second Edition. New York: Oxford University Press, 1996. 29. Neumann, Peter J., Eileen A. Sandberg, Sally S. Araki, Karen M. Kuntz, David Feeny, and Milton C. Weinstein, “A Comparison of HUI2 and HUI3 Utility Scores in Alzheimer’s Disease.” Medical Decision Making, Vol. 20, No. 4, October-December, 2000, pp 413-422. 30. Nichol, Michael B., Nishan Sengupta, and Denise R. Globe, “Evaluating Quality-Adjusted Life Years: Estimation of the Health Utility Index (HUI) from the SF-36.” Medical Decision Making, Vol. 21, No. 2, March-April, 2001, pp 105-112. 31. Nunnally, J. C., Psychometric Theory, Second Edition. New York: McGraw-Hill, 1978. 32. O’Brien, Bernie J., Marian Spath, Gordon Blackhouse, J. L. Severens, Paul Dorian, and John Brazier, “A View from the Bridge: Agreement Between the SF-6D Utility Algorithm and the Health Utilities Index.” Health Economics, Vol. 12, No. 11, November, 2003, pp 975-981. 16

Institute of Health Economics Working Paper 04-05

33. Shrout, Patrick E., and Joseph L. Fleiss, “Intraclass Correlations: Uses in Assessing Rater Reliability.” Psychological Bulletin, Vol. 86, No. 2, 1979, pp 420-428. 34. Szende, Agota, Klas Svensson, Elisabeth Stahl, Agnes Meszaros, and Gyula Y. Berta, “Psychometric and Utility-Based Measures of Health Status of Asthmatic Patients with Different Levels of Disease Control Level.” Pharmacoeconomics, Vol. 22, No. 8, 2004, pp 537-547. 35. Torrance, George W., David H. Feeny, William J. Furlong, Ronald D. Barr, Yueming Zhang, and Qinan Wang, "Multi-Attribute Preference Functions for A Comprehensive Health Status Classification System: Health Utilities Index Mark 2." Medical Care, Vol. 34, No. 7, July 1996, pp 702-722. 36. Torrance, George W., William Furlong, and David Feeny, “Health Utility Estimation.” Expert Review of Pharmacoeconomics and Outcomes Research, Vol. 2, No. 2, 2002, 99-108. 37. Ware, John E. and Cathy Donald Sherbourne. "The MOS 36-Item Short Form Health Survey (SF-36)." Medical Care, Vol. 30, No. 6, June, 1992, pp 473-483. 38. Ware, John E., K. K. Snow, Mark Kosinski, and B. Gandek, SF-36 Health Survey manual and interpretation guide. Boston: New England Medical Center, The Health Institute, 1993. 39. Ware, John E., Jr., “The SF-36 Health Survey,” In Bert Spilker ed., Quality of Life and Pharmacoeconomics in Clinical Trials. Philadelphia, PA: Lippincott-Raven, 1996, pp 337345.

Institute of Health Economics Working Paper 04-05

17

Table 1. Demographic and Clinical Characteristics of Patients at Baseline, n = 86 Sex, % male

54

Age, mean in years

69

Co-Morbidities, % with problem or disease Cardiovascular

7

Cancer

13

Hypertension

36

Coronary artery disease

15

Diabetes mellitus

5 6.87

Mean Duration of Symptons, years Employment Status at Time of Enrollment in Study, % Working Full Time

12

Retired

81

Other

7

18

Institute of Health Economics Working Paper 04-05

Table 2: Utility Scores Panel A. Baseline (n=86) Measure

Minimum

Maximum

Median

Mean

Std Dev

SF-6D

0.41

0.87

0.59

0.61

0.1

HUI2

0.16

0.94

0.65

0.62

0.19

HUI3

-0.08

0.97

0.53

0.53

0.22

Standard Gamble

0.05

0.95

0.75

0.62

0.32

HUI2SF36

0.48

0.92

0.68

0.68

0.09

QWBSF36

0.49

0.76

0.55

0.57

0.05

HUI3SF12

0.16

0.91

0.54

0.54

0.14

Note: HUI2SF36 n=80, QWBSF36 n=83, HUI3SF12 n=78

Panel B. Pre-Surgery (n=63) Measure

Minimum

Maximum

Median

Mean

Std Dev

SF-6D

0.37

0.84

0.58

0.59

0.1

HUI2

0.11

0.88

0.58

0.55

0.2

HUI3

-0.14

0.84

0.52

0.49

0.21

Standard Gamble

0.05

0.95

0.75

0.61

0.33

Institute of Health Economics Working Paper 04-05

19

HUI2SF36

0.32

0.92

0.69

0.66

0.12

QWBSF36

0.49

0.72

0.55

0.56

0.04

HUI3SF12

-0.33

0.87

0.49

0.47

0.21

Note: HUI2SF36 n=58, QWBSF36 n=62, HUI3SF12 n=57

Panel C. Post-Surgery (n=63) Measure

Minimum

Maximum

Median

Mean

Std Dev

SF-6D

0.42

0.94

0.68

0.69

0.11

HUI2

0.21

0.94

0.75

0.76

0.15

HUI3

0.26

0.97

0.74

0.72

0.18

Standard Gamble

0.05

1

0.85

0.76

0.25

HUI2SF36

0.35

0.94

0.79

0.78

0.11

QWBSF36

0.5

0.77

0.61

0.63

0.07

HUI3SF12

-0.03

0.93

0.63

0.62

0.18

Note: HUI2SF36 n=59, QWBSF36 n=59, HUI3SF12 n=57

20

Institute of Health Economics Working Paper 04-05

Table 3. Agreement: Intra-Class Correlation Coefficients (ICC) and 95% Confidence Intervals Between Utility scores Panel A. Baseline (n=86) Pair of Measures

ICC

95% Confidence Interval

SF6D- HUI2SF36

0.66

0.00 to 0.82

SF6D-QWBSF36

0.5

0.00 to 0.67

HUI2-HUI2SF36

0.43

0.01 to 0.57

HUI2-QWBSF36

0.27

0.00 to 0.43

HUI3-HUI2SF36

0.25

0.00 to 0.46

HUI3-QWBSF36

0.2

0.00 to 0.37

QWBSF36-HUI2SF36

0.28

0.00 to 0.56

SG-SF-6D

0.13

0.00 to 0.30

SF6D-HUI2

0.47

0.2 to 0.59

SF6D-HUI3

0.28

0.00 to 0.46

SG-HUI2

0.11

0.00 to 0.28

SG-HUI3

0.12

0.00 to 0.30

HUI2-HUI3

0.63

0.01 to 0.77

SG-HUI2SF36

0.2

0.00 to 0.37

Institute of Health Economics Working Paper 04-05

21

SG-QWBSF36

0.09

0.00 to 0.27

HUI3SF12-SG

0.25

0.00 to 0.41

HUI3SF12-SF6D

0.59

0.00 to 0.75

HUI3SF12-HUI2

0.44

0.00 to 0.60

HUI3SF12-HUI3

0.47

0.03 to 0.59

HUI3SF12-HUI2SF36

0.49

0.00 to 0.77

HUI3SF12-QWBSF36

0.46

0.01 to 0.59

Pair of Measures

ICC

95% Confidence Interval

SF6D- HUI2SF36

0.74

0.00 to 0.88

SF6D-QWBSF36

0.34

0.00 to 0.52

HUI2-HUI2SF36

0.45

0.00 to 0.65

HUI2-QWBSF36

0.13

0.00 to 0.33

HUI3-HUI2SF36

0.34

0.00 to 0.59

HUI3-QWBSF36

0.15

0.00 to 0.35

QWBSF36-HUI2SF36

0.17

0.00 to 0.41

Panel B. Pre-Surgery (n=63)

22

Institute of Health Economics Working Paper 04-05

SG-SF6D

0.04

0.00 to 0.25

SF6D-HUI2

0.49

0.01 to 0.63

SF6D-HUI3

0.36

0.00 to 0.55

SG-HUI2

0.04

0.00 to 0.25

SG-HUI3

0.12

0.00 to 0.32

HUI2-HUI3

0.68

0.02 to 0.78

SG-HUI2SF36

0.13

0.00 to 0.33

SG-QWBSF36

0.02

0.00 to 0.23

HUI3SF12-SG

0.18

0.00 to 0.37

HUI3SF12-SF6D

0.49

0.00 0.69

HUI3SF12-HUI2

0.53

0.01 to 0.68

HUI3SF12-HUI3

0.62

0.33 to 0.73

HUI3SF12-HUI2SF36

0.49

0.00 to 0.76

HUI3SF12-QWBSF36

0.12

0.00 to 0.33

Institute of Health Economics Working Paper 04-05

23

Panel C. Post-Surgery (n=63) Pair of Measures

ICC

95% Confidence Interval

SF6D- HUI2SF36

0.69

0.00 to 0.85

SF6D-QWBSF36

0.47

0.00 to 0.68

HUI2-HUI2SF36

0.68

0.13 to 0.77

HUI2-QWBSF36

0.25

0.00 to 0.50

HUI3-HUI2SF36

0.6

0.01 to 0.73

HUI3-QWBSF36

0.32

0.00 to 0.53

QWBSF36-HUI2SF36

0.3

0.00 to 0.61

SG-SF6D

0.16

0.00 to 0.35

SF6D-HUI2

0.57

0.00 to 0.73

SF6D-HUI3

0.49

0.02 to 0.63

SG-HUI2

0.31

0.09 to 0.48

SG-HUI3

0.2

0.00 to 0.39

HUI2-HUI3

0.82

0.03 to 0.89

SG-HUI2SF36

0.21

0.00 to 0.40

SG-QWBSF36

0.07

0.00 to 0.28

24

Institute of Health Economics Working Paper 04-05

HUI3SF12-SG

0.19

0.00 to 0.39

HUI3SF12-SF6D

0.67

0.01 to 0.81

HUI3SF12-HUI2

0.51

0.00 to 0.72

HUI3SF12-HUI3

0.58

0.00 to 0.74

HUI3SF12-HUI2SF36

0.57

0.00 to 0.82

HUI3SF12-QWBSF36

0.54

0.34 to 0.67

Institute of Health Economics Working Paper 04-05

25

Table 4. Differences (Post-Surgery minus Pre-Surgery) in Utility Scores Measure

Minimum

Maximum

Median

Mean Difference

Std. Dev.

0.05

0.16

0.1

0.10

0.02

0

0.5

0.1

0.16

0.14

HUI2

0.06

0.37

0.2

0.22

0.07

HUI3

0.13

0.4

0.22

0.23

0.04

HUI2SF36

-0.23

0.46

0.14

0.12

0.11

QWBSF36

-0.13

0.18

0.06

0.07

0.07

HUI3SF12

-0.18

0.54

0.17

0.16

0.16

SF-6D SG

Note: Results for SF-6D, SG, HUI2, and HUI3 were previously reported in Feeny, Wu, and Eng 2004.

26

Institute of Health Economics Working Paper 04-05

Table 5. Area under the Curve Estimates Panel A (number of months normalized to 12) HUI2

HUI3

SF-6D

SG

HUI2SF36

QWBSF36

HUI3SF12

Sum

25.47

21.56

24.67

26.23

25.88

22.07

21.46

Mean

0.50

0.42

0.48

0.51

0.51

0.43

0.42

Median

0.41

0.36

0.39

0.45

0.40

0.34

0.35

Std. Dev.

0.31

0.28

0.26

0.36

0.29

0.24

0.27

Minimum

0.11

0.08

0.17

0.04

0.21

0.16

0.10

Maximum

1.85

1.65

1.54

2.05

1.81

1.51

1.67

Panel B (raw results without normalization) HUI2

HUI3

SF-6D

SG

HUI2SF36

QWBSF36

HUI3SF12

Sum

34.76

29.81

33.17

33.45

35.91

30.00

29.13

Mean

0.68

0.59

0.65

0.66

0.70

0.59

0.57

Median

0.70

0.61

0.65

0.75

0.70

0.59

0.60

Std. Dev.

0.15

0.18

0.09

0.26

0.97

0.41

0.15

Minimum

0.31

0.14

0.40

0.05

0.41

0.51

0.08

Maximum

0.88

0.91

0.83

0.95

0.92

0.68

0.83

Institute of Health Economics Working Paper 04-05

27