The responsiveness of headache impact scales scored ... - Springer Link

7 downloads 0 Views 99KB Size Report
specific scales scored using summated rating scale methods vs. ... Specific Quality of Life Questionnaire; NSHI – National Survey of Headache Impact; RV – ...
Quality of Life Research 12: 903–912, 2003.  2003 Kluwer Academic Publishers. Printed in the Netherlands.

903

The responsiveness of headache impact scales scored using ‘classical’ and ‘modern’ psychometric methods: A re-analysis of three clinical trials M. Kosinski1, J.B. Bjorner1,2, J.E. Ware Jr1,3, A. Batenhorst4 & R.K. Cady5 1 QualityMetric Incorporated, Lincoln, RI, USA (E-mail: [email protected]); 2National Institute of Occupational Health Copenhagen, Denmark; 3Health Assessment Lab, Boston, MA, USA; 4GlaxoSmithKline, Research Triangle Park, NC, USA; 5Headache Care Center, Springfield, MO, USA Abstract Background: While item response theory (IRT) offers many theoretical advantages over classical test theory in the construction and scoring of patient based measures of health few studies compare scales constructed from both methodologies head to head. Objective: Compare the responsiveness to treatment of migraine specific scales scored using summated rating scale methods vs. IRT methods. Methods: The data came from three clinical studies of migraine treatment that used the Migraine Specific Quality of Life Questionnaire (MSQ). Five methods of quantifying responsiveness were used to evaluate and compare changes from preto post-treatment in MSQ scales scored using Likert and IRT scaling methods. Results: Changes in all MSQ scale scores from pre- to post-treatment were highly significant in all three studies. A single index scored from the MSQ using IRT methods was determined to be more responsive than any one of the MSQ subscales across the five methods used to quantify responsiveness. Across 13 of the 15 tests (5 responsiveness methods * 3 studies) conducted, the single index scored from the MSQ using IRT methods was the most responsive measure. Conclusions: IRT methods increased the responsiveness of the MSQ to the treatment of migraine. The results agree with the psychometric evidence that suggest that it is feasible to score a single index from the MSQ using IRT methods. This approach warrants further testing with other measures of migraine impact. Key words: Headache impact scales, Psychometric methods, Responsiveness Abbreviations: ANOVA – analysis of variance; CI – confidence interval; GSRM – Guyatt’s standardized response mean; IRT – Item Response Theory; MCO – Managed-Care Organization; MSQ – Migraine Specific Quality of Life Questionnaire; NSHI – National Survey of Headache Impact; RV – relative validity; SEM – standard error of measurement; SES – standardized effect size; SRM – standardized response mean

Introduction The detection and treatment of migraine is often complicated by the heterogeneity of symptoms that characterize migraine attacks and the lack of clear cut clinical or biological markers of the severity of the condition [1, 2]. The clinical management of migraine is further hampered by the considerable communication gap that remains between patients and physicians particularly over the extent to

which migraine impacts the patient’s life [3]. Standardized questionnaires [4–9] that measure the impact of migraine on the patient’s life have proven to be very useful in quantifying the burden of migraine and headache on a patient’s life and evaluating the effectiveness of treatment in reducing that burden [10–20] and they could potentially improve the communication between doctors and patients. Recent years have seen the emergence of modern psychometric methods, such as IRT, in analyzing

904 and scoring standardized health status questionnaires [21–28]. These methods in combination with computer adaptive assessments offer a promising solution to the tradeoff that exists between respondent burden and measurement precision with patient based measures of health [27]. A necessary step towards computer adaptive assessments is testing the feasibility of applying IRT methods to analyzing and scoring standardized questionnaires of health. A companion paper [29] presented important psychometric evidence of the feasibility of using IRT methods to analyze and score the MSQ Version 1.0 [6]. The aim of this study was to expand that psychometric evidence to include a head-tohead comparison of the responsiveness of the MSQ scored using IRT methods and the developer’s methods to the treatment of migraine in three clinical trials.

Methods Samples The data come from three open-label studies of oral or subcutaneous sumatriptan. Study 1 was a single center study designed to assess the impact of sumatriptan on the cost of migraine headache in a managed care setting [14]. Men and women aged 18 years or older who were members of a groupmodel HMO for at least 1-year were eligible for the study. Patients were eligible if they had at least a 1-year history of migraine diagnosed according to the International Headache Society criteria [30], with a recent history of 2–6 moderate or severe migraine attacks per month. Eligible patients also had at least two visits during the past year for the treatment or evaluation of migraine. Patients could treat an unlimited number of migraine attacks of at least mild intensity for up to 12 months with subcutaneous sumatriptan (6 mg), which they self-administered at home. Patients were instructed to use subcutaneous sumatriptan as the first treatment for migraine. Patients could administer a second 6 mg dose of sumatriptan if the migraine returned within 24 hours. Sumatriptan injections had to be separated by at least 1 hour and no more than two sumatriptan injections could be used within 24 hours. Rescue medication was allowed to be taken only if the participant remained

symptomatic after treatment with sumatriptan injection. Study 2 was a multi-center study designed to examine the effects of oral sumatriptan on migraine related costs from the perspectives of the patient, employer, and healthcare provider [13]. Men and women age 18 years or older were eligible if they were licensed nurses involved in patient care activities and had at least a 1-year history of migraine with or without aura diagnosed according to IHS criteria [30] with a recent history of 2–6 moderate or severe migraine attacks per month, and if migraine had affected the ability to perform usual activities at work in the previous 2 months. Eligible patients had to be using one or more prescription medications to treat migraine, with the exception of sumatriptan. The study protocol consisted of two phases. During the first study phase, patients used their usual (non-sumatriptan) therapy to treat an unlimited number of migraine attacks for 2 months. Patients successfully completing the usual therapy phase of the study were enrolled in the second phase of the study, in which they could treat an unlimited number of migraine attacks for up to 6 months with 100 mg of oral sumatriptan. Patients were instructed to use the oral sumatriptan as the first treatment of migraine pain of at least mild intensity. If migraine pain returned within 24 hours after taking sumatriptan, patients could repeat the 100 mg dose twice within 24 hours for a total daily dose of up to 300 mg. Patients were instructed to avoid the use of rescue medication unless they continued to experience pain after taking sumatriptan. The data used in this study come from the second phase (sumatriptan phase). Study 3 was a prospective, observational study conducted in a mixed-model managed-care organization (MCO) [18]. Patients who received new prescriptions for sumatriptan were identified through the MCO’s pharmacy prescription authorization system and screened for eligibility by a study coordinator at the MCO. Men and women aged 18 and over were eligible for the study if they had a physician diagnosis of migraine, had not previously taken any sumatriptan to treat migraine during the study period, had been enrolled for at least 6 months in the MCO before their first sumatriptan prescription, and had been continuously enrolled in the MCO for at least 6 months after

905 study enrollment. Patients meeting the eligibility requirements received prescriptions for sumatriptan from their MCO participating physicians. The prescriptions were filled in accordance with the MCO’s prescribing policies. Timing of sumatriptan administration was not specified by the study protocol. Patient’s use of other medications for acute migraine during the study period was not restricted. Measures In all three studies, the MSQ Version 1.0 was used to assess the effectiveness of treatment in reducing the impact of migraine on functioning and wellbeing. The MSQ Version 1.0 consisted of 16 questions measuring aspects of functioning and well-being most affected by migraine [6]. Three scales are scored from the MSQ: (1) role function – restrictive (the degree to which performance of normal daily activities are restricted by migraine); (2) role function – preventative (the degree to which performance of normal daily activities are prevented by migraine); and (3) emotional function (capturing the emotional impact of migraine). The MSQ was administered at baseline (pretreatment) and 3 months after baseline (posttreatment) in each study. For this study, the MSQ was scored in three ways. First, the three MSQ scales described above

were scored according to the developer’s algorithms [6]. Accordingly, the item response values for each MSQ scale were simply summed and scores were transformed to a 0 (least favorable) to 100 (most favorable) score (see Table 1). Second, a single IRT-based score was estimated from 15 of the 16 MSQ items using item parameters (slopes and thresholds) estimated from a generalized partial credit model [31], as detailed in the companion paper [29]. One item, item 11, was not included in the IRT-based score since that item did not fit the IRT model developed for the MSQ. Third, a total sum score was calculated by summing up the responses values of 15 of the 16 MSQ items (excluding item 11) and transforming the score to a 0 (least favorable) to 100 (most favorable) scale. In standard application of IRT, the scale of the latent trait, here migraine related functioning and well-being, is defined by setting the mean of the distribution of the study sample to zero and the standard deviation to one. In this study, we defined the latent migraine scale by the mean and standard deviation of the baseline sample combined across the three studies and linearly transformed the mean to be 50 and the standard deviation 10. In order to facilitate comparisons of results across scales scored using the different methods, the 0–100 scores for each MSQ sum score scale were also linearly transformed to have

Table 1. Abbreviated content of items in the MSQ Scale

Item #

Abbreviated item content

Role function – restrictive

1 2 3 4 5 12 13

…interfered with how well you get along with family, friends and others …interrupted your leisure time activities such as reading or exercising …difficulty in performing work or other daily activities …kept you from getting as much accomplished as you normally do at work/home …limited your ability to work or do other activities as carefully as you usually do …left you with limited energy levels …limited the number of days you have felt full of pep

Role function – preventative

6 7 8 9 11

…had to cancel or delay work or social activities …needed help of other people in handling routine tasks …had to stop work or other activities …difficult for you to go to social events such as parties …able to return to your normal self as quickly as you expected

Emotional function

10 14 15 16

…avoided social or family activities …felt fed up or frustrated …felt like you were a burden to others …been afraid of letting others down

906 a mean of 50 and standard deviation of 10 in the combined baseline samples from the three studies. To remain consistent with the direction of scores as prescribed by the developers of the MSQ, all scales were scored such that a higher score was more favorable. Analyses Since all participants in each trial were treated for their migraine attacks, the entire sample was expected to improve in migraine related functioning and well-being. Accordingly, the criterion used to evaluate responsiveness was the change in MSQ scale scores from pre- and post-treatment assessments. Change scores were calculated for each MSQ scale by subtracting the baseline (pre-treatment) score from the 3-month follow-up (posttreatment) score. Within each study, one-way analysis of variance (ANOVA) was conducted to determine whether the change in each MSQ scale score differed significantly from 0. The sample sizes were held constant across MSQ scales for the analyses conducted in each study in order to not bias the comparison of the relative performance of each scale. Five different methods of expressing responsiveness were used to evaluate the relative performance of each MSQ scale in responding to treatment. These methods include (1) relative validity (RV) coefficient, (2) standardized effect size (SES), (3) standardized response mean (SRM), (4) a variant of the SRM as suggested by Guyatt, and (5) categories of change scores in terms of better, same, or worse. Each method produces a standardized ratio of signal (observed change) to noise (variance) that facilitates the comparisons across scales.

Standardized effect size The SES was calculated by dividing the observed change scores of each MSQ scale by the standard deviation of the baseline score [34]. The SES provides a standardized estimate of the magnitude of the change in scores [35]. Standardized response mean The SRM was calculated by dividing the observed change score of each MSQ scale by the standard deviation of the change score [36]. The SRM provides a more effective summary of the signal to noise ratio than SES because it avoids the standard error of the mean in the denominator and is therefore less influenced by sample size [37]. Guyatt’s standardized response mean Guyatt’s standardized response mean (GSRM) was calculated by dividing the observed change score of each MSQ scale by the standard deviation of the change scores observed among stable subjects [38]. To calculate the GSRM we used the standard deviation of each MSQ scale observed in a cohort of stable subjects participating in the National Survey of Headache Impact (NSHI) [27]. In that study the MSQ was completed twice, 3-months apart, by a subset (n ¼ 300) of the participants of the NSHI. During the 3-month follow-up, participants were asked: ‘Compared to 3-months ago, do your headaches bother you more or less now’? The variance estimates used to compute the GSRM came from the stable cohort (n ¼ 156) who responded ‘about the same’ to this question (SD for role restrictive ¼ 7.82; SD for role preventative ¼ 7.52; SD for role emotion ¼ 7.98; SD for total sum score scale ¼ 7.19; and SD for total IRT scale ¼ 7.07).

Relative validity Categories of change scores The RV coefficient expresses in proportional terms the empirical validity of the scale in question relative to the most valid scale in a specific test [32, 33]. RV coefficients were estimated for each scale by dividing the F-statistic of a given scale by the largest F-statistic observed among all scales in a given test. The F-statistics came from the one-way ANOVA conducted for each MSQ scale.

The percentage of patients categorized as ‘better’ was compared across each MSQ scale. Patients were categorized as ‘better’ if the amount of change on a given scale was larger than the 95% confidence interval (CI) around an individual patient score on that scale. For example, if the 95% CI is 5 points then the change in score necessary to be categorized

907 as ‘better’ is þ5 points or greater. The calculation of the 95% CI around an individual patient score differed across scales scored using sum score and IRT methods. For scales scored using the sum score method the 95% CI around an individual patient score was calculated by multiplying the standard error of measurement (SEM) by 1.96. The SEM was calculated by multiplying the baseline standard deviation of each scale by the square root of one minus reliability (internal consistency) [39]. This yielded a 95% CI that was the same for all patients. For scales scored using IRT methods the 95% CI around an individual patient score was calculated by multiplying estimates of the person level standard error by 1.96. This yielded a 95% CI that was different for each patient, since measurement precision is assumed to differ across the total score range under the IRT model. Results in the companion paper [29] showed that the unidimensional IRT model was not satisfactory among a minority of patients. These patients were found to have a unique response pattern on MSQ items relating to emotional functioning. To test whether analyzing emotional functioning separately from the total IRT score would produce clinically relevant information, the responsiveness analyses were also conducted only among this minority of patients.

Results Patient characteristics In study 1, 126 eligible patients were enrolled and had taken the study drug. A total of 110 patients had complete data across MSQ scales. The majority of patients was women (94%), and most were Caucasian (91%). The mean age of patients was 44 years. More than one-half (58%) of the patients had not used sumatriptan before entry into the study. Patients experienced a mean of 9.6 migraine attacks during the first 3 months of the study, of which 82% were initially treated with sumatriptan and 95% of which were treated at some time with sumatriptan. Across migraine attacks, more than two-thirds of patients experienced relief (moderate or severe pain reduced to mild or none) 2 hours post-treatment with sumatriptan.

In study 2, 218 of the 263 eligible patients enrolled in the usual therapy phase of the study were enrolled in the sumatriptan phase of the study. These patients had taken the study drug on at least one migraine day and provided enough data to estimate migraine disability days. A total of 159 patients had complete data across MSQ scales. The majority of patients were women (98%), and most were Caucasian (93%). The mean age of patients was 41 years. Patients experienced an average of 3 migraine attacks per month during the sumatriptan phase of the study. During the sumatriptan phase of the study 76% of patients reported headache relief (moderate or severe pain reduced to mild or none) 2 hours post-treatment during sumatriptan study phase. In study 3, 178 of the 220 eligible patients completed the entire study protocol. A total of 144 patients had complete data across MSQ scales. The majority of patients were women (90%) and most were Caucasian (96%). The mean age of patients was 39 years. Approximately 26% of the patients reported having migraine attacks at least once a month and 69% reported being diagnosed with migraine more than 12 months prior to study enrollment. Responsiveness Mean baseline (pre-treatment), 3-month follow-up (post-treatment), and change scores from baseline to 3 months, along with the standard deviation of the means, are shown in Table 2 for all MSQ scales. Tests of significance (F-statistics) for the change in scale scores are also presented in Table 2. The results show that all MSQ scales were highly responsive to the diminished impact of migraine after initiation of sumatriptan treatment in each of the three studies. The changes in scores on each MSQ scale differed significantly from 0. A comparison of score changes across studies shows that the largest score changes was observed in study 1, followed by study 2 and then study 3. In study 1, changes in scores ranged from 13.5 to 16.2 points across MSQ scales. Changes in scores of this magnitude were greater than a full standard deviation on each scale. In study 2, changes in scores ranged from 7.7 to 10.8 points across MSQ scales. Changes in scores of this magnitude were 8/ 10th to a full standard deviation. In study 3,

908 Table 2. Comparison of pre- and post-treatment IRT-based and original sum scale scores: three studies of migraine treatment Scales

# items

Baseline Mean

Follow-up SD

Mean

Change SD

F

Mean

SD

Study 1 (n = 110) MSQ role restrictive MSQ role preventative MSQ emotion MSQ total sum MSQ total IRT

7 5 4 15 15

46.5 45.5 44.7 45.1 45.5

9.0 9.3 9.0 8.7 7.7

60.4 61.7 58.2 61.3 61.6

8.6 9.5 7.8 8.6 9.9

13.9 16.2 13.5 16.2 16.1

10.9 12.6 9.3 10.6 10.1

178.2a 184.0a 229.7a 257.2a 274.4a

Study 2 (n = 159) MSQ role restrictive MSQ role preventative MSQ emotion MSQ total sum MSQ total IRT

7 5 4 15 15

51.7 52.0 51.9 52.1 52.2

9.2 8.6 9.3 9.0 9.0

60.8 62.8 59.6 62.3 62.8

8.3 8.8 7.8 8.4 9.3

9.1 10.8 7.7 10.2 10.6

10.4 11.1 9.4 10.1 10.0

121.3a 150.3a 107.8a 160.1a 176.0a

Study 3 (n = 144) MSQ role restrictive MSQ role preventative MSQ emotion MSQ total sum MSQ total IRT

7 5 4 15 15

48.4 50.4 49.4 49.1 48.7

10.9 11.1 10.6 10.9 10.2

53.2 53.5 52.9 53.6 53.6

10.6 11.3 10.2 10.4 11.1

4.8 3.1 3.5 4.5 4.9

8.8 11.8 8.2 8.7 8.7

43.4a 9.8b 26.8a 38.1a 44.7a

a

p < 0.001; b p < 0.01.

changes in scores ranged from 3.1 to 4.9 across scales. Score changes of this magnitude were in the range of one-third to one-half a standard deviation. Across all three studies, the F-statistic observed for the IRT-based MSQ score was the largest.

Summary statistics used to compare the responsiveness of each MSQ scale are presented in Table 3. As shown, regardless of the summary statistic used to assess responsiveness, there was a decided advantage shown for the IRT-based scoring of the MSQ. In study 1, the IRT-based

Table 3. Responsiveness statistics for IRT based scores and original sum score scales: three studies of migraine treatment Scales

# items

RV

SES

SRM

GSRM

%Better

Study 1 (n = 110) MSQ role restrictive MSQ role preventative MSQ emotion MSQ total sum MSQ total IRT

7 5 4 15 15

0.65 0.67 0.84 0.94 1.00

1.54 1.74 1.50 1.86 2.09

1.27 1.28 1.45 1.53 1.59

1.78 2.15 1.69 2.25 2.28

74.6 67.3 70.9 85.5 82.7

Study 2 (n = 159) MSQ role restrictive MSQ role preventative MSQ emotion MSQ total sum MSQ total IRT

7 5 4 15 15

0.69 0.85 0.61 0.91 1.00

0.99 1.25 0.83 1.13 1.18

0.88 0.97 0.82 1.01 1.06

1.16 1.43 0.96 1.42 1.50

57.2 47.8 40.9 67.9 69.8

Study 3 (n = 144) MSQ role restrictive MSQ role preventative MSQ emotion MSQ total sum MSQ total IRT

7 5 4 15 15

0.97 0.22 0.60 0.85 1.00

0.44 0.28 0.33 0.41 0.48

0.54 0.26 0.43 0.52 0.56

0.61 0.41 0.44 0.62 0.69

40.3 22.3 24.3 40.3 49.3

909 Table 4. Comparison of the responsiveness of IRT-based and original sum scale scores: patients where unidimensional IRT model was not satisfactory (n = 28) Scales

MSQ MSQ MSQ MSQ MSQ a

# Items

role restrictive role preventative emotion total sum total IRT

7 5 4 15 15

Change

F

Mean

SD

11.5 7.3 4.9 9.3 10.2

12.6 11.3 8.4 9.4 11.7

23.3a 11.6b 9.8b 27.0a 20.7a

Responsiveness statistics

% Better

RV

SES

SRM

GSRM

0.86 0.43 0.36 1.00 0.77

1.09 0.59 0.45 1.11 1.32

0.91 0.65 0.58 0.99 0.87

1.47 0.97 0.61 1.29 1.44

71.4 28.6 32.1 64.3 67.9

p < 0.001; b p < 0.01.

MSQ scale was most responsive to treatment according to the RV, SES, SRM and GSRM responsiveness statistics. Only the total MSQ sum score scale showed a slightly larger percentage of patients classified as ‘better’ compare to the IRTbased MSQ scale. In study 2, the IRT-based MSQ scale was most responsive to treatment according to the RV, SRM, and GSRM responsiveness statistics and the percentage of patients categorized as ‘better’. The MSQ role restrictive scale was most responsive under the SES approach. In study 3, the IRT-based MSQ scale was most responsive according to all five summary statistics. Table 4 presents the results of the responsiveness analyses conducted among the minority of patients where the unidimensional IRT model was unsatisfactory. As shown, all scales were highly responsive to treatment among these patients. The total MSQ sum score was most responsive scale according to the RV and SRM responsiveness statistics. The IRT-based MSQ scale was most responsive according to the SES responsiveness statistic and the MSQ role restrictive was most responsive according the GSRM responsiveness statistics and the proportion of patients classified as ‘better’. Across all responsiveness statistics the MSQ role emotion scale was least responsive.

Discussion A companion paper showed the feasibility of implementing scoring algorithms based on IRT methodology in the measurement of migraine related functional status and well-being using the MSQ. The evidence showed that it was possible to fit a unidimensional IRT model among the items in

the MSQ and that a single IRT-based score from the MSQ satisfactorily summarized the information provided by the three subscales conceptualized by the developers of the MSQ. This study sought to expand upon that evidence, which was based primarily on psychometric evidence, to include empirical tests that support the feasibility of scoring a single index from the MSQ using IRT methods. In this study we conducted a head-to-head comparison of the MSQ scored using IRT and the developer’s methods in responding to the treatment of migraine in three studies. The results showed large and statistically significant improvements in MSQ scale scores from pre- to posttreatment, regardless of scoring method. The use of IRT methods to score a single MSQ scale did not compromise the validity of the MSQ in responding to the treatment of migraine. With few exceptions, the IRT-based score was shown to be more responsive to the treatment of migraine than the three MSQ subscales and total sum score scale in all three studies. The difference in responsiveness across scales was due in large part to the smaller variance estimates observed with the baseline and change scores of the IRT-based scale compared to the MSQ subscales. The smaller variance estimates led to larger signal to noise ratios for the IRT-based scale with each responsiveness statistic. For example, in study 1 the change in scores on the IRTbased scale was roughly equivalent to the changes in scores observed on the role preventative and total sum score scales. However, the variance estimate at baseline was 12–18% smaller for the IRT-based scale compared to the role preventative and total sum score scales, which contributed to the larger effect size estimates (SES) observed for

910 the IRT-based scale. Similarly, the variance estimate of the change in scores on the IRT-based scale was 5–20% smaller compared to the role preventative and total sum score scales, which led to larger estimates on the RV coefficients and the SRM for the IRT scale. Lastly, the variance estimate for the IRT-based scale in the stable cohort was the smallest, which led to the larger estimate on the GSRM statistic for the IRT scale. These patterns of results were similar to those observed in studies 2 and 3, where the difference in variance estimates accounted for much of the differences in responsiveness observed across the scales. The results of this study support of an approach that uses a single scale scored from the MSQ items. With few exceptions, each of the responsiveness statistics showed that either the IRT-based scale or the total sum score scale was more responsive to the treatment of migraine compared to the three MSQ sub-scales. From a practical point of view, the use of a single migraine disability scale eases the interpretation of study results. But more importantly, the use of a single scale over three subscales strengthens analyses in clinical trials. For example, in clinical trials it is customary to use formal methods of adjusting for multiple comparisons when more than one scale is used as an endpoint. Many methods used to adjust for multiple comparisons, such as a Bonferroni correction, are overly conservative [40] and the impact of these adjustments is a decrease in statistical power since the probability levels for significance testing are increased. The companion paper [29] showed that the unidimensional IRT model was not satisfactory among a minority of patients. These patients showed a unique pattern of responses to the role emotion items of the MSQ that was not consistent with what the IRT model would predict. To investigate whether the role emotion items produced clinically relevant information not captured by the items in the other scales, we conducted the responsiveness tests among these patients separately. The results of the responsiveness tests showed that the role emotion scale was the least responsive to treatment compared to all other scales among these patients. The implication was that the difference in the performance of the role emotion items was not likely due to any clinically relevant

information not captured by the other scales, but more the problem of fit to the IRT model for this subset of patients. The impact of the unique pattern of responses to the role emotion items served to diminish the performance of the single IRTbased scale among these patients. However, even in this worst-case group (regarding IRT fit) the IRT-based scale did not perform consistently worse across all measures or noteworthy worse on any responsiveness measure. In conclusion, the results of this study are in agreement with the psychometric evidence reported in the companion paper, which supported a unidimensional IRT model to summarize the data from the MSQ. In this study, our objective was not to suggest an alternate method of scoring the MSQ. Rather, the primary objective was to provide evidence in support of the feasibility of applying IRT methods to analyzing and scoring measures of headache impact. The companion paper proved that the use of IRT methods was psychometrically feasible. In this paper we showed that IRT methods did not compromise the validity of the MSQ in tests that more closely approximated the intended use of the questionnaire. Admittedly, the paper would be strengthened by demonstrating that the magnitude of the advantage of the IRT scoring of the MSQ was statistically and or clinically important, however, our goal was to prove the feasibility of using IRT methods to analyzing and scoring measures of headache impact, which in our view was a necessary first step towards building a more comprehensive pool of items for the purpose of computer adaptive assessments. Towards that end, the evidence presented in this paper along with the companion paper met our objectives. References 1. Stewart WF, Shechter A, Lipton RB. Migraine heterogeneity, disability, pain intensity, and attack frequency and duration. Neurology 1994; 44(Suppl 4): S24–S39. 2. Holmes WF, MacGregor AE, Dodick D. Migraine-related disability. Neurology 2001; 56(Suppl 1): S13–S19. 3. Lipton RB, Amatniek JC, Ferrari MD. Migraine: Identifying and removing barriers to care. Neurology 1994; 44 (Suppl 4): S63–S68. 4. Jacobson GP, Ramadan NM, Aggarwal SK, Newman CW. The Henry Ford Hospital Headache Disability Inventory (HDI). Neurology 1994; 44: 837–842.

911 5. Hartmaier SL, Santanello NC, Epstein RS, Silberstein SD. Development of a brief 24-hour migraine specific quality of life questionnaire. Headache 1995; 35: 320–329. 6. Jhingran P, Osterhaus JT, Miller DW, Lee JT, Kirchdoefer L. Development and validation of the Migraine Specific Quality of Life Questionnaire. Headache 1998; 38: 295–302. 7. Stewart WF, Lipton RB, Simon D, Liberman J, Von Korff M. Validity of an illness severity measure for headache in a population sample of migraine sufferers. Pain 1999; 79: 291–301. 8. Martin BC, Pathak DS, Sharfman MI. Validity and reliability of the migraine-specific quality of life questionnaire (MSQ Version 2.1). Headache 2000; 40: 204–215. 9. Stewart WF, Lipton RB, Dowson AJ, Sawyer J. Development and testing of the Migraine Disability Assessment (MIDAS) Questionnaire to assess headache-related disability. Neurology 2001; 56(Suppl 1): S20–S28. 10. Dahlof CGH. Assessment of health-related quality of life in migraine. Cephalalgia 1993; 13: 233–237. 11. Osterhaus JT, Townsend RJ, Gandek B, Ware JE Jr. Measuring the functional status and well-being of patients with migraine headache. Headache 1994; 34: 337– 343. 12. Solomon GD, Skobieranda FG, Genzen JR. Quality of life assessment among migraine patients treated with sumatriptan. Headache 1995; 35(8): 449–454. 13. Adelman JU, Sharfman M, Johnson R, et al. Impact of oral sumatriptan on workplace productivity, health-related quality of life, healthcare use, and patient satisfaction with medication in nurses with migraine. Am J Manag Care 1996; 2: 1407–1416. 14. Cohen JA, Beall D, Miller DW, Beck A, Pait GD, Clements B. Subcutaneous sumatriptan for the treatment of migraine: Humanistic, economic, and clinical consequences. Fam Med 1996; 28: 171–177. 15. Jhingran P, Cady RK, Rubino J, Miller D, Grice RB, Gutterman DL. Improvements in health-related quality of life with sumatriptan treatment for migraine. J Fam Practice 1996; 42(1): 36–42. 16. Dahlof C, Bouchard J, Cortelli P, et al. A multinational investigation of the impact of subcutaneous sumatriptan: Health-related quality of life. PharmacoEconomics 1997; 11 (Suppl 1): 24–34. 17. Monzon MJ, Lainez MJ. Quality of life in migraine and chronic daily headache patients. Cephalalgia 1998; 18(9): 638–643. 18. Lofland JH, Johnson NE, Batenhorts AS, Nash DB. Changes in resource use and outcomes for patients with migraine treated with sumatriptan. Arch Int Med 1999; 159: 857–863. 19. Lipton RB, Hamelsky SW, Kolodner KB, Steiner TJ, Stewart WF. Migraine, quality of life, and depression. Neurology 2000; 55(5): 629–635. 20. Muscari-Tomaioli G, Allegri F, Miali E, Pomposelli R, Tubia P, Targhetta A. Observational study of quality of life in patients with headache, receiving homeopathic treatment. Br Homeopathic J 2001; 90(4): 189–197. 21. Haley SM, McHorney CA, Ware JE. Evaluation of the MOS SF-36 physical functioning scale (PF-10): I. Unidi-

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

mensionality and reproducibility of the Rasch item scale. J Clin Epidemiol 1994; 47: 671–684. McHorney CA, Haley SM, Ware JE. Evaluation of the MOS SF-36 physical functioning scale (PF-10): II. Comparison of relative precision using Likert and Rasch scoring methods. J Clin Epidemiol 1997; 50: 451–461. Fischer WP Jr, Eubanks RL, Marier RL. Equating the MOS SF-36 and LSU HIS physical functioning scales. J Outcome Measur 1997; 1: 329–362. Prieto L, Alonzo J, Lamarca R, Wright BD. Rasch measurement for reducing the items in the Nottingham Health Profile. J Outcome Measure 1998; 2: 285–301. Hays R, Morales L, Reise SP. Item response theory and health outcomes measurement in the 21st century. Med Care 2000; 38(9 Suppl II): II28–II42. McHorney CA, Cohen AS. Equating health status measures with item response theory: Illustrations with functional status items. Med Care 2000; 38(9 Suppl II): II43–II59. Ware JE, Bjorner JB, Kosinski M. Practical implications of items response theory and computerized adaptive testing: A brief summary of ongoing studies of widely used headache impact scales. Med Care 2000; 38(9 Suppl II): II73–II82. Jenkinson C, Fitzpatrick R, Garratt A, Peto V, StewartBrown S. Can item response theory reduce patient burden when measuring health status in neurological disorders? Results from Rasch analysis of the SF-36 physical functioning scale (PF-10). J Neurol Neurosurg Psychiatry 2001; 71: 220–224. Bjorner JB, Kosinski M, Ware JE. The feasibility of applying item response theory to measures of migraine impact: A re-analysis of three clinical studies. Qual Life Res 2003; 12: 887–902. Headache Classification Committee of the International Headache Society. Classification and diagnostic criteria for headache disorders, cranial neuralgias and facial pain. Cephalagia 1988; 8: 1–96. Muraki E. A Generalized Partial Credit Model. In: van der Linden WJ, Hambleton RK, (eds), Handbook of Modern Item Response Theory. Berlin: Springer, 1997: 153–164. McHorney CA, Ware JE, Raczek AE. The MOS 36-Item Short-Form Health Status Survey (SF-36). II. Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med Care 1993; 31: 247–263. Ware JE, Kosinski M, Bayliss MS, McHorney CA, Rogers WH, Raczek AE. Comparison of methods for scoring and statistical analysis of SF-36 health profiles and summary measures: Summary of results from the Medical Outcomes Study. Med Care 1995; 33(Suppl 4): AS264–AS279. Cohen J. Statistical Power Analysis for the Behavioral Sciences. Hillsdale, New Jersey: Lawrence Erlbaum Associates, 1988. Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care 1989; 27(Suppl 3): S178–S189. Liang MH, Fossel AH, Larson MG. Comparison of five health status instruments for orthopedic evaluation. Med Care 1990; 28: 632–642. Beaton DE. Hogg-Johnson S, Bombardier C. Evaluating changes in health status: Reliability and responsiveness

912 of five generic health status measures in workers with musculoskeletal disorders. J Clin Epidemiol 1997; 50(1): 79–93. 38. Guyatt GH, Walter SD, Norman GR. Measuring change over time: Assessing the usefulness of evaluative instruments. J Chronic Dis 1987; 40: 171–178. 39. Nunnally JC, Bernstein IR. Psychometric Theory. 3rd ed. New York (NY): McGraw Hill, 1994. 40. Wu AW, Gray SM, Brookmeyer R. Application of random effects models and methods to the analysis of multidimen-

sional quality of life data in an AIDS trial. Med Care 1999; 37(3): 249–258.

Address for correspondence: Mark Kosinski, QualityMetric Incorporated, 640 George Washington Highway, Lincoln, RI, USA Phone: þ1-401-334-8800; Fax: þ1-401-334-8801 E-mail: [email protected]