On assessing responsiveness of health-related quality of life ...

4 downloads 15 Views 128KB Size Report
a general and a disease-specific quality of life questionnaire large variation in results was observed, ...... ment and six minute walk test in elderly heart failure.
Quality of Life Research 12: 349–362, 2003.  2003 Kluwer Academic Publishers. Printed in the Netherlands.

349

On assessing responsiveness of health-related quality of life instruments: Guidelines for instrument evaluation C.B. Terwee1, F.W. Dekker2, W.M. Wiersinga3, M.F. Prummel3 & P.M.M. Bossuyt4 1 Institute for Research in Extramural Medicine, VU University Medical Center (E-mail: cb.terwee.emgo@ med.vu.nl); 2Department of Clinical Epidemiology, Leiden University Medical Center; 3Departments of Endocrinology and Metabolism; 4Clinical Epidemiology and Biostatistics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands Accepted in revised form 5 July 2002

Abstract A lack of clarity exists about the definition and adequate approach for evaluating responsiveness. An overview is presented of different categories of definitions and methods used for calculating responsiveness identified through a literature search. Twenty-five definitions and 31 measures were found. When applied to a general and a disease-specific quality of life questionnaire large variation in results was observed, partly explained by different goals of existing methods. Four major issues are considered to claim the usefulness of an evaluative health-related quality of life (HRQL) instrument. Their relation with responsiveness is discussed. The confusion about responsiveness arises mostly from a lack of distinction between cross-sectional and longitudinal validity and from a lack of distinction between responsiveness defined as the effect of treatment and responsiveness defined as the correlation of changes in the instrument with changes in other measures. All measures of what is currently called responsiveness can be looked at as measures of longitudinal validity or as measures of treatment effect. The latter ones tell us little about how well the instrument serves its purpose and are only of use in interpreting score changes. We therefore argue that the concept of responsiveness can be rejected as a separate measurement property of an evaluative instrument. Key words: Guidelines, Health-related quality of life, Questionnaire development, Responsiveness, Review

Introduction Health-related quality of life (HRQL) is nowadays considered to be one of the most important outcome measures in many clinical studies. This trend makes it all the more necessary to have a clear methodology for the development and use of HRQL instruments. Traditionally, there is consensus that newly developed instruments should be tested for validity and reliability before they can be used in clinical studies. For evaluative instruments designed to measure longitudinal changes in HRQL over time, ‘responsiveness’ or ‘sensitivity to change’ has been proposed as a third requirement [1–15].

Despite the fact that some authors consider responsiveness to be the most essential property of an evaluative instrument, the methodology of assessing responsiveness tends to be less well understood [6, 9]. There is discussion whether responsiveness should be considered a separate property of an instrument or just an aspect of validity [16–19]. In addition, there is an evident lack of clarity in the literature about the definition of responsiveness. Definitions differ in the kind of change that a responsive instrument should be able to detect, e.g., (clinically) important changes over time [6, 20], changes due to treatment effects [21, 22], or changes in the true value of the underlying construct [12]. As a consequence, there is

350 considerable inconsistency in the methods for calculating it and consensus on an adequate approach for evaluating responsiveness is lacking [20, 21, 23–26]. In order to justify the claim that a HRQL instrument is adequate for use in clinical studies, there is a clear need to standardize the terminology and corresponding methodology for assessing responsiveness. This paper shows that different measures of responsiveness lead to different conclusions because they have different purposes. Therefore, standardization of methodology is needed. Our main conclusion is that the concept of responsiveness is redundant when considered next to reliability and validity. In the first part of this article a review of the available literature on the methodology for assessing responsiveness since 1985 is presented. We applied all identified measures to a general and a disease-specific quality of life questionnaire to illustrate the differences resulting from using various methods. In the second part of this article we discuss the methodology of responsiveness. Relations between validity, reliability and responsiveness are discussed and guidelines are provided for the assessment of these aspects in the evaluation of HRQL instruments.

A review of the literature on responsiveness We searched the literature from 1985 for definitions of and methods for calculating responsiveness. The purpose of our search was to identify all different categories of definitions and methods to present an overview as complete as possible. We first scanned our own collection of articles and textbooks on methodology of HRQL instrument development and evaluation. In addition, a PubMed search was performed for articles with ‘responsiveness’ or ‘sensitivity to change’ as a text word and ‘quality of life’ and/or ‘questionnaire’ as a MeSH term. This search yielded 270 abstracts, which were scanned for methodological considerations about responsiveness. Additional articles were sought by using the ‘related articles’ option of PubMed, by reviewing reference lists of key articles and by hand searching the latest issues of key journals (Journal of Clinical Epidemiology, Quality of Life Research and Medical Care). Although our literature search was limited to articles from

1985 and we considered it too time-consuming to find and scan all articles that assess measurement properties of HRQL instruments (we expected to find thousands of articles), we have not identified all existing different definitions and methods (other examples have been presented by Beaton et al. [27]). However we found no indications that including other definitions or methods would have lead to other categories and we think that our theoretical arguments on the methodology of assessing responsiveness would also be applicable to literature prior to 1985. In Table 1 an overview is presented of the definitions of responsiveness found through our search. Although there are many similarities between the definitions, important differences can be observed. We identified three categories of definitions. Our classification is based on the kind of change that a responsive instrument should be able to detect, although other groupings are possible, such as internal vs. external responsiveness [53] or distribution-based vs. anchor-based measures. In the first group responsiveness is defined as the ability to detect change in general [2, 5, 8–10, 15, 21–23, 26, 28, 29]. This could be any kind of change, regardless of whether it is relevant or meaningful. It is often defined as a statistically significant change after treatment. This definition equates to the concept of ‘sensitivity to change’ used by Liang [54]. In the second group responsiveness is more specifically defined as the ability to detect clinically important change [1, 3–6, 11, 20, 30–45, 47, 52]. These definitions differ from the ones in the first group because they require an explicit, although often subjective, judgment on what is to be important. In the third group responsiveness is defined as the ability to detect real changes in the concept being measured [12, 14, 48, 50]. This definition can be considered a further extension of the previous two as it does not only require a judgment on what changes are important but also a gold standard for the concept being measured. Within each of the three groups a variety of methods for calculating responsiveness has been proposed in the literature (Table 2). We found 31 different measures of responsiveness. The same statistics sometimes appear in multiple groups, depending on the patient group in which the

351 Table 1. Definitions of responsiveness Definition

References

Group 1: Responsiveness as the ability to detect change in general The ability to detect change over time The ability of an index to measure clinical change The ability of an instrument to detect the overall effect of treatments The ability to detect a change due to treatment When an instrument detect differences between interventions The ability to detect small within-patient change over time The sample size required to observe a small, medium or large change or effect size in the population; or the power to detect a difference when one is present; or the effect of random and systematic error on the power of a test Group 2: Responsiveness as the ability to detect clinically important change The ability of an instrument to detect important change over time The ability to detect change in outcomes that matter to persons with a health condition, to their significant others, or to their providers The ability to detect (sensitivity to) clinically important (significant, relevant, meaningful) changes in health status over time The ability to detect meaningful and statistical clinical change When statistical significant changes occur in a concurrent direction in subjects with clinically meaningful changes of symptoms The ability to detect clinically important changes in health status over time, which are changes that clinicians and patients think are discernable and important, have been detected with an intervention of known efficacy or are related to well-established physiologic measures The ability to detect clinically important changes in health status over time, even if these changes are small The ability to detect (sensitivity to) small but (clinically) important changes The ability to measure small but clinically important within-subject change after an effective therapeutic intervention The ability to detect MCID Detecting clinically meaningful changes and differences between treatments Group 3: Responsiveness as the ability to detect real change in the concept being measured The level to which an instrument is able to detect changes in the concept being measured Whether changes in the attribute are reflected in changes in the scores on the QOL instrument The capacity to detect changes in health status whenever they actually exist The extent to which a measure is sensitive to real change in HRQL A measure of the association between change in the observed score and changes in the true value of the construct The ability of a measure to respond to the underlying dynamic characteristic of a construct, such as HRQL, in response to an intervention The ability to discriminate between those who improve and those who do not

[8, 15, 23, 28, 29] [26] [21] [9, 22] [5] [10] [2]

[30] [11] [1, 6, 20, 31–37] [38] [39] [40]

[3, 5] [41–45] [46] [4] [47] [48] [14] [49] [50] [12] [51] [52]

Italic – kind of change that should be detected.

measure is being calculated. For example, an effect size calculated only in patients who have improved (measure 18) was placed in the second group because it involves a judgment on improvement, while the same effect size calculated in all patients who underwent treatment (measure 10) does not involve this kind of judgment and was therefore placed in the first group. Some methods are similar from a statistical point of view. For example, the paired t-test (measure 1) is similar to an effect size (measure 10) and the relative efficiency statistic

(measure 16) can be transformed into the F statistic included in measure 3. However, because of the influence of sample size on p-values, these methods can lead to different conclusions about responsiveness. For example, a certain change in an instrument can be found statistically significant (in a large patient sample), interpreted as good responsiveness of the instrument, but at the same time this change may represent a small effect size, indicating low responsiveness of the instrument.

352 Table 2. Measures of responsiveness Measure

GS*

Reported as…

References

T

p-Value

[55, 56]

T

p-Value

[4, 57, 58]

T T T

p-Value of main effect of time p-Value of interaction between time*therapy (t-statistic1/t-statistic2)2

[51] [59] [26, 30, 35, 60]

6

Group 1: Responsiveness as the ability to detect change in general Paired t-test (or Wilcoxon) in patients who underwent treatment (prospective) Unpaired t-test or ANOVA between treatment and control (prospective) Repeated measures analysis of variance Repeated measures analysis of variance Relative efficiency statistic (RE) in patients who underwent treatment (prospective) Standard error of measurement (SEM)

T

7

Measurement sensitivity

T

8

Effect size

T

9

Effect size for difference Coefficient of responsiveness

T

mean ðtest1 test2 Þtreatment mean ðtest1 test2 Þcontrol SDtest1 pooled

10 Effect size Standardized response mean

T

mean ðtest1 test2 Þtotal group SD ðtest1 test2 Þtotal group

11 Responsiveness index 12 Sample size requirements

T T

1 2 3 4 5

13 14 15

16 17

Group 2: Responsiveness as the ability to detect clinically important change Paired t-test in patients who did and did not change (retrospectively) Unpaired t-test between patients who did and did not change clinically (retrospective) Number of dimensions with significant prepost-treatment changes in patients who improved (retrospectively) Relative efficiency statistic (RE) in patients who clinically improved (retrospective) Responsiveness coefficient

mean ðtest1 test2 Þtotal group n

r2ðbaseline1baseline2Þ

[61] [62] [23, 26, 30, 38, 41, 94, 55, 59, 63, 64] [58], [65] [48], [25, 26, 29, 30, 32, 38, 66] [46] [4, 66, 67]

P PD D

p-Value

[4, 68, 69] [52] [57, 70]

PD

%

[49]

D

(t-statistic1/t-statistic2)2

PD

18 Effect size Standardized effect size 19 Normalized ratio

P

20 Responsiveness statistic

P/D

21 Standardized response mean

P

22 Responsiveness statistic Responsiveness index 23 Responsiveness ratio Standardized response mean 24 Guyatt’s responsiveness statistic Responsiveness coefficient 25 Responsiveness index

D T P D D/T P P

Group 3: Responsiveness as the ability to detect real change in the concept being measured 26 Sensitivity/specificity

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi within  subject variance r2contrast r2total mean ðtest1 test2 Þtotal group SD test1 total group

P

p-Value

r2change r2change þr2error

mean ðtest1 test2 Þimproved SD test1 improved mean ðtest1 test2 Þimproved SD test1 stable mean ðtest1 test2 Þtotal group SD test1 stable mean ðtest1 test2 Þimproved SD ðtest1 test2 Þimproved mean ðtest1 test2 Þtotal group SD ðtest1 test2 Þstable mean ðtest1 test2 Þimproved SD ðtest1 test2 Þstable MCID SD ðtest1 test2 Þstable MCID

r2ðbaseline1baseline2Þ

[28] [21, 71] [6, 31, 64] [20] [39] [41] [20, 41] [26] [22] [34] [69] [4, 44] [31] [46]

[1]

353 Table 2. (Continued) Measure 27 ROC curve 28 Relation with global measure of change differences in changes between relevant subgroups (Kruskall–Wallis test) 29 Correlations with changes in clinical variables 30 Correlation with overall improvement 31 Regression models

GS*

Reported as…

References

P PD P

AUC

[34]

Mean changes per category of global measure p-Value Pearson correlation coefficient

[1, 20, 31–33, 41]

P D PD P PD D

Pearson correlation coefficient Regression coefficient

[72] [42, 73] [52] [26] [1] [53]

*GS – gold standard = i.e. source that was used to define an important change: P – important change according to the patient; D – important change according to the doctor; PD – important change according to both the patient and the doctor; T – change due to treatment effect.

Some important conceptual differences can be identified within the second and third group of measures. First, there is variation in who should be the judge of an important change: the physician (as in measures 13,14) or the patient (as in measures 16,17). Second, there is additional variation in the definition of the standard deviation in the denominator of the ESs and SRMs. Finally, there is also variation in the terminology and calculation of ES and SRM, which may hamper the unambiguous interpretation of results. Application of different responsiveness measures Many authors use multiple indices of responsiveness and conclude that their instrument is responsive when all indices point in the same direction [20, 26, 30, 31, 38, 41, 74]. This approach would be justifiable if different methods evaluated the same measurement property and lead to the same conclusions. We will show that this is not necessarily the case for current measures of responsiveness. As a test, we tried to apply all measures in Table 2 to one disease-specific and one general HRQL questionnaire. The GO-QOL was used as a disease-specific HRQL questionnaire. It has been developed for patients with Graves’ ophthalmopathy (GO), a thyroid-related eye disease. The GO-QOL consists of two subscales: one measuring problems with visual functioning and one measuring psychosocial consequences of a changed appearance, both on a scale from 0 to 100, higher

scores indicating better health [75, 76]. The SF-36 was used as a general HRQL questionnaire [77]. Scores of the SF-36 were summarized in a physical component summary score (PCS) and a mental component summary score (MCS), both standardized to have a mean score of 50 with a standard deviation of 10 in the general population [78]. A group of 164 GO patients completed both questionnaires before and 3–6 months after therapy, which consisted of either radiotherapy or surgery. Two global questions were asked about perceived overall changes in visual functioning and in appearance. Table 3 lists the results. It was not possible to calculate all measures because not all the required data were available. There appears to be large variation in results between as well as within the three groups of measures. For example, different conceptual opinions regarding the patient group in which the measure should be calculated or different ways of selecting the judge of what constitutes an important change lead to different conclusions about the responsiveness of the instruments. Some differences can be explained by the fact that different methods have different goals. Statistics such as the t-test statistics and SRMs using the standard deviation of the change in score are meant to show statistically significant changes (direct or indirect), while effect sizes using the standard deviation of the baseline score are meant to quantify the amount of change (small, moderate, large effect). The effect of these differences in approach can be shown by a hypothetical example. If everyone

354 Table 3. Application of different responsiveness measures on a disease-specific and general HRQL questionnaire Measure (Table 2)

1 4 6 8 9

12

13 14 16 17 18 19 20 20 21 21 22

10

11

23

24

25

Group 1: Responsiveness as the ability to detect change in general Paired t-test (p-value) All patients All patients Relative efficiencya Effect size ES/SRM Sample size Group 2: Responsiveness as the ability to detect clinically important change Paired t-test (p-value) Better/no change, worse Better No change Worse Unpaired t-test (p-Value) Improved/no change, worse Improved Relative efficiencya Effect size Normalized ratio Resp. statistic SRM Resp. statistic Resp. indexb Resp. ratio SRM Resp. coefficientd Group 3: Responsiveness as the ability to detect real change in the concept being measured Sensitivity Improved six points/ Specificity not improved Improved six points/ not improved ROC (AUC) Improved/not improved Better/not better Mean (SD) Better No change Worse Correlation coefficient Visual acuity Diplopia Elevation Proptosis Soft tissue involvement Lid aperture Correlation coefficient Overall improvement

GS

GO-QOL visual SF-36 PCS functioning

T T T T

0.05 5.6 0.39 0.43 90

P P

D D P P ? P D T P D P

De Pf De Pf P

D D D D D D D P

GO-QOL appearance

0.71

SF-36 MCS

0.31

0.08 0.08 2796

0.00 4.2 0.45 0.71 34

0.21

0.99

0.00

0.23

0.17 0.11 0.98 0.80

0.61 0.49 0.02 0.39

0.00 0.21

1.2 0.39

0.08

0.38 0.44 0.45 0.59 0.67 0.66 0.41

0.07 0.16 0.12

0.67 0.36 0.64 0.33 0.47

0.22 0.82 0.27 0.83 0.41

0.52 9.6 (21.6) 8.8 (14.5) 0.3 (22.7) 0.12 0.16 0.16

0.02 0.44

0.46 0.25

0.65 1.9 (12.0) 1.9 (7.8) 7.2 (2.0) 0.11 0.03 0.50

0.16 0.41



0.13 0.17 588

0.20 – –

0.92

0.16

5.7 0.45

0.13

1.2 0.82 0.62 0.81 0.64 2.8 1.4

0.62 0.50 0.66 1.0 0.46 0.90 12.5 (15.2) 9.4 (4.4) 0.0 (–)

0.001 0.19 0.05 0.03 0.30

0.12 0.22 0.10 0.07 –c

0.36 0.50 0.37 0.37 0.27 0.86 2.4 (10.6) 10.2 (–) 5.5 (–)

0.08 0.31 0.07 0.18 0.31

* GS – gold standard< i.e. source that was used to define an important change: P – important change according to the patient; D – important change according to the doctor; PD – important change according to both the patient and the doctor; T – change due to treatment effect.

355 responds almost exactly one baseline standard deviation to treatment then responsiveness calculated as the paired t-test (e.g., measure 1) is infinite because the change is very significant, responsiveness calculated as the SRM, using the standard deviation of the change score (e.g., measure 10) approaches zero since the variance in change scores is almost zero, yet responsiveness calculated as an effect size using the standard deviation of the baseline score (e.g., measure 8) approaches 1. This lack of consensus about the goal of responsiveness tests makes the concept a rather elusive one. Comparisons of the relative responsiveness between instruments were found to be relatively insensitive to the particular approach of calculating responsiveness. Most measures indicated better responsiveness of the disease-specific GO-QOL compared to the general SF-36. In contrast, the absolute value of the responsiveness measure and consequently the interpretation of the magnitude of responsiveness depend on the methodological choices made, such as the goal of the test, the definition of the kind of change the instrument should be able to detect and the chosen judge of an important change.

Methodology of assessing responsiveness In an attempt to understand the goal and methodology of responsiveness assessment, we started from the perspective of developers and potential users of HRQL instruments. The main question we will focus on is: What do we need to know about the instrument and its measurement abilities in order to substantiate the claim that the instrument serves its purpose? After searching review articles and textbooks on guidelines for HRQL instrument development and evaluation [2, 3, 8–11, 13, 14, 21, 44, 79], we summarized these guidelines into four major issues that need to be considered: b a

1. We need to specify our measurement goals. What concepts do we want to measure? 2. We need to make sure that the instrument we developed or selected actually measures these concepts. 3. We need to know how well the instrument is able to measure these concepts. 4. We need to know how to interpret the outcomes of the instrument. In the next sections, we will discuss each of these issues and their relation to responsiveness.

What do we want to measure? The measurement goals of a HRQL instrument should be defined in terms of the specific HRQL concepts that the instrument intends to measure, e.g., physical or mental functioning, the target population for which the instrument is intended, and should also include the purpose of the instrument, i.e., discrimination, evaluation or prediction [10]. In addition, the magnitude of change (or minimal clinically important difference (MCID)) that we consider important to discriminate patient groups, to evaluate change over time or to predict changes should be specified in advance for each specific study to facilitate interpretation of study results and to enable sample size calculations [80– 83]. From these measurement targets it follows naturally that a good evaluative HRQL instrument should be able to measure ‘(minimal) important changes in specific HRQL domains over time’. Looking at the definitions of responsiveness listed in Table 1, we see that the third group of definitions comes closest to the measurement aim of an evaluative instrument as defined above. Stating that a HRQL instrument should be able to measure clinically important changes (group 2) can induce unnecessary confusion, because this requires a definition of ‘clinically importance’ in terms of relevant HRQL changes.

GO-QOL compared to SF-36. Based on SD of stable patients in test–retest study (see reference 15) (no data available for SF-36). c No standard deviation could be calculated (n = 2). d The MCID in score was defined as six points for the GO-QOL subscales based on a related study [96] (no data available for the SF36). e Improved vs. no change or worse. f Better vs. no change or worse. b

356 It should be emphasized that the MCID cannot and should not be considered a fixed property of the instrument. It depends on the setting of the study in which the instrument is going to be used. For example, the MCID in HRQL scores after treatment will depend on a weighing of the side effects and other costs of the treatment as well as on the available treatment alternatives. The MCID will therefore vary from study to study and across populations. Furthermore, defining a MCID for a specific study requires a judgment of what an important change is, which is necessarily subjective, to be made either by the physician, the patient, or society. This is a critical point that is often overlooked in clinical practice. Who the judge should be depends on the purpose of the study. If the purpose is to measure changes in HRQL from the patient’s perspective, then only the patient qualifies as the sole judge for what is important. Following this line of argument, it can be concluded that because the definition of an important change depends on a study-specific value judgment, a judgment about what a good evaluative instrument is, will vary from study to study. How can we be sure that the instrument really measures the changes that we want to measure? Gathering evidence that an instrument really measures what it is supposed to measure has generally been referred to as testing the validity of the instrument [21]. Demonstrating that an evaluative HRQL instrument really measures important changes in specific HRQL domains over time has therefore analogously been considered an aspect of validity [16–19]. In an oft-quoted article Guyatt et al. have emphasized that it is necessary to distinguish responsiveness from validity. Their main argument was that an instrument can be valid yet fail to detect important changes when they occur [5]. Hays et al. have countered this conclusion by arguing that any valid instrument, by definition, should be responsive to change. If not an initially valid instrument would lose its validity over time and thus no longer be able to measure the underlying construct at a later time point [18–19]. The confusion arises from a lack of distinction between the assessment of validity of a single score

in a cross-sectional design (where you correlate related and unrelated scores) and a similar evaluation of validity of the change score in a longitudinal design (where you correlate related and unrelated score changes). For convenience, some authors speak of cross-sectional and longitudinal (or evaluative) validity, respectively [2, 10, 31, 84, 85]. It is well known that because study goals and patient populations differ from study to study, an instrument can be valid in one setting but invalid in another. In analogy, conclusions about cross-sectional or longitudinal validity do not have to be the same. For example, an instrument can validly measure the HRQL of a certain patient group at one point in time but may be unable to measure changes in HRQL in some of those patients, for example if these are near the top or bottom of the scale. Such ceiling or floor effects can mask important changes [8, 9, 12]. Because of the non-linearity of a scale changes in the lower end of the scale may be less or more important than comparable changes in the upper end of the scale [12]. As a consequence, correlations with related measures obtained in a cross-sectional design may differ from those obtained in a longitudinal design. By selecting a homogeneous population at baseline one could also create a situation where there is longitudinal validity but no cross-sectional validity. Many authors fail to distinguish cross-sectional validity from longitudinal validity and consequently confuse longitudinal validity and responsiveness. Although the results from assessing cross-sectional validity and longitudinal validity can differ, the basic methodological principles are the same. One can therefore successfully argue that in this situation responsiveness should not be considered a separate property of the instrument but just an aspect of validity in a longitudinal setting [16–19]. Guyatt et al. [5] also claimed that an instrument can be responsive but not valid. They argued that if you ask subjects whether they felt better as a result of an intervention, such an approach is likely to be responsive since many patients will report improvement. However, the subjects’ responses may not be a valid indicator of the concept that one is intend to measure, in this case changes in HRQL. One may actually be measuring satisfaction with the intervention or willingness to please the doctor.

357 This example shows one of the main sources of confusion about the definition of responsiveness. Responsiveness in this example was defined as the ability to measure the effect of treatment (groups 1 and 2, Table 1), while the ability to measure important changes in HRQL over time (group 3, Table 1) was considered validity. This distinction between responsiveness and longitudinal (construct) validity has been made by other authors as well [2, 10, 85]. Stucki et al. [74] distinguished responsiveness (‘the ability to detect important clinical changes’) from discriminative ability (‘the ability to distinguish between those who improve clinically and those who do not’). Husted et al. [53] have called it internal responsiveness (‘the ability to measure change over time’) and external responsiveness (‘the extend to which changes in a measure over time relate to corresponding changes in a reference measure of health status’). They suggest to distinguish external responsiveness from longitudinal construct validity, because external responsiveness requires an accepted indication of change as the external standard. However, for the assessment of HRQL changes no such gold standard exists. Finally, the distinction between responsiveness and longitudinal construct validity is also closely related to the concepts of distributionbased methods and anchor-based methods, used to assess interpretability [86, 87]. Norman et al. [25] recognized that these two definitions of responsiveness originated long ago from two broad approaches within psychology, described by Cronbach in 1957. Responsiveness calculated as the magnitude of the treatment effect (groups 1 and 2, Table 2), is consistent with the experimental approach, while responsiveness defined as the correlation of the changes in the instrument to changes in some other measure (group 3, Table 2), is related to the correlational approach. These correlations have no direct relationship to the treatment effect [25]. Thus, if responsiveness is defined as a measure of treatment effect, it should not be calculated with correlations. Similarly, the example of Guyatt shows that the treatment effect tells you nothing about the validity of the instrument, i.e. its ability to measure changes in the construct that you intend to measure. Therefore, if responsiveness is defined as longitudinal validity, it should not be calculated with measures of the treatment effect. These two

different definitions of responsiveness clearly show why different currently used measures of responsiveness can lead to different conclusions. In order to be sure that the instrument really measures the changes that we want to capture, we believe that only the correlational approach suites this purpose. Responsiveness measures that measure treatment effects – such as effect sizes – can only be useful to assess longitudinal validity of the instrument when prespecified hypotheses about the magnitude of a treatment effect in terms of expected HRQL changes are being tested (the use of p-values and test statistics should be discouraged at all, because they depend on the sample size). Without predefined hypotheses, an effect size in itself tells us nothing about the instrument’s ability to serve its purpose. How well can the instrument measure these changes? Gathering evidence that the instrument is measuring precisely, with an acceptable amount of error, has generally been referred to as testing its reliability [21]. In analogy with the assessment of validity, one can assess reliability of a single score (assessed with repeated measures close in time) or reliability of a change score (assessed with repeated measures of the baseline score as well as of the follow-up measure [21, 36]. Both types of reliability can be expressed as an intraclass correlation coefficient (ICC). Although the results can differ, the underlying methodological principles are the same. For evaluative instruments, reliability of the change score is of particular importance, sometimes referred to as longitudinal reproducibility [36]. Longitudinal reproducibility depends on the internal consistency of the score at each time point. Longitudinal reproducibility is directly related to responsiveness, defined as the effect of treatment (groups 1 and 2): the better the reproducibility, the higher the SRM [36]. Longitudinal reproducibility is almost completely neglected in the development and evaluation of instruments. How do we interpret changes in score on the instrument? Once we have established that an instrument can measure the changes that we want it to measure without too much measurement error, we can

358 decide whether or not to use the instrument in clinical studies. In practice, no HRQL instrument will ever be used without a reference of what a certain (change in) score means for the patient and the decision-maker. Interpretation is important in cross-sectional as well as in longitudinal studies. For evaluative instruments, one should able to answer questions such as: ‘Does a change in score of 10 points in a certain patient group lead to an important improvement in HRQL for these patients?; ‘Is a change from 10 to 20 points equally meaningful as a change from 80 to 90 points?’; ‘Is a change of 15 points substantially more than a change of 10 points?’. For sample size calculations some idea is required about the MCID in HRQL for a specific patient group in a specific setting (e.g. after a certain treatment). These aspects of interpretation are one of the most important issues of instrument development and evaluation but are often neglected by researchers [11, 80, 86, 88, 89]. Although the interpretation of (changes in) scores depends on the validity and reliability of these (changes in) scores and can be based on the same (longitudinal) data, the presentation of data that facilitate interpretability of scores requires special attention. Many measures of what is currently called responsiveness – especially those who measure treatment effects (groups 1 and 2) – contain information that is of help in the interpretation of change scores, but only if presented in the right way [90]. The meaning of score changes cannot be determined from the magnitude of ICCs or correlations. It has to be determined from their respective components, such as means and variances. Many authors only present ESs without presenting the means and standard deviations separately. Guidelines for ESs have been proposed, defining what constitutes a small, moderate of large effect [91]. A disadvantage of these guidelines is that they have been based on statistical arguments (distribution-based), rather than on patients’ judgments of what constitutes an important change (anchor-based). Recent studies examining the association between distribution and anchor-based methods have suggested that in certain situations they provide equivalent information [92]. For the purpose of interpretability, the standard deviation of the baseline score should be used in the ES, rather than the standard deviation of the change in score, as in the SRM, since

the aim here is to describe the magnitude of the change, not the statistical significance [93, 94]. As Testa [88] pointed out, examination of the variability of scores in a variety of populations and treatments can help in understanding scores and score changes, similar to gaining clinical experience with signs and symptoms after examination of thousands of patients. Testa [88] and Guyatt [95] provide several strategies for relating HRQL data to independent standards that are themselves interpretable, such as global ratings of change by the patient or functional measures familiar to clinicians. Husted et al. [53] recommend the use of regression models for assessing what they called external responsiveness. Norman [25] showed that retrospective classifications such as mean changes per category of a global measure of change may lead to confounded estimates of the overall treatment effect. However, these standards can be used to define differences that are meaningful to patients (MCID) in different situations and to determine proportions of patients that benefit from treatment. All estimates of the MCID require a study-specific value judgment. It will depend on the magnitude of the variability of these judgments across studies whether a single estimate of the MCID will lead to biased inferences [95]. Additional guidelines for instrument evaluation In addition to the four major issues that we discussed above, we feel that a final remark about the design of a validation study is worth mentioning. It is well known that reliability as well as validity depends on the population of patients and the setting in which the instrument is being used [21]. The same applies to longitudinal reproducibility and longitudinal validity. For example, the variability of score change will be larger in a more heterogeneous population than in a homogeneous population. Similarly, the variability of changes will be larger when the treatment under study is very effective in some patients but not at all in others, compared to a situation where a treatment is a little effective in all patients, although the mean score change may be the same in both cases. The differences that we found when applying all existing measures of responsiveness to two HRQL instruments (Table 3) could be partly explained by differences in the patient populations used in the

359 calculation of these measures. In general, the assessment of validity, reliability, and interpretability should be performed in a patient population representative for the population in which the instrument is going to be used in future studies. As such future studies will often not be known in advance, a clear description of the study population and design in which the instrument was developed and evaluated is important for potential users of the instrument to generalize the findings to their patients. Finally, we acknowledge that there is no standard recipe for assessing validity and interpretation of changes just as these properties cannot be expressed in a single gold standard value. The long standing problems related to the lack of gold standard for changes in HRQL, and the difficult interpretation of indices of validity and reliability remain an important challenge for future research on instrument development and evaluation. In the mean time, with time and experience using an instrument in different patient populations evidence of validity will increase and interpretation will become easier.

Concluding remarks What to do with responsiveness? We identified four methodological issues that need to be considered in developing a useful evaluative HRQL instrument. These issues can be summarized as (1) defining the measurement goals of the instrument; (2) testing longitudinal validity; (3) testing longitudinal reproducibility, and (4) assessing interpretability of score changes on the instrument. In theory, we think that everything we need to know in advance before using an evaluative HRQL instrument is included in one these four issues. For this reason we see no need for an additional concept like responsiveness. In practice, we have demonstrated that all measures of what is currently called responsiveness can actually be looked at as measures of either longitudinal validity or as measures of the magnitude of the treatment effect. The latter ones tell us nothing about the quality of the instrument to serve its purpose and are therefore only useful to assess interpretability of score changes. We feel it is safe

to conclude that there is no need for responsiveness as a separate measurement property of an evaluative instrument.

Acknowledgements The authors would like to acknowledge the time and effort that one of the anonymous reviewers of Quality of Life Research spend on reviewing the previous versions of our manuscript. We would like to thank this reviewer for making such substantive comments that were very helpful in clarifying the issues raised in this paper and which we believe improved the paper considerably.

References 1. Deyo RA, Inui TS. Toward clinical applications of health status measures: Sensitivity of scales to clinically important changes. Health Serv Res 1984; 19: 275–289. 2. Kirschner B, Guyatt G. A methodological framework for assessing health indices. J Chron Dis 1985; 38: 27–36. 3. Guyatt GH, Bombardier C, Tugwell PX. Measuring disease-specific quality of life in clinical trials. CAMJ 1986; 134: 889–895. 4. Guyatt G, Walter S, Norman G. Measuring change over time: Assessing the usefulness of evaluative instruments. J Chron Dis 1987; 40: 171–178. 5. Guyatt GH, Deyo RA, Charlson M, Levine MN, Mitchell A. Responsiveness and validity in health status measurement: A clarification. J Clin Epidemiol 1989; 42: 403–408. 6. Fitzpatrick R, Ziebland S, Jenkinson C, Mowat A. Importance of sensitivity to change as a criterion for selecting health status measures. Qual Health Care 1992; 1: 89–93. 7. Fitzpatrick R, Fletcher A, Gore S, Jones D, Spiegelhalter D, Cox D. Quality of life measures in health care. I Applications and issues in assessment. Br Med J 1992; 305: 1074–1077. 8. Guyatt GH, Kirshner B, Jaeschke R. Measuring health status: What are the necessary measurement properties? J Clin Epidemiol 1992; 45: 1341–1345. 9. Fletcher A. Quality-of-life measurements in the evaluation of treatment: Proposed guidelines. Br J Clin Pharmac 1995; 39: 217–222. 10. Juniper EF, Guyatt GH, Jaeschke R. How to develop and validate a new health-related quality of life instrument. In: Spilker B (ed), Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-Raven Publishers, 1996; 49–56. 11. Lohr KN, Aaronson NK, Alonso J, et al. Evaluating quality of life and health status instruments: Development of scientific review criteria. Clinical Therapeutics 1996; 18: 979–992. 12. Testa MA, Simonson DC. Assessment of quality of life outcomes. N Engl J Med 1996; 334: 835–840.

360 13. Guyatt GH, Naylor D, Juniper E, Heyland DK, Jaeschke R, Cook DJ. Users’ guides to the medical literature. XII How to use articles about health-related quality of life? JAMA 1997; 277: 1232–1237. 14. Dijkers M. Measuring quality of life. Am J Phys Med Rehabil 1999; 78: 286–300. 15. Guyatt GH, Jaeschke R, Feeny DH, Patrick DL. Measurement in clinical trials: Choosing the right approach. In: Spilker B (ed), Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd ed. Philadelphia: Lippincott-Raven Publishers, 1996; 41–48. 16. Williams JL, Naylor CD. How should health status measures be assessed? Cautionary notes on procrustean frameworks. J Clin Epidemiol 1992; 45: 1347–1351. 17. Stratford PW, Binkley JM, Riddle DL. Health status measures: Strategies and analytic methods for assessing change scores. Phys Ther 1996; 76: 1109–1123. 18. Hays RD, Hadorn D. Responsiveness to change: An aspect of validity, not a separate dimension. Qual Life Res 1992; 1: 73–75. 19. Hays RD, Anderson RT, Revicki D. Assessing reliability and validity of measurement in clinical trials. In: Staquet MJ, Hays RD, Fayers PM (eds), Quality of Life Assessment in Clinical Trials. Methods and Practice, Oxford: Oxford University Press, 1998; 169–182. 20. Beaton DE, Hogg-Johnson S, Bombardier C. Evaluating changes in health status: Reliability and responsiveness of five generic health status measures in workers with musculoskeletal disorders. J Clin Epidemiol 1997; 50: 79–93. 21. Streiner DL, Norman GR. Health Measurement Scales. A Practical Guide to Their Development and Use. New York: Oxford University Press, 1989. 22. Tuley MR, Mulrow CD, McMahan CA. Estimating and testing an index of responsiveness and the relationship of the index to power. J Clin Epidemiol 1991; 44: 417–421. 23. Murawski MM, Miederhoff PA. On the generalizibility of statistical expressions of health-related quality of life instrument responsiveness: A data synthesis. Qual Life Res 1998; 7: 11–22. 24. McHorney CA. Methodological inquiries in health status assessment. Med Care 1998; 36: 445–448. 25. Norman GR, Stratford P, Regehr G. Methodological problems in the retrospective computation of responsiveness to change: The lesson of Cronbach. J Clin Epidemiol 1997; 50: 869–879. 26. Wright JG, Young NL. A comparison of different indices of responsiveness. J Clin Epidemiol 1997; 50: 239–246. 27. Beaton DE, Bombardier C, Katz JN, Wright JG. A taxonomy for responsiveness. J Clin Epidemiol 2001; 54: 1204– 1217. 28. Stockler MR, Osoba D, Goodwin P, Corey P, Tannock IF. Responsiveness to change in health-related quality of life in a randomized clinical trial: A comparison of the Prostate Cancer Specific Quality of Life Instrument (PROSQOLI) with analogous scales from the EORTC QLQ-C30 and a trial specific module. J Clin Epidemiol 1998; 51: 137– 145. 29. Kirkley A, Griffin S, McLintock H, Ng L. The development and evaluation of a disease-specific quality of life mea-

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

surement tool for shoulder instability. Am J Sports Med 1998; 26: 764–772. Stadnyk K, Calder J, Rockwood K. Testing the measurement properties of the Short Form-36 Health Survey in a frail elderly population. J Clin Epidemiol 1998; 51: 827– 835. O’Keeffe ST, Lye M, Donnellan C, Carmichael DN. Reproducibility and responsiveness of quality of life assessment and six minute walk test in elderly heart failure patients. Heart 1998; 80: 377–382. Garratt AM, Ruta DA, Abdalla MI, Russell IT. Responsiveness of the SF-36 and a condition-specific measure of health for patients with varicose veins. Qual Life Res 1996; 5: 223–234. Mohtadi N. Development and validation of the quality of life outcome measure (Questionnaire) for chronic anterior cruciate ligament deficiency. Am J Sports Med 1998; 26: 350–359. Windt van der DAWM, Heijden GJMGvd, Winter AFd, Koes BW, Deville´ W, Bouter LM. The responsiveness of the Shoulder Disability Questionnaire. Ann Rheum Dis 1998; 57: 82–87. Liang MH, Larson MG, Cullen KE, Schwartz JA. Comparitive measurement efficiency and sensitivity of five health status instruments for arthritis research. Arthritis Rheum 1985; 28: 542–547. Giraudeau B, Ravaud P, Chastang C. Importance of reproducibility in responsiveness issues. Biomet J 1998; 40: 685–701. Heald SL, Riddle DL, Lamb RL. The shoulder pain and disability index: The construct validity and responsiveness of a region-specific disability measure. Phys Ther 1997; 77: 1079–1089. Bessette L, Sangha O, Kuntz KM, et al. Comparitive responsiveness of generic versus disease-specific and weighted versus unweighted health status measures in carpal tunnel syndrome. Med Care 1998; 36: 491–502. Reilly MC, Zbrozek AS. Assessing the responsiveness of a quality of life instrument and the measurement of symptom severity in essential hypertension. Pharmacoeconomics 1992; 2: 54–66. Partick DL, Deyo RA. Generic and disease-specific measures in assessing health status and quality of life. Med Care 1989; 27: S217–S232. Patrick DL, Martin ML, Bushnell DM, Yalcin I, Wagner TH, Buesching DP. Quality of life of women with urinary incontinence: Further development of the Incontinence Quality of life Instrument (I-QOL). Urology 1999; 53: 71–76. Deyo RA. Measuring functional outcomes in therapeutic trials for chronic disease. Control Clin Trials 1984; 5: 223– 240. Feinstein AR, Josephy BR, Wells CK. Scientific and clinical problems in indexes of functional disability. Ann Intern Med 1986; 105: 413–420. Deyo RA, Diehr P, Patrick DL. Reproducibility and responsiveness of health status measures. Control Clin Trials 1991; 12: 142S–158S. Guyatt GH, Feeny DH, Patrick DL. Measuring health related quality of life. Ann Intern Med 1993; 118: 622–629.

361 46. Flemons WW, Reimer MA. Measurement properties of the Cagalry Sleep Apnea Quality of Life Index. Am J Respir Crit Care Med 2002; 165: 159–164. 47. Anderson R, Rajagopalan R. Responsiveness of the dermatology-specific quality of life (DSQL) instrument to treatment for acne vulgaris in a placebo-controlled clinical trial. Qual Life Res 1998; 7: 723–734. 48. de Bruin AF, Diederiks JPM, de Witte LP, Stevens FCJ, Philipsen H. Assessing responsiveness of a functional status measure: The Sickness Impact Profile versus the SIP68. J Clin Epidemiol 1997; 50: 529–540. 49. Badia X, Podzamczer D, Casado A, Lopez-Lavid CC, Garcia M. Evaluating changes in health status in HIV-infected patients: Medical Outcomes Study-HIV and Multidimensional Quality of Life-HIV quality of life questionnaires. AIDS 2000; 14: 1439–1447. 50. Parkerson GR, Willke RJ, Hays RD. An international comparison of the reliability and responsiveness of the Duke Health Profile for measuring health-related quality of life of patients treated with alprostadil for erectile dysfunction. Med Care 1999; 37: 56–67. 51. Conner-Spady B, Cumming C, Nabholtz JM, Jacobs P, Stewart D. Responsiveness of the EuroQol in breast cancer patients undergoing high dose chemotherapy. Qual Life Res 2001; 10: 479–486. 52. Deyo RA, Centor RM. Assessing the responsiveness of functional scales to clinical change: An analogy to diagnostic test performance. J Chron Dis 1986; 39: 897–906. 53. Husted JA, Cook RJ, Farewell VT, Gladman DD. Methods for assessing responsiveness: A critical review and recommendations. J Clin Epidemiol 2000; 53: 459–468. 54. Liang M. Longitudinal construct validity. Establishment of clinical meaning in patient evaluative instruments. Med Care 2000; 38: II-84–II-90. 55. Fletcher AE, Ellwein LB, Selvaraj S, Vijaykumar V, Rahmathullah R, Thulasiraj RD. Measurement of visual function and quality of life in patients with cataracts in southern India. Arch Ophthalmol 1997; 115: 767–774. 56. Moayyedi P, Duffett S, Braunholtz D, et al. The Leeds Dyspepsia Questionnaire: A valid tool for measuring the presence and severity of dyspepsia. Aliment Pharmacol Ther 1998; 12: 1257–1262. 57. Ware JE, Kemp JP, Buchner DA, Singer AE, Nolop KB, Goss TF. The responsiveness of disease-specific and generic health measures to changes in the severity of asthma among adults. Qual Life Res 1998; 7: 235–244. 58. Guyatt GH, King DR, Feeny DH, Stubbing D, Godlstein RS. Generic and specific measurement of health-related quality of life in a clinical trial of respiratory rehabilitation. J Clin Epidemiol 1999; 52: 187–192. 59. Anderson JJ, Firschein HE, Meenam RF. Sensitivity of a health status measure to short-term clinical changes in arthritis. Arthritis Rheum 1989; 32: 844–850. 60. Mangione CM, Goldman L, Orav EJ, et al. Health-related quality of life after elective surgery. Measurement of longitudinal changes. J Gen Intern Med 1997; 12: 686– 697. 61. Stofmeel MA, Post MW, Kelder JC, Grobbee DE, van Hemel NM. Changes in quality-of-life after pacemaker

62.

63.

64.

65.

66.

67.

68.

69.

70.

71. 72.

73.

74.

75.

76.

implantation: Responsiveness of the Aquarel questionnaire. Pacing Clin Electrophysiol 2001; 24: 288–295. Lipsey MW. A scheme for assessing measurement sensitivity in program evaluation and other applied research. Psychol Bull 1983; 94: 152–165. Hemingway H, Stafford M, Stansfeld S, Shipley M, Marmot M. Is the SF-36 a valid measure of change in population health? Results from the Whitehall II study. Br Med J 1997; 315: 1273–1279. Clark JA, Rieker P, Propert KJ, Talcott JA. Changes in quality of life following treatment for early prostate cancer. Urology 1999; 53: 161–168. Wiebe S, Rose K, Derry P, McLachlan R. Outcome assessment in epilepsy: Comparative responsiveness of quality of life and psychosocial instruments. Epilepsia 1997; 38: 430–438. Liang MH, Fossel AH, Larson MG. Comparisons of five health status instruments for orthopedic evaluation. Med Care 1990; 28: 632–642. Shields RK, Ruhland JL, Ross MA, Saehler MM, Smith KB, Heffner ML. Analysis of health-related quality of life and muscle impairment in individuals with amyotrophic lateral sclerosis using the Medical Outcome Survey and the Tufts Quantitative Neuromuscular Exam. Arch Phys Med Rehabil 1998; 79: 855–862. Chren MM, Lasek RJ, Quinn LM, Mostow EN, Zyzanski SJ. Skindex, a quality of life measure for patients with skin disease: Reliability, validity, and responsiveness. J Invest Dermatol 1996; 107: 707–713. Perpina M, Belloch A, Marks GB, Martinez-Morag¢n E, Pascual LM, Compte L. Assessment of the reliability, validity, and responsiveness of a Spanish asthma quality of life questionnaire. J Asthma 1998; 35: 513–521. Hjortswang H, Strom M, Almeida RT, Almer S. Evaluation of the RFIPC, a disease-specific health-related quality of life questionnaire, in Swedish patients with ulcerative colitis. Scand J Gastroenterol 1997; 32: 1235–1240. Norman GR. Issues in the use of change scores in randomized trials. J Clin Epidemiol 1989; 42: 1097–1105. Shaw JW, Coons SJ, Foster SA, Leischow SJ, Hays RD. Responsiveness of the Smoking Cessation Quality of Life (SCQoL) questionnaire. Clin Ther 2001; 23: 957–969. Meenan RF, Anderson JJ, Kazis LE, et al. Outcome assessment in clinical trials: Evidence for the sensitivity of a health status measure. Arthritis Rheum 1984; 27: 1344– 1352. Stucki G, Liang MH, Fossel AH, Katz JN. Relative responsiveness of condition-specific and generic health status measures in degenerative lumbar spinal stenosis. J Clin Epidemiol 1995; 48: 1369–1378. Terwee CB, Gerding MN, Dekker FW, Prummel MF, Wiersinga WM. Development of a disease-specific quality of life questionnaire for patients with Graves’ ophthalmopathy: The GO-QOL. Br J Ophthalmol 1998; 82: 773–779. Terwee CB, Gerding MN, Dekker FW, Prummel MF, van der Pol JP, Wiersinga WM. Test–retest reliability of the GO-QOL: A disease-specific quality of life questionnaire for patients with Graves’ ophthalmopathy. J Clin Epidemiol 1999; 52: 875–884.

362 77. Ware JE, Sherbourne CD. The MOS 36-item Short-Form Health Survey (SF-36). I Conceptual framework and item selection. Med Care 1992; 30: 473–483. 78. Ware JE, Koskinski M, Keller SD. SF-36 Physical and Mental Health Summary Scales: A User’s Manual, 2nd ed. Boston, MA: The Health Institute, 1994. 79. Staquet MJ, Hays RD, Fayers PM. Quality of Life Assessment in Clinical Trials. Methods and Practice. New York: Oxford University Press, 1998. 80. Jaeschke R, Singer J, Guyatt GH. Measurement of health status. Ascertaining the minimal clinically important difference. Control Clin Trials 1989; 10: 407–415. 81. Juniper EF, Guyatt GH, Willan A, Griffith LE. Determining a minimal important change in a disease-specific quality of life questionnaire. J Clin Epidemiol 1994; 47: 81– 87. 82. Ravaud P, Giraudeau B, Auleley GR, Edouard-Noe¨l R, Dougados M, Chastang Cl. Assessing smallest detectable change over time in continuous structural outcome measures: Application to radiological change in knee osteoarthritis. J Clin Epidemiol 1999; 52: 1225–1230. 83. Jacobson NS, Truax P. Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. J Consult Clin Psych 1991; 59: 12–19. 84. Erickson P. Assessment of the evaluative properties of health status instruments. Med Care 2000; 38: II-95–II99. 85. Moran LA, Guyatt GH, Norman GR. Establishing the minimal number of items for a responsive, valid, healthrelated quality of life instrument. J Clin Epidemiol 2001; 54: 571–579. 86. Lydick E, Epstein RS. Interpretation of quality of life changes. Qual Life Res 1993; 2: 221–226. 87. Lydick EG, Epstein RS. Clinical significance of quality of life data. In: Spilker B (ed), Quality of Life and Pharmacoeconomics in Clinical Trials. 2nd ed. Philadelphia: Lippincott-Raven Publishers, 1996: 461–465.

88. Testa MA. Interpretation of quality of life outcomes. Issues that affect magnitude and meaning. Med Care 2000; 38: II166–II-174. 89. Lohr KN. Health outcomes methodology symposium. Summary and recommendations. Med Care 2000; 38: II194–II-208. 90. Beaton DE. Understanding the relevance of measures change through studies of responsiveness. Spine 2000; 25: 3192–3199. 91. Cohen J. Statistical Power Analysis for the Behavioral Sciences. New York: Academic Press, 1977. 92. Norman GR, Gwadry Sridhar F, Guyatt GH, Walter SD. Relation of distribution- and anchor-based approaches in interpretation of changes in health-related quality of life. Med Care 2001; 39: 1039–1047. 93. Katz JN, Larson MG, Phillps CB, Fossel AH, Liang MH. Comparative measurement senistivity of short and longer health status instruments. Med Care 1992; 30: 917–925. 94. Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care 1989; 27: S178–S189. 95. Guyatt GH. Making sense of quality of life data. Med Care 2000; 38: II-175–II-179. 96. Terwee CB, Dekker FW, Mourits MPh, Gerding MN, Baldeschi L, Kalmann R, Prummel MF, Wiersinga WM. Interpretation and validity of changes in scores on the Graves’ opthalmopathy quality of life questionnaire (GOQOL) after different treatments. Clin Endocrinol 2001; 54: 391–398.

Address for correspondence: Ms C.B. Terwee, Institute for Research in Extramural Medicine (EMGO Institute), Vrije Universiteit van Amsterdam, Academic Medical Center, Van der Boechorststraat 7, 1081 BT Amsterdam, The Netherlands Phone: +31-20-444-8187; Fax: +31-20-444-8181 E-mail: [email protected]