Development and Validation of a New Clinically-Meaningful Rating ...

8 downloads 0 Views 966KB Size Report
Michael A. Kane, MD; Andrew Blitzer, MD, DDS; Fredric S. Brandt, MD;. Richard G. Glogau, MD; Gary D. Monheit ... by guest on March 7, 2012 aes.sagepub.com.


Facial Surgery

Development and Validation of a New Clinically-Meaningful Rating Scale for Measuring Lateral Canthal Line Severity

Aesthetic Surgery Journal 32(3) 275­–285 © 2012 The American Society for Aesthetic Plastic Surgery, Inc. Reprints and permission: http://www.​ sagepub.com/ journalsPermissions.nav DOI: 10.1177/1090820X12437784 www.aestheticsurgeryjournal.com

Michael A. Kane, MD; Andrew Blitzer, MD, DDS; Fredric S. Brandt, MD; Richard G. Glogau, MD; Gary D. Monheit, MD; Rhoda S. Narins, MD; Jean A. Paty, PhD; and Jacob M. Waugh, MD

Abstract Background: Several scales have been employed for evaluating the effects of cosmetic treatments in the periorbital area. The Food and Drug Administration (FDA) has recently issued new recommendations specifying a rigorous process to validate new aesthetic scales. Objectives: The authors describe and validate a new clinical rating scale: the Investigator’s Global Assessment of Lateral Canthal Line (IGA-LCL) severity scale. Methods: The new FDA recommendations were utilized to validate the new scale. The first step was concept elicitation (based on direct input from clinicians, patients, and literature) and evaluation of content validity (appropriateness of concepts). The resulting five-point scale provided detailed descriptions of the lateral canthal lines (LCL), including quantitative assessment of LCL length and depth. Performance parameters, including intra- and interrater reproducibility and construct validity, were then evaluated in clinical studies. Finally, the scale’s threshold for clinically-meaningful benefit and the ability of the scale to detect change were confirmed in two Phase 2b clinical studies involving a total of 270 subjects. Results: Content validity was established and the IGA-LCL scale showed excellent interrater reliability (weighted Kappa = 0.89) and interrater reliability (weighted Kappa = 0.77; Kendall’s coefficient of concordance = 0.89). In clinical trials, the scale was sensitive enough to detect clinically-meaningful oneand two-point changes in LCL severity following treatment with topical botulinum toxin type A (BoNT-A). The authors observed statistically-significant correlations between the physician-rated IGA-LCL results and patient-reported outcomes. Conclusions: The IGA-LCL scale was shown to be reliable, appropriate, and clinically meaningful for measuring LCL severity. Keywords botulinum toxin, topical gel, lateral canthal lines, scale validation, facial surgery

Botulinum toxin type A (BoNT-A) blocks cholinergic neurotransmission by preventing acetylcholine release at peripheral neuromuscular junctions.1 Local injection of BoNT-A is effective for the temporary improvement of facial lines, including glabellar lines and lateral canthal lines (LCL), and results in minimal if any systemic exposure.2-6 Treatments for glabellar lines target a single muscle group (procerus and corrugator muscles) that produces lines at both maximal and neutral expression. However, unlike the glabellar wrinkle pattern, different muscle groups impact LCL at neutral facial position (at rest) versus those at maximum expression (smile). Specifically, resting tone or spasm of the lateral orbicularis oculi muscles—the target of BoNT-A treatment—contribute to the lines at rest, while a combination of the zygomaticus, elevator, and orbicularis oculi groups contributes to the wrinkle involved with smiling.7,8

Dr. Kane is an attending physician at Manhattan Eye, Ear & Throat Institute, New York, New York. Dr. Blitzer is Professor of Clinical Otolaryngology, Columbia University College of Physicians and Surgeons, New York, New York. Dr. Brandt is Director, Dermatology Research Institute, Coral Gables, Florida. Dr. Glogau is Clinical Professor of Dermatology, University of California, San Francisco, California. Dr. Monheit is Clinical Associate Professor, Departments of Dermatology and Ophthalmology, University of Alabama at Birmingham, Birmingham, Alabama. Dr. Narins is Clinical Associate Professor, Department of Dermatology, New York University School of Medicine, New York, New York. Dr. Paty is Senior Vice President, Scientific, Quality & Regulatory Affairs, invivodata, Inc., Pittsburgh, Pennsylvania. Dr. Waugh is Chief Scientific Officer and Medical Director, Revance Therapeutics, Inc., Newark, California. Corresponding Author: Dr. Michael A. Kane, 115 East 67th Street, New York, NY 10065 USA. E-mail: [email protected]

Downloaded from aes.sagepub.com by guest on March 7, 2012

276

Aesthetic Surgery Journal 32(3)

Table 1.  Parameters of Lateral Canthal Line (LCL) Indications Recognized Clinical Condition LCL at Smile

LCL at Rest

Muscle involved–primary

Contraction of zygomaticus major

Spasm or tone of orbicularis oculi

Other muscles

Orbicularis oculi, levator anguli oris major, levator anguli oris minor

None

Typical age

Teen to thirties

Thirties and older

Stage of wrinkle progression

Glogau 2

Glogau 3

Contraction

Voluntary

Involuntary resting spasm

Patients seek treatment

Rarely

Majority seeking LCL treatment

Target clinical benefit according to literature

Glogau 1

Glogau 2 (early Glogau 3)

Perception of LCL in psychological literature

Genuine, happy emotion

Increasing age

Assessment

LCA at maximum expression

LCA at neutral facial position

Clinical literature has demonstrated that LCL at a neutral facial position and LCL at maximum expression are recognized conditions that can be distinguished from each other by target muscle, age and progression of condition, and nature of contraction (Table 1).7-13 The literature documents that improvement at rest is overwhelmingly sought by subjects and physicians. As detailed later, the literature also identifies the reasons which underlie the clear preference and the resulting need for a quantitative LCL grading scale at rest. Age-related progression of wrinkles has been described and categorized as a sequential progression of stages by Glogau.10 LCL present at smile but absent at rest (Glogau Stage 2) occur naturally beginning in the mid-teenage years and reflect a positive emotional connection that is not directly tied to a perception of age.10 Patients almost universally seek treatment only after they see lines while facial expression is at rest (Glogau Stage 3).9 Wrinkles at rest are tied to the perception of increasing age.7,9,10,14,15 As such, LCL at rest have been shown to be a major factor in the perception of facial age, and a 20% improvement in these LCL has been shown to be both readily detected and clinically significant.15 Clinical recommendations have evolved through years of off-label injection of BoNT-A in the LCL area and there has been a progressive decline in dose in order to achieve the current stated aesthetic goal of improving most (but not necessarily all) wrinkles apparent at rest while preserving nearly-normal LCL at smile.7,9,12,14,16-18 Consistent with these refinements, intradermal injections have evolved as a means of achieving improvement in wrinkles apparent at rest while not markedly impairing wrinkles at maximum voluntary contraction.13,19-22 Thus, it has become necessary to develop a grading scale specifically designed to quantify these changes at neutral facial expression. It has been recently confirmed that marked reduction in LCL at smile leads to negative self-perception as well as

negative perception by others.14,17,18,22-24 Lack of LCL at smile is perceived as a “photo-smile,” or a posed or fake smile, which is cortical in origin, voluntary, and does not reflect genuine emotion.14,17,18,23,24 The majority of published scales have been validated for facial areas other than LCL, and those designed for LCL focus on smile or nonquantitative assessments.25-27 Given the importance of LCL at rest, we sought to develop a quantitative, appropriate, fit-for-purpose instrument for their assessment. RT001 Botulinum Toxin Type A Topical Gel (RT001) is in development for the temporary reduction of LCL in the neutral, resting facial position. To evaluate RT001 for the treatment of LCL, we attempted to develop a new physician-rated scale for the assessment of LCL severity. We followed updated guidelines from the Food and Drug Administration (FDA) that define and standardize qualification and validation of clinical rating scales. Specifically, in these guidelines, relevant measurement principles for clinical outcome assessments (COA) for clinical trials of new drugs have been defined. These COA include clinician-reported outcome (ClinRO) measures such as the Investigator’s Global Assessment of Lateral Canthal Line (IGA-LCL), patient-reported outcome (PRO) measures, and observer-reported outcome (ObsRO) measures. All COA are intended to follow similar development and validation process as first outlined in 2009 in the “FDA Guidance for Industry: Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims” (FDA PRO Guidance, 2009). To be utilized as a clinical outcome assessment per FDA recommendations, a scale should be shown in a sequential process to be (1) valid in terms of content; (2) reliable; (3) valid in terms of construct; (4) able to detect clinical change; and (5) able to establish a threshold for treatment benefit. Content validity is established if the scale is “appropriate and comprehensive relative to the intended measurement concept, population, and use” (FDA PRO

Downloaded from aes.sagepub.com by guest on March 7, 2012

Kane et al

277

Guidance, 2009). Two components establish content validity: (1) identification of the relevant key concepts to measure based on the literature, clinician input, and direct patient input (through interviews termed “concept elicitation”) and (2) demonstration that intended users understand the instrument and what it is attempting to measure, which typically occurs through formal Clinical Advisory Board (CAB) evaluation or structured interviews, termed “cognitive debriefing” for subjects. Once content validity has been established, several measurement properties of the scale must be evaluated through clinical studies. Reliability refers to the reproducibility of a measure and is typically established through high intra- and interobserver correlation values. Kappa values are used to assess concordance and range between 0 (no agreement) and 1 (absolute agreement). It is generally accepted that < 0.20 designates poor agreement, 0.210.40 shows fair agreement, 0.41-0.60 shows moderate agreement, 0.61-0.80 shows good agreement, and 0.811.00 shows very good (near perfect) agreement.28 Construct validity is a critical component of validation; it is represented by confirmation that a scale measures what it is designed to measure. Construct validity is typically demonstrated through high correlations to other scales measuring similar concepts. The ability to detect change can be evaluated by looking at posttreatment changes and is established by an ability to detect treatment differences significantly and consistently. Finally, threshold for treatment benefit was assessed through performance versus clinical benchmark scales. This full process has been followed in development and validation of the IGA-LCL, with details of each step provided below.

Methods Ethics and Clinical Study Design Clinical studies adhered to the ethical principles of the Declaration of Helsinki, 21 CFR Part 50, and the International Conference on Harmonisation of Good Clinical Practice guidelines. Clinical protocols were approved by the Institutional Review Board/Independent Ethics Committee of each participating study center. Written informed consent was obtained from all subjects prior to performance of any study-related procedure. Healthy male and female subjects between 18 and 65 years of age were enrolled in all clinical studies. Subjects were required to have bilateral (both eyes) LCL graded as either moderate (3) or severe (4) at rest based on the IGALCL Severity Scale (ratings of 0-4 as detailed in Table 2). Patients received up to 0.5 mL of RT001 or control applied to each lateral canthal area (LCA) for 30 minutes. A nonadhesive occlusive dressing was utilized to ensure that patients did not inadvertently transfer the drug during the dwell time. A proprietary cleansing step was completed after the dwell time to remove and inactivate residual RT001. IGALCL assessments were performed by investigators who had received training in application of the scale, including an evaluator tool to measure LCL length (Figure 1).

Table 2.  Investigator’s Global Assessment of Lateral Canthal Line Severity Rating Score

Wrinkle Severity at Rest

Description

0

Absent

No visible wrinkles

1

Minimal

Minimal wrinkles, within 1.5-cm radius of the lateral canthus and may be minimally etched

2

Mild

Shallow wrinkles, extending between 1.5- to 2.5-cm radius of the lateral canthus and minimally etched

3

Moderate

Moderately deep wrinkles, extending between 1.5- to 2.5-cm radius of the lateral canthus and moderately etched

4

Severe

Very long wrinkles, extending 2.5-cm radius of the lateral canthus and may be deeply etched

Figure 1.  Caliper for assessment of mean lateral canthal line length.

Concept Elicitation and Initial Development of a New Clinical Scale for Evaluating the Severity of LCL The new clinical rating scale for the assessment of LCL was initially developed through in-depth interviews with an advisory board of physicians who were experienced in the evaluation and treatment of LCL as well as scale design. The goal of these interviews was to obtain expert opinion to assist in defining outcome measures for the evaluation of LCL treatments—particularly, to define primary and secondary attributes of LCL. With the guidance of CAB meetings in 2008, 2009, and 2010, which reviewed and refined the concepts being measured and the specific wording of the IGA-LCL, a new clinician rating instrument was developed to evaluate the severity of LCL when patients are in a neutral facial position (at rest). Patient perspectives on outcome measures for the evaluation of LCL treatments were elicited through interviews with

Downloaded from aes.sagepub.com by guest on March 7, 2012

278

Aesthetic Surgery Journal 32(3)

31 BoNT-A-naive patients, representing a diverse range of age, gender, ethnicity, and LCL severity. The concepts elicited in these studies corresponded to those that formed the basis of an initial IGA scale. This scale underwent refinement in exploratory studies with content validation in patient and subject interviews.

Intrarater and Interrater Reliability of the IGA-LCL Scale Intrarater reliability was assessed through five Phase Two clinical studies which prospectively assessed intraobserver correlations across a total of 17 investigators and 451 subjects (RT001-CL010LCL, RT001-CL012LCL, RT001CL015LCL, RT001-CL017LCL, and RT001-CL024LCL). Evaluation of intrarater reliability was based on the comparison of pretreatment IGA-LCL scores recorded by trained investigators at two separate study visits two weeks apart on live subjects. Photographs were not used as a basis for these assessments. Interrater reliability was evaluated through two interobserver correlation studies (RT001-CL022LCL and RT001-MK001). In the first study, ratings from two pairs of investigators and subinvestigators were compared across 31 subjects. Then, a second study was undertaken to evaluate interrater correlations across a broader range of investigators. This study (RT001-MK001) compared results from eight trained investigators, each of whom evaluated LCL on both sides of the face in ten in-person models spanning all five grades. All assessments were made on in-person models. Again, photographs were not used as a basis for these assessments. Each investigator rated each model once and all investigators rated all models.

Clinical Relevance and Construct Validity Given the importance of patient-based outcome measures in any aesthetic indication, correlations between the investigator-rated IGA-LCL and the responses on a PRO scale (the Patient Global Impression of Change [PGIC]) were examined to evaluate construct validity in two clinical trials (RT001-CL017LCL and RT001-CL024LCL). Correlations in severity scores between the IGA-LCL and a patient selfrated static score of severity (the Patient Severity Assessment [PSA]) were also examined in study RT001-CL024LCL. PRO from the PGIC and PSA tests were both developed and tested through the in-depth interviews with 31 BoNTA-naïve patients. These instruments assessed patients’ perceived improvement (PGIC) and severity (PSA) in their LCL. PGIC was based on a seven-point scale (much improved, improved, a little improved, no change, a little worse, worse, much worse), while PSA mirrored the IGALCL as a five-point scale (absent, minimal, mild, moderate, and severe). Both scales encompassed similar concepts compared to the IGA-LCL and thus represented appropriate benchmarks for clinical relevance and construct validity.

Ability to Detect Change The suitability of the refined IGA-LCL scale for measuring treatment response was tested in two clinical studies. In study RT001-CL017LCL, 180 subjects were randomized in a one-toone ratio to RT001 or control. In study RT001-CL024LCL, 90 subjects were randomized one-to-one to RT001 or placebo. Treatment effect was evaluated in both studies at week four after a single bilateral application of RT001.

Threshold for Clinically-Meaningful Response During the interviews conducted for the development of the IGA-LCL, physicians and patients were asked to describe the criteria for successful treatment. Clinicallymeaningful responses to treatment were described in terms of outcome on the respective clinical scales. Determination of clinical relevance and definition of meaningful response thresholds were then confirmed by anchor-weighted measures to patient self-perception of change.

Statistical Methods Intra- and interrater reliability were evaluated through Kappa statistics of agreement. Rows and columns for the interrater Kappa analysis included half-integer scores. Kendall’s coefficient of concordance for interrater reliability analysis with multiple raters was also performed. Intrarater reliability was also evaluated through Kappa statistics for each rater’s two scores for the same subject. For construct validity and for establishment of threshold for clinical benefit, correlations between the IGA-LCL scale results and subject (PGIC) results were calculated with Pearson and Spearman correlations. For determination of the ability to detect change, each LCA was evaluated separately. Responders were defined as those who improved at or beyond the designated threshold (a score of 1 or 2) in both eyes versus baseline. Responder rates between each RT001 group and the control group were compared with the CMH chi-square test or, if 25% or more of the expected cell frequencies were less than five, with Fisher’s exact test.

Results Concept Identification and Content Validity The identification of important and relevant concepts for evaluating LCL was based upon three activities: a literature review, clinician input, and patient input. For clinician input, independent interviews were initially conducted with physicians experienced with LCL evaluation and treatment. Collectively, the physicians confirmed that both depth and length are the primary aspects of LCL, with depth being the most important. While two of the physicians considered the number of LCL as an

Downloaded from aes.sagepub.com by guest on March 7, 2012

Kane et al

279

additional indirect attribute, all agreed that this attribute should be strongly deemphasized as a relevant treatment factor to consider when evaluating the effectiveness of LCL treatments. This conclusion relied upon the fact that most available treatments target direct characteristics of the wrinkles themselves (ie, depth and length) rather than the number. For patient input, concept elicitation interviews were conducted with 31 BoNT-A-naive patients with varying LCL severity ratings, including patients ranging from no LCL to severe LCL. Results from the concept elicitation interviews indicated that patients with varying LCL severity considered depth and length the most important and relevant concepts related to treatment of their LCL. Results also suggested that patients thought about changes in each of these attributes individually and in combination when considering posttreatment LCL improvement. The majority of the 31 subjects mentioned different features of their LCL and all subjects referred to more than one of these features. In interviews, they spontaneously identified depth (47%), length (33%), and quantity (23%) as important attributes. When prompted by the interviewer, 81%, 68%, and 58% of all patients reported depth, length, and quantity as relevant concepts for their LCL, respectively. In order to focus the development process of the instrument on the population enrolled in the Revance (Revance Therapeutics, Inc.; Newark, CA) Phase 3 clinical trials, a specific analysis was conducted on a subset of 14 subjects with moderate to severe LCL, as scored by live investigator assessments. Similar to the full sample, these 14 subjects focused on the depth and length of their LCL and, to a lesser extent, a number of qualitative descriptors related broadly to skin quality. Again, the most important concepts that emerged with regard to LCL for the majority of subjects were depth and length. Of note, 100% of subjects confirmed that they would have answered the same if “severity” had been replaced with the phrase “depth and length” as a term in the questionnaire. CAB meetings were convened in 2008, 2009, and 2010 to review and (if needed) refine the concepts being measured and the concomitant wording of the IGA-LCL. All CAB were comprised of clinicians who were experienced in treating LCL with BoNT-A. Each CAB confirmed that depth and length of LCL were central to clinical assessment of their severity. In 2012, the CAB members reviewed the final IGA-LCL instrument for content validity and unanimously agreed that the scale was appropriate, with no objections or additional recommendations voiced.

Reliability Evaluation of intrarater reliability (scores from the same rater on two different occasions) was based on the comparison of severity assessments two weeks apart and is summarized in Table 3. The overall Kappa estimates were 0.89. Interrater reliability was evaluated in two studies.

Table 3.  Intrarater Reliability of Investigator’s Global Assessment of Lateral Canthal Line Scale

Site

Lateral Canthal Areas Assessed, No.

007

 20

20 (100)

0 (0)

0 (0)

1.0000

008

 78

76 (97.4)

2 (2.6)

0 (0)

0.9484

009

 36

36 (100)

0 (0)

0 (0)

1.000

010

 24

24 (100)

0 (0)

0 (0)

1.000

013

 36

22 (61.1)

14 (38.9)

0 (0)

0.2500

014

108

105 (97.2)

3 (2.8)

0 (0)

0.9423

016

 60

60 (100)

0 (0)

0 (0)

1.000

Overall

362

343 (94.8)

19 (5.2)

0 (0)

0.8945

Scores Scores Exact Differing by 1, Differing Matches, No. No. (%) by 2, No. (%)

Kappaa

a

Weighted and unweighted Kappa were identical.

Detailed results for the study using eight investigators are presented in Table 4; this study evaluated live models selected to represent all five grades. The weighted Kappa estimates were 0.77 for this study. These results are parallel to the initial interrater validation study, which only employed pairs of investigators and arrived at a weighted Kappa of 0.81 for interobserver agreement.

Clinical Relevance and Construct Validity When IGA-LCL results were correlated with patient-reported assessments of PGIC improvement in LCL severity, both Spearman and Pearson correlation coefficients showed a statistically-significant agreement between the IGA-LCL and the PGIC scales (r = 0.3317-0.3972, P = 0.048-0.0006 for Pearson correlation; r = 0.3697-0.4673, P = 0.027 to P < .0001 for Spearman correlation). In general, the results indicated that when clinicians evaluated a positive change in LCL, patients also perceived improvement in their LCL. Likewise, when physicians reported no change, lower levels of patient-reported improvements were also observed. The clinical relevance of improvement on IGA-LCL was confirmed by a traditional anchor-based approach, which correlated the IGA-LCL to the anchor (standard) of PGIC as a standard for aesthetic outcome.29,30 Correlation between IGA-LCL and PGIC change was extremely high (Table 5), with Spearman correlations of r = 0.70 for right-eye IGA change to PGIC (P < .0001) and r = 0.73 for left-eye IGA change to PGIC (P < .0001). Responders at the levels of “Improved, or Much Improved” on PGIC had a two-point or greater IGA improvement at the proposed Phase 3 dose (RT001-CL017LCL and RT001-CL024LCL) in 88% of LCA (49 of 56; Table 6). Responders at the levels of “A Little Improved, Improved, or Much Improved” on PGIC had onepoint or greater IGA improvement in 88% of LCA (77 of 88; Table 6).

Downloaded from aes.sagepub.com by guest on March 7, 2012

280

Aesthetic Surgery Journal 32(3)

Table 4.  Interrater Reliability of Investigator’s Global Assessment of Lateral Canthal Line Scale (RT001-MK001) Lateral Canthal Areas Assessed, No.

Exact Matches, No.

Scores Differing by 1, No. (%)

Scores Differing by 2, No. (%)

Kappa

Weighted Kappa

007/009

16

14 (87.5)

2 (12.5)

0 (0)

0.8261

0.9126

013

16

6 (37.5)

10 (62.5)

0 (0)

0.2271

0.5855

014

16

11 (68.8)

4 (25.0)

1 (6.3)

0.6117

0.7405

009

16

13 (81.3)

3 (18.8)

0 (0)

0.7333

0.8696

015

16

7 (43.8)

9 (56.3)

0 (0)

0.2727

0.5955

016

16

15 (93.8)

1 (6.3)

0 (0)

0.9179

0.9592

010

16

14 (87.5)

2 (12.5)

0 (0)

0.8333

0.9116

017

16

7 (43.8)

9 (56.3)

0 (0)

0.2727

0.5814

Overall

128

87 (68.0)

40 (31.3)

1 (0.8)

0.5795

0.7717

Site

Table 5.  PGIC Score Relationship at Week 4 With IGA-LCL Change Scores Between Baseline and Week 4, No. (%) IGA-LCL Change Scores (Left) PGIC

IGA-LCL Change Scores (Right)

0

1

2

3

0

1

2

3

Much worse

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

Worse

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

A little worse

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

0 (0.0)

30 (34.1)

10 (11.4)

4 (4.5)

0 (0.0)

25 (28.4)

14 (15.9)

5 (5.7)

0 (0.0)

A little improved

6 (6.8)

4 (4.5)

6 (6.8)

0 (0.0)

5 (5.7)

6 (6.8)

5 (5.7)

0 (0.0)

Improved

0 (0.0)

4 (4.5)

10 (11.4)

1 (1.1)

0 (0.0)

3 (3.4)

11 (12.5)

1 (1.1)

Much improved

0 (0.0)

0 (0.0)

8 (9.1)

5 (5.7)

0 (0.0)

0 (0.0)

9 (10.2)

4 (4.5)

No change

Spearman correlation

r = 0.73, P ≤ .0001

r = 0.70, P ≤ .0001

a

PGIC, Patient Global Impression of Change; IGA-LCL, Investigator’s Global Assessment of Lateral Canthal Line.

Similarly, 80% of subjects with two-point or greater bilateral IGA improvement attained PGIC scores of “Improved, or Much Improved.” Perhaps more importantly, 79% of responders with one-point or greater bilateral IGA improvement exhibited the generally-recognized definition of a clinically-relevant improvement threshold (“A Little Improved, Improved, or Much Improved”) on PGIC. Thus, in the majority of subjects, clinical relevance by improvement on the validated PGIC corresponded with improvement on IGA-LCL and vice versa. The pattern and magnitude of the Spearman correlations between the scores for the IGA-LCL and a subject rating of severity measuring a similar concept (PSA) were examined in study RT001-CL024LCL. As expected, there was a positive relationship between the two instruments. The

IGA-LCL demonstrated substantial agreement with PSA scores (right side, Kappa=0.80; left side, Kappa=0.76).

Ability to Detect Change Treatment response was examined in two Phase 2b clinical studies utilizing the final instruments (RT001- CL017LCL and RT001-CL024LCL). Results are summarized in Table 6. Statistically-significant changes from baseline were observed at both one-point (75.7%) and two-point (51.5%) responder thresholds versus placebo controls. Of note, placebo rates with the IGA-LCL were notably low for topical products and for an aesthetic scale, with rates of 22.0% for

Downloaded from aes.sagepub.com by guest on March 7, 2012

Kane et al

281

Table 6.  Lateral Canthal Areas With Improvement in Lateral Canthal Line Severity at Rest From Baselinea Day

Improvement on IGA- LCL

RT001 25 ng/mL (n = 136)

Control (n = 132)

P

28

≥ 1 point

103 (75.7%)

29 (22.0%)

< .0001

28

≥ 2 points

70 (51.5%)

14 (10.6%)

< .0001

a

IGA-LCL, Investigator’s Global Assessment of Lateral Canthal Line. P value from CMH.

Table 7.  IGA-LCL Mean Change Scores From Baseline to Week 4 by PGIC Response Mean Change Score, 4 Weeks PGIC Group

IGA (Left Side)

IGA (Right Side)

Mean of IGA

Much improved

−2.38 (n = 13)

−2.31 (n = 13)

−2.345 (n = 26)

Improved

−1.80 (n = 15)

−1.87 (n = 15)

−1.835 (n = 30)

A little improved

−1.00 (n = 16)

1.00 (n = 16)

–1.000 (n = 32)

No change

−0.41 (n = 44)

−0.55 (n = 44)

−0.480 (n = 88)

A little worse

(n = 0)

(n = 0)

(n = 0)

Worse

(n = 0)

(n = 0)

(n = 0)

Much worse

(n = 0)

(n = 0)

(n = 0)

Overall P

< .0001

< .0001

< .0001

a IGA-LCL, Investigator’s Global Assessment of Lateral Canthal Line; PGIC, Patient Global Impression of Change.

one-point and 10.6% for two-point thresholds. Response on IGA-LCL was mirrored by response on other measures (PSA and PGIC).

Threshold Determination for ClinicallyMeaningful Response Increments Physician interviews conducted during the development of the IGA-LCL and PGIC scales determined the level of clinically-meaningful change for each scale in both the clinical study environment and clinical practice. For the IGA-LCL scale, a bilateral change (improvement) of two points from baseline to four weeks posttreatment was considered to be a very robust change by the physicians. However, clinicians noted that a one-point improvement in wrinkles at rest was a meaningful clinical improvement, which is consistent with clinical practice as well as the majority of clinical literature on LCL.2,3,6 Consistent with the investigator’s perspective, patient interviews demonstrated that patients considered a one-point improvement in wrinkle severity at four weeks posttreatment to be a meaningful improvement that made the treatment worthwhile. Correlation to PGIC confirmed these results for IGALCL, as summarized in Tables 5 and 7.

Table 8.  Concept Support Exclusive Lateral Canthal Line

Literature

Clinician Input

Patient Input

Length

Yes

Yes

Yes

Depth

Yes

Yes

Yes

Discussion An indication for a drug must be based on a recognized clinical need. For an aesthetic indication, the condition to be treated is an aesthetic problem and the desire that initially caused the patient to seek treatment. It is well known that patients seek treatment once wrinkles are present at rest (Glogau Stage 3).10 Clinically the term “crow’s feet” refers to LCL present at rest from the perspective of the typical patient’s desire for treatment. For the reasons detailed above, it is now generally held that most patients desire an aesthetic outcome that reflects a more youthful look with a normal, genuine smile (with some LCL), but reduced wrinkles at rest. Published scales for evaluating facial line severity have used both static (at rest) and dynamic (at frown or at smile) assessments.25,26,31,32 The IGA-LCL was developed to reliably and appropriately assess LCL in a resting neutral facial position. This position allows the most direct evaluation of the action of the drug on its target muscle, the orbicularis oculi, and thus provides an appropriate and specific means of evaluating the product. Development of all future clinical scales will likely follow the newly-emerging FDA recommendations mentioned earlier. As detailed in our research, the IGA-LCL was developed, refined, and validated following the processes specified by these recommendations. First, concept elicitation was undertaken and content validity was established. The identification of important and relevant concepts for evaluating LCL was based on a literature review, clinician input, and patient input. Two concepts—depth and length of LCL—consistently emerged as the central focus for both physicians and patients when considering severity of LCL (Table 8). Since neither could be omitted for a clinicallyrelevant scale, these became the basis of a quantitative scale development effort. Simple instrumentation (calipers) can be employed to standardize length. While depth is more difficult to measure, it remains essential to include, so the decision was made that the clinicians would rely on their breadth of clinical experience and acumen to separate depth into one of two simple categories: shallow or deep. Training was provided to highlight features that defined the two categories (ridging, heaping at rest, shadowing) as well as features that could be confused with depth but, in fact, were not. Since length could be more objectively measured, the length of the LCA was tiered into three categories based on a physical measurement performed as a live assessment. Both length by category and depth by category are separately evaluated as part of the clinician assessment. The combination of the two attributes dictates a unique, nonoverlapping grade, as illustrated in Table 9. The attributes are organized to ensure that a subject’s LCL must improve

Downloaded from aes.sagepub.com by guest on March 7, 2012

282

Aesthetic Surgery Journal 32(3)

Table 9.  Combination of Length and Depth Affords a Unique Investigator’s Global Assessment of Lateral Canthal Line Grade Length, cm Depth

> 2.5

1.5-2.5

Deep

4

3

Shallow Absent

2

< 1.5

1 0

in both length and depth in order to achieve a two-point improvement from their baseline scores of moderate or severe. Pre- and posttreatment photographs depicting these types of changes in average length and depth of lines are provided in Figure 2. Following identification and justification of the concepts to be measured, the single-item IGA-LCL was developed based on a literature review, along with clinician and patient input. The resulting instruments were then evaluated by clinicians and patients to demonstrate that there was a clear understanding that they measured what was intended. This evaluation confirmed the content validity of the IGA-LCL. After confirmation of content validity, traditional validation studies were undertaken to evaluate scale reliability through intra- and interobserver correlations. Statistical estimates of consistency assessed the degree of agreement between different individuals (interobserver) and the reproducibility of response by the same individual (intraobserver). Kappa statistics were applied to assess concordance for the IGA-LCL. Evaluation of intrarater reliability (the same rater on two different occasions) was based on the comparison of screening and baseline severity assessments.

Based on Kappa estimates of 0.89 and 0.88 (derived from the same rater on two different occasions), there was very good intrarater reliability demonstrated across a total of 17 raters and 451 subjects. Since subjects enrolled in these studies were specified to be moderate or severe at baseline, the intrarater correlations were anticipated to be relatively high. Nonetheless, there was very good agreement within raters in all studies that implemented the IGA-ICL. Two studies were conducted to evaluate interrater reliability. The first study included two pairs of raters who evaluated 31 subjects. Based on encouraging Kappa estimates of 0.81 for the study, a second study was undertaken with a larger number of investigators, compared directly. This study allowed eight physicians with experience in aesthetic outcomes to individually evaluate ten live models encompassing all ratings on the IGA-LCL. All ratings were performed on live subjects rather than photographs. The overall weighted Kappa estimate for this study was 0.77, confirming good-to-very-good agreement between raters with the IGA-LCL. After establishment of appropriate reliability, the other required measurement properties were evaluated in turn. The construct validity of an assessment tool indicates how well it measures what it is designed to measure. Construct validity and clinical relevance were demonstrated here by confirming that the IGA-LCL was directly related to patientbased measures of LCL, including patients’ self-perception of improvement (PGIC) and severity (PSA). “Clinically meaningful” is defined by the condition to be addressed. Here, the condition was a baseline severity of LCL (in a neutral facial position) which the patient seeks to improve. The patient was the sole driver for treatment and thus defined clinical meaningfulness and importance of the result in this indication. The investigator scale for an aesthetic indication should provide objectivity and clinical validation of the patient’s own outcome. Since the IGA-LCL is the most

Figure 2.  Illustrative photographs of a 40-year-old woman (A) pretreatment and (B) four weeks after a single treatment with RT001. This patient demonstrates changes in both length and depth of her lateral canthal lines. Since her wrinkles have shortened and softened such that visible changes are illustrated in both the average length of crow’s feet as well as the depth of the lines themselves, this improvement would represent at least a two-point change based on the scale outlined in Table 9.

Downloaded from aes.sagepub.com by guest on March 7, 2012

Kane et al

283

empirically-designed and objective scale of its type, the increments and results must be clinically meaningful based on patient responses. Thus, the gold standard for such an evaluation is patient-based outcomes, an approach confirmed by the clinical literature.29,30,33-35 Correlations between the scores for the IGA-LCL and patient self-perception of severity were examined in two studies. In both studies, the expected positive relationship between the two instruments was confirmed. Similarly, improvement as assessed by IGA-LCL was closely related in both studies to a commonlyemployed, validated clinical scale of self-perceived improvement. Thus, the IGA-LCL showed positive correlations with both patient-based instruments measuring a similar concept, which supports construct validity and establishes a clinical correlate for relevance of improvement. From a clinical perspective, a level of improvement on IGA-LCL is considered clinically meaningful once a simple majority of subjects rate themselves as improved at the same time. Improvement scales have become the gold standard for these correlations. These scales are particularly important because they factor in the patients’ desires and expectations in seeking treatment and then indicate whether those expectations were met. The standard for establishing a clinically-significant change is a score noting any level of improvement (ie, the top three categories) by subject self-rating. Marked improvement is usually noted by the top two categories of improvement.29,30,33-35 Here, 79% of patients with one point or greater of improvement by investigator assessment noted significant self-perceived improvement on PGIC. Thus, the clinicallyrelevant level of improvement for the overwhelming majority of patients occurred when the IGA-LCL scale exhibited a one-point or greater change. These results are further confirmed by standard statistical methods as discussed below. Similarly, 80% of patients with a two-point or greater IGA-LCL improvement attained top-two-category improvement by patient self-assessment. Thus, a change of two points or greater denotes marked improvement by patient self-rating, which factors in magnitude of perceived result versus expectations. After establishing of construct validity, we evaluated the IGA-LCL for its ability to detect change. Ability to detect change can be assessed by studying posttreatment changes. If the instruments are sensitive to change, posttreatment changes should be observed and the IGA-LCL should show strong correlations and agreement with other scales when change occurred. The ability of the IGA-LCL to detect change was prospectively examined in two Phase 2 studies, RT001-CL017LCL and RT001-CL024LCL. Specifically, Spearman correlations were calculated for the change from pretreatment to the four-week postreatment follow-up visit on the IGA-LCL Severity Scale. All comparisons were statistically significant (P < .0001) and strong in magnitude, with an r > 0.60.36 In summary, in the context of treatment, the IGA-ICL shows change and is correlated with patient severity. As required, the IGA-LCL was able to discriminate treatment effect reliably and with notably low placebo rates. Sensitivity to change (treatment response) in the IGA-LCL was characterized by its ability to generate scores that

reflect actual changes in LCL severity. Significant one-point or greater and, separately, significant two-point or greater improvements were observed on the IGA-LCL across both studies for RT001 versus controls. Improvement on the IGALCL scale was shown to be reliable, clinically meaningful, sensitive, and statistically robust as an endpoint in comparison between RT001 at various doses and across timepoints versus controls. The final and key consideration in establishing the IGALCL as fit for purpose is determining a threshold for treatment benefit. The FDA PRO Guidance recommends the application of anchor-based methods (with the anchor being a standard determined by patient rating, to which results are then compared) to establish a responder definition, which is a score change that will indicate treatment benefit on the IGA-ICL. Based on the results from study RT001-CL024LCL, a traditional anchor-based approach was followed by using PGIC to evaluate the level of change on the IGA-LCL at which patients reported being “improved” or “much improved.” This analysis focused on marked change (“improved” or “much improved”), rather than any improvement (“a little improved” or better). Tables 6 and 7 represent this analysis in two different ways. Table 7 shows that the average IGA-LCL change score for individual left (–1.80) and right (–1.87) LCL among patients who reported being “improved” on the PGIC supports a two-point improvement as a marked change. Table 6 summarizes the proportion of patients at each level of change on the PGIC and IGA-LCL, and it also clearly supports a two-point improvement as being a marked change. Beyond establishing a responder definition rigorous enough for use as a primary endpoint, the level of change that represents a threshold for clinically-meaningful benefit was also determined. With the data from Tables 6 and 7, the change scores on the IGA-LCL at which patients reported being improved at all (“a little improved” or better) were evaluated. Table 7 shows the average change score on individual left (–1.00) and right IGA-LCL (–1.00) for patients who reported being “a little improved” on the PGIC. Table 6 summarizes the proportion of patients at each level of change on the PGIC and each of the LGA-LCL confirms a one-point improvement as demonstrating a clinically-important level of improvement. This is consistent with the percentage improvement response threshold discussed above.

Conclusions The IGA-LCL scale has been shown to be reliable, appropriate, and clinically meaningful for measuring LCL severity. This scale is appropriate for evaluating improvement in LCL severity after treatment with topical or injectable neuromodulators.

Disclosures The authors were compensated for their contributions as investigators for the studies. Dr. Kane is a member of the

Downloaded from aes.sagepub.com by guest on March 7, 2012

284

Aesthetic Surgery Journal 32(3)

advisory board, a consultant, a stockholder, a member of the speaker bureau for Allergan; a consultant and a member of the advisory board for Mentor; a member of the advisory board, a consultant, a stockholder, a member of the speaker bureau, and a clinical investigator for Medicis; a member of the advisory board and speaker bureau, and a consultant for Sanofi-Aventis; a member of the advisory board for Stiefel; a member of the advisory board, a consultant, a clinical investigator, and an investigator for Revance Therapeutics; a consultant for Shire; a consultant for Galderma; a consultant for Johnson & Johnson; a consultant for QMed; a consultant for Canfield; a consultant and a clinical investigator for Coapt; a consultant for Merz; a consultant for Teoxane; and a consultant for Kythera. Dr. Blitzer receives research funding from and is a consultant for Allergan; receives research funding from Merz; is a consultant and shareholder for Myotech; receives royalty income from Xomed/Medtroncis; and is a clinical investigator for Revance Thereapeutics. Dr. Brandt is a consultant and a clinical investigator for Allergan; a consultant and a clinical investigator for Medicsis; a clinical investigator for Sanofi-Aventis; a clinical investigator for Anika Therapeutics; a clinical investigator for Mentor; a clinical investigator for Suneva Medical; a clinical investigator for Fibrocell; a clinical investigator for Contura; an investigator for Merz; a clinical investigator for Revance Therapeutics; a clinical investigator for Galderma Laboratories; a clinical investigator for Teoxane Laboratories; and a clinical investigator for Noven. Dr. Glogau is an executive advisory consultant and a clinical investigator for Allergan; an executive advisory consultant and a clinical investigator for Medicis; a clinical investigator for Contura; a clinical investigator and a consultant for Revance Therapeutics; a consultant for Gerson Lehman; a consultant for MedaCorp; and a clinical investigator for Liposinix. Dr. Monheit is a consultant and a clinical investigator for Allergan; a clinical investigator for Dermik Laboratories, a consultant and a clinical investigator for Genzyme Corporation; a consultant for Johnson & Johnson; a clinical investigator for Contura; a consultant and a clinical investigator for Ipsen/Medicis; a consultant and a clinical investigator for Electro-Optical Sciences, Inc.; a consultant and a clinical investigator for Revance Therapeutics; a clinical investigator for Kythera; a consultant and a clinical investigator for Galderma; a consultant and a clinical investigator for Mentor; a consultant and a clinical investigator for Merz; and a consultant for MyoScience. Dr. Narins is a member of the global board and a clinical investigator for Allergan; a consultant and a clinical investigator for Contura; a consultant and a clinical investigator for Revance Therapeutics; a consultant and a clinical investigator for Merz/BioForm; a clinical investigator for Suneva; and a clinical investigator for Galderma. Dr. Paty is an employee and a stockholder of invivodata. Dr. Waugh is an employee and a stockholder of Revance Therapeutics.

Funding The clinical studies were sponsored and funded by Revance Therapeutics, Inc., Newark, CA. The authors received no financial support for the authorship of this article.

References 1. Carruthers A, Carruthers J. Botulinum toxin type A. J Am Acad Dermatol 2005;53(2):284-290. 2. Lowe N, Lask G, Yamauchi P, Moore D. Bilateral, doubleblind, randomized comparison of 3 doses of botulinum toxin type A and placebo in patients with crow’s feet. J Am Acad Dermatol 2002;47(6):834-840. 3. Lowe NJ, Ascher B, Heckmann M, Kumar C, Fraczek S, Eadie N. Double-blind, randomized, placebo-controlled, dose-response study of the safety and efficacy of botulinum toxin type A in subjects with crow’s feet. Dermatol Surg 2005;31(3):257-262. 4. Carruthers JA, Lowe NJ, Menter MA, et al. A multicenter, double-blind, randomized, placebo-controlled study of the efficacy and safety of botulinum toxin type A in the treatment of glabellar lines. J Am Acad Dermatol 2002;46(6):840-849. 5. Carruthers A, Bogle M, Carruthers JD, et al. A randomized, evaluator-blinded, two-center study of the safety and effect of volume on the diffusion and efficacy of botulinum toxin type A in the treatment of lateral orbital rhytides. Dermatol Surg 2007;33(5):567-571. 6. Ascher B, Rzany BJ, Grover R. Efficacy and safety of botulinum toxin type A in the treatment of lateral crow’s feet: double-blind, placebo-controlled, dose-ranging study. Dermatol Surg 2009;35(10):1478-1486. 7. Kane MA. Classification of crow’s feet patterns among Caucasian women: the key to individualizing treatment. Plast Reconstr Surg 2003;112(5)(suppl):33S-39S. 8. Kane MA. The functional anatomy of the lower face as it applies to rejuvenation via chemodenervation. Facial Plast Surg 2005;21(1):55-64. 9. Olson JJ. Balanced Botox chemodenervation of the upper face: symmetry in motion. Semin Plast Surg 2007;21(1): 47-53. 10. Glogau RG. Aesthetic and anatomic analysis of the aging skin. Semin Cutan Med Surg 1996;15(3):134-138. 11. Branford OA, Dann SC, Grobbelaar AO. The quantitative assessment of wrinkle depth: turning the microscope on botulinum toxin type A. Ann Plast Surg 2010;65(3):285293. 12. Carruthers A, Carruthers JD, Glogau RG, Blitzer A; the Facial Aesthetics Consensus Group Faculty. Advances in facial rejuvenation: botulinum toxin Type A, hyaluronic acid dermal fillers, and combination therapies: consensus recommendations. Plast Reconstr Surg 2008;121(5) (suppl):5S-30S. 13. Carruthers J, Fagien S, Matarasso SL; Botox Consensus Group. Consensus recommendations on the use of botulinum toxin type A in facial aesthetics. Plast Reconstr Surg 2004;114(6)(suppl):1S-22S. 14. Joshua ID, Senghas A, Brandt F, Ochsner KN. The effects of BOTOX injections on emotional experience. Emotion 2010;10(3):433-440. 15. Samson N, Fink B, Matts PJ, Dawes NC, Weitz S. Visible changes of female facial skin surface topography in relation to age and attractiveness perception. J Cosmet Dermatol 2010;9(2):79-88.

Downloaded from aes.sagepub.com by guest on March 7, 2012

Kane et al

285

16. Blitzer A, Binder W, Aviv J, Keen M, Brin M. The management of hyperfunctional facial lines with botulinum toxin: a collaborative study of 210 injection sites in 162 patients. Arch Otolaryngol Head Neck Surg 1997;123(4):389-392. 17. Havas DA, Glenberg AM, Gutowski KA, Lucarelli MJ, Davidson RJ. Cosmetic use of botulinum toxina affects processing of emotional language. Psychol Sci 2010;21(7):895-900. 18. Tamietto M, Castelli L, Vighetti S, et al. Unseen facial and bodily expressions trigger fast emotional reactions. Proc Natl Acad Sci U S A 2009;106(42):17661-17666. 19. Wolff K, Goldsmith LA, Katz SI, Gilchrest BA, Paller AS, Leffell DJ, eds. Fitzpatrick’s Dermatology in General Medicine. 7th ed. New York, NY: McGraw-Hill; 2008. 20. Fagien, S. Botox for the treatment of dynamic and hyperkinetic facial lines and furrows: adjunctive use in facial aesthetic surgery. Plast Reconstr Surg 1999;103(2):701-703. 21. Petchngaovilai C. Midface lifting with botulinum toxin: intradermal technique. J Cosmet Dermatol 2009;8(4):312-316. 22. Chang SP, Tsai HH, Chen WY, Lee WR, Chen PL, Tsai TH. The wrinkles soothing effect on the middle and lower face by intradermal injection of botulinum toxin type A. Int J Dermatol 2008;47(12):1287-1294. 23. Alam M, Barrett KC, Hodapp RM, Arndt KA. Botulinum toxin and the facial feedback hypothesis: Can looking better make you feel happier? J Am Acad Dermatol 2008;58(6):1061-1072. 24. Rinn WE. The neuropsychology of facial expression: a review of the neurological and psychological mechanisms for producing facial expressions. Psychol Bull 1984;95(1):52-77. 25. Day DJ, Littler CM, Swift RW, Gottlieb S. The wrinkle severity rating scale: a validation study. Am J Clin Dermatol 2004;5:49-52.

26. Hund T, Ascher B, Rzany B. Reproducibility of two fourpoint clinical severity scores for lateral canthal lines (crow’s feet). Dermatol Surg 2006;32:1256-1260. 27. Kim EJ, Reeck JB, Maas CS. A validated rating scale for hyperkinetic facial lines. Arch Facial Plast Surg 2004;6:253-256. 28. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-174. 29. Norman GR, Sridhar FG, Guyatt GH, Walter SD. Relation of distribution- and anchor-based approaches in interpretation of changes in health-related quality of life. Med Care 2001;39(10):1039-1047. 30. McLeod LD. Evaluating minimal clinically important differences for the acne-specific quality of life questionnaire. Pharmacoeconomics 2003;21(15):1069-1079. 31. Honeck P, Weiss C, Sterry W, Rzany B. Reproducibility of a four-point clinical severity score for glabellar frown lines. Br J Dermatol 2003;149:306-310. 32. Carruthers A, Carruthers J, Hardas B, et al. A validated grading scale for crow’s feet. Dermatol Surg 2008;34(suppl 2):S173-S178. 33. Guyatt GH, Juniper EF, Walter SD, Griffith LE, Goldstein RS. Interpreting treatment effects in randomised trials. BMJ 1998;316(7132):690-693. 34. Bajaj-Luthra A, Mueller T, Johnson PC. Quantitative analysis of facial motion components: anatomic and nonanatomic motion in normal persons and in patients with complete facial paralysis. Plast Reconstr Surg 1997;99(7):1894-1902. 35. Chan KS, Chen WH, Gan TJ, et al. Development and validation of a composite score based on clinically meaningful events for the opioid-related symptom distress scale. Qual Life Res 2009;18(10):1331-1440. 36. Hinkle DE, Wiersma W, Jurs SG. Applied Statistics for the Behavioral Sciences. Boston, MA: Houghton Mifflin Company; 1998.

Downloaded from aes.sagepub.com by guest on March 7, 2012