choice of delta in equivalence testing

3 downloads 0 Views 118KB Size Report
of the therapeutic effect of the active control as compared to placebo. ... Key Words: Active control; Equivalence margin; Equivalence testing; Discounting; Preser ...
Drug Information Journal, Vol. 35, pp. 1517–1527, 2001 Printed in the USA. All rights reserved.

0092-8615/2001 Copyright  2001 Drug Information Association Inc.

CHOICE OF DELTA IN EQUIVALENCE TESTING* TIE-HUA NG, PHD Mathematical Statistician, Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, Rockville, Maryland

A valid interpretation of an active control equivalence study without a concurrent placebo control depends on the assumptions that 1. The active control is effective in the current trial (ie, assay sensitivity), and 2. The effect size is the same across the studies (ie, constancy assumption). The equivalence margin δ should be a small fraction (eg, 0.2) of the therapeutic effect of the active control as compared to placebo. However, a larger δ can be justified if the objective is to establish the efficacy of the experimental treatment as compared to placebo through its comparison to the standard therapy without claiming equivalence. The proposed δ may be interpreted as preserving a percentage of the active control effect as compared to placebo. The assumption that the active control effect is constant across studies may be discounted by using a smaller δ. Preservation and discounting are two distinct concepts, although they are indistinguishable mathematically. Placebo controls are not necessarily unethical when known effective therapy exists for a condition. When a placebo control is ethical, it is a clear choice if the study objective is to establish the efficacy of the test treatment. A three-arm trial (test treatment, active control, and placebo) would be an ideal design if the study objective is to establish the efficacy of the test treatment relative to an active control. When a placebo control is unethical and there can be no concurrent placebo, an evaluation of the efficacy of the test treatment depends on the discount factor to be used. The discount factor is often difficult to justify. In such a situation, an evaluation of the efficacy of the test treatment may be supplemented by other designs such as an “add on” design or an early escape design. In fact, a hybrid of the active control design with the “add on” design (test treatment, active control, and combination) is an ideal design when the test treatment and the active control possess different pharmacologic mechanisms. On the other hand, the discounting factor plays a less important role in an evaluation of the relative efficacy of the test treatment. Key Words: Active control; Equivalence margin; Equivalence testing; Discounting; Preservation

Presented at the DIA Workshop “Statistical Methodology in Clinical R&D,” April 2–4, 2001, Vienna, Austria. Reprint address: Tie-Hua Ng, 1401 Rockville Pike, #273S, HFM-217, Rockville, MD 20852–1448. E-mail: [email protected]. *The views expressed in this paper are those of the author and do not necessarily reflect policies of the U.S. Food and Drug Administration.

1517

1518

Tie-Hua Ng

EQUIVALENCE TESTING THERE ARE TWO MAJOR types of equivalence in clinical research: therapeutic equivalence and bioequivalence. Therapeutic equivalence is sometimes referred to as clinical equivalence, and often arises in active control equivalence studies in which an experimental or test treatment is compared to a standard therapy or active control based on clinical endpoints. The objective is to show that the experimental treatment produces the same benefit as the active control. Bioequivalence arises from studies in which a test product is compared to a reference product with respect to pharmacokinetic parameters such as the area under the concentration-time curve (AUC), the maximum concentration (Cmax ), and so forth. The objective of bioequivalence studies is to show that the pharmacologic activity of one product is similar to that of another. In active control equivalence studies, a major issue that has been raised by many authors is the efficacy of the active control relative to placebo in the current trial. This is the assay sensitivity of the trial, as discussed in the International Conference on Harmonization (ICH) E10 document (1). Assay sensitivity cannot be validated in a study without a concurrent placebo. If assay sensitivity is not established, the efficacy of the experimental treatment cannot be inferred when the study shows that the experimental treatment is not much different from the active control. Therefore, a concurrent placebo is often recommended if placebo use is ethical in the setting of the study. On the other hand, a placebo does not play any role in bioequivalence studies. However, for biologics products, such as immunoglobulin, one needs to take into account the endogenous level (eg, subtracting the endogenous level before computing the AUC). For drugs, bioequivalence studies are the basis for evaluating generic products. For biologics, these studies are conducted to show comparability of production lots when the sponsor makes significant manufacturing changes such as scaling up pilot plant pro-

duction or building new facilities that do not require efficacy studies or extensive safety data. Equivalence testing with two treatment groups can be one-sided or two-sided. Onesided equivalence studies are also known as noninferiority studies. Therapeutic equivalence is often one-sided, that is, we wish to know if the experimental treatment is not worse than the active control. Bioequivalence with biologics products can also be one-sided because we are not concerned with whether the test product is more bio-available than the reference product. However, that does not preclude two-sided bioequivalence testing. On the other hand, bioequivalence with drugs is two-sided since greater bio-availability may post a safety concern (eg, adverse events). Therapeutic equivalence can also be two-sided. For example, when comparing a twice-a-day to a once-a-day regimen, one would be interested in a difference in either direction. For two-sided equivalence testing, it is not possible to show absolute equality of the two means. Therefore, statistical methods are developed to show δ-equivalence, that is, the absolute mean difference being less than a prespecified δ > 0. More specifically, the null hypothesis H0(2): *µt − µs* ≥ δ, is tested against the alternative hypothesis H1(2): *µt − µs* < δ, where µt and µs denote the mean response for test (or experimental) treatment and the standard therapy (or active control), respectively. The superscript (2) indicates testing for two-sided equivalence. The two one-sided tests procedure (2) and the confidence interval approach are often used for testing this null hypothesis. For one-sided equivalence testing, assume that the larger the response the better the treatment. Therefore, µt − µs ≥ 0 means that the test treatment is not inferior to the stan-

Choice of Delta

1519

dard therapy. To show noninferiority, we test the null hypothesis H0: µt − µs < 0, against the alternative hypothesis H1: µt − µs ≥ 0, so that noninferiority can be concluded when the null hypothesis is rejected. To show superiority, we test the null hypothesis H0: µt − µs ≤ 0, against the alternative hypothesis H1: µt − µs > 0. The only difference between these two sets of hypotheses is the boundary zero. More specifically, the boundary zero is included in the null hypothesis for testing for superiority but not for noninferiority. Consequently, testing for noninferiority would be the same as testing for superiority. Therefore, instead of showing statistically that µt − µs ≥ 0, statistical methods are developed to show that the experimental treatment is δ-no-worse than the active control, that is, the experimental treatment is not worse than the active control by δ or more. Therefore, for noninferiority, we test the null hypothesis H0(1): µt − µs ≤ − δ, against the alternative hypothesis H1(1): µt − µs > − δ, where the superscript (1) indicates testing for the one-sided equivalence. Equivalence testing for average bioequivalence (bioequivalence in average bioavailabilities in terms of pharmacokinetic parameters such as AUC and Cmax ) (3) is well developed and accepted. This paper will focus on therapeutic equivalence.

THE EQUIVALENCE MARGIN ␦ What is δ? δ is the usual term for the equivalence or noninferiority margin. The following is a list of definitions of δ found in the literature. It is by no means a complete list and it is not a random sample. Some authors used other notations such as Θ0 and ∆ instead of δ: 1. “An equivalence margin should be specified in the protocol; this margin is the largest difference which can be judged as being clinically acceptable and . . . ” (4), 2. “This margin is the degree of inferiority of the test treatments to the control that the trial will attempt to exclude statistically.” (1), 3. “Choice of a meaningful value for Θ0 is crucial, since it defines levels of similarity sufficient to justify use of the experimental treatment.” (5), 4. “ . . . ; that a test treatment is not inferior to an active treatment by more than a specified, clinically irrelevant amount (Noninferiority trials); . . . ” (6), 5. “ . . . , but is determined from the practical aspects of the problem in such a way that the treatments can be considered for all practical purposes to be equivalent if their true difference is unlikely to exceed the specified ∆.” (7), 6. “In a study designed to show equivalence of the therapies, the quantity δ is sufficiently small that the therapies are considered equivalent for practical purposes if the difference is smaller than δ.” (8), 7. “An objective of ACES is the selection of the new treatment when it is not worse than the active control by more than some difference judged to be acceptable by the clinical investigator.” (9), 8. “Hence, if a new therapy and an accepted standard therapy are not more than irrelevantly different concerning a chosen outcome measure both therapies are called therapeutically equivalent.” (10), 9. “The δ is a positive number that is a measure of how much worse B could be than A and still be acceptable.” (11),

1520

Tie-Hua Ng

10. “For regulatory submissions, the goal is to pick the allowance, δ, so that there is assurance of effectiveness of the new drug when the new drug is shown to be clinically equivalent to an old drug used as an active control. For trials of conservative therapies, the δ represents the maximum effect with respect to the primary clinical outcome that one is willing to give up in return for the other benefits of the new therapy.” (11), 11. “ . . . , where δ represents the smallest difference of medical importance. . . . These approaches depend on the specification of a minimal difference in efficacy that one is willing to tolerate.” (12), 12. “The noninferiority/equivalence margin, δ, is the degree of acceptable inferiority between the test and active control drugs that a trial needs to predefine at the trial design stage.” (13), and 13. “In general, the difference δ should represent the largest difference that a patient is willing to give up in efficacy of the standard treatment C for the secondary benefits of the experimental treatment E.” (14) Most definitions relate δ to a clinical judgment; others relate δ to other benefits. The ICH E10 document referred to δ as the degree of inferiority of the test treatments to the control that the trial will attempt to exclude statistically. It says exactly what the statistical inference does. However, it provides no advice for choosing the δ. The document then states that if the confidence interval for the difference between the test and control treatments excludes a degree of inferiority of the test treatment as large as, or larger than, the margin, the test treatment can be declared noninferior. There is no problem with the statement if δ is small, but it could be misleading if δ is too large.

CHOICE OF DELTA How do you choose δ? The following is a list of suggestions in the literature on the

choice of δ. Again, it is by no means a complete list and it is not a random sample: 1. “ . . . should be smaller than differences observed in superiority trials of the active comparator.” (4), 2. “The margin chosen for a noninferiority trial cannot be greater than the smallest effect size that the active drug would be reliably expected to have compared with placebo in the setting of the planned trial. . . . In practice, the noninferiority margin chosen usually will be smaller than that suggested by the smallest expected effect size of the active control because of interest in ensuring that some clinically acceptable effect size (or fraction of the control drug effect) was maintained.” (1), 3. “Θ0 must be considered reasonable by clinicians and must be less than the corresponding value for placebo compared to standard treatment, if that is known. . . . The choice of Θ0 depends on the seriousness of the primary clinical outcome, as well as the relative advantages of the treatments in considerations extraneous to the primary outcome.” (5), 4. “In general, the equivalence limits depend upon the response of the reference drug.” (15), 5. “On the other hand, for one-sided therapeutic equivalence, the lower limit L may be determined from previous experience about estimated relative efficacy with respect to placebo and from the maximum allowance which clinicians consider to be therapeutically acceptable. . . . Therefore, the prespecified equivalence limit for therapeutic equivalence evaluated in a noninferiority trial should always be selected as a quantity smaller than the difference between the standard and placebo that a superior trial is designed to detect.” (16), 6. “The extent of accepted difference (inferiority) may depend on the size of the difference between standard therapy and placebo.” (10), 7. “A basis for choosing the δ for assurance of effectiveness in prior placebo-controlled trials of the active control in the

Choice of Delta

same population to be studies in the new trial.” (11), 8. “This margin chosen for a noninferiority trial should be smaller (usually a fraction) than the effect size, ∆, that the active control would be reliably expected to have compared with placebo in the setting of the given trial.” (13), and 9. “The difference δ must be no greater than the efficacy of C relative to P and will in general be a fraction of this quantity delta δc.” (14) Ng (17,18,19) proposed that the equivalence margin δ should be a small fraction (eg, 0.2) of the therapeutic effect of the active control as compared to placebo. This proposal is in line with the view of many authors that δ should depend on the effect size of the active control, but more specifically recommends that δ should be a small fraction of the effect size. Motivations The proposed δ (17,18,19) was first motivated in the review of a study in a New Drug Application submission in 1987. In this study, a new formulation and the old formulation of a cholesterol-lowering agent were compared with respect to their ability to maintain the depressed LDL cholesterol level (see Figure 1). All patients were on the old formulation during the baseline run-in period of two weeks. Patients were then randomized to a four-week treatment period with either the old formulation or the new formulation.

FIGURE 1. Old and new formulations for cholesterol-lowering agents are compared.

1521

The LDL level was presumed to have been stabilized at baseline and was assumed to remain at the baseline level through the end of the experiment if patients continued on the old formulation. That would also be true for patients taking the new formulation if the two formulations were equivalent. However, if we put patients on placebo, or if the new formulation was not effective, then the LDL level would be expected to eventually rise to the prebaseline level. Therefore, to say that the two formulations are equivalent with respect to their ability to maintain the depressed LDL cholesterol level, we want the old formulation to be near the depressed level rather than near the prebaseline level at the end of the treatment period. Thus, δ should be a small fraction of the decrease in LDL level from prebaseline to baseline. For active control equivalence studies, this means that δ should be a small fraction of the therapeutic effect of the standard therapy as compared to placebo. To further justify the proposal, let us consider the situation where we are comparing an antihypertensive agent versus a standard therapy (17,18). Assume that there is a fictitious concurrent placebo control. The primary efficacy endpoint is the reduction in diastolic blood pressure. Let µt, µs, and µp denote the mean response for test (or experimental) treatment, the standard therapy (or active control), and the placebo, respectively. Let δsp = µs − µp. So, δsp is the therapeutic effect of the standard therapy as compared to placebo or the effect size. Let us consider the four scenarios in Figure 2. In scenario 1, µs is known, but not µp. In scenario 2, µp is known, but not µs. Therefore in the first two scenarios, δsp is unknown. In scenario 3, δsp = 5 and in scenario 4, δsp = 10. Is δ = 6 mm Hg acceptable? In the first two scenarios we do not have enough information to answer the question. In scenario 3, δ = 6 mm Hg is unacceptable because placebo and the standard therapy would be δ-equivalent. In scenario 4, δ = 6 mm Hg is questionable because if µt is halfway between placebo and test standard therapy (ie, µt = 10 mm Hg), then the test drug

1522

Tie-Hua Ng

FIGURE 2. Four scenarios.

is δ-equivalent to both placebo and standard therapy. For the two treatments to be considered equivalent, the test drug should be closer to the standard therapy than to the placebo. So, in scenario 4, δ should be less than 5 mm Hg. Therefore, δ should be less than one half of the effect size. Or mathematically δ = ε δsp, where 0 < ε < 0.5. This leads to the proposal that δ should be a small fraction of the effect size. Estimation of δsp will be discussed later. Choice of ␧ and Interpretation What small fraction should be used? Before we answer this question, let us see how the proposed δ can be interpreted. Recall that for one-sided equivalence, we are testing the null hypothesis that µt − µs ≤ − δ. At the boundary, µt − µs = −δ = −ε δsp, so, we can write µt as

is the same as the standard therapy (µt = µs), while ε = 1 means that the test treatment is the same as placebo (µt = µp) (see Figure 3). For ε = 0.2, µt = µs − ε δsp means that the test treatment is 80% as effective as the standard therapy. Many authors use different words, such as “the test treatment preserves 80% of the effect.” It should be noted that when the null hypothesis is rejected, we would conclude that the test treatment preserves more than 80% of the effect, not 80%, or at least 80% of the effect. To claim “equivalence” or noninferiority, ε should be small. How small is small? Perhaps ε should depend on other benefits (eg, better safety profiles) relative to the primary endpoint. As noted earlier, with ε = 0.2, we would conclude that the test treatment preserves more than 80% of the effect when the null hypothesis is rejected. On the other hand, to claim the efficacy of the experimental treatment as compared to the placebo, but not to claim “equivalence” or noninferiority, a larger δ can be justified. In fact, ε can be as large as one. If we want to claim that the

µs − εδsp, or µp + (1 − ε)δsp. Thus, we put µt on a scale of 0 to 1 through ε, with ε = 0 meaning that the test treatment

FIGURE 3. Interpretation.

Choice of Delta

1523

test treatment preserves more than x% of the effect, then ε = 1 − x%. In that case, we cannot claim “equivalence”/noninferiority if x is too low. Therefore, a larger δ can be justified if the objective is to establish the efficacy of the experimental treatment as compared to placebo through its comparison to the standard therapy without claiming equivalence. For example, in 1992, the Food and Drug Administration (FDA) Cardio-Renal Advisory Committee recommended the use of half of the effect size of the standard therapy as the noninferiority margin for trials of new thrombolytics (20). This recommendation is reasonable if the objective is to establish the efficacy of the test product (as compared to placebo), but not to claim “equivalence” of the test product and the active control. For the latter objective, a much smaller δ should be used. We could also claim preservation of more than 50% of the effect when the null hypothesis is rejected if the constancy assumption holds. The constancy assumption will be discussed in the following section. EFFICACY How can we conclude efficacy of the experimental treatment without a concurrent placebo? We can do so only if we assume that the effect size of the active control is positive. That is the key assumption and it implicitly assumes assay sensitivity. In addition, we assume that the effect size is known. However, in practice, we need to estimate the effect size. Suppose that there is a series of studies comparing the active control versus a placebo and let the overall effect size of this historical data be denoted by µs* − µp*. When the effect size is estimated from historical data, we need to make another key assumption that the effect size in the current study is the same as the overall effect size of the historical data or mathematically, µs − µp = µs* − µp*. In other words, the effect size for the current study is equal to the effect size for the historical data. This is known as the constancy assumption. The driving force behind the use of previous placebo-controlled studies of the active

control drug to infer efficacy of the test drug in an active control equivalence study is the Code of Federal Regulations (CFR) (21,22). The CFR states that the analysis should explain why the drugs should be considered effective in the study, for example, by reference to results in previous placebo-controlled studies of the active control drug. However, it does not provide any direction on how to use the previous studies to infer efficacy of the test treatment. In 1987, Fleming (23) provided a more explicit direction: using information on the relationship of the new drug to the active control and of the active control to no treatment, one can estimate the relationship of the new drug to no treatment and thereby obtain the desired quantitative assessment of the new drug effect. Ng (17) translated the last statement into formulas and proposed a test statistic for inferring the efficacy of the test drug as compared to placebo. It is simple and straightforward and there is no need to specify δ, but it depends on the constancy assumption. Two approaches to establishing the efficacy of a test drug as compared to placebo are discussed by Hauck and Anderson (11). The first approach is to determine δ using the lower confidence limit of the effect size from the historical data with, say, a 90% confidence level and then testing for noninferiority. If H0(1) is rejected, we would conclude efficacy of the test drug. The second approach is to incorporate δ = εδsp = µs − µp (ε = 1, for efficacy; see the subsection on “Choice of ε and Interpretation”) into H0(1). So, the null hypothesis for noninferiority becomes H0(1): µt − µs ≤ − (µs − µp), or equivalently, H0(1): µt ≤ µp, and we would conclude efficacy of the test drug (ie, µt > µp), when H0(1) is rejected. The

1524

Tie-Hua Ng

null hypothesis H0(1) can be tested by writing H0(1) as H0(1): (µt − µs) + (µs* − µp*) ≤ 0, due to the constancy assumption. The left hand side of the above hypothesis can be estimated from the current trial and the historical data leading to the same test statistic proposed by Ng (17), as discussed earlier. Both approaches rely on the constancy assumption. Deviation from the constancy assumption may lead to inflation of the type I error rate in concluding efficacy, for example, when the effect size in the current trial is smaller than the overall effect size of the historical data, that is µs − µp < µs* − µp*. One way to alleviate the problem is to use a smaller δ to discount the effect size estimated from the previous studies. To do so, let µs − µp = γ(µs* − µp*), where 0 < γ ≤ 1. So, γ = 1 is the situation with no discounting; if there is a y% discounting, γ = 1 − y%, for 0 < y < 100. This should not be confused with the concept of preservation as discussed in the subsection on “Choice of ε and Interpretation.” See also the next section. Recall that δ = εδsp = µs − µp = γ(µs* − µp*), for efficacy (ie, ε = 1), and for discounting. Using the first approach, we compute δ as γ ⴢ Lower confidence limit for the effect size from the historical data, and then test for noninferiority using the resulting δ. Using the second approach, we incorporate δ = γ(µs* − µp*) into H0(1), and test the resulting null hypothesis H0(1): µt − µs + γ(µs* − µp*) ≤ 0. Again, the left hand side of the above hypothesis can be estimated from data from the

current trial and the historical data. A test statistic can be defined accordingly. Note that there is no preservation because ε = 1. PRESERVATION OF A PERCENTAGE OF THE ACTIVE CONTROL EFFECT To claim preservation of a percentage of the active control effect with discounting, we let δ = εδsp = ε(µs − µp) = εγ(µs* − µp*). Using the first approach, we compute δ as εγ ⴢ Lower confidence limit for the effect size from the historical data, and then test for noninferiority using the resulting δ. Using the second approach, we incorporate δ = εγ(µs* − µp*) into H0(1), and test the resulting null hypothesis H0(1): µt − µs + εγ(µs* − µp*) ≤ 0. If the null hypothesis is rejected in either approach, we can claim that the test treatment preserves more than 100(1 − ε)% of the effect with 100(1 − γ)% discounting. For example, using the first approach, we compute δ as 0.72 ⴢ Lower confidence limit for the effect size from the historical data, and then test for noninferiority using the resulting δ, where εγ = 0.72. Using the second approach, we incorporate δ = 0.72(µs* − µp*) into H0(1), and test the resulting null hypothesis H0(1): µt − µs ≤ −0.72(µs* − µp*). If this null hypothesis is rejected in either approach, we can claim that the test treatment preserves more than 28% of the effect with no discounting (ε = 0.72, γ = 1). Or we can claim that the test treatment is efficacious (with no preservation) with 28% discounting (ε = 1, γ = 0.72). Or we can claim that the test treatment preserves more than 10% of the effect with 20% discounting (ε = 0.9,

Choice of Delta

1525

γ = 0.8). Of course, other pairs of ε and γ may be used (eg, ε = 0.8, γ = 0.9). DISCUSSION A prerequisite for an active control used in an active control equivalence study without a concurrent placebo is that the active control has been shown to be effective with a reliable effect size in a series of studies. A valid interpretation of an active control equivalence study without a concurrent placebo control depends on the assumptions that the active control is effective (ie, assay sensitivity), and the effect size is the same across the studies (ie, constancy assumption). It should be noted that the prerequisite and the constancy assumption imply assay sensitivity. The constancy assumption leads to the design issues in the current study as discussed in the ICH E10 document (1). The current study design should be similar to that of the previous studies. For example, the current study should enroll a similar patient population, using the same inclusion/exclusion criteria. The same dosage regimen and the same efficacy endpoint should be used with the same duration/follow-up, and so forth. The conduct of the current study also has an impact on the constancy assumption. For example, mixing up treatment assignments would bias toward similarity of the two treatments and reduce the effect size of the active control when the test treatment is not effective or less effective than the active control. Setting aside the conduct of the study, it is reasonable to believe that the constancy assumption would hold if the current study follows the same protocol with the same centers/investigators as a recent placebo control study of the active control. In fact, it is logical to expect both studies to have the same (true) mean response for the active control (as well as for the placebo if placebo were also included in the current study), although reality may show otherwise (most likely due to the conduct of the study). Such an expectation implies the constancy assumption but not vice versa. In other words, the constancy as-

sumption is weaker than the assumption that the active control has the same (true) mean response in both studies. Changes in the protocol for the current study from that of the placebo control study of the active control would likely invalidate the later assumption, while the constancy assumption may still hold. It is difficult to assess the impact of such changes on the constancy assumption, leading to difficulties in deciding the discounting factor to be used. Bringing the conduct of the study into the picture, it is not clear how the discounting factor could be determined to account for the bias due to the conduct of the study. If there is a concurrent placebo control in the current study, we could incorporate δ = ε(µs − µp) into H0(1), and test the resulting null hypothesis H0(1): µt − (1 − ε)µs − εµp ≤ 0. If this null hypothesis is rejected, we can claim that the test treatment preserves more than (1 − ε)100% of the effect. For two-sided equivalence, incorporating δ = ε(µs − µp) into H0(2) results in testing the following two sets of one-sided hypotheses: H01: (1 + ε)µs − εµp − µt ≤ 0, against H11: (1 + ε)µs − εµp − µt > 0, and H02: (1 − ε)µs + εµp − µt ≥ 0, against H12: (1 − ε)µs + εµp − µt < 0. See Ng (17) for details and an example. Since there are no historical data, no constancy assumption is needed. Snapinn (24) discussed two ways of discounting historical data for making inferences about the efficacy of the test treatment relative to placebo:

1526

1. Using the first approach discussed in the section on “Efficacy” as a way of discounting for its inefficiency in utilizing the data, and 2. Preserving a fraction of the active control’s effect. Although discounting may be viewed as an inefficient method, the degree of discounting depends on the confidence level used in computing the δ and there is no quantitative measure for such discounting. Mathematically, preservation and discounting are indistinguishable as seen in the previous section. Using preservation as a form of discounting would be confusing, which may lead to inappropriate “double credits.” “Double credits” means claiming more than 28% preservation with 28% discounting in the example in the previous section. The first approach discussed by Hauck and Anderson (11) requires determining δ using the lower confidence limit of the effect size from the historical data while no explicit determination of δ is required for the second approach (see the section on “Efficacy”). Using the 75.2% (as opposed to 84%) one-sided lower confidence bound from the historical data as the δ, at the one-sided 0.05 level, the two approaches are approximately the same if the standard deviations of the treatment difference from both studies are approximately equal (11). Thus, the first approach could be very conservative for inference with regard to the efficacy of the test treatment when a confidence level much larger than 75.2% is used, and could be too liberal when a confidence level much smaller than 75.2% is used. Consequently, the second approach is preferable to the first approach. Temple and Ellenberg (25) argue that placebo controls are not necessarily unethical when known effective therapy exists for a condition. Choice of placebo control versus active control depends on the ethical issue and the study objective. When a placebo control is ethical and the study objective is to establish the efficacy of the test treatment, placebo control is a clear choice. When a placebo control is ethical and the study objec-

Tie-Hua Ng

tive is to establish the efficacy of the test treatment relative to active control, active control is a natural choice; however, due to problems in assessing an active control equivalence study (26), a three-arm trial that includes a concurrent placebo would be an ideal design. When a placebo control is unethical regardless of the study objective, active control is a natural choice, although other alternative designs such as a dose response study, “add on” design, early escape design, and so forth (1,25,26) could be considered when the study objective is to establish the efficacy of the test treatment. Inference with regard to the efficacy of the test treatment in an active control equivalence study without a concurrent placebo depends of the constancy assumption. Although discounting may alleviate the problem when the constancy assumption is invalid, it is not clear how the discounting factor would be determined. No designs are perfect and they all have strengths and weaknesses (1). There is no clear “winner” among active control and alternative designs when a placebo control is unethical to establish efficacy of the test treatment. Perhaps, several study designs should be used, so that the studies may supplement each other in the evaluation of the efficacy of the test treatment. In fact, a hybrid of the active control design with the “add on” design results in a three-arm trial (test treatment, active control, and combination) where the assay sensitivity may be shown by the comparisons of the combination with either of its components. As in the “add on” design, this three-arm trial is likely to succeed only when the test treatment and the active control possess different pharmacologic mechanisms, although there are exceptions (1). In a three-arm trial with a concurrent placebo, the relative efficacy of the test treatment as measured by a given percentage (eg, 80%) of preservation of the control effect may be tested and a constancy assumption is not needed, as discussed earlier in this section of the paper. When a placebo control is unethical and there can be no concurrent

Choice of Delta

placebo, the relative efficacy of the test treatment can be evaluated based on the percentage of preservation of the control effect as in a three-arm trial, but there is a need to determine a discounting factor to be used in such an evaluation. Discounting plays a much more important role in the evaluation of the efficacy of the test treatment than in the evaluation of the relative efficacy of the test treatment because erroneously concluding efficacy is much more serious than concluding preserving more than, say, 80% of the control effect when in fact only, say, 75% of the control effect is preserved. Acknowledgment—The author thanks Dr. Dieter Hauschke for his invitation to make the presentation at the DIA workshop and Drs. Susan Ellenberg and Peter Lachenbruch and the anonymous referees for their helpful comments.

REFERENCES 1. International Conference on Harmonization E10 Guideline. Choice of Control Groups in Clinical Trials. International Conference on Harmonization; 2000. http://www.fda.gov/cder/guidance/index.htm. 2. Schuirmann DJ. A Comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J Pharmacokinetics Biopharmaceutics. 1987;15:657–680. 3. Chow S-C, Liu J-P. Individual equivalence. In: Chow, S-C. ed. Encyclopedia of Biopharmaceutical Statistics. New York: Marcel Dekker; 2000: 259–266. 4. International Conference on Harmonization E9 Guideline. Statistical Principles for Clinical Trials. International Conference on Harmonization; 1998. http://www.fda.gov/cder/guidance/index.htm. 5. Blackwelder CW. Equivalence Trials. In: Armitage P and Colton T, eds. Encyclopedia of Biostatistics. New York: John Wiley; 1998: 1367–1372. 6. Hauschke D, Schall R, Luus HG. Statistical significance. In: Chow, S-C. ed. Encyclopedia of Biopharmaceutical Statistics. New York: Marcel Dekker; 2000: 493–497. 7. Dunnett CW, Gent M. Significance testing to establish equivalence between treatments, with special reference to data in the form of 2 × 2 tables. Biometrics. 1977;33:593–602. 8. Blackwelder CW. ‘Proving the Null Hypothesis’ in Clinical Trials. Control Clin Trials. 1982;3:345–353.

1527 9. Makuch R, Johnson M. Active control equivalence studies: Planning and interpretation. In: Statistical Issues in Drug Research and Development. K. Peace, ed., New York: Marcel Dekker; 1990, 238–246. 10. Windeler J, Trampisch H-J. Recommendations concerning studies on therapeutic equivalence. Drug Inf J. 1996;30:195–200. 11. Hauck WW, Anderson S. Some issues in the design and analysis of equivalence trials. Drug Inf J. 1999; 33:109–118. 12. Simon R. Bayesian design and analysis of active control clinical trials. Biometrics. 1999;55:484–487. 13. Hwang IK, Morikawa T. Design issues in non-inferiority/equivalence trials. Drug Inf J. 1999;33:1205– 1218. 14. Simon R. Therapeutic equivalence trials. In: Crowley J, ed. Handbook of Statistics in Cancer Trials. New York: Marcel Dekker; 2001 (In press). 15. Liu J-P. Equivalence trials. In: Chow S-C, ed. Encyclopedia of Biopharmaceutical Statistics. New York: Marcel Dekker; 2000:188–194. 16. Liu J-P. Therapeutic equivalence. In: Chow S-C, ed. Encyclopedia of Biopharmaceutical Statistics. New York: Marcel Dekker; 2000: 515–520. 17. Ng T-H. A Specification of treatment difference in the design of clinical trials with active controls. Drug Inf J. 1993;27:705–719. 18. Ng T-H. Active control equivalence studies. Proceedings of the Biopharmaceutical Section, American Statistical Association. 1997; 124–128. 19. Ng T-H. Statistical issues in equivalence testing— FDA reviewer’s perspectives. Proceedings of the Biopharmaceutical Section, American Statistical Association. 1999; 209–213. 20. Gupta G, Hsu H, Ng T-H, Tiwari T, Wang C. Statistical review experiences in equivalence testing at FDA/CBER. Proceedings of the Biopharmaceutical Section, American Statistical Association. 1999; 220–223. 21. Code of Federal Regulations. 21 CFR 314.126. Adequate and Well-controlled Studies, 1985. 22. Code of Federal Regulations. 21 CFR 314.126. Adequate and Well-controlled Studies, 2000. 23. Fleming TR. Treatment evaluation in active control studies. Cancer Treatment Reports. 1987;71:1061– 1065. 24. Snapinn SM. Alternatives for discounting historical data in the analysis of non-inferiority trials. Int Chinese Stat Assoc Bulletin. January 2001; 29–33. 25. Temple R, Ellenberg S. Placebo-controlled trials and active-control trials in the evaluation of new treatments. Part 1: Ethical and scientific issues. Ann Int Med. 2000;133 (6) 455–463. 26. Temple R. Problems in interpreting active control equivalence trials. Accountability Research. 1996;4: 267–275.