Composite reference standard in diagnostic research

0 downloads 0 Views 184KB Size Report
Apr 24, 2015 - see http://www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/Guidan. ceDocuments/ucm071287.pdf for detail).
Composite reference standard in diagnostic research: a new approach to reduce bias in the presence of imperfect reference tests Shaowu Tang, Parichehr Hemyari and Jesse A. Canchola Roche Molecular Systems, Inc., CA 94588, USA April 24, 2015

Abstract A common challenge in diagnostic research is how to accurately evaluate a new diagnostic test in the absence of gold standard. Several approaches, such as discrepant analysis, composite reference standard (CRS) method or latent class analysis (LCA), are commonly used for this purpose. However, discrepant analysis and latent class analysis were not recommended for use by regulatory agencies due to its over-optimistic estimates or problematic model assumptions, while the CRS method by the ”any-positive” rule becomes popular in recent FDA 510(k) submissions for its ease of use. In this paper, the focus is on studying CRS methods and to evaluate the properties of two extreme CRS methods, i.e., combining multiple reference test results by the ”any positive” rule or by the ”all positive” rule. Theoretical results show that the CRS method by the ”any-positive” rule generally generates more biased estimate of sensitivity, and thus a new approach based on these two extreme methods is proposed to reduce the biases of sensitivity and specificity simultaneously. Simulations covering a broad spectrum of possible real-world data situations were performed and the proposed approach was applied to three real datasets. The results demonstrated that the proposed approach can reduce the total bias significantly, and is quite robust to the model assumption violation, and therefore it is strongly recommended for future applications.

Keywords: Composite Reference Standard; Diagnostic Research; Discrepant Analysis; Latent Class Analysis; Sensitivity; Specificity.

1

1

Introduction

Infectious diseases in general and emerging infectious diseases in particular impose significant burdens upon global economies and public health. The agents that cause infectious diseases have produced suffering throughout history, despite the continual developments in science and medicine. Over the last 200 years, vaccines and antibiotics have made great strides in combating these scourges - so much so that in December 1967, Surgeon General William H. Stewart declared victory over infectious diseases. Unfortunately, no one told the germs. Infectious agents evolve constantly with new strains emerging frequently with resistance to current treatments. Today, infectious diseases still remain one of the leading causes of death worldwide, and they are in fact the third-biggest killers, behind cancer and heart disease, in the Unites States [Jones et al. (2008)].

One widely used diagnostic technique for the detection of infectious agents is the cell culture method followed by cytotoxicity testing, which is time-consuming (usually taking 5 days or longer) and was reported having low sensitivity for some infectious agents [Johnson et al. (2000); Bachmann et al. (2009)]. Early detection is crucial to the control and prevention of outbreak of emerging, reemerging, and novel infectious diseases, which demands fast and accurate diagnostic tests. In the past two decades, a series of in vitro nucleic acid amplification techniques (NAAT) have been developed for this purpose, which target on the enzymatic amplification of specific nucleic acid sequences by using polymerase chain reaction (PCR), transcription-mediated amplification (TMA), ligase chain reaction (LCR) or other amplification methods [Duck et al. (1990); Kwoh et al. (1989); Saiki et al. (1988); Wu and Wallace (1989)]. Compared to the traditional cell culture methods, NAATs are fast (obtaining results in hours) and more sensitive [Yolken (2002)], which makes them widely accepted for routine use. Since 1995, the US Food and Drug Administration (FDA) has approved or cleared a long list of nucleic acid based tests for the detection of various microbials such as Adenovirus, Bacillus Anthracis, Clostridium difficile, Chlamydia trachomatis/Neisseria gonorrhoeae, Dengue virus and etc. (http://www.fda.gov/MedicalDevices/ProductsandMedicalProcedures/InVitroDiagnostics/u cm330711.htm), and the list is still expanding. 2

Evaluating characteristics such as sensitivity and specificity of a new diagnostic test demands a gold standard (or error free reference) to determine the true disease status of all participants. However, in practice, a gold standard is not always available for all participants and sometimes there is even no uniformly accepted gold standard. Therefore new diagnostic tests are often evaluated by comparing them with the best available test, which usually is not considered sufficiently accurate [Enøe et al. (2000)]. Using an imperfect reference will introduce bias, because the misclassification of participants’ true status results in changes in the final counts of the 2-by-2 contingency table and the estimates derived from them. A main challenge faced by researchers, manufacturers and regulatory agencies is how to evaluate the accuracy of a new NAAT diagnostic test when a single gold standard is not available.

Several approaches such as discrepant analysis [Schachter et al. (1994); Buimer et al. (1996)], composite reference standard (CSR) methods [Pepe (2003); Alonzo and Pepe (1999); Baughman et al. (2008); Naaktgeboren et al. (2013)] and latent class analysis (LCA) [Hui and Zhou (1998); Pepe and Janes (2007); Qu et al. (1996); Goodman (1974); Joseph et al. (1995); Rindskopf (1986)] have been proposed to handle this important problem by performing and combining multiple (reference) test results in the absence of a single gold standard. These methods have been reviewed extensively by researchers [Rutjes et al. (2007); Hawkins et al. (2001)] and the main advantages and disadvantages of each method are briefly summarized below.

In discrepant analysis (also known as discordant analysis), the samples with discordant index and reference test results will be retested by an additional method and usually the disease status of those samples will be determined by the additional test results. Obviously this method is easy to implement, less costly and therefore has been widely used. However, since the index test results were involved in the construction of the reference standard, the performance of the index test is usually over-estimated, which results in potentially serious bias. As a consequence, the method has been criticized by many researchers [Hadgu (1996, 1999);

3

Miller (1998)] and was suggested to be avoided by regulatory agencies (FDA Guidance, 2007: Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests. see http://www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/Guidan ceDocuments/ucm071287.pdf for detail).

Latent class analysis is another widely used technique when test results from several (imperfect) reference tests are available for all participants [Walter and Irwig (1988); Dawid and Skene (1979)]. In this approach, a likelihood function is built to represent the relationship between an index test and two or more imperfect reference tests, and the unobserved disease status and the parameters such as prevalence, sensitivity and specificity are estimated by maximizing the likelihood function. This approach is statistically solid and very flexible for dealing with different types of test results (i.e., dichotomous, ordinal and continuous). A major concern about latent class analysis is its lack of robustness against model assumption violation (e.g., conditional independence among all tests) [Albert and Dodd (2004)]. Since the model assumption of latent class analysis is hard to verify, this approach appears problematic to regulatory agencies and was not recommended to use for the purpose of estimating sensitivity and specificity (FDA Guidance, 2007: Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests).

In the absence of gold standard, test results from multiple imperfect tests can be combined to generate a pseudo-reference. Usually culture method is used as gold standard to evaluate a new NAAT test. Being reported for having low sensitivity for the detection of some infectious agents, the FDA recommends that manufacturers compare the new NAAT test to a composite reference derived from the test results of culture and an FDA cleared/approved more sensitive NAAT test by the ”any positive” rule (FDA Guideline, 2014: Class II Special Controls Guideline: Nucleic Acid-Based In Vitro Diagnostic Devices for the Detection of Mycobacterium tuberculosis Complex and Genetic Mutations Associated with Mycobacterium tuberculosis Complex Antibiotic Resistance Respiratory Specimens. see http://www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/GuidanceD ocuments/UCM357642.pdf for detail). Under the ”any positive” rule, all samples with

4

discordant reference test results will be classified as positive. It is well-known that the culture method is almost 100% specific (eg. culture test positive samples are true positive), therefore under the ”any positive” rule, the positive samples missed by the culture method have a second chance to be captured by the additional NAAT test. To our best knowledge, at least 3 recent FDA-cleared NAAT assays for the detection of Clostridium difficile were required by FDA to be compared with the composite reference derived from the direct culture method followed by cytotoxicity testing and the enrichment culture method followed by cytotoxicity testing by the ”any positive” rule (Cepheid Xpert C. difficile, 2009, 510(k) number: K091109; Cepheid Xpert C. difficile/Epi, 2011, 510(k) number: K110203; BD MAX Cdiff Assay, 2013, 510(k) number: K130470).

The ”any positive” rule is used to improve the sensitivity of the composite reference. However, false-positive results may be introduced by the additional test except that it is 100% specific. Another way to derive a composite reference is to use the ”all positive” rule, where all the samples with discordant reference test results are classified as negative. The ”all positive” rule can be used to improve the specificity of the composite reference without sacrificing the sensitivity when all the reference tests are 100% sensitive. Both algorithms are shown for k = 2 in Table 1 below and can be easily generalized for k ≥ 3, where 1 (or 0) represents the positive (or negative) test result, and + or − represent the classified disease status or outcome of interest. Table 1: Composite reference standard: ”any positive” rule and ”all positive” rule

Pattern

Reference Test 1 Reference Test 2 ”Any Positive” Rule

”All Positive” Rule

1

1

1

+

+

2

1

0

+

-

3

0

1

+

-

4

0

0

-

-

When more than two reference tests were performed, CRS methods other than the ”any 5

positive” rule or the ”all positive” rule were also reported by some researchers using the majority vote [Johnson et al. (2000); Waggoner et al. (2013)], where usually the samples with more than two positive test results were classified as positive.

Unlike the latent class analysis, the CRS methods are easy to implement without requiring complex statistical modeling specification and validation, and also avoid using the index test results to construct the pseudo-reference, which makes it very appealing in practice and has been recommended for use by both the researchers and regulatory agencies [Hadgu (1997); Miller (1998) and FDA Guideline 2014 mentioned above].

Since the CRS by the ”any-positive” rule was recommended for use by regulatory agencies (e.g. FDA), there is an urgent need to investigate the following questions thoroughly: Is the assumption of conditional independence still necessary for CRS methods, and if yes, how robust are they against the violation of this assumption? what is the impact when different CRS methods were used, and how to select or construct a composite reference to minimize the bias due to imperfect reference? This paper will address these questions in sections 2 through 5, and the rest of the paper is organized as follows. In Section 2, the impact of an imperfect reference was investigated under the assumption of conditional independence, and explicit forms of bias for sensitivity and specificity were given. In Section 3 the properties of CRS methods were studied, and a new approach based on the CRS methods by the ”any positive” rule and by the ”all positive” rule was proposed to reduce the bias. Simulations were performed for various scenarios in Section 4 to compare the performances of different approaches and robustness against violation of the assumption of conditional independence. Those methods were applied to three real datasets in Section 5. In Section 6 we made conclusions and discussions. All the proofs were left to appendix.

2

Impact of using an imperfect reference

In this section, we investigated the impact of using an imperfect reference on the evaluation of a new diagnostic test when they are conditionally independent. We showed that if the new diagnostic test is better than a random guess, its sensitivity and specificity will always 6

be under-estimated. For simplicity, let’s denote by R the reference, by I the index test, by T the true disease status, by π the disease prevalence, and by (sR , cR ) and (sI , cI ) the sensitivity and specificity of R and I, respectively. For a given imperfect reference R, the estimated sensitivity and specificity (ˆ sI , cˆI ) of index test I are sˆI = P(I + |R+) and cˆI = P(I − |R−).

(1)

In this paper we assume that the index test I and the reference R are conditionally independent, i.e., for a given subject, the test result of one test has no impact on the test result of the other. This assumption is crucial to eliminate one source of bias and should always be pursued when selecting reference test, since usually the sensitivity and specificity will be over-estimated for positive correlation and will be under-estimated for negative correlation between the index test I and the reference test R [Pepe (2003)]. Under the assumption of conditional independence, the following results hold. Theorem 2.0.1. Given {π, sR , cR , sI , cI } and by assuming conditional independence between the index test I and an imperfect reference R, it holds (1 − π)(1 − cR ) = sI − (cI + sI − 1)(1 − PPVR ), πsR + (1 − π)(1 − cR ) π(1 − sR ) = cI − (cI + sI − 1)(1 − NPVR ), (2) cˆI = cI − (cI + sI − 1) (1 − π)cR + π(1 − sR ) where PPVR and NPVR represent the positive predictive value (PPV) and the negative sˆI = sI − (cI + sI − 1)

predictive value (NPV) of the reference R, respectively. Recall that the index test is better than a random guess, provided sI +cI > 1 (e.g., the point (sI , 1−cI ) is above the 45 degree diagonal line (or equivalently the line of no-discrimination) in ROC space, or equivalently, the Youden’s index sI + cI − 1 is positive). Therefore, the following results concluded that sI and cI are always under-estimated, provided the reference R is imperfect and the index test is better than a random guess. Corollary 2.0.2. If the index test is better than a random guess (i.e., sI + cI − 1 > 0), sˆI and cˆI are always under-estimated, unless sR = 1 and/or cR = 1, provided the index test I and an imperfect reference R are conditionally independent. Furthermore, it holds 1 − PPVR , PPVR + NPVR − 1 1 − NPVR = cˆI + (ˆ sI + cˆI − 1) · . PPVR + NPVR − 1

sI = sˆI + (ˆ sI + cˆI − 1) · cI

7

Theorem 2.0.1 and Corollary 2.0.2 confirmed Result 7.11 in Pepe (2003), and in addition provided explicit forms for biases sˆI − sI and cˆI − cI , e.g., 1 − PPVR , PPVR + NPVR − 1 1 − NPVR Bias(ˆ cI ) = cˆI − cI = −(ˆ sI + cˆI − 1) · , PPVR + NPVR − 1

Bias(ˆ sI ) = sˆI − sI = −(ˆ sI + cˆI − 1) ·

(3)

which showed that Bias(ˆ sI ) and Bias(ˆ cI ) contain two components with one being related to sˆI and cˆI , and the other being related to the predictive values of the reference R. The above results also demonstrate that PPVR has more impact on sˆI −sI , while NPVR has more impact on cˆI − cI , since sˆI − sI → 0 only when PPVR → 1 (or equivalently cR → 1), and cˆI − cI → 0 only when NPVR → 1 (or equivalently sR → 1), provided 0 < π < 1 and sˆI + cˆI 6= 1. Although Bias(ˆ sI ) and Bias(ˆ cI ) were provided explicitly, there are several obstacles which prevent using them to correct the biases in practice. Firstly, the biases were derived based on the assumption of conditional independence which is hard to verify. Secondly, it is equally difficult to accurately estimate sˆR and cˆR if the index test I is not perfect. Finally, in diagnostic research, it is usually hard to guarantee that the enrolled subjects are random samples of the target population, which makes π ˆ hardly to be close to the true population prevalence π.

When there is no gold standard available, it is common to perform two imperfect reference tests with the first being very specific and the second being more sensitive. The second reference test is used to capture the positive cases missed by the first one and hence increases the sensitivity of the composite reference. However, ”two wrongs don’t make a right”, and usually the gain of one parameter (e.g., specificity) is accompanied by the loss of the other (e.g., sensitivity). In the next section we will show that using the ”any positive” rule will maximize the sensitivity but minimize the specificity of the composite reference, and vice versa for using the ”all positive” rule. Explicit forms of the sensitivity and specificity of the CRS by the ”any-positive” rule or by the ”all-positive” rule are also provided when reference tests are conditionally independent. Furthermore, a new CRS approach other 8

than the ”any positive” rule or the ”all positive” rule is proposed to reduce Bias(ˆ sI ) and Bias(ˆ cI ) simultaneously.

3 3.1

CRS methods CRS methods by the ”any positive” rule or by the ”all positive” rule

In this section the CRS methods were investigated in detail, and a new CRS approach other than the ”any positive” rule or the ”all positive” rule was proposed to reduce Bias(ˆ sI ) and Bias(ˆ cI ) simultaneously. For simplicity, let’s denote by tij the test result of sample i by reference test j for i = 1, 2, · · · , N and j = 1, · · · , k. Every CRS method will classify a sample as positive or negative by a pre-defined function f (·) with f (ti1 , · · · , tik ) taking value 1 or 0 for observed test result pattern (ti1 , · · · , tik ). Usually f (·) satisfies f (1, · · · , 1) = 1 and f (0, · · · , 0) = 0 for concordant test results. The functions fany (·) and fall (·) for the ”any positive” rule or the ”all positive” rule are defined below. fany (ti1 , · · · , tik ) = 1 −

k Y (1 − tij ), j=1

fall (ti1 , · · · , tik ) =

k Y

tij .

(4)

j=1

By defining the following function space F = {f (ti1 , · · · , tik ) ∈ {0, 1} : f (1, · · · , 1) = 1 and f (0, · · · , 0) = 0},

(5)

then each f ∈ F defines a composite pseudo-reference Rf . Obviously fany ∈ F, fall ∈ F and fj ∈ F hold for j = 1, · · · , k, where fj (ti1 , · · · , tik ) ≡ tij is the function with respect to the jth reference test.

Without loss of generality, let’s denote by sany , cany and sall , call the sensitivity and specificity of the pseudo-reference derived by the ”any positive” rule and by the ”all positive” 9

rule respectively, and by sf and cf the sensitivity and specificity for ∀f ∈ F, then the following holds Theorem 3.1.1. sany , cany , sall and call satisfy sany = max{sf }, cany = min{cf }, f ∈F

f ∈F

sall = min{sf }, call = max{cf }. f ∈F

f ∈F

(6)

Theorem 3.1.1 implies that, maximization of the sensitivity of the composite reference is accompanied by the minimization of the specificity for the ”any positive” rule, and vice versa for the ”all positive” rule, among all CRS method defined by f ∈ F. So, neither the ”any positive” rule nor the ”all positive” rule can improve the sensitivity and specificity of the composite reference simultaneously.

It is worth pointing out that Theorem 3.1.1 holds without the assumption of conditional independence between the reference tests as latent class analysis does. However, by additionally assuming the conditional independence between the reference tests, explicit forms of sany , cany , sall and call are available. The assumption of conditional independence between reference tests implies that the conditional probabilities P((ti1 , · · · , tik )|T +) and P((ti1 , · · · , tik )|T −) for observed test result pattern (ti1 , · · · , tik ) can be written as P((ti1 , · · · , tik )|T +) = Πkj=1P(tij |T +) and P((ti1 , · · · , tik )|T −) = Πkj=1P(tij |T −),

(7)

with P(tij = 1|T +) = P(Rfj + |T +) = sj , P(tij = 0|T +) = P(Rfj − |T +) = 1 − sj , P(tij = 1|T −) = P(Rfj + |T −) = 1 − cj , P(tij = 0|T −) = P(Rfj − |T −) = cj . The theorem below provides explicit formulas for sany , cany , sall and call , provided all reference tests are conditionally independent. Theorem 3.1.2. Given {sj , cj }kj=1, under the assumption of conditional independence between the reference tests, the following hold: 10

(1) For sensitivity, it holds sany = 1 − Πkj=1 (1 − sj ), sall = Πkj=1 sj , and sany ≥ max{sj } and sall ≤ min{sj }. (2) For specificity, it holds call = 1 − Πkj=1 (1 − cj ), cany = Πkj=1cj , and cany ≤ min{cj } and call ≥ max{cj }. Obviously Theorem 3.1.2 is a special case of Theorem 3.1.1 under the assumption of conditional independence.

The assumption of conditional independence also provides a rationale why f (1, · · · , 1) = 1 and f (0, · · · , 0) = 0 are defined for all composite reference methods. In fact, for concordant test result patterns (1, · · · , 1) or (0, · · · , 0), it holds P((1, · · · , 1)|+) = Πkj=1 sj , P((1, · · · , 1)|−) = Πkj=1 (1 − cj )

(8)

P((0, · · · , 0)|+) = Πkj=1 (1 − sj ), P((0, · · · , 0)|−) = Πkj=1 cj .

(9)

and

Obviously the following conditions Πkij=1 sj > Πkj=1 (1 − cj ) and Πkj=1 cj > Πkj=1 (1 − sj )

(10)

are sufficient for defining f (1, · · · , 1) = 1 and f (0, · · · , 0) = 0 when comparing P((ti1 , · · · , tik )|+) with P((ti1 , · · · , tik )|−) for observed test result pattern (ti1 , · · · , tik ). Recall that sj + cj > 1 is always satisfied, provided the reference test j is better than a random guess. Therefore, assumption f (1, · · · , 1) = 1 and f (0, · · · , 0) = 0 is reasonable if all reference tests are conditionally independent, and all reference tests are better than a random guess, which implies that one only needs to determine the composite reference for subjects with discordant reference test results.

It has been shown that the sR (or cR ) has more impact on cˆI (or sˆI ), and thus neither the CRS by the ”any-positive” rule nor the CRS by the ”all-positive” rule alone can be used to improve both sˆI and cˆi simultaneously. In the next section a new approach is proposed to combine the results from the CRSs by the ”any positive” rule and by the ”all positive” rule to reduce the bias both for ˆ(sI ) and ˆ(cI ). 11

3.2

A new CRS approach by the ”any-all positive” rule

Recall that when using an imperfect reference R, Bias(ˆ sI ) → 0 only if cR → 1, and Bias(ˆ cI ) → 0 only if sR → 1, provided 0 < π < 1 and sˆI + cˆI 6= 1, which implies that finding a composite reference with both high sensitivity and high specificity is crucial to reduce Bias(ˆ sI ) and Bias(ˆ cI ) simultaneously. However, Theorem 3.1.1 or Theorem 3.1.2 demonstrated that no single composite pseudoreference can have high sensitivity and high specificity simultaneously, which motivates us the idea to estimate sI and cI by different composite pseudo-references separately. The new CSR approach, which is called the CRS by the ”any-all positive” rule, is defined as sˆI := sˆI,all and cˆI = cˆI,any ,

(11)

e.g., the pseudo-reference derived by the ”any-positive” rule is used to evaluate cI , and the pseudo-reference derived by the ”all-positive” rule is used to evaluate sI . Since sany maximizes sf and call maximizes cf for f ∈ F, in general Bias(ˆ sI ) and Bias(ˆ cI ) can be reduced simultaneously by this approach.

Since more than one composite pseudo-references were used in this approach, the computation of the 95% confidence intervals (CIs) for sˆI and cˆI is not trivial. We recommend reporting the empirical 95% CIs generated by bootstrap method.

4

Simulation results

In this section, three simulation scenarios were implemented to evaluate the accuracy and robustness of the proposed approach using three imperfect references when (1) all four tests are conditionally independent; (2) index test and reference tests are conditionally independent, but three reference tests are conditionally dependent; (3) all four reference tests are conditionally dependent.

To mimic the real data, we assume that for index test it holds sI ∈ (0.7, 1.0), cI ∈ (0.8, 1.0). For three imperfect reference tests, we assume that sR1 ∈ (0.5, 0.8), cR1 ∈ (0.95, 1.0) (e.g., 12

mimic the culture method), and sR2 , sR3 ∈ (0.7, 1.0), cR2 , cR3 ∈ (0.8, 1.0), e.g., reference test 1 has low sensitivity but high specificity, while reference tests 2 and 3 have both high sensitivity and specificity. In each simulation, parameters {sI , cI , sRi , cRi }3i=1 were not fixed but were randomly selected from their ranges. The simulation algorithm for scenario 1 (e.g., all four tests are conditionally independent) was given below, which can be easily modified for scenarios 2 and 3.

Simulation Algorithm for Scenario 1: for given parameters {N, π} (π is the prevalence), do the following: 1. Sample {sI , cI , sRi , cRi }3i=1 randomly from their ranges. 2. Generate N = 1000 samples with Nπ being positive (with value 1) and N(1 − π) being negative (with value 0). 3. Generate random numbers {(yi , xi1 , xi2 , xi3 )}N i=1 from uniform distribution Uniform(0, 1). 4. For reference test j = 1, 2, 3: a. For positive samples, assign 0 if xij ≥ sRj and 1 otherwise. b. For negative samples, assign 1 if xij ≥ cRj and 0 otherwise. 5. For index test: a. For positive samples, assign 0 if yi ≥ sI and 1 otherwise. b. For negative samples, assign 1 if yi ≥ cI and 0 otherwise. 6. For simulated test results of three reference tests, do the following: a. Generate the composite pseudo-reference by the ”any positive” rule; b. Generate the composite pseudo-reference by the ”all positive” rule; 7. Estimate sˆI and cˆI by single reference, by the CRSs, and by performing latent class analysis.

13

8. Repeat Step 1 to Step 7 for 1000 times and report the mean of |Bias(ˆ sI )| and |Bias(ˆ cI )|. When conditional independence is present, {(xi1 , xi2 , xi3 )}N i=1 for simulation scenario 2 (or {(yi , xi1 , xi2 , xi3 )}N i=1 for simulation scenario 3) were correlated uniform random numbers generated from Gaussian copula, where the correlation coefficient ρjj ′ ∈ (0.1, 0.9) was randomly selected in each simulation. In all three scenarios, π ∈ {0.1, 0.3, 0.5, 0.7, 0.9} was used to investigate the impact of prevalence on the estimates. We assume positive correlation in the presence of conditional dependence because it is common in practice. Latent class analysis was performed using R package randomLCA. Table 2 to Table 4 below summarized the simulation results for scenario 1 to scenario 3 respectively, where the standard errors of bias(ˆ sI ) ranged from 0.0003 to 0.0047 for Table 2 and Table 3, and ranged from 0.0003 to 0.0026 for Table 4; The standard errors of bias(ˆ cI ) ranged from 0.0002 to 0.0051 for Table 2 to Table 4. Table 2: Simulation Results for Scenario 1: all four tests are conditionally independent Single Reference Estimate |Bias(ˆ sI )|

|Bias(ˆ cI )|

π

R1

R2

R3

CRS Any

All

Any-all

LCA

0.1 0.168 0.415 0.356

0.506 0.043

0.043

0.034

0.3 0.090 0.175 0.201

0.292 0.013

0.013

0.013

0.5 0.024 0.079 0.076

0.127 0.018

0.018

0.014

0.7 0.020 0.043 0.042

0.068 0.013

0.013

0.008

0.9 0.010 0.015 0.014

0.018 0.009

0.009

0.005

0.1 0.032 0.020 0.015

0.005 0.047

0.005

0.003

0.3 0.108 0.052 0.055

0.016 0.146

0.016

0.014

0.5 0.218 0.109 0.114

0.024 0.273

0.024

0.019

0.7 0.308 0.236 0.213

0.025 0.402

0.025

0.019

0.9 0.613 0.548 0.419

0.107 0.675

0.107

0.039

14

Table 3: Simulation Results for Scenario 2: when three reference tests are conditionally independent Single Reference Estimate |Bias(ˆ sI )|

|Bias(ˆ cI )|

π

R1

R2

R3

CRS Any

All

Any-all

LCA

0.1 0.186 0.249 0.385

0.470 0.025

0.025

0.024

0.3 0.019 0.145 0.146

0.218 0.019

0.019

0.022

0.5 0.020 0.057 0.078

0.122 0.023

0.023

0.013

0.7 0.023 0.046 0.043

0.072 0.013

0.013

0.012

0.9 0.009 0.010 0.008

0.013 0.009

0.009

0.010

0.1 0.032 0.015 0.017

0.005 0.048

0.005

0.006

0.3 0.088 0.050 0.046

0.015 0.126

0.015

0.014

0.5 0.182 0.061 0.094

0.019 0.229

0.019

0.021

0.7 0.332 0.231 0.158

0.025 0.408

0.025

0.023

0.9 0.591 0.418 0.435

0.132 0.627

0.132

0.113

15

Table 4: Simulation Results for Scenario 3: all four tests are conditionally dependent Single Reference Estimate |Bias(ˆ sI )|

|Bias(ˆ cI )|

π

R1

R2

R3

CRS Any

All

Any-all

LCA

0.1 0.187 0.378 0.298

0.455 0.029

0.029

0.028

0.3 0.046 0.154 0.157

0.237 0.031

0.031

0.030

0.5 0.049 0.079 0.061

0.114 0.020

0.020

0.013

0.7 0.013 0.039 0.029

0.063 0.027

0.027

0.013

0.9 0.010 0.008 0.012

0.028 0.013

0.013

0.008

0.1 0.029 0.017 0.017

0.013 0.040

0.013

0.011

0.3 0.083 0.038 0.046

0.014 0.116

0.014

0.014

0.5 0.208 0.149 0.129

0.027 0.276

0.027

0.022

0.7 0.295 0.238 0.235

0.026 0.416

0.026

0.017

0.9 0.588 0.376 0.400

0.102 0.613

0.102

0.074

Simulation results demonstrated that the prevalence π is negatively correlated to |Bias(ˆ sI )|, but positively correlated to |Bias(ˆ cI )|, e.g., |Bias(ˆ sI )| decreases as π increases, while |Bias(ˆ cI )| increases as π increases. For example, in Table 3, for reference test 1, |Bias(ˆ sI )| decreases from 0.186 (π = 0.1) to 0.009 (π = 0.9), while |Bias(ˆ cI )| increases from 0.032 (π = 0.1) to 0.591 (π = 0.9). The simulation results are consistent with Theorem 2.0.1, from which one knows that when holding sI , cI , sR , cR constant, sˆI → sI as π → 1 and cˆI → cI as π → 0. Simulation results confirmed that sR has more impact on cˆI and cR has more impact on sˆI . Recall that the reference test 1 is used to mimic culture method and is designed to have lower sensitivity but higher specificity than reference tests 2 and 3. Simulation results in Tables 2, 3 and 4 showed consistently that reference test 1 produced less biased estimate sˆI but more biased estimate cˆI for almost all π ∈ {0.1, 0.3, 0.5, 0.7, 0.9} when compared to reference tests 2 and 3.

Simulation results also demonstrated that usually single imperfect reference will intro16

duce large total bias |Bias(ˆ sI )| + |Bias(ˆ cI )|. For example, for reference test 3 in Table 2, the total bias |Bias(ˆ sI )| + |Bias(ˆ cI )| ranged from 0.076 + 0.114 = 0.190 (π = 0.5) to 0.014 + 0.419 = 0.433 (π = 0.9). Furthermore, simulation results confirmed that the CRS by the ”any-positive” rule does reduce Bias(ˆ cI ) but increases Bias(ˆ sI ), and vice versa for the CRS by the ”all-positive” rule. Simulation results showed that the CRS by the ”anypositive” rule, which is recommended for use by regulatory agencies, produced large total bias |Bias(ˆ sI )| + |Bias(ˆ cI )| for all three simulation scenarios, where the total bias ranged from 0.093 (π = 0.7) to 0.511 (π = 0.1) in scenario 1, ranged from 0.097 (π = 0.7) to 0.475 (π = 0.1) in scenario 2, and ranged from 0.089 (π = 0.7) to 0.468 (π = 0.1) in scenario 3.

For all three simulation scenarios the CRS by the ”any-all positive” rule produced smallest total bias |Bias(ˆ sI )| + |Bias(ˆ cI )|, compared to single reference or the CRSs by the ”anypositive” rule or by the ”all-positive” rule, and the total bias |Bias(ˆ sI )|+|Bias(ˆ cI )| produced by the CRS by the ”any-all positive” rule is quite robust against the violation of assumption of conditional independence except for π = 0.9, which varies from 0.029 (π = 0.3) to 0.048 (π = 0.1) for scenario 1 (Table 2), varies from 0.03 (π = 0.1) to 0.042 (π = 0.5) for scenario 2 (Table 3), and varies from 0.042 (π = 0.1) to 0.053 (π = 0.7) for scenario 3 (Table 4) for π < 0.9. When π = 0.9, data showed that |Bias(ˆ cI )| became large for all approaches across all three simulation scenarios, due to the small number of negative samples.

Simulation results also showed that LCA is robust, and in most cases it can produce less biased estimates than the CRS by the ”any-all positive” rule did. However, difference between the CRS by the ”any-all positive” rule and the LCA is very small, except for Bias(ˆ cI ) when π = 0.9. In scenario 1 (Table 2), the difference of Bias(ˆ sI ) between these two approaches ranged from 0 (π = 0.3) to 0.009 (π = 0.1), and the difference of Bias(ˆ cI ) ranged from 0.002 (π = 0.1, 0.3) to 0.006 (π = 0.7) for π ≤ 0.7 and was 0.068 for π = 0.9. In scenario 2 (Table 3), the difference of Bias(ˆ sI ) ranged from −0.003 (π = 0.3) to 0.01 (π = 0.5), and the difference of Bias(ˆ cI ) ranged from −0.002 (π = 0.5) to 0.002 (π = 0.7) for π ≤ 0.7 and was 0.019 for π = 0.9. Similarly, in scenario 3 (Table 4), the difference of Bias(ˆ sI ) ranged from 0.001 (π = 0.1, 0.3) to 0.014 (π = 0.7), and the difference of Bias(ˆ cI )

17

ranged from 0 (π = 0.3) to 0.009 (π = 0.7) for π ≤ 0.7 and was 0.028 for π = 0.9.

5 5.1

Practical Applications Chlamydia trachomatis

In the first example, we analyzed the data reported in Gaydos et al. (2004), in which urine specimens from 506 subjects were tested by 3 NAATs: the Abbott LCx (LCx), the BD ProbeTec ET (ProbeTec), and the Gen-Probe APTIMA Combo 2 (AC2) for detection of Chlamydia trachomatis. The data is summarized in Table 8.

Of 506 subjects, 75 were true positive resulting in a sample prevalence π = 75/506 ≈ 0.148. The sensitivity was 0.96 (95% CI: 0.889 to 0.986), 0.96 (95% CI: 0.889 to 0.986) and 1.0 (95% CI: 0.951 to 1.0) for LCx, ProbeTec and AC2, respectively, and the specificity was 0.991 (95% CI: 0.976 to 0.996), 1.0 (95% CI: 0.991 to 1.0) and 0.988 (95% CI: 0.973 to 0.995) for LCx, ProbeTec and AC2, respectively. In this example all three NAATs have both high sensitivity (≥ 0.96) and high specificity (≥ 0.988).

Knowing the true status of each subject, Bias(ˆ sI ) and Bias(ˆ cI ) for LCx, ProbeTec and AC2 were evaluated using single reference, CRSs and LCA, which were summarized in Table 5.

18

Table 5: Comparison of three NAATs for detection of Chlamydia trachomatis

Bias(ˆ sI ) LCx

Bias(ˆ cI )

ProbeTec

AC2

-0.0521

-0.0526

LCx

ProbeTec

AC2

-0.0070

-0.0070

Single Reference LCx ProbeTec AC2

-0.0017

0

-0.06

-0.06

Any Positive

-0.06

-0.1029

All Positive

-0.0017

Any-all Positive LCA

-0.0068

-0.0068

-0.0001

0

-0.0506

-0.0001

0

-0.0001

-0.0017

0

-0.0068

-0.0069

-0.0136

-0.0017

-0.0017

0

-0.0001

0

-0.0001

-0.0017

-0.001

−1.33 × 10−6

CRS

-0.0001 −2.33 × 10−7

-0.0002

Results in Table 5 showed that for all comparisons, sˆI and cˆI , if biased, were all underestimated. When using ProbeTec as reference to evaluate LCx or AC2, the biases were very low for both sensitivity (< 0.2%) and specificity (< 0.7%). The CRS by the ”any-positive” rule reduced Bias(ˆ cI ) (≤ 0.01%), while Bias(ˆ sI ) increased. It is worth pointing out that, although all three NAATs have both high sensitivity and high specificity, the CRS by the ”any-positive” rule still introduced large bias for sˆI , which was under-estimated by 0.1029 for ProbeTec, was under-estimated by 0.06 for LCx, and was under-estimated by 0.0506 for AC2. On the contrary, the CRS by the ”all-positive” rule reduced Bias(ˆ sI ) (≤ 0.2%) but increased Bias(ˆ cI ) slightly (up to 1.36%). As expected, the biases were reduced for both sˆI (< 0.2%) and cˆI (≤ 0.01%) when using CRS by the ”any-all positive” rule. Recall that Bias(ˆ sI ) = 0 holds, provided the index test I and reference R are conditionally independent and cR = 1, and Bias(ˆ cI ) = 0 holds, provided the index test I and reference R are conditionally independent and sR = 1. Since Bias(ˆ sI ) ≈ 0 when using ProbeTec as reference, and Bias(ˆ cI ) ≈ 0 when using AC2 as reference, we conclude that LCx, ProbeTec 19

and AC2 are conditionally independent, which also explained why LCA performed equally well as the CRS by the ”any-all positive” rule for this example.

5.2

Clostridium difficile

In this section, we analyzed the data summarized in Humphries et al. (2013), in which liquid stool specimens from 296 patients were tested by a NAAT, enzyme immunoassays (EIA), and toxigenic culture (culture) for detection of C. difficile. The data is summarized in Table 9.

Of 296 subjects, 143 were true positive resulting in a sample prevalence π = 143/296 ≈ 0.483. The sensitivity was 0.979 (95% CI: 0.940 to 0.993), 0.531 (95% CI: 0.450 to 0.611) and 0.867 (95% CI: 0.802 to 913) for NAAT, EIA and culture, respectively, and the specificity was 0.961 (95% CI: 0.917 to 0.982), 0.941 (95% CI: 0.892 to 0.969) and 1.0 (95% CI: 0.976 to 1.0) for NAAT, EIA and culture, respectively. Unlike the first real example, two tests have relative low sensitivity (0.531 for EIA and 0.867 for culture).

Similarly, Bias(ˆ sI ) and Bias(ˆ cI ) for NAAT, EIA and culture were evaluated using single reference, CRSs and LCA, which were summarized in Table 6 below.

20

Table 6: Comparison of NAAT, EIA and culture for detection of Chlamydia trachomatis Bias(ˆ sI ) NAAT

EIA

Bias(ˆ cI ) culture

NAAT

EIA

culture

-0.0012

-0.02

Single Reference NAAT

-0.0109 -0.0384

EIA

-0.0849

culture

-0.0032

-0.0436 0.0330

-0.2925

-0.2559

-0.1061 -0.0284

CRS Any Positive

-0.0654 -0.0214 -0.0865

-0.0818 -0.0024 -0.0213

All Positive

0.0210

0.0470

0.0539

-0.2971 -0.0269 -0.2455

Any-all Positive

0.0210

0.0470

0.0539

-0.0818 -0.0024 -0.0213

LCA

0.0210

0.0485

0.0659

-0.0012 -0.0609

-0.020

Results in Table 6 showed that Bias(ˆ sI ) and Bias(ˆ cI ) are low when using NAAT as reference (Bias(ˆ sI ) : 1.09% lower for EIA and 3.84% lower for culture; Bias(ˆ cI ) : 0.12% lower for EIA and 2.0% lower for culture). However, Bias(ˆ cI ) became large when using EIA (29.25% lower for NAAT and 25.59% lower for culture) or culture (10.61% lower for NAAT and 2.84% lower for EIA) as reference, which were due to the low sensitivity of EIA and culture. When using the CRS by the ”any-positive” rule, Bias(ˆ cI ) was reduced significantly (−0.0818 − (−0.2925) = 0.2107 for NAAT and −0.0213 − (−0.2559) = 0.2346 for culture) from using EIA alone as reference.

When using total bias |Bias(ˆ sI )| + |Bias(ˆ cI )| as a measure, results in Table 6 demonstrated that the CRS by the ”any-all” positive rule outperforms all other approaches except for the evaluation of NAAT, where the CRS by the ”any-all” positive rule gave about 8.0% more bias for cˆI than LCA did. The main reason is that the sensitivity of the CRS by the ”any-positive” rule is still low (only 90.9%) after combining EIA and culture results. It is worth pointing out that usually this will not happen when evaluating a new NAAT, since as the FDA recommended, the additional reference test should be more sensitive than 21

culture to capture the false-negative subjects by culture.

5.3

B. pertussis

A dataset about the diagnosis of B. pertussis (Baughman et al. (2008)) was analyzed in this section, where swab or serum specimens from 212 persons with unknown B. pertussis status were tested by culture, PCR and other four serologic assays for antibodies to B. pertussis (IgG-PT, IgA-PT, IgG-FHA and IgA-FHA). In addition to laboratory test results, classification by the B. pertussis clinical case definition was also available. Four serologic assays are highly conditionally dependent and their test results were combined into one (denoted by ”IgX”) for simplicity by using the ”any-positive” rule. The data is summarized in Table 10.

In Table 7 below, sˆI and cˆI for culture, PCR, IgX, and Clinical case definition were reported by using single reference, CRSs or LCA.

22

Table 7: Comparison of culture, PCR, IgX and Clinical case for detection of B. pertussis

sˆI culture

cˆI

PCR

IgX

Clinical case

culture

PCR

IgX

Clinical case

0.625

0.75

1.0

0.60

1.0

0.985

0.955

0.989

0.979

1.0

1.0

0.969

0.963

1.0

1.0

0.969

0.168

0.971 0.913

0.154

Single Reference culture PCR

0.50

IgX

0.273

0.273

Clinical Case

0.044

0.056 0.117

Any Positive

0.044

0.055 0.117

All Positive

0.667

0.667

0.8

1.0

0.981

Any-all Positive

0.667

0.667

0.8

1.0

1.0

LCA

0.615

0.608

1.0

0.729

1.0

0.975 0.922

0.157

0.921

0.158 0.163

CRS

1.0

0.969

0.168

0.989 0.937

0.161

All approaches demonstrated that classification by clinical case definition has very low specificity (ranged from 0.154 to 0.168), which explained the reason why sˆI was so low when using clinical case definition as reference to evaluate culture, PCR and IgX. Similarly, all approaches except LCA showed that classification by clinical case definition has very high sensitivity (ranged from 0.955 to 1.0), which implies that cˆI will approximate cI well for culture, PCR and IgX when using classification by clinical case definition as reference. Similarly, all approaches demonstrated that culture has high specificity (ranged from 0.981 to 1.0), which implies that sˆI will approximate sI well for PCR, IgX and classification by clinical case definition when using culture as reference.

However, results in Table 7 showed that sˆI estimated by the CRS by the ”any-positive” rule is unusually low for culture (0.044), PCR (0.055) and IgX (0.117). Therefore, the CRS by the ”any-positive” rule is not an appropriate approach for this example. From Table 7 we concluded that the CRS by the ”any-all positive” rule outperformed all other approaches 23

by providing better approximations of sI and cI for all four diagnostic tests, while sI was over-estimated (> 0.20) for IgX and under-estimated (≈ 0.20) for clinical case definition by LCA.

6

Discussion

Evaluating the performance of a new NAAT is a very important task in diagnostic research, and is far more trivial when a gold standard is not available. Several categories of approaches were proposed by many investigators to deal with this problem, among which the discrepancy analysis, the CRS methods and LCA are most popular and were widely used. Unlike discrepancy analysis or latent class analysis, in a CRS method, the test results of the index test don’t involve in constructing a pseudo-reference, which is considered as an advantage of this approach. Furthermore, the assumption of conditional independence among reference tests that is usually hard to verify in advance can be relaxed for CRS methods, which makes it very appealing in practice and was recommended for use by many researchers and regulatory agencies (e.g., FDA). However, theoretical and simulation results both showed that no single composite pseudo-reference can reduce Bias(ˆ sI ) and Bias(ˆ cI ) simultaneously. In this paper, sI and cI were shown being always under-estimated except for some extreme occasions (e.g., sR = 1, cR = 1 or π = 1), and explicit forms of Bias(ˆ sI ) and Bias(ˆ cI ) were provided under the assumptions of conditional independence and sI + cI > 1. In this paper, a new approach, called the CRS by the ”any-all positive” rule, was proposed to reduce Bias(ˆ sI ) and Bias(ˆ cI ) simultaneously. This approach can take the advantages of the CRSs both by the ”any positive” rule and the ”all positive” rule but avoid their weakness. Practical applications of this new approach was demonstrated using simulations covering a broad spectrum of possible real-world data situations to compare the performance and robustness of different approaches. The simulation results demonstrated that the CRS by the ”any-all positive” rule and LCA outperformed other approaches, and both are quite robust to the violation of the assumption of conditional independence. LCA per24

formed slightly better than the CRS by the ”any-all positive” rule did, except for Bias(ˆ cI ) when π = 0.9.

Three real-world datasets for diagnosis of Chlamydia trachomatis, Clostridium difficile, or B. pertussis were analyzed in this paper, and for which true disease status of each subject is available for the Chlamydia trachomatis and Clostridium difficile datasets. The results showed that the CRS by the ”any-all positive” rule performed equally well or even better than LCA did for all three examples, expect for Bias(ˆ cI ) for NAAT when analyzing the Clostridium difficile data, where cI of NAAT was under-estimated about 0.082 by the CRS by the ”any-all positive” rule. The reason for this is that both EIA and culture have low sensitivity and the sensitivity remains low even after combining the results of EIA and culture by the ”any-positive” rule. However, this situation is rare in practice, since presumably no manufacturer is willing to use two less sensitive references to evaluate their NAATs.

Simulation results and three real examples demonstrated that the CRS by the ”anypositive” rule, which is recommended by regulatory agency, provided poor estimate for sI , while the CRS by the ”any-all positive” rule is robust, easy to implement without performing complex statistical modeling specification and validation, and can provide improved estimates for both sI and cI simultaneously. Therefore, the CRS by the ”any-all positive” rule is strongly recommended to use for the evaluation of new NAATs. However, it is worth pointing out that cˆI may not be accurate by this approach when the sample prevalence is high (e.g., π ≥ 0.9), but this is the common problem which should be faced by all approaches. For example, simulation results showed that when π = 0.9, LCA produced large biases for cˆI as well (0.039 for scenario 1, 0.113 for scenario 2 and 0.074 for scenario 3), which implies that more negative samples should be enrolled to obtain a better estimate of cI when the prevalence π is high (e.g., π ≥ 0.9), no matter which approach is used.

SUPPLEMENTARY MATERIAL

25

Proof of Theorem 2.0.1 Proof. We will show how to obtain sˆI , and the proof for cˆI is similar.By Bayes’ theorem and the assumption of conditional independence, it holds P(I+, R+) P(R+) P(I+, R + |T +)P(T +) + P(I+, R + |T −)P(T −) P(R + |T +)P(T +) + P(R + |T −)P(T −) P(I + |T +)P(R + |T +)P(T +) + P(I + |T −)P(R + |T −)P(T −) P(R + |T +)P(T +) + P(R + |T −)P(T −) πsR sI + (1 − π)(1 − cR )(1 − cI ) πsR + (1 − π)(1 − cR ) (1 − π)(1 − cR )(1 − cI − sI ) . sI + πsR + (1 − π)(1 − cR )

sˆI = P(I + |R+) = = = = =

Proof of Theorem 3.1.1 Proof. Here we only show Sany = maxf ∈F {sf }, the rest can be proved similarly. Recall that all samples with discordant test results are classified as positive by ”any positive” rule, which implies Rany is positive whenever Rf is positive. Therefore, for any f ∈ F, it holds Sany = P(Rany + |T +) ≥ P(Rf + |T +) = sf , ∀f ∈ F.

(12)

Proof of Theorem 3.1.2 Proof. We only show the proofs of (1) here. The proof of specificity is similar and was omitted. (1): Recall that by the ”any positive” rule, all test patterns except (0, · · · , 0) produce positive results. Therefore the sensitivity Sany by the ”any positive” rule satisfies Sany = 1 − P((0, · · · , 0)|+) = 1 − Πkj=1 P(tj = 0|+) = 1 − Πkj=1 (1 − sj ). 26

(13)

Similarly, Sall = Πkj=1 sj holds. Furthermore, for any 1 ≤ j0 ≤ k, it is easy to get Sany = 1 − Πkj=1(1 − sj ) ≥ 1 − (1 − sj0 ) = sj0 Sall = Πkj=1 sj ≤ sj0 .

(14)

Three real datasets Chlamydia trachomatis Table 8: Summary of Chlamydia trachomatis data LCx ProbeTec

AC2

True Status Frequency

1

1

1

+

69

1

0

1

+

3

0

1

1

+

3

1

0

0

-

4

0

0

1

-

5

0

0

0

-

422

27

Clostridium difficile Table 9: Summary of Clostridium difficile data True Status NAAT

EIA

culture

Frequency

+

1

1

1

70

+

1

1

0

6

+

1

0

1

51

+

1

0

0

13

+

0

0

1

3

-

1

0

0

6

-

0

1

0

9

-

0

0

0

138

B. pertussis Table 10: Summary of B. pertussis data culture

PCR

Clinical case

1

1

1

1

4

1

1

1

0

1

1

0

1

1

2

1

0

1

0

1

0

1

1

1

2

0

1

1

0

3

0

0

1

1

13

0

0

0

1

1

0

0

1

0

154

0

0

0

0

31

28

IgX Frequency

References Albert, P. and L. Dodd (2004). A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics 60 (2), 427–435. Alonzo, T. and M. Pepe (1999). Using a combination of reference tests to access the accuracy of a new diagnostic test. Statistics in Medicine 18 (22), 2987–3003. Bachmann, L., R. Johnson, H. Cheng, L. Markowitz, J. Papp, and E. H. III (2009). Nucleic acid amplification tests for diagnosis of Neisseria gonorrhoeae oropharyngeal infections. Journal of Clinical Microbiology 47 (4), 902–907. Baughman, A., K. Bisgard, M. Cortese, W. Thompson, G. Sanden, and P. Strebel (2008). Utility of composite reference standards and latent class analysis in evaluating the clinical accuracy of diagnostic tests for pertussis. Clinical and Vaccine Immunology 15 (1), 106– 114. Buimer, M., G. van Doornum, S. Ching, P. Peerbooms, P. Plier, D. Ram, and H. Lee (1996). Detection of Chlamydia trachomatis and Neisseria gonorrhea by ligase chain reactionbased assays with clinical specimens from various sites: Implications for diagnostic testing and screening. Journal of Clinical Microbiology 34 (10), 2395–2400. Dawid, A. and A. Skene (1979). Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics 28 (1), 20–28. Duck, P., G. Alvarado-Urbina, B. Burdick, and B. Collier (1990). Probe amplifier system based on chimeric cycling oligonucleotides. Biotechniques 9 (2), 142–148. Enøe, C., M. Georgiadis, and W. Johnson (2000). Estimation of sensitivity and specificity of diagnostic tests and disease prevalence when the true disease state is unknown. Preventive Veterinary Medicine 45 (1-2), 61–81. Gaydos, C., M. Theodore, N. Dalesio, B. Wood, and T. Quinn (2004). Comparison of three nucleic acid amplification tests for detection of Chlamydia trachomatis in urine specimens. Journal of Clinical Microbiology 42 (7), 3041–3045.

29

Goodman, L. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61 (2), 215–231. Hadgu, A. (1996). The discrepancy in discrepant analysis. Lancet 348 (9027), 592–593. Hadgu, A. (1997). Bias in the evaluation of dna-amplification tests for detecting Chlamydia trachomatis. Statistics in Medicine 16 (12), 1391–1399. Hadgu, A. (1999). Discrepant analysis: A biased and an unscientific method for estimating test sensitivity and specificity. Journal of Clinical Epidemiology 52 (12), 1231–1237. Hawkins, D., J. Garrett, and B. Stephenson (2001). Some issues in resolution of diagnostic tests using an imperfect gold standard. Statistics in Medicine 20 (13), 1987–2001. Hui, S. and X. Zhou (1998). Evaluation of diagnostic tests without gold standards. Statistical Methods in Medical Research 7 (4), 354–370. Humphries, R., D. Uslan, and Z. Rubin (2013). Performance of Clostridium difficile toxin enzyme immunoassay and nucleic acid amplification tests stratified by patient disease severity. Journal of Clinical Microbiology 51 (3), 869–873. Johnson, R., T. Green, J. Schachter, R. Jones, E. H. III, C. Black, D. Martin, M. Louis, and W. Stamm (2000). Evaluation of nucleic acid amplification tests as reference tests for Chlamydia trachomatis infections in asymptomatic men. Journal of Clinical Microbiology 38 (12), 4382–4386. Jones, K., N. Patel, M. Levy, A. Storeygard, D. Balk, J. Gittleman, and P. Daszak (2008). Global trends in emerging infectious diseases. Nature 451 (7181), 990–993. Joseph, L., T. Gyorkos, and L. Coupal (1995). Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. American Journal of Epidemiology 141 (3), 263–272. Kwoh, D., G. Davis, K. Whitfield, H. Chappelle, L. DiMichele, and T. Gingeras (1989). Transcription-based amplification system and detection of amplified human immunodeficiency virus type 1 with a bead-based sandwich hybridization format. PNAS 86 (4), 1173–1177. 30

Miller, W. (1998). Bias in discrepant analysis: When two wrongs don’t make it a right. Journal of Clinical Epidemiology 51 (3), 219–231. Naaktgeboren, C., L. Bertens, M. van Smeden, J. de Groot, K. Moons, and J. Reitsma (2013). Value of composite reference standards in diagnostic research. BMJ 347, F5605. Pepe, M. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press Inc. Pepe, M. and H. Janes (2007). Insights into latent class analysis of diagnostic test performance. Biostatistics 8 (2), 474–484. Qu, Y., M. Tan, and M. Kutner (1996). Random effects models in latent class analysis for evaluating accuracy of diagnostic test. Biometrics 52 (3), 797–810. Rindskopf, D. (1986). A model for evaluating sensitivity and specificity for correlated diagnostic tests in efficacy studies with an imperfect reference test.

Statistics in

Medicine 5 (1), 21–27. Rutjes, A., J. Reitsma, A. Coomarasamy, K. Khan, and P. Bossuyt (2007). Evaluation of diagnostic tests when there is no gold standard. a review of methods. Health Technology Assessment 11 (50), ix–51. Saiki, R., D. Gelfand, S. Stoffel, S. Scharf, R. Higuchi, G. Horn, K. Mullis, and H. Erlich (1988). Primer-directed enzymatic amplification of dna with a thermostable dna polymerase. Science 239 (4839), 487–491. Schachter, J., W. Stamm, T. Quinn, W. Andrews, J. Burczak, and H. Lee (1994). Ligase chain reaction to detect Chlamydia trachomatis infection of the cervix. Journal of Clinical Microbiology 32 (10), 2540–2543. Waggoner, J., J. Abeynayake, M. Sahoo, L. Gresh, Y. Tellez, K. Gonzalez, G. Ballesteros, A. Balmaseda, K. Karunaratne, E. Harris, and B. Pinsky (2013). Development of an internally controlled real-time reverse transcriptase pcr assay for pan-dengue virus detection and comparison of four molecular dengue virus detection assays. Journal of Clinical Microbiology 51 (7), 2172–2181. 31

Walter, S. and L. Irwig (1988). Estimation of test error rates, disease prevalence, and relative risk from misclassified data: A review. Journal of Clinical Epidemiology 41 (9), 923–937. Wu, D. and R. Wallace (1989). The ligation amplification reaction (lar) - amplification of specific dna sequences using sequential rounds of template-dependent ligation. Genomics 4 (4), 560–569. Yolken, R. (2002). Nucleic acid amplification assays for microbial diagnosis: Challenges and opportunities. Journal of Pediatrics 140 (3), 290–292.

32