Comparison of three-class classification performance ... - CiteSeerX

76 downloads 0 Views 250KB Size Report
modified HTM uses an inverse ROC AUC instead of the normal ROC curve. ... The values chosen from a three-class classification confusion ratio matrix form a ...
Comparison of three-class classification performance metrics: a case study in breast cancer CAD a

Amit C. Patel*a and Mia K. Markey†b The University of Texas at Austin, Department of Electrical and Computer Engineering b The University of Texas at Austin, Department of Biomedical Engineering ABSTRACT

Receiver Operating Characteristic (ROC) analysis is a widely used method for analyzing the performance of two-class classifiers. Advantages of ROC analysis include the fact that it explicitly considers the tradeoffs in sensitivity and specificity, includes visualization methods, and has clearly interpretable summary metrics. Currently, there does not exist a widely accepted performance method similar to ROC analysis for an N-class classifier (N>2). The purpose of this study was to empirically compare methods that have been proposed to evaluate the performance of N-class classifiers (N>2). These methods are, in one way or another, extensions of ROC analysis. This report focuses on three-class classification performance metrics, but most of the methods can be extended easily for more than three classes. The methods studied were pairwise ROC analysis, Hand and Till M Function (HTM), one-versus-all ROC analysis, a modified HTM, and Mossman’s “Three-Way ROC” method. A three-class classification task from breast cancer computer-aided diagnosis (CADx) is taken as an example to illustrate the advantages and disadvantages of the alternative performance metrics.

Keywords: Diagnosis, Computer-Assisted; ROC Curve; Classification; Mammography; Breast Neoplasms 1. INTRODUCTION 1.1 Computer-aided Diagnosis Classification is the categorization of objects according to their properties. Clinical diagnostic decisions can be framed as classification problems. For example, when a radiologist reads a diagnostic mammogram, s/he must classify the breast lesion as suspicious enough for cancer that biopsy is recommend or as non-suspicious enough that biopsy is deemed unnecessary. Computer-aided detection (CAD) and computer-aided diagnosis (CADx) systems are being developed to aid radiologists in mammographic interpretation1. CAD facilitates the radiologist in locating abnormalities that indicate breast cancer from the mammogram, thus reducing the chances of overlooking an abnormality. Systems which implement CAD have already been produced and are available on the market. CADx systems analyze the abnormalities from the mammogram and assist in the decision between follow-up and biopsy. Most CADx systems have been designed for a two-class classification of mammographic abnormalities as benign or malignant. Recently there has been growing interest in investigating CADx systems that can handle three (or more) classes. For example, a CADx system that could classify lesions into the three categories of benign, malignant but noninvasive, and malignant invasive may be more beneficial than a system which only classifies lesions as benign or malignant. 1.2 Classifier Performance Evaluation Accurate assessment of CAD and CADx systems is critical to improving breast cancer care. Receiver Operating Characteristic (ROC) analysis is widely used to analyze two-class classifier performance2,3. ROC analysis shows the trade off between the sensitivity and specificity as a decision threshold varies. Sensitivity is the fraction of positive cases that are correctly classified as positive and specificity is the fraction of negative cases that are correctly classified as negative. *[email protected]; phone: +1.512.471.8660; http://www.bme.utexas.edu/research/informatics/ † [email protected]; phone: +1.512.471.1771; fax: +1.512.471.0616; http://www.bme.utexas.edu/research/informatics/

Medical Imaging 2005: Image Perception, Observer Performance, and Technology Assessment, edited by Miguel P. Eckstein, Yulei Jiang, Proceedings of SPIE Vol. 5749 (SPIE, Bellingham, WA, 2005) · 1605-7422/05/$15 · doi: 10.1117/12.595763

581

Classifier Output 0.15 0.23 0.31 0.44 0.56 0.62 0.71 0.89 0.99

1 0. 9 0. 8 0. 7 0. 6

TPF

True State 0 1 0 0 1 0 1 1 1

0. 5 0. 4 0. 3 0. 2 0. 1 0 0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

1

FPF

True State 0 1 0 0 1 0 1 1 1

Classifier Output 0.15 0.23 0.31 0.44 0.56 0.62 0.71 0.89 0.99

When Threshold is 0.60 FPF Calculation Every object in bold cells that is negative (0) is false positive FPF = false positives/total negatives = 1/4 TPF Calculation Every object in bold cells that is positive (1) is true positive TPF = true positives/total positives = 3/5

Figure 1 - Example ROC curve and calculation

An example of the generation of an ROC curve is shown in Figure 1. A decision threshold is applied to continuous classifier outputs to classify each sample as positive (state 1) or negative (state 0). When the decision threshold is varied, different levels of sensitivity and specificity are obtained. An ROC curve is a plot of the sensitivity vs. (1 – specificity) or equivalently the true positive fraction (TPF) vs. the false positive fraction (FPF). For the example data in Figure 1, if the threshold was set at 0.60, the FPF would be 0.25 and the TPF would be 0.60. Varying the threshold from 0 to 1 would produce an ROC curve as shown in Figure 1. The area under the ROC curve (AUC) is a numeric performance metric, which represents how separable two objects are. An AUC of 1.00 suggests that the classifier would always be able to distinguish a positive from a negative. A straight line corresponding to all the points at which the FPF and TPF are equal to each other indicates chance classification and this has an AUC of 0.50. Chance classification means that when posed with the task of distinguishing a positive from a negative, the classifier could at best “guess” what state the object was. If the AUC is less than 0.50 the classifier is performing the opposite task and the definitions of states should be switched. An advantage of ROC analysis is that it is not necessary to specify the misclassification costs. The visual and numeric metrics associated with this method allow for great flexibility in performance analysis. Progress in developing CAD/CADx systems for three-class classification problems has been slow in part due to the fact that there is not an established method for analyzing their performance. There is considerable interest in extending the basic principles of ROC analysis into a multi-class performance metric. 1.3 Proposed Performance Metrics The purpose of this study was to provide an empirical comparison of methods that have been proposed for analyzing three-class classification problems using a mammographic CADx task as an example. A short description of each performance metric examined is provided here. 1.3.1 Pairwise Comparison Pairwise comparisons break down a C-class classification problem into separate binary one-versus-one comparisons. For a C-class classification, there exist C(C-1) different binary comparisons. Thus, this method returns C(C-1) pairwise AUCs for each paired comparison. This technique breaks down the problem into multiple binary classifications after a

582

Proc. of SPIE Vol. 5749

multi-class classifier has been applied to a dataset. Notice that this is different from pairwise classification in which the problem is broken down into binary classification problems before classification, such that two-class classifiers are used. For example, when classifying breast lesions as benign, malignant invasive, or malignant non-invasive, pairwise classification could be performed by designing separate two-class classifiers for benign versus malignant, malignant versus malignant non-invasive, and benign versus malignant non-invasive tasks. By comparison, what is referred to as “pairwise comparison” in this study is pairwise performance evaluation of a multi-class classifier. 1.3.2 Hand and Till M Function The Hand and Till M function (HTM) is an average of all the pairwise comparison AUCs4. Since this method averages multiple AUCs, it can be interpreted in a similar manner as the AUC for two-class classification problems. Hand and Till created this as an extension to ROC because they desired a multi-class performance metric which makes no assumptions about misclassification cost. When testing this method, they trained classifiers on different types of multiclass datasets and used the HTM function to rank which classifiers performed best. These rankings were compared to that of two error rate functions which were also used to rank the data. 1.3.3 One-versus-all (OVA) Comparisons One-versus-all comparisons cast a C-class classification problem as separate binary one-versus-all comparisons. A Cclass classification requires C different binary one-versus-all comparisons. Each of these C binary one-versus-all comparisons has its own AUC, which can be used as a metric of how well the classifier separates one class from all the other classes. 1.3.4 Modified HTM Ferri et al.5 proposed a modified HTM, which is an average of all the one-versus-all comparisons. However, their modified HTM uses an inverse ROC AUC instead of the normal ROC curve. An inverse ROC curve is a plot of the false negative fraction (FNF) versus FPF. Notice that the FNF is 1-TPF. The three inverse OVA AUCs that can be obtained from this inverse ROC graph are averaged to form the modified HTM. The authors used the modified HTM to see how well it could estimate the area under a hypervolume, which they are also testing as a possible multi-class metric. 1.3.4.1 1-point Inverse ROC AUC For the modified HTM function, a confusion ratio matrix is used to create a plot that has only three points (0,1), (1,0), and (FPF,FNF). The authors chose to define ROC in this manner because they believed this method was more coherent for multiple classes5. The inverse ROC curve for an FPF of 0.3 and an FNF of 0.4 is shown in Figure 2. The area above the curve is actually the area of interest but will be referred to as the area under the curve for the purposes of this paper. This area is also shaded in Figure 2, and can be calculated non-parametrically through the trapezoidal rule.

0.4

0.3 Figure 2- A 1-Point Inverse ROC5

1.3.5 Cobweb Representation A C-class classification problem can also be represented by the C(C-1) misclassification values from a confusion matrix5. The values chosen from a three-class classification confusion ratio matrix form a six-dimensional point. If there are three classes A, B, and C, the following point (A->B, A->C, B->A, B->C, C->A, C->B) is obtained from a

Proc. of SPIE Vol. 5749

583

confusion ratio matrix, where A->B corresponds to the cell of the confusion ratio matrix that gives the fraction of class A objects which were misclassified as class B objects. A C(C-1) equilateral polygon can be created to map the point obtained from the confusion matrix. A chance classification is shown in Figure 3, the point represented is (0.33, 0.33, 0.33, 0.33, 0.33, 0.33). The misclassification rates of 0.33 show that when confronted with an object of type A the classifier would classify it as having an equal likelihood of being from any of the three classes A, B, or C. Misclassification Cobweb

1

6

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

2

Chance Classification Cobweb

5

3

4

Figure 3 – Chance Misclassification Cobweb

1.3.6 Mossman Three-Way ROC Mossman’s three-way ROC technique uses the correct classification rates as two separate decision thresholds are varied to form a 3-dimensional plot6. The volume under the surface (VUS) of this plot serves as the performance metric. The main reason this performance analysis uses the correct rates is to allow for a three-dimensional graph similar in nature to the two-dimensional ROC curve.

2. METHODOLOGY 2.1 Dataset Non-palpable mammographically suspicious lesions that underwent biopsy at Duke University Medical Center from 1990 to 2000 were used in this study. Seven radiologists summarized each case in accordance with the Breast Imaging Reporting and Data System (BI-RADS¥) lexicon7. For each case, the patient’s age and the radiologist’s “gut assessment” of the likelihood of malignancy were also recorded. The “gut assessment” was reported on a scale of 1 (benign), 2 (likely benign), 3 (indeterminate), 4 (likely malignant), or 5 (malignant). The data for this study consisted of 326 non-calcified masses without associated or special findings for which the “gut assessment” was 3, 4, or 5. Of the 326 cases, 140 had “gut assessment” = 3, 84 cases had “gut assessment” = 4, and 102 cases had “gut assessment” = 5. A classifier was developed to predict the “gut assessment” from the patient age and the BI-RADS¥ descriptors mass margin, mass shape, mass density, and mass size. 2.2 Classifier A k-nearest-neighbor (KNN) classifier was used to predict the “gut assessment” from the BI-RADS¥ descriptors and patient age. KNN classifies a test point based on the k points in a training set which are most similar to the test point, i.e., the k nearest neighbors. In this study, the value of k was empirically optimized; the value of k that maximized the accuracy (percent correct) was used. Leave-one-out cross-validation was used in the analyses presented in this paper. In leave-one-out cross-validation, one point is selected to be the test datum and the remaining n-1 points are used as the training data. This process is repeated n times until every case has been used as the test set.

584

Proc. of SPIE Vol. 5749

The purpose of this study was to compare performance metrics. Thus, the same classifier was used throughout. However, dependent upon which performance metric was used, the classifier output was required to be continuous or discrete. The continuous output of a KNN consists of the three percent likelihoods that an object is from a specific class. The percent likelihood was determined by calculating the fraction of the k nearest neighbors belonging to each class. The discrete output returns a confusion matrix and classifies objects based on which class has the greatest sum of inverse Euclidian distances. 2.3 AUC Calculations Using the true object state and the continuous KNN classifier output, a non-parametric (empirical) ROC curve can be generated following the method described in Section 1.2. The trapezoidal method of integration was used to estimate the area under ROC curves in this study. 2.4 Performance Metrics The Classifier Performance Evaluation Toolbox (CPET) written in MATLAB“ (The MathWorks, Natick, MA) was created to implement the following metrics and is available on The University of Texas at Austin Biomedical Informatics Lab website (http://www.bme.utexas.edu/research/informatics/). A description of the implementation is provided here. However, the Biomedical Informatics Lab makes no warranty either expressed or implied, including, but not limited to, any implied warranties of merchantability or fitness for a particular purpose regarding these materials, and makes such materials available solely on an "as-is" basis. You bear entire risk as to using this toolbox for your purposes and as to the quality and performance of the toolbox. We are not liable for loss of goodwill, work stoppage, loss of data, computer failure or malfunction, or any other side effects that may occur from using this toolbox. While a KNN classifier was used for the case study presented in this paper, this toolbox can be used to evaluate the outputs of any classifier used for three-class classification. 2.4.1 Pairwise Comparison The pairwise comparisons are done through the function named pairwise and require the classifier to output the classification in a continuous form. There are two inputs to the pairwise function. The first input is the continuous classifier output and the second which identifies which columns of the classifier output correspond to which state. The columns of the first input are of the form [True State, Percent Likelihood Category 1, Percent Likelihood Category 2, Percent Likelihood Category 3]. The second input is a 1x3 vector [Category 1, Category 2, Category 3] which tells the function which columns in input 1 correspond to which category percent likelihood. The six pairwise AUCs are the function outputs. An example call to the function for this case study is pairwise(continuous_classifier_output, class_order). 2.4.2 Hand and Till M Function The Hand and Till M method (HTM) takes the six AUCs generated from the pairwise comparison and averages them into one numeric metric4. The HTM metric is calculated using the pairwise function in the toolbox. If the third input to pairwise is a 1, only the HTM will be output; if it is a 2, both the HTM and the pairwise comparisons are output. An example call to pairwise for only the HTM is pairwise(continuous_classifier_output, class_order, 1). 2.4.3 One versus all (OVA) Comparisons The one versus all comparisons function named OVA has the exact same inputs as the pairwise comparisons. Three separate AUCs are generated by this function. An example call to this function that only outputs the OVA comparisons is OVA(continuous_classifier_output, class_order). If the third input is 1, the average will also be returned. To see the average of the outputs as well the function can be changed to OVA(continuous_classifier_output, class_order,1). 2.4.4 Modified HTM The modified HTM is calculated using a 1-point inverse ROC extension5. The function name is MHTM and takes a confusion matrix calculated from a discrete classifier. The output is the Modified HTM metric. An example call is MHTM(discrete_classifier_output). 2.4.5 Cobweb Representation Currently the cobweb graph is not fully implemented in our toolbox. Calling the COBWEB function generates the sixdimensional point. The input to COBWEB is the discrete confusion matrix output from the classifier. An example call

Proc. of SPIE Vol. 5749

585

would be COBWEB(discrete_classifier_output). The graphical cobweb representation can be made from the toolbox output using a “radar” plot function in a program such as Microsoft Excel“. 2.4.6 Mossman Three-Way ROC The VUS, the performance metric associated with three-way ROC, can be estimated by a three-alternative forced-choice decision task6. The first input to the function TWRVUS is the continuous classifier output in the same form as it was described for the pairwise function. The second input is the number of iterations which the program should run and the third input is the class priority order. The only other requirement is that the columns for the classifier output should also be in the class priority order. An example function call is TWRVUS(continuous_classifier_output, iterations, class_order). For this case study, the last 3 columns of the continuous_classifier_output were percent likelihood 5, percent likelihood 3, and percent likelihood 4; in this case the class_order would be [5 3 4]. 3.0 RESULTS 3.1 Pairwise Comparisons and HTM Class Comparison AUCs

3vs4 0.83

4vs3 0.71

4vs5 0.55

5vs4 0.69

3vs5 0.92

5vs3 0.92

Table 1- Pairwise AUCs for class a vs. class b; class a has priority in the calculation of the AUC.

Pairwise comparisons allow for an in-depth view of classifier performance. The advantage is that problem areas can be pinpointed, but the disadvantage is that many numbers make it cumbersome to interpret. Each of the pairwise AUCs has a chance classification equal to 0.50 and a perfect classification equal to 1.00, and can individually be interpreted similarly to the usual two-class ROC AUC. The pairwise results for the breast cancer CADx case study are shown in Table 1. The HTM metric is obtained by averaging all of the pairwise values and is HTM = 0.77 for this case study. A classifier with ideal performance would have an HTM value of 1.00 and a classifier with chance performance would have an HTM of 0.50. The advantage of the HTM method is that it provides a view of the classifier as a whole. 3.2 OVA Comparisons Class Comparison AUCs

3vsALL 0.88

4vsALL 0.65

5vsALL 0.83

Table 2 - OVA AUCs for each class vs. all other classes.

OVA comparisons allow for a slightly less detailed view of classifier performance than pairwise comparisons. The OVA AUCs have a chance classification equal to 0.50 and a perfect classification equal to 1.00, which allow them to individually be interpreted in a similar manner as the two-class ROC AUC. The OVA results for the breast cancer CADx case study are shown in Table 2.

Decided State

3.3 Modified HTM

Class 3 Class 4 Class 5

Class 3 118 18 4

True State Class 4 29 31 24

Class 5 10 21 71

Table 3 Confusion matrix for K=20 KNN

The inverse 1-point OVA ROC AUCs are calculated from three manipulations of the confusion matrix. The confusion matrix for the case study is shown in Table 3. These AUCs are then averaged to obtain the modified HTM measure. For the breast cancer CADx case study, the modified HTM performance measure was calculated to be 0.73. The modified HTM measure generates 1.00 for the best classifier and 0.50 for a chance classifier. The modified HTM measure can be interpreted similarly to a two-class ROC AUC.

586

Proc. of SPIE Vol. 5749

3.4 Cobweb Method Misclassification Cobweb (true state -> decision state)

0.35

1 3->4

0.3

5->46

0.2 0.15

2 3->5

0.1 0.05

KNN k=20 Performance

0

5->35

Chance Performance

Decided State

0.25

Class 3 Class 4 Class 5

Class 3 0.84 0.13 0.03

True State Class 4 0.35 0.37 0.29

Class 5 0.10 0.21 0.70

Confusion ratio matrix for K=20 KNN

3 4->3

4->5 4 Figure 4 - Cobweb Graph – shows the cobweb misclassification graph created from data

The cobweb graphical performance representation provides a quick way to visualize classifier performance. A polygon within the chance performance hexagon shows a better than chance performance. Currently, there is not a numeric metric associated with the cobweb graphical representation. The CADx classification performance is visually represented in Figure 4 and was created from the confusion ratio matrix shown. The classification is better than chance classification for most cases except 4->5 and 4->3 where it is near or worse than chance classification. 3.5 Mossman Three-Way ROC The “Three-way ROC” VUS can be calculated six different ways and the choice depends upon which classes are given priority in a three-alternative forced-choice decision task. This method is similar to using a one-versus-all comparison followed by a one-versus-one comparison. Chance performance would give a VUS of 0.17 and perfect performance would give a VUS of 1.00. Thus, this measure does not retain the properties of the AUC from a two-class ROC. There are two different ways the VUS can be looked at, either using only one VUS, as this study did, or using the average of all the VUSs. In the breast cancer CADx case study, the VUS that gave class 5 “gut assessment” first priority and class 3 “gut assessment” second priority was used since it generated the highest VUS of all six possible ones (VUS = 0.50). This intuitively makes sense since the greatest difference is likely to exist between the extremes of malignant (“gut assessment” = 5) and indeterminate (“gut assessment” = 3).

4.0 DISCUSSION The purpose of this study was to empirically compare performance metrics derived from multi-class extensions of ROC analysis in a multi-class breast cancer CADx task. A KNN classifier was used to predict radiologists’ “gut assessment” of the likelihood of malignancy of mammographic lesions based on BI-RADS¥ descriptors and patient age. The classifier performance was evaluated using pairwise comparison, the HTM metric, one-versus-all comparisons, the modified HTM, the cobweb graph, and the three-way ROC method. The pairwise breakdown provides the most detailed information of the performance measures studied. It offers the ability to identify specific problem areas of a classifier. For example, in the case study presented, the pairwise results showed that separating classes 4 and 5 from each other poses difficulty. This intuitively makes sense because class 4 and class 5 masses both deal with malignancy, differing only in severity. The HTM eliminates the need to evaluate C(C1) pairwise comparisons for a C class problem, a task that can be somewhat tedious, by averaging the pairwise comparisons. The single numeric metric allows for easy comparisons between two different classifiers. However,

Proc. of SPIE Vol. 5749

587

information is lost in the translation from pairwise comparisons to HTM that could be valuable in interpreting the classifier performance. For example, in the breast cancer CADx case study, the HTM was unable to show that the main problem is separating class 4 from class 5. One-versus-all comparisons are a simpler version of the pairwise comparison scheme. Instead of needing C(C-1) numbers, only C numeric metrics are required for a C-class classification problem. For example, our case study oneversus-all results show that separating the two extremes (classes 3 and 5) is fairly simple, but separating the middle area (class 4) causes problems. This method was able to discern that the problem was with separating class 4, but it did not demonstrate that the specific problem is separating classes 4 and 5 from each other, as pairwise comparisons did. The modified HTM provides the average of the inverse 1-point OVA ROC AUCs, which provides a single numeric metric. The modified HTM has the same benefits and flaws as the regular HTM. It should be stated again that this metric is not the same as averaging the OVA AUCs, because the creators use an inverse 1-point AUC. In the case study presented, the modified HTM = 0.73, which shows that the classifier performance is mediocre, but it does not pinpoint which class poses the major problem. The cobweb graph allows for a unique visual representation of classifier performance (Figure 4). Generally speaking, any polyhedron within the chance hexagon performs better than chance. One can see from Figure 4 that most of the KNN performance lies within the bounds of the chance hexagon, except for 4->3 misclassifications. It can also be seen that the 4->5 misclassification rate, although better than chance, is still worrisome. These results suggest that classifying lesions of class 4 “gut assessment” proves to be the weakness of our classifier. However, visual inspection of the graph does not allow for precise conclusions. The addition of a numeric metric might be beneficial. However, developing a numeric metric from this graph may be difficult. Mossman’s “Three-Way ROC” method allows for a single numeric metric and a visual representation of classifier performance6,8. The VUS calculated for the case study, 0.50, showed that the classifier performed better than chance performance, a VUS of 0.17, but was low enough to show that the classifier performance was mediocre. Like the other metrics which have only one value, important information is lost regarding where the major classification problem occurs. One advantage to this metric is that it can be ordered for class priority as in this study or averaged when no class has a priority. This provides researchers some flexibility in how they want to report their results or build their classifier. The visual representation could be helpful and we plan to add it to the toolbox for the future. Most of the performance evaluation methods reviewed in this study are robust enough to be extended without modification into more-than-three-class performance metrics if necessary. The advantages and disadvantages of the different methods demonstrated with a three-class case study in this paper are expected to generalize to N>3 problems, though some of the contrasts may be more extreme. For example, as the number of classes increases, pairwise comparisons quickly become very cumbersome. Previous studies have created, tested, and used the performance metrics discussed in this paper, but none of them have compared strengths and weaknesses of the different performance metrics listed in the paper by means of a CADx casestudy. None of the methods have an absolute advantage over the others and the importance of their different strengths and weakness will depend on the classification task. There are many other research possibilities that could help to further expand research in the area of multi-class performance metrics. For example, the derivation of a numeric metric from the cobweb representation or further exploration of hypersurface volumes are promising areas for future research. Recent research has begun to address the feasibility of and derivation of metrics associated with hypersurface volumes5,9,10. ACKNOWLEDGEMENTS The authors of this paper would like to thank Zack Madhavi for technical assistance and the rest of the members of the University of Texas Biomedical Informatics Lab for their support. We would also like to acknowledge our colleagues at Duke University Medical Center; the case study data set was assembled by the Duke Advanced Imaging Laboratories and particular thanks are due to Carey E. Floyd, Jr. for this contribution.

588

Proc. of SPIE Vol. 5749

REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

M. P. Sampat, M. K. Markey, A. C. Bovik, "Computer-Aided Detection and Diagnosis in Mammography", Handbook of Image and Video Processing (2nd edition, forthcoming). Metz, C.E., Basic principles of ROC analysis. Seminars in Nuclear Medicine, 1978. 8(4): p. 283-98. Metz, C.E., ROC methodology in radiologic imaging. Investigative Radiology., 1986. 21(9): p. 720-33. Hand, D.J. and R.J. Till, A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 2001. 45(2): p. 171-186. Ferri, C., J. Hernandez-Orallo, and M.A. Salido, Volume Under the ROC Surface for Multi-class Problems. Exact Computation and Evaluation of Approximations. 2003, Univ. Politecnica de Valencia: Valencia. p. 1-40. Mossman, D., Three-way ROCs. Medical Decision Making., 1999. 19(1): p. 78-89. Radiology, A.C.o., ACR BI-RADS - Mammography, Ultrasound & Magnetic Resonance Imaging. Fourth ed. 2003, Reston, VA: American College of Radiology. P. S. Heckerling, Parametric Three-Way Receiver Operating Characteristic Surface Analysis Using Mathematica. Medical Decision Making, 2001. 21(5): p.409-417 Edwards, D.C., C.E. Metz, and R.M. Nishikawa, Hypervolume under the ROC hypersurface of a "nearguessing" ideal observer in a three-class classification task. SPIE Medical Imaging 2004. 5372: p. 128-137. Edwards, D.C., Metz, C.E., and Kupinski, M.A, Ideal observers and optimal ROC hypersurfaces in N-class classification. IEEE Transactions on Medical Imaging, 2004. 23(7): p. 891-895

Proc. of SPIE Vol. 5749

589