Classification of Epidemiological Data: A Comparison of ... - CiteSeerX

3 downloads 0 Views 102KB Size Report
monly used by machine learning researchers to compare algo- ... “training” and “testing” data sets; a model is formed based on .... To list a few reasons for this:.
Classification of Epidemiological Data: A Comparison of Genetic Algorithm and Decision Tree Approaches Clare Bates Congdon1 Department of Computer Science Colby College 5846 Mayflower Hill Drive Waterville, ME 04901 [email protected]

Abstract- This paper describes an application of genetic algorithms (GA’s) to classify epidemiological data, which is often challenging to classify due to noise and other factors. For such complex data (that requires a large number of very specific rules to achieve a high accuracy), smaller rule sets, composed of more general rules, may be preferable, even if they are less accurate. The GA presented here allows the user to encourage smaller rule sets by setting a parameter. The rule sets found are also compared to those created by standard decision-tree algorithms. The results illustrate tradeoffs involving the number of rules, descriptive accuracy, predictive accuracy, and accuracy in describing and predicting positive examples across different rule sets.

1 Introduction Complex common diseases, such as Coronary Artery Disease (CAD), hypertension, diabetes, and cancer are difficult for epidemiologists to model because traditional statistical modeling approaches do not take into account the possibility of multiple etiologies and the possibility of interactions between risk factors and risk of disease. However, it is believed that in most cases, an individual’s risk of disease is a consequence of many interacting factors [8], and it is likely that a single etiological explanation is insufficient to adequately describe the risk associated with different subgroups of individuals. Classification techniques in computer science typically encourage multiple-etiology (disjunctive) solutions. However, the typical methodology and assumptions for using these techniques are often incompatible with the needs of epidemiologists. Prediction accuracy is the metric most commonly used by machine learning researchers to compare algorithms on classification tasks. The data is divided into distinct “training” and “testing” data sets; a model is formed based on the former and then evaluated on the latter; and the prediction accuracy is the percent of examples correctly classified in the testing data. This metric is very useful for machine learning researchers because it can be applied to virtually any supervised classification task and is straightforward to apply. However, this metric is overly simplistic as a gauge of success for some real-world classification tasks. For example, 1. Prediction accuracy is “symmetric”: it assumes that the cost of misclassifying a positive example as negative is

the same as the cost of misclassifying a negative example as positive. There are numerous examples of real tasks in which this is not the case; for example, in epidemiology, early identification of individuals who are at high risk for a fatal disease could enable life-saving intervention. 2. Prediction accuracy does not incorporate a measure of the complexity of the rule set. There are several reasons that a simple rule set might be considered preferable to a more complex one:  For rules to be useful for humans, fewer rules may be more desirable, even at the expense of accuracy. Epidemiologists would like to know not just who is at risk for disease, but also, gain insight into why these individuals might be at risk.  For complex datasets, rules that are 100% correct may not be “statistically sound”: On a complex problem, to achieve 100% accuracy on the training data, a typical decision-tree system will form many rules that explain only 1 or 2 examples. Rules that are “less accurate” but explain more examples may be preferable from a statistical perspective.  Disjunctive explanations of data is not a familiar concept in the medical community. This research was done in collaboration with epidemiologists to address their specific needs; in particular, the desired output from the system is a set of rules that describes the positive examples (individuals at risk) as well as possible in a form that is easily interpretable by humans. The needs of the epidemiologists distinguish this task from most in machine learning in several ways: 1. Negative examples do not need to be described (in the rule set) as well as positive examples. 2. Negative examples do not need to be predicted as well as positive examples. 3. Smaller rule sets may be preferable to larger rule sets, even if they are slightly less accurate. 4. The classification task is more complex than many typically studied in machine learning:  A typical dataset is small, relative to the size of the search space. (Typically, only a few hundred examples are available; this is generally due to the expense of collecting epidemilogical data.)



A typical dataset is noisy, for several possible reasons, including day-to-day variations in an individual’s physiology.  Typical rule sets are large, relative to the number of examples in the dataset. That is, the rule sets formed by most systems have a low example/rule ratio. The research described in this paper acknowledges these concerns, and focuses on the third issue, specifically, exploring the effects of encouraging smaller rule sets even at the expense of the accuracy of the rules.

2 Experimental Methods The research described here was initially developed for epidemiologists, in part to illustrate how a disjunctive model (multiple rules) could be preferable to the single-etiology models usually used in epidemiology. It represents an attempt to bridge the differences in modeling practices between the machine learning and epidemiology communities: Machine learning researchers generally don’t consider the number of rules needed, and epidemiologists generally don’t consider disjunctive explanations of data. The program described here allows researchers from both fields to compare the effects of models that require different numbers of rules. 2.1 Questions Addressed in This Research The desiderata described in Section 1 conflict with each other; it is not practical to attempt to design a classification system that would achieve all of the stated goals. Therefore, this research attempts to evaluate some of the tradeoffs that one must make when straying from simple prediction accuracy as the sole metric for evaluation. (In the following discussion, “simplicity” is a measure of the number of examples/rule.) The questions addressed in this research are: 1. How does increasing the simplicity of rule sets affect the prediction accuracy? 2. How does increasing the simplicity of rule sets affect the prediction accuracy on positive examples? 3. How does increasing the simplicity of rule sets affect the descriptive accuracy (on the training data)? 4. How does increasing the simplicity of rule sets affect the descriptive accuracy on the positive examples in the training data? Questions 2 and 4 are perhaps the most important for our epidemiological task. 2.2 Machine Learning Systems Used GA’s were chosen for this task for several reasons. One reason is that they are more flexible than most classification systems in terms of the models that may be implemented by the system. In contrast, many machine learning systems are an implementation of a specific theory of classification; such

systems may be modified to some extent to match a particular task, but not always to the point that the underlying assumptions do not interfere with our intended task. GA’s were also promising in that they have shown success with complex problems and may be able to exploit interactions between attributes, whereas many machine learning systems would ignore interactions. In addition, GA’s do not require that the attributes be independent, as do some machine learning systems. Decision trees are an attractive point of comparison to GA’s because they are a successful and popular tool in machine learning. They are also well suited to this task in that they create symbolic rules that are interpretable by humans, they are fast, and they are reasonably good at a variety of classification tasks. They may have shortcomings for this particular task in that standard decision tree systems are not able to detect and exploit interacting attributes. Autoclass, Cobweb, and neural nets were also considered, but the output from these systems is not easily interpretable by humans. These systems may provide an interesting comparison point, but they have not yet been evaluated on the dataset described here. 2.3 An Illustrative Dataset The program described in this paper was developed for use with a proprietary dataset that the author cannot publish with at this time. Instead, the approach has been applied to another dataset, which is publicly available and similar in structure. The results on the two datasets are comparable, although some differences do appear, as will be described below. All of the datasets available at the UCI repository [9] were considered, and evaluated for structural similarity to the original task2 :  two classes  discrete attributes (or can be easily converted to discrete attributes)  at least several hundred examples  difficult to classify (has a low example/rule ratio when classified (by a typical decision tree system), and less than 100% prediction accuracy) Eleven of the UCI datasets were evaluated by running them through a decision-tree system, GID3* [4]; this ruled out “mushrooms” and “congressional voting”, for example, as being too easy to classify. Seven datasets survived the initial screening: Breast cancer (Ljubljana), breast cancer (Wisconsin), credit ratings, heart disease, hepatitis, promoter gene sequences, and tic-tac-toe. Ljubljana was attractive as being representative of the original dataset because the UCI documentation indicated the highest reported prediction accuracy 2 The approach described in this paper is not restricted to binary classification tasks or to discrete attributes, but this was the structure of the original task. Furthermore, more classes and continuous-valued attributes tend to cloud the issues addressed.

age possible values

number of values

20 30 40 6

meno 50 60 70

lt−40 ge−40 pre−m

t−size 0 5 10

3

15 20 25

i−nodes 30 35 40

45 50

0 3 6

11

9 12 15 7

24

caps

deg breast

no yes ?

1 2 3

left right

3

3

2

quad l−u l−l r−u

irrad

r−l central ?

no yes

6

2

Figure 1: An illustration of the possible values for each attribute in the Ljubljana dataset.

on the task was less than 80%, indicating that, like the original task, the Ljubljana data was virtually impossible to predict with 100% accuracy. Future experiments will explore the additional datasets. Idiosyncrasies of Epidemiological Datasets There are additional arguments for using another epidemiological task. For example, most epidemiological datasets will contain a degree of noise, which generally makes both description and prediction difficult. To list a few reasons for this:  For epidemiological data, one cannot even hope to measure all the possibly relevant factors that contribute to disease. For example, it is impossible to go back and evaluate the diet of a patient several years previous to an epidemiological study.  Individual physiology varies day by day (as well as on many other time scales), but is typically measured only once. For example, an individual’s blood pressure fluctuates with sleep patterns and stress levels.  Whether an individual is considered a positive or negative example in the dataset has some inaccuracy. For example, a person who might have had a heart attack on one day may have died in a car accident the day before, and so, is not recorded as a positive example of heart disease. Less extremely, the onset of a complex disease is a continuous process, but is typically measured by discrete values. It is doubtful that one should ever hope to achieve 100% prediction accuracy in such a domain. Furthermore, it is doubtful that one should ever strive for 100% accuracy in describing the training data, as is typically done in machine learning research. Such rule sets are bound to be overspecific. The Ljubljana Breast-Cancer Dataset The Ljubljana breast-cancer dataset3 contains nine discrete attributes and a binary class attribute. The nine attributes each have from two to 11 values. Two of the attributes have missing values; this was handled as an additional possible value, “missing”. 3 This breast cancer domain was originally obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. Thanks to M. Zwitter and M. Soklic for providing the data.

The Ljubljana data is different from the original dataset in a few notable aspects: 1. The original dataset has 23 attributes, and Ljubljana has only nine. 2. There are known interactions between some of the physiological traits measured in the original data, which make it difficult to classify well. There may not be many interactions in the Ljubljana dataset. 3. The Ljubljana data is “ambiguous”: There are five pairs of examples and one triple that have all attribute-values the same, but belong to different classes. The original dataset was not ambiguous. This does not seem to have been an important difference.4 4. The Ljubljana dataset includes only 20% positive examples; the original dataset contained 44.3% positive examples. This difference led to an interesting result with the Ljubljana data that was not observed in the original data. The possible values for each of the nine attributes is illustrated in Figure 1. The notation for these values is simplified from those in the original dataset. For example, age cohorts, such as “20-29” are abbreviated as “20”; also, values not occuring in the actual data are not listed. 2.4 Metrics for Evaluating Rules and Rule Sets There are five numbers that must be considered in evaluating the rule sets for this epidemiological task, as listed in Section 2.1. A weighting function may be constructed to evaluate the tradeoffs of emphasizing some of these five desiderata over others: eval (a; b; c; d; e) wa

=

wa a

+ w b b + wc c + wd d + w e e

+ wb + wc + wd + we = 1:0

Four of the desiderata use a form of “accuracy”, which is already on the [0..1] range; the “simplicity” of the rule sets should also be expressed as a function on the [0..1] range. This could be done by taking the inverse of the number of rules needed to explain the positive examples; however, this metric is not linear in the number of rules (causing it to favor small rule sets perhaps excessively). As an alternative, observe that the maximum number of rules needed to explain 85 positive examples is 85, and the minimum is 1. A linear 4 The original dataset would likely be ambiguous also, if it were restricted to nine attributes.

metric of the size of the rule set is, therefore (85 ? r); to ?r) . This bring this into the [0..1] range, we divide by 84: (8584 metric will be called the simplicity of the rule set. Using the following abbreviations for our desiderata: simplicity = “s”, descriptive accuracy = “d”, predictive accuracy = “p”, descriptive accuracy on positive examples = “dp”, predictive accuracy on positive examples = “pp”, our function is realized as: eval (s; d; p; dp; pp) ws

= (ws



s)+ wd d + wp p + wdp dp + wpp pp

+ wd + wp + wdp + wpp = 1:0

2.5 Testing and Training Methodology For each approach evaluated, 6 runs were done:  Each system (GA or decision-tree variant) was run on the original dataset to evaluate its descriptive ability on the entire dataset, both in terms of the descriptive accuracy and the example/rule ratio.  Five training/testing runs: Each system was trained on a random 80% of the data and tested with the remaining 20% of the data. The five testing datasets are not mutually exclusive, although each example from the original dataset appears in at least one training dataset. (The same training and testing datasets were used for all systems.) Statistical tests were then done to compare and evaluate the results of the five training/testing runs; t-tests were done to compare the five runs to random and biased coins, and pairedt tests were done to compare the runs to each other.

3 Machine-Learning Systems Used This section describes specifics of the GA’s and decision trees used. 3.1 The Genetic Algorithms Model Space prohibits a complete description of the genetic algorithms program used in this research. The program is based on the Genesis software package [7], and is described in more detail in [3]. Each string in the GA represents a rule; the GA fitness function evaluates how well each particular string describes the positive examples.5 The 9 attributes of the dataset were represented as a 23-bit string. In addition to the values shown in Figure 1, each attribute was allowed an additional value to represent that this attribute was unused. This results in a simple conjunctive representation. The basic Genesis program was modified in three primary ways: 1. Most GA’s converge to a single optimum. The fitness evaluation was changed to a two-step process, to encourage multiple optima, or niching. 5 Rules explain positive examples only because that’s what we’re most interested in describing. This is important because of the lopsidedness of the dataset; to strive to explain both positive and negative examples tends to means that negative examples are “easier” to explain well than positive examples, so the explanation of positive examples suffers.

2. Immigration was used [1, 2]; examples from the training dataset were converted into GA strings each generation. Empirical evidence suggests that this helps to maintain the diversity of the GA population and to increase coverage of the rule set. 3. A parameter was added to the fitness function to allow for penalizing strings that did not explain a minimum number of examples. Parameter settings include a population size of 200; 200,000 total trials ( 1000 generations); and 100 runs with each experiment, but the overall parameter settings were not “fine-tuned” to this particular task. Instead of accuracy, a derivative of an odds ratio function was used as the basic fitness function. (If we call the number of true positives a, the number of false positives b, the number of false negatives c, and the number of true negatives d, +d ad accuracy is a+ab+ c+d , while odds ratio is bc ). Odds ratio is a metric often used by epidemiologists; it was used instead of accuracy because empirical tests showed it to be better encouraging rules that described positive examples well. We took log(odds ratio) and scaled it to the [0..1] interval for the basic fitness function. The calculation of fitness was modified in two ways: 1. The evaluation of the population was made into a twostep process to encourage the GA to find multiple rules. This modification is similar to the approach described by Greene & Smith [6]; strings only get to count themselves as explaining positive examples that have not already been explained by a higher-fitness string. The process is illustrated in Figure 2. 2. A parameter was added to specify the minimum number of examples that should be described by each string; strings below the threshold were penalized. The effect of the penalty function for strings that explain fewer than 30 examples is illustrated in the third graph in Figure 3. Strings that fall below the threshold have their fitness divided by the difference between the threshold and the number of rules they explain. (For example, with a threshold of 30 examples, a string that explains 28 examples has its fitness divided by 2). 3.2 Decision Trees Systems Three variations on decision trees were investigated in this research: ID3, GID3*, and O-Btree, using TreeMaker software from the University of Michigan and JPL. The first was chosen because it is somewhat of a standard in the machine learning community; the latter two were chosen because they were available via TreeMaker and because they compare favorably to other decision tree systems in terms of their predictive accuracy on a range of tasks, e.g., [4, 5]. These decision-tree systems all represent rules as conjunct of disjuncts: for each attribute specified in the rule, one or more possible values is given.

current population 0101011 1110100 1101110 1010001 0011001 0100110 1000111 0011100

raw fitness

raw fitness

1 evaluate raw fitness

0101011 1110100 1101110 1010001 0011001 0100110 1000111 0011100

2 order by raw fitness

34 67 86 23 99 12 94 87

0011001 1000111 0011100 1101110 1110100 0101011 1010001 0100110

raw fitness

3 order by operative fitness

99 94 87 86 67 34 23 12

0011001 1000111 0011100 1101110 1110100 0101011 1010001 0100110

99 94 87 86 67 34 23 12

operative fitness

99 75 52 84 14 10 8 7

dataset

+ −

get fitness for string by comparing to dataset best string:

0011001

datapoints explained by current string datapoints explained by previous strings

99

remove positive examples explained by current string dataset’

+ −

get fitness for string by comparing to dataset’ second best string:

1000111

75

remove positive examples explained by current string

dataset’’

+ −

get fitness for string by comparing to dataset’’ third best string:

0011100

52

remove positive examples explained by current string

Figure 2: Modifications to the fitness evaluation for GA strings to encourage the GA to find a rule set, rather than the single best rule. First, the fitness of all strings in the population is calculated, using the fitness function; second, the strings are sorted by this “raw” fitness; and finally, the new fitness is calculated, based only on the number of examples that have not been explained by strings that have a higher raw fitness.

log(oddsratio), scaled to [0..1]

log(oddsratio) over reduced range

penalty for (a+b)