Cost-sensitive Discretization of Numeric Attributes

Cost-sensitive Discretization of Numeric Attributes Tom Brijs1 and Koen Vanhoof2 1

Limburg University Centre, Faculty of Applied Economic Sciences, B-3590 Diepenbeek, Belgium [email protected]

2

Limburg University Centre, Faculty of Applied Economic Sciences, B-3590 Diepenbeek, Belgium [email protected]

presented at the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery, Nantes, France, September 23-26, 1998

Abstract. Many algorithms in decision tree learning have not been designed to handle numerically-valued attributes very well. Therefore, discretization of the continuous feature space has to be carried out. In this article we introduce the concept of cost-sensitive discretization as a preprocessing step to induction of a classifier and as an elaboration of the error-based discretization method to obtain an optimal multi-interval splitting for each numeric attribute. A transparent description of the method and steps involved in cost-sensitive discretization is given. We also provide an assessment of its performance against two other well-known methods, i.e. entropy-based discretization and pure error-based discretization on an authentic financial dataset. From an algorithmic perspective, we show that an important deficiency of error-based discretization methods can be solved by introducing costs. From the application perspective, we discovered that using a discretization method is recommended. Finally, we use ROC-curves to illustrate that under particular conditions cost-based discretization can be optimal.

1

1

Introduction

Many algorithms focussing on learning decision trees from examples have not been designed to handle numeric attributes. Therefore, discretization of continuous valued features must be carried out as a preprocessing step. Many researchers have already contributed to the issue of discretization. However, as far as we know, insufficient efforts have been made to include the concept of misclassification costs to find an optimal multisplit. Discretization also has some additional appeals. Kohavi & Sahami [1996] mentioned that discretization itself may be considered as a form of knowledge discovery in that critical values in a continuous domain may be revealed. Catlett [1991] also reported that, for very large data sets, discretization significantly reduces the time to induce a classifier. Traditionally, five different axes can be used to classify the existing discretization methods: error-based vs. entropy-based, global vs. local, static vs. dynamic, supervised vs. unsupervised and top-down vs. bottom-up. These methods will be further elaborated on in section 3 featuring the related work in this domain. In fact, our method is an error-based, global, static, supervised method combining a top-down and bottom-up approach. However, it is not just an error-based method. Through the introduction of a misclassification cost matrix, candidate cutpoints are evaluated against a cost function to minimize the overall misclassification cost of false positive and false negative errors instead of just the total sum of errors. False positive (resp. false negative) errors, in our experimental design, are companies incorrectly classified as not bankrupt (bankrupt) although they actually are bankrupt (not bankrupt). The objective of this paper is to evaluate the performance of our cost-sensitive discretization method against Fayyad & Irani’s entropy-based method. Firstly, we evaluate the effectiveness of both methods in finding the critical cutpoints that minimize an overall cost function as a result of the preprocessing knowledge discovery step. Secondly, both methods will be compared after induction of the C5.0 classifier to evaluate their contribution to decision-tree learning. In section 2, we introduce the concept of cost-sensitive discretization and a transparent description of the several steps that have been undertaken to achieve cost-sensitive discretization will be given. In section 3, we elaborate on related work in the domain of discretization of continuous features. In section 4, an empirical evaluation of both methods is carried out on a real life-dataset. Section 5 provides a summary of this work. 2

Cost-sensitive discretization

Cost-sensitive discretization can be defined as taking into account the cost of making errors instead of just minimizing the total sum of errors. This implies that discretizing a numeric feature involves searching for a discretization of the attribute value range that minimizes a given cost function. The specification of this cost function is dependent on the costs assigned to the different error types. Potential interval splittings can then be generated and subsequently evaluated against this cost function. To illustrate the process

2

of cost-sensitive discretization we consider a hypothetical example of a numeric attribute for which 3 boundary points are identified (see figure 1). In each interval the number of cases together with their class labels are given. Intuitively, a boundary point is a value V in between two sorted attribute values U and W so that all examples having attribute value U have a different class label compared to the examples having attribute value W, or U and W have a different class frequency distribution. Previous work [Fayyad and Irani 1992] has contributed substantially in identifying potential cutpoints. They proved that it is sufficient to consider boundary points as potential cutpoints, because optimal splits always fall on boundary points. Formally, the concept of a boundary point is defined as: Definition 1 [Fayyad and Irani 1992]] A value T in the range of the attribute A is a boundary point iff in the sequence of examples sorted by the value of A, there exist two examples s1 , s2 ∈ S, having different classes, such that valA(s1) < T < valA(s2); and there exists no other example s’ ∈ S such that valA(s1) < valA(s’) < valA(s2). XXXY

a

XYY

b

XYYY

c

YYYY

d

e

Figure 1 an example (numerically-valued) attribute and its boundary points. Now suppose C(YX) = 3, i.e. misclassifying a class X case as belonging to class Y costs 3 and C(XY) = 1. For an authentic dataset these cost parameters can be found in the business context of the dataset. However, in many cases, exact cost parameters are not known. Usually the cost parameters of true positive and true negative classifications are set to null but the cost-values for false positive and false negative errors reflect their relative importance against each othter. This results in what is called a cost matrix. The cost matrix indicates that the cost of misclassifying a case will be a function of the predicted class and the actual class. The number of entries in the cost matrix is dependent on the number of classes of the target attribute. Consequently, for each potential interval the minimal cost can be calculated by multiplying the false positive cost (respectively false negative cost) by the false positive (respectively, false negative) errors made as a result of assigning one of both classes to the interval and picking the minimal cost of both assignments, as illustrated in figure 2. 0 a

1

2 b 3

c

3

e

d 3

6 10

Figure 2 intervals with corresponding minimal costs.

3

...

For example, the total minimal cost for the overall interval (from a to e) is 10 which can be found as follows: the number of X cases in interval a-e is 5 the number of Y cases in interval a-e is 10, so classifying all cases in a-e as ‘X’ gives a cost of 10 * C(XY) = 10 * 1 = 10 classifying all cases in a-e as ‘Y’ gives a cost of 5 * C(YX) = 5 * 3 = 15 therefore, the minimal cost for the overall interval a-e is 10. Suppose the maximum number of intervals k is set to 3. A network can now be constructed as depicted in figure 3 (not all costs are included for the sake of visibility). The value of k may be dependent on the problem being studied, but Elomaa & Rousu [1996] advise to keep the value of k relatively low. Increasing parameter k reduces the misclassification cost after discretization but it has a negative impact on the interpretability of the classification tree after induction because the tree will become wider.

a

0

a

1 0

a

10

b

b

1 3 6

c

c

10

d

d

e

e

e

Figure 3 Shortest route network. The shortest route linear programming approach can be used to identify the optimal number and placement of the intervals that yields the overall minimal cost for the discretization of this numeric attribute. This short illustration shows that several phases need to be undertaken to discretize numeric attributes in a cost-sensitive way. First, all boundary points for the attribute under consideration must be identified. Second, a cost matrix must be constructed to specify the cost of making false positive and false negative errors. Third, costs must be calculated and assigned to all potential intervals. Fourth, the maximum number of intervals k must be specified. Fifth, a shortest route network can be constructed from all potential intervals with their corresponding minimal costs. Finally, this shortest route network can be solved using the shortest route linear programming routine.

4

3

Related work

As already mentioned in section 1, our discretization method is an error-based, global, static, supervised method combining a top-down and bottom-up approach. In order to clearify this statement, we will elaborate on the differences that can be made to distinguish upon the known discretization methods that are available. Traditional error-based methods, for example Maas [1994], evaluate candidate cutpoints against an error function and explore a search space of boundary points to minimize the sum of false positive and false negative errors on the training set. Entropy-based methods, for example Fayyad and Irani [1993], use entropy measures to rate candidate cutpoints. Our method is an error-based discretization method. However, through the introduction of a misclassification cost matrix, candidate cutpoints are evaluated against a cost function to minimize the overall cost of false positive and false negative errors instead of just the total sum of errors. Kohavi and Sahami [1996] show that the error-based discretization method has an important deficiency, i.e. it will never generate two adjacent intervals when in both intervals a particular class prevails, even when the class frequency distributions differ in both intervals. Our cost-sensitive discretization method, however, does not suffer from this deficiency because it takes into account these class frequency differences. By increasing the error-cost of the minority class, the frequency of the minority class is leveraged so that, eventually, different class labels will be assigned to both intervals, indicating a potential cutpoint. The distinction between global [Chmielewski & Grzymala-Busse 1994] and local [Quinlan 1993] discretization methods depends on when discretization is performed. Global discretization, like for example binning, involves discretizing each numeric attribute before induction of a classifier whereas local methods, like C5.0 carry out discretization during the induction process. In fact, global methods use the entire value domain of a numeric attribute for discretization while local methods produce partitions that are applied to localized regions of the instance space. A possible advantage of global discretization as opposed to local methods is that it is less prone to variance in estimation from small fragmented data. The distinction between static [Catlett 1991, Fayyad & Irani 1993, Pfahringer 1995, Holte 1993] and dynamic [Fulton, Kasif & Salzberg 1994] methods depends on whether the method takes feature interactions into account. Static methods, such as binning, entropybased partitioning and the 1R algorithm, determine the number of partitions for each attribute independent of the other features. On the other hand, dynamic methods conduct a search through the space of possible k partitions for all features simultaneously, thereby capturing interdependencies in feature discretization. Another distinction can be made dependent on whether the method takes class information into account or not. Several discretization methods, such as equal width interval binning, do not make use of class membership information during the discretization process. These methods are referred to as unsupervised methods [Van de Merckt 1993]. In contrast, discretization methods that use class labels for carrying out discretization are referred to as supervised methods [Holte 1993 or Fayyad and Irani 1993]. Previous research has indicated that supervised methods are slightly better than unsupervised methods. This is 5

not surprising because with unsupervised methods there is a larger risk of combining values that are strongly associated with different classes into the same bin, making effective classification much more difficult. A final distinction can be made between top-down [Fayyad and Irani 1993] and bottom-up [Kerber 1992] discretization methods. Top-down methods start off with one big interval containing all known values of a feature and then partition this interval into smaller and smaller subintervals until a certain stopping criterion, for example MDL (Minimum Description Length), or optimal number of intervals is achieved. In contrast, bottom-up methods initially consider a number of intervals determined by the set of boundary points to combine these intervals during execution also until a certain stopping criterion, such as a χ2 threshold, or optimal number of intervals is achieved. 4

Empirical evaluation

4.1 The data set To carry out our experiments we used an authentic dataset of 549 tuples, all tuples representing a different company. Each company is described by 18 continuously valued attributes, i.e. different financial features on liquidity, solvability, rentability, and others. For a complete list, we refer to appendix 1. The target attribute is a 2-class nominal attribute indicating whether the company went bankrupt (class 0) or not (class 1) during that particular year. The class distribution in this data set is highly unbalanced, containing only 136 (24.77%) companies that went bankrupt. This data set was assembled from the official financial statements of these companies that are available at the National Bank of Belgium. In Belgium medium to large-sized enterprises are obliged to report their financial statements in detail. 4.2 Position of cutpoints and induced cost In a first experiment, we evaluated the performance of Fayyad & Irani’s discretization method against our cost-sensitive method relative to a specified cost function. We know our method yields optimal results, i.e it achieves minimal costs given a maximum of k intervals. Now, we are interested to find whether Fayyad & Irani’s entropy-based method is able to come close to this optimal solution by using the same cost function. We discretized all numeric attributes seperately for different misclassification cost matrices ranging from false positive cost parameter 1 (uniform cost problem) to 8 (false positive errors are severely punished relative to false negative errors). For the sake of simplicity, we call this cost parameter the discretization cost. Parameter k was arbitrarily set to 2n+1, i.e 5, with n the number of classes. We are particularly interested in the extent to which our cost-sensitive method is able to achieve significantly superior results on the overall cost function compared to the method of Fayyad & Irani. Our experiments revealed that for all attributes, cost-sensitive discretization achieved significantly better results than Fayyad & Irani’s entropy-based discretization. On average, for all discretization costs and for all attributes, entropy-based discretization resulted in a 8.7 % increase in cost relative to our method. Table 1 shows the percentage increase in cost of Fayyad & Irani’s method against our method, for different discretization costs.

6

Table 1 Average percentage increase in cost Fayyad vs. cost-sensitive Discretization Cost

1 6.3

2 6.9

3 7.1

4 7.1

5 8.7

6 9.0

7 11.4

8 13.1

This large performance gap is mainly due to the fact that our cost-sensitive method exploits local class differences to achieve lower costs whereas the entropy-based method finds thresholds to minimize the entropy function and, as a consequence, it is not so heavily distracted by local differences. To illustrate this effect, we have plotted the class distributions for a (highly discriminative) attribute in figure 4. The actual value-range for this attribute is wider than the values shown in this figure, but to focus on the important data we concentrate on this smaller range.

number Not bankrupt Bankrupt

A

D’

B

C

Interesting cutpoints Fayyad & Irani Cost sens disc. cost 1 Cost sens disc. cost 2 Cost sens disc. cost 5 Cost sens disc. cost 8

Figure 4 Class distribution of an attribute

This figure clearly illustrates the skewness of the class distributions. It can readily be observed that for lower values of the attribute the probability of bankruptcy is higher than for high attribute values. An interesting phenomenon can be noticed on the graph, as indicated by the arrow. It locates a rather unexpected peak in the data. However, such local peaks are not unusual in financial datasets, because they might be caused by sector differences or some financial factor. The objective of cost-sensitive discretization is to

7

determine the optimal multi-interval splitting of this numerically valued attribute range, given a maximum of k intervals. In fact, a good discretization method should find the 4 cutpoints labeled A, B, C and D’ because these points indicate very specific situations. A is positioned where the lower tails of both distributions intersect. B also has an important position because it is situated at the peak of the class 1 distribution. At position C the frequency of class 0 cases becomes very small relative to the other class. Point D’ is interesting because it indicates a local peak in both distributions. The placement of the cutpoints for different discretization methods is depicted below the graph. It is also interesting to note that entropy-based discretization reveals only 2 cutpoints, whereas cost-sensitive discretization discovers 4 cutpoints resulting in the maximum allowed number of intervals, i.e. 5. On average, entropy-based discretization resulted in fewer intervals (3 to 4) than cost-sensitive discretization (4 to 5). Important to notice is that no method is able to recognize all of these important cutpoints. Fayyad & Irani’s method finds cutpoint B and C, whereas our cost-sensitive method finds different cutpoints dependent on the discretization cost being used. For low false positive costs, cost-sensitive discretization only subdivides the range where the frequency of the minority class equals that of the majority class or small frequency differences exist, resulting in sensitivity to local class frequency differences. It discovers cutpoints A, B and D’ but it is insensitive to cutpoint C. For high false positive costs only the common range with high class frequency differences is subdivided. Cutpoints B and C are found together with some other cutpoints in the same neighbourhood, again as a result of local class frequency differences. Discretization with high false positive costs however ignores cutpoints A an D’. It can be observed that increasing discretization costs for the minority class will shift the focus of the placement of the cutpoints more to the right in the distribution of the minority class, as illustrated in the dashed box below figure 4. This shift in positions of the cutpoints influences the false positive and false negative error rates. This degree of freedom can be used to optimise classifiers (minimal misclassification cost or minimal error rate). At first sight, cost-sensitive discretization looks very promising but we want to find out whether this result still holds after induction of a classifier.

8

4.3

Comparison of entropy-based versus cost-based discretization

4.3.1 False positive(FP) and false negative(FN) error rates We induced the C5.0 classifier and used repeated 3-fold cross validation on the discretized data sets to compare the FP and FN error rate for both discretization methods. The C5.0 algorithm has been used with increasing FP cost. Increasing this cost results in a lower FP error rate and a higher FN error rate. In order to visualize the differences between the different discretization methods, we normalized the error rates and used entropy-based discretization as the base line. This means that a given percentage is the cost-sensitive error rate divided by the entropy-based error rate. First, we want to investigate the interaction effect of a given discretization cost and changing the C5.0 cost parameter on the FP and FN error rates. When the discretization cost equals 1, the method is an errorbased method. In figure 5 it can be seen that for this method the FP error rate is higher but the FN error rate is lower than the base line (entropy-based). 1 ,4 0

Ratio error-based vs. base line

1 ,3 0

1 ,2 0

1 ,1 0

F P 1 ,0 0

F N F ayyad

0 ,9 0

0 ,8 0

0 ,7 0

0 ,6 0 1

2

3

4

5

6

7

8

C 5 .0 c o s t p a r a m e t e r

Figure 5 FP and FN error rates discretization cost 1 vs. Fayyad & Irani This higher number of FP errors results from the fact that, for error based discretization, the range beyond cut point D’ in figure 4 will be fully classified as the majority class 1, i.e. not bankrupt, while still containing almost 50 % of the bankrupt cases. Calculation of the frequency of FP errors (43.6%) confirmed this observation. With the given class distribution, the global accuracy is higher with the error based discretization. The following figures (6 and 7) give the error rates for other discretization costs (resp. 2 and 6). 1 ,4 0

Ratio cost-based vs. base line

1 ,3 0

1 ,2 0

1 ,1 0

F P 1 ,0 0

F N F ayyad

0 ,9 0

0 ,8 0

0 ,7 0

0 ,6 0 1

2

3

4

5

6

7

8

C 5 .0 c o s t p a r a m e t e r

Figure 6 FP and FN error rates discretization cost 2 vs. Fayyad & Irani 9

1 ,5 0

Ratio cost-based vs.base line

1 ,4 0

1 ,3 0

1 ,2 0

1 ,1 0

F P F N

1 ,0 0

F a yyad

0 ,9 0

0 ,8 0

0 ,7 0

0 ,6 0 1

2

3

4

5

6

7

8

C 5 .0 c o s t p a r a m e te r

Figure 7 FP and FN error rates discretization cost 6 vs. Fayyad & Irani For the cost-sensitive discretization the FP error rate is lower and the FN error rate is higher than the base line. These figures show two surprising results. Firstly, the shape of the curves. By increasing the C5.0 cost parameter, the FP error rate should decrease. This is the case for both methods, but for the cost-sensitive discretizer the decrease is much faster (at lower C5.0 cost parameter). For higher C5.0 cost parameters, the entropy-based method has a lower FP error rate. Secondly, the first bar(s) show a decrease of the FP error rate with just a slight increase in the FN error rate. Combining these two results indicates that the best results can be obtained by using a low C5.0 cost parameter. 4.3.2 Misclassification costs We compare the performance of the classifiers obtained with the different discretization methods in a ROC graph [Provost & Fawcett 1997]. On a Roc graph the true positive rate (TP) is plotted on the Y axis and the false positive rate (FP) on the X axis. One point in the ROC curve (representing one classifier with given parameters) is better than another if it is to the north-west (TP is higher, FP is lower or both). A ROC graph illustrates the behaviour of a classifier without regard to class distributions and error cost, so that it decouples classification performance from these factors. Figure 8 on the next page shows the ROC graph for the different sets of classifiers. We decided not to show all classifiers, only the most relevant classifiers with respect to the performance evaluation are shown. Provost has shown that a classifier is potentially optimal if and only if it lies on the northwest boundary of the convex hull [Barber, Dobkin and Huhdanpaa 1993] of the set of points in the ROC curve. From the figure we can see that the set of classifiers with the entropy-based and cost-based discretization method are potentially optimal. From a visual interpretation, we can rank the methods for the region with a FP rate lower than 35 as follows: entropy, better than cost-based, better than error-based discretization. For the region with a FP rate higher than 35: firstly cost-based, and secondly entropy and errorbased discretization.

10

TP 90 Fayyad (Entropy): 85 Error-based:

80

Cost-based with C5.0 cost 1:

75

Cost-based with higher C5.0 cost: Not discretized:

70 15

20

25

30

35

40

45 FP

Figure 8 Roc curve for different classifiers

Choosing the optimal (minimal misclassification cost) classifier requires knowledge of the error cost and the class distributions. In Belgium, the FP error cost is estimated to be 30 to 50 times higher than the FN error cost and the prior probability (this is the true distribution) of negative classes versus positive classes is estimated to be approximately 1 in 95. With this information a set of iso-performance lines [Provost & Fawcett 1997] with a slope of 30/95 can be constructed. On an iso-performance line all classifiers corresponding to points on the line have the same expected cost and the slope of the line is dependent on the a priori cost and class frequency distributions. This yields an instrument to choose the optimal classifier of the given sets of classifiers. If only the single best classifier is to be chosen, under the known cost and class frequency distributions, the error-based classifier (indicated by a circle) slightly outperforms the cost-based discretization methods with C5.0 cost 1 (indicated by a right triangle), as can be seen on figure 8. However, if we use the average performance over the entire performance space as the decision criterion, it can be seen that cost-sensitive discretization with low C5.0 costs performs best. Altogether, it can be seen on the same figure that discretization of numeric attributes prior to induction is always better than discretizing while inducing the C5.0 classifier. 4.3.3 Overfitting By using a discretization method, with our dataset, better estimated accuracies are obtained due to the fact that overfitting is reduced, which can be clearly seen in table 3 and table 4. When using the C5.0 cost functionality this observation is even strengthened. Increasing the C5.0 cost parameter results in increasing resubstitution error rates as illustrated in table 2. The overfitting pattern is similar to that of the false negative error rates shown in paragraph 4.3.1. From these observations, it can also be seen that costsensitive discretization is less robust compared to entropy-based discretization, but more robust than C5.0 without discretization. So, cost-sensitive discretization is more able to 11

lower the false positive error rate but is more sensitive to overfitting than entropy-based discretization. Table 2 Resubstitution error rate in absolute percentages C5.0 cost parameter Not discretized Fayyad Average of all cost-sensitive

1 7.42 7.14 8.27

2 4.99 6.71 7.52

3 7.65 7.42 8.39

4 8.13 10.15 10.04

5 9.15 10.69 11.33

6 8.03 12.33 12.52

7 10.11 14.19 13.61

8 11.44 14.49 15.11

Table 3 Estimated accuracy in absolute percentages (repeated cross-validation) C5.0 cost parameter Not discretized Fayyad Average of all cost-sensitive

1 80.63 81.91 82.57

2 79.96 80.98 80.73

3 78.67 81.88 79.08

4 76.81 80.09 78.58

5 76.37 79.86 77.73

6 78.10 79.14 77.00

7 77.03 77.81 76.51

8 76.63 77.68 75.48

4 15.07 9.77 11.39

5 14.48 9.45 10.94

6 13.88 8.53 10.48

7 12.87 8.00 9.88

8 11.93 7.83 9.42

Table 4 Overfitting in absolute percentages C5.0 cost parameter Not discretized Fayyad Average of all cost-sensitive

5

1 11.96 10.95 9.16

2 15.06 12.31 11.75

3 13.68 10.69 12.54

Conclusion

The concept of misclassification costs is an important contribution to the work of errorbased discretization because in many real world problems the cost of making certain mistakes may be different. As a consequence, false positive and false negative classifications were treated equally. A discretization method that is cost-sensitive has been implemented, tested and compared with a well-known discretization method on an authentic financial dataset. From an algorithmic point of view, it has been shown that an important deficiency of error-based discretization methods can be solved by introducing costs, i.e. two adjacent intervals with different class labels can be generated even when in both intervals a particular class prevails. From the application point of view, we can conclude that using a discretization method is recommendable. C5.0 is overfitting the financial dataset. It is easier to reduce this overfitting by prior discretization than by tuning the C5.0 pruning parameter. Choosing the optimal discretization method is more difficult. Firstly, the results are only valid for this small dataset. Secondly, dependent on the evaluation procedure and distributions used, different choices are possible. The three methods that have been considered are all potentially optimal, but cost-sensitive discretization of numeric attributes has proven to be worth considering in future research.

12

References Barber C., Dobkin D., and Huhdanpaa H. (1993). The quickhull algorithm for convex hull. Technical Report GCG53, University of Minesota. Catlett J. (1991). On changing continuous attributes into ordered discrete attributes. In Proceedings of the Fifth European Working Session on Learning, 164-178. Berlin: Springer-Verlag. Chmielewski M. R., and Grzymala-Busse J. W. (1994). Global discretization of continuous attributes as preprocessing for machine learning. In Third International Workshop on Rough Sets and Soft Computing, 294-301. Dougherty J., Kohavi R., and Sahami M. (1995). Supervised and unsupervised discretization of continous features. In Machine Learning: Proceedings of the Twelfth Int. Conference, 194-202. Morgan Kaufmann. Elomaa T., and Rousu J. (1996). Finding Optimal Multi-Splits for Numerical Attributes in Decision Tree Learning. Technical Report NC-TR-96-041, University of Helsinki. Fayyad U., and Irani K. (1992). On the handling of continuous-valued attributes in decision tree generation. In Machine Learning 8. 87-102. Fayyad U., and Irani K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the Thirteenth Int. Joint Conference on Artificial Intelligence, 1022-1027. Morgan Kaufmann. Fulton T., Kasif S., and Salzberg S. (1995). Efficient algorithms for finding multi-way splits for decision trees. In Proceedings of the Twelfth Int. Conference on Machine Learning, 244-251. Morgan Kaufmann. Holte R. (1993). Very simple classification rules perform well on most commonly used datasets. In Machine Learning 11, 63-90. Kerber R. (1992). Chimerge: Discretization of numeric attributes. In Proceedings of the Tenth Nat. Conference on Artificial Intelligence, 123-128. MIT Press. Kohavi R., and Sahami M. (1996). Error-based and Entropy-Based Discretization of Continuous Features. In Proceedings of the Second Int. Conference on Knowledge & Data Mining, 114-119. AAAI Press. Maas W. (1994). Efficient agnostic PAC-learning with simple hypotheses. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, 67-75. ACM Press. Pfahringer B. (1995). Compression-based discretization of continuous attributes. In Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann.

13

Provost F., and Fawcett T. (1997). Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions. In Proceedings of the Third Int. Conference on Knowledge Discovery and Data Mining, 43-48, AAAI Press. Quinlan J.R. (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann, Los Altos, California. Van de Merckt T. (1993). Decision Trees in Numerical Attributes Spaces. In Proceedings of the Thirteenth Int. Joint Conference on Artificial Intelligence, 1016-1021, Morgan Kaufmann.

14