A Global Optimal Algorithm for Class-Dependent ... - Semantic Scholar

2 downloads 0 Views 341KB Size Report
Nov 19, 2003 - classes (double decker bus, Chevrolet van, Saab 9000 and an Opel Manta 400), 846 samples and 18 attributes [21]. The purpose of this data ...
A Global Optimal Algorithm for Class-Dependent Discretization of Continuous Data Lili Liu∗ , Andrew K. C. Wong∗, and Yang Wang† November 19, 2003

Abstract This paper presents a new method to convert continuous variables into discrete variables for inductive machine learning. The method can be applied to pattern classification problems in machine learning and data mining. The discretization process is formulated as an optimization problem. We first use the normalized mutual information that measures the interdependence between the class labels and the variable to be discretized as the objective function, and then use fractional programming (iterative dynamic programming) to find its optimum. Unlike the majority of class-dependent discretization methods in the literature which only find the local optimum of the objective functions, the proposed method, OCDD, or Optimal Class-Dependent Discretization, finds the global optimum. The experimental results demonstrate that this algorithm is very effective in classification when coupled with popular learning systems such as C4.5 decision trees and Naive-Bayes classifier. It can be used to discretize continuous variables for many existing inductive learning systems.

1

Introduction

In machine learning and data mining research, inductive learning systems are widely used to acquire classification knowledge from a set of given samples. Classification rules and/or models are generated from these pre-labelled samples. For historical reasons, most classification algorithms in machine learning can only be applied to nominal data. They cannot effectively deal with continuous attributes [3, 14] directly. In practice, however, a large portion of real world data sets contains continuous data and/or data of mixed types (continuous, discrete, ordinal or nominal). To adopt inductive learning systems with these kind of data, the continuous variables need to be discretized. Recently, researchers also found that even if some systems are explicitly designed for ∗

PAMI Lab, Department of Systems Design Engineering, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada ({lililiu, akcwong}@pami.uwaterloo.ca) † Pattern Discovery Software Systems, Ltd., 550 Parkside Drive, Unit B9, Waterloo, Ontario N2L 5V4, Canada ([email protected])

1

continuous attributes, they can attain a higher accuracy when the data are given with appropriate discrete values. Hence, the limitation of most inductive learning algorithms can be conquered by discretizing the continuous attributes appropriately before feeding the data into the learning systems [3, 6, 7, 23]. Discretization is a process of partitioning the value space of a continuous attribute into a finite number of intervals and attaching a nominal value to each of them. Each interval can be considered as an event in the discrete event space. After discretization, we can uniformly treat both continuous and discrete data as events in appropriate discrete event space. In this paper we describe a new technique for discetizing the value of continuous attributes. It is based on an information measure reflecting interdependence between the continuous variable and the class membership coupled with a global optimization algorithm. Traditionally, two important decisions have to be made in the process of partitioning continuous data — the number of intervals and the widths of the intervals. They must either be determined by the discretization algorithm or provided by the user. Many partition algorithms require the provision of the appropriate number of intervals by the users. The widths of the intervals can also be redeemed by the boundaries of the discretized intervals. A good algorithm should require as few inputs from the users as possible. In a specific classification problem, the available class information can provide crucial help in the discretization process. Several class-dependent discretization methods have been proposed [4, 29, 22, 20, 17]. Most of them can automatically determine the number of intervals and the interval boundaries. Nevertheless, some problems still prevail. Usually, either the class-dependent objective function is too simple to effectively utilize the class information, or no effective global optimization algorithms have been found for more complex objective functions encountered in real world situations. In the proposed method, the discretization process is viewed as the partitioning of the range space of a continuous random variable into a number of ordered adjacent disjoint discrete intervals with a certain probability distribution. The expected mutual information I(C : A) between the class (C) and the variable(A), which measures the interdependence between the class and that variable, is the objective function for the discretization. We then use the fractional programming (iterative dynamic programming) to find a global optimum of the expected mutual information. In addition to global optimization, another advantage of our method is that it could efficiently partition bimodal or multi-modal continuous data, a problem that has not been well solved by other methods. Bimodal and multi-modal data are referred to data whose distribution have respectively two or more separate and distinct peaks, each of which may correspond to a high-frequency subclass. To evaluate the performance of the proposed discretization method, we feed data discretized by our method into two well-known inductive learning systems, C4.5 decision trees [25, 24, 5], and Naive-Bayes[15] classifier and compare the classification accuracies of our method with those yielded by other discretization schemes. The experimental results demonstrate that our algorithm is more effective. The rest of this paper is organized as follows. After this introduction, the second section 2

reviews some popular discretization methods developed in the machine learning community. The third section presents the proposed class-dependent discretization scheme and the iterative dynamic programming algorithm for optimal partitioning. The fourth section introduces certain methods to deal with the real-world data with noise. The fifth section presents a simple introduction of the inductive learning systems that we will use to evaluate the discretization performance. Section Six presents the experimental results on a set of synthetic data as well as real-world data. The discretization results of different methods are fed into the learning systems. Classification accuracies are compared and discussed. The last section concludes by summarizing the advantages and the disadvantages of the proposed method.

2

Related Works in Discretization of Continuous Data

As most of the machine learning algorithms focus on nominal data, proper discretization methods which transform the domain of a continuous variable to a finite alphabet can significantly improve the speed of the inductive learning processes and avoid over-fitting. Discretization methods have been used in clustering and classifying continuous and mixed-mode data as early as the late 70’s and 80’s [31, 28, 30]. The literature on discretization is rich. We could view the systems from three different perspectives: 1) Global versus Local, 2) Supervised versus Unsupervised, and 3) Static versus Dynamic [6].

2.1

Global vs. Local

In general , the local methods produce intervals by conducting partitioning on one subspace (one dimension) of the instance space. They make the partition decision based on partial information. For example, Hierarchical Maximum Entropy [4], C4.5 decision trees [10], and VQ (Vector Quantization) [16] discretizers are local methods. The vector quantization algorithm attempts to split an N-dimensional continuous space into a voronoi tessellation and then represent the set of points in each region by the region into which they fall [6]. C4.5 algorithm is a well-known classifier. It could also be used as discretizers. The versatility for multiple data types of this algorithm makes it easy to construct trees in selected subspaces for problems with binary, discrete, continuous or categorical features. Local discretization methods start the search of the interval boundaries at a coarse and local level and then refine the boundaries step by step. The results are local optimal partitions. On the contrary, global discretization methods produce a partition result over the entire continuous instance space, two typical examples are 1R (One Rule Discretizer) [9] and ChiMerge [14] discretizers. The ChiMerge method is based on a statistically justified heuristic method. This algorithm allocates all the observed values into its own interval scheme, then uses the χ2 test to determine when adjacent intervals should be merged. The threshold of χ2 controls the extent of the merging process [15]. 1R method is a simple classifier that makes a single rule known as one-rule. With this algorithm 3

it is easy to get reasonable accuracy on many tasks by simply looking at one attribute [6]. Holte suggest to use 1R for the data sets in testing learners which do not contain complex relationships [9]. 1R method will miss some relationships and definitely not a good discretization methods for problems with complex relationships.

2.2

Supervised vs. Unsupervised

Supervised and unsupervised methods are two common categories for the discretization algorithms in the machine learning community. Unsupervised (class-blind, or class independent) methods simply apply a prescribed scheme to partition the continuous value without making use of the attributeclass information whereas supervised (class-aware or class dependent) methods do take into account the attribute-class information. Theoretically, directed by class information, the supervised methods should automatically determine the best number of intervals for each given continuous attribute for classification. 2.2.1

Unsupervised Methods

Examples of unsupervised methods are Equal-Width, Equal-Frequency [4], K-means[12] and Maximal Entropy [31, 30] partitioning. The simplest discretization method, Equal-Width, merely divides the range of observed values for a variable into k equal-sized bins, where k is a user-supplied parameter. A related method is the Equal-Frequency method. Given m examples, it divides a continuous m variable into k bins where each bin contains attribute values. A typical problem of this group k of methods is that it is difficult to determine how many intervals are the best for a given attribute. In practice, some kinds of heuristics have to be employed by the user to determine the number of the intervals. 2.2.2

Supervised Methods

Examples of this group of methods are CADD [13], Zeta [8], Patterson-Niblett algorithm [22], ChiMerge [20], Chi [14], CAIM [17] and 1R [9]. A successful supervised discretization algorithm should be able to find the minimal number of discrete intervals, and at the same time the discretization should not weaken the interdependency between the attribute values and the class label. CADD discretizes data by heuristically maximizing the interdependence between the class and the continuous-valued attribute[13]. Zeta, a new measure of association between nominal variables that is based on the minimization of the error rate when each value of the independent variable must predict a different value of the dependent variable[8]. This measure is only useful for variables with a small number of values. The difference between CAIM and CADD is that CAIM makes use of a different objective function to capture the dependency relationship between the class labels and the continuous-valued attribute while keeping the number of discrete intervals as minimal as possible[17].

4

Unsupervised data discretization does not always attain the optimal interdependence between classes and the attributes for the learning examples. Most supervised methods, such as CADD [13] and CAIM [17], could not efficiently find the global optimum of the objective function that optimizes the class and attribute dependence. They often rely on heuristics to attain the local optimum of the objective function.

2.3

Static vs. Dynamic

Static method [26] carries out one discretization pass of the data for each feature separately once the maximum number of intervals is given. This set of algorithms could have also been considered as a process of merging N adjacent intervals at a time until a certain threshold is achieved. Actually, almost all discretization methods mentioned above are static. Static discretization method could possibly destroyed the complex interactions among multiple variables. Dynamic method discretizes one variable based on not only the information of this variable but also its interaction with other variables in the data set, exploiting high order relationships. Therefore, it could produce better partitions. Bay [2] proposed a multivariate discretization method for set mining. It discretizes one attribute by considering the effects of all attributes in the data set. Two intervals should be merged into one if the sample points falling into these two intervals have similar multivariate distribution. The advantage of this method is that the hidden complex patterns would not be destroyed by the discretization process.

3 3.1

A Global Optimum Class-Dependent Discretization Scheme Class-Attribute Dependence Measure

Our method pertains to supervised learning and global optimization. It uses the class-attribute dependency information as the criterion for optimal discretization. To better present this method, we first introduce some basic concepts of our approach. Given a classification problem, suppose that there are M 0 training instances each of which has been preclassified into one of the K classes ck (k = 1, · · · , K). Let Ck denote the set of instances with class label ck . Assume that each training sample is described by L attributes Al , l = 1, · · · , L. Without loss of generality, assume that all attributes Al , l = 1, · · · , L are continuous. We define the interval [al , bl ] as the value domain of attribute Al (1 ≤ l ≤ L). For notational simplicity we use A to represent any attribute Al and [a, b] for its domain (a could be negative infinity and b could be positive infinity, in such case the domain should be represented by (a, b)). Let AψR be a partition of attribute A with R intervals, where the superscript ψR is a sequence (e0 , e1 , · · · , eR−1 , eR ) such that a = e0 < e1 < · · · < eR = b. They are the boundaries of the R intervals. After discretization, a continuous attribute can be treated as a discrete random variable. The class label of each instance is also treated as an outcome of the class random variable. We can then obtain a two dimensional quanta matrix as shown by Table 1. 5

Table 1: Contingency Table between the Classes and the Descretization Intervals

class c1 · · · ck · · · cK total

Interval marked by its upper bound er e1 e2 · · · er ··· eR q11 q12 · · · q1r · · · q1R

total q1+

qk1

qk2

···

qkr

···

qkR

qk+

qK1 q+1

qK2 q+2

··· ···

qKr q+r

··· ···

qKR q+R

qK+ M0

In Table 1, element qkr denotes the total number of the observed samples belonging to class ck with the attribute value falling within the interval between er−1 and er . From this table, we can obtain the joint probability Pkr of one sample which belongs to class ck with attribute value falling within the interval demarcated by the boundary pair (er−1 , er ]. Let x denote an instance of the data set and xC denote its class label and xA the attribute value of feature A. Then we have Pkr = P (x|xC = ck , er−1 < xA ≤ er ) =

qkr M0

(3.1)

where M 0 is the total number of observed samples. We can also calculate the estimated marginal probabilities of class ck and of interval R of attribute A respectively as

where qk+ =

R X r=1

qk+ M0 q+r = P (x|xA ∈ (er−1 , er ]) = M0

Pk+ = P (x|xC = ck ) =

(3.2)

P+r

(3.3)

qkr and q+r =

K X

qkr . With these notations, we can define the following terms:

k=1

(For detailed exposition, please refer to [1]). Definition 3.1 The class-attribute (CA) mutual information between the class label C and the attribute A (with intervals as outcomes) is defined as

6

I(C : A) = −

R K X X

Pkr log

k=1 r=1

Pkr Pk+ P+r

(3.4)

I(C : A) is a measure of interdependence (or more precisely, a measure of the expected deviation from independence) between the class label C and the attribute A. I(C : A) is asymptotically χ2 distributed, i.e. 1 2 χ (3.5) 2M 0 (R−1)(K−1) with (R − 1)(K − 1) degrees of freedom. With I(C : A), we can test if C and A are statistically interdependent. To maximize the use of attribute after discretization for classification, we should maximize the class-attribute dependence relationship during the discretization process. The mutual information initially appears to be a good candidate for such a discretization criterion, however, its value increases with the number of intervals. In fact, the expected CA mutual information is at maximum before any discretization, and it decreases as the number of intervals is reduced. Therefore, we must add one regularizer to normalize I(C : A) with respect to the degree of freedom. This yields the interdependence redundancy, a normalized measure of dependence between random variables used in information theory. I(C : A) ∼

Definition 3.2 The joint entropy of the class label C and the attribute A is defined as H(C, A) = −

K X R X

Pkr logPkr

(3.6)

k=1 r=1

With the class-attribute mutual information and class-attribute joint entropy in hand, we can normalize the CA mutual information as below: Definition 3.3 The interdependence redundancy measure R between classes C and attribute A is defined as R(C : A) =

I(C : A) H(C, A)

(3.7)

Note that both I(C : A) and H(C, A) are non-negative. Hence, R(C : A) is non-negative. The value not only depends on the number of class labels and the attribute outcomes but also on the mutual information between the class and the attribute. According to [1], R(C : A) reflects the degree of deviation from independence between C and A. If R(C : A) = 1, then the attribute and the classes are strictly dependent. If R(C : A) = 0, they are statistically independent. If 0 < R(C : A) < 1, then classes C and attribute A are partially dependent. The definition of R shows that it is independent of the composition of the attribute and class variables. This implies that 7

the number of attribute values can be reduced without destroying the interdependence relationship between the class outcomes and the attribute values. Thus, discretization could be regarded as a process to remove the redundancy introduced by too many possible attribute values. At the same time, the discretization process should minimize the loss of correlation between the class labels and the attribute. The properties of the interdependence redundancy measure clearly renders an ideal candidate as a class-dependent discretization criterion, which is used in our discretization method as the optimization criterion. Therefore, the discretization problem can be formalized as finding the partition of attribute A such that the class-attribute interdependence redundancy measure R(C : A) is maximized. Let Ψ represent the set of all possible finite partition schemes. Given class-attribute pair, we then need to find one ψmax ∈ Ψ such that: ∀ψ ∈ Ψ, R(C : Aψmax ) ≥ R(C, Aψ )

3.2

(3.8)

Iterative Dynamic Programming

Before proceeding to our proposed optimization algorithm, related knowledge on dynamic programming and fractional programming is reviewed mostly based on Moshe Sniedovich’s work [27]. Fractional programming is a branch of nonlinear optimization involving ratio functions. The problem can be stated as follows. Let r(z) =

ν(z) ω(z)

where ν(·) and ω(·) are real-valued functions on a certain set Z and ω(z) > 0, ∀z ∈ Z. Then c = max r(z) z∈Z

(3.9)

Let Z ∗ denote the set of optimal solutions to this problem. By assuming that the set Z ∗ is not empty, this problem can be solved as a parametric problem formulated as below: Let rλ (z) = ν(z) − λω(z) (3.10) Then α(λ) = max rλ (z) z∈Z

(3.11)

where λ ∈ R, R is the real space. Let Z ∗ (λ) denote the set of optimal solutions given λ, and let us assume that the problem has at least one optimal solution. The problem can be solved using Dinkelbach’s Algorithm as expressed by the following these steps:

8

1. Set k = 1 and select some z ∈ Z, let z (1) = z, and λ(1) = r(z (1) ) 2. Solve the problem α(λ(k) ) and select a certain z ∈ Z ∗ (λ(k) ) ν(z 0 ) , and stop. z 0 is the optimal solution. ω(z 0 )

3. If α(λ(k) ) = 0, set z 0 = z and λ0 = r(z 0 ) =

Otherwise, set z (k+1) = z and λ(k+1) = r(z (k+1) ). 4. Set k = k + 1 and go to step 2. The termination in finite steps of this algorithm is guaranteed according to the following theorem proved by Dinkelbach. Theorem 3.1 If ω(z) > 0, ∀z ∈ Z then either Dinkelbach’s algorithm terminates (in this case λ0 = c and z 0 ∈ Z ∗ ), or else the sequence {λ(k) } that it generates converges superlinearly to c. Termination is assured if Z is finite. With the above theoretical background, a new global optimal algorithm for class-dependent discretization from continuous data as described below is proposed. The algorithm, OCDD, or Optimal Class-Dependent Discretization, has two important components. One attempts to attain the maximal value of the objective function by dynamic programming for optimizing the class-attribute dependence. The other is an iterative process that uses the first component to drive towards the final global optimum solution. The process will be described in detail in the following sections.

3.3

Global Optimal Class-Dependent Discretization

In our attribute partitioning method, the objective function is to maximize R(C : A) = I(C : A)/H(C, A). We come up with the following iterative algorithm. Algorithm OCDD 1. Let us first assume an arbitrary partition ψ of an attribute A. This can be represented by a quanta matrix based upon which we can calculate the value of qk+ , q+r , M 0 , pkr . Initialize u = I(C : Aψ )/H(C, Aψ ). 2. Given u, use Algorithm A1 shown in the next subsection to calculate a new partition ψ 0 such 0 0 that I(C : Aψ − uH(C, Aψ ) is maximized. This step is a key component in our algorithm. 0

0

3. Obtain a new value u0 from u0 = I(C : Aψ )/H(C, Aψ ). 4. Compare u and u0 . If u = u0 , then ψ 0 is the optimal partition. Otherwise let u = u0 and repeat step 3 and 4. 9

Lemma 3.2 Algorithm OCDD always terminates. Proof. Given a data set and an attribute A and that the total number of data samples M 0 of this attribute A is finite. Thus the possible number of partitions of A is finite. And for any partition, H(C, A) is always positive. Therefore, based on Theorem 3.1, the Dinkelbach’s algorithm used here always terminates. Furthermore, Algorithm OCDD superlinearly converges to its optimal solutions. 2

3.4

Dynamic Programming Algorithm

Now we describe the dynamic programming Algorithm A1 to obtain the partition such that I(C : A) − uH(C, A) is maximized for any given u ≥ 0. For a given attribute A, assume its training attribute values are x1 ≤ x2 ≤, ..., ≤ xM 0 and assume it is partitioned into R intervals with the subscripts of the boundary values being i1 , i2 , ..., iR (i.e., er = xir (1 ≤ r ≤ R) are the boundary values). Let Xij (i ≤ j) = {xi , · · · , xj }, where xi and xj are the lower and upper bounds of the set Xij . Let F (i1 , i2 , ...iR ) denote the value of I(C : A) − uH(C, A) for a given partition (i1 , i2 , ..., iR ). We have R X K X

R X K X Pkr F (i1 , i2 , ..., iR ) = Pkr log + uPkr logPkr Pk+ P+r r=1 k=1 r=1 k=1

= =

R X K X qkr

(

r=1 k=1 R X K X

(

r=1 k=1

M

log 0

qkr qkr /M 0 qkr + u 0 log 0 ) 0 0 (qk+ /M )(q+r /M ) M M

qkr qkr qkr M 0 qkr + u 0 log 0 ) log 0 M qk+ q+r M M

(3.12)

where qkr = |Xir−1 ,ir ∩ Ck |, qk+ = |Ck | and q+r = |Xir−1 ,ir |. Therefore, F (i1 , i2 , ..., iR ) can be regarded as the sum of R items. Each item is K X qkr

(

k=1

M

log 0

qkr qkr M 0 qkr + u 0 log 0 ), 1 ≤ r ≤ R, qk+ q+r M M

(3.13)

which is fixed if its lower and upper bounds xir−1 and xir are given. If we fixed the sth boundary of the partition as xm (i.e., es = m), then, the partition of the first m attribute is independent of the last M 0 − m attribute values in terms of the value of F (i1 , i2 , ..., iR ). It is this observation that enables us to use dynamic programming algorithm to optimize F (i1 , i2 , ..., iR ). Let g(m, s) represent the sum of the first s(s ≤ R) items of F (i1 , i2 , ..., iR ) given that xm is the sth partition boundary, i.e.

10

g(m, s) =

s X K X qkr

(

r=1 k=1

M

qkr M 0 qkr qkr + u 0 log 0 ) qk+ q+r M M

log 0

(3.14)

Let T (m, s) be the set of all possible partition schemes with the first m continuous values being partitioned into s intervals, we write T (m, s) = {(i1 , . . . , is , . . . , iR )|is = m}. Let f (m, s) denote the optimal value of g(m, s) among all possible partitions in T (m, s), i.e., f (m, s) = max g(m, s). For an attribute with maximum of into at most M 0 intervals. Then

M0

(i1 ,i2 ,...,iR )∈T (m,s)

possible values as given in the data set, it can be partitioned

max(I(C, A) − uH(C, A)) = max f (M 0 , R)

(3.15)

1≤R≤M

In fact, f (m, s) drives the optimal partition of the first m attribute values Xm = {x1 , x2 , . . . , xm } given that they are partitioned into s intervals. That is, f (m, s) = max

à s K µ X X qkr r=1 k=1

=

qkr qkr M 0 qkr + u 0 log 0 log 0 M qk+ q+r M M  

max

(i1 ,i2 ,...,iR )∈T (m,s)

s X K X

Ã

j=1 k=1

¶!

|Xij ,ij+1 ∩ Ck | |Xij ,ij+1 ∩ Ck |M 0 log M0 |Xij ,ij+1 ||Ck |

|Xi ,i | |Xij ,ij+1 ∩ Ck | + u j j+1 log M0 M0

!!

(3.16)

In order to apply the dynamic programming algorithm to calculate f (m, s), we need to figure out a recursive equation to represent f (m, s). Assume the sth interval of the partition contains t attribute values, then we can obtain the following recursive formula based on Equation 3.16. 

f (m, s) =

max



(i1 ,i2 ,...,iR )∈T (m,s)

s X K X

j=1 k=1

Ã

|Xij ,ij+1 ∩ Ck | |Xij ,ij+1 ∩ Ck |M 0 log M0 |Xij ,ij+1 ||Ck |

|Xi ,i | |Xij ,ij+1 ∩ Ck | + u j j+1 log M0 M0 Ã

=

max

1≤t≤m−s

f (m − t, s − 1) +

!!

K X

Ã

k=1

|Xm−t+1,m ∩ Ck | |Xm−t+1,m ∩ Ck |M 0 log M0 |Xm−s+1,m ||Ck |

|Xm−t+1,m ∩ Ck | |Xm−t+1,m ∩ Ck | log +u M0 M0 We also have the following initial conditions:

f (m, 1) =

K X k=1

Ã

¶¶

(3.17)

!

|X1,m ∩ Ck | M 0 |X1,m ∩ Ck | |X1,m ∩ Ck | |X1,m ∩ Ck | log +u log (∀1 ≤ m ≤ M 0 ) 0 M |X1,m ||Ck | M0 M0 (3.18) 11

 K  X |X1,1 ∩ Ck | |X1,1 ∩ Ck |M 0 |X1,1 ∩ Ck | |X1,1 ∩ Ck |   ( log +u log ) n=1 0 0 f (1, n) = M |Ck | M M0 k=1    −∞ otherwise

(3.19)

Based on the above mentioned formulas, we can formulate the following dynamic programming algorithm: Algorithm A1: 1. Create a dynamic programming table with size U × U (U is the number of unique values of attribute A). In fact, the number of intervals is far less than U after partitioning. The element in the mth row and the sth column represents the value of f (m, s). 2. Initialize the elements in the first column and the first row of this table according to Equation 3.18 and Equation 3.19. 3. Calculate all the elements in this table according to the recursive formula of f (m, s) (Equation 3.17). 4. Find the maximum value in the last row. Assume that the maximum value is in the (s∗ )th column. Then the optimal partition consists of s∗ intervals. Trace back to obtain the optimal interval boundaries. Up to now, we have completed a superlinearly converged algorithm based on Theorem 3.1 and Lemma 3.2. Theoretically, this method would obtain an optimal partitioning. However, there are a number of practical issues we have to consider in real world applications. In the next section, we will discuss these issues and evaluate the performance of the algorithm.

4

Methods to Reduce the Number of Intervals

Real world data is seldom clean. Noise caused by measurement errors or entry of incorrect attribute values would often introduce small intervals in the discretization process. This is an inherent drawback in this class-dependence model. Since the objective function relies on the relationship between class and attribute so much, sometimes the total number of the partitioned intervals is far too many. Hence, the above proposed model alone cannot deal with the high-frequency noise which will yield too many intervals in the discretization results. In order to minimize the effect of noise on discretization, we should take into consideration noise suppressing techniques such as binning, clustering and regression. For example, using binning to handle the noisy data, we could first sort the data and partition them into many small bins. Then 12

the data can be smoothed by bin means, by bin median, or by bin boundaries. The smoothed data can then be discretized with the proposed method. In this paper, we propose the following heuristics to handle the noise and the problem of toomany-intervals.

4.1

Setting the Minimal Number of Data Point in Each Interval

In fact, if the number of data points contained in an interval is too small, this interval is insignificant yet would still affect the classification accuracy of the learning system. Hence, we could set the minimal width of the intervals. We solve this problem by stipulating the minimal number of data contained in each interval. If we set this parameter too big, some useful intervals containing small number of data points less than this parameter could be merged after discretization. Similarly, if we ignore this parameter, due to the characteristics of the objective function for the algorithm, it would cost much time to discrete some very small intervals caused by the noise data. Usually, we set the minimal number of data points in each interval as two. This method would not alter the outcome of the algorithm but it would decrease the computational time because of the number of the possible partitions becomes smaller. As a matter of fact, when dealing with the real world data, we could always employ some prior knowledge to fix the minimal number of data. For example, if there can be one data set containing information on the relationship among health (good, general, bad), age, degree and profession and so on. If the age attribute is to be discretized, it is meaningless for an interval which just has 1-2 samples or contains the data all falling into the same age. Usually, we set the minimal number of samples for each interval as two to avoid the above case.

4.2

Determining the Optimal Number of Intervals

In real world applications, the number of intervals to be partitioned for a continuous attribute is an important parameter besides the discretization criterion. According to information theory, the absolute mutual information is the greatest when the number of intervals assumes the largest possible number. Hence, the most important thing is to select the maximum allowable number of intervals without violating the statistical assumptions for estimating the second order probabilities needed for class-dependent discretization. Therefore, given the number of classes K implies that the number of intervals R for attribute A should be less than M 0 /(K ∗ N ). Usually, N is preferred to be three for estimation. After discretization, each interval is treated as one unique value and the goal of discreatization is to reduce the number of unique values. In addition, a large interval number will cause the slow and inefficient learning process. Thus, a comprehensive and efficient discretization method is a necessity to maximizing the interdependence between class labels and attributes without significant loss of the class-attribute mutual dependence [1]. The mutual information I(C : A) for measuring the interdependence between class C and attribute A is used for a statistical test. For any intermediate discretization stage, we use the 13

following equation to test the statical significance of the class-attribute interdependence: 1 2 χ 2M 0 (K−1)(R−1) By normalizing both sides of the above equation with H(C : A) respectively, we get I(C : A) >

R(C : A) ≥

χ2(K−1)(R−1) 2M 0 H(C, A)

(4.1)

(4.2)

If Equation 4.2 is true, class and attribute are statistically interdependent. We should eliminate any “redundant” intervals so that the number of intervals could be minimized. Given an intermediate partition result with a boundary set and its associated quanta matrix as depicted, all neighboring pairs of intervals are analyzed, one pair at a time. In practice, we calculate the partial I(C : A) mutual information between the two neighboring intervals by the following equations: K r+1 X X Pij I(C : A) = − Pij log (4.3) Pi+ Pj+ i=1 j=r and the interdependence redundancy Rij = I(C : A)/H(C, A) = (

K r+1 X X

Pij log

i=1 j=r

K r+1 X X Pij )/( Pij logPij ) Pi+ Pj+ i=1 j=r

(4.4)

The statistical test of equation 4.2 can then be used to determine if the frequency distribution among the two neighboring intervals and the class labels are significantly interdependent. Recall that 0 < Rij < 1; if Rij = 0, class and attribute are totally independent; and if Rij = 1, they are totally dependent. If the result of the test indicates significance at a certain confidence level, the analysis for the next pair of neighboring intervals is performed. If the test fails, we could conclude that the two individual intervals, when separated, would not likely contribute to classification and can therefore be combined into a single interval.

4.3

Smoothing the Original Data before Partitioning

Before feeding the continuous data into our discretization algorithm, we use the following smoothing technique (Algorithm Noise Filtering) to filter noise. 1. Given two parameters threshold t > 1 and width w. 2. For any attribute value xi , we define a segment si centered at xi with radius w, i.e. si = {xi−w , . . . , xi+w }.

14

3. Find the class label cmax , which occurs most frequently within the segment si . Let fmax denote the occurring frequency of cmax . Then calculate the occurring frequency of the class label of data xi within the segment si and denote it by fi . 4. If the ratio between fmax and fi is greater than the threshold t, we change the class label of xi into cmax . The result of partitioning is sensitive to the values of threshold t and width w. The smaller the threshold, the more data would be treated as noise, and more data would be smoothed. After smoothing, the number of intervals could be smaller. Theoretically, the value of w should be related with the number of classes. The larger the number of classes, the greater w should be. To keep the statistical significance, w can not be too small. The selection of the threshold and width is very important for this smoothing algorithm as it directly affects the partitioning result. While a user can choose the values of these two parameters based on his/her experiences and domain knowledge, we could also choose t by using some probabilistic technique. Given the width w and the number of the classes K, if the probability of fmax fi > t is very small (say less than five), then the class label of xi could be regarded as noise. The parameter w is subtle for the interval merging result and not easy to set up. Let us discuss the case for an attribute with too much noise and likely to be discretized into many intervals if no merging or preprocessing is carried out. Generally, the bigger the w is, the fewer intervals we will get. But sometimes, this rule are not strictly obeyed. If w is bigger (say more than 10% of the total number of attribute values), we might get more intervals than a smaller w. Conversely, for some attributes with not much noise, it is hard to determine the best w for them. If we set w to a small constant (say two ) or a big constant (say 10% of the number of attribute values), the algorithm might generate more intervals than cases where no smoothing technique is applied. In some cases, w has little impact on the discretization results. There is also a trade-off for setting w. After we set w and discretized the attribute, we can get fewer intervals, but we could also lose certain class-attribute information contained in the original data. Based on many experiments, we concluded that setting w to five is a good default for most of the data, i.e., we can expect better results and will not lose much class-attribute information. We fix 1.3 as the default value of t. In fact, discretization results are not very sensitive to t. The outcome is unlikely to change much if the alternation of t is within 0.2. These three methods would not conflict with each other. Instead, they can be combined for better results while requiring no major modifications of our discretization algorithm. At the same time, the speed of our discretization process can be increased by dozens of folds.

5

Inductive Learning Systems for Algorithm Evaluation

To evaluate the performance of our method, there are many different learning systems, such as C4.5 ID3 [10] , CN2 [?], naive-bayes. In this paper, we test the effectiveness of different discretizers on 15

two different learning systems, C4.5 with both pre-pruning and post-pruning, and Naive-Bayes. We would like to investigate if OCDD can significantly improve the performance of these two commonly used inductive learning methods. C4.5 C4.5 [24, 25] is a well-known inductive learning algorithm first proposed by Quinlan. Given a set of training patterns with class labels, C4.5 constructs a decision tree using a top-down divide and conquer approach. The resulting decision tree can be used later to classify new examples. Each interior node of the tree denotes a single attribute. At each interior node, the corresponding attribute values are partitioned into two or multiple intervals according to some information gain criteria. Any path from the tree root to a given intermediate node or a leaf node represents one conjunctive condition. All training data under one subtree rooted at one node N satisfies the conjunctive condition represented by the path from the root to node N . For each leaf node, there is a class label assigned to all training data belonging to that class and satisfying the conjunctive condition represented by the path from the root to that leaf node. C4.5 treats all attribute values as discrete. For a large set of noisy or de facto continuous data, C4.5 will generate a terribly big over-fitted tree. Therefore some pruning strategy must be harnessed without significantly affecting the classification accuracy of the resulting tree. According to where the decision tree is pruned during the tree-construction procedure, C4.5 algorithm can be divided into two groups: pre-pruning and post-prunning. • pre-pruning: During the tree-construction process, some statistical test is used to measure the information gain if one attribute value is partitioned into two or multiple intervals. The tree-growing process is stopped when it determines that no attribute is going to significantly increase the information gain. • post-pruning: The pre-pruning strategy is in fact a local method. It often misses important information that cannot be detected locally. To solve this problem, post-pruning strategy is taken, i.e., the pruning process is executed after the complete tree is constructed. Such post-pruning strategy often demands trade-off between the complexity of the decision tree and its observed classification accuracy. Naive-Bayes Naive-Bayes [18] [19] classifier is one of the most successful known algorithms. It was previously shown to be surprisingly accurate on many classification tasks even when the conditional independence assumption on which they are based is violated. Bayesian classification is a technique that has become increasingly popular in the recent years in part due to its recent active developments. The simplest Bayesian classifier is the widely used Naive-Bayes classifier. Assume all attributes are mutually independent, Given an instance, Naive-bayes classifier first computes its probabilities in each class, based on the assumption that all attributes are mutually independent, then classify this instance into the class with the highest probability. An optimal classifier is obtained as long as both the 16

actual and the estimated distributions agree on the most-probable class. Most of the work on Naive-Bayes compares its performance to other classifiers on particular benchmark problems.

6 6.1

Experiments and Discussion Experiments with Synthetic Data

In this section, we attempt to understand how OCDD works by analyzing its performance on a set of synthetic data. We generate a data set containing 500 samples. The data in the set are preclassified into five classes. The value of each attribute is generated stochastically. According to the ascending order, the data are assigned class label as shown in Table 2. Then, we randomly add 20% noise to perturbate the class assignment, i.e., the class labels of the 20% randomly chosen data are changed randomly. An ideal partition would divide this data into five intervals with correct boundaries as shown in Table 2. Table 2: The Synthetic Data Data Interval 0-77 78-250 251-274 275-312 313-500

Class Assignment 1 2 3 4 5

Table 3: Discretization Results of Different Methods on The Synthetic Data. Method Equal-Freq Equal-width 1R Maximum Entropy OCDD

Partition Boundary 40 68 96 124 152 180 208 236 264 292 319 347 375 403 431 459 500 38 151 174 202 227 255 283 312 342 367 399 429 459 500 27 63 106 201 236 311 384 411 445 476 500 77 260 312 500 77 251 274 312 500

We then use different methods to partition this data set. As shown in Table 3, our algorithm perfectly partitions this set of data into five intervals with the same boundaries as the given. Equalwidth algorithm merely divides the range of data into some equal-sized intervals. It partitions this 17

set into 14 intervals if we fix the minimal number of data contained in each interval as 20. In the same experiment, Equal-Frequency algorithm partitions the data set into 17 intervals, each of which contains almost the same total number of data points. 1R algorithm is better than Equal-Width and Equal-Frequency for this data set, but still a little away from yielding the optimal solution. Maximum-Entropy’s result is very close to the optimal solution. It partitions the data set into 4 intervals, but it is misled by the noise added in the segment from the 77th data point to the 312th data point. Although we did not compare the experimental results with CADD directly. We can infer theoretically that OCDD is better than CADD, due to its ability to achieve global optimum. When bimodal or multi-modal problems arise for certain classes as will be discussed later, CADD would fail to produce good results. In summary, for this data set, our algorithm obtains the perfect solution, far better than most of the popular methods.

6.2

Experiments with Bimodal Data

As mentioned earlier, the bimodal or multi-modal data have two or more separate and distinct peaks each of which may correspond to a separate high-frequency class. In real world, bimodal or multi-modal distributed data are not uncommon. Without loss of generality, we only test OCDD with bimodal data to demonstrate its capability to handle this kind of problems. In the test, we generate a bimodal data set consisting of 800 data points belonging to 4 classes where the data belonging to each class is bimodal. The original data distribution and the partition result are shown in Figure 1. The X-axis denotes the sequence of data corresponding to the classes, and the Y -axis denotes the value of the data. The four different symbols (”*”, ”o”, ”+”, ”.”) represent the data points of four different classes respectively. It is observed that the data in each class has two separate and distinct peaks. Those horizontal lines in the figure denote the boundaries of each interval after partition. Our algorithm divides this bimodal data set into eight different intervals. The figure clearly shows a reasonable and effective partition result. The first interval mainly contains the data in class 1 (represented by ”.”) and mixes with some data in another class. Similarly, each of the following seven intervals contains the data mainly in just one class and mixes with some data in other classes. The reason that our algorithm can successfully partition this bimodal data set is that after sorting all the continuous data apart from class dependence, the data value itself has no impact on the algorithm and the partition result. According to the objective function in the process of dealing with the continuous data, we do not use the data value but rather the sequence of the data and the class label of the data to calculate the mutual information or the interdependence redundancy. Therefore OCDD could deal with the bimodal data or even multiple modal data very effectively.

18

Figure 1: Example Figure of Partition Results of Bimodal data.

19

6.3

Experiments with Real-World Data

In order to evaluate our class-dependent discretization algorithm in a real-world situation, we tested several sets of mixed-mode data from UCI Machine Learning Data Repository[21]. The fifteen datasets used to test the OCDD are shown in Table 4. More details of these data can be found in the website of UCI. Figure 2 shows the partition results of the Vehicle silhouette’s 10th attribute. This data set has 4 classes (double decker bus, Chevrolet van, Saab 9000 and an Opel Manta 400), 846 samples and 18 attributes [21]. The purpose of this data set is for evaluating classifiers to determine how effective it could find out while silhouette is one of four types of vehicle, using a set of features extracted from the silhouette. The features were refined from silhouettes by the HIPS (Hierarchical Image Processing System) extension BINATTS. The 10th attribute is MAX LENGTH RECTANGULARITY which denotes one of the scale independent features based on heuristic measures. Each point in Figure 2 represents a data plot with a class label. The distribution of this data set is very noisy since the data for the four classes are mixed up promiscuously.

Figure 2: The partition results of vehicle’s 10th attribute. As shown in Figure 2, our algorithm generates four intervals for this attribute. The majority of the sample data contained in each interval belongs to at most 2 classes. All sample data in the first interval belongs to class 3. Most of data contained in the second interval belong to class 1 and class 2 and only several points are in class 3. The third interval is dominated by the sample 20

data in class 3 and class 4. Most of data in the last interval are from class 4. This partition clearly delivers that there is certain interdependence between these attribute values and the 4 classes. Table 4: The Details of datasets considered in the experimentation

Dateset anneal australian breast cleve diabetes german glass glass2 heart hepatitis horse-colic hypothroid iris sick-euthyroid vehicle

#of classes 6 2 2 2 2 2 7 2 2 2 2 2 3 2 4

# of examples 898 690 699 303 768 1000 214 163 270 155 368 3163 150 2108 846

# of attributes 38 14 10 13 8 20 9 9 14 20 28 25 4 25 18

# of continuous attributes 6 6 10 6 8 7 9 9 13 6 7 7 4 7 18

In the next set of experiments, we compare the results of OCDD with those obtained by other discretization methods: Equal-Width, 1R, and Maximum Entropy. We use the above algorithms to discretize all the continuous attribute values of the 15 datasets listed in Table 4. To test the performance of each algorithm, we use C4.5 with both pre-pruning and post-pruning, and NaiveBayes to evaluate how good the discretization result may impact the classification accuracy of these two machine learning systems. The classification results of these two learning systems based on various discretization results are tabulated in Table 5 and Table 6. In Table 5 and Table 6, the column labelled ”Continuous” denotes the classification accuracy when running C4.5 and Naive-Bayes respectively on the undiscretized continuous original data. The column labelled ”Bin-log l” and ”Ten Bins” denote those when equal-width binning method applies on the continuous attribute value given the numbers of intervals are fixed to log l and 10 respectively. That labelled ”Entropy” denotes the classification results when Maximum Entropy discretization method to discretized the continuous data. Therein, 1R is supervised discretization method while others are unsupervised methods. As shown in Table 5, OCDD achieves the highest class-attribute interdependency with C4.5 21

Table 5: Accuracies using C4.5 with different discretization methods of 15 datasets

Data Set anneal australian breast cleve diabetes german glass glass2 heart hepatitis horse-colic hypothyroid iris sick-euthyroid vehicle Average

Continuous 91.65 ± 1.60 85.36 ± 0.74 94.71 ± 0.37 73.62 ± 2.25 70.84 ± 1.67 72.30 ± 1.37 65.89 ± 2.38 74.20 ± 3.72 77.04 ± 2.84 78.06 ± 2.77 84.78 ± 1.31 99.20 ± 0.27 94.67 ± 1.33 97.70 ± 0.46 69.86 ± 1.84 81.99

Bin-log1 90.32 ± 1.06 84.06 ± 0.97 94.85 ± 1.28 76.57 ± 2.60 73.44 ± 1.07 71.10 ± 0.37 59.82 ± 3.21 80.42± 3.55 78.52 ± 1.72 80.00 ± 2.37 85.33 ± 1.23 97.3 ± 0.49 96.00±1.25 94.10 ± 0.72 68.45 ± 2.19 82.01

Entropy 89.65 ± 1.00 85.65± 1.82 94.42 ± 0.89 79.24 ± 2.41 76.04 ± 0.85 74.00± 1.62 69.62 ± 1.95 76.67 ± 1.63 81.11 ± 3.77 75.48 ± 1.94 85.60 ± 1.25 99.20± 0.30 94.00 ± 1.25 97.30 ± 0.49 69.62 ± 1.57 83.15

1RD 87.20 ± 1.66 85.22 ± 1.35 94.99± 0.68 79.23 ± 2.48 72.40 ± 1.72 70.10 ± 0.94 59.31 ± 2.07 71.29 ± 5.10 82.59± 3.39 79.35 ± 4.28 85.60 ± 1.24 98.00 ± 0.43 94.00 ± 1.25 97.40 ± 0.49 66.80 ± 3.39 81.47

Ten Bins 89.87 ± 1.30 84.20 ± 1.20 94.57 ± 0.97 77.58 ± 3.31 72.01 ± 1.07 70.10 ± 0.48 59.83 ± 2.04 74.32 ± 3.80 80.74 ± 0.94 80.00 ± 2.37 85.33 ± 1.23 96.3 ± 0.58 96.00± 1.25 95.70 ± 0.62 68.33 ± 2.12 81.65

OCDD 94.57± 0.38 85.51 ± 1.10 94.14 ± 0.94 80.20± 2.80 78.78± 1.28 70.20 ± 1.18 76.23± 2.55 79.12± 1.04 81.75 ± 2.15 82.00± 3.52 87.40± 1.24 98.2 ± 0.24 96.00± 1.47 97.60 ± 0.18 72.57± 1.18 84.95

for eight (anneal, cleve, diabetes, glass, hepatitis, horse-colic, sick-euthyroid, and vehicle) out of the fifteen data sets, and the second highest for other four data sets (australian, glass2, heart, iris). For those data sets which OCDD did not obtain the best result, the accuracy is very close to the best. OCDD outperforms other discretization methods significantly on the data sets cleve, diabetes, glass, hepatitis and vehicle data which are known as difficult data sets for discretization. For the glass data, it increases the classification accuracy by 7% to 17%. OCDD does slightly decrease the classification performance on some data which are relatively easy for discretization. The average classification accuracy of all data sets with C4.5 is 1.8% better than the second best method (Entropy) and the average accuracy is 3.3% over the Ten-Bins method. In general, the degree of classification accuracy improvements using OCDD as opposed to other methods are fairly consistent among sets of various kinds of data. Again, as shown in Table 6, the test results of Naive-Bayes confirms that OCDD consistently renders better classification performance. OCDD achieves the highest class-attribute interdependency with Naive-Bayes for eight (anneal, cleve, diabetes, glass, glass2, heart, hepatitis and sick-euthyroid) out of fifteen data sets, and the second highest for other five data sets (australian, breast, german, horse-colic and vehicle). The average classification improvement for all datasets with Naive-Bayes is 2.2% over the next best Entropy method and is about 9.5% over the continuous data. In summary, the classification results of C4.5 or Naive-Bayes show that OCDD algorithm gen22

Table 6: Accuracies using Naive-Bayes with different discretization methods of 15 datasets

Data Set anneal australian breast cleve diabetes german glass glass2 heart hepatitis horse-colic hypothyroid iris sick-euthyroid vehicle Average

Continuous 64.48 ± 1.47 77.10 ± 1.58 96.14 ± 0.74 84.19 ± 2.01 75.00 ± 1.77 72.60 ± 2.65 47.19 ± 0.71 59.45 ± 2.83 84.07 ± 2.24 84.52 ± 3.29 80.14 ± 2.45 97.82 ± 0.44 95.33 ± 1.33 86.84 ± 1.11 44.21 ± 1.58 76.45

Bin-logl 95.99 ± 0.59 85.65 ± 0.84 97.14 ± 0.50 83.86 ± 3.10 74.87 ± 1.39 75.60± 0.87 70.13 ± 2.39 76.04 ± 3.06 82.22 ± 2.72 83.87 ± 4.08 79.60 ± 2.52 97.54 ± 0.47 96.00±1.25 88.44 ± 0.98 60.76 ± 1.75 83.18

Entropy 97.66± 0.37 86.09 ± 1.06 97.14 ± 0.50 82.87 ± 3.11 74.48 ± 0.89 73.30 ± 1.38 71.52 ± 1.93 79.17 ± 1.71 81.48 ± 3.26 84.52 ± 4.61 80.96± 2.50 98.58± 0.36 94.00 ± 1.25 95.64 ± 0.62 60.76 ± 1.75 83.77

1RD 95.44 ± 1.02 84.06 ± 1.02 97.14 ± 0.60 81.86 ± 1.84 72.14 ± 1.52 71.80 ± 1.29 69.19 ± 3.18 82.86 ± 1.46 81.85 ± 2.44 83.87 ± 4.67 80.13 ± 3.17 98.29 ± 0.40 93.33 ± 1.05 94.98 ± 0.67 59.22 ± 1.56 83.27

Ten Bins 96.22 ± 0.64 85.07 ± 0.75 97.28± 0.52 82.21 ± 2.63 75.00 ± 1.74 74.40 ± 1.19 62.66 ± 3.11 77.88 ± 2.52 82.96 ± 2.77 85.81 ± 4.16 80.14 ± 2.09 97.25 ± 0.50 95.33 ± 1.70 91.09 ± 0.87 62.18± 1.88 82.86

OCDD 97.55 ± 0.27 86.23± 1.32 97.14 ± 0.64 84.52± 2.49 77.61± 1.63 75.10 ± 1.57 79.94± 2.59 86.47± 0.88 84.07± 2.21 89.79± 2.99 80.93 ± 2.75 97.76 ± 0.26 94.67 ± 1.33 96.40± 0.17 61.25 ± 2.32 85.96

erates the discretization results that are, by and large, better than those generated by other discretization methods particularly for cases when the data are more difficult to discretize. OCDD is clearly the best discretizer among all methods compared in this paper as far as these two inductive learning systems are concerned. The Entropy method had the second best overall results. It is easy and safe to conclude that class-dependent discretization is more suitable to supervised learning applications if coupled with a globally optimal search algorithm.

7

Conclusion

In this work, we are concerned with the handling of continuous attribute-valued data in inductive learning applications. We present a global optimal class-dependent discretization method based on the concept of maximum class-attribute interdependence redundancy. As a key pre-processing component for some inductive learning systems, many discretization algorithms have been proposed. Some of the existing approaches in discretization ignore the associative information between the continuous attributes and class assignments. Algorithms such as Equal-Width, Equal-Frequency and Maximum Entropy are typical representatives of this kind. Compared to these methods, ours has unparallel advantages in classification accuracy due to its use of the class information. The experiments on both the synthetic data and real-world data con23

sistently show that OCDD is better than its unsupervised counterparts. Others (such as CADD) use the associative class information, but could not find the global optimization of the mutual dependency between the attributes and class, because they only make use of local search or heuristic search methods. Local search methods have two disadvantages. One is, of course, that they can only find a locally optimal solution. The other is that the local search methods could be easily trapped into an infinite loop. We use the iterative dynamic programming to attain the global optimization of mutual dependency in each partition so that we can exploit the full relationship between class and attributes. This is an unique advantage of this method. It is worthwhile to mention that our algorithm converges superlinearly, i.e. it can generate the results very fast, contrasting to a general impression that a global optimal algorithm is always time-consuming. This proposed discretization method works well with data sets consisting of either completely continuous or mixed continuous and discrete valued data even with uncertainty and noise. Although the discretization results of OCDD show that the present algorithm is computationally effective, it is not possible to conclude at this point that it creates the best possible discretization. Indeed, from our experiments there is evidence to suggest that further improvements are needed. I In some cases, we sometimes get small and noise intervals because the objective function is not always suitable. Although the objective function of this algorithm measures the relationship between the class and attribute, it might not be the most proper and comprehensive measurement in certain cases for the supervised problems. Hence finding a good and efficient objective function for most kinds of data sets is still one of the important tasks for us; II In fact, the denoising measures of smoothing data require the involvement of different levels of domain knowledge, especially for determining the parameters used in the smoothing algorithms. One of our next endeavors is to develop a more detached test criteria to assist fixing the parameters required by the denoising processes for different data; III Another radical enhancement would be the modification of the discretiaztion procedure so that it would divide the continuous variable into less categories than there are classes; IV Given the number of unique sample data U and the maximum number of intervals N , our algorithm will consume O(U N ) memory. If U is very big and we do not have a good approach to limit N to a small number (i.e., we have to set N = U in our dynamic programming algorithm), there is a good chance that the algorithm will run out of memory. In order to deal with such large scaled data sets, we need to find ways to improve the memory consumption of our algorithm.

8

Acknowledgements

The research presented in this paper was supported by Pattern Discovery Software Systems Ltd. and Natural Sciences and Engineering Research Council of Canada (NSERC No. 4716). 24

References [1] A.K.C.Wong. Information Pattern Analysis, Synthesis and Discovery, chapter 7, pages 254– 257. University of Waterloo, 1998. [2] Stephen D. Bay. Multivariate discretization for set mining. Knowledge and Information Systems, 3(4):491–512, 2001. [3] J. Catlett. On changing continuous attributes into ordered discrete attributes. In Y. Kodratooe, editor, Proc. 5th European Working Session on Learning, pages 164–178, Porto, Portugal, 1991, March. Springer-Verlag Heidelberg. [4] D. Chiu, A. Wong, and B. Cheung. Information discovery through hierarchical maximum entropy. Journal of Experimental and Theoretical Artificial Intelligence, 2:117–129, 1990. [5] P. Clark and R. Boswell. Rule induction with CN2: Some recent improvements. In Proc. Fifth European Working Session on Learning, pages 151–163, Berlin, 1991. Springer. [6] James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsupervised discretization of continuous features. In International Conference on Machine Learning, pages 194–202, San Francisco, CA, 1995. [7] Usama M. Fayyad and Keki B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8(1):87–102, 1992. [8] K. M. Ho and P. D. Scott. Zeta: A global method for discretization of continuous variables. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors, Knowledge Discovery and Data Mining, pages 191–194, Menlo Park, 1997. AAAI Press. [9] Robert C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1):63–90, 1993. [10] Quinlan J.R. Induction of Decision Trees. Machine Learining 1, 1986. [11] Quinlan J.R. C4.5:programs for Machine Learning. Morgan Kaufmann, 1993. [12] Tou J.T and Gonzalez R.C. Pattern Recognition Principles. Addison-Wesley, 1974. [13] J.Y.Ching, A.K.C.Wong, and K.C.C.Chan. Class-dependent discretization for inductive learning from continuous data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7):641–651, 1995. [14] R. Kerber. Chi merge: Discretization of numeric attributes. In Proceedings of the 9th International Conference on Artificial Intelligence, pages 123–128, Menlo Park CA, 1992.

25

[15] R. Kohavi, G. John, R. Long, D. Manley, and K. Pfleger. Mlc++: A machine learning library in c, 1994. [16] T Kohonen. Self-Organization and Associative Memory. Springer-Verlag, Berlin, Germany, 1989. [17] L. Kurgan and K.J. Cios. Discretization algorithm that uses class-attribute interdependence maximization. In Proceedings of the 2001 International Conference on Artificial Intelligence (IC-AI 2001), pages 980–987, Las Vegas, Nevada, 2001,JUNE. [18] P. Langley. Induction of recursive bayesian classifiers. In P. Brazdil, editor, ECML93, volume 667 of LNAI, pages 153–164, Berlin, 1993. SV. [19] Pat Langley, Wayne Iba, and Kevin Thompson. An analysis of bayesian classifiers. In National Conference on Artificial Intelligence, pages 223–228, San Jose, California, 1992. [20] Huan Liu and Rudy Setiono. Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering, 9(4):642–645, 1997. [21] P.M. Murphy and D.W. Aha. Uci repository of machine learning databases, 1994. [22] A. Paterson and T.B. Niblett. Acls manual. Technical report, Intelligent Terminals Ltd., Edinburg, 1987. [23] Bernhard Pfahringer. Compression-based discretization of continuous attributes. In International Conference on Machine Learning, pages 456–463, San Francisco, CA, 1995. [24] J. R. Quinlan. Simplifying decision trees. International Journal of Man-Machine Studies, 27(3):221–234, 1987. [25] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kauffman, San, Mateo CA, 1993. [26] M. RIcheldi and M. Rossotto. Class-driven statistical discretization of continuous attributes(extended abstract). Springer-Verlag, Berlin, Heidelberg, 1995. [27] Moshe Sniedovich. Dynamic Programming, chapter appendix, pages 348–350. New York, 1992. [28] C.C. Wang and A.K.C. Wong. Classification of discrete-valued data with feature space transformation. IEEE Trans. on Automatic Control, AC-24(3):434–437, 1979. [29] A.K.C. Wong and D.K.Y. Chiu. Synthesizing statistical knowledge from incomplete mixedmode data. IEEE Transactions on Pattern Analysis and Machine Learning, PAMI-9(6):796– 805, 1987 Nov.

26

[30] A.K.C. Wong and D.K.Y. Chiu. Synthesizing statistical knowledge from incomplete mixedmode data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(6):796–805, 1987 Nov. [31] A.K.C. Wong and C.C. Wang. Deca - a discrete-valued data clustering algorithm. IEEE Trans. on Pattern Analysis and Machine Intelligence, 1(4):342–349, 1979.

27