Ordered estimation of missing values - CiteSeerX

2 downloads 0 Views 117KB Size Report
In fact, Quinlan decided to choose the probabilistic approach ... latest tested by Quinlan have two di erences. First, here the ... J. Ross Quinlan. Unknown attributeĀ ...
Ordered estimation of missing values Oscar Ortega Lobo and Masayuki Numao Department of Computer Science Tokyo Institute of Technology 2-12-1 Ookayama, Meguro-ku Tokyo 152-8552, Japan e-mail: fbeatriz,[email protected] When attempting to discover by learning concepts embedded in data, it is not uncommon to nd that information is missing from the data. Such missing information can diminish the con dence on the concepts learned from the data. This paper describes a new approach to ll missing values in examples provided to a learning algorithm. A decision tree is constructed to determine the missing values of each attribute by using the information contained in other attributes. Also, an ordering for the construction of the decision trees for the attributes is formulated. Experimental results on three datasets show that completing the data by using decision trees leads to nal concepts with less error under different rates of random missing values. The approach should be suitable for domains with strong relations among the attributes, and for which improving accuracy is desirable even if computational cost increases.

Abstract.

1 Introduction Machine learning techniques have been successfully employed to extract concepts embedded in data describing instances from a particular domain. When the instances are described by attributes and propositions on attribute values for each instance, this type of learning is called propositional learning. Algorithms already exist that manage to build concepts from the above type of data [8, 2]. One troublesome aspect of data sets used in machine learning is the occurrence of unknown attribute values for some instances in the available data. Missing values phenomenon is likely to occur after generating by-products on di erent data collections, which is an operation commonly carried out during the process of knowledge discovery [5]. When missing values occur in the data, the learning algorithm fails to nd an accurate representation of the concept (e.g., decision trees or rules). Properly lling missing values in data can help in reducing the error rate of the learned concepts. Thus, the purpose of this paper is to introduce and evaluate a mechanism to ll missing values in the data employed by a propositional learning algorithm. This paper is organized as follows. First, the approach for estimating missing values is explained, then two experimental scenarios and their results on several datasets are presented, followed by a discussion of the results and related work. Finally, suitable domains, restrictions, and further improvements are discussed.

2

Estimating Missing Values

Having established the need to ll missing values in the data of a particular domain, it is advisable to make the most ecient use of information already available in the data. That is to say, it seems worthwhile to design a method that uses the maximum amount of information derivable from the data, while at the same time holding computational demands to a sustainable level. In this section, a new method for lling missing values is described in terms of how these two requirements have been met. In order to ful ll the rst requirement, decision trees are constructed for each attribute by using a reduced training set with only those examples that have known values for the attribute. The reason for this is that decision trees are suitable for representing relations among the most important attributes when determining the value of a target attribute. In addition, decision tree learning algorithms are fast on formulating accurate concepts. After constructing a decision tree for lling the missing values of an attribute, it makes sense to use the data with lled values in order to construct a decision tree for lling the missing values of other attributes. Therefore, the order followed when constructing attribute trees and lling the missing values per attribute becomes important. The ordering proposed here is based on the concept in Information Theory called mutual information, which has been successfully used as a criteria for attribute selection in decision tree learning [8]. Mutual Information between two ensembles X and Y is de ned by: " # X X X 1 1 P (xjy) log P (xjy) (1) P (y ) P (x) log P (x) + H (X ; Y )  x2Ax y 2A y x2Ax Mutual information measures the average reduction in uncertainty about X that results from learning the value of Y , or vice versa. Thus, by measuring mutual information among attributes and class, inferences can be done about the strength of the relations between them. In propositional learning, attributes that have low mutual information with respect to the class have less chance to participate in the nal concept, so that properly lling the missing values for such attributes will have very low impact on the accuracy of the nal concept. In contrast, attributes having high mutual information with respect to the class have a higher chance of being incorporated into the nal concept, making it worthwhile trying to obtain a ner lling of their missing values. Considering previous discussion and the requirement of holding computational demands to a sustainable level, the ordering proposed in this approach can be expressed as follows. Let C to be the class variable. At rst, construct decision trees and ll the missing values for attributes that have less mutual information with respect to C . When constructing the decision tree for any particular attribute Ai discard from the training set those attributes Ak for which H (Ai ; C ) < H (Ak ; C ).

3 Experiments and Discussion The experimental focus of this paper is to compare the accuracy of decision tree learning from data sets whose missing values have been lled by di erent methods. In the experiments, the attribute trees method is compared with two other methods used in machine learning for dealing with missing values: the majority method [4] and the probabilistic method [3]. Ten-fold cross validation experiments were carried out for each of the three methods. Each experiment was conducted for rates of arti cial missing values ranging from 10% to 70%. Arti cial missing values were generated in identical proportions and following the same distribution for each attribute. Figure 1 shows a summary of the characteristics of the three datasets used for the evaluations.

Fig. 1.

Summary of Datasets

80 PROB MAJOR ATTTREE

70 CLASSIF ERROR

name instances attr class Soybean 307 35 cat 19 BreastCW 699 9 num 2 Mushroom 8124 21 cat 2

60 50 40 30 20 10 10

20

30 40 50 MISSING VALUE RATE

Fig. 2.

35

70

Results on Soybean

35 PROB MAJOR ATTTREE

PROB MAJOR ATTTREE

30

CLASSIF ERROR

30

CLASSIF ERROR

60

25 20 15 10

25 20 15 10 5

5

0 10

20

Fig. 3.

30 40 50 MISSING VALUE RATE

60

70

Results on BreastCW

10

20

Fig. 4.

30 40 50 MISSING VALUE RATE

60

70

Results on Mushroom

When looking at the e ects of missing data, the most reasonable assumption is that future data will contain the same proportion and kinds of missing values as the present data [2]. Accordingly, the experiments conducted in this study included arti cial missing values in identical proportion and distribution in both training and test data.

Figures 2, 3, and 4 plot the average classi cation error for concepts learned on each of the target domains, depending on each of the three methods. The di erence in error performance between the attribute trees method and the other two methods was found to be signi cant at the 95% con dence level for all tested rates of missing values in all data sets. These results indicate that when missing values occur in both the training and test instances, the attribute trees method is superior on modeling the missing values in the three domains tested. The worst performance was obtained with the majority method, as was expected, since this method can not be used for lling missing values in the test data. In contrast, the probabilistic and attribute trees methods are more complete in the sense that they can deal with missing values in both training and test data. The probabilistic method constructs a model of the missing values, which depends only on the prior distribution of the attribute values for each attribute being tested in a node of the tree. This approach is adequate when most of the attributes are independent, so that the model can rely on the values of each attribute, without regard of the other attributes. Attribute trees are more complete models because they can represent complex relations among the attributes, which appear when there is high dependency among the attributes.

4

Related Work

In propositional learning, one of the rst approaches for dealing with missing values was to ignore instances with missing data [8]. This approach was soon found to be weak in the sense of not-pro ting of useful information present in instances with some missing attribute values. Thus, a method that considered the most frequent attribute value as a good candidate for lling the missing value was proposed and extended later to take the most frequent attribute value for the class of the instance that has the missing value [4].This approach is referred as majority method. Another approach is to assign all possible values for the attribute, weighted by its prior probability estimated from the known distribution of the values of the attribute[3]. This approach is referred as probabilistic method, and it was borrowed for the implementation of C4.5[7]. In fact, Quinlan decided to choose the probabilistic approach after extensive experimentation on several domains [6], comparing the three methods mentioned above and a fourth method using decision trees per each attribute. The approach presented in this paper and the latest tested by Quinlan have two di erences. First, here the attribute trees are constructed following an ordering, and second, only the attributes with less mutual information with respect to the class are taken as input for the construction of a tree for a particular attribute. On the statistics side of research on decision tree learning, the surrogate splits method was formulated by Breiman[2] on his work on binary regression trees. This method, always keeps secondary attributes to be tested at each node of the decision tree when it happens that the value of the primary attribute is

missing. In fact, this method can be viewed as an speci c case of the more general approach of using decision trees to ll the missing values of the attributes[6].

5 Concluding Remarks A method for obtaining missing values has been proposed and successfully tested on several data sets. On the tested domains, the new method is seen to provide signi cantly better performance than the two methods currently used to deal with missing values in propositional learning. Domains with high dependency among the attributes are thought to be the most suitable for application of the approach introduced in this paper. All the datasets tested here have discrete values for their attributes. This restriction follows from the nature of the decision tree learner used to construct the attribute trees. Further experimentation on using a decision tree learner that can deal with continuous classes is advisable. The increase in computational cost was not evaluated here. Indeed, the approach is thought to be suitable for domains for which an increase in computational cost is worth to the bene t obtained by lowering the classi cation error.

6 Acknowledgments This research has been done under a grant from Japanese Ministry of Education. The data sets used in this research were obtained from UCI Machine Learning Repository [1]

References 1. C. Blake, E. Keogh, and C.J. Merz. UCI repository of machine learning databases. In [http://www.ics.uci.edu/~mlearn/MLRepository.html]. University of California, Department of Information and Computer Science, Irvine, CA, 1998. 2. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classi cation and Regression Trees. Chapman & Hall, 1993. 3. B. Cestnik, I. Kononenko, and I. Bratko. Assistant-86: A knowledge-elicitation tool for sophisticated users. In Ivan Bratko and Nada Lavrac, editors, Progress in Machine Learning. Sigma Press, Wilmslow, UK, 1987. 4. I. Kononenko and E. Roscar. Experiments in automatic learning of medical diagnostic rules. Technical report, Jozef Stefan Institute, Ljubjana, Yugoslavia, 1984. 5. W.Z. Liu, A.P. White, S.G. Thompson, and M.A. Bramer. Techniques for dealing with missing values in classi cation. In Proc of Advances in Intelligent Data Analysis (IDA'97), volume 1280 of Lecture notes in computer science, pages 527{536. Springer, 1997. 6. J. R. Quinlan. Unknown attribute values in induction. In Proceedings of the sixth international Machine Learning workshop, pages 164{168. Morgan Kaufmann, 1989. 7. J. Ross Quinlan. Unknown attribute values. In C4.5 Programs for Machine Learning, pages 27{32. Morgan Kaufmann, 1993. 8. J.R. Quinlan. Induction of decision trees. Machine Learning, 1:81{106, 1986. This article was processed using the LaTEX macro package with LLNCS style