Decision Tree Induction Methods and Their Application to Big Data

3 downloads 60312 Views 1MB Size Report
Decision Tree Induction Methods and Their Application to Big Data. Petra Perner. Abstract. Data mining methods are widely used across many disciplines to.
Decision Tree Induction Methods and Their Application to Big Data, Petra Perner, In: Fatos Xhafa, Leonard Barolli, Admir Barolli, Petraq Papajorgji (Eds.), Modeling and Optimization in Science and Technologies, Modeling and Processing for Next-Generation Big-Data Technologies With Applications and Case Studies, Volume 4 2015, Springer Verlag 2015, pp. 57-88

Decision Tree Induction Methods and Their Application to Big Data Petra Perner

Abstract. Data mining methods are widely used across many disciplines to identify patterns, rules or associations among huge volumes of data. While in the past mostly black box methods such as neural nets and support vector machines have been heavily used for the prediction of pattern, classes, or events, methods that have explanation capability such as decision tree induction methods are seldom preferred. Therefore, we like to give in this chapter an introduction to decision tree induction. The basic principle, the advantages properties of decision tree induction methods, and a description of the representation of decision trees so that a user can understand and describe the tree in a common way is given first. The overall decision tree induction algorithm is explained as well as different methods for the most important functions of a decision tree induction algorithm such as attribute selection, attribute discretization, and pruning developed by us and others. We explain how the learnt model can be fitted to the expertΒ΄s knowledge and how the classification performance can be improved. The problem of feature subset selection by decision tree induction is described. The quality of the learnt model is not only be checked based on the overall accuracy, more specific measure are explained that describe the performance of the model in more detail. We present a new quantitative measure that can describe changes in the structure of a tree in order to help the expert to interpret the differences of two learnt trees from the same domain. Finally we summarize our chapter and give an outlook.

1

Introduction

Data mining methods are widely used across many disciplines to identify patterns, rules or associations among huge volumes of data. While in the past mostly black box methods such as neural nets and support vector machines have been heavily used for the prediction of pattern, classes, or events, methods that have explanation capability such as decision tree induction methods are seldom preferred. Besides, it is very important to understand the classification result not only in medical application more often also in technical domains. Nowadays, data mining methods with explanation capability are more heavily used across disciplines after more work on advantages and disadvantages of these methods has been done. Decision tree induction is one of the methods that have explanation capability. Their advantages are easy use and fast processing of the results. Decision tree induction methods can easily learn a decision tree without heavy user interaction while in neural nets adfa, p. 1, 2011. Β© Springer-Verlag Berlin Heidelberg 2011

a lot of time is spent on training the net. Cross-validation methods can be applied to decision tree induction methods while not for neural nets. These methods ensure that the calculated error rate comes close to the true error rate. In most of the domains such as medicine, marketing or nowadays even technical domains the explanation capability, easy use, and the fastness in model building are one of the most preferred properties of a data mining method. There are several decision tree induction algorithms known. They differ in the way they select the most important attributes for the construction the decision tree, if they can deal with numerical or/and symbolical attributes, and how they reduce noise in the tree by pruning. A basic understanding of the way a decision tree is built is necessary in order to select the right method for the actual problem and in order to interpret the results of a decision tree. In this chapter, we review several decision tree induction methods. We focus on the most widely used methods and on methods we have developed. We rely on generalization methods and do not focus on methods that model subspaces in the decision space such as decision forests since in case of these methods the explanation capability is limited. The preliminary concepts and the background is given in Section 2. This is followed by an overall description of a decision tree induction algorithm in Section 3. Different methods for the most important functions of a decision tree induction algorithm are described in Section 4 for the attribute selection, in Section 5 for attribute discretization, and in Section 6 for pruning. The explanations given by the learnt decision tree must make sense to the domain expert since he often has already built up some partial knowledge. We describe in Section 7 what problems can arise and how the expert can be satisfied with the explanations about his domain. Decision tree induction is a supervised method and requires labeled data. The necessity to check the labels by an oracle-based classification approach is also explained in section 7 as well as the feature subset-selection problem. It is explained in what way feature sub-selection can be used to improve the model besides the normal outcome of decision tree induction algorithm. Section 8 deals with the question: How to interpret a learnt Decision Tree. Besides the well-known overall accuracy different more specific accuracy measures are given and their advantages are explained. We introduce a new quantitative measure that can describe changes in the structure of trees learnt from the same domain and help the user to interpret them. Finally we summarize our chapter in Section 9.

2

Preliminary Concepts and Background

The input to the decision tree induction algorithm is a data set which contains attributes in the column and data entries with its attributes values in each of the lines (see Fig. 1). From that the decision tree induction algorithm can automatically derive a set of rules that generalizes these data. The set of rules is represented as a tree. The decision tree recursively partition the solution space based on the attribute splits into subspaces

until the final solutions is reached. The resulting hierarchical representation is very natural to the human problem solving process. During the construction of the decision tree from the whole set of attributes are selected only those attributes that are most relevant for the classification problem. Therefore a decision tree induction method can also be seen as a feature selection method. Once the decision tree has been learnt and the developer is satisfied with the quality of the model the tree can be used in order to predict the outcome for new samples. This learning method is also called supervised learning, since samples in the data collection have to be labeled by the class. Most decision tree induction algorithms allow using numerical attributes as well as categorical attributes. Therefore, the resulting classifier can make the decision based on both types of attributes.

Fig. 1. Basic Principle of Decision Tree Induction

A decision tree is a directed a-cyclic graph consisting of edges and nodes (see Fig. 2). The node with no edges enter is called the root node. The root node contains all class labels. Every node except the root node has exactly one entering edge. A node having no successor is called a leaf node or terminal node. All other nodes are called internal nodes. The nodes of the tree contain the decision rules such as 𝐼𝐹 π‘Žπ‘‘π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘’ 𝐴 ≀ π‘π‘œπ‘›π‘ π‘‘π‘Žπ‘›π‘‘ 𝑐 𝑇𝐻𝐸𝑁 𝐷. The decision rule is a function f that maps the attribute A to D. The above described rule results into a binary tree. The sample set in each node is split into two subsets based on the constant c for the attribute A. This constant v is called cut-point. In case of a binary tree, the decision is either true or false. In case of an n-ary tree, the decision is based on several constant ci. Such a rule splits the data set into i subsets.

Geometrically, the split describes a partition orthogonal to one of the coordinates of the decision space. A terminal node should contain only samples of one class. If there are more than one class in the sample set we say there is class overlap. This class overlap in each terminal node is responsible for the error rate. An internal node contains always more than one class in the assigned sample set. A path in the tree is a sequence of edges from (v1,v2), (v2,v3), ... , (vn-1,vn). We say the path is from v1 to vn and has a length of n. There is a unique path from the root to each node. The depth of a node v in a tree is the length of the path from the root to v. The height of node v in a tree is the length of the largest path from v to a leaf. The height of a tree is the height of its root. The level of a node v in a tree is the height of the tree minus the depth of v.

Fig. 2. Representation of a Decision Tree

A binary tree is an ordered tree such so that each successor of a node is distinguished either as a left son or a right son. No node has more than one left son, nor has it more than one right son. Otherwise it is an n-ary tree. Let us now consider the decision tree learnt from FisherΒ΄s Iris data set. This data set has three classes (1-Setosa, 2-Vericolor, 3-Virginica) with fifty observations for each class and four predictor variables (petal length, petal width, sepal length and sepal width). The learnt tree is shown in Figure 3. It is a binary tree.

Fig. 3. Decision Tree learnt from Iris Data Set

The average depth of the tree is (1+3+3+2)/4=9/4=2.25. The root node contains the attribute petal_length. Along a path the rules are combined by the AND operator. Following the two paths from the root node we obtain for e.g. two rules such as: π‘…π‘ˆπΏπΈ 1: 𝐼𝐹 π‘π‘’π‘‘π‘Žπ‘™ π‘™π‘’π‘›π‘”π‘‘β„Ž ≀ 2.45 𝑇𝐻𝐸𝑁 π‘†π‘’π‘‘π‘œπ‘ π‘Ž π‘…π‘ˆπΏπΈ 2: 𝐼𝐹 π‘π‘’π‘‘π‘Žπ‘™ π‘™π‘’π‘›π‘”π‘‘β„Ž < 2.45 𝐴𝑁𝐷 π‘π‘’π‘‘π‘Žπ‘™ π‘™π‘’π‘›π‘”π‘‘β„Ž < 4.9 𝑇𝐻𝐸𝑁 π‘‰π‘–π‘Ÿπ‘”π‘–π‘›π‘–π‘π‘Ž. In the latter rule we can see that the attribute petal_length will be used two times during the problem-solving process. Each time a different cut-point is used on this attribute.

3

Subtasks and Design Criteria for Decision Tree Induction

The overall procedure of the decision tree building process is summarized in Figure 4. Decision trees recursively split the decision space (see Fig. 5) into subspaces based on the decision rules in the nodes until the final stopping criterion is reached or the remaining sample set does not suggest further splitting. For this recursive splitting the tree building process must always pick among all attributes that attribute which shows the best result on the attribute selection criteria for the remaining sample set. Whereas for categorical attributes the partition of the attributes values is given a-priori, the partition of the attribute values for numerical attributes must be determined. This process is called attribute discretization process.

do while tree termination criterion faild do for all features feature numerical? yes no splitting procedure feature selection procedure split examples build tree Fig. 4. Overall Tree Induction Procedure

Fig. 5. Demonstration of Recursively Splitting of Decision Space based on two Attributes of the IRIS Data Set

The attribute discretization process can be done before or during the tree building process [1]. We will consider the case where the attribute discretization will be done during the tree building process. The discretization must be carried out before the attribute selection process, since the selected partition on the attribute values of a numerical attribute highly influences the prediction power of that attribute. After the attribute selection criterion was calculated for all attributes based on the remaining sample set at the particular level of the tree, the resulting values are evaluated and the attribute with the best value for the attribute selection criterion is selected for further splitting of the sample set. Then the tree is extended by two or more further nodes. To each node is assigned the subset created by splitting on the attribute values and the tree building process repeats. Attribute splits can be done: ο‚· univariate on numerically or ordinal ordered attributes A such as A