Hierarchical classifier Igor T. Podolak? , Slawomir Biel, Marcin Bobrowski Institute of Computer Science, Jagiellonian University Nawojki 11, Krak´ ow, Poland [email protected]

Abstract. Artificial Intelligence (AI) methods are used to build classifiers that give different levels of accuracy and solution explication. The intent of this paper is to provide a way of building a hierarchical classifier composed of several artificial neural networks (ANN’s) organised in a tree-like fashion. Such method of construction allows for partition of the original problem into several sub-problems which can be solved with simpler ANN’s, and be built quicker than a single ANN. As the subproblems extracted start to be independent of one another, this paves a way to realise the solutions for the individual sub-problems in a parallel fashion. It is observed that incorrect classifications are not random and can be therefore used to find clusters defining sub-problems.

1

Introduction

In data mining the given data sets composed of patterns are described with models. In a predictive model the output variable Y is expressed as a function of other explanatory variables X. For a categorical Y variable, the task is a classification task [2,11]. The possible approaches, e.g. perceptrons, linear discriminant models, decision trees, various clustering methods, Naive Bayes model, neural networks (ANN’s) etc., vary greatly in terms of accuracy and explanatory power. ANN’s provide high accuracy, but predictions lack any reason why a particular answer was given, whereas decision trees give answers easy to understand by humans but do not generalise well. On the other hand it is hard to find an optimum ANN architecture for a problem, at least it is a lengthy process whose successful outcome depends more on experience and luck, than on clear rules. Clearly there is a need for a classifier easy and quick to build, accurate, with easy to explain predictions, fit for parallel implementation. In this paper we aim at proposing a methodology of building a hierarchical classifier (HC) composed of several small ANN’s, each constituting an easy to build weak classifier. Composition of weak classifiers’ predictions provides for a more accurate one [3]. Each such classifier divides the problem into sub-problems whose training is independent, therefore fit for parallel implementation. From some of the classifiers (mainly these near the tree root) rules would be extracted providing for the predictions to be explainable. ?

Corresponding author: [email protected]; research was funded by grants KBN Grant 3 T11C 054 26, and Jagiellonian University’s “Multiagent systems”

2

The paper is organised as follows: first the problem and the model used are described, then the actual algorithm is depicted, followed with analyses of the algorithm. We also apply the proposed HC to the problem of electric power consumption prediction. The paper ends with conclusions.

2

Problem definition and model

A classifier may be defined as a function Cl Cl : X 3 x −→ p ∈ P

(1)

that assigns each example x, defined with a vector of features, to a class p from a finite number of possible classes P . Cl is a realisation of an unknown function F (·) that is defined with the training data set D that comprises of pairs (x, p) where x is a vector of features, and p is the class that x belongs to. It is possible to build an ANN that realises F (·) with accuracy less than any > 0 [4]. On the other hand, ANN’s that better assign classes to training examples, need more neurons and have lower generalisation level, which is defined as the ability of correct classification of examples that where not used during training. It is possible to construct committee machines in which responses from several predictors are combined. Each predictor, may have a different starting point, different training subset, etc., therefore the overall combined answer may give better generalisation (see [4] for a discussion). We propose to construct a tree classifier with a classifier Cli at each node Cli : Di 3 x −→ p ∈ Pi

(2)

where Pi = {pi1 , . . . , pik } is the set of classes of examples from Di ⊂ D, and D is the original set of examples defining the whole problem. To have a quick algorithm, a small ANN is used to realise Cli , therefore the classifier is a weak one, i.e. it has not the perfect accuracy, but still a good generalisation rate. Some of the classes are confused with each other more frequently, therefore they are combined into groups, using the confusion matrix analysis and a merging algorithm as described below. If m groups of classes were formed, Qij = {pil ∈ Pi |l = 1, . . . , nQij } j = 1, . . . , m

(3)

where nQij is the number of classes in Qij and which form a set of groups Qi Qi = {Qij |j = 1, . . . , m}

(4)

then the original classifier Cli is replaced with classifier Clmod Clmod : Di 3 x −→ q ∈ Qi

(5)

In other words, the new classifier Clmod answers not with the class from the original problem, but tells which subset Qij the actual class most probably belongs to. The data set Di is divided into subsets Dij corresponding to groups Qij , and new ANN classifiers are built. The leaf node ANN’s classify into original problem classes. Tree is built recursively and the algorithm stops when satisfying accuracy is achieved.

3

3

The hierarchical algorithm

At each levelof the tree classifier, the original problem is subdivided into subproblems. At each node an ANN classifier is trained, but to save time and obtain better generalisation, ANN’s are trained only as weak classifiers, i.e. such that classify examples only a bit better than the average (minimal correct classification rate grows at each level). Thanks to that, the networks in all the nodes can be kept small resulting in short learning time. The algorithm partitions the whole input space into sub-problems through analysis of the confusion matrix. 3.1

Confusion matrix based problem partition

Since classifiers are weak, some of examples are classified incorrectly. Nevertheless, we postulate that classification, although incorrect, is not random, i.e. if an example from class A (a label from the training set) gets classified as, say, class B, this means that classes A and B are similar. This is a clustering effect. Therefore, we partition the problem at this node (the problem is given by the examples used to train the ANN at this node) into sub-problems by selecting groups of examples which frequently get classified similarly. Each such group defines a new sub-problem for which a new ANN would be trained. This is done by inspecting the confusion matrix M generated at the end of ANN training. Therefore, if we have the set of training examples D D = {x1 , x2 , . . . , xN }

(6)

and a set of classes K that these examples can be assigned to P = {p1 , p2 , . . . , pM }

(7)

and functions α(xi ) = pk

ϕ(xi ) = pk example xi is from class pk example xi was classified by the ANN as from class pk

(8) (9)

then the confusion matrix M can be defined as M [i][j] = a

(10)

where a is the number of examples from class p classified as q, i.e. the number of elements in the following set {xi ∈ D | ϕ(xk ) = pi ∧ α(xk ) = pj }

(11)

A perfect classifier would have non-zero elements only on the diagonal of matrix M . On the other hand, an ANN which is not perfect, but still better than majority voting type classifier, confuses examples from a group of classes, while some classes get easily separated. This can easily be seen in the confusion matrix where the examples from some groups of classes are frequently interclassified, and the corresponding elements of M are non zero, while some are never mistaken and the elements of M are zeroed. This conveys the information about clusters of examples easily mistaken by an ANN. With that information the groups can easily be found, see Fig. 1.

4

3.2

Construction of the classifier tree

For each of the desired output class p, basing on confusion matrix M, we find all the classes that are mistaken (corresponding to non-zero elements in each of the matrix M ’s row, see Fig. 1). Each group consists of classes that examples from desired output class (corresponding to matrix row) are classified as. In this way groups of classes are produced. In order not to have too many classes in any of the groups, only k maximum values are taken (k is usually 1/4 of classes). Definition 1. The generalised accuracy is a function that gives the probability that examples from a given class p are classified by an ANN into a group that to which p belongs. The normally used notion of accuracy, hereafter referred to as standard accuracy, is a special case of a generalised accuracy in which all the classes are singletons. If an ANN has the standard accuracy of 50% for examples from a class p1 , but 95% of examples classified as p1 actually belong to a group pi1 ∪ pi2 . . . ∪ pil , then by grouping all these examples into one, the generalised accuracy would be 95%. This enables us to construct a strong classifier out of weak ones. Each of the groups correspond to a new sub-problem which has less output classes than the original one. Therefore, we postulate that this problem is easier to solve. A new classifier is constructed for each of the sub-problems. The objective of each of the sub-classifiers would be to distinguish among classes that were similar (therefore they were grouped together) and harder to discern by the original classifier. In this way the classifier tree represents a hierarchical partition of the original problem. The partition is continued up to the moment when satisfying accuracy rate is achieved. During actual classification, an example would be classified recursively by classifiers at each level. Each classifier would decide to which of the groups the example does belongs, therefore selecting a sub-classifier that the example is passed to. Leaf classifiers would return the actual answer. 3.3

Group merging

With the partition algorithm described, at each tree level a separate class would be constructed for each of the possible classifications. This would result in an enormous number of classifiers. Recursive equation for number of classifiers can be written as: 1 T (x) = x ∗ T ( ∗ x) (12) q so using ”master method” [12] for solving recursive equation we can approximate number of classifiers as (13) M log qM But each of the trained classifiers finds regularities within the training set, i.e. some classes are similar to other. Therefore we propose to reduce the number of classes by using a Sequential Agglomerative Hierarchical Nesting (SAHN) [7]

5

which is a bottom-up clustering algorithm. Each group is represented as a binary valued vector with bits set in positions that correspond to classes occurring in that group. SAHN finds two closest groups, i.e. output class vectors, and combines them together. Similarity is defined with the Hamming distance H(x, y) = number of positions that vectors differ on

(14)

Groups are merged with SAHN until the number of classes in any of the groups does not exceed a threshold. For a threshold we use λ∗n where n is the number of all classes, and λ ∈ (0, 1). We have used λ = 0.5. For higher λ’s we obtain less subclassifiers, but with more classes in each of them. Bigger groups would resemble more the original problem, and therefore less would be achieved by partitioning. Only for the resulting groups of classes new classifiers would be constructed, i.e. we decrease the number of them making the whole tree construction feasible. The resulting hierarchical classifier for the zoo problem [1] is shown in Fig. 1.

Fig. 1. The confusion matrix with cells shaded accordingly to the number of examples classified, and a hierarchical classifier for the zoo problem [1].

3.4

Rule generation for selected networks

Rule extraction from trained ANN’s algorithm is based on FERNN – Fast Extraction of Rules from Neural Networks[9]. The following steps need to be performed to generate a rule classifier (RC): 1. Train a neural network 2. Build a decision tree that classifies the patterns in terms of the network hidden units activation values 3. Generate classification rules Neural network training ANN’s with a single hidden layer is trained to minimise the augmented cross-entropy error function (15) Each pattern xp is from one of the C possible classes. tip is the target value for pattern p (p = 1, 2, · · · , P ) at output unit i. Sip is the ANN’s output unit value. J is the number of hidden units. vij is the weight of the connection from hidden unit j to output unit i and wjk is the weight of the connection from input unit k to hidden unit j.

6

The ANN is trained to minimise the augmented cross-entropy error function: θ(w, v) = F (w, v) −

C X P X

[tip log Sip + (1 − tip ) × log(1 − Sip )]

(15)

i=1 p=1

where F (w, v) is a penalty function with positive parameters 1 , 2 , β. It is added to encourage weight decay. ! ! J C K J C K 2 2 X X X X X X βwjk βvij 2 2 F (w, v) = 1 + 2 vij + wjk 2 + 2 1 + βvij 1 + βwjk j=1 i=1 j=1 i=1 k=1

k=1

(16) Penalty function causes irrelevant connections to have very small weights. Connections are cut beginning from the smallest one as long as network error has acceptable value. As a result the ANN is produced which can be easily converted to the RC. It is used for some nodes of the HC. Construction of a decision tree The decision tree is built using the hidden unit activations of correctly classified ANN patterns along with the patterns’ class labels. The C4.5 [8] algorithm is used to build the decision tree. In the construction of the HC, we used WEKA’s [5] J48 decision trees implementation. Rule generation A sigmoid function is used for hidden units activation. Node splitting conditions in the decision tree can be written as follows: Pnj if σ k=0 wjk xk ≤ Sv then LeftNode else RightNode By computing the inverse of the sigmoid function σ −1 (Sv ) for all node splitting conditions in a decision tree, we obtain conditions that are linear combinations of the input attributes of the data. Below is a set of rules generated from the root classifier of the HF for the zoo problem from [1], where the objective is to classify an animal defined with 18 features into one of seven classes (mammals, birds, sea animals, fish, reptiles, insects, mollusks). One can see that the output of the classifier is frequently not a single class, but a group of classes found to be similar. Subsequent classifiers in the HF tree would find the actual classification. if(+3.03 *"eggs" -2.14 *"milk" -1.0 *"legs" +3.84 *"tail"

Abstract. Artificial Intelligence (AI) methods are used to build classifiers that give different levels of accuracy and solution explication. The intent of this paper is to provide a way of building a hierarchical classifier composed of several artificial neural networks (ANN’s) organised in a tree-like fashion. Such method of construction allows for partition of the original problem into several sub-problems which can be solved with simpler ANN’s, and be built quicker than a single ANN. As the subproblems extracted start to be independent of one another, this paves a way to realise the solutions for the individual sub-problems in a parallel fashion. It is observed that incorrect classifications are not random and can be therefore used to find clusters defining sub-problems.

1

Introduction

In data mining the given data sets composed of patterns are described with models. In a predictive model the output variable Y is expressed as a function of other explanatory variables X. For a categorical Y variable, the task is a classification task [2,11]. The possible approaches, e.g. perceptrons, linear discriminant models, decision trees, various clustering methods, Naive Bayes model, neural networks (ANN’s) etc., vary greatly in terms of accuracy and explanatory power. ANN’s provide high accuracy, but predictions lack any reason why a particular answer was given, whereas decision trees give answers easy to understand by humans but do not generalise well. On the other hand it is hard to find an optimum ANN architecture for a problem, at least it is a lengthy process whose successful outcome depends more on experience and luck, than on clear rules. Clearly there is a need for a classifier easy and quick to build, accurate, with easy to explain predictions, fit for parallel implementation. In this paper we aim at proposing a methodology of building a hierarchical classifier (HC) composed of several small ANN’s, each constituting an easy to build weak classifier. Composition of weak classifiers’ predictions provides for a more accurate one [3]. Each such classifier divides the problem into sub-problems whose training is independent, therefore fit for parallel implementation. From some of the classifiers (mainly these near the tree root) rules would be extracted providing for the predictions to be explainable. ?

Corresponding author: [email protected]; research was funded by grants KBN Grant 3 T11C 054 26, and Jagiellonian University’s “Multiagent systems”

2

The paper is organised as follows: first the problem and the model used are described, then the actual algorithm is depicted, followed with analyses of the algorithm. We also apply the proposed HC to the problem of electric power consumption prediction. The paper ends with conclusions.

2

Problem definition and model

A classifier may be defined as a function Cl Cl : X 3 x −→ p ∈ P

(1)

that assigns each example x, defined with a vector of features, to a class p from a finite number of possible classes P . Cl is a realisation of an unknown function F (·) that is defined with the training data set D that comprises of pairs (x, p) where x is a vector of features, and p is the class that x belongs to. It is possible to build an ANN that realises F (·) with accuracy less than any > 0 [4]. On the other hand, ANN’s that better assign classes to training examples, need more neurons and have lower generalisation level, which is defined as the ability of correct classification of examples that where not used during training. It is possible to construct committee machines in which responses from several predictors are combined. Each predictor, may have a different starting point, different training subset, etc., therefore the overall combined answer may give better generalisation (see [4] for a discussion). We propose to construct a tree classifier with a classifier Cli at each node Cli : Di 3 x −→ p ∈ Pi

(2)

where Pi = {pi1 , . . . , pik } is the set of classes of examples from Di ⊂ D, and D is the original set of examples defining the whole problem. To have a quick algorithm, a small ANN is used to realise Cli , therefore the classifier is a weak one, i.e. it has not the perfect accuracy, but still a good generalisation rate. Some of the classes are confused with each other more frequently, therefore they are combined into groups, using the confusion matrix analysis and a merging algorithm as described below. If m groups of classes were formed, Qij = {pil ∈ Pi |l = 1, . . . , nQij } j = 1, . . . , m

(3)

where nQij is the number of classes in Qij and which form a set of groups Qi Qi = {Qij |j = 1, . . . , m}

(4)

then the original classifier Cli is replaced with classifier Clmod Clmod : Di 3 x −→ q ∈ Qi

(5)

In other words, the new classifier Clmod answers not with the class from the original problem, but tells which subset Qij the actual class most probably belongs to. The data set Di is divided into subsets Dij corresponding to groups Qij , and new ANN classifiers are built. The leaf node ANN’s classify into original problem classes. Tree is built recursively and the algorithm stops when satisfying accuracy is achieved.

3

3

The hierarchical algorithm

At each levelof the tree classifier, the original problem is subdivided into subproblems. At each node an ANN classifier is trained, but to save time and obtain better generalisation, ANN’s are trained only as weak classifiers, i.e. such that classify examples only a bit better than the average (minimal correct classification rate grows at each level). Thanks to that, the networks in all the nodes can be kept small resulting in short learning time. The algorithm partitions the whole input space into sub-problems through analysis of the confusion matrix. 3.1

Confusion matrix based problem partition

Since classifiers are weak, some of examples are classified incorrectly. Nevertheless, we postulate that classification, although incorrect, is not random, i.e. if an example from class A (a label from the training set) gets classified as, say, class B, this means that classes A and B are similar. This is a clustering effect. Therefore, we partition the problem at this node (the problem is given by the examples used to train the ANN at this node) into sub-problems by selecting groups of examples which frequently get classified similarly. Each such group defines a new sub-problem for which a new ANN would be trained. This is done by inspecting the confusion matrix M generated at the end of ANN training. Therefore, if we have the set of training examples D D = {x1 , x2 , . . . , xN }

(6)

and a set of classes K that these examples can be assigned to P = {p1 , p2 , . . . , pM }

(7)

and functions α(xi ) = pk

ϕ(xi ) = pk example xi is from class pk example xi was classified by the ANN as from class pk

(8) (9)

then the confusion matrix M can be defined as M [i][j] = a

(10)

where a is the number of examples from class p classified as q, i.e. the number of elements in the following set {xi ∈ D | ϕ(xk ) = pi ∧ α(xk ) = pj }

(11)

A perfect classifier would have non-zero elements only on the diagonal of matrix M . On the other hand, an ANN which is not perfect, but still better than majority voting type classifier, confuses examples from a group of classes, while some classes get easily separated. This can easily be seen in the confusion matrix where the examples from some groups of classes are frequently interclassified, and the corresponding elements of M are non zero, while some are never mistaken and the elements of M are zeroed. This conveys the information about clusters of examples easily mistaken by an ANN. With that information the groups can easily be found, see Fig. 1.

4

3.2

Construction of the classifier tree

For each of the desired output class p, basing on confusion matrix M, we find all the classes that are mistaken (corresponding to non-zero elements in each of the matrix M ’s row, see Fig. 1). Each group consists of classes that examples from desired output class (corresponding to matrix row) are classified as. In this way groups of classes are produced. In order not to have too many classes in any of the groups, only k maximum values are taken (k is usually 1/4 of classes). Definition 1. The generalised accuracy is a function that gives the probability that examples from a given class p are classified by an ANN into a group that to which p belongs. The normally used notion of accuracy, hereafter referred to as standard accuracy, is a special case of a generalised accuracy in which all the classes are singletons. If an ANN has the standard accuracy of 50% for examples from a class p1 , but 95% of examples classified as p1 actually belong to a group pi1 ∪ pi2 . . . ∪ pil , then by grouping all these examples into one, the generalised accuracy would be 95%. This enables us to construct a strong classifier out of weak ones. Each of the groups correspond to a new sub-problem which has less output classes than the original one. Therefore, we postulate that this problem is easier to solve. A new classifier is constructed for each of the sub-problems. The objective of each of the sub-classifiers would be to distinguish among classes that were similar (therefore they were grouped together) and harder to discern by the original classifier. In this way the classifier tree represents a hierarchical partition of the original problem. The partition is continued up to the moment when satisfying accuracy rate is achieved. During actual classification, an example would be classified recursively by classifiers at each level. Each classifier would decide to which of the groups the example does belongs, therefore selecting a sub-classifier that the example is passed to. Leaf classifiers would return the actual answer. 3.3

Group merging

With the partition algorithm described, at each tree level a separate class would be constructed for each of the possible classifications. This would result in an enormous number of classifiers. Recursive equation for number of classifiers can be written as: 1 T (x) = x ∗ T ( ∗ x) (12) q so using ”master method” [12] for solving recursive equation we can approximate number of classifiers as (13) M log qM But each of the trained classifiers finds regularities within the training set, i.e. some classes are similar to other. Therefore we propose to reduce the number of classes by using a Sequential Agglomerative Hierarchical Nesting (SAHN) [7]

5

which is a bottom-up clustering algorithm. Each group is represented as a binary valued vector with bits set in positions that correspond to classes occurring in that group. SAHN finds two closest groups, i.e. output class vectors, and combines them together. Similarity is defined with the Hamming distance H(x, y) = number of positions that vectors differ on

(14)

Groups are merged with SAHN until the number of classes in any of the groups does not exceed a threshold. For a threshold we use λ∗n where n is the number of all classes, and λ ∈ (0, 1). We have used λ = 0.5. For higher λ’s we obtain less subclassifiers, but with more classes in each of them. Bigger groups would resemble more the original problem, and therefore less would be achieved by partitioning. Only for the resulting groups of classes new classifiers would be constructed, i.e. we decrease the number of them making the whole tree construction feasible. The resulting hierarchical classifier for the zoo problem [1] is shown in Fig. 1.

Fig. 1. The confusion matrix with cells shaded accordingly to the number of examples classified, and a hierarchical classifier for the zoo problem [1].

3.4

Rule generation for selected networks

Rule extraction from trained ANN’s algorithm is based on FERNN – Fast Extraction of Rules from Neural Networks[9]. The following steps need to be performed to generate a rule classifier (RC): 1. Train a neural network 2. Build a decision tree that classifies the patterns in terms of the network hidden units activation values 3. Generate classification rules Neural network training ANN’s with a single hidden layer is trained to minimise the augmented cross-entropy error function (15) Each pattern xp is from one of the C possible classes. tip is the target value for pattern p (p = 1, 2, · · · , P ) at output unit i. Sip is the ANN’s output unit value. J is the number of hidden units. vij is the weight of the connection from hidden unit j to output unit i and wjk is the weight of the connection from input unit k to hidden unit j.

6

The ANN is trained to minimise the augmented cross-entropy error function: θ(w, v) = F (w, v) −

C X P X

[tip log Sip + (1 − tip ) × log(1 − Sip )]

(15)

i=1 p=1

where F (w, v) is a penalty function with positive parameters 1 , 2 , β. It is added to encourage weight decay. ! ! J C K J C K 2 2 X X X X X X βwjk βvij 2 2 F (w, v) = 1 + 2 vij + wjk 2 + 2 1 + βvij 1 + βwjk j=1 i=1 j=1 i=1 k=1

k=1

(16) Penalty function causes irrelevant connections to have very small weights. Connections are cut beginning from the smallest one as long as network error has acceptable value. As a result the ANN is produced which can be easily converted to the RC. It is used for some nodes of the HC. Construction of a decision tree The decision tree is built using the hidden unit activations of correctly classified ANN patterns along with the patterns’ class labels. The C4.5 [8] algorithm is used to build the decision tree. In the construction of the HC, we used WEKA’s [5] J48 decision trees implementation. Rule generation A sigmoid function is used for hidden units activation. Node splitting conditions in the decision tree can be written as follows: Pnj if σ k=0 wjk xk ≤ Sv then LeftNode else RightNode By computing the inverse of the sigmoid function σ −1 (Sv ) for all node splitting conditions in a decision tree, we obtain conditions that are linear combinations of the input attributes of the data. Below is a set of rules generated from the root classifier of the HF for the zoo problem from [1], where the objective is to classify an animal defined with 18 features into one of seven classes (mammals, birds, sea animals, fish, reptiles, insects, mollusks). One can see that the output of the classifier is frequently not a single class, but a group of classes found to be similar. Subsequent classifiers in the HF tree would find the actual classification. if(+3.03 *"eggs" -2.14 *"milk" -1.0 *"legs" +3.84 *"tail"