Local voting of weak classifiers - Semantic Scholar

4 downloads 47 Views 85KB Size Report
lymphotherapy. 148. 15. 3. 4 monk3. 122. 6. 0. 2 primary-tumor. 339. 17. 0. 21 sonar. 208. 0. 60. 2. Soybean. 683. 35. 0. 19. Students. 344. 11. 0. 2. Titanic. 2201.
International Journal of Knowledge-based and Intelligent Engineering Systems 9 (2005) 239–248 IOS Press

239

Local voting of weak classifiers S. Kotsiantis∗ and P. Pintelas Educational Software Development Laboratory, Department of Mathematics, University of Patras, P.A. Box: 1399, University of Patras, Patra 26500, Greece

Abstract. Many data mining problems involve an investigation of relationships between features in heterogeneous datasets, where different learning algorithms can be more appropriate for different regions. We propose herein a technique of localized voting of weak classifiers. This technique identifies local regions which have similar characteristics and then uses the votes of each local expert to describe the relationship between the data characteristics and the target class. We performed a comparison with other well known combining methods on standard benchmark datasets and the accuracy of the proposed method was greater. Keywords: Machine learning, data mining, classification, ensembles of classifiers

1. Introduction In recent years researchers have continuously argued for the benefits of using multiple classifiers to solve complex classification problems. The main idea behind combining classifiers is based on the assumption that different classifiers which use different data representations, different concepts and different modeling techniques are most likely to arrive at classification results with different patterns of generalization [15]. Visible evidence of such multimodal diversity among classifiers takes the form of different occurrences of classification errors for different classifiers which are reported over a set of input instances. As most combination functions benefit from disagreement regarding the errors of individual learning algorithms: the greater this disagreement, the lower the impact of individual errors on the final classification, and therefore the lower the combined classification error [7]. When the size of the training set is small compared to the complexity of the classifier, the learning algorithm frequently overfits the noise in the training set. Thus, effective control of the complexity of a learning algorithm plays a basic role in achieving good generalization. Some theoretical results and experimental ∗ Corresponding author. Tel.: +30 2610 997833; +30 2610 992965; +30 2610 997313; Fax: +30 2610 992965; E-mail: sotos@math. upatras.gr.

results [23] indicate that a local learning algorithm provides a practicable solution to this problem. In local learning, each local model is trained independently of all other local models such that the total number of local models in the learning system does not directly affect how complex a function can be learned. This property avoids overfitting if a robust learning scheme exists for training the individual local model. To the best of authors’ knowledge, a characteristic of all the previously proposed ensemble methods is that they work globally [7]. In this study, to the contrary, we propose a technique that combines learning methods locally. Local learning [2] is an extension of instancebased learning. Local ensembles wait until they have seen the test set instances before making a prediction. This allows the ensemble to make predictions based on specific instances that are most similar to the test set instances. Local learning can be understood as a general theory that allows extending learning algorithms, in the case of complex data for which the algorithm’s assumptions would not necessarily hold globally, to become valid locally. A simple example is the assumption of linear separability, which in general is not satisfied globally in classification problems. Yet any learning algorithm able to find only a linear separation can be used inside a local learning procedure, yielding an algorithm able to model complex non-linear class boundaries. In this study, we proposed a local voting scheme of classifiers. We performed a comparison with other

ISSN 1327-2314/05/$17.00 © 2005 – IOS Press and the authors. All rights reserved

240

S. Kotsiantis and P. Pintelas / Local voting of weak classifiers

well-known combining methods on standard benchmark datasets and the proposed technique performed more accurately. For the experiments, three weak algorithms of three well-known machine learning techniques (decision trees, Bayesian networks and rule learners) were used. In more detail we used OneR [12], Decision stump [13] and Na¨ıve Bayes [8]. Using weak classifiers with the proposed ensemble has at least two benefits. Firstly, weak classifiers are less likely to suffer locally from over-fitting problems, since they avoid learning outliers, or quite possibly a noisy decision boundary. Secondly, the training time is often less for generating multiple weak classifiers compared to training even one strong classifier. In the next section, current ensemble approaches are described. In Section 3 we describe the proposed method and investigate its advantages and limitations. In Section 4, we evaluate the proposed method on several UCI datasets by comparing it with other ensemble methods. Finally, Section 5 concludes the paper and suggests further directions in current research.

2. Ensembles of classifiers No single learning algorithm can uniformly outperform other algorithms over all data sets. When faced with the question of “Which algorithm will be most accurate on a particular classification problem?”, the predominant approach is to estimate the accuracy of the candidate algorithms on the problem and select the one that appears to be most accurate, usually by means of cross-validation, and select the one which seems to be most accurate. This approach has been investigated in a small study with three learning algorithms on five UCI datasets in [20]. Schaffer’s [20] conclusions are that on the one hand this procedure is on average better than working with a single learning algorithm, but, on the other hand, the cross-validation procedure often picks the wrong base algorithm on individual problems. One of the most active areas of research in supervised learning has been to study methods for constructing good ensembles of classifiers. The main discovery is that ensembles are often much more accurate than the individual classifiers that make them up. Although classifier combination is widely applied in many fields, theoretical analysis of combination schemes can be very difficult. The net result is that only simple combinations have been explained up to now and mostly from a theoretical point of view [15]. In many cases the performance of a combination method cannot be accu-

rately estimated theoretically but can only be evaluated on an experimental basis in specific working conditions (a specific set of classifiers, training data and sessions, etc.). Numerous ensemble methods have been proposed in literature. Ensembles can be grouped according to the representation of input patterns into two distinct large groups, one that uses the same representations and one that uses different ones of the inputs. Bagging [4] is a “bootstrap” ensemble method that creates individual classifiers by training the same learning algorithm on a random redistribution of the training set. Each classifier’s training set is generated by randomly drawing, with replacement, N instances – where N is the size of the original training set. Many of the original instances may be repeated in the resulting training set while others may be left out. After the construction of several classifiers, taking a vote of the predictions of each classifier leads to making the final prediction. Boosting keeps track of the performance of the learning algorithm and concentrates on instances that have not been correctly learned. It chooses training instances in such a manner as to favour the instances that have not been accurately learned. After several cycles, the prediction is performed by taking a weighted vote of the predictions of each classifier, with the weights being proportional to each classifier’s accuracy on its training set. AdaBoost is a practical implementation of the boosting approach [11]. MultiBoosting [24] is a method that can be thought of as utilising wagging committees formed by AdaBoost. Wagging is a variant of bagging: bagging uses resampling to get the datasets for training and producing a weak hypothesis, whereas wagging uses reweighting for each training example, pursuing the effect of bagging in a different way. Output Coding (OC) methods break down a multiclass classification problem into a set of two-class subproblems, and then reconstitute the original problem by combining them to achieve the class label. An equivalent way of thinking about these methods consists in encoding each class as a bit string (named codeword), and in training a different two-class base learner in order to separately learn each codeword bit. Error Correcting Output Coding [6] is the most studied OC method. This analytical method tries to improve the error correcting capabilities of the codes generated by the breakdowns through the maximization of the minimum distance between each couple of codewords. In another study [18] a meta-learner (DECORATE) which uses a learner (one that provides high accuracy

S. Kotsiantis and P. Pintelas / Local voting of weak classifiers

on the training data) to build a diverse committee, was presented. This is accomplished by adding different randomly constructed examples to the training set when building new committee members. These artificially constructed examples are given category labels that disagree with the current decision of the committee, thereby directly increasing diversity when a new classifier is trained on the augmented data and added to the committee. Another approach for building ensembles of classifiers is to use a variety of data mining algorithms on all of the training data and combine their predictions. Among the combination techniques, majority vote is the simplest to implement, since it requires no prior training [14]. If we have a dichotomic classification problem and L hypotheses, whose error is lower than 0.5, then the resulting majority voting ensemble has an error lower than the single classifier, as long as the error of the base learners are uncorrelated. Stacked generalization [27], or stacking, is another technique. Stacking combines multiple learning algorithms to induce a higher-level classifier with improved classification accuracy. A learning algorithm is used to determine how the outputs of the base classifiers must be combined. The original dataset constitutes the level zero data. All the base learning algorithms run at this level. The level one data are the outputs of the base learning algorithms. Another learning process occurs using the level one data as input and the final classification as output. In [22] base-level learning algorithms, whose predictions are probability distributions over the set of class values are used, rather than single class values. The meta-level attributes are thus the probabilities of each of the class values that were returned by each of the base-level learning algorithms. Multi-response linear regression (MLR) was then used for meta-level learning. Recently, other authors have modified the method so as to use only the class probabilities associated with the true class and the accuracy seems to have improved [21].

3. Proposed technique All the previous ensembles techniques, which are described in Section 2, work globally. If all training instances are taken into account when classifying a new test case, the ensemble of classifiers works as a global method. However, in order to generalize, global learning methods tend to overlook local singularities. On the other hand, local learning methods do not suffer

241

from this limitation and are better suited for classification of examples close to class boundaries, where local structure becomes most important. In truth, local learning is not a new concept and it appeared in the early years of pattern recognition. The obvious example is the k-nearest neighbor method: given a testing instance, its class is estimated by the closest instances in the training set. A list of objections to k-nearest neighbor algorithms might include concerns regarding the following: a) voting used to combine the classes of the nearest k instances, b) uniform neighborhood shape (spherical) regardless of instance location and c) uniform weight given to all features, instances and neighbors. To the best of authors’ knowledge, one local ensemble technique has been only presented [17]. It used boosting of localized weak classifiers. It is known that simple boosting algorithms are sensitive to noise. In the case of local boosting, the algorithm could handle reasonable noise, and was at least as good as boosting, if not better in noise-free data as well. However, that local ensemble technique was based on a single learning algorithm. In the current study, we propose to use a variety of learning algorithms locally and combine their predictions. To be more specific, we propose to build a number of classifiers for each point to be estimated, taking into account only a subset of the training points. This subset is chosen on the basis of a preferable distance metric between the testing point and the training point in the input space. For each testing point, a voting scheme of the classifiers is then used using only the training points lying close to the current testing point. As one can understand, the proposed method is quite general. It can be combined with a number of different learning algorithms, aggregation rules and distance metrics. In this study, we chose to use weak classifiers. Using weak classifiers with the proposed ensemble has at least two benefits. Weak classifiers are less likely to suffer from over-fitting problems, since they avoid learning outliers, or quite possibly a noisy decision boundary, as mentioned earlier. Although simple aggregation rules compete with more complex aggregation schemes involving second-level training, they are susceptible to incorrect estimates of the confidence by the individual classifiers [9]. These confidences might be generated when using over-fitting sub-optimal classifiers. Secondly, the training time is often less for generating multiple weak classifiers compared to training even one strong classifier, which was also previously mentioned.

242

S. Kotsiantis and P. Pintelas / Local voting of weak classifiers

Subset of training set Learning phase NB

OneR

DS

h

h

h

Each classifier (NB, OneR, DS) generates a hypothesis h1 , h2 , h3 respectively for each testing instance using a subset of the training set (see Fig. 1). In the same figure, the a-posteriori probabilities generated by the individual classifiers are correspondingly denoted p1 (i), p2 (i), p3 (i) for each output class i. Next, the class represented by the maximum sum value of the a-posteriori probabilities is selected to be the voting hypothesis (h∗). The predictive class is computed by the rule: Predictive Class

(x, ?)

h* = SumRule(h , h , h )

= arg maxi

i=number  of classes, j=3

pj (i)

i=1,j=1

Application phase

(x, y*) Fig. 1. Proposed Technique.

Using a voting methodology as an aggregation rule with the weak classifiers in the proposed algorithm, we expect to obtain good results based on the belief that the majority of classifiers are more likely to be correct in their decision when they agree in their opinion. Voters can express the degree of their preference using a confidence score, i.e. the probabilities of classifiers prediction. In the current implementation, the sum rule is used: each voter gives the probability of its prediction for each candidate. Next, all confidence values are added for each candidate and the candidate with the highest sum wins the election. It must be mentioned that the sum rule was preferred since it is one of the best voting methods for classifier combination according to [9] and the number of training instances of each local learner does not allow second-level learning machines on top of the set of the base learners, such as stacking. As it is well-known, voting methods fail if the weak learners cannot achieve at least 50% accuracy for the specific data set. For this reason, the proposed method reduces the multi-class problems to a set of binary problems. We used the One Per Class (OPC) approach. A particular classifier is trained with all of the examples in each chosen class with positive labels, and all other examples with negative labels. The final output is the class that corresponds to the classifier with the highest output value. For the current implementation we used the three most common weak machine learning algorithms OneR [12], Decision stump [13] and Na¨ıve Bayes [8] as learners. These weak classifiers combine rapid learning processes and acceptable classification accuracy.

In other words, the proposed ensemble consists of the five steps in Fig. 2. The proposed ensemble has yet some free parameters such as the distance metric. In our experiments, we used the most well known – Euclidean similarity function – as the distance metric. For two data points, X =< x1 , x2 , x3 , . . . , xn−1 > and Euclidean similarity Y =< y1 , y2 , y3 , . . . , yn−1 >, the  n−1 2 function is defined as d 2 (X, Y ) = i=1 (xi − yi ) . The proposed algorithm also requires choosing the value of K. There are several ways to do this. A first, simple solution is to fix K a priori before the beginning of the learning process. However, the best K for a specific dataset is obviously not the best one for another dataset. A second, more time-consuming solution is therefore to determine this best K automatically through the minimization of a cost criterion. The idea is to apply a model selection process upon which the different hypothesis that can be built. One way to do that is to evaluate the estimation error on a test set and thus keep as K the value for which the error is the least. In the current implementation we decided to use a fixed value for K (=50) in order to a) keep the training time low and b) since about this size of instances is appropriate for a simple algorithm, to build a precise model according to [10].

4. Experiments We experimented with 31 datasets from the UCI repository [3]. These datasets cover many different types of problems which have discrete, continuous, and symbolic variables. Some datasets have missing values, and some have a mixture of all the above variables. We used data sets from the domains of: pattern recogni-

S. Kotsiantis and P. Pintelas / Local voting of weak classifiers Table 1 The used datasets Datasets audiology autos Badges Balance breast-cancer breast-w Colic Credit-g Credit-rating Diabetes Glass Grub-damage Haberman heart-c heart-h heart-statlog Hepatitis Ionosphere Iris lymphotherapy monk3 primary-tumor sonar Soybean Students Titanic Vehicle vote vowel wine Zoo

Instances 226 205 294 625 286 699 368 1000 690 768 214 155 306 303 294 270 155 351 150 148 122 339 208 683 344 2201 846 435 990 178 101

Categorical features 69 10 4 0 9 0 15 13 9 0 0 6 0 7 7 0 13 34 0 15 6 17 0 35 11 3 0 16 3 0 16

Numerical features 0 15 7 4 0 9 7 7 6 8 9 2 3 6 6 13 6 0 4 3 0 0 60 0 0 0 18 0 10 13 1

Classes 24 6 2 3 2 2 2 2 2 2 6 4 2 5 5 2 2 2 3 4 2 21 2 19 2 2 4 2 11 3 7

tion (iris, zoo), image recognition (ionosphere, sonar), medical diagnosis (breast-cancer, breast-w, colic, diabetes, heart-c, heart-h, heart-statlog, hepatitis, lymphotherapy, primary-tumor), commodity trading (autos, credit-rating, credit-g), computer games (badges, monk3), various control applications (balance), agricultural applications (grub-damage, wine) [5] and prediction of student dropout (student) [16]. The specific datasets are listed in Table 1. In order to calculate the classifiers’ accuracy, the whole training set was divided into ten mutually exclusive and equal-sized subsets and for each subset the classifier was trained on the union of all of the other subsets. Then, cross validation was run 10 times for each algorithm and the average value of the 10-cross validations was calculated. It must be mentioned that we used the free available source code for most of the algorithms by [26] for our experiments. We conducted three experiments. During the first experiment we compared the proposed ensemble with the plain classifier NB, DS, OneR and their local versions using 50 instances as local region as well as the

243

simple 50-Nearest Neighbors [1]. During the second experiment, we compared the proposed ensemble with the boosting versions of NB, DS, OneR classifiers as well as the local boosting versions of the same classifiers [17]. We did not test Bagging since in [4] the important observation was made that instability (responsiveness to changes in the training data) is a prerequisite for bagging to be effective. For this reason bagging is not effective with weak learners that have strong bias. We also did not test Decorate ensembles since this method needs a strong learner as base classifier [18]. Finally, during the third experiment, we compared the proposed ensemble methodology with other well-known ensembles, such as voting and stacking, using the same learning algorithm as base learners. As we have already mentioned, during the fist experiment we empirically compared the proposed ensemble with the plain classifier NB, DS, OneR and their local versions using 50 instances as local region as well as the simple 50NN [1]. In Table 2, we represent as “v” that the specific algorithm performed statistically better than the proposed ensemble according to t-test with p < 0.05. Throughout, we speak of two results for a dataset as being “significant different” if the difference is statistical significant at the 5% level according to the corrected resampled t-test [19], with each pair of data points consisting of the estimates obtained in one of the 100 folds for the two learning methods being compared. On the other hand, “*” indicates that proposed ensemble performed statistically better than the specific algorithm according to t-test with p < 0.05. In all the other cases, there is no significant statistical difference between the results (Draws). In the last row of the table one can also see the aggregated results in the form (a/b/c). In this notation “a” means that the proposed ensemble is significantly less accurate than the compared algorithm in a out of 31 datasets, “c” means that the proposed algorithm is significantly more accurate than the compared algorithm in c out of 31 datasets, while in the remaining cases (b), there is no significant statistical difference. In the last row of the Table 2 one can see the aggregated results. The presented ensemble is significantly more accurate than single NB in 10 out of the 31 datasets, while it has a significantly higher error rate in 5 datasets. What is more, the proposed ensemble is significantly more accurate than DS and OneR in 21 and 20 out of the 31 datasets, equivalently, whilst it has significantly higher error rate in one dataset. Likewise, the proposed ensemble is significantly more accurate than local DS and OneR in 11 and 10 out of the 31 datasets,

244

S. Kotsiantis and P. Pintelas / Local voting of weak classifiers

1) Determine a suitable distance metric. 2) Find the k nearest neighbors using the selected distance metric. 3) Train the weak classifiers using as training instances these k instances 4) Apply sum rule voting to the decision of weak classifiers 5) The answer of the voting ensemble is the prediction for the testing instance. Fig. 2. Local Voting ensemble.

equivalently, whilst it has a significantly higher error rate in none dataset. Furthermore, the presented ensemble is significantly more accurate than local NB in 6 out of the 31 datasets, whilst it has a significantly higher error rate in 4 datasets. Finally, the performance of the presented ensemble is significantly more accurate than simple 50NN algorithm in 14 out of the 31 datasets, whilst it has a significantly higher error rate in two datasets. Subsequently, we empirically compared the proposed ensemble with the boosting versions of NB, DS, OneR classifiers using 25 sub-classifiers as well as the local boosting versions of the same classifiers [17]. In the last rows of the Table 3 one can see the aggregated results. The proposed algorithm is significantly more accurate than boosting DS algorithm with 25 classifiers in 11 out of the 31 data sets. In other 2 data sets, the proposed algorithm has significantly higher error rates. In addition, the proposed ensemble is significantly more accurate than boosting OneR with 25 classifiers in 15 out of the 31 data sets, while on no data set does it have a significantly higher error rate. Furthermore, the proposed ensemble is significantly more accurate than boosting NB with 25 classifiers in 6 out of the 31 data sets, while on 4 data sets, it has a significantly higher error rate. The proposed ensemble is significantly more accurate than local boosting DS and local boosting OneR ensembles in 5 out of the 31 data sets. On the other hand, in no data set does the proposed ensemble have significantly higher error rates. Moreover, the proposed ensemble is significantly more accurate than local boosting NB in 8 out of the 31 data sets, while on two data sets, it has a significantly higher error rate. After that, we compared the proposed ensemble methodology with other well-known ensembles using the same base learners: – The methodology of selecting the best classifier of the NB, DS, OneR according to 3-cross validation (BestCV) [20].

– Voting using NB, DS and OneR as base classifiers [14] – Stacking using NB, DS and OneR as base classifiers with MLR as meta-classifier [22] – StackingC using NB, DS and OneR as base classifiers and only the class probabilities associated with the true class as meta-data set [21]. In the last row of the Table 4 one can see the aggregated results. The proposed ensemble is significantly more accurate than stacking with MLR procedure and stackingC in six out of the 31 data sets, while it is significantly less accurate in two data sets. Similarly, the proposed ensemble is significantly more accurate than BestCV in 8 out of the 31 data sets, while it is significantly less accurate in 4 data sets. Moreover, the proposed ensemble is significantly more accurate than simple voting in 14 out of the 31 data sets, while on 1 data set, it has a significantly higher error rate. Figure 3 shows the advantages of Local Voting over Simple Voting as (Accuracy of Local Voting)/ (Accuracy of Simple Voting) as a function of each dataset. Above the solid line, Local Voting is better than Simple Voting while below it, it is worse. Similarly, Fig. 4 shows the advantages of Local Voting over Stacking as (Accuracy of Local Voting)/ (Accuracy of Stacking) as a function of each dataset. Above the solid line, Local Voting is better than Stacking while below it, the opposite is true. In conclusion, our approach performs better than existing ensembles. The average relative accuracy improvement of the proposed methodology is more than 2% in relation to the other methods scrutinised. The reason for the good performance of the proposed technique is that local voting is more task-oriented since it omits an intermediate modelling step in classification tasks. It does not intend to build an accurate model to fit the observed data globally. This local ensemble tries to employ a subset of input points around the separating hyperplane, while global ensembles try to describe the overall phenomena by utilizing all input points. That’s

S. Kotsiantis and P. Pintelas / Local voting of weak classifiers

245

Table 2 Comparing the proposed ensemble with the plain classifier NB, DS, OneR and their local versions as well as the simple 50NN Datasets audiology autos Badges balance-scale breast-cancer breast-w colic Credit-g credit-rating diabetes Glass grub-damage haberman heart-c heart-h heart-statlog hepatitis ionosphere Iris lymphotherapy monk3 Primary-tumor Sonar soybean Students Titanic Vehicle Vote Vowel Wine Zoo Average accuracy W/D/L

Local voting 82.27 81.07 99.86 88.7 73.59 96.45 82.39 72.76 85.39 71.9 75.48 43.86 69.89 80.53 79.68 78.85 84.71 88.24 94.53 82 92.4 44.9 82.45 94.86 84.6 78.95 72.01 96 91.23 97.07 97.11 82.06

Local NB 77.25 75.93 96.54 90.18 72.82 96.32 81.77 75.23 82.87 71.84 72.66 44.7 71.3 81.36 80.66 80.41 86.18 81.91 95.67 82.95 91.66 44.46 86.6 93.29 83.38 78.95 74.94 95.49 94.56 98.92 97.53 81.88 4/21/6

Local DS * * * v

v *

*

*

v v

70.59 69.86 99.97 82.9 73.38 96.22 81.3 71.54 82.35 72.36 69.44 44.28 70.65 78.68 78.31 74.63 83.76 87.56 93.8 75.61 92.64 43.28 77.81 91.98 84.02 78.94 69.12 95.88 36.35 95.67 88.05 77.77 0/20/11

Local OneR * * *

* *

*

* *

* * *

why local voting is more direct, which results in more accurate and efficient performance. A local ensemble has significant advantages when the probability measure defined on the space of features for each class is very complex, but can still be described by a collection of less complex local approximations. Another reason why local voting outperforms other ensembles is small disjuncts. A target concept, i.e., a class, is highly disjunctive when the instances of that class are very dissimilar globally, and similar locally. In data sets containing highly disjunctive target concepts, similar instances of the same class are only found in small clusters. The expectation is that a local ensemble will perform generally better than global ensembles on such data, since it retains all information concerning disjuncts (no matter how small) while global methods tend to overgeneralise and overlook important small disjuncts.

72.31 76.49 28.55 87.73 72.69 96.27 81.17 70.93 83.59 69.76 68.52 39.01 68.88 79.23 79.14 76.3 82.02 88.24 94 79.6 92.4 43.8 73.57 92.25 84.36 78.95 67.37 95.47 74.81 94.83 42.59 75.32 0/21/10

NB * *

* * *

* *

* * *

72.64 57.41 99.66 90.53 72.7 96.07 78.7 75.16 77.86 75.75 49.45 49.76 75.06 83.34 83.95 83.59 83.81 82.17 95.53 83.13 93.45 49.71 67.71 92.94 85.7 77.85 44.68 90.02 62.9 97.46 94.97 78.18 5/16/10

DS * * v

* v * v

v *

v * *

* * *

46.46 44.9 100 56.72 69.27 92.33 81.52 70 85.51 71.8 44.89 33.2 71.57 72.93 81.78 72.3 77.62 82.57 66.67 75.31 76.01 28.91 72.25 27.96 87.22 77.6 39.81 95.63 17.47 57.91 60.43 65.76 1/9/21

OneR * * * * *

* * * * * * * * * * * v * * * * *

46.46 61.77 28.55 57.09 66.91 92.01 81.52 66.13 85.51 71.98 56.84 42.15 72.53 72.53 80.69 71.26 82.05 82.59 93.53 74.77 77.88 27.74 62.12 39.75 87.22 77.6 52.36 95.63 33.05 77.93 42.59 66.48 1/10/20

50NN * * * * * * *

*

* * *

* * * * v * * * * *

35.95 48.18 99.69 89.01 70.75 95.9 84.04 71.96 86.16 74.68 56.16 44.21 72.91 81.58 83.98 83.74 79.38 71.65 90.53 80.59 82.46 39.26 68.25 62.34 78.89 77.56 63.47 90.41 8.88 96.46 55.11 71.75 2/15/14

* *

*

v v *

* * * * * * * * * *

5. Conclusion One of the main difficulties found in global learning methodologies is the model selection problem. More precisely, no matter what one’s approach, one still needs to select a suitable and appropriate model and its parameters in order to represent the observed data. This is still an open and on-going research topic. Some researchers have argued that it is difficult, if not impossible, to obtain general and accurate global learning. Hence, local learning has recently attracted much interest. Local-based learning algorithms defer processing of the dataset until they receive requests for classification. A dataset of observed input-output data is always kept and the estimate for a new operating point is derived from an interpolation based on the neighborhood of the query point. Our experiment for several UCI datasets shows that the local voting ensemble method outperforms the other combining methods we tried as well

246

S. Kotsiantis and P. Pintelas / Local voting of weak classifiers Table 3 Comparing the proposed ensemble with the boosting versions of NB, DS, OneR as well as their local boosting versions

Datasets audiology Autos Badges balance-scale breast-cancer breast-w Colic credit-g credit-rating Diabetes Glass grub-damage haberman heart-c heart-h heart-statlog Hepatitis ionosphere Iris lymphography monk3 primary-tumor Sonar Soybean Students Titanic Vehicle Vote Vowel Wine Zoo Average accuracy W/D/L

Local Voting 82.27 81.07 99.86 88.7 73.59 96.45 82.39 72.76 85.39 71.9 75.48 43.86 69.89 80.53 79.68 78.85 84.71 88.24 94.53 82 92.4 44.9 82.45 94.86 84.6 78.95 72.01 96 91.23 97.07 97.11 82.06

Adaboost DS 46.46 44.9 100 71.77 71.55 95.28 82.72 72.6 85.57 75.37 44.89 33.2 74.06 83.11 82.42 81.81 81.5 92.34 95.07 75.44 90.92 28.91 81.06 27.96 87.16 77.83 39.81 96.41 17.47 91.57 60.43 70.63 2/18/11

* * *

v * *

v

* *

* * * *

Adaboost NB 78.2 57.12 99.66 92.11 68.57 95.55 77.46 75.09 81.16 75.88 49.63 49.9 73.94 83.14 84.67 82.3 84.23 91.12 95.07 80.67 91.07 49.71 81.21 92.02 85.12 77.86 44.68 95.19 81.32 96.57 97.23 79.60 4/21/6

Adaboost OneR 46.46 65.47 28.55 72.81 69.49 95.55 81.17 64.35 81.86 69.91 56.56 41.22 71.65 72.61 76.59 73.59 77.41 88.43 93.67 79.83 90.67 27.38 65.43 40.48 86.52 77.74 51.63 96.62 31.13 92.07 42.59 68.05 0/16/15

* v

* v *

v

v *

* *

* * * *

* * *

*

* * *

* * * *

Local boost DS 73.12 75.49 100 84.82 72.75 95.9 78.91 72.92 84.94 73.77 70.39 42.43 68.9 80.4 79.06 78.15 84.45 90.01 94.47 84.6 90.68 43.22 83.89 92.99 82.86 79.05 70.98 96.02 76.46 97.47 88.84 80.26 0/26/5

*

*

*

* *

Local boost NB 77.22 77.48 93.68 84.33 68.36 95.91 77.47 71.09 80.17 71.22 71.88 40.48 68.47 78.62 78.49 78.07 83.98 86.81 95.87 85.58 87.21 44.45 86.47 91.95 80.4 78.95 77.77 94.62 95.66 98.65 97.31 80.60 2/21/8

* * * * * *

* * v v

Local boost OneR 74.53 80.87 28.55 88.3 72.58 96 80.05 71.46 82.45 70.2 71.7 39.65 65.73 77.33 78.04 77.78 82.62 90.09 93.33 84.63 90.07 44.04 77.85 93.05 82.83 78.97 70.89 95.79 81.03 96.46 42.59 76.11 0/26/5

* *

*

* *

Local Voting vs. Simple voting

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031 Datasets

Fig. 3. The improvement of Local Voting over Simple Voting in each dataset and vice versa.

as any individual classifier. Due to the encouraging results obtained from these experiments, we can expect that the proposed technique can be effectively applied to the classification task in real world cases, and perform more accurately than traditional data mining

approaches. The benefit of voting multiple local models is somewhat offset by the cost of storing and querying the training dataset for each test set instance, which means that lazy ensembles do not scale well to very large datasets.

S. Kotsiantis and P. Pintelas / Local voting of weak classifiers

247

Table 4 Comparing the proposed ensemble with BestCV, Voting, Stacking and StackingC Datasets audiology Autos Badges balance-scale breast-cancer breast-w Colic Credit-g credit-rating Diabetes Glass grub-damage haberman heart-c heart-h heart-statlog Hepatitis ionosphere Iris lymphography monk3 primary-tumor Sonar Soybean Students Titanic Vehicle Vote Vowel Wine Zoo Average accuracy W/D/L

Local voting 82.27 81.07 99.86 88.7 73.59 96.45 82.39 72.76 85.39 71.9 75.48 43.86 69.89 80.53 79.68 78.85 84.71 88.24 94.53 82 92.4 44.9 82.45 94.86 84.6 78.95 72.01 96 91.23 97.07 97.11 82.06

StackingC 74.09 65.31 100 89.89 72.63 95.99 83.88 75.15 85.51 75.51 62.55 50.55 73.13 83.28 83.78 83.41 84.12 89.89 95.13 81.98 93.45 49.29 70.19 93 87.04 77.83 56.77 95.63 63.77 97.69 95.95 80.21 2/23/6

Best-3CV * *

v *

v *

* *

72.64 57.43 99.83 90.53 71.19 96.07 81.3 75.16 85.51 75.7 55.22 49.76 74.4 83.34 83.95 83.59 83.28 82.02 95.13 83.13 93.45 49.71 68.74 92.94 86.98 77.76 52.4 95.63 62.9 97.46 94.97 79.10 4/19/8

* * v

v *

v *

v * *

* *

Stacking with MLR 66.11 73.78 100 90.03 72.24 95.99 83.88 75.13 85.51 75.62 63.48 48.48 73.07 83.51 83.95 83.63 84.32 89.89 95.13 82.25 93.45 46.64 69.75 93.06 86.99 77.83 58.06 95.56 64.14 97.63 94.87 80.13 2/23/6

Voting * *

v *

v

*

* *

46.46 66.45 99.66 57.28 67.26 94.48 81.52 69.82 85.51 72.72 60.21 41.89 72.56 73.19 81.58 72.67 82.18 90.03 93.53 75.91 78.45 28.21 68.98 64.2 87.22 77.6 46.37 95.63 35.74 86.02 95.76 72.55 1/16/14

* * * *

*

*

* * * * v * * * *

Local Voting vs. Stacking

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031 Datasets

Fig. 4. The improvement of Local Voting over Stacking in each dataset and vice versa.

For this reason, in a following project we will focus on the problem of reducing the size of the stored set of instances while trying to maintain or even improve generalization accuracy by avoiding noise and over-fitting. In [25], numerous instance selection methods that can be combined with the proposed technique can be found.

References [1] [2]

D. Aha, Lazy Learning, Dordrecht: Kluwer Academic Publishers, 1997. C.G. Atkeson, A.W. Moore and S. Schaal, Locally weighted learning for control, Artificial Intelligence Review 11 (1997), 75–113.

248 [3]

[4] [5]

[6]

[7]

[8]

[9]

[10]

[11] [12]

[13]

[14] [15]

S. Kotsiantis and P. Pintelas / Local voting of weak classifiers C.L. Blake and C.J. Merz, UCI Repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science (1998). [http://www.ics.uci.edu/˜mlearn/MLRepository.html]. L. Breiman, Bagging Predictors, Machine Learning 24(3) (1996), 123–140, Kluwer Academic Publishers. J.G. Cleary, G. Holmes, S.J. Cunningham and I.H. Witten, MetaData for database mining, Proc IEEE Metadata Conference, 1996, Silver Spring, MD; April. T.G. Dietterich and G. Bakiri, Solving multiclass learning problems via error correcting output codes, Journal of Artificial Intelligence Research (2) (1995), 263–286. T.G. Dietterich, Ensemble methods in machine learning, in: Multiple Classifier Systems, (LNCS Vol. 1857), J. Kittler and F. Roli, eds, Springer, 2001, pp. 1–15. P. Domingos and M. Pazzani, On the optimality of the simple Bayesian classifier under zero-one loss, Machine Learning 29 (1997), 103–130. M. Van Erp, L.G. Vuurpijl and L.R.B. Schomaker, An overview and comparison of voting methods for pattern recognition, In Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition, 2002, 195–200, Niagara-on-theLake, Canada. E. Frank, M. Hall and B. Pfahringer, Locally weighted naive Bayes. Proc. of the 19th Conference on Uncertainty in Artificial Intelligence, Acapulco, Mexico. Morgan Kaufmann, 2003. Y. Freund and R. Schapire, Experiments with a New Boosting Algorithm, Proceedings: ICML’96, 148–156. R.C. Holte, Very simple classification rules perform well on most commonly used datasets, Machine Learning 11 (1993), 63–90. W. Iba and P. Langley, Induction of one-level decision trees, Proc. of the Ninth International Machine Learning Conference 1992, Aberdeen, Scotland: Morgan Kaufmann. C. Ji and S. Ma, Combinations of weak classifiers, IEEE Transaction on Neural Networks 8(1) (1997), 32–42. J. Kittler, M. Hatef, R.P.W. Duin and J. Matias, On combining classifiers, IEEE T-PAMI 20(3) (1998), 226–239.

[16]

[17]

[18]

[19] [20] [21]

[22]

[23] [24]

[25]

[26]

[27]

S. Kotsiantis, C. Pierrakeas and P. Pintelas, Preventing student dropout in distance learning systems using machine learning techniques, in proceedings of 7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems, Oxford, Sept. 3–5, Lecture notes series, SpringerVerlag 2774 (2003), 267–274. S. Kotsiantis and P. Pintelas, Local Boosting of Weak Classifiers, Proceedings of Intelligent Systems Design and Applications (ISDA 2004), August 26–28, 2004, Budapest, Hungary. P. Melville and R. Mooney, Constructing Diverse Classifier Ensembles using Artificial Training Examples, Proc. of the IJCAI-2003, Acapulco, Mexico, August, 2003, 505–510. C. Nadeau and Y. Bengio, Inference for the Generalization Error, Machine Learning 52(3) (2003), 239–281. C. Schaffer, Selecting a classification method by crossvalidation, Machine Learning 13 (1993), 135–143. A.K. Seewald, How to Make Stacking Better and Faster While Also Taking Care of an Unknown Weakness, in: Proceedings of the Nineteenth International Conference on Machine Learning (ICML 2002), C. Sammut and A. Hoffmann, eds, Morgan Kaufmann Publishers, 2002, 554–561. K. Ting and I. Witten, Issues in Stacked Generalization, Artificial Intelligence Research 10 (1999), 271–289, Morgan Kaufmann. V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. G.I. Webb, MultiBoosting: A Technique for Combining Boosting and Wagging, Machine Learning 40 (2000), 159– 196, Kluwer Academic Publishers. D. Wilson and T. Martinez, Reduction Techniques for Instance-Based Learning Algorithms, Machine Learning 38 (2000), 257–286. I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, San Mateo, CA, 2000. D. Wolpert, Stacked Generalization, Neural Networks 5(2) (1992), 241–260.