Diversified Random Forests Using Random Subspaces

Diversified Random Forests Using Random Subspaces Khaled Fawagreh, Mohamed Medhat Gaber, and Eyad Elyan IDEAS, School of Computing Science and Digital Medial, Robert Gordon University, Garthdee Road, Aberdeen, AB10 7GJ, UK

Abstract. Random Forest is an ensemble learning method used for classification and regression. In such an ensemble, multiple classifiers are used where each classifier casts one vote for its predicted class label. Majority voting is then used to determine the class label for unlabelled instances. Since it has been proven empirically that ensembles tend to yield better results when there is a significant diversity among the constituent models, many extensions were developed during the past decade that aim at inducing some diversity in the constituent models in order to improve the performance of Random Forests in terms of both speed and accuracy. In this paper, we propose a method to promote Random Forest diversity by using randomly selected subspaces, giving a weight to each subspace according to its predictive power, and using this weight in majority voting. Experimental study on 15 real datasets showed favourable results, demonstrating the potential of the proposed method.

1

Introduction

Random Forest (RF) is an ensemble learning technique used for classification and regression. Ensemble learning is a supervised machine learning paradigm where multiple models are used to solve the same problem [22]. Since single classifier systems have limited predictive performance [27] [22] [18] [24], ensemble classification was developed to overcome this limitation [22] [18] [24], and thus boosting the accuracy of classification. In such an ensemble, multiple classifiers are used. In its basic mechanism, majority voting is then used to determine the class label for unlabelled instances where each classifier in the ensemble is asked to predict the class label of the instance being considered. Once all the classifiers have been queried, the class that receives the greatest number of votes is returned as the final decision of the ensemble. Three widely used ensemble approaches could be identified, namely, boosting, bagging, and stacking. Boosting is an incremental process of building a sequence of classifiers, where each classifier works on the incorrectly classified instances of the previous one in the sequence. AdaBoost [12] is the representative of this class of techniques. However, AdaBoost is prone to overfitting. The other class of ensemble approaches is the bootstrap aggregating (bagging) [7]. Bagging involves building each classifier in the ensemble using a randomly drawn sample of the data, having each classifier giving an equal vote when labelling unlabelled E. Corchado et al. (Eds.): IDEAL 2014, LNCS 8669, pp. 85–92, 2014. c Springer International Publishing Switzerland 2014

86

K. Fawagreh, M.M. Gaber, and E. Elyan

instances. Bagging is known to be more robust than boosting against model overfitting. Random Forest (RF) is the main representative of bagging [8]. Stacking (sometimes called stacked generalisation) extends the cross-validation technique that partitions the dataset into a held-in data set and a held-out data set; training the models on the held-in data; and then choosing whichever of those trained models performs best on the held-out data. Instead of choosing among the models, stacking combines them, thereby typically getting performance better than any single one of the trained models [26]. The ensemble method that is relevant to our work in this paper is RF. RF has been proved to be the state-of-the-art ensemble classification technique. Since it has been proven empirically that ensembles tend to yield better results when there is a significant diversity among the models [16] [9] [1] [25], this paper investigates how to inject more diversity by using random subspaces to construct an RF that is divided into a number of sub-forests, where each of which is based on a random subspace. This paper is organised as follows. First, an overview of RFs is presented in Section 2. This is followed by Section 3 where our method is presented. Experimental results demonstrating the superiority of the proposed extension over the standard RF is detailed in Section 4. In Section 5, we describe related work. The paper is then concluded with a summary and pointers to future directions in Section 6.

2

Random Forests: An Overview

Random Forest is an ensemble learning method used for classification and regression. Developed by Breiman [8], the method combines Breiman’s bagging approach [7], and the random selection of features, introduced independently by Ho [14] [15] and Amit and Geman [2], in order to construct a collection of decision trees with controlled variation. Using bagging, each decision tree in the ensemble is constructed using a sample with replacement from the training data. Statistically, the sample is likely to have about 64% of instances appearing at least once in the sample. Instances in the sample are referred to as in-baginstances, and the remaining instances (about 36%), are referred to as out-of-bag instances. Each tree in the ensemble acts as a base classifier to determine the class label of an unlabeled instance. During the construction of the individual trees in the RF, randomisation is also √ applied when selecting the best node to split on. Typically, this is equal to F where F is the number of features in the dataset. Breiman [8] introduced additional randomness during the construction of decision trees using the classification and regression trees (CART) technique. Using this technique, the subset of features selected in each interior node is evaluated with the Gini index heuristics. The feature with the highest Gini index is chosen as the split feature in that node. Gini index has been introduced by Breiman et al. [19]. However, it has been first introduced by the Italian statistician Corrado Gini in 1912. The index is a function that is used to measure the impurity of

Diversified Random Forests Using Random Subspaces

87

data, i.e., how uncertain we are if an event will occur. In classification, this event would be the determination of the class label [4]. In the basic RF technique [8], it was shown that the RF error rate depends on correlation and strength. Increasing the correlation between any two trees in the RF increases the forest error rate. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the RF error rate. Such findings seem to be consistent with a study made by Bernard et al. [5] which showed that the error rate statistically decreases by jointly maximising the strength and minimising the correlation. Key advantages of RF over its AdaBoost counterpart are robustness to noise and overfitting [8] [17] [23] [6]. In the original paper about RF, [8] outlined four other advantages of an RF. First, its accuracy is as good as Adaboost and sometimes better. Second, it is faster than bagging or boosting. Third, it gives useful internal estimates of error, strength, correlation and variable importance. Last, it is simple and easily parallelised.

3

Diversified Random Forests

To some extent, the standard RF already applies some diversity to the classifiers being built during the construction of the RF. In a nutshell, there are two levels of diversity being applied. The first level is when each decision tree is constructed using sampling with replacement from the training data. The samples are likely to have some diversity among each other as they were drawn at random. The second level is achieved by randomisation which is applied when selecting the best node to split on. The ultimate objective of this paper is to inject more diversity in an RF. From the training set, we create a number of subspaces. The number of subspaces is determined as follows. Subspaces = α × Z

(1)

where α denotes the subspace factor such that 0 < α ≤ 1, and Z is the size of the diversified RF to be created. Each subspace will contain a fixed randomised subset of the total number of features and will correspond to a sub-forest. A projected training dataset will be created for each subspace and will be used to create the trees in the corresponding sub-forest. The number of trees in each sub-forest is given by the equation T rees =

Z Subspaces

(2)

We will refer to the resulting forest as the Diversified Random Forest (DRF) as shown in Figure 1. A weight is then assigned to each projected training dataset using the Absolute Predictive Power (APP) given by Cuzzocrea et al. [11]. Given a dataset S, the AP P is defined by the following equation

88


Training Set

…….

‫݁ܿܽ݌ݏܾݑݏ‬ଵ

Projected Training ܵ݁‫ݐ‬ଵ

Projected Training ܵ݁‫ݐ‬௡

Testing Set

‫ ܾݑݏ‬െ ݂‫ݐݏ݁ݎ݋‬ଵ ‫ݐ‬ଵ… ‫ݐ‬௞

‫݁ܿܽ݌ݏܾݑݏ‬௡

Projected Testing ܵ݁‫ݐ‬ଵ

…. …. ….

‫ ܾݑݏ‬െ ݂‫ݐݏ݁ݎ݋‬௡ Projected Testing ܵ݁‫ݐ‬௡ ‫ݐ‬ଵ… ‫ݐ‬௞

DRF

Fig. 1. Diversified Random Forest

AP P (S) =

1 × |Att(S)|

A∈Att(S)

I(S, A) E(S)

(3)

where E(S) is the entropy of a given dataset S having K instances and I(S, A) is the information gain of a given attribute A in a dataset S. E(S) is a measure of the uncertainty in a random variable and is given by the following equation E(S) =

K

−pi (xi ) log2 pi (xi )

(4)

i=1

where xi refers to a generic instance of S and pxi denotes the probability that the instance xi occurs in S. I(S, A) is given by |Sv | I(S, A) = E(S) − (5) E(Sv ) |S| v∈V al(A)

where E(S) denotes the entropy of S, V al(A) denotes the set of possible values for A, Sv refers to the subset of S for which A has the value v, and E(Sv ) denotes the entropy of Sv . This weight will be inherited by the corresponding sub-forest and will be used in the voting process. This means that the standard voting technique currently used in the standard RF is going to be replaced by a weighted voting technique. Algorithm 1 summarises the main steps involved in the construction of an DRF. To measure diversity, we can use the entropy in equation 4 to find the average entropy over a validation set V having K instances as given in the following equation: Diversity(V ) =

E(V ) K

(6)


89

Algorithm 1. Diversified Random Forest Algorithm {User Settings} input Z: the desired size of the DRF to be created input S: the number of features in the sub-forest input α: the subspace factor {Process} −−−→ Create an empty vector DRF − −−−−−→ Create an empty vector W eights Using Equation 1, create N subspaces each containing 70% of the features chosen at random from all features F For each subspace i in the previous step, create a projected training set T Ri Repeat the previous step to create a projected testing set T Si for each subspace (this is required for the testing phase) Using Equation 3, assign a weight to each projected training set and add this weight − −−−−−→ to W eights Using Equation 2, determine the number of trees in each subspace: treesPerSubspace ⇐ Z/N for i = 1 → N do for j = 1 → treesP erSubspace do Create an empty tree Tj repeat Sample S out of all features in the corresponding projected training set T Ri using Bootstrap sampling − → Create a vector of the S features FS − → Find Best Split Feature B(FS ) − → Create a New Node using B(FS ) in Tj until No More Instances To Split On −−−→ Add Tj to the DRF end for end for {Output} −−−→ A vector of trees DRF − −−−−−→ A vector of weights W eights (to replace standard voting by weighted voting during the testing phase)

Two examples will be given to clarify this equation. Assume we have n validation instances and an DRF of 500 trees where each tree predicts the class label. In the first example, assume we have instance x. Since we have 500 trees, we will have (c1 , c2 , .., c500 ) class labels for this instance. Assuming that all trees agree on one class, the entropy in equation 4 yields -1 log 1 = 0. If we use equation 6 to find the average for the n testing instances, we can measure the diversity. In the second example, let us consider another instance y, where half of the trees chose one class label and the other half chose another one. For such an instance, the entropy in equation 4 yields -0.5 log 0.5 + -0.5 log 0.5 = 1. We can see that when the trees diversified, the entropy increased. To summarize what was demonstrated in the previous examples, the entropy increases if the trees disagree, therefore, the higher the disagreement, the higher the entropy.

90

4


Experimental Study

RF and DRF were tested on 15 real datasets from the UCI repository [3] and both had a size of 500 trees. For our DRF, we used a subsapce factor of 2%, by equation 1, this produced 10 subspaces and hence 10 sub-forests. We used a random 70% of the features for each subspace. By equation 2, each sub-forest contained 50 trees. Results of the experiment, based on the average of 10 runs, are depicted in Table 1 below (we have highlighted in boldface all the datasets where DRF outperformed RF). Table 1 shows that DRF has outperformed RF on the majority of the datasets. It is worth noting that our technique has shown exceptionally better performance on the medical datasets, namely, diabetes and breast-cancer, both with 9 and 10 features respectively. It is worth further investigating this interesting outcome. As shown in Table 1, in the few cases that RF outperformed DRF, the difference was by a very small fraction ranging from 0.01% to 2.22%. On the other hand, DRF has performed better with a difference that ranged from 0.18% to 5.63%. Applying the paired t-test on the 8 cases that DRF outperformed RF has shown statistical significance of the results with p-value = 0.01772. Table 1. Performance Comparison of RF & DRF

5

Dataset

Number of Features

RF

DRF

soybean eucalyptus car credit sonar white-clover diabetes glass vehicle vote audit breast-cancer pasture squash-stored squash-unstored

36 20 7 21 61 32 9 10 19 17 13 10 23 25 24

77.543106% 20.0% 62.108845% 75.97059% 0.7042253% 63.333332% 73.71648% 12.328766% 73.81944% 97.97298% 96.30882% 71.855675% 41.666668% 55.555553% 60.000004%

75.43103% 22.76% 59.93197% 76.147064% 0.9859154% 63.809525% 79.34865% 17.123287% 73.40277% 97.36487% 96.29411% 75.46391% 40.833336% 53.333344% 61.11111%

Related Work

The random subspace method was initially introduced by Ho [15]. However, and unlike the approach proposed in this paper, the subspaces were not weighted according to their predictive power. The method became so popular that many researchers adopted it to enhance ensemble techniques. Breiman [8], for example, combined his bagging approach [7], and the random subspace method, in


91

order to construct an RF containing a collection of decision trees with controlled variation. Cai et al. [10] proposed a weighted subspace approach to improve performance but their approach was tailored for bagging ensembles [7]. Garc´ıaPedrajas and Ortiz-Boyer [13] presented a novel approach to improve boosting ensembles that combined the methods of random subspace and AdaBoost [12]. Opitz [20] developed an ensemble feature selection approach that is based on genetic algorithms. The approach not only finds features relevant to the learning task and learning algorithm, but also finds a set of feature subsets that will promote disagreement among the ensemble’s classifiers. The proposed approach demonstrated improved performance over the popular and powerful ensemble approaches of AdaBoost and bagging. Panov and Dˇzeroski [21] presented a new method for constructing ensembles of classifiers that combines both bagging and the random subspace method with the added advantage of being applicable to any base-level algorithm without the need to randomise the latter.

6

Conclusion and Future Work

In this paper, we have proposed an extension to Random Forest called Diversified Random Forest. The new proposed forest was constructed using sub-forests where each of which was built from a randomly selected subspace whose corresponding projected training dataset was later assigned a weight. We have used the Absolute Predictive Power (APP) to weigh each projected dataset. As demonstrated in Section 4, the extended RF outperformed the standard RF on the majority of the datasets. In our experiments, we have used a subspace factor of 2%, a size of 500 trees for the forest to be created, and 70% of the features in each subspace. In the future, we will attempt different values for these parameters. Generally, we expect our extended forest to work well with higher dimensional datasets since the more features there are, the more likely the subspaces are going to be diverse.

References 1. Adeva, J.J.G., Beresi, U., Calvo, R.: Accuracy and diversity in ensembles of text categorisers. CLEI Electronic Journal 9(1) (2005) 2. Amit, Y., Geman, D.: Shape quantization and recognition with randomized trees. Neural Computation 9(7), 1545–1588 (1997) 3. Bache, K., Lichman, M.: UCI machine learning repository (2013) 4. Bader-El-Den, M., Gaber, M.: Garf: towards self-optimised random forests. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012, Part II. LNCS, vol. 7664, pp. 506–515. Springer, Heidelberg (2012) 5. Bernard, S., Heutte, L., Adam, S.: A study of strength and correlation in random forests. In: Huang, D.-S., McGinnity, M., Heutte, L., Zhang, X.-P. (eds.) ICIC 2010. CCIS, vol. 93, pp. 186–191. Springer, Heidelberg (2010) 6. Boinee, P., De Angelis, A., Foresti, G.L.: Meta random forests. International Journal of Computationnal Intelligence 2(3), 138–147 (2005)

92


7. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996) 8. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001) 9. Brown, G., Wyatt, J., Harris, R., Yao, X.: Diversity creation methods: a survey and categorisation. Information Fusion 6(1), 5–20 (2005) 10. Cai, Q.-T., Peng, C.-Y., Zhang, C.-S.: A weighted subspace approach for improving bagging performance. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2008, pp. 3341–3344. IEEE (2008) 11. Cuzzocrea, A., Francis, S.L., Gaber, M.M.: An information-theoretic approach for setting the optimal number of decision trees in random forests. In: 2013 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1013– 1019. IEEE (2013) 12. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 13. Garc´ıa-Pedrajas, N., Ortiz-Boyer, D.: Boosting random subspace method. Neural Networks 21(9), 1344–1362 (2008) 14. Ho, T.K.: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995) 15. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998) 16. Kuncheva, L.I., Whitaker, C.J.: Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning 51(2), 181–207 (2003) 17. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18–22 (2002) 18. Maclin, R., Opitz, D.: Popular ensemble methods: An empirical study. arXiv preprint arXiv:1106.0257 (2011) 19. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and regression trees. Wadsworth International Group (1984) 20. Opitz, D.W.: Feature selection for ensembles. In: AAAI/IAAI, pp. 379–384 (1999) 21. Panov, P., Dˇzeroski, S.: Combining bagging and random subspaces to create better ensembles. Springer (2007) 22. Polikar, R.: Ensemble based systems in decision making. IEEE Circuits and Systems Magazine 6(3), 21–45 (2006) ˇ 23. Robnik-Sikonja, M.: Improving random forests. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 359– 370. Springer, Heidelberg (2004) 24. Rokach, L.: Ensemble-based classifiers. Artificial Intelligence Review 33(1-2), 1–39 (2010) 25. Tang, K., Suganthan, P.N., Yao, X.: An analysis of diversity measures. Machine Learning 65(1), 247–271 (2006) 26. Wolpert, D.H.: Stacked generalization. Neural Networks 5(2), 241–259 (1992) 27. Yan, W., Goebel, K.F.: Designing classifier ensembles with constrained performance requirements. In: Defense and Security, pp. 59–68. International Society for Optics and Photonics (2004)