RBF Networks from Boosted Rules - Semantic Scholar

RBF Networks from Boosted Rules∗ Juan José Rodríguez Vanesa Paniego, Leticia Villar

Carlos J. Alonso

Lenguajes y Sistemas Informáticos Universidad de Burgos, Spain

Grupo de Sistemas Inteligentes Departamento de Informática Universidad de Valladolid, Spain [email protected], [email protected]

Abstract

y_1

A novel method for constructing RBF networks is presented. It is based on Boosting, an ensemble method that combines several classifiers obtained using any other classification method. If the classifiers that are going to be combined by boosting are radialbasis functions, then the boosting method produces a RBF network as result. The method for constructing a RBF is based on obtaining a decision rule and using the attributes and values that appear in the rule for selecting the centers and radii of the RBF. Keywords: boosting, radial-basis function networks, decision rules, ensemble methods, machine learning.

h_1

x_1

...

...

...

y_j

h_k

x_i

...

...

...

y_m

h_p

x_n

Figure 1. A RBF Network.

1. Introduction In a Radial-Basis Function (RBF) the response decreases (or increases) monotonically with distance from a central point [8]. The most typical RBF is the Gaussian: ||x − c||2 h(x) = exp − r2

!

Where c is the center and r the radius. Although RBF can be used in any type of network, traditionally a RBF Network is considered a single-layer network, ∗

This work has been supported by the Spanish MCyT project DPI2001-4404-E and the “Junta de Castilla y León” project VA101/01.

as the one shown in figure 1. The inputs, x1 . . . xn are passed to a set of RBF, h1 , . . . hp . Each output is a linear combination of the RBF: yj =

p X

wjk hk (x)

k=1

If the RBF network is going to be used for regression it will only have one output; if it is going to be used for classification one of the possibilities is to have as many outputs as classes. This paper focuses on classification. The traditional approach for RBF construction works in two phases. First the RBF are selected, then

the weights are adjusted. The weights can be obtained using the delta rule or a pseudoinverse matrix [7]. Some approaches for selecting the centers are using all the training examples, a random subset or using a clustering method such as k-means. It is also possible to obtain the RBF from a decision tree [7]. Boosting [11] is an ensemble method that combines, linearly, different classifiers obtained using any other classification method, typically decision trees. A RBF network is a linear combination of RBF. Hence, it is possible to obtain this sort of networks using boosting, if the classification method produces as a result a RBF. This idea has already been explored in [10]. In that work the used method for obtaining a RBF was to select randomly a small subset of examples, to consider the distances to these examples as new features, and then to select the best feature and threshold. The center of the RBF was the selected example and the radius was a function of the threshold. The use of these RBF networks for time series classification, using a distance more adequate for this domain, was considered in [9]. In this work the same approach is used. The difference is that the RBF are constructed from decision rules, in the same way that in [7] decision trees are converted to RBF. The first advantage is that the attributes that do not appear in the rule will not appear in the RBF, so there is a way to deal with irrelevant features. The second one is that the radius is not uniform for each RBF nor for each dimension of each RBF. If decision trees can be converted to RBF networks [7], RBF networks can be constructed by boosting [10], and boosted rules compare favorably with Decision Trees [4], then it is sensible to explore the possibility of constructing RBF networks from boosted rules.

2. Ensemble Methods and Boosting One of the research areas in Machine Learning is the generation of ensembles. The basic idea is to use more than one classifier, in the hope that the accuracy will be better. It is possible to combine heterogeneous classifiers, where each of the classifiers is obtained with a different method. Nevertheless, it is also possible to combine homogeneous classifiers. In this case all the classifiers have been obtained with the same method. In order to avoid identical classifiers, it is necessary to change something, at least the random seed. There are methods that alter the data set. Bagging [2] obtains a new data set by resampling the original data set. An instance can be selected several times, so some instances will not be present in the new data set. The Random Subspaces [6] method obtains a new data set deleting some attributes. Boosting [11] is a family of methods. The most prominent member is AdaBoost. Figure 2 shows this method. AdaBoost works as follows. Every example has an associated weight. Initially, all the examples have the same weight. In each iteration a base (also named weak) classifier is constructed, according to the distribution of weights. Afterwards, the weight of each example is readjusted, based on the correctness of the class assigned to the example by the base classifier. The final result is obtained by weighted votes of the base classifiers. The version of boosting used in this paper is LogitBoost [5]. Figure 3 shows thie method. The reason for using this version is that the task of the base learner is not to classify (as it happens with AdaBoost) but to predict a numeric value, and the output of a RBF is numeric.

3. The Method 3.1. From Rules to RBF

The rest of the paper is organized as follows. Section 2 presents the boosting method. The method for the construction of RBF networks from boosting and decision rules is presented in section 3. Section 4 is devoted to the experimental validation. Finally, section 5 presents some concluding remarks.

In [7] it is described how to obtain a RBF from a decision tree. For each leaf, that is, a decision rule, a RBF is obtained. We will use the same approach. A multidimensional Gaussian is the product of several one-dimensional Gaussians:

Given (x1 , y1 ), . . . , (xm , ym ) where xi ∈ X , yi ∈ {−1, +1} Initialize D1 (i) = 1/m For t = 1, . . . , T : • Train weak learner using distribution Dt • Get weak hypothesis ht : X → R • Choose αt ∈ R • Update Dt+1 (i) =

Dt (i) exp(−αt yi ht (xi )) Zt

where Zt is a normalization factor (chosen so that Dt+1 will be a distribution). Output the final hypothesis: H(x) = sign

T X

!

αt ht (x)

t=1

Figure 2. AdaBoost (from [12])

1. Start with weights wi = 1/N, i = 1, 2 . . . , N , F (x) = 0 and probability estimates p(xi ) = 21 2. Repeat for m = 1, 2, . . . , M : (a) Compute the working response and weights (yi + 1)/2 − p(xi ) p(xi )(1 − p(xi )) = p(xi )(1 − p(xi )))

zi = wi

(b) Fit the function fm (x) by a weighted least-squares regression of zi to xi using weights wi (c) Update 1 F (x) ← F (x) + fm (x) 2 eF (x) p(x) ← −F (x) eF (x)+e PM

3. Output the classifier sign[F (x)] = sign[

m=1 fm (x)]

Figure 3. LogitBoost (from [5])

3.3. Recalculating the Weights h(x) = exp −

||x − c||2 r2

!

n Y

(xi − ci )2 h(x) = exp − r2 i=1

!

This transformation allows the radius to be different for each dimension: n Y

(xi − ci )2 h(x) = exp − ri2 i=1

!

If some feature is irrelevant, it can be excluded from the product. This is equivalent to have an infinity radium in that dimension. The antecedent of a rule is formed by conjunctions of comparisons between an attribute and a threshold. If an attribute does not appear in the rule, it will not appear in the RBF. If the values of the attribute are bounded from below and from above, then the value for the center of the RBF in that dimension is the center of the interval. If the values are only bounded from above (respectively, from below) then the value for the center of the RBF in that dimension is the minimum (respectively, the maximum) value for the attribute in all the data set. The radii for each dimension of the RBF are given by the width of the interval that bounds that attribute in the rule. If the attribute is not completely bounded in the rule, the minimum or the maximum value of the attribute in the data set is used as a boundary. The radius for each dimension is set as the width of the interval (Ii ) multiplied by a constant (α): ri = αIi . The value of α is a global parameter of the method.

3.2. Calculating the Weights Once that a RBF, hk (x), has been obtained from a rule, it is necessary to calculate the weights for the output of this RBF. The LogitBoost method (figure 3) asks the base learner to fit a function fm (x). The method used for obtaining fm from hk is simply linear regression. That is, fm (x) = ahk (x) + b, where a and b are coefficients. The b coefficient can be considered as a contribution to the bias.

One of the advantages of the method, considering storage and efficiency, is that there are a lot of connections between the neurons of consecutive layers that are not necessary. Only the inputs that appear in the rule are connected to the corresponding RBF, and the output of each hidden neuron is connected to only one of the outputs. It is sensible to speculate that if all the RBF were connected to all the outputs, the accuracy of the network could be improved. That is, the weights assigned by the boosting method are ignored, boosting is only used for selecting the different RBF. Then, any other method is used for the calculation of the weights. In the experimental validation, given that we are considering classification problems, Logistic Regression [3] will be used.

4. Experimental Validation The data sets used are shown in table 1. They are the ones used in [7] and that are available in the UCI Repository [1]. The number of iterations of the boosting process was 10. For the experiments, 10-fold stratified cross-validation was used. The results are shown in table 2. This table includes, in the first two columns, the results reported in [7] for decision trees (C4.5) and RBF networks obtained from decision trees (TB-RBF, tree based). The results obtained with RBF Networks obtained from Boosted Rules (BR-RBF in the table). These results are worse than the results for TB-RBF in 2 of 5 cases. Specially disappointing are the results for the vowels data set, that is even worse than the results obtained with C4.5. Nevertheless, when boosting is only used for obtaining the RBFs, and the weights are recalculated using Logistic Regression (BR+L-RBF in the table) the results are always better than the ones obtained with TB-RBF. The selection of weights using Logistic Regression instead of using the ones provided by boosting is not always advantageous. The results are worse for 3 of 5 data sets. These 3 data sets are the ones that had better results with BR-RBF than with TB-RBF. Although recalculating the weights can worsen the results, it is

data set bupa diabetes glass vehicles vowels

attributes 6 8 9 18 10

classes 2 2 6 4 11

examples 345 768 214 846 990

Table 1. Characteristics of the data sets.

data set bupa diabetes glass vehicles vowels

C4.5 65.3 71.8 61.7 64.9 71.0

TB-RBF 68.8 74.8 62.8 74.6 80.1

BR-RBF 71.0 77.6 73.4 71.7 68.9

BR+L-RBF 69.6 77.0 68.2 80.6 94.3

R-RBF 60.9 65.5 65.9 57.9 83.9

KM-RBF 60.3 68.4 65.4 60.9 88.1

Table 2. Experimental results. The two first columns are from [7].

worth considering because for some data sets it improves the results very substantially. It could be argued that the comparison between TB-RBF and BR+L-RBF is not fair because the weights are calculated using different methods. TBRBF uses a pseudoinverse matrix [7], while BR+LRBF uses Logistic Regression. Nevertheless, in [7] it is stated that the author believes that the method for calculating the weights is not decisive. Two additional sets of experiments were performed. They also use Logistic Regression for calculating the weights, but the centers are selected with more classical approaches: randomly (R-RBF) and using a clustering method, k-means (KM-RBF). The number of centers selected was the same than when using boosting, that is, 10 per class. These results are worst than the ones of BR+L-RBF.

5. Conclusions A new method for constructing RBF networks has been presented. It is based on boosting. The base classifiers are formed only by one RBF. Each RBF is obtained from a decision rule. The experimental validation shows the validity of the method. If for the considered data set the euclidean distance between examples is adequate for classification purposes, then, a priori, it would be

more sensible to use classical techniques. The presented method seems more adequate when the importance of the features is very different and/or there are irrelevant features. The method shares all the advantages of the construction of RBF networks using decision trees [7]: • Every neuron in the hidden layer is connected only with a subset of the inputs, the ones that appeared in the rule. This means that, for the same number of neuron in the hidden layers, the space requirements are smaller and the classification is faster. • The rule (and decision tree) construction process select the more relevant attributes, so there is a way to deal with irrelevant attributes. • The RBFs have not to share the same radius. For each RBF, the radius is not the same for all the dimensions. • There are very few possible parameters to adjust. For the presented method only the number of iterations of the boosting algorithm, that is, the number of neurons in the hidden layer, and the multiplicative constants in the RBF. In the experimental validation they were not adjusted in any way, but fixed a priori.

Most of the classical approaches for the construction of RBF networks have two separated phases, first, to select the centers and radii, and second, to select the weights. One possible pitfall of this approach is that there is no way of adding new RBFs. Once that the weights are selected, it is possible to realize that it would be necessary more RBFs. The method presented in this paper is incremental, after each RBF is selected, its weights are calculated.

[9]

[10]

Acknowledgements To the donors of the different data sets and the maintainers of the UCI Repository [1]. The used methods for constructing rules, LogitBoost and Logistic Regression are from the WEKA library [13]. Hence, we are in debt with its developers.

References [1] C. Blake and C. Merz. UCI repository of machine learning databases, 1998. http://www.ics. uci.edu/~mlearn/MLRepository.html. [2] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. ftp.stat.berkeley. edu/pub/users/breiman/bagging.ps.Z. [3] S. l. Cessie and J. v. Houwelingen. Ridge estimators in logistic regression. Applied Statistics, 41(1):191– 201, 1992. [4] W. W. Cohen and Y. Singer. A simple, fast, and effective rule learner. In AAAI-99, 1999. http: //www.research.att.com/~wcohen/ postscript/aaai-99-slipper.ps. [5] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28, 2000. http://www-stat. stanford.edu/~jhf/ftp/boost.ps. [6] T. K. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832– 844, 1998. http://cm.bell-labs.com/cm/ cs/who/tkh/papers/df.ps.gz. [7] M. Kubat. Decision trees can initialize radialbasis function networks. IEEE Transactions on Neural Networks, 9:813–821, 1998. http://www.cacs.usl.edu/~mkubat/ publications/dtrbf.ps. [8] M. J. Orr. Introduction to radial basis function networks. Technical report, Institute for Adaptive and Neural Computation of the Division of Informatics at Edinburgh University, Scotland, UK,

[11]

[12]

[13]

1996. http://www.anc.ed.ac.uk/~mjo/ papers/intro.ps.gz. J. J. Rodríguez Diez and C. J. Alonso González. Building RBF networks for time series classification by boosting. In D. Chen and X. Cheng, editors, Pattern Recognition and String Matching, volume 13 of Combinatorial Optimization. Kluwer, 2002. http://pisuerga.inf.ubu. es/juanjo/publs/prsm.ps.gz. J. J. Rodríguez Diez and C. J. Alonso González. Learning classification RBF networks by boosting. In J. Kittler and F. Roli, editors, Multiple Classifier Systems: second international workshop, MCS 2001, Lecture Notes in Computer Science, pages 43–52. Springer, 2001. http://pisuerga.inf.ubu. es/juanjo/publs/mcs01.ps.gz. R. E. Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, 2002. http://www.cs.princeton.edu/ ~schapire/papers/msri.ps.gz. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence rated-predictions. Machine Learning, 37(3):297–336, 1999. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999. http://www.cs.waikato.ac.nz/ ml/weka/index.html.