Support Vector Machines for computing action ... - Semantic Scholar

3 downloads 146 Views 411KB Size Report
function af (xt, cl.w) is computed by neural network with n inputs one for each ... SVM is trained from scratch using the actual set of SVs and the latest available ...
Support Vector Machines for computing action mappings in Learning Classifier Systems Daniele Loiacono, Andrea Marelli, Pier Luca Lanzi Abstract— XCS with Computed Action, briefly XCSCA, is a recent extension of XCS to tackle problems involving a large number of discrete actions. In XCSCA the classifier action is computed with a parameterized function learned in a supervised fashion. In this paper, we introduce XCSCAsvm that extends XCSCA using Support Vector Machines to compute classifier action. We compared XCSCAsvm and XCSCA on the learning of several binary functions. The experimental results show that XCSCAsvm reaches the optimal performance faster than XCSCA.

I. I NTRODUCTION Learning Classifier Systems are a genetics-based machine learning technique for solving problems through the interaction with an unknown environment. They maintain a population of condition-action rules called classifiers which represents the current solution to the target problem. Each classifier represents a small portion of the overall solution. The classifier condition c identifies a part of the problem domain. The classifier action a represents a decision on the subproblem identified by the condition c. In learning classifier systems, the genetic component works on classifier conditions searching for an adequate decomposition of the target problem into a set of subproblems. In the vast majority of the learning classifier systems each classifier has associated a prediction (or strength [1]) parameter. The classifier prediction, p, provides an estimate of how valuable action a is in terms of problem solution on the subproblem identified by condition c. Learning classifier systems usually exploit reinforcement learning techniques for estimating the classifier prediction in each subproblem. Recently, the idea of computed prediction has been introduced to improve the estimation of the classifiers prediction in terms of problem solution [2]. On the other hand, Bernado et al. [3] introduced the sUpervised Classifier System (UCS), that replaces the usual reinforcement learning paradigm in learning classifier systems with supervised learning. In UCS the system is fed with the best action for each subproblem and such information is used for evaluating directly the classifiers, without the need of any prediction parameter. In a recent work, Lanzi et al. [4] introduced XCS with Computed Action (XCSCA) that combines UCS with previous work on computed prediction [2], [5]. While in UCS the best action Daniele Loiacono ([email protected]) is with the Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milano, Italy. Andrea Marelli ([email protected]) is with the Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milano, Italy. Pier Luca Lanzi ([email protected]) is with the Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milano, Italy. Pier Luca Lanzi ([email protected]) is also member of the Illinois Genetic Algorithm Laboratory (IlliGAL), University of Illinois at Urbana Champaign, Urbana, IL 61801, USA.

is used only for evaluating the classifiers, in XCSCA the classifiers also learn to compute the best possible action in a supervised learning fashion. In this paper, we introduce XCSCAsvm that extends XCSCA using Support Vector Machines (SVMs) to compute the classifier action. Extending XCSCA with Support Vector Machines has two main advantages: (i) it sets the problem of learning the classifier action as a well-posed quadratic programming problem and (ii) it can tackle non-linear action mappings thanks to the so-called “kernel trick” [6]. In XCSCAsvm each classifier exploits a SVM for learning the best action, involving a non trivial optimization problem, that has quadratic requirements both in terms of time and memory. To deal with this issue we introduced a limitedcomplexity version of the the chunking method, a well known approach for applying SVMs to on-line learning problems [6], [7], [8], [9]. In order to test our approach, we compared XCSCAsvm and XCSCA on the learning of several binary functions introduced in [4]. Our results show that even if XCSCAsvm learns generally faster than XCSCA, it is computationally more expensive than the perceptrons and the neural networks used in XSCA for computing the action. Our results suggest that the choice of the action function heavily depends on the problem considered. In particular XCSCAsvm seems a good choice for solving problems where the samples are sparsely distributed over the whole input space. II. S UPPORT V ECTOR M ACHINES In this section we give the basic understanding about Support Vector Machines (SVMs) useful for the remainder of the paper. For a comprehensive introduction to SVMs we refer the interested reader to [10]. Consider a given data set {x1 , y1 , ..., x , y } ⊂ X × L, where X is the input space (e.g. X ≡ IRd ) and L is a set of labels. We define the classification task as the problem of finding a mapping f : x −→ y, such that f (xi ) = yi , ∀xi ∈ X . In the rest of this section we focus only on binary classification problems, i.e., L = {−1, +1}. The SVMs approach consist of finding the best linear decision surface, f (x) = sign(w·x+b), for solving the classification problem. More formally the SVMs solve the following optimization problem:  1 w 2 + C ξi (1) minimize 2 i subject to

yi (xi · w + b) ≥ 1 − ξi , ∀i

(2)

where the slack variables [11] account for the samples that are not correctly classified; the parameter C ∈ IR+ is used for

2141 c 1-4244-1340-0/07$25.00 2007 IEEE Authorized licensed use limited to: Politecnico di Milano. Downloaded on February 14,2010 at 07:01:14 EST from IEEE Xplore. Restrictions apply.

controlling the trade-off between the number of misclassified samples and the distance between the decision surface and the closest sample (equal to 1/||w||). In practice the input samples are usually far from being linearly separable and thus the SVMs have been extended also to non linear decision surfaces: the input space X can be mapped with a function, φ, onto a suitable feature space where the classification task can be still effectively solved using a linear decision surface. Unfortunately this approach cannot be used directly because the dimensionality of such a suitable feature space grows exponentially, resulting quickly computationally infeasible. On the other hand, it has been noted [12] that the optimization problem defined by Equation 1 and Equation 2 can be effectively solved in its dual formulation without using directly the mapping function, φ. In fact it can be solved solely in terms of the kernel function, k(x, x ) = φ(x)·φ(x ), i.e. the inner product in H. This key observation, usually referred to as the kernel trick, allows to solve classification problems with a non linear decision surface in a feasible way, because a large class of mapping function φ(·) admits an easy to compute kernel. Applying the kernel trick, the solution of the classification task can be expressed as:   αi yi k(xi , x) + b) f (x) = sign(

(3)

i

where αi is the Lagrangian coefficient associated to the sample xi . The value of αi and b can be found solving the dual optimization problem associated to the one described by Equation 1 and by Equation 2. It is worthwhile observing that only a subset, usually small, of the data samples affects the final solution, i.e., the samples xi , yi  with a non zero coefficient αi . Such particular samples are usually called Support Vectors (SVs). III. XCSCA XCS with Computed Action (XCSCA) extends the typical XCS structure to tackle problems involving a large number of discrete actions. It is inspired to UCS [3], the extension of XCS to supervised learning, and to XCSF [2], the extension of XCS with computed prediction. XCSCA borrows from UCS the idea of having the correct action (the correct output) as a part of the system input, from XCSF the idea of replacing a classifier parameter with a parameterized function. Accordingly, XCSCA does not learn a complete mapping of the target problem but, as UCS, it focuses on the correct output which in XCSCA is computed from the current input. A. Classifiers In XCSCA, classifiers consist of a condition, which specifies the set of inputs that the classifier matches, and four parameters: the vector w, used to compute a discrete classifier action; the error , that estimates the error affecting the computed action; the fitness F that estimates the accuracy of the computed action; the numerosity num, a counter used to represent different copies of the same classifier. In XCSCA, classifiers have no action. The classifier action is computed

2142

using a parameterized function af (x, cl.w) that computes the classifier discrete action based on the current input x and the parameter vector w associated to each classifier. In XCSCA, classifiers have no prediction since, as in UCS, there is no incoming reward. B. Performance Component XCSCA works similarly to UCS [3]. At each time step t, XCSCA receives as input the current input example xt and the associated output yt . XCSCA builds a match set [M] containing the classifiers in the population [P] whose condition matches the current input xt . If [M] is empty covering takes place and a new classifier that matches the current input is inserted in the population. The covering classifier is generated as follows: the classifier condition is created as in XCS [3], [13], the parameter vector w is initialized with zero values, while all the other parameters are initialized as in XCS [13], [14]. At this point, the behavior of XCSCA is different depending on whether the system is working in learning mode or in testing mode. During learning, XCSCA exploits the incoming information about the desired output yt to update the classifiers in [M] following the procedure described below. During testing, for each classifier cl in [M], XCSCA computes the current (discrete) action af (xt , cl.w). Then, for each action a computed from the classifiers in the match set [M], XCSCA computes the classification accuracy of action a for the input xt , C(xt , a), as the average fitness of the classifiers in [M] that advocate action a, i.e.,  [M](xt ,a) cl. F C(xt , a) = (4) |[M ](xt , a)| where [M ](xt , a) is the set of classifiers in [M] that for the input xt advocate action a. Finally, XCSCA selects the action with the highest classification accuracy. C. Classifier Update XCSCA works in a supervised fashion as UCS [3] and thus it has no incoming reward. In UCS, the incoming correct output yt is used to build the set [C] of correct classifiers and the set [!C] of incorrect classifiers. In XCSCA, the same information is exploited to train the classifiers in the match set [M]. The classifier error is updated first, then the prediction, and finally the classifier fitness. The error  of the classifiers in [M] is updated as, cl.ε ← cl.ε + β(εf (xt , yt , a) − cl.ε)

(5)

where the error function, εf (xt , yt , a), is equal to 0 if the classification is correct (i.e., yt = a) and 1000 if the classification is incorrect (i.e., yt = a). Then weight vectors of the classifiers in [M] are updated according the correct incoming action. This training phase depends on the action function af (xt , cl.w) used and it is illustrated in details in Section III-E. Finally, classifier fitness is updated from the classifier error as in XCS [13].

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

Authorized licensed use limited to: Politecnico di Milano. Downloaded on February 14,2010 at 07:01:14 EST from IEEE Xplore. Restrictions apply.

D. Discovery Component The genetic algorithm works as in XCS [15] except for the parameter vector cl.w. The parameter vectors cl.w of offspring classifiers can be either copied from the parents (as done in the experiments presented here) or alternatively they may be obtained by recombining the parents vectors. E. Action functions Several functions may be used to compute classifiers actions depending on the problem being solved. With Boolean functions [13], any function returning two values is feasible. When more actions are involved, more elaborate solutions need to be considered. The perceptron. The perceptron takes the current input xt and outputs 0 or 1 through a two stage process: first the linear combination cl.w · xt of the inputs xt and of the weight vector cl.w is calculated, then the perceptron outputs 1 if cl.w·xt is greater than zero, 0 otherwise. The original binary input xt is enriched with the usual constant input x0 and, since zero values for inputs must be generally avoided [16], binary inputs are mapped into integer values by replacing zeros with -5 and ones with +5. Given the current input xt and the desired output yt , the weight wi associated to input xi is updated as [17]: wi ← wi + η(yt − ot )xi

(6)

where ot is the perceptron output for input xt , and η is the usual learning rate. Extending the perceptron action function to problems that involve more than two actions is straightforward: an array of perceptrons can be used for computing the binary encoded value of the classifier action. Neural Networks. When the problem involves many actions, it is either possible to employ an array of simple Boolean functions, such as for instance an array of perceptrons, or a more adequate solution such as a neural network. The action function af (xt , cl.w) is computed by neural network with n inputs one for each component of xt , h hidden nodes, and as many outputs as required by the problem. The activation functions for both hidden and output nodes are the usual sigmoid [16]. Such neural network has n inputs, one for each component of st , h hidden nodes, and one output, i.e., the classifier prediction; the activation functions for the input and output nodes are linear while the activation functions for the hidden nodes are the usual sigmoid [16]. Given the current input xt and the desired output yt , the network weights are updated using on-line backpropagation [16]. IV. C OMPUTING ACTIONS

WITH

SVM

In this section we introduce XCSCA with SVMs, briefly XCSCAsvm. XCSCAsvm extends XCSCA basically in two respects: (i) the action function of each classifier is computed by a SVM and the usual parameters vector, cl.w, is replaced by a set of Support Vectors, cl.SVs; (ii) the usual update of the parameters vector, cl.w, is replaced by the training of a SVM for computing the classifier action. In the following we describe such extensions in details and finally we discuss the computational complexity of XCSCAsvm.

A. The SVM Action Function The SVM action function takes x as input and can return as output 0 or 1. For computing the classifier action, cl.w is replaced by the coefficient cl.b and a set of Support Vectors, cl.SVs, defined as: cl.SVs = {xi , yi , αi } .

(7)

Accordingly to the Equation 3, the action is thus computed as:  αi yi k(xi , x) + cl.b), (8) af (x, cl.SVs, cl.b) = sign( cl.SVs where k(·, ·) is a suitable kernel function. When the problem involves k > 2 actions we followed a well known approach for extending SVMs to multiclass problems called one-against-one in brief 1–v–1) [18], [19]: the classifier action is computed as the most voted by an ensemble of k(k − 1)/2 binary SVMs (one for each pair of actions). When the action function is updated each SVMij in the ensemble is updated with all the samples available for the actions i and j. In addition the set of SVs, cl.SVs, is shared between all the binary SVMs in the ensemble: (i) the scalar Lagrangian coefficient αi associated to each Support Vector is replaced by a vector αi of coefficients; (ii) in order to be a Support Vector for the ensemble, a sample must be a Support Vector for at least one of the SVM in the ensemble. B. Update of the SVM Action Function As in XCSCA, in XCSCAsvm at each time step the incoming sample xt , yt  is used for updating the classifiers action function. Unfortunately training a SVM involves the solution of a non trivial optimization problem, that has quadratic requirements both in terms of time and memory. Many efficient algorithms have been introduced [20], [21] for solving such optimization problem but they require all the training samples at once, i.e., they solve a batch learning problem. On the other hand, in XCSCAsvm, we have to solve an on on-line learning problem, i.e., at each time step a new training sample is available. Thus, we used a slightly modified version of the chunking method, one of the most used approach for applying SVMs to on-line learning problems [6], [7], [8], [9]. The basic idea behind the chunking methods is rather simple. At each time step the SVM is trained from scratch using the actual set of SVs and the latest available sample. Accordingly, the set of SVs is entirely replaced by the new one. The main drawback of chunking method is that the number of Support Vectors may grow indefinitely during learning. To deal with this issue, in XCSCAsvm we used a slightly modified version of the chunking method for updating the action function as follows. At each time step t the current sample, xt , yt , is used for updating the action function of each classifier cl in the match set [M] according to Algorithm 1. First of all the classifier action is computed and compared to correct action yt (line 2, Algorithm 1); if the computed action is correct the action function is not updated otherwise a training set, T , is built adding the current sample, xt , yt , to the actual set of Support Vectors, cl.SVs (line

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

2143

Authorized licensed use limited to: Politecnico di Milano. Downloaded on February 14,2010 at 07:01:14 EST from IEEE Xplore. Restrictions apply.

3, Algorithm 1). When the size of T is greater than a fixed threshold, θSV , the oldest sample is removed from T . Finally a new SVM is trained from scratch on T and the resulting set of Support Vectors replaces cl.SVs (line 7, Algorithm 1). We implemented XCSCAsvm using the well known LIBSVM [22] library for performing the underlying batch training of SVMs. In all the experiments reported in this paper we used an RBF kernel defined as k(x, x ) =  2 e−γx−x  , where γ ∈ IR+ is a kernel parameter. Algorithm 1 XCSCAsvm: action function update. 1: procedure UPDATE (cl,xt , yt ) 2: if (yt = cl.af (xt , cl.SVs, cl.b)) then  3: T ← {xt , yt } {xi , yi } ∈ cl.SVs 4: if |T | > θSV then 5: Remove the oldest sample from T 6: end if 7: cl.SVs ← TRAIN BATCH(T ) 8: end if 9: end procedure C. Computational Complexity Even if the action function in XCSCAsvm was specifically designed for having a limited complexity, it is still computationally more expensive than the action functions used in XCSCA. In Table I we reported the asymptotic complexity for all the action functions compared in this paper. The column, Parameters, reports the memory requirements for storing all the parameters of the action function; the columns Output and Update report respectively the time requirements for computing the classifier action and for updating the action function. We used the following notations: n is the number of input attributes; k is the number of actions involved by the problem, while k  ∈ [1, min{θSV , k}] is the number of different actions among the samples in cl.SVs; h is the number of hidden nodes of the neural network; θSV is the bound on the size of cl.SVs. It can be noticed that the array of perceptrons is the action function with the lowest complexity, having both time and memory requirements that scale linearly with respect to n and to the number of bits necessary for a binary encoding of the action (log2 k). On the other hand, we can see that the neural network has both time and memory requirement linear in the number of the weights of the network [23]. Finally it is worthwhile observing that the memory and time requirements of the SVM [22], [19] do not depend on the number of actions involved by the problem but on the number of different actions among the samples in cl.SVs. Even more interesting, the computational complexity for updating the action function does not depend neither on k nor on k  . In conclusion the comparison between the action function requirements is heavily affected by the problem considered. Generally speaking, we can say that the SVM is more expensive than the array of perceptrons and of the neural network in small and simple problems, i.e., when n and log2 k are smaller than θSV . On the other hand, the SVM has competitive memory and time requirements in sparse

2144

problems with a huge input space and many actions, i.e., when n and log2 k are greater than θSV . However in the experiments considered in this paper, the SVM is generally more expensive, even if in the most difficult problems considered here, such overhead is very small. V. E XPERIMENTAL D ESIGN In this paper we applied XCSCAsvm and XCSCA to the learning of several binary functions. Each experiment consists of a number of problems that the system must solve. Each problem is either a learning problem or a test problem. During learning problems, the system exploits the input xt and the desired output yt , to train the classifiers in the match set. During test problems, the system always selects the action with highest classification accuracy for the input xt and no update is performed. The genetic algorithm is enabled only during learning problems, and it is turned off during test problems. The covering operator is always enabled, but operates only if needed. Learning problems and test problems alternate. The performance is computed as the percentage of correct answers. All the reported statistics are averages over 20 experiments. Statistical Analysis: To analyze the results reported in this paper, we followed the procedure introduced in [24] for the comparison of performance curves. For each experiment, for every setting we tested, we considered all the performance curves; we sampled the curves and considered only one point every 1000 problems; we applied an analysis of variance (ANOVA) [25] on the resulting data to test whether there was some statistically significant difference; finally, we applied four post hoc tests [25], Tukey HSD, Scheff´e, Bonferroni, and Student-Neumann-Keuls, to find which settings performed significantly different. VI. E XPERIMENTAL R ESULTS A. Boolean Multiplexer In the first set of experiments, we compared XCSCAsvm and XCSCA on the Boolean Multiplexer [13], [26] a typical testbed for learning classifier systems. The Boolean Multiplexer is defined over binary strings of length n where n = k+2k ; the first k bits, x0 , . . . , xk−1 , represent an address which indexes the remaining 2k bits, y0 , . . . , y2k −1 ; the function returns the value of the indexed bit. For instance, in the 6-multiplexer function, mp6 , we have that mp6 (100010) = 1 while mp6 (000111) = 0. At first, we compared XCSCA and XCSCAsvm on the 20-multiplexer with the following parameter setting [27]: N = 2000, β = 0.2; α = 0.1; 0 = 10; ν = 5; χ = 0.8, μ = 0.04, P# = 0.5, θdel = 20; θGA = 25; δ = 0.1; GA-subsumption and action-set subsumption are on with θsub = 20 and θAS = 100; in XCSCA with the neural network we set h = 10; in XCSCAsvm we set θSV = 10, γ = 0.0001, and C = 100. Figure 1a compares the performance of XCSCAsvm with that of XCSCA when the action is computed by a neural network or a perceptron. Both XCSCA and XCSCAsvm reach the optimal performance. Even in a simple problem, like the 20-multiplexer, XCSCAsvm converges faster than XCSCA,

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

Authorized licensed use limited to: Politecnico di Milano. Downloaded on February 14,2010 at 07:01:14 EST from IEEE Xplore. Restrictions apply.

af PERCEPTRONS NN SVM

Parameters O(n · log2 k) O(h · (n + log2 k)) O(θSV · (n + k  ))

Output O(n · log2 k) O(h · (n + log2 k)) O(n · θSV · k  )

Update O(n · log2 k) O(h · (n + log2 k)) O(2n · θSV )

TABLE I T IME AND MEMORY REQUIREMENTS OF DIFFERENT ACTION FUNCTIONS ( WHEN k ≥ 2).

B. Binary Shift

(a)

(b) Fig. 1. XCSCA and XCSCAsvm applied to 20-multiplexer: (a) performance; (b) number of macroclassifiers.

in particular when compared to the version with the neural network. This is not surprising because the neural network is a more powerful classifiers than the perceptron but it requires more samples for learning the optimal values for the weights. It is also worthwhile observing that while both the perceptron and the neural network are updated with a gradient descent method, the SVM is instead updated solving a quadratic programming problem that allows a faster convergence. In addition XCSCAsvm is also able to generalize faster than XCSCA as showed in Figure 1b even if the final generalization is basically the same. Notice that XCSCA with the neural network generalize even slower and after 50000 learning problem evolved a solution that is not as compact as the one evolved by XCSCA with the perceptron and by XCSCAsvm. The statistical analysis of the results showed the differences between XCSCAsvm and the two versions of XCSCA are significant with a 99.99% confidence level. In conclusion is worthwhile observing that in the 20-multiplexer problem the neural network and the SVM have the same spatial and time complexity while the array of perceptrons has lower time and memory requirements.

In the second set of experiments we move to a quite simple binary function, the binary shift. The binary shift of size m, shortly shiftm , takes as input a binary string x = x1 , . . . , xm  and returns the binary string y of size m obtained by shifting the m input bits to the right, i.e., y = 0, x1 , . . . , xm−1 . For example, if m = 8 and x =11011011, then shift8 (11011011) returns 01101101. It is interesting observing that each bit of the output string actually depends on the value of a single bit of the input string. Thus, the binary shift is quite a simple problem for XCSCA: both a single array of perceptrons and a single neural network can easily compute the correct action in the whole input space. On the other hand, this problem is more challenging for XCSCAsvm: instead of computing each bit of the output string independently, the SVM has to learn a mapping between the whole input string and the whole output string. In addition the number of possible actions is exponential in the number of outputs (there are in fact 2m−1 actions). We have applied XCSCAsvm and XCSCA to the binary shift problem, with m = 8. Figure 2 compares (a) the performance and (b) the population size of XCSCAsvm and XCSCA with perceptrons and the neural network. The parameter setting is the same of the previous experiment, except for P# = 0.3. As expected computing the classifier action with an array of perceptrons or a neural network allows XCSCA to solve the problem using a single classifier, while XCSCAsvm needs more classifiers in order to reach the optimal performance. Figure 2a shows that XCSCAsvm converges only slightly slower than XCSCA (the difference is statistically significant with a confidence level of 99.99%) while the two versions of XCSCA does not perform significantly different. C. Binary Sum The binary shift involves many actions, however it is a rather easy problem because each action bit of the output can be independently computed. Instead, the next set of experiments is performed on a more difficult function, the binary sum, in which the action bits are correlated. The binary sum of of size m, shortly summ , takes as input a binary string of size 2m, representing two binary numbers of size m, x and y, and returns a binary string of size m + 1 obtained by the sum x + y. For example, suppose that k = 3, x =100, and y =101, then sum3 (100101) returns 1001; if x =010 and y =001 then sum3 (010001) returns 0011. We compared XCSCA and XCSCAsvm on the sum4

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

2145

Authorized licensed use limited to: Politecnico di Milano. Downloaded on February 14,2010 at 07:01:14 EST from IEEE Xplore. Restrictions apply.

(a)

(a)

(b)

(b)

Fig. 2. XCSCA and XCSCAsvm applied to shift8 : (a) performance; (b) number of macroclassifiers.

Fig. 3. XCSCA and XCSCAsvm applied to sum4 : (a) performance; (b) number of macroclassifiers.

problem with the same parameter setting of the previous experiment, except for the action-set subsumption that is off. Figure 3a reports the performance of XCSCAsvm and XCSCA with the perceptrons and the neural network. The results show that XCSCAsvm is able to converge faster than XCSCA because it exploits the SVMs for learning the action mapping more effectively (the difference is statistically significant with a 99.99% confidence level). In addition notice that XCSCA with the perceptrons converge slightly faster than XCSCA with neural networks. Although the difference is statistical significant, it is much smaller than the one on found for the 20-multiplexer problem: the more the problem is difficult, the more powerful action functions, like a neural network, are suitable for solving it. When we look at the generalization capabilities (Figure 3b) we note that XCSCAsvm generalizes slightly faster than XCSCA and evolves almost the same number of macroclassifiers of XCSCA with the array of perceptrons. On the other hand, XCSCA with the neural network generalized slower but requires less macroclassifiers. In order to confirm our hypothesis we performed a second set of experiments on the binary sum, with m = 5. Figure 4a reports the performance of XCSCAsvm and XCSCA applied to the sum5 problem with the same parameter settings of the previous experiment, except for the population size, N , that was set to 5000. In this case XCSCAsvm converges slightly faster than XCSCA. Notice that in this problem XCSCA with the neural network outperforms XCSCA with the perceptrons, confirming our previous hypothesis: the more powerful action functions, like SVM and neural network, are more suitable on difficult problems. As expected XCSCAsvm and

XCSCA evolve roughly the same number of macroclassifiers while XCSCA with the neural network is able to solve the problem with less macroclassifiers, suggesting that XCSCA with the neural networks may have higher generalization capabilities on this kind of problems. All the discussed differences in the results are statistical significant with a 99.99% confidence level.

2146

D. Anticipatory Functions In the last set of experiments, we compared XCSCAsvm and XCSCA on a problem that is well-known to the learning classifier systems community, i.e., the learning of anticipatory behavior [14]. Anticipatory classifier systems [14] extend the typical classifier structure by adding the prediction of the next state that the system will encounter after the classifier action is performed. In this work, as in [4], we simply applied XCSCAsvm to learn a model of the environment which may be used to implement anticipations. For a proper anticipatory classifier systems we refer the reader to [14] and to [28] for a neural based implementation of anticipatory classifier systems. We focused on typical multisteps environments, namely the woods environments, that have been already used both with XCS [13] and ACS [14]. We applied XCS to Woods1 [13] and Maze5 [29] and we traced the state transitions that XCS performed during the learning steps. Each transition records the current state st , the performed action at , and the next state st+1 encountered after action at was performed in st+1 . For each environment, we generated ten data sets containing 100000 transitions. The sequence of transitions contained in each data set was fed to XCSCA which had to learn a mapping from state-

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

Authorized licensed use limited to: Politecnico di Milano. Downloaded on February 14,2010 at 07:01:14 EST from IEEE Xplore. Restrictions apply.

(a)

(a)

(b)

(b)

Fig. 4. XCSCA and XCSCAsvm applied to sum5 : (a) performance; (b) number of macroclassifiers.

Fig. 5. XCSCA and XCSCAsvm applied to the learning of the model for Woods1: (a) performance; (b) number of macroclassifiers.

action pairs (represented as strings of 19 bits) into the next states (represented as strings of 16 bits). It is important to notice that even if, in principle, this problem is defined over a huge input-output space (219 input configurations and 216 possible outputs), it involves in practice a much smaller input-output space: the number of input configurations cannot be greater than |S × A| and there are no more than |S| possible outputs, where S and A are respectively the state and action space of the underlying multistep environment modeled. In the typical multistep environments considered here, the state space S is rather small and |A| = 8. Thus, learning the model of such environments would not be, at least in principle, more difficult than learning the binary sum. However applying XCSCA to the learning of an anticipatory function still involves using as action function an array of perceptrons or a neural network with 19 input variables and 16 output variables. In XCSCAsvm instead the action function is computed only on the basis of a particular subset of the problem samples, the Support Vectors. For this reason, in XCSCAsvm the resources of the action function (i.e., the action function parameters) are used only for the meaningful regions of the input space. We applied XCSCAsvm and XCSCA to learn the model of Woods1 and Maze5 with the same parameters used for the sum5 experiment. Figure 5 reports the performance and the population size of XCSCAsvm and XCSCA on Woods1. The problem is rather simple: there are only 128 different input configurations and 17 actions. As expected XCSCAsvm converges faster than XCSCA, especially when compared to the neural network (Figure 5). On the other hand, all the systems reach the same level of generalization (Figure 5b) after 100000 learning

problems. Figure 6a reports the performance in the more difficult problem of learning the model of Maze5, that involves 288 different input configurations and 37 actions. The results confirm our hypothesis: XCSCAsvm reaches the optimal performance roughly as fast as in the previous experiments while XCSCA with the array of perceptron or the neural networks is significantly more slower. Also in this case all the systems reach the same level of generalization (Figure 6b) at the end of the experiment. Notice that, according to the statistical analysis performed on the results, the differences discussed are significant at a 99.99% confidence level. In conclusion, it is worthwhile observing that in the anticipatory functions problems XCSCAsvm has smaller memory requirements than XCSCA with the neural network and the array of perceptrons. On the other hand it has still a higher computation complexity. VII. C ONCLUSIONS We introduced XCSCAsvm that extends XCSCA using the Support Vector Machines (SVMs) for computing the action function. Unfortunately training the SVMs involves the solution of a non trivial quadratic programming problem. To deal with this issue we used in XCSCAsvm a limited-complexity approach inspired to the well known chunking methods. The analysis of computational complexity suggests that, even if XCSCAsvm is more expensive than XCSCA, the additional overhead can be limited and may be negligible in some kind of problems. Then, we compared XCSCAsvm and XCSCA on the learning of several binary problems introduced in [4]. Our experimental results show that, even using a limitedcomplexity approach, the SVM is able to compute effectively

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

2147

Authorized licensed use limited to: Politecnico di Milano. Downloaded on February 14,2010 at 07:01:14 EST from IEEE Xplore. Restrictions apply.

(a)

(b) Fig. 6. XCSCA and XCSCAsvm applied to the learning of the model for Maze5: (a) performance; (b) number of macroclassifiers.

the classifier action function: in almost all the problems considered, XCSCAsvm reaches the optimal performance faster than XCSCA. According to the experimental results reported in this paper we can suggest some guidelines for choosing the action function depending on the problem considered: (i) even in very simple problems XCSCAsvm may be slightly faster than XCSCA, but XCSCA with the array of perceptrons is probably the best trade-off between the performance and the computational complexity; (ii) if the action bits can be ideally computed independently (like in the binary shift problem), XCSCA may outperform XCSCAsvm; (iii) in difficult problems XCSCAsvm may learn faster than XCSCA but XCSCA with the neural network is able to evolve a more compact solution; (iv) in problems where the samples are sparsely distributed over the whole input space, XCSCAsvm may outperform XCSCA in terms of convergence speed and offer a more convenient representation of the solution. R EFERENCES [1] J. Holland, “Escaping brittleness: The possibilities of general purpose learning algorithms applied to parallel rule-based systems,” in Machine Learning, An Artificial Intelligence Approach, R. Michalski, J. Carbonell, and T. Mitchell, Eds. Los Altos, CA: Morgan Kaufmann, 1986, vol. 2, ch. 20, pp. 593–623. [2] S. W. Wilson, “Classifiers that approximate functions,” Journal of Natural Computating, vol. 1, no. 2-3, pp. 211–234, 2002. [3] E. Bernado´o-Mansilla and J. Garrell, “Accuracy-based learning classifier systems: Models, analysis and applications to classification tasks,” Evolutionary Computation, vol. 11, pp. 209–238, 2003. [4] P. L. Lanzi and D. Loiacono, “Classifier systems that compute action mappings,” Illinois Genetic Algorithms Laboratory – University of Illinois at Urbana-Champaign, Tech. Rep. 2007002, 2007.

2148

[5] P. L. Lanzi, D. Loiacono, S. W. Wilson, and D. E. Goldberg, “XCS with computed prediction for the learning of boolean functions,” in Proceedings of the IEEE Congress on Evolutionary Computation – CEC-2005. Edinburgh, UK: IEEE, Sep. 2005, pp. 588–595. [6] V. N. Vapnik, The nature of statistical learning theory. New York, NY, USA: Springer-Verlag New York, Inc., 1995. [7] L. Ralaivola and F. d’Alch´e Buc, “Incremental support vector machine learning: A local approach,” Lecture Notes in Computer Science, vol. 2130, pp. 322–??, 2001. [Online]. Available: citeseer.ist.psu.edu/ralaivola01incremental.html [8] R. Rosipal and M. Girolami, “An adaptive support vector regression filter: A signal detection application,” 1999. [Online]. Available: citeseer.ist.psu.edu/rosipal99adaptive.html [9] N. Syed, H. Liu, and K. Sung, “Incremental learning with support vector machines,” 1999. [Online]. Available: citeseer.ist.psu.edu/syed99incremental.html [10] N. Cristianini and J. Shawe-Taylor, An introduction to support Vector Machines: and other kernel-based learning methods. New York, NY, USA: Cambridge University Press, 2000. [11] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. [Online]. Available: citeseer.ist.psu.edu/cortes95supportvector.html [12] B. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classiers,” Proceedings of the Fifth Annual Workshop on Computational Learning, 1992., 1992. [13] S. W. Wilson, “Classifier Fitness Based on Accuracy,” Evolutionary Computation, vol. 3, no. 2, pp. 149–175, 1995, http://predictiondynamics.com/. [Online]. Available: http://prediction-dynamics.com/ [14] M. V. Butz and S. W. Wilson, “An algorithmic description of xcs,” Journal of Soft Computing, vol. 6, no. 3–4, pp. 144–153, 2002. [15] S. W. Wilson, “Mining Oblique Data with XCS,” ser. Lecture notes in Computer Science, P. L. Lanzi, W. Stolzmann, and S. W. Wilson, Eds., vol. 1996. Springer-Verlag, Apr. 2001, pp. 158–174. [16] S. Haykin, Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, 1998. [17] F. Rosenblatt, Principles of Neurodynamics. New York: Spartan Books, 1962. [18] J. H. Friedmann, “Another approach to polychotomous classification,” Stanford Department of Statistics, Tech. Rep., October 1996. [19] C. Hsu and C. Lin, “A comparison of methods for multiclass support vector machines,” 2001. [Online]. Available: citeseer.ist.psu.edu/hsu01comparison.html [20] R. Collobert, S. Bengio, and J. Mari´ethoz, “Torch: a modular machine learning software library,” 2002, iDIAP Research Report 02-46, Martigny, Switzerland. (software available on www.torch.ch). [Online]. Available: citeseer.ist.psu.edu/collobert02torch.html [21] J. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines,” 1998. [Online]. Available: citeseer.ist.psu.edu/platt98sequential.html [22] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2006, software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [23] C. M. Bishop, Neural Networks for Pattern Recognition. New York, NY, USA: Oxford University Press, Inc., 1995. [24] J. H. Piater, P. R. Cohen, X. Zhang, and M. Atighetchi, “A Randomized ANOVA Procedure for Comparing Performance Curves,” in Machine Learning: Proceedings of the Fifteenth International Conference (ICML). Madison, Wisconsin: Morgan Kaufmann, San Mateo, CA, USA, Jul. 1998, pp. 430–438. [25] S. A. Glantz and B. K. Slinker, Primer of Applied Regression & Analysis of Variance. McGraw Hill, 2001, second edition. [26] S. W. Wilson, “Generalization in the XCS classifier system,” in Genetic Programming 1998: Proceedings of the Third Annual Conference. Morgan Kaufmann, 1998, pp. 665–674. [27] M. V. Butz, K. Sastry, and D. E. Goldberg, “Strong, stable, and reliable fitness pressure in XCS due to tournament selection,” Genetic Programming and Evolvable Machines, vol. 6, pp. 53–77, 2005. [28] T. O’Hara and L. Bull, “Building anticipations in an accuracy-based learning classifier system by use of an artificial neural network,” in IEEE Congress on Evolutionary Computation, I. Press, Ed., 2005, pp. 2046–2052. [29] P. L. Lanzi, “An Analysis of Generalization in the XCS Classifier System,” Evolutionary Computation Journal, vol. 7, no. 2, pp. 125– 149, 1999.

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

Authorized licensed use limited to: Politecnico di Milano. Downloaded on February 14,2010 at 07:01:14 EST from IEEE Xplore. Restrictions apply.