a Binary Cuckoo Search Algorithm for Feature ... - Semantic Scholar

8 downloads 447 Views 314KB Size Report
obtained from a Brazilian electrical power company, and have demonstrated the ... method for continuous optimization namely Cuckoo Search. (CS), which is ...
BCS: A Binary Cuckoo Search Algorithm for Feature Selection D. Rodrigues, L. A. M. Pereira A. N. Souza, C. C. O. Ramos Xin-She Yang Department of Electrical Engineering School of Engineering and Information Sciences T. N. S. Almeida, J. P. Papa Department of Computing UNESP - Univ Estadual Paulista Bauru, S˜ao Paulo, Brazil [email protected]

University of S˜ao Paulo S˜ao Paulo, S˜ao Paulo, Brazil [email protected]

Abstract—Feature selection has been actively pursued in the last years, since to find the most discriminative set of features can enhance the recognition rates and also to make feature extraction faster. In this paper, the propose a new feature selection called Binary Cuckoo Search, which is based on the behavior of cuckoo birds. The experiments were carried out in the context of theft detection in power distribution systems in two datasets obtained from a Brazilian electrical power company, and have demonstrated the robustness of the proposed technique against with several others nature-inspired optimization techniques.

I. I NTRODUCTION Feature selection attempts to find the most discriminative subset of features that leads to suitable recognition rates for some classifier. Given a problem with d features, we have 2d possible solutions, which makes an exhaustive search impracticable for high dimensional feature spaces. Keeping this in mind, several works have addressed the task of feature selection as an optimization problem, in which the fitness function can be a measure of feature space separability or even some classifier’s recognition rate. In the last years, meta-heuristic algorithms based on biological behavior and physical systems in the nature have been proposed to solve optimization problems. Kennedy and Eberhart [1], for instance, proposed a binary version of the well-known Particle Swarm Optimization (PSO) called BPSO, in which the traditional PSO algorithm was modified in order to handle binary optimization problems. Further, Firpi and Goodman [2] extended BPSO to the context of feature selection. Some years later, Rashedi et al. [3] proposed a binary version of the Gravitational Search Algorithm (GSA) called BGSA for feature selection, and Ramos et al. [4] presented their version of the Harmony Search (HS) for the same purpose in the context of theft detection in power distribution systems. Nakamura et al. [5] introduced their version of Bat Algorithm (BA) for binary optimization problems, called BBA. Recently, Yang and Deb [6] proposed a new meta-heuristic method for continuous optimization namely Cuckoo Search (CS), which is based on the fascinating reproduction strategy The authors would like to thank FAPESP grants #2009/16206-1 and #2011/14094-1, CAPES and also CNPq grant #303182/2011-3.

978-1-4673-5762-3/13/$31.00 ©2013 IEEE

Middlesex University Hendon, London, United Kingdom [email protected]

of cuckoo birds. Several species engage the brood parasitism laying their eggs in the nests of others host birds. Such approach has demonstrated to outperform some well-known nature-inspired optimization techniques, such as PSO and Genetic Algorithms. In this paper, we propose a binary version of the Cuckoo Search for feature selection purposes called BCS, in which the search space is modeled as a d-cube, where d stands for the number of features. The main idea is to associate for each nest a set of binary coordinates that denote whether a feature will belong to the final set of features or not, and the function to be maximized is the one given by a supervised classifier’s accuracy. As the quality of the solution is related with the number of nests, we need to evaluate each one of them by training a classifier with the selected features encoded by the egg’s quality and also to classify an evaluating set. Thus, we need a fast and robust classifier, since we have one instance of it for each nest. As such, we opted to use the Optimum-Path Forest (OPF) classifier [7], [8], which has been demonstrated to be so effective as Support Vector Machines, but faster for training. It is important to shed light over that Nozarian et al. [9] have proposed a binary version of Cuckoo search, but the main purpose was to solve the Knapsack problem. The proposed algorithm has been compared with Binary Bat Algorithm, Binary Firefly Algorithm [10], Binary Gravitational Search Algorithm and the Binary Particle Swarm Optimization in the context of non-technical losses detection in power distribution systems in two datasets obtained from a Brazilian electrical power company: one composed by industrial profiles and the other composed by commercial profiles. The remainder of the paper is organized as follows. In Section II we revisit the Cuckoo Search Algorithm and we present the proposed approach for binary optimization using CS, and the experimental results are discussed in Section III-C. Finally, conclusions are stated in Section IV. II. C UCKOO S EARCH The parasite behavior of some species of Cuckoo are extremely intriguing. These birds can lay down their eggs in a host nests, and mimic external characteristics of host eggs such as color and spots. In case of this strategy is unsuccessful,

465

the host can throw the cuckoo’s egg away, or simply abandon its nest, making a new one in another place. Based on this context, Yang and Deb [6] have developed a novel evolutionary optimization algorithm named as Cuckoo Search (CS), and they have summarized CS using three rules, as follows: 1) Each cuckoo choose a nest randomly to lays eggs. 2) The number of available host nests is fixed, and nests with high quality of eggs will carry over to the next generations. 3) In case of a host bird discovered the cuckoo egg, it can throw the egg away or abandon the nest, and build a completely new nest. Algorithmically, each host nest n is defined as an agent which can contain a simple egg x (unique dimension problem) or more than one, when the problem concerns to multiple dimensions. CS starts by placing the nest population randomly in the search space. In each algorithm iteration, the nests are updated using random walk via L´evy flights:

and

xji (t) = xji (t − 1) + α ⊕ Levy(λ)

(1)

Levy ∼ u = s−λ , (1 < λ ≤ 3),

(2)

where s is step size, and α > 0 is the step size scaling factor/parameter. Here the entrywise product ⊕ is similar to those used in PSO, and xji stands for the j th egg (feature) at nest i (solution), i = 1, 2, . . . , m and j = 1, 2, . . . , d. The L´evy flights employ a random step length which is drawn from a L´evy distribution. Therefore, the CS algorithm is more efficient in exploring the search space as its step length is much longer in the long run [6]. Finally, the nests which have eggs with the lowest quality, are replaced to new ones according to a probability pa ∈ [0, 1]. In traditional CS, the solutions are updated in the search space towards continuous-valued positions. Unlike, in the BCS for feature selection, the search space is modelled as a ndimensional boolean lattice, in which the solutions are updated across the corners of a hypercube. In addition, as the problem is to select or not a given feature, a solution binary vector is employed, where 1 corresponds whether a feature will be selected to compose the new dataset and 0 otherwise. In order to build this binary vector, we have employ the Equation 4, which can provide only binary values in the boolean lattice restricting the new solutions to only binary values:

 xji (t + 1) =

Algorithm 1: – BCS-F EATURE S ELECTION A LGORITHM I NPUT:

Labeled training set Z1 and evaluating set Z2 , loss parameter p, α value, number of nests n, dimension d, number of iterations T , c1 and c2 values. Global best position g. O UTPUT: AUXILIARY: Fitness vector f with size m and variables acc, maxf it, globalf it and maxindex. 1. For each nest ni (∀i = 1, . . . , m), do. For each dimension j (∀j = 1, . . . , d), do. 2. xji (0) ← Random{0, 1}. 3. 4. fi ← −∞ 5. globalf it ← −∞ 6. For each iteration t (t = 1, . . . , T ), do. For each nest ni (∀i = 1, . . . , m), do. 7. 8. Create Z1 and Z2 from Z1 and Z2 , respectively, such that both contains only features in ni in 9. which xji (t) = 0, ∀j = 1, . . . , d. 10. Train OPF over Z1 , evaluate its over Z2 and 11. stores the accuracy in acc. 12. If (acc > fi ), then. 13. fi ← acc 14. For each dimension j (∀j = 1, . . . , d), do. 15. x ji ← xji (t) 16. [maxf it, maxindex] ← max(f ) 17. If(maxf it > globalf it), then. 18. globalf it ← maxf it 19. For each dimension j (∀j = 1, . . . , d), do. 20. gj ← xjmaxindex (t) 21. 22. For each nest ni (∀i = 1, . . . , m), do. For each dimension j (∀j = 1, . . . , d), do. 23. Select the worst nests according to pa ∈ [0, 1] 24. and replace them for new solutions. 25. 26. For each nest ni (∀i = 1, . . . , m), do. For each dimension j (∀j = 1, . . . , d), do. 27. xji (t) ← xji (t − 1) + α ⊕ Levy(λ). 28. 1 If (σ < ), then 29. j x (t) 1+e i

A. BCS: Binary Cuckoo Search

S(xji (t)) =

objective function.

1 j

1 + e−xi (t)

1 if S(xji (t)) > σ, 0 otherwise

(3)

(4)

in which σ ∼ U (0, 1) and xji (t) denotes the new egg’s value at time step t. Algorithm 1 presents the proposed BCS algorithm for feature selection using the OPF classifier as the

30. 31.

xji (t) ← 1 j else xi (t) ← 0

The algorithm starts with the first loop in Lines 1−4, which initialize each nest with a vector of binary values (Line 3), and Lines 6 − 31 stand to the main algorithm loop. In the first part of this loop (Lines 7 − 16), the new training and evaluating sets are then composed with the selected features, and each nest is evaluated in order to update its fitness value (Lines 13 − 16). Lines 17 − 21 find out and store in gj the best nest with the best so far vector solutions. The loop in the Lines 22 − 25 is responsible to replace the nests with the worst solutions using the probability p, generating new nests randomly as described in [11]. Finally, Lines 26 − 31 update the binary vector for each nest restricting the generated solutions via L´evy flights (Equations 1-2) and the sigmoid function (Equation 4). III. E XPERIMENTAL R ESULTS A. Datasets We used two labeled datasets obtained from a Brazilian electric power company, Bi and Bc . The former is a dataset

466

composed of 3486 industrial profiles, and the latter contains 5645 commercial profiles. Each industrial and commercial profile is represented by eight features, as follows: 1) Demand Billed (DB): demand value of the active power considered for billing purposes, in kilowatts (kW); 2) Demand Contracted (DC): the value of the demand for continuous availability requested from the energy company, which must be paid whether the electric power is used by the consumer or not, in kilowatts (kW); 3) Demand Measured or Maximum Demand (Dmax ): the maximum actual demand for active power, verified by measurement at fifteen-minute intervals during the billing period, in kilowatts (kW); 4) Reactive Energy (RE): energy that flows through the electric and magnetic fields of an AC system, in kilovoltamperes reactive hours (kVArh); 5) Power Transformer (PT): the power transformer installed for the consumers, in kilovolt-amperes (kVA); 6) Power Factor (PF): the ratio between the consumed active and apparent power in a circuit. 7) Installed Power (Pinst ): the sum of the nominal power of all electrical equipment installed and ready to operate at the consumer unit, in kilowatts (kW); 8) Load Factor (LF): the ratio between the average demand (Daverage ) and maximum demand (Dmax ) of the consumer unit.

final subset will be the one that maximizes the curve over the range of values, i.e., the features that maximize the accuracy over Z3 . Further, these selected features are then applied to assess the accuracy over Z4 . Notice the fitness function employed in this paper is the accuracy measure proposed by Papa et al. [7], which is capable to handle unbalanced classes. Notice we used 30% of the original dataset for Z1 , 20% for Z2 , 20% for Z3 , and 40% for Z4 . These percentages have been empirically chosen. It is important to highlight that all R experiments were carried out through a PC Intel Core i7 Q740 1.73GHz with 3Gb RAM running Ubuntu 10.04 as operational system. C. Experiments We have compared BCS against with Binary Bat Algorithm (BBA), Binary FireFly Algorithm (BFFA), Binary Gravitational Search Algorithm (BGSA) and Binary Particle Swarm Optimization (BPSO). Table I presents the parameters employed for each evolutionary-based technique. Notice for all techniques we employed 30 agents with 10 iterations. Such values and the ones showed in Table I have been empirically set. TABLE I PARAMETERS USED FOR EACH OPTIMIZATION APPROACH . Technique BCS BBA BFFA BGSA BPSO

B. Methodology Suppose we have a fully labeled dataset Z = Z1 ∪Z2 ∪Z3 ∪ Z4 , in which Z1 , Z2 , Z3 and Z4 stand for training, learning, validating, and test sets, respectively. The idea is to employ Z1 and Z2 to find the subset of features that maximize the accuracy over Z2 , being such accuracy the fitness function. Therefore, each agent (particle, cuckoo, bat, etc.) is then initialized with random binary positions and the original dataset is mapped to a new one which contains the features that were selected in this first sampling. Further, the fitness function of each agent is set as the recognition rate of a classifier over Z2 after training in Z1 . As soon as the agent changes its position, a new training in Z1 followed by classification in Z2 needs to be performed. As the reader can see, such formulation requires a fast training and classification steps. This is the reason why we have employed the Optimum-Path Forest (OPF) classifier [7], [8], since it is a non-parametric and very robust classifier. However, in order to allow a fair comparison, we have added a third set in the experiments called validating set (Z3 ): the idea is to establish a threshold that ranges from 10% to 90%, and for each value of this threshold we marked the features that were selected at least a minimum percentage of the runnings (10 times) over a learning process in Z1 and Z2 , as aforementioned. For instance, a threshold of 40% means we choose the features that were selected at least 40% of the runnings. For each threshold, we computed the fitness function over the validation set Z3 to evaluate the generalization capability of the selected solution. Thus, the

Parameters α = 0.1, pa = 0.25 α = 0.9, γ = 0.9 γ = 0.7, β0 = 1.0, α = 0.01 G0 = 100 c1= 2.0, c2= 2.0, w= 0.7

In the first round of experiments, we have executed the OPF classifier over the original dataset in order to asses the accuracy using all features. For that, we have employed only Z1 and Z4 sets, since Z2 and Z3 are used in the feature selection algorithm. The training and test sets are the same as used in the feature learning process. Table II shows the averaged results using a cross-validation procedure with 10 runnings. Notice a Bayesian classifier (Bayes) has been employed in this first round just for comparison purposes. TABLE II M EAN CLASSIFICATION ACCURACIES OVER Z4 USING ORIGINAL DATASETS . Datasets Commercial Industrial

OPF 62.39%±1.85 65.88%±1.91

Bayes 62.50%±1.82 65.45%±1.76

The recognition rates of the classifiers are similar to each other for both datasets if we consider the standard deviation. Now, in the second round of experiments we take into account the feature selection process. The main question that raises is: can we increase the classification rate and using a pruned set of features? Figure 1 displays the accuracy curve over Z3 for Commercial dataset. As we can see, all techniques have achieved the same recognition rate for all thresholds, except BGSA for values behind 40%. This means we can use the features that have appeared only 40% of the runnings. Table III

467

shows the recognition rates after feature selection and also the selected features.

Classification Accuracy [%]

90

88

86

rates have been improved 47.73% when compared to the ones using the original dataset (Table II). Except for BGSA, the remaining optimization techniques have obtained the same accuracy over the test set Z4 , being BCS the fastest technique, followed by BPSO. The interesting point here is that BCS has one ad hoc parameter less than BPSO, which makes a difference in terms of meta-optimization efficiency. TABLE IV M EAN EXECUTION TIMES ( S ), THRESHOLD , SELECTED FEATURES AND ACCURACY OVER Z4 .

84

82

Technique 80 10

20

30

40

50

60

70

80

90

BCS BBA BFFA BGSA BPSO

Threshold [%] BCS

BBA

BFFA

BGSA

BPSO

OPF accuracy curve over Z3 for Commercial dataset.

Fig. 1.

Execution time [s] 1141.221069 1033.354248 1116.442017 1133.031860 1078.860962

BCS BBA BFFA BGSA BPSO

Threshold 40% 40% 40% 40% 40%

Selected features 10101111 10101111 10101111 10101111 10101111

Threshold 20% 20% 20% 20% 20%

Selected features 00100111 00100111 00100111 01100111 00100111

Classification Accuracy 97.33% 97.33% 97.33% 96.67% 97.33%

IV. C ONCLUSIONS In this paper we addressed the task of feature selection as an optimization problem proposing a binary version of Cuckoo Search called Binary Cuckoo Search. The experiments were carried out over two datasets aiming to detect thefts in power distribution systems. We have shown how to increase the theft recognition up to 40% selecting a proper set of features, and we also concluded that BCS is suitable for feature selection, being the fastest approach for Industrial dataset.

TABLE III M EAN EXECUTION TIMES ( S ), THRESHOLD , SELECTED FEATURES AND ACCURACY OVER Z4 . Technique

Execution time [s] 674.715332 784.976318 948.428711 1071.962036 685.916077

Classification Accuracy 91.44% 91.44% 91.44% 91.44% 91.44%

R EFERENCES We can see that all techniques have achieved the same recognition rate over the test set (Z4 ) using the same subset of features. The bit string denotes whether the feature has been selected or not to compose the final set of features, and it follows the same order of the features’ description in Section III-A, i.e., the bit string 1010111 means the features DB, Dmax , PT, PF, Pinst and LF were selected. In regard to the execution time, the fastest technique has been BBA. The recognition rates have been also improved in comparison with the results shown in Table II. Figure 2 displays the accuracy curve over Z3 for Industrial dataset.

Classification Accuracy [%]

100

95

90

85

80 10

20

30

40

50

60

70

80

90

Threshold [%] BBA

Fig. 2.

BCS

BFFA

BGSA

BPSO

OPF accuracy curve over Z3 for Industrial dataset.

We can see that all techniques achieved the maximum accuracy over Z3 using a threshold of 20%, and the recognition

[1] J. Kennedy and R. C. Eberhart, “A discrete binary version of the particle swarm algorithm,” in IEEE International Conference on Systems, Man, and Cybernetics, vol. 5, 1997, pp. 4104–4108. [2] H. A. Firpi and E. Goodman, “Swarmed feature selection,” in Proceedings of the 33rd Applied Imagery Pattern Recognition Workshop. Washington, DC, USA: IEEE Computer Society, 2004, pp. 112–118. [3] E. Rashedi, H. Nezamabadi-pour, and S. Saryazdi, “BGSA: binary gravitational search algorithm,” Natural Computing, vol. 9, pp. 727– 745, 2010. [4] C. Ramos, A. Souza, G. Chiachia, A. Falc˜ao, and J. Papa, “A novel algorithm for feature selection using harmony search and its application for non-technical losses detection,” Computers & Electrical Engineering, vol. 37, no. 6, pp. 886–894, 2011. [5] R. Y. M. Nakamura, L. A. M. Pereira, K. A. Costa, D. Rodrigues, J. P. Papa, and X.-S. Yang, “BBA: A binary bat algorithm for feature selection,” in Proceedings of the XXV SIBGRAPI - Conference on Graphics, Patterns and Images, 2012, (accepted for publication). [6] X.-S. Yang and S. Deb, “Cuckoo search via lvy flights,” in Proceedings of the NaBIC 2009 - World Congress on Nature & Biologically Inspired Computing, 2009, pp. 210–214. [7] J. P. Papa, A. X. Falc˜ao, and C. T. N. Suzuki, “Supervised pattern classification based on optimum-path forest,” International Journal of Imaging Systems and Technology, vol. 19, no. 2, pp. 120–131, 2009. [8] J. P. Papa, A. X. Falc˜ao, V. H. C. Albuquerque, and J. M. R. S. Tavares, “Efficient supervised optimum-path forest classification for large datasets,” Pattern Recognition, vol. 45, no. 1, pp. 512–520, 2012. [9] N. S., S. H., and J. M. V., “A binary model on the basis of cuckoo search algorithm in order to solve the problem of knapsack 1-0,” in Proceedings of the ICSEM 2012 International Conference on System Engineering and Modelling, vol. 32, 2012, pp. 67–71. [10] R. Falc´on, M. Almeida, and A. Nayak, “Fault identification with binary adaptive fireflies in parallel and distributed systems.” in Proceedings of the IEEE Congress on Evolutionary Computation. IEEE, 2011, pp. 1359–1366. [11] X.-S. Yang and S. Deb, “Engineering optimisation by cuckoo search,” International Journal of Mathematical Modelling and Numerical Optimisation, vol. 1, pp. 330–343, 2010.

468