A new variable selection method for classification

A new variable selection method for classification.

A new variable selection method for classification

Silvia Casado Yusta [email protected] Departamento de Economía Aplicada Universidad de Burgos Joaquín Pacheco Bonrostro [email protected] Departamento de Economía Aplicada Universidad de Burgos Laura Nuñez Letamendia [email protected] Instituto de Empresa, Madrid

ABSTRACT This work proposes an “ad hoc” new method for variable selection in classification, specifically in Discriminant Analysis. This new method is based on the metaheuristic strategy Tabu Search. From a computational point of view variable selection is a NP-Hard problem and therefore there is no guarantee of finding the optimum solution (NP = Nondeterministic Polynomial Time). This means that when the size of the problem is large finding an optimum solution in practice is unfeasible. As found in other optimization problems, metaheuristic techniques have proved to be good at solving this type of problems. Although there are many references in the literature regarding selecting variables for their use in classification, there are very few key references on the selection of variables for their use in Discriminant Analysis. In fact, the most well-known statistical packages continue to use classic selection methods as Stepwise, Backward or Forward. After performing some tests it is found that Tabu Search obtains significantly better results than the Stepwise, Backward or Forward methods used by classic statistical packages.

Keywords: Variable Selection; Classification; Discriminant Analysis; Metaheuristics; Tabu Search; JEL (Journal Economic Literature) Classification: C44, C61.

Área temática: Estadística aplicada a los métodos cuantitativos

XV Jornadas de ASEPUMA y III Encuentro Internacional

1

Casado S., Pacheco J. y Núñez L.

1. INTRODUCTION Variable selection plays an important role in classification. Before beginning designing a classification method, when many variables are involved, only those variables that are really required should be selected; that is, the first step is to eliminate the less significant variables from the analysis. There can be many reasons for selecting only a subset of the variables instead of the whole set of candidate variables (Reunanen, 2003): (1) It is cheaper to measure only a reduced set of variables, (2) Prediction accuracy may

be improved through exclusion of redundant and irrelevant variables, (3) The predictor to be built is usually simpler and potentially faster when fewer input variables are used and (4) Knowing which variables are relevant can give insight into the nature of the prediction problem and allows a better understanding of the final classification model.

The aim in the classification problem is to classify instances that are characterized by attributes or variables; that is, to determine which class every instance belongs to. Based on a set of examples (whose class is known) a set of rules is designed and generalised to classify the set of instances with the greatest precision possible. There are several methodologies for dealing with this problem: Classic Discriminant Analysis, Logistic Regression, Neural Networks, Decision Trees, Instance-Based Learning, etc. Linear Discriminant Analysis and Logistic Regression methods search for linear functions and then use them for classification purposes. The use of linear functions enables better interpretation of the results (e.g., importance and/or significance of each variable in instance classification) by analysing the value of the coefficient obtained. Not every classification method is suited to this type of analysis and in fact some are classified as “black box” models. Thus, classic discriminant analysis continue to be interesting methodologies.

This work proposes an “ad hoc” new method for variable selection in classification, specifically in discriminant analysis. This new method is based on the metaheuristic strategy Tabu Search and yields significantly better results than the classic methods (stepwise, backward and forward) used by statistical packages as SPSS, BMDP, etc., as it’s shown below. Different tests were used to analyse and compare this new method efficacy with previous methods. Our method is performed for 2 classes. 2



The remainder of this work is organized as follows: section 2 is devoted to the background. In section 3 the problem is modelled, the Tabu Search algorithm is described and the computational results are presented. Finally in Section 4 the main conclusions are offered.

2. BACKGROUND Research in variable selection started in the early 1960s (Lewis, 1962 and Sebestyen, 1962). Over the past four decades, extensive research into feature selection has been conducted. Much of the work is related to medicine and biology (e.g., Ganster et al., 2001; Inza et al., 2002; Shy and Suganthan, 2003; and Tamoto et al., 2004). The selection of the best subset of variables for building the predictor is not a trivial question, because the number of subsets to be considered grows exponentially with the number of candidate variables. Even with a moderate number of candidate variables, not all the possible subsets can be evaluated, which means that feature selection is a NP (Nondeterministic Polynomial) -Hard computational problem (see Cotta et al., 2004). This means that when the size of the problem is large finding an optimum solution in practice is not feasible.

For variable selection problems two different methodological approaches have been developed: optimal or exact techniques (enumerative techniques), which can guarantee an optimal solution, but are applicable only in small sets; and heuristic techniques, which can find good solutions (although they cannot guarantee the optimum) in a reasonable amount of time. Among the former, the Narendra and Fukunaga (1977) algorithm is one of the best known, but as Jain and Zongker (1997) pointed out, this algorithm is impractical for problems with very large feature sets. Recent references about implicit enumerative techniques of selection features adapted to regression models could be found in Gatu and Kontoghiorghes (2003), (2005) and (2006). On the other hand, among the heuristic techniques we find works based on genetic algorithms (see Bala et al. (1996), Jourdan et. al. (2001), Inza et al. (2001a. 2001b) and Wong and Nandi, (2004)) and the recent work by García et al. (2006) who present a method based on Scatter Search. As found in other optimization problems metaheuristic techniques have proved to be superior methodologies. XV Jornadas de ASEPUMA y III Encuentro Internacional

3


However, although there are many references in the literature regarding selecting variables for their use in classification, there are very few key references on the selection of variables for their use in discriminant analysis. For this specific purpose the Stepwise method (Efroymson. 1960) and all its variants, such as O’Gorman's (2004), as well as the Backward and Forward methods, can be found in the literature. These are simple selection procedures based on statistical criteria which have been incorporated into some of the best known statistical packages such as SPSS, BMDP, etc. As highlighted by Huberty (1994) these methods are not very efficient, and when there are many original variables the optimum is rarely achieved.

3. SOLVING THE PROBLEM 3.1 Setting out the problem

We can formulate the problem of selecting the subset of variables with superior classification performance as follows: V being a set of m variables, such that V = {1, 2,..., m} and A being a set of instances, (also named “training” set). For each case we also know the class it belongs to. Given a predefined value p ∈ N, p < m, we have to find a subset S ⊂ V, with a size p with the greatest classification capacity, f (S). To be precise, for the discriminant analysis the function f(S) is defined as a percentage of hits in A obtained through the variables of S with Fisher's classifier. Specifically let W be the “between-group” variance matrix and B the “intra-group” variance matrix. The Fisher's linear classifier, F(x), is the auto-vector of the matrix W-1B which is associated to the highest auto-value of this matrix (for more detail see Salvador 2000).

3.2. Solution approach: Tabu Search

Tabu Search (TS) is a strategy proposed by Glover (1989 and 1990). A comprehensive tutorial on Tabu Search can be found in Glover and Laguna (2002). Our Tabu Search algorithm includes a method for building an initial solution and a basic procedure for exploring the space of solutions around this initial solution The performance of the Tabu Algorithm is outlined as follows: Tabu Search Procedure 4



Build an initial solution Repeat Execute Basic Tabu Search until a stopping condition is reached In this work a limit of 30 minutes of computational time is used as stopping condition. Next these elements are described. 3.2.1. Initial solution The initial solution is built as follows: starting from the empty initial solution, a variable is added in each iteration until the solution S reaches p variables (|S| = p). To decide which variable is added to the solution in each iteration the value of f is used.

3.2.2. Description of a basic algorithm Our Tabu Search algorithm uses neighbouring moves which consists in exchanging an element that is in solution S for an outside element at each step. In order to avoid repetitive looping when a move is performed, consisting in exchanging j from S for j’ from V-S, element j is prevented from returning to S for a certain number of iterations. We define vector_tabu (j) = the number of the iterations in which element j leaves S. Some 'tabu' moves can be permitted under specific conditions (“aspiration criterion”), for example, to improve the best solution found. The basic Tabu Search method is described next, where S is the current solution and S* the best solution. The Tabu_Tenure parameter indicates the number of iterations during which an element is not allowed to return to S. After different tests, Tabu_Tenure was set as p.

Basic Tabu Search Procedure (a) Read initial solution S (b) Do vector_tabu(j) = - Tabu_Tenure, j =1..m; niter =0, iter_better=0 and S* = S (c) Repeat (c.1) niter = niter+1 (c.2) Calculate vjj’= f(S ∪ {j’} - {j}) (c.3) Determine vj*j’* =max {vjj’ / ∀ j ∈ S, j’ ∉ S verifying: niter > vector_tabu(j) + Tabu_Tenure or vjj’ > f(S*) (‘aspiration criterion’)} XV Jornadas de ASEPUMA y III Encuentro Internacional

5


(c.4) Do S = S ∪ {j’*} - {j*} and vector_tabu (j*) = niter (c.5) If f (S) > f(S*) then do: S* = S, f* = f and iter_better = niter; until niter > iter_better+2·m

That is, this procedure terminates when 2·m iterations have taken place without improvement.

3.3. Computational Results

To check and compare the efficacy of this new method a series of experiments was run with different test problems. Specifically three data sets were used. These data sets can be found in the well-known data repository of the University of California, UCI, (see Murphi

and

Aha.

1994).

This

can

be

found

at:

www.ics.uci.edu/~mlearn/MLRepository.html. The following databases were used: -

Spambase Database: 57 variables, 2 classes and 4,601 cases. 600 were randomly selected from these as training set. Among the rest of the cases we have randomly selected 10 sets of 200 cases as test sets.

-

Mushrooms Database: 22 variables, 2 classes and 8,100 cases. The 22 nominal variables were transformed into 121 binary variables: 1 for each binary variable and 1 per possible answer for the remaining variables. 1,300 cases were randomly selected from the cases without missing data for the training set. Among the rest of the cases we have randomly selected 10 sets of 200 cases as test sets for evaluating the model with independent data .

-

Covertype Database: This is a forestry database, with 54 explanatory variables, 8 classes and more than 580,000 cases or instances. For the training set, a random selection of 600 cases from the two first groups was made. Among the rest of the cases we have randomly selected 10 sets of 200 cases for evaluating the model with independent data (test sets).

The experiments consist of comparing our Tabu Search Algorithm with the classic Stepwise, Backward and Forward procedures used in some well-known statistical software packages such as SPSS, BMDP,

etc . for discriminant analysis. All the

experiments were done on a Pentium IV 2.4 GHz PC using the BORLAND DELPHI compiler (version 5.0). 6



Table 1 presents a summary of the solutions obtained with the training set for each value of p considered ( classification capacity in the intermediary steps). The results of Forward method are omitted because they are the same than the ones obtained by Stepwise method . The best solutions for every case appear in bold. Data

Spam

Mushrooms

Cover

m

p

57 57 57 57 57 57 121 121 121 54 54 54 54 54

3 4 5 6 7 8 3 4 5 3 4 5 6 7

Stepwise 0.830 0.848 0.860 0.868 0.877 0.878 0.953 0.965 0.995 0.755 0.765 0.765 0.768 0.765

Backward 0.830 0.848 0.860 0.868 0.873 0.878 0.922 0.912 0.928 0.755 0.755 0.758 0.762 0.762

Tabu S 0.847 0.860 0.882 0.887 0.893 0.905 0.982 0.999 1.000 0.767 0.773 0.777 0.778 0.780

Table 1:Comparison of Tabu Search Algorithm and traditional methods for discriminant analysis. The following points can be made regarding Table 1: -

The Backward method seems to work similar or, in some cases, a bit worse than the Stepwise and Forward methods.

-

Our Tabu Search algorithm very significantly improves the solutions of the classic methods for any case.

Following we evaluate the model previously obtained with independent data. Specifically, the data we use are 10 tests sets obtained from each database described at the beginning of section 3. Provided that Backward obtains similar or, in some cases, worse results that Stepwise, for these data we only compare our Tabu Search method with Stepwise. In table 2 a summary of the results obtained with these test is shown. For every case table shows the mean values of f.

Data Spam

M 57 57 57 57


p 3 4 5 6

Stepwise 0,787 0,812 0,827 0,827

Tabu S 0,804 0,825 0,844 0,852

7


Data

Mushrooms

Cover

M 57 57 121 121 121 54 54 54 54 54

p 7 8 3 4 5 3 4 5 6 7

Stepwise 0.831 0.850 0.952 0.950 0.989 0.740 0.752 0.752 0.749 0.755

Tabu S 0.883 0.883 0.976 0.989 1.000 0.732 0.730 0.749 0.746 0.726

Table 2. Comparison in test sets for discriminant analysis.

In table 2 we can observe that results obtained by Tabu Search in these independent data sets are very similar than the ones obtained in training sets. Also we can see that for the Covertype Database the results obtained with Tabu Search are a bit worse than the ones obtained with Stepwise.

4. CONCLUSIONS A Tabu Search method to select variables that are subsequently used in discriminant analysis is proposed and analysed. Although there are many references in the literature regarding variable selection variables for their use in classification, there are very few key references on the selection of variables for their use in discriminant analysis. After performing some tests it is found that Tabu Search obtains significantly better results than the Stepwise, Backward or Forward methods used by classic statistical packages.

5. ADKNOWLEDGEMENTS Authors are grateful for financial support from the Spanish Ministry of Education and Science (National Plan of R&D - Projects SEJ2005-08923/ECON) and from Regional Government of “Castilla y León” (“Consejería de Educación” – Project BU008A06).

6. REFERENCES • BALA J., DEJONG K., HUANG J., VAFAIE H. AND WECHSLER H. (1996). “Using Learning to Facilitate the Evolution of Features for Recognizing Visual Concepts”. Evolutionary Computation, 4, 3, 297-311. 8



• COTTA C., SLOPER C. AND MOSCATO P. (2004). “Evolutionary Search of Thresholds for Robust Feature Set Selection: Application to the Analysis of Microarray Data”. Lecture Notes In Computer Science 3005: 21-30. • EFROYMSON, M.A. (1960). “Multiple Regression Analysis”. Mathematical Methods for Digital Computers (Ralston, A. and Wilf, H.S., ed.) Vol.1. Wiley, New York. • GANSTER H., PINZ A., ROHRER R., WILDLING E., BINDER M. AND KITTLER H.(2001). “Automated Melanoma Recognition”. IEEE Transactions On Medical Imaging 20 (3): 233-239. • GARCÍA F.C., GARCÍA M., MELIÁN B., MORENO J.A.AND MORENO M. (2004). “ Solving Feature Selection Problem by a Parallel Scatter Search”. In press in European Journal of Operational Research • GATU C. AND KONTOGHIORGHES E.J. (2003).”Parallel Algorithms for Computing all Possible Subset Regression Models Using the {QR} Decomposition”. Parallel Computing, 29, pp.505-521. • GATU C. AND KONTOGHIORGHES E.J. (2005). ”Efficient Strategies for Deriving the Subset {VAR} Models”. Computational Management Science, 2 (4):253-278. • GATU C. AND KONTOGHIORGHES E.J. (2006).”Branch-and-bound Algorithms for Computing the Best-Subset Regression Models”. Journal of Computational and Graphical Statistics, 15 (1):139-156. • GLOVER F. AND LAGUNA M. (2002). “Tabu Search, in Handbook of Applied Optimization”. P. M. Pardalos and M. G. C. Resende (Eds.), Oxford University Press, pp. 194-208. • GLOVER F. (1989). “Tabu Search: Part I ”. ORSA Journal on Computing. Vol. 1, pp. 190-206. • GLOVER F. (1990). “Tabu Search: Part II ”. ORSA Journal on Computing. Vol. 2, pp. 4-32.. • HUBERTY C.J.(1994). “Applied Discriminant Analysis”. Wiley. Interscience. • INZA, I., SIERRA B. AND BLANCO R. (2002). “Gene selection by sequential search wrapper approaches in microarray cancer class prediction”. Journal of Intelligent & Fuzzy Systems, 12 (1), 25-33. • INZA I., MERINO M., LARRANAGA P., QUIROGA J., SIERRA B. AND GIRALA M.(2001a). “Feature Subset Selection by Genetic Algorithms and Estimation of Distribution Algorithms - A Case Study in the Survival of Cirrhotic Patients Treated with TIPS”. Artificial Intelligence In Medicine 23 (2): 187-205.


9


• INZA I., LARRANAGA P. AND SIERRA B. (2001b). ” Feature Subset Selection by Bayesian Networks: A Comparison with Genetic and Sequential Algorithms”. International Journal of Approximate Reasoning 27 (2): 143-164. • JAIN A. AND ZONGKER D. (1997). ”Feature Selection: Evaluation, Application, and Small Sample Performance ”. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 2, pp. 153-158. • JOURDAN L., DHAENENS C. AND TALBI E. (2001). ”A Genetic Algorithm for Feature Subset Selection in Data-Mining for Genetics”. MIC 2001 Proceedings, 4th Metaheuristics Internationl Conference, 29-34. • LEWIS P.M. (1962). “The Characteristic Selection Problem in Recognition Systems”, IEEE Trans. Information Theory, vol. 8: 171-178. • MURPHY P. M. AND AHA. D. W. (1994). “UCI repository of Machine Learning. University of California”. Department of Information and Computer Science, http://www.ics.uci.edu/~mlearn/MLRepository.html • NARENDRA P.M. AND FUKUNAGA K. (1977). “A Branch and Bound Algorithm for Feature Subset Selection”. IEEE Trans. Computers, vol. 26, no. 9: 917-922. • O’GORMAN T.W. (2004). “Using adaptive Methods to Select Variables in CaseControl Studies”. Biometrical Journal 46,5, pp.595-605. • REUNANEN, J. (2003). “Overfitting in making comparisons between variable selection methods”. Journal of Machine Learning Research, 3 (7/8), 1371--1382. • SHY S. AND SUGANTHAN P.N. (2003). “Feature Analysis and Clasification of Protein Secondary Structure Data”. Lecture Notes in Computer Science 2714: 11511158. • SALVADOR FIGUERAS M. (2000). “Análisis Discriminante”,[en línea] 5campus.com, Estadísticahttp://5campus.com/lección/discri[10 de febrero de 2005] • SEBESTYEN G. (1962). “Decision-Making Processes”. Pattern Recognition. New York: MacMillan. • TAMOTO E., TADA M., MURAKAWA K., TAKADA M., SHINDO G., TERAMOTO K., MATSUNAGA A., KOMURO K., KANAI M., KAWAKAMI A., FUJIWARA Y., KOBAYASHI N., SHIRATA K., NISHIMURA N., OKUSHIBA S.I., KONDO S., HAMADA J., YOSHIKI T., MORIUCHI T. AND KATOH H.(2004). ”Gene expression Profile Changes Correlated with Tumor Progression and Lymph Node Metastasis in Esophageal Cancer”. Clinical Cancer Research 10(11):3629-3638. • WONG M.L.D. AND NANDI A.K. (2004). “Automatic Digital Modulation Recognition Using Artificial Neural Network and Genetic Algorithm”. Signal Processing 84 (2): 351-365. 10




11