Rough set and Tabu search based feature selection for credit scoring

0 downloads 0 Views 316KB Size Report
Keywords: Feature selection, Rough set, Tabu search, Credit scoring .... Given a relatively consistent decision table S = (U,C ∪ D), an attribute set R is relatively ...
Procedia Computer Science ProcediaComputer Computer Science Procedia Science001 (2010) (2012)1–8 2425–2432

www.elsevier.com/locate/procedia

International Conference on Computational Science, ICCS 2010

Rough set and Tabu search based feature selection for credit scoring Jue Wanga,∗, Kun Guob , Shouyang Wanga a Academy

b Research

of Mathematics and Systems Science, Chinese Academy of Sciences, 100190, China Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, 100190, China

Abstract As the credit industry has been growing rapidly, huge number of consumers’ credit data are collected by the credit department of the bank and credit scoring has become a very important issue. Usually, a large amount of redundant information and features are involved in the credit dataset, which leads to lower accuracy and higher complexity of the credit scoring model. So, effective feature selection methods are necessary for credit dataset with huge number of features. In this paper, a novel approach, called FSRT, to feature selection based on rough set and tabu search is proposed. In FSRT, conditional entropy is regarded as the heuristic to search the optimal solutions. The proposed method is introduced to credit scoring and Japan credit dataset in UCI database is selected to demonstrate the competitive performance of the proposed method. Moreover, FSRT shows a superior performance in saving the computational costs and improving classification accuracy.

c 2012 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. ⃝

Keywords: Feature selection, Rough set, Tabu search, Credit scoring

1. Introduction The credit industry has experienced two decades of rapid growth with significant increases in auto-financing, credit card debt, and so on. Credit scoring models have been widely studied in the areas of statistics, machine learning, and artificial intelligence(AI) [1, 2, 3]. The advantages of credit scoring include reducing the cost of credit analysis, enabling faster credit decisions, closer monitoring of existing accounts, and prioritizing collections [4, 5]. With the growth of the credit industry and the large loan portfolios under management today, credit industry is actively developing more accurate credit scoring models. However, credit scoring datasets containing huge number of features are often involved. In such cases, feature selection will be necessary for significantly improving the accuracy of the credit scoring models [6, 7, 8]. This effort is leading to the investigation of effective approach to feature selection for credit scoring applications [9, 10, 11]. Due to the abundance of noisy, irrelevant or misleading features, the ability to handle imprecise and inconsistent information in real world problems has become one of the most important requirements for feature selection. Rough sets theory, proposed by Pawlak, is a novel mathematic tool handling uncertainty and vagueness, and inconsistent data [12, 13, 14]. The rough set approach to feature selection is to select a subset of features (or attributes), which ∗ Corresponding

author. Email: [email protected]

c 2012 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. 1877-0509 ⃝ doi:10.1016/j.procs.2010.04.273

2426

J.Jue Wang et al. / Procedia Computer Science 1 (2012) 2425–2432 Wang et al. / Procedia Computer Science 00 (2010) 1–8

2

can predict or classify the decision concepts as well as the original feature set [15]. Obviously, feature selection is an attribute subset selection process, where the selected attribute subset not only retains the representational power, but also has minimal redundancy. In this paper, a novel method of feature selection based on rough sets and tabu search(FSRT) is proposed for credit scoring data. Tabu Search (TS), originally proposed by Glover in 1986, is a heuristic method which has shown successful performance in solving many combinatorial search problems in some operations research literatures [16, 17, 18]. FSRT uses a binary representation of solutions in the process of feature selection. The conditional entropy is invoked to measure the quality of solution and regarded as a heuristic. TS neighborhood search methodology has been intensified in this method, i.e., avoiding be back to a recently visited solution, and adopting downhill moves to escape from local maximum information. Some search history information is reserved to help the search process to behave more intelligently [19, 20, 21]. The numerical results indicate that the proposed method shows a superior performance in saving the computational costs and improving the accuracy of several classification models. The rest of the paper is organized as follows. In the next section, we briefly give the principles of rough set. In Section 3, we highlight the main components of FSRT and present the algorithm formally. In Section 4, numerical results of FSRT are illustrated, and the classification results are presented comparing with the models without feature selection. Finally, the conclusion makes up Section 5. 2. Rough sets Preliminaries An information system is a formal representation of a dataset to be analyzed, and it is defined as a pair S = (U, A), where U is a non-empty set of finite objects, called the universe of discourse, and A is a non-empty set of attributes. With every attribute a ∈ A, a set of its values Va is associated. In practice, we are mostly interested in dealing with a special case of information system called a decision system. It consists of a pair S = (U, C ∪ D), where C is called a conditional attributes set and D a decision attributes set. The RS theory is based on the observation that objects may be indiscernible (indistinguishable) because of limited available information. For a subset of attributes P ⊆ A , the indiscernibility relation is defined by IND(P): IND(P) = {(ξ, η) ∈ U × U|∀a ∈ P, a(ξ) = a(η)}.

It is easily shown that IND(P) is an equivalence relation on the set U. The relation IND(P), P ⊆ A, constitutes a partition of U, which is denoted U/IND(P) [13]. If (ξ, η) ∈ IND(P), then ξ and η are indiscernible by attributes from P. The equivalence classes of the P-indiscernibility relation are denoted [ξ]P . For a subset Ξ ⊆ U, the P-lower approximation of Ξ can be defined as follows, PΞ = {ξ|[ξ]P ⊆ Ξ}. For an information system S = (U, A) and a partition of U with classes Xi ,1 ≤ i ≤ n, the entropy of attribute set B ⊆ A is defined by: H(B) = −

n 

(p(Xi ))log(p(Xi ))

i=1

where IND(B) = {X1 , X2 , . . . , Xn }, p(Xi ) = |Xi |/|U|, and |.| is the cardinality. The conditional entropy of an attribute set D with reference to another attribute set B is defined as follows: H(D|B) = −

n m   (p(xi )) (p(Y j |Xi ))log(p(Y j |Xi )) i=1

j=1

where p(Y j |Xi ) = |Y j ∩ Xi |/|Xi |,and U/IND(D) = {Y1 , Y2 , . . . , Ym }, U/IND(B) = {X1 , X2 , . . . , Xn }, 1 ≤ i ≤ n, 1 ≤ j ≤ m. And some theorems have been proved in some literatures [22, 23]. Theorem 2.1. H(D|B) = H(D ∪ B)) − H(B)

J. Wang et al. / Procedia Computer Science 1 (2012) 2425–2432 Jue Wang et al. / Procedia Computer Science 00 (2010) 1–8

2427 3

Theorem 2.2. If P and Q are the attribute sets of an information system S = (U, A) and IND(Q) = IND(P), then H(Q) = H(P). Theorem 2.3. If P and Q are the attribute sets of an information system S = (U, A), P ⊆ Q and H(Q) = H(P), then IND(Q) = IND(P). Theorem 2.4. An attribute r in an attribute set R ⊆ C is reducible if and only if H(r|R − {r}0 = 0. Theorem 2.5. Given a relatively consistent decision table S = (U, C ∪ D), an attribute set R is relatively independent if and only if H(D|R) = H(D|R − r) for every r ∈ R. Theorem 2.6. Given a relatively consistent decision table S = (U, C ∪ D), an attribute set R ⊆ C ⊆ A of an information system S = (U, A) is a reduct of B if and only if (1)H(R) = H(C) (2)The attribute set R is relatively independent. One of the major applications of rough set theory is the attribute reduction, that is, the elimination of attributes considered to be redundant, while avoiding information loss [13, 14]. The reduction of attributes is achieved by comparing equivalence relations generated by sets of attributes. Using the conditional entropy as a measure, attributes are removed so that the reduced set provides the same dependency degree as the original. In a decision system, a reduct is formally defined as a subset R of the conditional attribute set C such that H(R) = H(C), where D is the decision attributes set. A given dataset may have many reducts. Thus the set R of all reducts is defined as [13]: R = {R : R ⊆ C; H(R) = H(C)}. The intersection of all the sets in R is called the core, Core(R) =



R.

R∈R

The elements of the core are those attributes that cannot be eliminated without introducing more contradictions to the dataset. In the process of attribute reduction, a set Rmin ⊆ R of reducts with minimum cardinality is searched for: Rmin = {R ∈ R : |R| ≤ |S |, ∀S ∈ R}. which is called the minimal reduct set. It is well known that finding a minimal reduct is NP-hard [13]. For most of the applications, only one minimal reduct is required, so all the calculations involved in discovering the rest are pointless. Therefore, an alternative strategy is required for large datasets. 3. Credit scoring based on feature selection In this section, the proposed feature selection method for credit scoring is presented, which is based on rough set and tabu search. Three credit scoring models and several performance measures are involved in the illustration of the proposed method. 3.1. Feature Selection based on Rough Set and Tabu Search FSRT starts with an initial solution x0 and uses the tabu restriction rule in generating trial solutions in a neighborhood of the current iterate. FSRT continues generating trial solutions until no improvement can be obtained through consecutive iterations. Then, FSRT proceeds to the search process starting from a diverse solution xdiv . If the number of these consecutive iterations without improvement exceeds I shak the number of non-improvement to apply shaking, FSRT invokes the search proceeds to improve the best reduct xbest obtained so far, if it exists. Then, the search is continued starting from xbest . The search may be terminated if the number of iterations exceeds the maximal number of iterations Imax , or the number of consecutive iterations without improvement exceeds the maximal number of non-improvement iterations Iimp . Finally, Elite Reducts Inspiration is applied as a final diversification-intensification mechanism. The FSRT formal algorithm is stated in the following and the detail of the algorithm can be found in literature [24].

2428

J.Jue Wang et al. / Procedia Computer Science 1 (2012) 2425–2432 Wang et al. / Procedia Computer Science 00 (2010) 1–8

4

Algorithm 3.1. FSRT 1. Let the Tabu List (TL) contain the two extreme solutions (0, . . . , 0) and (1, . . . , 1), set vF to be a zero vector, and set RedSet to be an empty set. Choose an initial solution x0 , and set the counter k := 0. Select Imax , Iimp , I shak and Idiv such that Imax > Iimp > I shak > Idiv . 2. Generate neighborhood trials {y1 , . . . , y } around xk .

3. Set xk+1 equal to the best trial solution from y1 , . . . , y , and update TL, vF , xbest and RedSet. Set k := k + 1.

4. If the number of iterations exceeds Imax , or the number of iterations without improvement exceeds Iimp , go to Step 7. 5. If the number of iterations without improvement exceeds I shak , apply shaking strategy to improve xbest , set xk := xbest and go to Step 2. 6. If the number of iterations without improvement exceeds Idiv , apply diversification strategy to obtain a new diverse solution xdiv , set xk := xdiv , and go to Step 2. 7. Apply Elite Reducts Inspiration to obtain xERI . Update xbest by xERI if the latter is better, and terminate. 3.2. credit scoring based on FSRT This section investigates the credit scoring accuracy of three credit scoring models without feature selection and with feature selection based on rough set and tabu search, including radial basis function(RBF), support vector machine(SVM) and logistic regression. They use different algorithms to estimate the unknown credit scoring function and employ different training methods to acquire information from the credit scoring examples available. Artificial neural networks has been widely used in credit scoring problems, and it has been reported that its accuracy is superior to the traditional statistical methods [25]. RBF is the most frequently used neural network architecture in commercial applications including credit scoring [26]. Support vector machines (SVM) have been applied successfully in many classification problems including credit scoring [27]. Logistic regression model is one of the most popular statistical tools for classification problems. Logistic regression model, unlike other statistical tools (e.g. discriminant analysis or ordinary linear regression), can suit various kinds of distribution functions such as Gamble, Poisson, normal, etc. and is more suitable for the credit scoring problems [28, 29]. Depending on the credit features selected and the parameters setting of models, the estimation of the credit scoring function may result in different classification results. The credit scoring models are tested using 10-fold cross validation with a real world data set- Japanese credit data. 3.3. Performance measure In addition to prediction accuracy, ROC curve is used to measure the performance of the models [30]. Given a classifier and an instance, there are four possible outcomes.If the instance is positive and it is classified as positive, it is counted as a true positive(TP); if the instance is negative and it is classified as positive, it is counted as a false positive(FP). The tp rate(true positive rate), fp rate(false positive rate),Precision and F-measure are expressed as: tp rate =

TP ; P

f p rate =

FP TP TP ; Precision = ; Recall = N T P + FP P

F − measure =

2 1/precision + 1/recall

where P is the total positives and N is the total negatives. The ROC is a two-dimensional graph in which true positive rate is plotted on the Y-axis and false positive rate is plotted on the X-axis, as illustrated in Figure 1. AUC is the area under the ROC curve. Generally, a model with a larger AUC will have a better average performance.

J. Wang et al. / Procedia Computer Science 1 (2012) 2425–2432 Jue Wang et al. / Procedia Computer Science 00 (2010) 1–8

2429 5

Table 1: FSRT Parameter Setting

Parameter |TL|  nR Imax Iimp Idiv I shak

Definition size of tabu list size of neighborhood num. of the best reducts xERI max num. of iterations max num. of non-improvement iterations num. of non-improvement to apply Diverse num. of non-improvement to apply shaking

Value 5 round( |C| 2 ) 3 100 40 20 30

4. Numerical Experiments This research investigates the potential for feature selection with Japan credit dataset, which is available from the UCI Repository of Machine Learning Databases. 4.1. Data set description Japan credit dataset derived from various applications is used to evaluate the performance of the proposed algorithm. It is made public from the UCI Repository of Machine Learning Databases, and is mostly used to compare the performance of various classification models. The dataset consists of 299 examples of creditworthy applicants and 365 examples where credit should not be extended. For each applicant, 15 variables describe job status, applicant’s gender, marital status, house area, applicant’s age, saving, monthly loan repayment, No. of months to repay and so on. 4.2. Parameters setting For Japan credit dataset, the FSRT MATLAB code was run 20 times with different initial solutions. Before discussing these results, we explain the setting of the FSRT parameters. In Table 1, we summarize all parameters used in the FSRT algorithm with their assigned values. These chosen values are based on the common setting in the literature or based on our numerical experiments. The performance of FSRT was tested using different values of these parameters. First, the tabu list size |TL| was set to 5 to save the three most recently visited solutions in addition to the all-zeros and all-ones solutions. The number  of neighborhood trial solutions was set equal to round( |C| 2 ). Since l depends on the problem size, it may help avoid a deterioration of the method when the problem size increases. The numbers Idiv and I shak of consecutive non-improvement iterations before applying diversification or shaking are set equal to 20 and 30, respectively. The number nR of best reducts is set equal to 3, which helps FSRT to improve the best reduct found so far if it was not a global one. Finally, the maximum numbers Imax and Iimp of iterations and of consecutive non-improvement iterations, respectively, are set as in Table 1. The numerical experiments showed that these setting were reasonable to avoid spending more non-improvement iterations, and sufficient to let the search process explore the current region before supporting it with diversification or intensification. 4.3. Feature selection and credit scoring This section provides a discussion of the experimental procedure and the result of the experiment. To evaluate the performance of feature selection and the classification accuracy of credit scoring model, the dataset was usually randomly partitioned into training sets and independent test sets via a k-fold cross validation. Each of the k subsets acted as an independent holdout test test for the model trained with the remaining k − 1 subsets. The advantage of cross validation are that all of the test sets were independent and the reliability of the results could be improved. A typical experiment uses k = 10, then we divide each credit data into 10 subsets for cross validation and carried out our implementation on 10 training data sets.

2430

J.Jue Wang et al. / Procedia Computer Science 1 (2012) 2425–2432 Wang et al. / Procedia Computer Science 00 (2010) 1–8

6

Table 2: Feature selection results for Japan credit data sets

Fold 1 2 3 4 5 6 7 8 9 10 Average

Japan credit dataset FRST result FSRT average 7(8) 8(6) 9(6) 7.9 7.6 7(8) 8(12) 7.6 7(10) 8(8) 9(2) 7(4) 8(16) 7.8 8 7(2) 8(16) 9(2) 8 7(4) 8(12) 9(4) 7.1 7(18) 8(2) 7.6 7(8) 8(12) 7.5 7(14) 8(2) 9(4) 8.2 7(4) 8(8) 9(8) 7.73

Table 3: Performance comparison for different models in Japan dataset

Model RBF SVM Logistic

Feature Selection NO YES NO YES NO YES

TP rate 0.798 0.816 0.861 0.861 0.854 0.864

FP rate 0.226 0.19 0.126 0.126 0.144 0.131

Precision 0.807 0.816 0.872 0.872 0.856 0.867

Recall 0.798 0.816 0.861 0.861 0.854 0.864

F-measure 0.794 0.816 0.862 0.862 0.854 0.865

AUC 0.867 0.888 0.868 0.868 0.905 0.921

In Table 2, the result of feature selection for Japan credit dataset is illustrated, which will be contribute to decrease the complexity of classification and improve the accuracy of credit scoring model. In the following experiment, a feature subset with minimum number of features is selected to the credit scoring models. Three strategies were used in this study, namely RBF, SVM and Logistic. A comparison was made between using all features with no pre-selection and using the features pre-selected by FSRT. The result for Japan credit dataset was obtained and summarized in Tables 3. From Table 3, we can see that there is no significant difference in performance between SVM classifiers with feature selection and the baseline SVM classifier(all features). There is however a marginal difference in RBF and Logistic classifiers with feature selection and the baseline classifiers. For Japan credit dataset, the average classification accuracy of both models without feature selection is 80.7% and 85.6%, respectively. An obvious improvement over baseline classifiers has happened after feature selection. The precision of both models achieved 81.6% and 86.6%. There is also a higher true positive rate with a lower false positive rate after FSRT. The performance of all models for Japan dataset are showed in the ROC space in Figure 1. Informally, one point in ROC space is better than another if it is to the northwest where tp rate is higher while fp rate is lower. The points in Figure 1 show that the Logistic classifier with FSRT has a best performance as the point (FS-logistic) appearing in the northwest corner of the ROC space. Furthermore, the average performance of the classifiers with feature selection is better than the baseline classifiers. Moreover, the model achieved a higher AUC area after feature selection, take the RBF model for example, the ROC curve and AUC area of the RBF model is showed in Figure 2. The RBF model has a better performance in average with FSRT as the ROC curve is almost always up the baseline model. In a word, the models with fewer attributes after FSRT should not only save computational cost, but also improve the classification accuracy for Japan credit dataset.

J. Wang et al. / Procedia Computer Science 1 (2012) 2425–2432 Jue Wang et al. / Procedia Computer Science 00 (2010) 1–8

Figure 1: Performance comparison for different models

2431 7

Figure 2: ROC curve and AUC for RBF model

5. Conclusion A novel approach to feature selection based on rough set and tabu search has been proposed for credit scoring problem. New diversification and intensification elements have been embedded in FSRT to achieve better performance and to fit the considered problem. Numerical experiment on Japan credit dataset has been presented to show the efficiency of FSRT. Comparisons with no pre-selection models have revealed that FSRT is promising and it is less expensive in computational cost. This research is leading to the effort for developing more refined and accurate credit scoring model. Acknowledgements This work is supported by Chinese Academy of Sciences(CAS) Foundation for Planning and Strategy Research(KACX1YW-0906) and the National Natural Science Foundation of China (NSFC No. 70801058 and No. 70921061). References [1] Loretta J. Mester. What’s the point of credit scoring? Business Review, (9):3-16, 1997. [2] D. J. Hand, W.E. Henley. Statistical classification methods in consumer credit scoring: A review.Journal of the Royal Statistical Society. Series A, 160(3): 523-541, 1997. [3] Yong Shi. Current Research Trend: Information Technology and Decision Making in 2008. International Journal of Information Technology and Decision Making, 8(1):1-5, 2009. [4] David West. Neural network credit scoring models.Computers & Operations Research, 27(11):1131-1152, 2000. [5] Ligang Zhou, Kin Keung Lai, Jerome Yen. Credit Scoring Models with AUC Maximization Based on Weighted SVM. International Journal of Information Technology and Decision Making, 8(4):677-696, 2009. [6] Guyon Isabelle, Elisseeff Andre. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3(7):1157C1182, 2003. [7] Huan Liu, Hiroshi Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1998. [8] Jianping Li, Zhenyu Chen, Liwei Wei, Weixuan Xu, Gang Kou. Feature Selection via Least Aquares Suport Feature Machine. International Journal of Information Technology and Decision Making, 6(4):671-686, 2007. [9] R. Jensen and Q. Shen. Finding Rough Set Reducts with Ant Colony Optimization. Proceedings of the 2003 UK Workshop on Computational Intelligence, 15-22, 2003. [10] R. Jensen and Q. Shen. Semantics-Preserving Dimensionality Reduction: Rough and Fuzzy-Rough-Based Approaches. IEEE Transactions on Knowledge and Data Engineering, 16(12):1457-1471, 2004. [11] S. Tan. A Global Search Algorithm for Attributes Reduction. AI 2004: Advances in Artificial Intelligence, G.I. Webb and Xinghuo Yu (Eds.), LNAI 3339:1004-1010, 2004. [12] Z. Pawlak. Rough sets. International Journal of Computer and Information Sciences, 11(5):341-356, 1982. [13] Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer Academic Publishing, 1991.

2432

J.Jue Wang et al. / Procedia Computer Science 1 (2012) 2425–2432 Wang et al. / Procedia Computer Science 00 (2010) 1–8

8

[14] Z. Pawlak and A. Skowron. Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems. Studies in Fuzziness and Soft Computing, L. Polkowski, T.Y. Lin, S. Tsumoto (Eds.), 56, Physica-Verlag, 2000. [15] Roman W. Swiniarski, Andrzej Skowron. Rough Set Methods in Feature Selection and Recognition. Pattern Recognition Letters, 24(6):833C849, 2003. [16] C. Rego and B. Alidaee. Metaheursitic Optimization via Memory and Evolution. Springer-Verlag, Berlin, 2005. [17] F. Glover and M. Laguna. Tabu Search. Kluwer Academic Publishers, MA, USA, 1997. [18] F. Glover, G. Kochenberger. New Optimization Model for Data Mining. International Journal of Information Technology and Decision Making, 5(4):605-609, 2006. [19] F. Glover. Tabu search–Part I, ORSA Journal on Computing, 1(3):190-206, 1989. [20] F. Glover, Tabu search–Part II, ORSA Journal on Computing, 2(1):4-32, 1990. [21] A. Hedar and M. Fukushima. Tabu Search Directed by Direct Search Methods for Nonlinear Global Optimization. European Journal of Operational Research, 170(2):329-349, 2006. [22] Guoyin Wang£Algebra View and Information View of Rough Sets Theory£Proceedings of SPIE, 4384£200-207, 2001. [23] T.Y. Lin, Y.Y. Yao, L.A. Zadeh. Data Mining, Rough Sets and Granular Computing. Springer-Verlag, Berlin, 2002. [24] A. Hedar, Jue Wang and M. Fukushima. Tabu Search for Attribute Reduction in Rough Set Theory. Soft Computing, 12(9):909-918, 2008. [25] Vijay S. Desai, Jonathan N. Crook, George A. Berstreet Jr. A Comparison of Neural Network and Linear Scoring Models in the Credit Union Enviroment. European Journal of Operational Research, 95(1):24-37, 1996. [26] Estefane Lacerda, Andre C.P.L.F. Carvalho, Antonio Padua Braga, Teresa Bernarda Ludermir. Evolutionary Radial Basis Functions for Credit Assessment. Applied Intelligence, 22(3):167-181, 2005. [27] Cheng Lung Huang, Mu Chen Chen, Chieh Jen Wang. Credit Scoring with a Data Mining Approach Based on Support Vector Machines. Expert Systems with Applications, 33(4):847-856, 2007. [28] Erkki K.Laitinen. Predicting a Corporate Credit Analyst’s Risk Estimate by Logistic and Linear Models. International Review of Financial Analysis, 8(2):97-121, 1999. [29] Timothy H. Lee, Ming Zhang. Bias Correction and Statistical Test for Developing Credit Scoring Model through Logistic Regression Approach. International Journal of Information Technology and Decision Making, 2(3):299-311, 2003. [30] Fawcett, T., 2004. ROC Graphs: Notes and Practical Considerations for Researchers. Intelligent Enterprise Technologies Laboratory, HP Laboratories Palo Alto, 2004.