A new variable selection method applied to credit scoring - IOS Press

1

Algorithmic Finance xx (20xx) x–xx DOI:10.3233/AF-180227 IOS Press

3 4 5 6 7

8 9 10 11 12 13 14 15

roo f

2

A new variable selection method applied to credit scoring Dalila Boughacia,∗ and Abdullah A.K. Alkhawaldehb

a LRIA/Computer Science Department, University of Sciences and Technology Houari Boumediene, USTHB BP 32 El-Alia, Beb-Ezzoaur, Algiers, Algeria a Department of Accounting, Faculty of Economics and Administrative Sciences, The Hashemite University, Zarqa, Jordan

Au tho rP

1

Abstract. Credit scoring (CS) is an important process in both banking and finance. Lenders or creditors have to use CS to predict the probability that a borrower will default or become delinquent. CS is usually based on variables related to the applicant such as: his age, his historical payments, his behavior, etc. This paper first proposes a new method for variable selection. The proposed method (VS-VNS) is based on the variable neighborhood search meta-heuristic. VS-VNS allows us to select a set of significant variables for the data classification task. The VS-VNS is combined then with a Bayesian network (BN) to build models for CS and select counterparties. Further, six search methods are studied for BN on different sets of variables. The different techniques and combinations are evaluated on some well-known financial datasets. The numerical results are promising and show the benefits of the new proposed approach (VS-VNS) for data classification and credit scoring. Keywords: Credit scoring, variable selection, variable neighborhood search, search technique, Bayesian network, Hill climbing, tabu search, simulated annealing, TAN, classification

18

1. Introduction

22 23 24 25 26 27 28 29 30

rre

21

co

20

Variable selection called also attribute selection or feature selection is the operation that permits to select a subset of relevant or significant variables to be used in the data classification task. Variable selection is a pre-processing step before launching the classification task. In this work, we are interested in variable selection and classification for credit scoring (CS). As shown by Mester (1997), CS is a crucial problem in financial institutions and banks. In order to select counterparties, the financial institutions have to use good techniques to distinguish between "bad" and "good" counterparties and decide if the credit will

Un

19

cte d

17

16

∗ Corresponding

author: Dalila Boughaci, LRIA/Computer Science Department, University of Sciences and Technology Houari Boumediene, USTHB - BP 32 El-Alia, Beb-Ezzoaur, 16111, Algiers, Algeria. E-mails: dalila [email protected] and [email protected].

be granted or not. For example, in banks, lenders or creditors have to use CS to predict the probability that a borrower will default or become delinquent. CS is based generally on some variables related to applicants to evaluate their creditworthiness. These variables can be: the age of applicant, his historic, payments, guarantees, default rates and so on. Various studies in finance and banking have shown the importance of CS. For instance, the credit scores are one of the most powerful predictors of risk as shown by Miller (2003). It helps in providing an objective analysis of the applicant’s creditworthiness which reduces discrimination and credit risk. CS can be used in the decision making whether grant credit to applicant or not. Further, CS can be used in corporate and collection scorecards. To handle CS, researchers have used several techniques. Among them, we give the following ones: Abdou (2009) proposed a genetic programming

2158-5571/17/$35.00 © 2017 – IOS Press and the authors. All rights reserved

31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73

74 75 76 77 78 79 80 81 82

83 84 85 86 87 88 89 90 91 92 93 94 95 96 97

1. We propose a new variable selection called VS-VNS. The latter is based on the variable neighborhood search meta-heuristic to select a significant set of variables for the data classification task. This new technique is compared to two filtering methods to show its performance. 2. We study the impact of different search methods on BN when combined with variable selection methods for CS.

98

The aim of this section is to give an overview of some important concepts used in this study.

100

2.1. Bayes Networks (BN)

101

Bayes network (BN) is a well-known machine learning technique called also Bayesian network or belief network. BN is a statistical model based on combination of directed acyclic graph of nodes and link, and a set of conditional probability table. As shown by Friedman et al. (1997); John and Langley (1995), BN is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph. The default BN uses "K2" search method which is a greedy algorithm. The latter is run several times with random ordering of variables.

roo f

54

2. Background

2.2. Search method and meta-heuristics Meta-heuristics are computational search techniques that have been used successfully for solving several optimization problems in several areas. Metaheuristics can be divided into two main categories: the population-based methods and the single solutionoriented methods (called also trajectory methods). The population based methods maintain and evolve a population of solutions. The trajectory methods or the single solution oriented methods work on a current single solution. Among the population based methods for optimization problems, we cite genetic algorithms used by Gen (2006) and evolutionary computation used by Li et al. (2011). Among the trajectory methods, we find local search proposed by Hansen (1986); Hoos and Boutilier (2000); Boughaci et al. (2010), simulated annealing developed by Kirkpatrick et al. (1983) and tabu search proposed by Glover (1989). In this work, we are interested in the trajectory methods, especially in hill climbing, tabu search and simulated annealing.

cte d

53

rre

52

technique for CS. Desay et al. (1996) gave a comparison of neural networks and linear scoring models in the credit union environment. Henley and Hand (1996) used a k-Nearest Neighbor (k-NN) classifier for assessing consumer credit risk. An interesting support vector machines (SVM) was proposed by Bellotti and Crook (2009) for credit scoring and discovery of significant features. Also, Sousaa et al. (2015) proposed a dynamic modeling framework for credit risk assessment. Statistical methods are also studied for CS. We give as examples: the linear regression proposed by Hand and Henley (1997), the decision trees proposed by Quinlan (1987) and the classification and regression trees (CART) studied by Breiman, et al. (1984). Wiginton (1980) proposed the discriminant analysis and logistic regression based methods which are one of the most broadly established statistical techniques used to classify clients as "good" or "bad". Also, Friedman et al. (1997) proposed Bayesian networks (BN) that may be used to build models for CS. In this work, we study the impact of variable selection and search method on CS when combined with BN. This paper makes two main contributions:

An extensive experiment is conducted on four credit datasets to evaluate the performance of the different proposed combinations and techniques for CS. We perform credit scoring task on Australian, German, Japanese and the huge "Give me some credit" datasets. We discuss the performance of the different combinations by using various metrics. The rest of this paper is organized as follows: Section 2 gives a background on BN and the search methods considered in this research. Section 3 details the proposed new technique for variable selection and the different combinations studied in this work. Section 4 presents the empirical studies on the four credit datasets. Finally Section 5 concludes and gives some perspectives.

co

51

Un

50

D. Boughaci and A.A.K Alkhawaldeh / A new variable selection method applied to credit scoring

Au tho rP

2

r Local search based method (LS): is a simple hill-climbing technique proposed by Hansen (1986). LS starts with a randomly generated solution and tries to find better solutions in the current neighborhood. Neighborhood solutions are obtained by modifying one position from the solution vector. LS is an iterative process that should be repeated until a certain number of iterations or a criterion is reached.

99

102 103 104 105 106 107 108 109 110 111 112 113

114

115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134

135 136 137 138 139 140 141 142 143


r Simulated annealing (SA): is a local search

147 148 149 150 151 152 153 154 155

r

156 157 158 159 160 161 162 163 164

r

165 166 167 168 169 170

171

based method. SA starts with a randomly generated solution. Then it generates neighboring configurations by using a move operator. The new configuration is accepted when it has a lower energy than the preview one. Otherwise the new configuration can be accepted with a certain probability p which helps the system to escape from a local minimum. SA process is repeated for a certain time fixed empirically as presented by Kirkpatrick et al. (1983). Tabu search (TS): is a local search metaheuristic. The method was proposed initially by Glover (1989). TS starts with an initial random solution. Then, it tries to locate good solutions by applying iterative modifications on the current solutions. In order to avoid the local optima effectively, TS uses a list to store solutions already visited. This list permits the storage of solution trajectory which avoids local optima. Tree augmented Naive Bayes (TAN): is a Bayes Network learning algorithm augmented with a tree. The tree is formed by computing the maximum weight spanning tree by using Chow and Liu algorithm. This algorithm permits to learn a BN with a tree structure that maximizes the likelihood of the training data.

3. Proposed method and combinations

182

r The ﬁltering methods are based usually on

175 176 177 178 179 180

co

174

183 184 185 186 187 188 189 190 191

r

Un

173

rre

181

Variable selection is a pre-processing that can be launched before any classification task. It is the process that selects variables for the data classification task. It removes the redundant variables that are deemed irrelevant to the data classification task. Several methods have been studied for variable selection. These methods can be divided in two main methods: the wrapper methods proposed by Kohavi and John (1996) and the filtering methods studied by Caruana and Freitag (1994).

172

Fig. 1. The variable vector representation.

to filtering methods because the machine learning algorithm is run iteratively while selecting variables.

roo f

146

For the filtering methods, we choose the best-first search proposed by Kohavi and John (1996) and the ranking filter information gain methods proposed by Caruana and Freitag (1994) from WEKA package which is available at Waikato (2017). In the rest of this section, we detail the new proposed variable selection VS-VNS for data classification. We give the variable vector representation used in VS-VNS. Then we detail the main components of the proposed VS-VNS method.

Au tho rP

145

3.1. The variable vector representation The aim of the variable selection is to search for a significant set of variables to be used with the classifier in the classification task. The variable vector can be represented as a binary vector which denote the variables present in the dataset, with the length of the vector equal to n, where n is the number of variables. To represent such vector we use the following assignment: if a variable is selected, the value 1 is assigned to it, a value 0 is assigned to it otherwise. For example, Fig. 1 represents an assignment. We have a dataset of seven variables where the second, the third and the sixth variables are selected.

cte d

144

3

heuristics. The filtering methods eliminate and filter out the undesirable variables before launching the classification task. The wrapper methods are based generally on machine learning algorithm for searching the best subset of variables. The aim of the machine learning algorithm is to select the best variables with high classification accuracy. However, the wrapper methods are time consuming compared

3.2. Proposed VS-VNS for variable selection The proposed VS-VNS is a variable selection method based on the variable neighborhood search (VNS). VNS is a local search meta-heuristic working on a set of different neighborhood. The basic idea is a systematic change of k neighborhood combined with a local search as shown by Mladenovic and Hansen (1997). Neighbors solutions are generated by randomly adding or deleting a variable from the variable vector of size n. We use three structures of neighborhood (k=3) which are:

192 193 194

195 196 197 198 199 200 201 202 203 204

205

206 207 208 209 210 211 212 213 214 215 216 217

218

219 220 221 222 223 224 225 226 227 228 229

r N : where the neighbor solution x of the solu1

230

tion x is obtained by modifying only one bit as

231

232 233 234 235 236 237

r

240 241 242 243 244

247 248 249 250 251 252

253 254 255 256 257 258 259

260 261 262 263 264

265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281

r

To avoid exploiting the same region, the proposed VS-VNS performs a certain number of local steps that combines intensification and diversification strategies to locate a good solution. The intensification step is applied with a fixed probability wp > 0 and the diversification step with a probability (1 − wp). The wp is a probability fixed empirically.

3.3. Combination techniques

The second contribution of this paper is the study of the impact of search method on BN when combined with variable selection methods. We study six variants of BN for CS: BN with K2, BN with hill climbing, BN with repeated hill climbing, BN with TS, BN with SA and BN with TAN. For each variant, we consider the three variable selection methods: the best-first search given by Kohavi and John (1996), the ranking filter information gain method given by Caruana and Freitag (1994) and our new VS-VNS method. The numerical results are detailed in the next section.

cte d

246

Step 1: the intensification that consists in selecting the best neighbor solution having the best objective function value. Step 2: the diversification that consists in selecting a random neighbor solution.

rre

245

The proposed VS-VNS is combined with BN. The overall method starts with an initial solution considering all the variables and then tries to find a good solution in the whole neighborhood in an iterative manner. The BN classifier is built for each candidate solution constructed by VS-VNS method. The solution is evaluated according to both accuracy and ROC (the area under the ROC Curve) values. This means that the solution quality is measured by using an objective function given as: objective function (f ) = (Accuracy + ROC)/2. The objective is to find an optimal subset of variables by finding optimal combinations of variables from the dataset. The VS-VNS process is repeated for a certain number of iterations max iterations fixed empirically. The overall VS-VNS algorithm for variable selection is sketched in Algorithm ??.

co

239

Un

238

done with local search method. For example, if n the number of variables equals to 8 and x = 11111111 is a current solution vector, then the possible neighbor solutions in N1 can be represented as: 01111111, 10111111, 11011111, 11101111, 11110111, 11111011, 11111101, 11111110. N2 : where the neighbor solution x of the solution x is obtained by modifying two bits simultaneously. For example, if x = 11111111 is a current solution vector, then the possible neighbor solutions in N2 can be: 00111111, 10011111, 11001111, 11100111, 11110011, 11111001, 11111100. N3 : where neighboring solution x of the solution x is obtained by modifying three bits simultaneously. For example, if we take the same x = 111111111111 as a current solution vector, then the possible neighbor solutions in N3 can be: 00011111, 10001111, 11000111, 11100011, 11110001, 11111000.

roo f


Au tho rP

4

4. Empirical study for credit scoring All experiments were run on an Intel Core(TM) i52217U [email protected] GHz with 6 GB of RAM under Windows 8, 64 bits, processor x64. The source code is written in Java under NetBeans IDE 8.2 and using the WEKA machine learning package. 4.1. Datasets description We perform credit scoring task on four financial datasets: German, Australian and Japanese datasets available on UCI (University of California at Irvine) Machine Learning Repository1 . We consider also the 1 https://archive.ics.uci.edu/ml/datasets

282

283 284 285 286 287 288 289 290 291 292 293

294

295 296 297 298 299

300

301 302 303 304


Australian German Japanese Give me Some Credit

#Loans

#Good Loans

#Bad Loans

#variables

690 1000 690

307 700 307

383 300 383

14 20 15

150000

139974

10026

10

Table 2 Summary statistics of quantitative variables of the Australian dataset Variables

Min

Max

Mean

stdDev

A2 A3 A7 A10 A13 A14

13.75 0 0 0 0 1

80.25 28 28.5 67 2000 100001

31.568 4.759 2.223 2.4 184.014 1018.396

11.853 4.978 3.347 4.863 172.159 5210.103

Table 3 Summary statistics of quantitative variables of the German dataset

308 309 310 311 312 313

314

315 316 317 318 319 320

stdDev

A2 (duration) A5 (credit amount) A8 (installment commitment) A11 (residence since) A13 (age) A16 (existing credits) A18 (num dependents)

4 250 1

72 18424 4

20.903 3271.258 2.973

12.059 2822.737 1.119

1 19 1 1

4 75 4 2

2.845 35.546 1.407 1.155

1.104 11.375 0.578 0.362

Kaggle2

"Give me some Credit" dataset from . The description of the four datasets is given in Table 1. Table 2 to 5 give the summary statistics of quantitative variables of Australian, German, Japanese and "Give me some credit" datasets respectively. The column Min gives the minimum value, the column Max is the maximum value, the column Mean is the average value and the column stdDev is the standard deviation. 4.2. Evaluation measures

We used both split training/test partition and a 10 fold cross-validation to evaluate models. The experiment (not reported here) showed that the splitting partition is more effective than cross-validation in our case. In consequence, the evaluation technique considered in this study is to run the BN classifier on 2 https://www.kaggle.com/c/GiveMeSomeCredit

Min

Max

Mean

stdDev

A2 A3 A8 A11 A14 A15

13.5 0 0 0 0 0

80.25 28 28.5 67 2000 100000

31.568 4.759 2.223 2.4 184.015 1017.386

11.958 4.978 3.347 4.863 173.807 5210.103

the training data to get a model. Then, we apply this model on the test data to find the appropriate class. The example data are partitioned into training and test examples, approximately in the proportion of 66.6% to 33.4%, respectively. We use several metrics to evaluate the performance of credit scoring models. Table 6 gives the confusion matrix where True Positives (TP) indicates the number of positive examples, labeled as such. False Positives (FP): is the number of negative examples, labeled as positive. True Negatives (TN): is the number of negative examples, labeled as such. False Negatives (FN): is the number of positive examples, labeled as negative. The diagonal elements the confusion matrix given in Table 6 (TP and TN) represents the data properly classified by the classifier while the diagonal elements (FN and FP) represents the misclassified data. We consider the following metrics presented by Powers (2011):

cte d

Mean

rre

307

Max

co

306

Min

Un

305

Variables

Variables

roo f

Dataset

Table 4 Summary statistics of quantitative variables of the Japanese Credit dataset

Au tho rP

Table 1 Description of the datasets used in the study

5

r r r r r r r r

321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340

Recall, Sensitivity or true positive rate (TPR) : TP Recall =TPR= TP P = TP+FN

341

Specificity or true negative rate (TNR): TN TNR = TN N = TN+FP

343 344

TP Precision or positive predictive value = TP+FP

False positive rate (FPR) =

FP N

=

FP FP+TN

342

= 1 − TNR

345 346

The harmonic mean of precision and sensitivity (F2TP measure) = 2TP+FP+FN

347

Matthews correlation coefficient (MCC): TP·TN−FP·FN MCC = √

349

Accuracy (ACC ) or PRC Area where PRC curves plot TP+TN precision versus recall. ACC= TP+TN P+N = TP+TN+FP+FN

351

The area under the ROC curve (AUC ). The ROC Area is a common evaluation metric for binary classification problems. ROC plots the value of the Recall against that of the FP Rate at each FP Rate considered.

353

(TP+FP)(TP+FN)(TN+FP)(TN+FN)

We note that both ROC and PRC are important performance parameter. However, ROC is more robust than PRC in imbalanced class case because ROC is independent of the fraction of the test population which is class 0 or class 1.

348

350

352

354 355 356

357 358 359 360 361

6

D. Boughaci and A.A.K Alkhawaldeh / A new variable selection method applied to credit scoring Table 5 Summary statistics of quantitative variables of the Give me some Credit dataset Min

Max

Mean

stdDev

0 0 0 0 0 0 0 0 0 0

50708 109 98 392664 3008750 58 98 54 98 20

6.048 52.295 0.421 353.005 5348.139 8.453 0.266 1.018 0.24 0.737

249.755 14.772 4.193 2037.819 13152.057 5.146 4.169 1.13 4.155 1.107

Table 6 Confusion Matrix Predicted Real class

Positive Negative Total

Positive

Negative

TP FN P = TP + FN

FP TN N = FP + TN

are weighted by using the proxy measure rather than error rate as shown by Hall (1999); Caruana and Freitag (1994). VS-VNS applied the process already given in section 3.2 to select the relevant variables. The variables selected with CFS are as follows:

r For Australian dataset, 7 variables are selected: r

Table 7 Number of selected variables with the three variable selection methods Australian German Japanese

ALL variables with CFS with Ranking filter with VS-VNS

14 7 14 12

20 3 16 19

Give me Some Credit

15 7 15 15

10 4 10 10

r r

A4, A5, A7, A8, A10, A13 and A14. For German dataset, only 3 variables are chosen: A1, A2 and A3. For Japanese dataset, 7 variables are selected: A4, A6, A8, A9, A11, A14 and A15. For Give me some credit, there are 4 variables which are: A2, A3, A7and A9.

The variables selected with the ranking filter are as follows:

cte d

#Selected variables

roo f

revolving utilization of unsecured lines age number of time 30-59 days past due not worse debt ratio monthly income number of open credit lines and loans number of times 90 days late number real estate loans or lines number of time 60-89 days past due not worse number of dependents

Au tho rP

Variables

r For the Australian dataset, we remark that

365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382

rre

364

In this section, we study the impact of the three variable selection methods on CS: the two filtering methods and our VS-VNS. We evaluate their performance when they are combined with search methods for BN. The same variable selection methods are used within the four datasets. The corresponding empirical studies are given in following. Table 7 gives the number of selected variables with the three variable selection methods on the four datasets. As already mentioned, we consider two filtering methods for variable selection. CFS is a Correlation Based Feature Selection presented by Hall (1999) that selects the best set of variables where variables are assumed to be conditionally independent. However CFS is not able to select all relevant variables when there are strong dependencies. For this reason, we use Information Gain based Ranking filter and VS-VNS. The Ranking filter finds the best subset of variables from the original dataset by using score. The variables

r

co

363

4.3. Impact of variable selection

Un

362

r

all the variables are ranked and selected: A8, A10, A9, A14, A7, A5, A6, A3, A13, A4, A2, A12, A11 and A1. The corresponding rank values are respectively: 0.425709, 0.213511, 0.156286, 0.110235, 0.110022, 0.10916, 0.050189, 0.041099, 0.036708, 0.029603, 0.022884, 0.010036, 0.000721 and 0.000139. For the German dataset, there 16 selected variables among 20. The selected variables: A1, A3, A2, A6, A4, A5, A12, A7, A15, A13, A14, A9, A20, A10, A17 and A19 with the following rank value respectively 0.094739, 0.043618, 0.0329, 0.028115, 0.024894, 0.018709, 0.016985, 0.013102, 0.012753, 0.011278, 0.008875, 0.006811, 0.005823, 0.004797, 0.001337 and 0.000964. For the Japanese dataset, the selected variables are: A9, A11, A10, A15, A8, A6, A14, A7, A3, A5, A4, A2, A13, A12 and A1 sorted by rank. The corresponding rank values are respectively: 0.425709, 0.213511, 0.156286, 0.110235, 0.110022, 0.107525, 0.05371,

383 384 385 386 387

388 389 390 391 392 393 394 395

396 397

398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420


7

Table 8 BN with variable selection and search methods on Australian Credit dataset

K2

CFS Ranking VS-VNS CFS Ranking VS-VNS CFS Ranking VS-VNS CFS Ranking VS-VNS CFS Ranking VS-VNS CFS Ranking VS-VNS

Hill climbing

Repeated Hill

TS

SA

TAN

421

423 424 425 426 427 428

r

TPR%

FPR%

Precision%

Recall%

F-Measure%

MCC%

ROC%

PRC%

7 14 18 7 14 12 7 14 12 7 14 13 7 14 12 7 14 13

88.5 85.5 88.0 88.5 86.0 87.8 88.5 86.0 87.8 88.5 86.0 87.2 87.2 86.0 89.0 88.1 87.2 88.1

11.4 14.4 12.6 11.4 14.0 12.7 11.4 14.0 12.7 11.4 14.0 13.3 12.5 13.7 11.8 11.6 12.8 13.0

88.6 85.6 88.0 88.6 86.0 87.8 88.6 86.0 87.8 88.6 86.0 87.2 87.5 86.2 89.0 88.3 87.2 88.2

88.5 85.5 88.0 88.5 86.0 87.8 88.5 86.0 87.8 88.5 86.0 87.2 87.2 86.0 89.0 88.1 87.2 88.1

88.5 85.5 88.0 88.5 86.0 87.8 88.5 86.0 87.8 88.5 86.0 87.2 87.2 86.0 89.0 88.1 87.2 88.1

77.0 71.1 75.6 77.0 71.9 75.3 77.0 71.9 75.3 77.0 71.9 74.1 74.7 72.2 77.7 76.4 74.4 75.9

93.7 92.9 93.6 93.7 92.8 93.5 93.7 92.8 93.5 93.7 92.8 93.5 94.1 94.8 95.9 94.0 93.1 94.3

93.4 92.7 93.5 93.4 92.5 93.4 93.4 92.5 93.4 93.4 92.5 93.3 93.9 94.7 95.9 93.8 92.6 94.0

0.049456, 0.041099, 0.02875, 0.02875, 0.021239, 0.010036, 0.000721 and 0.000423. For Give me some credit dataset, the ranking filter select all the variables sorted by rank as follows: A1, A7, A3, A9, A2, A6, A4, A5, A8 andA10. The ranks are: 0.0529, 0.04621, 0.03813, 0.03171, 0.01085, 0.00556, 0.00473, 0.00405 , 0.00299 and 0.00155.

431

The set of variables selected with the proposed VSVNS when SA is applied as a search method are as follows:

432

r For Australian dataset, VS-VNS selects 12 vari-

434 435 436 437 438 439

440 441

442 443 444 445 446 447 448

ables. The unselected variables are A1 and A6.

r For German dataset there are 19 selected variables. The variable A5 is removed.

rre

433

r For Japanese dataset there are 15 selected variables where the variable A12 is removed.

r For Give me some credit dataset all the variables are selected.

co

430

4.4. Variable selection with search method for BN

Un

429

According to the numerical results, we can say that in general BN with K2, BN with hill climbing, BN with Repeated hill climbing and BN with TS are comparable on the four considered datasets. The numerical results show a slight performance in favor of BN with TAN when compared it to BN with K2, BN with BN with hill climbing, BN with Repeated hill and BN with TS. Moreover BN succeeds in finding good results when combined with SA search method for all the considered datasets compared to the others BN variants. For example for Australian dataset, BN with SA gives a ROC% value equal to 94.1 when the CFS is used as variable selection method while BN with K2 gives a ROC% equal to 93.7. The same behavior can be seen when we use Ranking, and VS-VNS variable selection methods. Further, the results are much better when we use SA with VA-VNS. The resulting method gives the best ROC% value which is equal to 95.9 for the Australian dataset. The performance of BN with SA is confirmed on German, Japanese and Give me some credit datasets. BN with SA provides good results compared to the other BN variants considered in this study. According to the numerical results, we can say that the Bayesian network with SA search method is able to find high quality results in term of both ROC and PRC. The other measures TPR, FPR, Precision, Recall, F-Measure, MCC confirm this result as shown in Tables 8 to 11. When we compare the variable selection methods, we can see that VS-VNS provides good results compared to both CFS and Ranking methods. For

cte d

422

#Selected variables

In this section we present the different results when considering the six search methods combined with the three variable selection techniques in BN (CFS, Ranking and VS-VNS). The max iterations in VSVNS is fixed empirically to 10 and the wp value is equal to 0.6. The numerical results are reported in Tables 8 to 11.

roo f

Variable Selection

Au tho rP

Search Method

449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480

8

D. Boughaci and A.A.K Alkhawaldeh / A new variable selection method applied to credit scoring Table 9 BN with variable selection and search methods on German Credit dataset

Hill Climbing

Repeated Hill

TS

SA

TAN

CFS Ranking VS-VNS CFS Ranking VS-VNS CFS Ranking VS-VNS CFS Ranking VS-VNS CFS Ranking VS-VNS 3 Ranking VS-VNS

#Selected variables

TPR%

FPR%

Precision%

Recall%

F-Measure%

MCC%

ROC%

PRC%

3 16 17 3 16 19 3 16 19 3 16 19 3 16 19 CFS 16 19

73.8 75.3 76.6 73.8 74.7 74.6 73.8 74.7 74.6 73.8 75.9 76.2 73.8 73.2 81.9 73.8 74.7 78.7

50.0 35.9 35.2 50.0 45.4 35.3 50.0 45.4 35.3 50.0 37.8 36.7 50.0 46.6 31.9 50.0 40.4 32.6

71.7 76.0 76.0 71.7 73.4 74.6 71.7 73.4 74.6 71.7 75.9 75.4 71.7 72.1 81.4 71.7 74.5 78.1

73.8 75.3 76.6 73.8 74.7 74.6 73.8 74.7 74.6 73.8 75.9 76.2 73.8 73.2 81.9 73.8 74.7 78.7

72.3 75.6 76.2 72.3 73.9 74.6 72.3 73.9 74.6 72.3 75.9 75.7 72.3 72.5 81.1 72.3 74.6 78.3

26.8 38.4 42.8 26.8 31.4 39.4 26.8 31.4 39.4 26.8 38.0 41.3 26.8 28.1 54.6 26.8 34.6 47.8

75.5 80.1 81.2 75.5 76.8 78.3 75.5 76.8 78.3 75.5 77.8 79.5 75.5 74.8 85.1 74.1 78.3 83.1

77.7 82.4 81.8 77.7 80.4 79.5 77.7 80.4 79.5 77.7 80.5 79.9 77.7 79.6 86.5 76.9 80.9 84.4

roo f

K2

Variable Selection

Au tho rP

Search Method

Table 10 BN with variable selection and search methods on Japanese Credit dataset

Hill Climbing

Repeated Hill

TS

SA

TAN

483 484 485 486 487 488 489 490 491 492 493

FPR%

Precision%

Recall%

F-Measure%

MCC%

ROC%

PRC%

7 14 13 7 14 14 7 14 14 7 14 13 7 14 14 7 14 13

85.5 86.0 87.1 85.5 86.0 87.1 85.5 86.0 87.1 85.5 86.0 87.7 85.5 83.8 89.7 85.5 85.5 89.0

15.4 15.2 13.4 15.4 15.2 13.2 15.4 15.2 13.2 15.4 15.2 12.7 15.6 17.6 10.4 15.6 15.6 11.7

85.8 86.6 87.1 85.8 86.6 87.1 85.8 86.6 87.1 85.8 86.6 87.7 86.1 84.7 89.7 86.1 86.1 89.0

85.5 86.0 87.1 85.5 86.0 87.1 85.5 86.0 87.1 85.5 86.0 87.7 85.5 83.8 89.7 85.5 85.5 89.0

85.4 85.8 87.1 85.4 85.8 87.1 85.4 85.8 87.1 85.4 85.8 87.7 85.4 83.6 89.7 85.4 85.4 89.0

71.1 72.3 73.9 71.1 72.3 73.9 71.1 72.3 73.9 71.1 72.3 75.0 71.3 68.1 79.2 71.3 71.3 77.7

91.4 90.3 93.3 91.4 90.5 93.3 91.4 90.5 93.3 91.4 90.5 93.3 91.8 91.0 95.8 90.5 91.3 94.8

90.5 89.5 93.0 90.5 89.7 93.0 90.5 89.7 93.0 90.5 89.7 93.0 90.9 90.0 95.7 89.4 90.4 94.5

example for Australian dataset, BN with SA gives a ROC% value equal to 94.1 when combined with CFS, a ROC% value equal to 94.8 when we use Ranking as selection variable method. The best result is found when we use VS-VNS where the ROC% value is equal to 95.9. In conclusion, promising results are obtained when combining BN with SA search method with our new VS-VNS technique. This improvement can be shown for all the considered datasets. The proposed method gives a ROC% value equal to 95.9 and PRC % equal to 95.9 for Australian dataset, a ROC% value equal to 85.1 and PRC % equal to 86.5 for German dataset,

co

482

TPR%

Un

481


#Selected variables

cte d

K2

Variable Selection

rre

Search Method

a ROC% value equal to 95.8 and PRC % equal to 95.7 for Japanese dataset and a ROC% value equal to 86.2 and PRC % equal to 94.7 for the Give me some credit. To clearly show this performance, we draw the curves in Figs 2 to 5 for the four datasets. 4.5. ANOVA statistical analysis To show statistically the significance of our results, we use here ANOVA (Analysis of variance) statistical tool. In our case, we use the ROC% to compare the effect of the VS-VNS variable selection when combined with SA search method in BN for CS.

494 495 496 497 498

499

500 501 502 503 504


9

Table 11 BN with variable selection and search methods on Give me some Credit dataset

K2


Hill Climbing Repeated Hill

TS

SA

10 8 10 8 4 10 8 4 10 10 4 10 10 4 10 9

TPR%

FPR%

Precision%

Recall %

F-Measure%

MCC %

ROC%

PRC %

93.2 92.6 92.7 93.2 92.6 92.7 93.2 92.6 92.7 93.6 92.9 92.7 93.7 93.5 93.6 93.5 93.3 93.3

64.4 54.9 54.9 64.4 54.9 54.9 64.4 54.9 54.9 79.9 57.4 56.0 79.1 70.4 69.2 71.6 64.5 64.7

92.1 92.4 92.4 92.1 92.4 92.4 92.1 92.4 92.4 91.7 92.4 92.4 91.9 92.0 92.2 92.0 92.2 92.2

93.2 92.6 92.7 93.2 92.6 92.7 93.2 92.6 92.7 93.6 92.9 92.7 93.7 93.5 93.6 93.5 93.3 93.3

92.5 92.5 92.6 92.5 92.5 92.6 92.5 92.5 92.6 91.8 92.6 92.5 91.9 92.4 92.5 92.3 92.6 92.6

35.0 38.6 39.4 35.0 38.6 39.4 35.0 38.6 39.4 26.2 38.1 38.7 27.8 32.6 34.7 32.0 35.9 36.1

81.1 85.5 85.9 81.1 85.5 85.9 81.1 85.5 85.9 73.1 85.3 85.5 81.2 85.0 86.2 81.2 85.5 86.0

93.5 94.5 94.5 93.5 94.5 94.5 93.5 94.5 94 92.0 94.5 94.5 93.5 94.4 94.7 93.4 94.5 94.6

cte d

TAN

#Selected variables

roo f

Variable Selection

Au tho rP

Search Method

Fig. 4. ROC% and PRC% found with Variable selection with SA for Japanese dataset.

co

rre

Fig. 2. ROC% and PRC% found with variable selection with SA for Australian dataset.

505 506 507 508 509 510 511 512

Un

Fig. 3. ROC% and PRC% found with variable selection with SA for German dataset.

We compare between all the variable selection (CFS, Ranking and VS-VNS) when used with SA in BN. We compared also with BN when we consider all the variables (ALL). Table 12 describes the four ANOVA tests where the column df represents the degree of freedom. The column SS is the Sum of squares. The column MS is the mean square. The F-value is the F-statistic. The p-value in bold font

Fig. 5. ROC% and PRC% found with variable selection with SA for Give me some credit dataset.

makes interpretation and results analysis. The p-value is < 0.05. This indicates that the values of ROC% produced by VS-VNS are high significantly different from those produced with the other BN variants. This means that the quality of the results using our proposed technique combined with SA in BN is statistically better than those of the other methods.

513 514 515 516 517 518 519


BN with SA search method

df

VS-VNS. vs. CFS 1 VS-VNS vs. Ranking 1 VS-VNS vs. ALL 1

524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545

546

547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564

0.00336 0.0299 0.0258

5. Conclusion Credit scoring is an important process in financial institutions and banks. It helps in decision making and permits to distinguish between ’bad’ and ’good’ counterparties. CS uses variables related to applicants to evaluate their creditworthiness. In this work, we studied the impact of variables and classification on credit scoring. We proposed a new variable selection called VS-VNS for CS. The proposed method is compared to two filtering methods. Further, we explored different combinations of variable selection and various search method for BN. We considered six search methods which are K2, Hill climbing, Repeated Hill climbing, TS, SA and TAN. The different techniques are combined with BN to build models for CS. The different combinations are evaluated on German, Australian, Japanese and the huge kaggle dataset. The numerical results are promising and show the benefit of the proposed combinations. Further, the proposed VS-VNS succeeds in finding promising results compared to the other variable selection techniques in particular when combined with SA in BN. We plan to combine several variable selection strategies together to further enhance the performance. It would be nice to evaluate the considered combinations with other classifiers on other datasets.

cte d

523

73.12 15.22 17.03

References

rre

522

120.91 120.91 105.14 105.14 107.02 107.02

F-value P-value

Abdou, H.A., 2009. Genetic programming for credit scoring: The case of Egyptian public sector banks, Expert Systems with Applications 36, 11402–11417. Bellotti, T., Crook, J., 2009. Support vector machines for credit scoring and discovery of significant features, Expert Systems with Applications 36, 3302–3308. Bonilla Huerta, E.B., Duval, B., Hao, J.K., 2006. A Hybrid GA/SVM Approach for Gene Selection and Classification of Microarray Data. In Rothlanf et al. (Eds.) EvoWorkshops 2006, LNCS 3907, pp. 34–44. Boughaci, D., Benhamou, B., Drias, H., 2010. Local search methods for the optimalwinner determination problem in combinatorial auctions, In In Journal of Mathematical Modelling and Algorithms 9(2), 165–180. Breiman, L., Friedman, J., Olshen, R., Stone, C., 1984. Classification and Regression Trees, Belmont, CA: Wadsworth. Caruana, R., Freitag, D. 1994. Greedy attribute selection, In Proceedings of the eleventh international conference on machine

co

521

MS

Un

520

SS

learning, (ICML 1994, New Brunswick, New Jersey), Morgan Kauffmann, pp. 28-36. Desay, V., Crook, J.N., Overstreet, G.A., 1996. A comparison of neural networks and linear scoring models in the credit union environment, European Journal of Operational Research 95, 24–37. Friedman, N., Geiger, D., Goldszmidt, M., 1997. Bayesian Network Classifiers, Machine Learning 29, 131–163. Gen, M., 2006. Genetic Algorithms and Their Applications. In H. Pham (Ed.), Springer handbook of engineering statistics, pp: 749–773. London: Springer. Glover, F., 1989. Tabu search, part I, ORSA Journal on Computing 1, 190–206. Hall M., 1999. Correlation-based Feature Selection for Machine ˘ ¸ 5. Learning, Methodology 21i195-i20(April), 1ˆaAS Hand, D.J., Henley, W.E., 1997. Statistical classification methods in consumer credit scoring, it. Journal of the Royal Statistical Society, Series A (Statistics in Society) 160, 523–541. Henley, W.E., Hand, D.J., 1996. A k-nearest neighbour classifier for assessing consumer credit risk, Statistician 45, 77–95. Hansen, P., 1986. The Steepest Ascent Mildest Descent Heuristic for Combinatorial Programming, Presented at the Congress on Numerical Methods in Combinatorial Optimization, Capri, Italy. Hoos, H.H., Boutilier, C., 2000. Solving combinatorial auctions using stochastic local search, In Proceedings of the 17th national conference on artificial intelligence, pp: 22-29. John, G.H., Langley, P., 1995. Estimating continuous distributions in Bayesian classifiers, In Proceedings of the eleventh conference on uncertainty in artificial intelligence. San Mateo: Morgan Kaufman, pp. 338–345. Jolliffe, I.T., 2002. Principal Component Analysis, Series: Springer Series in Statistics, 2nd ed., Springer, NY, 2002, XXIX, 487 p. 28 illus. ISBN 978-0-387-95442-4 Kohavi, R., John, G., 1996. Wrappers for feature subset selection, Artificial intelligence, Special issue on relevance, pp. 273-324. Kirkpatrick, S., Gelatt Jr, C.D., Vecchi, M.P., 1983. Optimization by simulated annealing, Science 220(4598), 671–680. Li, J., Wei, L., Li, G., Xu, W., 2011. An evolution strategybased multiple kernels multi-criteria programming approach: The case of credit decision making, Decision Support Systems 51, 292–298. Mladenovic, N., Hansen P., 1997. Variable neighbourhood decomposition search, in the Computer and Operations Research 24, 1097–1110. Mester, L.J., 1997. What’s the point of credit scoring? Business Review (September) 3–16. Miller, M., 2003. Research confirms value of credit scoring, National Underwriter 107(42), 30. Powers, D.M W., 2011. Evaluation: From precision, recall and fmeasure to ROC, informedness, markedness and correlation, Journal of Machine Learning Technologies 2(1), 37–63. Quinlan, J.R., 1987. Simplifying decision trees, it Int J ManMachine Studies 27, 221–234. Sousaa, M.R., Gamaa, J., Brandao, E., 2015. A new dynamic modeling framework for credit risk assessment, Expert Systems With Applications October 16, 2015. Waikato Environment for Knowledge Analysis (WEKA), Version 3.9. The University ofWaikato, Hmilton, New Zealand, Software available at http://www.cs.waikato.ac.nz/ ml/weka/: Accessed: On November 2017. Wiginton, J.C., 1980. A note on the comparison of logit and discriminant models of consumer credit behavior, Journal of Financial and Quantitative Analysis 15, 757–770.

roo f

Table 12 ANOVA Test for BN with SA search method

Au tho rP

10

565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628