A Transfer Learning Based Classifier Ensemble Model for Customer ...

4 downloads 29895 Views 410KB Size Report
e-mail: [email protected] ... Keywords-customer credit scoring; transfer ensemble model; ... Over the past decades, the global credit businesses have.
2014 Seventh International Joint Conference on Computational Sciences and Optimization

A transfer learning based classifier emsemble model for customer credit scoring Jin Xiao, Runzhe Wang Business School Sichuan University Chengdu,China e-mail:[email protected]

Geer Teng

Yi Hu*

The Faculty of Social Development & Western China Development Studies

School of Management University of Chinese Academy of Sciences, Beijing 100190, China e-mail:[email protected]

Sichuan University Chengdu, China e-mail: [email protected]

sets. Algorithm-level solutions mainly introduce cost sensitive learning technique and assign different misclassification costs to the customers from different classes. For instance, Zou et al. [6] constructed a costsensitive support vector machine for credit scoring. In recent years, some scholars have introduced multiple classifiers ensemble (MCE) technique to customer credit scoring for improving the generalization ability. For example, Paleologo et al. [7] proposed an ensemble classification technique, Subagging for highly imbalanced credit scoring data. Xiao et al. [3] proposed a dynamic classifier ensemble method for imbalanced customer data (DCEID). The common characteristic of the two types of methods above is that they only use the original information in the inner system (target domain) to handle the class imbalance issue, and do not generate new information. According to the statistics learning theory, under certain sample information capacity, model accuracy has an upper limit [8]. A popular phenomenon exists in the real customer credit scoring. There are a large number of customer data in related source domains, which may be from different districts, businesses, periods, or from different enterprises in the same industry. Although the customer data in the source and target domains are very similar, they are often subject to different distributions. Achieving satisfactory performance in this case is difficult for most of the traditional models, because they are all based on the assumption that the training data and the test data are subject to the same distribution [9]. Therefore, effective integrating the data from the source and target domains is expected to improve the credit scoring performance with imbalanced class distribution. The transfer learning proposed in machine learning area provides a new idea for this issue, and its main idea is to utilize the data of related source domain tasks to assist in modeling of target task [9]. In recent years, transfer learning has been applied to many areas such as text mining, image recognition, and so on. Combining the transfer learning with multiple classifiers ensemble (MCE) [10], this study proposes a clustering and selection based dynamic transfer ensemble (CSTE) model, and applies it to customer credit scoring. The experimental results in a customer credit scoring dataset show that CSTE can achieve better performance compared with the traditional credit scoring models, as well as some existing transfer learning models. The structure of this study is organized as follows: it simply introduces the transfer learning theory in Section II;

Abstract—Customer credit scoring is an important concern for numerous domestic and global industries. It is difficult to achieve satisfactory performance by traditional models constructed on the assumption that the training and test data are subject to the same distribution, because the customers usually come from different districts and may be subject to different distributions in reality. This study combines ensemble learning and transfer learning, and proposes a clustering and selecting based dynamic transfer ensemble (CSTE) model to transfer the related source domains to target domain for assisting in modeling. The experimental results in a large customer credit scoring dataset show that CSTE model outperforms two traditional credit scoring models, as well as three existing transfer learning models. Keywords-customer credit scoring; transfer ensemble model; selectively transfer learning; eliminate noise; multiple classifier ensemble

I.

INTRODUCTION

Over the past decades, the global credit businesses have developed rapidly, and the credit institutions are faced with more and more credit frauds, which results in huge losses. Therefore, it is significant to build a scientific customer credit scoring model which can provide important decision support for the related people, and decrease the losses. At present, the most popular approaches to consumer risk assessment treat loan decisions as binary classification problems [1], i.e., divide the applicants into two categories: applicants with good credit and applicants with bad credit. At present, the commonly used credit scoring models include logistic regression, artificial neural network (ANN), Bayesian networks, support vector machine (SVM), etc. In real credit scoring issue, the class distribution is often imbalanced, which means that the number of applicants with bad credit constitutes only a very small minority of the data [2]. In this case, the misclassification rate by the above models for bad credit applicants is much higher than that of good credit applicants. However, the value of accurate classification for a bad credit applicant is often higher than that of a good credit applicant [3]. At present, two types of approaches are proposed to deal with the class imbalance issue in customer credit scoring [4]: data-level and algorithm-level solutions. Data-level solutions mainly use resampling techniques, such as random over-sampling and random down-sampling, to balance the class distribution of the training set. For example, Marqués et al. [5] investigated the suitability and performance of several resampling techniques when applied in credit data 978-1-4799-5372-1/14 $31.00 © 2014 IEEE DOI 10.1109/CSO.2014.21

64

proposes the work principle and detailed steps of CSTE in Section III; presents the experimental design and detailed results analysis in Section IV. Finally, the conclusions are included in Section V. II.

[11], the instance-based TrBagg strategy [12], and the instance-based TrAdaBoost strategy [13]. Although the above strategies have their own advantages, neither of them takes the imbalanced class distribution of data into consideration. So they can hardly conduct the credit scoring correctly when they are involved in CRM field.

TRANSFER LEARNING THEORY

In the past decades, research on transfer learning has attracted more and more attention [9]. It means the ability of a system to recognize and apply knowledge learned in previous tasks to novel tasks. In this definition, transfer learning aims to extract the knowledge from one or more source tasks and applies the knowledge to a target task. There are three elements of transfer learning [9]: (1) what to transfer? (2) how to transfer? (3) when to transfer? They are just the main research topics in transfer learning today. “What to transfer” considers which part of knowledge can be transferred across domains or tasks. After discovering which knowledge can be transferred, learning algorithms need to be developed to transfer the knowledge, which corresponds to the “how to transfer” issue. “When to transfer” considers in which situations, transferring skills should be done. In some situations, when the source domain and target domain are not related to each other, brute-force transfer may hurt the performance of learning in the target domain, i.e., negative transfer. In recent years, many scholars have focused on transfer learning strategies, and the representative researches are as follows: the feature-based transfer learning strategy TFS

Figure 1.

III.

CLUSTERING AND SELECTION BASED TRANSFER ENSEMBLE MODEL

A. The basic idea of CSTE model The CSTE model aims to transfer some useful information from the related source domain to assist in modeling for target domain. When learning transfer model, it is necessary to consider how to avoid the negative transfer. In Ref. [14], we proposed a feature selection based transfer ensemble model, which supposed that there was only one source domain related to the target domain. However, in real world, there may be many source domains, and how to construct transfer ensemble model in this case, is the focus of this study. Suppose that T is the target domain of a credit scoring issue, and there are p source domains Sri (i=1, 2, ..., p) related to T. Meanwhile, T and Sri contain two types of samples: bad credit samples with class label “1” and good credit samples with class label “2”. Further, T is divided into target training set T1 and target test set T2.

The flow chart of CSTE model.

65

In order to avoid negative transfer effectively, the CSTE model proposed in this study contains 4 phases (see Figure 1): 1) Cluster the target training set T1; 2) Transfer the source domain datasets selectively; 3) Eliminate the noise data in training set; 4) Train base classifiers and classify target test set T2.

) NMI (O( a ) , O( b ) )

The MI value is between 0 and 1. The larger the value is, the more consistent two clustering programs are. D. Detailed description of CSTE model CSTE model contains four phases, and its pseudo-code is as follows: Phase 1˖Cluster the target training set 1) Determine the cluster number k1 in target training set T1 with IE method; 2) Divide T1 into k1 clusters by k-means method and get the initial clustering program C I ; Phase 2˖Transfer source domain datasets selectively 1) Combine each source domain dataset Sri (i=1, 2, …, p) with T1 respectively, and cluster again by k-means algorithm, supposing the new clustering program is Cli; 2) Calculate the consistence i between the initial clustering program and the new clustering program Cli based on Eq. (3); 3) Select half of the source domains with higher i, add them to target training set T1 and construct the new training set TR; Phase 3˖Eliminate the noise data in new training set 1) Utilize IE theory again to determine the cluster number k2 in TR, and then cluster TR with k-means method to get the clustering result Clus1 , Clus2 ,..., Clusk2 ; 2) For each cluster Clus i (i=1, 2, ..., k2), if it contains some samples with two different class labels, it is called inconsistent cluster, and we divide it into two sub-clusters according to the samples’ class labels: Clusi1, Clusi2; 3) For any cluster generated, if it only contains few samples, i.e., isolated cluster, it is deleted directly. Finally, we number all the remaining clusters; 4) Set the cluster number as the new class label of samples in each cluster, and combine all the samples to get the final training set Tf; Phase 4˖Train base classifiers and classify target test set 1) Sample N training subsets from Tf randomly with replacement, and balance the class distribution of each training subset with the random over-sampling; 2) Train a classifier in each training subset and get the base classifier set C={C1, C2, …, CN}; 3) Classify the test set T2 in the target domain with each base classifier Ci (i=1, 2, …, N), and suppose the classification results are Ri; 4) Ensemble the classification results of N base classifiers with weighted voting and get the final classification results for T2.

B. Determine the cluster number with information entropy To determine the number of clusters, we adopt the information entropy (IE) based method [15], and its basic steps are as follows: Step 1: Calculate the degree of deviation. Suppose a dataset including n samples X1, X2, DŽDŽDŽ, Xn is divided into k clusters, and Yj={Xi| Xię Yj} represents the j-th cluster. The degree of deviation between the i-th sample and the center of the j-th cluster is:

pij

dij

¦

n

t

,

(1)

d 1 tj

where, dij is the distance between the i-th sample and center of the j-th cluster. Step 2: Calculate IE of clustering program as follows:

S

¦ ¦ k

j 1

X i Y j

pij ln pij .

nij n 2 k k log ) (3) n 2 ( ¦ ¦ ij k n i1 j1 ni n j

(2)

With the increase of the cluster number, the IE of the clustering program will also gradually increase. Step 3: Find the optimal cluster number. Firstly, the following definitions are given: Status˖If the whole dataset is divided into j clusters, then it is defined as the j-th status. The IE of the j-th status is Sj. The IE jump value: increment of IE when the status changes from the (j-1)-th to the j-th, i.e., Sj-Sj-1. The IE transition value: The difference between the IE jump value from the (j-1)-th status to the j-th status and that from the j-th status to the (j+1)-th status, that is to say, |(Sj+1-Sj)-(Sj-Sj-1)|. By means of the concept of transition in atomic physics, when |(Sj+1-Sj)-(Sj-Sj-1)| reaches the smallest value, j is just the optimal number of clusters. At this moment, the uncertainty in the whole dataset is minimum, and then it is unnecessary to increase the number of clusters. C. Measure the consistence of the clustering program with mutual information When selecting the source domains, it is important to measure the consistence of two clustering programs. In this study we regard the mutual information (MI) based method [16] as the measurement. Suppose that we cluster the dataset with n samples by kmeans clustering method, and obtain two labeled vectors with k clusters (a)= {C1(a), C2(a), …, Ck(a)} and (b) ={C1(b), C2(b), …, Ck (b)} respectively. Further, suppose that there are ni samples in the cluster Ci(a) and nj samples in Cj(b), and there are nij same samples between Ci(a) and Ci(b). Then, the MI can be defined as follows:

IV.

EMPIRICAL ANALYSIS

To analyze the performance of CSTE proposed in this study, we experimented in a large customer credit scoring dataset. Meanwhile, we compared CSTE model with the

66

following five strategies: 1) traditional customer credit scoring model Subagging [7] by utilizing all data (Subagging), which trains N classifiers by uniting the data in source domains with those in the target domain without distinction; 2) traditional Subagging by utilizing target domain data only (Subagg-OT), which trains N classifiers by using the data in the target domain; 3) transfer feature selection method TFS [11]; 4) instance based transfer learning strategy TrBagg [12]; and 5) instance based transfer learning strategy TrAdaBoost [13].

that contain 10 samples at least as the source domain (the first 3 columns in Table 3 show 10 source domains). To determine whether the distributions of the 10 source domains and target domain are different, we introduced the multivariate two-sample testing procedure proposed in [17]. In this study, we let D =0.05, and the last two columns in Table 2 show the test results, where each row is to test whether there is difference between the distributions of target domain and the corresponding source domain. It can be seen that there is significant difference between the distributions of the target domain and the 10 source domains.

A. Dataset and data processing The data are from Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2009 Data Mining Competition (http://sede.neurotech.com.br:443 /PAKDD2009). The dataset is a credit scoring problem. The dataset contains three subsets: modeling dataset with 50,000 samples, leaderboard dataset with 10,000 samples and prediction data set with 10,000 samples. However, only the samples in the modeling dataset have the class label, therefore, we just select the modeling dataset in experiment for convenient analysis. TABLE I.

TABLE II.

DESCRIPTION OF THE FEATURES IN PAKDD2009 DATASET.

Final features x1 x2 x3 x4 x5

Feature name Id_Shop Sex Marital_Status Age Flag_Residencial_Phone

x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20

Area_Code_Residencial_Phone Payment_Day Shop_Rank Residence_Type Months_In_Residence Flag_Mothers_Name Flag_Fathers_Name Flag_Residence_Town=Working_Town Flag_Residence_State=Working_State Time in the current job in months Profession_Code Mate_Income Flag_Residencial_Address=Postal_Address Personal_Net_Income Quant_Additional_Cards_In_The_Application

DISCRIPTION OF SOURCE DAOMINS AND RESULTS OF THE MULTIVARIATE TWO-SAMPLE TESTING

Number

Source domains

Number of samples

| tˆ |

| tˆ950|

1

x6=23

994

96.83

1.898

2

x6=24

120

73.66

1.976

3

x6=27

27

78.68

1.865

4

x6=31

34992

78.36

1.892

5

x6=32

33

94.54

1.675

6

x6=38

12

70.65

1.728

7

x6=42

14

83.67

1.934

8

x6=49

48

92.32

1.847

9

x6=50

11071

76.43

1.856

10

x6=56

10

65.43

2.034

'

B. Experimental setup Before training the models, we need to partition the target domain T into target training set T1 and target test set T2. In this study, we adopted the random sampling without replacement method to select 30% patterns from T to construct T2, and the remaining patterns composed T1. We regarded support vector machine (SVM) as the basic classification algorithm. We found that the classifier based on radial basis kernel (RBK) could obtain the best performance through experimental comparison. Thus, we designated it as the kernel function. The kernel parameter of the RBK and the regularization parameter were set as the default values. As for the six models referred in the experiment, the number of base classifiers is an important parameter. Note that all the six models belong to SCE model. Tsymbal et al. [18] found that the SCE models usually could achieve their best performance when the number of base classifiers for ensemble equaled 50. Therefore, we let the size of base classifier pool be 50. For the other parameters in TFS, TrBagg, and TrAdaBoost models, we let them be the values which make the models perform best by repeated experiments. Meanwhile, except CSTE, Subagging and Subagg-OT models, the other three models do not consider the impact of class imbalance on the performance. To ensure the fairness of comparison, we balanced the class distribution of data by the random over-sampling technique before training the base classifiers. In addition, all experiments were performed on the MATLAB 6.5 platform with a dual-

There are 31 features in the data. After preliminary data cleaning, there are 49904 samples and 20 features in the modeling dataset (see Table 1). Note that x6 means the modified residential phone area code. Its different values imply that the customers are from different regions, and these customers may be subject to different distributions. Therefore, we divide 49904 customer samples into 67 subsets according to different values of x6. We regard the subset x6=5 (it contains 2471 samples, where there are 341 customers with bad credit, and 2310 ones with good credit) as the target domain. Further, most of the remaining 66 subsets contain few samples (one or two), which loses the meaning of transferring. So we just select 10 subsets data

67

ACKNOWLEDGMENT

processor 2.1 GHz Pentium 4 Windows computer. For each model, the final classification result was the average of the results from 10 iterations of the experiment.

This research is partly supported by the Natural Science Foundation of China under Grant Nos. 71101100, 71301160 and 70731160635, New Teachers' Fund for Doctor Stations, Ministry of Education under Grant No. 20110181120047, Excellent Youth fund of Sichuan University under Grant No. 2013SCU04A08, Frontier and Cross-innovation Foundation of Sichuan University under Grant No. skqy201352, Research Start-up Project of Sichuan University under Grant No. 2014SCU11053.

C. Evaluation criterion In this study, the ability of the models to discriminate between “good” and “bad” applicants is evaluated by Receiver Operating Characteristic (ROC) curve analysis [19]. ROC curves can also be used to compare the separated performance of two or more classifiers. The closer the curve follows the left and the top borders of the ROC space, the more accurate the model is. However, sometimes it is difficult to compare ROC curves of different models directly, so the area under the receiver operating characteristic curve (AUC) is more convenient and popular.

* Yi Hu is the corresponding author.

REFERENCES [1] [2]

D. Performance comparison with other models Table 3 shows the classification results of six models in PAKDD2009 dataset. Based on this table, the following conclusions can be drawn: 1) The AUC value of CSTE model proposed in this study is the largest, followed by TrBagg, TrAdaBoost, Subagging, TFS and Subagg-OT. Therefore, the performance of CSTE model is the best; 2) In six models, the standard deviation of CSTE model is the smallest, i.e., it has the best stability; 3) The credit scoring performance of Subagg-OT model is poorer than that of four transfer learning strategies CSTE, TrBagg, TrAdaBoost and TFS, which demonstrates that it is important to transfer the source domains to target domain. 4) The performance of Subagging model is poorer than three transfer learning strategies CSTE, TrBagg and TrAdaBoost, just superior to transfer learning strategy TFS. This results imply that most of the existing transfer learning strategies can eliminate some noise data in source domains through different internal mechanisms, and achieve better customer credit scoring performance than traditional multiple classifiers ensemble model Subagging. TABLE III.

[3]

[4] [5]

[6]

[7] [8] [9] [10] [11] [12]

CUSTOMER CREDIT SCORING PERFORMANCE OF SIX MODELS IN PAKDD2009 DATASET

[13] Criteria

CSTE

Subagging

Subagg-OT

TFS

TrBagg

TrAdaBoost

AUC

0.6689

0.6548

0.6456

0.6482

0.6593

0.6581

s.d.

0.0345

0.0462

0.0732

0.0832

0.0492

0.0583

[14]

[15]

V.

CONCLUSIONS

Customer credit scoring is an important concern for numerous domestic and global industries. This study combines transfer learning with multiple classifier ensemble, and proposes CSTE for customer credit scoring. Unlike the traditional research paradigm in customer credit scoring, which only utilizes the customer data in target domain, CSTE not only uses the data in target domain, but also utilizes the data in related source domains to assist in modeling. The experimental results in a customer credit scoring dataset show that CSTE not only outperforms two traditional credit scoring strategies, but also outperforms three existing transfer learning strategies.

[16] [17] [18]

[19]

68

S. Finlay, The Management of Consumer Credit: Theory and Practice. Basingstoke: Palgrave Macmillan, 2008. K. Falangis, “The use of MSD model in credit scoring,” Oper Res., 2007, 7(3), pp. 481-503. J. Xiao, et al., “Dynamic classifier ensemble model for customer classification with imbalanced class distribution,” Expert Syst Appl., 2012, 39(3), pp. 3668-3675. H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE T. Knowl. Data En., 2009, 21(9), pp. 1263-1284. A. Marqués, V. García, and J. Sánchez, “On the suitability of resampling techniques for the class imbalance problem in credit scoring,” J. Oper. Res. Soc., 2013, 64(7), pp. 1060-1070. P. Zou, Y. Y. Hao, and Y. J. Li, “Customer value segmentation based on cost-sensitive learning support vector machine,” Int. J. Serv. Tech. Manage., 2010, 14(1), pp. 126-137. G. Paleologo, A. Elisseeff, and G. Antonini, “Subagging for credit scoring models,” Eur. J. Oper. Res., 2010, 201(2), pp. 490-499. V. Vapnik, Statistical Learning Theory. New York: John Wiley & Sons, 1998. S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE T. Knowl. Data En., 2010, 22(10), pp. 1345-1359. J. Kittler, et al., “On combining classifiers,” IEEE T. Pattern Anal., 1998, 20(3), pp. 226-239. W. Bi, Y. Shi, and Z. Lan, “Transferred feature selection,” IEEE International Conference on Data Mining Workshops, 2009. T. Kamishima, M. Hamasaki, and S. Akaho, “TrBagg: a simple transfer learning method and its application to personalization in collaborative tagging,” IEEE international conference on data mining, 2009, pp. 219-228 W. Dai, et al., “Boosting for transfer learning,” The 24th International Conference on Machine Learning, 2009, pp. 193-200. J. Xiao, et al., “Feature-selection-based dynamic transfer ensemble model for customer churn prediction,” Knowl. Inf. Syst., Doi:10.1007/s10115-013-0722-y, forthcoming. C. Wu, D. Wu, and N. Jiang, “A kind of client market segmentation based on the dynamic clustering algorithm.” Comput. Syst. Appl., 2008, 2, pp. 21-24. W. Tang and Z. Zhou, “Bagging-based selective clusterer ensemble,” J. Soft., 2005, 16(4), pp. 496-502. J. H. Friedman, “On multivariate goodness-of-fit and two-sample testing,” Proceedings of Phystat, 2003, pp. 1-3. A. Tsymbal, S. Puuronen, and D. W. Patterson, “Ensemble feature selection with the simple Bayesian classification,” Inform. Fusion, 2003, 4(2), pp. 87-100. F. L. Chen, and F. C. Li, “Combination of feature selection approaches with SVM in credit scoring,” Expert Syst. Appl., 2010, 37(7), pp. 4902-4909.