Supportive Instances for Regularized Multiple ... - Semantic Scholar

2 downloads 0 Views 159KB Size Report
Dec 4, 2008 - The charge-off policy may vary among authorized institutions. Delinquent accounts are accounts that haven't paid the minimum balances for ...
Supportive Instances for Regularized Multiple Criteria Linear Programming Classification

Volume 14, Number 4 December 2008, pp. 249-263

Peng Zhang, Yingjie Tian, Zhiwang Zhang, Xingsen Li Research Center on Fictitious Economy & Data Sci., Chin. Academy of Sci., Beijing,China ([email protected]) ({tyj, zzw05, lixs}@ gucas.ac.cn) Yong Shi Research Center on Fictitious Economy & Data Sci., Chin. Academy of Sci., Beijing,China College of Inform. Sci. & Tech., Univ. of Nebraska Omaha, Omaha, NE 68182, USA ([email protected])

Although classification models based on multiple criteria progarmming receive many attentions in recent years, their essence of taking every training instances into considering makes them too senstive to noisy or imbalanced samples. To overcome this shortage, in this paper, we propose a clusering based algorithm to find the representative (also called supportive) instances for a most recent multiple criteria programming classification model, the Rregularized Multiple Criteria Linear Programming (RMCLP) model, just as Support Vector Machine (SVM) finding the support vectors. Our new algorithm selects instances which locate around the clustering center as the supportive instances, and then bulid RMCLP model using only these supportive instancs. Experimental results on synthetic and real-life datasets show that our method not only improves the performance of RMCLP, but also reduces the number of training instances, which can significantly save costs in business world because labeling training samples is usually expensive and sometimes impossible. Keywords: MCLP, RMCLP, Noisy Dataset, Imbalance Dataset, Sample Selection.

1. Introduction Recent years have witnessed a large body of research work on mining useful knowledge by multiple criteria mathematical programming method where various of classification models have been proposed and received great success in business intelligence [1]. All these models are created mainly by adapting the objective functions of the original multiple criteria linear programming (MCLP) model [2] to improve MCLP’s accuracy and stability, and lots of research works have exhibited their powerful ability to classfiy different kinds of real-life data, such as credit card data, network intrusion data, VIP Email data, and bioinformatic data. Among all these models, the most recent Reguarlized Multiple Criteria Linear Programming (RMCLP) model [3] which created by adding two regularized objective functions

REGULARIZED MCLP CLASSIFICATION

250

onto the original MCLP model, have been theoretically demonstrated that it is stable in finding the global optimal solution. Following empirical studies on both synthetic and real-life UCI datasets also told us that RMCLP is superior to other multiple criteria programming models, which makes RMCLP as the most promisingydel in multiple critier programming classification field.

(a)

(c)

(b)

(d)

Figure 1 (a) the original RMCLP model built on an ideal training sample; (b) when adding two noisy instances in the left side, the classification boundary shifts towards the left side; (c) when the training sample is imbalanced, the boundary also shifts significantly; (d) if we select representative training instances which locate around the distribution centers (inside the circle), the classifcaition boundary becomes satisfactory.

Although RMCLP performs excellently in classifying lots of banchmark datasets, its shortage is also obvious. By taking account of every training instances into consideration, RMCLP is senstive to noisy and imblanced training samples. In other words, the classification boundary may shift significantly even if there is merely a slight change of training samples. This difficluty can be described in Figure 1, assume there is a two groups classification problem, the 1st group is denoted by “.” and the second group is denoted by “ ”. We can observe that it is a linear-sperable dataset and the classification boundary is denoted by a line ”/”. Figure 1(a) shows that on an ideal training sample, RMCLP successfully classify all the instances. In Figure 1(b), when we add some noisy instances into the 1st group, the classification boundary shifts towards the 1st group, making more instances in the 1st group misclassifed. In Figure 1(c), we can observe that when we add instances into the 2nd group to make the number of instances in two groups imbalanced, the classification boundary also changes significantly, causing a great number of misclassification. In Figure 1(d), we can see that if we choose some representative instances (also called supportive instances) for RMCLP, which locate inside the blue circle, then although more noisy and imbalanced instances are added into the training sample, the classification boundary always keeps unchanged and will have a good ability to do prediction. That is to say, building RMCLP model only on supportive instances can improve its accuracy and stability.

ZHANG, TIAN, ZHANG, LI, SHI

251

According to the above oberservation, in this paper, we propose a clustering based sample selection method, which chooses the instances in the clustering center as the supportive samples (just as SVM [4] chooses the support vectors to draw a classification boundary). Experimental results on synthetic and real-life datasets show that our new method not only can significantly improve the prediction accuracy, but also can dramatically reduce the number of training instances. The rest of this paper is organized as follows: In section two, we will give a survey of classfication models based on multiple criteria programming. In section three, we will introduce the original MCLP model, and the two groups RMCLP model and multiple groups RMCLP model. In section four, we will introduce our clustering based sample selection method for RMCLP in detail. In section five, we will study the performance of our new algorithm on both synthetic and real-life datasets on two groups and multiple groups RMCLP model. Finally, we summarize this paper with conclusions in section six.

2. Literature Survey The fields of data mining and optimization are increasingly intertwined [5]. Some researchers who traditionally work on optimization are now gradually moving to data mining field. They mainly focus on two topics: how to develop novel large-scale optimization algorithms to solve the existing data mining models or how to develop novel data mining models using the existing optimization algorithms, such as quadratic, linear, second-order cone, semi-definite, and semi-infinite programs. Shi and his group belong to the latter. Since 1998, Shi and his group have proposed a series of novel data mining models using multiple criteria mathematical programming. During the last few years, the multiple criteria programming models have received great success in both theoretical and applied fields. In 2001, Shi et al. [2,6,7,8] proposed a Multiple Criteria Linear Programming (MCLP) to a real-life credit card classification. In 2004, He et al. [9,10] proposed a Fuzzy Multiple Criteria Linear Programming (FMCLP) model and a Multiple Criteria Nonlinear Programming (MCNP) model for credit card analysis. In 2004, Kou et al. [11,12] proposed a Mutiple Criteria Qudratic Programming model by adapting the linear objective functions of MCLP to quadratic ones, and in the following year, Kou [13,14] also proposed a Multi-group MCLP model to solve the multiple groups classification problem of MCLP. Since the promising results of MCQP, Kou [15] stepped forward and proposed a kernel based MCQP method, called MCVQP method which extends MCQP to nonlinear classification problem. Followed by that, Zhang [16] et al. proposed a kernelized MCLP by adopting the inner product form of SVM to MCLP, which is a popular method to extend a linear classifier to non-linear one. In 2007, Zhang et al. [17, 18] proposed a MQLC model for VIP E-Mail Analysis. Based on rough set thoery, Zhang et al. [19] proposed a rough set based MCLP model and report its efficiency on several UCI benchmark datasets. Recently, Shi et al. [3, 20] proposed a RMCLP model and compare its accuracy on both synthetic and the UCI benchmark datasets with the former multiple criteria programming methods, such as MCLP and MCQP, and the well known SVM model. The results show that RMCLP is a promising method for classification, it is superior

REGULARIZED MCLP CLASSIFICATION

252

to other models in many different kinds of datasets. The family of multiple criteria programming models [21,22] is described in Figure 2.

Figure 2 The family of multiple criteria programming models for classfication. All of the models , such as fuzzy MCLP, rough MCLP, MCQP, RMCLP, are proposed to enhance MCLP’s performance. Besides, to solve multiple groups classification problem, several multiple groups classificaiton models are listed in the right side; to solve the nonlinear classification problem, some kernel based models are listed in the left side.

3. Regularized Multiple Criteria Linear Programming (RMCLP) Model In this section, we will introduce the two groups and multiple groups RMCLP model. Since RMCLP model is originated from MCLP model, we will introduce the MCLP model first. 3.1 Multiple Criteria Linear Programming (MCLP) model Assume we have a training set A = { A1 , A2 ,..., An } which has n instances, each instance has r attributes, we define a boundary vector b to distinguish the first group G1 and the second group G2 . Then we can establish the following linear non-equality functions [23] :

A i x < b, ∀Ai ∈ G1 , A i x ≥ b, ∀Ai ∈ G2 .

(1)

To formulate the criteria functions and complete constraints for data separation, some other variables need to be introduced. As depicted in Figure 3, we define external measurement α i to be the overlapping distance between boundary and a training instance, say Ai . When a record Ai ∈ G1 has been wrongly classified into group G2 or a record Ai ∈ G2 has been wrongly classified into group G1 , α i will equal to | Ai x − b | . We also define internal measurement β i to be the distance of a

ZHANG, TIAN, ZHANG, LI, SHI

253

record Ai from its adjusted boundary b* . When Ai is correctly classified, distance βi will equal to | Ai x − b* | , where b* = b + α i or b* = b − α i . To separate the two groups as far as possible, we design two objective functions which minimize the overlapping distances and maximize the distances between classes. Suppose || α || pp denotes for the relationship of all overlapping α i while

|| β ||qq denotes for the aggregation of all distances βi . The final correctly classified instances is depended on simultaneously minimize || α || pp and maximize || β ||qq . Thus, a generalized bi-criteria programming model can be formulated as [24]: Minimize || α || pp and Maximize

|| β ||qq

(2)

Subject to:

Ai x − α i + βi − b = 0 , ∀Ai ∈ G1

Ai x + α i − β i − b = 0 , ∀Ai ∈ G2 where

Ai is given, X and b are unrestricted, and α = (α 1 ,..., α n ) T , β = (β1 ,..., β n )T ;

α i , β i ≥ 0, i = 1,..., n . When choosing linear formulation for (2), we get the original multiple criteria linear programming (MCLP) model as follows: n

(MCLP)

Minimize wα

∑α i =1

n i

− wβ ∑ β i

(3)

i =1

Subject to:

Ai x − α i + β i − b = 0, ∀Ai ∈ G1 Ai x + α i − β i − b = 0, ∀Ai ∈ G2 where Ai is given, x and b are unrestricted, α i and βi ≥ 0 . 3.2 Two Groups Regularized Multiple Criteria Linear Programming (RMCLP) Lots of empirical studies have shown that MCLP is a powerful tool for classificaiton. However, there is no theoretical work on whether MCLP always can find an optimal solution under different kinds of training samples. To go over this difficluty, recently, Shi et.al [3] proposed a RMCLP model by adding two regularized items 1 xT Hx and 2 1 T α Q α on MCLP as follows: 2

Minimize

1 T 1 x Hx + α T Qα + d T α − cT β , 2 2

(4)

REGULARIZED MCLP CLASSIFICATION

254 Subjet to:

Ai x − α i + β i = b, ∀Ai ∈ G1; Ai x + α i − β i = b, ∀Ai ∈ G2 ;

α i , βi ≥ 0. where H ∈ R r *r , Q ∈ R n*n are symmetric positive definite matrices. d T , cT ∈ R n .The RMCLP model is a convex quadratic program. Theoretically studies [3] have shown that RMCLP can always find a global optimal solution.

Figure 3 Two groups MCLP classification, where instances in G1 are denoted by black dot while in G2 are denoted by stars. If the instances are perfectly classified, then classification boundary is b, otherwise, two adjusted boundaries b − α * and b + α * are used to separate the misclassified instances.

3.3 Multiple Groups Regularized Multiple Criteria Linear Programming (RMCLP) model Besides two groups classification problem, a recent work [25] also introduced a multiple groups RMCLP model. As far as three groups classification problem be considered, we first find a projection direction x and a group of hyper planes (b1, b2), to an arbitrary training instance Ai, if Ai x < b1 , then Ai ∈ G1 ; if b1 ≤ Ai x < b2 , then Ai ∈ G2 ; and if Ai x ≥ b2 , then Ai ∈ G3 . Extending this method to n group classification, we can also find a direction x and n-1 dimension vector b = [b1 , b2 ,..., bn −1 ] ∈ R n −1 , to make sure that to any training instance Ai:

Ai x < b1 , ∀Ai ∈ G1 ; b j −1 ≤ Ai x < b j , ∀Ai ∈ Gi ,1 < i < n;

(5)

Ai x ≥ bn −1 , ∀Ai ∈ Gn ; We first define ci = bi −1 + bi as the midline in group i ( 1 < i < n ). Then, to the 2

ZHANG, TIAN, ZHANG, LI, SHI

255

misclassified records, we define

α i+

as the distance from

ci to Ai x , which equals

( ci − Ai x ), when misclassify a group i’s record into group j ( ji ) . Similarly, to the correct classified records, we define βi− when Ai is in the left side of ci , and we define β i+ when Ai is in the right side of + i

ci . When we have a n groups training sample with size m, we have

− i

α = {α , α } ∈ R m*2 , β = ( β i+ , β i− ) ∈ R m*2 , and we can build a multiple groups Regularized Multi-Criteria Linear Programming (SRMCLP) as follows: Minimize 1 xT Hx + 1 α T Qα + d T α + cT β 2 2 Subject to:

(6)

1 Ai x − α i− − βi− + β i+ = b1 , ∀Ai ∈ G1 ; 2 1 Ai x − α i− + α i+ − β i− + β i+ = (bi −1 + bi ), ∀Ai ∈ Gi ,1 < i < n; 2 + − + Ai x + α i − β i + β i = 2bn −1 , ∀Ai ∈ Gn ;

α i− , α i+ , β i− , βi+ ≥ 0

Since this multiple groups RMCLP model is mainly designed to solve the ordinal seperable dataset, we also call it Oridnal RMCLP model [25].

4. Clustering Based Sample Selection Method for RMCLP Figure 4 gives the whole precedure of the sample selection algorithm. The main idea of our algorithm is that it iteratively discards training instances in each group which are far away from the clustering center until the clustering center for each group is stable (with the given threshold ε ), then the remained instances will be taken as the supportive instances (just as the support vectors to SVM) and used to build a classifier. From Figure 4, We can observe that this algorithm is similar to the well known k-means algorithm. However, the main difference between them is that, our algorithm is based on supervised learning framework, while k-means is an unsupervised learning algorithm. In our algorithm, although the clustering centers shift in each iteration, each instance keeps a constant class label. But in k-means, the class label of each instance may change frequently. An important issue of k-means clustering is how to choose the initial points, if we choose a good initial point, we can get a global optimal solution; otherwise, we may only get a local optimal solution. On the contrast, our sample selection method can avoid this problem. It always lead to a global minimal solution. There are some important parameters in our algorithm. The first important parameter is ε , which determinates when the algorithm stops. The second parameter is the exclusion percentage s, which indicates how many instances that are far away from the clustering center should be discarded in each iteration. This parameter, in

REGULARIZED MCLP CLASSIFICATION

256

fact, determines the convergence speed. The larger value of s, the faster algorithm converges. To analyze the computation complexity of our new algorithm, we take an extremely bad situation into consideration. Assume there are n instances in the training sample, we assign the values s=1 and ε = 0 . Then, the algorithm will discard only one instances in each iteration. To the worst case, after n times iterations, the algorithm converges to the clustering center. In the ith iteration, it needs to calculate the (n-i) instances to get the clustering center, so we can roughly infer that the compuation complexity is about O(n2). --------------------------------------------------------------------------------------------

Input: training sample Tr, testing sample Ts, parameter ε , exclusion percentage s Output: selected sample Tr’ Begin 1. Set Tr’=Tr 2. While ( |PrevClusteringCenter-CurrClusteringCenter| < ε ) { 2.1. Calculate current clustering center; cent =

1 ∑ xi | Tr ' | i

2.2. For each instances i ∈ Tr do { 2.2.1 Calculate the Euclidean distance of the clustering center,

disi =

∑ | cent

i

r

2

− Trr |

r∈R

2.2.2 get s% of the instances which are farthest from the clustering center, denoted as the subset {P} 2.2.3 exclude Figure {p} 4from Clustering the training method sample to get, Tr’ the=supportive Tr\{P}. sample } } 3. Return the selected sample Tr’ End -----------------------------------------------------------------------------------------------

5. Experiments To investigate whether our new algorithm works, we use two synthetic datasets and a well known US bank’s real-life credit card dataset for testing. In our experiments, the RMCLP is implemented by Visual Fortran 6.5. 5.1 Experimental Results on Synthetic Datasets Two Groups Synthetic Dataset We now create a synthetic dataset and show how the effectiveness of our method on RMCLP. Assume there is a two groups classification problem, all of the samples follow the distribution x ~ N ( µ , ∑ ) , more specifically, G 1 ~ N (1, Σ ) , G 2 ~ N (5, Σ ) . There are 6 instances in each group, additionally, a noisy instance -100 is added in G1, and a noisy instance -200 is add in G2. We will see how these two noisy instances affect the boundary of RMCLP:

ZHANG, TIAN, ZHANG, LI, SHI

257

G1= {1.11, 1.20, 1.19, 0.82, 0.90, -50} G2 = {5.20, 5.22, 4.76, 4.99, -100}

(7)

Case 1: We directly build RMCLP model on this dataset, the optimal parameters of RMCLP are H=105, Q=1, d=105, c=100. The projection director x=-0.798034, and boundary b=7.33516. The objective value of each element in G1 is {-0.886, -0.958, 0.950, -0.654, -0.718, 39.902}, in G2 is {-4.150, -4.166, -3.799, -3.982, -4.086, 79.803}. We can figure out that the accuracy is 83% on G1 and 16% on G2. Case 2: Now we call our Algorithm first to take out the noisy instances -50 from G1 and -100 from G2, and then we build RMCLP model on the purified and balanced dataset and get the optimal parameters H=105, Q=1, d=105 , c=104. The x=1.98336 and b=5.50112. The objective value of G1 is {2.202, 2.380, 2.360, 1.626, 1.785} and of G1 is {10.313, 10.353, 9.441, 9.897, 10.155}. So the accuracies of the two groups dataset are both 100%. This means we’ve gotten a perfect training sample. Multiple Groups Synthetic Dataset To investigate the performance of multiple groups RMCLP, we create another synthetic dataset in Table 1. As we can see, there are three groups (G1, G2 and G3) in our synthetic dataset. The first group G1 has three instances, A1, A2, and A3. The second group G2 has two instances, A4 and A5. And the third group has three instances, A6, A7 and A8. Each instance has two attributes R1 and R2. We suppose the separation hyper-planes be b1=2 and b2=4, and the H and Q be the identity matrix. We also suppose d and c be the vectors with all elements equal to 1. If we directly use multiple groups RMCLP model on this dataset, then we can get the projection direction x = (1.20105, -1.1626) . And we can get the location of each instance on the projection direction x as follows (we also list the resluts in the 5th column in Table 1): G1: A1x=0.038454; A8x=3.845