Novel Design of Decision-Tree-Based Support Vector ... - Springer Link

1 downloads 0 Views 252KB Size Report
Abstract. Designing the hierarchical structure is a key issue for the decision- tree-based (DTB) support vector machines multi-class classification. Inter-class.
Novel Design of Decision-Tree-Based Support Vector Machines Multi-class Classifier Liaoying Zhao1, Xiaorun Li2, and Guangzhou Zhao2 1

Institute of Computer Application Technology, HangZhou Dianzi University, Hangzhou 310018, China 2 College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China [email protected]

Abstract. Designing the hierarchical structure is a key issue for the decisiontree-based (DTB) support vector machines multi-class classification. Inter-class separability is an important basis for designing the hierarchical structure. A new method based on vector projection is proposed to measure inter-class separability. Furthermore, two different DTB support vector multi-class classifiers are designed based on the inter-class separability: one is in the structure of DTB-balanced branches and another is in the structure of DTB-one against all. Experiment results on three large-scale data sets indicate that the proposed method speeds up the decision-tree-based support vector machines multi-class classifiers and yields higher precision. Keywords: Pattern classification, Support vector machines, Vector projection, Inter-class separability.

1 Introduction Support vector machines (SVMs), motivated by statistical learning theory, is a new machines learning technique proposed recently by Vapnik and co-workers [1]. The main feature of SVMs is that they use the structural risk minimization rather than the empirical risk minimization. The SVMs has been successful as a high performance classifier in several domains including pattern recognition [2, 3], fault diagnosis [4], and bioinformatics [5]. It has strong theoretical foundations and good generalization capability. The SVMs approach was originally developed for two-class or binary classification. Practical classification applications are multi-class problems commonly. Forming a multi-class classifier by combining several binary classifiers is the way commonly used, methods such as one-against-all (OAA) [6] one-againstone (OAO) [7] and DAG (decision directed acyclic graph) support vector machines [8] are all based on binary classifications. Decision-tree-based SVMs (DTBSVMs) [912] which combine SVMs and decision tree is also a good way for solving multi-class problems. However, additional work is required to effectively design the hierarchical structure of the DTBSVMs.



D.-S. Huang, L. Heutte, and M. Loog (Eds.): ICIC 2007, LNAI 4682, pp. 871–880, 2007. © Springer-Verlag Berlin Heidelberg 2007

872

L. Zhao, X. Li, and G. Zhao

The classification performances of DTBSVMs multi-class classifier with different hierarchical structure differ a lot. The inner-class separability is an important basis for designing the hierarchical structure. In this paper, a new method based on vector projection is proposed to measure inter-class separability, and two ways are presented to design the hierarchical structure of the multi-class classifier based on the inter-class separability. This paper is organized as follow. In section 2, the structure of decision-tree-based SVMs is briefly described; in section 3, the seperability measure is defined based on vector projection. Two algorithms for design DTBSVMs are given in section 4, and the simulation experiments and results are given in section 5.

2 The Structure of Decision-Tree-Based SVMs Classifier The DTBSVMs classifier decomposes the C-class classification problem into C-1 sub-problems, each separating a pair of micro-classes. Two structures of the DTBSVMs classifier for a 4-class classification problem are shown in Fig.1. Fig.1(a) is partial binary tree structure, also called DTB-one against all (DTB-OAA), represents a simplification of the OAA strategy obtained through its implementation in a hierarchical context; Fig.1(b) is the DTB-balanced branches (DTB-BB) structure. The DTBSVMs classifier discussed in paper [9]、 [10] and [11] are all based on the DTB-OAA strategy, while in [12], a DTB-BB strategy is described. In this paper, we investigate a new design method of the two different DTB hierarchies.

SVM1

w1

SVM1

SVM 2

w3

SVM 2

SVM 3

SVM 3

w1

w2

w3 w2

w4

w4

(a)

(b)

Fig. 1. Structures of DBTSVMs classifier

The distance between the separating hyperplane and the closed data points of training set is called margin. The following lemma [13] gives the relation between the margin and the generalization error of the classifier. Lemma 1. Suppose we are able to classify an m sample of labeled examples using a perceptron decision tree and suppose that the tree obtained contains k decision nodes

Novel Design of Decision-Tree-Based Support Vector Machines Multi-class Classifier

with margin

γi

at node i ,

873

i = 1,2, " , k , then we can bound the generalization error

with probability greater than 1- δ to be less than k +1 ⎛ 2 k ⎞ ⎜⎜ ⎟⎟ ( ) 4 m 130 R 2 ⎝ k ⎠] [ D ′ log( 4em) log( 4m) + log ( k + 1)δ m

, D′ = ∑ γ1 k

where

i =1

2 i

,

δ >0

and

the unknown (but fixed) distribution

(1)

R is radius of a sphere containing the support of P.

According to lemma1, for a given set of train samples, the less the number of nodes, the smaller of generalization error of the classifier, and the larger the margin, the higher generalization ability of the classifier. Thus, in order to get better generalization ability, the margin in the DTB is an important basis for designing the hierarchical structure. Different classes have different domains in the sample space. If the domains of two classes are not intersected, the margin is larger and the two classes are more separable. While the margin is smaller if the domains of two classes are intersected, and the larger ratio of the intersected samples to the total number of the two classes leads to more difficulties in separating. Now the problem is how to judge two classes intersect or not and how to estimate the separability between two classes.

3 The Inter-class Separabilty Measure This section will mainly discuss that how to measure the inter-class separability between two classes. In order to be comprehensible, we first discuss the seperability measure in linear space and then generalize it to nonlinear feature space. 3.1 The Seperabiliy Measure in Linear Space First we give some definitions. Definition 1. (sample center

m i )Consider the set of samples X i = { x1 , x 2 , ", x n } ,

the sample center of class-i is defined by

mi = Definition 2.

1 n ∑xj n j =1

(2)

( feature direction ) Define the direction of vector m m 1

feature direction of pattern-1 , and the direction of vector direction of pattern-2.



2

as the

m 2 m1 as the feature

874

L. Zhao, X. Li, and G. Zhao

Definition 3.

( feature distance ) Let x

i

∈ X 1 = { x1 , x 2 ," , x n } , x io be the

xi to the feature direction of pattern-1, m1 be the sample center of X 1 , the feature distance of xi can be defined as

projection of data

= m1 − x io

m1 x io

(3)

2

2

It is easy to proof the following theorem by reduction to absurdity. Theorem 1. Suppose set

d = m1 − m 2 is the sample centers distance of data

X 1 = { x1 , x 2 , " , x l1 } and X 2 = { y1 , y 2 , " , y l2 }

distance of data

xi as m1 x i

o

and

y j as m 2 y j

2

o

, calculate

the feature

respectively, let 2

r1 = max( m1 x i

o

xi ∈ X 1

)

(4)

2

r2 = max ( m 2 y j

o

y j ∈X 2

)

(5)

2

X 1 and X 2 are not intersected if r1 + r2 < d , while if the data domains of X 1 and X 2 are intersected, it is surely that r1 + r2 ≥ d .

then the data domains of data set

According to theorem 1, the inter-class seperability measure can be defined on the principle that the smaller measure value, the larger margin. Definition 4. If

r1 + r2 < d , then the inter-class seperability is defined as se12 = se21 = −d

If



r1 + r2 ≥ d

d - r2 ≤ m1 x i

assume

the

number

(6) of

data

in

X 1 that satisfied

≤ r1 is tr1 , the number of data in X 2 that satisfied

o 2

d − r1 ≤ m 2 y j

≤ r2 is tr2

o 2

, the inter-class seperability is defined as

se12 = se21 = (tr1 + tr2 ) /(l1 + l 2 )

(7)

Novel Design of Decision-Tree-Based Support Vector Machines Multi-class Classifier

875

3.2 The Sepearability Measure in Nonlinear Space The following lemma [14] gives the formula of Euclidean distance between two vectors in the feature space. Lemma 2. If two vectors

x = ( x1 , x 2 , ", x n ) and y = ( y1 , y 2 ,", y n ) are



projected into a high-dimension feature space by a nonlinear map Φ (•) the Euclidean distance between vector x and y in the corresponding feature space is given by

d H ( x , y ) = k ( x , x ) − 2k ( x , y ) + k ( y , y )

(8)

,the function k ( x , y ) = Φ( x )Φ( y ) is a kernel function. According lemma2,the center distance between class-i and class-j is

where

d H = Φ( m i ) − Φ( m j )

2

=

k ( m i , m i ) − 2k ( m i , m j ) + k ( m j , m j )

Lemma 3. Consider three vectors and

z = ( z1 , z 2 , " , z n )

, y = ( y , y ,", y ) feature map function , let

x = ( x1 , x 2 , ", x n )

, suppose

Φ (•) is a

(9)

1

2

n

Φ( x )Φ( z o ) be the projection of vector Φ( x )Φ( z ) onto vector Φ( x )Φ( y ) , then the feature distance is given by

=

Φ( x )Φ( z o )

k ( z, y) − k ( z, x ) − k ( x, y ) + k ( x, x )

(10)

k ( x , x ) − 2k ( x , y ) + k ( y , y )

2

The inter-class seperability measure in nonlinear space can be defined as the definition in linear space. Definition 5. Suppose data set

d H = Φ( m1 ) − Φ( m 2 ) is the sample centers distance of

X 1 = { x1 , x 2 , " , x l1 } and X 2 = { y1 , y 2 , " , y l2 } in the feature space,

calculate the feature distance of data

o

xi as Φ( m1 )Φ( x i )

and

y j as

2

o

Φ( m 2 )Φ( y j ) respectively, let 2

r1 = max( Φ( m1 )Φ( x i ) )

(11)

r2 = max ( Φ( m 2 )Φ( y j ) )

(12)

o

xi ∈ X 1

2

o

y j ∈X 2

2

876

If

L. Zhao, X. Li, and G. Zhao

r1 + r2 < d H , the inter-class seperability is defined as se12 = se21 = −d H r1 + r2 ≥ d H ,

if

assume the number of data in

(13)

X 1 that satisfied



d H - r2 ≤ Φ( m1 )Φ( x i ) ≤ r1 is tr1 the number of data in X 2 that satisfied o

2

d H − r1 ≤ Φ( m 2 )Φ( y j ) ≤ r2 is tr2 o

2

, the inter-class seperability is defined

as

se12 = se21 = (tr1 + tr2 ) /(l1 + l 2 )

(14)

4 Construct DTBSVMs Classifier In classification of DTBSVMs classifier, starting from the top of the decision tree, we calculate the value of the decision function for input data x and according to the value we determine which node to go to. We iterate this procedure until we reach a leaf node and classify the input data into the class associated with the node. According to this classification procedure of DTBSVMs classifier, not all the decision functions need to be calculated, and the more the data are misclassified at the upper node of the decision tree, the worse the classification performance becomes. Therefore, the classes that are easily separated need to be separated at the upper node of the decision tree. Suppose S j , j = 1,2, " , c are sets of l pairs training data included in c classes, and

yi = j if x i ∈ S j . The new design procedures of DTB-OAA and DTB-BB are

described respectively. 4.1 DTB-OAA For DTB-OAA classifier, one class is separated from the remaining classes at the hyperplane corresponding to each SVMs of the decision tree. For the sake of convenience for realization, taking an array L to keep the markers of the classes according their seperability in descend. The algorithm of DTB-OAA is proposed as follows. Step1. Calculate the separability measure in feature space

i, j = 1,2, " , c

seij , seij = se ji

, i ≠ j , construct a symmetric matrix of separability measures



Novel Design of Decision-Tree-Based Support Vector Machines Multi-class Classifier

⎡ 0 ⎢ se ⎢ 12 SE = ⎢ # ⎢ ⎢ sec −1,1 ⎢ sec ,1 ⎣

se12 0 # sec −1,2 sec , 2

se1,c ⎤ se2,c ⎥⎥ # ⎥ ⎥ sec −1,c ⎥ 0 ⎥⎦

" se1,c −1 " se2,c −1 # # " 0 " sec ,c −1

Step2. Define array D_no =[1,2,…,c], let i=1, and

877

SE (k , :) indicate the row k

of SE , for j = 1 to c − 2 , repeat the following procedure to get the most easily separated class from the remaining classes: 1

) Calculate k

0

= arg min sum( SE ( k , :)) k =1,",c +1- j

, L(i) = D_no(k ) . If 0

k0 exists for plural k, regard the one got first as minimization; 2

)Set SE (k , :) =null, 0

Step3.

SE (:, k 0 ) =null, and D_no ( k0 )=null, i=i+1.

L(c − 1) = D_no(1)

, L(c) = D_no(2) .

Step4. Define structure array node to keep the information of each node (including support vector, weight α and , threshold b et al). For j =1 to c -1, repeat the following procedure to construct the classifier: regard class- L( j ) as the plus

L( j + 1)," , L(c ) as the negative samples of SVMs-j. Training SVMs-j to get the structure information of node( j ) . samples of SVMs-j, and union the rest class

4.2 DTB-BB In the DTB-BB strategy, the tree is defined in such a way that each node (SVMs) discriminate between two groups of classes with maximum margin. The algorithm that implements the DTB-BB strategy is described as follows: Step1

、2、3 is the same as DTB-OAA to get array L .

Step4. Define a binary tree structure

θ = {node(i )} .

The structure

variable

node(i ) keeps the information of each node (including support vector, weight α and threshold b etc). Let node(i ). I keep the markers of the classes included in node(i ) and variable endnodes be the number of leaf nodes. Set i = 1 node(1). I = L t = 1, j = 1 , endnodes = 0 . Step5. If length( node(i ). I ) =1, then go to Step9. Step6. Let num = length( node(i ). I ) divide classes in node(i ) into two groups in such a way that node(i ). pl = j + 1 node(i ). pr = j + 2 node( j + 1).I = node(i ).I (1, " , [num / 2])















878

L. Zhao, X. Li, and G. Zhao

node( j + 2).I = node(i ).I ([num / 2] + 1, " , num) Step7. Regard the classes in node(t ). pl as the plus samples and the classes in node(t ). pr as the negative samples of classifier- t , train the SVMs to get the information of node(t ) . Step8. Set i = i + 1, j = j + 1 and t = t + 1 , go to Step5. Step9. Set endnodes = endnodes + 1 , if endnodes = c then Stop, otherwise, set i = i + 1 , go to Step5.

5 Experimental Results The experiments reported in this section have been conducted to evaluate the performance of the two DTBSVMs multi-class classifier proposed in this paper, in comparison with the OAO algorithm. The experiments focus on the following three issues: classification accuracy, execution efficiency and the number of support vectors. The kernel function used in the experiments is the radial basis function kernel

k ( x, y ) = exp(− x − y / γ ) . Table 1 lists main characteristics for the three large 2

dataset used in our experiments. The data sets are from the UCI repository (http://www.ics.uci.edu/~mlearn/MLRepository.html). In these experiments, the SVMs software used is SVM_V0.51 [15] with the radial basis kernel. Cross validation has been conducted on the training set to determine the optimal parameter values to be used in the testing phase. Table 2 is the optimal parameters for each data set, where C is the castigatory coefficient of SVMs, ones(1,n) denotes an all 1s vector of size 1 × n . Table 1. Benchmark data sets used in the experiments Date set Letter Satimage Shuutle

# trainging samples 15 000 4 435 43 500

# testing #class samples numbers 5 000 26 2 000 6 14 500 7

# attribute numbers 16 36 9

Table 3 compares the results delivered by alternative classification algorithms with the three large benchmark data sets, where Tc/s is the testing time in second, Tx/s is the training time in second, #SVs denotes the number of all support vectors (with intersection), u_SVs denotes the number of different support vectors, and CRR denotes the correct recognition rate. As Table 3 shows that the two DTBSVMs classifiers and the OAO classifier basically deliver the same level of accuracy. The OAO needs more support vector in training, but the numbers of different support vectors are approximately equal. For letter, the test time of OAO is much higher than that of DTB-OAA and that of DTB-BB. For satimage, the test time of OAO is more

Novel Design of Decision-Tree-Based Support Vector Machines Multi-class Classifier

879

than twice of that of DTB-OAA and almost triple of that of DTB-BB. For shuttle, the test time of OAO is approximate to that of DTB-OAA and almost twice of that of DTB-BB. Table 3 also shows that DTB-BB is more efficient than DTB-OAA both in accuracy and speed. This is consistent with the theoretic analyse in paper [12]. Table 2. The optimal parameters for each data set Date set

γ

C

OAO

DTB-OAA

DTB-BB

Letter

8

64

64×ones(1, 25)

64×ones(1, 25)

Satimage

1.5

3048

3048×ones(1,5)

3048×ones(1,5)

Shuutle

212

4096

[4096, 1024, 1024, 1024, 1024, 1024]

[4096, 1024, 1024, 1024, 1024, 1024]

Table 3. Comparison of the results

Tx/s Tc/s

OAO #SVs u_SVs

Letter

397 348

33204 7750

97.4

3916 58

7389 5087

96.4

2068 18

8489 5475

96.5

Satimage

60 35

3404 1510

91.8

43 17

2191 1428

91.2

53 13

2208 1529

92

7182 26

1239 382

99.9

15452 28

1219 499

99.8

6807 14

703 417

99.9

Date set

Shuutle

CRR %

DTB-OAA Tx/s #SVs CRR Tc/s u_SVs %

DTB-BB Tx/s #SVs CRR Tc/s u_SVs %

6 Conclusion In this paper, we proposed new formulation of SVMs for a multi-class problem. A novel inter-class separability measure is given based on vector projection, and two algorithms are presented to design the DTBSVMs multi-class classifier based on the inter-class separability. Classification experiments for three large-scale data sets prove that the two DTBSVMs classifiers basically deliver the same level of accuracy as the OAO classifier, and the executing time is shortened. Based on the study presented in this paper, there are several issues that deserve further studies. The first issue is the experiment on other benchmark data sets or some real data sets such as remote sensing images with the proposed algorithms to verify their effectiveness. The second issue is a more reasonable design for the structure of DTB-BB classifier. The third issue is the choice of parameters of kernel function.

880

L. Zhao, X. Li, and G. Zhao

Acknowledgments. This work is supported by Natural Science Basic Research Plan in Zhejiang Province of China Grant Y106085 to L.Y.Zhao.

References 1. Vapnik ,V.: The Nature of Statistical Learning Theory. New York: Springer (1995) 2. Ma, C., Randolph, M.A., Drish, J.: A Support Vector Machines-Based Rejection Technique for Speech Recognition. Proceeding of IEEE Int. Conference on Acoustics, Speech, and Signal Processing (2001) 381-384 3. Brunelli, R.: Identity Verification Through Finger Matching: A Comparison of Support Vector Machines and Gaussian Basis Functions Classifiers. Pattern Recognition Letters 27 (2006) 1905-1915 4. Ma, X.X., Huang, X.Y., Chai, Y.: 2PTMC Classification Algorithm Based on Support Vector Machines and Its Application to Fault Diagnosis. Control and Decision 18 (2003) 272-276 5. Jin, B., Tang, Y.C., Zhang, Y.Q.: Support Vector Machines with Genetic Fuzzy Feature Transformation for Biomedical Data Classification. Information Sciences 177 (2007) 476-489 6. Bottou, L., Cortes, C., Denker, J.: Comparison of Classifier Methods: A Case Study in Handwriting Digit Recognition. Proceedings of the 12th IAPR International Conference on Pattern Recognition, Jerusalem: IEEE (1994) 77-82 7. Kebel, U.: Pairwise Classification and Support Vector Machines. Advances in Kernel Methods-Support Vector Learning, MIT, Cambridge (1999) 255-258 8. Platt, J., Cristianini, N., Shawe-Taylor, J.: Large Margin DAG’s for Multiclass Classification. Advances in Neural Information Processing Systems 12, MA, Cambridge (2000) 547-553 9. Hsu, C. W., Lin, C. J.: A Comparison of Methods for Multi-Class Support Vector Machines. IEEE Transaction on Neural Network 13 (2002) 415-425 10. Wang, X.D., Shi, Z.W., Wu, C.M. Wang, W.: An Improved Algorithm for Decision-treebased SVM. Proceedings of the 6th World Congress on Intelligent Control and Automation, Dalian, China (2006) 4234-4237 11. Sahbi, H., Geman, D., Perona, P.: A Hierarchy of Support Vector Machines for Pattern Detection. Journal of Machine Learning Research 7 (2006) 2087-2123 12. Zhao, H., Rong, L.L., Li, X.: New Method of Design Hierarchical Support Vector Machine Multi-class Classifier. Application Research of Computers 23 (2006) 34-37 13. Bennet, K.P., Cristianini, N., Shaue T.J.: Enlarging the Margins of Perceptron Decision Trees. Machine Learning 3 (2004) 295-313 14. Li, Q., Jiao, L.C., Zhou, W.D.: Pre-Extracting Support Vector for Support Vector Machine Based on Vector Projection, Chinese Journal of Computers 28 (2005) 145-152 15. Platt, J.C.: Fast Training of Support Vector Machines Using Sequential Minimal Optimization. http://research.microsoft.com/~jplatt