On Invariance of Support Vector Machines - CiteSeerX

On Invariance of Support Vector Machines Shigeo Abe Graduate School of Science and Technology Kobe University Kobe, Japan [email protected] http://frenchblue.scitec.kobe-u.ac.jp/~abe/index.html

Abstract. In this paper we discuss the conditions that L1 and L2 soft margin support vector machines with dot product, polynomial, neural network or radial basis function kernels give the same solution under translation, scaling, and rotation of input variables. Specifically, we show the conditions that transformation of the input range from [0, 1] to [−1, 1] or vice versa is invariant for dot product and RBF kernels.

1

Introduction

In general, classification performance of a classifier is affected by linear transformation of input variables. If a classifier is not affected by the transformation, the classifier is called invariant to linear transformation. Invariance of the classifier is important; if the classifier is scale invariant, we need not consider scaling of each input. Neural networks are not invariant; any transformation of inputs may make the network converge to a different solution. But since fuzzy classifiers with ellipsoidal regions [1, pp. 208–209] are based on the Mahalanobis distance, they are linear transformation invariant. Thus, specifically, translation, scaling, and rotation invariant. For most classifiers, which are not transformation invariant, to avoid the influence of variables with large input ranges on the generalization ability, we usually scale the ranges of input variables into [0, 1] or [−1, 1]. But there is no theory which to choose. It is known that support vector machines [2] with some kernels such as radial basis function (RBF) kernels are translation and rotation invariant [3]. In [3, pp. 333–358], [4], invariance, in the sense that a small variation of the input does not affect the classification results, is discussed. But little is discussed how linear transformation affects the solution of support vector machines. In this paper, we discuss invariance of L1 and L2 support vector machines for linear transformation of input variables, especially, translation, scaling, and rotation. Namely, we clarify the condition of kernel parameters that the solutions of support vector machines give the same solution under rotation, scaling, and translation. Then, we derive the condition of kernel parameters that gives the same solution when the input ranges are converted from [0, 1] to [−1, 1] or vice versa.

2

Shigeo Abe

In Section 2, we discuss L1 and L2 soft margin support vector machines, and in Section 3, we discuss linear transformation invariance of support vector machines. And in Section 4, we clarify the condition that the same solution is obtained when the input ranges are converted from [0, 1] to [−1, 1] or vice versa. Finally, in Section 5, we show the validity of our analysis by computer experiments.

2

Soft Margin Support Vector Machines

In soft margin support vector machines, we consider the linear decision function D(x) = wt g(x) + b

(1)

in the feature space, where w is the l-dimensional weight vector, g(x) is the mapping function that maps the m-dimensional input x into the l-dimensional feature space, and b is a scalar. We determine the decision function so that the classification error for the training data and unknown data is minimized. This can be achieved by minimizing M C p 1 w2 + ξ 2 p i=1 i

(2)

subject to the constraints yi (wt g(xi ) + b) ≥ 1 − ξi

for

i = 1, . . . , M,

(3)

where ξi are the positive slack variables associated with the training data xi , M is the number of training data, yi are the class labels (1 or −1) for xi , C is the margin parameter, and p is either 1 or 2. When p = 1, we call the support vector machine L1 soft margin support vector machine (L1 SVM) and when p = 2, L2 soft margin support vector machine (L2 SVM). The dual problem for the L1 SVM is to maximize M

Q(α) =

i=1

αi −

M 1 αi αj yi yj H(xi , xj ) 2 i,j=1

(4)

0 ≤ αi ≤ C,

(5)

subject to the constraints M

yi αi = 0,

i=1

where αi are the Lagrange multipliers associated with the training data xi and t H(x, x ) = g(x) g(x) is the kernel function. In the following study, we use the following four kernel functions: 1. dot product kernels: H(x, x ) = xt x ,

On Invariance of Support Vector Machines

3

2. neural network kernels: H(x, x ) =

1 , 1 + exp(νxt x − a)

where ν and a are parameters, 3. polynomial kernels: H(x, x ) = (xt x + 1)d , where d is a positive integer, 4. RBF kernels: H(x, x ) = exp(−γx − x 2 ), where γ is a positive parameter. Since neural network kernels satisfy Mercer’s condition only for specific values of ν and a, they are less frequently used. The dual problem for the L2 SVM is to maximize M M 1 δij αi − αi αj yi yj H(xi , xj ) + Q(α) = (6) 2 i,j=1 C i=1 subject to the constraints M

yi αi = 0,

αi ≥ 0.

(7)

i=1

3

Linear Transformation Invariance

The Euclidean distance is used for calculating the margins and it is rotation and translation invariant but not scale invariant. Therefore, support vector machines with dot product kernels are rotation and translation invariant but not scale invariant. In general, the Euclidean distance is not scale invariant but if all the input variables are scaled with the same factor, the Euclidean distance changes with that factor. Therefore here we consider the following transformation: z = sAx + c,

(8)

where s(> 0) is a scaling factor, A is an orthogonal matrix and satisfies At A = I, and c is a constant vector. 3.1

RBF Kernels

Now the RBF kernel H(z, z ) is given by H(z, z ) = exp(−γ sAx + c − sAx − c2 ) = exp(−γ sA(x − x )2 ) = exp(−γ s2 x − x 2 )

(9)

4

Shigeo Abe

Therefore, RBF kernels are translation and rotation invariant. For s = 1, if γ s2 = γ,

(10)

H(z, z ) = H(x, x ). Thus, if (10) is satisfied, the optimal solutions for a training data set and the data set transformed by (8) are the same. 3.2

Neural Network Kernels

The neural network kernel H(z, z ) is given by H(z, z ) =

1 1 + exp(ν (sAx + c)t (sAx + c) − a)

(11)

If c = 0, (11) is not invariant. Setting c = 0, (11) becomes H(z, z ) =

1 1+

exp(ν s2 xt x

− a)

.

(12)

Therefore, neural network kernels are rotation invariant. If ν s2 = ν

(13)

is satisfied, H(z, z ) = H(x, x ). Thus, if (13) is satisfied, the optimal solutions for a training data set and the data set transformed by (8) with c = 0 are the same. 3.3

Dot Product Kernels

For the dot product kernel, H(z, z ) is given by H(z, z ) = (sAx + c)t (sAx + c) = s2 xt x + sct Ax + sxt At c + ct c.

(14)

Training of the L1 support vector machine with a data set transformed by (8) is as follows: Find αi (i = 1, . . . , M ) that maximize Q(α ) =

M

αi

i=1

−

M 1 α α yi yj (s2 xti xj + sct Axj + sxti At c + ct c) 2 i,j=1 i j

(15)

subject to the constraints M i=1

yi αi = 0,

0 ≤ αi ≤ C .

(16)


5

Using (16), (15) becomes

Q(α ) =

M i=1

αi 

M 1 − α α yi yj s2 xti xj 2 i,j=1 i j

 M 1 = s−2  s2 αi − s2 αi s2 αj yi yj xti xj  . 2 i=1 i,j=1 M

(17)

Thus, setting αi = s2 αi , the inequality constraint in (16) becomes 0 ≤ αi ≤ s2 C .

(18)

Therefore, the optimal solutions of the L1 SVM with the dot product kernel for a training data set and the data set transformed by (8) are the same when C = s2 C .

(19)

This also holds for L2 SVMs. 3.4

Polynomial Kernels

For the polynomial kernel, H(z, z ) is given by d H(z, z ) = (sAx + c)t (sAx + c) + 1 d = s2 xt x + sct Ax + sxt At c + ct c + 1 .

(20)

Therefore, polynomial kernels are rotation invariant but neither scale nor translation invariant. This is also true for L2 support vector machines using polynomial kernels. Assuming that (20) is approximated by the term with the highest degree of s: H(z, z ) = s2d (xt x )d ,

(21)

similar to the discussions for the dot product kernel, the support vector machines with a data set and the data sets transformed by (8) perform similarly when C = s2d C .

4

(22)

Relation between the Ranges [0, 1] and [−1, 1]

In training support vector machines, we normalize the range of input variables into [0, 1] or [−1, 1], without knowing their difference. Using our previous discussions, however we can clarify relations of the solutions. Since the transformation from [0, 1] to [−1, 1] is given by z = 2x − 1,

(23)

it is a combination of translation and scaling. Thus according to the previous discussions, we can obtain the parameter values that give the same or roughly the same results for the two input ranges. Table 1 summarizes this result.

6

Shigeo Abe Table 1. Parameters that give the same or roughly the same solutions Kernel

[0, 1]

[−1, 1]

Dot product

4C

C

Polynomial

d

≈4 C

C

4γ

γ

≈ 4ν

ν

RBF NN

5

Numerical Evaluation

To see the validity of Table 1 especially for the polynomial kernels, we conducted the simulation using the blood cell data and the thyroid data sets [1]. For both data sets, we selected data for Classes 2 and 3. The numbers of training and test data are listed in Table 2. We trained the L1 support vector machine for the blood cell data and the L2 support vector machine for the thyroid data. For the input range of [−1, 1], we set C = 5000 and for [0, 1] we set it appropriately according to Table 1. For the polynomial kernels, we changed C for [0, 1] from 4 × 5000 = 20000 to 4d × 5000. Table 2. Training and test data for the blood cell and thyroid data Data

Training data Class 2

Class3

Test data Class 2

Class 3

Blood cell

399

400

400

400

Thyroid

191

3488

177

3178

Table 3 lists the recognition rates of the blood cell test and training data, the number of support vectors, and the value of Q(α) for the L1 support vector machine. The numerals in the brackets show the numbers of bounded support vectors. For the dot product kernel, as the theory tells us, the solution with the ranges of [0, 1] and C = 20000 and that with [−1, 1] and C = 5000 are the same. For the RBF kernels also, the solution with [0, 1] and γ = 4 and that with [−1, 1] and γ = 40 are the same. For the polynomial kernel with d = 2, the solution with [0, 1] and C = 4d × 5000 = 80000 and that with [−1, 1] and C = 5000 are similar. The value of Q(α) with C = 80000 is near the value of Q(α) × 16 with C = 5000. Similar


7

Table 3. Solutions of the L1 SVM for the blood cell data Kernel

dot

d2

d3

RBF

Range

PARM

Test rate

Train. Rate

(%)

(%)

SVs

Q(α)

[0, 1]

C20000

87.00

90.23

103 (89)

1875192

[−1, 1]

C5000

87.00

90.23

103 (89)

1875192/4

[0, 1]

C5000

88.50

92.23

101 (52)

331639

[0, 1]

C20000

86.25

94.24

103 (51)

1191424

[0, 1]

C80000

86.75

95.99

96 (34)

4060006

[−1, 1]

C5000

86.75

95.49

99 (35)

4137900/16

[0, 1]

C5000

88.25

96.24

99 (31)

237554

[0, 1]

C20000

86.00

97.49

98 (19)

672345

[0, 1]

C80000

85.75

99.00

97 (4)

1424663

[0, 1]

C320000

86.50

100

93

1839633

[−1, 1]

C5000

86.00

100

90 (1)

2847139/64

[0, 1]

γ4

89.00

92.48

99 (58)

358168

[−1, 1]

γ1

89.00

92.48

99 (58)

358168

results hold for d = 3, although the difference between the values of Q(α) are widened compared to that for d = 2. Table 4 lists the recognition rates of the thyroid test and training data, the number of support vectors, and the value of Q(α) for the L2 support vector machine. For the dot product and RBF kernels, the solutions with the range of [0, 1] and the associated solutions are the same. For the polynomial kernels, the solution with [0, 1] and C = 5000 and that with [−1, 1] and C = 4d × 5000 are similar.

Acknowledgments We are grateful to Professor N. Matsuda of Kawasaki Medical School for providing the blood cell data and to Mr. P. M. Murphy and Mr. D. W. Aha of the University of California at Irvine for organizing the data bases including the thyroid data (ftp://ftp.ics.uci.edu/pub/machine-learning-databases).

6

Conclusions

We discussed the conditions that support vector machines with dot product, polynomial, neural network or radial basis function kernels are invariant under

8

Shigeo Abe Table 4. Solutions of the L2 SVM for the thyroid data Kernel

dot

d2

d3

RBF

Range

PARM

Test rate

Train. Rate

(%)

(%)

SVs

Q(α)

[0, 1]

C20000

97.50

98.34

474

2096156

[−1, 1]

C5000

97.50

98.34

474

2096156/4

[0, 1]

C5000

98.12

99.18

275

298379

[0, 1]

C20000

98.18

99.29

216

974821

[0, 1]

C80000

98.21

99.37

191

3360907

[−1, 1]

C5000

97.85

99.37

201

3357314/16

[0, 1]

C5000

97.97

99.40

217

206335

[0, 1]

C20000

98.15

99.57

168

642435

[0, 1]

C80000

98.06

99.76

131

1993154

[0, 1]

C320000

97.94

99.86

106

5626809

[−1, 1]

C5000

97.65

99.92

125

4691633/64

[0, 1]

γ4

97.91

97.35

237

3816254

[−1, 1]

γ1

97.91

.73

237

3816254

translation, scaling, and rotation of input variables. Specifically, we show the conditions that transformation of the input range from [0, 1] to [−1, 1] or vice versa is invariant for dot product and RBF kernels. The validity of the analysis is demonstrated by computer experiments for two data sets.

References 1. S. Abe. Pattern Classification: Neuro-fuzzy Methods and Their Comparison. Springer-Verlag, London, UK, 2001. 2. V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, NY, 1998. 3. B. Sch¨ olkopf and A. J. Smola. Learning with Kernels. The MIT Press, Cambridge, MA, 2002. 4. C. J.C. Burges. Geometry and invariances in kernel based methods. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods: Support Vector Learning, pages 89–116. The MIT Press, Cambridge, MA, 1999.