Supplement - Journal of Machine Learning Research

4 downloads 0 Views 188KB Size Report
for the variables ξ = (ξr)r∈Ry , where Py ⊂ Y , Ry is an index set, and sy : Py → Ry ... The set Py lists all margin violations that enter the loss, Ry lists the slack.
Journal of Machine Learning Research ? (?)

Submitted 7/15; Published ?/??

A Unified View on Multi-class Support Vector Classification Supplement ¨ un Do˘ Ur¨ gan

[email protected]

Microsoft Research

Tobias Glasmachers

[email protected]

Institut f¨ ur Neuroinformatik Ruhr-Universit¨ at Bochum, Germany

Christian Igel

[email protected]

Department of Computer Science University of Copenhagen, Denmark

Editor: Ingo Steinwart

A. Aggregation Operators As Linear Programs Aggregation operators, which combine the d margin violations into a single cost value, can be understood as computing the value of the linear program X ∆(v, y) = min ξr ξ

s.t.

r∈Ry

∀p ∈ Py : ξsy (p) ≥ vp (f (x), y)

for the variables ξ = (ξr )r∈Ry , where Py ⊂ Y , Ry is an index set, and sy : Py → Ry is surjective. The set Py lists all margin violations that enter the loss, Ry lists the slack variables, and sy assigns slack variables to margin components, depending on the particular loss in use. Table 1 lists the configurations of the linear programs corresponding to the different aggregation operators. aggregation operator ∆self ∆o-max ∆t-max ∆t-sum ∆o-sum

linear program definition Py Ry sy {y} {∗} p 7→ ∗ Y \ {y} {∗} p 7→ ∗ Y {∗} p 7→ ∗ Y Y id Y \ {y} Y \ {y} id

Table 1: Aggregation operators and the corresponding linear programs, expressed in terms of the sets Py and Ry , and the assignment sy : Py → Ry , for each y ∈ Y . As for the margin function definition based on the sparse coefficients νy,p,m , the true underlying degrees of freedom for aggregation operators are far more restricted than it ¨ un Do˘ c Ur¨

? gan, Tobias Glasmachers and Christian Igel.

˘ an, Glasmachers and Igel Dog

seems, in particular if classes are treated symmetrically. For symmetry reasons, the sets Py can take only the three values {y}, Y , and Y \ {y}, since all classes c 6= y are to be treated the same way. The same argument implies that sy either has to be injective or constant, restricted to the atomic invariant subsets {y} and Y \ {y}. This again leaves only few choices for Ry under the restriction that sy is surjective. The hinge loss Lhinge (µ) = max{0, 1 − µ} can also be expressed as a linear program, namely Lhinge (µ)

=

min u

s.t.

u u≥1−µ u≥0 .

The two linear programs can be combined into one: L(f (x), y)

=

min ξ

X

ξr

r∈Ry

s.t.

∀p ∈ Py : ξsy (p) ≥ 1 − µp (f (x), y) ∀r ∈ Ry : ξr ≥ 0

The first constraint can be rewritten as µp (f (x), y) =

X

νy,p,m · fm (x) ≥ 1 − ξsy (p) .

m

Thus, the decision function values enter a multi-class loss based on the hinge loss as parameters of a linear program.

B. Deriving the Uniform Dual Problems For deriving the dual problem from the primal, we introduce Lagrange multipliers αi,p ≥ 0, βi,r ≥ 0, η ∈ H, and τ ∈ R corresponding to the constraints of the primal problem, and compute the Lagrangian X 1X kwc k2 + C · ξi,r 2 c i,r " # X X X  + αi,p γyi ,p − νyi ,p,c hwc , φ(xi )i + bc − ξi,syi (p) − βi,r ξi,r

L=

c

i,p

* +

η,

i,r

+ X c

wc



X

bc ,

c

2

A Unified View on Multi-category Support Vector Classification: Supplement

with derivatives X X ∂L = wc − αi,p νyi ,p,c φ(xi ) + η = 0 ⇒ wc = αi,p νyi ,p,c φ(xi ) − η ∂wc i,p

(1)

i,p

X X ∂L =− αi,p νyi ,p,c + τ = 0 ⇒ αi,p νyi ,p,c = τ ∂bc i,p i,p   X X ∂L =C − αi,p  − βi,r = 0 ⇒ αi,p ≤ C . ∂ξi,r r r p∈Py

p∈Py

The sets Pyr , r ∈ Ry , are defined as Pyr = s−1 y ({r}) = {p ∈ Py | sy (p) = r}. They form a partition of the set Py of constraints. To derive the dual in the absence of the sum-to-zero constraint we just set the dual variables η and τ to zero. Then the first derivative above gives us an expression of wc in terms of α. In the case with sum-to-zero constraint we get   ! X X X X X 1  0= wc = αi,p νyi ,p,c φ(xi ) − η  ⇒ η = αi,p νyi ,p,c φ(xi ) d c c c i,p

i,p

and thus " wc =

X i,p

αi,p

X m

1 δm,c − d

#



νyi ,p,m φ(xi ) .

To get to the dual problem, we plug this expression into the Lagrangian using the identity     X 1 1 1 δm,c − δn,c − = δm,n − . d d d c

C. Proof of Theorem 5 In the following, we outline a proof of Theorem 5. Let L(f (x), y) denote either the loss function used by the AMO machine or the loss function used by the ATM machine, that is, the loss resulting from application of either the max-over-others or the total-max operator to absolute margins: n o    L(f (x), y) = max vcabs (f (x), y) = 1 + max fc (x) + (AMO) c∈Y \{y}

c∈Y \{y}

or ( )  n o   L(f (x), y) = max vcabs (f (x), y) = max 1 + max {fc (x)} , 1 − fy (x) + . c∈Y

c∈Y \{y}

+

(ATM) Then Theorem 5 states that the minimizer of the corresponding risk R = E[L(f (x), y)], P subject to the sum-to-zero constraint c∈Y fc (x) = 0, satisfies: f∗

3

˘ an, Glasmachers and Igel Dog

• If there exists a majority class y ∈ Y such that Py > (d − 1)/d, then fy∗ (x) = d − 1 and fc∗ (x) = −1 for all c ∈ Y \ {y}.

• If Py < (d − 1)/d for all y ∈ Y , then f ∗ (x) = 0.

Proof We demonstrate the proof for the AMO loss function. Following Liu (2007), we argue that fc∗ (x) ≥ −1 for all c ∈ Y . Suppose fc (x) < −1, then it is easy to see that f˜ defined as f˜c (x) = −1 and f˜e (x) = fe (x) + (fc (x) + 1)/(d − 1) fulfills Rx (f˜) ≤ Rx (f ) contradicting the optimality of f ∗ . Restricting the solution space to fc (x) ≥ −1 allows us to write the point-wise risk as

Rx =

X

   Py · 1 + max fc (x) c ∈ Y \ {y} .

y∈Y

Now we pick y ∈ arg max{fc (x) | c ∈ Y } and treat the value fy (x) ≥ 0 (which is non-negative because of the sum-to-zero constraint) as fixed from now on. We write the point-wise risk as 

 X  Rx = Py · 1 + max fc (x) c ∈ Y \ {y} + Pc · 1 + fy (x) . 

c6=y

The best we can Pdo to keep this risk low is to set all components fc (x), c 6= y, to the same value: fc (x) = c6=y fc (x)/(d − 1) = −fy (x)/(d − 1) for all c ∈ Y \ {y}. It holds

  X     fy (c) fy (c) R x = Py · 1 − + + (1 − Py ) · 1 + fy (x) Pc · 1 + fy (x) = Py · 1 − d−1 d−1 c6=y   fy (c) d = 1 − Py · + (1 − Py ) · fy (x) = 1 + 1 − Py fy (x) . d−1 d−1 For Py > (d−1)/d this expression is a decreasing function of fy∗ (x), resulting in the optimum ∗ ∗ fP y (x) = d − 1 and fc (x) = −1 for c 6= y, which maximizes fy (x) under the constraints x fc (x) = 0 and ∀c : fc (x) ≥ −1. In contrast, for Py < (d − 1)/d the risk is lower bounded by one. In this case f ∗ (x) = 0 minimizes the expression yielding Rx = 1. The analogous result for the ATM loss function can be proven with exactly the same arguments.

4

A Unified View on Multi-category Support Vector Classification: Supplement

D. Data Sets The descriptive statistics of the 12 UCI data sets used in both the linear as well as nonlinear SVM experiments are given in Table 2. The additional data sets used in the linear SVM experiments are described in Table 3.

Data set

d

`train

`test

p

Abalone Car Glass Iris Opt. digits Page blocks Sat Segment Soy bean Vehicle Red wine White wine

27 4 6 3 10 5 7 7 19 4 10 10

3133 1209 149 105 3823 3831 4435 1617 214 592 1120 3428

1044 519 65 45 1797 1642 2000 693 93 254 479 1470

10 6 9 4 64 10 36 19 35 18 11 11

Table 2: Descriptive statistics of the 12 UCI data sets used in the non-linear SVM study. The columns d, `train , `test , and p contain the number of classes, the number of training examples, the number of test examples, and the input space dimension (number of features), respectively.

Data set Covertype Letter News-20 Sector Usps

d

`train

`test

p

7 26 20 105 10

406,707 15,000 15,935 6,412 7,291

174,305 5,000 3,993 3,207 2,007

54 16 62,061 55,197 256

Table 3: Descriptive statistics of the additional data sets used in the linear SVM experiments. The columns d, `train , `test , and p contain the number of classes, the number of training examples, the number of test examples, and the input space dimension (number of features), respectively.

5

˘ an, Glasmachers and Igel Dog

E. Model Selection Results

MMR

log2 (γ)

log2 (C)

log2 (γ)

log2 (C)

log2 (γ)

log2 (C)

log2 (γ)

log2 (C)

log2 (γ)

log2 (C)

log2 (γ)

log2 (C)

log2 (γ)

log2 (C)

log2 (γ)

log2 (C)

log2 (γ)

Abalone Car Glass Iris Opt. digits Page blocks Sat Segment Soy bean Vehicle Red wine White wine

OVA log2 (C)

The best parameter configurations (C, γ) for the non-linear SVMs are found in Table 4. The values of the parameter C for the linear SVM experiments are listed in Table 5. WW

CS

LLW

AMO

ATS

ATM

-6 3 0 7 3 1 2 6 2 12 -1 5

1 -1 -2 -6 -6 -1 -2 0 -5 -7 0 0

-3 2 0 1 1 -2 1 1 1 0 -1 1

0 0 -1 -2 -3 2 0 1 -3 -2 0 0

3 4 -2 10 3 5 3 7 1 11 -4 -1

-4 -1 -1 -9 -5 -3 -2 0 -5 -7 0 0

0 4 2 1 3 7 3 9 3 11 -1 6

3 -1 -2 -3 -5 -4 -2 -5 -5 -8 0 0

0 6 2 7 7 11 5 9 6 14 2 1

0 -1 0 -5 -6 -4 -2 0 -6 -7 0 1

9 8 0 1 -3 4 6 12 9 17 3 4

3 -1 3 -2 -2 -1 -2 -4 -5 -8 0 0

0 5 4 13 7 10 4 9 8 16 2 3

0 -1 -3 -9 -6 -4 -2 0 -7 -8 0 0

8 6 -3 1 9 4 6 9 9 17 2 4

4 -1 3 -2 -6 -1 -2 -2 -5 -8 0 0

-1 6 2 9 6 9 3 8 8 16 1 5

0 -2 -2 -7 -6 -4 -2 0 -7 -9 0 0

Table 4: Best hyperparameter values (log2 (C), log2 (γ)) found by the model selection procedure.

References Y. Liu. Fisher consistency of multicategory support vector machines. In M. Meila and X. Shen, editors, Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS), volume 2 of JMLR W&P, pages 289–296, 2007.

6

RM

A Unified View on Multi-category Support Vector Classification: Supplement

Cover type Letter News-20 Sector Usps Abalone Car Glass Iris Opt. digits Page blocks Sat Segment Soybean Vehicle Red wine White wine

OVA -1 12 1 1 -3 -9 1 7 2 -1 11 1 -1 -2 9 -1 -5

MMR -18 -10 0 1 -11 -14 -17 -3 -6 -7 -5 -23 -10 -6 -4 -8 -5

WW -11 -4 1 1 1 3 10 2 1 -2 13 2 12 3 8 -1 0

CS 2 8 2 2 -2 -5 0 5 -2 -3 5 5 8 10 7 2 -2

LLW 7 14 6 8 17 -5 19 8 2 -13 3 5 2 22 7 6 0

AMO -5 -12 10 14 -4 -3 -11 9 -8 -6 5 0 14 5 11 -11 -8

ATS 5 11 6 8 13 -3 0 6 3 -12 4 -7 7 8 11 -3 1

ATM -5 -12 10 14 -4 -3 -11 9 -8 -6 5 0 14 5 11 -11 -8

RM -4 -8 5 7 0 -5 -1 8 4 -6 2 -11 6 9 -4 3 5

Table 5: Best hyperparameter values (log2 (C)) for linear models found by the model selection procedure.

7