Lower Bounds on Rate of Convergence of Cutting ... - Purdue University

3 downloads 0 Views 442KB Size Report
n, Euclidean projection is intractable and we resort to Bregman projection. Given a ... [2] Choon Hui Teo, S. V. N. Vishwanthan, Alex J. Smola, and Quoc V. Le.
Lower Bounds on Rate of Convergence of Cutting Plane Methods Xinhua Zhang Dept. of Computing Sciences University of Alberta [email protected]

Ankan Saha Dept. of Computer Science University of Chicago [email protected]

S.V. N. Vishwanathan Dept. of Statistics and Dept. of Computer Science Purdue University [email protected]

Abstract In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method (CPM) for training linear Support Vector Machines (SVMs) which converges to an  accurate solution in O(1/2 ) iterations. By tightening the analysis, Teo et al. [2] showed that O(1/) iterations suffice. Given the impressive convergence speed of CPM on a number of practical problems, it was conjectured that these rates could be further improved. In this paper we disprove this conjecture. We present counter examples which are not only applicable for training linear SVMs with hinge loss, but also hold for support vector methods which optimize a multivariate performance score. However, surprisingly, these problems are not inherently hard. By exploiting the structure √ of the objective function we can devise an algorithm that converges in O(1/ ) iterations.

1

Introduction

There has been an explosion of interest in machine learning over the past decade, much of which has been fueled by the phenomenal success of binary Support Vector Machines (SVMs). Driven by numerous applications, recently, there has been increasing interest in support vector learning with linear models. At the heart of SVMs is the following regularized risk minimization problem: n λ 1X 2 min J(w) := kwk + Remp (w) with Remp (w) := max(0, 1 − yi hw, xi i). (1) w n i=1 |2 {z } | {z } regularizer

empirical risk

Here we assume access to a training set of n labeled examples {(xi , yi )}ni=1 where xi ∈ Rd and yi ∈ P 2 {−1, +1}, and use the square Euclidean norm kwk = i wi2 as the regularizer. The parameter λ controls the trade-off between the empirical risk and the regularizer. There has been significant research devoted to developing specialized optimizers which minimize J(w) efficiently. In an award winning paper, Joachims [1] presented a cutting plane method (CPM)1 , SVM-Perf, which was shown to converge to an  accurate solution of (1) in O(1/2 ) iterations, with each iteration requiring O(nd) effort. This was improved by Teo et al. [2] who showed that their Bundle Method for Regularized Risk Minimization (BMRM) (which encompasses SVMPerf as a special case) converges to an  accurate solution in O(nd/) time. While online learning methods are becoming increasingly popular for solving (1), a key advantage of CPM such as SVM-Perf and BMRM is their ability to directly optimize nonlinear multivariate performance measures such as F1 -score, ordinal regression loss, and ROCArea which are widely used in some application areas. In this case Remp does not decompose into a sum of losses over ¯) individual data points like in (1), and hence one has to employ batch algorithms. Letting ∆(y, y denote the multivariate discrepancy between the correct labels y := (y1 , . . . , yn )> and a candidate ¯ (to be concretized later), the Remp for the multivariate measure is formulated by [3] as labeling y 1 In this paper we use the term cutting plane methods to denote specialized solvers employed in machine learning. While clearly related, they must not be confused with cutting plane methods used in optimization.

1

" Remp (w) =

max

¯ ∈{−1,1}n y

# n 1X ¯) + ∆(y, y hw, xi i (¯ yi − yi ) . n i=1

(2)

In another award winning paper by Joachims [3], the regularized risk minimization problems corresponding to these measures are optimized by using a CPM. Given the widespread use of CPM in machine learning, it is important to understand their convergence guarantees in terms of the upper and lower bounds on the number of iterations needed to converge to an  accurate solution. The tightest, O(1/), upper bounds on the convergence speed of CPM is due to Teo et al. [2], who analyzed a restricted version of BMRM which only optimizes over one dual variable per iteration. However, on practical problems the observed rate of convergence is significantly faster than predicted by theory. Therefore, it had been conjectured that the upper bounds might be further tightened via a more refined analysis. In this paper we construct counter examples for both decomposable Remp like in equation (1) and non-decomposable Remp like in equation (2), on which CPM require Ω(1/) iterations to converge, thus disproving this conjecture2 . We will work with BMRM as our prototypical CPM. As Teo et al. [2] point out, BMRM includes many other CPM such as SVM-Perf as special cases. Our results lead to the following natural question: Do the lower bounds hold because regularized risk minimization problems are fundamentally hard, or is it an inherent limitation of CPM? In other words, to solve problems such as (1), does there exist a solver which requires less than O(nd/) effort (better in n, d and )? We provide partial answers. To understand our contribution one needs to understand the two standard assumptions that are made when proving convergence rates: • A1: The data points xi lie inside a L2 (Euclidean) ball of radius R, that is, kxi k ≤ R. • A2: The subgradient of Remp is bounded, i.e., at any point w, there exists a subgradient g of Remp such that kgk ≤ G < ∞. Clearly assumption A1 √ is more restrictive than A2. By adapting a result due to [6] we show that one can devise an O(nd/ ) algorithm for the case when assumption A1 holds. Finding a fast optimizer under assumption A2 remains an open problem. Notation: Lower bold case letters (e.g., w, µ) denote vectors, wi denotes the i-th component of w, 0 refers to the vector with all zero components, ei is the i-th coordinate vector (all 0’s except 1 at the i-th coordinate) and ∆k refers to the k dimensional simplex. Unless specified otherwise, P h·, ·i denotes the Euclidean dot product hx, wi = i xi wi , and k·k refers to the Euclidean norm 1/2 kwk := (hw, wi) . We denote R := R ∪ {∞}, and [t] := {1, . . . , t}. Our paper is structured as follows. We briefly review BMRM in Section 2. Two types of lower bounds are subsequently defined in Section 3, and Section 4 contains descriptions of various counter examples that we construct. In Section √ 5 we describe an algorithm which provably converges to an  accurate solution of (1) in O(1/ ) iterations under assumption A1. The paper concludes with a discussion and outlook in Section 6. Technical proofs and a ready reckoner of the convex analysis concepts used in the paper can be found in the supplementary material.

2

BMRM

At every iteration, BMRM replaces Remp by a piecewise linear lower bound Rkcp and optimizes [2] λ min Jk (w) := kwk2 + Rkcp (w), where Rkcp (w) := max hw, ai i + bi , (3) w 1≤i≤k 2 to obtain the next iterate wk . Here ai ∈ ∂Remp (wi−1 ) denotes an arbitrary subgradient of Remp at wi−1 and bi = Remp (wi−1 ) − hwi−1 , ai i. The piecewise linear lower bound is successively tightened until the gap k := min J(wt ) − Jk (wk ) (4) 0≤t≤k falls below a predefined tolerance . Since Jk in (3) is a convex objective function, one can compute its dual. Instead of minimizing Jk with respect to w one can equivalently maximize the dual [2]: 1 (5) Dk (α) = − kAk αk2 + hbk , αi , 2λ 2 Because of the specialized nature of these solvers, lower bounds for general convex optimizers such as those studied by Nesterov [4] and Nemirovski and Yudin [5] do not apply.

2

Algorithm 1: qp-bmrm: solving the inner loop Algorithm 2: ls-bmrm: solving the inner loop of BMRM exactly via full QP. of BMRM approximately via line search. k

k

Require: Previous subgradients {ai }i=1 and k intercepts {bi }i=1 . 1: Set Ak := (a1 , . . . , ak ) , bk := (b1 , . . . bk )> .

Require: Previous subgradients {ai }i=1 and k intercepts {bi }i=1 . 1: Set Ak := (a1 , . . . , ak ) , bk := (b1 , . . . bk )> . > 2: Set α(η) := ηα> ,1 − η . k−1  2 3: ηk←argmax −1 2λ kAk α(η)k +hα(η),bk i .

1 2: αk ← argmax − 2λ kAk αk2 + hα, bk i .





α∈∆k

η∈[0,1]

>

4: αk ← ηk α> . k−1 , 1 − ηk −1 5: return wk = −λ Ak αk .

3: return wk = −λ−1 Ak αk .

and set αk = argmaxα∈∆k Dk (α). Note that Ak and bk in (5) are defined in Algorithm 1. Since maximizing Dk (α) is a quadratic programming (QP) problem, we call this algorithm qp-bmrm. Pseudo-code can be found in Algorithm 1. Note that at iteration k the dual Dk (α) is a QP with k variables. As the number of iterations increases the size of the QP also increases. In order to avoid the growing cost of the dual optimization at each iteration, [2] proposed using a one-dimensional line search to calculate an approximate maximizer > αk on the line segment {(ηα> k−1 , (1 − η)) : η ∈ [0, 1]}, and we call this variant ls-bmrm. Pseudocode can be found in Algorithm 2. We refer the reader to [2] for details. Even though qp-bmrm solves a more expensive optimization problem Dk (α) per iteration, Teo et al. [2] could only show that both variants of BMRM converge at O(1/) rates: Theorem 1 ([2]) Suppose assumption A2 holds. Then for any  < 4G2 /λ, both ls-bmrm and qpbmrm converge to an  accurate solution of (1) as measured by (4) after at most the following number of steps: λJ(0) 8G2 + − 1. log2 G2 λ Generality of BMRM Thanks to the formulation in (3) which only uses Remp , BMRM is applicable to a wide variety of Remp . For example, when used to train binary SVMs with Remp specified by (1), it yields exactly the SVM-Perf algorithm [1]. When applied to optimize the multivariate score, e.g. F1 -score with Remp specified by (2), it immediately leads to the optimizer given by [3].

3

Upper and Lower Bounds

Since most rates of convergence discussed in the machine learning community are upper bounds, it is important to rigorously define the meaning of a lower bound with respect to , and to study its relationship with the upper bounds. At this juncture it is also important to clarify an important technical point. Instead of minimizing the objective function J(w) defined in (1), if we minimize a scaled version cJ(w) this scales the approximation gap (4) by c. Assumptions such as A1 and A2 fix this degree of freedom by bounding the scale of the objective function. Given a function f ∈ F and an optimization algorithm A, suppose {wk } are the iterates produced by the algorithm A when minimizing f . Define T (; f, A) as the first step index k when wk becomes an  accurate solution3 : T (; f, A) = min {k : f (wk ) − minw f (w) ≤ } . (6) Upper and lower bounds are both properties for a pair of F and A. A function g() is called an upper bound of (F, A) if for all functions f ∈ F and all  > 0, it takes at most order g() steps for A to reduce the gap to less than , i.e., (UB)

∀  > 0, ∀ f ∈ F, T (; f, A) ≤ g().

(7)

3 The initial point also matters, as in the best case we can just start from the optimal solution. Thus the quantity of interest is actually T (; f, A) := maxw0 min{k : f (wk ) − minw f (w) ≤ , starting point being w0 }. However, without loss of generality we assume some pre-specified way of initialization.

3

Algorithms ls-bmrm qp-bmrm Nesterov

UB

Assuming A1 SLB

WLB

O(1/) O(1/) √ O(1/ )

Ω(1/) open √ Ω(1/ )

Ω(1/) open √ Ω(1/ )

Assuming A2 UB SLB WLB O(1/) O(1/) n/a

Ω(1/) open n/a

Ω(1/) Ω(1/) n/a

Table 1: Summary of the known upper bounds and our lower bounds. Note: A1 ⇒ A2, but not vice versa. SLB ⇒ WLB, but not vice versa. UB is tight, if it matches WLB. On the other hand, lower bounds can be defined in two different ways depending on how the above two universal qualifiers are flipped to existential qualifiers. • Strong lower bounds (SLB) h() is called a SLB of (F, A) if there exists a function f˜ ∈ F, such that for all  > 0 it takes at least h() steps for A to find an  accurate solution of f˜: (SLB) ∃ f˜ ∈ F, s.t. ∀  > 0, T (; f˜, A) ≥ h(). (8) • Weak lower bound (WLB) h() is called a WLB of (F, A) if for any  > 0, there exists a function f ∈ F depending on , such that it takes at least h() steps for A to find an  accurate solution of f : (WLB) ∀  > 0, ∃ f ∈ F, s.t. T (; f , A) ≥ h(). (9) Clearly, the existence of a SLB implies a WLB. However, it is usually much harder to establish SLB than WLB. Fortunately, WLBs are sufficient to refute upper bounds or to establish their tightness. The size of the function class F affects the upper and lower bounds in opposite ways. Suppose F 0 ⊂ F. Proving upper (resp. lower) bounds on (F 0 , A) is usually easier (resp. harder) than proving upper (resp. lower) bounds for (F, A).

4

Constructing Lower Bounds

Letting the minimizer of J(w) be w∗ , we are interested in bounding the primal gap of the iterates wk : J(wk ) − J(w∗ ). Datasets will be constructed explicitly whose resulting objective J(w) will be shown to attain the lower bounds of the algorithms. The Remp for both the hinge loss in (1) and the F1 -score in (2) will be covered, and our results are summarized in Table 1. Note that as assumption A1 implies A2 and SLB implies WLB, some entries of the table imply others. 4.1

Strong Lower Bounds for Solving Linear SVMs using ls-bmrm

We first prove the Ω(1/) lower bound for ls-bmrm on SVM problems under assumption A1. Consider a one dimensional training set with four examples: (x1 , y1 ) = (−1,−1), (x2 , y2 ) = (− 21 ,−1), 1 (x3 , y3 ) = ( 12 , 1), (x4 , y4 ) = (1, 1). Setting λ = 16 , the regularized risk (1) can be written as (using w instead of w as it is now a scalar): 1 2 1h wi 1 min J(w) = w + 1− + [1 − w]+ . (10) w∈R 32 2 2 + 2 The minimizer of J(w) is w∗ = 2, which can be verified by the fact that 0 is in the subdifferential 2 of J at w∗ : 0 ∈ ∂J(2) = 16 − 12 12 α : α ∈ [0, 1] . So J(w∗ ) = 81 . Choosing w0 = 0, we have Theorem 2 limk→∞ k (J(wk ) − J(w∗ )) = 41 , i.e. J(wk ) converges to J(w∗ ) at 1/k rate. The proof relies on two lemmata. The first shows that the iterates generated by ls-bmrm on J(w) satisfy the following recursive relations. Lemma 3 For k ≥ 1, the following recursive relations hold true 8α2k−1,1 8α2k−1,1 (w2k−1 − 4α2k−1,1 ) w2k+1 = 2 + > 2, and w2k = 2 − ∈ (1, 2). (11) w2k−1 (w2k−1 + 4α2k−1,1 ) w2k−1 α2k+1,1 =

2 2 w2k−1 + 16α2k−1,1

(w2k−1 + 4α2k−1,1 )

2 α2k−1,1 , where

4

α2k+1,1 is the first coordinate of α2k+1 .

(12)

The proof is lengthy and is relegated to Appendix B. These recursive relations allow us to derive the convergence rate of α2k−1,1 and wk (see proof in Appendix C): Lemma 4 limk→∞ kα2k−1,1 = 14 . Combining with (11), we get limk→∞ k|2 − wk | = 2. Now that wk approaches 2 at the rate of O(1/k), it is finally straightforward to translate it into the rate at which J(wk ) approaches J(w∗ ). See the proof of Theorem 2 in Appendix D. 4.2

Weak Lower Bounds for Solving Linear SVMs using qp-bmrm

Theorem 1 gives an upper bound on the convergence rate of qp-bmrm, assuming that Remp satisfies the assumption A2. In this section we further demonstrate that this O(1/) rate is also a WLB (hence tight) even when the Remp is specialized to SVM objectives satisfying A2. n

Given  > 0, define n = d1/e and construct a dataset {(xi , yi )}i=1 as yi = (−1)i and xi = √ (−1)i (nei+1 + ne1 ) ∈ Rn+1 . Then the corresponding objective function (1) is J(w) =

n n 2 √ kwk 1X 1X +Remp (w), where Remp (w) = [1−yi hw, xi i]+ = [1− nw1 −nwi+1 ]+ . 2 n i=1 n i=1

(13) 1 It is easy to see that the minimizer w∗ = 12 ( √1n , n1 , n1 , . . . , n1 )> and J(w∗ ) = 4n . In fact, simply    > Pn check that yi hw∗ , xi i = 1, so ∂J(w∗ ) = w∗ − √1n i=1 αi , α1 , . . . , αn : αi ∈ [0, 1] , and setting all αi =

1 2n

yields the subgradient 0. Our key result is the following theorem.

Theorem 5 Let w0 = ( √1n , 0, 0, . . .)> . Suppose running qp-bmrm on the objective function (13) 2 produces iterates w1 , . . . , wk , . . .. Then it takes qp-bmrm at least 3 steps to find an  accurate solution. Formally, 1 1 2 min J(wi ) − J(w∗ ) = + for all k ∈ [n], hence min J(wi ) − J(w∗ ) >  for all k < . 2k 4n 3 i∈[k] i∈[k] Indeed, after taking n steps, wn will cut a subgradient an+1 = 0 and bn+1 = 0, and then the minimizer of Jn+1 (w) gives exactly w∗ .  Pn Proof Since Remp (w0 ) = 0 and ∂Remp (w0 ) = −1 i=1 αi yi xi : αi ∈ [0, 1] , we can choose n  > 1 1 1 1 a1 = − y1 x1 = − √ , −1, 0, . . . , b1 = Remp (w0 ) − ha1 , w0 i = 0 + = , and n n n n    > 1 1 1 1 2 kwk − √ w1 − w2 + = √ , 1, 0, . . . . w1 = argmin 2 n n n w In general, we claim that the k-th iterate wk produced by qp-bmrm is given by k copies

!> z }| { 1 1 1 wk = √ , , . . . , , 0, . . . . k n k We prove this claim by induction on k. Assume the claim holds true for steps 1, . . . , k, then it is  P n easy to check that Remp (wk ) = 0 and ∂Remp (wk ) = −1 α y x : α ∈ [0, 1] . Thus we i i i i i=k+1 n can again choose 1 1 ak+1 = − yk+1 xk+1 , and bk+1 = Remp (wk ) − hak+1 , wk i = , so n n k+1 copies

!> }| { 1 1 1 wk+1 = argmin = √ , ,..., , 0, . . . , k+1 n k+1 w n o P which can be verified by checking that ∂Jk+1 (wk+1 ) = wk+1 + i∈[k+1] αi ai : α ∈ ∆k+1 3 

1 2 kwk + max {hai , wi + bi } 1≤i≤k+1 2

0. All that remains is to observe that J(wk ) = 1 1 that J(wk ) − J(w∗ ) = 2k + 4n as claimed.

1 2k

5

+



1 2n

z

while J(w∗ ) =

1 4n

from which it follows

√ As an aside, the subgradient of the Remp in (13) does have Euclidean norm 2n at w = 0. However, in the above run of qp-bmrm, ∂Remp (w0 ), . . . ,  ∂Remp (w n ) always contains a subgradient with norm 1. So if we restrict the feasible region to n−1/2 × [0, ∞]n , then J(w) does satisfy the assumption A2 and the optimal solution does not change. This is essentially a local satisfaction of A2. In fact, having a bounded subgradient of Remp at all wk is sufficient for qp-bmrm to converge at the rate in Theorem 1. However when we assume A1 which is more restrictive than A2, it remains an open question to determine whether the O(1/) rates are optimal for qp-bmrm on SVM objectives. Also left open is the SLB for qp-bmrm on SVMs. 4.3

Weak Lower Bounds for Optimizing F1 -score using qp-bmrm

2a F1 -score is defined by using the contingency table: F1 (¯ y, y) := 2a+b+c . n Given  > 0, define n = d1/e+1 and construct a dataset {(xi , yi )}i=1 as n e − n2 ei+1 ∈ Rn+1 with yi = −1 for all i ∈ [n−1], follows: xi = − 2√ 3 1 √

y¯ = 1 y¯ = −1

y = 1 y = −1 a b c d

and xn = 23n e1 + n2 en+1 ∈ Rn+1 with yn = +1. So there is only one Contingency table. positive training example. Then the corresponding objective function is " # n 1 1X 2 ¯) + J(w) = kwk + max 1 − F1 (y, y yi hw, xi i (yi y¯i − 1) . (14) ¯ y 2 n i=1 1 Theorem 6 Let w0 = √13 e1 . Then qp-bmrm takes at least 3 steps to find an  accurate solution.   1 1 1 1 J(wk )−min J(w) ≥ − ∀k ∈ [n−1], hence min J(wi )−min J(w) >  ∀k < . w w 2 k n−1 3 i∈[k] Proof A rigorous proof can be found in Appendix E, we provide a sketch here. The crux is to show k copies

wk =

!> z }| { 1 1 1 √ , , . . . , , 0, . . . k 3 k

∀k ∈ [n − 1].

(15)

We prove (15) by induction. Assume it holds for steps 1, . . . , k. Then at step k + 1 we have  1 1    6 + 2k if i ∈ [k] 1 1 yi hwk , xi i = 6 if k + 1 ≤ i ≤ n − 1 .  n  1 if i = n

(16)

2

For convenience, define the term in the max in (14) as n 1X ¯) + yi hwk , xi i (yi y¯i − 1). Υk (¯ y) := 1 − F1 (y, y n i=1 ¯ (among others) maximize Υk : a) Then it is not hard to see that the following assignments of y correct labeling, b) only misclassify the positive training example xn (i.e., y¯n = −1), c) only misclassify one negative training example in xk+1 , . . . , xn−1 into positive. And Υk equals 0 at all ¯ misclassifies the positive training example, these assignments. For a proof, consider two cases. If y ¯ ) = 0 and by (16) we have then F1 (y, y Υk (¯ y) = 1−0 +

n−1 k n−1 1X 1 k + 3X 1 X yi hwk , xi i (yi y¯i −1)+ (−1−1) = (yi y¯i −1)+ (yi y¯i −1) ≤ 0. n i=1 2 6k i=1 6 i=k+1

¯ correctly labels the positive example, but misclassifies t1 examples in x1 , . . . , xk and t2 Suppose y ¯ ) = 2+t21 +t2 , and examples in xk+1 , . . . , xn−1 (into positive). Then F1 (y, y   k n−1 2 1 1 X 1 X Υk (¯ y) = 1 − + + (yi y¯i − 1) + (yi y¯i − 1) 2 + t1 + t2 6 2k i=1 6 i=k+1   t1 + t2 1 1 1 t − t2 = − + t1 − t2 ≤ ≤ 0 (t := t1 + t2 ). 2 + t1 + t2 3 k 3 3(2 + t) 6

k copies

n−k−1 copies

z }| { z }| { ¯ as (−1, . . . , −1, +1, −1, . . . , −1, +1)> which only misclassifies xk+1 , and get So we can pick y ak+1 =

−2 1 yk+1 xk+1 = − √ e1 − ek+2 , n 3

bk+1 = Remp (wk ) − hak+1 , wk i = 0 +

:=Jk+1 (w)

1 1 = , 3 3

k+1 copies

!> z }| { }| { z 1 1 1 1 2 . wk+1 = argmin kwk + max {hai , wi + bi } = √ , ,..., , 0, . . . 2 k+1 i∈[k+1] w 3 k+1 n o Pk+1 which can be verified by ∂Jk+1 (wk+1 ) = wk+1 + i=1 αi ai : α ∈ ∆k+1 3 0 (just set all αi =

1 k+1 ).

So (15) holds for step k + 1. End of induction.

1 All that remains is to observe that J(wk ) = 12 ( 13 + k1 ) while minw J(w) ≤ J(wn−1 ) = 21 ( 13 + n−1 ) 1 1 1 from which it follows that J(wk ) − minw J(w) ≥ 2 ( k − n−1 ) as claimed in Theorem 6.

5

√ An O(nd/ ) Algorithm for Training Binary Linear SVMs

The lower bounds we proved above show that CPM such as BMRM require Ω(1/) iterations to converge. We now show that this is an inherent limitation of CPM and not an artifact of the problem. To demonstrate this, we√will show that one can devise an algorithm for problems (1) and (2) which will converge in O(1/ ) iterations. The key difficulty stems from the non-smoothness of the objective function, which renders second and higher order algorithms such as L-BFGS inapplicable. However, thanks to Theorem 11 (see appendix A), the Fenchel dual of (1) is a convex smooth function with a Lipschitz continuous gradient, which are easy to optimize. To formalize the idea of using the Fenchel dual, we can abstract from the objectives (1) and (2) a composite form of objective functions used in machine learning with linear models: min J(w) = f (w) + g ? (Aw),

where Q1 is a closed convex set.

w∈Q1

(17)

Here, f (w) is a strongly convex function corresponding to the regularizer, Aw stands for the output of a linear model, and g ? encodes the empirical risk measuring the discrepancy between the correct labels and the output of the linear model. Let the domain of g be Q2 . It is well known that [e.g. 7, Theorem 3.3.5] under some mild constraint qualifications, the adjoint form of J(w): D(α) = −g(α) − f ? (−A> α),

α ∈ Q2

(18)

satisfies J(w) ≥ D(α) and inf w∈Q1 J(w) = supα∈Q2 D(α). Example 1: binary SVMs with bias. Let A := −Y X > where Y := diag(y1 , . . . , yn ) and X := Pn 2 λ 1 ? (x1 , . . . , xP n ), f (w) = 2 kwk , g (u) = minb∈R n i=1 [1 + ui − yi b]+ which corresponds to g(α) = − i αi . Then the adjoint form turns out to be the well known SVM dual objective function: n o X X 1 > α Y X > XY α, α ∈ Q2 = α ∈ [0, n−1 ]n : yi αi = 0 . (19) D(α) = αi − 2λ i i ¯ -th row is Example 2: multivariate scores. Denote A as a 2n -by-d matrix where the y Pn n 2 λ > ? ¯ ¯ ) + n1 uy¯ x (¯ y − y ) for each y ∈ {−1, +1} kwk , f (w) = , g (u) = max ∆(y, y ¯ i i y i=1 i 2 P ¯ )αy¯ , we recover the primal objective (2) for multiwhich corresponds to g(α) = −n y¯ ∆(y, y variate performance measure. Its adjoint form is n X X n 1 1o ¯ )αy¯ , α ∈ Q2 = α ∈ [0, n−1 ]2 : D(α) = − α> AA> α + n ∆(y, y αy¯ = . (20) 2λ n ¯ ¯ y y In a series of papers [6, 8, 9], Nesterov developed optimal gradient based methods for minimizing the composite objectives with primal (17) and adjoint (18). A sequence of wk and αk is produced such that under√ assumption A1 the duality gap J(wk ) − D(αk ) is reduced to less than  after at most k = O(1/ ) steps. We refer the readers to [8] for details. 7

5.1

Efficient Projections in Training SV Models with Optimal Gradient Methods

However, applying Nesterov’s algorithm is challenging, because it requires an efficient subroutine for computing projections onto the set of constraints Q2 . This projection can be either an Euclidean projection or a Bregman projection. Example 1: binary SVMs with bias. In this case we need to compute the Euclidean projection to Q2 defined by (19), which entails solving a Quadratic Programming problem with a diagonal Hessian, many box constraints, and a single equality constraint. We present an O(n) algorithm for this task in Appendix F. Plugging this into the algorithm described in [8] and noting that √ all intermediate steps of the algorithm can be computed in O(nd) time directly yield a O(nd/ ) algorithm. A self-contained description of the algorithm is given in Appendix G. Example 2: multivariate scores. Since the dimension of Q2 in (20) is exponentially large in n, Euclidean projection is intractable and we resort to Bregman projection. Given a differentiable convex function F on Q2 , a point α, and a direction g, we can define the Bregman projection as: ¯ − h∇F (α) − g, αi ¯ . V (α, g) := argmin F (α) ¯ α∈Q 2 P Scaling up α by a factor of n, we can choose F (α) as the negative entropy F (α) = − i αi log αi . Then the application of the algorithm in [8] will endow a distribution over all possible labelings: ! X p(¯ y; w) ∝ exp c∆(¯ y, y) + ai hxi , wi y¯i , where c and ai are constant scalars. (21) i P The solver will request the expectation Ey¯ [ i ai xi y¯i ] which in turn requires that marginal distribution of p(¯ yi ). This is not as straightforward as in graphical models because ∆(¯ y, y) may not decompose. Fortunately, for multivariate scores defined by contingency tables, it is possible to compute the marginals in O(n2 ) time by using dynamic programming, and this cost is similar to the algorithm proposed by [3]. The idea of the dynamic programming can be found in Appendix H.

6

Outlook and Conclusion

CPM are widely employed in machine learning especially in the context of structured prediction [16]. While upper bounds on their rates of convergence were known, lower bounds were not studied before. In this paper we set out to fill this gap by exhibiting counter examples on which CPM require Ω(1/) iterations. This is a fundamental limitation of these algorithms and not an artifact √ of the problem. We show this by devising an O(1/ ) algorithm borrowing techniques from [8]. However, this algorithm assumes√ that the dataset is contained in a ball of bounded radius (assumption A1 Section 1). Devising a O(1/ ) algorithm under the less restrictive assumption A2 remains an open problem. It is important to note that the linear time algorithm in Appendix F is the key to obtaining a √ O(nd/ ) computational complexity for binary SVMs with bias mentioned in section 5.1. However, this method has been rediscovered independently by many authors (including us), with the earliest known reference to the best of our knowledge being [10] in 1990. Some recent work in optimization [11] has focused on improving the practical performance, while in machine learning [12] gave an exact projection algorithm in linear time and [13] gave an expected linear time algorithm for the same problem. Choosing an optimizer for a given machine learning task is a trade-off between a number of potentially conflicting requirements. CPM are one popular choice but there are others. If one is interested in classification accuracy alone, without requiring deterministic guarantees, then online to batch conversion techniques combined with stochastic subgradient descent are a good choice [17]. While the dependence on  is still Ω(1/) or worse [18], one gets bounds independent of n. However, as we pointed out earlier, these algorithms are applicable only when the empirical risk decomposes over the examples. On the other hand, one can employ coordinate descent in the dual as is done in the Sequential Minimal Optimization (SMO) algorithm of [19]. However, as [20] show, if the kernel matrix obtained by stacking xi into a matrix X and X > X is not strictly positive definite, then SMO requires O(1/) iterations with each iteration costing O(nd) effort. However, when the kernel matrix is strictly positive definite, then one can obtain an O(n2 log(1/)) bound on the number of iterations, which has better dependence on , but is prohibitively expensive for large n. 8

References [1] T. Joachims. Training linear SVMs in linear time. In Proc. ACM Conf. Knowledge Discovery and Data Mining (KDD). ACM, 2006. [2] Choon Hui Teo, S. V. N. Vishwanthan, Alex J. Smola, and Quoc V. Le. Bundle methods for regularized risk minimization. J. Mach. Learn. Res., 11:311–365, January 2010. [3] T. Joachims. A support vector method for multivariate performance measures. In Proc. Intl. Conf. Machine Learning, pages 377–384, San Francisco, California, 2005. Morgan Kaufmann Publishers. [4] Yurri Nesterov. Introductory Lectures On Convex Optimization: A Basic Course. Springer, 2003. [5] Arkadi Nemirovski and D Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley and Sons, 1983. [6] Yurri Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k 2 ). Soviet Math. Docl., 269:543–547, 1983. [7] J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization: Theory and Examples. CMS books in Mathematics. Canadian Mathematical Society, 2000. [8] Yurii Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM J. on Optimization, 16(1):235–249, 2005. ISSN 1052-6234. [9] Yurii Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, CORE Discussion Paper, UCL, 2007. [10] P. M. Pardalos and N. Kovoor. An algorithm for singly constrained class of quadratic programs subject to upper and lower bounds. Mathematical Programming, 46:321–328, 1990. [11] Y.-H. Dai and R. Fletcher. New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds. Mathematical Programming: Series A and B archive, 106 (3):403–421, 2006. [12] Jun Liu and Jieping Ye. Efficient euclidean projections in linear time. In Proc. Intl. Conf. Machine Learning. Morgan Kaufmann, 2009. [13] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandrae. Efficient projections onto the `1 -ball for learning in high dimensions. In Proc. Intl. Conf. Machine Learning, 2008. [14] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In S. Thrun, L. Saul, and B. Sch¨olkopf, editors, Advances in Neural Information Processing Systems 16, pages 25–32, Cambridge, MA, 2004. MIT Press. [15] Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, and Peter Bartlett. Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. J. Mach. Learn. Res., 9:1775–1822, 2008. [16] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6:1453–1484, 2005. [17] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated subgradient solver for SVM. In Proc. Intl. Conf. Machine Learning, 2007. [18] Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin Wainwright. Informationtheoretic lower bounds on the oracle complexity of convex optimization. In Neural Information Processing Systems, 2009. [19] J. C. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines. Technical Report MSR-TR-98-14, Microsoft Research, 1998. [20] Nikolas List and Hans Ulrich Simon. Svm-optimization and steepest-descent line search. In Sanjoy Dasgupta and Adam Klivans, editors, Proc. Annual Conf. Computational Learning Theory, LNCS. Springer, 2009. [21] J.B. Hiriart-Urruty and C. Lemar´echal. Convex Analysis and Minimization Algorithms, I and II, volume 305 and 306. Springer-Verlag, 1993.

9

Supplementary Material

A

Concepts from Convex Analysis

The following four concepts from convex analysis are used in the paper. Definition 7 Suppose a convex function f : Rn → R is finite at w. Then a vector g ∈ Rn is called a subgradient of f at w if, and only if, f (w0 ) ≥ f (w) + hw0 − w, gi

for all w0 .

The set of all such g vectors is called the subdifferential of f at w, denoted by ∂w f (w). For any convex function f , ∂w f (w) must be nonempty. Furthermore if it is a singleton then f is said to be differentiable at w, and we use ∇f (w) to denote the gradient. Definition 8 A convex function f : Rn → R is strongly convex with respect to a norm k·k if there exists a constant σ > 0 such that f − σ2 k · k2 is convex. σ is called the modulus of strong convexity of f , and for brevity we will call f σ-strongly convex. Definition 9 Suppose a function f : Rn → R is differentiable on Q ⊆ Rn . Then f is said to have Lipschitz continuous gradient (l.c.g) with respect to a norm k · k if there exists a constant L such that k∇f (w) − ∇f (w0 )k ≤ Lkw − w0 k

∀ w, w0 ∈ Q.

For brevity, we will call f L-l.c.g. Definition 10 The Fenchel dual of a function f : Rn → R, is a function f ? : Rn → R defined by f ? (w? ) = sup {hw, w? i − f (w)} w∈Rn

Strong convexity and l.c.g are related by Fenchel duality according to the following lemma: Theorem 11 ([21, Theorem 4.2.1 and 4.2.2]) 1. If f : Rn → R is σ-strongly convex, then f ? is finite on Rn and f ? is σ1 -l.c.g. 2. If f : Rn → R is convex, differentiable on Rn , and L-l.c.g, then f ? is

1 L -strongly

convex.

Finally, the following lemma gives a useful characterization of the minimizer of a convex function. Lemma 12 ([21, Theorem 2.2.1]) A convex function f is minimized at w∗ if, and only if, 0 ∈ ∂f (w∗ ). Furthermore, if f is strongly convex, then its minimizer is unique.

B

Proof of Lemma 3

We prove the lemma by induction on k. Obviously, Lemma 3 holds for k = 1. Suppose it holds for indices up to some k − 1 (k ≥ 2). Let p = 2k − 1 (p ≥ 3). Then     3 1 1 ¯bp = 1, 0, 1 , . . . , 0, 1 , Ap = − , 0, − , . . . , 0, − , 4 4 4 2 2   3 1 1 1 1 wp = −16Ap αp = (−16) − αp,1 − αp,3 − αp,5 − . . . − αp,p−2 − αp,p 4 4 4 4 4 wp ⇒ αp,3 + . . . + αp,p−2 + αp,p = − 3αp,1 . 4 So 10

¯bp αp = αp,1 + 1 αp,3 + 1 αp,5 + . . . + 1 αp,p−2 + 1 αp,p = 1 wp − 1 αp,1 . 2 2 2 2 8 2  Since wp > 2, so ap+1 = 0, bp+1 = 0. So Ap+1 = (Ap , 0), ¯bp+1 = ¯bp , 0 . Let αp+1 = (ηαp , 1 − η), then Dp+1 (η) = 8η 2 (Ap αp )2 − η¯bp αp . So ηp+1 =

¯bp αp 2

16 (Ap αp )

=

2wp − 8αp,1 , wp2

wp+1 = −16Ap αp ηp+1 = wp ηp+1 = 2 −

8αp,1 < 2. wp (22)

which proves the claim in (11) for even iterates as p + 1 = 2k. Since α2,1 = 19 , p ≥ 3, and αk,1 ≥ αk+1,1 due to the update rule of ls-bmrm, we have 8αp,1 ≤

8 < 2 < wp , 9

wp+1 > 1.

hence

(23)

  Next step, since wp+1 ∈ (1, 2), so ap+2 = − 41 , bp+2 = 12 , Ap+2 = Ap , 0, − 41 , ¯bp+1 = ¯bp , 0, 21 . Let αp+2 (η) = (ηηp+1 αt , η(1 − ηp+1 ), 1 − η). Then ¯bp+2 αp+2 = ηηp+1¯bp αp + 1 (1 − η). 2

1 Ap+2 αp+2 = ηηp+1 Ap αp − (1 − η), 4

Dp+2 (η) = 8(Ap+2 αp+2 )2 − ¯bp+2 αp+2 2   4ηp+1 Ap αp + 1 1 2 ¯ = η − 4ηp+1 Ap αp + ηp+1 bp αp + η+const, 2 2 where the const means terms independent of η. So ηp+2 = argmin Dp+2 (η) = η∈[0,1]

4ηp+1 Ap αp + ηp+1¯bp αp + 2 4ηp+1 Ap αp + 1

1 2

=

2 wp2 + 16αp,1

2,

(24)

(wp + 4αp,1 )

wp+2 = −16Ap+2 αp+2 = −16ηp+2 ηp+1 Ap αp + 4(1 − ηp+2 ) = 2 +

8αp,1 (wp − 4αp,1 ) , wp (wp + 4αp,1 )

where the last step is by plugging in the expression of ηp+1 in (22) and ηp+2 in (24). Now using (23) we get wp+2 − 2 =

C

8αp,1 (wp − 4αp,1 ) > 0. wp (wp + 4αp,1 )

Proof of Lemma 4

The proof is based on (12). limk→∞ α2k−1,1 = 0. Now  lim kα2k−1,1 = lim k→∞

k→∞

Let βk

= 1/α2k−1,1 , then limk→∞ βk

−1

1 kα2k−1,1

 =

βk k→∞ k lim

−1

= ∞ because

 =

lim βk+1 − βk

−1

k→∞

,

where the last step is by the discrete version of L’Hospital’s rule. To compute limk→∞ βk+1 − βk we plug the definition βk = 1/α2k−1,1 into (12), which gives: 1 βk+1

2 w2k + 16 β12 1 k = 2 βk 1 w2k + 4 βk



βk+1 − βk = 8

w2k βk2 w2k . 2 β 2 + 16 = 8 2 w2k w2k + β162 k k

Since limk→∞ wk = 2 and limk→∞ βk = ∞, so  −1 1 lim kα2k−1,1 = lim βk+1 − βk = . k→∞ k→∞ 4 11

D

Proof of Theorem 2

Denote δk = 2 − wk , then limk→∞ k |δk | = 2 by Theorem 4. So If δk > 0, then

J(wk ) − J(w∗ ) =

If δk ≤ 0, then

J(wk ) − J(w∗ ) =

Combining these two cases, we conclude

E

1 δk 1 1 1 2 1 1 2 2 2 − 8 = 8 δk + 32 δk = 8 |δk | + 32 δk . 1 2 1 2 − δk )2 − 18 = − 18 δk + 32 δk = 18 |δk | + 32 δk . limk→∞ k(J(wk ) − J(w∗ )) = 41 .

1 32 (2 1 32 (2

− δk )2 +

Proof of Theorem 6

The crux of the proof is to show that k copies

wk =

!> z }| { 1 1 1 √ , , . . . , , 0, . . . k 3 k

∀k ∈ [n − 1].

(25)

At the first iteration, we have 1 yi hw0 , xi i = n

(

1 6 1 2

if i ∈ [n − 1] if i = n

.

(26)

For convenience, define the term in the max of (14) as n 1X ¯) + Υ0 (¯ y) := 1 − F1 (y, y yi hw0 , xi i (yi y¯i − 1). n i=1 The key observation in the context of F1 score is that Υ0 (¯ y) is maximized at any of the following assignments of (¯ y1 , . . . , y¯n ), and it is easy to check that they all give Υ0 (¯ y) = 0: (−1, . . . , −1, +1), (−1, . . . , −1, −1), (+1, −1, −1, . . . , −1, +1), . . . , (−1, . . . , −1, +1, +1). The first assignment is just the correct labeling of the training examples. The second assignment just misclassifies the only positive example xn into negative. The rest n−1 assignments only misclassify a single negative example into positive. To prove that they maximize Υ0 (¯ y), consider two cases of ¯ . First the positive training example is misclassified. Then F1 (y, y ¯ ) = 0 and by (26) we have y Υ0 (¯ y) = 1 − 0 +

n−1 n−1 1 1X 1X yi hw0 , xi i (yi y¯i − 1) + (−1 − 1) = (yi y¯i − 1) ≤ 0. n i=1 2 6 i=1

¯ where the positive example is correctly labeled, while t ≥ 1 negative Second, consider the case of y 2 ¯ ) = 2+t examples are misclassified. Then F1 (y, y , and Υ0 (¯ y) = 1 −

n−1 1X t 1 t − t2 2 + (yi y¯i − 1) = − t= ≤ 0, 2 + t 6 i=1 2+t 3 3(2 + t)

∀t ∈ [1, n − 1].

So now suppose we pick ¯ 1 = (+1, −1, −1, . . . , −1, +1)> , y i.e. just misclassify the first negative training example. Then  > −2 1 1 1 a1 = y1 x1 = − √ , −1, 0, . . . , b1 = Remp (w0 ) − ha1 , w0 i = 0 + = , n 3 3 3    > 1 1 1 2 kwk − √ w1 − w2 = √ , 1, 0, . . . . w1 = argmin 2 w 3 3 Next, we prove (25) by induction. Assume that it holds for steps 1, . . . , k. Then at step k + 1 it is easy to check that  1 1    6 + 2k if i ∈ [k] 1 yi hwk , xi i = 16 (27) if k + 1 ≤ i ≤ n − 1 .  n  1 if i = n 2

12

Define

n

¯) + Υk (¯ y) := 1 − F1 (y, y

1X yi hwk , xi i (yi y¯i − 1). n i=1

¯ (among others) maximize Υk : a) correct labeling, b) Then it is not hard to see that the following y only misclassify the positive training example xn , c) only misclassify one negative training example in xk+1 , . . . , xn−1 . And Υk equals 0 at all these assignments. For proof, again consider two cases. ¯ misclassifies the positive training example, then F1 (y, y ¯ ) = 0 and by (27) we have If y n−1

1X 1 Υk (¯ y) = 1 − 0 + yi hwk , xi i (yi y¯i − 1) + (−1 − 1) n i=1 2  k  n−1 1 X 1 1 X + (yi y¯i − 1) ≤ 0. = (yi y¯i − 1) + 6 2k i=1 6 i=k+1

¯ correctly labels the positive example, but misclassifies t1 examples in x1 , . . . , xk and t2 examIf y ¯ ) = 2+t21 +t2 , and ples in xk+1 , . . . , xn−1 (into positive). Then F1 (y, y   k n−1 2 1 X 1 X 1 Υk (¯ y) = 1 − + + (yi y¯i − 1) + (yi y¯i − 1) 2 + t1 + t2 6 2k i=1 6 i=k+1   1 t − t2 t1 + t2 1 1 + t1 − t2 ≤ ≤ 0 (t := t1 + t2 ). = − 2 + t1 + t2 3 k 3 3(2 + t) k copies

n−k−1 copies

z }| { z }| { ¯ as (−1, . . . , −1, +1, −1, . . . , −1, +1)> which only misclassifies xk+1 , and get So we can pick y ak+1 =

−2 1 yk+1 xk+1 = − √ e1 − ek+2 , n 3

bk+1 = Remp (wk ) − hak+1 , wk i = 0 +

1 1 = , 3 3

k+1 copies

!> }| { 1 1 1 1 2 wk+1 = argmin kwk + max {hai , wi + bi } = √ , ,..., , 0, . . . . 2 k+1 i∈[k+1] w 3 k+1 n o Pk+1 which can be verified by ∂Jk+1 (wk+1 ) = wk+1 + i=1 αi ai : α ∈ ∆k+1 3 0 (setting all z

1 ). All that remains is to observe that J(wk ) = 21 ( 13 + k1 ) while minw J(w) ≤ J(wn−1 ) = αi = k+1 1 1 1 1 1 1 2 ( 3 + n−1 ) from which it follows that J(wk ) − minw J(w) ≥ 2 ( k − n−1 ) as claimed by Theorem 6.

F

A linear time algorithm for a box constrained diagonal QP with a single linear equality constraint

It can be shown that the dual optimization problem D(α) from (1) can be reduced into a box constrained diagonal QP with a single linear equality constraint. In this section, we focus on the following simple QP: n

min s.t.

1X 2 d (αi − mi )2 2 i=1 i li ≤αi ≤ ui ∀i ∈ [n]; n X σi αi = z. i=1

Without loss of generality, we assume li < ui and di 6= 0 for all i. Also assume σi 6= 0 because otherwise αi can be solved independently. To make the feasible region nonempty, we also assume X X σi (δ(σi > 0)li + δ(σi < 0)ui ) ≤ z ≤ σi (δ(σi > 0)ui + δ(σi < 0)li ). i

i

13

The algorithm we describe below stems from [10] and finds the exact optimal solution in O(n) time, faster than the O(n log n) complexity in [13]. With a simple change of variable βi = σi (αi − mi ), the problem is simplified as n

min s.t.

1 X ¯2 2 d β 2 i=1 i i li0 ≤βi ≤ u0i ∀i ∈ [n]; n X βi = z 0 , i=1

li0 = where



σi (li − mi ) if σi > 0 , σi (ui − mi ) if σi < 0



σi (ui − mi ) if σi > 0 , σi (li − mi ) if σi < 0 X d2 σi mi . d¯2i = i2 , z 0 = z − σi i u0i

=

We derive its dual via the standard Lagrangian. X X 1 X ¯2 2 X + 0 L= d i βi − ρi (βi − li0 ) + ρ− βi − z 0 i (βi − ui ) − λ 2 i i i i

! .

Taking derivative: ∂L − = d¯2i βi − ρ+ i + ρi − λ = 0 ∂βi



+ − βi = d¯−2 i (ρi − ρi + λ).

(28)

Substituting into L, we get the dual optimization problem X X 1 X ¯−2 + − 2 0 0 0 min D(λ, ρ+ di (ρi − ρ− ρ+ ρ+ i , ρi ) = i + λ) − i li + i ui − λz 2 i i i s.t.

ρ+ i ≥ 0,

ρ− i ≥0

∀i ∈ [n].

Taking derivative of D with respect to λ, we get: X + − 0 d¯−2 i (ρi − ρi + λ) − z = 0.

(29)

i

The KKT condition gives: 0 ρ+ i (βi − li ) = 0,

ρ− i (βi



u0i )

= 0.

(30a) (30b)

Now we enumerate four cases. − 0 0 1. ρ+ i > 0, ρi > 0. This implies that li = βi = ui , which is contradictory to our assumption. − 0 0 ¯−2 ¯2 0 ¯2 0 2. ρ+ i = 0, ρi = 0. Then by (28), βi = di λ ∈ [li , ui ], hence λ ∈ [di li , di ui ]. − 0 ¯−2 + ¯−2 3. ρ+ i > 0, ρi = 0. Now by (30) and (28), we have li = βi = di (ρi + λ) > di λ, hence + 2 0 2 0 ¯ ¯ λ < di li and ρi = di li − λ. − − 0 ¯−2 ¯−2 4. ρ+ i = 0, ρi > 0. Now by (30) and (28), we have ui = βi = di (−ρi + λ) < di λ, hence − 2 0 2 0 λ > d¯i ui and ρi = −d¯i ui + λ. − ¯2 0 ¯2 0 In sum, we have ρ+ i = [di li − λ]+ and ρi = [λ − di ui ]+ . Now (29) turns into

14

d¯2i li0

hi (λ)

u0i

slope = d¯−2 i

li0

λ

d¯2i u0i

Figure 1: hi (λ) Algorithm 3: O(n) algorithm to find the root of f (λ). Ignoring boundary condition checks.   1: Set kink set S ← d¯2i li0 : i ∈ [n] ∪ d¯2i u0i : i ∈ [n] . 2: while |S| > 2 do 3: Find median of S: m ← MED(S). 4: if f (m) ≥ 0 then 5: S ← {x ∈ S : x ≤ m}. 6: else 7: S ← {x ∈ S : x ≥ m}. 8: end if 9: end while lf (u)−uf (l) 10: Return root f (u)−f (l) where S = {l, u}.

f (λ) :=

X i

d¯−2 ([d¯2i li0 − λ]+ − [λ − d¯2i u0i ]+ + λ) −z 0 = 0. {z } |i

(31)

=:hi (λ)

In other words, we only need to find the root of f (λ) in (31). hi (λ) is plotted in Figure 1. Note that hi (λ) is a monotonically increasingP function of λ, so the whole f (λ) isPmonotonically increasing 0 0 0 in λ. Since f (∞) ≥ 0 by z 0 ≤ i ui and f (−∞) ≤ 0 by z ≥ i li , the root must exist. Considering that f has at most 2nkinks (nonsmooth points) and is linear between two adjacent kinks, the simplest idea is to sort d¯2i li0 , d¯2i u0i : i ∈ [n] into s(1) ≤ . . . ≤ s(2n) . If f (s(i) ) and f (s(i+1) ) have different signs, then the root must lie between them and can be easily found because f is linear in [s(i) , s(i+1) ]. This algorithm takes at least O(n log n) time because of sorting. However, this complexity can be reduced to O(n) by making use of the fact that the median of n (unsorted) elements can be found in O(n) time. Notice that due to the monotonicity of f , the median of a set S gives exactly the median of function values, i.e., f (MED(S)) = MED({f (x) : x ∈ S}). Algorithm 3 sketches the idea of binary search. The while loop terminates in log2 (2n) iterations because the set S is halved in each iteration. And in each iteration, the time complexity is linear to |S|, the size of current S. So the total complexity is O(n). Note the evaluation of f (m) potentially involves summing up n terms as in (31). However by some clever aggregation of slope and offset, this can be reduced to O(|S|).

G

Solving Binary Linear SVMs using Nesterov’s Algorithm

Now we present the algorithm of [8] in Algorithm 4. It requires a σ2 -strongly convex prox-function on Q2 : d2 (α) = σ22 kαk2 , and sets D2 = maxα∈Q2 d2 (α). Let the Lipschitz constant of ∇D(α) be L. Algorithm 4 is based on two mappings αµ (w) : Q1 7→ Q2 and w(α) : Q2 7→ Q1 , together with an auxiliary mapping v : Q2 7→ Q2 . They are defined by X µ 2 αµ (w) := argmin µd2 (α) − hAw, αi + g(α) = argmin kαk + w> XY α − αi , (32) α∈Q2 α∈Q2 2 i w(α) := argmin hAw, αi + f (w) = argmin −w> XY α + w∈Q1

w∈Rd

15

λ 1 2 kwk = XY α 2 λ

(33)

Algorithm 4: Solving Binary Linear SVMs using Nesterov’s Algorithm. Require: L as a conservative estimate of (i.e., no less than) the Lipschitz constant of ∇D(α). Ensure: Two sequences wk and αk which reduce the duality gap at O(1/k 2 ) rate. 1: Initialize: Randomly pick α−1 in Q2 . Let µ0 = 2L, α0 ← v(α−1 ), w0 ← w(α−1 ). 2: for k = 0, 1, 2, . . . do 2 3: Let τk = k+3 , βk ← (1 − τk )αk + τk αµk (wk ). 4: Set wk+1 ← (1 − τk )wk + τk w(βk ), αk+1 ← v(βk ), µk+1 ← (1 − τk )µk . 5: end for

v(α) := argmin α0 ∈Q2

L 2 kα0 − αk − h∇D(α), α0 − αi . 2

(34)

Equations (32) and (34) are examples of a box constrained QP with a single equality constraint. In the appendix, we provide a linear time algorithm to find the minimizer of such a QP. The overall complexity of each iteration is thus O(nd) due to the gradient calculation in (34) and the matrix multiplication in (33).

H

Dynamic Programming to Compute the Marginals of (21)

For convenience, we repeat the joint distribution here: ! p(¯ y; w) ∝ exp c∆(¯ y, y) +

X

ai hxi , wi y¯i

.

i

Clearly, the marginal distributions can P Pbe efficiently computed if we are able efficiently compute the forms like y¯ exp (c∆(¯ y, y) + i ai hxi , wi y¯i ). Notice that ∆(¯ y, y) only depends on the “sufficient statistics” including false positive and false negative (b and c in the contingency table respectively), so our idea will be to enumerate all possible values of b and c. For any fixed value of b and c, we just need to sum up ! X Y exp ai hxi , wi y¯i = exp (ai hxi , wi y¯i ) i

i

¯ which has false positive b and false negative c. Let us call this set of labeling as Cn (b, c). over all y If yn = +1, then X

Vn (b, c) :=

n Y

exp (ai hxi , wi y¯i )

¯ ∈Cn (b,c) i=1 y n−1 Y

X

= exp (an hxi , wi)

exp (ai hxi , wi y¯i )

(¯ yn = +1)

(¯ y1 ,...,¯ yn−1 )∈Cn−1 (b,c) i=1

+ exp (−an hxi , wi)

X

n−1 Y

exp (ai hxi , wi y¯i )

(¯ yn = −1)

(¯ y1 ,...,¯ yn−1 )∈Cn−1 (b,c−1) i=1

= exp (an hxi , wi) Vn−1 (b, c) + exp (−an hxi , wi) Vn−1 (b, c − 1). If yn = −1, then Vn (b, c) = exp (an hxi , wi) Vn−1 (b − 1, c) + exp (−an hxi , wi) Vn−1 (b, c). In practice, the recursions will start from V1 , and there is no specific b and c kept in mind Vk (p, q) will enumerate all values of p and q. So at completion, we get Vn (b, c) all possible values of b and c. The cost for computation and storage is O(n2 ).

16