On Efficient Randomized Methods for Convex Optimization - CiteSeerX

On Efficient Randomized Methods for Convex Optimization Robert M. Freund MIT

Banff International Research Station Banff, Canada November, 2006

A Quotation “The final test of any theory is its capacity to solve the problems which originated it.”

1

A Quotation “The final test of any theory is its capacity to solve the problems which originated it.” –George B. Dantzig, opening sentence of Linear Programming and Extensions, 1963

2

In conversation at U.Florida, 1999 Don Ratliff: “Tell me something, do any of you people at MIT ever do anything practical?”

3

In conversation at U.Florida, 1999 Don Ratliff:: “Tell me something, do any of you people at MIT ever do anything practical?” Rob Freund: “In theory, yes.”

4

Motivation and Scope of Talk

• Until recently, randomized methods have played mostly a minor role in theory and practice in continuous convex optimization • Perhaps now is the time to explore the possible contributions of randomized algorithms in convex optimization Herein we survey some concepts of randomized methods, two recent research papers, and attempt to look ahead at future research

5

Preamble We know how to do at least three types of random sampling: 1. compute a random vector on the unit sphere S d−1 ⊂ IRd 2. compute a random point uniformly distributed on a convex body S given an initial interior point v 0 ∈ S and membership oracle for S 3. compute a random point exponentially distributed on a convex body S given an initial interior point v 0 ∈ S and membership oracle for S These and other random sampling methods lead to some very interesting methods for convex optimization 6

Uniform Vector on the Sphere

7

Uniform Vector on a Convex Body

8

Exponentially Distributed Vector on a Convex Body

9

Outline Problem Setting

Randomization

Authors

Compute x ∈ S

Random Walk on Convex Body

Bertsimas/Vempala

Convex Optimization

Exponential/Annealing

Kalai/Vempala

Solve Ax ∈ intK

Random points on Sphere

Belloni/F./Vempala

• Probabilistic Complexity: New Paradigms • References

10

Probabilistic Algorithm Notions • Las Vegas algorithm: always outputs the correct answer, but allows a small probability of large running time • “High Probability”: Computational complexity of the algorithm with probability of success 1 − δ is O · · · ln 1δ 1 • depends on δ only through ln δ

11

Computing a Point x ∈ S by Random Walks Bertsimas and Vempala (2004)

12

Convex Body Convex Body is a compact convex set with nonempty interior

13

Notation

• B∞(x, r) is the L∞ norm ball of radius r centered at x √ • For a positive definite matrix Σ, kvkΣ := v T Σ−1v

14

Separation Oracle for a Convex Set A separation oracle for a set P : given a point x, identifies if x ∈ P or outputs a vector d satisfying kdk2=1, dT y ≤ dT x for all y ∈ P

15

Computing a Point x ∈ S by Random Walks S ⊂ IRn is a convex set given by a Separation Oracle The goal is to compute a point x ∈ S Assume that B∞(v, r) ⊂ S ⊂ B∞(0, R) for some v, r, R Assume that we know R

16

Cut through the Center of Mass Cut through the Center of Mass: Let µ denote the center of mass of the convex body P . Any halfspace H that contains µ contains at most (1- 1/e) of the volume of P

This implies Vol(P ∩ H) ≤ (1 − 1/e)Vol(P ) 17

Exact Center of Mass Algorithm Levin’s Algorithm (1965): Input: Separation Oracle for S, scalar R Step Step Step Step Step Step

0. 1. 2. 3. 4. 5.

(Initialization) P ← B∞(0, R), µ ← 0 (Oracle call) If µ ∈ S, stop. Else compute separator d (Compute Halfspace) H ← {x ∈ IRn : dT x ≤ dT µ} (Cut P ) P ← P ∩ H (Compute new center of mass) µ ← µ(P ) (Repeat) Goto Step 1.

18

Complexity of Exact Center of Mass Algorithm Assume that B∞(v, r) ⊂ S ⊂ B∞(0, R) for some v, r, R Assume R is known Algorithm will compute x ∈ S is at most d2.2n ln(R/r)e iterations Problem: computing µ(P ) is #P-complete The method appears to be worthless

19

Basic Idea of Probabilistic Algorithm

20

Computing a Point x ∈ S by Random Walks B-V Algorithm: Input: Separation Oracle for S, scalar R Step Step Step Step Step Step Step

0. 1. 2. 3. 4. 5. 6.

(Initialization) P ← B∞(0, R), vˆ ← 0 (Oracle call) If vˆ ∈ S, stop. Else compute separator d (Compute Halfspace) H ← {x ∈ IRn : dT x ≤ dT vˆ} (Cut P ) P ← P ∩ H (Sample) Sample M random points v 1, . . . , v M ∼ U (P ) PM i 1 (Estimate Mean) vˆ ← M i=1 v (Repeat) Goto Step 1.

How does volume of P ∩ H decrease? How to choose the number of samples M ? 21

The Uniform Distribution on S: Bounding Ellipsoids Let X be a random vector uniformly distributed on a convex body S ⊂ IRd. Let fU (S) denote the uniform density function on S: 1/Vol(S) for x ∈ S fU (x) := 0 for x ∈ /S µ := E[X]

Σ := E[(X − µ)(X − µ)T ]

and

Theorem: BΣ µ ,

p

(d + 2)/d ⊂ S ⊂ BΣ

p µ , d(d + 2)

(This yields a d-rounding of S)

22

Covariance Matrix and Bounding Ellipsoids

23

Bounding the Decrease in Volume Let µ and Σ denote the mean and covariance of the uniform distribution on P µ is the center of mass of P √ Define kwkΣ := wT Σ−1w vˆ is is an approximation of µ Theorem (Bertsimas/Vempala): If kˆ v − µkΣ ≤ t, then any halfspace containing vˆ also contains at most (1 − 1e + t) of the volume of P .

24

Main Computational Complexity Result

R Run the algorithm for K ≤ 3n ln iterations r where recall B∞(v, r) ⊂ S ⊂ B∞(0, R) for some v, r, R Let 1 − δ be the probability of success of algorithm &

R r

!'

3n ln Theorem (Bertsimas/Vempala): Let M = 838n ln δ be the number of sample points at each iteration. Then the algorithm will compute a point x ∈ S, and therefore stop, with probability at least 1 − δ 2

25

“High Probability” Let T be the number of iterations needed to have success with probability p = 0.125 Probability of Success 0.9 0.99 0.999 0.9999

Iteration Bound 18T 36T 54T 72T

26

Uniform Sampling on a Convex Body • Notation • Convex Geometry • Logconcave Functions • Uniform Sampling via Geometric Random Walk • Estimating µ and Σ

27

Notation h·, ·i is the Euclidean inner product p kvk2 := hv, vi B(x, r) the Euclidean ball of radius r centered at x √ For a positive definite matrix Σ, kvkΣ := v T Σ−1v BΣ(c, r) is the ball of radius r centered at x in the norm k · kΣ O∗(·) notation ignores all logarithm terms: for example, O(n3 ln(m)) = O∗(n3)

28

Logconcave Functions A function f : IRd → IR+ is logconcave if for all x, y ∈ IRd and α ∈ [0, 1]: f (αx + (1 − α)y) ≥ f (x)αf (y)(1−α) (i.e., log f is concave) • The Uniform, Gaussian, and exponential densities on a convex set S are all logconcave • The product and minimum of two logconcave functions are logconcave Theorem (Leindler, Pr´ekopa) Let f be a logconcave density function. • All marginal density functions are logconcave • The probability distribution function is logconcave • The convolution of two logconcave functions is logconcave 29

Logconcave Functions, continued Let X ∈ IRd be a random vector whose density function f is logconcave T • µ := E [X] and Σ := E (X − µ)(X − µ) • If H is a halfspace containing µ, P (X ∈ H) ≥

1 e

(Lov´asz/Vempala)

√ −t/ d

• P (kX − µkΣ ≥ t) ≤ e

30

The Uniform Distribution on S: Bounding Ellipsoids Let X be a random vector uniformly distributed on a convex body S ⊂ IRd. Let fU (S) denote the uniform density function on S: 1/Vol(S) for x ∈ S fU (x) := 0 for x ∈ /S µ := E[X]

Σ := E[(X − µ)(X − µ)T ]

and

Theorem: BΣ µ ,

p

(d + 2)/d ⊂ S ⊂ BΣ

p µ , d(d + 2)

(This yields a d-rounding of S)

31

Covariance Matrix and Bounding Ellipsoids

32

Uniform Sampling on S by Random Walks X ∼ U (S) denotes that X obeys a uniform distribution on the convex body S Goal is to generate a random vector whose density function is approximately U (S) Geometric Random Walks: Grid Walk, Ball Walk, Hit-and-Run, others These random walks only require: • Membership Oracle for S • starting point v 0 ∈ intS 33

Hit-and-Run Random Walk Step Step Step Step Step

0 1 2 3 4

Initialize with v 0 ∈ intS, k = 0, number of steps N is given Choose d ∼ U (S d−1) in IRd Compute endpoints p and q of {v k + td} ∩ S Choose v k+1 ∼ U ([p, q]) Set k ← k + 1. If k < N Goto Step 1

34

Hit-and-Run

35

Sampling on a Convex Body: Stationary Distribution Hit-and-Run on S induces a Markov chain on S • state space is the set of density functions on S Theorem: U (S) is the stationary distribution of this Markov chain v N denotes the current point after N steps of the random walk fvN denotes the density function of v N Asymptotically, fvN → fU as N → ∞

36

Sampling: Convergence of Random Walk Convergence to the uniform distribution is measured with the Total Variation (TV) norm: kfvN − fU kT V

1 := 2

Z |fvN (x) − fU (x)| dx

(This is essentially the L1 norm for functions)

37

Sampling on a Convex Body: Complexity kfvN − fU kT V :=

1 2

R

|fvN (x) − fU (x)| dx

Suppose that B2(w, r) ⊂ S ⊂ B2(y, R) for some w, y, r, R Then in order for kfU − fvN kT V ≤ ε it suffices to take N steps of Hit-and-Run, where ! 2 R 3 R N =O d ln r ε · dist2(v 0, ∂S)

38

Sampling on a Convex Body: Complexity, continued 2 ! R 3 R N =O d ln steps of Hit-and-Run 0 r ε · dist2(v , ∂S) where B2(w, r) ⊂ S ⊂ B2(y, R) As written this is “exponential” in ln(R/r) √ R/r can be replaced by d through a rounding subroutine The rounding subroutine requires O∗(d4) random walk steps

39

Sampling on a Convex Body: Rounding Subroutine We have M sample vectors v 1, . . . , v M PM i PM i 1 1 ˆ and Σ := M i=1(v − vˆ)(v i − vˆ)T vˆ := M i=1 v ˆ −1/2(X − vˆ) Rounding (affine) transformation: Y 7→ Σ µY → 0

and

ΣY → I

as

M →∞

Rounding subroutine requires N = O∗(d4) steps of Hit-and-Run

40

Importance of Rounding Subroutine

41

Convex Optimization via Simulated Annealing Kalai and Vempala (2005)

42

Convex Optimization via Simulating Annealing Approximately solve: (CP )

   min   s.t.

x

hc, xi x∈S

Assumptions: • S is a convex body given by a membership oracle • B(v, r) ⊂ S ⊂ B(w, R) for some v, r, w, R

43

Classical Simulating Annealing • Typically applied to non-convex problem min x{hc, xi : x ∈ X} where X is non-convex • cooling schedule T1 > T2 > · · · → 0 • x is current iterate, Ti is current temperature • Choose y at random in a neighborhood of x hc,xi−hc,yi • x ← y with probability min 1, e Ti • i ← i + 1 and repeat Remark: T ≈ ∞ is “uniform” on X Remark: T ≈ 0 “concentrates on the optima”

44

The Exponential Distribution on S • Consider the convex problem min x{hc, xi : x ∈ S} for the convex body S • Exponential density on S with cooling rate T : e−hc,xi/T fT (x) = R −hc,yi/T e dy S • “X ∼ Exp(T )”

45

Exponential Distribution on S

46

Annealing Algorithm for Convex Optimization We wish to approximately solve: min x{hc, xi : x ∈ S} • Choose X0 ∼ U (S), T0 ← n, i ← 1

• Update cooling temperature: Ti := Ti−1 1 − √1n

• Sample: X1, . . . , XO(n) ∼ Exp(Ti) to obtain estimate of µTi and ΣTi √

• If i ≥ 2 n ln (n/ε) then Stop • i ← i + 1 and repeat 47

Iteration Complexity Lemma: For a unit vector c ∈ IRn, temperature T , and X ∼ Exp(T ): EfT [hc, Xi] ≤ OPTVAL + nT Complexity Implications: ε ∗ • For T := we have EfT ∗ [hc, Xi] ≤ OPTVAL + ε n √ n • “Major” Iterations: O n ln ε • Overall Complexity: O∗(n4.5)

48

Comparison: Simulated Annealing and IPM

Parameter Driving Force Iterate Tracking Staying Close Information/Iteration Work/Iteration Complexity Information Implementation Performance

Simulated Annealing Ti (temperature) ETi [X] Density/Sample of fTi (x) Path of Densities fX (·) near fTi (·) µ ¯Ti , ΣTi Steps of Random Walk O∗(n4.5) Membership Oracle Easy? ?

IPM µi (barrier parameter) min x hc, xi + µiF (x) Current Iterate Central Path x near x(µi) ∇F (x), ∇2F (x) Newton Step √ O∗( ϑn3) Barrier Function Less Easy Very Good

49

Hit-and-Run for the Exponential Distribution on S

e−hc,xi/T fT (x) = R −hc,yi/T e dy S • Hit-and-Run for the exponential distribution Step 0 Initialize with v 0 ∈ intS, k = 0, number of steps N is given Step 1 Sample d ∼ U (S n−1) in IRn Step 2 Sample v k+1 ∼ 1-dimensional exponential distribution on {v k + td} ∩ S Step 3 Set k ← k + 1. If k < N Goto Step 1

50

Sampling from the Exponential Distribution Hit-and-Run for the exponential distribution:

51

Sampling from the Exponential Distribution on S • Can be done efficiently in theory • Is less efficient than sampling from uniform distribution on S • For Tˆ ≈ 0 sampling directly requires at least O∗(n6) Hit-and-Run steps to compute one sample point • Indirect method uses a homotopy of samples for the family fTi (·) for T1 > . . . > TN = Tˆ • Rounding subroutine is critical to the efficiency of the homotopy methodology

52

Sampling from the Exponential Distribution Hit-and-Run for the exponential distribution:

53

Annealing Algorithm for Convex Optimization We wish to approximately solve: min x{hc, xi : x ∈ S} • Choose X0 ∼ U (S), T0 ← n, i ← 1

• Update cooling temperature: Ti := Ti−1 1 − √1n

• Sample: X1, . . . , XO(n) ∼ Exp(Ti) to obtain estimate of µTi and ΣTi √

• If i ≥ 2 n ln (n/ε) then Stop • i ← i + 1 and repeat 54

Iteration Complexity Lemma: For a unit vector c ∈ IRn, temperature T , and X ∼ Exp(T ): EfT [hc, Xi] ≤ OPTVAL + nT Complexity Implications: ε ∗ • For T := we have EfT ∗ [hc, Xi] ≤ OPTVAL + ε n √ n • “Major” Iterations: O n ln ε • Overall Complexity: O∗(n4.5)

55

Comparison: Simulated Annealing and IPM

Parameter Driving Force Iterate Tracking Staying Close Information/Iteration Work/Iteration Complexity Information Implementation Performance

Simulated Annealing Ti (temperature) ETi [X] Density/Sample of fTi (x) Path of Densities fX (·) near fTi (·) µ ¯Ti , ΣTi Steps of Random Walk O∗(n4.5) Membership Oracle Easy? ?

IPM µi (barrier parameter) min x hc, xi + µiF (x) Current Iterate Central Path x near x(µi) ∇F (x), ∇2F (x) Newton Step √ O∗( ϑn3) Barrier Function Less Easy Very Good

56

An Efficient Re-scaled Perceptron Algorithm for Conic Systems Belloni, F., Vempala (2006)

57

Problem of Interest Compute a solution to:

(P ) :

  Ax ∈ intK 

x ∈ X

A : X → Y denote a linear operator from n-dimensional Euclidean space X to m-dimensional Euclidean space Y Assume that (P ) has a solution K is the inclusion cone, and is a regular convex cone F := {x ∈ IRn : Ax ∈ K} is the feasibility cone 58

Uniform Random Vector on the Sphere It is elementary to compute a random vector on the unit sphere S n−1 := {v ∈ IRn : kvk = 1}

59

A Useful Property of the Sphere S n−1 := {v ∈ IRn : kvk = 1} Proposition: Given v¯ ∈ S n−1, sample x ∼ U (S n−1). Then: √ Pr ( h¯ v , xi ≥ 1/ n ) ≥ 1/8.

60

Notation h·, ·i is the Euclidean inner product p kvk := hv, vi B(x, r) the Euclidean ball of radius r centered at x Let A : X → Y denote a linear operator from n-dimensional Euclidean space X to m-dimensional Euclidean space Y A∗ : Y → X denotes the adjoint operator associated with A

61

Interior Separation Oracle Interior Separation Oracle for a convex set S: given a point x, identifies if x ∈ intS or computes a vector d satisfying kdk=1, hd, yi ≤ hd, xi for all y ∈ S

62

Cones C is a closed convex cone in a finite-dimensional space dual cone of C is C ∗ := {d : hd, xi ≥ 0, for all x ∈ C} extrayC denotes the set of extreme rays of C C is a regular cone if C is a pointed closed convex cone with non-empty interior

A useful intersection property: Proposition: If C is a regular cone, then intC ∩ intC ∗ 6= ∅

63

Width and Center of a Cone Width of a convex cone C is: τC := max {r : kxk ≤ 1, B(x, r) ⊂ C} x,r

The center of C is z¯ for which B(¯ z , τC ) ⊂ C and k¯ zk = 1

“inner measure” of C, Goffin 1980 64

Width of a Cone

65

t-Relaxation of a Cone A convex cone C can be characterized as: C := {x : hd, xi ≥ 0 for all d ∈ extrayC ∗} n = x:

hd,xi kdkkxk

≥ 0 for all d ∈ extrayC ∗

o

A t-relaxation of C for a given relaxation value t > 0 is: n Ct := x:

hd,xi kdkkxk

≥ −t for all d ∈ extrayC

∗

o

66

t-Relaxation of a Cone Ct looks like:

67

“Deep Separation Oracle” Motivating Idea For a given x 6= 0 and a scalar t > 0, determine whether or not x lies in the t-relaxation of the cone C

68

Deep Separation Oracle Definition Deep Separation Oracle for a Cone C. For a given x 6= 0 and a scalar t > 0, either: (I) correctly identifies that

hd,xi kdkkxk

≥ −t for all d ∈ extrayC ∗, or

(II) returns d ∈ C ∗ satisfying kdk = 1 and

hd,xi kdkkxk

< −t

(I) states x ∈ Ct

(II) states x is “deeply” separated from C by d: hd, yi ≥ 0 for all y ∈ C (that is, d ∈ C ∗) hd, xi < −tkdkkxk 69

Perceptron Algorithm for Linear Inequalities Compute a solution to:

(P ) :

  Ax > 0 

x

∈

IRn

A ∈ IRm×n, and assume that (P ) has a solution F := {x ∈ IRn : Ax ≥ 0} is the feasibility cone m K = IRm + := {s ∈ IR : si ≥ 0, i = 1, . . . , m} is the inclusion cone

70

Perceptron Algorithm

(P ) :

  Ax > 0 

x

∈

IRn

Perceptron Algorithm for Homogeneous Linear Inequalities (a) Let x be the origin in IRn. Repeat: (b) If Ax > 0, Stop. Otherwise, pick i for which Aix ≤ 0, and set x ← x + Ai/kAik.

71

Perceptron Algorithm, continued

72

Preceptron Algorithm Complexity

Theorem (≈Rosenblatt 1962): Let τF denote the width of the feasibility cone of (P ). The perceptron algorithm will compute a solution of (P ) in at most b1/τF2 c iterations.

73

Perceptron Algorithm for a Conic System Compute a solution to: (P ) :


x ∈ X

A maps n-dimensional Euclidean space X to m-dimensional Euclidean space Y Assume that (P ) has a solution F := {x ∈ X : Ax ∈ K} is the feasibility cone K is the inclusion cone We assume that K is a regular cone

74

Conic Perceptron Algorithm

(P ) :


x ∈ X

Perceptron Algorithm for Conic System (a) Let x be the origin in X. Repeat: (b) If Ax ∈ intK, Stop. Otherwise, call interior separation oracle for K at Ax, returning 0 6= λ ∈ K ∗ such that hλ, Axi ≤ 0, and set d = A∗λ/kA∗λk and x ← x + d.

75

Conic Perceptron Algorithm, continued

76

Conic Preceptron Algorithm Complexity

(P ) :


x ∈ X

F := {x ∈ X : Ax ∈ K}

Theorem: Let τF denote the width of the feasibility cone of (P ). The perceptron algorithm will compute a solution of (P ) in at most b1/τF2 c iterations.

77

Proof of Conic Perceptron Complexity Theorem Proof: Let z¯ be the center of F, which satisfies B(¯ z , τF ) ⊂ F and k¯ zk = 1 Consider the potential function π(x) = hx, z¯i /kxk At each iteration numerator of π(x) increases by at least τF : hx + d, z¯i = hx, z¯i + hd, z¯i ≥ hx, z¯i + τF , and the denominator does not increase too quickly: kx + dk2 = kxk2 + 2 hx, di + hd, di ≤ kxk2 + 1, since hx, di ≤ 0, hd, di = 1, and hd, z¯i ≥ τF After k = b1/τF2 c + 1 iterations we would have hx, z¯i /kxk = π(x) ≥ which is a contradiction.

kτ √F k

> 1,

78

Separation Oracle for F when K = S+k×k

(P ) :

 k×k  Ax ∈ intS+ 

x ∈ X

k×k K = S+

Suppose that Ax ∈ / intK. Compute any eigenvector v of Ax associated with a non-positive eigenvalue, and define d = A∗(vv T ). hd, xi = v T (Ax)v ≤ 0, For all y ∈ F we have: hd, yi = v T (Ay)v ≥ 0 Therefore d ∈ F ∗ and hd, xi ≤ 0 79

Can τF be Improved by Linear Transformation? Select v ∈ intF ∩ intF ∗, and consider for θ > 0: ˆ ∈ K} Aˆ := A ◦ [I + θvv T ] and Fˆ := {x : Ax ˆ = Ax +θ Av v T x ∈ intK If x ∈ F, then Ax |{z} |{z} |{z} ∈K ∈intK >0 Proposition: τFˆ > τF , and τFˆ → 1 as θ → ∞

80

Linear Transformation to Improve τF

81

Linear Transformation to Improve τF Select v ∈ intF ∩ intF ∗, and define for θ > 0: ˆ ∈ K} Aˆ := A ◦ [I + θvv T ] and Fˆ := {x : Ax

Proposition: τFˆ > τF Problem: Computing v is at least as hard as solving original problem (P )

82

A Computable Linear Transformation to Improve τF Proposition (D,V): Let z¯ denote the center of F, i.e., B(¯ z , τF ) ⊂ F and k¯ z k = 1. Suppose that x satisfies: (a):

√ h¯ z , xi ≥ 1/ n, and k¯ z kkxk

hd, xi ≥ −1/(32n) for all d ∈ extrayF ∗. kdkkxk T xx ˆ ∈ K} Define Aˆ := A ◦ I + and Fˆ := {x : Ax hx, xi 1 Then τFˆ ≥ τF 1 + 3.02n

(b):

83

Interpreting Suppositions (a) and (b)

(a):

√ h¯ z , xi ≥ 1/ n k¯ z kkxk

States that x makes a slightly acute angle with the center z¯ of the feasibility cone

Although we do not know z¯, it turns out that this condition will be easy to satisfy with high probability

84

Interpreting the Suppositions (a) and (b), continued

(b):

hd, xi ≥ −1/(32n) for all d ∈ extrayF ∗ kdkkxk

States that x lies in the t = 1/(32n)-relaxation of the feasibility cone F, i.e., x ∈ Ft Easy to check if we have a deep separation oracle for F

Also, such an x will be easy to compute (with high probability) if we have a deep separation oracle for F 85

Computing a Point in the t-Relaxation of a Cone For a given cone C, consider the following algorithm for computing a point x ∈ Ct: Perceptron Improvement Subroutine

(a) Let x ∼ U (S n−1). Repeat at most b(1/t2) ln(n)c times: (b) Call deep separation oracle for C at x with relaxation parameter t. If hd, xi ≥ −tkdkkxk for all d ∈ extrayC ∗ (condition I), Stop. Otherwise, oracle returns d ∈ C ∗, kdk = 1, such that hd, xi ≤ −tkdkkxk (condition II), and set x ← x − hd, xi d. If x = 0 restart at (a).

86

Evaluation of Perceptron Improvement Subroutine Lemma: Let z¯ be the center of C. With probability at least 1/8, Perceptron Improvement subroutine returns x ˜ satisfying:

•

h¯ z, x ˜i 1 ≥ √ , and k¯ z kk˜ xk n

•

hd, x ˜i ≥ −t for every d ∈ extrayC ∗ kdkk˜ xk

87

Proof of Conic Perceptron Improvement Lemma Proof: Let z¯ be the center of C, which satisfies B(¯ z , τC ) ⊂ C and k¯ zk = 1 Consider the potential function π(x) = h¯ z , xi /kxk 0

With probability at least 1/8, subroutine’s starting x satisfies

hz¯,x0i k¯ z kkx0 k

≥

√1 . n

At each iteration numerator of π(x) is increasing: h¯ z , x − hd, xi di ≥ h¯ z , xi since hd, xi ≤ 0 and h¯ z , di ≥ 0 At each iteration denominator of π(x) decreases rapidly: 2

2

kx − hd, xi dk2 = kxk2 − hd, xi = kxk2 − hd, x/kxki kxk2 ≤ (1 − t2)kxk2, since hx, di ≤ −tkdkkxk After k = b(1/t2) ln(n)c + 1 iterations we would have h¯ z , xi /kxk = π(x) > 1, which is a contradiction. 88

Putting it All Together Re-scaled Perceptron Algorithm for a Conic System

Step 1 Initialization. Set B = I and σ = 1/(32n).

Step 2 Perceptron Algorithm for a Conic System. Run perceptron algorithm for conic system at most b(1/σ 2 )c iterations, stopping if Ax ∈ intK . Step 3 Perceptron Improvement Subroutine. (a) Let x be a random unit vector in X . Repeat at most b(1/σ 2 ) ln(n)c times: (b) Call deep separation oracle for F at x with t = σ . If hd, xi ≥ −σkdkkxk for all d ∈ extrayF ∗ (condition I), Goto Step 4. Otherwise, oracle returns d ∈ F ∗ , kdk = 1, such that hd, xi ≤ −σkdkkxk (condition II), and set x ← x − hd, xi d. If x = 0 restart at (a). (c) Call deep separation oracle for F at x with t = σ . If oracle returns condition (II), restart at (a).

Step 4 Re-scaling. A ← A ◦

! xxT I+ , B←B◦ hx, xi

! xxT I+ , and Goto Step 2. hx, xi

89

Probabilistic Analysis of Algorithm It turns out that even if perceptron improvement subroutine fails, it cannot fail too badly: Lemma: Suppose that τF ≤ 1/32n,and let x ˜ be the output of the perceptron x ˜x ˜T ˆ ˆ improvement subroutine. Let Aˆ := A I + h˜ x,˜ xi and F := {x : Ax ∈ K}. Then • With probability at least 1/8, τFˆ ≥

• Regardless, τFˆ ≥

1 1+ τF 3.02n

1−

1 1 − τF 2 32n 512n

90

Complexity of Algorithm Theorem: Suppose that n ≥ 2. If (P ) has a solution, the re-scaled conic perceptron algorithm will compute a solution in at most 1 1 1 1 T = max 4096 ln , 139n ln = O n ln + ln δ 32nτF τF δ iterations with probability at least 1 − δ. Moreover, with probability at least 1−δ, the algorithm makes at most O(T n2 ln(n)) calls of a deep separation oracle for F and at most O(T n2) calls of a separation oracle for F.

91

“High Probability” Let T be the number of iterations needed to have success with probability p = 0.125 Probability of Success 0.9 0.99 0.999 0.9999

Iteration Bound 18T 36T 54T 72T

92

Deep Separation Oracles for F This methodology is based on the availability of an efficient deep separation oracle for the feasibility cone F of:

(P ) :


x ∈ X

F := {x ∈ X : Ax ∈ K} is the feasibility cone K is the inclusion cone The availability of an efficient deep separation oracle for F depends on the structure of the inclusion cone K

93

Deep Separation Oracles for F, continued An efficient deep separation oracle for F is fairly easy to construct when (P ) has the format:  m A x ∈ intIR  L +       Aix ∈ intQni i = 1, . . . , q    k×k  x ∈ intS  s +   m IRm := {s ∈ IR : sj ≥ 0, i = 1, . . . , m} +

Qk := {s ∈ IRk : k(s1, s2, . . . , sk−1)k ≤ sk } S k×k := {X ∈ IRk×k : X = X T } k×k S+ := {X ∈ S k×k : hv, Xvi ≥ 0 for all v ∈ IRk } 94

Deep Separation Oracles for F, continued

A somewhat efficient deep separation oracle for F can be constructed when (P ) has the format: k×k Ax ∈ intS+

95

Simple Calculus of Deep Separation Oracles Suppose we have deep separation oracles for F1 and F2 of instances:

A1x ∈ intK1 x ∈ X

F1 = {x : A1x ∈ K1}

and

A2x ∈ intK2 x ∈ X

and

F2 = {x : A2x ∈ K2}

Consider:   A1x ∈ intK1 A2x ∈ intK2  x ∈ X F = {x : A1x ∈ K1, A2x ∈ K2} = {x : Ax ∈ K} 96

Simple Calculus, continued

  A1x ∈ intK1 A2x ∈ intK2  x ∈ X K = K1 × K2 F = {x : A1x ∈ K1, A2x ∈ K2} = {x : Ax ∈ K} F = F1 ∩ F2 where Fi = {x : Aix ∈ Ki}, i = 1, 2 F ∗ = F1∗ + F2∗, Therefore: extrayF ∗ ⊂ (extrayF1∗ ∪ extrayF2∗)

97

Simple Calculus, continued   A1x ∈ intK1 A2x ∈ intK2  x ∈ X

extrayF ∗ ⊂ (extrayF1∗ ∪ extrayF2∗) Deep Separation Oracle for F1 ∩ F2 Given: scalar t > 0 and x 6= 0, call the deep separation oracles for F1 and F2 at x. If both oracles report Condition I, return Condition I. Otherwise at least one oracle reports Condition II and provides d ∈ Fi∗ ⊂ F ∗, kdk = 1, such that hd, xi < −tkdkkxk; return d and Stop. 98

Deep Separation Oracle for F when K = IRm + F = {x ∈ IRn : Ax ≥ 0} F ∗ = {A∗λ : λ ≥ 0} extrayF ∗ ⊂ {A1, . . . , Am} Deep separation oracle for F at x 6= 0 for relaxation parameter t:

If

hAi ,xi kAi kkxk

≥ −t for all i = 1, . . . , m, report Condition (I).

hAi ,xi If there exists i ∈ {1, . . . , m} for which kA < −t, i kkxk return d = Ai/kAik and report Condition (II).

Complexity is O(mn) operations

99

Deep Separation Oracle for F when K = Qk For convenience use notation F = {x : kM xk ≤ g T x} M is (k − 1) × n, g is an n-vector Suppose that x is given and that kM xk > g T x, whereby x ∈ / F. Solve for d: t∗

:=

min

d

xT d

s.t. kdk = 1 d ∈ F∗ If t∗/kxk ≥ −t, then Condition I is satisfied If t∗/kxk < −t then Condition II is satisfied and return d This requires O(kn2 + n ln ln(1/t) + n ln ln(1/min {τF , τF ∗ })) operations 100

Deep Separation Oracle: F when K = Qk , continued t∗

:=

min

d

xT d

s.t. kdk = 1 d ∈ F∗ x∈ / F implies t∗ < 0 and we can replace “=” with “≤” obtaining the following dual problems: t∗ := mind xT d s.t. kdk ≤ 1 d ∈ F∗

v ∗ := maxy

−ky − xk

s.t. y ∈ F

101

Deep Separation Oracle: F when K = Qk , continued

t∗ := mind xT d

v ∗ := maxy

s.t. kdk ≤ 1 d ∈ F∗

−ky − xk

s.t. y ∈ F

The dual problem can be written as the following nice minimum norm problem: −v ∗

:=

min

y

ky − xk

s.t. kM yk ≤ g T y

102

Deep Separation Oracle: F when K = Qk , continued

−v ∗

:=

min

y

ky − xk

s.t. kM yk ≤ g T y This is a very well-behaved problem, and almost has a closed form solution. The work in solving this problem lies in two tasks: • Decompose M T M − gg T = QDQT where Q is orthonormal, D is diagonal (and D will have exactly one negative entry, re-scaled to −1) • Solve for the unique solution γ¯ of f (γ) = 0 in the range γ ∈ (0, 1) where

f (γ) :=

n X Di(xT Qi)2 i=1

(1 + Diγ)2

or

n X D−1(xT Qi)2 i

i=1

(1 + Di−1γ)2 103

Deep Separation Oracle: F when K = Qk , continued Solve for the unique solution γ¯ of f (γ) = 0 in the range γ ∈ (0, 1) where

f (γ) :=

n X Di(xT Qi)2 i=1

(1 + Diγ)2

or

n X D−1(xT Qi)2 i

i=1

(1 + Di−1γ)2

f (γ) is monotone on (0, 1), and enhanced binary search/Newton using ideas of Ye 1992 will yield necessary tolerance in O(ln ln(1/t)) steps. Total complexity is O(kn2 + n ln ln(1/t) + n ln ln(1/min {τF , τF ∗ })) operations.

104

Deep Separation Oracle for F = S+k×k k×k F = S+

k×k Deep separation oracle for S+ at X 6= 0 for relaxation parameter t:

Check the condition “X + tkXkI 0” If X + kXkt 0, report Condition (I). If there exists v such that v T (X + tkXkI)v < 0, return d = vv T /kvk2 and report Condition (II).

Complexity is O(k 3) operations in practice

105

Deep Separation Oracle for F when K = S+k×k k×k F = {x : Ax ∈ S+ }

t > 0 is the relaxation parameter Suppose that x ¯ 6= 0 is given and x ¯∈ /F We seek d satisfying:

(S) :

  

h¯ x,di k¯ xkkdk

 

d

< −t ∈

F∗

If d is feasible for (S), then Condition II is satisfied and return d If (S) has no solution, then report Condition I 106

Deep Separation Oracle: F when K = S+k×k , cont. (S) :

(S) :

  

h¯ x,di k¯ xkkdk

 

d

< −t ∈

F∗

 x, di + tk¯ xkkdk < 0  h¯ 

d

∈

F∗

k×k Proposition: F ∗ = cl{A∗w : w ∈ intS+ }

(S) :

 xi + tk¯ xkkA∗wk < 0  hw, A¯ 

w

∈

k×k S+ 107

Deep Separation Oracle: F when K = S+k×k , cont.

(S) :

 xi + tk¯ xkkA∗wk < 0  hw, A¯ 

w

∈

k×k intS+

k×k This has single SOC constraint, plus variables must lie in S+

Can be solved by re-scaled conic perceptron algorithm itself We bound the complexity using Renegar’s condition measure C(A)

108

Deep Separation Oracle: F when K = S+k×k , cont. Recall Renegar’s condition measure for (P ):

(P ) :


x ∈ X

Let M denote those operators A : X → Y for which (P ) has a solution. For A ∈ M, ρ(A) := min

∆A {k∆Ak

: A + ∆A ∈ / M}

C(A) := kAk/ρ(A) is a scale-invariant reciprocal of the smallest perturbation of A that would render (P ) infeasible C(A) is used to bound the complexity of both the ellipsoid method and IPMs for solving (P ) 109

Deep Separation Oracle: F when K = S+k×k , cont.

(S) :

 xi + tk¯ xkkA∗wk < 0  hw, A¯ w



∈

k×k intS+

Half-deep-separation Oracle for F, for x 6= 0, σ > 0, and L > 0 Set t := σ/2, and run the re-scaled perceptron algorithm to compute a solution w ˜ nof (S) for at most o 1 ˆ T := max 4096 ln , 139n ln 6L iterations. δ

τK ∗

∗ If a solution w ˜ of (S) is computed, return d := A∗w/kA ˜ wk, ˜ report Condition II, and Stop. If no solution is computed within Tˆ iterations, report “either Condition I is satisfied, or L < C(A),” and Stop.

110

Deep Separation Oracle: F when K = S+k×k , cont. Theorem: Combined with binary search, this deep-separation oracle yields an algorithm for computing a solution of (P ) in time polynomial in n, ln(C(A)), ln(1/δ), ln(1/τK ), and ln(1/τK ∗ ).

111

Unexplored Issues Case when (P ) is not feasible, alternative system has solution k×k More efficient deep separation oracle for S+

112

Randomized Methods: Comments • Implementations tend to be easy • Recent advances in theoretical efficiency • Approximations to some very hard problems with high probability • Computing volume of S within ε • Computing a rank-k approximation of a given matrix M • Limits of Randomized Methods? • Cannot approximate the diameter of S within a factor of

√

d

• Perhaps the time has come for randomized algorithms in convex optimization 113

References A. Belloni and R. M. Freund, “Projective Preconditioners for Improving the Behavior of Homogeneous Conic Systems” A. Belloni, R. M. Freund and S. Vempala, “An Efficient Re-scaled Perceptron Algorithm for Conic Systems” D. Bertsimas and S. Vempala, “Solving convex programs by random walks” A. Brieden, P. Gritzman, R. Kannan, V. Klee, L. Lovasz and M. Simonovits, “Approximation of Radii and norm-maxima” J. Dunagan and S. Vempala, “A simple polynomial-time rescaling algorithm for solving linear programming” A. Kalai and S. Vempala, “Linear Programming via Simulating Annealing”

114

References, continued L. Lovasz and S. Vempala, “Simulated annealing in convex bodies and an O∗(n4) volume algorithm” A. Pr´ ekopa, “On Logarithmic Concave Measures and Functions” S. Vempala, “Geometric Random Walks: A Survey” F. Solis and R. J-B. Wets, “Minimization by Random Search Techniques”

115