Quantum gradient descent and Newton's method for constrained ...

Quantum gradient descent and Newton’s method for constrained polynomial optimization Patrick Rebentrost,1, ∗ Maria Schuld,2, † Francesco Petruccione,2, 3 and Seth Lloyd1, 4, ‡

arXiv:1612.01789v1 [quant-ph] 6 Dec 2016

1

Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA 02139

2

University of KwaZulu-Natal, Durban 4000, South Africa 3

National Institute for Theoretical Physics, South Africa 4

Department of Mechanical Engineering,

Massachusetts Institute of Technology, Cambridge, MA 02139 (Dated: December 7, 2016)

Abstract Solving optimization problems in disciplines such as machine learning is commonly done by iterative methods. Gradient descent algorithms find local minima by moving along the direction of steepest descent while Newton’s method takes into account curvature information and thereby often improves convergence. Here, we develop quantum versions of these iterative optimization algorithms and apply them to homogeneous polynomial optimization with a unit norm constraint. In each step, multiple copies of the current candidate are used to improve the candidate using quantum phase estimation, an adapted quantum principal component analysis scheme, as well as quantum matrix multiplications and inversions. The required operations perform polylogarithmically in the dimension of the solution vector, an exponential speed-up over classical algorithms, which scale polynomially. The quantum algorithm can therefore be beneficial for high dimensional problems where a relatively small number of iterations is sufficient.

∗ † ‡

[email protected] [email protected] [email protected]

1

I.

INTRODUCTION

Optimization plays a vital role in various fields. For example, in many machine learning and artificial intelligence applications such as neural networks, support vectors machines and regression, the objective function to minimize is a least-squares loss or error function f (x) that takes a vector-valued input to a scalar-valued output [1]. For convex functions there exists a unique global minimum, and in the case of equality constraint quadratic programming the solution to the optimization problem reduces to a matrix inversion problem. In machine learning we often deal with either convex objective functions beyond quadratic programming, or non-convex objective functions with multiple minima. In these situations no closed-form solution is known and the solution has to be found by an iterative search in the landscape defined by the objective function. One popular approach focusses on gradient descent methods such as the famous backpropagation algorithm in the training of neural networks. Gradient descent finds a minimum starting from an initial guess by iteratively proceeding along the negative gradient of the objective function. Because it only takes into account the first derivatives of f (x), gradient descent may involve many steps in cases when the problem has an unfortunate landscape (for example where it features long and narrow valleys). To overcome this, Newton’s root finding method multiplies the inverse Hessian to the gradient of the function. By taking into account the curvature information in such a manner, the number of steps required to find the minimum often dramatically reduces at the cost of computing and inverting the matrix of the second derivatives of the function with respect to all inputs. More precisely, once the method arrives in the vicinity of a minimum, the algorithm enters a realm of quadratic convergence, where the number of correct bits in the solution doubles with every step [2]. As quantum computation becomes more realistic and ventures into the field of machine learning, it is worthwhile to consider in what way optimization algorithms can be translated into a quantum computational framework. Optimization has been considered in various implementation proposals of large-scale quantum computing [3, 4]. The adiabatic quantum computing paradigm [5] and its famous sibling, quantum annealing, are strategies to find the ground state of a Hamiltonian and can therefore be understood as ‘analogue’ algorithms for optimization problems. The first commercial implementation of quantum annealing, the D-Wave machine, solves certain quadratic unconstrained optimization problems and has been tested for machine learning applications such as classification [6, 7] and sampling for the training of Boltzmann machines [8]. More recently in 2

the gate model of quantum computation, quadratic optimization problems deriving from machine learning tasks such as least-squares regression [9, 10] and the least-squares support vector machine [11] were tackled with the powerful quantum matrix inversion technique [12], demonstrating an exponential speedup for such single-shot solutions under certain conditions. In this work, we provide quantum algorithms for iterative optimization methods, specifically, the gradient descent and Newton’s method. We thereby extend the quantum machine learning literature by a method that can be used in non-quadratic convex or even non-convex optimization problems. The main idea is that at each step we take multiple copies of a quantum state |x(t) i to

produce multiple copies of another quantum state |x(t+1) i by considering the gradient vector and the Hessian matrix of the objective function. For small step sizes, this quantum state improves the objective function. Since at each step we consume multiple copies to prepare a single copy of the next step, the algorithms scale exponentially in the number of steps performed. While this exponential scaling in the number of steps precludes the use of our algorithms for optimizations that require many iterations, it is acceptable in cases when only a few steps are needed to arrive at a reasonably good solution, especially in the case of the Newton method. A possible application in machine learning could be the fine-tuning in the training of deep neural networks that have been layer-wise pre-trained by unsupervised learning algorithms [13, 14], and which exhibit a large number of parameters. Here, we consider optimization for the special case of homogeneous polynomials with a normalization constraint. We show how to explicitly obtain the gradient of the objective function and the Hessian matrix using quantum techniques such as quantum principal component analysis (QPCA) without resorting to oracle abstractions for these operations.

II.

PROBLEM STATEMENT

Gradient descent is formulated as follows. Let f : RN → R be the objective function one

intends to minimize. Given an initial point x(0) ∈ RN , one iteratively updates the candidate using information on the steepest descent of the objective function in a neighborhood of the current point, x(t+1) = x(t) − η∇f (x(t) ),

(1)

where η > 0 is a hyperparameter called the learning rate which may in general be step-dependent. Newton’s method extends this strategy by taking into account information on the curvature, i.e. the second derivatives of the objective function, in every step. The iterative update therefore includes 3

the Hessian matrix H of the objective function, x(t+1) = x(t) − ηH−1 ∇f (x(t) ), with the entries Hij =

∂2f ∂xi ∂xj

(2)

evaluated at the current point x(t) .

The objective function we seek to minimise here is a polynomial of order 2p defined over x ∈ RN ,

N 1 X ai ...i xi . . . xi2p , f (x) = 2 i ,...,i =1 1 2p 1 1

with the N

2p

(3)

2p

coefficients ai1 ...i2p ∈ R and x = (x1 , . . . , xN )T . Moving to a higher dimensional

space, we can write this function as a quadratic form, f (x) =

1 T x ⊗ . . . ⊗ xT A x ⊗ . . . ⊗ x. 2

(4)

This reformulation of Equation (3) will be important for the quantum algorithm. It projects the original vector x ∈ RN into a space constructed by taking p tensor products of the original space,

RN ⊗ . . . ⊗ RN . (For example, for p = 2 the two-dimensional input x = (x1 , x2 )T gets projected

to a vector containing the polynomial terms up to second order x ⊗ x = (x21 , x1 x2 , x2 x1 , x22 )). The coefficients become entries of a p-dimensional tensor A ∈ RN ×N ⊗ . . . ⊗ RN ×N . Note that such a

matrix A can always be written as a sum of tensor products on each of the p spaces that make up the ‘larger space’, A=

K X α=1

where each

Aαi

Aα1 ⊗ . . . ⊗ Aαp ,

(5)

is a N × N matrix for i = 1, ..., p and K is the number of terms in the sum needed

to specify A. Such a representation simplifies the computation of the gradient and the Hessian matrix of f (x). We assume in the following that this decomposition of the coefficient tensor is known. Equations (3) and (4) describe a homogeneous polynomial (also called an algebraic norm) of even order, which by choosing appropriate coefficients can express any polynomial of lower order as well. For the more familiar case of p = 1, the objective function reduces to f (x) = xT Ax and a common quadratic optimization problem. If A is positive definite, the problem becomes convex and the optimization landscape has a single global minimum. In this paper we consider general but sparse matrices A. For gradient descent and Newton’s method, we need to compute expressions for the gradient and the Hessian matrix of the objective function. In the tensor formulation Eq. (5), the gradient of 4

the objective function at point x can be written as p K 1 XX Y T α x Ai x ((Aαj )T + Aαj )x =: Dx, ∇f (x) = 2 α=1 j=1 i6=j

(6)

which can be interpreted as an operator D which is a sum of matrices Aαj with x-dependent Q coefficients i6=j xT Aαi x, applied to a single vector x. In a similar fashion, the Hessian matrix at the same point reads,

p K 1X X Y T α x Ai x ((Aαk )T + Aαk )x xT ((Aαj )T + Aαj ) + H(f (x)) = 2 α=1 j,k=1:j6=k i6=j,k p K 1 XXY T α x Ai x ((Aαj )T + Aαj ) =: HA + D, + 2 α=1 j=1 i6=j

(7)

defining the operator HA . The core of the quantum algorithms for gradient descent and Newton’s method will be to implement these matrices as quantum operators acting on the current state and successively shifting it towards the desired minimum. Since the quantum methods we propose represent vectors x as quantum states, a normalization constraint appears naturally ensuring that the vectors are normalized to unit length, xT x = 1. In the optimization literature this is known as a spherical constraint and applications of such optimization problems appear in image and signal processing, biomedical engineering, speech recognition and quantum mechanics [15]. The problem we solve is thus given by minx f (x) subject to xT x = 1. A well-known adaptation of gradient descent for such constrained problems is projected gradient descent [16, 17] where after each iteration, the current solution is projected onto the feasible set (here corresponding to a renormalization to xT x = 1). It turns out that our quantum algorithms naturally follow this method: the renormalization is explicitly realized in each update of the quantum state. We therefore obtain the general convergence properties of this classical method.

III.

QUANTUM GRADIENT DESCENT ALGORITHM

To implement a quantum version of the gradient descent algorithm we assume without loss of generality that the dimension N of the vectors x is N = 2n where n is an integer. As mentioned, we consider the case of a normalization constraint xT x = 1. Following previous quantum algorithms [11, 12, 18] we represent the entries of x = (x1 , ..., xN )T as the amplitudes of a n-qubit 5

quantum state |xi =

PN

j=1

xj |ji, where |ji is the j’th computational basis state. We will first

describe how to implement quantumly the update of the current candidate |x(t) i at the t-th step of

the gradient descent method via a quantum state |∇f (x(t) )i representing the gradient ∇f (x(t) ) at the current step. We then discuss how to prepare |∇f (x(t) )i in more detail.

First, it is important to note that with the amplitude encoding method the classical scalar coefficients xT Aαi x that occur as weighing factors in equations (6) and (7) can be written as the expectation values of operators Aαi and quantum states |xi, or hx| Aαi |xi = tr{Aαi ρ} where

ρ = |xihx| is the corresponding density matrix. Of course, the Aαi are not symmetric (Hermitian) in general and therefore do not correspond to quantum observables. However, with this formal P P Q T α α T α relation the full operator D = α j i6=j x Ai x ((Aj ) + Aj ) can be reproduced by a quantum operator that is equal to

D = tr1...p−1{ρ⊗(p−1) MD },

(8)

where ρ⊗(p−1) = ρ ⊗ . . . ⊗ ρ is the joint quantum state of p − 1 copies of ρ. This operator D acts on another copy of ρ. The operator MD is given by 1 XX O α Ai MD = 2 α j i6=j

!

⊗ ((Aαj )T + Aαj ).

(9)

More informally stated, we construct an operator MD from the Aαj , j = 1 . . . p of Equation (5) in such a way that for each term in the sum the expectation values of the first p − 1 subsystems correspond to the desired weighing factor, and the last subsystem remains as the operator acting on ρ. In general, Eq. (9) results in a non-symmetric operator, which we can embed in the symmetric extended matrix  

0

MD

T MD

0



.

(10)

For the following discussion we therefore assume that without loss of generality MD is symmetric. Regarding the sparsity of MD , if each Aαj , j = 1 . . . p, α = 1 . . . K is at most d column/row sparse, then each term Aα1 ⊗ . . . ⊗ Aαp is at most dp column/row sparse. The full matrix MD has sparsity

sD which is at most sD = Kdp . The sparsity is of order O(poly log N) if d and K only grow polylogarithmically with N and p is a constant. 6

A. Single step in the gradient descent method (0)

(0)

Let x(0) = (x1 , ..., xN )T be the initial point we choose as a candidate for finding the minimum P (0) x∗ with i |xi |2 = 1. We assume that the initial quantum state |x(0) i corresponding to x(0) can be prepared efficiently, either as an output of a quantum algorithm or by preparing the state from

quantum random access memory. For example, efficient algorithms exist for preparing states corresponding to integrable [19] and bounded [20] distributions. Similarly, as shown in [21– 23], states can be prepared efficiently from random access memory if their entries are evenly distributed. Now assume we are at step t and have prepared multiple copies of the current solution |x(t) i to

an accuracy ǫ(t) > 0. We would like to propagate a single copy of |x(t) i to an improved |x(t+1) i. The improved copy shall be prepared to accuracy ǫ(t+1) > ǫ(t) . We prepare the state (cos θ |0i − i sin θ |1i) |x(t) i ,

(11)

where θ is an external parameter, the first register contains a single ancilla qubit, and the second register contains the state |x(t) i. As will be presented in detail below, we can multiply the operator

D to |x(t) i conditioned on the ancilla being in state |1i to obtain

1 cos θ |0i |x(t) i − i CD sin θ |1i D |x(t) i . |ψi = √ PD

(12)

Here, CD = O(1/κD ) is an appropriately chosen factor where where κD is the condition number

2 of D. The success probability is PD = cos2 θ + CD sin2 θ hx(t) | D2 |x(t) i. As shown in the next 2 (t) 2 section and Appendix D, this step takes O √P1 D 1 + p (ǫǫD ) ǫ13 copies of the current state to D

perform the matrix multiplication with D to accuracy ǫD . Each copy of the current state requires ˜ sD log N) operations for the matrix exponentiation methods. The state |ψi is prepared to O(p accuracy O(ǫD + ǫ(t) ).

We now perform the required vector addition of the current solution and the state related to the first derivative. We measure the state in the basis |yesi =

√1 (|0i+i |1i) 2

and |noi =

√1 (i |0i+|1i). 2

Measuring the ‘yes’ basis state results in the quantum system being in a state |x(t+1) i = p

Choosing θ such that

1 cos θ |x(t) i − CD sin θ D |x(t) i . 2PD Pyes

cos θ = p

1 1+

2 η 2 /CD

, sin θ = 7

η p , 2 CD 1 + η 2 /CD

(13)

(14)

leads to the state |x(t+1) i =

1 C (t+1)

|x(t) i − η |∇f (x(t) )i .

(15)

with the normalization factor given by (C (t+1) )2 = 1 − 2η hx(t) | D |x(t) i + η 2 hx(t) | D2 |x(t) i. The probability of obtaining this state through a successful ‘yes’ measurement is given by Pyes =

2η hx(t) | D |x(t) i 1 − , 2 1 + η 2 hx(t) | D2 |x(t) i

(16)

where hx(t) |D|x(t) i simply refers to the inner product of the current solution x(t) with the gradient at this point, ∇f , and hx(t) | D2 |x(t) i is the length of the gradient. Note that Pyes = 1/2 − 2ηhx(t) |D|x(t) i + O(η 2 ), which is large for small η. To prepare the state |x(t+1) i successfully, we 1 have to repeat O Pyes times.

The quantum state |x(t+1) i is an approximation to a normalized version of the classical vector

x(t+1) which would be the result of a classical gradient descent update departing from a normalized vector x(t) . The quantum measurement normalizes the quantum state at each step. Appendix A

shows an argument that this projected gradient descent method improves the objective function if the gradient has a component tangential to the unit ball. Below, we also provide a method to relax the normalization constraint by treating the normalization as another optimization parameter. As stated, errors arise from the previous state |x(t) i, denoted by ǫ(t) , and from preparing D|x(t) i, denoted by ǫD . The error in preparing |x(t+1) i is ǫ(t+1) = O(ηǫD + ǫ(t) ). Thus we have prepared

a normalized quantum state representing an improved solution for our original norm-constraint polynomial problem.

B. Computing the gradient

We now proceed to show details on how to implement the multiplication with operator D |xi = |∇f (x)i used for the quantum gradient descent step above (the index t indicating the current step is omitted for readability). The idea is to implement D |xi through matrix exponentiation eiD∆t |xi adapting the quantum principal component analysis (QPCA) procedure outlined in [18],

and subsequent phase estimation. Since D depends on the current state |xi, we cannot use the oracular exponentiation methods of [24]. Instead we exponentiate the sparse matrix MD given in Eq. (9). The matrix MD is specified by the problem and assumed to be given in oracular form with sparsity sD . For a short time ∆t, use multiple copies of ρ = |xi hx| and perform a matrix 8

exponentiation of MD . In the reduced space of the last copy of ρ, we observe the operation trp−1 {e−i MD ∆t ρ⊗p ei MD ∆t } = e−i D∆t ρ ei D∆t + O(∆t2 ).

(17)

Since the current solution |xi is presented to an error ǫ0 we apply Eq. (17) using additional copies for error correction compared to the method in [18], as is shown in the Appendix D. In order to perform phase estimation and extract the eigenvalues of D, we need to implement a series of powers of the exponentials in Equation (17) in superposition with an index or time register. The infinitesimal application of the MD operator can be modified as in Ref [18], as |0ih0| ⊗ I + |1ih1| ⊗ e−i MD ∆t and applied to a state |qihq| ⊗ ρ where |qi is an arbitrary single control qubit state. For the use in phase estimation, use a multi-qubit register with O(⌈log(2 + 1/2ǫD )⌉) control qubits forming an eigenvalue register. In this manner, we can apply PSph −i D∆t l ) for performing the phase estimation. Via phase estimation, this condil=1 |l∆tihl∆t|(e tioned application of the matrix exponentials allows eigenvalues to be resolved with accuracy ǫD in a total time Sph ∆t = O(1/ǫD ). The result of the phase estimation algorithm is a quantum state proportional to X j

where |xi =

P

j

βj |uj (D)i |λj (D)i ,

(18)

βj |uj (D)i is the original state |xi written in the eigenbasis {|u(D)j i} of D, and

|λ(D)j i is an additional register encoding the corresponding eigenvalue in binary representation. As in [9, 12] we can now conditionally rotate and measure an extra ancilla qubit resulting in X l

λj (D)βj |uj (D)i .

(19)

This performs the matrix multiplication with D using in the phase estimation. As is shown in the p2 ǫ2 Appendix D, we require O √P1 D 1 + ǫD0 ǫ13 copies of the current state to perform the matrix D

multiplication to error ǫD , where PD is the success probability of a conditional rotation. To achieve √ constant success probability we require O(1/PD ) repetitions, which can be reduced to O(1/ PD ) ˜ sD log N) operations repetitions using amplitude amplification [9, 25]. Each copy requires O(p for the sparse matrix exponentiation methods [24]. For multiple steps we will need a number of copies that is exponentiated by the number of steps. It is obvious that the exponential growth of the resources in both space and run time with the number of steps is prohibitive. We therefore need to look at extensions of the gradient descent method which guarantee fast convergence to the minimum in only a few steps. One such option is 9

Newton’s method, which requires inverting the Hessian matrix of f (x) at point x, an operation that can become expensive on a classical computer while being well suited for a quantum computer.

IV. QUANTUM NEWTON’S METHOD

A quantum algorithm for Newton’s method follows the quantum gradient descent scheme, but in addition to the conditional implementation of D |xi = |∇f (x)i, one has to also apply an

operator H−1 to |∇f (x)i which represents the inverse Hessian matrix. The Hessian is given from Eq. (7) by H = HA + D,

(20)

with HA and D as above. To obtain the eigenvalues of H via phase estimation, we exponentiate H via exponentiating the matrices HA and D sequentially using the standard Lie product formula ei H∆t ≈ ei HA ∆t ei D∆t + O(∆t2 ).

(21)

To implement the individual exponentiations themselves we use a similar trick as before. We associate a simulatable matrix MHA with HA and the matrix MD with D, as before. See Appendix C for the matrix MHA and further details of the quantum Newton’s method. This matrix MHA is sparse if the constituent matrices Aαj are sparse and can be constructed if the Aαj are given in oracular form. Denote by sHA the sparsity of MHA . Let σ be an arbitrary state on which the matrix exponential shall be applied on, we can use multiple copies of the current state ρ = |xi hx| to perform tr1...p−1 {e−i MHA ∆t (ρ ⊗ . . . ⊗ ρ) ⊗ σ ei MHA ∆t } ≈ e−i HA ∆t σ ei HA ∆t .

(22)

The error for a small time step is O(∆t2 ) from this simulation method, O(∆t2 ) from the Lie product formula, and errors O(ǫ0 ) from the current solution |xi. Similar to before and as described in p2 ǫ2 Appendix D, we require O 1 + ǫH0 ǫ13 copies of the current state to perform phase estimaH

˜ sH log N) tion to accuracy ǫH . Each copy is processed via sparse simulation methods in O(p A

operations. With PH the success probability of an ancilla measurement after a conditional rotation, we require O(1/PH ) repetitions to achieve a large enough success probability for the matrix √ inversion. The number of repetitions can be reduced to O(1/ PH ) by amplitude amplification [25]. 10

For the Newton’s method the performance is determined from computing the gradient and inverting the Hessian and subsequent vector addition as before. Thus, the overall performance of a single Newton step of step size η to prepare an improved solution to accuracy O(ηǫD + ηǫH + ǫ0 ) from a current solution given to accuracy ǫ0 is 1 p2 ǫ20 1 1 p2 ǫ20 1 ˜ s log N). × O( √ ×O 1+ × O(p O )×O 1+ Pyes ǫH ǫ3H ǫD ǫ3D PD PH Here, we define s = sD + sHA > sD .

V. MULTIPLE STEPS

We have presented a single step of the quantum versions of gradient descent and Newton’s method for polynomial optimization. When performing multiple steps, t = 0, . . . , T , quantities such as the condition numbers and success probabilities become step dependent. We have: η → η (t) ,

(23)

(t)

(24)

PH → PH ,

(t)

(25)

(t) Pyes → Pyes .

(26)

PD → PD ,

The step size parameter η is usually varied as one gets closer to the target. To denote the worst case performance, we define the quantities ηˆ := max η (t) ,

(27)

(t) PˇD := min PD ,

(28)

(t) PˇH := min PH ,

(29)

(t) Pˇyes := min Pyes .

(30)

t

t

t

t

Let ǫ > 0 the final desired accuracy after T steps. As discussed, for a single gradient descent step we require 

O

(t) Pyes

1 q

2

(t) PD

1+

(t) 2

p (ǫ ) (t) ǫD

!

1 (t) (ǫD )3

copies of the current solution to prepare a next copy to accuracy (t)

ǫ(t+1) = O(η (t) ǫD + ǫ(t) ). 11

 

(31)

(32)

(t)

If we take ǫ(t) = O(ǫ) and ǫD = (ǫ(t+1) − ǫ(t) )/η (t) = O(ǫ), we obtain a requirement of ! 1 1 p 1 + p2 ǫ 3 O ˇ ˇ ǫ Pyes PD

(33)

copies of the current solution at a single step. For T iterations of the gradient descent method, we need at most

" O

Pˇyes

1 p

PˇD

#T  1 1 + p2 ǫ 3  ǫ

copies of the initial state |x(0) i. For T iterations of Newton’s method, we need at most  #T " #T  T " 1 1 1 1 1 p p O 1 + p2 ǫ 3 1 + p2 ǫ 3  ˇ ǫ ǫ Pyes PˇH PˇD

(34)

(35)

˜ s log N) operations, copies of the initial state |x(0) i. For both methods, each copy requires O(p where we define s = sD + sHA . Considering the exponential dependence on the number of steps T , recall that in Newton’s method in the vicinity of an optimal solution x∗ the accuracy δ := |x∗ − x(T ) | of the estimate often improves quadratically with the number of iterations, δ ∝ O(exp(−T 2 )), for example for unconstrained convex problems. That is, even though in

the quantum algorithm the number of initial copies grows exponentially in the number of iterations required, the accuracy of the approximate solution yielded by the algorithm in certain cases can improve even faster in the number of iterations.

VI.

DISCUSSION AND CONCLUSION

The present work have considered polynomial optimization. We have developed a quantum algorithm for the optimization of the restricted class of homogeneous polynomials, for which we can find expressions for the gradient and the Hessian matrix. A simple linear inhomogeneity can be added by considering the polynomial function f (x) + cT x, where f (x) is the homogeneous part as before and c is a vector specifying the inhomogeneous part. The gradient becomes Dx + c, which leads to another vector addition in the quantum algorithm. Beyond polynomials, one can envision a setting where copies of quantum states representing the current solution are consumed for evaluating the first derivative of the objective function. If we can implement the operation |xi ⊗ . . . ⊗ |xi |0i 7→ |ψi |∇f (x)i , 12

(36)

where |ψi is an arbitrary scrambled state, we can use the same basic gradient descent steps as discussed in this work with a similar performance as presented here for polynomial optimization. Our gradient descent algorithm at each step is iterated for a number of times that scales with the success probability of the vector addition of the gradient to the current solution. In reference [26], Kimmel et al. discuss a way to coherently add two vectors via the principal component algorithm. Employing this method can lead to an improvement in the number of copies required compared to the stochastic vector addition of the present work. Because the success probability of the stochastic vector addition depends on the step size, this methods can also lead to an effectively increased step size and faster convergence of the quantum gradient descent algorithm. In reference [27], Childs et al. presented an exponential improvement of the matrix inversion error dependence, 1/ǫ → log 1/ǫ by using approximation polynomials instead of phase estimation. It remains to be seen if this algorithm can be combined with our Hamiltonian simulation method using the QPCA algorithm to improve the Hessian matrix inversion step. Our algorithms scale exponentially in the number of steps performed. However, we envision a few scenarios where the algorithm can nevertheless be useful. One scenario is when the size of the vectors dominates the dependence on other parameters such as condition number and error raised to the power of the number of steps. In this case classical computation is prohibitively expensive, while our quantum method scales logarithmically in the size of the vectors and could allow optimization. Often, a constant number of steps can yield a significant improvement on the initial guess, even without finding a local minimum, whereas the problem would be intractable on a classical computer. Another case where our quantum algorithm yields an exponential speed up is the application of Newton’s method to (locally) convex problems. In such problems, the number of iterations of Newton’s method required to find a highly accurate solution is only weakly dependent on the dimension of the system [2], and often the minimum can be found in around 5-20 steps. The optimization algorithms presented here yield a performance O(polylog(N)) in the dimension N of the vector space over which the optimization is performed, as long as the number of iterations required is small. When searching for a small number of solutions in a featureless landscape, a large number iterations is required, and the algorithms scale polynomially in N: in particular, the algorithms cannot do better than Grover search, which finds M solutions in an esp sentially ‘gradient-free’ landscape in time O( N/M). In problems such as weight assignment for

machine learning, however, where gradient descent does indeed lead to good solution after a few iterations, the algorithms presented here provide an exponential improvement over their classical 13

counterparts.

[1] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for machine learning. MIT Press, 2012. [2] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [3] C. Dürr and P. Hoyer.

A quantum algorithm for finding the minimum.

arXiv preprint

quant-ph/9607014, 1996. [4] Edward Farhi and Aram W Harrow. Quantum supremacy through the quantum approximate optimization algorithm. arXiv:1602.07674, 2016. [5] Edward Farhi, Jeffrey Goldstone, Sam Gutmann, and Michael Sipser. Quantum computation by adiabatic evolution. arXiv preprint quant-ph/0001106, 2000. MIT-CTP-2936. [6] Vasil Denchev, Nan Ding, Hartmut Neven, and Svn Vishwanathan. Robust classification with adiabatic quantum optimization. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 863–870, 2012. [7] Hartmut Neven, Vasil S Denchev, Geordie Rose, and William G Macready. Training a large scale classifier with the quantum adiabatic algorithm. arXiv preprint arXiv:0912.0779, 2009. [8] Marcello Benedetti, John Realpe-Gómez, Rupak Biswas, and Alejandro Perdomo-Ortiz. Estimation of effective temperatures in a quantum annealer and its impact in sampling applications: A case study towards deep learning applications. arXiv preprint arXiv:1510.07611, 2015. [9] Nathan Wiebe, Daniel Braun, and Seth Lloyd. Quantum algorithm for data fitting. Physical Review Letters, 109(5):050505, 2012. [10] Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione. Linear regression on a quantum computer. Submitted to Physical Review A. [11] Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. Quantum support vector machine for big data classification. Phys. Rev. Lett., 113:130503, Sep 2014. [12] Aram W Harrow, Avinatan Hassidim, and Seth Lloyd. Quantum algorithm for linear systems of equations. Physical Review Letters, 103(15):150502, 2009. [13] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006. R in Machine Learning, [14] Yoshua Bengio. Learning deep architectures for ai. Foundations and trends

14

2(1):1–127, 2009. [15] Simai He, Zhening Li, and Shuzhong Zhang. Approximation algorithms for homogeneous polynomial optimization with quadratic constraints. Mathematical Programming, 125(2):353–383, 2010. [16] Alan A Goldstein. Convex programming in hilbert space. Bulletin of the American Mathematical Society, 70(5):709–710, 1964. [17] Evgeny S Levitin and Boris T Polyak. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5):1–50, 1966. [18] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis. Nature Physics, 10:631–633, 2014. [19] L. Grover and T. Rudolph. Creating superpositions that correspond to efficiently integrable probability distributions. arXiv preprint quant-ph/0208112, 2002. [20] Andrei N Soklakov and Rüdiger Schack. Efficient state preparation for a register of quantum bits. Physical Review A, 73(1):012307, 2006. [21] V. Giovannetti, S. Lloyd, and L. Maccone. Quantum random access memory. Phys. Rev. Lett., 100:160501, 2008. [22] V. Giovannetti, S. Lloyd, and L. Maccone. Architectures for a quantum random access memory. Phys. Rev. A, 78:052310, 2008. [23] F. De Martini, V. Giovannetti, S. Lloyd, L. Maccone, E. Nagali, L. Sansoni, and F. Sciarrino. Experimental quantum private queries with linear optics. Phys. Rev. A, 80:010302, 2009. [24] Dominic W. Berry and Andrew M. Childs. Black-box hamiltonian simulation and unitary implementation. Quantum Info. Comput., 12(1-2):29–62, January 2012. [25] A. Ambainis. Variable time amplitude amplification and a faster quantum algorithm for solving systems of linear equations. arXiv:1010.4458, 2010. [26] S. Kimmel, C. Yen-Yu Lin, G. H. Low, M. Ozols, and T. J. Yoder. Hamiltonian simulation with optimal sample complexity. arXiv:1608.00281, 2016. [27] A M. Childs, R. Kothari, and R. D. Somma. Quantum linear systems algorithm with exponentially improved dependence on precision. arXiv:1511.02306, 2015.

15

Appendix A: Convergence of the projected gradient descent method

We provide a simple proof of convergence for the projected gradient descent method in the limit of small step sizes in the vicinity of a minimum of a locally convex function for which, without loss of generality, f (|x(t) i) > 0. Let the unnormalized gradient step be |φi = |x(t) i − η |∇f (x(t) )i .

(A1)

The projected (normalized) gradient step is |x(t+1) i =

1

|x(t) i − η |∇f (x(t) )i ,

C (t+1)

(A2)

with the normalization factor given by (C (t+1) )2 = 1 − 2η hx(t) | D |x(t) i + η 2 hx(t) | D2 |x(t) i. For the unnormalized state there exists η0 > 0 such that for all small η < η0 we have η h∇f (x(t) )| ∇f (x(t) )i + O(η 2 ) = f (|x(t) i) − f (|φi) ≥ 0.

(A3)

For the normalized state: f (|x(t) i) − f (|x(t+1) i) = f (|x(t) i) −

1 (C (t+1) )2p

= f (|x(t) i) −

1

f (|φi)

(A4)

(t) (t) (t) 2 f (|x i) − η h∇f (x )| ∇f (x )i + O(η ) . (C (t+1) )2p = f (|x(t) i) − 1 + 4ηp hx(t) | ∇f (x(t) )i + O(η 2 ) f (|x(t) i) − η h∇f (x(t) )| ∇f (x(t) )i + O(η 2 ) .

= η h∇f (x(t) )| ∇f (x(t) )i − 4ηp hx(t) | ∇f (x(t) )if (|x(t) i) + O(η 2 ). Thus, we obtain the condition hx(t) | ∇f (x(t) )i ≤

h∇f (x(t) )| ∇f (x(t) )i . 4pf (|x(t) i)

(A5)

Hence we expect our projected method to work well if hx(t) | ∇f (x(t) )i ≪ 1, which means the gradient has a large component orthogonal to the current solution. Intuitively that means that the gradient points tangential to the unit ball. If the current solution and gradient are parallel, we cannot optimize further on the unit ball and our norm-constrained problem is solved.

Appendix B: Computing the gradient

We show the intermediate steps for computing a single gradient descent step. The eigenstates of D are given by |uj (D)i and the eigenvalues by λj (D). Start from (cos θ |0i + i sin θ |1i) |xi . 16

(B1)

After the conditional phase estimation, we obtain: |ψi =

cos θ |0i |xi |0i + i sin θ |1i

X j

!

βj |uj (D)i |λj (D)i ,

(B2)

where βj = huj (D) |xi. Now perform a conditional rotation of another ancilla and uncompute the eigenvalue register to arrive at the state cos θ |0i |xi |1i + i sin θ |1i

X j

βj |uj (D)i (

q

!

1 − (CD λj (D))2 |0i + CD λj (D) |1i) . (B3)

We chose a constant CD = O(1/κD ), where κD is the condition number of D. A measurement of the ancilla in |1i arrives at 1 √ PD

cos θ |0i |xi + i sin θ |1i

X j

!

CD λj (D)βj |uj (D)i |λj (D)i ,

(B4)

which is the desired state 1 √ (cos θ |0i |xi + i CD sin θ |1i D |xi) . PD

(B5)

The success probability of the measurement is given by: 2 PD = cos2 θ + CD sin2 θ hx| D2 |xi .

(B6)

Appendix C: Quantum Newton’s method details

The matrix MHA corresponding to the Hessian part HA , defined in Eq. (7), used for the sparse matrix simulation methods is given by ! 1 XX O α Ai ⊗ I ⊗ ((Aαk )T + Aαk ) S I ⊗ ((Aαj )T + Aαj ) . MHA = 2 α j6=k

(C1)

i6=j,k

Here, S is the swap matrix between the last two registers. The matrix MHA is sparse if the constituent matrices Aαj are sparse. We assume that the matrices Aαj are given via an oracle. The operator HA is obtained from HA = tr1...p−1{ρ⊗(p−1) MHA },

(C2)

as before. In case MHA is non-symmetric, we can define a symmetric matrix analogous to Eq. (10). 17

The quantum Newton step goes as follows. Assume we have prepared the gradient descent step up to the state: 1 |ψi = √ (cos θ |0i |xi + i CD sin θ |1i D |xi) . PD

(C3)

The procedure for obtaining a Newton step are as follows, similar to the steps for the gradient as before. The eigenstates of H are given by |uj (H)i and the eigenvalues by λj (H). First, perform phase estimation with H conditioned on the ancilla being |1i to arrive at 1 √ PD

cos θ |0i |xi |0i |1i + i CD sin θ |1i

X j

!

βj (H) |uj (H)i |λj (H)i .

(C4)

Perform a conditional rotation of another ancilla and uncompute the eigenvalue register:   s 2 X CH 1  CH √ |1i) . |0i + βj (H) |uj (H)i ( 1 − cos θ |0i |xi |1i + i CD sin θ |1i λ (H) λ (H) PD j j j (C5)

Here, CH = O(1/κH ), where κH is the condition number of H. Measurement of the ancilla in |1i arrives at 1 √ PH

cos θ |0i |xi + i CD CH sin θ |1i

which is the desired state

X j

! 1 βj (H) |uj (H)i , λj (H)

1 √ cos θ |0i |xi + i CD CH sin θ |1i H−1 D |xi . PH

(C6)

(C7)

The success probability of the measurement is given by:

2 2 PH = cos2 θ + sin2 θCD CH hx| DH−2 D |xi .

(C8)

Appendix D: Quantum principal component analysis with erroneous states

The simulation of the operators D and H incurs additional errors because the current solution |xi used to construct the operators is given to an error. We provide a simple argument for an error averaging at the cost of number of copies required for the simulation. First note that D = tr1...p−1{ρ⊗(p−1) MD } with ρ = |xi hx|. Thus, if |xi is given to an error ǫ0 then D is given to an error O(pǫ0 ) and the same for H. 18

For the error discussion, assume for simplicity a simple binomial distribution, where errors ˜ and −pǫ0 D, ˜ where D ˜ is some operator with kDk ˜ ≤ 1. The argument holds also come as +pǫ0 D for other distributions. Two simulation steps with opposite signs are given by ˜

˜

˜ ˜ ei (D+pǫ0 D)∆t ei (D−pǫ0 D)∆t = (I + i (D + pǫ0 D)∆t + O(∆t2 ))(I + i (D − pǫ0 D)∆t + O(∆t2 )) = (I + i D∆t)2 + O(∆t2 ).

(D1)

The task is to simulate e−i Dt for a time t = m∆t to obtain an overall error ǫ > 0. Note that after m steps and averaging over the signs, we have: √ ˜ ˜ kei (D±pǫ0 D)∆t · · · ei (D±pǫ0 D)∆t − ei Dt k = O( mpǫ0 ∆t + m∆t2 ) := ǫ.

(D2)

The error is given by: √

2

ǫ = O( mpǫ0 ∆t + m∆t ) = O

pǫ t t2 √0 + m m

.

(D3)

Thus, solving for the number of steps m leads to 2 22 p ǫ0 t p2 ǫ20 t2 t2 m=O =O 1+ . (D4) + ǫ2 ǫ ǫ ǫ 2 This compares to m = O tǫ for the usual QPCA algorithm assuming no errors in the state encoding the Hamiltonian. Since we run the algorithm for a time t = 1/ǫ, we obtain a total number of copies m=O

p2 ǫ20 1+ ǫ

19

1 ǫ3

.

(D5)