A gradient descent solution to the Monge ... - Semantic Scholar

Applied Mathematical Sciences, Vol. 3, 2009, no. 22, 1071 - 1080

A Gradient Descent Solution to the Monge-Kantorovich Problem Rick Chartrand

1

and Brendt Wohlberg

Los Alamos National Laboratory, Theoretical Division T-5, MS B284, Los Alamos, NM 87545, USA Kevin R. Vixie Washington State University, Department of Mathematics PO Box 643113, 203 Neill Hall, Pullman, WA 99164-3113, USA Erik M. Bollt Clarkson University Department of Mathematics & Computer Science Potsdam, NY 13699-5815, USA Abstract We present a new, simple, and elegant algorithm for computing the optimal mapping for the Monge-Kantorovich problem with quadratic cost. The method arises from a reformulation of the dual problem into an unconstrained minimization of a convex, continuous functional, for which the derivative can be explicitly found. The Monge-Kantorovich problem has applications in many fields; examples from image warping and medical imaging are shown.

Mathematics Subject Classification: 49D10 Keywords: Monge-Kantorovich problem, optimal transport, image warping.

1

Introduction

The original problem, posed by G. Monge [11] in 1781, was to determine the optimal way to move a pile of dirt to a hole of the same volume. Here “optimal” 1

[email protected]

1072

R. Chartrand, B. Wohlberg, K. R. Vixie and E. M. Bollt

means the total distance that the dirt is moved, one infinitesimal unit of volume at a time, should be minimal. A modern and suitably generalized version is the following: let μ1 and μ2 be compactly supported, absolutely continuous measures on Rn , with supports K1 and K2 and densities f1 and f2 . Assume that the measures have the same total mass. Call a measurable function s : K1 → K2 feasible if s is injective off a set of measure zero, and pushes μ1 forward to μ2 in the sense that μ1 ◦ s−1 = μ2 . We wish to find the feasible s that minimizes the total-cost functional I(s) = c(x, s(x)) dμ1 (x), (1) K1

where c ∈ C(K1 × K2 ) is a nonnegative function, thought of as measuring the cost of moving a unit of mass from a point in K1 to a point in K2 . In addition to being mathematically interesting, this problem has applications in many fields, a few of which are economics, meteorology, astrophysics, and image processing (see [12], [3], and [1] for discussion and references). The purpose of this paper is to present a simple algorithm for the numerical computation of the optimal mapping s. Other methods in the literature include linear programming [3], computational fluid mechanics [1], and minimizing flows [6]. Linear programming is simple but inefficient; the other two methods are much more complex in their justification and implementation. Once the mapping s has been computed, one can obtain from (1) a measure of the distance between the measures μ1 and μ2 , known as the Wasserstein distance. However, the mapping s contains much more information about the relationship between the two measures, and any geometric properties of s will likely be of great relevance to the particular application. See Section 4 for a simple example. The above formulation of the problem assumes that all the dirt at a point x ∈ K1 must be moved to the same point s(x) ∈ K2 . This restriction was relaxed by Kantorovich [7], replacing the mapping s with a measure π ∈ M(K1 × K2 ) that specifies the joint distribution of dirt-hole correspondences. The measure π is called feasible if it has μ1 and μ2 as marginal distributions; that is, if π(· × K2 ) = μ1 and π(K1 × ·) = μ2 . The relaxed problem is to find the feasible π that minimizes c dπ. J(π) =

(2)

(3)

K1 ×K2

Gangbo and McCann [5] show that if c is strictly convex, the relaxed problem and the original problem have the same, unique solution: one has min I(s) = min J(π), both functionals have unique minimizers, and the minimizers are related by π(E) = μ1 {x ∈ K1 : (x, s(x)) ∈ E} for all E ⊂ K1 × K2 .

A gradient descent solution to the Monge-Kantorovich problem

2

1073

Duality

Kantorovich also formulated a dual problem [8]: maximize u dμ1 + v dμ2 K(u, v) = K1

(4)

K2

among u ∈ C(K1 ), v ∈ C(K2 ) satisfying u(x) + v(y) ≤ c(x, y) for all x ∈ K1 , y ∈ K2 .

(5)

(Note that since M(K1 ×K2 ) = C(K1 ×K2 )∗ , this should really be called a predual problem.) It is a dual problem in the sense that sup K(u, v) = min J(π). For the rest of this paper, we specialize to the case of the quadratic cost c(x, y) = 12 |x − y|2 , a strictly convex function. Ideas from convex analysis can be brought into play by substituting u(x) = 12 |x|2 − ϕ(x), v(y) = 12 |y|2 − ψ(y) into (4). The resulting problem is to minimize L(ϕ, ψ) = ϕ dμ1 + ψ dμ2 (6) K1

K2

among ϕ ∈ C(K1 ), ψ ∈ C(K2 ) satisfying ϕ(x) + ψ(y) ≥ x · y for all x ∈ K1 , y ∈ K2 .

(7)

The value of this substitution is the following result, due to Knott and Smith [9] and Brenier [2]. Proposition 2.1. The functional L has a unique minimizing pair (ϕ, ψ) of functions, which are convex conjugates: ψ(y) = ϕ∗ (y) := max(x · y − ϕ(x))

(8)

ϕ(x) = ψ ∗ (x) := max(y · x − ψ(y)).

(9)

x∈K1

and y∈K2

Furthermore, s = ∇ϕ solves the Monge-Kantorovich problem (1). The mapping ϕ → ϕ∗ defined by (8) is a variant of the Legendre-Fenchel transform, the difference being that the Legendre-Fenchel transform takes extended real-valued functions on a Banach space X to functions on the dual space X ∗ . The proposition can be phrased in terms of the true LegendreFenchel transform for Rn by defining ϕ ≡ ∞ outside K1 and similarly for ψ, after which one obtains ψ|K2 = ϕ∗ |K2 and ϕ|K1 = ψ ∗ |K1 . By this means, one can deduce the following from the corresponding result (see [13, Proposition 11.3]) for the Legendre-Fenchel transform on Rn .

1074


Lemma 2.2. Let ϕ, ψ be convex conjugates in the sense of (8) and (9). Then (10) ∂ϕ(x) = argmax y · x − ψ(y) y∈K2

for all x ∈ K1 , and for all y ∈ K2 ∂ψ(y) = argmax x · y − ϕ(x) . x∈K1

(11)

In particular, where ϕ (resp. ψ) is differentiable, there is a unique maximizer for the right side of (10) (resp. (11)). Remark 2.3. It is easily seen from the definition (8) that for any function ϕ on K1 , ϕ∗ (and hence ϕ∗∗ ) is convex and Lipschitz. Hence for ϕ = ϕ∗∗ to be true requires that ϕ be convex and Lipschitz. Unlike the case of the LegendreFenchel transform on Rn , however, this is not sufficient, as ϕ∗∗ depends on the choice of K2 . It is true, however, that ϕ∗ = ϕ∗∗∗ for any function ϕ, so that ϕ∗ and ϕ∗∗ will be convex conjugates.

3

A gradient descent iteration

We can now state the main result of the paper: the Monge-Kantorovich problem can be solved by an unconstrained problem, for which the derivative can be explicitly found. Theorem 3.1. Let f1 ∈ L1 (K1 ), f2 ∈ L1 (K2 ). Define M on C(K1 ) by M(ϕ) = ϕf1 + ϕ∗ f2 . (12) K1

K2

The functional M is convex, Lipschitz, and Hadamard differentiable. In particular, (13) M (ϕ) = f1 − f2 ◦ ∇ϕ∗∗ det D 2 ϕ∗∗ , where the matrix-valued function D 2 ϕ∗∗ is defined in the Aleksandrov sense. Furthermore, M has a unique, convex minimizer ϕ, for which s = ∇ϕ is the solution to the Monge-Kantorovich problem (1). Remark 3.2. If ϕ is such that ϕ = ψ ∗ for some ψ, then M (ϕ) = f1 − f2 ◦ ∇ϕ det D 2 ϕ .

(14)


1075

Remark 3.3. The theorem suggests that a potential for the optimal MongeKantorovich mapping can be computed by a gradient descent iteration of the form ϕn+1 = ϕn − αn M (ϕn ),

(15)

where αn is a stepsize parameter. However, in general M (ϕn ) may fail to be continuous. In practice, with discontinuous f1 and f2 we find the iteration (15) to produce a reasonable approximation of the optimal mapping before numerical instabilities occur. A method to improve the performance of the algorithm is described in Section 4. Proof. That M has a unique, convex minimizer which is a potential for the solution of the Monge-Kantorovich problem follows immediately from Proposition 2.1. To show the convexity of M, since the first term of (12) depends linearly on ϕ, it suffices to show the pointwise convexity of ϕ → ϕ∗ (y) : ∗ tϕ1 + (1 − t)ϕ2 (y) = max x · y − tϕ1 (x) − (1 − t)ϕ2 (x) x∈K1 ≤ max t x · y − ϕ1 (x) + max(1 − t) x · y − ϕ2 (x) =

x∈K1 tϕ∗1 (y)

x∈K1

+ (1 −

t)ϕ∗2 (y). (16)

The Lipschitz continuity of M is an immediate consequence of the contractive property of the Legendre-Fenchel transform, namely that ϕ∗1 − ϕ∗2 ∞ ≤ ϕ1 − ϕ2 ∞ . This property is well-known and the proof, a simple consequence of the definitions, is omitted. The heart of the theorem, and the key to its usefulness, is the differentiability of M. We begin by computing the one-sided directional derivative of M at ϕ ∈ C(K1 ) in the direction of v ∈ C(K1 ). (A similar computation can be found in a paper [4] by W. Gangbo, but in a different context.) M(ϕ + tv) − M(ϕ) (ϕ + tv)∗ − ϕ∗ Dv M(ϕ) = lim+ = f2 . vf1 + lim+ t→0 t t K1 K2 t→0 (17) Since ϕ∗ is a convex function, ϕ∗ is differentiable almost everywhere. Fix y ∈ K2 such that ∇ϕ∗ (y) = x0 exists. Then by Lemma 2.2, x0 is the unique maximizer of y · x − ϕ(x), the quantity whose maximum is ϕ∗ (y). Similarly, for t > 0 choose xt ∈ ∂(ϕ + tv)∗ (y) = argmaxx∈K1 (x · y − (ϕ + tv)(x)). Then (ϕ + tv)∗ (y) − ϕ∗ (y) = xt · y − ϕ(xt ) − tv(xt ) − x0 · y − ϕ(x0 ).

(18)

1076


Replacing x0 with xt in (18) results in a larger quantity, while replacing xt with x0 results in a smaller quantity. Rearranging gives (ϕ + tv)∗ (y) − ϕ∗ (y) (19) + v(x0 ) ≤ v(x0 ) − v(xt ). t Since tv(xt ) converges uniformly to 0, any convergent subsequence of the family (xt ) will converge to a maximizer of x · y − ϕ(x). Since x0 is the unique such maximizer, it follows that xt → x0 , hence v(x0 ) − v(xt ) → 0. Therefore Dv M(ϕ) = vf1 − (v ◦ ∇ϕ∗ )f2 . (20) 0≤

K1

K2

The Hadamard differentiability of M at ϕ is equivalent (since M is Lipschitz) to the existence of a measure M (ϕ) = σ ∈ M(K1 ) = C(K1 )∗ such that Dv M(ϕ) = K1 v dσ for arbitrary v ∈ C(K1 ); when σ is absolutely continous we identify M (ϕ) with the density function. We obtain this by the change of variables y = ∇ϕ∗∗ (x) in (20), first obtaining Dv M(ϕ) = vf1 − v ◦ ∇ϕ∗ ◦ ∇ϕ∗∗ f2 ◦ ∇ϕ∗∗ det D 2 ϕ∗∗ . K1

(∇ϕ∗∗ )−1 (K2 )

(21) This change of variables is justified by work of McCann [10]: if we denote ∗ v ◦ ∇ϕ f2 by g, then ∇ϕ∗∗ pushes g ◦ ∇ϕ∗∗ det D 2 ϕ∗∗ forward to g, where the Aleksandrov derivative D 2 ϕ∗∗ is the absolutely continuous part of the distributionally-defined Hessian of ϕ∗∗ . The purpose of this change of variables is to employ the cancellation property of gradients of convex-conjugate functions. Namely, ∇ϕ∗ ◦ ∇ϕ∗∗ (x) = x

(22)

when the left side exists, a consequence of the uniqueness of the maximizers in (10) and (11). Although the convex functions ϕ∗ and ϕ∗∗ are differentiable almost everywhere, it may be that ∇ϕ∗∗ maps a set of positive measure into the set where ∇ϕ∗ fails to exist. On the other hand, the Aleksandrov derivative will be singular on such a set (see [10]), and so the cancellation property (22) holds on the support of the second integrand in (21). −1 The two integrals in (21) can be combined, as ∇ϕ∗∗ (K2 ) = K1 . Indeed, at any x ∈ K1 such that ∇ϕ∗∗ (x) = y0 exists, y0 is the unique maximizer of x · y − ϕ∗ (y). In particular, y0 ∈ K2 . Combining this with the cancellation property (22), we obtain f1 − f2 ◦ ∇ϕ∗∗ det D 2 ϕ∗∗ v. (23) Dv M(ϕ) = K1

This establishes the existence of the Hadamard derivative (13).


4

1077

Examples

In this section, we present two examples in which we use the result of Theorem 3.1 to compute optimal mappings for warping images. The first demonstrates the effectiveness of the algorithm using a standard pair of images from the image processing literature. The second is a simple example from medical imaging of how geometric properties of the optimal mapping between two images contain information about the relationship between the two images. In the context of image warping, f1 and f2 are discretely approximated by the intensity values of the pixels in two greyscale images. In practice, we find it works as well to replace the iteration (15) with 2 ϕn+1 = ϕn − αn f1 − f2 ◦ ϕn det D ϕn , (24) which amounts to approximating ϕ∗∗ n with ϕn . With natural images, we find the iteration (24) can produce good-quality warpings. An example with two 256 × 256-pixel images is shown in Figure 1. A Lax-type numerical scheme was used. Values of ϕ were computed at pixel vertices. Derivatives of ϕ were computed at pixel centers using centereddifferencing. The second term of (24) was computed at the pixel centers, then ϕn+1 was updated from ϕn at each vertex by averaging over surrounding centers. The resulting warp of the Lena image is shown in Figure 1(c), where 190 iterations with stepsize parameter 1 were used, starting with the potential of the identity mapping as the initial function ϕ0 . Numerical artifacts are just beginning to appear where the residue of the mirror of the Lena image meets Tiffany’s eyelashes. These artifacts worsen upon further iteration, precluding iterating further to remove the mirror residue and dark remnants of Lena’s hair. To improve the quality of the warp, we employ a multiresolution approach. We begin with smoothed versions of the two images, obtained by convolution with a Gaussian kernel. We run the algorithm to obtain a warping potential. We then repeat the algorithm with the images having been smoothed to a lesser degree, using the final potential from the previous step as the initial function. We repeat this procedure, then at some point use the warping potential obtained as the initial function for the unsmoothed images. Figure 1(d) shows the result of using just one step of this iterated smoothing procedure, using a Gaussian kernel of size 16 and width 4. The resulting warp of the Lena image is almost identical to the Tiffany image. Compare with the corresponding images in [6]. An example is presented in Figure 2 of a way that geometric information contained in the optimal Monge-Kantorovich mapping can be used to infer something about the relationship between two images. In Figures 2(a) and 2(b)

1078


(a)

(b)

(c)

(d)

Figure 1: (a) Lena image. (b) Tiffany image. (c) Lena to Tiffany warp, without smoothing. (d) Lena to Tiffany warp, using smoothed images first to compute the initial function.


1079

are images of the same brain slice, before and after a tumor has developed. In Figure 2(c) is a plot of the vector field s(x) − x, where s is the MongeKantorovich mapping. One can see that most of the deformation of the domain is in the tumor region. This region can be identified using the divergence of s, displayed in Figure 2(d). The dark color corresponds to a negative divergence, as the mapping s compresses the domain to create the dark tumor region. The nearby light areas show where surrounding tissues have been compressed by the growth of the tumor.

(a)

(b)

(c)

(d)

Figure 2: (a) Healthy brain. (b) Same brain with tumor. (c) Vector-field plot of mapping s(x) − x. (d) Divergence of optimal mapping.

1080


References [1] J.-D. Benamou and Y. Brenier, A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem, Numer. Math., 84 (2000), 375–393. [2] Y. Brenier, Décomposition polaire at réarrangement monotone des champs de vecteurs, C. R. Acad. Sci. Paris Sér. I Math., 305 (1987), 805–808. [3] L. C. Evans, Partial differential equations and Monge-Kantorovich mass transfer, in Current developments in mathematics, 1997 (Cambridge, MA), Int. Press, Boston, MA, 1999, 65–126. [4] W. Gangbo, An elementary proof of the polar factorization of vectorvalued functions, Arch. Rational Mech. Anal., 128 (1994), 381–399. [5] W. Gangbo and R. J. McCann, The geometry of optimal transportation, Acta Math., 177 (1996), 113–161. [6] S. Haker and A. Tannenbaum, On the Monge-Kantorovich problem and image warping, in Mathematical methods in computer vision, vol. 133 of IMA Vol. Math. Appl., Springer, New York, 2003, 65–85. [7] L. V. Kantorovich, On the translocation of masses, Dokl. Akad. Nauk SSSR, 37 (1942), 227–229. [8]

, On a problem of Monge, Uspekhi Mat. Nauk, 3 (1948), 225–226.

[9] M. Knott and C. S. Smith, On the optimal mapping of distributions, J. Optim. Theory Appl., 43 (1984), 39–49. [10] R. J. McCann, A convexity principle for interacting gases, Advances in Math., 128 (1997), 153–179. [11] G. Monge, Mémoire sur la théorie des déblais et des remblais, Histoire de l’Académie Royale des Sciences de Paris, (1781), 666–704. [12] S. T. Rachev and L. R¨ uschendorf, Mass transportation problems. Vol. II: Applications, Probability and its Applications (New York), SpringerVerlag, New York, 1998. [13] R. T. Rockafellar and R. J.-B. Wets, Variational analysis, vol. 317 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], Springer-Verlag, Berlin, 1998. Received: November, 2008