Convergence of Contrastive Divergence Algorithm in ... - arXiv

Convergence of Contrastive Divergence Algorithm in Exponential Family

arXiv:1603.05729v2 [stat.ML] 6 May 2016

Tung-Yu Wu∗ , Bai Jiang∗ , Yifan Jin and Wing H. Wong† Department of Statistics 390 Serra Mall Stanford, CA 94305 e-mail: [email protected] url: http://web.stanford.edu/group/wonglab/ Abstract: This paper studies the convergence properties of contrastive divergence algorithm for parameter inference in exponential family, by relating it to Markov chain theory and stochastic stability literature. We prove that, under mild conditions and given a finite data sample X1 , . . . , Xn ∼ pθ∗ i.i.d. in an event with probability approaching to 1, the sequence {θt }t≥0 generated by CD algorithm is a positive Harris recurrent chain, and thus processes an unique invariant distribution πn . The invariant distribution concentrates around the Maximum Likelihood Estimate at a speed √ arbitrarily slower than n, and the number of steps in Markov Chain Monte Carlo only affects the coefficient factor of the concentration rate. Finally we conclude that as n → ∞,

t

1 X

p lim sup θs − θ∗ → 0.

t→∞ t s=1

MSC 2010 subject classifications: Primary 68W48, 60J20; secondary 93E15. Keywords and phrases: Contrastive Divergence Algorithm, Exponential Family, Harris Recurrent Chain, Convergence Rate.

1. Introduction Contrastive Divergence (CD) algorithm [1] has been widely used for parameter inference of Markov Random Fields. This first example of application is given by Hinton [1] to train Restricted Boltzmann Machines, the essential building blocks for Deep Belief Networks [2, 3, 4]. The key idea behind CD is to approximate the computationally intractable term in the likelihood gradient by running a small number (m) of steps of a Markov Chain Monte Carlo (MCMC) run. Thus it is much faster than the conventional MCMC methods that run a large number to reach equilibrium distributions. Despite of CD’s empirical success, theoretical understanding of its behavior is far less satisfactory. Both computer simulation and theoretical analysis show that CD may fail to converge to the correct solution [5, 6, 7]. Studies on theoretical convergence properties have thus been motivated. Yuille relates the ∗ Contributing

equally to this paper author

† Corresponding

1

T. Wu & B. Jiang et al./Convergence of CD in Exponential Family

2

algorithm to the stochastic approximation literature and gives very restrictive convergence conditions [8]. Others show that for Restricted Boltzmann Machines the CD update is not the gradient of any function [9], but that for full-visible Boltzmann Machines the CD update can be viewed as the gradient of pseudolikelihood function if adopting a simple scheme of Gibbs sampling [10]. In any case, the fundamental question of why CD with finite m can work asymptotically in the limit of n → ∞ has not been answered. This paper studies the convergence properties of CD algorithm in exponential families and gives the convergence conditions involving the number of steps of Markov kernel transitions m, spectral gap of Markov kernels, concavity of the log-likelihood function and learning rate η of CD updates (assumed fixed in our analyses). This enables us to establish the convergence of CD with a fixed m to the true parameter as the sample size n increases. Section 2 describes the CD algorithm for exponential family with parameter θ and data X. Section 3 states our main result: denoting {θt }t≥0 the parameter sequence generated by CD algorithm from an i.i.d. data sample X1 , . . . , Xn ∼ pθ∗ , a sufficiently large m can guarantee

t

1 X

∗ p θs − θ → 0 as n → ∞, lim sup

t→∞ t s=1

under mild conditions. Section 4 shows that {θt }t≥0 is a Markov chain under Px , the conditional probability measure given any realization of data sample X = (X1 , . . . , Xn ), and impose three constraints on X, which hold asymptotically with probability 1. Thereafter Sections 5-8 studies {θt }t≥0 under Px in the framework of Markov chain theory, and show that the chain is positive Harris recurrent and thus processes a unique invariant distribution πn . The invariant distribution πn concentrates around the MLE θˆn at a speed arbitrarily slower √ than n, and m only affects the coefficient factor of the concentration rate. Section 9 completes the proof of the main result. For convenience of the reader, we assume throughout Sections 3-9 that the exponential family under study is a set of continuous probability distributions and show in Section 10 how to get a similar conclusion for the case of discrete probability distribution. We also provide two numerical experiments to illustrate the theories in Section 11. 2. Contrastive Divergence in Exponential Family Consider an exponential family over X ⊆ Rp with parameter θ ∈ Θ ⊆ Rd pθ (x) = c(x)eθ·φ(x)−Λ(θ) where c(x) is the carrier measure, φ(x) ⊆ Rd is the sufficient statistic and Λ(θ) is the cumulant generating function Z Λ(θ) = log c(x)eθ·φ(x) dx. X


3

We assume φ(X ) is bounded, then the natural parameter domain {θ ∈ Rd : Λ(θ) < ∞} = Rd (if it is not empty). Λ(θ) is convex and differentiable at any interior point of the natural parameter domain, and both the gradient and Hessian of cumulant generating function Λ(θ) exist µ(θ) , ∇Λ(θ) = Eθ [φ(X)] Σ(θ) , ∇2 Λ(θ) = Covθ [φ(X)] Given an i.i.d. sample X = (X1 , ..., Xn ) generated from a certain underlying distribution pθ∗ , the log-likelihood function is n

l(θ) =

n

1X 1X log c(Xi ) + θ · φ(Xi ) − Λ(θ), n i=1 n i=1

and the gradient n

g(θ) , ∇l(θ) =

1X φ(Xi ) − µ(θ). n i=1

Assuming the positive definiteness of Σ(θ), the Maximum Likelihood Estimate (MLE) θˆn uniquely exists and satisfies g(θˆn ) = 0 or equivalently n

µ(θˆn ) =

1X φ(Xi ). n i=1

Maximum likelihood learning can be done by gradient ascent " n # 1X new θ = θ + ηg(θ) = θ + η φ(Xi ) − µ(θ) n i=1 where learning rate η > 0. Pn When computing the gradient g(θ), the first term n1 i=1 φ(Xi ) is easy to compute. But it is usually difficult to compute the second term µ(θ), which involves a complicated integral over X . Markov Chain Monte Carlo (MCMC) methods may generate a random sample from a Markov chain with the equilibrium distribution pθ (x) and approximate µ(θ) by the sample average. However, Markov chains take a large number of steps to reach the equilibrium distributions. To address this problem, Hinton proposed the Contrastive Divergence (CD) method [1]. The idea of CD is to replace µ(θ) and g(θ) with n

µcd (θ) ,

1 X (m) , φ Xi n i=1 (m)

n

gcd (θ) ,

1X φ(Xi ) − µ ˆ(θ) n i=1

respectively, where Xi is obtained by a small number (m) of steps of an MCMC run starting from the observed sample Xi . Formally, denote by kθ (x, y)


4

the Markov transition kernel with pθ (x) as equilibrium distribution. CD first run Markov chains for m steps k

(1)

θ Xi −→ Xi

k

(2)

θ −→ Xi

k

k

(m)

θ θ −→ . . . −→ Xi

, for i = 1, ..., n,

and makes update "

θ

new

# n n 1X 1 X (m) = θ + ηgcd (θ) = θ + η φ(Xi ) − φ Xi . n i=1 n i=1

Denote by Kθ the Markov operator associated with kθ (x, y), i.e. Z (Kθ f )(x) = f (y)kθ (x, y)dy, X

and by α(θ) the second largest absolute eigenvalue of Kθ . Markov kernel Kθ is said to have L2 -spectral gap 1 − α(θ) if α(θ) < 1. The convergence rate of MCMC depends on L2 -spectral gap [11]. Throughout the paper, kθm denotes the m-step transition kernel of kθ , kθm pθ0 (·) denotes the distribution of Markov chain after m-step transition starting from initial distribution pθ0 , and Kθm denotes the m-step Markov operator of Kθ . We also let k · k denote the l2 -norm k · k2 . 3. Main Result We base the convergence properties of CD algorithms for for exponential family of continuous distributions on the assumptions (A1), (A2), (A3), (A4), (A5), (A6). Theorem 3.1 states our main result, whose proof is presented in Sections 4-9. We later show in Section 10 a similar conclusion for the case of discrete distribution. (A1) φ(x) is bounded, i.e. there exists some constant C such that φ(x) ⊆ [−C, C]d for any x ∈ X . (A2) Θ ⊆ Rd is convex and compact, and the true parameter θ∗ is an interior point of Θ. (A3) For any θ ∈ Θ, φj (X), 1 ≤ j ≤ d are linearly independent under pθ , and thus Σ(θ) is positive definite. This assumption together with continuity of Σ(θ) and (A2) immediately implies the existence of the bounds for smallest and largest eigenvalues of Σ(θ) λmin , inf λmin (θ) > 0, Θ

λmax , sup λmax (θ) < ∞. Θ

(A4) Define a metric ρ on the set of Markov operators {Kθ : θ ∈ Θ} as ρ(Kθ , Kθ0 ) ,

sup sup |(Kθ f )(x) − (Kθ0 f )(x)|, f : |f |≤1 x∈X

and assume the Lipchitz continuity of Kθ in sense that ρ(Kθ , Kθ0 ) ≤ ζkθ − θ0 k for any θ, θ0 ∈ Θ.


5

(A5) Markov operators Kθ have L2 -spectral gap 1 − α(θ) and α , sup α(θ) < 1. Θ

(A6) φ(X ) is a convex set. Using MCMC transition kernel kθ (x, y), φ(y) has a conditional probability density function p(φ|θ, x) conditioning on θ and x. For any x ∈ X , inf θ∈Θ inf φ∈φ(X ) p(φ|θ, x) > 0. Theorem 3.1. Assume (A1), (A2), (A3), (A4), (A5), (A6), and the data sample X1 , . . . , Xn ∼ pθ∗ i.i.d.. There exists some constant L > 0. For any m and learning rate η satisfying λ2min −

√

dCLαm λmax −

2 √ ηλmax λmax + dCLαm > 0, 2

CD algorithm generates a sequence {θt }t≥0 such that for any > 0

! t

1 X

θs − θ∗ > = 0. lim P lim sup n→∞

t→∞ t s=1 4. Conditioning on Data Sample It is not hard to see that CD generates a Markov chain {θt }t≥0 in the parameter space Θ given any realizationof the data sample X = x. Indeed, denote by (m) (m) (m) (m) Xt = Xt,1 , Xt,2 , ..., Xt,n the m-step MCMC random sample to estimate the CD gradient from θt−1 to θt . The filtration (m) (m) (m) Gt , σ-algebra X, θ0 , X1 , θ1 , X2 , ..., θt−1 , Xt , θt contains all historical information until tth step of CD. The CD update " n # n 1X 1 X (m) θt = θt−1 + η φ(xi ) − φ Xt,i n i=1 n i=1 is merely function of data X = x, current parameter θt−1 and m-step MCMC (m) (m) (m) (m) sample Xt = Xt,1 , Xt,2 , ..., Xt,n . Conditioning on data sample x and (m)

current parameter θt , Xt,i is independent to the history of CD updates. Thus {θt }t≥0 is indeed a homogeneous Gt -adapted Markov chain under Px , the conditional probability measure given X = x. Thereafter the remaining of the paper studies CD path {θt }t≥0 in the framework of Markov chain theory. From now on Pxθ denotes the conditional probability measure given data sample x and parameter θ. And Exθ and Covxθ denote the expectation and covariance under Pxθ . Next we impose three constraints (4.1), (4.2), (4.3) on data sample X, which are shown in Lemma 4.1 to hold asymptotically with probability 1 as n → ∞. We


6

later show that the Markov chain {θt }t≥0 converges to an invariant distribution under Px if the data sample X = x satisfies these constraints.

n

1 X

(4.1) φ(Xi ) − µ(θ∗ ) < n−1/2+γ1

n

i=1 kθˆn (X1 , ..., Xn ) − θ∗ k < n−1/2+γ1

n Z

Z

1 X

sup φ(y)kθm (Xi , y)dy − φ(y)kθm pθ∗ (y)dy < n−1/2+γ1 .

n θ∈Θ i=1

(4.2) (4.3)

where θ∗ is the true parameter and γ1 is any number between 0 and 1/2). i.i.d.

Lemma 4.1. Assume (A1), (A2), (A3), (A4), and X1 , ..., Xn ∼ pθ∗ . lim P ((4.1), (4.2), (4.3)) = 1

n→∞

for any γ1 ∈ (0, 1/2). The result that (4.1) and (4.2) hold asymptotically with probability 1 follows from standard theorems in large sample theory [12]. Therefore it suffices to show (4.3) holds asymptotically with probability 1. To this end, we define Z fθ : x 7→ φ(y)kθm (x, y)dy X

and bound the tail of empirical process Z n X −1/2 vn (fθ ) , n fθ (Xi ) − fθ (x)pθ∗ (x)dx =n

−1/2

i=1 n Z X i=1

X

φ(y)kθm (Xi , y)dy

Z −

X

φ(y)kθm pθ∗ (y)dy

X

by Theorem 2.14.9 in [13], which relates the tail of empirical process to covering number of function class. Details of proof are provided in Appendix. 5. Gradient Approximation Error Our study on the Markov chain {θt }t≥0 starts from appropriately bounding bias and variance of CD gradient gcd (θ) under Px . Lemma 5.1 shows that the bias of gcd (θ) is O(n−1/2+γ1 )+O(αm kθ− θˆn k) depending on the mixing rate of chains in MCMC α, the MCMC step number m, sample size n and the distance between θ and the MLE θˆn , and that the covariance of gcd (θ) is O(1/n) depending on the sample size n. Write the gradient approximation error ∆g(θ) = gcd (θ) − g(θ).


7

Lemma 5.1. Assume (A1), (A2) and (A5) and data sample x satisfies (4.2) and (4.3). Then √ √ kExθ [∆g]k ≤ 1 + dCLαm n−1/2+γ1 + dCLαm kθ − θˆn k for some constant L > 0, where 1 − α is the L2 -spectral gap of Markov operators Kθ in MCMC and γ1 is introduced by inequalities (4.2) and (4.3). And trace [Covxθ ∆g(θ)] ≤

dC 2 . n

Proof. For simplicity of notations, we abbreviate ∆g(θ) as ∆g. n Z 1X x Eθ [∆g] = µ(θ) − φ(y)kθm (xi , y)dy n i=1 X Z Z n Z 1X φ(y)kθm (xi , y)dy = µ(θ) − φ(y)kθm pθ∗ (y)dy + φ(y)kθm pθ∗ (y)dy − n i=1 X X X implying

Z Z n Z

1 X

m m m ∗ (y)dy + ∗ (y)dy . µ(θ) − φ(y)k p kExθ [∆g]k ≤ φ(y)k (x , y)dy − φ(y)k p θ i θ θ θ θ

n

X X X i=1

−1/2+γ1

The second term is bounded by n in inequality (4.3). For the first term, consider each j = 1, ..., d, Z m µj (θ) − ∗ φ (y)k p (y)dy j θ θ X Z Z φj (y)kθm pθ∗ (y)dy = φj (y)kθm pθ (y)dy − X ZX ∗ (y) p θ m − 1 pθ (y)dy = Kθ φj (y) pθ (y) ZX pθ∗ (y) = Kθm (φj (y) − µj (θ)) − 1 pθ (y)dy pθ (y) X s sZ 2 Z pθ∗ (y) 2 m ≤ α(θ) − 1 pθ (y)dy (φj (y) − µj (θ)) pθ (y)dy pθ (y) X X s 2 Z pθ∗ (y) m ≤α C − 1 pθ (y)dy pθ (y) X p = αm C e−2Λ(θ∗ )+Λ(θ)+Λ(2θ∗ −θ) − 1 ≤ CLαm kθ − θ∗ k

(5.1)

where L is the √ Lipchitz constant of the continuously differentiable function f : θ ∈ Θ 7→ e−2Λ(θ∗ )+Λ(θ)+Λ(2θ∗ −θ) − 1, and the last step follows from |f (θ)| = |f (θ) − 0| = |f (θ) − f (θ∗ )| ≤ Lkθ − θ∗ k.


8

Putting (4.2), (4.3) and (5.1) together yields √ kExθ ∆gk ≤ n−1/2+γ1 + dCLαm kθ − θ∗ k √ √ ≤ n−1/2+γ1 + dCLαm kθˆn − θ∗ k + dCLαm kθ − θˆn k √ √ ≤ 1 + dCLαm n−1/2+γ1 + dCLαm kθ − θˆn k (m)

Also, noting that Xi |x, θ ∼ kθm (xi , ·) are conditional independent (but not identically distributed) since we run n Markov Chains independently starting from different xi , i = 1, ..., n, write trace [Covxθ ∆g] 2 Z n d Z 1 XX m = 2 φj (y) − φj (z)kθ (xi , z)dz kθm (xi , y)dy n i=1 j=1 X X ≤

dC 2 . n

6. Drift Towards MLE When studying the convergence of CD method, we hypothesize that starting at some θ0 far away from θˆn , the exact gradient g(θt ) is large enough to dominate the approximation error of m-step MCMC sampling, and bring a positive drift in log-likelihood, until θt is close to θˆn and g(θt ) fails to suppress the MCMC error. To precisely characterize this phenomenon, we give the definitions of drift and establish the drift condition with Lyapunov function u(θ) , l(θˆn ) − l(θ) being the log-likelihood gap at θ compared to the MLE θˆn . Definition 6.1. (drift) Let V : S → R+ be some function on the state space of a Markov chain {Zt }t≥0 . The one-step drift of V is defined as Ez V (Z1 ) − V (z) Definition 6.2. (drift condition) V satisfies the drift condition if Ez V (Z1 ) − V (z) ≤ −δ for z ∈ B c with some δ > 0 and some subset of the state space B. V is called a Lyapunov function for the Markov chain {Zt }t≥0 . ˆ − l(θ) satisfies drift condition Lemma 6.1 shows that function u(θ) , l(θ) ˆ outside of closed balls Bβ = {θ ∈ Θ : kθ − θk ≤ βrn } with β > 1 and rn = O(n−1/2+γ1 ).


9

Lemma 6.1. Assume (A1), (A2), (A3), (A5), and data sample x satisfies (4.2) and (4.3). For any m and learning rate η satisfying 2 √ √ ηλmax λmax + dCLαm > 0, a , λ2min − dCLαm λmax − 2 the chain {θt }t≥0 has Lyapunov function u(θ) = l(θˆn ) − l(θ) which satisfies drift condition outside closed ball Bβ = {θ ∈ Θ : kθ − θˆn k ≤ βrn } for any β > 1 with 2

δ ≥ η(β − 1)cn ,

rn =

bn +

p

b2n + 4acn n−1/2+γ1 2a

where √ √ bn = λmax 1 + dCLαm 1 + ηλmax + η dCLαm n−1/2+γ1 2 √ ηλmax m 2 −2γ1 cn = dC n n−1+2γ1 + 1 + dCLα 2 Proof. For simplicity of notations, we abbreviate g(θ) as g, gcd (θ) as gcd , and ∆g(θ) as ∆g. The difference u(θ1 ) − u(θ) when moving from θ to θ1 = θ + ηgcd is bounded from above as 1 u(θ1 ) − u(θ) = −ηhg, gcd i + η 2 hΣ(θ0 )gcd , gcd i 2 1 ≤ −ηhg, gcd i + η 2 λmax hgcd , gcd i 2 1 = −ηhg, gi − ηhg, ∆gi + η 2 λmax hgcd , gcd i 2 where θ0 is a convex combination of θ and θ1 . The first term kgk2 is constant. Expectation of the second term is Exθ [−hg, ∆gi] = −hg, Exθ [∆g]i ≤ kgk kExθ [∆g]k ,

(6.1)

and expectation of the third term is 2

Exθ [hgcd , gcd i] = trace [Covxθ gcd ] + kExθ [gcd ]k

2

≤ trace [Covxθ ∆g] + (kgk + kExθ [∆g] k) .

(6.2)

Since g(θ) = g(θ) − g(θˆn ) = µ(θˆn ) − µ(θ), λmin kθ − θˆn k ≤ kg(θ)k ≤ λmax kθ − θˆn k.

(6.3)

Putting (6.1), (6.2) and (6.3) with Lemma 5.1 together yields Exθ [u(θ1 ) − u(θ)] ≤ −η(akθ − θˆn k2 − bn kθ − θˆn k − cn ).

(6.4)


10

The RHS of (6.4) has a quadratic form of kθ − θˆn k, and it is clear that a > 0 for sufficiently large m and sufficiently small η. Then p b + b2n + 4acn n kθ − θˆn k ≥ rn = =⇒ Exθ [u(θ1 ) − u(θ)] ≤ 0 2a can guarantee Exθ [u(θ1 ) − u(θ)] ≤ 0. In particular, the drift condition holds outside any closed ball Bβ centering at MLE of radius βrn with β > 1, i.e. kθ − θˆn k > βrn =⇒ Exθ [u(θ1 ) − u(θ)] ≤ −δ < 0 with δ = η(aβ 2 rn2 − bn βrn − cn ) = η[a(β 2 − 1)rn2 − bn (β − 1)rn ] ≥ η(β 2 − 1)(arn2 − brn ) = η(β 2 − 1)cn

Remark. A Rfunction h(·) is called supharmonic for a transition probability p(x, ·) at x if h(y)p(x, dy) ≤ h(x). And it is called strong supharmonic if Z h(y)p(x, dy) ≤ h(x) − δ for some positive δ. We actually prove in Lemma 6.1 that u(θ) is strong supharmonic at Bβc . We later see in Theorem 7.1 a nice connection between strong supharmonic functions, positive recurrence of Markov chains, and supmartingales. 7. Positive Harris Recurrence Tweedie [15] connected the drift condition in Definition 6.2 to positive recurrence of sets in the state space by a Markov chain. We restate this result in Theorem 7.1 and provide a proof based on sup-martingales and sup-harmonic functions in Appendix. Next, Corollary 7.1 combines Lemma 6.1 with Theorem 7.1 and concludes that the closed balls Bβ centering at the MLE θˆn of radius βrn are positive recurrent by the chain {θt }t≥0 . Theorem 7.1. (Theorem 6.1 in [15]) Suppose a Markov chain {Zt }t≥0 has a non-negative function V on the state space satisfying the drift condition in Definition 6.2 with some δ > 0 and set B. Let T = min{t ≥ 1 : Zt ∈ B} be the first hitting time of B if starting from Z0 ∈ B c or the first returning time otherwise, then ( V (z)/δ for z ∈ B c R Ez T ≤ 1 + 1δ B c V (z1 )p(z, dz1 ) for z ∈ B


11

where p(z, dz1 ) is the transition probability of the chain. Thus, if Z sup V (z1 )p(z, dz1 ) < ∞ z∈B

Bc

also holds, then the set B is positive recurrent. Lyapunov function is widely used in stochastic stability or optimal control study [16]. As we have seen in Theorem 7.1, a suitably designed Lyapunov function can determine the positive recurrence of sets of a Markov chain. We proceed to apply Theorem 7.1 to the Markov chain {θt }t≥0 , for which u(θ) satisfies the drift condition outside of any closed ball Bβ in Lemma 6.1, and conclude in Corollary 7.1 that closed balls Bβ centering at MLE are positive recurrent by the chain {θt }t≥0 . Corollary 7.1. Following Theorem 7.1 and Lemma 6.1, Bβ for each β > 1 are positive recurrent by the chain {θt }t≥0 . Proof. Let T , min{t ≥ 1 : θt ∈ Bβ } be the first hitting or returning time of Bβ by the chain {θt }t≥0 . Lemma 6.1 establishes the drift condition for the likelihood gap function u(θ) outside of Bβ , i.e. Exθ u(θ1 ) − u(θ) ≤ −δ for θ ∈ Bβc . The compactness of Θ and continuity of u(θ) follow the boundedness of u(θ), implying Z sup u(θ1 )p(θ, dθ1 ) < sup u(θ1 ) < ∞. θ∈Bβ

c Bβ

Θ

Both conditions of Theorem 7.1 are satisfied, thus Bβ is positive recurrent. Next we prove the positive Harris recurrence of the chain {θt }t≥0 , which further implies the distribution convergence of Markov chain in total variation. Definition 7.1. An accessible set B is called a small set of a Markov chain {Zt }t≥0 if Pz (Z1 ∈ ·) ≥ IB (z)q(·) for some positive > 0 and probability measure q(·) over the state space. Definition 7.2. A Markov chain {Zt }t≥0 is called Harris recurrent if there exists a set B s.t. (a) B is recurrent. (b) B is a small set. If B is positive recurrent in addition, then the chain is called positive Harris recurrent. Lemma 7.1. Assume (A1), (A2), (A3), (A4), (A5), (A6). For sufficiently large n and for any m and learning rate η satisfying 2 √ √ ηλmax a , λ2min − dCLαm λmax − λmax + dCLαm > 0, 2


12

data sample x satisfying (4.1), (4.2) and (4.3), the chain {θt } generated by CD updates is positive Harris recurrent. Proof. Since Corollary 7.1 ensures the positive recurrence of Bβ , it suffices to show Bβ is a small set by checking Definition 7.1. Since θ∗ is an interior point of Θ and µ : Θ → φ(X ) a continuous mapping, µ(θ∗ ) is an interior point of φ(X ). Denoting by ∂φ(X ) the boundary of φ(X ), inf

kµ(θ∗ ) − φk > 0.

φ∈∂φ(X )

Pn

φ(xi ) − µ(θ∗ )k < n−1/2+γ1 holds,

n n

1 X

1 X

∗ ∗ φ(xi ) − φ ≥ inf kµ(θ ) − φk − φ(xi ) − µ(θ ) inf

n

φ∈∂φ(X )

φ∈∂φ(X ) n

If (4.1) k n1

i=1

i=1

i=1

≥ 2βrn /η for sufficiently large n. Then for any θ ∈ Bβ # " n 1X φ(xi ) − φ(X ) . Bβ ⊆ θ + η n i=1 Pn (m) Assumption (A6) implies n1 i=1 φ Xi has positive density over φ(X ), which is strictly bounded away from 0 for any θ, so does # " n n 1 X (m) 1X φ(xi ) − φ Xi θ1 = θ + η n i=1 n i=1 Pn over θ + η n1 i=1 φ(xi ) − φ(X ) . Denote by p(θ, θ1 ) the transition kernel of the chain {θt }, then #) ( " n 1X φ(xi ) − φ(X ) > 0. inf p (θ, θ1 ) ≥ inf inf p (θ, θ1 ) : θ1 ∈ θ + η θ,θ1 ∈Bβ θ∈Bβ n i=1 Let q(·) be the uniform measure on Bβ . There exists some constant > 0 such that Z p(θ, θ1 )dθ1 ≥ IBβ (θ)q(A) A

for any Borel set A ⊆ Θ, completing the proof. As stated in Theorems 6.8.5, 6.8.7, 6.8.8 in [17], any aperiodic, positive Harris recurrent chain {Zt }t≥0 processes an unique invariant distribution π, and the chain {Zt }t≥0 converges to the invariant distribution π in total variation for πa.e. starting point z. We strengthen these results for chain {θt }t≥0 in Corollary 7.2. Corollary 7.2. Following Lemma 7.1,


13

(a) The positive Harris recurrent chain {θt }t≥0 has an unique invariant probability measure πn (b) let Θ1 be the set of θ ∈ Θ s.t. lim kPxθ (θt ∈ ·) − πn (·)ktotal variation = 0

t→∞

(c) (d) (e) (f )

(7.1)

πn and the Lesbegue measure are absolutely continuous w.r.t. each other. πn has a positive density function over Θ. (7.1) holds for almostR all θ ∈ Θ (in sense of Lesbegue measure). For any f such that Θ |f (θ)|πn (θ)dθ < ∞, Z t 1X f (θs ) = f (θ)πn (θ)dθ (7.2) lim t→∞ t Θ s=1

Proof. Clearly the chain {θt }t≥0 is aperiodic. Then parts (a)(b) are immediate consequences of Lemma 7.1 and Theorems 6.8.5, 6.8.7, 6.8.8 in [17]. Proceed to prove part (c). On the first hand, part (b) and absolutely continuity of Pxθ (θt ∈ ·) w.r.t. the Lesbegue measure imply absolutely continuity of πn w.r.t. the Lesbegue measure. On the other hand, the invariant measure πn is a maximal irreducibility measure (See Definition 7.3) by Theorems 10.0.1 and 10.1.2 in [16]. Hence the Lesbegue measure is absolutely continuous w.r.t. πn , completing the proof of part (c). Further, parts (b)(c) imply (d)(e). Part (f) is another consequence of part (a) (See details in Section 17.1 in [16]). Definition 7.3. Let π be a positive measure on the state space of chain {Zt }. If for any z and set A in the state space π(A) > 0 implies the accessibility of set A by the chain from any start point z, we say π is an irreducible measure for the chain (or the chain is π-irreducible). An irreducible measure π ∗ specifying the minimal family of null sets, i.e. π ∗ (A) = 0 =⇒ π(A) = 0 for any irreducible measure, is called a maximal irreducibility measure. 8. Concentration of the Invariant Distribution Lemma 8.1 shows that the invariant distribution πn concentrates on positive recurrent ball Bβ . Lemma 8.1. Following Corollary 7.2, the invariant probability measure π concentrates on Bβ as πn (Bβc ) n−2γ2 for any γ2 ∈ (0, 1/2 − γ1 ) and β nγ2 increasing with n, while the ball Bβ shrinks with radius βrn n−1/2+γ1 +γ2 . Proof. By (6.4) in Lemma 6.1, Z Z 0= Exθ u(θ1 )πn (dθ) − u(θ)πn (dθ) Θ ZΘ ≤ −η akθ − θˆn k2 − bn kθ − θˆn k − cn πn (dθ) Θ


14

Rearranging terms yields Z akθ − θˆn k2 − bn kθ − θˆn k − cn πn (dθ) c Bβ

Z

− akθ − θˆn k2 − bn kθ − θˆn k − cn πn (dθ).

≤ Bβ

At θ ∈ Bβc , aβ 2 kθ − θˆn k2 − bn βkθ − θˆn k − cn ≥ aβ 2 rn2 − bn βrn − cn ≥ a(β 2 − 1)rn2 − bn (β − 1)rn ≥ (β 2 − 1) arn2 − bn rn ≥ (β 2 − 1)cn At θ ∈ Bβ , b2 − akθ − θˆn k2 − bn kθ − θˆn k − cn ≤ cn + n 4a Thus

πn (Bβc ) 1 cn + b2n /4a ≤ 2 πn (Bβ ) β −1 cn 1/2

Noting that bn , cn

n−1/2+γ1 , letting β nγ2 increase with n yields πn (Bβc ) 1 cn + b2n /4a ≤ 2 n−2γ2 πn (Bβ ) β −1 cn

while the ball Bβ has shrinking radius βrn nγ2 × n−1/2+γ1 = n−1/2+γ1 +γ2 .

9. Convergence of CD Estimator in Probability Now in Theorem 3.1: the estimator Pt we can complete the proof of the main result 1 ∗ θ converges to the true parameter θ in probability as the sample size s=1 t t n → ∞ in sense that for any > 0

t

!

1 X

∗ lim P lim sup θs − θ > = 0. n→∞

t→∞ t s=1

Proof. With Lemma 4.1, it suffices to show

! t

1 X

∗ x lim P lim sup θs − θ > = 0 n→∞

t→∞ t s=1


15

for any realization of data sample x satisfying (4.1), (4.2), (4.3). Write

Z

Z t t

1 X

1 X

∗ ∗ ˆ ˆ θs − θ ≤ θs − θπn (dθ) + θπn (dθ) − θn

+ kθn − θ k.

t

t

Θ Θ s=1 s=1 The first term, by part (f) of Corollary 7.2, converges to 0, i.e.

Z t

1 X

lim sup θs − θπn (dθ) = 0 Px -a.s..

t→∞ t Θ s=1 The second term, by Jensen’s inequality and Lemma 8.1, vanishes as n increases.

Z

Z

θπn (dθ) − θˆn ≤ kθ − θˆn kπn (dθ)

Θ Θ Z Z ˆ = kθ − θn kπn (dθ) + kθ − θˆn kπn (dθ) Bβ

c Bβ

≤ πn (Bβ ) × βrn + πn (Bβc ) × max kθ − θˆn k θ∈Θ

n− min{1/2−γ1 −γ2 ,2γ2 } The third term kθˆn − θ∗ k < n−1/2+γ1 as in (4.2). Therefore as n → ∞

t

!

1 X

x ∗ P lim sup θs − θ >

t→∞ t s=1

Z

t !

Z

1 X

∗ x ˆ ˆ

≤P lim sup θs − θπn (dθ) + θπn (dθ) − θn + kθn − θ k >

t→∞ t Θ Θ s=1

! Z t

1 X

x ≤P lim sup θs − θπn (dθ) > /3

t→∞ t Θ s=1

Z

x ∗ ˆ ˆ + Px

θπn (dθ) − θn > /3 + P kθn − θ k > /3 Θ

→0.

10. Results for Discrete Exponential Family If the sufficient statistic φ(X) in the exponential family is discrete, we have a similar conclusion as stated in Theorem 10.1. In contrast to positive Harris recurrence of {θt }t≥0 in Theorem 3.1 for the continuous case, {θt }t≥0 for the discrete case is a Markov chain with countable state space, and may admit multiple invariant distributions. Theorem 10.1. Consider an exponential family of discrete probability distributions. Assume (A1), (A2), (A3), (A4), (A5) and


16

(A7) φ(X ) is finite, and for each j = 1, ..., d, elements in φj (X ) have rational ratios. The conclusion in Theorem 3.1 is also true. Proof. If the sufficient statistic φ(X) in the exponential family is discrete and Pn Pn (m) has finite possible values, the CD gradient gˆ(θ) = n1 i=1 φ(xi )− n1 i=1 φ(Xi ) has finite possible values g1 , g2 , ..., gs . Starting from any initial parameter guess θ0 ∈ Θ, the chain   t s t X X X  θt = θ0 + η gˆ(θj ) = θ0 + η I (ˆ g (θj ) = gk ) gk j=1

k=1

j=1

˜ ⊂ Θ. And Θ ˜ ∩ Bβ is a finite set due to (A7). lies in a countable state space Θ ˜ By decomposition theorem, Θ can be partitioned uniquely as ˜ =Θ ˜0 ∪ Θ ˜1 ∪ Θ ˜ 2 ∪ ... Θ ˜ 0 is the set of transient states and the Θ ˜ i , i ≥ 1 are disjoint, irreducible where Θ ˜ 0 , or it lies closed set of recurrent states. The chain either remains forever in Θ ˜ eventually in some Θi , i ≥ 1. We argue by contradiction that the first of these possibility does not occur. ˜ 0 . By Suppose for the sake of contradiction that the chain {θt } forever lies in Θ ˜ Corollary 7.1, Bβ is positively recurrent. Thus the chain visits Θ0 ∩ Bβ infinitely ˜ 0 ∩ Bβ infinitely many times many times, and thus visits at least one state in Θ ˜ 0 ∩Bβ ⊆ Θ∩B ˜ for reason that Θ β is finite. Such a state is recurrent, contradicting ˜ 0. the fact that it belonging to the set of transient states Θ Therefore the chain will eventually lies in the first irreducible set of recurrent states it enters, and converges to the corresponding invariant distribution πn . Every invariant distribution πn concentrates in Bβ with πn (Bβc ) 1 cn + b2n /4a ≤ 2 n−2γ2 πn (Bβ ) β −1 cn The same convergence rate with that in Theorem 3.1 can be obtained. Theorem 10.1 establishes the convergence property of CD algorithm for relative simple exponential family of discrete distributions, which satisfy the assumption (A7). It suffices to guaranteed the success of CD algorithm for Restricted Boltzmann Machines (RBM) which has φ(X ) = {0, 1}d . Also, it is noteworthy that CD algorithm converges to the MLE even for more complicated cases in which φ(X ) is infinite and/or elements in φ(X ) have irrational ratios, if one takes the finite precision of computation into account. Due to the finite precision of any computer, numbers are always rounded or truncated. In real world, the update rule of CD algorithm is # " n n 1 X (m) 1X new new ˜ φ(Xi ) − φ Xi +ε θ =θ +ε=θ+η n i=1 n i=1


17

where ε is the numerical error incurred such that θnew is substituted by its nearby grid point θ˜new . As h i u(θ˜1 ) − u(θ) = u(θ˜1 ) − u(θ1 ) + [u(θ1 ) − u(θ)] ≤ O(kεk) + [u(θ1 ) − u(θ)] , we have the following drift condition akin to (6.4) h i Exθ u(θ˜1 ) − u(θ) ≤ −η(akθ − θˆn k2 − bn kθ − θˆn k − (cn + O(kεk))). If the precision of computation copes well with the sample size n, i.e. kεk = O(cn ), the chain {θ˜t }t≥0 is positive recurrent to the ball Bβ centering at the MLE θˆn . Noting the ball Bβ contains finitely many grid points, a similar argument to Theorem 10.1 can prove that the chain {θ˜t }t≥0 admits invariant distributions concentrating around the MLE. 11. Numerical Experiments 11.1. Bivariate Normal Distribution We conduct numerical experiments on the bivariate normal model 1 1 T −1 exp − (x − µ) Σ (x − µ) p(θ) = q 2 2 (2π) |Σ|

(11.1)

with unknown mean µ (parameter θ) and known covariance matrix 1 0.5 Σ= 0.5 1 . Figure 1-3 shows CD path {µt }t≥0 given a sample X of size n = 50, 100, 500 with true parameter µ∗ = (0, 0), respectively. For each data sample, CD-3 (i.e. CD with m=3) is applied to learn the parameter µ. In each of Figure 1-3, subplot (a) shows the CD paths {µt }t≥0 in the parameter space with different start points µ = (3, 3), (−3, 3), (3, −3), (−3, −3). They illustrate that the estimated parameter initially directly moves towards to true parameter but eventually randomly walks around the true parameter. Furthermore, comparison of Figures 1(a), 2(a), 3(a) shows that the region of the random walk decreases in size as the sample size n increases. Subplot (b) shows the true gradient of the likelihood function in each case. Subplot (c) presents the approximate gradient used by CD. For each grid point in the subplot (c), we run CD-3 5 times and draw 5 estimated gradients to reveal the randomness in CD approach. Subplot (d) reveals the directions of these gradients by normalizing the magnitude of these gradients. According to the plots (b) and (c), it can be observed that the magnitude and direction of the gradient become smaller and more stochastic when the point moves closer to the true parameter. Comparing the three figures we can see that the range of randomness decreases as the sample size increases. This is exactly what our theory predicts.


(a)

(b)

(c)

(d)

Fig 1. Simulation results of the bivariate normal distribution with N = 50.

(a)

(b)

(c)

(d)


18


(a)

(b)

(c)

(d)

19


11.2. Restricted Boltzmann Machines In our next experiment, we simulate data from the Restricted Boltzmann Machines (RBM). The CD method is a standard way for inferring RBM during the training of deep belief neural network [2]. There are m visible units V = (V1 , ..., Vm ) to represent observable data and n hidden units H = (H1 , ..., Hn ) to capture dependencies between observed variables. In the simulation, we focus on the binary RBM which the random variables (V , H) take values (v, h) ∈ {0, 1}m+n and the joint distribution of (V , H) is given by p (v, h) = Z1 e−E(v,h) with the energy function E (v, h) = −

n X m X i=1 j=1

wij hi vj −

m X j=1

bj v j −

n X

ci hi

(11.2)

i=1

In the simulation, the data sets is generated from a RBM with the weight ma0.5 0.5 trix w = and ci = bj = 0 for i, j = 1, 2.Figure 4-6 shows the approx0.5 0.5 imate gradients for N = 102 , 104 and 106 . In each figure, subplots in the lower T triangular part show the approximate gradients. Let x = x1 x2 x3 x4 = T w11 w21 w12 w22 . Subplot (i, j) in the lower triangular part gives the ˜ satisfying projections of the gradient onto the plane (xi , xj ), at those points x xk equal to 0.5 approximately, for k not equal to i or j. The corresponding directions of these gradients are given in the upper triangular part of each figure.


20

Again,the behaviour of the CD approximate gradients is in agreement with our theory.

Fig 4. Simulation results of RBM with N = 102 .

Appendix A: Proof of Lemma 2.1 and Theorem 6.1 A.1. Proof of Lemma 2.1 In Lemma 4.1, The result that (4.1) and (4.2) hold asymptotically with probability 1 follows from standard theorems in large sample theory [12]. Therefore it suffices to show inequality (4.3) holds asymptotically. Therefore it suffices


21

to show inequality (4.3) holds with asymptotically probability 1. To this end, we applied limit theorem for empirical processes [13], which relates the tail of empirical process to covering number of function class. Definition A.1. (covering number) Let (F, D) be an arbitrary semi-metric space. Then the covering number N (, F, D) is the minimal number of balls of radius > 0 needed to cover F. Formally, N (, F, D) = min{k : ∃f1 , . . . , fk ∈ F s.t. sup min D(f, fk ) < }. f ∈F 1≤j≤k

Theorem A.1. (Theorem 2.14.9 in [13]) Let X1 , ..., Xn i.i.d. and F be a class of functions f : X → [0, 1]. If s C1 , ∀ 0 < < C1 sup N (, F, L2,Q ) ≤ Q where s, C1 are constants, Q is a probability measure on X , and sZ (f1 (x) − f2 (x))2 Q(dx),

L2,Q (f1 , f2 ) , X

then for every t > 0 ! s n 1 X 2 C2 t √ √ [f (Xi ) − Ef (X1 )] > t ≤ e−2t , P sup n s f ∈F i=1

where constant C2 only depends on C1 . We proceed to bound the tail supΘ kvn (fθ )k by using Theorem A.1. i.i.d.

Lemma A.1. Assume (A1), (A2) and (A4) and X1 , ..., Xn ∼ pθ∗ . Then

n Z

! Z

1 X

− 21 +γ1 m m P sup φ(y)kθ (Xi , y)dy − φ(y)kθ pθ∗ (y)dy < n →1

θ∈Θ n i=1 X X as n → ∞, for any γ1 ∈ (0, 1/2). R Proof. Let fθ : x 7→ φ(y)kθm (x, y)dy, then Z φ(y)kθm (Xi , y)dy = fθ (Xi ) X Z Z Z m m φ(y)kθ pθ∗ (y)dy = φ(y) kθ (x, y)pθ∗ (x)dx dy X X X Z Z m = φ(y)kθ (x, y)dy pθ∗ (x)dx X ZX = fθ (x)pθ∗ (x)dx. X


22

For j = 1, ..., d, let vn (fθj ) , n−1/2 = n−1/2

n X

[f (Xi ) − Ef (X)]

i=1 n Z X

φj (y)kθm (Xi , y)dy −

Z

φj (y)kθm pθ∗ (y)dy .

i=1

and view vn (fθj ) as a stochastic process indexed by θ ∈ Θ. Z Z fθj (x) − fθj0 (x) = φj (y)kθm (x, y)dy − φj (y)kθm0 (x, y)dy =

X m−1 X Z

X

i=0

=

X

m−1 XZ i=0

X

φj (y)kθm−i kθi 0 (x, y)dy

Z − X

φj (y)kθm−i−1 kθi+1 (x, y)dy 0

Kθ Kθm−i−1 φj − Kθ0 Kθm−i−1 φj (y)kθi 0 (x, y)dy

implying |fθj (x) − fθj0 (x)| ≤

m−1 X

ρ(Kθ , Kθ0 ) × sup |Kθm−i−1 φj (y)| × y∈X

i=0

Z

kθi 0 (x, y)dy

X

≤ mCζkθ − θ0 k where ρ is the metric and ζ is the Lipchitz constant introduced by Assumption (A4). It concludes that sup L2,Q (fθj , fθj0 ) ≤ mCζkθ − θ0 k. Q

Denoting by F j the function class of {fθj , θ ∈ Θ}, it follows that sup N (, F j , L2,Q ) ≤ N (/mCζ, Θ, k · k) = O(−d ). Q

Applying Theorem A.1 to function class F j yields n γ1 P sup |vn (fθj )| > √ →0 d Θ as n → ∞. Further, P sup kn

−1/2

−1/2+γ1

vn (fθ )k > n

Θ

as n → ∞, completing the proof.

d X n γ1 j ≤ P sup |vn (fθ )| > √ →0 d Θ j=1


23

A.2. Proof of Theorem 6.1 Proof. We first show that, if Z0 ∈ B c , Mt = V (Zt∧T ) + (t ∧ T )δ1 is a supermartingale adapted to Zt ’s canonical filtration Gt . The adaptedness of Mt to Gt follows T being a Gt -stopping time. It suffices to show Ez [Mt+1 − Mt |Gt ] ≤ 0, then we also have integrability of non-negative Mt by induction Ez Mt ≤ Ez Mt−1 ≤ ... ≤ Ez M0 = V (z) < ∞. Indeed, (Mt+1 − Mt )I(T ≤ t) = [(V (ZT ) + T δ1 ) − (V (ZT ) + T δ1 )] I(T ≤ t) =0 (Mt+1 − Mt )I(T ≥ t + 1) = [(V (Zt+1 ) + (t + 1)δ1 ) − (V (Zt ) + tδ1 )] I(T ≥ t + 1) = [V (Zt+1 ) − V (Zt ) + δ] I(T ≥ t + 1) implying for z ∈ B c Ez [Mt+1 − Mt |Gt ] = Ez [(Mt+1 − Mt )I(T ≤ t)|Gt ] + Ez [(Mt+1 − Mt )I(T ≥ t + 1)|Gt ] = Ez [(V (Zt+1 ) − V (Zt ) + δ) I(T ≥ t + 1)|Gt ] (i)

= Ez [(V (Zt+1 ) − V (Zt ) + δ) |Gt ] I(T ≥ t + 1)

(ii)

= Ez [(V (Zt+1 ) − V (Zt ) + δ) |Zt ] I(T ≥ t + 1)

(iii)

≤ [−δ + δ] I(T ≥ t + 1) =0 where (i) follows T is a Gt -stopping time, and thus {T ≥ t + 1} ∈ Gt , (ii) is due to the Markov property of {Zt } and (iii) follows Zt ∈ B c (given T ≥ t + 1 and z ∈ B c ) and the drift condition in Definition ??. Consequently, Ez Mt ≤ Ez M0 = V (z) for z ∈ B c . That is Ez V (Zt∧T ) + Ez (t ∧ T )δ ≤ V (z). implying with non-negativeness of V Ez (t ∧ T )δ ≤ V (z). Taking t → ∞, the monotone convergence theorem yields Ez T ≤ V (z)/δ for z ∈ B c Furthermore, one step analysis gives Z Ez T = Pz (Z1 ∈ B) + (1 + Ez1 T ) p(z, z1 )dz1 Bc Z =1+ (Ez1 T )p(z, z1 )dz1 Bc Z 1 ≤1+ V (z1 )p(z, z1 )dz1 δ Bc for z ∈ B, completing the proof.


24

Acknowledgements This work is supported by NSF of US under grant DMS-1407557. The authors would like to thank Prof. Lester Mackey, Dr. Rachel Wang and Weijie Su for valuable advice. References [1] Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation 14(8) 1771–1800. [2] Hinton, G., S. Osindero and Y. Teh (2006). A fast learning algorithm for deep belief nets. Neural Computation 18(7) 15271554. [3] Hinton, G. and R. Salakhutdinov (2006). Reducing the dimensionality of data with neural networks. Science 313(5786) 504-507. [4] Bengio, Y., P. Lamblin, D. Popovici, H. Larochelle, and U. Montreal (2007). Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19 153. [5] MacKay, D. (2001). Failures of the one-step learning algorithm. In Available electronically at http://www.inference.phy.cam.ac.uk/mackay/ abstracts/gbm.html. [6] Teh, Y., M. Welling, S. Osindero and G. Hinton (2003). Energybased models for sparse overcomplete representations. The Journal of Machine Learning Research 4 1235-1260. [7] Williams, C. K. and F. V. Agakov (2002). An analysis of contrastive divergence learning in gaussian boltzmann machines. Institute for Adaptive and Neural Computation. [8] Yuille (2005). The convergence of contrastive divergence. In Advances in neural information processing systems 17 1593-1600. [9] Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive Divergence. In International Conference on Artificial Intelligence and Statistics 789-795. ¨ rinen, A. (2006). Consistency of pseudolikelihood estimation of fully [10] Hyva visible Boltzmann machines. Neural Computation 18(10) 2283-2292. [11] Rudolf, D. (2011). Explicit error bounds for Markov chain Monte Carlo. arXiv preprint arXiv:1108.3201. [12] Lehmann, E.L. and Casella, G. (1998). Theory of point estimation 31. Springer Science and Business Media. [13] Van Der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence. Springer, New York. [14] Foster, F. G. (1953). On the stochastic matrices associated with certain queuing processes. The Annals of Mathematical Statistics 355–360. [15] Tweedie, R. L. (1976). Criteria for classifying general Markov chains. Advances in Applied Probability 737–771. [16] Meyn, S. P. and Tweedie, R. L. (2012). Markov chains and stochastic stability. Springer Science and Business Media.


25

[17] Durrett, R. (2010). Probability: theory and examples. Cambridge university press.



26



27