Hamiltonian Monte Carlo with Energy Conserving Subsampling

HAMILTONIAN MONTE CARLO WITH ENERGY CONSERVING SUBSAMPLING KHUE-DUNG DANG, MATIAS QUIROZ, ROBERT KOHN,

arXiv:1708.00955v1 [stat.CO] 2 Aug 2017

MINH-NGOC TRAN AND MATTIAS VILLANI Abstract. Hamiltonian Monte Carlo (HMC) has recently received considerable attention in the literature due to its ability to overcome the slow exploration of the parameter space inherent in random walk proposals. In tandem, data subsampling has been extensively used to overcome the computational bottlenecks in posterior sampling algorithms that require evaluating the likelihood over the whole data set, or its gradient. However, while data subsampling has been successful in traditional MCMC algorithms such as Metropolis-Hastings, it has been demonstrated to be unsuccessful in the context of HMC, both in terms of poor sampling efficiency and in producing highly biased inferences. We propose an efficient HMC-within-Gibbs algorithm that utilizes data subsampling to speed up computations and simulates from a slightly perturbed target, which is within O(m−2 ) of the true target, where m is the size of the subsample. We also show how to modify the method to obtain exact inference on any function of the parameters. Contrary to previous unsuccessful approaches, we perform subsampling in a way that conserves energy but for a modified Hamiltonian. We can therefore maintain high acceptance rates even for distant proposals. We apply the method for simulating from the posterior distribution of a high-dimensional spline model for bankruptcy data and document speed ups of several orders of magnitude compare to standard HMC and, moreover, demonstrate a negligible bias. Keywords: Bayesian inference, Big Data, Markov chain Monte Carlo, Estimated likelihood.

1. Introduction Bayesian inference relies on the computation of the posterior distribution of the parameter vector θ ∈ Θ ⊂ Rd given the data y, π(θ) = pΘ (θ)p(y|θ)/p(y), where pΘ (θ) and p(y|θ) denote Dang, Quiroz and Kohn: Australian School of Business, University of New South Wales. Tran: Discipline of Business Analytics, University of Sydney. Villani: Division of Statistics and Machine Learning, Department of Computer and Information Science, Linköping University . 1

HMC WITH ENERGY CONSERVING SUBSAMPLING

2

the prior and likelihood, respectively, and p(y) is a normalization constant. The functional form of the posterior distribution often does not correspond to a known distribution and hence obtaining independent samples is difficult, especially when the dimension of θ is moderate to large. Markov Chain Monte Carlo (MCMC) is a generic sampling algorithm which produces (correlated) samples from π(θ) by quantifying the so called typical set (Cover and Thomas, 2012), which consists of regions with high probability mass (density multiplied by volume) such that high-dimensional integrals can be estimated based on the iterates, circumventing the curse of dimensionality (Betancourt, 2017). Metropolis-Hastings (MH) (Metropolis et al., 1953; Hastings, 1970) is arguably the most popular MCMC algorithm. Its most common implementation uses a random walk proposal, in which a new sample is proposed based on the current state of the Markov chain. While this proposal is easy to implement, it explores the typical set very slowly in high-dimensional problems, resulting in an MCMC chain that moves slowly with highly correlated draws which gives imprecise estimators of integrals over the posterior distribution (Betancourt, 2017). Hamiltonian Monte Carlo (HMC) (Duane et al., 1987) can produce distant proposals while maintaining a high acceptance probability (Neal, 2011; Betancourt, 2017). HMC augments the target variables by corresponding fictitious momentum variables and carries out the sampling on an extended target distribution which is proportional to the exponential of a Hamiltonian function that describes the total energy of the system, which is the sum of the potential energy (= − log π(θ)) and kinetic energy (minus log-density of the fictitious variables). Hamiltonian dynamics, specified by Hamilton’s equations, describes how the Hamiltonian, i.e. the total energy, evolves through time. One particularly interesting feature of the Hamiltonian is that it conserves energy as “time” evolves, a property that is approximately maintained even when the dynamics is approximated in discrete time. Hence, a MH proposal (on the extended space) that is obtained by simulating the dynamics has approximately the same value of the extended target density as that of the current draw, resulting in a high acceptance probability, even when the proposed draw is far from the current draw. This, and the characteristic that the trajectory is confined within the typical


3

set, avoids the inherently slow exploration of the parameter space evident in random walk proposals. However, HMC comes with the extra computational burden of computing the gradient of the log-posterior in each step when simulating the dynamics, and typically a large number of steps is performed. This burden is magnified in the presence of large data sets, which are increasingly available to practitioners. Our article speeds up computation by using subsets of the data to estimate both the dynamics and the subsequent MH correction performed when deciding to accept a proposal. More precisely, we propose a HMC-within-Gibbs that alternates sampling (i) small subsets of data and (ii) target variables, where the latter update is performed by HMC with a dynamics estimated using the subset from (i), and the corresponding MH correction uses the estimate of the likelihood based on the same subset. The acceptance decision in (i) and the choice of likelihood estimator gives us a marginal target distribution for θ which is within O(m−2 ) of π(θ), where m is the subsample size. Moreover, we present an alternative unbiased approach which allows us to compute expectations of any function of the parameters without bias. The paper is organized as follows. Section 2 reviews previous research, with a special emphasis on previous literature that discusses the incompatibility of data subsampling and Hamiltonian Monte Carlo. Section 3 presents our methodology and argues that it circumvents this apparent incompatibility. Finally, Section 4 demonstrates the usefulness of the method by estimating a high-dimensional non-linear spline model in a firm bankruptcy application and documents a dramatic improvement over the standard HMC algorithm without subsampling.

2. Previous research There has recently been a surge of interest in developing posterior simulation algorithms that scale both with respect to the number of observations n and the number of parameters d. Contrary to posterior optimization algorithms, in which only the posterior mode needs to be located, simulation methods require further iterations as typically a reasonably large number of posterior samples N is necessary for reliable inference. While it may be argued


4

that posterior optimization is better suited for big data in terms of computational effort, it is undoubtedly preferable to have an estimate of the entire posterior distribution of the parameters when quantifying uncertainty - a central task in Bayesian inference. Although there exist optimization methods that aim to approximate the entire posterior distribution, e.g. Variational Bayes (Blei et al., 2017) or the Laplace approximation (Bernardo and Smith, 2001, Chapter 5), in practice it is nearly impossible to know how they perform without comparing the results to a posterior simulation method. It is thus important to develop simulation methods that:

(i) remain computationally feasible when n is large (ii) explore the posterior distribution efficiently when d is large, in order to avoid a prohibitively large number of posterior samples N .

Two distinct approaches exist to resolve (i). The first is to utilize parallel computing, by dividing the n data observations into different parts, perform independent posterior simulation (each based on a single part of the data) and subsequently merge the draws to represent the full data posterior (based on all n data points). See, for example, Scott et al. (2013); Neiswanger et al. (2014); Wang and Dunson (2014); Minsker et al. (2014); Nemeth and Sherlock (2016). The second approach, which is the focus in our article, is to work instead with a subsample of m observations to estimate the full data posterior (Maclaurin and Adams, 2014; Korattikara et al., 2014; Bardenet et al., 2014, 2015; Quiroz et al., 2017c,a,b) or the gradient thereof (Welling and Teh, 2011; Chen et al., 2014; Ma et al., 2015; Baker et al., 2017). The simulation algorithm then uses the estimated quantities in place of the true quantities to reduce the computational effort. The rest of this section reviews samplers which utilize gradient information about π(θ). The primary problem confronted in (ii) is not the size of n, but rather how to generate proposals which maintain a high acceptance probability and are also distant enough to avoid a highly persistent Markov chain for θ. A useful approach is to simulate a discretized Langevin diffusion (Roberts and Rosenthal, 1998; Nemeth et al., 2016) or, more generally,


5

Hamilton’s equations (Duane et al., 1987) and use the simulated draw as a proposal in the MH algorithm, so as to correct for the bias introduced by the discretization (Neal, 2011). HMC provides a solution to (ii) (Neal, 2011; Betancourt, 2017), but when combined with (i), the algorithm becomes computationally infeasible since simulating the Hamiltonian dynamics requires a large number of costly gradient evaluations for every proposed θ. One might try to estimate the dynamics, simulating it using a fixed subsample of the data through each discretized trajectory step when estimating the gradient of the potential energy unbiasedly and skip the MH correction (to avoid the full data evaluation). Betancourt (2015) demonstrates that this simple strategy produces highly biased trajectories, where the bias depends upon the quality of the gradient estimator. Moreover, Betancourt (2015) shows that renewing the subsample in each step of the trajectory (hoping that the bias averages out) still performs poorly; see also the so called naive stochastic gradient HMC in Chen et al. (2014). Betancourt (2015) illustrates that adding a MH correction step based on the full data (which would remove all bias) deteriorates the high acceptance probability of HMC rapidly as d increases, and concludes that there is a fundamental incompatibility of HMC and data subsampling. As a remedy to the poor performance by the naive stochastic gradient HMC, Chen et al. (2014) propose adding a friction term to the dynamics to correct for the bias, but this sacrifices the coherent exploration of the Hamiltonian flow (Betancourt, 2015). A different approach, but related in the sense that it uses an estimated gradient, is Stochastic Gradient Langevin Dynamics (Welling and Teh, 2011, SGLD). SGLD combines Stochastic Gradient Optimization (Robbins and Monro, 1951, SGO) and Langevin dynamics (Roberts and Rosenthal, 1998), by allowing the initial iterates to resemble a SGO and gradually traverse to Langevin Dynamics so as to not collapse to the mode. SGLD avoids a costly MH correction as the discretization error of the Langevin dynamics is decreased as a function of the iterates. However, the drawback of SGLD is that the consistency rate of estimators based on its output is N −1/3 (Teh et al., 2016) as opposed to N −1/2 for MCMC (Roberts and Rosenthal, 2004), and in particular HMC. Moreover, it was demonstrated by Bardenet et al. (2015) to give a poor approximation of the full posterior distribution (although getting


6

the mode right) on a toy example with d = 2 and highly redundant data, i.e. superfluous amounts of data in relation to the complexity of the model. Recently, Dubey et al. (2016) improve SGLD using control variates, see also Baker et al. (2017) who, in addition, use control variates in the algorithm proposed in Chen et al. (2014). The next section presents our implementation of subsampling HMC and discusses why our approach avoids the pitfalls described in Betancourt (2015). 3. Methodology Our objective is to obtain samples from π(θ) given a dataset with a very large number of observations n and a possibly large number of variables d, with d < n. Although n is large, we stress that the data are not redundant in our application in Section 4. 3.1. Background. Quiroz et al. (2017c) devise a pseudo-marginal MCMC (Andrieu and Roberts, 2009) scheme targeting the augmented posterior (3.1)

1 2 b π (m) (θ, u) ∝ exp `m (θ) − σ b (θ) pΘ (θ)pU (u), 2 m

2 (θ), where `bm (θ) is an unbiased log-likelihood estimator with variance σ 2 (θ) estimated by σ bm

and u ∈ U ⊂ {1, . . . , n}m , with density pU (u), is the set of indices for the subsampled data points, see Section 3.3. Note that all estimated quantities also depend on n and d, which are suppressed to avoid notational clutter. The exponent term in (3.1) is an estimate of the likelihood and is by construction unbiased if `bm (θ) ∼ N (`(θ), σ 2 (θ)) and the correction term replaced with the true σ 2 (θ); otherwise the estimator is biased. However, with our choice of control variate in (3.6), Theorem 1 in Quiroz et al. (2017c) gives the upper bound of the R fractional error for the perturbed target πm (θ) = u π(θ, u)du, (3.2)

|error(θ)| = O

d3 nm2

,

with error(θ) =

πm (θ) − π(θ) . π(θ)

Quiroz et al. (2017c) show that m = m(n) is sublinear in n and allow for d to grow with n in their analysis, see also Section 3.6. However, they did not consider datasets with large d because of the well known limitation of random walk proposals in high dimensions. The


7

main contribution of our article is to extend their approach to high-dimensional d, with the help of Hamiltonian Monte Carlo.

3.2. Outline of the idea. We introduce the fictitious continuous momentum vector p~ ∈ Rd of same dimension as θ, which we henceforth assume to be continuous. Furthermore, augmenting the target in (3.1), we obtain (3.3)

b p~) pU (u), π (m) (θ, p~, u) ∝ exp −H(θ,

b p~) = U(θ) b + K(~p) H(θ,

with 1 2 1 b U(θ) = −`bm (θ) + σ bm (θ) − log pΘ (θ) and K(~p) = p~ 0 M −1 p~, 2 2 where M is a symmetric, positive-definite matrix. In (3.3) we have assumed, without loss of b is separable. We propose generality for our subsampling approach, that the Hamiltonian H to sample (3.3) using Gibbs sampling (Geman and Geman, 1984), alternating sampling (1) u|θ, p~, y - Metropolis-Hastings update (Section 3.4) (2) θ, p~ |u, y - Hamiltonian Monte Carlo given the set of u from Step 1 (Section 3.5). Note that a Metropolis-Hastings step within Gibbs (MH-within-Gibbs) is valid (Johnson et al., 2013), and so is a Hamiltonian step (Neal, 2011). Therefore this scheme has (3.3) as its invariant distribution. Integrating out the momentum variables yields π (m) (θ, u) in (3.1) and, further integrating out u, gives π(m) (θ) with the error rate given in (3.2). The next subsections show how to obtain accurate estimates and perform the Gibbs updates efficiently. In particular, we explain why our approach does not fundamentally compromise the Hamiltonian flow.

3.3. Efficient estimators. Quiroz et al. (2017c) propose sampling m observations with replacement and estimate an additive log-likelihood `(θ) = log p(y|θ) =

n X k=1

`k (θ),

`k (θ) = log p(yk |θ)


8

by the unbiased difference estimator (3.4) `bm (θ) =

n X

m

1 X ùi (θ) − qui (θ) qk (θ) + , m ω (θ) u i i=1 k=1

ui ∈ {1, . . . , n} iid with Pr(ui = k) = ωk ,

and qk (θ) ≈ `k (θ) are control variates. If the qk (θ) approximate the `k (θ) reasonably well, then we obtain an efficient estimator by taking ωk = 1/n for all k. Hence, the sampling probabilities are not required to be computed prior to sampling, which can be prohibitively expensive. With these ωk we can estimate σ 2 (θ) = V[`bm (θ)] by (3.5)

2 σ bm (θ) =

m n2 X (eui (θ) − eu (θ))2 , m2 i=1

with eui (θ) = ùi (θ) − qui (θ)

and eu denotes the mean of the eui in the sample u = (u1 , . . . , um ). We follow Bardenet et al. (2015) and let qk (θ) be a Taylor approximation around a central value θ? of `(θ), 1 (3.6) qk (θ) = `k (θ? )+∇θ `k (θ? )T (θ−θ? )+ (θ−θ? )T Hk (θ? )(θ−θ? ), 2

Hk (θ? ) := ∇θ ∇Tθ `k (θ? ).

After processing the full data once before the MCMC to compute simple summary statistics, Pn k=1 qk (θ) can be computed in O(1) (Bardenet et al., 2015). It is straightforward to modify (3.4) to instead provide an unbiased estimator of the gradient of the log-likelihood. With ωk = 1/n and ∇θ qk (θ) = ∇θ `k (θ? ) + Hk (θ? )(θ − θ? ), m

(3.7)

∇θ `bm (θ) = A(θ? ) + B(θ? )(θ − θ? ) +

n X (∇θ ùi (θ) − ∇θ qui (θ)) , m i=1

where ?

A(θ ) :=

n X k=1

?

d

∇θ `k (θ ) ∈ R

?

and B(θ ) :=

n X k=1

Hk (θ? ) ∈ Rd×d

are obtained at the cost of computing over the full dataset once before the MCMC since θ? 2 is fixed. It is also straightforward to compute ∇θ σ bm (θ) using (3.5) with this choice of control

variate.


9

3.4. Gibbs update of the data subset. Given θ(j−1) , p~ (j−1) and u(j−1) , at iteration j we propose u0 ∼ pU (u) and set u(j) = u0 with probability   1 2 (j−1) 0 (j−1) 0 b   bm (θ ;u ) exp `m (θ ; u ) − 2σ , (3.8) αu = min 1,  exp `b (θ(j−1) ; u(j−1) ) − 1 σ 2 (θ (j−1) ; u(j−1) )  b m 2 m where the notation emphasizes that the estimators are based on different data subsets. If rejected u(j) = u(j−1) . Since u0 is proposed independently of u(j−1) , the log of the ratio in (3.8) can be highly variable, possibly getting the sampler stuck when the numerator is significantly overestimated. Inducing a high correlation between the estimators at the current and proposed draws, either through correlating u (Deligiannidis et al., 2016) or by block updates of u (Tran et al., 2016), reduces the variance of the log of the ratio dramatically. We implement the block update of u for data subsampling in Quiroz et al. (2017c) with G blocks, which gives a correlation of roughly 1 − 1/G between the `bm at the current and proposed draws (Tran et al., 2016). Setting G = 100 helps the chain to mix well. 3.5. Gibbs update of the parameters. Given u(j) , we can utilize Hamilton’s equations (3.9)

b p~) dθl ∂ H(θ, = , dt ∂~pl

b p~) d~pl ∂ H(θ, =− , dt ∂θl

l = 1, . . . , d,

to propose θ and p~. Note that, since u(j) is fixed through time t, this trajectory follows the b viewed as a function of θ and p~ for a given data subset (indicated Hamiltonian flow for H by u(j) ). We obtain the proposal as in standard HMC, using a Leapfrog integrator with b in place of H. Specifically, at iteration j, given the data integrating time L, but with H subset u(j) , if the Leapfrog integrator starts at (θ(j−1) , p~0 ) with p~0 ∼ K(~p) and ends at (θL , −~pL ), we let (θ(j) , p~ (j) ) = (θL , −~pL ) with probability (3.10)

αθ,~p

n o (j−1) b b = min 1, exp −H(θL , −~pL ) + H(θ , p~0 ) ,

b in (3.3). If (θL , −~pL ) is rejected, we set (θ(j) , p~ (j) ) = (θ(j−1) , p~0 ). In practice, it is with H unnecessary to store the sampled momentum.


10

Using the terminology in Betancourt (2015), we can think of the dynamics in (3.9) generating a trajectory following a modified level set, as opposed to the exact level set obtained using a dynamics without subsampling the data (H with potential energy U(θ) = − log π(θ)). We discretize the Hamiltonian with a symplectic integrator which, given a sensible step length , introduces an error of O(2 ) (Neal, 2011) relative to the modified level set and hence the discretization error is very small. However, note that the modified level set might be very far from the exact level set and hence also the discretized trajectory no matter how small is. Betancourt (2015) shows that adding a MH correction based on the exact Hamiltonian H (with the potential U(θ) = − log π(θ), attempting to make the algorithm have π(θ) as invariant distribution) results in a very low acceptance probability. This is intuitive, since the b (with potential energy U(θ)) b trajectory was generated using a different Hamiltonian H and hence energy conservation with respect to the exact Hamiltonian is impossible. Instead, our approach uses the Hamiltonian which actually generated the trajectory in the MH correction and hence, albeit not having π(θ) as its invariant, does not compromise the main mechanism behind the high efficiency of HMC: the energy conservation, even for distant proposals. This observation is also clear from the fact that, given that we use the same u throughout a single update of θ and p~, we perform a standard HMC update with an estimated Hamiltonian based on u which uses an exact gradient of the estimated log-posterior (with the same u). Because of this feature, we name our algorithm Hamiltonian Monte Carlo with Energy Conserving Subsampling (HMC-ECS). Section 3.2 shows the asymptotics of the fractional error of the invariant distribution of our algorithm relative to that obtained by the HMC without data subsampling, i.e. π(θ). Section 3.6 studies this further. Algorithm 1 shows one iteration of our proposed HMC-ECS algorithm based on the Leapfrog integrator. 3.6. Asymptotic properties. Quiroz et al. (2017c) show that by using the control variates in Section 3.3, the variance of the log-likelihood estimator is (3.11)

2

σ (θ) = O

d3 nm

.


11

Algorithm 1: One iteration of HMC-ECS. Input: Current position u(j−1) , θ(j−1) , stepsize and integrating time L Propose u0 ∼ pU (u) Set u(j) ← u0 with probability   2   bm (θ(j−1) ; u0 ) exp `bm (θ(j−1) ; u0 ) − 21 σ , αu = min 1,  exp `b (θ(j−1) ; u(j−1) ) − 1 σ b2 (θ(j−1) ; u(j−1) )  m 2 m else u(j) ← u(j−1) Given u(j) : b0 ← H(θ b (j−1) , p~0 ) p~0 ∼ K(~p); θ0 ← θ(j−1) ; H b 0) p~0 ← p~0 − ∇θ U(θ 2

for l = 1 to L do θl ← θl−1 + M −1 p~l−1

b l ); if i < L then p~l ← p~l−1 − ∇θ U(θ b l ); else p~L ← p~L−1 − ∇θ U(θ 2

end p~L ← −~pL bL ← H(θ b L , p~L ) H Set (θ(j) , p~

(j)

) ← (θL , p~L ) with probability n o (j−1) b b αθ,~p = min 1, exp −H(θL , −~pL ) + H(θ , p~0 ) ,

(j)

else (θ(j) , p~ ) ← (θ(j−1) , p~0 ) Output: u(j) , θ(j) To obtain a variance which is bounded in n we can take, for example, d = O(n1/2 ) and m = O(n1/2 ) which results in (3.2) being (3.12)

πm (θ) − π(θ) 1 =O √ . π(θ) n

Note that the asymptotic analysis allows for increased model complexity through increasing d, while we are able to still target σ 2 (θ) = O(1) using a subsample size which is sublinear in m. If the value of σ 2 (θ) is large, then it is important to perform block updates of u as discussed in Section 3.4 to ensure that the Gibbs step u|θ, p~, y does not get stuck. However, Quiroz et al. (2017c) argue that increasing σ 2 (θ) increases the perturbation error. They devised an


12

approach to compute the perturbation error given a value of θ. Nevertheless, an unacceptably large σ 2 (θ) may occur when the dimension of the parameter is very large, and the algorithm may then produce estimates with non-negligible bias. This can be avoided by an alternative approach which we now discuss.

3.7. Exact Hamiltonian-within-Gibbs. Quiroz et al. (2017a) propose using the Poisson estimator (Wagner, 1987) to estimate the likelihood unbiasedly in a subsampling MCMC. We now briefly describe the approach of Quiroz et al. (2017a). Generate G ∼ Poisson(µ) and let u(h) ∼ pU (u) with mb observations for each h = 1, . . . , G. The Poisson estimator of the likelihood is bm (θ) = exp (a + µ) L

(3.13)

G Y h=1

(h) `bmb − a µ

! ,

where a is a constant, µ = E[G] (if G is zero the product is defined to be 1) and `bmb is defined in (3.4). Defining the augmented density targeted by HMC-ECS using this estimator yields bm (θ) exp (−K(~p)) pΘ (θ)pU (u). π e(m) (θ, p~, u) ∝ L

(3.14) Remarkably,

Z Z p ~

u

π e(m) (θ, p~, u)dud~p = π(θ)

and thus HMC-ECS with (3.13) simulates from the desired target given that (3.14) is a bm (θ) ≥ 0 = 1. This requires that a in (3.13) is a proper density, which holds only if Pr L

(h) lower bound of `bmb (Jacob and Thiery, 2015) which results in a prohibitively costly estimator

(Quiroz et al., 2017a). Instead, we follow Quiroz et al. (2017a) who use the approach of Lyne et al. (2015) for exact inference of an expectation of an arbitrary function ψ(θ) with respect to π(θ). This is achieved by considering the target π e(m) (θ, p~, u, s) ∝ s(θ, u) π e(m) (θ, p~, u)

with π e(m) (θ, p~, u) in (3.14)


13

bm (θ) . For any function ψ, and s(θ, u) = sign L R

ψ(θ)π(θ)dθ π(θ)dθ θ R R R ψ(θ)e π(m) (θ, p~, u)dud~pdθ θ p R~ RuR = π e (θ, p~, u)dud~pdθ θ p ~ u (m) R R R π dud~pdθ ψ(θ)s(θ, u) e (θ, p ~ , u) (m) θ p R~ RuR = . s(θ, u) π e(m) (θ, p~, u) dud~pdθ

I = Eθ [ψ] =

θR

θ

p ~

u

Hence, we can use our scheme to simulate from (3.14), record the sign s(j) of the estimate at the jth iteration, and estimate the expectation by PN bIN =

ψ(θ(j) )s(j) . PN (j) s j=1

j=1

Lyne et al. (2015) derive the variance of bIN , which is small if most estimates are of the same sign, which in turn can be controlled by tuning a (Quiroz et al., 2017a). Finally, to reduce the variance of the ratio when updating u in Section 3.4, Quiroz et al. (2017a) propose to correlate the Poisson random variates (G) at the current and proposed draw by a Copula transformation of the Gaussian random variates used in the correlated pseudo-marginal approach of Deligiannidis et al. (2016).

4. Application 4.1. Data and Model. Our dataset consists of annual observations on the bankruptcy status (binary y) of Swedish firms in the time period 1991-2008. The firm-specific financial ratios, all scaled with respect to total assets, are earnings (x1 ), leverage (x2 ), cash (x3 ) and tangible assets (x4 ). The dataset also contains the log of firm age in years (x5 ) and the log of deflated total sales (x6 ) to control for age and size effects. Finally the macroeconomic variables yearly GDP-growth rate (x7 ) and the interest rate (x8 ) are included. Nonlinear bankruptcy models for this dataset have been analyzed in Quiroz and Villani (2013) and Giordani et al. (2014). We follow the approach of Giordani et al. (2014) who model the log odds of the firm failure probability as a non-linear function of the covariates


14

using an additive spline model. They estimate the parameters using a frequentist Generalized Additive Model (GAM) approach (Hastie and Tibshirani, 1990). Our article uses a Bayesian approach and simulates from the 89 dimensional posterior distribution (an intercept and 10 knots + linear term for each covariate) given the n = 4, 748, 089 firm-year observations. This is clearly a nontrivial problem and requires an efficient MCMC scheme. We consider the logistic regression p(yk |xk , θ) =

1 1 + exp(xTk θ)

yk

1 1 + exp(−xTk θ)

1−yk ,

with pθ (θ) = N (θ|0, I/λ2 ),

where, for simplicity, the shrinkage factor is λ = 0.1 and I is the identity matrix. We estimate the model using HMC-ECS and compare its performance against the standard (non-subsampling) HMC. 4.2. Tuning HMC and HMC-ECS. It is well known that Hamiltonian MC is difficult to tune and hence so is HMC-ECS. Section 3.5 argues that HMC-ECS (based on the subset u) is a Hamiltonian Monte Carlo update and can thus be tuned using the same methods. To select the step-size in the Leapfrog integrator, we utilize the dual averaging approach of Hoffman and Gelman (2014) which requires a predetermined trajectory length L. This is found by preliminary runs of HMC with a small (to ensure a small discretization error) and a large L (to move far), with the aim to find a trajectory length which gives a reasonable sampling efficiency. We found that L ≈ 2 is sensible for our problem. The dual averaging algorithm uses this trajectory length and adaptively changes throughout a training period of 1, 000 iterations in order to achieve a desired level of acceptance rate δ. We follow Hoffman and Gelman (2014) and set δ = 0.8. The choice of the positive-definite matrix M is crucial for the performance of HMC (and therefore also HMC-ECS): an M that closely resembles the covariance of the posterior facilitates sampling, especially when θ is highly correlated in the posterior (Neal, 2011; Betancourt, 2017). In logistic regression, we can approximate M = Σ−1 (θ? ), which is the Hessian of the log posterior evaluated at some θ? . The closer θ? is to the posterior mode, the better the approximation. We update θ? , and then M , every 100 iterations during the training period, using the latest 10% of the iterates. Note that


15

the tuning discussed here might be suboptimal for HMC-ECS. For example, the choice of kinetic energy is based on Σ of the full data posterior, which would favor the performance of HMC. However, both algorithms have equal state-of-the-art performance with respect to the sampling efficiency. For the settings specific to HMC-ECS, we let G = 100 and m = 10, 000 so that each block of u contains 100 random variables ui . With N = 1, 000 iterations this means that every block is updated (on average) 10 times. Tran et al. (2016) show that in large samples the dependence between θ and u in the posterior is weak and hence in practice one obtains good mixing even if u moves slowly. We acknowledge that the tuning of , M and L in this particular problem, which results in basically iid sampling in Section 4.3, is a consequence of an ideal situation with the logistic model. For example, Σ can be computed analytically which is useful for both selecting M and (see above). This is a situation which is unlikely in a more complex model. However, we stress that this is not something which puts HMC-ECS in a more favorable position than HMC. Our article is the first to demonstrate a serious example of subsampling MCMC and, by using optimal implementations of both HMC and HMC-ECS, we focus on comparing their core performance, rather than worrying about tuning issues. That being said, we hope to further develop HMC-ECS in future research; see Section 5.

4.3. Results. Figure 1 illustrates the accuracy of HMC-ECS compared to full data HMC, where both algorithms run for 1,000 iterations. The figure shows the kernel density estimates of the marginal posterior densities for the earnings-ratio (x1 with effect β1 ) and the corresponding coefficients of the non-linear parts. We have verified similar accuracy for the other coefficients (not shown here). Figure 2 also illustrates how well the model fits the data by comparing the realized bankruptcy frequencies against those obtained by the logistic spline model; see the caption for details. The figure shows that the posterior mean and credible intervals obtained by HMC and HMC-ECS are indistinguishable.


β1

β9

β10

0.8 0.4

0.20

0.3

0.15

0.2

0.10

0.2

0.1

0.05

0.0

0.0

0.00

0.6

0.4

16

β11 0.125 0.100 0.075 0.050

−3

−2

−1

0

β12

1

2

−4

3

−2

β13

0

0.125

0.12

0.100

0.100

0.10

0.075

0.075 0.050

0.025

0.025

0.000 −5

0

5

β16

0.0

β14

2.5

5.0

−5

0

5

β17

−10

7.5

β15

0

5

0

5

10

0.10

0.08

0.08

0.06

0.06

0.04

0.04 0.02 0.00 −10

10

−5

0.12

0.00 −10

10

−2.5

0.02

0.000 −10

0.000 −5.0

2

0.125

0.050

0.025

−5

0

β18

5

10

−10

−5

10

0.08

0.10 0.08 0.08

0.06

0.06

HMC HMC-ECS

0.06 0.04

0.02

0.02

0.02 0.00 −15

0.04

0.04

0.00

0.00 −10

−5

0

5

10

−15

−10

−5

0

5

10

15

−10

0

10

Figure 1. Kernel density estimates of marginal posteriors for the effects of the earnings-ratio (linear term followed by spline terms). The figure shows the results for HMC (red solid line) and HMC-ECS (green dashed line). Table 1 shows that both methods are very efficient as measured by the Effective Sample Size ESS = IF/N , where the Inefficiency Factor (IF) is defined as (4.1)

IF := 1 + 2

∞ X

ρl

l=1

and ρl is the l-lag autocorrelation of the chain. We estimate ESS and IF using the CODA package in R (Plummer et al., 2006). Both algorithms are as efficient as i.i.d. sampling according to these measures. The table also shows that HMC-ECS can maintain an acceptance rate close to that of HMC even in this high dimensional example. A reasonable measure of overall cost is the Computational Time (CT) which also includes the computing cost of the algorithm. We define the computational time of an algorithm A as CTA := IFA × Total number of density evaluations. Figure 3 shows a histogram of the Relative CT (RCT) of the parameters, defined as RCT :=

CTHMC . CTHMC-ECS


HMC

17

HMC-ECS

Bankruptcy Frequency

0.150

0.125

0.100

0.075

0.050

0.025

0.000 −0.75 −0.50 −0.25

0.00

0.25

0.50

0.75

−0.75 −0.50 −0.25

0.00

0.25

0.50

0.75

Earnings Ratio

Figure 2. Realized and estimated bankruptcy probabilities. The figure shows the results with respect to the earnings ratio for HMC (left panel) and HMC-ECS (right panel). The data is divided into 100 equally sized groups based on the earnings-ratio. For each group, the empirical estimate of the bankruptcy probability is the fraction of bankrupt firms. These empirical estimates are represented as blue dots, where the corresponding x-value (earningsratio) has been set to the mean within the group. The model estimates for each of the 100 groups are obtained by, for each posterior sample θ, averaging Pr(yk = 1|y) for all observations k in a group, and subsequently computing the posterior mean (solid line) and 90% credible interval (quantiles 5-95, shaded region).

We conclude that significant gains, with RCT ranging from 180-300, are obtained for the HMC-ECS compared to using HMC on the full dataset. Finally, Table 2 uses the fractional error formula derived in Quiroz et al. (2017c) to show that error(θ) < 8.501×10−14 in (3.2) is very small, and hence that our approximate approach is very accurate. We acknowledge that the small perturbation error obtained in this example is due to HMC-ECS using a good reference θ? , which results in accurate control variates. However, we stress that when σ 2 (θ) is prohibitively large we can instead use the Poisson estimator as described in Section 3.7 to avoid noticeable biases.


αθ,~p αu L IF ESS

HMC

HMC-ECS

0.818

0.793 0.997 2 0.464 1 1000

2 0.454 1 1000

Table 1. Summary of settings and efficiencies of HMC and HMC-ECS. The table shows the average acceptance probability (over HMC/HMC-ECS iterations, post training period) when sampling θ (αθ,~p ) and u (αu ), calculated as in (3.10) and (3.8). The step size is from the final step size of the dual averaging described in Section 4.2. The number of steps L in the integrator is set so to obtain a predetermined trajectory length L for a given value of . The Ineffiency Factor (IF) and Effective Sample Size (ESS) are computed using (4.1) using post training iterations, and are summarized by their mean across all 89 parameters.

Relative computational time CTHMC/CTHMC-ECS 25 20 15 10 5 0

150

175

200

225

250

275

300

Figure 3. Relative Computational Time (RCT) of HMC compared to HMC-ECS. The figure shows a histogram using d = 89 RCT values (one for each parameter) when HMC-ECS uses m = 10, 000.

18


Mean

Min

19

Max

HMC-ECS 2.648 × 10−15 1.373 × 10−20 8.501 × 10−14 Table 2. Perturbation error of the invariant distribution of HMC-ECS. The table shows the min, max and mean of the absolute value of (3.2) over 100 iterations of HMC-ECS. The fractional error has been computed using the approach in Quiroz et al. (2017c)

5. Conclusions and Future Research We propose a method to speed up Bayesian inference while maintaining high sampling efficiency in high-dimensional parameter spaces by combining data subsampling and Hamiltonian Monte Carlo. Although our asymptotic results allow for the dimension of θ to increase with the size of the dataset n, eventually the variance of the log-likelihood estimator σ 2 (θ) will become so large that the error in the posterior simulation will be undesirably large with the unbiased log-likelihood estimator. In this scenario, we can instead use the Poisson estimator that is unbiased for the likelihood which, although having a larger σ 2 (θ), is exact in the sense that the iterates of HMC-ECS can be used to consistently estimate the expectation of any function of the parameters. Of course, at some limit of the dimension, σ 2 (θ) will be so large that the first step of our algorithm gets stuck, and the algorithm stops mixing over u. Therefore HMC-ECS does not scale as well as HMC, but in exchange is considerably faster when applicable. However, keeping in mind that Bardenet et al. (2015) and Quiroz et al. (2017c) demonstrate that most subsampling approaches cannot even beat the standard (non-subsampling) MH on toy examples with d = 2 and highly redundant data, we believe that our method is a good step forward in scaling up MCMC for huge datasets. Similarly to HMC, HMC-ECS is difficult to tune. Self-tuning algorithms such as the NoU-Turn sampler (Hoffman and Gelman, 2014) have been proposed for HMC and it would be interesting to see if our ideas can be applied there. It will also be interesting to consider Riemann Manifold HMC (Girolami and Calderhead, 2011), which has been demonstrated to be very effective when a high-dimensional posterior exhibits strong correlations. Scaling up such an algorithm will open up the possibility of estimating highly complex models with


20

huge datasets. Finally, until very recently, one of the limitations of HMC was its inability to cope with discrete parameters. Nishimura et al. (2017) overcomes this limitation and perhaps HMC-ECS can be extended in this direction. We leave these extensions for future research.

6. Acknowledgments Khue-Dung Dang, Matias Quiroz and Robert Kohn were partially supported by Australian Research Council Center of Excellence grant CE140100049. Mattias Villani was partially supported by Swedish Foundation for Strategic Research (Smart Systems: RIT 15-0097).

References Andrieu, C. and Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, pages 697–725. Baker, J., Fearnhead, P., Fox, E. B., and Nemeth, C. (2017). Control variates for stochastic gradient MCMC. arXiv preprint arXiv:1706.05439. Bardenet, R., Doucet, A., and Holmes, C. (2014). Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In Proceedings of The 31st International Conference on Machine Learning, pages 405–413. Bardenet, R., Doucet, A., and Holmes, C. (2015). On Markov chain Monte Carlo methods for tall data. arXiv preprint arXiv:1505.02827. Bernardo, J. M. and Smith, A. F. (2001). Bayesian theory. Betancourt, M. (2015). The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In International Conference on Machine Learning, pages 533–540. Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434. Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, (just-accepted).


21

Chen, T., Fox, E., and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691. Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. John Wiley & Sons. Deligiannidis, G., Doucet, A., and Pitt, M. K. (2016). The correlated pseudo-marginal method. arXiv preprint arXiv:1511.04992v3. Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B, 195(2):216–222. Dubey, K. A., Reddi, S. J., Williamson, S. A., Poczos, B., Smola, A. J., and Xing, E. P. (2016). Variance reduction in stochastic gradient Langevin dynamics. In Advances in Neural Information Processing Systems, pages 1154–1162. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):721–741. Giordani, P., Jacobson, T., von Schedvin, E., and Villani, M. (2014). Taking the twists into account: Predicting firm bankruptcy risk with splines of financial ratios. Journal of Financial and Quantitative Analysis, 49(4):1071–1099. Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214. Hastie, T. and Tibshirani, R. (1990). Generalized additive models. Wiley Online Library. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109. Hoffman, M. D. and Gelman, A. (2014). The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593– 1623. Jacob, P. E. and Thiery, A. H. (2015). On nonnegative unbiased estimators. The Annals of Statistics, 43(2):769–784.


22

Johnson, A. A., Jones, G. L., and Neath, R. C. (2013). Component-wise Markov chain Monte Carlo: Uniform and geometric ergodicity under mixing and composition. Statistical Science, pages 360–375. Korattikara, A., Chen, Y., and Welling, M. (2014). Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 181–189. Lyne, A.-M., Girolami, M., Atchade, Y., Strathmann, H., and Simpson, D. (2015). On Russian roulette estimates for Bayesian inference with doubly-intractable likelihoods. Statistical Science, 30(4):443–467. Ma, Y.-A., Chen, T., and Fox, E. (2015). A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pages 2917–2925. Maclaurin, D. and Adams, R. P. (2014). Firefly Monte Carlo: Exact MCMC with subsets of data. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI 2014). Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092. Minsker, S., Srivastava, S., Lin, L., and Dunson, D. (2014). Scalable and robust Bayesian inference via the median posterior. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1656–1664. Neal, R. M. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2:113–162. Neiswanger, W., Wang, C., and Xing, E. (2014). Asymptotically exact, embarrassingly parallel MCMC. Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI 2014), pages 623–632. Nemeth, C. and Sherlock, C. (2016). Merging MCMC subposteriors through Gaussianprocess approximations. arXiv preprint arXiv:1605.08576.


23

Nemeth, C., Sherlock, C., and Fearnhead, P. (2016). Particle Metropolis-adjusted Langevin algorithms. Biometrika, 103(3):701–717. Nishimura, A., Dunson, D., and Lu, J. (2017). Discontinuous Hamiltonian Monte Carlo for sampling discrete parameters. arXiv preprint arXiv:1705.08510. Plummer, M., Best, N., Cowles, K., and Vines, K. (2006). Coda: Convergence diagnosis and output analysis for MCMC. R News, 6(1):7–11. Quiroz, M., Tran, M.-N., Villani, M., and Kohn, R. (2017a). Exact subsampling MCMC. arXiv preprint arXiv:1603.08232. Quiroz, M., Tran, M.-N., Villani, M., and Kohn, R. (2017b). Speeding up MCMC by delayed acceptance and data subsampling. Journal of Computational and Graphical Statistics, (To appear). Quiroz, M. and Villani, M. (2013). Dynamic mixture-of-experts models for longitudinal and discrete-time survival data. Sveriges Riksbank Working Paper Series, No. 268. Quiroz, M., Villani, M., Kohn, R., and Tran, M.-N. (2017c). Speeding up MCMC by efficient data subsampling. arXiv preprint arXiv:1404.4178v5. Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407. Roberts, G. O. and Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268. Roberts, G. O. and Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probability Surveys, 1:20–71. Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H., George, E., and McCulloch, R. (2013). Bayes and big data: the consensus Monte Carlo algorithm. In EFaBBayes 250” conference, volume 16. Teh, Y. W., Thiery, A. H., and Vollmer, S. J. (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17:1–33.


24

Tran, M.-N., Kohn, R., Quiroz, M., and Villani, M. (2016). Block-wise pseudo-marginal Metropolis-Hastings. arXiv preprint arXiv:1603.02485v3. Wagner, W. (1987). Unbiased Monte Carlo evaluation of certain functional integrals. Journal of Computational Physics, 71(1):21–33. Wang, X. and Dunson, D. B. (2014). Parallel MCMC via Weierstrass sampler. arXiv preprint arXiv:1312.4605v2. Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 681–688.