Bayesian model selection for exponential random graph models via ...

20 downloads 18147 Views 638KB Size Report
Jun 20, 2017 - School of Mathematics and Statistics & Insight Centre for Data Analytics, ... selection of exponential random graph models for the analysis of ...
Bayesian model selection for exponential random graph models via adjusted pseudolikelihoods Lampros Bouranis, Nial Friel∗, Florian Maire

arXiv:1706.06344v1 [stat.CO] 20 Jun 2017

School of Mathematics and Statistics & Insight Centre for Data Analytics, University College Dublin, Ireland

Abstract Models with intractable likelihood functions arise in many areas including network analysis and spatial statistics, especially those involving Gibbs random field models. Posterior parameter estimation in these settings has been termed a doubly intractable problem because both the likelihood function and the posterior distribution are intractable. The comparison of statistical models following the Bayesian paradigm is often based on the statistical evidence, the integral of the unnormalised posterior distribution over the model parameters which is rarely available in closed form, giving rise to a third layer of intractability. Consequently, the selection of the model that best describes an observed network among a collection of exponential random graph models for social network analysis is a daunting task. Pseudolikelihoods offer a tractable approximation to the likelihood but should be treated with caution because they can lead to an unreasonable inference. This paper proposes the use of an adjusted pseudolikelihood as a reasonable, yet tractable, approximation to the likelihood for Bayesian model selection. This, in turn, allows us to implement widely used computational methods for evidence estimation and pursue Bayesian model selection of exponential random graph models for the analysis of real-world social networks. Empirical comparisons to existing methods for various experiments show that our procedure yields similar evidence estimates, but at a lower computational cost. Keywords: Bayes factors, Evidence, Exponential random graph models, Intractable normalising constants, Pseudolikelihood, Tractable approximation. 1. Introduction Bayesian inference for models that are characterized by an intractable likelihood function has received considerable attention by the statistical community, notably the class of Gibbs random fields (GRFs). Popular examples include the autologistic model (Besag, 1972), used to model the spatial distribution of binary random variables defined on a lattice or grid and the exponential random graph model (ERGM) for social network analysis (Robins et al., 2007). Despite their popularity, posterior parameter estimation for GRFs presents considerable difficulties because the ∗ Corresponding

author Email addresses: [email protected] (Lampros Bouranis), [email protected] (Nial Friel), [email protected] (Florian Maire)

1

normalising constant z(θ) of the likelihood density f (y | θ) =

q(y | θ) z(θ)

(1)

is typically intractable for all but trivially small graphs. The posterior distribution defined as π(θ | y) =

f (y | θ)p(θ) f (y | θ)p(θ) =R π(y) Θ f (y | θ)p(θ) dθ

(2)

is termed doubly-intractable because of the intractability of the normalising term of the likelihood model within the posterior and the intractability of the posterior normalising term. Bayesian model comparison is often achieved by estimating the Bayes’ factor (Kass and Raftery, 1995), which relies upon the marginal likelihood or model evidence, π(y), of each of the competing models. However, for many models of interest with intractable likelihoods such as GRFs, estimation of the marginal likelihood adds another layer of difficulty. This paper addresses this problem in the context of Bayesian model comparison of exponential random graph models. Related work by Friel (2013) and Everitt et al. (2017) has the same objective as our study, namely to estimate the evidence in the presence of an intractable likelihood normalising constant. Contrary to our method, their proposed algorithms rely heavily on repeated simulations from the likelihood. Friel (2013) devised a "population" version of the exchange algorithm (Møller et al., 2006), however for evidence estimation, it is limited to models with a small number of parameters. Everitt et al. (2017) describe an importance sampling approach for estimating the evidence, which is promising for low-dimensional models. However, when moving to higher dimensional settings their approach makes use of a particle filter to estimate the evidence, which is naturally more computationally demanding. Additionally, those approaches rely on simulation to circumvent the evaluation of the intractable likelihood, adding significantly to the computational burden. Motivated by overcoming the intractability of the likelihood in (1), a natural approach is to use composite likelihoods as a plug-in for the true likelihood (Varin et al., 2011). The pseudolikelihood (Besag, 1975) is an antecedent of composite likelihoods and was developed in the context of ERGMs by Strauss and Ikeda (1990). Building on the work of Stoehr and Friel (2015), Bouranis et al. (2017) proposed an alternative approach to Bayesian inference for ERGMs. The replacement of the true likelihood with the pseudolikelihood approximation in Bayes’ formula yields what we term a pseudo-posterior distribution, a tractable Bayesian model from which it is straightforward to sample. Bayesian inference based on the pseudolikelihood can be problematic however, as in some cases the posterior mean estimates are biased and the posterior variances are typically underestimated. Bouranis et al. (2017) developed an approach to allow for correction of a sample from the pseudo-posterior distribution so that it is approximately distributed from the target posterior distribution. While parameter inference with the adjusted pseudo-posterior distribution yields reasonable results (Bouranis et al., 2017), evidence estimation is inefficient, a point which is explained in Section 5. Based on this observation, in this paper we consider adjusting the pseudolikelihood directly and then replace the likelihood function in the model evidence with this fully adjusted pseudolikelihood. These adjustments involve a correction of the mode, the curvature and the magnitude at the mode of the pseudolikelihood function, as outlined in Figure 1. The crucial point is that this adjusted pseudolikelihood function renders the corresponding posterior distribution

2

amendable to standard evidence estimation methods from the Bayesian toolbox. A non-exhaustive list of such methods includes Chib’s method (Chib, 1995) and its extension (Chib and Jeliazkov, 2001), importance sampling (Liu, 2001), annealed importance sampling (Neal, 2001), bridge sampling (Meng and Wong, 1996) and path sampling/thermodynamic integration (Gelman and Meng, 1998; Lartillot and Phillipe, 2006; Friel and Pettitt, 2008; Calderhead and Girolami, 2009), among others. Full adj.

Mode−Curv. adj. Pseudo

True

−140

−140

−145

−145

−145

−150

−155

Log likelihood

−140

Log likelihood

Log likelihood

Type

−150

−155

−160

−155

−160 0.0

0.2

0.4

0.6

0.8

−150

−160 0.0

Theta

0.2

0.4

Theta

0.6

0.8

0.0

0.2

0.4

0.6

0.8

Theta

Figure 1: A graphical representation of the steps involved in the adjustment of the log-pseudolikelihood. A mode and curvature-adjusted log-pseudolikelihood (red curve) stems from the unadjusted log-pseudolikelihood (green curve). The magnitude adjustment ensures equality with the true log-likelihood (black curve) at the mode.

The tractability of the fully adjusted pseudolikelihood allows for evidence estimation with such methods, thereby allowing Bayesian model selection of exponential random graph models. The outline of the paper is as follows. Section 2 introduces the reader to the concept of Bayesian model comparison. A basic description of exponential random graph models is provided in Section 3. In Section 4 we discuss how to perform the adjustments of the pseudolikelihood for ERGMs with the goal to obtain an approximation of the marginal likelihood and in Section 5 we derive an approximation to the Bayes’ factor. In Section 6 we assess the efficiency of the marginal likelihood approximation with a Potts model example for spatial analysis (Potts, 1952), where the size of the lattice allows for exact estimation of the marginal likelihood. Detailed ERG model selection experiments are presented in Section 7. We conclude the paper in Section 8 with final remarks and recommendations to practitioners based on accuracy of evidence and Bayes’ factor estimates and computational speed. 2. Overview of Bayesian model selection Consider the countable model set M = {M1 , M2 , M3 , . . .}. The data y are assumed to have been generated by one of the models in that set. Bayesian model selection aims at calculating 3

the posterior model probability for model m, π(Mm | y), where it may be of interest to obtain a-posteriori a single most probable model or a subset of likely models. We associate each model m with a parameter vector θm . The prior beliefs for each model are expressed through a prior distribution p(Mm ) (∑m∈M p(Mm ) = 1) and for the parameters within each model through p(θm | Mm ). These specifications allow Bayesian inference to proceed by examining the posterior distribution π(θm , Mm | y) ∝ f (y | θm , Mm )p(θm | Mm )p(Mm ). The within-model posterior appears as π(θm | y, Mm ) ∝ f (y | θm , Mm )p(θm | Mm ). The constant of proportionality, termed the marginal likelihood or evidence, for model Mm is expressed by π(y | Mm ) =

Z

Θm

f (y | θm , Mm )p(θm | Mm ) dθm ,

assuming a proper prior distribution for θm . Precise estimation of the above integral is challenging as it involves a high-dimensional integration over a usually complicated and highly variable function, so in most cases the model evidence is not analytically tractable. Knowledge of the evidence is required to deduce the posterior model probability π(Mm | y) =

π(y | Mm )p(Mm ) |M |

∑ j=1 π(y | M j )p(M j )

using Bayes’ theorem. The probability π(Mm | y) is treated as a measure of uncertainty for model Mm . Comparison of two competing models in the Bayesian setting is performed though the Bayes factor, π(y | Mm ) . (3) BFm,m′ = π(y | Mm′ ) which provides evidence in favour of model Mm compared with model Mm′ . The larger BFm,m′ is, the greater the evidence in favor of Mm compared to Mm′ . The reader is referred to Kass and Raftery (1995) who present a comprehensive review of Bayes’ factors. Friel (2013) described the estimation of the evidence arising from a doubly intractable posterior distribution as a "triply intractable problem" when intractable normalising constants are present. In this paper we are concerned with approaches based solely on within-model simulation, where the posterior distribution within model Mm is examined separately for each m. Recent reviews comparing popular methods based on MCMC sampling can be found in Friel and Wyse (2012) as well as in Ardia et al. (2012). 3. Exponential random graph models Consider the set of all possible graphs on n nodes (actors), Y . A n × n random adjacency matrix Y on n nodes and a set of edges (relationships) describes the connectivity pattern of a graph that represents the network data. A realisation of Y is denoted with y and the presence or

4

absence of a tie between the pair of nodes (i, j) is coded as ( 1, if (i, j) are connected, yi j = 0, otherwise. An edge connecting a node to itself is not permitted so yii = 0. Exponential random graph models represent a general class of models for specifying the probability distribution for a set of random graphs or networks based on exponential-family theory (Wasserman and Pattison, 1996). Local structures in the form of meaningful subgraphs model the global structure of the network. ERGMs model directly the network using the likelihood function  d exp θT s(y) q(y | θ) T f (y | θ) = = (4) , θ s(y) = ∑ θ j s j (y) , z(θ) ∑y∈Y exp {θT s(y)} j=1 where q(y | θ) is the un-normalised likelihood, s : Y → Rd are sufficient statistics based on the adjacency matrix and θ ∈ Θ ⊆ Rd are the model parameters (Snijders et al., 2006; Hunter and Handcock, 2006). Our focus lies on ERG models that are edge-dependent, and whose likelihood is intractable. n The evaluation of z(θ) is feasible for only trivially small graphs as this sum involves 2(2) terms for undirected graphs. Recent studies on the inference of ERGMs with the Bayesian approach include Koskinen (2004), Caimo and Friel (2011), Wang and Atchade (2014), Caimo and Mira (2015), Thiemichen et al. (2016) and Bouranis et al. (2017). Bayesian model selection for exponential random graph models has been explored by Caimo and Friel (2013), Friel (2013), Thiemichen et al. (2016) and Everitt et al. (2017). A reparameterization of (4) can express the distribution of the Bernoulli variable Yi j under the conditional form  logit p(yi j = 1 | y−i j , θ) = θT δs (y)i j , − where δs (y)i j = s(y+ i j ) − s(yi j ) denotes the vector of change statistics. The pseudolikelihood method defines the approximation to the full joint distribution in (4) as the product of the full conditionals for individual observations/ dyads:

p(yi j = 1 | y−i j , θ)yi j , yi j −1 i6= j {1 − p(yi j = 1 | y−i j , θ)}

fPL (y | θ) = ∏ p(yi j | y−i j , θ) = ∏ i6= j

where y−i j denotes y\{yi j }. 4. Adjusting the pseudolikelihood Analytical or computational intractability of the likelihood function poses a major challenge to the Bayesian approach, as well as to all likelihood-based inferential approaches. A natural strategy to deal with such model intractability is to substitute the full likelihood with a surrogate composite likelihood (Lindsay, 1988; Varin et al., 2011), that shares similar properties with the full likelihood. The pseudolikelihood (Besag, 1975, 1977) is a special case of the composite likelihood and can serve as a proxy to the full likelihood when the assumption of conditional independence of the variables is reasonable.

5

This assumption is usually unrealistic, though. The drawback of the pseudolikelihood is that it ignores strong dependencies in the data and can, therefore, lead to a biased estimation. We propose to perform adjustments on the pseudolikelihood to obtain a reasonable approximation to the likelihood. Adjustments to composite likelihood functions have been previously suggested by Ribatet et al. (2012). Following the proposals of Stoehr and Friel (2015) and Bouranis et al. (2017) for GRFs, we initially adjust the pseudolikelihood itself by matching its first two moments with the first two moments of the likelihood through a model-specific invertible and differentiable mapping ( Θ→Θ (5) g: θ 7→ θˆ MPLE +W (θ − θˆ MLE ), which depends on the maximum likelihood estimate, θˆ MLE , the maximum pseudolikelihood estimate, θˆ MPLE , and a transformation matrix W . The mode and curvature-adjusted pseudolikelihood is defined as the function y 7→ fPL (y | g(θ)). Figure 1 displays a difference in magnitude from f (y | θ); a magnitude adjustment of the mode and curvature-adjusted pseudolikelihood results in the fully adjusted pseudolikelihood f˜(y | θ) = C · fPL (y | g(θ)).

(6)

The remainder of this section provides guidelines for estimating C and g(θ). 4.1. Mode adjustment Empirical analysis by Stoehr and Friel (2015) and Bouranis et al. (2017) showed that the Bayesian estimators resulting from using the pseudolikelihood function as a plug-in for the true likelihood function are biased and their variance can be underestimated. It is, therefore, natural to consider a correction of the mode of the pseudolikelihood approximation. Paramount to the approach is the ability to estimate the maxima of the likelihood and the pseudolikelihood, θˆ MLE = arg max log f (y | θ), θ

θˆ MPLE = arg max log fPL (y | θ).

(7)

θ

While the MPLE is fast and straightforward to obtain because of the closed form of the pseudolikelihood, care is needed when estimating the MLE for ERGMs. MCMC estimation procedures (Snijders, 2002) can be considered for approximating a solution to (7). By construction of g in (5) we have θˆ MLE = arg max log fPL (y | g(θ)). θ

4.2. Curvature adjustment Composite likelihoods have been previously shown to modify the correlation between the variables (Stoehr and Friel, 2015). The mapping in (5) ensures that the adjusted pseudolikelihood and the full likelihood have the same mode and aims to recover the overall geometry of the distribution

6

(Figure 1). We choose the transformation matrix W that satisfies ∇θ log f (y | θ)|θˆ MLE = W T ∇θ log fPL (y | θ)|θˆ MPLE ∇2θ log f (y | θ)|θˆ MLE = W T ∇2θ log fPL (y | θ)|θˆ MPLE W,

(8)

so that the gradient and the Hessian of the log-likelihood and log f˜(y | θ) are the same. It is possible to estimate the gradient and the Hessian of the log-likelihood using the following two identities: z′ (θ) z(θ)  ∑ s(y)exp θT s(y) = s(y) − ∑ exp {θT s(y)} = s(y) − Ey|θ [s(y)] ,

∇θ log f (y | θ) = s(y) −

and z′′ (θ)z(θ) − z′(θ)z′ (θ) ∇2θ log f (y | θ) = − z2 (θ)    2 2 = − Ey|θ [s(y)] − Ey|θ [s(y)] = −Vy|θ [s(y)] ,

where Vy|θ [s(y)] denotes the covariance matrix of s(y) with respect to f (y | θ). The presence of the normalising term renders exact evaluation of Ey|θ [s(y)] and Vy|θ [s(y)] intractable. We resort to Monte Carlo sampling from f (y | θ) in order to estimate these. The Hessian matrices at the maximum of their respective distributions are negative definite matrices and therefore admit a Cholesky decomposition, −∇2θ log f (y | θ)|θˆ MLE = N T N −∇2θ log fPL (y | θ)|θˆ MPLE = M T M,

(9)

which yields W = M −1 N (Ribatet et al., 2012). 4.3. Magnitude adjustment The magnitude adjustment aims to scale the mode and curvature-adjusted pseudolikelihood to the appropriate magnitude by performing a linear transformation of the vertical axis. The constant C in (6) is defined so that f˜(y | θˆ MLE ) = f (y | θˆ MLE ), which implies q(y | θˆ MLE ) · z−1(θˆ MLE ) C= . fPL (y | g(θˆ MLE ))

(10)

Since z(θˆ MLE ) is intractable, we estimate C by replacing the normalising constant with an estimator which we now describe, following Friel (2013). We introduce an auxiliary variable t ∈ [0, 1]

7

discretised as 0 = t0 < t1 < . . . < tL = 1 and consider the distributions f (y | t j θ) =

exp{(t j θ)T s(y)} q(y | t j θ) = , j ∈ {0, . . ., L}. z(t j θ) ∑y∈Y exp{(t j θ)T s(y)}

An estimate of z(θˆ MLE ) can be obtained using z(θˆ MLE ) z(tLθˆ MLE ) L−1 z(t j+1θˆ MLE ) = =∏ , z(0) z(t0θˆ MLE ) j=0 z(t j θˆ MLE )

(11)

n

where z(0) = 2(2) for undirected graphs and n is the number of nodes. Note that in the case of a Potts/autologistic model, z(0) = 2N , where N is the size of the lattice. Importance sampling is used to estimate the ratios of normalising constants in (11). We take the un-normalised likelihood q(y | t j θ) as an importance distribution for the "target" distribution f (y | t j θ), noting that " # z(t j+1θˆ MLE ) q(y | t j+1θˆ MLE ) . = Ey|t θˆ j MLE z(t j θˆ MLE ) q(y | t j θˆ MLE )) An unbiased importance sampling estimate of this expectation can be obtained by simulating ( j) ( j) multiple draws y1 , . . ., yK ∼ f (y | t j θˆ MLE ), yielding ( j) \ z(t j+1 θˆ MLE ) 1 K q(yk | t j+1θˆ MLE ) . = ∑ K k=1 q(y( j) | t j θˆ MLE ) z(t j θˆ MLE ) k

Increasing the number of temperatures L and the number of simulated graphs will lead to a more precise estimate of z(θˆ MLE ) but will also increase the computational burden. As we shortly illustrate, however, this does not add significantly to the overall computational cost. We note that estimation of z(θˆ MLE ) is performed once upfront for each competing model. The estimate of C is q(y | θˆ MLE ) · zˆ−1(θˆ MLE ) . Cˆ = fPL (y | g(θˆ MLE )) 5. Approximation of the Bayes’ factor Replacing the likelihood with the unadjusted pseudolikelihood approximation in Bayes’ formula yields the within-model pseudo-posterior distribution πPL (θm | y, Mm ) =

fPL (y | θm , Mm )p(θm | Mm ) fPL (y | θm , Mm )p(θm | Mm ) =R . πPL (y | Mm ) Θm f PL (y | θm , Mm )p(θm | Mm ) dθm

(12)

In analogy to (3), the Bayes’ factor based on the unadjusted pseudolikelihood approximation is PL BFmm ′

R

π (y | Mm ) Θ f PL (y | θm , Mm )p(θm | Mm ) dθm = PL =R m . πPL (y | Mm′ ) Θ ′ f PL (y | θm′ , Mm′ )p(θm′ | Mm′ ) dθm′ m

8

A naive implementation of the pseudolikelihood or any other higher-order composite likelihood is likely to give misleading marginal likelihood estimates, as we illustrate in Section 6. Having completed the adjustment steps, we propose to approximate the within-model posterior distribution by f˜(y | θm , Mm )p(θm | Mm ) f˜(y | θm , Mm )p(θm | Mm ) e =R , (13) π(θ | y, Mm ) = ˜ e π(y | Mm ) Θm f (y | θm , Mm )p(θm | Mm ) dθm

Working with (13) we can now approximate (3) by f mm′ BF

R

˜ e π(y | Mm ) Θ f (y | θm , Mm )p(θm | Mm ) dθm . = =R m ˜ e π(y | Mm′ ) Θ ′ f (y | θm′ , Mm′ )p(θm′ | Mm′ ) dθm′ m

The aforementioned framework offers one possibility in the Bayesian setting to obtain an approximation to the within-model posterior distribution. Another possibility has been explored by Bouranis et al. (2017), whose ERGM experiments showed that estimation with the pseudoposterior distribution is biased. The authors presented an algorithm to draw an approximate sample from the intractable posterior distribution π(θ | y). Their suggested approach first samples from the pseudo-posterior distribution (12). Then an invertible and differentiable mapping φ : Θ → Θ is considered to transform the entire sample {θi }Ti=1 so that it is a sample from an approximation of the posterior distribution, whose density is −1 ∂φ (θ) −1 . ˆ | y, Mm ) = πPL (φ (θ) | y, Mm ) · π(θ ∂θ Following this approach and applying a change of variables, the model evidence π(y | Mm ) is approximated by −1 Z ∂φ (θ) dθ = π (y | Mm ) ˆ | Mm ) = πPL (φ−1 (θ), y | Mm ) π(y PL ∂θ Θ

and it follows that there is no gain from the transformation of the pseudo-posterior distribution when the aim is to obtain a reasonable approximation of the marginal likelihood. As such, while the correction algorithm of Bouranis et al. (2017) is appropriate for conducting Bayesian inference on the model parameters, it is not suitable for model selection. 6. Potts simulation study

The Ising model has been a popular approach to modeling spatial binary data y = {y1 , . . . , yN } ∈ {−1; 1}N on a lattice of size N = υ × ν. A lattice with N nodes has 2N possible realizations; the normalising constant z(θ) in (1) is a summation over all of the realizations and it becomes analytically unknown for moderate sized graphs. The autologistic model (Besag, 1972) extends the Ising model to allow for unequal abundances of each state value, while the Potts model (Potts, 1952) allows each lattice point to take one of S ≥ 2 possible values/states.

9

Figure 2: Example of a first-order neighborhood graph. The closest neighbors of the node in red are represented by nodes in blue.

In this example we investigate the efficiency of the approximation to the marginal likelihood when the likelihood is replaced by f˜(y | θm , Mm ), with a small dataset for which we can carry out exact computations. 30 realizations from an isotropic 2-state Potts model with interaction parameter θ = 0.4 defined on a lattice of size 15 × 15 were exactly sampled via Friel and Rue (2007) and Stoehr et al. (2016). The sufficient statistic for the Potts model is the number of corresponding neighbors in the graph s(y) = ∑ ∑ 1{yi = y j }, j 1. There is a striking difference between the unadjusted pseudolikelihood-based estimates of the evidence and those estimates based on the fully adjusted pseudolikelihood. All Bayes’ factor estimates slightly favour Model 1, which is in contrast with the Auto-RJ results and the results based on fully adjusted pseudolikelihoods. Therefore, we can conclude that the benefits of conducting model selection for this network based on the unadjusted pseudolikelihood approximation are reduced. All in all, we highly recommend approximation of the evidence with the fully adjusted pseudolikelihood, which comes at a negligible computational cost. 8. Discussion In this paper we have presented a novel approach to marginal likelihood estimation of models with intractable normalising constants, which we applied to the challenging setting of exponential random graph models for social network analysis. We approximated a doubly intractable posterior distribution with an intractable likelihood by a "standard" singly intractable posterior distribution 19

with a tractable likelihood approximation. Our methodology is highly compatible with a plethora of evidence estimation techniques from the Bayesian toolbox. Our experiments suggest that the one-block Metropolis-Hastings approach yields marginal likelihood estimators with low variability. It comes at a low computational cost and is suitable because it requires no further tuning of the MCMC algorithm and the low-dimensional parameter spaces in our experiments allow the parameter vector to be updated in a single block. For higherdimensional MCMC problems, though, multi-block Metropolis-within-Gibbs updating strategies will be more suitable (Chib and Jeliazkov, 2001). The Power posteriors algorithm can come at a higher computational cost, depending on the inverse temperature scheme. We suggest the use of the improved trapezoidal scheme of Friel et al. (2014) and the control variate technique (Oates et al., 2016) to vastly improve the statistical efficiency of the evidence estimate. The Stepping stones estimators suffer from high variability. We note that our approach avoids the heavy computational burden of repeated likelihood simulations, as in Friel (2013) and Everitt et al. (2017). Here the likelihood simulations are performed only once for each competing model. The empirical results presented above suggest that our approach gives similar estimates of the Bayes’ factor to the computationally intensive population exchange algorithm, but at a fraction of the time. Simulation procedures for approximating a solution to the likelihood equation are more challenging when larger datasets are considered, but any simulation approach with increased dependence on likelihood simulations will be infeasible under these conditions. Our work should offer a more scalable approach. Overall, the reasonable computational effort that needs to be put when working with our likelihood approximation extends the applicability of the proposed approach to other complex models like Gaussian Markov random fields and autologistic models; it will be particularly interesting to implement our proposed framework to large grids and networks with hundreds of nodes or models with many parameters. Appendix A. Methods to compute the model evidence Appendix A.1. Chib and Jeliazkov’s method A popular method of marginal likelihood estimation within each model is that of Chib (1995). Following from Bayes’ formula, the model evidence is given by the basic marginal likelihood identity f (y | θ)p(θ) π(y) = . (A.1) π(θ | y) An estimate of the marginal likelihood log π(y) = log f (y | θ† ) + log p(θ† ) − log π(θ† | y) is obtained by evaluating this expression on the log scale at some specific point θ† ∈ Θ. By approximating the true likelihood with the adjusted pseudolikelihood and π(θ | y) with e π(θ | y), the first two terms are available by direct calculation. However, an estimate of the posterior ordinate e π(θ† | y) is still required. For estimation efficiency, the point θ† is chosen in a high density region of the support Θ. An extension of this work is provided in Chib and Jeliazkov (2001), where output from a π(θ | y) can be used to estimate the model eviMetropolis-Hastings algorithm for the posterior e dence. Among others, the authors suggested the one-block Metropolis-Hastings approach for estimating the evidence, in the case where the parameter vector θ can be updated in a single block.

20

For a Metropolis-Hastings transition from θ to θ′ , let h(θ, θ′ ) denote the candidate generating dene (θ, θ′ ) denote the probability of accepting the proposed move when the corresponding sity and α π(θ | y) as the limiting distribution. The proposed estimate of the posterior Markov chain has e ordinate at θ† is −1 M e (θ(m) , θ† )h(θ(m) , θ† ) ˆπ(θ† | y) = M ∑m=1 α e , e (θ† , θ(l) ) L−1 ∑Ll=1 α

where {θ(m) } are the sampled draws from the (corrected) pseudo-posterior and {θ(l) } are the draws from h(θ† , θ) given the fixed point θ† . This gives πˆ (θ† | y). log e π(y) = log f˜(y | θ† ) + log p(θ† ) − log e

Chib and Jeliazkov (2001) generalised the method to settings with high-dimensional parameter spaces, which may require sampling of the parameters in several smaller blocks, eg. with a Metropolis-within-Gibbs strategy. Appendix A.2. Power posteriors Friel and Pettitt (2008) and Friel et al. (2014) demonstrated how the marginal likelihood can be computed via Markov chain Monte Carlo methods on modified posterior distributions for each model with the method of thermodynamic integration. They denote the power posterior by πt (θ | y) ∝ f (y | θ)t p(θ), t ∈ [0, 1], R

where z(y | t) = θ f (y | θ)t p(θ) dθ is the corresponding normalising constant. The inverse temperature t ∈ [0, 1] has the effect of tempering the likelihood; at the extreme ends of the inverse temperature range, πt=0 (θ | y) and πt=1 (θ | y) correspond to the prior and posterior, respectively. Here we assume a proper prior so that πt (θ | y) is proper and z(y | t) exists for all t ∈ [0, 1]. When working with the corrected pseudo-posterior distribution, the respective version of the power posterior is: e πt (θ | y) ∝ f˜(y | θ)t p(θ), t ∈ [0, 1],

(A.2)

The normalising constant is then expressed by z(y | t) =

Z

Θ

f˜(y | θ)t p(θ) dθ.

At zero temperature the integration is over the prior with respect to θ, thus z(y | t = 0) = 1. We note that d log z(y | t) = Eθ|y,t log f˜(y | θ), dt therefore the log-evidence will be given by: [log z(y | t)]10

= log z(y | t = 1) =

Z 1 0

Eθ|y,t log f˜(y | θ) dt.

(A.3)

To form an estimator based on (A.3) the inverse temperature range t ∈ [0, 1] is discretised as

21

0 = t0 < t1 < . . . < tm = 1. A trapezoidal rule can be used to approximate the log-evidence: # " m Eθ|y,t j−1 log f˜(y | θ) + Eθ|y,t j log f˜(y | θ) . log e π(y) = ∑ (t j − t j−1 ) 2 j=1

πt j (θ | y) can be used to estimate Eθ|y,t j log f˜(y | θ), For each t j , a MCMC sample from the posterior e using a burn-in phase of B < N iterations: Eθ|y,t j log f˜(y | θ) ≈

N 1 (i) log f˜(y | θ j ). ∑ N − B i=B+1

Three sources of error are present: error from discretisation of the temperature scheme, Monte Carlo error from approximating Eθ|y,t j log f˜(y | θ) and error coming from using the corrected pseudo-posterior. Estimating the marginal likelihood via the power posterior approach is relatively straight-forward but computationally costly. To minimise the cost, it is desirable to use as few t values as possible. However, choosing the temperature schedule presents an immediate difficulty with this approach. A revised scheme by Friel et al. (2014) used an improved trapezoidal scheme for a more accurate quadrature approximation, which also requires the variance of the log likelihood. The authors proposed reducing the bias in the estimation of the marginal likelihood by observing that differentiation of Eθ|y,t log f˜(y | θ) with respect to t yields dtd Eθ|y,t log f˜(y | θ) = Vθ|y,t log f˜(y | θ). They improved upon the standard trapezoidal rule used to numerically integrate the expected log deviance by incorporating this derivative information at a minimal extra computational cost, yielding # " m Eθ|y,t j−1 log f˜(y | θ) + Eθ|y,t j log f˜(y | θ) log e π(y) = ∑ (t j − t j−1 ) 2 j=1 i (t j − t j−1 )2 h ˜ ˜ − (A.4) Vθ|y,t j−1 log f (y | θ) − Vθ|y,t j log f (y | θ) , 12 having replaced the true likelihood with the pseudolikelihood approximation. The results presented in Section 7 are based on this improved trapezoidal scheme.

Appendix A.3. Applying control variates to evidence estimation A key concern about the thermodynamic integral is that the corresponding evidence estimator can suffer from high variability. Oates et al. (2016) extended the zero-variance (ZV) control variate technique (Assaraf and Caffarel, 1999; Mira et al., 2013), aiming to improve the estimator by decreasing the variance. Together with the numerical integration scheme (A.4), the authors have shown that this can yield a dramatic improvement in the statistical efficiency of the estimate of the evidence by efficiently estimating Eθ|y,t j log f˜(y | θ) and Vθ|y,t j log f˜(y | θ) for each temperature t j ∈ [0, 1], at very little extra computational cost. The basic idea behind the control variate technique is to estimate the posterior expectation ˜ Eθ|y [k(θ)] by constructing a modified function k(θ) = k(θ) + φ1 h1 (θ) + . . . + φm hm (θ) such that ˜ k(θ) has the same posterior expectation but reduced posterior variance, compared to k(θ). This requires that each of the "control variates" h j (θ) have zero posterior expectation, the collec22

tion [h1 (θ), . . ., hm (θ)] has strong posterior canonical correlation with the target function k(θ) = log f˜(y | θ), and the coefficients φ1 , . . . , φm are appropriately selected. Mira et al. (2013) and Oates et al. (2016) considered the class of control variates that are expressed as functions of the score vector u(θ | y) of the target function. By taking the target density to be the power posterior e πt (θ | y), the score vector will be u(θ | y,t) = ∇θ log e πt (θ | y) = t · ∇θ log f˜(y | θ) + ∇θ log p(θ).

Following Oates et al. (2016), the ZV control variates are

h(θ | y,t) = ∆θ [P(θ | φ(y,t))] + ∇θ[P(θ | φ(y,t))] · u(θ | y,t), where ∆θ = ∇θ · ∇θ is the Laplacian operator and the "trial function" P(·) belongs to the family P of polynomials in θ. The coefficients φ ≡ φ(y,t) of the polynomial P depend on both the data y and inverse temperature t. This framework is highly compatible with the adjustment strategy proposed in this paper; the replacement of the true (intractable) likelihood by the tractable likelihood approximation allows for ZV control variates that are in closed form. The "controlled thermodynamic integral" (CTI) will be log e π(y) =

Z 1 0

  Eθ|y,t log f˜(y | θ) + h(θ | y,t) dt.

In this work we restrict our attention to degree 2 (quadratic) polynomials. Denoting the model dimensions by d, second degree polynomials can be expressed as P(θ) = cT θ + 21 θT Bθ where c is d × 1 and B is d × d. This leads to ZV control variates of the form h(θ | y,t) = tr(B) + (c + Bθ)T u(θ | y,t), where c and B denote the quadratic polynomial coefficients and tr(B) is the trace of B. It is assumed that B is symmetric, but this is not required in general. See Oates et al. (2016) for a further discussion of the ZV strategy with degree 2 polynomials. Estimation of the control variates is performed with the same MCMC samples that are exploited to estimate Eθ|y,t log f˜(y | θ) and Vθ|y,t log f˜(y | θ). The control variates are stored along each Markov chain and their sample covariance is computed after the MCMC has terminated, leading to only a negligible increase in the total computational cost. The optimal choice of polynomial coefficients, φ∗ (y,t), that minimises the variance of the estimator of model log-evidence is given by   ˆ −1 [u(θ | y,t)] C ˆ θ|y,t log f˜(y | θ), u(θ | y,t) , φ∗ (y,t) ≈ −V θ|y,t ˆ θ|y,t [u(θ | y,t)] and C ˆ θ|y,t [log f (y | g(θ)), u(θ | y,t)] denote the estimated variance and where V PL cross-covariance matrices, respectively. To implement the second order quadrature rule proposed by Friel et al. (2014) we used the identity   Vθ|y,t log f˜(y | θ) + h(θ | y,t) = Vθ|y,t [log f˜(y | θ)] + Vθ|y,t [h(θ | y,t)] +2Cθ|y,t [log f˜(y | θ), h(θ | y,t)]. 23

If log f˜(y | θ) and h(θ | y,t) are strongly correlated, so that the covariance term on the right hand side is greater than the variance of h(θ | y,t), then a variance improvement has been made over the original estimation problem. Appendix A.4. The Stepping stones sampler The Stepping stones sampler (Xie et al., 2011) uses the idea of powered posteriors (A.2), treating them as a series of intermediate distributions between the prior and the posterior. Based on importance sampling, the intermediate distributions are utilised as importance distributions, avoiding numerical integration. Following the notation of Friel et al. (2014), the method generates samples from each of the power posteriors from t0 = 0 up to tm−1, estimating the ratio of consecutive normalising constants z(y | tk+1 ) = rk = z(y | tk ) with rˆk =

Z

Θ

f˜(y | θ)tk+1−tk e πt (θ | y) dθ

N 1 ∑ f˜(y | θ(i))tk+1−tk , k = 0, . . . , m − 1. N − B i=B+1

Here N − B denotes the number of MCMC samples post burn-in and the corresponding sampled πtk . Assuming that the prior is normalised, the final estimate of the values {θ(i) } are drawn from e model evidence will the product of these m independent estimates, ∏m−1 k=0 rˆk , or at the log-scale: m−1

log e π(y) =

∑ log rˆk . k=0

The Stepping stones estimator is unbiased for estimation of the marginal likelihood and slightly biased for estimation of the log marginal likelihood. Xie et al. (2011) compared the performance of the Stepping stones sampler to that of power posteriors for estimating the log evidence. Their findings indicated that the Stepping stones approach slightly outperformed the power posterior, but the two approaches were comparable when the inverse temperature allocation was well chosen or when the number of inverse temperatures was large. The power posterior estimates performed relatively poorly when few inverse temperatures were present or when they were badly placed. We observed the same comparable behaviour in our experiments, for a large number of rungs in the temperature ladder. Since the computational burden is mostly due to the simulation of the tempered distributions, one can easily calculate both estimates. Appendix A.5. Population exchange algorithm A population-based MCMC extension of the exchange algorithm (Møller et al., 2006), leading to realisations from the posterior distribution π(θ | y), was presented by Friel (2013). The algorithm was modified to allow for an unbiased estimate of the normalising constant of the likelihood, z(θ), for each draw θ from the posterior distribution. The idea behind the method is to augment the target distribution with a sequence of tempered distributions by slowly moving from the prior, πt0 , to the posterior, πtn . The augmented target distribution is πt0 (θ0 | y) × . . . × πtn (θn | y), t ∈ [0, 1], 24

where πt j (θ j | y) ∝ f (y | θ j )t j p(θ j ). The distribution of each chain in the population was further augmented in order to tackle the problem of likelihood intractability, yielding πt0 (θ0 , θ′0 , y′01 , . . ., y′0s | y) × . . . × πtn (θn , θ′n , y′n1 , . . . , y′ns | y), where θ′j is the auxiliary parameter value for the swap/exchange move and y′j1 , . . ., y′js ∼ f (y | θ′j )t j are draws from the tempered likelihood. At iteration i of the Markov chain, those auxiliary draws are used to estimate the normalising constant by   ′ s q(y(i) | t j+1 θ(i) ) n−1 1 j+1  jk (i) × z(0). zˆ(θn ) = ∏  ∑ ′ (i) (i) s k=1 q(y | t j θ ) j=0 jk

j

(i)

The algorithmic output includes draws {θn } from the posterior distribution and associated es(i) timates {ˆz(θn )}, which are used to approximate the marginal likelihood (A.1). Friel (2013) estimates the model evidence by ˆ π(y) =

1 r 1 r q(y | θb ) p(θb ) ˆ , π (y) = θ ∑ b ∑ zˆ(θb) π(θ ˆ b | y) r b=1 r b=1

(r)

for a range of draws {θn }br=1 from the high posterior density region, which requires kernel denˆ b | y). sity estimation to estimate π(θ Acknowledgments The Insight Centre for Data Analytics is supported by Science Foundation Ireland under Grant Number SFI/12/RC/2289. Nial Friel’s research was also supported by a Science Foundation Ireland grant: 12/IP/1424. References Ardia, D., Ba¸stürk, N., Hoogerheide, L., and van Dijk, H. (2012). A comparative study of Monte Carlo methods for efficient evaluation of marginal likelihood. Computational Statistics and Data Analysis, 56:3398–3414. Assaraf, R. and Caffarel, M. (1999). Zero-Variance principle for Monte Carlo algorithms. Physical Review, 83(23):4682–4685. Besag, J. (1972). Nearest-neighbour systems and the auto-logistic model for binary data. Journal of the Royal Statistical Society, Series B, 34(1):75–83. Besag, J. (1975). Statistical analysis of non-lattice data. Statistician, 24:179–195. Besag, J. (1977). Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika, 64:616–618. 25

Bouranis, L., Friel, N., and Maire, F. (2017). Efficient Bayesian inference for exponential random graph models by correcting the pseudo-posterior distribution. Social Networks, 50:98–108. Caimo, A. and Friel, N. (2011). Bayesian inference for exponential random graph models. Social Networks, 33:41–55. Caimo, A. and Friel, N. (2013). Bayesian model selection for exponential random graph models. Social Networks, 35(1):11–24. Caimo, A. and Friel, N. (2014). Bergm: Bayesian exponential random graphs in R. Journal of Statistical Software, 61(2):1–25. Caimo, A. and Mira, A. (2015). Efficient computational strategies for doubly intractable problems with applications to Bayesian social networks. Statistics and Computing, 25(1):113–125. Calderhead, B. and Girolami, M. (2009). Estimating Bayes factors via thermodynamic integration and population MCMC. Computational Statistics and Data Analysis, 53:4028–4045. Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432):1313–1321. Chib, S. and Jeliazkov, I. (2001). Marginal likelihood From the Metropolis-Hastings output. Journal of the American Statistical Association, 96:270–281. Everitt, R., Johansen, A., Rowing, E., and Hogan, M. (2017). Bayesian model selection with un-normalised likelihoods. Statistics and Computing, 27(2):403–422. Everitt, R. G. (2012). Bayesian parameter estimation for latent Markov random fields and social networks. Journal of Computational and Graphical Statistics, 24(4):940–960. Friel, N. (2013). Evidence and Bayes factor estimation for Gibbs random fields. Journal of Computational and Graphical Statistics, 22(3):518–532. Friel, N., Hurn, M., and Wyse, J. (2014). Improving power posterior estimation of statistical evidence. Statistics and Computing, 24:709–723. Friel, N. and Pettitt, A. N. (2008). Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society, Series B, 70(3):589–607. Friel, N. and Rue, H. (2007). Recursive computing and simulation-free inference for general factorizable models. Biometrika, 93(3):661–672. Friel, N. and Wyse, J. (2012). Estimating the evidence - a review. Statistica Neerlandica, 66(3):288–308. Gelman, A. and Meng, X. (1998). Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Statistical Science, 13(2):163–185. Hunter, D. and Handcock, M. (2006). Inference in curved exponential family models for networks. Journal of Computational and Graphical Statistics, 15(3):565–583.

26

Hunter, D., Handcock, M., Butts, C., Goodreau, S., Morris, and Martina (2008). ergm: A package to fit, simulate and diagnose exponential-family models for networks. Journal of Computational and Graphical Statistics, 24(3):1–29. Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430):773–795. Koskinen, J. (2004). Bayesian Analysis of exponential random graphs - Estimation of parameters and model selection. Technical Report 2, Department of Statistics, Stockholm University. Lartillot, N. and Phillipe, H. (2006). Computing Bayes factors using thermodynamic integration. Systematic Biology, 55:195–207. Lindsay, B. G. (1988). Composite likelihood methods. Contemporary Mathematics, 80:221–239. Liu, J. (2001). Monte Carlo strategies in scientific computing. Springer Publishing Company, Incorporated. Martin, A., Quinn, K., and Park, J. (2011). MCMCpack: Markov chain Monte Carlo in R. Journal of Statistical Software, 42(9):22. Meng, X. and Wong, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6:831–860. Mira, A., Solgi, R., and Imparato, D. (2013). Zero Variance Markov chain Monte Carlo for Bayesian estimators. Statistics and Computing, 23(5):653–662. Møller, J., Pettit, A., Reeves, R., and Bertheksen, K. (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika, 93:451–458. Morris, M., Handcock, M., and Hunter, D. (2008). Specification of exponential-family random graph models: terms and computational aspects. Journal of Statistical Software, 24(4):1–24. Neal, R. (2001). Annealed importance sampling. Statistics and Computing, 11(2):125–139. Oates, C., Papamarkou, T., and Girolami, M. (2016). The controlled thermodynamic integral for Bayesian model evidence evaluation. Journal of the American Statistical Association, 111(514):634–645. Pearson, M. and Michell, L. (2000). Smoke rings: social network analysis of friendship groups, smoking and drug-taking. Drugs: Education, Prevention and Policy, 7(1):21–37. Potts, R. B. (1952). Some generalized order-disorder transformations. Proceedings of the Cambridge Philosophical Society, 48:106–109. R Development Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. Reeves, R. and Pettitt, A. N. (2004). Biometrika, 91(3):751–757.

Efficient recursions for general factorisable models.

27

Ribatet, M., Cooley, D., and Davison, A. (2012). Bayesian inference from composite likelihoods, with an application to spatial extremes. Statistica Sinica, 22:813–845. Robins, G., Snijders, T., Wang, P., Handcock, M., and Pattison, P. (2007). Recent developments in exponential random graph (p*) models for social networks. Social Networks, 29:192–215. Rosenthal, J. and Roberts, G. (2001). Optimal scaling for various Metropolis-Hastings algorithms. Statistical Science, 16(4):351–367. Snijders, T. (2002). Markov chain Monte Carlo estimation of exponential random graph models. Journal of Social Structure, 3(2):1–40. Snijders, T., Pattison, P., Robins, G., and Handcock, M. (2006). New specifications for exponential random graph models. Sociological Methodology, 36(1):99–153. Stoehr, J. and Friel, N. (2015). Calibration of conditional composite likelihood for Bayesian inference on Gibbs random fields. In AISTATS, Journal of Machine Learning Research: W & CP, volume 38, pages 921–929. Stoehr, J., Pudlo, P., and Friel, N. (2016). GiRaF: Gibbs random fields analysis. R package version 1.0. Strauss, D. and Ikeda, M. (1990). Pseudolikelihood estimation for social networks. Journal of the American Statistical Association, 85:204–212. Thiemichen, S., Friel, N., Caimo, A., and Kauermann, G. (2016). Bayesian exponential random graph models with nodal random effects. Social Networks, 46:11 – 28. Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood methods. Statistica Sinica, 21(1):5–42. Wang, J. and Atchade, Y. F. (2014). Bayesian inference of exponential random graph models for large social networks. Communications in Statistics - Simulation and Computation, 43:359–377. Wasserman, S. and Pattison, P. (1996). Logit models and logistic regression for social networks: I. An introduction to Markov graphs and p*. Psycometrica, 61:401–425. Xie, W., Lewis, P. O., Fan, Y., Kuo, L., and Chen, M. (2011). Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Systematic Biology, 60(2):150–160.

28