Bayesian Nonparametric Inference for the Power Likelihood

0 downloads 0 Views 315KB Size Report
The aim in this paper is to provide a means by which to undertake Bayesian inference for mixture models when the likelihood function is raised to a.
Bayesian Nonparametric Inference for the Power Likelihood Stephen G. Walker ∗

Isadora Antoniano Villalobos †

Abstract The aim in this paper is to provide a means by which to undertake Bayesian inference for mixture models when the likelihood function is raised to a power between 0 and 1. The main purpose for doing this is to guarantee a strongly consistent model and hence it is possible to compare the consistent model with the correct model looking for signs of discrepancy. This will be explained in detail in the paper. Another purpose would be for simulated annealing algorithms. In particular, for the widely used mixture of Dirichlet process model, it is far from obvious how to undertake inference via MCMC methods when the likelihood is raised to a power other than 1. In the present paper we demonstrate how posterior sampling can be carried out when using a power likelihood of this type. Keywords: Consistency; Latent model; MCMC; Mixture of Dirichlet process model.

1

Introduction

The progress of Bayesian nonparametric methods (see Hjort et al., 2010) has led to a surge in theory involving posterior consistency. It has been pointed out that infinite dimensional Bayesian models can be inconsistent (see, for ∗

Stephen G. Walker is Professor of Statistics, School of Mathematics, Statistics & Actuarial Science, University of Kent, Canterbury, U. K. (email: [email protected]) † PhD research funded by CONACyT

1

example, Barron et al., 1999). Therefore, it is important to find sufficient conditions on the prior to guarantee that the sequence of posterior distributions accumulates around the true sampling density function f0 . More formally, a prior Π(df ) is constructed on a space of density functions F. The posterior is obtained, given a sequence X = (X1 , . . . , Xn ), as Rn (f ) Π(df ) Πn (df ) = Π(df |X1 , . . . , Xn ) = R R (f ) Π(df ) Ω n where Rn (f ) =

n Y

f (Xi )

i=1

is the likelihood function. In order to study the behavior of the posterior as n grows it is assumed that the (Xi ) are independent and identically distributed from some “true” density f0 . The aim then is to find conditions on the prior Π which ensure that Πn (A ) → 0 a.s. for all  > 0, where A = {f : H(f, f0 ) > } and H refers to the Hellinger distance, which is equivalent to the L1 distance. A benchmark assumption, related to the support of the prior (Barron et al., 1999; Ghosal et al., 1999; Walker, 2004), is that Π assigns positive mass to all Kullback–Leibler neighborhoods of f0 ; i.e. Π(f : dKL (f0 , f ) < δ) > 0 R for all δ > 0. Here dKL (f0 , f ) = f0 log(f0 /f ). This condition, referred to as the Kullback–Leibler property, is not sufficient for consistency. A counterexample is given in Barron et al. (1999). They present an additional condition, required to ensure the numerator of Πn (A ), i.e. Z Rn (f ) Π(df ), A

has a suitable upper bound. Alternative approaches to this have been explored, none of them trivial to implement in practice, specially in the multivariate case.

2

However, as pointed out by Walker and Hjort (2001), the use of a square root likelihood guarantees consistency with only the Kullback–Leibler support property. That is, 1/2

Qn (A ) = R

Rn (f ) Π(df ) 1/2



Rn (f ) Π(df )

is such that Qn (A ) → 0

a.s.

for all  > 0. In fact, this is true for any power 1 − α, with α ∈ (0, 1). A popular model in Bayesian nonparametrics is the mixture of Dirichlet process model, introduced by Lo (1984) and based on the Dirichlet process (Ferguson, 1973). For illustrative purposes, and to keep the notation simple, we will consider here a simple univariate version of this model, however the results can be applied for more general versions of the model, including multivariate settings. In what follows, we will assume that the prior generates random density functions of the type Z fσ,P (x) = (2π)−1/2 σ −1 m((x − θ)/σ) dP (θ), where m denotes the standard normal kernel, m(x) = exp(−x2 /2). The parameters θ and σ are the mean and the standard deviation of the mixture components. Also, P denotes a random distribution function over the mean parameter, taken from a Dirichlet process prior, and the standard deviation σ is assigned a prior independent of P . Bayesian inference for this model, via MCMC methods, is now routine (Escobar, 1988; MacEachern and M¨uller, 1998; Neal, 2000; Ishwaran and James, 2001; Papaspiliopoulos and Roberts, 2008; Kalli et al., 2011). Yet it is not clear that inference can be performed when using a power likelihood. For this we need to estimate 1−α n Z Y Qn (df ) ∝ Π(df ) σ −1 m((xi − θ)/σ) dP (θ) . i=1

The aim of this paper is to demonstrate how to do MCMC inference for this power likelihood model. We do not claim such inference would perform better than the correct Bayesian posterior inference. In fact, the quantity Qn does not have a clear interpretation other than that of an approximate model

3

(when α is small) which is consistent. The motivation to use the power likelihood is to implement an updating procedure which guarantees consistency for a particular Π without needing to check non–trivial conditions. Additionally, by using different values of α > 0, and comparing the results with those obtained with α = 0, the validity of the inference obtained for a given sample can be assessed empirically. This idea will be explained later in the paper. Furthermore, there are other reasons for undertaking posterior inference with power likelihoods and we refer to Ibrahim and Chen (2000) and Friel and Pettitt (2008) for examples. The algorithms have been coded in matlab (R2012a) and will be demonstrated later in the paper. In Section 2 we describe a latent model which is equivalent to using the power likelihood for the mixture of Dirichlet process model. Section 3 describes the MCMC algorithm for inference on the latent model and Section 4 contains an illustration of the comparison between results using the power likelihood with α = 0 and α > 0, for an inconsistent model. We also show results concerning the use of the MDP model for real and simulated data. Finally, Section 5 contains a concluding discussion.

2

A latent model for the power likelihood

Our approach relies on the use of latent variables. We define a latent model which is marginally equivalent to the use of the power likelihood for the model of interest (see, for example Besag and Green, 1993; Damien et al., 1999). We wish to base inference on the power likelihood n Y

fσ,P (xi )1−α .

i=1

There is no direct use for this expression, so we may use the stick–breaking representation for P , ∞ X P = w j δθj , j=1

to obtain an equivalent expression for which the 1 − α power is applied to objects bounded by 1, σ −n(1−α)

 n X ∞ Y i=1



m((xi − θj )/σ) wj

1−α  

j=1

4

.

Here, the (wj ) are based on a sequence of independent and identically distributed τj ∼ beta(1, c) for some c > 0, and the (θj ) are independent and identically distributed from some distribution P0 (see Sethuraman, 1994, for more details). We could remove the 1−α using a power series, since, for any 0 < q < 1, q 1−α =

∞ X

(−1)k ak (1 − q)k ,

k=0

for some positive constants (ak ). However, this is not convenient, as the resulting negative terms would invalidate the mixture model representation. On the other hand, we see that q −α =

∞ X

bk (1 − q)k ,

k=0

where the (bk ) are all positive. In fact, b0 = 1, b1 = α and for k > 1, bk = k!−1 α(k) = k!−1 α(α + 1) . . . (α + k − 1). Therefore, we can rewrite the power likelihood as n Y

fσ,P (xi ) × fσ,P (xi )−α ,

i=1

which is equivalent to σ nα

n Y

fσ,P (xi )

i=1

∞ X

 bki 1 −

∞ X

ki wj m((xi − θj )/σ) .

j=1

ki =0

We now consider Kn = (k1 , . . . , kn ) as a latent variable and rearrange terms to obtain  ki n ∞ Y X p(x, Kn |σ, P ) ∝ σ nα fσ,P (xi ) bki 1 − wj m((xi − θj )/σ) . i=1

j=1

This remains a complicated expression, but we can now introduce latent variables Dn = (dl,i : i = 1, . . . , n; l = 1, . . . , ki ) to substitute the term  1 −

∞ X

ki wj m((xi − θj )/σ)

j=1

5

by the latent expression ki Y

 wdl,i 1 − m((xi − θdl,i )/σ) .

l=1

Here the dl,i ∈ {1, 2, . . .} and summing over these variables over the positive integers we recover the desired term. Therefore, we now have the latent model p(X, Dn , Kn |σ, P ) ∝ σ nα

n Y

ki Y

fσ,P (xi ) bki

i=1

 wdl,i 1 − m((xi − θdl,i )/σ) .

l=1

Furthermore, the term fσ,P (xi ) can be dealt with in the usual way, which involves introducing latent variables (di : i = 1, . . . , n). Once again, we replace fσ,P (xi ) by the latent term σ −1 wdi m((xi − θdi ))/σ). Hence, we arrive at the full latent model, given by σ −n(1−α)

n Y

bki wdi m((xi − θdi ))/σ)

i=1

ki Y

 wdl,i 1 − m((xi − θdl,i )/σ) .

l=1

i It is easy to verify that summing over the ((dl,i )kl=1 , di , ki )ni=1 returns the (1 − α) power likelihood, with (σ, P ) = (σ, (wj , θj )∞ j=1 ). At this point, we are essentially ready to undertake inference for the power likelihood, via MCMC.

3

The MCMC algorithm

The joint latent model is complemented by the prior for the (σ, P ). Together, they provide all the variables which need to be sampled for posterior estii mation, i.e. the (σ, P, ((dl,i )kl=1 , di , ki )ni=1 ). However, there is still an issue due to the infinite choice of the (dl,i , di ). We can deal with this in the same way as in Kalli et al. (2011), which involves the technique of slice sampling. In order to reduce the choices represented by (dl,i , di ) to a finite set, we can introduce (vi )ni=1 which interact with the (di ) in the joint model as 1 vi < e−ξdi

6



for some ξ > 0. We can do the same for the (dl,i ) by introducing  1 vl,i < e−ξdl,i . These variables then allow for finite choices and the easy sampling of the index variables. Hence, P(di = j| · · · ) ∝ wj m((xi − θj )/σ) 1(1 ≤ j ≤ Ji ) where Ji = b−ξ −1 log vi c. Likewise, P(dl,i = j| · · · ) ∝ wj (1 − m((xi − θj )/σ))) 1(1 ≤ dl,i ≤ Jl,i ) where Jl,i = b−ξ −1 log vl,i c. These values of (Ji , Jl,i ) then tell us exactly how many of the (θj , wj ) we need to sample at each iteration of the MCMC algorithm. That is, at any given iteration of the MCMC algorithm, a sampler with the correct target distribution would only need to sample these for j = 1, . . . , J, where J = maxl,i {Ji , Jl,i }. This is because none of the variables that need to be sampled depend on the values beyond J. To sample (σ, (θj , wj )Jj=1 ) we first introduce the variables (ul,i ) which change ki Y  1 − m((xi − θdl,i )/σ) l=1

to

ki Y

 1 ul,i < 1 − m((xi − θdl,i )/σ) .

l=1

It is more convenient to work with λ = σ −2 and the conditional for λ is then given by ! n 1 X n(1−α)/2 2 p(λ| · · · ) ∝ π(λ) λ exp − λ (xi − θdi ) 1(λ > L) 2 i=1 where L = max l,i

−2 log(1 − ul,i ) . (xi − θdl,i )2

Hence, if π(λ) is a gamma distribution, then the conditional is a truncated gamma distribution, for which numerous sampling routines are available. We next describe how to sample the (wj )Jj=1 . As is well known these can be constructed from the sequence of independent (τj ), making w1 = τ1 and Q wj = τj j 0 j) +

i

X

1(dl,i > j).

l,i

The sampling of the (θj ) is also not problematic. For each j,    1 X  p(θj | · · · ) ∝ π(θj ) exp − λ (xi − θj )2 1 (θj ∈ ∩ni=1 Aj,i )  2  di =j

where Aj,i = (−∞, xi − aj,i ] ∪ [xi + aj,i , ∞), q  −1 aj,i = max −2λ log(1 − ul,i ) : dl,i = j l

and Aj,i = (−∞, ∞) if dl,i 6= j for every l. So, if the prior is a normal distribution, then the conditional here will be a truncated normal distribution. Finally, we need to describe how to update each ki . Since the dimension of the sampling space changes with ki , we use ideas involving reversible jump MCMC (Green, 1995; Godsill, 2001). We will propose a move from ki to either ki + 1 or ki − 1, with probability 1/2 each. The move from ki to ki + 1 is accepted with probability   bki +1 [1 − m((xi − θdki +1,i )/σ)] . min 1, bki On the other hand, the move from ki to ki − 1 is accepted with probability ( ) bki −1 min 1, . [1 − m((xi − θdki ,i )/σ)] bki For the move upwards we need to allocate a value to dki +1,i . To ensure reversibility, we will do this using the probabilities (wj ). Hence, we take dki +1,i = j with probability wj . This can be implemented straightforwardly, paying special attention to the case when ki = 0, for which we can only propose the move to ki + 1 (and not to ki − 1).

8

4

Illustrations

In this section we present some examples. The first one, where the data is simulated from a known density, illustrates the behavior of density estimates based on the power likelihood with different values of α, as the sample size increases. The second example involves density estimation for a real data set. In both cases, the MDP model used is known to be consistent. The third and last example shows how the density estimate obtained using the “true” likelihood (α = 0) for an inconsistent model, diverges from those obtained using α > 0 which are known to be consistent.

4.1

A consistent model

We consider a basic simulation set up. Observations are generated from a bimodal distribution, defined as a mixture of three normal components with means θ1 = −1, θ2 = 0 and θ3 = 3, with common variance σ 2 = 1, and weights w1 = 0.1, w2 = 0.4 and w3 = 0.5, respectively. To describe the settings for the model and algorithm, we took ξ = 0.1 and the prior for the (θj ) was taken to be normal with mean 1.2 (roughly in the mid-range of the data) and variance 10. The purpose of this example is to illustrate the effect of increasing sample sizes on the density estimates obtained using the power likelihood with different values of α, when the model is consistent. Therefore, in order to eliminate any additional noise, we fixed the variance of the mixture components at the true value σ 2 = 1. We estimated the posterior density for sample sizes of n = 10, n = 100 and n = 1000 observations, using the power likelihood, with α = 0.25, 0.5 and α = 0 (the “true” Likelihood). Each time, we run the algorithm for 15,000 iterations and took 5,000 samples, after a burn–in of 10,000 iterations. Figure 1 shows the true density f0 from which the data was generated, and MCMC estimates of the predictive density. When n = 10, we can clearly see the smoothing effect of using α > 0. However, since the model is consistent, as the sample size increases, all the estimated densities eventually merge, as the posterior accumulates around the true density.

9

4.2

Real data

Here we consider the galaxy data which consist of the velocities of 82 distant galaxies diverging from our own galaxy. Once again, we took ξ = 0.1 and the prior for the (θj ) was taken to be Normal with mean equal to the midrange of the data and variance equal to the range. The prior for λ = 1/σ 2 was chosen to be standard exponential and we defined a hyper–prior for c which is Gamma(0.5, 0.1). Figure 2 shows a histogram of the data and the estimated predictive density for α = 0, 2/3000, 4/3000 and 1/300. It can be seen that, for values of α close to 0, the power likelihood density estimate approaches the density estimated using the “true” consistent model. The first example illustrates that use of the power likelihood does not highlight a discrepancy between the α = 0 “true” model and those with α > 0, which are known to be consistent. We now present an example where, with a known inconsistent model, the α > 0 models clearly highlight a discrepancy with the α = 0 model and hence raise an issue as to the consistency of the “true” model.

4.3

An inconsistent model

The results found in the literature present conditions for consistency which are sufficient only. Therefore, in many cases, even when consistency for a model cannot be established, this does not imply it is inconsistent. Hence, if the practitioner wanders away from these sufficient conditions, she would be interested in diagnosing a possible case of inconsistency. In this part we present an interesting example constructed by Barron et al. (1999) to show that posterior consistency can occur when nonparametric densities are involved. The idea is to construct a prior which assigns equal probability to a set F0 of continuous densities and a set F∗ of piecewise constant densities. The role of the first set is to ensure the Kullback-Leibler property is satisfied, while the second ensures posterior probability does not accumulate almost surely on arbitrarily small Hellinger neighborhoods of the true density. The i.i.d. observations are generated from the uniform density on [0, 1], that is, f0 (x) = 1. To construct the prior, first consider, for each positive

10

integer N , the following partition of [0, 1]  IN = [0, 1/2N 2 ), [1/2N 2 , 2/2N 2 ), . . . , [(2N 2 − 1)/2N 2 , 1] . Let FN be the set of all density functions which are constant on every interval of IN and take only the values 0 and 2. Then, the cardinality of FN is 2 qN = 2N N 2 . The prior will assign equal mass Π(f ) =

1 CqN 2N 2

to every function f ∈ FN , where C is a normalizing constant, C=

∞ X 1 . N2

N =1

S∞

Making F∗ = N =1 FN , this means exactly 1/2 of the prior probability is assigned to F∗ . The rest of the prior mass is assigned to the parametric family F0 = {fθ = exp(θ +



2θΦ−1 ) : θ ∈ (0, 1)},

where Φ denotes the standard Normal c.d.f. The prior probability on F0 is induced by the density on the parameter space   1 Π(θ) ∝ exp − 1(0,1) (θ). θ For every fθ ∈ F0 , the Kullback-Leibler divergence to the true density is dKL (fθ , f0 ) = θ, so the Kullback-Leibler property is satisfied. At the same time, the squared Hellinger distance between f0 and any density f ∈ F∗ is √ H 2 (f, f0 ) = 2 − 2 from which it follows that lim sup Πn (F∗ ) = 1 a.s. n→∞

Therefore, the model is not strongly consistent (see Barron et al., 1999, for a proof). Both the prior and the posterior for this model are non parametric mixtures over the space of densities F = F0 ∪ F∗ . Therefore, a posterior sample can be obtained using slice sampling techniques, introducing a latent variable to index the mixture component from which the density is sampled (see Kalli et al., 2011, for details). When the latent variable takes the value 0, we obtain

11

f = fθ by sampling the parameter from the corresponding posterior density. In this case, the Hellinger distance to the true density is given by H(fθ , f0 ) =

p

1 − exp(−θ/4).

When the latent takes a value N > 0 we know f ∈ FN and p variable √ H(f, f0 ) = 2 − 2. So we may calculate an MCMC estimate of the Hellinger distance between f0 and a realization f from the posterior Πn . Our results are shown in Figure 3. The horizontal axis corresponds to the sample size n, while the vertical axis shows the MCMC estimate of the Hellinger distance between f0 and the predictive density, for different choices of α. We see that, for small n and small α, the behavior of the estimate is similar to that of α = 0. However, for n large enough, all estimates obtained using α > 0 approach the true density f0 , as expected from the consistency property. The estimated distance for α = 0, on the other hand, remains constant for large n, since the “true” model is inconsistent. Hence, in this example, if the model were not known to be inconsistent, the plots of posterior predictive distributions would look different for choices of α = 0 with choices of α > 0 and this discrepancy would or should lead the practitioner to question the appropriateness of the model.

5

Discussion

With the mixture models now developing at a pace, both to multivariate versions and regression functions, it is becoming harder to establish conditions for consistency. On the other hand there is no work to be done in this direction if one is willing to work with an α > 0, however small. While using a mixture model with a power less than 1 for the likelihood solves the problem of consistency, it brings up the issue of how to do Bayesian inference via MCMC. This paper has shown and demonstrated how this can be done. The trick is to see the power likelihood as Rn (f ) × Rn (f )−α rather than Rn (f )1−α and to use a power series representation for q −α , with 0 < q < 1, which is guaranteed to have positive weights. We have shown how the likelihood for a

12

MDP model can be appropriately manipulated to ensure we obtain a quantity bounded by 1. There are other examples. We can consider, for instance, an exponential model where m(x; θ) = θ exp(−xθ). In this case, we can use m(x; θ) ∝

θ exp(−xθ) , x−1 e−1

which is, again, bounded above by 1. For a consistent model and small α there is no difference between using the likelihood raised to the power 1 − α (i.e. Rn (f )1−α ) when compared with the “true” likelihood (α = 0). Yet, for a choice of 0 < α < 1, the model is proved to be consistent (Walker and Hjort, 2001), therefore, for large enough n, the estimate should move away from the estimate produced by using α = 0 if the model is inconsistent. Since results for consistency involve conditions which are sufficient only, there are models for which consistency may be present but not theoretically verifiable. In such cases, we propose the use of the power likelihood for inference or for checking for discrepancies with the true model. Results can then be compared between α = 0 and different values of 0 < α < 1. If density estimates are similar, the estimation produced by the “true” model may be considered adequate. However, if the estimates seem different, this may be considered as a warning sign that the model may not be consistent. Even for inconsistent models, the power likelihood may be used to produce a consistent estimate of the predictive density or to assess the quality of an estimate obtained using the “correct” likelihood, by comparison. Consistency may serve as a motivation for the use of a power likelihood, but it is not the only one. In general, raising the likelihood to some power smaller than 1 has a smoothing effect, which can be useful in some situations. See for example Friel and Pettitt (2008) for the use of power likelihood in the context of simulated annealing algorithms or Ibrahim and Chen (2000), also concerned with rising likelihoods to powers. Acknowledgements. The authors are grateful for the comments of an Associate Editor and two referees on an earlier version of the paper.

References Barron, A., M. J. Schervish, and L. Wasserman (1999). The consistency of posterior distributions in nonparametric problems. The Annals of Statis-

13

tics 27(2), 536–561. Besag, J. and P. J. Green (1993). Spatial statistics and bayesian computation. Journal of the Royal Statistical Society. Series B (Methodological) 55(1), 25–37. Damien, P., J. Wakefield, and S. Walker (1999). Gibbs sampling for bayesian non-conjugate and hierarchical models by using auxiliary variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61(2), 331–344. Escobar, M. D. (1988). Estimating the means of several normal populations by nonparametric estimation of the distribution of the means. Ph. D. thesis, Department of Statistics, Yale University. Ferguson, T. S. (1973). A bayesian analysis of some nonparametric problems. The Annals of Statistics 1(2), 209–230. Friel, N. and Pettitt, A. N. (2008). Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society: Series B 70(3), 589– 607. Ghosal, S., J. K. Ghosh, and R. V. Ramamoorthi (1999). Posterior consistency of dirichlet mixtures in density estimation. The Annals of Statistics 27(1), 143–158. Godsill, S. J. (2001). On the relationship between markov chain monte carlo methods for model uncertainty. Journal of Computational and Graphical Statistics 10(2), 230–248. Green, P. J. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika 82(4), 711–732. Hjort, N. L., C. Holmes, P. M¨uller, and S. G. Walker (2010). Bayesian nonparametrics. Cambridge University Press. Ibrahim, J. G. and Chen, M. H. (2000). Power Prior Distributions for Regression Models. Statistical Science 15(1), 46–60. Ishwaran, H. and James, L. F. (2001). Gibbs Sampling Methods for Stick Breaking Priors. Journal of the American Statistical Association 96, 161– 173.

14

Kalli, M., J. E. Griffin, and S. G. Walker (2011). Slice sampling mixture models. Statistics and Computing 21, 93–105. Lo, A. Y. (1984). On a class of bayesian nonparametric estimates: I. density estimates. The Annals of Statistics 12(1), 351–357. MacEachern, S. N. and P. M¨uller (1998). Estimating mixture of dirichlet process models. Journal of Computational and Graphical Statistics 7(2), 223–238. Neal, R. M. (2000). Markov chain sampling methods for dirichlet process mixture models. Journal of Computational and Graphical Statistics 9(2), 249–265. Papaspiliopoulos, O. and Roberts, G. O. (2008). Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika 95(1), 169–186. Sethuraman, J. (1994). A constructive definition of dirichlet priors. Statistica Sinica 4, 639–650. Walker, S. G. (2004). New approaches to Bayesian consistency. The Annals of Statistics 32(5), 2028–2043. Walker, S. G. and N. L. Hjort (2001). On bayesian consistency. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(4), 811–821.

15

0.25 n=10 0.2 0.15 0.1 0.05 0

0.25 n=100 0.2 0.15 0.1 0.05 0

0.25 n=1000 0.2 0.15 0.1 0.05 0 -4

-3

-2

-1

0

True density

1

α=0

2

3

α=0.25

4

5

6

7

α=0.5

Figure 1: Simulated data from the MDP model : Estimated predictive density based on (1 − α) power likelihood 16

0.25 Data α=0 alpha=1/300 alpha=4/3000 alpha=2/3000

0.2 0.15 0.1 0.05 0 -10

0

10

20

30

40

50

Figure 2: Galaxy Data: Estimated predictive density based on (1 − α) power likelihood.

0.8 0.7 0.6

α=0 α=0.1 α=0.25 α=0.5

0.5 0.4 0.3 0.2 0.1 0 0

200

400

600

800

1000

Figure 3: Inconsistent model: Estimated Hellinger distance between the true density f0 and the estimated predictive density based on the (1 − α) power likelihood, for increasing sample size

17