On the Convergence Rate of Random Permutation ... - Semantic Scholar

9 downloads 0 Views 154KB Size Report
Department of Statistics and Insurance Science, University of Piraeus, ... bution at the same rate as the Random Permutation Sampler of Frühwirth-Schnatter.
On the Convergence Rate of Random Permutation Sampler and ECR Algorithm in Missing Data Models Panagiotis Papastamoulis and George Iliopoulos1 Department of Statistics and Insurance Science, University of Piraeus, 80 Karaoli & Dimitriou str., 18534 Piraeus, Greece

Abstract Label switching is a well-known phenomenon that occurs in MCMC outputs targeting the parameters’ posterior distribution of many latent variable models. Although its appearence is necessary for the convergence of the simulated Markov chain, it turns out to be a problem in the estimation procedure. In a recent paper, Papastamoulis and Iliopoulos (2010) introduced the Equivalence Classes Representatives (ECR) algorithm as a solution of this problem in the context of finite mixtures of distributions. In this paper, label switching is considered under a general missing data model framework that includes as special cases finite mixtures, hidden Markov models, and Markov random fields. The use of ECR algorithm is extended to this general framework and is shown that the relabelled sequence which it produces converges to its target distribution at the same rate as the Random Permutation Sampler of Fr¨ uhwirth-Schnatter (2001) and that both converge at least as fast as the Markov chain generated by the original MCMC output.

Keywords: MCMC methods; label switching; latent variables; ECR algorithm; random permutation sampler.

1

Introduction

Let x = (x1 , . . . , xn ), xi ∈ X , i = 1, . . . , n, be the observed data. Suppose that the joint distribution x ∼ f (x|η) belongs to a specific parametric family of distributions with 1

Corresponding author. e-mail: [email protected]

1

unknown parameter η = (η1 , . . . , ηk ) ∈ H, where k ≥ 2 a fixed constant. Typically, the complexity of the distribution f makes prohibitive the use of analytical methods for estimating the parameter vector η. In many cases, the complexity of a model originates from the lack of further information. However, one can often assume that this information can be thought of as an unobserved sequence z = (z1 , . . . , zn ) of discrete random variables taking values in a fixed finite set. So, if one had observed the complete data, that is, the sequence (x, z), the inference about the characteristics of the population would become simpler. Since z’s are unobserved, then can be treated as missing data. The EM algorithm (Dempster et al., 1977) and the Gibbs sampler (Gelfand and Smith, 1990) are well known methods for handling missing data problems under a frequentist and a Bayesian point of view, respectively. In this paper the latter case is considered. Furthermore, it is assumed that a MCMC algorithm to (approximately) sample from the joint posterior distribution of (z, η|x) is available. In many latent class models the posterior distribution of the parameters is symmetric with respect to permutations of the parameter labels. This property burdens the estimation procedure of the marginal posterior distributions when using traditional MCMC methods since they suffer from the so-called label switching problem (Redner and Walker, 1984). The early attempts to handle this problem consist of imposing artificial identifiability constraints to the parameters so that to prevent the presence of the label switching phenomenon (see Richardson and Green, 1997). Fr¨ uhwirth-Schnatter (2001) faced this problem by running a constrained sampler, where each simulated parameter vector is permuted in order to satisfy a particular ordering constraint. The constraint is selected from a preliminary run of the random permutation sampler (RPS), that is, a sampler augmented with a random permutation step in order to ensure the presence of the label switching phenomenon. More sophisticated methods are also available (see Stephens, 1997, 2000 and Celeux et al., 2000) but their high computational cost restricts their applicability. For a review on methods that attempt to handle the label switching problem in Bayesian analysis of mixtures of distributions see Jasra et al. (2005). Recently, Papastamoulis and

2

Iliopoulos (2010) developed a relabelling procedure, namely, the Equivalence Class Representative (ECR) algorithm and showed that the label switching problem can be efficiently solved in the case of mixtures of distributions, under a considerably smaller amount of computational effort. Note that at about the same time another relabelling procedure has been developed by Sperrin et al. (2010). The contribution of the present paper is twofold. First we present a general statistical model that includes as special cases many popular models such as finite mixtures, hidden Markov models and Markov random fields, as well as provide conditions under which the label switching phenomenon takes place in MCMC outputs targeting the parameters’ posterior distribution. Secondly, we deal with the convergence rates of the RPS and the ECR algorithm to the corresponding posterior distributions in terms of the total variation distance. More specifically, we formally show that RPS converges to the posterior distribution at least as fast as the original sampler and that the ECR’s relabelled sequence converges to its target distribution (which by construction is different than the posterior distribution) at least as fast as the original sequence. The rest of the paper is organized as follows. In Section 2 the above mentioned general statistical model is introduced and it is proven that under certain conditions exhibits the label switching phenomenon. Section 3 briefly reviews the random permutation sampler of Fr¨ uhwirth-Schnatter (2001) and the ECR algorithm of Papastamoulis and Iliopoulos (2010). Section 4 provides the results about the convergence rates of the RPS and the ECR algorithm. The paper concludes with a discussion.

2

The label switching problem in a general class of missing data models

Let k ≥ 2 be a known, fixed integer. Assume that it may be considered that there are some unobserved data z = (z1 , . . . , zn ) ∈ Z := {1, . . . , k}n so that the distribution of

3

x = (x1 . . . . , xn ) arises via marginalization, that is, f (x|η) =

X

f (x, z|η),

η = (η1 , . . . , ηk ) ∈ H.

(1)

z∈Z

The density f (x|η) as a function of η will be called the observed likelihood while f (x, z|η) (for given z) will be called the complete likelihood. Let f (η), η ∈ H, be the prior distribution of the parameters. In the sequel we will need the following definition. Definition 2.1. Let Tk be the permutation set of {1, . . . , k}. Then, for any τ = (t1 , . . . , tk ) ∈ Tk the corresponding permutations of z = (z1 , . . . , zn ) ∈ Z and η = (η1 , . . . , ηk ) ∈ H are defined as τ z := (tz1 , . . . , tzn ) ∈ Z and τ η := (ηt1 , . . . , ηtk ), respectively. For the basic missing data model (1) we will assume the following general properties: • The parameter space, H, is permutation invariant, i.e., τ H = H,

∀τ ∈ Tk ,

(2)

in the sense that η ∈ H ⇔ τ η ∈ H, ∀τ ∈ Tk . • The prior distribution of the parameters η is permutation invariant, that is, f (η) = f (τ η),

∀τ ∈ Tk .

(3)

• The complete likelihood has the property f (x, τ z|η) = f (x, z|τ η),

∀τ ∈ Tk .

(4)

Assumptions (2)-(4) are quite general and hold for a variety of popular statistical models. Some important cases of the missing data model (1) are the finite mixtures of distributions, hidden Markov models as well as more general Markov random fields. For instance, the P classical independence mixture model kj=1 pj f (xi |θj ), θj ∈ Θ, j = 1, . . . , k, arises by

assuming that P(zi = j) = pj and xi |zi ∼ f (·|θzi ), independently for i = 1, . . . , n, p = P (p1 , . . . , pk ) ∈ Pk := {p ∈ (0, 1)k : kj=1 pj = 1}. In terms of expression (1), we have that

η = ((p1 , θ1 ), . . . , (pk , θk )), H is the restriction of ((0, 1)×Θ)k due to the constraint p ∈ Pk , Q while f (x, z|η) = ni=1 pzi f (xi |θzi ). By relaxing the independence assumption among 4

the allocation variables, various generalizations of the mixture model arise. Assuming that the sequence of z’s is a k-state homogeneous Markov chain with transition matrix P = (pij ), i, j = 1, . . . , k, expression (1) defines a HMM (see Capp´e et al., 2005, or Fr¨ uhwirth-Schnatter, 2006). If pi := (pi1 , . . . , pik ) denotes the i-th row of the transition matrix P , then in terms of the missing data model formulation (1) we have that η = ((p1 , θ1 ), . . . , (pk , θk )) and H = (Pk × Θ)k . Note that the dependence structure among z’s can exhibit a more complex pattern, for example a Potts model, leading to the class of Markov random fields (see Bremaud, 1999). In all of these models, the validity of properties (2) and (4) is proven under routine calculations. Finally, the choice of standard label invariant priors guarantees (3) (see for example Marin et al., 2005). Next we show that properties (2), (3) and (4) are, actually, sufficient conditions for the presence of the label switching phenomenon. Proposition 2.1. Any model of the form (1) satisfying (2)–(4) exhibits a symmetric likelihood function and a symmetric posterior distribution. Proof. Assumption (2) guarantees that each ηj , j = 1, . . . , k, lies in the same set, something that is necessary for the symmetry of the likelihood function and the joint posterior distribution. On the other hand, (4) implies that the observed likelihood is symmetric, that is, ∀τ ∈ Tk we have f (x|τ η) =

X

f (x, z|τ η) =

z∈Z

X

f (x, τ z|η) =

z∈Z

X

f (x, ξ|η) = f (x|η),

(5)

ξ∈Z

for all η ∈ H. Finally, the joint posterior of η = (η1 , . . . , ηk ) is symmetric as well, because ∀τ ∈ Tk we have that f (τ η|x) =

f (x|τ η)f (τ η) f (x|η)f (η) = = f (η|x), f (x) f (x)

(6)

for all η ∈ H, by (5) and (3). Equation (6) is the core of the label switching phenomenon. Although its presence is necessary for the convergence of the MCMC sampler (see Marin et al., 2005) it turns out to be a problem when considering estimation of characteristics of the marginal posterior distributions of either ηj ’s or zi ’s due to the following facts. 5

Proposition 2.2. Any model of the form (1) satisfying (2)–(4) is marginally unidentifiable, that is, f (η1 |x) = · · · = f (ηk |x)

(7)

1 , k

(8)

and for all i = 1, . . . , n, P(zi = j|x) =

∀j = 1, . . . , k.

Proof. By (6), η1 , . . . , ηk are a posteriori exchangeable and this implies (7). In order to prove the second argument, denote by w(z|x) the posterior weight of z and observe that for all z ∈ Z it holds w(z|x) = w(τ z|x),

∀τ ∈ Tk ,

(9)

since making the change of variable η = τ η ′ we get Z Z Z f (x, z) f (x, z|η) (2) f (x, z|τ η ′ ) (4) f (x, τ z|η ′ ) w(z|x) = = = = = w(τ z|x). f (x) f (x) f (x) f (x) H H H Next, for any i = 1, . . . , n, j = 1, . . . , k, and τ = (t1 , . . . , tk ) ∈ Tk we have P(zi = j|x) =

k X

P(z1 , . . . , zi = j, . . . , zn |x)

k X

w(z1 , . . . , j, . . . , zn |x)

zℓ =1, ℓ6=i

=

zℓ =1, ℓ6=i

=

k X

w(tz1 , . . . , tj , . . . , tzn |x)

tzℓ =1, ℓ6=i

(by making the change of variable z = τ z ′ and using (9)) = P(zi = tj |x). Since tj can be any integer in {1, . . . , k}, (8) follows immediately. In the literature, η1 , . . . , ηk are referred to as state specific parameters. However, in many models there exist additional parameters that may appear in the data density (either the observed or the complete) or as prior hyperparameters or as both. As far as these parameters are not related to the state labels 1, . . . , k, they play no role to the label switching phenomenon and the above results still hold. Therefore in general we will suppress from notations their presence. 6

3

Random Permutation Sampler and ECR Algorithm

Suppose that we have at hand a MCMC sampler targeting the symmetric posterior distribution. Fr¨ uhwirth-Schnatter (2001) ensures that the chain visits all k! symmetric modal regions by applying at each step a random permutation to the simulated allocation and parameter vectors. In general, a random permutation sampler proceeds as follows.

Random Permutation Sampler (RPS) At iteration m = 1, 2, . . . 1. Generate (z, η)(m) ≡ (z (m) , η (m) ) via a standard MCMC step. 2. Select randomly some permutation τ = (t1 , . . . , tk ) ∈ Tk . 3. Set (z, η)(m) = (τ z (m) , τ −1 η (m) ), where τ −1 denotes the inverse permutation of τ. Fr¨ uhwirth-Schnatter (2001) proved that RPS does not alter the target of the simulated chain. It is intuitively clear that since this move forces the appearence of the label switching phenomenon, the convergence to the symmetric posterior distribution should be at least as fast as of the original algorithm (that does not involve the permutation step). This has been mentioned in the literature (see for example Jasra et al., 2005) but it has not been formally proven. Papastamoulis and Iliopoulos (2010) developed a method for solving the label switching problem in the classical mixture setting. Their method is based on partioning the allocation space Z = {1, . . . , k}n into k! subsets. Each subset is called a set of representatives of classes implied by an equivalence relation defined on the allocation space. More specifically, two allocations vectors z 1 and z 2 are said to be equivalent if there exists τ ∈ Tk such that z 1 = τ z 2 . Let Ξz = {τ z : τ ∈ Tk } be the equivalence class of z. This set contains exactly k!/(k − k0 (z))! elements, where k0 (z) denotes the number of different symbols appearing in a given allocation vector z. Consider now an arbitrary set Z0 , consisting of exactly one

7

representative from each equivalence class and define fZ0 (η|x) :=

X

z∈Z0

k! w(z|x) f (η|x, z), (k − k0 (z))!

(10)

where w(z|x) denotes the posterior weight of the allocation vector z. It is easy to verify P that for any choice of Z0 , z∈Z0 k!w(z|x)/(k−k0 (z))! = 1 and so, fZ0 (η|x) is a probability density function with the same support as f (η|x). Moreover, fZ0 is nonsymmetric for any

choice of Z0 and can reproduce the symmetric posterior distribution through the relation f (η|x) =

1 X fZ0 (τ η|x) . k!

(11)

τ ∈Tk

Papastamoulis and Iliopoulos (2010) solved the label switching problem by choosing a set of equivalence classes representatives (ECR) Z0 and transforming the original MCMC output so that it targets fZ0 . They then used the transformed output in order to estimate the posterior characteristics of the component specific parameters.

The Equivalence Classes Representatives (ECR) Algorithm 1. Determine Z0 . 2. Generate a sequence (z, η)(m) , m = 1, 2, . . ., via a standard MCMC algorithm. 3. For m = 1, 2, . . . (a) Select randomly a permutation τm ∈ Tz(m) := {τ ∈ Tk : τ z (m) ∈ Z0 }.  −1 η (m) . (b) Set (z, η)(m) = τm z (m) , τm The validity of the algorithm follows from the fact that if (z, η)(m) , m = 1, 2, . . ., is a Markov chain with limit distribution f (z, η|x) and τm is uniformly distributed on Tz(m) ,  −1 η (m) , m = 1, 2, . . ., has limit distribution f (z, η|x). then the sequence τm z (m) , τm Z0 This has been proven in the context of classical mixtures by Papastamoulis and Iliopoulos

(2010, Proposition 3.1) but the result applies to the general model (1) as well. For further use note that the set Tz contains exactly (k − k0 (z))! permutations.

8

4

Convergence Rate

In this section we compare the convergence rates of the outputs obtained by the standard MCMC sampler and the RPS to the (original) posterior distribution with the output produced by the ECR algorithm to its limit distribution fZ0 in terms of the total variation distance. First we formally show that the RPS converges at least as fast as the original MCMC sampler and then that its convergence rate coincides with that of the ECR algorithm. For notational convenience, we avoid writing explicitly that the distributions are con(m)

ditional on the data x, although they must be understood as such. Let fZ0 (z, η) and f (m) (z, η) be the densities of the generated (z, η)(m) according to ECR algorithm and the (m)

original sampler, respectively. Let also f∗

(z, η) denote the density of (z, η)(m) accord-

ing to the RPS. By the definition of the ECR algorithm, every (z, η) generated from the original sampler is transformed to (τ z, τ −1 η), where τ is selected among the (k − k0 (z))! permutations that switch z back to Z0 with probability 1/(k − k0 (z))!. This implies that (m)

fZ0 (z, η) =

X f (m) (τ z, τ −1 η) , (k − k0 (z))!

∀ z ∈ Z0 , η ∈ H.

(12)

τ ∈Tk

For later use, observe that for any function g(z) it holds X

z∈Z

g(z) =

X X

z∈Z0 τ ∈Tk

g(τ z) . (k − k0 (z))!

(13)

In what follows we assume that the transition density of the original chain is invariant with respect to permutations in the sense that f (τ z ′ , τ −1 η ′ |τ z, τ −1 η) = f z ′ , η ′ |z, η),

∀ τ ∈ Tk .

(14)

It can be verified that (14) holds when the simulations are performed directly from the full conditional distributions, i.e. when running a Gibbs sampler. This fact is proven in Lemma A.1 in the Appendix. Using similar arguments it can be also shown that (14) holds when using the slice sampler instead (as do Robert et al., 2000) as well as when Metropolis steps with permutation invariant proposals are involved.

9

Lemma 4.1. Under assumption (14) it holds (m)

f∗

(z, η) =

1 X (m) f (τ z, τ −1 η), k!

for all z ∈ Z, η ∈ H.

(15)

τ ∈Tk

(m)

Furthermore, f∗

(m)

(τ z, τ −1 η) = f∗

(z, η) for all τ ∈ Tk .

Proof. The result will be proven by induction. By the definition of the random permutation move it clearly holds for m = 1. By supposing that it holds for some m ≥ 1 we have that (m+1)

f∗

(z ′ , η ′ ) =

XZ

z∈Z

(m)

f (z ′ , η ′ , |z, η)f∗

(z, η)dη

H

 X  1 (m) −1 = f (z , η |z, η) f (τ z, τ η) dη k! z∈Z H τ ∈Tk Z (14) 1 X X = f (τ z ′ , τ −1 η ′ |τ z, τ −1 η)f (m) (τ z, τ −1 η)dη k! τ ∈Tk z∈Z H Z 1 X X ˜ )f (m) (˜ ˜ )d˜ = f (τ z ′ , τ −1 η ′ |˜ z, η z, η η k! H (15)

XZ





˜ ∈Z τ ∈Tk z

˜ ) = (τ z, τ −1 η) (by making the change of variables (˜ z, η and using the fact that τ Z = Z, τ −1 H = H, ∀τ ∈ Tk ) 1 X (m+1) = f (τ z ′ , τ −1 η ′ ) k! τ ∈Tk

and we are done. For the second argument, observe that for any permutations τ = (t1 , . . . , tk ) and ρ = (r1 , . . . , rk ) it holds ρ(τ z) = ρ(tz1 , . . . , tzn ) = (rtz1 , . . . , rtzn ) = (τ ρ)z,

(16)

because τ ρ = (rt1 , . . . , rtk ). Moreover, for all η = (η1 , . . . , ηk ) we obtain that ρ(τ η) = ρ(ηt1 , . . . , ηtk ) = (ηtr1 , . . . , ηtrk ) = (ρτ )η. Using (15), for every τ ∈ Tk we have (m)

f∗

(τ z, τ −1 η)

=

1 X (m) f (ρ(τ z), ρ−1 (τ −1 η)) k! ρ∈Tk

(16),(17)

=

1 X (m) f ((τ ρ)z, (ρ−1 τ −1 )η) k! ρ∈Tk

10

(17)

1 X (m) f ((τ ρ)z, (τ ρ)−1 η) k!

=

ρ∈Tk

(because (τ ρ)−1 = ρ−1 τ −1 ) 1 X (m) f (γz, γ −1 η) k!

=

γ∈Tk

(by making the change of variables τ ρ = γ) (m)

=

f∗

(z, η)

and this completes the proof. Let g, h be densities with respect to some measure µ and recall that their total variation R distance can be expressed as ||g − h|| = 12 |g(x) − h(x)|dµ(x). Propositions 4.1 and 4.2 below state our main results.

Proposition 4.1. Assume that (14) holds. Then, (m)

||f∗

− f || ≤ ||f (m) − f ||,

∀m > 1.

Proof. Before proceeding, notice that an immediate consequence of (3) and (4) is the invariance property of the joint posterior distribution, that is, f (τ z, τ −1 η|x) = f (z, η|x),

∀ τ ∈ Tk , z ∈ Z, η ∈ H.

Let Z0 be any set of classes representatives. Then, XZ (m) (m) 2||f∗ − f || = |f∗ (z, η) − f (z, η)|dη z∈Z

(13)

=

H

X X Z f∗(m) (τ z, η) f (τ z, η) − dη (k − k0 (z))! H (k − k0 (z))!

z∈Z0 τ ∈Tk

=

X X Z f∗(m) (τ z, τ −1 η ˜ ) f (τ z, τ −1 η ˜ ) − η d˜ (k − k0 (z))! (k − k0 (z))! H

z∈Z0 τ ∈Tk

˜ = τ η) (by making the change of variables η X X Z f∗(m) (z, η ˜) ˜ ) f (z, η ˜ = − dη (k − k0 (z))! H (k − k0 (z))! z∈Z0 τ ∈Tk

(m)

(since f∗

(m)

˜ ) = f∗ (τ z, τ −1 η 11

˜ ) by Lemma 4.1) (z, η

(18)

X Z k!f∗(m) (z, η ˜) ˜ ) k!f (z, η = − η d˜ (k − k0 (z))! (k − k0 (z))! z∈Z0 H Z P (m) −1 ˜ ) ˜ ) k!f (z, η (15) X τ ∈Tk f (τ z, τ η = − η d˜ (k − k0 (z))! (k − k0 (z))! z∈Z0 H X X Z f (m) (τ z, τ −1 η ˜) ˜ ) f (z, η ˜ ≤ − dη (k − k0 (z))! (k − k0 (z))! H z∈Z0 τ ∈Tk X X Z f (m) (τ z, η) f (z, τ η) = − dη (k − k0 (z))! H (k − k0 (z))! z∈Z0 τ ∈Tk

˜) (by making the change of variables τ η = η Z f (m) (τ z, η) f (τ z, η) (18) X X = − dη (k − k0 (z))! (k − k0 (z))! z∈Z0 τ ∈Tk H Z (13) X = |f (m) (z, η) − f (z, η)|dη H

z∈Z

= 2||f (m) − f || and the proof is completed. Proposition 4.2. For any set Z0 of classes representatives it holds (m)

(m)

||fZ0 − fZ0 || = ||f∗

− f ||,

∀ m ≥ 1.

Proof. Observe that (m)

2||fZ0 − fZ0 || =

X Z

z∈Z0

(m)

H

|fZ0 (z, η) − fZ0 (z, η)|dη

Z P (m) −1 k!f (z, η) (12) X τ ∈Tk f (τ z, τ η) = − dη (k − k0 (z))! (k − k0 (z))! z∈Z0 H X Z k!f∗(m) (z, η) k!f (z, η) = − dη (k − k0 (z))! H (k − k0 (z))! z∈Z0

(m)

= 2||f∗

− f ||

by (19). An immediate consequence of Propositions 4.1 and 4.2 is the following:

12

(19)

Corollary 4.1. Assume that the transition density of the original chain enjoys property (14). Then, for all m ≥ 1 it holds (m)

||fZ0 − fZ0 || ≤ ||f (m) − f ||.

5

Discussion

In this paper, a general presentation of the label switching phenomenon in a broad class of missing data models is presented. Many popular statistical models are special cases of model (1) and typically possess properties (2)-(4). It is shown that any model that belongs to this family is marginally unidentifiable and so, the label switching problem occurs in MCMC outputs making difficult the estimation of several quantities of interest. The ECR algorithm which has been originally developed for solving the problem in the case of finite mixtures of distributions can be used in this general context as well. It is also proven that the Markov chain produced by this algorithm convergences to its target distribution at the same rate as that of the random permutation sampler of Fr¨ uhwirth-Schnatter (2001) and both converge at least as fast as the chain produced by the original MCMC sampler to the posterior distribution. It is well known that many MCMC methods, such as the Gibbs sampler, are often unable to explore all k! symmetric modal regions of the posterior distribution, especially when they are apart from each other (see Marin et al., 2005). In such cases, the simulated chain obtained by the original MCMC algorithm in practice fails to converge to f (η|x). However, the ECR algorithm has as target the distribution fZ0 (η|x) which is, roughly speaking, concentrated on just one modal region of f (η|x). This means that even if the original MCMC algorithm gets stacked to a single high posterior probability area, the reordered output behaves as if all of the symmetric areas have been explored. This is further supported by Proposition 4.2 which states that the rate of convergence to fZ0 is the same as that of the RPS to f and the fact that the RPS is by definition forced to visit all summetric areas of the posterior. Of course, the performance of the ECR algorithm depends on the ability of the availble MCMC algorithm to explore at least one modal

13

region: the better the MCMC algorithm the more information is there to be extracted by the ECR algorithm. The results of Section 4 imply that the convergence rate of the ECR algorithm is not affected by the choice of Z0 . However, the more fZ0 breaks the symmetry of the posterior distribution the better the solution of the label switching problem. Papastamoulis and Iliopoulos (2010) suggested to choose Z0 by means of a pivot vector z ∗ that coincides with the z-value of the posterior mode (z, η)(MAP) := arg max f (z, η|x). (In fact due to symmetry there are k! modes but they obviously belong to the same equivalence class.) Then, the representatives of the remaining classes are selected based on their similarity to z ∗ . Nevertheless, since analytical evaluation of the mode is prohibitive even for moderate sample sizes, they proposed to estimate z (MAP) by the z-value of the observed mode b )(MAP) := arg max1≤m≤M f (z (m) , η (m) |x), where M is the length of the simulated (b z, η

chain. Of course, one may argue that determination of Z0 by the MCMC output itself jeopardizes the convergence rate results but fortunately this is not the case: Due to the discreteness of the state space of z’s, the event C = {b z (MAP) = z (MAP) } will occur sooner or later. An even better approach that accelerates the occurence of C can be used in cases where it is easy to maximize f (z, η|x) with respect to η for any z ∈ Z. By defining g(z) = maxη∈H f (z, η|x), z ∗ may be taken equal to arg max1≤m≤M g(z (m) ). Indeed, since z (MAP) = arg maxz∈Z g(z) and Z is a discrete set, the MCMC output will return z ∗ = z (MAP) in finite time with probability one.

Appendix: Gibbs sampler and property (14) Rearrange the elements of η in blocks: η = (η 1 , . . . , η m ) where η l = (ηl1 , . . . , ηlk ) for all l = 1 . . . , m. For example, in a normal mixture model these blocks are consisting of the weights, means and variances, that is, m = 3 and η 1 = p,η 2 = µ and η 3 = σ2 . Thus, we can express the parametric vector as (z, η). Lemma A.1. Suppose that at each iteration of the original MCMC sampler the parameter vector is updated via a Gibbs sampling scheme using the full conditional distributions of

14

the blocks η l , l = 1, . . . , m and zi , i = 1, . . . , n in the following manner. If η (m) = η ′ and z (m) = z ′ , then at m + 1-step, the updates are zi ∼ f (zi |z 1:(i−1) , z ′(i+1):n , η ′ , x),

i = 1, . . . , n

η l ∼ f (ηl |η 1:(l−1) , η ′(l+1):m , z, x),

l = 1, . . . , m.

Then property (14) is valid. Proof. The transition density of the Markov chain can be written as n m Y Y f (zi |z 1:(i−1) , z ′(i+1):n , η ′ , x) × f (ηl |η 1:(l−1) , η ′(l+1):m , z, x). f (z, η|z ′ , η ′ ) = i=1

l=1

According to the specific update scheme we have that n m Y Y f (z, η|z ′ , η ′ ) = f (zi |z 1:(i−1) , z ′(i+1):n , η ′ , x) f (ηl |η 1:(l−1) , η ′(l+1):m , z, x) i=1



n Y



n Y

l=1

f (z 1:(i−1) , zi , z ′(i+1):n , η ′ |x)

i=1

m Y

f (η 1:(l−1) , η l , η ′(l+1):m , z|x)

l=1

f (z 1:i , z ′(i+1):n , η ′ |x)

i=1 n (18) Y



m Y

f (η1:l , η ′(l+1):m , z|x)

l=1

f (τ

−1

z 1:i , τ −1 z ′(i+1):n , τ η ′ |x)

i=1

∝ f (τ −1 z, τ η|τ −1 z ′ , τ η ′ ),

m Y

f (τ η 1:l , τ η ′(l+1):m , τ −1 z|x)

(20)

l=1

∀τ ∈ Tk

and the proof is completed, since the normalizing constant is the same in all cases.

References [1] Bremaud, P. (1999). Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues, Springer-Verlag, New York. [2] Capp´e, O., Moulines, E. and Ryd´en, T. (2005). Hidden Markov Models. SpringerVerlag, New York. [3] Celeux, G., Hurn, M. and Robert, C.P. (2000). Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95, 957–970. 15

[4] Dempster, A., Laird, N. and Rubin, D. (1977). Maximum Likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39, 1–38. [5] Diebolt, J. and Robert, C.P. (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the American Statistical Association, 96, 194–209. [6] Fr¨ uhwirth-Schnatter, S. (2001). Markov chain Monte Carlo estimation of classical and dynamic switching and mixture models. Journal of the American Statistical Association, 56, 363–375. [7] Fr¨ uhwirth-Schnatter, S. (2006). Finite mixture and Markov switching models. Springer, New York. [8] Gelfand, A. and Smith, A. (1990). Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409. [9] Jasra, A., Holmes, C.C. and Stephens D.A. (2005). Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modelling. Statistical Science, 20, 50–67. [10] Marin, J.M., Mengersen, K. and Robert, C.P. (2005). Bayesian modelling and inference on mixtures of distributions. Handbook of Statistics, 25, D. Dey and C.R. Rao (eds). Elsevier-Sciences. [11] Marin, J.M. and Robert, C.P. (2007). Bayesian Core: A Practical Approach to Computational Bayesian Statistics, Springer-Verlag, New York. [12] Papastamoulis, P. and Iliopoulos, G. (2010). An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. Journal of Computational and Graphical Statistics 19, 313–331. [13] Redner, R.A. and Walker, H.F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26, 195–239.

16

[14] Richardson, S. and Green, P.J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society, Series B, 59, 731–792. [15] Robert, C.P., Ryd´en, T. and Titterington, D.M. (2000). Bayesian Inference in Hidden Markov Models through the Reversible Jump Markov Chain Monte Carlo Method. Journal of the Royal Statistical Society, Series B, 62, 57–75. [16] Sperrin, M., Jaki, T. and Wit, E. (2010). Probabilistic relabelling strategies for the label switching problem in Bayesian mixture models. Statistics and Computing, 20, 357–366. [17] Stephens, M. (1997). Bayesian methods for mixtures of normal distributions. D. Phil dissertation, Department of Statistics, University of Oxford. [18] Stephens, M. (2000). Dealing with Label Switching in Mixture Models. Journal of the Royal Statistical Society Series B, 62, 795–809.

17