Posterior Convergence and Model Estimation in Bayesian Change

3 downloads 0 Views 166KB Size Report
Aug 20, 2008 - maximal squared risk will be asymptotically larger than the optimal O(1/√n) .... the following results to any distribution of X with density bounded ... A diverging number of grid points is used and thus ... The model space is defined as .... the total variation distance between probability measures P and Q, â(t0).
arXiv:0808.2700v1 [math.ST] 20 Aug 2008

Posterior Convergence and Model Estimation in Bayesian Change-point Problems Heng Lian Division of Mathematical Sciences, SPMS, Nanyang Technological University, Singapore, 637371 Summary. We study the posterior distribution of the Bayesian multiple change-point regression problem when the number and the locations of the change-points are unknown. While it is √ relatively easy to apply the general theory to obtain the O(1/ n) rate up to some logarithmic factor, showing the exact parametric rate of convergence of the posterior distribution requires additional work and assumptions. Additionally, we demonstrate the asymptotic normality of the segment levels under these assumptions. For inferences on the number of change-points, we show that the Bayesian approach can produce a consistent posterior estimate. Finally, we argue that the point-wise posterior convergence property as demonstrated might have bad finite sample performance in that consistent posterior for model selection necessarily implies the √ maximal squared risk will be asymptotically larger than the optimal O(1/ n) rate. This is the Bayesian version of the same phenomenon that has been noted and studied by other authors. Keywords: change-point problems, posterior distribution, rate of convergence

1. Introduction We consider the regression problem of estimating a piece-wise constant function when the number of segments as well as the locations of its change-points is unknown. This is an old problem that has attracted much attention recently (Goldenshluger et al., 2006; Ben Hariz et al., 2007; Fearnhead, 2008). Applications of multiple change-point models surged after efficient computations using reversible jump MCMC was discovered (Green, 1995). (Green, 1995) applied piece-wise constant function in the study of the coal mining disaster data in the context of Poisson process. A more recent trend of analysis that dispenses with the usage of MCMC for the change-points problem starts with the paper Liu and Lawrence (1999) where a dynamic programming approach is utilized to marginalize over segment levels and change-point locations. Their original motivation comes from the problem of partitioning DNA sequences into homogeneous segments. This dynamic programming approach is later extended by Fearnhead (2006); Lian (2008). Unlike the above studies, in this paper we are only concerned with the asymptotic properties of Bayesian multiple change-point problems and investigate from the frequentist view the posterior contraction characteristics of a simplified model. Although a piece-wise constant function involves only a finite number of parameters, as we will only consider the case where an upper bound on the number of change-points is available a priori, it is nevertheless best studied from a infinite-dimensional viewpoint and put the estimation problem in the context of function spaces. Until recently, little is known about the behavior of the posterior distribution of infinite-dimensional models. For consistency issues, Schwartz (1965) shows that the posterior is consistent when certain tests can be established for the true

2

Lian, H.

distribution versus the complement of its neighborhood. Barron et al. (1999) further developed the theory by sieve construction and metric entropy bounds. Convergence rates are studied in two independent and to some extent overlapping but complementary works (Ghosal et al. (2000) and Shen and Wasserman (2001)). In particular, Ghosal et al. (2000) extends the idea of constructing suitable tests in order to bound the convergence rates for both nonparametric and parametric problems, and Ghosal and Van Der Vaart (2007) further extends the approach to non-i.i.d. observations. The existence of tests for many specific problems can be found in the existing literature although sometimes new tests need to be carefully designed. In nonparametric Bayesian analysis, we have an i.i.d. sample Z1 , . . . , Zn from the distribution P0 with density p0 with respect to some measure on the sample space (Z, B). The model space is denoted by P which is known to contain the true distribution P0 . Given some prior distribution Π on P, the posterior is a random measure given by R Qn p(Zi )dΠ(P ) Πn (A|Z1 , . . . , Zn ) = RAQni=1 . p(Z i )dΠ(P ) i=1 For ease of notation, we will omit the explicit conditioning and write Πn (A) for the posterior distribution. We say that the posterior is consistent if Πn (P ∈ P : d(P, P0 ) > ǫ) → 0 in P0n probability for any ǫ > 0, where d is some suitable distance function between probability measures. To study rates of convergence, let ǫn be a sequence decreasing to zero, we say the rate is at least ǫn if for sufficiently large constant M Πn (P : d(P, P0 ) > M ǫn ) → 0 in P0n probability. We also need a slightly weaker definition of rates of convergence by replacing M with a sequence Mn and requiring that the above posterior mass converge to zero for any sequence Mn that diverges to infinity. This definition is usually required in parametric problems to get rid of the extra log n factor in the convergence rates. In our regression problem, we observe an i.i.d. sample Z = (Z1 , . . . , Zn ) with the distribution of Zi = (Xi , Yi ) defined structurally by Yi = θ0 (Xi ) + ǫi for i.i.d Gaussian noise ǫi ∼ N (0, 1). and θ0 is a piece-wise constant function on [0, 1) with Pk0 unknown locations of change-points. We can write θ0 (t) = j=1 aj I(tj−1 ≤ t < tj ), t0 = 0 < t1 < t2 < . . . < tk0 = 1 using the indicator function and thus θ0 is parameterized by (a, t), a = (a1 , . . . , ak0 ), t = (t0 , . . . , tk0 ). For simplicity, we assume the marginal distributions for {Xi } are i.i.d uniform on [0, 1), and note that it is straightforward to extend all the following results to any distribution of X with density bounded away from zero and infinity. Note P0 is fully determined by θ0 under these assumptions, and thus we also use the space of piece-wise constant functions as our model space which is equivalent to using P. The measure induced by θ is denoted by Pθ , and thus Pθ0 is the same as P0 , the true distribution. Lian (2007a) studied the consistency issue for the above model with the exception that there Xi ’s are deterministically chosen on a grid. In that paper, consistency is proved for the

Bayesian change-point problem

3

case that the true regression function is in the Lipschitz class as well. Another related work is Scricciolo (2007) where the Bayesian density estimation problem is studied with density approximated by piece-wise constant functions. Besides the fact that they are interested in density estimation instead of regression, the focus of that paper is very different from the current one. They are mostly concerned with the case of approximating a smooth density using step functions and aim to achieve the optimal rates up to a logarithmic factor. For density functions that are piece-wise constant, they prove parametric rate of convergence also with an extra logarithmic factor. A diverging number of grid points is used and thus this approach cannot be used to estimate the number of segments when the density is truly piece-wise constant. One simplification of our model compared with those works mentioned at the beginning of this section that focus on the computational issues is that the variance of the noise is assumed to be known here (and actually 1 without loss of generality). Investigations of Bayesian regression with unknown noise levels can run into additional technical difficulties especially in the design of appropriate tests. Consistency of a regression problem with unknown noise level is addressed in Choi and Schervish (2007). We hope to be able to address this problem in our context in a future paper. In this paper, we focus √ on the case θ0 is piece-wise constant and aim to achieve the exact parametric O(1/ n) rates of convergence and also study the posterior consistency in the estimation of the number of change-points, which we refer to as the model selection problem. The proofs for the estimation rates involve direct application of general theorems in Ghosal et al. (2000) but the calculation of the covering number is nontrivial in this case. In order to achieve the exact parametric rate, an additional assumption needs to be made to exclude functions with segment lengths that are too short. 2. Main results Consider the case where we have some a priori bounds for the number of change-points as well as for the segment levels {aj }. The model space is defined as Θ = {θ : θ(t) =

k X j=1

aj I(tj−1 ≤ t ≤ tj ), t0 = 0 < t1 < . . . < tk = 1, k ≤ kmax , |aj | ≤ K} .

By convention, we say θ with tk = 1 has k change-points, which is the same as the number of segments. Another equivalent representation of Θ is Θ = {(a, t) ∈ [−K, K]k × Tk : 1 ≤ k ≤ kmax , where Tk is the set of (k + 1)-tuples (t0 , . . . , tk ) with tj < tj+1 . We will not distinguish between these two different representations and θ can denote either a function or the tuple (a, t). This ambiguity can always be resolved by the context. For rates of convergence, the distance d we use is the L2 norm of the function ||θ|| = R 1/2 1 2 θ (x)dx . Since we only consider uniformly bounded functions, the L2 norm is 0 equivalent to the Hellinger distance (e.g.Ghosal and Van Der Vaart (2007), section 7.7). We now specify a prior on Θ using a hierarchical approach. Let Θk be the subspace of Θ that consists of functions with k change-points, and the prior Π is specified as a mixture Π=

kX max k=1

p(k)Πk , p(k) > 0,

X k

p(k) = 1

4

Lian, H.

with Πk the prior measure on Θk . We assume that Πk has a density πk (θ) which can be further decomposed as πk (a, t) = πka (a|t)πkt (t) . The assumption we make on the prior is that (A) The density πka (a|t) and πkx (t) are bounded away from zero and infinity on [−K, K]k and Tk respectively. This assumption is satisfied, for example, when t1 , t2 , . . . , tk−1 are distributed as the order statistic of k−1 points uniform distributed on [0, 1) while segment levels are independent and uniformly distributed on [−K, K]. The first simple result shows that the posterior rate of convergence is n−1/2 up to a logarithmic factor. Theorem 1. Under assumption (A), the posterior rate of convergence is at least ǫn = log1/2 n/n1/2 , i.e. Πn (θ : ||θ − θ0 || > M ǫn ) → 0 in P0n probability for sufficiently large M . Theorem 1 considers the convergence rates of the estimation problem. A different problem is the convergence of the posterior for the number of change-points. Under no additional assumptions, we can show that the posterior probability will concentrate on the true number of change-points with probability converging to 1. Theorem 2. Under the same assumption (A), we have Πn (k = k0 ) → 1 in P0n probability, where k0 is the number of change-points for the true function θ0 . Nonparametric Bayesian model selection has been investigated in Ghosal et al. (2008). The focus of that paper is on conditions under which the adaptive rates are achieved when simultaneously considering models with different rates of contraction. Thus it seems the results presented there cannot be directly applied in our case. To get rid of the extra logarithmic factor in Theorem 1, one would use the more refined Theorem 2.4 in Ghosal et al. (2000), using local covering number instead of the global one. Nevertheless, as shown by Lemma 4 in the appendix,√the local covering number for Θ is not bounded as would be required if we set ǫn = O(1/ n). Instead, we consider the smaller model space Θδ = {θ ∈ Θ : min |tj − tj−1 | ≥ δ} . We can define Θδk in a similar way and assumption (A) can be modified accordingly. Specification of a prior on Θδ is easy. Conceptually, we can just restrict πkt (t) to be supported Θδ and renormalize the density. Reversible jump algorithms can be easily modified to take into account the constraint. Dynamic programming can also incorporate the pre-specified shortest possible segment length (Lian (2007b). Theorem 1 and Theorem 2 is still true on Θδ with few modifications on the proofs. Green (1995) also noticed the practical advantage of avoiding short steps. They proposed using even-numbered order statistics from 2k − 1 uniformly distributed points so that short segment lengths are better penalized. As shown in the appendix, putting some lower bound on the segment lengths makes the local covering number bounded by a constant. This requires a very detailed argument to construct the covering. Using this more refined bound on the covering number, we can achieve the exact parametric rate.

Bayesian change-point problem

5

Theorem 3. For any δ √ > 0, under assumption (A), the posterior rate of convergence on Θδ is at least ǫn = O(1/ n). That is, for every Mn → ∞, we have that Πn (θ ∈ Θδ : ||θ − θ0 || > Mn ǫn ) → 0 in P0n probability. Combination of Theorem 2 and 3 immediately gives us the rates of convergence for the change-point locations: Corollary 1. Under the same assumptions as above, the posterior convergence rate for the change-point locations is at least ǫ2n = O(1/n), that is , for any sequence Mn → ∞, Πn (max1≤j≤k0 |tj − t0j | > Mn ǫ2n ) → 0 in P0n probability. This rate of course agrees with many frequentist approaches, say using the cumulative sum. It is well-known that the posterior distribution in regular parametric models conditionally converges to a Gaussian distribution under weak conditions. Since the previous results show that the number and locations of the change-points can be consistently estimated, one would naturally conjecture that the posterior distribution for segment levels will converge to a multivariate Gaussian distribution. This is indeed the case as stated in the following theorem: Theorem 4. Suppose the true segment lengths are lj = t0j − t0j−1 , j = 1, . . . , k0 . Denoting the posterior distribution of a = (a1 , . . . , ak ) restricted on the event k = k0 (which has a posterior probability converging to 1) by Πna|Z and the covariance matrix I0 = diag(lj · n), then we have a(t0 ), I0 )||T V → 0 , EZ|θ0 ||Πna|Z − N (ˆ where ||P −Q||T V is the total variation distance between probability measures P and Q, a ˆ(t0 ) is the maximum likelihood estimator for a, assuming the true locations of the change-points are known. The above theorems show that the Bayesian procedure possesses very good properties. On the one hand, the exact parametric rate is achieved for the estimation problem in the function space. On the other hand, the number of change-points can be consistently estimated. This is reminiscent of the recent literature on the oracle property of penalized estimators. As shown in Fan and Li (2001), the SCAD estimator, when the smoothing parameter is chosen appropriately, can estimate the zero coefficients in a linear regression model as exactly zero with probability converging to one as sample size increases. At the same time, the estimator is still consistent for nonzero coefficients and the asymptotic distribution is the same whether or not the correct zero positions are known. This is called the oracle property by Fan and Li (2001). More recently, Leeb and Potscher (2008) showed that the oracle property might be misleading in terms of the estimator’s finite sample performance and it is impossible to adapt to the unknown zero restrictions without paying a price. The caveat lies in the point-wise nature of the asymptotic theory laid out in Fan and Li (2001). The authors of Leeb and Potscher (2008) show that an unbounded (normalized) risk results for any estimator possessing the sparsity property. Another related work is Yang (2005) where the author shows that AIC has a minimax property which cannot be shared with any model selection consistent estimators in a regression problem. In our context, similar conclusion can be drawn for the Bayesian multiple change-point problem. Theorem 2 and Theorem exactrate apply to a fixed true piece-wise constant function and thus the convergence as stated√is point-wise in nature. It is not difficult to see from the proof of Theorem 3 that the 1/ n rate is not actually uniform over the class

6

Lian, H.

Θδ . The reason is that to obtain the bound for the local covering number (Lemma 5 in the appendix), the constants involved does depend on θ0 . In particular, the derivation of the lemma requires a lower bound on the size of the jumps of the neighboring segments and thus the convergence is not uniform over Θδ . Intuitively, small jumps makes the estimation more difficult and heavier penalization by the prior must be entertained (possibly by using a prior that depends on the sample size) to achieve model selection consistency at the cost of losing estimation accuracy. As seen√in the proof of Theorem 5, the difficulty occurs when the size of the jump is of order O(1/ n), in which case it becomes difficult to detect the change-point. Nevertheless, as discussed above, the convergence is uniform if we further restrict our attention on the sub-class: Θδ1 ,δ2 = {θ ∈ Θδ1 , min |aj − aj−1 | ≥ δ2 }. We state the uniform convergence as a proposition without proof: Proposition 1. For any fixed δ1 , δ2 > 0, the rate of convergence is uniformly at least √ n δ ,δ ¯ > ǫn = O(1/ n). That is, for any Mn → ∞, supθ∈Θ δ1 ,δ2 EZ|θ ¯Π (θ ∈ Θ 1 2 : ||θ − θ|| ¯ Mn ǫn ) → 0. The property of model selection consistency is still satisfied in this case. On the other hand, the following result confirms that we cannot expect the posterior to converge uniformly over the class Θδ if the method can adapt to the number of changepoints. Note that the theorem applies for any Bayesian posterior distribution for the changepoint problem, not just the specific prior we constructed. Theorem 5. Suppose the posterior distribution satisfies the model consistency condition: Πn (k = k0 ) → 1 in P0n probability, then √ the maximal L2 convergence of θ is necessarily slower than the parametric rate ǫn = O(1/ n). That is, for some Mn → ∞, ¯ > M n ǫn ) → 1 . sup EZ|θ¯Πn (θ ∈ Θδ : ||θ − θ||

δ ¯ θ∈Θ

. The above theorem demonstrated the trade-off between function estimation and model selection for our Bayesian multiple change-point problems. 3. Discussion In this paper, we investigated in detail some asymptotic properties of Bayesian multiple change-point problems when the noise level is assumed known. We proved estimation rate of convergence as well as model selection consistency of the posterior distribution. The main contribution of the paper is to show that the exact parametric rate is achieved for a restricted class of piece-wise constant functions and that this optimal rate cannot be achieved uniformly over the class. Our theory still leaves some gaps in between. For example, it is still unknown whether it is absolutely necessary to restrict the functions to have not too short segment lengths in order to achieve the optimal rate. The additional restriction makes the local covering number bounded in order to apply the Theorem in Ghosal et al. (2000). Besides, the situation with unknown error level is of significant practical importance in which case one should also put a prior on the noise level. The convergence property in this case is still an open problem.

Bayesian change-point problem

7

Appendix Some Lemmas In preparation for the proofs of the main results, we first collect some lemmas here. Lemma 4 below shows that the local covering number is unbounded as remarked in the main text and is not used further in any other proofs. The constant C is used to denote a generic constant which might not be the same at different places. Note that since we are only considering uniformly bounded class of functions, the Hellinger distance, the KullbackLeibler divergence, as well as the second moment of the likelihood ratio are all equivalent to the L2 norm of the regression function. In the following, we set δ0 = min{minj |t0j − t0j−1 |, minj |a0j − a0j−1 |} > 0, which bounds the segment lengths as well as the jump size from below. Lemma 1. Under condition (A), we have the lower bound for the prior concentration when ǫn → 0, Π(θ : ||θ − θ0 || ≤ ǫn ) ≥ Cp(k0 )ǫn3k0 −2 . Proof. When θ = ǫ2n 8K 2 kmax

Pk0

j=1

aj I(tj−1 ≤ t < tj ) ∈ Θk0 with |aj − a0j | < ǫn /2, 1 ≤ j ≤ k0 and

|tj − t0j | < , 1 ≤ j ≤ k0 − 1, it is easy to show that ||θ − θ0 ||2 < ǫ2n . Since the prior density for (a, t) is bounded away from zero, we get Π(θ : ||θ − θ0 || ≤ ǫn ) ≥ p(k0 )Πk0 (θ : ||θ − θ0 || ≤ ǫn ) ≥ Cp(k0 )ǫn3k0 −2 . q 3 δ0 (δ0 is defined immediately before Lemma 1). When ǫ < δ ′ , Lemma 2. Let δ ′ = 4kmax we have that Πk (θ ∈ Θk : ||θ − θ0 || ≤ ǫ) ≤ Cǫ3k0 −2 , k = 1, . . . , kmax , where C is a constant that depends on K, kmax and δ0 . For k > k0 , the bound can be refined to Cǫ3k0 −1 . Proof. First we consider the case k < k0 and θ ∈ Θk . By the definition of δ0 , the k0 − 1 intervals (t0j − δ0 /2, t0j + δ0 /2), j = 1, . . . k0 − 1 are nonoverlapping. Thus there is at least one segment of θ p that includes one of these k0 − 1 intervals. Thus the distance between θ and θ0 is at least δ(δ/2)2 ≥ δ ′ , and thus Πk (θ ∈ Θδk : ||θ − θ0 || ≤ ǫ) = 0. When k ≥ k0 , θ ∈ Θk and ||θ − θ0 || < ǫ, for any j, let s(j) be the index of the interval [ts(j)−1 , ts(j) ) which has the largest overlap with [t0j−1 , t0j ). Obviously the length of the q overlap is at least δ0 /kmax . This implies |as(j) − a0j | ≤ ǫ kmax (otherwise the squared L2 δ0 δ0 > ǫ2 ). Similarly, let t(j) be the distance between θ and θ0 is at least (as(j) − a0j )2 kmax

index of the change-point of θ that is closest to t0j , we have |tt(j) − t0j | ≤ 4ǫ2 δ0 2 ( ) δ02 2

4ǫ2 δ02

(otherwise the

2

squared distance will be bigger than = ǫ ). The above considerations give us k0 constraints on the segments levels of θ as well as k0 − 1 constraints on the change-point locations. Thus under assumption (A), the prior probability Πk (θ ∈ Θk : ||θ − θ0 || ≤ ǫ) is bounded by Cǫ3k0 −2 . For refined bound, we consider k = k0 + 1 only for simplicity. Without loss of generality, we assume t(j) = j and thus ||θ − θ0 || ≤ ǫ implies an additional restriction (a0k0 − ak0 +1 )2 (1 − tk0 ) ≤ ǫ2 . This gives us an additional factor of ǫ in the bound. Lemma 3. log D(ǫ, Θ) ≤ b log(1/ǫ) + c, for some constants b, c > 0 that depends on K and kmax , where D(ǫ, Θ) is the ǫ−covering number of Θ.

8

Lian, H.

Proof. Choose a grid on the domain [0, 1) and another grid on [−K, K]   ǫ2 ∆t = · i, i ∈ N ∩ [0, 1], ∆y = {ǫ · i, i ∈ Z} ∩ [−K, K] . 8K 2 kmax ˜ = {θ ∈ Θ, θ jumps only at points in ∆t and takes segment levels in ∆y }. It is then Let Θ ˜ is an ǫ−covering of Θ with covering number bounded by easy to show that Θ  8K 2 kmax  ⌊ ǫ2 ⌋ + 1 2K + 1)kmax . ( ǫ kmax The next lemma considers the local covering/packing number. In particular, Lemma 4 illustrates why we cannot apply Theorem 2.4 from Ghosal et al. (2000) to obtain the exact parametric rate on Θ. Lemma 4. log D(ǫ/2, {θ ∈ Θ, ǫ ≤ ||θ − θ0 || ≤ 2ǫ) ≥ C/ǫ2 for some constant C. Proof. Without loss of generality, we consider θ0 = 0. We construct a lower bound for the packing number. For simplicity, we assume 1/(4ǫ2) is an integer. Using the partition of 1/4ǫ2 the interval [0, 1) = ∪i=1 [(i − 1)4ǫ2, i · 4ǫ2 ) and construct the piece-wise constant functions θi (t) = I((i − 1)4ǫ√2 ≤ t < i · 4ǫ2 ). Obviously, for this set of functions, we have ||θi || = 2ǫ and ||θi − θj || = 2 2ǫ. The lower bound for the covering number is obtained by the simple relationship between covering number and packing number. q 3 δ0 Lemma 5. For 2ǫ < δ ′ = 4kmax , log D(ǫ/2, {θ ∈ Θδ , ǫ ≤ ||θ − θ0 || ≤ 2ǫ) ≤ C

for some constant C that depends on δ, δ0 , K and kmax but does not depend on ǫ. Proof. Suppose that ||θ − θ0 || ≤ 2ǫ. From the proof of Lemma 2, we know that each changepoint of θ0 has a corresponding change-point of θ that satisfies |tt(j) − t0j | ≤ 16ǫ2 /δ02 . For any segment level a0j of θ0 , denote the corresponding index of the segment of θ that has an √ √ overlap of at least δ/2 by r(j), by similar argument as Lemma 2, |aj − a0r(j) | ≤ 2 2ǫ/ δ. To construct a covering, we partition [0, 1) into nonoverlapping intervals. In the following, M, B, N are sufficiently large integers to be chosen later. First, each interval [t0j − 16ǫ2 /δ02 , t0j + 16ǫ2 /δ02 ] is partitioned into M subintervals with equal lengths. For the rest of [0, 1) we partition it into segments of lengths between δ/2B and δ/B. Obviously the total number of subintervals does not depend on ǫ. These subintervals falls into two types: (i) the subinterval that contains some change-point of θ0 ; (ii) the subinterval that is entirely contained in some segment of θ0 . The function class F that forms a covering is defined as the set of functions which is piece-wise constant with respect to√the partition, takes a value of 0 on type (i) subintervals and takes values of the form a0j + 2N √2ǫδ i, i = −N, −(N − 1), . . . , N, on type (ii) subintervals if the subinterval is contained in segment j of θ0 . The size of F is a constant independent of ǫ and we show next that it is indeed a ǫ/2-covering.

Bayesian change-point problem

9

On subintervals of type (i), the squared L2 distance between F and θ restricted on these 16ǫ2 2 intervals are at most Mδ 2 kmax K . Type (ii) subintervals can further be divided into three types: (iii) it contains a change-point of θ which is closest to some change-point of θ0 ; (iv) it contains a change-point of θ other than those closest to some change-point of θ0 ; (v) it is entirely contained in some segment of θ. On subintervals of type (iii) the squared 16ǫ2 2 distance is at most Mδ 2 kmax K . On subintervals of type (iv) the squared distance is at √ √ most Bδ kmax ( 2√δ2ǫ )2 . On subintervals of type (v) the squared distance is at most ( 2N √2ǫδ )2 . Thus when M, B, N is large enough, we have a ǫ/2-covering. Proofs of the main results p Proof of Theorem 1. We apply Theorem 2.1 in Ghosal et al. (2000) with ǫn = C log n/n. Condition (2.2) for that theorem is verified in Lemma 3, condition (2.3) is trivially satisfied and condition (2.4) is verified in Lemmas 1. Proof of Theorem 2. Theorem 1 immediately implies that the under-estimation probablity Πn (k < k0 ) → 0 in P0n probability. For over-estimation, it is sufficient to show that R pn (Z) R pn (Z) P0n ( Θk pθn (Z) dπk0 (θ) < Cn−(3k0 −2+2ξ)/2 ) → 0 for some 0 < ξ < 1/2, and P0n ( Θk pθn (Z) dπk (θ) > 0

0

0

(log n)−1 n−(3k0 −2+2ξ)/2 ) → 0, when k > k0 . step 1. Let Un = {t ∈ Tk0 : t = t0 + u, u ∈ Rk0 +1 , u0 = uk0 = 0, |ui | < c/n} with t Πk0 (Un ) ≥ c′ n−k0 +1 , where Πtk0 is the prior measure on the locations of change-points. For any fixed t ∈ Un , with probability converging to 1, by considering a small neighborhood of the maximum likelihood estimator a ˆ(t) for the given t as in Laplace approximation, we have Z n n n pθ (Z) C p(ˆa(t),t) (Z) C p(a0 ,t) (Z) (a|t) ≥ dπ ≥ . k0 pn0 (Z) nk0 /2 pn0 (Z) nk0 /2 pn0 (Z) pn

(Z)

is normally distributed with mean For any t ∈ Un , and conditional on {Xi }, log (ap0n,t) (Z) Pk0 −10 0 1 2 2 2 − 2 f (t) and variance f (t) , where f (t) = j=1 (aj+1 − a0j )2 · nj , and nj is the number of Xi that falls into the subinterval [t0j , tj ) or [tj , t0j ) (depending on the sign of uj ). Since nj pn

(Z)

≥ is Binomial distributed with mean less than c, f (t)2 = Op (ξ log n) and thus log (ap00,t) (Z) −ξ log n with probability converging to 1. Thus, with probability converging to 1, we have R pn θ (Z) dπk0 (θ) ≥ Πtk0 (Un ) nkC0 /2 n−ξ = Cn−(3k0 −2+2ξ)/2 . Θk0 pn 0 (Z) √ step 2. Letting δn = 2 log nn3k10 −2(1−ξ) , and ǫn = C log n/ n, we have that Z pnθ (Z) P0n ( dπk (θ) > (log n)−1 n−(3k0 −2+2ξ)/2 ) n Θk p0 (Z) Z Z pnθ (Z) pnθ (Z) n dπ (θ) > δ ) + P ( dπk (θ) > δn ). ≤ P0n ( k n 0 n n {||θ−θ0 ||>ǫn }∩Θk p0 (Z) {||θ−θ0||≤ǫn }∩Θk p0 (Z) By the Markov inequality and Fubini’s theorem, the first term above is bounded by δ1n πk (||θ− θ0 || ≤ ǫn ) ≤ δ1n ǫn3k0 −1 → 0, where we have made use of Lemma 2. For the second term, we apply Theorem 1 of Shen and Wasserman (2001) with ǫ in that theorem replaced by ǫn defined above. Using r Z √2ǫ p √ 28 b log(1/ǫ) + c ≤ b log 2 + c · 2ǫ , ǫ ǫ2 /28

10

Lian, H.

the entropy condition in that theorem can be verified for ǫ = ǫn . Thus when C is large enough, the second term also converges to 0. √ Proof of Theorem 3. We apply Theorem 2.4 in Ghosal et al. (2000) using ǫn = A/ n with A sufficiently large. Condition (2.7) for that theorem is verified in Lemma 5, and (2.8) n ≤||θ−θ0 ||≤2jǫn ) is trivially satisfied. Now we verify (2.9), for which we need to bound Π(jǫΠ(||θ−θ . 0 ||≤ǫn ) √ ′ ′ When j < δ n/2A, 2jǫn < δ and Lemma 2 can be directly applied to obtain that n ≤||θ−θ0 ||≤2jǫn ) ≤ Cj 3k0 −2 ≤ Πk (θ ∈ Θδk : ||θ − θ0 || ≤ 2jǫn ) ≤ C(jǫn )3k0 −2 , and we get Π(jǫΠ(||θ−θ 0 ||≤ǫn ) √ n ≤||θ−θ0 ||≤2jǫn ) Cexp(A2 j 2 /2). For j ≥ δ ′ n/2A, we bound the probability by 1, and Π(jǫΠ(||θ−θ ≤ 0 ||≤ǫn ) C(1/ǫn )3k0 −2 ≤ Cexp(A2 j 2 /2) for this range of j. Proof of Corollary 1. By Theorem 2, we can assume the number of change-points of θ ||2 > (δ0 /2)2 Mn ǫ2n . Thus is also k0 . Then max1≤i≤k0 |ti − t0i | > Mn ǫ2n implies that ||θ − θ0√ 0 2 δ Πn (max1≤i≤k0 |ti − ti | > Mn ǫn ) ≤ Πn (θ ∈ Θ : ||θ − θ0 || > (δ0 /2) Mn ǫn ) → 0. Proof of Theorem 4. Fixing one t ∈ TkC0 = {t ∈ Tk0 : maxj |tj − t0j | ≤ C/n}, denote the maximum likelihood estimator for a by a ˆ(t). Let Πna|t,Z and Πnt|Z be the posterior measure for a conditioning on t and the posterior measure for t respectively. The classical Bernstein-von Mises Theorem implies that E0 ||Πna|t,Z − N (ˆ a(t), I0 )||T V → 0. We have that E0 ||Πna|Z − N (ˆ a(t0 ), I0 )||T V Z Z Πna|t,Z dΠnt|Z − N (ˆ a(t0 ), I0 )||T V + || ≤ ||

(TkC )c 0

TkC 0

Πna|t,Z dΠnt|Z − N (ˆ a(t0 ), I0 )||T V

= (I) + (II). (I) can be bounded by Z a(t0 ), I0 )||T V Πna|t,Z dΠnt|Z − N (ˆ E0 || TkC

0

≤ E0

"Z

TkC

0

||Πna|t,Z



N (ˆ a(t), I0 )||T V dΠnt|Z

#

+ E0

"Z

TkC

0

||N (ˆ a(t), I0 ) −

N (ˆ a(t0 ), I0 )||T V dΠnt|Z

The first term converges to zero by the boundedness of the TV norm and √ the Fubini’s theorem. The second term converges to zero since ||ˆ a(t) − a ˆ(t0 )|| = op (1/ n). Letting n goes to infinity and then C goes to infinity, we see that E0 ||Πna|Z − N (ˆ a(t0 ), I0 )||T V → 0. Proof of Theorem 5. Fix any number M > 0 and γ > 2M . Define θ0 = 0 and θn = √γ I( 1 ≤ t < 1), a function with a single change-point and jump size √γ . We trivially have 2 n n

γ M ||θ − θn || ≥ 2√ > √ for all θ ∈ Θ1 (i.e. θ is a constant function). Under θ0 , the posterior n n probability on Θ1 converges to 1 by Theorem 2. This gives us √ √ EZ|θ0 Πn (θ : ||θ−θn || > M/ n) ≥ EZ|θ0 Πn (θ : ||θ−θn || > M/ n, θ ∈ Θ1 ) ≥ EZ|θ0 Πn (Θ1 ) → 1.

Since the measure P0n induces by θ0 and the measure Pθnn induced by θn are mutually contiguous (this is a straightforward extension of Theorem 7.2 in Vaart (1998)), we have ¯ > M ǫn ) ≥ EZ|θ Πn (θ ∈ Θδ : ||θ − θn || > M ǫn ) → 1. sup EZ|θ¯Πn (θ ∈ Θδ : ||θ − θ|| n

δ ¯ θ∈Θ

Since this is true for any M , it is also true for some slowly diverging sequence Mn as in the statement of the theorem.

#

.

Bayesian change-point problem

11

References Barron, A., M. J. Schervish, and L. Wasserman (1999). The consistency of posterior distributions in nonparametric problems. Annals of Statistics 27 (2), 536–561. Ben Hariz, S., J. J. Wylie, and Q. Zhang (2007). Optimal rate of convergence for nonparametric change-point estimators for nonstationary sequences. Annals of Statistics 35 (4), 1802–1826. Choi, T. and M. J. Schervish (2007). On posterior consistency in nonparametric regression problems. Journal of Multivariate Analysis 98 (10), 1969–1987. Fan, J. Q. and R. Z. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96 (456), 1348–1360. Fearnhead, P. (2006). Exact and efficient bayesian inference for multiple changepoint problems. Statistics and Computing 16 (2), 203–213. Fearnhead, P. (2008). Computational methods for complex stochastic systems: a review of some alternatives to mcmc. Statistics and Computing 18 (2), 151–171. Ghosal, S., J. K. Ghosh, and A. W. Van der Vaart (2000). Convergence rates of posterior distributions. Annals of Statistics 28 (2), 500–531. Ghosal, S., J. Lember, and A. Van Der Vaart (2008). Nonparametric bayesian model selection and averaging. Electronic Journal of Statistics 2, 63–89. Ghosal, S. and A. Van Der Vaart (2007). Convergence rates of posterior distributions for noniid observations. Annals of Statistics 35 (1), 192–223. Goldenshluger, A., A. Tsybakov, and A. Zeevi (2006). Optimal change-point estimation from indirect observations. Annals of Statistics 34 (1), 350–372. Green, P. J. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika 82 (4), 711–732. Leeb, H. and B. M. Potscher (2008). Sparse estimators and the oracle property, or the return of hodges’ estimator. Journal of Econometrics 142 (1), 201–211. Lian, H. (2007a). On the consistency of bayesian function approximation using step functions. Neural Computation 19 (11), 2871–2880. Lian, H. (2007b). Some topics on statistical theory and applications. Ph. D. thesis, Brown University. Lian, H. (2008). Bayes and empirical bayes inference in change-point problems. Communications in Statistics - Theory and Methods To appear. Liu, J. S. and C. E. Lawrence (1999). Bayesian inference on biopolymer models. Bioinformatics 15 (1), 38–52. Schwartz, L. (1965). On bayes procedures. Z. Wahrsch. Verw. Gabiete 4, 10–26.

12

Lian, H.

Scricciolo, C. (2007). On rates of convergence for bayesian density estimation. Scandinavian Journal of Statistics 34 (3), 626–642. Shen, X. T. and L. Wasserman (2001). Rates of convergence of posterior distributions. Annals of Statistics 29 (3), 687–714. Vaart, A. W. v. d. (1998). Asymptotic statistics. Cambridge, UK ; New York: Cambridge University Press. Yang, Y. H. (2005). Can the strengths of aic and bic be shared? a conflict between model indentification and regression estimation. Biometrika 92 (4), 937–950.