Non Parametric Methods for Genomic Inference - CiteSeerX

Non Parametric Methods for Genomic Inference Peter J. Bickel James B. Brown Haiyan Huang Nancy R. Zhang November 26, 2007

1 1.1

Introduction Background

This paper grew out of a number of examples arising in data coming from the ENCODE project (Birney et al., 2007). Variations of some of the methods described here have been applied at various places in that paper, as well as in Margulies et al., 2007, for assessing significance and computing confidence bounds for statistics that operate along a genomic sequence. The background on these methods are described in cookbook form in the supplements to these papers, and it is the goal of this paper to describe them in more detail and rigor. We begin with some concrete examples from the data mentioned in the papers above as well as other types of genomic data in Section 1.2, and proceed with a motivated description of our model in Section 2. Our methods are discussed both qualitatively and mathematically in Sections 3 and 4. Sections 5 contain results from simulation studies and real data analysis. Finally, an appendix with proofs of theorems stated in Sections 3 and 4 completes the paper. Essentially, we will argue that, in making inference about statistics computed from “large” stretches of the genome, in the absence of real knowledge about the evolutionary path which led to the genome in question, the best 1

we can do is to think of position in the genome corresponding to time and the observed genome being modelled by a piecewise stationary ergodic time series. The variables of the series could be pase pair composition or some other local features of the genome, such as binding site information. In the purely stationary case some of the types of questions that we will address, such as tests for independence of point processes, confidence bounds for expectations of local functions, goodness of fit of the model, have been considered extensively. However, we do not believe that inference in the piecewise stationary case has been investigated to any great extent, perhaps because there was no particular reason to do so. With the advent of enormous amounts of genomic data all sorts of inferential questions have arisen. The proposed model may be the only truly nonparametric approach to the genome, although just as in ordinary nonparametriic statistics there are many possibly ways of carrying out inference. Our methods are based on a development of the subsampling schemes of Politis and Romano (1998), see Politis, Romano, and Wolf (1999) and the block bootstrap methods of K¨ unsch (1989). For many applications, as we shall see, Gaussian approximations can replace these schemes. But in these applications, as with the ordinary bootstrap, we believe that a subsampling approach is valuable for the following reasons: (1) Letting the computer do the approximation is much easier. (2) Some statistics, such as tests of the Kolmogorov Smirnov type, are functions of stochastic processes to which a joint Gaussian approximation applies. Then, limiting distributions can only be computed by simulation. (3) Perhaps, most importantly, the bootstrap distributions of our statistics show us whether the approximate Gaussianity we have invoked for the “true” distribution of these statistics is in fact warranted. This visual confirmation is invaluable.

1.2

Motivating Examples

We start with several practical examples in genomic studies. 1. Association of functional elements in human genome note: discuss in general the association between annotated functional features. The feature physically overlap (cite the figure 11 in the Nature paper). 2. cooperativity between transcription factor binding sites note: for each predicted TFBS, we can extend it at both sides to consider the 2

interactions between neighboring TFBSs, e.g., for a predicted TFBS at position i with length l, we could consider (i-100, i+l+100) as the binding region and use it to calculate the region overlap for two different types of TFBSs. The ”features” do not physically overlap; but when they are close to each other, it implies functional significance 3. DNA copy number changes vs. transcribed genes Note: Nancy may have the ideas? As we have seen in these examples, the major question we need to address is the following: Given the position vectors of two features in the genome, e.g. “conservation between species” and “transcription start sites”, and a measure of relatedness between features, e.g. base or region percentage overlap; how significant is the observed value of the measure? How does it compare with that which might be observed “at random”? The essential challenge in the statistical formulation of this problem is the appropriate modeling of randomness of the genome, since we observe only one of the multitudes of possible genomes that evolution might have produced for our and other species. How have such questions been answered previously? Existing methods employ varied ways to simulate the locations of features within genomes, but all center around the uniformity assumption of the features’ start positions: The features must occur homogeneously in the studied genome region (e.g. Blakesley et al (2004) and Redon et al. (2006). This assumption ignores the natural clumping of features as well as the heterogeneity of genome sequences. The clumps of features appear quite commonly along the genome due to either the feature’s own characteristic, e.g. transcription factor binding sites (TFBSs) tend to occur in clusters, or the genome’s evolution constraints, conserved elements are often found in a dense conservation neighborhood [REF]. Ignoring these natural properties could result in misleading conclusions. On the basis of a more biologically meaningful elaboration of the randomness of the features’ start positions, this paper introduces a reliable method for evaluating the feature relationships that are defined through linear statistics.

2

The Block Stationary Model

We postulate the following for the observed genomes or stretches of genomes: 3

1. They can be thought of as a concatenation of number of regions, each of which is homogenous in a way we describe below, 2. Features that are located very far from each other on the average have little to do with each other. 3. The number of regions is small compared to the total length of the genomic segments we consider. These assumptions are motivated by earlier studies of DNA sequences, which show that there are global shifts in base composition, but that certain sequence characteristics are locally un-changing. One such characteristic is the GC content. Bernardi et al. (1985) coined the term “isochore” to denote large segments (of length greater than 300 Kb) that have fairly homogeneous base composition, and especially, constant GC composition. Even earlier evidence of segmental DNA structure can be found in chromosomal banding in polytene chromosomes in drosophila, visible through the microscope, that result from underlying physical or chemical structure. These banding patterns are stable enough to be used for the identification of chromosomes and for genetic mapping. The experimental evidence for segmental genome structure and the increasing availability of DNA sequence data has inspired attempts to computationally segment DNA into statistically homogeneous regions. The paper by Braun and M¨ uller (1998) offers a nice review of statistical methods developed for detecting and modeling the inhomogeneity in DNA sequences. Statistical methods have been developed to segment DNA sequences by both base composition (Fu and Curnow (1990), Churchill (1989,1992), Li et al (2002)) and chemical characteristics (Li et al. (1998)). Most of these computational studies concluded that a model that assumes block-wise stationarity gives a significantly better fit to the data than stationary models (see, for example, the conclusions of two very different studies by Fickett, Torney, and Wolf (1992), Li et al. (1998)). A subtle issue in the definition of “homogeneity” is the scale at which we are viewing the genome. Inhomogeneity at the kilobase resolution, for example, might be “smoothed out” if we look at the megabase level. The level of resolution is a modeling issue that must be considered carefully with the goal of the analysis in mind. In this paper, we propose a combined segmentation-subsampling method, in which the size of the subsample L is

4

chosen based on the rate of mixing of the sequence. Then, homogeneity is defined in the scale of windows of length L. In mathematical terms, the block stationarity model assumes that we observe a sequence of random variables {X1 , . . . , Xn } positioned linearly along the genomic region of interest. Xi may be base composition, or some other measurable feature. We assume that there exists integers 0 = τ0 < τ1 < · · · < τI = n such that the collections of variables, {Xτi , . . . , Xτi+1 } are separately weakly stationary for each i = 0, . . . , I − 1. We let ni = τi − τi−1 be the length of the i-th region. For convenience, we introduce the mapping π : {1, . . . , n} → {(i, j) : 1 ≤ i ≤ I, 1 ≤ j ≤ ni } which relates the relabeled sequence, {Xij : 1 ≤ i ≤ I, 1 ≤ j ≤ ni } to the original sequence {X1 , . . . , Xn }. We write π = (π1 , π2 ) with π(k) = (i, j) if and only if k = τi + j. We will use the notation Xij and Xk interchangeably. For any i, j, let Fij be the σ-field generated by Xi , . . . , Xj . Define m(k) to be the standard Rosenblatt mixing number (c.f. Herrndorf, 1984), n m(k) = sup{|P(AB) − P(A)P (B)| : A ∈ F1l , B ∈ Fl+k , 1 ≤ l ≤ n − k}.

Then, assumptions 1-3 stated at the beginning of this section translate to the following: A1. {Xij } are piecewise stationary. That is, {Xij : 1 ≤ j ≤ ni } is a stationary sequence for i = 1, . . . , I. A2. m(k) ≤ ck −β for all k and some β > 1 and some constant c. A3. I/n → 0. There are other more technical assumptions to our model that is needed by the various results in Sections 3 and 4, which will be given in the appendix. An immediate and important consequence of A1-A3 is that for any fixed small k, if we define U1 = (X1 , . . . , Xk ), U2 = (Xk+1 , . . . , X2k ), . . . , Um = (Xn−k+1 , . . . , Xn ), where m = n/k, then {U1 , . . . , UM } also obey A1-A3. This is useful if, for example, in the region overlap example considered in the next section. The remarkable feature of these assumptions, which are more general than any made heretofore, is that they still allow us to conduct most of the statistical inference of interest. Not surprisingly, these assumptions lead to more conservative estimates of significance than any of the previous methods. 5

3

Linear Statistics and Gaussian Approximation

As an illustration, consider the ENCODE data examples, and suppose that we are interested in base pair overlap between Feature A and Feature B. We can represent the count of the base pair overlap by defining, Ii = 1 if position i belongs to Feature A and 0 otherwise. Ji = 1 if position i belongs to Feature B and 0 otherwise. We can then define Xi = Ii Ji to be the indicator that position i belongs to both Feature A and Feature B. Then, for the n = 30Megabases of the ENCODE regions, the mean base pair overlap is equal to ¯= X

n X

Xi /n.

i=1

Similarly, if we consider the raw region overlap, we can let Xi = Ii Ji (1 − Ii+1 Ji+1 ) since the boundary of a region is marked by a position i which belongs to both features followed by one which belongs only to one or neither ¯ We focus our of the features. Then, the quantity of interest is again X. ¯ By the attention on statistics that can be expressed as a function of X. flexible definition of Xi , this encompasses a wide class of situations. ¯ under conditions detailed in Section First, consider the mean statistic X. ¯ is approximately Gaussian with the following mean and variance: 7, X ¯ = µ= E(X)

I X

fi µi

(3.1)

i=1 I X √ 2 ¯ fi σi2 (nfi ), V ar( nX) = σ (n) =

(3.2)

i=1

where fi = ni /n, µi = EXi1 , σi2 (m)

=

2 σi,0

+2

(3.3) (3.4) m X l=1

6

2 σi,l

l−1 1− m

.

(3.5)

If the change-points τˆ were known, the quantities above can be obtained using moment estimates from the data, τî+1 1 X µ î = Xj n ˆ i j=ˆτ +1

(3.6)

i

2 σ î,h

τî+1 −h X 1 = [Xj Xj+h − µ ˆ2i ] n ˆ i − h j=ˆτ +1

(3.7)

i

If the estimate τˆ are consistent for τ , then the above estimates are also 2 consistent. However, simply plugging in σ î,h in to (3.1) and (3.2) does not 2 yield consistent estimates of σ , as is well known for the stationary case. Some regularization is necessary. We do not pursue this but prefer to approach the question from a resampling point of view – see next section. In many cases, the statistics of interest are not linear. For example, in the analysis of the ENCODE data a more informative statistic is the %bp overlap defined as ¯ X (3.8) B= , L where n X L= Ii i=1

is the total base count of feature A. The same applies to the % regional overlaps. A standard delta method computation shows that the standard errors ¯ be respectively the of B can be approximated as follows: Let µ(L) and µ(X) ¯ Then, expectation of L and X. ¯ ¯ − µ(X) ¯ ¯ X µ(X) X ¯ (L − µ(L)) , − ≈ − µ(X) L µ(L) µ(L) µ2 (L) and hence we can approximate variance σ 2 (B) ≈

¯ X L

by a Gaussian variable with mean

¯ ¯ ¯ σ 2 (X) µ2 (X) µ(X) 2 ¯ L), + σ (L) − 2 cov(X, µ2 (L) µ4 (L) µ3 (L)

¯ µ(X) µ(L)

and

(3.9)

¯ σ 2 (L) are the corresponding variances and the covariwhere σ 2 (B), σ 2 (X), ¯ L) can be obtained by a formula generalizing (3.2). In doing ance Cov(X, 7

inference, we can use the approximate Gaussianity of B with σ 2 (B) estimated using the above formula with regularized sample moments replacing the true moments. Again, we prefer the subsampling method which does not require the analytic approximations of (4.1) and, in practice, gives better results. We also note, although we do not pursue this here, that goodness of fit or equality of population test statistics, such as Kolmogorov-Smirnov and many others, can be viewed as functions of empirical distributions, which themselves are infinite dimensional linear statistics, and the subsampling theory applies to them too under weak conditions.

4 4.1

Subsampling Based Methods Estimating Confidence Intervals

As an alternative to the analytic formulas (3.1-3.2), we propose an extension of a subsampling based approach for stationary sequences due to K¨ unsch (1989) and Politis and Romano (1996), which is based on the following fact: Let the observations X1 , . . . , Xn be stationary. If the distribution of a statisˆ based on a sequence of length n, call it B ˆn , is asymptotically tic such as B ˆL based on observing a sequence of Gaussian, then so is the distribution of B ˆn and B ˆL should be equal, and length L < n as L → ∞. The means of B the variance should scale by a factor of n/L. The subsampling approaches in both K¨ unsch (1989) and Politis and Romano (1996) rely on this fact to ˆ based on repeatedly drawing smaller obtain standard error estimates of B contiguous blocks of the data. K¨ unsch’s approach is referred to as a moving blocks bootstrap, where b blocks of size L are drawn with replacement from the data, and strung together in the order with which they were drawn, to produce a bootstrapped data set of length bL. On the contrary, Politis and Romano(1996) studied the case b = 1, which is simply referred to as “subsampling”. When the data is stationary, then the block bootstrap method with proper rescaling is consistent for L → ∞ and any choice of b < n/L with n/L → ∞. Since our sequence is block stationary, neither the moving blocks bootstrap nor the subsampling approach by K¨ unsch would give the correct variance estimates. In fact, both methods would tend to over estimate the variance, as shown in Section 7.2. In order to obtain the correct variance estimate by subsampling, some sort of stratification needs to be applied, so that 8

the proportion of the subsample that comes from each homogeneous region matches the proportion of the entire sequence that belongs to that region. To achieve this effect, we propose the following stratified subsampling strategy: Algorithm 4.1. Given a segmentation t = (0 = t0 , t1 , . . . , tm = n), estimate the variance as follows: 1. For each subsample, draw integers N = {N1 , . . . , Nm }, with Ni chosen uniformly from {(ti−1 + 1, . . . , ti }. Let λi (t) = d(ti − ti−1 )L/ne. Form the subsample (X1∗ , . . . , XL∗ ) =

(XN1 , . . . , XN1 +λ1 (t)−1 , XN2 , . . . , XN2 +λ2 (t)−1 , . . . . . . XNm , . . . , XNm +λm (t)−1 ).

¯∗ = X ¯ ∗ (t) = Compute the statistic X

1 L

PL

i=1

Xi∗ .

¯ ∗,1 , . . . , X ¯ ∗,B . Form the subsam2. Repeat step one B times to obtain X pling estimate of variance by: B 1 X ¯ ∗,b ¯ ∗ ]2 . ∗ [ V art = [X − X B b=1

The above algorithm assumes a given segmentation t, which estimates the true changepoints τ (n) . In order for the algorithm to perform well, a good segmentation is critical. In Section 4.4 we will propose a method for estimating the true segmentation that, when used with the above stratified subsampling algorithm for this purpose, does indeed do well. In Section 7.3, we will prove that, under certain conditions, the estimate of variance obtained from Algorithm 4.1 coupled with the segmentation algorithm proposed in Section 4.4 is consistent. Next, we will discuss hypothesis testing for presegmented data, and delay the illustration of the segmentation method and discussion of the choice of the subsample size L to Sections 4.4 and 4.5.

4.2

Testing the Null Hypothesis of No Associations

As we discussed in Section 1.2, the inference problem typically posed in highthroughput genomics is that of association of two features. In terms of our framework we have two 0-1 processes {Ik }k=1,...,n and {Jk }k=1,...,n both defined 9

on a segment of length n of the genome. We assume that the joint process {Ik , Jk } is piecewise stationary and mixing and want to test the hypothesis that the two point processes {Ik }k=1,...,n and {Jk }k=1,...,n are independent. We have studied two fairly natural test statistics in (ENCODE), the ”percent basepair overlap”, Pn Ik Jk , On = Pk=1 n k=1 Ik and the ”regional overlap,” which we define as Pn Ik Jk (1 − Ik−1 Jk−1 ) , Rn = k=1 Pn k=1 Ik Jk with large values of these statistics indicating dependence. The major problem we face in constructing a test is what critical values onα , rnα we should specify so that PH0 [On ≥ onα ] ≈ α, (4.1) and similarly for Rn . Here H0 is the hypothesis that the vectors (I1 , ..., In )T and (J1 , ..., Jn )T are independent. We proposed the following method to generate an approximation to the null distribution first in the one stationary region case 1. Pick at random without replacement two starting points, K1 and K2 , of blocks of length L from {1, ..., n − L}. 2. Let (IK1 +1 , ..., IK1 +L )T and (JK1 +1 , ..., JK1 +L )T , (IK2 +1 , ..., IK2 +L )T and (JK2 +1 , ..., JK2 +L )T be the two sets of two feature indicators. Consider On with Rn being treated analogously. 3. Form ∗ OnL

PL =

l=1 (Ik1 +l Jk2 +l PL l=1 (Ik1 +l

+ Ik2 +l Jk1 +l ) + Ik2 +l )

∗ ∗ and let OnL1 , ..., OnLB be obtained by choosing (K1b , K2b ), b = 1, ..., B ¯ ∗ = 1 PB O∗ , O e∗ = O∗ − independently as usual. Define O nL nLb nLb b=1 nLb B ¯ ∗ , b = 1, ..., B and write O e∗ for a single O e∗ . O nL nL nLb

4. We use the following cnLα as a critical value for On at level α, ¯ ∗ + ( L )1/2 O e∗ cnLα = O ([B(1−α)]) , nL n 10

e∗ ≤ ... ≤ O e∗ are the ordered O e∗ and [.] denotes integer where O nLb (1) (B) part. If the sequence is piecewise stationary with estimated segments j = 1, ..., s as in Section 4.4, we draw independently B sets of starting points, (j) (j) (j) (j) bj L from each segment K11 , ..., K1B and K21 , ..., K2B , of blocks of length λ jP= 1, ..., s when each pair is drawn at random without replacement. Here s b b j=1 λj = 1 and λj is proportional to the length of estimated segment j. Then piece nominator and denominator together as

b

Mb∗

=

λj L s X X j=1 l=1

Nb∗

=

(IK (j) JK (j) + IK (j) JK (j) ), 1b+l

2b+l

2b+l

1b+l

b

λj L s X X

(IK (j) + IK (j) ), 1b+l

j=1 l=1

b = 1, ..., B, form ∗ OnLb =

2b+l

Mb∗ Nb∗

and then proceed as before. Theorem 4.2. If L0 , P0 denote distributions under the hypothesis of independence, L → ∞, Ln → 0, and A(1)-A(6) hold, then 1.

√ L0 ( n(On − E0 (On ))) ⇒ N (0, σ02 )

2. With probability tending to 1 √ ∗ e ) ⇒ N (0, σ 2 ) L∗0 ( LO nL 0 3. If

L n1/2

(4.2)

(4.3)

→ 0, in addition to our other conditions, then, under H0 , ∗

OnL = E0 (On ) + op (n−1/2 ) and

L ∗ e∗ P0 [On ≥ OnL + ( )1/2 O (4.4) [B(1−α)] ] → α n Thus, the test we propose has asymptotic level of significance α. The same results hold for Rn . We give the proof for s = 1 in detail and sketch the general arguments in the Appendix.

11

4.3

An Alternative General Model for Testing the Hypothesis of No Association

An alternative general model for testing association between inhomogeneous features may be found in (REF). The assumption is that the {j : Ij = 1} are events in a nonhomogeneous Poisson process with intensity function λ(·). Thus if T(1) < T(2) < · · · < T(M ) are the times of occurrence (i.e. ITj = 1) of the events, then T1 , . . . , TM where M is the total number of events and (T1 , . . . , TM ) is a random permutation of T(1) , . . . , T(n) given M = n are i.i.d. f and f = F 0 . from a density F with λ = 1−F

4.4

Segmentation Method

The primary objective of the segmentation step is to divide the original data ¯ into “homogeneous” regions so that the variance V[ sequence X ar∗ t estimated ¯ L . By Corollary 7.4 in Algorithm 4.1 approximates the true variance of X in the Appendix, if the goal is only to estimate the variance correctly, the homogeneity of the segmentation needs only to be true for the expectation, and not for the second order moments. That is, as long as, for all regions j, the expectation E[Xi ] remains constant for all i ∈ [tj−1 , tj ), then the variance obtained from the subsampling algorithm 4.1 would be consistent. Therefore, ¯ into regions of constant mean. we focus here on the segmentation of X First consider the simple case where X1 , . . . , Xn are independent with variance 1. In testing the null hypothesis H0 : E[Xi ] = µ, versus the alternative HA that there exists 1 < τ < n such that E[Xi ] = µ1 for i < τ and E[Xi ] = µ2 6= µ1 for i ≥ τ , one can show that the following is the generalized likelihood ratio test: Reject H0 if max nM (j) > c, 1 0, exists k1 () such that for all k ≥ k1 (), n

n

1 XX |Cov[Xa , Xb ]|I(π1 (a) 6= π1 (b), |a − b| > k) ≤ . n a=1 b=1

(7.4)

Now, by A4, since n X

I(π1 (a + k) > π1 (a) + 1) ≤

X

ni = o(n),

i:ni ≤k

a=1

by A2 and (7.3) we have, n

n

1 XX |Cov[Xa , Xb ]|I(π1 (a) 6= π1 (b), |a − b| ≤ k) n a=1 b=1 ≤

n n C2 X X I(π1 (b) = π1 (a) + 1, |a − b| ≤ k) + o(1). n a=1 b=1

18

(7.5)

The first term on right hand side of (7.5) is bounded by ni I C2 X X 1 ≤ 2C 2 Ik/n. n i=1 j=n −k+1

(7.6)

i

Thus by A3, the above expression is o(1). Combining (7.4-7.6) we obtain (7.2). Evidently, An = σn2 , and thus Theorem 1 follows.

7.2

Subsampling Unsegmented Data \

For any event A ∈ F∞ , we define law L∗\ by L∗\ (A) = P(A|X∞ , . . . , X\ ). By convergence of L∗\ to a law L in probability, we mean that ρ(L∗\ , L) → 0 in probability, where ρ is any metric for weak converegence, e.g. the Prohorov metric. We use P∗ , E∗ to correspond respectively to probability measure and expectation conditional on X1 , . . . , Xn . In consistence with our earlier notation, let N be uniformly distributed on {1, . . . , n}, and let n

µ ˆ∗L

1X wl Xl I(N ≤ l ≤ (N + L) ∧ n), = n l=1

where wl−1 = P∗ [N ≤ l ≤ (N + L) ∧ n]  1 ≤ l ≤ L − 1;  l/n, L/n, L ≤ l ≤ n − L + 1; =  (n − l)/n, n − L + 2 ≤ l ≤ n. Then, E∗ µ ˆ∗L = n−1

n X

¯ Xi ≡ X.

(7.7)

i=1

7.2.1

The stationary case (I = 1)

For the stationary case, Politis et al. (2001) have already established that, under our (or weaker) conditions, if L/n → 0, L → ∞, the mean of a block 19

drawn randomly from X1 , . . . , Xn n

µ ˜∗L

1X ≡ Xl I(N ≤ l ≤ (N + L) ∧ n) L l=1

satisfies

√

L

¯ ⇒ N (0, 1) (7.8) (˜ µ∗ − X) σn L in probability. That the weighted mean µ ˆ∗L behaves the same way is clear since L/n → 0 implies P (ˆ µ∗L 6= µ ˜∗L ) → 0. p ¯ as a n/L(ˆ µL − X) The conclusion is that we may treat the distribution of √ ¯ proxy for the unknown distribution of n(X − µ). 7.2.2

The block stationary case (I > 1)

p ¯ as a proxy In general we may not treat the√distribution of n/L(ˆ µL − X) ¯ − µ). To see why this is true, write for the unknown distribution of n(X ¯ = (ˆ ¯ π (N ) ) + (X ¯ π (N ) − X). ¯ µ ˆL − X µL − X 1 1 If n = mini ni and L → ∞, L/n → 0, then, it is easy to see under our assumptions that √

¯ π (N ) ) ⇒ L(ˆ µL − X 1

I X

fi Zi ,

(7.9)

i=1

where Zi ∼ N (0, σi2 (ni )), where σi2 (ni ) is defined in (3.5), and ¯ π (N ) − X ¯⇒ X 1

I X

fi δ(µi −µ) ,

i=1

where δx is a point mass at x. Moreover, the limiting distribution of the two components are independent.

20

7.2.3

Conservativeness

If the regions are relatively homogeneous with respect to µi , so that I X

fi δ(µi −µ) ≈ δ0

i=1

in the strong sense that √

L(µπ1 (N ) − µ) → 0,

(7.10)

we are still in error when P √ I > 1 since we are using (7.9) as our approximation to the distribution of n(Sn /n−µ), which is approximately N (0, Ii=1 fi σi2 (ni )). However, the approximation is conservative since a mixture of Gaussians with mean 0 is heavier tailed than a Gaussian with the same mean and variance (Mallows, 198?). For instance, the moment generating function of the limit (7.9) is I PI 2 X 2 2 2 fi et σi (ni )/2 ≥ et i=1 fi σ( ni )/2 i=1

which is mgf of N (0, σn2 ), by Jensen’s inequality. Conservativeness is even more pronounced if (7.10) does not hold, since then the second term adds yet another component to the variance. 7.2.4

Sufficient conditions for (7.8) to hold for I > 1

Assume that, in addition to A1-A6, we have the following conditions: A7.

1 − maxi fi = o(1).

A8.

I/n = o(1/L).

Theorem 7.2. Under assumptions A1-A8, (7.8) holds. Proof Proof: By A1-A3 ¯ π (N ) − X) ¯ 2= E ∗ (X 1

I X

¯ 2≤ fi (µi − X)

i=1

I X i=1

21

fi (µi − µi )2

and E

I X

2

fi (µi − µi ) =

i=1

I X i=1

σi2 (ni ) fi = o(L−1 ) , n

by A8. On the other hand, if k · k is the variational norm k

I X

fi N (0, σo2 (ni ))

−

N (0, σn2 )k

≤ (1 − fi ) + k

i=1

I X

fi N (0, σi2 (ni ))k

i=2

≤ 2(1 − fi ) = o(1) 2

by A7.

Remark The validity of the conditions A1-A5 can be checked √ informally ¯ through Q-Q Gaussian plotting of the empirical distribution of L(ˆ µ∗L − X). We give some examples of the plots made in connection with the ENCODE project (Birney et al., 2007). If A5 and A6 were not approximately valid, we would expect to see distributions such as (7.9), which are distinctly more heavy tailed than Gaussians.

7.3

Subsampling Pre-segmented Data (n)

(n)

(n)

We start by introducing some notations. Let t(n) = (0 = t0 , t1 , . . . , tIˆ = n n) be a set of change-points estimated for data set of size n, and Iˆn = |t(n) |. (n) (n) ˆ i /n. These quantities are meant to be Define n ˆ i = t − t , and fî = n i+1

i

(n)

(n)

(n)

estimates of the true change-points τ (n) = (0 = τ0 , τ1 , . . . , τIn ) and its (n) (n) related quantities ni = τi+1 − τi and fi = ni /n. We sometimes suppress in our notation the reliance on n of these quantities. For any set of change-points t, we let Ri (t) = (ti−1 + 1, . . . ,P ti ) to be the i-th region in the segmentation n produced by t. We let ki (t) = Ij=1 I(ti−1 < τj < ti ) be the number of true change-points that lie within region i in the segmentation. Hence, each region Ri (t) in the segmentation produced by t can be thought of as composed of ki + 1 underlying homogeneous regions. We let fi,j be the fraction of total ¯ i,j be the mean of, the j-th homogeneous region in Ri (t). length of, and X Thus, ki ki X X ˆ ¯ i,j = X ¯i. fi,j = fi , fi,j X j=1

j=1

22

When we obtain a random subsample of the data given a segmentation t , we let N = N(t(n) ) = (N1 (t(n) ), . . . , NIˆn (t(n) )) be the starting indices (n) (n) of the subsampled blocks, with Ni (t(n) ) ∈ [ti−1 + 1, ti ). The length of the ˆ i = Lˆ block starting at Ni (t(n) ) is λ ni /n, where L = Ln , the total blocksize, which satisfies the following: (n)

A9.

Ln → ∞. Ln /n → 0.

A10.

We also need the following assumptions on the segmentation t(n) , A11.

Iˆn n

A12.

1 n

→ 0, P

i:ˆ ni ≤k

n ˆ i → 0 for all k < ∞,

which are the analogue of assumptions A3-A4 replacing the estimated changepoints t(n) for the true change-points τ (n) . To simplify discourse, we sometimes omit reliance on t(n) when referring to the subsample, although it always relies on the underlying segmentation. We let X∗i = (XNi , . . . , XNi +λî −1 ) be the i-th block in the subsample, and let ¯∗ = X i

ˆ i −1 λ X

î XNi +j /λ

j=0

to be the mean of the i-th block. The complete subsample is the concatenation of the X∗i ’s, which for simplicity we renumber to be X∗ = X∗ (N) = (X1∗ , . . . , XL∗ ). The mean of the subsample Iˆ

L

n 1X ∗ X ¯ i∗ Xi = fî X X = L i=1 i=1

¯∗

(7.11)

is a weighted mean of the means of the blocks from each region. We let P∗ = P∗t(n) , E∗ = E∗t(n) and Var∗ = Var∗t(n) to denote respectively the probability measure, expectation, and variance conditional on the data 23

X and the segmentation t(n) . That is, the randomness under P∗ arises only from the random sampling of N. Hence, to clarify, we have: X ¯ ∗ (N) < z). ¯ ∗ < z) = P (X ¯ ∗ < z|t(n) , X) = I(X (7.12) P∗ (X N1 ,...,NIˆn

We first state and prove the following theorem, which describes how the error in the subsampling estimate of σ 2 (Ln ) depends on the segmentation t(n) . Theorem 7.3. Let assumptions A1-A6 hold for the sequence X1 , . . . , Xn , and assume that the sequence of segmentations t(n) satisfy A11-A12. Then, as n → ∞, ¯ ∗) Ln Var∗t(n) (X

2

− σ (Ln ) =

In X

(n)

fj Aj (t

)+

Iˆn X

fî Bi (t(n) ) + op (1),

(7.13)

i=1

j=1

where (n)

Aj (t

) = i:Ri (t(n)

(n)

Bi (t

X

) = Ln

T ) R (τ

ki X

j

(n) )6=∅

fij fj

ˆ i ) − σ 2 (λj )], [σj2 (λ j

¯ i,j − X ¯ i )2 . fij (X

(7.14)

(7.15)

j=1

Before going in to the details of the proof we will first explain the meaning of the terms Aj (t(n) ) and Bj (t(n) ). If we were to know τ (n) , then the scaled variance of a subsample conditioned on τ (n) would be In X p ¯ ∗) = fj σj2 (λj ) + op (1), Var∗τn ( Ln X j=1

¯ ∗, the op (1) remainder term is due to the covariance between the blocks X i which is negligible given the mixing assumptions. Aj (t(n) ) is the cost of not estimating σj2 (λj ) correctly due to the discrepancy between t(n) and τ (n) . The terms Bj (t(n) ) can be thought of as a chi-square statistic for testing for a change in the first moment at the true change-points within Ri (t(n) ), and is a compensation for the fact that we don’t know τ (n) when estimating the ¯ i∗ . means X 24

¯ ∗ ) = V1 + V2 , where Proof of Theorem 7.3: By (7.11), Var∗t(n) (X V1 =

Iˆn X

¯ i∗ ] fî2 Var∗ [X

i=1

V2 = 2

X

¯∗ , X ¯∗ ] fî1 fî2 Cov∗ [X i1 i2

i1 2, then

V1 =

In X τi − τi−1

n

i=1

ˆ λ ˆ + (X ¯ τ :τ − X ¯ 1:n )2 ] + Op (In L/n). [σi2 (λ)/ i−1 i

(7.20)

(n)

If |t | > 2, i.e. the segmentation divides the sequence into two or more regions, the same reasoning as above applies to each region separately. Hence, for general t(n) , V1 =

Iˆn X

fî2

i=1

ki X fij j=1

fî

2 ˆ ˆ i + (X ¯ i,j − X ¯ i )2 ] + Op (ki L/n), [σi,j (λi )/λ

(7.21)

2 ˆ ˆ i ) if and only if the r-th homogeneous region in the where σi,j (λi ) = σr2 (λ entire sequence is the j-th homogeneous region in Ri (t(n) ). If we rearrange 2 all of the terms in (7.21) involving σi,j , we get: Iˆn X i=1

fî2

ki X fij j=1

fî

Iˆ

2 ˆ î σi,j (λi )/λ

k

n X i 1X 2 ˆ = (λi ) fij σi,j L i=1 j=1   In X X 1 fij 2 ˆ  = fj  σ (λi ) L j=1 fj j T (n) (n) i:Ri (t ) Rj (τ )6=∅ "I # n 1 X = Aj (t(n) ) + σn2 (Ln ) L j=1

26

¯ i,j − X ¯ i )2 gives Bj (t(n) ). Rearranging the terms involving (X We now look at the term V2 . For any pair i1 < i2 , ¯ i∗ ] ¯ i∗ , X Cov∗ [X 2 1 n ˆ i1 n ˆ i2 X 1 X ¯ (n) (n) ¯ (n) (n) ][X ¯ (n) (n) ¯ = [X −X ˆ i − Xt(n) :t(n) ] ti :ti +1 ti +k:ti +k+λ i2 i2 +1 2 n ˆ i1 n ˆ i2 j=1 k=1 ti1 +j:ti1 +j+λî1 1 1 2 2

By the mixing condition A2, we can take a window wn = O(Ln ), so that for terms in the above average that are separated by a distance wn from each other, their (unconditional) covariance is less than cL−β n , for β > 1. Inside this window, assumption A5 guarantees that each term in the sum is less than C. Thus, 2 √ ¯ i∗ , X ¯ i∗ ] ≤ Cwn + cL−β (7.22) Cov∗ [X n + O(1/ n1 n2 ), 1 2 n ˆ i1 n ˆ i2 √ where the O(1/ n1 n2 ) error term comes from the approximation of the summation terms outside the window by cL−β n . ¯ i∗ , X ¯ i∗ ], each with weight Note that since V2 is a weighted sum of Cov∗ [X 1 2 fî1 fî2 = (ˆ ni1 n ˆ i2 )/n2 , and since assumption A12 guarantees that the combined size of “small” regions is also small, only those terms with n ˆ i1 n ˆ i2 = Op (n2 ) contribute asymptotically to the sum. Thus, from (7.22) we have: Ln V2 = Op Ln1−β → 0. (7.23) √ What remains is to show that Ln V ar(V2 ) converges to 0 as n → ∞. This ¯ i∗ , X ¯ j∗ ] is an average of O(n2 ) terms which belong to is true because Cov∗ [X a stationary sequence which satisfy the mixing condition A2. Thus, the average is of order O(1/n) This part of the proof will not be given here, but the interested reader is referred to the proof of a similar fact in Theorem 3.2.1 in Politis, Romano, and Wolf, 1999. The following corollary states that it never hurts to over-segment:

Corollary 7.4. Under the conditions of Theorem 7.3, and with the additional assumption that A13 :

σj2 (m) → σj2 (∞) uniformly for j = 1, . . . , In ,

then as n → ∞, if the following is true about the change-points t(n) : (n)

|τi − tj | max min → 0, i j n 27

(7.24)

then

¯ ∗) Var∗τ (n) (X ¯ ∗ ) → 1. Var∗t(n) (X

Proof: Since

¯ ∗) Ln Var∗τ (n) (X → 1, σ 2 (Ln )

to prove this corrolary we only need to show that the given assumptions imply In Iˆn X X (n) fj Aj (t ) + fî Bi (t(n) ) → 0. j=1

i=1

Assumption (7.24) implies that max max

1≤i≤Iˆn 1≤j≤ki

fi,j → 1, fî

(7.25)

and it is easy to see that this implies Bi (t(n) ) → 0 for every 1 ≤ i ≤ Iˆn , P that (n) ˆ and thus their weighted average, ) converges to 0 as well. To i fi Bi (t show that In X fj Aj (t(n) ) → 0, (7.26) j=1

for every > 0, ∃M such that for all M1 , M2 > M and for all j, |σj2 (fj M1 ) − σj2 (fj M2 )| < . By assumption A6, In X δ −1 (n) fj Aj (t ) ≤ n j=1

X j: nj M

The first two terms on the right hand side converges to 0 in probability by assumptions A4 and A12. By (7.24), the last term converges to . Since is arbitrary, (7.26) is proved. ¯ ∗ ) is “asymptotically monotone The next corollary states that Var∗t(n) (X decreasing in |t(n) |: 28

Corollary 7.5. Under the conditions of Corollary 7.4, given two segmenta(n) (n) tions t(n) and t♥ , where t(n) ⊂ t♥ which satisfy A12, then ¯ ∗) Var∗t(n) (X ¯ ∗ ) > 1. n→∞ Var∗(n) (X lim

t♥

Proof: It is easy to see that, at every n, Iˆn X

fî Bi (t

(n)

)>

i=1

Iˆn X

(n) fî Bi (t♥ ).

i=1

Thus, we will all we need is that for every j, (n)

|Aj (t(n) ) − Aj (t♥ )| → 0.

(7.27)

This fact is true because, as in Corrolary 7.4, assumptions A6, A12, and A13 (n) imply that Aj (t(n) ) → 0, Aj (t♥ ) → 0. The above two corollaries hint to a strategy for segmenting the data to ¯ ∗ ) ≈ σ 2 (n): Since Var∗(n) (X ¯ ∗ ) is monotone decreasing estimate Var∗τ (n) (X t in |t(n) | (albeit in the asymptotic sense as n → ∞), with an asymptote ¯ ∗ ), one could use recursive binary segmentation as |t(n) | → ∞ at Var∗τ (n) (X ¯ ∗ ) stabilizes. This variance can be estito segment the data until Var∗t(n) (X mated by subsampling at each level of the recursion. However, since subsampling is computationally expensive, we propose the analytic stopping rule 1, which does not require additional computing above the search for the optimal change-point in the binary segmentation algorithm. Theorem 7.6. Let assumptions A1-A12 be true, and let the segmentation scheme be consistent for τ (n) in the sense of (7.24). We denote by Fn∗ the ¯ ∗ sampled as in Algorithm 4.1 conditional on the data X ¯ distribution of X (n) and the segmentation t , and by Φσn2 the Gaussian measure with variance σn2 . Then, ρ(Fn∗ , Φσn2 ) →p 0, where ρ is the Prohorov metric and →p denote convergence in probability. Proof: It is necessary and sufficient to prove that point-wise for all u in some dense set of 0. Let L

41M

1X c 1((K2 − K1 ) > M + L)IK1 +l (JK2 +l − JK ) = 2 +l L l=1 c +1((K1 − K2 ) > M + L)IK2 +l (JK1 +l − JK ) 1 +l

and interchange I and J to get 42M . Then c |OnL − OnL |≤

1 (|41M | + |42M | + P [|K1 − K2 | ≤ L + M ]) 2δ()

with probability ≥ 1 − using (7.36). But P [|41M | + |42M | = 6 0] ≤ 2m(M ), and P [|K1 − K2 | ≤ L + M ] ≤ 2(L + M )/n.

(7.37)

Letting L → ∞ and also M → ∞, we use that c |OnL − OnL | = op (n−1/2 ).

(7.38)

c c∗ c∗ But by construction OnL has the distribution of OnL under H and On1 ,...,OnB are an iid sample. Thus for B large enough, we can apply Theorem 7.3(?) ec and hence by and the delta method to conclude that (4.3) holds for O nL enL as well. (7.38) for O

For part (3) we use the argument of (7.36) and (7.37) to conclude that 2(L + M ) M L ) ≤ C(M −β + + ) = o(n−1/2 ) n n n (7.39) M under the conditions of the theorem by taking n1/(2β) → ∞ and M = o(n1/2 ) which is possible since β > 1. 2 c E0 |E ∗ (OnL −OnL )| ≤ C(m(M )+

32

References [1] Bernardi, G. (2000). Isochores and the evolutionary genomics of vertebrates. Gene 241: 3-17. [2] Birney, E. et al. (2007) Nature 447, 799-816. [3] Braun, J. and Muller, H.-G. (1998). Statistical Methods for DNA Sequence Segmentation. Statistical Science. 13(2):142-162. [4] Churchill, G.A. (1989). Stochastic Models for heterogeneous genome sequences. Bulletin of Mathematical Biology. 51:79-94 [5] Churchill, G.A. (1992). Hidden markov chains and the analysis of genome structure. Computers in Chemistry. 16:107-115. [6] JW Fickett, DC Torney, DR Wolf (1992), Base compositional structure of genomes, Genomics, 13:1056-1064. [7] Fu, Y.-X. and Curnow, R.-N. (1990). Maximum likelihood estimation of multiple change-points. Biometrika, 77:563-573. [8] James, B., James, K. L. and Siegmund, D. Tests for a change-point. Biometrika, 74, 71-84 (1987). [9] K¨ unsch, H. (1989). The jackknife and the bootstrap for general stationary observations. Annals of Statistics 17, 1217-1241. [10] Li W, Stolovitzky G, Bernaola-Galvn P, Oliver JL. (1998) Compositional heterogeneity within, and uniformity between, DNA sequences of yeast chromosomes. Genome Research. 1998 8(9):916-28. [11] Wentian Li, Pedro Bernaola-Galvn, Fatameh Haghighi, Ivo Grosse: Applications of Recursive Segmentation to the Analysis of DNA Sequences. Computers & Chemistry 26(5): 491-510 (2002) [12] Morgan, James N. and John A. Sonquist (1963), Problems in the Analysis of Survey Data, and a Proposal. Journal of the American Statistical Association, 58:415-435. [13] Politis, D., Romano, J., and Wolf, M. (1999) Subsampling. SpringerVerlag: New York. 33

[14] Politis, D. and Romano, J. (?) [15] Sen, A. and Srivastava, M.S. (1975). On Tests for Detecting Change in Mean. The Annals of Statistics, Vol. 3, No. 1, 98-108. [16] Strassen, V. (1965) The existence of probability measures with given marginals, Ann. Math. Stat. 423-439. MR 31:1693

34