Rate of Convergence for Data Augmentation and ... - IEEE Xplore

UG,(((,QWHUQDWLRQDO&RQIHUHQFHRQ&RPSXWHUDQG&RPPXQLFDWLRQV

Rate of Convergence for Data Augmentation and Inverse Bayes Formula method in the Genetic Linkage Model 6KDQ /L

Wei Shao

6FKRRO RI 0DWK 6KDQGRQJ 8QLYHUVLW\ -LQDQ &KLQD VKDQOLP#IR[PDLOFRP

School of Management, Qufu Normal University, Rizhao, China e-mail: [email protected]

Guoqing Zhao

Yingyu Zhang

School of Finance, Shandong University of Finance and Economics, Jinan, China e-mail: [email protected]

School of Management, Qufu Normal University, Rizhao, China e-mail: [email protected] Abstract—Many statistical problems can be formulated as the missing data problems. The data augmentation algorithm and the inverse Bayes formula are important tools for constructing iterative optimization or samplings via the introduction of unobserved data or latent variables. As a Markov Chain Monte Carlo method, the data augmentation algorithm has its autocorrelation. Nevertheless, the convergence rate of the data augmentation algorithm is not known in the Genetic Linkage Model, and the same is true for the inverse Bayes formula method. In this article, we analyze the convergence rates of the data augmentation algorithm and the inverse Bayes formula method in the genetic linkage model, and through simulation results we get their convergence rates. Keywords-data augmentation; inverse Bayes formula; genetic linkage model; rate of convergence; Markov Chain Monte Carlo

I.

INTRODUCTION

With the development of the modern computer, Markov Chain Monte Carlo methods have become important tools for the complex statistical models [1] [2] [3]. After several years development, the Markov Chain Monte Carlo methods [2] [4] [5] have greatly expanded our scientific horizon from early computational physics to modern computational biology [6]. Tanner and Wong [7] first proposed to use the Markov Chain Monte Carlo method for the missing data problems. They introduced the missing data to the Bayesian framework, and treated the missing data at the same to the parameters. They named it the data augmentation (DA) algorithm. Tan et al. [8] proposed to use inverse Bayes formula (IBF) to the missing data problems, and got a good result. The IBF includes the point-wise formula, the sampling-wise formula and the function-wise formula. The IBF for the missing data problem can be seen as an improvement of the DA algorithm. So the idea of the IBF sampling algorithm is in the DA framework. First, we augment the observed data with latent data (missing data). Then we get the structure of augmented distribution (with augmented data and parameters unknown).

978-1-5090-6352-9/17/$31.00 ©2017 IEEE

Second, we sample the missing data using the IBF (or DA) algorithm. Finally, with the observed missing data, we can easily sample the parameters from the augmented distribution. Further more, the DA algorithm can be used in big data analysis and high performance computing [9], [10]. Genetic linkage model can be seen as a latent data problem. As a missing data problem, this model was first analyzed by Dempster et al. [11]. Tanner and Wong [7] used the DA algorithm for the genetic linkage model, and got the analogue result under the Bayesian framework. Tan et al. [8] proposed to use the IBF algorithm for the genetic linkage model. The IBF algorithm avoids the autocorrelation in the Markov Chain and gets an accurate result. Hobert et al. [12] analyzed the convergence rate of the DA algorithm in the Bayesian mixture model. But as far as we know, none of them compares the efficiency of these algorithms for the genetic linkage model. In this paper, we analyze the convergence rate of the DA algorithm and the IBF algorithm in the genetic linkage model. Simulation examples show the compare results and their convergence rates from these algorithms. The paper is organized as follows. SectionĊintroduces the genetic linkage model. In Section ċ , we give the convergence rate of the DA algorithm and the IBF algorithm in the genetic linkage model. In Section Č, through the simulation results, we exams the convergence rate of these algorithms. Section č summarizes this paper with the conclusions and some further research on this work. II.

GENETIC LINKAGE MODEL

Dempster et al. [4] first treated the genetic linkage model as a missing data problem, and Tanner and Wong [7] reexamined this model by the DA algorithm in 1987. In the genetic linkage model, it is believed that 197 animals are distributed multinomially into four categories: y ( y1 , y2 , y3 , y4 ) (125,18, 20,34) , with

1 T (1 T ) (1 T ) T ( , , , ) . The task is 2 4 4 4 4 from the sample y to estimate the parameter T .

T (i 1)

probability:

From

the

Bayesian

analysis,

if

we

set

flat

distribution U 0,1 as the prior distribution of T , then we

Then we get the sequence

P( x

But the computation of the posterior distribution (1) is difficult. We split the first cell y1 into two cells, one of which

1 T , the other having probability . Then 2 4 the augmentation data set becomes x ( x1 , x2 , x3 , x4 , x5 ) , where x1 x2 125, x3 y2 , x4 y3 and x5 y4 , and having probability

the augmented posterior becomes: (2)

where k binomiall

Then the implementation of the DA algorithm which is (i ) 2

the iteration from ( x below: x

,T ) to ( x

( i 1) 2

(i )

,T

probability pro

( i 1) 2

x

distribution (3) conditioning on T

x

) is summarized

Binom(1 (125,

(Posterior Step) distribution (4) ( i 1) sample x2 :

from the binomial (i )

:

T (i ) T (i ) 2

).

Draw T from the conditioning on the ( i 1)

beta new

density

( i 1)

(i )

x

function,

P( x2(i 1)

k)

(i )

to T

( i 1)

) for the

dbinom(k ;125,T0 (2 T0 )) . dbeta(T0 ; k y4 1, y2 y3 1)

(Posterior Step) distribution (4) ( i 1) sample x2 :

T (i 1)

and T

x2(i 1) from the improper

(Imputation Step) Draw exact distribution (5):

Draw T from the conditioning on the ( i 1)

beta new

Beta( Beta( x2(i 1) x5 1, x3 x4 1) .

Then we get the sequence

^( x

(i ) 2

,T (i ) )` ,i 1, 2,

, and

T (i ) can be seen as the posterior sample from (1). III.

(Imputation Step) Draw

x2(i 1)

( i 1)

,125, dbinom( ; a, b) denotes the

the IBF algorithm (from x2 to x2 genetic linkage model:

(3)

(4)

) (2 T0 ) , dbeta(T0 ; k y4 1, y2 y3 1)

If we replace the binomial distribution (3) in the DA algorithm by the improper exact distribution (5), we can get

x

Beta Beta(( x2 x5 1, x3 x4 1) .

k)

T0

and T 0 is an arbitrary value of T .

From (2) we know that the distribution of T is beta distribution:

T

dbinom(k ;125,

0,1, 2,

distribution:

T 2

(i )

dbeta( ; a, b) denotes the beta probability density function

x2 as the latent data, and from the assumption, we know that the distribution of x2 is binomial ).

, and

(5)

We can treat

Binom(1 (125,

,T (i ) )` ,i 1, 2,

T (i ) can be seen as the posterior sample from (1).

( i 1) 2

(1)

x2

(i ) 2

( i 1)

p(T | y )fS (T ) p( y | T ) 1 T 1 1 T f( ) y1 [ (1 T )] y2 [ (1 T )] y3 ( ) y4 2 4 4 4 4 y2 y3 y4 y1 T . f(2 T ) (1 T )

T

^( x

Clearly, the distribution of x2 depends on T . We can use Bayes’ formula to calculate the improper exact ( i 1) distribution of x2 :

get the posterior of T :

p(T | x)fT x2 x5 (1 T ) x3 x4 .

Beta( x2(i 1) x5 1, x3 x4 1) . Beta(

CONVERGENCE RATE ANALYSIS FOR DA AND IBF

The last section illustrates the DA algorithm and the IBF algorithm for the genetic linkage model. In this section, we first recall the procedure of the DA algorithm, and then give the convergence rates of the DA algorithm and the IBF algorithm. The procedure of the DA algorithm is illustrated in Figure. 1.

§ 1· ¨ ¸ 1 P ¨ ¸ S 1 , S 2 , ¨ ¸ ¨ ¸ ©1¹126u1 Figure 1. The procedure of the DA algorithm

From Figure. 1, we find that the DA algorithm can be seen as a special case of the Gibbs sampler. Then we can through the Gibbs sampler to infer the convergence rate of the DA algorithm. From the property of the Gibbs sampler, we know that the convergence rate of the

^T ` sequence ^x ` sequence ^ x ` (i )

sequence

is

i 1,2,,

(i ) 2 i 1,2,,

(i ) 2 i 1,2,,

the transition matrix of

^p `

i , j 126u126

to

.

we

So

that

of

use

the

^x ` is ^0,1, 2, ,125` . Let be the sequence ^ x ` (i ) 2 i 1,2,,

1

³ p( j | T ) p(T | i)dT ³ dbeta(T ; i y 1, y 0

1

4

0

dbinom( j;125, T

2

y3 1)

(T 2)

(6)

) dT

§125 · 125 j * 74) *((i 74 . ¨ ¸2 * *((i 35) 3 )*(39) © j ¹ 38 1 i j 34 (1 T ) dT T ³0 (T 2)125

Then we can get the transition matrix P . On the other hand, we know that the second largest Eigen value of P controls the convergence rate of the

^x ` the sequence ^T `

sequence

(i ) 2 i 1,2,,

, then we get the convergence rate of

(i )

(7)

, S126 is the probability T

^x ` . So the Eigen values (i ) 2

of the transition matrix P are {1, 0} . Table ĉ summarized the largest five Eigen values from the DA algorithm and the IBF algorithm. TABLE I.

EIGENVALUES FROM THE DA AND THE IBF

O1 DA IBF

O2

O3

O4

O5

(i ) 2

, then we get: pi , j

S 1 , S 2 ,

distribution (5) of the sequence

the

to infer the convergence rate of the

DA algorithm. The state of

P

equivalent

S

where

, S 126

IV.

SIMULATION RESULTS

In this section, we give the simulation results for the genetic linkage model from the DA algorithm and the IBF algorithm. In order to compare the estimated results with the true values, we use the acceptance rejection method [13] to draw sample directly from the posterior distribution (1). Theoretically, the acceptance rejection method can be use for any target distribution, but this method does not be used widely for its low sampling efficiency. The efficiency of the acceptance rejection method depends crucially on the target distribution. It means that if we can find a good envelope distribution for this method, we can sample from the target distribution efficiently, otherwise, the efficiency of this method will be very low. However, in the genetic linkage model, we do not consider the efficiency of the acceptance rejection method, and only focus on the true samples from the posterior distribution (1). The exact computation of the posterior distribution of T in the genetic linkage model can be seen in Figure. 2 below.

.

i 1,2,,

The convergence rate analysis of the IBF algorithm is the same to that of the DA algorithm. We can also get a

^T ` sequence ^ x ` sequence

(i )

i 1,2,,

(i ) 2 i 1,2,,

the

sequence

sequence

^x `

(i ) 2 i 1,2,,

from the IBF algorithm. Through the , we can infer the convergence rate of

^T ` (i )

i 1,2,,

.Clearly,

the

is drawn from an exact independent

tion (5), and a distribution we can easily get the transition matrix P of the sequence

^x ` in the IBF algorithm: (i ) 2

Figure 2. The exact computation for the genetic linkage model. The left of this Figure is the autocorrelation plot of the sample; the right of this Figure is the histogram of the sample.

As can be seen from Figure. 2, the autocorrelation of the sample is 0 between two different sample points. The histogram of the sample is very smooth around 0.6 . Next, we use the DA algorithm for the genetic linkage 5 model. We draw 10 samples from this procedure, and the simulation results are illustrated in Figure. 3. From Figure. 3, we find that the sequence of the latent data Z and the sequence of the parameters T all have their autocorrelation which are leaded by the second largest Eigen value of the transition matrix. P . We also use the IBF algorithm for the computation of the parameter T in the genetic linkage model. The sample of 5

size 10 is drawn by the IBF algorithm, and the simulation results are illustrated in Figure. 4. Similar to the exact computation from the acceptance rejection method, the sequence T has no autocorrelation, and the sequence of the latent data Z is also independent from two different sample points.

Figure 5. The simulation results from the IBF algorithm

Therefore, Comparing with the exact computation of the genetic linkage model, the results from the DA algorithm and the IBF algorithm are very accurate. Though the result from the DA algorithm has its autocorrelation, the DA algorithm consumes less CPU time than the IBF algorithm. V.

CONSLUSIONS

In this paper, we analyze the convergence rate of the DA algorithm and the IBF algorithm. For the DA algorithm, we find that the convergence rate of this algorithm is controlled by the second largest Eigen value (0.1337). Simulation results show that both the DA algorithm and the IBF algorithm can get the accurate results for the genetic linkage model. Though the results from the DA algorithm has slower convergence rate, the DA algorithm consumes less CPU time than the IBF algorithm.

Figure 3. The simulation results from the DA algorithm

Further research along this work can be done on different angles. First, we can analyze the convergence rate of the DA algorithm for other statistical models, for example, the Ising model. Second, for the IBF algorithm, we can improve this algorithm to avoid its overmuch time consuming problem (such as the simulated annealing [14]). Finally, the DA algorithm and IBF algorithm can be efficiently used in big data analysis in distributed environments (such as high performance computing and cloud [9], [10], [15]). ACKNOWLEDGMENT The authors thank the referees and edits for their careful reading and constructive suggestions. The authors would also like to thank for the partial support from the Natural Science Foundation of China (11501320,71471101), the Natural Science Foundation of Shandong Province (ZR2014AP008,

Figure 4. The simulation results from the DA algorithm

ZR2014GQ014,ZR2015GZ008,2015M098,ZR2017MG009), the Humanities and social sciences research project of Shandong higher education institutions (J14WF61) and the Natural Science Foundation of Qufu Normal University (bsqd20130114).

[7]

[8]

[9]

REFERENCES [1]

[2] [3]

[4] [5]

[6]

F. M. Liang, G. H. Liu, and R. J. Carrol, Advanced Markov Chain Monte Carlo Methods: learning from past samples. Wiley, New York, 2010. J. S. Liu, Monte Carlo Strategies in scientific computing. Springer, New York, 2001. W. Shao, G. Guo, F. Meng, and S. Jia, “An efficient proposal distribution for Metropolis–Hastings using a B-splines technique”, Computational Statistics and Datra Analysis, vol. 57, pp. 465-478, 2013. W. K. Hastings, “Monte carlo sampling methods using markov chains and their applications”. Biometrika, vol. 57, pp. 97-109, 1970. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, “Equations of state calculations by fast computing machines”. Journal of Chemical Physics, vol. 21, pp. 1087-1091, 1953. M. Yue and L. Cheng, Investigating distributed approaches to efficiently extract textual evidences for biomedical ontologies. 14th IEEE International Conference on Bioinformatics and BioEngineering (BIBE'14), Boca Raton, United States, 2014.

[10]

[11]

[12]

[13] [14]

[15]

M. A. Tanner, and W. H. Wong, “The calculation of posterior distributions by data augmentation ”,Journal of the American Statistical Association, vol. 82, pp. 528-540, 1987. M. T. Tan, G. L. Tian, and K. W. Ng, Bayesian Missing Data Problems. EM, Data Augmentation and Noniteration Computation. Chapman & Hall/CRC, New York, 2010. L. Cheng and T. Li, Efficient data redistribution to speedup big data analytics in large systems. Proceedings of the 23rd IEEE International Conference on High Performance Computing, Hyderabad, India, pp. 91-100, 2017. L. Cheng, I. Tachmazidis, S. Kotoulasd and G. Antoniou, Design and evaluation of small–large outer joins in cloud computing environments, Journal of Parallel and Distributed Computing, pp.2-15, 2017. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from Incomplete data via the EM algorithm”, Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, pp. 1-38, 1977. J. P. Hobert, V. Roy, and C. P. Robert, “Improving the convergence properties of the data augmentation algorithm with an application to bayesian mixture modeling”. Statistical Science, vol. 26, pp. 332-351, 2011. G. S. Fishman, Monte Carlo: concepts, algorithm, and applications. Springer, New York, 1995. W. Shao, G. Guo, G. Zhao, and F. Meng, “Simulated annealing for the bounds of Kendall's tau and Spearman's rho”, Journal of Statistical Computation and Simulation, vol. 84, pp. 2688-2699, 2014. L. Cheng, S. Kotoulas, T. E. Ward and G. Theodoropoulos. Robust and skew-resistant parallel joins in shared-nothing systems. in CIKM'14 : proceedings of the 23rd ACM International Conference on Information and Knowledge Management, pp.1399-1408, 2014.