scalable bayesian nonparametrics

2 downloads 0 Views 196KB Size Report
BN prior, (d) inference algorithm picked to infer parameters of the model. ... model we propose a data and model parallel memoized variational inference scheme ... Bayesian Nonparametric (BN) models provide a Bayesian framework for model ... let's review some specific examples of this generic formulation that we will be ...
Thesis Proposal: Scalable Bayesian Nonparametrics

SCALABLE BAYESIAN NONPARAMETRICS Avinava Dubey Department of Computer Science Carnegie Melon University Pittsburgh, PA 15213, USA [email protected]

A BSTRACT Bayesian Nonparametric (BN) methods have gained popularity recently. They have an advantage over parametric methods as they do not need to assume a fixed number of parameters but adjusts parameter size to a given dataset. This means that such models can theoretically cope with data sets of arbitrary size and latent dimensionality. This makes BN methods a good choice for modeling real world datasets. But BN methods are slow in practice, requiring large amount of memory thus limiting their usability. In this thesis we advance the use of Bayesian Nonparametrics (BN) to various real world machine learning applications with focus on scalability. We dissect BN models into different components that affect scalability. These include (a) data representation ie how the data is represented, (b) model definition i.e. how the model is defined for a particular task, (c) parameters sampled from a BN prior, (d) inference algorithm picked to infer parameters of the model. All of these components have a joint effect on the run-time complexity. In my thesis, I scale up BN models by redesigning one or more of the above components. In the first part of my thesis I propose to reformulate (a - c) such that the resulting method is equivalent to the original method. For the task of community detection we show that redesigning the data representation scales the algorithmic complexity from O(U 2 ) to O(U d) where U is the number of users a network and d is the maximum number of neighbor of a user. For the task of kernel learning we formulate a BN model (i.e. redesign the Model definition) for data driven learning of kernels that reduces the algorithmic complexity from quadratic to linear in the number of data points. Next we examine how nonparametric process specific conditional independencies can be exploited to redesign existing BN models thus reformulating how we draw samples from BN prior (ie redesign c). We show that this helps us achieve the independencies necessary for parallel learning of parameters while maintaining the theoretical guarantees for consistency of the parameters. We propose a model-parallel and data-parallel Markov chain Monte Carlo (MCMC) inference method for our model. We show scale-up in inference time on shared-memory systems with multiple processor. We propose to implement the distributed version of our method where multiple machines do not share memory and communication cost is a bottleneck. Even though reformulating BN models improves scalability, it may not be possible to reformulate every model. Instead for a given model we may scale up by improving the inference algorithm. We first propose a BN model for modeling evolving hierarchies for which scalable reformulation is difficult. To scale inference for this model we propose a data and model parallel memoized variational inference scheme that can distribute inference across multiple machines. This shows the ability of redesigning inference alone in achieving scalability. To further make inference more efficient, we propose to combine the fast convergence property of variational algorithms with the better performance of MCMC based methods. Specifically we propose to adapt stochastic gradient based MCMC inference schemes to improve time to convergence for BN models.

1

Thesis Proposal: Scalable Bayesian Nonparametrics

1

T HESIS I NTRODUCTION AND S COPE

Model adaptation is an integral part of machine learning. Many important decisions such as the number of clusters, the number of latent states in latent variable models and other such model parameters fall under the ambit of model adaptation. Selecting an appropriate number of parameters is important if one wants to improve performance. One way of handling model adaptation is to use nonparametric methods (Orbanz & Teh, 2015) which allow these model parameters to change with data. Bayesian Nonparametric (BN) models provide a Bayesian framework for model adaptation. BN models put a prior over an infinite dimensional parameter space; conditioned on the data the posterior concentrates on a finite number of parameters, thus adapting the model complexity to the data. A finite dataset is thus modeled using a finite but random number of parameters. If more data are observed the model parameters can grow in a consistent manner. Thus BN provides a principled way of model adaptation. Unfortunately, while BN models can theoretically cope with data sets of arbitrary size and latent dimensionality, in practice inference can be slow, and the memory requirements are high. Thus scalability is often a major hindrance in using BN in modern machine learning (ML) applications that have to deal with huge real world data. The main challenges in achieving scalability are huge sizes of real world datasets (web content, the human genome, social networks, etc.), large state space of modern machine learning algorithms (mammoth graphical models, deep belief networks, etc.), and algorithmic complexity of inference in BN models. In this thesis I advance the use of Bayesian Nonparametrics (BN) to various real world machine learning applications with focus on scalability. To achieve scalability let’s first dissect BN model into components that affect the run-time. Lets assume that we want to perform task T on a dataset Z. A model MΘT , with parameter Θ ∼ BN (λ), acts on a representation X of the data Z to solve for task T . Here BN (λ) is a BN prior on the parameter and thus Θ is a distribution over infinitely many parameters. Further assume that an algorithm LMΘT is given to infer parameters and latent variables of the model. Before going further let’s review some specific examples of this generic formulation that we will be working on in this thesis. 1.1

D IRICHLET P ROCESS M IXTURE M ODEL

P∞ The Dirichlet process (DP) is a distribution over discrete probability measures D = k=1 πk δθk with countably infinite support, where the finite-dimensional marginals are distributed according to a finite Dirichlet distribution. It is parameterized by a base probability measure H, which determines the distribution of the atom locations, and a concentration parameter α > 0, which acts like an inverse variance. The DP can be used as the distribution over mixing measures in a nonparametric mixture model. In the Dirichlet process mixture model (DPMM, Antoniak, 1974), data {xi }ni=1 are assumed to be generated according to D ∼ DP(α, H) θi ∼ D, xi ∼ f (θi )

(1) (2)

Here the task T is to cluster data, X is points in Rd , Θ = D as defined in eq. 1, MΘT is a mixture model defined as in eq. 2. Many inference algorithms, LMΘt have been proposed that maximized p(X). 1.2

H IERARCHICAL D IRICHLET P ROCESS M IXTURE M ODEL

Hierarchical Dirichlet processes (HDP) extend the DP to model grouped data. The HDP is a distribution over probability distributions Dm , m = 1, . . . , U , each of which is conditionally distributed according to a DP. These distributions are coupled using a discrete common base-measure, itself distributed according to a DP. Each distribution Dm can be used to model a collection of observations m xm := {xmi }N i=1 , where D0 ∼ DP(α, H) , θmi ∼ Dm ,

Dm ∼ DP(γ, D0 ) , xmi ∼ f (θmi ) ,

(3)

for m = 1, . . . , U and i = 1, . . . , Nm . HDPs have been used to model data including text corpora (Teh et al., 2006b), images (Sudderth et al., 2005), time series data (Fox et al., 2008), and genetic 2

Thesis Proposal: Scalable Bayesian Nonparametrics

variation (Sohn & Xing, 2009). The task T is to model group data with admixtures, X is data represented as bag-of-bag of items,Θ = {D0 , D1 , . . . DU } and the admixture model MΘT is given in defined in eq. 3. As before many inference algorithms LMΘT have been proposed to optimize p(X). 1.3

BAYESIAN N ONPARAMETRIC N ETWORK M ODELING

Several mixed-membership models for community discovery (MMSBAiroldi et al. (2008), LinkPLSA-LDANallapati et al. (2008), Topic-Link LDALiu et al. (2009) and RTMChang (2009)) have been proposed. They draw a vector of community memberships for the two candidate entities (documents or users) and typically makes a Bernoulli draw for a binary link between them using a notion of similarity between the community membership vectors. These methods directly model the adjacency matrix of size Ω(U 2 ) latent variables, where U is the number of users (nodes) in the network. The generative model is D0 ∼ DP(α, H) , θmi ∼ Dm ,

Dm ∼ DP(γ, D0 ) , θim ∼ Di , xmi ∼ f (θmi ) ,

(4)

for m, i = 1, . . . , U . Here the task T is to discover communities, X is the adjacency matrix of the social network, Θ = {D0 , D1 , . . . DU } and the admixture model MΘT is given in defined in eq. 4. Again a series of inference algorithms LMΘT has been proposed to maximize p(X). The scalability of an algorithm depends upon choices made for these four components ie (a) data representation, X; (b) model definition, MΘT ; (c) parameter sampled from a BN prior, Θ; (d) inference algorithm picked to infer parameters of the model, LMΘT . In order to achieve scalability one can redesign one or more of the above as long as that doesn’t effect the performance on task T . It is important to note that redesigning one of X, Θ, and MΘT may lead to remodeling the others while LMΘT can be redesigned without changing the others. This leads to natural categorization of ways to scale up BN models: 1. Reformulation where one or more of X, Θ, MΘT is redesigned to make the model more scalable. 2. Scalable Inference where only LMΘT is redesigned to make the resultant model more scalable. In my thesis I scale up BN models by redesigning one or more of the above components. Next we discus these two broad categories in details. 1.4

R EFORMULATION

One way of achieving scalability is to reformulate an “original” method (O1 = {MΘT , X, Θ, LMΘT }) 0 0 0 0 so that the resulting method (O2 = {MΘ }) is equivalent to the “original” method 0 T , X , Θ , LM 0 Θ0 T while simultaneously being more scalable. Reformulation has been at the heart of many advances in ML. Examples include finding global optima of a non-convex problem by reformulation Freund (2006), bounding non-convex problems by a convex upper bound Dubey et al. (2009), reformulating structured SVM so as to have convergence linear in time Joachims (2006), relaxing integer programs to obtain solution much faster Dubey et al. (2011), using random features to speed-up kernel learning Rahimi & Recht (2009), etc. One can also redesign models so that it is simple to distribute inference over multiple machines. We note that there are two broad approaches for parallel inference: • Data Parallel: In data parallel designs (e.g. in Asuncion et al. (2008)), approximations are introduced in the learning or inference process to distribute computation on data. Data is divided onto multiple processors and sufficient statistics are computed locally on the data partition residing on each processor. The overall (sufficient) statistics are computed by combining the statistics from various processors. • Model Parallel: In model parallel designs (e.g. in Gemulla et al. (2011)), the model is partitioned into various smaller models by using the conditional independence structure implicit in the model (or by introducing approximations that introduce such independence). On each processor the sub-model is solved independently and the results are combined globally. 3

Thesis Proposal: Scalable Bayesian Nonparametrics

The data size and state-space issues are often related. For example, in Hidden Markov Models (HMM), the latent space increases with the length of the sequence. Thus, one must distribute both the data as well as the state-space to truly achieve scalability. Also, using both data and model distributed designs significantly increases the speedup as compared to just one or the other alone Zinkevich et al. (2010). While redesigning models to achieve data or model distribution, one must ensure that the resulting model is not very different from the original model in terms of solving for task T . This can often be verified theoretically and empirically. In my thesis, I use this general concept of reformulation to propose novel BN methods that are scalable by design. I propose to design BN methods that either reduce the algorithmic complexity of the resultant model O2 making it more scalable i.e. run-time(MT0 )