Latent Process Model for Manifold Learning - CiteSeerX

Latent Process Model for Manifold Learning Gang Wang, Weifeng Su, Xiangye Xiao, Lochovsky Frederick Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Hong Kong.

Abstract

to deal with nonlinear dimension reduction. [5] proposed a latent generative model, called Gaussian process latent variable model (GPLVM). It uses a Gaussian process prior as a mapping from a latent space to a feature space induced by a kernel to obtain the embedding elements. Motivated by this work, we propose a latent process model for unsupervised nonlinear embedding. In our model, the latent variables are introduced and the latent processes are assumed to characterize the pairwise relations of the points over a high dimensional and a low dimensional space respectively. The embedding elements can be obtained by minimizing the divergence between the latent processes over the two spaces. Recently, [3] proposed a probabilistic model called stochastic neighbor embedding (SNE), in which a probabilistic distribution over all the potential neighbors of points in a high dimensional space is defined and the points in a low dimensional space are obtained by preserving the distribution as close as possible to the one in the original space. We find the SNE is also a special case of our model.

In this paper, we propose a novel stochastic framework for unsupervised manifold learning. The latent variables are introduced, and the latent processes are assumed to characterize the pairwise relations of points over a high dimensional and a low dimensional space. The elements in the embedding space are obtained by minimizing the divergence between the latent processes over the two spaces. Different priors of the latent variables, such as Gaussian and multinominal, are examined. The Kullback-Leibler divergence and the Bhattachartyya distance are investigated. The latent process model incorporates some existing embedding methods and gives a clear view on the properties of each method. The embedding ability of this latent process model is illustrated on a collection of bitmaps of handwritten digits and on a set of synthetic data.

1 Introduction 2 Latent Process Model Learning with the latent variable models (LVMs) has drawn many attentions in the machine learning community. Unlike models that describe only relationships among the observed variables, LVMs allow the inclusion of one or more latent or hidden variables. The inclusion of latent variables empowers LVMs to explore hidden structures underlying observed variables, or to measure crucial concepts that cannot be observed directly. Dimensionality reduction is an important issue in machine learning and pattern recognition. LVMs, such as factor analysis (FA) and independent component analysis (ICA) are two important linear transformation methods for dimensionality reduction. Recently, nonlinear dimension reduction techniques have become popular for visualization and other applications. Most of the nonlinear dimensionality reduction techniques typically begin with an affinity matrix between pairwise instances or variates and followed by an eigendecomposition process. Thus, they are also referred to as spectral embedding methods [1, 2]. Till now, there has been little work on LVMs

Given a set of k high dimensional points X = {x1 , x2 , . . . , xk } ⊂ Rn , our objective is to embed them into a low dimensional space Rm where m < n. Let Y = {y1 , y2 , . . . , yk } be the set of embedded coordinates corresponding to X . All pairwise affinity relations between elements over X are represented by Rk×k = [rij ], and over Y are represented by Sk×k = [sij ], where rij (sij ) describes the relation from xi (yi ) to xj (yj ). Since our concern is dimensionality reduction, the points in the embedding space are unknown in advance. Our task is to find the optimal coordinates of the points in the low dimensional space, whose affinity relations are similar to those in the high dimensional space. We introduce a set of latent variables l, which can be either continuous or discrete. In some sense, each possible value of the latent variables encodes some relations among elements. We model the latent variables l as stochastic processes over X and Y, respectively. Based on different as1

2.2 Multinomial Process

sumptions, the value ranges and the distributions of the latent variables will be different. The latent process of l over X depends on R, and the latent process of l over Y depends on S. Since the data in high dimensional space is fixed, the latent process p(l|R) is already known. By tuning the coordinates of elements in Y, the affinity matrix S is changed correspondingly. We use a divergence minimization between two latent processes over X and Y as an embedding criterion to reduce the difference of the affinity relations, R and S. When the affinity matrices R and S are the same, all pairwise relations over X are the same as those over Y, and therefore the divergence is zero. Consequently, the low dimensional embedding can be obtained via minimizing the divergence measurement with a numerical optimization method. As we know, the Kullback-Leibler (KL) divergence is most widely used to measure the distance between two distributions. However, the KL divergence suffers some drawbacks. So we also investigate the symmetric Bhattacharyya distance [4, 6]. Many authors have considered different forms of Bhattacharyya distance. In this paper, we will treat theR Bhattacharyya distance measure as ∆(π1 , π2 ) = −ρ = 1 − (pq) 2 dλ, even though this distance may be negative.

We assume that for each point i in the high dimensional space, point i picks its potential neighbors following a probabilistic distribution. Such a distribution can be specified as a multinomial process. Let li = (li1 , . . . , lik ) be a latent (k −1) vector (other than lii ) for the point i, and li have values all zero except a 1 in one entry. Q Thenlijwe have the multinomial distribution p(l |r ) = over X subject to i i j6=i rij P a constraint j6=i rij = 1, where rij is the probability that the point i picks j as its neighbor over X . The latent process p(li |ri ) characterizes the affinities by which the point i is related with the rest k − 1 points. The latent variable li has an explicit physical meaning. Namely, if lij = 1, the point i will pick j as its neighbor. Since li has only one element with value 1, it implies that point i is related to one and only one point. Correspondingly,Qthe latent variable lij li over Y is introduced, and q(li |si ) = j6=P i sij , where sij is defined over Y subject to a constraint j6=i sij = 1. Since the point i is not related with itself, the latent variable li does not contain the element lii , and rii and sii are not defined1 . The KL divergence between p(li |ri ) and q(li |si ) for point i is X ¡ ¢ p(li |ri ) p(li |ri ) log KL p(li |ri )kq(li |si ) = q(li |si ) li X rij = rij log sij j

2.1 Gaussian Process We assume that l is a continuous k-vector, and follows Gaussian processes on X and Y respectively. Specifically, p(l|R) = N (0, R) and q(l|S) = N (0, S), where R and S are k × k positive definite matrices (also called kernel matrices) over X and Y. R and S are symmetric, which indicates the relation from point i to j is equivalent to the relation from point j to i. The latent variable l can be regarded as k regression values from the k elements. Since the latent process of l is determined by a whole kernel matrix, R or S, all pairwise relations are coupled to represent l. The affinity matrices R = [rij ] and S = [sij ] can be defined using the kernel functions such as linear, RBF, etc. The cost function LG KL defined by the KL divergence is Z ¡ ¢ p LG = KL p(l|R)kq(l|S) = p log dl KL q

The Bhattacharyya distance between p(li |ri ) and q(li |si ) for point i is X ¡ ¢ 1 B p(li |ri )kq(li |si ) = − (p(li |ri )q(li |si )) 2 li

=

−

X

1

(rij sij ) 2

j

Since li is defined only for point i, the overall latent variable is a matrix L = (l1 , . . . , lk )0 . It is unlike in section 2.1 where the latent variable is a vector, here . Its related affinity matrices are R = (r1 , . . . , rk )0 and S = (s1 , . . . , sk )0 . The latent variables li (i = 1, . . . , k) are assumed to be mutually independent of each other, hence we obtain the cost M functions LM KL and LB as shown in Table 1. The cost function LB is actually the same as the cost function defined KL in SNE.

The cost function LG B defined by the symmetric Bhattacharyya distance is Z ¡ ¢ 1 1 LG = B p(l|R)kq(l|S) = − p 2 q 2 dl B

2.3 Discussion

The integration results for the cost functions LG KL and can be found in Table 1. The integration for LG B is tricky, and the proof can be found in [4]. The cost function LG KL is just the log-likelihood function in the twin kernel PCA [5], which causes the latent process model to become GPLVM.

LG B

In Gaussian process model, the latent variable l is a continuous k-vector. All pairwise affinities are coupled together, and are represented by the Gaussian process prior. 1 For

2

compactness, we set rii and sii to zero.

Table 1. The cost functions of the latent process model with Gaussian and multinomial as priors, with the KL divergence and the Bhattacharyya distance as a divergence measurement.

Gaussian Multinomial

KL divergence |S| 1 1 −1 LG )− k2 KL = 2 log |R| + 2 tr(R∗S P P r ij LM KL = i j rij log sij

The probability sM ij (i 6= j) in Y is defined similarly. It is clear that the probability that i picks j as its neighbor rij (sij ) is not only related to the distance between i and j, but also related to the distances between i and other points. M M Note that rM i = (ri1 , . . . , rik ) is a normalized version of

This model minimizes the divergence between the affinities over the two spaces R and S directly in the case that they are positive definite. The latent variable in multinomial process is a matrix L = (l1 , . . . , lk )0 , where li (i = 1, . . . , k) are (k−1) binary vectors, and are independent of each other. The pairwise affinities are partial coupled. The latent process of li is determined by the relations from point i to the rest k − 1 points, hence these k − 1 pairwise affinities from point i are coupled together. Since the multinomial process has the constraint that the summation of each affinity row is 1, this latent process model minimizes the divergence between corresponding rows of R and S in the case that elements are normalized by their rows.

G G B rG i = (ri1 , . . . , rik ), i.e., ri =

Let R = [rij ] and S = [sij ] be affinity matrices, where rij and sij represent the relation from point i to j. Affinity matrix R is computed from every pairwise points in the original space. The definitions of R can be based on different assumptions, and correspondingly they will induce different embeddings. The most direct approach is to compute the pairwise distances for the affinity matrix. For the Gaussian process prior, R and S are positive definite kernel matrices, which can be defined by the kernel functions. If we use RBF(Guassian) kernel, then µ ¶ kxi − xj k2 G rij = exp σ µ sG ij = exp

kyi − yj k2 γ

This affinity matrix R = [rij ] possesses some nice discriminant properties. Its induced distance is metric, and the affinity between any two points with the same relation (e.g. in the same neighborhood) will never be larger than that between any two points with different relations (e.g. in different neighborhoods). The affinity for the multinomial prior RM are obtained by normalizing rG . We can also use other kernels like polynomial or sigmoid, where the distance is measured in different reproducing kernel Hilbert spaces. To obtain the coordinates in the embedding space, we need to get the gradient of the cost function w.r.t. yi (i = G 1, . . . , k). Since the cost function LG KL and LB contain the determinant or trace of the affinity matrices R and S, the G gradients of the cost functions LG KL and LB are not only very complex, but also need to invert the matrix, which is very time consuming for large dataset. Therefore we do not consider the Gaussian process model in the experiment.

¶

where σ > 0 and γ > 0 are width parameters over X and Y, respectively. For the multinomial process P P prior, we need to ensure the constraints j rij = 1 and j sij = 1 (i = 1, . . . , k). We M define the asymmetric probability rij (i 6= j) in X as exp(−d2ij ) 2 l6=i exp(−dil )

M rij =P

where dij is d2ij =

rG Pi G. j rij

This affinity definition tries to map nearby points in the original space to nearby points in the embedding, and faraway points to faraway points. In many cases, the high dimensional data may have its own intrinsic structure on a low dimensional manifold. The two points which are closed in the original space may be faraway on the intrinsic manifold. As such, some embedding methods are optimized only to preserve the local configurations of nearest neighbors, like ISOMAP, and LLE. Similarly, other information could be used to guide the distance measurement such as class label and side information. We say if xi and xj are nearest neighbors, or in the same class, or with the same side information, xi is relational to xj . It denotes xi ∼ xj for convenience. Several effective methods can be used to represent rij , and here we give one candidate:  ³ ´ 2  1 + 1 exp − kxi −xj k xi ∼ xj 2 2 β G ³ ´ rij = (2) 2 kx −x k  1 − 1 exp − i j otherwise. 2 2 β

3 Affinity Matrix and Numeric Methods

and

Bhattacharyya distance k |R| −1 2 LG | B = −2 |S| |I + R ∗ S P P 1 M 2 LB = − i j (rij sij )

(1)

kxi − xj k2 σ 3

When RM and SM are defined with (1), the gradient of the cost functions LM KL w.r.t. yi for the multinomial prior model is

version of the USPS database. In order to make the embedding to preserve the same local and global distances as in the original space, the affinity matrices R and S are defined with the equations (1). The two dimensional embeddings obtained by two different divergence criterions KL and Bhattacharyya are shown in Figure 1. The elements in the embedding are computed via SCG with a maximum of 200 iterations and the initialization is based on the standard linear PCA. Although no class label information is given, the elements in the embeddings are quite clearly separated according to the digit groups. We randomly generate 2000 points on the S-curve and swissroll surfaces in three dimensional space. In these two manifolds, two nearby points in the original space may be faraway through the manifold surface. Hence the obtained embedding should not preserve a faithful representation for every pairwise distance. Instead each point in the embedding tries to keep the same neighbors as those in the original space. We integrate the neighbor information to define the affinity matrix. The result is shown in Figure 2. Both the KL divergence and the Bhattacharyya distance induce the good embeddings, where the local neighborhood relations are preserved. PCA is used to initialize the coordinates in the embedding. Since PCA is a kind of linear mapping, it generally projects the points of the different area of the manifold onto the same region. Although points are fuzzy arranged at the beginning of the iteration, the latent process model can coordinate the points to keep the original neighborhood relations.

2X ∂LM KL = (yi − yj )(rij − sij + rji − sji ) ∂yi γ j and the gradient of LM B w.r.t. yi is ∂LM B = ∂yi

¡ 1 1 1X (yi − yj ) (rij sij ) 2 + (rji sji ) 2 γ j ¢ +sij Li + sji Lj

¢1 P ¡ where Li = − l ril sil 2 . Differentiation is tedious since yk affects sij via the normalization term, but the result is pretty simple. γ can also be adaptively estimated by the optimizer. We set γ = 1 in the following experiments for computational simplicity. From the gradients of LM KL and LM B , we find sij and sji (j = 1, . . . , k) need to be estimated. Hence all pairwise distances over Y in the last iteration need to be computed to update the yi in the current iteration. When RM and SM are normalized from the definition (1), setting corresponding non-relational pairwise points, i.e., xi xj , to zero, the gradient of the cost functions LM KL w.r.t. yi in the multinomial prior is simplified as ∂LM 2X KL = (yi − yj )(rij − sij + rji − sji ) ∂yi γ j∼i If the relational concept is defined as the neighbor information, only a dozen of the elements need to be summed up in these gradients. Hence the computation cost is much released. Given the gradient, there are many possible ways to minimize the cost function. We resort to the scaled conjucate gradient (SCG) algorithm, which is a very efficient optimizer, to obtain the yi ’s. We treat the minimization of L({yi }) as a multi-parameter optimization problem and employ a conditional SCG algorithm with parallel-update scheme. That is, we find the (t+1)st estimate yi (t+1) of yi via minimizing L(y1 (t), . . . , yi−1 (t), yi , yi+1 (t), . . . , yk (t)).

5 Conclusions In this paper, we propose a new framework for the nonlinear embedding problem, called latent process model. In this model, the elements in the embedding space try to keep the same affinity relations as in the high dimensional space. A divergence measure is used as an embedding criterion. The difference of the affinity relations over the two spaces is minimized when a divergence between two latent processes is minimized. We investigate the Gaussian and multinomial as the latent variables prior, and compare two divergence measures: the KL divergence and the Bhattacharyya distance. The latent process model provides a unified way to investigate many embedding methods, and gives a clear view on the properties of each method. Except for the direct kernel distance, we can also define the affinity by the geodesic distance on the manifold, like ISOMAP, or preserve local metric information, like LLE. Besides, we can integrate the class label and side information into affinity definition. Therefore, such a latent process model is very general, and provides versatile ways to discover the underlying structures of the dataset.

4 Experimental Result The latent process models with the multinomial process using the KL divergence and Bhattacharyya distance are implemented in the experiment. We apply the latent process model to a collection of bitmaps of handwritten digits (USPS) and a set of synthetic data. We follow the settings in [3] to choose a subset of 3000 of the digits 0-4 (600 for each digit) from a 16×16 grayscale 4

(a) The embedding elements obtained by the multinomial process with the KL divergence.

(b) The embedding elements obtained by the multinomial process with the Bhattacharyya distance.

Figure 1. The two dimensional results of the latent process model on the USPS dataset. The black, blue, green, red, and cyan points represent the digits 0, 1, 2, 3, 4, respectively. The σ is set to 20.

Raw

PCA

KL

Bhattachartyya

Raw

PCA

KL

Bhattacharyya

Figure 2. The embeddings obtained from the three dimensional data using the latent process model with the multinomial prior. The KL divergence and the Bhattacharyya distance give the similiar results. PCA is used to initialize the iteration. The number of the nearest neighbors is set to 12.

References

[4] W. J. Krzanowski. Distance between populations using mixed continuous and categorical variables. Biometrika, 70(1):235–243, 1983.

[1] M. Brand. A unifying theorem for spectral embedding and clustering. In The 9th International Conference on Artificial Intelligence and Statistics, Key West, Florida, 2003.

[5] N. D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In Advances in Neural Information Processing Systems 16, 2004.

[2] F. R. Chung. Spectral Graph Theory. American Mathematical Society, 1997.

[6] K. Matusita. Decision rule, based on the distance, for the classification probelm. Annual of Institute of Statistical Mathematics, 8:67–77, 1956.

[3] G. Hinton and S. Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems 15, 2003. 5