A Similarity Reinforcement Algorithm for ... - Semantic Scholar

A Similarity Reinforcement Algorithm for Heterogeneous Web Pages Ning Liu 1, Jun Yan2, Fengshan Bai1, Benyu Zhang3, Wensi Xi4 Weiguo Fan4, Zheng Chen3, Lei Ji3, Chenyong Hu5, and Wei-Ying Ma3 1

Department of Mathematical Science, Tsinghua University, Beijing, P.R. China [email protected] [email protected] 2 LMAM, Department of Information Science, School of Mathematical Science, Peking University, Beijing, P.R. China [email protected] 3 Microsoft Research Asia, 49 Zhichun Road, Beijing, P.R. China 4 Computer Science, Virginia Polytechnic Institute and State University, U.S.A 5 Lab for Internet Software Technologies, Institute of Software, CAS, Beijing, P.R. China

Abstract. Many machine learning and data mining algorithms crucially rely on the similarity metrics. However, most early research works such as Vector Space Model or Latent Semantic Index only used single relationship to measure the similarity of data objects. In this paper, we first use an Intra- and InterType Relationship Matrix (IITRM) to represent a set of heterogeneous data objects and their inter-relationships. Then, we propose a novel similaritycalculating algorithm over the Inter- and Intra- Type Relationship Matrix. It tries to integrate information from heterogeneous sources to serve their purposes by iteratively computing. This algorithm can help detect latent relationships among heterogeneous data objects. Our new algorithm is based on the intuition that the intra-relationship should affect the inter-relationship, and vice versa. Experimental results on the MSN logs dataset show that our algorithm outperforms the traditional Cosine similarity.

1 Introduction The performance of many data mining algorithms such as document clustering and text categorization critically depends on a good metric that reflects the relationship between the data objects in the input space [1] [30]. It is therefore important to calculate the similarity as effectively as possible [28]. Most early research works only used single relationship to measure the similarity of data objects. In the original vector space model (VSM) [23], “terms” (keywords or stems) were used to characterize queries and documents, creating a document-term relationship matrix where it is straightforward to compute the similarities between and among terms and documents by taking the inner product of the two corresponding row or column vectors. Dice, Jaccard and Cosine measurements[20] are a few classical methods that use the document-term relationship to measure the similarity of docu-

ments for retrieval and clustering purposes. Deerwester and Dumais [9, 10] thought that the concept in a document might not be well presented by keywords they contained. In their Latent Semantic Index (LSI) work, instead of directly using the document-term matrix to compute the similarity of text objects, they first use the Singular Vector Decomposition (SVD) method to map the document-term matrix into some lower dimension matrix where each dimension associates with a “hidden” concept, then the similarity of text objects (documents or queries) are measured by their relationships to these “concepts” rather than the key works they contained. Other relationships such as reference relationships among scientific articles are also used to measure the similarity of data objects. Small [26] tried to measure the similarity of two journals by counting the number of journals they both cite, this method is also called co-citation. Kessler [14] measured the similarity of two journal papers by counting the number of journals that cite them both, this method is also called bibliographic coupling. Co-citation and bibliographic coupling had been successfully used to cluster scientific journals [18]. With the advent of World Wide Web, relationships within web objects such as the hyperlink relationship were also used to calculate the similarity of web objects. Dean [8] and Kleinberg [15] used hyper-links among a web pages community to discover similar web pages. Larson [16] and Pitkow [17] applied co-citation on the hyperlink structure of the web to measure the similarity of web pages. In the Collaborative Filtering [12] and Recommendedr Systems [21] field researchers tried to analyze the similarity of peoples by examining the people-document and people-artifacts relationship respectively. The research works introduced above only used single type of relationship to measure the similarity of data objects. However, these approaches run into serious problems when various information applications require a more real and accurate similarity measuring method where multiple types of data objects and their relationship must be handled in an integrated manner. Thus in the extended VSM [11], feature vectors of data objects were lengthened by adding attributes from objects of other related spaces via inter-type relationships. By doing so, information from different sources are directly mapped into an enhanced Vector Space and similarity computation were obtained through the calculation on these enhanced feature vectors. The extended feature vector had been used for document search [25] or clustering purposes [6]. Following the same idea, Rocchio [22] and Ide [13] expand the query vector using the frequent terms appeared in the top documents retrieved by the query and improved the search effectiveness, the idea of using terms found in related documents to extend the query term vector is also referred to as “Query Expansion”. Similarly, Brauen [4] modified document vector by adding or deleting the terms in the queries that relates to it. Changing document vectors by related query terms is also referred to as “Dynamic Document Space” method [24]. Recently, researchers have tried to calculate the similarity of two data objects by measuring the similarity of their related data objects. For example, Raghavan and Sever [18] tried to measure the similarity of two queries by calculating the similarity relationship of their corresponding search lists. Beeferman and Berger [3] clustered queries using the similarity of their clicked web pages and cluster web pages using the similarity of the queries that lead to the selection of the web pages. Wen [19] and Su [27] calculated the query similarity based on both the query contents similarity and the

similarity of the documents that retrieved by queries; they calculated the similarity of documents in a similar way. Although research works introduced above used intertype relationships to help improve the similarity calculation of data objects, they did not consider the mutual reinforcement effect on similarities of the interrelated heterogeneous data objects. Most recently, Wang et, al.[29] proposed an iteratively reinforcement clustering algorithm for multi-type data objects where the cluster results from one type of data objects is used to reinforce the clustering process of another data type. Their method was shown to be effective for clustering and is much related to the similarity reinforcement algorithm that we propose in this paper. Aiming at finding the best cluster for individual documents, Wang’s algorithm may not be very precise when used to calculate the similarity of individual data objects. Davidson [7] had proposed another much related idea. In his two-page short paper, Davidson analyzed multiple term document relationships by expanding the traditional document-term matrix into a matrix with term-term, doc-doc, term-doc, and doc-term sub-matrices. He proposed that the links of the search objects (web-page or terms) in the expanded matrix could be emphasized. With enough emphasis, the principal eigenvector of the extended matrix will have the search object on top with the remaining objects ordered according to their relevance to the search object. Although his idea is sounding, he didn’t point the reason that difference kind of relationship can be calculated in a unified manner in his paper. Then we propose a novel iterative similarity learning approach to measure the similarity among the objects combining inter-type relationship over the Intra- and Inter- Type Relationship Matrix (IITRM). Our proposed algorithm is based on an intuitive assumption that the intra-relationship should affect the inter-relationship, and vice versa. It can help detect latent relationships (such as latent term association discovered by LSI) among heterogeneous data objects, which can be used to improve the quality of various information applications that require the combination of information from different data sources. Experimental results the MSN logs dataset show that our algorithm outperforms the traditional Cosine similarity. The rest of the paper is organized as follows: in Section 2, we will give the formal matrix to represent both intra- and inter- type relationships among data objects from heterogeneous data sources in a unified manner, which we call it as an Inter- and IntraType Relationship Matrix (IITRM) and the problem formulation of similarity measure. In Section 3, we will present an information-processing assumption that form the theoretical basis of our proposed study and the unified similarity calculating algorithm that we proposed. Some experimental results for this algorithm will also be reported in section 4. We conclude this paper and describe our future work in Section 5.

2 Intra- and Inter- Type Relationship Matrix In this section, we first will give the formal matrix that represents both intra- and intertype relationships among data objects from two data sources in a unified manner, which we call it as the second-order Intra- and Inter- Type Relationship Matrix. Then

we will present the Intra- and Inter- Type Relationship Matrix (IITRM) to represent a set of heterogeneous data objects and their inter-relationships. 2.1 Second-order IITRM Suppose we have N different data spaces S1, S2… SN. Data objects within the same data space are connected via intra-type relationships Ri⊆Si×Si. Data objects from two different data spaces are connected via inter-type relationships Rij⊆Si×Sj (i≠j). The intra-type relationships Ri can be represented as an m×m adjacency matrix Li (m is the total number of objects in data space Si), where cell lxy represents the inter-type relationship from the xth object to the yth object in the data space Si. The inter-type relationship Rij can also be represented as an m×n adjacency matrix Lij (m is the total number of objects in Si, and n is the total number of objects in Sj), where the value of cell lxy represents the inter-type relationship from the xth object in Si to the jth object in Sj. Let’s consider two data spaces X = { x1 , x 2 , x m } , and Y = { y1 , y2 , yn } and their relationships: Rx, Ry, Rxy, and Ryx. The adjacency matrices Lx and Ly stand for the intratype relationship within the data spaces X and Y, respectively. Lxy and Lyx stand for the inter-type relationships from objects in X to objects in Y and inter-type relationships from objects in Y to objects in X respectively. If we merge data spaces X and Y into a unified data space U, then, previous inter- and intra- type relationships Rx, Ry, Rxy, and Ryx are now all part of intra-type relationships Ru in data space U. Suppose Lu is the adjacency matrix of Ru, then Lu is a (m+n)×(m+n) matrix, with cell lij representing the relationship from the ith object originally from X, (if i≤m), or the (i-m)th object originally from Y,(if i>m), to the jth object originally from X, (if i≤m), or the (j-m)th object originally from Y,(if i>m). It is not difficult to figure out that the matrix Lu is actually a matrix that combines Lx, Ly, Lxy and Lyx in such a way as shown in Eq. (1) below:

Lu =

Lx Lyx

Lxy Ly

(1)

In this paper, we call the matrix Lu as the Second-order Intra- and Inter- Type Rela. The LIITRM matrix can be used to explain a lot of tionship Matrix and denote as LIITRM 2 2 real world information application scenarios. For example, if we only consider one data space: the web pages; and one type of intra-type relationship: the hyperlink relationship, the LIITRM matrix is reduced to the link adjacency matrix of the web graph. 2 If we want to analyze how user-browsing behaviors can affect the “popularity” of a web page as defined in the PageRank algorithm [5], we would be actually analyzing two data spaces: user, web page, and one inter- (browsing), two intra- (hyperlink, user endorsement relationship) type similarities as shown in Figure1. The figure can be represented as IITRM: LIITRM = 2

Luser L browse T

Lbrowse Lhyperlink

(2)

Where, Luser is the endorsement relationship matrix for user space, Lbowse is the browsing relationship matrix between user space and web page space; Lhyperlink is the hyperlink adjacency matrix for web page space. Eq. (2) has provided a much generalized way of representing web objects and their relationships. Hyperlink

Profile Similarity

Web-page Brows e

Use

Figure 1. A real world r scenario for the Second-order IITRM

2.2 Intra- and Inter- Type Relationship Matrix As the second-order IITRM, we present the formal matrix that represents both intraand inter- type relationships among data objects from heterogeneous data sources in a unified manner. Using the notations in the first paragraph of this section, Eq. (1) can for easily lead to the definition of the intra- and inter- type Relationship Matrix LIITRM N N interrelated data spaces, as shown in Eq. (3).

LIITRM = N

L1 L21

L12 L2

L1N L2 N

LN1 LN 2

LN

(3)

As discussed above, our problem is reinforcing the similarity among the intra-type objects by combining the inter-type relationship. So we can divide N different data spaces S1, S2… SN into two data spaces S x and S x , where S x denotes all the data spaces except S x . We can rewrite IITRM as below: LIITRM = N

Lx

Lxx

Lxx

Lx

(4)

3 Similarity Reinforcement Algorithm In this section, we will further argue that the similarity relationships between data objects from heterogeneous data sources could also be iteratively reinforced by interand intra- type relationships among heterogeneous data spaces under a simple assump-

tion. More specifically, this reinforcement process can also be modeled as an iterative calculation over IITRM. Following that is the convergence proof of this algorithm. 3.1 Similarity Reinforcement Algorithm Firstly, let us interpret our basic assumption, “the intra-relationship should affect the inter-relationship, and vice versa.” We believe iteratively reinforcement the similarity of a set of heterogeneous data objects by their inter- and intra- type relationships can better predict the similarity of two data objects because the iterative similarity reinforcement calculation would discover some hidden similarities between data objects as illustrated in Figure 2.

A

Data Space

Intertype q1

P1

Data Space B

P2 q2

P3

Figure 2. An example of iterative similarity reinforcement calculation

In Figure 2, in the first step, data objects q1 and q2 in data space A are connected with objects p1, p2 and p3 in data space via some inter-type relationship. The objects q1 and q2 in A are considered similar because they link to the same object p2 in B. p1 and p2 are similar because they are linked by the same object q1 in A, and p2 and p3 are similar for the same reason. In the second step, objects p1 and p3 in B also can be considered similar, because they are linked by similar objects q1 and q2 in A. The similarity measure procedure continues iteratively until the similarity values of the objects converge. Therefore, under this assumption, we could model this iterative calculation as below equations over IITRM: Lkx+1 = λ1Lxx Lkx Lxx and Lkx+1 = λ2 Lxx Lkx Lxx . where λ1 and λ2 are the decay factors. We will prove that if λ1