KEYNOTE: Keyword Search by Node Selection for

KEYNOTE: Keyword Search by Node Selection for Text Retrieval on DHT-based P2P Networks ? Zheng Zhang, Shuigeng Zhou, Weining Qian, and Aoying Zhou {zhzhang1981, sgzhou, wnqian, ayzhou}@fudan.edu.cn Department of Computer Science and Engineering, Fudan University, Shanghai 200433, China

Abstract. Efficient full-text keyword search remains a challenging problem in P2P systems. Most of the traditional keyword search systems on DHT overlay networks perform the join operation of keywords at document level, which consumes a huge amount of storage and bandwidth cost. In this paper, we present KEYNOTE, a novel keyword search system that performs the join operation at node level. Compared to the traditional keyword search systems on DHTs, KEYNOTE could greatly reduce the storage and communication cost. To forward a query to the relevant nodes for searching documents, two effective node selection methods are presented. To address the hot spot problem in Chord overlay networks, an efficient load balancing scheme is introduced. Simulated experimental evaluation with up to 8,000 nodes and over 600,000 real-world documents validates the practicality of the proposed system.

1 Introduction Recently Peer-to-Peer (P2P) systems have emerged as a scalable infrastructure that could provide large-scale and decentralized lookup services. File-name-based P2P search systems are already popular, while content-based search remains a challenge in P2P networks. Compared to the other complex P2P information retrieval systems, keyword search systems on DHTs [1–4] are particularly attractive due to their simple yet efficient searching mechanism and high search accuracy. Meanwhile, along with the development of centralized search engines, such as Google, it is interesting and worth of studying whether P2P-based Web and text search can achieve equivalent precision, and similar or even better performance, for the low cost, ease of deployment, and scalability characteristics of P2P systems. As using centralized Web search engines, users may only be interested in top-k documents that are most relevant to the query in the P2P networks. Since the top-k documents usually exist in a small number of nodes( or peers), how to select the relevant peers without maintaining the information of all peers is the first challenge for bringing P2P-based content search into feasible. Existing DHT-based keyword search technologies perform the join operation at the document level, consuming a large amount of storage and bandwidth cost. In the rest ?

This work was supported by the National Natural Science Foundation of China (NSFC) under grant numbers 60373019, 60573183 ,60496325 and 60503034, and Shanghai Rising-Star Program (04QMX1404).

part of this paper, it is called the document join approach. Fig. 1 illustrates the basic scheme of it. In this approach, both the keywords and the computers in this system are hashed into a common range. A computer hosts those keywords whose hashed values are between its address and the address of the next computer in the network. When each node joins the network, it publishes the inverted index including the term, the global document ID and other meta data (e.g., the weight of the term in the document). To perform the multi-term query, the system has to transmit the posting of one term from one node to other nodes holding other terms. This approach may encounter difficulties that when a huge number of documents exist in the P2P network, the node hosting the index of the terms must spend considerable storage space for the index and metadata, while on answering a multi-term query, much bandwidth is consumed to transmit the large indices. As indicated by [1], even with promising optimization techniques, current keyword search systems are not feasible for Internet-scale search. A third problem of existing methods is that queries are not distributed evenly. Even with the randomized DHT, the query execution loads are not balanced in the P2P systems. Providing load balance on top of DHT overlay is a must-have feature of P2Pbased keyword search systems. To deal with the above three challenges, we present a novel search system. The basic idea is to select those peers having many relevant documents to the query. To consume less storage and bandwidth resource, our system performs the join operation at the node level. Fig. 2 illustrates its scheme. Instead of publishing the term-document information, our system only publishes term-node data. The distributed index includes the term, the source node identifier and the statistics regarding the term in the source node. To perform the query consisting of two terms, we only need to transmit the term-node statistics from one node to another node. We call our system KEYNOTE, KEYword-search using NOde-selection for TExt-retrieval. We also provide an efficient load balancing algorithm on top of the DHT overlay we use, namely Chord [5]. In this paper, we make the following contributions: – We propose KEYNOTE, a novel keyword search system on Chord overlay networks using node-level join, which could greatly reduce both the communication cost and the storage cost. – Two simple but effective methods for node selection are proposed, which could be implemented without the centralized server. – A load balancing algorithm is presented in the Chord overlay, which could guarantee that the load balancing of each hot term on one given node could be achieved in O(log N ) steps(N is the number of nodes in the network). 1.1 Related Work PlanetP[6] uses Bloom Filter to summarize the content of each node and distributes this summarization to the whole network. In [7], Lu et. al studies content-based resource selection and document retrieval in hybrid P2P networks. pSearch[8] is a peer-to-peer information retrieval system based on CAN[9], which employs statistically derived conceptual indices for retrieval. P2P keyword search on top of DHTs are most relevant to our work. To reduce the communication cost for the keyword join, some optimization

techniques have been proposed. In [2], Reynolds et. al adopt Bloom filters, caches and incremental results to reduce the join cost. [1] reported that by combing all the existing optimization techniques , it is still impossible to make P2P web search feasible on top of DHTs. In [4], Tang et. al propose eSearch, a hybrid global-local indexing mechanism to make the communication cost independent of the number of documents in the network. However, this method consumes 6.8 times storage cost of traditional keyword join systems. In [10], Zhong et. al report that the communication cost of keyword join system could be reduced to 0.0175 times by combining the techniques of Bloom Filter, caching , pre-computation, query log mining and incremental set intersection. The remainder of this paper is organized as follows. Section 2 presents KEYNOTE in details. Section 3 describes a load balancing algorithm in Chord overlay. Section 4 gives the experimental evaluation of our system. Section 5 concludes our paper.

Fig. 1. Document-join System

Fig. 2. Node-join System

2 KEYNOTE In KEYNOTE, a large number of computers are organized by DHT into a Chord[5] ring. Fig. 3 illustrates the system architecture of KEYNOTE. When a new peer wants #

"

!

Fig. 3. Query processing in KEYNOTE

to join the network, it first joins the Chord network using the join protocol of Chord. Afterward, the new peer publishes the term-node statistics to other peers which are responsible for the terms in the Chord ring. When a multi-term query is issued, the term-node information regarding these terms are transmitted from one node to other

nodes for performing the node-level join. From those nodes which contain all the terms, the peer selection methods are used to select top-L (where L is the system parameter in our system) relevant peers to which the query is finally forwarded for processing. After all local top-k results from these selected peers are returned to the query peer, these partial results are merged into the final top-k documents. The node leaving and failure are handled similarly in Chord. In the following, we would present peer selection methods, term-node information publication , query processing and analysis of system resources in details. 2.1 Peer Selection Methods Peer Selection Using Sum Statistics: PS-sum Our first peer selection method(simply PS-sum) is inspired by the gGlOSS system[11], which proposes an approach to estimate the goodness of each text database for a given query and then ranks these text databases according to the estimated goodness. In KEYNOTE, an estimation of the goodness of peer Pj with regard to the query vector q = {q1 , q2 , . . . , qm } is given as follows. X X Gsum (q, Pj ) = qi × wsum (i, j) = qi × sum tfi (Pj ) × idfi (1) i=1,...,m

i=1,...,m

where wsum (i, j)is the sum weight of term ti appearing in different documents in Pj . And sum tfi (Pj ) is P the sum of TF weight of ti appearing in different documents in Pj , i.e., sum tfi (Pj ) = dk ∈Pj ∧ti ∈dk tfik , where tfik is the TF weight of ti in document dk . In KEYNOTE we compute the weight of each term based on the global information. Therefore, in equation 1, idfi is defined as the global value of log nNi , where ni is the number of documents that contains ti in the whole network and N is the total number of documents in the network. Peer Selection Using Sum and Max Statistics: PS-sum max The PS-sum method could fail in the following occasion. Peer A contains only one relevant document d to the query q. Peer B contains many relevant documents whose summed relevance is much greater than the relevance of d, but the maximum relevance of these documents is less than the relevance of d. Now the users issue a top-1 search for the query q and use PS-sum to select one peer for searching, peer B would be selected for processing the query. In this case, the users would miss the most relevant document d in peer A . To address this problem, here we propose a new method, PS-sum max, which considers both the summed relevance and the maximum relevance. Formally, we define the goodness of Pj to q as follows: Gsum max (q, Pj ) =

Gsum (q, Pj ) Gmax (q, Pj ) + 2 ∗ M ax Gsum (q) 2 ∗ M ax Gmax (q)

(2)

where M ax Gsum (q)=M ax{Gsum (q, Pj )} ,M ax Gmax (q)=M ax{Gmax (q, Pj )} and X X Gmax (q, Pj ) = qi × wmax (i, j) = qi × max tfi (Pj ) × idfi (3) i=1,...,m

i=1,...,m

where wmax (i, j) is the maximum weight of term ti among all the documents in Pj , max tfi (Pj ) = max{tfik |dk ∈ Pj } and idfi has the same meaning in equation 1. 2.2 Term-Node Information Publication and Query Processing To implement PS-sum and P S-sum max, we need to know the global value of wsum (i, j) and wmax (i, j) for each term ti in peer Pj , which further requires global values of N and ni . For the value of ni , we could look up the node holding ti . But there is no dedicated node storing the value of N . In KEYNOTE, we let those nodes responsible for stop-words store the value of N . Our process of publishing statistics for each term ti in Pj consists of three steps. 1. We publish the number of documents containing ti to the node holding ti . 2. We get the global value of N and ni from the nodes containing stop-words and the node holding ti respectively and then compute the global weight of ti in each document in Pj . Afterward, wsum (i, j) and wmax (i, j) are computed . 3. Finally, the information which constitute a tuple ( ti , Pj , wsum (i, j) , wmax (i, j) ) is published to the node responsible for ti . To answer a multi-term query, the term-node information of each term is first located using the routing protocol in Chord. Then inverted lists of these terms are transmitted from one node to other nodes for performing the join operation. From those nodes which contain all these terms, we choose the top-L relevant nodes and forward the query to these L nodes in parallel to get the local top-k documents from each of these L nodes. Finally, the global top-k documents are obtained by merging the local top-k documents from each of these L nodes. 2.3 Analysis of System Resource Usage and Search Latency For the convenience of analysis, we make the following assumptions: (1)The query involves two terms t1 and t2 . The user is interested in seeing the top-10 results for the query and the system parameter L is 15. (2)The link between each peer and its successors on the chord ring is pre-established TCP-IP link. When two non-neighbors want to transmit data, they have to establish the temporal UDP link.(3) The available network bandwidth per query is 1.5Mbps, the bandwidth of a T1 link. (4) The link latency to establish the temporal link is 40ms and the time involved in the local searching and computing in each site is omitted in our analysis since in P2P systems we focus on the network latency. Storage Cost The storage cost is the total storage consumption for holding the distributed index over the whole network. In traditional document-join systems, we assume that these systems need 8 bytes to represent the document ID and another 4 bytes to store the meta-data (e.g., the weight of the term ). Thus, the total storage consumption in these systems is 12 × n × t × d Bytes, where n is the number of nodes in the network, t is the averaged number of distinct terms per node and d is the averaged number

of term-relevant documents per term on each node. In KEYNOTE , we use 4 bytes to represent the node ID(or node IP) and another 8 bytes to store the sum weight and max weight of each term. Thus, the storage cost in KEYNOTE is 12 × n × t Bytes. Compared to the document-join systems, KEYNOTE reduces the storage cost by a factor of d. Obviously, the value of d is data collection dependent. In our Reuter news collection, this value is around 17. Communication Cost The communication cost is the bandwidth cost for transmitting the data after the link is established. We first analyze the communication cost of document-join systems. To locate the two nodes holding t1 and t2 , the communication cost is 2 ∗ 21 ∗ log(n) ∗ (40 + 1 + 40) = 81 ∗ log n, where 12 ∗ log(n) is the averaged routing hops in the Chord overlay consisting of n nodes, (40+1+40) is the size of the query message(including 40-byte TCP-IP header, 1-byte message ID and 40-byte query text). To perform the join operation of t1 and t2 , the communication cost is 12∗Nterm doc bytes, where Nterm doc is the averaged number of documents in which a term appears. We assume that the optimization techniques for the join operation (including Bloom Filter, caching , pre-computation, query log mining and incremental set intersection) are used in the traditional document-join system and these techniques could lead to 0.0175 times reduction1 . To return the search results, the communication cost is 28 + 1 + 10 ∗ (8 + 4) = 149(including 28-byte UDP-IP header, 1-byte message ID, 8-byte document ID and 4-byte similarity score for the top-10 results). Therefore, the total communication cost of document join systems with optimization techniques is 81 ∗ log n + 0.0175 ∗ 12 ∗ Nterm doc + 149 Bytes. Now we begin to analyze the communication cost in KEYNOTE. The cost to locate the two nodes holding t1 and t2 is the same as that in document-join systems. The communication cost of node-level join operation is 12 ∗ Nterm node , where Nterm node is the averaged number of nodes in which a term appears . The cost of returning the list of selected peers to the query peer is 28 + 1 + 15 ∗ (4 + 4) = 149(including 28byte UDP-IP header, 1-byte message ID, 4-byte node ID and 4-byte similarity score for the top-15 nodes). The communication cost of returning the final document list to the query peer is 28 + 1 + 15 ∗ 10 ∗ (8 + 4) = 1829(including 8-byte document ID and 4-byte similarity score of the top-10 documents from each of the selected 15 peers). Therefore, the total communication cost of KEYNOTE with optimization techniques is 81 ∗ log n + 0.0175 ∗ 12 ∗ Nterm node + 149 + 1829 Bytes. We can observe that compared to document-join systems KEYNOTE could reduce the communication cost for Web searching roughly by a factor of Nterm doc /Nterm node . In our data set, Nterm doc /Nterm node is roughly around 13. Search Latency The search latency in the document join system is 2 ∗ Tlink latency + Cdoc−join ∗8 seconds, where two latencies to establish the UDP links are consumed. 1.5∗106 One link latency is used to establish the UDP link for transmitting the data in the join operation, and the other link latency is consumed to build the link for transmitting the 1

[10] reports that by combining these techniques together could lead to 0.0175 times reduction in communication cost for the join operation

top-k results to the query peer. The search latency in KEYNOTE is 3 ∗ Tlink latency + CKEY N OT E ∗8 seconds where totally 3 link latencies are involved. One link latency is 1.5∗106 used for performing the join operation. The second latency is used for returning the list of selected peers to the query peer. The last latency is consumed for locating the selected peers.

3 Load Balancing The load balancing issue includes balancing both the storage load and query execution load. In KEYNOTE, the storage cost on each node is greatly reduced by our termnode information publication mechanism. Thus, in this paper we focus on balancing the query execution cost. In [12], Dabek et al. propose the idea of replicating the popular files on all the nodes in the search path. We call this strategy “all-cache” method. In KEYNOTE, we only cache the hot keys on the carefully selected nodes and we call our method “selective cache”. Algorithm1 gives a formal description of our load balancing algorithm when a node Ni becomes overloaded for the hot key k. Algorithm 1 ShedLoad (Node Ni , Key k) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

i=0 if Ni is not the original host for key k then Let d=(k + 2m − Ni ) mod 2m while i < m AND d > 2i do i=i+1 end while end if repeat Let Nj be the predecessor who hosts the key (k + 2m − 2i ) mod 2m i=i+1 until i ≥ m OR Nj has no copy of the key k Copy key k to Nj if Ni is still overloaded due to the frequent access of key k then ShedLoad (Ni , k) end if

Theorem 1. In KEYNOTE, for one given node Ni and key k, our load balancing algorithm could make Ni ’s load for key k become light in O(log N ) steps. For the detailed description of our algorithm and the proof of theorem 1, please refer to our technical report[13].

4 Experimental Evaluation We use the Reuters news as our evaluation dataset, which contains over 600,000 news from the date 08.20.1996 to 08.19.1997. Each document in this data set is stored in the XML format and has the tag ”location.name” that denotes the source location of

this news. We divide the Retuers news into different collections based on the tag “location.name” . The total number of collections is 8,722 , which represents 8,722 nodes in the Chord overlay networks. To test the performance of our system on networks with different sizes, we scale the network size from 500 to 8,000. Four types of queries are used to evaluate the search performance of KEYNOTE: one-term , two-term , threeterm and four-term. We give each of the four types the same probability to appear and generate 1, 000 queries . All the experimental results presented are obtained by averaging the results of the 1000 queries. We measure the accuracy of our system by comparing the results returned by our system with the results from centralized searching. For each top-k query, let A be the searching result in centralized context, and B be the returned result of our system. The accuracy is defined as |A∩B| |A| . To test the effectiveness of our load-balancing method, we conduct the experiment both on the full Chord ring and half-full Chord ring when m = 16. We assume that there is one hot key in the network and a node that receives more than 100 request per time unit would become overloaded. The number of requests for the hot key ranges from 1,000 to 11,000 per time unit, and all requests are evenly distributed in the Chord ring. Fig. 4 illustrates the top-k search performance of KEYNOTE using our two peer selection methods on a 2000-node network. Fig. 4(a) shows the performance comparison between PS-sum and PS-sum max for the top-10 search. From Fig. 4(a) we can observe that PS-sum max outperforms PS-sum since by visiting the same number of nodes PS-sum max could return more accurate documents than PS-sum could. We could have the similar observation in Fig. 4(b) and Fig. 4(c). From Fig. 4, we could see that our peer selection methods are very effective in determining the relevances of each peer to the query. For the top-10 search, we could get an accuracy of 93% by visiting only 15 nodes on a 2000-node network. Even for the top-100 search, we could get an accuracy of 87% by visiting 35 nodes on a 2000-node network.

(a) Top-10 search

(b) Top-50 search

(c) Top-100 search

Fig. 4. Top-k search performance of KEYNOTE on a 2000-node network

Fig. 5 shows the system performance on different network sizes. In Fig. 5 we let the accuracy be 90% and test the number of visited nodes required to achieve this accuracy on different network sizes. In Fig. 5(a), we could see that the number of visited nodes in KEYNOTE doesn’t grow linearly with the network size. For the top-100 search, the system needs to visit 19 nodes when network size is 500. But when the network size is 8000 (16 × increment), the system visits only 50 nodes(2.5 × increment). For the top-10 and top-50 search, the number of visited nodes almost remains constant when

the network size grows from 4,000 to 8,000. Fig. 5(b) shows that the percentage of the visited nodes to the network size decreases as the network size grows larger.

(a) Number of visited nodes Vs. network size

(b) Percentage of visited nodes Vs. network size

Fig. 5. The impact of varying network size on system performance

Fig. 6 gives the performance comparison of two load balancing techniques, our “selective cache” (simply SC) method and the “all cache” (simply AC) method. We could see that our “selective cache” method uses significantly few replications to achieve load balancing than the “all cache” method both on the full Chord ring and the half-full ring.

(a) Replications on the full chord ring

(b) Replications on the half-full chord ring

Fig. 6. The performance of the load balancing algorithm

To project the performance of our system and the document-join system for the Internet-level search, we scale our evaluation results to 8 ∗ 109 Web pages and 6 ∗ 107 nodes. Table 1 shows that by compromising 10% accuracy we get over 10× reduction in communication and bandwidth cost as well as the search latency, which makes KEYNOTE feasible for the Internet-level search. Table 1. The scaled performance of top-10 search on a 6 ∗ 107 -node network with 8 ∗ 109 pages Techniques Total storage cost Averaged Comm. cost Latency per query Accuracy

Document-join KEYNOTE 7814GB 7.03 M 37.5 seconds 100%

439GB 541KB 3.0 seconds 91%

5 Conclusion In this paper, we present KEYNOTE, a novel full-text search system on DHTs. By performing the join operation at node level, KEYNOTE could greatly reduce the storage and bandwidth cost. For addressing the hot-spot problem on the Chord overlay, an efficient load balancing algorithm is introduced. The experimental results validate the performance of KEYNOTE and indicates the feasibility of KEYNOTE for the Internetlevel search.

References 1. Li, J., Loo, B.T., Hellerstein, J., Kaashoek, F., Karger, D.R., Morris, R.: On the feasibility of peer-to-peer web indexing and search. In: Proceedings of IPTPS’03. (2003) 2. Reynolds, P., Vahdat, A.: Efficient peer-to-peer keyword searching. In: Middleware03. (2003) 3. Shi, S., Yang, G., Wang, D., Yu, J., Qu, S., Chen, M.: Making peer-to-peer keyword searching feasible using multi-level partitioning. In: Proceedings of IPTPS’04. (2004) 4. Tang, C., Dwarkadas, S.: Hybrid global-local indexing for effcient peer-to-peer information retrieval. In: Proceedings of NSDI04. (2004) 5. Stoica, I., Morris, R., Karger, D., Kaashoek, F., Balakrishnan, H.: Chord: A scalabel peerto-peer lookup service for internet applicaitons. In: Proceedings of SIGCOMM01. (2001) 6. Cuenca-Acuna, F., Peery, C., Martin, R., Nguyen, T.: Planetp: Using gossiping to build content addressbale peer-to-peer information sharing communities. In: Proceedings of HPDC03. (2003) 7. Lu, J., Callan, J.: Content-based retrieval in hybrid peer-to-peer networks. In: Proceedings of CIKM03. (2003) 8. Tang, C., Xu, Z., Dwarkadas, S.: Peer-to-peer information retrieval using selforganizing semantic overlay networks. In: Proceedings of SIGCOMM03. (2003) 9. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable content addressable network. In: Proceedings of SIGCOMM01. (2001) 10. Zhong, M., Moore, J., Shen, K., Murphy, A.: An evaluation and comparison of current peerto-peer full-text keyword search techniques. In: Proceedings of Webdb05. (2005) 11. Gravano, L., Garcia-Molina, H.: Generalizing gloss to vector-space databases and broker hierarchies. In: Proceedings of the 21st VLDB conference. (1995) 12. Dabek, F., Brunskill, E., Kaashoek, M.F., Karger, D., Morris, R., Stoica, I., Balakrishnan, H.: Building peer-to-peer systems with chord, a distributed lookup service. In: Proceedings of HotOS-VIII. (2001) 13. Zhang, Z., Zhou, S., Qian, W., Zhou, A.: KEYNOTE: Keyword search using node selection for text retrieval on DHT-based p2p networks. Technical report, Fudan University (2005)