A Peer-to-Peer Recommender System Based on Spontaneous ... - Unito

14 downloads 11484 Views 1MB Size Report
failure of a typical client-server architecture. In order to speed up the ..... and Huberman 2000], the sales of books, music songs and other branded com- modity [Kohli and Sah ...... each one dedicated to a specific domain. Compared with the ...
A Peer-to-Peer Recommender System Based on Spontaneous Affinities GIANCARLO RUFFO and ROSSANO SCHIFANELLA Universita` degli Studi di Torino

Network analysis has proved to be very useful in many social and natural sciences, and in particular Small World topologies have been exploited in many application fields. In this article, we focus on P2P file sharing applications, where spontaneous communities of users are studied and analyzed. We define a family of structures that we call “Affinity Networks” (or even Graphs) that show self-organized interest-based clusters. Empirical evidence proves that affinity networks are small worlds and shows scale-free features. The relevance of this finding is augmented with the introduction of a proactive recommendation scheme, namely DeHinter, that exploits this natural feature. The intuition behind this scheme is that a user would trust her network of “elective affinities” more than anonymous and generic suggestions made by impersonal entities. The accuracy of the recommendation is evaluated by way of a 10-fold cross validation, and a prototype has been implemented for further feedbacks from the users. Categories and Subject Descriptors: H.2.4 [Computer-Communication Networks]: Distributed Systems—Distributed applications; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Information filtering; J.4 [Computer Applications]: Social and Behavioral Sciences—Sociology General Terms: Algorithms, Design, Human Factors, Measurement Additional Key Words and Phrases: Peer-to-Peer, recommender system, complex networks, social networks, file sharing systems ACM Reference Format: Ruffo, G. and Schifanella, R. 2009. A peer-to-peer recommender system based on spontaneous affinities. ACM Trans. Intern. Tech. 9, 1, Article 4 (February 2009), 34 pages. DOI = 10.1145/1462159.1462163 http://doi.acm.org/10.1145/1462159.1462163

1. INTRODUCTION Peer-to-Peer (P2P) systems have been deeply investigated during the last years, and a number of distributed services (e.g., storage, streaming, file sharing) has This work was partially supported by the Italian Ministry for University and Research (MUIR), within the framework of the “PROFILES” project (PRIN). Authors’ address: Computer Science Department, Universita` degli Studi di Torino, 185 Corso Svizzera, 10149 Torino, Italy, email: {ruffo, schifane}@di.unito.it. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected].  C 2009 ACM 1533-5399/2009/02-ART4 $5.00 DOI 10.1145/1462159.1462163 http://doi.acm.org/ 10.1145/1462159.1462163 ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4

4:2



G. Ruffo and R. Schifanella

been re-designed in pure decentralized domains. The motivations of such a research trend can be found in many interesting features of the P2P paradigm, such as the scalability of the applications, load balancing, availability and persistence of the information, and so on. In particular, recent works on reputation management, incentive schemes, distributed pricing and auditing mechanisms, have stimulated new contributions to the idea of up-and-coming new decentralized market places. In such a scenario, users trade each other without being controlled by a third party and without depending on a given single point of failure of a typical client-server architecture. In order to speed up the process of designing other services for the common users, researchers from industry and academy should deal with the important problem of providing services with a quality comparable to centralized ones. For instance, a lot of effort has been devoted in finding efficient search mechanisms, with the awareness that users making a daily use of Google have great expectations in terms of reliability and efficiency of the queries’ results. Moreover, in the most popular P2P file sharing systems, like in the Web, there is an explosive growth of the volume of information: users should be able to make choices without knowing all of the alternatives. In this case, the expectation of a user is a personalized assistance service by way of sentences like this: “Customers who bought this CD also bought: The Rolling Stones - Aftermath”; in fact, Recommender Systems (RS) have been introduced for sifting through very large sets of information selecting those pieces that are relevant for the active user. When a RS is proposed in a decentralized domain, some considerations must be done. First of all, traditional RSs rely on a central authority that manages the complete knowledge of the domain: the users and the items they purchased (or they assigned a rate to) are linked together (e.g., by means of association rules). The resulting scheme is used to derive users profiles, that are exploited for submitting recommendations to the active users. This is a very difficult task in a decentralized system, where, in principle, we have not to deal with central entities, and where no one has a complete knowledge of the domain. Moreover, profiling is strongly dependent on a preliminary tagging system, that associates each item to a set of keywords and/or high level meta information, for example, an MP3 song can contain the name of the author, the genre (i.e., folk, rock, classical, . . . ), the production year, and so on. In modern P2P file-sharing systems, it cannot be assumed that every user, when puts a file in the network, fills in the file’s meta information fields with the correct data. As a matter of fact, the “fake files” phenomenon is largely common in popular systems like Gnutella, eMule and BitTorrent: even the name is frequently unrelated to the real content of the file. We address the task of exploiting spontaneous partnerships between users for pushing suggestions to them. Like in the real world, in virtual communities too, people can meet each other to talk about their favorite topics. Hence, a user would trust another user if both have many interests in common: they create a “de facto” word of mouth mechanism that helps them to select an item before buying or downloading it, amongst the huge volumes of data that is available on the Web and on the P2P file sharing systems. Social networking and analysis is a pretty mature field that gives tools and structures for detecting self organizing ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:3

aggregations of users. In this article, we introduce Affinity Networks that can be built on top of a given file sharing application, proving that such networks have a small world topology and a power-law degree distribution. We use these results to develop a novel recommender system that can be implemented and executed “as is” in many file sharing systems, without requiring that the other peers update their software. 1.1 Original Contributions As a main result of the article, we show that many kinds of affinity networks with small world characteristics can be detected over a P2P file-sharing system. Accordingly, we propose a novel recommender system, called DeHinter, that supports the active user to select new files on the basis of the items shared in the self organizing affinity cluster to which the user is connected. Amongst the various studies related to affinity networks, [Iamnitchi et al. 2004] is the most similar to our work. Authors have found small world file sharing communities by means of so-called data-sharing graphs. We address the differences between these two structures, the analysis, and the purposes of both researches in Section 2. We will point out the differences between DeHinter and other decentralized recommender systems in Section 5. We also underline the original contributions of this article with respect to our previous publication [Ruffo et al. 2006] on the same subject. We extend the affinity network analysis by enforcing the empirical evidence of the existence of preference relationships between users. We do not restrict to audio files, but video objects too are included in the analysis (Section 4.2), confirming that small world patterns are independent by a given type of entertainment data. Then, we carry out a scale-free analysis (Section 4.3), in order to identify the presence of affinity hubs, that is, peers with many different interests or simply heavy downloaders. The formula used by the recommender system presented in Ruffo et al. [2006] is proposed again, but, in this article, we add an empirical evaluation based on a well known statistical test, namely the 10-fold cross validation. We show that, if we reduce the recommendation problem to a classification task, where each item must be labeled as interesting or not interesting, the application of the proposed formula to the files really possessed by a given peer classifies them as “interesting” with high accuracy. 1.2 Road Map The rest of the article is organized as follows: Section 2 defines the concept of affinity networks that will be extensively used from now on. Sections 3 introduces some basic concepts about network analysis. Section 4 describes the experiments that we performed to provide evidence that affinity networks show small world patterns and that, after some transitions, they reveal scale-free characteristics. A brief survey on existent recommender systems (with a focus on decentralized approaches) is provided in Section 5, while Section 6 contains the description and evaluation of our recommender system DeHinter. Finally, a prototype that we have developed for the Phex Gnutella client is presented in Section 7. This section also contains a list of considerations that the designer of a new DeHinter plug-in for any other file-sharing tools should follow. ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:4



G. Ruffo and R. Schifanella

2. AFFINITY NETWORKS In order to model our domain in a more formal way, let us assume that a set of users U = {u1 , u2 , . . . , un } is sharing a set of items S = {s1 , s2 , . . . , sl }. We assume a bijection between users and nodes in the system, hence the user ui denotes both the ith node and the ith user. We defined P(S) as the power set of S, that is, the set of all subset of S. The function f : U → P(S) maps users to nitems. In other words, f (ui ) is the set of items user ui shares. It is clear that i=1 f (ui ) = S. To take advantage of the power of social relationships, we need to define the concept of affinity among users. For this purpose, we first introduce the affinity function Aff : U 2 → N + as:   (1) Aff(ui , u j ) =  f (ui ) ∩ f (u j ) . In other words, the friendship among users is defined as the number of resources they have in common. Now we introduce the idea of “Affinity Network” that is represented by a graph where nodes are users and there exists an edge between users ui and u j iff they share at least m files. More formally, we define a family of graphs G m = (U, E m ), as: eimj ∈ E m ⇔ Aff(ui , u j ) ≥ m.

(2)

Clearly, we could also define more complex affinity functions by taking into consideration also other kind of user-related information such as an high level description of peer’s profile or structured metadata regarding the resources shared. It is evident that the topology of G m graphs is strongly related to degree m. When m increases, the network is less connected since two users have to share more resources in order to be linked together. Such a stronger relationship connects users with increasing level of likeness due to larger intersection of shared files. The definition of affinity networks is in some way related to the concept of data-sharing graphs presented in Leibowitz et al. [2003] and Iamnitchi et al. [2004] with some significant differences. In a data-sharing graph two users are linked each other iff they are looking for the same data in a time interval T . The main assumption is that similar requests, that is, queries regarding resources with identical names, can capture common user interests. On the contrary, in an affinity network of degree m, users are connected if they hold and share at least m files, unambiguously identified through their hash codes. Therefore, there are some main differences. First of all, the two structures differ at the data collection level: Leibowitz et al. [2003] focus on traffic generated by KaZaa clients looking for users that download or search the same items referred by their file names. It is clear that some interesting phenomena relevant in whatever file sharing communities are left out. At the beginning, the spread of fake files, that is, items having names not matching their contents, could wrongly relate resources with users likings. Likewise, we have to consider the possibility that a file can have many replica with different names. Furthermore, focusing on queries does not take into consideration the impact of the common download-delete pattern, where a user downloads an item and immediately ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:5

deletes it since he/she notices that it is out of interest. To get rid of these effects, we collect Gnutella QueryHit [Klingberg and Manfredi 2002] messages in order to acquire information on the files actually held by the network.1 It is also important to underline that we aim to model persistent phenomena rather than to draw a snapshot of the network depending on temporal constraints. In fact, we are making the assumption that, if a user is still sharing a file, it is because this file has been inserted in the network by that user, or the file was downloaded earlier by him/her. In the former case, the user is obviously interested in it. In the latter case, the user downloaded the item and, we can assume that he/she is interested in that item if it was not immediately deleted after the download. The direct consequence of this assumption is that temporary aspects are out of the scope of this investigation.2 An interesting study about the dynamics of unstructured overlay topologies in modern file-sharing applications can be found in Stutzbach et al. [2005]. Another approach exploiting a similar social paradigm can be found in Pouwelse et al. [2006]: the proposed peer-to-peer file sharing system, called Tribler, is based on the generation and maintenance of social networks in order to improve content discovery, searching and download performance. Moreover, Tribler proposes a recommendation mechanism that takes advantage of the concepts of friends, friends-of-friends and tastes communities. Each peer stores a list of N most similar users (the idea of similarity between users is somehow related to the affinity function described above) along with their preference lists, that is, a set of most rated items. By way of an epidemic protocol this information is disseminated amongst peers in the network. Even though Tribler proposes a decentralized recommendation algorithm based on standard collaborative filtering techniques comparable to our proposal (Section 6), there are some distinctive differences. In particular, Tribler does not take advantage of the topology properties of affinity networks in order to automatically discover friend peers: through the Tribler interface a user can just mark a contact as a friend, in this way implementing something similar to a list of favorite users. On the contrary, our idea is to exploit the small world feature of affinity networks in order to derive a set of most akin friends by way of finding users triangulates and the evidence of dense clusters of spontaneous thematic communities. A very similar intuition is behind the work described in Sripanidkulchai et al. [2003], that presents a way to improve the inefficient flooding mechanism of the Gnutella search protocol by using the power of interest based localities. Starting from the assumption that peers (who hold contents that we are looking for) share similar likings, the authors introduce links between these alike peers, called interest-based shortcuts. Such links speed up the search process by making a preference overlay on top of the standard Gnutella topology.

1 In

the Gnutella protocol each file is unambiguously identified by the SHA1 hash code. other words, if a user shows a preference for a given item, then he/she will maintain an interest for it (e.g., if he/she likes the Beatles classic “Yesterday”, then he/she will very likely love that song even in the future).

2 In

ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:6



G. Ruffo and R. Schifanella

3. SMALL WORLD EFFECT The idea that people are connected through “six degrees of separation”, an expression made popular by a play of Guare [1990], stems from a well-known experiment carried out by the psychologist Stanley Milgram [Milgram 1967]. The goal of the experiment was to find short chains of acquaintances connecting people in the United States who did not know each other. For instance, a source person randomly selected in Nebraska, mails a letter to a target person in Massachusetts. The sender knows some basic information about the recipient such as the name, the address or the employment. Each node of the chain would forward the letter following a simple first-name basis criterion, in a effort to transmit the letter as effectively as possible. Milgram estimated a mean of intermediate steps in the range from five to six, empirically proving the concept that people live in a small world. As pointed out by Kleinberg [2000a, 2000b], the Milgram’s results demonstrate not only the existence of short paths in social relationships, but also the ability of people to find them using only local information. Note that there is a big difference between the existence of a path and the ability to discover it algorithmically. Before Milgram’s study, the existence of the small-world effect was hypothesized in few works, such as a short story by the Hungarian writer Frigyes Karinthy [Karinthy 1929], or a more formal proposal by de Sola Pool and Kochen [1978], available in a preprint form before Milgram’s studies. A detailed analysis of Milgram’s experiments casts some doubts upon their scientific validity [Kleinfeld 2001]: the great majority of the chains was uncompleted and the target person was never reached.3 Furthermore, the choice of the letter’s recipient, that is, a socially prominent person, seems to be slightly unfair since social and racial differences were not taken into consideration.4 Nonetheless, after few generations since the Milgram’s experiment, the “six degrees of separation” and the small world ideas have entered the popular mindset, fostering a large set of empirical studies upon the topology of complex networks spanning all of the sciences, from biology to art and sociology, going through physics, computer science and electronics. Many scientists looked into the topology of food webs [Cohen et al. 1986; Williams and Martinez 2000], electrical power grids, cellular and metabolic networks [Kohn 1999; Hartwell et al. 1999; Bhalla and Iyengar 1999; Jeong et al. 2000], the Internet backbone [Faloutsos et al. 1999], the Web [Broder et al. 2000], the neural network of the nematode worm Caenorhabditis elegans [Achacoso and Yamamoto 1991], telephone call graphs [Abello et al. 1998], co-authorship and citation networks of scientists [Newman 2001, 2004; Seglen 1992; Redner 1998], and the web of political and legal decisions [Pagallo 2006]. The importance to understand in depth the topology of a complex network is clear. In fact, the structure of a network heavily affects its functionality, performance and effectiveness. For instance, the topology of social networks 3 In

a Milgram’s unpublished study, only 5% of the letters reached the recipient. Even in his published studies, less than 30% of the folders went through the entire chain. 4 Several studies in social networks sketch a world divided by social class and racial distances, where low-income people are practically disconnected. ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:7

influences the spread of information or the relations between human beings. Similarly, in medicine, understanding the diffusion dynamics of a disease may help to detect an effective vaccination plan. Likewise, the comprehension of the power grid structure biases the robustness and stability of the electric lines to avoid blackouts. A well known example in computer science is the relationship between file popularity in the Web and cache size: few documents are very popular but the majority of them are rarely requested. The design of web cache algorithms follows the principle that the benefits of increasing the size of the cache is not linear [Barford et al. 1999; Breslau et al. 1999]. The search problem in a peer-to-peer system is another significant case study. Adamic at al. [2001] propose a mechanism for probabilistic search in power-law networks where the search process is guided first to nodes with a higher degree, improving the speed of the network coverage. Sripanidkulchai et al. [2002] describe a protocol to improve search efficiency adding a set of shortcuts between peers based on interests. Beside a large number of empirical studies, many efforts have been put in the definition of analytical models in order to capture the nature of small world networks. Watts and Strogatz [1998] and Watts [1999, 1999b] propose a model based on a low-dimensional regular lattice where a small fraction of the edges are rewired, with probability p, by moving one end of the edge to a new location chosen randomly in the lattice, avoiding double or self edges. The original model by Watts and Strogatz was modified independently by Monasson [1999] and by Newman and Watts [1999] removing the constraints on edges rewiring. Another interesting proposal is suggested by [Kleinberg 2000, 2000a]. This model uses a two-dimensional grid as a base with long-range random links added between any two nodes u and v with a probability proportional to d −2 (u, v), the inverse square of the lattice distance between u and v. In the basic model, from each node there is an undirected local link to each of its four grid neighbors and one directed long-range random link. In this setting, Kleinberg shows that a simple greedy algorithm using only local information finds routes between any source and destination using only O(log2 n) expected links. An important contribution to study and model complex networks is given ´ and Albert [Barabasi ´ and Albert 1999; Barabasi ´ 2003; Albert and by Barabasi Barabasi 2002]. Earlier models assume that we start with a fixed number N of vertices that are later connected or rewired randomly, without modifying N . In contrast, most real world networks describe open systems that grow by the continuous addition of new nodes. Starting from a small nucleus of nodes, the number of nodes increases throughout the lifetime of the network by the subsequent addition of new nodes. Moreover, the probability that two nodes are connected (or their connection is rewired) is independent of the nodes degree; in fact, the edges are added randomly. Most real networks, however, exhibit preferential attachment: the likelihood of connecting to a node depends on the node’s degree. For example, a web page will more likely include hyperlinks to popular documents, because such highly connected items are very well known and, consequently, easy to be linked. These two ingredients, growth and preferential ´ attachment, inspired the introduction of the Barabasi-Albert model, which led for the first time to a network with a power-law degree distribution. ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:8



G. Ruffo and R. Schifanella

3.1 Clustering and Transitivity Short paths connecting most pairs of vertices are only the first distinctive property of small world networks. An evident deviation from the behavior of random graphs can be pointed out in the transitivity property: in many real networks if a node A is connected to a node B and B is linked to a node C, then there is a high probability that node A is also connected to C. In a social context, the friend of your friends is probably also a friend of yours. Moreover, the transitivity property has a significant impact to allow to apply an Affinity Network (see Section 2) in the design of our decentralized recommender system: if a user ua shows similar likings with user ub and ub shows a high-affinity degree with user uc , then ua and uc will almost certainly reveal a strong likeness. In terms of network topology, the transitivity characteristic leads to the presence of a large number of triangles, that is, sets of vertices each one connected to each others. In order to quantify the phenomenon we use the concept of clustering coefficient C that is much larger in a small world network than in a random network. Watts and Strogatz [1998] define C as the average of a local value: Ci =

number of triangles connected to vertex i , number of triples centered on vertex i

(3)

where a triple centered on vertex i is composed by i and by other two nodes connected to it. Formally, let G = (V , E) be a graph, where V and E are respectively the set of vertices and the set of edges between nodes. If we define the set of neighbors of vi as Vi = {v j } : v j ∈ V , ei j ∈ E, then the degree of vi is d i = |Vi |, that is, d i is the number of neighbors of the vertex. Note that Di , the maximum number of links between neighbors of vi , can be defined as a function of d i : if G is a directed graph (i.e., ei j = e j i ), then Di = d i · (d i − 1). If G is an undirected graph, Di = di ·(d2i −1) . Let Ei = {e j k } : v j , vk ∈ Vi , e j k ∈ E be the actual set of edges between neighbors of vi . Hence, the clustering coefficient of vi , introduced in equation (3), can be rewritten as: Ci =

|Ei | . Di

(4)

Observe that if Ci = 0, it means that the neighbors of vi are not connected each other (i.e., Ei = ∅). If Ci = 1, then the subgraph G i is complete, where G i = (Vi ∪ {vi }, Ei ∪ {ei j : ei j ∈ E}). Furthermore, the clustering coefficient of graph G is defined as:  Ci C= i . (5) |V | An alternative definition of clustering coefficient widely adopted [Newman 2003] quantifies C as: C=

3 × number of triangles in the network , number of connected triples of vertices

(6)

where a connected triple means a trios of nodes in which at least one is connected to both of the others. The constant three in the numerator is due to the fact that each triangle contributes to three connected triples. This definition is equivalent ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:9

to the concept used in sociology and known as the “fraction of transitive triples” [Wasserman and Faust 1994]. The value of clustering coefficient C calculated by Eq. (5) and Eq. (6) is clearly different in many networks. In general, regardless of which definition is used, the property characterizing a small world topology is a value of C considerably higher than for a random graph with the same number of vertices and edges. Our research is based on the definition (5) by Watts and Strogatz [1998]. 3.2 Power-Law Degree Distribution In the previous section we have defined the degree d i of a vertex i as the number of edges connected to i. As deeply investigated by Erd˝os and R´enyi [1959, 1960, 1961], in a random graph the presence or the absence of each edge has equal probability, shaping the degree distribution as binomial or Poisson in the limit of large graph size. But the analysis of real networks reveals degree distributions far from having Poisson behavior: we can note the presence of a long right tail characterized by few nodes with very high values and very many nodes with small degree. Defined P (k) as the probability that a node selected at random has degree k, we observe that P (k) decays as a power-law, following P (k) ∼ k −γ , where the constant γ is called the exponent of the power law. In the literature, a network showing power-law degree distribution is also called a scale-free network because the power-law is the only distribution that is the same “whatever scale we look at in on” [Newman 2005]. The presence of a small fraction of nodes with a much higher degree than the average introduces the concept of hub. A hub can be imagined as a highly connected node able to ´ et al. link each other by way of short paths a large number of nodes. Barabasi ´ and Albert 1999; Barabasi ´ explain with the preferential attachment [Barabasi 2003; Albert and Barabasi 2002] phenomenon the presence of the hubs in some domains: new nodes connect with higher probability to more connected nodes. Power-law distributions occur in an extraordinary diverse range of phenomena: the size of earthquakes [Gutenberg and Richter 1944], moon craters [Neukum and Ivanov 1994], solar flares [Lu and Hamilton 1991], computer files [Crovella and Bestavros 1997] and wars [Roberts and Turcotte 1998], the frequency of use of words in any human language [Estoup 1916], the frequency of occurrence of personal names in most cultures [Zanette and Manrubia 2001], the number of papers scientists write [Lotka 1926], the number of citations received by papers [deSolla Price 1967], the number of hits on web pages [Adamic and Huberman 2000], the sales of books, music songs and other branded commodity [Kohli and Sah 2003; Cox et al. 1995] or the number of species in biological taxa [Willis and Yule 1922]. 3.3 Network Resilience Scale-free topologies affect also the resilience of a network. In general, many networks rely upon their connectivity, that is, the presence of paths between pairs of nodes, in order to perform their functions. The removal of vertices can increase the typical length of these paths as well as the number of disconnected ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:10



G. Ruffo and R. Schifanella

components. The deletion process can be performed removing nodes randomly from the network or targeting some specific classes of vertices. Such choice produces very different consequences. Let us define the resilience of a network as its vulnerability to nodes deletion. Albert et al. [2000] claim that the Internet and WWW are highly resilient to random removal of nodes, but highly vulnerable to deliberate attacks on the nodes with highest degree. Generally speaking, in a power law network the distance between nodes is almost completely unaffected by random vertex removal. On the contrary, when removal is targeted at the highest degree vertices, the effects become devastating. 3.4 Plotting and Analyzing Power-Law Distributions In general, the standard strategy to identify a power-law distribution is to prove that the histogram of the quantity analyzed, plotted in a log-log scale, appears as a straight line. However, in several cases, characterizing the tail of the distribution can be extremely tricky due to the lack of enough measurements or because the direct histograms usually seem rather noisy. To counter these effects, we can construct a plot in which the bin sizes increase exponentially with the degree; reducing a problem that anyway remains. An alternative way to present nodes degree is to create a plot of the cumulative distribution function P (k) which represents the probability that the degree is greater than or equal to k. The histogram of P (k) plotted in a log-log scale shows still a straight line but with a shallower slope with respect to the plot of the pure data. Since the cumulative distribution function removes the binning drawbacks and represents the entire dataset, we will take advantage of this technique in our study. Finally, we present a simple and reliable method to calculate the exponent γ of the power-law by employing the following formula [Newman 2005]:  −1 n  xi γ =1+n ln , (7) xmin i=1 where the values xi represents the collected degrees and xmin is the lower bound from which the law holds. Furthermore, we can compute an estimation of the statistical error α on (7) by means of the relation −1  n √  xi γ −1 α= n ln = √ . (8) xmin n i=1 4. DATA COLLECTION AND NETWORK ANALYSIS In Section 2, we have introduced the concept of affinity network as a web of likeness among users. We assume that these networks show a small world topology, characterized by short paths between pairs of nodes randomly chosen and a high clustering coefficient. This idea drive us to propose a decentralized recommendation scheme taking advantage of these self-organizing relationships between people [Ruffo et al. 2006]. ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:11

In order to verify the above idea, we investigated the Gnutella file sharing network looking for an empirical evidence of small world patterns. Some interesting questions arise: (1) How can we monitor (a portion of) the resources shared within Gnutella network? (2) How can we identify (a subset of) the files owned by a users in order to create the family of G m graphs? A short premise concerning the modern Gnutella architecture can be useful to answer such questions. The basic structure consists in a two-tier overlay where a set of interconnected ultrapeers forms the top-level overlay to which a large group of leaves are connected. Leaves never forward messages: they send queries to the ultrapeers and wait for a set of QueryHits matching the searching criteria. An ultrapeer acts as a proxy to the Gnutella network for the leaves connected to it. Ultrapeers are connected to each other and to regular Gnutella hosts. QueryHit messages return back to the querying user by reverse path forwarding. This ensures that only those servents that routed the Query message will get the returning QueryHit message. Therefore, a ultrapeer receives all QueryHit messages addressed to its leaves. Since QueryHit messages contain information about files matching searching criteria stored in answering peers, they are a precious source that allow to identify who shares what (question 2). Furthermore, the two-tier architecture gives us the possibility of collecting QueryHits by means of a passive monitoring of Gnutella traffic that transits through an ultrapeer node (Question 1). In fact, it receives both QueryHit replies from leaves peers that he is connected to and (part of) the traffic that the top-level ultrapeers forward to each other. Of course, we reach only the top of the iceberg since the information extracted from QueryHit messages represents a small fraction of the overall resources shared by a peer. In fact, we collected data about the most searched files whereas QueryHits enclose replies regarding items that users are looking for. We are strongly confident that the whole picture, that is, a complete view concerning what users share, can strengthen the intuition that Affinity Networks show a small world topology. Instead of implementing a Gnutella crawler from scratch, we modified the open-source client Phex [Phex Team], a pure Java file sharing application, multi-platform, with the multi-source download feature and able to realize an effective passive searching and snooping for files. This adapted client is forced to enter the network in ultrapeer mode, collecting and storing all QueryHit messages it forwards. The crawler ran for seven days, from 19 October to 26 October 2005, within our department laboratories. As summarized in Table I, the traces collected are composed by more than 3 millions of searching replies generated by a community of 283000 clients that advertize more than 900000 different files. In order to create the affinity graphs, we need a set of pairs in the form (ui , f i ) that means user ui shares the resource f i . An interesting point is to decide the criteria able to identify unambiguously both users and files in a Gnutella network. Instead of exploiting the name of a file, in our work we took advantage of the SHA1 hash codes that bind identifiers to the content rather than to the name of a resource. In fact, the hash codes can smooth the phenomenon of fake files, i.e. a resource where the name does not match the content. Furthermore, SHA1 avoids the problem of identical items ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:12



G. Ruffo and R. Schifanella Table I. Data Collected by the Gnutella Ultrapeer Crawler from 19 October to 26 October 2005 CHARACTERISTICS OF TRACES COLLECTED Time Interval 7 days WHOLE DATASET # IP (unique) 283.431 # GUID (unique) 470.333 # Files (distinct SHA1 hashes) 944.758 # (IP,SHA1) pairs 3.092.794 WITHOUT PRIVATE CLASS ADDRESSES # IP (unique) 278.281 # GUID (unique) 422.726 # Files (distinct SHA1 hashes) 714.640 # (IP,SHA1) pairs 2.261.396

shared with different file names. We note that the user identification process is more troublesome. Phex client uses a Global Unique IDentifier (GUID) to identify a Gnutella node: it is generated randomly each time a user session starts according to an application-specific format. Since this code shifts every running instance, a user that quits and suddenly enters a novel session receives different identities. On the other hand, the IP address can represent another feasible solution. In fact, our model binds each IP address to a distinct user. However, it is possible that the same IP address corresponds to different users, for example, shared workstations or presence of NAT/proxy. A private network environment provides a concrete example of this effect: let us suppose that the IP 192.168.1.10 publishes a set of resources R. We could relate the IP 192.168.1.10 to a particular user u and we could wrongly assert that u shares the files belonging to R. In fact, many distinct users in different networks can obtain this address, so that the QueryHit content cannot distinguish between these users. The effect is the presence of distinct IPs that seem to share large sets of files, affecting the fairness of the affinity graphs.5 Notice that the opposite phenomenon can be observed as well. For example, in a DHCP-based network the same user can obtain different IP addresses in distinct sessions. Therefore, a set of resources R that effectively belong to a user u, can be seen as the sum of shared items from many users. Obviously, this phenomenon can smooth the hub behavior of the user u. However, we think that this effect does not impact our study due to the relatively short time of trace collection. To get rid of above problems, we filtered out all IP addresses belonging to the private network class specification.6 Table I describes the characteristics of the dataset without resources coming from such addresses class. A first analysis of data gathered regards the file popularity distribution observed in our Gnutella snapshots. Figure 1 reveals that it follows a Zipf ’s law, 5 Indeed,

these IPs behave like hubs, so they should amplify the small-world properties showed by the preference graphs. 6 We filtered out the following sets of IP addresses [Rekhter et al. 1996]: 10.x.x.x, 192.168.x.x and the range from 172.16.0.0 to 172.31.255.255. ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:13

1 0.1

P(k)

0.01 0.001 0.0001 1e-005 1e-006

1

10

100 k

1000

10000

Fig. 1. Cumulative distribution of the files popularity plotted in a log-log scale following a Zipf’s law. 1 0.1

P(k)

0.01 0.001 0.0001 1e-005 1e-006

1

10

100

1000

10000

k

Fig. 2. Cumulative distribution of the node degree, that is, the number of distinct files shared by a peer, plotted in a log-log scale.

as already observed in Iamnitchi et al. [2004]. In fact, we can find few very popular files along with a very large set of resources shared by only one or two individuals. Moreover, we investigated the distribution of the number of files shared by peers in the network. We observed that it follows a power law (Figure 2) characterized by an exponent γ = 3.17 and an error α = ±0.04. The key consequence is the proof of the existence of hub peers, namely users that share a large amount of items playing a significant role in providing connectivity, short paths between couples of peers and high clustering factor. Afterward, we focused our attention on the composition of file types shared by users in the Gnutella network. Our goal is to understand what kind of resources are most popular in this file sharing community in order to perform a focused analysis about user preferences in a specific field, reducing also the complexity of graphs generation process. Table II shows the composition of resources types in relation to eight different categories: Image, Executable, Video, Audio, Archive formats, Document in general, Web and Others resources not easily classifiable. Table II ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:14



G. Ruffo and R. Schifanella

Table II. List of Extensions Belonging to the Different File Types Categories FILE TYPE Image Executable Video Audio Archive Document Web Others

FILE EXTENSIONS

RATE (%)

jpeg, gif, png, jpg, tiff, tif, bmp, ico exe, bin, bat, dll, h, ini { avi, mpeg, wmv, asx, mov, wm, asf, scb, mpg, mp4, vob, rmvb, ogm, mpe, divx } { mp3, wav, mid, ogg, aac, m4a, aiff, wma, kar, m3u, aif, rm, rmj } { zip, rar, gz, jar, iso, cdr, nrg, 7z, ace, tar, ccd, cue, img, cbr, vcd } pdf, doc, txt, rtf, info, nfo, hlp, eml, awk, ps, ttf html, htm, php, css, xml, swf, fla, url, lnk n.a.

2.7 3.2 27.8 58.0 6.0 1.0 0.8 0.5

Table III. Average Shortest Path Length L and Clustering Coefficient C for the Affinity Networks Generated Focusing on Audio Files

Gm

# Nodes

# Edges

G2 G3 G4 G5 G6 G7 G8

22777 9807 4779 2612 1501 891 591

428931 81088 23378 8519 3617 1780 990

AUDIO FILES Affinity Graph L C 3.29 3.53 3.68 3.81 3.93 4.12 4.4

0.43 0.37 0.35 0.33 0.28 0.27 0.25

Random Graph Lrand Crand 3.418 4.351 5.336 6.655 8.316 9.815 12.371

0.0017 0.0017 0.0020 0.0025 0.0032 0.0045 0.0057

shows also the set of file extensions mapped to each category. As expected, the majority of files shared are mp3 songs or other audio formats (58% of the overall resources). In addition, we noticed that video contents represent the second largest set of items (27.8%). Therefore, as described in the next sections, we generated the affinity graphs G m from these two most popular categories in order to point out the hypothesized small-world behavior. 4.1 Analysis of Audio Files In our evaluation we generated several affinity graphs, from G 2 to G 8 , in order to investigate if they show small-world topology and if this behavior is maintained even increasing the value of m. Following the definition of small-world network given in Section 3, for each graph we computed: (i) the average shortest path length (L), i.e., the average length of all the shortest paths from or to the vertices in the network, and (ii) the clustering coefficient (C). We compared both L and C with the same metrics estimated in random graphs with identical number of nodes and edges. Table III shows that all the G m graphs exhibit small-world topologies: in fact, for all the affinity networks, we have that C Crand and L ≈ Lrand (quite interestingly, it always happens that L < Lrand ). The presence of a small world pattern in affinity graphs G m depicts the Gnutella network as a set of strongly interconnected clusters, representing ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:15

Table IV. Example of C and L for Several Real Networks L

Network WWW, site level, undir. Movie actors LANL co-authorship MEDLINE co-authorship SPIRES co-authorship NCSTRL co-authorship Math. co-authorship Neurosci. co-authorship E. coli, substrate graph E. coli, reaction graph Ythan estuary food web Silwood Park food web Words, co-occurrence Words, synonyms Power grid C. Elegans

3.1 3.65 5.9 4.6 4.0 9.7 9.5 6 2.9 2.62 2.43 3.40 2.67 4.5 18.7 2.65

C 0.1078 0.79 0.43 0.0666 0.726 0.496 0.59 0.76 0.32 0.59 0.22 0.15 0.437 0.7 0.08 0.28

Reference Adamic, 1999 Watts and Strogatz, 1998 Newman, 2001a, 2001b, 2001c Newman, 2001a, 2001b, 2001c Newman, 2001a, 2001b, 2001c Newman, 2001a, 2001b, 2001c ´ Barabasi et al., 2001 ´ Barabasi et al., 2001 Wagner and Fell, 2000 Wagner and Fell, 2000 Montoya and Sole’ , 2000 Montoya and Sole’ , 2000 Ferrer i Cancho and Sole’, 2001 Yook et al., 2001b Watts and Strogatz, 1998 Watts and Strogatz, 1998

Table V. Average Shortest Path Length L and Clustering Coefficient C for the Affinity Networks Generated Focusing on Video Files

Gm

# Nodes

# Edges

G2 G3 G4 G5 G6 G7 G8

7553 3560 1948 1150 705 459 330

184230 37984 11693 4719 2298 1305 834

VIDEO FILES Affinity Graph L C 3.11 3.10 3.03 3.12 3.29 3.13 2.37

0.53 0.483 0.49 0.47 0.46 0.51 0.48

Random Graph Lrand Crand 2.796 3.454 4.226 4.992 5.550 5.866 6.255

0.0065 0.0060 0.0062 0.0071 0.0093 0.0124 0,0154

spontaneous thematic communities of users sharing similar files. Such communities are linked by way of a small fraction of peers, the hubs, that belong to multiple thematic groups, since they hold a huge amount of resources of all kind. Table IV reports the values of L and C estimated in several known domains showing small-world phenomena [Albert and Barabasi 2002]. Such values are similar to those found in our experiments. This provides further evidence that affinity networks G m in the field of music preferences are small world. 4.2 Analysis of Video Files The results of the previous section pose an important question: is the small world property related to the thematic sphere investigated or is it a more general structural property of self-organizing preference communities? In Table V, we focus on video contents and we generated all graphs from G 2 to G 8 . As reported in Table V, the relations C Crand and L ≈ Lrand hold for all of the graphs, in this way clearly showing small-world characteristics. This outcome ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:16



G. Ruffo and R. Schifanella

provides further evidence that small world phenomenon does not depend on a particular field of interest. The main consequence is that our recommendation scheme (see Section 6) is free from thematic constraints and it can be applied in different domains. 4.3 Affinity Networks are Scale-Free? In Section 3.2, we observed that in many real networks the probability P (k) that a node chosen randomly has degree k follows a power-law, that is, P (k) ∼ k −γ . These networks are called scale-free. After providing an experimental evidence that affinity networks are small-world, we investigated the distribution of nodes degree in order to understand if graphs G m are scale-free and if this property remains unchanged with different values of m. In the evaluation presented in this section, we focus on video files. However, we reached similar results even in the domain of audio items. Figure 3 shows the cumulative distribution of nodes degree calculated for the affinity graphs from G 2 to G 8 . For each graph, the exponent γ and the error α have been computed respectively with formulas (7) and (8). In Figure 3, we see that when the graph’s degree, that is, the value of m, grows, then the scale-free tendency of the affinity networks is much more evident. In other words, we move from an affinity network with m = 2, that is, any pair of users are linked together if they share two or more files, to a graph with m = 8, where the users must have eight or more files in common to remain connected to each other. Generally speaking, a graph G m gets the next level m + 1 by removing those links eimj such that Aff(ui , u j ) = m. It is worth noting that the topology of the affinity graphs when 4 ≤ m ≤ 5 shifts towards a less connected network: the overall number of connected nodes decreases, the clustering coefficient and the average shortest path remain stable (see Table V), but a few peers show evident hub’s characteristics: the remaining nodes in the graphs with m ≥ 5 remain connected to the network by means of these hubs. The existence of hubs in an affinity network can be explained if we assume that affinities are observed between users with similar tastes. Most users restrict their interests to quite limited fields (e.g., a specific music genre, or popular hits), and few others range over different topics. Of course, hubs can also simply behave as “hyperphagic” users, who download everything they find. Because of their high-capacity hard disks, such users do not remove uninteresting files for days, and this can weaken results of a superficial network analysis. In general, hubs reduce the diameter of a small world network: they can be contacted in order to access into affinity clusters of different spontaneous communities. The presence of hubs has side effects even in the recommendation scenario, as we will discuss in Section 6.4. 5. RELATED WORKS ON RECOMMENDER SYSTEMS Recommender systems are often suggested as an effective technique to cope with the problem of information overloading in a wide range of domains. In the real life, frequently we face up a choice without having a direct experience about the feasible alternatives [Resnick and Varian 1997]. For instance, ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.



A Peer-to-Peer Recommender System

1

1

G2

0.1

0.1

0.01

0.01

0.001

0.001

0.0001

1

10

100

1000

10000

0.0001

γ = 1.412 ; α = ±0.004 1

0.1

0.01

0.01

0.001

0.001

10

100

1000

0.0001

γ = 1.71 ; α = ±0.016 ; 1

6

0.1

0.01

0.01

10

100

1000

γ = 1.97 ; α = ±0.036 ; 1

100

1000

10000

G

1

10

5

100

1000

γ = 1.88 ; α = ±0.026

G

1

10

1

0.1

0.001

1

1

4

G

1

G3

γ = 1.55 ; α = ±0.0092

0.1

0.0001

4:17

0.001

G

1

10

7

100

1000

γ = 1.99 ; α = ±0.045

G8

0.1

0.01

0.001

1

10

100

1000

γ = 2.05 ; α = ±0.057 Fig. 3. Cumulative distribution of node degree of the affinity graphs G m for video files type.

ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:18



G. Ruffo and R. Schifanella

consider a user surfing the Web looking for some information regarding popular heavy metal bands. The amount of available data is impressive: thousands of different web pages, documents, albums, concerts and any kind of related information. Of course, it is impossible for the user to make a choice with a complete knowledge of all possible alternatives. In such a context, it is evident that we need a sort of personalized advice able to aid the user in finding useful information. Recommender systems are devoted to play this role. Before the recommender systems era, the information overloading problem was faced with pattern matching techniques based on data indexing, retrieving, searching and filtering [Pinkerton 2000; Howe and Dreilinger 1997; Yan and Garcia-Molina 1995]. The problem with these approaches is that they simply find a match between a query string and the content of the documents, without any consideration about user’s preferences. Therefore, the results of a typical search are not personalized to individual users, providing often too much irrelevant information or, conversely, missing sensible items. 5.1 Centralized Approaches To date, the most relevant proposals in recommendation area are the contentbased, the collaborative filtering, and the demographic approaches. In the content-based recommender systems, the items are analyzed in order to assess if they are similar to previously rated items. The similarity amongst resources are evaluated by way of an objective and automatic analysis of the content, identifying some distinguishing features and comparing them to the user’s profile. Normally, such approach is widely used in the field of information items. For example, the Syskill & Webert recommender system [Pazzani et al. 1996] proposes a mechanism to suggest Web documents based on textual analysis. Another interesting approach is NewsWeeder [Lang 1995], a netnewsfiltering system. However, a content-based technique is subject to a severe drawback: the items are analyzed in an objective way, whereas often users perceive the content in a subjective way. Different people looking for the same resource can valuate it in a completely different way depending on personal preferences. Furthermore, this approach can work only with contents that are in a machineparsable format, like documents, web pages or e-mails. In our study, we exploit the properties of the affinity networks generated by video or audio files to push a recommendation to the user, therefore, the content-based paradigm cannot be implemented at all. The collaborative filtering approach is based on the comparison between user’s profiles rather than amongst items content. A typical user’s description is composed by an array of items and ratings. The system uses such information to estimate a similarity value between users and recommends resources liked by people with similar preferences. In some cases, the ratings are simply binary, that is, a user likes or dislikes the item. More often, it is possible to define a range of values suggesting the degree of user’s preference. Collaborative filtering recommenders have been implemented in a wide range of domains and applications. Tapestry [Goldberg et al. 1992] is the first recommender system using this technique in order to filter out in newsgroups ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:19

the documents that are off the topic or to select interesting contents within a multitude of posts that are impossible to read entirely; Ringo [Shardanand and Maes 1995] generates suggestions on music albums and artists based on a social filtering mechanism that automates the process of “word of mouth”; GroupLens [Konstan et al. 1997] helps users of Usenet, a high-volume, highturnover discussion list service on the Internet, to find interesting articles; finally, MEMOIR [DeRoure et al. 2001] focuses on finding people with similar tastes and interests. One of the most relevant feature of collaborative techniques is that they introduce a subjective evaluation of items without forcing the system to represent data in a machine-parsable form. Thus, the collaborative filtering recommenders are completely independent to content representation problems, working well for both complex objects, like music songs and movies, and for classical textual documents or Web pages. The applicability on multimedia content and the subjective character of suggestions make this approach well feasible in our proposal. In any case, collaborative filtering approaches show some drawbacks: —Sparsity Problem. The effectiveness of the recommendation engine is strongly related to the population of users in the system [Terveen and Hill 2001]. When few users have rated the same items, the collaborative filtering mechanism does not provide useful suggestions. In order to reach better performance it is necessary to achieve a critical mass of participants. —Cold Start Problem. When a new user enters the system it is impossible to evaluate a similarity degree regarding other people [Resnick and Varian 1997]. To get rid of this effect, it is necessary a training period to refine the user’s profile in order to accurately reflect his preferences. In general, a user that exhibits a slight intersection with other users’ characterizations could receive a bad suggestion due to an inherent lack of information, that is, in this context ratings on items. —Early-Rated Problem. A similar issue arises when a new item is added to the database [Montaner et al. 2003]. In such a case, any suggestion is impossible until a sufficient set of users will rate the document. Another class of recommender systems exploits a demographic approach. Starting from a description of a user based on age, gender, profession and other indicative features, each item is related to people who more probably like it [Krulwich 1997]. General speaking, each user is classified in a stereotype, as proposed in the Grundy system [Rich 1979], and an item is recommended to people with similar demographic profiles. For instance, is more likely for a Ferrari sport car to be suggested to a rich businessman than to a Ph.D. student. Even if it is evident that the definition of stereotypes is a challenging task, in some cases, they produce good quality recommendations. However, such approach is not suitable when a user changes preferences over time, in fact, the demographic mechanism is not able to adapt the profiles. As described above, each approach shows benefits and shortcomings, therefore it is clear that no technique can be used effectively in all domains and for all ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:20



G. Ruffo and R. Schifanella

kind of users. For all the above reasons, one common thread in recommender system research is to combine different approaches into a hybrid technique in order to gather the advantages of each proposal. For example, the contentbased and the collaborative filtering approaches are normally integrated together. Fab [Balabanovi´c and Shoham 1997] maintains user profiles based on content analysis, and compares them in order to define similar users for collaborative recommendation. Other hybrid proposals can be found in Sarwar et al. [1998], Pazzani [1999], Popescul et al. [2001], and Claypool et al. [1999]. An interesting approach is proposed in Wei et al. [2005] where it is delineated a marketplace in which the various recommendation methods compete to offer their recommendations to the user. To conclude, we observe that one of the major shortcomings of recommender systems is their lacking transparency, that is, users would like to understand why they were recommended specific items [Herlocker et al. 2000]. Herlocker et al. [2000] propose a mechanism to provide an explanation on resources suggested, so that the users can understand the reasoning behind the recommendation process. 5.2 Decentralized Approaches One common characteristic of the recommender systems described in the previous section is the use of a centralized client-server architecture. Focusing on the collaborative filtering approach, the information about items and ratings is stored in a central database that contains a complete knowledge of the domain. In other words, usually a recommender system creates a matrix in which rows are users and columns contain votes concerning the evaluation of items. As a consequence, each vector represents a customer profile used to compute the correlation between users and to form good suggestions. For instance, the book store Amazon implements a popular centralized recommender system in which there is a whole knowledge about the books up for sale and a central repository of user’s activities and profiles. In a decentralized environment, such as the peer-to-peer file sharing domain, these conditions are not achievable at all. On the one hand, there is not a complete knowledge regarding the data shared: each peer holds a partial vision resulting from the interaction with the neighbors and the information included in query replies. Furthermore, since the peers can share what they want, the content is not regulated, fostering phenomenons as fake files or identical resources with multiple names. On the other hand, the lack of a central entity does not provide a full repository of user’s profiles. Usually, each peer holds the ratings concerning items directly experienced and, by means of an exchanging protocol, this information is routed to other peers in the network. It is evident as the recommendation is generated locally at peer-side by way of the partial data available. A first attempt to deal with a decentralized environment is proposed in Tveit [2001] where it is described a recommender system able to suggest products and services in a marketplace populated by mobile customers. The main idea behind this approach is to translate the recommendation task in a search problem performed in a pure peer-to-peer topology like Gnutella. The queries ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:21

are propagated by way of the standard flooding mechanism, but instead of a set of keywords, the queries contain the ratings vector concerning the user. Therefore, using a broadcast approach to spread votes can bear serious scalability and efficiency issues. In a mobile ad-hoc scenario where devices exchange information with users in the proximity without accessing to any remote online directory services, Schifanella et al. [2008] propose epidemic collaborative strategies to spread ratings over self-organized communities of users. Authors show how the proposed approach converges very quickly to the prediction accuracy measured in a domain with complete knowledge with a sustainable overhead. A different approach, presented in Wang et al. [2005, 2006], introduces a probabilistic relevance model based on the concept of buddies tables. Related to an item and stored locally with it, a buddies table captures the similarity degree with respect to the other resources in the domain. The buddy table of an item is updated each time the resource is downloaded, reducing the communication burden due to the spread of ratings. In order to build the users’ profiles, they employ the list of previously downloaded items that provides a positive evidence of user’s interests. Another approaches have been proposed in the literature. The PipeCF scheme uses a Distributed Hash Table in order to spread ratings information [Xie et al. 2004; Han et al. 2004]. The PocketLens project [Miller et al. 2004] and the Vineyard [Oka et al. 2004] system run a set of independent recommenders each one dedicated to a specific domain. Compared with the above proposals, our recommendation scheme shows some relevant differences: —we do not employ any form of description of user’s profile nor vector ratings, in this way removing the burden of spreading this information in the network. We suppose that if a user shares a file, then he/she has some interest in it. In order to generate the affinity networks, we need to know at least the set of files shared by a given peer, an operation that is supported (even if often not automatically) by all file-sharing clients. Therefore, our recommender system can be easily implemented and employed without restrictions in any real file sharing communities (such as Gnutella, Emule and so on) regardless any topological or structural issues. —the system is completely self-organizing and autonomous. —the recommendation engine is completely transparent: the suggested items are pushed to the user that just has to use the system. 6. A DECENTRALIZED RECOMMENDATION SCHEME BASED ON SPONTANEOUS AFFINITIES Starting from the study of the small-world properties of affinity networks and the theory concerning collaborative filtering recommender systems in a decentralized environment, here we focus on the highly informative power of self-organizing interest-based communities: users in the same cluster share a subset of common items and are probably interested in other files popular in ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:22



G. Ruffo and R. Schifanella

Fig. 4. An example of partners and friends of ui .

the cluster. The transitivity property may be used for enabling reserved information lanes between users, in order to suggest items that are potentially of interests for members of the same cluster. In order to define a decentralized recommendation scheme exploiting this significant concepts, let us introduce a notation that we will use in the rest of the section. Given ux , u y ∈ U , we define the set of partners of ui as:   F0 (ui ) = {u j :  f (ui ) ∩ f (u j ) ≥ 1}. (9) Roughly speaking, the node of user ui maintains a list of other users that share at least one file with him/her. It is clear that if a user downloads a resource, all the candidate sources can be tagged as ui ’s contacts, since they own the file that ui is downloading. In order to exploit the triangulation property of the affinity networks which the node is connected to, we consider also the set of partners of first order of ui (i.e., the list of partners of partners of ui ):  F1 (ui ) = F0 (u j ). (10) u j ∈F0 (ui )

We introduce the list of friends of ui as follows: F (ui ) = F0 (ui ) ∩ F1 (ui ).

(11)

Therefore, a ui ’s friend is a partner, that is, a peer from which ui has previously downloaded an item or a candidate source for it, that is, a partner of another partner of ui . Obviously this definition exploits the triangulation property as depicted in Figure 4. The node ui stores an integer value m(ui , u j ) for each reference in F (ui ). More precisely, we define the partnership degree of the pair ui and u j as m : U 2 → N + , where m(ui , u j ) = | f (ui ) ∩ f (u j )|, that is the number of files that they have in common. For the sake of simplicity, in the rest of the section, we will use the notation mi j instead of m(ui , u j ).7 7 We

note that the definition of the partnership degree is equivalent to the affinity function (1). We introduced two different notations since the Aff function proposes only a possible way to capture similarities between users, and in other domains its definition may change.

ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:23

At the implementation level, list F (ui ) has a constant size, and it is ordered on the basis of the value of the partnership degree mi j : the user on the top of the list has more files in common with ui than with other users. On the contrary, the less “interesting” user (e.g., mi j ≈ 0), is likely to be removed from the list. The reader should observe that, given m, it is possible to extract, from this list, the (known) neighbors of ui in the preference graph G m ; in fact, it is easy to note that m(ui , u j ) = m ⇒ u j ∈ Uim , where Uim is the set of neighbors of ui in G m . For example, let us suppose that user ux downloaded s1 and s2 from u y . Moreover, he downloaded s3 , s4 and s5 from uz , and s6 from uv . Finally, we have also that F0 (ux ) = F1 (ux ). As a consequence, we have that: f (ux ) = {s1 , s2 , s3 , s4 , s5 , s6 }, and F (ux ) = {u y , uz , uv }. Moreover, after having interacted with u y , uz and uv , the P2P client of ux got also their file lists. So, ux knows that f (u y ) = {s1 , s2 , s4 , s7 }, f (uz ) = {s3 , s4 , s5 , s7 , s8 } and f (uv ) = {s6 , s7 , s3 , s4 , s5 }. The values of function m are updated after each interaction. After the last download, they are the following: mx y = 3, mxz = 3, and mxv = 4. F (ux ) is ordered as: (uv , u y , uz ), as explained in Figure 5. We need also to identify the set of files8 owned by the friends of ui , but that are not possessed by ui : 

Co-f(ui ) = f (u j ) − f (ui ). (12) u j ∈F (ui )

The state of a running node includes also a file map, that returns the partners owning a given resource that is not possessed by ui . Hence, we define the family of functions: mapi : Co-f(ui ) → P(F (ui )),

(13)

where mapi (sk ) = {u j ∈ F (ui ) : sk ∈ Co-f(ui ) ∩ f (u j )}. In the previous example, we have that Co-f(ui ) = {s7 , s8 }, mapi (s7 ) = {u y , uz , uv }, and mapi (s8 ) = {uz } (see Figure 5). 6.1 Die Wahlverwandtwchaften: The Intuition The intuition behind the proposed recommendation scheme is based on the observation that friends of a given peer build a cluster of nodes with different partnership degrees. We previously observed in Section 4 that nodes can be naturally grouped together on the basis of common interests. Moreover, we noted that there are peers that are more kindred to some partners than others; in fact, we found that an affinity network G m is a small world, even with growing values of m. But not all the nodes involved in affinity networks with lower degrees than m are still involved in G m . Thus, some relationship between nodes in the same cluster is stronger than others: even in the file sharing community, elective affinities [von Goethe 1890] rule the social behavior of the users. 8 Note

that the P2P client of ui does not store all friends’ files, but only an unique reference to them (e.g., their SHA-1 hashes). ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:24



G. Ruffo and R. Schifanella

Fig. 5. An example of a feasible scenario characterized by four peers exchanging items between each others. In a such context, the recommendation list R(ux ) is equal to {s7 , s8 }. In fact, we have m +m +m ¯ that mapx (s7 ) = {uv , u y , uz } and, then, w(s7 ) = xv 3x y xz = 10 3 = 3.3. Similarly, after having mxz 3 identified the relation mapx (s8 ) = {uz }, we derive w(s8 ) = 3 = 3 = 1.

We want to sort files in Co-f(ui ) by means of the following criteria: (1) popularity in the cluster of partners of ui ; (2) partnership degree of friends storing the missing files; The recommendation list is defined as the ordered sequence: R(ui ) = (sk1 , sk2 , . . . , sk ), where  = |Co-f(ui )|, and ∀h = 1, . . . ,  : skh ∈ Co-f(ui ). Files in R(ui ) are sorted (and, hence, recommended), on the basis of the weight defined below:  (mi j ) w(skh ) =

u j ∈mapi (skh )

maxd (|mapi (skd )|)

,

(14)

that is, ∀skd ∈ R(ui ) : w(skd −1 ) ≤ w(skd ) ≤ w(skd +1 ). In our example, files will be recommended to ux in this order: (s7 , s8 ). In ¯ and w(s8 ) = 1.0 (as explained in Figure 5). Of fact, we have that w(s7 ) = 3.3, course, in a practical environment, we can set a threshold, in order to filter out recommendations with low weight. In the previous case, if the threshold is set to 2.0, only file s7 would be submitted to the user’s attention. 6.2 Discussion Let us numerically quantify the popularity of a file and the average partnership degree of nodes hosting a given item as follows: Given a node ui , the popularity of a missing file skh is calculated by way of the family of functions popi : Co-f(ui ) → [0, 1], where popi (skh ) =

|mapi (skh )| . maxd (|mapi (skd )|)

(15)

Given a node ui , the degree of a missing file skh is the average partnership degree of nodes in F (ui ) that store skh . This value is calculated by way of the family of ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System

functions degi : Co-f(ui ) → R+ , where degi (skh ) =



u j ∈mapi (skh )



4:25

(mi j )

|mapi (skh )|

.

(16)

Trivially, w(skh ) = popi (skh ) · degi (skh ). The following theorem simply shows that the recommendations are sorted according to the criteria inspired by the affinity networks which the node is connected to: a missing file is suggested for its popularity amongst the friends of the users and for affinity degree of the node that stores the given file. THEOREM 1. Given a node ui , and two files skx and sk y in Co-f(ui ), such that, w(skx ) > w(sk y ), then the following statements are true: (1) If the files have the same popularity, then skx is owned mostly by nodes with a higher average partnership degrees with respect to sk y . (2) If the files are owned by nodes with the same average partnership degree, then skx is more popular than sk y in Co-f(ui ). PROOF. that

Note that the hypothesis says that w(skx ) > w(sk y ), which means popi (skx ) · degi (skx ) > popi (sk y ) · degi (sk y ).

It is easy to show that, when popi (skx ) = popi (sk y )(> 0), then it follows that degi (skx ) > degi (sk y ), which proves the first part of the theorem. The second enunciation states that, on the contrary, degi (skx ) = degi (sk y )(> 0); in this case, we have that popi (skx ) > popi (sk y ), which proves the theorem. 6.3 An Empirical Evaluation The validation of a recommender system is always a quite hazardous task, because of the difficulty of modeling the tastes of a given user: we need a quantitative measure of a qualitative service, that can be fairly evaluated only after a final feedback from the final user. Novel recommender systems are proposed and evaluated by way of wellknown logs of user profiles and buddy tables, that contain lists of items with feedback ratings assigned by a given set of users. These information are cross linked, and the precision of the recommendation is compared with other wellknown (centralized) systems. This approach cannot be applied to our domain, because we do not have user profiles, and users are not required to give feedbacks to a data collector entity. Moreover, our objects are not structured, for example, in terms of author names, genre, song or movie titles. We have unique identifiers (i.e., hash values) that are not coupled with the content, and recommendations are made only by way of users relationships and partnership degrees. The advantage of this approach is that it takes care of the privacy of users without disseminating information about their interests. Moreover, he/she has not to waste time training his/her personal virtual assistant. Of course, DeHinter could be significantly improved by any semantic recommendation engine that uses reliable descriptions of items ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:26



G. Ruffo and R. Schifanella

and user feedbacks, and we will investigate this research direction in the next future. As described in more details in Ruffo and Schifanella [2007], we have reduced the recommendation problem to a classification task, that, for each user u, labels every unseen file ski as “interesting” or “not interesting”. This classification is performed by implementing formula (14), and using the following (normal recommendation) criterion: if w(ski ) > 0, then file ski is considered “interesting” for the active user, otherwise it is considered “not interesting”. In the case we have a set of recommended files, that is, R(u) = (sk1 , sk2 , . . . , sk ), that is sorted by weight, we may want to define a stronger criterion, such that the given item is considered “interesting” only if it is in the higher half of the list. Therefore, if med w (u) is the median of w(ski ) values, with 1 ≤ i ≤ , then we can use the following (strong recommendation) criterion: if w(ski ) ≥ med w (u), then file ski is considered “interesting” for the active user, otherwise it is considered “not interesting”. Cross-validation is a statistical test that fits well the classification domain and that is used to validate hypotheses, especially when further data are difficult to collect. A single run of a cross validation test is made of two steps: the dataset is randomly partitioned into two parts, a training set and a test set. Training data are used to build the model, and test data are “hidden” in order to confirm and validate the initial analysis. Our empirical evaluation uses a particular kind of this test, that is called K fold cross validation (in our analysis, K = 10), where the data set is partitioned in K subsets. Of the K parts, a single one is used to validate the classification, and the other K − 1 subsets are used for running the model. The process is repeated K times, with each of the partitions used exactly once as the test set. The mean of the results is then computed. In our experiment, we considered the set containing all the data presented in Section 4 regardless of the media type. We transformed the dataset in a list of rows like: [ f i : ui1 , ui2 , . . . ], where ui1 , ui2 , . . . is the list of peers storing file f i . We filtered out from this set all the rows related to files owned by only one peer, because they would not be of use for discovering relationships between users. After the cleaning process, we obtained a set of 105, 995 rows. Then, we executed the 10-fold cross validation test: the dataset is split in 10 parts, giving a test set of approximately 10, 600 rows. During each fold, the recommendation weight of all the files in the test set is computed for each user. Formula (14) has been applied considering only the data in the training set (an example is given in Figure 6). The classification was made using both the recommendation criteria defined above. The accuracy of the estimation has been calculated as it follows: giving a user u, if file ski is stored by u and it has not been correctly classified as “interesting” by a given recommendation criterion, then we have an error; if erru is the number of errors, and nu is the number of files stored by u and correctly classified, then the accuracy of the estimation is nu . nu +erru Table VI reports the averaged accuracies for all the users and for each different fold of the test. The results are really good, with an average accuracy of 81% (with a confidence interval of [66%, 96%]) if the normal criterion is used. ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:27

0.6 0.5

i

w(sk )

0.4 0.3 0.2 0.1 0 i

Fig. 6. Given a user u, files ski in the testing set are sorted by w(ski ). Files ski are displayed with dots. In addition, if the file is owned by u, then it is also displayed with a square.

Table VI. 10-Fold Cross Validation Results: Average Accuracy and Variance for the Given Recommendation Criteria FOLDS

NORMAL RECOMMENDATION ACCURACY σ

STRONG RECOMMENDATION ACCURACY σ

0 1 2 3 4 5 6 7 8 9

0,8145 0,8115 0,8174 0,8215 0,8261 0,8022 0,8097 0,8066 0,8035 0,8029

0,145113 0,152825 0,149422 0,137835 0,143064 0,148108 0,148670 0,150512 0,166106 0,154615

0,6682 0,6549 0,6657 0,6766 0,6629 0,6453 0,6517 0,6595 0,6560 0,6574

0,189547 0,194969 0,189540 0,186365 0,192164 0,199965 0,191519 0,194688 0,199304 0,204398

Tot

0,8116

0,149630

0,6598

0,194250

As expected the strong recommendation criterion has a lower average accuracy, even if it is still quite high (66%). 6.4 Observations First of all, it must be observed that we considered only one side errors, i.e., items that are erroneously considered as “not interesting”: of course, we cannot make any deduction from files that are not owned by the users; in fact, we implemented a prototype (described in Section 7) for receiving in the mid-term a “satisfaction feedback” directly from the users. Another important measure for evaluating the efficacy of a recommender system is the sparsity of the population of the collaborative set (see Section 5). Potentially, a user of DeHinter can encounter the following problem: after running the engine for a while, the size of his/her friends list tend to reduce. This ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:28



G. Ruffo and R. Schifanella

could happen when he/she has been entangled in the web of a single affinity cluster. After a given interval of time, the set of suggested files can collapse, and the recommendation engine became useless if it does not find a way for leaving the web and exploring some other adjacent clusters. Here it comes the importance of the hubs, that are bridges to other affinity clusters (see Section 4.3). Even if such a refinement of DeHinter is planned for the next future, a worthwhile hub detection mechanism can be run in order to explore the topology of the affinity network in the immediate neighborhood of the active node. When an hub is found, a discrimination procedure must be able to understand if the hub is a portal (i.e., a user with many different interests and some level of competence, that can let you access other affinity clusters of interest) or just a greedy user (i.e., someone who downloads everything he/she finds). This observation opens a wide area of possible extensions of this work, because it suggests that it is possible to exploit the scale-free nature of affinity networks. Moreover, the reader should remember that hubs allow the crossing of a small network in few hops: many different clusters characterizing some (not-disjoint) communities can be reached by following “semantic” links, that can be generated and maintained at the hub-side. Anyhow, these are only starting points for future work. 7. PROTOTYPE The implementation of DeHinter has been guided by some goals: —Network Independence. Our aim is the creation of a mechanism unbound from a particular file sharing network such as Gnutella, eMule, KaZaA or others. Though we have implemented our proposal within the client Phex, an open-source Java-based Gnutella client, it could be easily translated into a plug-in module for any modern file sharing application because it is free of particular structural or topological constraints. Similarly, the implementation directives are free from any specific programming language, platform or operating system. —Transparency. We aim to reach a two-fold level of transparency: with respect to (1) the user and to (2) a generic client in the file-sharing communities. On the one hand, the user should simply search and download favorite items, without attending to profile’s creation or complicated rating procedures. On the other hand, we want to design a mechanism able to exploit the active file-sharing communities without forcing all peers to run the recommendation module. Let us notice that this is a fundamental goal since it allows a generic user, wanting to take advantages of the recommendation feature, to interact with a real network and, thus, with a preexistent virtual community composed by thousands clients sharing a large amount of resources. —Efficiency and Scalability. One of the most relevant concern in prototype implementation is certainly the management of complex data structures, for example, the affinity graphs, involved in the recommendation process. It is clear that the spatial overhead and the computational cost in making suggestions must be affordable for the peers. ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:29

Starting from these premises, we describe in the following sections how we have extended the Phex Gnutella client in order to implement our approach. Let us underline that we have added only the recommendation module, without any considerable alteration in the features of the standard Phex application. Of course, a similar task can be performed in other file-sharing tools. 7.1 Data Structures In the previous section, we have pointed out that one of the guideline in the DeHinter implementation was the attention to complexity and efficiency aspects. Many efforts have been put in designing data structures able to optimize both spatial occupancy and efficiency in retrieving and updating information. It is evident that a single peer can manage a bounded amount of data. In other words, it can store a limited portion of the affinity graphs with the related information about files shared by peers. For these reasons, we have defined the following entities: —FILESSET. The list of the items known by the selected user. Each resource si is unambiguously identified by the SHA1 hash code and it is combined to this set of related information: —{u1 , u2 , . . . um }: the set of the owners of si . —popAbs(si ): the si ’s popularity in the cluster of friends, that is, the number of friends that share si . —w(si ): the weight estimation of the item, according to formula (14). —dimension: the file’s size in bytes. —PARTNERSSET. Includes the set of partners {u1 , u2 , . . . , uk } of the peer, according to formula (9). Each partner object has the following fields: —destAddress: contains some information about the peer such as the IP address and port, or the DNS full name. —partnership degree: valuates the amount of items that ui shares with the running user, that is, an estimation of the affinity between them. —friend flag: marks if ui is a friend of the active user. As described above, the FilesSet and the PartnersSet cannot grow without a dimensional boundary, in order to assure that the client can reasonably manage them. In such a context, it is clear that we have to define a replacement policy in case the limit is reached. After an experimental phase, we set up the maximal dimension of the FilesSet to 15.000 entries. When this threshold is exceeded, the prototype deletes each file that is owned by only one peer, since it shows a small popularity value. In general, we can implement this set as a priority queue in which the file’s popularity represents the priority criterion. Similarly, the PartnersSet can manage a list of 500 users and it is implemented as a priority queue based on the partnership degree. In this case, when a new user must be added and the list has reached the maximal dimension, we first remove the users that are offline and then we exploit the partnership degree in order to delete the peers that are less kindred with the active user. In any case, such policies are bound to the empirical analysis performed in Section 4: in fact, we have ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:30



G. Ruffo and R. Schifanella

Fig. 7. A screenshot of the DeHinter recommender module.

noticed that both files popularity and nodes degree in affinity networks follow a power-law distribution. In other words, removing the items characterized by lower popularity values does not affect heavily the system, because such items show a small connectivity degree. A similar consideration applies also to the PartnersSet replacement policy. PartnersSet and FilesSet are populated during the normal activity of the user. When a file sk is downloaded (often from many sources, or candidate peers), then the system asks the user if the item can be given to the recommendation engine. In fact, a user can listen or watch the downloaded resource and he/she decides that it is out of his/her interest. The idea is that just the relevant documents must be considered. Such mechanism allows to get rid of the fake files phenomenon: when an item does not show the expected content, the user must be able to ignore it. If the user marks the file as “interesting”, then all the sources are included in the PartnersSet. The file list of these new partners is then browsed, and FilesSet is updated as well. Then the friend flag of each partner is checked again and it is set to 1 for those partners that have some files in common. Finally, for each file stored by a friend of the active user, the weight is calculated, and a list R = {s1 , s2 , . . . , sN } of the top-N items sorted by weight is returned to the user. Figure 7 shows a DeHinter’s screen shot with a recommendation list sorted by weights. Each item in the list displays the popularity value, the average ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:31

partnership degree, and the SHA1 hash code. All the file names that appear with that identifier can be displayed at the user’s click. Furthermore, the user can select a suggested resource and download it simply by double-clicking the item in the list. 8. CONCLUSIONS The main contributions of this article is twofold: first, we have shown that transitivity and clustering of a small world graph can be exploited for pushing personalized information to the users without building and maintaining profiles. Second, a novel (and quite simple) decentralized recommendation approach has been introduced. One of the main and original characteristics of our recommender system is that suggestions are made on the basis of spontaneous relationships between users, and that no human a-priori knowledge is required for finding associations. This does not imply that meta information and smart tagging are useless: on the contrary, we think that every recommendation engine would be greatly enhanced in terms of efficiency and accuracy if affinity networks would be deeply studied and applied, and we plan to do this in the near future. ACKNOWLEDGMENTS

The authors would like to thank the anonymous referees for their insightful comments about how to improve this work. REFERENCES ABELLO, J., BUCHSBAUM, A. L., AND WESTBROOK, J. 1998. A functional approach to external graph algorithms. In Proceedings of the European Symposium on Algorithms. 332–343. ACHACOSO, T. B. AND YAMAMOTO, W. S. 1991. AYs Neuroanatomy of C Elegans for Computation. CRC-Press. ADAMIC, L. A. AND HUBERMAN, B. A. 2000. The nature of markets in the world wide web. Quarterly J. Electron. Commerce 1, 512. ALBERT, R. AND BARABASI, A. L. 2002. Statistical mechanics of complex networks. Rev. Modern Physics 74, 1. BALABANOVIC´ , M. AND SHOHAM, Y. 1997. Fab: content-based, collaborative recommendation. Comm. ACM 40, 3 66–72. BARABA´ SI, A.-L. 2003. Linked: How Everything Is Connected to Everything Else and What It Means for Business, Science, and Everyday Life. Plume Books. BARABA´ SI, A.-L. AND ALBERT, R. 1999. Emergence of scaling in random networks. Science 286, 509. BARFORD, P., BESTAVROS, A., BRADLEY, A., AND CROVELLA, M. 1999. Changes in web client access patterns: Characteristics and caching implications. World Wide Web 2, 1-2, 15–28. BHALLA, U. S. AND IYENGAR, R. 1999. Emergent properties of networks of biological signaling pathways. Science 283, 381–387. BRESLAU, L., CAO, P., FAN, L., PHILLIPS, G., AND SHENKER, S. 1999. Web caching and zipf-like distributions: Evidence and implications. In Proceedings of INFOCOM. 126–134. BRODER, A., KUMAR, R., MAGHOUL, F., RAGHAVAN, P., RAJAGOPALAN, S., STATA, R., TOMKINS, A., AND WIENER, J. 2000. Graph structure in the web. Comput. Netw. 33, 309–320. CLAYPOOL, M., GOKHALE, A., MIRANDA, T., MURNIKOV, P., NETES, D., AND SARTIN, M. 1999. Combining content-based and collaborative filters in an online newspaper. In Proceedings of the ACM SIGIR Workshop on Recommender Systems: Algorithms and Evaluation. ACM. COHEN, J. E., BRIAND, F., AND NEWMAN, C. M. 1986. A stochastic theory of community food webs III. Predicted and observed lengths of food. Royal Soc. London Proc. Series B 228, 317–353. ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:32



G. Ruffo and R. Schifanella

COX, R. A. K., FELTON, J. M., AND CHUNG, K. C. 1995. The concentration of commercial success in popular music: an analysis of the distribution of gold records. J. Cultural Economics 19, 333–340. CROVELLA, M. E. AND BESTAVROS, A. 1997. Self-similarity in World Wide Web traffic: Evidence and possible causes. IEEE/ACM Trans. Netw. 5, 6, 835–846. DE SOLA POOL, I. AND KOCHEN, M. 1978. Contacts and influence. Social Netw. 1, 1–48. DEROURE, D., HALL, W., REICH, S., HILL, G., PIKRAKIS, A., AND STAIRMAND, M. 2001. MEMOIR − an open framework for enhanced navigation of distributed information. Inf. Process. Manage. 37, 1. DESOLLA PRICE, D. J. 1967. Networks of scientific papers. Science 155, 3767, 1213–1219. ˝ , P. AND R´ENYI, A. 1959. On random graphs. Publicationes Mathematicae 6. ERDOS ˝ , P. AND R´ENYI, A. 1960. On the evolution of random graphs. Publications of the MathematERDOS ical Institute of the Hungarian Academy of Sciences 5. ˝ , P. AND R´ENYI, A. 1961. On the strength of connectedness of a random graph. Acta MatheERDOS matica Scientia Hungary 12. ESTOUP, J. B. 1916. Les gammes stenographiques. Institut Stenographique de France. FALOUTSOS, M., FALOUTSOS, P., AND FALOUTSOS, C. 1999. On power-law relationships of the internet topology. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM’99). Vol. 29. ACM Press, New York, NY, 251–262. GOLDBERG, D., NICHOLS, D., OKI, B. M., AND TERRY, D. 1992. Using collaborative filtering to weave an information tapestry. Comm. ACM 35, 12, 61–70. GUTENBERG, B. AND RICHTER, R. F. 1944. Frequency of earthquakes in california. Bul. Seismological Soc. Amer. 34, 185–188. HAN, P., XIE, B., YANG, F., AND SHEN, R. 2004. A scalable p2p recommender system based on distributed collaborative filtering. Expert Syst. Appl. 27, 2, 203–210. HARTWELL, L. H., HOPFIELD, J. J., LEIBLER, S., AND MURRAY, A. W. 1999. From molecular to modular cell biology. Nature 402, 6761 Suppl. HERLOCKER, J. L., KONSTAN, J. A., AND RIEDL, J. 2000. Explaining collaborative filtering recommendations. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW’00). ACM Press, 241–250. HOWE, A. E. AND DREILINGER, D. 1997. SAVVYSEARCH: A metasearch engine that learns which search engines to query. AI Mag. 18, 2, 19–25. IAMNITCHI, A., RIPEANU, M., AND FOSTER, I. 2004. Small-world file-sharing communities. In The 23rd Conference of the IEEE Communications Society (INFOCOM’04). Vol. 2. 952–963. JEONG, H., TOMBOR, B., ALBERT, R., OLTVAI, Z. N., AND BARABA´ SI, A. L. 2000. The large-scale organization of metabolic networks. Nature 407, 6804, 651–654. KARINTHY, F. 1929. Chains. Everything is Different. Atheneum Press. KLEINBERG, J. M. 2000. The small-world phenomenon: An algorithmic perspective. In Proceedings of the 32nd ACM Symposium on Theory of Computing. KLEINFELD, J. S. 2001. Could it be a big world after all? the “six degrees of separation” myth. Society. KLINGBERG, T. AND MANFREDI, R. 2002. Gnutella protocol development. http://rfc-gnutella. sourceforge.net/src/rfc-0_6-draft.html (last access:1/13/09). KOHLI, R. AND SAH, R. 2003. Market shares: Some power law results and observations. Working paper 04.01, School of Public Policy, University of Chicago. KOHN, K. W. 1999. Molecular interaction map of the mammalian cell cycle control and DNA repair systems. Mol. Biol. Cell 10, 8, 2703–2734. KONSTAN, J. A., MILLER, B. N., MALTZ, D., HERLOCKER, J. L., GORDON, L. R., AND RIEDL, J. 1997. Grouplens: applying collaborative filtering to usenet news. Comm. ACM 40, 3, 77–87. KRULWICH, B. 1997. Lifestyle finder: Intelligent user profiling using large-scale demographic data. AI Maga. 18, 2, 37–45. LANG, K. 1995. NewsWeeder: learning to filter netnews. In Proceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. 331–339. LEIBOWITZ, N., RIPEANU, M., AND WIERZBICKI, A. 2003. Deconstructing the kazaa network. In Proceedings of the 3rd IEEE Workshop on Internet Applications. IEEE Press. LOTKA, A. J. 1926. The frequency distribution of scientific production. J. Wash. Acad. Sci. 16, 317–323. ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

A Peer-to-Peer Recommender System



4:33

LU, E. T. AND HAMILTON, R. J. 1991. Avalanches and the distribution of solar flares. Astrophysical J. 380, L89–L92. MILGRAM, S. 1967. The small world problem. Psych. Today 2, 60–67. MILLER, B. N., KONSTAN, J. A., AND RIEDL, J. 2004. Pocketlens: Toward a personal recommender system. ACM Trans. Inform. Syst. 22, 3, 437–476. MONASSON, R. 1999. Diffusion, localization and dispersion relations on “small-world” lattices. European Physical J. B 12, 4, 555–567. ´ MONTANER, M., LOPEZ , B., AND DE LA ROSA, J. L. 2003. A taxonomy of recommender agents on theinternet. AI. Rev. 19, 4, 285–330. NEUKUM, G. AND IVANOV, B. A. 1994. Crater size distributions and impact probabilities on earth from lunar, terrestrial-planet, and asteroid cratering Data. In Hazards Due to Comets and Asteroids, T. Gehrels, M. S. Matthews, and A. M. Schumann, Eds. The University of Arizona Press, 359–416. NEWMAN, M. E. 2001. The structure of scientific collaboration networks. Proc. Nat. Acad. Sci. 98, 2, 404–409. NEWMAN, M. E. J. 2003. The structure and function of complex networks. SIAM Rev. 45, 167. NEWMAN, M. E. J. 2005. Power laws, pareto distributions and zipf’s law. Contemp. Physics 46, 323. NEWMAN, M. E. J. AND WATTS, D. J. 1999. Renormalization group analysis of the small-world network model. Physics Lett. A 263, 341–346. OKA, T., MORIKAWA, H., AND AOAYAMA, T. 2004. Vineyard: A collaborative filtering service platform in distributed environment. In Proceedings of the IEEE/IPSJ Symposium on Applications and the Internet Workshops. ` Dalla “Polis primitiva” di Socrate ai “mondi PAGALLO, U. 2006. Teoria giundica della Complessita. piccoli” dellinformatica—Un approccio evolutivo. Giapichelli, Torino, Italy. PAZZANI, M. J. 1999. A framework for collaborative, content-based and demographic filtering. AI. Rev. 13, 5-6, 393–408. PAZZANI, M. J., MURAMATSU, J., AND BILLSUS, D. 1996. Syskill webert: Identifying interesting web sites. In Proceedings of AAAI/IAAI, Vol. 1. 54–61. PHEX TEAM. 2003. Phex file-sharing gnutella client. http://www.phex.org/mambo/ (last access: 1/13/09). PINKERTON, B. 2000. Webcrawler: Finding what people want. Ph.D. thesis, University of Washington. POPESCUL, A., UNGAR, L. H., PENNOCK, D. M., AND LAWRENCE, S. 2001. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence (UAI’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, 437–444. REDNER, S. 1998. How popular is your paper? An empirical study of the citation distribution. European Physical J. B 4, 131. REKHTER, Y., MOSKOWITZ, B., KARRENBERG, D., DE GROOT, G. J., AND LEAR, E. 1996. Address allocation for private internets. RFC 1918, Internet Engineering Task Force. RESNICK, P. AND VARIAN, H. R. 1997. Recommender systems—introduction to the special section. Comm. ACM 40, 3, 56–58. RICH, E. 1979. User modeling via stereotypes. Cognitive Sci. 3, 329–354. ROBERTS, D. C. AND TURCOTTE, D. L. 1998. Fractality and selforganized criticality of wars. Fractals 6, 351–357. RUFFO, G. AND SCHIFANELLA, R. 2007. Evaluating peer-to-peer recommender systems that exploit spontaneous affinities. In Proceedings of the ACM Symposium on Applied Computing (SAC’07). ACM, New York, NY 1574–1578. RUFFO, G., SCHIFANELLA, R., AND GHIRINGHELLO, E. 2006. A decentralized recommendation system based on self-organizing partnerships. Lecture Notes in Computer Science, vol. 3976. Springer, 618–629. SARWAR, B. M., KONSTAN, J. A., BORCHERS, A., HERLOCKER, J., MILLER, B., AND RIEDL, J. 1998. Using filtering agents to improve prediction quality in the grouplens research collaborative filtering system. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (CSCW’98). ACM Press, New York, NY, 345–354. ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.

4:34



G. Ruffo and R. Schifanella

SCHIFANELLA, R., PANISSON, A., GENA, C., AND RUFFO, G. 2008. Mobhinter: epidemic collaborative filtering and self-organization in mobile ad-hoc networks. In Proceedings of the ACM Conference on Recommender Systems (RecSys’08). ACM, New York, NY, 27–34. SEGLEN, P. O. 1992. The skewness of science. J. Amer. Soc. Inform. Sci. 43, 9, 628–638. SHARDANAND, U. AND MAES, P. 1995. Social information filtering: Algorithms for automating “word of mouth”. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI’95). ACM, 210–217. SRIPANIDKULCHAI, K., MAGGS, B., AND ZHANG, H. 2003. Efficient content location using interestbased locality in peer-topeer systems. In Proceedings of the InfoCom. STUTZBACH, D., REJAIE, R., AND SEN, S. 2005. Characterizing unstructured overlay topologies in modern p2p file-sharing systems. In Proceedings of the ACM SIGCOMM Internet Measurement Conference. TERVEEN, L. AND HILL, W. 2001. Beyond recommender systems: Helping people help each other. In HCI in the New Millennium. Addison-Wesley, 487–509. TVEIT, A. 2001. Peer-to-peer based recommendations for mobile commerce. In Proceedings of the 1st International Workshop on Mobile Commerce (WMC’01). ACM Press, New York, NY, 26–29. VON GOETHE, J. W. 1809. Die Wahlverwandtschaften. http://en.wikipedia.org/wiki/Elective Affinities. WANG, J., REINDERS, M. J. T., LAGENDIJK, R. L., AND POUWELSE, J. 2005. Self-organizing distributed collaborative filtering. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05). ACM Press, New York, NY, 659–660. WASSERMAN, S. AND FAUST, K. 1994. Social Network Analysis. Cambridge University Press, Cambridge, U.K. WATTS, D. J. 1999. Small Worlds: The Dynamics of Networks Between Order and Randomness. Princeton University Press, Princeton, NJ. WATTS, D. J. AND STROGATZ, S. H. 1998. Collective dynamics of ‘small-world’ networks. Nature 393, 6684, 440–442. WEI, Y. Z., MOREAU, L., AND JENNINGS, N. R. 2005. A market-based approach to recommender systems. ACM Trans. Inform. Syst. 23, 3, 227–266. WILLIAMS, R. J. AND MARTINEZ, N. D. 2000. Simple rules yield complex food webs. Nature 404, 6774, 180–183. WILLIS, J. C. AND YULE, G. U. 1922. Some statistics of evolution and geographical distribution in plants and animals, and their significance. Nature 109, 177–179. XIE, B., HAN, P., AND SHEN, R. 2004. Pipecf: a scalable dht-based collaborative filtering recommendation system. In Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers and Posters (WWW Alt.’04). ACM Press, New York, NY, 224–225. YAN, T. AND GARCIA-MOLINA, H. 1995. SIFT—A tool for wide-area information dissemination. In Proceedings of the USENIX Technical Conference. 177–186. ZANETTE, D. H. AND MANRUBIA, S. C. 2001. Vertical transmission of culture and the distribution of family names. Physica A: Statist. Mechanics Appl. 295, 1-2, 1–8. Received July 2006; accepted May 2007

ACM Transactions on Internet Technology, Vol. 9, No. 1, Article 4, Publication date: February 2009.