Spam Detection through Link Authorization from ...

3 downloads 49126 Views 466KB Size Report
both good and bad pages to boost their desired target page(s) and attract web ... to legitimate means by using Search Engine Optimization. (SEO) techniques ...
Proceedings of the Fourth International Conference on e-Technologies and Networks for Development, Lodz, Poland, 2015

Spam Detection through Link Authorization from Neighboring Nodes Opoku-Mensah Eugene ∗ , Zhang Fengli † , Opare Kwasi Adu-Boahen ‡ , and Baagyere Edward Yellakuor∗ ∗ School

of Information and Software Engineering, UESTC, Chengdu, China. Email: [email protected] of Information and Software Engineering, UESTC, Chengdu, China. Email: [email protected] ‡ School of Communications, UESTC, Chengdu, China. Email: [email protected]

† School

Abstract—Current link spam techniques aim at manipulating both good and bad pages to boost their desired target page(s) and attract web surfers. The web structure of today includes links from bad to good pages and vice versa as well as pages of same kind. It is widely known that good pages seldom connect to bad ones, hence, spamming is assumed when such connections occur. Therefore, such good pages are penalized. However, such penalization tend to be unfair since every web page has an honest and dishonest part. Besides, several factors such as pages similarity influences the web hyperlinks distribution. Based on this, the paper proposes Link Authorization Model to detect link spam propagation onto neighboring pages. We design metrics with relevant link and content features to compute the angular similarity between connecting good-bad pages. Then based on the angular similarity, we are able to predict page-links as true or false authorization. Hence, for every false authorization detected, the out-going page receives a penalization by a pre-determined threshold. Our results show an average spamicity of 0.77 and a corresponding demotion of 0.60.

motivated by the financial benefit from high search engine ratings to attract surfers to their sites either by legitimate or illegitimate means. All techniques aimed at achieving higher rank without respectively improving the page quality and/or structure are considered illegitimate. For example, this can be through term spam techniques (as in cloaking, flooded keywords, hidden text) and link spam (such as boosting links, page swapping etc.) to manipulate rankings in order to appear higher in search results. Others alternatively, adopt to legitimate means by using Search Engine Optimization (SEO) techniques such as title tag optimization, ‘alt’ tag in images, easily crawled site-maps, and quality out-links. The introduction of Google Panda in 2011 and Google Penguin in 2012 by Google aim to penalize pages that duplicate others and manipulate search results respectively. Similarly in [4], the authors proposed pragmatic ways to combat term spam.

keywords: web spam, link authorization, webpage similarity, webpage demotion, link distribution

The web basically shows four(4) types of edges (links) and their proposed distributions as described by Andras in [5]. They are nonspam-to-nonspam, spam-to-spam, spam-tononspam, and links from nonspam-to-spam. Nonspam pages are considered to be good whereas spam pages are bad. Earlier researchers reckoned that good pages rarely points to bad one but currently, one cannot ascertain this phenomenon because recent spam techniques (term and link spam) are targeted at all pages; both good and bad. The concern is that every spammed good-to-bad connection will certainly be used to boost several spam pages. Now, our contributions are as follows:

I. I NTRODUCTION The growth of the web has seen an exponential increase in recent years, holding huge sums of data and thereby making it the most accessible information source for users’ informational needs. The average web user resorts to Search Engines as the first point of call for information request not just for the large repository but also their ability in ranking. Search Engines adopt ranking techniques with the aim to sort user requests, with respect to higher level of relevance but many at times, spammers are able to corrupt that due to the low precision of search engines [1] leading to the dissatisfaction of surfers. Web page ranking algorithms are metrics used to evaluate and score web pages according to some basic computational parameters of the algorithm. Notable among them is Google PageRank [2] proposed by Larry Page and Sergey Brin which greatly relies on the quality of in-links received by a page. PageRank score shows the popularity and importance of a page as compared to other pages; justifying the various attempts of manipulation by spammers. Web spam, as defined by Egele et al. [3], is the manipulation of web pages or exploitation of ranking algorithms to raise the page’s position in user search results list. Web owners are

ISBN: 978-1-4799-8450-3 ©2015 IEEE

• •

We design an algorithm that predicts link spam from good to bad pages and present their characteristics. Propose a penalization to the pages involved.

In this paper, all nodes and pages refer to webpages. Our approach, which is based on the authorization conferred on target nodes, seeks to detect link spam arising from both link farm or good nodes. This authorization can be true or false. By computing a similarity match, the authority of the link received can be accepted as legitimate or rejected as spam. This study is essential and worth-taking in that it seeks to demote nodes involved in manipulations towards high ranking while rewarding deserving so-called bad pages. Unlike other approaches such as [5] that focus on demoting, our approach establishes a criteria between legitimate and illegit-

52

Proceedings of the Fourth International Conference on e-Technologies and Networks for Development, Lodz, Poland, 2015 imate nonspam-to-spam connected pages. In effect, deserving pages are rewarded in their rank score but undeserved ones are demoted. And the significant benefit will be for users as they surf and attain the relevant desired webpages with search engines. The rest of this paper is organized as follows: Section II gives an overview of related research. Section III provides some basic concepts of the structure of web page links. Section IV describes the Link Authorization algorithm. Section V demonstrates our methodology process and experimental results. The last section concludes with the discussion and future work.

From Fig. 1, directed links are regarded as either in-links (sometimes referred to as back-links) or out-links. In-links represent in-coming links received from other pages whereas out-links are links directed away from a web page to other pages. In-degree and out-degree refer to the sum of in-links and out-links respectively. Links authorization comes from the out-links. The page to which the link is directed to is referred to as target page. Popularity of a web page comes from higher in-degree whereas the richness of a page is known by the out-degree. The diagram in Fig. 1 shows the

II. R ELATED R ESEARCH The research in detecting and removing link spam has been approached with varied range of techniques including but not limited to classification of seed set [6] [7] [8], neighboring set influence [3] [9] and statistical approach [10]. Gyongyi et al. TrustRank [6] proposed a small good set pages to find similarly reliable and trustworthy pages. Despite the fact that this algorithm diminishes with farther documents from the seed set it is limited to nodes in its direction of propagation. From machine-learning view, Gan et al. [11] analyzed the characteristics between spam and non-spam nodes and observed that spam nodes link to nonspam but not the vice versa. Using statistics, Andras et al. [10] later found that nonspam pages can connect to spam but this is minimal. His SpamRank [10] detects spam nodes during ranking to penalize the outgoing nodes that differ per the power law. Andras based his hypothesis against benefits gained by a spam node from a nonspam. Though improved over TrustRank, it is likely to penalize genuine nonspam pages connecting spam pages. Similarly, Metaxes et al. in “propagation of distrust” [13] identified spam based on a backward propagation of distrust of neighboring nodes. By his approach, it is evident that distrust or trust propagation detects spam only with honest or dishonest techniques. Therefore, Zhang et al. [14] and Xinyue Liu [15] proposed that anti-spam technique with both trust and distrust propagated simultaneously achieves a higher efficiency than the above schemes. Similar to Zhang [14], our antispam approach adopts true/false link authorization propagated between connected non-spam to spam pages and vice versa. Whereas SimRank algorithm [16] computes relevance based on webpage-query similarity, our work aims at link authority based on their page-to-page similarity. III. T HE S TRUCTURE OF W EB PAGE L INKS The web structure is a directed graph, G = (V, E) where V symbolizes vertices representing the set of nodes while E is the set of edges representing hyperlinks relationship between nodes. According to [17], the web hyperlinks contain metadata used in web structural and content mining and for most webmetrics computations.

ISBN: 978-1-4799-8450-3 ©2015 IEEE

Fig. 1. Link Authorization between webpages in web graph

four(4) types of links distributions described in section I. Links by nature, primarily function as navigational (random walk through pages) or authorization (recognition of richness) to the target node. By this, we establish that unlike authorization links which affect ranking, navigational links do not. Though identifying the function of link falls out of our scope, we set the following guiding rules for the description of navigational links based on our observations: • • •

they occur within the main contents of the web page usually do not have keywords in their anchor tags examples are links with anchor tags like ‘next’ ‘up’ ‘top’ and ‘down’.

Link spam Techniques that aim to manipulate the web link structure with the creation of highly interconnected links to boost a target page(s) are referred to as link spam. These techniques are usually associated with link-based ranking algorithms such as Google PageRank and Hyperlink-Induced Topic Search (HITS) [2]. Earlier link spam focused mainly on some spamcontrolled pages (link farm) where a large controlled set of nodes are generated and directed to boost the rank score of one or multi target nodes. Over time, the pattern of link farms as well as other spam characteristics such as large in-links from same domain, overcrowded links per page, structural deficiency and pages with empty content have been discovered [10] [18]. Therefore spammers extend their approaches to benefit from good pages by using expensive schemes such as linkbaits, through blogs and link reciprocity. From 2004, new anti-spam schemes [10] proposed the penalization of good pages that connect to bad ones by the power law distribution.

53

Proceedings of the Fourth International Conference on e-Technologies and Networks for Development, Lodz, Poland, 2015 TABLE I S PAMICITY IN WEBSPAM - UK .2006 Spamicity Range (0.0, 0.2) (0.2, 0.4) (0.4, 0.6) (0.6, 0.8) (0.8, 1.0)

Frequency 1530 342 179 261 413

DATASET

Percentage 56% 13% 7% 9% 15%

The content features are number of words, average length of words, number of keywords, average count of all keywords, number of anchor-text in page, fraction of visible text, number of words per page title, fraction of words from popular words, and Tf − idf (Term frequency-inverse document frequency) of keywords, and the top 100 corpus recall. B. LA Model Formulation

From the webspam-uk.2006 labels.txt [19] dataset a feature termed as Spamicity, Sp, which measures the likelihood of a page to be spam, is defined such that, 0 ≤ Sp ≤ 1. Pages are categorized as normal, border or spam, based on Sp as follows;   normal if 0 ≤ Sp < 0.5 ,   Sp = border if Sp = 0.5, (1)    spam if 0.5 < Sp ≤ 1. The values of Sp clearly show that, the different categories of pages exhibit both spam and non-spam features in different ratios.It can also be observed in Table I that, 25% of the dataset in [19] are within the range [0.4 - 0.6]. Without necessarily focusing on the labeling process, we observe very little difference between the worst-normal and best-spam page. Therefore, hyperlinks directed from nonspam-to-spam pages and vice versa within the [0.4 - 0.6] category have a lot in common. In this case, the notion of demoting good pages connecting bad ones cannot be applied unless we show by further analysis that they really contributed to spam; thus what our proposed approach does. IV. L INK AUTHORIZATION M ODEL

The aforementioned features are transformed into a P × K matrix A, where P is the number of pages and K is the number of features being considered. Hence,   a11 a12 · · · a1k · · · a1K    a21 a22 · · · a2k · · · a2K    A= . .. ..  ..  .. . ··· . ··· .    aP 1 aP 2 · · · apk · · · aP K Element apk represents the p th row and the k th column, where p = 1 , 2 , ..., P and k = 1 , 2 , ..., K . For each apk ∈ A, Scaling is done as follows; a ¯pk =

apk − akmin akmax − akmin

(2)

where apk is the initial feature value, akmin the minimum value in the k th column and akmax its maximum. Scaling is done per column and is necessary to convert features to comparable weights. A new matrix, AT is formed where the scaled features, a ¯pk replaces their initial values (apk ). The transformed matrix, AT becomes the input of the LA algorithm.

Link Authorization (LA) Model, as the name implies uses the link connecting any two pages, to examine the similarity match between the pages in terms of content and structure. Generally, spam and nonspam pages explicitly have their respective characteristics [11] however, this model proposes that the nonspam-to-spam link type shows a strong relationship by their content and structure as shown between pj and pl in Fig. 1, otherwise, this connection is spam. The LA model is expressed by computing the similarity index between the two connected pair of pages.

Link Authorization (LA) between any two pages pn , pn+k through a hyperlink e (with e ∈ E), indicates the conferral of authority and recognition to pn+k by page pn . The necessary condition of connected node pairs is required for the algorithm. The edgelists set, E with each single edge, e represented in the format [ap , ap+n ], where n = 1, 2, ..., K − k indicates that ap (being nonspam) is the source and ap+n (being spam) is the destination. From the vector space approach [20], we express the angular similarity, Sim between ap and ap+n as; PK (¯ ap,k ) · (¯ ap+n,k ) Sim(ap , ap+n ) = k=1 (3) L2 (¯ ap ) × L2 (¯ ap+n )

A. Features

where L2 (¯ ap ) and L2 (¯ ap+n ) are the L2 normalizations of pages ap and ap+n respectively.

Link and content based features have been widely used for the discovery of how worthy and true a web page is. Our approach makes use of similar features as [11] for every page in the set of pages. The link features are in-degree, out-degree, average in-degree to out-degree, link reciprocity (fraction of out-links with in-links to pi ), ratio of current pagerank to maximum ranked page of website, number of in-links at distance 1 and 4 from current page.

ISBN: 978-1-4799-8450-3 ©2015 IEEE

Based on the outcome of Sim, LA is expressed as ( T rue if φ < Sim ≤ 1 LA = F alse if 0 ≤ Sim ≤ φ where φ is the threshold. False authorization is the conferral of authority on target node without relevant similarity between the nodes. Its interpretation is that the target page does not enrich the information coming from the out-going page. The

54

Proceedings of the Fourth International Conference on e-Technologies and Networks for Development, Lodz, Poland, 2015 LA identifies this as spam contributed link, rejects this link and sets a demotion to respective nodes. True authorization and False authorization are respectively associated with high Sim score (i.e. greater than φ) and low Sim score. Algorithm 1 is used to predict the authorization scores between connected nodes. Algorithm 1 - Link Authorization Algorithm - To predict link received at target page as true or false authorization. Ensure: Edgeset E, exist Require: INPUT: Matrix AT for all p ∈ P do L2 normalization end for while nonspam ap → spam ap+n where n 6= 0 do compute Sim(ap , ap+n ) if Sim ≥ φ then LA = T rue (distribute rank to ap+n ) else if Sim(ap , ap+n ) < φ then LA = F alse (demote ap ) end if end if end while OUTPUT LA Scores (True/False authorization)

(a) out-degree distribution showing xmin =30 and γ=2.8

(b) ratio of out-degree to in-degree

C. Demotion

Fig. 2. Degree properties of webspam-uk-07 dataset

Demotion of a node is the consequence of a false-authorization and is applied on the out-going nodes. We explicitly issue personalized demotions to ensure that any given pair of connected nodes receive unique penalizations for their spamicity. Any two connected nodes (ap and ap+n ) detected with False-authorization (with Sim < φ), ap is demoted, and the corresponding rank score from ap extending to ap+n is reduced to zero. We express demotion to ap of the range (0−1) as dv where, " # Sim(ap , ap+n ) (4) dv = 1 − PK 1 + 1 rec(PXY ) where rec(PXY ) is the count of reciprocity (bi-directional links) between hostX and host(Y ) from where ap and ap+n are found. Mathematically, we express rec(PXY ) in Equation 5 as N X rec(PXY ) = ρ(Xi ↔ Yi ) (5) i=1

where Xi ↔ Yi is the reciprocity between the ith page in hostX and that of hostY evaluated by ρ function. V. M ETHODS AND E XPERIMENT A. Dataset Preprocessing Web related research require a more trusted and freely accessible dataset for experiments. The webspam-uk2007 public

ISBN: 978-1-4799-8450-3 ©2015 IEEE

dataset [19] which was crawled from the .uk domain with 114,529 hosts, 100 million pages and about 3 billion arcs is a commonly used dataset in this field. Specifically, we used the .arff version of content and link features with 4000 instances available from [21] which was also used in 2008 for the webspam challenge. First in the data preprocessing, we merged the content and link features into one dataset using the weka tool, and further assessed the features on Ranker, Search-Eval, and InformationGain to see how the attributes perform in order to select the most significant and relevant ones. We then analyzed some statistics on the data with the in-out-degree as well as the neighbors to see the edge distribution as it was from the original webgraph data (which was compressed in .GRAPH format). The results showed that the graph was neither exponential nor random but followed a power law graph (with a power law exponent, γ of 2.8 and xmin =30) as shown in Fig. 2 (a). Therefore we plotted a graph of the distribution of out-degree to in-degree ratio of the data as shown in Fig. 2(b); which clearly indicates more out-links than in-links. So, using the power law exponent and the total number of 4000 nodes, we generated a set of 16097 of which 14287 edgelist were used as shown in Table II with their various link-types. It is worth mentioning the reason why spam-spam connection type recorded a relatively small number; this is accounted by

55

Proceedings of the Fourth International Conference on e-Technologies and Networks for Development, Lodz, Poland, 2015 TABLE II D ISTRIBUTION OF GENERATED EDGELIST Link Type spam to spam spam to nonspam nonspam to spam nonspam to nonspam Total

Number of edges 20 665 522 13080 14287

Percentage 0.14 4.65 3.65 91.55 100

the 208:3641 spam to nonspam label-set in the webspam-uk2007.05 dataset.

TABLE III S UMMARY OF VARIED THRESHOLD WITH THEIR LA S CORES Threshold 0.65 0.70 0.80 0.90 0.95

True Authorization 392 364 277 176 92

False Authorization 130 158 245 346 430

the left of the equilibrium point of the curves in Fig 3 meets a demotion.

B. Experiment We set up the experiment in java with 522 nonspam-spam connected edges as our test sample. For each edge, a hashmap is used to extract both source and destination information with their corresponding features from the features dataset represented by matrix A in the LA Algorithm. The features are normalized after which we computed Sim(source, target) just as was described in Equation 3 of section IV. First, a threshold of 0.50 was used for which the result for false and true authorizations detected was 76 and 446 respectively. Other thresholds values and their corresponding True and False authorization are shown in Table III. A graphical representation of the True and False authorization is also shown in Fig. 3. C. Results and Discussion The result of Sim computation on our test data showed these values: average score of 0.766, median of 0.821, minimum of 0.196 and a maximum of 0.997. We observed no perfect similarity (Sim = 1) or dissimilarity (Sim = 0) as compared to other methods, such as TrustRank [22] or in the Webspam Challenge [23] where some pages attained scores of 1 or 0. Further, we observed the following in our analysis: •



Fig. 3. Varied thresholds for determining True and False Link Authorization

In Fig. 4, a graphical view of false authoritative nodes is shown where similarity scores is inversely proportional to the demotion value. The more the spamicity, the higher the demotion and vice versa. An average demotion of 0.60 was found within a range of 0.5 to 0.7 similarity score. Interestingly, pages that received the least demotion (of the range of 0.2 - 0.5) came from forums and blogs where very useful and relevant information are found. However, lack of proper monitoring and control mechanisms on these forums caused the demotions.

No two connected pages had either a perfect match or mismatch in terms of content and structure. This is true because a perfect match implies page duplication, which is a spam characteristic. Connected pages showed some degree of similarity, weather weak or strong connection according to the Sim value.

The variation of threshold, φ at different points as shown in Fig. 3 was to choose the most suitable value. An increase of φ significantly rendered more pages as false authorization while the number of true authorization decreased. Because the LA considers the two connected pages in its computation, the usual 0.5 used in many spam page detection schemes was unable to efficiently detect most of these spammed effects. We therefore analyzed statistically the connected page features together with the authorization patterns as against the different thresholds. As shown in Fig. 3, the equilibrium value of 0.82 recorded the best predictor for the LA algorithm. Hence, all nodes on

ISBN: 978-1-4799-8450-3 ©2015 IEEE

Fig. 4. Demotion used in False Link Authorization

Moreover, we observed a higher degree of reciprocity between pages found with false authorization. From Fig. 5 the rate of inter-connected links between two hosts has significant

56

Proceedings of the Fourth International Conference on e-Technologies and Networks for Development, Lodz, Poland, 2015 [3] M. Egele, C. Kolbitsch, and C. Platzer, “Removing web spam links from search engine results,” Journal in Computer Virology, vol. 7, pp. 51–62, 2011. [4] K. Chellapilla and D. Chickering, “Improving cloaking detection using search query popularity and monetizability,” In Proceedings of the SIGIR Workshop on Adversarial Information Retrieval on the Web (AIRWeb), vol. 0, pp. 17—-24, 2006. [5] T. Andras, a. Bencz´ur, K. Csalog´any, and M. Uher, “SpamRankFully Automatic Link Spam Detection,” First International Workshop on Adversarial Information Retrieval on the Web, pp. 1–14, 2005. [6] Z. Gy¨ongyi, “Combating web spam with trustrank,” Vldb, pp. 576–587, 2004. [7] S. Sahu, B. Dongre, and R. Vadhwani, “Web Spam Detection Using Different Features,” no. 3, pp. 70–73, 2011.

Fig. 5. Pattern of reciprocity of False Link Authorization

effect on demotion. Nodes with a lower Sim score but higher reciprocity suffered the greatest demotion. Comparing our approach with Qi et. al removal of unqualified links [12], LA further issues respective demotions to the webpages. Demotion is the usual approach used to persuade web masters and all web designers to ensure web sanity. This approach along with others can go a long way to ensure a more reliable web information retrieval. VI. C ONCLUSION The paper first described the nonspam-spam link type scenario. It uses the link authorization model to examine the received link at the target page. With relevant content and link features, the computation of a similarity index, Sim determines whether the link is spam or not based on a 0.82 threshold. From our experiment, we found that a page spamicity (nonspam or spam) is independent from being an agent of link spam. Therefore, both good and bad pages do propagate link spam to other target pages. Nevertheless, this link type recorded an average similarity of 0.9 as truly authoritative. Also, we discovered by the LA algorithm that any two connected pages has a similarity score greater than zero but less than one. Our future work aims at enhancing the effectiveness of the LA Algorithm through optimization techniques to quickly adapt to the frequent changing characteristics of individual page features. Moreover, by redirecting the focus of anti-spam techniques, we hope to stir up researchers to explore measures that induce spammers to boost their pages both in relevance and structure so as to receive their deserving rank scores. R EFERENCES [1] C. Sunitha, B. M. Preethi, and M. Akshay, “A Comparative Study over Search Engine Optimization on Precision and Recall Ratio,” pp. 35–39, 2013. [2] A. Jain, R. Sharma, G. Dixit, and V. Tomar, “Page ranking algorithms in Web mining, limitations of existing methods and a new method for indexing Web pages,” Proceedings - 2013 International Conference on Communication Systems and Network Technologies, CSNT 2013, pp. 640–645, 2013.

ISBN: 978-1-4799-8450-3 ©2015 IEEE

[8] Y. I. Leon-Suematsu, K. Inui, S. Kurohashi, and Y. Kidawara, “Web spam detection by exploring densely connected subgraphs,” Proceedings - 2011 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2011, vol. 1, pp. 124–129, 2011. [9] K. L. Goh, R. K. Patchmuthu, and A. K. Singh, “Link-based web spam detection using weight properties,” Journal of Intelligent Information Systems, vol. 43, pp. 129–145, 2014. [10] T. Andras, a. Bencz´ur, K. Csalog´any, and M. Uher, “SpamRankFully Automatic Link Spam Detection,” First International Workshop on Adversarial Information Retrieval on the Web, pp. 1–14, 2005. [11] Q. Gan and T. Suel, “Improving Web Spam Classifiers Using Link Structure,” 2007. [12] X. Qi, L. Nie, and B. D. Davison, “Measuring Similarity to Detect Qualified Links,” 2007. [13] P. Metaxas, “Enhancing Information Reliability through Backwards Propagation Of Distrust,” vol. 2, no. 2, pp. 214–225, 2009. [14] X. Zhang, Y. Wang, and N. Mou, “Propagating Both Trust and Distrust with Target Differentiation for Combating Web Spam ,” Intelligence, pp. 1292–1297. [15] X. Liu, Y. Wang, S. Zhu, and H. Lin, “Combating Web spam through trust-distrust propagation with confidence,” Pattern Recognition Letters, vol. 34, no. 13, pp. 1462–1469, 2013. [16] S. Qiao, T. Li, H. Li, Y. Zhu, J. Peng, and J. Qiu, “SimRank: A page rank approach based on similarity measure,” Proceedings of 2010 IEEE International Conference on Intelligent Systems and Knowledge Engineering, ISKE 2010, no. 09, pp. 390–395, 2010. [17] S. Agarwal, “Learning to rank on graphs,” Machine Learning, vol. 81, no. March, pp. 333–357, 2010. [18] C. Likitkhajorn, A. Surarerks, and A. Rungsawang, “A novel approach for spam detection using boosting pages,” Proceedings of the 2011 8th International Joint Conference on Computer Science and Software Engineering, JCSSE 2011, pp. 91–95, 2011. [19] C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna, “A reference collection for web spam,” ACM SIGIR Forum, vol. 40, no. 2, pp. 11–24, 2006. [20] C. D. Manning, P. Raghavan, and H. Sch¨utze, “Scoring, term weighting and the vector space model,” Introduction to information retrieval, no. c, pp. 109–133, 2009. [21] “Web Spam Challenge Phase III Features.” [Online]. Available: http://webspam.lip6.fr/wiki/pmwiki.php?n=Main.PhaseIIIFeatures [22] J. Pei and B. Zhou, “Data Mining Techniques for Web Spam Detection Why Are Search Engines Useful ?” Techniques, pp. 1–49. [23] C. Castillo, K. Chellapilla, and B. D. Davison, “Adversarial Information Retrieval on the Web (AIRWeb 2007),” ACM SIGIR Forum, vol. 42, no. May, p. 68, 2008.

57