A Structural, Content-Similarity Measure for ... - Semantic Scholar

A Structural, Content-Similarity Measure for Detecting Spam Documents on the Web Maria Soledad Pera Yiu-Kai Ng∗ Computer Science Department Brigham Young University Provo, Utah, U.S.A. Email: {[email protected], [email protected]}

Abstract Purpose - The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines must deal with an annoying problem: the presence of spam documents that are ranked among legitimate ones. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. To improve the quality of Web searches, the number of spam documents on the Web must be reduced, if they cannot be eradicated entirely. Design/methodology/approach - In this paper, we present a novel approach for identifying spam Web documents, which have mismatched titles and bodies and/or low percentage of hidden content in markup data structure. Findings - By considering the content and markup of Web documents, we develop a spam-detection tool that is (i) reliable, since we can accurately detect 84.5% of spam/legitimate Web documents, and (ii) computational inexpensive, since the wordcorrelation factors used for content analysis are precomputed. Research limitations/implications - Since the bigram-correlation values employed in our spam-detection approach are computed by using the unigram-correlation factors, it imposes additional computational time during the spam-detection process and could generate higher number of misclassified spam Web documents. Originality/value - We have verified that our spam-detection approach outperforms existing anti-spam methods by at least 3% in terms of F -measure.

Keywords: Spam detection, structural content, content-similarity measure, accuracy and error rates, F-measure Paper type: Research paper ∗

Corresponding Author

1

1

Introduction

The Web is populated with documents in many different subject areas from personal health care to constitutional laws to religious beliefs as presented in online news articles, research papers, and customer-generated media (i.e., blogs), to name a few, and virtually all kinds of information can be found on the Web. With the huge amount of information to sort through, users turn to Web search engines for assistance in locating information of interest. Hence, analyzing the content of (relevant) Web documents and ranking them according to the user’s information need is a crucial process in Web information retrieval (IR). As a result, a search engine spider crawling the Web not only entices gathering useful information, but it also imposes significant financial reward for the owners of various Web sites, such as increasing the number of commercial transactions on the Web sites, that results from ranking (the contents of) retrieved Web documents posted at the Web sites by the search engine [26]. Hence, there is an economic incentive for manipulating search engine’s rankings [1] by creating Web documents that score high independently of their real contents, even though the intention is unethical. Gradually, more documents are introduced on the Web that are considered legitimate when in fact they are not and ranked high when they should not be by existing search engines. As reported in [7] and [23], a significant portion of existing Web documents—between 14% and 22% in the year of 2006—are spam. Spamming is a serious Web IR problem, since it (i) affects the quality of Web searches, (ii) damages the search engine’s reputation, and (iii) weakens the user’s confidence in retrieved results. In general, Web spamming are treated as an attempt to receive an unjustifiably favorable relevance or high ranking for their Web documents, regardless of the true values of the documents. Spamming can also be defined as an attempt to deceive a search engine’s relevancy ranking algorithm [25]. A number of existing spamming approaches rely on links within Web documents [11] to manipulate Web ranking algorithms, such as PageRank [6], whereas others rely on infecting the content of Web documents [23], e.g., stuffing popular or concatenating words to be included in Web documents, to increase their chance for matching Web queries. However, due to their complexity, neither considering links nor multiple statistical features of Web document contents are effective approaches in identifying spam Web documents as the volume of Web documents is huge. Fetterly et al. [12] use a semantic technique, i.e., actual word count, on Web documents for detecting spam. Instead, we consider the actual content (i.e., words) of Web documents and use the word-similarity values to identify and eliminate spam Web documents. We will show that by (i) considering the degree of similarity among the words in the title and body of a Web document D, which is computed by using their word-correlation factors, (ii) using the percentage of hidden content in the markup data structure within D, and/or (iii) considering the bigram or trigram phrase-similarity values of D, we can determine whether D is spam with high accuracy. The remaining sections are organized as follows. In Section 2, we present existing antispam methods and discuss their differences with ours. In Section 3, we introduce our spam-detection approach. In Section 4, we propose an alternative method which further enhances the accuracy in detecting spam Web documents. In Section 5, we analyze the experimental results, which verify the performance of our spam-detection method, and in 2

Section 5.6 we include a case study which demonstrates various degrees of accuracy in detecting spam using different proposed spam-detection methods. In Section 6, we provide the time complexity analysis of our spam-detection algorithm and its implementation. In Section 7, we give a conclusion and present the future work.

2

Related Work

Previous anti-spam work focus on applying two different strategies: content analysis and link connection. Ntoulas et al. [23] introduce and combine several anti-spam heuristics based on the content of a Web document D, which include within D (i) the number of words, (ii) the average length of the words, (iii) the amount of anchor text, (iv) the fraction of visible content, (v) the fraction of globally popular words, and (vi) the likelihood of independent n-grams. These heuristics are treated as features in classifying spam Web documents. Gyongyi et al. [15] also present several anti-spam heuristics according to the content of a document D that include (i) detecting the inclusion of terms in the anchor text of D that are unrelated to the referenced Web documents, (ii) computing the amount of repetitive terms in D introduced by a spammer with the intention to increase its relevance score, (iii) verifying the existence of a large number of unrelated terms, and (iv) identifying the presence of phrase stitching in D, which should increase its degree of relevance to several posted queries. Likewise, Fetterly et al. [12] analyze the statistical features of the host component of an URL and the excessive replication of content to establish content-based features that can be used by Web spam classifiers. As a link-analysis approach, Becchetti et al. [2] introduce a damping function, which classifies spam and non-spam Web documents using (incoming and outgoing) links within a document without considering its content. Other well-known anti-spam techniques that rely on link analysis are described in [14, 16]. Gyongyi et al. [14] present a semi-automatic algorithm, TrustRank, which uses the link structure of the Web to discover documents that are likely to be legitimate with respect to a small set of (seed) legitimate documents that were manually identified by experts. Since this approach requires human assistance, it is not fully automated. A spam mass metric, on the other hand, is defined in [16] that reflects the impact of link spamming on the ranking of a document and is used for determining Web documents that are significantly beneficiaries of link spamming. As yet in another link-based method, Bencz´ ur et al. [4] introduce SpamRank, which is based on PageRank, to identify documents linked by a large number of other documents intended for misleading search engines to rank their target higher. Following the premise that the purpose of spam Web sites is to obtain financial gain, Bencz´ ur et al. [5] classify Web documents by extracting features from the documents, which are known to have high advertisement or spam value, to capture their semantic content based on the keyword occurrences. These features include (i) the Online Commercial Intention Value given to a Web site by Microsoft adCenter Labs Demonstration, (ii) the distribution of Google AdSense words on a Web site, and (iii) the Yahoo! Mindset classification to a particular Web site as (non-)commercial, etc. The extracted features are combined with the set of features defined in [7] for identifying (content- and link-based) spam Web

3

documents. According to [5], determining the commercial intent of Web documents can significantly enhance the performance of the decision tree used in [5, 7] for detecting which ones are spam. As opposed to our spam-detection approach, the approach in [5] relies solely on the assumption that spammers seek profit from their Web sites; however, there is no verification on its effectiveness when the Web sites under evaluation are not set up for financial gain, but they exist for other purposes instead. Becchetti et al. [3] propose a link-based approach for detecting spam Web documents. After conducting a statistical study on the (outgoing and incoming) links in a large collection of Web documents, Becchetti et al., who extract features such as in-degree, out-degree, reciprocity, average in-degree of out-neighbors, etc., which are (some of) the features used hereafter in a C4.5 decision tree for classifying spam Web documents, claim that the classifier is as effective as other existing content-based Web spam-detection approaches. However, they have not considered combining content-based classifiers and their (or any other existing) link-based approach, which could enhance the performance of their spam-detecting method for correctly identifying spam Web documents. Urvoy et al. [27] present a spam Web document detection method based on the internal structure of HTML documents rather than their contents. The authors develop an algorithm that relies on the style similarity measure computed using both the textual and extratextual features in the HTML source code of a Web page, such as the average number of a particular HTML tag and anchor links in a large collection of Web documents, for clustering Web documents and identifying the spam clusters. Just like [3], Urvoy et al. emphasize the necessity of combining their spam-detection method with content-based spam Web document classifiers. Although in our approach we also consider the HTML markup content in existing Web documents for spam detection, the effectiveness of our approach largely depends on the content (i.e., words) of the Web documents, which does not require any previous analysis of document structures in spam detection. Caverlee and Liu [9] introduce a ranking algorithm which, as opposed to PageRank or TrustRank, relies on the credibility of Web documents to assess their quality. The credibility value of a given Web document is computed according to the quality of the links specified in the document. What is more, the algorithm in [9] allows its users to determine different levels of “spam-tolerance” according to their preferences. Although the experimental results in [9] seem favorable, the evaluations were conducted by using only pornography-related Web documents as spam, which exclude a large number of Web pages that are not pornography-related, but they are spam. The spam-detection approach of Liu et al. [20] is neither content- nor link-based; instead, it relies on user-behaviors and Bayes learning. The proposed method analyzes user-behavior patterns as shown in a collected Web-access log and uses three different features—search engine oriented visiting ratio, the number of clicks on hyperlinks in a Web document, and the number of sessions a user visit which is less than a previously-defined number of pages within a given Web site—for training a naive Bayes learner for detecting spam Web documents. Since the approach is based on a Bayes learner that must be trained to accurately detect spam Web documents, it is differed from our spam-detection approach that does not require any training process. Goodstein and Vassilevska [13] adapt the approach in [28] to develop a spam-detection mechanism, which is set up as a game in which users, i.e., players, label the results re4

trieved by a search engine as relevant or non-relevant with respect to a particular query. Using the provided answers, a voting algorithm is invoked for classifying Web documents as (non-)spam. It is assumed that previously-generated Web document rankings are modified according to the voting results from the game, and hereafter Web documents labeled as spam are removed entirely from the rank, preventing spammers from bypassing Web spam filters. As opposed to our spam-detection approach, the detection method in [13] is entirely depended on users’ classifications of Web documents, which may not always be reliable, since the users are only given a snapshot of the Web pages and a short period of time for deciding if they are (non-)spam. As a result, their answers might not be accurate, which could jeopardize the quality of the modified rankings. None of the approaches discussed above relies on the actual word-semantic measure in the content of a given Web document to detect spam Web documents, which is our spamdetection approach. In [24] we demonstrate that using the content of emails for detecting junk emails is effective, the same strategy that we adapt for finding spam Web documents in this paper.

3

Our spam-document detection approach

As discussed earlier, spam Web documents are a burden for (i) the Web servers, since (among others) undetected spam pages waste storage space and processing time to index and maintain and (ii) the users who must deal with low-quality retrieved results (caused by spamming) when performing Web searches. To neutralize existing spamming tricks, we rely on the content (i.e., words and phrases) and/or proportion of markup structure of a given Web document D to determine whether D should be treated as spam.

3.1

Title and body of a Web document

In [17, 19] the authors claim that the title of a document often reflects its content, and we are confident that this same concept applies to Web documents as well, since a legitimate Web document is a regular document with a title that describes its content, whereas the title of a spam Web document often does not reflect its content [21]. Consider the legitimate Web document (http://www.mothersbliss.co.uk/shopping/index) in Figure 1 in which the title reflects the content of the document, whereas Figure 2 shows a spam Web document (http://extc.co.uk) in which its content and its title mismatch. We analyze the content of a Web document to determine how closely its title is related to its body (as discussed in Sections 3.2 and 3.4) and calculate the fraction of hidden content (i.e., markup data structure) of a Web document (in Section 3.3), if necessary, in detecting spam Web documents. Hereafter, we discuss an enhanced content-similarity measure using bigrams and trigrams (in Section 3.4).

3.2

Degrees of similarity of Web documents

We rely on the similarity measure between the content of, which is represented by a sequence of words in, the title T and the body B of a Web document to identify spam documents. We determine the degree of similarity between T and B using the correlation 5

Figure 1: The title and (portion of the) content of a legitimate Web document

Figure 2: The title and (portion of the) content of a spam Web document factors of words in T and B as defined in a precomputed word-correlation matrix. The matrix is generated by analyzing a set of 880,000 Wikipedia documents (downloaded from http://www.wikipedia.org/) to calculate the correlation factor (i.e., similarity value) of any two words1 based on their (i) frequency of co-occurrence and (ii) relative distance as defined below. ci,j =

1 d(wi , wj ) wi ∈V (i) wj ∈V (j)

(1)

where d(wi , wj ) denotes the distance between any two words wi and wj in any Wikipedia document D, and V (i) (V (j), respectively) is the set of stem variations of word i (j, respectively) in D. If wi and wj are consecutive words, then d(wi , wj ) = 1, whereas if wi 1 and wj do not co-occur in D, then d(wi , wj ) = ∞, i.e., d(wi1,wj ) = ∞ = 0. To avoid bias that occurs in documents of large size, we normalize the word-correlation factors as follows: Ci,j =

ci,j |V (i)| × |V (j)|

(2)

where ci,j is as defined in Equation 1. 3.2.1

Content similarity of the title and body

Using the normalized word-correlation factors, we can compute the degree of similarity (between the words) of the title T and the body B of a Web document D, denoted SimT B. We focus only on T and B of D, since as shown in the Experimental Results section (Section 5), SimT B of D can accurately determine whether D is spam or legitimate2 . To determine the SimT B value of D, we calculate the similarity value of each word t in T with respect to each word b in B of D. The higher the correlation factors among t and the words in B, the higher the similarity value between t and B, denoted µt,B . 1 Words in the documents were stemmed (i.e., reduced to their grammatical roots, e.g., “computer”, “computing”, and “computation” are converted to “compute”) after all the stopwords (i.e., words that carry little meaning, e.g., articles, prepositions, and conjunctions) were removed, which minimized the number of words to be considered. 2 In the case in which there is no title in D, we consider the hidden content (i.e., the markup structure) of D. (See Section 3.3.)

6

Baby Pregnancy Discover Magic ...

Names 8.9×10−8 5.8×10−4 6.1×10−8 6.9×10−8 ...

Announce 1.6×10−7 4.1×10−5 7.1×10−2 5.8×10−8 ...

Time 9.8×10−8 1.4×10−7 1.4×10−8 9.5×10−8 ...

Diet 7.1×10−8 6.1×10−7 3.3×10−8 4.9×10−8 ...

Baby 1 3.7×10−3 8.9×10−8 1.3×10−7 ...

Answer 7.3×10−8 2.5×10−10 8.1×10−9 6.1×10−8 ...

... ... ... ... ... ...

µ value 1 1 2.3×10−4 8.7×10−5 ...

Table I: Word-correlation factors and the µ values of some of the words in the title with respect to the words in the body of the legitimate Web document as shown in Figure 1

Compare Find Resources

Computers 9.1×10−8 2.2×10−7 6.2×10−8

Internet 2.5×10−7 2.1×10−7 4.0×10−8

Electronics 5.9×10−8 1.4×10−8 5.4×10−8

Mortgages 2.3×10−8 5.3×10−9 4.5×10−8

Credit 1.1×10−8 1.6×10−8 1.7×10−7

Flights 6.1×10−8 2.0×10−8 2.0×10−8

... ... ... ...

µ value 2.9×10−7 1.9×10−7 4.7×10−7

Table II: Word-correlation factors and the µ values of the words in the title with respect to the words in the body of the spam Web document as shown in Figure 2

µt,B = 1 −

(1 − Ct,b )

(3)

b∈B

where µt,B is defined as the complement of a negated algebraic product, instead of the algebraic sum, and Ct,b is as defined in Equation 2. Once the µ value of each word in T with respect to all the words in B is calculated, we can determine the degree of similarity between T and B, which yields the average similarity value of each word t in T with respect to all the words in B, as SimT B(T, B) =

µ1,B + µ2,B + . . . + µn,B n

(4)

where n is the number of words in T . A high (low, respectively) SimT B(T, B) value reflects a high (low, respectively) degree of similarity between (the contents of) T and B. Example 1 Table I (Table II, respectively) shows the correlation factors among the words in the title and the body of the Web document D1 in Figure 1 (D2 in Figure 2, respectively). The degree of similarity, i.e., SimT B, of the title T1 and the body B1 of D1 is 0.88, whereas the degree of similarity between the title T2 and the body B2 of D2 is 3.2 × 10−7 . According to the computed SimT B value, D1 (D2 , respectively) is highly likely a legitimate (spam, respectively) document. 2 3.2.2

Similarity threshold value

Having computed the SimT B value of a given Web document D, we must determine an appropriate word-similarity threshold value V so that if SimT B of D ≥ V , then D is considered legitimate; otherwise, D is treated as spam. An ideal similarity threshold should (i) reduce to the minimum the number of spam Web documents identified as legitimate, i.e., false negatives (F Ns), and (ii) avoid treating legitimate Web documents as spam, 7

(a) SimT B threshold values

(b) Hidden-content threshold values

Figure 3: Numbers of False Positives (F P s) and False Negatives (F Ns) computed by using different possible similarity threshold values on the Web documents in the Threshold set i.e., false positives (F P s). To determine the correct similarity threshold, we consider (i) Web documents in a set, called Threshold set, and (ii) a number of possible similarity threshold values which yield different number of F P s and F Ns. The Threshold set is a collection of 370 previously classified spam and non-spam Web documents (170 spam and 200 non-spam) randomly selected from the WEBSPAM-UK2006 dataset (http://www.yrbcn.es/Webspam/datasets/), which is a well-known, publicly available reference collection for Web spam research that consists of 77.9 million spam and non-spam Web documents. As shown in Figure 3(a), the optimal word-similarity threshold is 0.80, since at 0.80 the total number of F P s and F Ns are reduced to a minimum and neither the number of F P s nor F Ns dominates the other3 . Hence, we declare a Web document D as

Status(D) =

if SimT B(TD , BD ) ≥ 0.80 otherwise

Legitimate Spam

(5)

where TD (BD , respectively) denotes the (content of) title (body, respectively) of D. Using Equation 5, we classify the Web document D1 in Figure 1 as legitimate, since SimT B(T1 , B1 ) = 0.88 ≥ 0.80, whereas the Web document D2 in Figure 2 as spam, since SimT B(T2 , B2 ) = 3.2 ×10−7 < 0.80.

3.3

Fraction of the hidden content

Even if a Web document D lacks of a title, we can still determine whether D is spam by considering the fraction of hidden content in D. Ntoulas et al. [23] define the visible content of a Web document D as the length (in characters) of all non-markup words in D divided 3

We verified the correctness of the similarity threshold value, as well as other threshold values in this paper, using another Threshold set S, which consists of 100 (38 spam and 62 non-spam) documents from WEBSPAM-UK2006, and S yields the same threshold values. Thus, we are confident that the threshold values are accurately defined.

8

by the total size (in characters) of D and claim that spam Web documents often contain less markup than legitimate documents. We adapt this heuristic, but instead compute the fraction of hidden content, denoted HC, i.e., proportion of markup content, of D as HC(D) =

Size of markup content of D Total size of D

(6)

where the size of markup content and the total size of D are in characters. Again, upon computing the HC value of a Web document D, we must apply an appropriate threshold value, denoted HC-threshold, so that if HC(D) ≥ HC-threshold, then D is considered legitimate; otherwise, D is treated as spam. To determine an appropriate HC-threshold value, we used the same Threshold set previously described and computed the number of F P s and F Ns according to each potential HC-threshold value. Figure 3(b) shows that the ideal HC-threshold value is 0.75, since the total number of F P s and F Ns are reduced to a minimum at 0.75. Excluding the title in the HTML document D1 , a legitimate Web document (D2 , a spam Web document, respectively) as shown in Figure 1 (Figure 2, respectively), the fraction of hidden content of D1 (D2 , respectively) is 0.87 (0.23, respectively). Using the chosen HC-threshold value, i.e., 0.75, we correctly classify D1 and D2 as legitimate and spam, respectively.

3.4

The use of bigrams and trigrams

We have observed that whenever there is at least one word4 in the title T of a spam Web document D that also appears in the body B of D, then SimT B(T, B) is high, which causes our spam-detection approach to misclassify D as legitimate. As a result, our spam-detection method yields higher than expected number of false negatives. In order to further enhance our spam-detection approach, we consider bigram and trigram, instead of unigram (i.e., single-word as presented in Section 3.2.1), phrase-correlation factors of T and B in determining the content similarity between T and B. We consider bigrams and trigrams (as opposed to phrases of longer length), since as claimed by [22] and verified by us, short phrases (i.e., bigrams and trigrams) increase the retrieval effectiveness, whereas using phrases of 4 or more words tends to retrieve unreliable results. 3.4.1

The phrase-similarity value

In computing the phrase-correlation factors of any two n-grams (2 ≤ n ≤ 3), we apply the p(H) Odds [18] ratio, i.e., Odds(H) = 1−p(H) , on the normalized word-similarity factors as defined in Equation 2. Odds measures the predictive or prospective support based on a hypothesis H (i.e., n-grams in our case) using prior knowledge p(H), i.e., the word-correlation factors of the n-grams, to determine the strength of a belief, which is the phrase-correlation factor in our case. We determine the phrase-correlation factor, denoted pcf , between any two n-grams (2 ≤ n ≤ 3) p1 and p2 as n

pcfp1 ,p2 = 4

1

i=1 Cp1i ,p2i − ni=1 Cp1i ,p2i

After stopwords are removed and the remaining words are reduced to their stems.

9

(7)

Baby pregnancy Pregnancy motherhood Motherhood discover Discover magic ...

Baby name 4.1×10−5 7.7×10−11 6.9×10−15 6.1×10−15 ...

Name birth 1.2×10−14 2.4×10−11 8.1×10−13 1.3×10−16 ...

Birth announce 1.3×10−11 2.8×10−15 4.2×10−8 3.5×10−10 ...

Announce ready 5.9×10−10 3.7×10−12 2.9×10−13 3.1×10−9 ...

...

µ value

... ... ... ... ...

3.6×10−4 3.7×10−9 7.4×10−9 1.8×10−10 ...

Table III: The phrase-correction factors and µ values of some of the bigrams in the title with respect to some of the bigrams in the body of the legitimate Web document in Figure 1

Baby pregnancy motherhood Pregnancy motherhood discover Motherhood discover magic Discover magic mother ...

Baby name birth

Name birth announce

Birth announce ready

Announce ready baby

...

µ value

2.4×10−11

2.4×10−22

7.8×10−17

6.7×10−17

...

1

4.7×10−16

1.7×10−12

3.9×10−20

1.2×10−19

...

1.4×10−12

1.4×10−23

4.7×10−17

1.8×10−15

3.6×10−16

...

6.1×10−13

3.7×10−21

2.5×10−24

2.1×10−15

3.5×10−16

...

8.9×10−17

...

...

...

...

...

...

Table IV: The phrase-correction factors and µ values of some of the trigrams in the title with respect to some of the trigrams in the body of the legitimate Web document in Figure 1

where p1i and p2i are the ith (1 ≤ i ≤ n) words in p1 and p2 , respectively, and Cp1i ,p2i is the normalized word-similarity value as defined in Equation 2. Using the computed phrase-correlation factors, we can replace Ct,b in Equation 3 by pcfp1 ,p2 to determine (i) the µ value between an n-gram (2 ≤ n ≤ 3) in T and all the n-grams in B, and (ii) the degree of similarity between T and B, i.e., SimT B, using the computed µ values for n-grams and Equation 4, which overcomes the unigram problem that arises when an unigram in T appear in B. In adopting Equation 4 to compute the degree of similarity, n in the equation represents the number of bigrams (trigrams, respectively), instead of unigrams, in T . Table III (Table IV, respectively) shows some of the phrasecorrection factors between the bigrams (trigrams, respectively) in the title and body of the legitimate document in Figure 1. Example 2 Figure 4 shows a spam Web document D (http://khs.co.uk) in which the word KHS in its title T is repeated in its body B, yielding SimT B(T, B) = 0.84 by Equation 4 on word-similarity measures. Using the word-similarity threshold value, 0.80, as defined in Section 3.2.2, D is misclassified as legitimate. However, when considering the bigrams in T 10

Figure 4: A sample spam Web document that is misclassified as legitimate when the (unigram) word-similarity measure is applied, but is correctly classified as spam when the (bigram) phrase-similarity value is considered

(a) Bigram-similarity threshold values

(b)Trigram-similarity threshold values

Figure 5: Number of F P s and F Ns computed by using different possible bigram- and trigram-similarity threshold values on the Web documents in the Threshold set and B, SimT B(T, B) = 0.57 and using the threshold value (defined below), D is correctly classified as spam. 2 3.4.2

The phrase-similarity threshold value

Prior to using the phrase-correlation factors, we define the bigram- (trigram-, respectively) similarity threshold value V so that for any Web document D, if SimT B(TD , BD ) ≥ V , where SimT B(TD , BD ) is computed by using the bigram- or trigram-correlation factors, then D is considered legitimate; otherwise, D is treated as spam. In determining an ideal phrase-similarity threshold, we use the same Threshold set (in Section 3.2.2) to compute the number of F P s and F Ns according to different potential phrase-similarity threshold values and choose the value V such that the total number of F P s and F Ns at V are reduced to a minimum. Figure 5(a) shows that the optimal bigram-similarity threshold value is 0.75, whereas Figure 5(b) indicates that the optimal trigram-similarity threshold value is 0.65.

11

4

An enhanced similarity-measure method

We have considered alternative approaches to augment the use of phrase-correlation factors in computing the SimT B value of a Web document that can further enhance the performance of our spam-detection approach, i.e., minimizing the number of misclassified Web documents. An alternative approach is to determine the similarity among n-gram phrases5 (2 ≤ n ≤ 3) in the title T with respect to the ones in the body B of a Web document D and penalize D with a lower SimT B value if B contains phrases that are similar to only a few phrases in T and reward D with a higher SimT B value if B contains phrases which are closely related to a number of phrases in T .

4.1

The enhanced phrase-similarity approach

The enhanced phrase-similarity approach assures that if each phrase in the title T is closely related to (or matches exactly with) a phrase in the body B, then the corresponding Web document D is more likely legitimate; otherwise, D is likely spam. We compute the enhanced similarity value between T and B of D by calculating the sum of the phrasecorrelation factor of each bigram (trigram, respectively) pt in T with respect to each bigram (trigram, respectively) phrase in B, i.e., spcfpt,B =

m

pcfpt,j

(8)

j=1

where pcfpt,j is the phrase-correlation factor as defined in Equation 7 and m is the total number of the bigrams (trigrams, respectively) in B. Once the spcf value of each bigram (trigram, respectively) in T has been calculated, we can compute the enhanced degree of similarity between T and B, denoted enSimT B, as enSimT B(T, B) =

n

Min(spcfi,B , 1)

(9)

i=1

where n is the total number of bigrams (trigrams, respectively) in T . In calculating the enSimT B value, we add the minimal value of 1 and the spcf value of each n-gram phrase pti in T . We do so in order to restrict the similarity value of each pti in T with respect to the ones in B to 1, which is the similarity value for an exact match; otherwise, the spcf value could be given too much weight over an exact match, which could raise the enSimT B value much higher than necessary on a “few” good (or exact) matches. To avoid the length bias in T , we normalize an enSimT B value as EnSimT B(T, B) =

enSimT B(T, B) n

(10)

where n is the total number of bigrams (trigrams, respectively) in T , and 0 ≤ EnSimT B(T , B) ≤ 1. 5 In the case when no bigrams or trigrams are available in the title T , i.e., after stopword removal and stemming on the words in T and only one word is available in T , then the (single) word-similarity will be considered.

12

(a) Bigram EnSimT B threshold values

(b) A sample spam Web document misclassified by the SimT B value

Figure 6: Determining the ideal bigram EnSimT B threshold value and a classification example using the SimT B versus EnSimT B value based on the bigram-similarity value

4.2

The EnSimT B threshold value

We define the appropriate threshold value for EnSimT B, which yields the cut-off value between spam and legitimate Web documents. Using the same Threshold set and different possible threshold values, we determine the number of F P s and F Ns for each of the possible thresholds for EnSimT B. As shown in Figure 6(a), the bigram EnSimT B-threshold value should be 0.67, which yields the minimal sum of F P s and F Ns. (Note that the trigram EnSimT B-threshold value is not computed, since as shown in Section 5 bigrams outperform trigrams in similarity measure and hence we only consider bigrams from here on.) Example 3 Table V shows how closely related some of the bigrams in the title of the spam Web document D in Figure 6(b) are to the ones in its body. Using the bigram SimT B value of D, which is 0.75, D is misclassified as legitimate, since SimT B(TD , BD ) ≥ 0.75, the bigram SimT B threshold value. However, when the EnSimT B value is considered instead, D is correctly classified as spam, since EnSimT B(TD , BD ) = 0.5 < 0.67, the bigram EnSimT B threshold value. 2

4.3

Verifying the threshold values

To validate the correctness of the threshold values, i.e., HC-threshold, word-similarity threshold, bigram-similarity threshold, trigram-similarity threshold, and bigram EnSimT B-threshold, used in our spam-detection approach, we conducted an empirical study using six disjoint subsets with 2,000 Web pages each randomly-selected from WEBSPAMUK2006. Out of the six collections, two contain 50% of spam Web pages, whereas the remaining four include 20%, 30%, 60%, and 80% spam Web pages, respectively. As shown in Table VI, with the exception of the trigram-similarity threshold, the accuracy ratio generated by each of the six subsets using the HC-threshold, word-similarity 13

Bigrams in the Title Loan advice Advice site Site loan Loan quote ...

Cash advance 1.5×10−9 2.7×10−16 4.5×10−13 4.4×10−15 ...

Bigrams Advance payday 2.1×10−14 2.6×10−15 1.5×10−16 4.1×10−13 ...

in the Body Credit report 5.0×10−16 4.5×10−15 1.3×10−15 2.4×10−16 ...

Card debt 6.5×10−16 2.9×10−15 2.3×10−15 5.9×10−14 ...

... ... ... ... ... ...

spcf value 2.14 7.2×10−12 9.7×10−10 3.4×10−14 ...

M in (spcf , 1) 1.00 7.2×10−12 9.7×10−10 3.4×10−14 ...

Table V: The bigram-similarity values and spcf values for some of the bigrams in the title T of the Web document in Figure 6(b) with respect to the bigrams in its body B threshold, bigram-similarity threshold, or bigram EnSimT B-threshold, respectively, for detecting (non-)spam Web pages remains relatively consistent. The HC-threshold, however, has generated lower accuracy ratio6 (with respect to its maximum and minimum, as well as its accuracy ratio range, i.e., between 52% to 65%). For this reason, we have re-computed and re-evaluated the HC-threshold. The re-computed HC-threshold value, which is set to be 0.60, is based on the numbers of F P s and F Ns generated by different potential threshold values using a set of 1,000 Web pages from WEBSPAM-UK2006, out of which 450 are spam Web pages, which is the percentage of spam used in the Threshold Set (introduced in Section 3.2.2) for defining different threshold values. Furthermore, adopting the same threshold evaluation strategy presented in Section 3.2.2, we verified the correctness of the new HC-threshold value using a new threshold set with the same settings as Threshold Set S. (See Footnote 3.) Hereafter, we proceeded to re-evaluate the new HCthreshold using the same six subsets of Web pages. As shown in Table VI, using the new HC-threshold value on the six subsets of Web pages, the accuracy remains consistent. In addition, the appropriateness of the established threshold values7 is further confirmed by the high accuracy value in detecting spam Web pages achieved using various large test sets. (See Section 5 for details.)

4.4

The overall spam-detection process

Figure 7 shows the overall process of our spam-detection approach, which is described as follows: (i) when analyzing a Web document D, if D is detected without a title, then (ii) the HC (i.e., Hidden Content) value v of D is computed so that if v ≥ HC-threshold (< HC-threshold, respectively), i.e., 0.60, then D is treated as legitimate (spam, respectively). Otherwise, i.e., D contains a title, (iii) the SimT B value e of D using the chosen type of n-gram (1 ≤ n ≤ 3) is computed and if e ≥ the corresponding n-gram similarity threshold, i.e., 0.80 for unigrams, 0.75 for bigrams, and 0.65 for trigrams, then (iv) the HC value v of D is calculated. (The HC value is evaluated at this stage as an additional step to provide further evidence for classifying D as legitimate or spam.) If v ≥ HC-threshold (< HC6 The maximum, minimum, and average accuracy ratios generate by the HC-threshold value are comparable to the corresponding ones generated by the trigram-threshold value, which is not a reliable measure in detecting spam Web pages and thus is eventually excluded from consideration in our spam-detection process. (See the experimental results presented in Section 5 for detailed discussion). 7 From now on, whenever we refer to HC-threshold value, we mean the new HC-threshold value.

14

Threshold HC W ord-similarity Bigram-similarity T rigram-similarity Bigram EnSimT B New HC

Accuracy Ratio Min(imum) Max(imum) Ave(rage) 0.52 0.65 0.61 0.61 0.73 0.66 0.66 0.81 0.73 0.51 0.68 0.66 0.62 0.78 0.72 0.74 0.77 0.75

Largest Difference, i.e., MAX(Ave-Min, Max-Ave) 0.09 0.07 0.08 0.15 0.10 0.02

Table VI: The minimum, maximum, and average accuracy values generated for each threshold value using the six disjoint subsets of randomly-selected Web pages from the WEBSPAM-UK2006 dataset

Figure 7: The overall Web spam-detection process threshold, respectively), then D is classified as legitimate (spam, respectively). Otherwise, i.e., e < the corresponding n-gram similarity threshold, (v) the EnSimT B value E of D (on bigrams) is computed. (We calculate, as part of our spam-detection process, the EnSimT B value as an extra step for reducing the number of F P s and F Ns that could be generated.) If E ≥ bigramsEnSimT B-threshold, i.e., 0.67, then D is considered legitimate; otherwise, D is categorized as spam. Note that since by using bigrams, we reduce considerably the number of F P s, as well as F Ns, compared with using unigrams or trigrams (see Section 5), we do not consider unigrams nor trigrams any further in Step (v).

5

Experimental results

In this section we discuss the two datasets (in Section 5.1) used for our empirical study and show the accuracy of using n-gram (1 ≤ n ≤ 3) phrases in our spam-detection approach 15

(in Section 5.2), which verifies the effectiveness of our approach in detecting spam Web documents (in Section 5.3). In addition, we compare the performance of our spam-detection approach with other well-known anti-spam methods (in Section 5.4) and evaluate and verify the consistency of our approach in accurately detecting spam documents using corpora samples of Web pages with different percentages of spam (in Section 5.4). We include a case study (in Section 5.6) to demonstrate the degrees of accuracy in detecting spam Web documents at various steps of the entire spam-detection process as shown in Figure 7.

5.1

Web document dataset

In verifying the effectiveness of our spam-detection approach in terms of accuracy, which is measured by the number of correctly classified Web documents as spam or legitimate versus the number of F P s and F Ns, we used nine subsets of randomly-selected Web pages in the WEBSPAM-UK2006 dataset8 , which (as stated in Section 3.2.2) contains 77.9 million Web documents. Each one of the nine subsets consists of 1,500 Web pages, and the percentage of spam in each subset varies from 10% to 90% with 10% increments. In order to further evaluate the overall performance of our spam-detection approach, in Section 5.3 we used another nine randomly-sampled subsets of 1,500 Web documents each with different percentage of spam (in the range of 10% to 90%) from the WEBSPAM-UK2007 dataset [30]. The WEBSPAM-UK2007 dataset contains 105,896,555 Web pages downloaded from 114,529 hosts in the .UK domain in May 2007 that were previously labeled as spam and non-spam (see http://barcelona.research.yahoo.net/webspam/datasets/uk2007 for details). The reported measures in Section 5.3, i.e., the number of F P s and F Ns, accuracy, and error rate, are computed by averaging the corresponding measures generated by each of the nine corresponding subsets of the WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets. As stated in [7], WEBSPAM-UK2006 is appropriate and widely used for verifying the accuracy of a given spam-detection approach, since the collection (i) includes a large variety of spam and non-spam Web documents, (ii) represents an uniform random sample, (iii) consists of spam Web documents created by using different spam techniques, and (iv) is freely available to be used as a benchmark measure in detecting spam Web documents. These properties also apply to WEBSPAM-UK2007.

5.2

Accuracy of our approach in using SimT B with(out) the HCvalue

We first verify (i) the effectiveness of our spam-detection approach in using n-grams (1 ≤ n ≤ 3) and (ii) the most accurate n-gram phrases in determining the SimT B value between the title and the body of a Web document using the following measures:

Accuracy =

Correctly identified Web documents Total number of Web documents

Error Rate = 1 − Accuracy

(11) (12)

8 Web pages in different subsets of WEBSPAM-UK2006 are different from the ones used earlier for setting and verifying the appropriateness of the threshold values.

16

(a) The accuracy and error rates

(b) The number of F P s and F N s

Figure 8: Experimental results on using n-gram (1 ≤ n ≤ 3) phrases based on the computed SimT B values of the Web documents in the subsets of the WEBSPAM-UK2006 dataset

where correctly identified Web documents is the total number of analyzed Web documents minus F P s and F Ns. As shown in Figure 8(a), using bigram SimT B values of the Web documents in the various subsets of WEBSPAM-UK2006, our spam-document detection approach yields the (average) accuracy and error rate of 74% and 26%, respectively, which outperforms the unigram and trigram approaches. Furthermore, Figure 8(b) shows the number of F P s and F Ns of different n-grams (1 ≤ n ≤ 3) in misclassifying Web documents. According to the figure, bigrams significantly reduce the number of F P s and F Ns as opposed to the number of F P s and F Ns generated by using unigrams or trigrams based on their corresponding computed SimT B values. We have observed that bigrams outperform trigrams because (closely) related 3-word phrases in the title and body of a Web document occur less often than (closely) related 2-word phrases. As a result, the degree of similarity between the title and the body of a Web document is lower using trigrams than bigrams, causing a higher number of F P s and F Ns. We have further compared how well our spam-detection approach performs when considering both bigram-similarity among the words in the title T and the body B of a given Web document D and the fraction of hidden content of D (Method A), i.e., Steps (iii) and (iv) in Figure 7, as opposed to only considering the bigram-similarity measure between T and B of D (Method B), i.e., Step (iii) only, using the SimT B values. Figure 9(a) shows that the accuracy of our approach is increased by 5% in applying Method A than method B.

5.3

The overall accuracy of our spam-detection approach

We have conducted further comparisons using the bigram phrase-SimT B measure with(out) the EnSimT B values on the subsets of Web pages in the WEBSPAM-UK2006 dataset, i.e., Step (iii) with(out) Step (v) in Figure 7. As shown in Figure 9(b), the accuracy ratio is increased by 6%, yielding an accuracy ratio of 80%, in detecting spam Web pages using 17

(a) Accuracy using Method A (bigram SimT B + HC) and Method B (bigram SimT B only)

(b) Accuracy and Error Rates of the SimT B values with(out) the EnSimT B values

Figure 9: Experimental results computed on the samples of Web documents in the WEBSPAM-UK2006 dataset bigrams based on the SimT B and EnSimT B values, instead of using solely the SimT B values. Even more so, by considering the HC-value, as well as the EnSimT B value (i.e., Step (iv) and (v) in Figure 7), in addition to the SimT B value, we further reduce the number of F P s and F Ns (as shown in Figure 10(a)) and obtain an overall accuracy ratio of 85% for the nine subsets of Web pages from the WEBSPAM-UK2006 dataset (as shown in Figure 10(b)) without significantly increasing the computational cost, since it requires only O(n) time in calculating the HC value, where n is the number of characters in a Web document D, and O(m2 ) in computing the EnSimT B value, where m is the number of bigrams in D. The accuracy of our proposed approach for detecting spam Web pages using the nine subsets of Web documents from the WEBSPAM-UK2007 is 84%. As confirmed by the empirical study conducted on (the samples of) two well-known datasets, i.e., WEBSPAMUK2006 and WEBSPAM-UK2007, our detection approach achieves an average accuracy ratio of 84.5%, which verifies the effectiveness of our spam-detection approach. Besides assessing the performance of our detection approach using subsets of Web documents with different percentages of spam, we conducted yet another empirical study using a subset of 20,000 randomly-selected Web pages from WEBSPAM-UK2006, denoted W S06, and another subset of 20,000 Web pages extracted from the WEBSPAM-UK2007, denoted W S07. Since the percentage of spam documents on the Web these days is between 14% and 22%, as stated in [7] and [23], both W S06 and W S07 contain an average percentage of spam Web pages, which is 18% (i.e., 14% +2 22% ). Based on the experimental results, the averaged accuracy yielded for W S06 and W S07 is 83%, which is comparable with the one generated by using smaller subsets of documents and further verifies the efficiency and scalability of our spam-detection method.

18

(a) Computed F P s and F N s

(b) Computed Accuracy-Error Rates

Figure 10: Experimental results of applying various steps in our spam detection approach on the (sampled) Web documents in the WEBSPAM-UK2006 dataset

5.4

Comparing the performance of our spam-detection approach with other anti-spam methods

We further compare the performance (in terms of precision and recall) of our spam-detection approach with other well-known anti-spam methods in [8], which consider link-based [1] and content-based [23] features, and the combination of both. The features described in [8], which include the degree-related measures, PageRank, TrustRank [14], and features described in [23], such as the number of words and average word length in a document, are served as inputs to the C4.5 decision tree. Furthermore, the authors of [8] use the aggregation of spam hosts to enhance the spam-detection accuracy by (i) implementing a graph clustering algorithm that evaluates whether the majority of hosts in a cluster C are spam and if so all the hosts in C are considered spam; (ii) applying the graph topology to smooth “spamicity” predictions by propagating them using random walks [31]; and (iii) using a stacked graphical learning scheme [10], which derives initial predictions for all the objects in a group of Web documents and generates extra features for each object to improve the quality of the original predictions. In comparing the existing anti-spam methods listed earlier with ours, we consider the evaluation method defined in [8], which adapts the following confusion matrix:

True Label

Prediction Non-Spam Spam Non-Spam a b Spam c d

Castillo et al. [8] compute the True Positive Rate (or recall) = c +d d , False Positive Rate precision × recall , where precision is defined as b +d d . High = a +b b , and F -Measure = 2 ×precision + recall recall and precision translate into high F -measure, whereas low precision and recall yield low F -measure. Furthermore, high (low, respectively) recall and low (high, respectively) precision generate low F -measure. 19

Figure 11: The False Positive Rate (FPR), True Positive Rate (TPR), and F -Measure computed by using the WEBSPAM-UK2006 dataset applied to the approaches in [8] and ours Figure 11 shows the experimental results reported in [8] for different Web anti-spam detection methods using the respective classifier with the highest F -Measure, as well as the results generated by using our approach, on the WEBSPAM-UK2006 dataset. Our spam-detection method outperforms the other anti-spam methods by at least 5% in True Positive Rate and at least 3% in terms of F-measure, which indicates that we obtain high recall and comparable precision with respect to other approaches in detecting spam Web documents. The empirical study has verified that our spam-detection approach correctly identifies (almost) all spam Web documents and avoids misclassifying many legitimate Web documents. To further assess the performance of our Web spam-detection approach we compare our accuracy ratios with the ones generated by the spam-prediction method introduced in [29] using another nine9 different subsets of documents. As opposed to our detection approach, which relies on the content and structure of Web pages to detect spam documents, the spam-prediction method in [29] relies on HTTP session information. The prediction method analyzes hosting IP addresses, as well as HTTP session headers, such as “Content-Type”, “Server”, “X-Powered-By”, “Content-Language”, or “Pragma”, to train classification algorithms, such as C4.5, HyperPipes, Logistic regression, or Support Vector Machine (SVM), to identify (non-)spam Web pages [29]. To perform a compatible evaluation between our spam-detection approach and the one in [29], each of the nine subsets with the corresponding percentage of spam Web pages used for evaluation contains 1,486 Web documents from the WEBSPAM-UK2006 dataset, the same number of pages used for conducting the evaluation measures in [29]. Figure 12 shows, for each subset of Web pages, the corresponding accuracy ratios of the two spamdetection/prediction methods. As stated in [29], the classifier’s performance is relative consistent for subsets that contain between 30% and 70% spam Web pages, but varies considerably in the extremes. The accuracy ratios of our proposed approach, on the other hand, 9

Once again the percentage of spam Web pages in each subset varies from 10% to 90%.

20

Figure 12: Accuracy ratios achieved by our spam-detection approach and the Web spamprediction method in [29] on different corpus samples extracted from the WEBSPAMUK2006 dataset steadily increase as the percentage of spam pages in a collection increases. Furthermore, the overall accuracy of our spam-detection approach (as shown in Figure 12) is higher than the averaged accuracy of the approach in [29] by 3%. Since the spam-prediction method in [29] performs better than our approach for collections of Web pages with low percentages of spam, i.e., below 40%, we could consider using the HTTP session information, in addition to our content- and structure-based analysis approach in classifying spam Web pages, which should further enhance our spam-detection method.

5.5

Observations

The anti-spam methods in [8] combine widely-used algorithms, such as C4.5 decision tree or (graph) clustering, with known link-based and content-based spam-detection strategies, which are representative of the commonly-used methods in classifying (non-)spam Web documents. Compared with the performance of these anti-spam methods, which have been used for verifying the higher degree of accuracy of our spam-document detecting approach in Section 5.4, we draw the conclusion that ours is more effective and simple. Our approach is effective, since on the average we can correctly identify 84.5% (the accuracy ratio as reported in Section 5.3) of the evaluated Web documents, and is simple, since our detection approach only requires computing the (enhanced) similarity values among the words in the title and body of a given Web page and/or its percentage of hidden content to identify spam Web documents. The cost-sensitive decision tree [8], on the other hand, 21

requires computing different link-based measures (as discussed in [1]) and considers many content-based features (as presented in [23]) to construct a decision tree that classifies (non-)spam Web pages and needs to be trained using previously labeled data (i.e., Web documents labeled as spam/legitimate) so that it can later be used as a classifier. Hence, the cost-sensitive decision tree requires additional pre-processing time for detection, which is a constraint. Another anti-spam method, clustering [8], not only requires a classification step, but also applies a graphical clustering algorithm that considers the hosts that contain the Web pages to be evaluated. In one case, propagation [8] is used for smoothing the probabilities that are associated with each host, which establish the likelihood of the host being spam. In another case, the stacked graphical learning strategy [8] is applied, which considers a stacked graphical learning scheme that iteratively yields new features that describe hosts and are later used as additional evidences to improve the accuracy of correctly detecting spam Web pages. Unfortunately, these methods require an additional one or more steps to determine hosts’ influence in (i.e., provide additional information to) the decision tree classifier for identifying (non-)spam Web pages. Moreover, these additional steps do not generate higher degree of accuracy compared with our spam-detection approach in detecting spam documents. Furthermore, for each one of the anti-spam methods presented in [8], the authors apply bagging, which is a technique that combines available classifiers, i.e., decision trees. This technique requires building and training an ensemble of classifiers, which translates into yet an additional process to be implemented with the purpose of augmenting the accuracy in detecting spam Web pages. It is worth to mention that the initial step, i.e., building a classifier, in all of the anti-spam methods in [8] requires training to construct a decision tree, which is not required by ours. As shown in Figure 11, among all the anti-spam approaches in [8], even though the stacked graphical learning approach achieves the highest degree of accuracy, it is at least 5% less accurate (in terms of True Positive Rate) than ours. While Castillo et al. [8] claim that the stacked graphical learning approach is scalable and can be used in large Web datasets of any size, it is clear that is not as simple (in terms of implementation) as our proposed spam-detection approach, which depends solely on the content of Web documents and the availability of a pre-defined word-correlation matrix.

5.6

A case study

In designing our spam-detection approach, we rely on several methods—percentage of hidden content, unigram similarity, n-gram similarity (2 ≤ n ≤ 3), and enhanced n-gram similarity measures. In this section we present a case study, which is conducted for analyzing the effectiveness in applying the content-similarity methods that depend on the (words in the) content of Web documents and the structure-based method that relies on the mark-up content of Web pages for accurately identifying spam Web pages. We constructed another nine disjoint subsets with 1,000 randomly-selected Web pages each from the WEBSPAM-UK2006 dataset10 . Again, each of the subsets contains a different percent10

The randomly selected Web pages are different from the subsets used for establishing and verifying the different threshold values (in Sections 3 and 4) and the subsets of Web pages used in Section 5.1.

22

Figure 13: (Average) False Positives, False Negatives, and overall error rate computed using nine subsets of randomly-selected pages from the WEBSPAM-UK2006 dataset age of spam varying from 10% to 90% on a 10% increment to assess the persistence of our spam-detection approach. On an average, among all the subsets 883 Web pages come with a title, whereas the remaining 117 do not. Of the averaged 117 Web pages with no title, 76 are spam and 41 are legitimate. Figure 13 depicts the (average) false positives, false negatives, and the overall error rate using different tactics in our spam-detection approach. As shown in the figure, applying both the semantic-based and structure-based methods for detecting spam Web pages, we can reduce the number of false positives and false negatives and obtain (an average) low error rate of 15% (i.e., 85% accuracy), which verifies the ideal compensation of using the semantic and structural analysis in classifying spam/legitimate Web documents.

6

Complexity analysis and implementation

In this section we analyze the complexity of our spam-detection algorithm, SpamDe, which includes all the steps in the overall spam-detection process as shown in Figure 7, and discuss its implementation, which are given below. Algorithm: SpamDe—Detecting (non-)spam Web documents 23

Input: A Web document D, the word-correlation matrix M, HC threshold, SimT B threshold, EnSimT B threshold, the n-gram indicator n (1 ≤ n ≤ 3) Output: Classified D as (non-)spam 1. 2. 3. 4. 5.

Let S be the size (in characters) of D Let P be the size of the markup content of D Let HC be the percentage of hidden content in D /* HC = P /S */ Let V := n be the variable used for altering the n-gram indicator n, if needed IF (the title T in D is missing) OR (there is no non-stop, stemmed word in T ), THEN 5.1. IF HC ≥ HC threshold, THEN Label D as Legitimate 5.2. ELSE Label D as Spam ELSE 6. Let L be the number of non-stop, stemmed words in the title T of D 7. IF L < n, THEN /* there are insufficient number of words in T to perform V := L n-gram phrase comparisons */ 8. FOR each V -gram g in the title of D 8.1. IF V > 1, THEN /* Detection based on bigrams or trigrams */ 8.1.1. Compute and store the phrase-correlation factor, pcf , for g and each V -gram in the body of D using Equation 7 and M 8.1.2. Compute the similarity µ value of g with respect to the V -grams in the body of D using Equation 3 and the pcf s computed in Step 8.1.1 8.2. ELSE 8.2.1. Compute the similarity µ value of g with respect to the unigrams in the body of D using Equation 3 and the word-correlation factors, cf s, in M 9. Compute the degree of similarity of the title T and body B of D using Equation 4 9.1. IF SimT B(T, B) ≥ SimT B threshold, THEN 9.1.1. IF HC ≥ HC threshold, THEN Label D as Legitimate 9.1.2. ELSE Label D as Spam ELSE 9.2. FOR each V -gram g in the title of D 9.2.1. Compute the sum of the pcf s (or cf s for the unigram case) for g with respect to the V -grams in the body of D using Equation 8 /* pcf s of g and V -grams were computed in Step 8.1.1 */ 9.3. Compute the enhanced degree of similarity of T and B using Equation 10 9.4. IF EnSimT B(T, B) ≥ EnSimT B threshold, THEN 9.4.1. Label D as Legitimate ELSE 9.4.2. Label D as Spam END The complexity analysis of SpamDe is based on the n-grams (1 ≤ n ≤ 3) detection strategies presented in Section 3.4, even though we have already verified in Section 5 that 24

the use of bigrams outperforms the use of unigrams and trigrams. By considering any ngrams, SpamDe can handle the detection of (non-)spam Wed documents according to the user’s preference on which n-grams to use. However, if the number of non-stop, stemmed words in the existing title of a given Web document D is less than n chosen by the user, SpamDe resets the value of n to be the number of non-stop stemmed words in the title of D (in Step 7 of SpamDe). As a result, the user’s choice of the n-grams used for spam detection on D could be overwritten by SpamDe. This is a legitimate strategy because n-grams cannot be used for spam detection unless the title of the document includes at least one non-stop, stemmed n-gram. Furthermore, although the computed EnSimT Bthreshold value mentioned in Section 4.2 is for bigrams only, the EnSimT B-threshold value for unigrams or trigrams can be computed in the same manner using the same (or different) set of previously labeled (non-)spam Web documents and selecting the threshold value that yields the minimal number of false positives and false negatives, i.e., misclassified Web documents. Steps 1 through 7 of SpamDe involve constant time, since they either invoke an assignment statement or perform a(n) (inequality) comparison, whereas Steps 8 and 9 are the dominant steps in SpamDe in terms of time complexity. The dominant sub-step in Step 8 is the step that determines the µ value (the complement of the negated products) between the m different n-grams (1 ≤ n ≤ 3) in the title of a Web document with respect to the k different n-grams (1 ≤ n ≤ 3) in its body, i.e., Step 8.2.1 when unigrams are considered and Step 8.1.2 when bigrams or trigrams are considered, which is O(k). Note that computing the phrase-correlation factors, i.e., Step 8.1.1, also requires O(k), since it involves using the corresponding word-correlation factors of the ngram in the title being considered and each n-gram in the body of D, where n (2 ≤ n ≤ 3) is insignificant. Thus, the overall complexity of Step 8 is O(m × k), where in general k is significantly larger than m, i.e., k m. This is because the number of non-stop, stemmed words in the body is often much more than their counterparts in the title of D. The dominant sub-step in Step 9, which requires complexity O(m × k), involves computing the sum of the word-correlation factors (pcf s, respectively) between the m different n-grams (1 ≤ n ≤ 3) in the title of a Web document with respect to the k different n-grams (1 ≤ n ≤ 3) in its body, i.e., Step 9.2.1. In the worse-case scenario, which occurs when both the degree of similarity, i.e., SimT B, and the enhanced degree of similarity, i.e., EnSimT B, among the n-grams (1 ≤ n ≤ 3) in the title and the body of D must be computed before SpamDe can determine whether D is spam, the complexity of SpamDe is O(2 × (m × k)) = O(m × k). SpamDe was implemented using the P erl programming language on an Intel Dual Core workstation with dual 2.66 GHz processors, 3 GB Ram size, and a hard disk of 300 GB running under the Windows XP operating system.

7

Conclusions and future work

We have presented a spam-document detection approach that can effectively identify spam Web documents to aid search engines in performing intelligent searches. Our anti-spam approach minimizes the user’s time in looking through documents that are deceitful and

25

do not contain useful information. In designing our anti-spam method, we consider (i) the (enhanced) similarity measures of the n-gram (1 ≤ n ≤ 3) phrases in the title with respect to the ones in the body of a Web document D, and (ii) the fraction of hidden content of D, if necessary, to determine whether D is spam. Experimental results conducted on two wellknown Web spam-detection datasets, i.e., WEBSPAM-UK2006 and WEBSPAM-UK2007, show that by using our spam-detection approach, we can classify spam Web documents with an 84.5% accuracy on the average. Even more so, our detection approach outperforms existing anti-spam approaches by at least 3% in F -measure. Furthermore, our approach is computational inexpensive, since (i) the word-correlation factors used for computing the phrase-correlation factors are precomputed and (ii) the computational time to calculate the fraction of hidden content is insignificant. We have observed that the use of bigrams significantly enhances the performance of our spam-document detection approach. Since the bigram-correlation values employed in our approach are computed by using the unigram-correlation factors, it is our belief that constructing a phrase-correlation matrix directly from the Wikipedia documents could further enhance the performance of our approach in terms of (i) minimizing misclassified spam Web documents and (ii) reducing the computational time required to determine the (En)SimT B value, i.e., the (enhanced) similarity value between the title and body, of each Web document to be examined.

References [1] Becchetti, L., Castillo, C., Donato, D., Leonardi, S. and Baeza-Yates, R. (2006), “LinkBased Characterization and Detection of Web Spam”, Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 1-8. [2] Becchetti, L., Castillo, C., Donato, D., Leonardi, S. and Baeza-Yates, R. (2006), “Using Rank Propagation and Probabilistic Counting for Link-Based Spam Detection”, Proceedings of the Workshop on Web Mining and Web Usage Analysis, pp. 1-8. [3] Becchetti, L., Castillo, C., Donato, D., Baeza-Yates, R. and Leonardi, S. (2008), “Link Analysis for Web Spam Detection”, ACM Transactions on the Web(2):1, Article No. 2. [4] Bencz´ ur, A., Csalogany, K., Sarlos, T. and Uher, M. (2005), “SpamRank - Fully Automatic Link Spam detection”, Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 25-38. [5] Bencz´ ur, A., B´ıró, I., Csalogány, K. and Sarlós, T. (2007), “Web Spam Detection via Commercial Intent Analysis”, Proceedings of the 3rd International Workshop on Adversarial information Retrieval on the Web (AIRWeb), Vol. 215, pp. 89-92. [6] Brin, S. and Page, L. (1998), “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Proceedings of the International World Wide Web Conference. [7] Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M. and Vigna, S. (2006), “A Reference Collection for Web Spam”, SIGIR Forum(40):2, pp. 11-24. 26

[8] Castillo, C., Donato, D., Gionis, A., Murdock, V. and Silvestri, F. (2007), “Know your Neighbors: Web Spam-detection Using the Web Topology”, Proceedings of ACM Research and Development in Information Retrieval (SIGIR), pp. 423-430. [9] Caverlee, J. and Liu, L. (2007), “Countering Web Spam with Credibility-based Link Analysis”, Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, pp. 157-166. [10] Cohen, W. and Kou, Z. (2006), “Stacked Graphical Learning: Approximating Learning Markov Random Fields Using Very Short Inhomogeneous Markov Chains”, Technical Report, Machine Learning Department, Carnegie Mellon University. [11] Davison, B. (2000), “Recognizing Nepotistic Links on the Web”, Artificial Intelligence for Web Search, AAAI Press, pp. 23-28. [12] Fetterly, D., Manasse, M. and Najork, M. (2004), “Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages”, Proceedings of the 7th International Workshop on the Web and Databases (WebDB), pp. 1-6. [13] Goodstein, M. and Vassilevska, V. (2007), “A Two Player Game to Combat Web Spam”, Carnegie Mellon University Technical Report CMU-CS-07-134. [14] Gyongyi, Z., Garcia-Molina, H. and Pedersen, J. (2004), “Combating Web Spam with TrustRank”, Proceedings of the 30th International Conference on Very Large Data Base, pp. 576-587. [15] Gyongyi, Z. and Garcia-Molina, H. (2005), “Web Spam Taxonomy”, Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, pp. 39-47. [16] Gyongyi, Z., Berkin, P., Garcia-Molina, H. and Pedersen, J. (2006), “Link Spam Detection Based on Mass Estimation”, Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 439-450. [17] Jin, R. and Hauptmann, A. (2002), “A New Probabilistic Model for Title Generation”, Proceedings of Computational Linguistics(1), pp. 1-7. [18] Judea, P. (1988), “Probabilistic Reasoning in the Intelligent Systems: Networks of Plausible Inference”, Morgan Kauffman. [19] Lam-Adesina, A. and Jones, G. (2001), “Applying Summarization Techniques for Term Selection in Relevance Feedback”, Proceedings of ACM Research and Development in Information Retrieval (SIGIR), pp. 1-9. [20] Liu, Y., Zhang, M., Ma, S. and Ru, L. (2008), “User Behavior Oriented Web Spam Detection” Proceeding of the International World Wide Web Conference, pp. 10391040.

27

[21] Martinez-Romo, J. and Araujo, L. (2009), “Web Spam Identification through Language Model Analysis”, Proceedings of the International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 21-28. [22] Misjne, G. and de Rijke, M. (2005), “Boosting Web Retrieval through Query Operations”, Proceedings of European Conference on Information Retrieval, pp. 501-516. [23] Ntoulas, A., Najork, M., Manasse, M. and Fetterly, D. (2006), “Detecting Spam Web Pages through Content Analysis”, Proceedings of the International World Wide Web Conference, pp. 83-92. [24] Pera, M.S. and Ng, Y.-K. (2007), “Using Word Similarity to Eradicate Junk Emails”, Proceedings of International Conference on Information and Knowledge Management (CIKM), pp. 943-946. [25] Perkins, A. (2001), “The classification of Search Engine Spam”, Available online at http://www.silverdisc.co.uk/articles/spam-classification/. [26] Svore, K.M., Wu, Q., Burges, J.C. and Raman, A. (2007), “Improving Web Spam Classification Using Rank-Time Features”, Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 9-16. [27] Urvoy, T., Chauveau, E., Filoche, P. and Lavergne, T. (2008), “Tracking Web Spam with HTML Style Similarities”, ACM Transactions on the Web(2):1, Article No. 3. [28] von Ahn, L. and Dabbish, L. (2004), “Labeling Images with a Computer Game”, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 319-326. [29] Webb, S., Caverlee, J. and Pu, C. (2008), “Predicting Web Spam with HTTP Session Information”, Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), pp. 339-348. [30] Yahoo! Research: Web Spam Collections. (2007), http://barcelona.research.yahoo.net/ webspam/datasets. Crawled by the Laboratory of Web Algorithmics, University of Milan (http://law.dsi.unimi.it). [31] Zhou, D., Bousquet, O., Lal, T.N., Weston, J. and Scholkopf, B. (2004), “Learning with Local and Global Consistency”, Advance in Neural Information. Processing Systems(16), pp. 321-328.

28