SimPaD: A Word-Similarity Sentence-Based ... - Semantic Scholar

1

Web Intelligence and Agent Systems: An International Journal 0 (2009) 1–0 IOS Press

SimP aD: A Word-Similarity Sentence-Based Plagiarism Detection Tool on Web Documents Maria Soledad Pera and Yiu-Kai Ng ∗ 3361 TMCB, Computer Science Department Brigham Young University Provo, Utah, U.S.A. E-mail: {ng, mpera}@cs.byu.edu

Abstract. Plagiarism is a serious problem that infringes copyrighted documents/materials, which is an unethical practice and decreases the economic incentive received by their legal owners. Unfortunately, plagiarism is getting worse due to the increasing number of on-line publications and easy access on the Web, which facilitates locating and paraphrasing information. In solving this problem, we propose a novel plagiarism-detection method, called SimP aD, which (i) establishes the degree of resemblance between any two documents D1 and D2 based on their sentence-to-sentence similarity computed by using pre-defined wordcorrelation factors, and (ii) generates a graphical view of sentences that are similar (or the same) in D1 and D2 . Experimental results verify that SimP aD is highly accurate in detecting (non-)plagiarized documents and outperforms existing plagiarismdetection approaches. Keywords: Plagiarism, word-correlation factor, sentence similarity, word manipulation, graphical view

1. Introduction Plagiarism, or the act of using other’s words or ideas as his own, is a prolific problem, especially in the academic world. This problem is getting worse because the volume of on-line publications has become larger and larger during the past decades. Common plagiarism-detection methods either (i) simply duplicate material from a(n) (non-)electronic source, or (ii) copy material from a given source and intentionally modify its wordings or sentence structures without alternating its content [17]. The latter is more difficult to identify than the former due to its complexity. In order to identify plagiarized documents, a number of new approaches have recently been proposed by (i) integrating information retrieval techniques into previous plagiarism-detection methods and (ii) analyzing * Corresponding

author. E-mail: [email protected].

potential on-line plagiarized documents on the Web which are easier to process than hard copies. Popular plagiarism-detection approaches (i) compute the overlapping among n-grams in any two documents [23], (ii) analyze the writing, i.e., syntactical and grammatical, styles of the authors of various documents [33], (iii) identify words substituted by their synonyms and split/merged sentences [34], and (iv) detect plagiarized documents based on their fingerprints [20]. Besides using synonyms, hypernyms, and hyponyms, majority of these approaches rely on exact word/phrase matching in finding the portion of a document that is plagiarized. Relying on exact-word/phrase matching for plagiarism detection, however, is insufficient and inaccurate, since it is a common practice to paraphrase words by using similar ones for plagiarizing a source document. In order to handle the shortcomings of existing plagiarism-detection methods, we propose a new, novel Similarity-based Plagiarism Detec-

c 2009 – IOS Press and the authors. All rights reserved 1570-1263/09/$17.00

2

M.S. Pera et al. / A Plagiarized Document Detection Tool

tion approach, denoted SimP aD, which considers not only (similar) word substitution, addition, and deletion, but also sentence splitting and merging based on word-similarity measures. Given any two documents D 1 and D2 , SimP aD conducts sentence-to-sentence comparison to determine the degree of resemblance between D 1 and D2 using the pre-computed word-correlation factors defined in [15], which can be applied for detecting semantically-related, as well as exact, words in different sentences. Since SimP aD is based on the wordcorrelation factors, determining the degree of resemblance between any two given (words/sentences in) documents is a simple and computationally effective process. SimP aD can detect plagiarized documents by identifying in a plagiarized document (i) sentences that are split/merged from sentences in a source document and (ii) sentences in which words have been substituted in, added to, or deleted from sentences in a source document but retain similar content. Unlike existing plagiarism-detection approaches, SimP aD is unique, since it (i) allows partial similarity matching as opposed to the strict exact matches, and (ii) uses a graphical view to display the plagiarized sentences in a plagiarized document matched with the corresponding sentences in a source document with various degrees of similarities. SimP aD is primarily designed to detect plagiarized Web documents. Documents evaluated by SimP aD, which include (i) a (non-)plagiarized document D and (ii) a collection of archived documents that includes the (potential) sources of D, are text documents available online. Experimental results generated by using existing plagiarized-document corpora show that SimP aD is highly accurate in detecting (non-)plagiarized documents. This paper is organized as follows. In Section 2, we present our similarity-based, plagiarism-detection approach. In Section 3, we introduce different plagiarism corpora used for evaluating the performance of SimP aD and assessing its effectiveness in detecting plagiarism. In Section 4, we discuss a number of existing methods for detecting text plagiarism. In Section 5, we give a conclusion and directions for future work.

2. Our Plagiarism-Detection Approach Plagiarism can be detected by establishing the “content similarity" among documents [33]. SimP aD identifies DP as a document plagiarized from a source document D S , if DP contains (words in) sentences

with high degrees of similarity to (words in) sentences in DS . In reality, plagiarism detection is not as simple as matching sentences with sentences, since sentences in DS may not be copied entirely into D P , i.e., a “cut and paste" plagiarism; instead, they could be reordered, split, and merged, and/or have words in them added to, deleted from, or replaced in D P . Indeed, establishing which sentences of D S have been plagiarized is quite broad in scope. For this reason, SimP aD considers a number of plagiarism strategies on words and sentences, which are discussed in the following subsections. Note that these plagiarism strategies are not mutually exclusive, e.g., a sentence S that has been split to yield two sentences S 1 and S2 may have words in S reordered in S 1 and S2 , and some words in S are deleted from and others added to S1 and/or S2 . SimP aD is designed for detecting sentences created by these strategies in tandem, rather than independently, which complement each other in identifying plagiarized sentences/documents. 2.1. Document Representation without Stopwords and Short Sentences Prior to analyzing potential plagiarized documents, we first remove all the stopwords 1 and reduce all the non-stopwords in a document D to their grammatical roots, i.e., stems, using the Porter’s Stemmer algorithm [26]. In addition, as part of the pre-processing step, short sentences (i.e., semantically uninformative with regard to plagiarism detection) are removed from D due to the high probability that independent authors can create (semantically the) same short sentences rather than long, similar ones. The longer the two sentences are, the less likely they are similar by chance. Example 1 Consider the sentences in the two small documents, D 1 and D2 , given below. D1 : “Many people believe that lemmings are prone to frequently jumping off a cliff in mass suicide. This is not true." D2 : “One may assume that this chemical reaction is unfeasible due to the steric hindrance. This is not true." 1 Stopwords are words that have little meaning, such as articles, conjunctives, and prepositions, which can be removed from a document without significant information loss. We have considered the stopword list posted under http://students.cs.byu.edu/˜rmq3/Stop wordList.txt


Clearly, D1 and D2 are different in content, and neither author of D 1 nor D2 plagiarizes from another. Nonetheless, the same sentence “This is not true" appears in both documents, which is accounted to the tendency of some words and sentences that naturally appear more frequently regardless of authorship. 2 We exclude from documents to be evaluated by SimP aD sentences that are sufficiently short. Gildea [9] estimates the average number of words in an English sentence varies between 15 and 20 words, and LaRocque [16] considers every sentence with less than 12 words (including stopwords) as short, whereas Liddy [19] claims that 40% of words in a sentence are stopwords. Hence, we remove (short) sentences in a document with fewer than 7 non-stop, stemmed words and consider only the remaining semantically informative sentences during the process of plagiarism detection. 2.2. Manipulation of Words Words in a source sentence may have been reordered, substituted, deleted, or added to yield a plagiarized sentence. We consider different plagiarism strategies on manipulating words in sentences and apply a word-similarity approach to compute the similarity values of words in sentences for detecting plagiarized sentences and documents. 2.2.1. Word Reordering It is quite common that a plagiarized sentence is created from a source sentence S by reordering the words in S. In the simplest case of word reordering, the same keywords2 in S are present and placed in a different order, along with probably additional words, in a plagiarized sentence P . Example 2 Consider the following source sentence S and plagiarized sentence P : S: “Over 45% of all current high school students are involved in intramural sports of some kind." P : “Of all the current high school students, over 45% are involved in some kind of intramural sports." 2 From now on, unless stated otherwise, (key)words refer to nonstop, stemmed words.

3

In both sentences the same set of keywords is used, i.e., current, high, school, students, involved, kind, intramural, and sports, which yields the same content, even though the keywords in S and P are placed in a different order. 2 As shown in Example 2, the order of words does not affect the content of P and S 3 . Thus, SimP aD discards the order of words in comparing any (sentences in) documents and considers word substitution, addition, and deletion and sentence splitting/merging instead. 2.2.2. Word Substitution Word substitution can be viewed as deleting a word in a source sentence S followed by adding a (similar) word in S. Example 3 Consider the sentences S and P below. S: “Many dairy farmers today use machines for operations from milking to culturing cheese." P : “Today many cow farmers perform different tasks from milking to making cheese using automated devices." 2 As stated in [33], the problem of word substitution is a complex one to address in plagiarism detection, partially due to the lack of plagiarism-detection schemes which measure the degrees of similarity among words. In developing such a scheme for determining content similarity of (words in) two sentences, we first consider how a human may compare words in them. Consider the sentences in Example 3. A person may initially notice several identical words, such as “farmers", “today", “milking", and “cheese", in both S and P , and the presence of these words alone may not provide sufficient evidence to suggest a common origin of S and P . The person who further evaluates the content of each sentence should notice that “making cheese" (“automated devices" and “tasks", respectively) is quite similar to “culturing cheese" (“machines" and “operations", respectively). Each word in P that has a high degree of similarity to a word in S suggests common word content. A significant number of (non-)identical words with similar/same meaning in two sentences provide solid evidence that the sentences come from the same origin. 3 Word-reordering has been widely-used in modern plagiarismdetection approaches. See [33] for details.

4


2.2.3. Word Addition/Deletion A word deleted from (added to, respectively) a sentence without adding (deleting, respectively) another word can be considered as a special case of word substitution. We realize that the similarity of sentences P and S is higher when words that are added to P are closely related to the words in S. However, adding non-related words (in terms of similarity with the words in S) to P yields lower sentence-to-sentence similarity of P with respect to S. 2.2.4. Word-Correlation Factors In establishing the degrees of similarity among non-identical keywords for plagiarism detection, we adopt the word-correlation factors in a pre-computed word-correlation matrix defined in [15]. The wordcorrelation factors between any two words i and j, denoted Sim(i, j), were pre-computed using 880,000 documents in the Wikipedia collection (downloaded from http://www.wikipedia.org/) 4 based on their (i) frequency of co-occurrence and (ii) relative distance in each Wikipedia document as defined below. Sim(i, j) =

wi ∈V (i)

1 wj ∈V (j) d(wi ,wj )+1

|V (i)| × |V (j)|

(1)

where d(wi , wj ) is the distance between any two words wi and wj in any Wikipedia document D, V (i) (V (j), respectively) is the set of stem variations of i (j, respectively) in D, and |V (i)| × |V (j)| is the normalization factor. The Wikipedia collection is an ideal and unbiased choice for establishing word similarity, since (i) documents within the collection were written by close to 90,000 authors with different writing styles and word usage, (ii) the Wikipedia documents cover an extensive range of topics, and (iii) the words within the documents appear in a number of on-line dictionaries, such as 12dicts-4.0 (prdownloads.sourceforge.net/wordlist/ 12dicts-4.0.zip), Ispell (www.cs.ucla.edu/geoff/ispell. html), and BigDict (packetstormsecurity.nl/Crackers/ bigdict.gz). Compared with the word-correlation factors, WordNet (wordnet.princeton.edu/) provides synonyms, hypernyms, holonyms, antonyms, etc. for a given word. There is, however, no partial degree of similarity measures, i.e., weights, assigned to any pair of words. For this reason, word-correlation factors are more sophisticated in measuring word similarity than 4 Words

within the Wikipedia documents were stemmed (i.e., reduced to their root forms) and stopwords were removed.

the word pairs in WordNet which does not provide any measures on their degrees of “closeness" that exist. 2.2.5. Scaled Word-Correlation Factors The correlation factor of words w 1 and w2 is assigned a value between 0 and 1, such that ‘1’ denotes an exact match and ‘0’ denotes total dissimilarity between w1 and w2 . Note that even for highly similar, non-identical words, their correlation factors in [15] are on the order of 5 × 10 −4 or less. For example, the degree of similarity between “tire" and “wheel" is 3.1 × 10−6 , which can be treated as 0.00031% similar and 99.99% dissimilar. When the degree of similarity of two sentences is computed using the correlation factors of their words, we prefer to ascertain how likely the words are on a scale of 0% to 100% in sharing the same semantic meaning. Hence, we assign a similarity value of 1, or rather 100%, to exact-word matches and comparable values (e.g., 0.7-0.9 or 70%90%) for highly-similar, non-identical words. We accomplish this task by scaling the word-correlation factors in [15]. Since correlation factors of non-identical word pairs are less than 5 × 10 −4 and word pairs with correlation factors below 1 × 10 −7 do not carry much weight in the similarity measure, we use a logarithmic scale, i.e., ScaledSim, which assigns words w 1 and w2 the similarity value V of 1.0 if they are identical, 0 if V < 1 × 10−7 , and a value between 0 and 1 if 1 × 10−7 ≤ V ≤ 5 × 10−4 , which is formally defined below. ScaledSim(w1 , w2 ) = ⎧ ⎨1 ⎩ M ax(0, 1 −

−4

5×10 ln( Sim(w

1 ,w2 −4

ln( 5×10−7 )

if w1 = w2 ) )

) Otherwise

(2)

1×10

where Sim(w1 , w2 ) is the word-correlation factor of w1 and w2 as defined in the word-correlation matrix using Equation 1. We formulate Equation 2 so that the proportion of the high/low word-correlation factors in the wordcorrelation matrix are adequately retained using their scaled values. Furthermore, using the natural logarithm, i.e., ln, in Equation 2, we can represent a large range of data values, i.e., the word-correlation factors, which is in the order of 10 −4 or less, in a more manageable, intuitive manner, i.e., values between 0% and 100%. 2.2.6. N-gram Correlation Factors As mentioned in Section 2.2.1, SimP aD does not consider the order of words in sentences. However, in


some cases, disregarding the order of the words in a sentence might yield a higher degree of similarity of sentences than what it is supposed to be, which could falsely classify a legitimate document as plagiarized, generating a false positive during the plagiarism detection process. In order to reduce the number of false positives, we can consider n-gram, phrase-correlation factors (2 ≤ n ≤ 3), which are computed by combining the correlation factors of the corresponding words in each n-gram of sentences as defined in [25], if needed. Since experimental results (presented in Section 3) show that the Sim values on words are adequate in detecting (non-)plagiarized documents accurately, n-gram, phrase-correlation factors are not further considered for plagiarism detection in this paper. 2.3. Sentence Similarity SimP aD computes the degree of similarity of any two sentences P and S using LimSim(P, S) = m n i=1 M in(1, j=1 ScaledSim(i, j)) m

(3)

where m (n, respectively) denotes the number of keywords in a (potential) plagiarized (source, respectively) sentence P (S, respectively), i (j, respectively) is a word in P (S, respectively), and ScaledSim(i, j) is the scaled word-correlation factor of i and j as defined in Equation 2. LimSim(P, S) = LimSim(S, P ), unless P = S. Using the LimSim function, instead of simply adding the Sim or ScaledSim value of each word in P with respect to each word in S, we restrict the highest possible sentence-similarity value between P and S to 1, which is the value for exact matches. By imposing this constraint, we ensure that if P contains a word k that is (i) an exact match of a word in S, and (ii) similar to (some of) the other words in S, then the degree of similarity of P with respect to S cannot be significantly impacted/affected by k to ensure a balanced similarity measure of P with respect to S. 2.3.1. Merged/Split Sentences Besides considering word addition, deletion, and substitution in detecting plagiarism, we identify sentences in a (plagiarized) document D P created by splitting and/or merging sentences in a source document DS . Identifying these split/merged sentences in DP not only measures the document similarity of

5

DP with respect to DS more accurately, this information is also useful to SimP aD users who are interested in knowing which sentences in D S have been split/merged to yield the corresponding sentences in DP . Some plagiarism-detection methods, such as [34], consider sentence rearrangement, i.e., sentence merging and splitting, by setting a threshold value V so that each pair of sentences with a number of words in common that is higher than V is further evaluated. Relying on the proportion of common words among sentences for detecting split/merged sentences, however, is a limitation, since as previously mentioned, words in a given source sentence S may have been replaced by other similar ones to yield a plagiarized sentence P , and hence the number of common words between S and P is lower than what it should be. We claim that a split sentence P is “subsumed" by its original sentence S if majority of the words in P are (semantically) the same as (some of) the words in S. By adopting the threshold value of 0.93, which was established and verified using text documents in [25], SimP aD treats P as a split (subsumed) sentence from (of) S if LimSim(P, S) ≥ 0.93. The same strategy can be applied to detect merged sentences, i.e., source sentences S1 , . . ., Sn (n ≥ 2) are merged to yield a plagiarized sentence P , if LimSim(S i , P ) ≥ 0.93, ∀1≤i≤n . 2.3.2. Sentence-to-Document Similarity Having determined the LimSim value of each sentence P in a (potential) plagiarized document D P with respect to each sentence in a source document D S , SimP aD identifies the highest degree of similarity of P with the sentences in DS , which yields the probability of P having the same content as one of the sentences in DS , as SenSim(P, DS ) = ⎧ M ax{∀Sj ∈DS LimSim(P, Sj )}, if ∃! or ∃ ⎪ ⎪ ⎪ ⎪ ⎨ Sj such that LimSim(Sj , P ) ≥ 0.93 (4) n ⎪ ⎪ M in(1, j=1 LimSim(P, Sj )) such that ⎪ ⎪ ⎩ LimSim(Sj , P ) ≥ 0.93), otherwise where n is the number of sentences in D S that are subsumed by P . SenSim(P, D S ) returns the highest LimSim of P with respect to the sentences in D S , if P is not created by merging two or more sentences in DS ; otherwise, the combined similarity of the sentences in DS that are merged to yield P is assigned to

6


be the sentence-to-document value of P with respect to DS . Using the M in value in Equation 4, we impose the same restriction as in Equation 3, i.e., in combining the LimSim values of P with respect to the sentences in DS merged to create P , we limit the combined LimSim values to 1, an exact match. 2.3.3. Dotplot Views of Similar Sentences Using the SenSim (LimSim, respectively) values computed by Equation 4 (Equation 3, respectively), we can (i) determine for each sentence in a (potential) plagiarized document D P its most highly-related sentence in a source document D S , in addition to sentences in DP that are split/merged sentences from sentences in DS , and (ii) graphically display these related sentences. Dotplot view [11] was designed for visualizing patterns of string matches in different kinds of texts, e.g., news articles, programming code, etc. We adapt the Dotplot view to provide an intuitive, conceptual diagram that shows similar sentences in a source and a plagiarized document visually. We did modify, however, the Dotplot view using the scatter graph in Microsoft Office Excel and call the modified graph Plagiarism View (or P laV iew). In each P laV iew, the x- (y-, respectively) axis represents the sentences (by numbers in a chronical order of their appearance) in a plagiarized document D P (source document DS , respectively), whereas each dot, denoted “•", in P laV iew represents the sentence S in D S that is the most highly similar to the sentence P in D P , which is dictated by the SenSim(P, D S ) value. Furthermore, P laV iew graphically displays the sentences P 1 , . . ., Pn in DP that are the split version of a sentence S in DS , if LimSim(Pi , DS ) ≥ 0.93, and the “dots" of (P1 , S), . . ., (Pn , S), 1 ≤ i ≤ n, are horizontally aligned in P laV iew. In addition, a cross symbol, i.e., “x", in P laV iew denotes a merged sentence P in D P that combines several sentences S 1 , . . ., Sm (m ≥ 2) in DS , such that LimSim(Si , P ) ≥ 0.93, 1 ≤ i ≤ m, and the “crosses" of (S i , P ) are vertically aligned. Furthermore, the larger the dot (cross, respectively) size, the higher the content similarity of the corresponding sentences, which captures the SenSim value for each sentence S in a source document D S and a sentence P in a (potential) plagiarized document D P . Figure 1 shows the P laV iew of the document “Student File Management under Primos" and its plagiarized version in W ebis-P C, which is one of the datasets used for experiments and is discussed in Section 3, and v 1 and

v2 of in Figure 1 denote the SenSim and LimSim values, respectively of sentences S and P . 2.4. Document Similarity Having identified the sentences in a source document DS that are most closely related to the sentences in a (potential) plagiarized document D P , we determine the overall percentage of plagiarism of D P with respect to DS as Resem(DP , DS ) =

|DP | i=1

SenSim(Pi , DS ) (5) |DP |

where |DP | is the number of sentences in D P , and Resem(DP , DS ) = Resem(DS , DP ), if DP = DS . Resem(DP , DS ) is the degree of resemblance of D P with respect to DS . By averaging the computed SenSim values of sentences in DP , SimP aD determines the ratio of the (segments of) sentences in D P that are related to the sentences in DS . Using Equation 5 and a threshold value defined in Section 3, SimP aD can classify (non-)plagiarized documents.

3. Experimental Results In this section, we introduce the datasets used for conducting various empirical studies on SimP aD and present several evaluation measures for analyzing the performance of SimP aD in terms of accuracy and F measure in detecting (non-)plagiarized documents. 3.1. Datasets In assessing the performance of SimP aD, we used three plagiarism corpora. The first one, denoted W ebis-P C, is the Bauhaus University Plagiarism Corpus W ebis-P C-08 [36], which consists of 101 original English documents downloaded from the ACM Digital Library (http://portal.acm.org/dl.cfm). There is a plagiarized version for each original document D in W ebis-P C, which was generated by (i) including exact paragraphs in D, (ii) excluding some sentences from D, and/or (iii) adding sentences with words similar to the ones in D. The second corpus, the M eter Corpus [8], was constructed as part of the Measuring Text Reuse Project at the University of Sheffield in U.K. (The) M eter (corpus) consists of 265 unique stories provided by the


7

Fig. 1. P laV iew of sentences in the source and plagiarized version of the document “Student File Management under Primos", where X and Y in denote the SenSim and LimSim values, respectively of the corresponding sentences

British Press Association (PA)5 and clustered into two different subject areas: entertainment and law/court reporting, which were collected from July 1999 to June 2000. For each of the 265 stories, M eter provides one or more (non- 6)derived newspaper articles, which translates into 944 pairs of news articles published in a variety of newspapers such as The Sun, Daily Mirror, Daily Mail, etc. Each news article pair in M eter is classified as wholly-derived (i.e., when the PA stories are copied/paraphrased entirely), partially-derived (i.e., when PA is the major source used for writing a news article), and non-derived (i.e., when PA is not the original source). In rewriting news articles based on PA stories, Gaizauskas et al. [8] observe that common rewriting strategies include (i) using the exact content from a source sentence, (ii) paraphrasing text from the source story to report the same information, and (iii) including new text, i.e., reporting PA stories using a different context. As a result, wholly- (partially-, respectively) derived articles include sentences “cut and pasted" exactly from a PA story (sentences from the PA source story in which words have been replaced, reordered, added, and/or deleted, respectively). In eval5 According

to [8], PA is the most prestigious press agency in the U.K., which provides news to 86 different national newspapers, as well as 470 radio and television broadcasts. 6 Non-derived news articles are published articles that report (but do not plagiarize) the 265 stories provided by PA.

uating the performance of SimP aD, article pairs labeled as wholly- and partially-derived are treated as plagiarized articles, the same assumption made by [32] in detecting plagiarism and using M eter for performance evaluation. The third test dataset, denoted P AN 09 [27], is the development corpus created for the First PlagiarismDetection Competition of the 3 rd PAN Workshop on Plagiarism Detection (http://www.webis.de/pan-09). The P AN 09 corpus consists of 7,214 suspicious, i.e., potential plagiarized, documents, and 7,214 source documents. Documents in P AN 09 are plain text files, and each text file T includes an XML file with meta information of T , i.e., the plagiarized fragments in T along with their corresponding sources. The suspicious documents in P AN 09 are generated by applying random text operations, i.e., adding, deleting, or relocating a word, as well as including a word from an external source and/or replacing a word with a synonym, antonym, hypernym, or hyponym. The P AN 09 dataset is a large-scale corpus of artificial plagiarism. Since the competition is still in progress, the “competition" portion of the corpus is not annotated. As a result, we have conducted our empirical study on the “development" portion of the P AN 09 dataset instead. Furthermore, since the results on P AN 09 generated by various plagiarism detection systems involved in the competition are not available, we cannot compare the performance of SimP aD with

8


other approaches using P AN 09, except conducting the measures on our own. To the best of our knowledge, besides W ebis-P C, M eter, and P AN 09, no other benchmark datasets are available for evaluating the performance of a plagiarism detection approach based on adding, deleting, or replacing words, and/or sentence merging/splitting. Hence, W ebis-P C, M eter, and P AN 09 are the most ideal choices for evaluating the performance of SimP aD. 3.2. Resemblance Values As discussed in Section 2, in comparing any two documents SimP aD computes their Resem values. Figure 2(a) (2(b), respectively) shows the Resem value of each of the 101 plagiarized documents (944 pairs of articles, respectively) and its corresponding source in W ebis-P C (M eter, respectively). As shown in Figure 2(b), partially-derived news article pairs have a lower degree of resemblance than the wholly-derived news article pairs, but a higher degree of resemblance than non-derived pairs in M eter. Based on the Resem values shown in Figure 2, we observe that SimP aD adequately detects the proportion of content shared by documents, i.e., the percentage of plagiarism found in the (potential) plagiarized documents. 3.3. A Threshold for Plagiarism Detection Prior to determining the accuracy of SimP aD in detecting (non-)plagiarized documents, we set an appropriate threshold value, ResemT H, for automatically labeling a (non-)plagiarized document D P of a source document D S as (non-)plagiarized using the Resem(DP , DS ) value7 . In defining ResemT H, we used the ID3 implementation of the decision tree [28], since ID3 is commonly-used for inductive inference based on a given training set of instances. We randomly selected 40 (60, respectively) documents from W ebisP C (M eter, respectively) and their corresponding plagiarized ((non-)plagiarized, respectively) version, which yield 100 training instances (i.e., document pairs) for constructing the decision tree. Each training instance includes a Resem value, and the class 7 SimP aD

is fully automated, since it does not require any user’s interaction or involvement during the process of determining whether an input document is plagiarized.

value of the instance previously set (i.e., “plagiarized" for W ebis-P C and “(non-)plagiarized" as defined in M eter). Using the constructed decision tree, a document DP is classified as a plagiarized version of D S , if Resem(DP , DS ) ≥ 0.27, which is the ResemT H value. Using ResemT H and the Resem values of the document pairs computed by SimP aD as shown in Figure 2, we observe that SimP aD correctly classifies all the plagiarized documents in W ebis-P C and the selected wholly-derived documents in M eter and almost all the selected partially-derived documents in M eter. 3.4. Performance Evaluation Using the established ResemT H value and the computed Resem values of the document pairs in W ebis-P C, M eter, and P AN 09, we evaluated the Accuracy = N umber_of _Correctly_Classif ied_Documents |corpus| (6) of SimP aD in correctly detecting (non-)plagiarized documents, where |corpus| is the total number of document pairs in a corpus, and the Error_Rate = 1 − Accuracy

(7)

for misclassification. As shown in Figure 3, in detecting the plagiarized documents in W ebis-P C and P AN 09, SimP aD yields 100% and 90% accuracy 8, respectively, and classifies the (non-)plagiarized news article pairs in M eter with a 96% accuracy rate. Note that none of the wholly-derived and non-derived news article pairs in M eter were misclassified, and of the thirty-six misclassified news article pairs (4% of the total number of 944 classified pairs) in the partiallyderived category (with a total of 438 news article pairs), each of its plagiarized copy yields a Resem value lower than ResemT H due to the small size of its corresponding news article, which includes only 2 to 4 sentences and only half of these sentences are (partially) derived from the corresponding PA source ar8 We randomly selected a subset of 204 documents in P AN 09, which are plagiarized from a unique source document to compute the accuracy value using SimP aD on the selected subset of documents.


9

(a) Degrees of similarity of documents in W ebis-P C

(b) Degrees of Similarity of documents in M eter Fig. 2. Resem values of documents and their (non-)plagiarized versions computed by SimP aD

ticle. Even though SimP aD misclassified 4% of the articles in M eter as non-plagiarized, which are false negatives, SimP aD did not generate any false positives, i.e., all of the non-plagiarized articles were correctly identified. Furthermore, as shown in Figure 3, even though SimP aD achieves an average 95% accuracy on all three datasets, SimP aD achieves a lower accuracy ratio on the selected set of documents in P AN 09 when compared with the accuracy ratios achieved using SimP aD on W ebis-P C or M eter. We have

observed that the 10% misclassified documents in P AN 09 are not entirely plagiarized. In some cases only a small portion of (a few sentences in) a document D is plagiarized, even though D is classified as plagiarized in P AN 09. 3.5. Comparing SimP aD’s Performance In order to further assess the effectiveness of SimP aD in detecting (non-)plagiarized documents, we compare its performance, in terms of accuracy,

10


Fig. 3. Accuracy and Error Rates generated by SimP aD on W ebis-P C, M eter, and P AN 09

Fig. 4. Accuracy and Error Rates generated by SimP aD and methods in [6] and [32] on the M eter corpus

with other existing plagiarism-detection approaches whose performance evaluations are based on M eter. (None of the performance evaluations of existing plagiarism-detection methods are based on W ebisP C, which is relatively new). The plagiarized-detection method proposed by [32] uses a binary (i.e., similar and non-similar) classifier, which is based on style features, such as frequent words in a document, and vocabulary features, i.e., tf × idf (the term frequency-inverse document frequency)-weighted vectors of unigrams, in a given document to identify copyright infringement. The combined approach yields an accuracy of 70.5% in detecting (non-)plagiarized news articles out of the 88 selected pairs in M eter. In addition, Clough [6] proposes using the overlapping between n-grams in any two documents to determine the proportion of shared content. Experimental results conducted on whollyderived and non-derived, law/court news article pairs in M eter report an overall 74% accuracy rate [6]. As shown in Figure 4, SimP aD outperforms the two approaches (by a huge margin), which are the only ones that we could find based on M eter and using the accuracy measure for performance evaluation. Besides using the accuracy as an evaluation metric, we have further verified the effectiveness of SimP aD in detecting (non-)plagiarized documents using Precision, Recall, and F -measure, which are commonlyused metrics, to provide additional evidence of the efficiency of SimP aD in plagiarism detection. The F measure combines and weights the precision (a measure of fidelity) and the recall (a measure of completeness) evenly, whereas accuracy 9 is sensitive only to

the number of errors [21]. Using the three additional metrics on the M eter corpus, we compare the performance of SimP aD and the plagiarism detection methods in [2] and [3], denoted P D 1 and P D2 , respectively. P D1 is discussed in details in Section 4, whereas P D2 , as P D1 does, determines the overlap between (the n-grams in) a sentence S in a potential plagiarized document P and (the n-grams in) a source document, which dictates whether S should be treated as plagiarized based on a previously-set threshold value. As opposed to P D 1 , P D2 does not apply search space reduction to detect the possible sources of S. In order to tailor the evaluation measures computed by SimP aD so that they are comparable with the ones in [2] and [3], we treat each sentence S in a potential plagiarized document as an individual document itself and adopt the ResemT H threshold defined in Section 3.3 to determine whether S should be considered plagiarized when compared with potential source documents. Using ResemT H, we examine and verify whether the proposed document-based threshold value can be adapted as a sentence-based threshold value. Since the performance evaluations as presented in [2] and [3] consider various n-grams (1 ≤ n ≤ 5), our comparisons are based on the highest Precision, Recall, and F -M easure values reported for P D 1 and P D2 , which were achieved when using bigrams, i.e., when n = 2. As shown in Figure 5, SimP aD outperforms P D 1 and P D2 by at least 10%, 5%, and 8% in terms of Precision, Recall, and F -M easure, respectively.

9 Accuracy is also called precision at 1 [7], i.e., P@1, which means that when only one document is retrieved, the document is either

relevant or non-relevant, and thus its accuracy value is either 1 or 0, respectively.


11

Fig. 5. Performance evaluation of SimP aD, P D1 in [2], and P D2 in [3] in terms of Precision, Recall, and F -M easure using the Meter corpus

3.6. Retaining/Removing Short Sentences and/or Stopwords To verify the correctness of the assumption that removing stopwords and short sentences (as discussed in Section 2.1) from each document does not affect the effectiveness of SimP aD in detecting (non)plagiarized documents, we have used a randomlyselected subset of 100 documents from the M eter corpus, denoted StSe-dataset 10 , and computed the accuracy in correctly identifying the (non-)plagiarized documents using (i) SimP aD without removing short sentences, and (ii) SimP aD without removing stopwords. As shown in Figure 6, applying (i) SimP aD on documents with short sentences and stopwords removed and (ii) SimP aD on documents without removing short sentences both achieved the same accuracy, i.e., 89%, in identifying (non-)plagiarized documents using the StSe-dataset, which verifies that removing short sentences does not affect the performance of SimP aD, as expected. Furthermore, the accuracy of using SimP aD on documents without removing stopwords is 11% lower than the accuracy of applying SimP aD on documents with the removal of stopwords and short sentences. The lower accuracy ratio occurs because non-plagiarized documents share a fair amount of stopwords with other sources, which translates into a higher degree of resemblance than expected between two of these documents; as

Fig. 6. Accuracy of SimP aD computed (without) using stopwords and/or short sentences in the StSe-dataset documents

a result, a number of non-plagiarized documents are mistakenly classified as plagiarized when stopwords are retained during the plagiarism-detection process using SimP aD. Based on the conducted empirical study, we conclude that removing short sentences and stopwords from (potential) plagiarized documents and their corresponding sources yield the highest accuracy value in detecting plagiarism.

4. Related Work 10 The

StSe-dataset consists of 50 non-plagiarized documents and 50 plagiarized documents, each randomly selected from its respective (non-)plagiarized document subset in the M eter corpus.

Many attempts have been made in the past to detect plagiarized documents. In this section, we discuss

12


a number of recently proposed plagiarism detection methods along with their strengths and shortcomings. Lukashenko et al. [20] compare two documents and determine their degree of similarity using different metrics such as the Euclidean distance, Cosine similarity, the percentage of shared n-grams, and the resemblance among estimated language models, whereas Hoad and Zobel [12] recommend two different methods for identifying co-derivatives, i.e., documents that are originated from the same source, which can detect potential plagiarized documents. The first method relies on an identity measure that determines the degree of resemblance between two documents, assuming that similar documents contain similar words. The second method is based on fingerprinting, which identifies the content of a document by hashing substrings within the document. By using fingerprints among documents, one can detect those that are highly likely derivates of another, i.e., plagiarized copies. Kienreich et al. [14] introduce a plagiarism-detection approach in which a set of news articles are (i) represented by their shingles, i.e., sequences of non-stop words within the news articles, and (ii) clustered according to their relative similarity values. The similarity values are calculated using a hash function among the shingles of any two news articles to compute the colliding values among them, and articles are clustered together if they share a significant number of textual terms. Hereafter, in each subset of similar news articles, another evaluation is performed to detect plagiarized documents. This evaluation method involves performing sentence-to-sentence comparison between any two news articles, A1 and A2 , on which transformation operations, such as insertion, deletion, and update, are applied to a sentence in A 1 to convert it into a sentence in A2 . The less operations required, the more alike the two sentences are. Although Kienreich et al. [14] show promising experimental results, the number of undetected plagiarized documents is still significant. Monostori et al. [23] present a plagiarism-detection system, denoted M atchDetect Reveal (MDR), using suffix-trees. MDR, which is capable of detecting overlapping in (potential) plagiarized documents, uses a string-matching algorithm to identify suspicious documents from where suffix-trees are constructed using a modified Ukkonen’s algorithm. The overlap between the constructed suffix-trees of the suspicious documents is evaluated to determine which documents, if there are any, are plagiarized. Niezgoda and Way [24] design an algorithm, denoted SNITCH, for automatically detecting plagia-

rized documents. SNITCH analyzes a set of windows, each of which is a portion of a document D containing a certain number of words (as defined by the user), and according to the number of characters in each word in a window W , SNITCH assigns an average weight to W . Hereafter, according to the weight given to each of the windows, the highest weighted windows (each of which is treated as a query) are searched for on the Web and based on the results, a report is generated which includes the percentage of plagiarism in D, along with the Web sites from where the original documents were extracted. Unfortunately, SNITCH is accurate only when identified “cut-and-paste" are exact, i.e., original information are copied exactly, which is not always the case. Zini et al. [35] focus on detecting not only “cutand-paste" plagiarized documents, but plagiarized documents that are modified in a subtler way by inserting, deleting, or substituting a word, a period, or a paragraph in the original document to generate plagiarized ones. This approach extracts the document structure at different levels (i.e., word, period, and paragraph level) and compares the overall similarity among the structures of any two documents using the Levenshtein distance [18]. Khmelev and Teahan [13] use the R-measure to recognize plagiarized documents. The R-measure adds the lengths of the substrings in a given document that are included in another document in a collection. By considering the normalized R-measure value, it is possible to establish the “repeatedness" of a document with respect to others, which establishes the degree of plagiarism in the corresponding documents. Tashiro et al. [31] introduce EPCI, which is a tool for finding copyright infringement texts. Given a potential plagiarized document D, EPCI extracts several sequences of words, i.e., seed text, and generates queries that retrieve a set of Web documents W that could be the source of the content of D. Hereafter, EPCI computes the similarity between D and the documents in W . The higher the similarity value between D and any document in W , the more likely that infringement has occurred. Metzler et al. [22] establish several levels of similarity among documents to identify those that are exact copies of a given document D, as well as the ones that are modified versions of D. In accomplishing the task, Metzler et al. [22] first determine the similarity between sentences within any two documents, and based on the sentence-to-sentence similarity score, the overall similarity value of the documents is determined.


The plagiarism-detection approach introduced in [10] determines the authors of different documents, rather than comparing their contents to detect plagiarism. Gruner and Naven [10] develop a tool based on stylometry, which is a statistical approach used for establishing the authorship of documents. By using stylometry, Gruner and Naven [10] recognize the similarity of the writing style between the authors of any two given documents and consider two different types of plagiarism: (i) documents that should be written by the same author, while in fact are not, and (ii) documents that should be written by different authors, while in fact are not. In [1] the authors present a cross-lingual approach for detecting plagiarized documents written in different languages. The similarity between a potential plagiarized document P and the source document S (written in another language) is measured using (i) a statistical model that establishes the probability that (a fragment of) P is related to (a fragment of) S regardless of the order in which the words appear in P and S, and (ii) a statistical bilingual dictionary created by a parallel corpus of text in English and Spanish. Although effective, this approach requires the construction of the cross-lingual corpus, which is a difficult task on itself as mentioned in [1]. Barr´ on-Cede˜ no et al. [2] propose a Kullback-Leibler (KL) distance-based method, which should drastically reduce the search space for plagiarism detection and facilitate the task of locating the source document S of a potential plagiarized one P within a large corpus C. The KL distance is used to determine the equality between the probability distribution 11 of P and S and extract from C the subset of documents D, including S, that have the lowest KL distance with respect to P , which are the documents most likely to be the source of P . Hereafter, by (i) computing the overlap between the n-grams in each sentence P i in P and the n-grams in each document in D and (ii) using a previously defined threshold value, the source of P i , if it exists, is determined. Sediyono et al. [29], who view plagiarism detection as the “problem of finding the longest common parts of string", treat each source document S in a document collection C as a list of consecutive words. Given an observed, i.e., potential plagiarized, document P , Sediyono et al. determine the common con11 The probability distribution of P (S, respectively) is composed of a set of features, such as term frequency, that characterize P (S, respectively) [2].

13

secutive words in P and S (to establish that S is the source of sections of text in P ) as well as their relative locations in both P and S (to aid users in visualizing the common portions of text in P and S), as opposed to SimP aD and P laV iew, which are sentence-based. SVDPlag is a Singular Value Decomposition (SVD)based plagiarism detection method [4]. Given a particular document collection C, SVDPlag (i) removes stopwords, (ii) stems the remaining words, and (iii) extracts n-grams of pre-defined length. After eliminating n-grams that belong to at most one document in C or appear in all the documents in C, SVDPlag employs a Latent Semantic Analysis (LSA) framework to infer latent semantic associations between pairs of distinct n-grams in every pair of text documents in C. Hereafter, SVDPlag computes for each pair of documents in C their level of similarity by calculating the number of n-grams the documents have in common and determines which ones should be treated as plagiarized using a pre-defined threshold value. Leung and Chan [17] propose using a natural language processing method to facilitate the detection of plagiarized documents, not only among the ones created by “cut and paste", but also documents in which both the text and the structure of their original sentences are altered while the content of the documents are intact. The approach in [17] considers (i) word replacement by using WordNet and assigning weights to different semantic word relations, such as hypernyms, hyponyms or synonyms, when performing sentenceto-sentence comparisons to establish the degree of resemblance among sentences, and (ii) syntactic (semantic, respectively) processing to analyze the syntactic structure (meaning, respectively) of the sentences. As stated in [17], their detection approach is slow and should be reserved for detecting plagiarism only when a document that is very likely the source of P is available. M LP lag, the multilingual plagiarism detection method presented in [5], relies on (i) a multilingual database of words and their relations, denoted EWN thesaurus, and (ii) the analysis of position of words in documents to detect plagiarized ones from a document written in another language. M LP lag, like SimP aD, considers plagiarism strategies such as word replacement and repositioning. M LP lag, however, depends exclusively on EWN thesaurus, which has yet to be fully developed [5]. Su et al. [30] propose a hybrid plagiarism detection method based on the Levenshtein distance LD and a simplified version of the Smith-Waterman algorithm.

14


LD is a metric that measures the edit distance between any two strings, whereas the Smith-Waterman algorithm is commonly used for detecting similarities, i.e., local alignment, between biological sequences, which in [30] is adapted to detect local similarity between any two documents and identify plagiarized ones. As stated by the authors, the main disadvantage of the proposed detection method is that the Smith-Waterman algorithm does not scale well [30].

reduce the number of documents that are misclassified as (non-)plagiarized. A direction for future work is to apply natural language processing techniques to further enhance SimP aD by considering (i) the semantic meaning of (a sequence of) individual words to perform word sense disambiguation, and (ii) the (syntactical) structure of words in sentences, as opposed to just relying on the (similarity of) individual words in sentences, which is the strategy SimP aD currently applies, in detecting plagiarized documents.

5. Conclusions We have proposed a similarity-based plagiarismdetection tool, SimP aD, which relies on precomputed word-correlation factors for determining sentence-to-sentence similarity values that yield the degree of resemblance of any two documents to detect the plagiarized one, if it exists. SimP aD is designed to detect (non-)plagiarized text documents, which are digitalized and posted online, using a collection of Web documents which includes the original documents of their plagiarized versions. SimP aD, which can handle various plagiarism methods such as substitution, addition, and deletion of words in sentences, as well as sentence splitting and merging, provides the users a visual representation of sentences in a source document that are paraphrased in its plagiarized version. We have conducted an empirical study which shows that SimP aD achieves high accuracy, i.e., 95.3% on the average, in detecting (non-)plagiarized documents using three different benchmark datasets: W ebis-P C [36], M eter [8], and P AN 09 [27]. Furthermore, we have compared the performance of SimP aD with the existing plagiarism-detection approaches introduced in [6] and [32] using the M eter corpus, which shows that SimP aD outperforms these documentbased plagiarism-detection approaches in accuracy by a huge margin, i.e., 22%. We have also evaluated the performance of SimP aD in identifying individual plagiarized sentences and validated its effectiveness based on Precision, Recall, and F -measure. The experimental results show that SimP aD achieves an 83% F -measure using the M eter corpus, which outperforms the alternative sentence-based plagiarism detection approaches introduced in [2] and [3] by at least 8%. Although highly effective, the empirical study conducted on SimP aD shows that the proposed plagiarism-detection approach could be improved to

References [1] A. Barr´ on-Cede˜ n o, P. Rosso, D. Pinto, and A. Juan. On CrossLingual Plagiarism Analysis Using a Statistical Model, in: Proceedings of European Conference on Artificial Intelligence Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse (ECAI), 2008, pp. 9–13. [2] A. Barr´ on-Cede˜ n o, P. Rosso, and J. Bened. Reducing the Plagiarism Detection Search Space on the Basis of the KullbackLeibler Distance, in: Proceedings of the International Conference on Computational Linguistics and intelligent Text Processing (CiCLing), 2009, pp. 523–534. [3] A. Barr´ on-Cede˜ n o and P. Rosso. On Automatic Plagiarism Detection Based on n-Grams Comparison, in: Proceedings of the European Conference on IR Research on Advances in Information Retrieval (ECIR), 2009, pp. 696–700. [4] Z. Ceska. Plagiarism Detection Based on Singular Value Decomposition, in: Proceedings of the International Conference on Advances in Natural Language Processing, 2008, pp. 108–119. [5] Z. Ceska, M. Toman, and K. Jezek. Multilingual Plagiarism Detection. in: Proceedings of the International Conference on Artificial intelligence: Methodology, Systems, and Applications, LNAI 5253, 2008, pp. 83–92. [6] P. Clough, Measuring Text Reuse in a Journalistic Domain, in: Proceedings of the Computational Linguistics UK (CLUK) Colloquium, 2001, pp. 53–63. [7] W. Croft, D. Metzler, and T. Strohman, Search Engines: Information Retrieval in Practice, Addison Wesley, 2010. [8] R. Gaizauskas, J. Foster, Y. Wilks, J. Arundel, P. Clough, and S. Piao, The Meter Corpus: a Corpus for Analyzing Journalistic Text Reuse, in: Proceedings of Corpus Linguistics, 2001. [9] D. Gildea, Loosely Tree-based Alignment for Machine Translation, in: Proceedings of the Association for Computational Linguistics (ACL), 2003, pp. 80–87. [10] S. Gruner and S. Naven, Tool Support for Plagiarism Detection in Text Documents, in: Proceedings of the ACM Symposium on Applied Computing, 2005, pp. 776–781. [11] J. Helfman, Dotplot Patterns: A Literal Look at Pattern Languages, Theory & Practice of Object Systems 2(1) (1996), pp. 31–41. [12] T. Hoad and J. Zobel, Methods for Identifying Versioned and Plagiarized Documents, Journal of the American Society for Information Science and Technology (JASIST) 54(3) (2003), 203– 215. [13] D. Khmelev and W. Teahan, A Repetition Based Measure for Verification of Text Collections and for Text Categorization, in:

M.S. Pera et al. / A Plagiarized Document Detection Tool Proceedings of ACM International Conference on Research and Development in Information Retrieval (SIGIR), 2003, pp. 104– 110. [14] W. Kienreich, M. Granitzer, V. Sabol, and W. Klieber, Plagiarism Detection in Large Sets of Press Agency News Articles, in: Proceedings of the 17th International Conference on Database and Expert Systems Applications (DEXA), 2006, pp. 181–188. [15] J. Koberstein and Y.-K. Ng, Using Word Clusters to Detect Similar Web Documents, in: Proceedings of the First International Conference on Knowledge Science, Engineering and Management (KSEM’06), LNAI 4092, 2006, pp. 215–228. [16] P. LaRocque, The Book on Writing: the Ultimate Guide to Writing Well, Marion Street Press, 2003. [17] C. Leung and Y. Chan, A Natural Language Processing Approach to Automatic Plagiarism Detection, in: Proceedings of ACM Conference on Information Technology Education (SIGITE), 2007, pp. 213–218. [18] V. Levenshtein, Binary Codes Capable of Correcting Deletions, Insertions and Reversals, Soviet Physics Doklady 10(8) (1996), pp. 707–710. [19] E. Liddy, How Search Engines Work, Searcher (The Magazine for Database Professionals) 9(5) (2001), 38–45. [20] R. Lukashenko, V. Graudina, and J. Grundspenkis, Computerbased Plagiarism Detection Methods and Tools: an Overview, in: Proceedings of International Conference on Computer Systems and Technologies (CompSysTech), 2007, pp. 1–6. [21] C. Manning and H. Schutze, Foundations of Statistical Natural language Processing, MIT Press, 1999. [22] D. Metzler, Y. Bernstein, W. Croft, A. Moffat, and J. Zobel, Similarity Measures for Tracking Information Flow, in: Proceedings of ACM International Conference on Information and Knowledge Management (CIKM), 2005, pp. 517–524. [23] K. Monostori, A. Zaslavsky, and H. Schmidt, Document Overlap Detection System for Distributed Digital Libraries, in: Proceedings of the 5th ACM Conference on Digital Libraries, 2000, pp. 226–227. [24] S. Niezgoda and T. Way, SNITCH: A Software Tool for Detecting Cut and Paste Plagiarism, in: Proceedings of the SIGCSE Technical Symposium on Computer Science Education (SIGCSE), 2006, pp. 51–55. [25] M.S. Pera and Y.-K. Ng, Utilizing Phrase-Similarity Measures for Detecting and Clustering Informative RSS News Arti-

15

cles, Journal of Integrated Computer-Aided Engineering (ICAE) 15(4) (2008), pp. 331–350, IOS Press. [26] M.F. Porter, An Algorithm for Suffix Stripping, Program 14(3) (1980), pp. 130–137. [27] M. Potthast, A. Eiselt, B. Stein, A. Barron, and P. Rosso. Plagiarism Corpus PAN-PC-09, Webis at Bauhaus-Universitat Weimar and NLEL at Universidad Polytecnica de Valencia, 2009, available at www.webis.de/research/corpora. [28] J. Quinlan, Induction of Decision Trees, Machine Learning 1(1) (1986), pp. 81–106. [29] A. Sediyono and K. Ku-Mahamud. Algorithm of the Longest Commonly Consecutive Word for Plagiarism Detection in Text Based Document, in: Proceedings of the International Conference on Digital Information Management (ICDIM), 2008, pp. 253-259. [30] Z. Su, B. Ahn, K. Eom, M. Kang, J. Kim, and M. Kim. Plagiarism Detection Using the Levenshtein Distance and SmithWaterman Algorithm, in: Proceedings of the International Conference on Innovative Computing, Information, and Control (ICICIC), 2008, pp. 569-572. [31] T. Tashiro, T. Ueda, T. Hori, Y. Hirate, and H. Yamana, EPCI: Extracting Potentially Copyright Infringement Texts from the Web, in: Proceedings of the World Wide Web (WWW), 2007, pp. 1151–1152. [32] O. Uzuner, R. Davis, and B. Katz, Using Empirical Methods for Evaluating Expression and Content Similarity, in: Proceedings of the 37th Hawaii International Conference on System Sciences (HICSS) 2004. [33] O. Uzuner, B. Katz, and T. Nahnsen, Using Syntactic Information to Identify Plagiarism, in: Proceedings of the ACL Workshop on Educational Applications, 2005, pp. 37–44. [34] D. White and M. Joy, Sentence-based Natural Language Plagiarism Detection, ACM Journal on Educational Resources in Computing (JERIC) 4(4) (2004), pp. 1–20. [35] M. Zini, M. Fabbri, M. Moneglia, and A. Panunzi, Plagiarism Detection through Multilevel Text Comparison, in: Proceedings of the International Conference on Automated Solutions for Cross Media Content and Multi-Channel Distribution (AXMEDIS), 2006, pp. 181–185. [36] S. zu Eissen, B. Stein, and M. Kulig, Plagiarism Corpus Webis-PC-08, (2008), Web Technology and Information Systems Group Bauhaus University Weimar.