Utilizing Text Similarity Measurement for Data ...

2 downloads 0 Views 307KB Size Report
Pera, M. S., & Ng, Y. K.: SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents. Web Intelligence and Agent Systems, 9(1), 27- ...
Utilizing Text Similarity Measurement for Data Compression to Detect Plagiarism in Czech Hussein Soori, Michal Prilepok, Jan Platos, and Vaclav Snasel Department of Computer Science, FEECS, IT4 Innovations, Centre of Excellence, VSB-Technical University of Ostrava, Ostrava Poruba, Czech Republic {sen.soori,michal.prilepok,jan.platos,vaclav.snasel}@vsb.cz

Abstract. This paper attempts to apply data compression based similarity method for plagiarism detection. The method has been used earlier for plagiarism detection for Arabic and English languages. In this paper we utilize this method for Czech language text from a local multi-domain Czech corpus with 50 original documents with non-plagiarized parts, and 100 suspicious documents. The documents were generated so that every document could have from 1 to 5 paragraphs. The suspicion rate in the documents was randomly chosen from 0.2 to 0.8. The findings of the study show that the similarity measurement based on Lempel-Ziv comparison algorithms is efficient for the plagiarized part of the Czech text documents with a success rate of 82.60%. Future studies may enhance the efficiency of the algorithms by including combined and more sophisticated methods. Keywords: similarity measurement; plagiarism detection; Lempel-Ziv compression algorithm; plagiarism in Czech; data compression; plagiarism detection tools

1

Introduction

The research to find plagiarized texts came as an answer to the massive increase in written data in various fields of science and knowledge and the urge to protect intellectual property of authors. Many text plagiarism methods have been utilized for this purpose. Similarity detection covers a wide area of research including spam detection, data compression and plagiarism detection. In addition to that, text similarity is a powerful tool for documents processing. Plagiarism detection includes many texts documents such as, students’ assignments, postgraduate theses and dissertations, reports, academic papers and articles. This paper investigates the viability of the similarity measurement based on Lempel-Ziv comparison algorithms for detecting plagiarism of Czech texts.

2

2

Hussein Soori, Michal Prilepok, Jan Platos, and Vaclav Snasel

Methods of Plagiarism Detection

The similarity measure approach is one of the most used approaches to detect plagiarism. To a certain extent, this method resembles the methods used in information retrieval in that it decides the retrieval rank by measuring the similarity to a certain query. Generally speaking, we may rank the similarity-based plagiarism detection techniques as: text based techniques that depend on cosine and fingerprints as in [1, 2], graph similarity that depend on ontology [3, 4] and line matching as in bioinformatics [5]. On the other hand, some machine techniques have demonstrated high accuracy as in the case of k-nearest neighbors (KNN), artificial neural networks (ANN) and support-vector machine learning (SVM). KNN depends on the idea of X member and its relation to its nearest neighbor n. Winnowing algorithm [6] is one of the widely used algorithms in this regard. It depends on the selection of fingerprints of hashes of k-grams. The idea is based on the optimization of results by trying the variations of the n value and how distant or remote is x to its neighbors. In this case, plagiarism is decided on whether or not x falls into the neighboring members, and the number of neighbors. One of the drawbacks of this technique is its slow computational processing because it requires calculation of all the remote and distant members. The ANN learning classifier functions by modeling the way the brain works with mathematical models. These two are commonly represented (to serve the purpose at hand) by plagiarized and non-plagiarized texts. It resembles the mechanism of human neutron with other neurons that work in another brain layer. In addition to the efficiency of this technique to detect plagiarism, it has also proven very effective in other areas such as speech synthesis. SVM is a statistical classifier that deals mainly with text. This technique tries to find the boundaries between the plagiarized and non-plagiarized text by finding the threshold for these two classes. The boundary set is called support vectors. Text similarity measurement can also be used for data compression. Platos et al. [9] utilized similarity measurement in their compression algorithm for small text files. Some other attempts to use the similarity measurement for plagiarism detection include Prilepok et al. [7] who used this measurement to detect plagiarism of English texts and Soori et al. [8] who used the technique to detect plagiarism of Arabic texts. The idea of [7] and [8] was originally inspired by Chuda et al. [10]. In this paper, we utilize the same method to detect plagiarism of Czech texts.

3

Similarity of Text Plagiarism Detection by Compression

The main property in the similarity is a measurement of the distance between two texts. The ideal situation is when this distance is a metric [11]. The distance is formally defined as a function over Cartesian product over set S with nonnegative

Utilizing Text Similarity Measurement for Data Compression

3

real value [12, 13]. The metric is a distance, which satisfies four conditions for all: D(x, y) ≥ 0

(1)

D(x, y) = 0; x = y

(2)

D(x, y) = D(y, x)

(3)

D(x, z) ≤ D(x, y) + D(y, z)

(4)

Conditions 1, 2, 3 and 4, above are called: nonnegative, identity, symmetry and the triangle inequality axioms respectively. The definition is valid for any metric, e.g. Euclidean Distance. However, applying this principle into document or data similarity is rather more complicated than that. The idea behind this paper is inspired by the method used in Chuda et al. [10]. This compression method uses Lempel-Ziv compression algorithm and it is used for text plagiarism detection to detect similarity [16]. The main principle here is the fact that the compression becomes more efficient for the same sequence of data. Lempel-Ziv compression method is one of the currently used methods in data compression for various kinds of data such as, texts, images and audio [17].

4

Creating Dictionaries out of Documents

Creating a dictionary is one task in the encoding process for Lempel-Ziv 78 method [16]. The dictionary is created from the input text, which is split into separate words. If a current word from the input is not found in the dictionary, this word is added. In case the current word is found, then the next word from the input is added to it. This eventually creates a sequence of words. If this sequence of words is found in the dictionary, then the sequence is extended with the next word from the input in a similar way. However, if the sequence is not found in the dictionary, then, it is added to the dictionary with the increased number of sequences property. The process is repeated until we reach the end of the input text. In our experiment, every paragraph has its own dictionary.

5

Comparison of Documents

Comparison of the documents is the main task at hand. One dictionary is created for each and every compared file. Next, the dictionaries are compared to each other. This aims at comparing the number of common sequences in the dictionaries. This number is represented by the parameter in the following formula. The formula below is a similarity metric between two documents. SM =

sc min(c1 , c2 )

(5)

4

Hussein Soori, Michal Prilepok, Jan Platos, and Vaclav Snasel

Where: – sc is a count of common word sequences in both dictionaries, – c1 , c2 is a count of word sequences in the dictionary of the first or the second document. The SM value is in the interval between 0 and 1. If SM = 1, then the documents are similar. However, if SM = 0, then the documents have the highest difference.

6

Experimental Setup

In our experiments we used a local small corpus of Czech texts from the online newspaper iDNES.cz [18]. This corpus contains 4850 words. The topics were collected from various news items including sports, politics, social daily news and science. In our experiment, we needed to have suspicious document collection to test the suggested approach. We created 100 false suspicious and 50 source documents from the local corpus by using a small tool that we designed to create source and suspicious documents as described in the next sub-section.

6.1

Document Creator Tool

The purpose of the document creator tool is to create some documents to be used as testing data. The tool is designed as follows. We split all the documents from the corpus into paragraphs. Next, we labeled each paragraph with a new tagged line for quick reference that indicates its location in the corpus. After that, we created two separate collections of documents from this paragraphs list: a source document collection and a suspicious document collection. Finally, we randomly selected one to five paragraphs from the list of paragraphs. The first group is added to a newly created document and marked as source documents. We repeated this step for all the 50 documents. The collection we have contains 120 different sets of paragraphs. The same steps were repeated for the group of created suspicious documents. We randomly selected from each suspicious document one to five paragraphs. Then, the tool randomly selected the paragraphs. Each document contains some paragraphs from the source document created earlier, and some unused paragraphs. We repeated these steps for all the 100 documents. To create the list of suspicious documents, we used 227 paragraphs out of the 109 paragraphs from the source documents, and 118 unused paragraphs from our local corpus. For each of the created suspicious document, an XML description file is made. This file includes information about the source of each paragraph in our corpus, its starting and ending point, byte and file name. We repeated this step for all the 100 documents.

Utilizing Text Similarity Measurement for Data Compression

6.2

5

The Experiment

The idea of comparing a whole document where only a small part of it is plagiarized is futile for the reason that the plagiarized part may only be a small chunk hidden among the total volume of text in the new document. Hence, splitting the documents into paragraphs is essential. We chose paragraphs, because we believe they retain some better characteristics than sentences in terms of carrying more sense words, and they are not affected by stop words, i.e. preposition, conjunctives, etc. We separated paragraphs by a line separator and created a dictionary for each paragraph from the source document, according to the steps mentioned earlier. From the fragmentation of the source documents, 120 paragraphs and their corresponding dictionaries were made. These dictionary paragraphs were used as reference dictionaries to be compared against the created suspicious documents. Similarly, the suspicious document sets were processed in the same manner. The suspicious document were fragmented into 227 paragraphs. Next, a corresponding dictionary was created using the same algorithm, without removing stop words in this step. After that, this dictionary is compared against the dictionaries from the source documents. To accelerate the comparison speed, only a subset of the dictionaries were chosen for comparison for the reason that comparing one suspicious dictionary to all source dictionaries is time consuming. We chose the subset as per the size of a particular dictionary with a tolerance rate of 20%. For instance, if the dictionary of the suspected paragraphs contains 122 phrases, we choose all dictionaries with number of phrases between 98 and 146. The 20% tolerance rate improves the comparison speed significantly. We believe that this tolerance rate does not affect the overall efficiency of the success rate of the algorithm. Accordingly, the paragraph with the highest similarity to each paragraph of the tested paragraph is spotted. 6.3

Stop Words Removal

The removal of stop words increases the accuracy level of text similarity detection. In our experiment, we removed stop words from the text. We used a list of Czech stop words from Semantikoz [19]. The list of the stop words we used contains 138 stop words. The algorithm we used is modified so that after the fragmentation of the text, all stop words are removed from the list of paragraphs and the remaining words are processed by the same algorithm.

7

Results

This paper uses the terms: plagiarized document, to refer to a document which the PlagTool managed to find all its plagiarized parts from the attached annotation XML file; partially plagiarized document, to refer to a document which the PlagTool managed to find parts of it to be plagiarized in the attached annotation XML file (for example, 3 out of 5 parts of the original document); non-plagiarized

6

Hussein Soori, Michal Prilepok, Jan Platos, and Vaclav Snasel Success rate Plagiarized documents 76/ 92 82.60% Partially plagiarized documents 16/ 92 17.40% Non- Plagiarized documents 8/ 8 100.00% Table 1. Results

document, to refer to a document which the PlagTool did not manage to find any plagiarized parts of it in the annotation XML file. In our experiments we found 86.60% plagiarized documents, 17.40% partially plagiarized documents and 100% non-plagiarized documents. In case of partially plagiarized documents, we encountered suspicious paragraphs in another document, or paragraphs with higher similarly rate as another a paragraph with the same similar content. This case occurs if one of the paragraphs is shorter than the other.

8

The PlagTool and Visualization of Document Similarity in PlagTool

The PlagTool aplicattion is devided into four parts. Three parts show document visualisation and the last part shows the result. In the PlagTool, we have used three visual representations for showing paragraph similarity. These help the user to identify the suspicious documents and identify their locations. The first representation is a line chart. The line chart shows the similarity for each suspicious paragraph in the document. The user may easily locate the plagiarized parts of the document and the number of the plagiarized parts where higher similarities represent paragraphs with more plagiarized content, as shown in figure 1 where paragraphs two and three are plagiarized. The second representation is a histogram of document similarity. The histogram depicts how many paragraphs are similar and accordingly how many paragraphs are plagiarized. This is shown in figure 2. The last visual representation is used to show the similarity in the form of colored highlighted texts. Five colors have been used. The black color is used to show the source document. The red color indicates that the paragraph has a similarity rate greater than 0.2. The orange color shows the paragraphs with lower similarity ranging between less than 0.2 and 0, where the paragraph has only few similar words with the source document. The green text indicates that the paragraph is not found in the source document and accordingly it is not plagiarized. The blue color indicates the location of the plagiarized paragraph in the source document. A snapshot of PlagTool, with the color functions, are depicted in figure 3. As shown in figures 1, 2, 3, the left table shows all paragraphs from the selected document, starting and ending byte (file location) in the document,

Utilizing Text Similarity Measurement for Data Compression

Fig. 1. PlagTool: line chart of document similarity

Fig. 2. PlagTool: histogram of document similarities

7

8

Hussein Soori, Michal Prilepok, Jan Platos, and Vaclav Snasel

Fig. 3. PlagTool: colored visualization of similarity

the detected highest similarity rates, name of the document with the highest similarity rate and its starting and ending bytes. The middle table opens suspicious document with color highlighting. The color highlighting helps the user to easily identify the plagiarized paragraphs. The part on the right side shows the content of paragraphs with the highest detected similarity to source documents. This helps the user to compare between the content of the paragraph from the suspicious document and the paragraph from the source document. The bottom part displays information about the results using three tabs. The first two tabs show various charts of the selected suspicious document. These charts give the user information about the overall rate of plagiarism for the selected document. The last tabs shows information about the suspicious document from the testing data set. It also shows which paragraphs are plagiarized, and from which document they were copied. This information is useful for validating the success rate of plagiarism detection.

9

Conclusion

In this paper, we applied the similarity detection algorithm used by Prilepok et al. [7] for plagiarism detection of English text, and Soori et al. [8] for plagiarism detection of Arabic text on a dataset for Czech text plagiarism detection. We also confirmed the ability to detect plagiarized parts of the documents with the removal of stop words, as well as the viability of this approach for Czech language. The algorithm for similarity measurement based on the Lempel-Ziv compression algorithm and its dictionaries proved to be efficient in detecting the plagiarized parts of the documents. All plagiarized documents in the dataset were marked

Utilizing Text Similarity Measurement for Data Compression

9

as plagiarized and in most cases all plagiarized parts were identified, as well as, their original versions, with a success rate of 82.6%.

Acknowledgement This work was supported by the European Regional Development Fund in the IT4Innovations Centre of Excellence project (CZ.1.05/ 1.1.00/02.0070) and by Project SP2014/110, Parallel processing of Big data, of the Student Grant System, VSB - Technical University of Ostrava.

References 1. Pera, M. S., & Ng, Y. K.: SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents. Web Intelligence and Agent Systems, 9(1), 27-41 (2011) 2. Gustafson, N., Pera, M. S., & Ng, Y. K.: Nowhere to hide: Finding plagiarized documents based on sentence similarity. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent TechnologyVolume 01 (pp. 690-696). IEEE Computer Society. (2008, December) 3. Foudeh, P., & Salim, N.: A Holistic Approach to DuplicatePublication and Plagiarism Detection Using Probabilistic Ontologies. In Advanced Machine Learning Technologies and Applications (pp. 566-574). Springer Berlin Heidelberg. (2012) 4. Liu, C., Chen, C., Han, J., & Yu, P. S.: GPLAG: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 872-881). ACM. (2006, August) 5. Chen, X., Francia, B., Li, M., Mckinnon, B., & Seker, A.: Shared information and program plagiarism detection. Information Theory, IEEE Transactions on, 50(7), 1545-1551. (2004) 6. Schleimer, S., Wilkerson, D. S., & Aiken, A.: Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (pp. 76-85). ACM. (2003, June) 7. Prilepok, M., Platos, J., & Snasel, V.: Similarity based on data compression. In Advances in Soft Computing and Its Applications (pp. 267-278). Springer Berlin Heidelberg. (2013) 8. Soori, H., Prilepok, M., Platos, J., Berhan, E., & Snasel, V.: Text Similarity Based on Data Compression in Arabic. In AETA 2013: Recent Advances in Electrical Engineering and Related Sciences (pp. 211-220). Springer Berlin Heidelberg. (2014) 9. Platos, J., Snasel, V., & El-Qawasmeh, E. Compression of small text files. Advanced engineering informatics, 22(3), 410-417. (2008) 10. Chuda, D., & Uhlik, M.: The plagiarism detection by compression method. In Proceedings of the 12th International Conference on Computer Systems and Technologies (pp. 429-434). ACM. (2011, June) 11. Tversky, A.: Similarity features, Psychological Review, (84), 327-352 (1977) 12. Cilibrasi, R., & Vitnyi, P. M.: Clustering by compression. Information Theory, IEEE Transactions on, 51(4), 1523-1545. (2005) 13. Li, M., Chen, X., Li, X., Ma, B., & Vitanyi, P. M.: The similarity metric. Information Theory, IEEE Transactions on, 50(12), 3250-3264. (2004)

10

Hussein Soori, Michal Prilepok, Jan Platos, and Vaclav Snasel

14. Kirovski, D., & Landau, Z.: Randomizing the replacement attack. In Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International Conference on (Vol. 5, pp. V-381). IEEE. (2004, May) 15. Crnojevic, V., Senk, V., & Trpovski, Z.: Lossy lempel-ziv algorithm for image compression. In Telecommunications in Modern Satellite, Cable and Broadcasting Service, 2003. TELSIKS 2003. 6th International Conference on (Vol. 2, pp. 522-525). IEEE. (2003, October) 16. Ziv, J., & Lempel, A.: Compression of individual sequences via variable-rate coding. Information Theory, IEEE Transactions on, 24(5), 530-536. (1978) 17. Khoja, S., & Garside, R.: Stemming Arabic text. Lancaster, UK, Computing Department, Lancaster University. (1999) 18. iDNES.cz. Available from http://www.idnes.cz/ (2014) 19. Semantikoz. Available from http://www.semantikoz.com/blog/free-stop-word-lists-in-23-languages/ (2014)