Copy detection in Urdu Language Documents using N ... - IEEE Xplore

0 downloads 0 Views 226KB Size Report
2GIK Institute of Engineering & Technology, Topi l{mak.fast, abdul_aleem13, uop.wahab}®yahoo.com, 2mnasirkhan174®gmail.com. Abstract: In this paper we ...
Copy detection in Urdu Language Documents using N-grams Model Muhammad A. Khan\ Abdul Aleem\ Abdul Wahab\ M. Nasir Khan2 J Department

l{mak.fast,

of Computer Science, University of Peshawar, 25000 PAKiSTAN 2 GIK Institute of Engineering & Technology, Topi

abdul_aleem13,

uop.wahab}®yahoo.com,

Abstract: In this paper we present our work on copy detection in short Urdu text passages. Given two passages one as the source text and another as the copied text it is determined whether the

second

passage

is

plagiarized

version of the source text? We have developed an algorithm for plagiarism detection. We have used the n-gram model for word retrieval and found tri-grams as the best model for comparing the Urdu text passages. Based on probability and

2mnasirkhan174®gmail.com

stream. N Grams are used as alternative to word based retrieval of text. In section 2 the proposed detection algorithm is explained, in section 3 the experiments on different passages and the comparison with other n-gram model are presented. In section 4 conclusion and future work of the paper is given.

the resemblance measures calculated from the bi-gram

II. PROPOSED PLAGIARISM DETECTION SYSTEM

comparison we categorize the passages on a threshold. In the Algorithm the connecting words are considered in computing and matching trigram. We have developed a software system in C# for both the algorithms. This system can be used to detect copy in student's assignments in Urdu language.

Keywords: Copy detection, N-gram Model, Bi-gram, Urdu Language, Natural Language Processing

I.

INTRODUCTION

The advent and awilability of digital information has made it possible to send, share, save and use digital data Virtually there is no best system available that could prohibit or limit the misuse of the available data Other issues associated with the misuse of digital data are Ownership detection, Copyright issues and Plagiarism detection. Plagiarism detection is of particular interest to people in the academia and the publishing sector. Plagiarism means copying thought and text of another author and presenting them as ones's own work [1 ]. One way to ensure quality in academic research is through the application of plagiarism detection. Donor and sponsor agencies like the Higher Education Commission (HEC) are interested in determining the quality of the research work for eligibility of a grant or fund. One of the quality assurance steps is checking for plagiarism in the work. Plagiarism in academics is considered as academic dishonesty and the responsible are subject to punishment by the university or the research funding organization. N Gram Model was first used in text categorization based on the statistical information gathered from the usage of sequence of characters [4]. N grams are consecutive overlapping characters formed from an input

We have proposed a plagiarism detection system for Urdu text passages based on the n-gram Model. We have used trigram as our model of representing the text. Trigram means that token of three words are used extracting the words from the passages and these trigrams matched. Then the resemblance measures are computed for text categorization. The resemblance measure R [3] is defined as

R

=

I S(A)nS(B) I IS(A)u S(B)I

(1 )

Where S (A) is the set of trigram from passage A, S (B) is the set of trigram from passage B. The Matched trigrams are calculated as

M

=

IS(A) nS(B)1

(2)

And the total number of trigram is computed as

N

=

IS(A)uS(B)1

(3)

The value of R ranges between 0 and 1 . We have set a threshold of 75% resemblance as the yard stick for classifying text as plagiarized. I) Punctuation Removal First of all the punctuation from the passage is removed. The Algorithm used in the punctuation removal is as follows.

978-1-61284-941-6/11/$26.00 ©2011 IEEE 263

III. EXPERIMENTS

Listing no.1 the Pseudo code for Punctuation removal from passages.

1) Experiments with Trigram Model

Clean String (STR)

We have used passages from the standard Urdu text books and rephrased them ourselves (the text passages

Define legalcharacterset="all valid

1.

urdu characters" 2.

Initialize String="validcharacterset"

3.

Define CleanString STR= empty string.

can be provided on request).

The following are two passages n and J2. Passage [2] is the original passage (taken from Urdu text book of ih class, NWFP text book board) J2 is the rephrased version of n. Trigrams for both the passages are calculated, table no 1 lists the trigrams calculated for n and table contains the trigrams computed for passage J2.

For index = 0 to str.length

n

currentcharacter = str.charAT (index) If legalcharacterset.indexof(cur rentcharacter>=O) Then CleanString += currentcharacter;

2) Comparison with other n-gram Models

End of loop

4.

Return clean string

5.

Exit

2) Algorithm for Extracting and Matching Trigram

The pseudo code of the algorithm for calculating trigram from the given passages is given in listing no. 1 . In the algorithm for matching the trigram when the first match of trigram is encountered then the search is stopped. The reason is that in the set 8 (A) and 8(B) we can have only distinct trigram from the passage. Listing

No. 2

the

Pseudo

code for

extracting

and

comparing trigramsfrom the given passages. 1* Creating Trigrams from the Passage

The tri-gram model is compared with other n-gram models to asses our selection of using tri-grams as the extracting word model. In the table no. 4 three passages and their rephrased versions are compared for copy detection using trigram and four-gram models. Copy detection with bi-gram model is the maximum but the complexity of extracting and comparing bi-gram is also the maximum. The copy detection rate of four-gram model is the smallest; it finds a very small number of matched four-grams as it compare longer sentences. The trigram model gives the average acceptable performance with affordable cost in terms of complexity and false alarms.

1 and

IV.

Passage 2*1 1

Passage 1

For index

=0 to trigramsl.length-2

2 Strl = trigraml[index] + trigraml[index + 1] + trigram1[index + 2] 2

Passage 3

for index =0

4

to trigrams2.length-2

str2 = trigram2[index] 1]

+ trigram2[index +

+ trigram2[index+ 2]

1* Creating Tokens out of String *1 String[]

str3 = str.split(

"

1* Token Matching *1 For i = 0 to str.length

"

);

In this paper we have presented a copy detection mechanism using the trigrams as our word extraction model. We have used Resemblance measure R for computing the probability of the matching text. Based on a threshold the given text is categorized as plagiarized. To asses the validity of trigram model selection we have compared it with bi-gram and four-gram models. This comparison gives further confidence to our results and the selecting of using trigram as model. In future we will extend this study and try to compute an adaptive threshold using machine learning techniques for classifying the Urdu text.

For j = 0 to str.length If (str[i]

CONCLUSION & FUTURE WORK

== str[j]

Then Increment count Break;

264

Passage No Jl

TIlE LIST OF

� l lf:i � ��� �.).JI�

":;�y. �d.J .)IS ').JI � y;..;

� 1""'-'" �.)�""" �Iy...� . .l

1""'-'"