Finding Inner Copy Communities Using Social

1 downloads 0 Views 214KB Size Report
For this, we apply Social Network Analysis to discover groups of people associated to each other by their documents' similarity in a plagiarism detection context.
Finding Inner Copy Communities Using Social Network Analysis Eduardo Merlo, Sebasti´ an A. R´ıos, ´ H´ector Alvarez, Gaston L’Huillier, and Juan D. Vel´ asquez Department of Industrial Engineering University of Chile {emerlo,halvarez}@ing.uchile.cl, {srios,jvelasqu}@dii.uchile.cl, [email protected] Abstract. Nowadays, the technology usage is a massive practice where internet and digital documents are considered as powerful tools in both professional and personal domains. Although, as useful as they can be in a proper way, wrong practices can appear easily, where the copy & paste or plagiarism phenomenon is not far away from this. Documents’ copy & paste is a world-wide growing practice, and Chile is not the exception. Therefore, all levels of educational fields, from elementary school to graduate students, are directly affected by this. Regarding to this concern, in Chile it’s been decided to tackle the plagiarism problem among students. For this, we apply Social Network Analysis to discover groups of people associated to each other by their documents’ similarity in a plagiarism detection context. Experiments were successfully performed in real reports of graduate students at University of Chile.

1

Introduction

The copy & paste syndrome is growing around the world. This unwanted behaviour is produced when students copy information for their written works or assignments instead of writing their own documents. This is generally done by using information acquired from written sources (web page, books, images, videos, pdf documents, etc.) or classmates’ documents, without making the appropriate references from the original sources [3,8]. Today’s technology allows everyone to access and manipulate huge amount of digital contents making available information and knowledge [15]. Therefore, it is simple to manage all the information gathered from these, or other sources, to create a new document. In Chile, as well as many other countries, the massive proliferation of schools with access to Internet is allowing students to replace their information sources, from printed books to digital media. Some web sites, such as Wikipedia in Spanish, are popular among students to research topics for written work development. Unfortunately, students do not usually cite information sources or used them properly. This proliferation of copy & paste is affecting many fields where education has been severely affected [3,15]. According to recent studies [8], plagiarism is nowadays close to 52% in schools and higher education institutions. In Chile, a total R. Setchi et al. (Eds.): KES 2010, Part II, LNAI 6277, pp. 581–590, 2010. c Springer-Verlag Berlin Heidelberg 2010 

582

E. Merlo et al.

number of 3200 students and 300 teachers in 16 educational institutions (high schools, universities, and professional institutes) were surveyed by researchers at DOCODE Project1 . Here, results shows that textual copy occurs in nearly 45% of research assignments. It is important to mention that the copy & paste phenomenon per se is not a fault, for which this problem arises when neither citations nor references are properly made. In order to enhance the learning process, we propose to use Social Network Analysis (SNA) techniques [1,9,11] for the plagiarism detection problem. SNA provides a good complementary application for the representation, visualization, and analysis of multiple data sources that make possible to identify patterns and interaction between social individuals [5]. We distinguish two different copy analysis types which are internal and external copy. On the one hand, external copy refers to the use of a textual fragment from any document found in the Web. On the other hand, internal copy refers to the usage of classmates works’ textual fragments. Despite previously mentioned plagiarism techniques, the focus of this work is on internal copies. Furthermore, we defined a document similarity measure in order to visualize the obtained Inner Copy Communities (ICC) (definition in section 3). Specifically, SNA is used to generate a similarity mapping representation among all courses’ documents, where each node role is identified and described in order to determined the ICC. By using the proposed methodology, the aim is to dissolve the set of ICCs obtained inside a given course. The paper structure is the following: First, in section 2, related work on document copy detection is over-viewed. Then, the proposed methodology to determine the ICCs and main contribution of this work is described in section 3. In section 4, a real application and its main results are presented. Finally, main conclusions are detailed in section 5.

2

Related Work

Plagiarism detection for document sources can be classified into several categories [3]. From exact document copy, to paraphrasing, different levels of plagiarism techniques can been used in several contexts [7,16]. Likewise, pairs of documents can be described into different categories as unrelated, related, partly overlapped, subset, and copied [13]. According to Schleimer et al. [12], copy prevention and detection methods are combined to reduce plagiarism. While copy detection methods can only minimize it, prevention methods can fully eliminate it and decrease it. Notwithstanding this fact, prevention methods need the whole society to take part, thus its solution is non trivial. Copy or plagiarism detection methods tackle different levels, from simple manual comparison to complex automatic algorithms [10]. Among these techniques, document similarity detection, writing style detection, document content similarity, content translation, multi-site source plagiarism, 1

Fondef D08I-1015 DOcument COpy DEtection (DOCODE) project.

Finding Inner Copy Communities Using SNA

583

and multi-lingual plagiarism detection methods have been previously proposed [2,3,7,8,10,12,15]. Recent methods subdivides a document into Document Units (DU ), such as words, sentences, paragraphs or the whole document [2,3,7]. As describe in PAN’092 an intrinsic and external detection process can be applied to a given corpus [10]. Intrinsic detection refers to single document analysis where plagiarism must be determined by using exclusively the document text [16], and external detection considers to compare all documents with each other or with an original reference corpus. [10,16]. The effectiveness of detection methods depends on the selected DU , which determines the complexity of the algorithm, hence the execution time, and similarity measures to analyze the obtained results [3,7]. Future trends in plagiarism detection have shown an evolution to Natural Language Processing (NLP) methods, which are contributing to further plagiarism detection techniques [3,4].

3

Proposed Methodology

All communities are defined by their social nature, for which all of them are different. Communities are commonly represented for social behaviour analysis among other particular objectives like enhancing interaction, diffusion and share of knowledge, dissolving bad influences, and several other applications [5]. As previously described in section 1, the main goal of this work is to represent existing copy & paste relations among students. In the following, the proposed methodology is presented for which its main elements are defined. 3.1

Course Document Network (CDN)

Considering a corpus of N documents as D = {D1 , . . . , DN }. Each document Di is generated from a partition of the vocabulary set V , P(V ). Then, the Copy Similarity Relation from corpus D is defined as, CSRij = δ(Di , Dj ), ∀i, j ∈ {1, . . . , N }, i = j

(1)

CDN = (V, E)

(2)

where δ : P(V ) × P(V ) → Ê is a similarity measure for two documents, and CSRii = 0, ∀i ∈ {1, . . . , N }. A Course Document Network (CDN ) is defined as a one-mode simple valued network,

where each vertex represent a students’ document, E ⊆ V ×V V(Vertices) := set of Di with i ∈ D E(Edges) := set of CSRi,j measures 2

http://www.webis.de/research/corpora

(3) (4) (5)

584

E. Merlo et al.

In this undirected graph, edges quantify the similarity among students documents’ content. This case defines a small network given that the average quantity of students is a low number per course. Different CDN instances are produced from both vertices and edges’ characteristics, depending on roles and interaction [5,14]. 3.2

Inner Copy Community (ICC)

We refer to the Inner Copy Community (ICC) as a non-collaborative network. Its members do not want to be publicly related and its interaction does not promote knowledge or learning (e.g. Virtual Communities of Practice) which is associated with most of the nowadays communities. In practice, it is not expected to provoke any interaction among nodes from this type of network where the ideal case is an unconnected network which means no one is related in the Copy Community. Emerging Copy Communities must be detected to avoid establishment of endemic structures or organizations of students which could systematically incur in this bad practice. For this an optimal network layout can be used to reorder nodes to clearly identify each component. A component is defined as a set of one or more nodes, but to talk about community we restrict this definition to at least two nodes [5]. Based on both nodes and relations metrics, it’s possible to describe each node with a defined role inside the community, creating components that forms cohesive groups between them. A cohesive group is defined as components with many ties among members. At this point, it is important who is related and who is not, and how much important is this relation, in terms of centrality and betweenness [5,9]. In terms of Copy Community roles association we propose three types: – Good Student : non connected, or isolated peripheral nodes, located far from the center of the network with no amount of document similarity content. – Suspicious Student : low connected semi peripheral nodes located around the center of the network with a intermediate amount of document similarity content. – Bad Student : high connected core nodes located near the center of the network with a considerable amount of document similarity content. The general hypothesis of Social Networks, according to [5], states that people who match on social characteristics will interact more often. Likewise, people who interact regularly will foster a common attitude or identity, illustrated by dense zones of people who ”stick together”, or whose vertices are interconnected. It is possible to find cohesive groups generated from the last two roles interaction. Bad Students groups can be related with Suspicious Students or vice versa. Furthermore, the presence that each role has in a given network will establish the courses’ yield in a document content creativity context.

Finding Inner Copy Communities Using SNA

3.3

585

The SNA-KDD Process

Our methodology involves the SNA with the Knowledge Discovery in Data Bases (KDD) Process. In this case, the process consists on the following steps, – Data Selection and Preprocessing. The selection of the corpus will be oriented towards documents from students from a given course. Students’ documents are transformed from their original sources, to simple text documents for their manipulation on the CSR evaluation step. As a difference to common text-mining applications, in this case stop-words removal and stemming are considered in the data selection and pre-processing step. However, these steps could not be required by the CSR function evaluation. – Network Construction. Once documents are selected and pre-processed, the comparison method for the CSR evaluation must be defined. Here, different CSR measures could be considered for the construction of several CDN s, allowing the analyst, reviewer, or evaluator to infer the needed information for the decision step. – Analysis, Evaluation, and Knowledge Extraction. Depending on the CDN s’ number of nodes and relations, network visualization algorithms like Kamada-Kawai [6], could be applied. Also, clustering techniques that allows to determine node grouping measures (like degree, core, etc.) and blockmodeling that helps on detecting cohesive groups, roles and positions.

4

A Real Application

Experiments were performed with Spanish documents given from a graduate student’s course of the University of Chile. 4.1

Data Selection and Preprocessing

The assignment’s topic used in the experiment was “how internet works”, where students were asked to explain how TCP/IP worked when transmitting information from a sender to a receiver and vice versa. In total, there were 23 submitted documents with an average of 41 paragraphs. Assignments’ data consisted on Spanish written documents available in digital formats as docx, doc, and pdf. The corpus was transformed into ASCII codified text, excluding all graphics, tables, images or any other data type different from text. 4.2

Network Construction

Once documents were pre-processed, a comparison method and a proper similarity measure was determined. For this, we chose the document paragraph as a minimal copy unit (DU ). As described in section 3.1, given a corpus D = {D1 , . . . , DN }, for which each document Di is represented by n paragraphs

586

E. Merlo et al.

Di = {pi,1 , . . . , pi,n }, where pi,k represents the k th paragraph of document i. A simple document similarity measure (equation 6) is proposed for textual copy, CPi,j =

m n  

d(pi,k , pj,l )

(6)

k=1 l=1

where function d : V × V → {0, 1} takes the following values:  1 if pi,k = pj,l or pi,k ∈ pj,l or pj,l ∈ pi,k d(pi,k , pj,l ) = 0 otherwise

(7)

then the CSR matrix for the networks’ construction is determined by CP similarity measure, CSRi,j = CPi,j , ∀i ∈ {1, . . . , n}, j ∈ {1, . . . , m}

(8)

With this measure, in terms of this comparing method we make O(n2 ) comparisons (where n is the number of documents). However, other comparison techniques could be used for the network construction [10]. We used the Pajek 3 program for visualization and SNA metrics evaluation. In network analysis terms, the proposed CSR helps distinguish between isolated nodes (CPi,j = 0) and connected nodes (CPi,j > 0). In order to discover the ICC is required to construct the CDN . To do so, there were 946 extracted paragraphs where we found out that 96 tuples from these paragraphs were exact match. Considering this first step, we can visualize a good approximation for textual similarity in the CDN , which holds the ICC we need. The reason is that not all similar paragraphs correspond to copy for example, some of them may have the similar web site source reference. Using these results we can find some isolated nodes that for sure hold no sign of copy and dense zones of nodes that are in no doubt suspicious. 4.3

Analysis, Evaluation, and Knowledge Extraction

We realized more accurate results depend on the paragraph quality. This factor could be solved if there would be available an intelligent automatic paragraph extractor. As this situation is not solved yet a proper clean up must be made, taking in consideration the paragraphs’ criteria to clearly identify whether the similar paragraphs are in fact copy. We elaborated a PHP application that shows all similar paragraphs occurrences, allowing to visually check and mark the paragraph as copy. As shown in figure 1, the CDN illustrates students documents relationship. There are two nodes nearly the center which hold a huge amount of similarity. These nodes are extremely close. From the CDN analysis, the ICC is formed by 2 nodes which hold 10 suspicious paragraphs in common. Based on this information both nodes suit perfectly in the Bad Students role (section 3.2). All other nodes 3

http://vlado.fmf.uni-lj.si/pub/networks/pajek/

Finding Inner Copy Communities Using SNA

587

Fig. 1. CDN from the student homework corpus using the proposed CSRs

suited in the Good Students role, finding cases which similar information site sources were properly cited. No further analysis techniques previously described were applied due to the network size. For evaluation purposes we wanted to apply two different perspectives. Only one case of copy was detected when we analyzed the ICC obtained. We proceed to the visual inspection of documents, confirming our results. Additionally, we reviewed all the other documents which according to the experimental results did not have any textual similarity in a copy context. Proving again the concordance between the results and our revision, we did not find any proof of textual copy in the rest of documents. Therefore, our methodology was successfully evaluated based on these experimental results. While constructing the network we applied some metrics for each document in order to have some insights about the results (these metrics are defined in table 1). As presented in table 2, seven documents present a lot of similarity. Their scores overpass the average score of 9%. Most of documents are evenly connected, that is why the connectivity measure is also overpassed. We need to consider all possible cases, so metrics by themselves are not sufficient. It is required to combine these metrics involving score and connectivity. In this way SC displays both similarity and connectivity. Only 2 documents highlight in this aspect. This indicates they are both highly similar and connected. 4.4

Academic Actions

In the second case, when evaluators’ results showed no sign of copy, there was only one assigned evaluator who in this case, did not review all the documents at the same time. As a matter of fact, it could happen that the copy cases were not identified due to timing differences. When the evaluator got the information

588

E. Merlo et al. Table 1. Metrics definitions

Metric Definition SP Amount of similar paragraphs occurrences with the rest of documents SU P Amount of paragraphs that present at least one detected similarity P Number of paragraphs S Score measure obtained from SU P/P C Connectivity measure that indicates the % of documents’ associations Score-connectivity measure obtained from S × C SC

Range

  

0 0 0

[0, 1] [0, 1] [0, 1]

Table 2. Documents information for top-ten students ordered by Score (S) Student 1 2 3 4 5 6 7 8 9

SP 39 22 12 8 10 11 9 1 7

SU P 24 18 4 7 6 2 1 1 6

P 63 50 22 60 53 20 11 12 80

S = SU P/P (%) 0,38 0,36 0,18 0,12 0,11 0,1 0,09 0,08 0,08

C (%) 0,65 0,52 0,52 0,39 0,26 0,52 0,43 0,09 0,17

S × C (%) 0,25 0,19 0,09 0,05 0,03 0,05 0,04 0,01 0,01

about the copy case detected with our method, the first impression was scepticism. The situation was now double confirmed from both parts, so actions had to be taken. This implied to get in contact with the involved students. After a long conversation with both, one of them confessed as the author of the copy. As a resolution they were both punished with bad marks. The idea was to make a statement to all students about copy cases reprisals. Based on this fact, we think people will tend to avoid sharing their own written works due the reputation risk involved in a copy menace. Finally all students were cautioned about the future copy cases implicants.

5

Conclusions

We had shown that educational evaluators got a difficult problem in their hands with the documents’ copy and paste situation and the way technology is being used for this purpose. Accordingly, it is not a minor fact to count with precious informations about the course similar works in order to reduce the detection problem of identifying copy cases. In this work, we have successfully applied SNA to find an interesting hidden community which unlocks the possibility to make a change in actual students’ culture when doing written works. To the best of our knowledge, until now SNA has not been used in a copy detection context. It has been used to modelled and analyze communities, roles interaction, behaviour and evolution through time to name a few among other social applications [5]. SNA generates many insights allowing to diagram the

Finding Inner Copy Communities Using SNA

589

course document network, finding communities and calculating metrics from the full network, nodes and its relations. Until today there are not copy detection methods combined with SNA approaches to find communities or anything alike. However we also realize that SNA alone cannot suffices to resolve the problem. Regarding to an evaluators’ point of view (school teachers, university professors, etc.), our methodology allows to divide revision from detection. It’s not necessary to read all the documents, just to check the suspicious paragraphs. It clearly reduces the time consuming detection task with better and more accurate results. Besides it also adds additional information about the corpus in a content context. We can identify the documents that share the same site sources information, the same contents words, and so on. We selected real documents sample from an undergoing course and prove our SNA copy detection approach. We successfully identified one copy case, allowing an evaluator to take actions regarding the students involved. We conclude with this work that SNA adds value in finding ICC in terms of students’ roles and behaviour. The SNA visualization helps to illustrate a global network representation which gives many insights for the course analysis in a copy detection context. We have introduced the concept of Copy Communities in the copy detection context. As future work, this approach can be complemented with proper text mining techniques or pattern recognition algorithms in order to detect more complex plagiarism levels. Although the present work focuses on the educational field, this application can be extended to other areas. In enterprises it is possible to use its focus to identify work groups who shares similar content documents and promote interaction and align information. In the terms of industrial property, it could be used in the revision process of commercial patent. In the research field its usage allows to order information and form structures of similar content data. Even though this application is addressed to language, it could be adapted to programming code using a different comparison method.

Acknowledgement Authors would like to thank Felipe Aguilera for his useful comments in Social Network Analysis. Also, we would like to thank continuous support of Instituto Sistemas Complejos de Ingenier´ıa (ICM: P-05-004- F, CONICYT: FBO16; www.sistemasdeingenieria.cl); FONDEF project (DO8I-1015) entitled, “DOCODE: Document Copy Detection” (www.docode.cl); and the Web Intelligence Research Group (wi.dii.uchile.cl).

References 1. Batagelj, V., Mrvar, A.: Network analysis of texts. In: Erjavec, T., Gros, J. (eds.) Proc. of the 5th International Multi-Conference Information Society - Language Technologies, pp. 143–148 (2002)

590

E. Merlo et al.

2. Brin, S., Davis, J., Garc´ıa-Molina, H.: Copy detection mechanisms for digital documents. In: SIGMOD 1995: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pp. 398–409. ACM, New York (1995) 3. Ceska, Z.: The future of copy detection techniques. In: Proceedings of the 1st Young Researchers Conference on Applied Sciences (YRCAS 2007), Pilsen, Czech Republic, pp. 5–107 (November 2007) 4. Ceska, Z.: Plagiarism detection based on singular value decomposition. In: Nordstr¨ om, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 108–119. Springer, Heidelberg (2008) 5. de Nooy, W., Mrvar, A., Batagelj, V.: Exploratory Social Network Analysis with Pajek. Cambridge University Press, New York (2004) 6. Kamada, T., Kawai, S.: An algorithm for drawing general undirected graphs. Inf. Process. Lett. 31(1), 7–15 (1989) 7. Kang, N., Gelbukh, A.F., Han, S.-Y.: Ppchecker: Plagiarism pattern checker in document copy detection. In: Sojka, P., Kopeˇcek, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 661–667. Springer, Heidelberg (2006) 8. Maurer, H., Kulathuramaiyer, N.: Coping with the copy-paste-syndrome. In: Bastiaens, T., Carliner, S. (eds.) Proceedings of World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education 2007, Quebec City, Canada, pp. 1071–1079. AACE (October 2007) 9. Musial, K., Kazienko, P., Br´ odka, P.: User position measures in social networks. In: SNA-KDD 2009: Proceedings of the 3rd Workshop on Social Network Mining and Analysis, pp. 1–9. ACM, New York (2009) 10. Potthast, M., Stein, B., Eiselt, A., Barr´ on-Cede˜ no, A., Rosso, P.: Overview of the 1st International Competition on Plagiarism Detection. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E. (eds.) SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 2009), pp. 1–9 (September 2009), CEUR-WS.org 11. R´ıos, S.A., Aguilera, F., Guerrero, L.A.: Virtual communities of practice’s purpose evolution analysis using a concept-based mining approach. In: Vel´ asquez, J.D., R´ıos, S.A., Howlett, R.J., Jain, L.C. (eds.) Knowledge-Based and Intelligent Information and Engineering Systems. LNCS, vol. 5712, pp. 480–489. Springer, Heidelberg (2009) 12. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85. ACM, New York (2003) 13. Seo, J., Croft, W.B.: Local text reuse detection. In: SIGIR 2008: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 571–578. ACM, New York (2008) 14. Thimbleby, H., Oladimeji, P.: Social network analysis and interactive device design analysis. In: EICS 2009: Proceedings of the 1st ACM SIGCHI Symposium on Engineering Interactive Computing Systems, pp. 91–100. ACM, New York (2009) 15. Vandehey, M.A., Diekhoff, G.M., LaBeff, E.E.: College cheating: A twenty-year follow-up and the addition of an honor code. Journal of College Student Development, 468–480 (2007) 16. Eissen, S.M.z., Stein, B.: Intrinsic plagiarism detection. In: Lalmas, M., MacFarlane, A., R¨ uger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006)