Similarity and originality in code: plagiarism and normal ... - CiteSeerX

0 downloads 0 Views 228KB Size Report
text is included. 2 Student similarity and originality. Dick et al (2003) describe many different strategies for combating plagiarism: by detection, setting plagiarism.
Similarity and originality in code: plagiarism and normal variation in student assignments Samuel Mann and Zelda Frew Department of Information Technology Otago Polytechnic Dunedin, New Zealand @tekotago.ac.nz

Abstract

2

This paper examines the relationship between plagiarism and normal variation in student programming assignments. Reasons why code might be similar, both innocuous and suspicious are described. Free text searching and structural metrics are used to examine a set of programming assignments. These metrics are used as the basis for analysis of the variability in the student assignments and the processes used by the students. The boundary between normal practice and plagiarism is examined by “forced plagiarism”. Finally we briefly examine student understanding of cheating and normal work processes. The investigation of similarity has provided some clarity to the ambiguous fine line of un/acceptable practice. Keywords: similarity.

1

Plagiarism,

assessment,

programming,

Introduction

Plagiarism in student assignments is a matter of high academic interest (Dick et al. 2002). Most institutions have a policy stating that non-originality is inappropriate and outlining disciplinary procedures. There are various plagiarism detectors (eg JPlag; Prechelt et al. 2002, MOSS; Schleimer et al. 2003). There is little guidance, however, on how much similarity constitutes cheating. This is particularly troublesome for programming assignments. This paper examines reasons why code might be similar, both innocuous and suspicious. We examine a set of programming assignments using both free text searching and structural metrics. These metrics are used as the basis for analysis of the variability in the student assignments and the processes used by the students. We carry out “controlled cheating” and use the metrics describe the artefacts. Finally we briefly examine student understanding of cheating and normal work processes. Copyright © 2006, Australian Computer Society, Inc. This paper appeared at the Eighth Australasian Computing Education Conference (ACE2006), Hobart, Tasmania, Australia, January 2006. Conferences in Research in Practice in Information Technology, Vol. 52. Denise Tolhurst and Samuel Mann Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included.

Student similarity and originality

Dick et al (2003) describe many different strategies for combating plagiarism: by detection, setting plagiarism proof assignments (tailoring per student, inclusion of personal context, changing assignments etc), watermarks etc (Daly and Horgan 2005), detection and response. They, like most authors, prefer to take a proactive approach based on education of students. Martin (2004) argues against general use of plagiarism detectors, instead favouring education on appropriate practices, so students can become “ambassadors through good practice”. Dick et al (2003) describe institutional methods in reducing cheating “orientation programme…devoted to teaching proper academic behaviour” and “well defined and supported…policy”. McDowell and Brown (nd) agreed, but point out that it is not clear cut, despite written guidance and codes of practice. Academics might describe situations of collaboration through to blatant copying but then are unable to provide a statement other than ‘there's fine line between these two’. In free text similarity checking, such as for an essay, 10% similarity is usually considered acceptable to allow for references, quotes, common phrases and specific subject matter (an essay on the Battle of Hastings is likely to state “King Harold was shot by an arrow in the eye”). For code, the situation is more complex. Prechelt et al. (2002) described a study of the effectiveness of JPlag, an online plagiarism detector for code. They describe thresholds in detecting plagiarism: with 0-5% similarity “this is clearly no plagiarism”, 100% similarity “this is clearly plagiarism”, but then raise the question of variability of student code: “but what if the similarity is 40%?”. Their answer is that human checking then takes over. What is not clear is whether this 40% was a number purely for illustration, or whether there is some basis for this. The questions for this paper then, are: what is the ‘normal’ variability in student coding assignments? is there a threshold at which the normal variation can be considered plagiarism? and can we, from this describe acceptable academic practice to students?

2.1

the

reporting of the patterns associated with known plagiarism). Care has been taken here not to identify and report actual students. Code itself had identifying characteristics removed (filenames etc) before being submitted to external similarity engines.

their

The assignments are shown in Table 1. All assignments used Java.

Expected variability

There are many innocuous and suspicious reasons why student code might be similar within a cohort: • •

Students will/should be following institutional coding conventions and style Students are at similar stages programming experience/maturity

in



Students have been taught the same exemplars



The task in a coding exercise is usually defined to explore a limited subset of functions with a limited solution set

students

Task

Ark 04_1

30

Inheritance

Bank 04_2

23

Linked lists



The solution might be tightly defined (eg in network configuration scripting)

Sunshine 05_1

18

Data manipulation



Partial solutions may have been worked in class

Ark

18

Inheritance



The assignment might be an elaboration of work previously solved (in class or previous assignment)

05_2



The programming language is a very limited subset of usual langauges



Students are working from the same textbook





Table 1: Assignments The student code was described by the following metrics: •

free text similarity (using Turnitin.com on key sections of code: FT, http://www.turnitin.com)

Stub code or integrative code might have been provided, forcing the solutions to common class, function and variable names



lines of code (LOC)



cyclometric complexity (CC)

Students undertook acceptable collaboration (eg in planning solutions)



function count (FN)



comment lines (CL)



Work splitting and recombining



Paired or Xtreme Programming



Components used by students



Paid external help



Groupwork submitted individually (perhaps with superficial changes)



Copied code (perhaps disguised: comments and strings changed or removed, variable names changed, layout changed, structure changed).

This list is not exhaustive but it does highlight the difficulty is prescribing acceptable practice, where on the list above should a line be drawn? With code reuse actively promoted, the distinction to students becomes further blurred. Dick et al. (2003) notes that the degree of acceptable behaviour depends on the purpose of the task.

2.2

Name

Variability in student code

In order to explore normal variation in student code, we examined four corpora of student programming assignments. These were taken from two year’s worth of second year object oriented programming from a vocational Information Technology degree. A nonidentifying association was made between code and grades, and a coded identifier used to allow exploration and reporting of student work practice (including

These latter measures were calculated using Resource Standard Metrics (http://msquaredtechnologies.com/). There are many algorithms for detecting plagiarism (Schleimer et al. 2003), the focus here, however is on understanding normal variability, hence the use of traditional software metrics. Table 2 can be examined both by assignment and metric. The Ark 04_1 is perhaps the simplest of tasks but required quite a lot of essentially repeated code, as each of the required animals had to be specified. This assignment had the highest median LOC with almost the smallest standard error. Similarly, although the cyclometric complexity is high, standard deviation of the CC is low. The lecturer provided integrative code, a “Noahs Ark” class that was used to interact with the animals and an interface “domesticated”. This had the effect of standardising the code structures to interact with these files.

04_1

LOC

CC

FN

average

350.86

86.07

72.14

56.64

59.96

98.44

27.80

25.94

47.55

31.00

max

563

145

131

193

100

min Stdev % of Avg

116

18

10

0

0

28%

32%

36%

84%

52%

stdev

CN

FT

04_2

LOC

CC

FN

CN

FT

average

228.19

47.95

23.86

60.43

13.14

stdev

140.30

31.46

12.83

35.95

21.00

max

797

179

72

177

79

min stdev% of Avg

123

25

11

12

0

61%

66%

54%

59%

160%

05_1

LOC

CC

FN

CN

FT

average

277.06

53.50

13.50

139.33

20.17

stdev

110.27

24.68

10.76

69.07

22.62

max

797

179

72

360

79

min Stdev % of Avg

123

25

10

12

0

40%

46%

80%

50%

112%

05_2

LOC

CC

FN

CN

FT

average

325.06

81.65

59.35

154.18

42.76

79.73

19.62

14.96

137.14

28.85

max

483

124

79

535

91

min Stdev % of Avg

226

55

14

5

0

25%

24%

25%

89%

67%

stdev

Table 2: Summary of variability statistics for four assignments. The second assignment in 2004 (04_2) was the modelling of behaviour of queues in a bank. Although possibly the most difficult of the assignments, students had a rehearsal of the problem in a lab exercise and were pointed to a solution “you will need a linked list, code in the book provides this functionality”. This assignment generated the least code, although the highest percentage variability. The grades for this assignment had a bimodal distribution as many students struggled while others excelled. The first assignment in 2005 was that of a string and numerical manipulation exercise (analysing Bilbo’s meticulous weather records for the Shire). While neither the LOC and CC were unusual, the median function count was low but the range and standard deviation of function count was very high. This appears to be the result of some students not utilising function structures that would have greatly simplified their task. The second assignment in 2005 (05_2), Ark lists had the same scenario as the Ark assignment in 2004, except that this time the integrative files (NoahsArk and Domesticated) were not provided but the required tree was not as deep. This meant that the students were freer

to develop their own structures, although some methods were required (“Manfriend”). Both the LOC and CC decreased from Ark 04_1 as too did the variability of these metrics. The comments as a percentage of LOC increased slightly between the first and second assignments in 2004, from 16% to 26%, but decreased from 50% to 47% in 2005, although the percent of comments in 2005 was considerably higher overall. The variability of the amount of comments is high, with all standard deviations above 50%. The pattern of the free text similarity are strange, perhaps indicating the unreliability of free text search comparisons for code. The very high standard deviations suggest non-normal distributions of similarities (Figure 1). All of the similarities were within the group, suggesting that students did not look to the internet for solutions (or, of course that they didn’t find them, a deliberate strategy in this assessment design). Of particular interest is that 67% of assignments had above 10% similarity, the traditional threshold for flagging potential plagiarism in written assignments. Indeed, 30% had higher than 60% similarity. In the next sections we examine the nature of these similarities. Before doing this, however, the two extremes are worth noting. At the very dissimilar end, 33% of students had less than 10% similarity, 25% had zero similarity with other code. Given the reasons promoting similar code listed in section 2.1 above, this is somewhat surprising. Who are these people who are writing such unique code? Unfortunately there is no real relationship between the textual similarity of code and final grade, the correlation between these two factors is only 31% and those with zero similarity stood an about equal chance of getting As or failing grades. The Ark04/1 assignment contains a known plagiarism pair. One A+ student provided his code to a fellow 35 30 25 20 15 10 5 0 0 to 10 10 to 20

20 to 30

30 to 40

40 to 50

50 to 60

60 to 70

70 to 80

80 to 90

90 to 100 to 100 110

Similarity

Figure 1: Frequency distribution of free text similarities for all submissions. student to “point him in the right direction”. Unfortunately that student used this opportunity for blatant cheating. Leaving the structural code intact 04_19 disguised comments and variables (Figure 2). In doing this copying, the student managed to add five lines of code but remove two functions from the whole program.

and as well developed his own style, or, had external assistance. Another student is an unusual case. He is a competent student (4.12- final mark 86%) but one who did not use the exemplars provided, instead preferring to learn by developing entirely independent solutions and did this successfully (in the second assignment however, this student wrote some of the most vanilla code). _________________________________________

Figure 2: Blatantly disguised without touching either layout or structural code (original at top). Note although misspelt, the original “Overriding” indicates a better understanding of OO concepts than the altered “change”. For each of LOC, CC and FN, a normalised value was generated for each student assignment. These values were then compared in pairwise fashion to every other submission in that assignment. This created a structural similarity measure for each pair and an overall structural dissimilarity index (SD) for each submission. Being the sum of the differences of the three measures, each of which have been normalised, the SD is a normalised difference index. An SD of zero indicates identical structure, whereas SD of 3 indicates that each of the three measures are at one standard deviation different. The copying case described above results in an SD of 0.3047. We examine some illustrative pairs, but first, it is worth exploring the SD for each student. This individual SD highlights variability in approaches to working. Some individuals have high SD, meaning that their work is very dissimilar to everyone else, others have very low SD, meaning that their work is the most similar to everyone else’s. The characteristics of both these groups of students are of interest here. Four students had an SD in the first assignment Ark04_1 that were very different to everyone else (including each other). While the average SD for that assignment was 3.3 (2.6 if they are excluded), their SD are in the 9-13 range. This means that their code is more than two standard deviations different from everyone else’s on each of the three structural measures. Three of these students failed the course. Examination of their code for two of these shows it to be garbage. The same is true for the very different student in the second assignment (Bank 04_2 4.22). One student who failed the syntax test administered early in the course to assess comprehension, later submitted code that it was extremely different in style to the rest of the class and appeared to be highly component. This student either miraculously recovered his form through the semester

At the other end of the scale are the students who write code most similar to everyone else. This is not to say that there is collusion or plagiarism, we exclude from this characterisation those who have work very similar to other individuals. Examination of these assignments shows them to most closely follow the assignment instructions and provided exemplars. In the second assignment 2004 (the linked list problem), the overall similarity increased (the average SD decreased to 1.2). This is surprising, but, while this assignment had the lowest degree of structural red flags, it had the highest paired similarity, ie more evidence of people using text book examples, sharing ideas or code snippets, without actually copying, although see Figure 3 for evidence that the same student who stole code in the first assignment cheated again, repeating his actions with another student’s code. For most students the similarity is the result of normal work, Figure 4 shows the same code as Figure 3, independently generated yet with a SD of 0.27 (with the top example in Figure 3). Three students provided work (Ark 05-2,9,5 and 12) that during marking the lecturer had commented “this work is structurally similar to x”. This can be seen in the cyclometric complexity of 47, 46 and 42 (where the standard deviation was 19). These students were known to have worked together to a flowchart stage in the development of a solution. The production of the actual code was entirely independent (Figure 5). Despite this, the SD for the pairs ranged from 0.05 to 0.4. (ie over the three metrics combined less than half a standard deviation variation). The lecturer, however, was satisfied that this was acceptable practice for this assignment. The code was clearly written by each student independently. A different group of students worked together and produced structurally similar work (SD = 0.0998) but the lecturer was not happy with this: the students then went on to write the code together (Figure 6).

___________________________________________

Figure 3: Same student who cheated in first assignment copies from someone else, original at top (SD = 0.0712).

Figure 4: Independently derived solution very similar

_____________________________________________

_____________________________________________

Figure 5: Segments of code from group of students who developed structure together but then wrote code independently

SD with original

SD < 0.5

False flags

structure

layout

We have described normal practice in student programming, and examined how this can be seen in terms of variability in software metrics. In order to investigate how close this ‘normal practice’ is to plagiarism, we invited a skilled programmer to generate forced plagiarism, based on submissions from our corpora. Of particular interest was not so much whether this faked plagiarism could be detected, but how similar would it appear to the normal practice of student work.

Variables etc

Forced plagiarism

comments

2.3

0

3

0

Ark 04_1 edit edit

yes

minor

0.05 06

1

0

repla ce

yes

mode rate

0.10 51

0

0

repla ce

yes

major

0.10 77

0

1

edit

yes

yes

0.23 30

4

4

0.11 06

5

2

yes

Bank 04_2 edit edit

yes

minor

0

6

0

repla ce

yes

mode rate

0.28 16

3

3

repla ce

yes

major

0.70 07

3

15

edit

yes

yes

1.10 00

0

42

yes

Table 3: Forced plagiarism

2.4 Figure 6: Both structure and syntactic similarity Table 3 shows the various treatments applied to different original submissions from Ark 04_1 and Bank 04_2. The SD with the original, unaltered submission is shown, along with the count of cases where the SD of the forced plagiarism against other cases is less than 0.5. The final column shows the count of false positives found in the normal corpora if a detection threshold was to be set at level to identify the forced plagiarism. The last row in each group is a considerable rewrite of code by a skilled programmer including non-trivial structural changes. On the first assignment, changing the comments, variable names and layout was easily distinguished from normal variation, even when changed by significant amounts. Only when the structure of the code was changed did four cases of false positives emerge. On the second assignment, that already identified as having a high variability, the difference between forced plagiarism and normal variation was less distinct. With minor changes, the plagiarism was identified, but as the modifications became major, the amount of false positives increased (to 42 pairs although this includes the cooperating groups described above, and remember that this is from a total pair pool of 231).

Student understanding

We discussed student programming practice with a class of second year students in order to examine student understanding of the “fine line” between acceptable work practices and plagiarism. These students, undertaking software engineering, had all previously undertaken the Java class described above. We asked the class to describe plagiarism. All five groups of students repeated the traditional “unacknowledged copying” definition. The class was then asked to provide some descriptions of such plagiarism: all five groups gave the simple case of stolen or purchased work. When asked what percent similarity should raise concerns about authenticity, they saw that code would be more similar than essay work and the class agreed that 30% similarity should raise suspicions. They were then prompted to discuss reasons for similarity, and returned with a list similar to that in section 2.1 of this paper. They were then asked to indicate what was acceptable practice. There was total agreement on what was either side of the line but the boundary became fuzzy over the nature of acceptable and unacceptable collaboration. The students, it seems, agreed with the unhelpful observation of academics reported by Dick et al. (2003) that “I’ll know it when I see it”.

We put the students to the test and provided them with key code from the corpora, that of Ark04, including both the known and forced plagiarism. Despite being given significant hints and 30 minutes, none of the groups were able to find the plagiarism: one student threw her hands in the air “it’s all the same!”. When then asked what should be the threshold for flagging plagiarism, the groups suggested 60-90% as normal similarity. The students were very surprised to be shown the known plagiarism and marvelled at the ability of human markers to ever find any plagiarism. The students were then shown the SD matrix and examined the examples of flagged code, from knowing the structural similarity they were able to find the forced plagiarism. The students understood and agreed with the distinction between structure and shared code.

3

Conclusion

In this paper we have examined the notion of normal variability in student code. There are good reasons why code might be similar although the variability depends on the task assigned. For our samples of student code the percentage of structural variability ranged from 25 to 68% of the mean. This variability increased with more advanced tasks, despite the average total LOC decreasing. We described code using a measure consisting of three measures describing structural variability. A threshold of a 0.5 SD seems appropriate as an indicator of potential issues in student code. While this would not catch the most extreme case of our forced plagiarism, such major structural modification required skill in programming and almost as much effort as writing from scratch. This 0.5 SD can be seen as a sixth of a standard deviation across each of three structural measures. We are not arguing for the veracity (or otherwise) of our measures, rather, we wish to highlight the strong similarity of student code and that investigation of this similarity has provided some clarity to the ambiguous fine line of un/acceptable practice. This paper has also highlighted the importance of examining code for lack of similarity. In writing this paper we found a case of previously unknown suspected plagiarism. This we identified as being competent yet extremely different in style to all the other students. Examining student code in this way has helped identify that for these assignments the “fine line” between acceptable and unacceptable practice is positioned between collaboration over structure and shared code. For different assessment tasks, the position would differ, but, crucially, we believe that academics should be able to articulate quite clearly what is acceptable and what isn’t. Anecdotal evidence suggests that students would find useful discussions about the process of normal and acceptable collaboration, and that discussions of structural and syntactic similarity contribute usefully to these discussions. We would suggest that such discussions replace, at least in part, the emphasis on consequences of being caught plagiarising.

A worthwhile further study would be to formalise the anecdotal discussion of student understanding of what constitutes normal practice.

3.1

Acknowledgements

We appreciate the useful discussions and assistance of Mark Crook, Lesley Smith, Robin Day and Patricia Haden.

4

References

Daly, C., Horgan, J. (2005) Patterns of plagiarism, Proceedings of the 36th SIGCSE technical symposium on Computer science education, St. Louis, Missouri, USA, 383 - 387 Dick, M., Sheard, J., Bareiss, C., Carter, J., Joyce, D., Harding, T., Laxer, C. (2003) Addressing student cheating: definitions and solutions. SigCSE Bulletin 35(2):172-184 Culwin, F. and Lancaster, T. (2000) “A Review of Electronic Services for Plagiarism Detection in Student Submissions”. Proceedings of the First Annual Conference of the Learning and Teaching Support Network for Information and Computer Sciences, 5460. Joyce, D. (2003) “Checking Originality and Preventing Plagiarism”. Proceedings of the 16th Annual Conference of the National Advisory Committee on Computing Qualifications NACCQ, 303-306. Joyce, D. (2004) “Plagiarism: meeting new challenges”. Proceedings of the 17th Annual Conference of the National Advisory Committee on Computing Qualifications NACCQ, 504. McDowell, L. and Brown, S. (nd) Assessing students: cheating and plagiarism, http://www.english.ltsn.ac.uk/explore/resources/plagiar ism/reference1.php viewed June 2005 Martin, B. (1994) Plagiarism: a misplaced emphasis. Journal of Information Ethics 3(2):36-47 Prechelt, L., Malpohl, G., and Philippsen, M. (2002) Finding Plagiarisms among a Set of Programs with JPlag Journal of Universal Computer Science 8(11): 1016-1038 Schleimer, S., Wilkerson,D., and Aiken, A. (2003) Winnowing: Local Algorithms for Document Fingerprinting, Proceedings of the 2003 SIGMOD Conference, San Diego, CA,