Classifying Code Clones with Configuration - Semantic Scholar

5 downloads 5675 Views 97KB Size Report
often used to generate this Makefile, that is, a configuration). When a developer uses a Visual Studio in Windows OSs, a configuration is given as a file *.vcproj.
Classifying Code Clones with Configuration Toshihiro Kamiya Future University Hakodate 116-2 Kamedanakano-cho, Hakodate, Hokkaido, Japan 041-8655

[email protected] ABSTRACT To answer the question, “Which kind of code clone has deleterious effects on maintainability?” this paper proposes a classification of code clones with a configuration (or Makefile) that determines which set of source files are compiled and linked.

Categories and Subject Descriptors D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement – Restructuring, reverse engineering, and reengineering.

General Terms Management, Measurement, Experimentation.

Keywords Code clone, Taxonomy, Configuration, Clone Management.

1. INTRODUCTION A code clone is a set of code fragments that are (nearly) identical. There are some experiments as to whether the existence of code clone results in a deleterious effect on maintenance of the source file or not. Some results show a positive correlation [1,4,5] and others, a negative one [2,3]. This conflict implies some code clones have a deleterious effect and others not. Therefore, the following question naturally arises: “Which kind of code clone has a deleterious effect on maintainability?” This study classifies code clones according to their configuration. More specifically, we classify code clones by investigating how a configuration deals with the source files where the code fragments of each code clone appear.

2. CODE CLONE A code clone addressed in this paper is a clone class (or clone set). A code clone is a set of code fragments. That is, a code clone is represented as follows: c = { cf1, cf2, ... }, where cf1, cf2, ... are the code fragments that are determined (by some predicate) to be identical to each other. A code fragment is a (not necessarily continuous) part of a source file, and a source file is represented as a sequence of tokens. Therefore, a code fragment is represented as Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IWSC2010 May 8, 2010, Cape Town, South Africa Copyright 2010 ACM 978-1-60558-980-0/10/05 ...$10.00.

follows: cf = { (s, i1), (s, i2), ... }, where s is a source file and each i1, i2, ... is a index of a token in source file s.

3. CONFIGURATION A configuration is a series of commands in order to generate executable files (with compilers and linkers) from the given source files. In other words, a configuration determines which set of source files is used to generate an executable file, or which options will be passed to a compiler or a linker, etc. When a developer uses a make command in Unix (or Linux) OSs, a configuration is given as a Makefile (and ./configure command is often used to generate this Makefile, that is, a configuration). When a developer uses a Visual Studio in Windows OSs, a configuration is given as a file *.vcproj. A configuration also decides which set of features are provided by the generated executable files and which source files are used to implement each of the features, by controlling compiling and linking of the source files. For example, in the case of Python interpreter (described later), when no database (sqlite) is installed on a developer's site, the available configuration does not compile and link the source files that are related to the database feature (module sqlite3).

3.1 Model Relations between configurations, source files, and features are represented using the following model. Let SU be the total set of source files, and FU be the total set of features. A configuration C uses a set of source files S(C) ⊆ SU (that is, the source files are compiled and linked in the configuration). A configuration provides a set of features F(C) ⊆ FU (that is, such a feature will be provided to users by the executable files that are generated by the configuration). Additionally, for the sake of simplicity, include the assumption that a source file is used to implement a single feature (a feature is generally implemented by multiple source files). Through this simplification a function, SFM: S → F, can be defined.

3.2 Classification of Code Clone with Configuration Consider two configurations: C and D, which are given. For a given code clone c, if all code fragments of a clone c are included by set S(C) ⋂ S(D), then classify a clone c into a common-file clone (a code clone whose code fragments appear in the source files that are shared between configuration C and D). As shown in Fig. 1, this means all code fragments of a clone are included by source file s3. If all code fragments of a clone c are either included by a set S(C) or included by a set S(D), then classify a clone c into a singleconfiguration clone (a code clone whose code fragments appear in

C f1

s1

D s3 s2

f2

Legend

f3

configuration source file

s4

f4

s5

feature SFM (source file to feature mapping)

Figure 1. Configurations, Source Files, and Features the source files that belong to a configuration). E.g., in Fig. 1, when a code clone consists of a code fragment of s1 and one of s2, such a clone is a single-configuration clone. Otherwise, if some of the code fragments of a clone c are included in S(C) ‒ S(D) and the others are included in S(D) ‒ S(D), then classify clone c into cross-configuration clone. (a code clone whose code fragments appear in both some source files of C and some source files of D). As for cross-configuration clones, a further classification is given: Let cS be a set of source files where any code fragment of a clone c appears. Let cF = { SFM(si) | si ∈ cS }, that is, cF is a set of features that are implemented by the source files that include some code fragment of a clone c. If | cF | = 1 then classify c into a cross-configuration and a single feature clone (CCSFC; a crossconfiguration clone and all its source files are used to implement a single feature). E.g., in Fig. 1, when a clone consists of a code fragment of s2 and one of s4, such a clone is CCFSC. If | cF | = | cS | then classify c into a cross-configuration and a multipleimplementation clone (CCMIC; a cross-configuration and each of its source files is used to implement the distinct feature). E.g., in Fig. 1, when a clone consists of a code fragment s1 and one of s5, such a clone is CCMIC. When | cF | has the other value, the relations between source files and features are more complicated, and no further classification for these cases are given.

4. CLASSIFICATIONS AND MAINTENANCE TASKS As for the common-file clone, all code fragments are shared between configurations C and D, and when some modification is needed in these code fragments, such modification will affect both the users of configuration C and the users of D. However, both a developer of C and a developer of D can perform tests for such modification, so the developers can release (responsibly) the modified products. As for the single-configuration clone, all code fragments (and thus the modification) are included in either C or D, so the developer can complete such a modification task (modification, compilation and testing). As for CCSFC, all of the code fragments are used to implement a single feature; however, some of code fragments are outside of C and the others are outside of D. This means the modification task cannot be completed in a single configuration. A developer of C can modify source files in C (then compile and test); however, he/she must ask for some developer of D to apply the corresponding modification of source files in D. This means

communication (or interaction) between developers is needed, which can be expensive. As for CCMIC, the issue is somehow similar to CCSFC, but not the same. A developer of C can modify source files in C (then compile and test) and maybe ask for some developer of D to apply the corresponding modification of source files in D. However, in this case, the source files in D will provide a feature different from C, so it is plausible further consideration will be needed about whether to apply such modification also in D or not, or to cause a new problem in D, which is difficult to foresee, or is not caught by test cases of C because of the difference of input domains.

5. CASE STUDY In this case study, we used a source code of Python 2.6.4 interpreter. Target configurations included one for Ubuntu i386 (U) and one for Windows 32bit (W). The source files of each configuration were extracted from Makefile and *.vcproj with small scripts (Count of source files in S(U) ⋂ S(W) is 160, S(U) ‒ S(W) is 11, S(W) ‒ S(U) is 25, other is 189). Clone classes of code fragments containing 20 or more tokens were extracted with a clone detector Quartet (a token-based clone detection tool, unpublished). Each code clone is classified into either a commonfile clone, a single-configuration clone, or a cross-configuration clone with a script. We investigated cross-configuration clones which appeared as CCFSC (Table 1). These CCFSC clones consist of code fragments from the identical set of source files, { /Python-2.6.4/Modules/_ctypes/libffi/src/x86/ ffi.c, /Python-2.6.4/Modules/_ctypes/libffi_msvc/ ffi.c } and they include some compiler-specific types, assemblers, and macros.

6. REFERENCES [1] E. Duala-Ekoko, M. Robillard, “Tracking Code Clones in Evolving Software”, ICSE 2007, pp. 158-167 (2007). [2] R. Geiger, et al., “Relation of Code Clones and Change Couplings”, FASE 2006 Proceedings. LNCS 3922, Springer, pp. 411-425 (2006). [3] C. Kapser, M. Godfrey, “ ‘Cloning Considered Harmful’ Considered Harmful”, WCRE 2006, vol. 00, pp. 19-28 (2006). [4] A. Lozano, M. Wermelinger, B. Nuseibeh, “Evaluating the harmfulness of cloning: a change based experiment”, MSR 2007, pp. 19 - 20 (2007). [5] A. Monden, et al., “Software quality analysis by code clones in industrial legacy software”, METRIC S2002, pp. 87-94, Ottawa, Canada, (June 4-7, 2002). Table 1. Classification of Clones Classification

Count of clone classes

Common file

101

Single configuration

5

CCFSC

3

CCMIC

0

Cross configuration