attribute selection using rough sets in software

March 24, 2009 16:5 WSPC/122-IJRQSE

SPI-J072 00330

International Journal of Reliability, Quality and Safety Engineering Vol. 16, No. 1 (2009) 73–89 c World Scientific Publishing Company

ATTRIBUTE SELECTION USING ROUGH SETS IN SOFTWARE QUALITY CLASSIFICATION

TAGHI M. KHOSHGOFTAAR∗ and LOFTON A. BULLARD† Department of Computer Science and Engineering Florida Atlantic University, 777 Glades Road Boca Raton, Florida 33431, USA ∗[email protected] †[email protected] KEHAN GAO Department of Mathematics and Computer Science Eastern Connecticut State University, 83 Windham Street Willimantic, Connecticut 06226, USA [email protected] Received 1 December 2008 Revised 13 January 2009 Finding techniques to reduce software developmental effort and produce highly reliable software is an extremely vital goal for software developers. One method that has proven quite useful is the application of software metrics-based classification models. Classification models can be constructed to identify faulty components in a software system with high accuracy. Significant research has been dedicated towards developing methods for improving the quality of software metrics-based classification models. It has been shown in several studies that the accuracy of these models improves when irrelevant attributes are identified and eliminated from the training data set. This study presents a rough set theory approach, based on classical set theory, for identifying and eliminating irrelevant attributes from a training data set. Rough set theory is used to find small groups of attributes, determined by the relationships that exist between the objects in a data set, with comparable discernibility as larger sets of attributes. This allows for the development of simpler classification models that are easy for analyst to understand and explain to others. We built case-based reasoning models in order to evaluate their classification performance on the smaller subsets of attributes selected using rough set theory. The empirical studies demonstrated that by applying a rough set approach to find small subsets of attributes we can build case-based reasoning models with an accuracy comparable to, and in some cases better than, a case-based reasoning model built with a complete set of attributes. Keywords: Attribute selection; software metrics-based classification model; rough set; reducts; balancing misclassification.

∗Corresponding

author. 73


74

SPI-J072 00330

T. M. Khoshgoftaar, L. A. Bullard & K. Gao

1. Introduction Currently, rough set theory1 is being used by researchers to classify and discover knowledge in data. Using the attribute values of the set of objects in the training data set, and an equivalence relation called the indiscernibility relation, the training data set is partitioned into disjoint subsets of objects called equivalence classes. Depending on the value of a binary attribute called the decision attribute, which signifies the class membership of a software module in this study, the set of objects in the training data set can be divided into two sets: the objects whose group membership is known with absolute certainty based on their attribute values, and the set of objects whose group membership cannot be determined based on their attribute values. When the set difference between these classes is not empty, the objects in this difference is called a rough set. Rough sets are being used as tools to identify outliers, classify objects, find inconsistent data, determine attribute weights, rank attributes, find reduced attribute sets, reduce data set size, and so forth.2–4 In this study we apply these techniques to increase the classification accuracy of a software quality metrics-based classification model by eliminating irrelevant attributes from the training data set. For many years the software engineering community has benefited from the positive effects that software quality models have had on software development processes. In the past, they have been used to improve the fault detection process. Finding faulty components in a software system can lead to a more reliable final system and reduce development and maintenance costs. There are a number of classification models available which have been able to identify faulty software components. These include classification trees,5 fuzzy logic,6 neural networks,7 discriminant analysis,8, 9 meta-learners,10 optimal set reduction,11 and case-based reasoning.12 There has been an explosion in the interest for the use of case-based reasoning (cbr) models over the past decade. This is partly due to the many advantages case-based reasoning provides. For instance, cbr models • • • • • • • •

are easy to understand and implement require less maintenance effort reduce the knowledge acquisition effort improve problem solving performance through reuse make use of existing data, eg. existing databases performance improves over time adapt to changes in the environment have high user acceptance

cbr is a method that finds a solution to a new problem based on past experiences (old problems), which are stored in a library (database). When a new problem is presented to the system, an algorithm is used to search the library. All the old problems in the library are studied, and a set of the most similar to the new


SPI-J072 00330

Attribute Selection Using Rough Sets in Software Quality Classification

75

problem is determined. Once a selected set of old problems is chosen, one of several algorithms can be used to determine the solution of the new problem. The focus of our study will be to improve the performance of cbr by reducing the number of irrelevant attributes. We will use rough set theory to identify significant attributes. In the experiments we performed in this study, we dramatically reduced the number of attributes in a training data set. Using the new, smaller attribute subset, we built a cbr classification model that had accuracy comparable to the corresponding cbr classification model built with all the attributes available. The rest of the paper is organized as follows. Section 2 presents a background of attribute selection algorithms. Rough sets are presented in Sec. 3. Section 4 presents the case-based reasoning technique. In Sec. 5, a discussion about the experimental procedure, the Rough Set Exploration System (RSES), and the results of the experiments are given. Finally, the conclusion is presented in Sec. 6.

2. Attribute Selection Algorithms Attribute selection algorithms can be divided into two main groups, wrappers and filters.13–15 Wrappers are algorithms that use feedback from a learning algorithm to determine which attributes to use in building a classification/prediction model. Selected subsets of attributes are evaluated by using them to train and test the learning algorithm (the learning algorithm and the classification model being built are the same). The subset with the best performance using the learning algorithm is selected as the most significant attribute subset. Whereas, in filters no feedback is provided from the learning algorithm (the learning algorithm and the classification model being built are different algorithms). The training data is analyzed using a method that does not require a learning algorithm to determine which attributes are most relevant. In some studies the attributes selected using wrappers have performed better than those that used filters.14 This is mostly due to the fact that feedback is provided by the learning algorithm in wrappers. However, filters require less computer resources, and are computationally faster. Attribute selection algorithms can be further categorized as either being sequential or genetic.16 Sequential algorithms are algorithms that search through the subsets of attributes, called the attribute subset space, in a predetermined way. Whereas, genetic algorithms use heuristic techniques to search the attribute subset space. There are numerous algorithms that will do an exhaustive search of the attribute subset space to find an optimal attribute subset; but when the number of attributes is greater than 20, usually the computational demands on the system are too much. We know that we cannot say that a selected subset is optimal if we cannot compare it with all possible subsets; but we will be satisfied with a semi-optimal subset that can be used to build a classification model that performs comparable to a classification model built with the complete attribute set. Two common sequential search attribute selection algorithms are the Forward Sequential Search (FSS) and the Backward Sequential Search (BSS). FSS uses a


76

SPI-J072 00330


simple heuristic approach to find a minimal set of attributes that covers a data set. Initially, the subset of attributes is empty. The most significant attribute, based on some performance criterion, is added to the model. Each of the remaining attributes are added one by one, in order of significance, until no improvement is observed to the current subset of attributes.14 In contrast, BSS starts with the complete set of attributes. An attribute is repeatedly removed, and its removal yields the maximal performance improvement.17 Most genetic algorithms are variants of the common hill-climbing algorithm which is used to search the attribute subset space. In hill-climbing a starting point, represented as an attribute subset, is selected at random. The attribute set is represented using a binary string.18 A new attribute subset is formed by flipping a bit in the binary string. The new attribute subset’s performance is evaluated and compared to the performance of any subset presented thus far. If its performance is better, it is selected as the optimal attribute subset; otherwise, it is discarded. This process is continued for a given number of iterations. In this case study we will use several genetic algorithms to find small attribute subsets to be used to build cbr classification models. 3. Rough Sets Rough set theory is based on classical set theory.1 It can be used as a tool to gain information about the dependency relationship that exists between attributes (independent variables vs. dependent variable) in a particular domain,1 and the results of operations on the sets they form. In this study, we are interested in the partitions that are constructed from groups of attributes, called equivalence classes, using rough sets and how they can be used to build cbr models to classify software modules. Equivalence classes are formed by equivalence relations, which are relations formed from similar characteristics (properties) of objects.1 A complete data set contains and represents all the knowledge about a model. However, it may contain repeated information and irrelevant attributes, and it can be quite large. Using the concept of equivalence from classical set theory, we can eliminate redundant data and insignificant attributes. A reduct is a minimal set of attributes that preserves the discrimination power and the ability to perform classifications as if we are using the whole attribute set. See1 for more detailed information on rough set theory. As the number of possible reduct candidates can be enormous, approximately 2|A| , where |A| represents the number of attributes for a given data set, we must use heuristic techniques to compute the minimal reducts of a given data set. Consequently, the problem of determining the number of reducts for a given data set is N-P hard,19 meaning that it is intrinsically harder to solve than any problem that can be solved by a nondeterministic Turing machine in polynomial time. Fortunately, there exist a number of genetic algorithms20, 21 that can compute a sufficient number of reducts for a given attribute set. We will use several in our study.


SPI-J072 00330


77

4. Case-Based Reasoning (CBR) Classification Technique cbr is a modeling technique that attempts to find a solution to a new problem based on past experiences represented by cases (with similar attributes) in a case library. A solution algorithm uses a similarity function to measure the relationship between the new problem and each case in the case library, retrieves relevant case(s) and determines a solution to the new problem. A cbr system is therefore comprised of three major components: a case library, a similarity function and a solution algorithm. In a cbr system, cases or program modules related to previously developed systems are stored in a case library, which is often represented by a fit data set or a training data set used to train the model. A case is composed of a set of independent variables and a dependent variable, which in our studies is the qualitybased class membership: fault-prone (fp) and not fault-prone (nfp). Prior to model construction, the dependent variable values for the modules in the case library are known. Using the cases in the library, a model is trained and then applied to a test data set or target data set, which contains information related to program modules of a similar project or a subsequent system release. In order to retrieve cases in the library that are most similar to the current case under investigation, a similarity function is used. A similarity function measures the distance between the current case and all the cases in the case library. Modules with the smallest distances from the module under investigation are considered similar and designated as the nearest neighbors. Many similarity functions can be used, such as city block distance, Euclidean distance, and Mahalanobis distance. In our empirical investigation, the Mahalanobis distance is used as the similarity function,22 which is given by dij = (cj − xi ) S −1 (cj − xi )

(1)

where xi is a vector of the independent variables for the current case, and cj is the j th case in the case library. The prime ( ) implies a transpose, S is the variancecovariance matrix of the independent variables over the entire case library, and S −1 is its inverse. Subsequent to computing the distances, a classification rule is used as the solution algorithm of the cbr system. In this study, our previously proposed generalized data clustering classification rule22 is used, which is given by  dnf p (xi ) f p, ≥c if df p (xi ) Class(xi ) = (2)  nf p, otherwise where df p (xi ) is the average distance to the nN (number of neighbors) fp nearest neighbor cases, and dnf p (xi ) is the average distance to the nN nfp nearest neighbor d p (xi ) cases. The current case (xi ) is classified by comparing the ratio dnf to c, a f p (xi ) modeling parameter, which is determined empirically.


78

SPI-J072 00330


In the context of a two-group classification model, two types of misclassifications can occur: Type I (nfp module classified as fp) and Type II (fp module classified as nfp). From a software engineering practice point of view, the penalty for a Type II misclassification is often more severe than the penalty for a Type I misclassification. As discussed above, a cbr model has two parameters that can be varied during the modeling process. One parameter is the number of the nearest neighbors (nN ), the other parameter is the modeling parameter (c). For a given nN , an inverse relationship between the Type I and Type II misclassification error rates is observed when varying the value of c, i.e., as the Type I error rate increases, the Type II error rate decreases, and vice versa. The desired preferred balance between the two misclassification error rates is dependent on the project requirements. According to our discussion with the development team of the large legacy telecommunication system, the preferred balance for the case study is that the two misclassification error rates are approximately equal with the Type II misclassification error rate being as low as possible. 5. Experiments 5.1. Rough Set Exploration System (RSES) The Rough Set Exploration System (rses) is a set of software tools that are used for rough set computations in data mining.23, 24 The algorithms that are used to implement the tools are flexible, and user friendly as demonstrated by the provided graphical user interface which allows the construction of experiments by easy pointand-click objects in the workplace. rses has the ability to handle very large data sets (limited only by the computer’s available memory). The current state of research in classification methods originating in rough set theory is reflected in the algorithms it implements. rses implements algorithms to manage and edit data structures that are used in user experiments and defined in the rses library, reduce data (objects and attributes),4 quantify data,25 generate templates and decomposition trees,26 and classify objects.23 In our experiments, we are interested in the discretization and the reduction algorithms. The discretization algorithms find cuts or intervals for the attributes. This allows the initial decision table1 to be converted into one that is described with simple binary attributes without the loss of the discernibility information that was described in the original decision table. A discussion of background information for the discretization algorithm used by rses is given in Ref. 25. rses implements several reduction algorithms for reducing the number of irrelevant attributes.4 These algorithms included an exhaustive algorithm, and several genetic algorithms. When the number of attributes is large, greater than 20, an exhaustive search for reducts is impractical. rses uses genetic algorithms to find approximate and heuristic solutions to the attribute selection problem (find short reducts).


SPI-J072 00330


79

5.2. Data description The software metrics and fault data for this case study (denoted as llts) was collected over four historical releases from a very large legacy telecommunications system written in a high-level language, using the procedural paradigm, and maintained by professional programmers in a large organization. We labelled the four system releases, 1 through 4. These releases were the last four releases of the legacy system. The telecommunications system had over ten million lines of code and included numerous finite-state machines and interfaces to various kinds of equipment. A software module was considered as a set of functionally related source-code files according to the system’s architecture. A module was considered fault-prone (fp) if any faults were discovered during operations, and not fault-prone (nfp) otherwise. A software fault for a program module was recorded only when the problems discovered by customers resulted in changes to the module’s source code. Faults in deployed telecommunication systems are extremely expensive because of system down-time due to failures. Visits to remote sites are usually necessary to repair them. Preventing customer-discovered faults was a very high priority for the developers of this system, and thus, they were interested in timely software quality predictions. Fault data, collected at the module-level by the problem reporting system, consisted of faults discovered during post unit testing phases, and were recorded before and after the product was released to customers. It was observed that over 99% of the unchanged modules had no faults. Consequently, this case study considered modules that were new, or had at least one source code update since the prior release. Configuration management data analysis identified software modules that were unchanged from the prior release. Table 1 presents the distribution of the faults discovered in the updated modules of the four system releases by the customers. Each release had approximately 3500 to 4000 updated software modules. The number of modules considered in Releases 1, 2, 3, and 4 (also referred to as Fit, Test 1, Test 2, and Test 3) were 3649, 3981, 3541, and 3978, respectively. According to the definitions of fp and nfp modules for this case study, the numbers of fp and Table 1.

LLTS fault distribution.

Percentage of updated modules Faults

Rel. 1 (Fit) (%)

Rel. 2 (Test 1) (%)

Rel. 3 (Test 2) (%)

Rel. 4 (Test 3) (%)

0 1 2 3 4 6 9

93.7 5.1 0.7 0.3 0.1 * *

95.3 3.9 0.7 0.1 *

98.7 1.0 0.2 0.1

97.7 2.1 0.2 0.1

∗One

module.


80

SPI-J072 00330


nfp modules for the four releases are as follows: Releases 1 through 4 have 229, 189, 47, and 92 fp modules, and 3420, 3792, 3494, and 3886 nfp modules, respectively. The proportion of modules with no faults among the updated modules of the first release (fit data set) was 0.937, and the proportion with at least one fault was 0.063. The set of available software metrics is usually determined by pragmatic considerations. A data mining approach is preferred in exploiting software metrics data,27 by which a broad set of metrics are analyzed rather than limiting data collection according to a predetermined set of research questions. Data collection for this case study involved extracting source code from the configuration management system. The available data collection tools selected the software metrics. Software measurements were recorded using the emerald (Enhanced Measurement for Early Risk Assessment of Latent Defects) software metrics analysis tool, which includes software-measurement facilities and software quality models.28 Preliminary data analysis selected metrics (aggregated to the module level) that were appropriate for our modeling purposes. The software metrics considered included 24 product metrics, 14 process metrics, and 4 execution metrics. The experiments of the case study were performed on two groups of data sets, • a group of data sets that hold all 42 software metrics, referred to as RAW 42, and • a group of data sets that hold 24 product metrics and the 4 execution metrics, referred to as RAW 28. Consequently, this case study consists of two groups of independent variables (42 and 28) that are used to predict the response variable, i.e., Class. It should be noted that the sets of software metrics used for this case study may not be universally appropriate for all software systems. Another project might collect (depending on availability) and use a different set of software metrics. The software product metrics in Table 2 are based on call graphs, control flow graphs, and statement metrics. The number of procedure calls by each module (CALUNQ and CAL2) is derived from a call graph depicting calling relationships among procedures. A module’s control flow graph consists of nodes and arcs depicting the flow of control of the program. Statement metrics are measurements of the program statements without expressing the meaning or logic of the statements. Process metrics in Table 3 may be associated with either the likelihood of inserting a fault during development, or the likelihood of discovering and fixing a fault prior to product release. The configuration management systems tracked each change to source code files, including the identity of the designer and the reason for the change, e.g., a change to fix a problem or to implement a new requirement. The problem reporting system maintained records on past problems. Execution metrics listed in Table 4 are associated with the likelihood of executing a module, i.e., operational use. The proportion of installations that had a module, USAGE, was approximated by deployment data from a prior release.


SPI-J072 00330

Attribute Selection Using Rough Sets in Software Quality Classification Table 2.

Software product metrics.

Symbol Call Graph Metrics CALUNQ CAL2

81

Description Number of distinct procedure calls to others. Number of second and following calls to others. CAL2 = CAL − CALUNQ where CAL is the total number of calls.

Control Flow Graph Metrics CNDNOT Number of arcs that are not conditional arcs. IFTH Number of non-loop conditional arcs (i.e., if-then constructs). LOP Number of loop constructs. CNDSPNSM Total span of branches of conditional arcs. The unit of measure is arcs. CNDSPNMX Maximum span of branches of conditional arcs. CTRNSTMX Maximum control structure nesting. KNT Number of knots. A “knot” in a control flow graph is where arcs cross due to a violation of structured programming principles. NDSINT Number of internal nodes (i.e., not an entry, exit, or pending node). NDSENT Number of entry nodes. NDSEXT Number of exit nodes. NDSPND Number of pending nodes (i.e., dead code segments). LGPATH Base 2 logarithm of the number of independent paths. Statement Metrics FILINCUQ LOC STMCTL STMDEC STMEXE VARGLBUS VARSPNSM VARSPNMX VARUSDUQ VARUSD2

Number of distinct include files. Number of lines of code. Number of control statements. Number of declarative statements. Number of executable statements. Number of global variables used. Total span of variables. Maximum span of variables. Number of distinct variables used. Number of second and following uses of variables. VARUSD2 = VARUSD − VARUSDUQ where VARUSD is the total number of variable uses.

Execution times were measured in a laboratory setting with different simulated workloads. 5.3. The experiments on data RAW 28 5.3.1. Procedure The training data set was discretized by the algorithm implemented in rses before we generated the reducts. We used three different genetic filter algorithms, described in Ref. 21, to determine the most significant subsets of attributes to be use to build the cbr models. To determine significant attribute subsets, we choose the subsets that were generated by all three algorithms. Each algorithm had to generate at least 10 subsets, for a total of 30 attribute subsets, before there were two subsets of


82

SPI-J072 00330

T. M. Khoshgoftaar, L. A. Bullard & K. Gao Table 3.

Symbol DES PR BETA PR DES FIX BETA FIX CUST FIX REQ UPD TOT UPD REQ SRC GRO SRC MOD UNQ DES VLO UPD LO UPD UPD CAR

Description Number of problems found by designers during development of the current release. Number of problems found during beta testing of the current release. Number of problems fixed that were found by designers in the prior release. Number of problems fixed that were found by beta testing in the prior release. Number of problems fixed that were found by customers in the prior release. Number of changes to the code due to new requirements. Total number of changes to the code for any reason. Number of distinct requirements that caused changes to the module. Net increase in lines of code. Net new and changed lines of code. Number of different designers making changes. Number of updates to this module by designers who had 10 or less total updates in entire company career. Number of updates to this module by designers who had between 11 and 20 total updates in entire company career. Number of updates that designers had in their company careers.

Table 4. Symbol USAGE RESCPU BUSCPU TANCPU

Software process metrics.

Software execution metrics. Description

Deployment percentage of the module. Execution time (microseconds) of an average transaction on a system serving consumers. Execution time (microseconds) of an average transaction on a system serving businesses. Execution time (microseconds) of an average transaction on a tandem system.

attributes that were common to all three. After reviewing the reducts, we observed that of the 30 reducts generated, only 23 were unique. We choose the reducts that had the most votes. Each algorithm was able to cast one vote for a particular reduct. Using this method there were 5 reducts that had 2 or more votes, and they are shown in Table 5. We did not show the other 18 reducts that were produced by the algorithms, because they received only one vote. Also, the performances of the CBR models constructed from the 18 reducts were not better than the performances of the CBR models constructed from the 5 selected reducts. The first column labelled ‘ID’ was created to make the identification of the reducts easier when referring to the different tables. The column labelled ‘Algorithm Vote’ gives the tally of the number of algorithms out of the 3 that generated that particular reduct. The last column labelled ‘Selected Attributes’ gives the set of reducts with 2 or more votes. As shown in Table 5 there were 2 reducts that had 3 votes, {CNDSPNSM, STMEXE} and {FILLINCUQ, VARSPNSM}, and there were 3 reducts that had 2 votes, {STMEXE, VARSPNSM}, {CTRNSTMX, LOC, STMEXE}, {FILLINCUQ, NDSINT, VARUSD2}.


SPI-J072 00330

Attribute Selection Using Rough Sets in Software Quality Classification Table 5. ID 1 2 3 4 5

83

Unique reducts generated for RAW 28 with ID.

Algorithm vote 3 3 2 2 2

Selected attributes CNDSPNSM, STMEXE FILLINCUQ, VARSPNSM STMEXE, VARSPNSM CTRNSTMX, LOC, STMEXE FILLINCUQ, NDSINT, VARUSD2

We used the fit data set to build the cbr models for all 23 unique reducts that were produced by rses in the experiments (5 with more than 1 vote and 18 with 1 vote). We used the leave-one-out (n-fold) cross-validation technique to build the cbr models.29 In the cbr modeling process, two parameters were adjusted in the fitting process, the number of nearest neighbors nN , and the modelling cost ratio c. The Mahalanobis distance was used as the similarity function. We determined the preferred cbr model for each of the 23 reducts by keeping the Type I and Type II misclassification rates as balanced (equal) as possible, with the Type II misclassification rate being as low as possible. It has been shown previously that the balance of misclassification rates is a good evaluation measure of the quality of classification, especially for the software quality classification problem.30 c was varied over a range from 0 to 1 at increments of 0.01, and nN was varied over a range from 1 to 100 at increments of 1. In the experiments, the preferred models built from the 5 reducts with 2 or more votes performed comparable and in most cases better than the other 18 models built from reducts with only 1 algorithm vote.

5.3.2. Results Table 6 displays the results of the preferred cbr model for the 5 reducts with 2 or more votes, and the complete 28 attribute set. The Overall misclassification rates are shown in the last column. The preferred model for each of the attribute subsets was determined when the best balance between Type I and the Type II misclassification rates occurred. The preferred model from all the other reduct models for this experiment was the cbr model built using the reduct (FILLINCUQ, VARSPNSM). It occurred when the number of nearest neighbors, nN = 21 and the cost ratio, c = 0.18. The preferred model is shown in bold in Table 6. The preferred model performed better than the cbr models built with the four other reducts. It had a better quality-of-fit as demonstrated by its lower Type I, Type II, and Overall misclassification rates on the fit data set. On the Test 1 data set, the preferred model had lower Type I and Overall misclassification rates than all the other 4 reduct models. However, its Type II misclassification rate was insignificantly higher than 3 of the other reduct models. On the Test 2 data set, the preferred model had significantly lower Type I and Overall misclassification


SPI-J072 00330


84

Table 6.

Results of reducts selected by 2 or more algorithms for RAW 28. Fit (cross-validation)

ID

Votes

1 2 3 4 5 *

3 3 2 2 2 *

Test 1

# of attri.

nN

c

Type I (%)

Type II (%)

Overall (%)

Type I (%)

Type II (%)

Overall (%)

2 2 2 3 3 28

48 21 95 93 78 15

0 0.18 0.16 0.31 0.34 0.96

31.87 28.92 32.28 34.50 31.02 27.69

32.31 28.38 32.31 34.50 31.00 27.95

31.90 28.89 31.28 34.50 31.02 27.71

32.75 26.66 29.09 30.22 27.51 23.73

30.69 26.98 25.93 26.46 24.34 31.22

32.66 26.67 28.94 30.04 27.36 24.09

Test 2 ID

Votes

1 2 3 4 5 *

3 3 2 2 2 *

Test 3

# of attri.

nN

c

Type I (%)

Type II (%)

Overall (%)

Type I (%)

Type II (%)

Overall (%)

2 2 2 3 3 28

31 21 95 93 78 15

0.17 0.18 0.16 0.31 0.34 0.96

32.43 29.71 32.31 36.01 31.14 27.82

36.17 29.79 25.53 27.66 29.79 27.66

32.48 29.71 32.22 35.89 31.12 27.82

27.95 30.47 33.45 38.06 32.27 36.46

32.28 31.52 27.17 20.65 21.74 17.39

28.16 30.49 33.31 37.66 32.03 36.02

rates when compared to the other 4 reduct models. Its Type II misclassification rate was higher, two to four percent, than those for two of the other reduct models, significantly lower, six percent, than that for one of the reduct models, and the same as that for one reduct model. On the Test 3 data set, the preferred model had lower Type I and Overall misclassification rates, two to seven percent, on 3 of the 4 reduct models. However, its Type II misclassification rate was four to eleven percent higher than 3 of the other reduct models; only one reduct model had a Type II misclassification rate higher than the preferred model. The Type I and Type II misclassification rates for the preferred model was more balanced across the three difference releases, i.e., 3 test data sets, which is a measure of stability. The preferred model’s classification accuracy was comparable to the 28 attribute model. On the Test 1 data set, the preferred model had a significantly lower Type II misclassification rate (4%) than the 28 attribute model. Results for Test 2 show that the preferred model had higher Type I, Type II, and Overall misclassification rates of approximately two percent. On the Test 3 data set, the preferred model had lower Type I and Overall misclassification rates, each six percent. It was also observed that the preferred model was more stable across consecutive releases than the 28 attribute model. 5.4. The experiments on data RAW 42 5.4.1. Procedure We used the same procedure as we did for the previous experiments on the RAW 28 data set. We discretized the training data set and used the same three different

Attribute Selection Using Rough Sets in Software Quality Classification Table 7.

85

Unique reducts generated for RAW 42 with ID.

ID

Algorithm vote

1 2 3 4 5 6 7 8

3 3 2 2 2 2 2 2

Selected attributes CALUNQ, FILLINCUQ, LGPATH UNQ DES, VARSPNMX, VARUSDUQ UPD CAR, CNDSPNSM CALUNQ, CTRNSTMX, FILLINCUQ FILLINCUQ, VARSPNSM LOC, IFTH, LOP DES FIX, LOC, NDSINT SRC GRO, VARSPNSM

genetic filter algorithms to determine the most significant subsets of attributes that would be used to build the cbr models. To determine significant attribute subsets, we choose the subsets that were generated by all three algorithms. However, this time each algorithm had to generate at least 20 subsets, for a total of 60 attribute subsets, before there were two subsets of attributes that were common to all three algorithms. After reviewing the reducts, we observed that of the 60 reducts generated, only 50 were unique. There were 8 reducts that had 2 or more votes, and they are shown in Table 7. We used the fit data set to build the cbr models for all 50 unique reducts (8 with more than 1 vote and 42 with 1 vote). We used the leave-one-out (n-fold) cross validation technique to build the cbr models. In the cbr modeling process, we used the same model selection strategy, i.e., to achieve balanced misclassification rates.

5.4.2. Results Table 8 displays the results of the preferred cbr model for 8 reducts with 2 or more votes, and the complete 42 attribute set. The preferred model for this experiment was the cbr model built using the reduct {UNQ DES, VARSPNMX, VARUSDUQ}. It occurred when the number of nearest neighbors, nN = 29 and the cost ratio, c = 0.28. The preferred model is shown in bold in Table 8. The preferred model had a better quality-of-fit as demonstrated by its lower Type I, Type II, and Overall misclassification rates on the fit data set compared to the other 7 reduct models. On the Test 1 data set, the preferred model had a significantly lower Type II misclassification rate (6%) and was more stable than the 42 attribute model. The results for Test 2 showed that the preferred model had lower Type I (4%), Type II (4%), and Overall (4%) misclassification rates than the 42 attribute model. On Test 3 the preferred model had lower Type I (4%) and Overall misclassification rates (4%). However, the complete 42 attribute model had a significantly lower Type II misclassification rate (3%), but the preferred model had a better balance between the Type I and Type II misclassification rates. The results of both experiments on the RAW 28 and RAW 42 data sets verify that we can use rough set theory to identify subsets of software quality attributes,


86

SPI-J072 00330

T. M. Khoshgoftaar, L. A. Bullard & K. Gao Table 8.

Results of reducts selected by 2 or more algorithms for RAW 42. Fit (Cross-Validation)

ID

Votes

1 2 3 4 5 6 7 8 *

3 3 2 2 2 2 2 2 *

Test 1

# of attri.

nN

c

Type I (%)

Type II (%)

Overall (%)

Type I (%)

Type II (%)

Overall (%)

3 3 2 3 2 3 3 2 42

91 29 16 81 41 36 46 29 7

0.31 0.28 0.23 0.32 0.13 0.44 0.24 0.34 0.95

31.37 27.52 33.16 29.71 30.12 33.63 32.31 34.47 23.16

31.44 27.51 33.19 29.69 30.13 33.62 32.31 34.50 23.14

31.38 27.51 31.16 29.71 30.12 33.63 32.31 34.48 23.16

27.64 25.24 29.51 27.69 27.06 30.20 31.07 31.44 24.24

22.22 25.93 39.15 25.93 25.93 27.51 24.87 34.39 31.75

27.38 25.27 29.97 27.61 27.00 30.07 30.77 31.58 24.59

Test 2 ID

Votes

1 2 3 4 5 6 7 8 *

3 3 2 2 2 2 2 2 *

Test 3

# of attri.

nN

c

Type I (%)

Type II (%)

Overall (%)

Type I (%)

Type II (%)

Overall (%)

3 3 2 3 2 3 3 2 42

91 29 16 81 41 36 46 29 7

0.31 0.28 0.23 0.32 0.13 0.44 0.24 0.34 0.95

30.37 28.82 37.95 30.05 30.37 35.43 34.55 46.65 33.03

23.40 25.53 29.79 19.15 25.53 25.53 23.40 14.89 29.79

30.27 28.78 37.84 29.91 30.30 35.30 34.40 46.23 32.99

31.37 26.84 34.53 32.35 31.22 38.63 38.81 34.66 31.19

31.44 27.17 29.35 20.65 27.17 19.57 19.57 29.35 23.91

31.38 26.85 34.41 32.08 31.12 38.19 38.36 34.54 31.02

and use those subsets to build a cbr model with accuracy comparable to, and in some cases better than, a cbr model built with the complete set of attributes. Also, the preferred model was more stable across consecutive releases. In addition, for this particular system, llts, it is observed that the statement metrics played an important role in the attribute selection process. For both experiments, the preferred reduct models each includes 2 statement metrics. The preferred reduct model on the RAW 42 data set also includes one more process metric. Probably that is the reason why the preferred reduct model of the RAW 42 data set outperformed that of the RAW 28 data set. This means that the process metrics may have significant effects on the performance of the classification models and should not be ignored when software metric are collected. 6. Conclusion Identifying a small set of relevant attributes, called reducts, which can be used to build high accuracy software quality metric-based classification models, could be used by software developers to guide their efforts to reduce software developmental costs and produce a more reliable system. In this study, we evaluated the performance of different case-based reasoning models built from the smaller subsets of attributes selected using rough set theory. The CBR models built with the smaller


SPI-J072 00330


87

subsets of attributes sometimes have a classification accuracy comparable to, and in some cases better than, a cbr model built with all the available quality attributes. Finding reducts is also beneficial because the metrics collection, model calibration, model validation, and model evaluation times of future developmental efforts of similar systems can be significantly reduced. Some future directions of our research would be to, • use rough sets to identify noisy objects, • use rough sets to reduce attributes in other domains, • compare the rough set attribute selection method with other attribute selection methods, • compare the rough set classification technique to other classification techniques, and • use rough sets to reduce attributes in multi-class problems. Acknowledgments We are grateful to Professor Hoang Pham, EIC of the International Journal of Reliability, Quality, and Safety Engineering, for his comments and suggestions. References 1. Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Academic (1992). 2. L. A. Bullard, T. M. Khoshgoftaar and K. Gao, An application of a rule-based model in software quality classification, in Proceedings of the 6th IEEE International Conference on Machine Learning and Applications (ICMLA’07), Cincinnati, Ohio, USA: IEEE Computer Society, (December 13–15 2007), pp. 204–210. 3. M. Salamo and E. Golobardes, Analyzing rough sets weighting methods for case-based reasoning systems, in Artificial Intelligence: Latin American Magazine of Artificial Intelligence 15 (2002) 34–43. 4. D. Slezak and J. Wroblewski, Covering with reducts — a fast algorithm for rule generation, in Proceedings: RSCTC Vol. 1424, Springer Verlag, (1998), pp. 402–407. 5. T. M. Khoshgoftaar, E. B. Allen and J. Deng, Using regression trees to classify faultprone software modules, IEEE Transactions on Reliability 51(4) (2002) 455–462. 6. C. Ebert, Classification techniques for metric-based software development, Software Quality Journal 5 (1996), 255–272. 7. Z. Xu, N. Seliya and W. Wu, An adaptive neural network with dynamic structure for software defect prediction, in Proceedings of the Twentieth International Conference on Software Engineering and Knowledge Engineering (SEKE’2008), San Francisco, CA, USA (2008), pp. 79–84. 8. J. C. Munson and T. M. Khoshgoftaar, The detection of fault-prone programs, IEEE Transactions on Software Engineering 18 (1992) 423–433. 9. T. M. Khoshgoftaar, N. Seliya and K. Gao, Assessment of a new three-group software quality classification technique: An empirical case study, Empirical Software Engineering 10(2) (2005) 183–218. 10. Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, in Proc. 13th International Conference on Machine Learning. Morgan Kaufmann, (1996), pp. 148–156.


88

SPI-J072 00330


11. L. Briand, V. Basili and C. Hetmanski, Developing interpretable models with optimized set reduction for identifying high-risk software components, IEEE Transactions on Software Engineering 19 (1993) 1028–1044. 12. T. M. Khoshgoftaar and N. Seliya, Analogy-based practical classification rules for software quality estimation, Empirical Software Engineering Journal 8(4) (2003) 325–350. 13. M. A. Hall, Correlation-based feature selection for discrete and numeric class machine learning, in Proc. 17th International Conf. on Machine Learning. Morgan Kaufmann, San Francisco, CA, (2000), pp. 359–366. 14. C. Kirsopp, M. J. Shepperd and J. Hart, Search heuristics, case-based reasoning and software project effort prediction, in Proceedings of the Genetic and Evolutionary Computation Conference, Morgan Kauffman Publisher Inc., San Francisco, CA (USA) (2002), pp. 1367–1374. 15. G. Bontempi, Structural feature selection for wrapper methods, in Proceedings of ESANN’2005, European Symposium on Artificial Neural Networks, Bruges, Belgium, (April 27–29 2005), pp. 405–410. 16. H. B. Borges and J. C. Nievola, Attribute selection methods comparison for classification of diffuse large B-cell lymphoma, in Proceedings of World Academy of Science, Engineering and Technology, Vol. 8 (October 2005), pp. 193–197. 17. D. W. Aha and R. L. Bankert, A comparative evaluation of sequential feature selection algorithms, in Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, Fort Lauderdale, Florida, USA: IEEE Computer Society (January 1995). 18. D. B. Skalak, Prototype and feature selection by sampling and random mutation hill climbing algorithms, in International Conference on Machine Learning (1994), pp. 293–301. 19. G. Brassard and P. Bratley, Fundamental of Algorithmics, Prentice-Hall, (1996). 20. J. Wroblewski, Finding minimal reducts using genetic algorithms, in Proceedings of the International Workshop on Rough Sets Soft Computing at Second Annual Joint Conference on Information Sciences (1995), pp. 186–189. 21. ——, Genetic algorithms in decomposition and classification problems, in Rough Sets in Knowledge Discovery 2, L. Polkowski and A. Skowron (eds.) Physica-Verlag, (1998), pp. 471–487. 22. T. M. Khoshgoftaar, B. Cukic and N. Seliya, Predicting fault-prone modules in embedded systems using analogy-based classification models, International Journal of Software Engineering and Knowledge Engineering 12(2) World Scientific Publishing (2002) 201–221. 23. J. Bazan, M. Szczuka and J. Wroblewski, A new version of rough set exploration system, in Proceedings of the Third International Conference, RSCTC2002 (2002), pp. 397–404. 24. J. G. Bazan and M. S. Szczuka, RSES and REALIA — a collection of tools for rough set computations, in Rough Sets and Current Trends in Computing, Springer Verlag, (2000), pp. 106–113. 25. S. H. Nguyen and H. S. Nguyen, Discretization methods in data mining, in Rough Sets in Knowledge Discovery 1, L. Polkowski and A. Skowron (eds.) Physica-Verlag, (1998), pp. 451–482. 26. S. H. Nguyen, A. Showron and P. Synak, Discovery of data patterns with applications to decomposition and classification problems, in Rough Sets in Knowledge Discovery 2, L. Polkowski and A. Skowron (eds.) Physica-Verlag (1998), pp. 55–97. 27. U. M. Fayyad, Data mining and knowledge discovery: Making sense out of data, IEEE Expert 11(4) (1996) 20–25.


SPI-J072 00330


89

28. J. P. Hudepohl, S. J. Aud, T. M. Khoshgoftaar, E. B. Allen and J. Mayrand, Emerald: Software metrics and models on the desktop, IEEE Software 13(5) (1996) 56–60. 29. R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in Proceedings of International Joint Conference on Artificial Intelligence (1995), pp. 1137–1145. 30. T. M. Khoshgoftaar, X. Yuan and E. B. Allen, Balancing misclassification rates in classification tree models of software quality, Empirical Software Engineering 5(4) (2000) 313–330.

About the Authors Taghi M. Khoshgoftaar is a professor of the Department of Computer Science and Engineering, Florida Atlantic University and the Director of the Empirical Software Engineering and Data Mining and Machine Learning Laboratories. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, machine learning, and statistical modeling. He has published more than 350 refereed papers in these areas. He is a member of the IEEE, IEEE Computer Society, and IEEE Reliability Society. He was the program chair and general Chair of the IEEE International Conference on Tools with Artificial Intelligence in 2004 and 2005 respectively and is the Program chair of the 20th International Conference on Software Engineering and Knowledge Engineering (2008). He has served on technical program committees of various international conferences, symposia, and workshops. Also, he has served as North American Editor of the Software Quality Journal, and is on the editorial boards of the journals Software Quality and Fuzzy systems. Lofton A. Bullard is an Instructor of the Department of Computer Science and Engineering, Florida Atlantic University. He received his Ph.D. in Computer Science from Florida Atlantic University, Boca Raton, FL, USA in May 2008. His research interests include software engineering, data mining and Machine learning, software measurement, software reliability and quality engineering. Kehan Gao received the Ph.D. degree in Computer Engineering from Florida Atlantic University, Boca Raton, FL, USA, in 2003. She is currently an Assistant Professor in the Department of Mathematics and Computer Science at Eastern Connecticut State University. Her research interests include software engineering, software metrics, software reliability and quality engineering, computer performance modeling, computational intelligence, and data mining. She is a member of the IEEE Computer Society and the Association for Computing Machinery.