Mining Software Repositories to Guide Software Development

Mining Software Repositories to Guide Software Development Ahmed E. Hassan Software Architecture Group (SWAG) School of Computer Science University of Waterloo Waterloo, Canada {aeehassa}@plg.uwaterloo.ca

Research Area: Software understanding, Software evolution, Software visualization, Change propagation, Impact analysis, Software complexity, Fault prediction ABSTRACT Software repositories (such as source control repositories) contain a wealth of valuable information regarding the evolutionary history of a software project. In this research we recover such historical data and present several techniques and approaches to guide managers and developers working on large software systems. We validate our work empirically using data based on over 60 years of development history for several open source projects. 1 PROBLEM STATEMENT As the size and complexity of software systems increases, software practitioners find themselves facing many challenges to deliver high quality software on time and within budget. Software managers are continuously estimating the quality of their software and predicting its reliability to ensure that no faulty software is released. Moreover, they endeavor with varying degrees of success to wisely allocate their limited testing and development resources to the parts of the most appropriate parts of the code. Unfortunately, in many cases such attempts are based on ad-hoc techniques and rough approximations. Their success depends on their intuition, experience and chance. Similarly, software developers spend a large amount of their time trying to comprehend source code as they modify it to implement the evolving requirements of the customers. They perform such complex tasks using simple text editors and tools such as grep. Such primitive tools are limited by the size of the software system and the amount of information a person can keep track of while jumping around the source tree [16]. The evolutionary history of a software project offers a great opportunity to assist these practitioners. In this work, we propose a set of techniques and tools to assist software practitioners by mining historical data about the code development process. 2 RESEARCH HYPOTHESIS The code development process represents the patterns of modifications to the source code of a project. These modifications are done by the developers to implement new fea-

tures and repair faults. Whereas the current code base represents a static view of the current state of the system, the code development process provides a rich and evolving historical record of the code. It plays a central role in the production of a software system where several facets interact heavily and affect each other (see Figure 1). Changes, complexity, lack of understanding, and concerns in one facet flow to other facets and affects them. For example, schedule pressure, requirements, and team structure affect the development process. As the complexity of the development process increases, it negatively affects the source code whose complexity increases as well. Over time the complexity of the source code and its design will increase, in return the complexity of the development process increases - creating a feedback loop. Problem Domain

Requirements Development Process

Code/Design

Team (size/structure) Market/ Schedule Pressure

Figure 1: Interactions Among Facets of a Software Project By studying patterns of modifications to the source code, we hope to achieve a better understanding of the software system and the development process in general. We claim that: Mining the evolutionary history of a software system can assist us in guiding software practitioners (managers, developers, etc.) to build better software systems 3 DETAILS FOR PROPOSED SOLUTION In our work, we focus on the data stored by a source control system as an example of a software repository. Luckily, such data is available for most large projects and the cost for collecting it is minimal as it is collected automatically when modifications are done to the source code. Other possible

a high degree of complexity associated with them. These files will eventually have a large number of faults in them over time. To validate our conjecture. We compared the performance of our model in predicting faults against another model based on FI modifications, for each studied software system. We found that our model has a statistically significant better accuracy. Thus we believe that our complexity metric is a good indicator of complexity in large software systems. If monitored it should help avoid delays and faults in a project over time. In future work, we plan to evaluate our model against other well studied complexity metrics such as Mccabe’s complexity.

repositories are the ones used by the defect tracking systems and archived project communications/emails. As an initial step we developed a set of algorithms to preprocess the source control data. Whereas, most source control systems record changes to the code at the file level, we trace changes to specific source code entities, such as functions, variables, or data type definitions. Then we can track details such as: • Addition, removal, or modification to a source code entity such as adding or removing a function. • Changes to dependencies between the modified entities and other source code entities. For example, we can determine that a function no longer uses a specific variable or that a function now calls another function.

5 PREDICTING CHANGE PROPAGATION As developers modify software entities such as functions or variables to introduce new features or fix bugs, they must ensure that other entities in the software system are updated to be consistent with these new changes. For example, if the interface for a function changes, its callers have to be modified to reflect the new interface otherwise the source code won’t compile nor link. This example of propagation is easy to determine, that is not always the case. Many hard to find bugs are introduced by developers who did not notice dependencies between entities, and failed to propagate changes correctly. The goal of change propagation is to ensure the consistency of assumptions among these interdependent entities.

Furthermore, we automatically divided modifications into three types based on the content of the detailed message attached to a modification using a lexical technique, similar to [14]: Fault Repairing modifications (FR), Feature Introduction modifications (FI), and General Maintenance modifications (GM). Using our derived data, we build mathematical models about the code change process and its effect on the quality of the source code (see Section 4). Then, we develop a framework for the change propagation process (see Section 5). In addition, we propose an approach (The Top Ten List) which highlights to managers the ten most susceptible subsystems to have a fault (see Section 6). Finally based on our newly acquired understanding of the development process, we propose visualization techniques to assist developer as they maintain and evolve their code base (see Section 7). Throughout our research, we validate our work empirically using data derived from six large C/C++ open source projects (OpenBSD, NetBSD, FreeBSD, Postgres, KDE, and KOffice) with over 60 years of development history.

In [12], We investigate if there are good indicators such as call graph relations that could assist a developer in determining other entities to change. In particular, we ask the question: How does a change in one source code entity propagate to other entities? We present several change propagation heuristics and develop a framework to study empirically the performance of various change propagation models. Our results cast doubt on the effectiveness of code structures such as call and data dependency graphs as good indicators for change propagation. In addition, we define a historical co-change relation that records if two source code entities have changed together in the past. We show that the co-change relations can be used to develop heuristics to assist developers during the change propagation process with great success. In future work, we plan to extend our change propagation framework to permit the development of more complex heuristics for change propagation. We also plan to define a taxonomy for software change (such as interface addition, and refactoring) and empirically study the evolution of software systems through our taxonomy.

4 THE CHAOS OF SOFTWARE DEVELOPMENT Using sound mathematical concepts from information theory such as Shannon’s Entropy [17], we present a novel view of complexity in software and conjecture that: A chaotic code development process negatively affects its outcome, the source code. In [10], we evaluated our model using an observational study where we correlated our model’s measurements to actual events in a project’s history. Events such as large refactorings or delays in releases were accompanied with increases in our model’s measurements. Then in [9], we extended our model to develop a metric that associates complexity values to each file in the software system, as we sought to perform a more formal and concrete validation of our model. We conjectured that files which were modified during periods of high system complexity, as defined by our model, will have

6 THE TOP TEN LIST To assist managers in coping with the challenges of allocating their limited resources effectively, we present an approach (The Top Ten List) which highlights to them the ten 2

ware systems. In an ideal world, if each developer attached a sticky note to each added dependency to record their name, the rationale behind the addition or removal of the dependency then the job of the maintainer will be much easier. In the fast paced world of software development with tight schedules and short time to market, this is neither possible nor practical. Thus in addition to proposing these extended dependency graphs, we present a technique to build such graphs automatically without any input from the developers using the recovered repository data.

most susceptible subsystems to have a fault. The list is updated dynamically as the development of the system progresses. Managers can focus testing resources to the subsystems suggested by the list. In contrast to count based techniques which focus on predicting an absolute count of faults in a system over time, or classification based techniques which focus on predicting if a subsystem is fault prone or not, we focus on predicting the subsystem that are most likely to have a fault in them in the near future. For example, even though a subsystem may not be fault prone and may only have a few number of predicted faults, it may be the case that a fault will be discovered in the next few days or weeks. Or in another case, even though a fault counting based technique may predict that a subsystem has a large number of faults, they may be dormant faults that are not likely to cause concerns in the near future.

8 RELATED WORK Other researchers have used source code repositories to explain and validate their ideas. In our work we shows other benefits and uses of data stored in source repositories. Furthermore, we aim to develop a standard common exchange format for source control data to ease the sharing of the extracted data and to enable reuse and repeatability of results throughout the research community [13]. For example, Eick et al. studied the concept of code decay and used the modification history to predict the incidence of faults [3, 4]. Graves et al. showed that the number of modifications to a file is a good predictor to the fault potential of the file [7]. Chen et al. presented a case study for a source code searching tool that makes use of developer’s comments associated with each modification to the code [1]. The tool uses the comments to index the source code to provide more accurate search results, when developers search for the location where specific features are implemented in the code. Gall proposes the use of visualization techniques to show the historical logical coupling between entities in the source code [6, 5]. Cubranic et al. presented a tool that uses bug reports, news articles, and mailing list posting to suggest pertinent software development artifacts [2]. Work by Shirabad [15] uses machine learning techniques and source control data to suggest other entities to change in a software system.

If we were to draw an analogy to our work and rain prediction, our prediction model focuses on predicting the areas that are most likely to rain in the next few days. The predicted rain areas may be areas that are known to be dry areas (i.e. not fault prone) and may be areas which aren’t known to have large precipitation values (i.e low predicted faults). In [11], we develop a model to measure the performance of (The Top Ten List) using ideas that have been extensively studied in the literature of web and file systems. We show that building the list using data derived from the source repository provides good results and value for managers. We believe that the Top list approach holds a lot of promise and value for software practitioners, it provides a simple and accurate technique to assist them in maintaining large evolving software systems. In future work, we plan to focus on the development of more elaborate and mathematically sound value functions to measure the perceived value of a prediction, for example a manager may appreciate being warned a month before a fault occurs, whereas a developer may assign more value for shorter warnings (just a day before the fault appears).

9 CONCLUSION We believe that the approach and results presented in our work highlight the value of mining the evolutionary history of large projects stored in source repositories which are rarely investigated. Our research provides novel techniques adopted from well studied domains such as web, filesystem, and complexity theory. Moreover, we have validated empirically the techniques and ideas presented. Our validation have in some cases cast doubts on well established software engineering beliefs such as the effectiveness of code structures like call graphs as a good indicator for change propagation.

7 ENHANCED VISUALIZATION Dependency graphs have been proposed and used in many studies and maintenance activities to assist developers in understanding large software systems before they embark on modifying them to meet new requirements or to repair faults. Call graphs and data usage graphs are the most commonly used dependency graphs. These graphs show the present structure of the software system (e.g. In a compiler, an Optimizer function calling a P arser function). They fail to reveal details about the structure of the system that are needed to gain a better understanding. For example, traditional call graphs cannot give the rationale behind an Optimizer function calling P arser function.

ACKNOWLEDGEMENTS I am very grateful for my thesis supervisor Professor Richard C. Holt for all his assistance and fruitful discussions and for always willing to listen to my comments and provide excellent advice and suggestions.

In [8], we advocate a new view on dependency graphs – Annotated Dependency Graphs (ADG). ADG can assist maintainers understand better the current structure of large soft-

REFERENCES [1] A. Chen, E. Chou, J. Wong, A. Y. Yao, Q. Zhang, S. Zhang, and

3

A. Michail. CVSSearch: Searching through source code using CVS comments. In IEEE International Conference Software Maintenance (ICSM 2001), pages 364–374, Florence, Italy, 2001. [2] D. Cubranic and G. C. Murphy. Hipikat: Recommending pertinent software development artifacts. In Proceedings of the 25th International Conference on Software Engineering (ICSE 2000), pages 408– 419, Portland, Oregon, May 2003. ACM Press. [3] S. G. Eick, T. L. Graves, A. F. Karr, J. Marron, and A. Mockus. Does Code Decay? Assessing the Evidence from Change Management Data. IEEE Trans on Software Engineering, 27(1):1–12, 1990. [4] S. G. Eick, C. R. Loader, M. D. Long, S. A. V. Wiel, and L. G. V. Jr. Estimating software fault content before coding. In Proceedings of the 14th International Conference on Software Engineering, pages 59–65, Melbourne, Australia, May 1992. [5] H. Gall, K. Hajek, and M. Jazayeri. Detection of logical coupling based on product release history. In IEEE International Conference on Software Maintenance (ICSM98), Bethesda, Washington D.C., Nov. 1998. [6] H. Gall, M. Jazayeri, and J. Krajewski. CVS Release History Data for Detecting Logical Couplings. In IEEE International Workshop on Principles of Software Evolution (IWPSE03), Helsinki, Finland, Sept. 2003. [7] T. L. Graves, A. F. Karr, J. S. Marron, and H. P. Siy. Predicting fault incidence using software change history. Software Engineering, 26(7):653–661, 2000. [8] A. E. Hassan and R. C. Holt. ADG: Annotanted Dependency Graphs. In Proceedings of VISSOFT 2003: Annual DESIGNFEST On Visualizing Software For Understanding And Analysis, Amsterdam, Netherlands, Sept. 2003. [9] A. E. Hassan and R. C. Holt. Studying The Chaos of Code Development. In Proceedings of WCRE 2003: Working Conference on Reverse Engineering, Victoria, British Columbia, Canada, Nov. 2003. [10] A. E. Hassan and R. C. Holt. The Chaos of Software Development. In IEEE International Workshop on Principles of Software Evolution (IWPSE03), Helsinki, Finland, Sept. 2003. [11] A. E. Hassan and R. C. Holt. The Top Ten List: Dynamic Fault Prediction. In Subimtted for Publication, Oct. 2003. [12] A. E. Hassan and R. C. Holt. Predicting Change Propagation in Software Systems. In Submitted to the 26th International Conference on Software Engineering (ICSE 2004), Scotland, UK, May 2004. [13] A. E. Hassan, R. C. Holt, and A. Mockus. MSR’04: 1st International Workshop on Mining Software Repositories, 2004. Proposed Workshop for the 26th International Conference on Software Engineering (ICSE 2004). [14] A. Mockus and L. G. Votta. Identifying reasons for software change using historic databases. In Proceedings of the International Conference on Software Maintenance (ICSM), pages 120–130, San Jose, California, Oct. 2000. [15] J. S. Shirabad. Supporting Software Maintenance by Mining Software Update Records. PhD thesis, University of Ottawa, 2003. [16] S. E. Sim, C. L. A. Clarke, and R. C. Holt. Archetypal Source Code Searching: A Survey of Software Developers and Maintainers. In Proceedings of International Workshop on Program Comprehension, pages 180–187, Ischia, Italy, June 1998. [17] S. Weaver. The mathematical theory of communication. Urbana: University of Illinois Press, 1949.

4

Mining Software Repositories to Guide Software Development

Mining Software Repositories to Guide Software Development

Suggest Documents

Mining Software Repositories to Guide Software Development

Process mining software repositories - CiteSeerX

Mining Software Repositories with CVSgrab

Process mining software repositories - FLOSShub

Process mining software repositories - CiteSeerX

Mining Software Repositories to Assist Developers ...

Mining Software Repositories for Automatic Interface Recommendation

Mining Software Repositories for Social Norms - UOW

Mining Software Repositories for Social Norms - UOW

Mining Software Repositories for Software Change Impact Analysis: A ...

On Mining Sensor Network Software Repositories - Embedded ...

MetricMiner: Supporting Researchers in Mining Software Repositories

Mining Software Repositories for Software Change Impact ... - CiteSeerX

Mining software repositories: measuring effectiveness and ...

MSR Submission 2005 - Mining Software Repositories

Mining Software Repositories for Automatic Interface Recommendation

Mining Software Repositories for Automatic Interface Recommendation

Mining Internet-Scale Software Repositories - CiteSeerX

Mining Software Repositories for Accurate Authorship - Description

Mining Software Repositories to Assist Developers and ... - CiteSeerX

Mining Software Repositories to Study Co-Evolution ... - Andy Zaidman

Mining Software Repositories to Study Co-Evolution of Production ...

Mining Version Control Systems for FACs - Mining Software Repositories

Repositories with Public Data about Software Development