A Stochastic Model of Software Maintenance and Its ... - CiteSeerX

3 downloads 5893 Views 69KB Size Report
Modeling software maintenance: a review of past approaches ... a domain library to requests for maintenance from customers in two large software systems.
A Stochastic Model of Software Maintenance and Its Implications on XP Process Sergio Focardi* , Michele Marchesi** and Giancarlo Succi§ * The Intertek Group, Paris, France – [email protected] ** Università di Cagliari, Cagliari, Italy – [email protected] § University of Alberta, Edmonton, Canada - [email protected]

Introduction Maintenance is that phase of the software life cycle when existing software is modified to fix errors, restructure the system for greater efficiency, or ready it for future upgrades (Pressman, 1997); it accounts for well over 50% of the total cost of software development. Despite its contribution to costs, software maintenance and evolution has been the subject of a surprisingly small number of empirical studies (Kemerer and Slaughter, 1999); few models have been attempted and validated to describe the overall process and explain the effort required and the quality of the resulting system. However, more in depth research seems to be necessary to attain a higher level of empirical validation (Ramil and Lehman, 1999). Effective models for software maintenance and evolution would serve two purposes: 1. 2.

They would serve to guide the development process in an effort to minimize maintenance costs; They would help predict the maintenance effort, thus supporting better scheduling and resource allocation.

This paper proposes a model of the effects of maintenance operations on software systems based on the theory of random graphs. Our model uses a random graph to represent a software system. The graph is populated by two types of nodes: strong nodes and weak nodes. Maintenance spreads through weak nodes as a contagion process that can be modeled as percolation on a Cayley tree. The model is used to give some insight on development processes based on refactoring, including Extreme Programming (XP) (Beck, 1999).

Modeling software maintenance: a review of past approaches The seminal work of Lehman (1980) to model the structure and effect of software maintenance and evolution has been followed by few studies. Two classes of models have been proposed in the literature: 1) those focusing on the process of maintaining and evolving the system and 2) those focusing on the effects of maintenance operations. On the basis of existing literature, (Kemerer and Slaughter, 1999) refer to the former as models for software evolution and the latter as models for software maintenance. Models for describing the process of software development and maintenance focus on different techniques to describe the research methodology (Kellner, 1998). Following (Forrester, 1961), system dynamics is the study of the dynamics of complex systems composed of interacting entities. The use of system dynamics to model software maintenance and evolution was proposed by Abdel-Hamid and Madnick (1991) and was later extended into a general project management framework by Rodrigues and Williams (1996). Petri nets (Bandinelli et al., 1993) and generalized Petri nets (Kusumoto et al., 1997) were proposed to describe process evolution in general and, sometimes, with a focus on maintenance. Temporal logic was used to describe the effects of changes in software systems by Zhou et al. (1999). Succi et al. (1995) proposed a morphogenetic approach using a set of evolution operators to describe formally the evolution of a program. Gamma Analysis has been also applied to determine the different sequence of activities in the maintenance and evolution process (Kemerer and Slaughter, 1999). Models that focus on the code are often based on the analysis of code metrics. For example, Li and Henry

S. Focardi, M. Marchesi, G. Succi (June 2000) “A Stochastic Model Of Software Maintenance and Its Implications on XP Process.” Proceedings of the First International Conference on Extreme Programming and Flexible Processes in Software Engineering (XP2000), Cagliari, Italy

A Stochastic Model Of Software Maintenance

(1993) determine how object-oriented metrics are related to various versions of software systems, considered as a proxy of maintenance effort. Succi et al. (2000) relate software reuse and the presence of a domain library to requests for maintenance from customers in two large software systems. This work proposes a different approach. It focuses on the dynamics of the propagation of changes using the theory of random graphs. It attempts to model quantitatively, albeit in a stochastic sense, the dynamics of software maintenance costs.

The representation of software systems with random graphs In this work, we model the effects of changes on software systems using the theory of random graphs (Palmer, 1985; Bollobas, 1985). Graphs have been often used to represent software systems. In particular, (Briand et al., 1996) provides a comprehensive representation of any software system as a collection of nodes and of various kinds of edges between nodes. A node can be any software artifact, such as a design class, a line of code, or a piece of documentation. An edge is any kind of link between two nodes, such as an inheritance relation between two classes, a definition-usage pair between two lines of code, a crossreference between two pieces of documentation, a trace between a part of a model and the requirement that led to that design. Our approach assumes that different development methodologies lend to different kinds of graphs such that each methodology can be represented with one or several random graphs. The properties of the various methodologies can then be inferred by studying the associated random graphs. A random graph G is a graph with N nodes and Q edges. We assume that edges are not directed. There are two fundamental probabilistic structures for representing random graphs which, following Palmer (1985), we will refer to as model A and model B. In model A, Q is a random variable. Let’s assume that each couple of nodes i,j has the same probability p of being connected and that such probabilities are independent of each other. In a graph with N nodes, there are N(N-1)/2 possible edges. Therefore, the random variable Q is distributed as a binomial random variable with mean 〈Q〉=pN(N-1)/2. In model B, the number of edges Q is fixed. The number of graphs with N nodes and Q edges is:

 N( N − 1 ) / 2   Q  

(1)

and all graphs have the same probability:

 N( N − 1 ) / 2   Q  

−1

(2)

For large values of N, the two models are equivalent if p=2Q/N(N–1) (Palmer, 1985); we will thus use the two models interchangeably. In both cases, the random graph G is fully represented by the number of nodes N and the probability p, that is, G=G(Q,p). If a graph is connected, the number of edges Q must be N≤Q≤N(N-1)/2. In model A, if N grows with constant probability p, the probability that a graph is connected tends to1. In model B, when N grows keeping the ratio 2Q/N(N–1) constant, the probability that a graph is connected tends to 1. In other words, for large N, if the probability p or the ratio 2Q/N(N–1) remains constant, almost all graphs are connected. The number r of edges that leave a node is called the degree of the node. The sum of all degrees is twice the number of edges:

Modelli Software – Versione 7 – Ultima modifica 14/06/00 - 17.09 - 2

A Stochastic Model Of Software Maintenance

N

∑ r = 2Q

(3)

i

i =1

A graph is called regular of degree r (or r-regular) if all nodes have the same degree r. In model A, for large values of N, the average degree is given by: 〈r〉=p(N-1)≈ pN

(4)

When N tends to infinity, any random graph G(N,p) tends to be regular if probability p is constant (Palmer, 1985). This fact is reflected in model B if 2Q/N(N–1) remains constant for growing N. Under this assumption, in model B, the average degree grows linearly with the number of nodes of the graph. In fact, for large N: N

∑r

i

r ≈

i =1

N

=

2Q ∝N. N

(5)

If probability p is not constant, the degree r has a distribution that depends on how p grows with N. We call the average degree of the graph that represents a software system the degree of the software system. The degree of such a system is a measure of complexity as it satisfies trivially the properties of complexity measures discussed in (Briand et al., 1996). In model B, the average degree is a constant; in model A, we approximate (in probabilistic sense) the average degree with: N

∑r

i

i =1

N

(6)

Model B allows representing in a natural way the growth of the degree of a software system through its different development phases. It also allows modeling quantitatively - albeit in a probabilistic way - the growth of maintenance costs for different development phases.

The propagation of maintenance as a contagion process To represent the effect of a change in a software system, we follow an approach similar to the one taken in epidemiology; see (Durrett, 1993; Liggett, 1985) for a discussion of this. The spread of an infectious disease is modeled as an interacting particle system. A number of individuals occupy the sites of a (bidimensional) lattice. Each individual can be either healthy but susceptible of getting sick, sick, or vaccinated/recovered and thus immune. Healthy but susceptible individuals randomly acquire the disease at a rate dependent on the number of sick individuals surrounding their sites. Sick individuals, in turn, recover at an independent rate and become immune. The disease spreads to healthy but susceptible individuals, reaches a peak, and then decays as a growing fraction of the population develops immunity. To apply the same metaphor to software systems, we divide the nodes into two mutually exclusive classes: “strong” nodes and “weak” nodes. Strong nodes correspond to immune individuals, weak nodes to individuals that are healthy but susceptible to contagion. Because this work does not model the “time to repair,” we do not include a population of nodes under maintenance corresponding to sick individuals. We assume only that weak nodes are capable of maintenance. After maintenance, they are likely to become strong with a given probability. Maintenance is performed instantaneously but can spread to other connected weak nodes that, in turn, become strong. The modification of the state of a node in a software system inevitably impacts connected nodes. If the node is well engineered, the impact is negligible; the maintenance might be limited to a change in a parameter. In other cases, however, the node might require major reengineering. This reengineering

Modelli Software – Versione 7 – Ultima modifica 14/06/00 - 17.09 - 3

A Stochastic Model Of Software Maintenance

process, in turn, is likely to affect neighboring weak nodes, and so on. Software maintenance can therefore be represented as a process that starts from some weak node and propagates to other weak neighboring nodes. To formalize, let the system be composed of N nodes of which W are weak and S are strong. Let q be the ratio W/N. Given any node, the probability that another connected node is weak is q. Given any two connected weak nodes a and b, let s be the probability that a change in a will propagate to b. q and s are the statistical parameters that characterize different classes of software systems. We assume that q and s are small and independent and that the graph that represents the system under change is r-regular. As observed above, most large graphs are approximately regular. If a modification has to be performed on a strong node, the cost is small and the change does not propagate. If, however, modification is required on a weak node, it may result in a significant cost C. Let F(x) be the probability distribution for C, that is, F(x)=P(C≤ x). The node under change is connected to r other nodes since the degree of the graph is r. Given that q is the probability that a connected node is weak and s the probability that a change propagates to connected weak nodes, the change propagates with probability qs. The overall cost of the change is a random sum of S terms, each with distribution F. A key simplification is to assume that the effect of a modification to a weak node propagates along trees without forming loops. This assumption is reasonable. In fact, the theory of random graphs supplies the thresholds under which most components of a graph will be trees. In a graph with N nodes, the probability threshold for the appearance of loops in connected components is 1/N (Palmer, 1985). This means that in a large random graph, if the probability of connection of two nodes is less than 1/N, most components are trees (Palmer, 1985). Under these hypotheses, the probability that two nodes are connected by a change propagation is pqs. A reasonable stochastic model of graphs requires that pqs