Computers, IEEE Transactions on - IEEE Xplore

3 downloads 0 Views 1MB Size Report
importantly, the effect on the job completion time of the work lost due to the occurrence of ... Index Terms- Computer performance, failure-repair models,. Laplace ...
1184

IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 10, OCTOBER 1993

The Completion Time of Programs on Processors Subject to Failure and Repair Philip F. Chimento, Jr., Senior Member, IEEE, and K. S . Trivedi, Fellow, IEEE

Abstract-The objective of this paper is to describe a technique for computing the distribution of the completion time of a program on a server subject to failure and repair. Several realistic aspects of the system are included in the model. The server behavior is modeled by a semi-Markov process in order to accommodate nonexponential repair-time distributions. More importantly, the effect on the job completion time of the work lost due to the occurrence of a server failure is modeled. We derive a closed-form expression for the Laplace-Stieltjes transform (LST) of the time to completion distribution of programs on such systems. We then describe an effective numerical procedure for computing the completion time distribution. We show how these results apply to the analysis of different computer system structures and organizations of fault-tolerantsystems. Finally, we use numerical solution methods to find the distribution of time to completion on several systems. Index Terms- Computer performance, failure-repair models, Laplace transform inversion, multistate computer systems, preemptions, semi-Markov processes.

1. INTRODUCTION

A

SYSTEM that is fault tolerant may seem very good when viewed in the light of performance measures such as steady-state availability or interval availability. However, from a user’s point of view, even a highly available system may not be adequate. If faults are frequent and work is lost each time a fault occurs, then the system may not meet user requirements. In real-time systems, such as those in process control and avionics, user requirements will include deadlines on completion time. For such systems, the computation of mean job completion time provides only limited information, since the mean completion time cannot give the probability of the task not meeting the deadline. We need to compute the distributions of completion time including the effect of work loss due to a fault and the delay caused by a repair action. In this paper, we examine the behavior of a program executing on a processor that is subject to faults of different types. We refer to the stochastic process modeling the failurehepair behavior of the processor (or server) as the “structure-state’’ process [ 171, [ 181 or the “server process” [19]. A common criticism of such models is the assumption Manuscript received July 8, 1991; revised January 21, 1992 and July 29, 1992. This work was supported in part by the National Science Foundation under Grant CCR-9108114 and by the Office of Naval Research under Grant N00014-91-3-4162, P. F. Chimento is with IBM Corporation. Research Triangle Park, NC 27709. K. S. Trivedi is with the Department of Electrical Engineering, Duke University, Durham, NC 27706. IEEE Log Number 9208565.

of exponential distributions for both failure and repair [9]. In order to model general distributions of times to repair, we will assume that the structure-state process is semi-Markov. The work requirement of the program is also allowed to be generally distributed. We assume that the program work requirement and the structure state process are statistically independent of one another. Upon the occurrence of a transition in the structure-state process, an executing task will be interrupted. We consider three different types of interruptions. If no loss of prior work occurs due to the interruption, we use the prs (pre-emptive resume) interruption. The prs interruption is an appropriate model for certain classes of hardware faults that can be detected almost instantaneously by hardware. Recovery in this case typically involves either hardware or a combination of hardware and software. For example, a transient fault of the arithmetic logic unit (ALU) could cause a machine check which might be cleared by hardware instruction retry. Or a fault might cause a software error-recovery routine to put the machine back into an acceptable state and then resume the program from where it was interrupted. Such an error might be caused by a parity error on an internal bus which, when cleared, would not cause the program to lose any work. The pri (pre-emptive repeat identical) and prd (pre-emptive repeat different) interruptions model cases where prior work is lost due to the interruption. Some classes of hardware or software errors that are the result of an error being propagated through the computation imply such a work loss. In general, it is not possible to recover a valid system state in this event. If this occurs, the only recovery possible is to repeat the program. If the program is executed with identical information the time requirement is identical to the requirement the first time the program was executed. Then the job is repeated without resampling the service demand and the interruption type is pri. On the other hand, after a prd interruption the job is to be repeated but with a new work requirement drawn from the same distribution as the old work requirement. We claim that this model of the processor behavior is realistic in the light of certain failures and error recovery routines in modern operating systems. See, for example, [l], PI, and [91. Note that the three features of the model described above, viz. nonexponential repair time distribution, nonexponential program work requirement distribution, and accurate modeling of lost work, make the overall model non-semi-Markovian. Fortunately, however, a transform based recursive method that was devised earlier [12] can be used here.

001&9340/92$03.00 @ 1992 IEEE

CHIMENTO AND TRIVEDI: COMPLETION TIME OF PROGRAMS SUBJECT TO FAILURE AND REPAIR

! I

7 I

We consider two different server models of a single processor system. First, we derive a general four-state model of failure and repair with mixed types of interruptions. Then we consider a model by Castillo and Siewiorek [l] as a special case of the general model. Since our analytical solution technique provides us with one- and two-dimensional transform expressions, we need transform inversion procedures to obtain numerical results. For the numerical inversion of one-dimensional Laplace-Stieltjes transforms we use a method of Jagerman’s [ll].For the twodimensional problem, we use Jagerman’s technique coupled with numerical integration to obtain the distribution of time to completion. Using the computed completion time distributions we can check whether a design satisfies user requirements. For example, the probability of violating a certain deadline constraint (the dynamic failure probability) or the probability that the task never completes (the omission failure probability) can be computed. It should be clear that such information cannot be obtained if the analysis only yields the average completion time, Since analytic-numeric methods such as the ones proposed here are rather quick in execution time (as compared to simulation, for example), they can be repeated to assess the effects of changes in system design or to obtain an optimal design. In Section 11, we develop transform equations for the completion time distribution assuming a semi-Markov CPU (server) model. We specialize the general model we have developed in Section 111. There we obtain a completion time model proposed by Castillo and Siewiorek [l] and add some extensions. We develop the equations for the omission failure probability of each model in Section IV. In Section V, we describe the numerical techniques that we use for the inversion of the transform equations. In Section VI, we describe the control of numerical errors in our procedures. Finally, we provide numerical results for our chosen example problems in Section VI1 and conclude. 11. GENERALCOMPLETION TIME MODEL

In this section, we develop a CPU model to account for complex failure and repair behavior. In what follows, we use the terminology of Laprie [14] to discuss failures, errors, and faults. We view the CPU and associated peripherals and the operating system as providing a service to tasks or user programs. If we were to model parts of the operating system as tasks, then the CPU and peripherals would provide the service. The service that the combined system supplies in this sense is the scheduling of the user task and the execution of the instructions and system calls coded in the task. This service specification is typically provided in the “principles of operation” of a system. Thus, from the point of view of the executing task, an instruction retry is an example of a failure and attempted recovery in the sense that the implied service specification of the system (that is, instructions executed sequentially and with correct results) was not delivered to the program. In this

1185

example, the fault causing the error which caused the failure was likely a transient hardware condition. The error may have had a long latency time before being detected, if it were associated with logic peculiar to only a small subset of the instructions, On the other hand, the error may have had a very short latency time if it were associated with logic common to all the instructions. In any event, the effect of the error on the service provided to the task was that the work done by the instruction had to be discarded and the task delayed by the time it took to perform the recovery action. Our model of the CPU encompasses three kinds of failures. We have a state to model each kind of failure. The first kind of failure simply delays the service provided to an executing task or program. This kind of failure is due to errors that are detected very quickly once they are activated, and so in general do not have an opportunity to propagate and create a more extensive erroneous system state. The repair time for this kind of failure is quite short and often takes place at hardware or microcode speeds. There is a second kind of failure that may occur. The effect of this kind of failure on the system is both delay and work loss. This kind of failure is usually due to an error that has propagated or that has been active long enough before being detected so that neither the task state nor the system state immediately preceding the first erroneous state is recoverable. In this case, the task may be restarted. While the time to perform the recovery actions is not great, it is considerably longer than the first kind of failure. The third kind of failure that may occur in a system is that the system simply stops providing service; the system crashes. In this case, the errors causing the failure have propagated so far that no valid system state is recoverable, and either the hardware stops or the operating system crashes. In this case, considerable time may elapse before the system is able to provide service again and before the task may execute again. In any case, any work done would be lost, and the task would have to be restarted from the beginning. Again, this kind of error involves work loss in addition to delay. Referring to Fig. 1, we see all four states of the system as described above. In our terminology, the “working” state is a prs state as is the state corresponding to a failure due to an error caused by a transient fault. The two states that cause the program to repeat are pri states since we assume that the same job is simply repeated. For now, we put aside considerations of prd states. However, we will see later that the same techniques can be applied if the repeated work after a failure is drawn from a new distribution. There are two pri states in Fig. 1 to capture some complex system behavior. Here, we can think of work requirement satisfied as “accumulated reward.” The structure of the model captures multiple failures: for example, while recovering in state 2, a failure can occur causing the process to enter state 3 or state 4. On entry to state 3 or state 4, any accumulated reward is lost. Since both states also have zero reward rate, we may be tempted to lump the two states. But then the holding time in the combined state would depend upon how the state was entered. The resulting (server) process would not then be a semi-Markov process.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 10, OCTOBER 1993

1186

B. Kernel and Transforms in States 2, 3, and 4 A z i and we construct the kernel distriWe let A2 = butions. t d zG~(7)e-"~'dr, Q21(t) = (1) Q23(t) =

1 1'

~ 2 3 e - ~ 2 ' ( 1 -~ 2 ( 7 ) ) d 7

- h[l - e-A2t] -

~ 2 4 ( t= ) -[I A24

Fig. 1. Semi-Markov process for general CPU model.

- e-"Zt

]

-

A2

In the following, we use the terminology of Pyke [21]. The semi-Markov server (structure-state) process is shown in Fig. 1. Let Yn+l be the state of the SMP after the n+ 1st transition and let X , be the time elapsed between the nth and n 1st transition. Define the kernel distribution Qij ( t )by

+

and the kernel matrix by Q ( t ) = [Q;j(t)]. Let the Laplace-Stieltjes transform (LST) of Qij (t) be Q i j (s) = ePstdQij(t). n

Define the random variable Zij =time to occurrence of a fault of type j from i. Also, we define the repair distributions as Gi(t) where i is the index of the state in which the repair is done. We assume that the random variable Zij has the exponential distribution with parameter A i j . Based on our assumptions, we can now write the expressions for the kernel distributions.

/'

A23e-A2'G2(~)d~,

0

A2

It

A24e-A2'G2(~)dr. (2)

The LST's for the state 2 kernel distributions are a little more difficult. We can write the derivatives of the Q2j(t) as -Q21(t) d

dt

= d

I'

gG2(r)e-"2idT

-d~ 2 3 ( t ) = ~23e-"2~- ft dt

I'

A23e-A2rG2(~)d~

If we assume that a) Qzj(0) = 0 and b) that the integrands in the above expressions are bounded and measurable in each case on [O, oo),then using Lemma 8 from Royden [22] we can use the differentiation operator to remove the integral. Thus

1

03

Q21(s) =

e - " 2 t e - s t dt ~G2(t)dt

= G2(h2

+ s),

A. Kernel and Transforms for State I Since state 1 has only exponentially distributed transition times, the analysis is quite simple. We have where G;(s) denotes the Laplace transform (LT) of G2(t), that is, G;(s) = ePstG2(t)dt. However, we can use the fact that G;(s) = G ~ ( s ) /and s we can write

som

which becomes

Having justified all the steps for the kernel distributions for state 2, we can go on to write them down for state 3, since they are analogous. Specifically

where A1 =

Xlj. The LST's are easily derived.

CHIMENTO AND TRIVEDI: COMPLETION TIME OF PROGRAMS SUBJECT TO FAILURE AND REPAIR

1187

Castillo and Siewiorek develop the probability density function (pdf) of the time to completion of a program conditioned upon a random variable they denote as Tmin,which is defined as the time to execute a program on a CPU dedicated to the program in the absence of CPU errors. They proceed in their analysis by further conditioning on the time from the last time the program is restarted (Le., after the last visit to the pri state) until the program completes execution. They Fig. 2. Semi-Markov process for Castillo-Siewiorek. obtain several of the conditional distributions by analytical methods and one by direct measurement. A n important point Finally, since there are no competing arcs leaving state 4 of to note is that they are able to account for the effect of the model, we have Q 4 1 ( t ) = G4(t) and Q41(S) = G ~ ( s ) . workload in their pdf through the use of the empirical disThe reader should note that in our examples, we assume tribution. that the G i ( t ) distributions are deterministic. There is some Our analysis can account for all the phenomena that are evidence to support this assumption (see [9]). included in [ 11 but allows extensions and generalizations of the model that are more difficult to obtain using their method C. Transform Equations for Completion Time of analysis. Now, let us discuss these extensions in more detail. In Let Ti(.) be the random variable denoting the completion the construction of these models by our method, the work time of a job with a fixed work requirement 5 given that it started execution when the server process just entered requirement may be generally distributed as long as we can state i. Let Fi(t,x)= P{Ti(x) 5 t } and let Fi(s,z) = obtain an analytically or numerically tractable expression by e - S t d F i ( t ,x). Then using the procedure developed in [12] unconditioning on the work requirement. In principle, the same for F~(s, x) in terms of the kernel distributions derived earlier unconditioning technique may be applied to the result of the method of [l]. In the models derived here, the repair time distributions Fl(S,Z) = KlLl (s,.) can be general. In [l], repair times are deterministic and the l-n;rl,i,3(s,zc,.[Q34(S)Q41 ( s ) + Q 3 1 ( s ) I - Q 4 1 ( s ) ~ 1 , 1 , 4 ( . 7 , ~ ) introduction of generally distributed repair times would cause the modeler to have to compute by hand some convolution where integrals. In principle, time-to-failure distributions in our method have no restrictions although we specialize to exponentially distributed times to failure. In [ 11 the exponential assumption is used several times in the arguments deriving the forms of the subsidiary conditional distributions. These arguments would have to be reworked if different distributional assumptions are made. A In [l],the heart of the method for determining the comand with T ( S ) = s A1 - A 1 2 Q 2 1 ( ~ ) . In order to obtain numerical results for our models, we use the transform Fl (s, x). pletion time is the characterization of the random variable That is, we invert the transform that is conditional on the server TuselTmin. This is the execution time of a program given the work demanded of an unloaded failure-free system (Tmin) model starting in state 1, the up state. under anticipated workload conditions (also assuming failureLet free operation). They proposed to determine the distribution of a ( s ) = (A13 f AlZQZ3(s))(Q31(S) Q34(S)Q41(S)) this random variable empirically. In our method, the effect of (A14 AlZQZ4(S))Q41(S). various system workloads upon completion time is modeled through the use of various reward rates in states with nonzero By substituting we obtain the following expression in terms rewards. In the models derived in this paper, we have used of the kernel distributions of the semi-Markov process: reward rates of either 0 or 1, but the ability to use more general rewards is inherent in the method. (3) Finally, as we will see in the rest of this section, our method handles the three different types of interruptions and the resulting lost work using the same framework. The 111. SPECIALIZED COMPLETION TIMEMODELS method described in [l] is specific to the mixed prs and pri In this section, we propose several different versions of a interruptions. In fact, one often comes across various ad-hoc CPU model originally proposed by Castillo and Siewiorek in approaches to the completion time problem (see, for example, [l]. The method of analysis we use provides us with a way [51). In summary, we believe that our method is very general to improve upon the Castillo-Siewiorek analysis. The CPU model that they use is a three-state model with two prs states and provides us with the analytical machinery to model many features of computer systems. and one pri state. It is shown in Fig. 2.

+

+

+

+

IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 10, OCTOBER 1993

1188

A. Three-State CPU Model

C. PRS Models

For this model, we assume that failures occur according to a randomly split time-homogeneous Poisson process of rate A. With probability p,j the failure is nonfatal and with probability p j = (1 - p,j) the failure is fatal, causing a crash. The difference between failure types is that nonfatal failures do not cause a loss of work, whereas fatal errors do. The repair distributions may be of general form. Thus, we have a three-state model (shown in Fig. 2) where state 1is a prs state representing the system up state in which useful work is done; state 2 is a p r s state in which the system is recovering from a nonfatal error; state 3 is a pri state in which the system is recovering from a fatal error. We can now assign the following parameter values to the model we have already developed: A = A, A12 = p,fX, A13 = p f x , A23 = pjA. These assignments give us the kernel distributions

Once again, we take the structure state process depicted in Fig. 2. If we assume this time that all the states in the model are prs states, then by the same procedure that we have used for building the other models, we obtain

= p,f

Q23(t)

= (1- e b X p f t )

Q31(t)

= G3(t),

I’

‘(S)

.(s)

=s

=A(1 - pnj[G2(Apf

+ s)

+-XPfXpf+ S [I - G2(XPj + s)]G3(s)]- pjG3(s)) + s and as before B(.)is the LST of the work requirement. We note that when we specialize this LST to an exponentially distributed work requirement, we obtain exactly the same equation as (6). The equivalence between prs and prd is because of the memoryless property of the exponential distribution. Thus. when a task with an exoonentiallv distributed work requirement is interrupted, the remaining work to be completed has the same exponential distribution as the original. However, this is the same as resampling the original distribution after an interruption, which is exactly the repeat behavior of the prd model. One should note, however, that under the assumption of an exponential work requirement distribution, (6) for the prd case is not the same as (4) for the pri case. One can demonstrate this easily for oneself. See [7] and [16] for discussions of this point. This discussion suggests that the prd model with exponentially distributed work requirement is not of great practical interest. One advantage of our method is that we do not require either assumption.

Xpje -XPf G2 ( T )dT

of the transform equations is

T(S)

(7)

where

. (1 - e-Xt)

Q12(t)

where the

E;(*$) = B(T(S)),

+ A(1 -pnf021(4).

So substituting these quantities back into (3) for the transform of the completion time given that we start in the up state, we have

IV. DEFECTIVE DISTRIBUTIONS Note that (4) is applicable even when state 3 is a prd state. This is true because pri and prd interruptions produce identical results in the case that the work requirement distribution is deterministic as it is here.

We now derive expressions for the Laplace-Stieltjes transforms of the defective completion time distributions in the case where we make one of the states of the server model absorbing. By applying the final value theorem of the LST, we can find the limiting value of the defective distribution, which will give us the omission failure probability. This is simply the probability that the task never completes due to the failure of the processor. We show that we can obtain the entire defective distribution and the mass at infinity.

B. PRD Models The model in this section corresponds to the system in Fig. 2 with state 3 as aprd state instead ofpri. In [12] two forms of

the LST of the completion time distribution are derived. First, one form is with the assumption of a general work requirement distribution. This LST is given by

A. Four-State Server Model

In this section, we find a sequence of defective distribution transform expressions by setting the repair distributions from various failure states to 0. From these transform expressions, we can obtain expressions for the omission failure probability, that is P { T ( x )= m}, recalling that T ( z )is the time required for a task with work requirement z to complete. For the first model we set the kernel distribution from state 4 equal to 0 in (3). We denote the defective completion time distribution as F1,dej(.). Thus we obtain

where B(.) is the LST of the work requirement random variable. Specializing to a fixed work requirement yields exactly the same equation as (4), as it should, since for deterministic work requirements prd and pri are identical. If we allow the work requirement to have the exponential distribution with parameter 4, we obtain the following from (5): F&) = (S+XPf 14 (s+XPf) ~ + ( S + X ( ~ - P ~ (s+XPf ~ G Z ) ) ) ( s + X ~ f (1-G3 (8)))

r

e--r(s)+

F1,dej (s, x) =

(6) ‘

-

- e--r(s)x]

(8)

CHIMENTO AND TRIVEDI: COMPLETION TIME OF PROGRAMS SUBJECT TO FAILURE AND REPAIR

+

+

+

+

where T ( s ) = A13 A14 A l Z [ l - GZ(X23 A24 f s)]. AS with the proper distributions developed in this paper, this expression must be unconditioned with respect to the work requirement distribution. Now for a deterministic work requirement, we can take lims+o to find the omission failure probability:

1189

V. NUMERICAL PROCEDURES We have devoted considerable space to developing some structure-state models of CPU behavior. When X , the task execution time requirement, is a random variable, we must uncondition the transforms by integrating

,-TX

+ +

+

where T = A13 A14 X l z [ l - GZ(A.23 A z ~ ) ] .This function can be evaluated for fixed x or it may be unconditioned with respect to a work requirement distribution. Next, by setting &(s) to 0, we make state 3 an absorbing state also, while there is still leakage from states 1 and 2 into states 3 and 4 . Thus we have

By using the final value theorem again, we obtain pof(x) = 1 - e-‘”,

+

(11)

+

+

where again T = A13 A14 A12[1 - G 2 ( A 2 3 AZ4)]. Finally, by making state 2 an absorbing state also, we have

The omission failure probability is then equal to

Of course, each of these expressions may be unconditioned by a variety of work requirement distributions.

where B ( x ) is the distribution of the task execution requirement X . Having obtained either a conditional or unconditional transform of the completion time distribution, we proceed to obtain the distributions themselves by means of numerical inversion. In the case where we ar_eable to obtain an expression for the unconditional transform F ( s ) or in the case where we are content with inverting the conditional transform F ( s ,x), then we use the one-dimensional transform inversion algorithm by Jagerman [ l l ] . It is often the case, however, that we are not able to obtain a closed-form expression for F ( s ) . We then have to perform the unconditioning by numerical integration. We describe the resulting numerical procedure here. In an earlier paper, Jagerman [lo] constructed a sequence {fn(t)}r that approximates the desired function f ( t ) . The elements of the approximation sequence have the following properties: 1) If f ( t )is monotonic, then f n ( t ) is monotonic. 2) If f ( t ) is convex, then f n ( t ) is convex. 3) If f ( t ) is bounded, then f n ( t ) is bounded. 4) f(o+) = f n ( O + ) and f ( m )= f n ( m ) . In [ll], Jagerman gives a method for approximating any element of the approximation sequence. Specifically, from [ l l , (21)] we have

B. Three-State Server Model

In the three-state server models, we obtain a much simpler expression. Now state 3 becomes an absorbing state and we set the kernel distribution exiting state 3 to 0. Thus we obtain P1,def(s,

x) = e - T ( s ) x ,

+

+

=

B(s+ ~

+

where ~ ( s )= s A13 A l z ( 1 - & ? ( A 2 3 s)). From this general expression, we can obtain a simpler one by letting A i z = Apnf and A13 = A23 = Apf. Then we obtain T ( S ) = S+A(1-pnfG2(Apf+s)). When we uncondition with respect to the work requirement distribution B ( . ) we obtain Fl,def(S)

( 1 ~-

+

n f G Z ( ~ p fs))).

We can use our numerical techniques to invert this expression and recover the distribution. Taking lims+o we obtain of = 1 - B ( x ( -~p n f G z ( A p f ) ) ) .

We note that given a set of model parameters, we can evaluate the omission failure probability directly from this expression.

r

where f * is the Laplace transform (LT) of the function of interest, and S f is a functional that approximates the nth element of the approximation sequence for f ( t ) .Here, n, T , and q are parameters of the inversion. Note that although we haye LST’s to invert, the relationship noted earlier, viz. f*(s) = f ( s ) / s gives us a way to compute the inversion of the LST of a function by inverting the LT. The inversions that we compute in the following sections yield CDF’s directly by dividing the expression for the LST by s and inverting the resulting LT. In effect, (14) is an approximation to an approximation. However, we have found this approximation to behave very well. By “behave well” we mean that for reasonable parameter choices, the approximate distribution functions that we obtain are nonnegative, monotonic, and bounded. Even inverting to obtain distribution functions that have jump discontinuities we note that the approximation is well-behaved on either side of the discontinuity. With some inversion methods, on either side of a discontinuity there will be an oscillation in the numerical distribution function. We noticed no such phenomenon with Jagerman’s method.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 10, OCTOBER 1993

1190

In [ 111, Jagerman gives bounds for the error 1 f n ( t )- f ( t ) 1. The pointwise and uniform error bounds depend on the second and first derivatives, respectively. Specifically, from [ 111 the pointwise error bound is

Note that for this procedure we can use essentially the same approach for phase-type work requirement distributions. In this way, we can analyze systems with work requirements of very general forms.

VI. CONTROL OF NUMERICAL ERROR and the uniform bound is

as n 4 m. For the examples that we where cy have run, these derivatives are quite small in the tails of the distributions and pointwise, fn(t) will be a very good appJoximation there. We discuss 1 Sf - fn(t) 1 in Section VI. F ( s ,x) denotes the LST of the completion time distribution conditioned on the work requirement being equal to z. For a fixed task requirement 5,we can apply Jagerman’s technique directly and obtain a numerical inversion. Similarly, if we had a closed-form expression for F ( s ) , the unconditional transform, we could use the numerical inversion procedure directly. With mixed pri interruptions, it is often not possible to obtain _a closed-form expression for the unconditional transform F ( s ) . We use a two-phase procedure to solve these problems numerically. There must be an inversion phase and an unconditioning phase in order to recover the unconditional distribution of the completion time. We integrate the function numerically using an adaptive Simpson’s rule as described by Lyness [15]. The algorithm and code is known as SQUANK, which stands for “Simpson Quadrature Used Adaptively, Noise Killed.” This code attempts to meet a user-specified absolute error tolerance by subdividing the interval of integration into smaller subintervals. When the error tolerance is met by a particular subinterval, the program stops subdividing and moves on to the next subinterval. If the user-requested error tolerance cannot be met within the built-in limits, the program returns the absolute error achieved for the interval of integration. The adaptive Simpson’s rule has the advantage that the accuracy of the result of the integration is known. The error estimate can be used as input to error bounds for numerical procedures that depend upon the integration results. Note that this unconditioning step also transforms the integral over an infinite interval into an integral over a finite interval when the work requirement distributions have an exponential form. This is one method recommended by Davis and Rabinowitz [3] for handling integrals over infinite intervals. First we_use Jagerman’s approximation to obtain F ( t ,x) by inverting F ( s , x ) for a fixed t. Then, taking the case where B ( x ) = 1 - e - @ “ , we make the substitution u = 1 -e-@” and changing the limits of integration accordingly we have N

T

In [ll], Jagerman gives inequalities for both the round-off error in computing f n ( t ) and for the truncation error resulting from using a finite series to compute f n ( t ) . Let ~t be the total round-off error of the series and TL be the truncation error of the approximation. Further, let E be the error in computing each term of the series and let A be the upper bound of I f n ( t ) I. We know this upper bound to be 1 since the target function is a distribution function. Jagerman’s inequalities are

and

Here, T , q , and n are the parameters of the inversion routine. The parameter n should be chosen so that fn(t) is an acceptable approximation of f ( t ) . Once n is chosen, then T can be chosen to give a small round-off error for the series approximation. Then q can be chosen to control the truncation error. The parameter q is key to the performance of the inversion since it controls the number of terms in the series approximation. We proceeded as follows. First the program found the conditional distribution through numerical inversion and then we numerically integrated with respect to the work requirement distribution to find the unconditional distribution. In this case, our objective was to keep the inversion accurate, since this procedure integrates the inverted function F ( t ,z) with respect to x. Both the accuracy of the integration and the number of function evaluations depend upon the accuracy of the inversion. Recently, Obi has published [20] a set of procedures for determining the optimal parameters of Jagerman’s inversion method. We used Obi’s method to determine the optimal T , given other inversion parameters. We used n = 50 (the 50th element of the approximation sequence) and took the relative precision to be since that was an overestimate of the effect of accumulated error in evaluating the transform function. We chose q = 251. We requested an absolute error of 2 x IO-’ of the adaptive integration code, and almost without exception, SQUANK [15] was able to meet that error bound. The reader should not forget, however, that transform inversion is an ill-conditioned problem. Simpson’s rule is valid for functions that are continuous and four times differentiable. For a fixed value of t, however, the integrand of (15) is not continuous, since there will be a jump at the point at which t = -ln(l - u)/@ However, we note that for all work requirements n: > t, the probability of accumulating enough work to complete the job is exactly 0. Consequently, we need only integrate over the interval

1191

CHIMENTO AND TRIVEDI: COMPLETION TIME OF PROGRAMS SUBJECT TO FAILURE AND REPAIR

[O,u]which makes x 2 t. By simple manipulation of the inequality -[ln(l - u ) / 4 5 t, we find that the values of u over which we would like to integrate are just those for which u 5 1 - e-$t. We used this fact when giving the integration routine the limits of integration.

VII. NUMERICAL RESULTS In this section we present the results from the models derived in Section 111. We compare the distributions that result from varying the structure of the models in different ways. Specifically, we show the ordering of the models when we vary the distribution of the work requirement, the mean repair time, the type of interruption, and the structure of the structure-state process itself. There are two groups of models that we can solve using our techniques. First, we can solve models for which the program work requirement is a fixed constant. For these models we use the transform expressions that we have already developed and simply use the Laplace transform inversion method outlined above. This kind of model is useful for representing real-time systems with hard deadlines, for example. If the deadline is not a hard deadline, we can compute the expected penalty of missing the deadline by using the distribution function that we obtain. Second, we can solve models in which the work requirement is not fixed, but rather is a random variable. An example of the use of this second type of model is in representing jobs run on a modern multitasking operating system. The stochastic nature of the work requirement may be due to the variation of input parameters or due to the job mix currently executing on the machine and the resulting patterns of contention for system resources. We solve such a model by using both numerical inversion and numerical integration. A. Results for the Specialized Model Now we turn our attention to the model depicted in Fig. 2. We consider four versions of this smaller model. The first

model, which we will call the pure prs model, is similar to the model in Fig. 2, but all the states are prs states. The transform equations for this model are given in Section 111-C. The second model, which we call the mixed prs-prd model, is again similar to the model in Fig. 2 but with state 3 replaced by a prd state. The transform equations for this model are found in Section 111-B. The third model is the model in Fig. 2. The conditional transform equations are given in Section 111-A. We obtained the results using both techniques described in the previous section. For the pri model, we do not have a closed-form unconditional transform, and so we used the combined integration-inversion technique. For the pure prs model and the mixed prs-prd model, we used only transform inversion since the unconditional transforms are available in closed form.

B. Model Parameters Table I gives the parameters for the structure state process that underlies the completion time model. We obtained these parameters from [9] using their failure and repair model which

TABLE I PARAMETER VALUES:THREE-STATE MODELS

x prob. fatal failure mob. nonfatal failure

R2 R3

2.016 0.1964 0.8036

failures per hour

0.5

seconds

1.0

seconds

TABLE I1 PARAMETERS FOR THE FOURDISTRIBUTIONS Distribution Deterministic

Parameters x = 1.0

Units hour

2-Stage Hypoexponential Exponential

$1 = 9.0, 42 = 1.125 4 = 1.0

per hour

2-Stage Hyperexponential

$1

= 1.6, $2 = 0.2 cy1 cy7

= 0.9143, = 0.0857

per hour per hour

is based on measurements taken from an operating computer system. Our failure states correspond to the hardware recovery state and the software recovery state of the empirical model, and the workload states correspond to the up state in our model. For our three-state model, we used for our parameter X the total rate into both the hardware recovery and software recovery states of the empirical model. These rates were determined from the entrance rates into the hardware and software recovery states [9, table 5(b), p. 4821. The total rate into both recovery states was taken to be A. The values of p f and p , f were derived from splitting X so that we got the appropriate rates into states 2 and 3 of our model. That is, p,f is the fraction of the total rate (into both states) that goes to the hardware recovery state, and p f is the fraction of the total rate that goes to the software recovery state. The recovery times in our model were taken to be deterministic. R2 is the recovery time from state 2 back to state 1 and is 0.5 seconds. RJ is the recovery time from state 3 and is taken to be 1 second. In a footnote in [9, p. 4791 the authors note that these times are upper bounds on the recovery times for these kinds of errors, but that they are deterministic. One should note that deterministic recovery times are difficult to capture with phase-type distributions but are taken into account quite easily in our model. Also, there are four orders of magnitude between the slowest rates and the fastest rates in the model. The work requirement characterization was chosen to demonstrate the effect of different workloads on the behavior of the model. Accordingly we show fixed (deterministic) work requirements, exponentially distributed work requirement, and also use the hypoexponential and hyperexponential distributions. In each case we held the mean work requirement fixed (and equal to 1 hour). The parameters are shown in Table 11. For the hypoexponential and the hyperexponential distributions, we kept the ratio of the parameters of the exponential stages constant. That is, $1/$2 = 8 for both distributions. The coefficients of variation for these work requirements varied from 0 to 2. These are shown in Table 111.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 10, OCTOBER 1993

1192

TABLE 111

DETERMINISTIC WORK REQUIREMENT DISTRIBUTION

COEFFICIENTS OF VARIATION

COMPARE PRI AND PRS INIERRUPTIONS

__

C?.

Distribution Deterministic 2-Stage Hypoexponential Exponential 2-Stage Hyperexponential

0 0.8958 1 2.0

.

-

PIIS

pm

-D

0

a *

PRI INTERRUPTIONS COMPARE WORK REaUlREMENT DISTRBUTIONS

Fig. 4. Effect of deterministic work requirement on distribution of completion time. EXPONENllAL WORK REQUIREMENT DISTRIBUTION COMPARE PRI AND PRS INTERRUPTIONS

nME (HOURS)

Fig. 3. Distribution of completion time for the mixed prs and pri model.

Comparison: Fixed Model Types, Varying Distributions: Fig. 3 shows the effect of varying the work requirement distribution for the mixed prs and pri model. We kept the same mean work requirement. The coefficient of variation changed, however, as shown by Table 111. The completion time distributions cross somewhat after the mean of the work requirement distributions. Looking at the completion time distributions for the exponential and hypoexponential work requirements we note that the less variable completion times dominate before the crossover point and the more variable completion times dominate in the tails. That is, the distributions for the more variable completion time distributions lie above the less variable completion time distributions before the crossover point and they lie below the less variable completion time distributions after the crossover point. From this phenomenon, the interaction of the work requirement and the completion time should be clear. When the work requirement is highly variable, as indicated by the large coefficient of variation, then the probability that a work requirement will be drawn that is very far from the mean is higher. That means that both very short jobs and very long jobs become more probable than with a work requirement that is not as variable. We know that a short job is likely to complete quickly without interruption, and so the completion time can be expected to be stochastically smaller than for a less variable distribution in that case. The fact that long jobs are also more probable with a more highly variable work requirement accounts for longer completion times being more probable in the tails of the more variable distributions. The behavior of the completion time for the hyperexponential work requirement is due to the higher variability of the distribution. In this case, with high probability the job is

nM

(HOURS)

Fig. 5. Effect of exponential work requirement on distribution of completion time.

short and will complete with fewer interruptions. Note that the slopes of the distributions that it dominates are larger in the tail. We would expect them to cross if we were to compute the numerical distribution further out. Comparison: Varying Model Types, Fixed Distributions: Figs. 4-7 comprise the next set of graphs. These graphs show the relationships of the model types to each other under different distributional assumptions. In Fig. 4, we see the ordering of the distributions when the work requirement is deterministic. Note that there are only two curves in this case. This is because, as noted above, for deterministic work requirements pri and prd models are identical. This can easily be seen by s_ubstitutingthe LST for a deterministic random variable for B(.) in (5). The loss of work due to pri interrupts as compared with no loss of work @rs) case is easily discernible. In Fig. 5 , the work requirement distribution is exponential. In this case again we have only two curves, but this time due to the exponentially distributed work requirements, the prs and prd models are identical. In Fig. 6, we see the same relationship again, but here the prd and the prs models are very close indeed. However, the completion time of pure prs model is still stochastically the smallest. One might conjecture that as the coefficient of

1193

CHIMENTO AND TRIVEDI: COMPLETION TIME OF PROGRAMS SUBJECT TO FAILURE AND REPAIR

HYPO-EXPONENTIAL WORK REOUIREMENT DISTRIBUTION COMPARE PRI, PRS AND PRO INTERRUPllONS

x

i1

,/

C D W

,,Y

L

Fig. 6. Effect of hypoexponential work requirement on distribution of completion time. HYPER-EXPONENTIAL WORK REOUIREMENT DlSTAlBUnON COMPARE PRI, PRS AND PRD INTERRUFIlONS

:To W

0 a

Y

-

4/ I/ a

,

I

I

2

1

nK

3

(HOURS)

Fig. 7. Effect of hyperexponential work requirement on distribution of completion time.

variation approaches 1, the completion time distribution for the mixed prs-prd model approaches that for the pure prs model. Finally, in Fig. 7, we see the relationship of the models under the hyperexponential distribution. The stochastic ordering of the models is not the same. Noticing the parameters for the hyperexponential distribution in Table 11, we see that one phase is fast and one is slow. With preemptive resume, the work requirement does not change when an interrupt occurs, but no work is lost. However, with preemptive repeat (different), when an interrupt occurs, a new work requirement is drawn from the distribution. Since the parameters of the distribution favor drawing a job whose work requirement is shorter, the prd distribution is stochastically the smallest.

VIII. CONCLUSION In this paper, we have addressed the problem of finding completion time distributions of jobs brograms) on interruptable processors. Specifically, the problem gives rise to overall models that cannot be reduced to Markov chains or even semi-Markov processes because of the loss of work and the repetition of exactly the same amount of work as the work lost, nonexponential work requirement distributions and nonexponential repair-time distributions.

Since we are able to obtain distribution functions, we can analyze systems that previously we could not when only expected completion time is available. Extensions of the work presented here include allowing checkpointing and concurrency within a job, and contention at the server. For a somewhat restricted version of the CPU failure and repair model, several authors have considered checkpointing [4], [6], [13], while contention at the server was considered in [2] and in [19]. First we showed how to construct semi-Markov models of an interruptable server from specified time-to-failure distributions and repair time distributions. Then we applied the theory developed in [12] to obtain LST’s of the completion time conditioned upon starting state and work requirement. We then described a numerical procedure to obtain both the conditional and the unconditional completion time distributions. Finally, we exhibited numerical results obtained from the model and we discussed their meaning. The main contributions of this paper include the development of transform equations for the job completion time with nearly realistic models of server failure and repair behavior, and the development of numerical procedures for inverting the resulting transform equations. Specifically, the semi-Markov structure, though still requiring the assumption of independence from history across states, allows general distributions to be used for transitions. We made use of this when we specialized to deterministic repair time distributions, a behavior that cannot easily be approximated by phasetype distributions. The fact that different repair and workload distributions can be accommodated in our model shows its strength. We can accommodate measured parameters and fitted distributions in our framework making it applicable to real operating environments. Additionally, obtaining numerical distributions, though subject to numerical error, is an important step. Having the completion time distribution allows us to ask questions about stochastic dominance, which in turn allows us to make design choices based on the stochastic behavior of the system.

ACKNOWLEDGMENT The authors would like to thank the referees for their thorough and helpful reviews of our manuscript.

REFERENCES X. Castillo and D. P. Siewiorek, “A performance-reliabilitymodel for computing systems,” in Proc. FTCS-IO. Silver Spring, MD: IEEE Computer Society, 1980, pp. 187-192. P. F. Chimento, “System performance in a failure prone environment,” Ph.D. dissertation, Duke Univ., Durham, NC, 1988. P. J. Davis and Philip Rabinowitz, Methods of Numerical Integration (Computer Science and Applied Mathematics). New York: Academic, 1975. A. Duda, “The effects of checkpointing on program execution time,” Inform. Processing Lett., vol. 16, pp. 221-229, 1983. H. Garcia-Molina and J. Kent, “Evaulating response time in a faulty distributed computer,” IEEE Trans. Comput., vol. C-34, no. 2, pp. 101-109, Feb. 1985. R. Geist, R. Reynolds, and J. Westall, “Checkpoint interval selection in a critical task environment,” IEEE Trans. Rel., vol. 37, no. 4, pp. 3 9 5 4 4 0 , Oct. 1988. E. Gelenbe and I. Mitrani, Analysis and Synthesis of Computer Systems. New York: Academic, 1980.

1194

IEEE TRANSACTIONS ON COMPUTERS, VOL. 42, NO. 10, O m B E R 1993

[SI M. Y. Hsiao, W. C. Carter, J. W. Thomas, and W. R. Stringfellow, “Reliability, availability, and serviceability of IBM computer systems: A quarter century of progress,” IBM J. Res. Develop., vol. 25, no. 5, pp. 453465, Sept. 1981. (91 M. C. Hsueh, R. K. Iyer, and K. S. Trivedi, “Performability modeling based on real data: A case study,” IEEE Trans. Comput., vol. 37, no. 4, pp. 4 7 H 8 4 , Apr. 1988. [lo] D. L. Jagerman, “An inversion technique for the Laplace transform with application to approximation,” Bell Syst. Tech. J . , vol. 57, no. 3, pp. 669-710, Mar. 1978. [ll] D. L. Jageman, “An inversion technique for the Laplace transform,” Bell Syst. Tech. J., vol. 61, no. 8, pp. 1995-2002, Oct. 1982. [12] V. G. Kulkami, V. F. Nicola, and K. S. Trivedi, “The completion time of a job on multi-mode systems,” Advances Applied Probability, vol. 19, no. 4, pp. 932-954, Dec. 1987. [13] V. G. Kulkami, V. F. Nicola, and K. S. Trivedi. “Effects of checkpointing and queqeing on program performance,” Stochastic Models, vol. 6, no. 4, pp. 615448, 1990. [ 14) J. C. Laprie, “Dependable computing and fault-tolerance: Concepts and terminologv,” in Proc. 15th Int. SvmD. Fault-Tolerant ComDutinz. . - Julv, 1985, pp.-2-7. [15] J. N. Lyness, “Algorithm 397, S Q U A N V Collected Algorithms ACM, I

Philip F. Chimento Jr. (M’Sl-S’85-M’8&SM’91) received the Ph.D. degree in computer science from Duke University, Durham, NC, in 1988. He works in IBM’s Networking Systems line of business, where he has worked since 1978. His research interests are performance and reliability modeling applied especially to issues in high-speed networks. Dr. Chimento is also a member of the ACM and an associate member of ORSA.

_

1969 _.

R. Marie and K. S. Trivedi, “A note on the effect of preemptive policies on the stability of a priority queue,” Inform. Processing Lett., vol. 24, no. 6, pp. 397-401, Apr. 1987. J. F. Meyer, “On evaluating the performability of degradable computing systems,” IEEE Trans. Comput., vol. C-29, no. 8, pp. 720-731, Aug. 1980. J. F. Meyer, “Closed-form solutions of performability,” IEEE Trans. Comput., vol. C-31, no. 7, pp. 64-57, July 1982. V. F. Nicola, V. G. Kulkami, and K. S. Trivedi, “A queueing analysis of fault-tolerant computer systems,” IEEE Trans. Sofhvare Eng., vol. SE-13, no. 3, pp. 363-375, Mar. 1987. W. C. Obi, “Error analysis of a Laplace transform inversion procedure,” SIAM J. Numerical Analysis, vol. 27, no. 2, pp. 457469, Apr. 1990. R. Pyke, “Markov renewal processes: Definitions and preliminary properties,” Annals Math. Statistics, vol. 32, pp. 1231-1242, 1961. H. L. Royden, Real Analysis, 2nd ed. New York: Macmillan, 1968.

Kishor S. ’Ikivedi (M’86-SM’87-F’92)

received the Ph.D. degree in computer science from the University of Illinois, Urbana-Champaign. He is a Professor of Electrical Engineering and Computer Science at Duke University, Durham, NC. His research interests are in reliability and performance evaluation. He has lectured and published extensively on these topics. He is a co-designer of HARP, SAVE, SHARPE and SPNP modeling packages. Dr. Trivedi is a Fellow of the Institute of ElectriON cal and Electronics Engineers. He was an Editor of the IEEE TRANSACTIONS COMPUTERS from 1983 to 1987. He is the author of text entitled, Probability and Statistics with Reliability, Queuing and Computer Science Applications (Prentice-Hall).