By Glen B. Alleman Irvine California, Copyright

FAULT–TOLERANT SYSTEM RELIABILITY IN THE PRESENCE OF IMPERFECT DIAGNOSTIC COVERAGE

By Glen B. Alleman Irvine California, Copyright © 1989 Revised and updated Niwot Colorado, Copyright © 2000

ii

FAULT–TOLERANT SYSTEM RELIABILITY IN THE PRESENCE OF IMPERFECT DIAGNOSTIC COVERAGE

Glen B. Alleman The deployment of computer systems for the control of mission critical processes has become the norm in many industrial and commercial markets. The analysis of the reliability of these systems is usually understood in terms of the Mean Time to Failure. The design and analysis of high reliability systems is now a mature science. Starting with fault–tolerant central office switches (ESS4), dual redundant and n– way redundant systems are now available for variety of application domains. The technologies of microprocessor based industrial controls and redundant central processor systems create the opportunity to build fault–tolerant computing systems on a much smaller scale than previously found in the commercial market place. The diagnostic facilities utilized in a modern Fault–Tolerant Computer System attempts to detect fault conditions present in the hardware and embedded software. Coverage is the figure of merit describing the effectiveness of the diagnostic system. This paper examines the effects of less than perfect diagnostics coverage on system reliability. The mathematical background for analyzing the coverage factor of fault–tolerant systems is presented in detail as well as specific examples of practical systems and their relative reliability measures. In a complex system, malfunction and even total nonfunction may not be detected for long periods, if ever. — John Gall

ii

TABLE OF CONTENTS

INTRODUCTION ........................................................................................................ 1 Fault Tolerant System Definitions .......................................................................... 1 Fault–Tolerant System Functions ........................................................................... 2 Overview of This Paper....................................................................................... 3 RELIABILITY, AVAILABILITY FOR SIMPLE SYSTEMS .............................. 5 Deterministic Models ................................................................................................ 6 Probabilistic Models ............................................................................................. 6 Exponential and Poisson Relationships ........................................................... 7 Reliability Availability and Failure Density Functions ................................. 13 Mean Time to Failure ......................................................................................... 16 Mean Time to Repair ......................................................................................... 20 Mean Time Between Failure ............................................................................. 20 Mean Time to First Failure ............................................................................... 21 General Availability Analysis ............................................................................ 25 Instantaneous Availability .......................................................................... 27 Limiting Availability .................................................................................... 28 SYSTEM RELIABILITY ............................................................................................ 31 Series Systems ...................................................................................................... 31 Parallel Systems ................................................................................................... 33 M–of–N Systems ................................................................................................ 33 Selecting the Proper Evaluation Parameters.................................................. 34 IMPERFECT FAULT COVERAGE AND RELIABILITY ............................. 37 Redundant System with Imperfect Coverage ................................................ 37 Generalized Imperfect Coverage ..................................................................... 39 MARKOV MODELS OF FAULT–TOLERANT SYSTEMS .......................... 45 Solving the Markov Matrix ............................................................................... 48 Chapman–Kolmogorov Equations .......................................................... 48 Markov Matrix Notation ................................................................................... 51 Laplace Transform Techniques........................................................................ 53 Modeling a Duplex System..................................................................................... 55 Modeling a Triple–Redundant System ................................................................. 61 Modeling a Parallel System with Imperfect Coverage ....................................... 65 Modeling A TMR System with Imperfect Coverage ......................................... 71 Modeling A Generalized TMR System ................................................................ 74 Laplace Transform Solution to Systems of Equations ................................ 75 Specific Solution to the Generalized System ................................................. 76 PRACTICAL EFFECTS OF PARTIAL COVERAGE ...................................... 83 Determining Coverage Factors.............................................................................. 83

i

Coverage Measurement Statistics ............................................................. 84 Coverage Factor Measurement Assumptions ........................................ 85 Coverage Measurement Sampling Method ............................................. 85 Normal Population Statistics ..................................................................... 86 Sample Size Computation .......................................................................... 86 General Confidence Intervals.................................................................... 88 Proportion Statistics .................................................................................... 88 Confidence Interval Estimate of the Proportion................................... 90 Unknown Population Proportion............................................................. 90 Clopper–Person Estimation ...................................................................... 91 Practical Sample Estimates ........................................................................ 92 Time Dependent Aspects of Fault Coverage Measurement ............... 93 Common Cause Failure Effects ............................................................................ 93 Square Root Bounding Problem ...................................................................... 95 Beta Factor Model .............................................................................................. 96 Multi–Nominal Failure Rate (Shock Model) ................................................. 96 Binomial Failure Rate Model ............................................................................ 97 Multi–Dependent Failure Fraction Model ..................................................... 97 Basic Parameter Model ...................................................................................... 98 Multiple Greeks Letter Model .......................................................................... 98 Common Load Model ....................................................................................... 99 Nonidentical Components Model ................................................................... 99 Practical Example of Common Cause Failure Analysis .............................. 99 Common Cause Software Reliability............................................................. 101 Software Reliability Concepts .................................................................. 102 Software Reliability and Fail–Safe Operations ..................................... 108 PARTIAL FAULT COVERAGE SUMMARY ................................................... 111 Effects of Coverage ............................................................................................... 112 REMAINING QUESTIONS .................................................................................. 113 Realistic Probability Distributions....................................................................... 113 Multiple Failure Distributions ........................................................................ 114 Weilbull Distribution ........................................................................................ 116 Periodic Maintenance ............................................................................................ 119 Periodic Maintenance of Repairable Systems .............................................. 119 Reliability Improvement for a TMR System................................................ 123 CONCLUSIONS ........................................................................................................ 125

ii

LIST OF FIGURES

Number

Page

Figure 1 – Evaluation Criteria defining System Reliability. These criteria will be used to develop a set of time dependent metrics used to evaluate various configurations. ...............................................................................................5 Figure 2 – Assumptions regarding the behavior of a random process that generated events following the Poisson probability distribution function. ........................................................................................................................8 Figure 3 – State Transition probabilities as a function of time in the Continuous– Time Markov chain that is subject to the constraints of the Chapman– Kolmogorov equation. ............................................................................................ 47 Figure 4 – Definition of the exponential order of a function............................................ 54 Figure 5 – the state transition diagram for a Parallel Redundant system with repair. State {2} represents the fault free operation mode, State {1} represents a single fault with a return path to the fault free mode by a repair operation, and State {0} represents the system failure mode, the absorption state......................................................................................................... 56 Figure 6 – The transition diagram for a Triple Modular Redundant system with repair. State {2} represents the fault free (TMR) operation mode, State {1} represents a single fault (Duplex) operation mode with a return path to the fault free mode, and State {0} represents the system failure mode, the absorbing state. ...................................................................................... 63 Figure 7 – The transition diagram for a Parallel Redundant system with repair and imperfect fault coverage. State {2} represents the fault free mode, State {1} represents a single fault with a return path to the fault free mode by a repair operation, and State {0} represents the system failure mode. State {0} can be reached from State {2} through an uncovered fault, which causes the system to fail without the intermediate State {1} mode. .......................................................................................................................... 66 Figure 8 –The state transition diagram for a Triple Modular Redundant system with repair and imperfect fault coverage. State {3} represents the fault free mode, State {2} represents the single fault (Duplex) mode, State

{1} represents the two–fault (Simplex) mode, and State {0} represents

the system failure mode........................................................................................... 72 iii

Figure 9 – The state transition diagram for a Generalized Triple Modular Redundant system with repair and [perfect fault detection coverage. The system initially operates in a fault free state {0} . A fault in any

module results in the transition to state {1, K , N } . A second fault while in state {1, K , N } results in the system failure state {N + 1} . ....................... 76 Figure 10 – Sample size requirement for a specified estimate as tabulated by Clopper and Pearson. .............................................................................................. 91 Figure 11 – Common Cause Failure modes guide figures for electronic programmable system [HSE87]. These ratios of non–CCF to CCF for various system configurations. CCFs are defined as non–random faults that are designed in or experienced through environmental damage to the system. Other sources [SINT88]. [SINT89] provide different figures. ...................................................................................................................... 101 Figure 12 – Four Software Growth Model expressions. The exponential and hyperexponential growth models represent software faults that are time independent. The S–Shaped growth models represent time delayed and time inflection software fault growth rates [Mats88]. ...................................... 103 Figure 13 – MTTF of Simplex, Parallel Redundant, and TMR Systems. ...................... 111 Figure 14 – MTTF of Parallel Redundant and TMR Systems with varying degrees of coverage............................................................................................................... 112 Figure 15 – Mean Time to Failure increases for a Triple Modular Redundant system with periodic maintenance. This graph shows that maintenance intervals which are greater than one–half of the mean time to failure for one module have little effect on increasing reliability. But frequent maintenance, even low quality maintenance, improves the system reliability considerably. .......................................................................................... 124

iv

ACKNOWLEDGMENTS

The author wishes to thank Dr. Wing Toy of AT&T Napierville Laboratories, Napierville, Illinois for his consultation on the ESS4 Central Office Switch and his contributions to this work. Dr. Victor Lowe of Ford Aerospace, Newport Beach, California for his consultation on the general forms of Markov model solutions. Mr. Henk Hinssen of Exxon Corporation, Antwerp Belgium for his discussion of the effects of partial diagnostic coverage in Triple Modular Redundant Systems at the Exxon Polystyrene Plant, Antwerp, Belgium. Dr. Phil Bennet of The Centre for Software Engineering, Flixborough, England for his ideas regarding software reliability measurements in the presence of undetected faults. Mr. Daniel Lelivre of Factory Systems, Paris France for his comments and review of this paper and its applicability to safety critical systems at Total, Mobile, and NorSoLor chemical plants. Several institutions have contributed source material for this work including The Foundation for Scientific and Industrial Research at the Norwegian Institute of Technology (SINTF), Trondheim Norway and the United Kingdom Atomic Energy Authority, Systems Reliability Service, Culcheth, Warrington, England. This work is a derivative of an aborted PhD thesis in Computer Science at the University of California, Irvine. This effort started in the early 1980’s through TRW, when holding a PhD was a naïve dream, requiring much more work then I had capacity to produce.

v

PREFACE

This paper was originally written to support the design and development of the Triple Modular Redundant (TMR) computer produced by Triconex Corporation of Irvine, California. In 1987, Triconex designed and manufactured a digital TMR process control computer that was deployed in a variety of industrial environments, including: turbine controls, boiler controls, fire and gas systems, emergency shutdown systems, and general purpose fault–tolerant real–time control systems. The Tricon (a classic 1980’s product name) was based on several innovative technologies. As the manager of software development for the Triconex, I was intimately involved in the software and hardware of the Tricon. In 1987, TMR was not a completely new concept. Flight control systems and navigation computers were found in aerospace applications. The Space Shuttle used a TMR+1 computer system and was well understood by the public. What was new to the market was an affordable TMR computer that could be deployed in a rugged industrial environment. The heart of the Tricon was a hardware voting system that performed a 2–out–of–3 vote for all digital input signals presented to the control program. The contents of memory and the computed digital outputs were again voted 2–out–of–3 at the physical output devices. Once the digital command had been applied to the output device its driven state was verified and the results reported to the control program. The Tricon contained of 3 independent (but identical) 32–bit battery powered microprocessors, a 2–out–of–3 voting digital serial bus connecting the three processors, a dual redundant power system using DC–to–DC converters (state of the art for 1987), and three separate isolated serial I/O buses connecting the I/O

vi

subsystem to the three main processors. The I/O subsystem cards were themselves TMR, using onboard 8–bit processors and a quad output device to vote 2–out–of–3 the digital commands received from the control program. The Tricon executed a control program on a periodic basis. The architecture of the operating software was modeled after the programmable controllers of the day, which were programmed in a ladder logic representing mechanical relays and timers. Both digital and analog devices provided input and output to the control program. The control program accepted input states from the I/O subsystem, evaluated the decision logic and produced output commands, which were sent to the I/O subsystem. This cycle was performed every 10ms in a normally configured system. In the presence of faults, the key to the survivability of the Tricon was the combination of TMR hardware and fault diagnostic software. Diagnostic software was applied to each processor element and the digital I/O device. This diagnostic software was capable of detecting all single stuck–at faults, many multiple stuck–at faults as well as many transient faults. A fault–injection and reliability evaluation technique developed by the author and described in this paper was used to evaluate the coverage factor of the diagnostic software. Triconex no longer exists as an independent company, having been absorbed into a larger control systems vendor. The materials presented in this paper were critical to Tricon’s TÜV and SINTF [SINTF89] certification for North Sea Norwegian Sector, German (then the Federal Republic), Belgium and British Health and Safety Executive (HSE) industrial safety operations. The concept of fault–tolerant computing has become important again in the distributed computing market place. The Tandem Non–Stop processor, modern flight and navigation computers as well as telecommunications computers all vii

depend on some form of diagnostics to initiate the fault recovery process. A recent systems architectural paper mentioned TMR but without sufficient attention to the underlying details. [1] The reissuing of this paper addresses several gaps in the literature:

The foundations of fault–tolerance and fault–tolerance modeling have faded from the computer science literature. The underlying mathematics of fault– tolerant systems present a challenge for an industry focused on rapid software development and short time to market pressures.

The understanding that unreliable and untrustworthy software systems are created by latent faults in both the hardware and software is poorly understood in this age of Object–Oriented programming and plug and play systems development.

The Markov models presented in this paper have general applicability to distributed computer systems analysis and need to be restated. The application of these models to distributed processing systems, with symmetric multi–processor computers is a reemerging science. With the advent of high–availability computing systems, the foundations of these systems needs to be understood once again.

The current crop of computer science practitioners have very little understanding of the complexities and subtleties of the underlying hardware and firmware that make up the diagnostic systems of modern computers, their reliability models and the mathematics of system modeling.

Glen B. Alleman Niwot Colorado 80503 April 2000

1

“Attribute Based Architectural Styles,” Mark Klein and Rick Kazman, CMU/SEI–99–TR–022, Software Engineering Institute, Carnegie Mellon University, October 1999.

viii

CHAPTER 1

INTRODUCTION

Two approaches are available for increasing the system reliability of digital computer systems: Fault avoidance (fault intolerance) and fault tolerance [Aviz75]. Fault avoidance is the result of conservative design techniques utilizing high–reliability components, system burn–in, and careful design and testing processes. The goal of fault avoidance is to reduce the possibility of a failure [Aviz84], [Rand75], [Kim86], [Ozak88]. The presence of faults however will result in system failure, negating all prior efforts to increase system reliability [Litt75], [Low72]. Fault–tolerance provides the system with the ability to withstand a system fault, maintain a safe state, and possibly continue to operate in the presence of this fault. FAULT TOLERANT SYSTEM DEFINITIONS

A set of consistent definitions is used in this paper to avoid confusion with existing definitions. These definitions are provided by the IFIP Working Group 10.4, Reliable Computing and Fault–Tolerance [Aviz84], [Aviz82], [Ande82], [Robi82], [Lapr84], [TUV86]:

A Failure occurs when the user of a system perceives that a service resource ceases to deliver the expected results.

An Error occurs when some part of a system resource assumes an undesired state. Such a state is then contrary to the specification of the resource to the expectation (requirement) of the user.

A Fault is detected when either a failure of the resource occurs, or an error is observed within the resource. The cause of the failure or error is said to be a fault.

1/130

FAULT–TOLERANT SYSTEM FUNCTIONS

In fault–tolerant systems, hardware and software redundancy provides information needed to negate the effects of a fault [Aviz67]. The design of fault– tolerant systems involves the selection of a coordinated failure response mechanism that follows four steps [Siew84], [Mell77], [Toy86]:

Fault Detection

Fault Location and Identification

Fault Containment and Isolation

Fault Masking

During the fault detection process, diagnostics are used to gather and analyze information generated by the fault detection hardware and software. These diagnostics determine the appropriate fault masking and fault recovery actions [Euri84], [Rouq86], [Ossf80], [Gluc86], [John85], [John86], [Kirr86], [Chan70]. It is the less than perfect operation of the Fault Detection, Location, and Identification processes of the system that is examined in this paper. The reliability of the fault–tolerant system depends on the ability of the diagnostic subsystem to correctly detect and analyze faults [Kirr87], [Gall81], [Cook73], [Brue76], [Lamp82]. The measure of the correct operation of the diagnostic subsystem is called the Coverage Factor. It is assumed in most fault–tolerant product offerings that the diagnostic coverage factor is perfect, i.e. 100%. This paper addresses the question: What is the reliability of the Fault–Tolerant system in the presence of less than perfect coverage? To answer this question, some background in the mathematics of reliability theory is necessary.

2/130

Overview of This Paper

The development of a reliability model of a Triple Modular Redundant (TMR) system with imperfect diagnostic coverage is the goal of this paper. Along the way, the underlying mathematics for analyzing these models is developed. The Markov Chain method will be the primary technique used to model the failure and repair processes of the TMR system. The Laplace transform will be used to solve the differential equations representing the transition probabilities between the various states of the TMR system described by the Markov model. The models developed for a TMR system with partial coverage can be applied to actual systems. In order to make the models useful in the real–world a deeper understanding of the diagnostic coverage and fault detection is presented. The appendices provide the background for the Markov models as well as the statistical process. The mathematics of Markov Chains and the statistical processes that underlay system faults and their repair processes can be applied to a variety of other analytical problems, including system performance analysis. It is hoped the reader will gain some appreciation of the complexity and beauty of modern systems as well as the subtitles of their design and operation. If the reader is interested in skipping to the end, Chapter 7 provides a summary of the effects of partial coverage on various system configurations.

3/130

4/130

CHAPTER 2

RELIABILITY, AVAILABILITY FOR SIMPLE SYSTEMS

When presented with the reliability figures for a computer system, the user must often accept the stated value as factual and relevant, and construct a comparison matrix to determine the goodness of each product offering [Kraf81]. Difficulties often arise through the definition and interpretation of the term reliability. This chapter develops the necessary background for understanding the reliability criteria defined by the manufactures of computer equipment. Figure 1 lists the criteria for defining system reliability [Siew82], [Ande72], [Ande79], [Ande81]. Deterministic Models Survival of at least k component failures Probabilistic Models z (t )

R (t )

µ

A (t )

– Hazard (failure rate) function – Reliability function – Repair Rate – Availability function

Single Parameter Models MTTF – Mean Time to failure MTTR – Mean Time to Repair MTBF – Mean Time Between Failure c – Coverage Figure 1 – Evaluation Criteria defining System Reliability. These criteria will be used to develop a set of time dependent metrics used to evaluate various configurations.

5/130

DETERMINISTIC MODELS

The simplest reliability model is a deterministic one, in which the minimum number of component failures that can be tolerated without system failure is taken as the figure of merit for the system. Probabilistic Models

The failure rate of electronic and mechanical devices varies as a function of time. This time dependent failure rate is defined by the hazard function, z ( t ) . The hazard function is also referred to as the hazard rate or mortality rate. For electronic components on the normal–life portion of their failure curve, the failure rate is assumed to be a constant, λ , rather than a function of time. The exponential probability distribution is the most common distribution encountered in reliability models, since it describes accurately most life testing aspects for electronic equipment [Kapu77]. The probability density function (pdf), Cumulative Distribution Function (CDF), reliability function ( R ( t ) ), and hazard (failure rate) function ( z ( t ) ) of the exponential distribution are expressed by the following [Kend77]:

pdf = f ( t ) = λe −λt

(2.1)

CDF = F ( t ) = 1 − e −λt

(2.2)

Reliability = R ( t ) = e −λt

(2.3)

Hazard Function = z ( t ) = λ

(2.4)

6/130

The failure rate parameter λ describes the rate at which failures occur over time [DoD82]. In the analysis that follows, the failure rate is assumed to be constant, and measured as failures per million hours. Although a time dependent failure rate could be used for un–aged electronic components, the aging of the electronic components can remove the traditional bathtub curve failure distribution. The constant failure rate assumption is also extended to the firmware controlling the diagnostics of the system [Bish86], [Knig86], [Kell88], [Ehre78], [Eckh75], [Gmei79], [RTCA85]. Exponential and Poisson Relationships

In modeling the reliability functions associated with actual equipment, several simplifying assumptions must be made to render the resulting mathematics tractable. These assumptions do not reduce the applicability of the resulting models to real–world phenomenon. One simplifying assumption is that the random variables associated with the failure process have exponential probability distributions. The property of the exponential distribution that makes it easy to analyze is that it does not decay with time. If the lifetime of a component is exponentially distributed, after some amount of time in use, the item is assumed to be good as

new. Formally, this property states that the random variable X is memoryless, if the expression P { X > s + t X > t } = P { X > s} is valid for all s , t ≥ 0 [Cram66], [Ross83]. If the random variable X is the lifetime of some item, then the probability that the item is functional at time s + t , given that it survived to time

t, is the same as the initial probability that is was functional at time s. If the item is functional at time t, then the distribution of the remaining amount of time that it survives is the same as the original lifetime distribution. The item does not remember that it has already been in use for a time t.

7/130

This property is equivalent to the expression

P {X > s + t , X > t } P {X > t }

= P { X > s}

or P { X > s + t } = P { X > s} P { X > t } . Since the form of this expression is satisfied when the random variable X is exponentially distributed (since

e −λ( s +t ) = e −λs e −λt ), it follows that exponentially distributed random variables are memoryless. The recognition of this property is vital to the understanding of the models presented in this paper. If the underlying failure process is not memoryless, than the exponential distribution model is not valid. The exponential probability distributions and the related Poisson processes used in the reliability models are formally based on the assumptions shown in Figure 2 [Cox 62], [Thor26].

Failures occur completely randomly and are independent of any previous failure. A single failure event does not provide any information regarding the time of the next failure event.

The probability of a failure during any interval of time [ 0, t ] is proportional to the length of the interval, with a constant of proportionality λ . The longer one waits the more likely it is a failure will occur. Figure 2 – Assumptions regarding the behavior of a random process that generated events following the Poisson probability distribution function.

An expression describing the random processes in Figure 2 results from the

Poisson Theorem which states that the probability of an event A occurring k times in n trials is approximately [Papo65], [Pois37],

n ( n − 1)L ( n − k + 1) k n −k p q , 1 ⋅ 2L k

(2.5)

8/130

where p = P { A} is the probability of an event A occurring in a single trial and

q = 1 − p . This approximation is valid when n → ∞,

p → 0 and the product

n ⋅ p remains finite. It should be noted that a large number of different trials of

independent systems is needed for this condition to hold, rather than a large number of repeated trials on the same system. The Poisson Theorem can be simplified to the following approximation for the probability of an event occurring k times in n trials [Kend77], ⎛ n ⎞ k n −k (n p ) ⎛ n p ⎞ n! 1− ⎜k⎟ p q = ( n − k ) ! k ! n k ⎝⎜ n ⎠⎟ ⎝ ⎠ k

=

( 2π ) e − n n n + ( 2π ) ( n − k )

=

1 n

k⎞ k ⎛ ⎜1− ⎟ e ⎝ n !⎠ n p k −n p ≈ e . k!

n − k + 21

n −k

(n p )

1 2

e

,

−n +k

n

k

k!

k

e −n p ,

npk k!

(2.6)

The exponential and Poisson expressions are directly related. A detailed understanding of this relationship will aid in the development of the analysis that follows. Using the Poisson assumptions described in Figure 2, the probability of n failures prior to time t is, P {N = n T ≤ t } = Pt ( n ) .

(2.7)

9/130

From of Eq. (2.7), the probability that no failures occur ( n = 0 ) between time t and time t + ∆t is,

Pt +∆t ( 0 ) = Pt ( 0 ) [1 − λ∆t ] ,

(2.8)

where the term λ = n p describing the total number of failures is of moderate magnitude [Fell67]. The probability that n failures occur between time t and time t + ∆t is then,

Pt +∆t ( n ) = Pt ( n ) [1 − λ∆t ] + Pt ( n − 1) [ λ∆t ] , n > 0 .

(2.9)

Using Eq. (2.9) and Eq. (2.8) and allowing ∆t → 0 , a differential equation can be constructed describing the rate at which failures occur between time t and time t + ∆t , d Pt ( 0 ) = −λPt ( 0 ) , dt d Pt ( n ) = λ ⎡⎣ Pt ( n − 1) − Pt ( n ) ⎤⎦ , for n > 0, dt

(2.10)

with the initial conditions of,

Pt ( n ) = 0.

(2.11)

The unique solution to the differential equation in Eq. (2.10) is [Klie75], Pt ( n ) =

n ( λt ) e −λt

n!

, n = 0, 1, 2,K

10/130

(2.12)

which is the Poisson distribution defined in Eq. (2.6). Using Eq. (2.12) to define a function F ( t ) representing the probability that no failures have occurred as of time t gives,

F ( t ) = Pt {n = 0} = e −λt .

(2.13)

The expression in Eq. (2.13) is also the definition for the Cumulative Distribution Function, CDF, of the Poisson failure process [Fell67]. By using Eq. (2.19), the probability distribution function, pdf, of the Poisson process can be given as,

f ( t ) = λe −λt ,

(2.14)

which is the exponential probability distribution. [2] The following statement describes the relationship between the Poisson and exponential expressions [Cox65],

If the number of failures occurring over an interval of time is Poisson distributed, then the time between failures is exponentially distributed. An alternative method of relating the exponential and Poisson expressions is useful at this point. The functions defined in Eq. (2.1) and Eq. (2.2) are based on the interchangeability of the pdf and the CDF for any defined probability distribution. The Cumulative Distribution Function F ( x ) of a random variable

X is defined as a function obeying the following relationship [Papo65],

2

This development of the pdf is very informal. Making use of the forward reference to construct an expression is circular logic and would not be permitted in more formal circumstances. For the purposes of this paper, this type of behavior can be tolerated, since the purpose of this development is to get to the results rather than dwell on the analysis process. This is a fundamental difference between mathematics and engineering.

11/130

F ( x ) = P { X ≤ x} , − ∞ < x < ∞.

(2.15)

The probability density function f ( x ) of a random variable X can be derived from the CDF using the following [Dave70], f ( x) =

d F ( x ). dx

(2.16)

The CDF can be obtained from the pdf by the following, F ( x ) = P { X ≤ x} = ∫

x

−∞

f ( t )dt , − ∞ < x < ∞.

(2.17)

Using Eq. (2.16) and Eq. (2.17), the CDF and pdf expressions for an exponential distribution can be developed. If the mean time between failures (MTBF) is an

Exponentially distributed random variable, the CDF is, ⎧1 − e −λt , 0 ≤ t ≤ ∞, F (t ) = ⎨ , otherwise, ⎩0

(2.18)

The number of failures in the time interval [ 0, t ] is a Poisson distributed random variable with a probability density function of, f (t ) =

⎧λe −λ , t > 0, d F (t ) = ⎨ dt ⎩0, otherwise,

where t is a random variable denoting the time between failures.

12/130

(2.19)

Reliability Availability and Failure Density Functions

An expression for the reliability of a system can be developed using the following technique. The probability of a failure as a function of time is defined as,

P {T ≤ t } = F ( t ) , t ≥ 0,

(2.20)

where t is a random variable denoting the failure time. F ( t ) is a function defining the probability that the system will fail by time t. F ( t ) is also the Cumulative Distribution Function (CDF) of the random variable t [Papo65]. The probability that the system will perform as intended at a certain time t is defined as the Reliability function and is defined as, R ( t ) = (1 − F ( t ) ) = P {T ≥ t } .

(2.21)

If the random variable describing the time to failure t has a probability density function f ( t ) then using Eq. (2.21) the Reliability function is, ∞

∞

t

t

R ( t ) = 1 − F ( t ) = 1 − ∫ f ( x ) dx = ∫ f ( x ) dx.

(2.22)

Assuming the time to failure random variable t has an exponential distribution its failure density defined by Eq. (2.19) is,

f ( t ) = λe −λt , t ≥ 0, λ ≥ 0.

(2.23)

The resulting reliability function is then, ∞

R ( t ) = ∫ λe −λt dt = e −λt .

(2.24)

t

13/130

A function describing the rate at which a system fails as a function of time is referred to as the Hazard function (Eq. (2.4)). Let T be a random variable representing the service life remaining for a specified system. Let F ( x ) be the distribution function of T and let f ( x ) be its probability density function. A new function z ( x ) termed the Hazard Function or the Conditional Failure Function of T is given by z ( x ) =

f (x ) . The function z ( x ) dx is the conditional 1− F (x )

probability that the item will fail between x and x + dx given it has survived a time T greater than x. For a given hazard function z ( x ) the corresponding distribution function is ⎡ x ⎤ 1 − F ( x ) = (1 − F ( x0 ) ) exp ⎢ − ∫ z ( y ) dy ⎥ were x 0 is an arbitrary value of x. In a ⎢⎣ xo ⎥⎦ continuous time reliability model the hazard function is defined as the instantaneous failure rate of the system [Kapu77], z ( t ) = lim

∆t →0

R (t ) − R (t + ∆ ) , ∆t ⋅ R ( t )

=

1 ⎡ d ⎤ − R ( t )⎥ , ⎢ R ( t ) ⎣ dt ⎦

=

f (t ) , R (t )

(2.25)

λe −λt = −λt , e = λ.

The quantity z ( t ) dt represents the probability that a system of age t will fail in the small interval of time [t , t + dt ] . The hazard function is an important

14/130

indicator of the change in the failure rate over the life of the system. For a system with an exponential failure rate, the hazard function is constant as shown in Eq. (2.25) and it is the only distribution that exhibits this property [Barl85]. Other reliability distributions will be shown in later chapters that have variable hazard rates. If a system contains no redundancy – this is, every component must function properly for the system to continue operation – and if component failures are statistically independent, the system reliability function is the product of the component reliabilities and follows an exponential probability distribution. The failure rate of such a system is the product of the failure rates of the individual components, n

n

i =1

i −1

Rsys ( t ) = ∏ Ri ( t ) =∏ e −λi t = exp ⎡⎣ − ( ∑ λ i ) t ⎤⎦ .

(2.26)

In most cases it is possible to repair or replace failed components and accurate models of system reliability will consider this. As will be shown the repair activity is not as easily modeled as the failure mechanisms. For systems that can be repaired, a new measure of reliability can be defined, The probability that the system is operational at time “t.” This new measure is the Availability and is expressed as A ( t ) . Availability A ( t ) differs from reliability R ( t ) in that any number of system failures can occur prior to time t but the system is considered available if those failures have been repaired prior to time t.

15/130

For systems that can be repaired, it is assumed that the behavior of the repaired system and the original system are identical from a failure standpoint. In general, this is not true, as perfect renewal of the system configuration is not possible. The terms Mean Time to First Failure and Mean Time to Second Failure now become relevant. Assuming a constant failure rate λ , a constant repair rate µ , and identical failure behaviors between the repaired system and the original system, the steady–state system availability can be expressed as, ASS =

µ . λ+µ

(2.27)

The expression in Eq. (2.27) is an approximation of the expression of the availability with repair requires the solution of the appropriate Markov model, which will be developed in a later chapter. Mean Time to Failure

The Mean Time to Failure (MTTF) is the expected time to the first failure in a population of identical systems, given a successful system startup at time t = 0 . The Cumulative Distribution function F ( x ) in Eq. (2.15) and the probability density function f ( x ) in Eq. (2.16) characterize the behavior of the probability distribution function of the underlying random failure process. These expressions are in a continuous integral form and require the solution of integral equations to produce a useable result. A concise parameter that describes the expected value of the random process is useful for comparison of different reliability models. This parameter is the Mean or Expected Value of the random variable denoted by E [ X ] and is defined by [Parz60], [Dave70],

16/130

∞

E[ X ] =

∫ xf ( x ) dx.

(2.28)

−∞

The expression in Eq. (2.28) denotes the expected value of the continuous function f ( x ) . It is important to note that this definition assumes x f ( x ) is integrable in the interval ( −∞, ∞ ) . For an exponential probability density function of, f ( x ) = λe −λx , x > 0,

(2.29)

the mean or expected value of the exponential function is given by, ∞

E[X ] =

∫

−∞

∞

xf ( x ) dx = ∫ λe −λx dx .

(2.30)

0

The evaluation of Eq. (2.30) can be done in a straightforward manner using the Gamma function [Arfk70], which is defined as, ∞

Γ (α ) = ∫ xα −1e − x dx, α > 0,

(2.31)

0

or alternately, ∞

α−1 x ∫ x e dx = 0

Γ (α) . λα

(2.32)

Rewriting the expression in Eq. (2.30) for the expected values as,

17/130

E[ X ] =

1

∞

ue λ∫

−u

du ,

(2.33)

0

where substituting the variables, u = λx and du = λdx,

(2.34)

results in, ∞

1 E [ X ] = ∫ ue − u du , λ0 1 Γ (2) , λ 1 = , λ =

(2.35)

which is the MTTF for a simple system. Although this expression is useful for simple systems, a general–purpose expression representing the MTTF is needed. This function can be developed in the following manner. Let X denote the lifetime of a system so that the reliability function is, R ( t ) = P {X > t } ,

(2.36)

and the derivative of the reliability function which is also given in Eq. (2.21) and Eq. (2.22) is again defined as,

d R (t ) = − f (t ) . dt

(2.37)

The expression for the expected value or MTTF using Eq. (2.28) is given by:

18/130

∞

∞

⎛d ⎞ E [ X ] = ∫ tf ( t ) dt = − ∫ t ⎜ R ( t ) ⎟ dt dt ⎠ 0 0 ⎝

(2.38)

Using the technique of integration by parts [Smai49], [Arfk70] is shown in Eq. (2.39), b

∫ a

b

b ⎛ d ⎞ ⎛ d ⎞ f ( x ) ⎜ g ( x ) ⎟ dx − f ( x ) g ( x ) − ∫ g ( x ) ⎜ f ( x ) ⎟ dx, a a ⎝ dx ⎠ ⎝ dx ⎠

(2.39)

to evaluate Eq. (2.38). Integrating by parts gives the expected value as, E [ X ] = −t R ( t )

∞ 0

∞

+ ∫ R ( t ) dt .

(2.40)

0

Since R ( t ) approaches zero faster than t approaches infinity, Eq. (2.40) can be reduced to, ∞

E [ X ] = ∫ R ( t ) dt = MTTF ,

(2.41)

0

which is the expression for the Mean Time to Failure for a general system configuration. This direct relationship between MTTF and the system failure rate is one reason the constant failure rate assumption is often made when the supporting reliability data is scanty [Barl75]. Appendix G describes the analysis of the variance for this distribution. Using an exponential failure distribution implies two important behaviors for the system,

19/130

Since a used subsystem is stochastically as good as a new subsystem, a policy of scheduled replacement of used subsystems which are know to still be functioning, does not increase the lifetime of the system.

In estimation the mean system life and reliability, data can be collected consisting only of the number of hours of observed life and the number of observed failures; the ages of the subsystems under observation are of no concern.

Mean Time to Repair

The Mean Time to Repair (MTTR) is the expected time for the repair of a failed system or subsystem. For exponential distributions this is MTTF = MTTR =

1 and λ

1 . The steady state availability ASS defined in Eq. (2.27) can be µ

rewritten in terms of these parameters, ASS =

MTTF . MTTR + MTTF

(2.42)

Mean Time Between Failure

The Mean Time Between Failure (MTBF) is often mistakenly used in place of Mean Time to Failure (MTTF). The MTBF is the mean time between failures in a system with repair, and is derived from a combination of repair and failure processes. The simplest approximation for MTBF is: MTBF = MTTF + MTTR . In this paper, it is assumed MTTR

(2.43) MTTF so that MTTR is used in place of

MTBF. The Mean Time to Failure is considered since in fault–tolerant systems Failure occurs only when the redundancy features of the system fail to function

properly. In the presence of perfect coverage and perfect repair the system should

20/130

operate continuously. Therefore, failure of the system implies total loss of system capabilities. Mean Time to First Failure

The Mean Time to Failure is defined as the expected time of the first failure in a population of identical systems. This development depends on the assumption that the failure rate is constant Eq. (2.25), exponentially distributed Eq. (2.14), and the repair time is constant, µ . In the general case, these assumptions may not be valid and the Mean Time to Failure (MTTF) is not equivalent to the Mean Time to First Failure (MTFF).

By removing the exponential probability failure distribution restriction in Eq. (2.29) a generalized expression for the first failure time can be derived. Given a population of n subsystems each with a random variable X i , i = 1, 2, K , n and a continuous pdf of f ( x ) , the failure time for the n th

subsystem is given by summing all the failure times prior to the failure, n

Sn = X 1 + X 2 + L + X n = ∑ X i .

(2.44)

i =1

If the random variables { X 1 , X 2 , K , X n } are independent and identically distributed, all with pdf’s of f ( x ) , the random process described by these variables is referred to as an Ordinary Renewal Process [Cox62], [Ross70]. The details of the Renewal Process are shown in Appendix E.

21/130

Given the random process described by Eq. (2.44) the distribution function of Sn is provided by convolving each individual distribution function F ( t ) . The convolution of two functions is defined as [Brac65], [Papo65]: f ( x) ⊗ g ( x) ≡

∞

∫ f ( u ) g ( x − u ) du .

(2.45)

−∞

The resulting convolution function for the n+1 subsystem failure is given by: t

F( n +1) ( t ) = ∫ F( n ) ( t − x ) F ( x ) dx.

(2.46)

0

In renewal processes, the random variables are actually functions and can be substituted in the reliability computations when: N ( t ) = n ⇔ S n ≤ t ≤ S n +1 .

(2.47)

When the conditions in Eq. (2.47) are met, the probability of n renewals in a time interval is given by, P {N ( t ) = n } = P {Sn ≤ t ≤ Sn +1} ,

= P {Sn ≤ t } − P {Sn +1 ≤ t } ,

(2.48)

= F( n ) ( t ) − F( n +1) ( t ) . The renewal function H ( t ) can be defined as the average number of subsystem failures and repairs as a function of time, and is given as, H ( t ) = E ⎡⎣ N ( t ) ⎤⎦ .

(2.49)

22/130

Using Eq. (2.48) in the evaluation of Eq. (2.49) and Eq. (2.30) as the definition of the expectation value, gives the following for the renewal function, H (t ) = = =

∞

∑ n P {N ( t ) = n },

n =0 ∞

∑

n =0 ∞

n F( n ) ( t ) −

∞

∑ n F(n +1) ( t ).

(2.50)

n =0 ∞

∑ n F( n ) ( t ) − ∑ ( n − 1) F(n ) ( t ).

n =0

n =1

Simplifying Eq. (2.50) results in an expression for the renewal function of, ∞

H ( t ) = F ( t ) + ∑ F( n +1) ( t ) .

(2.51)

n =1

The term F( n +1) is the convolution of F( n ) and F which gives, t

F( n +1) ( t ) = ∫ F( n ) ( t − x ) F ( x ) dx ,

(2.52)

0

which results in the expression for the renewal function of, ∞ t

H ( t ) = F ( t ) + ∑ ∫ F( n ) ( t − x ) F ( x ) dx .

(2.53)

n =1 0

Rearranging the integral term in Eq. (2.53) gives, t ∞ ⎡ ⎤ H ( t ) = F ( t ) + ∫ ⎢ ∑ F( n ) ( t − x ) ⎥ F ( x ) dx . ⎦ 0 ⎣ n =1

23/130

(2.54)

The summation term in Eq. (2.54) is the renewal function for the n th failure, giving, t

H ( t ) = F ( t ) + ∫ H ( t − x ) F ( x ) dx .

(2.55)

0

Using Eq. (2.16), the renewal density function h ( t ) is the derivative of the distribution function, giving, h (t ) =

d H (t ) . dt

(2.56)

Using Eq. (2.50) to evaluate the derivative results in, ∞

h ( t ) = ∑ f ( n ) ( t ),

(2.57)

n =1

and using Eq. (2.54) as a substitute for the right–hand side of Eq. (2.57) results in, t

h ( t ) = f ( t ) + ∫ h ( t − x ) f ( x ) dx .

(2.58)

0

Eq. (2.58) is known as the Renewal Equation [Ross70]. To solve the renewal equation, the Laplace transform will be used. The transform of the probability density function is, ∞

L { f ( s )} = ∫ e − sx f ( x ) dx ,

(2.59)

0

and the transform of the renewal function is,

24/130

∞

L {h ( s )} = ∫ e − sx h ( x ) dx .

(2.60)

0

Using the convolution property of the Laplace transform [Brac65], an equation for the renewal distribution can be generated, L {h ( s )} = L { f ( s )} + L {h ( s )} L { f ( s )} ,

(2.61)

and simplified to, L {h ( s )} =

L { f ( s )}

1 − L { f ( s )}

.

(2.62)

Eq. (2.62) is now the generalized expression for the failure distribution for a random process within an arbitrary probability distribution. General Availability Analysis

The steady state system availability defined in Eq. (2.42) assumes an exponential distribution for the failure rate of the system or subsystems. An important activity in the analysis of Fault–Tolerant systems is the development of a general– purpose availability expression, independent of the underlying failure distribution. In the analysis that follows, it will be assumed that when a subsystem fails it is repaired and the system restored to its functioning state. It will also be assumed that the restored system functions as if it were new, that is with the failure probability function restarted at t = 0 .

25/130

Let Ti be the duration of the ith functioning period and let Di be the system downtime because of the failure of the system while the ith repair takes place. These durations will form the basis of the renewal process. By combining the subsystem failure interval and the subsystem repair duration, a random variable sequence is constructed such that, Xi = Ti + Di ; i = 1, 2, K

(2.63)

It must be assumed that the duration of the functioning subsystems are identically distributed with a common Cumulative Distribution Function W ( t ) and a common probability density function w ( t ) and that the repair periods are also identically distributed with G ( t ) and g ( t ) . Using these assumptions the terms in Eq. (2.63) are also identically distributed such that,

{ Xi

i = 1, 2, K} ,

(2.64)

meets the definition of a Renewal process developed Eq. (2.44). Using this development an expression for the convolution of the two independent random processes is given by, L { f ( s )} = L {w ( s )} L { g ( s )} .

(2.65)

Using Eq. (2.62) gives, L {h ( s )} =

L {w ( s )} L { g ( s )}

1 − L {w ( s )} L { g ( s )}

.

26/130

(2.66)

The average number of repairs M ( t ) in the time interval ( 0, t ] has the Laplace transform: L { M ( s )} =

L {w ( s )} L { g ( s )}

s ⎣⎡1 − L {w ( s )} L { g ( s )}⎦⎤

.

(2.67)

Instantaneous Availability The steady state availability defined in Eq. (2.42) can now be replaced with the instantaneous availability A( t ) . In the absence of a repair mechanism the availability A( t ) is equivalent to the repairability, R ( t ) = 1 − A( t ) of the subsystem. The subsystem may be functioning at time t because of two mutually exclusive reasons,

The subsystem has not failed from the beginning.

The last renewal occurred within the time period and the subsystem continued to function sine that time.

The probability associated with the second case is the convolution of the reliability function and the renewal density, giving, t

∫ R ( t − x ) h ( x ) dx ,

(2.68)

0

which results in a expression for the instantaneous availability of, t

A ( t ) = R ( t ) + ∫ R ( t − x ) h ( x ) dx . 0

Taking the Laplace transform of both sides of Eq. (2.69) gives, 27/130

(2.69)

L { A( s )} = L { R ( s )} + L { R ( s )} L{h ( s )} , = L { R ( s )} ⎡⎣1 + L {h ( s )}⎤⎦ ,

(2.70)

⎡ L {w ( s )} L{ g ( s )} ⎤ = L { R ( s )} ⎢1 + ⎥. ⎣⎢ 1 − L {w ( s )} L{ g ( s )} ⎦⎥

Since the reliability of the system is given as R ( t ) = 1 − W ( t ) , 1 L { A( s )} = − L {W ( s )} , s 1 L {w ( s )} 1 − L {w ( s )} = − = . s s s

(2.71)

Substituting gives, L { A( s )} =

1 − L {w ( s )}

s ⎡⎣1 − L {w ( s )} L { g ( s )}⎤⎦

.

(2.72)

Given the failure–rate distribution and the repair–time distribution, Eq. (2.72) can be used to compute the instantaneous availability as a function of time. Limiting Availability An important question to ask is – what is the availability of the system after some long period of time? The limiting availability A( t ) as t → ∞ is defined as A or simply the Availability. To derive an expression for the limiting availability the Final Value Theorem of Laplace transform can be used [Doet61], [Widd46], [ Brac65], [Ogat70], [Gupt66]. This theorem states that the steady state behavior of f ( t ) is the same as the

28/130

behavior of sF ( s ) in the neighborhood of s = 0 . Thus it is possible to obtain the value of f ( t ) as t → ∞ . Let, t

F ( t ) = ∫ f ( x ) dx + F ( 0 − ),

(2.73)

0

then using a table of Laplace transforms [Doet61], [Brac65], ∞

s L { F ( s )} − F ( 0 − ) = L {h ( s )} = ∫ e − st f ( t ) dt ,

(2.74)

0

and by letting s → 0, ∞

lim s L { H ( s )} = ∫ f ( t ) dt + F ( 0 − ) , s →0

0

⎡t ⎤ = lim ⎢ ∫ f ( x ) dx ⎥ + F ( 0 − ) , s →∞ ⎣0 ⎦ = lim F ( t ) .

(2.75)

t →∞

The Limiting availability is then given as, A = lim A( t ) = lim s L { A( s )} . t →∞

(2.76)

s →0

For small values of s the following approximations can be made [Apos74], e − st ≅ 1 − st ,

(2.77)

giving,

29/130

∞

L {w ( s )} = ∫ e − st w ( t ) dt , 0

∞

∞

0

0

= ∫ w ( t ) dt − s ∫ tw ( t ) dt ,

(2.78)

2 1− . λ where MTTF =

1 and, λ

L { g ( s )} = 1 −

and where MTTR =

2 , µ

(2.79)

1 giving the limiting availability as, µ

s⎞ ⎡ ⎤ ⎛ 1 1− ⎜1− ⎟ ⎢ ⎥ MTTF λ ⎝ ⎠ . A = lim ⎢ ⎥= λ = s →0 ⎢ 1 − ⎛ 1 − s ⎞⎛ 1 − s ⎞ ⎥ 1 + 1 MTTF + MTTR ⎟ ⎢⎣ ⎜⎝ λ ⎟⎜ ⎠⎝ λ ⎠ ⎥⎦ λ µ

(2.80)

Eq. (2.80) is an important result in the analysis of system reliability, because it shows that the limiting availability depends only on the Mean Time to Failure and the Mean Time to Repair and not in the underlying distributions of the failure and repair times.

30/130

CHAPTER 3

SYSTEM RELIABILITY

This chapter provides the basis for the computation of the overall system reliability given a redundant architecture with partial fault detection coverage. Redundant systems can be modeled under variety operational assumptions. Of most interest in this paper are dual and triple redundant systems that contain repair facilities. Series Systems

Creating a reliable system often involves a series or parallel combination of independent systems or subsystems. If Ri ( t ) is the reliability of module i and all the modules are statistically independent, then the overall system reliability of modules connected in series is, Rseries ( t ) = ∏ Ri ( t ) .

(3.1)

For a series redundant system the failure probability Fseries is given by, n

Fseries ( t ) = 1 − Rseries ( t ) = 1 − ∏ Ri ( t ), i =1

(3.2)

n

= 1 − ∏ (1 − Fi ( t ) ) . i =1

Expanding Eq. (3.1) will illustrate an aspect of the exponential distribution. For a system of n subsystems connected in series the reliability of the system is given by

31/130

Eq. (3.1). If a general purpose hazard function is used for the failure rate [Shoo68] defined by, hi ( t ) = λ i + c i t k ,

(3.3)

where λ i , c i , and k are constants, then the reliability function for the individual subsystem is given by, ⎡ t k +1 ⎤ Ri ( t ) = exp− ⎢λ i t + ci ⎥, k +1 ⎦ ⎣

(3.4)

and the reliability functions for the system is given by, ⎡ n t k +1 n ⎤ Rseries ( t ) = exp− ⎢t ∑ λ i + ∑ci ⎥. k + 1 i =1 ⎦ ⎣ i =1

(3.5)

Defining two new terms for the summation of the failure rate and a new term for n

n

the time constant adjustment gives, λ = ∑ λ i , c = ∑ c i , and T = λ∗t results ∗

∗

i =1

i =1

in the series reliability expression of, ⎡ ⎤ ⎛ c ∗ ⎞ 1 ⎛⎜ T k +1 ⎞⎟ ⎥ Rseries ( t ) = exp− ⎢T + ⎜ ∗ ⎟ . k ⎢ ⎝ λ ⎠ k + 1 ⎜⎝ ( λ ∗ ) ⎟⎠ ⎥ ⎣ ⎦

As the number of subsystems grows large ( λ∗ → ∞ ) , the term

(3.6)

c∗ is ( k + 1) λ∗

bounded and the expression for the system reliability becomes, lim Rseries ( t ) = e −T = e −λ t . ∗

(3.7)

n →∞

32/130

Eq. (3.7) defines the failure distribution of the system as the number of subsystems grows without bound. This implies that a large complex system will tend to follow exponential distribution failure models regardless of the internal organization of the subsystems. Parallel Systems

In a parallel redundant configuration, the system fails only if all modules fail. The probability of a system failure in a parallel system given by, n

Fparallel ( t ) = 1 − ∏ Fi ( t ).

(3.8)

i =1

The system reliability for a parallel system is given by, n

Rparallel ( t ) = 1 − Fparallel ( t ) = 1 − ∏ Fi ( t ) , i =1

n

= 1 − ∏ (1 − Ri ( t ) ) .

(3.9)

i =1

M–of–N Systems

An M–of–N system is a generalized form the parallel system. Instead of requiring only one of the N modules of the system to remain functional, M modules are required. The system of interest in this paper is a Triple Modular Redundant (TMR) configuration in which two of the three modules must function for the system to operate properly [Lyons 62], [Kuehn 69]. [3] For a given module reliability of Rm the TMR reliability is given by,

3

In practical TMR systems, a simplex mode is allowed, which usually places the system in a shutdown mode, allowing the controlled process to be safely stopped.

33/130

⎛ 3⎞ Rtm r = Rm3 + ⎜ ⎟ Rm2 (1 − Rm ) . ⎝2⎠

(3.10)

In Eq. (3.10) all working states are enumerated. The Rm3 term represents that ⎛ 3⎞ state in which all three modules are functional. The ⎜ ⎟ Rm2 (1 − Rm ) term ⎝2⎠ represents the three states in which any one module has failed and the two states in which a module is functional. Selecting the Proper Evaluation Parameters

In comparing different redundant system configurations, it is desirable to summarize their reliability by a single parameter. The reliability may be an arbitrary complex function of time. The selection of the wrong summary parameter could lead to incorrect conclusions, as will be shown below. Consider a simplex system, with a reliability function of, Rsim plex ( t ) = e −λt ,

(3.11)

and using Eq. (2.41) to derive the Mean Time to Failure results in, MTTFsim plex =

1 . λ

(3.12)

For a TMR system with an exponential reliability function, 3 2 ⎛3⎞ Rtm r ( t ) = ( e −λt ) + ⎜ ⎟ ( e −λt ) (1 − e −λt ) , ⎝2⎠ = 3e −2 λt − 2e −3λt ,

34/130

(3.13)

and using Eq. (2.40) results in a Mean Time to Failure of, MTTFtm r =

3 2 − . 2λ 3λ

(3.14)

Comparing the simplex and TMR reliability expressions gives, MTTFtm r =

5 1 ≤ = MTTFsim plex . 6λ λ

(3.15)

By using the MTTF figure of merit, the TMR system can be shown to be less reliable than the Simplex system. The above equations do not include the facility for module repair. Once the TMR system has exhausted its redundancy, there is more hardware to fail then the remaining modules of the non–redundant system. This effect lowers the total system reliability. With online repair, the MTTF figure of merit for the TMR system becomes an important measure of the overall system reliability. These results illustrate why simplistic assumptions and calculations may result in erroneous information.

35/130

36/130

CHAPTER 4

IMPERFECT FAULT COVERAGE AND RELIABILITY

Reliability models of systems with dynamic redundancy usually depend on perfect fault detection [Arno73], [Stif80]. The ability of the system to detect faults that occur can be classified as [Geis84],

Covered – faults that are detected. The probability that a fault belongs to this class is given by c.

Uncovered – faults that are not detected. The probability that a fault belongs to this class is given by (1 − c ) .

The underlying diagnostic firmware and hardware may not provide perfect coverage for many reasons, primarily due to the complexity of the system under diagnosis [Rous79], [Cona72], [Wood79], [Soma86]. Because of this built–in complexity, an exhaustively tested set of diagnostics may not be possible. Another factor affecting the diagnostic coverage is the presence of intermittent faults [Dahb82], [Mall78]. The detection and analysis of these intermittent or permanent faults is further complicated by the presence of transient faults which behave as real faults but are only present in the system for a short time [Glas82], [Sosn86]. Modeling a fault–tolerant system in the presence of imperfect fault coverage becomes an important aspect in predicting the overall system reliability. Redundant System with Imperfect Coverage

Before developing the Markov method of analyzing Fault–Tolerant systems, a conditional probability method will be used to derive the MTTF and MTBF for a redundant system with imperfect fault detection [Bour69]. Assume that the failure rate for each subsystem of the redundant system is described by an independent 37/130

random variable λ . Let X denote the lifetime of a system with two modules, one active and the other in standby mode. Assume that the module in the standby mode does not experience a fault during the mission time interval. [4] Let Y be a random variable where, Y = 0 if a fault is not covered, and Y = 1 if a fault is covered, then, P { y = 0} = (1 − c ) and P { y = 1} = c . To compute the MTTF of this system, the conditional expectation value of the system lifetime X given the fault coverage state Y is must be derived. If an uncovered fault occurs the MTTF of the system is the MTTF of the initially active module, P { X Y = 0} =

1 . λ

(4.1)

If a covered fault occurs the MTTF of the system is the sum of the MTTF of the active module and the MTTF of the inactive module, P { X Y = 1} =

2 . λ

(4.2)

The total expectation value of the system lifetime is then given by, E[ X ] =

( 1 − c ) 2c λ

+

λ

=

(1 + c ) λ

= MTTF .

(4.3)

The computation of the system reliability depends on the combination of the two independent exponential distribution functions when a covered fault occurs,

4

This is an invalid assumption in a practical sense, but it greatly simplifies this example.

38/130

f ( x = t y = 1) = λ 2te −λt ,

(4.4)

and when an uncovered fault occurs f ( x = t y = 0 ) = λe −λt .

(4.5)

The joint exponential distribution function for both conditions is given by, f (t , y ) = f ( X = t y ) ⋅ P {y } , f ( t , y ) = λ (1 − c ) e −λt ; t > 0, y = 0, f ( t , y ) = λ c te 2

−λt

(4.6)

; t > 0, y = 1.

and the marginal density function of X is computed by summing over the joint density function, f ( t ) = λ 2 c te −λt + λ (1 − c ) e −λt .

(4.7)

The system reliability as a function of the coverage is then given by integrating the joint density function in Eq. (4.7) to give, t

R ( t ) = 1 − ∫ f ( x ) dx = 1, 0

t

= 1 − ∫ λ 2 c te −λt + λ (1 − c ) e −λt dt , 0

∞

= 1 − ∫ λ c te 2

−λt

t

+ λ (1 − c ) e

−λt

(4.8)

dt ,

= (1 + c λt ) e −λt . Generalized Imperfect Coverage

In the previous example, the system consisted of two modules, one in the active state and one in the standby state. The conditional probability that a fault will go

39/130

undetected (uncovered) was computed using the conditional probability that the system will survive for a specified period. Cox [Cox55] analyzed the general case of a stage–type conditional probability distribution. The principle on which the method of stages is based is the memoryless property of the exponential distribution of Eq. (2.1) [Klie75]. The lack of memory is defined by the fact that the distribution of the time remaining for an exponentially distributed random variable is independent of the current age of the random variable, that is the variable is memoryless. Appendix D develops further the memoryless property of random variables with exponential distributions. In the generalized model, it is assumed that individual modules are always in one of two states – working or failed. It is also assumed that the modules are statistically independent and module repair can take place while the remainder of the system continues to function. In the general case of N active and S standby modules, the lifetime of the system is defined by a stage–type distribution. An active module has an exponential failure distribution with a constant failure rate λ . Assume that the modules in the standby state can fail at a rate µ (presuming 0 ≤ µ ≤ λ ). Let Xi (1 ≤ i ≤ N ) be a random variable denoting the lifetime of the active modules and let Y j (1 ≤ j ≤ S ) be a random variable denoting the lifetime of the standby modules. The system lifetime L is then, L ( m N , S ) = min ( X1 , X2 , K , XN ; Y1 , Y2 ,K , YS ) + L ( m N , S − 1) , = W ( N , S ) + L ( m N , S − 1) .

(4.9)

where W ( N , S ) is the time to first failure among the N + S modules. After the removal of the failed module, the system has N active modules and S − 1

40/130

standby modules. As a result N + S − 1 modules have not aged by the memoryless exponential assumption and therefore the system lifetime is, S

L ( m N , S ) = L ( m N , 0 ) + ∑W ( N , i ).

(4.10)

i =1

Here L ( m N , S ) = L ( m N , 0 ) is the lifetime of the m–out–N system and is therefore a k th order statistic with k = N − m + 1 [Kend77]. The distribution of L ( m N , 0 ) is an ( N − m + 1) – phase Hypoexponential distribution with

parameters N λ , ( N − 1) λ , K , m λ . The distribution for the time to first failure W ( N , i ) has an exponential distribution with the parameter N λ + i µ . Using Theorem D.1 in Appendix D, the distribution L ( m N , S ) has a

( N + S − m + 1) –stage Hypoexponential distribution [Koba78], [Cox55], [Ash70] with parameters N λ + Sµ, N λ + ( S + 1) µ, K , N λ + µ, N λ , ( N − 1) λ ,K , m λ . Let R⎡m N , S ⎤ ( t ) denote the reliability of such a system, then the reliability ⎣

⎦

function is defined as, S

N

i =1

i =m

R⎡m N , S ⎤ ( t ) = ∑ a j e −( N λ+ j µ ) + ∑ b i e −i λt , ⎣

⎦

(4.11)

where, S

ai = ∏ j =1 j ≠i

N λ + jµ N jλ , ∏ j µ − i µ j =m j λ − N λ − i µ

and,

41/130

(4.12)

N N λ + jµ jλ . ∏ j =1 ( N − i ) λ + j µ j = m j λ − i λ S

bi = ∏

(4.13)

j ≠i

Defining the constant K = λ µ gives a new expression for the active and standby terms in the reliability equation Eq. (4.11) of,

( NK + S )L ( NK + 1) ai = ⋅ ( NK + i )( S − i )L (1)( −1)L (1 − i ) ⎛

N ( N − 1)L m i ⎞ ⎛i ⎞⎛ i ⎜ + N − m ⎟L ⎜ + 1⎟ ⎜ ⎝K ⎠ ⎝K ⎠⎝ K

= ( −1)

i −1

( −1)

N − m +1

( NK + S ) ! S ! i N − m +1 ⋅ ( −1) L ( NK + i )( NK ) ! S ! ( S − 1) ! i !

⎛i⎞ N ( N − 1) ! ⎜ ⎟ ! ( N − m ) ! ⎝k⎠ , L⋅ i i ⎡⎛ ⎞ ⎤ ( m − 1) ! ⎢⎜ ⎟ M − m ⎥ ! ( N − m ) ! K ⎣⎝ K ⎠ ⎦ ⎛ NK + s ⎞ ⎛ S ⎞ ⎛ N − 1 ⎞ ⎜ S ⎟⎜ i ⎟⎜ m −1 ⎟ N − m +1 ⎝ ⎠⎝ ⎠⎝ ⎠ . = ( −1) ⎛i ⎞ i ⎞⎜ + N − m ⎟ ⎛ ⎜1+ ⎟ K ⎝ NK ⎠ ⎜ N − m ⎟ ⎝ ⎠

A similar expression can be developed for,

42/130

⎞ ⎟ ⎠

,

(4.14)

bi =

( NK + S )L ( NK + 1) N Lm ⋅ , ⎡⎣( N − i ) K + S ⎤⎦ L ⎡⎣( N − i ) K + 1⎤⎦ i ⎡⎣( N − 1)L (1)( −1)L ( m − i ) ⎤⎦

( NK + S ) ! ⎡⎣( N − 1) K ⎤⎦ ! N ! ( −1) = , ( NK ) ! ⎡⎣( N − i ) K + S ⎤⎦ ! i ( m − 1) ! ( N − i ) ! ( i − m ) ! ( NK + S ) ! S ! ( ( N − i ) K ) ! N ! i ! m i −m = ( −1) , S ! ( NK ) ! ⎡⎣( N − i ) K + S ⎤⎦ ! i ! ( i − m ) ! m ! i −m

= ( −1)

i −m

(4.15)

⎛ NK + S ⎞ ⎛ N ⎞ ⎛ i ⎞ ⎜ S ⎟⎜ i ⎟⎜ m ⎟ ⎝ ⎠⎝ ⎠⎝ ⎠ . N i )K + S ⎞ − ⎛ ( i ⎜ ⎟ m⎝ S ⎠

An expectation value of the reliability function derived from a general stage–type distribution can be found using the Laplace transform [Cox 55]. The Laplace transform of a stage–type random variable X is,

µj , j =1 s + µ j i

r

L X ( s ) = γ 1 + ∑ β 1 β 2 L β i γ i +1 ∏ i =1

(4.16)

where γ i + β i = 1 for 1 ≤ i ≤ r and γ r +1 = 1 . Defining the Laplace transform of the system described in Eq. (4.9) gives, i

S

L X ( s ) = ∑ c i −1 (1 − c ) ∏ i =1

j =1

N λ + ( S − j + 1) µ L s + N λ + ( S − j + 1) µ

N λ + j µ N − M +1 ( N − j + 1) λ L+ c ∏ . ∏ j =1 s + ( N λ + j µ ) j =1 s + ( N − j + 1) λ 2

S

(4.17)

By inverting the transformation in Eq. (4.17) an expression for the MTTF with imperfect coverage can be given as,

43/130

S

E [ X ] = ∑ c i −1 (1 − c ) i =1

N ⎧S 1 1 1 ⎪⎫ 2 ⎪ + + c ⎨∑ ⎬. ∑ ∑ j = S − i +1 N λ + j µ ⎩⎪ j =1 N λ + j µ j = M j λ ⎭⎪ S

(4.18)

The details of the above development are described in more detail in [Ing76], [Chan72], [King69], [Saat65], [Math70], [Triv82]. In the example described above, the system does not provide for repair. When repairable systems are analyzed in this manner, the number of stages becomes infinite. To deal with the infinite number of conditional probabilities a different technique must be employed. The Markov Chain is just such a technique, capable of dealing with a system configuration of many modules, each with repairability. An additional caution should be noted. The assumption of statistical independence is questionable in the case of stage–type failure distributions. In addition, the fixed probability distribution associated with each failure in the stage–type should be removed in the detailed analysis [Rams76].

44/130

CHAPTER 5

MARKOV MODELS OF FAULT–TOLERANT SYSTEMS

A generalized modeling technique is required to deal with an arbitrary number of modules, failure events, and repair events in the analysis of Fault–Tolerant systems [Boss82]. Several techniques are available, including Petri Nets [Duga84], [Duga85], Fault Tree Analysis [Fuss76], Failure Mode and Effects Analysis [Mil1629], [Jame74], Event Tree Analysis [Gree82], and Hazard and Operability Studies [Lee80], [Robi78], [Smit85]. When system components are not independent, a state based analysis technique is needed which includes redundancy and repair [Biro86], [Guid86]. A Continuous Parameter Markov Chain is a method used to analyze systems that have state transitions that include repair processes [Hoel72], [Kend50], [Kend53]. A Markov Process is a stochastic process whose dynamic behavior is such that the probability distributions for its future behavior depend only on the present state and not how the process arrived in that state [Mark07], [Fell67], [Issa76], [Chun76], [Kulk84]. To illustrate the principles of a Markov process, consider a system S described in Figure 3, which is changing over time in such a way that its state at any instant in time v can be described in terms of a finite dimensional vector X ( t ) , [Triv74], [Triv75a], [Triv75]. Assume that the state of the system at any time t , for t > v can be described by a predetermined function of the starting state v and the

ending state t:

45/130

X ( t ) = G ⎡⎣ X ( v ) , t ⎤⎦ .

(5.1)

Given a set of reasonable starting conditions and the continuity of the function G a differential equation for X ( t ) describing the rate at which transitions between each state of the system takes place can be derived by expanding both sides of Eq. (5.1) in powers of t to give, dx = H ⎡⎣ X ( t ) ⎤⎦ . dt

(5.2)

Finite–dimensional deterministic systems described by the set of state vectors are equivalent to systems described by sets of ordinary differential equations [Bell60], [Brau67], [Beiz78], [Brue80]. This property will serve as the basis for analysis of fault–tolerant systems that include repair. It will be assumed that the system described by the set of differential equation in Eq. (5.2) can exist in only one of the finite number of states [Keme60], [Koba78]. The transition from state i to state j in this system takes place with some random probability defined by, p ij ( v , t ) = P { X ( t ) = j X ( v ) = i } , t ≥ v ; i , j ∈ S.

(5.3)

Eq. (5.3) is the conditional pdf of the system of state transitions and satisfies the relation,

∑ p ( v , t ) = 1;

∀i ∈S

j

0 ≤ v ≤ t.

(5.4)

The unconditional pdf of the state transition vector X ( t ) is given by,

46/130

p j ( t ) = P {X ( t ) = j} ,

j = 1, 2, 3,K

(5.5)

with,

∑ p ( t ) = 1,

∀j ∈S

j

∀t > 0,

(5.6)

since the process at any time t must be in a unique state. An Absorbing Markov Process is one in which transitions have the following properties [Gave73],

There is at least one absorbing state,

From every state, it is possible to get to the absorbing state.

L

i

j

v i

v

t

L

k

u

L

j

t

Figure 3 – State Transition probabilities as a function of time in the Continuous–Time Markov chain that is subject to the constraints of the Chapman–Kolmogorov equation.

The fundamental assumption of the Markov model is that the probability of a given state transition depends only on the current state of the system and not on any previous state. For continuous–time Markov processes, that is, those 47/130

described by ordinary differential equations, the length of time already spent in the current state does not influence either the probability distribution of the next state or the probability distribution of the remaining time in the same state before the next transition. The Markov model fits with the standard assumption of the reliability models developed so far in this paper, that the failure rates are constant, leading to an exponentially distributed state transition time for failures and a Poisson distribution for the occurrence of these failures. Solving the Markov Matrix

In order to describe a continuous–time Markov process using transition matrices, it is necessary to specify the entire family of stochastic matrices, { P ( t )} . Only those matrices that meet certain conditions are useful in finding the solution to the final absorption state rate of the system described by the Markov Chain [Cour77]. Initial value problems involving systems of equations may be solved using the Laplace transform. The advantage of this technique over traditional methods (Elimination, Eigenvalue solutions, and Fundamental Matrix [Pipe63], [Cour43]) is that satisfaction of initial values is automatically provided. No special techniques are needed to find particular solutions of the fundamental matrix, such as repeated eigenvalues [Lome88]. Chapman–Kolmogorov Equations A set of differential equations describing the transitions between each state can be derived if the following conditions are met by the transitions probability matrix [Bhar60], [Parz62], [Howa71]. These equations are the Chapman–Kolmogorov Equations and are defined as the transition probabilities of the Markov chain that satisfy Eq. (5.7) for all i and j, using Figure 3 as an example,

48/130

p ij ( v , t ) = ∑ p ik ( v , u ) ⋅ p kj ( u , t ) .

(5.7)

k

A simplified notation for the matrix elements defined in Eq. (5.7) can be created where the elements of each matrix are given by,

H (v , t ) = Η (v , u ) H (u , t ) , v ≤ u ≤ t

(5.8)

and where,

H ( t , t ) = I,

(5.9)

is the identity matrix. The Forward Chapman–Kolmogorov Equation is now defined as, ∂ H ( v , t ) = H ( s, t ) Q (t ) , v ≤ t , ∂t

(5.10)

where the new matrix Q ( t ) is defined as,

P (t ) − I , ∆t → 0 ∆t

Q ( t ) = lim

(5.11)

∆t = t − v .

(5.12)

with,

The matrix Q ( t ) is now defined as the transition rate matrix [Papo65a]. The elements of Q ( t ) are q ij ( t ) and are defined by,

49/130

q ii ( t ) = lim

∆t → 0

p ii ( t , t + ∆t ) − 1 , ∆t

(5.13)

and q ij ( t ) = lim

p ij ( t , t + ∆t ) − 1

∆t → 0

∆t

, i ≠ j.

(5.14)

If the system at time t is in state i, then the probability that a transition occurs to any state other than state i during the time interval t + ∆t is given by, − q ii ( t ) ∆t + o ( ∆t ) ,

(5.15)

where o ( h ) is any function of h that approaches zero faster than h, that is o (h ) = 0. Eq. (5.13) is the rate at which the process departs state i when the h →0 h

lim

starting in state i. Similarly, given that the system is in state i at time t, the conditional probability that it will make a transition from state i to state j in the time interval [t , t + ∆t ] is given by, q ij ( t ) ∆t + o ( t ) .

(5.16)

Eq. (5.14) is the rate at which the process moves from state i to state j given that the system is in state i, since,

∑ p ( v , t ) = 1,

(5.17)

ij

then Eq. (5.13) and Eq. (5.14) implies,

50/130

∑ q ( t ) = 0, ij

∀i ∈ S.

(5.18)

Using these developments, the Backward Chapman–Kolmogorov equation is given by, ∂ H ( v , t ) = −Q ( v ) H ( v , t ) , v ≤ t . ∂v

(5.19)

The forward equation may be expressed in terms of its elements, ∂ p ij ( v , t ) = q jj ( t ) p ij ( v , t ) + ∑ q kj ( t ) p ik ( v , t ). ∂t k≠ j

(5.20)

The initial state i at the initial time v affects the solution of this set of differential equations only through the following conditions, ⎧ 1, i = j pij ( v , v ) = ⎨ ⎩0, i ≠ j

(5.21)

The backward matrix equation may be expressed in terms of its elements, ∂ pij ( v , t ) = −q jj ( t ) pij ( v , t ) − ∑ q ik ( t ) pkj ( v , t ), ∂t k≠ j

(5.22)

with the initial conditions, ⎧ 1, i = j pij ( t , t ) = ⎨ ⎩0, i ≠ j

(5.23)

Markov Matrix Notation

The expressions developed in the previous section can be represented by a transition probability matrix [Papo62] of the form, 51/130

⎡ pmn L L L ⎢ M O ⎢ O P = ⎡⎣ pij ⎤⎦ = ⎢ M ⎢ p11 ⎢ M ⎢⎣ p0 n p01

pm 0 ⎤ M ⎥ ⎥ M ⎥. ⎥ p10 ⎥ p00 ⎥⎦

The entries in this matrix satisfy two properties; 0 ≤ pij ≤ 1 and

∑p

ij

= 1 which

j

is a restatement of Eq. (5.17). The Transition Probability Matrix can also be represented by a directed graph [Maye72], [Deo74]. A node labeled i in the directed graph represents state i of the Markov Chain and a branch labeled pij from node i to node j implies that the conditional probability P { X n = j X n −1 = j } = pij is met by the Markov Process represented by the

directed graph. The transition probabilities represent a set of differential equations describing the rate at which the transitions take place between each node in the directed graph. The differential equations are then represented by a matrix structure of, ⎡d ⎤ ⎢ dt Pn ⎥ ⎢ ⎥ ⎡ pmn L L ⎢ M ⎥ ⎢ M O ⎢d ⎥=⎢ O ⎢ P1 ⎥ ⎢ p1n ⎢ dt ⎥ ⎢ p K K ⎢ d ⎥ ⎣ 0n ⎢⎣ dt P0 ⎥⎦

pm 0 ⎤ ⎡ Pn ⎤ M ⎥⎢ M ⎥ ⎥ ⎢ ⎥. p10 ⎥ ⎢ P1 ⎥ ⎥⎢ ⎥ p00 ⎦ ⎣ P0 ⎦

The solution to this set of linear homogeneous differential equations can be derived by elimination using the Laplace transform method.

52/130

Laplace Transform Techniques

Given a set of differential equations in Eq. (5.20) and Eq. (5.22), the Laplace transform can be used to generate solutions to these equations [Lome88]. One advantage of using the Laplace transform method is it ability to handle initial conditions automatically, without having first to find a general solution and then having to evaluate the integration constants. The Laplace transform is defined as, ∞

F ( s ) = ∫ e − st f ( t ) dt = L { f ( t )}

(5.24)

0

The differential equation solution method depends on the following operational property of the Laplace transform [Krey72]. The Laplace transform of the derivative of a function is, ∞

L { f ′ ( t )} = ∫ e 0

− st

b ⎤ b − st ⎡ − st f ′ ( t ) dt = lim ⎢e f ( t ) ⎥ + s ∫ e f ( t ) dt . b →∞ 0⎦ 0 ⎣

(5.25)

In the limit, the integral appearing on the right–hand side of Eq. (5.25) is L { f ( t )} , so that the first term in Eq. (5.25) can be evaluated in the following

manner [McLac39], lim e − sb f ( b ) − e 0 f ( 0 ) .

(5.26)

b →∞

Using the property of absolute values and limits [Arfk70], Eq. (5.26) can be rewritten as, lim e − sb f ( b ) ≤ lim e − sb f ( b ) .

b →∞

(5.27)

b →∞

53/130

The term f ( b ) is of the order e ab as b → ∞ . For b > T using the definition for exponential order, Eq. (5.27) can be reevaluated to the following, lim e − sb f ( b ) ≤ lim e − sb Me αb = lim Me −( s −α )b .

b →∞

b →∞

b →∞

(5.28)

The function f ( b ) is said to be of exponential order as b → ∞ if there exists a constant α such that: e −α b f ( b ) , is bounded for all t greater than some T. If this statement is true, there also exists a constant M, such that f ( b ) < Me α t , t > T . Figure 4 – Definition of the exponential order of a function.

If s > α , then s − α > 0, giving, lim Me −( s −α )b = 0,

(5.29)

b →∞

so that in the limit, lim e − sb f ( b ) = 0,

(5.30)

b →∞

giving the final form of the Laplace transform of a differential equation as, L { f ′ ( t )} = s L { f ( t )} − f ( 0 ) .

(5.31)

The notation for the Laplace transform for the differential equation for the rate of arrival at the transition state i is then given by, L {Pi ( t )} ⇒ Pi ( s ) .

(5.32)

54/130

From this point on, this Laplace transform notation will be used in the solution of the Markov transition matrix differential equations. Using the expression R ( t ) = 1 − F ( t ) = P {T ≥ t } to define the system reliability, where F ( t ) is the probability distribution function of the time to failure, a new random variable, Y, can be defined which represents the expected time to system failure. A notation can be defined such that f Y ( t ) = −

dR ( t ) dP0 ( t ) = is the failure density of the dt dt

random variable Y. The Laplace transform of this failure density is denoted by L { f Y ( t )} ⇒ LY ( s ) = f Y ( s ) = sP0 ( s ) . In this paper P0 ( s ) represents the

absorbing state of the Markov model. By using the Laplace transform notation in the solution of differential equations, the inverse transform can be used to generate the failure density function for the random variable Y. Using Eq. (2.38) the derivative of the failure density function can be integrated to produce the ∞ ⎛ d ⎞ Mean Time to Failure MTTF = E [Y ] = − ∫ t ⎜ R ( t ) ⎟ . The inversion of the 0 ⎝ dt ⎠

Laplace transform may be straightforward in some cases and more complex in other cases. MODELING A DUPLEX SYSTEM

Duplex systems or Parallel Redundant systems have been utilized in electronic central office switching systems and other high–reliability systems for the past 35 years [Toy78]. Parallel redundant systems depend on fault detection and recovery for their proper operation. In most dual redundant architectures both system are monitored continuously, providing fault detection in the primary subsystem as well as the standby subsystem. This section describes the detailed development of the Markov model for a parallel redundant system with perfect diagnostic coverage. The failure rate of 55/130

both subsystems are assumed to be a constant λ and the repair rate a constant

µ . The system is considered failed when both subsystems have failed. The number of properly functioning subsystems is described in the state space S ⇒ {2,1, 0} , where {0} is the failure state of the system. The state diagram for

the system is shown in Figure 5.

2λ 2

1

λ

0

µ Figure 5 – the state transition diagram for a Parallel Redundant system with repair. State {2} represents the fault free operation mode, State {1} represents a single fault with a return path to the fault free mode by a repair operation, and State {0} represents the system failure mode, the absorption state.

The initial state of the system is {2} and the initial conditions for the transition equations are, P2 ( 0 ) = 1,

P1 ( 0 ) = P0 ( 0 ) = 0.

(5.33)

Using the initial conditions, the system of differential equations derived from the transition matrix,

56/130

⎡ dP2 ( t ) ⎤ µ 0 ⎤ ⎡ P2 ( t ) ⎤ ⎢ dt ⎥ ⎡ −2λ ⎥ ⎢ ⎥⎢ ⎢ ⎥ ⎥ ⎥⎢ ⎢ dP1 ( t ) ⎥ ⎢ ⎢ ⎥ = ⎢ 2λ − ( λ + µ ) λ ⎥ ⎢ P1 ( t ) ⎥ , ⎥ ⎥⎢ ⎢ dt ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎢ dP0 ( t ) ⎥ ⎢ ⎢ ⎥ 2λ 0 ⎦ ⎣ P0 ( t ) ⎥⎦ ⎢ ⎥ ⎣ 0 ⎢⎣ dt ⎥⎦

are given by, dP2 ( t ) = −2λP2 ( t ) + µP1 ( t ) , dt dP1 ( t ) = 2λP2 ( t ) − ( λ + µ ) P1 ( t ) , dt dP0 ( t ) = λP1 ( t ) . dt

(5.34)

Using the Laplace transform solution technique described in the previous section and in detail in [Doet61], [Widd46], [Lome88], [Rea78], and [Lath65] gives the following set of equations in Laplace form,

sP2 ( s ) − 1 = −2λ P2 ( s ) + µ P1 ( s ) , sP1 ( s ) = 2λ P2 ( s ) − ( λ + µ ) P1 ( s ) ,

(5.35)

sP0 ( s ) = λ P1 ( s ) . Solving Eq. (5.35)(a) for the final failed state {2} gives, sP2 ( s ) + 2λP2 ( s ) = µP1 ( s ) + 1,

( s + 2λ ) P2 ( s ) = µP1 ( s ) + 1, µP ( s ) + 1 , P2 ( s ) = 1 ( s + 2λ )

(5.36)

57/130

and solving for Eq. (5.36)(b) for state {2} gives, sP1 ( s ) = 2λP2 ( s ) − ( λ + µ ) P1 ( s ) , sP1 ( s ) + ( λ + µ ) P1 ( s ) = 2λP2 ( s ) , P2 ( s ) =

(5.37)

( s + λ + µ ) P1 ( s ) 2λ

.

Equating Eq. (5.36) and Eq. (5.37) a solution representing state {1} can be derived, giving,

( s + λ + µ ) P1 ( s ) µ P1 ( s ) + 1 = . 2λ ( s + 2λ )

Multiplying each side by

1

P1 ( s )

gives,

( s + λ + µ )( s + 2λ ) = 2λµ +

(5.38)

(s + λ + µ ) 2λ

2λ . P1 ( s )

µ+ =

1

P1 ( s ) , which results in, ( s + 2λ )

(5.39)

Solving Eq. (5.39) for state {1} gives, P1 ( s ) =

2λ . ( s + λ + µ )( s + 2λ ) − 2λµ

(5.40)

Expanding and simplifying Eq. (5.40) gives, P1 ( s ) =

2λ . s + 3λ s + 2λ 2 + s µ

(5.41)

2

58/130

Substituting Eq. (5.41) into Eq. (5.35)(c) gives the solution to the final absorbing state {0} as, sP0 ( s ) = λ P1 ( s ) , ⎡ ⎤ 2λ sP0 ( s ) = λ ⎢ 2 , 2 ⎥ ⎣ s + s ( 3λ + µ ) + s λ ⎦ 2λ 2 P0 ( s ) = . s ⎡⎣s 2 + s ( 3λ + µ ) + s λ 2 ⎤⎦

(5.42)

After producing the inverse Laplace transform of Eq. (5.42)(c), the probability that no subsystems are operating at time, t > 0 is the result. Let the random variable Y be the time to failure of the system and P0 ( t ) be the probability that the system has failed at or before time t. The reliability of the system is then defined by, R ( t ) = 1 − P0 ( t ) .

(5.43)

Using Eq. (2.37), the failure density function for the random variable Y is given by, f Y (t ) = −

dR dP0 ( t ) , = dt dt

(5.44)

and using Eq. (5.31), its Laplace transform is given by, LY ( s ) = f Y ( s ) = sP0 ( s ) − P0 ( 0 − ) =

2λ 2 . s 2 + ( 3λ + µ ) s + 2λ 2

Inverting Eq. (5.45) gives the failure density of Y as,

59/130

(5.45)

2λ 2 fY (t ) = e −α 2t − e −α1t ) , ( α1 − α 2

(5.46)

where,

α1 , α 2 =

( 3λ + µ ) ± λ 2 + 6λµ + µ 2 2

.

(5.47)

Using Eq. (2.28), the MTTF of the Parallel Redundant system with repair is given by, ∞

E [Y ] = ∫ y f Y ( y ) dy = 0

∞ ⎡∞ ⎤ 2λ 2 = ⎢ ∫ y e −α2 y dy − ∫ y e −α1 y dy ⎥ , α1 − α 2 ⎣ 0 0 ⎦ 2λ 2 ⎡ 1 1 ⎤ = ⎢ 2 − 2 ⎥, α1 − α 2 ⎣ α 2 α 1 ⎦

=

2λ 2 ( α1 − α 2 ) , = α12 α 22 = =

2λ 2 ( 3λ + µ )

( 2λ )

2 2

(5.48)

,

3 µ + 2. 2 λ 2λ

The MTTF of a two element Parallel Redundant system without repair ( µ = 0 ) would have been equal to the first term in Eq. (5.48)(c). The effect of adding a repair facility to the system increases the mean life of the system by, MTTF as a result of Repair =

µ , 2λ 2

60/130

(5.49)

or a factor of, µ

2λ 2 = µ , 3 3λ 2λ

(5.50)

over a system without repair facilities. MODELING A TRIPLE–REDUNDANT SYSTEM

A Triple Modular Redundant (TMR) system continues to operate correctly as long as two of the three subsystems are functioning properly. A second subsystem failure causes the system to fail. This model is referred to as 3–2–0. A second architecture (shown in Figure 7) is possible in which the system will continue to operate in the presence of two (2) subsystem failures. This system operates in simplex mode 3–2–1–0. The 3–2–0 model without coverage will be developed in this section. Figure 6 describes a TMR system with a constant failure rate λ and a constant repair rate µ . The repair activity takes place with a constant response time whenever a subsystem fails, giving a Markov transition matrix of, ⎡ dP2 ( t ) ⎤ µ 0 ⎤ ⎡ P2 ( t ) ⎤ ⎢ dt ⎥ ⎡ −3λ ⎥ ⎢ ⎥⎢ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ dP t ( ) ⎢ ⎢ ⎥ ⎢ 1 ⎥ = 3λ − ( 2λ + µ ) λ P ( t ) ⎥ . ⎥⎢ 1 ⎥ ⎢ dt ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ dP ( t ) ⎥ 2λ 0 ⎥⎦ ⎢⎣ P0 ( t ) ⎥⎦ ⎢ 0 ⎥ ⎣⎢ 0 ⎣⎢ dt ⎦⎥

(5.51)

The set of differential equations derived from the transition matrix is given by,

61/130

dP2 ( t ) = −3λP2 ( t ) + µP1 ( t ) , dt dP1 ( t ) = 3λP2 ( t ) − ( 2λ + µ ) P1 ( t ) , dt dP0 ( t ) = 2λP1 ( t ) . dt

(5.52)

Rewriting the differential equations in the Laplace transform format gives,

sP2 ( s ) − 1 = −3λP2 ( s ) + µP1 ( s ) , sP1 ( s ) = 3λP2 ( s ) − ( 2λ + µ ) P1 ( s ) ,

(5.53)

sP0 ( s ) = 2λP1 ( s ) . Using Eq. (5.53)(a) and Eq. (5.53)(b) to solve for state {2} gives, sP2 ( s ) + 3λP2 ( s ) = µP1 ( s ) + 1,

( s + 3λ ) P2 ( s ) = µP1 ( s ) + 1, µP ( s ) + 1 . P2 ( s ) = 1 ( s + 3λ )

(5.54)

62/130

3λ 2

1

2λ

0

µ Figure 6 – The transition diagram for a Triple Modular Redundant system with repair. State {2} represents the fault free (TMR) operation mode, State {1} represents a single fault (Duplex) operation mode with a return path to the fault free mode, and State {0} represents the system failure mode, the absorbing state.

Using Eq. (5.54)(a) and Eq. (5.54)(b) again to solve for state {2} gives, sP1 ( s ) = 3λP2 ( s ) − ( 2λ + µ ) P1 ( s ) , sP1 ( s ) + ( 2λ + µ ) P1 ( s ) = 3λP2 ( s ) , P2 ( s ) =

( s + 2λ + µ ) 3λ

(5.55) P1 ( s ) .

Equating (5.54) and Eq. (5.55) and solving for state {1} gives,

( s + 2λ + µ ) 3λ

P1 ( s ) =

µP1 ( s ) + 1 , ( s + 3λ )

3λ . P1 ( s ) = ( s + 2λ + µ )( s + 3λ ) − 3λµ Simplifying Eq. (5.56)(b) gives,

63/130

(5.56)

P1 ( s ) =

3λ . s + 5λs + 6λ 2 + µs

(5.57)

2

Substituting the solution for state {1} , Eq. (5.57), into Eq. (5.54)(c) gives the solution for the final absorbing state {0} , ⎡ ⎤ 3λ sP0 ( s ) = 2λP1 ( s ) = 2λ ⎢ 2 ⎥, 2 ⎣ s + 5λs + 6λ + µs ⎦ 6λ 2 P0 ( s ) = s ( s 2 + 5λs + 6λ 2 + µs ) .

(5.58)

Expanding and factoring the denominator of Eq. (5.58)(b) gives the differential equation for the absorption state as,

P0 ( s ) =

6λ 2

( (

s s + 21 5λ + µ − λ 2 + 10λµ + µ 2

)) ( s + (5λ + µ + 1 2

λ 2 + 10λµ + µ 2

))

(5.59) Expanding the partial fractions of Eq. (5.59) and taking the inverse Laplace transform, results in the following reliability function,

R (t ) = L−

5λ + µ + λ 2 + 10λµ + µ 2 2 λ + 10λµ + µ 2

2

5λ + µ − λ 2 + 10λµ + µ 2 2 λ + 10λµ + µ 2

2

e

e

(

− 21 5 λ+µ− λ 2 +10 λµ+µ 2

(

− 21 5 λ+µ+ λ 2 +10 λµ+µ 2

)L

).

Integrating Eq. (5.60) using Eq. (2.24) produces the MTTF of,

64/130

(5.60)

MTTF = L−

5λ + µ + λ 2 + 10λµ + µ 2

( 5λ + µ ) λ 2 + 10λµ + µ 2 − λ 2 − 10λµ − µ 2 5λ + µ − λ 2 + 10λµ + µ 2

( 5λ + µ ) λ 2 + 10λµ + µ 2 + λ 2 + 10λµ + µ 2

L (5.61)

.

Simplifying Eq. (5.61) gives the MTTF for a TMR system with repair as, MTTF =

5λ + µ . 6λ 2

(5.62)

Rearranging Eq. (5.62) and isolating the repair term from the failure term gives, MTTF =

5 µ + 2. 6λ 6λ

(5.63)

MODELING A PARALLEL SYSTEM WITH IMPERFECT COVERAGE

A more realistic model of a Parallel Redundant System assumes that not all faults are recoverable and that the coverage factor c denotes the conditional probability that the system detects the fault and survives. The state diagram for this system is shown in Figure 7

65/130

2λc

2

1

λ

0

µ 2λ ( 1 − c )

Figure 7 – The transition diagram for a Parallel Redundant system with repair and imperfect fault coverage. State {2} represents the fault free mode, State {1} represents a single fault with a return path to the fault free mode by a repair operation, and State {0} represents the system failure mode. State {0} can be reached from State {2} through an uncovered fault, which causes the system to fail without the intermediate State {1} mode.

The transition matrix for Figure 7 is, ⎡ dP2 ( t ) ⎤ 0 ⎤ ⎡ P2 ( t )⎤ µ ⎢ dt ⎥ ⎡ −2λc + 2λ (1 − c ) ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ dP t ( ) ⎢ ⎥ ⎢ ⎥ ⎢ 1 ⎥= − ( λ + µ ) λ ⎥ ⎢ P1 ( t ) ⎥ , 2λc ⎢ ⎢ dt ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎢ dP ( t ) ⎥ 0 ⎢ ⎥ ⎢ ⎥ λ − λ 2 1 c 2 0 P t ⎢ ⎥ ⎣ ( ) ⎦ ⎣ 0 ( )⎦ ⎣⎢ dt ⎦⎥ With an initial state of {2} producing a set of starting conditions,

66/130

(5.64)

P2 ( 0 ) = 1, P1 ( 0 ) = P0 ( 0 ) = 0 , the system of equations describing the state transitions are, dP2 ( t ) = −2λc P2 ( t ) − 2λ (1 − c ) P2 ( t ) + µP1 ( t ) , dt dP1 ( t ) = 2λc P2 ( t ) − ( λ + µ ) P1 ( t ) , dt dP0 ( t ) = 2λ (1 − c ) P2 ( t ) + λP1 ( t ) . dt

(5.65)

Using the Laplace transform method, the above equations are reduced to, sP2 ( s ) − 1 = −2λP2 ( s ) + µP1 ( s ) , sP1 ( s ) = 2λc P2 ( s ) − ( λ + µ ) P1 ( s ) ,

(5.66)

sP0 ( s ) = 2λ (1 − c ) P2 ( s ) + λP1 ( s ) .

Using Eq. (5.66)(a) and solving for state {2} gives, sP2 ( s ) − 2λP2 ( s ) = µP1 ( s ) ,

( s − 2λ ) P2 ( s ) = µP1 ( s ) , µP ( s ) + 1 . P2 ( s ) = 1 ( s + 2λ )

(5.67)

Using Eq. (5.66)(b) to solve for state {2} gives, sP1 ( s ) = 2λc P2 ( s ) − ( λ + µ ) P1 ( s ) ,

( s + λ + µ ) P1 ( s ) = 2λc P2 ( s ) , ( s + λ + µ ) P1 ( s ) P2 ( s ) = . 2 λc

67/130

(5.68)

Equating Eq. (5.67)(c) and Eq. (5.68)(c) and solving for state {1} gives, µP1 ( s ) + 1 ( s + λ + µ ) P1 ( s ) = . 2λ c ( s + 2λ )

(5.69)

Simplifying Eq. (5.69) and solving for state {1} gives, 2λµc P1 ( s ) + 2λc = ( s + 2λ )( s + λ + µ ) P1 ( s ) , P1 ( s ) =

2λc . ( s + 2λ )( s + λ + µ ) − 2λµc

(5.70)

Using Eq. (5.66)(a) and solving for state {1} gives, sP2 ( s ) − 2λP2 ( s ) = µP1 ( s ) ,

( s − 2λ ) P2 ( s ) = µP1 ( s ) , ( s − 2λ ) P1 ( s ) − 1 P1 ( s ) = .

(5.71)

µ

Using Eq. (5.66)(b) and solving for state {1} gives, sP1 ( s ) = 2λc P2 ( s ) − ( λ + µ ) P1 ( s ) ,

( s + λ + µ ) P1 ( s ) = 2λc P2 ( s ) , P1 ( s ) =

(5.72)

2 λc P (s). ( s + λ + µ) 2

Equating Eq. (5.71) and Eq. (5.72) and solving for state {2} gives,

68/130

( s + 2λ ) P2 ( s ) − 1 µ

=

2 λc P (s), ( s + λ + µ) 2

( s + λ + µ) . P2 ( s ) = ( s + 2λ )( s + λ + µ ) − 2λµc

(5.73)

Substituting Eq. (5.70) and Eq. (5.73) into Eq. (5.66)(c) and solving for state {0} gives,

sP0 ( s ) = 2λ (1 − c ) P2 ( s ) + λP1 ( s ) , s +λ+µ 2λ c ⎪⎧ ⎪⎫ ⎪⎧ ⎪⎫ = 2λ (1 − c ) ⎨ ⎬+λ⎨ ⎬, ⎩⎪ ( s + 2λ )( s + λ + µ ) − 2λµc ⎭⎪ ⎩⎪ ( s + 2λ )( s + λ + µ ) − 2λµc ⎭⎪ =

2λ + 2λ 2 + 2λµ − 2λsc − 2λ 2 c − 2λµc + 2λ 2 c , ( s + 2λ )( s + λ + µ ) − 2λµc

=

2λ ( s + λ + µ − sc − µc ) . ( s + 2λ )( s + λ + µ ) − 2λµc

(5.74) Simplifying Eq. (5.74) for state {0} gives, P0 ( s ) =

s + λ + µ − c ( s + µ) 2λ ⋅ . s ( s + 2λ )( s + λ + µ ) − 2λµc

(5.75)

If X is the random variable describing the time to the system failure, then, FX ( t ) = P0 ( t ) ,

(5.76)

and,

69/130

f X ( s ) = L X = sP0 ( s ) =

2λ ⎡⎣( s + λ + µ ) − c ( s + µ ) ⎤⎦

( s + 2λ )( s + λ + µ ) − 2λµc

.

(5.77)

Inverting Eq. (5.77) is difficult in this case and an alternative method of finding the mean time to system failure will be used. If X denotes the time to failure of a system, then from the knowledge of the Laplace transform L X ( s ) the MTTF can be obtained using the Moment Generating Property of the Laplace transform. Let X be a random variable possessing a Laplace transforms L X ( s ) . Then the kth ( k = 1, 2, K , n ) moment of X is given by, E ⎡⎣ X k ⎤⎦ = ( −1)

n

d n LX ( s ) . s=0 ds n

(5.78)

In this case Eq. (5.77) can be rewritten as, 2λ U , V

(5.79)

U = s + λ + µ − c ( s + µ) ,

(5.80)

V = ( s + 2λ )( s + λ + µ ) − 2λµc .

(5.81)

LX ( s ) =

where,

and,

70/130

The evaluation of the derivative term in Eq. (5.79) can be performed using the quotient rule [Midd46], [Smai49],

d ⎡U ⎤ = ds ⎢⎣V ⎥⎦

V⋅

d d U −U ⋅ V ds ds . 2 V

(5.82)

This method is useful when it is difficult to obtain the density function f X ( t ) and the reliability function RX ( t ) . Using the moment generating property, Eq. (5.78) is evaluated as, E[X ] = − =

2λ ⎡⎣U ( 2s + 3λ + µ ) − V (1 − c ) ⎦⎤ dL X , = s =0 ds s = 0 V2

2λ ⎡⎣( λ + µ − µc )( 3λ + µ ) − ( 2λ ( λ + µ ) − 2λµc ) (1 − c ) ⎤⎦ ⎡⎣ 2λ ( λ + µ ) − 2λµc ⎤⎦

2

(5.83) ,

which gives the expression for the MTTF of a parallel redundant system with imperfect coverage as, E[ X ] =

λ (1 + 2c ) + µ . 2λ ⎡⎣λ + µ (1 − c ) ⎤⎦

(5.84)

MODELING A TMR SYSTEM WITH IMPERFECT COVERAGE

In some Triple Modular Redundant systems a simplex mode of operation is allowed, providing the facility to have two successive subsystem failures before total system failure is experienced [Toy87]. Figure 8 describes the states of the TMR system operating in the 3–2–1–0 mode.

71/130

2 λc

3λ

3

2

µ

1

λ

0

µ 2λ ( 1 − c )

Figure 8 –The state transition diagram for a Triple Modular Redundant system with repair and imperfect fault coverage. State {3} represents the fault free mode, State {2} represents the single fault (Duplex) mode, State {1} represents the two–fault (Simplex) mode, and State {0} represents the system failure mode.

With an initial state of {3} , the starting conditions are, P3 ( 0 ) = 1, P2 ( 0 ) = P1 ( 0 ) = P0 ( 0 ) = 0. The transition matrix is,

72/130

(5.85)

⎡ dP3 ( t ) ⎤ µ 0 ⎢ ⎥ ⎡ −3λ dt ⎢ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢ dP2 ( t ) ⎥ ⎢ −3λ − µ −2λc − 2λ (1 − c ) µ ⎢ dt ⎥ ⎢ ⎢ ⎥=⎢ ⎢ dP ( t ) ⎥ ⎢ 2 λc − (λ + µ) ⎢ 1 ⎥ ⎢ 0 ⎢ dt ⎥ ⎢ ⎢ ⎥ ⎢ ⎢ dP0 ( t ) ⎥ ⎢ 2λ ( 1 − c ) λ ⎢⎢ ⎥ ⎢⎣ 0 ⎣ dt ⎦⎥

0 ⎤ ⎡ P3 ( t ) ⎤ ⎥⎢ ⎥ ⎥⎢ ⎥ ⎥⎢ ⎥ 0 ⎥ ⎢ P2 ( t ) ⎥ ⎥⎢ ⎥. ⎥⎢ ⎥ ⎥ ⎢ 0 P1 ( t ) ⎥ ⎥⎢ ⎥ ⎥⎢ ⎥ ⎥⎢ ⎥ 0 ⎥⎦ ⎢⎣ P0 ( t ) ⎥⎦

(5.86)

producing a system of differential equations, dP3 dt dP2 dt dP1 dt dP0 dt

= −3λP3 ( t ) + µP2 ( t ) , = ( 3λ − µ ) P3 ( t ) − 2λc P2 ( t ) − 2λ (1 − c ) P2 ( t ) + µP1 ( t ) , = 2λc P2 ( t ) − ( λ + µ ) P1 ( t ) ,

(5.87)

= 2λ (1 − c ) P2 ( t ) + λP1 ( t ) .

The Laplace transform of these equations produces, sP3 ( s ) − 1 = −3λP3 ( s ) + µP2 ( s ) , sP2 ( s ) = ( 3λ − µ ) P3 ( s ) − 2λc P2 ( s ) − 2λ (1 − c ) P2 ( s ) + µP1 ( s ) , sP1 ( s ) = 2λc P2 ( s ) − ( λ + µ ) P1 ( s ) ,

(5.88)

sP0 ( s ) = 2λ (1 − c ) P2 ( s ) + λP1 ( s ) .

Solving Eq. (5.88) for the absorption state and inverting the Laplace transform gives the mean time to failure for a TMR system with imperfect converge as,

73/130

MTTF = L

⎡µ 2 λ 2 ( 7 + 2c 2 − 8c ) + µλ 3 (17 − 13c ) + µ 3λ (1 − c ) + 5λ 4 − µλ 3 ( 6 − 12c − 6c 2 ) + 6λ 4 c ⎤ ⎦. =⎣ 3 6λ ⎣⎡λ + µ (1 − c ) ⎦⎤

(5.89) MODELING A GENERALIZED TMR SYSTEM

In practice, the state transitions for a Triple Modular Redundant computing system take place in a controlled manner. The physical structure of the underlying hardware and software create a transition network similar to that describe in Figure 8. In normal operation, the system functions as a TMR machine. Failure of a subsystem transforms the system into an intermediate mode in which failure of a second subsystem results in the final absorption state. The states

{1}K{N } represent the possible remaining subsystems after the failure of one of the TMR subsystems. A closed form solution to this general TMR model using perfect fault coverage would be useful [Dona84], [Dona85], [Iyer84], [Meye82]. Using Figure 8 as the basis for the transition model, this section develops an expression for the MTTF of a generalized TMR system with the following behavior:

Each subsystem is itself a TMR system.

The subsystems are connected to form a fault–tolerant system.

The system starts in state {0} .

A fault in any subsystem, results in the transition to state {1}K{N } .

A fault in the remaining subsystems result in the absorption state {N + 1} .

74/130

Laplace Transform Solution to Systems of Equations

The scalar differential equation of the form

d p ( t ) = ap ( t ) has solutions of the dt

form c e at , where c is a constant [Lome86], [Brau70], [Brau67]. Given a system of differential equations of the form, d Pi ( t ) = A Pi ( t ) , dt

(5.90)

the generation of the fundamental matrix A leads to a unique solution. Taking the Laplace transform of both sides of Eq. (5.90) and using the initial conditions vector P ( 0 ) gives, sP ( s ) − P ( 0 ) = A P ( s )

(5.91)

Rearranging Eq. (5.91) results in,

( sI − A ) P ( s ) = P ( 0 )

(5.92)

The system of equations described in Eq. (5.92) is a linear nonhomogeneous system of n algebraic equation in n unknowns. If s is not equal to an eigenvalue of A, e.g. det ( sI − A ) ≠ 0 , Eq. (5.92) can be solved for P ( s ) in terms of P ( 0 ) and

s, using Cramers Rule and / or Gauss Reduction (see Appendix B). ⎡ s − a11 Since det ( sI − A ) , where det ( sI − A ) = ⎢ ⎣ −an 1

− a1m ⎤ , is a polynomial of s − a n m ⎥⎦

degree n, P ( s ) is a vector whose components are rational functions of s and

75/130

linear in P1 ( 0 ) , P2 ( 0 ) ,K , Pn ( 0 ) , providing a unique solution to the system described in Eq. (5.92).

0

µ1

µN µ2

λ1

1

λ2

2

L L L

λN

N

Θ2 Θ1

N+ 1

ΘN

Figure 9 – The state transition diagram for a Generalized Triple Modular Redundant system with repair and [perfect fault detection coverage. The system initially operates in a fault free state {0} . A fault in any module results in the transition to state {1, K , N } . A second fault while in state {1, K , N } results in the system failure state {N + 1} .

Specific Solution to the Generalized System

The solution set of differential equations describing the reliability of the system in Figure 9 can be developed using the Gauss Reduction method by transforming these set of differential equations into a set of algebraic equations using the 76/130

Laplace transform [Lehm62], [Bell65], [Jame67]. Appendix F provides additional details on this solution method. The matrix described in Eq. (5.93) represents the state transition rates between each node of Figure 9,

⎡ N ⎢ −∑ λ i ⎢ i =1 ⎢ µ ⎢ µ A=⎢ ⎢ µ ⎢ ⎢ M ⎢ µ ⎢ ⎣⎢ 0

λ1

λ2

L

λN

−µ − Θ1 0 0 M 0 0

0 −µ − Θ2 0 O L L

L L −µ − Θ3 O 0 0

0 0 M O −µ − Θ N 0

⎤ 0 ⎥ ⎥ Θ1 ⎥ Θ2 ⎥ ⎥, Θ3 ⎥ ⎥ M ⎥ ΘN ⎥ ⎥ s ⎦⎥

(5.93)

where the fundamental matrix in the Laplace Transform notation of A, is given by, sI − A = A ,

(5.94)

resulting in, N ⎡ + s ⎢ ∑ λi ⎢ i =1 ⎢ −µ ⎢ −µ A=⎢ ⎢ −µ ⎢ M ⎢ ⎢ −µ ⎢ ⎣⎢ 0

−λ1

−λ 2

s + µ + Θ1 0 0 s + µ + Θ2 0 0 M O L 0 L 0

L

−λ N

L L s + µ + Θ3 O 0 0

0 0 M O s + µ + ΘN 0

⎤ 0 ⎥ ⎥ −Θ1 ⎥ −Θ 2 ⎥ ⎥, −Θ3 ⎥ ⎥ M ⎥ −Θ N ⎥ ⎥ s ⎦⎥ (5.95)

using the initial conditions for the system of equations,

77/130

P ( 0 ) = P ( s ) A,

(5.96)

and the normalizing term, N

α0 = s + ∑ λi ,

(5.97)

i =1

αi = s + µ + Θi ; i = 1, 2,K , N , the system of equations can be written in the transposed form as, ⎡ 1⎤ ⎡ α0 ⎢ 0 ⎥ ⎢ −λ ⎢ ⎥ ⎢ 1 ⎢ 0 ⎥ ⎢ −λ 2 ⎢ ⎥ ⎢ ⎢ 0 ⎥ = ⎢ −λ 3 ⎢M ⎥ ⎢ ⎢ ⎥ ⎢ ⎢ 0 ⎥ ⎢ −λ N ⎢0 ⎥ ⎢ 0 ⎣ ⎦ ⎣

−µ

−µ

−µ

α1

L

0

L

0

L

O

L

L −Θ1

−Θ 2 L −Θ N

α1

0 ⎤ ⎡ P0 ( s ) ⎤ 0 ⎥ ⎢⎢ P1 ( s ) ⎥⎥ ⎥ 0 ⎥ ⎢ P2 ( s ) ⎥ ⎥ ⎥⎢ 0 ⎥ ⎢ P3 ( s ) ⎥ , M⎥ ⎢ M ⎥ ⎥ ⎥⎢ 0 ⎥ ⎢ PN ( s ) ⎥ s ⎥⎦ ⎢⎢⎣ PN +1 ( s ) ⎥⎥⎦

(5.98)

which can be rewritten in an augmented form as, ⎡1 α0 ⎢ 0 −λ 1 ⎢ ⎢ 0 −λ 2 ⎢ ⎢ 0 −λ 3 ⎢M M ⎢ ⎢ 0 −λ N ⎢0 0 ⎣

−µ α1 0 M 0 −Θ1

−µ 0 α2 0 M 0 −Θ1

−µ 0 0 α3 0 −Θ1

L −µ L 0 L 0 L 0 O M L αN L −Θ1

0⎤ 0⎥ ⎥ 0⎥ ⎥ 0⎥ . M⎥ ⎥ 0⎥ s ⎥⎦

(5.99)

Using the Gauss Reduction method to multiply row i + 1 by, µ α i +1 and add it to row 1 to give the new form of the matrix as,

78/130

⎡ 1 α 0 + µ∑ λ i α i ⎢ −λ1 ⎢0 ⎢0 −λ 2 ⎢ −λ 3 ⎢0 ⎢M M ⎢ −λ N ⎢0 ⎢0 0 ⎢⎣

0

0

0

0

α1

L

0

0

0

α2

L

0

0

0

0

α3

L L

0

M

M

O

M

0 −Θ1

0 −Θ1

0 L αN −Θ1 L −Θ1

0⎤ ⎥ 0⎥ 0⎥ ⎥ 0⎥ . M⎥ ⎥ 0⎥ s ⎥⎥⎦

(5.100)

The augmented matrix is now in lower triangular form and can be solved for PN +1 ( s ) . Defining the term,

B0 = α 0 − µ

∑λ α

i

,

(5.101)

gives the following sets of equations from Eq. (5.100), 1 = B0 P0 ( s ) , 0 = −λ1 P0 ( s ) + α1 P1 ( s ) , 0 = −λ 2 P0 ( s ) + 0 + α 3 P2 ( s ) ,

(5.102)

0 = −λ 3 P0 ( s ) + L + α 3 P3 ( s ) ,

0 = −λ N P0 ( s ) + L + α N PN ( s ) ,

0 = −Θ1 P1 ( s ) − Θ 2 P2 ( s ) − L − Θ N PN ( s ) + sPN +1 ( s ) .

Solving for each differential equation gives,

79/130

P0 ( s ) = B0−1 , P1 ( s ) =

λ1 P0 ( s ) λ = 1 , α1 α1B0

P2 ( s ) = PN ( s ) =

λ 2 P0 ( s ) λ = 2 , α2 α 2 B0

λ1 P0 ( s ) λ = 1 , α1 α1B0

PN +1 ( s ) =

(5.103)

1 N 1 N Θi λ i 1 P s Θ = = ( ) ∑ ∑ i i s i =1 s i =1 α i B0 sB0

sPN +1 ( s ) =

1 B0

N

∑ i =1

Θi λ i 1 = B0 αi

N

∑Θ λ α i =1

i

i

−1 i

N

∑ i =1

Θi λ i , αi

.

Solving the differential equations set for the absorption state {N + 1} using the moment generating function describes in Eq. (5.78) gives, ∞

d = − ∫ p ( t ) dt = − MTBF , ⎡ sPN +1 ( s ) ⎤⎦ s=0 ds ⎣ 0

(5.104)

as a general expression. Defining the intermediate terms, B0 = α 0 − µ ∑ λ i α i−1 ,

B0 = s + λT − µ∑ λ i α i−1 ,

(5.105)

B0 = s + λT − µ∑ λ i ( s + µ + Θi ) , −1

and the derivative term, d d −1 B0 = 1 − µ ∑ λ i ⎡( s + µ + Θi ) ⎤ , ⎦ ds ds ⎣ and

80/130

(5.106)

d −2 ∑λ B0 = 1 − µ∑ λ i − ( s + µ + Θi ) = 1 + µ 2 i . αi ds

The term u = 1 B0 , is defined so that the derivative can be defined resulting in

(5.107)

d 1 u =− 2 ds B0

⎡ λi ⎤ ⎢1 + µ∑ 2 ⎥ αi ⎦ ⎣

d d −1 1 −1 α1−1 ) = ( s + µ + Θi ) = = 2. ( 2 ds ds ( s + µ + Θi ) α i

Expanding Eq. (5.106) and Eq. (5.107) to,

−

d d ⎡1 ⎡⎣ sPN +1 ( s ) ⎤⎦ = − ⎢ ds ds ⎣ B0

−

⎡d ⎡ 1 ⎤ d 1 ⎡⎣ sPN +1 ( s ) ⎤⎦ = − ⎢ ⎢ ⎥ ∑ Θi λ i α i−1 + ds B0 ⎣ ds ⎣ B0 ⎦

∑Θ λ α i

i

−1 i

⎤ ⎥, ⎦

(5.108)

and,

∑Θ λ i

i

⎤ d ⎡α i−1 ⎦⎤ ⎥ , (5.109) ⎣ ds ⎦

evaluating at s = 0 gives an expression for the MTTF of the generalized TMR system,

−

d =L ⎡ sPN +1 ( s ) ⎤⎦ s =0 ds ⎣

⎡ ⎤⎡ λi Θi λ i ⎤ ⎡ λi ⎤ ⎡ Θi λ i ⎤ + λ − µ ⎢1 + µ∑ ⎥ ⎢ ∑ µ + Θ ⎥ ∑ ( µ + Θ )2 ⎥ (5.110) ⎥ ⎢ T 2 ⎢∑ ( µ + Θi ) ⎥⎦ ⎣ µ + Θi ⎦ ⎣ i ⎦⎢ ⎢⎣ ⎥⎦ i ⎣ . 2 ⎡ λi ⎤ ⎢λT − µ∑ µ + Θ ⎥ i ⎦ ⎣ Simplifying Eq. (5.110) gives the closed form solution for the generalized TMR equation of,

81/130

λi µ + Θi MTBF = N i =1 . λ i Θi ∑ i =1 µ + Θi N

1+ ∑

(5.111)

The expression shown in Eq. (5.111) represents the generalized form of a Triple Modular Redundant system with individual failure rates and a constant repair rate for each failure event. Although this model appears generic, is can be utilized to represent actual TMR system with a high degree of accuracy [Sint89], [Whit82], [UKAE88].

82/130

CHAPTER 6

PRACTICAL EFFECTS OF PARTIAL COVERAGE

The analysis presented in the previous section utilizes a coverage factor c in evaluating the Mean Time to Failure of various fault–tolerant system configurations. In practice, the coverage factor of the underlying diagnostics firmware must be determined through analysis and verification. The important aspects of the diagnostics subsystem are [Kraf81],

The physical determination of the coverage factor through some means.

The effects of systematic failures in the diagnostic system on the reliability of the system.

DETERMINING COVERAGE FACTORS

Throughout this paper, the term coverage has been referred to as a scalar value describing the probability that the system successfully recovers from a specific type of fault. What has not been described is the method by which the coverage factor is determined for a particular system or subsystem. Several difficulties arise when the coverage factor is measured:

For a particular subsystem, exhaustive testing of each fault condition may not be possible. A typical printed circuit board may have 20,000 device connection points, each of which can exhibit several different fault states. Defining the fault condition and the proper recovery state for each point can severely restrict the test activities.

Reconfirmation of the coverage factor for the next generation of hardware or firmware could require exhaustive retesting. Once the original coverage factors has been computed or measured, some method for reconfirming the coverage is necessary, without expending the same effort.

83/130

What is needed is, A method by which the coverage factor can be measured and then reconfirmed for each iteration of the system design with the minimum of effort while maintaining a high degree of confidence that the measured factor represents the actual coverage. One method used to address this question is Physical Fault Injection. This technique places physical faults in the system under test and observes the resulting error or failure response [Crou82], [Lala83], [Damm88], [Schu86], [Gunn87]. Coverage Measurement Statistics The behavior of a system in the presence of faults can be determined through a suitably selected set of proportion sample tests. Each test induces a fault in the system and the resulting behavior is observed. From these sample tests an inference can be made regarding the total population of faults that can occur in the system. The method used in the following sections is based on Statistical Inference theory [Coch77], [Lars81], [ Hoel62], [Yama67], [Hoel72], [Bern88]. Three aspects of the sampling process and statistical inference are applicable here:

As the sample size increases, the estimate of the parameter of interest generally gets closer to the true value, with complete correspondence reached when the sample size equals the entire population. This aspect is referred to as Consistency.

Whatever the sample size, the sample should be representative of the underlying population. This aspect is referred to as Bias.

For certain statistical distributions, the arithmetic mean is considered to be more stable from sample to sample than any other measure of the central tendency of the sampled population. For any size sample the sample mean will tend to be closer to the population mean than any other unbiased estimator. The precision of the sample mean is referred to as the Efficiency of the estimator.

84/130

Coverage Factor Measurement Assumptions There are several assumptions used in the development of the coverage factor calculations:

Each fault that can occur in the system will result in an error or failure of the system. This assumption states that there are no hidden faults or circuit elements that when faulty do not result in an error or failure.

The goal of the diagnostics software is to detect and properly recover from 100% of the faults. The determination of the proper recovery behavior is determined by the functional specification of the system under evaluation.

The fault locations can be enumerated in some way. The population of possible faults can then serve as a sample population for the sampling method described below. Coverage Measurement Sampling Method In a practical implementation what is needed is a sampling method by which a small number of fault locations can be tested to determine, within a desired confidence interval, that the remaining (unselected) fault locations will exhibit similar behavior. The determination of the proper number of samples from the total population of available faults is based on the technique of estimating the population proportion. The supporting theory for the sampling method used in this method relies in the following assumptions:

If a random variable X is normally distributed with a mean of µ and a standard deviation of σ , and a random sample size n is drawn, then the sample mean X will be normally distributed with a mean µ and a standard deviation σ .

A population proportion is considered a special case of the mean, where the random variable X takes on only value of 0 or 1 (successful or unsuccessful). The sample proportion p is an unbiased estimator of the population proportion P.

85/130

Normal Population Statistics The probability density function for a normally distributed random variable is, 2 ⎡ ⎛ X − µX ⎞ ⎤ 1 1 f (x ) = exp ⎢ − 2 ⎜ ⎟ ⎥. 2π ⎢⎣ ⎝ σ X ⎠ ⎥⎦

(6.1)

Eq. (6.1) is difficult to work with since the mean and standard deviation values must always be adjusted to the distribution. A standardizing technique can be used to Normalize the distribution using the expression, Z=

X − µX

σX

.

(6.2)

For a small sample population, the confidence interval estimate for the population proportion is given by [Bend71], ⎛ ⎜ pS − Z ⎝

⎛ pS q S ⎞ ⎟ ≤ p ≤ ⎜ pS + Z n ⎠ ⎝

pS q S n

⎞ ⎟. ⎠

(6.3)

Sample Size Computation The confidence interval estimate of the true proportion p obtained from Eq. (6.3) is,

pS ± Z

pS q S . n

(6.4)

To determine the sample size that meets the confidence interval estimate, Eq. (6.3) can be rearranged to give,

86/130

Z

pq = pS − p . n

(6.5)

In this case the sampling error e is the difference between the sample estimate pS and the population parameter p given by, e =Z

pq . n

(6.6)

Solving Eq. (6.6) for the sample size gives, n=

Z 2 pq . e2

(6.7)

For a finite population size using sampling without replacement the error on estimating the sample proportion is, e =Z

pq N − n ⋅ . n N −1

(6.8)

Without considering the finite population error correction, the number of samples necessary for a desired error is, Z 2 pq n0 = 2 , e

(6.9)

where n0 is the sample size without considering the finite population correction factor. Applying the finite population correction factor, the number of samples is given by,

87/130

n=

n0 N . n0 + ( N − 1)

(6.10)

General Confidence Intervals What is now needed is a technique which produces a sample size which results in the desired confidence estimate for the coverage factor. In practice the selection of the sample size depends on three unknowns. Any normal random variable X is converted to a standardized normal random variable Z with a mean of 0 and a standard deviation of 1. The probability density function for Z is then given by, f (Z ) =

1 exp ⎡⎣ − 1 2 Z 2 ⎤⎦ . 2π

(6.11)

If the samples are draw from a non–normal distribution Eq. (6.11) may not be valid. An important concept is used at this point – The Central Limit Theorem [Bend71], [Hoel62], [Papo65]. As the sample size (the number of observations in each sample) grows “large enough,” the sampling distribution of the mean can be approximated by the normal distribution. This is true regardless of the shape of the distribution of the individual values in the population. Proportion Statistics In the case of the fault coverage measurement statistics, the variable of interest X indicates success or failure of the diagnostics. What is of interest is the proportion of faults that are covered by the diagnostic software rather than the actual number of covered faults. The proportion of the sample values that exhibits successful results are given by,

88/130

pS =

X , n

(6.12)

with a sample proportion mean of,

µ p = p and q = 1 − p ,

(6.13)

S

and a sample proportion standard deviation of,

σp = S

pq . n

(6.14)

Rewriting Eq. (6.2) in terms of the proportion statistics results in a normalization expression of, Z=

pS − p . pq n

(6.15)

The Central Limit Theorem and the Standard Error of the Mean are based on the premise that the samples selected are chosen with replacement. However, when the fault location samples are drawn they are done so without replacement from a finite population of size N. When the sample size is not small compared to the population, that is n N > 0.05 , a finite population correction factor is applied to the standard error of the mean. Eq. (6.16) shows this correction factor [Coch77], [Bern88],

σp = S

pq n

N −n . N −1

(6.16)

89/130

Confidence Interval Estimate of the Proportion Using the Central Limit Theorem or knowledge about the population distribution, an estimate can be made regarding the percentage of sample means that fall within a certain distance from the population mean [Coch77], [Sned80],

The level of confidence desired, Z,

The sampling error permitted e and

The true proportion of success p.

Unknown Population Proportion The selection of the three unknowns described above is often difficult. Once the desired level of confidence is chosen and the appropriate Z computed from the normal distribution, the sampling error e indicates the amount of error that is acceptable. The third quantity, the true proportion of success p is actually the population parameter the measurement is attempting to quantify. The question is, How can a value for the true proportion be stated for the value of the sample proportion that is being measured? There are two alternatives. The first is to use past information regarding the population proportion to make an educated estimate of the true proportion p. If past information is not available an estimate is provided which never underestimates the sample size needed. Referring to Eq. (6.16) it is observed that the quantity p ⋅ q appears in the numerator. The value of p that will make p ⋅ q the largest is

p = 0.5 . With no prior knowledge or estimate of the true proportion p using p = 0.5 will result in the most conservative determination of the necessary sample size. If the actual sample proportion is very different from 0.5, the width of the real

90/130

confidence interval may be substantially narrower than the estimate obtained using this method. Clopper–Person Estimation When sampling values from the proportion from an unknown distribution (non– normal), some constraints must be placed on the sample size so that the expressions in the proceeding section are valid. For small sample sizes, a lower limit is set for a finite population and a specified confidence interval. Details of these constraints are given in [Yama67] pp. 89–95. The sample size limits were tabulated by C. J. Clopper and E. S. Pearson in [Clop34], [Kend61]. If the proportion estimate is to have a 95% confidence with a 5% error, (these figures are general guidelines described in [Yama67]) Figure 10 is used. The normal approximation for sample and the absolute sample size n follows the guidelines in Figure 10 [Lawr68]. Sample Proportion

Sample Size must be

p=

n>

0.4

50

0.3 or 0.7

80

0.2 or 0.8

200

0.1 or 0.9

600

0.05 or 0.95

1400

Figure 10 – Sample size requirement for a specified estimate as tabulated by Clopper and Pearson.

91/130

Practical Sample Estimates Assume an electronic circuit assembly containing 6000 fault points consisting of 375 packages with 8 outputs, each capable of being in one of three states [Myer64], [Barr73], [Boss70], [Seth77], [McCl86], [TUV86],

Operating properly

Stuck HIGH

Stuck LOW

The question is, How many random samples must be taken on the circuit assembly to determine with 95% confidence ( Z = 1.96 ) and 5% ( e = 0.05 ) error that the coverage factor for the circuit is 0.95? Using Eq. (6.16) and the worst case estimator for the population proportion p = 0.5 gives the number of samples from an infinite population of,

Z 2 p ⋅ q (1.96 ) ( 0.5 )( 0.5 ) = = 320 samples, n0 = e2 ( 0.05 )2 2

(6.17)

and a corrected sample count from a finite population of 6000 without replacement of, n=

( 320 )( 6000 ) n0 N = = 300 samples. n0 + ( N − 1) ( 320 ) + ( 6000 − 1)

(6.18)

A fault injection procedure using 300 fault locations is a feasible task during the engineering validation portion of a product development cycle. Using the sampling method, an incremental engineering design can be revalidated with relative ease. Such testing procedures have been performed on industrial control

92/130

electronics as a mechanism for verifying each incremental release of hardware and software. Time Dependent Aspects of Fault Coverage Measurement In the previous section, it was assumed that each injected fault immediately resulted in a observable error or system failure. In many instances, this is not the case, rather some time may pass before the effects of the injected fault is observed. Two time intervals require further analysis [Aviz86], [Bour69],

The fault dormancy is the time interval between the occurrence of a fault and its activation as an error.

The error detection latency is the time interval between an error and its detection by the diagnostic subsystem.

In addition to single faults causing observable errors, near coincident faults are of important interest in multiprocessor and / or real–time control systems [McGo83]. An expression for the coverage in the presence of time–delayed realization will be developed using the following notation [Boss82], [Shin86], [Arla85]. Let the following variables are defined,

TP – is a random variable denoting the time at which the system is checked for an observable error or failure as a result of a physical fault injection.

n – is the number of physical fault injections performed.

N ( t ) – is the total number of times the system is checked in the interval [0, t ] .

COMMON CAUSE FAILURE EFFECTS

Systematic failures differ from random failures in that the protection mechanism for random failures may not function properly in the presence of a systematic failure [Bour81], [Wats79], [Hene81]. Systematic failures occur as a result of 93/130

design, operation, or environmentally induced errors which effect two or more channels of a redundant hardware system simultaneously, possibly resulting in a total system failure. Software systematic failures may also occur, but are more difficult to quantify [Cost78], [Hals79], [Shoo73], [Iyer85], [Leve83], [Rama79]. A reliability analysis metric, such as those described in Figure 1 is needed to estimate the effects of systematic failures in a redundant system. To date no clear method for deriving these metrics has been developed, although practical experience has shown that the effect of systematic failures are present but are not quantified. In the literature, systematic failures are referred to as Common Cause Failures [HSE87], [Heik79]. This is the term that will be used in this paper. There are many conflicting issues, as well as conflicting definitions in the area of Common Cause Failure analysis. The confusion is due to different interpretations of the term failure versus unavailable functionality. There are several techniques for modeling the Common Cause Failure process and defining the detailed terms and expressions including,

Square Root Bounding [Harr86]

Beta–Factor [Flem74], [Hump87], [Evan84], [Wall85]

Multinomial Failure Rate [Apos87]

Basic Parameter [Flem85], [Apos87]

Multiple Dependent Failure Function [Heis84]

Common Load [Mank77]

Non Identical Components [Vese77]

Multiple Greek Letter Models [Apos87], [Flem85]

In the following section, an overview of the Common Cause Failure modeling techniques will be presented. Throughout the discussion, several terms will be used to describe the various failure rates. The term λ is the total failure rate of one specific component, regardless of the effects of the failure. The term λ i is 94/130

the rate of independent component failures. The failure of an independent component in a system does not lead to the failure of the system if,

The diagnostic coverage factor for that component is equal to 1.

The specific component is not a single point of failure in the architecture of the system.

The term λ d is the rate of dependent failures that possibly affect other components. The failure of a dependent component may lead to the failure of the system, if the failure of the specific component leads to the failure of other components (i.e. component dependent on the specific component). The term λ S ( k ) is the rate of system failures in which exactly k components fail, where k = 1, 2,K , n is termed the failure multiplicity. In a Triple Modular Redundant system, λ S ( k ) with k = 1 is the rate of failure of any single subsystem or component of the triad. The total failure rate is then given by λ = λ i + λ d . This notation is different than other notations found in this paper. The subscript i is used to indicate independent failures and the subscript d is used to indicate dependent failures. In previous sections the subscript i is used to indicate the i th component of a system or subsystem. The following section describes several methods for evaluating Common Cause Failures. Square Root Bounding Problem

A Common Cause Failure analysis method introduced in the WASH–1400 Report [WASH75] provides a technique for measuring the upper, Pupper and lower Plower bounds of the Common Cause Failure probabilities. These bounds

95/130

are denoted by Pupper = q ( complete dependence ) and Plower = q n ( complete independence ) . The probability of a simultaneous failure of all components is given by the geometric mean Pupper and Plower , which implies the square root of their product, i.e.

Plower ⋅ Pupper = q (n +1) 2 .

Beta Factor Model

The Beta–Factor model is most commonly used to account for the occurrences of dependent failures [Flem74] because of its simplicity. Given that a component has failed, this failure will with probability β be a dependent failure, and all components of the redundant system will fail. With probability 1− β , there is an independent failure and only the specific component which is a subset of the system fails. The rate of independent component failures is given by λ i = (1 − β ) λ and dependent component failures given by λ d = βλ . It also follows that the system failure rates are λ S (1) = n (1 − β ) λ for a single component, λ S ( k ) = 0 , k = 2, 3,K , n − 1 , and λ S ( n ) = βλ for multiple components. Multi–Nominal Failure Rate (Shock Model)

The Multi–Nominal Failure Rate or Shock Model [Apos87] techniques accounts for the possibility that any number of components may fail simultaneously. The fundamental assumption of the Shock Model is that two causes of failure exist. The variable µ describes the failure rate for one specific component that experiences an independent failure. The variable ν describes the rate of External System Shocks, which may cause one or more components to fail simultaneously. Given that a shock or rate ν occurs, the multiplicity of failure distribution is given 96/130

by f k . This term is the probability that exactly k components, among the total of n components fail due to the external shock. n

Using the failure distribution f k requires that the term

∑f k =0

k

= 1 be satisfied.

The system failure rate of multiplicity k is given by the following, λ S (1) = nµ + νf 1 , k = 1 and λ S ( k ) = νf k , k = 2, 3,K n . Binomial Failure Rate Model

The Binomial Failure Rate Model can be defined in a similar manner to the Multi–Nominal model [Atwo86] with the following specific assumptions.

At the occurrence of the shock, each component has the same probability p of failure.

Given that a shock has occurred, all components fail independently.

With these assumptions, the number of components affected by the shock will fail according to the binomial distribution with parameters n and p. The ⎛n ⎞ n −k multiplicity failure distribution is then f k ⎜ ⎟ p k (1 − p ) [Vese77]. The mean ⎝k⎠ number of components that fail is n ⋅ p and the component failure rate as a result of the shock is ν ⋅ p . The total failure rate of a single component is given by λ = µ + νp . Multi–Dependent Failure Fraction Model

The previous models can be expanded to provide information about the general behavior of the failures of identical components. This model [Heis84] can be considered a special case of the Multiple Failure Rate Model when f 0 = 0 and µ = 0 . Thus ν denotes the total rate of system failures and 97/130

λ S ( k ) = νf k , k = 1, 2,K n . In this failure model the sum of the multiplicity terms f k ' s ( k ≥ 1) equals 1 with only n parameters required for the estimation. Basic Parameter Model

This model could be formulated as a Shock Model [Flem85], [Apos87] with µ = 0 . However, the following parameter is chosen for the model, λ k = failure rate of a specific group of k components.

The meaning of this parameter and its relation to the parameters of the Multiple Failure Rate Model is made clear when a Triple Modular Redundant system is considered. The rate of failure of one specific component in the system, λ1 = νf 1 . The failure rate of two specific components of the Triplicate system is given by λ 2 = νf 2 3 . This failure rate is the double simultaneous rate of the Triplicate

system. The simultaneous failure rate of all three components is then given by λ 3 = νf 3 . The total failure rate of one specific component is then given by, λ = λ 1 + 2λ 2 + λ 3 . Multiple Greeks Letter Model

The Multiple Greek Letter Model [Flem85], [Apos87] represents another form of the Basic Parameter Model. Using a Triplicate system as an example, λ is the total failure rate of one specific component. The conditional probability that the component failure will be shared by at least one additional component is given by β . The conditional probability that a component which is known to share at least one additional component will in fact be shared by two or more additional components is given by γ .

98/130

Using these three parameters a group of expressions can be formed which describe the failure rates of one, two, or three components. The failure rate of one specific component is given by λ (1 − β ) = λ1 . This single component is one of the triplicate set of components in the system. The failure rate of two components is given by λβ (1 − γ ) = 2λ 2 . The two components that fail are members of the set of three components of the triplicate system. The failure rate of three components is given by λβγ = λ 3 . Common Load Model

The Common Load Model [Mank77] describes a stochastic resistance to failure for each independent component. These components are then exposed to a stochastic stress. The probability that exactly k components fail due to some common stress is calculated. A generalization of this model has been developed which is similar to the Shock Model [Harr86]. Nonidentical Components Model

In each of the above models, homogeneous components have been assumed in order to simplify the number of parameters. Nonidentical component models have been developed [Vese77] in which diverse components that operate in parallel are considered. Practical Example of Common Cause Failure Analysis

The determination of systematic failure rate in various systems models is based on the empirical estimation from field measurements [Wats79]. Field data is expressed in terms of Common Cause Failure Rate for each set of parallel subsystems. In Figure 8, a systematic failure mode would result in a transition from state {3} to state {0} . In the presence of the Common Cause Failure event

99/130

this transition would take place with a probability of p = 1 (since the failure mode is not random). The Beta–Factor model is the most straightforward technique (if its limitations are understood [SINT88]). The Common Cause Failure rate using the Beta Factor is related to the independent subsystem failure rate by the following, λ=

λ ccf . λ ccf + λ i

(6.19)

The ratio of the Common Cause Failure rate to the independent failure rate is, λ=

λ ccf β = , λi 1−β

(6.20)

giving, λ ccf =

λi . 1− β

(6.21)

The total decrease in the system reliability is adjusted by the Beta Factor. Some examples of these factors are shown in Figure 11 [HSE87].

100/130

Minimum

Typical

Maximum

Identical Systems

3

10

30

Diverse Systems

0.3

1

3

Figure 11 – Common Cause Failure modes guide figures for electronic programmable system [HSE87]. These ratios of non–CCF to CCF for various system configurations. CCFs are defined as non–random faults that are designed in or experienced through environmental damage to the system. Other sources [SINT88]. [SINT89] provide different figures.

Common Cause Software Reliability

Throughout this paper the reliability discussion has centered on the underlying hardware and subsystems that make up the electronic system. The reliability figures for hardware components are based on a statistical failure mode developed in Chapter 2 in which the failure rate is based on an underlying physical failure mechanism. When discussing software failures, statistical models are also used. However, the basis for these models are not founded in the underlying physical failures of semiconductor or mechanical components, but rather on less tangible metrics, including [Musa89a], [Harr89], [Litt73], [Musa75], [Goel79], [Leve87], [Leve89],

Field measurements of existing software systems and their observed failure rates.

Calculations based on theoretical models of software systems and their execution.

101/130

Because of this less tangible failure process, traditional failure prediction methods themselves sometimes fail to produce proper results [Lipo79], [Boeh76]. Further understanding of software reliability models will be necessary. Software Reliability Concepts The reliability of software can be defined by the conditional survival probability of two random variables Sk and X k which describes the behavior of the software system, such that R ( x y ) = P { X k > x Sk −t = t } where X k is the time interval between the ( k − 1) and k th software failure and Sk is the k th software th

failure occurrence time. Software reliability then corresponds to the probability that a failure does not occur in the time interval ( t , t + x ) [Yama83]. Through a testing procedure, software faults are detected and removed. Assuming no new software faults are introduced as a result of the repair process, the software reliability increases with time. The test–repair cycle is well understood and is usually explained by a Software Reliability Growth Model [Goel79], [OHBA84], [Yama83]. The growth curve H ( t ) , which relates the number of detected software faults to the time span of the program testing. By definition H ( ∞ ) is the cumulative number of faults to be eventually detected. The number of residual faults at time t can be estimated by H ( ∞ ) − H ( t ) . Figure 12 describes the various software reliability growth models.

102/130

Exponential Growth Models with parameters N , φ

N – initial number of faults in the system φ – fault detection rate S–Shaped Growth Models with parameters N , φ, r ( > 0 ) N – initial number of faults in the system. φ – fault detection rate. r – ratio of total faults to detectable faults.

H ( t ) = N ⎡⎣1 − e −φt ⎤⎦ n

H ( t ) = ∑ N i ⎡⎣1 − e −φi t ⎤⎦ i =1

H ( t ) = N ⎡⎣1 − (1 − φt ) e −φt ⎤⎦ H (t ) = N

1 − e −φt 1 − r −φt 1+ e r

Figure 12 – Four Software Growth Model expressions. The exponential and hyperexponential growth models represent software faults that are time independent. The S–Shaped growth models represent time delayed and time inflection software fault growth rates [Mats88].

In contrast to hardware failure models, software reliability cannot take advantage of the aging aspects of the product. Once a software error is detected, diagnosed and repaired, it is corrected forever. The result of this action is an improvement in the reliability of the system as shown in Figure 12. However, during the lifetime of the system, errors continue to appear at some rate and a simple model of the underlying reliability is needed. (The subject of software reliability is complex and the following description is a simple approach for the purpose of this paper.) Let N ∗ be the number of distinct input types to the system, and let N be the number of input types that result in software failure. N ∗ is assumed to be large relative to N, which is assumed to be unknown. Once an error is detected and 103/130

corrected, the total number of possible errors, N, is reduced by one. If inputs to the software system occur according to a Poisson process with intensity λ , then given N ∗ and N, the probability of no failure in the interval [ 0, t ) is expressed i

e −λt ⎛ N ∗ − N ⎞ ⎛ N ⎞ by ∑ ⎜ ⎟ = exp ⎜ − ∗ λt ⎟ . This is the Shock Model for failures ∗ N ⎝ N ⎠ i =1 i ! ⎝ ⎠ ∞

[Barl85]. Let Ti denote the time between the ( i − 1) and i failure, i ≤ N . Given N ∗ and N, the survival function of Ti is given by ⎛ −λ ( N − i + 1) t ⎞ Fi t N , N ∗ , λ = exp ⎜ ⎟ . This expression and its refinements N∗ ⎝ ⎠

(

)

[Jeli72], [Musa75], [Schi78], provide appropriate approximations for a static reliability model. The software utilized in a high reliability system most often runs continuously, rather than on demand. This continuous operation provides the observer with an unchanging set of functions, from which inferences can be made regarding reliability. Inputs to the software system can be considered a finite set of random variables. Specification of these inputs and the relative frequency of their occurrence is termed the Operational Profile [Musa89]. Modeling the failure rate of a steady state software system can be performed using the function f ( t ) = λt , were λ is the failure rate constant. This function models a system where both the software and the operational profile remain fixed in their behaviors. Changes to the software functionality are usually made only when the system is not in use. Modeling the failure rate for a software system under development and being maintained (i.e. the software is being changed periodically) can be done

104/130

⎡ ⎡ λ ⎤⎤ with the function, f ( t ) = f 0 ⎢1 − exp ⎢ − 0 t ⎥ ⎥ . During the system test phase, ⎣ f0 ⎦⎦ ⎣

software faults that are found can be repaired, so the failure rate should decrease as the testing proceeds. In this model the Operational Profile remains unchanged, but the total number of faults f 0 are corrected immediately whenever a failure occurs. If the faults are not corrected immediately, repeated occurrences of the ⎡ ⎡ λ ⎤⎤ same fault are discarded and the model becomes f ( t ) = λ 0 ⎢1 − exp ⎢ − 0 t ⎥ ⎥ . ⎣ f0 ⎦⎦ ⎣

This model is referred to as the Basic Execution Model [Musa87]. The Basic Execution Model provides a reasonable representation of the software failure modes. The advantage of this model is that the parameters λ 0 , the initial software failure rate and f 0 , the total number of failure possibilities, can be related to other parameters in the software failure analysis. A second model for software failures can be constructed if it is assumed that some faults are more likely than others to cause a system failure. If a second assumption that the improvement in the failure as a result of the software fault repair decreases exponentially as the repairs are made, the failure function becomes f ( t ) = 1 ρ ln ( λ 0ρt + 1) . This function models a system in which the Operational Profile remains unchanged and where corrections are made to the software when a failure occurs. For this model the failure is given as f (t ) =

λ0 and is referred to as the Logarithmic Poisson Execution Model λ 0ρt + 1

[Musa89]. The parameter λ 0 has the same meaning as it does in the Basic Execution Model. The parameter ρ characterizes the exponential rate at which the failure rate is reduced as a function of time [Chri88].

105/130

In each of the models described above, it is assumed that each software fault results in one observable error. Suppose a software system initially contains an unknown m number of faults. Fault number i will cause an error in accordance with a Poisson process having an unknown failure rate λ i . Then the number of errors due to fault i that occur in any s units of operating time is Poisson distributed with a mean sλ i . Assume the process causing the faults

i = 1, 2,K , m are independent and assume the software system is to be executed for time t with all the resulting errors observed. At the end of this time an analysis will be performed to determine which faults caused which errors. These faults will then be removed and the error rate of the repaired software system determined. Let, ⎧1, if fault i has not caused an error by time t , f i (t ) = ⎨ ⎩0, otherwise.

then the quantity to be estimated is given by, F ( t ) = ∑ λ i f i ( t ),

(6.22)

i

which is the error rate of the final software package. It should be noted that E ⎡⎣ F ( t ) ⎤⎦ = ∑ λ i E ⎡⎣ f i ( t ) ⎤⎦ = ∑ λ i e −λi t . Let M j ( t ) be the number of faults i

i

that caused exactly 1 error, M 2 ( t ) is the number of faults that caused exactly 2 error, etc., with

∑ jM ( t ) being the total number of errors that resulted. The j

j

expectation value of the number of single error occurring from single faults is given by E ⎡⎣ M 1 ( t ) ⎤⎦ . 106/130

A set of indicator variables can be defined such that, ⎧1, fault i causes exactly 1 error, I i (t ) = ⎨ ⎩0, otherwise.

The number of faults causing exactly one error is given by M1 ( t ) = ∑ I i ( t ) i

resulting in the expression for the estimated value of, E ⎣⎡ M1 ( t ) ⎦⎤ = ∑ E ⎣⎡ I i ( t ) ⎦⎤ = ∑ λ i te −λi t . i

(6.23)

i

From Eq. (6.22) and Eq. (6.23) an interesting result can be obtained, M (t ) ⎤ ⎡ E ⎢ F ( t ) − 1 ⎥ = 0. t ⎦ ⎣

(6.24)

Eq. (6.24) suggests that M1 ( t ) t could be a good estimate of F ( t ) . To determine this the variance of the two expressions is given as, 2 ⎡⎛ M1 ( t ) ⎞ ⎤ M1 ( t ) ⎞ ⎛ E ⎢⎜ F ( t ) − ⎟ ⎥ = var ⎜ F ( t ) − ⎟, t ⎠ ⎥⎦ t ⎠ ⎢⎣⎝ ⎝ 2 1 = var ( F ( t ) ) − cov ( F ( t ) , M1 ( t ) ) + 2 var ( M1 ( t ) ) . t t

Expanding the variance in terms gives, var ( F ( t ) ) = ∑ λ i2 var ( f i ( t ) ) = ∑ λ i2e −λi t (1 − e −λi t ) . i

i

var ( M 1 ( t ) ) = ∑ var ( I i ( t ) ) = ∑ λ i te −λi t (1 − λ i e −λi t ) . i

i

107/130

⎛ ⎞ cov ( F ( t ) , M 1 ( t ) ) = cov ⎜ ∑ λ i f i ( t ) , ∑ I j ( t ) ⎟ , j ⎝ i ⎠ = ∑∑ cov ( λ i f i ( t ) , I j ( t ) ), i

j

= ∑ λ i cov ( f i ( t ) , I i ( t ) ) , i

= −∑ λ i e −λi t λ i te −λi t . i

The two distributions f i ( t ) and I j ( t ) are independent when i ≠ j since they refer to different Poisson distributions. This result in f i ( t ) ⋅ I j ( t ) = 0 which results in an expression for the variance between the error rate for the entire software package and the error rate resulting from single faults being given as, 2 ⎡⎛ M1 ( t ) ⎞ ⎤ 1 2 −λ i t −λ i t E ⎢⎜ F ( t ) − ⎟ ⎥ = ∑ λi e + ∑ λi e , t ⎠ ⎥⎦ i t i ⎢⎣⎝ E ⎡ M 1 ( t ) + 2 M 2 ( t ) ⎤⎦ = ⎣ . t2

Software Reliability and Fail–Safe Operations Fault–Tolerant computer systems are often applied to safety applications, where statistical reliability figures provide little comfort to the user [Brow80]. In this environment, there is a tradeoff between reliability and safety. Reliability doe not assure safety [NSCC80]. Fail–Safe functionality becomes the design goal, rather than Fault–Tolerance [Leve86] with specific techniques being applied to software to assure the Fail–Safe operation of the system [TÜV86]. Although this paper deals in the reliability and availability aspects of fault–tolerant system, it can serve as the basis for the analysis of Fail–Safe operations as well [Leve83].

108/130

Unlike hardware, software does not have a wear out failure mode, even though specific failure modes do exist [Thom81]. In Fault–Tolerant system these failure modes include,

Failure to perform an expected command.

Failure to diagnose a latent fault (fault coverage).

Generation of an erroneous, premature, or delayed command.

In the context of Fault–Tolerant system failures, faults have several definitions [SINT89],

Dangerous Revealed – are failures that will be revealed by the effects of the fault.

Dangerous Unrevealed – are failures that will be unrevealed and will be found only through proof testing of the system.

Safe Revealed – are failure that are safe and will cause the system under control to assume a safe – but operative – state.

Safe Unrevealed – are failures that are safe but result in continued operation of the system under control. They will be revealed through the proof testing of the system.

109/130

110/130

CHAPTER 7

PARTIAL FAULT COVERAGE SUMMARY

The previous sections have developed the background and details of several reliability measures for redundant systems. Figure 13 compares the relative improvements in system MTTF for various redundant configurations. For comparison purposes, the failure rate λ is assumed to have a value of 10 −4 , one failure every 10 −4 hours. The repair rate for each system with repair is assumed to be 8 hours, giving µ = 0.125 . The MTTF for a simplex system is approximately 1 year. In the parallel redundant configuration, the MTTF is dominated by the second term, which represents the added redundancy of the second subsystem.

MTTF

MTTF in years

1 λ

1

Parallel Redundant

3 µ + 2 2λ 2λ

625

TMR in the 3–2 mode.

5 µ + 2 6λ 6λ

208

11 2µ µ 2 + + 6λ 3λ 2 6λ 3

260,400

System Simplex

TMR in the 3–2–1 mode

Figure 13 – MTTF of Simplex, Parallel Redundant, and TMR Systems.

In the TMR configuration, the system continues to operate as long as two of the three subsystems are functional. This configuration is similar to the Parallel 111/130

Redundant configuration, except that there is one additional subsystem. As a result, the MTTF is reduced by a factor of 3 when compared to a perfect coverage parallel Redundant system. If the TMR system is allowed to operate in the Simplex mode, the MTTF is increased by a factor of 260,400. EFFECTS OF COVERAGE

When coverage is added to the reliability models, a significant change occurs in the relative reliability figures for various configurations. Figure 14 describes the effect of coverage on the MTTF of the various configurations.

System

c =1

MTTF, years c = 0.99

Dual 2–1

625

50

10

0.5

TMR 3–2

208

–

–

–

260,400

20,833

4167

208

TMR 3–2–1

c = 0.95

c =0

Figure 14 – MTTF of Parallel Redundant and TMR Systems with varying degrees of coverage.

112/130

CHAPTER 8

REMAINING QUESTIONS

Although Figure 13 describes the effect of coverage on redundant architectures with exponential distributions for the failure rate of the underling hardware, the figures stated are a simplified version of commercial products with the assumptions stated in Figure 2. Several major topics of discussion remain open:

If the distributions of an actual system are not exponential and / or are not independent, what effect does this have on the reliability calculations presented in this paper?

Given an actual redundant configuration what dependencies are present for a specific hardware or software structure? Does the common cause failure model effect the reliability in a significant way?

Can the assumption that all failure rates are constant be verified for actual hardware and software systems in commercial products?

Can repairable systems be modeled using similar techniques if the repair rate is extremely long (> 6 months) or carried out at regular intervals, rather than on demand? This question arises when dealing with unmanned systems.

REALISTIC PROBABILITY DISTRIBUTIONS

Throughout this paper an assumption has been made regarding the probability distributions of the failure rates and repair rates. Although these assumptions may be appropriate in describing physical events, the validation of their appropriateness has not been stated formally. Several motivations are present for assuming Exponential and Poisson Failure distributions:

113/130

There is evidence that electronic components follow an exponential failure curve. Although the goodness of fit of this curve is acceptable in many instances, alternative distributions provide more accurate models [Mels78], [Snyd75].

The use of the exponential distribution provides known Laplace transforms which are utilized in the Markov models of system reliability. Alternative distributions, both periodic and cyclostationary, do not posses Laplace transforms and therefore are intractable when forming differential equations [Cast80].

A constant hazard rate (Eq. (2.25)) provides an easy means of defining the underlying failure mechanism of the system under analysis. In practice the hazard rate is usually a function of time, use, and application [Bell78].

Faults present in the system which result in an error or a failure are assumed to be permanent rather than transient. This assumption allows exponential distributions to be used in the Markov model. In practice transient faults may represent the majority of system upset conditions [Stif80]. In an attempt to analyze transient fault conditions, the Weilbull distribution provides a better estimator of the underlying statistical process. This distribution has no know Laplace transform and therefore is intractable to analysis with the Markov process [McCo79].

Multiple Failure Distributions

Before examining the Weilbull distribution an understanding of he effects of ignoring the assumptions made in Figure 2. The questions to be answered is, What is the effect on the system reliability if the failure distribution is different for each subsystem; if the individual faults are not independent; and if the repair / replacement strategy is also not constant? Suppose there are a large number n of failure types, with individual failure time of Y1 , K , Yn . That is Yi denotes the failure time that would be observed if all

failures except the i th failure were suppressed. The actual–failure–time, denoted by X n , when there are n failure types is then given as, MIN ( Y1 , K , Yn ) .

(8.1)

114/130

Next assume that {Yi } are mutually independent and identically distributed random variables with a Cumulative Distribution Function of L ( y ) . Then since, X n ≥ x iff Yi ≥ x ; i = 1, 2, K , n ,

(8.2)

if follows that, P { X n ≥ x } = P {Yi ≥ x } ; i = 1, 2,K , n , = (1 − L ( x ) ) . n

(8.3)

Suppose now that L ( x ) ≈ ax α for α > 0 as x → 0 . Then for a sufficiently large n, only small x need be considered in Eq. (8.3) and, P { Xn ≥ x} ≈ e − n L( x) ,

(8.4)

α

≈ e − n ax .

Let X n∗ = X n kn where kn is a normalizing constant to be chosen so that X n∗ has a limiting distribution as n → ∞ . Then, P { Xn∗ ≥ x} = P { Xn ≥ kn x} , ≈e Take kn = ( na )

−1 α

− n aknα xα

(8.5)

.

, then, α

P ⎡⎣ X n∗ ≥ x ⎤⎦ ≈ e − x .

(8.6)

115/130

The standardized variable ( na )

1α

X n has a limiting Weilbull distribution.

The index α of the Weilbull distribution is determined by the local behavior near x = 0 of the underlying Cumulative Distribution Function L ( x ) . If α = 1 so that L ( x ) is locally rectangular near x = 0 , the limiting distribution function is exponential.

Weilbull Distribution

The Weilbull distribution is best described in its Cumulative Distribution form [Kapu77]. The Cumulative Distribution of a random variable x distributed as the three parameter Weilbull is given as, ⎡ ⎛ λt − δ ⎞ α ⎤ F ( t ) = 1 − exp ⎢ − ⎜ ⎟ ⎥ , for t ≥ δ, ⎣ ⎝ 1− δ ⎠ ⎦

(8.7)

where α > 0, δ > 0, λ > 0 where α is the shape parameter or Weilbull Slope [Kimb60], [Whit69]. As α increases, the mean of the distribution approaches the characteristic life λ , with α = 1 , the Weilbull distribution becomes the exponential distribution. The location parameter δ defines the minimum life of the system. The scale of the distribution is controlled by λ . The three parameter expression in Eq. (8.7) can be reduced to a two parameter form by assuming the minimum life parameter δ is zero [Naka75], [Kapu77]. The probability density function pdf, Cumulative Distribution Function CDF, and Hazard Function z of the Weilbull two parameter distribution are expressed in the following,

116/130

pdf = f ( t ) = αλ ( λt )

α−1 − ( λt )α

e

,

(8.8)

α

CDF = F ( t ) = 1 − e −( λt ) ,

(8.9)

α

Reliability = R ( t ) = e −( λt ) ,

(8.10)

Hazard Function = z ( t ) = αλ ( λt )

α−1

(8.11)

.

The values of Eq. (8.8) through Eq. (8.11) are a function of time only, through the product of the scale factor. Since the failure rate is given by ( λt ) , this rate is α

directly influenced by the shape parameter in the [Thom69], [Meno63],

If α < 1 the failure rate is decreasing with time.

If α = 1 the failure rate is constant with time and the Weilbull distribution is identical to the exponential distribution.

If α > 1 the failure rate is increasing with time.

The Weilbull distribution most closely matches the failure distribution of transient faults [McCoy79] and therefore is useful in the analysis of Fault– Tolerant systems which are capable of handling transient as well as permanent faults. The primary difficulty with the Weilbull distribution is that no known Laplace transform exists, making it intractable to generate closed form differential equations for the transition probabilities in a Markov model. Other properties of the Weilbull distribution are useful for the analysis of reliable systems. The k th moment of the Weilbull distribution is given by [Kend77],

117/130

∞

µ′k = E ⎡⎣t ⎤⎦ = ∫ t k λα ( λt ) k

α−1 − ( λt )α

e

dt .

(8.12)

0

Using the transformation u = ( λt ) and du = λα ( λt ) α

α−1

dt gives a new form of

the moment generating function, ⎛1⎞ µ′k = ⎜ ⎟ ⎝λ⎠

k ∞

∫u

k α −u

(8.13)

e du .

0

Eq. (8.13) is the recognizable form of the Gamma function [Arfk70] and results in, k

⎛1⎞ ⎛ k⎞ u k′ = ⎜ ⎟ Γ ⎜ 1 + ⎟ . ⎝λ⎠ ⎝ α⎠

(8.14)

The mean for the Weilbull distribution is given by, µ=

1 ⎛ 1⎞ Γ ⎜1+ ⎟ , λ ⎝ α⎠

(8.15)

and the variance is given by, ⎛1⎞ σ =⎜ ⎟ ⎝λ⎠ 2

2

2⎞ 1 ⎞⎤ ⎡ ⎛ 2⎛ ⎢Γ ⎜ 1 + α ⎟ − Γ ⎜ 1 + α ⎟ ⎥ . ⎠ ⎝ ⎠⎦ ⎣ ⎝

(8.16)

With only the mean and the variance (standard deviation) of a sample data set available, the Weilbull failure rate can be determined to be increasing, constant, or decreasing. The relationship between the means and the standard deviations can be a useful indicator,

118/130

µ < σ , the failure rate is decreasing

µ = σ , the failure rate is constant

µ > σ , the failure rate is increasing

PERIODIC MAINTENANCE

In the previous sections the reliability models assumed a fixed Mean Time to Repair,

µ , for each module which resulted in the system availability measures given in Eq. (2.42) and Eq. (2.80). The assumption in these expressions is that a maintenance request will be generated as a results of a failure and that the length of time between the failure event and the repair of the module is a fixed interval. In a practical application, responding to a module repair request within the repair time may not be possible [Ingl77]. The inability to respond in the minimum time may be caused by,

The remoteness of the installation

System operation rules

Lack of spare parts

In many instances a periodic maintenance interval is substituted for the on demand maintenance technique assumed in the previous sections. Periodic Maintenance of Repairable Systems

The overall system reliability of a general purpose configuration given in Eq. (2.26) can be restated as the reliability of an unmaintained system by [Ng77], R0 ( t ) = ∑ a i e −λi t ,

(8.17)

i

119/130

where a i is a normalizing constant. In this model the following notation will be used, τ – is the time interval from the latest periodic maintenance action

β – is the time interval between two adjacent maintenance actions i

th – is the i occurrence of a maintenance action

R – is the reliability of a maintained system R∗ – is the reliability of an unmaintained system The reliability of a system after receiving the i th maintenance action is Ri ( t ) . The reliability of an unmanned system is given by R ∗i = R0 ( i β ) and the reliability of a maintained system is given by R i = Ri ( i β ) = Ri −1 ( i β ) . A convenient representation of the reliability of each system is developed through the generating function of the probability sequence of each maintenance action [Ahlf66], [Fell71], [Jury64]. The generating function for the sequence of failure ∞

events for an unmaintained system is given by G ∗ ( s ) = ∑ R ∗i s i and the i =1

generating function for the sequence of failure events if a maintained system is ∞

given by G ∗ = ∑ R i s i . By maintaining a Fault–Tolerant system in a periodic i =1

manner, an improvement in the change in the overall reliability can be expected. The probability that the maintenance action does not result in an improvement in the reliability is given by θ1 . While the probability that the maintenance action

120/130

brings the system to a condition identical to the reliability at the initial system start, t = 0 , is given by θ2 . After β time units, the system is inspected and if module failures have occurred, repair of the system takes place. With probability θ1 , all failures, if any, are ignored or insufficiently repaired. With probability θ2 all failures are removed from the system and the operation of the system is restored to full perfection. Since all failures are removed from the system, partial restoration of a failed system is not considered [Naka79], [Oda81], [Naka81], such that the probability of all failures being removed is given by, θ1 = 1 − θ2 . From the definition of θ1 and θ2 it can be shown that after i maintenance actions the system reliability is given by Ri −1 ( t ) with probability θ1 and Ri −1 ( i β ) ⋅ R0 ( t − i β ) with probability θ2 . Defining a term for the time from the latest maintenance action as τ = t − i β , the system reliability can be expressed as, Ri ( τ + i β ) = θ1Ri −1 ( τ + i β ) + θ2R0 ( τ ) Ri −1 ( i β ) ,

(8.18)

with i > 0 and 0 ≤ τ ≤ β . Restating the system reliability as a function of the maintenance interval gives, R0 ( τ ) = ∑ a j e

−λ j τ

,

(8.19)

j

and R1 ( τ + β ) = ∑ a j e j

−λ j τ

⎡θ1e −λ j β + θ2 R1 ⎤. ⎣ ⎦

121/130

(8.20)

A general expression can be developed by using the symmetry of Eq. (8.18) resulting in, Ri ( τ + i β ) = ∑ a j e

−λ j t

j

i −1 ⎡ −λ j β i −λ j β e θ + θ θ1 2∑ e ⎢ 1 k =0 ⎣

(

)

(

)

k

⎤ R i −k ⎥ . ⎦

(8.21)

To obtain R i from Eq. (8.21) the reliability time base is set to the time between adjacent maintenance actions, t = β and the sequence of summation changed to give, i −1

R i +1 = θ1i R ∗i +1 + θ2 ∑ θ1k R ∗k +1R i −k .

(8.22)

k =0

The probability of survival until the first maintenance action is the same, whether the system is maintained or not, that is,

R1 = R1∗ .

(8.23)

The R i may be determined recursively from Eq. (8.22) and Eq. (8.23). The R i may also be obtained from its generating function defined by, G (s ) =

G ( θ1s )

⎡⎣θ1 − θ2G ∗ ( θ1s ) ⎤⎦

(8.24)

,

and G∗ ( s ) = ∑ a j e j

−λ j β

(

s 1 − se

−λ j β

)

,

(8.25)

122/130

where Eq. (8.24) and Eq. (8.25) are obtained from Eq. (8.22) and Eq. (8.19) [Heiv80]. The determination of R i provides the final expression for the reliability equation from Eq. (8.21), Ri ( t ) = ∑ a j e j

−λ j t

i ⎡ i⎛ −λ j β ⎢ θ1 ⎜ 1 + θ2 ∑ e θ1 k =1 ⎣ ⎝

(

)

−k

⎞⎤ Rk ⎟⎥ . ⎠⎦

(8.26)

The Mean Time to Failure using Eq. (2.41) is, ∞ β

MTTF = ∑ ∫ Ri ( τ + i β ),

(8.27)

i =0 0

⎡ G ∗ ( θ1 ) ⎤ MTTF = ⎢1 − θ2 ⎥ θ1 ⎦ ⎣

−1

a j 1 − e −λ j β ∑j λ 1 − θ e −λ j β . j 1

(8.28)

Reliability Improvement for a TMR System

The analysis of periodically maintained TMR system has been presented in several other papers [Maka81], [Helv80], [Math70]. Although the details of these papers are relevant to the discussion presented here, the mathematics is beyond the scope of this presentation and is the subject of a future paper. A simplifying assumption will be made regarding the periodic maintenance of the TMR system. Only the relative increase in reliability will be considers in the paper. If a TMR system is considered with an unmaintained reliability expression given in Eq. (3.13). Figure 15 shows the increase in system reliability as a result of periodic maintenance to the TMR system [Helv80].

123/130

Normalized Mean Time to Failure

100.0

1.0 10.0

0.8 0.4 0.3 0.2 0.1

1.0

0.001

0.010

0.100

Normalized Maintenance Interval

Figure 15 – Mean Time to Failure increases for a Triple Modular Redundant system with periodic maintenance. This graph shows that maintenance intervals which are greater than one–half of the mean time to failure for one module have little effect on increasing reliability. But frequent maintenance, even low quality maintenance, improves the system reliability considerably.

124/130

1.000

CHAPTER 9

CONCLUSIONS

A broad range of topics have been presented in support of the original question, What is the reliability of a system in the presence of less than perfect diagnostic coverage? The following conclusions can be made regarding this question:

Analytical expression have been developed which describes the time dependent behavior of Fault–Tolerant systems. These expressions are based on a continuous–time Markov Chain, which assumes instantaneous transition times of an underlying random process. The statistical distribution that drives the random process is assumed to be exponential in nature.

A coverage factor was introduced in the Markov Chain model that results in a closed form solution for the system Mean Time To Failure for various fault– tolerant configurations. Several assumptions have been made regarding the underlying failure process as well as the repair process.

In the presence of imperfect coverage a scheme had been described which determines the actual coverage present in the system.

In addition to the above summary conclusions, several questions regarding the validity of the original assumptions have been presented. The alternative considerations present a different set of analytical problems that in turn require simplifying assumptions to produce tractable results. These questions include,

A modeling technique capable of dealing with Weilbull distribution.

A modeling technique capable of dealing with finite transition times between fault detection and fault repair. This technique would provide

125/130

realistic models for systems with actual repair processes, rather than instantaneous repair processes.

Analytical expressions for a general configuration of models, including repair cycles. ∴

126/130

Appendix A

MARKOV CHAINS Modeling of Partial Diagnostic Coverage in Fault–Tolerant systems with a Markov Matrix technique requires an underlying assumption regarding the failure probability distributions for the electronic components of the system. In addition, the properties of the probability generating functions for the failure processes are assumed to be understood in detail. In the body of this work, some assumptions were made regarding Markov Chains without detailed development. This appendix provides the definition and theorems for the properties of the transition matrix theory utilized in Chapter 5. [Chun67], [Parz60], [Parz62], [Midd46], [Pipe63], [Keme60], [Cox62], [Cox65], [Bhar60], [Bell60], [Hoel72]. Definition A.1 A Stochastic Process is defined as any collection of random variables X ( t ) , t ∈ T defined on a common probability space where T is a subset of

( −∞, ∞ ) and is referred to by a time parameter set. The random process is called a continuous parameter process if T is an interval having positive length and a discrete parameter process if T is a subset of integers. If the random variables X ( t ) take values from the fixed set ¡ , then ¡ is called the State Space of the process∴ Definition A.2 A Stochastic Process { X t }t ∈¡ + is said to satisfy the Markov property if for all times t 0 < t 1 < L < t n < t and for all n it is true that,

Appendix A – 1

{

}

P X t = j X t 0 = i 0 , X t1 = i 1 , K , X t n = i n ∴

(A.1)

Definition A.3 The Markov Chain { X t }t ∈¡ + is said to be stationary if for every i and j in the probability transition matrix, S, P { X t +h = j X t = j } is independent of t ∴ The following conditions must be satisfied by the probability transition matrices used in Chapter 5. pij ( t ) ≥ 0 ∀i , j ∈S and t ≥ 0

(A.2)

∑ p ( t ) = 1∀ i ∈ S and t ≥ 0

(A.3)

pij ( t + s ) = ∑ pik ( t ) pkj ( s ) ∀i , j ∈ S and t , s ≥ 0.

(A.4)

1, if i = j , lim+ pij ( t ) =  t →0 0, if i ≠ j .

(A.5)

j ∈S

ij

k∈S

Theorem A.1 Let X t be a continuous–time Markov chain with a transition probability function pij ( t ) . If i and j are fixed states in S then pij ( t ) is a uniformly continuous function of time t. Proof of Theorem A.1 Let ε > 0 be given. In order to show that pij ( t ) is uniformly continuous, it must be shown that there exits a δ > 0 such that pij ( t + h ) − pij ( t ) < ε for all t

Appendix A – 2

whenever 0 < h < δ . Using the Chapman–Kolmogorov equations defined in Eq. (5.20) and Eq. (5.22) results in, pij ( t + h ) − pij ( t ) =

∑ p (h ) p (t ) − p (t ) , ik

kj

ij

k

≤ ∑ pik ( h ) pkj ( t ) + pij ( t ) pii ( h ) − 1 .

(A.6)

Since pkj ( t ) ≤ 1 for all k, j, and t it follows that, pij ( t + h ) − pij ( t ) ≤ ∑ pik ( h ) + pii ( h ) − 1

(A.7)

k ≠1

But since

∑ p ( h ) = 1 results in ∑ p ( h ) = k∈S

ik

k ≠i

ik

pii ( h ) − 1 the following

relationship holds, pij ( t + h ) − pij ( t ) ≤ 2 (1 − pii ( h ) ) ,

(A.8)

and by Eq. (A.5) it is given that lim+ 1 − pii ( h ) = 0 . Therefore given ε > 0 a h →0

δ > 0 can be found so that 1 − pii ( h )
0 such that pij ( t ∗ ) yields the probability transition of the next state from the current state. Since no positive time unit can be used in considering where the system state can go next, instantaneous state changes must be considered in the analysis of the system. This requirement leads to the analysis of the derivative of the transition probability function pij ( t ) for i ∈S , j ∈S . Since the derivative is defined as the limit of the difference quotient, the determination of limits is necessary. In order to make this determination the following Lemma will be used, which involves certain properties of the superemum of a set of real numbers. Lemma A.1 Let f ( t ) be a real–valued function defined on the open interval ( 0, ∞ ) . Assume that lim+ f ( t ) = 0 and assume the f is subadditive, that is t →0

 f (t )  f ( s + t ) ≤ f ( s ) + f ( t ) for all s , t ∈ R + . Then lim+   exists and in fact t →0  t   f (t )  ∴ equals sup  t  t ∈R +  Theorem A.2 Let X t be a continuous–time Markov chain with a transition probability function pij ( t ) . In the case where i = j the resulting probability function of t has a derivative on the right–hand side of the equality in the sense that the  pii ( t ) − 1 lim+  = q ii exists. This limit will always be nonpositive and in some t →0 t cases, the limit will be −∞ . Appendix A – 4

Proof of Theorem A.2 The transition probability condition for the Markov chain described in Eq. (A.4) states that for all s , t ∈ 0, ∞ ) , pii ( t + s ) = ∑ pik ( t ) pki ( s ) , so when the sum is k

reduced to the case where k = i it follows that for all s , t ∈ 0, ∞ ) , pii ( t + s ) ≥ pii ( t ) p ( s ) . Consider the function f ( t ) = − log pii ( t ) . It can be shown that f ( t ) is subadditive and the lim+ f ( t ) = 0 . It also follows that t →0

 − log pii ( t )  lim+   exists. The difference quotient in question is now t →0 t    pii ( t ) − 1 , so the following fact from calculus is used to reduce this term. t For small values of x > 0 , the Taylor series expansion for − log (1 − x ) about zero is − log (1 − x ) = x +

x2 x3 x4 + + + L = x + R ( x ) . The dominant term in 2 3 4

R (x )  this series is the first term. It can be shown that lim+   = 0 . Using the x →0  x  above developments an expression for the right–hand derivative is given as, − log pii ( t ) = − log 1 − ( 1 − pii ( t ) )  = 1 − pii ( t ) + R ( 1 − pii ( t ) ) .

(A.9)

It follows that, log pii ( t ) pii ( t ) − 1 R (1 − pii ( t ) ) , = − t t t p ( t ) − 1  R ( 1 − pii ( t ) )  = ii 1 + . t 1 − pii ( t )  

Appendix A – 5

(A.10)

resulting in, pii ( t ) − 1 log pii ( t )  R (1 − pii ( t ) )  = 1 +  . t t 1 − pii ( t )   −1

(A.11)

Taking the limits of both sides of Eq. (A.11) as t → 0 + and noting that both terms on the right–hand side have limits at t → 0 + , results in confirming the  p (t ) − 1 existence of lim+  ii  = q ii ∴ t →0 t   Theorem A.3 Let X t be a continuous–time Markov process with a probability transition function pij ( t ) . In the case where i ≠ j the resulting probability transition function of t has a right–hand derivative at zero in the sense that  pij ( t )  lim+   = q ij exists and is finite∴ t →0  t  Proof of Theorem A.3 In proving this theorem X t will be considered only at certain discrete times, even though this is a continuous Markov process. This is done by examining the X t process at times h , 2h , 3h , K where h is a small positive number. With this discrete version of X t it can be noted the pij ( mh ) = pij( m ) ( h ) . In this case pij ( mh ) will be considered as an m–step transition of a discrete–time Markov process with a probability transition matrix of pij ( h ) . In the same way forbidden probabilities, j pii( m ) ( h ) are defined as,

Appendix A – 6

m −1

pij ( mh ) ≥ ∑ j pii( k ) ( h ) pij ( h ) p jj ( mh − kh − h ) .

(A.12)

k =0

The right–hand side of Eq. (A.12) is smaller since not all possible probability transitions are considered. By using Eq. (A.6) is follows that lim+ j pii( k ) ( h ) = 1 . h →0

This is illustrated by writing, k

pii(k ) ( h ) = j pii(k ) ( h ) + ∑ f ij(l ) ( h ) p ji ( k − 1) h  ,

(A.13)

l =0

where f ij( l ) ( h ) is the probability of the first visit to state j from state i in l steps. Now, lim p ji ( k − 1) h  = 0,

(A.14)

lim+ j pii(k ) ( h ) = lim+ pii(k ) ( h ) = 1.

(A.15)

h →0+

so,

h →0

h →0

Let ε > 0 be given and choose t 0 > 0 such that p jj ( t ) > 1 − ε for all t ∈ ( 0, t 0 ) and j pii( k ) ( h ) > 1 − ε for all kh ∈ ( 0, t 0 ) . Then using Eq. (A.15) it follows that, pij ( mh ) ≥ (1 − ε )

2

m −1

∑ p ( h ) = (1 − ε ) k =0

ij

2

mpij ( h ) ,

(A.16)

whenever mh < t 0 . Dividing Eq. (A.16) by mh gives, pij ( mh ) mh

≥

(1 − ε )2 pij ( h ) h

whenever mh < t 0 .

Appendix A – 7

(A.17)

Let the transition rate in the limit be defined by, L = lim inf

pij ( t ) t

t →0+

.

(A.18)

The limit L is finite since, if m is chosen so that pij ( h ) h

≤

pij ( mh ) mh

⋅

t0 ≤ mh ≤ t 0 results in, 2

1 1 2 ≤ ≤ . 2 2 (1 − ε ) mh (1 − ε ) t 0 (1 − ε )2

Using the definition of lim inf, choose t 1
0 is arbitrary, so lim sup  ≤ L . Therefore the limit of as  t h  h →0+  t → 0 + exists and is finite. This limit is denoted by qij ∴

Appendix A – 9

Appendix A – 10

Appendix B

SOLUTIONS TO LINEAR SYSTEMS The generalized TMR system model described in Chapter 5 depends on the solution of a set of differential equations using the Gauss Reduction method. This appendix outlines the details of this solution method [Brau70], [Cour43], [Jame67], [Rals65], [Hild56]. A set of m simultaneous linear equations in n unknowns, x 1 , K , x n , has the form, a11x 1 + L a1n x n a m 1x 1 + L a mn x n

= b1 M = bm

(B.1)

where the a ij , bi are the known constants. Rewriting the matrix terms as A = a ij , x = [ x 1 , K , x n ] , b = [b1 , K , bn ] , then using a standard matrix notation, Eq. (B.1) becomes, Ax = b . In the case of a square matrix, where m = n the rank of the matrix is given by r ( A ) = n so that A ≠ 0 , the unique solution obtained by x = A −1b can be written as, xi =

1 A

n

∑A b

. j =1

ij

j

, i = 1, K , n .

Appendix B – 1

(B.2)

From the definition of the determinate, it is observed that

∑A b ji

j

is the

j

determinate of the matrix formed from A by replacing column i with b j , giving, b1 1 M xi = A bn

a12 L a1n a1 L a1,n −1 b1 1 M ;K ; x n = M M . A a n 2 L a nn bn L a n ,n −1 bn

(B.3)

Eq. (B.2) is called Cramer’s Rule for finding the solution to a system of n equations in n unknowns. An alternative method, which provides more efficiency in numerical computation, is called Gauss Reduction. The Gauss Reduction method assumes that a11 ≠ 0 such that first equation in Eq. (B.1) is divided by a11 and the resulting equation is used to eliminate x 1 in equations 2, K , n . This method gives a new set of equations with augmented terms of,

aˆ22 x 2 + L + aˆ2 n x n

= bˆ1 = bˆ

aˆn 2 x 2 + L + aˆnn x n

= bˆn

x 1 + aˆ12 x 2 + L +aˆ1n x n

where the new terms aˆ1 j = aˆij = a ij −

a1 j a11

ai 1 ,

a1 j a11

,

M

2

j = 2, K , n ;

b1 j j = 2, K , n , i = 2, K , n and bˆ1 j = , i = 2, K , n and b11

a bˆi = bi − i 1 b1 , i = 2,K , n result from the elimination terms. a11

Appendix B – 2

(B.4)

The second equation in the set of Eq. (B.4) is then divided by aˆ22 (This method assumes that the equations and variables are renumbered such that aˆ22 ≠ 0 ) and eliminates x 2 in equations 3, K , n . In a finite number of steps a set of equations can be formed (provided A ≠ 0 ) such that, 1 h12 L h1n   x 1   g 1   1 L h2n   x 2   g 2   ⋅  =  .  1 M  M  M       1  xn   gn  

(B.5)

Then x n = g n , and by back substitution x n −1 = g n −1 − hn −1,n g n , etc. A general case can be considered where m is different from n. The condition under which there is at least one solution to Eq. (B.1) can be determined. A new matrix can be introduced with a dimension m × ( n + 1) giving A b = ( A , b ) , with A b defined as the augmented matrix for the system. This matrix is formed by annexing to the matrix A the vector b which becomes column n + 1 . It should be noted that r ( A b ) ≥ r ( A ) since every minor in A also appears in A b . Also if r ( A b ) > r ( A ) , then there does not exist a solution to Eq. (B.1). This follows since if r ( A b ) = k > r ( A ) , every set of k linearly independent columns from A b must contain b. Hence b cannot be written as a linear combination of columns in A, therefore there does not exist a x j such that

∑x

j

a j = b . On the other hand,

j

if r ( A b ) = k = r ( A ) , then there exists a set of k columns from A such that every column in A b , and in particular b, can be written as a linear combination of these k columns. Therefore, there exits one solution to Eq. (B.1) and in fact a solution with no more than k of the variables different from zero.

Appendix B – 3

Theorem B.1 Consider the initial value problem, d y = Ay + g ( t ) , dt

y ( 0 ) = 0.

(B.6)

where A is a constant matrix in Fnn , g ( t ) is continuous and of exponential growth at infinity. Then the solution φ ( t ) of Eq. (B.6) exists and is unique for 0 ≤ t ≤ ∞ and its derivative φ′ ( t ) is of exponential growth at infinity. Proof of Theorem B.1 The initial value problem in Eq. (B.6) is equivalent to the problem of finding a continuous function φ ( t ) such that, t

φ ( t ) = y 0 + ∫  A ( s ) φ ( s ) + g ( s )  ds ,

(B.7)

t0

for every t in the interval I. By hypothesis, there exists constants M > 0 and c and a time T such that, g ( t ) ≤ Me ct ,

(B.8)

for t ≤ T . Assume c > 0 , since increasing c increases the right–hand side of Eq. (B.8) and does not affect the relation of the inequality. Eq. (B.7) can be written as, T

T

t

t

0

0

T

T

φ ( t ) = y 0 + ∫ Aφ ( s ) ds + ∫ g ( s ) ds + ∫ Aφ ( s ) ds + ∫ g ( s ) ds .

Appendix B – 4

(B.9)

It can be shown that Eq. (B.7) is bounded on the interval 0 ≤ t ≤ T . Using this fact, and the continuity of g ( t ) , there exists a constant k such that, T

T

0

0

y 0 + ∫ Aφ ( s ) ds + ∫ g ( s ) ds ≤ K .

(B.10)

Taking the norms in E q. (B.9) and using Eq. (B.8) and the following properties, the inequality Av ≤ A v is true for A ∈ Fnn and for any v ∈ Fnn and b

b

a

a

∫ f ( t ) dt ≤ ∫ f ( t ) dt

for every b > a and every continuous vector function f

on the interval a ≤ t ≤ b . Using the triangle inequality for vector norms, the following relationship is obtained, t

t

T

T

t

M ct cT (e − e ) , c

φ ( t ) ≤ K + ∫ A g φ ( s ) ds + ∫ Me cs ds ,a ≤ K + ∫ A g φ ( s ) ds + T

(B.11)

t

M ≤ K + e ct + ∫ A g φ ( s ) ds . c T Multiplying Eq. (B.11) by e − ct , the following expression is obtained, t

M  φ ( t ) e − ct ≤  Ke − ct +  + ∫ A g φ ( s ) e − ct ds , c  T  t

M  ≤  Ke − ct +  + ∫ A g φ ( s ) e − cs ds . c  T 

Appendix B – 5

(B.12)

since e − ct ≤ e − cs for t ≥ s . Since c > 0 , there exists a constant L such that Ke − ct +

M ≤ L for all t ≥ T . The inequality describes in Eq. (B.12) now c

becomes φ ( t ) e

− ct

≤L+ A

t

∫ φ (s ) e

− cs

ds .

T

Now let K > 0 and a ≥ 0 be constants. Suppose that r ( t ) is a continuous nonnegative function for t ≥ t 0 which satisfies the inequality t

r ( t ) ≤ a + K ∫ r ( τ ) d τ on some interval I. Then r ( t ) ≤ a e K (t −t 0 )  for t ≥ t 0 and t0

t in the interval I. Using this development with r ( t ) = φ ( t ) e −ct gives φ ( t ) e − ct ≤ Le A ( t −T ) or φ ( t ) e − ct ≤ Le A T e (

A + c )t

for t ≥ T . Thus φ ( t ) has

exponential growth at infinity. Since φ′ ( t ) = Aφ ( t ) + g ( t ) the following inequality exists, A −c t φ′ ( t ) ≤ A g φ ( t ) + g ( t ) ≤ A Le − A T e ( ) + Me ct ,

(

)

A +c t ≤ A Le − A T + M e ( ) ,

and thus φ′ ( t ) also has exponential growth at infinity∴

Appendix B – 6

(B.13)

Appendix C

PROBABILITY GENERATING FUNCTIONS Throughout the body of this paper a probability density function is used to represent a random variable associated with the failure rate of a system. In many instances, these random variables assume only nonnegative integer values representing the occurrence of a random event or the duration between the occurrences of random events [Dave70]. The series of random variables generated by the underlying random process can be represented by a sequence of numbers with several interesting properties. Definition C.1 Let a 0 , a1 ,K , an be a sequence of real numbers representing a random process. If

A ( s ) = a 0 + a1s + a 2 s 2 + L = ∑ a j s j converges in some open interval j

( −s 0 < s < s 0 )

then A ( s ) is defined as the Generating Function of the sequence

{a } [Fell67]∴ j

Unlike the Laplace transform notation, the variable s has no significance in this instance. If the sequence {a j } is bounded, then a comparison with the geometric sequence shows that A ( s ) converges at least for s < 1 . Let X be a random variable assuming the values 0,1, 2,K. Let P { X = j } = p j and P {X > j } = q j = 1 − p j , then, qk = pk +1 + pk + 2 +L . The Generating Function of

the sequences { p j } and {qk } are defined as, Appendix C – 1

P ( s ) = p0 + p1s + p2 s 2 + L = ∑ p j s j , j

Q ( s ) = q0 + q1s + q 2 s + L = ∑ q j s j . 2

(C.1)

j

Since P (1) = 1 , the series for P ( s ) converges absolutely at least for {−1 ≤ s ≤ 1} . The coefficients of Q ( s ) are less than unity, and so the series for Q ( s ) converges at least in the open interval ( −1 < s < 1) . Theorem C.1 For the interval ( −1 < s < 1) , Q ( s ) =

1 − P (s ) ∴ 1− s

Proof of Theorem C.1 The coefficients of s n in (1 − s ) ⋅Q ( s ) is equivalent to qn − qn −1 = − pn when n ≥ 1 and is equivalent to q0 = p1 + p2 + L = 1 − p0 when n = 0 . Therefore

(1 − s ) Q ( s ) = 1 − P ( s ) ∴ Restating the Generating Function for the random variable X gives, G ( s ) = ∑ P {X = j } s j ,

(C.2)

j

= E  s  . X

If s is allowed to be any complex value, s = x + iy , an important generalization can take place [Fell66], [Arfk70]. By defining a new variable s , s ⇔ e i ξ , where i = −1 and substituting this variable in the expression in Eq. (C.2) a generalization of the Generating Functions can be given as,

Appendix C – 2

G ( s ) = ∑ P {X = j } e isj , j

(C.3)

= ∑ P { X = j } ( e is ) . j

j

The summation on the right–hand side of Eq. (C.3) is a power series in e is . Given G ( s ) as a function of s, the expansion of a power series in s gives the probability P { X = j } directly as a coefficient of s j . This relationship is also the definition of the z–Transform of a complex function [Papo65]. Defining z = e t 0 s and expanding the definition to z = e t 0 ( σ+ i ω) = e t 0 σ [ cos t 0 ω + i sin t 0 ω] and substituting this definition into the discrete representation of the Laplace Transform defined by [Iser77], ∞

L {x ( t )} = x ( s ) = ∑ x ( kt 0 ) e − kt 0s

(C.4)

k =0

results in, ∞

Z {x ( kt 0 )} = ∑ x ( kt 0 ) z − k ,

(C.5)

k −0

= x ( 0 ) + x (1) z + x ( 2 ) z −1

−2

+L

This infinite series converges if x ( kt 0 ) is restricted to finite values and if z < 1 . Since σ in the term s = σ + i ω can be chosen appropriately, convergence is possible for many functions of the form x ( kt 0 ) ∴

Appendix C – 3

Appendix C – 4

Appendix D

POISSON PROCESSES Throughout this paper the Poisson process is referred to with little or no formal development. In searching the literature, it is curious that the fundamental axioms of the Poisson process are rarely developed. Although several relationships are developed in this paper between the Poisson process and the Exponential process, the underlying source of a random process exhibiting a Poisson behavior has not been describes in sufficient detail. This appendix attempts to correct this situation. Numerous physical and organic processes exhibit behavior that is meaningfully modeled by the Poisson process. One of the first observations of the Poisson process was that it: Properly represented the number of army soldiers that died as a result of being kicked by a horse. [Fry28] It has been shown that the sum of a large number of independent stationary renewal processes, each with an arbitrary distribution renewal time, will tend to a Poisson process [Palm43], [Khin60]. The exponential function is perhaps the most important function in the analysis of physical systems. This is because the exponential function has an almost magical property that its derivative and its integral yield the function itself. For example d t e = ∫ e t dt = e t . Most of the phenomena observed in nature can be described dt by exponential functions [Lath65]. Appendix D – 1

Although the Poisson process arises frequently as a model for random events, it also serves as a building block with other useful stochastic processes, most importantly the Markov process. Definition D.1 Consider a finite time interval [ 0, T ) in which the probability distribution function describing the number of events occurring in the interval is sought. This function can be found by dividing the period T into m subintervals, each of length h = T m . Let λ represent the average event occurrence rate within the interval of interest. For any subinterval, the probability that one (1) event occurs is λh + O ( h ) , that two or more events occur is O ( h ) and therefore, that no events occur is 1 − λh + O ( h ) , where the function O ( h ) represents any quantity that approaches zero faster than h as h → 0 ; that is O ( h ) h → 0 as h → 0 . The rate at which events occur is said to be Poisson when the following property holds, The event occurrence rate in any subinterval is statistically independent of the event occurrence rate in any other subinterval not overlapping this subinterval. This is a formal restatement of the properties given in Figure 3. This property can best be described if an event is considered a success of a Bernoulli trial [Bern31], then the event pattern observed over the period of T = mh can be regarded as a result of a sequence of m Bernoulli trials. Then the probability that exactly i events occuring in the m subintervals can be approximated by using the binomial m i m −i distribution   λh + O ( h )  1 − λh + O ( h )  . By taking the limits h → 0 i 

Appendix D – 2

and m → ∞ while keeping mh = T constant, the number of events in the interval T, given by N (T ) , has the probability distribution function, P {N ( T ) = i } = =

( λT )i i!

( λT ) i!

m!  λT  ⋅ lim i ⋅ lim  1 −  m →∞ m ( m − i ) ! m →∞  m 

m −i

,

i

(D.1)

e −λT ,

which is a Poisson distribution of Eq. (2.6). Another important property of the Poisson distribution is the distribution of intervals between events. Let X be an arbitrarily chosen interval from the time origin to the first event. The distribution for the random variable X can be developed by noting that no arrivals occur in the interval ( 0, X ) if an only if X > x ; that is P { X > x } = P {N ( x ) = 0} where N ( x ) represents the number of events occurring in x time units. Using Eq. (D.1) results in P {N ( x ) = 0} = e −λx . The probability distribution function of the random variable X is then given by FX ( x ) = 1 − e −λx and using Eq. (2.15) the probability density function is given as f X ( x ) = FX′ ( x ) = λe −λx ∴ The Poisson process can be characterized in several ways: §

The pure birth process with constant birth rates. A birth–death process is a continuous parameter Markov chain { N ( t ) , t ≥ 0} with the state space

{0,1, 2,K} and homogeneous transition probabilities. The conditional

probability that the population will increase by one (1) may depend on the population size n and is denoted by λ n . The conditional probability that the population may decrease by one (1) may depend on the population size n and denoted by µn .

Appendix D – 3

§

The renewal counting process with exponentially distributed inter–arrival times. If f ( t ) , g ( t ) , and h ( t ) are functions defined for t ≥ 0 satisfying t

the relation g ( t ) = h ( t ) + ∫ g ( t − s ) f ( s ) ds , t ≥ 0 then f ( t ) and h ( t ) 0

are known functions, and g ( t ) is an unknown function to be determined as the solution to the integral equation. This integral equation is referred to as the Renewal Equation. A renewal equation for the mean value function of a renewal continuous process can be stated as, ∞

m ( t ) = E  N ( t )  = ∑ np N (t ) ( n ) . n =0

§

An integer–valued process with stationary independent increments which has unit jumps.

Definition D.2 The counting process { N ( t ) , t ≥ 0} is said to be a Poisson process have rate λ, λ > 0 if §

N (0) = 0 .

§

The process has independent increments

§

The number of events in any interval of length t is Poisson distributed with a mean of λt . That is for all s , t ≥ 0 ,  ( λt )n P { N ( t + s ) − N ( s ) = n} = ( e ) ⋅   n!  −λ

  ∴ 

Definition D.3 In Chapter 2, the memoryless property of the exponential distribution was described. This property is important to many areas of reliability analysis. In the theory of recurrent events (Appendix E) and the theory of Markov Chains, the underlying probability distributions are memoryless. Given a random variable X, such that X > t . Suppose the random variable X represents the lifetime of a component. If the component has been observed for

Appendix D – 4

some time t then the observer would be interested in the probability distribution of the remaining lifetime of the component, Y = X − t , or the residual lie of the component. Let the conditional probability of Y ≤ y , given that X > t be denoted by the function Ft ( y ) . Thus for y ≥ 0 , Ft ( y ) = P {Y ≤ y X > t } ,

Ft ( y ) = P { X − t ≤ y X > t } ,

Ft ( y ) = P { X ≤ y + t X > t } , Ft ( y ) =

(D.2)

P { X ≤ y + t and X > t } . P {X > t }

By definition of the conditional probability [Kend77], Ft ( y ) =

P {t < X ≤ y + t } . P {X > t }

(D.3)

Now consider the exponential distribution function f ( x ) = λe −λx . If this function represents both the previous life of a component and the remaining life (residual life) of the component, then the ratio of the two distributions should not be a function of time. Thus, y +t

∫

t ∞

∫f t

y +t

f residual ( x ) dx lifetime

( x ) dx

=

∫ λe t t

∫ λe

−λx

−λx

dx =

dx

e −λt (1 − e −λy ) e

−λt

0

Appendix D – 5

= 1 − e −λy .

(D.4)

Thus two exponential distributions are independent of time t. Stated more clearly a component whose failure process is described by an exponential distribution does not age, e.g. it is good as new or it forgets how long is has been operating, and this its eventual breakdown is a result of a sudden failure, not its gradual deterioration. If the arrival time of a component failure is exponentially distributed, then the memoryless property implies that the time of arrival of the next failure is independent of how long it has been since the last failure. This can be shown if we assume X is a nonnegative continuous random variable with the Markov property. Then the distribution of X must be exponential.

Using Eq. (D.3) then

P {t < X ≤ y + t } = P {X ≤ y } = P {0 < X ≤ y } . The P {X > t }

Cumulative Distribution Function (CDF) is then given by FX ( y + t ) − FX ( t ) = 1 − FX ( t )   FX ( y ) − FX ( 0 )  . Since FX ( 0 ) = 0 , the CDF can be rearranges to give

FX ( y + t ) − FX ( y ) FX ( t ) 1 − FX ( y )  = . t t

Taking the limit as t → 0 gives, lim t →0

FX ( y + t ) − F ( y ) = FX′ ( y ) which is an t

expression of the derivative of FX . Using Eq. (2.21) for the definition of the reliability function, e.g. R X ( y ) = 1 − FX ( y ) , then the derivatives of the reliability is given by, d d RX ( y ) = RX ( 0 ) RX ( y ) . dx dy

Appendix D – 6

(D.5)

the solution to the differential equation Eq. (D.5) is given by d R X ( y ) = K exp  R X ( 0 )  dy −

 y  , where K is a constant of integration and 

d d R X ( 0 ) = FX ( 0 ) = f X ( 0 ) which is the pdf of the random variable X dy dy

evaluated at 0. Noting that R X ( 0 ) = 1 and that f X ( 0 ) = λ results in R X ( y ) = e −λy and thus FX ( y ) = 1 − e − y , y > 0 , which by inspection is an exponential distribution [Bhat72]. When working with exponential distribution it is desirable to treat then at times as joint probability distributions rather than a conditional probability distribution. The following two definitions provide the basis for this process. Definition D.4 Given two random variables X and Y and a function g ( x, y ) of the real variables x and y. If g ( x, y ) satisfies certain general conditions, namely (1) For the set IY of all real numbers x such that g ( x ) ≤ y should be a countable union or intersection of intervals for any y; only then {Y ≤ y } is an event. If g ( x ) has this property it is called a Baire function. (2) the set of outcomes ζ such that g  X ( ζ )  = ±∞ should be an event with zero probability, then the function Z = g ( x, y ) is a random variable and it value Z ( ζ ) is given by Z ( ζ ) = g  X ( ζ ) , Y ( ζ )  ∴ Definition D.5 If the random variables X and Y are independent, then the density of the their sum Z = X + Y equals the convolution of their respective densities,

Appendix D – 7

f Z (Z ) =

∞

∫

f X ( z − y ) f Y ( y ) dy =

−∞

∞

∫

f X ( x ) f Y ( z − x ) dx . It should be noted that

−∞

if X and Y take only positive values, i.e. if f X ( x ) = 0 for x < 0 and f Y ( y ) = 0 Z

for y < 0 , then the convolution takes the form f Z ( z ) = ∫ f X ( z − y ) f Y ( y ) dy 0

for z > 0 . In the case of the subtraction of two independent distributions, Z = X − Y , the convolution is given by f Z ( z ) =

∞

∫

f X ( z − y ) f Y ( − y ) dy ∴

−∞

In the analysis of Fault–Tolerant systems, situations frequently arise in which a series of exponential distributions are arranged in sequence. The resulting density is known as an r–stage Hypoexponential distribution. Definition D.6 In the previous section of this appendix, the Poisson distributions have been treated as separate and independent processes. In practical systems a group of Poisson processes can be combined or separated in a manner modeling the physical organization of the underlying systems. The superposition of Poisson processes can be described by considering m independent source that generate random events. Assume search source follows a Poisson distribution with rate λ k , k = 1, 2,K , m . If these sources are combined into a single source, then this new source is a Poisson process with a rate that is the sum of all components, λ = λ1 + λ 2 + L + λ m . This additively process can be shown by using the probability generating function. Consider an interval of length T. the number of events from the k th Poisson source in this interval is Poisson distributed with the parameter λ kT , and using Eq. (C.2) its probability

Appendix D – 8

generating function is given as G k ( z ) = e −λk T (1−z ) . The total number of events from all sources has the probability generating function m

m

k =1

k =1

G ( z ) = ∏ G k ( z ) = e −λT (1−z ) , with a parameter λ = ∑ λk , where the product form is due to the statistical independence of the m generating sources. The total number of events in a merged stream is Poisson–distributed with mean λT . The second situation is which Poisson processes are combined is one in which a source of random variables is decomposed into m separate Poisson processes. If the originating source has a generating rate of λ the probability of selecting the k th decomposed process is rk . Let N (T ) denote as before the number of events occurring in T time units and let N k (T ) denote the number of events in the k th decomposed process. The conditional joint distribution of N k ( T ) , k = 1, 2, K , m is given as N ( T ) = n and is a multinomial distribution of r independent Poisson random processes resulting in the expression, P {N 1 ( T ) = n1 , N 2 ( T ) = n 2 , K , N m ( T ) = n m N ( T ) = n}L L=

n! ⋅ r1n1 r2n2 L rmnm n 1 ! n 2 !L n m !

.

By multiplying the probability of the random variable n, which is the Poisson distribution with parameter λT , an expression for the joint probability of m decomposed Poisson distributions is given by,

Appendix D – 9

( λT ) −λT n! P {n1 , n 2 , K n m } = ⋅ r1n1 r2n 2 L rmnm e , n 1 ! n 2 !L n m ! n! n

m

=∏ k =1

( rk λT )

nk

nk !

(D.6)

e − rk λT .

Since the joint probability factors into m Poisson distributions, the random variables n1 , n 2 , K , n m are statistically independent for an arbitrary chosen interval T. Suppose there are r identical exponential distributions and an overall process that experiences these distributions in sequence. The resulting density is given as, f (t ) =

λ r t r −1e −λt , t > 0, λ > 0, r = 1, 2, K ( r − 1) !

(D.7)

The corresponding distribution function is given by, r −1

( λ t )k

k =0

k!

F (t ) = 1 − ∑

e −λt , t ≥ 0, λ > 0, r = 1, 2, K

(D.8)

A single hypoexponential distribution is equivalent to an exponential distribution when r = 1 resulting in, λ r t r −1e −λt  → λe −λt ∴ r =1 ( r − 1) !

(D.9)

Theorem D.1 r

Let Z = ∑ X i , where X 1 , X 2 , K , X r are mutually independent and X i is an i =1

exponentially distributed random variable with the parameter Appendix D – 10

λ i , λ i ≠ λ j for i ≠ j . Then the density of Z, which is a r–stage hypoexponential distributed random variable, is given as, r

f Z ( z ) = ∑ a i λ i e −λi z , z > 0,

(D.10)

i =1

where, r

ai = ∏ j =1 j ≠i

λj λ j − λi

,1≤ i ≤ r∴

Appendix D – 11

(D.11)

Appendix D – 12

Appendix E

RENEWAL THEORY In Chapter 3, the Mean Time to First Failure expression was developed using Renewal Theory [Cox62], [Ross70], [Smit58], [Fell66]. This appendix develops further the foundations of Renewal Theory, also referred to as the Theory of Recurrent Events. Renewal Theory started as the study of the probability associated with the failure and replacement of components. Suppose there is a population of components, each component characterized by a non–negative random variable X called its failure–time. The failure–time is the age of the component at which a clearly defined event occurs – its failure. The random variable X is non–negative and there are two considerations: (i) there is a positive constant h such that he only possible values of X are {0, h , 2h , K} ; (ii) the random variable has a continuous distribution over the range ( 0, ∞ ) , its distribution being determined by a probability density function. In the analysis of Fault–Tolerant systems which are capable of repair, the fault– failure–repair sequence can be modeled as a recurring sequence of random variables. A pattern of random variables R , qualifies as a renewal process if after each occurrence of an event in R , the random process generating the events starts again with the restriction that the sequence of events following an occurrence in R forms a replica of the entire process. The duration of waiting times between occurrences in R are mutually independent random variables and have the same probability distribution.

Appendix E – 1

In the simplest example, R represents an event in which a success occurs, in a sequence of Bernoulli trials. The waiting time to the occurrence of the first event has a geometric distribution. This is defined as so that when the first success occurs, the trial starts again, and the number of trials between the r th and the

(r

th

+ 1) success have the same geometric distribution. The waiting time up to

the r th success is then the sum if r independent variables. This sequence of events also holds when R represents a success followed by a failure. In the analysis of reliability systems, this model can be used to represent a sequence of proper operations followed by a fault event, which results in a repair, followed by another series of proper operations. Consider a sequence of repeated trials with possible outcomes E j

( j = 1, 2,K) .

Since these outcomes need not be independent, they can be used to analyze Markov processes. Suppose it is possible to continue the trials indefinitely, then

{

the probabilities P E j1 , E j 2 , K , E jn

} being defined consistently for all finite

sequences. Let R be an attribute of these finite sequences. Definition E.1 The attribute R defines a recurrent event if: (a) In order that R occurs at the

(

)

n th and ( n + m ) location in the sequence E jn +1 , E jn +2 , K , E jn +m , (b) If R th

occurs at the n th location then identically

{

}

{

} {

}

P E j1 , K , E jn +m = P E j1 , K , E jn ⋅ P E jn +1 , K , E jn +m ∴

Appendix E – 2

(

Using this definition the term R occurs in the sequence E j1 , E j 2 , K , E jn

) for the

first time at the n th place. It can also be stated with each recurrent event R there is associated the two sequences of numbers for n = 1, 2, K as follows, u n = P {R ocuurs at the n th trial} , and f n = P {R ocuurs for the first time at the n th trial} . For convenience f 0 = 1 , u 0 = 1 which results in the generating functions ∞

∞

k =1

k =1

F ( s ) = ∑ f k s k and U ( s ) = ∑ uk s k being defined. The sequence {u k } is not a probability distribution since

∑u

k

= ∞ . The events described by R occurring ∞

for the first time at the n th trial are mutually exclusive, therefore f = ∑ f n ≤ 1 . n =1

The term 1 − f can be interpreted as the probability that R does not occur in a indefinitely prolonged sequence of trials. If f = 1 a random variable T can be defines as P {T = n} = f n which is to say that T is an improper, ort defective random variable, which with probability 1 − f does not assume a numerical value. Theorem E.1 Let f n( r ) be the probability that the r th occurrence of R tales place at the n th

{ }

trial. The sequence f n( r ) is the probability distribution of the sum T ( r ) = T1 + T2 + L + Tr of independent random variables T1 , K , Tr , each having

Appendix E – 3

{ }

the distribution P {T = n} = f n . For a fixed r the sequence f n( r ) has the generating function F r ( s ) ∴ Proof of Theorem E.1 It follows that

∞

∑ f ( ) = F (1) = f n =1

r

r

r

n

∴

It is preferable to consider the number n of trials as a fixed value and determine the number N n of occurrences of R in the first n trials as a random variable. In the analysis of system failures and repairs the behavior of the distribution of N n fro large n in important. Let T ( r ) represent the number of trials up to and including the r th occurrence of R . The probability distributions of T ( r ) and N n are related by the identity,

{

}

P { N n ≥ r } = P T (r ) ≤ n ∴

(E.1)

Theorem E.2 If the recurrent event R is persistent and its recurrent times have finite mean µ and variance σ2 , then both the number T ( r ) of trials up to the r th occurrence of R and the number N n of occurrences of R in the first n trials are

asymptotically normally distributed. Proof of Theorem E.2 Assume the case where R is persistent and the distribution { f n } of its recurrence times has a finite mean µ and variance σ2 . Since T ( r ) is the sum of r

Appendix E – 4

independent variables, the central limit theorem asserts for each fixed x as r →∞,  T (r ) − r µ  P < x  → N (x ) ,  σ r 

(E.2)

where N ( x ) is the normal distribution function. Now let n → ∞ and r → ∞ is such a way that, n − rµ → x, σ r

(E.3)

then Eq. (E.1) and Eq. (E.2) together lead to P { N n ≥ r } → N ( x ) . This relation can be seen in a familiar from if the reduced variable N n∗ = ( µN n − n )

µ is σ 2n

introduced. The inequity N n ≥ r is identical with,

N n∗ ≥

rµ − n rµ rµ ⋅ = −x . n n σ r

(E.4)

By dividing Eq. (E.3) by r it is seen that n r → µ , and the right side of Eq. (E.4) rends to − x . Since, N ( −x ) = 1 − N ( x ) it follows that P { N n∗ ≥ − x } → N ( x ) or P { N n∗ < − x } → 1 − N ( x ) ∴ Unfortunately many of the recurrence times occurring in various stochastic processes and in statistical applications have infinite expectations. In such cases the normal approximation is replaced by more general limit theorems of an entirely different character [Fell49], and the chance fluctuations exhibit unexpected features. Appendix E – 5

One expects intuitively that E [ N n ] should increase linearly with n because on the average R must occur twice as often in twice as many trials. However, this is not true [Fell49]. The previous discussion is important to understand when analyzing the response of e Fault–Tolerant system in the presence of a request for action, with a finite probability that the system will not respond properly. This issue is best illustrated in an example. Suppose a Fault–Tolerant system is operating normally, that is all faults that can occur are detected properly – the coverage factor is 1.0. When an external event generates a demand on the system, it will respond properly, with a probability of 1.0. If however, the coverage factor is less than 1.0, the probability that the system will respond properly is now less than 1.0. In Chapter 7, there are summary tables that described the Mean Time Between Failures for the various configurations. In order to understand the consequences of this situation on these tables, the probability of a demand must be included in the model. A paradox will develop when this factor is included. Let Ak represent the k th uncovered fault in the system, which is assumed to occur at time t k . Assume that the intervals t k +1 − t k are independent and identically distributed random variables with the distribution functions given as F ( x ) @ P {t k +1 − t k ≤ x } . The common pdf for these intervals is defined as f (x ) @

d F (x ) . dx

Now chose some random point in time, τ , for a demand on the system to occur. Let X denote the interarrival time for the undetected fault and let Y denote the time to the next occurrence of the undetected fault after the arrival of a system

Appendix E – 6

demand. The sequence of arrival events {t k } forms a renewal process. In this case {t k } forms the sequence of instants when the system experiences an undetected fault. The random variable X is defined as the lifetime of the system under consideration, and the random variable Y is defined as the residual lifetime of the component at time t and X 0 = X − Y is defined as the age of the system at time t. A paradox will be discovered if the pdf for the lifetime and the residual lifetime are compared. It is assumed the renewal process has been operating for an arbitrary long time and limiting distributions are observed. The result will be that the random variable X will not be distributed according to F ( x ) @ P {t k +1 − t k ≤ x } . This means that the interval during which the demand on the system occurs is not a typical interval. A longer interval is more likely to be on the average twice as long as a typical interval. Let the residual life have a distribution FY ( x ) @ P {Y ≤ x } with a density function of f Y ( x ) =

d FY ( x ) . Let the selected lifetime X have a distribution dx

FX ( x ) @ P { X ≤ x } with a density function of f X ( x ) =

Appendix E – 7

d FX ( x ) . dx

X Ak

Ak +1 τ

tk

t k +1

X0

Y

Figure E.1 – Residual life Y and the process life X of a system demand event occurring between s system fault event. The derivation of the residual life f Y ( x ) is performed by observing that the event {Y ≤ y} can occur if and only if τ < t k ≤ τ + y < t k +1 for some k. The pdf is given by, FY t ( y ) @ P {Y ≤ y t } , ∞ τ+ y

=∑

∫

k =1 τ

1 − F ( τ + y − x )  dP {t k ≤ x }.

(E.5)

Observing that t k ≤ x if an only if α ( x ) , the number of arrival in the interval

( 0, x ) , is at least k, that is P {t k ≤ x } = P {α ( x ) ≥ k} then the cumulative distribution function is given by

∞

∑ P {t k =1

∞

k

≤ x } = ∑ kP {α ( x ) ≥ k} . For larger x,

Appendix E – 8

k =1

the mean–value expression for the CDF is x m1 . Let FY ( y ) = lim FY t ( y ) there t →∞

is a corresponding pdf of, fY ( y ) =

1− F ( y) . m1

(E.6)

An intuitive explanation of the above derivation can be used for the lifetime density which takes advantage of the physical properties of this situation. It is observed that long intervals between renewal points occupy larger segments of the time axis than do shorter intervals, and therefore it is more likely that the random point τ , will fall in a long interval. It is recognized that the probability of an interval of length x is chosen should be proportional to the length f ( x ) as well as the relative occurrence of such intervals, which is given by f ( x ) dx . Thus for the selected interval, f X ( x ) dx = Kxf ( x ) dx ,

(E.7)

where the left–hand side is P [ x < X ≤ x + dx ] and the right–hand side expresses the linear weighting with respect to interval length and includes a constant K, which must be evaluated so as to properly normalize this density. Integrating both sides of Eq. (E.7) it is found that K = 1 m1 , where m1 @ E [t k − t k −1 ] and is the common average time between renewals (between failures and repair of the system). The density associated with the selected intervals is given in terms of the density of a typical interval by, f X (x ) =

xf ( x ) . m1

(E.8)

Appendix E – 9

The results described in Eq. (E.8) and Eq. (E.6) can be expressed in terms of their Laplace transforms f X ( x ) ⇔ FX∗ ( s ) and f Y ( x ) ⇔ FY∗ ( s ) . Using the Laplace transforms u −1 ( t ) = δ ( t ) ⇔

1 and s

t

∫

f ( t ) dt ⇔

−∞

F ∗ (s ) to transform s

Eq. (E.6) gives, FY∗ ( s ) =

1− F ∗ (s ) . sm1

(E.9)

The moments of the residual life can now be found in terms of the lifetimes themselves. Denote the n th moment of the lifetime by mn and the n th moment of the residual life by rn , that is, n m n @ E  ( t k − t k −1 )  ,  

(E.10)

rn @ E Y n  .

(E.11)

Using the moment generating function A ∗( n ) ( 0 ) = ( −1) X n , Eq. (E.11) can be n

differentiated to obtain the moments of the residual life. As s → 0 indeterminate forms are produced which may be evaluated using L’Hospital’s rule to give the moments of the residual life as rn =

m n +1 . ( n + 1) m1

Appendix E – 10

If f ( x ) and g ( x ) are continuous in an interval including x = a , and if the derivatives f ′ ( x ) and g ′ ( x ) ≠ 0 in this interval (expect possibly at x = a ) and if f ( a ) = 0 and g ( a ) = 0 , then lim x →a

f (x ) f ′(x ) = lim provided the latter limit exists. g ( x ) x →a g ′ ( x ) Figure E-2 – L’Hospital’s Rule used to evaluate the indeterminate Form 0/0 of two continuous functions over an interval.

This expression is used to evaluate r1 , the mean residual life, which is found to be r1 =

m2 . The mean residual life may also be expressed in terms of the lifetime 2 m1

variance (second moment, denoted by σ2 @ m 2 − m12 ) to give r1 =

m1 σ 2 + . This 2 2 m1

last form shows the correct answer for the paradox is m1 2 , one half the mean interarrival time, only if the variance is zero (regularly spaced arrivals). However for Poisson arrives, m1 = 1 λ and σ2 = 1 λ 2 , giving r1 = 1 λ = m1 , which confirms the earlier solution to the residual life paradox. Since the distributions representing the lifetime, FX ( x ) @ P { X ≤ x } , and the residual life, FY ( x ) @ P {Y ≤ x } are independent, the distribution of the system age can be obtained through the convolution of the two distributions. ∴

Appendix E – 11

Appendix E – 12

Appendix F

LAPLACE TRANSFORM GENERALIZED SOLUTION METHODS

The generalized solution to the reliability problems described in Chapter 5 are based on an approach to computing the Markov reliability matrix using Eigenvalues [Wilk65], [Wibe71], [Shoo68], [Buza70]. These methods can be adapted to the solution of systems that have been transformed to the Laplace transform for. The following assumptions can be made regarding the generalized solution, §

The underlying system can be represented by a set of n states. These states represent different conditions the system S may assume and are contained in the transition state space matrix A consisting of elements pij .

§

The transition from one state to another is a Markov process [Hoel72].

§

The time in each state is exponentially distributed.

§

The probability of more than one state change during a time increment ∆t is of the order O ( ∆t ) .

The key to the solution of the generalized reliability system, is to partition the resulting Markov matrix into an appropriate set of states. Two subs–states can exist: non–absorbing and absorbing. For the absorbing states the following condition is true pnj = 0 for j = 1, 2, K , n − 1 . What is of interest is the probability for a state–change from Si to S j during the time ∆t . Although the discrete time equation is useful, the continuous time form will be used in this appendix. By utilizing this form, a set of differential equations, results in a set of

Appendix F – 1

algebraic equations (derived from the Laplace transform) which can be solved using techniques described below. The continuous time state–space equations corresponding to the probability transition rates are given by, d d d   dt P1 ( t ) , dt P2 ( t ) ,K , dt Pn ( t )  = L n  1 − pij ∑  j =2    p21 L =  P1 ( t ) , P2 ( t ) , K , Pn ( t )  ⋅   M    pn 1 

p12 n

1 − ∑ p2 j j =1 j ≠2

L

    L p2 n  ,  O M   n −1 L 1 − ∑ pnj  j =1  L

p1n

(F.1)

where  P1 ( t ) , P2 ( t ) , K , Pn ( t )  is the probability that the i th row–vector of the system A being in a certain state as a function of time. The corresponding matrix differential equations are of the simplified form, r d r P (t ) = P (t ) ⋅ A , dt

(F.2)

r where P ( t ) is now the general row–vector probability of the system being in a certain state. r The initial condition for this differential equation system is P ( 0 ) = P0 .

Appendix F – 2

Definition F.1 A recurrent Markov Chain is a chain with a transition matrix A having the following properties: §

There is at least one recurrent state.

§

Form every state t is possible to reach a recurrent state.

The nonrecurring states are referred to as transient states. The canonical form of the transition matrix can be written with the recurrent states listed first and the nonrecurring states listed last. Suppose there are a total of r transition states and s absorbing states. Then the canonical form for the transition matrix A is [Gave73],   pij = 1, iff i = j  I=  pij = 0, otherwise   I O  A= = R Q   R = pij = ∑ pik f kj + ∑ pij  i ∈T i ∉C  

   ∞  . p n ( ) ∑ ij  Q = pij = n∞=1  pii ( n )  ∑ n =0  O=0

Definition F.2 Each element of the canonical matrix is defined as, §

The matrix Q with elements pij is not stochastic since

∑p j ∈S

ij

but is defined as sub–stochastic if pij ≥ 0 for all i and j and if

≤ 1, ∀i ∈ S

∑p

ih

≤ 1 for

j

all i. §

The matrix Q by itself does not have the usual probabilistic interpretations. Since the sum of some of the rows is less than one, the entries in Q represent all the transitions among the transient states of the system.

§

The matrix R corresponds to the transition from the transient states to recurrent states. The R matrix is not square and in general does not have an interpretation as a transition matrix from a state space into itself. Appendix F – 3

§

The matrix I denotes the stochastic matrix corresponding to transitions within the set of recurrent states. Definition F.3 The fundament matrix of an absorbing Markov Chain is the matrix, N = ( I - Q ) = I + Q + Q2 + L + Qk + L ∴ −1

Definition F.4 The Laplace transform can be used for constructing a fundamental matrix for the system described by

r d r P ( t ) = P ( t ) ⋅ A , where A is an arbitrary n × n constant dt

matrix. If f ( t ) is a vector function with n components defined on ( 0 ≤ t < ∞ ) , then f ∈ Λ , where Λ is the class of complex functions, then there exists the ∞

Laplace transform L {f } = ∫ e − st f ( t ) dt . Using Theorem B.1 (in the scalar sense) 0

and applying it to the vector case, it follows that if φ is a solution to r r d r P ( t ) = P ( t ) ⋅ A , with P ( 0 ) as the initial conditions vector, then the solution dt r vector φ ∈ Λ for any vector P ( 0 ) ∴ This can be sown in the following manner. Let the Laplace transform of the solution to

r d r P ( t ) = P ( t ) ⋅ A , be given by P ( s ) = L ( φ ) . Taking the Laplace dt

transform of both sides of

r d r P ( t ) = P ( t ) ⋅ A , and using the initial conditions dt

r P ( 0 ) gives a system of equations, r r r sP ( s ) − P ( 0 ) = AP ( s ) .

(F.3)

Appendix F – 4

Rearranging Eq. (F.3) (this is possible at this point since Eq. (F.3) is treated as an algebraic equation, rather than a differential equation) results in, r

r

(s I - A ) P (s ) = P (0) .

(F.4)

The system in Eq. (F.4) is a linear nonhomogeneous system of n algebraic r equations in n unknowns, with ( P1 ( s ) , P2 ( s ) ,K , Pn ( s ) ) ∈ P ( s ) . If s is not equal to an eigenvalue of P, that is det ( s I - P ) ≠ 0 , then Eq. (F.4) can be solved r uniquely for P ( s ) in terms of η and s by Cramer’s Rule (Eq. (B.2)). Since det ( s I - P ) is a polynomial of degree n, it is clear that Y ( s ) is a vector whose components are rational functions of s and linear in ( η1 , η2 , K , ηn ) with r r components in P ( 0 ) . Each component of P ( s ) can be expanded in partial fractions (the denominators will be integral powers of ( s − λ j ) , where λ j is an r r eigenvalue of P ( s ) ). P ( s ) can then be inverted to find the solution φ ( t ) r corresponding to any initial vector P ( 0 ) . r Letting P ( 0 ) take on successive values, 0   1 0 1  0 M        0 P1 ( 0 ) = , P (0) = , K , Pn ( 0 ) =   , M  2 0   M      0  1 0  The solutions φ1 , φ2 , K , φn used as column vectors of the matrix Φ generate a fundamental matrix of

r d r P ( t ) = P ( t ) ⋅ A , such that Φ ( 0 ) = I . dt

Appendix F – 5

Appendix F – 6

Appendix G

EXPONENTIAL DISTRIBUTION VARIANCE In the analysis of Mean Time to Failure models in Chapter 2, hard data may not be available for calculating a traditional mean for the random variable representing the failure rate λ . What is of interest is the effect of the variance on the failure rate of the underlying system. If the failure rate is λ , then for a total cumulative observation time T the number of failures n has a Poisson distribution. The restatement of Eq. (2.6) is given in a posterior probability form as, P {n λ} =

( λT )n n!

e −λT .

(G.1)

The parameter of interest is λ and the prior estimate, h ( λ ) , of this parameter will be described by the Gamma distribution, which is the natural conjugate of the Poisson distribution [Kapu77]. The Gamma prior pdf for λ is given by, ρδ λ δ−1e −ρλ h (λ ) = , λ ≥ 0, δ ≥ 0, ρ ≥ 0. Γ (δ)

(G.2)

Definition G.1 The failure density function for a Gamma distribution is given by f (t ) =

λ η η−1 −λt t e , t ≥ 0, η > 0, λ > 0 , where η is the shape parameter and λ Γ ( η)

is the scale parameter. The failure distribution function is then given by

Appendix G – 1

λ η η−1 −λt τ e d τ . If η is an integer, it can be shown by successive Γ ( η) 0 t

F (t ) = ∫

∞

( λt )k e −λt

k =η

k!

integration by parts that F ( t ) = ∑ η−1

( λt )k e −λt

k =0

k!

be given as R ( t ) = 1 − F ( t ) = ∑

z (t ) =

λ η η−1 −λt t e Γ ( η)

f (t ) = R (t )

η−1

( λt )k e −λt

k =0

k!

∑

. Then the reliability function can

, and the hazard function as

∴

The posterior pdf for λ , given n failures over the time interval T, can be found using Bayes’ theorem. That ism the solution to, k (λ n ) =

h ( λ ) P {n λ} f 2 (n )

,

(G.3)

must be found for the parameters λ and n. Recognizing that the joint pdf for n and λ is found from f ( n , λ ) = h ( λ ) P {n λ} , which in the instance of Eq. (G.3) becomes, f (n, λ ) =

ρδT n λ δ+n +1e −λ( ρ+T ) , λ ≥ 0. Γ ( δ ) Γ ( n + 1)

(G.4)

The marginal pdf for n is, f 2 (n ) =

∞

ρδT n λ δ+n +1e −λ( ρ+T )d λ. ∫ Γ ( δ ) Γ ( n + 1) 0

Appenix G – 2

(G.5)

Solving the integral in Eq. (G.5) by letting µ = λ ( ρ + T ) , results in, f 2 (n ) =

∞

ρ δT n u δ+n −1e − u du . δ+ n ∫ Γ ( δ ) Γ ( n + 1)( ρ + T ) 0

(G.6)

The term under the integral in Eq. (G.6) is the Gamma function resulting in, f 2 (n ) =

ρ δT n Γ ( δ + n )

Γ ( δ ) Γ ( n + 1)( ρ + T )

δ+n

, r = 0,1, 2, K

(G.7)

Substituting Eq. (G.4) and Eq. (G.6) into Eq. (G.3) to obtain the posterior distribution for λ gives,

( ρ + T )δ+n δ+n −1 −λ(ρ+t ) , λ ≥ 0. k (λ n ) = e λ Γ (δ + n )

(G.8)

The expression in Eq. (G.8) is recognized as a Gamma pdf with parameters

( ρ + T ) and ( δ + n ) . The Bayesian point estimator for λ is the mean of the δ+n . Gamma posterior pdf given in Eq. (G.8) and is, λˆ b = ρ+T The upper and lower confidence limits for λ can be obtained from the posterior distribution given in Eq. (G.8). Both λ upper and λ lower , respectively define a 100 (1 − α ) % one sided upper and lower Bayesian confidence limit for λ . Since the posterior pdf in Eq. (G.8) is Gamma, a simple transformation defined by, z = 2λ ( ρ + T ) ,

(G.9)

Appenix G – 3

can be applied to the pdf in Eq. (G.8) to produce a random variable z which is Chi–squared distributed with 2 ( ρ + T ) degrees of freedom. Using tabled Chi– squared values results in a set of confidence limits for the variable λ . Definition G.2 The exponential probability distribution and the Chi–Squared distribution are related in the following manner. Consider an interval of time t to the first failure where f ( t ) = λe −λt , t ≥ 0 . The random variable y = 2λt is Chi–square distributed with two degrees of freedom. This can be shown by defining dh −1 ( y ) 2 −1 dy = 2λdt and t = y . Using the relationship g ( y ) dy = f ( h ( y ) ) λ dy 1 results in g ( y ) dy = e −( y 2 )dy , y ≥ 0 , which us a Chi–squared distribution with 2 two degrees of freedom∴ Considering the transform in Eq. (G.9), the confidence limit can be given as,

{

}

P X 12−α ,2( δ+ n ) ≤ 2λ ( ρ + t ) = 1 − α.

(G.10)

Rearranging the inequality gives,  X 12−α ,2( δ+n )   X 2α ,2( δ+n )   ≤T ≤ ,  2 (ρ + T )   2 (ρ + T )      as the confidence bounds on the failure rate random variable λ .

Appenix G – 4

(G.11)

Bibliography [Ande79] Anderson, T. and Randell, B., Computing Systems Reliability, Cambridge University Press, 1979. [Ande81] Anderson, T. and Lee, P. A., Fault Tolerant Principles and Practice, Prentice Hall, 1981. [Ande82] Anderson, T. and Lee, P. A., “Fault Tolerant Terminology Proposals,” Proceedings of the 12th Annual International Symposium on Fault Tolerant Computing, IEEE Computer Society Press, June 1982, pp. 29–33. [Apos74] Apostolakis, G., “Mathematical Methods of Probabilistic Safety Analysis,” Technical Report, School of Engineering and Applied Science, University of California, Los Angeles, 1974. [Apso87] Apostolakis, G. and Moieni, P., “The Foundation of Models of Dependence in Probabilistic Safety Assessment,” Reliability Engineering, 18(3), pp. 177–195, 1987. [Arfk70] Arfken, G., Mathematical Methods for Physicists, Academic Press, 1970. [Arla85] Arlat, J. and Laprie, J.–C., “On the Dependability Evaluation of High Safety Systems,” Proceedings of the 15th International Symposium on Fault–Tolerant Computing, June 1985, pp. 318–322. [Arla85] Arlat, J., Croizet, Y. and Laprie, J.–C., “Fault Injection for Dependability Validation of Fault–Tolerant Computing Systems,” Proceedings of the 15th International Symposium on Fault–Tolerant Computing, June 1985, pp. 348–355. [Arno73] Arnold, T. F., “The Concept of Coverage and Its Effects on the Reliability Model of a Repairable System,” IEEE Transactions on Computers, Volume 22, March 1973, pp. 251–254. [Arse80] Arsenault, J. E. and Roberts, J. A., Reliability and Maintainability of Electronic Systems, Computer Science Press, Rockville, MD., 1980. [Ash70] Ash, R. B., Basic Probability Theory, John Wiley & Sons, 1970. [Atwo86] Atwood, C. L., “The Binomial Failure Rate Common Cause Model,” Technometrics, Volume 28, 1986, pp. 139–148. [Aviz67] Avizienis, A., “The Design of Fault Tolerant Computers,” AFIPS Conference Proceedings, 1967, 31:733–743. [Aviz85] Avizienis, A., “Fault–Tolerance and Fault–Intolerance: Complementary Approaches to Reliable Computing,” Proceedings of the International Conference on Reliable Software, in SigPlan Notices, 10(6), June 1995. pp. 458–464.

Bibliography – 1

[Aviz82] Avizienis, A., “The Four–Universe Information System Model for the Study of Fault–Tolerance,” Proceedings of the 12th Annual International Symposium on Fault–Tolerant Computing, IEEE Computer Society Press, June 1982, pp. 6–13. [Aviz75] Avizienis, A., “Fault Tolerant and Fault–Intolerance: Complementary Approaches to Reliable Computing,” Proceedings of the International Conference on Reliable Software in SigPlan Notices, 10(6), June 1975, pp. 458–464. [Aviz82] Avizienis, A., “The Four–Universe Information System Model of the Study of Fault–Tolerance,” Proceedings of the 12th Annual International Symposium of Fault–Tolerant Computing, IEEE Computer Society, June 1982, pp. 6–13. [Abiz84] Avizienis, A. and Kelly, P. J., “Fault Tolerance by Design Diversity: Concepts and Experiments,” Computer, 17(8), August 1984, pp. 67–80. [Aviz86] Avizienis, A. and Laprie, J.–C., “Dependable Computing: From Concepts to Design Diversity,” Proceedings of the IEEE, May 1986, pp. 629– 638. [Barl75] Barlow, R. E. and Proschan, F., Statistical Theory of Reliability and Life Testing: Probability Models, Holt, Rinehart and Winston, 1975. [Barl85] Barlow, R. E. and Singurwalla, N. D., “Assessing the Reliability of Computer Software and Computer Networks: An Opportunity for Partnership with Computer Scientist,” The American Statistician, 39(2), May 1985, pp. 88–94. [Barr73] Barrett, L. S., “Reliability Design and Application Considerations for Classical and Current Redundancy Schemes,” Lockheed Missiles and Space Company, Inc., Sunnyvale, California, September 30, 1973. [Bell65] Bellman, R. and Kalaba, R., Quasilinearization and Nonlinear Boundary–Value Problems, American Elsevier Publishing, 1965. [Bell87] Bellis, H., “Comparing Analytical Reliability Models to Hard and Transient Failure Data,” Master’s Thesis, Carnegie–Mellon University, Department of Electrical Engineering, 1987. [Bend71] Bendat, J. S. and Piersol, A. G., Random Data: Analysis and Measurements Procedures, Wiley–Interscience, 1971. [Bern13] Bernoulli, J., Ats Conjectandi, The main work of James Bernoulli (1654– 1705) published in 1713. [Bern88] Berenson, M. L., Levine, D. M. and Rindskopf, D., Applied Statistics, Prentice Hall, 1988.

Bibliography – 4

[Bhar88] Bharicha–Reid, A. T., Elements of the Theory of Markov Processes and Theory Applications, McGraw–Hill, 1988. [Bhat87] Bhat, U. N., Elements of Applied Stochastic Processes, John Wiley & Sons, 1972. [Biez87] Biezer, B., Micro–Analysis of Computer System Performance, Van Nostrand Reinhold, 1987. [Biro86] Birolini, A., “On the Use of Stochastic Processes in Modeling Reliability Problems,” 5th International Conference on Reliability and Maintainability, Biarritz France, October 1986. [Bish86] Bishop, P. G., “PODS – A Project on Diverse Software, IEEE Transactions on Software Engineering, SE12(9), 1986, pp. 929–940. [Brac65] Bracewell, R., The Fourier Transform and it Applications, McGraw Hill, 1965, pp. 25–50. [Bray67] Brauer, F., and Nohel, J. A., Ordinary Differential Equations, W. A. Benjamin, 1967. [Brau83] Brauer, F., Nohel, J. A., and Scheider, H., Linear Mathematics, W. A. Benjamin, 1983. [Breu76] Breuer, M. A. and Friedman, A. D., Diagnosis and Reliable Design of Digital Systems, Computing Science Press, 1976. [Brue80] Bruell, S. C., and Balbo, G., Computational Algorithms for Closed Queuing Networks, Elsevier North Holland, 1980. [Brow80 Browing, R. L., The Loss Rate Concept in Safety Engineering, Marcel Dekker, 1980. [Boeh76] Boehm, B. W., “Software Engineering,” TRW Software Series TRW–SS– 76–08, TRW Systems Engineering and Integration Division, One Space Park, Redondo Beach, CA 1976. [Borr87] Borrelli and Coleman, C., Differential Equations, A Modeling Approach, Prentice Hall, 1987. [Boss70] Bossen, D. C., Ostapko, D. L., and Patel, A. M., “Optimum Test Patterns for Parity Networks,” Proceedings AFIPS Fall 1970, Joint Computer Conference, Volume 37, Houston Texas, November, 1970, pp. 63–68. [Boss82] Bossen, D. C. and Hsiao, N. Y., “Model for Transient and Permanent Error–Detection and Fault–Isolation Coverage,” IBM Journal of Research and Development, 26(1), 1982 pp., 67–77.

Bibliography – 5

[Bour69] Bouricius, W. G., “Reliability Modeling Techniques for Self–Repairing Computer Systems,” Proceedings of the 24th National Conference of the ACM, August 1969, pp. 295–309. [Born81] Borne, A. J., Edwards, G. T., Hunns, D. M., Poulter, D. R., and Watson, I. A., “Defense Against Common Cause Mode Failures in Redundancy Systems,” Safety and Reliability Directorate, SRD–R196, 1981. [Buza80] Buzacoft, J. A., “Markov Approach to Finding Failure Times of Repairable Systems,” IEEE Transactions on Reliability, Volume R–19, November 1980, pp. 128–134. [Cast80] Castillo, X., “Workload, Performance, and Reliability of Digital Computing Systems,” Carnegie–Mellon University Technical Report, Computer Science Department, 1980. [Chan70] Chang, H. Y., Manning, E. G., and Metze, G., Fault Diagnosis of Digital Systems, Wiley–Interscience, 1970. [Chan72] Chandy, K. M., “Analysis and Solutions of General Queuing Networks,” Proceedings of the Sixth Annual Princeton Conference on Information Sciences and Systems, Princeton University, March 1972. [Chri88] Christenson, D. A., “Using Software–Reliability Models to Predict Field Failure Rates in Electronic Switching Systems,” Proceedings of the National Security Industrial Association Annual Joint Conference on Software Quality and Reliability, National Security Industrial Association, Washington D. C., 1988. [Chun76] Chung, K. L., Markov Chains with Stationary Transition Probabilities, 2nd Edition, Springer–Verlag, 1976. [Clop34] Clopper, C, J. and Pearson, E. S., “The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial,” Biometrike, 26, pp. 104, 1934. [Coch77] Cochran, W. G., Sampling Techniques, 3rd Edition, John Wiley & Sons, 1977. [Cohe65] Cohen, A. C., “Estimation in Mixtures of Poisson and Mixtures of Exponential Distributions,” NASA Technical Memo, NASA TMX–53245, April 1965. [Cona72] Conant, R. C., “Detecting Subsystems of a Complex System,” IEEE Transactions on Systems, Man and Cybernetics, Volume SMC–2, Number 4, September 1972, pp. 550–553. [Cook73] Cook, R., W., Sisson, W. H., Storey, T. F., and Toy, W. N., “Design of a Self–Checking Microprogram Control,” IEEE Transactions on Computing, Volume C–22, March 1973, pp. 255–262.

Bibliography – 6

[Cost78] Costes, A. C., Landrault, and Laprie, J. C., “Reliability and Availability Models for Maintained Systems Featuring Hardware Failures and Design Faults,” IEEE Transactions on Computers, Volume C–27, June 1978, pp. 548– 560. [Cour43] Courant, R. and Hilbert, D., Methods of Mathematical Physics, Interscience Publishers, 1943. [Cour77] Courtois, P. J., Decomposability: Queuing and Computer System Applications, Academic Press, 1977. [Cox62] Cox, D. R. Renewal Theory, Methuen & Company, 1962. [Cox65] Cox, D. R. and Miller, H. D., The Theory of Stochastic Processes, Methuen, 1965. [Cram66] Cramer, H. and Leadbetter, M., Stationary and Related Stochastic Processes, John Wiley & Sons, 1966. [Crou82] Crouzet, Y. and Decounty, B., “Measurements of Fault Detection Mechanisms Efficiency: Results,” Proceedings of the 12th Annual International Symposium on Fault Tolerant Computing, IEEE Computer Society Press, June 1982, pp. 373–376. [Dahl82] Dahlbura, A. T. and Masson, G. M., “A New Diagnosis Theory as the Basis of Intermittent–Fault/Transient–Upset Tolerant System Design,” Proceedings of the 12th Annual International Symposium on Fault Tolerant Computing, IEEE Computer Society Press, June 1982, pp. 353–356. [Damm82] Damm, A., “Experimental Evaluation of Error–Detection and Self– Checking Coverage of Components of a Distributed Real–Time System,” Doctorate Thesis, Technical University of Vienna, October, 1982. [Dave70] Davenport, W. B. Jr., Probability and Random Processes, McGraw–Hill, 1970. [Deo74] Deo, N., Graph Theory with Applications to Engineering and Computer Science, Prentice Hall, 1974. [DoD82] Department of Defense, “Reliability Prediction of Electronic Equipment,” Military Handbook, MIL–HDBK–217D, 15 January 1982. [Doet61] Doetsch, G., Guide to the Laplace Transform, Van Nostrand, 1961. [Dona84] Donatiello, L. and Iyer, B. R., “Analysis of a Composite Performance Reliability Measure for Fault–Tolerant Systems,” IBM Research Report RC– 10325, Yorktown Heights, New York, 1984.

Bibliography – 7

[Dona84] Donatiello, L. and Iyer, B. R., “Closed–form Solution for System Availability Distribution,” IBM Research Report RC–11169, Yorktown Heights, New York, 1984. [Drak87] Drake, H. D. and Wolting, D. E., “Reliability Theory Applied to Software Testing,” Hewlett–Packard Journal, April 1987, pp. 35–39. [Dung84] Dungan, J. B., Trivedi. K. S., Geist, R. M. and Nicola, V. F., “Extended Stochastic Petri Nets: Analysis and Applications,” Performance ’84, Paris, North Holland, December 1984. [Dung85] Dungan, J. B., Bobbio, A., Ciardo, G. and Trivedi, K. S., “The Design of a Unified Package for the Solution of Stochastic Petri Net Models,” Proceedings of the International Workshop on Timed Petri Nets, Torino Italy, July 1985. [Eckh85] Eckhardt, D. E. and Lee, L. D., “A Theoretical Basis for the Analysis of Multi–Version Software Subject to Coincident Errors,” IEEE Transactions on Software Engineering, Volume SE–11, Number 12, December 1985, pp. 1511– 1517. [Ehre78] Ehrenberger, W. and Plogert, K., “Statistical Verification of Reactor Protection Software,” Proceedings of the International Symposium on Nuclear Plant Control, Cannes France, April 1978. [Euri84] Euriger, M. and Reichert, W., “The AS220 EHF Fault–Tolerant and Fail–Safe Automation System with Two–Out–of–Three Redundancy,” Siemens Power Engineering, November/December 1984, pp. 323–327. [Evan84] Evans, M. G. K., Parry, G. W. and Wreathall, J., “On the Treatment of Common–Cause Failures in Systems Analysis,” Reliability Engineering, 9(2), 1984, pp. 107–115. [Fell49] Feller, W., “Fluctuation Theory of Recurrent Events,” Transactions of the American Mathematical Society, Volume 67, 1949, pp. 98–119. [Fell67] Feller, W., An Introduction to Probability Theory and Its Applications, Volume I, 3rd Edition, John Wiley & Sons, 1967. [Fell71] Feller, W., An Introduction to Probability Theory and Its Applications, Volume II, 3rd Edition, John Wiley & Sons, 1971. [Flem74] Fleming, K. N., “A Reliability Model for Common Mode Failures in Redundant Safety Systems,” General Atomic Report, GA–13284, 1974. [Felm85] Fleming, K. N., Mosleh, A. and Deremer, R. K., “A Systematic Procedure for the Incorporation of Common Cause Events into Risk and Reliability Models,” Nuclear Engineering and Design, 93, 1985, pp. 245–279.

Bibliography – 8

[Fry28] Fry, T. C., Probability and Its Engineering Uses, Van Nostrand, 1928. [Fuss76] Fussel, J., “Fault Tree Analysis–Concepts and Techniques,” in Generic Techniques in Reliability Assessment, Henely, E. and Lynn, E., editors, Noordhoff Publishing Company, Leyeden Holland, 1976. [Gall76] Gallaher, L. E. and Toy, W. N., “Fault–Tolerant Design of the 3B20 Processor,” NCC–81 Proceedings, Chicago, May, 1976. [Gave73] Gaver, D. P. and Thompson, G. L., Programming and Probability Models in Operations Research, Brooks/Cole Publishing, 1973, pp. 417–426. [Geis84] Geist, R., Trivedi, K., Dungan, J. B. and Smotherman, M., “Modeling Imperfect Coverage in Fault–Tolerant Systems,” Proceedings IEEE 14th Fault Tolerant Computing Symposium, June 1984, pp. 77–82. [Glas82] Glaser, R. E. and Masson, G. M., “The Containment Set Approach to Upset Handling in Microprocessor Control Design,” Proceedings 12th Annual International Symposium of Fault–Tolerant Computing, IEEE Computer Society Press, June 1982, pp. 215–222. [Gkuc83] Gkuch, D. P. and Paul, M. J., “Fault Tolerance in Distributed Digital Fly–by–Wire Flight Control Systems,” Digest of the 13th Annual International Symposium on Fault–Tolerant Computing, Milan Italy, June 1983, pp. 121–126. [Goel79] Goel, A. L. and Okumoto, K., “Time–Dependent Error Detection Rate Model for Software Reliability and Other Performance Measures,” IEEE Transactions on Reliability, August 1979, Volume R–28, pp. 206–211. [Gree82] Green, A. E. (editor), High Risk Safety Technology, Wiley–Interscience, 1982. [Guid86] Guidal, C., “Computerized FMECA: An Overview of FMEGEN Program,” 5th International Conference on Reliability and Maintainability, Biarritz France, October 1986. [Gunn87] Gunneflo, J. B., Johnson, J., Karlsson, S., Lowb, S., and Torin, J., “A Fault Injection Systems for the Study of Transient Fault Effects on Computer Systems,” Technical Report Number 47, Chalmers University, 1987. [Gupt66] Gupta, D., Transform and State Variable Methods in Linear Systems, John Wiley & Sons, 1966. [Hals79] “Commemorative Issue in Honor of Dr. Maurice H. Halstead,” Special Edition of IEEE Transactions on Software Engineering, Volume SE–5, Number 2, March 1979. [Harr89] Harrington, P., “Applying Customer–Oriented Metrics,” IEEE Software, November 1989, pp. 71–74.

Bibliography – 9

[Heik79] Heikkila, M., “A Model for Common Mode Failures,” Proceedings of the Second National Reliability Conference, 1979. [Heis84] Heising, C. D. and Guey, C. H., “A Comparison of Methods for Calculating System Unavailability Due to Common Cause Failures: The Beta Factor and Multiple Dependent Failure Fraction Methods,” Reliability Engineering, 8(2), 1984, pp. 106–116. [Heiv80] Heivik, B. E., “Periodic Maintenance, on the Effect of Imperfectness,” Digest of Papers of the 10th International Symposium of Fault–Tolerant Computing, IEEE Computer Society Press, 1980, pp. 204–206. [Hene81] Henely, E. J. and Hiromitsu, K., Reliability Engineering and Risk Assessment, Prentice Hall, 1981. [Hild56] Hildebrand, F. B., Introduction to Numerical Analysis, McGraw–Hill, 1956. [Hoel62] Hoel, P. G., Introduction to Mathematical Statistics, John Wiley & Sons, 1962. [Hoel72] Hoel, P. G., Port, S. C. and Stone, C. J., Introduction to Stochastic Processes, Houghton Mifflin, 1972. [Howa71] Howard, R. A., Dynamic Probabilistic Systems: Volume I (Markov Models) and Volume II (Semi–Markov and Decision Processes), John Wiley & Sons, 1971. [HSE87] Health and Safety Executive, Programmable Electronic Systems in Safety Related Applications, Rimington, J. D., Director General, Her Majesty’s Stationary Office, (HMSO), London, 1987. [Hump87] Humphreys, R. A., “Assigning a Numerical Value to the Beta Factor Common Cause Evaluation,” Proceedings of Reliability ’87, Paper 2C, 1987. [Ingl77] Ingle, A. D. and Siewiorek, D. P., “Reliability Models for Multiprocessor Systems With and Without Maintenance,” Proceedings Fault–Tolerant Computing Symposium, 1977, pp. 3. [Iser76] Isermann, R., Digital Control Systems, Springer–Verlag, 1977. [Issa76] Issacson, D. L. and Madsen, R. W., Markov Chains Theory and Applications, John Wiley & Sons, 1976. [Iyer80] Iyer, R. K., “A Study of the Effect of Uncertainty in Failure Rate Estimation of System Reliability,” Digest of Papers of the 10th International Symposium on Fault–Tolerant Computing, IEEE Computer Society Press, 1980, pp. 219–224. [Iyer84] Iyer, B. R.., Donatiello, L. and Heidelberger, P., “Analysis of Performability of Stochastic Models of Fault Tolerant Systems,” IBM Research Report RC 10719, Yorktown Heights, New York, 1984. Bibliography – 10

[Iyer85] Iyer, B. R. and Velardi, P., “Hardware–Related Software Errors,” IEEE Transactions on Software Engineering, Volume SE–11, February, 1985. [Jame67] James, M. L., Smith, G. M. and Wolford, J. C., Applied Numerical Methods for Digital Computation with Fortran, International Textbook Company, 1967. [Jame74] James, L. E., Sheffield, T. S. and Plein, K. M., Study of Reliability Prediction Techniques for Conceptual Phases of Development, Final Report, Rome Air Development Center, RDAC–TR–74–235, October, 1974. [Jeli72] Jelinski, Z. and Moranda, P. B., “Software Reliability Research,” Statistical Computer Performance Evaluation, edited by W. Freigerger, Academic Press, 1972, pp. 465–484. [John85] Johnson, B. W. and Aylor, J. H., “Reliability and Safety Analysis of a Fault–Tolerant Controller, IEEE Transactions on Reliability, Volume R–35, Number 4, October 1985, pp. 355–362. [John86] Johnson, B. W. and Julich, P. M., “Fault–Tolerant Computer Systems for the A129 Helicopter,” IEEE Transactions on Aerospace and Electronic Systems, Volume AES–21, Number 2, March 1985, pp. 220–229. [Jury64] Jury, E. I., Theory and Application of the z–Transform Method, John Wiley and Sons, 1964. [Kapur77] Kapur, K. C. and Lamberson, L. R., Reliability in Engineering Design, John Wiley & Sons, 1977. [Kell88] Kelly, J. P., Eckhardt, D. E., et al, “A Large Scale Second Generation Experiment in Multi–Version Software: Description and Early Results,” Proceedings 18th International Symposium on Fault–Tolerant Computing, IEEE Computer Society Press, June 1988, pp. 9–14. [Keme60] Kemey, G. and Snell, J. L., Finite Markov Chains, D. Van Nostrand, 1960. [Kend50] Kendall, D. G., “Some Problems in the Theory of Queues,” Journal Royal Statistical Society, Series B 13, 1950, pp. 151–185. [Kend53] Kendall, D. G., “Stochastic Processes Occurring in the Theory of Queues and Their Analysis by the Method of Imbedded Markov Chain,” American Mathematical Statistics, 24, 1953, pp. 338–354. [Kend61] Kendall, M. G., and Stuart, A., The Advanced Theory of Statistics, Volume 2, Inference and Relationships, Hafner Publishing, 1961. [Kend77] Kendall, M. G. and Stuart, A., The Advanced Theory of the Statistics, Volume 1, Distribution Theory, 4th Edition, MacMillan Publishing Company, 1977.

Bibliography – 11

[Khin60] Khinchin, A. J., Mathematical Methods in the Theory of Queuing, Griffin, London, 1960. [Kimb60] Kimball, B. F., “On the Choice of Plotting Positions on Probability Paper,” Journal of the American Statistical Association, Volume 55, September, 1960. [Kim86] Kim, K. H., “A Scheme for Coordinated Execution of Independently Designed Recoverable Distributed Processes,” Proceedings 16th International Symposium of Fault–Tolerant Computing, Computer Society Press, 1986, pp. 130– 135. [King69] Kingman, J. F. C., “Markov Population Processes,” Journal of Applied Probability, 6, 1969, pp. 1–18. [Kirr86] Kirrmann, H., “Fault–Tolerant Issues in the Design of a Highly Available High–Speed Controller for HVDC Transmission,” Proceedings 16th International Symposium on Fault–Tolerant Computing, Computer Society Press, 1986, pp. 184–189. [Klie75] Klienrock, L., Queuing Systems, Volume 1: Theory, John Wiley & Sons, 1975. [Knig86] Knight, J. C. and Leveson, N. G., “An Experimental Evaluation of the Assumption of Independence in Multiversion Programming,” IEEE Transactions on Software Engineering, Volume SE–12, Number 1, January 1986, pp. 96–109. [Koba78] Kobayashi, H., Modeling and Analysis: An Introduction to System Performance Evaluation Methodology, Addison Wesley, 1978. [Kraf81] Kraft, G. D. and Toy, W. D., Microprogrammed Control and Reliable Design of Small Computers, Prentice Hall, 1981. [Krey72] Kreyszig, E., Advanced Engineering Mathematics, John Wiley & Sons, 1972. [Kueh69] Kuehn, R. E., “Computer Redundancy: Design, Performance, and Future,” IEEE Transactions on Reliability, Volume R-18, Number 1, February 1969, pp. 3–11. [Kulk84] Kulkarnia, V. G., Nicola. V. F., and Trivedi, K. S., “A Unified Model for Performance and Reliability of Fault–Tolerant / Multi–Modal Systems,” CS–1984–12, Department of Computer Science, Duke University. [Lala83] Lala, J. H., “Fault Detection, Isolation, and Reconfiguration in FTMP: Methods and Experimental Results,” Proceedings AIAA/IEEE Avionics Systems Conference, November 1983, pp. 21.3.1–1.3.9.

Bibliography – 12

[Lamp82] Lamport, L., Shostak, R., and Pease, M., “The Byzantine Generals Problem,” ACM Transactions of Programming Languages and Systems, Volume 4, Number 3, July 1982, pp. 382–401. [Lapr84] Laprie, J., “Dependable Computing and Fault–Tolerance: Concepts and Terminology,” Proposal to the IFIP WG 10.4 Summer 1984 Meeting, Kissimmee Florida, June 16–19, 1984. [Lars81] Larsen, R. L. and Marx, M. L., Introduction to Mathematical Statistics and Its Applications, Prentice Hall, 1981. [Lath65] Lathi, B. P., Signals and Communication, John Wiley & Sons, 1965. [Lawr68] Lawrence, W. A., “Proposed Methods for Evaluation of Current Trouble Location Manuals – Case 36279–133,” Internal Memo, Bell Telephone Laboratories, December 9 1968. [Lee80] Lees, F. P., Loss Prevention in the Process Industries, Butterworth, 1980. [Lehm62] Lehman, R. S., “Dynamic Programming and Gaussian Elimination,” Journal of Mathematical Analysis and Applications, Volume 5, pp. 1–16, 1962. [Leve83] Leveson, N. G. and Harvey, P. R., “Analyzing Software Safety,” IEEE Transactions on Software Engineering, Volume SE–9, Number 5, September 1983, pp. 569–579. [Leve86] Leveson, N. G., “Software Safety: Why, What, and How,” ACM Computing Surveys, June 1986, pp. 125–163. [Leve86a] Levendal, Y., “Quality and Reliability Estimation for Large Software Projects Using a Time–dependent Model,” Proceedings of COMPSAC87, Tokyo Japan, October 1987, pp. 340–346. [Leve89] Levendal, Y., “Quality and Reliability Prediction: A Time–Dependent Model with Controllable Testing Coverage and Repair Intensity,” Proceedings of the 4th Israel Conference on Computer Systems and Software Engineering, Tel–Aviv, June 1989. [Lill69] Lilliefors, H. W., “On the Kolmogorov–Smirnov Test for the Exponential Distribution with Mean Unknown,” Journal of the American Statistical Association, 64, 1969, pp. 387–389. [Lipo79] Lipow, M., “Prediction of Software Failures,” TRW–SS–79–05, TRW Systems and Integration Division, One Space Park, Redondo Beach, California, 1979. [Litt73] Littlewood, B. and Verrall, J. L., “A Bayesian Reliability Growth Model for Computer Software,” Applied Statistics, Volume 22, 1973, pp. 332–346.

Bibliography – 13

[Litt75] Littlewood, B., “A Reliability Model for Markov Structured Software,” Proceedings of the International Conference on Reliable Software, SIGPLAN Notices, Volume 10, Number 6, June 1975, pp. 204–207. [Lome88] Lomen, D. and Mark, J., Differential Equations, Prentice Hall, 1988. [Low72] Low, T. A. W. and Noltingk, B. E., “Quantative Aspects of Reliability in Process–Control,” Proceedings of the I.E.E., I.E.E. Reviews, Volume 119, Number 8R, August, 1972. [Luen79] Luenberger, D., Introduction to Dynamic Systems, John Wiley & Sons, 1979. [Lyon62] Lyons, R. E. and Vanderkulk, W., “The Use of Triple–Modular Redundancy to Improve Computer Reliability,” IBM Journal of Research and Development, Volume 6, April 1962, pp. 200–209. [Mall78] Mallela, S. and Masson, G. M., “Diagnosable Systems for Intermittent Faults,” IEEE Transactions on Computers, Volume C–27, June 1978, pp. 306– 366. [Maka81] Makam, S. V. and Avizienis, A., “Modeling and Analysis of Periodically Renewed Closed Fault–Tolerant Systems,” Proceedings of the Eleventh Annual International Symposium on Fault–Tolerant Computing, June 1981, pp. 134–141. [Mank77] Mankamo, T., “Common Load Method, A Tool for Common Cause Failure Analysis,” Sahkotekniikan Laborratorio, Technical Research Centre of Finland, Tiedonanto 31, 1977. [Mann71] Mann, N. R., Fertig, K. W., and Scheuer, E. M., Confidence and Tolerance Bounds and a New Goodness–of–Fit for Two Parameter Weilbull or Extreme Value Distribution (with Tables for Censored Sample Sizes), Aerospace Research Laboratories, Wright–Patterson Air Force Base, Ohio, ARL 71–0077, Contract No. F33(615)–70–C–1216, May 1971. [Mark07] Markov, A. A., “Extensions of the Limit Theorems of Probability Theory to a Sum of Variables Connected in a Chain,” The Notes of the Imperial Academy of Sciences of St. Petersburg, VIII Series, Pysio–Mathematical College, Volume XXII, No. 9, December 5, 1907. [Math70] Mathur, F. P. and Avizieis, A., “Reliability Analysis and Architecture if a Hybrid Redundant Digital System: Generalized Triple Modular Redundancy with Self–Repair,” AFIPS Conference Proceedings, Spring Joint Computer Conference, Volume 36, 1970, pp. 375–383. [Mats88] Matsumoto, K., Inoue, K., Kikuno, T., and Torji, K., “Experimental Evaluation of Software Reliability Growth Models,” Proceedings of the 18th International Symposium on Fault–Tolerant Computing, June 1988, pp. 148–153. [Maye72] Mayeda, W., Graph Theory, Wiley–Interscience, 1972. Bibliography – 14

[McMl86] McClusky, E. J., Logic Design Principles with Emphasis on Testable Semi– Custom Circuits, Prentice Hall, 1986. [McCo79] McConnell, S. R., Siewiorek, D. P., and Tsao, M. M., “The Measurement and Analysis of Transient Errors in Digital Computer Systems,” Digest Ninth International Fault–Tolerant Computing Symposium, IEEE Computer Society Press, 1979, pp. 67–70. [McGo83] McGough, J., “Effects of Near–Coincident Faults in Multiprocessors Systems, Proceedings AIAA/IEEE Digital Avionics Systems Conference, November 1983, pp. 16.6.1–16.6.7 [McGo85] McGough, J., Smotherman, M., and Trivedi, K. S., “The Conservativeness of Reliability Estimates on Instantaneous Coverage,” IEEE Transactions on Computing, Volume C–34, July 1985, pp. 602–609. [Mels78] Melsa, J. L. and Cohen, D. L., Decision and Estimation Theory, McGraw Hill, 1978. [Meno63] Menon, M. V., “Estimation of the Shape and Scale Parameters of the Weilbull Distribution,” Technometrics, Volume 5, Number 2, May 1963. [Midd46] Middleniss, R. R., Differential and Integral Calculus, McGraw Hill, 1946. [MIL1629] MIL–STD–1629a, Procedures for Performing a Failure Mode and Effects Analysis. [Mell77] Melliar–Smith, P. M. and Randell, B., “Software Reliability: The Role of Programmed Exception Handling,” Proceedings of Conference on Language Design for Reliable Software, SIGPLAN Notice, Volume 12(3), March 1977, pp. 95–100. [McLac39] McLachlan, N. W., Complex Variable and Operational Calculus, Cambridge University Press, 1939. [Meye82] Meyer, J. F., “Closed Form Solutions of Performability,” IEEE Transactions on Computers, July 1982, pp. 648–657. [Morg78] Morganti, M., Coppadoro, G., and Ceru, S., “UDET 7116–Common Control for PCM Telephone Exchange: Diagnostic Software Design and Availability Evaluation,” Digest of the Eighth International Fault–Tolerant Computing Symposium, IEEE Computer Society, Toulouse, France, 1978, pp. 16–23. [Musa75] Musa, J. D., “A Theory of Software Reliability and Its Applications,” IEEE Transactions on Software Engineering, SE–1, September 1975, pp. 312–327. [Musa87] Musa, J. D., Iannino, A., and Okumoto, K., Software Reliability: measurement, Prediction, Application, McGraw–Hill, 1987.

Bibliography – 15

[Musa89] Musa, J., “Faults, Failures, and a Metrics Revolution,” IEEE Software, March 1989, pp. 85–91. [Myer64] Myers, R., Wong, K. and Gordy, H., Reliability Engineering for Electronic Systems, John Wiley & Sons, 1964. [Naka75] Nakagawa, T. and Osaki, S., “The Discrete Weilbull Distribution,” IEEE Transactions on Reliability, R–24, Number 5, December 1975, pp. 300– 301. [Naka79] Nakagawa, T., “Optimum Policies when Preventive Maintenance is Imperfect,” IEEE Transactions on Reliability, Volume R–28, Number 4, October 1979. [Naka81] Nakagawa, T., Yasui, K., and Osaki, S., “Optimum Maintenance Policies for a Computer System with Restart,” Proceedings of the Eleventh Annual International Symposium on Fault–Tolerant Computing, June 1981, pp. 148–150. [Ng76] Ng, Y–W., “Reliability Modeling and Analysis for Fault Tolerant Computers,” Ph. D Dissertation, Computer Science Department, University of California, Los Angeles, 1976. [Ng77] Ng, Y–W. and Avizienis, A., “A Reliability Model for Gracefully Degrading and Repairable Fault–Tolerant Systems,” Proceedings of Fault– Tolerant Computing Symposium, 1977, pp. 22. [NSCC80] NSCC/PATE Guidebooks: Volume IIA – Nuclear Safety Cross–Check Analysis and Technical Evaluation Process, SED–80204–1, Logicon, San Pedro, California, 1980. [Oda81] Oda, Y., Tokma, Y., and Furuya, K., “Reliability and Performance Evaluation of Self–Reconfiguable Systems with Periodic Maintenance,” Proceedings of the Eleventh Annual International Symposium on Fault–Tolerant Computing, June 1981, pp. 142–147. [Ogat70] Ogata, K., Modern Control Engineering, Prentice Hall, 1970. [Ohba84] Ohba, M., “Software Reliability Analysis Models,” IBM Journal of Research and Development, July 1984, Volume 28, Number 4, pp. 259–265. [Ossf80] Ossfeldt, B. and Jonsson, I., “Recovery and Diagnostics in the Central Control of the AXE Switching System, IEEE Transactions on Computers, June 1980, pp. 482–491. [Ozak88] Ozaki, B. M., Ferandex, E. B., and Gudes, E., “Software Fault Tolerance in Architectures with Hierarchical Protection Levels,” IEEE Micro, August, 1988, pp. 30–43.

Bibliography – 16

[Palm43] Palm, C., “Intensitatsschwankungen im Fernsprecverkehr,” Ericsson Technics, Volume 44, 1943, pp. 1–89. [Papo65] Papoulis, A., Probability, Random Variables, and Stochastic Processes, McGraw–Hill, 1965. [Papo65a] Papoulis, A., “Markoff and Wide–Sense Markoff Sequences,” Proceedings of the IEEE, October 1965. [Parz60] Parzen, E., Markov Probability Theory and its Applications, John Wiley & Sons, 1960. [Parz62] Parzen, E., Stochastic Processes, Holden Day, 1962. [Pipe63] Pipes, L. A., Matrix Methods of Engineering, Prentice Hall, 1963. [Pois37] Poisson, Siemeon D., Rescherches sur la Probabilite des Jugements en Maitere Criminelle et en Matiere Civile, Precedes des Regeles Generales du Calcul des Probabilities, appearing in 1837. [Rals65] Ralston, A., A First Course in Numerical Analysis, McGraw–Hill, 1965. [Rama79] Ramamoorthy, C. V., Bastiani, F. B., Favaro, J. M., Mok, Y. R. Nam, C. W., and Suzuki, K., “A Systematic Approach to the Development and Validation of Critical Software for Nuclear Power Plants,” Proceedings of the 4th International Conference on Software Engineering, 1979, pp. 231–240. [Rams79] Ramshaw, L. H., “Formalizing the Analysis of Algorithms,” Stanford University Computer Science Department Technical Report, STAN–CS–79–741, 1979. [Rand75] Randell, B., “System Structures for Software Fault Tolerance,” IEEE Transactions on Software Engineering, Volume SE–1, Number 3, June 1975, pp. 221–232. [Rea78] Research and Education Association, The Differential Equations Problem Solver, Research and Education Association, 1978. [Robi82] Robison, A. S., “A User Oriented Perspective of Fault–Tolerant System Models and Terminologies,” Proceedings 12th Annual International Symposium Fault Tolerant Computing, IEEE Computer Society Press, June 1882, pp. 22–28. [Ross83] Ross, S., Stochastic Processes, John Wiley & Sons, 1983 [Ross70] Ross S., Applied Probability Models with Optimization Applications, Holden– Day, 1970. [Rous79] Rouse, W. B. and Rise, S. H., “Measures of Complexity of Fault Diagnosis Tasks,” IEEE Transactions on Systems, Man, and Cybernetics, Volume SMC–9, Number 11, November 1979, pp. 720–727.

Bibliography – 17

[Rouq86] Rouquet, J. C. and Traverse, P., “Safe and Reliable Computing on Board and Airbus and ATR Aircraft,” Proceedings Fifth International Workshop on Safety of Computer Control Systems, SAFECOMP 86, Sarlet, France, 1986, pp. 93–97. [RTCA85] RTCA, Radio Technical Commission for Aeronautics, “Software Considerations in Airborne Systems and Equipment Certification,” Technical Report DO–178A, Washington D. C., March 1985. RTCA Secretariat, One McPherson Square, 1425 K Street, N. W., Suite 500, Washington D. C., 20005. [Saat65] Saaty, T. L., “Stochastic Network Flow: Advances in Networks of Queues,” Proceedings Symposium Congestion Theory, University of North Carolina Press, 1965, pp. 86–107. [Schi78] Schick, G. K. and Wolverton, R. W., “An Analysis of Competing Software Reliability Models,” IEEE Transactions on Reliability, R–32. [Schu86] Schutette, M. A., Shen J. P., Siewiorek, D. P., and Zhu, Y. X., “Experimental Evaluation of Two Concurrent Error Detection Schemes,” Proceedings of the 16th International Symposium on Fault–Tolerant Computing, July 1986, pp. 138–143. [Sega88] Segall Z., et al, “FAIT–Fault Injection Based Automated Testing Environment,” Proceedings of the 18th International Symposium on Fault–Tolerant Computing, June 1988, pp. 102–107. [Seth77] Seth, S. C. and Kodandapani, K. L., “Diagnosis of Faults in Linear Tree Network,” IEEE Transactions on Computers, Volume C–26, Number 1, January 1977, pp. 29–33. [Shin86] Shin, K. G and Lee, Y. H., “Measurement and Application of Fault Latency,” IEEE Transactions on Computers, Volume C–35, April 1986, pp. 370– 375. [Shoo68] Shooman, M. L., Probabilistic Reliability: An Engineering Approach, McGraw–Hill, 1968. [Shoo73] Shooman, M. L., “Operational Testing and Software Reliability Estimation During Program Development,” in Record, IEEE Symposium on Computer Software Reliability, 1973, pp. 51–57. [Siew82] Siewiorek, D. P. and Swarz, R. S., The Theory and Practice of Reliability System Design, Digital Press, 1982. [Siew84] Siewiorek, D. P., “Architecture of Fault–Tolerant Computers,” Computer, Volume 17, Number 8, August 1984, pp. 9–17.

Bibliography – 18

[SINT88] SINTEF, “Reliability Evaluation of Safety System Configurations,” The Foundation for Scientific and Industrial Research at the Norwegian Institute of Technology, Report Number STF75 F88002, 1988. [SINT89] SINTEF, “Reliability Assessment of TRICON for Boiler Management Applications,” The Foundation for Scientific and Industrial Research at the Norwegian Institute of Technology, Report Number DRAFT–890914, 1989. [Smai49]Smail, L. L., Calculus, Appleton–Century–Crofts, 1949. [Smit58] Smith W. L., “Renewal Theory and Its Ramifications,” Journal of the Royal Statistical Society, Series B, 20, pp. 243–302, 1958. [Smit85] Smith, D. J., Reliability and Maintainability in Perspective, MacMillan, 1985. [Sned80]Snedecor, G. W. and Cochran, W. G., Statistical Methods, 7th Edition, Iowa State University Press, 1980. [Snyd75] Synder, D. L. Random Point Processes, John Wiley & Sons, 1975. [Soma86] Somani, A. and Agarwal, A. V., “On Complexity of Diagnosability and Diagnosis Problems in System–Level Diagnosis,” Proceedings 16th Annual International Symposium on Fault–Tolerant Computing Systems, IEEE Press, pp. 232–237, July 1986. [Sosn86] Sosnowski, J., “Evaluation of Transient Hazards in Microprocessor Controllers,” Proceedings 16th Annual International Symposium on Fault Tolerant Computing, IEEE Press, July 1986, pp. 364–369. [Stif80] Stiffler, J. J., “Robust Detection of Intermittent Faults,” Proceedings 10th Annual International Symposium of Fault Tolerant Computing, IEEE Press, October 1980, pp. 216–218. [Thay78] Thayer, T. A., Lipow, M. and Nelson, E. C., Software Reliability – A Study of Large Project Reality, TRW Series on Software Technology, Volume 2, Elsevier North Holland, 1978. [Thom69] Thoman, D. R., Bain, L. J. and Antle, C. E., “Inferences on the Parameters of the Weilbull Distribution,” Technometrics, Volume 7, Number 4, November 1969. [Thom81] Thomas, J. C. and Leveson, N. G., “Applying Existing Safety Design Techniques to Software Safety,” Technical Report Number 180, University of California, Irvine, 1981. [Thor26] Thorndike, F., “The Applications of Poisson’s Probability Summation,” The Bell System Technical Journal, Volume 5 (1926), pp. 604–624. [Toy78] Toy, W. N., “Fault–Tolerant Design of Local ESS Processors,” Proceedings of IEEE, October 1978, pp. 1126–1145. Bibliography – 19

[Toy86] Toy, W. N. and Zee, B., “Fault Tolerant Computing,” in Computer Hardware /Software Architecture, Prentice Hall, 1986, pp. 337–392. [Toy87] Toy, W. N., “Dual Versus Triplication Reliability Estimations,” AT&T Technical Journal, November/December, 1987, Volume 66, Issue 6, pp. 15–20. [Triv74] Trivedi, A. K., “Reliability Estimation and Prediction,” Reading Report, Department of EE/EP, Polytechnic Institute of New York, June 1975. [Triv75] Trivedi, A. K., “Computer Software Reliability: Many State Markov Techniques,” PH. D. Dissertation, Department of Electrical Engineering, Polytechnic Institute of New York, June 1975. [Triv75a] Trivedi, A. K. and Shooman, M. L., “A Many–State Markov Model for the Estimation and Prediction of Computer Software Performance Parameters,” Proceedings of the International Conference on Reliable Software, SIGPLAN Notices, Volume 10, Number 6, June 1975, pp. 208–215. [Triv82] Trivedi, A. K., Probability and Statistics with Reliability, Queuing, and Computer Science Applications, Prentice Hall, 1982. [TÜV86] TÜV Study Group on Computer Safety, Microcomputers in Safety Technique, and Aid to Orientation for Developer and Manufacture, Technischer Uberwachungs–Verein Bayern e. V., Verlag TUV Reinland, 1986. [UKAE88] United Kingdom Atomic Energy Authority, “A Reliability Analysis of the Relay Logic for a Burner Control and Safety System in a Boiler Installation,” Safety and Reliability Directorate Report SRS/ASG/31610/2, 1988. [Vses77] Vesely, W. E., “Estimating Common Cause Failure Probabilities in Reliability and Risk Analysis: Marshall–Olkin Specializations,” in Nuclear Systems Reliability Engineering and Risk Management, editors, Fussel, J. B. and Burdick, G. R., Philadelphia Society of Industrial and Applied Mathematics, 1977, pp. 314–341. [Wall85] Waller, R. A., “A Brief Survey and Comparison of Common Cause Failure Analysis,” NUREG/CR–4314, Los Alamos National Laboratory, 1985. [WASH75] WASH–1400, Reactor Safety Study: Appendix IV: Common Mode Failures– Bounding Techniques and Special Techniques, NUREG/75/014, US Nuclear Regularity Commission, Washing D. C., 1975. [Wats79] Watson, I. A. and Edwards, G. T., “A Study of Common Cause Mode Failures,” United Kingdom Atomic Energy Authority (UKAEA), Report SRD–R146, Safety and Reliability Directororate, 1979.

Bibliography – 20

[Whit69] White, J. S., “The Moments of Log–Weilbull Order Statistics,” Technometrics, Volume 11, Number 2, May 1969. [Whit82] Whittingham, R. B., “Air: Ammonia Ratio on a Nitric Acid Plant, Associate Member’s Overview of the SRS Study,” United Kingdom Atomic Energy Authority, Systems Reliability Service, SRS/GR/54, 1982. [Wibe71] Wilberg, D. M., State Space and Linear Systems, Schaum’s Outline Series, McGraw Hill, 1971. [Widd46] Widder, D. V., The Laplace Transform, Princeton University Press, 1946. [Wilk65] Wilkinsson, J. H., The Algebraic Eigenvalue Problem, Oxford University Press, 1965. [Wood79] Woodfield, S. N., “An Experiment on Unit Increase in Problem Complexity,” IEEE Transactions on Software Engineering, Volume SE–5, Number 2, 1979, pp. 76–79. [Yama67] Yamane, T., Elementary Sample Theory, Prentice Hall, 1967. [Yama85] Yamada, S., Ohba, M., and Osaki, S., “S–Shaped Reliability Growth Modeling: Models and Applications,” IEEE Transactions on Reliability, December 1983, Volume R-32, pp. 475–478.

Bibliography – 21

Bibliography – 22