## combining information by Alyson Wilson & Kassandra Fronczyk by Tim ...

1 e-mail for correspondence tim[email protected] ... The Ford Motor Company developed a taxonomy for these noise factors to help engineers identify them ...

1 Some remarks on Bayesian Reliability: combining information by Alyson Wilson & Kassandra Fronczyk by Tim Davis University of Warwick1, UK

The paper by Wilson & Fronczyk (henceforth WF) raises some important quality engineering concepts that I would like to discuss2 Firstly, what is the definition of reliability?, and secondly, how to deal with System Reliability?

The definition of reliability The probabilistic definition of reliability e.g. “reliability is the probability that the entity3 performs its function under a given set of operating conditions for a given time” is common in the statistical literature (see e.g. Meeker & Escobar,1998, p2), and although not explicitly mentioned in the paper is the definition used, by implication, by WF In practice depending on context this definition can be difficult to measure and to operationalize (in the sense of using it to plan engineering development work to improve reliability) because operating conditions (which from now on we will call noise factors) are not fixed and can vary greatly (over time and space) in the field of operation, and because often many entities in the field may be lost to follow-up (for commercial products when the warranty period expires, for example), making an assessment of the reliability in the field difficult if not impossible. The probabilistic definition essentially answers the question “how often will the chosen design fail? which is an estimation problem, and usually requires holding the current design configuration fixed and repeating many trials, so that this probability can be measured An alternative definition of reliability which might be called information based as opposed to probability based is “reliability is failure mode avoidance” (this definition is due to Clausing, 2004)4 where the “reliability-related quantity of interest” (to use a phrase in WF) shifts from the rate (or frequency) at which failures occur in the field to the information contained in the potential failure modes themselves and the required counter measures. This definition of reliability answers the question “how do I choose the design that fails the least”? which is a selection problem, that generally requires doing just enough testing on alternative product configurations to be sufficient in pointing the direction of improvement, so that at the end of the development phase, the best design (within the applicable constraints) can be chosen.

1

e-mail for correspondence [email protected] This discussion article is based on the remarks that I gave at the conference regarding the presentation that was given by the first named author, which included some topics that are not in the published paper. Therefore, readers who were not at the conference should treat this discussion article as a general narrative on reliability. 3 The entity may be a component or a complete system 4 An earlier but less accessible paper which introduced the concept of Failure Mode Avoidance is Clausing (1998); readers are invited to contact the author for access to this paper 2

1

2 My remarks from this point forward will be from the perspective of reliability as failure mode avoidance whereby the emphasis will be on selection rather than estimation.

System reliability Since all counter measures are ultimately taken on components, we argue that system reliability is best tackled from the perspective of the how the component functions when it is embedded in the system and interacting with other interfacing components. In an important article on system reliability Clausing & Frey (2008) define two types of failure mode 1) lack of robustness, and 2) mistakes Lack of robustness is the sensitivity of the primary function of the component to noise factors. Robustness as an engineering concept became well known in the Western economies (particularly the automotive sector) in the early 1980’s due to exposure to the work of Genichi Taguchi - see e.g. Taguchi (1989). The Ford Motor Company developed a taxonomy for these noise factors to help engineers identify them in particular problems; [see Davis (2006) and Clausing and Fey (2004p151)]. In this taxonomy the noise factors are classified as follows: 𝑧1 : variations in part characteristics (e.g. geometry/mass) due to production rate; 𝑧2 : changes in part characteristics through usage i.e. wear out 𝑧3 : variations in the external environment caused by climate/weather 𝑧4 : Variations in the way the Customer (or end user) uses the product 𝑧5 : Variations in the internal environment i.e. interactions caused by component to component interfaces within the system (e.g. vibrations, heat, electromagnetic interference etc.) Or collectively “multiple operating conditions that depend on mission”5, we denote the complete set of noises as 𝒁 = {𝑧1 , … 𝑧5 } Note – 𝑧1 and 𝑧2 are often referred to as inner noises (sometimes capacity noises) while 𝑧3 ,𝑧4 ,&𝑧5 are called outer noises (sometimes demand noises). it is the last of these (𝑧5 ) that enables system reliability to be tackled from the perspective of the component, and provides an alternative to the Fault Tree representation (Fig 5 in WF) to tackle system reliability. to tackle system reliability Mistakes are the occurrence of failures the counter measures for which are already known to the design team but because of an oversight were not implemented on the current design (most probabilistic approaches to reliability ignore making inference on the occurrence of mistakes). Avoiding mistakes is primarily a matter of organizational vigilance.

5

This phrase was used in the conference presentation by Alyson Wilson, but does not appear in the journal paper

2

3

An example of System reliability for the perspective of the component This example is related to a failure mode on a push-push6 headlamp switch sticking in the on position during activation in a cold ambient7. The headlamp switch (which after the switch is pushed relies on a spring to re-position the switch for the next push) is mounted in an aperture in the Instrument Panel which in a very cold winter ambient of around -20 Celcius8, shrinks thus reducing the size of the aperture and so causes the switch housing to be squeezed – the resulting force on the switch housing results in the switch mechanism sticking (because the spring is now not strong enough to overcome the resulting increase in friction due to the deformation of the switch) and so it does not return to the required position for the next push. If the switch is activated prior to a short journey (so as not to allow the interior of the car to warm up enough to return the Instrument Panel aperture to the required dimension) then at the end of the journey it is not possible to turn off the headlamps9 which could result in a flat battery if the vehicle needed to be left for a while before the next journey. It may be tempting to think initially that this failure mode can only be detected during engineering development by testing the entire vehicle in a cold room facility, but by taking a systems view and cascading the failure mode from the vehicle (the entire system) to sub-systems (the Instrument panel) to component (the switch) we can easily detect this failure mode by determining how much force is required to deform the switch housing causing the switch to stick (and so no cold room facility is required) – see Figure 1 for a schematic of this cascade using the Systems “V” model

Figure1: A Systems “V” representation (e.g. see https://en.wikipedia.org/wiki/V-Model) of the headlamp switch: note how the noise factor is temperature (𝑧3 ) at the vehicle level but cascades down to the component level as force (𝑧5 ) – (see the graphics on the left hand side of the “V”)

Figure 2 illustrates the further detail of the failure mode from the point of view of the component (i.e. the switch) rather than the total system (the vehicle)

6

Push for on - then push again for off. See also Davis (2006) for a discussion of this example. 8 This condition arises if an automobile is left overnight outside a garage – a common practice in Europe 9 This real-life example occurred on a product introduced to the market prior to the widespread introduction of automatically activated headlamps. 7

3

4

Figure 2: Schematic of the transfer function between deformation of the switch housing and the force generated by the switch aperture being squeezed due to shrinkage caused by cold temperature. Because of conservation laws this transfer function will be linear. The solid black line illustrates the design that was initially released into the field with a negative “distance from the failure mode”, while the dotted line represents the design after the chosen counter measure is implemented where the distance from the failure mode is now positive (see text). The critical force denoted 𝑓𝑐 at which failure occurs will be much further to the right on the ordinate for the design represented by the dotted line. Note that a cold room facility is not required to evaluate design options using this criterion, but rather a simple device that can exert a measured force on the switch housing.

Two component based counter measures are immediately apparent from Figure 2, either a) move the dashed line in Figure 2 up the abscissa to increase the area of the functional zone which means selecting a stronger spring to overcome the increased friction (this was rejected due to the adverse effect on the tactile feel of the switch) or b) reduce the gradient of the line describing the relationship between force and deformation i.e. make the switch housing stiffer so that the deformation caused by the force of the aperture shrinking is reduced (this counter measure was implemented by modifying the tool to add strengthening ribs to the moulding to make the switch mechanism immune (i.e. robust) to the effect of cold ambient). Note that both of these counter measures have one thing in common – they increase the “distance” from the failure mode (see Figure 2) or put another way increase the number of points in the set {𝒁|𝐹 < 𝑓𝑐 } Once the robustness counter measure along with an appropriate detection control (to use the language of FMEA – see e.g. Bertsche (2008)) have been established future occurrences of this failure mode on subsequent products can be regarded as a mistake – repeating the mistake can be avoided if the counter measure and detection control are embedded within an engineering standard that must be followed on future programs – this ensures reliability growth as emphasised by Alyson Wilson in the conference presentation. Hence mistakes can be thought of as the absorbing state in reliability. So what at first sight seems like a “system failure mode” can essentially be re-cast when the failure mode can be viewed from the point of view of component that is malfunctioning. The general principle is to try to detect the and excite the failure mode on the smallest entity possible (i.e. a component) by understanding how the effects of the relevant noise factors affect the component while it is embedded in the system; Often when we refer to a system failure mode, we mean a failure mode that is initially detected only when the entire system is tested.

4

5 This example is very similar in principle (in the sense that the whole system fails due to a failure of a simple component) to that of the Space Shuttle challenger in 1986 (see Feynman 1986)10. The missile example (SDB-II) of WF could be seen from this point of view if the reliability quantity of interest could be extended beyond simple pass/fail data to a measure a characteristic of the failure based on knowledge of the physics (it is not possible to measure the distance from the failure mode with pass/fail data). For the viscosity example11 in WF the distance from the failure mode could be defined as the distance in time between a given percentile of the posterior distribution (see Figure 2 in WF) and the time regarded as the minimum for an acceptable product.

In his Space Shuttle report Feynman (1986) was very critical of managing the engineering development via predicted failure rates of the total system (i.e. the shuttle) in the field of operation, and devoted his attention to the Solid Rocket Booster (SRB) “O”-ring seal and how it interacted with the geometry of the joint in the SRB (i.e. noise factor 𝑧5 ) and what counter measures were required to avoid the failure. This prediction difficulty can be illustrated with the sticking switch example: - it would make sense to choose the life variable as the force on the switch housing 𝐹 say so that failure occurs when 𝐹 > 𝑓𝑐 (see Figure 2) the probabilistic definition of reliability would require us to calculate 1 − Pr(𝐹 > 𝑓𝑐 ), In order to do this, we would require knowledge of the transfer function 𝐹 = 𝑓(𝒁), (some knowledge of this transfer function is of course required to construct Figure 2) as well as the joint probability density function of 𝒁 e.g.      

Part variations across the population of vehicles e.g. spring strength (𝑧1 ) The age of each vehicle in the population (𝑧2 ) The geographical location of each vehicle in the population (𝑧3 ) The temperature of across the geographical region (𝑧4 ) The type of journey being undertaken (short journey in the dark vs. other) (𝑧4 ) Thermal expansion characteristics of the Instrument Panel Material surrounding the switch housing (𝑧5 )

But we don’t need to know this joint density function to fix the problem (that is maximize the distance from the failure mode by increasing the number of points in the set {𝒁|𝐹 < 𝑓𝑐 }) and make the failure mode go away; A key method to facilitate this approach is Failure Mode & Effects Analysis (FMEA) which we now discuss briefly Failure Mode & Effect Analysis (FMEA) The primary aim of the FMEA is to turn hindsight into foresight (see Saxena et al (2015)), so that failure modes can be detected and counter measures developed prior to the tooling phase of the engineering product development process, and not after the fact when the failure mode has escaped into the field. (The text by Ulrich & Eppinger (2004) provides a good introductory overview of the engineering product development process)

10

Feynman was able to easily detect and excite the failure mode (lack of compressibility of the “O” ring material in very cold conditions) by immersing a small amount of clamped material in iced water which didn’t spring back when the clamp was removed (see https://www.youtube.com/watch?v=6Rwcbsn19c0) 11 The choice of the lognormal distribution for this example given the failure mode (breakdown in viscosity) is curious because the shape of the hazard function allows for the failure rate to decrease with 𝑡

5

6 A primary focus is therefore to try to identify potential failure modes (e.g. through Brainstorming and other team-based activities), and their likely causes - some approaches e.g. in Ford Motor Company (2001) provide a taxonomy for failure modes to aid this identification process; the next step is to develop detection controls that will be rich in the noise factors (𝒁) and so have a high probability12 of exciting the failure mode should the current level of design still require further counter measures; these detection controls (tests in the language of WF) need to be executed as early as possible in the product development process to allow maximum time for counter measure development and implementation Figure 3 shows a schematic of the information flow (from failure mode to counter measure) in an FMEA document.

Figure 3: Schematic of the FMEA document illustrating the transition from failure mode to counter measure via a detection control, using the sticking headlamp switch as an example. Note that an effective detection control which can excite the failure mode is necessary to evaluate a proposed counter measure to prevent the failure mode escaping into the field. The column headed “S” provides for a rating S ∈ (1, . . ,10) related to the severity of the effect of the failure mode, 10 being the most severe, while the column headed “O” provides for a rating O ∈ (1, . . ,10) related to the likely occurrence of the cause of the failure mode in the field 10 being highly likely, and the column headed “D” provides for a rating D ∈ (1, . . ,10) related to the probability12 that the detection control is able to excite the failure mode, 1 being highly likely. Note that once a counter measure has been implemented the detection control becomes a verification test.

Wilson & Fronczyk refer to two errors that can be made in reliability improvement, these being 1) we could decide that the test is failed when the system reliability is actually higher than required (or specified) and 2) we could decide that the test is passed when the system reliability is actually lower than required (or specified); in the Failure Mode Avoidance paradigm, these errors translate to 1) we could include more counter measures in the final design than necessary (i.e. over engineering) or 2) we could decide that a counter measure is effective when it isn’t– but we could still consider these errors as the posterior producer’s risk. The FMEA is an important method to combine data and information across tests (Section 4.1 of WF) to ensure the development of reliable systems. Final remarks The presentation at the 2016 Hunter conference by Alyson Wilson emphasised some important questions particularly around the definition of reliability and how to treat system reliability. Failure Mode Avoidance provides another way to look at these problems – I would like to emphasise that it is no more correct than the approach suggested in Wilson & Fronczyk, it’s just another way to look, 12

This probability which represents a degree of belief and so is therefore Bayesian in nature is reflected on the FMEA as a rating between 1 & 10 – low ratings indicate a high probability that the detection control can excite the failure mode while high ratings indicate a low probability; hence a risk that the failure mode will not be detected prior to the tooling phase and will thus escape into the field. The actual values of these probabilities will depend on the context and the consequence of an escaped failure mode, but the use of Bayesian methods in the FMEA would benefit from further research.

6

7 although perhaps the phrase Bayesian Reliability is a misnomer given the definition of reliability in Wilson & Fronczyk is a frequency based probability!

References: Bertsche, B. (2008) Reliability in Automotive and mechanical engineering Springer, Berlin Clausing D. (2004); “Operating Window – an engineering measure of robustness”, Technometrics 46(1) pp25-29 Clausing D. (1998) “Reliability” in Conference on Robust Design Methods – proceedings Ford Motor Company. PP 74-94 Clausing D, & Frey DD, 2005, “Improving System reliability by Failure Mode Avoidance including four concept design strategies” Systems Engineering 8(3) pp245-261 Clausing D, and Fey, V, (2004) Effective Innovation- the development of winning technologies ASME Press ISBN 0 7918 0203 5 Davis, TP (2006) “Science Engineering and Statistics” Applied Stochastic Models in Business and Industry, 22(5-6), pp401-430. Feynman RP, (1986) “Personal Observations on the Reliability of the Shuttle” Appendix F, Report of the presidential commission on the Space Shuttle Challenger accident William P. Rogers (chair), US government Washington DC. Ford Motor Company (2001) Failure Modes and Effects Analysis Handbook Version 4.2, Ford Motor Company Meeker WQ, and Escobar, LA, (1998), Statistical Methods for Reliability Data, John Wiley Saxena A, Davis TP, & Jones JA (2015) “A failure mode avoidance approach to reliability” Annual Reliability and Maintainability Symposium (2015) (RAMS) pp1-6. DOI 10.1109/RAMS.2015.7105062 Taguchi G (1986) Introduction to quality engineering – designing quality into products and processes Asian Productivity Association, ISBN 92 833 1084 5 Ulrich K.T. & Eppinger S.D. (2004) Product design and development – 3rd edition McGraw-Hill/Irwin ISBN0 07 247146 8

7