Software-implemented fault-tolerance and ... - Semantic Scholar

29 downloads 0 Views 219KB Size Report
tially implemented as middleware on top of the OS: Windows. CE, VxWorks, Virtuoso [7], TEX [8]. As a pilot application, this FA was deployed on the embedded.
158

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 2, JUNE 2002

Software-Implemented Fault-Tolerance and Separate Recovery Strategies Enhance Maintainability Geert Deconinck, Senior Member, IEEE, Vincenzo De Florio, and Oliver Botti

Abstract—1 This paper describes a novel approach to softwareimplemented fault tolerance for distributed applications. This new approach can be used to enhance the flexibility and maintainability of the target applications in a cost-effective way. This is reached through a framework-approach including: 1) a library of fault tolerance functions; 2) a middleware application coordinating these functions; 3) a language for the expression of nonfunctional services, including configuration, error recovery, and fault injection. This framework-approach increases the availability and reliability of the application at a justifiable cost, also thanks to the re-usability of the components in different target systems. This framework-approach further increases the maintainability—due to the separation of the functional behavior from the recovery strategies that are executed when an error is detected—because the modifications to functional and nonfunctional behavior are, to some extent, independent, and hence less complex to deal with. The resulting tool matches well, e.g., with current industrial requirements for embedded distributed systems, calling for adaptable and reusable software components. The “integration of this approach in an automation system of a substation for electricity distribution” reports this experience. This case study shows in particular the ability of the configuration-and-recovery language ARIEL to allow adaptability to changes in the environment. This framework-approach is also useful in the context of distributed automation systems that are interconnected via a nondedicated network. Index Terms—Dependable computing, framework approach, recovery strategies, software-implemented fault tolerance, software maintainability.

ACRONYMS2 ARIEL FI FA FT HV LV MV OS PLC

a high-level configuration-and-recovery language END_IF framework approach fault tolerance high voltage low voltage medium voltage operating system programmable logic controller

Manuscript received June 3, 2001; revised October 9, 2001. This work was supported in part by Projects ESPRIT-28620 (TIRAN) and IST-2000-25434 (DepAuDE), and by the Fund for Scientific Research—Flanders (Belgium, F.W.O.) through the Postdoctoral Fellowship for Geert Deconinck. G. Deconinck and V. De Florio are with the Department of Electrical Engineering (ESAT), K.U. Leuven, B-3001 Leuven, Belgium (e-mail: {Geert.Deconinck; Vincenzo.DeFlorio}@esat.kuleuven.ac.be). O. Botti is with the Networks and Plants Automation Group, CESI S.p.A., I-20134 Milan, Italy (e-mail: [email protected]). Publisher Item Identifier S 0018-9529(02)05995-X. 1This

paper is a revised and extended version of a 2001-RAMS paper. singular and plural of an acronym are always spelled the same.

2The

PSAS

Primary Substation Automation System. I. INTRODUCTION

C

OSTS are an important driving force for distributed embedded systems, also when fault tolerance (FT) is concerned. This calls for open, flexible, and configurable solutions that can answer in a cost-effective way to a variety of dependability requirements. Such solutions can be based on pre-built and reusable modules, adaptable for a wide range of applications and reusable in different environments. In this context, this paper presents a framework approach (FA) that provides FT capabilities to embedded systems by exploiting the systems’ distributed hardware and by separating the functional behavior from the recovery strategy (viz., the set of actions to be executed when an error is detected). This conceptual FA consists of the three entities [1]–[5]: 1) A library of basic FT tools: This library provides basic elements for error detection, localization, containment, recovery, and fault masking. The tools are software-based implementations of well-known FT mechanisms, grouped in a library of adaptable, parametric functions. These basic tools can be used on their own, or as cooperating entities attached to the control-backbone (see entity #2). Examples include watchdogs, voting units, support for acceptance tests, and replicated memory. 2) A control backbone: This backbone is a distributed application that extracts information about the applicationþs topology, its progress, and its status; it maintains this information in a replicated database and it coordinates the FT actions at run-time via the interpretation of user-defined recovery strategies. The backbone functions as a sort of middleware on top of the underlying OS. It is hierarchically structured to maintain a consistent system view and contains self-testing and self-healing mechanisms. 3) ARIEL: This language is used to configure the basic FT tools from the library, and to express the recovery strategies. The application developer specifies these configurations and recovery actions via ARIEL. For configuration purposes, ARIEL can set parameters and properties of the basic tools. For expressing recovery strategies (indicating FT strategies by detailing localization, containment, and recovery actions to be executed when an error is detected), ARIEL allows building queries on the database of the backbone and attaching actions to these queries. These actions allow, e.g., to start, terminate, isolate, or inform an entity. Such an entity can be a node, a single

0018-9529/02$17.00 © 2002 IEEE

DECONINCK et al.: SOFTWARE-IMPLEMENTED FAULT-TOLERANCE

task, a group of tasks, etc. As such, it is possible, e.g., to start a standby task, to reset a node or link, and/or to generate synchronization signals for reconfiguration. Following this FA, increasing the dependability of an application implies the configuration and integration of basic FT tools from the library into the application, and writing the recovery strategy in ARIEL. This script is translated into a compact code that is executed at run-time by the backbone when an error is detected. It matches well to a number of coarse-grained local and distributed FT mechanisms: the different ARIEL templates support standby sparing, recovery blocks, N-modular redundancy, etc. The innovative aspects of this approach do not come from implementing the library of well-known FT tools, but rather from the combination with the backbone that executes user-defined recovery actions when an error is detected. Such local, distributed, or system-wide recovery strategies described in ARIEL separate the nonfunctional aspects of application recovery from those concerning the functional behavior that the application should have in the absence of faults. Furthermore, separating these aspects allows the modification of the recovery strategy with only a limited number of changes in the application code, and vice-versa, i.e., the application functionality can be changed without adjusting the recovery strategy. This results in better maintainability of the application (assuming a reliable interface and an orthogonal division of application functionality from FT strategies). The FA is further guided by using semi-formal techniques to support requirement specification, and by modeling for predictive evaluation, together with intensive testing and evaluation on pilot applications. The assessment considers real-time, dependability and cost requirements.3 The target system is a distributed embedded automation system, which is assumed to obey the “timed asynchronous distributed system model” [6]; this realistic model supposes that: 1) all services (communication, computation) are limited in time, and 2) the nodes have access to a local clock with a bounded drift rate. A message-passing library is required, offering asynchronous, nonblocking multicast primitives (third party tools are permitted to provide this). This FA has been partially implemented as middleware on top of the OS: Windows CE, VxWorks, Virtuoso [7], TEX [8]. As a pilot application, this FA was deployed on the embedded automation system in an electrical substation, where the dependability requirements have been traditionally fulfilled by dedicated hardware-based FT solutions. Today, the evolution toward a new generation of automated substations demands a reduction of development and maintenance costs, and requires the use of lower cost hardware and software components from the market. This trend has a direct impact on selecting the target platforms for automation, where industrial computers with commercial real-time OS are pushed as alternatives to previously adopted dedicated boards or customized PLC. The migration away from dedicated hardware-based FT solutions imposes the adoption of new FT strategies to cope with dependability requirements, especially if the target platform does not offer FT capabilities 3These

aspects are not discussed in this paper.

159

off-the-shelf. The use of a distributed architecture is then the key issue for this migration, because this provides redundancy based on conventional hardware modules. The lack of built-in FT capabilities at each node can be compensated in this case by software-implemented FT strategies that fully exploit the redundancy. This paper does not present a formal analysis of the FA. Section II focuses on ARIEL as a configuration-and-recovery language. Section III elaborates the experience with a primary substation automation system as a pilot application. Section IV explores the usefulness when the target applications are interconnected via a shared nondedicated network. II. THE FA AND ARIEL Component #1 (basic tools), and component #2 (backbone) of the FA are described in detail in [1]–[5]. This paper focuses on component #3, ARIEL. There are two major tasks in ARIEL: • configuration of the instantiation of the FA (Section II-A), • description of the recovery strategies (Section II-B). A. ARIEL as Configuration Support Tool Within ARIEL, the developer describes the configuration of the parametric basic tools and their integration in the application. A translator processes these ARIEL descriptions and issues header files defining configured objects and symbolic constants. These header files are to be compiled with the application. Three examples of tools configured in ARIEL follow in Sections II-A-1–II-A-3. Other ARIEL-templates have been created to handle recovery blocks, N-modular redundancy, exception handling, and other well-known FT techniques [5]. 1) Example 1: A software-implemented watchdog task (task10) can be configured in ARIEL by indicating the heartbeat period (100 ms), the task to be guarded (task14) and the task to be informed when an exception occurs (task18): WATCHDOG task10 WATCHES task14 HEARTBEAT 100 ms ON ERROR WARN task18 END WATCHDOG 2) Example 2: ARIEL can be used to implement transparent task replication, and to indicate how voting is to be handled. The voting algorithm and the metric for comparison of the objects can be selected. Within ARIEL, one can include a timeout for a slow or missing voting party, and choose to: • continue as soon as two of the three inputs are received, or • wait until all three inputs are received or until the timeout has elapsed (which is the default option). REPLICATED task10 IS task101, task102, task103 MULTICAST IS ATOMIC METHOD IS MODULAR REDUNDANCY VOTING ALGORITHM IS MAJORITY METRIC “int_cmp”

160

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 2, JUNE 2002

Fig. 1. Recovery script for a three-and-a-spare configuration.

TIMEOUT 1000 ms END METHOD ON SUCCESS task20 ON ERROR task30 END REPLICATED 3) Example 3: For retry-blocks, the input and the state of the calling task are transparently recorded in a recovery cache before the retry block is entered. These are restored when the acceptance test fails in order to re-execute the task based on the same input. RETRY task10 TIMEOUT 100 ms ACCEPTANCE TEST task20 RETRIES 3 ON ERROR task30 END RETRY B. ARIEL as Recovery Language This is an ancillary application layer, to describe the recovery strategies to be executed when an error is detected. These strategies are specified at development time, as described in the example in Section II-B-1. Basically, the language allows querying the database of the backbone for the state of entities of the application, and attaching run-time actions to be executed on these entities if the condition is fulfilled—for example: IF -FAULTY task1 THEN RESTART node1 Such an entity can be a single task, a node, or a group of tasks. As such, one can query the database in the backbone to check whether an entity has been found in error, is running, has been restarted/rebooted, etc., and then perform recovery actions on it. The actions allow starting, terminating, isolating, or sending information to an entity. It is also possible to start an alternative task, to reset a node or link, to generate synchronization signals so to order a reconfiguration, etc. The database is filled by the backbone, which receives notifications from the library of basic tools (e.g., when an error is detected), or from the application itself (when a predefined checkpoint is reached in the application code); furthermore, the libraries for communication and task-management have been instrumented in such a way as to forward their return value to the

database of the backbone. All this information can be queried in the ARIEL scripts; hence the recovery actions can depend on the results of these queries. The power of ARIEL is its ability to describe local, distributed, or system-wide recovery strategies. The backbone takes care of passing the necessary information to other nodes in the system, and of initiating the recovery actions on the different nodes. Before run-time, a translator processes these ARIEL scripts, and produces binary recovery code to be compiled with the application. At run-time, the backbone executes these strategies devoted to error processing. These strategies are switched-in either asynchronously (when an error is detected within the system by one of the basic error detection tools), or synchronously (when the user signals particular dynamic conditions like, for instance, a failed assertion, or when the control flow runs into user-defined breakpoints). 1) Example: As an example consider the script: IF [ -FAULTY task1 ] THEN STOP task1 START task4 WARN task2, task3 FI This script is a part of a three-and-a-spare system (a triplemodular redundant task with a standby component that can take over when 1 of the 3 replicas fails). Three such scripts describe the entire three-and-a-spare system. Several meta-characters allow the writing of more powerful scripts: • Meta-character “*” refers to any entity in the system. • Meta-character “$” can be used to refer, in a section, to an entity mentioned in the query; e.g., “STOP $2” means that the entity, which fulfills the condition in part 2 of the query, needs to be stopped. • Meta-character “@” refers to the entity fulfilling the query, • Meta-character “ ” refers to all those entities that do not fulfill the query. Fig. 1 uses these meta-characters to describe the three-and-a-spare recovery strategy as a single if–then section, in which group1 is defined as the set of tasks {task1, task2, task3}. Furthermore, the recovery strategies can be replaced whenever needed. For example, the example from Fig. 2 shows that a different ARIEL script can allow the application to behave as

DECONINCK et al.: SOFTWARE-IMPLEMENTED FAULT-TOLERANCE

161

Fig. 2. Changing recovery strategies from three-and-a-spare to graceful-degradation.

a three-and-a-spare system or as a gracefully degrading set of tasks. In the latter case, the voting function has to be transparently modified from “2-out-of-3 majority voting” to “2-out-of-2 duplication with comparison.” Using ARIEL and FA lets the developer separately address the (nonfunctional) aspects of application recovery (written in ARIEL) from those pertaining to the (functional) behavior that the application should have in the absence of faults (written in C or other programming languages). This divide-and-conquer approach allows the developer to tackle more easily and effectively any of these two fronts, e.g., to modify the recovery strategy with only a limited impact on the application, and vice-versa. This results in a better maintainability of the application, and in more flexibility. ARIEL scripts can also be used to validate the FT behavior of an application by injecting software faults on the basis of a (partial) view to the global state of the system, as stored in the database of the backbone. This allows a developer to set up a fault injection system such as the one in [9]. ARIEL on its own does not provide a complete FT solution. The ARIEL recovery scripts have to be triggered by the error-detection mechanisms from the library of basics tools, or from the application, or from the platform. This implies that the coverage of the FT strategy driven by the ARIEL scripts cannot be higher than the coverage of the error-detection tools which trigger their execution. It is the task of the developer to • provide the ARIEL configuration parameters and recovery scripts that are appropriate for a given application on a given platform; • to assess if the application-specific timing constraints are met under the worst execution times of the recovery strategies. III. PILOT APPLICATION: A PRIMARY SUBSTATION AUTOMATION SYSTEM This FA has been integrated in a PSAS: the embedded hardware and software in a substation that controls electricity distribution. The PSAS requires protection, control, monitoring, and supervision capabilities; and it is representative for many

applications with strict dependability requirements in the energy field [10], [11]. In an ongoing renewal of the infrastructure, the company decided to replace the dedicated hardware-based FT solutions in the PSAS by commercial distributed platforms (industrial computers and dedicated processing boards) running real-time OS such as VxWorks. This decision was motivated by the need for more functionality in the PSAS, where development of new dedicated (hardware-based) solutions was considered too expensive and not flexible enough in the deregulated electricity market. The required dependability level has to be reached by using the hardware redundancy in the distributed platform, combined with software-implemented FT solutions at middleware level. The need for adaptability to new situations and maintainability of the software was accomplished using ARIEL. Although software-based FT might have less coverage than hardware-based solutions, this was not considered inhibitive, because the physical (electrical, nonprogrammable) protection in the plant continues to act as a safeguard for noncovered faults. Besides, a high-quality level for the software engineering and on-site testing remains important not to introduce software design faults that could hamper the mission-critical functionality. Therefore, based on requirements of operational divisions working with these automation applications, the company developed a systematic approach to dependability, involving the whole life-cycle of the application and including fault prevention, fault removal, fault forecasting, and FT. FT plays a central role, and is obtained through a layered-system organization separating the application from the platform (hardware OS middleware) that provides FT. Formal and semi-formal languages are applied (favored by the availability of in-house developed tools [12], [27], [28], [13]) to the specification, design, and code generation of the automation system. Qualitative and quantitative analysis of functional, real-time, and dependability aspects are supported by these tools. A. The Pilot Application Description The energy-distribution network provides the connection between the HV lines coming from generator plants, and the

162

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 2, JUNE 2002

The automation system is hierarchically organized. The various tasks (protection, control, monitoring) require different degrees of dependability depending, e.g., on: their functionality, the possible impact of faults, the cost of confinement. The major source of faults in the system is electromagnetic interference (EMI) [10] caused by the process itself (opening and closing of HV connections) in spite of the attention paid to design for electromagnetic compatibility. This results in errors in the communication, computation, and memory subsystems. B. FA-Based Modules

Fig. 3. Electrical-schema (gray lines) of a primary substation, and the architecture (black lines) of its automation system.

HV substations on one hand and the customers on the other. It is a meshed network and handles MV and LV in the range 12–15–20 kV. Its nodes of interconnection are primary-substations (connected to HV lines and transforming and distributing energy to secondary substations and to MV customers) and secondary-substations that transform and distribute energy to LV customers. A substation can be controlled locally and/or remotely. Fig. 3 shows that a primary substation (gray lines) consists of switches, insulators, bus bars, transformers, capacitors and other electrical components. The PSAS architecture (black lines) consists of a controller (LCL—Local Control Level) and a number of Peripheral Units (PU) distributed on the plant. (The inherent electrical redundancy of the substation is not shown.) Each PU is associated with a component of the plant and provides for this component data collection, diagnostic information, control, primary and secondary protection levels, and, possibly, an interface to a local operator. The LCL provides functionality for the entire substation for control, monitoring, PU supervision and interfacing to local and remote operators. In the renewal program, this LCL (based on dedicated boards or customized PLC) is being replaced by a set of industrial computers (running the real-time OS VxWorks) and several co-processing-based dedicated boards running the micro-kernel TEX [8]. It is powered from the electricity network and has backupbatteries to withstand power failures.

As an example of the FA in Section II, a software module has been designed, implementing stabilizing-memory [14], as a mechanism combining physical with temporal redundancy (and with several protocols) to recover from transient faults affecting memory or computation, and to prevent incorrect output to the field. With respect to a previous solution relying exclusively on dedicated hardware boards, this software implementation of the stable memory module supports better maintainability. For example, it is possible to set parameters for, e.g., the size of the stabilizing memory, the number of physically redundant copies, and the number of temporally redundant copies. The developer can also modify the allocation of the physically distributed copies to the available resources. ARIEL sets all these configuration parameters as well as the recovery actions to be taken if/when an error is detected. The additional flexibility offered by these recovery strategies allows, e.g., for reconfiguration by re-allocating the distributed copies to nonfailed components when a fault occurs. The recovery strategies are not hard-coded in the application code, but are specified in ARIEL as an ancillary application layer, and executed by the backbone that interacts with the modules. Because the interface to the dedicated board and to the software module is identical, the complexity for the developer is equivalent in both implementations. The FA-based implementation of the stabilizing memory-module meets the real-time requirements of the application, which are in the range from 50 ms to 500 ms. When deployed in an operational PSAS for testing, subject to the electromagnetic interference from the opening and closing of breakers and switches, no incorrect output was identified during the test-period, while the log-files indicated that the stabilizing memory-module functioned correctly and masked the introduced errors. A second module implemented according to the FA concerns the redundant-watchdog-module. It consists of several (e.g., 3) software-implemented watchdog tasks that are distributed and supervise a given application process. ARIEL allows experimenting with various configurations of this module, according to the application requirements: •

OR-strategy triggers the alarm as soon as any of the watchdogs expire. This tolerates the case in which up-to-two watchdogs have crashed, or are faulty, or are unreachable. This reduces the probability that a missing heartbeat message goes undetected; hence it can be regarded as a safety-first strategy. At the same time, the probability of false alarms is increased.

DECONINCK et al.: SOFTWARE-IMPLEMENTED FAULT-TOLERANCE

163



AND-strategy requires that all the watchdogs reach consensus before triggering the system alarm. It decreases the probability of false alarms but at the same time decreases the error detection coverage of the watchdog task. It can be regarded as an availability-first strategy. • 2-out-of-3 strategy requires that a majority of watchdogs expire before the system alarm is executed. Intuitively, this corresponds to a tradeoff between the AND and OR strategies.

C. The FA and PSAS This FA applied to single modules (stabilizing memory and redundant watchdog) as well as applied to the entire PSAS application, confirmed the improvement in the maintainability of the application: in the traditional methodology with dedicated hardware solutions, every change in the environment (larger system, additional functionality) resulted in a different implementation of the dedicated solution, or to adaptations of the existing LCL controller hardware. By using ARIEL and the FA, the configuration of the FA-elements and recovery-strategies themselves can be adapted without major modifications to the application code. Analogously, if the functional aspects of the application need to be modified, this does not necessarily interfere with the FT strategies. For the company, integrating the FA in the PSAS met the primary objectives of increased flexibility and maintainability, while the application continued to fulfill its requirements in terms of functionality, timing, and dependability. IV. EXTENSION OF THE FA TO INTERCONNECTED DISTRIBUTED SYSTEMS The distributed embedded automation systems become more and more interconnected with each other via nondedicated networks. For example, all PSAS that are spread over a country or region, are interconnected to allow load-balancing and orchestrated-reactions for partial breakdown of the electricity distribution or for local overloads. In this context, one can take advantage of modeling these systems at 2 levels (see Fig. 4 [15], [16]): • Intra-site level: this corresponds to the distributed embedded application for which its nodes are connected via a local area network, or via dedicated point-to-point connections. This intra-site network is used only by this application, and the application has complete control over it. This network also provides real-time support to the application. • Inter-site level: this interconnects the local systems via a nondedicated network (e.g., the Internet) that is not under control of the application, and that is shared with other applications. This inter-site network is mainly used for nonreal-time communication; however, “current and future industrial demands” impose quality-of-service or (soft) real-time requirements on this inter-site communication. This inter-site network allows introducing new cost-saving features into the applications, such as: • Remote diagnosis of local sites via a public network (for instance by step-by-step execution of a process guided via visual feedback over the inter-site network).

Fig. 4. Architecture of the target (distributed) system. Application A runs on 3 sites (X; Y ; Z ) as a distributed real-time application. The interconnections among the various parts of application A happens via a nonreal-time network (internet-alike) that is also used by other applications B; C .

• Remote maintenance of embedded systems (software updates or upgrades of system modules without shutting down the entire local distributed system and while still providing partial services). • Remote control of embedded systems over nondedicated inter-site connections, if a certain quality-of-service can be guaranteed by the inter-site communication system. These interconnected distributed embedded automation applications are not only subject to classical (physical) hardware faults affecting parts of the intra-site system or the inter-site connection system, leading to unavailability of computing nodes, or of parts of the network. They can also suffer from malicious faults (intrusions) affecting the inter-site connections, which can cause network-unavailability (e.g., denial-of-service attacks) or which endanger the integrity of the data, etc. The presence of other applications that use the same inter-site network results in a dynamic-environment, leading to bandwidth reduction or nondeterminism. We conjecture that ARIEL and the FA provide a powerful way to deal with these interconnected distributed applications, thanks to the separation of the functional aspects of the application from the recovery strategies that are to be executed when an anomaly is detected. This eases maintainability by adapting FT strategies depending on the environment with only minor modifications to the application source code. Specifically for the dynamic environment, one can modify the recovery strategies dynamically, by providing various recovery scripts that correspond to various situations in the environment. (For example, one would want to increase the security of inter-site communication in case of attacks, at the cost of a higher performance overhead.) V. RELATED WORK AND FUTURE STEPS ARIEL and FA borrow several ideas from existing research and implementations. For example, the suitability of libraries of software-implemented FT solutions to improve the dependability of distributed applications is shown in [17]–[20]. The middleware approach toward FT gained much support recently [21]–[23]. The concept of de-coupling the functional application aspects from the nonfunctional ones concerning FT, is also present in the meta-object approach, where a call to a method in

164

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 2, JUNE 2002

an object-oriented application is trapped in order to implement transparently some FT function [24], [25]. The presented FA combines the advantages of software-implemented FT, via a library of functions, with the decoupling of the meta-object approach (but without requiring object orientation) by specifying recovery actions as a sort of ancillary application layer. In addition, ARIEL allows for a concise description of distributed actions, especially for expressing recovery strategies. The FA is three-tiered, comprising a user library of basic FT mechanisms, a control backbone, and a high-level configuration-and-recovery language. It integrates software-implemented mechanisms into distributed embedded systems, exploiting the available hardware redundancy. Software-implemented FT might need to be complemented by other approaches or techniques on lower levels (hardware or OS); e.g., to be able to meet hard real-time requirements [26], and/or by application-specific mechanisms. Currently, our research concentrates on collecting dependability data from the deployed configuration and on the challenges posed by the inter-site connection, as well as on a detailed quantitative and qualitative comparison with existing approaches. ACKNOWLEDGMENT The authors would like to thank the Associate Editor and anonymous referees for their useful comments. REFERENCES [1] G. Deconinck, V. De Florio, and O. Botti, “Separating recovery strategies from application functionality: Experiences with a framework approach,” in Proc. 2001 Ann. Reliability & Maintainability Symp., pp. 246–251. [2] G. Deconinck, V. De Florio, R. Lauwereins, and R. Belmans, “A software library, a control backbone and user-specified recovery strategies to enhance the dependability of embedded systems,” in Proc. 25th EUROMICRO Conf. (EuroMicro’1999), Workshop on Dependable Computing Syst., vol. II, 1999, pp. 98–104. [3] V. De Florio, G. Deconinck, and R. Lauwereins, “Software tool combining fault masking with user-defined recovery strategies,” IEE Proc.—Software, Special Issue on Dependable Computing Syst., vol. 145, no. 6, pp. 203–211, Dec. 1998. [4] G. Deconinck, V. De Florio, T. Varvarigou, and E. Verentziotis, “The EFTOS approach to dependability in embedded supercomputing,” IEEE Trans. Reliability, vol. 51, pp. 76–90, Mar. 2002. [5] V. De Florio, “A fault tolerance linguistic structure for distributed applications,” Ph.D. dissertation, Katholieke Universiteit Leuven, Belgium, Oct. 2000. [6] F. Cristian and C. Fetzer, “The timed asynchronous distributed system model,” IEEE Trans. Parallel Distributed Syst., vol. 10, pp. 642–657, June 1999. [7] (1999) Virtuoso 4.0 User Manual. Eonic Systems, Aarschot, Belgium. [Online]. Available: http://www.eonic.com [8] (1997) TEX User Manual. TXT Ingegneria Informatica, Milan, Italy. [Online]. Available: http://www.txt.it [9] M. Cukier, R. Chandra, and D. Henke et al., “Fault injection based on a partial view of the global state of a distributed system,” in Proc. 18th Reliable Distributed Systems, 1999, SRDS-18, pp. 168–177. [10] R. Gargiuli and P. G. Mirandola et al., “ENEL approach to computer supervisory remote control of electric power distribution network,” in Proc. 6th IEE Int’l Conf. Electricity Distribution (CIRED’1981), 1981, pp. 187–192. [11] R. Meda, A. Bertani, and P. Colombo et al., “Il sistema di protezione e controllo della cabina primaria,” (in Italian), , ENEL Internal Draft, Feb. 1999.

[12] F. Maestri, R. Meda, and G. L. Redaelli, “Un ambiente di sviluppo di funzioni applicative strutturate per sistemi di automazione di impianti ENEL,” ANIPLA (Associazione Nazionale Italiana Per l’Automazione), Milan, Italy, technical report, June 1997. [13] A. Moro, “Traduttore delle reti ASFA,” Masters degree thesis (in Italian), Politecnico di Milano, Italy, 1998. [14] G. Deconinck, O. Botti, and F. Cassinari et al., “Stable memory in substation automation: A case study,” in Digest of Papers of 28th Ann. Int. Symp. Fault-Tolerant Computing (FTCS-28): IEEE Comp. Soc. Press, June 1998, pp. 452–457. [15] G. Deconinck and R. Lauwereins, “Dependability for distributed embedded automation systems in dynamic environments,” in Proc. 3rd Information Survivability Workshop (ISW2000): IEEE Comp. Soc. Press, Oct. 2000, pp. 47–50. [16] G. Deconinck, V. De Florio, and G. Dondossola et al., “Distributed embedded automation systems: dynamic environments and dependability,” in Supplement of Int. Conf. Dependable Systems and Networks (DSN2001—Special Track: European Dependability Initiative), July 2001, pp. D16–D19. [17] Y. Huang and C. M. R. Kintala, “Software fault tolerance in the application layer,” in Software Fault Tolerance, M. Lyu, Ed. New York: Wiley, 1995. [18] M. R. Lyu, Ed., Handbook of Software Reliability Engineering. New York: McGraw-Hill, 1995. [19] B. Randell, J.-C. Laprie, H. Kopetz, and B. Littlewood, Eds., ESPRIT Basic Research Series: Predictably Dependable Computing Systems: Springer-Verlag, 1995. [20] Y. M. Wang, Y. Huang, and K. P. Vo et al., “Checkpointing and its applications,” in Proc. 25th Int. Symp. Fault-Tolerant Computing (FTCS-25), 1995. [21] Z. T. Kalbarczyk, R. K. Iyer, S. Bagchi, and K. Whisnant, “Chameleon: A software infrastructure for adaptive fault tolerance,” IEEE Trans. Parallel Distributed Syst., vol. 10, pp. 560–579, June 1999. [22] K. H. Kim, “ROAFTS: A middleware architecture for real-time objectoriented adaptive fault tolerance support,” in Proc. HASE 1998 (IEEE CS 1998 High-Assurance Systems Engineering Symp.), 1998, pp. 50–57. [23] M. Cukier, J. Ren, and C. Sabnis et al., “AQUA: An adaptive architecture that provides dependable distributed objects,” in Proc. 17th Symp. Reliable and Distributed Syst. (SRDS-17), 1998, pp. 245–253. [24] J. C. Fabre, V. Nicomette, and T. Prennou et al., “Implementing faulttolerant applications using reflective object-oriented programming,” in Proc. 25th Int. Symp. Fault-Tolerant Computing (FTCS’25), 1995, pp. 489–498. [25] G. Kiczales, J. des Rivières, and D. G. Bobrow, The Art of the Metaobject Protocol: The MIT Press, 1991. [26] D. Powell, J. Arlat, and L. Beus-Dukic et al., “GUARDS: A generic upgradeable architecture for real-time dependable systems,” IEEE Trans. Parallel Distribut. Syst., vol. 10, pp. 580–597, June 1999. [27] F. Maestri, R. Meda, and G. L. Redaelli, “La progettazione del software di controlo per l’automazione di impianti,” (in Italian), , Milan, Italy, June 1997. , “La progettazione del software di controlo per l’automazione di [28] impianti,” (in Italian), Automazione e Strumentazione, Milan, Italy, Dec. 1997.

Geert Deconinck (SM’00) received the M.Sc. degree (1991) in electrical engineering and the Ph.D. degree (1996) in applied sciences from the Katholieke Universiteit Leuven (K.U. Leuven), Belgium. He is a Postdoctoral Fellow of the Fund for Scientific Research—Flanders (Belgium). He is working at the ESAT-Department of Electrical Engineering, K.U. Leuven, where he is also a visiting professor. His research interests include the design, analysis, and assessment of software-based fault-tolerance solutions to meet real-time, dependability, and cost constraints for distributed embedded systems. In this field, he has authored and coauthored about 50 publications in international journals and conference proceedings. In 1995–1997, he received a grant from the Flemish Institute for the Promotion of Scientific–Technological Research in Industry (IWT). Dr. Deconinck is a Certified Reliability Engineer (ASQ), a member of the Royal Flemish Engineering Society, and a member of the IEEE Reliability Society and the IEEE Computer Society.

DECONINCK et al.: SOFTWARE-IMPLEMENTED FAULT-TOLERANCE

Vincenzo De Florio received the Laurea degree from the University of Bari, Italy, and the Ph.D. degree (2000) in applied sciences from the K.U. Leuven, Belgium. He was a researcher and tutor with the School for Advanced Studies in Industrial and Applied Mathematics in Tecnopolis Novus Ortus Science Park (Italy) until 1996. Then he joined the research on fault tolerance of the Department of Electrical Engineering (ESAT) at the K.U. Leuven, where he holds a postdoctoral research position. His research interests include software fault-tolerance algorithms and methods for parallel and distributed applications. He is author or coauthor of more than 25 papers published in international journals or conference proceedings.

165

Oliver Botti graduated in computer science from the University of Milan in 1991. He joined ENEL R&D in 1992 as a researcher working in internal and cooperative projects in software engineering, addressing most of the steps of system life-cycle, from formal specification to design, development, and evaluation. He has been Project Manager of several ESPRIT projects concerning HPCN, performance evaluation, and fault tolerance issues. At CESI he is responsible for two large R&D projects addressing: 1) development of techniques and tools for fault prevention and removal, covering the whole life-cycle of automation systems, and 2) development of novel techniques and tools for embedding fault-tolerance capabilities in dependable automation applications. He is author or coauthor of more than 30 papers published in international journals or conferences proceedings.