Design of embedded control systems

Design of embedded control systems Matjaˇz Colnariˇc University of Maribor Faculty of Electrical Engineering and Computer Science Smetanova 17, SI-2000 Maribor, Slovenia [email protected]

Abstract—The paper presents some guidelines supported by a case study of a design of a fault tolerant computer control system. It was devised and implemented as a laboratory platform for the 5th framework project IFATIS. Instead of the redundancy, in this project it is proposed to deal with faults by dynamic reconfiguration of the resources that are still viable after an incident. The goal of the design presented in this paper is to provide a consistent sophisticated distributed control system, tolerant to faults. In the paper, first general outlook of the IFATIS architecture will be given. Based on that, hardware platform has been designed and implemented, together with the system, application support and design software. The idea is to deal with the faults on the lowest level and resolve them if possible in order to prevent the higherlevel mechanisms to have to employ more costly measures. Keywords—Hard real-time systems, embedded systems, fault tolerance, high-integrity requirements, safety-related systems, peripheral interfaces.

1

Introduction

Computer control systems are increasingly being used in safety critical applications where the integrity of the systems depends substantially on them. Any failure could have severe consequences: it may result in massive material loss or endanger human safety. Dependability, being an inherent requirement in the design of control systems, is usually insufficiently supported. Techniques intended to improve dependability are predominantly based on testing, and the quality achieved depends mainly on the experience and the intuition of the designers and of the testers. This approach, however, is inadequate. It is necessary that these measures are built into the design methodologies already in the early phases. The usual measure to enhance the dependability is to make the system tolerant to faults. As a state of

0-7803-7852-0/03/$17.00 ©2003 IEEE

the art, this is predominantly achieved by redundant resources. That, however, has some disadvantages, the most obvious one being the costs. In the project IFATIS [2], which has been founded since 2002 by EU in the 5th framework programme, the robustness of the control systems has been achieved by dynamic reconfiguration in the case of faults. To provide for that, the faults must be identified as early as possible, and must be isolated to prevent their propagation. These are also the main goals of the architecture proposed in this paper. We will concentrate on the low level solutions in the platform design. The idea is to deal with the faults on the lowest levels and resolve them if possible in order to prevent the higher-level mechanisms to have to employ more costly measures. First, in Section 2 the general outlook of a control cell as proposed by the IFATIS project will be given, providing the basic architecture for the platform. Further, in Section 3 the hardware aspect will be elaborated in more detail, with a special emphasis on the fault tolerance of the peripheral units. Software design issues, following in Section 4, will be elaborated in less detail.

2

System implementation FTC/FTM cell

of

the

In this section, the global architecture of the fault tolerant control cell (FTC), as proposed in [4], will be elaborated. Particularly, the most concern is dedicated to the input and output data consistency: based on that the integrity of the function can be evaluated. Further, in the refinement of the model we already considered the specific implementation of the hardware platform. In [4], it was proposed that the FTC module consists of control functions, function monitor (FM), local reconfiguration and mode controller (LRMC) and the resource needs catalogue. In Figure 1 the more elaborated scheme of the FTC is given, where the process data transfer path has been further refined. The process data are passing input data mapping block, pre-evaluation, control functions (which represent the obvious application programs), post evaluation, and the output data mapping block.

TU22

ICIT 2003 - Maribor, Slovenia

Fig.1: Fault tolerant control (FTC) block model

Function monitor (FM) is checking for abnormalities that may emerge in the pre– or post-evaluation blocks or control functions themselves, based on anticipated behaviour or any other prescribed rules. If there are any detected, they are reported to the Local reconfiguration and mode controller (LRMC), which then decides on the possible reconfiguration of the blocks in the process data path. If it has to decide that no alternative configuration is possible with the remaining parts, the incident is reported to the higher–level Group/Global Resource Reconfiguration Manager (GRRM) which assigns the tasks to other FTCs. Details on that may be found in [4]. Most of the blocks shown represent software modules in form of OS kernel system tasks running in the multitasking environment. Some, however, can be physically implemented as independent HW units in a form of FPGAs, like e.g. input and output data mapping units. In the following, the blocks of the process data path will be further elaborated, since they impose the requirements for the HW and SW platform design and implementation. Also, the low-level fault tolerance measures are employed there. 2.1

Input data mapping

The function of the input data mapping module is to accept the HW signals, communication or system internal data, coming from the outside of the module, and convert them into a unified format suitable for further processing. Apart from that, within the data mapping unit it is possible to chose different peripherals to pro-

vide the same input information in the case of faults, detected by the FM, and based on the request from the LRMC. Because of the inherent simplicity of its functions, it is expected that this module could be alternatively implemented in HW using Field Programmable Gate Arrays or similar programmable devices. The advantage is that it is fast, more robust and easily verifiable. 2.2

Data pre-evaluation

Once the input data are converted into a suitable form, they are pre-evaluated. The purpose of this module is to check and assure the high integrity of the data. The advantage of the physical separation of the evaluation from the control functions is that the latter do not need to be aware of which of the redundant inputs the data are coming from, or whether they have been obtained from an observer in case of problems. This can be done by SW functions or, in simpler cases, even by HW means, see above. The latter solution is again faster, more robust, inherently parallel with the operation of other functions, but cannot handle complex evaluation functions. The evaluation of data utilises techniques which will be in detail described in the Section 3.4. If the evaluation fails, two actions are possible, either to derive the data, that were inconsistent, from other informations (like observers), or to report the problem to the LRCM. If the former case is possible, the control function does not have to sense any input data problems. However, this solution will most likely provide lower quality

TU23

Fig.2: Distributed multiprocessor architecture

of information (like, e.g., extrapolation of a series of analogue values read) and can only be tolerated for a short time. If the problem persists, it should be reported to the LRCM. In the latter case, the LRCM can possibly handle the problem within the input data mapping by selecting the alternative or redundant sensor, or it can reconfigure the control functions to tolerate the missing data. If this is not possible, the report goes further to the GRRM. 2.3

Control functions

It is expected that the control functions will be SW implemented, although it may be possible to even employ a parallel HW based automata as the last resort for the extremely safety critical applications (SIL4, [1]) to bring the system into a safe state. The control function is selected by the LRCM. During the execution, it may itself check its own internal states or employ some other consistency checking, like functionality checking or to employ the watchdog technique to identify deadlocks or some other software errors. If a problem is detected it is reported back to LRCM. 2.4

Post evaluation

The results from the control functions can again be wrong due to different causes, like non-identified wrong data, bad algorithms, or software errors. In order to prevent the propagation of these faults, the results of the control functions are post-evaluated. As a method, consistency checking can be used: if the values or dynamics of the output values could be anticipated, they may be checked. Also, if the control functions are redundant and diversely implemented (N-version programming), the post-evaluation module can do the pairchecking, or even serve as a voter if there is a triple modular redundancy. In the case of an error, the LRCM can select another control function, or different inputs, or similar. In the worst case, it reports non-solved problem to the GRRM.

2.5

Output data mapping

Similarly to the input, the output data mapping converts the outgoing data into the form needed for the physical transfer to the peripherals. Again, some consistency checking can be done, like loop-back testing: the receiving peripheral unit is requested to re-submit the received data to the sender for safety checking. In the case of the problems the LRCM (or even the output data mapping unit itself) may select different data output channels.

3

Hardware architecture concepts and implementation of the IFATIS control system

The logical application model as given in [4] is mapped onto the hardware implementation which is designed in the form of essentially symmetrical multiprocessor distributed hardware architecture. An example is shown in the Figure 2. GRRMs are in the prototype implemented on the PC; however, they can as well be put to run onto the platform itself. An FTC function can be mapped onto one or more processing units, sensors and actuators (FTC1), or can share a processing unit with more FTCs (FTC2 and FTC3). This is particularly convenient in a case of tightly coupled partial processes. It can also share a processing unit with a resource monitor, which in this case can have immediate access to its internal states. To provide for the necessary system integrity, the mapping is not static but can be reconfigured in the case of faults. Control functions, as well as monitor functions and other blocks of the FTC functions and resource monitors, are executed on the resources that were proven faultless. The reconfiguration is managed by higher-level configuration managers based on the information from the function and resource monitors. Alternative or redundant resources are employed, if available, or the system’s performance is degraded gracefully in order to survive the situation. On the higher level of the hardware architecture the dedicated group and global resource and reconfig-

TU24

uration managers (GRRMs) are controlling the overall structure and configuration of the control system. Over the communication network they are receiving the state information from the processing units and resource monitors and providing the commands for their reconfiguration (D, N and M from [4]). In the hardware design, special care was taken in selecting proper processing and peripheral devices. Processor modules are implemented on the basis of Texas Instruments TMS320 family of signal processors and additional custom-built communication modules with XILINX FPGAs. There is also a dedicated peripheral module supporting the specific fault-tolerance measures in peripheral data transfer. Although it has not been done in the laboratory prototype for the simplicity reasons, to prevent common mode failures the redundant resources should be designed diversely, offering also different levels of quality of service. Generally, the simpler systems should provide higher-level integrity. For the ultimate safety, simple automata (even falling into the SIL4) could also be available as a last resort to provide for fail-safe behaviour in the case when the system encountered a nonrecoverable failure. In order to provide for better flexibility, the system software on the processing resources should support multi tasking. E.g., if a FTC function is mapped onto a single processing unit, the latter should execute the control and the monitoring functions in parallel. For simpler systems, even more FTC functions and/or resource monitors may be mapped jointly onto the same physical unit. In the following sections, the implementation of the communication system, the processor– and the peripheral units will be described in detail. 3.1

Communication system

As shown in the Figure 2, there are two logical levels of interconnection, the lower level (solid line) connecting FTC/FTM functions, running on the microprocessors or microcontrollers, universal peripheral interfaces, individual sensors, actuators and resource monitors, and the higher level (dotted line), which manages them. The two levels of communication are physically implemented on the same field bus system, using the TTCAN technology. It is connecting different and possibly diversely designed processing resources and process peripherals (like sensors and actuators). The GRRM running on the PC is by parallel communication connected to one processor, which is then distributing the messages to and from other processor boards (see the Section 6.5 on system software). Based on the widely accepted CAN bus, the TTCAN [5] was selected for the implementation of the data transfer for several reasons. First, unlike the basic CAN protocol, it provides deterministic times for delivering of messages. Further, the built-in protocol checking can detect certain failures of communication

partners or of the media itself. If a failure on a communication channel is detected, it is called for reconfiguration. To allow for that, dual TTCAN bus is implemented. The two TTCAN channels serve for load sharing in normal operation. In the case of a failure of one channel, the system may be re-configured in such a way as to reroute all communication onto the remaining channel. It may be necessary to reduce the amount of messages in order to comply with the throughput of the single channel. Further measure to deal with the potential central point of failure is built into the protocol itself. In the TTCAN protocol one node becomes a time master; its role is to periodically submit reference messages that trigger communication cycles on which the operation of the whole system is synchronised. The failure of such time master would present a central point of failure. In order to deal with that, TTCAN protocol provides a mechanism of competition for becoming the time master, which is automatically invoked if the current one fails. 3.2

Processing unit

IFATIS laboratory prototype platform processing unit is based on the Texas C6711DSK processor boards [3] with the Texas TMS320C6711 DSP processor. In order to provide for the necessary low level functionality (communication, enhanced fault tolerance, modularity, increased performance) a daughter board has been developed; its block scheme is given in Figure 3. Technically, it is in a form of a piggy-back unit plugged in and attached with screws onto the motherboard. The function of the FPGA chip is to provide the interface between the processor board and the peripheral interfaces (mainly the TTCAN communication channels). Certain fault detection and isolation functions have been implemented in the firmware, and some reconfiguration has been enabled. Two TTCAN interfaces implemented with the BOSCH TTCAN EC rev.2.0 [5] allow for fault tolerant communication. In normal operating mode when both channels are intact the communication load is shared among both. In a presence of a fault on one channel, the communication requirements are limited to the urgent messages and all data transfer proceeds on the single channel. There is also data memory in a form of a 512kbit EEPROM attached to the XILINX. It is designed to contain different data which will be used e.g. for the automatic (re-)initialisation of the communication interfaces and other functions to be performed directly within the firmware. By the latter, the main processor is relieved from the overhead tasks. Since they are not implemented using programming means, they could be more rigorously verified with the usual hardware design and verification means.

TU25

Fig.3: Schematic of the processor unit

The daughter board also provides an interface with the outside world over the JTAG. The processing units have their JTAG ports connected in a daisy chain. During the development and debugging phases a PC can load the configuration data into the configuration flash PROM or it can perform some monitoring functions. 3.3

Peripheral unit

As can be seen from the Figure 2, sensors and actuators can be accessed in two ways: either over the universal peripheral unit connecting various sensors and actuators, or each of the latter individually, provided it has a TTCAN interface. Distributing sensors and actuators from the process units may not be reasonable in the case of peripherals tightly coupled with processing units, e.g. in the case of specific resource monitors (e.g. RM2 in the Figure 2). In this case, they may be connected to the processing unit itself. Access to the external data via the peripheral unit, however, is the preferable choice, since it allows much more flexibility that can be used for system reconfig-

uration. It may further include certain intelligence to process the raw data in order to minimise the communication and to reduce the load of processing units. The most important feature, however, is provision for the fault tolerance. The implementation of the peripheral unit itself is given in the Figure 4. It consists of two parts, the peripheral interfaces and the communication module. In the figure the actual peripheral interface is on the left. It provides digital and analogue inputs and outputs. Further, there is a simple Microchip microcontroller PIC that may be used as further source of input/output ports, and it may also provide considerable processing facilities for data transformation and/or evaluation. Those can be very useful for enhancing the tolerance to faults. Right to the peripheral interface there is the communication module. It is the same module as the one used for the same purpose in the processing units. Its function is to support the dual TTCANs and the physical layers of the middleware. It consists of the communication interfaces, additional data storage and FPGA chip (XILINX Spartan II) where different functionalities can

TU26

Fig.4: Schematic of the peripheral unit

be implemented in the firmware. The peripheral interface module and the communication module interact over the serial peripheral interface busses; each peripheral interface is directly connected to the FPGA, providing for individual consistency checks of the data. Apart of that, there are some additional lines from the FPGA which directly act as digital inputs and outputs. 3.4

Support for enhanced fault-tolerance

In the sequel it will be explained how the well known techniques for enhancing fault tolerance in embedded systems [6] have been mapped onto the IFATIS architecture. Functionality checking: this involves measures to check the functionality of the hardware, usually memories, processor or communication resources. The peripheral unit has two possibilities to check the functionality of peripheral interfaces: it can be done either by the firmware in the FPGA or via the microcontroller which is a part of the unit. Tests are devised for viability checks, e.g. to see whether there have been any changes recorded on the inputs TU27

in the last period, etc. Self-checking peripheral interfaces can be polled for any exceptional events. Consistency checking: using the knowledge about the reasonable behaviour of signals or data, their validity may be checked. Data read from the peripheral interfaces are first evaluated in the FPGA using the firmware. The obvious plausibility checks are done with range checking. Apart of that the data can be checked for the reasonable dynamics: too fast changes in the input variable indicate the probability of wrong readings. More complex algorithms may be devised and executed on the microcontroller. Since this is done off-line with the process control loop, it does not present any delay or jitter on the process outputs. These checks rely on experience and knowledge of the application designer over the reasonable behaviour of the environment. Checking pairs: In the case of redundant resources it is possible to check whether different instances of the partial systems behave similarly.

An error can be detected if a pair of peripheral interfaces behave essentially different. To eliminate common mode failures, the interfaces in the pair must be designed diversely. Obviously, careful consideration is necessary to decide which source is more trustworthy in the case of a difference. For this reason, it is useful that one device is less sophisticated, and thus more robust. This way, the more precise result is taken as long they both match approximately, or else the less precise is considered to be more reliable. A combination of consistency checking may also be useful in determining the more credible resource. Information redundancy: if feasible, it is reasonable to introduce certain redundancy in the data or signals in order to allow for fault detection, like checksums, parity information, etc. Especially in the data transfer this is a valuable measure. In the architecture with the dual TTCAN busses, this can be employed in several ways: either an information can be transmitted in the same time frame on both busses (in the receiver the pairs checking technique to detect faults is employed); or, redundant information (e.g. parity) can be added to the data messages. The checksums are employed to verify the integrity tables of data in volatile memory.

Also implemented within the TTCAN protocol: monitoring is done by the TTCAN drivers. Further, monitoring of the system bus is done by the processors; however, this way the errors can only be detected but not handled.

4

Software architecture concepts

To adequately support execution of the application programme on the platform adequate system software has been developed. The system software resides on each processing unit. It operates in cooperation with hardware and firmware as was described above. Beside the system software, the application software is running on the platform. This relates to the object code of all FTC/FTM, RM, LRRM and GRRM modules. For the time being this code is generated by the Matlab/Simulink tool. The system software can be divided into several layers: system support layer, communication layer, distributed shared memory layer and application support layer. On the one side, system support layer communicates with the hardware and firmware. On the other side, the application support layer interacts with the application software. The software architecture of the IFATIS platform is shown in Figure 5.

Loop-back testing: in order to prevent the signal- or data transmission faults, they can be transmitted back to the source and verified. The data of crucial importance for safety can be re-transmitted in a message from the receiving processor unit to the peripheral device and compared with the original input data at the very source. This can be done by firmware at the peripheral interface (not implemented or shown in the block scheme). Watchdog timers: to check the viability of the system, it is a good idea to check whether the system responds to a periodical test signal. If there is no response within a predefined interval, a timer detects a fault. To check the viability of the system by observing its response to periodical test (usually dummy) signals, the reference messages may be employed. Firstly, to verify the viability of the bus, the presence of the reference messages is monitored by the TTCAN drivers themselves. Further, other actions may be supervised for their liveness, from the behaviour of the environment over peripheral interfaces to the processing units. Bus monitoring: operation of a computer system can often be monitored by observing the behaviour on its system bus to detect hardware failures.

Fig.5: System software architecture

4.1

System support layer

System support layer provides the basic functionality of the system. It communicates directly with the hardware and though sensors and actuators with the environment. On the other hand it is also responsible for smooth application execution. It consists of microkernel, device drivers and a set of initialization routines. Micro-kernel performs basic operations of the system. These are: task scheduling, memory allocation and process synchronization. For the IFATIS applications, simple memory allocation and task scheduling routines are needed. For the time being the actual application code is executed periodically as a single task. There is support

TU28

Fig.6: Model of the distributed replicated shared memory

for multitasking (e.g. to implement parallel execution of the high-level resource monitor routines, the global resource manager and reconfiguration module. For the inter-task synchronization, the micro-kernel includes static semaphore objects. The micro-kernel is implemented with low-level BIOS routines provided by the DSP experimental board. Device drivers are pre-programmed on the I/O nodes and are executed by firmware (i.e. as discrete logic circuits in FPGAs) and/or by a simple microcontroller. Initialisation routines are employed at the start-up of the system. They perform initialisation of the hardware and basic internal data structures; they prepare the communication infrastructure, etc. As a part of the initialisation a set of self-diagnostic routines is also executed to detect possible faults in the hardware. The initialisation of the system is based on the configuration tables provided by the application development tool. Those tables include general information about the node (e.g. its ID, type and priority), information for the integrated peripheral, principal attributes for the communication infrastructure etc. 4.2

Communication layer

Communication routines perform communication between modules, sensors, actuators and other components of the system. The TTCAN communication protocol is based on timetables provided in advance by the development tool and utilizes distributed shared memory model described in next section. A timetable consists of a set of slots that are used to transmit individual messages. A message represents a piece of information that must be transferred between different nodes, sensors or actuators. The communication routines write to, and read from those slots. Small amount of the communication bandwidth is set aside for the debugging messages. The distributed shared memory model (Figure 6) was introduced to isolate the control application design from the issues of the hardware implementation. Two or more processes can use a cell to communicate with

each other regardless of their current location (i.e. on which processing nodes the processes run). The same mechanism is used for the communication between control application and sensors or actuators. Data written to a shared memory cell are asynchronously to the application, transparently and periodically distributed through the system. If the communicating nodes (processes) reside on the same processing unit such a cell is represented simply by physical memory, else the contents are transmitted among the memory cells by means of communication subsystem. In the latter case, the memory cells are mapped onto the blocks of the TTCAN communication messages. This is done by an external tool which also optimises the mapping in order to guarantee the dispach times and to maximise the free slots. For the fault detection, data transformation, etc. each data cell can be associated with a validation routine. The validation routines are generated off-line and are entirely independent of the design of control application. System software supports the use of 32 shared memory cells with five basic data types (byte, word, integer, float and double). This functionality can be expanded to a larger number of memory cells and for using custom data types with maximum data size of eight bytes. Communication layer manages the replication of memory cells in the background as a separate task. 4.3

Application support layer

Application support layer is responsible for the execution of the control application. It consists of the system interface routines, the debugging routines, and a loader. System interface routines are used by the application to interact with the system software. They translate the request from the application into a form that is acceptable by the other parts of the system software. All functionality of the system software can be used by the application if needed. However, it is expected that those routines will mainly be used by the global resource and reconfiguration manager and eventually by the resource monitors. The interface to the system

TU29

routines is provided by means of C header files. The actual code resides in the object library and is linked with the application code. The IFATIS hardware platform also provides an experimental test bed for different higher-level approaches to dealing with faults in the control systems. Because of that, appropriate system support for testing and diagnostics has also been implemented. It can be used for monitoring and setting the shared memory cells and some global variables of the control application. The changing of the information can be monitored during some period of time, then the data can be downloaded and analyzed off-line, etc. The debugging routines communicate with development platform through the JTAG port by means of shared memory. For this, a simple handshaking protocol is used. The development platform deals with single command at a time. To load the application code from the development system onto the IFATIS processing nodes a simple loader was implemented. The compiler creates object files in common object file format (COFF). The loader is used in two ways depending on where object code is located. It can be used to load the application code from the EEPROM. In this case, the loader is a part of the application code. In the second case it can be used for loading the application from the development platform. In this case, the loader is a part of the development tool and utilizes the debugging routines described in previous section.

In the paper, the platform design and implementation was presented. It was shown how the existing principles could be implemented to support the higherlevel IFATIS architecture. Some of the measures are at this moment only theoretically devised and have still to be implemented. This mainly relates to the implementation of certain fault tolerance features that are now done in software but should be better performed in the firmware. Further, a challenging part of the project will be to explore how to describe the reasonable properties of the peripheral data, and how to design and describe the measures in order to introduce them in the early stages of the application development.

5

[1] IEC - International Electrotechnical Commission, WG9, Geneva. Functional safety of electrical / electronic / programmable electronic safety-related systems - Part 1: General requirements. SC: 65A (System Aspects), December 1998.

Conclusions

For economical reasons, computer controlled systems are taking over the control of more and more mission and safety critical systems, thus replacing discrete automata, whose operation and safety properties are much easier to be verified and even certified. To make them more reliable and credible, it is necessary to make them tolerant to the anticipated and to the un-anticipated potential faults. Since (at least in less critical cases) their cost also plays an important role, in the IFATIS project an attempt was made to explore the remaining intact resources in the case of a failure and dynamically reconfigure the control system. Apart of the sophisticated upper layers, the lower layer platform, dealt with in this paper, also needs to express a reasonable amount of tolerance to faults. Detection and blocking their propagation may prevent them to emerge on the higher levels, thus sparing the more costly measures needed to handle them.

Acknowledgement The paper emerged from the IFATIS project. I would like to express sincere gratitude to all members of the project, who contributed to the design of the architecture, described in the second section. In particular, Prof. Uwe Maier authored the original idea of the FTC modul architecture, and Dr. Domen Verber contributed essential solutions in its implementation. Finally, I would like to thank the colleagues from the Real-Time Systems Laboratory of Faculty of Electrical Engineering and Computer Science in Maribor for their work on the implementation of the prototype.

References

[2] IFATIS. Intelligent Fault Tolerant Control in Integrated Systems. IST-2001-32122; http://ifatis.uniduisburg.de/, 2002-2004. [3] Texas Instruments. DSK C6711 DSP kit. http://focus.ti.com/docs/tool/toolfolder.jhtml?PartNumber=TMDS320006711. [4] Uwe Maier and Matjaˇz Colnariˇc. Some basic ideas for intelligent fault tolerant control systems design. In Proceedings of 15th IFAC World Congress, Barcelona, Spain, July 2002. [5] Robert Bosch Company. TTCAN User’s Manual; http://www.can.bosch.com/docu/Users Manual TTCAN.pdf, 2001. [6] Neil Storey. Safety-Critical Computer Systems. Addison–Wesley Longman, 1996.

TU30