Model-based development of fault tolerant systems of ... - IEEE Xplore

10 downloads 0 Views 394KB Size Report
E-mail: {zoe.andrews, richard.payne, alexander.romanovsky}@ncl.ac.uk. André Didier and Alexandre Mota. Universidade Federal de Pernambuco. Centro de ...
Model-based development of fault tolerant systems of systems Zoe Andrews and Richard Payne and Alexander Romanovsky School of Computing Science, Newcastle University, NE1 7RU, UK. E-mail: {zoe.andrews, richard.payne, alexander.romanovsky}@ncl.ac.uk

Abstract—This paper puts forward a new method for modelbased development of fault tolerant systems of systems. The method covers early architectural design, formal modelling and verification. The focus is on supporting modelling techniques that ensure systematic and structured reasoning about faults, error detection and fault and error recovery. The method combines semi-formal modelling in SysML with formal modelling and verification conducted in CSP. The work is part of the EC COMPASS Integrated Project on Comprehensive Modelling for Advanced Systems of Systems1 .

I. I NTRODUCTION Reasoning about system models at different steps of system development and explicitly linking these steps into processes is at the core of model-based development. In this paper we discuss models that help the developers to construct complex systems of systems. We focus on architectural design, followed by formal modelling and verification. Systems of systems (SoSs) are systems built, at least in part, from pre-existing constituent systems that may not have been designed for integration. Their characteristics include autonomy, dynamic connectivity, distribution, operational and managerial independence of constituents, evolution over time, and emergent behaviour [1]. Examples can be found in the transport, health care, space, defence and power distribution domains. The independence of the constituent systems makes it extremely difficult to ensure SoS dependability. The situation is complicated by the fact that any SoS must have features for tolerating faults. This calls for developing novel techniques that support systematic rigorous design and verification of fault-tolerant SoSs. II. T HE METHOD PROPOSED The development method [2] we are proposing starts with elicitation of the requirements and their capture in a semiformal structured notation (for example, in SysML). This step includes identification of the dependability and fault tolerance requirements. The early architectural design is conducted using SysML. We provide the SoS developers with guidelines showing how normal and abnormal behaviour and state can be 1 http://www.compass-research.eu/

978-1-4673-3108-1/13/$31.00 ©2013 IEEE

André Didier and Alexandre Mota Universidade Federal de Pernambuco Centro de Informática Av. Jornalista Anibal Fernandes, s/n Cidade Universitária, Zip 50.740-560 E-mail: {alrd, acm}@cin.ufpe.br http://www.cin.ufpe.br modelled. The focus is on a clear separation of the normal and abnormal activities, explicit modelling of all fault tolerance steps (error detection, error and fault handing) for every error and all possible combinations of errors. This semi-formal reasoning about SoSs and their fault tolerance is crucial for ensuring a clear traceability to the requirements. It makes it easier for various developers to agree on the common understanding of the architecture. We also recognise the importance of formal modelling and verification while developing critical SoSs. To this end, at the next step of our method we support mapping of the SysML models into a formal notation (CSP in our case) that can be fully verified. We define the rules of the mapping. The correctness of the resulting CSP model is demonstrated by applying available proving and modelchecking techniques. Of particular importance to our method is the verification of fault tolerance properties of the SoS. The steps are described in detail in the following three sections. III. FAULT TOLERANCE MODELLING IN S YS ML SysML [3] is a rich language with many different ways to model the same behaviour. The SysML diagrams provide views onto the underlying model describing the architectural structure, behaviour and requirements of the model. Rather than prescribing a subset of diagrams that are suitable for fault modelling, we opt to describe a set of views that together will build up a picture of the expected as well as erroneous behaviour and recovery procedures present in an SoS. These views fall into two categories: • Nominal. The structure and behaviour of the SoS under the assumption that no faults are present. • Erroneous behaviour and recovery. Describes possible faults, errors and failures, and the behaviour of the SoS in the presence of such dependability threats. Also includes the structure and behaviour of the SoS with recovery procedures included to deal with the identified faults and errors. These views are detailed and illustrated in [4]. Here we provide just a summary. The nominal structure and behaviour of an SoS is modelled using structural (Ontology, Composition, Connection) views and behavioural (Scenarios

and Processes) views. These employ SysML block definition, internal block, sequence and activity diagrams. Several views are proposed for modelling SoS erroneous behaviour and recovery procedures (see Table I). These include structural views (Fault/Error/Failure Definition, Fault Propagation, Fault Tolerance Structure, Fault Tolerance Connections) and behavioural views (Erroneous/Recovery Scenario, Erroneous/Recovery Processes, Fault Activation and Recovery). Some of these show the structural relationships between faults, errors, failures and components, whilst others show how faults and errors impact on the behaviour of the SoS. For example, the Process view defines the sequences of actions to be taken in the SoS using a combination of block definition diagrams and activity diagrams, and the Fault Activation view extends the nominal Process activity diagrams to model the low level behaviour of errors by identifying when faults may be activated, what happens after activation and where in the process the error may be detected. Most of the erroneous behaviour and recovery views (with the exceptions of Fault/Error/Failure Definition and Erroneous/Recovery Processes views) are intended to focus on one fault at a time, although we realise that for interacting faults it may be useful to model several faults in a single view (we plan to investigate this further in our future work). IV. S YS ML TO CSP The main objective of linking SysML and CSP is to obtain a formal representation of the semi-formal graphical view of SysML. Several works report translations from SysML to Petri Nets (PNs) [5], [6] because OMG’s SysML document, which refers to the OMG UML Infrastructure [7] and Superstructure [8], uses a Petri-Nets-like semantics. However, in this work we adopt CSP as formal language to reason about SysML models. There are two reasons to use CSP instead of PNs: (i) CSP is founded in a powerful process refinement relation with tool support (like FDR [9]); (ii) we will need to translate SysML to CML as a goal of the COMPASS project (see Section VIII). CML is a formal combination between VDM [10], CSP [11], and the refinement calculus of Morgan [12]. It is based on the maturity of Circus [13]. Compared to CSP and PNs, it is more suitable to model and verify data structures and operations. In our strategy, the SysML model contains the fault information on the structure views and its usage on the behavioural views (Scenarios and Processes). Sequence diagrams model faulty scenarios (Scenarios views) and activity diagrams (Processes views) model the processes that describe integrated SoS behaviour. Therefore, it is only necessary to translate the behavioural views to formally validate the dependability properties. For instance, it is possible to check all possible paths that lead to the occurrence of a fault, as well as the verification of fault tolerance properties as described in Section V. In this work we translate the activity diagrams. After translating each activity diagram into CSP we integrate them to have the SoS behavioural model as a whole. The integration relies on the synchronization of the send signal and

Figure 1: Strategy to translate SysML to CSP.

accept event actions. When an element of type send signal is reached, it synchronizes with all accept event actions that have the same operation name. To obtain CSP from SysML we create a bridge using PN (Figure 1 (1), detailed in Section IV-A) following the work reported in [5]. Mapping from SysML to PNs tends to be easier to prove as the activity diagrams are stated in terms of tokens and transitions (Petri-Nets-like [8]). Then we translate from PNs to CSP (Figure 1 (2), detailed in Section IV-B). As PN is a formal language, this translation is straightforward. A. From SysML to PNs We use the works reported in [5], [6] as a starting point to our mapping from SysML to CSP. But we consider some changes to obtain an intermediate representation of the diagrams. The interruptible regions of the activity diagrams [8] are the basis of our strategy to model fault behaviour (see Sections III and VI). They are not translated directly into basic PNs because their semantics require the synchronous execution of different transitions when interruption events occur. We employ reset edges and interface places of ZSN (zero-safe nets [14]) as shown in [6]. This intermediary mapping is based on the SysML elements and its input edges. For each element, we create a triple with type ElementN ames × ID × P (ID × nat), where ElementN ames is an enumerated set of diagram element type names (INITIAL, JOIN, FORK, ACTION, MERGE etc.), ID is a node identifier of an activity diagram and P (ID × nat) is a set of pairs of identifiers and multiplicity that are connected to input edges. For example, if a diagram has an initial node, an action node and an end node, then we have: ActivityDiagramT oP etriN et ({ (IN IT IAL, 1, {}) , (ACT ION, 2, {(1, 1)}) (F IN AL, 3, {(2, 1)}) })

B. From PNs to CSP A PN is stated in terms of nodes, transfers and edges. The CSP process that represents a PN behaviour has the following signature: P ET RI N ET ((N odes, T ransf ers, Edges))

Table I: SysML views of erroneous/recovery structure and behaviour Name Fault/Error/Failure Definition Fault Propagation Fault Tolerance Structure Fault Tolerance Connections

Name Erroneous/Recovery Scenarios Erroneous/Recovery Processes Fault Activation Recovery

Structural Views Description Define faults, errors and failures of the SoS using BDDs. Faults, errors or failures may be generalised into abstract categories. Identifies propagation of faults through errors to failures and their relationships to constituent systems using IBDs. Extends nominal Composition View with additional components required to tolerate a given fault. Extends nominal Connection View with the additional components identified in the Fault Tolerance Structure View and the interfaces and connectors to tolerate a given fault. Behavioural Views Description Models behaviour in the presence of errors (with and without recovery) as scenarios in SDs. Shows erroneous behaviour propagation and recovery procedure triggers. Extends nominal Process BDDs to include behaviour resulting from faults. Further processes are added to model the recovery procedures. Extends nominal Process ADs to model the low level behaviour of errors. Identifies when faults may be activated, what happens after activation and where in the process the error may be detected. Further extends nominal Process ADs to show the behaviour of the recovery procedures once an error has been detected.

where N odes is a set of pairs of type ID × T okens, T ransf ers is a set of IDs and Edges is a set of triples ID × ID × M ultiplicity connecting nodes and transfers. The set of ID contains the identifiers of each activity diagram element. The process body of a PN is: M AIN (state) =

2id∈enabledT ransf ers(state) transf er!id → (|||(s,d,m)∈inputs(id) edge!s!d!m → label!d → SKIP ); (|||(s,d,m)∈outputs(id) edge!s!d!m → SKIP ); M AIN (updateState(id, state)) where enabledT ransf ers is a function that returns a set of IDs of enabled transfer elements and the functions inputs and outputs return a set of a transfer’s connected elements. The updateState function changes the state parameter, to reflect the transfer execution and the channel label is used to give meaning to the transfer execution (see Section IV-C). To express the reset edges we just augment the original multiplicity values set with a reset abstraction: M ultiplicity R = M ultiplicity ∪ {Reset} The auxiliary functions enabledT ransf ers, inputs, outputs and updateState support the augmented multiplicity set. C. PN transfer semantics in CSP The semantics of the execution of a PN transfer is an event in CSP, thus if a SysML node A is executed, then corresponding transf er and label events are executed. They should be separated to keep ID’s selection and the actual semantics of the transfer, which should be only noticed by the external environment after all inputs have been executed. To name each execution we use a renaming relation, where for each transfer we create a unique event. For example, if a SysML diagram

has a node with name A and it is the second node on the diagram, then we have the mapping label.ID.2 7→ activity.A where ID.2 is generated for A’s node. If the SysML element is a decision node, then each guard has corresponding events (as actions do): label.ID.x 7→ guard. guardName where guardN ame is the name of the corresponding guard. The semantics of a SysML diagram is a process that allows any possible execution (traces) of the diagram, where each event represents an execution of a node or guard. For our simple example shown in Section IV-A, the largest trace is: hlabel.ID.1.1, label.ID.2.1, label.ID.3.1i where only label.ID.2.1 should be mapped to an activity and the other transf er events should be hidden. If the action node name is A, then the trace is: hactivity.Ai. V. F ORMAL VERIFICATION After translating the SysML diagrams into CSP and integrating them (Section IV), the SoS behaviour can be verified. There are two properties of interest: absence of faults and fault tolerance. We show the theoretical foundation to verify them. Reachability verification allows us to test the absence of the occurrence of fault events in an SoS P . If we want to check the absence of paths that lead to faults in set E, we use this refinement: CHAOS Σ\E vT P , where Σ is the set of all events and CHAOSA is defined as: CHAOS A = STOP u (?ev : A → CHAOS A ) . If there is no such path (a counterexample), we can assure that none of the faults in E occur in the system. Fault tolerance is defined in CSP in terms of lazy abstraction [11], [15]. Lazy abstraction is defined as follows:   LE (P ) = P k CHAOS E \ E , E

where P is a process of interest and E is a set of events to hide.

Lazy abstraction can be compared to, but is slightly different from CSP hiding. For example, given a process P1 = a → b → P1 , P1 \ {b} becomes a → P1 \ {b} whereas L{b} (P1 ) can simply stop after communicating a. It is a deadlock, which means that the set of failures of L{b} (P1 ) contains the refusal of the alphabet after the trace hai: failures(L{b} (P1 )) ⊇ {(hai, {a, b})}. In the traces model hiding is equal to lazy abstraction because the failures property is not considered for comparing processes:

the fault-tolerance property can still be checked with a few changes. Our fault-tolerance property F TL,R (.) is written as:   F TL,R (P ) ⇔ (P \ R) k STOP vF LE (P \ R) k LR ,

P \ E ≡T LP (E) ,

VI. E VALUATION

but in the failures model: LE (P ) vF P \ E . If we want to check if a process P is fault tolerant F T (P ), we employ the fault tolerance property [11], [15] stated as: F T (P ) ⇔ P k ST OP vF LE (P ) . E

This means that when a fault event (from E) occurs (in the left-hand side of vF ), P on the right-hand side of vF simply behaves as it has not occurred. The left-hand side process P k ST OP can be seen as P with no faults. If the refinement E

holds, we say that P is always able to recover from any fault in the set E and thus it is said to be fault tolerant over E. The property has a finer-grained version, which considers a limit on fault-tolerance F TL (.). It is used when we know that the system is not fully tolerant to faults, but we want to check whether the system can tolerate a certain behaviour over the faults, for example, a specified number of occurrences of a fault. This version is stated as:   F TL (P ) ⇔ P k STOP vF LE P k L , E

E

where L communicates only events in E. A simple example is an active (Act) / backup (Bkp) system, where if the active system behaviour (M ) breaks (with fault f ), then the backup starts. The CSP code is as follows:

E

E

where R is the set of recovery events related to the faults in E, and LR allows only fault events that are recoverable. This analysis checks if the designed recovery mechanisms make the system fault tolerant.

The case study used to evaluate our work is supplied by the Italian company Insiel that operates in the Fruili Venezia Giulia region of North Italy. The SoS of interest provides an emergency response to Targets identified by the public. The constituent systems are operationally and managerially independent (provided and developed by external organisations). They are the Phone System, Call Centre, Radio System and Emergency Response Unit (ERU). Each constituent may be geographically distributed and may evolve. The entities in the environment with which the SoS interacts include Callers and Targets. The purpose of this simplified Insiel SoS is to meet one high-level requirement: for every call received, send an ERU with correct equipment to the correct target. For our study, we identify several faults which may occur in the simplified Insiel SoS [2]. These are related to failures of the constituent systems leading to error states in the SoS, which are discovered by different constituents. This paper focuses on the fault: Complete Failure of the Radio System. This SoS fault represents the situation where the Radio System is unavailable and does not pass any messages between the Call Centre and ERUs. This can result in the failure of the SoS. The use of an alternative communication system is proposed to recover from this fault. The remainder of this section follows the method proposed in Section II: Section VI-A describes the SysML modelling and Section VI-B the CSP modelling and verification. A. SysML Modelling

Act = M 4f Bkp M = doW ork → M L = f → ST OP

Bkp = M 4f STOP E = {f }

Fault-tolerance properties assure that Act is not FT to f : Act k ST OP 6vF LE (Act) , E

but Act tolerates f once:  Act k ST OP vF LE E

 Act k L . E

Expanding this definition to our fault modelling (Section III), we have one additional concept: recovery. The translation strategy (Section IV) considers the faults and recoveries as CSP events (as they are defined on the SysML diagrams). The fault-tolerance property considers that the recovery is implicit, not written as events. In spite of these differences

1) Nominal: We begin by modelling the nominal structure and behaviour of the SoS. In this section, we highlight a small selection of nominal diagrams: the complete nominal model is omitted from this paper due to space constraints, but can be found in [4]. The connections between the constituents systems of the SoS, defined in the Composition View, are shown in the Connections View in Figure 2. The interfaces identified in the Connections View are described in more detail in a separate diagram, not shown here, with operation signatures. Given the structural description of the model, the behaviour is defined using Scenario and Process views. In this paper, we concentrate on one process identified in [2], the Initiate Rescue Process, described in Figure 3. This process, defined using a SysML activity diagram dictates the flow of behaviour between constituents of the SoS when initiating a rescue. The Call Centre must allocate an ERU to the rescue, which may depend upon the availability of ERUs, and if required,

Figure 2: Connections View for the Insiel SoS

Figure 3: Processes View for the Initiate Rescue Process

Figure 4: Fault Propagation View for the Radio System failure

diverting ERUs from their current task. Messages are sent to the ERUs via the Radio System to initiate the rescue. The complete model considers a selection of other processes, including the Service Rescue Process enacted when the ERU receives the request to service a rescue, as shown in Figure 3. 2) Erroneous Behaviour and Recovery: Erroneous behaviour and recovery SysML views (Table I) show how the structure and behaviour of the system change in order to tolerate the Complete Failure of the Radio System. This section focuses on a selection of these views (a more complete model is presented in [4]). The Fault Propagation View given in Figure 4 extends the Connections View to show the causal chain from fault to error to failure2 . In this case the SoS fault Complete Failure of the Radio System leads to the SoS error state Radio System Unavailable. This means that the Call Centre cannot communicate with the ERUs to give them details of the Target,

thus leading to the SoS failure Target Not Attended by ERU. The diagram also shows in which constituents of the SoS the fault originates, which constituents can detect the error state and which external interfaces the SoS failure affects. The failure of the Radio System has an impact on the behaviour of the Initiate Rescue Process as shown in the Fault Activation View (Figure 5). This SoS fault is activated if the Call Centre attempts to send the details of a rescue to an ERU when the Radio System has failed. This scenario is modelled by an interruptible region with an interrupt event Fault 1 activation. Once the fault has been activated the erroneous behaviour of the Radio System dropping the message will occur. The error state (that the message has been dropped) may be detected, as modelled by the event Error 1 detection in a second interruptible region. If the error is detected a recovery process (Figure 6) is carried out before continuing to service the rescue. The Recovery View given in Figure 6 shows the steps taken to tolerate the failure of the Radio System. In essence the SoS fault is logged and an alternative communication system, the

2 Connections between constituents are shown instead of ports and interfaces for a cleaner presentation.

1

2 3

14

Figure 5: Fault Activation View for the Radio System failure

B. CSP Modelling and Verification Using the intermediary representation as PNs shown in Section IV, the initial mapping is straightforward. The initial node (1 on Figure 5) is mapped as IN IT IAL N ODE(1), the start rescue accept event (2) is mapped as an action with one input from node 1 and multiplicity 1: ACT ION N ODE(2, {(1, 1)}). The merge node (3) is mapped as M ERGE N ODE(3, {(2, 1), (14, 1)}) and so on. As explained in Section IV, the intermediary representation of SysML and PNs is written in CSP, so it is not necessary to use any other tools than FDR. An initial trace of the process generated by the mapping is: haccept.StartRescue, activity.F indIdleERU s, guard.IdleERU, activity.AllocateIdleERU i From the diagrams shown in Figures 5 and 6, the faulttolerance property (Section V) variables are:

Figure 6: Recovery View for the Radio System failure

Mobile Phone System, is used to send the rescue details to the ERU. This involves a reconfiguration of the SoS architecture. Note that the Mobile Phone System may also fail and further recovery procedures may be required to handle that failure.

P = [parallel composition of the diagrams and their alphabets] E = {accept.F ault1Activation, activity.DropM essage, send.T argetN otAttended} R = {accept.Error1Detection, send.StartRecovery1, accept.EndRecovery1, activity.LogF ault1, activity.ResendRescueInf oT oERU } LR = (accept.F ault1Activation → LR ) 2 (activity.DropM essage → LR )

As expected, this system passes our FT property (FDR screen, Figure 7 (1)). But if we try different configurations, then we achieve different, yet interesting, results (when the system is not FT). The first configuration is when there is no limit (LR = RU NE , Figure 7 (2); the tick with a bullet means the negation of the check — counterexamples were found). The counterexample says that the right side is able to not service a rescue after the activation of fault 1. A second configuration is when the recovery diagram (Figure 6) is not yet implemented, thus there is no recovery mechanism, meaning that the events start and end recovery (Figure 5) do not synchronize with any other diagram (Figure 7 (3)). This is equivalent to say that even if the error 1 is detected, the process message and receive message would not happen anyway, because the mobile phone system is not defined. Another thing to note is that the left side of the FT property ((P \ R) k STOP ) should be equal to the system without E

faults, shown in Figure 3. The last check (Figure 7 (4)) finds all traces that leads to a T argetN otAttended event (reachability).

1 2

3

4

Figure 7: FDR fault-tolerance checks C. Evaluation Summary From the evaluation performed in this section using the Insiel SoS case study, we have shown that our approach allows the specification and analysis of faults in an SoS. We identify some avenues for additional effort to improve the approach: the creation of a FT SysML profile, the automation of the translation to CSP, extending the approach to CML to reduce deadlock issues, and potential optimisations of the translation from SysML to CSP. We expand upon these in the next section. VII. D ISCUSSION A. Towards a fault tolerance SysML profile Our initial experience in the SysML modelling of fault tolerance led us to start working on developing an FT profile for SysML to allow us to capture the FT-specific reusable modelling elements. This is an ongoing work and our initial ideas will need to be evaluated using several examples of SoSs with different types of faults and recovery. We use stereotypes and tags as the standard way for profiling in SysML [3]. In particular, we are introducing a

stereotype for each FT view identified in Table I as well as for key elements within these views that are related to fault tolerance such as faults, errors, failures and recovery processes. The stereotypes are indicated by guillemets in the SysML diagrams of the case study (Figures 2–6). For example the “Complete Failure of the Radio System” block in Figure 4 is stereotyped as a (SoS-level) Fault. Tags are enclosed in braces, for example the faultsOfInterest tag that appears in the title of each FT view. These stereotypes document the model, assist in understanding the system design, and also help mapping and checking FT properties. For example, the sets E and R, and the process LR (Sections V and VI) can be directly translated from stereotyped SysML elements. In addition to defining stereotypes and tags we take a similar approach to Holt [16], using SysML BDDs to define a metamodel that indicates which SysML elements and stereotypes should (or may) be found in each of the stereotyped views. B. Automatic SysML to CSP mapping Mapping SysML to CSP is not so direct as one could expect. Particularly the OMG Superstructure just mentions tokens. This indicates a solution towards basic PNs, but it is not feasible to find such a mapping for all SysML elements this way. This is the main reason we have chosen to use Zero-Safe Nets (ZSN) and why we used a high-level representation of SysML elements in CSP (ElementN ames × ID × PID × nat). The intermediary representation on PN can be changed by a direct representation (Section VII-D), without changing this highlevel representation. The automatic mapping can be executed using QVT3 , which has a high-level programming language based on UML metadata elements. For example, it has a stereotypedBy(string) function which receives a stereotype name and checks whether an element has the given stereotype. There is ongoing research into an automatic translation from SysML to CML as part of the COMPASS project. C. Deadlocks Although we have mapped most usual SysML elements to CSP, our mapping strategy still has some limitations. One of them is related to avoiding deadlocks. The CSP representation would have a (real) deadlock only if the original SysML model does. In our evaluation, one deadlock can be found if there is no external send signal action to the “Start Rescue” accept event or if there is no data handling. Note that in Figure 5 the process awaits a “Start Rescue” event, but also provides a “Start Rescue” signal. The difference is that the send signal starts a new instance for a diverted ERU, thus with a different ERU Id. To solve this deadlock one can add another diagram that enables the accept event “Start Rescue” (another diagram with send signal) or to implement data handling for the operations. In the COMPASS project, as CML offers a 3 http://www.omg.org/spec/QVT/

better support for data handling, we left that as a future work, and we just implemented the diagram that receives the calls and activates the “Start Rescue” accept event. D. Optimization considerations Although the mappings from SysML to PNs and to CSP can be easily proved, the operational representation of the resulting CSP process as LTS4 can have more states than those necessary to verify fault tolerance properties. Creating a direct mapping from SysML to CSP as shown in Figure 1 (3), should improve performance by reducing the number of states needed to represent SysML diagrams in CSP and can be proved equivalent using CSP’s process refinement (Figure 1 (4)). A different approach could also be used on the first step of translating SysML to PNs. It is possible to create an optimal PN — with the minimum number of places and transfers — by modelling the mapping rules as mathematical expressions that can be solved by C-PLEX5 or Z36 . VIII. C ONCLUSION In this paper we proposed a development method for modelling functional and certain non-functional requirements of complex SoSs using SysML, focusing on the architectural design. Particularly for the non-functional requirements, we created a SysML profile assisting the developers in describing SoS dependability and fault tolerance aspects. Furthermore, we proposed an approach to formal SoS analysis by translating SysML into CSP and using the model checker FDR. We believe the proposed approach is valuable for developing complex SoSs in several ways. Firstly, it helps in capturing requirements and early architectural design of SoSs using a specifically designed SysML profile, where dependability and fault tolerance aspects receive a focused attention. It is particularly useful for modelling and verification of SoS reconfiguration as part of ensuring tolerance to failures of the constituent systems and infrastructures involved. Such reconfiguration is demonstrated by the addition of the Mobile Phone System in response to the Complete Failure of the Radio System in the case study (Section VI), and will be investigated more thoroughly in future work. Secondly, the method supports formal SoS modelling by a translation from the SysML models into the process algebra CSP models. Finally, it helps in verifying these requirements with the aid of a formal tool (FDR), by checking that the desired dependability and fault tolerant aspects hold. As the future work we plan to evaluate our ideas using several large-scale SoSs with different types of faults and recovery strategies. We also intend to create a library of fault-tolerance patterns in SysML and automate their application. This would result in achieving even greater degree of development automation and knowledge reuse. Concerning the translation, we plan to fully prove its soundness and 4 Labelled

Transition System, the operational semantics of CSP.

5 http://www-01.ibm.com/software/integration/optimization/

cplex-optimizer. 6 http://research.microsoft.com/en-us/um/redmond/projects/z3.

completeness. Finally, as error scenarios can be modelled with sequence diagrams (Section IV) and their translation to CSP allows formal validation of the integrated SoS behaviour extracted from the activity diagrams, we intend to explore this further in our future work. Acknowledgements The authors’ work is supported by the EU FP7 Project COMPASS (No.287829), and the EPSRC Platform Grant on Trustworthy Ambient Systems. The authors are grateful to Enrico Fracasso of Insiel for his input in developing the case study and to John Fitzgerald, Jon Holt, Simon Perry and the anonymous reviewers for their feedback on earlier versions of the paper. R EFERENCES [1] M. W. Maier, “Architecting Principles for Systems-of-Systems,” Systems Engineering, vol. 1, no. 4, pp. 267–284, 1998. [2] Z. Andrews, J. Fitzgerald, R. Payne, and A. Romanovsky, “Fault Modelling for Systems of Systems,” in Proceedings of the 11th International Symposium on Autonomous Decentralised Systems (ISADS 2013), March 2013. [3] Object Management Group (OMG). (2012, June) Systems Modelling Language (SysML) 1.3. website. OMG. Version 1.3. [Online]. Available: http://www.omg.org/spec/SysML/1.3 [4] Z. Andrews, J. Fitzgerald, R. Payne, and A. Romanovsky, “Fault Modelling for Systems of Systems,” School of Computing Science, Newcastle University, Tech. Rep. CS-TR-1346, September 2012, available online at http://www.compass-research.eu/Project/Publications/CS-TR-1346.pdf. [5] E. Andrade, P. Maciel, G. Callou, and B. Nogueira, “A methodology for mapping SysML Activity Diagram to Time Petri Net for requirement validation of embedded real-time systems with energy constraints,” in Digital Society, 2009. ICDS ’09. Third International Conference on, feb. 2009, pp. 266 –271. [6] S. Boufenara, F. Belala, and K. Barkaoui, “Mapping UML 2.0 Activities to Zero-Safe Nets,” JSEA, vol. 3, no. 5, pp. 426–435, 2010. [7] Object Management Group (OMG). (2011, August) OMG UML Infrastructure 2.4.1. website. OMG. Version 2.4.1. [Online]. Available: http://www.omg.org/spec/UML/2.4.1 [8] ——. (2011, August) OMG UML Superstructure 2.4.1. website. OMG. Version 2.4.1. [Online]. Available: http://www.omg.org/spec/UML/2.4.1 [9] FSEL, FDR2 User Manual, version 2.91, Formal Systems (Europe) Ltd, 2010. [Online]. Available: http://fsel.com/fdr2_manual.html [10] D. Bjørner and C. B. Jones, Eds., The Vienna Development Method: The Meta-Language, ser. Lecture Notes in Computer Science, vol. 61. Springer, 1978. [11] A. W. Roscoe, The Theory and Practice of Concurrency. Upper Saddle River, NJ, USA: Prentice Hall PTR, 1997. [12] C. Morgan, Programming from specifications. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1990. [Online]. Available: http: //www.cs.ox.ac.uk/publications/books/PfS/ [13] J. Woodcock and A. Cavalcanti, “The semantics of Circus,” in Proceedings of the 2nd International Conference of B and Z Users on Formal Specification and Development in Z and B, ser. ZB ’02. London, UK, UK: Springer-Verlag, 2002, pp. 184–203. [Online]. Available: http://dl.acm.org/citation.cfm?id=647285.723106 [14] R. Bruni and U. Montanari, “Zero-Safe Nets, or Transition Synchronization Made Simple,” Electronic Notes in Theoretical Computer Science, vol. 7, no. 0, pp. 55 – 74, 1997. [15] A. W. Roscoe, Understanding Concurrent Systems, ser. Texts in Computer Science. Springer, 2010. [Online]. Available: https: //www.cs.ox.ac.uk/ucs/ [16] J. Holt, A Pragmatic Guide to Business Process Modelling, 2nd ed. Swinton, UK, UK: British Computer Society, 2009.