Diagnosis and Consulting for Control Network Performance ...

2 downloads 0 Views 710KB Size Report
Abstract—Network performance engineering can verify the design and dimensioning of large-scale control networks like. CSMA-based building automation ...
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 4, NO. 2, MAY 2008

71

Diagnosis and Consulting for Control Network Performance Engineering of CSMA-Based Networks Joern Ploennigs, Member, IEEE, Mario Neugebauer, Member, IEEE, and Klaus Kabitzsch, Member, IEEE

Abstract—Network performance engineering can verify the design and dimensioning of large-scale control networks like CSMA-based building automation networks. It combines performance analysis with diagnosis methods to evaluate the network utilization and to detect design errors before installation and can therewith save the expenses of overdimensioning and redesign. This paper will develop a diagnosis model based on fault trees that is able to use the huge amount of performance analysis results to identify design errors and analyze their coherences. This enables not only a fast tracing back of fault causes and the derivation of solutions; it can also visualize the fault coherence to the user and help him to understand his design. Additional consulting tools implement best practice strategies, to support the user in parameterization. Index Terms—Building management networks, carrier sense multiaccess, diagnosis, network performance engineering, quality of service.

I. INTRODUCTION

M

EDIUM access schemes of the CSMA-type are becoming more and more important for various control networks. Their drawback is that transmission time bounds cannot be guaranteed and potential long transmission times and message losses can cause malfunction and instable control cycles [1]. Hence, in general only data with weak real-time requirements might be transmitted on lightly loaded CSMA networks [2]. These problems can be reduced with the usage of high bandwidth networks and switches [3], usually requiring higher investments. But, the investments in network design and construction will decrease, while the network size and the interaction of the devices will increase. Hence, it will become more and more important to predict the network performance already in the design phase, to avoid malfunction as well as expensive overdimensioning. These problems already take effect in the domain of building automation, were CSMA-based networks with several tens of thousands of elements are common to automate the lighting, heating, ventilation, and air-conditioning. The automated processes have moderate real-time requirements of about 100 ms

Manuscript received June 29, 2007; revised December 11, 2007. Paper no. TII-07-06-0094. J. Ploennigs and K. Kabitzsch are with the Department for Computer Science, Dresden University of Technology, D-01062 Dresden, Germany (e-mail: [email protected]; [email protected]). M. Neugebauer is with the ubigrate, D-01187 Dresden, Germany (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TII.2008.919708

and can tolerate small disturbances. Therefore, all common protocols like LON, KNX and partially BACnet rely on CSMA arbitration [4], due to its good overall performance and easy handling. To design these networks efficiently and economically, off-the-shelf components and software tools are used that permit a function block oriented design and automated commissioning. To detect design failures in these network designs, it is common to use performance evaluation methods [5]–[7]. However, these methods need to be automated and seamlessly integrated in the design process for an efficient usage in practice. This design paradigm is called (control) network performance engineering [8] and is commonly used for wide area networks [9]. The NetPlan tool, which is extended in this article, implements this paradigm for the common building automation network LON [10], [11]. The suite combines an automated modeling approach with fast performance analysis methods and permits an efficient use by neither requiring special knowledge nor interrupting the design process significantly [12]. The automated modeling approach [13] creates a system and traffic model from the uniform LNS design database [14] used by most LON design tools. An analytical queuing analysis [5] permits to analyze these models within seconds in its mean characteristics. A simulative approach completes this analysis and creates detailed probability distributions if necessary. In a short time, an abundance of performance results is computed and can only be evaluated by the tool itself, as it is able to understand the complex model information to identify problems, their causes and solve them. A similar objective of fault diagnosis is performed by the network fault management during operation. It monitors the network performance as a part of the network management. If a failure is noticed, the fault management tries to identify the cause and corrects or avoids the problem. Artificial intelligence methods are used to train expert systems that accelerate the fault identification and correction [15], [16]. Model-based reasoning is one method used [17], [18]. However, in network design no reference system for training is available, but the performance analysis results can be used. Fahmy et al. [19], [20] and Stefano et al. [21] for example developed expert systems to assist the user in network design. Nevertheless, they used only simple performance measures to advice general design decisions and did not look in detail for design faults. Therefore, a more systematic approach is needed to model the complex coherences of faults and performance measures. Fig. 1 summarizes the proposed automated process of fault analysis during design. First, faults need to be identified from the performance results. Second, their causes and correlations need to be analyzed. Finally, the user has to be consulted to find a solution. This paper is organized in the same order.

1551-3203/$25.00 © 2008 IEEE

72

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 4, NO. 2, MAY 2008

Fig. 1. Process of automated fault analysis during design.

II. IDENTIFICATION OF FAULTS The identification of faults is more difficult for computers than for humans. Humans are able to interpret ambiguous situations and to use fuzzy terms. Computers need predefined rules, e.g., to answer the question, when a channel is considered to be overloaded? Basically, if the channel cannot cope with the messages to be transmitted, because it is already utilized at 100%. A network designer would state from experience, that in CSMAnetworks the problems already start to arise with a usage of 30% and more than 60% utilization is hard to handle anyway [5], [6]. Although, this behavior is typical for CSMA networks, it is implicitly clear, that it depends furthermore on the system design and configuration. To define computer rateable rules, the transmission requirements of the processes are expressed by quality of service (QoS) requirements. The ISO 13236 [22] defines a classification for QoS-requirements. Thomesse [23] and Soucek et al. [1] rate the following QoS-characteristics and specializations as the most important for fieldbus systems: • TIME—Delay of a connection : The end-to-end-delay for the information transmission between two connected datapoints from the sending to the receiving application. Jitter of a connection : The variance of the delay . • CAPACITY—Utilization of a channel : The ratio of the used to the maximum capacity available. Interpreting the channel as a station, it is the ratio of the departure rate to the load-dependent mean service rate [5]. • INTEGRITY—Error probability of a connection : Probability that the transmitted information does not reach its receiver even with multiple repeats. For these QoS-characteristics, the QoS-parameter classifies the required limits. With respect to Thomesse [23] common parameters for any QoS-characteristic are: • upper threshold ; • upper warn level ; • operating target ; • lower warn level ; • lower threshold . The meaning of these thresholds needs to be specified further. In case of best effort service a violation of these thresholds has no consequences. In contrast, hard QoS assures that no sample violates the threshold. However, CSMA type networks cannot guarantee this for more than one sending device. Hence, soft QoS is more likely to be used, which permits a maximum percentage of cases to violate the threshold. Let us assume that for each defined QoS-requirement the cumulative distribution function (CDF) of the corresponding performance measurement1 was computed by performance analysis, probably a simulation as analytical methods may only estimate main characteristics of the CDF. The prob1It is assumed that the error probability p is a binary value, which is either 0 if the transmission was successful or 1 otherwise. Hence, the thresholds can =p =p =p =p = 1. only be p

Fig. 2. Application layer model of the example functions.

ability of a QoS-violation for a soft QoS-requirement is then defined as the volume of the CDF that exceeds the threshold, hence

(1) To identify faults in the design, a QoS-fault is defined as the case that is larger than for an upper or lower threshold of a QoS-requirement. A QoS-warning holds the same condition for an upper or lower warn level . The tolerance for an operating target is defined to 50%, i.e., it lies within the middle of the distribution (median), which equals not necessarily to the mean value. The definition of these QoS-requirements for large networks with hundreds of connections can be very time-consuming. In [24] an approach is introduced, that distributes these requirements in a network design based on device profiles. The user has to define the QoS-requirements only for a device profile once and they will then be applied automatically to all designs implementing the device. The assignment further applies to all connected devices and connections, assuming that processes with QoS-requirements also require connected processes with at least the same requirement. The following example illustrates the usage of QoS-requirements and the diagnosis approach introduced below. Fig. 2 shows three processes of a single room control. E1) The fire alarm has to sound within s in case of detected smoke with less than failures. E2) A closed light control cycle requires a transmission delay smaller than 200 ms in 99% of all cases , otherwise the control cycle becomes instable. E3) And last, the same light can be switched by a human, who demands with 95% reliability a reaction within 500 ms before he gets irritated. To create a realistic traffic situation 15 of these example rooms are connected to a single channel. The simulation results of this scenario for a LON TP/FT-10 channel are presented in Table I and will be discussed in the course of this paper. Using the assignment rule from [24] the requirements of the light control automatically applies to its connections and , but not to the other connections. In the same way the requirements defined for the fire alarm apply to the connection as well as the light switch implies the requirements of . Different message service types (MST) typical for LON are defined in the network design for the connections (Table I). Connection uses a repeated MST that sends multiple copies of the

PLOENNIGS et al.: DIAGNOSIS AND CONSULTING FOR CONTROL NETWORK PERFORMANCE ENGINEERING

TABLE I PERFORMANCE ANALYSIS RESULTS FOR A CHANNEL WITH 15 X FIG. 2 = ; USED CHANNEL BITRATE =; (ARRIVAL RATE  BITRATE PER MSG. = —SEE (14); ERROR PROBABILITY ; UTILIZATION  %; DELAY TIME  %; UPPER THRESHOLD p %; TOLERANCE "  %; QOS-FAULT PROB. AND   —SEE (9) %; QOS-WEIGHT G  p

[ ] = Msg s [2 ] = bit Msg [ ]= [ ] = ms [ ]= [ ( )] = [ ] = ms [ ] = [ ( )] = 0 [ ( )] =

[2 ] = bit s

Fig. 3. Histogram of the delay time for a transmission. (a) Acknowledged MST for . (b) Repeated MST for connection .

2

3

information message with a period defined by the repeat-timer . Connection uses an acknowledged MST that repeats only if no acknowledge message arrives from the receiver within a transmission-timer . The unacknowledged MST of connection transmits only one message. Connection uses an authenticated MST, a unique MST of LON, that authenticates a sender by exchanging cryptographic keys in four messages [10]. Fig. 3 shows the delay time distributions of the connections and resulting from the simulation. The underlying Erlangdistribution is typical for delay times. Its reappearance every 50 ms in Fig. 3(b) and every 100 ms in Fig. 3(a) origins from the successful arrival of repeats if the previous messages got lost. The repeated MST is faster in mean than the acknowledged MST, because its repeat-timer of 50 ms is smaller than the transmission-timer of the acknowledged MST with 100 ms. Connection can therefore fulfill the QoS-requirement of ms with a maximum simulated delay of 194 ms. The connection with acknowledged MST exceeds this threshold with 2.5% of its distribution and the QoS-requirement fails due to % % . However, the user gets only informed about failed QoS-requirements and can then identify causes with the tools introduced in the next section. III. CAUSE ANALYSIS

73

resolve the problem. However, a failed QoS-requirement does not have a single cause, but rather has many ambiguous causes and is correlated to other faults. For instance, the failed delay time QoS-requirement of has many reasons: First, the transmission-timer of 100 ms is too high and the QoS-fault could probably be eliminated if it is reduced or a repeated MST is used. Second, 16% error probability for a single transmission is a high value, and many repeats are required to finally transmit the value. The error probability is also the reason for the failure of the QoS-requirements of the connections and , because a delay time QoS-requirement assumes the information to be transmitted successfully and the error probability of both connections already exceeds the tolerance . This error probability results from collisions during arbitration, if several devices access the channel at the same time. They are caused by the high channel utilization from 66% to 100% with a mean of 86%, which itself has even more complex reasons that need to be analyzed. However, more than 1500 performance measures are computed after 2 min of simulation for this example with a total number of 90 devices and 60 connections. Due to this huge amount of results, any user needs help to reconstruct causes for QoS-faults, otherwise he is unable to comprehend a suggested solution or to develop his own strategy. Fault trees are an appropriate method to handle situations with complex fault dependencies, as they hierarchically decompose faults to cause faults and can be extended with probabilities to rate the causes. Fault trees are commonly used for dependability analysis and are not uncommon in diagnosis [25]–[27] either. They satisfy three purposes in this paper: First, a fault tree representation can be used by any user to understand the correlation of faults and get helpful hints for problem solving. Second, a probability extended fault tree can be applied by the user to identify relevant causes. Last, a software implementation can prioritize the causes in the same way to consult the user about efficient ways to solve problems. The used fault trees in Figs. 4 and 5 do not contain fault events or states in their pure form, but rate performance measurements with fuzzy terms, like a “high throughput.” These terms are only used to visualize the proportionality of the causes in the figures. The implementation refers the measurements like “throughput” and rates them with probabilities. These cause probabilities are computed from the performance analysis results. However, it is not possible to calculate the correct probability values due to the complex coherences of the performance measures. Instead, the coherences known from analytical models [5] are related by simple and/or-operators as done in dependability analysis. Then, the same equations used to compute the probabilities in dependability analysis are applied to compute the cause probabilities. Finally, they are proportional to the share each cause takes in the failure and permit the user and the tool to easily compare causes. This paper demonstrates the development and weighting of a fault tree at the example of an overloaded channel, which applies on failed utilization requirements and is also a major cause for time and integrity failures. Their fault trees are equal in size and, thus, discussed in [8].

A. Introduction

B. Utilization Requirements (Overload)

If a requirement has not been met or a situation is unsatisfying for the user, it needs to be traced back to its causes to

An upper threshold utilization requirement demands that the utilization of a channel must not exceed its limit very

74

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 4, NO. 2, MAY 2008

the second cause only occurs with the first one together, then the cause probabilities of the first cause and the upper fault are equal.2 Hence, the cause of an high utilization is a high used bitrate, with equal probability . The used bitrate of a channel accumulates from the bitrate of the messages passing the channel. The performance analysis assumes that theses messages are created by the connections between two datapoints on different devices [12]. Each passing connection adds its mean bitrate to the channel mean bitrate (2) Equations (2) and (3) explain why the overload fault of a channel can be traced back to the connections using it. However, in the next step it is necessary to compute the cause probabilities to weight the tree. Therefore, the common equations used for dependability analysis of fault trees are used. Each connection bitrate can be the single cause of the channel bitrate and does not necessarily require another connection to add its bitrate. Thus, the connections are combined in the fault tree by an or-operator. To simplify the computation, it is assumed that their cause probabilities are statistically disjoint.3 Then, the cause probability of each connection has to fulfill (4)

Fig. 4. Fault tree of the overload (see Fig. 5 for completion).

often during runtime. If the probability of time slots violating this requirement is larger than a QoS-fault is identified and the channel is overloaded. For example, a capacity QoS-requirement defines that the channel has not to be utilized over 80% in more than 30% of its runtime. Actually, during the simulation of the example, the channel exceeded the limit in more than 73% of the time. Hence, the QoS-requirement fails as % % . In this case, the fault tree in Fig. 4 is instantiated and extended with cause probabilities as explained next. The fault tree is specific for LON, but similar fault trees can be constructed for all CSMA networks. The utilization of a channel rises proportional to the ratio of the mean used bitrate and the maximum bitrate (bandwidth) of a channel , as follows: (2) Hence, the cause for a high utilization is the proportion between both measures, requiring the used bitrate to be high relatively to a low bandwidth. Accordingly, the fault tree in Fig. 4 combines both with an and-operator. However, the measures cannot be separated in their influence and no cause probability can be computed for either one of them being the exclusive fault cause. To solve this problem, one of two causes in the fault trees that are logically combined by and-operators will always be defined as undeveloped event, which means that its cause probability is not computable. It is further assumed that

This equation from dependability analysis and (3) from model knowledge need to be linked. Therefore, it is assumed that the relation between the channel and connection mean bitrate equals their relation in fault probability. Then, the cause probability of a connection can be computed via (5) Drilling down the tree, the number of transmitted messages depends on the MST of the connection. A repeated MST for example creates multiple copies of the information message, while an acknowledged MST repeats only if no acknowledges arrive from the receiver. These messages improve the integrity of the connection, but in the interest of low channel utilization, only one single successful message transporting the information is relevant and all other messages, like acknowledges or repeats, are overhead. From the results of the performance analysis the mean used bitrate of information messages and of protocol messages can be computed for each connection with the mean used bitrate . The probabilities of these causes are also assumed to be statistically disjoint . Similar to (5) results (6) 2In common fault tree analysis [26] the probability [C ] of the upper event C computes from the probability of the causing events A and B . For an andoperator is defined: [C ] = [A ^ B ] = [A] 1 [B j A]. If B only occurs with A together, this simplifies with [B jA] = 1 to [C ] = [A]. 3For an or-operator is defined: [C ] = [A _ B ] = [A ] + [B ] 0 [A ^ B ]. If A and B are statistically disjoint, then is [A ^ B ] = ; and [C ] = [A ] + [B ] .

PLOENNIGS et al.: DIAGNOSIS AND CONSULTING FOR CONTROL NETWORK PERFORMANCE ENGINEERING

75

Fig. 6. Part of the overload fault tree for the example.

Fig. 5. Fault tree for the protocol overhead created by different message service types detailing Fig. 4.

The bitrate of each connection is influenced from the rate of messages sent and their sizes. The mean used bitrate of information messages depends for example on the mean departure rate of successfully transmitted messages. It computes from the error probability and mean arrival rate of messages to be send to . With the message size of the information message, the mean used bitrate is (7) . leaving a bitrate of overhead messages: The size of overhead messages is defined by the protocol and the size of information messages depends on the datapoint variable type (information size) and the MST configuration of the connection. Both are known from network design, but the information size is usually not changeable and is therefore defined as undeveloped event in the fault tree, to simplify the and-relation of arrival rate and message size. The arrival rate of overhead messages depends on the MST configuration and the network behavior. For example, the ratio of the message transmission delay to the transmission-timer, which creates repeats if an acknowledgement is late. Depending on the defined MST the specific path in Fig. 5 is evaluated further. The arrival rate of information messages depends on the application parameters of the devices. For instance, if the device is a sensor, the sampling parameters determine the arrival rate. If the device processes incoming messages, the number of created messages depends on the number of incoming ones. These relations are modeled in the traffic model used for performance analysis [28] and can therefore be computed in their probabilities analogously to (2)–(6). The fault trees in Figs. 4 and 5 include further uncommented undeveloped events. These undeveloped events can cause overload, but their cause probabilities cannot be estimated from the performance results. Thus, they are ignored during analysis and do only complete the fault tree. The transfer events “large delay times” and “collisions” in Fig. 5 refer to the fault trees for integrity and time QoS-requirements documented in [8]. Fig. 6 compares the probabilities of connections for the example, which are computed by (5) and the bitrates from

Table I. The connections of type have the highest cause probability with 2.9% and the connections are negligible in comparison. This computation can be completed for all causes in the fault trees that are marked with a probability. C. Fault Prioritization With Safety Levels Bottlenecks like channels or routers can easily cause many failed QoS-requirements in the connections passing them. To help the user finding important faults in this amount of failures, the faults are prioritized. The priority depends on the safety requirement and the performance of a QoS-requirement. The QoS-requirements of the fire alarm E1 are, for example, always more important than the requirements of the light switch E3. To express this differences, each QoS-requirement gets a safety level assigned. A system with three simple safety levels is used, which are • safety requirements bear a risk for humans if they fail (example E1); • process requirements bear a risk for technical equipment but not for humans (E2); • comfort requirements bear no risk for equipment or humans, but a failure reduces comfort (E3). This system can be extended if necessary. The safety levels are used to sort QoS-requirements, with the following rules. First order, failed QoS-requirements are more important than fulfilled ones. Second order, a higher safety level always outranks a lower one, i.e., safety requirements are prior to process and comfort requirements. Third order, tightly met requirements are prior to requirements with a higher tolerance. Two measures are used to implement these rules. The QoSquality of a requirement specifies how good a requirement is fulfilled. It ranges from to 1 and is negative if the requirement failed and positive otherwise, as follows: otherwise.

(8)

The QoS-weight implements the hierarchy of the above stated rules by adding constants to avoid loss of order in (9) with the safety levels and QoS-quality in and the signum function . The QoS-weight is also negative, if the QoS-requirement is not met. The lower a QoS-weight, the worse is this requirement met and the higher

76

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 4, NO. 2, MAY 2008

Fig. 7. Screen shot of the NetPlan diagnosis tool showing a fault tree.

is its priority. Comparing the values for the example in Table I reveals that the connection of E1 has the lowest QoS-weight and therefore the highest need for action. The next section explains how these problems can be resolved. IV. SOLUTION CONSULTING The fault trees are useful to examine the most probable causes of a QoS-fault. Each element of the tree implies a solution for the fault, which are marked in the fault trees as boxes with arrows. These solutions are also known to the NetPlan tool in form of text elements, which are combined to create messages that suggest the user solutions. They are sorted depending on their cause probabilities and presented to the user depending on his position in the tree. Fig. 7 shows a screenshot of the NetPlan diagnosis tool. However, many of these solutions in the fault tree are not very practicable, like changing the message size or the device type. Instead, a handful of best practice solutions can be applied to solve many QoS problems. If a network contains only few QoS-faults that are independent from each other, then they can usually be solved by selecting an appropriate MST. Also, a message prioritization is worthwhile to think about. If the number of QoS-faults is high, then usually an overloaded channel or router is the cause as in this example and their load needs to be reduced. Therefore, a network wide adaptation of the MSTs can be a first step. If this is not sufficient, the traffic created by the devices needs to be reduced. Fig. 8 summarizes this workflow and the following subsections introduce the specific consulting tools. The parameterization of these solutions is complex and should be selected wisely to prohibit cross effects in the network that would cause new faults. Special MSTs and prioritization for example will always decrease the performance of other messages and any change of application parameters can affect the process quality, which is only judgeable by the user. In fact, many of these solutions are complex multi-objective optimization problems, which have multiple Pareto-optimal solutions. Hence, it is necessary for the user to weight his objectives, i.e., is it acceptable to decrease the performance of one requirement to improve another more important one? To simplify this decision two strategies are used. First, always multiple solutions are rated and recommended to the user. Second, the QoS-weight and the safety levels of the last sections are used to graceful distribute any degradation to elements that are

Fig. 8. Workflow to solve a QoS-fault.

less discriminated of it. Therefore, the following rules are applied. • Safety requirements have to be met. Process and communication parameters of these connections are only permitted to be changed to enforce own interests. • Process requirements are recommended to be met. Process and communication parameters of these connections should only be changed to enforce own interests. • Comfort requirements should be met, if possible. Parameters can be adapted in the interest of other requirements. A. Message Service Type Consulting The message service type can strongly influence the performance of a message and other messages. An unacknowledged MST for example is transmitted only once and creates minimum load. However, if the message gets lost due to an overloaded message queue, a collision, or a bit error, then the contained information is lost. An acknowledged MST avoids this by repeating messages if it does not get an acknowledgement from the receiver in time and has therewith a higher integrity at the cost of more load (comp. of versus in Table I). With a repeated message, the information is sent multiple times without waiting for any acknowledgement. It is faster than an acknowledged MST, but usually creates more load in LON (comp. of versus ). However, if the message uses group addressing and has more than four receivers, then the load created by a repeated MST with three repeats is usually lower than for an acknowledged MST. Table II collects these generalized rules to select the appropriate MST, but an analysis of the effects is always advised.

PLOENNIGS et al.: DIAGNOSIS AND CONSULTING FOR CONTROL NETWORK PERFORMANCE ENGINEERING

77

The repeat-timer now needs to permit, that is significantly smaller than . Rearranging the inequation with from (12) allows to estimate the repeat- and transmission-timer

TABLE II RECOMMENDED MSTS DEPENDING ON THE QOS-CHARACTERISTIC

(13)

The message service type consulting supports the user in selecting a MST, by automatically parameterizing different MSTs, computing their effects on the load, time delay, and integrity as well as comparing them to assigned QoS-requirements. The effects of the different MSTs are not computed in separate runs of the performance analysis, but estimated from existing results. This automatic parameterization of MSTs and estimation of effects is demonstrated for the error probability with an integrity and delay QoS-requirement. Let be the error probability for the transmission of a single message containing the information. The final error probability of an (un)acknowledged or repeated message is then (10) if is the maximum number of retries. This probability has to be lower than the required of the integrity and lower than the defined of the delay QoS-requirement, which requires a message to reach its target. The left part of the inequation (10) is used to estimate the error probability for acknowledged, repeated, and unacknowledged MSTs and with (8) the QoS-quality of each MST is computed for the integrity and delay requirement. The results are presented in a table to the user, who can then choose an appropriate MST. The computation has to estimate the parameters of a MST including the maximum number of repeats and any transmission- and repeat-timer. In [11] it is recommended to set the transmission-timer, to permit 99% of all packets to arrive in this time. However, this reduces the quality of all timing QoS-requirements, as the transmission times will be large for multiple retries. Instead, the parameters are set optimal depending on the given requirements. The number of repeats for the repeated MST computes with a safety factor from rearranging (10) to (11) The repeat- or transmission-timers are adjusted to permit the last repeated message to reach the target in the required time . Let be the transmission delay of a single message and be a transmission- or repeat-timer. The mean delay of the first message is . If this message fails with the probability , it has to be recovered by the next repeat, which has itself a success probability . As the repeat is send after the timer expires, it has a mean delay . Continuing this sequence for all repeats allows to estimate the communication mean delay (12)

The MST consulting solves some problems in the example. After consultation, the user decided to change to an acknowledged MST with nine retries that creates less load and has a higher integrity than the authenticated MST. An acknowledged MST with three repeats is enough for connection . However, neither connection nor can be relieved by either acknowledged or repeated MST, as both MSTs create too much traffic. B. Prioritization Consulting The prioritization is another possibility to increase the quality of connections as prioritized messages have separated queues and access slots in the MAC. However, these slots are reserved in front of all messages, which increases the utilization. Further, prioritized messages can block non-prioritized messages and if one priority is assigned multiple times, the error probability dramatically increases. These strong influences on the network behavior make it necessary to evaluate the performance of any prioritization configuration. This would require a high computation effort, but it is possible to preselect the most promising configurations depending on the QoS-weight of the messages. The priority is defined in LON by two parameters: first, a flag marks a connection as prior. Second, each device on the route of this message needs a priority slot assigned to transmit the respective message. For each device or router, the QoS-weight of each passing connection is accumulated. The priority slots are then assigned ascending to the accumulated QoS-weight of the devices in a channel, starting with the lowest weight. Connections with a QoS-fault are then set prior if they pass at least one device with an assigned slot. In the end, the configuration is evaluated via a new performance evaluation. This is done for a growing number of priority slots to find the best configuration with the highest accumulated QoS-weight. Prioritization consulting cannot solve the problem in the example. The simulated TP/FT-10 channel permits only four priority slots, which would improve the quality only for a few -connections and degrade the channel performance even more. The prioritization consulting works better if more channels with diverse devices are used. C. Load Reduction With MSTs and Priorities An overloaded channel or router causes multiple QoS-faults. The example shows that assigning special MSTs and priorities does not help, but only increases the gap in the QoS-quality of connections. Instead, it is worth to think about changing to unacknowledged MSTs and removing priorities to reduce the load and, therewith, increase the overall quality. This strategy should not be underestimated, as many users permanently use acknowledged or repeated MSTs, in the false hope of increasing the integrity of the connections. The MST and priority reduction tries to solve this situation by using special MSTs and priorities only if necessary. First, the MSTs of all connections are set to unacknowledged MSTs and all priorities are removed. Only authenticated MSTs are left

78

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 4, NO. 2, MAY 2008

unchanged, due to their special function. In the next step the MSTs and priorities are assigned to the most demanding connections with negative QoS-weights using the algorithms from Sections IV-A and IV-B. The performance of this configuration is evaluated and checked if the QoS-faults could be solved or if new faults emerged. This is repeated until all QoS-faults have vanished or QoS-faults could not be removed with this algorithm. The algorithm converges usually with few runs of the performance analysis. For instance, the mean channel utilization of the example will drop to 31% if the MSTs of all connections are changed to unacknowledged. Despite this acceptable channel utilization, the 8% error probability of the connections is too high to satisfy their QoS-requirements. Selecting adequate MSTs results in the configuration of Section IV-A that could not solve all faults. Hence, the QoS-faults in the example can only be removed if the number of messages is reduced.

to be done by the user, as only he knows the tolerance of the processes. In the simple example of Fig. 1 the datapoint of device is responsible for the bitrate of the connections and and creates about 931 bit/s (472 bit/Msg). Hence, it is the wisest choice to reduce the arrival rate of this device by adapting its sampling parameters. In contrast, connection has only comfort requirements, but such a small influence on the channel bitrate, that it is not worth to change it.

D. Load Reduction by Adaptation of Sampling Parameters If the previous algorithms could not solve the overload problem, then the number of messages created by the device applications need to be reduced. The intuitive approach is, to start with the connections with the highest impact on the load. Table I lists the mean used channel bitrate for each connection. It also contains the mean used bitrate per information message , which expresses the mean bitrate created by each transmitted information message and computes via (14) for example causes in mean 900 bit for Connection each information message. Reducing its arrival rate would have a high impact on the channel load. However, it is safety critical and should not be changed, due to the rules defined in Section IV. Connection on the other hand uses the highest channel bitrate , but it is not possible to reduce the arrival rate of as its source device does only process the incoming messages from . This situation is characteristic and often the arrival rate of the most active devices cannot be changed. The adaptation of sampling parameters aims to avoid this problem and identifies the devices that originally caused the traffic. Therefore, a simple device traffic model from previous works [12], [28] is used. It assumes that the outgoing messages of an output datapoint are either created by processing messages from input datapoints (e.g., ) or are created independently (e.g., ). This process model allows tracing the cause of a message over multiple processing devices back to the original source datapoint, which usually is a sensor. During this tracing back the bitrate of each datapoint is accumulated, that it creates with its connections. It is used to rank the datapoints in their influence on the network load. All datapoints, having safety critical QoS-requirements depending on, are removed from this ranking. Also, the QoS-weight is taken into account to rank datapoints higher, with free resources in their QoS-requirements. The resulting ranking allows the user to easily select the most influencing, safest (in terms of the safety levels) and promising (with free resources) datapoints to reduce the arrival rate. These are usually output datapoints of sensors. Their arrival rate can be easily reduced by adapting the sampling parameters. This has

V. CONCLUSION The introduced diagnosis approach identifies design faults by evaluating performance analysis results with QoS-requirements. To support the user in solving detected problems two methods are proposed. First, fault trees permit the user to comprehend the causes and correlation of faults easily and get a weighted survey of problem solving strategies. Second, specialized consulting tools estimate the impact of common strategies on the performance and suggest the best parameters. The MST and priority consulting focuses on improving the quality of single connections, while the overall load reduction is addressed by the MST and priority reduction and the adaptation of sampling parameters. The introduced algorithms avoid the multi-objective optimization problem by classifying and weighting the QoS-requirements with safety levels. The performance of the approach is improved by reusing results. The QoS-faults of the connections in the introduced example can be finally solved with a combination of the introduced methods. Therefore, the MST configuration is changed equal to Section IV-A and the load is reduced corresponding to Section IV-D by adapting the inter arrival time of source from 500 ms to 600 ms. However, the channel is still utilized 80% in mean, letting its QoS-requirement fail. This problem can finally solved by placing a router to split the channel in two, which drops the utilization to 43%. A consulting tool to support the user in router placement is under development, using the connectivity information from the traffic model. Also, the consultation in parameterization of networked control cycles is considered, as the performance measurements, namely the delay times, jitter, and error probabilities, permit detailed simulations of control cycle performances. To apply this approach, the user needs to specify his QoS-requirements. He is supported in this task by device models and automatic distribution of requirements [24]. The quality of the diagnosis depends primarily on the quality of the performance results and models. The automated model generation from design databases [13] provides such high-quality models. The automated diagnosis and solution consulting complete these approaches to a create a control network performance engineering [8], which supports the user to easily identify design errors, bottlenecks, as well as oversized segments and, therefore, save the expenses of overdimensioning and redesign. REFERENCES [1] S. Soucek and T. Sauter, “Quality of service concerns in IP-based control systems,” IEEE Trans. Ind. Electron., vol. 51, no. 6, pp. 1249–1258, Dec. 2004. [2] J. Vinyes, E. Vazquez, and T. Miguel, “Throughput analysis of p-CSMA based lontalk protocols for building management systems,” in Proc. MELECON—8th Mediterranean Electrotechnical Conf., May 13-16, 1996, vol. 3, pp. 1741–1744.

PLOENNIGS et al.: DIAGNOSIS AND CONSULTING FOR CONTROL NETWORK PERFORMANCE ENGINEERING

[3] J. Jasperneite, P. Neumann, M. Theis, and K. Watson, “Deterministic real-time communication with switched ethernet,” in Proc. WFCS—4th IEEE Int. Workshop Factory Communication Systems, Vasteras, Sweden, Aug. 27-30, 2002, pp. 11–18. [4] W. Kastner, G. Neugschwandtner, S. Soucek, and H. M. Newman, “Communication systems for building automation and control,” Proc. IEEE, vol. 93, no. 6, pp. 1178–1203, Jun. 2005. [5] P. Buchholz and J. Plönnigs, “Analytical analysis of access-schemes of the CSMA-type,” in Proc. WFCS—5th IEEE Int. Workshop Factory Communication Systems, Vienna, Austria, 2004, pp. 127–136. [6] M. Miskowicz, M. Sapor, M. Zych, and W. Latawiec, “Performance analysis of predictive p-persistent CSMA protocol for control networks,” in Proc. WFCS—4th IEEE Int. Workshop Factory Communication Systems, Vasteras, Sweden, 2002, pp. 249–256. [7] F. L. Lian, J. R. Moyne, and D. M. Tilbury, “Performance evaluation of control networks: ethernet, controlnet and devicenet,” IEEE Control Syst. Mag., vol. 21, no. 1, pp. 66–83, Feb. 2001. [8] J. Ploennigs, Control Network Performance Engineering, ser. Informationstechnik. Dresden, Germany: Vogt Verlag, 2007. [9] R. G. Cole and R. Ramaswamy, Wide Area Data Network Performance Engineering. Norwood, MA: Artech House, 1999. [10] D. Dietrich, D. Loy, and H.-J. Schweinzer, Open Control Networks. Boston, MA: Kluwer, 2001. [11] EN 14908—Open Data Communication in Building Automation, Controls and Building Management, 2005. [12] J. Ploennigs, P. Buchholz, M. Neugebauer, and K. Kabitzsch, “Automated modeling and analysis of CSMA type access-schemes for building automation networks,” IEEE Trans. Ind. Informat., vol. 2, no. 2, pp. 103–111, May 2006. [13] J. Ploennigs, M. Neugebauer, and K. Kabitzsch, “Automated model generation for performance engineering of building automation networks,” Int. J. Softw. Tools Technol. Transfer (STTT), vol. 8, no. 6, pp. 607–620, Nov. 2006. [14] LNS Network Operating System, Echelon Corp., 2006. [Online]. Available: http://www.echelon.com/lns. [15] I. Katzela and M. Schwartz, “Schemes for fault identification in communication networks,” IEEE/ACM Trans. Netw., vol. 3, no. 6, pp. 753–764, 1995. [16] S. Keshav and R. Sharma, “Achieving quality of service through network performance management,” in Proc. NOSSDAV—8th Int. Workshop Network and Operating Systems Support for Digital Audio and Video, Cambridge, U.K., Jul. 1998. [17] W. Hamscher, L. Console, and J. de Kleer, Eds., Readings in ModelBased Diagnosis. San Francisco, CA: Morgan Kaufmann, 1992. [18] R. Isermann and P. Ball, “Trends in the application of model-based fault detection and diagnosis of technical processes,” Control Eng. Pract., vol. 5, no. 5, pp. 709–719, 1997. [19] H. I. Fahmy and C. Douligeris, “An integrated AI approach for automating networks design, modeling and simulation,” in Proc. 2nd IEEE Symp. Computers and Communications, Los Alamitos, CA, 1997, pp. 339–343. [20] H. Fahmy, G. Develekos, and C. Douligeris, “Application of neural networks and machine learning in network design,” IEEE J. Sel. Areas Commun., vol. 15, no. 2, pp. 226–237, Feb. 1997. [21] A. D. Stefano, L. L. Bello, and O. Mirabella, “Some issues concerning fieldbus design by means of an expert system,” Comput. Ind., vol. 32, no. 3, pp. 305–318, Mar. 1997. [22] ISO/IEC 13236—Quality of Service Framework, 1995. [23] J.-P. Thomesse, “Fieldbuses and quality of service,” in Proc. Controlo—5th Portuguese Conf. Automatic Control, Aveiro, Portugal, 2002, pp. 10–14. [24] J. Ploennigs, M. Neugebauer, and K. Kabitzsch, “Fault analysis of control networks designs,” in Proc. ETFA—10th IEEE Int. Conf. Emerging Technologies and Factory Automation, Catania, Italy, Sep. 19-20, 2005, vol. 2, pp. 477–484.

79

[25] R. Isermann, “Model base fault detection and diagnosis methods,” in Proc. American Control Conf., Seattle, WA, Jun. 21-23, 1995, vol. 3, pp. 1605–1509. [26] C. J. Price, Computer-Based Diagnostic Systems. Heidelberg, Germany: Springer, 1999. [27] T. Assaf and J. B. Dugan, “Diagnostic expert systems from dynamic fault trees,” in Proc. Annu. Symp. Reliability and Maintainability, 2004, pp. 444–450. [28] J. Plönnigs, M. Neugebauer, and K. Kabitzsch, “A traffic model for networked devices in the building automation,” in Proc. WFCS—5th IEEE Int. Workshop Factory Communication Systems, Vienna, Austria, 2004, pp. 137–145.

Joern Ploennigs (M’06) received the Dipl.-Ing. in electrical engineering for automation and control and the Ph.D. degree (with a thesis about control network performance engineering) from the Dresden University of Technology, Dresden, Germany, in 2001 and 2007, respectively. Since 2002, he has been a Research Assistant at the Chair for Technical Information Systems, working in the area of network performance engineering, fault analysis, and component-based design for building automation networks and wireless sensor networks.

Mario Neugebauer (M’07) received the diploma and doctoral degree in electrical engineering and computer science from the TU Dresden, Dresden, Germany, in 2002 and 2007, respectively. From 2002 to 2006, he was with the Chair for Technical Information Systems working on performance evaluation of control networks, wireless sensor networks, and embedded systems. From 2004 to 2007, he was working for the SAP Research CEC Dresden, focusing on product lifecycle management and manufacturing. In 2008, he cofounded the start-up ubigrate, focusing on smart device integration in manufacturing and other various application domains.

Klaus Kabitzsch (M’05) received the Diploma and the Ph.D. degree in electrical engineering and communications technology from the Ilmenau University of Technology in 1982. He became a Professor and Head of the Department of Technical Computer Sciences, Dresden University of Technology, Dresden, Germany, in 1993. His current projects focus on wireless networks and their application in the automation domain, component-based software design, performance engineering, and design for networked building automation. He is member or chair of various national and international organizations and founded in Dresden the fieldbus competence center in 1995 and the SAP ubiquitous computing laboratory in 2004.