Using neural network in distributed management to ... - IEEE Xplore

1 downloads 0 Views 418KB Size Report
ABSTRACT. Poison message failure propagation is a mechanism that has been responsible for large scale failures in both telecommunications and IP networks: ...
USING NEURAL NETWORK IN DISTRIBUTED MANAGEMENT TO IDENTIFY CONTROL AND MANAGEMENT PLANE POISON MESSAGES* Xiaojiang Du Mark A. Shayman Department of Electrical and Computer Engineering University of Maryland, College Park, MD Ronald A. Skoog Telcordia Technologies Red Bank,NJ

ABSTRACT Poison message failure propagation is a mechanism that has been responsible for large scale failures in both telecommunications and IP networks: Some or all of the network elements have a software orprotocol ‘bug’ that is activated on receipt of a certain network control ‘or management message (the poison message). This activated ‘bug’ will cause the node to fail with some probability. !f the network control or management is such that this message is persistently passed among the network nodes, and if the node failure probability is sufficiently high, large-scale instabiliry can result. Our previous research has been focused on centralized network management paradigm. In centralized management, one of the effective tools to deal with poison message failure is the neural network approach. However, a centralized scheme cannot be applied if the network is partitioned into several subnetworks by node failures. In this paper, we consider distributed management for the poison message problem. In particular, we use the neural network approach in a distributed way to identifv the poison message.

propagation mechanisms and designing network management and control mechanisms to mitigate them, it would be possible to stabilize networks against malicious attack even when the details of the network vulnerability being exploited are unknown. There are several failure propagation mechanisms that can cause unstable network. Five generic propagation mechanisms have been identified thus far [4]. System failure propagation via routing updates System failure propagation from management or control plane ‘poison message’ System failure propagation from data plane ‘invalid message’ 0 Congestion propagation from congestion back pressure Deadlocks created from overload and timeouts

In this paper we focus on one of the mechanisms that we refer to as the ‘poison message’ failure propagation mechanism.

1.1 THE POISON MESSAGE FAILURE PROPAGATION PROBLEM

1. INTRODUCTION

Large networks relying on real-time processing can be driven into unstable modes of operation (e.g., routing system failures, routing flaps, congestion and deadlock scenarios, system crash chain reactions, etc.). In the past, unintentional system faults have led to frame relay networks, SS7 signaling networks, and PSTNS going into unstable modes that have led to major service disruptions. A serious concern is that a malicious.party.could induce similar instabilities. The vulnerability of a network to instabilities may be due to unrecognized design flaws or hidden software bugs. Since these details are not known in advance, effective control mechanisms tailored to the specifics of the vulnerability are virtually impossible to achieve. And new defects are constantly introduced. However, it is our contention that there are a limited number of ”generic propagation mechanisms” that enable these network instabilities to occur. By enumerating these

We have described the generic mechanism of the poison message failure in our previous works [I, 2, 31. For reader’s convenience, the generic poison message failure mechanism is stated here again: ,A trigger event causes a particular network control or management message (the poison message) to be sent to other network elements. Some or all of the network elements have a software or protocol ‘bug’ that is activated on receipt of the poison message. This activated ‘bug’ will cause the node to fail with some probability. If the network control or management is such that this message is persistently passed among the network nodes, and if the node failure probability is suffkiently large, large-scale instability can result. Previous efforts to address major network outages have focused on areas such as software reliability, disaster

‘Research supported by DARPA under contract N66001-00-C-8037

4.58 C~7803-814CLW03/$17.03 ( C ) 2033 IEEE

prevention and recovery, network topological design, network engineering, and congestion control. In relation to fault propagation, the main idea that has previously been considered is software diversity [5]. However, whenever network providers have studied software diversity they have determined that to achieve it would be too costly and unmanageable a solution. Furthermore, software diversity does not address flaws in a standardized protocol. We have designed a fault management framework that can effectively identify the poison message type, and block the propagation of the poison message until the network is stabilized. The fault management framework is a centralized management paradigm. And it includes passive diagnosis and active diagnosis. There are mainly three approaches in passive diagnosis: Finite State Machine (FSM) approach. It compares message sequence from failed node with corresponding protocol’s FSM model to detect possible fault. Correlating Message approach. This is an event correlation technique. It analyzes messages from multiple failed nodes to deduce the possible fault. Neural Network approach. Different message types have different failure propagation patterns. This approach uses neural network to exploit node failure pattern and hence identify the poison message type. Passive diagnosis generates a probability distribution vector about the poison message. And this probability distribution is used in active diagnosis. In active diagnosis, filters are dynamically configured to block suspect message types - which we refer to as message filtering. Message filtering must take into account tradeoffs involving the time to complete diagnosis, the degradation of network performance due to poison message propagation, and the cost to network performance of disabling each of those message types. Each decision on filter configuration leads to further observations, which may call for changing the configuration of the filters. Message filtering is formulated as a sequential decision problem. [2] 1.2 EXPLOIT NODE FAILURE PATTERN

Different message types have different node failure propagation patterns. One way to exploit the node failure pattern is to use a neural network classifier. The neural network is trained via simulation. A simulation testbed can be set up for a communication network. The testbed has the same topology and protocol configuration as the real network. Then for each message type used in the network,

the poison message failure is simulated. And the simulation is run for the probability of a node failure taking on different values. After the neural network is trained, it is applied by using the node failure sequence as input, and a pattern match score is the output. Our previous work has demonstrated that the neural network approach is very effective in identifying the poison message. In most cases, the neural network can output a good probability distribution about the poison message. Le., it assigns the maximum probability to the poison message type. In addition, the neural network can be combined with the sequential decision problem to further identify the poison message.

2. D I S T ~ B U T E DNEURAL NETWORK APPROACH Since the mid 1990s, network management has steadily evolved from a centralized paradigm, where all the management processing takes place in a single management station, to distributed paradigms, where management is distributed over a potentially large number of nodes [ 111. Distributed management system has higher reliability and efficiency as well as lower communication overhead than centralized one. Besides, there are some other motivations to consider distributed management in the poison message problem. When the poison message failure happens, many nodes will fail and the network could be partitioned into several subnetworks. Then the central manager will not have all the global information (e.g., node status, messages exchanged by some failed nodes), and the manager will not be able to perform the centralized management. So it is very important to consider distributed management schemes for the poison message failure. We used neural network approach to deal with the poison message problem in the centralized management paradigm [l]. And OUT simulations show that this approach works well for small networks. However the neural network approach might not work well for a large network. Because when the network is very large, the number of nodes in the network is very large. And as in [l], the input of the neural network is twice the number of network nodes, this means the size of neural network will become very large. The weights of large neural network may not converge to a good parameter matrix and hence may not provide good output for the node failure pattern recognition. This is another reason that we should consider distributed management for the poison message failure.

459

We can apply the neural network approach in a distributed way. The communication network is divided into several subnetworks. There is a manager station in each subnetwork. And manager stations can communicate with each other if there is no disconnection. A neural network is trained for each subnetwork based on the data from that subnetwork. When the poison message failure happens, each neural network outputs a distribution score about the poison message type based on the node status in the corresponding subnetwork. Since each neural network only applies to a subnetwork, the size of these neural networks is not very large. And there should not be any concern about the neural network convergence.

cany poison messages. There are fifty routers in the testbed. The large network is divided into five subnetworks. And they are: subnetwork Center with 14 nodes; subnetwork Atlanta with 9 nodes; subnetwork Dallas with 10 nodes; subnetwork DC with 8 nodes and subnetwork Portland with 9 nodes. The topology of the testbed is given in Figure 1.

If there is no network disconnection, the outputs from all the neural networks are combined together to give better diagnosis. If network disconnection happens, each subnetwork makes decision based on the output from its own neural network and filters messages by itself. Later when the network is re-connected, certain useful past information can be exchanged among manager stations. For example, suppose subnetwork 1 is disconnected from subnetwork 2 by node failures. During the disconnection, subnetwork 1 filtered message type 1 and subnetwork 2 filtered message type 2. But the failure propagation continues in both subnetworks. When subnetwork 1 and 2 are reconnected later, the two manager stations can exchange past filtering history and rule out message types 1 and 2.

3.1

3. SIMULATIONS AND TEST RESULTS

And we constructed five neural networks, each one corresponding to a subnetwork. The neural networks are trained by the simulation data from the corresponding subnetwork. Then each neural network is tested by new data.

VOTING ALGORITHM NEURAL NETWORKS

FOR

MULTIPLE

Each neural network outputs a distribution score for the corresponding subnetwork. When the network is not disconnected, the outputs should be combined to provide better results. An important reason to combine the outputs together is that in our previous neural network tests, different neural networks fail for different input data. (Of course there are some overlaps.) Since each neural network has a high percentage of correctly identifying the poison message, it certainly increases the success rate by combining several neural network outputs together. One way is to let all subnetworks vote for the poison message. Each subnetwork votes for the message type that has the maximum score in its output. Then the message type voted for by the most subnetworks wins. For example, assume each neural network can identify the poison message with a success rate of 60%. There are totally five message types, and there are five neural networks. Then we can calculate the probability of correct identification using voting algorithm for the five neural networks. We make two assumptions for the computation in this example. (I) The events {Correct identification by neural network i} are independent, where i=l,. . .,5. (2) Given that a neural network makes an incorrect identification, the four possible incorrect identifications are equally likely and the identifications by different neural networks are independent.

I

Figure 1. The OPNET Testbed

We have constructed an OPNET simulation testbed for a relatively large network where BGP, LDP and OSPF can

There are three cases that the voting algorithm can correctly identify the poison message. 1 If three or more out of the five neural networks correctly identify the poison message, then the voting algorithm finds out the poison message. This probability is

E; = 1 - ( ~ ) ~ 0 . 4 ’ ~ 0 . 6 ’ - ( ~ ) ~ 0 . 4 ~ ~=0.683 0.6-0.4~

460

2 Two neural networks correctly identify the poison message, and the other three neural networks output three different message types. This probability is

p2

=( 3 ~ 0 . ~ 4 ~0 . x63~/ 8

= 0.0864

where 318 comes from picking three different message types from the remaining four message types. The 9/16 next is similar. Here we use the assumption (2) above. 3 Two neural networks correctly identify the poison message, two other neural networks identify a different message type, and the 5'h neural network identifies a third message type. In this case, if the voting algorithm chooses each message type with !4 probability, then the voting algorithm works for !4 of this case. This probability is 4 = 1 / 2 ~ ( 3 ~ 0 . x0.6' 4 ' x 9/16 = 0.0648 So the probability of the voting algorithm correctly identifying the poison message is P = 4 + pZ = 0.834. This probability is much larger than the probability of each neural network correctly identifying the poison message. And this is another reason to use neural network in the distributed way.

+s

nodes, and s; = 0 or 1. (0 means this node is normal, and 1 means this node is failed). And the input of the neural network is the node status vectors at two consecutive time steps -- Sk.,,S, . So the number of inputs is twice of the number of nodes in each subnetwork.

-# 5

Figure 2. The Structure of Neural Networks

2) Hidden layer in the middle. In the middle of the three layers is the hidden layer. There is a transfer function in the hidden layer.

3) Output layer with five outputs. The output is the normalized distribution score vector of the poison message. Five outputs represent five message types in the OPNET testbed. Each output is the distribution score of the corresponding message being poison.

3.2 NEURAL NETWORK APPROACH

3.3 TEST RESULTS

We use the Neural Network Toolbox in MATLAB to design, implement, and simulate neural networks. We implemented two kinds of neural networks in our simulation: feedforward backpropagation and radial basis neural networks. The test results for both kinds of neural networks are similar. So here we only give the results of feedforward backpropagation neural networks.

After training the neural networks, we use new data to test the neural networks. In our simulation we use 20 sets of data to test the five neural networks. Part of the test results is given in Table 1.

Training and Learning Functions Training and learning functions are mathematical procedures used to automatically adjust the network's weights and biases. The training function dictates a global algorithm that affects all the weights and biases of a given network. The learning function can be applied to individual weights and biases within a network. The training function we used is: trainh, which is a batch training with weight and bias learning rules.

The Neural Network Structure The neural networks have three layers. The structure is shown in Figure 2. 1) Input layer. We use a vector to denote the node status in the communication

network

S, = [s:,s:,

...,sr

'I

,

where k is the discrete time step, M is the number of

In Table 1, column 2 through column 4 correspond to three tests, where each column gives the result of one test, and the poison messages are type 1, type 2 and type 3 respectively. Row 2 through row 6 correspond to the output from five neural networks, and each row gives the output from one neural network. The outputs of neural networks include both positive and negative numbers. And the final normalized distribution score of poison message is a linear transformation and normalization from the neural network outputs. Le., y=a(x+b), where x is the output and y is the normalized distribution score, and a and b are parameters. Since during the neural network training, the output of non-poison message is set to zero, we set the transformation of the smallest (negative) number in the output vector to be zero. I.e., set b to be no where x, is the smallest number. And a can he determined from C y = l . The data in Table 1 is the normalized distribution score after transformation. The hold numbers are the distribution scores of the poison message. And the italic underline numbers mean that the neural networks assign maximum distribution score to wrong message types. Thus, poison message 1 is correctly 46 I

diagnosed by neural networks 1,2,4 but misdiagnosed by 3 and 5. Poison message 2 is correctly diagnosed by neural networks 4,5 but misdiagnosed by 1,2,3. And poison message 3 is correctly diagnosed by all neural networks except 2. The last row is the result from the voting algorithm. In the case of poison message 1, the voting algorithm can still correctly identify the poison message even though two neural networks failed.

result of a large neural network. The detail is explained in next subsection. Table 2. Summary of Neural Network Test and Voting Algorithm Results

I Poison

1

Numberof

I

Numberof

1

Correct

Table 1. Neural Network Test Results Poison Message Neural Network 1 Neural Network 2

Neural Network 3

Type 1

Type 2

Type 3

0.3982 0.2518 0.2297 0 0.1203 0.4641 0.3112 0 0.1309 0.0938 0.1631

0.3521 0.3191

0.3270

0.1549 0.2593 0.0273

0.3367 0.1805

0.3100 0.1202

0

0.1540 0.1683 0.3643 0.3237 0.1459 0.1661 0 0.2155

Neural Network 4 Neural Network 5

o..16ih’ 0.3208 0 0 1

Voting

0.4088 0.281 1 0.1388 0.3951 0.3424

The Large Neural Network We also trained a large neural network for the whole network. There are fifty nodes in the large communication network. So for the large neural network, the input is a vector with 100 elements. When we train the large neural network, the training goal can’t be met. One of the large neural network training curves is given in Figure 3. In the training, the goal is to let the training error be less than lo-”, but the actual error is about And as one can imagine, the test result will not be very good. Twenty data sets are tested, and only eight outputs are correct.

0.3802 0.1790 0.1830 0.5006 0.2807

We have tested twenty sets of data for all the neural networks. The results %e summarized in Table 2. “NN” means neural network.~The correct percentage of each neural network ranges from 55% to 75%:The neural network corresponding to the center subnetwork has the highest correct percentage. This is probably because the center subnetwork has more nodes than other subnetwork. So there is more node failure pattern information to exploit. It may also be due to: the location (at the center of the network) of the subnetwork. ,And we c& see that the voting algorithm gives a high- correct percentage - 85%. So by combining the outputs from several neural networks, we can get much better results. .The last row is the test .

1

.b--; J e+. ,I

>I

-2

Figure 3. Training Curve for the Large Neural Network

4. SUMMARY We have discussed a particular failure propagation mechanism--poison message failure. And we provided a framework to identify the responsible message type. In this paper, we focus on designing a distributed management scheme to deal with the poison message failure, especially when the network is partitioned into several subnetworks

~

462

by failed nodes. We proposed a distributed neural network approach. In this approach, a large network is divided into several subnetworks. And a neural network is trained for each subnetwork. Further we suggest combining outputs from multiple neural networks by a voting algorithm. We implemented a large OPNET testbed where BGP, LDP and OSPF can carry poison messages. Both computation and simulation show that the distributed neural network approach can provide very good results in identifying the poison message. The distributed neural network approach is also very useful when a large neural network does not work well.

IEEE/ACM Transactions on Communications, Vol.: 3. Issue: 6, pp: 753 -764, Dec. 1995. [I 11 J.P. Martin-Flatin, S . Znaty, and J.P. Hubaux. “A Survey of Distributed Network and Systems Management Paradigms” Technical Report SSC/1998/024, EPFL, Lausanne, Switzerland, August 1998.

REFERENCE [l] X. Du, M.A. Shayman and R. A. Skoog, “Using Neural Networks to Identify Control and Management Plane Poison Messages”, The Eighth IFIPIEEE International Symposium on Integrated Network Management (IM 2003). Colorado Spring, Colorado, March 2003. [2] X. Du, M.A. Sbayman and R. A. Skoog, “Markov Decision Based Filtering to Prevent Network Instability from Control Plane Poison Messages”, Conference on Information Sciences and Systems (CIS9 2003, Baltimore, Maryland, March 2003. [3] X. Du, M.A. Shayman and R. A. Skoog, “Preventing Network Instability Caused by Control Plane Poison Messages” IEEE MILCOM2002, Anaheim, CA, Oct. 2002. [4] R. A. Skoog et al., “Network management and control mechanisms to prevent maliciously induced network instability,” Network Operations and Management Symposium, Florence, Italy, April 2002. [ 5 ] D. J. Houck, K. S . Meier-Hellstern, and R. A. Skoog, “Failure and congestion propagation through signaling controls”. In Proc. 14th Intl. Teletra& Congress, Amsterdam: Elsevier, pp: 367-376, 1994. [6] A. Bouloutas, S . Calo, and A. Finkel, “Alarm correlation and fault identification in communication networks”, Communications, IEEE Transactions on, Vol.: 42, Issue: 2, pp: 523 -533, Feb-Apr 1994. [7] A. Bouloutas, G.W. Hart and M. Schwartz, “Simple finite-state fault detection for communication networks,” IEEE Trans. Communications, Vol. 40, Mar. 1992. [8] J-F. Huard and A.A. Lazar. “Fault isolation based on decision-theoretic troubleshooting”, Technical Report 44296-08, Center f o r Telecommunications Research, Columbia University, New York, NY, 1996. [9] A. Bouloutas, et al, “Fault identification using a finite state machine model with unreliable partially observed data sequences,” IEEE Tran. Communications, Vol.: 41 Issue: 7, pp: 10741083, July 1993. [lo] 1. Katzela; M. Schwartz, “Schemes for Fault Identification in Communication Networks”, Networking,

463