Packet Loss Reduction During Rerouting - IEEE Xplore

2 downloads 0 Views 285KB Size Report
Abstract—When a network failure is detected by an IP router, its routing table is updated for all affected routing entries. The packet loss resulting from this event ...
1120

IEEE COMMUNICATIONS LETTERS, VOL. 15, NO. 10, OCTOBER 2011

Packet Loss Reduction During Rerouting Wouter Tavernier, Dimitri Papadimitriou, Didier Colle, Mario Pickavet, and Piet Demeester, Member, IEEE

Abstract—When a network failure is detected by an IP router, its routing table is updated for all affected routing entries. The packet loss resulting from this event depends on the network traffic associated with the updated routing table entries. In this paper, we model this process, define packet loss minimization heuristics relying on network traffic prediction models, and show by simulation that the resulting packet loss can be reduced compared to the default routing table update process. Index Terms—Routing, rerouting, traffic statistics, neural networks, arma, packet loss reduction.

I. I NTRODUCTION

M

ANY techniques have been recently developed to help networks reducing their recovery time resulting from various network failures. Fast failure detection techniques such as loss-of-signal detection or Bi-directional Forwarding Detection (BFD) allow detection times of the order of milliseconds. These can be combined with recovery techniques such as IP Fast ReRoute (FRR) (actually protection) to minimize the duration of traffic forwarding disruption caused by link or node failure(s) until both routing and forwarding table entries of each node reconverge on the new topology. Common to all these techniques is their focus on recovery time and availability of a (loop free) backup path either before failure (preprovisioned protection) or after failure (dynamic re-routing). No matter how fast a failure is detected or how fast a router recomputes its routing table entries and updates its forwarding entries to reroute all its affected traffic flows towards alternate recovery paths, during this period all traffic for the forwarding table entries affected by the failure is simply lost. None of the above techniques accounts for the fact that the amount of traffic directed to the IP address prefix (i.e., set of one or more destination addresses) corresponding to each forwarding table entry is not homogeneously distributed over these entries. Given the fact that the low-level update of routing entries at the IP level can still take more than 1 second in IP backbone networks [1], in this letter, we propose a routing table update scheme which reorders the updates with respect to the locally monitored (and projected) network traffic. The idea behind our proposed scheme is to update the routing table entries corresponding to high bitrate traffic before the routing table entries corresponding to low bitrate traffic by taking into Manuscript received March 21, 2011. The associate editor coordinating the review of this letter and approving it for publication was G. Lazarou. W. Tavernier, D. Colle, M. Pickavet, and P. Demeester are with the research group Internet Based Communication Networks and Services (IBCN), Department of Information Technology, Ghent University, Belgium (e-mail: {wouter.tavernier, didier.colle, mario.pickavet, piet.demeester}@intec.ugent.be). D. Papadimitriou is with Alcatel-Lucent Bell, Antwerp, Belgium (e-mail: [email protected]). This work is supported by the European Commission-funded FP7-FIRE ECODE project. Digital Object Identifier 10.1109/LCOMM.2011.080811.110615

Fig. 1.

The process of an IP router updating its RIB, FIB, and LFIBs.

account short-term dynamics of the affected network traffic. As developed in [2], this strategy results into a packet loss decrease under the assumption of a static but simplified traffic model. In this paper, we improve the mentioned study using real traffic traces, and advanced real-time traffic models. II. T HE T RAFFIC -I NFORMED R EROUTING P ROCESS In an IP backbone network operating Link-State (LS) Interior Gateway Protocols (IGP) such as Open Shortest Path First (OSPF), a router detecting a link failure originates a LS Update (LSU) message. This message, containing LS Advertisement(s) (LSA) describing the topological link state change(s), is reliably disseminated in the network (flooded). At every router receiving the LSU, the following 3-step process executes (Figure 1): 1) Re-computation of the shortest path tree (SPT) (1) using the topological information stored in the updated LS DataBase (LSDB), taking about 30 to 50 𝜇s per IGP destination prefix [1]. 2) Update of the central Routing Information Base (RIB) and central Forwarding Information Base (FIB) based on the Shortest Path Tree computation (2a and 2b); each update ranging from 50 to 100 𝜇s per prefix [1]. 3) Distribution of central FIB towards local FIB (LFIB) on line cards (3) performed typically in (pseudo-)parallel with step 2, both update and distribution processes use the router’s central CPU in interleaved time slots by swapping between them. The swapping time is determined by the process quantum which can be configured between 10 and 800 ms.

c 2011 IEEE 1089-7798/11$25.00 ⃝

TAVERNIER et al.: PACKET LOSS REDUCTION DURING REROUTING

As such, the rerouting process results into a series of update-distribution batches, consisting of a repetition of a first quantum dedicated to the update of a fixed set of routing table entries in the central RIB/FIB, followed by a second quantum where the same set of entries is distributed towards the LFIBs. Given this batch structure, all traffic flows directed to a given destination address are recovered when the following conditions are met: i) the IGP routing table entry for the prefix including that destination address is updated and stored in the central RIB and FIB, and ii) the batch number 𝑛𝑖 that comprises this updated entry is distributed towards the LFIBs located on line cards. Thus, assuming a fixed batch size of 𝑥𝑢 (number) routing table entries and a given order of entries, the recovery time 𝑡𝑟 of a traffic flow 𝑓𝑖 is characterized by the following formula: 𝑡𝑟,𝑓𝑖 = 𝑛𝑖 𝑥𝑢 (𝑡𝑢 + 𝑡𝑑 ) + (2𝑛𝑖 − 1)𝑡𝑠 This formula indicates that a routing entry corresponding to the flow 𝑓𝑖 is recovered once all 𝑛𝑖 entries contained in the batches comprising that entry and in all its preceding batches have been updated and distributed. Individual routing entry update and distribution time takes 𝑡𝑢 and 𝑡𝑑 time units, respectively with a swapping time 𝑡𝑠 between the update and the distribution of corresponding batch. Packet loss occurs for traffic flows as long as their corresponding routing table entries are not updated and distributed to the line cards. The packet loss resulting from the recovery of flow 𝑓𝑖 relates to its bitrate, 𝑏𝑓𝑖 , through the following formula: ∫ 𝑡𝑟,𝑓 𝑖 𝑏𝑓𝑖𝑑𝑡 𝑙𝑜𝑠𝑠(𝑓𝑖 ) = 0

A. Network Traffic Dynamics The idea behind the proposed solution is to enable traffic monitoring on routers using online statistical counters (similar to [3]) to measure the aggregated traffic volume per destination address prefix for last time frame, for example of 200 ms. Given these statistics, the update and distribution of the RIB/FIB entries (IGP prefixes) could then be reordered such that entries carrying more traffic are updated and distributed before those carrying less traffic. However, this would only reduce packet loss if the measured traffic volumes persist during the update-distribution process. To verify this assumption, we take a set of PCAP traces obtained in December 2009 from the Japanese WIDE backbone as made available by the MAWI project [4], and process them at three time interval levels (100 ms, 500 ms and 1000 ms) and at three spatial levels (/8, /16 and /24 subnetworks) with respect to their traffic volume per interval per address prefix. Next, we measure the temporal persistence of active flows. A traffic flow is considered as active when a non-zero volume is associated to it, and it is considered persistent if it remains active during the next time interval. The analysis showed that for /16 and /24 subnets, for all interval sizes, the percentage of flows that remain active during two consecutive intervals is only about 50 percent on average. Figure 2 shows the number of active and persistent flows when using /24 subnets and 1000 ms intervals. These results indicate that a more accurate traffic model (modeling traffic volumes per destination address prefix) is needed to

1121

all flows

active flows

Fig. 2. The number of active and persistent flows using /24 subnets and 1000 ms time bins.

achieve packet loss decreases by reordering RIB/FIB entries during the routing table update. We evaluate two alternative traffic models: i) a ARMA (Auto-Regressive Moving Average time-series model), and ii) a Feed-Forward Neural Network regression model (FFNN). The ARMA model is a time series prediction model that has been successfully applied in many cases [5]. An ARMA(𝑝, 𝑞), or Box-Jenkins model, predicts a time series as follows: 𝑝 𝑝 ∑ ∑ 𝛾𝑖 𝑌𝑡−𝑖 + 𝜖𝑡 + 𝜃𝑖 𝜖𝑡−𝑖 . 𝑌𝑡 = 𝑐 + 𝑖=1

𝑖=1

In this formula, two techniques are combined: i) autoregression, which reflects the fact that a prediction is based on the signal itself (using 𝑝 previous values weighted by 𝛾𝑖 values) reflected by the second term, and ii) the technique of moving averages, reflected by the white noise series 𝜖𝑡 which is put through a linear non-recursive filter determined by the coefficients 𝜃𝑖 (weighted average). The second model, the FFNN [6], is a regression technique involving a set of interconnected, non-linear computational elements (neurons) operating in parallel. These neurons are interconnected in layers via weighted connections. The neurons in the input layer are fed with input data, each computing a weighted sum of all inputs and, after applying an activation function, transferring the output to the elements in the next layer. The output of each neuron in the output layer is compared to the desired output. The difference between the desired output and the one obtained constitutes an error that is fed back to the network so that the weights are adjusted in such a way that the error is minimized (e.g. training using the Levenberg-Marquardt algorithm as performed for this letter [6]. Once the FFNN has been executed on a given training set, it can be presented with new information. In the present case, an FFNN is evaluated for one-stepahead prediction of the traffic volume for a given destination (as in [7]), based on a set of statistics (sum, mean and most frequent value) of the following IP packet header fields: the mean (inter-)arrival time of aggregated packets, the sum/mean of packet sizes, the most frequent IP and TCP protocol field values, and the total number of packets.

1122

IEEE COMMUNICATIONS LETTERS, VOL. 15, NO. 10, OCTOBER 2011

Fig. 3.

Average packet loss vs. default IP router with 100 ms bin size.

B. Packet loss reduction Assuming a sufficiently representative traffic model, a heuristic can be used to reduce packet loss during the routing table update by: i) reordering the RIB/FIB entries, and ii) dynamically changing the size of the update-distribution batch (dynamic change of the process quantum). Let 𝑓1 , . . . , 𝑓𝑛 denote the set of traffic flows affected by the failure. Assuming that the RIB/FIB entries for the address prefixes corresponding to the flows prior to 𝑓𝑖 have already been updated and distributed in an ordered manner (from 𝑓1 to 𝑓(𝑖−1) ), let’s define 𝑏𝑐𝑢𝑟𝑟𝑒𝑛𝑡 = (𝑓𝑖 , . . . , 𝑓𝑖+𝑠 ) as the set of flows corresponding to entries in the current batch that still need to be updated. At this point, the entries remaining to be updated can be sorted in decreasing order with respect to their estimated traffic volumes for the next time interval, having as possible alternatives: 1) Extension: extend the current batch with the RIB/FIB entry associated to the next flow 𝑓𝑖+𝑠+1 , generating an additional packet loss for the RIB/FIB entries part of the current batch, induced by the additional time required by adding one more flow to the batch; 2) Splitting: terminate the current batch and put the RIB/FIB entry associated to the next flow into a new update-distribution batch. Termination generates an additional packet loss for the RIB/FIB entries that still need to be updated and distributed, induced by the additional swapping time before a new batch can be started. Comparing the additional resulting packet loss resulting from applying either an extension or a split, packet loss will be minimized. III. N UMERICAL R ESULTS THROUGH S IMULATION The proposed schemes for short-term network traffic modeling and packet loss reduction are evaluated in a custommade C++/Python environment using MATLAB software and toolboxes for network traffic modeling, and a set of MAWI network traces of December 2009 as input. Figure 3 depicts the obtained average packet loss reduction in percentage for

the indicated network traces using time intervals of 100 ms for several routing table update schemes: 1) default rup: the default router update process using fixed update batches (process quantum) allowing 100 entries 2) persist sorted fixed: the routing update process assuming persistent network traffic during the update, sorting the resulting RIB/FIB entries in decreasing traffic volume order with fixed batch size allowing 100 entries 3) persist sorted optimized: the alternative assuming persistent network traffic, using the heuristic of Section II-B 4) arma sorted optimized: the solution using an ARMA (10, 10) model for prediction and the referred heuristic for update batch sizing 5) ffnnsum sorted optimized: the solution using the FFNN model for prediction and the referred heuristic for update batch sizing This figure illustrates that involving more aggregation – using coarser subnetworks– decreases the packet loss reduction potential. The resulting percentage of packet loss reduction ranges between 10 (for /8 subnetworks) and 20 percent (for /24 subnetworks) using 100 ms time intervals. As indicated by earlier studies [5], [8], higher traffic aggregation (either larger time intervals or coarser subnetworks) improves predictability. More aggregation also leads to smaller differences in terms of traffic volume between aggregated flows. Henceforth, the packet loss potential reduces as reordering the associated RIB/FIB entries decreases the packet loss resulting from the routing table updates. The usage of the FFNN-based models can be justified for low traffic aggregation levels. IV. C ONCLUSION This paper investigated the potential of designing a trafficinformed routing table update process. The latter, enhanced with heuristics for packet loss minimization, proved to be able to produce packet loss decreases between 10 percent at low aggregation levels and 20 percent at higher aggregation levels. R EFERENCES [1] P. Francois, C. Filsfils, J. Evans, and O. Bonaventure, “Achieving subsecond IGP convergence in large IP networks,” SIGCOMM CCR, vol. 35, no. 3, pp. 33–44, July 2005. [2] W. Tavernier, D. Papadimitriou, D. Colle, M. Pickavet, and P. Demeester, “Optimizing the IP router update process with traffic-driven updates,” in DRCN 2009. [3] C. Estan, K. Keys, D. Moore, and G. Varghese, “Building a better netflow,” SIGCOMM CCR, vol. 34, no. 4, pp. 245–256, 2004. [4] K. Cho, K. Mitsuya, and A. Kato, “Traffic data repository at the wide project,” in Proc. Annual Conference on USENIX Annual Technical Conference, 2000, pp. 51–51. [5] A. Sang and S. Q. Li, “A predictability analysis of network traffic,” in INFOCOM 2000, vol. 1, pp. 342–351. [6] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), 1st edition. Springer, 2007. [7] A. Khotanzad and N. Sadek, “Multi-scale high-speed network traffic prediction using combination of neural networks,” in Proc. International Joint Conference on Neural Networks 2003, vol. 2, pp. 1071–1075. [8] Y. Qiao, J. Skicewicz, and P. Dinda, “An empirical study of the multiscale predictability of network traffic,” in Proc. 13th IEEE International Symposium on High Performance Distributed Computing 2004, pp. 66– 76.