A Family of Mechanisms for Congestion Control in Wormhole Networks

0 downloads 0 Views 596KB Size Report
tion relying on global information that wastes some network bandwidth. This paper presents a family of mechanisms to avoid network congestion in wormhole ...
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 200X

1

A Family of Mechanisms for Congestion Control in Wormhole Networks

Keywords: Wormhole switching, congestion control, message throttling. I. M OTIVATION Massively parallel computers provide the performance that most scientific and commercial applications require. Their interconnection networks offer the low latencies and high bandwidth that is needed for different kinds of traffic. Usually, wormhole switching with virtual channels and adaptive routing has been used. However, multiprocessor interconnection networks may suffer from severe congestion problems with high traffic loads, which prevent reaching the wished performance. This problem can be stated as follows (see Fig. 1). With low and medium network loads, the accepted traffic rate is the same as the injection rate. But if traffic increases and reaches (or surpasses) certain level (the saturation point), accepted traffic falls down and message latency considerably increases. This problem appears both in deadlock avoidance and recovery strategies [9]. Performance degradation appears because with high network traffic, several packets compete for the same resources (physical or virtual channels, escape channels or recovery mechanisms), but as only one packet can use them, the remainder packets stay stopped in the network, thus blocking other packets and so forth. When this situation is generalized, the network is saturated and performance degradation appears. Interconnection network may become a bottleneck within the next few years. The high clock frequencies and the development of techniques that increase the effective instruction issue rate of processors (for instance, see [6]) push higher DISCA, Universidad Polit´ecnica de Valencia, Valencia, SPAIN, E-mail: [email protected]

Uniform Complement

10000

1000

100 0.1 0.2 0.3 0.4 0.5 0.6 Injected Traffic (flits/node/cycle)

Accepted Traffic (flits/node/cycle)

Abstract— Multiprocessor interconnection networks may reach congestion with high traffic loads, which prevents reaching the wished performance. Unfortunately, many of the mechanisms proposed in the literature for congestion control either suffer from a lack of robustness, being unable to work properly with different traffic patterns or message lengths, or detect congestion relying on global information that wastes some network bandwidth. This paper presents a family of mechanisms to avoid network congestion in wormhole networks. All of them need only local information, applying message throttling when it is required. The proposed mechanisms use different strategies to detect network congestion and also apply different corrective actions. The mechanisms are evaluated and compared for different network loads and topologies, improving network performance noticeably with high loads but without penalizing network behavior for low and medium traffic rates, where none congestion control is required.

Latency since generation (cycles)

E. Baydal, P. L´opez, and J. Duato

0.7

0.7 0.6 0.5

Uniform Complement

0.4 0.3 0.2 0.1 0.1

0.2 0.3 0.4 0.5 0.6 Injected Traffic (flits/node/cycle)

0.7

Fig. 1. Performance degradation in a 8-ary 3-cube. Fully adaptive routing alg., deadlock-recovery. 3 virtual channels. Uniform and complement traffic patterns. 16-flit messages.

demands on the interconnects. Indeed, cost-effectiveness of interconnection networks is increased by placing more than one processor attached to each network router [23], which again increases the bandwidth request. On the other hand, several studies of interconnection network behavior under the traffic generated by real applications [10], [23], [26] show that network traffic is bursty and peak traffic may saturate the network. Finally, there is a growing interest in applying power saving techniques everywhere. When these strategies are used in interconnection networks, solving congestion becomes mandatory. These techniques are mainly based on reducing the network bandwidth (for instance, by switching off some links [28]) when traffic is not intense. The problem is that an injection rate that is acceptable by the original network may be completely beyond saturation in the network with reduced bandwidth, thus leading to the performance degradation point. Although several mechanisms for multiprocessors have been already proposed, they have important drawbacks that will be explained in Section III. In this paper, we propose and compare

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 200X

a family of mechanisms to avoid network congestion that tries to overcome these drawbacks. The rest of the paper is organized as follows. Section II gives some background on congestion control mechanisms. In order to design a good congestion control mechanism, Section III presents the main desirable features of such a mechanism. Section IV describes our proposals for congestion control. Performance evaluation results of the mechanisms for different message destination distributions and topologies are presented in Section V. Finally, some conclusions are drawn. II. R ELATED WORK From our point of view, there are two strategies to solve the congestion problem: congestion prevention and recovery. Congestion prevention techniques require some kind of authorization from the network before injecting a packet. The two most well known techniques are based on opening connections, reserving the necessary resources and regulating the message injection rate in accordance with the reserved bandwidth, and on limiting the number of sent messages without receiving an acknowledgment. The former is frequent in data transmission with quality of service guarantee and the latter is typical of communication networks. These techniques are not usually applied to multiprocessor interconnection networks. Often, they do not use acknowledgments, therefore, they can not calculate packet round-trip delays, or take into account the number of packet retransmissions. On the other hand, introducing these elements would imply to increase traffic and add complexity to the network. Congestion recovery techniques are based on monitoring the network and triggering some actions when they detect congestion. In this case, the solution has three steps. First, congestion has to be detected. Second, congestion has to be notified to the network nodes, and third, some actions have to be applied to solve network congestion. About how network congestion is detected, some strategies are based on measuring the waiting time of blocked messages [12], [16], [27] while others use the number of busy resources in the network. Unfortunately, waiting time depends on message length. In the second group of strategies, busy resources may be channels [2], [7], [19], [21], [24], or buffers [13], [30]. Obviously, as traffic increases, more network resources will stay occupied and probably during a longer time, because messages advance more slowly. Information about busy resources may be only local to the node or global from all the network nodes. While global information may allow a more accurate network status knowledge, it also incurs in some overhead. In order to periodically update network status, control messages have to be sent across the network, thus wasting some amount of bandwidth. Mechanisms that use only local information are more limited, but they may obtain a better cost/performance tradeoff. As nodes do not need to exchange status information, more bandwidth is available to send data messages. Concerning congestion notification, we can classify the mechanisms according to which node is notified: only the local node [7], [19], [21], [24], the local node and their neighbors, the sender nodes of the packets related to the

2

congestion detection [16], or all the network nodes [27], [30]. Notifying only the local node has the disadvantage that applied actions to avoid network congestion may punish only the nodes that detect the congestion, independently if they are the main responsible for the problem. On the contrary, other solutions may seem more profitable, but they require transmitting extra information through the network, either by signaling [27], which complicates network implementation or by sending extra flits (as additional packets [30] or as padding the sent packets [16]) which may make worse a bad situation. Moreover, applying restrictions over all the network nodes is very fair but may penalize excessively network throughput. Finally, the actions that are triggered to avoid network congestion can be divided in three groups. First, and more studied, is message throttling. We will refer to it later. The second group uses non-minimal routing to balance network load along alternative paths [7]. Finally, the third group [27] adds additional buffers (“congestion buffers”) to slow down messages and reduce network traffic. Message throttling has been the most frequently used method. If the network is becoming congested, it seems reasonable that nodes stop, or at least, slow down their injection rate of messages. This is a classical solution to congestion control for both low-speed packet-switched networks [11], [14] and high-speed packet-switched networks [15]. Also, in multiprocessor and cluster of workstations interconnection networks, several strategies have been developed. Some proposals completely stop message injection when congestion is detected [18], [24], [27], [30]. Another possibility is to reduce the available bandwidth to inject new messages. If several injection channels are available, a way may consist of progressively disabling them. Also message transmissions may be delayed during increasing intervals [16]. Message throttling can be applied for a predefined time [16], [27], or until the traffic fall enough [2], [19], [21], [30]. III. F EATURES OF

A GOOD CONGESTION CONTROL MECHANISM

In order to design an efficient congestion control mechanism, we will describe, in this section, the desirable features of such a mechanism. From our point of view, an efficient congestion control mechanism should have mainly three properties: it should be robust, it should not penalize network behavior for low and medium loads and it should not generate new problems. First, the mechanism should work properly for different conditions: different message destination distributions, message sizes and network topologies. As we can see in Fig. 1, the saturation point depends on the distribution of message destinations. Because of that, a suitable mechanism for one distribution, may not work properly with another one. The same problems may appear when we change the network topology. However, many of the previously proposed mechanisms have been analyzed with only one network size [2], [30] and for the uniform distribution of message destinations [7], [12], [13], [16], [24], [27]. Other mechanisms do not achieve good results for each individual traffic pattern considered [19],

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 200X

[29]. Also, the performance of some mechanisms strongly depends on message size [16], [27]. Second, the added mechanisms should not penalize the network when it is not saturated. Notice that this situation is the most frequent [25]. Thus, when network traffic is not close to or beyond the saturation point, it would be desirable that the mechanism do not restrict or delay injection messages into the network. However, some of the previous proposals increase message latency before the saturation point [7], [29]. On the other hand, strategies based in non-minimal routing [7], [12] may also increase message latency. Finally, the new mechanism should not generate new problems in the network. Proposals that need to send extra information across the network increase network traffic and may make worse the congestion situation [16]. Moreover, when congestion detection relies on global information, mechanisms do not scale well since the amount of network status information increases with the network size [30]. IV. T HE PROPOSED

MECHANISMS

In this section, we will present a family of mechanisms to prevent network congestion in wormhole networks. They mainly differ on their complexity and accuracy in detecting network congestion. All of them detect congestion by considering only local information, notifying only the node that detects congestion and solving the congestion situation with message throttling. Stopped messages are queued until network traffic decreases. In order to make easy the implementation, messages are injected into the network in the same order they were generated. Finally, notice that, although the required hardware to implement these mechanisms may add some delay to injected messages, it does not reduce the clock frequency because it is not in the critical path. A. U-channels This approach estimates network traffic locally at each network node by checking its number of free virtual channels. However, only those channels that are useful for forwarding the message towards its destination are considered in the count. Injection of new messages will be allowed only if this number exceeds a threshold. The idea behind this method is that although some network areas are congested, it does not matter if they are not going to be used by the message. Moreover, if the channels that may forward the message towards its destination are congested, the mechanism should prevent message injection regardless of the other channels are free. Hence, despite the mechanism only notifies the congestion to the local node, it only applies message throttling when it may contribute to increase network congestion. We will refer to this mechanism as U-channels (Useful channels). However, notice that the threshold value strongly depends on the message destination distribution. To make the mechanism independent of the message destination distribution, the number of useful free virtual output channels relative to the total number of virtual channels that may be used for forwarding the message towards its destination is used instead.

3

PENDING MESSAGE QUEUE SOURCE

DESTINATION

RESERVE

RELEASE

ROUTING

FREE VIRTUAL

FUNCTION

CHANNELS

VIRTUAL CHANNELS STATUS REGISTER

USEFUL CHANNELS COUNTER U 2

4 FREE USEFUL VIRTUAL CHANNELS COUNTER

ADDER

F U*Threshold

COMPARATOR

INJECTION PERMITTED

Fig. 2.

Hardware required for the U-channels mechanism.

Therefore, a message that may use more virtual channels to reach its destination will need more free virtual channels to be injected by the mechanism. The threshold value, still, has to be empirically tuned, but as we will see in Section V, we can find a single threshold that works fine with all message destination distributions. To illustrate the method, consider a bidirectional k-ary 3cube network with 3 virtual channels per physical channel. Assume that the optimal threshold is 0.3125 and that a message needs to cross two network dimensions. This message may choose among 6 virtual output channels. Let F be the number of them that are free. If F6 is greater than 0.3125, the message can be injected. In this case, in order to allow message injection at least two (0.3125 × 6 = 1.875) virtual channels must be free. Fig. 2 shows the implementation of the mechanism. The routing function must be applied before injecting the newly generated message into the network. This requires replicating some hardware, but if a simple routing algorithm is used (see Section V-A), this should not be a problem. The routing function returns the useful virtual output channels, U , for instance, in a bit-vector. Then, a circuit counts the number of “1” in this bit-vector. On the other hand, combining the bit-vector with each virtual channel status (free or busy), the number of free useful virtual output channels, F , is obtained. This can be easily done by a bitwise and operator followed by a circuit similar to the one that computes U . Next, the quotient between the free and the total useful virtual output F , should be obtained. If Q is greater than channels, Q= U the threshold, the message can be injected. Otherwise, the F message is queued. In Fig. 2, the comparison between U and the threshold has been converted into a comparison between F and U multiplied by the threshold. This product is easily performed if the threshold is chosen as a sum of powers of two. For instance, if threshold is 0.3125 (as assumed in Fig. 2), it can be decomposed into 2−2 + 2−4 . So, its product by

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 200X

Routings (%)

U is as easy as adding U right shifted two positions and U right shifted four positions.

B. ALO

100 90 80 70 60 50 40 30 a) b) 20 a) OR b) 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Injected traffic (flits/node/cycle)

Fig. 3. Percentage of routings with: a) All useful physical output channels with at least one free virtual channel. b) At least one useful physical channel completely free. PENDING MESSAGE QUEUE SOURCE

DESTINATION

ROUTING FUNCTION USEFUL PHYSICAL

....

OUTPUT CHANNELS

VIRTUAL CHANNELS B

C

STATUS REGISTER

E

C

D

G

F

....

....

B

Fig. 4.

D

1: FREE 0: BUSY

A

....

The second proposed mechanism is also based on measuring network traffic by using the number of useful free virtual output channels. However, it does not require to adjust any threshold. In particular, it is based on the assumption that the routing algorithm tries to minimize virtual channel multiplexing in order to avoid its negative effects [5]. Many adaptive routing algorithms work in this way [8], [22]. Thus, busy virtual channels tend to be uniformly distributed among all the physical channels of the router. In other words, the average number of busy virtual channels will tend to be the same for all the physical channels of the router. As network load increases, this number will also increase. Intuitively, when the last virtual channel of one physical channel starts to be used, network traffic is becoming high. To support this intuitive idea, we have performed an analysis of the percentage of routing occurrences which have at least one free virtual channel in all the physical channels that are useful to forward the message towards its destination. Fig. 3 shows the results. As it can be seen, the condition is satisfied in almost all the cases for low injection rates. However, as traffic increases, the number of routings which has at least one free virtual channel in all the feasible output channels is reduced. Therefore, some correlation exists between completely used physical channels at a node and the network traffic in the node area. However, in some cases -not many- a physical channel can become completely free, while others in the same node still have all their virtual channels occupied. In this situation, traffic in the node area is not really saturated, but the described mechanism will prevent injection of messages, increasing message latency. Fig. 3 shows the percentage of routing occurrences that satisfy this condition alone and also combined with the first one. As we can see, the second condition alone is a worse indicator of congestion. However, both rules combined (the first one OR the second one) slightly improve congestion detection. Thus, the mechanism allows injection if, after applying the routing function, at least one virtual channel of all the useful physical channels is free or at least one useful physical channel have all its virtual channels completely free. For this reason we will refer to this mechanism as ALO (At Least One). Notice that, contrary to previous approach, there is not any threshold to adjust in this mechanism. This will noticeably simplify the implementation. Fig. 4 shows the implementation of the mechanism. As in U-channels, the routing function must be applied before injecting the newly generated message into the network. The routing function returns the useful physical output channels. This implementation assumes that all the virtual channels of a physical channel can be used in the same way by a message, which is the case for True Fully Adaptive Routing (see Section V-A). In parallel, two logical operations are performed on the virtual channels status register. First, C gate detects if there

4

E

INJECTION PERMITTED

Hardware required for the ALO mechanism.

is at least one free virtual channel in each physical channel1. Simultaneously, D gate detects if all the virtual channels of a physical channel are free. Next, this information is combined with the result of the routing function (B and E gates) in order to consider only the useful channels. Finally, A and F gates apply the first and second rules, respectively, for all the useful physical channels and G gate allows injection if any of them is satisfied. As we can see, the hardware required to implement the mechanism is very simple. Only some logic gates are required. As the mechanism does not need any threshold, there is neither need for registers nor comparators. C. INC: Injection and Network Congestion The previous mechanisms only use the information directly available to the node. Although simple, this might lead to a lack of accuracy as congestion in remote areas may not be quickly detected. On the other hand, message throttling is used in the local node, completely stopping message injection once congestion is detected, but the local node may not be the main 1 This

implementation assumes 3 virtual channels per physical channel.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 200X Transmitted flits counter Injecting?

"Injection" Flag I

Injection per virtual channel

1: Yes 0: No

fc "Network" Flag

Network per virtual channel

R

Fig. 5. Hardware required for the injection and network flags for the INC mechanism.

responsible of the network congestion. In this section, we will propose a mechanism that tries to overcome these drawbacks. This mechanism checks message advance speed in order to detect network congestion. In particular, it measures the number of transmitted flits during a certain time interval tm . Each virtual channel has an associated counter that increases each time one flit is transmitted. At the end of each interval, the channel is considered congested if it is busy and the number of transmitted flits is lower than a threshold fc (flit counter). When the header of a message reaches a congested area, either it will block or will advance at a low speed and so will do the remaining flits of the message. By tracking this speed, we try to detect congestion in network areas remote to the current node, but using only local information. Notice that, we can not use a counter per physical channel with a threshold fc’=fc × nr. of virtual channels. In this case, if the traffic in one of the virtual channels is blocked but the traffic in other virtual channels of the same physical channel continues flowing, we will not detect congestion. Once congestion is detected, the applied policies are different depending on whether the node is currently injecting messages towards the congested area. In particular, each node manages two different flags: the injection and the network flags. The injection flag is set when some of the virtual channels that the node is using to inject messages into the network is congested. On the contrary, the network flag is set when congestion is detected in any virtual channel that is being used by messages injected by other nodes. The mechanism will be referred to as INC, Injection and Network Congestion detection. Fig. 5 shows the logic required at each virtual channel to implement the congestion flags. When either the injection or network flags are set, the local node applies message injection restrictions. These actions are more restrictive when the injection flag is set than when only the network flag does. The injection flag setting means that the node is actively contributing to the network congestion. On the contrary, if only the network flag is set, some preventive measures have to be applied, because although the current node is not directly generating the problem, congestion spreads quickly through the network. In both cases, if congestion is repeatedly detected in later intervals, the corrective actions become more restrictive. Thus, the mechanism works in a progressive way. Moreover, in the INC mechanism, when congestion is detected, instead of completely stop message injection, the

5

injection bandwidth is progressively decreased by reducing the ncenteringumber of enabled injection channels by some factor r1 (integer division). If this quotient reaches zero, injection is completely stopped for an interval fii (forbidden injection interval). This interval starts when the last message that the node is currently injecting is completely injected into the network. After fii cycles, one injection channel is enabled regardless of the detected network traffic, and the first pending message, if any, is injected. As network traffic is estimated periodically, injection restrictions will get harder, increasing fii or, on the contrary, will be reduced. The mechanism uses limited forbidden injection intervals for two reasons. First, it avoids starvation and, second, it allows other nodes to also detect congestion and apply message injection limitation. On the other hand, if only network congestion is detected, the enabled injection channels are reduced by a lower factor r2 , with r2 < r1 . Forbidden injection intervals are also used if only network congestion is repeatedly detected, but they increase more slowly. The optimal values of r1 and r2 depend on the network radix k. Notice that when k increases also does the average distance traversed by the messages, which makes worse network congestion. Therefore, injection restrictions have to be harder. After some tests, we have found that r1 =k/4 and r2 =k/8 work well. Notice that, with these values, a topology with a low radix k, such as a 8-ary 3-cube, will apply r1 =8/4=2 and r2 =8/8=1, so it only reduces the injection bandwidth when injection congestion is detected. On the contrary, a topology more prone to congestion (because of the reduced routing flexibility and greater average distance), as a 32-ary 2-cube, will use r1 =32/4=8 and r2 =32/8=4. Hence, if 4 injection channels are used, it will disable all of them as soon as injection congestion is detected. The forbidden injection interval is initialized to fiimin . Then, it is incremented by some value every time that congestion is detected and injection had already been forbidden. Again, the increment value depends on the network radix k. Indeed, we have chosen it higher for injection than for network congestion, since in the first case the node is contributing directly to the network congestion. After several tests and looking for values that were power of two to get an easier implementation, we have found that the heuristics incinj = (fiimin ×k)/16 and incnet =(fiimin ×k)/32 work well (both with integer division). Notice, though, that as the mechanism tries to self-adjust injection bandwidth, the r1 , r2 , incinj , and incnet values are not critical. Finally, the injection bandwidth has to be recovered when congestion it not longer detected. However, the injection restrictions are also reduced gradually. After an interval without detecting any congestion, the injection limitation is smoothed. First, the injection bandwidth is increased by reducing the forbidden interval (by incnet ), and later (when it reaches the minimum value fiimin ), by increasing (by one channel at a time) the number of injection channels. Fig. 6 and Table I summarizes the behavior of the mechanism.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 200X

TABLE I O PERATION OF THE INC MECHANISM .

Applied policies (k=network radix) General form Example General form Example General form Example

injch reduc. fii incr. fii decr.

Congestion Type Network

Injection

injch =injch /r1 = injch ×4/k injch =injch /8 fii =fii +incinj = fii +(fiimin ×k)/16 fii =fii +fiimin ×2 fii =fii −incnet = fii −(fiimin ×k)/32 fii =fii −fiimin

injch =injch /r2 = injch ×8/k injch =injch /4 fii =fii +incnet = fii +(fiimin ×k)/32 fii =fii +fiimin fii =fii −incnet = fii −(fiimin ×k)/32 fii =fii −fiimin

Wait tm cycles

Yes

injCh : Enabled injection channels AllinjC: Maximum number of injection channels

Injection congestion? No

injCh := injCh/r1

Network congestion? No

Yes No

injCh := injCh/r2

injCh = 0? Yes

fii > fii_min? No

injCh = 0? No

fii := fii + inc_inj

Yes

Wait until the last message is injected

fii := fii + inc_net Yes

Yes

fii = fii−inc_net If injCh < AllInjCh injCh := injCh + 1

Injecting messages? No

Wait fii cycles Inject pending message

Fig. 6.

Operation of the INC mechanism.

V. E VALUATION In this section, we will evaluate by simulation the behavior of the proposed congestion control mechanisms. The evaluation methodology is based on the one proposed in [3]. Most important performance measures are latency since generation 2 (time required to deliver a message, including the time spent at the source queue) and throughput (maximum traffic accepted by the network). Accepted traffic is the flit reception rate. Latency is measured in clock cycles, and traffic in flits per node per cycle. Unfortunately, message latency at saturation grows with time. Hence, we will also measure network performance by tracking the required time to deliver a given number of messages injected into the network. Moreover, as in these simulations every network node has to deliver the same number of messages, these results will show if there is node unfairness. Nodes suffering starvation because the congestion control mechanism does not allow them to inject their messages at all until network load is low, will provoke peaks of latency in the results when the load traffic is expected to be low. 2 Notice that network latency only considers the time spent traversing the network

6

In [2] and mainly in [4], an in depth analysis about starvation problem is done, also evaluating the number of injected messages by each network node with different network loads. The results showed that the mechanisms proposed in this paper did not have starvation problems. A. Network model The simulator models the network at the flit level. Each node has a router, a crossbar switch and several physical channels. Routing time and both transmission time across the crossbar and across a physical channel are all assumed to be equal to one clock cycle. On the other hand, each node has four injection/ejection channels. We showed the advantages of having several injection/ejection channels in [3]. Concerning deadlock handling, we have considered both deadlock avoidance and recovery. In the former case, fully adaptive routing with escape channels [8] is used. In the latter case, we use software-based deadlock recovery [22] and a True Fully Adaptive Routing algorithm (TFAR) [22], [25], which allows the use of any virtual channel of those physical channels that forwards a message closer to its destination. Deadlocks are detected with the mechanism proposed in [20] with a threshold equal to 32 cycles. In both cases, the routing algorithms can use 3 or 4 virtual channels per physical channel. The mechanisms that estimate the network traffic by using the number of free virtual output channels (U-channels and ALO) have been evaluated only with deadlock recovery. They are not well suited for deadlock avoidance, since all the virtual channels are not used in the same way, which is mandatory for these mechanisms. On the other hand, the mechanism that uses message advance speed as congestion indicator (INC) has been evaluated both for deadlock avoidance and recovery. We have evaluated the performance of the proposed congestion control mechanisms on different bidirectional k-ary n-cubes. In particular, we have used the following network sizes: 256 nodes (n=2, k=16), 512 nodes (n=3, k=8), 1024 nodes (n=2, k=32), and 4096 nodes (n=3, k=16). B. Network load Each node generates messages independently, according to an exponential distribution. Destinations are chosen according to the Uniform, Uniform with locality l (where destinations are randomly selected inside a sub-cube of side l, with l=2 and l=4), Butterfly, Complement, Bit-reversal, and Perfect-shuffle traffic patterns. Uniform distribution is the most frequently used one in the analysis of interconnection networks. The other patterns take into account the permutations that are usually performed in parallel numerical algorithms [17]. For message length, 16-flit, 64-flit and 128-flit messages are considered. We have performed two different kind of experiments, using constant and dynamic loads. Experiments with constant loads are the usual evaluation tool for interconnection networks. In this case, the message generation rate is constant and the same for all network nodes. We analyze all the traffic range, from low load until saturation. Simulations finish after receiving 500,000 messages with the smallest networks (256 and 512 nodes) and 1,200,000 with the largest ones (1024

and 4096 nodes), but only the last 300,000 (1,000,000 for large networks) messages are considered to calculate average latencies. We will also evaluate the mechanisms’ behavior when the network traffic changes dynamically. In particular, we use a bursty load that alternates low and high loads. Let Ls be the message generation rate just before the network reaches saturation. We have chosen a generation rate for low load, Llow =0.5×Ls . On the other hand, in order to analyze the behavior of the mechanisms at different saturation levels, we have tested two different values of the high load: Lhigh =1.2×Ls and Lhigh =1.8×Ls . These loads are repeated twice (Lhigh Llow Lhigh Llow ). For a given generation rate, each node generates the same number of messages. When a node has generated all the messages for one of the injection rates, then, it starts generating traffic with the next load level in sequence. Finally, when it has generated all the messages for the complete sequence, it does not generate new messages at all. The simulation finishes when all the generated messages arrive to their destinations. We show results with Mhigh =1, 000 messages generated per node with the Lhigh injection rate. The number of messages generated with the Llow rate defines the elapsed time between two traffic bursts, and it allows us to analyze how the mechanism recovers the network after a congestion period. We have used two different values for the load under the saturation (Llow ): Mlow =555 and Mlow =1, 111 messages.

C. The Self-Tuned congestion control mechanism For comparison purposes, in addition to the mechanisms proposed in this paper we will also evaluate the behavior of Self-Tuned [30]. In this mechanism, nodes detect network congestion by using global information about the number of full buffers in the network. If this number surpasses a threshold, nodes apply message throttling. When it drops below the threshold, message injection is resumed. In theory, the mechanism tries to automatically determine the optimal threshold values based on network throughput measurements in order to adapt to variations in traffic patterns. In practice, there are many parameters (like the steps for incrementing and decrementing the threshold) that must be tuned to get the best performance. The use of global information requires to broadcast data between all the network nodes. A way of transmitting this control information is to use a sideband network [30]. This is costly in terms of hardware and complexity. Indeed, as the mechanisms proposed in this paper does not need to exchange control messages, to make a fair comparison, the bandwidth provided by the sideband network should be considered as additional available bandwidth in the main interconnection network. However, in the results that we present we do not consider this fact. If this additional bandwidth were considered, the differences, not only in throughput but also in latency, between Self-Tuned and the mechanisms proposed in this paper would be greater than the ones shown.

Free Useful Output Channels (%)

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 200X

7

100

Uniform Perfect-shuffle Bit-reversal Complement

80 60 40 20 0

0

0.05 0.1 0.15 0.2 0.25 0.3 Accepted Traffic (flits/node/cycle)

0.35

Fig. 7. Percentage of free useful output virtual channels versus traffic for different pattern traffic. 16-ary 3-cube. TFAR, deadlock recovery. 3 virtual channels. 16-flit messages.

D. Performance comparison In this section, we will analyze the behavior of the mechanisms proposed in this paper, comparing with Self-Tuned. In all cases, results without any congestion control mechanism (No-Lim) are also shown. For the sake of shortness, we will show only a subset of the results. Most detailed results are shown for the largest network (4096 nodes, k=16, n=3), 3 virtual channels per physical channel and the uniform distribution of message destinations, but we also include some results for other network sizes and traffic patterns. The complete evaluation for all the tested networks, traffic patterns and different number of virtual channels can be found in [4]. Also, in [4], the mechanisms proposed in this paper are compared not only with Self-Tuned but with other mechanisms. First of all, the U-channels and INC mechanisms have to be tuned. Therefore, we have evaluated their behavior with different message destination distributions, message sizes and network topologies. 1) Tuning the U-channels mechanism: First, we will show that U-channels threshold does not depend on traffic pattern. As Fig. 7 shows, although the throughput reached by different message destination distributions is not the same, the percentage of free useful virtual channels for the saturation points is very similar (around 0.3) for a given topology. This behavior is analogous for all the analyzed topologies. We have fine tuned this threshold by testing several threshold values for each topology. All of them can be easily calculated as a sum of powers of two, as required in the implementation shown in Fig. 2. Fig. 8 shows the average message latency versus traffic for different threshold values for the uniform and perfect-shuffle traffic patterns for a 16ary 3-cube. From the latency and throughput point of view, the highest threshold values lead to apply more injection limitation than necessary. As a consequence, message latency is increased due to the fact that the messages are waiting at the source nodes. On the other hand, the lowest threshold value allows a more relaxed injection policy and tends saturating the network. For the 4096-node network, a good choice can be 0.375, but a given topology may have more than one threshold that works properly.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 200X

Perfect-shuffle

Latency since generation (cycles)

Latency since generation (cycles)

Uniform

10000

1000

100

No-lim Thresh. = 0.25 Thresh. = 0.3125 Thresh. = 0.375 Thresh. = 0.5 0.05

8

0.1 0.15 0.2 0.25 0.3 0.35 Accepted Traffic (flits/node/cycle)

10000

1000

100

No-lim Thresh. = 0.25 Thresh. = 0.3125 Thresh. = 0.375 Thresh. = 0.5

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Accepted Traffic (flits/node/cycle)

Fig. 8. Average message latency vs. accepted traffic for different threshold values for the U-channels mechanism. 16-ary 3-cube. TFAR, deadlock recovery. 3 virtual channels. 16-flit messages. TABLE II B EST THRESHOLD VALUES FOR DIFFERENT TOPOLOGIES AND VIRTUAL CHANNELS PER PHYSICAL CHANNEL FOR THE

NUMBER OF

U- CHANNELS

MECHANISM .

Nodes 256, 4096 1024 512

Threshold (3 VCs)

Threshold (4 VCs)

0.375 0.625 0.3125

0.3125 0.5 0.25

In general, we have noticed that the best threshold depends mainly on the network topology (in particular, it depends on the number of nodes per dimension, k) and the number of virtual channels per physical channel, but it does neither significantly depend on message destination distribution nor message size. For the same radix k and number of dimensions n, the optimal threshold may decrease when the number of virtual channels per physical channel increases and may increase for high k values. The explanation is simple. Increasing the number of virtual channels per physical channel improves adaptivity and decreases network contention. Hence, we have to apply a more relaxed injection policy, with a lower threshold value. On the contrary, by increasing k, average traversed distance also augments and the maximum achievable network throughput is reduced [1]. Therefore, a higher threshold value is required to avoid congestion. Table II shows the optimal thresholds for the topologies and number of virtual channels analyzed. 2) Tuning the INC mechanism: The sampling interval tm , the flit counter fc and the forbidden injection interval fii have to be tuned. However, none of these parameters are critical as the INC mechanism dynamically adjust the injection bandwidth. However, we will give some hints to select the proper values for these parameters. First, the sampling interval, tm , should be high enough to obtain meaningful results but also short enough to allow a quick reaction to changes in the traffic. We have used a sampling interval tm =20 cycles, which is the time required to transmit enough number of flits to fill up the buffers associated with the virtual channel. In our case, 4 flits per virtual channel can be stored (one buffer associated to the output virtual channel and another one at the input side

of the next node, and each of them can hold 2 flits). So, the interval tm has to be at least as long as the time required to transmit 5 flits. However, notice that the maximum advance rate when all the virtual channels of a physical channel are busy is the inverse of the number of virtual channels per physical channel. Hence, if there are 4 virtual channels per physical channel, 20 cycles may be required to transmit 5 flits (5 flits ×4 virtual channels). Remember that the link time is one cycle. The possible values of the flit counter (fc ) threshold are related to the value chosen for tm and the number of virtual channels per physical chanel. As we have stated, for a tm =20 cycles and 4 virtual channels per physical channel, if the number of transmitted flits per virtual channel is