Packet Scheduling Across Networks of Switches

Packet Scheduling Across Networks of Switches Kevin Ross1 and Nicholas Bambos2 1

UCSC School of Engineering [email protected] 2 Stanford University [email protected]

Abstract. Recent developments in computer and communication networks require scheduling decisions to be made under increasingly complex system dynamics. We model and analyze the problem of packet transmissions through an arbitrary network of buffered queues, and provide a framework for describing routing and migration. This paper introduces an intuitive geometric description of stability for these networks and describes some simple algorithms which lead to maximal throughput. We show how coordination over sequential timeslots by algorithms such as those based on a round robin can provide considerable advantages over a randomized scheme.

1 Introduction We consider the scheduling of service over generalized switch networks. In this paper we develop methodology to analyze networks of queues where service resources must be distributed over a network, and each queue may forward processed requests to another queue. Besides theoretical interest, this work has immediate applied impact in the design of multi-stage/multi-fabric switches (due to limited scalability of switching cores) as well as controlling interconnection networks. Consider a general network of queues, with arbitrary interrelations between the queues. Packets, jobs or requests enter some queue in the network and remain there until they are served. Upon completion, packets are either forwarded to other queues or they depart the network. This model is a significant generalization to that presented in [8] where there is no forwarding or feedback allowed, and packets served in any queue immediately depart the network. Several important results have been shown [6, 7, 4] on the stability of switches which can be modeled as interacting queues competing for service. For networks of switches, the potential for localized switching algorithms to lead to instability was shown in [2]. An early overview of queueing network theory is given in [9] and some recent work has included the analysis of greedy algorithms in [1] and using an adversarial fluid model approach in [5]. This paper proceeds as follows. In section 2, we describe in detail the model under consideration. In section 3 we discuss system stability and throughput, and in section 4 we introduce throughput maximizing algorithms with examples of their performance. Conclusions are outlined in section 5. Due to space limitations we have restricted the content to model formulation and simple algorithms.

2 The Network Model and its Dynamics In this section we develop the network model, using a sequence of definitions explained via carefully chosen examples and figures. We consider a processing system which is a network comprised of Q first-in-firstout (FIFO) queues of infinite buffer capacity, indexed by q ∈ Q = {1, 2, ..., Q}. Time is slotted and indexed by t ∈ {0, 1, 2, 3, ...}. Packets (jobs/tasks) may arrive at each queue in each time slot. Upon receiving service and departing from that queue, they may be routed to another queue, and then another, visiting several queues before eventually exiting the network. We use the term cell to denote a unit of packet backlog in each queue. For simplicity, we assume that each packet can be ‘broken’ arbitrarily into cells or segments of cells, and in each time slot a number of cells can be processed at each queue then forwarded to another queue (or exit the network). Vectors are used to encode the network backlog state, arrivals, and service in each time slot. Specifically, X(t) = (X1 (t), X2 (t), ..., Xq (t), ..., XQ (t)) is the backlog state, where Xq (t) is the integer number of cells in queue q ∈ Q at time t. The vector of external arrivals to the network is A(t) = (A1 (t), A2 (t), ..., Aq (t), ..., AQ (t)) where Aq (t) is the number of cells arriving to queue q at time t from outside the network (as opposed to being forwarded from other queues). The following is assumed for each q ∈ Q Pt lim

t→∞

Aq (s) = ρq ∈ [0, ∞) t

s=0

(1)

that is, the long-term average external arrival load to each queue is well-defined, nonnegative and finite; there is at least one queue with strictly positive external load ρq > 0, while several queues may have zero external load ρq0 = 0. The long-term average load vector is ρ = (ρ1 , ρ2 , ..., ρq , ..., ρQ ). We do not assume any particular statistics that may generate the traffic traces, allowing for very general traffic loads to be applied. At each time slot, the network may be set to one transfer mode. This is represented by a matrix Tm and a corresponding vector S m chosen from the set of available modes m ∈ {1, 2, ..., M }. Each Tm is a Q × (Q + 1) matrix of transfer rates under mode m. It represents all of the cell transfers in that mode. In particular, for q 6= Q + 1, Tm pq is the number of cells sent from queue p to queue q in one timeslot when configuration mode m is used m (with Tm pp = 0 for all p). For q = Q + 1, Tpq is the number of cells served in queue p and then departing the system immediately under  m.  0203 For example if Q = 3 and the matrix T∗ =  0 0 0 0  is used then two packets are 0001 forwarded from queue 1 to queue 2, three packets are served in queue 1 and then exit, and one cell exits from queue 3. Corresponding to each matrix Tm are three service vectors S m , S m+ and S m− . These vectors reflect the total change in queue lengths for each queue in the system PQ+1 m when mode m is selected. In particular, Sqm+ = p=1 Tqp is the total number of PQ m− m departures from queue q under mode m, Sq = p=1 Tpq is the total number of

arrivals to queue q generated by mode m, and S m = S m+ − S m− is the vector of total change in workload (service) to the system under mode m. According to our example T∗ above we have S ∗+ = (5, 0, 1), S ∗− = (0, 2, 0), S ∗ = (5, −2, 1). At each timeslot, a mode m is selected from the available modes. If Sqm+ > Xq for some q then more cells are scheduled to be served in queue q than are actually waiting. In this case the matrix Tm and vectors S m must be adjusted to correspond to actual transitions. This is done through a careful notational change, differentiating between the selected mode at time t, labeled m(t), and the actual transition and service levels T(t) and S(t) (which are based on Tm(t) and S m(t) respectively). The updating of T(t) can be by some rule reflecting the priorities of waiting cells and maintains the property that the total workload forwarded under m is at most the number of cells waiting. Assumption 1 At timeslot t, for a workload vector X(t) and a selected service mode m(t), the matrix T(t) of actual workload transfer at time t is found by some function m(t) T(t) = f (X(t), m(t)) .P Corresponding actual service P which satisfies T(t) ≤−T + vectors are Sq (t) = p Tqp (t) ≤ Xq (t) and Sq (t) = p Tpq (t) for each q ∈ Q. One example of such a function would be Tpq (t) =

Xq (t) m(t)

Sq

m(t)

Tpq , which sends cells

in proportion to the scheduled transition matrix. Another example would be to reduce Tpq (t) for each q in order of priority. Using our example matrix T from earlier, if the workload vector is X(t) = (3, 5, 8) then five cells are scheduled to depart from queue 1 but  only 3are waiting.The function  0102 0203 f may choose an alternative transfer matrix T(t) = 0 0 0 0 ≤ Tm = 0 0 0 0 0001 0001 Having defined carefully the terms in the workload evolution, the vectors representing workload and workload change follow the simple evolution equation: X(t + 1) = X(t) − S(t) + A(t)

(2)

Fig. 1 shows various network topology features, and the way that this model would describe each case. A general network topology would include multiple queues entangled via various tandem and feedback cell routing paths. By extension of (2), in the long term X(t + 1) = X(0) +

t X s=0

A(s) −

t X

S(s)

(3)

s=0

where X(0) is the vector of initial backlog levels. The objective of this analysis is to develop algorithms for these systems to select m(t) in each timeslot in a way that ensures that all cells are served and no backlog queue will grow uncontrollably. For simplicity, all queues are considered to be store-and-forward. Current cell arrivals are registered at the end of the slot while cell service and departures during the slot. Therefore, it is not allowed for any cells to both arrive and depart in the same slot. Moreover, we assume per-flow queueing (or per-class) in the sense that if packets/cells are differentiated by class/flow they are queued up in separate (logical) queues in the

1 2 3

(a) (b)

1

2

(c)

1

2

1

3

2

1

(d)

2

3

(e)

Fig. 1. Service modes under various queuing structures. (a) Parallel queues. This is the simple case of a parallel queue network topology with no cell routing interaction between queues; For example, the possible service transfer matrix T =   0002 0 0 0 0 would serve two cells from queue 1 and one cell from queue 3 when applied in a 0001 slot. 010 would correspond to one cell being served (b) Tandem queues. The transfer matrix T = 000 in queue 1 and forwarded to queue 2. (c) Queues with feedback. Cells served in one queue may be routed back to an upstream queue even if they have previously been processed there. On return to the upstream queue, the cell is either routed to the exact same queue or stored in a separate virtual tandem logical queue. Separate queues must be utilized when cells need to be distinguished according to the number of times they have already been processed there. (d) Routing or splitting. There are two main scenarios covered by the model. In the first  one, the  0100 mode selects which downstream queue to send each cell to. The configurations T1 = 0 0 0 0 0000   0010 and T2 = 0 0 0 0 represent forwarding a cell from queue 1 to either queue 2 or 3 respec0000 tively. In the other scenario, queue 1produces/spawns several cells and forwards to both queue  0110 2 and queue 3. For example, T3 = 0 0 0 0 would correspond to two cells served in queue 1 0000 and then sending one to queue 2 and the other to queue 3 (similar to cell multicasting). (e) Merging. In this network topology, cells may from different queues to the same  be forwarded  0010 queue. For example, the configuration T = 0 0 1 0 would allow both queues 1 and 2 to 0000 forward to queue 3 simultaneously.

system. Such class/flow differentiation may reflect distinct paths/routes of nodes that various packets/cells need to follow through the network or diverse service requirements they might have at the nodes.

3 Stability and Throughput The vector backlog framework described here leads to an intuitive geometric understanding of stability. We say that an arrival rate is stable if there exists a sequence of configurations to match the arrival rate, and an algorithm is throughput-maximizing if it finds such a sequence for any such stable arrival rate. We utilize the concept of rate stability in our throughput analysis of the system. In particular, we seek algorithms which ensure that the long-term cell departure rate from each queue is equal to the long-term arrival rate. Such algorithms must satisfy Pt Pt s=0 Sq (s) s=0 Aq (s) lim = lim = ρq (4) t→∞ t→∞ t t for each q ∈ Q, that is, there is cell flow conservation through the system. In section 2 we described the transfer matrix Tm (or T(t)) and the service vector m S (or S(t)). For any set of modes available there is a finite set of possible vectors S(t) which could be realized. We call this set S. Note that the set {S m }M m=1 is itself a subset of S. Definition 1 The stability region R of the switching system described is the set of all load vectors ρ for which rate stability in (4) is maintained under at least one feasible scheduling algorithm. The stability region can be expressed [3, 7] as ) ( X X Q (5) φS S, for some φS ≥ 0 with φS = 1 R = ρ ∈