Dynamic Time-Slot Allocation for QoS Enabled Networks ... - CiteSeerX

3 downloads 0 Views 334KB Size Report
putation and communication from the platform architecture. In the case of a multi-processor ... ARM, that have cache sizes of about 16 KB. The rest of the paper is ...
Dynamic Time-Slot Allocation for QoS Enabled Networks on Chip T. Marescaux1 , B. Brick´e1,2 , P. Debacker1,2 , V. Nollet1 , H. Corporaal3 1 2 3

IMEC V.Z.W., Kapeldreef 75, 3001 Leuven, Belgium

Katholieke Universiteit Leuven, ESAT, Leuven, Belgium

Technical University Eindhoven ( TU/e ), Eindhoven, The Netherlands

[email protected]

Abstract MP-SoCs are expected to require complex communication architectures such as NoCs. This paper presents, to our knowledge, the first algorithm to dynamically perform routing and allocation of guaranteed communication resources on NoCs that provide QoS with TDMA techniques. We test the efficiency of our algorithm by allocating the communication channels required for an application composed of a 3D pipeline and an MPEG-2 decoder/encoder video chain on a 16 node MP-SoC. Dynamism in the communication is created by the 3D application. On a StrongARM processor clocked at 200 MHz, the allocation time for one timeslot takes about 1000 cycles per hop in the connection. We show that central time-slot allocation algorithms are practical for small-scale MP-SoC systems. Indeed, our algorithm can compute the allocation of 40 connections for a complex scene of the 3D pipeline in 450 to 900 µs, depending on the slot table size.

1

Introduction

In order to meet the ever-increasing design complexity, future sub-100nm platforms [1, 2] will consist of a mixture of heterogeneous computing, memory and I/O resources. These platforms, called Multi-Processor Systems on Chip (MP-SoC) are expected to use flexible and scalable switched communication architectures such as a Network-on-Chip (NoC) [2, 5, 8]. We are targeting the mapping of multimedia applications such as video encoding/decoding and 3D to these advanced compute architectures. Multimedia applications typically have quite stringent real-time requirements in terms of computation and communication from the platform architecture. In the case of a multi-processor system, the communication architecture is a critical part of the system and providing guarantees in terms of bandwidth and latency is an everincreasing need to ensure predictable and reliable operation. We believe that NoCs are the most cost efficient underlying communication architecture of future multi-processor embedded platforms and can ease embedded application design. NoCs can provide different classes of communication

0-7803-9347-3/05/$20.00 ©2005 IEEE.

47

-more generally called Quality of Service (QoS)- with realtime guarantees application designers can rely on. The class of NoCs we are considering offers hard-guarantees in terms of bandwidth and/or latency coupled to best effort traffic. The process of allocating guaranteed communication resources requires finding an optimal or nearly-optimal route through the network from source to destination while ensuring a contention-free time-slot scheduling. It is a complex allocation problem of spatial and temporal resources that needs to be performed at run-time to establish new guaranteed connections. We have studied several allocation algorithms, running on a central StrongARM processor clocked at 200 MHz, and found that our extended version of the IDA∗ algorithm -well known in the artificial intelligence domain- provides the best results. To establish the 40 guaranteed connections of a 3D pipeline to render a complex scene, our algorithm requires between 450 to 900 µs, depending on the slot table size. The memory foot-print of our algorithm is under 2.5 KB, which makes it practical to implement on an embedded processor such as the StrongARM, that have cache sizes of about 16 KB. The rest of the paper is organized as follows: Section 2 discusses how TDMA hard-guarantees can be provided on top of packet-switched NoCs. Section 3 introduces augmented reality, our driver application, and its mapping to a 4x4 MP-SoC. Section 4 details our run-time time-slot allocation algorithm and performance and memory footprint results are presented in Section 5. Finally, Section 6 concludes. Related work [2, 6, 7, 8] is discussed throughout the paper.

2

TDMA on Packet-Switched NoCs

This section explains how TDMA channels can be implemented on top of packet-switched NoCs to provide flexible, yet guaranteed communication channels.

2.1

Circuit and Packet Switching Techniques

There are two main classes of switching techniques for NoC: circuit and packet switching. The word switching refers to

the fact that the routers composing a NoC contain a switch that creates input → output connections to transmit data. The difference between circuit and packet switching comes from the manner the switch inside the router is controlled. When the information to control the switch is embedded with a (relatively short) burst of data, we are speaking of packet switching. Circuit switching occurs when the information to control the switch is sent on beforehand and the connection is maintained so that any consecutive bursts of data follow the same circuit. In the case of circuit switching, routers are often no more than simple switches and most of the control and protocols to establish and close circuits happens outside of the router. On the contrary, in the case of packet switching, routers understand simple protocols that allow them to determine the switch configuration based on the routing information embedded in the packet. In circuit switching a circuit, composed of all input → output configuration of switches along the path, is effectively established for the time of the communication. The advantage of this approach is that there is no protocol overhead once all the switches on the path have been configured, but if no data is sent during a period of time the unused bandwidth is wasted. In packet-switching the configuration of the switches has to be transmitted with every packet thus generating overhead, but data is only sent when needed thus avoiding waste of bandwidth. Typically circuit switching is used for a limited number of connections that have a long life-time, whereas packet-switching is more adapted to more flexible communications and only uses bandwidth when there is data to transmit.

2.2

Reservations in Space or in Time

Networks on chip are shared communication resources. Providing hard-guaranteed QoS requires exclusive reservations of resources on the communication channels. To provide QoS, reservations happen either in space -typically buffer space is reserved in the routers for a particular guaranteed communication- or in time -communication channels are multiplexed in time and time-slices are allocated to a particular guaranteed communication. The latter technique is called Time Division Multiple Access (TDMA). It is of course possible to mix these techniques to provide differentiated levels of QoS. Circuit switched NoCs typically use TDMA to share the bandwidth of a physical link between several circuits. In this case the circuits are only physically connected during a timeslice (also called time slot), so that routers contain time slot tables that specify the input → output configuration of the switch at every time slot Ti , with i ∈ {0, . . . n − 1} on a system with a wrap-around time of n time-slots. Such a timesliced circuit is called a virtual circuit. On circuit-switched NoCs the hard guarantees on the communication are inherent to the circuit switched property that effectively reserves communication resources along the path. The TDMA ap-

48

proach is used to allow sharing of physical links between several virtual circuits. On the contrary, packet-switching NoCs don’t need TDMA to allow the sharing of physical links as packets from several communications can be interleaved. However, as reservations are not required no hard guarantees are provided by default. QoS can nevertheless be offered provided reservations can be made in space by reserving buffer space on the routers along the path or in time by using TDMA and ensuring that packets belonging to a certain communication are only injected in the network at their reserved time. For packet-switched networks, there are no time-slot tables required in the routers but only at the network-interfaces that control the injection of the packets [3, 4, 7]. The NoCs we are considering in this paper are packetswitched. Hard guarantees are provided the TDMA way [6, 7], because it is more area efficient and allows a better control of the granularity of the reservations. The justification is out of the scope of this paper, but one can intuitively understand that to provide n guaranteed channels over a physical link between two routers it would require reserving n buffers at every hop, whereas by multiplexing this channels in time the buffers can be reused over time.

3

Augmented Reality on an MP-SoC

Augmented Reality (AR) is a technique where computergenerated 3D images are mixed with real-life video images and interact with the real scene as if they were part of it. Augmented reality finds applications in avionics, medicine and of course in games such as ARQuake [10] (Figure 1). These applications are typically very compute-intensive and require heterogeneous compute resources for video and 3D processing, for which MP-SoCs are suitable architectures.

Figure 1: Screenshot of Augmented Reality Quake, a 3Dapplication on top of live video. [10] The application we consider in this paper, conceptually similar to AR-Quake, is composed of two main parts: a 3D pipeline and a video decoding/encoding chain composed of a camera (CAM), MPEG-2 encoder (ENC) and decoder

(DEC) blocks, a display (FB) and of memory blocks (Figure 2). The 3D image from 3D Out is multiplexed with the video frames from CAM before going to the frame buffer (FB) for display and to the encoding block to be sent over a wireless link to other players. The decoding part of the video pipeline is used to display picture in picture of the encoded streams of other players. The processing nodes in the 3D pipeline are 2 Vertex Editor (VE) nodes, that contain 2 vertex processors each and 4 Pixel Editor (PE) nodes, that each contain 4 pixel processors (Figure 2). The constraints for the 3D part of our application are a resolution of 640 by 480 pixels at 25 frames per second with a texture size of 128 by 64 pixels. For this example we consider that the 3D scene covers 30% of the total image size, so that only 30% of the image has to be rendered by the 3D pipeline, the rest being live video. The mapping of the various processors and memory modules onto a 4 × 4 mesh NoC architecture is shown in Figure 3.   









 







  



PE

Video

PE

Raster

PE

Decoder

VE

VE

3D App

Z-Buf

3D Pipeline

MPEG2 Decoder

Encoder

MPEG2 Encoder Camera

Frame Buffer

Video Chain

Figure 3: Floorplanning on a 4 × 4 mesh NoC. coverage, fixed to 30 % in our example. Switching from one scenario to another requires adaptation of the bandwidth reservation by performing time-slot allocation in a time negligible with respect to (non-predictable) scene complexity changes. We expect scenario changes to occur in the worst case every 100ms, which is compatible with scene complexity changes that occur, for instance, when a player enters a new room with much more objects to render.

  



Texture

PE







 

   

 





)%' 

! 

 

#$%$&%' 

!

#$%$&%' 

"!

#(' 

  

#(' 

Figure 2: Mapping of 3D pipeline and video chain. We consider that the communication of the video chain does not vary much over time so that the time-slot allocation can be computed off-line for this part of the AR application. The required bandwidth used to reserve communication channels for the video chain is indicated in Figure 2. The dynamic part of the communication comes from the 3D part of the application, where we use 3 scenarios to describe the complexity of the scene to render (Table 1). These scenarios are labeled simple, average and complex and differ by the number of triangles, vertices and objects in the scene. These scenarios influence the amount of bandwidth to reserve between the various blocks (Table 1). This required bandwidth has been estimated by analyzing the behavior of the 3D-pipeline under the conditions generated by our 3 scenarios. The bandwidth of input and output of PEs is invariant over the 3 scenarios because it only depends on the scene

49

Simple Average Complex #Triangles 500 15000 50000 #Vertices 250 7500 25000 #Objects 10 300 1000 VE in 1.53 45.78 152.59 Raster vtx 1.1 32.9 109.67 Raster z rd 35.16 35.16 35.16 Raster z wr 10.55 10.55 10.55 Raster tri 1.14 34.33 114.44 PE in 333.98 333.98 333.98 Text ram rd 7.5 220 750 Text ram wr 0.003 0.082 0.275 3D Out 35.16 35.16 35.16 Table 1: Number of elements of the 3D scene in the 3 scenarios and required bandwidth (Mbits/s).

4

Time-Slot Allocation Algorithm

This section explains how topology and time-slots of TDMA NoCs can be modeled with a unified time-space graph. Performing routing and time-slot allocation amounts in finding a path from source to destination on such a graph. We discuss the efficiency of algorithms to perform this graph traversal and propose our extended version of IDA∗ as a good trade-off between speed and optimality.

4.1

Graph Representation of TDMA NoC

It is possible to represent the network under the form of a graph, where nodes represent routers and edges represent the physical links between the routers (Figure 4(a)), which corresponds to the spatial (eg. topology) description of the network. In order to also represent time information, such as time-slots, the graph can to be extended so that a router R is represented by an ensemble of nodes R|S| = {Ri , i ∈ S}, where S represents the ensemble of time-slots and |S| the number of time-slots in one time-slot rotation. Edges represent transitions in space -going from one router to anotherand in time -traversing one hop takes one time-slot. For instance, Figure 4(b) represents the 2x2 mesh topology of Figure 4(a) with |S| = 3 time-slots. The edges have been chosen so that only pipelined schedulings of time-slots are authorized [7]. We call such a graph a time-space graph of the network. The scheduling is said pipelined if for any hop along the reserved path, time slot Tn (n ∈ S) is allocated implies that time-slot T(n+1)mod|S| is allocated for the following hop.

A

B

C

D

A0

B1

D2

C1

A2

B0

D1

C0

A1

B2

D0

C2

a.

b.

Figure 4: Graph representation of a 2x2 mesh network (a). Graph representation of the (pipelined) virtual connections over the same 2x2 mesh network with 3 time-slots (|S| = 3). The representation of the time-space graph of the network is maintained centrally on the processor performing timeslot allocation. Central configuration of the network is feasible for the relatively small-scale networks we are considering (a few tens of nodes). Time-slot allocation and path finding between any two nodes is an optimal-path problem over the time-space graph of the network.

4.2

Hill-Climbing and IDA∗

We have tested three different graph traversal algorithms to perform run-time time-slot allocation: Hill-Climbing (HC) (Listing 1, Figure 5), Maximum-Misroute (MM) and Iterative Deepening A∗ (IDA∗ ) [9]. Listing 1: Hill-Climbing LIST ← SRC WHILE ( LIST n o t empty AND DST n o t r e a c h e d ) {

50

− remove f i r s t p a t h o f t h e LIST − c r e a t e new p a t h s by e x p a n d i n g t h e removed p a t h t o t h e c h i l d r e n n o d e s − remove p a t h s w i t h l o o p s − s o r t new p a t h s by m i n i m i z i n g c o s t ( m i s r o u t e s and o c c u p a n c y ) and i n s e r t i n LIST } p a t h f o u n d ← DST reached

Figure 5: Illustration of Hill Climbing on a 3x3 mesh network with SRC = S and DST = G HC is a depth-first (DF) traversal algorithm and has thus as drawback not to always find the shortest path, in one iteration, when there are a number of misroutes required due to unavailable time-slot resources along the shortest path (Figure 6(a)). A solution to this issue would be to use a breadth-first (BF) traversal algorithm in place, but this approach has the major drawback to drastically increase the memory usage. IDA∗ combines the advantages of both approaches (HC and BF). HC is used to find a path because of its loop-detection and low memory footprint properties with respect to BF. To also guarantee that HC yields the shortest path, we initially allow 0 misroutes and incrementally allow more misroutes if no paths are found. This principle is called Iterative Deepening and is illustrated in Figure 6(b), where the search space is limited to 0 misroutes. S

1

2

S 3

4

1

2 3 4

5

G

G

a.

b.

Figure 6: Path found with HC (a) and with IDA∗ (b). Having an iterative search over the graph may at first sight seem inefficient, however this is not the case. Assume the target is reached after the whole graph has been searched in m misroutes. Let b be the average degree of a node in a graph and d be the Manhattan distance between source and destination nodes. Our heuristic assumes x-y routing on mesh networks, therefore every misroute increases the path length by 2 hops. In iterative deepening the computation done for the iterations (0 . . . m − 1) can be written as:

5 Texec =

m−1 

bd+2i

b2m − 1 ) b2 − 1

=

bd · (

=

O(bd+2m−2 )

i=0

The worst case computing time of IDA∗ over iterations (0 . . . m − 1) is thus a factor b2 lower than the m misroutes traversal of the whole graph, which is executed in O(bd+2m ).

4.3

IDA∗ Extensions for Greedy Allocation

The amount of bandwidth allocated is proportional to the number of time-slots allocated to a particular reserved channel. If all allocated time-slots are consecutive, then the allocation is said to be greedy. To perform time-slot allocations of channels that require more bandwidth than available in a single time-slot, it is possible to run several iterations of IDA∗ . However for a greedy allocation this is inefficient. We have extended the IDA∗ algorithm to open paths composed of multiple time-slots using a greedy heuristic: Listing 2: IDA∗ Extended allowed misroutes ← 0 UNTIL ( c o n n e c t i o n e s t a b l i s h e d OR allowed misroutes ≤ max misroutes ) DO { connection established ← tryMisroutes () a l l o w e d m i s r o u t e s ++ } path found ← connection established

Results and Discussion

The extended IDA∗ algorithm (Section 4.2) has been implemented in C and optimized to run on a Strong-ARM SA-110 clocked at 200 MHz. We have measured the time required to create a new connection and to modify an existing connection for each of the three scenarios. The central time-slot table is initialized with the slots reserved for the video chain. The time measurements of the modification of existing connections has been tested for both increase of the allocated bandwidth from the “simple” to the “complex” scenario, and for reduction of bandwidth from “complex” to “simple”. Table 2 shows the amount of time-slots that must be reserved for the various tests. The measurements have been performed on a SA-110 ISS and on a Compaq iPAQ H3600, with a SA-1110 processor (similar to the older SA-110). Using an ISS besides the real architecture not only allowed us to simulate the cache hit ratio and perform code optimizations to maximize it, but it also allowed to remove the operating-system overhead from the measurements on the real platform. #time-slots in |S| 5 8 16 32 create simple 35 40 49 72 create average 35 40 53 84 create complex 39 48 71 123 simple2complex 4 8 22 51 complex2simple -4 -8 -22 -51 Table 2: Required time-slots for the creation of the 3 scenarios and for the switching from “complex” to “simple”.

5.1

Listing 3: tryMisroutes timeslots over ← timeslots required UNTIL ( t i m e s l o t s o v e r ≤ 0 OR path failed ) DO { p a t h f a i l e d ← performHillClimbing () IF ( NOT p a t h f a i l e d ) path failed ← greedyAllocateSlots () } IF ( p a t h f a i l e d ) { removePathsFromConnection ( ) connection established ← false } ELSE c o n n e c t i o n e s t a b l i s h e d ← true

Memory Footprint

The memory footprint of the time-slot allocation application is dominated by the time-slot tables. The central time-slot tables indicate which router output port Ro is reserved when a flit comes from the router input port Ri in time-slot Tn . Table 3 shows the memory usage for |S| = {5, 8, 16, 32} time-slots. The “timeslottable” variable contains a central copy of the time-slot tables. The “connInfo” variable contains information over all connections established and is required to modify or delete an existing connection. It results from this table that the algorithm is sufficiently low on memory requirement to run on an embedded processor such as the Strong-ARM SA-110 (16 KB of cache).

Whenever HC has found a path we attempt to reserve the consecutive time-slots until all required bandwidth is provided. If HC has not found a feasible schedule or if not enough time-slots can be allocated consecutively to those found by HC, then the path is discarded and another iteration is attempted with one extra allowed misroute.

51

#time-slots in |S| 5 8 16 32 timeslottable 320 512 1024 2048 connInfo 256 256 256 256 other 228 228 228 228 total 804 996 1508 2532 Table 3: Memory footprint for different |S| in bytes.

5.2

Path Reservation Performance

Figure 7 shows the timing measured on a StrongARM SA110 ISS for different total amounts of time-slots (|S| = {5, 8, 16, 32}). It shows the creation of the simple, average and complex scenarios as well as the expansion from simple to complex and the decrease from complex to simple. Strong-Arm SA-110: Slot Scheduling Time 1000

Slot Allocation Time (µs)

References

create simple (40 conns) 919.6

create average

900

[1] H. De Man: On Nanoscale Integration and Gigascale Complexity in the Post .Com World. Proc. DATE 2002, Paris, France, March 2002.

create complex

800

simple2complex

700

complex2simple 636.2

600

571.7 487.7 507.2 473.7 444

500

[2] S. Kumar, A. Jantsch, M. Millberg, J. berg, J. Soininen, M. Forsell, K. Tiensyrj, and A. Hemani, ”A network on chip architecture and design methodology,” in Proceedings, IEEE Computer Society Annual Symposium on VLSI, Apr. 2002.

400 303

300 200 130.7

100

57.2 15

0 0

5

81 22.2

10

107.8

49.1

15

20

25

30

35

Time-Slots

Figure 7: Timing in µs for the various creation or modification scenarios for |S| = {5, 8, 16, 32} time-slots. From Table 2 and Figure 7 we can derive typical computetimes required for the allocation of one time-slot under different conditions (Table 4). The most expensive in computetime is naturally the opening of new connections which is about 20% more than than simply adding a time-slot to an existing connection and about 70% more than removing one slot. The computing time per slot per hop is about 5 µs (1000 processor cycles), which is very reasonable compared to the expected 10 to 100 ms time to switch scenario. It grows linearily with the total number of time-slots |S|.

[3] T. Marescaux, J-Y. Mignolet, A. Bartic, W. Moffat, D. Verkest, S. Vernalde, R. Lauwereins, ”Networks on Chip as Hardware Components of an OS for Reconfigurable Systems”, Proc. Field Programmable Logic and Applications, Lisbon, 2003. [4] V. Nollet, T. Marescaux, D. Verkest, J-Y. Mignolet, S. Vernalde, ”Operating System controlled Network-onChip”, Proc. of the Design Automation Conference, San Diego, June 2004, pp. 256-259. [5] T.A. Bartic, J-Y. Mignolet, V. Nollet, T. Marescaux, D. Verkest, S. Vernalde, R. Lauwereins, ”Highly Scalable Network on Chip for Reconfigurable Systems”, Systems on Chip Conference, Tampere, 2003. [6] A. Radulescu, et al., ”An Efficient On-Chip Network Interface Offering Guaranteed Services, SharedMemory Abstraction, and Flexible Network Programming”, IEEE Transactions on CAD of Integrated Circuits and Systems, January 2005.

Action Compute-time open new conn. 4.3 to 6.4 µs increase existing conn. 3.2 to 5.4 µs decrease existing conn. 1.1 to 2.1 µs Table 4: Allocation time per time-slot per hop.

6

of IDA∗ and has a memory footprint under 2.5 KB for a slot table size of 32. It is thus well adapted for usage on a embedded processor. Typically, per time-slot, the creation of one hop on the path takes about 1000 processor cycles (5 µs) when opening a new connection and about 850 cycles when adding bandwidth to a readily existing connection. Allocation time grows linearly with the slot table size.

Conclusion

This paper discusses algorithms to perform dynamic timeslot allocation on networks-on-chip that provide hardguaranteed QoS with TDMA techniques. These algorithms are executed on a central Strong-ARM processor clocked at 200 MHz and are usable for embedded multi-processor systems with either circuit-switching and packet-switching NoCs. The efficiency of the algorithms presented is tested on an augmented reality application composed of a 3D pipeline and an MPEG-2 decoder/encoder video chain that is mapped to a 4x4 MP-SoC. The most efficient algorithm to perform both routing and time-slot allocation is an extended version

52

[7] O. P. Gangwal, et al., ”Building Predictable Systems on Chip: An Analysis of Guaranteed Communication in the AEthereal Network on Chip”, Dynamic and Robust Streaming In and Between Connected ConsumerElectronics Devices, Kluwer, 2005. [8] A. Jantsch and H. Tenhunen, ”Will Networks on Chip Close the Productivity Gap”, Networks on Chip, Kluwer, 2003. [9] S. J. Russel and P. Norvig, ”Artificial Intelligence: a modern approach”, Prentice Hall, 1995. [10] W. Piekarski and B. Thomas, ”ARQuake: The Outdoor Augmented Reality Gaming System”, ACM Communications, Vol. 45, No. 1, pp 36-38, Jan 2002.