Fast Evaluation of Protocol Processor Architectures for IPv6

0 downloads 0 Views 509KB Size Report
Dragos Truscan. Seppo Virtanen .... It scans the input ports of the line cards for pending datagrams .... hardware solutions, using Content Addressable Memories.
Fast Evaluation of Protocol Processor Architectures for IPv6 Routing Johan Lilius

Dragos Truscan Seppo Virtanen Turku Centre for Computer Science Embedded Systems Laboratory Lemmink¨aisenkatu 14A, 20520 Turku, Finland

{Johan.Lilius, Dragos.Truscan, Seppo.Virtanen}@abo.fi

Abstract In this paper we present a design case study in configuring our protocol processor architecture to meet the performance requirements of IPv6 routing at gigabit speeds. Our methodology makes it possible to make fast reliable analyses of the problem on a high level and to find its key bottlenecks and design constraints. Based on the analyses we suggest architectural configurations for the target application. The best configurations can then be further analyzed in more detailed system-level simulations and physical estimations.

1. Introduction In recent years addressing the conflicting requirements on network applications has become an important challenge for hardware designers. On the one hand there is a need for faster time-to-market for the product, a goal that has traditionally been achieved by using general purpose processors, and on the other hand one would like to work with ASIC’s to obtain an optimal solution in terms of performance. One proposed solution to this problem has been the adoption of network or protocol processors, special programmable processors that are tailored towards network and protocol processing. Such a processor is an attempt to harness the processing speed of ASIC’s and the programmability of general purpose processors for optimal protocol processing speed. In our protocol processor design framework TACO (Tools for Application-specific hardware/software COdesign) [11] we have developed tools and methods for helping the designer in specifying, simulating, evaluating and synthesizing a certain type of TTA (Transport Triggered Architectures, [2, 9]) based protocol processors, TACO processors (introduced in [13]). A TTA processor is formed of FUs (functional units) that communicate via an intercon-

nection network of data buses, controlled by an interconnection network controller unit. The FUs connect to the buses through modules called sockets. Each functional unit in a TACO processor performs a specific protocol processing task, and each FU has been designed to be able to complete the execution of its function in one clock cycle. TTA’s are in essence one instruction processors, as instructions only specify data moves between functional units. Thus, the instruction word of any TTA processor consists mostly of source and destination addresses. The maximum number of instructions (i.e. data transports) that can be carried out in one clock cycle is equal to the number of data buses in the interconnection network. In this paper we present a case study in which we apply the TACO design methodology [6, 12] to configure TACO protocol processors for efficient IPv6 routing. We have previously described smaller case studies from the system level through logic synthesis in e.g. [6, 13]. Implementing an IPv6 router that uses the Routing Information Protocol (RIPng) is a considerably more complex application of our methodology. The contribution of this paper is to show that within a short time frame we can efficiently and reliably analyze the problem, its key bottlenecks and constraints, and provide solutions to them at abstraction levels on and above the system level. All this is done prior to proceeding to detailed system level simulations and physical estimations. In our methodology, a processor configuration is obtained by deciding what functionality should be implemented in hardware (as FUs) and what in software (as data move operations between the FUs). This process involves identifying the operations to be performed by the router. In order to reach a good balance between the router’s performance and its physical characteristics (like power and area use), we explore different architectural configurations by varying the number of FUs of each required type, and varying the number of buses in the interconnection network. These different configurations are simulated and their physical characteristics are estimated. In the end we select for

synthesis a configuration that is able to perform the target application within given power and area constraints.

processor. The second part evaluates and simulates different design configurations and implements the chosen one in hardware. The UML design-flow of the problem consists of incremental refinement steps of the problem specification until the level of detail allows gathering necessary architectural requirements. We combine the functional specification with domain-based knowledge in order to provide component reuse and fast identification of resources. Since we address the TTA architecture, our goal is to find the basic operations the processor should be able to perform, and to implement them using dedicated FUs of TACO. More information can be found in [6]. System-Level design flow. When the recommended hardware module types have been determined, we start evaluating different hardware configurations, i.e. architecture instances. Architecture instances are constructed by varying the number of modules of the same type in the processor as well as varying the internal data transport capacity of the instances. The evaluation of the architecture instances consists of two parts: system level simulations [11, 15] in SystemC and system level physical characteristics estimation [8] in Matlab. The application code needs to be tuned for each instance separately. The simulations yield functional correctness information as well as the total cycle count of the application running on the particular architecture instance. The physical estimation yields estimates of required area and power as well as of the clock frequency of the final product. By co-analyzing the results from SystemC and Matlab the designer is able to determine at the system level whether the architecture instance is suitable for the target protocol processing application. After an architecture instance that fulfills the design constraints has been found, it can be synthesized and its characteristics can be verified using the TACO VHDL synthesis model.

1.1. Related work With a trend towards increasing system complexities, abstract specification based modelling flows are constantly gaining in popularity. The idea is to start with a very abstract system description and then to incrementally refine this specification to contain more and more architectural details. In [1] an OCAPI [10] description of system behavior is used as a starting point in a design methodology for an ethernet packet decoder. In [5] a Y-chart based design methodology and its use for design space exploration is presented. In the Y-chart methodology the performance of a selected architecture is analyzed for a given set of applications. As a result of this analysis the designer receives performance data, based on which decisions and design choices can be made to the architecture. The process is repeated iteratively until a satisfactory architecture for the target application is found. There are similarities between the flows outlined above and the TACO flow, as one might expect: in TACO we work in a specific problem domain, it being protocol processing. In the TACO flow the hardware architecture is also represented as a specification and simulation model written in a high level language, but only prior to synthesis. And, lastly, from the high level simulations we obtain performance data such as clock cycle requirements and module utilization (in addition to the verification of correct processor functionality). However, our approach is very much library-based and allows extensive component re-use for both simulation and synthesis. Since we do not develop all modules from scratch every time the actual hardware design times in our flow can be quite short. As a major difference to the mentioned design flows, we develop TACO processor models in three different development environments at the same time: we have a model for system-level simulations written in SystemC, a model for estimating physical parameters (e.g. processor area and power consumption) at the system level written in Matlab, and a model for synthesizing architectures written in VHDL. The models are highly parameterizable, and hence top-level description files for a given architecture can be automatically generated for all three models using a single hardware design tool [14].

3. The IPv6 Router Architecture An IPv6 router should be able to receive IPv6 datagrams from the connected networks, to check their validity for the right addressing and fields, to interrogate the routing table for the interface(s) they should be forwarded on, and to send the datagrams on the appropriate interface. Additionally a router should build and maintain a routing table that contains information about network topology. The router builds up the Routing Table by listening for specific datagrams broadcasted by the adjacent routers, in order to find out information about the topology of the network. At regular intervals, the routing table information is broadcasted to the adjacent routers to inform them about changes in topology. The IPv6 protocol is described in e.g. [3] and [7]. Routers have to handle two types of Internet traffic, one that updates the routing tables and one that has to be for-

2. Design Flow and Methodology There are two distinct parts in designing applications for TACO processor. The first part takes care of specifying and analyzing the problem requirements using UML, in order to decide the functionality that has to be implemented by the 2

the entire datagram in the main memory because in IPv6 the IP header can be accompanied by a variable number of extension headers that also have to be taken into consideration. Once saved in the memory, the datagrams are processed one at a time, the header is updated and then the entire datagram is saved in the output buffer of the corresponding line card. One important design feature of a TTA processor is the modularity of its architecture. Each FU computes independently of the interconnection network and other units. So the performance of the processor is reflected by the number of transports on the buses and implicitly by the time in which each operand becomes available in the output registers of the functional units. In order to decrease the waiting time, FUs have to complete their computation in as few cycles as possible. In this sense a balance should exist between the amount of complexity and the response time of functional units. The Preprocessing Unit (iPPU) scans the input buffers for new datagrams. If a datagram is pending it is stored in the main memory. A pointer to the memory address where the datagram was stored is saved in a queue, along with the interface identifier of the input buffer. The iPPU is connected to the buses in the processor’s interconnection network through one trigger and two result registers. It also provides a 1-bit signal connected to the Interconnection Network Controller (see figure 2) to notify it of new en-

TACO processor

Switching fabric

Ethernet Line Card #1

Ethernet Line Card #2

Ethernet Line Card #3

Ethernet Line Card #4

Figure 1. Generic router.

warded on adjacent networks. The forwarding process has to search the routing table for a specific network prefix with the longest prefix length possible. Since a routing table can consist of thousands of entries, finding the matching prefix can require long computational time. The current bandwidth demands of internet networks put a high pressure on the routing table look-up speed. To meet these demands, the router implementations need to use fast searching algorithms and dedicated hardware in order to improve the forwarding throughput. Our router is composed of a TACO processor and a number of Ethernet line cards corresponding to each connected network interface of the router. We are only interested in the design and performance of the TACO processor for implementing routing and forwarding tasks. The line cards can be chosen from the the available products on the market (Intel IFX18103, Cisco GigE 12000, etc.). The interface between the cards and processor is dependent on the products used. Each network card contains a set of independent input and output registers that can be read and written by the processor. The line cards deal with implementing the Ethernet protocol and its specific tasks, provide fully assembled decapsulated IPv6 datagrams to the processor, take care of Ethernet fragmentation and encapsulation of outgoing datagrams, and also resolve ARP/RARP requests. The TACO processor is in charge of deciding how the forwarded datagrams are to be routed between the line cards and takes care of building and maintaining its routing table. It scans the input ports of the line cards for pending datagrams, which are transferred into the main memory of the processor. Usually, the existing router implementations split the Internet datagrams into header and payload, and only the header is stored in the main memory for further processing. The payload is only analyzed for datagrams addressed to the router. In our design, we choose to transfer

Program Memory

Network Controller

MATCHER

Data Memory

COMPARATOR

Memory Management Unit

COUNTER

CHECKSUM

SHIFTER

MASKER

Routing Table Unit

Local Info Unit

iPPU

oPPU

Registers

Figure 2. TACO architecture. 3

a = ( b * 2 + c )/4

R1 = b R2 = 2 R3 = c R4 =4

Mov (b, R1) Mov (2, R2) Mov (c, R3) Mov (4, R4)

R5 = R1 * R2 R6 = R5 + R3 R7 = R6 / R4

Mul2 (R1, R2, R5) Add (R5, R3, R6) Div2 (R6, R4, R7)

A = R7

Mov (R7, a)

Non -optimized TACO

TTA -optimized code

Domain Operation List

Figure 3. TACO Code Optimization Process Some general compiler optimizations can also be performed on TACO assembly code like sinking, loop unrolling, etc. Code optimization for TACO processors reduces in fact to well-known bus scheduling and registry allocation problems. We have to schedule move instructions on the buses and to allocate registers to the operands of the instructions. The scheduling and allocation policies have been widely discussed in the literature and we are not suggesting any new methods. A compiler can do necessary allocation and scheduling, along with some final optimizations.

tries pending in the queue. The PostProcessing Unit (oPPU) manages the output traffic of the router. The unit contains an internal queue in which pointers to memory addresses of the datagrams to be sent are stored along with the output interface identifer. The oPPU interrogates its internal queue and for each entry it moves the corresponding datagram from the data memory to the specified output buffer. The Counter Unit performs arithmetical operations (increment, decrement, addition, substraction) and counting (upwards or downwards from a start value to a stop value). When the stop value has been reached a result signal directly connected to the Network Controller is enabled. For comparing operands with a given value a Comparer Unit has been designed. The result of a comparison or unit is signaled to the Network Controller via a result signal. The Matcher and the Masker are bitstring manipulation FUs that process only parts of their input operands according to a given mask. The Matcher reports its result to the Interconnection Network Controller by means of a result bit signal directly connected between them. The Masker sets the bits of a register according to a given mask and a given value. In addition to logical shifting, a Shifter can also be used for arithmetical multiplication by 2.

4. Results Previously we specified and implemented a set of resources (FUs and buses) TACO processors should use for processing Internet datagrams. These resources can be used as a test bench for specifying the final configuration of the TACO processor used in the router. To be able to do this, we still have to decide on several implementation options of the routing table. The Routing Table implementation is the most important aspect of a router’s performance, so we decided to create a dedicated functional unit for it. Different routing table implementations have been suggested in literature [4] depending on the target application. For instance, in routers that have a relatively small number of entries in the routing table, a fast memory can be used. For larger routing tables the cost of a fast memory chip would be too high, so software-based algorithms are needed. Among proposed hardware solutions, using Content Addressable Memories (CAM) seem to be very tempting because of the fast match time (tens of nanoseconds) they provide. On the other hand the price of this type of memory is very high. Also, most of

From the programmer’s point of view, programming TACO processors is a matter of moving data from output to input registers. Using registers for FUs allows using optimization techniques like moving operands from an output register to an input register without additional temporary storage (bypassing), using the same output register or general purpose register for multiple data transports (operand sharing), easy removing of registers that are no longer in use, etc. All these techniques help in reducing code size by reducing the number of transports on buses. 4

Routing Table Implementation Sequential access Balanced tree Content Addressable Memory

Architecture configuration 1BUS/1FU 3BUS/1FU 3bus/3CNT, 3CMP, 3M 1BUS/1FU 3BUS/1FU 3bus/3CNT, 3CMP, 3M 1BUS/1FU 3BUS/1FU 3bus/3CNT, 3CMP, 3M

Required speed 6 GHz 2 GHz 1GHz 1,2 GHz 600 MHz 250 MHz 118 MHz 40 MHz 35 MHz

Bus util. [%] 100 100 97 100 100 98 100 100 99

Area [mm2 ] NA NA 5.66 NA 1.35 2.08 1.20 1.35 2.08

Avg. Power [W] NA NA 28.1 NA 2.33 2.80 0.58 0.19 0.39

Table 1. Estimated minimum clock frequencies, processor areas and average power consumption for different processor architectures. NA (not available) indicates an architecture that was not estimated due to its high clock frequency requirement. The CAM estimates do not include the area and power used by the CAM chip.

frequency of 2 GHz, which also is beyond the capabilities of our current implementation technology. By performing data and control analysis we tried to optimize the program code. Based on this analysis we configured the router to use 3 matchers, 3 counters and 3 comparers. The increase in performance brought the clock frequency into the vicinity of 1 GHz, meaning that this configuration could be possible to implement in the 0.18 µm technology. However, as seen in table 1, at this clock speed the average power consumed by the architecture is not acceptable. The high power consumption follows from the fact that larger gate sizes had to be used in order to reach the 1 GHz clock speed. This naturally also had an effect on the total processor area estimation. In sequential organization of the routing table entries we have linear complexity of searching time. In order to get a faster search time we implemented a balanced tree structure, that offers logarithmic complexity of searching time. However, the insertion and deletion operations become much more complex. Still, this does not influence the throughput of the router significantly. Statistics show that when the topology of the network stabilizes, the routing table updates appear once in 2 minutes, which does not require much computational effort. By performing simulations and estimations, we can see that the gain in performance is evident. The faster two architectures qualify as possible solutions as seen in table 1; The slowest one would require even more power and area than the last case in sequential routing, and was thus not estimated. Finally we evaluated a hardware-based solution for the routing table. We used a 136-bit wide content addressable memory (CAM) and a commercially available SRAM chip. By combining these two circuits we calculated that the routing table searching time would be 40 ns. This is a major

the existing solutions provide support for IPv4 protocol and very few for IPv6. The difference between the two is the size of the address to be matched in the routing tables (128 bits for IPv6 and only 32 bits for IPv4). We simulate and estimate different TACO processor configurations to evaluate their performance with respect to IPv6 routing throughput, silicon area and power consumption. Our goal is to choose the configuration (and implicitly the routing table implementation) that is the best fit for the performance requirements and constraints of the router. Our system-level estimation model has been verified in previous work [12] to give quite precise results when compared to post-synthesis results. As the first case we implemented the routing table using a cache memory in which the entries are organized sequentially. As the second case we simulate a balanced tree structure. For both cases we tested different TACO architecture configurations that we obtained by varying the number of buses and the number of functional units used. Each of these configurations has to be able to achieve the 10 Gbps ethernet throughput with a maximum size of 100 entries in the routing table. Based on these constraints we calculated the minimum clock frequencies for the TACO processor configurations. This is done by taking into account the number of clock cycles the datagram forwarding process takes to complete in each case. In the case of sequential routing table organization we calculated that with one bus and one functional unit of each type the clock speed of the processor should be 6 GHz (See results in table 1). This exceeds the capabilities of the 0.18 µm standard cell library that we currently use. We estimate that the upper limit for TACO clock frequencies using this technology is near 1 GHz. By configuring the TACO processor to use three buses we observed a required clock 5

References

boost in router performance in detriment of high implementation cost. By using the mentioned industrial IP blocks, we transformed the TACO processor into a system-on-chip design solution. As we can see in table 1, the speed requirements for the TACO processors drop dramatically; especially when using 3 buses and one functional unit of each needed type. Multiplying the number of functional units does not anymore seem to offer considerable increase in routing table access performance, instead it actually causes the power and area requirements to increase. It is also important to realize that the power and area required by the Content Addressable Memory (CAM) chip are not included in the estimates in table 1. As an example, the Micron Harmony 1Mb CAM consumes the average power of 1.5 to 2 Watts when operated at 133 MHz. Therefore, the total power consumed when using a CAM processor to handle routing table searches is approximately the same as when using only a TACO processor for it. On the other hand, in the CAM case the total footprint area required by the two circuits is of course larger than the area required by just a TACO processor.

[1] M. Attia and I. Verbauwhede. Programmable Gigabit Ethernet packet processor design methodology. In Proceedings of the European Conference on Circuit Theory and Design (ECCTD’01), pages III:177–180, Espoo, Finland, August 2001. [2] H. Corporaal. Microprocessor Architectures - from VLIW to TTA. John Wiley and Sons Ltd., Chichester, West Sussex, England, 1998. [3] S. Deering and R. Hinden. Internet protocol, version 6 (IPv6) specification. RFC 2460, 1998. [4] S. Kechav and R. Sharma. Issues and trends in router design. IEEE Communication Magazine, pages 144–151, May 1988. [5] B. Kienhuis, E. F. Deprettere, P. van der Wolf, and K. Vissers. A Methodology to Design Programmable Embedded Systems, volume 2268 of LNCS, pages 18–37. SpringerVerlag, 2001. [6] J. Lilius and D. Truscan. UML-driven TTA-based protocol processor design. In Proceedings of Forum for Design Languages ’02 (FDL’02), Marseille, France, 2002. [7] M. A. Miller. Implementing IPv6. MeT Books, 2000. [8] T. Nurmi, S. Virtanen, J. Isoaho, and H. Tenhunen. Physical modeling and system level performance characterization of a protocol processor architecture. In Proceedings of the 18th IEEE NORCHIP Conference, pages 294–301, Turku, Finland, November 2000. [9] D. Tabak and G. J. Lipovski. MOVE architecture in digital controllers. IEEE Transactions on Computers, 29(2):180– 190, February 1980. [10] S. Vernalde, P. Schaumont, and I. Bolsens. An Object Oriented Programming Approach for Hardware Design. In IEEE Computer Society Workshop on VLSI’99, Orlando, USA, 1999. [11] S. Virtanen and J. Lilius. The TACO protocol processor simulation environment. In Proceedings of the 9th International Symposium on Hardware/Software Codesign, 2001. [12] S. Virtanen, J. Lilius, T. Nurmi, and T. Westerlund. TACO: Rapid design space exploration for protocol processors. In the Ninth IEEE/DATC Electronic Design Processes Workshop Notes, Monterey, CA, USA, April 2002. [13] S. Virtanen, J. Lilius, and T. Westerlund. A processor architecture for the TACO protocol processor development framework. In Proceedings of the 18th IEEE NORCHIP Conference, pages 204–211, Turku, Finland, November 2000. [14] S. Virtanen, T. Lundstr¨om, and J. Lilius. A processor design tool for the TACO framework. In Proceedings of 2002 IEEE Norchip Conference, November 2002. [15] S. Virtanen, D. Truscan, and J. Lilius. SystemC based object oriented system design. In Proceedings of the Fourth International Forum on Design Languages (FDL’01), 2001.

5. Conclusions The increased complexity of network applications produces constantly more requirements for device design efforts in the areas of optimization, testing and validation. By dealing with the requirements and constraints at the system level allows us to address more complex designs efficiently. Simulation and estimation in the early phases of the design decreases the development time of the products by allowing early verification and performance evaluation. In addition to shortening the development time, this also significantly reduces the cost of the final product. In this paper we experimented with our design methodology to quickly evaluate different architectural alternatives for an IPv6 routing protocol processor. By simulating and estimating different architectural configurations at the system-level we obtained a fast turn-around time for finding well-suited configurations to match the target application and its constraints. Our future work includes the full implementation and synthesis of an IPv6 router. In the same time we would like to develop a tool that automates the design space exploration phase, which based on some heuristics will suggest good solutions, with respect to performance requirements and physical constraints. Acknowledgements The authors wish to thank M.Sc. Tero Nurmi, University of Turku, for providing physical estimates for the processor architectures discussed in this paper. 6