Tracker-node model for energy consumption in ... - Semantic Scholar

0 downloads 0 Views 638KB Size Report
Tracker-node model for energy consumption in reconfigurable processing systems. Grzegorz Chmaj1, Henry Selvaraj1, Laxmi Gewali1. 1 Howard R. Hughes ...
Author's copy.

Tracker-node model for energy consumption in reconfigurable processing systems Grzegorz Chmaj1, Henry Selvaraj1, Laxmi Gewali1 1

Howard R. Hughes College of Engineering, University of Nevada Las Vegas, USA {Grzegorz.Chmaj, Henry.Selvaraj, Laxmi.Gewali}@unlv.edu

Abstract. In this paper, we present an energy dissipation model for reconfigurable systems in which FPGAs have the property of online reprogramming. The proposed system contains regular nodes and one control node. Each regular node contains both CPU - capable of software processing, and FPGA unit which after being programmed with bitstream serves as the hardware processing parts. Nodes are connected in some structure and the connections form the transport layer. The system is capable of processing tasks in a distributed manner and communication, control and processing parts are taken into consideration in the energy equations. The model has also been used for algorithms that formed the complete system that is used for experimentation. Keywords: reconfigurable processing, FPGA, distributed computing

1 Introduction Reconfigurable Systems (RS) are current trend in distributed processing. They are built of nodes that allow partial hardware reconfiguration. This gives them more flexibility over non-reconfigurable structures in processing needs, such as higher efficiency and lower power consumption. Compared to Application-Specific Integrated Circuits (ASIC), RS offers short reconfiguration time, the ability of multiple reconfiguration and low price of the reconfigurable unit. Nodes in RS are connected with each other using the interconnection structure (IS) of given topology, which also impacts the energy used to compute the given task. Contribution of this paper is to introduce the preliminary modeling of reconfigurable system for trackernode architecture, which is the base for further research. We show the complete model of energy dissipation, two algorithms based on the presented model and experimentation results.

2 Literature overview Distributed processing structures are used to lower the financial costs of process intensive tasks. Industry applications widely use grids, formed mostly by groups of

institutions and later sharing the grid’s resources [3]. Grids however require the financial investments, this led to the foundation of private distributed computation networks, such as SETI [8], where the power of personal computers is used. The processing power currently is a valuable asset, so the investments to the dedicated computational structures are worth considering [10]. The most efficient approach is to use ASICs [9], however they are designed to suit the specific tasks, what makes them hard to use for other types of tasks. FPGAs can be repeatedly reprogrammed, making the system adjustable to changing needs [1], [2]. This makes the reconfigurable systems an interesting topic of research, gaining the attention of many research institutions. They are considered both as on-chip systems, called RSoC [4] and large scale structures [5]. Wide spectrum of reconfigurable systems applications is a subject of research, e.g. image processing [6], unmanned aerial vehicles [7].

3 Tracker-node structure We propose the structure of reconfigurable system using tracker-node approach for its operation. Tracker is a special node that handles the control over the nodes. The general system scheme is shown in Fig. 1:

FPGA

CPU

Control

FPGA

CPU

FPGA

CPU

Memory Control Control bitstream_request Memory

Memory

block_request result ready block reject block offer

CPU

FPGA

CPU

FPGA Control

IS

Control

kmv, ev, fv

Memory

Memory

sv, hv, gv, pv, uv, dv Communication Active set av

Control/Memory

tracker m Assignments qbv

Task T (1,2,…,T)

Activity register Av

… …

Fig. 1. General system diagram

blocks (1,2,…,B) results (1,2,…,B)

Nodes and tracker are connected using IS, which can be defined to reflect any topology. 3.1 System general V nodes are present in the system (including tracker): v, w = 1, 2, …, V .

(1)

Processing (input) task is divided into B blocks of the same size: b = 1, 2, …, B .

(2)

Processing of block b yields the result r (same id): r = 1, 2, …, B .

(3)

Operating timespan is divided into T undividable slots: t = 1, 2, …, T .

(4)

Block b is computed at the node v at time t (index): xbvt = 1 (0 otherwise) . Tracker is the special node in the system, denoted as m

(5) (6)

Reconfigurable system is used to process the given task, which is split into chunks (blocks) for the purpose of processing (this also means that the task must be divisible) (2). For the sake of simplicity, this paper considers tasks divided into uniform blocks (i.e. having the same size). The division operation occurs at the special node called tracker, which also performs the role of coordinator. Blocks are then sent to nodes for processing. Once the block is processed, the result of computation has the form of result blocks r (3). The system contains V nodes of the same parameters. Each node contains processing unit capable of performing software functions, and the reprogrammable FPGA unit – with the capability of online reconfiguration. The online reconfiguration allows FPGA to be programmed during normal system operation with the received bitstream. The programing part is performed by the control unit, which is a part of the node. The system operates in real time, for better description of the system and algorithms, and for precise properties description, we consider the timespan divided into time slots (4). This brings the ability to fully express the moment of each event. 3.2 Node Energy used on node v for computation by software: sv = const .

(7)

Energy used on node v for computation by hardware: hv = const .

(8)

Node v has the bitstream for hardware computing (index): gv = 1 (0 otherwise) .

(9)

Node v has limited computation capability: pv = const .

(10)

Each node can decide to fetch the bitstream from the tracker node. Once the bitstream is fetched, the node gathers the ability to perform the hardware computation (9). Both hardware and software computations involve the given amount of energy (7), (8). In this paper we assume that hv < sv, thus the benefit of using hardware computing is that the amount of energy emitted is smaller. 3.3 IS Properties Energy emitted by sending the block b from tracker to node v: kmv = const .

(11)

Energy emitted by sending the result r from node v to tracker: kmv = const .

(12)

Block b is sent from node w to node v at the time t (index): ybwvt = 1 (0 otherwise) .

(13)

Bitstream is sent from the tracker to node v at the time t (index): zwvt = 1 (0 otherwise), w=m .

(14)

The cost of sending the bitstream from tracker to node v: ev = const .

(15)

Time required for fetching the bitstream from tracker to node v: fv = const .

(16)

Time required for transfer of block/result between node v and tracker: jv = const .

(17)

The IS operation also involves electrical energy. In this paper, we assume that the energy used for sending the block from tracker to node, equals the energy required for sending the result back from this node to the tracker (11), (12). The moment of transfer of the block or the bitstream is determined by (13) and (14). Similar relations are formulated for bitstreams (14), (15). (16) and (17) express the time, which is

required to transfer blocks, results and bitstreams between the given node and the tracker. 3.4 Constraints Certain block b is processed only at one node: vt xbvt = 1 b = 1, 2, …, B .

(18)

Each node v has limited computational capabilities: bt xbvt  pv v = 1, 2, …, V .

(19)

Each node w has limited upload capabilities: bv ybwvt  uw w = 1, 2, …, V t = 1, 2, …, T .

(20)

Each node v has limited download capabilities: bw ybwvt  dv v = 1, 2, …, V t = 1, 2, …, T .

(21)

All results have to be sent back to the tracker: rv yrwvt  1 v=m .

(22)

Computation of block b at the node v can finish when b is done fetching: t=1..q ybwvt + t=q+1…T xbvt =  w=m, q ≥ jv .

(23)

Constraints define the assumptions for the system. Each node has the limited computational capabilities (19), (10) that determines the amount of data it can process in given timespan. Regarding node’s communications capabilities, upload (20) and download (21) speeds are defined. The processing is considered as finished when all B result blocks are collected at the tracker node (22). For this paper, we assume that each block will be assigned for processing to only one node (18), and that the computation of block b can start when b is fully downloaded – no processing of partially fetched block is allowed (23). The constraint (23) also assures that the block fetched by node v, will be processed by this node. The goal of the system is to process the task, divided into B blocks and collect the results at the tracker node. This will yield the following energy emission components: Fetching bitstreams: vt zmvtgvev .

(24)

bvt xbv ybmvt kmv .

(25)

Fetching blocks:

Performing the computations: bvt (gvhv+(1–gv)sv)xbvt .

(26)

Results return: rvt xrv yrvmt kmv .

(27)

The overall energy consumption will be the sum of the components: E = vt zmvtgvev + bvt xbv ybmvt kmv + rvt xrv yrvmt kmv + bvt (gvhv+(1–gv)sv)xbvt . According to the node algorithm, nodes can decide to fetch the bitstream or perform the computations based on the software. If a node will decide to fetch the bitstream (9), it will be done with the energy ev (15). This energy characterizes the network relation between tracker and the fetching node, and can be different for each node. If a node is fetching the bitstream (14) during time t, we assume that this process ends in time slot t+fv (according to (14) and (16)). To perform the computation, a node fetches the block from the tracker, generating the cost (25). This transfer is indicated in (25) to be started in time slot t, and will end in time slot t+jv (17). To be able to perform the block processing, a node has to finish block fetching – there is no possibility to process partially fetched block (23). (26) describes the energy emitted in during the process of computation: already fetched block b is either processed using software (1–gv)svxbvt or hardware gvhvxbvt. Variable gv assures that either one of these two costs will be produced. After block computation, when the result r is produced, it is sent back to the tracker. 3.4 Communications IS is the vital part of the processing system. We propose the communication layer based on message-exchange protocol. The following messages are being used in our system: bitstream_request, block_request, result_ready, block_reject, block_offer. The IS structure is defined using values of kmv (11), (12) – determining the cost of blocks and results transfer, and bitstream sending ev (15). The energy for node may be interpreted as the distance from the node to the tracker node. This way kmv and ev can describe the physical structure – three structures considered in this paper are shown in Fig. 2. Mesh is the regular interconnection network, where nodes form the matrix (Fig. 2 a). The torus is created based on mesh – the connections do not end on the boundary nodes, but form a connection to the overlapping node (Fig 2. b)). The third considered IS is freely unstructured, where the lengths of inter-node connections do not follow any specific rule (Fig. 2. c)). The IS structure also determines the timing relations in the system. In this paper, we simplified them by using two variables: fv determining the time required for fetching the bitstream to node, and jv – determining the block/result transfer time (we consider the link to be symmetric). Constraints (20) and (21) are also related to timing, as they determine the transfer speeds – each node has limited download and upload capability.

a)

mesh

b)

torus

c)

unstructured

Fig. 2. Examples of IS structures

Tracker node tracks the nodes activity by using active nodes set (28) and action register (29). The first one contains all nodes known to the tracker (not all nodes are known to the tracker when system starts the operation), the latter indicates the status of the node, although this information may become outdated. Node is active (index): av=1 (0 otherwise) (28) Node action: Av={idle, fetch, send, processing} (29) Block b is assigned to node v: qbv=1 (0 otherwise) (30) Block is processed (index): rb=1 (0 otherwise) (31) The operation of system elements are described by algorithms. All nodes operate under the same algorithm, the tracker node has its own specific algorithm. Algorithms are using request-respond architecture. Nodes decide whether to fetch the bitstream, or directly start requesting blocks. To fetch the block, node sends the request to the tracker, which responds with the block and expects the return of the result. Tracker algorithm serves the requests from nodes and keeps track of results (can also request the node which it considers as idle) and combines the final result.

4 Experimentation results The system described above was implemented as a software simulator. It takes in several parameters, which characterize the experiment: IS topology, the energy required for transferring data and bitstreams among the nodes, etc. The first experiment shows the impact of using hardware processing, compared to software processing. IS structure used in this case is unstructured (Fig. 2.c)). The IS contains 50 nodes, the average energy consumption per block processing is 14.2mW for software processing and 5.2mW for hardware processing. Energy consumption for data transportation in IS ranges between 3-49mW per uniform data block. The bitstream fetch consumes an average of 10.2mW. Figure 3 shows the same IS structure in three cases: all nodes perform hardware processing (allHw), all nodes perform software processing (allSw) and mixed – where only some nodes process task using hardware (24 of them) and the rest of them use the software approach. These three cases are used for various task sizes T. For T=200 – the differences between allHw, allSw and mixHS – are small: allHw required 15.3% less energy than allSw, 13.5% for mixHS. For small task sizes, the energy used to fetch bitstreams negatively influenced the overall cost, as only small number of blocks are processed

by hardware. For T=500, energy saving increases (comparing to allSw: 18,8% less for allHw and 15.6% less for mixHS). This trend continues for T=1500 and T=5000 (19.8% and 20.3% allSw to allHw, and 16.2% and 16.5% allSw to mixHS respectively).

Fig. 3. Three processing cases

The relation between hv and sv is defined for the whole system as the ratio . Fig.4 presents the experiment for three R values: R=6.26, R=28.4 and R=64.1. The mixHS case was used. Experiments show that small task sizes require more or less the same amount of energy. The advantage of the proper R ratio becomes more visible as the processing load increases. The average energy consumption per block (including transfer, computation, result return and share in bitstream processing) was equal to 58.4mW. The energy saved by using R=6.26 instead of R=64.1 would allow computation of additional 3154 data blocks using the same system resources. For T=1500 the energy saving, compared to R=6.26 was 92W and 184W for R=28.4 and R=64.1 respectively. Experiments also show, that the relation between energy saving and R ratio is not linear – other aspects such as energy required for bitstream fetch, communication time and mutual relations between nodes also impact the final energy dissipation amount. For too small R values, system becomes inefficient for hardware processing – the research about finding the proper R ratio for wide range of input conditions is the part of our current and future work.

Fig. 4. The relation between m ratio and energy dissipation for various tasks

The IS structure impacts the operational energy – the part responsible for data transfer. Our experiments show that the mesh and torus cases resulted in very similar energy dissipation (below 5% of difference). Unstructured IS demonstrated its advantage for larger tasks (13% less energy emitted for task size T=500 and larger).

Fig. 5. Energy dissipation for three structures.

5 Conclusion Reconfigurable systems provide many possibilities to act as flexible structures, which in turn save the overall energy used. The model presented in this paper is being extended in our current research in order to be able to handle more complex processing tasks. The experimentation results using two presented algorithms show

that hardware processing can lower the energy used for computation, but the system configuration is not straightforward for all cases. Our further research concentrates on using many bitstreams, task types and nodes with multiple reconfiguration units.

References 1. Gokhale M.B., Graham P.S.: Reconfigurable Computing, Accelerating Computation with Field-Programmable Gate Arrays. Springer (2005) 2. Hauck S., Dehon A.: Reconfigurable computing – The theory and practice of FPGA-based computing. Morgan Kaufmann / Elsevier (2008) 3. Mahajan S., Shah S.: Distributed Computing. Oxford University Press (2010) 4. Samara S., Schomaker G., Self-adaptive OS Service Model in Relaxed Resource Distributed Reconfigurable System on Chip (RSoC). Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns, pp. 1--8 (2009) 5. Nadeem M.F., Ostadzadeh S.A., Nadeem M., Wong S., Bertels K.: A Simulation Framework for Reconfigurable Processors in Large-scale Distributed Systems. 2011 International Conference on Parallel Processing Workshops, pp. 352--360 (2011) 6. Salvador R., Otero A., Mora J., De la Torre E., Riesgo T., Sekanina L.: Self-reconfigurable Evolvable Hardware System for Adaptive Image Processing. IEEE Transactions on Computers (2013) 7. Jasiunas M., Kearney D., Bowyer R.: Connectivity, Resource Integration, and High Performance Reconfigurable Computing for Autonomous UAVs. IEEE Aerospace Conference, pp. 1--8 (2005) 8. Korpela E., Werthimer D., Anderson D., Cobb J., Lebofsky M.: SETI@home-massively distributed computing for SETI. Computing in Science & Engineering 3, Issue 1, pp. 78--83 (2001) 9. Ballinger N.: ASIC technical training-the challenges and opportunities. ASIC Seminar and Exhibit, 1989. Proceedings of Second Annual IEEE, pp. P14-4/1-4, (1989) 10.Zydek D., Selvaraj H., Gewali L.: Synthesis of Processor Allocator for Torus-Based Chip MultiProcessors. Proceedings of 7th International Conference on Information Technology: New Generations (ITNG 2010), IEEE Computer Society Press, pp. 13--18 (2010) doi: 10.1109/ITNG.2010.145.