Cellular Neural Network Parallelization Rules

0 downloads 0 Views 272KB Size Report
The process of parallelizing the algorithm employs HPF to generate an ... By our experience in this work and further evaluation runs of CNN algorithms we devel-.
© 2004 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Cellular Neural Network Parallelization Rules∗ Thomas Weish¨aupl and Erich Schikuta Department of Computer Science and Business Informatics University of Vienna Rathausstraße 19/9, A-1010 Vienna, Austria {thomas.weishaeupl,erich.schikuta}@univie.ac.at

ABSTRACT: We present “Rules of Thumb” for efficient and simple parallelization of Cellular Neural Networks (CNNs). The rules result from the application and optimization of the simple but effective structural data parallel approach, which is based on the SPMD model. The process of parallelizing the algorithm employs HPF to generate an MPI-based program.

1. Introduction In the literature many approaches for the parallelization of neural networks can be found. These were done on various parallel or distributed hardware architectures, as workstation clusters, large-grain supercomputers [3] or specialized neural network hardware systems. A good survey can be found in [5], which presents a categorization of the various levels of parallel neural network simulation. Generally two forms of software techniques for the parallelization of neural network systems are distinguished, control parallel and data parallel simulation, which can also be described as decentralized and centralized (regular) control flow. Besides the mapping of logical neural network elements to processing nodes, as in topological data parallelism, it is also feasible to map data structures representing neural network information containers (as weight matrices, error value structures, input vectors, etc.) onto processing elements according to a data parallel scheme. We call this approach Structural Data Parallel (SDP) simulation. For the underlying work we used the data-parallel decomposition based SDP approach for the parallel simulation of the CNN. This approach was developed to make parallel neural network simulation as simple as sequential one. It allows users, who are inexperienced in highperformance computing to increase the performance of neural network simulation by parallelization easily. This approach is based on the Single-Program-Multiple-Data (SPMD) programming model and utilizes the highly specialized programming capabilities of high performance languages for the parallelization process. A CNN represents a simple mathematical model of a massively parallel computing architecture defined in discrete N-dimensional spaces. Basically a CNN can be seen as an array of identical dynamical systems, which are locally connected cells. The underlying network paradigm of the presented parallel CNN simulation approach is a discrete time version of Chua and Yang’s cellular neural network model [1]. This is a first order system with linear instantaneous connections. It can be described by the following equations [2], ∗

This research is supported by the Austrian Science Fund as part of the Aurora Project (SFBF1104).

1

xj (n + 1) =

X

Ajk yk (n) +

k∈Nr (j)

X

Bjk uk (n) + Ij

(1)

k∈Nr (j)

yj (n) = f [xj (n)]

(2)

where x, y, u, I represent the cell state, output, input, and bias values respectively. A and B denote the neighborhood feedback template and the input template. The output function is defined by f (x), which can be the sign function ½ 1 if x ≥ 0 f (x) = (3) 0 if x < 0 or the piece-wise linear output function yj (n) = 0.5 ∗ (k x(n) + 1 k − k x(n) − 1 k).

(4)

This problem proved very well for the parallelization on classical supercomputers (see [4]). In [6] we presented the parallelization results of CNNs on cluster architectures. By our experience in this work and further evaluation runs of CNN algorithms we developed “Rules of Thumb” for the efficient and straight-forward parallelization of cellular neural network (CNN) running on cluster architectures by evaluating a HPF generate MPI-based CNN program on two cluster architectures. These rules are explained and justified by a detailed performance analysis in the following section.

2. Five “Rules of Thumb” for CNN parallelization From our point of view a CNN is an N-dimensional regular array of elements. We focus on three correlated arrays: the input array (u in Formula 1) and two output arrays, one for the cell states of time n (result x) and one for the cell states of time n-1 (array y) respectively. Every array element represents one node of a cellular neural network. Following the SPMD approach the structure in focus for the SPMD parallelization are these three arrays. We identified the following areas for analysis applying the SPMD approach. These are respectively the workload distribution, latency of communication (handshake occurrence), local cache misses, communication throughput (message size, buffering and communication cache size), and locality of data (no computation flow interruptions by communication). To deal with the mentioned problems we have to change our CNN program in the following areas. Firstly, we need data distribution balance by deploying equally sized data partitions on the processor. Secondly, correlating array elements of the different arrays need to be alignment together. Thirdly, array elements have to be computed according to the alignment in the memory (column-wise, row-wise) depending on the used programming environment. Fourthly, the schema of data distribution (see Figure 1) has to be chosen along the computational work-flow. Finally, we can concern an overlap along the array partition border elements, and the adjustment of the rectangular array (image rotation). By the definition of influencing factors we get a clear approach to find and justify our “Rules of Thumb”. Based on these factors we performed the analysis in two steps. A research and a production cluster were the basis for our work, both are of a Beowulf type clusters architecture. The research cluster is the gescher1 system consisting of sixteen compute nodes, each having one Pentium 4 3.06GHz and 2GB of dual channel DDR-RAM FSB 266 1

http://gescher.vcpc.univie.ac.at/

2

A

B

C

Figure 1: Distributions for CNNs e.g. 8 proc.: tile (A),row-wise (B), column-wise (C) MHz. The compute nodes are connected by a Gigabit Ethernet interconnect. As production cluster the schroedinger22 was used. It is the cluster for numerical intensive applications of the University of Vienna and is ranked under the TOP500. This cluster consists of 192 nodes, each having one Pentium 4 2.53GHz and 1GB DDR RAM connected by a Gigabit Ethernet. The CNN algorithm was coded using the PGI pghpf Compiler. Our CNN algorithm uses image data as input of size 1600x1200 pixels and 3200x2400 pixels. 2.1 Quasi-Optimal CNN Algorithm We were able to assume a quasi-optimal algorithm for the parallel execution of CNNs concerning the following factors. They build the basis for the deviation tests in the second step of the analysis. Regarding the deployment factor, the optimal algorithm divides every array in equally sized partitions (blocks). By this deployment we get balanced processing workload and minimized communication costs (less total latency). By using HPF directives our distribution schemas are uniform. Thus we reach automatically the quasi-optimal configuration for the deployment factor. Rule 1: Distribute data uniformly. Every partition (block) of the data arrays should have the same size on every processor. This results in a well-balanced workload. Rule 2: Align related data. It is important for CNNs that all elements of the matrices x, y and u with the same index are mapped onto the same processor (are aligned respectively). This minimizes the communication costs, because the template matrices of CNNs embody communication with the direct neighborhood cells only. Considering the loop factor we face two possible ways of array computation. These are the column-wise and the row-wise computation. In our case we assume the best results with column-wise computation. The reasons for this are fewer cache misses by the array alignment in the memory. In the distribution factor we evaluate the three distribution schemas (see Figure 1). We get the best results by the column-wise distribution because of the computational work-flow and the optimized message-buffer usage. So the communication throughput is optimized. For the overlap factor we measure the differences between three cases: overlap-size zero, overlap-size one and overlap-size two. Along the array block borders, the compiler localizes a portion of a computation prior to the execution that would otherwise require communication. We assume the best result by overlap-size one, because of the CNN algorithm specific communication exceeding array borders by size one. The size of the template matrices A and B in Formula 1 determines the overlap-size. The adjustment factor has influence on the length of the communication borders of the array partitions. We use a column-wise distribution schema. By this fact horizontal adjustment ensures a minimal communication border. 2

http://www.univie.ac.at/nic

3

num. of processors 1 2 4 8 16

schroedinger2 loop row dist. row dist. tile 5.21 3.22 2.65 4.64 1.96 2.71 1.78 2.02 2.80 1.33 2.11 2.48 1.20 1.80 1.70

loop row 6.55 5.08 2.12 1.29 1.15

gescher dist. row dist. tile 2.51 2.86 2.95 2.76 2.74 2.83 2.32 2.70 1.96 2.53

Table 1: Factors for small images (1600x1200 pixels) on schroedinger2 and gescher num. of processors 1 2 4 8 16

schroedinger2 loop row dist. row dist. tile 7.10 4.20 2.64 5.98 5.39 2.63 4.93 5.45 2.71 1.88 5.22 3.00 1.41 9.54 5.17

gescher loop row dist. row dist. tile 7.69 5.61 2.76 6.53 6.26 2.87 5.11 6.10 2.87 2.31 5.67 3.05 1.38 10.53 6.95

Table 2: Factors for large images (3200x2400 pixels) on schroedinger2 and gescher Based on the above intuitive analytical considerations we assume that the optimal CNN program algorithm shows the following characteristics, which we justify in the subsection below: • • • • •

equally sized array partitions and adequate alignment of the arrays by HPF column-wise loop execution column-wise distribution schema overlap size one horizontal adjustment

2.2 Absolute Execution Time Comparison The quasi-optimal CNN program with the above stated characteristics builds the basis for the following heuristic analysis. We see four sensitivity factors for the evaluation of the program. Therefore we changed each of the four factors separately and measured the influences on the execution times. The sensitivity factors3 are respectively, loop execution mode (column-wise, row-wise), distribution schemas (column-wise, row-wise, tile), overlap size (one, zero, two), and array adjustment (horizontally, vertically). We give the aberrations of the single experiments from the quasi-optimal case as multiples of the absolute execution times of the quasi-optimal program. Table 1 shows the results for the changed loop execution and changed distribution schema for “small” images on both clusters, and Table 2 for “large” images. Based on these results it can be stated that the loop factor and the distribution factor are the defining factors for the absolute execution times. Rule 3: Compute arrays along the memory alignment. The computational work-flow for computing two-dimensional arrays is defined by the program language implementing the CNN algorithm. Thus, by an adapted array computation cache misses are avoided. The memory alignment of the multidimensional arrays is a specifics of the program language. Fortrans use column major order. C, C++ and Java use row major order. 3

The quasi-optimal configuration is shown in italics

4

num. of processors 1 2 4 8 16

overlap 0 1.00 1.00 0.99 0.94 0.89

schroedinger2 overlap 2 adjust. 1.00 1.00 1.02 0.95 0.99

vert. 0.99 1.01 1.00 0.96 0.98

overlap 0 1.01 0.97 0.98 0.88 1.11

gescher overlap 2 adjust. vert. 1.00 1.00 0.99 0.99 0.97 0.91 0.88 0.88 1.01 0.87

Table 3: Factors for small images (1600x1200 pixels) on schroedinger2 and gescher num. of processors 1 2 4 8 16

overlap 0 0.99 0.99 0.98 1.01 1.00

schroedinger2 overlap 2 adjust. 1.00 0.99 0.98 0.98 1.04

vert. 0.99 1.02 0.99 1.01 1.06

overlap 0 1.00 1.00 1.00 1.01 1.02

gescher overlap 2 adjust. vert. 1.00 1.00 1.00 1.00 0.99 0.98 1.01 1.04 1.05 1.12

Table 4: Factors for large images (3200x2400 pixels) on schroedinger2 and gescher Rule 4: Choose a distribute schema supporting the computational workflow. Concerning the program language architecture (see Rule 3), it is also important to distribute the data along the computational work-flow (execution order of loop structure). A better communications throughput results from column-wise distribution in our case. Table 3 shows the results for varying overlap-size and rotational adjustment of “small” images on both clusters. Table 4 presents similar results for “large” images. The presented data show that the overlap factor and the adjustment factor have only minor influence on the absolute execution time of the CNN program. Rule 5: “Forget” overlapping and adjustment (almost). Overlapping and the adjustment of the arrays have only minor influence on the execution time for the CNN algorithm. Depending on the size of the array the effect can be positive or negative. The overlap-size is affected by the locality of data elements. Along the array block borders, the compiler localizes a portion of a computation prior to the execution that would otherwise require communication during computation. The adjustment (landscape or portrait) of an array changes the number of communication border elements. For small arrays more border elements produce a better throughput by reduction of latency influence on the communication time. Larger arrays are processed faster by minimizing the communication, because of less communication data size.

3. Conclusion In this paper we presented five “Rules of Thumb” for the straight forward parallelization of cellular neural networks simulation. Summing up these five rules are: • • • •

Distribute data uniformly, Align related data, Compute arrays along the memory alignment, Choose a distribute schema supporting the computational workflow, and 5

Figure 2: Justification for the five “Rules of Thumb” • Forget overlapping and adjustment (almost). A figurative justification of our statement is given by Figure 2, which shows the execution times for parallel CNN processing of a 1600x1200 image for two programs, the quasi-optimal case adhering to the stated “Rules of Thumb” and the worst case ignoring all of them. The rules found are equally applicable for the parallelization of other neural network paradigms which are based on a regular problem scheme.

References [1] L. O. Chua and L. Yang. Cellular neural networks: theory. IEEE Transactions on circuits and systems, 35(10):1257–1272, October 1988. [2] V. Cimagalli and M. Balsi. Cellular Neural Networks: A Review. In 6th Italian Workshop on Parallel Architectures and Neural Networks, Vietri sul Mare, May 1993. World Scientific. [3] Sanchez, S. Barro, and C. Regueiro. Artificial Neural Networks Implementation on Vectorial Supercomputers. In IEEE Int. Conf. On Neural Network, pages 3938–3943, Orlando, 1994. [4] E. Schikuta. Data Parallel Software Simulation of Cellular Neural Networks. In CNNA’96 - 1996 Fourth IEEE Internationl Workshop on Cellular neural networks and their applications, pages 267–271, 1996. [5] N. B. Serbedzija. Simulating Artificial Neural Networks on Parallel Architectures. IEEE Computer, 29(3):56–63, 1996. [6] T. Weish¨aupl and E. Schikuta. Parallelization of cellular neural networks for image processing on cluster architectures. In International Conference on Parallel Processing Workshops, pages 191–196. IEEE, Oct. 2003.

6