Adaptative Resonance Theory Fuzzy Networks Parallel ... - Springer Link

0 downloads 0 Views 349KB Size Report
works like CUDA (Compute Unified Device Architecture) [3] or OpenCL (Open ... using a GPU-based simulation of a cellular neural network (CNN) instead of.
Adaptative Resonance Theory Fuzzy Networks Parallel Computation Using CUDA M. Martínez-Zarzuela, F.J. Díaz Pernas, A. Tejero de Pablos, M. Antón Rodríguez, J.F. Díez Higuera, D. Boto Giralda, and D. González Ortega Higher School of Telecommunications Engineering University of Valladolid (Spain) [email protected] http://gti.tel.uva.es Abstract. Programming of Graphics Processing Units (GPUs) has evolved in a way they can be used to address and speed-up computation of algorithms exemplified by data-parallel models. In this paper parallelization of a Fuzzy ART algorithm is described and a detailed explanation of its implementation under CUDA is given. Experimental results show the algorithm runs up to 52 times faster on the GPU than on the CPU for testing and 18 times faster for training under specific conditions.

1

Introduction

In the last years, GPU (Graphics Processing Unit) programming has become a natural solution in order to speed-up the execution of different kind of algorithms [1]. In the beginnings of GPGPU (General Purpose computation on the GPU), algorithms had to be translated into graphics terms and programs were written using graphics APIs, such as OpenGL or DirectX and shading languages, such as Cg or HLSL [2]. Modern GPUs include new hardware resources and new frameworks like CUDA (Compute Unified Device Architecture) [3] or OpenCL (Open Computing Language) can be used for writing general purpose applications. Adaptative Resonance Theory (ART) is a theory on how brain processes information. Fuzzy ART is an unsupervised artificial Neural Network (NN) capable of incremental learning, offering a great level of malleability [4]. It has been widely used in a universe of applications as medical sciences, economics and finance, engineering and computer science, or pattern recognition and classification. However, execution of Fuzzy ART on CPUs is a very time demanding task, thus is difficult to apply it for time dependent applications, such as computer vision. Several researchers have shown great results in means of performance with their GPU-based artificial NNs implementations, speeding up execution dozens of times. Tze-Yui Ho et al. [5], explains the advantages and disadvantages of using a GPU-based simulation of a cellular neural network (CNN) instead of other methods such as VLSI circuits. The results show that the GPU-based J. Cabestany et al. (Eds.): IWANN 2009, Part I, LNCS 5517, pp. 149–156, 2009. c Springer-Verlag Berlin Heidelberg 2009 

150

M. Martínez-Zarzuela et al.

CNN simulator can run 8-17 times faster than a CPU-based CNN simulator, improving the performance at a low cost. Honghoon Jang et al. [6] designed an heterogeneous implementation for pattern recognition and image processing using CUDA for execution on a GPU and OpenMP for execution on a multicore CPU. The computational times for an algorithm of text detection showed about 15 times faster than an implementation using the CPU and about 4 times faster than an implementation running only on the GPU. In the area of Fuzzy ART, Martínez-Zarzuela et al. [7] presented the first GPU implementation of the algorithm using OpenGL, which speeds-up NN testing process. This article describes how to adapt the original algorithm for data-parallel programming and optimize its execution under CUDA enabled platforms. Experimental results show both NN training and testing are faster on the GPU than on the CPU and a comparison with the aforementioned OpenGL implementation is included [7]. Section 2 gives a quickly overview of Fuzzy ART algorithm. In section 3, CUDA language and its main features are introduced. Section 4 focuses in the reformulation of the algorithm for parallel execution and its implementation using CUDA. Section 5, shows the time improvements achieved with respect to CPU and OpenGL implementations. Finally, section 6 draws the main conclusions and further researching tasks.

2

Fuzzy ART Networks

Fuzzy ART is a self-organizing neural network capable of incremental learning [4]. This kind of NN can be used to clusterize a stream of P input patterns into N different categories, generated during an unsupervised training phase. In Fuzzy ART systems, the first layer of neurons F1 receives the input pattern and neurons in the upper layer F2 represent a specific category from those emerged during the self-organizing training phase. F1 activity vector is denoted by Ip = (I1 , · · · , IM ) where each component Ii is within the [0, 1] interval. F2 neurons sinapses are weighted by long-term memory (LTM) traces denoted by wj = (wj1 , · · · , wjM ). Activity of these neurons is computed as Tj (I) = |I ∧ wj |/(α + |wj |) and the

(a) Sequential training

(b) Sequential testing

Fig. 1. Trainining and testing Fuzzy ART NN pseudocodes

Adaptative Resonance Theory Fuzzy Networks Parallel Computation

151

category choice is indexed by J, where TJ = max(Tj : j = 1 · · · N ). The system enters in resonance only if the match function meets the vigilance criterion (|I ∧ wJ |)/|I| ≥ ρ. When this occurs and learning is enabled (β = 0), associated wJ is updated wJnew = β(I ∧ wJold ) + (1 − β)wJold . Otherwise, node J is inhibited and the node in F2 with the next higher activity is selected. If no neuron j is found to meet the vigilance criterion, a new neuron is committed in F2 . Figures 1(a) and 1(b) describe neural network computation on a sequential processor as the CPU.

3

Programmable Graphics Hardware Using CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing framework created by NVIDIA for GPUs from the G8X series graphics cards and higher. C for CUDA includes some extensions to standard ANSI C99 language, giving access to the native instruction set and memory of the parallel computational elements in CUDA enabled GPUs. Accelerating an algorithm using CUDA includes translating it into data-parallel sequences of operations and mapping computations to the underlying resources to get maximum performance. GPUs have a parallel many-core multiprocessor architecture. Each core is capable of running a series of operations (kernel) over several data simultaneously, delivering computations to thousands of GPU threads. Threads are grouped into a block of threads, that is executed on the same multiprocessor, thus allowing data sharing between threads through shared memory. Computations are delivered into a number of blocks called grid. When calling a GPU kernel from the CPU, dimensions of the grid and the block have to be specified. The design must be optimal to exploit hardware resources, being necessary to take into account aspects like the use of shared memory, coalescence or level of occupancy. Shared memory is one of the most useful resources available in modern GPUs. The access to global memory suffers from a bottleneck effect, which can dramatically slow down the performance of an implementation. Shared memory is an intermediate cache between global memory and stream processors, as fast as registers. Threads belonging to the same block can store intermediate results of a kernel without spending a huge amount of time reading from or writing into the GPU global memory. The amount of shared memory is quite limited and has to be shared among all the threads of each block. A coallesced reading or writing performed by various threads simultaneously, can speed up the process, although it has to be specifically designed depending on the implementation. Designing the kernel for the execution of more than one block per multiprocessor can also help hiding latency in global memory loads. Execution of threads belonging to different blocks can overlap, but only if there are enough hardware resources for two or more blocks in the multiprocessor. A synchronization execution point for the threads belonging to the same block can be needed in specific parts of the code. In reduction operations, for example, threads collaborate to compute a global operation over the elements of a vector, such as the maximum or the sum-up of every component. For parallelization,

152

M. Martínez-Zarzuela et al.

different threads compute a small part of the operation in several stages and algorithm converges on a result on the first position of the vector (figure 5(a)). Synchronization points between stages guarantees the consistence of the data.

4

Parallel Computation of Fuzzy ART Using CUDA

In this section, we describe how to adapt the Fuzzy ART algorithm for dataparallel computation. Also, an explanation of main issues to bear in mind when implementing the code in CUDA is included. Figure 2(a) contains the Fuzzy ART parallelized version for training. Every input vector Ip has to be sequentially analyzed, since learning process can modify the weights associated to previous committed categories. However, it is possible to simultaneously compute the match criterion and the activity of every output neuron. The winning neuron is the most fired one that also matches this criterion. Figure 2(b) describes parallelization of the testing phase. In this case every input vector Ip can be processed in parallel and the activity of nodes in F2 are sequentially computed. The main advantage of using a graphics card is the possibility to execute the same operation on various vectors simultaneously. In figure 3 data organization in CUDA is depicted. Single-dimension blocks and grid are used in order to simplify vector accesses. A grid of blocks is generated to simultaneously deal with neural network weights on training and with the input patterns on testing, as described in pseudocodes in figures 2(a) and 2(b). On each block various vectors are included. Padding with zeros block and grid data might be necessary so that coalesced readings and coherent reductions can be done during processing. Dimensions of the grid are dynamically computed during the training, so that depending on the number of committed categories by the neural network, more blocks of threads are generated. Training phase in figure 2(a) is computed in 3 different kernels: labelled κ1 , κ2 and κ3 . This division of the training is made because each kernel works with vectors of different dimension, what means that for an optimal performance it is

(a) Parallel training

(b) Parallel testing

Fig. 2. Trainining and testing parallelized Fuzzy ART NN pseudocodes

Adaptative Resonance Theory Fuzzy Networks Parallel Computation

153

Fig. 3. Grid and kernel data organization for training and testing phases

(a) Training

(b) Testing

Fig. 4. Vector handling for parallel training and testing operations inside a block

necessary to call the kernels with their respective values of block and grid; but the real obstacle is that the only way to synchronize all the threads after the operation performed in the second kernel is calling another kernel, thus algorithm can not be implemented in a single function call. Testing phase in figure 2(b), on the other hand, can be computed in a single call to kernel κ4 . Activities vector is declared as a float2 value, which stores activity Tj and the associated j index. In order to compute the norm of a vector, reductions are needed in κ1 and κ4 . During training, finding TJ involves another reduction in κ2 to find the neuron in F2 with the largest activity. If TJ = 0, then another neuron in F2 is generated and dimensions of the grid may change if another block is needed to store its associated weights. Because of this parallelization, the number of vectors has to be adapted to fit into the number of blocks defined, so the last block can work with all its threads. This happens with weight vectors on training and input vectors on testing and the solution is as easy as adding vectors filled with 0s until the last block is filled (figure 3). As various vectors can be stored on a single block, an specialized kind of reduction has to be done, in order to compute the norm of every one independently. This kind of parallel reduction is described in figure 5(b). This way, various reductions can be calculated simultaneously (figure 4). One remarkable technique which helps speeding-up reductions is unrolling kernel loops [8]. When applied for reduction operations, the size of the block in each iteration of the algorithm can be adjusted to the number of needed threads. There are a couple of restrictions when using this technique: reductions have to be made on vectors whose

154

M. Martínez-Zarzuela et al.

(a) Reduction over a sin- (b) Reduction over varigle vector ous vectors Fig. 5. Vector handling for reduction operations inside a block

size is a power of two to have a coherent result, because of the continuous halving in the process. Reductions have to use all the threads of the block; therefore, the size of the vector to reduce has to be equal to the number of threads. These two restrictions are easily solved in our program adjusting the size of the vector with 0s, as it does not alter the result of the operations.

5

Experimental Results

This section includes a performance comparison among the Fuzzy ART implementation using CUDA, a GPU implementation using OpenGL [7] and a CPU C++ implementation. Timings were taken on a dual-core 3.2 GHz Pentium 4 with 1GB RAM and different graphics cards belonging to G8X series from NVIDIA and include times for copying GPU results to the CPU. A synthetic benchmark comprised of several sets of P patterns was generated. In each set, dimension of input vectors Ip and the number of expected categories N varies. In order to guarantee the number of categories that could be committed was not too much influenced by the length and number of input patterns, a Multivariate Normal Distribution was used for pattern generation, so that P = N xM , being M the number of patterns expected to be included in one of N different categories [7]. Tests have been done for Ip dimensions of 4, 8, 16 and 32. The block size has been chosen to get the higher occupancy on the GPU. More occupancy means less idle threads while executing the program. Given the number of registers and shared memory needed for the kernels in the program, a block size of 128 is the optimal in most of the cases, and has been chosen for all the tests presented in this section. The semi logarithmic plot in figure 6 compares the time needed for training the network on the CPU and on the GPU for P = N xM patterns of length 32. Parallelization of the algorithm becomes useful when the number of committed categories is very large and so is the number of weight vectors. Otherwise, most of the time there will be a big number of idle threads. The CPU is faster for less than N = 1500 categories. When N is larger, the relative speed-up achieved on the GPU becomes exponential. During tests, a maximum speed-up of 18 was obtained for N = 10000. Testing, on the other hand, is always much faster on the GPU than on the CPU, because it is always possible to process in parallel a huge amount of input

Adaptative Resonance Theory Fuzzy Networks Parallel Computation

155

Fig. 6. Training for Ip of size 32

(a) Testing on different platforms for Ip of size 32

(b) Testing on 8800GT

Fig. 7. Performance during testing

vectors. Figure 7(a) shows testing performance under different platforms for patterns of length 32. Worth to mention is that a naïve CUDA implementation, not using the advanced unrolling technique does not manage to win the OpenGL implementation on a 8800GT GPU equipped graphics card. The speedup achieved for the best GPU implementation with respect to CPU varies between 37 and 52. In figure 7(b) a performance comparison depending on the dimension of the input patterns is included.

6

Conclusions and Future Work

In this paper, parallelization of a Fuzzy ART NN algorithm for training and testing was presented. An implementation of the algorithm on CUDA was detailed and tested against OpenGL [7] and CPU implementations. While a first naïve CUDA implementation was not as fast as the OpenGL version for testing, final version using unrolling loops technique considerably speed-up CUDA

156

M. Martínez-Zarzuela et al.

implementation. A peak speed-up of x57 is achieved for testing against the sequential version of the algorithm running on the CPU. For training, when the number of categories created in the NN is large enough to take advantage of all the GPU hardware resources, the relative speed-up between GPU and CPU grows exponentially.

Acknowledgments This work has been partially supported by the Spanish Ministry of Education and Science under project TIN2007-67236 and by the University Rey Juan Carlos in collaboration with the Community of Madrid under project URJC-CM-2007CET-1724.

References 1. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 26 (2007) 2. Harris, M.: Mapping computational concepts to gpus. In: Pharr, M. (ed.) GPU Gems 2, pp. 493–508. Addison-Wesley, Reading (2005) 3. CUDA: Nvidia cuda zone: programming resources, http://www.nvidia.com/object/cuda_home.html (last visit, January 2009) 4. Carpenter, G.A., Grossberg, S., Rosen, D.B.: Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks 4(6), 759–771 (1991) 5. Ho, T.Y., Park, A., Jung, K.: Parallelization of cellular neural networks on gpu. Pattern Recogn 41(8), 2684–2692 (2008) 6. Jang, H., Park, A., Jung, K.: Neural network implementation using cuda and openmp. In: DICTA 2008: Proceedings of the 2008 Digital Image Computing: Techniques and Applications, Washington, DC, USA, pp. 155–161. IEEE Computer Society Press, Los Alamitos (2008) 7. Martínez-Zarzuela, M., Díaz Pernas, F.J., Díez Higuera, J.F., Antón-Rodríguez, M.: Fuzzy art neural network parallel computing on the gpu. In: Hernández, F.S., Prieto, A., Cabestany, J., Graña, M. (eds.) IWANN 2007. LNCS, vol. 4507, pp. 463–470. Springer, Heidelberg (2007) 8. Harris, M.: Parallel prefix sum (scan) with cuda. In: Nguyen, H. (ed.) GPU Gems 3, pp. 851–876. Addison Wesley Professional, Reading (2007)