Neuromorphic Architectures - UCL Computer Science

0 downloads 0 Views 4MB Size Report
Jul 23, 2010 - Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, ... ❖R. Stewart and W. Bair, "Spiking neural network simulation: ...
GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy Dmitri Yudanov, Muhammad Shaaban, Roy Melton, Leon Reznik

Department of Computer Engineering Rochester Institute of Technology United States WCCI 2010, IJCNN, July 23

Agenda



Motivation



Neural network models



Simulation systems of neural networks



Parker-Sochacki numerical integration method



CUDA GPU architecture



Implementation: software architecture, computation phases



Verification



Results



Conclusion and future work



Q&A

Motivation 

Other works: accuracy and verification problem J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum, "A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors," Neural Networks, Jul. 2009. A. K. Fidjeland, E. B. Roesch, M. P. Shanahan, and W. Luk, "NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, Architectures and Processors, IEEE International Conference on, vol. 0, pp. 137-144, 2009. J.-P. Tiesel and A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in , 2009, pp. 754--759.

To provide scalable accuracy  To perform direct verification  Based on: 

R. Stewart and W. Bair, "Spiking neural network simulation: numerical integration with the Parker-Sochacki method," Journal of Computational Neuroscience, vol. 27, no. 1, pp. 115-33, Aug. 2009.

Neuron Models: IF, HH, IZ IF

HH

IZ



IF: simple, but has poor spiking response



HH: has reach response, but complex



IZ: simple, has reach response, but phenomenological

System Modeling: Synchronous Systems 

Aligned events  good for parallel computing



Time quantization error introduced by dt



Smaller dt  more precise, but computation hangry



May result in missing events  STDP unfriendly

Order of computation per second of simulated time

N – network size F - average firing rate of a neuron p – average target neurons per spike source

R. Brette, et al.

System Modeling: Asynchronous Systems 

Small computation order



Events are unique in time  no quantization error more accurate, STDP friendly



Events are processed sequentially



More computation per unit-time



Spike predictor-corrector  excessive re-computation



Assumes analytical solution

Order of computation per second of simulated time N – network size F - average firing rate of a neuron p – average target neurons per spike source

R. Brette, et al.

System Modeling: Hybrid Systems 

Refreshes every dt  more structured than event-driven  good for parallel computing



Events are unique in time  no quantization error more accurate, STDP friendly



Doesn’t require analytical solution



Events are processed sequentially



Largest possible dt is limited by minimum delay and highest possible transient Order of computation per second of simulated time

N – network size, F - average firing rate of a neuron, p – average target neurons per spike source R. Brette, et al.

Choice of Numerical Integration Method 

Motivation: need to solve an IVP



Euler: compute next y based on tangent to current y



Modified Euler: predict with Euler, correct with average slope



Runge-Kutta 4th Order: evaluate and average



Bulirsch–Stoer: modified midpoint method with evaluation and error tolerance check using extrapolation with rational functions. Adaptive order. Generally more suited for smooth functions.



Parker-Sochacki: express IVP as power series. Adaptive order

Parcker-Sochacki Method A typical IVP:

Assume that solution function can be represented with power series. Therefore, its derivative based on Maclaurin series properties is

Parcker-Sochacki Method If

is linear:

Shift it to eliminate constant term: As a result, the equation becomes:

With finite order N:

 

LLP Parallel reduction

Parcker-Sochacki Method If

is quadratic:

Shift it to eliminate constant term: As a result, the equation becomes:

Quadratic term can be converted with series multiplication:

Parcker-Sochacki Method and the equation becomes:

With finite order N:



Loop-carried circular dependence on d



Only partial parallelism possible

Parcker-Sochacki Method 

Local Lipschitz constant determines the number of iterations for achieving certain error tolerance:



Power series representation  adaptive order  error tolerance control

Limitations:  Cauchy product reduces parallelism

CUDA: SW

     

Kernel: code separate, task division Thread Block (1D, 2D, 3D) Grid (1D, 2D) Divide computation based on IDs Granularity: bit level (after warp bcast access)

CUDA: HW

Scheduling 

Scheduling: parallel and sequential



Scalability  requirement for blocks to be independent

Warp 

Warp = 32 threads



Warp divergence



Warp level synchronization

Active blocks and threads: 

Active threads / SM: maximum1024



Goal: full occupancy = 1024 threads

Software Architecture

Update Phase

Stewart and Bair



Adaptive order p according to required error tolerance



Can be processed in parallel for each neuron

Propagation Phase



Translate spikes to synaptic events: global communication is required



Encoded spikes are written to the global memory: bit mask + time values



A propagation block reads and filters all spikes, decodes, fetches synaptic data and distributes into time slots

Sorting Phase

Satish et al.

Software Architecture

Results: Verification Input Conditions 

Random parameter allocation



Random connectivity



Zero PS error tolerance

GPU Device: GTX 260

CPU Device: AMD Opteron 285



24 symmetric multiprocessors



Dual core



Shared memory size, 16 KB / SM



L2 cache size, 1 KB / core



Global memory size, 938 MB



RAM size, 4 GB



Clock rate, 1.3 GHz



Clock rate, 2.6 GHz

Output 

Membrane potential traces



Passed test for equality

Results: Simulation Time vs. Network Size 250

Simulation Time, sec.

200

2%

4%

8%

16%

2%

4%

8%

16%

150

100

50

0 2

3

4

5

6

7

8

Network size, 1000 x neurons 

Conditions: 80% excitatory / 20% inhibitory synapses, zero tolerance, 10 sec of simulation, initially excited by 0 – 200 pA current.



Results: GPU simulation 8-9 times faster, RT performance for 2-4% - connected networks with size 2048 – 4096 neurons.



Major limiting factors: shared memory, number of SM

9

Results: Simulation Time vs. Event Throughput 410

Simulation Time, sec.

360

2%

4%

8%

16%

2%

4%

8%

16%

310 260 210 160

110 60 10 0

2

4

6

8

10

Mean Event Throughput, 1000 x events/(sec. x neuron) 

Conditions: increasing excitatory / inhibitory ratio from 0.8/0.2 to0.98/0.02, network of 4096 neurons, zero tolerance, 10 sec of simulation, initially excited by 0 – 200 pA current.



Results: GPU simulation 6-9 times faster, up to 10,000 events per sec per neuron. RT performance for 0-2% - connected networks with size of 2048 – 4096.



Major limiting factors: shared memory, number of SM

12

Results: Comparison with Other Works Metric Increase in speed Network Size Connectivity per neuron Accuracy Verification

This Work

Other works

Reason

6 – 9, RT

10 – 35, RT

2K - 8K

16K - 200K

GPU device, complexity of computation, numerical integration methods, simulation type, time scale

100 - 1.3K

100 – 1K

Full single precision FP

Undefined

Direct

Indirect

Numerical integration method

Conclusion  

Implemented high-accurate PS-based hybrid system of spiking neural network with IZ neurons on GPU Directly verified implementation

Future Work 

   

Add accurate STDP implementation Characterize accuracy in relation to signal processing, network size, network speed, learning Provide an example of application Port to Open CL Further optimization

Q&A Essential Bibliography R. Brette, et al., "Simulation of networks of spiking neurons: A review of tools and strategies," Journal of Computational Neurscience, vol. 23, no. 3, pp. 349-398, 2007. R. Stewart and W. Bair, "Spiking neural network simulation: numerical integration with the Parker-Sochacki method," Journal of Computational Neuroscience, vol. 27, no. 1, pp. 115-33, Aug. 2009. G. E. Parker and J. S. Sochacki, "Implementing the Picard iteration," Neural, Parallel Sci. Comput., vol. 4, pp. 97--112, 1996. E. M. Izhikevich, "Simple model of spiking neurons," Neural Networks, IEEE Transactions on, vol. 14, pp. 1569--1572, 2003. N. Satish, M. Harris, and M. Garland, "Designing efficient sorting algorithms for manycore GPUs," in , 2009, pp. 1--10. (2010, Apr.) CUDA Data Parallel Primitives Library. [Accessed online 04/30/2010]. http://code.google.com/p/cudpp/ (2008) NVIDIA CUDA Programming Guide 2.3. [Accessed online 04/30/2010]. http://developer.nvidia.com

Other works J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum, "A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors," Neural Networks, Jul. 2009. A. K. Fidjeland, E. B. Roesch, M. P. Shanahan, and W. Luk, "NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, Architectures and Processors, IEEE International Conference on, vol. 0, pp. 137-144, 2009. J.-P. Tiesel and A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in , 2009, pp. 754--759.