Jun 14, 2013 ... Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki. --> sliced. --> flattened.
--> stained. --> mounted ... --> digitized large scale digital ...
EVIP
Technical Application Field
Scientific Computing
Applied Numerics
Variational Modeling
EVIP Parallel Processing
Rank efficient operators
Elasticity modeled Image Registration
NOMIR – Part I
Motivation
Eldad Haber & Jan Modersitzki
Given a reference image , and a template image Motivation , find a Image reasonable transformation Registration
Given a reference image R and a template image T , , such that the transformed image is similar to find a reasonable transformation y, such that the trans-
template T
reference R
formed image T [y] is similar to R
transformed template T [y] c Eldad Haber & Jan Modersitzki
page 4
Applications
HNSP: Sectioning --> sliced --> flattened --> stained --> mounted ... --> digitized large scale digital images, up to 10.000 x 20.000 pixel Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki
HNSP: Microscopy
Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki
NSP: Deformed Images
HNSP: Deformed Images
sections 3.799 and 3.800 out of about 5.000 sec:3799
human
human
sec:3800
affine
affine linear
elastic
Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki |Torig R| = 100% |Tlinear R| = 72% |Telastic R| = 50
HNSP: Results
3D elastic registration of a part of the visual cortex 2 hemispheres; 100 sections of ́ 512 X 512 pixel
Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki
Registration in Medical Imaging •
Comparing/merging/integrating images from different :
• • • •
times, devices, perspectives, objects, e.g.,
•
Pre-/post surgery CT-images/MRI panorama imaging atlas/patient mapping
• • • •
Catheter in blood vessel Find 2D view in 3D data HNSP Template matching, e.g., I Atlas mapping, e.g., I Serial sectioning, e.g., Registration is not restricted to medical applications
Variational Modelling
Interpolation Continuous models for reference and template: discrete data
Transformation
NOMIR – Part I
Eldad Haber & Jan Modersitzk
Eulerian versus Lagrangian View Eulerian versus Lagrangian View ⌃ y
y(x) y
0
x(p) p
y(x0 )
p0
x(p0 )
Euler: T [y](x) = T (y(x)); Lagrange: (p, T (p)) 7! x(p), T (p) ;
x x0 x x0
easy, but x 2 y 1 (⌃) ? option for constraints
Distance measures Sum of Squared Differences (SSD)
Distance measures
Regularization
ill-posedness
Regularization Implicit vs explicit regularization Parametric regularization
Regularized Parametric registration
Non Parametric regularization
Elastic Regularizer Elastic potential of u
Numerical optimization
ELE to PDE balance of forces
outer forces , drive registration inner forces , tissue properties
Discretized Regularizer
discretise
and
Discretized Cost function
Minimization of J Necessary condition for minimizer
Minimization of Solve
Remarks on B Need to solve
0
200
400
600
• • •
HUGE very sparse Has a lot of structure
800
1000
1200
0
200 400 nz = 3296
0 50 100 150 200 250 300 350 400 450 500 0
100
200 300 nz = 6319
400
500
28
Performance Optimization
Outline • • •
Fundamentals Architecture and Little’s Law
• •
Yesterday's Constraints - ILP/DLP Today's Constraints - MLP
Summary
Little’s Law
Basic Throughput Quantities
Basic Throughput Quantities •
Latency:
•
every operation requires time to execute.
•
(i.e. instruction, memory or network latency)
Basic Throughput Quantities • •
Latency:
•
every operation requires time to execute.
•
(i.e. instruction, memory or network latency)
Bandwidth:!
•
# of (parallel) operations completed per cycle.!
•
(i.e. #FPUs, DRAM, Network, etc…)
Basic Throughput Quantities • •
Latency:
•
every operation requires time to execute.
•
(i.e. instruction, memory or network latency)
Bandwidth:!
•
# of (parallel) operations completed per cycle.!
•
(i.e. #FPUs, DRAM, Network, etc…)
Basic Throughput Quantities • •
•
Latency:
•
every operation requires time to execute.
•
(i.e. instruction, memory or network latency)
Bandwidth:!
•
# of (parallel) operations completed per cycle.!
•
(i.e. #FPUs, DRAM, Network, etc…)
Concurrency! :
•
Total # of operations in flight
Little’s Law
Little’s Law •
Little’s Law relates these three:
Little’s Law •
Little’s Law relates these three: • Concurrency = Latency * Bandwidth
- or -
Little’s Law •
Little’s Law relates these three: - or • Concurrency = Latency * Bandwidth
• Effective Throughput = Expressed Concurrency / Latency
Little’s Law •
Little’s Law relates these three: - or • Concurrency = Latency * Bandwidth
• Effective Throughput = Expressed Concurrency / Latency
•
This concurrency must be filled with parallel operations
Little’s Law •
Little’s Law relates these three: - or • Concurrency = Latency * Bandwidth
• Effective Throughput = Expressed Concurrency / Latency
•
This concurrency must be filled with parallel operations
•
Can’t exceed peak throughput with superfluous concurrency (each channel has a maximum throughput).
Basic Traffic Quantities •
Traffic often includes
• •
#Floating-point operations (FLOPs) #Bytes from (registers, cache, DRAM, network)
Performance Optimization: Contending Forces
Improve Throughput (Gflop/s, GB/s, etc…)
Reduce Volume of Data (Flop’s, GB’s, etc…)
Contending forces of device Efficiency and usage/traffic
Performance Optimization: Contending Forces
Restructure to satisfy Little’s Law
Implementation & Algorithmic Optimization
Architects, Mathematicians, Programmers
Architects, Mathematicians, Programmers
•
Architects: invent paradigms to improve (peak) throughput and facilitate(?) Little’s Law.
Architects, Mathematicians, Programmers
•
Architects: invent paradigms to improve (peak) throughput and facilitate(?) Little’s Law.
•
Mathematicians: invent new algorithms to improve performance by reducing (bottleneck) traffic.
Architects, Mathematicians, Programmers
•
Architects: invent paradigms to improve (peak) throughput and facilitate(?) Little’s Law.
•
Mathematicians: invent new algorithms to improve performance by reducing (bottleneck) traffic.
•
Programmers: restructure algorithms and implementations to these new features.
Performance Optimization
Performance Optimization •
Often boils down to several key challenges:
Performance Optimization •
Often boils down to several key challenges:
Performance Optimization •
Often boils down to several key challenges:
•
Management of data/task locality
Performance Optimization •
Often boils down to several key challenges:
• •
Management of data/task locality Management of data dependencies
Performance Optimization •
Often boils down to several key challenges:
• • •
Management of data/task locality Management of data dependencies Management of communication
Performance Optimization •
Often boils down to several key challenges:
• • • •
Management of data/task locality Management of data dependencies Management of communication Management of variable and dynamic parallelism
Yesterday’s Constraint: Instruction Latency & Parallelism
Single-issue, non-pipelined • Consider a single issue, non-pipelined processor • Little’s Law – – –
Bandwidth = issue width = 1 Latency = 1 Concurrency = 1
• Very easy to get good performance even if all instructions are dependent
Issue width Future instructions In flight completed
77
Pipelined • By pipelining, we can increase the processor frequency. • However, pipeline should be filled to achieve better performance. • Little’s Law – – –
Bandwidth = issue width = 1 Latency = 3 Concurrency = 3
• Performance may drop to 1/3 of peak Issue width Future instructions
In flight
completed
78
Pipelined • There may be inherent and untapped parallelism in the code • Compilers/programmers must find parallelism, and unroll/reorder the code to keep the pipeline full
Issue width Future instructions
In flight
completed
79
Out-of-order • Alternately, the hardware can try to find instruction level parallelism (ILP) Issue width • Instructions are: 11 – – – –
Queued up Executed out-of-order Reordered Committed in-order
10 9
8
Future instructions Reservation Stations
7
• Useful when parallelism or latency cannot be determined at compile time.
4
Out-of-order execution
6 9
8
7
6
5
4
3 2 1
Reorder buffer completed
80
Superscalar • Increase throughput, by executing multiple instructions in parallel Issue width • Usually separate pipelines for different instruction types: 13 14 –
FP, integer, memory
11 12
• Significantly complicates out-of-order execution
Future instructions Reservation Stations
10 8 5
9
Out-of-order execution
7 10
9
8
7
6
5
4 3
1
2
Reorder buffer completed
81
SIMD • Many codes perform the same operations on different pieces of data (Data level parallelism = DLP) • SIMD : Single Instruction Multiple Data • Register sizes are increased. • Instead of each register being a 64b FP #, each register holds 2 or 4 FP#’s. • Much more efficient solution than superscalar on data parallel codes
82
Multithreaded • Superscalars fail when there is no ILP or DLP • However, there are many codes with –
Thread-level parallelism (TLP)
• Consider architectures that are virtualised to appear as N cores. –
In reality, there is one core maintaining multiple contexts and dynamically switching between them
• There are 3 main types of multithread architectures: – – –
Coarse-grained multithreading (CGMT) Fine-grained multithreading (FGMT) , aka Vertical Multithreading Simultaneous multithreading (SMT)
83
Coarse-grained Multithreading • Maintain multiple contexts • On a long latency instruction: – – – –
dispatch instruction Switch to a ready thread Hide latency with multiple ready threads Eventually switch back to original
Ready instructions
In flight
completed
84
Fine-grained Multithreading • • • •
Maintain multiple contexts On every cycle choose a ready thread May now satisfy Little’s Law through multithreading:
threads ~ latency * bandwidth
Ready instructions
In flight
completed
85
Simultaneous Multithreading • Maintain multiple contexts • On every cycle choose as many ready instructions from the thread pool as possible • Can be applied to both in-order and out-of-order architectures
Ready instructions
In flight
completed
86
Today’s Constraint: The Memory Wall
Abstract Machine Model
Core z=0;
i++; z+=x[i]*y[i];
DRAM
float z;int i; float y[N]; float x[N];
88
Abstract Machine Model Core z=0;
i++; z+=x[i]*y[i];
Register File
DRAM
float z;int i; float y[N]; float x[N];
89
Abstract Machine Model Core z=0;
i++; z+=x[i]*y[i];
Register File
Cache
DRAM
float z;int i; float y[N]; float x[N];
90
Abstract Machine Model Core z=0;
i++; z+=x[i]*y[i];
2(x + 1) + 6(x + 1), 1 x < 0, < x x3 + 2(x 1)3 6(x 1), 0 x < 1, b(x) = > 3 > 1 x < 2, > : (2 x) , 0, else. T (x) = T spline (x) =
m X
j=1
cj bj (x)
(2) (3)
(4)
(5)
Outline
FAIR
FAIR on CUDA
Improvements
Summary
B Spline Interpolation [Sigg, C. and Hadwiger, M.]
T spline (x) = cp
1 b(⇠
+ 1) + cp b(⇠) + cp+1 b(⇠
T linear (x) := dataT (p) · (1
1) + cp+2 b(⇠
⇠) + dataT (p + 1) · ⇠,
(a + b) · T linear (x) := dataT (p) · a + dataT (p + 1) · b, T spline (x) = g0 (⇠) · clinear p+h0 + g1 (⇠) · clinear p+h1
2)
(6) (7) (8) (9)
where, g0 (⇠) = b(⇠ + 1) + b(⇠) g1 (⇠) = b(⇠ 1) + b(⇠ 2) b(⇠) b(⇠ 2) h0 = ( ) 1 h1 = ( )+1 g0 (⇠) g1 (⇠)
(10) (11)
Outline
FAIR
FAIR on CUDA
Improvements
Summary
Bandwidth Results Interpolation
(a) splineInter2D(L)
Grid Size 64X32 128X64 256X128 512X256
Measured bandwidth 1.44 2.45 4 9.14
splineInter2D (NN) Worst Case 2.39 7.07 18.58 33.43
(b) splineInter2D(NN)
Best Case 0.5 1.49 3.91 7.04
Measured bandwidth 1.44 4.15 10.66 26.76
splineInter2D (bilinear) Worst Case 3.24 12.71 37.17 113.2
Best Case 0.68 2.67 7.83 23.83
Outline
FAIR
FAIR on CUDA
Improvements
Summary
Runtime Results Interpolation
(a) Runtime Comparision Grid Size 64X32 128X64 256X128 512X256
linearInter2D (FAIR)(ms) 23.717 67.898 216.525 556.287
splineInter2D (FAIR)(ms) 28.856 78.599 229.961 575.266
(b) Runtime vs ideal splineInter2D (NN texture)(ms) 0.065 0.088 0.134 0.298
splineInter2D (bilinear texture)(ms) 0.048 0.049 0.067 0.088
Outline
FAIR
FAIR on CUDA
Improvements
Summary
Results Interpolation
(a) Der. test Inter2D(MATLAB)
(b) Der. test Inter2D(CUDA MEX)
Outline
FAIR
FAIR on CUDA
Improvements
Summary
Rigid transformation An affine linear transformation allows for translation, rotation, shearing, and individual scaling. The components of an affine linear transformation are
In matrix form
y1
= w 1 x1 + w 2 x2 + w 3 ,
(12)
y2
= w 4 x1 + w 5 x2 + w 6 ,
(13)
x1 Q(x) = 0
x2 0
1 0
0 x1
0 x2
0 1
(14) (15)
y = Q(x)w. Rigid transformation: A special affine linear transform that allows only rotation and translation y1
= cos(w1 )x1
sin(w1 )x2 + w2 ,
(16)
y2
= sin(w1 )x1 + cos(w1 )x2 + w3 ,
(17)
Although this function is non-linear in w, y(x) = Q(x)f (w), f (w) = [cos w1 ; s
sin w1 ; w2 ; sin w1 ; cos w1 ; w3 ].
Outline
FAIR
FAIR on CUDA
Improvements
Summary
Results
Grid Size X 64 128 256 512 512 1024 1024 2048
Grid Size Y 32 64 128 256 512 512 1024 1024
rigid2D (non persistent) 0.2181 0.2369 0.2289 0.2247 0.2320 0.2427 0.2683 0.2874
rigid2D (persistent) 0.2139 0.2243 0.2233 0.2142 0.2200 0.2135 0.2329 0.2379
% time saved using persistent memory 2 5 2 5 5 12 13 17
Outline
FAIR
FAIR on CUDA
Improvements
CUDA MEX Registration cycle
GridSize X 128 256 512
GridSize Y 64 128 256
P IR SSD RIGID (MATLAB) 14.96 s 45 s 201.85 s
P IR SSD RIGID (CUDA MEX) 14.13 s 33 s 92 s
Summary
Outline
FAIR
FAIR on CUDA
Improvements
FAIR Improvements
Use of kronecker products. The explicit storage of the large coordinate grids could be avoided. Combination of functional modules. The stringent requirement for the lexico-graphical ordering.
Summary
Outline
FAIR
FAIR on CUDA
Improvements
Summary
CUDA MEX Improvements
(a) Cuda Driver Objects
(b) Cuda Driver Objects (c) Improved framework
Outline
FAIR
FAIR on CUDA
Improvements
Summary
Summary
1
Successful integration of MATLAB and CUDA.
2
Porting of the FAIR toolbox onto the GPU.
3
Fast implementation of spline interpolation within the CUDA MEX framework.
4
Analysis of accuracy results for texture usage for interpolant derivatives.
5
GPU acceleration of fixed level image registration scheme for large descritizations.
6
Implementation of persistent memory on GPUs.
Rank efficient operators
HSS Hierarchically Semi-Separable Representation
Generic HSS structure
Symmetric HSS matrix For Siblings i & j :
Introducing Zeros
Introducing Zeros
Partial factorisation of diagonal blocks
Partial factorisation of diagonal blocks
Partial factorisation of diagonal blocks
Compression
Compression
Merge
Update
Root node
Compute full Cholesky
Cholesky based solver
HSS vs Classical
Summary
• Continual struggle : computer architects, mathematicians, and computer scientists. • Quick solution --- > satisfy Little’s Law • Optimize: data/task locality, data dependencies, communication, variable and dynamic parallelism • Parallel hardware is here to stay Parallelism & scalability are crucial for success • Presents many important research challenges