Latency

EVIP

Technical Application Field

Scientific Computing

Applied Numerics

Variational Modeling

EVIP Parallel Processing

Rank efficient operators

Elasticity modeled Image Registration

NOMIR – Part I

Motivation

Eldad Haber & Jan Modersitzki

Given a reference image , and a template image Motivation , find a Image reasonable transformation Registration

Given a reference image R and a template image T , , such that the transformed image is similar to find a reasonable transformation y, such that the trans-

template T

reference R

formed image T [y] is similar to R

transformed template T [y] c Eldad Haber & Jan Modersitzki

page 4

Applications

HNSP: Sectioning --> sliced --> flattened --> stained --> mounted ... --> digitized large scale digital images, up to 10.000 x 20.000 pixel Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki

HNSP: Microscopy

Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki

NSP: Deformed Images

HNSP: Deformed Images

sections 3.799 and 3.800 out of about 5.000 sec:3799

human

human

sec:3800

affine

affine linear

elastic

Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki |Torig R| = 100% |Tlinear R| = 72% |Telastic R| = 50

HNSP: Results

3D elastic registration of a part of the visual cortex 2 hemispheres; 100 sections of ́ 512 X 512 pixel

Courtesy: Oliver Schmitt, Eldad Haber & Jan Modersitzki

Registration in Medical Imaging •

Comparing/merging/integrating images from different :

• • • •

times, devices, perspectives, objects, e.g.,

•

Pre-/post surgery CT-images/MRI panorama imaging atlas/patient mapping

• • • •

Catheter in blood vessel Find 2D view in 3D data HNSP Template matching, e.g., I Atlas mapping, e.g., I Serial sectioning, e.g., Registration is not restricted to medical applications

Variational Modelling

Interpolation Continuous models for reference and template: discrete data

Transformation

NOMIR – Part I

Eldad Haber & Jan Modersitzk

Eulerian versus Lagrangian View Eulerian versus Lagrangian View ⌃ y

y(x) y

0

x(p) p

y(x0 )

p0

x(p0 )

Euler: T [y](x) = T (y(x)); Lagrange: (p, T (p)) 7! x(p), T (p) ;

x x0 x x0

easy, but x 2 y 1 (⌃) ? option for constraints

Distance measures Sum of Squared Differences (SSD)

Distance measures

Regularization

ill-posedness

Regularization Implicit vs explicit regularization Parametric regularization

Regularized Parametric registration

Non Parametric regularization

Elastic Regularizer Elastic potential of u

Numerical optimization

ELE to PDE balance of forces

outer forces , drive registration inner forces , tissue properties

Discretized Regularizer

discretise

and

Discretized Cost function

Minimization of J Necessary condition for minimizer

Minimization of Solve

Remarks on B Need to solve

0

200

400

600

• • •

HUGE very sparse Has a lot of structure

800

1000

1200

0

200 400 nz = 3296

0 50 100 150 200 250 300 350 400 450 500 0

100

200 300 nz = 6319

400

500

28

Performance Optimization

Outline • • •

Fundamentals Architecture and Little’s Law

• •

Yesterday's Constraints - ILP/DLP Today's Constraints - MLP

Summary

Little’s Law

Basic Throughput Quantities

Basic Throughput Quantities •

Latency:

•

every operation requires time to execute.

•

(i.e. instruction, memory or network latency)

Basic Throughput Quantities • •

Latency:

•


•


Bandwidth:!

•

# of (parallel) operations completed per cycle.!

•

(i.e. #FPUs, DRAM, Network, etc…)


Latency:

•


•


Bandwidth:!

•


•



•

Latency:

•


•


Bandwidth:!

•


•


Concurrency! :

•

Total # of operations in flight

Little’s Law

Little’s Law •

Little’s Law relates these three:

Little’s Law •

Little’s Law relates these three: • Concurrency = Latency * Bandwidth

- or -

Little’s Law •

Little’s Law relates these three: - or • Concurrency = Latency * Bandwidth

• Effective Throughput = Expressed Concurrency / Latency

Little’s Law •



•

This concurrency must be filled with parallel operations

Little’s Law •



•

This concurrency must be filled with parallel operations

•

Can’t exceed peak throughput with superfluous concurrency (each channel has a maximum throughput).

Basic Traffic Quantities •

Traffic often includes

• •

#Floating-point operations (FLOPs) #Bytes from (registers, cache, DRAM, network)

Performance Optimization: Contending Forces

Improve Throughput (Gflop/s, GB/s, etc…)

Reduce Volume of Data (Flop’s, GB’s, etc…)

Contending forces of device Efficiency and usage/traffic

Performance Optimization: Contending Forces

Restructure to satisfy Little’s Law

Implementation & Algorithmic Optimization

Architects, Mathematicians, Programmers


•

Architects: invent paradigms to improve (peak) throughput and facilitate(?) Little’s Law.


•


•

Mathematicians: invent new algorithms to improve performance by reducing (bottleneck) traffic.


•


•

Mathematicians: invent new algorithms to improve performance by reducing (bottleneck) traffic.

•

Programmers: restructure algorithms and implementations to these new features.

Performance Optimization

Performance Optimization •

Often boils down to several key challenges:





•

Management of data/task locality



• •

Management of data/task locality Management of data dependencies



• • •

Management of data/task locality Management of data dependencies Management of communication



• • • •

Management of data/task locality Management of data dependencies Management of communication Management of variable and dynamic parallelism

Yesterday’s Constraint: Instruction Latency & Parallelism

Single-issue, non-pipelined • Consider a single issue, non-pipelined processor • Little’s Law – – –

Bandwidth = issue width = 1 Latency = 1 Concurrency = 1

• Very easy to get good performance even if all instructions are dependent

Issue width Future instructions In flight completed

77

Pipelined • By pipelining, we can increase the processor frequency. • However, pipeline should be filled to achieve better performance. • Little’s Law – – –

Bandwidth = issue width = 1 Latency = 3 Concurrency = 3

• Performance may drop to 1/3 of peak Issue width Future instructions

In flight

completed

78

Pipelined • There may be inherent and untapped parallelism in the code • Compilers/programmers must find parallelism, and unroll/reorder the code to keep the pipeline full

Issue width Future instructions

In flight

completed

79

Out-of-order • Alternately, the hardware can try to find instruction level parallelism (ILP) Issue width • Instructions are: 11 – – – –

Queued up Executed out-of-order Reordered Committed in-order

10 9

8

Future instructions Reservation Stations

7

• Useful when parallelism or latency cannot be determined at compile time.

4

Out-of-order execution

6 9

8

7

6

5

4

3 2 1

Reorder buffer completed

80

Superscalar • Increase throughput, by executing multiple instructions in parallel Issue width • Usually separate pipelines for different instruction types: 13 14 –

FP, integer, memory

11 12

• Significantly complicates out-of-order execution

Future instructions Reservation Stations

10 8 5

9

Out-of-order execution

7 10

9

8

7

6

5

4 3

1

2

Reorder buffer completed

81

SIMD • Many codes perform the same operations on different pieces of data (Data level parallelism = DLP) • SIMD : Single Instruction Multiple Data • Register sizes are increased. • Instead of each register being a 64b FP #, each register holds 2 or 4 FP#’s. • Much more efficient solution than superscalar on data parallel codes

82

Multithreaded • Superscalars fail when there is no ILP or DLP • However, there are many codes with –

Thread-level parallelism (TLP)

• Consider architectures that are virtualised to appear as N cores. –

In reality, there is one core maintaining multiple contexts and dynamically switching between them

• There are 3 main types of multithread architectures: – – –

Coarse-grained multithreading (CGMT) Fine-grained multithreading (FGMT) , aka Vertical Multithreading Simultaneous multithreading (SMT)

83

Coarse-grained Multithreading • Maintain multiple contexts • On a long latency instruction: – – – –

dispatch instruction Switch to a ready thread Hide latency with multiple ready threads Eventually switch back to original

Ready instructions

In flight

completed

84

Fine-grained Multithreading • • • •

Maintain multiple contexts On every cycle choose a ready thread May now satisfy Little’s Law through multithreading:

threads ~ latency * bandwidth

Ready instructions

In flight

completed

85

Simultaneous Multithreading • Maintain multiple contexts • On every cycle choose as many ready instructions from the thread pool as possible • Can be applied to both in-order and out-of-order architectures

Ready instructions

In flight

completed

86

Today’s Constraint: The Memory Wall

Abstract Machine Model

Core z=0;

i++; z+=x[i]*y[i];

DRAM

float z;int i; float y[N]; float x[N];

88

Abstract Machine Model Core z=0;

i++; z+=x[i]*y[i];

Register File

DRAM


89


i++; z+=x[i]*y[i];

Register File

Cache

DRAM


90


i++; z+=x[i]*y[i];

2(x + 1) + 6(x + 1), 1  x < 0, < x x3 + 2(x 1)3 6(x 1), 0  x < 1, b(x) = > 3 > 1  x < 2, > : (2 x) , 0, else. T (x) = T spline (x) =

m X

j=1

cj bj (x)

(2) (3)

(4)

(5)

Outline

FAIR

FAIR on CUDA

Improvements

Summary

B Spline Interpolation [Sigg, C. and Hadwiger, M.]

T spline (x) = cp

1 b(⇠

+ 1) + cp b(⇠) + cp+1 b(⇠

T linear (x) := dataT (p) · (1

1) + cp+2 b(⇠

⇠) + dataT (p + 1) · ⇠,

(a + b) · T linear (x) := dataT (p) · a + dataT (p + 1) · b, T spline (x) = g0 (⇠) · clinear p+h0 + g1 (⇠) · clinear p+h1

2)

(6) (7) (8) (9)

where, g0 (⇠) = b(⇠ + 1) + b(⇠) g1 (⇠) = b(⇠ 1) + b(⇠ 2) b(⇠) b(⇠ 2) h0 = ( ) 1 h1 = ( )+1 g0 (⇠) g1 (⇠)

(10) (11)

Outline

FAIR

FAIR on CUDA

Improvements

Summary

Bandwidth Results Interpolation

(a) splineInter2D(L)

Grid Size 64X32 128X64 256X128 512X256

Measured bandwidth 1.44 2.45 4 9.14

splineInter2D (NN) Worst Case 2.39 7.07 18.58 33.43

(b) splineInter2D(NN)

Best Case 0.5 1.49 3.91 7.04

Measured bandwidth 1.44 4.15 10.66 26.76

splineInter2D (bilinear) Worst Case 3.24 12.71 37.17 113.2

Best Case 0.68 2.67 7.83 23.83

Outline

FAIR

FAIR on CUDA

Improvements

Summary

Runtime Results Interpolation

(a) Runtime Comparision Grid Size 64X32 128X64 256X128 512X256

linearInter2D (FAIR)(ms) 23.717 67.898 216.525 556.287

splineInter2D (FAIR)(ms) 28.856 78.599 229.961 575.266

(b) Runtime vs ideal splineInter2D (NN texture)(ms) 0.065 0.088 0.134 0.298

splineInter2D (bilinear texture)(ms) 0.048 0.049 0.067 0.088

Outline

FAIR

FAIR on CUDA

Improvements

Summary

Results Interpolation

(a) Der. test Inter2D(MATLAB)

(b) Der. test Inter2D(CUDA MEX)

Outline

FAIR

FAIR on CUDA

Improvements

Summary

Rigid transformation An affine linear transformation allows for translation, rotation, shearing, and individual scaling. The components of an affine linear transformation are

In matrix form

y1

= w 1 x1 + w 2 x2 + w 3 ,

(12)

y2

= w 4 x1 + w 5 x2 + w 6 ,

(13)



x1 Q(x) = 0

x2 0

1 0

0 x1

0 x2

0 1

(14) (15)

y = Q(x)w. Rigid transformation: A special affine linear transform that allows only rotation and translation y1

= cos(w1 )x1

sin(w1 )x2 + w2 ,

(16)

y2

= sin(w1 )x1 + cos(w1 )x2 + w3 ,

(17)

Although this function is non-linear in w, y(x) = Q(x)f (w), f (w) = [cos w1 ; s

sin w1 ; w2 ; sin w1 ; cos w1 ; w3 ].

Outline

FAIR

FAIR on CUDA

Improvements

Summary

Results

Grid Size X 64 128 256 512 512 1024 1024 2048

Grid Size Y 32 64 128 256 512 512 1024 1024

rigid2D (non persistent) 0.2181 0.2369 0.2289 0.2247 0.2320 0.2427 0.2683 0.2874

rigid2D (persistent) 0.2139 0.2243 0.2233 0.2142 0.2200 0.2135 0.2329 0.2379

% time saved using persistent memory 2 5 2 5 5 12 13 17

Outline

FAIR

FAIR on CUDA

Improvements

CUDA MEX Registration cycle

GridSize X 128 256 512

GridSize Y 64 128 256

P IR SSD RIGID (MATLAB) 14.96 s 45 s 201.85 s

P IR SSD RIGID (CUDA MEX) 14.13 s 33 s 92 s

Summary

Outline

FAIR

FAIR on CUDA

Improvements

FAIR Improvements

Use of kronecker products. The explicit storage of the large coordinate grids could be avoided. Combination of functional modules. The stringent requirement for the lexico-graphical ordering.

Summary

Outline

FAIR

FAIR on CUDA

Improvements

Summary

CUDA MEX Improvements

(a) Cuda Driver Objects

(b) Cuda Driver Objects (c) Improved framework

Outline

FAIR

FAIR on CUDA

Improvements

Summary

Summary

1

Successful integration of MATLAB and CUDA.

2

Porting of the FAIR toolbox onto the GPU.

3

Fast implementation of spline interpolation within the CUDA MEX framework.

4

Analysis of accuracy results for texture usage for interpolant derivatives.

5

GPU acceleration of fixed level image registration scheme for large descritizations.

6

Implementation of persistent memory on GPUs.

Rank efficient operators

HSS Hierarchically Semi-Separable Representation

Generic HSS structure

Symmetric HSS matrix For Siblings i & j :

Introducing Zeros

Introducing Zeros

Partial factorisation of diagonal blocks



Compression

Compression

Merge

Update

Root node

Compute full Cholesky

Cholesky based solver

HSS vs Classical

Summary

• Continual struggle : computer architects, mathematicians, and computer scientists. • Quick solution --- > satisfy Little’s Law • Optimize: data/task locality, data dependencies, communication, variable and dynamic parallelism • Parallel hardware is here to stay Parallelism & scalability are crucial for success • Presents many important research challenges