Massively Parallel Signal Processing for Wireless Communication ...

3 downloads 131 Views 3MB Size Report
Mar 20, 2013 ... Massively Parallel Signal Processing for. Wireless Communication Systems. Michael Wu, Guohui Wang, Joseph R. Cavallaro. Department of ...
Massively Parallel Signal Processing for Wireless Communication Systems Michael Wu, Guohui Wang, Joseph R. Cavallaro

Department of ECE, Rice University

Wireless Communication Systems

Internet

Information bits 2

transmitted  signal

Received Information  bits signal 3/20/2013

RF TX

MIMO Modulator

Source

Heavy computations on RX side:

3



MIMO Detector: decouple streams to provide estimates of the tx bits.



Channel Decoder: correct errors using redundant data.

Sink



Channel Decoder

Used in many standards:

MIMO Detector



Channel

RF RX

Channel Encoder

Wireless Communication Systems

3/20/2013

Wireless Communication Systems (cont.) Related Research Work





MIMO detector: [SAMOS 2010 IEEE-TVT 2012]



Turbo decoder: [CASES 2010,VLSID 2012]



LDPC decoder: [ASILOMAR 2008,TPDS 2010, JSPS 2010, SASP 2011]



SDR systems:

[IEEE Comm. Mag 2010, ISRSP 2011]

Our previous work



4



MIMO detector: [SiPS 2009, Asilomar 2009, JSPS 2011, SIPS 2012]



Turbo decoder:



LDPC decoder: [SASP 2011, Asilomar 2011]



NB-LDPC decoder: [Asilomar 2012]

[SiPS 2010, JSPS 2011]

3/20/2013

Massively parallel implementations Massively parallel implementations:



MIMO Detector



Tailored algorithms to improve efficiency.



Results:

5



Achieve high throughput (faster than existing work)



Very flexible, can be a good platform for SDR systems.

LDPC Decoder

3/20/2013

Outline MIMO soft detection algorithm on GPU





Introduction to MIMO detection



Kernel mapping



Optimization techniques



Experiment results

Multi-standard LDPC Decoder on GPU



6



Introduction to LDPC algorithm



Kernel mapping



Optimization techniques



Experiment results 3/20/2013

Modulation Encode data in amplitude and phase of a sinusoid



…0111011

Higher modulation order  more data per symbol



1

7

11

10

01

00

0

3/20/2013

MIMO System Model …0111011

x0

…0101011

x1

…0101111

x2  y0   y1    y2   y

H

y0

y1 y2

 

 h00  h10   h20

h01 h11 h21

h02   x0   n0  h12   x1    n1  h22    x2     n2   Hx  n



Spatial Multiplexing: ↑ throughput by transmitting multiple streams



Receiver: Transmit streams interfere with each other 8

3/20/2013

MIMO-OFDM

H Break a wideband signal into many independent subcarriers Perform MIMO detection independently many times, one per subcarrier Many subcarriers for many wireless standards. 

9

LTE 20Mhz subframe: 14*1200 subcarriers 3/20/2013

MIMO Detection yyˆ00 yyˆ   11   yyˆ22   ˆ yy

 

hr00 00 h0  10  20  h0

h r01 01 h r11 11 h021

rh0202xx00 nn00 xx nn  rh 12 12 11  11 rh2222    xx22     nn22   Rx Hx  n

H 𝑑𝟐𝟐 = 𝑦22 − 𝑟22 (−1) 𝑥2 𝟐

x2

+

-

+

x1

+

x0

-

+

𝟐 𝑑𝟏 = 𝑦1 − 𝑟12 (−1) 𝑥2 − 𝑟− 𝑟 𝑥 (1) 11 11 1

+ -

+

+ -

𝟐

+

𝟐

𝟐(−1) 𝑑𝟎 = 𝑦0 − 𝑟02 𝑥(−1) − 𝑟 (1) − 𝑟 − 𝑟 𝑥 − 𝑟 𝑥 2 01 01 1 00 000

𝟐

-

Search Space for 3x3 BPSK 

Probability of a path, 𝑥, is inversely prop. to d y, 𝑥 = y − 𝑅𝑥 =



Probabilities of all paths are used to generate bit probability values. 

10

𝒊 𝒅𝒊

4x4 64QAM644 = 16,777,216 paths

3/20/2013

2x2 QPSK

SSFE Detector -

+

+ +

+ -

-

1st antenna Real

-

+

-

1st antenna Imag

+

-

+

2nd

-

-

+

2nd antenna Imag

antenna Real

Selective Spanning with Fast Enumeration (SSFE)* 

Real value decomposition



Data parallel deterministic search

1st level: enumerate all modulation points. Subsequent levels: depth-first search, pick the best outgoing node

Generate M likely paths which are used to generate bit probability values 11

3/20/2013

SSFE Detector: Node Expansion -

+

+ 

Inputs: • P: [x2=1,x1=1] • Channel gains: r : [r00 r01 r02] • Received signal: y0

-

+

-

Find best node–x0 that minimizes the cumulative distance 

Pick the constellation point closest to the zero forcing solution. 𝑥0 =

Zero forcing solution:

-5

− 𝑟02 𝑥2 − 𝑟01 𝑥1 )

𝑥0 = 3.9

64 QAM example: -7

1 (𝑦0 𝑟00

-3

-1

1

3

5

7

Schnoor-Euchner enumeration 12

3/20/2013

GPU Implementation 

Search algorithm maps well onto GPU 

Data parallel with no sharing 



Efficient node expansion 



Each search path is independent

complexity doesn’t depend on modulation

Modest storage requirement

M threads per detection

13

3/20/2013

GPU Implementation of SSFE //enumerate a modulation point for 1 st antenna path[0] = mod_table[tid%8];

path[1] = mod_table[tid/8];

One thread block handles one subcarrier 

threads)

dist+= calc_dist(y, r, path[0]); dist+= calc_dist(y, r, path[1]); //depth first search For i = 2:ntx //compute partial Euclidean dist ped = 0; For j = 0:i ped += calc_dist(y, r, path[j]); //find best outgoing path

Spawns 1 thread per modulation point (M



Completely unrolled inner and outer loops



Path stored in registers

Demodulator + soft estimate computation not shown Result 

4x4 16QAM: 940Mbps



4x4 64QAM: 480Mbps

Path[j] = SE_expand(dist, r); dist = update_dist(dist, ped);

14

3/20/2013

N-Way MIMO Detector Duplicate search block depending on FER requirement.

Larger lists, NM candidates

y,H X0  X1  X2

X1  X2  X0

SSFE



SSFE

Example: N = 2, two search blocks

RVDQR



RVDQR

Add permute block which enforces a detection order

Permut e



Permute



Qi, Q., Chakrabarti, C, Parallel high throughput soft-output sphere decoder, (SIPS 10) M Wu, C Dick, JR Cavallaro, Improving MIMO sphere detection through antenna detection order scheduling (SDR Forum 11)

15

3/20/2013

GPU Implementation of N-Way MIMO Detector Duplicate threads to improve accuracy of the detection algorithm



Parallelism: M

Parallelism:NM



Divide a thread block into N subsections



Each subsection consists of M threads operates on a different channel permutation 

Performs SSFE detection independently



Generate a NM size candidate list.

16

3/20/2013

N-Way MIMO FER Performance Soft Output detectors + Rate 1/2 WiMAX LDPC code, Rayleigh fading channel. 

1 outer iteration + 20 inner iteration with early termination



Compared to soft-output K-best and exhaustive (MAP) detector.

4x4 16QAM

4x4 64QAM

Better



Better 17

3/20/2013

N-Way MIMO Detector Throughput 1250

Mbps

1000

945

750

N=1 N=2

500

N=3

480

476

N=4 316 225

250

239 151

111

0 16QAM

64QAM

• GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps • 8192 subcarriers, Kernel time only 18

3/20/2013

N-Way MIMO Detector Throughput vs Workload 16 QAM

64 QAM

800

400

600

300

Mbps

500

Mbps

1000

400

200

200

100 N=1

N=2

N=3

N=4

0

0 0

1024 2048 3072 4096 5120 6144 7168 8192

Number of subcarriers

0

1024 2048 3072 4096 5120 6144 7168 8192

Number of subcarriers

• GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps • Kernel time only 19

3/20/2013

Performance Comparison Number of Subcarriers





16QAM (Mbps) FPFSD* Ours N=4

64QAM (Mbps) FPFSD* Ours N=4

150*7

72.41

232.27

18.6

67.55

300*7

82.56

246.60

18.59

69.12

600*7

90.44

256.83

17.9

69.53

900*7

91.39

256.22

17.4

70.05

1200*7

92.31

256.88

17.2

69.97

FPFSD 

N=4—4 parallel detectors with different permutations



Differences: a) operates in complex domain b) one kernel for search + one kernel for soft-output generation

Fermi, 448 core @ 1150MHz, 320bit DDR5 @ 3Gbps *Sandra Roger, et.al Fully Parallel GPU Implementation of a Fixed-Complexity Soft-Output MIMO Detector (IEEE TVT 2012)

20

3/20/2013

N-Way MIMO Detector Duplicate search block depending on FER requirement.

Larger lists, NM candidates

y,H X0  X1  X2

X1  X2  X0

SSFE



SSFE

Example: N = 2, two search blocks RVD-QR



RVDQR

Add permute block which enforces a detection order

Permute



Permute



Qi, Q., Chakrabarti, C, Parallel high throughput soft-output sphere decoder, (SIPS 10) M Wu, C Dick, JR Cavallaro, Improving MIMO sphere detection through antenna detection order scheduling (SDR Forum 11)

21

3/20/2013

N-Way QR Decomposition Divide a thread block into N subsections Each subsection of N threads operates on a different channel permutation 

Performs modified Gram Schmidt QR on an extended matrix [𝐻|𝑌]



Generate 𝑅 and 𝑦

QR decomposition time for 8192 symbols

4x4

N=1

N=2

N=3

N=4

0.201 ms

0.350 ms

0.551ms

0.675 ms

290.3 Mbps @ 64QAM • GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps • Kernel time only 22

3/20/2013

Complete Design Complete design, QR + MIMO Detection



Also includes PCIE transfer time

Mbps



200 180 160 140 120 100 80 60 40 20 0

196 173

130

124 89

16QAM

• • 23

79

85 64

N=1 N=2 N=3 N=4

64QAM

GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps 8192 subcarriers 3/20/2013

Complete Design 3

3

2.5

2.5

2

2

1.5

1

0.5

0.5

0

0

N=2

N=3

N=4

Detection QR Transport

1.5

1

N=1 

4x4 64QAM

ms

ms

4x4 16QAM

N=1

N=2

N=3

N=4

Transfer time doesn’t depend on N (# of parallel search) or M (modulation size) 

Transfer time can be hidden



QR depends only on N



Detection depends on N and M 25

3/20/2013

RF TX

MIMO Modulator

Source

Sink

Channel Decoder

MIMO Detector



Channel

RF RX

Channel Encoder

Outline

Multi-standard LDPC Decoder on GPU 

Introduction to LDPC algorithm



Kernel mapping



Optimization techniques



Experiment results

26

3/20/2013

Channel Coding

27



Linear block codes



Encoding: 𝒙 ∙ 𝑮 = 𝒄 

K-bit 𝒙 encoded into Nbit codeword 𝒄 (K0 ?

Yes

xn=1

No

xn=0

CN3

VN5 VN5

VN6 VN6

Variable Node Processing

VN7 VN7

L

Hard Decision

Decoded Bits

3/20/2013

Early Termination 



Early termination (ET) 

Avoid unnecessary computations when codeword converges



Widely used in low power decoding architecture.

ET for LDPC decoder 

Use parity check equation: H· cT=0



Use massive threads to perform parity check

H

37

.

cT

=0

3/20/2013

Why GPU for LDPC Decoding?

38

LDPC Decoding

GPU

Highly parallel algorithm • No dependency for computations among rows (or columns).

SIMT Parallel architecture

High complexity iterative algorithm

Enough workload to fully occupy the GPU’s computing resources

Clear algorithm structure

Partition tasks into kernel functions.

3/20/2013

Partition the LDPC Decoding Task Initialization

Host code

Start decoding iterations Check node processing Variable node processing Early termination (ET) check

Kernel 1

Computation kernels

Kernel 2 Kernel 3

(Go back if the ET condition is not met)

Make hard decision

39

Host code

3/20/2013

CUDA Kernel 1: Check Node Processing 

One thread block processes one row of sub-matrices



Each thread block contains 81 threads, each thread processes one row of the H matrix (one check node). 81 threads/TB

972 threads 12 thread blocks * 802.11n (1944, 972) LDPC code

40

3/20/2013

CUDA Kernel 2: Variable Node Processing 

One thread block processes one column of sub-matrices.



Use 1944 threads to run concurrently.

Update variable node message

One thread block

1944 threads Probability value memory

41

3/20/2013

CUDA Kernel 3: Parallel Early Termination M threads

H

.. .

.. .

cT

.

b

= Barrier Sync

.. .

b Check euqation ? b[0] b[1] … b[M-1] = 0

M threads

42

3/20/2013

Decoding Algorithm Optimization 



Loosely coupled algorithm 

Don t store qmn in the memory.



Before computing rmn, recover qmn first.



Good for CUDA implementation 

Reduce the device memory storage



Reduce number of memory operations

before

Store r

Store q

after

Store r

Compute q

Forward-backward check node update 

For one row with ωr non-zero element, we need to traverse the row for ωr times.



Use forward-backward algorithms



Reduce number of operations: M ωr(ωr-2) → M (3ωr-2) 

43

For example, M=2000, ωr=7, reduce ~50% operations. 3/20/2013

Optimization: multi-codeword decoding

… 

  

44

Utilize the 2-D thread block structure Reduce diverse branches Take advantage of constant memory Good flexibility and scalability 3/20/2013

Optimization – Efficient Storage Memory optimization 

Constant memory: increase throughput by 8%



Compact representation struct h_element { byte x; byte y; byte shift_value; byte valid; };

H_kernel1 matrix

H_kernel2 matrix I57 I3 I28 I30 I62 I53 I40 I20 I0 I69 I79 I79 I65 I64 I45 I70 I2 I56 I57 I24 I61

I24 I37 I53 I3 I66 I22 I8 I42 I56 I38 I57 I14 I52 I0 I35 I27 I60

I50 I79 I1 I0 I55 I7 I0 I0 I56 I14 I0 I0 I35 I0 I0 I28 I0 I0 I50 I8 I0 I0 I52 I0 I0 I0 I72 I27 I0 I0 I30 I0 I0 I32 I77 I9 I0 I0 I0 I0 I12 I51 I0 I16 I1

Vertical compression



Horizontal compression

45

3/20/2013

Optimization – Memory Coalescing 

Coalescing device memory access 

Compact format of Rmn and ∆mn (check node message)



Writing compressed Rmn and ∆mn matrices column-wise → coalesced

memory access (20% throughput improvement)

One column of Rmn and ∆mn 46

3/20/2013

Experimental Results: LDPC Decoding Throughput Code type 802.11n WiFi N=1944

802.16m WiMAX N=2304

# of iterations

Decoding Time (ms)

Decoding Throughput (Mbps)

5

26.0

74.7

10

49.8

39

15

71.5

27.2

5

24.0

96.1

10

43.0

52.31

15

64.0

36

* Host PC: Intel i5-750 Quad-core CPU, @2.67GHz, 8GB DDR3 memory * GTX 470 Fermi GPU

47

3/20/2013

Experiment results: Early Termination Throughput 



Adaptive ET scheme: 

Low SNR: ET off



High SNR: ET on

Increase throughput for high SNR

48

3/20/2013

Throughput VS Workload



WiMax code, 2304 bits, rate ½ code



At first, throughput increases almost linearly as workload increases



After certain point, throughput stops increasing, because the threads occupies all the computation SMs in the GPU.

49

G.Wang et al, GPGPU Accelerated Scalable Parallel Decoding of LDPC Codes , ASILOMAR Conference 2011.

3/20/2013

Comparison with Recently Published Work

50

Normalized throughput

Work

Code length

Park et al, [Journal on WCN 2011]

18000 bits

2.809 Mbps

Yau et al, [ICACT 2011]

1/2 CMMB codes 9126 bits

2.046 Mbps

Zhao et al, [ICA3PP 2011]

4058 bits QC-LDPC

1.067 Mbps

Abburi, [VLSID 2011]

2034 bits WiMax

40 Mbps

Kennedy, [journal on WCN 2012]

2034 bits WiMax

32.9 Mbps

Kang [ICC 2012]

2048 bits, R=0.89

24.09 Mbps

Our work (Results on GTX 470) *

2304 bits WiMax

52.31 Mbps

(# of iterations = 10)

* G.Wang et al, A Massively Parallel Implementation of QC-LDPC Decoder on GPU, IEEE SASP 2011.

3/20/2013

Beyond Binary LDPC Codes – GF(q) Nonbinary LDPC Inside one work group

q

Backward computation

th re ad

4

F1(0) F1(1) .. .

F2(0) F2(1) .. .

F3(0) F3(1) .. .

F0(q-2) F0(q-1)

F1(q-2) F1(q-1)

F2(q-2) F2(q-1)

F3(q-2) F3(q-1)

7

3

s q

F0(0) F0(1) .. .

2

Forward computation

th re ad

4

s

2

3

7 3

3 1 6

4 Barrier local memory sync

51

5 3

2 7

4 6 1

N work groups

G. Wang et al, Parallel Nonbinary LDPC Decoding on GPU, ASILOMAR Conference 2012.

3/20/2013

Conclusion 







Massively parallel implementations of a MIMO detector and a LDPC decoder on GPU 

Tailor your algorithm



Tweak algorithm to improve efficiency

Results: 

Achieve high throughput



Faster than Existing work



Very flexible, can be a good platform for SDR systems

Future work 

Improving performance on Kepler



GPU accelerated SDR systems

Links 

Guohui Wang: www.GuohuiWang.com



Michael Wu:

52

http://www.ruf.rice.edu/~mbw2/ 3/20/2013

Acknowledgement 

Research supported by US National Science Foundation under grants CNS-1265332, ECCS-1232274, EECS0925942 and CNS-0923479.



53

Equipment donations generously provided by NVIDIA.

3/20/2013