Mar 20, 2013 ... Massively Parallel Signal Processing for. Wireless Communication Systems.
Michael Wu, Guohui Wang, Joseph R. Cavallaro. Department of ...
Massively Parallel Signal Processing for Wireless Communication Systems Michael Wu, Guohui Wang, Joseph R. Cavallaro
Department of ECE, Rice University
Wireless Communication Systems
Internet
Information bits 2
transmitted signal
Received Information bits signal 3/20/2013
RF TX
MIMO Modulator
Source
Heavy computations on RX side:
3
MIMO Detector: decouple streams to provide estimates of the tx bits.
Channel Decoder: correct errors using redundant data.
Sink
Channel Decoder
Used in many standards:
MIMO Detector
Channel
RF RX
Channel Encoder
Wireless Communication Systems
3/20/2013
Wireless Communication Systems (cont.) Related Research Work
MIMO detector: [SAMOS 2010 IEEE-TVT 2012]
Turbo decoder: [CASES 2010,VLSID 2012]
LDPC decoder: [ASILOMAR 2008,TPDS 2010, JSPS 2010, SASP 2011]
SDR systems:
[IEEE Comm. Mag 2010, ISRSP 2011]
Our previous work
4
MIMO detector: [SiPS 2009, Asilomar 2009, JSPS 2011, SIPS 2012]
Turbo decoder:
LDPC decoder: [SASP 2011, Asilomar 2011]
NB-LDPC decoder: [Asilomar 2012]
[SiPS 2010, JSPS 2011]
3/20/2013
Massively parallel implementations Massively parallel implementations:
MIMO Detector
Tailored algorithms to improve efficiency.
Results:
5
Achieve high throughput (faster than existing work)
Very flexible, can be a good platform for SDR systems.
LDPC Decoder
3/20/2013
Outline MIMO soft detection algorithm on GPU
Introduction to MIMO detection
Kernel mapping
Optimization techniques
Experiment results
Multi-standard LDPC Decoder on GPU
6
Introduction to LDPC algorithm
Kernel mapping
Optimization techniques
Experiment results 3/20/2013
Modulation Encode data in amplitude and phase of a sinusoid
…0111011
Higher modulation order more data per symbol
1
7
11
10
01
00
0
3/20/2013
MIMO System Model …0111011
x0
…0101011
x1
…0101111
x2 y0 y1 y2 y
H
y0
y1 y2
h00 h10 h20
h01 h11 h21
h02 x0 n0 h12 x1 n1 h22 x2 n2 Hx n
Spatial Multiplexing: ↑ throughput by transmitting multiple streams
Receiver: Transmit streams interfere with each other 8
3/20/2013
MIMO-OFDM
H Break a wideband signal into many independent subcarriers Perform MIMO detection independently many times, one per subcarrier Many subcarriers for many wireless standards.
9
LTE 20Mhz subframe: 14*1200 subcarriers 3/20/2013
MIMO Detection yyˆ00 yyˆ 11 yyˆ22 ˆ yy
hr00 00 h0 10 20 h0
h r01 01 h r11 11 h021
rh0202xx00 nn00 xx nn rh 12 12 11 11 rh2222 xx22 nn22 Rx Hx n
H 𝑑𝟐𝟐 = 𝑦22 − 𝑟22 (−1) 𝑥2 𝟐
x2
+
-
+
x1
+
x0
-
+
𝟐 𝑑𝟏 = 𝑦1 − 𝑟12 (−1) 𝑥2 − 𝑟− 𝑟 𝑥 (1) 11 11 1
+ -
+
+ -
𝟐
+
𝟐
𝟐(−1) 𝑑𝟎 = 𝑦0 − 𝑟02 𝑥(−1) − 𝑟 (1) − 𝑟 − 𝑟 𝑥 − 𝑟 𝑥 2 01 01 1 00 000
𝟐
-
Search Space for 3x3 BPSK
Probability of a path, 𝑥, is inversely prop. to d y, 𝑥 = y − 𝑅𝑥 =
Probabilities of all paths are used to generate bit probability values.
10
𝒊 𝒅𝒊
4x4 64QAM644 = 16,777,216 paths
3/20/2013
2x2 QPSK
SSFE Detector -
+
+ +
+ -
-
1st antenna Real
-
+
-
1st antenna Imag
+
-
+
2nd
-
-
+
2nd antenna Imag
antenna Real
Selective Spanning with Fast Enumeration (SSFE)*
Real value decomposition
Data parallel deterministic search
1st level: enumerate all modulation points. Subsequent levels: depth-first search, pick the best outgoing node
Generate M likely paths which are used to generate bit probability values 11
3/20/2013
SSFE Detector: Node Expansion -
+
+
Inputs: • P: [x2=1,x1=1] • Channel gains: r : [r00 r01 r02] • Received signal: y0
-
+
-
Find best node–x0 that minimizes the cumulative distance
Pick the constellation point closest to the zero forcing solution. 𝑥0 =
Zero forcing solution:
-5
− 𝑟02 𝑥2 − 𝑟01 𝑥1 )
𝑥0 = 3.9
64 QAM example: -7
1 (𝑦0 𝑟00
-3
-1
1
3
5
7
Schnoor-Euchner enumeration 12
3/20/2013
GPU Implementation
Search algorithm maps well onto GPU
Data parallel with no sharing
Efficient node expansion
Each search path is independent
complexity doesn’t depend on modulation
Modest storage requirement
M threads per detection
13
3/20/2013
GPU Implementation of SSFE //enumerate a modulation point for 1 st antenna path[0] = mod_table[tid%8];
path[1] = mod_table[tid/8];
One thread block handles one subcarrier
threads)
dist+= calc_dist(y, r, path[0]); dist+= calc_dist(y, r, path[1]); //depth first search For i = 2:ntx //compute partial Euclidean dist ped = 0; For j = 0:i ped += calc_dist(y, r, path[j]); //find best outgoing path
Spawns 1 thread per modulation point (M
Completely unrolled inner and outer loops
Path stored in registers
Demodulator + soft estimate computation not shown Result
4x4 16QAM: 940Mbps
4x4 64QAM: 480Mbps
Path[j] = SE_expand(dist, r); dist = update_dist(dist, ped);
14
3/20/2013
N-Way MIMO Detector Duplicate search block depending on FER requirement.
Larger lists, NM candidates
y,H X0 X1 X2
X1 X2 X0
SSFE
SSFE
Example: N = 2, two search blocks
RVDQR
RVDQR
Add permute block which enforces a detection order
Permut e
Permute
Qi, Q., Chakrabarti, C, Parallel high throughput soft-output sphere decoder, (SIPS 10) M Wu, C Dick, JR Cavallaro, Improving MIMO sphere detection through antenna detection order scheduling (SDR Forum 11)
15
3/20/2013
GPU Implementation of N-Way MIMO Detector Duplicate threads to improve accuracy of the detection algorithm
Parallelism: M
Parallelism:NM
Divide a thread block into N subsections
Each subsection consists of M threads operates on a different channel permutation
Performs SSFE detection independently
Generate a NM size candidate list.
16
3/20/2013
N-Way MIMO FER Performance Soft Output detectors + Rate 1/2 WiMAX LDPC code, Rayleigh fading channel.
1 outer iteration + 20 inner iteration with early termination
Compared to soft-output K-best and exhaustive (MAP) detector.
4x4 16QAM
4x4 64QAM
Better
Better 17
3/20/2013
N-Way MIMO Detector Throughput 1250
Mbps
1000
945
750
N=1 N=2
500
N=3
480
476
N=4 316 225
250
239 151
111
0 16QAM
64QAM
• GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps • 8192 subcarriers, Kernel time only 18
3/20/2013
N-Way MIMO Detector Throughput vs Workload 16 QAM
64 QAM
800
400
600
300
Mbps
500
Mbps
1000
400
200
200
100 N=1
N=2
N=3
N=4
0
0 0
1024 2048 3072 4096 5120 6144 7168 8192
Number of subcarriers
0
1024 2048 3072 4096 5120 6144 7168 8192
Number of subcarriers
• GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps • Kernel time only 19
3/20/2013
Performance Comparison Number of Subcarriers
16QAM (Mbps) FPFSD* Ours N=4
64QAM (Mbps) FPFSD* Ours N=4
150*7
72.41
232.27
18.6
67.55
300*7
82.56
246.60
18.59
69.12
600*7
90.44
256.83
17.9
69.53
900*7
91.39
256.22
17.4
70.05
1200*7
92.31
256.88
17.2
69.97
FPFSD
N=4—4 parallel detectors with different permutations
Differences: a) operates in complex domain b) one kernel for search + one kernel for soft-output generation
Fermi, 448 core @ 1150MHz, 320bit DDR5 @ 3Gbps *Sandra Roger, et.al Fully Parallel GPU Implementation of a Fixed-Complexity Soft-Output MIMO Detector (IEEE TVT 2012)
20
3/20/2013
N-Way MIMO Detector Duplicate search block depending on FER requirement.
Larger lists, NM candidates
y,H X0 X1 X2
X1 X2 X0
SSFE
SSFE
Example: N = 2, two search blocks RVD-QR
RVDQR
Add permute block which enforces a detection order
Permute
Permute
Qi, Q., Chakrabarti, C, Parallel high throughput soft-output sphere decoder, (SIPS 10) M Wu, C Dick, JR Cavallaro, Improving MIMO sphere detection through antenna detection order scheduling (SDR Forum 11)
21
3/20/2013
N-Way QR Decomposition Divide a thread block into N subsections Each subsection of N threads operates on a different channel permutation
Performs modified Gram Schmidt QR on an extended matrix [𝐻|𝑌]
Generate 𝑅 and 𝑦
QR decomposition time for 8192 symbols
4x4
N=1
N=2
N=3
N=4
0.201 ms
0.350 ms
0.551ms
0.675 ms
290.3 Mbps @ 64QAM • GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps • Kernel time only 22
3/20/2013
Complete Design Complete design, QR + MIMO Detection
Also includes PCIE transfer time
Mbps
200 180 160 140 120 100 80 60 40 20 0
196 173
130
124 89
16QAM
• • 23
79
85 64
N=1 N=2 N=3 N=4
64QAM
GK104, 1536 SM @ 915MHz, 256-bit DDR5 @ 6Gbps 8192 subcarriers 3/20/2013
Complete Design 3
3
2.5
2.5
2
2
1.5
1
0.5
0.5
0
0
N=2
N=3
N=4
Detection QR Transport
1.5
1
N=1
4x4 64QAM
ms
ms
4x4 16QAM
N=1
N=2
N=3
N=4
Transfer time doesn’t depend on N (# of parallel search) or M (modulation size)
Transfer time can be hidden
QR depends only on N
Detection depends on N and M 25
3/20/2013
RF TX
MIMO Modulator
Source
Sink
Channel Decoder
MIMO Detector
Channel
RF RX
Channel Encoder
Outline
Multi-standard LDPC Decoder on GPU
Introduction to LDPC algorithm
Kernel mapping
Optimization techniques
Experiment results
26
3/20/2013
Channel Coding
27
Linear block codes
Encoding: 𝒙 ∙ 𝑮 = 𝒄
K-bit 𝒙 encoded into Nbit codeword 𝒄 (K0 ?
Yes
xn=1
No
xn=0
CN3
VN5 VN5
VN6 VN6
Variable Node Processing
VN7 VN7
L
Hard Decision
Decoded Bits
3/20/2013
Early Termination
Early termination (ET)
Avoid unnecessary computations when codeword converges
Widely used in low power decoding architecture.
ET for LDPC decoder
Use parity check equation: H· cT=0
Use massive threads to perform parity check
H
37
.
cT
=0
3/20/2013
Why GPU for LDPC Decoding?
38
LDPC Decoding
GPU
Highly parallel algorithm • No dependency for computations among rows (or columns).
SIMT Parallel architecture
High complexity iterative algorithm
Enough workload to fully occupy the GPU’s computing resources
Clear algorithm structure
Partition tasks into kernel functions.
3/20/2013
Partition the LDPC Decoding Task Initialization
Host code
Start decoding iterations Check node processing Variable node processing Early termination (ET) check
Kernel 1
Computation kernels
Kernel 2 Kernel 3
(Go back if the ET condition is not met)
Make hard decision
39
Host code
3/20/2013
CUDA Kernel 1: Check Node Processing
One thread block processes one row of sub-matrices
Each thread block contains 81 threads, each thread processes one row of the H matrix (one check node). 81 threads/TB
972 threads 12 thread blocks * 802.11n (1944, 972) LDPC code
40
3/20/2013
CUDA Kernel 2: Variable Node Processing
One thread block processes one column of sub-matrices.
Use 1944 threads to run concurrently.
Update variable node message
One thread block
1944 threads Probability value memory
41
3/20/2013
CUDA Kernel 3: Parallel Early Termination M threads
H
.. .
.. .
cT
.
b
= Barrier Sync
.. .
b Check euqation ? b[0] b[1] … b[M-1] = 0
M threads
42
3/20/2013
Decoding Algorithm Optimization
Loosely coupled algorithm
Don t store qmn in the memory.
Before computing rmn, recover qmn first.
Good for CUDA implementation
Reduce the device memory storage
Reduce number of memory operations
before
Store r
Store q
after
Store r
Compute q
Forward-backward check node update
For one row with ωr non-zero element, we need to traverse the row for ωr times.
Use forward-backward algorithms
Reduce number of operations: M ωr(ωr-2) → M (3ωr-2)
43
For example, M=2000, ωr=7, reduce ~50% operations. 3/20/2013
Optimization: multi-codeword decoding
…
44
Utilize the 2-D thread block structure Reduce diverse branches Take advantage of constant memory Good flexibility and scalability 3/20/2013
Optimization – Efficient Storage Memory optimization
Constant memory: increase throughput by 8%
Compact representation struct h_element { byte x; byte y; byte shift_value; byte valid; };
H_kernel1 matrix
H_kernel2 matrix I57 I3 I28 I30 I62 I53 I40 I20 I0 I69 I79 I79 I65 I64 I45 I70 I2 I56 I57 I24 I61
I24 I37 I53 I3 I66 I22 I8 I42 I56 I38 I57 I14 I52 I0 I35 I27 I60
I50 I79 I1 I0 I55 I7 I0 I0 I56 I14 I0 I0 I35 I0 I0 I28 I0 I0 I50 I8 I0 I0 I52 I0 I0 I0 I72 I27 I0 I0 I30 I0 I0 I32 I77 I9 I0 I0 I0 I0 I12 I51 I0 I16 I1
Vertical compression
Horizontal compression
45
3/20/2013
Optimization – Memory Coalescing
Coalescing device memory access
Compact format of Rmn and ∆mn (check node message)
Writing compressed Rmn and ∆mn matrices column-wise → coalesced
memory access (20% throughput improvement)
One column of Rmn and ∆mn 46
3/20/2013
Experimental Results: LDPC Decoding Throughput Code type 802.11n WiFi N=1944
802.16m WiMAX N=2304
# of iterations
Decoding Time (ms)
Decoding Throughput (Mbps)
5
26.0
74.7
10
49.8
39
15
71.5
27.2
5
24.0
96.1
10
43.0
52.31
15
64.0
36
* Host PC: Intel i5-750 Quad-core CPU, @2.67GHz, 8GB DDR3 memory * GTX 470 Fermi GPU
47
3/20/2013
Experiment results: Early Termination Throughput
Adaptive ET scheme:
Low SNR: ET off
High SNR: ET on
Increase throughput for high SNR
48
3/20/2013
Throughput VS Workload
WiMax code, 2304 bits, rate ½ code
At first, throughput increases almost linearly as workload increases
After certain point, throughput stops increasing, because the threads occupies all the computation SMs in the GPU.
49
G.Wang et al, GPGPU Accelerated Scalable Parallel Decoding of LDPC Codes , ASILOMAR Conference 2011.
3/20/2013
Comparison with Recently Published Work
50
Normalized throughput
Work
Code length
Park et al, [Journal on WCN 2011]
18000 bits
2.809 Mbps
Yau et al, [ICACT 2011]
1/2 CMMB codes 9126 bits
2.046 Mbps
Zhao et al, [ICA3PP 2011]
4058 bits QC-LDPC
1.067 Mbps
Abburi, [VLSID 2011]
2034 bits WiMax
40 Mbps
Kennedy, [journal on WCN 2012]
2034 bits WiMax
32.9 Mbps
Kang [ICC 2012]
2048 bits, R=0.89
24.09 Mbps
Our work (Results on GTX 470) *
2304 bits WiMax
52.31 Mbps
(# of iterations = 10)
* G.Wang et al, A Massively Parallel Implementation of QC-LDPC Decoder on GPU, IEEE SASP 2011.
3/20/2013
Beyond Binary LDPC Codes – GF(q) Nonbinary LDPC Inside one work group
q
Backward computation
th re ad
4
F1(0) F1(1) .. .
F2(0) F2(1) .. .
F3(0) F3(1) .. .
F0(q-2) F0(q-1)
F1(q-2) F1(q-1)
F2(q-2) F2(q-1)
F3(q-2) F3(q-1)
7
3
s q
F0(0) F0(1) .. .
2
Forward computation
th re ad
4
s
2
3
7 3
3 1 6
4 Barrier local memory sync
51
5 3
2 7
4 6 1
N work groups
G. Wang et al, Parallel Nonbinary LDPC Decoding on GPU, ASILOMAR Conference 2012.
3/20/2013
Conclusion
Massively parallel implementations of a MIMO detector and a LDPC decoder on GPU
Tailor your algorithm
Tweak algorithm to improve efficiency
Results:
Achieve high throughput
Faster than Existing work
Very flexible, can be a good platform for SDR systems
Future work
Improving performance on Kepler
GPU accelerated SDR systems
Links
Guohui Wang: www.GuohuiWang.com
Michael Wu:
52
http://www.ruf.rice.edu/~mbw2/ 3/20/2013
Acknowledgement
Research supported by US National Science Foundation under grants CNS-1265332, ECCS-1232274, EECS0925942 and CNS-0923479.
53
Equipment donations generously provided by NVIDIA.
3/20/2013