An Efficient Multi-Rate LDPC-CC Decoder With Layered ... - IEEE Xplore

1 downloads 0 Views 162KB Size Report
Abstract—An efficient multi-rate Low-Density Parity-Check. Convolutional Code decoder will be present in this paper. We will introduce layered decoding ...
IEEE ICC 2013 - Wireless Communications Symposium

An Efficient Multi-Rate LDPC-CC Decoder With Layered Decoding Algorithm Yun Chen, Changsheng Zhou, Yuebin Huang, and Xiaoyang Zeng ASIC and System State Key Laboratory, Fudan University, Shanghai, China, 201203 Email: [email protected],[email protected]

Abstract—An efficient multi-rate Low-Density Parity-Check Convolutional Code decoder will be present in this paper. We will introduce layered decoding algorithm into LDPC-CC decoding. Simulation results shows that our method can achieve better performance than the original brief propagation algorithm with less processors. Besides a new ASIC architecture which adopt proposed algorithm and can support all code rate (1/2, 2/3, 3/4, 4/5) of the LDPC-CC code in IEEE 1901 is proposed. Based on SMIC 130 nm CMOS process, our decoder attaints a maximum throughput of 333.3 Mb/s at 200 MHz. The core area is 3.55 mm2 with 10 processors. The average power consumption is 262 mW at code rate 4/5 and 200 MHz. The VLSI result shows that our decoder is both memory efficient and area efficient. Index Terms—LDPC-CC, Layered Decoding, IEEE 1901.

I. I NTRODUCTION Low-Density Parity-Check (LDPC) code, introduced by Gallager [1], is one of the most important Forward Error Correcting (FEC) codes. It’s widely used in digital broadcast and communication systems for its remarkable performance that can be even near Shannon limit. But most LDPC codes in use so far are LDPC Block Codes (LDPC-BC) such as DVB-S2, IEEE 802.11n and IEEE 802.16e. Many researches have been done on them. Recently the LDPC Convolutional Codes (LDPC-CC), introduced by Felstrom and Zigangarov [2]), draws a lot of attention. It’s the convolutional counterpart of LDPC-BC and has some additional advantages: 1) The code length is very flexible. The parity check matrix of LDPC-CC is periodical and can be infinite. So the length of one frame can be set to any number according the length of messages; 2) It has comparable performance with LDPC-BC. [2] shows that LDPC-CC codes outperform the block codes of the same length; 3) The encoding of LDPC-CC is simple. As it’s convolutional code, the encoder can simply be made up of some delay registers and XOR logics; 4) The starting state of the encoding and decoding processes are known. So the start bits of a frame will have more excellent error correction performance and usually these bits contain some important information. Due to these advantages, LDPC-CC is more suitable for some applications such as packet and Ethernet communication. Similar to LDPC-BC, LDPC-CC is defined by an infinite parity check matrix and can be represented in an Tanner graph [3]. The decoding algorithm of LDPC-CC is similar to LDPC-BC too. Paper [2] first introduced Brief Propagation (BP) algorithm that is similar to Two-Phase Message-Passing (TPMP) [1] with Min-Sum (MS) [4] in LDPC-BC. But the

978-1-4673-3122-7/13/$31.00 ©2013 IEEE

BP algorithm has low convergence speed and high memory consumption. Then instead of activating the variable nodes that are about to leave an operating processor in BP, paper [5] introduced an On-demand Variable node Activation (OVA) scheduling technique in which a variable node is activated whenever it’s requested. The OVA has fast convergence speed but much memory is still needed. In this paper, we will introduce layered decoding algorithm [6] into LDPC-CC decoding. Besides we will also adopt Normalized Min Sum (NMS) algorithm [7] in the check node operations. To verify our architecture, a memory based decoder that can support all code rate (1/2, 2/3, 3/4, 4/5) of the LDPC-CC code in IEEE 1901 [8] is designed and fabricated. Results show that our decoder has fast convergence speed while consumes less memory and hardware. The rest of the paper is organized as follows: Section II briefly introduces the LDPC-CC and the main parameters of LDPC-CC in IEEE 1901. In Section III, we will first introduce the layered decoding algorithm into LDPC-CC decoding. Then the simulation results are shown. Section IV presents VLSI implement of the decoder. The VLSI results are followed in Section V. Section VI concludes the paper. II. I NTRODUCTION OF LDPC-CC

IN

IEEE 1901

In LDPC-CC, parity bits are generated by parity check operations as LDPC-BC does. But different with LDPC-BC, this generation uses only previous systematic bits and parity bits. LDPC-CC is defined by a memory M parity-check matrix H which is periodically time varying and infinite. We usually use the transposed form H T . A rate R = b/c LDPC-CC code can be defined as fellows: ⎤ ⎡ (0) (1) (M ) H0 H1 ··· Ht ⎥ ⎢ (0) (M −1) (M ) ⎥ (1) H1 · · · Ht Ht+1 HT = ⎢ ⎦ ⎣ .. .. .. .. . . . . (m)

where Ht , m = 0, 1, · · · , M, t = 0, 1, · · · , are c × (c − (0) b) periodically time-varying submatrices. Ht must have full (m) (m) = Ht+Tp (Tp is the period of the code) are rank and Ht satisfied for all t and m. Figure 1 shows an example of the H T in rate 1/2 LDPC-CC. Every row in H T represents a variable node and every column represents a check node. Most elements in H T are zeros. But every nonzero element defines a connection between corresponding variable node and check node. The number of

5548

& & & & & 9  9  9  9  9      

Fig. 1.

the parity bits represented by V5k+4 . 2) The maximum delay factor is 226. III. D ECODING ALGORITHM

An example of H T in rate 1/2 LDPC-CC

ones in each column starting from the M × (c − b)th column is called column weight (Wc ) and the number of ones in each row is called row weight (Wr ). There is another parameter Constraint Length (CL) in LDPC-CC that is similar to code length in LDPC-BC. Usually the CL of code rate b/c is (1 + M )c. The main parameters (Wr , Wc , M , Tp and CL) in Fig.1 would be (3, 6, 3, 3, 8). Table I lists the main parameters in IEEE 1901. TABLE I M AIN PARAMETERS OF LDPC-CC Rate Wr Wc M Tp CL

1/2 3 6 215 3 432

2/3 3 9 226 3 681

3/4 3 12 226 3 908

IN

IEEE 1901

4/5 3 15 226 3 1135

As a convolutional code, LDPC-CC can also be defined by check polynomial. Take rate 1/2 LDPC-CC in fig.(1) for an example. All the bits can be divided into two kinds: 1) The systematic bits, such as bits represented by v0 , v2 , v4 , · · · , v2k (k = 1, 2, 3, · · · ), expressed as X0 . 2) The parity bits, such as bits represented by v1 , v3 , v5 , · · · , v2k+1 (k = 1, 2, 3, · · · ), expressed as P . Then the parity check polynomial of the rate 1/2 LDPC-CC in fig.(1) can be listed as formula (2), where D is the delay operator and Cn represents check node Cn .

Before the algorithm description, let’s introduce some nomenclatures which are similar to LDPC-BC and corresponding notations: Iv is the channel intrinsic Logarithm Likelihood Ratio (LLR) of variable node v. Lv is the posterior message of variable node v. Zc→v is the extrinsic message from check node c to variable node v. Lv→c is the prior message from variable node v to check node c. α is the normalized factor. N(c) is the set of variable nodes that have connection with check node c. “\” is the symbol of exclusion. sgn(x) is the symbol of computing the sign of x. xv is the hard decision value of variable node v. Paper [2] first introduced BP algorithm to decode LDPCCC. But the BP algorithm has low convergence speed and high memory consumption. One intrinsic channel message and Wr prior messages need to be stored for every variable node. Then instead of activating the variable nodes that are about to leave an operating processor in BP, paper [5] introduced a Ondemand Variable node Activation (OVA) scheduling technique in which a variable node is activated whenever it’s requested. It has faster convergence speed. But much memory is still needed. Then decoder in [9] adopted the OVA scheduling technique. Although the intrinsic message is concealed in [9], three prior messages are still needed for every variable node. To reduce the memory and area consumption while achieve fast convergence speed, we will introduce layered decoding algorithm to LDPC-CC decoding. Similar to LDPC-BC, every check node can be seen as one layer in LDPC-CC. Posterior messages of the variable nodes that have connection with the operating check node and corresponding extrinsic messages should be read out first. Then the prior and extrinsic messages can be updated. The updated messages can be used immediately to update the relative variable nodes. Based on NMS, our method can be described as fellows: Step 1: Initialization. Lv = Iv , Zc→v = 0 Step 2: Check Node Updating (CNU) in one layer.

new Zc→v =α×

1

3

1

3

1

3

2

old Lv→c = Lold v − Zc→v  sgn(Ln→c ) × min n∈N (c)\v

C3k , (D3 + D2 + 1)X0 (D) + (D2 + D1 + 1)P (D) = 0 2

(3)

n∈N (c)\v

(4) |Ln→c |

(5)

Step 3: Variable Node Updating (VNU) in one layer.

C3k+1 , (D + D + 1)X0 (D) + (D + D + 1)P (D) = 0 (2) C3k+2 , (D + D + 1)X0 (D) + (D + D + 1)P (D) = 0

LDPC-CC in IEEE 1901 is defined by check polynomial too and is very similar to LDPC-CC in fig.(1). Every kind of bits has and only has three delay factors and one delay factor is 0 in every check polynomial. The differences are that: 1) The systematic bits may be divided into several kinds. For an example, there are X0 , X1 , X2 , X3 and P in code rate 4/5. Xn are the systematic bits represented by V5k+n and P are

new Lnew = Lv→c + Zc→v v

(6)

Step 4: Messages will be passed through all processors and repeat step 2 and 3. Step 5: Hard decision. 1 Lv < 0 xv = (7) 0 Lv ≥ 0 Take H T in Fig.1 for an example. When operating layer C3 , posterior messages of Lold vn (n=0, 2, 3 5, 6, 7, the other n in

5549

Fig. 2.

BER curves of all code rate (1/2, 2/3, 3/4, 4/5) of the LDPC-CC code in IEEE 1901

this paragraph has the same meaning) and extrinsic messages of Zcold should be read out first. Then Lvn →c3 and Zcnew 3 →vn 3 →vn can be updated as formula (4) and (5). Zcnew are stored 3 →vn into memory until next processor handling C3 . Lvn →c3 and Zcnew are used immediately to update Lnew vn as formula (6). 3 →vn When handling C4 , the processor can use the updated posterior messages Lvm (m=3, 6, 7) immediately. With our method, posterior and extrinsic messages are needed to store. The amount of prior messages used by every processor is very small and can be stored by some registers. The decoder can store only one posterior message for every variable node. Based on NMS, the extrinsic messages of each check node can be reduced to: the first and the second minimum absolute value, the position index of the first minimum absolute value, the signs of all extrinsic messages and the product of all the signs [13]. In this way, the memory bits needed can be reduced greatly especially when the parameter M of the LDPC-CC code is very big. The simulation results of all code rates in IEEE 1901 are shown in Fig.2. All the simulations are carried out on Additive White Gaussian Noise (AWGN) channel with Binary Phase Shift Keying (BPSK) modulation. The normalized factor is 0.75 in both float and fixed point mode. In fixed point mode,

posterior message and prior message is quantized to 8 bits with a sign bit, 4-bit integer and 3-bit fraction while extrinsic message is quantized to 6 bits. The performance of OVA scheduling technique is almost the same with layered decoding algorithm and can’t be distinguished if they are drawn on the same figure, so it’s not listed in this figure. From the Bit Error Ratio (BER) curve, we can see that proposed layered decoding algorithm is more efficient. Its performance with only half processors is better than the original BP algorithm. IV. VLSI

IMPLEMENT

Paper [2] first introduced the concepts of processors and First-In First-Out shift registers (FIFO) in pipelined LDPC-CC decoding. Then decoder in [10], [11] utilizes these concepts directly. It’s register-based and lots of registers are used to implement the FIFO. This method is simple and can achieve high throughput, but it also consumes more hardware and power as it’s really expensive to take mass registers as storing unit. Then decoder in [12] introduces a memory-based architecture. It reduces the costs in both hardware and power. In this paper, the memory-based method is adopted. The proposed architecture is shown in Figure 3. The decoder contains ten processors and nine extrinsic memory blocks.

5550

3URF

&HQWUDO &RQWUROOHU &RQWUROVLJQDOV

'HFRGHU

3RVWHULRU0VJ0HPRU\%ORFN [ UDP

[ UDP

[ UDP

[ UDP

&HQWUDO &RQWUROOHU

[ UDP

8SGDWHG3RVWHULRU0VJ

3URF

$GGHU%ORFN

&RQWUROVLJQDOV ,QSXWWHG3RVWHULRU0VJ ,QWULQVLF0VJ

,QSXWWHG([WULQVLF0VJ $OO

VJQ

3RVWHULRU0VJ 8SGDWHG([WULQVLF0VJ

6XEWUDFWHU%ORFN 3ULRU0VJ

Fig. 3.

106 %ORFN

3URF

([WULQVLF0VJ 0HPRU\%ORFN [ UDP

The architecture of proposed LDPC-CC decoder

Every processor is the same and the detail of the processor0 is shown in Fig.3 too. The inputted posterior messages and extrinsic messages of processor0 are feed by the intrinsic messages of the channel and all zeros as the initial step described by formula (3). The outputted posterior message and extrinsic message of one processor are passed directly or through the extrinsic memory block to the next processor. Every processor has five parts: 1) Central controller that generates all the control signals. 2) Substracter block that updates the prior messages from the old posterior messages and extrinsic messages as described by formula (4); 3) NMS block that updates the extrinsic messages from the prior messages as described by formula (5); 4) Adder block that updates the posterior messages from the updated prior messages and extrinsic messages as described by formula (6); 5) Posterior message memory block that stores the posterior messages. Different kinds of code bits are stored in different memories in this decoder. As every kind of code bits has three delay factors in the check polynomial, every memory mush finish three reading and three writing operations when process one check node. Dual-port memories are chosen and three clocks are needed to handle these operations. To support all the code rates in IEEE 1901, the processor must meet the maximum demand in hardware among different code rates. For the tradeoff between throughput and area, decoder in this paper would handle N messages together at every operation when decoding code rate (N − 1)/N . The maximum N of LDPCCC in IEEE 1901 is five. As a result, there are five dual-port memories, five substracters and five adders in corresponding blocks of every processor. Some of the hardware would be disabled to reduce the power consumption when decoding rate 1/2, 2/3 and 3/4. The throughput of the decoder would be fmax × N/3 and the maximum value would be 333.3 Mbps. Five-stage pipeline is adopted in this decoder: 1) Reading corresponding messages from memories; 2) Prior message updating process as formula (4); 3) Extrinsic message updating process as formula (5); 4) Posterior message updating process as formula (6); 5) Writing corresponding messages to memo-

ries. Usually the width of the memories would be M +1. But it’s extended to M +1+1 to avoiding memory confliction because the corresponding messages will have some delay when adopting pipeline technology. The depth of the posterior message memory is 8 because the posterior message is quantized to 8 bits. The first and the second minimum absolute value of the extrinsic messages are quantized to 5 bits. The position index of the first minimum absolute value would be 4 bits and the signs of all extrinsic messages per check node would be 15 as the maximum Wc is 15. The product of all the signs would take one bits. So the depth of the extrinsic messages would be 5 + 5 + 4 + 15 + 1 = 30. The total memories bits needed would be 228 × 8 × 5 × 10 + 228 × 30 × 9 = 152, 760 bits. V. R ESULTS The proposed decoder is fabricated on SMIC 0.13 um 1.2V 8-metal layer CMOS technology. Fig.4 shows its die photograph. The decoder attaints a maximum throughput of 333.3 Mb/s at 200 MHz (system’s demand is 220Mb/s in IEEE 1901). The core area is 1.70 x 2.09 mm2 with 10 processors. The average power consumption is 262 mW at code rate 4/5 and 200 MHz. Table II compares the proposed decoder with other LDPCCC decoders. It can be seen that the proposed decoder has the best memory efficient so far, from which we can say that the layered decoding algorithm does reduce the memory needed. Besides the proposed decoder has the smallest area consumption per processor while has the largest constraint length. So the proposed decoder is area efficient. The power consumption is also comparable with the best state of arts. VI. C ONCLUSION In this paper, we present an efficient multi-rate LDPC-CC decoder. We also introduce layered decoding algorithm into LDPC-CC decoding. From the simulation results it can be seen that our method can achieve better performance than the original BP algorithm with only half processors. To verify our method, a new architecture is proposed and a decoder that can

5551

TABLE II C OMPARISON WITH OTHER LDPC-CC

Code Rate Constraint length Processor Input quantization Memory bits(k) Memory Efficiency(bit/Constraint length/proc.) Frequency (MHz) Core Area (mm2 ) Area Efficiency (mm2 /proc.,scaled to 130nm) Throughput(Gbps) Power (mW) Energy Efficiency (pj/bit/proc.) Technology (nm, V)

Proposed 1/2, 2/3, 3/4, 4/5 1135 10 8 152.76 13.46 200 3.55 0.355 0.333 262 78.6 130,1.2 Measured

DECODERS

[9] 1/2, 2/3, 3/4, 4/5, 5/6(1) 984 5 6 52.5(1) 21.34 175 2.24 0.896 2.37 284.8 24 90,1.2 Measured

[11] 1/2 258 10 6 67 25.97 198 9.9 0.495 0.175 1300 760 180,1.8 Measured

[14] 1/2 258 3 8 600 1.5 1.0 0.6 368.7 204.8 90,1.0 Measured

[15] 1/2 960 1 6 23.04 24 250 0.924 1.848 2.0 64 90,Synthesis

Note:(1) In [9], rate 2/3, 3/4, 4/5, 5/6 are generated by discarding several systematic bits and/or parity bits from rate 1/2 outputs, which is also called puncturing process. Besides in [9] only 50% messages are stored in memory, so the total bits should be 105k.

R EFERENCES

Fig. 4.

Die photograph of the proposed LDPC decoder

support all the code rates (1/2, 2/3, 3/4, 4/5) of the LDPC-CC code in IEEE 1901 is fabricated and tested. Results show that our decoder is memory and area efficient. So our method is efficient.

ACKNOWLEDGMENT This work was supported in part by Key projects of the 11th Five year Plan of China under Grand NO. 2009ZX01031-002-003-2, Project of 863 under Grand NO. SQ2008AA01ZX1480432 and New wireless communication of 2011 Key project under Grand NO. 2011ZX03003-00303.Project supported by the ”chengguang”Foundation from the Education Commission of Shanghai, China under Grant No.11SG07,State Key Laboratory of ASIC & system under Grant No.11ZD0005.

[1] R. G. Gallager, ”Low density parity check codes,” IEEE Trans. on Information Theory, vol.8, no.1, pp.21-28, Jan.1962. [2] A. J. Felstrom and K. S. Zigangirov, ”Time-varying periodic convolutional codes with low-density parity-check matrix,” IEEE Trans. on Information Theory, vol.45, no.6, pp.2181 - 2191, Sep.1999. [3] R. M. Tanner, ”A recursive approach to low complexity codes,” IEEE Trans. on Information Theory, vol.IT-27, no.5, pp. 533-547, Sep.1981. [4] N. Wiberg, ”Codes and decoding on general graphs,” Ph.D. dissertation, Dept. Elect. Eng., Univ. Linkoping, Linkoping, Sweden, 1996. [5] A. Pusane, M. Lentmaier. and et., ”Reduced complexity decoding strategies for LDPC convolutional codes,” in IEEE International Symposium on Information Theory 2004,ISIT 2004,pp.490, 2004. [6] M.M. Mansour and N.R. Shanbhag, ”High-throughput LDPC decoders,” IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol.11, no.6, pp.976-996, Dec.2003. [7] J. Chen and M. C. Fossorier, ”Decoding low-density parity-check codes with normalized APP-based algorithm,” IEEE Global Telecommunications Conference 2001,GLOBECOM 2001, vol.2, pp.1026-1030, Nov.2001. [8] IEEE P1901/D4.01, ”IEEE Draft Standard for Broadband over Power Line Networks: Medium Access Control and Physical Layer Specifications,” Dec.2010. [9] Chih-Lung Chen, Yu-Hsiang Lin, Hsie-Chia Chang and Chen-Yi Lee, ”A 2.37Gb/s 284.8mW rate-compatible (491,3,6) LDPC-CC decoder,” IEEE Symposium on VLSI Circuits 2011,VLSIC 2011, pp.134-135, June 2011. [10] Swamy, R., Bates, S. and Brandon, T., ”Architectures for ASIC implementations of low-density parity-check convolutional encoders and decoders,” IEEE International Symposium on Circuits and Systems 2005, ISCAS 2005, Vol.5, pp.4513-4516, May 2005. [11] Swamy, R., Bates, S. and et., ”Design and Test of a 175-Mb/s, Rate1/2 (128,3,6) Low-Density Parity-Check Convolutional Code Encoder and Decoder,” IEEE Journal of Solid-State Circuits, vol.42, no.10, pp.22452256, Oct.2007. [12] Bates, S., Gunthorpe, L. and et., ” Decoders for low-density parity-check convolutional codes with large memory,” IEEE International Symposium on Circuits and Systems 2006, ISCAS 2006, Kos, Greece, May 2006. [13] Gunnam, K, Gwan Choi, Weihuang Wang, Yeary, ”Multi-Rate Layered Decoder Architecture for Block LDPC Codes of the IEEE 802.11n Wireless Standard”, IEEE International Symposium on Circuits and Systems 2007, ISCAS 2007, pp.1645-1648, May 2007 [14] Tyler L. Brandon, John C. Koob and et., ” A Compact 1.1-Gb/s Encoder and a Memory-Based 600-Mb/s Decoder for LDPC Convolutional Codes,” IEEE Trans. On Circuits and Systems I, vol.56, no.5, pp.10171029, Mar.2009. [15] Zhengang Chen, Tyler L. Brandon and et., ”Jointly Designed Architecture-Aware LDPC Convolutional Codes and High-Throughput Parallel Encoders/Decoders,” IEEE Trans. On Circuits and Systems I, vol.57, no.4, pp.836-849, Dec.2009.

5552