Consumer Electronics, IEEE Transactions on - IEEE Xplore

IEEE Transactions on Consumer Electronics, Vol. 44, No. 1, FEBRUARY 1998

16

EFFICIENT ALGORITHM AND ARCHITECTURE FOR POST-PROCESSOR IN HDTV Jae-Wook Lee, Jeong-Woo Park, Myung-Hoon Yang, Sungho Kang and Yoonsik Choe Department of Electrical Engineering Yonsei University Seoul. Korea

Abstract To display high quality images on the monitor screen i n H D T V , the processed image data must be converted into the form appropriate for the realtime display. This paper presents the eficient algorithm and architecture for the post-processor, which has four functions. The first function is t o remove blocking effect i n H D T V images, and the second function is to convert the scan formats appropriate for the display. The third function is to convert the Y U V format i m a g e signals into the RGB format signals. The final function is the y correction f o r better quality images. To reduce the size of a memory, a memo r y is partitioned into many memory banks. Also, to improve the operation speed, a pipelined parallel architecture and a memory scheduling technique are adopted. Therefore this architecture is very fast and uses small size memory banks, and this makes it possible t o realize a real-time signal processor.

1 Introduction Due t o our desire for a clear and wide image on T V , HDTV(High Definition Television) has been invented. However there are many problems in realizing a commercial model that is inexpensive but still offers high quality. Early HDTV was designed to get broadcasting signals via broadcasting-satellite channels which have wide frequency band-width. Because HDTV has quite a lot of image signal data compared with normal T V , wide frequency band-width is necessary. For commercial purposes, however, broadcasting via narrow frequency band-width is necessary. I11 digital HDTV, image signals are compressed by DCT(Discrete Cosine Transform) and then broadcasted via existing TV broadcasting channels. Broadcasted signals are caught and then restored by IDCT(1nverse Discrete Cosine Transform). In this process of compression and restoration, data losses are unavoidable, so there exist a few errors in the restored images. In DCT process, 8 x 8 size image

blocks are used. If we compare a restored 8 x 8 image block with an original one, it is difficult to find the differences. However, if we gather many restored 8x8 image blocks and compose a large image, we can see the discontinuity between each block. These phenomena are called the “blocking effect” and the “staircase effect” which result in image-quality degradations. This blocking effect can be removed by low-pass filtering, but low-pass filtering blurs the edges of the image. Therefore using improved algorithm[l], the direction of an edge is classified. According to the direction of the edge only the boundary pixels of the image blocks are filtered with various filter coefficients to avoid the edge-blurrings[2]. After this operation, scan conversion process is required t o guarantee the compatibility between various types of image format in different image systems. For example, a movie film is 24(frames/sec), but a general T V signal, NTSC(Nationa1 Television Systems Committee), is 30(frames/sec). Therefore, in order to reuse the image data in different image systems, appropriate conversion processes are required. In the case of a conversion from a low resolution image into a high resolution image, the conversion process which generates additional data from the original data is required, because the high resolution image requires more data than the low one. All of these processes are called scan conversion. In the digital image signal processings, separating luminance signals and chrominance signals among image signals is efficient for the data compression and the signal processings. For this reason the YUV signal format is used in HDTV image processings, but the RGB signal format is required to display images on the monitor. As a result, the YUV/RGB signal conversion block is necessary in HDTV[3]. In video devices like cameras and monitors, the y represents a numerical parameter that describes the nonlinearity of the intensity reproduction. In other words, the output voltage of the camera is not proportional to the intensity of a input image and the intensity of the light reproduced at the screen of a

Contributed Paper

Manuscript received October 8, 1997

0098 3063/98 $10.00

1998 IEEE

Lee et al.:

17

Efficient Algorithm and Architecture for Post-Processor in HDTV

CRT monitor is not proportional t o its input voltage. Therefore, the function of y correction can be thought of as the compensation for this nonlinearity so as to achieve correct reproduction of intensity[4]. In this paper, a memory-based pipelined parallel architecture[5;6,7] is presented which is used in blocking effect removal and scan conversions. Because there is too much data t o process in HDTV, a pipelined parallel architecture is necessary to implement a real-time processing architecture for the post-processor. To reduce the memory size and to realize an easy and fast access t o memory, the memoryscheduling technique is used with many memory banks[8,9]. In the process of calculations, many multiplications and divisions are required. However these operations are time-critical[lO] , so many multiplications and divisions are replaced with shifters to realize real-time processing. To avoid complicated calculations in the y correction the operation results are calculated in advance, and the results are given by forming combinational logics. With this architecture, post-processor was simulated with C language, and an ASIC(App1ication Specific Integrated Circuit) chip was realized with VHDL(VHS1C Hardware Description Language)[ll,12,13].

Algorithm 2.1 Blocking Effect Removal 2

In this paper, an improved blocking effect removal algorithm[l] is used. This algorithm is used t o determine the direction of an edge in an 8 x 8 size image block and then based on the determined direction of the edge the low-pass filter coefficients are determined not to blur the edge of each 8x8 image block. Because a blocking effect takes place at the boundary of each image block, only 28 boundary pixels of each 8x8 image block are filtered. The suggested algorithm is as follows: 1. Let two counters K and L be set to zero. And let xi,j denote the ( i , j ) t h pixel in an 8 x 8 image block, and let xaUg denote the average value of two values.

2. Let

T denote the preselected threshold. For all horizontally adjacent pixels,

If

> r then increase the Iicounter by

% < -r avg

1. If

then decrease the K counter by 1.

3. For all vertically adjacent pixels,

w h e r e , i = 1 , . . . , 8 a n d j = 1;..,7

If 5 ku X,V9

> r then increase the L counter by 1.

If

< -T

X,,g

then decrease the L counter by 1.

4. Let m denote the minimum length of an edge segment to be a meaningful edge . Using the counted two values of K and L , classify the direction of a n edge. (a) Monotone : if 11 m, ILI a n d sgn(1i) = sgn(L)

m

> m,

(d) 90' edge : if 11 m a n d ILI (e) 135' edge : if Ili'l > m , ILI a n d sgn(1S) = -sgn(L)

m,

By experiments the values of 'm' and ' T ' were determined to m = 6 a n d T = 0.2. Using the results from above algorithm, filter coefficients are determined and then used for that image block. For a monotone block, a 2D low-pass filter which has filter coefficients like the followings is used. h(0,O) = 0.547 h ( 0 , l ) = h ( 0 , -1)

h(1,O) = h ( - 1 , 0 ) = 0.227

For a block which has a distinct edge-direction, a 1D low-pass filter which is parallel t o the direction of the edge is used. The filter coefficients are shown below. h(0) = h(4) = 0.036, h(1) = h(3) = 0.282, h(2) = 0.363

2.2

Scan Conversion

In order to convert the display format between the non-interlaced and the interlaced or to convert the image size, the scan conversion is required. In scan conversion, there are two methods: spatial conversion and temporal conversion. In this proposed algorithm, only spatial conversion is considered owing to two reasons. First, the large size memory for storing input images is required to consider the temporal conversion. Second, the quality of a spatially


18

converted image is better than that of a temporally average value of two pixels, but this value conserves converted image(l41. information of the edge direction very well. The simple way for the image interpolation is avxi xj y=eraging two values independent of image patterns 2 However, this algorithm results in low-quality images where, ( x z ,x j ) pair has the smaldest /72,J In this paper, the modified new algorithm is used to improve the quality of the interpolated image. To keep away the image quality degradation which occurs in scan conversion, a visual weight is considered and the weight is given based on the spatial correlation between neighboring pixels. Here, the weight is given according t o Weber’s law which considers the characteristics that human’s eyes are more sensitive to the luminance changes at low luminance[l5,16].

+

lnterpolatedri Line

Figure 1: The Pixel Group used in Interpolation The simplest algorithm for the interpolation is to reuse the same data several times repeatedly or to calculate the average value of two neighboring data. But this algorithm has some problems. The generation of the average value of only two neighboring data may break the edge direction of the image since this method doesn’t consider the property of the ima.ge pattern. Therefore in order not to break the edge direction of the image, this paper presents a new algorithm considering a six-neighboring pixel group shown in Figure 1. The newly interpolated data, y is generated by considering its six neighboring pixels, The pixel y is interpolated by the calculation of the average pixel pair whose visual weight is the smallest among three pixel pairs of {XI,xe}, {x2,x s } , and { x ~ , ~ The ~ } . value of the visual weights can be calculated by the 2.4 following equation.

where, ( % I ) E {(I161, ( 2 , 5 ) , (3,411

The smallest /3 means that the visual difference of two pixels is the smallest, and the direction of these two pixels represent the edge direction of the selected image pixel group. If the same p values are calculated amang three cases, the value of is calculated and the ,f~ which has the largest value of is selected according to Weber’s law. As shown in the following equation, the interpolated data is only

1 1.576 1 -0.477 1 0

0 -0.227 1.826

11 Y

y Correction

It is very difficult to acquire the reproduced image which is exactly identical with the original image The reason is that the transfer function between the camera and the monitor has nonlinear characteristics That is, when the intensity of a light in a CRT is I , the input voltage is U , and the exponent is y, the 1 and the w have the following relation

I = I,

+ GWY

In the above equation, the 10 is the intensity of light in case of black color and the c is the constant gain coefficient.

Lee et al.:


Because the y value of the camera and that of a

CRT is different, the correct color reproduction of the image in CRT is extremely difficult. In order t o solve this problem, y value should be transmitted with the image signal. In the receiver, the correct reproduction of the original color is possible by means of the following equation.

Mode

Format

00 01 10 11

HDTV HDTV HDTV

3.1.1

Architecture

Figure 2 shows the total architecture of the postprocessor in HDTV. In Figure 2, the YUV format image signals are input signals, and the RGB format image signals are the processed results for the display, The control signals are M o d e , Vertical S y n c , and F r a m e S y n c . The Mode signal is used to represent the type of input signals. In Table 1, the Mode signal is explained in detail. Vertical S y n c and F r a m e S y n c signals are used t o synchronize the display images. First, the input Y signal’s blocking effect is removed and the input U,V signals are upsampled. Second, according to the Mode signal the processed image signals are deinterlaced or formatconverted. If neither the deinterlacing operation nor the format-conversion operation is required, the input signals are bypassed to the next stage. Third, the YUV format signals are converted into the RGB format signals. Last, the signals are sent t o the ycorrection block. In the following subsections, each of the blocks is explained in detail.

I

I

Image Size

640 x 480

1

1280 x 720 1920 x 1080 1920 x 1080

Table 1: The type of input signals according to the Mode signal

3.1 3

NTSC

Scan Method Nonprogressive Progressive Progressive Nonprogressive

Blocking Effect Removal Memory Scheduling and Partitioning

In this architecture, six memory modules are used, and each memory module is composed of ten memory banks. Nine memory banks are used t o store input image signals, and a large size memory bank is used t o store filtered image signals. Each module is used for processing 8 lines of pixels in HDTV, because we decide the edge direction of an image block based on an 8 x 8 image block. In Figure 3, a small block means a pixel in a HDTV screen, and the number in each blocks represents the memory bank number for storing its image signal. A shaded 3x3 block represents the memory banks which are used for a 2D-filtering t o produce a filtered result for a heavily shaded pixel. Although the filter mask moves to another position, nine image signals can be accessed simultaneously by using nine memory banks. This can solve the problem of the memory access bottleneck. 8

8

Figure 3: The memory bank assignments in a HDTV image for blocking effect removal Figure 2: The architecture of the post-processor in HDTV

Figure 4 shows a memory scheduling[8]. The states of the memory modules are shown in each time step. In this architecture, 54 1920-byte memory banks and

E E E Transactions on Consumer Electronics, Vol. 44,NO. 1, FEBRUARY 1998

20

Input

opz

opl

Module5

op3

output

Memory Mapper

Memory Mappet

for Signal

for Signal

inputs

outputs

I

Module6

Memory Bankr

Opl : writes input data to memory banks

where,

4

4

4

Op2 : decides the edge direction of a block

Op3 : filters the horizontal border of a block

Op4 :filters the vettical border of a block Op5 :reads output data from memory banks

Figure 4: A memory scheduling for blocking effect removal 6 6720-byte memory banks are required. That is, the amount of 54 x 1920 6 x 6720 N 14l(Kbytes) total memory is used.

+

3.1.2

Architecture

Xavg

x i + i , j - xi,j Xi+l,j

+ xi,j

(Xi+l,j

- %,j)

> 5-

x

- Xi,j)

rEncodings

I

Edge direction

I

Table 2: The encodings of the edge directions

("i+l,j

+

Zij)

To optimize the processing time, 0 . 2 5 ( = f ) is selected as r instead of 0.2. The final equation to calculate is presented below.

23 x

In an edge detector, there are two identical blocks. One is for producing L value control signals, and the other is for producing K value control signals. A direction detector is used for counting the numbers of K and L values, and then for comparing the result based on the algorithm step 4, which is mentioned in section 2. After the edge direction of an image block is decided, the results are encoded. The produced encodings are shown in Table 2 .

x2>r

To reduce the calculation time, the equation is modified as the following. 2x

Registers

Figure 5: The architecture for blocking effect removal

Figure 5 shows the architecture for removing the blocking effect. First, input image signals are stored into memory banks by a memory mapper for signal input. And then computations are performed to produce counter control signals by an edge detector. The numbers of K and L values are counted, and the edge directions of an 8 x 8 image block are determined by a direction detector. The right and left boundary pixels of an 8 x 8 image block are filtered by a horizontal filter while the upper and lower boundary pixels of an 8 x 8 image block are filtered by a vertical filter. Finally, the filtered images are sent to an output port by a memory mapper for a signal output. In the following, the sub-blocks are explained in detail. An edge detector is used to perform the following computations.

Axij -

Direchon Detenor

> (xi+l,j + xi,j)

The multiplication of 23 means 3-bit left shift operations, so instead of multiplications and divisions, shift operations can be used.

In this proposed architecture, filters are the most time-critical blocks. In a 2-D 3x3 low-pass filtering, 9 image pixels are used at the same time. Since a blocking effect takes place at the boundary of each image block, only 28 boundary pixels of each 8 x 8 image block are filtered. To realize the real-time processing, the calculation processes must be simplified. As shown in Figure 6, floating-point filter coefficients are replaced by integer coefficients. The integer coefficients are formed by multiplying floating-point filter coefficients

Lee et al.:


by 1024(=21°). After the calculations, the desired results can be obtained by 10-bit right shiftings. To replace multiplying processes with shifting processes, produced integer coefficients are replaced by the closest Y ( n = 1 , 2 , 3 , . - . )values.

(a) 2 0 filler mask

They are only different from each other in the direction of signals and the algorithms of counter operations. In mappers, all the 1 / 0 ports are driven by many tri-state buffers, and all the tri-state buffers are controlled by a decoder block. The decoder block enables the corresponding tri-state buffers for memory bank enable ports, for address ports, and for data ports. According t o the CLK signal, the counters decide the addresses and the memory banks for signal processings in each time-step.

(b) Four 4-D filter marks

Figure 6: Filter Masks Figure 7 shows the architecture of a filter. Nine image inputs are used for filtering. In filters, according to the edge direction, an amount-selector determines the amount of bits for shiftings. Input image signals are shifted with the help of amount-selector outputs. Since the shifting amounts are fixed to several values, the shifting operations can be realized by extracting the required bits from the intermediate signal values and then wiring them into the expected position of the result values. These processes are used instead of using floating-point multipliers t o reduce the calculation time. Inputs far Flllering

Edge Direction

Figure 8: Memory Mappers

3.2 3.2.1

Wallace Tree

Shifter

output

Figure 7: The architecture of a filter Shifted nine input signals are added by a wallacetree[l$]. This reduces the addition steps, and realizes fast addition of many values. After the reduction, two values are added by a carry-select adder, which is fast for an addition of two values. And the result is shifted by 10 bits t o the right direction to compensate for the filter coefficients which have been shifted by 10 bits t o the left direction. Figure 8 shows the architecture of a memory mapper. In an architecture for blocking effect removal, there are seven memory mappers.

Scan Conversion Memory Scheduling and Partitioning

To explain the pipelined parallel processing and memory partitioning, scan conversion from 640 x 480 nonprogressive NTSC signals to 1280 x 720 progressive HDTV signals will be considered. Figure 9 shows the assignments of memory banks on HDTV screen for the deinterlacing. Memory is partitioned into 36 banks such that OP1, OP2, and OP3 operations in Figure 10 can be executed simultaneously. This concurrent execution of several operations makes the real time deinterlacing operation possible, and the memory partitioning can reduce the required memory size. In Figure 9, the squares represent the memory banks that store image data and pixels, the numbers in squares mean the numbers of memory banks, and the shaded banks are used to store the pixels generated by the interpolation while other bright squares are used t o store original signals. Figure 10 shows the pipeline scheduling for the deinterlacing. In each step, all operations are executed simultaneously because each operation does not need to access the same memory bank at the same time. After all the data from the first two lines are read, OP2 can start the execution. Therefore,


22

.. ..

a wait operation is necessary in the previous step of the first execution of OP2 Of course, in subsequent steps, the wait operation is not necessary, and all operations can be executed independently

....

..

.. .. .. .. ..

..

..

..

....

..

.. .. ..

..

.. .. ....

.. .*

..

Figure 9: The memory bank assignments in a HDTV image for deinterlacing

..

..

.............*...... _.

.

.

.. .

.

_ a

.

-.

.

_ .

.

-.

.

_

_

-

_

_

_

_

.... -

_

I . . . . .

Figure 11: The memory bank assignments in a HDTV image for format conversion

where,

Opl wntesmput data to memory banks

the other operations are executed twice as a result of the assignments of memory banks. Therefore, each in even number of step, OP2 is a null operation. And a wait operation in the previous step of the first OP4 is required to wait till OP3 is executed twice.

Op2. generates data from veacal interpolahon Op3 reads output data from memory hanks

Figure 10: The memory scheduling for deinterlacing Figure 11 shows the assignments of memory banks on HDTV screen for the format conversion from 640 x 480 size to 1280 x 720 size. Memory is partitioned into 60 banks. Of course, according to the input data type, the assignments of memory banks must be changed, since the conversion ratios in both vertical and horizontal directions are different Considering pixels from 0 to 9, first, data in pixels 3,4,5 are generated by the vertical interpolation of OP2 For example, pixel 4 is vertically interpolated by considering pixels 0, 1, 2, 6, 7, 8. Data in pixel 9 are generated by the horizontal interpolations of OP3 and OP4 The pixel 9 located in the second line, is horizontally interpolated by OP3 execution by considering pixels 0, 1, 3 , 4, 6 , 7 Pixel 9 in the next line is filled by OP4 operation by considering eight pixels 6, 7, 10, 11, 13, 14, 16, 17 with pixel 19 in the next line. Therefore, OP4 generates two data simultaneously by considering eight pixels. Figure 12 shows the pipeline scheduling for the format conversion. OP2 is executed only once while

- OP3

opl

Module7

OP 1

fkd"le8

OPZ'

.. .,.

Figure 12. A memory scheduling for format conver sion

3.2.2

Architecture

This section proposes an efficient architecture t o implement the new algorithm. Scan conversion block consists of two functional blocks, DM(Deinter1acing

-

-

_

-

Lee et al.: Efficient Algorithm and Architecture for Post-Processor in HDTV

Module) and FCM(Format Conversion Module). DM banks[19,20]. The Control Unit controls all the other makes progressive signals out of nonprogressive sig- operation units t o prevent them from colliding with nals. FCM transforms general TV screen size (gener- one another in the parallel processing. ally 640 x 480) into HDTV screen size (1920 x- 1080 or -1280 x 720). These two functional blocks generx* X6 XZ xs x3 x4 ate new data by interpolating the neighboring pixels. In this architecture, Y , U , and V are 8 bit input signals, which represent color information. The Control ADDER signal decide the characteristics of the input signals, whether they are HDTV signals or NTSC signals, and progressive signals or nonprogressive signals. If the input signal is a progressive HDTV signal, the scan conversion block carries out no operation and sends Comparator out the input signal through bypass path.

+TJ

Input

output

Y Inputs

I

4

I

Memory Banks

Interpolation

Interpolatlon

Interpolation

.Interpolatlo”

U

Figure 14: The architecture for interpolation blocks

I

U

The structure of IM is shown in Figure 14. SAM(Subtraction/Abstraction Module) represents the block whose function is the generation of the absolute value of difference between two input signals. MUX chooses the adequate sum of an input pair among the three sums according to the result of comparisons. A 1 bit right shifter generates the average value of the selected two input signals, as a 1 bit right shifted value is equal t o the value divided by 2.

Figure 13: The architecture for scan conversion block The structure of DM and FCM in the scan conversion block is shown in Figure 13. The Memory Address Generator is a functional block which is used in generating the addresses of memory cells. The number of IMs(Interpo1ation Module) differs with each execution. In case of the deinterlacing operation, since the number of the original image data equals the number of the newly generated data, only one IM is required. Since the number of the newly generated data is greater than that of the original data in the format conversion, several IMs are necessary for the simultaneous interpolation of more than one pixel. Memory Banks are used to store the input signals used in the interpolation and to store the newly calculated data. After the interpolation operations, all the data stored in these memory banks are sent out in sequence of the image display. To access several memory cells simultaneously in the pipelined parallel processing and t o reduce the total memory size, the memory is partitioned into a number of memory

3.3

YUV/RGB Conversion

To realize the YUV/RGB conversion, seven multiplications and four additions are required. During this calculation process, floating-point operations are required due to floating-point matrix coefficients. But, in this architecture, integer numbers are used instead of floating-point numbers t o reduce the time for calculations. First, matrix coefficients are modified by multiplying 1024(=21°), which are shown in the following.

R’ G’ B’

=

1024 1614 1024 -488 0 1024

0 -232 1870

Y U V

EEE Transactions on Consumer Electronics, Vol. 44, No. 1, FEBRUARY 1998

24

Size Multiplier

Shifter

‘7L Adder

R

Multiplier

Multiplier

iE!T Adder

Multiplier

Conversion YUV/RGB Conversion

Wallace Tree Adder

B

Blocking Effect

G

L

Gamma

I I Total I (Post-processor) I Correction

1,213 139,982

I I

9.5l(ns) 20.28(ns)

I 1

none 201(KB)

I

Table 3: The synthesis results of the post-processor Figure 15: The architecture of a YUV/RGB conversion version Block, the number of cells was 53,368, the critical path delay was 19.86(ns), and the sum of the used RAM sizes was 6 0 ( K b y t e s ) . In the YUV/RGB Conversion Block, the number of cells was 4,123, and the critical path delay was 18.28(ns). In the Gamma Correction Block, the number of cells was 1,213, and the critical path delay was 9.51(ns). As a result, the total number of cells was 139,982, which is the sum of the number of cells used by the four functional blocks. The critical path delay of the postprocessor was 20.28(ns), which is the largest value among four critical delays. Since the post-processor 3.4 y Correction is implemented using the pipeline strategy, the largest path delay among many functional blocks is the critiSince the equation mentioned in section 2.4 incal path delay of the post-processor. The sum of used cludes divisions and exponential calculations, it is RAM sizes was 20l(l