GPU

Realisation of Radar Signal Processor on Graphic Processor Unit (GPU) and Performance Comparison with GPP based Platform BibhaBasu Mondal, Peter Joseph Basil Morris,Saikat Roy Choudury Electronics and Radar Development Establishment,

Defense Research & Development Organization Bangalore, India.

Abstract— This paper presents the realization methodology of Radar Signal Processing on Graphic Processing Units (GPU) using Compute Unified Device Architecture (CUDA) environment. It leverages the possible inherent parallelization offered by the radar signal processing algorithms. Realization of typical radar functionalities viz., MTI, Doppler Filtering, CFAR, centroiding, monopulse on GPU platform is conveyed. The paper brings out the total setup used for the realization and the end-toend performance evaluations achieved. This paper also recounts the different steps involved in realization of such application on GPU based platform and also points out the different issues that are to be considered for the optimized realization of the same. The performance obtained was compared with the same application running on multiprocessor based platform. In conclusion, the work clearly brings out the fact that an optimized parallel implementation has lead to a drastic reduction in computation time with respect to implementation of the same on typical multi-core processor architecture. Keywords-Parallel Processing, CUDA, Radar Signal Processing

I.

INTRODUCTION

Radar Signal Processing represents a complex task involving a plethora of advanced signal processing techniques and intense computational effort. Though the former has matured itself the latter still presents a profound challenge. It is no surprise that the computational load of modern radar signal processors is tremendous. Moreover, in most applications real time processing of radar data is required with the constraints of space ever haunting. The gamut of radar signal processor hardware ranges from general purpose hardware: - PC, workstations or mainframes, application specific hardware such as multi-core processors to reconfigurable computing platforms such as Field programmable gate arrays(FPGA). Modern radar signal processing is a data parallel operation that also benefits from parallel processing architectures. This has made traditional high performance computing platforms such as clusters of vector or scalar processors, multi-core processors also enter into limelight. Multi-core processors were recently established as the most popular CPU architecture for

Radar Signal Processing and since then a steady transition to many-core systems is observed. The most promising of all these computational high performance architectures are the Graphic processing units (GPU) which are capable of leveraging hardware multithreading capabilities and Single Instruction Multiple Data (SIMD) or Single instruction multiple Thread (SIMT) execution schemes leading to incredible levels of performance on data parallel based applications. Since SIMD execution model operates on multiple data, data parallelism is the key concept in leveraging the power of GPU. GPU are inexpensive and are evolving towards a general purpose parallel architecture and the implementations of general purpose applications on GPU platforms have increased over recent years. However maximum performance of GPU calls for creative algorithm design. The introduction of NVidia’s Compute Unified Device Architecture (CUDA) Framework, a C-language development environment for NVidia GPUs, general purpose GPU computations have now become simpler, faster and more programmer friendly. This work presents the implementation of Radar Signal Processing Chain on an NVidia based GPU, focusing on the real-world benefits of GPU for Radar Signal Processing. II.

OVERVIEW OF RADAR SIGNAL PROCESSING CHAIN

The radar signal processing chain selected for implementation is shown in Fig. 1. For ease of implementation single channel data (namely SUM channel) is selected for processing. The cluster of high rate samples from one pulse may be viewed as being stored in a single row and layer of a structure called the radar data cube as shown in Fig. 2. The cluster of samples taken from the next pulse forms the next row of the structure. For the current signal processing chain, since a single channel is of interest the data cube degenerates into a 2D data matrix called the Range Doppler matrix whose dimensions are the number of range cells or range bins and number of pulses as shown in Fig. 3.The various signal processing functions are well studied, explained [1].

Figure 1 : Radar Signal Processing Chain

Figure 2 : Radar Data Cube

CUDA is based on Single Instruction Multiple Threads (SIMT) programming model exhibiting close similarities to Single Instruction Multiple Data (SIMD), where a single code instruction can be executed by different threads on multiple data. In the CUDA model 4, a kernel is the central data parallel function that is applied to chunks of data. Each of these kernels is executed in parallel by several threads which are arranged in blocks that may contain up to 512 threads in 1-D, 2-D or 3-D layouts. Blocks themselves are arranged as 1-D, 2-D or 3-D within one grid that occupies the GPU. Threads and blocks are identified by their unique thread-id (threadIdx) and block-id (blockIdx) through which the thread calculates which chunk of data it is to operate on. Shared memory is used for intra-block communication, and no inter-block communication or specific schedule-ordering mechanism for blocks or threads is provided, which allows each thread block to run on any multiprocessor. The host (i.e., CPU) and the device(i.e., GPU) maintain their own address spaces called the host and the device memory. The host and the device communicate by copying data from/to the host memory to/from the device memory. Transfer of data from the Host memory to the device memory hence, is a time consuming process. The GPU processing remains stalled unless the entire batch of data is transferred and is available at the GPU memory from the host memory. Processing of a particular batch of data hence requires the GPU to stall its operation until the completion of the data transfer from host memory to device memory. Hence by default, threads operate on data stored in the GPU global memory which features huge latency times. It therefore calls for prudent kernel design and implementation in order to minimize the number of global memory reads, by the use of alternate memories and concurrent data transfers available. To achieve maximum efficiency of CUDA computation the following guidelines were adopted during the algorithm design and implementation stages: •

Threads were identified corresponding to the algorithms that were computationally intensive (e.g., Data Conversions, FFT, Complex Transpose etc)

•

Kernels were written to execute these threads. Each kernel code executes actions of a single thread only

•

Prudent memory usage and data transfers were adopted to reduce the inherent memory latency due to hostdevice memory transfers

•

Dynamic assignment of thread blocks to parallel processing units of the GPU thereby enabling efficient utilization of available resources

•

Leveraging computational density with less usage of memory accesses

Figure 3 : Radar Data Matrix III.

GPU PROGRAMMING AND CUDA OVERVIEW

Figure 4 : CUDA Programming Model

block is fixed at 256 (for better optimization) V the number of independent blocks required would be Range cells/256 which is dynamically assigned. The corresponding thread mapping is shown in Fig. 6. The storage of radar data in complex format ensures further reduction in computation time. Since the threads need to operate on the complex (I,Q) data rather than independently operating on I and Q data, the number of threads required for computation is reduced to half thereby providing better utilization of resources and reduced computation time. No: of threads = 256 No: of blocks = Number of Range Cells /256

Figure 5 : Mapping of CUDA Threads to Range Doppler Matrix IV. A.

IMPLEMENTATION CONSIDERATIONS

Radar Signal Processing :- Pulse Range Dependencies

Pulse Doppler Processing: Doppler processing involves explicit spectral analysis of the pulse dimension data for each range bin. The processing dimension for this stage is along the pulse. The process involves computation of independent Fast Fourier Transforms (FFT) on each column of the MTI processed Range Doppler matrix shown in Fig. 6. FFT operation is accomplished in batches using CUFFT, NVidia’s CUDA Fast Fourier Transform (FFT) library. Zero padding is applied to batches of sizes that are not multiples of two: - using customized CUDA kernels. The result of Doppler processing represents a matrix with fast time (Number of range bins) and Doppler frequency as its dimensions.

The dimension along which the radar data is processed is of prime concern to obtain the optimal performance for any hardware architecture. Radar data inherently being two dimensional (i.e., Range gate and Pulse dimension), requires the data has to be rearranged along the required processing dimension for each specific algorithm. The current implementation fixes the block size to 256 threads for better optimization V. The mapping of CUDA threads to Range Doppler matrix is shown in Fig. 5. The kernel executes computations on a range dimension for CFAR while along the pulse dimension for Doppler Processing. Each thread executes computations for a single range gate or pulse gate as the case may be. To prevent memory latency due to hostdevice memory transfer, a single memory transfer to the device memory is performed at the start while the rest of the computations are carried out at the device with only the final results or detection reports transferred back to the global memory. The typical processing schemes for the chain are as follows: Short to Float Conversion and Complex data storage: Radar signal processing operations are performed on float data type but since the radar input data being short, a conversion of short to float becomes prerogative. The block performs the short to float conversion and stores the float data in complex format which is required for further optimized processing. Moving Target Indicator (MTI): The processing is performed along the pulse dimension. For a given range bin the MTI processing performs subtraction of each complex sample of a pulse with that of the previous pulse. The process is repeated. for each range cell. To leverage the advantage of SIMT architecture of GPU each thread performs MTI processing on each range bin for all the pulses. Hence the total number of threads required for the processing would be equal to the number of range bins. Since the number of threads in

Figure 6 : Thread Mapping for Doppler Processing

CFAR (Constant False Alarm Rate) Detection: Also referred to as adaptive threshold detection, this is used to provide predictable detection and false alarm behavior in realistic interference scenarios. Cell Averaging 2-D CFAR is the algorithm employed with the processing dimension along both the range bin and the pulse dimension. The implementation assumes a 2-D mask having dimensions same as that of the range Doppler matrix with a window of predefined size to include some specified reference cells on the left and right and some guard cells. A running sum is performed across these cells estimating the background information corresponding to the cell under test. The CUDA CFAR kernel would compute the running sum for each of the cells under test along the range dimension, in parallel. The running sum is implemented by convolving the range Doppler matrix with the mask matrix. This is accomplished by

multiplying the FFT’s of both the range Doppler matrix and the Mask matrix and then performing the inverse FFT by employing the CUFFT library functions. The CUDA kernel now compares the background information matrix with the Range Doppler data matrix where comparison is performed in parallel. The output of the comparison represents a set of zeros and ones indicating the presence of target wherever ones are present. The use of CUFFT libraries and comparison being done in parallel using CUDA kernels adds to the efficient implementation of CFAR. Magnituding: The process involves finding the magnitudes of the complex outputs of the I, Q data at the output of Doppler processing accomplished using CUDA custom kernels Monopulse: The CFAR detection output matrix is now processed along the pulse dimension. Monopulse requires that for each range cell a maximum filter seek operation needs to be performed which gives the maximum filter value as well as its index. If it represents detection, the same operation is repeated for azimuth and elevation channels to the corresponding range cells and MTI and FFT operation is performed on the pulses for the specific range cells. B.

Optimizations

Optimization forms the most critical and essential phase in designing parallel procedures, to obtain close to hardware limited performance. The absence of a linear dependence between the performance and the block size poses hindrance in determining the optimum execution configuration of a GPUbased application. The main parameters for the consideration of optimization are the block size (B) or the number of threads, the number of range bins (R) and the number of blocks which forms a grid (G). Thus we have G = R / B. If ’G’ is not an integer value, it takes the value of the next higher integer, which makes the number of threads that execute a kernel to exceed the range bins that are processed. The optimization strategies employed includes: a) Memory Optimizations: Since global memory access requires more clock cycles, data memory management on a GPU device plays an important factor in optimal performance.Since data transfer between the host and the memory using cudaMemcpy() consumes more time than the time taken for execution of the GPU kernel it is prerogative to minimize such data transfers. Our present implementation ensures minimal data transfer between the host and the memory with much of the data made to reside in the device memory. As already brought out in section III GPU processing is stalled until the entire data gets transferred from CPU memory to GPU memory. Applications like RADAR involves continous flow of incoming data, where waiting for data transfers are not feasible. Overhead of data transfer time could be overcome by using the overlap mechanism (asynchronous data transfer and execution) of GPU. The use of asynchronous transfers using cudaMemcpyAsync( ) which enables overlap of data transfers with computation, it is possible to overlap

host computation with asynchronous data transfers and with device computations. The final memory optimization employed in the current implementation is the ”concurrent copy and execute” whereby kernel executions are overlapped with data transfers between host and device. The data is broken into chunks and transferred in multiple stages or streams and launching multiple kernels to operate on each chunk of data as it arrives. b) Execution Configuration Optimizations: The importance aspect that needs to be taken into account once a CUDA code is ready to run is the multiprocessor occupancy. Threads are scheduled for parallel execution in scheduling units called ”warps.” The multiprocessor occupancy indicates the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Occupancy of 100 percent indicates that the application fully exploits the multiprocessor resources. Optimization involves choosing the thread and block sizes based on shared memory and register requirements. In the current implementation the first execution configuration parameter - the number of blocks per grid, or grid size is chosen so as to keep the entire GPU busy. It is dynamically assigned using the expression: G = R/B,where R is the number of range cells and B is the block size or the number of threads The number of threads per block or the block size is fixed at 256 to ensure that multiple concurrent blocks reside on a multiprocessor. The maximum number of threads per multiprocessor is 768. Hence a selection of 256 threads ensures 100 percent occupancy with three resident active blocks thereby leading to improved performance. V.

RESULTS

The entire algorithms were tested using an Nvidia QuadroFX 1800 on a workstation equipped with Intel Core2Duo E8400 at 3.00GHz,4GB RAM I. The CUDA kernels are tested for two sets of Range Doppler Matrix data (530 x 258) and (1024 x 258), the results are compared with the same algorithms implemented on a standard vectorized PowerPC core tabulated in tables II, III.

VI.

CONCLUSION

In this work, we have demonstrated GPU computing and CUDA programming environment by bringing forward the step-by-step implementation of a Radar Signal Processing Chain application. The Radar Signal Processing Chain was specifically chosen due to the computationally intensive challenges and time constraints it poses in exploiting the benefits of GPU. We also provided results obtained for an application running on an inexpensive off-the shelf GPU. As clearly brought out in this work, a optimized parallel implementation has lead to a drastic reduction in computation time with respect to implementation of the same on a typical multi-core processor architecture. The performance achievements obtained clearly demonstrate the immense potential of GPU for developing high performance Signal Processing Applications. REFERENCES [1] [2] [3] [4]

Mark A.Richards, Fundamentals of Radar Signal Processing,6th ed. New Delhi, India: Tata Mc-Graw Hill, 2005. M.Skolnik, Radar Handbook, 2nd Edition. Boston, MA : McGraw Hill Publishing, 1990 CUDA , CUFFT Library 1.1 Nvidia CUDA Programming Guide

BIO DATA OF AUTHOR(S) Bibhabasu Mondal received his B.Tech in Electronics & Communication Engineering from West Bengal University of Technology and joined Electronics and Radar Development Establishment (LRDE) in the year 2007 presently he is working as Scientist ‘C’ in the field of Radar Signal Processing. Peter Joseph received his B.Tech in Electronics & Communication Engineering from University Of Kerala in the year 2008. He joined Electronics and Radar Development Establishment (LRDE) in the year 2009 and is working as Scientist in the field of Radar Signal Processing for Airborne Radar Systems.

Saikat Roy Chowdhury did his B.Tech from NIT Silchar in the year 2008. He joined LRDE as Scientist in 2008 and has worked from 2008 to 2011 in the field of Synthetic Aperture Radar Signal Processing.