Design, Simulation, Implementation, and Performance ...

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014) www.ijars.in Research Article

Design, Simulation, Implementation, and Performance Analysis of a fixed-point 8 Point FFT Core for Real Time Application in Verilog HDL Authors: 1

2

Bikash Poudel, Manish Bhattrai*, 3 Sandesh Ghimire 1

Address For correspondence: Asst. Lecturer Institute of Engineering, Thapathali Campus 2 Assistant R&D Engineer, Powertech Nepal 3. Engineer, Nepal Electricity Authority

Abstract - Fast Fourier Transform (FFT), which serves as an efficient and ubiquitous tool for computing Discrete Fourier Transform (DFT), is popular for transforming a signal from time domain to frequency domain. Since FFT algorithm requires less number of computations than direct evaluation of DFT, this technique has been widely used in speech recognition (massively used now days in many application lines and products), telecommunication, signal processing, multimedia communication, etc. Designing and implementing the floating-point (FP) FFT Algorithms in FPGA is always the hot research spot and is still a challenging task. This paper proposes a new architecture of an FFT core that computes radix-2 8-point FFT using fixed-point operation in only eight clock cycles. The key feature of this design is that it tries to maintain better performance with minimal possible footprint. The design is done in Xilinx ISE 13.2 tool using Verilog-HDL. The processor core has been simulated using Xilinx ISIM simulator for the functional verification and its FPGA based implementation has been successfully verified using Spartan-3E Starter Kit. This paper also aggregates a brief analysis of the performance of FFT Core and the consumption of FPGA resources by the designed core. The objective of this work is to get an area and time efficient architecture that could be used as a part of a voice processing system. Keywords: DFT, FFT, FPGA, Xilinx ISE 13.2, Verilog-HDL Introduction Audio signal processing is a well-developed line massively used these days in telecommunication, multimedia applications, speech recognition for voice-operated system, etc. When it comes to signal processing one always opts to work in the frequency domain because of [email protected] *Corresponding Author Email-Id

Manuscript Id: iJARS/820

1

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014) www.ijars.in myriad of advantages the frequency domain offers, which brings forward Discrete Fourier Transform that converts a signal in discrete time domain to discrete frequency domain. The arithmetic complexity of the Discrete Fourier Transform (DFT) algorithm becomes a significant factor, which influences in global computational costs of a design. Cooley and Tukey [1] developed the well-known radix-2 FFT algorithm to reduce the computational load of the DFT. Based on how one divides a set of N inputs into two sets of N/2 numbers, there are two types of radix-2 FFT algorithm or Cooley-Tucky algorithm: Decimation in time FFT (DIT-FFT) and Decimation in frequency FFT (DIF-FFT). We have implemented decimation in frequency FFT algorithm. To understand decimation in frequency we start by writing definition of DFT, (1)

For even k, i.e. k=2m, (-1)k =1, (2) For odd k, i.e. k=2m+1, (-1)k = -1, (3) Using symbols

and

,

equations (2) and (3) can be written as (4) (5) Thus, we started by dividing N inputs into two halves. Noticing that twiddle factor set ( ) is similar in first and second half, we worked out that twiddle factor multiplication is same if k is even and we need to multiply second half by certain power of twiddle factor( ) if k is odd.


2

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014) www.ijars.in Thus we grouped two sets with some modifications, but now the number of twiddle factors is halved and we have reached N/2 point DFT. Equation (4) can be viewed as N/2 point DFT of and equation (5) can be viewed as N/2 point DFT of . Thus, N point DFT can be computed by evaluating two N/2 point DFT.

Figure 1: Illustration of N-point FFT using two N/2-point FFT

This process can be continued and N/2 point DFT can be computed by two N/4 point DFT. For N=8, figure 1 corresponds to decimation of 8-point DFT into two 4-point DFTs. Further decimating 4-point DFTs into 2-point DFTs we reach a butterfly structure as shown in figure 5. As shown in figure 5, there are three stages. For N-point DFT, the number of stage is log2(N). Thus, this kind of decimation reduces computational complexity from O(N 2) to O(N.log2(N)) since computation in each stage is of order N. Proposed Methodology A. Functional Block Diagram of the Radix-2 8-point FFT Computer

The proposed system, which has a functional diagram as shown in figure 2, has divided the computation of FFT in three stages- Input Stage, Compute Stage, and Output Stage. In Input Stage, eight samples are read form the Analog to Digital Converter (ADC) and are stored in 8*64-bit Input Buffer, which takes eight clock cycles. Compute stage performs the computation


3

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014) www.ijars.in of FFT out of eight input samples and generates eight frequency samples in eight clock cycles. Compute stage has three blocks- x-Buffer to hold eight input samples from the input buffer, FFT Core that computes the FFT, and temporary buffer that holds the intermediate results. Finally, Output Stage presents the output in the output ports.

Figure 2: Proposed architecture of the radix-2 8-point FFT. B. Implementation method of Butterfly Network

The computation of the FFT is done by implementing the Butterfly Network in a novel and efficient way. Whereas the direct implementation of the butterfly arrangement requires twelve subtracters, twelve adders and twelve multipliers, the FFT core presented in this paper uses four adders, four subtracters and four multipliers in order to conserve resources without sacrificing the performance of the network. Had all twelve adders, multipliers, and subtracters been used then the output frequency samples would have been computed in one clock cycle since the whole design will be a single combinational circuit, the output is generated as soon as new input is available i.e. in one clock cycle. But, since only four of the adders, subtracters, and multipliers each are used, the output samples are presented to output port only at the end of the eight clock cycles because the calculation of FFT with butterfly network has been done using the FSM as shown in figure 4 which will take 8 clock cycles to complete. This architecture is a three-stage pipelined-architecture, so there are three independent and concurrent stages, which are: Input Stage, Compute Stage, and Output Stage as shown in figure 5. The input stage takes eight clock cycles independent of the architecture, since it will always take eight clock cycles to fetch eight input samples from ADC. Thus, in order to complete the computation of FFT before the next set of input samples are available from ADC, Compute Stage has at most seven clock cycles to compute the FFT and Output Stage has one clock cycle to host output samples in the output ports without introducing extra cycle consumption in the overall instruction cycle of the core. For this, the Compute Stage has been mathematically divided into three sub-stages each sub-stages requiring four adders (A1, A2, A3, and A4), four


4

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014) www.ijars.in subtracters (S1, S2, S3, and S4) and four multipliers (M1, M2, M3, and M4) as shown in figure 5. Here, for each of the three sub-stages of Compute Stage i.e. Compute Stage I, Compute Stage II, and Compute Stage III; the four adders, four subtracters, and four multipliers are reused by using the computational architecture as shown in figure 7 with the help of 4-to-1 multiplexers whose one input line is not used.

Figure 3: Illustration of how a adder is reused in various sub-stages of Compute Stage using 4to-1 multiplexer The crux behind the reusability of the adders, subtracters, and multipliers is that the compute stage has been divided into three sub-stages as shown in figure 5 where each sub-stage require each of four adders, subtracters, and multipliers. At first, input samples in x-Buffer are added or subtracted and multiplied as per the butterfly network in Compute Stage I. The four adders, subtracters, and multipliers are used to generate intermediate results in x-Buffer, which are used as the inputs for the Compute Stage II. Next, in Compute Stage II the same four adders, subtracters, and multipliers are reused to compute intermediate results to be used as input for the Compute Stage III as shown in figure 5. Finally, in the Compute Stage III the same four adders, subtracters, and multipliers are used again to generate final output frequency samples. The reusability of the four adders, four subtracters, and four multipliers can be properly illustrated with the help of figure 3. The reuse of the same set of component is done by using multiplexers in the input lines of the component to select different inputs in different Compute Stage. Say, the core is at COMPUTE STAGE I of FSM shown in figure 4 that corresponds to the Compute Stage I of figure 5. The adder A1 has to add input samples x[0] and x[4] as dictated by the butterfly network of figure 5. This is done by sending the 2’b00 from the controller FSM to the multiplexer connected to the second port of Complex Adder A1 that allows multiplexer to feed x[4] to the adder as shown in figure 3. The resulting sum evaluated by Complex Adder A1 is stored in the x[0] of x-Buffer, which previously contained first sample from ADC, in the WRITE BACK I stage. In COMPUTE STAGE II of FSM which corresponds to Compute Stage II of figure 5, the same Complex Adder A1 has to add the intermediate samples x[0] and x[2] by sending 2’b01 selection line value from the controller FSM to select a sample for the second


5

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014) www.ijars.in input of the Complex Adder. The sum is written back to x-Buffer in WRITEBACK STAGE II stage. Finally, the same Complex Adder A1 has to sum up intermediate samples x[0] and x[1] in COMPUTE STAGE III of FSM in figure 4 which corresponds to the Compute Stage III of figure 5, thus, generating the output frequency sample X[0]. The first input to the Complex Adder is again x[0] but the second input to Complex Adder is x[1] which is selected with the help of the multiplexer by sending 2’b10 in the selection line from the controller FSM. The output sample X[0] is presented to the output port in WRITEBACK STAGE III. Another important point to note about this design is that the same x-Buffer, which initially holds the input samples from ADC, is used to hold the intermediate results of sub-stages of Compute Stage as shown in figure 5. The complete architecture of the Core is shown in figure 7, which shows how the multiplexers are incorporated with the adders, subtracters, and multipliers in their input ports to select different set of inputs in different sub-stage of Compute Stage.

Figure 4: FSM that dictates how the FFT Core of figure 7 operates.


6

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014) www.ijars.in

Figure 5: Segregation of the Butterfly Network into three Stages-Input, Compute and Output Stage. The input samples from the ADC are in 14-bit 2’s complement form and the twiddle factors are represented by 10-bit 2’s complement fixed-point number. Since the Twiddle factor is a complex number, two 10-bits fixed-point numbers represents the real part and imaginary part respectively. So during the twiddle factor multiplication at most 24-bits (14-bits value times 10-bits value generates result that is at most 24-bits value) of the storage for each of real and imaginary part of the samples is needed. However, in our actual implementation 32-bit for real and imaginary part of each sample has been used so that we can go up to 16-bit (which is 10-bits for this particular design) representation for the twiddle factor for more accurate representation of floating-point number in fixed-point format in future modification of the core. All of the adders, subtracters,


7

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014) www.ijars.in COMPUTE STAGE III and the results are finally sent to the output port from the Temp-Buffer in OUTPUT STAGE. Thus, from the FSM shown in figure 4 where each of the stage requires one clock-cycle to operate, it is straightforward that the core takes eight clock cycles to compute FFT and present the output frequency samples in the output ports. D. Verilog Design of the FFT Core in Xilinx ISE

The Verilog [2] module named fft8point whose block diagram is as shown in figure 8 computes the radix-2 8-point FFT and the HDL code snippet for the ports declaration is as shown in figure 9. The port signal name and the description of each signal is shown in table 1. The eight input samples are taken into the core using the input ports Px0 to Px7 each of which is 14-bit wide. The core then evaluates the FFT of the eight time domain samples thus generating eight frequency samples. These frequency samples are presented in the eight output ports X0 to X7 each of which is 64-bit wide where the upper 32-bits is the real part and the lower 32-bits is the imaginary part of the output frequency samples.

Figure 8: Verilog Module of the FFT Core


10


Figure 9: Verilog Code Snippet of the Core

Table 1: List of ports of the designed core with their function S.N. Port Name

Function

Width (in bits)

Direction

1.

Px0 – Px7

Takes eight input samples from the Input 14-bit each Buffer

2.

X0 – X7

Present output samples

64-bit each with 32- Output bit real part and 32bit imaginary part

3.

Clk

Global Clock

1-bit

Input

4.

inputValid Asserts that the input samples at the input 1-bit port are valid

Input

5.

Reset

Input

Global Reset signal for the core


1-bit

Input

11


Result, Discussion and Summary A. Result and Discussion:

The functional verification of the FFT Core has been done by using the ISIM Simulator of Xilinx ISE. The snapshot of the simulation result is shown in figure 9. The input sequences fed to the FFT Core and the corresponding output samples generated along with the comparison with actual output is shown in table 2. The execution period is 0.16us (= 8/50MHz). That means 8-point FFT is computed only in 8 clock-cycles.

Figure 10: Simulation Waveform showing the input samples, output samples, and control signals. The waveform shown in figure 9 illustrates that the designed core runs smoothly with correct output. The result must be in floating point but because of the use of fixed-point representation for the floating-point numbers, the obtained result is integer approximation of the actual result.


12

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014) www.ijars.in The result is almost 100% accurate. Thus, the FFT Core is able to calculate the 8-point FFT with a good precision. Table 2: Comparison between the actual output from MATLAB calculation and computed output from the designed core S.N.

Input

Matlab Output

Observed Output

Error in magnitude (in Percentage)

100 + 0 i

1500 + 0 i

1500 + 0 i

0

2

200 + 0 i

300 + 200 i

300 + 200 i

0

3

300 + 0 i

-541.4 – 724.3 i

-542 – 725 i

0.102 %

4

400 + 0 i

-258.6 – 124.3 i

-259 -124 i

0.174 %

5

500 + 0 i

300 – 0 i

300 – 0 i

0

6

0+0i

300 – 200 i

300 – 200 i

0

7

0+0i

-258.6 + 124.3 i

-258 + 125 i

0.102 %

8

0+0i

-541.4 + 724.3 i

-541 + 724 i

0.174 %

B. Design Summary

Design summary, a report generated by Xilinx ISE [3], allows designers to view various information like targeted device, device utilization, design goal, etc. The implementation of the FFT Core has been done in Spartan-3E Starter Board. The RTL schematic, which is a basic logical representation of the circuit in terms of logic primitives which are generated when the design become correct in simulation and synthesis level, of the FFT processor is shown in figure 12. The design summary generated by the Xilinx is shown in figure 11.


13


Figure 11: Design Summary generated by Xilinx ISE

Figure 12: RTL Schematic of the Designed Core


14

International Journal of Applied Research and Studies (iJARS) ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014) www.ijars.in Summary and Conclusion This paper presents 8-point FFT processor with a new architecture, which indeed has best possible performance with the optimization in resource consumption. The whole design is implemented in Verilog-HDL through Xilinx ISE 13.2 and the functional verification is done by using ISIM simulator. The performance of our design presents better results in terms of both the physical resources and throughput that is required for real time application as audio processing. Along with these performance results come other considerations, which needs to be evaluated to select the best approach depending on system requirements like easy implementation, costs and performance. This design has a very simple port interface so that it can be easily incorporated with any other system that requires FFT computation. The design produces a maximum error of 0.2% in the result due to the fixed-point representation of floating point values, which is accurate to a very good tolerance limit. Another important note is that this core can be extended to compute higher point FFT with a little modification. Further, this core can be a very useful tool to analyze the frequency samples of any type of discrete time signals in real time. References: [1] Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck, Discrete-Time Signal Processing, 2nd ed., Tenth Impression, Pearson Education, 2012, pp.655–681. [2] J Bhasker, A Verilog HDL Primer, 3rd ed., Star Galaxy Publishing, 2005. [3] Xilinx, April 2009, ISE In-Depth Tutorial (UG695 (v 12.1). [4] Young-jin Moon, and Young-il Kim, “A Mixed-Radix 4-2 Butterfly with Simple Bit Revering for Ordering the Output sequences,” ICA0T2006 vol. 4, pp. 1772–1774, February 2006. [5] A. Sreir.sr, C. Ka-a-Terki, H. Mshrez, and S. Negus, “A Flexible High Perfomance Serial Radix-2 fft Butterfly Arithmetic unit” IEEE Transl. J. Magn. Japan, vol. 2, pp. 26-29, August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982]. [6] Xilinx Logi Core FFT Processor guide.pdf. [7] Chung-Ping Hung, Sau-Gee Chen and Kun-Lung Chen , “Design Of An Efficient Variablelength FFT Processor”, ISCAS 2004,vol -4,pp.833-836.


15