High-performance and power-efficient CMOS comparators - IEEE Xplore

0 downloads 0 Views 1MB Size Report
approaches to designing CMOS comparators, each with ..... VLSI Circuits Dig. ... in low-power and high-speed digital integrated circuits and systems, analog inte ...
254

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 2, FEBRUARY 2003

High-Performance and Power-Efficient CMOS Comparators Chung-Hsun Huang, Student Member, IEEE, and Jinn-Shyan Wang, Member, IEEE

Abstract—Several design techniques for high-performance and power-efficient CMOS comparators are proposed. First, the comparator is based on the priority-encoding (PE) algorithm, and the dynamic circuit technique developed specifically for the priority encoder can be applied. Second, the PE function and the subsequent logic functions are merged and efficiently realized in the multiple output domino logic (MODL) to result in a shortened logic depth. The circuit in MODL CMOS is also compact and power efficient because few transistors are needed. Third, the multilevel look-ahead technique is used to shorten the path of priority-token propagation. Finally, the circuit is realized with a latch-based two-stage pipelined structure, and the comparison function is partitioned into two parts, with each part executed in each half of the clock cycle in a delay-balanced manner. Post-layout simulation results show that a 64-b comparator designed with the proposed techniques in a 3-V 0.6- m CMOS technology is 16% faster, 50% smaller, and 79% more power efficient as compared with the all-n-transistor comparator, which is the fastest among the conventional comparators. Measurement results of the test chip conform with simulation results and prove the feasibility of the proposed techniques. Index Terms—CMOS dynamic circuit, comparator, priority encoding, multilevel lookahead, multiple output domino logic (MODL).

I. INTRODUCTION

T

HE COMPARATOR is a very basic and useful arithmetic component of digital systems. There are several approaches to designing CMOS comparators, each with different operating speed, power consumption, and circuit complexity. One can implement the comparator by flattening the logic function directly [1]. This approach is only suitable for comparators with short inputs. For the comparators with longer inputs, circuit complexity increases drastically, and the operating speed is degraded accordingly. Another way to designing the comparator is employing a parallel adder [2]. In this approach, the adder becomes the major factor limiting the operating speed. However, a very high-speed adder often requires thousands of transistors [11]–[13]. Recently, Wang et al. [3] proposed to construct the comparator in a tree structure with the all-n-transistor (ANT) dynamic CMOS logic [3] in order to improve the operating speed. The ANT logic is derived from the all-n-logic (ANL) [4]. Both ANT and ANL logic circuits can only be implemented with heavy pipelining. In [3], a 64-b comparator is designed as Manuscript received March 26, 2002; revised September 6, 2002. This work was supported by the National Science Council of Taiwan under Research Grant NSC 90-2215-E-194-019 and Grant NSC 91-2215-E-194-007. The authors are with the Institute of Electrical Engineering, National ChungCheng University, Chia-Yi, 621 Taiwan, R.O.C. (e-mail: [email protected]). Digital Object Identifier 10.1109/JSSC.2002.807409

Fig. 1. Numerical example of 4-b priority-encoding-based comparison.

a six-pipeline circuit, and each comparison operation through these six pipelines is finished in three clock cycles. Although such a heavily pipelined design achieves high throughput, it may not be suitable for all applications. For example, some popular microprocessors such as the ARM microprocessor [5] often need to execute a comparison instruction within a single clock cycle. Moreover, the latches used to form the pipelines increase the circuit complexity and power consumption of the ANT comparator. In this paper, we propose several design techniques for highperformance and power-efficient CMOS comparators. The proposed techniques span from the microarchitecture to logic and circuit design levels. In the microarchitecture design, the priority-encoding algorithm is adopted to efficiently implement each comparison operation in one clock cycle. The critical path is effectively shortened using the multilevel look-ahead technique that we proposed in [6] for the priority encoder. Furthermore, for long comparators, a two-stage pipelined architecture is used to partition and balance the logic functions into each half of the clock cycle. In the logic design, the priority-encoding function and some logic functions are merged in one complex CMOS gate called the magnitude decision module. Such a design not only improves the operating speed but also makes the circuit more compact and power efficient. In the circuit design, the dynamic technique with serially connected structure is applied to produce high performance with low switching activity. Also, a technique similar to the multiple output domino logic (MODL) [7] is applied to the magnitude decision module so that the circuit complexity is reduced further. The rest of this paper is organized as follows. Section II describes the design principles of the new priority-encoding-based comparator. Basic design techniques used to design new comparators will be described in Section III, while the microarchi-

0018-9200/03$17.00 © 2003 IEEE

HUANG AND WANG: HIGH-PERFORMANCE AND POWER-EFFICIENT CMOS COMPARATORS

255

Fig. 2. Conceptual block diagram of the priority-encoding-based 4-b comparator.

tecture improvement together with modified circuits for long comparators is described in Section IV. Performance evaluation and experimental results are given in Section V, and the conclusion is given in Section VI. II. DESIGN PRINCIPLES OF THE PRIORITY-ENCODING-BASED COMPARATOR Let the two inputs of the comparator be and , both with bits counted from bit 0 to bit . The binary variable denotes that is larger than . Another binary signal EQUAL indicates is equal to . A 4-b numerical example, as shown in Fig. 1, is used to demonstrate the design concept of the proposed comparator. Assume the two operands and are 4 b1011 and 4 b1000, and should be 1 and 0, respectively. By inspection, respectively, and EQUAL should be 0. The magnitude comparison is divided into four steps and the number in each shaded oval in Fig. 1 stands for the sequence number of each step. The first step is to determine whether each corresponding bit of and is equal or not using XOR gates. If is equal to , all the output bits of the XOR gates will be 0. On the other hand, if is not equal to , there is at least one “1” bit in the result. In this numerical example, the result is 4 b0011, reflecting that is not equal to . There are two operations in the second step. The first operation is performed by NORing the result of the first step, which is 4 b0011, to generate the output signal EQUAL. The second operation actually determines which input is larger. Observe that in the result of the first step (4 b0011), the most significant “1” bit, which will be called the most significant unequal bit (MSUB) hereafter, is at bit 1. Meanwhile, the bit at the MSUB of is “1,” while the bit at the MSUB of is “0.” The MSUB immediately shows which operand is larger. In order to quickly find the MSUB, we employ the priority encoder proposed in [6] and [8] (details of the circuit will be described in a later section). For this numerical case, the priority encoder takes the output of the first step (4 b0011) and generates 4 b0010. There is only one “1” bit in the output, which is exactly at the MSUB. In the third step, AND operations are used to find out from which operand the “1” bit (MSUB) comes. Let ( be the AND of and the output of the and are 4 b0010 priority encoder. Then,

(a)

(b) Fig. 3. (a) Block diagram of a 4-b comparator. (b) Schematic diagram of a 4-b MDM.

and 4 b0000, respectively. The nonzero value of immediately shows that is larger than . Finally, the signals and can be generated by ORing the bits of and , respectively. As expected, and are 1 and 0 in this example. The above operations can be realized by the circuit shown in Fig. 2. The priority encoder implements the following equations [6].

(1)

256

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 2, FEBRUARY 2003

Fig. 4. Schematic diagram of a 4-b priority-encoding-based CMOS comparator.

Note that although the above design concept is similar to that described in [1], the implementation details are quite different. We will elaborate these details shortly. III. BASIC DESIGN TECHNIQUES When implementing the comparator in CMOS technology, we found that the priority encoder and those AND gates in Fig. 2 can be merged into a functional block, called the magnitude decision module (MDM). With the MDM, the block diagram of a 4-b comparator is revised as shown in Fig. 3(a). The circuit for generating EQUAL will not be shown hereafter because it is not in the critical path. The MDM implements the functions listed below, and it is designed as the circuit shown in Fig. 3(b).

(2) The circuit in Fig. 3(b) is derived from the priority encoder we proposed in [6]. We also adapt the MODL style [7] to reduce circuit complexity and increase operating speed. The circuit in Fig. 3(b) operates as follows. When the clock signal clk goes low, the circuit enters the precharging phase and and the output nodes are precharged to 0. When clk goes high, the circuit enters , the priority descends the evaluation phase. For

from to . For example, if , then the will be used to turn off the discharging paths signal and . Therefore, for and will be kept outputs and depend at “0.” The values of and . For example, if , nodes and on will be evaluated as logic 0 and 1, while nodes and will be kept at logic 1 and 0, respectively. On , neither node nor node have the other hand, if is turned off. Then, discharging path because transistor and stay in the precharged outputs relinquishes the control state. At the same time, and the rest of the circuit functions as if there are only three , , and . inputs, The schematic of the 4-b comparator [Fig. 3(a)] is shown in Fig. 4. The circuit follows the domino logic style [9] and, hence, the necessary inversion function is moved to the input terminal and implemented via static CMOS circuits. On the other hand, the OR function is implemented by a dynamic NOR gate plus a NOT gate, and placed after the dynamic MDM circuit. Although we can derive an MDM with more than four inputs in the same way as (2), the circuit becomes too complicated to achieve high speed. Thus, instead, we employ the concept of multilevel lookahead proposed for the priority encoder [6] to design comparators with more than four input bits. The concept of multilevel lookahead is illustrated with the aid of the block diagram of a 16-b comparator in Fig. 5(a), and the schematic diagram of the modified 4-b comparator macro PEBCLA4b is shown in Fig. 5(b). In addition to the input/output (I/O) signals shown in Fig. 4, the new 4-b comparator macro needs an extra input look-ahead and an extra output look-ahead signal . As ilsignal in the th macro is connected to lustrated in Fig. 5(a), the in the th macro, except that the in the least the

HUANG AND WANG: HIGH-PERFORMANCE AND POWER-EFFICIENT CMOS COMPARATORS

257

(a)

(b) Fig. 5.

(a) Block diagram of a 16-b comparator. (b) Schematic diagram of the macro PEBCLA4b.

significant macro should be tied to directly. The following equations describe the functions of Fig. 5(b).

As described in [6], and in (3) realize the first-level look-ahead mechanism because all these functions are flattened without iteration and finished with one gate delay. On the other hand, the circuits enclosed in the gray areas of Fig. 5(b) realize the second-level look-ahead mechanism signal is generated only with a domino-gate because the delay. The look-ahead signals are used to connect different macros to shorten the critical path. IV. LONG PRIORITY-ENCODING-BASED COMPARATORS

(3)

When the size of the comparator grows larger, the third- and even the fourth-level look-ahead circuit structures, which are similar to that used in the priority encoder [6], can be used to shorten the critical path further. However, not only does the structure of a single gate become more complex, but also the propagation delay grows linearly to the number of the cascading

258

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 2, FEBRUARY 2003

Fig. 6. Block diagram of a high-performance 64-b comparator.

macros. Therefore, for a longer comparator, we propose a twostage pipelined structure to enhance the performance with little increase in circuit complexity. The previous design approach needs a precharge phase and an evaluation phase to finish one comparison operation. Thus, the precharging time is wasted from the viewpoint of logic operation. Furthermore, the duty cycle of a system clock is usually set to be 50% despite that the required precharging time is typically shorter than the evaluation time. Taking these factors into consideration, we partition the logic functions of the comparator into each half of the clock cycle to form a two-stage pipeline. Such a design not only makes each pipeline shorter but also fully utilizes the clock cycle if the circuit is implemented in the dynamic CMOS logic. When the first pipeline stage enters the evaluation phase, the second pipeline stage enters the precharge phase. After the first pipeline stage turns to precharge and latches the results, the second pipeline stage begins to evaluate. Although the new architecture needs more transistors for pipeline latches, it can effectively shorten the clock cycle to improve the operating speed. Furthermore, implementing the circuit by dynamic CMOS circuits, the comparator can still finish each comparison in one clock cycle. Let us take the 64-b comparator as an example. The block diagram of the new design is shown in Fig. 6. The 64 input bits are partitioned into eight small groups, each having eight input bits. In the first pipeline stage, eight comparators process eight groups of inputs respectively, producing eight pairs of outputs and . After latching, these outputs are sent to the second stage, which is another 8-b comparator, to perform the rest operations. The 8-b macro cell PEB8b shown in Fig. 7 implements the following equations.

(4) The circuit structure is derived from that of Fig. 5(b) and is described as follows. 1) Two 4-b comparators are used to construct the 8-b macro. 2) The least significant macro of PEB8b uses an AND gate to for the second macro generate the lookahead signal of PEB8b. The second macro does not need to generate because there is no connection between different 8-b macros. in the original 4-b 3) Those transistors controlled by macro are removed from the least significant macro of . PEB8b because PEB8b does not need and signals in the two 4-b 4) The macros are combined together by two eight-input dynamic NOR gates, respectively, and the results are latched by two N-C MOS latches. The detail operations of the 64-b comparator are described briefly as follows.

HUANG AND WANG: HIGH-PERFORMANCE AND POWER-EFFICIENT CMOS COMPARATORS

Fig. 7.

259

Schematic diagram of the 8-b macro cell PEB8b.

1) The first and second pipeline stages of the 64-b comparator utilize the same 8-b macro PEB8b. However, the macros in the first pipeline stage accept the clock signal , but the macro in the second pipeline stage accepts . Therefore, when goes high, the the clock signal macro cells in the first pipeline stage enter the evaluation phase and the macro cell in the second pipeline stage enters the precharge phase.

2) When goes low, the macro cells in the first pipeline stage enter the precharge phase and the evaluated results are latched in the N-C MOS latches. These outputs are also fed into the corresponding inputs of the macro in the second pipeline stage for obtaining the final comparison result. 3) Both stages have the same critical path, i.e., the 8-b comparator. Because the critical paths of both stages are short-

260

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 2, FEBRUARY 2003

TABLE I COMPLEXITY COMPARISON OF TWO 64-b COMPARATORS

TABLE II POST-LAYOUT SIMULATION RESULTS OF TWO DIFFERENT 64-b COMPARATORS

(a)

(b) Fig. 8.

Layouts of (a) Wang et al.’s comparator [3]. (b) Proposed comparator.

ened and balanced, the operation speed of the comparator is improved significantly. V. PERFORMANCE EVALUATION AND EXPERIMENTAL RESULTS In order to verify the proposed techniques, a two-stage pipelined 64-b comparator is realized. To minimize the layout effort and layout area, we have all N-type transistors at the pull-down network with the same transistor width instead of ratioed design. We also enlarge the width of these transistors up to 5 m to reduce the pull-down delay. For example, the – in Fig. 7 are all 5 m. channel width of the transistors The design is implemented based on a 3-V 0.6- m CMOS technology [10], which is the same as that used in the ANT comparator [3]. The 64-b comparator based on Wang et al.’s approach is also resimulated with the transistor sizes reported in [3]. However, for comparison purpose, the performance comparison is based on the results of post-layout simulations running at 3-V supply voltage. The layouts of both designs are shown in Fig. 8, and the complexity information is listed in Table I. We found the transistor count of the new design is less than that required in the conventional design, while the layout area of the new design is only nearly half of the conventional design. This is mainly because the transistor size used in the new design is typically much smaller than that used in the previous design. Before reporting the timing information, timing characterization methods for both designs will be described. For the new

design, both stages have the same critical path, i.e., the 8-b comparator. Then, we only need to characterize the critical path of the 8-b comparator macro, which is the sumdelay and the evalmation of the delay of the static XOR gate . Note that the output uation delay of the dynamic gate of the static XOR gate must be stable before the dynamic gate can be entering the evaluation phase. This means that viewed as the setup time of the dynamic circuit. The minimal , and the maximal operating cycle time will be twice of . Analysis shows that we can frequency will be , ) to apply the pattern ( trigger the longest signal propagation path. The timing chart of the 8-b macro PEB8b is illustrated in Fig. 9. As mentioned above, the new comparator finishes each comparison in just one clock cycle, while the conventional 64-b comparator takes three clock cycles to finish the task. Similar to the new design, all stages in the conventional design also have the same critical path, but each pipeline is a 2-b comparator in this case. Then, we only need to characterize the critical path of the 2-b comparator macro. For a fair compardelay for each ison, we define the equivalent total delay time , and the equivalent maxoperation to be six times of . The imal operating frequency is defined to be ) is applied to trigger the longest pattern ( signal propagation path. Post-layout simulation results are summarized in Table II. Power consumption listed in Table II is evaluated at the maximum clock frequency. It shows that the proposed comparator is 16% faster and consumes 79% less power as compared with Wang et al.’s comparator [3]. For the new design, it is possible to trade the layout area and the power consumption for more speed advantages. The proposed 64-b comparator has been fabricated for performance verification. Fig. 10 shows the test chip architecture . used to measure the delay time of the dynamic circuit This measurement method is commonly used in measuring delay time of dynamic circuits [13], [14]. The input clock signal goes through the clock buffer first, and then proceeds in two paths. One goes through the comparator core, output buffer, and reaches output pad. The other one only goes through the output buffer to reach the output pad. Obviously, the only difference between these two paths is the comparator core. Therefore, we can measure the time between clock output signal Clk and get the delay time . and comparator output The photograph of the test chip is shown in Fig. 11(a) and measured waveforms with 160- and 50-MHz clocks are shown

HUANG AND WANG: HIGH-PERFORMANCE AND POWER-EFFICIENT CMOS COMPARATORS

261

Fig. 9. Timing chart of the critical path of PEB8b.

Fig. 10. Test chip architecture.

in Fig. 11(b) and (c), respectively. Measured chip features and post-layout simulation results are summarized in Table III. The of the measured waveforms indicate that the delay time dynamic gate in the 8-b macro is 2.2 ns no matter which clock rate is used, which completely matches with the simulation directly on the chip because result. We cannot measure it is the set-up time in nature. However, according to the above measurement result, we have confidence that the experimental result is very close to the simulation result. The maximal operating frequency is measured around 180 MHz (not shown), which again agrees with the simulation. The measured power consumption is also very close to the simulated result. VI. CONCLUSION Design techniques for high-performance and power-efficient CMOS comparators are proposed. The design is based on the priority-encoding algorithm and utilizes the dynamic CMOS

circuit technique to result in a compact comparator with high performance. In implementation, the priority-encoding function and the subsequent AND function are merged as an MDM, which is realized in the MODL. Such a design not only improves the operating speed due to the reduced logic depth, but also makes the circuit compact and power efficient because fewer transistors are used. To efficiently shorten the critical path that lies in the MDM, multilevel look-ahead technique is adopted. To enhance the operating speed further, the circuit is realized with a latch-based two-stage pipelined structure, and the logic functions are partitioned into two parts, with each part executed in half of the clock cycle in a delay-balanced manner. Post-layout simulation results show that a 64-b comparator designed with the proposed techniques in a 3-V 0.6- m CMOS technology is 16% faster, 50% smaller, and 79% more power efficient as compared with the fastest conventional design. Measurement results of the test chip confirm with simulation results and prove the feasibility of the proposed techniques.

262

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 2, FEBRUARY 2003

tion Center, Taiwan, for supporting the fabrication of the test chip. REFERENCES

(a)

(b)

(c) Fig. 11. (a) Photograph of the fabricated chip. (b) Measured waveforms with 160-MHz clock. (c) Measured waveforms with 50-MHz clock.

TABLE III EXPERIMENTAL AND POST-LAYOUT SIMULATION RESULTS

ACKNOWLEDGMENT The authors would like to thank Prof. C. Yeh of the Department of Electrical Engineering, National Chung Cheng University, for improving the language and presentation of this paper. They would also like to thank to the Chip Implementa-

[1] M. M. Mano, Digital Design. Englewood Cliffs, NJ: Prentice-Hall, 1991, ch. 5. [2] N. West and K. Eshraghian, Principles of CMOS VLSI Design. Reading, MA: Addison-Wesley, 1993, ch. 8. [3] C.-C. Wang, C.-F. Wu, and K.-C. Tsai, “1-GHz 64-b high-speed comparator using ANT dynamic logic with two-phase clocking,” Proc. Inst. Elect. Eng. Comput. Digital Techn., vol. 145, no. 6, pp. 433–436, Nov. 1998. [4] R. X. Gu and M. I. Elmasry, “All-N-Logic high-speed true-single-phase dynamic CMOS logic,” IEEE J. Solid-State Circuits, vol. 31, pp. 221–229, Feb. 1996. [5] S. Furber, ARM System Architecture. Reading, MA: Addison-Wesley, 1997. [6] J.-S. Wang and C.-H. Huang, “High-speed and low-power CMOS priority encoders,” IEEE J. Solid-State Circuits, vol. 35, pp. 1511–1514, Oct. 2000. [7] I. S. Hwang and A. L. Fisher, “Ultrafast compact 32-b CMOS adders in multiple-output domino logic,” IEEE J. Solid-State Circuits, vol. 24, pp. 358–369, Apr. 1989. [8] J.-S. Wang and C.-S. Huang, “A high-speed single-phase-clocked CMOS priority encoder,” in Proc. IEEE Int. Symp. Circuit and Systems, vol. 5, May 2000, pp. 537–540. [9] R. W. Krambeck, C. M. Lee, and H.-F. S. Law, “High-speed compact circuits with CMOS,” IEEE J. Solid-State Circuits, vol. SC-17, pp. 614–619, June 1982. [10] “0.6-m CMOS ASIC process digests,” Taiwan Semiconductor Manufacturing Corp., Hsinchu, Taiwan, R.O.C., 1996. [11] J. Park, H. C. Ngo, J. A. Silberman, and S. H. Dhong, “470 ps 64-b parallel binary adder [for CPU chip],” in Symp. VLSI Circuits Dig. Tech. Papers, 2000, pp. 192–193. [12] S. Naffziger, “A sub-nanosecond 0.5-m 64-b adder design,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, 1996, pp. 362–363. [13] R. Woo, S.-J. Lee, and H.-J. Yoo, “A 670-ps 64-b dynamic low-power adder design,” in Proc. IEEE Int. Symp. Circuit and Systems, vol. 1, May 2000, pp. 28–31. [14] G. Yee and C. Sechen, “Clock-delayed domino for dynamic circuit design,” IEEE Trans. VLSI Syst., vol. 8, pp. 425–430, Aug. 2000.

Chung-Hsun Huang (S’00) was born in Taiwan, R.O.C., in 1977. He received the B.S. and M.S. degrees in electrical engineering from National Chung-Cheng University, Chia-Yi, Taiwan, in 1999 and 2000, respectively. He is currently working toward the Ph.D. degree at the Institute of Electrical Engineering, National Chung-Cheng University. His research interests include high-speed and low-power digital integrated circuits, microprocessor design, SOC design methodology, and high-speed analog-to-digital converter design.

Jinn-Shyan Wang (S’85–M’88) was born in Taiwan, R.O.C., in 1959. He received the B.S. degree in electrical engineering from the National Cheng-Kung University, Tainan, Taiwan, in 1982 and the M.S. and Ph.D. degrees from the Institute of Electronics, National Chiao-Tung University, Hsinchu, Taiwan, in 1984 and 1988, respectively. He was with Industrial Technology Research Institute (ITRI) from 1988 to 1995, engaged in ASIC circuit and system design, and became the Manager of the Department of VLSI Design. He joined the Department of Electrical Engineering, National Chung-Cheng University, Chia-Yi, Taiwan, in 1995, where he is currently a full Professor. His research interests are in low-power and high-speed digital integrated circuits and systems, analog integrated circuits, IP and SOC design, and CMOS image sensors. He has published over 20 journal papers and 40 conference papers and holds over 20 patents on VLSI circuits and architectures.