An AND-Type Match-Line Scheme for High ... - IEEE Xplore

5 downloads 0 Views 2MB Size Report
An AND-Type Match-Line Scheme for. High-Performance Energy-Efficient Content. Addressable Memories. Hung-Yu Li, Chia-Cheng Chen, Jinn-Shyan Wang, ...
1108

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 5, MAY 2006

An AND-Type Match-Line Scheme for High-Performance Energy-Efficient Content Addressable Memories Hung-Yu Li, Chia-Cheng Chen, Jinn-Shyan Wang, Member, IEEE, and Chingwei Yeh

Abstract—High search speed and low energy per search are two major design goals of content-addressable memories (CAMs). In this paper, an AND-type match-line scheme is proposed to realize a high-performance energy-efficient CAM. The realized 256 128-b CAM macro, based on a 0.18- m 1.8-V CMOS process, achieves a 2.1-ns search time. When both the stored and search data are generated from an on-chip 4 32-b LFSR with the same seed, the measured energy is 2.33-fJ/bit/search. Index Terms—Content-addressable memory (CAM), low power, match line scheme. Fig. 1. (a) BiCAM cell and (b) TCAM cell used in the AND-type match-line scheme.

I. INTRODUCTION ONTENT-ADDRESSABLE memory (CAM) is widely used as the lookup table in applications such as a search engine [1], internet router [2], [3], data compression [4], and image processing [5]. A CAM should be pre-stored with an array of data before executing the search operation. When performing a search operation, a new search word is sent into the memory array and is compared simultaneously with all entries of the entire memory array. Depending on search and stored data, one or more matching results will indicate which pre-stored data is a complete match with the input datum. Due to the characteristics of parallel processing for data comparison in each search operation, power consumption is always an important concern when designing CAM circuitry. Due to the continuing shrinkage of the feature size in each generation of the CMOS process, modern applications using CAM demand higher and higher memory capacity, which in turn requires longer and longer memory depth and width. In the face of this demand, improving the search speed is quickly becoming a major challenge in CAM circuit design. Many works have been devoted to the design of the match-line scheme of CAM to increase the search speed or to reduce the power consumption. The most conventional CAM [6] adopted the classical NOR-logic match line for high search speed, but with the penalty of high power consumption. The design in [7] took advantage of a reduced switching activity

C

Manuscript received June 30, 2005; revised December 23, 2005. This work was supported by the National Science Council under Research Grant NSC932220-E-194-001 and by the Ministry of Economic Affairs of Taiwan, R.O.C., under 94-EC-17-A-01-S1-040. H.-Y. Li and C.-C. Chen were with the Department of Electrical Engineering, National Chung Cheng University, Chia-Yi 621, Taiwan, R.O.C. They are now with Faraday Technology Corporation, Hsinchu 300, Taiwan, R.O.C. J.-S. Wang and C. Yeh are with the Department of Electrical Engineering, National Chung Cheng University, Chia-Yi 621, Taiwan, R.O.C. (e-mail: [email protected]) Digital Object Identifier 10.1109/JSSC.2006.872719

from the NAND-type match line to reduce power consumption. However, the price for this is a much degraded search speed because of the native NAND-type logic structure. This speed degradation in turn limits the bit width of each memory entry, which contradicts the requirement of some modern applications such as the lookup table for the IPv6 router, which require a long bit width. The design in [8] tried to solve this problem of bit-width limitation by using the NORA [9] NAND-type match line. However, it did not solve the low-speed problem, and even made it worse because of the utilization of P-type domino gates. The design in [10] went back to the traditional NOR-type match line and employed the concept of suppressing the voltage swing of the match line to reduce the power consumption, and the sense amplifier was adopted for sensing the small voltage swing in order to improve the search speed. The timing control of the “enable” signal of the sense amplifier should be precise enough for the performance. However, the timing control is both critical and difficult considering the PVT variations. The designs in [11] and [12] also used the NOR-type match-line scheme, as well as a more sophisticated closed-loop sensing circuitry for further reducing the voltage swing of the match line so as to reduce the power consumption and improve the search speed. The bias voltage of the sense amplifier in this circuit must be carefully or even adaptively controlled to allow the circuit to work at all the operating corners. The pipelined version of the design in [12] was proposed in [13] for improving the throughput rate. However, the overhead of area and power consumption coming from the flip-flops and the clock driver for pipelining makes this design both hardware and energy inefficient. Recently, a hybrid-type multi-bank CAM architecture [14] was proposed to utilize the high-speed benefit of the NOR-type scheme for bank selection, and to take advantage of the low-power benefit of the NAND-type scheme for each CAM macro block.

0018-9200/$20.00 © 2006 IEEE

LI et al.: AND-TYPE MATCH-LINE SCHEME FOR HIGH-PERFORMANCE ENERGY-EFFICIENT CAMS

1109

Fig. 2. (a) Floorplan, (b) block diagram of the 11-stage match line, and (c) circuit showing the relationship among the CAM cell, the pseudo-footless gate, and the match line.

In this paper, an AND-type match-line scheme [15] is proposed for realizing a high-performance, energy-efficient CAM macro. The remainder of this paper is laid out as follows. The proposed AND-type match-line scheme and its operating principle are described in Section II. The reasons why the proposed match-line scheme contributes to high performance and low power are provided in Section III. Several design considerations of the proposed match-line scheme are described in Section IV. The test chip implementation and measurement results are illustrated in Section V. Finally, conclusions are drawn in Section VI. II. THE AND-TYPE MATCH-LINE SCHEME The proposed AND-type match-line scheme can be applied in either the binary CAM (BiCAM) or the ternary CAM (TCAM). The adopted BiCAM and TCAM cells are shown in Fig. 1(a) and Fig. 1(b), respectively. The 9T BiCAM cell is the same as that used in [7], and the proposed 13T CAM cell is derived from the TCAM cell used in [11]. Word-Line (WL) is used for controlling the read or write operations, and is kept low in the and ) are sepsearch operation. The search bit lines ( and ) for reducing arated from the read/write bit lines ( the power consumption of the search operation. In both cells,

the transistor in the shadow is also the fan-in transistor of the AND-type match-line circuit, which will be explained later. If the TCAM cell needs to perform the “don’t care” operation, both storage nodes should be written as “0” to pull up the gate voltage of the shadowed transistor. In the followings, the design of a BiCAM macro with 256 entries and 128 bit per entry is taken as the example to explain the proposed design techniques. The floor-plan of the designed 256 128-b BiCAM macro is shown in Fig. 2(a). The cell array is partitioned into two halfplanes in order to shorten the critical path of the match line. Therefore, the bit width of each half-plane is 64. The 64-b ANDtype match line is composed of 11 pseudo-footless AND gates (to be described later) with the block diagram shown in Fig. 2(b). Each pseudo-footless AND gate is composed of a pseudofootless dynamic NAND gate and a static inverter. The circuit in Fig. 2(c) illustrates the relationship among the CAM cell, the pseudo-footless gate, and the match line. The output of the left match-line and that of the right match-line are connected to a . two-input AND gate to generate the final match output The basic element in the match-line circuit is the proposed pseudo-footless clock-and-data precharged dynamic (PF-CDPD) gate. The operation and the characteristics of the PF-CDPD gate can be understood by describing the evolution

1110

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 5, MAY 2006

Fig. 3. Evolution from a domino gate, a CDPD gate, to the PF-CDPD gate.

from the conventional domino gate [16] and the Clock-and-Data Pre-charged Dynamic (CDPD) gate [17] to the PF-CDPD gate, as shown in Fig. 3. The shaded nMOS and pMOS devices in the domino gate are triggered by a global clock signal. Because the clock signal is sent to all the domino gates, we need a buffer to increase the driving capability of the clock signal. When evolving from the domino gate to the CDPD gate, the global clock signal is only connected to the first CDPD gate of a match line, while all other CDPD gates of the same match line is triggered by the outputs of their preceding gates. Note that the function performed by these two gates is not altered. However, because the external clock signal need not trigger a large load, the size of the clock buffer (not shown) can be largely shrunk. The PF-CDPD gate is evolved a step further from the CDPD gate. The main difference between the CDPD and the PF-CDPD is that the clock- or data- triggered nMOS transistors are placed at different locations. Therefore, CDPD and PF-CDPD still perform the same function, but the timing control style, the performance, and the power consumption are different. The timing control and the operating principle of the CAM macro adopting the AND-type PF-CDPD match circuit is explained below, while the explanation for why the PF-CDPD logic leads to high performance and low power will be described in the next section. Furthermore, the design consideration for overcoming the charge-sharing problem of the PF-CDPD match line will be discussed later in Section IV. The circuit along the critical path of the designed 256 128-b BiCAM macro is shown in Fig. 4(a). The operating waveforms are illustrated in Fig. 4(b), where means the external clock signal [not shown in Fig. 4(a)]. The signal is the derived internal clock signal for the match circuit. Each search operation is divided into two phases: data setup and data matching. The dynamic match circuit operates accordingly in two phases as well, i.e., the precharge phase and the evaluation goes high, goes low. Now the match phase. When circuit enters the precharge phase, and the outputs of every in Fig. 4(a)] and the local PF-CDPD NAND gate [ are pulled high and low, rematch-lines spectively, by the clock-and-data pre-charging mechanism. At are sent in the same time, the input search data and are passed along all the way to the input of the match circuit and ). through the search bit lines ( If the input bit matches with the stored bit, then the PF-CDPD gate will get a high input. If all the inputs of a PF-CDPD gate

get a “high”, then the source node of the clocked nMOS will be pulled toward the ground level in this phase, and the pull-down path will remain conductive in the next (evaluation) phase. On the other hand, the pull-down path will be cut off if at least one goes low, goes high. At input gets a “low”. When that point the search bit lines are kept quiet in this phase, and the match circuit enters the evaluation phase. All the match lines are evaluated at the same time, and the pseudo-footless gates in one match-line are evaluated in domino fashion. III. PERFORMANCE ANALYSIS This section describes how the PF-CDPD logic contributes to high performance and low power consumption. The worst-speed evaluation happens when the input data fully matches with the stored data. In that case, the evaluation signal will go along the longest path, and the output of each PF-CDPD AND gate of a match line will be pulled high in domino fashion. The status of the match line just before the evaluation phase of this case is illustrated in Fig. 5(a). In that situation, all nMOS transistors in the pull-down networks receive a “high” during the precharge phase, and their drain nodes are being pulled toward the ground level. Therefore, the pull-down network of a particular PF-CDPD AND gate can be electrically replaced with a small resistance when the clock signal for evaluation comes to the gate. The closer the PF-CDPD AND gate to the final match output, the latter it will be evaluated. The latter it is evaluated, the closer its drain node voltage to the ground level at the time of evaluation. We call this phenomenon a pseudo ground effect, and a stronger pseudo ground effect is represented by a smaller resistance. The PF-CDPD match line now behaves much like a series of inverters with each inverter standing on top of a small , resistance, as shown in Fig. 5(b) where and therefore the search time can be greatly reduced. No matter whether a BiCAM or a TCAM is realized with this match-line scheme, the search speed will be nearly the same because of the same critical path with a similar strength of the pseudo ground effect. The PF-CDPD logic also leads to low power consumption for the following reasons. 1) In the pre-charge phase, only a small parasitic capacitance at the output node of each dynamic NAND gate is charged. Therefore, if the dynamic gate changes its output state in the evaluation phase, only a small quantity of charges will

LI et al.: AND-TYPE MATCH-LINE SCHEME FOR HIGH-PERFORMANCE ENERGY-EFFICIENT CAMS

1111

Fig. 4. (a) Circuit along the critical-path and (b) operating waveforms.

be pulled to ground, and the power consumption will be small. 2) The implemented logic function in each PF-CDPD gate is AND. It is well known that a multiple-fan-in AND gate has a low switching activity. Consequently, the average power consumption of a PF-CDPD AND gate is much lower than that of a NOR gate. 3) The evaluation of the match line [shown in Fig. 4(a)] is started from the left most PF-CDPD gate (or simply called as the first gate). If the first four input bits match completely with the first four stored bits, the output of the first gate will go high after evaluation. The second left most PF-CDPD gate (the second gate) can not begin to evaluate until the output of the first gate goes high. This is because the clock signal of the second gate is exactly the output signal of the first gate. All the following gates have a similar connection way, and then the evaluation of the entire match line will be performed consecutively from the left most gates to the right most gates like a domino. If the output of the first gate is kept low, reflecting an unmatching condition, all the other gates will be kept quiet in the evaluation phase. As

such the switching activity of the latter stages is dependent on the evaluation result of the preceding stages. This effect greatly reduces the average switching activity of the match line. 4) For some applications, the data can be arranged such that the mismatch mostly happens in the left-most bits of Fig. 4(a), so that the average switching activity and the power consumption of the match line, in a statistics sense, can be reduced even further. 5) As mentioned before, search bit lines are kept quiet in the evaluation phase. Therefore, search bit-lines can be realized as static circuits with no concerns on the data racing or the DC current. Compared to the dynamic counterpart, the static realization of the search circuit saves the switching power. The critical path circuit and the operating waveforms of a CAM macro adopting the PF-CDPD match line has been shown in Fig. 4. If the CAM is realized with the domino or CDPD circuit, the match line circuit is composed of domino or CDPD gates, and the critical path circuit and the timing waveforms remain similar. As described before, the input search data are

1112

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 5, MAY 2006

Fig. 5. (a) Worst-speed evaluation case and (b) pseudo ground effect in this case.

sent into the array in the data setup phase. Therefore, for evaluating the search speed, only the match line in the data matching phase has to be simulated. We have implemented a 64-b match line with domino, CDPD, and PF-CDPD circuits, respectively, as shown in Fig. 6(a). We assume that all the match lines are constructed with 11 cascaded stages of dynamic AND gate. The reasons of constructing such a match line will be explained in the next section. Post-layout HSPICE simulations were performed to obtain the search delay and the power consumption of each match line, and the results are shown in Fig. 6(b) and Table I. Search delay time is evaluated between the output of the clock , not including the delay of the clock buffer and the node buffer. Power consumption is evaluated at the clock frequency of 100 MHz, assuming the input data are completely matched with the stored data. Simulation data show that by adopting the PF-CDPD circuit, the search delay, the power consumption of the clock buffer, and the power consumption of the match line are all reduced significantly. Comparing to the CDPD match line, the power-delay product of the PF-CDPD match line is reduced about 66%. IV. DESIGN CONSIDERATIONS The circuit structure of the content addressable memory based on the PF-CDPD match line is very simple, and no sophisticated and PVT-variation sensitive sensing circuitry is required to accelerate the search operation. In addition, the PF-CDPD match line permits full utilization of the inherent speed of gates just like the domino logic circuit.

Two design considerations remain. The first is the charge sharing effect in the PF-CDPD gate, and the second is the relationship between the performance achieved and the total number of stages of PF-CDPD AND gates in a match line. As described earlier, the proposed match circuit evaluates like a string of inverters due to the pseudo ground effect. For accelerating the search speed, it seems that the number of inverters should be reduced as much as possible. As the logic depth gets shorter, the pull-down stacking becomes higher. Therefore, the pseudo ground effect becomes weaker because of a higher series resistance, and the delay time of each stage becomes longer accordingly. Charge sharing is also a concern when the pull-down stacking becomes higher. As a result, for a given bit width, there is an optimal tradeoff between the number of cascaded stages and the number of stacked nMOS transistors in each stage. The traditional solution for overcoming the charge sharing problem is to add a weak feedback pMOS to each dynamic gate [16]. In addition to using the feedback PMOS, we added design steps in order to alleviate the charge sharing problem at any PVT corners, as well as for obtaining a better performance. The design procedure is described as follows, and by taking the 0.18- m 1.8-V 256 128-b BiCAM macro as a design example. 1) The first step is to determine the maximal number of stacked nMOS transistors in the PF-CDPD NAND gate with the constraint of no charge sharing (CS) error found at the typical operating condition (typical process, 1.8 V, and 25 C). The circuit model for the simulation consists

LI et al.: AND-TYPE MATCH-LINE SCHEME FOR HIGH-PERFORMANCE ENERGY-EFFICIENT CAMS

1113

Fig. 6. (a) Three kinds of 64-b match line, and (b) simulated waveforms.

of two cascaded -input PF-CDPD stages, as shown in Fig. 7(a). For triggering the worst CS effect, a two-cycle simulation is performed. The pull-down network of the first gate is completely discharged in the first cycle, and is cut off simply by turning off the nMOS closest to the ground in the second cycle. All the lower inputs of the for always passing second dynamic gate are fixed at

the output of the first stage to the final output . The circuit will be claimed to have a charge sharing problem if a wrong output voltage of the first stage, caused by the charge sharing effect, not only propagates to the second stage but also becomes a logic error at the second stage. In order to obtain a small gate delay, the nMOS transistors in the pull-down network should be sized as large as

1114

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 5, MAY 2006

TABLE I PERFORMANCE COMPARISONS BETWEEN DIFFERENT STYLES OF MATCH LINE

Fig. 7. (a) Simulation model and (b) simulation results of observing the charge sharing effect.

possible, while at the same time taking the cell layout into consideration and not increase the cell area. In this work, according to the 0.18- m CAM cell layout, all the nMOS transistors can be sized to two times the minimal transistor size. From simulation results, we can always find to overcome the charge sharing a suitable size for problem for a PF-CDPD gate with six or less than six inputs. However, the 7-input PF-CDPD gate will malfunction due to the charge sharing effect, no matter the size chosen. Fig. 7(b) shows the simulation result of of a 7-input PF-CDPD gate. The output of the first dynamic is pre-charged to , and then falls below half gate in the evaluation phase due to the charge sharing efis pulled above half to make a logic fect. Node error at node .

2) After determining the maximal stacking number, the to be able to conquer the charge second step is to size sharing problem at all process corners. The simulation model remains the circuit shown in Fig. 7(a) with the fan-in number being fixed at six, and the supply voltage fixed at , , is varied to be 1.8 V. The channel length of equal to or larger than the minimal channel length, and the channel width is fixed at 0.42 m, which is the minimal transistor width in this technology. Fig. 8(a)–(c) shows or the simulation waveforms of two designs ( m) for three corner cases, TT (typical NMOS, typical PMOS), SS (slow NMOS, slow PMOS), and SF (slow NMOS, fast PMOS), respectively. These simulation reis, the more severe sults indicate that the longer the is 1.0 m, the voltage the charge sharing effect. If

LI et al.: AND-TYPE MATCH-LINE SCHEME FOR HIGH-PERFORMANCE ENERGY-EFFICIENT CAMS

1115

Fig. 8. Simulation waveforms at (a) TT, (b) SS, and (c) SF corners. (d) Simulation results.

at node , , is pulled up to 0.67 V and 0.82 V for the TT [Fig. 8(a)] and the SS [Fig. 8(b)] cases, respectively. At these two corners, the voltage at node is always kept is 1.0 m, the circuit is at 0 V. This means that if still safe at these two corners. However, for the SF case is pulled up as high as 1.49 V, and the [Fig. 8(c)], second PF-CDPD gate no longer functions correctly. Fig. 8(d) summarizes the simulation results, where several cases where the circuit will fail due to the charge sharing problem are circled together. Therefore, we should never larger than 1.0 m. However, there are two other size concerns for allowing the real design to choose an even . First, the shorter the channel length, lower value for and the smaller the tranthe smaller the glitch at node sient power waste due to the glitch. Second, considering the PVT variations together, the design should maintain a is chosen to be suitable margin. For this purpose, 0.7 m at most, as indicated in Fig. 8(d), while the max-

Fig. 9. Simulation results of the delay time.

imal due to the charge effect is kept below 0.4 V no matter the operating corner.

1116

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 5, MAY 2006

Fig. 10. Simulation results of a 6-input PF-CDPD gate.

3) The size of not only determines if the charge sharing problem will happen, but it also affects the operating speed of the circuit. The simulation results of the delay time at difof a 6-input PF-CDPD gate with different ferent operating corners are shown in Fig. 9. For a small can be chosen to be between 0.4 m and delay time, 0.7 m. on both the 4) After considering the impact of sizing charge sharing effect and the delay time, the channel length are chosen as 0.5 m and and the channel width of 0.24 m, respectively. Then finally we check if the 6-input PF-CDPD gate can function correctly at the worst operating conditions, with the temperature set at 110 C, the process corner is SF, and the supply voltage varies from 1.8 to 1.5 V. Simulation results in Fig. 10 indicate that the 6-input PF-CDPD circuit can always work without the charge sharing problem. 5) The final design step is to decide the number of cascaded stages, and the number of stacked nMOS transistors in each stage, with the constraint of the maximal fan-in being six. There exist various methods to construct a match line for a with given bit width, i.e., a different number of stages in each stage. a different number of stacked transistors We tried four different combinations of and for every commonly used bit width, as shown in Table II. For example, if the bit width is 64, method 1 of constructing the match line adopts twenty stages of a 3-input gate plus one stage of a 4-input gate, and method 4 utilizes ten stages of a 6-input gate plus one stage of a 4-input gate. The gate delay of a 3-input, 4-input, 5-input, and 6-input PF-CDPD AND gate is found to be 152.20 ps, 169.80 ps, 184.30 ps, and 197.70 ps, respectively. The simulation results of the search time for different cases shown in Table I are summarized in Fig. 11. It was found that no matter how long the data width, the search time can be reduced if more stages with a larger number of stacked transistors are used. Therefore, for the designed 256 128-b CAM macro, method 4 can achieve the minimal search time.

V. EXPERIMENTAL RESULTS The block diagram and the photograph of the 256 128-b BiCAM test chip are shown in Fig. 12(a) and (b), respectively. The clock signal of the test chip is generated by a VCO, which can provide a frequency range of 200 to 600 MHz. A divide-by-two circuit is used to obtain the clock signal with a 50% duty cycle. The input (pre-stored and searching) data is generated by four 32-b linear feedback shift registers (LFSRs).

TABLE II DIFFERENT CONSTRUCTION METHODS FOR VARIOUS BIT WIDTHS

Fig. 11. Simulation results of the search time for different cases shown in Table II.

We can assign the same or different seeds for generating the same or different random data patterns for both pre-store and search operations, and the average power consumption can then be measured. In order to easily measure the search time, the output flipflops are used to capture the matching results. The operation waveforms of the test chip are illustrated in Fig. 12(c). The search time can be calculated as , where is the measured clock cycle time, is the is the simulated set-up time of the flip-flop (5.2 ps), and simulated delay time between the clocking edge of the match circuit and that of the output flip-flop (370 ps). The test chip is fabricated in a 1.8-V 0.18- m mix-signal CMOS technology [20]. Only eight match results are pulled to the output pads, and the core area of the test chip is m . The measurement waveforms at 1.8 V are shown in

LI et al.: AND-TYPE MATCH-LINE SCHEME FOR HIGH-PERFORMANCE ENERGY-EFFICIENT CAMS

1117

Fig. 13. (a) Measured waveforms, and (b) shmoo plot.

Fig. 12. (a) Block diagram, (b) photograph, and (c) timing diagram of the test chip.

Fig. 13(a). The minimal cycle time is measured to be 4.95 ns, ns ps and the search time can then be calculated as ps ns. Fig. 13(b) shows the shmoo plot. The chip can function correctly even if the supply voltage is reduced to 1.4 V, which implies that the PF-CDPD circuit has a large tolerance for supply voltage variation. The feature summary of the test chip is given in Table III. The proposed BiCAM chip achieves 2.1-ns search time with 2.33 fJ/bit/search of energy. Total power consumption is 18.16 mW at the highest operating frequency with the test patterns generated from the LFSR’s, and the power consumption breakdown

from clock buffers, match lines, and search lines are about 7.9%, 9.6%, and 82.5%, respectively. We obtained eight samples of the test chip from an educational program. All chips function correctly. The last two rows of Table III show the standard deviation of the search time and the energy metric, respectively. Measurement results show that the standard deviation of the search time is 0.05 ns, and the standard deviation of the energy metric is 0.05 fJ/bit/search. Performance summaries of the proposed CAM macro and several other CAM macros published recently are shown in Table IV. The proposed design has very competitive search speed and energy index. The performance of 128-b BiCAM, 144-b BiCAM, and 144-b TCAM match lines are also evaluated through HSPICE simulations, and the evaluation results are listed in Table V. When realizing a 144-b word line, the match line in each half plane was designed to consist of twelve stages of six-input dynamic AND gate. Power consumption is evaluated for two cases. In the first case, the input search data match with the stored data. In this case, the power consumption will reach maximum because that all the dynamic gates in a match line are switched. In the second case, the calculated

1118

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 5, MAY 2006

TABLE III FEATURE SUMMARY OF THE TEST CHIP

where and stand for the switching probability and the power consumption of the th gate, respectively. As compared to the 128-b BiCAM match line, the search delay and the power consumption of the 144-b TCAM match line increase 12% and 7%, respectively. VI. CONCLUSION We proposed a new AND-type match-line scheme in order to obtain high-performance and energy-efficient CAM macros. The match line was constructed with the newly proposed PF-CDPD logic. The implemented 0.18- m 1.8-V 256 128-b BiCAM macro achieved 2.1-ns search time with a 2.33 fJ/bit/search of energy.

TABLE IV PERFORMANCE SUMMARIES OF DIFFERENT CAM MACROS

ACKNOWLEDGMENT The authors thank the Chip Implementation Center for supporting chip fabrication, and the National Science Council and the Ministry of Economic Affair of Taiwan for funding the research. REFERENCES

TABLE V PERFORMANCE SUMMARIES OF DIFFERENT PF-CDPD MATCH LINES

average power consumption is reported, taking the switching probability into the consideration. For example, the power consumption of the 256 144-b TCAM match line is calculated as

[1] J. P. Wade and C. G. Sodini, “A ternary content addressable search engine,” IEEE J. Solid-State Circuits, vol. 24, no. 8, pp. 1003–1013, Aug. 1989. [2] R. Sangireddy and A. K. Somani, “High-speed IP routing with binary decision diagrams based hardware address lookup engine,” IEEE J. Sel. Areas Commun., vol. 21, no. 5, pp. 513–521, May 2003. [3] T. Hayashi and T. Miyazaki, “High-speed table lookup engine for IPv6 longest prefix match,” in Proc. IEEE Global Telecommunications Conf. GLOBECOM’99, 1999, vol. 2, pp. 1586–1571. [4] K.-J. Lin and C.-W. Wu, “A low-power CAM design for LZ data compression,” IEEE Trans. Comput., vol. 49, no. 10, pp. 1139–1145, Oct. 2000. [5] T. Ikenaga and T. Ogura, “A fully parallel 1-Mb CAM LSI for realtime pixel-parallel image processing,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1999, pp. 264–265. [6] H. Kadota et al., “An 8-kbit content-addressable and reentrant memory,” IEEE J. Solid-State Circuits, vol. SC-20, no. 5, pp. 951–957, Oct. 1985. [7] F. Shafai, K. J. Schultz, G. F. R. Gibson, A. G. Bluschke, and D. E. Somppi, “Fully parallel 30-MHz, 2.5-Mb CAM,” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1690–1696, Nov. 1998. [8] Y. L. Hsiao, D. H. Wang, and C. W. Jen, “Power modeling and lowpower design of content addressable memories,” in Proc. IEEE Int. Symp. Circuits and Systems (ISCAS), 2001, vol. 4, pp. 926–929. [9] N. F. Goncalves and H. De Man, “NORA: a racefree dynamic CMOS technique for pipelined logic structures,” IEEE J. Solid-State Circuits, vol. 18, no. 3, pp. 261–266, Jun. 1983. [10] H. Miyatake, M. Tanaka, and Y. Mori, “A design for high-speed low-power CMOS fully parallel content-addressable memory macros,” IEEE J. Solid-State Circuits, vol. 36, no. 7, pp. 956–968, Jun. 2001. [11] I. Arsovski and A. Sheikholeslami, “A current-saving match-line sensing scheme for content- addressable memories,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, 2003, pp. 304–305. [12] ——, “A mismatch-dependent power allocation technique for match-line sensing in content-addressable memories,” IEEE J. Solid-State Circuits, vol. 38, no. 11, pp. 1958–1966, Nov. 2003. [13] K. Pagiamtzis et al., “A low-power content-addressable memory (CAM) using pipelined hierarchical search scheme,” IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1512–1519, Sep. 2004. [14] S. Choi et al., “A 0.7 fJ/bit/search, 2.2 ns search time, hybrid type TCAM architecture,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, 2004, pp. 498–507. [15] J.-S. Wang, H.-Y. Li, C.-C. Chen, and C. Yeh, “An AND type match-line scheme for energy efficient content addressable memories,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, 2005, pp. 464–465.

LI et al.: AND-TYPE MATCH-LINE SCHEME FOR HIGH-PERFORMANCE ENERGY-EFFICIENT CAMS

[16] R. H. Krambeck, C. M. Lee, and H.-F. S. Law, “High-speed compact circuits with CMOS,” IEEE J. Solid-State Circuits, vol. 17, no. 6, pp. 614–619, Jun. 1982. [17] J.-R. Yuan, C. Svensson, and P. Larsson, “New domino logic precharged by clock and data,” Electron. Lett., vol. 29, no. 25, pp. 2188–2189, Dec. 1993. [18] J.-S. Wang, C.-R. Chang, and C. Yeh, “Analysis and design of highspeed and low-power CMOS PLAs,” IEEE J. Solid-State Circuits, vol. 36, no. 8, pp. 1250–1262, Aug. 2001. [19] C.-R. Chang, J.-S. Wang, and C.-H. Yang, “Low-power and high-speed ROM modules for ASIC applications,” IEEE J. Solid-State Circuits, vol. 36, no. 10, pp. 1516–1523, Oct. 2001. [20] Taiwan Semiconductor Manufacturing Co., Ltd., TSMC 0.18m mixed signal 1P6M+ MIM salicide 1.8 V/3.3 V process documents, T-018-MM-TM-002. Hung-Yu Li was born in Taiwan, R.O.C., in 1974. He received the B.S. degree in electrical engineering from the Tatung University, Taipei, Taiwan, in 1996 and the Ph.D. degree from the electrical engineering, National Chung-Cheng University, Taiwan, in 2005. Since then, he has been with Faraday Technology Corporation, Hsinchu Science Park, Taiwan, where he is currently a Senior Engineer. His research interests include high-speed and low-power memory designs.

Chia-Cheng Chen was born in Taiwan, R.O.C., in 1980. He received the B.S. and M.S. degrees in electrical engineering from National Chung-Cheng University, Chiayi, Taiwan, in 2004. He is currently an engineer at Faraday Technology Corporation, Hsinchu, Taiwan, where he is working on development and design of memories.

1119

Jinn-Shyan Wang (S’85–M’88) was born in Taiwan, R.O.C., in 1959. He received the B.S. degree in electrical engineering from the National Cheng-Kung University, Tainan, Taiwan, in 1982 and the M.S. and Ph.D. degrees from the Institute of Electronics, National Chiao-Tung University, Hsinchu, Taiwan, in 1984 and 1988, respectively. He was with Industrial Technology Research Institute (ITRI) from 1988 to 1995, engaged in ASIC circuit and system design, and became the Manager of the Department of VLSI Design. He joined the Department of Electrical Engineering, National Chung-Cheng University, Chia-Yi, Taiwan, in 1995, where he is currently a full Professor. His research interests are in low-power and high-speed digital integrated circuits and systems, analog integrated circuits, IP and SOC design, and CMOS image sensors. He has published over 20 journal papers and 40 conference papers and holds over 20 patents on VLSI circuits and architectures.

Chingwei Yeh received the B.S. degree in electrical engineering from National Taiwan University, Taipei, Taiwan, R.O.C., in 1986, and the Ph.D. degree in electrical and computer engineering from the University of California at San Diego in 1992. Since then, he has been with the Electrical Engineering Department, National Chung-Cheng University, Taiwan, as a faculty member. His research interests include digital VLSI design and CAD.