Digital Circuit Methodologies for Low Power and Robust ... - CiteSeerX

1 downloads 0 Views 3MB Size Report
13.1 Robust and High Speed SRAM Cell Using Local Strain Technologies…………… 196. 13.2 Dual-Threshold ... 13.3 High Data Stability and Low Leakage Power FinFET Memory Circuit Based on .... (a) Quad-core Opteron from AMD. [84].
Digital Circuit Methodologies for Low Power and Robust Nanoscale Integration

by Sherif Amin Tawfik

A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy (Electrical and Computer Engineering)

at the UNIVERSITY OF WISCONSIN-MADISON 2009

ii

Acknowledgments I would like to express my gratitude to the people who helped me throughout my graduate school at the University of Wisconsin-Madison. I am deeply grateful to my academic advisor Professor Volkan Kursun for his hard work, his dedication, and his encouragement. I am also grateful to Professors Michael Schulte, Azadeh Davoodi, Zhenqiang Ma, and Oguzhan Alagoz for serving in my proposal and defense committees. Special thanks go to my colleagues Ranjith Kumar and Zhiyu Liu for the interesting discussions we had during the Ph.D. program. I would also like to thank all the professors at electrical and computer engineering and computer sciences departments who I have taken courses with. Finally, I would like to thank my family and friends in Egypt, USA, Canada, Austria, and Germany for their love and support.

iii

Abstract The integration density and the operating speed of integrated circuits are enhanced with technology scaling. The increased number and the higher operating speed of scaled transistors lead to broader functionality and enhanced performance in an integrated circuit. These advantages associated with technology scaling, however, come at a cost of elevated power consumption and enhanced sensitivity to parameter variations. Developing low power and variation tolerant integrated circuit techniques has become a primary necessity for the semiconductor industry. The multiple supply voltage circuit techniques exploit the delay differences among the different signal propagation paths within an integrated circuit. The supply voltages of the gates on the non-critical delay paths are selectively lowered while a higher supply voltage is maintained on the speed critical paths in order to satisfy a target clock frequency. Specialized voltage interface circuits are required in order to transfer signals among these circuits operating at different voltage levels. Voltage level converters impose additional power consumption and longer propagation delay overheads in a multiple supply voltage system. New low-power and high-speed multiple threshold voltage interface circuits are proposed to enhance the efficiency of the multiple supply voltage systems-on-chip. The clock distribution network consumes a significant portion of the power, area, and metal resources of an integrated circuit. The enhancement of clock frequency and the growth of die size cause the power consumption of the clock distribution subsystem to increase significantly. Furthermore, the process and environment parameter variations are enhanced with each new technology generation. Coping with parameter variations is particularly challenging in the design of the clock distribution networks since the clock signal needs to be distributed to the entire integrated circuit with controlled skew. Novel clock tree design methodologies are presented in this dissertation for simultaneously suppressing the temperature fluctuations induced skew and the power consumption of clock distribution networks. The amount of embedded memory in modern microprocessors and systems-on-chips is increased to meet the performance requirements in each new technology generation. The reduced supply and threshold voltages and the scaled device dimensions lead to a degradation in the data stability of the memory banks with technology scaling. The increasing leakage energy consumption of on-chip caches is another growing concern since the majority of transistors are

iv

employed for embedded memory in modern microprocessors. New circuit techniques are proposed for simultaneously enhancing the data stability and reducing the leakage power in deeply scaled nanometer memory banks. Scaling of the standard single-gate bulk MOSFETs faces great challenges in the nanometer regime due to the severe short-channel effects that cause an exponential increase in the leakage currents and enhanced sensitivity to process variations. Multi-gate MOSFET technologies mitigate these limitations by providing a stronger control over a thin silicon body with multiple electrically coupled gates. FinFET is the most attractive choice among the multi-gate transistor architectures because of the self-alignment of the two gates and the similarity of the fabrication steps to the existing standard CMOS technology. FinFET technology development guidelines for higher performance, lower power consumption, and weakened sensitivity to parameter variations are proposed. New low power and robust FinFET static memory circuits, sequential circuits, and domino circuits are proposed in this dissertation.

v

Contents Acknowledgments……………………………………………………………………………..

ii

Abstract………….……………………………………………………………………….......... iii List of Tables…………………………………………………………………………..............

x

List of Figures…………………………………………………………………………............. xii

1

Introduction………………………………………………………………………..............

1

1.1 Scaling Trends…………………………………………………………………..........

2

1.2 Outline of the Dissertation……..…………………………………………….............. 10

2

Sources of Power Consumptions in Digital Circuits……………………………………… 14 2.1 Dynamic Switching Power Consumption……………………………………………. 14 2.2 Short Circuit Power Consumption………………………………………………...…. 19 2.3 Leakage Power Consumption…………………………………………………......…. 25 2.3.1 Sub-threshold Leakage…………………………………………………….….. 26 2.3.2 Gate Leakage Current……………………………………………………...….

34

2.3.3 Reverse Biased Junction Leakage…………………………………………….. 38 2.4 Static DC Power Consumption………………………………………………………. 41

3

Low Power and High Speed Multi Threshold Voltage Interface Circuits………………... 43 3.1 Level Converters…………………………………………………………………..…. 44 3.1.1 Feedback-Based Level Converters………………………………………...….. 44 3.1.2 Multi-Vth Level Converters………………………………………………..….. 47 3.2 Speed and Power Consumption Characteristics…………………………………...… 50 3.2.1 Comparison at the Nominal Process Corner………………………………….. 51 3.2.2 Characterization Under Supply Voltage and Process Parameter Variations.....

54

vi

3.2.3 Multi-Vth CMOS Technology……………………………………………..….. 58 3.3 Chapter Summary………………………………………………………………….… 60

4

Dual Power Supplies and Dual Clock Frequencies for Lower Clock Power and Suppressed Temperature-Gradient Induced Clock Skew………………………………..... 61 4.1 Previous Works………………………………………………………………………. 62 4.2 Supply Voltage Optimization in Clock Distribution Networks…………………...…. 63 4.2.1 Dual-VDD/Single-Frequency Clocking Methodology……………………..….. 66 4.2.2 Dual-VDD/Dual-Frequency Clocking Methodology……………………...….. 69 4.3 Clock Skew Characterization under Process Variations…………………………..…

76

4.4 Chapter Summary………………………………………………………………….… 80

5

Clock Distribution Networks with Gradual Signal Transition Time Relaxation for Reduced Power Consumption…………………………………………….......................... 81 5.1 Buffered Wire Segment……………………………………………………………… 82 5.2 Buffer Insertion and Sizing Algorithm………………………………………………. 86 5.3 Experimental Results……………………………………………………………...…. 91 5.3.1 Comparisons at the Nominal Process Corner with Uniform Die Temperature………………………................................................................... 92 5.3.2 Impact of Process and Temperature Variations on Clock Skew………….…... 93 5.4 Branch and Bound Formulation…………………………………………………...…

98

5.5 Chapter Summary……………………………………………………………………. 100

6

Dynamic Wordline Voltage Swing for Low Leakage and Stable Static Memory Banks… 101 6.1 The Proposed 6T SRAM Circuit Technique………………………………………… 103 6.2 Simulation Results and Area Comparison…………………………………………… 104 6.2.1 Data Stability…………………………………………………………………. 105

vii

6.2.2 Leakage Power Consumption………………………………………………… 106 6.2.3 Area Comparison……………………………………………………………... 106 6.2.4 Active Mode Power and Access Speed……………………………………..... 108 6.2.5 Process Variations…………………………………………………………….. 110 6.3 Chapter Summary……………………………………………………………………. 112

7

Low Power and Robust 7T Dual-Vt SRAM Circuit………………………………………. 113 7.1 The Proposed 7T Dual-Vt SRAM Cell………………………………………………. 114 7.2 Simulation Results and Circuit Layouts……………………………………………... 116 7.2.1 Data Stability…………………………………………………………………. 116 7.2.2 Leakage Power Consumption………………………………………………… 117 7.2.3 Area Comparison……………………………………………………………... 118 7.2.4 Active Mode Power and Access Speed………………………………………. 120 7.2.5 Process Variations……………………………………………………………. 122 7.3 Chapter Summary……………………………………………………………………. 125

8

Multi-Gate FinFET Technology………………………………………………………….. 126 8.1 Emerging Multi-Gate Technology…………………………………………………… 126 8.2 FinFET Technology Development Guidelines………………………………………. 129 8.2.1 DC Characteristics……………………………………………………………. 130 8.2.2 Process, Supply Voltage, and Temperature Variations………………………. 138 8.3 Threshold Voltage Tuning Techniques……………………………………………… 144 8.3.1 Independent-Gate FinFET Technology………………………………………. 144 8.3.2 Work-Function Engineering………………………………………………….. 146 8.4 Chapter Summary……………………………………………………………………. 148

viii

9

Multi-Vth FinFET Sequential Circuits with Independent-Gate Bias and Work-Function Engineering for Reduced Power Consumption…………………………………………… 149 9.1 FinFET Technology………………………………………………………………….. 149 9.2 FinFET Latches……………………………………………………………………… 151 9.2.1 Single-Vth Tied-Gate FinFET Latches……………………………………….

151

9.2.2 New Brute-Force Multi-Vth FinFET Latches………………………………… 153 9.2.3 Comparison of the FinFET Latches…………………………………………... 155 9.3 FinFET Flip-Flops…………………………………………………………………… 157 9.3.1 Brute-Force FinFET Flip-Flops………………………………………………. 157 9.3.2 Comparison of the FinFET Flip-Flops………………………………………... 159 9.4 Chapter Summary……………………………………………………………………. 161

10 FinFET Domino Logic with Independent Gate Keepers…………………………………. 163 10.1 FinFET Device…………………………………………………………………….. 164 10.2 Domino Logic Circuits……………………………………………………………. 165 10.2.1 Standard Tied-Gate FinFET Domino Logic Circuits…………………….. 166 10.2.2 FinFET Domino with Variable-Threshold-Voltage Keeper……………… 168 10.3 Simulation Results…………………………………………………………………. 169 10.4 Chapter Summary………………………………………………………………….. 173

11 Low Power and Robust Independent-Gate FinFET SRAM Cells………………………… 174 11.1 FinFET SRAM Cells………………………………………………………………. 175 11.1.1 Standard Low-Vt Tied-Gate FinFET SRAM Cells………………………. 175 11.1.2 Independent-Gate FinFET SRAM Cells………………………………….. 176 11.2 Simulation Results…………………………………………………………………. 178 11.2.1

Read Stability……………………………………………………………. 178

ix

11.2.2

Leakage Power Consumption……………………………………………. 179

11.2.3

Cell Read Current………………………………………………………... 180

11.2.4

SRAM Cell Area………………………………………………………… 181

11.2.5

Process Parameter Variations……………………………………………. 184

11.3 Chapter Summary………………………………………………………………….. 185

12 Work-Function Engineering for Reduced Power and Higher Integration Density: An Alternative to Sizing for Stability in FinFET Memory Circuits………………………….. 186 12.1 Work-Function Engineered SRAM Cells…………………………………………. 187 12.2 Comparisons……………………………………………………………………….. 191 12.3 Chapter Summary…………………………………………………………………. 195

13 Future Works……………………………………………………………………………… 196 13.1 Robust and High Speed SRAM Cell Using Local Strain Technologies…………… 196 13.2 Dual-Threshold Voltage FinFET Technology Based on Gate-Drain/Source Underlap Engineering……………………………………………………………… 201 13.3 High Data Stability and Low Leakage Power FinFET Memory Circuit Based on Gate-Drain/Source Underlap Engineering ………………………………………… 202 13.4 Robust and High Speed Seven Transistors FinFET SRAM Cell Based on GateDrain/Source Underlap Engineering ………………………………………………. 203

Bibliography…………………………………………………………………………………... 205 Appendix A: Publications……………………………………………………………………... 219

x

List of Tables 3.1 Total transistor width (w), average propagation delay (d), and average power consumption (p) of the level converters………………………………………………..... 53 3.2 Optimum threshold voltages with the proposed level converters…………………..........

53

3.3 Normalized total transistor width (w), average propagation delay (d), and average power consumption (p) of the level converters…………………………………….......... 54 4.1 Temperature fluctuations induced clock skew and power consumption of the standard single-VDD/single-frequency CDN………………………………………........................

66

4.2 Temperature fluctuations induced clock skew and power consumption of the dualVDD/single-frequency CDN……………………………………………………………...

69

4.3 Temperature fluctuations induced delay variation and power consumption of the levelconverter/frequency-doubler circuits…………………….................................................

75

4.4 Temperature fluctuations induced clock skew and power consumption of the proposed dual-VDD/dual-frequency CDN………………………………………………………….. 75 4.5 Normalized power consumption and temperature fluctuations induced clock skew of the standard and the proposed dual-VDD clocking methodologies………………………. 76 4.6 Mean clock skew of the different clocking techniques with different temperature profiles…………………………………………………………………………………...

79

4.7 Standard deviation of the clock skew of the different clocking techniques with different temperature profiles……………………………………………………………………...

79

5.1 Pseudo code of the buffer insertion and sizing algorithm for an h-tree clock distribution network (BIST)…………………………………………………………………………..

90

5.2 Experimental results with the proposed algorithm………………………………………

92

5.3 Clock skew of the different clock distribution networks with non-uniform temperature profiles…….......................................................................................................................

96

6.1 Write margin of the SRAM cells………………………………………………………... 109 7.1 Write margins of the SRAM cells………………………………...................................... 121

xi

8.1 FinFET Technology Parameters………………………………………………………… 128 8.2 Physical Models used in Medici simulation [70]………………………………………... 130 10.1 The Independent-Gate Keeper Optimum Bias Conditions for Achieving Minimum Delay and Power Consumption with No Degradation in NML……………………......... 171 12.1 The Gate Material Compositions to Achieve Different Workfunctions. Data Extracted from [71]………………………………………………………………............................ 188 12.2 Comparison of the Standard Single-low-Vt and the Proposed Work-Function Engineered Multi-Vt SRAM Cells……………………………………………………..... 193

xii

List of Figures 1.1 Scaling of the feature size with the lead Intel microprocessors.………………...........

3

1.2 Evolution of the number of transistors of the lead Intel microprocessors....................

3

1.3

Evolution of the number of transistors of the IBM microprocessors.………..............

4

1.4 Clock frequency scaling with the lead Intel microprocessors.…………………….....

4

1.5 Clock frequency scaling with the lead IBM microprocessors………………………..

5

1.6 Die photos of two recent multi-core processors. (a) Quad-core Opteron from AMD [84]. (b) Nine-core synergistic processor from IBM [85]. Both processors are implemented in a 65nm SOI technology……………………...................................... 6 1.7 Die photo of an experimental 80-core Intel processor operating at 4GHz in a 65nm CMOS technology [89]………………………………………………........................

6

1.8 Maximum power consumption of the lead Intel microprocessors……………………

7

1.9 Breakdown of the power consumption in Itanium processors. (a) Itanium 2 microprocessor in a 180 nm CMOS technology in 2002 [88]. (b) Two-core Itanium processor implemented in a 90nm CMOS technology in 2006 [14]............................

7

1.10 Power density trends of the lead Intel microprocessors …...…………………………

8

1.11 Leakage power trends of the lead Intel microprocessors [12]………………………..

9

1.12 Comparison of the primary leakage current components in three different CMOS technology nodes. VDD is 0.7V at 25nm, 0.9V at 50nm, and 1.2V at 90nm [13]….....

9

2.1 A generic CMOS gate. The current Ip charges the load capacitance CL to VDD when the pull-up network is activated. The current In discharges the load capacitance CL to GND when the pull-down network is activated…………………………………… 15 2.2 Reduced voltage swing static CMOS gate. The output voltage swing is between Vtn and VDD-|Vtp|. Vtn: NMOS threshold voltage. Vtp: PMOS threshold voltage……….

18

2.3 Short-circuit current (Isc) for a static CMOS inverter. ISC is equal to the source-todrain current of the PMOS transistor during the input low to high transition. Alternatively, ISC is equal to the drain-to-source current of the NMOS transistor during the input high to low transition…………………………………………….....

20

xiii

2.4 Test circuit for evaluating the short-circuit current. The short-circuit power consumed in inv1 is measured for different sizes of inv1, inv2, inv3, and different supply voltages……………………………………………………………………..... 21 2.5 Variations of the transition time at Node1 and the short-circuit power consumed in inv1 with the size of inv3. inv1 and inv2 are minimum sized. VDD = 1V……………..

22

2.6 Effect of coupling capacitance on the short-circuit power. During the input low to high transition the voltage on the output node increases momentarily beyond VDD leading to a negative Ip. Similarly, during the input high to low transition the voltage on the output node decreases momentarily below 0V leading to a negative In……………………………………………………………………………………… 23 2.7 Variations of the input and output transition times and the short-circuit power consumed in inv1 with the size of inv2. The normalized sizes of inv1 and inv3 are 1 and 8, respectively. VDD = 1V……………………………………………………….. 24 2.8 Variations of the short-circuit power consumed in inv1 and the ratio of the shortcircuit power to the total power with the supply voltage (VDD)………..…………..... 25 2.9 Illustration of the sub-threshold and the gate tunneling currents in a CMOS inverter for the different input states. The gate tunneling currents are shown with the horizontal arrows. Ioff-n (Ioff-p) is the sub-threshold current conducted by the turned off NMOS (PMOS)……………………………………………………......................

25

2.10 IDS-VGS characteristics of a minimum sized NMOS in 65nm CMOS technology at room temperature. VDS = 1V. Vtn: NMOS threshold voltage. Ioff-n: NMOS drain current when VGS = 0V. (a) Linear scale. (b) Logarithmic scale…………………...... 26 2.11 (a) A cross section of an NMOS transistor. (b) Band diagram along the vertical direction “X”. (c) Band diagram along the horizontal direction “Y” near the Si-SiO2 interface. Ψ(x) : the substrate potential. Ψf : the Fermi potential in the bulk of the substrate. Ec: bottom of the conduction. Ev : top of the valence band. Ei : intrinsic Fermi level. Vbi : built-in potential…………………………………………………… 27 2.12 Equivalent circuit relating the surface potential to the gate voltage…………………. 28 2.13 Off-current versus temperature. VDS = 1V. Vtn = -Vtp = 220mV. Wn = Wp = 65nm….. 31 2.14 Off-current versus threshold voltage. VDS = 1V. T = 27oC. Wn = Wp = 65nm……..... 32

xiv

2.15 Off-current versus VDS. T = 27oC. Vtn = -Vtp = 220mV. Wn = Wp = 65nm…………... 33 2.16 Band diagram of an NMOS transistor at the channel surface for different VDS [109]. Ec: bottom of the conduction band…………………………………………...............

33

2.17 Band diagram of an NMOS showing the gate tunneling current. (a) FowlerNordheim tunneling mechanism. (b) Direct tunneling mechanism………………….. 35 2.18 Medici-predicted gate leakage current versus gate bias and oxide thickness for an NMOS transistor……………………………………………………………………... 36 2.19 Band diagram. (a) NMOS transistor. (b) PMOS transistor. NMOS gate leakage is dominated by inversion layer electrons tunneling from the conduction band of the psilicon substrate to conduction band of the n+ polysilicon gate. PMOS gate leakage is dominated by inversion layer holes tunneling from the valence band of the nsilicon substrate to valence band of the p+ polysilicon gate. Inversion layer holes (electrons) are represented with a plus (minus) sign…………………………………. 36 2.20 Different biasing condition for a stack of two transistors (NMOS or PMOS). (a) Bias1: both transistors are turned on. (b) Bias2: the transistor closer to the power rail is turned off and the second transistor in the stack is turned on. (c) Bias3: the transistor closer to the power rail is turned on and the second transistor in the stack is turned off. (d) Bias4: both transistors are turned off. Gate currents are indicated with the dashed arrows…………………………………………………………….....

37

2.21 Total gate leakage current of a stack of two transistors versus the biasing condition in a 65nm CMOS technology. The gate leakage is maximized when both transistors are turned on (Bias1). The gate leakage in PMOS transistors is one to two orders of magnitude lower as compared to NMOS transistors…………………………………. 38 2.22 Junction reverse-bias leakage current. (a) An NMOS structure showing the PN junction between the drain/source and the substrate. (b) The band diagram along the “x” axis with small built-in potential. (c) The band diagram along the “x” axis with a larger built-in potential caused by heavily doping both sides of the junction. Efn: quasi Fermi potential for electrons. Efp: quasi Fermi potential for holes. Vbi: PN junction built-in potential…………………………………………………………….

40

xv

2.23 Circuit topologies that consume static DC current. (a) CMOS circuits with different power supplies. (b) Pass transistor logic circuits. (c) Pseudo-NMOS logic circuits. (d) Current mode logic circuits..................................................................................... 42 3.1 The standard level converter (LC1) presented in [15]. VDDL is the lower supply voltage. VDDH is the higher supply voltage.………………………………………...... 46 3.2 The level converter (LC2) presented in [17]…………………………………………. 47 3.3 The first proposed level converter (PC1). Thick line in the channel area indicates a high-Vth device …...……………………………………………………………….....

48

3.4 The second proposed level converter (PC2). Thick line in the channel area indicates a high-Vth device. (a) Circuit configuration for VDDL and VDDH that satisfy both (3.1) and (3.3). (b) Circuit configuration for the supply voltages that do not satisfy either (3.1) or (3.3)………………………………………............................................ 49 3.5 The simulation setup for characterizing the level converters. Power is measured for the entire test circuit including the driver and the load inverters. Delay is measured from

the

input

of

the

driver

(ID)

inverter

to

Node2……….………………………………………………………………………...

52

3.6 Statistical delay and power distributions of PC2 and LC2. (a) Propagation delay. (b) Power consumption. The level converters (LC2 and PC2) are optimized for minimum

propagation

delay

at

VDDL

=

1.2V.

SD:

standard

deviation.……………………………………………………………………………..

55

3.7 Statistical delay and power distributions of PC1 and LC2. (a) Propagation delay. (b) Power consumption. The level converters (LC2 and PC1) are optimized for minimum

power

consumption

at

VDDL

=

1V.

SD:

standard

deviation………….…………………………………………………………………... 56 3.8 Statistical delay and power distributions of PC1 and LC2. (a) Propagation delay. (b) Power consumption. The level converters (LC2 and PC1) are optimized for minimum

power

consumption

at

VDDL

=

0.5V.

SD:

standard

deviation……………………………………………………………………................ 57 3.9 Variations of the propagation delay and the power consumption of PC2 with the threshold voltage of M2 (Vtho-M2) and M3 (Vtho-M3) at VDDL = 1.2V. For each Vth, PC2 is reoptimized (resized) to minimize the propagation delay.………………........ 58

xvi

3.10 Variations of the propagation delay and the power consumption of PC1 with the threshold voltage of M2 (Vtho-M2) at VDDL = 1V. For each Vth, PC1 is reoptimized (resized) to minimize the power consumption……………………………………...... 59 3.11 Variations of the propagation delay and the power consumption of PC1 with the threshold voltage of M2 (Vtho-M2) at VDDL = 0.5V. For each Vth, PC1 is reoptimized (resized) to minimize the power consumption……………………………………...... 59 4.1 Two level buffered H-tree clock distribution network……………………………...... 64 4.2 Four different on-chip temperature profiles considered in this chapter………………

65

4.3 Temperature fluctuations induced clock skew (for the rising edge) versus the supply voltage. VDD-nominal = 1.8V…………………………………………………………… 67 4.4 Temperature fluctuations induced clock skew (for the falling edge) versus the supply voltage. VDD-nominal = 1.8V………………………………………..................... 67 4.5 Average power consumption of the clock distribution network versus the supply voltage for four different temperature profiles……………………………………...... 68 4.6 The proposed dual-VDD/dual-frequency clock distribution technique. LCFD: hybrid level-converter/frequency-doubler. VDDL: low optimum supply voltage that minimizes the clock skew. VDDH: nominal higher supply voltage required by the clocked elements for high speed……………………………………………………... 70 4.7 A standard single-Vth cascaded two-stage level-converter/frequency-doubler circuit.………………………………………………………………………………… 71 4.8 Proposed hybrid level-converter/frequency-doubler. The dual-Vth gate(s) and all the downstream gates after the dual-Vth gate(s) operate with VDDH………………........... 71 4.9 Voltage waveforms of the frequency doubler (T = 25oC)……………………………. 72 4.10 The delay element…………………………………………………………………..... 73 4.11 LCFD3. The shaded NAND gates have high-Vth transistors. The shaded inverters have high-Vth NMOS transistors……………………………………………………... 74 4.12 Statistical distributions of the clock skew under process parameter variations and a uniform temperature at T = 125oC……………………………………........................ 77

xvii

4.13 Statistical distributions of the clock skew under process parameter variations and a non-uniform die temperature (assuming temperature profile 1 shown in Fig. 4.2)…………………………………………………………………………………… 77 4.14 Statistical distributions of the clock skew under process parameter variations and a non-uniform die temperature (assuming temperature profile 2 shown in Fig. 4.2)…………………………………………………………………………………… 77 4.15 Statistical distributions of the clock skew under process parameter variations and a non-uniform die temperature (assuming temperature profile 3 shown in Fig. 4.2)….………………………………………………………………………………... 78 4.16 Statistical distributions of the clock skew under process parameter variations and a non-uniform die temperature (assuming temperature profile 4 shown in Fig. 4.2)…………………………………………………………………………………… 78 5.1 A wire segment of length L. The inverters are sized S times a unit inverter.………... 83 5.2 The signal transition time at Node1 for different buffer sizes and wire lengths. The input transition time is 100 ps………………………………………………………... 84 5.3 The average power consumption for different buffer sizes and wire lengths. The input transition time is 100 ps………………………………………………………... 84 5.4 The signal transition time at Node1 for different buffer sizes and input signal transition times. The wire length is 1mm…………………………………………...... 85 5.5 The average power consumption for different buffer sizes and input signal transition times. The wire length is 1mm…………………………………………...................... 86 5.6 Two cascaded buffered wires. ……………………………………………………...... 86 5.7 An H-tree clock distribution network. The index of each level is indicated with the circled numbers. The squares represent the fixed loads at the leaves.……………...... 87 5.8 The proposed buffer insertion and sizing algorithm for a single wire (BISW)……… 89 5.9 Effect of input transition time on the temperature fluctuations induced delay variation………………………………………………………………………………. 94 5.10 Four different on-chip temperature profiles considered in this section………………………………………………………………………………... 95

xviii

5.11 Clock skew for the nominal process corner and different non-uniform die temperature profiles.………………………………………………………................. 96 5.12 Clock skew distribution under process parameter variations (uniform die temperature)………………………………………………………………………….. 97 5.13 Clock skew distribution of a 4-level H-tree under process parameter variations and non-uniform die temperature…………………………………………………………. 98 5.14 Illustration of the branch and bound algorithm………………………………………. 99 6.1 A standard 6T SRAM cell in a 65nm CMOS technology. The size of each transistor is given as W/L. W: transistor width (nm). L: transistor channel length (nm). The β is typically in the range of 2 to 3 for data stability…………………………………… 102 6.2 The schematic of the proposed variable voltage swing wordline driver……............... 103 6.3 Read static noise margin of the standard full-voltage-swing and the proposed dynamic wordline voltage swing SRAM circuits…………………………………..... 105 6.4 Leakage power consumption of the standard full-voltage-swing and the proposed dynamic wordline voltage swing SRAM circuits…………………………………..... 106 6.5 The layouts of the SRAM cells. (a) β = 1. (b) β = 2. (c) β = 3. The area of each cell is determined by a dashed rectangle……………………………….............................. 107 6.6 Comparison of the access delay of the standard and the proposed SRAM circuits……………………………………………………………………................... 109 6.7 Comparison of the read and write power consumption of the standard and the proposed SRAM circuits……………………………………………………………... 110 6.8 Statistical leakage power distributions of the standard and the proposed SRAM circuits………………………………………………………………………………... 111 6.9 Statistical SNM distributions of the standard and the proposed SRAM circuits…..……………………………………………………………………............. 111 7.1 A standard 6T SRAM cell in a 65nm CMOS technology. The size of each transistor is given as W/L. W: transistor width in nanometer. L: transistor channel length in nanometer. For data stability: β ≥ 2………………………………………………….. 113

xix

7.2 The schematic of the proposed 7T dual-Vt SRAM circuit in a 65nm CMOS technology. The size of each transistor is given as W/L. W: transistor width in nanometer. L: transistor channel length in nanometer. Thick line in the channel area indicates a high-Vt transistor.………………………………………............................ 115 7.3 Read static noise margins of the SRAM cells………………………………............... 117 7.4 Average leakage power consumption of the SRAM cells…………………………… 118 7.5 The layout of a 6T SRAM cell with β = 2. Area = 0.62 µm2………………………… 118 7.6 The layout of a 6T SRAM cell with β = 3. Area = 0.688 µm2……………………...... 119 7.7 The layout of a 6T SRAM cell with β = 4. Area = 0.754 µm2……………………...... 119 7.8 The layout of an 8T SRAM cell. Area = 0.82 µm2………………………………....... 119 7.9 The layout of the proposed dual-Vt 7T SRAM cell. Area = 0.72 µm2……………...... 119 7.10 Comparison of the access delays of the memory arrays with different SRAM cells………….……………………………………………………………………….. 121 7.11 Comparison of the read and write power consumptions of the memory arrays with different SRAM cells……………………………………………………………….... 122 7.12 Statistical leakage power distributions of the SRAM cells.………………………...... 124 7.13 Statistical SNM distributions of the SRAM cells…………………………………..... 124 8.1 Different implementations of a multi-gate field-effect transistor [67]……………...... 127 8.2 FinFET 3D view. (a) Single fin FET with the fin dimensions indicated. (b) Two-fins FET …………………………………………………………………………………... 127 8.3 The variation of the threshold voltage with the channel length for a FinFET and a standard single-gate bulk MOSFET. Results are obtained by MEDICI simulations [70]…………………………………………………………………………………… 128 8.4 Drain-induced-barrier-lowering (DIBL) of a FinFET and a standard single-gate bulk MOSFET. DIBL is measured as the degradation in |Vth| when the drain voltage is increased from 0.05V to 0.8V (VDD). Results are obtained by MEDICI simulations [70]…………………………………………………………………………………… 128

xx

8.5 The on-current produced by single-fin (minimum sized) transistors for different fin and gate oxide thicknesses. T = 27oC. VDD = 0.8V. (a) N-type FinFETs. (b) P-type FinFETs. The gate work-function is adjusted to maintain a constant threshold voltage for each device profile with a different tox and tsi combination.……..……… 131 8.6 Number of carriers per fin height in the channel region versus the fin thickness. Tox =1.6nm. (a) N-type FinFETs. (b) P-type FinFETs. Dashed lines: gate work-function is the same for all the devices. Solid lines: gate work-function is increased (reduced) with a thicker fin to maintain a constant threshold voltage for N-type (P-type) FinFETs………………………………………………………………………………. 132 8.7 The off-current and the gate tunneling current of an N-type FinFET for different fin and gate oxide thicknesses. T = 27oC. VDD = 0.8V. The off-current is the drain current with VGS = 0V and VDS = VDD. The gate tunneling current is measured when VGD = VGS = VDD.……………………………………………………………………. 133 8.8 The off-current and the gate tunneling current of a P-type FinFET for different fin and gate oxide thicknesses. T = 27oC. VDD = 0.8V. The off-current is the drain current with VGS = 0V and VDS = VDD. The gate tunneling current is measured when VGD = VGS = VDD…………………………………………………………………….. 134 8.9 Ratio of the on-current to the total leakage currents for different fin and gate oxide thicknesses of an N-Type FinFET. (a) T = 27oC. (b) T = 110oC…………………..… 135 8.10 Ratio of the on-current to the total leakage currents for different fin and gate oxide thicknesses of a P-Type FinFET. (a) T = 27oC. (b) T = 110oC.……………………… 136 8.11 Variation of the sub-threshold slope of an N-type FinFET with the different fin and gate oxide thicknesses. (a) T = 27oC. (b) T = 110oC………………………………… 137 8.12 Variation of the drain-induced-barrier-lowering of an N-type FinFET with the different fin and gate oxide thicknesses. (a) T = 27oC. (b) T = 110oC….…………… 138 8.13 On-current percent variation due to process parameter variations. T = 27oC. (a) Ntype FinFETs. (b) P-type FinFETs…………………………………………………… 140 8.14 Ratio of the maximum off-current to the minimum off-current under process variations with different device profiles. T = 27oC. (a) N-type FinFET. (b) P-type FinFET……………………………………………………………………………….. 141

xxi

8.15 Test circuit for characterizing the impact of supply voltage and temperature variations on the inverter propagation delay.………………………………................ 142 8.16 Propagation delay versus the supply voltage for different fin thickness. Gate oxide thickness is equal to 1.6nm. T = 27oC........................................................................... 143 8.17 Percentage temperature-induced propagation delay variation for different supply voltages and different fin thickness. The temperature is varied from 27oC to 110oC. 144 8.18 FinFET architectures. (a) Tied-gate FinFET. (b) Independent-gate FinFET………… 145 8.19 Drain current characteristics of FinFETs. a) N- FinFET. b) P- FinFET. |VDS| = VDD = 0.8V. T = 110oC……………………………………………………………............. 146 8.20 Work-function tuning with Molybdenum gate material. The work-function is tuned with a 5 x 10-15 cm-2 Nitrogen dose and different implantation energy. Data extracted from [71]…………………………………………………………………… 147 8.21 Work-function tuning with full silicidation of doped polysilicon gate material. The work-function is tuned based on the doping level of the polysilicon gate prior to the silicidation step [72]………………………………………………………………….. 148 9.1 FinFET architectures. (a) Tied-gate. (b) Independent-gate…………………………... 150 9.2 Modes of operation of independent-gate FinFETs. (a) Dual-gate mode. (b) Singlegate mode…………………………………………………………………………….. 150 9.3 Brute-force latch implementations with single-Vth tied-gate FinFETs. (a) LATCHTG1. (b) LATCH-TG2……………………………………………………………….. 152 9.4 Layouts of the standard single-Vth tied-gate FinFET circuits. (a) LATCH-TG1: 0.64 µm2. (b) LATCH-TG2: 0.63 µm2………………………………………………

153

9.5 Proposed multi-Vth brute-force latches. (a) LATCH-IG. (b) LATCH-WF. (c) LATCH-WF-IG. Thick lines indicate the high-Vth FinFETs based on work-function engineering. Work-function of a low-Vth N-Type (P-Type) FinFET is 4.5 eV (4.9 eV). Work-function of a high-Vth N-Type (P-Type) FinFET is 4.7 eV (4.7 eV)……………………………………………………………………………………. 154 9.6 Layouts of the new brute-force FinFET latches. (a) LATCH-IG and LATCH-WFIG: 0.506 µm2. (b) LATCH-WF: 0.506 µm2. ………………………………………... 155 9.7 Total active-mode power consumption of the FinFET latches………………………. 156

xxii

9.8 Clock power of the FinFET latches. ………………………………………………… 156 9.9 Leakage power (averaged for four different input-output combinations in the standby mode) of the FinFET latches. Clock is gated low. ………………………..... 156 9.10 Average propagation delay of the FinFET latches…………………………………… 156 9.11 Setup time of the FinFET latches. …………………………………………………… 157 9.12 Static noise margin of the FinFET latches. ………………………………………...... 157 9.13 Five brute-force FinFET flip-flops. (a) FF-TG1 (b) FF-TG2. (c) FF-IG. (d) FF-WF. (e) FF-WF-IG. Thick lines indicate high-Vth FinFETs based on work-function engineering. Work-function of a low-Vth N-Type (P-Type) FinFET is 4.5 eV (4.9 eV). Work-function of a high-Vth N-Type (P-Type) FinFET is 4.7 eV (4.7 eV)…………………………………………………………………………………..... 158 9.14 Total active-mode power consumption of the FinFET flip-flops……………………. 160 9.15 Clock power of the FinFET flip-flops………………………………………………... 160 9.16 Leakage power (averaged for the four input-output combinations in the standby mode) of the FinFET flip-flops. Clock is gated low…………………………………. 161 9.17 Setup time of the FinFET flip-flops. ………………………………………………… 161 9.18 Average propagation delay of the FinFET flip-flops. ……………………………...... 161 10.1 Tied-gate FinFET architectures. (a) Single fin transistor. (b) Two fins transistor…… 165 10.2 Different gate-bias options with single-fin and multi-fin independent-gate FinFETs……………………………………………………………………………..... 165 10.3 A standard footless domino circuit with tied-gate FinFETs………………………..... 166 10.4 Evaluation delay, power consumption, and NML of a standard 16-input domino OR gate in a 32nm tied-gate FinFET technology. KPR: ratio of keeper size to the size of one of the pull-down network transistors. Frequency = 4GHz. T = 110oC………….. 167 10.5 Schematic of the proposed variable threshold voltage keeper independent-gate FinFET domino logic circuit technique..…………………………………………….. 168 10.6 Gate bias options of a three-fin independent-gate keeper FinFET. G1: number of independent keeper gates driven by NAND1. G2: number of independent keeper gates driven by the output …………………………………………………………… 170

xxiii

10.7 Delay, power, and NML characteristics of a 16-input domino OR gate with KPR = 0.75. G2: number of independent keeper gates driven by the output. G2 = 6 corresponds to the standard tied-gate FinFET domino circuit……………………...... 171 10.8 Comparison of the evaluation delay of the standard tied-gate and the proposed variable threshold voltage keeper independent-gate FinFET techniques for different domino circuits and various keeper sizes. For each comparison case, the two techniques provide identical noise margin…………………………………………… 172 10.9 Comparison of the power consumption of the standard tied-gate and the proposed variable threshold voltage keeper independent-gate FinFET techniques for different domino circuits and various keeper sizes. For each comparison case, the two techniques provide identical noise margin…………………………………………… 172 11.1 Three tied-gate FinFET SRAM cells. TG1: all six transistors are sized minimum. TG2: the pull-down transistors in the cross-coupled inverters have two fins. TG3: the pull-down transistors in the cross-coupled inverters have three fins…………….. 175 11.2 The IG-FinFET SRAM cells. (a) IG1. (b) IG2. (c) IG3……………………………… 177 11.3 The read SNMs of the FinFET SRAM cells. T = 70°C. IG3 provides the highest SNM. ………………………….………………………….………………………...... 179 11.4 The leakage power consumptions of the FinFET SRAM cells at 70°C. IG2 consumes the lowest leakage power. ………………………….…………………….. 180 11.5 The peak read currents of the FinFET SRAM cells. T = 70°C. ……………………... 181 11.6 Layout of the TG1 FinFET SRAM cell. Layout area = 0.18 µm2. ………………….. 181 11.7 Layout of the TG2 FinFET SRAM cell. Layout area = 0.21 µm2. ………………….. 182 11.8 Layout of the TG3 FinFET SRAM cell. Layout area = 0.24 µm2. ………………….. 182 11.9 Layout of the IG1 FinFET SRAM cell. Layout area = 0.22 µm2. …………………… 182 11.10 Layout of the IG2 FinFET SRAM cell. Layout area = 0.18 µm2. …………………… 183 11.11 Layout of the IG3 FinFET SRAM cell. Layout area = 0.22 µm2. …………………… 183 11.12 Normalized area of the FinFET SRAM cells. TG1 and IG2 occupy the smallest layout area.………………………….………………………………………………... 183 11.13 Statistical leakage power distributions of the FinFET SRAM cells. ………………... 184 11.14 Statistical SNM distributions of the FinFET SRAM cells..………………………...... 185

xxiv

12.1 Variation of the read static noise margin with the work-functions of the access and the pull-up devices. The work-function of the pull-down devices is fixed at 4.6eV. T = 70 oC. All the devices are minimum sized……………………………………......... 189 12.2 Variation of the static power with the work-function of the access, the pull-up, and the pull-down devices. T = 70 oC. Minimum sized SRAM cell. …………………..... 189 12.3 Variation of the read delay with the work-function of the access devices. The read delay is insensitive to the work-function of the pull-up devices in the range of (4.5eV-5eV) and the work-function of the pull-down devices in the range of (4.5eV4.6eV). T = 70 oC. Minimum sized SRAM cell. Read delay is measured for a column of 256 SRAM cells…………………………………………………………... 190 12.4 Variation of the write delay with the work-function of the access and the pull-up devices. T = 70 oC. All the transistors are sized minimum…………………………... 190 12.5 Schematic and layout of a minimum sized 6T SRAM cell (SRAM1, SRAM_LP, and SRAM_HS) in a 32nm FinFET technology. The size of each transistor is given as (number of fins × fin height) / channel length…………………………………….. 191 12.6 Schematic and layout of a larger memory cell (SRAM3) with enhanced data stability as compared to SRAM1. The size of each transistor is given as (number of fins × fin height) / channel length……………………………………………………. 192 12.7 The normalized leakage power consumption of the standard single-low-Vt and the proposed work-function engineered multi-Vt SRAM cells. T = 70 oC………………. 193 12.8 The normalized delay and power consumption of a memory column with the different SRAM cells (the standard single-low-Vt and the proposed work-function engineered multi-Vt SRAM cells). T = 70 oC………………………………………... 194 12.9 Statistical static power distributions of the memory circuits………………………… 194 12.10 Statistical SNM distributions of the SRAM cells. …………………………………... 195 13.1 Process flow of a local strain technology based on epitaxial growth of SiGe and SiC in the source and drain region of a PMOS and NMOS, respectively. [115]…............. 197 13.2 Local strain technology using capping layer [115]…………………………………... 197 13.3 Biaxial and uniaxial Strain induced threshold-voltage shifts versus the amount of strain………………………………………………………………………………..... 198

xxv

13.4 Six transistors SRAM circuits with enhanced speed using local strain. The (*) indicates transistors with strained channel…………………………………………… 198 13.5 Six transistors SRAM circuit with enhanced read stability and read speed using local strain will be investigated. The (*) indicates transistors with strained channel……………...................................................................................................... 199 13.6 Stress induced by the TiN as a function of the TiN thickness and the deposition method. ALD: atomic layer deposition. CVD: chemical vapor deposition. PVD: physical vapor deposition. Higher tensile stress is applied with thinner TiN layer [123]. ………………………….……………………………………………………... 200 13.7 FinFET six transistors SRAM circuit with local strain applied to the pull-down and the access transistors for enhanced read speed, write speed, and write margin. The strain applied to the pull-down transistors is higher as compared to the access transistors for enhanced read stability. The (*) indicates transistors with strained channel. The (**) indicates higher strain than (*)…………………..……………….. 200 13.8 (a) FinFET 3D architecture. (b) Cross sectional top view of a FinFET with gatedrain/source overlaps. (c) Cross sectional top view of a FinFET with gatedrain/source underlaps……………………………………………………………….. 201 13.9 On-current and off-current of FinFETs versus gate-drain/source overlap. Underlaps are represented with negative overlaps. 32nm gate length. 1.6nm gate oxide thickness. Undoped body. Gate work-function of N-type (P-type) FinFET is 4.5eV (4.9eV). T = 110oC…………………………………………………………………… 202 13.10 FinFET SRAM based on a dual gate-drain/source overlap technology will be investigated. The pull-down transistors are FinFETs with gate-drain/source overlaps. The pull-up and the access transistors are FinFETs with gate-drain/source underlaps.

Thick

lines

indicate

transistors

with

gate-drain/source

underlaps………………………………....................................................................... 203 13.11 Robust and high speed 7T FinFET SRAM circuit based on gate-drain/source overlap engineering will be investigated. A single-ended sense amplifier based on gate-drain/source overlap engineering will also be investigated…………………….. 204

1

Chapter 1 Introduction The first monolithic integrated circuit (IC) was invented at Fairchild Semiconductor in 1959 [1], [2], [57]-[59]. The integration of an entire electrical circuit on a single piece of silicon significantly lowers the cost and enhances the reliability as compared to the circuits with discrete components. The impressive growth of the semiconductor industry driven by the advancements of the integrated circuit technology and the market dynamics was predicted by Gordon Moore in 1965 [1]-[6], [57]-[59]. A new process technology with significantly higher integration density and enhanced speed has been introduced by the semiconductor industry every two to three years since the early 1970s [1], [7], [57]-[59]. The size of the transistors is reduced with technology scaling, thereby increasing the integration density and the operating speed of the circuits [1], [7], [57]-[59]. The increased number and the higher operating speed of transistors lead to broader functionality and enhanced performance in an integrated circuit. These advantages associated with technology scaling, however, come at a cost of elevated power consumption and enhanced sensitivity to parameter variations [1], [7]-[11]. Developing low power and variation tolerant integrated circuit techniques has become a primary necessity for the semiconductor industry. The power consumption of high performance integrated circuits has increased significantly with technology scaling despite the scaling of the supply voltage. The higher power consumption shortens the battery lifetime of portable devices. Furthermore, the power density of ICs increases steadily with technology scaling resulting in elevated die temperature, more expensive cooling system, higher packaging cost, increased leakage power, and degraded reliability. Numerous microarchitectures and circuit techniques were introduced for achieving higher performance, lower power consumption, and enhanced robustness to parameter variations with the scaling of the integrated circuit technologies. The primary trends of the integrated circuit technologies are presented in Section 1.1. Several new low-power, high-performance, and variation-tolerant circuit techniques targeting the nanoscale CMOS technologies are presented in this dissertation. The outline of the dissertation is given in Section 1.2.

2

1.1. Scaling Trends Three different approaches to CMOS technology scaling can be identified in the literature: constant voltage scaling, constant field scaling, and generalized scaling. With the constant voltage scaling, the device dimensions are scaled down and the channel doping concentration is increased by a factor of “f”. The supply voltage is maintained constant from one technology generation to the next. The electric fields within a scaled MOS transistor are increased thereby enhancing the carrier velocity and the circuit speed with the constant voltage scaling approach. The technology scaling process was based on the constant voltage scaling approach until the 0.8µm technology node [7]. Due to the device reliability concerns and the velocity saturation phenomenon, however, the constant voltage scaling approach is not applicable to the submicrometer technologies. The carrier velocity saturates beyond a specific critical lateral electric field. This critical electric field is easily reached by the submicrometer MOSFETs provided that the supply voltage is not scaled together with the channel length. The doping concentrations are increased while the device dimensions and the power supply voltage are scaled down by a factor of “f” with the alternative constant field scaling approach [7]-[8]. The electrical fields within a scaled device are maintained similar as compared to the previous technology generations primarily by scaling the supply voltage with the other technology parameters. The device reliability is not degraded with the constant field scaling criteria while achieving enhanced speed and higher integration density. The challenges with the constant field scaling approach are the non-scaled device parameters such as the built-in potential (determined by the band gap energy) and the subthreshold slope (determined primarily by the temperature) [8]. Scaling the supply voltage degrades the transistor currents. The threshold voltages are scaled to compensate for this speed degradation at a lower supply voltage. Scaling of the threshold voltages for enhanced device speed, however, causes exponentially higher subthreshold leakage currents. The threshold voltage of an MOS transistor therefore cannot be lowered with the same scaling factor as the other device parameters in the deep submicrometer CMOS technologies. A selective scaling approach is utilized in the recent nanoscale CMOS technologies. The channel length is scaled by a factor of approximately 0.7 [8]. The device parameters are individually optimized (typically with different scaling factors) in order to achieve a compromise

3

between the speed, the power consumption, and the reliability goals. Furthermore, several novel device fabrication techniques such as strained silicon, channel doping engineering, and high-k gate dielectric have been introduced for simultaneous enhancement in speed and reliability while reducing the power consumption with the current CMOS technologies. The trends in the evolution of microprocessor technologies are reviewed in this section [1], [57]-[59]. The feature sizes of the transistors and the wires have been continually scaled, enhancing the integration density in each new process technology generation. The minimum critical feature size (channel length) of the transistors in the lead Intel microprocessors has decreased from 10µm in 1971 to 45nm in 2008 as illustrated in Fig. 1.1. The total number of transistors in the lead Intel and IBM microprocessors has increased by more than a thousand times in less than two decades as shown in Figs. 1.2 and 1.3, respectively. 100

4004

8080 8086 286

1

486 Pentium Pentium II Celeron Pentium III Pentium 4

Core2 Duo (Conroe) Core2 Duo (Wolfdale)

0.1

0.01 1971

1974

1978

1982

1989

1993 1997 Year

1998

1999

2001

2006

2008

Fig. 1.1. Scaling of the feature size with the lead Intel microprocessors. Number of Transistors (Million)

Feature Size (µm)

10

10000 1000

100

10

1

1990

1992

1994

1998

2000

2002

2006

Year

Fig. 1.2. Evolution of the number of transistors of the lead Intel microprocessors.

4

POWER6

1000 POWER5

Number of Transistors (Million)

POWER4 100 POWER2 P2SC

POWER3

1996

1998

10

1

POWER1

0.1 1990

1993

2001

2003

2007

Year

Fig. 1.3. Evolution of the number of transistors of the IBM microprocessors. In addition to the device dimensions, technology scaling also reduces the propagation delays. The operating frequency of the ICs has significantly increased with technology scaling as shown in Figs. 1.4 and 1.5 for the Intel and IBM microprocessors, respectively. Recent microprocessors from Intel (Penryn) and IBM (POWER6) feature clock frequencies in excess of 3.7GHz [83] and 5GHz [82], respectively. 10000

Core2 Duo Pentium D (Conroe) Pentium IV

Clock Frequency (MHz)

1000 Pentium Pro

100

Core2 Duo (Wolfdale)

Pentium III PentiumII

Pentium 386

10

8086

486

286

8080

1

0.1

4004

0.01 1971

1974

1978

1982

1985

1989

1993 1995 Year

1997

1999

2000

2005

2006

Fig. 1.4. Clock frequency scaling with the lead Intel microprocessors.

2008

5

10000

POWER6

Clock Frequency (MHz)

POWER5 POWER4 1000 POWER3 P2SC 100

POWER2 POWER1

10 1990

1993

1996

1998 Year

2001

2003

2007

Fig. 1.5. Clock frequency scaling with the lead IBM microprocessors. The operating frequency can be used as a measure of the amount of work accomplished per unit time. The performance of an integrated system can be enhanced by increasing the clock frequency. The increase in the total number of transistors and the operating frequency of an IC, however, also causes a significant surge in the power consumption [1]. The maximum power consumption of the Intel microprocessors has been increasing over the past 30 years as shown in Fig. 1.8. The primary sources of power consumption in two generations of Itanium microprocessors are depicted in Fig. 1.9 [14], [88]. The dynamic switching power consumed in the logic core accounts for approximately 50% of the total power consumption for both microprocessors. The clock distribution network and the leakage currents account for the other half of the total power consumption in a recent 90nm Itanium microprocessor as shown in Fig. 1.9b. The enhancement of the clock frequency has recently slowed down as shown in Fig. 1.4 due to the limitations imposed by the higher power consumption. The multi-core technology has emerged as an alternative approach to enhance the performance without increasing the clock frequency in the high-end microprocessors market. Die photos of the recent microprocessors from AMD and IBM featuring four processor cores and nine processor cores, respectively, are shown in Fig. 1.6. An experimental 80-core processor implemented in 65nm CMOS technology from Intel is shown in Fig. 1.7.

6

(a)

(b)

Fig. 1.6. Die photos of two recent multi-core processors. (a) Quad-core Opteron from AMD [84]. (b) Nine-core synergistic processor from IBM [85]. Both processors are implemented in a 65nm SOI technology.

Fig. 1.7. Die photo of an experimental 80-core Intel processor operating at 4GHz in a 65nm CMOS technology [89].

7

1000 Itanium2 Montecito Dual-Core

Pentium 4

Power (W)

100 NMOS to CMOS transition

Pentium 2 Pentium

10 i286

i486

8086 i386

1

8080

8085

8008 4004

0.1 1970

1972

1976

1982

1989

1995

1999

2002

2006

Year

Fig. 1.8. Maximum power consumption of the lead Intel microprocessors. Clock Distribution (5%)

Leakage(2%) IO (5%)

Package (3%)

Global Repeaters (3%) Switching Logic (51%)

Clock System (33%)

Final Clock Buffers (20%)

Leakage (25%)

Logic Switching (50%)

Contention Logic (3%)

(a)

(b)

Fig. 1.9. Breakdown of the power consumption in Itanium processors. (a) Itanium 2 microprocessor in a 180 nm CMOS technology in 2002 [88]. (b) Two-core Itanium processor implemented in a 90nm CMOS technology in 2006 [14]. The power densities of the high performance ICs have also increased significantly over the past 20 years. The power densities of the lead Intel microprocessors are shown in Fig. 1.10. The increased power consumption and power density lead to higher operating temperature.

8

Furthermore, the diversity of circuitry in an IC coupled with the higher power consumption result in significant on-chip temperature variations. The multi-core microprocessor architectures offer a more uniform distribution of the work load, thereby leading to more balanced power consumption with a lower power density across a system-on-chip. A multi-core integrated circuit therefore experiences more uniform heat dissipation and lower temperature gradients across the die as compared to a conventional single-core higher frequency processor designed for similar performance.

Power Density (W/cm2)

10000 1000 Penryn

100

10

1

8086 4004 8008 8080 8085

1970

Pentium Processors

286

1980

386

486

1990

2000

2010

Year

Fig. 1.10. Power density trends of the lead Intel microprocessors. The scaling of the threshold voltage and the device dimensions leads to a significant increase in the leakage power. The leakage power trends are depicted in Figs. 1.11 and 1.12. Approximately 40% of the total active mode power is consumed due to leakage currents for a 3 GHz Pentium 4 microprocessor in a 130nm CMOS technology, as shown in Fig. 1.11 [12]. Similarly, leakage power accounts for the 42% of the total active mode power consumption of the POWER6 microprocessors in a 65nm CMOS SOI technology [82]. The trends of the different components of the leakage power are illustrated in Fig. 1.12 for three different CMOS technologies [13]. All components of the leakage power are increased with technology scaling. Novel circuit techniques that are aimed at reducing the leakage power consumption are therefore highly desirable.

9

Leakage Power Percentage

40 35 30 25 20 15 10 5 0 Pentium 66 MHz

Pro 200 MHz

PII 333 MHz

PII 650 MHz

P3 933 MHz

P4 2 GHz

P4 3 GHz

Fig. 1.11. Leakage power trends of the lead Intel microprocessors [12].

Leakage Components (A / µm)

Gate Tunneling 1.E-06

Subthreshold Leakage

1.E-07

Band to Band Tunneling

1.E-08 1.E-09 1.E-10 1.E-11 1.E-12 1.E-13 Leff = 90nm

Leff = 50nm

Leff = 25nm

Fig. 1.12. Comparison of the primary leakage current components in three different CMOS technology nodes. VDD is 0.7V at 25nm, 0.9V at 50nm, and 1.2V at 90nm [13]. Technology scaling leads to more significant parameter variations [9]-[11]. Parameter variations can be caused by process and environmental fluctuations. Process related imperfections are either random or systematic with a predictable spatial distribution. Alternatively, environment variations occur dynamically during the operation of the ICs due to the time varying supply voltage and die temperature fluctuations. Supply voltage fluctuations stem from the increase in

10

the wire parasitic impedances, the increased current demand, and the higher switching frequency of the integrated circuits. The increase in the current demand is primarily due to the increase in power consumption and the aggressive scaling of the supply voltage in successive technology generations. The increasing wire parasitic impedances and the higher circuit switching currents cause resistive and inductive voltage drops along the power lines. The development of new lowpower and variation-tolerant integrated circuit techniques are therefore critical for the future of the semiconductor industry.

1.2. Outline of the Dissertation Sources of power consumption in digital circuits are identified and modeled in Chapter 2. Several new circuit techniques for the design of low power, high performance, and variation tolerant integrated circuits are presented in this thesis. An effective method for reducing the power consumption is scaling the supply voltage. All components of power consumption are simultaneously reduced with the scaling of the supply voltage in a CMOS circuit. Lowering the supply voltage, however, also degrades the circuit speed. The multiple supply voltage circuit technique exploits the delay differences among the different signal propagation paths within an IC. The supply voltages of the gates on the non-critical delay paths are selectively lowered while a higher supply voltage is maintained on the speed critical paths in order to satisfy the target clock frequency. When a low voltage swing signal drives a CMOS gate connected to a higher supply voltage, static DC power is consumed as the transistors in the pull-up and the pull-down networks are simultaneously turned on. Furthermore, the output voltage swing of the receiver degrades, thereby leading to a static DC current in the fanout gates of the receiver. In order to transfer signals among these circuits operating at different voltage levels, specialized voltage interface circuits are required. Level converters impose additional power consumption and propagation delay overhead in a multiple supply voltage system. High-speed and low-power voltage level conversion is critical for effective power reduction with minimum effect on speed in multi-VDD systems. New multiple threshold voltage interface circuits are proposed in Chapter 3. When the level conversion circuits are individually optimized for minimum power consumption, the proposed level converters offer significant power savings of up to 70% as compared to the previously published circuits. Alternatively, when the circuits are individually optimized for

11

minimum propagation delay, the speed is enhanced by up to 78% with the proposed voltage interface circuits as compared to the conventional level converters in a 0.18µm TSMC CMOS technology. The clock distribution network consumes a significant portion of the power, area, and metal resources of an integrated circuit (IC). The enhancement of clock frequency and the growth of die size cause the power consumption of the clock distribution subsystem to increase significantly with each new technology generation. Furthermore, the process and environment parameter variations become more pronounced with technology scaling. Coping with parameter variations is particularly challenging in the design of the clock distribution networks since the clock signal needs to be distributed to the entire IC with controlled skew. New supply voltage optimization methodologies are presented Chapter 4 for simultaneously suppressing the temperature fluctuations induced clock skew and the power consumption of clock distribution networks. The clock signal is globally distributed at a lower optimum supply voltage for minimizing the temperature fluctuations induced clock skew. Level converters restore the full swing clock signal at the leaves of the clock distribution network in order to maintain the system performance. A new dual threshold voltage low-power level converter that is robust against temperature fluctuations is presented. The temperature fluctuations induced clock skew and the power consumption are reduced by up to 76% and 50.6%, respectively, as compared to a standard clock distribution network operating at the nominal supply voltage. Distributing the clock signal at a lower supply voltage requires increasing the sizes of the clock buffers, thereby degrading the power savings achievable with the proposed dualVDD/single-frequency clocking methodology. Another methodology based on dual-supply/dualfrequency clock distribution is presented for enhanced power savings in Chapter 4. Buffer resizing is avoided with this new technique by distributing the clock signal at the lower optimum supply voltage with half the target frequency. Novel frequency doublers with level conversion capability are utilized to restore the full swing clock signal with the target clock frequency at the leaves of the clock distribution network. A new hybrid dual threshold voltage frequency doubler/level converter with lower skew and reduced power consumption characteristics is presented. The temperature fluctuations induced clock skew and the power consumption are

12

reduced by up to 80% and 76%, respectively, with this methodology as compared to a standard clock distribution network operating at the nominal supply voltage. Another important topic in the design of clock distribution networks is the buffer insertion and sizing. Maintaining sharp clock signal transition times is critical for the high speed and robust operation of the clocked elements. A shorter clock transition time, however, typically requires an increased number of larger clock buffers, thereby leading to higher power consumption in a clock tree. A new buffer insertion and sizing methodology is presented for an H-tree clock distribution network in Chapter 5. The objective of the new methodology is to minimize the total power consumption while maintaining sharp clock edges at the leaves of the clock distribution network. Non-uniform buffer insertion and gradual relaxation of the transition time constraints from the leaves to the root of the clock distribution network are proposed for significant power savings with the new methodology. The total power consumption is reduced by up to 30% as compared to the standard approach based on uniform buffer insertion to maintain similar transition times at all the nodes of a clock tree. The amount of embedded memory in modern microprocessors and systems-on-chips is increased to meet the performance requirements in each new technology generation. The reduced supply and threshold voltages and the scaled device dimensions lead to a degradation in the data stability of the memory banks with technology scaling. The increasing leakage energy consumption of on-chip caches is another growing concern since the majority of transistors are employed for the embedded memory in modern microprocessors. A new SRAM circuit technique based on dynamically adjusting the wordline voltage swing is proposed in Chapter 6 for reducing the leakage power consumption and enhancing the data stability in static memory banks. The wordline voltage swing is reduced in order to suppress the voltage disturbance at the data storage nodes during a read operation with the proposed technique. The stability of a minimum sized standard six transistors (6T) SRAM cell is thereby significantly enhanced. Alternatively the wordline signal has a full voltage swing in order to achieve write-ability with a high write margin during a write operation. The static noise margin is enhanced by up to 122% with the proposed circuit technique as compared to the conventional full-voltage-swing 6T SRAM circuits with minimum sized transistors. Furthermore, the leakage power consumption is reduced by 51% with the proposed technique as compared to the conventional full-voltage-swing circuits sized for

13

comparable data stability in a 65nm CMOS technology. A new seven transistor dual threshold voltage SRAM cell is proposed in Chapter 7 for simultaneously reducing the active and standby mode power consumption while enhancing the data stability and the read speed with a small area overhead as compared to the standard six transistor SRAM circuits. The proposed circuit provides two separate data access mechanisms for the read and write operations. During a read operation, the storage nodes are isolated from the bitlines, thereby enhancing the read stability by up to 87% as compared to the conventional 6T SRAM cells. The cross-coupled inverters of the proposed SRAM cell are not on the critical delay path, thereby allowing the utilization of high threshold-voltage minimum sized transistors for significantly reducing the leakage power consumption by up to 66% without degrading the circuit speed as compared to the conventional 6T SRAM circuits. Furthermore, the read speed is enhanced with the proposed 7T SRAM circuit due to smaller resistance of the read-delay-path. The write power is also reduced with the 7T SRAM circuit due to the utilization of a low activity factor single bitline for the transfer of new data into the cell. Enhancing the system performance while maintaining the power consumption under control will become increasingly challenging in the future nanoscale CMOS technology generations. The emerging multi-gate MOSFET technology offers distinct advantages for simultaneously enhancing the speed and suppressing the sub-threshold and the gate dielectric leakage currents as compared to the conventional single-gate MOSFETs. The semiconductor industry is expected to transition to the multi-gate MOSFET technology in 2010 (International Technology Roadmap for Semiconductors [87]). The FinFET technology is introduced in Chapter 8. FinFET technology development guidelines for higher performance, lower power consumption, and resilience to parameter variations are presented. Independent-gate and work-function engineering technologies are also presented in Chapter 8. New compact FinFET based sequential circuits for reducing the power consumption without sacrificing speed are presented in Chapter 9. A new independent-gate FinFET domino circuit technique for enhancing the speed and reducing the power consumption without sacrificing noise immunity is presented in Chapter 10. Novel FinFET SRAM circuits with low leakage power, small area, and high data stability are presented in Chapters 11 and 12. Finally future research plans are provided in Chapter 13.

14

Chapter 2 Sources of Power Consumptions in Digital Circuits The power consumption of high performance integrated circuits has increased significantly with technology scaling. Higher power consumption shortens the battery lifetime of portable devices. Furthermore, the increased power consumption poses limitation on the continued technology scaling due to the associated higher power density. In this Chapter, the sources of power consumption in digital circuits are identified and modeled. The power consumption mechanisms in digital circuits can be classified into dynamic switching power, short-circuit power, leakage power, and static DC power. Dynamic switching power consumption and short-circuit power consumption occur during the signal switching. The dynamic switching power consumption and the short-circuit power consumption are presented in Sections 2.1 and 2.2, respectively. The leakage and the static DC power consumptions occur regardless of the switching activities of the signals. The leakage and the static DC power consumptions are presented in Sections 2.3, and 2.4, respectively.

2.1. Dynamic Switching Power Consumption Dynamic switching power is the dominant component of power consumption in high performance integrated circuits [14]. The dynamic power consumption is modeled for static CMOS circuits in this Section [103]. A generic static CMOS gate is shown in Fig. 2.1. A static CMOS gate is composed of a pull-up network composed of PMOS transistors and a pull-down network composed of NMOS transistors. When the pull-up network is turned on with a specific input signal combination, a low resistance path connects the high supply voltage VDD to the output node, thereby charging the output node to VDD. Alternatively, when the pull-down network is turned on with a specific input signal combination, a low resistance path connects the output node to ground, thereby discharging the output node to 0V. The pull-up and pull-down networks operate in a complementary fashion. At steady state, only one of the pull-up and the pull-down networks is turned on based on the state of the input signals. When the pull-up (pull-down) network is turned off the resistance of the pull-up (pull-down) network is increased by several orders of magnitude.

15

VDD Vin-1 Vin-n

.. .

Pull-Up Network

Ip

Vout Vin-1 Vin-n

.. .

Pull-Down Network

In

CL

Fig. 2.1. A CMOS gate. The current Ip charges the load capacitance CL to VDD when the pull-up network is activated. The current In discharges the load capacitance CL to GND when the pulldown network is activated. At steady state only one of the pull-up or the pull-down network is turned off. Hence there is a high resistance path between VDD and ground, thereby only leakage currents are drawn from the power supply resulting in leakage power consumption. Leakage power consumption is discussed in Section 2.3. Alternatively, when the state of the output changes from a low voltage to a high voltage a transitory current, significantly higher as compared to the leakage current, is drawn from the power supply resulting in dynamic switching power consumption. The dynamic switching power is modeled in this section for static CMOS circuits. The output voltage swing is generally assumed to be between VLow ≥ 0V and VHigh ≤ VDD. The output low to high transition is considered first. The output node is initially discharged to VLow through the pull-down network. The input signals switch to a combination at which the pull-down network is turned off and the pull-up network is turned on. The Current Ip flows from VDD to the output node, charging the output node from VLow to VHigh through the lowresistance pull-up path as depicted in Fig. 2.1. The energy drawn from the supply voltage during this transition is computed by integrating the instantaneous power (P(t) = VDD Ip(t)) as follows:

16













EVDD = P ( t ) = VDD I p ( t )dt = VDD CL 0

0

0

VHigh

EVDD = VDD CL



dVout dt , dt

(

(2.1)

)

dVout = VDD CL VHigh − VLow ,

VLow

(2.2)

where EVDD and CL are the total energy drawn from the power supply and the output load capacitance respectively. Part of the energy drawn from the supply is stored on the load capacitance. The remaining part of the energy drawn from the power supply is dissipated as heat in the resistance of the pull-up network. The energy stored in a capacitor is computed by integrating the instantaneous power delivered to the capacitor (Pc). The instantaneous power delivered to the capacitor is given by

Pc ( t ) = I p ( t ) Vout ( t ) =CL

dVout ( t ) dt

Vout ( t ) .

(2.3)

The energy stored in the capacitor (EC) after charging the output from VLow to VHigh is computed by integrating Pc(t) as follows ∞



VHigh

0

0

VLow

dV EC = Pc ( t ) dt = CLVout ( t ) out dt = dt







CLVout ( t ) dVout =

)

(

1 2 2 − VLow . (2.4) CL VHigh 2

The energy dissipated in pull-up network during the output low to high transition is equal to the difference between the energy drawn from the power supply during this transition and the energy stored in the capacitor at the end of the output low to high transition. From (2.2) and (2.4), the dynamic energy dissipated during the output low to high transition is

(

)

1 2 2 E0→1 = CLVDD VHigh − VLow − CL VHigh − VLow . 2

(

)

(2.5)

The output high to low transition is considered next. The output node is initially charged to VHigh through the pull-up network. The input signals switch to a combination at which the pullup network is turned off and the pull-down network is turned on. The Current In flows from the output node to the ground, discharging the output node from VHigh to VLow through the lowresistance pull-down path as depicted in Fig. 2.1. The charge stored on the load capacitance is

17

therefore drawn to ground and the stored energy in the capacitor is dissipated in the resistance of the pull-down network. The dynamic energy dissipated during the high to low transition is therefore equal to the energy stored in the capacitor prior to the output high to low transition.

E1→0 =

)

(

1 2 2 C L VHigh − VLow . 2

(2.6)

The total dynamic switching energy dissipated to charge and discharge the load capacitance is therefore equal to the energy drawn from the power supply during the output low-to-high transition. The dynamic energy consumed in charging and discharging the output capacitance is

(

)

Edynamic = E0→1 + E1→0 = CLVDD VHigh − VLow = CLVDDVSwing ,

(2.7)

where VSwimg is the voltage swing of the output node and is equal to VHigh - VLow. By defining the activity factor “α01” to denote the probability of the output signal transitioning from “0” to “1” per clock cycle (T), the average dynamic power consumed in the charging and discharging of the load capacitance can be formulated as

Pdynamic =

α 01Edynamic T

=

α 01CLVDDVSwing T

= α CLVDDVSwing f ,

(2.8)

where f is the clock frequency. Different low power circuit techniques are proposed in literature for reducing the dynamic switching power consumption by reducing one or more of the terms that appear in the right hand side of equation (2.8). Reducing the activity factor (α01) can be achieved by avoiding circuit families that are characterized with high activity factor such as domino circuits. The activity factor can also be reduced with clock gating techniques [45]. The load capacitance (CL) can be reduced by reducing the size of the transistors on the non-critical delay paths. The load capacitance in long buses can be reduced by reducing the coupling capacitance between adjacent lines using bus coding [106]. Reducing the voltage swing (VSwing) can be achieved by modifying the circuit topology as shown in Fig. 2.2. The voltage swing is reduced from VDD to VDD–Vtn–|Vtp| with this technique where Vtn and Vtp are the NMOS and the PMOS threshold voltages, respectively. Low-power techniques based on reduced voltage swing are applied in domino circuits [100], clock distribution networks [30], [32], and in long buses [104].

18

VDD

VDD-Vtn Vin-1 Vin-n

.. .

Pull-Up Network Vout

Vin-1 Vin-n

.. .

CL

Pull-Down Network

|Vtp| Fig. 2.2. Reduced voltage swing static CMOS gate. The output voltage swing is between Vtn and VDD-|Vtp|. Vtn: NMOS threshold voltage. Vtp: PMOS threshold voltage. Supply voltage scaling is the most effective technique for reducing the dynamic switching power since both the supply voltage (VDD) and the voltage swing (VSwing) are reduced with the supply voltage scaling resulting in quadratic reduction in dynamic power. Circuit speed is, however, degraded with supply voltage scaling. With the dynamic voltage scaling technique [1] the supply voltage is adjusted based on the current computation workload. At a low computation workload, the required throughput is low. The clock frequency is therefore reduced allowing the supply voltage to be scaled, thereby providing cubic reduction in the dynamic power consumption. Alternatively, at high computation workload, the clock frequency and the supply voltage are increased in order to achieve the required high throughput. Supply voltage scaling is also employed with the multiple supply voltage technique. The multiple supply voltage circuit technique exploits the delay differences among the different signal propagation paths within an integrated circuit [1], [18]. The supply voltages of the gates on the non-critical delay paths are selectively lowered while a higher supply voltage is maintained on the speed critical paths in order to satisfy the target clock frequency in a multiple supply voltage circuit. Dual supply

19

voltages are utilized in [33] and [38] (proposed in Chapter 4) for designing low power clock distribution networks with suppressed temperature fluctuations induced clock skew.

2.2. Short Circuit Power Consumption At steady state, the path between VDD and GND in a static CMOS circuit is characterized with high resistance since one of the pull-up and the pull-down networks is turned off. When the inputs of a static CMOS gate transition to a different combination that result in changing the state of output node, the pull-up and the pull-down networks are simultaneously turned on due to the non-zero transition time of the input signals. A small resistance path is therefore created between VDD and GND during the transitory period of the input signals resulting in the conduction of an electric current from VDD to GND. This current is called short-circuit current (Isc). The power consumed by the conduction of the short-circuit current is called short-circuit power (Psc). The short-circuit current measurement is illustrated in Fig. 2.3 for a static CMOS inverter. The NMOS transistor is turned off when the input signal is below the NMOS threshold voltage Vtn. Alternatively, the PMOS transistor is turned off when the input signal is above VDD-|Vtp| where Vtp is the threshold voltage of the PMOS transistor. When the input signal is between Vtn and VDD-|Vtp| during the input low to high or the input high to low transition, both the NMOS and the PMOS transistors are therefore simultaneously turned on. A short-circuit current flows from VDD to GND through the PMOS and the NMOS transistors. The PMOS source-to-drain current, the NMOS drain-to-source current, and the current flowing to the load capacitance are denoted by Ip, In, and Iload, respectively, as shown in Fig. 2.3. During the input low to high transition (output high to low transition), the NMOS transistor discharges the load capacitance. Hence, In is equal to the sum of the short-circuit current and the load discharge current (-Iload). Alternatively, Ip is equal to the short-circuit current. During the input high to low transition (output low to high transition), the PMOS transistor charges the output capacitance. Hence, Ip is equal to the sum of the shortcircuit current and the load charge current (Iload). Alternatively, In is equal to the short-circuit current. The short-circuit current is therefore equal to Ip and In during output high to low and output low to high transitions, respectively.

20

VDD Ip Iload

Vin In

Vin

Vout CL

VDD-|Vtp| Vtn

Isc Isc = Ip

Isc = In

Fig. 2.3. Short-circuit current (Isc) for a static CMOS inverter. ISC is equal to the source-to-drain current of the PMOS transistor during the input low to high transition. Alternatively, ISC is equal to the drain-to-source current of the NMOS transistor during the input high to low transition. The energy drawn from the power supply (Eshort-circuit) due to the short circuit current is calculated by integrating the instantaneous power over the transition time of the input signal

Eshort − circuit

⎛ ⎞ ⎜ ⎟ ⎜ = VDD ∫ I sc ( t ) dt = VDD ∫ I p ( t ) dt + Output ∫ I n ( t ) dt ⎟⎟ . ⎜ Output low-to-high ⎜ high-to-low ⎟ transition ⎝ transition ⎠

(2.9)

The average short-circuit power is

Pshort − circuit

⎛ ⎞ ⎜ ⎟ ⎜ = VDD α10 ∫ I p ( t ) dt + α 01 ∫ I n ( t ) dt ⎟ f , ⎜ ⎟ Output Output low-to-high ⎜ high-to-low ⎟ transition transition ⎝ ⎠

(2.10)

21

where α01, α10, and f are the probability of the output low-to-high transition per clock cycle, the probability of the output high-to-low transition per clock cycle, and the clock frequency, respectively. The short-circuit current depends on the supply voltage, the threshold voltages, the input transition time, the transistor sizes, and the load capacitance [1], [103]. The test circuit shown in Fig. 2.4 is used to study the effect of the input transition time, the output capacitance, and the supply voltage on the short-circuit power in a 65nm CMOS technology (Vtn = -Vtp = 0.22V). A 1GHz signal with 50ps transition time is applied to the input of the test circuit. The first two inverters are minimum sized. The first two inverters are utilized in order to produce a realistic signal waveform on Node1. The short circuit power and the total power consumed by inv1 are measured for different sizes of inv2 and inv3. The sizes of the inverters inv1, inv2, and inv3 are normalized relative to the minimum inverter size and denoted with S1, S2, and S3, respectively. The short-circuit power consumed in inv1 is computed using (2.10) with α01 = α01 = 1. The total power consumed in inv1 is computed by integrating the current drawn from the power supply through inv1 over one cycle followed by multiplication with VDD and the input signal frequency. VDD VDD 0V

Node1

Ip Node2

inv1

inv2

In C1

inv3

C2

Fig. 2.4. Test circuit for evaluating the short-circuit current. The short-circuit power consumed in inv1 is measured for different sizes of inv1, inv2, inv3, and different supply voltages. The variations of the transition time at Node1 and the short-circuit power consumed in inv1 with the size of inv3 (S3) are shown in Fig. 2.5. VDD is set to 1V. S1 and S2 are fixed at 1 while S3 is varied from 1 to 8. The transition time is defined as the duration for a signal to change from 10% to 90% (or 90% to 10%) of the full voltage swing. Increasing S3 leads to a longer transition

22

time of the signal at Node1 due to the increased capacitance at Node1. The increased transition time at Node1 leads to a high short-circuit power consumed in inv1 due to the increased period in which the pull-up and the pull-down network of inv1 are simultaneously turned on. The transition time at Node1 and the short-circuit power consumed in inv1 increase approximately linearly with the size of inv3 as shown in Fig. 2.5. The dynamic switching power is however independent of the input transition time as presented in Section 2.1. The ratio of the short-circuit power to the total power therefore increases with a longer input transition time. The ratio of the short-circuit power to the total power is increased from 7% to 20% when S3 is increased from 6 to 8 (input transition

200

0.50

150

0.30

100

0.10

50

-0.10

0

1

2

3 4 5 6 Normalized Size of inv3 (S3)

7

8

Short-Circuit Power of inv1 (µW)

Transition Time at Node1 (ps)

time is increased from 143ps to 181ps).

-0.30

Fig. 2.5. Variations of the transition time at Node1 and the short-circuit power consumed in inv1 with the size of inv3. inv1 and inv2 are minimum sized. VDD = 1V. Note that the short-circuit power consumed in inv1 is negative for a short input transition time (S3 smaller than 5.2 in Fig. 2.5) due to the coupling capacitance between the input and the output of inv1. When Node1 is at 0V Node2 is at VDD. When Node1 transitions to VDD the voltage on Node2 exceeds VDD momentarily due to the coupling capacitance between Node1 and Node2, thereby causing a negative Ip current as shown in Fig. 2.6. Similarly, when Node1 transitions to GND the voltage on Node2 is reduced below 0V momentarily due to the coupling capacitance between Node1 and Node2, thereby causing a negative In current as shown in Fig. 2.6. The

23

negative Ip and In currents reduce the short circuit power and could lead to negative short-circuit power. Voltage (V)

1

0.5

Current (µA)

0

Input (Node1)

Output (Node2)

15

Ip

In

10 5 0 0.2

0.4

0.6

0.8

1

1.2

Time (ns)

Fig. 2.6. Effect of coupling capacitance on the short-circuit power. During the input low to high transition the voltage on the output node increases momentarily beyond VDD leading to a negative Ip. Similarly, during the input high to low transition the voltage on the output node decreases momentarily below 0V leading to a negative In. The variation of the short-circuit power consumed in inv1 with the size of the load inverter (inv2) is shown in Fig. 2.7. S1 and S3 are fixed at 1 and 8, respectively. S2 is varied from 1 to 8. Since S1 and S3 are fixed the transition time of the input signal (Node1) is approximately constant as shown in Fig. 2.7. Alternatively, the load capacitance seen by inv1 is increased with the size of inv2 (S2), leading to a longer transition time at the output of inv1 as shown in Fig. 2.7. The shortcircuit power is reduced with a longer output transition which is explained as follows. During the input low to high transition, if the output transition time is long, the voltage on the output node is maintained close to VDD during the input transition. Hence the source to drain voltage of the PMOS transistor is maintained close to 0V. The short-circuit current produced by the PMOS transistor is therefore suppressed. Similarly, during the input high to low transition, if the output transition time is long, the voltage on the output node is maintained close to 0V during the input transition. Hence the drain to source voltage of the NMOS transistor is maintained close to 0V. The short-circuit current produced by the NMOS transistor is therefore reduced. The short-circuit power consumed in inv1 is therefore reduced with a longer output transition time as shown in Fig. 2.7. The short-circuit power is reduced eight times when S2 is increased from 1 to 8 (output

24

transition time is increased from 60ps to 184ps). From Figs. 2.5 and 2.7, the short-circuit power is reduced with a shorter input transition time and a longer output transition time. The input transition time of a CMOS circuit is the output transition time of the driver circuit and the output transition time of a CMOS circuit is the input transition time of the fanout circuits. Reducing the total short-circuit power in a given path is therefore achieved when the input transition time and

0.30

200

Input (Node1)

0.25

160

0.20

120

Output (Node2) 0.15

80

0.10

40

0.05 0.00

1

2

3 4 5 6 Normalized Size of inv2 (s2)

7

8

Transition Time (ps)

Short-Circuit Power of inv1 (µW)

the output transition time are comparable for each circuit along the path.

0

Fig. 2.7. Variations of the input and output transition times and the short-circuit power consumed in inv1 with the size of inv2. The normalized sizes of inv1 and inv3 are 1 and 8, respectively. VDD = 1V. The variations of the short-circuit power with the supply voltage (VDD) is shown in Fig. 2.8. S1, S2, and S3 are fixed at 5, 1, and 5, respectively. With this choice of inverter sizes, the transition times at Node1 and Node2 are 200ps and 43ps, respectively, resulting in a significant short-circuit power at high VDD. The supply voltage is varied from 0.8V to 1.2V. Both the shortcircuit power and the dynamic switching power are significantly reduced with supply voltage scaling as shown in Fig. 2.8. The dynamic switching power is reduced quadratically with supply voltage scaling as explained in Section 2.1. Alternatively, the short-circuit power is reduced approximately cubically with the supply voltage scaling since the amplitude and duration of the short-circuit current are proportional to VDD. The ratio of the short-circuit power to the total power is reduced from 0.59 to 0.07 when the supply voltage is scaled from 1.2V to 0.8V. The contribution of the short-circuit power to the total power is therefore expected to be reduced with technology scaling since the supply voltage is typically scaled faster than the threshold voltages.

Ratio of Short-Circuit Power to Total Power

0.6

8 7

0.5

6 0.4

5

0.3

4 3

0.2

2 0.1

1

0 0.8

0.9

1

1.1

1.2

Short-Circuit Power of inv1 (µW)

25

0

VDD (V)

Fig. 2.8. Variations of the short-circuit power consumed in inv1 and the ratio of the short-circuit power to the total power with the supply voltage (VDD).

2.3. Leakage Power Consumption The resistance of a cutoff MOSFET is not infinite. A cutoff MOSFET therefore continues to conduct a current between the drain and source. This current is called sub-threshold leakage current. The sub-threshold leakage current is significantly increased with technology scaling due to the threshold voltage scaling. Furthermore, the scaling of the gate oxide thickness leads to a significant increase in the gate tunneling current. The increased doping levels with technology scaling also leads to increased reverse bias junction leakage currents. The sub-threshold and the gate tunneling currents are bias dependent as illustrated in Fig. 2.9 for a CMOS inverter.

VDD

VDD

Ioff-p 0V

VDD

VDD

0V

Ioff-n Fig. 2.9. Illustration of the sub-threshold and the gate tunneling currents in a CMOS inverter for the different input states. The gate tunneling currents are shown with the horizontal arrows. Ioff-n (Ioff-p) is the sub-threshold current conducted by the turned off NMOS (PMOS). The power consumed due to the leakage current (Ileak) is

Pleakage = VDD I leak .

(2.11)

26

The sub-threshold leakage current, the gate tunneling current and the reverse biased junction leakage current are characterized in Sections 2.3.1, 2.3.2, and 2.3.3, respectively.

2.3.1. Sub-threshold Leakage Ideally an MOS transistor conducts zero current when the gate-to-source voltage is less than the threshold voltage (sub-threshold regime) as shown in Fig. 2.10a. A closer examination of the IDS-VGS curve with a logarithmic scale shows that the drain current is not zero in the subthreshold regime. The sub-threshold drain current, however, drops exponentially with the reduction in the gate-to-source voltage as shown in Fig. 2.10b. The sub-threshold drain current is caused primarily by the diffusion of the minority carriers in the channel region [1]. This subthreshold leakage current depends on the bias voltages of the transistor, the threshold voltage, the device dimensions, the doping profile of the channel, the source and the drain, and the junction temperature [1]. A derivation of the drain current in the sub-threshold regime and the parameters that affect the sub-threshold leakage current are presented in this section. 50

40

101

IDS VGS + -

30

1V

IDS (µA)

IDS (µA)

102

1V

20 Sub-threshold Regime

10

100

Sub-threshold Swing

10-2

Vtn

Ioff-n

0

IDS

10X reduction in drain current

10-1

VGS + -

Vtn

10-3 0.0

0.2

0.4

0.6

VGS (V)

(a)

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

VGS (V)

(b)

Fig. 2.10. IDS-VGS characteristics of a minimum sized NMOS in 65nm CMOS technology at room temperature. VDS = 1V. Vtn: NMOS threshold voltage. Ioff-n: NMOS drain current when VGS = 0V. (a) Linear scale. (b) Logarithmic scale. The drain current in the sub-threshold regime is derived for an NMOS transistor assuming uniform channel doping [112]. The device profile of an NMOS is shown in Fig. 2.11a with the source and substrate grounded. A voltage source VGS < Vtn is applied to the gate terminal. The

27

drain is biased with a voltage source equal to VDS. The band diagram along the “X” direction is shown in Fig. 2.11b. The band diagram along the “Y” direction near the Si-SiO2 interface is shown in Fig. 2.11c. Gate VGS - +

Source

Drain

VDS + -

Y

n+

n+

X p

Substrate

(a) n+ polysilicon Gate

SiO2

qψs = qψ(0)

Source

p-silicon substrate

q(Vbi – ψs) EC

qψ(x) Ec

qVGS

qψf

Ec

Channel

Ei Ef Ev

Drain

q(Vbi – ψs + VDS)

Ev

Ef Ev

Y

X

0

(b)

Leff

(c)

Fig. 2.11. (a) A cross section of an NMOS transistor. (b) Band diagram along the vertical direction “X”. (c) Band diagram along the horizontal direction “Y” near the Si-SiO2 interface. Ψ(x) is the substrate potential. Ψf is the Fermi potential in the bulk of the substrate. Ec is the bottom of the conduction. Ev is the top of the valence band. Ei is the intrinsic Fermi level. Vbi is the built-in potential. The Fermi potential at the bulk of the substrate (Fig. 2.11b) is given by ⎛ NA ⎞ ⎟, ⎝ ni ⎠

ψ f = VT ln ⎜

(2.12)

28

VT =

kT , q

(2.13)

where VT, NA, ni, k, T, and q are the thermal voltage, the channel doping concentration, the Silicon intrinsic carrier concentration, the Boltzmann constant, the absolute temperature, and the unit charge, respectively. The potential of the p-silicon substrate as a function of “x” is

ψ ( x ) = Ec ( ∞ ) − Ec ( x ) ,

(2.14)

where EC is the energy level of the conduction band edge. The potential of the channel at the SiSiO2 interface (surface potential) is given by ψs = ψ(0). The surface potential can be approximated in terms of the gate-to-source voltage as follows (illustrated in Fig. 2.12):

ψs =

VGS , n

n = 1+

Cd =

Cd , Cox

ε si td

Cox = Where n, Cd, Cox,

εsi,

t d,

εox,

(2.15)

,

ε ox tox

(2.16)

(2.17)

,

(2.18)

and tox are the sub-threshold swing coefficient, the depletion

capacitance, the gate-oxide capacitance, the dielectric permittivity of silicon, the depletion layer thickness, the dielectric permittivity of silicon dioxide, and the gate-oxide thickness, respectively. Note that equation 2.17 applies only when the channel is depleted.

VG Cox Cd

ψs

Fig. 2.12. Equivalent circuit relating the surface potential to the gate voltage.

29

The drain current in the sub-threshold regime is conducted by the thermal diffusion of the electrons from the source to the drain through the channel area. The excess electron concentrations in the channel at the source and the drain ends are given by

⎛ ψ S ⎞ ni2 ⎛ VGS ⎞ ni2 exp ⎜ exp ∆n(0) = = ⎟ ⎜ ⎟, NA ⎝ VT ⎠ N A ⎝ nVT ⎠ ⎛ ψ − VDS ni2 ∆n( Leff ) = exp ⎜ s NA ⎝ VT

⎞ ni2 ⎛V ⎞ ⎛ −V ⎞ exp ⎜ GS ⎟ exp ⎜ DS ⎟ , ⎟= ⎠ NA ⎝ nVT ⎠ ⎝ VT ⎠

(2.19)

(2.20)

respectively. The diffusion current density (JDS) is proportional to the gradient of the electron concentration. JDS is given by:

J DS = qDn

∆n(0) − ∆n( Leff ) Leff

⎛ V ⎞⎛ ⎛ −V Dn ni2 =q exp ⎜ GS ⎟ ⎜⎜ 1 − exp ⎜ DS Leff N A ⎝ nVT ⎠ ⎝ ⎝ VT

⎞⎞ ⎟ ⎟⎟ , (2.21) ⎠⎠

where Dn and Leff are the electron diffusion coefficient and the effective channel length, respectively. The electron diffusion coefficient is related to the electron mobility (µn) by (Einstein relation):

Dn = µ nVT

(2.22)

The drain-to-source current is the diffusion current density multiplied by the transistor width and the thickness of the electron layer at the Si-SiO2 interface. The electron layer thickness is inversely proportional to the electric field at the surface of the channel [112]. The drain-to-source current is given by

⎛V ⎞ I DS = J DSWn ⎜ T ⎟ , ⎝ Es ⎠

(2.23)

where Wn, VT, and Es are the NMOS transistor width, the thermal voltage, and the electric field at the surface of the channel. The electrical field at the surface of the channel is equal to the depletion charge divided by the dielectric permittivity of silicon (Gauss’s law). Es is

Es =

qN Atd

ε si

=

qN A qN A = Cd ( n − 1) Cox ,

(2.24)

30

where q, NA, td,

εsi, Cd, Cox, and n are the unit charge, the channel doping concentration, the

depletion thickness, the dielectric permittivity of silicon, the depletion capacitance (equation 2.17), the gate-oxide capacitance (equation 2.18), and the sub-threshold swing coefficient (equation 2.16), respectively. By substituting equations 2.21, 2.22, and 2.24 in equation 2.23, the subthreshold drain current can be expressed as: 2

I DS

⎛ n ⎞ ⎛ V ⎞⎛ ⎛ −V W = n µ n CoxVT2 ( n − 1) ⎜ i ⎟ exp ⎜ GS ⎟ ⎜⎜ 1 − exp ⎜ DS Leff ⎝ NA ⎠ ⎝ nVT ⎠ ⎝ ⎝ VT

⎞⎞ ⎟ ⎟⎟ . ⎠⎠

(2.25)

From equation 2.12 2

⎛ ni ⎞ ⎛ −2ψ f ⎞ ⎜ ⎟ = exp ⎜ ⎟. ⎝ NA ⎠ ⎝ VT ⎠

(2.26)

By definition the threshold voltage (Vtn) is the gate-to-source voltage at which the surface potential is equal to twice the Fermi potential. Hence from equations 2.26 and 2.15, 2

⎛ ni ⎞ ⎛ −2ψ f ⎞ ⎛ −ψ s ⎞ ⎛ −V ⎞ = exp ⎜ tn ⎟ . ⎜ ⎟ = exp ⎜ ⎟ = exp ⎜ ⎟ ⎝ NA ⎠ ⎝ VT ⎠ ⎝ VT ⎠VGS =Vtn ⎝ nVT ⎠

(2.27)

By substituting equation 2.27 in equation 2.25, the final form of the drain current of an NMOS transistor in the sub-threshold regime is given by [107]:

I DS =

⎛ V −V Wn µ n CoxVT2 ( n − 1) exp ⎜ GS tn Leff ⎝ nVT

⎞⎛ ⎛ −VDS 1 exp − ⎜ ⎟⎜ ⎜ ⎠⎝ ⎝ VT

⎞⎞ ⎟ ⎟⎟ . ⎠⎠

(2.28)

⎞⎞ ⎟ ⎟⎟ , ⎠⎠

(2.29)

The drain current in the sub-threshold regime for a PMOS is similarly given by

I SD =

Wp Leff

⎛ VSG + Vtp ⎝ nVT

µ p CoxVT2 ( n − 1) exp ⎜

⎞⎛ ⎛ −VSD ⎟ ⎜⎜ 1 − exp ⎜ ⎠⎝ ⎝ VT

where Wp, µp and Vtp are the PMOS transistor width, the hole mobility, and the PMOS threshold voltage, respectively. The sub-threshold slope (SS) is a commonly used parameter for characterizing the subthreshold current. The sub-threshold slope is defined as the change in the gate voltage that leads to a ten times change in the drain current (Fig. 2.10b). Based on equation 2.28, the sub-threshold slope is

31

SS = nVT ln (10 ) = 2.3nVT .

(2.30)

The off-current is the drain current when the gate-to-source voltage is equal to zero. From equations 2.28 and 2.29, the off-current of an NMOS transistor (Ioff-n) and a PMOS transistor (Ioffp)

are given by:

I off − n = I off − p =

⎛ −V Wn µ n CoxVT2 ( n − 1) exp ⎜ tn Leff ⎝ nVT

⎞⎛ ⎛ −VDS ⎟ ⎜⎜ 1 − exp ⎜ ⎠⎝ ⎝ VT

⎞⎞ ⎟ ⎟⎟ , ⎠⎠

(2.31)

⎛ Vtp ⎝ nVT

⎞⎛ ⎛ −VSD ⎟ ⎜⎜ 1 − exp ⎜ ⎠⎝ ⎝ VT

⎞⎞ ⎟ ⎟⎟ . ⎠⎠

(2.32)

Wp Leff

µ p C oxVT2 ( n − 1) exp ⎜

As given by equations 2.31 and 2.32, the off-current varies exponentially with the threshold voltage, the drain-to-source voltage, and the temperature (VT is proportional to the absolute temperature). Alternatively, the off-current varies linearly with the transistor width. The variation of the off-current with temperature for NMOS and PMOS transistors in a 65nm technology is shown in Fig. 2.13. The off-current is increased by 5.7X and 12.9X for an NMOS transistor and a PMOS transistor, respectively when the temperature is increased from -40oC to 120oC. 12

Ioff-n

Off-current (nA)

10

Ioff-p

8 6 4 2 0 -40

-20

0

20

40

60

80

100

120

Temperature (oC) Fig. 2.13. Off-current versus temperature. VDS = 1V. Vtn = -Vtp = 220mV. Wn = Wp = 65nm.

32

The variation of the off-current with the absolute value of the threshold voltage is depicted in Fig. 2.14. The off-current is increased exponentially with the reduction of the threshold voltage as shown in Fig. 2.14. The off-current is increased by 9X and 55X for an NMOS and a PMOS transistor, respectively when the threshold voltage is reduced from 300mV to 150mV. 18 16

Off-current (nA)

14

Ioff-n

12

Ioff-p

10 8 6 4 2 0 150

180

210

240

270

300

|Vth| (mV) Fig. 2.14. Off-current versus threshold voltage. VDS = 1V. T = 27oC. Wn = Wp = 65nm. Based on equations 2.31 and 2.32, the off-current increases exponentially by increasing |VDS| until |VDS| is about four times the thermal voltage (about 0.1V at room temperature). The off-current saturates with further increase in |VDS|. The variation of the off-current with |VDS| is shown in Fig. 2.15 for NMOS and PMOS transistors in a 65nm CMOS technology at room temperature. Contrary to the predictions of equations 2.31 and 2.32, the off-currents continue to increase with further increase in the |VDS| beyond 0.1V. This is due to the drain-induced-barrierlowering (DIBL) phenomenon that occurs in short-channel transistors. DIBL is explained with the band diagram shown in Fig. 2.16. The conduction band edge is shown in Fig. 2.16 from the source to the drain near the surface of the channel of an NMOS transistor [109]. The increase in the drain-to-source voltage in short-channel transistors not only causes a reduction in the electron concentration at the drain end of the channel but also causes a lowering of the potential barrier at the source end. The lowering of the potential barrier near the source end is equivalent to reducing

33

the threshold voltage. The off-current is exponentially dependent on the threshold voltage as explained above. 4.0

Ioff-n

Off-current (nA)

3.0

Ioff-p 2.0

1.0

0.0 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

|VDS| (V)

Fig. 2.15. Off-current versus VDS. T = 27oC. Vtn = -Vtp = 220mV. Wn = Wp = 65nm.

Source

Channel

Drain

EC

VDS Fig. 2.16. Band diagram of an NMOS transistor at the channel surface for different VDS [109]. Ec: bottom of the conduction band. The sub-threshold leakage current increases with technology scaling due to the scaled threshold voltages, the increased number of transistors, and the increased die temperature as explained in Chapter 1. The sub-threshold leakage current can be reduced by reducing the transistor width, increasing the threshold voltage, reducing the drain-to-source voltage and/or reducing the temperature. Reducing the transistor width, threshold voltages, and the drain-tosource voltage, however, leads to a reduction in the on-current and the circuit speed. Alternatively, reducing the die temperature using techniques like enhanced cooling techniques [108] leads to a

34

significant reduction in the sub-threshold leakage current and typically enhances the on-current. The cost of this approach can, however, be prohibitive. Several circuit techniques are proposed in literature for reducing the sub-threshold leakage power consumption. With the Multi-threshold CMOS (MTCMOS) power gating circuit technique [108]-[110], high threshold voltage transistors are used to switch off the power supply from the rest of the circuits during sleep mode, thereby significantly reducing the sub-threshold leakage current. This design approach comes with an area cost and a small increase in the circuit delay. Another popular technique for reducing subthreshold leakage is the multi threshold voltage circuit technique [61], [108], [109], and [111]. With this technique, a number of transistor profiles with different threshold voltages are provided by the process technology. The high threshold voltage transistors are utilized in the non-critical paths to reduce the leakage power. Alternatively, the low threshold voltage transistors are utilized in the critical paths to maintain a high performance. Note that in CMOS circuits the drain-tosource voltage of the turned off transistors is equal to the supply voltage (Fig. 2.9). The multi-VDD design methodology discussed in Section 2.1 for reducing the dynamic switching power is therefore also effective in reducing the sub-threshold leakage power.

2.3.2. Gate Leakage Current The gate oxide leakage current conduction mechanisms are presented in this section. The gate oxide leakage current occurs due to the quantum mechanical tunneling phenomenon. The probability of an electron tunneling through a potential barrier increases exponentially with the reduction of the barrier thickness and the barrier height. In older CMOS technologies (oxide thickness > 2nm) the gate leakage current was order of magnitudes lower than the sub-threshold leakage current. The scaling of the gate oxide thickness results in a significant increase in the gate oxide leakage current. With the current CMOS technologies the oxide thickness is in the order of 1nm - 2nm. The gate leakage current therefore, cannot be ignored. There are two mechanisms for gate tunneling current as shown in Fig. 2.17 [107]. The first tunneling mechanism is called Fowler-Nordheim tunneling which occurs in a relatively thicker oxide under high electric field. When the electric field in the gate oxide is sufficiently high (~8MV/cm) large band bending occurs in the substrate and the gate oxide. An inversion layer is created at the interface between the p-silicon substrate and the gate oxide. The oxide thickness seen by the electrons in the inversion layer is reduced at such high electric field as shown in Fig.

35

2.17a thereby leading to a significant increase in the gate tunneling current. The second gate tunneling mechanism is called direct tunneling. The direct tunneling occurs with a thin gate oxide at small electric fields in the gate oxide. Electrons in the inversion layer and the source and the drain region tunnel through the entire gate oxide thickness with this direct tunneling mechanism as shown in Fig. 2.17b. The direct tunneling is the dominant gate leakage mechanism in today’s CMOS technologies. tox

tox Ec

Ec

Ef Ev qVG

Ec

Ef Ev

qVG Ef

Ec

Ev

Ef Ev

n+ polysilicon Gate

SiO2

p-silicon substrate

n+ polysilicon Gate

(a)

SiO2

p-silicon substrate

(b)

Fig. 2.17. Band diagram of an NMOS showing the gate tunneling current. (a) Fowler-Nordheim tunneling mechanism. (b) Direct tunneling mechanism. The gate leakage current increases exponentially with gate oxide scaling as shown in Fig, 2.18 for a bulk NMOS transistor with a 32nm channel length. Data in Fig. 2.18 are produced with Medici simulations [70] using the direct tunneling model. The gate tunneling current is also significantly smaller for a PMOS as compared to an NMOS due to the larger potential barrier for holes as compared to electrons at the silicon-oxide interface as shown in Fig. 2.19. The potential barrier from the silicon conduction band (at the silicon-oxide interface) to the oxide conduction band is 3.1eV. Alternatively, the potential barrier from the silicon valence band (at the siliconoxide interface) to the oxide valence band is 4.5eV [1], [107].

36

IG VG +-

tox = 1.0nm

10-06

IG (A/µm)

10-08

tox = 1.5nm

10-10

tox = 2.0nm 10-12

tox = 2.5nm 10-14 10-16 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

VG (V)

Fig. 2.18. Medici-predicted gate leakage current versus gate bias and oxide thickness for an NMOS transistor. n+ polysilicon Gate

SiO2

p+ polysilicon Gate

p-silicon substrate

SiO2

n-silicon substrate

3.1eV -----

qVG

Ec Ef Ev

Ec Ef Ev

Ec Ef Ev qVG

Ec Ef

++ +

Ev

4.5eV

(a)

(b)

Fig. 2.19. Band diagram. (a) NMOS transistor. (b) PMOS transistor. NMOS gate leakage is dominated by inversion layer electrons tunneling from the conduction band of the p-silicon substrate to conduction band of the n+ polysilicon gate. PMOS gate leakage is dominated by inversion layer holes tunneling from the valence band of the n-silicon substrate to valence band of the p+ polysilicon gate. Inversion layer holes (electrons) are represented with a plus (minus) sign. The gate leakage is bias dependent. The total gate leakage current is characterized for a stack of two NMOS transistors and a stack of two PMOS transistors as shown in Fig. 2.20. Four

37

bias conditions are examined. With the first bias condition the two transistors in the stack are turned on. With the remaining three bias conditions one of the transistors or the two transistors in the stack are turned off as indicated in Fig. 2.20. Note that when the two transistors in the stack are turned off the voltage on the shared node is close to the power rail voltage. This is due to the body effect. The transistor away from the power rail experiences a higher threshold voltage due to the body effect. The drain current decreases exponentially with the increased threshold voltage when the transistor is operating in the sub-threshold regime as explained in Section 2.3.1. The resistance of the transistor that is away from the power rail is therefore significantly higher as compared to the transistor closer to the power rail. VDD

VDD

VDD

VDD

0V

VDD

~VDD VDD

VDD

0V

0V

VDD

0V |Vtp|

VDD

VDD

VDD VDD

0V

0V

0V

0V

VDD

VDD

VDD

Power rails

VDD

VDD 0V

VDD

0V

0V 0V

VDD-Vtn VDD

0V

~0V 0V

Bias1

Bias2

Bias3

Bias4

(a)

(b)

(c)

(d)

Fig. 2.20. Different biasing condition for a stack of two transistors (NMOS or PMOS). (a) Bias1: both transistors are turned on. (b) Bias2: the transistor closer to the power rail is turned off and the second transistor in the stack is turned on. (c) Bias3: the transistor closer to the power rail is turned on and the second transistor in the stack is turned off. (d) Bias4: both transistors are turned off. Gate currents are indicated with the dashed arrows. The gate leakage for the transistor stacks versus the bias condition is shown in Fig. 2.21 for a 65nm CMOS technology with minimum sized transistors operating with VDD = 1V. As

38

shown in Fig. 2.21, the gate leakage current is highest when the two transistors are turned on due to the tunneling of carriers from both the inversion layer and the source and drain regions for both transistors in the stack. Alternatively, the lowest gate leakage current occurs when the transistor closer to the power rail is turned off and the transistor away from the power rail is turned on due to the small voltage drops across the transistor terminals as indicated in Fig. 2.20. The gate leakage current is one to two orders of magnitude lower for a PMOS transistor as compared to an NMOS transistor as shown in Fig. 2.21. The relatively lower gate leakage for a PMOS transistor as compared to an NMOS transistor is exploited in [91] to reduce the total leakage current in domino circuits in the sleep mode. Due to the significant gate tunneling currents with oxide thickness scaling, using gate dielectric materials with higher dielectric constant as compared to SiO2 are employed in recent technologies to enhance the strong inversion current without increasing the gate tunneling current [113].

Igate (nA)

10

NMOS PMOS

1

0.1

0.01 Bias1

Bias2

Bias3

Bias4

Fig. 2.21. Total gate leakage current of a stack of two transistors versus the biasing condition in a 65nm CMOS technology. The gate leakage is maximized when both transistors are turned on (Bias1). The gate leakage in PMOS transistors is one to two orders of magnitude lower as compared to NMOS transistors.

2.3.3. Reverse Biased Junction Leakage In bulk CMOS technology, PN junctions exist in the MOSFET structure as shown in Fig. 2.22a for the case of an NMOS transistor. Similar PN junctions exist in a PMOS structure. For normal operation these PN junctions are reverse biased by connecting the p-type substrate of an

39

NMOS (N-type well of a PMOS) to the lowest (highest) potential in the circuit. A reverse biased PN junction conducts a leakage current due to the diffusion process of the minority carriers. The reverse bias junction leakage current is [107]

⎛ D Dp J reverse−bias = qni2 ⎜ n + ⎜ Ln N A Lp N D ⎝

⎞⎛ ⎛ V ⎞ ⎞ ⎟⎟ ⎜⎜ exp ⎜ − R ⎟ − 1⎟⎟ , ⎝ VT ⎠ ⎠ ⎠⎝

(2.33)

Ln = τn Dn ,

(2.34)

Lp = τ p Dp ,

(2.35)

⎛N N ⎞ Vbi = VT ln ⎜ D 2 A ⎟ , ⎝ ni ⎠

(2.36)

where q, ni, Dn, Dp, Ln, Lp, τn, τp, NA, ND, VT, and VR are the unit charge, the intrinsic carrier concentration density, the electron diffusion coefficient, the hole diffusion coefficient, the electron diffusion length, the hole diffusion length, the electron lifetime, the hole lifetime, the acceptor doping concentration in the p-region of the junction, the donor doping concentration in the n region of the junction, the thermal voltage, and the reverse bias voltage applied across the PN junction, respectively. In Fig. 2.22b, VR is equal to the drain voltage. Another type of reverse biased junction leakage current occurs when the electric field in the PN junction is high (~106 V/cm). At such high electric field a significantly larger reverse junction leakage current is conducted across the PN junction due to the tunneling of electrons through the thin depletion layer from the valence band of the p-type region to the conduction band of the n-type region as illustrated in Fig. 2.22c. This reverse junction leakage current is called band-to-band tunneling (BTBT) current. The BTBT current density is given by [107]

J BTBT

⎛ 4 2m* E 3/ 2 ⎞ 2m* q 3 EmVR g ⎟, = exp ⎜ − 3 2 1/ 2 ⎜ ⎟ 4π E g 3qEm ⎝ ⎠ Em =

2qN A N D (VR + Vbi ) , ε si ( N A + N D )

(2.37)

(2.38)

where m*, q, Em, VR, ħ, and Eg are the electrons effective mass, the electron charge, the peak electric field in the PN junction, the reverse bias applied to the junction, Plank’s constant, and the energy band gap, respectively. As given by equation 2.37 the BTBT increases

40

exponentially with the peak electric field in the junction. The peak electric field is determined by the built-in potential, the applied reverse bias, and the doping profile as given by equation 2.38. With technology scaling, the built-in potential is increased due the increased doping level as given by equation 2.36. This results in a significant increase in the BTBT leakage current [13]. Gate Source

p substrate

Drain

n+

q (Vbi + VD)

n+

Substrate

-

Ec

VD +-

n+ drain

Efn

Efp EV

p + x

(b)

(a) p+ substrate Ec

n+ drain q (Vbi + VD)

Efp EV

- ---

-

Efn

(c) Fig. 2.22. Junction reverse-bias leakage current. (a) An NMOS structure showing the PN junction between the drain/source and the substrate. (b) The band diagram along the “x” axis with small built-in potential. (c) The band diagram along the “x” axis with a larger built-in potential caused by heavily doping both sides of the junction. Efn is the quasi Fermi potential for electrons. Efp is the quasi Fermi potential for holes. Vbi: PN junction built-in potential. Several device design techniques are proposed in literature for reducing the leakage currents in MOSFETs. Two-dimensional channel doping (halo doping) by ion implantation at an angle is proposed in [114] to reduce the short-channel effects and the drain-induced-barrierlowering, thereby significantly reducing the sub-threshold leakage current. The high carrier concentration of the halo doping, however, leads to increased BTBT leakage current. The sub-

41

threshold leakage current can be reduced by reducing the gate oxide thickness. The scaled gate oxide thickness leads to increased gate oxide capacitance (equation 2.18), reduced sub-threshold swing coefficient (equation 2.16), and hence reduced sub-threshold leakage current. The scaled gate oxide thickness, however, leads to increased gate leakage current. Replacing the gate oxide with high-K dielectric material without scaling the gate oxide thickness leads to reduced subthreshold and gate tunneling leakage currents [113]. The junction and BTBT leakage currents can be completely eliminated using silicon on insulator technology. With the emerging FinFET technology, all components of the leakage current are simultaneously reduced as compared to the single-gate MOSFETs. The gate control over the channel potential is enhanced using multiple electrically coupled gates and a thin silicon film. The short-channel effects are therefore suppressed and the sub-threshold swing coefficient is reduced. The sub-threshold leakage current is therefore significantly lowered. The enhanced control of the gate over the channel permits the use of a thicker gate oxide thereby significantly reducing the gate tunneling currents. With the SOI FinFET technology FinFETs are integrated on a buried oxide layer thereby eliminating the reverse biased junction leakage. With the bulk FinFET technology, FinFETs are integrated on a lightly doped silicon substrate thereby suppressing the reverse bias junction leakage. The FinFET device architecture and technology development guidelines are presented in Chapter 8.

2.4. Static DC Power Consumption No static power is consumed in CMOS circuits if the voltage swing of all the nodes is between 0V and VDD. In some special CMOS circuits the voltage swing of some nodes in the circuit is reduced as shown in Fig. 2.23a. With the circuit in Fig. 2.23a, two supply voltages are utilized VDD1 and VDD2 such that VDD1 < VDD2. When the signal “A” transitions to 0V Node1 transitions to VDD1. M4 is turned on and M3 is not fully turned off since VDD1 < VDD2. Static DC current therefore flows from VDD2 to ground. This static DC current can be eliminated using higher threshold voltage transistors as discussed in Chapter 3. Some circuit topologies such as the pass transistor logic and pseudo-NMOS logic consume static DC current as shown in Figs. 2.23b and 2.23c, respectively. With the pass transistor logic circuits, the voltage swing of intermediate signals is reduced by a threshold voltage drop thereby causing static DC current in the receiver circuit. Alternatively with the pseudo-NMOS logic circuits, the pull-up PMOS transistor is always

42

turned on. Static DC current flows in the circuit as indicated in Fig. 2.23c when the pull-down network is turned on. The circuit topology shown in Fig. 2.23d is called current mode logic. This circuit topology is biased with a constant DC current. The bias current is steered between the two branches of the circuit based on the input signal state. This circuit family consumes only static DC current. VDD1

VDD2

M1 A

M2

>0V

Node1

VDD

Istatic

M3 VDD1

0V

VDD

B

A

M4

VDD

M1

M2 VDD-Vtn-M1

>0V

Node1

Istatic

M3

VDD1 < VDD2

(a)

(b) VDD

VDD

M1

A

M2

Out

Istatic Out B

M3

C

M4

D

Out

In

In

M5

Istatic

(c)

(d)

Fig. 2.23. Circuit topologies that consume static DC current. (a) CMOS circuits with different power supplies. (b) Pass transistor logic circuits. (c) Pseudo-NMOS logic circuits. (d) Current mode logic circuits.

43

Chapter 3 Low Power and High Speed Multi Threshold Voltage Interface Circuits An effective method for reducing the power consumption in integrated circuits is scaling the supply voltage. All components of the power consumption are simultaneously reduced with the scaling of the supply voltage in a CMOS circuit. Lowering the supply voltage, however, also degrades the circuit speed. The multiple supply voltage (multi-VDD) circuit technique exploits the delay differences among the different signal propagation paths within an IC [1], [18]. The supply voltages of the gates on the non-critical delay paths are selectively lowered while a higher supply voltage is maintained on the speed critical paths in order to satisfy the target clock frequency in a multi-VDD circuit. Similarly, in systems-on-chips (SoCs), different circuits operating at different supply voltages exist [24]. When a low voltage swing signal drives a CMOS gate connected to a higher supply voltage, static DC power is consumed as the transistors in the pull-up and the pull-down networks are simultaneously turned on [1]. Furthermore, the output voltage swing of the receiver degrades, thereby leading to a static DC current in the fanout gates of the receiver. In order to transfer signals among these circuits operating at different voltage levels, specialized voltage interface circuits are required. Level converters impose additional power consumption and propagation delay overhead in a multi-VDD system. High-speed and low-power voltage level conversion is critical for effective power reduction with minimum effect on speed in a multi-VDD integrated circuit. Several factors such as the path propagation delay statistics, the power and delay overheads of the level converters, and the availability and efficiency of the different power supplies determine the choice of the supply voltages in a multi-VDD system [18]-[23]. The number and voltages of the multiple power supplies therefore vary with the type of the IC and the target set of applications. In this chapter a wide range of supply voltages are considered in order to address the speed, power, and area trade-offs in the design of voltage level conversion circuits. The previously published level converters rely on some form of feedback circuitry for controlling the operation of the pull-up network transistors in order to avoid static DC current

44

within the level converter. These conventional circuits, however, suffer from significant amount of short-circuit current and degraded speed characteristics due to the typically slow response of the feedback circuitry. Furthermore, transistor resizing with significant increase in the device widths is required to achieve functionality with a very low voltage transmitter, thereby further increasing the power consumption and the propagation delay with these feedback-based level converters. In this chapter, two novel level converters based on a multi-threshold voltage (multi-Vth) CMOS technology are presented [25], [36]. Unlike the conventional level conversion techniques based on feedback, the proposed level converters eliminate the static DC current using multi-Vth devices. The new level converters are compared with two previously published feedback-based level converters for different supply voltages. The effectiveness of the proposed circuits for reducing power consumption, propagation delay, and area is evaluated at scaled supply voltages down to the sub-threshold regime. The chapter is organized as follows. The operation of the proposed level converters is described in Section 3.1. The power consumption and the propagation delay characteristics of the level converters at the nominal process corner and under parameter variations are presented in Section 3.2. Finally, some conclusions are provided in Section 3.3.

3.1. Level Converters In this section various level conversion techniques are described. The issues related to the standard feedback-based level converters are discussed in Section 3.1.1. Two new level converters based on a multi-Vth CMOS technology are presented in Section 3.1.2.

3.1.1. Feedback-Based Level Converters The conventional feedback-based level converters are discussed in this section. When a low swing signal directly drives a gate that is connected to a higher supply voltage, the pull-up network of the receiver cannot be fully turned off. A receiver driven by a low voltage swing signal therefore produces static DC current. In order to suppress this DC current, specialized voltage interface circuits are employed between a low voltage driver and a full voltage swing receiver [15], [17]-[23]. In the standard feedback-based voltage interface circuits, the pull-up network transistors are not directly driven by the low voltage swing signal provided by the driver.

45

The operation of the pull-up network transistors is controlled by an internal feedback mechanism isolated from the low voltage swing input signal, thereby avoiding the formation of static DC current paths within the circuit. These traditional level converters, however, suffer from high short-circuit power and long propagation delay due to the typically slow response of the internal feedback circuitry that controls the operation of the pull-up transistors. Furthermore, the pulldown network transistors in these circuits are driven by low voltage swing signals unlike the pullup network transistors that receive higher gate overdrive voltages from the full-voltage swing feedback paths. The widths of the transistors that are directly driven by the low-swing signals need to be significantly increased in order to balance the strength of the pull-up and the pull-down networks particularly at very low input voltages. This causes further degradation in the speed and the power efficiency of the conventional feedback-based level converters when utilized with very low input voltages. The standard feedback-based level converter (LC1) [15] is shown in Fig. 3.1. M1 and M2 experience a low gate overdrive voltage (VDDL - Vth) during the operation of the circuit. M1 and M2 need to be sized larger to produce more current as compared to M3 and M4, respectively, for functionality. The circuit operates as follows. When the input is at 0V M2 is turned off. Node1 is charged to VDDL. M1 is turned on. Node3 is discharged to 0V turning M4 on. Node2 is charged to VDDH turning M3 off. The output is pulled down to 0V. When the input transitions to VDDL, M2 is turned on. Node1 is discharged, turning M1 off. Node2 is discharged, turning M3 on. Node3 is charged up to VDDH turning M4 off. The output transitions to VDDH. A feedback loop, isolated from the input, controls the operation of M3 and M4 during both transitions of the output. LC1 consumes significant short-circuit and dynamic switching power due to the transitory contention between the pull-up and the pull-down networks and the large size of the NMOS transistors (M1 and M2). The sizes of M1 and M2 need to be further increased to maintain functionality with the lower values of VDDL (in order to compensate for the gate overdrive degradation at a lower VDDL). The load seen by the previous stage (driver circuit) is therefore increased, thereby further degrading the speed and increasing the power consumption. Tapered buffers are required to drive M1 and M2 at very low voltages. These tapered buffers further increase the power consumption and the area overhead of LC1.

46

VDDH

M3 VDDL

VDDL 0

Node3 Node1

M4 VDDH Node2

I1

input

M1

VDDH Output

I2

0

M2

Fig. 3.1. The standard level converter (LC1) presented in [15]. VDDL is the lower supply voltage. VDDH is the higher supply voltage. Another level converter (LC2) is presented in [17] for enhanced speed as compared to LC1. LC2 is shown in Fig. 3.2. M6 maintains the voltage of Node3 between VDDL and VDDL + Vthn in order to enhance the current produced by M1. The capacitor (C = 8fF) stabilizes the voltage of Node3 against the noise induced by the nearby switching events. The circuit operates as follows. When the input is at 0V, Node1 is discharged through M1. M3 is turned on. M2 is turned off. Node2 is charged to VDDH, turning M4 off. The output is discharged to 0V. When the input transitions to VDDL, M2 is turned on. Node1 is initially charged to a voltage between VDDL - Vthn and VDDL through M1. M3 is not completely cut-off (weakly active). M2 is sized to be stronger than M3 for the circuit to function properly. Node2 is discharged, turning M4 on. Node1 is charged all the way up to VDDH, thereby eventually turning M3 off. The output transitions to VDDH. When the input switches from 0V to VDDL there is a direct current path from VDDH to GND through the M2-M3 path. This direct current path exists until Node1 is charged to VDDH through M4 and M5. Similarly, when the input switches from VDDL to 0V, there is a direct current path from VDDH to GND through the M5-M4-M1 path. This direct current path exists until Node2 is pulled up to VDDH and M4 is turned off. LC2 therefore consumes significant short-circuit power, similar to LC1, during both low-to-high and high-to-low transitions of the output. Furthermore, when VDDL is reduced, a significant increase in the size of M2 is required for maintaining functionality. The load seen by the driver circuit therefore increases at lower VDDL. Tapered

47

buffers are required for driving LC2 at very low voltages. These tapered input drivers further increase the power consumption and the area of LC2. VDDH

M6

M5

C M4 VDDL

M7

VDDL 0

VDDH – Vth6

Input

M1

VDDH

M3

Node3

Output Node2

Node1

I1

VDDH 0

M2

Fig. 3.2. The level converter (LC2) presented in [17].

3.1.2. Multi-Vth Level Converters Two new multi-Vth level converters are described in this section. Unlike the previously published level converters that rely on feedback, the proposed level converters employ a multi-Vth CMOS technology in order to eliminate the static DC current. The high threshold voltage pull-up network transistors in the new level converters are directly driven by the low-voltage swing signals without producing a static DC current problem. The first proposed level converter (PC1) is shown in Fig. 3.3. PC1 is composed of two cascaded inverters with multi-Vth transistors. The threshold voltage of M2 (Vth-M2) is lower (higher |Vth|) for avoiding static DC current in the first inverter when the input is at VDDL. |Vth-M2| is required to be higher than VDDH - VDDL for eliminating the static DC current. PC1 operates as follows. When the input is at 0V, M2 is turned on. M1 is cutoff. Node1 is pulled up to VDDH. The output is discharged to 0V. When the input transitions to VDDL, M1 is turned on. M2 is turned off since VGS, M2 > Vth, M2. Node1 is discharged to 0V. The output is charged to VDDH. PC1 has fewer transistors as compared LC1 and LC2. Furthermore, the elimination of the slow feedback circuitry reduces the short-circuit power of PC1 as compared to LC1 and LC2. For

48

the lower values of VDDL, the threshold voltage of M2 needs to be smaller (higher-|Vth|) in order to suppress the static DC current. No increase in the size of M1 is required for achieving functionality at lower input voltages with the proposed multi-Vth circuit (unlike LC1 and LC2). Therefore, particularly for the very low values of VDDL, PC1 consumes lower power, occupies significantly smaller area, and imposes a much smaller load capacitance on the input driver as compared to LC1 and LC2. VDDH

VDDL 0

M4

M2 Node1

Input

Output

VDDH 0

M1

M3

Fig. 3.3. The first proposed level converter (PC1). Thick line in the channel area indicates a highVth device. The circuit configurations of the second proposed level converter (PC2) are shown in Fig. 3.4 for operation at different supply voltages. |Vth-M2| is required to be higher than VDDH - VDDL for eliminating the static DC current when the input is low (Node1 is at VDDL). M1 needs to be cutoff after a “1” is successfully propagated to the output (the input is at VDDL and the output is at VDDH) in order to avoid the formation of a static DC current path between VDDH and VDDL though M1. The peripheral circuitry composed of M3, M4, and C, shown in Fig. 3.4a, is employed to maintain the Node2 voltage in the range of VDDL < VNode2 < VDDL + Vth-M1,

(3.1)

in order to enhance the speed of charge transfer through M1 while avoiding the formation of a static DC current path within the level converter. M3 maintains the voltage of Node2 at VNode2 = VDDH – Vth-M3,

provided that

(3.2)

49

(3.3)

VDDH – Vth-M3 < VDDL + Vth-M4.

If (3.3) is satisfied, M4 is maintained cut-off under normal operating conditions with no external noise coupling onto Node2. The purpose of M4 is to provide a discharge path for Node2 if the voltage on Node2 temporarily exceeds VDDL + Vth-M4 due to nearby switching events and crosstalk. The capacitor (C = 6fF) stabilizes the voltage of Node2 against the noise induced by the nearby switching events. The value of the capacitor is determined by circuit simulation such that the voltage of Node2 does not vary by more than 10% due to the coupling noise generated from within the level converter by the switching input signal. The capacitor is implemented by a MOSFET. If, however, (3.3) is not satisfied for the very low values of VDDL, a DC current path exists between VDDH and VDDL through M3 and M4 with the circuit configuration shown in Fig. 3.4a . In order to avoid a static DC current path within the level converter, M3, M4, and the capacitor C are eliminated and Node2 is directly connected to VDDL for the voltages that do not satisfy (3.3), with the circuit configuration illustrated in Fig. 3.4b. Similarly, if (3.1) is not satisfied for certain values of VDDL and VDDH, Node2 is directly connected to VDDL, eliminating the need for M3, M4, and C as shown in Fig. 3.4b. VDDH VDDL

VDDH

Node1

M2

I1

Node1

VDDH

VDDL 0

M1

Input

VDDL

VDDL

M4

Node2

M3

Output 0 C

I1

M2 VDDH

VDDL 0

Input

M1

Output 0

VDDL

VDDH

(a)

(b)

Fig. 3.4. The second proposed level converter (PC2). Thick line in the channel area indicates a high-Vth device. (a) Circuit configuration for VDDL and VDDH that satisfy both (3.1) and (3.3). (b) Circuit configuration for the supply voltages that do not satisfy either (3.1) or (3.3).

50

PC2 operates as follows. When the input is at 0V, Node1 is pulled high to VDDL turning M2 off (note that M2 has a high-|Vth|). The output node is discharged to 0V through the pass transistor M1. When the input transitions to VDDL, the output node is initially charged to VDDH Vthn-M1 - Vthn-M3 and VDDL - Vthn-M1 through M1 with the circuit configurations shown in Figs. 3.4a and 3.4b, respectively. M2 is turned on after the high-to-low propagation delay of the inverter (I1). The output is pulled high all the way up to VDDH through M2. M1 is turned off isolating the two power supplies. Both M1 and M2 assist the output low-to-high transition, thereby eliminating the contention current and enhancing the low-to-high propagation speed. The small transistor count and the elimination of the feedback-based control mechanism for the pull-up network reduce the power consumption of the proposed level converter as compared to LC1 and LC2. Furthermore, the speed of PC2 is enhanced due to the shorter input-to-output signal propagation path (composed of only one pass transistor) and the elimination of the contention current during the output low-to-high transition.

3.2. Speed and Power Consumption Characteristics The two new level converters are compared with the previously published standard feedback-based level converters for average power consumption and propagation delay in this section. The available slacks in the propagation delay paths, the power consumption and delay overheads of the level converters, the availability of high efficiency power supplies, and the availability of a multi-Vth CMOS technology with adequate threshold voltages are the important factors that determine the optimum supply voltages in a multi-VDD system [15], [17]-[23]. A wide range of lower supply voltages is considered in this chapter since the factors that determine the desirable and feasible optimum supply voltages vary with the available technology and the target application. The simulations are carried out for the following values of VDDL: 0.5V, 1V, and 1.2V. The standard nominal supply voltage (VDDH) is 1.8V in this 0.18µm CMOS technology. All the transistors of LC1 and LC2 have nominal-Vth. Comparison between the level converters at the nominal process corner is presented in Section 3.2.1. The characteristics of the level converters under supply voltage and process parameter variations are given in Section 3.2.2. The superiority of the proposed multi-Vth circuits for high-speed and low-power voltage level conversion is confirmed for a wide range of available threshold voltages in Section 3.2.3.

51

3.2.1. Comparison at the Nominal Process Corner The level converters are characterized at the nominal process corner in this section. LC1 is redesigned for proper functionality at VDDL = 0.5V. M1 and M2 are driven by low-swing signals while M3 and M4 are biased with full-swing signals (see Fig. 3.1). At very low VDDL, the currents conducted by M1 and M2 are significantly reduced. M1 and M2 are resized for producing higher current as compared to M3 and M4. Tapered inverters are employed in order to drive M1 and M2 after the resizing. The sizing of these tapered inverters is included in the optimization process. Similarly, LC2 is redesigned for proper functionality at VDDL = 0.5V. The size of M2 is increased significantly for functionality with LC2 at VDDL = 0.5V (see Fig. 3.2). An inverter that is large enough for driving M2 is used. The resizing of the new inverter is included in the optimization process. I1 is removed to maintain the output polarity. The second configuration of PC2 shown in Fig. 3.4b is used at VDDL = 0.5V since both (3.1) and (3.3) are violated. Two cascaded inverters are added at the output of PC2 before the load. The simulation setup is depicted in Fig. 3.5. The size of the driver (ID) and the load (IL) inverters are 4X the size of a minimum size inverter (minimum sized inverter: Wn = Wmin, and Wp = 2.5Wmin). The temperature is 125 oC. The activity factor of the input signal is 0.1 (a typical value for the logic core of an IC [17]). The propagation delay is measured from the input of ID to Node2 in order to include the loading effect of the level converter on the driver circuit when the circuit is optimized for minimum propagation delay. Reducing the sizes of the transistors in the level converter decreases the dynamic switching power consumption by lowering the switched capacitance. The level converter output rise and fall times are however increased with the reduced size of the transistors, thereby increasing the short-circuit power consumption of the load (IL). The average power consumption is measured for the whole circuit (including the power consumed by the driver ID and the load IL) in order to evaluate the trade-off between the dynamic switching power consumption of the level converter and the short-circuit power consumption of the load. The circuits are optimized with two different design criteria for each value of VDDL. Minimizing the average power consumption and minimizing the average propagation delay are the goals of the first and the second sets of optimizations, respectively. The design and optimization of the circuits are carried out using HSPICE built-in optimizer in a 0.18µm TSMC

52

CMOS technology. The optimization results are listed in Table 3.1. The optimum threshold voltages of M2 and M3 are listed in Table 3.2 for the proposed circuits at different input voltages with different optimization goals. As described in Section 3.1.2, the threshold voltage of M2 |VthM2|

is required to be higher than 0.6V, 0.8V, and 1.3V for VDDL = 1.2V, 1V, and 0.5V,

respectively, for both PC1 and PC2. Similarly, from (3.1), (3.2), and (3.3), the ranges of Vth-M3 for VDDL = 1.2V and VDDL = 1V are 0.13V < Vth-M3 < 0.6V and 0.33V < Vth-M3 < 0.8V, respectively. VDDH

VDDL VDDL 0

Input

ID

VDDH Node1

Level converter

Node2

IL

Output 0 C

Fig. 3.5. The simulation setup for characterizing the level converters. Power is measured for the entire test circuit including the driver and the load inverters. Delay is measured from the input of the driver inverter (ID) to Node2. When the circuits are individually optimized for minimum power consumption, PC1 and PC2 consume lower power as compared to LC1 and LC2 for all values of VDDL as listed in Table 3.1. Alternatively, when the circuits are optimized for minimum average propagation delay, PC1 and PC2 are faster as compared to LC1 and LC2 for all values of VDDL. From this point on, the proposed circuits are compared only with LC2 since LC2 is faster and consumes lower power as compared to LC1. The normalized total transistor width, average propagation delay, and power consumption of LC2, PC1, and PC2 are listed in Table 3.3. When the circuits are optimized for minimum power consumption, the power consumption of PC1 is 11% (3%), 13% (10%), and 58% (25%) lower as compared to LC2 (PC2) for VDDL = 1.2V, 1V, and 0.5V, respectively. When the circuits are optimized for minimum propagation delay, the propagation delay of PC2 is 41% (25%) and 22% (7%) lower as compared to LC2 (PC1) for VDDL = 1.2V and 1V, respectively. The propagation delay of PC1 is 70% (40%) lower as compared to LC2 (PC2) at VDDL = 0.5V. The total transistor width of PC1 is 54% to 96% (61% to 94%) smaller as compared to LC1 (LC2) for the various design objectives and VDDL values considered in this study.

53

TABLE 3.1. TOTAL TRANSISTOR WIDTH (W), AVERAGE PROPAGATION DELAY (D), AND AVERAGE POWER CONSUMPTION (P) OF THE LEVEL CONVERTERS Optimum Power Design

VDDL

1.2V

1V

0.5V

W (nm)

D (ps)

LC1

2320

234

5.17

LC2

6594

193

PC1

1060

PC2

Optimum Delay Design

P (µW) W (nm)

D (ps)

P (µW)

7250

203

7.63

4.44

8914

176

5.4

167

3.93

2300

137

4.53

5073

131

4.06

6053

103

4.68

LC1

2840

271

5.18

7700

257

6.74

LC2

6684

221

4.37

7884

197

4.82

PC1

1054

199

3.79

3110

165

4.64

PC2

5123

177

4.21

5423

153

4.92

LC1

43940

4697

11.58

40300

4529

12.29

LC2

28354

3692

8.54

22654

3413

8.93

PC1

1560

1033

3.57

1760

1030

3.69

PC2

1720

1720

4.74

1480

1720

4.74

TABLE 3.2. OPTIMUM THRESHOLD VOLTAGES WITH THE PROPOSED LEVEL CONVERTERS Optimum Power Design

VDDL

1.2V 1V 0.5V

Optimum Delay Design

Vth-M2 (V)

Vth-M3 (V)

Vth-M2 (V)

Vth-M3 (V)

PC1

-0.84

N/A

-0.76

N/A

PC2

-0.86

0.27

-0.76

0.27

PC1

-1.00

N/A

-0.96

N/A

PC2

-0.96

0.37

-0.94

0.37

PC1

-1.50

N/A

-1.46

N/A

PC2

-1.44

N/A

-1.44

N/A

54

TABLE 3.3. NORMALIZED TOTAL TRANSISTOR WIDTH (W), AVERAGE PROPAGATION DELAY (D), AND AVERAGE POWER CONSUMPTION (P) OF THE LEVEL CONVERTERS VDDL

LC2 1.2V

1V

0.5V

3.2.2.

Optimum Power Design W D P 1.00 1.00 1.00

Optimum Delay Design W D P 1.00 1.00 1.00

PC1

0.16

0.87

0.89

0.26

0.78

0.84

PC2

0.77

0.68

0.91

0.68

0.59

0.87

LC2

1.00

1.00

1.00

1.00

1.00

1.00

PC1

0.16

0.90

0.87

0.39

0.84

0.96

PC2

0.77

0.80

0.96

0.69

0.78

1.02

LC2

1.00

1.00

1.00

1.00

1.00

1.00

PC1

0.06

0.28

0.42

0.08

0.30

0.41

PC2

0.06

0.47

0.56

0.07

0.50

0.53

Characterization

Under

Supply

Voltage

and

Process

Parameter Variations The robustness of the level converters is evaluated under process and supply voltage variations in this section. The channel length, the gate oxide thickness, the channel doping, and the supply voltages are assumed to have independent normal Gaussian distributions. Each parameter is assumed to have a three sigma (3σ) variation of 10%. Monte Carlo simulations with 1500 samples are run to produce the statistical distributions of the propagation delay and the power consumption. The Monte Carlo simulation results are shown in Figs. 3.6, 3.7, and 3.8. In the first phase of the analysis, LC2 and PC2, initially optimized for minimum propagation delay at VDDL = 1.2V at the nominal process corner and supply voltages, are characterized under supply voltage and process parameter variations. The mean of the propagation delay and the power consumption of PC2 are reduced by 40% and 14%, respectively, as compared to LC2 as shown in Fig. 3.6. The power consumption distributions of LC2 and PC2 intersect at 5.1µW. 85.4% of the statistical samples consume more than 5.1µW with LC2. Alternatively, with the proposed circuit PC2, 78.5% of the statistical samples consume less than 5.1µW, as illustrated in Fig. 3.6b.

55

300 Number of Samples

LC2 Mean = 176ps SD = 6.23ps

PC2 Mean = 106ps SD = 6.3ps

250 200 150

40% reduction

100 50 0

80

98

116

134 152 Delay (ps)

170

188

(a)

Number of Samples

200

5.1µW 78.5%

160

85.4%

PC2 Mean = 4.67µW SD = 0.5µW LC2 Mean = 5.41µW SD = 0.4µW

120 80 40 0 3.5

4.0

4.5 5.0 5.5 6.0 Power Consumption (µW)

6.5

7.0

(b) Fig. 3.6. Statistical delay and power distributions of PC2 and LC2. (a) Propagation delay. (b) Power consumption. The level converters (LC2 and PC2) are optimized for minimum propagation delay at VDDL = 1.2V. SD: standard deviation. In the second phase of analysis, LC2 and PC1, initially optimized for minimum power consumption at VDDL = 1V at the nominal process corner and supply voltages, are characterized under process parameter and supply voltages fluctuations. The mean of the propagation delay and the power consumption of PC1 are reduced by 10% and 13%, respectively, as compared to LC2 as shown in Fig. 3.7. The propagation delay distributions of LC2 and PC1 intersect at 210ps. With

56

LC2, the propagation delay of 90% of the statistical samples is longer than 210ps. Alternatively, with PC1, the propagation delay of 83% of the statistical samples is shorter than 210ps as shown in Fig. 3.7a. The power consumption distributions of LC2 and PC1 intersect at 4.07µW. With LC2, 80% of the statistical samples consume more than 4.07µW. Alternatively, with PC2, 81% of the statistical samples consume less than 4.07µW, as illustrated in Fig. 3.7b. 160

210ps

Number of Samples

83%

PC1 Mean = 200.7ps SD = 10.2ps

90%

120

LC2 Mean = 222ps SD = 9.6ps

80 40 0 170

180

190

200

210

220

230

240

250

260

Delay (ps)

(a) 4.07µW

Number of Samples

120

81%

80%

80

PC1 Mean = 3.82µW SD = 0.3µW LC2 Mean = 4.4µW SD = 0.35µW

40

0 2.8

3.3

3.8 4.3 4.8 Power Consumption (µW)

5.3

5.8

(b) Fig. 3.7. Statistical delay and power distributions of PC1 and LC2. (a) Propagation delay. (b) Power consumption. The level converters (LC2 and PC1) are optimized for minimum power consumption at VDDL = 1V. SD: standard deviation.

57

Finally, LC2 and PC1, initially optimized for minimum power consumption at VDDL = 0.5V at the nominal process corner and supply voltages, are characterized under process parameter and supply voltages fluctuations. The mean (standard deviation) of the propagation delay and the power consumption of PC1 are 71% (78%) and 59% (74%) lower, respectively, as compared to LC2 as shown in Fig. 3.8.

Number of Samples

500

PC1 Mean = 1.07ns SD = 0.13ns

400

LC2 Mean = 3.66ns SD = 0.58ns

300 200 100 0 0.74 1.24 1.74 2.24 2.74 3.24 3.74 4.24 4.74 5.24 5.74 Delay (ns)

(a)

Number of Samples

400

300

PC1 Mean = 3.59µW SD = 0.38µW

LC2 Mean = 8.73µW SD = 1.46µW

200

100

0 2.8

4.4

6.0

7.6 9.2 10.8 Power Consumption (µW)

12.4

14.0

(b) Fig. 3.8. Statistical delay and power distributions of PC1 and LC2. (a) Propagation delay. (b) Power consumption. The level converters (LC2 and PC1) are optimized for minimum power consumption at VDDL = 0.5V. SD: standard deviation.

58

3.2.3. Multi-Vth CMOS Technology In a multi-Vth CMOS technology, the available threshold voltages are limited to a few discrete values. The speed and power consumption characteristics of the proposed level converters are optimized over a wide range of threshold voltages in this section in order to assess the effectiveness of the proposed circuits with different CMOS technologies. The variations of the power consumption and the propagation delay with the threshold voltages are plotted in Figs. 3.9, 3.10, and 3.11 for different VDDL. In Figs. 3.9, 3.10, and 3.11 the lower limit of the PMOS threshold voltage is the nominal threshold voltage minus the difference between VDDH and VDDL. The upper limit of the PMOS threshold voltage is determined as either the nominal threshold voltage or the value at which the optimized characteristic of the proposed circuit starts to degrade as compared to LC2 or when the circuit fails to function due to the reduced voltage swing of the output signal. As shown in Figs. 3.9, 3.10, and 3.11, the proposed circuits maintain higher speed and lower power consumption characteristics as compared to LC1 and LC2 for a wide range of the available threshold voltages.

Vth0-M3 = 0.27V

Vth0-M3 = 0.47V 200

7

160

PC2-delay

6

120

80

LC2-power

5

40

Propagation Delay (ps)

Power Consumption (µW)

LC2-delay

PC2-power 4

0 -1.06

-0.91

-0.76

-0.61

-0.46

Vth0-M2 (V) Fig. 3.9. Variations of the propagation delay and the power consumption of PC2 with the threshold voltage of M2 (Vtho-M2) and M3 (Vtho-M3) at VDDL = 1.2V. For each Vth, PC2 is reoptimized (resized) to minimize the propagation delay.

59

LC2-delay

9

220

PC1-delay

200

7

180 PC1-power

5

LC2-power

160

3

Propagation Delay (ps)

Power Consumption (uW)

240

140 -1.26

-1.11

-0.96

-0.81

-0.66

Vtho-M2 (V) Fig. 3.10. Variations of the propagation delay and the power consumption of PC1 with the threshold voltage of M2 (Vtho-M2) at VDDL = 1V. For each Vth, PC1 is reoptimized (resized) to

4

LC2-delay

19

3

17 15 13

2

PC1-delay

1

11 LC2-power

9

0

7 5

Propagation Delay (ns)

Power Consumption (µW)

minimize the power consumption.

PC1-power

3 -1.76

-1.61

-1.46

-1.31

-1.16

Vth0-M2 (V) Fig. 3.11. Variations of the propagation delay and the power consumption of PC1 with the threshold voltage of M2 (Vtho-M2) at VDDL = 0.5V. For each Vth, PC1 is reoptimized (resized) to minimize the power consumption.

60

The power and speed overheads of the level converters limit the amount of feasible voltage scaling in a multi-VDD system. The power consumption and propagation delay overheads are significantly reduced with the proposed level converters as compared to the previously published standard feedback-based circuits. The new multi-Vth level converters therefore allow further supply voltage scaling beyond the low voltages that would be permitted in a multi-VDD system based on the standard feedback-based level converters. Furthermore, the threshold voltages are scaled less aggressively as compared to the supply voltages with technology scaling. The implementation of the proposed feedback-free circuit techniques therefore becomes more feasible as the gap between the supply and the threshold voltages tends to become narrower with technology scaling.

3.3. Chapter Summary In this chapter, two novel level converters based on a multi-Vth CMOS technology are proposed. Unlike the standard level converters based on feedback, the new circuits employ multiVth transistors in order to suppress the DC current paths in CMOS gates driven by low-swing input signals. The proposed level converters are compared with the previously published circuits for different values of the lower supply voltages in a multi-VDD system. The proposed level converters offer significant power savings of up to 70% as compared to the previously published circuits when the circuits are individually optimized for minimum power consumption in a 0.18µm TSMC CMOS technology. Alternatively the speed is enhanced by up to 78% with the proposed circuits when the circuits are individually optimized for minimum propagation delay. The proposed circuits maintain higher speed and lower power consumption characteristics as compared to the conventional feedback-based level converters for a wide range of available threshold voltages with different multi-Vth CMOS technologies.

61

Chapter 4 Dual Power Supplies and Dual Clock Frequencies for Lower Clock Power and Suppressed Temperature-Gradient Induced Clock Skew Clock distribution network (CDN) consumes a significant portion of the power, area, and metal resources of an integrated circuit (IC). Technology scaling coupled with the increase in die size and clock frequency causes the process and environment parameter variations to be more pronounced [1], [9]-[11]. Coping with parameter variations is particularly challenging in the design of the clock distribution networks since the clock signal needs to be distributed to the entire IC with controlled skew. Clock skew degrades the performance of an IC by reducing the time available for computation in each clock cycle. Time-varying temperature gradients occur due to the imbalanced utilization and diversity of circuitry across an IC. The effect of on-chip temperature gradients on clock skew is characterized in this chapter. Supply voltage optimization is demonstrated to be an effective method for minimizing the clock skew induced by temperature fluctuations in a clock distribution network. The clock signal is distributed at a scaled optimum supply voltage with the proposed scheme. The size of the buffers are increased to maintain the target clock frequency while satisfying the transition time constraint (transition time Target, it is concluded that sizing alone is not sufficient to reach the target output transition time and the current iteration is exited. If Tr-out is below Target, a binary search for S is performed such that Tr-out is within 1% of Target. For the binary search to be valid, Tr-out must be a monotonically decreasing function of S. This condition is satisfied by the proper selection of Smax as presented in Section 5.1. S, y, and average power consumption are recorded at the end of the binary search. If y is larger than 2*x, the value of y is decreased by x and the next iteration is performed. If y is less than 2*x then all the iterations are performed. The values of y and S that provide the minimum power consumption are reported as the optimum solution with respect to the generated solutions. For buffer insertion and sizing in an H-tree clock distribution network (BIST) the BISW algorithm is repeatedly employed with a bottom up tree traversal starting from level 1. The symmetry of the H-tree clock distribution network is exploited to reduce the complexity of the algorithm. The branches in each level are identical. Therefore it is sufficient to solve the buffering problem for a single branch in each level. The BIST algorithm is listed in Table 5.1. Non-equal signal transition times at the internal nodes are achieved by progressively increasing the transition time requirement from the target value at the leaves to a maximum transition time at the output of the last level (closest to the root). The maximum acceptable transition time is determined by the signal integrity requirements (the clock signal becomes triangular in shape for transition times larger than approximately 30% of the clock period assuming a 50% duty cycle). The algorithm invokes the BISW algorithm with the parameters of the branch at level 1. The optimum solution

89

for that level is recorded. The load for the next level is twice the load represented by the buffer size of level 1 since the tree is binary. The algorithm then invokes BISW for a branch in level 2 with the new load. The process is repeated until all the levels are visited. The final design is then simulated and the total power consumption is reported. Start LÆy Smin Æ S Simulate and check the results

No

Tr-out > Target and S ≤ Smax

Yes

S*2 ÆS Yes

Tr-out > Target

No

S/2 Æ SL, S ÆSU

(SL + SU) / 2 Æ S Simulate and check the results

Yes

Tr-out > Target

S Æ SL

No

No

Tr-out < 0.99* Target

Yes

S Æ SU Record S, l, and power consumption

Yes

y–xÆy Smin Æ S

y > 2*x

No Report S and y that Provide the minimum power consumption

Stop

Fig. 5.8. The proposed buffer insertion and sizing algorithm for a single wire (BISW).

90

TABLE 5.1. PSEUDO CODE OF THE BUFFER INSERTION AND SIZING ALGORITHM FOR AN H-TREE CLOCK DISTRIBUTION NETWORK (BIST) Algorithm BIST

Inputs

• H-tree network with m levels • Signal transition time at the root (Tr-m) • Signal transition times at the leaves (Tr-0) • Maximum signal transition time (Tmax) • Size of the load inverters at the leaves (S0) • Minimum spacing between the buffers (x) • Minimum buffer size (Smin) • Maximum buffer size (Smax) Compute the transition time at the inputs of each level using the formula:

Step1

Tr-i = Tr-0 + i *

( Tmax - Tr-0 ) ( m-1)

, i = 1, 2, …, m-1

• • • • •

Step2

Outputs

Select a wire at level 1 Set the size of the load inverter to S0 Call BISW Record the optimum buffer size S and spacing y1 For i = 2 Æ m • Select a wire at level i • Size the load inverter twice the buffer size of the level (i-1) • Call BISW • Record the optimum buffer size Si and spacing yi End For • Simulate the entire clock distribution tree and measure the power consumption (S1, y1), (S2, y2),…, (Sm, ym), power consumption of the entire clock distribution network

The complexity of the proposed algorithm is estimated by the number of times SPICE is invoked as a function of the number of clocked elements. If the number of clocked elements is n then the number of levels in the H-tree is

m = ⎡⎢log2 ( n ) ⎤⎥ . The number of times SPICE is invoked for each call to BISW is

(5.1)

91

⎛S ⎞⎤ ⎢L⎥ ⎡ CBISW = ⎢ ⎥ * ⎢log2 ⎜ max ⎟ ⎥ * k , ⎣x⎦ ⎢ ⎝ Smin ⎠ ⎥

(5.2)

where L is the length of the wire, x is the minimum spacing between the buffers, Smin is the minimum buffer size, Smax is the maximum buffer size, and k is the number of times SPICE is invoked in the binary search. In order to find an upper boundary for k, an extra exit condition for the binary search is needed when the difference between the upper (Su) and the lower (SL) boundaries used in the binary search (see Fig. 5.8) is below a given threshold Sd. At the beginning of the binary search, SL is one half of SU. The worst case occurs when Su = Smax and SL = Smax/2. In this case the upper boundary for k is

⎢ ⎛S ⎞⎥ k = ⎢log2 ⎜ max ⎟ ⎥ . ⎝ 2 Sd ⎠ ⎥⎦ ⎣⎢

(5.3)

By combining (1), (2), and (3), the upper boundary for the complexity of the BIST algorithm is

⎛ S ⎞⎥ ⎛ S ⎞⎤ ⎢ ⎢L ⎥ ⎡ CBIST = ⎡⎢log2 ( n)⎤⎥ *⎢ max ⎥ *⎢log2 ⎜ max ⎟⎥ *⎢log2 ⎜ max ⎟⎥ , ⎣ x ⎦ ⎢ ⎝ Smin ⎠⎥ ⎢⎣ ⎝ 2Sd ⎠⎥⎦

(5.4)

where Lmax is the length of the longest branch in the clock distribution network. The complexity of the proposed algorithm is therefore a logarithmic function of the number of clocked elements.

5.3. Experimental Results The experimental results are given in this section. Comparisons are provided between the proposed algorithm based on non-uniform buffer spacing and progressively relaxed transition times from the leaves to the root of a clock tree versus the standard approach of uniform buffer spacing and maintaining equal transition times at all the nodes. Comparisons at the nominal process corner with uniform die temperature are given in Section 5.3.1. The clock distribution networks are characterized under temperature and process parameter variations in Section 5.3.2.

92

5.3.1. Comparisons at the Nominal Process Corner with Uniform Die Temperature Three test circuits are employed. The first test circuit is a 4 level H-tree network as shown in Fig. 5.7. The second and the third test circuits are 5 and 6 levels H-tree networks, respectively. Each circuit spans a 20mm x 20mm die. The circuits are designed in a 180nm CMOS technology. The clock frequency is 1GHz. A uniform die temperature of 125oC is assumed in this section. The input transition time at the root is 50ps. The target transition time at the leaves is 100ps. The size of the load inverters at the leaves is 2X the minimum size inverter. The largest inverter is 32X the minimum size inverter. For the minimum size inverter, the width of the NMOS (PMOS) transistor is 220nm (550nm). The minimum buffer spacing is taken as one tenth of the shortest branch. For the first circuit, the length (width) of the branches in levels 1, 2, 3, and 4 is 2.5mm (0.5µm), 2.5mm (0.5µm), 5mm (1µm), and 5mm (1µm), respectively. The parasitic resistance and capacitance are extracted for each wire segment. Each wire segment is modeled as a Π2 network. As listed in Table 5.2, the proposed algorithm based on non-uniform buffer spacing and progressive transition time relaxation is effective for reducing the total power consumption. With the proposed techniques the total power consumption is reduced by up to 30% due to the utilization of smaller and fewer buffers as compared to a conventional network with uniform buffer insertion and equal transition time constraints. TABLE 5.2. EXPERIMENTAL RESULTS WITH THE PROPOSED ALGORITHM Average Normalized Power Average Consumption Power (mW) Consumption

Number of Levels in H-tree 4

5

6

Uniform buffer insertion / Equal transition times

15.30

1.00

Non-uniform buffer insertion / Gradually relaxed transition times

10.66

0.70

Uniform buffer insertion / Equal transition times

21.89

1.00

15.24

0.70

29.00

1.00

20.89

0.72

Non-uniform buffer insertion / Gradually relaxed transition times Uniform buffer insertion / Equal transition times Non-uniform buffer insertion / Gradually relaxed transition times

93

5.3.2. Impact of Process and Temperature Variations on Clock Skew The impact of temperature and process parameter fluctuations on the clock skew is evaluated in this section for the clock distribution networks designed with equal transition times and uniform buffer insertion approach and the clock distribution networks designed with the proposed gradually relaxed transition times and non-uniform buffer insertion techniques. The impact of temperature and process parameter fluctuations is considered both independently and concurrently. The effect of temperature fluctuations on circuit delay is presented for both CMOS gates and wires. The propagation delay of a CMOS gate driving a load capacitance CL considering the non-zero transition times of the inputs is [46]

⎛ 1 V −V ⎞ CV t p = ⎜ − DD th ⎟ Tr −in + L DD , ⎜ 2 VDD (1 + α ) ⎟ 2 I D0 ⎝ ⎠

(5.5)

where Vth is the transistor threshold voltage, VDD is the supply voltage, α is the velocity saturation coefficient ( 1 ≤ α ≤ 2 ), Tr-in is the input transition time, and ID0 is the drain current at VGS = VDS = VDD. ID0 is a function of the carrier mobility µ and the gate overdrive voltage (VGS −Vth). The primary parameters in the delay equation that depend on temperature are the carrier mobility µ and the threshold voltage Vth. Both µ and |Vth| decrease as the temperature increases. The degradation of mobility tends to increase the delay. Alternatively, the decrease in the absolute value of the threshold voltage tends to lower the delay. The combined effect of the mobility and the threshold voltage variations on delay depends on other parameters such as the supply voltage and the input signal slope. Variations in gate overdrive are smaller as compared to carrier mobility variations when the temperature fluctuates in circuits operating at the nominal supply voltage (VDD-nominal = 1.8V) in a 180nm CMOS technology. The transistor saturation current and the circuit speed, therefore, degrade with the increased temperature. The first term in (5.5) is proportional to the input signal transition time. This term is also a function of the gate overdrive (VDD − Vth). As the input transition time increases, the significance of this term on the propagation delay is enhanced. Hence the temperature fluctuations induced variation of the gate overdrive becomes more effective on the delay variations at a higher input transition time. The increased significance of the gate overdrive on the delay leads to enhanced

94

counterbalancing of the mobility degradation. Therefore the temperature fluctuations induced delay variations are suppressed with a higher input transition time. The effect of the input transition time on the delay variations with temperature fluctuations is evaluated with the circuit shown in Fig. 5.9. The supply voltage is swept from 0.6V to 1.8V. The average propagation delay at 25 OC and 125 OC and the percent variation in the average propagation delay with temperature are measured for each value of the supply voltage. The input signal transition time is assumed to be 50ps and 160ps in the first and second set of simulations, respectively. The temperature fluctuations induced delay variation is reduced by 33% with a 160ps input transition time as compared to a 50 ps input transition time at the nominal supply voltage (VDD = 1.8V) as shown in Fig. 5.9.

Percentage Delay Variation (%)

10 1.33X

0 0.6

0.8

1

1.2

1.4

1.6

1.8

-10

Tr-in = 50ps Tr-in = 160ps

-20 VDD

-30

0

CL Tr-in

-40

VDD (V)

Fig. 5.9. Effect of input transition time on the temperature fluctuations induced delay variation. In clock distribution networks, long wires with significant resistance are employed. The effect of temperature on wire resistance and delay is therefore more pronounced in clock trees. Furthermore, the widths of the wires are scaled in order to enhance the integration density with each new technology generation. The increasing resistance of the wires plays an important role in determining the performance of scaled CMOS integrated circuits in the nanometer regime. Resistance of a metal wire increases approximately linearly with the temperature according to the formula [47] R(T) = R0(1 + k(T − T0)),

(5.6)

95

where R0 is the resistance at temperature T0, T is the temperature of the wire, and k is the temperature coefficient. The signal propagation delay across a wire, therefore, increases at a higher temperature. Each clock distribution network is characterized for clock skew caused by a non-uniform die temperature profile. The temperature profile of an IC varies over time. Four on-chip temperature profiles are considered in this section for characterizing the clock skew as shown in Fig. 5.10. Both vertical and horizontal temperature gradients are considered with a different location for the hottest spot in each temperature profile. The temperature range of each profile is 25 oC (room temperature) to 125 oC. 125 oC is reported as a typical hot spot temperature for the state of the art microprocessors fabricated in a 180nm CMOS technology [9]. For each temperature profile the maximum clock skew between the leaves of the clock distribution networks is measured as listed in Table 5.3 and as depicted in Fig. 5.11. The temperature fluctuations induced clock skew is reduced by up to 22% with the proposed gradually relaxed transition time and non-uniform buffer insertion technique as compared to the uniform buffer insertion and equal transition times approach.

Fig. 5.10. Four different on-chip temperature profiles considered in this section.

96

Uniform buffer insertion/ Equal transition times Non-uniform buffer insertion/ Unequal transition times

Clock Skew (ps)

50

30

10 profile1 profile2 profile3 profile4 profile1 profile2 profile3 profile4 profile1 profile2 profile3 profile4

Four Levels H-tree

Five Levels H-tree

Six Levels H-tree

Temperature Profile

Fig. 5.11. Clock skew for the nominal process corner and different non-uniform die temperature profiles. TABLE 5.3. CLOCK SKEW OF THE DIFFERENT CLOCK DISTRIBUTION NETWORKS WITH NON-UNIFORM TEMPERATURE PROFILES Clock Skew (ps)

Four Levels H-tree

Five Levels H-tree

Six Levels H-tree

Temperature Profile

Uniform buffer insertion Equal transition times

Non-uniform buffer insertion Gradually relaxed transition times

% Reduction

Profile1

37.0

29.6

20.0

Profile2

51.2

39.7

22.4

Profile3

38.0

29.6

22.1

Profile4

51.3

40.0

22.0

Profile1

24.4

23.7

2.9

Profile2

37.0

33.5

9.4

Profile3

27.6

25.3

8.3

Profile4

37.2

33.8

9.1

Profile1

33.0

30.6

7.2

Profile2

44.5

39.7

10.8

Profile3

33.0

29.2

11.5

Profile4

44.8

39.9

10.9

97

The impact of process parameter variations on the clock skew with a uniform die temperature is evaluated next. The gate oxide thickness, the channel length, the channel doping, the transistor width, and the wires’ widths are assumed to have independent Gaussian distributions. The 3σ variation of the gate oxide is assumed to be 5%. Alternatively, the 3σ variation of the remaining parameters is assumed to be 10%. Monte Carlo simulations with 250 samples are run to evaluate the clock skew of the clock distribution networks. The skew distributions of different H-tree networks are shown in Fig. 5.12. The mean clock skew is increased by up to 36% with the clock distribution networks designed with the proposed gradual relaxation of transition time and unequal buffer insertion approach as compared to the clock distribution networks designed with standard equal transition time and uniform buffer insertion approach. This increase in clock skew is due to the increased dependence of the gate delays on the threshold voltage with a higher input transition time. The impact of both process parameter variations and non-uniform die temperature are considered together next. The Monte Carlo simulations are rerun in this case with the nonuniform die temperature profiles shown in Fig. 5.10. The clock skew distributions for a four level H-tree clock distribution network is shown in Fig. 5.13. The mean clock skew of the clock network designed with the proposed gradual relaxation of transition time and non-uniform buffer insertion approach is reduced by up to 14.4% as compared to a clock distribution network designed with equal transition time and uniform buffer insertion as shown in Fig. 5.13. Uniform buffer insertion/ Equal transition times

60

Number of Samples

40

40

40

20

0 9.4

60

60

Number of Samples

Number of Samples

Non-uniform buffer insertion/ Gradually relaxed transition times

20

14.9 20.4 25.9 31.4 36.9 42.4 47.9

0 9.4

20

14.9 20.4 25.9 31.4 36.9 42.4 47.9

0 9.4

14.9 20.4 25.9 31.4 36.9 42.4 47.9

Clock Skew (ps)

Clock Skew (ps)

Clock Skew (ps)

Four levels H-tree

Five levels H-tree

Six levels H-tree

Fig. 5.12. Clock skew distribution under process parameter variations (uniform die temperature).

98

Uniform buffer insertion/ Equal transition times

30

15

0 25.3 32.2 39.0 45.9 52.8 59.7 66.5 73.4

Number of Samples

45

45

30

15

0 31.4 38.4 45.3 52.3 59.3 66.2 73.2 80.1

Clock Skew (ps)

Clock Skew (ps)

Profile1

Profile2

30

20

10 0 25.3 30.5 35.7 40.8 46.0 51.2 56.4 61.6

Clock Skew (ps)

Profile3

Number of Samples

Number of Samples

Number of Samples

Non-uniform buffer insertion/ Gradually relaxed transition times

40 30 20 10 0 27.9 34.5 41.1 47.8 54.4 61.0 67.6 74.2

Clock Skew (ps)

Profile4

Fig. 5.13. Clock skew distribution of a 4-level H-tree under process parameter variations and nonuniform die temperature.

5.4. Branch and Bound Formulation In this section a branch and bound formulation of the buffer insertion and sizing problem is presented. The branch and bound formulation allows an exhaustive search to be performed on a reduced solution space. The solution space is reduced using pruning techniques. Two pruning techniques are presented in this section. Unlike the heuristic algorithm proposed in section 5.2, the configuration obtained using the branch and bound formulation is guaranteed to be the optimum. The runtime of the branch and bound algorithm, however, grows exponentially with the number of levels in the H-tree. The branch and bound algorithm is therefore practical only for small sized problems. The BIST heuristic algorithm presented in Section 5.2 is a greedy algorithm. The local optimum solution for the current level is selected regardless of the effect on the remaining levels.

99

The size of the load inverter for the next higher level is twice the buffer size of the current level (binary tree). If a sub-optimal solution with a smaller buffer size is selected for the current level, the power consumption of the higher level will be reduced due to the smaller load buffers. A better global solution can be obtained if the reduction in the power consumption of the higher level is less than the increase in the power consumption of the current level. To reach a global optimum, a single solution cannot be selected at each level but rather all the solutions must be selected for the current level and propagated to the next level. This creates a tree of solutions and is called branching. The branch and bound algorithm is illustrated in Fig. 5.14. At each level every possible buffer spacing (y) is considered and the buffer sizing is performed using a binary search approach like the one employed in BISW. The tree of solutions shown in Fig. 5.14 is traversed depth first. At the last level the buffer sizes are determined for all the levels and the total power consumption for the clock distribution network is determined. The global optimum solution is determined at the end of the tree traversal by visiting all the nodes in the tree of solutions and keeping a record of the minimum total power consumption.

Level 1

y11

y12

...

... Level 2

y21

...

Level 3

y22

...

y1p

...

y2p

...

y31

...

y32

...

...

y3p

...

. . . .

Fig. 5.14. Illustration of the branch and bound algorithm. Two pruning techniques are applied in order to avoid visiting all the nodes in the solution tree. With the first pruning technique, if a solution for the current level gives higher power consumption at a larger buffer size, this solution is not propagated to the next level to avoid

100

incurring higher power consumption in both the current and the higher levels. With the second pruning technique, a branch in each level is initially and individually optimized with a minimum load represented by an inverter sized 2X the minimum inverter. The power consumption representing the minimum power consumption for each level is recorded. The minimum power consumption for the whole clock distribution network is initialized to infinity. During the solution tree traversal if at a node the sum of the power consumption for the current level, previous levels, and the minimum power consumption of the subsequent levels is higher than the best solution obtained so far then this new solution is rejected and not propagated to the following levels. To implement the second pruning technique a record of the minimum total power consumption is kept. The minimum power consumption achievable, determined by the branch and bound algorithm for a 4-level H-tree clock distribution network, is 10 mW. The power consumption with the proposed BIST heuristic algorithm is within 6.6% of the absolute minimum achievable power consumption determined by the branch and bound formulation.

5.5. Chapter Summary In this chapter a new heuristic algorithm is proposed for buffer sizing and insertion for minimizing the total power consumption while satisfying the transition time constraints at the leaves of an H-tree clock distribution network. The algorithm employs non-uniform buffer insertion and progressive relaxation of the transition time requirements from the leaves to the root of a clock tree in order to reduce the total power consumption. Up to 30% reduction in the power consumption is achieved without increasing the clock skew by applying these two techniques simultaneously to a clock distribution network. The algorithm is based on time-domain SPICE simulation instead of relying on simple and inaccurate circuit models. The proposed algorithm exploits the symmetry in the H-tree clock distribution network in order to reduce the runtime complexity. The runtime complexity of the algorithm grows logarithmically with the number of clocked elements.

101

Chapter 6 Dynamic Wordline Voltage Swing for Low Leakage and Stable Static Memory Banks The amount of embedded SRAM in modern micro-processors and systems-on-chips (SoCs) increases to meet the performance requirements in each new technology generation [50]. Lower voltages and smaller devices cause a significant degradation in SRAM cell data stability with the scaling of CMOS technology. In addition to the data stability issues, SRAM arrays are also an important source of leakage due to the enormous number of transistors in the memory banks. The development of an SRAM cell that can provide higher data stability and lower leakage power is therefore highly desirable. A conventional 6T SRAM cell in a 65 nm CMOS technology is shown in Fig. 6.1. SRAM data stability is characterized by the hold stability during a read operation. In a conventional 6T SRAM cell, the data storage nodes are directly accessed through the pass transistors connected to the bitlines. The storage nodes are disturbed due to the voltage division between the crosscoupled inverters and the access transistors during a read operation. The data is most vulnerable to external noise during this intrinsic disturbance produced by the direct data-read-access mechanism of a standard 6T SRAM circuit (destructive read) [54]. There are strict constraints on the sizing of transistors to be able to maintain the data stability and functionality of a standard 6T SRAM cell. In order to maintain the read stability, N1 and N2 (Fig. 6.1) must be stronger as compared to the access transistors N3 and N4. Alternatively, for write ability, N3 and N4 must be stronger as compared to P1 and P2. These requirements are traditionally satisfied by careful transistor sizing, as illustrated in Fig. 6.1. The stability of a 6T SRAM cell is characterized by the ratio (β) of the size of the pull-down transistors to the access transistors. Higher β leads to enhanced data stability at the expense of increased leakage power and larger cell area [54]. β is typically in the range of 2 to 3 for achieving sufficient data stability in a noisy on-chip environment [53]. In addition to the data stability issues, the increasing leakage energy consumption of onchip caches is another growing concern. In modern high performance integrated circuits (ICs), more than 40% of the total active mode energy is consumed due to leakage currents [12], [51]. As

102

more transistors are crammed onto ICs with each new technology generation, leakage energy will soon dominate the total active mode energy consumption. Furthermore, leakage is the only source of energy consumption in an idle circuit. SRAM cells are important sources of leakage since the majority of transistors are utilized for on-chip memory caches in today’s high performance ICs. The development of new low leakage and robust memory circuit techniques is, therefore, highly desirable. VDD

VDD

BL

65 / 65

WL

65 / 65 P1

N3

WL

P2

Node1

N4

Node2

65 / 65

65 / 65

N1 β*65 / 65

BLB

N2 β*65 / 65

Fig. 6.1. A standard 6T SRAM cell in a 65nm CMOS technology. The size of each transistor is given as W/L. W: transistor width (nm). L: transistor channel length (nm). The β is typically in the range of 2 to 3 for data stability. A new 6T SRAM circuit technique is proposed in this chapter for reducing the leakage power consumption and enhancing the data stability [60]. With the proposed circuit technique, the voltage swing of the wordlines is dynamically adjusted during read and write operations. For a read operation, the voltage swing of the wordlines is reduced in order to suppress the intrinsic data disturbance induced by the direct-storage-node-access mechanism of the standard 6T SRAM cell architecture. The data stability is thereby enhanced without the need to increase the size of the pull-down transistors (β = 1) with the proposed technique. Alternatively, during a write operation, the wordlines are driven with a full-voltage swing signal in order to enhance the strength of the bitline access transistors. Write-ability is thereby achieved with a high write margin. A new wordline driver is introduced in this chapter to achieve the dynamic adjustment of the wordline voltage swing with the proposed circuit technique.

103

The chapter is organized as follows. The new SRAM circuit technique and the wordline driver are presented in Section 6.1. The new and the standard SRAM circuits are compared at the nominal design corner and under process variations in Section 6.2. Finally, conclusions are offered in Section 6.3.

6.1 The Proposed 6T SRAM Circuit Technique The new 6T SRAM circuit technique is presented in this section. A new wordline driver is described for dynamically adjusting the voltage swing of the wordline signal during the read and write operations. The circuit schematic of the proposed wordline driver is shown in Fig. 6.2. The wordline driver is formed by cascaded inverters. Two extra transistors (P3 and P4) are added in the last stage inverter as shown in Fig. 6.2. The wordline driver has two modes of operation: the reduced-voltage-swing mode and the full-voltage-swing mode. The operation mode is determined by the “Read” signal. During a read operation the “Read” signal is connected to VDD. P3 is turned off. A threshold voltage drop (|Vtp|) is observed across P4. The voltage swing of the driver output is thereby reduced by |Vtp|. After WLin transitions to VDD, WL rises only up to VDD - |Vtp| for achieving data stability during a read operation. Alternatively, during a write operation, the “Read” signal is connected to GND. P3 is turned on. After WLin is asserted, WL rises all the way up to VDD for achieving write ability with the proposed technique. VDD

Read

VDD

P3

P4

P5

WL

WLin

N5

Fig. 6.2. The schematic of the proposed variable voltage swing wordline driver.

104

The operation of a minimum sized 6T SRAM cell (β = 1) with the proposed wordline driver (Fig. 6.2) is described next. Prior to a read operation, the bitlines are pre-charged to VDD. The “Read” signal is maintained at VDD. WLin transitions to VDD to start the read operation. The wordline driver operates in the reduced-voltage-swing mode. The WL transitions to VDD - |Vtp|, thereby weakly activating the bitline access transistors of the addressed memory cell. Provided that Node1 (in the SRAM cell) stores “0”, BL is discharged through N3 and N1. Alternatively, provided that Node2 (in the SRAM cell) stores “0”, BLB is discharged through N4 and N2. After a 200mV differential voltage is developed between the bitlines, WLin transitions to VGND. The sense amplifier is enabled to detect the bitline differential voltage and produce a full voltage swing output. The access transistors are weakened (VG = VDD - |Vtp|) during the read operation due to the reduced voltage swing of the WL signal. The voltage disturbance at the data storage nodes during a read operation is suppressed, thereby enhancing the data stability without increasing the size of the pull-down transistors with the proposed technique (N1 and N2 are minimum sized, β = 1). Both bitlines are periodically precharged to VDD. Prior to a write operation, one of the bitlines is selectively discharged to VGND depending on the data to be written into the SRAM cell. In order to start the write operation, the WLin transitions to VDD. During a write operation the “Read” signal is maintained at 0V. The wordline driver therefore operates in the full-voltageswing mode. WL transitions to VDD. Data is forced into the SRAM cell through the access transistors. The access transistors are strongly turned on (VG = VDD) during the write operation due to the full voltage swing WL signal. The write-ability is thereby achieved with a high write margin with the proposed dynamic wordline voltage swing memory technique.

6.2. Simulation Results and Area Comparison The read stability, leakage power consumption, read and write delays, write margin, read and write power consumption, and layout area of the standard full-voltage-swing 6T SRAM cells (with β = 1, 2, and 3) and the proposed dynamic wordline voltage swing 6T SRAM cell (β = 1) integrated with the new wordline driver are compared in this section. 256-bit × 128-bit memory arrays are designed to operate at a clock frequency of 1 GHz. The SRAM circuits are simulated in a 65nm CMOS technology (Vtn = |Vtp| = 0.22V and VDD = 1V) for the nominal design corner and

105

under process parameter variations. Data are measured at 70°C. The standard full-voltage-swing SRAM circuits with β = 1, β = 2, and β = 3 are denoted with ST1, ST2, and ST3, respectively.

6.2.1. Data Stability Static noise margin (SNM) is the metric used in this section to characterize the read stability of the SRAM cells. The SNM is the minimum DC noise voltage necessary to flip the state of an SRAM cell [52]. The read SNM of the SRAM cells is depicted in Fig. 6.3. When Node1 of the 6T SRAM cell is at VDD, Node2 rises to a higher steady state voltage due to the voltage division between the access transistor and the pull-down transistor in the inverter during a read operation. Data stored in the 6T SRAM cell is highly vulnerable to noise due to the raised Node2 voltage during a read operation. With the proposed technique the access transistors are intentionally weakened to reduce the voltage disturbance at the data storage nodes during a read operation. The data stability is thereby enhanced with the dynamic wordline voltage swing technique. As illustrated in Fig. 6.3, the read SNM of the proposed SRAM circuit is enhanced by 122%, 37%, and 19% as compared to ST1, ST2, and ST3, respectively.

SNM (mV)

300

200

100

0

ST1

ST2

ST3

Proposed

Fig. 6.3. Read static noise margin of the standard full-voltage-swing and the proposed dynamic wordline voltage swing SRAM circuits.

106

6.2.2. Leakage Power Consumption The leakage power consumption of the SRAM circuits is shown in Fig. 6.4. The leakage power of an SRAM cell is determined by the total effective transistor width that produces the leakage current. All the transistors with the new dynamic wordline voltage swing technique are sized minimum, thereby producing the lowest leakage current. Transistor sizing for enhanced data stability comes at a cost of significant additional leakage power with the standard full-voltageswing circuits ST2 and ST3, as illustrated in Fig. 6.4. The leakage power consumed by the proposed SRAM circuit is 34% and 51% lower as compared to ST2 and ST3, respectively.

Leakage Power (nW)

60

40

20

0

ST1

ST2

ST3

Proposed

Fig. 6.4. Leakage power consumption of the standard full-voltage-swing and the proposed dynamic wordline voltage swing SRAM circuits.

6.2.3. Area Comparison The thin cell layouts [55] of the 6T SRAM cells are shown in Fig. 6.5. Another side-effect of relying on transistor sizing to achieve data stability is a significant increase in the memory area as shown in Fig. 6.5. The area of the 6T SRAM cell with β = 3 is 10.5% larger than the area of the cell with the proposed dynamic wordline voltage swing technique with minimum size transistors. The 6T SRAM cells with β = 1 and β = 2 have the smallest area. Note that the cell area does not increase when β is increased from 1 to 2 since the size of a contact is 130nm x 130nm as shown in Fig. 6.5. The transistors P3 and P4 in the proposed wordline driver (see Fig. 6.2) can be shared

107

between the wordline drivers of the different rows in an SRAM array. Hence, the area overhead due to these transistors is small. VDD

BL

VGND

WL

WL

VGND

VDD

BLB

(a) VDD

BL

VGND

WL

WL

VGND

VDD

BLB

(b) VDD

BL

VGND

WL

WL

VGND

VDD

BLB

(c) Fig. 6.5. The layouts of the SRAM cells. (a) β = 1. (b) β = 2. (c) β = 3. The area of each cell is determined by a dashed rectangle.

108

6.2.4. Active Mode Power and Access Speed The junction and oxide capacitances of the access transistors attached to the bitlines are extracted for each SRAM cell. The length of the bitlines and the wordlines are estimated based on the cell layout dimensions. Π-type RC networks that represent the bitline and the wordline parasitics are attached to the each SRAM circuit. The simulation results for the access delay and the access power consumption are shown in Figs. 6.6 and 6.7, respectively. The read delay is measured from the time WLin signal rises to VDD/2 (from 0V) until a 200mV voltage difference is developed between the bitlines. The write delay is measured from the time WLin signal rises to VDD/2 (from 0V) until the storage node is discharged to VDD/2 (from an initial voltage of VDD). The read and write power consumptions include the power consumed to pre-charge the bitlines, the power consumption of the wordline drivers, and the power consumed by the SRAM cells. The read delay is increased by 35% due to the weaker bitline access transistors during a read operation with the proposed technique as compared to ST1. Alternatively, the write delay of the proposed SRAM circuit and ST1 are similar. The write delay of the proposed SRAM circuit is reduced by 11.5% as compared to ST3 due to the shorter wordlines and the smaller internal node parasitic capacitances. Writing to an SRAM cell is achieved by discharging one of the bitlines to ground. Successful writing to an SRAM cell can also be achieved with a voltage higher than 0V on the discharged bitline (incomplete/partial bitline discharge). The write margin is the maximum voltage of the discharged bitline that achieves a successful transfer of a “0” into the 6T SRAM cell [56]. The content of an SRAM cell with a higher write margin is easier to be modified. The write margins of the SRAM circuits are listed in Table 6.1. The write margin of the proposed SRAM cell is 32% and 64% higher as compared to ST2 and ST3, respectively.

109

800

Read Delay

Write Delay

Delay (ps)

600

400

200

0

ST1

ST2

ST3

Proposed

Fig. 6.6. Comparison of the access delay of the standard and the proposed SRAM circuits. TABLE 6.1. WRITE MARGIN OF THE SRAM CELLS SRAM Cell ST1 ST2 ST3 Proposed

Write Margin (mV) 410 310 250 410

The proposed SRAM circuit consumes the lowest power during a read operation due to the reduced voltage swing of the WL signal and the smaller cell parasitic capacitances. The read power is reduced by up to 7.5% with the proposed SRAM circuit as compared to the standard SRAM circuits as shown in Fig. 6.7. Due to the weaker pull-up network of the last stage in the proposed wordline driver as compared to a standard inverter (see Fig. 6.2), the sizes of the pull-up network transistors are increased with the proposed wordline driver. The write power is therefore increased by 7.6% with the proposed SRAM circuit as compared to ST1. The write power of the proposed SRAM circuit is, however, reduced by 7% as compared to ST3 due to the shorter wordlines and the smaller internal node parasitic capacitances.

110

80

Power Consumption (uW)

Read Power

Write Power

60

40

20

0

ST1

ST2

ST3

Proposed

Fig. 6.7. Comparison of the read and write power consumption of the standard and the proposed SRAM circuits.

6.2.5. Process Variations In this section, the read stability and the leakage power variations of the SRAM circuits due to process fluctuations in the gate length, the width, and the threshold voltage of the transistors are evaluated. The channel length, the transistor width, and the threshold voltage are assumed to have normal Gaussian statistical distributions. Each parameter is assumed to have a three sigma (3σ) variation of 10% [54]. Monte Carlo simulations with 10000 samples are run to evaluate the read stability and the leakage power distributions. The distributions of the leakage power and the SNM of the memory circuits are depicted in Figs. 6.8 and 6.9, respectively. The proposed dynamic wordline voltage swing SRAM circuit significantly reduces the leakage power and enhances the data stability as compared to the standard full-voltage-swing SRAM circuits under process fluctuations. With the proposed SRAM circuit the mean (standard deviation) of the leakage power is reduced by 33% (28%) and 49.6% (46.4%) as compared to ST2 and ST3, respectively. Furthermore, with the proposed SRAM circuit, the mean of the SNM is enhanced by 124%, 38%, and 19% as compared to ST1, ST2, and ST3, respectively. The leakage power distributions of the proposed SRAM circuit and ST2 intersect at 30.8nW. With the dynamic wordline voltage swing circuit technique, 71.5% of the statistical samples consume less than 30.8nW of leakage power. Alternatively, with ST2, 95.7% of the statistical samples consume more than 30.8nW of leakage power as illustrated in Fig. 6.8. The leakage power distributions of the proposed SRAM cell and the ST3 intersect at 38nW. With the

111

dynamic wordline voltage swing circuit technique, 89.7% of the statistical samples consume less than 38nW of leakage power. Alternatively, with ST3, 99.4% of the statistical samples consume more than 38nW of leakage power. 38nW 89.7%

Number of Samples

1200

99.4% 30.8nW

ST1 / Proposed

95.7%

71.5%

ST2 ST3

800

400

0 0

20

40

60

80

100

120

140

Leakage Power (nW)

Number of Samples

Fig. 6.8. Statistical leakage power distributions of the standard and the proposed SRAM circuits. 194mV Proposed 87.5% ST1 147mV ST2 99.89% 99.85% ST3

800

85.9%

600

400

200

0

0

50

100

150

200

250

300

Static Noise Margin (mV)

Fig. 6.9. Statistical SNM distributions of the standard and the proposed SRAM circuits.

112

6.3. Chapter Summary A new memory circuit technique with a variable voltage swing wordline driver is proposed in this chapter for achieving minimum sized 6T SRAM cells with reduced leakage power consumption and enhanced data stability. The new wordline driver provides either a reduced-voltage-swing or a full-voltage-swing cell access control signal depending on the mode of operation. During a read operation, the wordline driver operates in the reduced-voltage-swing mode in order to weaken the bitline access transistors. The voltage disturbance at the storage nodes is reduced, thereby enhancing the data stability without the need to increase the size of the transistors in the cross-coupled inverters. Alternatively, during a write operation, the wordline is driven by a full-voltage swing signal in order to enhance the strength of the bitline access transistors. Write-ability is thereby achieved with a high write margin. With the proposed dynamic wordline voltage swing SRAM circuit technique the static noise margin is enhanced by 122% as compared to the standard full-voltage-swing SRAM circuit with the same transistor sizes. Furthermore, the leakage power is reduced by up to 51% and the write margin is enhanced by up to 64% with the proposed technique as compared to the standard full-voltage-swing SRAM circuits sized for data stability. The advantages of the dynamic wordline voltage swing SRAM circuit technique are also verified under process parameter variations.

113

Chapter 7 Low Power and Robust 7T Dual-Vt SRAM Circuit The robustness of an SRAM cell is characterized by the hold stability during a read operation. In a conventional 6T SRAM cell, the data storage nodes are directly accessed through the pass transistors connected to the bitlines. The storage nodes are disturbed due to the voltage division between the cross-coupled inverters and the access transistors during a read operation. The data is most vulnerable to external noise during a read operation due to this intrinsic disturbance produced by the direct data-read-access mechanism of a standard 6T SRAM circuit (destructive read) as described in Chapter 6. The design of a 6T SRAM cell is typically characterized by the ratio (β) of the size of the pull-down transistors to the access transistors [53]. In order to maintain the read stability, N1 and N2 (Fig. 7.1) must be stronger as compared to the access transistors N3 and N4. Alternatively, for write ability, N3 and N4 must be stronger as compared to P1 and P2. These requirements are satisfied with careful transistor sizing, as illustrated in Fig. 7.1.

VDD

BL

VDD

65 / 65 WL

N3

BLB

65 / 65 P1

WL

P2

Node1

Node2

65 / 65

N4 65 / 65

N1 β*65 / 65

N2 β*65 / 65

Fig. 7.1. A standard 6T SRAM cell in a 65nm CMOS technology. The size of each transistor is given as W/L. W: transistor width in nanometer. L: transistor channel length in nanometer. For data stability: β ≥ 2.

114

The increasing leakage energy consumption of the embedded memory circuits is a growing concern. In modern high performance microprocessors, more than 40% of the total active mode energy is consumed due to leakage currents [51]. Memory arrays are an important source of leakage since the majority of transistors are utilized for on-chip caches in today’s high performance microprocessors. Furthermore, the speed requirement of memory caches particularly the first level is also increasing with technology scaling. The development of a high read speed low leakage SRAM cell with enhanced data stability is therefore highly desirable. A new seven transistor (7T) dual-threshold-voltage SRAM cell [61] is proposed in this chapter for simultaneously reducing the active and standby mode power consumption while enhancing the data stability and the circuit speed. With the proposed SRAM cell, the data storage nodes are isolated from the bitlines during a read operation. The data stability is thereby significantly enhanced by up to 87% as compared to a standard 6T SRAM cell. Furthermore, minimum sized high-threshold voltage transistors are employed in the cross-coupled inverters in order to significantly reduce the leakage power by up to 66% as compared to the standard 6T SRAM circuits. New data is written to the proposed SRAM cell using a single pass transistor and a single bitline in order to reduce the circuit area and the power consumption. Successful transfer of both “0” and “1” to the proposed memory cell is achieved by the asymmetrical dual threshold voltage (dual-Vt) design of the cross-coupled inverters. The chapter is organized as follows. The new 7T SRAM circuit is presented in Section 7.1. The new 7T SRAM cell, a previously published 8T SRAM cell, and the standard 6T SRAM cells are compared for area, data stability, access delay, and power consumption under process parameter variations in Section 7.2. Finally, conclusions are offered in Section 7.3.

7.1 The Proposed 7T Dual-Vt SRAM Cell The new 7T dual-Vt SRAM cell is presented in this section. The circuit schematic of the proposed SRAM cell with transistors sized for a 65nm CMOS technology is shown in Fig. 7.2. The cross-coupled inverters formed by the transistors N1, P1, N2, and P2 store a single bit of information. The write bitline WBL and the pass transistor N3 are used for transferring new data into the cell. Alternatively, the read bitline RBL and the transistor stack formed by N4 and N5 are

115

used for reading data from the cell. Two separate control signals R and W are used for controlling the read and the write operations, respectively, with the proposed circuit as shown in Fig. 7.2. Prior to a read operation, the RBL is pre-charged to VDD. To start the read operation, the read signal R transitions to VDD while the write signal W is maintained at VGND. If a “1” is stored at Node1, RBL is discharged through the transistor stack formed by N4 and N5. Alternatively, if a “0” is stored at Node1 RBL is maintained at VDD. The storage nodes (Node1 and Node2) are completely isolated from the bitlines during a read operation. The data stability is thereby significantly enhanced as compared to the standard 6T SRAM cells. The RBL is conditionally discharged through the N4-N5 stack during a read operation. The transistors of the cross-coupled inverters are not on the read-delay-path. The transistor sizing of the dual-Vt cross-coupled inverters therefore does not affect the read speed of the proposed SRAM cell.

VDD

VDD

WBL

65/65

W

65/65

Node1

N5

65/65

Node2

N1 65/65

R

P2

P1

N3

RBL

N2 65/65

130/65

N4 130/65

Fig. 7.2. The schematic of the proposed 7T dual-Vt SRAM circuit in a 65nm CMOS technology. The size of each transistor is given as W/L. W: transistor width in nanometer. L: transistor channel length in nanometer. Thick line in the channel area indicates a high-Vt transistor. Prior to a write operation the WBL is charged (discharged) to VDD (VGND) to get ready to force a “1” (“0”) onto Node1. To start the write operation, the write signal W transitions to VDD while the read signal R is maintained at VGND. The data is forced onto Node1 through the low-Vt bitline access transistor N3. The following design constraints are imposed in order to achieve

116

write ability with the proposed SRAM cell. To be able to write a “0” onto Node1, the pass transistor N3 must be stronger as compared to the pull-up transistor P1. Alternatively, to be able to write a “1” onto Node1, the pass transistor N3 must be stronger as compared to N1. Furthermore, since N3 transfers a degraded “1” (due to the Vt drop across the N-type access transistor), the inverter formed by N2 and P2 is required to have a low switching threshold voltage (low-skew inverter) that assists the transfer of a full “1’ onto Node1. These design requirements are achieved by employing dual-Vt transistors within the cross-coupled inverters (high-Vt transistors N1, P1, and P2 and a low-Vt transistor N2), as shown in Fig. 7.2. The use of minimum sized high-Vt transistors reduces the leakage power without degrading the read speed with the proposed technique.

7.2. Simulation Results and Circuit Layouts The read stability, the leakage power consumption, the read and write delays, the write margin, the read and write power consumptions, and the layout area of the standard 6T SRAM cells (with β = 2, 3, and 4), a previously published 8T SRAM cell [53], and the proposed 7T SRAM cell are compared in this section. 256-bit × 128-bit memory arrays are designed to operate at a clock frequency of 1 GHz with the standard 6T, the 8T, and the proposed 7T SRAM cells in a 65nm CMOS technology (Vtn = |Vtp| = 0.22V, Vtn-high = |Vtp-high| = 0.42V, and VDD = 1V). Data are measured at 70°C.

7.2.1. Data Stability Static noise margin (SNM) is the metric used in this chapter to characterize the read stability of the SRAM cells. The SNM is the minimum DC noise voltage necessary to flip the state of an SRAM cell [52]. The read SNMs of the SRAM cells are compared in Fig. 7.3. When Node1 of the 7T SRAM cell is at VDD, Node2 is maintained strictly at 0V due to the complete decoupling of the data storage nodes from the bitlines during a read operation. Alternatively, when Node1 of a standard 6T SRAM cell is at VDD, Node2 rises to a higher steady state voltage due to the voltage division between the access transistor and the pull-down transistor in the inverter during a read operation. Data stored in the 6T SRAM cell is most vulnerable to noise during a read access due to the already disturbed (raised) Node2 voltage. Alternatively, with the proposed technique, decoupling the data from the bit lines significantly enhances the cell

117

stability. As illustrated in Fig. 7.3, the read SNM of the 7T SRAM cell is 87% and 52% higher as compared to the 6T SRAM cells with β = 2 and β = 4, respectively. The SNM of the 7T SRAM cell is slightly lower as compared to the 8T SRAM cell due to the asymmetrical design of the dual-Vt cross-coupled inverters with the proposed technique. 400

SNM (mV)

300

200

100

0

6T (β = 2) 6T (β = 3) 6T (β = 4)

8T

7T

Fig. 7.3. Read static noise margins of the SRAM cells.

7.2.2. Leakage Power Consumption The transistors of the cross-coupled inverters are sized minimum and these transistors have high threshold voltage except for N2 (see Fig. 7.2) in the 7T SRAM cell. The leakage power consumption is therefore significantly suppressed with the proposed technique. Unlike the 6T SRAM cell, the use of dual-Vt transistors in the cross-coupled inverters of the 7T SRAM cell doesn’t degrade the read speed since the high-Vt transistors are not on the read delay path with the proposed technique. The transistor stack formed by N4 and N5 in the 7T SRAM cell has two different states in the standby mode depending on the stored data. If Node1 stores “0” both N4 and N5 are cut-off, thereby reducing the leakage current due to the stack effect [1]. Alternatively, if Node1 stores “1” N5 is turned on and N4 is cut-off. The drain-to-source voltage of N4 is however smaller than VDD due to the threshold voltage drop across N5. The leakage current produced by N4 is therefore reduced as compared to a circuit in which N4 and N5 are interchanged (e.g. see the 8T SRAM cell presented in [53]).

118

The proposed 7T SRAM cell lowers the leakage current as compared to the standard 6T SRAM cells and the 8T SRAM cell. The leakage power consumption of the 7T SRAM cell is 43%, 66%, and 46% lower as compared to a 6T SRAM cell with β = 2, a 6T SRAM cell with β =

Leakage Power Consumption (nW)

4, and the 8T SRAM cell, respectively, as illustrated in Fig. 7.4. 70 60 50 40 30 20 10 0

6T (β = 2)

6T (β = 3)

6T (β = 4)

8T

7T

Fig. 7.4. Average leakage power consumption of the SRAM cells.

7.2.3 Area Comparison The thin cell layouts [55] of the 6T, the 8T, and the proposed 7T SRAM cells are shown in Figs 7.5-7.9. The area of the proposed 7T SRAM cell is 15.8% larger than the 6T SRAM cell with β = 2. The area of the 7T SRAM cell is 4% and 12% smaller as compared to the 6T SRAM cell (with β = 4) and the 8T SRAM cell, respectively. VDD

BL

VGND

WL

WL

VGND

VDD

BLB

Fig. 7.5. The layout of a 6T SRAM cell with β = 2. Area = 0.62 µm2.

119

VDD

BL

VGND

WL

WL

VGND

VDD

BLB

Fig. 7.6. The layout of a 6T SRAM cell with β = 3. Area = 0.688 µm2. VDD

BL

VGND

WL

WL

VGND

VDD

BLB

Fig. 7.7. The layout of a 6T SRAM cell with β = 4. Area = 0.754 µm2. VDD

WBL

VGND

VGND

W

R

RBL W VGND

WBLB

VDD

Fig. 7.8. The layout of an 8T SRAM cell. Area = 0.82 µm2. VDD

WBL

VGND

RBL

W

R

VGND

VGND

VDD

Fig. 7.9. The layout of the proposed dual-Vt 7T SRAM cell. Area = 0.72 µm2.

120

7.2.4. Active Mode Power and Access Speed The junction and oxide capacitances of the access transistors attached to the bitlines are extracted for each SRAM cell. The length of the bitlines and the wordlines are estimated based on the cell layout dimensions. Π-type RC networks that represent the bitline and the wordline parasitics are attached to each memory array. The simulation results for the access delay and the power consumption of the memory arrays with different SRAM cells are shown in Figs. 7.10 and 7.11, respectively. The read delay is measured as the time period from the 50% point of the R signal low-tohigh transition until the voltage of RBL drops by 200mV (when a ‘1’ is stored at Node1). The write delay is measured as the time period from the 50% point of the low-to-high transition of the signal that controls the write driver until the storage node is discharged to VDD/2 (from an initial voltage of VDD). The read and write power consumptions include the power consumed to precharge the bitlines, the power consumption of the wordline drivers, and the power consumed by the SRAM cells. As illustrated in Fig. 7.10, the 7T SRAM cell is not only more robust but also is faster as compared to the 6T SRAM cells during a read operation. The read critical delay path is composed of two series transistors (M4 and M5) each sized twice a minimum sized transistor in the 7T SRAM cell (see Fig. 7.2). Alternatively, with the 6T SRAM cells, the read critical path is composed of two series transistors (N1 and N3 or N2 and N4) with the access transistors (N3 and N4) sized minimum as shown in Fig. 7.1. Note that the bitline access transistors are typically sized minimum for achieving read data stability with the conventional 6T SRAM cells. The read speed is enhanced by up to 17% with the proposed technique due to the lower resistance of the read access delay path as compared to the standard 6T SRAM cells. The write speed, however, is degraded by 8% to 16% with the 7T SRAM cell as compared to the 6T SRAM cells due to the utilization of only a single bitline for writing into the cells with the proposed technique. The write margin for the SRAM cells is listed in Table 7.1. Writing to the 6T and 8T SRAM cells is achieved by discharging one of the bitlines to ground. Writing to an SRAM cell is, however, possible with a voltage higher than 0V on the discharged bitline (performing a write operation with an incomplete/partially discharged bitline). The write margin is the maximum incomplete bitline discharge voltage for which the successful transfer of new

121

data into the 6T and 8T SRAM cells is achieved [56]. The content of an SRAM cell with a higher write margin is easier to be modified.

Read Delay

Write Delay

Delay (ps)

350

300

250

200

6T (β = 2)

6T (β = 3)

6T (β = 4)

8T

7T

Fig. 7.10. Comparison of the access delays of the memory arrays with different SRAM cells. For the 7T SRAM cells, two different write margins exist. The definition and measurement of the write margin when writing a “0” is similar to the 6T and the 8T SRAM cells. Alternatively, when writing a “1” into the 7T SRAM cell, the write margin is the difference between VDD and the minimum bitline voltage required to achieve a successful transfer of a “1” into the cell. The write margin of the 7T SRAM cell while transferring a “0” is identical to the 6T SRAM cell with β = 4, as listed in Table 7.1. TABLE 7.1. WRITE MARGINS OF THE SRAM CELLS SRAM Cell

Write Margin (mV)

6T (β = 2) 6T (β = 3) 6T (β = 4) 8T 7T writing “0” 7T writing “1”

340 280 240 420 240 420

122

As shown in Fig. 7.11, the write power is significantly reduced with the proposed 7T SRAM cell as compared to the 6T and 8T SRAM cells. This reduction in the write power is due to the utilization of a single bitline for writing into the 7T SRAM cells in a memory column. For the 6T and 8T SRAM cells both bitlines in each memory column are periodically precharged to VDD. After the bitline precharge is complete and once a write decision is made, one of the precharged bitlines is selectively discharged to VGND to perform a write operation. In a memory array with 6T or 8T SRAM cells, therefore, one of the bitlines needs to be fully charged and discharged every write cycle, regardless of whether a “0” or a “1” is transferred to the cell. Alternatively, in case of writing a “1” to a memory column with the 7T SRAM cells, the WBL does not need to be discharged (maintained at the precharge voltage VDD). The bitline dynamic switching power consumption is thereby significantly reduced with the 7T SRAM cells. The write power is reduced by up to 35% with the proposed 7T SRAM cell as compared to the 6T and the

Access Power Consumption (µW)

8T SRAM cells. 70

Read Power

60

Write Power

50 40 30 20 10 0

6T (β = 2)

6T (β = 3)

6T (β = 4)

8T

7T

Fig. 7.11. Comparison of the read and write power consumptions of the memory arrays with different SRAM cells.

7.2.5. Process Variations In this section, the read stability and the leakage power variations of the SRAM cells due to process fluctuations in the channel length, the width, and the threshold voltage of the

123

transistors are evaluated. The channel length, the transistor width, and the threshold voltage are assumed to have normal Gaussian statistical distributions. Each parameter is assumed to have a three sigma (3σ) variation of 10% [54]. Monte Carlo simulations with 10000 samples are run to evaluate the read stability and the leakage power statistical distributions. The distributions of the leakage power and the SNM of the SRAM cells are shown in Figs. 7.12 and 7.13, respectively. As illustrated in Figs. 7.12 and 7.13, the statistical samples with the proposed 7T SRAM technique consume significantly lower leakage power and provide enhanced data stability as compared to the 6T SRAM cells. With the proposed 7T SRAM cell, the mean (standard deviation) of the statistical leakage power distribution is reduced by 42% (51%) and 65% (71.5%) as compared to the 6T SRAM cells with β = 2 and β = 4, respectively. Furthermore, the mean of the statistical SNM distribution is enhanced by 90% and 54% with the proposed 7T SRAM cell as compared to the 6T SRAM cells with β = 2 and β = 4, respectively. The leakage power distributions of the 7T SRAM cell and the 6T SRAM cell with β = 2 intersect at 30.2nW, as shown in Fig. 7.12. 86.2% of the statistical samples with the proposed technique consume less than 30.2nW of leakage power. Alternatively, 96.8% of the 6T SRAM cell samples consume more than 30.2nW of leakage power. Similarly, the leakage power distributions of the 7T SRAM cell and the 6T SRAM cell with β = 4 intersect at 44nW. With the 7T SRAM cell 98.5% of the statistical samples consume less than 44nW of leakage power. Alternatively, 99.97% of the 6T SRAM cell samples consume more than 44nW of leakage power as illustrated in Fig. 7.12.

124

44nW

Number of samples

99.97%

98.5%

1200

30.2nW

86.2%

1000

6T (β = 2) Mean = 43 nW SD = 12.6 nW

96.8%

6T (β = 4) Mean = 71.5 nW SD=21.7 nW

800

600

7T Mean = 25 nW SD = 6.17 nW

400

200

0 0

50

100

150

200

Leakage Power Consumption (nW) Fig. 7.12. Statistical leakage power distributions of the SRAM cells. 6T (β = 2) SD = 16 mV

2000

6T (β = 4) SD = 13.4 mV

Number of Samples

Mean = 154 mV Mean = 190 mV

7T SD = 17.5 mV

Mean = 292 mV

1600

53.7% higher

1200

800

89.7% higher 400

0 80

105

130

155

180

205

230

255

280

305

SNM (mV) Fig. 7.13. Statistical SNM distributions of the SRAM cells.

330

355

125

7.3 Chapter Summary A new 7T dual-Vt SRAM cell is proposed in this chapter for simultaneously reducing the active and standby mode power consumption while enhancing the data stability and the read speed. The proposed circuit provides two separate data access mechanisms for the read and write operations. During a read operation, the storage nodes are isolated from the bitlines, thereby enhancing the read SNM by up to 87% as compared to the conventional 6T SRAM cells. The cross-coupled inverters of the 7T SRAM cell are not on the critical delay path, thereby allowing the utilization of high threshold-voltage minimum sized transistors for significantly reducing the leakage power consumption by up to 66% without degrading the circuit speed as compared to the conventional 6T SRAM circuits. Furthermore, the read speed is enhanced with the proposed 7T SRAM circuit due to significantly smaller resistance of the read-delay-path. The write power is also reduced with the 7T SRAM circuit due to the utilization of a single bitline for the transfer of new data into the cell. The effectiveness of the 7T SRAM cell for providing significant data stability enhancement and leakage power reduction is also verified with the statistical data produced under process parameter variations.

126

Chapter 8 Multi-Gate FinFET Technology In this chapter the emerging multi-gate FinFET technology is presented. The advantages of the multi-gate transistors as compared to the conventional single-gate MOSFETs are presented in Section 8.1. FinFET technology development guidelines for higher on-current, suppressed leakage currents, and weaker sensitivity to parameter variations are provided in Section 8.2. Techniques for threshold voltage tuning using independent-gate bias and work-function engineering are discussed in Section 8.3. Finally, conclusions are offered in Section 8.4.

8.1. Emerging Multi-Gate Technology The channel length of the conventional single-gate MOSFETs has been scaled from 10µm to 45nm over the past 40 years. Parameter variations and power consumption are currently the primary barriers against further device scaling as discussed in Chapter 1. The emerging multi-gate transistors offer distinct advantages for simultaneously enhancing the on-current and suppressing the leakage currents as compared to the standard single-gate MOSFETs [62]-[67]. The multiple electrically coupled gates and the thin silicon body suppress the short-channel effects, thereby lowering the sub-threshold leakage current in a multi-gate MOSFET. The suppressed shortchannel effects and the enhanced gate control over the channel (lower sub-threshold swing) permit the use of a thicker gate oxide in a multi-gate MOSFET as compared to a conventional single-gate transistor. The gate oxide leakage current is thereby significantly reduced despite the larger gate area of a multi-gate transistor. The thin body of a multi-gate device is typically undoped or lightly doped. Therefore the carrier mobility is enhanced and the device variations due to the doping fluctuations are reduced in a multi-gate MOSFET as compared to a single-gate bulk transistor. Several architectures for the implementation of multi-gate transistors [67] are shown in Figs. 8.1 and 8.2. The most attractive multi-gate device is the FinFET due to the self alignment of the two gates and the relative compatibility of the FinFETs with the existing standard CMOS fabrication process.

127

Drain

Planar DG-FET

Front Gate

Drain

Source Front Gate

Back Gate

Back Gate

Current

Back Gate

FinFET

Current

Source Source

Front Gate

Drain Current

Vertical DG-FET

Fig. 8.1. Different implementations of a multi-gate field-effect transistor [67].



n

e Dr ai

at G

So ur ce

Hfin

tsi Dr ai n

So u

rc e

e at G

L (a)

(b)

Fig. 8.2. FinFET 3D view. (a) Single fin FET with the fin dimensions indicated. (b) Two-fins FET The variations of the threshold voltage (Vth) and the drain-induced barrier-lowering (DIBL) with the channel length for a double-gate FinFET and a single-gate bulk MOSFET are depicted in Figs. 8.3 and 8.4, respectively. The parameters of the FinFET device are listed in Table 8.1. The threshold voltage is the gate-to-source voltage at which the drain current per fin height is 10-5 A/µm for |VDS| = VDD. DIBL is measured as the degradation in |Vth| when the drain voltage is increased from 0.05V to VDD. The short-channel effect (Vth-roll-off) is significantly suppressed with the double-gate FinFET technology, as illustrated in Fig. 8.3. The dependence of the threshold voltage on the channel length is much weaker for a double-gate FinFET as compared to a single-gate bulk MOSFET. Furthermore, the DIBL observed with a double-gate FinFET is significantly smaller as compared to a single-gate MOSFET, as shown in Fig. 8.4.

Vth (V)

128

0.6

Single-Gate MOSFET

0.5

Double-Gate FinFET

0.4 0.3 0.2 0.1 0.0 12

16

20

24

28

32

36

40

44

48

52

Channel Length (nm)

Fig. 8.3. The variation of the threshold voltage with the channel length for a FinFET and a standard single-gate bulk MOSFET. Results are obtained by MEDICI simulations [70]. 0.20

Single-Gate MOSFET

DIBL (V)

0.16

Double-Gate FinFET

0.12 0.08 0.04 0.00 12

16

20

24

28

32

36

40

44

48

52

Channel Length (nm)

Fig. 8.4. Drain-induced-barrier-lowering (DIBL) of a FinFET and a standard single-gate bulk MOSFET. DIBL is measured as the degradation in |Vth| when the drain voltage is increased from 0.05V to 0.8V (VDD). Results are obtained by MEDICI simulations [70]. TABLE 8.1. FinFET TECHNOLOGY PARAMETERS Parameter Value Channel length (L) 32 nm Gate-Drain (Source) overlap 3.2 nm Fin Height (Hfin) 32 nm Fin thickness (tsi) 8 nm Oxide thickness (tox) 1.6 nm Channel doping 1015 cm-3 Source / Drain doping 2 x 1020 cm-3 Work function (N-type FinFET) 4.5eV Work function (P-type FinFET) 4.9eV Supply voltage (VDD) 0.8V

129

The width of a FinFET is quantized with the number of fins due to the vertical gate structure. The fin height determines the minimum transistor width (Wmin). For a tied-gate FinFET Wmin is Wmin = 2 × Hfin + tsi,

(8.1)

where Hfin is the height of the fin and tsi is the thickness of the silicon body as shown in Fig. 8.2a. Hfin is the dominant component of the transistor width since tsi is typically much smaller than Hfin

in order to effectively suppress the short-channel effects. Since Hfin is fixed in a FinFET technology, multiple parallel fins are utilized to increase the width of a FinFET as shown in Fig. 8.2b. The total physical transistor width (Wtotal) of a tied-gate FinFET with n parallel fins is Wtotal = n × Wmin = n × (2 × Hfin + tsi).

(8.2)

The width quantization of FinFETs introduces interesting design challenges in CMOS circuits whose performance characteristics are highly sensitive to transistor sizing.

8.2. FinFET Technology Development Guidelines The fin thickness and the gate oxide thickness are the most critical device parameters due to the strong impact of these dimensions on the efficiency of the FinFET architecture in suppressing the short-channel effects and parameter variation sensitivity. The impact of the fin thickness and the gate oxide thickness on the electrical characteristics of N-type and P-type symmetric double-gate FinFETs is explored in this section. FinFET design guidelines for higher on-current, suppressed leakage currents, and stronger resilience to parameter variations are provided. N-type and P-type FinFETs with different fin thickness and gate oxide thickness are designed and characterized for DC characteristics, propagation delay, and parameter variations sensitivity using Taurus-Medici [70]. The gate length, the gate-drain/source overlaps, and the fin height are fixed at 32nm, 3.2nm, and 32nm, respectively, for all the devices considered in this Section. The fin thickness is varied from 8nm to 20nm. For each fin thickness, the gate oxide thickness is varied from 1.2nm to 2nm. Twenty distinct N-type and P-type device configurations with different combinations of fin thickness and gate oxide thickness are considered. For each device configuration, the gate work-function is adjusted such that the threshold voltage is equal to 300mV and -400mV for the N-Type and the P-type transistors, respectively, when operating with

130

a 0.8V supply voltage (VDD = 0.8V). The physical models incorporated in all the simulations performed in this section are listed in Table 8.2. The DC characteristics of the different FinFET device configurations are presented in Section 8.2.1. The effect of the process parameter variations, the supply voltage variations, and the temperature fluctuations on the characteristics of FinFET circuits are provided in Section 8.2.2. TABLE 8.2. PHYSICAL MODELS USED IN MEDICI SIMULATION [70] Physical Model

QM.PHILI CCSMOB SRFMOB2 FLDMOB

Description

Accounts for quantum mechanical effects in the inversion layer Accounts for carrier-carrier scattering, lattice scattering, ionized impurity scattering, and temperature Accounts for surface scattering with surface roughness, phonon scattering, and charged impurity scattering considered. Accounts for the dependency of the mobility on the parallel electric field

8.2.1. DC Characteristics In this section, the different FinFET profiles are characterized for the on-current, the offcurrent, the gate tunneling current, the sub-threshold slope, and the drain-induced-barrierlowering. The on-currents produced by the N-type and the P-type FinFETs are shown in Figs. 8.5a and 8.5b, respectively. The on-current is the drain current when |VGS| = |VDS| = VDD. The on-current increases with the reduction in the gate oxide thickness as shown in Fig. 8.5 due to the enhanced gate capacitance. Increasing the fin thickness leads to a higher on-current due to the increased number of carriers as illustrated in Fig. 8.6 (dashed curve with a flat gate work-function) and due to the reduced series resistance of the drain and source extension regions. Furthermore, the threshold voltage is reduced with a thicker fin due to the reduction in quantummechanical confinement and due to the enhanced short-channel effects. Alternatively, when the gate work-function is increased for an N-type FinFET with a thicker fin in order to maintain the same threshold voltage for each device profile, a higher gate work-function combined with a thicker fin leads to a reduction in the on-current when the fin thickness is increased from 12nm to 20nm as shown in Fig. 8.5a. Similarly, when the gate work-function is reduced for a P-type FinFET with a thicker fin in order to maintain the same threshold voltage for each device profile,

131

a lower gate work-function combined with a thicker fin leads to a reduction in the on-current when the fin thickness is increased from 8nm to 20nm as shown in Fig. 8.5b.

Ion (mA/µm)

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm

1.4 1.2 1.0 0.8 0.6 0.4

1.2

1.4

1.6

1.8

2.0

tox (nm)

Ion (mA/µm)

(a)

0.8

tsi = 8nm

tsi = 12nm

0.7

tsi = 16nm

tsi = 20nm

0.6 0.5 0.4 0.3 0.2 1.2

1.4

1.6

1.8

2.0

tox (nm) (b) Fig. 8.5. The on-current produced by single-fin (minimum sized) transistors for different fin and gate oxide thicknesses. T = 27oC. VDD = 0.8V. (a) N-type FinFETs. (b) P-type FinFETs. The gate work-function is adjusted to maintain a constant threshold voltage for each device profile with a different tox and tsi combination.

2250

5.2 5.1

1500

5.0 4.9 4.8

750

4.7 0

8

12

16

20

Gate Work-Function (eV)

Number of Carriers Per Hfin (µm-1)

132

4.6

tsi (nm)

4.8 2100 4.7

4.6 1350

4.5

4.4

600

Gate Work-Function (eV)

Number of Carriers Per Hfin (µm-1)

(a)

4.3 8

12

16

20

tsi (nm) (b) Fig. 8.6. Number of carriers per fin height in the channel region versus the fin thickness. tox = 1.6nm. (a) N-type FinFETs. (b) P-type FinFETs. Dashed lines: gate work-function is the same for all the devices. Solid lines: gate work-function is increased (reduced) with a thicker fin to maintain a constant threshold voltage for N-type (P-type) FinFETs.

133

The leakage currents (off-current and gate tunneling current) of the N-type and the P-type FinFETs are depicted in Figs. 8.7 and 8.8, respectively, for the different FinFET profiles. The offcurrent is significantly increased with a thicker fin and a thicker gate oxide due to the increased sub-threshold swing caused by the weaker gate control over the channel area. Alternatively, the gate tunneling current is exponentially reduced as the gate oxide thickness is increased. The gate tunneling current of an N-type FinFET is reduced with a thicker fin as shown in Fig. 8.7 due to the utilization of a higher gate work-function to compensate for the degradation of the threshold voltage when the fin thickness is increased. Alternatively, the gate tunneling current of a P-type FinFET is increased with a thicker fin as shown in Fig. 8.8 due to the utilization of a lower gate work-function to compensate for the degradation of the threshold voltage when the fin thickness is increased. The gate tunneling current is reduced with a higher gate work-function [92]. The gate tunneling leakage is significantly lower for a P-type transistor as compared to an N-type transistor due to the higher tunneling barrier for holes as compared to electrons at the Si-SiO2 interface [91]. Off-current

Leakage Currents (nA/µm)

102 101 100 10-1 10-2 10-3 10-4 1.2

tsi = 8nm tsi = 12nm tsi = 16nm

Gate current

tsi = 20nm 1.4

1.6

1.8

2

tox (nm) Fig. 8.7. The off-current and the gate tunneling current of an N-type FinFET for different fin and gate oxide thicknesses. T = 27oC. VDD = 0.8V. The off-current is the drain current with VGS = 0V and VDS = VDD. The gate tunneling current is measured when VGD = VGS = VDD.

134

tsi = 8nm tsi = 12nm

Leakage Currents (nA/um)

102

tsi = 16nm tsi = 20nm

Off-current

100 10-2 10-4

Gate current

10-6 10-8 1.2

1.4

1.6

1.8

2

tox (nm) Fig. 8.8. The off-current and the gate tunneling current of a P-type FinFET for different fin and gate oxide thicknesses. T = 27oC. VDD = 0.8V. The off-current is the drain current with VGS = 0V and VDS = VDD. The gate tunneling current is measured when VGD = VGS = VDD. Enhancing the on-current and reducing the leakage currents cannot be simultaneously achieved due to the contradictory requirements on the gate oxide thickness and the fin thickness as shown in Figs 8.5, 8.7, and 8.8. One approach for designing a FinFET for high on-current and reduced leakage currents is to select the fin thickness and the gate oxide thickness that maximize the ratio of the on-current to the sum of the off-current and gate tunneling currents as shown in Figs. 8.9 and 8.10. With this approach the optimum device configuration of an N-type FinFET operating at room temperature would be achieved with a fin thickness of 8nm and a gate oxide thickness of 1.6nm, as shown in Fig. 8.9a. Alternatively, the optimum configuration of an N-type FinFET operating at T = 110oC is achieved when the fin and the gate oxide thicknesses are 8nm and 1.4nm, respectively, as shown in Fig. 8.9b. The ratio of the on-current to leakage currents is maximized for a P-type FinFET when the fin and the gate oxide thicknesses are minimized, as shown in Fig. 8.10. Note that there are no local maxima in Fig. 8.10 due to the significantly lower

135

gate tunneling current as compared to the sub-threshold leakage currents of a P-type FinFET as shown in Fig. 8.8.

Ion / (Ioff + Igate)

107

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm

106

105

104

103 1.2

1.4

1.6

1.8

2

1.8

2

tox (nm) (a)

Ion / (Ioff + Igate)

105

104

103

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm 102 1.2

1.4

1.6

tox (nm) (b)

Fig. 8.9. Ratio of the on-current to the total leakage currents for different fin and gate oxide thicknesses of an N-Type FinFET. (a) T = 27oC. (b) T = 110oC.

136

Ion / (Ioff + Igate)

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm

108

107

106

105

104 1.2

1.4

1.6

tox (nm)

1.8

2

(a)

Ion / (Ioff + Igate)

106 105 104 103 102 1.2

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm 1.4

1.6

tox (nm)

1.8

2

(b) Fig. 8.10. Ratio of the on-current to the total leakage currents for different fin and gate oxide thicknesses of a P-Type FinFET. (a) T = 27oC. (b) T = 110oC. The sub-threshold slope and the drain-induced-barrier-lowering of the N-type FinFETs are shown in Figs. 8.11 and 8.12, respectively. The sub-threshold slope is measured as the change in

137

the gate-to-source voltage that leads to a ten fold reduction in the drain current when the device is operating in the sub-threshold regime [1]. The drain-induced-barrier-lowering (DIBL) is measured as the degradation in the threshold voltage when the drain-to-source voltage is increased from 0.05V to VDD. The sub-threshold slope and the DIBL are increased with a thicker gate oxide and a thicker fin thickness due to the reduced gate control over the channel area. A sub-threshold slope lower than 100mV (at room temperature) can be achieved with a fin thinner

Subthreshold Slope (mV/Decade)

than half of the gate length as shown in Fig. 8.11a. 140

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm

120

100

< 100mV/decade tsi < 16nm

80

60 1.2

1.4

1.6

1.8

2

tox (nm)

Subthreshold Slope (mV/Decade)

(a) tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm

180

160

140

120

100

80 1.2

1.4

1.6

tox (nm)

1.8

2

(b) Fig. 8.11. Variation of the sub-threshold slope of an N-type FinFET with the different fin and gate oxide thicknesses. (a) T = 27oC. (b) T = 110oC.

138

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm

310

DIBL (mV)

260 210 160 110 60 10 1.2

1.4

1.6

1.8

2

1.8

2

tox (nm)

(a) tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm

360 310

DIBL (mV)

260 210 160 110 60 10 1.2

1.4

1.6

tox (nm)

(b) Fig. 8.12. Variation of the drain-induced-barrier-lowering of an N-type FinFET with the different fin and gate oxide thicknesses. (a) T = 27oC. (b) T = 110oC.

8.2.2. Process, Supply Voltage, and Temperature Variations In this section, the impact of process parameter variations, supply voltage variations, and temperature fluctuations on the characteristics of FinFET circuits are presented. The impact of process parameter variations on the on-current and the leakage currents are examined by varying

139

the fin thickness and the gate oxide thickness of a FinFET. Four process corners are examined in which the fin thickness and the gate oxide thickness are varied by ±1nm and ±1Ao, respectively, for each of the twenty device configurations considered in this section. The on-current, the off-current, and the gate tunneling currents are determined for each of the four process corners as well as the nominal process corner. The percent variation of the oncurrent is

IVariation =

I Max − I Min *100% I No min al

,

(8.3)

where IMax, IMin, and INominal are the maximum on-current, the minimum on-current, and the oncurrent at the nominal process corner, respectively. In this section, the off-current variation is the ratio of the maximum off-current to the minimum off-current. Similarly, the gate tunneling current variation is the ratio of the maximum gate leakage current to the minimum gate leakage current. The on-current fluctuations of the N-type and the P-type FinFETs are shown in Figs. 8.13a and 8.13b, respectively. The process induced on-current variation is significant when the fin thickness is 8nm since the relative fin thickness variation is highest for this device configuration. The on-current variation is reduced approximately three times when the fin thickness is increased from 8nm to 16nm as shown in Fig. 8.13. Increasing the fin thickness to 20nm, however, enhances the on-current fluctuations due to the weaker gate control over the channel area. The weaker gate control over the channel enhances the sensitivity of the transistor current to the fluctuations of the gate oxide thickness as shown in Fig. 8.13. The off-current variations of the N-type and the P-type FinFETs are shown in Figs. 8.14a and 8.14b, respectively. The off-current variation is reduced with a thinner fin and a thinner gate oxide as shown in Fig. 8.14 due to the suppressed short-channel effects. The gate tunneling current variation is approximately constant (≈ 9) due to the exponential dependence of the gate tunneling current on the gate oxide thickness and the weak dependence on the fin thickness as shown in Figs. 8.7 and 8.8. There exists a trade-off between the fluctuations of the on-current and the variations of the off-current as shown in Figs. 8.13 and 8.14. For achieving an enhanced

140

resilience to the process parameter variations, the fin thickness should be between one fourth and one half of the gate length. 32

Ion Variation (%)

28

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm

24 20 16 12 8 1.2

1.4

1.6

1.8

2

1.8

2

tox (nm)

(a) 32

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm

Ion Variation (%)

27

22

17

12 1.2

1.4

1.6

tox (nm)

(b) Fig. 8.13. On-current percent variation due to process parameter variations. T = 27oC. (a) N-type FinFETs. (b) P-type FinFETs.

141

Ioff Variations

21

18

15

12

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm

9 1.2

1.4

1.6

1.8

2

tox (nm)

(a)

Ioff Variations

28

23

18

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm 13 1.2

1.4

1.6

1.8

2

tox (nm)

(b) Fig. 8.14. Ratio of the maximum off-current to the minimum off-current under process variations with different device profiles. T = 27oC. (a) N-type FinFET. (b) P-type FinFET. The impact of supply voltage and temperature variations on a FinFET circuit is characterized using the inverter chain test circuit shown in Fig. 8.15. The first inverter is used for signal shaping. The third inverter represents the load. The propagation delay of the second

142

inverter is measured as the average of the low-to-high and high-to-low propagation delays. The first two inverters are minimum sized with a single fin for the N-type FinFET and two fins for the P-type FinFET. Alternatively, the load inverter is sized four times the minimum sized inverter. Propagation Delay VDD 0V 1X

1X

4X

Fig. 8.15. Test circuit for characterizing the impact of supply voltage and temperature variations on the inverter propagation delay. The propagation delay is shown in Fig. 8.16 for different fin thickness and different supply voltages. The gate oxide thickness is fixed at 1.6nm. The fin thickness and the supply voltage are varied from 8nm to 20nm and from 0.6V to 1.2V, respectively. The propagation delay is relatively insensitive to the supply voltage fluctuations at a higher supply voltage due to the enhanced gate overdrive voltage of the transistors. The propagation delay is proportional to the ratio of the supply voltage to the gate overdrive. The gate overdrive is the difference between the supply voltage and the threshold voltage. The gate overdrive voltage is reduced at a lower supply voltage. The modulation of the gate overdrive voltage is therefore more significant for the voltage fluctuations around a lower nominal supply voltage. The propagation delay variations are thereby enhanced at a lower supply voltage as shown in Fig. 8.16. The propagation delay fluctuations at a lower supply voltage are more significant for thicker fins due to the reduced coupling between the two gates of a FinFET with a thicker fin and a smaller supply voltage. When the circuit is designed with transistors that have 8nm fin thickness, the propagation delay is increased by 2.14X when the supply voltage is varied from 1.2V to 0.6V. Alternatively, when the circuit is designed with transistors that have 20nm fin thickness, the propagation delay is increased by 14X when the supply voltage is varied from 1.2V to 0.6V. For a more aggressive supply voltage scalability therefore the fin should be thinner than half of the channel length.

143

Propagation Delay (ps)

120 100

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm

80 60 40 20 0 0.6

0.8

1

1.2

VDD (V) Fig. 8.16. Propagation delay versus the supply voltage for different fin thickness. Gate oxide thickness is equal to 1.6nm. T = 27oC. The impact of temperature variations on the propagation delay is characterized by measuring the propagation delay at a low temperature (T = 27oC) and a high temperature (110oC). The percent change in the propagation delay due to the temperature fluctuations is shown in Fig. 8.17 for different supply voltages and fin thicknesses. The gate oxide thickness is fixed at 1.6nm. The fluctuations of the delay are caused by the variations of the mobility and the threshold voltage with the temperature. The mobility degradation tends to increase the propagation delay by reducing the on-current at a higher temperature. Alternatively, the reduction in the threshold voltage tends to reduce the propagation delay by increasing the gate overdrive voltage at a higher temperature. At a lower supply voltage the gate overdrive variations are more significant as compared to the mobility variations, thereby reducing the propagation delay as the temperature is increased as shown in Fig. 8.17 (negative delay variations). Alternatively, at a high supply voltage the gate overdrive variations are reduced. Mobility variations are small due to the undoped body of the FinFETs. The gate overdrive variations are counterbalanced by the mobility fluctuations as the temperature is varied. At a higher supply voltage, therefore, the temperature variations cause smaller fluctuations in the propagation delay as shown in Fig. 8.17. The threshold voltage

144

variations are enhanced with a thicker fin due to the stronger short-channel effects. The propagation delay dependence on temperature becomes more significant at a reduced supply voltage and a thicker fin as shown in Fig. 8.17. For a stronger tolerance to temperature fluctuations therefore the fin should be thinner than half of the channel length. 10

Delay Variation (%)

0 -10 -20 -30

tsi = 12nm tsi = 20nm

tsi = 8nm tsi = 16nm

-40 -50 -60 0.6

0.8

1

1.2

VDD (V) Fig. 8.17. Percentage temperature-induced propagation delay variation for different supply voltages and different fin thickness. The temperature is varied from 27oC to 110oC.

8.3. Threshold Voltage Tuning Techniques In this section two techniques for threshold voltage tuning of FinFETs are presented. The first technique is based on independent-gate bias, a unique feature of the FinFET technology. The independent-gate bias technique is discussed in Section 8.3.1. The second threshold voltage tuning approach based on gate work-function engineering is presented in Section 8.3.2. Note that channel doping is avoided as a method of threshold voltage tuning in FinFETs due to the significant threshold voltage variations that result from the channel doping [93].

8.3.1. Independent-Gate FinFET Technology The two vertical gates of a single-fin FET can be separated by an oxide on top of the silicon fin, thereby forming an independent-gate FinFET. The 3D architectures of tied-gate and

145

independent-gate FinFETs are shown in Fig. 8.18. Both tied-gate and independent-gate FinFETs have been successfully fabricated [68], [79]-[81]. A fabrication process is described in [68] for implementing tied-gate and independent-gate FinFETs on the same die. Independent-gate FinFETs are utilized to reduce the number of transistors required for implementing specific logic functions as compared to the circuits with tied-gate FinFETs in [66] and [69]. In addition to the area savings, significant speed enhancement is reported due to the reduced parasitic capacitance and the lower transistor stack heights with the independent-gate FinFET circuits as compared to the circuits with tied-gate FinFETs. The power consumption is also reduced due to the lower parasitic capacitance of the simplified circuit topologies with the independent-gate FinFETs. Back Gate

Gate

tsi Source

Hfin

Drain

L

Source

Drain Front Gate

(a)

Insulator

(b)

Fig. 8.18. FinFET architectures. (a) Tied-gate FinFET. (b) Independent-gate FinFET. An independent-gate FinFET operates in the dual-gate mode when both gates are biased to induce channel inversion. Alternatively, an independent-gate N-FinFET (P-FinFET) operates in the single-gate mode when one of the gates is deactivated by connection to GND (VDD). Disabling one of the gates in the single gate mode increases the absolute value of the threshold voltage as compared to the dual-gate mode. It is therefore possible to modulate the threshold voltage of a FinFET by independently biasing the two gates. The current produced by an N-type (P-type) FinFET at 110oC is 2.55X (2.77X) higher in the dual-gate mode as compared to the single-gate mode, as shown in Fig. 8.19a (Fig. 8.19b). The modulation of the threshold voltage by independently biasing the gates of a FinFET is attractive for developing low-power and robust circuit techniques with multi-threshold-voltage (multi-|Vth|) transistors. The development of new low-power and robust multi-Vth integrated circuit techniques with independent-gate FinFET is presented in Chapters 9, 10, and 11.

146

10-02

Single-gate mode

Dual-gate mode 2.55X

IDS (A/µm)

10-03

10-04

10-05

10-06 Vth = 0.22V

Vth = 0.38V

10-07 0.0

0.2

0.4

0.6

0.8

VGS(V)

(a) Single-Gate-Mode

10-02

Dual-Gate-Mode

10-04 10-05

ISD (A/µm)

10-03

2.77X

10-06 Vth = -0.27V

Vth = -0.46V

10-07 -0.8

-0.6

-0.4

-0.2

0.0

VGS (V)

(b) Fig. 8.19. Drain current characteristics of FinFETs. a) N- FinFET. b) P- FinFET. |VDS| = VDD = 0.8V. T = 110oC.

8.3.2. Work-Function Engineering The threshold voltage of a conventional single-gate transistor is typically tuned by controlling the channel doping concentration. Due to the continued miniaturization of the transistor sizes with technology scaling, however, the control of the channel doping concentration

147

with sufficient precision has become infeasible. The reduced number of dopant atoms within smaller device volumes results in significant threshold voltage variations due to the discrete and random locations of the dopant atoms. Several researchers explore work-function engineering as an alternative technique for threshold voltage tuning in deeply scaled nanometer CMOS technologies [71]-[74]. The gate work-function affects the threshold voltage directly. A higher gate work-function increases the threshold voltage. Molybdenum is used as the gate material in [71] and [73]. The work-function of the pure Molybdenum is 5eV. Implanting Molybdenum with nitrogen decreases the work-function depending on the implantation dose and energy, as shown in Fig. 8.20. Total Nickel silicidation of doped polysilicon gate is shown to result in a metallic alloy with a tunable work-function that depends on the doping type and the doping level of the polysilicon prior to the silicidation step as shown in Fig. 8.20 [72], [74]. The use of work-function engineering for developing novel multiVth low-power FinFET circuit techniques is explored in Chapters 9 and 12.

Gate Work-Function (eV)

5.2

5.0

4.8

4.6

4.4 0

10

20

30

40

Implantation Energy (keV) Unimplanted

Fig. 8.20. Work-function tuning with Molybdenum gate material. The work-function is tuned with a 5 x 10-15 cm-2 Nitrogen dose and different implantation energy. Data extracted from [71].

148

5.2

Work-Function (eV)

5.0

NiSi TiSi

4.8

4.6

4.4 4.2

Higher P doping (Boron)

Higher N doping (Phosphorus)

N+

N: 3x1012

0

P: 3x1012

P: 1x1014

Gate Implant Dose (cm-2)

Fig. 8.21. Work-function tuning with full silicidation of doped polysilicon gate material. The work-function is tuned based on the doping level of the polysilicon gate prior to the silicidation step [72].

8.4. Chapter Summary In this chapter the multi-gate MOSFET technologies are presented as a potential replacement for the conventional single-gate MOSFET technology. Among the different multigate MOSFET architectures the FinFET is the most attractive due to the compatibility of the FinFET fabrication to existing technology and due the self alignment of the gates. The advantages of the FinFET architecture in suppressing the short channel effects and the leakage currents as compared to the single gate MOSFETs are described and verified using Taurus-Medici simulations. The fin thickness and the gate oxide thickness are the most critical dimensions in the design of FinFETs due to the strong impact of these fin dimensions on the efficiency of the FinFET architecture in suppressing the short-channel effects and enhancing the on-current. An extensive study on the impact of the fin thickness and gate oxide thickness on the on-current, offcurrent, gate leakage current, sub-threshold slop, drain-induced-barrier-lowering, and the processsupply-temperature parameter variations resilience is presented. Finally threshold voltage tuning techniques using independent-gate bias and work-function engineering are described. These threshold voltage tuning approaches are attractive for developing low power circuits as presented in the following chapters.

149

Chapter 9 Multi-Vth FinFET Sequential Circuits with Independent-Gate Bias and Work-Function Engineering for Reduced Power Consumption Static latches and flip-flops are extensively used in synchronous integrated circuits (ICs). The main module in static latches and flip-flops is the bistable circuit formed by a pair of crosscoupled inverters. Data is written to a latch either by brute-force using a stronger input circuitry as compared to the feedback inverter or by breaking the feedback loop using a clock-controlled switch (a transmission gate or a tri-state inverter). The approach based on data forcing reduces the clock load, the power consumption, and the circuit area by lowering the number of clocked transistors. Power consumed by the clock subsystem is a significant portion (e.g. reported as 33% in [40]) of the total IC power. Brute-force sequential circuits with reduced clock load and simpler circuitry are therefore widely used in the state-of-the-art integrated circuits [40], [75]. In this Chapter, new FinFET latches and flip-flops that operate based on data forcing are presented. Independent-gate bias and work-function engineering are explored to achieve multi threshold voltage (multi-Vth) compact FinFET sequential circuits with reduced power consumption as compared to the standard single threshold voltage (single-Vth) tied-gate FinFET circuits [94], [95]. The sequential circuits are characterized for power consumption, delay, and noise immunity characteristics in a 32nm FinFET technology. The Chapter is organized as follows. The FinFET operation is presented in Section 9.1. The new multi-Vth FinFET static brute-force latches based on independent-gate bias and workfunction engineering are described and characterized in Section 9.2. The multi-Vth master-slave flip-flops based on the new FinFET latches are described and evaluated in Section 9.3. Finally, conclusions are provided in Section 9.4.

9.1. FinFET Technology In this section the device architectures for the tied-gate and the independent-gate FinFETs are presented. The effect of different gate bias conditions on the I-V characteristics of independent-gate FinFETs is described. The technology parameters of the FinFETs considered in

150

this chapter are listed in Table 8.1. The architectures of tied-gate and independent-gate FinFETs are illustrated in Fig. 9.1. As presented in Chapter 8, an independent-gate FinFET has two modes of operation: single-gate mode and dual-gate mode. In the dual-gate mode both gates are biased to induce channel inversion as illustrated in Fig. 9.2a. Alternatively, in the single-gate mode, one of the gates is deactivated as illustrated in Fig. 9.2b. Disabling one of the gates in the single gate mode increases the absolute value of the threshold voltage as compared to the dual-gate mode. It is therefore possible to modulate the threshold voltage of a FinFET by independent gate-bias. The current produced by an N-FinFET (P-FinFET) at 110oC is 2.55X (2.77X) higher in the dual-gate mode as compared to the single-gate mode with the 32nm FinFET technology considered in this Chapter. G at e

Source

G2

tsi Source

Drain

Hfin

Drain G1

L (a)

(b)

Insulator

Fig. 9.1. FinFET architectures. (a) Tied-gate. (b) Independent-gate. G

G S

S

D

NMOS

PMOS

(a)

G

G S

S

D

NMOS

D

(b)

VDD D

PMOS

Fig. 9.2. Modes of operation of independent-gate FinFETs. (a) Dual-gate mode. (b) Single-gate mode.

151

The threshold voltage a FinFET can also be tuned by adjusting the work function of the gate material [73], [74]. The gate work-function directly affects the threshold voltage. A higher gate work-function increases the threshold voltage of a FinFET. In [73] Molybdenum is used as the gate material. The work-function of the unimplanted Molybdenum is 5eV. Implanting Molybdenum with nitrogen decreases the work-function depending on the implantation dose and energy. Alternatively, in [74], total Nickel silicidation of doped polysilicon gate is shown to result in a metallic alloy with a tunable work-function that depends on the doping type and the doping level of the polysilicon prior to the silicidation step. Independent-gate bias and work-function engineering are explored in this Chapter to achieve compact multi-Vth FinFET sequential circuits with reduced power consumption as compared to the standard single-Vth tied-gate FinFET circuits. With the new sequential circuits, the total power consumption, the clock power, and the leakage power are reduced by up to 55%, 29%, and 53%, respectively, while maintaining similar speed and noise immunity characteristics as compared to the standard single-Vth tied-gate circuits in a 32nm FinFET technology.

9.2. FinFET Latches Static FinFET latches that operate with brute force in the transparent mode are presented in this section. Two standard implementations of a brute-force latch with single-Vth tied-gate FinFETs are described in Section 9.2.1. The new multi-Vth FinFET latches based on independentgate-bias and work-function engineering are described in Section 9.2.2. The latches are characterized for power consumption, propagation delay, setup time, and noise immunity in Section 9.2.3.

9.2.1. Single-Vth Tied-Gate FinFET Latches The standard implementations of a brute-force latch in a standard single-Vth tied-gate FinFET technology are shown in Fig. 9.3. The feedback inverter (I2) must be weaker than the input stage composed of the driver inverter (I1) and the transmission gate (T1) in order to be able to change the stored bit when the latch is transparent (clock signal is high). With the first singleVth tied-gate implementation (LATCH-TG1), minimum sized low-Vth FinFETs are employed in the feedback inverter. The feedback inverter is further weakened by enhancing the resistances of

152

the pull-up and the pull-down networks by employing transistor stacks, as shown in Fig. 9.3a. Minimum sized input drivers (I1 and T1) are able to overpower the feedback inverter (I2), thereby providing functionality with the first tied-gate FinFET latch. With the second single-Vth tied-gate implementation (LATCH-TG2), minimum sized lowVth tied-gate FinFETs are employed in the feedback inverter (I2), as shown in Fig. 9.3b. The size of a FinFET is quantized due to the constant fin height in a FinFET technology. The functionality with this implementation can therefore be achieved only by utilizing input drivers (I1 and T1) sized at least twice the feedback inverter. This technique based on transistor sizing with a standard minimum sized single-Vth tied-gate feedback inverter and a significantly larger input stage leads to higher input and clock load capacitances. The power consumption of LATCH-TG2 is therefore higher as compared to LATCH-TG1. Furthermore, the areas of LATCH-TG1 and LATCH-TG2 are similar although LATCH-TG1 has two extra transistors. The layouts of LATCH-TG1 and LATCH-TG2 are shown in Fig. 9.4. I4

CLK

CLK

CLK Node1

I1

D

Q

I3

T1 CLK

I2

VDD 1 fin

M4

1 fin Node1

M2

1 fin 1 fin

0V

VDD

Q

M2

Node1

M1 M3

(a)

VDD

M1

1 fin Q 1 fin

(b)

Fig. 9.3. Brute-force latch implementations with single-Vth tied-gate FinFETs. (a) LATCH-TG1. (b) LATCH-TG2.

153

VDD VDD

X

X

D

Q

D

X X

CLK

X

Node1

X X

Node1

CLK

X

X

Q

X

X

GND

(a)

X

GND

(b)

Fig. 9.4. Layouts of the standard single-Vth tied-gate FinFET circuits. (a) LATCH-TG1: 0.64 µm2. (b) LATCH-TG2: 0.63 µm2.

9.2.2. New Brute-Force Multi-Vth FinFET Latches Three new compact multi-Vth FinFET latches based on independent-gate-bias and workfunction engineering are presented in this section [94], [95]. The first multi-Vth latch (LATCHIG), shown in Fig. 9.5a, is based on an independent-gate FinFET technology [95]. With LATCHIG, the driver inverter (I1) and the transmission gate (T1) are minimum sized. The minimum sized transistors in the feedback path are further weakened by operating in the single-gate mode (one of the gates of M2 and M1 are connected to VDD and GND, respectively). The driver inverter (I1) and the transmission gate (T1) therefore produce more current as compared to the feedback inverter (I2) without the need for over-sizing the input stage. The contention at Node1 and the capacitance at the output node are simultaneously reduced. The power consumption and area are thereby reduced with LATCH-IG while maintaining similar speed as compared to the standard single-Vth tied-gate FinFET circuits [95]. The second compact multi-Vth latch (LATCH-WF) is shown in Fig. 9.5b. LATCH-WF is composed of work-function engineered multi-Vth tied-gate FinFETs. The driver inverter (I1) and the transmission gate (T1) are sized minimum. The transistors of I1, T1, and I3 have low-Vth. Alternatively, the transistors of the feedback inverter are minimum sized high-Vth FinFETs. The threshold voltages of the transistors are tuned with work-function engineering [73], [74]. The gate work-functions of the low-Vth and the high-Vth N-type (P-type) FinFETs are 4.5eV (4.9eV) and 4.7eV (4.7eV), respectively. The contention between the input stage and the feedback inverter is reduced by employing minimum sized high-Vth transistors in the feedback inverter. The power

154

consumption and area are therefore reduced as compared to the standard single-Vth tied-gate FinFET circuits. The third compact multi-Vth latch (LATCH-WF-IG) is shown in Fig. 9.5c. LATCH-WFIG is based on a work-function engineered independent-gate FinFET technology. The driver inverter (I1) and the transmission gate (T1) are low-Vth minimum sized gates. Alternatively, the feedback inverter is composed of minimum sized work-function engineered high-Vth FinFETs operating in the single-gate mode. The contention at Node1 is further reduced as compared LATCH-IG and LATCH-WF. The power consumption is lower and the speed is enhanced with LATCH-WF-IG as compared to LATCH-IG, LATCH-WF, and the single-Vth tied-gate FinFET latches. The layouts of LATCH-IG, LATCH-WF, and LATCH-WF-IG are shown in Fig. 9.6. The layout areas of the new multi-Vth FinFET latches are 21% (20%) smaller due to the fewer (smaller) transistors as compared to LATCH-TG1 (LATCH-TG2). CLK

I4

CLK

CLK Node1

I1

D

Q

I3

T1 I2

CLK

VDD

VDD

VDD 1 fin

1 fin

1 fin

M2

M2 Node1

Q

Node1

Q

M1

M1

Node1

M2 Q

M1

1 fin

1 fin

1 fin

(a)

(b)

(c)

Fig. 9.5. Proposed multi-Vth brute-force latches. (a) LATCH-IG. (b) LATCH-WF. (c) LATCHWF-IG. Thick lines indicate the high-Vth FinFETs based on work-function engineering. Workfunction of a low-Vth N-Type (P-Type) FinFET is 4.5 eV (4.9 eV). Work-function of a high-Vth N-Type (P-Type) FinFET is 4.7 eV (4.7 eV).

155

VDD

X

Q

CLK

X

X

D

X

GND (a)

Node1

D

X

X

Node1

CLK

VDD

Q

GND (b)

Fig. 9.6. Layouts of the new brute-force FinFET latches. (a) LATCH-IG and LATCH-WF-IG: 0.506 µm2. (b) LATCH-WF: 0.506 µm2.

9.2.3. Comparison of the FinFET Latches A quantitative comparison of the FinFET latches is provided in this section. Each circuit drives a capacitive load of 0.2fF. The temperature is 110oC. The clock power is measured when the clock is the only switching signal with the input and the output nodes steady at 0V. The setup time is the time duration between the latest input transition and the negative edge of the clock signal (the latches evaluated in this chapter are positive) for which the propagation delay (TDQ) is increased by 1% relative to the minimum data-to-Q delay (TDQ-min). The latches are characterized for power consumption, propagation delay, setup time, and static noise margin, as shown in Figs. 9.7 to 9.12. LATCH-IG-WF consumes the lowest total power due to the utilization of minimum sized gates, the lower contention at Node1, and the reduction of output node parasitic capacitance (50.3% lower power as compared to LATCH-TG2). LATCH-TG2 consumes the highest clock power due to the larger clocked transistors (29% higher clock power as compared to the proposed multi-Vth FinFET latches). LATCH-WF and LATCHWF-IG consume the lowest leakage power (47% lower as compared to LATCH-TG2) due to the utilization of minimum sized gates and multi-Vth transistors. The propagation delay and the setup time are minimized with LATCH-WF-IG due to the reduced output node parasitic capacitance and the weaker contention at Node1 as compared to the other latches (49% shorter delay and 71% shorter setup time as compared to LATCH-TG1). LATCH-WF provides the highest static noise margin (18% higher as compared to LATCH-TG1) due to the more symmetric voltage transfer characteristics of the cross-coupled inverters as compared to the other latches.

Power Consumption (µW)

156

4 3 2 1 0

Latch-TG1

Latch-TG2

Latch-IG

Latch-WF Latch-WF-IG

Fig. 9.7. Total active-mode power consumption of the FinFET latches.

Clock Power (µW)

0.8

0.6

0.4

0.2

Latch-TG1 Latch-TG2

Latch-IG

Latch-WF Latch-WF-IG

Fig. 9.8. Clock power of the FinFET latches. 120 Average Leakage Power (nW)

100 80 60 40 20 Latch-TG1 Latch-TG2

Latch-IG

Latch-WF Latch-WF-IG

Fig. 9.9. Leakage power (averaged for four different input-output combinations in the standby mode) of the FinFET latches. Clock is gated low. Average Propagation Delay (ps)

16 12 8 4 0

Latch-TG1 Latch-TG2

Latch-IG

Latch-WF Latch-WF-IG

Fig. 9.10. Average propagation delay of the FinFET latches.

157

Setup Time (ps)

20 16 12 8 4 0

Latch-TG1 Latch-TG2

Latch-IG

Latch-WF Latch-WF-IG

Fig. 9.11. Setup time of the FinFET latches.

SNM (mV)

310 290 270 250 230

Latch-TG1 Latch-TG2

Latch-IG

Latch-WF Latch-WF-IG

Fig. 9.12. Static noise margin of the FinFET latches.

9.3. FinFET Flip-Flops In this section five master-slave FinFET flip-flops are presented. The flip-flops are based on the brute-force latches presented in Section 9.2. The operation of the FinFET flip-flops are described in Section 9.3.1. The FinFET flip-flops are characterized for power consumption, clock-to-output delay, and setup time in Section 9.3.2.

9.3.1. Brute-Force FinFET Flip-Flops The five FinFET flip-flops evaluated in this section are shown in Fig. 9.13. The flip-flops are based on the brute-force topology. To be able to transfer new data to the master stage when the clock signal is high, I1 and T1 must be significantly stronger than I2. Similarly, to be able to change the state of the slave stage when the clock is low, I3 and T2 must be significantly stronger than I5.

158

CLK

CLK

CLK D

CLK Node1

I1

Node2

I3

Node3

Node4

I4

I6

Q

T2

T1

I2

CLK VDD

CLK

I5

1 fin

0V

M4 M2 M1 M3

(a)

1 fin

VDD

VDD 1 fin

M2

VDD

VDD 1 fin

1 fin

M2

1 fin

M2

M2

M1

M1

1 fin

1 fin

VDD

M1

1 fin

(b)

M1 1 fin

(c)

1 fin

(d)

1 fin

(e)

Fig. 9.13. Five brute-force FinFET flip-flops. (a) FF-TG1 (b) FF-TG2. (c) FF-IG. (d) FF-WF. (e) FF-WF-IG. Thick lines indicate high-Vth FinFETs based on work-function engineering. Workfunction of a low-Vth N-Type (P-Type) FinFET is 4.5 eV (4.9 eV). Work-function of a high-Vth N-Type (P-Type) FinFET is 4.7 eV (4.7 eV). With the first single-Vth circuit (FF-TG1) shown in Fig. 9.13a, the feedback inverters are composed of minimum sized tied-gate low-Vth FinFETs. The feedback inverters are weakened using transistor stacks in the pull-up and the pull-down networks. With FF-TG1, the minimum sized input drivers are thereby able to overpower the feedback inverters. However, the contentions at Node1 and Node3 are still significant resulting in a relatively higher data transfer power and longer propagation delay as compared to the multi-Vth flip-flops presented in this section. With the second single-Vth circuit (FF-TG2) shown in Fig. 9.13b the low-Vth tied-gate transistors of the feedback inverters (I2 and I5) are sized minimum. The input drivers (I1-T1 and I3T2) are sized twice the feedback inverters in order to overpower the feedback inverters to achieve functionality. Note that the transistor sizes are quantized in a FinFET technology. This approach based on transistor sizing to achieve brute-force functionality results in larger area, increased clock load, and higher power consumption as compared to the multi-Vth flip-flops described next.

159

The first multi-Vth circuit (FF-IG) shown in Fig. 9.13c is based on an independent-gate FinFET technology. The transistors of the feedback inverters (I2 and I5) are sized minimum and further weakened by operation in the single-gate mode (high-Vth). The minimum sized input drivers (I1-T1 and I3-T2) are therefore able to overpower the feedback inverters. The contentions at Node1 and Node3 are significantly reduced since the on-currents of the FinFETs operating in the single-gate mode are significantly reduced as compared to the transistors operating in the dualgate mode as described in Section 9.1. Furthermore, the parasitic capacitances at Node2 and Node4 are also reduced with FF-IG due to the disabled gates of the feedback transistors. The power consumption is therefore reduced while maintaining similar speed as compared to the single-Vth tied-gate FinFET circuits. The second multi-Vth circuit (FF-WF) shown in Fig. 9.13d is based on a work-function engineered multi threshold voltage tied-gate FinFET technology. The transistors of the feedback inverters are sized minimum and further weakened by increasing the threshold voltages. The threshold voltages of the FinFETs are tuned with work-function engineering. The minimum sized input drivers (I1-T1 and I3-T2) are able to overpower the high-Vth feedback inverters due to the reduced contention at Node1 and Node3. The power consumption due to data transfer, clocking, and leakage are therefore reduced with FF-WF as compared to the standard single-Vth tied-gate FinFET circuits. The third multi-Vth circuit (FF-WF-IG) is shown in Fig. 9.13e. FF-WF-IG is based on a work-function engineered multi-Vth independent-gate FinFET technology. The minimum sized high-Vth FinFETs of the feedback inverters are further weakened by operation in the single-gate mode. The contentions at Node1 and Node3 are therefore further reduced as compared to FF-IG and FF-WF. Furthermore, the parasitic capacitances at Node2 and Node4 are reduced as compared to the tied-gate FinFET circuits. The power consumption is therefore reduced while enhancing the speed as compared to FF-IG, FF-WF, and the standard single-Vth tied-gate FinFET circuits.

9.3.2. Comparison of the FinFET Flip-Flops The FinFET flip-flops are characterized in this section for power consumption, setup time, and clock to output delay as shown in Figs. 9.14 to 9.18. Each circuit drives a capacitive load of 0.4fF. The temperature is 110oC. The total power consumption includes the power consumed

160

during data transfer and the power consumed by the clock driver. The clock power is measured when the input and the output are idle (the clock is the only switching signal). The setup time of the flip-flop is the time duration between the input transition and the active clock edge (the flipflops evaluated in this chapter are negative-edge triggered) for which the data-to-output delay is minimized. FF-WF-IG consumes the lowest total power due to the utilization of minimum sized gates, the reduced contentions at Node1 and Node3, and the reduced parasitic capacitances at Node2 and Node4 (55% lower power as compared to FF-TG2). FF-TG2 consumes the highest clock power due to the larger clocked transistors (40.6% higher as compared to the proposed multi-Vth circuits). FF-WF and FF-WF-IG consume the lowest leakage power (53% lower as compared to FF-TG2) due to the utilization of minimum sized and multi-Vth transistors. The setup time and the clock-to-output delay are minimized with FF-WF-IG due to the reduced contentions at Node1 and Node3 and the smaller internal node parasitic capacitances (57% shorter setup time and 40%

Power Consumption (µW)

shorter delay as compared to FF-TG1). 6 4 2 0 FF-TG1

FF-TG2

FF-IG

FF-WF

FF-WF-IG

Fig. 9.14. Total active-mode power consumption of the FinFET flip-flops.

Clock Power (µW)

1.2 1 0.8 0.6 0.4 0.2

FF-TG1

FF-TG2

FF-IG

FF-WF

Fig. 9.15. Clock power of the FinFET flip-flops.

FF-WF-IG

161

Average Leakage Power (nW)

180 140 100 60 20 FF-TG1

FF-TG2

FF-IG

FF-WF

FF-WF-IG

Fig. 9.16. Leakage power (averaged for the four input-output combinations in the standby mode) of the FinFET flip-flops. Clock is gated low.

Setup Time (ps)

20 16 12 8 4 0

FF-TG1

FF-TG2

FF-IG

FF-WF

FF-WF-IG

Fig. 9.17. Setup time of the FinFET flip-flops. Average Clock to Output Delay (ps)

20 16 12 8 4

FF-TG1

FF-TG2

FF-IG

FF-WF

FF-WF-IG

Fig. 9.18. Average propagation delay of the FinFET flip-flops.

9.4. Chapter Summary Multi-Vth FinFET latches and flip-flops are presented in this chapter. The latches considered in this chapter operate with brute-force in the transparent mode. For this type of latches to function correctly, the input drivers must be designed to be significantly stronger as compared to the feedback path. By selectively utilizing multi-Vth FinFETs with independent-gate-

162

bias and work-function engineering, the contention between the input circuitry and the feedback path of a latch is significantly reduced. New data can therefore be transferred to a transparent latch without the need for over-sizing the input drivers. With these new techniques, the smaller sizes of the transistors in the input circuitry lead to a reduction in the switched capacitance and the clock load, thereby reducing the power consumption as compared to the circuits with standard single-Vth tied-gate FinFETs. Furthermore, the area is also reduced with the proposed circuits due to the smaller and fewer transistors. The sequential circuits are characterized in a 32nm FinFET technology. With the proposed multi-Vth sequential circuits, the total active mode power consumption, the clock power, and the average leakage power are reduced by up to 55%, 29%, and 53%, respectively, as compared to the circuits with standard single-Vth tied-gate FinFETs. Furthermore, the area of the new multi-Vth circuits is reduced by up to 21% as compared to the circuits with single-Vth tied-gate FinFETs.

163

Chapter 10 FinFET Domino Logic with Independent Gate Keepers Domino logic circuit techniques are extensively applied in high-performance microprocessors due to the superior speed and area characteristics of dynamic CMOS circuits as compared to static CMOS circuits. Higher-speed operation of domino logic circuits however also implies lower noise margins as compared to static gates. As on-chip noise becomes more severe with technology scaling and increasing operating frequencies, error free operation of domino logic circuits has become a major challenge [76], [77]. In a standard domino logic gate, a feedback keeper is employed to maintain the state of the dynamic node against coupling noise, charge sharing, and sub-threshold leakage current. The keeper transistor is typically sized significantly smaller as compared to the pull-down network transistors in order to minimize the delay and power penalty caused by the keeper contention current. A small keeper, however, cannot provide the necessary noise immunity for reliable operation in an increasingly noisy and noise sensitive on-chip environment in the scaled CMOS technologies [76]-[78]. There is, therefore, a tradeoff between reliability and high-speed/energyefficient operation in domino logic circuits. New dynamic circuit techniques which can suppress the keeper contention current while maintaining a high noise immunity are, therefore, highly desirable. In [76], a variable-threshold-voltage keeper is proposed for enhancing the evaluation speed and lowering the power consumption of domino logic circuits in a standard single-gate bulk CMOS technology. The threshold voltage of a keeper transistor is dynamically adjusted by bodybias with this technique. The high capacitance of an n-well, however, may prohibit the alteration of the body voltage of the keeper transistor every clock cycle. Furthermore, dual supply voltages are required for providing the necessary body-bias voltages with the technique described in [76]. An alternative conditional keeper technique employing two keeper transistors is presented in [78] to reduce the contention current in a domino circuit. With this technique, one of the keeper transistors operates unconditionally similar to a standard domino circuit. The other keeper transistor is conditionally turned on if the dynamic node is not discharged during the evaluation phase. The use of multiple keepers increases the circuit area. Furthermore, the unconditional

164

keeper, when implemented even with a minimum sized tied-gate FinFET, produces significant contention current. The power reduction and the speed enhancement provided with this conditional keeper technique are therefore limited, particularly in a standard tied-gate FinFET technology. In this Chapter, a new variable threshold voltage keeper technique based on an independent-gate FinFET technology is presented for the simultaneous enhancement of the evaluation speed and the reduction of the power consumption without sacrificing the noise immunity in FinFET domino logic circuits [96]. A single tunable-strength multi-fin keeper transistor is utilized with the proposed technique, unlike the technique presented in [78] which requires multiple keeper transistors. Furthermore, the threshold voltage of the keeper transistor is adjusted with a simple independent-gate-bias mechanism utilizing only the standard voltage references (VGND and VDD) that are readily available in the system, unlike the technique presented in [76] which requires multiple power supplies and periodic switching of the bias voltage of the high capacitance n-wells. In this Chapter, all of the independent-gate bias options of a multi-fin keeper are identified and explored in order to maximize the power and delay savings while maintaining identical noise immunity characteristics as compared to the standard tied-gate FinFET domino circuits. It is shown that the evaluation delay and the power consumption can be simultaneously reduced with no degradation in noise margins with the proposed variable threshold voltage multi-gate keeper circuit technique as compared to the standard tied-gate FinFET domino circuits. The Chapter is organized as follows. The FinFET devices are described in Section 10.1. The standard tied-gate and the proposed independent-gate FinFET domino logic circuits are presented in Section 10.2. The proposed asymmetric independent-gate FinFET domino circuit technique is compared to the standard tied-gate FinFET circuit technique in Section 10.3. Finally, conclusions are offered in Section 10.4.

10.1. FinFET Device The architectures of the FinFET devices are presented in this section. The different modes of operation of an independent-gate FinFET are illustrated. The technology parameters of the devices considered in this chapter are listed in Table 8.1. The tied-gate and the independent-gate implementations of FinFETs are illustrated in Figs. 10.1 and 10.2, respectively.

165

e



So

(a)

at

Dr a

ur ce

in

So

G

in

e at D ra

ur

ce

G

(b)

Fig. 10.1. Tied-gate FinFET architectures. (a) Single fin transistor. (b) Two fins transistor. G1

n ai



G2



2

n

G

Dr ai

ce ur So

n ai

2

G1 Dr

So

ur

ce

1 G G

1

Dr

So

G2

ur ce

Dr ai n

So

ur

ce

G

Fig. 10.2. Different gate-bias options with single-fin and multi-fin independent-gate FinFETs. The gates of a FinFET can be separated by an insulator, thereby forming an independentgate FinFET as shown in Fig. 10.2. An independent-gate FinFET (IG-FinFET) provides two different active modes of operation with significantly different current characteristics determined by the independent bias conditions of the two gates as explained in Chapter 8. In the dual-gatemode, the two gates of an IG-FinFET are biased with the same signal to control the formation of a channel. Alternatively, in the single-gate-mode, one gate is biased with the input signal to induce channel inversion while the other gate is disabled (disabled gate: biased with VGND in an N-type FinFET and with VDD in a P-type FinFET). The two gates are strongly coupled in the dual-gatemode, thereby lowering the threshold voltage |Vth| as compared to the single-gate-mode. The unique Vth modulation aspect of IG-FinFETs through selective gate bias is exploited in this Chapter to simultaneously enhance the speed and reduce the power consumption without sacrificing the noise immunity as compared to the standard domino circuits with tied-gate FinFETs.

10.2. Domino Logic Circuits Performance critical paths in high-performance integrated circuits are often implemented

166

with domino logic circuits [76]-[78], [100]. Although domino logic circuits are preferable in highspeed applications, the reliability of domino circuits is seriously degraded with technology scaling. The operating principles of domino logic circuits are reviewed in this section. The noise immunity versus the speed and power tradeoffs in standard domino logic circuits are discussed in Section 10.2.1. The new variable threshold voltage independent-gate FinFET keeper technique for simultaneously enhancing the evaluation speed and lowering the power consumption in FinFET domino logic circuits is described in Section 10.2.2.

10.2.1. Standard Tied-Gate FinFET Domino Logic Circuits A standard footless FinFET domino gate is shown in Fig. 10.3. Domino circuits behave in the following manner. When the clock signal is low, the domino logic circuit is in the pre-charge phase. During the precharge phase, the dynamic node is charged to VDD by the precharge transistor. The output transitions low, turning on the keeper transistor. When the clock transitions high, the circuit enters the evaluation phase. In this phase, the circuit evaluates and the dynamic node is discharged to ground depending on the inputs. If the circuit does not evaluate in the evaluation phase, the high state of the dynamic node is preserved against coupling noise, charge sharing, and sub-threshold leakage current by the keeper transistor until the beginning of the subsequent pre-charge phase.

Precharge Transistor VDD

VDD

Standard Keeper

CLK Output . Inputs ..

Pull-down Network

Dynamic Node

Fig. 10.3. A standard footless domino circuit with tied-gate FinFETs. The effect of the keeper transistor on noise immunity, evaluation delay, and power consumption of a FinFET domino logic circuit is evaluated next. The low noise margin (NML) is

167

the noise immunity metric used in this chapter. The NML is NML = VIL - VOL,

(10.1)

where VIL is the voltage amplitude of the DC noise signal applied to the inputs (from the beginning to the end of the evaluation phase) that produces a signal with the same amplitude at the output of a domino logic circuit [100]. VOL is the output low voltage. Simulation results for a 16-input footless domino OR gate are shown in Fig. 10.4 for various keeper sizes in a 32nm tied-gate FinFET technology. The worst-case evaluation speed is observed when a single input signal transitions to VDD while the other gate inputs are maintained at VGND. The NML is measured for a worst-case noise scenario assuming all the inputs of the OR gate are simultaneously excited by the same noise signal. As shown in Fig. 10.4, the NML is enhanced by 70% when KPR (ratio of keeper size to the size of one of the pull-down network transistors) is increased from 0.25 to 1.5. The penalty for this noise immunity enhancement with keeper sizing, however, is the degradation of the evaluation speed and the increase in the power consumption due to the higher contention current [76] produced by a larger keeper transistor. The evaluation delay and the power consumption are increased by 3.3X and 3X, respectively, when KPR is increased from 0.25 to 1.5. There is, therefore, a tradeoff between high noise immunity and high-speed/low-power operation of domino logic gates. New dynamic circuit techniques which can suppress the keeper contention current while maintaining a high noise immunity are, therefore, highly desirable. 250

NML

20

200 Delay

15

150 Power

10 5

100

NML (mV)

Delay (ps) and Power (µW)

25

50

0

0 0.25

0.5

KPR

1

1.5

Fig. 10.4. Evaluation delay, power consumption, and NML of a standard 16-input domino OR gate in a 32nm tied-gate FinFET technology. KPR: ratio of keeper size to the size of one of the pull-down network transistors. Frequency = 4GHz. T = 110oC.

168

10.2.2. FinFET Domino with Variable-Threshold-Voltage Keeper A variable threshold voltage keeper circuit technique based on an independent-gate FinFET technology is presented in this section for simultaneous delay and power reduction without sacrificing noise immunity in domino logic circuits. The schematic of the proposed technique is shown in Fig. 10.5 [96]. Delay

VDD

VDD N1

CLK

Independent-Gate FinFET Keeper

NAND1 Dynamic Node

. Inputs ..

Output

Pull-down Network

Fig. 10.5. Schematic of the proposed variable threshold voltage keeper independent-gate FinFET domino logic circuit technique. The operation of the proposed domino logic circuit is as follows. In the pre-charge phase, the clock signal is low. The pull-down network is cut-off. The dynamic and the output nodes are charged and discharged to VDD and VGND, respectively. The keeper control signal (N1) transitions to VDD, disabling one of the gates of the keeper transistor. The other gate of the keeper is activated by the discharged output node. The keeper operates in the single-gate-mode with a high threshold voltage (high-Vth). The evaluation phase begins when the clock signal transitions to high. The precharge device is cut-off. Provided that the pull-down network is activated by asserting the inputs, the dynamic node is discharged to VGND. Since the keeper threshold voltage is increased with singlegate-bias, the contention current produced by the keeper is less. The evaluation speed is enhanced and the short-circuit power consumption is reduced due to the lower keeper contention current.

169

After the output node is charged to VDD, the keeper is fully cut-off. Alternatively, provided that the pull-down network is not activated for certain input vectors, the dynamic node is maintained at VDD in the evaluation phase. After some delay, the keeper control signal (N1) transitions to VGND. Both gates of the keeper are fully activated for strongly maintaining the high voltage state of the dynamic node. The keeper transistor operates in the dual-gate-mode with a low threshold voltage (low-Vth), thereby providing similar noise immunity as compared to a standard tied-gate FinFET domino circuit with the same size keeper transistor for the rest of the evaluation phase. The delay element used with the proposed technique is designed such that the combined delay of the delay element and the NAND gate (NAND1) is equal to the evaluation delay of the domino gate.

10.3. Simulation Results The standard tied-gate FinFET domino technique and the proposed independent-gate FinFET domino circuit technique with asymmetric gate bias are compared in this section for the evaluation delay, the power consumption, and the noise immunity characteristics in a 32 nm FinFET technology using Taurus-Medici [70]. The clock frequency is 4GHz and the temperature is 110oC for all the simulations. The test circuits evaluated in this section include 2-input AND gate, 4-input OR gate, 16-input OR gate, 8-bit multiplexer, and 32-bit multiplexer. The worst-case evaluation delay and the power consumption are measured when one of the parallel transistors (or transistor stacks) of the pull-down network is activated while the other inputs are connected to VGND. The NML is measured for a worst-case noise scenario with a noise signal coupling to all of the inputs. The different gate bias options of the multi-fin keeper transistors with the proposed independent-gate biased FinFET domino circuits are explored in this section. A 16-bit domino OR gate with KPR = 0.75 (a keeper with three fins) is used as an example for illustrating the different bias options of an independent-gate transistor. Each fin is controlled by two independent gates. Seven configurations are examined in which the output inverter drives different number of gates of the multi-fin keeper transistor (G2 = {0..6}) with the technique illustrated in Fig. 10.5. The other gates (G1 = {6..0}) of the keeper transistor are driven by the NAND1. The seven gate bias options for a 3-fin keeper are illustrated in Fig. 10.6.

170

G1 = 6 G2 = 0

Lowest delay and power with no degradation in NML G1 = 5 G2 = 1

G2

G1

G2

G1

G1 = 1 G2 = 5

G1 = 2 G2 = 4

G1 = 3 G2 = 3 G1

G2

G1

G1

G1

G1 = 4 G2 = 2

G2

G2

G1 = 0 G2 = 6 G2

Fig. 10.6. Gate bias options of a three-fin independent-gate keeper FinFET. G1: number of independent keeper gates driven by NAND1. G2: number of independent keeper gates driven by the output. The simulation results for these different gate bias options are shown in Fig. 10.7. The lowest power and delay are achieved with the first bias option where all the gates of the keeper transistor are driven by NAND1 (G1 = 6 and G2 = 0). The NML, however, is the lowest for this bias option since the keeper is completely turned off at the beginning of the evaluation phase. For G1 = 6 and G2 = 0 the proposed technique becomes essentially an extension of the technique presented in [77] to the FinFET domino logic circuits. Alternatively, when the output drives one gate of the keeper transistor (G1 = 5 and G2 = 1), the NML is the same as a standard tied-gate domino circuit with the same keeper size while the power consumption and the evaluation delay are reduced by 20% and 22%, respectively. Further increasing G2 does not enhance the noise immunity. However, as G2 is increased beyond one, the keeper contention current is enhanced due to the lower threshold voltage of the keeper transistor. For G2 greater than one, therefore, the power consumption savings and the speed are reduced with no additional benefit in noise immunity as shown in Fig. 10.7. The optimum keeper gate bias scheme is identified for each test circuit with the proposed independent-gate FinFET technique to minimize the evaluation delay and the power consumption while maintaining identical noise immunity as compared to the standard tied-gate FinFET circuits. The optimum bias conditions for the various circuits with different KPRs are listed in Table 10.1.

171

The simulation results with the optimum keeper gate bias conditions for different types of domino circuits employing various keeper sizes are shown in Figs. 10.8 and 10.9 for the evaluation delay and the power consumption, respectively. The evaluation delay is significantly reduced by up to 49% (16-input OR with KPR = 1.5) with the proposed dynamic threshold voltage keeper technique while maintaining identical NML as compared to the standard domino circuits, as shown in Fig. 10.8. Similarly, the power consumption is significantly reduced by up to 46% (16input OR with KPR = 1.5) with the proposed variable threshold voltage keeper technique as compared to the standard domino circuits, as shown in Fig. 10.9. NML

182

9 179 8

Power

Delay

176

7

NML (mV)

Delay (ps), Power (µW)

10

173

6 5

0

1

2

3 G2

4

5

170

6

Fig. 10.7. Delay, power, and NML characteristics of a 16-input domino OR gate with KPR = 0.75. G2: number of independent keeper gates driven by the output. G2 = 6 corresponds to the standard tied-gate FinFET domino circuit. TABLE 10.1. THE INDEPENDENT-GATE KEEPER OPTIMUM BIAS CONDITIONS FOR ACHIEVING MINIMUM DELAY AND POWER CONSUMPTION WITH NO DEGRADATION IN NML Number of KPR keeper fins 0.25 1 0.5 2 1 4 1.5 6

2-input AND

4-input OR

16-input OR

8-bit Multiplexer

32-bit Multiplexer

G1

G2

G1

G2

G1

G2

G1

G2

G1

G2

1 1 1 3

1 3 7 9

1 2 2 3

1 2 6 9

1 3 5 5

1 1 3 7

1 4 6 4

1 0 2 8

1 4 8 7

1 0 0 5

* G1: number of independent keeper gates driven by NAND1. G2: number of independent keeper gates driven by the output.

172

25

Standard Tied-Gate

Proposed Independent-Gate -49%

Delay (ps)

20

-33%

-32%

15

10

5

0 0.25 0.5

1

1.5 0.25 0.5

2-input AND

1

1.5 0.25 0.5

4-input OR

1

1.5 0.25 0.5

16-input OR

KPR

1

1.5 0.25 0.5

8-bit Multiplexer

1

1.5

32-bit Multiplexer

Fig. 10.8. Comparison of the evaluation delay of the standard tied-gate and the proposed variable threshold voltage keeper independent-gate FinFET techniques for different domino circuits and various keeper sizes. For each comparison case, the two techniques provide identical noise margin. 25

Power Consumption (µW)

Standard Tied-Gate

Proposed Independent-Gate -28%

20

-46% 15

-33% 10

5

0 0.25 0.5

1

1.5 0.25 0.5

2-input AND

1

4-input OR

1.5 0.25 0.5

1

1.5 0.25 0.5

16-input OR

KPR

1

1.5 0.25 0.5

8-bit Multiplexer

1

1.5

32-bit Multiplexer

Fig. 10.9. Comparison of the power consumption of the standard tied-gate and the proposed variable threshold voltage keeper independent-gate FinFET techniques for different domino circuits and various keeper sizes. For each comparison case, the two techniques provide identical noise margin.

173

10.4. Chapter Summary A new high-speed and low-power domino logic circuit technique based on an independent-gate FinFET technology is presented in this chapter. The proposed technique dynamically changes the threshold voltage of the keeper transistor with a specific delay after the beginning of each operational phase (evaluation and pre-charge) of a domino circuit by independently biasing the multiple gates of the keeper transistor. The keeper contention current is reduced by increasing the keeper threshold voltage by operating the keeper in the single-gatemode at the beginning of the evaluation phase. Similarly, a degradation in noise immunity is avoided by dynamically and conditionally reducing the keeper threshold voltage after a delay greater than the worst case evaluation delay of a domino logic circuit provided that the dynamic node is not discharged in the evaluation phase. The new circuit technique is characterized with different logic gates for various keeper sizes and multi-fin keeper gate bias options in this chapter. With the proposed technique, the evaluation delay and the power consumption are simultaneously reduced by up to 49% and 46%, respectively, without sacrificing the noise immunity as compared to the standard tied-gate FinFET domino logic circuits in a 32nm FinFET technology.

174

Chapter 11 Low Power and Robust Independent-Gate FinFET SRAM Cells Lower voltages and smaller devices cause a significant degradation in SRAM cell data stability with the scaling of CMOS technology. Maintaining the data stability of SRAM cells is expected to become increasingly challenging as the device dimensions are scaled to the sub-45nm regime. In addition to the data stability issues, SRAM arrays are also an important source of leakage due to the enormous number of transistors in the memory caches. The development of a robust SRAM cell that can provide enhanced memory integration density and lower leakage power with the emerging FinFET technologies is highly desirable. The data stability of a conventional six transistor (6T) SRAM cell is characterized by the static noise margin (SNM) during a read operation [52]. The data is most vulnerable to external noise during a read operation due to the intrinsic disturbance caused by the direct data-read-access mechanism of a standard 6T SRAM cell. A minimum size SRAM cell is highly desirable for maximizing the memory integration density. The noise margins of a minimum size standard SRAM cell are, however, dangerously low. The SNM is typically enhanced by increasing the size of the pull-down devices of the cross-coupled inverters in a 6T SRAM cell. This standard approach based on transistor sizing to achieve enhanced cell stability, however, causes a significant increase in the cell area and a higher leakage power consumption. Employing multiple threshold voltage (multi-Vt) transistors by independent-gate bias is an alternative approach for enhancing the data stability of a minimum sized 6T SRAM cell. Three independent-gate FinFET SRAM techniques are presented in [50], [98], [99], and [101] for enhanced data stability and reduced leakage power as compared to the standard tied-gate FinFET SRAM cells. In this Chapter the standard tied-gate and the proposed independent-gate-biased FinFET SRAM cells are characterized and compared for data stability, leakage power, read current, and the cell area. The Chapter is organized as follows. The standard low-Vt tied-gate FinFET SRAM cells and the multi-Vt FinFET SRAM cells based on independent-gate-bias are described in Section 11.1. Data stability, leakage power, read current, and cell area characteristics of the FinFET SRAM cells are compared in Section 11.2. Finally, conclusions are offered in Section 11.3.

175

11.1. FinFET SRAM Cells The design considerations for the reliable operation of the 6T FinFET SRAM circuits are provided in this section. The standard low-Vt tied-gate FinFET SRAM cells are presented in Section 11.1.1. The independent-gate FinFET SRAM circuits are described in Section 11.1.2.

11.1.1. Standard Low-Vt Tied-Gate FinFET SRAM Cells The data stability of a memory circuit is most vulnerable to external noise during a read operation due to the intrinsic disturbance produced by the direct data-read-access mechanism of the standard 6T SRAM cells. In order to maintain the read stability, the pull-down transistors within the cross-coupled inverters must be stronger as compared to the bitline access transistors. Alternatively, for write ability, the bitline access transistors must be stronger as compared to the pull-up transistors within the cross-coupled inverters. Three tied-gate FinFET SRAM cells (TG1, TG2, and TG3) with different pull-down to access transistor ratios are presented in this section for achieving different levels of data stability, as shown in Fig. 11.1. All of the six transistors in TG1 are sized minimum (one fin). A minimum sized SRAM cell is highly desirable for maximizing the memory integration density. For enhanced noise immunity and read stability, however, the pull-down transistors of the crosscoupled inverters of TG2 and TG3 are sized twice and three times the minimum size, respectively. This enhancement in stability through transistor sizing, unfortunately, comes at a cost of significantly higher leakage power consumption and larger cell area. VDD

BL

VDD 1 fin

WL

P1

P2 Node2

N3

BLB 1 fin

WL N4

1 fin

Node1 N1

1,2, or 3 fins

1 fin N2

1,2, or 3 fins

Fig. 11.1. Three tied-gate FinFET SRAM cells. TG1: all six transistors are sized minimum. TG2: the pull-down transistors in the cross-coupled inverters have two fins. TG3: the pull-down transistors in the cross-coupled inverters have three fins.

176

11.1.2. Independent-Gate FinFET SRAM Cells The two vertical gates of a FinFET can be separated, thereby forming an independent-gate FinFET. An independent-gate FinFET (IG-FinFET) provides two different active modes of operation with significantly different current characteristics determined by the bias conditions of the two independent gates. In the dual-gate-mode, the two gates are biased with the same signal to control the formation of a conducting channel. Alternatively, in the single-gate-mode, one gate is biased with the input signal to induce channel inversion while the other gate is disabled (disabled gate: biased with VGND in an N-type FinFET and with VDD in a P-type FinFET). The two gates are strongly coupled in the dual-gate-mode, thereby lowering the threshold voltage (Vt) as compared to the Single-Gate-Mode. The maximum drain current produced by an N-type (P-type) FinFET operating in the dual-gate-mode is 2.6 (2.77) times higher as compared to the single-gate-mode [95], [98]. The switched gate capacitance of a FinFET is also halved in the single-gate-mode due to the disabled back gate. The unique Vt modulation aspect of IG-FinFETs through selective gate bias is exploited in [50], [98], [99], and [101] to enhance the SRAM data stability and the integration density while lowering the static and dynamic power consumption with minimum sized transistors. Three IG-FinFET 6T SRAM cells are presented in this section. All of the transistors in the independent-gate FinFET SRAM cells have single fin (minimum width) as shown in Fig. 11.2. With the first independent-gate FinFET SRAM cell (IG1) shown in Fig. 11.2a, the pull-down transistors in the cross-coupled inverters are tied-gate FinFETs. Alternatively, the access transistors and the pull-up transistors in the cross-coupled inverters are independent-gate FinFETs operating in the single-gate mode. The access transistors act as weak high-Vt devices. The disturbance caused by the direct-data-access mechanism during read operations is thereby suppressed without the need for increasing the sizes of the transistors within the cross-coupled inverters. The data stability is enhanced as compared to a tied-gate FinFET SRAM cell with the same transistor sizing. With the second independent-gate FinFET SRAM cell (IG2) shown in Fig. 11.2b, the pulldown transistors in the cross-coupled inverters and the bitline access transistors are tied-gate FinFETs. Alternatively, the pull-up transistors within the cross-coupled inverters are independentgate FinFETs operating in the single-gate-mode. The bitline access transistors are weak P-Type

177

FinFETs. The disturbance at the data storage nodes is therefore suppressed during a read operation. Furthermore, during a read operation, one of the P-type access transistors enhances the pull-up strength on the side that stores a “1”. The data stability of IG2 is thereby further enhanced as compared to a tied-gate FinFET SRAM cell with the same transistor sizes. The bitline access transistors operating in the dual-gate-mode are stronger as compared to the pull-up transistors operating in the single-gate-mode. The write-ability is thereby achieved with minimum sized transistors with IG2. VDD

BL

VDD

VDD

1 fin P1

WL

BLB

1fin P2

WL Node1

N3

N4

Node2 1 fin N1

1 fin

N2 1 fin

1 fin

(a) VDD

BL

VDD

VDD

1 fin

WL

BLB

1fin

P1

P2

WL

Node1

P3

P4

Node2

1 fin

1 fin N1

N2 1 fin

1 fin

(b) VDD

BL

VDD 1fin

P1

RW N3

P2

Node1

RW Node2

1fin

W

BLB

1fin

N4

1fin

N1

W

N2 1fin

1fin

(c) Fig. 11.2. The IG-FinFET SRAM cells. (a) IG1. (b) IG2. (c) IG3.

178

With the third independent-gate FinFET SRAM cell (IG3) shown in Fig. 11.2c the transistors in the cross-coupled inverters are tied-gate FinFETs. Alternatively, the access transistors are independent-gate FinFETs. The unique Vt modulation aspect of IG-FinFETs through selective gate bias is exploited with IG3 by dynamically tuning the read and write strength of the access transistors. IG3 provides two separate data access mechanisms for the read and write operations. One gate of each access transistor is controlled by a read/write signal (RW). The second gate of each access transistor is controlled by a separate write signal (W), as shown in Fig. 11.2c. Both RW and W signals are maintained low in an un-accessed SRAM cell. During a read operation, RW signal transitions high while W is maintained low. The access transistors N3 and N4 act as high-Vt devices with weaker current conducting capability as compared to the tiedgate pull-down transistors (N1 and N2) during a read operation with this technique. The current produced by the access transistors (with one gate disabled) is significantly reduced. The intrinsic data disturbance that occurs due to the direct-data-read-access mechanism of the 6T SRAM cell topology is suppressed with IG3, thereby enhancing the read stability as compared to the standard tied-gate FinFET SRAM circuits. Alternatively, during a write operation, both RW and W transition high. The two access transistors N3 and N4 act as low-Vt devices conducting significantly higher current. The write-ability is thereby achieved with minimum sized transistors with IG3.

11.2. Simulation Results The read stability, the leakage power, the cell read current, and the cell area of the three tied-gate FinFET SRAM cells (TG1, TG2 and TG3) and the three independent-gate FinFET SRAM cells (IG1, IG2, and IG3) are compared in this section for a 32nm FinFET technology. The physical technology parameters used for the MEDICI [70] simulations are listed in Table 8.1. VDD is 0.8 V.

11.2.1. Read Stability Static noise margin (SNM) is the metric used in this chapter to characterize the read stability of the SRAM cells. The SNM is the minimum DC noise voltage necessary to flip the state of an SRAM cell. The read SNM of the FinFET SRAM cells during a read operation are shown in Fig. 11.3.

179

250

Highest read stability

SNM (mV)

200

150

100

50

0 TG1

TG2

TG3

IG1

IG2

IG3

Fig. 11.3. The read SNMs of the FinFET SRAM cells. T = 70°C. IG3 provides the highest SNM. The disturbance induced at the storage nodes during a read operation is significantly suppressed by weakening the access transistors with the IG1, IG2, and IG3 cells as compared to the minimum sized tied-gate SRAM cell (TG1). The read SNM of the IG1, IG2, and IG3 cells is enhanced by 50%, 60%, and 92%, respectively, as compared to the TG1 cell. Alternatively, with the TG2 and TG3 cells, the disturbance during a read operation is reduced by increasing the size of the pull-down devices in the cross-coupled inverters. The read SNM of the TG2 and TG3 cells is enhanced by 65% and 73%, respectively, as compared to the TG1 cell. IG3 provides the highest data stability among the FinFET SRAM circuits evaluated in this chapter.

11.2.2. Leakage Power Consumption The leakage power consumption of the SRAM cells at 70°C is shown in Fig. 11.4. The leakage power of an SRAM cell is determined by the total effective transistor width and the threshold voltages of the transistors that produces the leakage current. Transistor sizing for enhanced data stability comes at a cost of significant additional leakage power with the TG2 and TG3 cells, as illustrated in Fig. 11.4. With the TG1, IG1, IG2, and IG3 cells, all the transistors are sized minimum. Furthermore, with the IG2 SRAM cell, the leakage current of the PMOS access transistors is lower as compared to the NMOS access transistors employed with the other SRAM cells. Note that a minimum sized P-type FinFET produces significantly smaller sub-threshold and

180

gate-oxide leakage currents as compared to a minimum sized N-type FinFET. IG2 therefore consumes the lowest leakage power among the memory circuits considered in this chapter. IG2 consumes 48% and 61% lower leakage power as compared to the TG2 and TG3 cells, respectively. IG2 also consumes 21% lower leakage power as compared to the TG1 IG1, and IG3 SRAM cells. 28

Leakage Power (nW)

24

Lowest Leakage Power

20 16 12 8 4 0

TG1

TG2

TG3

IG1

IG2

IG3

Fig. 11.4. The leakage power consumptions of the FinFET SRAM cells at 70°C. IG2 consumes the lowest leakage power.

11.2.3. Cell Read Current The peak read current of the FinFET SRAM cells is shown in Fig. 11.5. The cell read current is measured as the maximum current drawn from the bitline during a read operation. The cell read current of TG2 and TG3 is 14.6% and 18% higher, respectively, as compared to TG1 due to the larger pull-down transistors. The cell read current is reduced by 56.4% with IG1 and IG3 as compared to TG1 due to the weaker access transistors operating in the single-gate mode. The cell read current is reduced by 36% with IG2 as compared to TG1 due to the weaker P-type access transistors. The read current of the IG2 cell is 46% higher as compared to the IG1 and the IG3 cells since a single fin PMOS access transistor operating in the dual-gate-mode produces higher on-current as compared to a single fin NMOS transistors operating in the single-gate-mode.

181

60

Highest read speed

Cell Read Current (µA)

50 40 30 20 10 0

TG1

TG2

TG3

IG1

IG2

IG3

Fig. 11.5. The peak read currents of the FinFET SRAM cells. T = 70°C.

11.2.4. SRAM Cell Area The thin-cell layouts of the FinFET SRAM cells are shown in Figs. 11.6-11.11. The fin pitch is assumed to be 6 times the fin thickness in the layouts. TG1 and IG2 have the smallest area since all six transistors are sized minimum with only one fin and due to the fewer contacts. TG3 has the largest area since the pull-down transistors in the cross-coupled inverters have three fins. TG3 is 34% larger than the smallest cell. IG1 and IG3 are 23% larger as compared to the smallest cells due to the extra internal contacts. VDD

BL

VGND

WL

WL

VGND

VDD

BLB

Fig. 11.6. Layout of the TG1 FinFET SRAM cell. Layout area = 0.18 µm2.

182

VDD

BL

VGND

WL

WL

VGND

VDD

BLB

Fig. 11.7. Layout of the TG2 FinFET SRAM cell. Layout area = 0.21 µm2. VGND

VDD

BL

WL

WL

BLB

VDD

VGND

Fig. 11.8. Layout of the TG3 FinFET SRAM cell. Layout area = 0.24 µm2. BL

VGND

VGND

VDD

WL

WL

VGND

VDD

VGND

BLB

Fig. 11.9. Layout of the IG1 FinFET SRAM cell. Layout area = 0.22 µm2.

183

VDD

VGND

BL

WL

WL

VDD

BLB

VGND

Fig. 11.10. Layout of the IG2 FinFET SRAM cell. Layout area = 0.18 µm2. VGND

VDD

BL W RW

RW

W VGND

VDD

BLB

Fig. 11.11. Layout of the IG3 FinFET SRAM cell. Layout area = 0.22 µm2. Smallest area

Normalized Cell Area

1.4

1.2

1

0.8

0.6

TG1

TG2

TG3

IG1

IG2

IG3

Fig. 11.12. Normalized area of the FinFET SRAM cells. TG1 and IG2 occupy the smallest layout area.

184

11.2.5. Process Parameter Variations The effect of process variations on the tied-gate and the independent-gate FinFET SRAM cells is evaluated in this section. 1500 Monte-Carlo simulations are run with Taurus-Medici using a PERL script. The channel length, the fin height, the fin thickness, and the gate oxide thickness are assumed to have independent Gaussian distributions with 3σ variations of 10%. The statistical distributions of the leakage power and the SNM of the SRAM cells are shown in Figs. 11.13 and 11.14, respectively. With the proposed independent-gate SRAM cells, the mean and the standard deviation (SD) of the leakage power are reduced by up to 61% and 65%, respectively, as compared to standard tied-gate FinFET SRAM cells. Furthermore, with the proposed independent-gate SRAM cells the mean SNM is enhanced by up to 80% as compared to the TG1 SRAM cell. The leakage power distributions of IG2 and TG1 intersect at 10.92nW. With the IG2 SRAM circuit, 79% of the statistical samples consume less than 10.92nW of leakage power. Alternatively, with the TG1 SRAM cell 82% of the statistical samples consume more than 10.92nW of leakage power as illustrated in Fig. 11.13. The leakage power distributions of the IG2 the TG2 intersect at 13.65nW. With the IG2 circuit, 99% of the statistical samples consume less than 13.65nW of leakage power. Alternatively, with TG2, 99% of the statistical samples consume more than 13.65nW of leakage power. All the statistical samples of IG2 consume lower leakage power as compared to the statistical samples of TG3. 10.92nW

Number of Samples

500

TG2

82%

79%

TG3

13.65nW

400

99%

99%

IG1, IG3, TG1

300

IG2

200 100 0

0

5

10

15

20

25

Leakage Power (nW)

30

35

Fig. 11.13. Statistical leakage power distributions of the FinFET SRAM cells.

40

185

1000 800

Samples

TG2

TG1 IG1

TG2

TG3 IG3

TG3

600

IG1 IG2

400

TG1

IG3 200 0 50

75

100

125

150

175

200

225

250

SNM (mV)

Fig. 11.14. Statistical SNM distributions of the FinFET SRAM cells.

11.3. Chapter Summary Independent-gate FinFET SRAM cells and the standard low-threshold-voltage tied-gate FinFET SRAM cells are presented and characterized in this Chapter for a 32nm FinFET technology. The access transistors are weakened by threshold voltage tuning with the independent-gate-bias. The data stability of the independent-gate FinFET SRAM circuits is thereby significantly enhanced as compared to a minimum sized low-threshold voltage tied-gate FinFET SRAM cell. Among the SRAM cells presented in this chapter the highest data stability is provided with the independent-gate FinFET SRAM cell that is capable of dynamically tuning the access transistor threshold voltages during the read and write operations. Up to 92% enhancement in SNM is observed as compared to the standard 6T SRAM cells. Alternatively, the independentgate FinFET SRAM cell with P-type access transistors consumes the lowest leakage power and occupy the smallest area among the SRAM cells evaluated in this Chapter. The leakage power and the cell area are reduced by up to 61% and 25.5%, respectively, as compared to a standard low-threshold-voltage tied-gate FinFET SRAM cell sized for comparable read stability. The lowVt tied-gate FinFET SRAM cell sized for sufficient data stability provide the lowest read delay among the cells considered in this Chapter. The advantages of the independent-gate FinFET SRAM circuits are verified under process parameter variations.

186

Chapter 12 Work-Function Engineering for Reduced Power and Higher Integration Density: An Alternative to Sizing for Stability in FinFET Memory Circuits The amount of embedded memory in modern micro-processors and systems-on-chip (SoC) increases to meet the performance requirements in each new technology generation. Lower voltages and smaller devices cause a significant degradation in SRAM cell data stability with the scaling of CMOS technology. Maintaining the data stability of memory circuits is expected to become increasingly challenging as the device dimensions are scaled to the sub-45nm regime with the emerging FinFET technologies. In addition to the data stability issues, memory circuits are also important sources of leakage power consumption due to the enormous number of transistors in the embedded memory banks. The development of a new SRAM cell that can provide enhanced data stability, higher integration density, and lower leakage power with the emerging FinFET technologies is highly desirable. The data stability of a 6T SRAM cell is characterized by the static noise margin (SNM) during a read operation [52]. The data is most vulnerable to external noise during a read operation due to the intrinsic disturbance caused by the direct data-read-access mechanism of a standard 6T SRAM cell. A minimum size SRAM cell is highly desirable for maximizing the memory integration density. The noise margins of a minimum size standard SRAM cell are, however, dangerously low. The SNM is typically enhanced by increasing the size of the pull-down devices of the cross-coupled inverters in a 6T SRAM cell. This standard approach based on transistor sizing to achieve enhanced cell stability, however, causes a significant increase in the cell area and higher leakage power consumption. An alternative approach is to selectively use high-threshold-voltage (high-Vt) devices for simultaneously enhancing the SNM and lowering the leakage power consumption of a multi-Vt memory cell. The use of work-function engineering to control the threshold voltage of the FinFETs is explored in this Chapter for achieving minimum sized multi-

187

Vt 6T SRAM cells with sufficient data stability and lower leakage power consumption characteristics [97]. The Chapter is organized as follows. A design methodology for optimizing the workfunctions of a minimum sized 6T SRAM cell to achieve enhanced data stability and lower leakage power consumption is presented in Section 12.1. Data stability, power consumption, propagation delay, and layout area characteristics of the standard SRAM cells with single-low-Vt FinFETs and the proposed work-function engineered multi-Vt SRAM cells are compared in Section 12.2. Finally, conclusions are offered in Section 12.3.

12.1. Work-Function Engineered SRAM Cells The threshold voltage is typically tuned by adjusting the channel doping concentration in a conventional single-gate bulk MOSFET. Alternatively, in a FinFET technology, the threshold voltage is typically tuned by adjusting the work function of the gate material [71], [73], [74]. The gate work-function affects the threshold voltage directly. A higher gate work-function increases the threshold voltage of a FinFET. In [71] and [73] Molybdenum is used as the gate material. The work-function of the unimplanted Molybdenum is 5eV. Implanting Molybdenum with nitrogen decreases the work-function depending on the implantation dose and energy, as listed in Table 12.1. Alternatively, in [74], total Nickel silicidation of doped polysilicon gate is shown to result in a metallic alloy with a tunable work-function that depends on the doping type and the doping level of the polysilicon prior to the silicidation step. Minimum sized multi-Vt 6T SRAM cells are characterized in this section with different gate work-functions for the pull-down transistors (WF-pulldown), the pull-up transistors (WFpullup), and the access transistors (WF-access). Mixed-mode device/circuit simulation with TAURUS-MEDICI [70] is used to characterize the circuits in a 32nm FinFET technology. The device and circuit parameters of the low-Vt transistors are listed in Table 8.1. The gate workfunction of the access, the pull-up, and the pull-down devices are varied in the ranges of [4.5eV to 5eV], [4.5eV to 5eV], and [4.5eV to 4.6eV], respectively. The gate materials that can be used to achieve the work-functions in the range of [4.5eV to 5eV] are listed in Table 12.1 (data extracted from [71]). The room temperature nominal low threshold voltages (pre-work-function engineering) of 0.23V and -0.28V are achieved with the gate work-functions of 4.5eV and 4.9eV for the N-type and the P-type transistors, respectively, in this 32 nm FinFET technology.

188

TABLE 12.1. THE GATE MATERIAL COMPOSITIONS TO ACHIEVE DIFFERENT WORKFUNCTIONS. DATA EXTRACTED FROM [71] Gate Work-Function (eV)

Gate Material

4.5

Mo implanted with Nitrogen. Dose = 5x1015 cm-2. Implant energy = 26KeV

4.6

Mo implanted with Nitrogen. Dose = 5x1015 cm-2. Implant energy = 22KeV

4.7

Mo implanted with Nitrogen. Dose = 5x1015 cm-2. Implant energy = 17KeV

4.8

Mo implanted with Nitrogen. Dose = 5x1015 cm-2. Implant energy = 12KeV

4.9

Mo implanted with Nitrogen. Dose = 5x1015 cm-2. Implant energy = 6KeV

5.0

Pure Mo

The variations of the SNM, the leakage power, the read delay, and the write delay with the work-functions of the minimum sized (single fin) devices of an SRAM cell are shown in Figs. 12.1, 12.2, 12.3, and 12.4, respectively. The SNM is enhanced with higher-Vt access transistors (higher WF-access) as shown in Fig. 12.1, due to the reduced disturbance during a read-access. The SNM is also enhanced with lower-|Vt| pull-up transistors (higher WF-pullup) due to the more symmetric voltage transfer characteristics of the two cross-coupled inverters. The leakage power is reduced with higher-|Vt| devices as shown in Fig. 12.2. However, the read delay increases with higher-Vt access transistors (higher WF-access) due to the reduced current that discharges the bit-lines as shown in Fig. 12.3. Alternatively, the read delay is weakly dependent on the work-function of the pull-up transistors. Similarly, the write delay increases with higher-Vt access transistors (higher WF-access). The write speed is also degraded with lower-|Vt| pull-up transistors (higher WF-pullup) as shown in Fig. 12.4 due to the stronger contention current produced by the pull-up transistors while an access transistor attempts to force a “0” into the cell. The write delay also increases with lower-Vt pull-down transistors (not shown in Fig. 12.4) due to the increased contention between the pull-down transistor and the access transistor while forcing a “1” onto the cell node that initially stores a “0”.

189

300

WF-access = 4.5eV

WF-access = 4.6eV

WF-access = 4.7eV

WF-access = 4.8eV

WF-access = 4.9eV

WF-access = 5eV

SNM (mV)

250 200

150 100 50 0 4.5

4.6

4.7

4.8

4.9

5.0

WF-pullup (eV)

Fig. 12.1. Variation of the read static noise margin with the work-functions of the access and the pull-up devices. The work-function of the pull-down devices is fixed at 4.6eV. T = 70 oC. All the devices are minimum sized.

Leakage Power (nW )

WF-access = 4.5eV

WF-access = 4.6eV

WF-access = 4.7eV

12 10

WF-pulldown=4.5eV 8 6 4

WF-pulldown=4.6eV 2 0 4.5

4.6

4.7

4.8

4.9

WF-pullup (eV) Fig. 12.2. Variation of the static power with the work-function of the access, the pull-up, and the pull-down devices. T = 70 oC. Minimum sized SRAM cell.

190

Read Delay (ps)

160

120 80

40

0 4.5

4.6

4.7

4.8

4.9

5

WF-access (eV)

Fig. 12.3. Variation of the read delay with the work-function of the access devices. The read delay is insensitive to the work-function of the pull-up devices in the range of (4.5eV-5eV) and the work-function of the pull-down devices in the range of (4.5eV-4.6eV). T = 70 oC. Minimum sized SRAM cell. Read delay is measured for a column of 256 SRAM cells.

Write Delay (ps)

12

WF-pullup = 4.5eV

WF-pullup = 4.6eV

WF-pullup = 4.7eV

WF-pullup = 4.8eV

WF-pullup = 4.9eV

WF-pullup = 5eV

10 8 6 4 2 4.5

4.6

4.7

4.8

4.9

5

WF-access (eV)

Fig. 12.4. Variation of the write delay with the work-function of the access and the pull-up devices. T = 70 oC. All the transistors are sized minimum. The following procedure is used for selecting the optimum work-functions for a lowleakage-power SRAM cell. The functional SRAM cells that satisfy a specific SNM criterion (e.g. SNM > 190mV) are selected to form an initial pool of sufficiently robust designs. The SRAM cell that consumes the lowest leakage power is identified in this initial pool. A secondary list is formed by the SRAM cells that are within 5% of the minimum leakage power. The low-leakage SRAM cell that provides the fastest read speed is then selected from this second list and denoted as SRAM_LP.

191

Alternatively, the following procedure is applied for selecting the optimum workfunctions for a high-read-speed SRAM cell. The functional SRAM cells that satisfy the SNM criterion are first selected for robust operation. The SRAM cell that provides the minimum read delay is identified in this initial list. The SRAM cells that are within 5% of the minimum read delay are selected to form a secondary list of robust and high-speed designs. The SRAM cell that consumes the lowest leakage power among the high-speed circuits in the second list is identified and denoted as SRAM_HS.

12.2. Comparisons The read stability, the leakage power, the cell area, the active mode power, and the access delays of the standard SRAM cells with single-low-Vt FinFETs (SRAM1 and SRAM3 shown in Figs. 12.5 and 12.6, respectively) and the work-function engineered multi-Vt SRAM cells (SRAM_LP and SRAM_HS) are compared in this section for different process corners. The circuit schematic and layout of SRAM_LP and SRAM_HS are the same as SRAM1. SRAM_LP, SRAM_HS, and SRAM1 differ only in the gate work-functions which determine the threshold voltages of the transistors. VDD

VDD BL

BLB (1x32)/32

WL

(1x32)/32

P2

P1

WL

Node1 N3

N4 Node2

(1x32)/32 N2

(1x32)/32

VDD

BL

(1x32)/32 N1

(1x32)/32

VGND

WL

WL

VGND

VDD

BLB

Fig. 12.5. Schematic and layout of a minimum sized 6T SRAM cell (SRAM1, SRAM_LP, and SRAM_HS) in a 32nm FinFET technology. The size of each transistor is given as (number of fins × fin height) / gate length.

192

VDD

VDD BL

BLB (1x32)/32

WL

(1x32)/32

P2

P1

WL

Node1 N3

N4 Node2

(1x32)/32 N2

(3x32)/32

VDD

BL

(1x32)/32 N1

(3x32)/32

VGND

WL

WL

VGND

VDD

BLB

Fig. 12.6. Schematic and layout of a larger memory cell (SRAM3) with enhanced data stability as compared to SRAM1. The size of each transistor is given as (number of fins × fin height) / gate length. The circuit characteristics of the standard single-low-Vt and the proposed work-function engineered multi-Vt SRAM cells are listed in Table 12.2. The normalized leakage power of the four SRAM cells is shown in Fig. 12.7. The leakage power is reduced by 98% (92%) with SRAM_LP (SRAM_HS) as compared to SRAM3 with similar SNM due to the higher-Vt transistors. The normalized read delay, write delay, read power, and write power of the standard and the proposed SRAM cells are shown in Fig. 12.8. Delay and power are measured for a column of 256 SRAM cells. The write delay is reduced by 25% (11.5%) with SRAM_LP (SRAM_HS) as compared to SRAM3 due to the reduced internal node parasitic capacitances. The read delay is increased by 154% (84%) with SRAM_LP (SRAM_HS) due to the higher-Vt access and pull-down transistors as compared to SRAM3. The layout area of SRAM_LP and SRAM_HS with minimum sized (single-fin) transistors is reduced by 25% as compared to the area of SRAM3. Note that SRAM3 is composed of low-Vt transistors that need to be sized significantly larger for sufficient data stability. The reduced width of the cell layout of SRAM_HS and SRAM_LP with smaller transistors also results in shorter

193

wordlines and a faster wordline decoder as compared to SRAM3. Taking into consideration the decoder delay would further reduce the read delay penalty with the work-function engineered SRAM cells as compared to SRAM3. TABLE 12.2. COMPARISON OF THE STANDARD SINGLE-LOW-Vt AND THE PROPOSED WORK-FUNCTION ENGINEERED MULTI-Vt SRAM CELLS SRAM1

SRAM3 SRAM-LP

SRAM-HS

WF-pulldown (eV)

4.5

4.5

4.6

4.6

WF-access (eV)

4.5

4.5

4.8

4.7

WF-pullup (eV)

4.9

4.9

4.7

4.9

Read Delay (ps)

19.8

17.2

43.6

31.7

Read Power (µW)

2.18

2.15

2.14

2.12

Write Delay (ps)

4.36

6.53

4.9

5.78

Write Power (µW)

3.48

3.79

3.28

3.35

Static Power (nW)

12.39

25.37

0.39

2.05

SNM (mV)

124

207

204

196

Normalized Static Power

10

(25.37 nW) 1

(12.39 nW)

(2.05 nW) 0.1

(0.39 nW)

0.01

SRAM1

SRAM3

SRAM_LP

SRAM_HS

Fig. 12.7. The normalized leakage power consumption of the standard single-low-Vt and the proposed work-function engineered multi-Vt SRAM cells. T = 70 oC.

Normalized Delay and Power

194

SRAM1

2.5

SRAM3

SRAM_LP

SRAM_HS

2.0 1.5 1.0 0.5 0.0

Read Delay

Read Power

Write Delay

Write Power

Fig. 12.8. The normalized delay and power consumption of a memory column with the different SRAM cells (the standard single-low-Vt and the proposed work-function engineered multi-Vt SRAM cells). T = 70 oC. The impact of process variations on the characteristics of the memory circuits is evaluated with 2500 samples produced by Monte Carlo simulations. 10% 3σ variations are assumed for the channel length, the fin height, the fin thickness, and the gate oxide thickness. The 3σ variation of the gate work-function is assumed to be 50mV [73]. The results of the Monte Carlo analysis are depicted in Figs. 12.9 and 12.10. The mean and the standard deviation of the leakage power are significantly reduced with the proposed work-function engineered multi-Vt SRAM cells without degrading the data stability as compared to the standard single-low-Vt memory cells, SRAM1 and SRAM3. The mean and the standard deviation (SD) of the SNM are similar for SRAM3, SRAM_LP, and SRAM_HS. SRAM1 Mean = 14 SD = 4.66

SRAM3 Mean = 28.3 SD = 11.3

SRAM_LP Mean = 0.45 SD = 0.23

SRAM_HS Mean = 2.36 SD = 1.07

300

Number of Samples

250

200

150

100

50

0

0.1

1.0

10.0

100.0

Static Power (nW)

Fig. 12.9. Statistical static power distributions of the memory circuits.

195

Number of Samples

250

SRAM1 Mean = 124 SD = 9.8

SRAM3 Mean = 207 SD = 7.3

SRAM_LP Mean = 204 SD = 8.9

SRAM_HS Mean = 195.6 SD = 8.38

200

150

100

50

0 70

90

110

130

150

170

190

210

230

250

SNM (mV)

Fig. 12.10. Statistical SNM distributions of the SRAM cells.

12.3. Chapter Summary In this Chapter work-function engineering is explored for implementing minimum sized multi-Vt 6T FinFET SRAM cells with sufficient data stability and lower leakage power consumption characteristics. A design methodology is proposed for optimizing the workfunctions of the transistors in a FinFET memory circuit. The optimization goals are either lower leakage power consumption or higher read speed. With the proposed multi-Vt design methodology based on gate work-function engineering, the leakage power consumption is reduced by up to 65X as compared to a standard single-low-Vt SRAM circuit sized for similar data stability in a 32nm FinFET technology.

196

Chapter 13 Future Research Plans Future research plans are presented in this chapter. The use of local strain technologies [115]-[118] enhances the on-current of MOSFETs without sacrificing leakage currents. The development of high speed and robust SRAM circuit using the local strain technologies is an important research topic. The local strain technologies are described in Section 13.1. A robust and high speed static memory circuits using the local strain technologies are proposed. The use of gate-drain/source underlap in FinFETs [119]-[120] has an advantage in relaxing the fin thickness requirement while maintaining high inversion current and good control over the short-channel effects. Extending the FinFET technology development guidelines presented in Chapter 8 while considering the gate-drain/source underlap is an important research topic. The development of dual-threshold voltage FinFET technology based on the gatedrain/source underlap engineering is presented in Section 13.2. The development of low power and robust FinFET SRAM circuits using gate-drain/source underlap engineering is proposed in Sections 13.3 and 13.4.

13.1. Robust and High Speed SRAM Cell Using Local Strain Technologies In this section, local strained silicon technologies are presented. Robust and high speed six transistors SRAM circuits based on local strain technologies are proposed. Local strain technologies are based on process-induced stress techniques [113], [115]-[118]. Two approaches have been developed to achieve local stress [115], [116]. With the first approach, uniaxial compressive (tensile) stress on the silicon channel is achieved by growing a local epitaxial film of SiGe (SiC) in the source and drain region of a PMOS (an NMOS) transistor. The tensile and the compressive stresses significantly enhance the mobility of electrons and holes, respectively, thereby leading to higher on-current and enhanced circuit performance. The process flow of this local strain technology is shown in Fig. 13.1. The source and drain are etched then SiGe and SiC are epitaxially grown in the source and drain regions of NMOS and PMOS, respectively. The amount of Ge and C controls the amount of stress induced by the source and drain regions on the channel.

197

Gate STI

Gate STI

STI

STI

Si Recess Etch

SiGe Epitaxial Growth

Fig. 13.1. Process flow of a local strain technology based on epitaxial growth of SiGe and SiC in the source and drain region of a PMOS and NMOS, respectively. [115]. With the second local strain approach a tensile and compressive capping layer is deposited on the gate of NMOS and PMOS, respectively [115], [116]. The process flow consists of a uniform deposition of a post silicidation tensile Si3N4 layer over the entire wafer, followed by patterning and etching the film off the PMOS transistors. A compressive SiN layer is then deposited and etched from NMOS transistors. The final structures of strained NMOS and PMOS transistors are shown in Fig. 13.2 [115]. A hybrid strain technology is presented in [117]-[118] in which epitaxially grown SiGe source and drain regions are used for achieving a compressive strain in PMOS transistors. Alternatively, a tensile capping layer is used for achieving tensile strain in NMOS transistors.

Tensile Nitride

Compressive Nitride

Gate

Gate

STI

STI NMOS

STI

STI PMOS

Fig. 13.2. Local strain technology using capping layer [115]. The local uniaxial strain technologies cause smaller threshold voltage shift as compared to the global biaxial strain technology as illustrated in Fig. 13.3 [121]. The strain-induced threshold voltage reduction is undesirable since this reduction in the threshold voltage is counterbalanced with excess channel doping. The excess doping increases the scattering rate and reduces the gain in carrier mobility achieved with strain [122].

198

Threshold Voltage Shift (mV)

-100

-101

Uniaxial Strain

-102

Biaxial Strain -103

0

0.005

0.01

0.015

0.02

Strain

Fig. 13.3. Biaxial and uniaxial Strain induced threshold-voltage shifts versus the amount of strain [121]. Local strain is employed in [124] to enhance the speed of six transistors SRAM circuits by applying strain to the NMOS transistors only as shown in Fig. 13.4. This approach, however, comes with a cost of degraded static noise margin. VDD

BL

VDD

BLB

WL

WL

*

* *

*

Fig. 13.4. Six transistors SRAM circuits with enhanced speed using local strain. The (*) indicates transistors with strained channel. An alternative SRAM circuit with strain applied only to the pull-down transistors is proposed in this section. The proposed circuit is shown in Fig. 13.5. By applying strain to the pull-down transistors only, the voltage disturbances introduced on the storage nodes during read operations are suppressed without increasing the size of the pull-down transistors. The read static noise margin is therefore expected to be enhanced as compared to an SRAM circuit without strain and with the same transistors’ sizes. The read speed of the proposed circuit is also expected to be

199

enhanced as compared to the unstrained SRAM circuit due to the stronger pull-down transistors. Investigating the electrical characteristics of the proposed strained SRAM circuit and the sensitivity to process parameter variations will be an interesting research topic. VDD

BL

VDD

WL

BLB

WL

*

*

Fig. 13.5. Six transistors SRAM circuit with enhanced read stability and read speed using local strain will be investigated. The (*) indicates transistors with strained channel. In [123], metal gate electrode using TiN with varying thickness is utilized to apply strain in FinFETs. The TiN electrode applies tensile stress on the silicon fin that is function on the TiN thickness and the deposition method as shown in Fig. 13.6. The applied tensile stress enhances the electron mobility and the on-current of NMOS transistors. The proposed strained SRAM circuit shown in Fig. 13.5 can be directly extended to FinFET technology using the strain silicon technique based on the metal electrode presented in [123]. Alternatively, the TiN electrode can be deposited on both the pull-down and the access transistors with varying thickness in order to achieve higher strain in the pull-down transistors as compared to the access transistors. With this circuit shown in Fig. 13.7, the read speed, the read static noise margin, and the write margin are simultaneously enhanced as compared to the same SRAM circuit without strain. Investigating the quantitative enhancement in speed and read stability of the proposed FinFET SRAM circuit as well as the sensitivity to process parameter variations will be an important research topic.

200

TiN Film Stress (a.u.)

Compressive stress CVD ALD

PVD Tensile stress

10

20

30

40

TiN Thickness (nm) Fig. 13.6. Stress induced by the TiN as a function of the TiN thickness and the deposition method. ALD: atomic layer deposition. CVD: chemical vapor deposition. PVD: physical vapor deposition. Higher tensile stress is applied with thinner TiN layer [123].

VDD

BL

VDD

BLB

WL

WL

*

* **

**

Fig. 13.7. FinFET six transistors SRAM circuit with local strain applied to the pull-down and the access transistors for enhanced access speed, and write margin. The local strain applied to the pull-down transistors is higher as compared to the access transistors for enhanced read stability. The (*) indicates transistors with strained channel. The (**) indicates higher strain than (*).

201

13.2. Dual-Threshold Voltage FinFET Technology Based on GateDrain/Source Underlap Engineering Designing FinFETs with gate-drain/source underlap rather than gate-drain/source overlap has several advantages. Higher read stability in SRAM circuits is achieved due to the reduced drain-induced barrier lowering [119]-[120]. Furthermore, the processing steps are reduced by skipping the implantation of the source and drain extensions [120]. FinFETs with gatedrain/source overlaps and gate-drain/source underlaps are illustrated in Fig. 13.8. The on-current and the off-current versus the gate-drain/source overlap are shown in Fig. 13.9. Extending the technology development guidelines presented in Chapter 8 to FinFETs with gate-drain/source

G at e

underlaps will be an important research topic.

Source

tsi Drain

Hfin

L (a)

overlap

underlap

Gate

Gate

Source

Drain

Source

Drain

Gate

Gate

(b)

(c)

Fig. 13.8. (a) FinFET 3D architecture. (b) Cross sectional top view of a FinFET with gatedrain/source overlaps. (c) Cross sectional top view of a FinFET with gate-drain/source underlaps. Extending the FinFET technology development guidelines to include the gate-drain-source underlaps will be an important research topic.

202

On Current (mA/µm)

N-Type FinFET 1.2

600

P-Type FinFET

400 0.6 200

0.0 -18

Off Current (nA/µm)

800

1.8

0 -14

-10

-6

-2

2

6

Gate-Drain/Source Overlap (nm)

Fig. 13.9. On-current and off-current of FinFETs versus gate-drain/source overlap. Underlaps are represented with negative overlaps. 32nm gate length. 1.6nm gate oxide thickness. Undoped body. Gate work-function of N-type (P-type) FinFET is 4.5eV (4.9eV). T = 110oC.

13.3. High Data Stability and Low Leakage Power FinFET Memory Circuit Based on Gate-Drain/Source Underlap Engineering In this section, a low leakage dual threshold voltage FinFET SRAM circuit with enhanced data stability is proposed. The proposed SRAM circuit is based on the gate-drain/source overlap engineering presented in Section 13.2. FinFETs with gate-drain/source overlaps and gatedrain/source underlaps are used together to design low-Vth and high-Vth transistors, respectively. FinFETs with gate-drain/source overlaps and gate-drain/source underlaps can be co-integrated on the same die by skipping the extension doping for the high-Vth devices only. The proposed gate overlap engineered SRAM circuit is shown in Fig. 13.10. The pull-down transistors are low-Vth FinFETs with gate-drain/source overlaps. Alternatively, the pull-up transistors and the access transistors are high-Vth FinFETs with gate-drain/source underlaps. The read stability is expected to be enhanced due to the stronger pull-down transistors as compared to the access transistors. Sizing for enhanced stability is therefore not needed. Integration density will be enhanced and leakage power is expected to be reduced. Quantitative analysis of the proposed SRAM circuit is an interesting research topic.

203

VDD

BL

VDD

WL

BLB WL

Fig. 13.10. FinFET SRAM based on a dual gate-drain/source overlap technology will be investigated. The pull-down transistors are FinFETs with gate-drain/source overlaps. The pull-up and the access transistors are FinFETs with gate-drain/source underlaps. Thick lines indicate transistors with gate-drain/source underlaps.

13.4. Robust and High Speed Seven Transistors FinFET SRAM Cell Based on Gate-Drain/Source Underlap Engineering Maintaining the data stability of SRAM cells is expected to become increasingly challenging with the continued technology scaling. Extending the proposed seven transistors SRAM circuit (Chapter 7) to a dual threshold voltage FinFET technology based on gatedrain/source underlap engineering will be investigated. A single-ended sense amplifier that is based on the non-linear gate capacitance of a MOSFET is proposed in [125]. The gain of this single-ended sense amplifier is enhanced with the reduction of the gate-drain/source overlap and fringe capacitances. Extending the single-ended sense amplifier to a dual threshold voltage FinFET technology based on gate-drain/source underlap engineering will also be investigated.

204

Dual-Vth FinFET 7T SRAM Circuit

Write Bitline

Read Bitline Sense

VDD

VDD

Read Bitline

Read Write output

Single-Ended Sense Amplifier

Fig. 3.11. Robust and high speed 7T FinFET SRAM circuit based on gate-drain/source overlap engineering will be investigated. A single-ended sense amplifier based on gate-drain/source overlap engineering will also be investigated.

205

Bibliography [1] V. Kursun and E. G. Friedman, Multi-Voltage CMOS Circuit Design, John Wiley & Sons Ltd., 2006, ISBN # 0-470-01023-1. [2] G. E. Moore, “The Role of Fairchild in Silicon Technology in the Early Days of Silicon Valley,” Proceedings of the IEEE, Vol. 86, Issue 1, pp. 53-62, January 1998. [3] G. E. Moore, “Cramming More Components onto Integrated Circuits,” Electronics, Vol. 38, No. 8, pp. 114-117, April 1965. [4] G. E. Moore, “Progress in Digital Integrated Electronics,” Proceedings of the IEEE International Electron Device Meeting, pp. 11-13, December 1975.

[5] G. E. Moore, “Cramming More Components onto Integrated Circuits,” Proceedings of the IEEE, Vol. 86, No. 1, pp. 82-85, January 1999.

[6] G. E. Moore, “No Exponential is Forever: But ‘Forever’ Can be Delayed!,” Proceedings of the IEEE International Solid-State Conference, Vol. 1, pp. 20-23, February 2003.

[7] S. Borkar, “Design Challenges of Technology Scaling,” IEEE Micro, Vol. 19, Issue 4, pp. 2329, (July–August) 1999. [8] D. J. Frank et al., “Device Scaling Limits of Si MOSFETs and Their Application Dependencies,” Proceedings of the IEEE, Vol. 89, No. 3, pp. 259-287, March 2001. [9] S. Borkar et al., “Parameter Variations and Impact on Circuits and Microarchitecture,” Proceedings of the IEEE Design Automation Conference, pp. 338-342, June 2003.

[10] T. Karnik, V. De, S. Borkar, “Statistical Design for Variation Tolerance: Key to Continued Moore's Law,” Proceedings of the IEEE International Conference on Integrated Circuit Design and Technology, pp.175–176, May 2004.

206

[11] P. Friedberg, et al., “Modeling Within-Die Spatial Correlation Effects for Process-Design Co-Optimization,” Proceedings of the IEEE International Symposium on Quality Electronic Design, pp. 516-521, March 2005.

[12] Andy Grove, “Changing Vectors of Moore’s Law,” International Electron Devices Meeting, December 2002. [13] S. Mukhopadhyay, A. Raychowdhury, and K. Roy, “Accurate Estimation of Total Leakage in Nanometer-Scale Bulk CMOS Circuits Based on Device Geometry and Doping Profile,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 24, No. 3, pp.

363–381, March 2005. [14] S. Naffziger, B. Stackhouse, and T. Grutkowski, "The Implementation of a 2-Core MultiThreaded Itanium Family Processor," Proceedings of the IEEE International Solid-State Circuits Conference, Vol. 1, pp. 182-592, February 2005.

[15] K. Usami et al. “Automated Low-Power Technique Exploiting Multiple Supply Voltages Applied to a Media Processor,” IEEE Journal of Solid-State Circuits, Vol. 33, No. 3, pp. 463 – 471, March 1998. [16] Y. Taur and T. H. Ning, Fundamentals of Modern VLSI Devices, New York: Cambridge University Press, 1998. [17] S. H. Kulkarni and D. Sylvester, “High Performance Level Conversion for Dual VDD Design,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 12, No. 9, pp. 926 – 936, September 2004. [18] A. Srivastava and D. Sylvester, “Minimizing Total Power by Simultaneous Vdd/Vth Assignment,” Proceedings of the IEEE Design Automation Conference, pp. 400 – 403, January 2003. [19] S. H. Kulkarni, A. N. Srivastava, and D. Sylvester, “A New Algorithm for Improved VDD Assignment in Low Power Dual VDD Systems,” Proceedings of the IEEE International Symposium on Low Power Electronics and Design, pp. 200-205, August 2004.

207

[20] F. Ishihara, F. Sheikh, and B. Nikolic´, “Level Conversion for Dual-Supply Systems,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 12, No. 2, pp.185-195,

February 2004. [21] V. Kursun, R. M. Secareanu, and E. G. Friedman, “CMOS Voltage Interface Circuit for Low Power Systems,” Proceedings of the IEEE International Symposium on Circuits and Systems, Vol. 3, pp. 667-670, May 2002. [22] M. Takahashi et al., “A 60-mW MPEG4 Video Codec Using Clustered Voltage Scaling with Variable Supply-Voltage Scheme,” IEEE Journal of Solid-State Circuits, Vol. 33, No. 11, pp. 1772-1780, November 1998. [23] M. Hamada et al., “A Top-Down Low Power Design Technique Using Clustered Voltage Scaling with Variable Supply-Voltage Scheme,” Proceedings of the IEEE Custom Integrated Circuits Conference, pp. 495-498, May 1998.

[24] D. E. Lackey et al., “Managing Power and Performance for System-on-Chip Designs Using Voltage Islands,” Proceedings of the IEEE/ACM International Conference on Computer Aided Design, pp. 195-202, November 2002.

[25] S. A. Tawfik and V. Kursun, “Low Power and High Speed Multi Threshold Voltage Interface Circuits,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2009 (accepted, in press). [26] D. Velenis, M. C. Papaefthymiou, and E. G. Friedman, “Reduced Delay Uncertainty in High Performance Clock Distribution Networks,” Proceedings of the Design, Automation and Test in Europe Conference, pp. 68–73, March 2003.

[27] A. Kapoor, N. Jayakumar, and S. P. Khatri, “A Novel Clock Distribution and Dynamic Deskewing Methodology,” Proceedings of the IEEE/ACM International Conference on ComputerAided Design, pp. 626–631, November 2004.

208

[28] G. Geannopoulos and X. Dai, “An Adaptive Digital Deskewing Circuit for Clock Distribution Networks,” Proceedings of the IEEE International Solid-State Circuits Conference, pp. 400–401, February 1998. [29] C. E. Dike, N. A. Kurd, P. Patra, and J. Barkatullah, “A Design for Digital, Dynamic Clock Deskew,” Proceedings of the IEEE International Symposium on VLSI Circuits, pp. 21–24, June 2003. [30] J. Pangjun and S. S. Sapatnekar, “Low-Power Clock Distribution Using Multiple Voltages and Reduced Swings,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 10, No. 3, pp. 309–318, June 2002. [31] P. Mahoney, E. Fetzer, B. Doyle, and S. Naffziger, “Clock Distribution on a Dual-Core, Multi-Threaded Itanium-Family Processor,” Proceedings of the IEEE International Solid-State Circuits Conference, pp. 292–599, February 2005.

[32] F. H. A. Asgari and M. Sachdev, “A Low-Power Reduced Swing Global Clocking Methodology,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 12, No. 5, pp. 538–545, May 2004. [33] S. A. Tawfik and V. Kursun, “Dual-VDD Clock Distribution for Low Power and Minimum Temperature Fluctuations Induced Skew,” Proceedings of the IEEE International Symposium on Quality Electronic Design, pp. 73-78, March 2007.

[34] R. Kumar and V. Kursun, “Voltage Optimization for Simultaneous Energy Efficiency and Temperature Variation Resilience in CMOS Circuits,” Microelectronics Journal, Volume 38, Issues 4-5, pp. 583-594, April/May 2007. [35] Y. Lee et al., "Clock Multiplier Using Digital CMOS Standard Cells for High-Speed Digital Communication Systems," IEE Electronics Letters, Vol. 35, No. 24, pp. 2073-2074, November 1999.

209

[36] S. A. Tawfik and V. Kursun, “Multi-Vth Level Conversion Circuits for Multi-VDD Systems,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 1397-1400, May

2007. [37] N. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective, Addison Wesley, 2005. [38] S. A. Tawfik and V. Kursun, “Dual Supply Voltages and Dual Clock Frequencies for Lower Clock Power and Suppressed Temperature-Gradient Induced Clock Skew,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2009 (accepted, in press).

[39] P. E. Gronowski et al., "High-Performance Microprocessor Design," IEEE Journal of SolidState Circuits, Vol. 33, No. 5, pp. 676-686, May 1998.

[40] S. Naffziger et al., "The Implementation of the Itanium2 Microprocessor," IEEE Journal of Solid-State Circuits, Vol. 37, No. 11, pp. 1448-1460, November 2002.

[41] G. E. T´ellez and M. Sarrafzadeh, “Minimal Buffer Insertion in Clock Trees with Skew and Slew Rate Constraints,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 16, No. 4, pp. 333-342, April 1997.

[42] M. Pan, C. Chong-Nuen Chu, and J. Morris Chang, “Transition Time Bounded Low-power Clock Tree Construction,” Proceedings of the IEEE International Symposium on Circuits and Systems, Vol. 3, pp. 2445-2448, May 2005.

[43] N. Ahmed, M. H. Tehranipour, D. Zhou, and M. Nourani, “Frequency Driven Repeater Insertion for Deep Submicron,” Proceedings of the IEEE International Symposium on Circuits and Systems, Vol. 5, pp. V-181 – V-184, May 2004.

[44] J. G. Xi and W. W. -M. Dai, “Buffer Insertion and Sizing under Process Variations for Low Power Clock Distribution,” Proceedings of the ACM/IEEE Conference on Design Automation, pp. 491–496, June 1995.

210

[45] J. Oh and M. Pedram, “Gated Clock Routing for Low-Power Microprocessor Design,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 20, No. 6, pp.

715–722, June 2001. [46] T. Sakurai and A. R. Newton, “Alpha-Power Law MOSFET Model and its Applications to CMOS Inverter Delay and Other Formulas,” IEEE Journal of Solid-State Circuits, Vol. 25, No. 2, pp. 584–594, April 1990. [47] A. H. Ajami, K. Banerjee, and M. Pedram, “Modeling and Analysis of Non-Uniform Substrate Temperature Effects in High Performance VLSI,” IEEE Transactions on Computer Aided Design, Vol. 24, No. 6, pp. 849– 861, June 2005.

[48] S. A. Tawfik and V. Kursun, “Buffer Insertion and Sizing in Clock Distribution Networks with Gradual Transition Time Relaxation for Reduced Power Consumption,” Proceedings of the IEEE International Conference on Electronics, Circuits and Systems, pp. 845-848, 2007.

[49] S. A. Tawfik and V. Kursun, “Clock Distribution Networks with Gradual Signal Transition Time Relaxation for Reduced Power Consumption,” Journal of Circuits, Systems, and Computers, Vol. 17, No. 6, pp. 1173–1191, December 2008. [50] V. Kursun, S. A. Tawfik, and Z. Liu, “Leakage-Aware Design of Nanometer SoC,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 3231-3234, May

2007. [51] G. Sery et al., “Life is CMOS: Why Chase Life After?,” Proceedings of the IEEE Design Automation Conference, pp. 78-83, June 2002.

[52] E. Seevinck, F. J. List, and J. Lohstroh, “Static-Noise Margin Analysis of MOS SRAM Cells,” IEEE Journal of Solid-State Circuits, Vol. 22, No. 5, pp. 748-754, October 1987. [53] L. Chang et al., “Stable SRAM Cell Design for the 32 nm Node and Beyond,” Proceeding of the IEEE Symposium on VLSI Technology, pp. 128-129, June 2005.

211

[54] Z. Liu and V. Kursun, “High Read Stability and Low Leakage Cache Memory Cell,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 2774-2777, May

2007. [55] T. Devoivre et al., “Validated 90nm CMOS Technology Platform with Low-K Copper Interconnects for Advanced System-on-Chip (SOC),” Proceedings of the IEEE International Workshop on Memory Technology, Design and Testing, pp. 157-162, July 2002.

[56] K. Zhang et al., “A 3-GHz 70-Mb SRAM in 65-nm CMOS Technology with Integrated Column-Based Dynamic Power Supply,” IEEE Journal of Solid-State Circuits, Vol. 41, No. 1, pp.146-151, January 2006. [57] V. Kursun, Supply and Threshold Voltage Scaling Techniques in CMOS Circuits, Ph.D Thesis, University of Rochester, 2004. [58] Z. Liu, Multi-Voltage Nanoscale CMOS Circuit Techniques, Ph.D Thesis, University of Wisconsin-Madison, 2008. [59] R. Kumar, Temperature Adaptive and Variation Tolerant CMOS Circuits, Ph.D Thesis, University of Wisconsin-Madison, 2008. [60] S. A. Tawfik and V. Kursun, “Dynamic Wordline Voltage Swing for Low Leakage and Stable Static Memory Banks,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 1894-1897, May 2008.

[61] S. A. Tawfik and V. Kursun, “Low Power and Robust 7T Dual-Vt SRAM Circuit,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 1452-1455, May

2008. [62] K. Kim et al., “Leakage Power Analysis of 25-nm Double-Gate CMOS Devices and Circuits,” IEEE Transactions on Electron Devices, Vol. 52, No. 5, pp. 980-986, May 2005. [63] S. Tang et al., “FinFET – A Quasi-planar Double-Gate MOSFET,” Proceedings of the IEEE International Solid-State Circuit Conference, pp. 118-119, February 2001.

212

[64] H. Shang et al., “Investigation of FinFET Devices for 32nm Technologies and Beyond,” Proceedings of the IEEE Symposium on VLSI Technology, pp. 54-55, June 2006.

[65] S. Mitra et al., “Low Voltage/Low Power Sub 50nm Double Gate SOI Ratioed Logic,” Proceedings of the IEEE International SOI Conference, pp.177-178, September 2003.

[66] M-H. Chiang et al., “High-Density Reduced-Stack Logic Circuit Techniques Using Independent-Gate Controlled Double-Gate Devices,” IEEE Transactions on Electron Devices, Vol. 53, No. 9, pp. 2370-2377, September 2006. [67] E. Nowak et al., “Turning Silicon on Its Edge,” IEEE Circuits & Devices Magazine, Vol. 20, No. 1, pp. 20-31, January/February 2004. [68] Y. Liu et al., “Cointegration of High-Performance Tied-Gate Three-Terminal FinFETs and Variable Threshold-Voltage Independent-Gate Four-Terminal FinFETs with Asymmetric GateOxide Thicknesses,” IEEE Electron Device Letters, Vol. 28, No. 6, pp. 517-519, June 2007. [69] M-H. Chiang et al., “Novel High-Density Low-Power Logic Circuit Techniques Using DG Devices,” IEEE Transactions on Electron Devices, Vol. 52, No. 10, pp. 2339–2342, October 2005. [70] Medici Device Simulator, Synopsys, Inc., February 2003. [71] R. Lin et al., “An Adjustable Work Function Technology Using Mo Gate for CMOS Devices,” IEEE Electron Device Letters, Vol. 23, No. 1, pp. 49-51, January 2002. [72] P. Xuan and J. Bokor, "Investigation of NiSi and TiSi as CMOS Gate Materials," IEEE Electron Device Letters, Vol. 34, No. 10, pp.634-636, October 2003.

[73] P. Ranade et al., “Work Function Engineering of Molybdenum Gate Electrodes by Nitrogen Implantation,” Electrochemical and Solid-State Letters, Vol. 4, Issue 11, pp. G85-G87, November 2001. [74] J. Kedzierski et al., “Metal-gate FinFET and Fully-Depleted SOI Devices Using Total Gate Silicidation,” Proceedings of the IEEE Electron Devices Meeting, pp. 247–250, December 2002.

213

[75] H. Ando et al., “A 1.3-GHz Fifth-Generation SPARC64 Microprocessor,” IEEE Journal of Solid-State Circuits, Vol. 38, No. 11, pp. 1896-1905, November 2003.

[76] V. Kursun and E. G. Friedman, “Domino Logic With Variable Threshold Voltage Keeper,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 11, No. 6, pp. 1080-

1093, December 2003. [77] M. W. Allam, M. H. Anis, and M. I. Elmasry, “High-Speed Dynamic Logic Styles for Scaled-Down CMOS and MTCMOS Technologies,” Proceedings of the IEEE International Symposium on Low-Power Electronics Design, pp. 155–160, July 2000.

[78] A. Alvandpour et al., “A Sub-130-nm Conditional Keeper Technique,” IEEE Journal of Solid-State Circuits, Vol. 37, No. 5, pp. 633-638, May 2002.

[79] Y. X. Liu et al., “4-Terminal FinFETs with High Threshold Voltage Controllability,” Proceedings of the IEEE Device Research Conference, Vol. 1, pp. 207–208, June 2004.

[80] Y. X. Liu et al., “Advanced FinFET Technology: TiN Metal-Gate CMOS and 3T/4T Device Integration,” Proceedings of the IEEE International SOI Conference, pp. 219-220, September 2005. [81] D. M. Fried, E. J. Nowak, J. Kedzierski, J. S. Duster, and K. T. Kornegay, “A Fin-type Independent-Double-Gate NFET,” Proceedings of the IEEE Device Research Conference, pp. 4546, June 2003. [82] J. Friedrich et al., “Design of the Power6 Microprocessor,” Proceedings of the IEEE SolidState Circuits Conference, pp. 96-97, February 2007.

[83] V. George et al., “Penryn: 45-nm Next Generation Intel Core 2 Processor,” Proceedings of the IEEE Solid-State Circuits Conference, pp. 14-17, November 2007.

[84] J. Dorsey et al., “An Integrated Quad-Core Opteron Processor,” Proceedings of the IEEE International Solid-State Circuits Conference, pp. 102-103, February 2007.

214

[85] B. Flachs et al. “Microarchitecture and Implementation of the Synergistic Processor in 65nm and 90-nm SOI,” IBM Journal, Vol. 51, No. 5, pp. 529-543, September 2007. [86] S. -Y. Kim et al., “Temperature Dependence of Substrate and Drain–Currents in Bulk FinFETs,” IEEE Transactions on Electron Devices, Vol. 54, No. 5, pp. 1259-1264, MAY 2007. [87] The International Technology Roadmap for Semiconductors, 2007. http://www.itrs.net/ [88] S. Naffziger et al., “The Implementation of the Itanium 2 Microprocessor,” IEEE Journal of Solid-State Circuits, Vol. 37, No. 11, pp. 1448-1460, November 2002.

[89] S. Vangal et al., “An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS,” Proceedings of the IEEE International Solid-State Circuits Conference, pp. 98-99, February 2007.

[90] B. S. Doyle et al. “High Performance Fully-Depleted Tri-Gate CMOS Transistors,” IEEE electron Device Letters, Vol. 24, No. 4, pp. 263-265, April 2003.

[91] Z. Liu and V. Kursun, “Leakage Biased PMOS Sleep Switch Dynamic Circuits,” IEEE Transactions on Circuits and Systems II, Vol. 53, No. 10, pp. 1093 – 1097, October 2006.

[92] Y. T. Hou, M. F. Li, T. Low, and D. L. Kwong, “Impact of Metal Gate Work Function on Gate Leakage of MOSFETs,” Proceedings of the IEEE International Symposium on Semiconductor Device Research, pp. 154-155, December 2003.

[93] S. Xiong and J. Bokor, “Sensitivity of Double-Gate and FinFET Devices to Process Variations,” IEEE Transactions on Electron Devices, Vol. 50, No. 11. pp. 2255-2261, November 2003. [94] S. A. Tawfik and V. Kursun, “Multi-Vth FinFET Sequential Circuits with Independent-Gate Bias and Work-Function Engineering for Reduced Power Consumption,” Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems, December 2008 (INVITED PAPER).

[95] S. A. Tawfik and V. Kursun, “Low-Power and Compact Sequential Circuits with Independent-Gate FinFETs,” IEEE Transactions on Electron Devices, Vol. 55, No. 1, pp. 60-70, January 2008.

215

[96] S. A. Tawfik and V. Kursun, “Asymmetric Dual-Gate Multi-Fin Keeper Bias Options and Optimization for Low Power and Robust FinFET Domino Logic,” Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems, December 2008.

[97] S. A. Tawfik and V. Kursun, “Work-Function Engineering for Reduced Power and Higher Integration Density: An Alternative to Sizing for Stability in FinFET Memory Circuits,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 788-791, May

2008. [98] S. A. Tawfik and V. Kursun, “Compact FinFET Memory Circuits with P-Type Data Access Transistors for Low Leakage and Robust Operation,” Proceedings of the IEEE/ACM International Symposium on Quality Electronic Design, pp. 855-860, March 2008.

[99] S. A. Tawfik, Z. Liu, and V. Kursun, “Independent-Gate and Tied-Gate FinFET SRAM Circuits: Design Guidelines for Reduced Area and Enhanced Stability,” Proceedings of the IEEE International Conference on Microelectronics, pp.171-174, 2007.

[100] Z. Liu and V. Kursun, “Robust Dynamic Node Low Voltage Swing Domino Logic with Multiple Threshold Voltages,” Proceedings of the IEEE/ACM International Symposium on Quality Electronic Design, pp. 31-36, March 2006.

[101] B. Giraud et al., “A Comparative Study of 6T and 4T SRAM Cells in Double-Gate CMOS with Statistical Variation,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 3022-3025, May 2007.

[102] D. Velenis, R. Sundaresha, and E. G. Friedman, “Buffer Sizing for Delay Uncertainty Induced by Process Variations,” Proceedings of the IEEE International Conference on Electronics, Circuits and Systems, pp. 415-418, December 2004.

[103] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits: A Design Perspective second edition, Prentice Hall, 2003.

216

[104] H. Zhang, V. George, and J. M. Rabaey, “Low-Swing On-Chip Signaling Techniques: Effectiveness and Robustness,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 8, No. 3, pp. 264-272, June 2000.

[105] S. A. Tawfik and V. Kursun, “Dual Signal Frequencies and Voltage Levels for Low Power and Temperature-Gradient Tolerant Clock Distribution,” Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design, pp.62-67, August 2007.

[106] P. P. Sotiriadis and A. Chandrakasan, “Low Power Bus Coding Techniques Considering Inter-wire Capacitances,” Proceedings of the IEEE Custom Integrated Circuits Conference, pp. 507-510, 2000. [107] Y. Taur and T. H. Ning, Fundamentals of Modern VLSI Devices, Cambridge University Press, Cambridge, UK, 1998. [108] S. Borkar, “Low Power Design Challenges for the Decade,” Proceedings of the IEEE/ACM International Design Automation Conference, pp. 293-296, June 2001.

[109] J. Kao, S. Narendra, and A. Chandrakasan, “Subthreshold Leakage Modeling and Reduction Techniques,” Proceedings of the IEEE/ACM international conference on Computer-aided design, pp. 141-148, 2002. [110] Z. Liu and V. Kursun, “New MTCMOS Flip-Flops with Simple Control Circuitry and Low Leakage Data Retention Capability,” Proceedings of the IEEE International Conference on Electronics, Circuits, and Systems, December 2007.

[111] L. Wei et al., “Design and Optimization of Dual Threshold Circuits for Low Voltage Low Power Applications,” Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 7, No. 1, pp. 16-24, March 1999. [] [112] Lundstrom, Mark (2006), “ECE 612 Lecture 12: Subthreshold Conduction,” https://www.nanohub.org/resources/1823/.

217

[113] K. Mistry et al., “A 45nm Logic Technology with High-k + Metal Gate Transistors, Strained Silicon, 9 Cu Interconnect Layers, 193nm Dry Patterning, and 100%

Pb-free

Packaging,” Proceedings of the IEEE International Electron Devices Meeting, pp. 247-250, December 2007. [114] Y. Taur, C. H. Wann, and D. J. Frank, “25 nm CMOS Design Considerations,” Proceedings of the IEEE Electron Devices Meeting, pp. 789 - 792, December 1998.

[115] N. Mohta and S. E. Thompson, “Mobility Enhancement,” IEEE Circuits and Devices Magazine, Vol. 21, Issue 5, pp. 18-23, September/October 2005.

[116] E. Ungersboeck, V. Sverdlov, H. Kosina, and S. Selberherr, "Strain Engineering for CMOS Devices," Proceedings of the IEEE International Conference on Solid-State and Integrated Circuit Technology, pp. 124-127, October 2006.

[117] S. Thompson et al., “A 90 nm logic technology featuring 50 nm strained silicon channel transistors, 7 layers of Cu interconnects, low k ILD, and 1 µm2 SRAM cell,” Proceedings of the IEEE International Electron Devices Meeting, pp. 61-64, 2002.

[118] S. E. Thompson et al., “A 90-nm Logic Technology Featuring Strained-Silicon,” IEEE Transactions on Electron Devices, Vol. 51, No. 11, pp. 1790-1797, November 2004.

[119] S. -H. Kim and J. G. Fossum, “Design Optimization and Performance Projections of Double-Gate FinFETs with Gate–Source/Drain Underlap for SRAM Application,” IEEE Transactions on Electron Devices, Vol. 54, No. 8, pp. 1934-1942, 2007.

[120] J. –W. Yang et al., “Enhanced Performance and SRAM Stability in FinFET with Reduced Process Steps for Source/Drain Doping,” Proceedings of the IEEE Symposium on VLSI Technology, Systems, and Applications, pp.20-21, April 2008.

[121] J.-S. Lim, S. E. Thompson, and J. G. Fossum, “Comparison of Threshold-Voltage Shifts for Uniaxial and Biaxial Tensile-Stressed n-MOSFETs,” IEEE Electron Device Letters, Vol. 25, No. 11, pp. 731-733, November 2004.

218

[122] J. -S. Goo et al., “Scalability of Strained-Si nMOSFETs Down to 25 nm Gate Length,” IEEE Electron Device Letters, Vol. 24, No. 5, pp. 351-353, May 2003.

[123] C.Y. Kang et al., “A Novel Electrode-Induced Strain Engineering for High Performance SOI FinFET utilizing Si (110) Channel for Both N and P MOSFETs,” Proceedings of the IEEE International Electron Devices Meeting, pp. 1-4, December 2006.

[124] R. Kuchipudi and H. Mahmoodi, “Strain Silicon Optimization for Memory and Logic in Nano-Scale CMOS,” Proceedings of the IEEE International Symposium on Quality Electronic Design, pp. 27-32, March 2007.

[125] W. K. Luk and R. H. Dennard, “Gated-Diode Amplifiers,” IEEE Transactions on Circuits and Systems-II: Express Briefs, Vol. 52, No. 5, pp. 266-270, May 2005.

219

Appendix A: Publications Journal Papers: [1] S. A. Tawfik and V. Kursun, “Dual Supply Voltages and Dual Clock Frequencies for Lower Clock Power and Suppressed Temperature-Gradient Induced Clock Skew,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2009 (accepted, in press).

[2] S. A. Tawfik and V. Kursun, “FinFET Domino Logic with Independent Gate Keepers,” Microelectronics Journal, 2009 (accepted, in press).

[3] S. A. Tawfik and V. Kursun, “Clock Distribution Networks with Gradual Signal Transition Time Relaxation for Reduced Power Consumption,” Journal of Circuits, Systems, and Computers, Vol. 17, No. 6, pp. 1173–1191, December 2008. [4] S. A. Tawfik and V. Kursun, “Low Power and High Speed Multi Threshold Voltage Interface Circuits,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2009 (accepted, in press). [5] S. A. Tawfik and V. Kursun, “Low-Power and Compact Sequential Circuits with Independent-Gate FinFETs,” IEEE Transactions on Electron Devices, Vol. 55, No. 1, pp. 60-70, January 2008.

Conference Papers: [6] S. A. Tawfik and V. Kursun, “Portfolio of FinFET Memories: Innovative Techniques for an Emerging Technology,” Proceedings of the IEEE International SoC Design Conference, November 2008. [7] S. A. Tawfik and V. Kursun, “Stability Enhancement Techniques for Nanoscale SRAM Circuits: A Comparison,” Proceedings of the IEEE International SoC Design Conference, November 2008.

220

[8] S. A. Tawfik and V. Kursun, “Multi-Vth FinFET Sequential Circuits with Independent-Gate Bias and Work-Function Engineering for Reduced Power Consumption,” Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems, December 2008 (INVITED PAPER).

[9] S. A. Tawfik and V. Kursun, “Asymmetric Dual-Gate Multi-Fin Keeper Bias Options and Optimization for Low Power and Robust FinFET Domino Logic,” Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems, December 2008.

[10] S. A. Tawfik and V. Kursun, “Low Power and Robust 7T Dual-Vt SRAM Circuit,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 1452-1455, May

2008. [11] S. A. Tawfik and V. Kursun, “Dynamic Wordline Voltage Swing for Low Leakage and Stable Static Memory Banks,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 1894-1897, May 2008.

[12] S. A. Tawfik and V. Kursun, “Work-Function Engineering for Reduced Power and Higher Integration Density: An Alternative to Sizing for Stability in FinFET Memory Circuits,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 788-791, May

2008. [13] S. A. Tawfik and V. Kursun, “Compact FinFET Memory Circuits with P-Type Data Access Transistors for Low Leakage and Robust Operation,” Proceedings of the IEEE/ACM International Symposium on Quality Electronic Design, pp. 855-860, March 2008.

[14] S. A. Tawfik and V. Kursun, “Characterization of New Static Independent-Gate-Biased FinFET Latches and Flip-Flops under Process Variations,” Proceedings of the IEEE/ACM International Symposium on Quality Electronic Design, pp. 311-316, March 2008.

[15] Z. Liu, S. A. Tawfik, and V. Kursun, “Statistical Data Stability and Leakage Evaluation of FinFET SRAM Cells with Dynamic Threshold Voltage Tuning under Process Parameter Fluctuations,” Proceedings of the IEEE/ACM International Symposium on Quality Electronic Design, pp. 305-310, March 2008.

221

[16] S. A. Tawfik, Z. Liu, and V. Kursun, “Independent-Gate and Tied-Gate FinFET SRAM Circuits: Design Guidelines for Reduced Area and Enhanced Stability,” Proceedings of the IEEE International Conference on Microelectronics, pp.171-174, 2007.

[17] S. A. Tawfik and V. Kursun, “High Speed FinFET Domino Logic Circuits with Independent Gate-Biased Double-Gate Keepers Providing Dynamically Adjusted Immunity to Noise,” Proceedings of the IEEE International Conference on Microelectronics, pp.175-178, 2007.

[18] S. A. Tawfik and V. Kursun, “Buffer Insertion and Sizing in Clock Distribution Networks with Gradual Transition Time Relaxation for Reduced Power Consumption,” Proceedings of the IEEE International Conference on Electronics, Circuits and Systems, pp. 845-848, 2007.

[19] S. A. Tawfik and V. Kursun, “Low Power and Stable FinFET SRAM with Static Independent Gate Bias for Enhanced Integration Density,” Proceedings of the IEEE International Conference on Electronics, Circuits and Systems, pp. 443-446, 2007.

[20] S. A. Tawfik and V. Kursun, “Low-Power High-Performance FinFET Sequential Circuits,” Proceedings of the IEEE International Systems on Chip (SOC) Conference, pp. 145-148,

September 2007. [21] Z. Liu, S. A. Tawfik, and V. Kursun, “An Independent-Gate FinFET SRAM Cell for High Data Stability and Enhanced Integration Density,” Proceedings of the IEEE International Systems on Chip (SOC) Conference, pp.63-66, September 2007.

[22] S. A. Tawfik and V. Kursun, “Dual Signal Frequencies and Voltage Levels for Low Power and Temperature-Gradient Tolerant Clock Distribution,” Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design, pp.62-67, August 2007.

[23] V. Kursun, S. A. Tawfik, and Z. Liu, “Leakage-Aware Design of Nanometer SoC,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 3231-3234, May

2007 (INVITED PAPER).

222

[24] S. A. Tawfik and V. Kursun, “Low-Power Low-Voltage Hot-Spot Tolerant Clocking with Suppressed Skew,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 645-648, May 2007. [25] S. A. Tawfik and V. Kursun, “Multi-Vth Level Conversion Circuits for Multi-VDD Systems,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 1397–1400, May

2007. [26] S. A. Tawfik and V. Kursun, “Dual-VDD Clock Distribution for Low Power and Minimum Temperature Fluctuations Induced Skew,” Proceedings of the IEEE/ACM International Symposium on Quality Electronic Design, March, pp.73-78, March 2007.esign, pp. 73-78, March

2007.