Exploiting Instruction Redundancy for Transient ... - Semantic Scholar

Exploiting Instruction Redundancy for Transient Fault Tolerance Toshinori Sato Department of Arti cial Intelligence Kyushu Institute of Technology [email protected]

Abstract

This paper presents an approach for integrating fault-tolerance techniques into microprocessors by utilizing instruction redundancy as well as time redundancy. Smaller and smaller transistors, higher and higher clock frequency, and lower and lower power supply voltage reduce reliability of microprocessors. In addition, microprocessors are used in systems which require high dependability, such as e-commerce businesses. Based on these trends, it is expected that the quality with respect to reliability will become important as well as performance and cost for future microprocessors. To meet the demand, we have proposed and evaluated a fault-tolerance mechanism, which is based on instruction reissue and utilizes time redundancy, and found severe performance loss. In order to mitigate the loss, this paper proposes to exploit instruction redundancy. Using the reuse table, previously executed computing is reused for checking the occurrence of transient faults. From detailed simulations, we nd that the performance loss caused by introducing faulttolerance into 4-way and 8-way superscalar processors is 12.5% and 20.8%, respectively. 1 Introduction

As transistors shrink in size with succeeding technology generations, future microprocessors will become susceptible to transient faults [19], which constitute the vast majority of hardware failures [28]. Transient faults are random events which occur when various noise sources cause an incorrect result. For example, cosmic ray particles can alter the state of latches and dynamic logic, resulting in logic errors. On the other hand, the use of portable and mobile computer platforms such as laptops and cellular phones is spreading widely. For example, it is expected that the new phones will be used for nancial services and other e-commerce businesses in the near future. For these applications, reliability and dependability are very important. In light of this, it is expected that the quality with respect to reliability will become important as well as performance and cost for future microprocessors. Current fault-tolerance techniques utilized in com-

mercial systems such as the IBM S/390 G5 [28] and the HP/Compaq NonStop [4] are based on redundancy. For example, error checking is implemented by duplicating chips and comparing outputs. These techniques require a twice or more hardware overhead. In addition, duplicate and compare is adequate for only error detection. The other example is parity or errorcorrecting code (ECC). Contemporary microprocessors use parity for caches. However, they leave control, arithmetic, and logical functions unchecked, since it is dicult and time-consuming to check these components [5, 28]. Hence, a low-cost and simple faulttolerance technique is necessary for future microprocessors. Recently, we have investigated the use of instruction reissue technique as a hardware mechanism to detect and recover from transient faults [20]. Originally, the instruction reissue mechanism is proposed for incorrect data speculation recovery. We modi ed and applied this mechanism for fault-tolerance. Since future microprocessors will utilize data speculation and include the instruction reissue mechanism [6, 7], this fault-tolerance mechanism costs the least hardware overhead. In addition, it is simple because detection of transient faults and their recovery processes are done simultaneously by means of dynamic instruction scheduling. However, unfortunately, we found significant performance loss. In this paper, we will investigate the use of instruction redundancy in order to mitigate the performance loss. The rest of this paper is organized as follows: Section 2 surveys related work. Section 3 explains that the instruction reissue mechanism is applicable to fault-tolerance, and explores design space to enhance the original proposal. Section 4 describes our evaluation methodology. Section 5 discusses our simulation results. Finally, Section 6 presents our conclusions. 2 Related Work

IBM S/390 G5 [28] is a commercial fault-tolerant microprocessor. However, it relies on traditional faulttolerance techniques and thus is not adequate for the mobile applications. HaL SPARC64 [23] and Intel Itanium [14] processors have fault detection mechanisms

for caches and memories but do not for functional units. In Fujitsu SPARC64 V [5], all chip internal data paths are covered by parity. DIVA [1] and SSD [8] are fault-tolerant microprocessors based on space redundancy. A checker processor attached behind its main processor pipeline is used for dynamically verifying committed instructions. Any hardware faults are corrected using the recovery mechanism for incorrect branch predictions. Hence, DIVA and SSD are hardware-based mechanisms and thus transparent. Sohi et al. [25] and Ray et al. [15] duplicate every instruction and detect faults by comparing two instructions which originate from a single instruction. Sohi et al. focus on fault detection and do not consider the recovery process. In the present paper, we will consider a transparent fault recovery mechanism. Ray et al. utilizes the recovery mechanism for incorrect branch predictions for the recovery from faults. A possible problem in their mechanism is that two executions belong to a single instruction might be eected by the same transient fault and thus it might not be detected. A similar idea of using time redundancy for detecting transient faults is utilized in AR-SMT [18], SRT [12,16], and SRTR [31] processors. They are all based on simultaneous multithreading (SMT) processors [29]. In contrast, our approach is based on single thread superscalar processors. An SMT processor can execute multiple threads simultaneously and thus is attractive for fault detection since two redundant copies of a single thread are executed on the SMT to detect a fault by comparing two results. While ARSMT and SRT processors exploit these characteristics, they only detect transient faults but cannot recover from the faults transparently. The recovery should be supported by OS. Only SRTR processor has the ability of the recovery from faults. In addition, generating two redundant threads also requires OS support. Hence, transient fault detection based on these SMTbased processors is not transparent. Our approach is most similar to O3RS (Out Of Order Reliable Superscalar) proposed by Mendelson et al. [9]. It also exploits the instruction reissue mechanism for fault detection. However, O3RS does not utilize instruction redundancy. REESE [27] also executes every instruction twice by injecting the completed instruction into its instruction scheduling window, but it does not rely on the instruction reissue mechanism but a dedicated FIFO queue. Instruction reuse [10,24] is a technique, which eliminates redundant computations by utilizing instruction redundancy. Instruction redundancy is a characteristic of programs: Dynamic instances of a static instruction are executed with the same operand values many

Fetch

PC

FP Register File

LdSt Units

Instruction Cache

Decode

Register Rename

FP Units RUU

Int Register File

Data Cache

Int Units

Figure 1: Protected portion times. Thus, if the previous computation is kept in a table, the next same computation can be eliminated by looking up the table. Sodani et al. and Molina et al. proposed the reuse buer and the redundant computation buer, respectively, for utilizing instruction redundancy and thus for boosting processor performance. In contrast, this paper will investigate instruction redundancy not for performance gain but for reliability. 3 Fault Tolerance Mechanism

This section describes the transient fault tolerance mechanism based on the instruction reissue technique [11, 17, 22]. While the described mechanism is applicable to any dynamic instruction schedulers, we will explain it using our proposed instruction reissue mechanism [22], which is based on register update unit (RUU) [26].

3.1 Fault detection via instruction reissue

The proposed mechanism detects transient faults occurring in arithmetic and logical functions. Thus, the protected portion is depicted as the shaded box in Figure 1. We focus on these functions because they are unchecked in modern microprocessors [28], except for a few products [5]. Instruction cache, register les, and the RUU should be protected using parity or ECC, which is common for modern microprocessors. In order to detect transient faults, we propose to duplicate committed instructions and compare two results of a single instruction [20]. It is assumed the comparator is fault-free. This is possible by using large strong cells [23] or triple modular redundancy. We use the instruction reissue to execute each committed instruction twice. When every instruction becomes ready for commitment, it is reissued in the RUU and dispatched into a functional unit again. Its execution outcome which is generated for the rst time is held in the RUU and will be compared with its equivalent one when it is generated the second time. If they do not match, a transient fault is detected. Thus, dier-

OP

Fetch

PC

FP Register File

Register Rename

operand2 OP code operand1

FP Units

operand2

outcome

RUU LdSt Units

Instruction Cache

Decode

operand1

Int Register File

Data Cache

Int Units

=? =? =?

Figure 2: Protected portion for revised mechanism ent from the SMT approach, our proposal requires no additional program counters. There is an advantage of limiting only committed instructions are reissued. When each instruction is ready for commitment, its dependences | both control and data dependences | have been resolved. Therefore, it can be dispatched unconditionally if an appropriate functional unit is free. This utilizes the dispatch bandwidth eciently. Another role of the reissue policy is to ensure that the rst and second executions are not eected by the same transient event. Completed instructions stay in the instruction window for a while before they are reissued. This works just like the delay queue in the AR-SMT processor [18]. On the other hand, a possible disadvantage of the policy is the increasing of pressure on the eective instruction window capacity. We evaluated the in uence on processor performance in [20] and found signi cant performance loss. Unfortunately, the loss can not be easily reduced by increasing instruction window size [20].

3.2 Overhead reduction

From our previous studies, we have found that the fault-tolerance technique based on the instruction reissue mechanism causes considerable performance degradation. Hence, we investigate ways to mitigate the overhead caused by the introduction of the faulttolerance mechanism. First, we focus on load instruction [21]. And then, we propose to exploit instruction redundancy as well as time redundancy. 3.2.1 Redundant memory access removal

Load instructions would diminish the performance of the proposed fault-tolerant processor due to the following reason. An execution of a load consists of address calculation and memory access, and thus the execution latency of a load instruction is longer than any others, while reissued load instructions may always hit in the L1 cache. This will put pressure on the RUU capacity. Based on these observations, we investigate a way to eliminate the possible bottleneck. We remove redundant memory accesses performed by

Value Hit

Figure 3: Reuse table reissued load instructions. The original mechanism executes every load instruction twice. Both address calculation and memory access are performed twice. Since the data cache can be protected by parity or ECC, it is not necessary to detect faults in the data cache using the instruction reissue. We reissue only the address calculation but perform the memory access once. Thus, the protected portion is revised as depicted in Figure 2, while the data cache should be protected by parity or ECC, which is common for modern microprocessors. One of the possible demerits of the revise is that any faults on the data cache busses cannot be detected. We evaluated the removal of redundant memory accesses in [21] and found there is still considerable performance overhead. 3.2.2

Exploiting instruction redundancy

Instruction reuse [10,24] is a technique, which eliminates redundant computations by utilizing instruction redundancy. Instruction redundancy is a characteristic of programs: Dynamic instances of a static instruction are executed with the same operand values many times. Thus, if the previous computation is kept in a table, the next same computation can be eliminated by looking up the table. Figure 3 depicts the reuse table studied in this paper. It resembles caches, and its each entry has an opcode eld, two operand elds, and an outcome eld. It is a direct-mapped table and indexed by a function of a combination of two operands. In this paper, we use the simple function: operand1 xor (not operand2). The reuse table keeps registerwriting instructions and store instructions. In the case of load and store instructions, address calculation is the object of instruction reuse. When every instruction is executed, its computation is registered in the reuse buer regardless that it is speculative or not. After that, when an instruction's operands are available, the reuse buer is referred. Because we use the reuse buer at the execution stage, no performance gain is obtained for instructions with short latency. Instead, we exploit instruction re-

dundancy for fault tolerance. Thus, the amount of time to access the cache-like structure is not a problem. It is only required that the access time is less than the amount of time to execute every instruction twice redundantly. In order to use the reuse buer for fault tolerance, its registration policy should be changed. When every instruction is committed, its computation is registered in the reuse buer. While the registration in the commit stage delays re-usability, it is necessary since the reuse buer should keep only veri ed execution results and our tolerance mechanism checks errors in the commit stage. An advantage of utilizing the reuse buer for fault tolerance is that it can be protected by using parity or ECC since the reuse buer is a memory, and transient faults cannot aect the reuse buer. When every instruction is executed, the reuse buer is referred in parallel. If the previous execution result is obtained from the reuse buer, the instruction should not be reissued in order to check the occurrence of transient faults but its execution result from a function unit is compared with the one from the reuse buer, and its latency is signi cantly reduced so that the overheads due to fault tolerance is also reduced. While the reuse buer is protected by parity or ECC, there may be possible transient fault at its logic and busses. Then, the instruction should be executed in the functional unit in parallel with referring the reuse buer. It should be noted that branch instructions are not covered by the reuse buer. While every branch should be executed twice, we expect that branch prediction helps to tolerate the latency.

3.3 Transparent fault recovery

Even when fault detection is successful, a system provided with no scheme for recovery will usually hang, causing application down. Recovery can be handled by either OS or hardware. In this section, we explain two transparent hardware-based recovery schemes. One uses the existing speculation recovery mechanism for mispredicted branches [1, 15, 20], and the other is based on the instruction reissue mechanism for incorrect data speculation [20]. In this paper, we assume the failure is transient, and thus instruction retry is successful. That is, when a fault is detected by the instruction reissue mechanism described in the previous section, re-executing the fault instruction is sucient. If we utilize the recovery mechanism for mispredicted branches, when a transient fault is detected the microprocessor ushes its pipeline and then restarts at the corresponding instruction. All instructions following the fault instruction are squashed and thus the penalty is very large. However, it is expected that performance degradation will be small since the frequency of transient faults is low [15].

On the other hand, the fault recovery mechanism utilizing the instruction reissue exploits its ability of selective squashing. As the mechanism reissues only misspeculated instructions when a data dependence misprediction occurs, we reissue only instructions dependent upon a fault instructions. That is, a fault is regarded as a misspeculation. We have already know that the instruction reissue mechanism suers less penalties from mispredicted data speculation than the instruction squashing mechanism [30], performance degradation will be further reduced. 4 Evaluation Environment

In this section, we describe our evaluation environment by explaining a processor model and benchmark programs.

4.1 Processor model

We implemented a timing simulator using SimpleScalar/Alpha tool set (ver.3.0a) [3]. We evaluate 4- and 8-way out-of-order execution superscalar processors. Dynamic instruction scheduling is based on the RUU which has 64 entries. Each functional unit can execute any operation. The latency for execution is 1 cycle except in the cases of multiplication (4 cycles) and division (12 cycles). A non-blocking, 128KB, 32B block, 2-way set-associative L1 data cache is used for data supply. The numbers of ports are 2 and 4 in the 4- and 8-way superscalar processor models, respectively. The data cache has a load latency of 1 cycle after the data address is calculated and a miss latency of 6 cycles. It has a backup of an 8MB, 64B block, direct-mapped L2 cache which has a miss latency of 18 cycles for the rst word plus 2 cycles for each additional word. No memory operation that follows a store whose data address is unknown can be executed. A 128KB, 32B block, 2-way set-associative L1 instruction cache is used for instruction supply and also has the backup of the L2 cache which is shared with the L1 data cache. For control prediction, a 1K-entry 4way set associative branch target buer, a 4K-entry gshare-type 2-level adaptive branch predictor, and an 8-entry return address stack are used. The branch predictor is speculatively updated at the instruction decode stage.

4.2 Benchmark programs

The SPEC2000 benchmark suite is used for this study. Table 1 lists the benchmarks and the input sets. We use the object les provided by the University of Michigan. For each program except for 252.eon, 1 billion instructions are skipped before actual simulation begins. Each program is executed to completion or for 100 million instructions. Table 1 also shows the number of instructions executed and execution cycles. We count only committed instructions.

100

100 no_mem

reuse

baseline

80

%Execution cycles

%Execution cycles

baseline

60 40 20 0

no_mem

reuse

80 60 40 20

gzip

vpr

gcc

crafty

parser

eon

vortex

bzip2 average

0

gzip

vpr

gcc

crafty

parser

eon

vortex

bzip2 average

(i) 4-way superscalar (ii) 8-way superscalar Figure 4: %Increase in execution cycles Table 1: Benchmark programs program gzip vpr gcc crafty parser eon vortex bzip2

input set

#inst.s

input.compressed net.in arch.in cccp.i crafty.in test.in chair lendian.raw input.random

100.0M 100.0M 100.0M 100.0M 100.0M 94.0M 100.0M 100.0M

#cycles 4-way 8-way 43.0M 31.1M 48.2M 37.2M 54.5M 41.4M 43.4M 29.2M 49.0M 36.6M 39.2M 23.8M 41.2M 26.5M 34.8M 18.7M

5 Simulation Results

This section presents our simulation results. First, we show how signi cantly performance loss is mitigated. Next, we evaluate the sensitivity of the number of table entry on performance. And last, our approach will be compared with a space redundant technique based on a chip multiprocessor (CMP) [13].

5.1 Processor performance

We use execution cycles for evaluating processor performance and the gures present the percent increase of the cycles. In this paper, we do not assume any transient faults occur but we only evaluate performance penalty caused by introducing the proposed mechanism. Therefore, the impact of the fault recovery mechanism on performance is not evaluated. For each group of three bars for every program, the left bar labeled with baseline in Figure 4 shows the performance degradation due to the duplication of all committed instructions in the cases of 4- and 8-way superscalar processors. As can be easily seen, performance of the 4-way superscalar is diminished signi cantly, while it is smaller than the case in which the program is executed twice and the results are compared. It is reduced by an average of 73.0%. On the other hand, the performance degradation of the 8-way superscalar is moderate: an average of 47.9%. The major reason is the increasing latency of in-

Table 2: Commitment latency (cycles) : 4-way prog gzip vpr gcc crafty parser eon vortex bzip2

base line 31.0 21.4 21.5 27.9 25.6 34.7 35.2 43.8

no mem 26.5 18.4 17.8 23.8 21.8 29.2 29.5 37.8

1K 21.1 15.0 13.5 17.9 16.0 21.5 21.3 28.0

reuse 2K 4K 20.8 20.8 14.9 14.8 13.5 13.5 17.7 17.6 15.9 15.9 21.4 21.3 21.2 21.2 27.8 27.7

1M 20.8 14.6 13.3 17.0 15.7 21.0 20.8 27.3

ideal 14.9 10.0 9.54 13.2 12.2 16.4 16.9 21.5

Table 3: Commitment latency (cycles) : 8-way prog gzip vpr gcc crafty parser eon vortex bzip2

base line 18.2 12.9 12.3 15.2 14.6 19.2 18.9 22.0

no mem 16.5 11.6 10.6 13.3 12.8 16.6 16.1 19.0

1K 14.9 10.9 9.56 12.0 11.6 15.0 14.2 17.0

reuse 2K 4K 14.8 14.8 10.8 10.7 9.53 9.51 12.0 11.9 11.6 11.5 15.0 14.9 14.1 14.1 16.9 16.8

1M 14.7 10.7 9.45 11.6 11.5 14.8 14.0 16.6

ideal 11.8 8.28 7.13 8.50 8.96 9.95 10.0 11.0

structions until commitment [21]. Each reissued instruction stays in the instruction window longer than in the case of the processor without fault tolerance. The baseline columns in Tables 2 and 3 summarize clock cycles where each instruction stays in the RUU. The commitment latency is very large especially for the 4-way superscalar. This increases the execution cycles of the processor models utilizing the baseline fault-tolerance mechanism. The goal of this paper is to mitigate the performance loss. Next, we will show the eect of removing redundant memory accesses. After that, we will evaluate the use of the reuse table. Figure 4 also explains how memory accesses aect performance. For each group of three bars, the mid-

100

2K 1K

100

4K 1M

60 40

60 40

20

20 reuse_miss

0

4K 1M

80 %Ratio

%Ratio

80

2K 1K

gzip

vpr

gcc

crafty

parser

eon

vortex

reuse_hit bzip2 average

reuse_miss 0

gzip

vpr

gcc

crafty

parser

eon

vortex

reuse_hit bzip2 average

(i) 4-way superscalar (ii) 8-way superscalar Figure 5: %Reusable instruction ratio dle bar labeled with no mem indicates the increase in execution cycles over the processor without fault tolerance when redundant memory accesses are eliminated. It is 51.1% and 33.0% for the 4- and 8-way models, respectively. In other words, the elimination reduces the overhead by 30.0% and 31.1% for the 4- and 8way models, respectively. This means that the long execution latency is one of the main sources of the overhead. In other words, the pressure on the RUU capacity should be mitigated. The no mem columns in Tables 2 and 3 show the commitment latencies when redundant memory accesses are removed. We can nd that the latencies are also reduced. In this evaluation, we use 1M-entry direct-mapped reuse buer. The table would be too large to realize, but allows us to initially evaluate the use of instruction redundancy ignoring the issues of contention. For each group of three bars for every program, the right bar labeled with reuse in Figure 4 shows the increase in execution cycles in the cases of 4- and 8-way superscalar processors. As can be easily seen, the increase in the 4-way superscalar becomes signi cantly small and is an average of 12.5%. On the other hand, the increase in the 8-way superscalar is a little larger: an average of 20.8%. In other words, exploiting instruction redundant mitigates the performance loss by 82.9% and 56.6% for the 4- and 8-way models, respectively. The reuse(1M) columns in Tables 2 and 3 show the commitment latencies when the 1M-entry reuse table is used. It is con rmed that the decrease in the commitment latency improves processor performance. A reason why the use of instruction redundancy contributes better to the 4-way superscalar than to the 8-way one is that the decrease in the commitment latency aects eectively processor performance in the case of the 4-way superscalar because the length of critical paths is longer in the 4-way superscalar than the 8-way one due to availability of functional units. Next, we would compare our proposed faulttolerance mechanism with other ones based on time

redundancy: 2-way redundant superscalar [15], SRT [12], and SRTR [31]. The SRT and SRTR processors are SMT processors based on an 8-way superscalar. On the other hand, the 2-way redundant superscalar is an 8-way superscalar but has only 4 integer ALUs. Thus, hardware cost of the SRT and SRTR is approximately equal to that of our 8-way processor model, and that of the 2-way redundant superscalar is in between those of our 4- and 8-way processors. It should be noted that the SRT processor only detects transient faults but cannot recover processor state. The articles report that both the 2-way redundant superscalar and SRT suers about 30% performance degradation [12,15], and SRTS performs slightly worse than SRT [31]. In summary, the previous fault-tolerance mechanism based on time redundancy suers approximately 30% performance loss. In contrast, our proposed mechanism causes 12.5% and 20.8% loss for the 4- and 8-way superscalar processors, respectively.

5.2 Impact of reuse table size

Next, we evaluate how the number of the reuse table entry aects processor performance. Figure 5 shows the percent reusable instruction ratio. For each group of four bars, the bars from left to right indicate the reusable ratios in the cases where the reuse table has 1K, 2K, 4K, and 1M entries, respectively. Each bar is divided into two parts. The lower part indicates the percentage of instructions that nd its associated entry in the reuse table (ie. reuse table hit). The upper indicates the percentages of instructions that cannot nd its associated entry in the reuse table (ie. reuse table miss). The sum of the two parts is not equal to 100%. The rest indicates the percentage of instructions that do not refer the reuse table such as branch instructions. It can be easily seen that approximately 80% of instructions refer the reuse table. We cannot nd considerable dierence in re-usability between the 4- and 8-way processors, and the hit ratio (ie. reuse table hit / (hit + miss)) increases approximately from 30% for the 1K-entry table to 40% for the

50

50 2K

4K

1M

1K

40

%Execution cycles

%Execution cycles

1K

30 20 10 0

gzip

vpr

gcc

crafty

parser

eon

vortex

bzip2 average

2K

4K

1M

40 30 20 10 0

gzip

vpr

gcc

crafty

parser

eon

vortex

bzip2 average

(i) 4-way superscalar (ii) 8-way superscalar Figure 6: %Increase in execution cycles 1M-entry table. This implies the same computations occur frequently in short periods of cycle time. Figure 6 shows the impact of the reuse table size on performance. For each group of four bars, the bars from left to right indicate the increase in execution cycles over the processor without fault tolerance in the cases where the reuse table has 1K, 2K, 4K, and 1M entries, respectively. The dierence between the cases of the 1K- and 1M-entry reuse tables is small, and the 4-way and 8-way processors with the 1K-entry reuse table performs only 15.4% and 22.7% worse than that without fault tolerance. The reuse columns in Tables 2 and 3 show the commitment latencies in the cases where the 1K-, 2K-, 4K-, and 1M-entry reuse buers are used, respectively. The latency changes little when the table size is increased. The ideal columns in Tables 2 and 3 show the commitment latencies of the processor without fault tolerance. In most cases, the dierence in execution cycles between the reuse and ideal cases becomes only few cycles.

5.3 Comparison with chip multiprocessor

Next, we compare our approach with a space redundant technique based on the CMP [2, 13]. The CMP is a complexity-eective solution for large scale microprocessors in the future. In this evaluation, we use the CMP consisting of two smaller superscalar processors. They execute an identical program respectively and two outcomes of a single instruction is compared. We compare the 4-way superscalar with the CMP consisting of two 2-way superscalars, and the 8-way superscalar with the one consisting of two 4-way superscalars. Hardware resources of the smaller superscalars are half of the baseline model. Thus, total hardware budgets of the time and space redundancy models are approximately same. Please note that only space redundant technique can detect errors in caches, register les, and the RUU. However, they can be protected using parity or ECC in the case of the time redundant technique. Figure 7 presents the results. The execution cycles are used for the com-

parison. For each group of two bars, the left is for the proposed model utilizing time redundancy and instruction redundancy, and the right is for the CMP utilizing space redundancy. The capacity of the reuse table is 1K entries. Please note that the penalties caused by synchronizing two component processors in the CMP are not considered. Hence, the simulation results are favorable for the CMP. It can be easily observed that the proposed time redundancy technique suers less performance degradation than the space redundancy technique. Thus, our approach is one of the ecient mechanisms for fault-tolerance. 6 Concluding Remarks

The investigation of fault-tolerance techniques for microprocessors is driven by two issues: One regards deep submicron fabrication technologies. Future semiconductor technologies could become more susceptible to alpha particles and other cosmic radiation. The other is the increasing popularity of mobile platforms. Cellular telephones are currently used for applications which are critical to our nancial security, such as mobile banking, mobile trading, and making airline ticket reservations. Such applications demand that computer systems work correctly. In light of this, we proposed the fault-tolerance mechanism for future microprocessors. It is based on the instruction reissue technique for incorrect data speculation recovery, and utilizes time redundancy. Our previous evaluation showed that our proposed mechanism caused considerable performance loss. In this paper, we proposed to exploit instruction redundancy to mitigate the loss. Only instructions that cannot utilize instruction redundancy should be reissued to detect transient faults. We evaluated our proposal using the timing simulator and found that the performance loss can be signi cantly reduced by the instruction reuse mechanism.

Acknowledgments

This work was supported in part by the grant from the Okawa Foundation for Information and Telecommunications (No.01-13) and is supported in

100

100 space

time+insn

80

%Execution cycles

%Execution cycles

time+insn

60 40 20 0

space

80 60 40 20 0

gzip

vpr

gcc

crafty

parser

eon

vortex

bzip2 average

gzip

vpr

gcc

crafty

parser

eon

vortex

bzip2 average

(i)4-way superscalar (ii)8-way superscalar Figure 7: %Increase in execution cycles for two redundancy models part by the Grant-in-Aid for Scienti c Research (B) from the Japan Society for the Promotion of Science (No.13558030). References

[1] T.M.Austin, \DIVA: a reliable substrate for deep submicron microarchitecture design," 32nd Int. Symp. on Microarchitecture, 1999. [2] D. C. Bossen et al., \Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology," IBM Jour. Res. & and Develop., vol.46, no.1, 2002. [3] D.Burger and T.M.Austin, \The SimpleScalar tool set, version 2.0," ACM SIGARCH Computer Architecture News, vol.25, no.3, 1997. [4] Compaq Computer Corp., \Data integrity for Compaq NonStop Himalaya servers," White paper, 1999. [5] Fujitsu Ltd., \SPARC64 V microprocessor provides foundation for PRIMPOWER performance and reliability leadership," White paper, 2002. [6] G. Hinton et al., \The microarchitecture of the Pentium 4 processor," Intel Technical Journal, issue Q1, 2001. [7] R.E.Kessler, E.J.McLellan, and D.A.Webb, \The Alpha 21264 microprocessor architecture," Int. Conf. on Computer Design, 1998. [8] S.Kim and A.K.Somani, \SSD: an aordable fault tolerant architecture for superscalar processors," 8th Int. Symp. on Dependable Computing, 2001. [9] A. Mendelson and N. Suri, \Designing high-performance & reliable superscalar architectures { the out of order reliable superscalar (O3RS) approach," 1st Int. Conf. on Dependable Systems and Networks, 2000. [10] C.Molina and A.Gonzalez, \Dynamic removal of redundant computations," 13th Int. Conf. on Supercomputing, 1999. [11] E.Morancho, J.M.Llaberia, and A.Olive, \Recovery mechanism for latency misprediction," 10th Int. Conf. on Parallel Architectures and Compilation Techniques, 2001. [12] S.S.Mukherjee et al., \Detailed design and implementation of redundant multithreading alternatives," 29th Int. Symp. on Computer Architecture, 2002. [13] K. Olukotun et al., \The case for a single-chip multiprocessor," Int. Conf. on Architectural Support for Programming Languages and Operating Systems VII, 1996. [14] N.Quach, \High availability and reliability in the Itanium processor," IEEE Micro, vol.20, no.5, 2000.

[15] J.Ray, J.C.Hoe, and B.Falsa , \Dual use of superscalar datapath for transient-fault detection and recovery," 34th Int. Symp. on Microarchitecture, 2001. [16] S.K.Reinhardt and S.S.Mukherjee, \Transient fault detection via simultaneous multithreading," 27th Int. Symp. on Computer Architecture, 2000. [17] E.Rotenberg et al., \Trace processors," 30th Int. Symp. on Microarchitecture, 1997. [18] E.Rotenberg, \AR-SMT: a microarchitectural approach to fault tolerance in microprocessors," 29th Fault-Tolerant Computing Symposium, 1999. [19] P.I.Rubinfeld, \Managing problems at high speed," IEEE Computer, vol.31, no.1, 1998. [20] T.Sato and I.Arita, \Tolerating transient faults through an instruction reissue mechanism," Int. Conf. on Parallel and Distributed Computing Systems, 2001. [21] T.Sato and I.Arita, \In search of ecient reliable processor design," 30th Int. Conf. on Parallel Processing, 2001. [22] T.Sato, \Evaluating the impact of reissued instructions on data speculative processor performance," Microprocessors and Microsystems, vol.25, issue 9, 2002. [23] N. Saxena et al., \Error detection and handling in a superscalar, speculative out-of-order execution processor system," 25th Fault-Tolerant Computing Symposium, 1995. [24] A. Sodani and G. S. Sohi, \Dynamic instruction reuse," 24th Int. Symp. on Computer Architecture, 1997. [25] G.S.Sohi et al., \A study of time-redundant fault tolerance techniques for high-performance pipelined computers," 19th Fault-Tolerant Computing Symposium, 1989. [26] G.S.Sohi, \Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers," IEEE Trans. Comput., vol.39, no.3, 1990. [27] A.K.Somani and J.Nickel, \REESE: a method of soft error detection in microprocessors," 2nd Int. Conf. on Dependable Systems and Networks, 2001. [28] L.Spainhower and T.A.Gregg, \IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective," IBM Jour. Res. & Develop., vol.43, no.5/6, 1999. [29] D.M.Tullsen, S.J.Eggers, and H.M.Levy, \Simultaneous multithreading: maximizing on-chip parallelism," 22nd Int. Symp. on Computer Architecture, 1995. [30] G.Tyson and T.M.Austin, \Improving the accuracy and performance of memory communication through renaming," 30th Int. Symp. on Microarchitecture, 1997. [31] T.N.Vijaykumar, I.Pomeranz, and K.K.Cheng , \Transient fault recovery using simultaneous multithreading," 29th Int. Symp. on Computer Architecture, 2002.