SPITFIRE: Synchronous and Asynchronous ... - Semantic Scholar

SPITFIRE: Synchronous and Asynchronous Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation Dilip Krishnaswamyx Prithviraj Banerjeez x Intel Corporation Logic Test Technology 1900 Priarie City Road M/S FM5 108 Folsom CA 95630 [email protected]

Corresponding Author: Address:

Phone: FAX: EMAIL:

Elizabeth Rudnicky Janak H. Pately

y Center for Reliable and High Performance Computing University of Illinois 1308 W. Main St. Urbana, IL 61801 fliz,[email protected]

z

Center for Parallel and Distributed Computing Northwestern University 2145 Sheridan Rd Evanston IL 60208 [email protected]

Prithviraj Banerjee Walter P. Murphy Chaired Professor Electrical and Computer Engineering Director, Center for Parallel and Distributed Computing Northwestern University Room L497 Technological Institute 2145 Sheridan Road, Evanston, IL-60208 (847) 491-4118 (847) 467-4144 [email protected]

This research was completed at the University of Illinois at Urbana-Champaign. The research was supported in part by the Semiconductor Research Corporation under Contract SRC 96-DP-109 and the Advanced Research Projects Agency under contract DAA-H04-94-G-0273 administered by the Army Research Office. Preliminary versions of this paper have been presented at the VLSI Test Symposium Conference, April 1997, and the Parallel and Distributed Simulation Workshop, June 1997. The experimental results in this paper are all new and a new algorithm is also presented.

ii

SPITFIRE: Synchronous and Asynchronous Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation

Abstract In this paper, we investigate new parallel algorithms for sequential circuit fault simulation using overlapping test set partitions. We propose six parallel algorithms for scalable parallel test set partitioned fault simulation (SPITFIRE). The test set partitioning inherent in the algorithms overcomes the good circuit logic simulation bottleneck that exists in traditional fault partitioned approaches to parallel fault simulation. Since the test sequence is partitioned, both the good and faulty circuit simulation costs are scaled down. First, single-stage and two-stage synchronous parallel algorithms are presented. Next, these algorithms are improved using an asynchronous approach where detected faults are broadcast to all processors, thereby reducing the load on each processor. A monotonically decreasing schedule for the threshold for communicating detected faults in the asynchronous algorithms is shown to perform very well. To correct any pessimism that may exist with single-stage and two-stage algorithms, a pipelined synchronous algorithm and a hybrid asynchronous pipelined algorithm are presented. It is shown that, in the hybrid asynchronous pipelined parallel algorithm, there is no pessimism in the fault coverage. Furthermore, the combination of a stage of asynchronous communication followed by pipelining results in very minimal redundant work and is therefore very scalable and nearly optimal. A theoretical analysis comparing the various algorithms is also given to provide an insight into these algorithms. All implementations were done using the MPI communication library and are therefore portable to many parallel platforms. All algorithms provide excellent performance in terms of speedup and quality on both shared-memory and distributed-memory parallel platforms.

iii

List of Tables 1

Benchmark Circuits Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2

Logic and Fault Simulation Results for ATPG Test Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

3

Logic and Fault Simulation Results for Random Test Sets

. . . . . . . . . . . . . . . . . . . . . . . . . .

11

4

Fault Simulation Results with Varying Overlap for Random Test Sets . . . . . . . . . . . . . . . . . . . . .

12

5

Fault Simulation Results with Varying Overlap for ATPG Test Sets . . . . . . . . . . . . . . . . . . . . . .

13

6

Logic Simulation Results on 16 Processors for Random Test Sets . . . . . . . . . . . . . . . . . . . . . . .

14

7

Logic Simulation Results on 16 Processors for ATPG Test Sets . . . . . . . . . . . . . . . . . . . . . . . .

15

8

Execution Times (secs) with SPITFIRE5 on 16 processors of the IBM SP2 for Varying MinFaultLimit and Random Test Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

21

Execution Times (secs) with SPITFIRE5 on 16 processors of the IBM SP2 for Varying MinFaultLimit and ATPG Test Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

10

Faults Detected with Random Test Sets on 4 Processors of the IBM SP2 . . . . . . . . . . . . . . . . . . .

27

11

Faults Detected with ATPG Test Sets on 4 Processors of the IBM SP2 . . . . . . . . . . . . . . . . . . . .

28

12

Faults Detected with Random Test Sets on 16 Processors of the IBM SP2 . . . . . . . . . . . . . . . . . .

29

13

Faults Detected with ATPG Test Sets on 16 Processors of the IBM SP2 . . . . . . . . . . . . . . . . . . . .

30

14

Speedups with Random Test Sets on 16 Processors of the IBM SP2 . . . . . . . . . . . . . . . . . . . . . .

31

15

Speedups with ATPG Test Sets on 16 Processors of the IBM SP2 . . . . . . . . . . . . . . . . . . . . . . .

32

16

Faults Detected with Random Test Sets on 6 Processors of the SUN E3000 . . . . . . . . . . . . . . . . . .

33

17

Faults Detected with ATPG Test Sets on 6 Processors of the SUN E3000 . . . . . . . . . . . . . . . . . . .

34

18

Speedups with Random Test Sets on 6 Processors of the SUN Ultra Enterprise 3000 shared memory multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

35

Speedups with ATPG Test Sets on 6 Processors of the SUN Ultra Enterprise 3000 shared memory multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

iv

List of Figures 1

Test Sequence Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2

Test Set Partitioning and Fault Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

3

SPITFIRE0 Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

4

Partitioning in SPITFIRE1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

5

SPITFIRE1 Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

6

Partitioning in SPITFIRE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

7

Multistage Synchronous Pipelined Algorithm Execution (After First Stage) . . . . . . . . . . . . . . . . .

18

8

SPITFIRE4 Algorithm (Stage 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

9

Algorithm SPITFIRE6 on 4 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

1

SPITFIRE: Synchronous and Asynchronous Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation 1 Introduction Fault simulation is an important step in the electronic design process and is used to identify faults in a circuit that cause erroneous responses at the outputs of a circuit for a given test set. The objective of a fault simulation algorithm is to find the fraction of total faults in a sequential circuit that is detected by a given set of input vectors (also referred to as fault coverage). In a typical fault simulator, the good circuit (fault-free circuit) and the faulty circuits are simulated for each test vector. If the output responses of a faulty circuit differ from those of the good circuit, then the corresponding fault is detected, and the fault can be dropped from the fault list, speeding up simulation of subsequent test vectors. Fault simulation may be performed in different environments. It may be used in a deterministic test generation environment, as described in the previous section, where it is used to reduce the number of faults that must be explicitly targeted by the test generator. It may also be used in a fault-grading environment where functional test vectors or random test vectors or test vectors from an ATPG may be used and where the fault coverage obtained by the entire sequence is measured. Thus, in the latter environment, fault simulation is used to measure the effectiveness of the entire sequence of test vectors. In a test generation environment, fault simulation typically takes a negligible portion of the execution time in comparison with the time taken to generate a test for a fault. Most fault simulation algorithms are typically of O(n2 ) complexity where n is the number of lines in the circuit. Studies have shown that there is little hope of finding a linear-time fault simulation algorithm [2]. Parallel implementations can potentially provide significant speedups while retaining good quality results. In this paper, we investigate new parallel algorithms for sequential circuit fault simulation using overlapping test set partitions. The parallel algorithms for fault simulation proposed in this paper will aim at speeding up fault simulation in a fault grading environment where the entire test sequence is available and the fault coverage of this sequence needs to be computed. We propose six parallel algorithms for scalable parallel test set partitioned fault simulation (SPITFIRE). The test set partitioning inherent in the algorithms overcomes the good circuit logic simulation bottleneck that exists in traditional fault partitioned approaches to parallel fault simulation. Since the test sequence is partitioned, both the good and faulty circuit simulation costs are scaled down. First, single-stage and two-stage synchronous parallel algorithms are presented. Next, these algorithms are improved using an asynchronous approach where detected faults are broadcast to all processors, thereby reducing the load on each processor. The single-stage and two-stage synchronous and asynchronous algorithms show a small degree of pessimism in a few cases, with respect to the fault coverage as compared with a uniprocessor run. To correct this pessimism if any, a pipelined synchronous algorithm and a hybrid asynchronous pipelined algorithm are presented for the synchronous and the asynchronous algorithms respectively. In the hybrid asynchronous pipelined parallel algorithm, the combination of a stage of asynchronous communication followed by pipelining results in very minimal redundant work and is therefore very scalable and near-optimal. A theoretical analysis comparing the various algorithms is also given to provide an insight into these algorithms. In the asynchronous algorithms, each processor communicates detected faults to other processors asynchronously if a threshold for the number of faults detected is exceeded on that processor. The results show that the lowest execution time is provided by a monotonically decaying schedule for the threshold which follows the profile of

2 the number of undetected faults in the fault simulation algorithm. All parallel implementations were done in C++ using the MPI communication library [5], and are therefore portable to many parallel platforms. Results are shown on a 6-processor SUN Ultra Enterprise 3000 shared memory multiprocessor and on a 16-processor IBM SP2 distributed memory multicomputer. All algorithms provide excellent performance in terms of speedup and quality on both shared-memory and distributed-memory parallel platforms. The paper is organized as follows. In the next section, we will present existing serial algorithms for fault simulation related work in parallel fault simulation. Section 3 presents the test set partitioning approach used in this paper. Experimental results obtained with test set partitioning experiments are also presented. This is followed by a discussion of the proposed SPITFIRE series of parallel algorithms for scalable parallel test set partitioned fault simulation in Section 4. An analysis of the proposed parallel algorithms is presented in Section 5. The experimental results with the proposed parallel fault simulation algorithms obtained are discussed in Section 6. Section 7 concludes the paper.

2 Related Work In this section we present existing serial and parallel algorithms for fault simulation.

2.1 Review of Serial Algorithms for Fault Simulation The goal of fault simulation is to determine all faults in a circuit that can be detected by a given test sequence of vectors. The fault simulator determines whether the results of simulation with the fault-free circuit and the faulty circuit would differ at one or more output nodes for each of the faults in the list of faults. In a simple implementation of the fault simulation algorithm, one would perform logic simulation on the fault-free circuit, followed by logic simulation with each of the faulty circuits, and then compare the values at the output nodes. If there is at least one differing output then the fault is detected; otherwise, the fault is not detected. However, this is a very straightforward and expensive approach, and there have been many efforts to develop new algorithms. These serial algorithms have been developed with the aim of speeding up the execution of fault simulation on a uniprocessor. Most of these algorithms target faults using the single stuck-at fault model, described in the previous section, for synchronous sequential circuits and no reset state is assumed. Execution is usually performed in an event-driven manner using three-valued logic (0, 1, X (unknown)). In concurrent fault simulation [6, 7, 8, 9], for every vector in the test sequence, the good circuit is simulated, and all elements in each faulty circuit that are different from the corresponding elements in the good circuit are simulated. This requires that lists of active faulty circuits be stored; hence, it can be expensive in terms of memory. However, fewer events need to be processed in general, so such an approach can run faster. In deductive fault simulation [10], the good circuit is simulated and set-intersection operations are performed on the set of active faulty circuits for each node, and a list of active faulty circuits are propagated. Storage requirements can be high with this approach too. However, the computational costs involved are reduced as only events corresponding to active faults are propagated. In differential fault simulation [11], for each input vector, the good circuit is simulated, and for every faulty circuit, only the differences between a faulty circuit and the previous faulty circuit processed for the same vector are simulated. This is a

3 very memory-efficient algorithm, but the dependency between faulty machines makes the process of fault dropping (where faults are removed from the fault list from future consideration once the fault has been detected) very difficult. In bit-parallel fault simulation [12], all bits in a processor word are used during simulation. Therefore, on a k -bit processor,

the good circuit and k ? 1 faulty machines are simulated in each operation. The use of all bits in the processor word could

possibly lead to a k -fold improvement, though in practice, the improvement observed is smaller due to more events being processed. Traditionally, in bit-parallel fault simulation, faults are grouped statically. Therefore, after a fault is detected, the bit space corresponding to the fault is wasted. The PROOFS fault simulator [4] combines features of bit-parallel, differential, and concurrent fault simulation algorithms, to achieve a performance better than any of the individual methods, without incurring large memory requirements. Faults are dynamically grouped into words and bit-parallel fault simulation is performed. The dynamic fault grouping avoids the waste of bit-space inherent in a static fault grouping approach to bit-parallel fault simulation. For each test vector, the good circuit is first simulated using an event-driven logic simulator, and then only differences between the good and faulty circuits are simulated. Once a fault is detected, it is dropped from the fault list, and it is not simulated again. The state values at the flip-flops of undetected faulty circuits which differ from the corresponding values for the good circuit are stored. This makes the algorithm very fast and memory efficient. In [13], an algorithm similar to PROOFS is used. However, representative stem faults are found for faults in fan-out-free regions. These representative stem faults are dynamically grouped, and only those faults whose effects did not propagate to flip-flops in the previous time frame are simulated in a bit-parallel fashion. Faults in a fan-out-free region, which are active at the stem, are identified using single fault propagation. The fault injection procedure was also modified in [13], with static ordering of faults by fan-out-free regions and dynamic ordering of faults to place potentially detected faults in separate fault groups. These enhancements to the original PROOFS algorithm result in the HOPE fault simulator [13] running twice as fast as the PROOFS algorithm. In the parallel-pattern single-fault propagation algorithm [14] for combinational circuits (PPSFP), the circuit is rank ordered and simulated using compiled simulation techniques for N patterns of the good circuit followed by N patterns for each undetected fault. This algorithm has been extended to synchronous sequential circuits in the fault simulators PARIS [15] and PSF [16]. In both algorithms, for each group of 32 test vectors, the good circuit is simulated. Subsequently, every undetected fault is simulated individually for all 32 vectors. Since the circuit is sequential, many iterations may be required before circuit values stabilize, and the computation required is minimized using heuristics. In PARIS, 32 consecutive vectors from the test sequence are grouped before simulation is performed. On the other hand, in PSF, the test sequence is partitioned into 32 equal subsequences. All subsequences are simulated in parallel with the nth vector of each subsequence being grouped into

one 32-bit word, where n is an index which ranges from 1 to the number of vectors in every subsequence. With this parallelpattern approach to serial fault simulation in PARIS and PSF, only one fault is simulated at a time. The number of iterations required for the circuit state to stabilize can vary between the two algorithms. The Zambezi algorithm [17] improves upon these algorithms by using a parallel-pattern parallel fault propagation technique, where groups of faults that are independent are determined. These faults are then simulated in parallel using a bit-parallel approach for a group of 32 test vectors. This algorithm has shown significant improvement in performance compared to the HOPE and PARIS algorithms [17]. In evaluating our proposed approach to parallelizing fault simulation, we used a fault simulator which uses a modified version of the PROOFS algorithm. However, any of the fault simulation algorithms described could have been used. The

4 new fault simulator is based on an algorithm similar to that of PROOFS, but several features have been added. New features include the ability to unroll frames in order to save storage and retrieval costs at the flip-flops and an ability to keep track of states that have been visited. Thus, the new fault simulator allows sequences of vectors to be evaluated without intermediate storage and retrieval of flip-flop state information. Furthermore, instead of the two-pass approach (fault-free circuit simulation followed by a 32-fault parallel simulation), a single pass using the fault-free circuit and 31 faulty circuits in parallel is performed to save simulation time. This simulator is about two times faster than the original PROOFS algorithm.

2.2 Review of Parallel Fault Simulation Due to the long execution times for large circuits, several algorithms have been proposed for parallelizing sequential circuit fault simulation [18]. 2.2.1 Circuit partitioning In circuit-partitioned parallel fault simulation, the circuit is partitioned into subcircuits [19, 20, 21, 22]. These subcircuits are assigned to different processors, and the simulations for the assigned portions of the circuit are done in the respective processors. However, if a signal value at a node assigned to a given processor is required at a different processor, then this information needs to be communicated. This creates a large communication overhead in parallel execution. Another problem with such an approach is that, when statically partitioning the circuit, it is difficult to predict the activities in each of the subcircuits to optimally balance the load across the processors. Dynamic repartitioning of the circuit is a possibility, but that is also very communication-intensive. In general, circuit partitioning is a must when memory scalabality is required for simulation of large circuits. However, speedups are very limited due to the large communication overheads inherent in this approach. In [19], a parallel algorithm was proposed which is suitable for distributed-memory multicomputers. The circuit is partitioned by a depth-first method, and execution proceeds in an event-driven manner. Information is exchanged between processors when logic values on the signal connection between two subcircuits change. This information consists of the good value of a node and the faulty values represented as a bit vector, which is as long as the number of faults in the circuit. No implementations of this algorithm were reported. In [20], the circuit was partitioned into zero-delay fan-in cones of latches. This approach results in a cycle based logic simulation algorithm where processors compute a complete zero-delay cycle independently and need to synchronize only on clock boundaries. A list of fault effects for each gate was maintained. If the fault effect was at a primary output, then the fault was detected. Otherwise, the fault effects were communicated to any processors that required them. This algorithm was implemented on the Intel iPSC/2 hypercube with random test sets, and limited speedups were observed. In [21], the circuit was levelized and the gates at a given level are distributed equally across the processors. The circuit was simulated level-by-level with all processors participating in simulation of gates at every level. The input vectors used were vectors from a deterministic test generator and good speedups were obtained with this approach also on a shared memory multiprocessor. In [22], a pipelined, circuit-partitioned approach was suggested for sequential circuits where the circuit was levelized and each processor is assigned all the gates at a particular level. Execution was performed in a pipelined fashion using input vectors in sequence with all the gates at a given level forming a stage of the pipeline. Due to the presence of feedback paths through memory elements, the speedups obtained from this algorithm were limited.

5 2.2.2 Algorithmic partitioning One can use algorithmic partitioning by partitioning the fault simulation algorithm into many functions. This approach has limited parallism since the functional parallelism available in the algorithm is usually small. Algorithmic partitioning was proposed for concurrent fault simulation in [23, 24]. In [23], a multipass concurrent fault simulator was implemented using the MARS hardware accelerator, with the fault simulation procedure being partitioned into 12 pipelined stages functionally. In [24], a pipelined algorithm was developed, and specific functions were assigned to each processor. Limited speedups were observed using a software emulation of a message-passing multicomputer. 2.2.3 Fault partitioning Fault partitioning is a relatively common approach to parallel fault simulation, where the faults are partitioned across the processors. Each processor performs good circuit simulation for every input vector, and faulty circuit simulation for all the faults allocated to it. Since logic simulation is replicated on all processors, it is a serial overhead, and it can affect the parallism available. Static fault partitioning was performed in [25, 17] and limited speedps were observed. Since static partitioning can create a load imbalance across the processors, one can dynamically redistribute the faults across the processors. In [25], faults were migrated dynamically to idle processors and good and faulty circuit simulation was repeated starting from the first input vector in the test sequence. However, the improvement in speedup from using such an approach compared to static partitioning was quite insignificant. Another possibility is to communicate state information corresponding to these faults to avoid replicated work. However, this can create a large communication overhead and it may provide very limited speedups. 2.2.4 Test sequence partitioning Test sequence partitioning was first proposed in [26] for combinational circuit parallel fault simulation. In this approach, each processor is assigned the entire circuit and fault list along with a subset of input vector set. Since only combinational circuits are considered in this approach, the simulation of each vector in the test sequence can proceed independently. In the algorithm, each processor is assigned a group of input vectors. Fault simulation is performed on this group of vectors. When a processor completes fault simulation on the current assigned group of vectors, the next available group of vectors is assigned to it. This procedure is iterated until all vectors are exhausted. Excellent speedups in the range 9.3 to 9.8 were reported for 10 processors using a network of SUN/4 workstations were reported. However, such an approach is not directly useful in the context of sequential circuit fault simulation, where the order in which vectors are simulated is important. This is due to the fact that in sequential circuit fault simulation, one needs an input vector and the state of the circuit prior to the simulation of that vector, to correctly perform fault simulation. One observation that can be made about fault partitioning experiments described earlier is that larger speedups are obtained for circuits having lower fault coverages [25, 17]. These results highlight the fact that the potential speedup drops as the number of faults simulated drops, since the good circuit evaluation takes up a larger fraction of the computation time. For example, if good circuit logic simulation takes about 20 % of the total fault simulation time on a single processor, then by Amdahl’s law, one cannot expect a speedup of more than 5 on any number of processors. The good circuit evaluation is not parallelized in the fault partitioning approach. Therefore, speedups are limited. Parallelization of good circuit logic simulation, or simply logic simulation is known to be a difficult problem. Most implementations have not shown an appreciable

6 speedup. Parallelizing logic simulation based on partitioning the circuit has been suggested but has not been successful due to the high level of communication required between parallel processors. Test set partitioning for sequential circuits was proposed in [27] to overcome the logic simulation bottleneck in sequential circuit fault simulation. In this approach, overlapping test set partitions were used with test sets generated randomly. Fault simulation proceeds in two stages. In the first stage, the fault list is partitioned among the processors, and each processor performs fault simulation using the fault list and test vectors in its partition. In the second stage, the undetected fault lists from the first stage are combined, and each processor simulates all faults in this list using test vectors in its partition. Experiments performed on a uniprocessor as a proof of concept have indicated that this approach is effective and is potentially scalable to a large number of processors. Test set partitioning is also used in the parallel fault simulator Zamlog [28]. In this approach, both the fault list and the test set are partitioned across the processors. Logic simulation is performed only once and the results are stored. Faulty circuit simulation is performed in each processor for the faults allocated to the processor, using the vectors from the test set partition assigned to the processor. The fault simulation results are combined in a tree-like fashion and the faulty circuit simulation is repeated on each processor for the undetected faults assigned to the processor. Since the logic simulation results have been stored, this simulation is not repeated. However, Zamlog assumes that independent test sequences are provided which form the partition, which can be a drawback with respect to test sequences generated for sequential circuits. If only one test sequence is given, Zamlog does not partition it. If, for example, only 4 independent sequences are given, it cannot use more than 4 processors. Also, in Zamlog, good circuit simulation is performed only once, and the simulation results are stored for future use. This may not be a feasible approach for large circuits, which may require large test sets and which have a large number of primary outputs. Random vectors were used for fault simulation, and no comparisons between the fault coverage obtained with serial and parallel executions of the algorithm were reported in this approach. This paper proposes the SPITFIRE algorithms for sequential circuit parallel fault simulation [29, 30]. These algorithms will be presented in detail in Section 4. The proposed algorithms use the overlapping test set partitioning strategy suggested in [27] which will be described in detail in Section 3. The good circuit simulation results are not stored to avoid excessive use of memory. No assumption is made regarding the independence of test sequences in the SPITFIRE algorithms. Hence, these algorithms are scalable to any number of processors and can be used for sequential circuit fault simulation. It is shown that test set partitioning for sequential circuits can result in pessimistic results in terms of fault coverage. This pessimism is avoided using a pipelined approach which will be described in Section 4. In general, the test set partitioning strategy provides a more scalable implementation than fault partitioning, since both the cost of good circuit logic simulation and the cost of simulating faulty circuits are distributed over the processors. This concludes our discussion on related work in the areas of serial and parallel algorithms for fault simulation.

3

Parallel Fault Simulation Using Test Set Partitioning

In this section we first motivate the need for the test set partitioning approach used in our work for sequential circuit fault simulation. The tradeoffs involved in using a test set partitioning strategy with overlapping partitions of the test set are presented. We then propose the SPITFIRE series of parallel algorithms for test set partitioned fault simulation.

7

3.1 Test Set and Fault Partitioning We now describe the different approaches to partitioning the test set and fault list in a parallel processing context. Let the test set, or more precisely the test sequence, be denoted by T , the fault list by F , and the number of processors by p.

3.1.1 Test sequence partitioning Parallel fault simulation through test sequence partitioning is illustrated in Figure 1 and in Figure 2(a). Let us partition the test

Example: A Test Sequence of 5n vectors on 5 Processors Overlap

1

Test Sequence 2n 3n

n

P2

P1

P3

4n

P4

5n

P5

Processors Figure 1: Test Sequence Partitioning sequence T into p partitions:

i.

f T ; T ; Tp g. The test sequence partition Ti and the fault list F are allocated to processor 1

2

Each processor performs the good and faulty circuit simulations for the subsequence in its partition only, starting from

an all-unknown (X) state. Of course, the state would not really be unknown if we did not partition the vectors. Since the unknown state is a superset of the known state, the simulation will be correct but may have more X values at the outputs than the serial simulation. This is considered pessimistic simulation in the sense that the parallel implementation produces an X at some outputs which in fact are known 0 or 1. From a pure logic simulation perspective, this pessimism may or may not be acceptable. However, in the context of fault simulation, the effect of the unknown values is that a few faults which are detected in the serial simulation are not detected in the parallel simulation. Rather than accept this small degree of pessimism, the test set partitioning algorithm tries to correct it as much as possible. To compute the starting state for each test segment, a few vectors are prepended to the segment from the preceding segment. This process creates an overlap of vectors between successive segments, as shown in Figure 1. Our hypothesis is that a few vectors can act as initializing vectors to bring the machine to a state very close to the correct state, if not exactly the same state. Even if the computed state is not close to the actual state, it still has far fewer unknown values than exist when starting from an all-unknown state. Results in [27] showed that this approach indeed reduces the pessimism in the number of fault detections. The number of initializing vectors required depends on the circuit and how easy it is to initialize. If the overlap is larger than necessary, redundant computations will be performed in adjacent processors, and efficiency will be lost. However, if the overlap is too small, some faults that are detected by the test set in a serial run may not be identified as detected in a parallel run. Hence, the fault coverage reported may be overly pessimistic.

8

T1 T2 T3 T4 T5

F P1 P2 P3 P4 P5

T F1 F2 F3 F4 F5

P1 P2 P3 P4 P5

Test Sequence Partitioning

Fault Partitioning

(a)

(b)

Figure 2: Test Set Partitioning and Fault Partitioning 3.1.2 Fault partitioning Fault partitioning for parallel fault simulation is illustrated in Figure 2(b). The fault list

F

is partitioned into p partitions:

f F ; F ; Fp g. The figure shows that the fault partition Fi and the entire test set T are allocated to processor i. In this 1

2

approach, each processor uses the entire test set T to target the fault partition Fi that it owns. This partitioning suffers from

the problem that each processor has to perform the complete good circuit simulation for the entire test set T . This is a huge sequential bottleneck for any parallel implementation which employs such a partitioning strategy. Also, there is a potential problem of load imbalance across processors, depending on which faults are allocated to which processors. In the worst case, if all the hard-to-detect faults are allocated to a single processor, these faults may not be detected for a long time. Hence, the total execution time will depend on how long the processor with the highest load may take to complete its task. It is possible to dynamically balance the load by employing a strategy where faults are migrated from busy processors to idle processors. However, one would either have to perform resimulation from the beginning [25] or migrate the circuit state information for the faulty circuits associated with these faults to an idle processor. For large circuits, migrating circuit state information is prohibitively expensive. In addition, for large test sets, performing resimulation is very expensive, too. Hence, in general, a purely fault partitioning approach has limited scalability and does not provide good performance.

3.2 Experimental Results with Test Sequence Partitioning We will now present some experimental results to motivate the need for using a test sequence partitioning strategy. The circuits used in the study are listed in Table 1. Circuits s526, s1423, s5378 and s35932 are taken from the ISCAS89 benchmark suite [33]. Circuits s3271, s3330, s3384, s4863 and s6669 were chosen from the ISCAS93 benchmark suite. The mult16 circuit is a 16-bit two’s complement multiplier; div16 is a 16-bit divider; am2910 is a 12-bit microprogram sequencer; pcont2 is an 8-bit parallel controller used in DSP applications; and piir8o is an 8-point infinite impulse response filter for DSP applications. “Faults” refers to the number of single stuck-at faults left after fault collapsing and “Gates” refers to the number of logic gates in the circuit. “PIs” stands for the number of primary inputs, “POs” for the number of primary outputs, and “FFs” refers to the number of flip-flops in the circuit.

9

Table 1: Benchmark Circuits Used Circuit Gates Faults PIs POs FFs am2910 931 2391 20 16 87 div16 856 2141 33 34 50 mult16 648 1708 18 33 55 pcont2 4249 11,300 9 8 24 piir8o 8830 19,920 9 8 56 s1423 657 1515 17 5 74 s3271 1573 3270 26 14 116 s3330 1789 2870 40 73 132 s3384 1702 3380 43 26 183 s35932 16,065 39,094 35 320 1728 s382 158 399 3 6 21 s400 164 426 3 6 21 s444 181 474 3 6 21 s4863 2342 4764 49 16 104 s526 193 555 3 6 21 s5378 2779 4603 35 49 179 s6669 3123 6684 83 55 239

A simple experiment was performed to compare the cost of good circuit logic simulation to the overall cost of fault simulation (simulation of the good circuit and all faulty circuits). This experiment was performed on circuits for which a reasonably large test set was available from an ATPG (Automatic Test Pattern Generator). The test sets were generated using the STRATEGATE [31] tool developed at the University of Illinois. This test generator provides the best fault coverage for the benchmark circuits published in the literature, and it uses the current state of circuit to generate the next sequence of test vectors to add to the test set; i.e., the circuit state is not reset during the test generation process except at the very beginning, when the circuit starts with an all unknown state. Hence, the test sets obtained from this test generator are ideal in studying the impact of test set partitioning on fault coverage accuracy since they do not contain repeated initialization sequences; they are likely to result in pessimistic simulation and, hence, possibly in pessimistic results in terms of fault coverage. The results from the experiment comparing the good circuit logic simulation cost to the overall fault simulation cost are shown in Table 2. For each circuit the number of vectors simulated is shown, followed by the number of faults detected, the good circuit simulation time (LogicSim Time) in seconds, the time for both good and faulty circuit simulation (FaultSim Time) in seconds, the percentage of time required for good circuit simulation, and the number of vectors simulated before the good circuit was initialized (Init Vector). The results show that the good circuit logic simulation time varied from 3 percent to 20.8 percent of the overall fault simulation time for the range of circuits considered. The average percentage for the above 12 circuits was 13.1 percent. It can also be seen that for circuits with larger test set sizes, the logic simulation cost tends to be higher. This is due to the fact that towards the end of fault simulation, there are very few faults left, and good circuit simulation contributes to a larger portion of the execution time. Another observation that we can make is that the percentages tend to be lower for larger circuits, for which a reasonable number of faults are left undetected. For example, for s35932, the largest circuit being considered, the logic simulation cost is only 3 percent of the overall fault simulation cost. However, the test set size is also small, so it

10

Table 2: Logic and Fault Simulation Results for ATPG Test Sets Circuit

Vectors

am2910 div16 mult16 pcont2 piir8o s1423 s35932 s382 s400 s444 s526 s5378

3841 4434 3281 2017 1577 7363 2926 2492 3417 3129 3376 14,721

Faults Detected 2198 1814 1665 6837 15,072 1414 35,100 364 384 424 454 3639

LogicSim Time 3 3 2 4 10 7 44 0.6 1.0 0.7 1.0 43

FaultSim Time 23 19 16 72 103 55 1420 3.8 4.8 4.3 5.5 301

LogicSim Percentage 13 15.7 12.5 5.5 9.7 12.7 3 15.7 20.8 16.2 18.2 14.2

Init Vector 27 2 5 2 2 9 2 2 2 2 3 15

is not clear whether the low percentage was merely due to the fact that the test size was not large enough. One can expect to obtain higher gains from test set partitioned fault simulation, for large test sets with good fault coverage, so that the good circuit logic simulation time will be larger fraction of the overall execution time. This is because, towards the end of serial fault simulation, there are very few faults left and more time is spent in logic simulation with larger test set sizes. We therefore performed another experiment which uses the same test set size for all circuits. The test set size was chosen to be 10,000 vectors, and these vectors were generated at random. However, since these are random input vectors, one can expect the fault coverage to be much lower than for the above case. Since the fault coverage is lower, one can also expect that more time may be spent in fault simulation, because there may be a large number of faults left undetected at all times, and more effort would be spent in simulating the undetected faults as compared to the logic simulation cost. However, if the fault coverage reaches a high level early, and there are only a few faults left, then one can expect logic simulation to take the bulk of the execution time. We observe both situations here while simulating random vectors for the circuits considered. The results are shown in Table 3. It was observed that the logic simulation cost varied from 3.8 percent to 69.0 percent with an average value of 19.24 percent over 17 circuits. The lower value of 3.8 percent was observed with the circuits s400 and s526 where many faults were left undetected. Similarly, for the largest circuit s35932, the logic simulation cost was only 4.1 percent since quite a few faults were left undetected. In these and some other circuits, a large portion of the execution time was spent in targeting these faults. The highest value of 69.0 percent was observed with the circuit s3271 where only a few faults were left undetected, and hence logic simulation took up the bulk of the execution time. Also, if we observe the “Init Vector” column closely, we can see that the random test sequence seems to take longer to fully initialize the good circuit as compared to the ATPG test sets studied above. The most striking difference in the number of initialization vectors required can be observed for the circuits s5378 and am2910, where the ATPG took 15 and 27 vectors to fully initialize the good circuit, respectively, and the random test sequence took 361 and 203 vectors to fully initialize the circuit, respectively. It should be noted that even though the good circuit may be fully initialized, a faulty circuit may take more or fewer vectors vectors to get fully initialized since the finite state machines corresponding to the good and faulty circuits are different. Hence, full initialization of the good circuit does not in any way imply that any faulty circuit is fully initialized at the same stage.

11

Table 3: Logic and Fault Simulation Results for Random Test Sets Circuit am2910 div16 mult16 pcont2 piir8o s1423 s382 s3271 s3330 s3384 s35932 s400 s444 s526 s4863 s5378 s6669

Faults Detected Detected 2115 1640 1468 6829 15,004 802 53 3243 2103 3069 32,933 56 60 52 4485 3031 6675

LogicSim Time 13 10 5 27 46 9 1 29 21 29 234 1 1.1 1.2 33 24 42

FaultSim Time 71 50 68 345 636 97 24 42 120 111 5671 26 28 32 54 254 77

LogicSim Percentage 18.3 20.0 7.3 7.8 7.2 9.2 4.1 69.0 17.5 26.1 4.1 3.8 3.9 3.7 61.1 9.4 54.5

Init Vector 203 3 10 6 6 8 2 16 17 9 2 2 2 3 5 361 9

However, since the finite state machines can be expected to be reasonably similar, one could expect that the good and faulty circuits would get fully initialized around the same time for most faults. An exception to this would be certain faults on reset lines of flip-flops, for which the circuit is never fully initialized. From the above results one can conclude that logic simulation of the good circuit can be expected to take a significant amount of the fault simulation time in general. With larger test set sizes, one can expect logic simulation to take a larger fraction of the total execution time. However, if there are many faults that are left undetected for most of the execution time, then logic simulation can be expected to constitute a smaller portion of the overall execution time, and a larger fraction of the time is spent in simulating these undetected faults. Therefore, when we attempt to parallelize fault simulation, we must take into account both factors, viz., the cost of fault simulating undetectable faults and the cost of logic simulation. The best approach would attempt to distribute the load across the processors such that both contributing factors would be parallelized. The SPITFIRE algorithms were developed to target these issues and hence provide very good performance. The basic approach used is to partition the test set across processors. When we do that, effectively the cost of good circuit logic simulation is approximately equally distributed across the processors. (This assumes that the each test set partition generates the same amount of overall activity. On the average, for large test sets, this can be considered to be true.) Now, consider the case of an undetectable fault. In a serial simulation, one would have to simulate all the vectors without detecting this fault. By partitioning the test set, we distribute the cost of simulating this fault approximately equally across the processors. (This statement makes the same assumption as the previous one.) Effectively, we have distributed the cost of good circuit logic simulation and the cost of simulating undetected faults in a load balanced fashion; hence, we can expect to get good speedups. The above is a somewhat simplistic view of the issues involved, and we will get into more details in the following sections. One issue with using test set partitioning is the issue of determining the value of the overlap of test vectors between succes-

12 sive partitions that one must use. An experiment was performed for 3 different values of the overlap with random and ATPG test sets. The experiment was performed with values of 4, 20 and 100 for the overlap. The results for a single processor and for a test set partitioned approach (using the SPITFIRE1 algorithm which will be described in detail in the following section) on 16 processors of the IBM SP2 distributed memory multicomputer are shown in Table 4. The random test sets used were of size 10,000 as was used earlier. It can be seen from the table that with random test sets, the same number of faults are usually detected irrespective of the value of the overlap used. Also, as we would expect, the execution time increases as the overlap value is increased, since extra work is being performed in simulating the overlapping vectors. However, in circuit s1423, only 790 faults were detected with an overlap value of 4 on 16 processors, while 802 faults were detected in all other cases. Also for the circuit s3330, 2101 faults were detected when an overlap of 4 or an overlap of 20 was used. However, in the single processor case and in the case of an overlap of 100 on 16 processors, 2103 faults were detected. This shows that one can expect some pessimism in the fault coverage using the test set partitioning approach. However, the random test sets are not typical of a test set obtained from an ATPG.

Table 4: Fault Simulation Results with Varying Overlap for Random Test Sets Circuit 1 Processor

am2910 div16 mult16 pcont2 piir8o s1423 s3271 s3330 s3384 s35932 s382 s400 s444 s4863 s526 s5378 s6669

2115 1640 1468 6829 15,004 802 3243 2103 3069 32,933 53 56 60 4485 52 3031 6675

Faults Detected 16 Processors Overlap 4 20 100 2115 2115 2115 1640 1640 1640 1468 1468 1468 6829 6829 6829 15,004 15,004 15,004 790 802 802 3243 3243 3243 2101 2101 2103 3069 3069 3069 32,933 32,933 32,933 53 53 53 56 56 56 60 60 60 4485 4485 4485 52 52 52 3031 3031 3031 6675 6675 6675

Execution Time 1 Processor 16 Processors Overlap 4 20 100 71 14 14 14 50 9.4 9.3 10.3 68 12 12.5 14 345 58 58 67 636 108 112 131 97 16 17 19 42 10 10 11 120 19 20 23 111 20 20 23 5671 896 893 1016 24 4.6 4.2 4.3 25 3.8 4.3 4.3 28 4.3 4.4 5.1 54 12 14 15 32 5 5 6 255 38 40 46 77 19 20 24

Table 5 shows the results of the same experiment performed with ATPG test sets. One can see that there is some pessimism in the number of faults detected for quite a few circuits. However, the pessimism seems to decrease with increase in the value of the overlap and with increase in the test set size. For example, for the circuit with the largest test set size, s5378, no pessimism is observed with a test set partitioned approach. Since the partioning is being done across 16 processors, the size of the partitions on each of the processors is quite small. It is expected that with larger circuits with large test set sizes, the pessimism would decrease or become negligible. In general, it can be observed that an overlap of 20 provides reasonably good results for most cases. There is nothing

13

Table 5: Fault Simulation Results with Varying Overlap for ATPG Test Sets Circuit

Vectors 1 Processor

am2910 div16 mult16 pcont2 piir8o s1423 s35932 s382 s400 s444 s526 s5378

3841 4434 3281 2017 1577 7363 2926 2492 3417 3129 3376 14721

2198 1814 1665 6837 15,072 1414 35,100 364 384 424 454 3639

Faults Detected 16 Processors Overlap 4 20 100 2197 2198 2198 1781 1781 1781 1665 1665 1665 6836 6837 6837 15,031 15,072 15,072 1411 1414 1414 35,100 35,100 35,100 350 351 361 370 370 379 419 422 423 435 441 452 3639 3639 3639

Execution Time 1 Processor 16 Processors Overlap 4 20 100 23 6.3 6.5 8 20 5 5 6 16 3 3.4 4.3 73 51 54 82 104 25 30 43 55 12 13 14 1420 493 561 550 3.4 1.2 1.4 1.6 4.8 1.9 2 2 4.3 1.3 1.4 1.6 5.5 1.5 1.7 1.7 301 46 48 52

really special about the value 20. One could possibly use a large value of the overlap for large test set sizes, and a small value of the overlap when the test set size is small. As long as the overlap chosen is small compared to the partition size, the overhead due to additional simulation in each partition corresponding to the overlap vectors will be negligible. For our study using test set partitioning algorithms, we will use a value of 20. An additional experiment was performed to study the effects of test set partitioning on good circuit logic simulation. This study was performed on 16 processors of the IBM SP2, a distributed memory multicomputer. The test set was partitioned across the 16 processors. A value of 20 was chosen for the overlap. Each of the 16 partitions was simulated on a different processor, and the index of the vector when the circuit was fully initialized in each processor was obtained. The maximum over these vectors across the processors is shown in the Tables 6 and 7 for simulations with random test sets and ATPG test sets, respectively. The number of fully initialized partitions is also indicated. Surprisingly enough, initialization was a bigger problem with ATPG test sets than with random test sets. It can be observed that for 8 out of 12 circuits studied using ATPG test sets, the circuit was not fully initialized in at least one of the partitions. It can be seen that for random test sets, all partitions seem to get initialized quite early. The only exceptions are in the circuit s5378, where the circuit was not initialized in 7 partitions, and in circuits s3330 and am2910, where 322 and 547 vectors were required, respectively, before the circuits were fully initialized in all partitions. It will be seen in the following section that some of the circuits in which all the partitions did not get fully initialized and circuits that took a long time to get initialized in all partitions demonstrate pessimism in the fault coverage. This pessimism is avoided by a pipelined approach that we will describe. We will now proceed to the next section for a detailed discussion of the SPITFIRE series of parallel algorithms.

4 SPITFIRE: Scalable Parallel Test Set Partitioned Fault Simulation In this section, we will describe six new scalable parallel algorithms for sequential circuit fault simulation using overlapping test set partitions. The first two algorithms, SPITFIRE1 and SPITFIRE2, are parallel two-stage synchronous approaches

14

Table 6: Logic Simulation Results on 16 Processors for Random Test Sets Circuit am2910 div16 mult16 pcont2 piir8o s1423 s3271 s3330 s3384 s35932 s382 s400 s444 s4863 s526 s5378 s6669

Partition Size 645 645 645 645 645 645 645 645 645 645 645 645 645 645 645 645 645

Maximum Init Vector 547 10 33 6 6 13 25 322 13 7 6 6 5 17 6 645+ 12

Initialized Partitions 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 9 16

[29, 27]. The third algorithm, SPITFIRE3, is a synchronous pipelined algorithm, which extends the algorithm SPITFIRE1 to eliminate any pessimism that may occur in using a test set partitioned approach [29]. The fourth and fifth algorithms, SPITFIRE4 and SPITFIRE5 are parallel two-stage and single-stage asynchronous approaches [30]. The final algorithm is SPITFIRE6, a hybrid asynchronous pipelined approach which uses a combination of the ideas in the algorithms SPITFIRE3 and SPITFIRE5. All algorithms are compared with each other and with a static fault partitioned approach, which we will call FPAR, and with a simple test set partitioned approach, which we will call SPITFIRE0. It is shown that the algorithm SPITFIRE6 is the best algorithm, and it provides the best speedups without compromising on the fault coverage. We begin the discussion with the algorithm SPITFIRE0 which is presented as a base of reference for the various parallel test set partitioning algorithms to follow.

4.1 SPITFIRE0: A simple test set partitioned parallel fault simulation algorithm In SPITFIRE0, the test set is partitioned across the processors as described in the previous section. The entire fault list is allocated to each processor. Thus, each processor targets the entire list of faults using a subset of the test vectors. Each processor proceeds independently and drops the faults that it can detect. The results are merged in the end. The algorithm is outlined in Figure 3. Let the test set be denoted by T and the fault list by F .

4.2 SPITFIRE1: A synchronous two-stage algorithm The simple algorithm described above is somewhat inefficient in that many faults are very testable and are detected by most if not all of the test segments. Simulating these faults on all processors is a waste of time. Therefore, one can filter out these easy-to-detect faults in an initial stage in which both the fault set and the test set are partitioned among the processors. This

15

Table 7: Logic Simulation Results on 16 Processors for ATPG Test Sets Circuit am2910 div16 mult16 pcont2 piir8o s1423 s35932 s382 s400 s444 s526 s5378

Partition Size 261 298 226 147 119 481 203 176 234 216 231 941

Algorithm: SPITFIRE0 1. Partition test set T among p processors:

Maximum Init Vector 261+ 191 82 147+ 50 209 8 176+ 234+ 216+ 231+ 941+

Initialized Partitions 12 16 16 5 16 16 16 10 10 13 12 15

f T ; T ; T p g. 1

2

2. Perform fault simulation in each processor Pi applying Ti to F . Let the list of detected faults in processor Pi after fault simulation be Di 3. Each processor Pi sends the detected fault list Di to processor P1 .

S

4. Processor P1 combines the detected fault lists from other processors by computing D = pi=1 Di . The result after parallel fault simulation is D, the list of detected faults, and it is now available in processor P1 . Figure 3: SPITFIRE0 Algorithm.

results in the two stage algorithm proposed in [27, 29]. In the first stage, each processor targets a subset of the faults using a subset of the test vectors, as illustrated in Figure 4. A large fraction of the detected faults are identified in this initial stage, and only the remaining faults have to be simulated by all processors in the second stage. The overall algorithm is outlined in Figure 5.

Spj

Uj ; j 6= i is another equivalent expression for Gi . The rationale behind this step is that, if test segment Ti could not detect a fault in Fi in the first stage of fault simulation, then there is no need to repeat this computation Note that Gi

=

=1

in the second stage. The reason that a second stage is necessary is because every test vector must eventually target every undetected fault if it has not already been detected on some other processor. Thus, the initial fault partitioning phase is used to reduce redundant work that may arise in detecting easy-to-detect faults. It can be observed, though, that one has to perform two stages of good circuit simulation with the test segment on any processor. However, the first stage eliminates a lot of redundant work that might have been otherwise performed. Hence, the two-stage approach is preferred. The test set partitioning approach for parallel fault simulation is subject to inaccuracies in the fault coverages reported only when the circuit cannot be initialized quickly from an unknown state at the beginning of each test segment. This problem can be avoided if the test set is partitioned such that each segment starts with an initialization sequence. The definite redundant computation in the above approach is the overlap of test segments for good circuit simulation. However, if the overlap is small compared to the size of the test segment assigned to a processor, then this redundant computation will be negligible. Another source

16

T1 T2 T3 T4 T5

T1 T2 T3 P2 P3 P1 P3 P1 P2 P1 P2 P3 P1 P2 P3

T4 T5 P4 P5 P4 P5 P4 P5 P5 P4

F1 P1 P2 F2 P3 F3 P4 F4 P5 F5

U1 U2 U3 U4 U5

Partitioning in Stage 1


Figure 4: Partitioning in SPITFIRE1 of redundant computation is in the second stage when each processor has to target the entire list of faults that remains (excluding the faults that were left undetected in that processor). In this situation, when one of the processors detects a fault, it may drop the fault from its fault list, but the other processors may continue targeting the fault until they detect the fault or until they complete the simulation (i.e., until the second stage of fault simulation ends).

4.3 SPITFIRE2: A hybrid approach We will now present a new algorithm which attempts to reduce the size of the partitions used in SPITFIRE1. Let us partition the test set T into (p + 1) partitions: entire test set

f T ; T ; Tp ; Tp g, where p is the number of processors. If N is the size of the 1

2

+1

T , then the size of each partition is now pN+1 .

The partitioning for the two stages of fault simulation in the

SPITFIRE2 algorithm is illustrated in Figure 6. As can be seen from the figure, processor i uses T1 and Fi in the first stage of fault simulation. Since all faults are targeted

T1, there is no need to resimulate these vectors in the second stage. Let G = =1 Uj be the set of undetected faults left at the end of stage 1. Then, in the second stage, processor i uses the test set Ti+1 and fault list G. Thus, in the first stage, processors target different sets of faults, and in the second stage, processors

Spj

in the first stage using the input vectors in

target the same list of undetected faults that was available at the end of the first stage. Since we are targeting only easyto-detect faults in the first stage of fault simulation, most of these faults will get detected by T1 . Hence, it does not matter whether we use Ti , as in the SPITFIRE1 algorithm, or whether we use T1 in processor i in the first stage of fault simulation, as proposed in this algorithm. The advantage of this algorithm is that the number of vectors simulated in each stage is now

p as compared to SPITFIRE1. A small additional advantage is that the faulty circuit states available reduced by a factor p+1 for the undetected faults in the set U1 can be used for simulation with the test set T2 in the second stage of fault simulation in

processor 1. It is possible, though, that one may not drop as many faults after the first stage of fault simulation as compared to SPITFIRE1. A possible disadvantage of this algorithm is that a pipelined approach such as the algorithm SPITFIRE3, which will be presented shortly, cannot be directly applied since there is state information missing in the first processor, with respect to the states of the faulty machines corresponding to the faulty circuits assigned to other processors during the first stage of fault simulation. One may have to store and communicate this information, which is available at the end of the first stage, to the first processor, if one needs to use a pipelined approach. This could be expensive.

17 Algorithm: SPITFIRE1 1. Partition test set T among p processors: 2.

f T ; T ; T p g. Partition fault list F among p processors: f F ; F ; Fp g. 1

2

1

2

3. Each processor Pi performs the first stage of fault simulation by applying Ti to Fi . 4. Let the list of detected faults and undetected faults in processor Pi after fault simulation be Ci and Ui respectively. 5. Each processor Pi sends the detected fault list Ci to processor P1 . 6. Processor P1 combines the detected fault lists from other processors by computing C 7. Processor P1 now broadcasts the total detected fault list C to all other processors. 8. Each processor Pi finds the list of faults it needs to target in the second stage Gi

=

=

Sp

i=1 Ci .

S

F ? (C Fi ).

9. The circuit state is reset in each processor. 10. Each processor Pi performs the second stage of fault simulation by applying the test segment Ti to fault list Gi . 11. Each processor Pi sends the detected fault list Di to processor P1 . 12. Processor P1 combines the detected fault lists from other processors by computing D = 13. The result after parallel fault simulation is the list of detected faults C

Spi

Di .

S D, and it is now available in processor P . =1

1

Figure 5: SPITFIRE1 Algorithm.

4.4 SPITFIRE3: A multistage pipelined synchronous algorithm SPITFIRE3 avoids the small degree of pessimism that may be present in the test set partitioning algorithms presented thus far. The algorithm is illustrated in Figure 7. Initially, the execution of the algorithm is similar to that in SPITFIRE1. The first stage of fault simulation is identical to that in SPITFIRE1. Synchronization points are introduced in the second stage, at which processors exchange information about detected faults. This may reduce the amount of work that a processor has to do subsequently, since each processor does not need to target faults that have been detected by other processors. However, the synchronization points introduce barriers which may slow down execution, assuming that the load may be somewhat imbalanced in different processors. Therefore, there is some degree of tradeoff involved in using synchronization points in the second stage. At the end of the second stage, when processor i has finished executing the test vectors in test set Ti , all

T1 T1 T1 T1 T1

T2 T3 T4 T5 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4

T6

F1 P1 P2 F2 P3 F3 P4 F4 P5 F5

U1 U2 U3 U4 U5



Figure 6: Partitioning in SPITFIRE2

P5 P5 P5 P5 P5

18

Number above

P1 P2 P3 P4

Synchronization Points indicates vector to be simulated after synchronization Denotes Busy Processor Denotes Idle Processor

1

N/p + 1

2N/p + 1

3N/p + 1

N/p - q

2N/p + 1

3N/p + 1

4N/p = N

2N/p - q

3N/p + 1

4N/p = N

3N/p - q

4N/p = N

4N/p = N

Program terminated here q = vector overlap for circuits considered N = number of test vectors in test set p = number of processors (p = 4 in the figure) Figure 7: Multistage Synchronous Pipelined Algorithm Execution (After First Stage) processors exchange information and drop all faults detected by other processors. At this point, processor i resumes execution

and starts working on vectors in partition Ti+1 without resetting its state. Hence, if processor i stopped execution at vector

Ni , it resumes execution at vector Ni + 1. Note that this means that processor p is now idle, since it has simulated the last p p

vector in the test set T , while all other processors are busy. Once again, processors exchange information at synchronization points. In addition, information is stored regarding the number of faults that were detected at the end of that stage. If at the next synchronization point, more faults are detected, then execution is continued. If, however, there is no change in the number of faults detected between two synchronization points, execution is stopped, and the final fault coverage is the coverage available at that synchronization point. This approach has helped in detecting the faults that had been detected in a uniprocessor run but were left undetected in a parallel run for the previous two approaches. Obviously, we are paying a price in continuing execution with synchronization to identify a few more detected faults. If one is willing to tolerate the pessimism that may exist, this approach may not be essential. Note that in this pipelined approach, at the end of the second stage of fault simulation, processor

p becomes idle, and then after every Np

vectors have

been simulated, processors (p ? 1), (p ? 2), ..., 3, 2, and 1 become idle in order. In the worst case, processor 1 may have

to perform the good circuit logic simulation on the entire test set T , if more faults continue to get detected. Processors con-

tinue to work on fewer faults, as more faults get detected. The caveat here is the fact that if the synchronization points are too close to each other, some faults may still be missed. In the implementation, synchronization points were introduced at

regular intervals of 4Np vectors, and it turned out that the faults that had not been detected were indeed detected before the second synchronization point after the end of the second stage. Thus, the extra overhead in continuing execution was not very high, and the fault coverage obtained was identical to that obtained in a uniprocessor run. An accurate way of ensuring that no faults are missed is discussed in the section on the algorithm SPITFIRE6. This approach was not implemented as the communication costs were expected to be a lot higher as detailed in the discussion on the SPTIFIRE6 algorithm.

19

4.5 SPITFIRE4: A two-stage asynchronous algorithm Consider the second stage of fault simulation in the SPITFIRE1 algorithm. All processors have to work on almost the same list of undetected faults that was available at the end of the first stage (except faults that it could not detect in Stage 1). It would therefore be advantageous for each processor to periodically communicate to all other processors a list of any faults that it detects. Thus, each processor asynchronously sends a list of newly detected faults to all other processors provided that it has detected at least MinFaultLimit new faults. Each processor periodically probes for messages from other processors and drops any faults that are received through these messages. This helps in reducing the load on a processor if it has not detected these faults yet. Thus, by allowing each processor to asynchronously communicate detected faults to all other processors, we dynamically reduce the load on each processor. It should be observed that in the first stage of the SPITFIRE1 algorithm, all processors are working on different sets of faults. Hence, there is no need to communicate detected faults during Stage 1, since this will not have any effect on the workload on each processor. It would make sense, therefore, to communicate all detected faults only at the end of Stage 1. The asynchronous algorithm used for fault simulation in Stage 2 by any processor Pi is outlined in Figure 8. The routine CheckForAnyMessages() is a nonblocking probe which returns a 1 only if there is a message pending to be received. Algorithm: SPITFIRE4 (Stage2) Set NumNewFaultsDetected = 0; For each vector k in the test set Ti FaultSimulate vector k ; If (NumNewFaultsDetected > MinFaultLimit) then Send the list of newly detected faults to all Processors using a buffered asynchronous send; Set NumNewFaultsDetected = 0; End If While (CheckForAnyMessages()) Receive new message using a blocking receive; Drop newly received faults (if not dropped earlier); End While End For Figure 8: SPITFIRE4 Algorithm (Stage 2)

4.6 SPITFIRE5: A single-stage asynchronous algorithm It is possible to employ the same asynchronous communication strategy used in the SPITFIRE4 algorithm for the SPITFIRE0 algorithm. In the latter algorithm, all processors start with the same list of undetected faults, which is the entire list of faults

F . Only faults which each processor detects get dropped, and each processor continues to work on a large set of undetected faults. Once again, it would make sense for each processor to communicate detected faults periodically to other processors provided that it has detected at least MinFaultLimit new faults. The same approach for asynchronous communication that was discussed in the previous section is used for this algorithm. However, the asynchronous communication is applied to the first and only stage of fault simulation that is used for this algorithm.

20 There is a tradeoff between the SPITFIRE4 and SPITFIRE5 algorithms. As we can see in the SPITFIRE4 algorithm, we have a completely communication-independent phase in Stage 1 followed by an asynchronous communication-intensive phase. However in the SPITFIRE5 algorithm, we have only one stage of fault simulation. This means that the good circuit simulation with test set Ti on processor Pi needs to be performed only once. Thus, although we may have continuous communication in the SPITFIRE5 algorithm, we may obtain substantial savings by performing only one stage of fault simulation. This is indeed observed to be the case. The value of MinFaultLimit used in asynchronous communication can be circuit dependent. It also depends on the parallel platform that may be used for parallel fault simulation. For a very small circuit with mostly easy to detect faults, it may not make sense to set MinFaultLimit too small, as this may result in too many messages being communicated. On the other hand, if the circuit is reasonably large, or if faults are hard to detect, the granularity of computation between two successive communication steps will be large. Therefore, it may make sense to have a small value of MinFaultLimit. Similarly, it may be more expensive to communicate often on a distributed parallel platform. However, this factor may not matter as much on a shared memory machine. It is therefore important to ensure that the computation-to-communication ratio be kept high; hence, depending on the parallel platform used, one needs to arrive at a compromise for the frequency at which faults are communicated between processors. One may also use the number of vectors in the test set that have been simulated, say MinVectorLimit, as a control parameter to regulate the frequency of synchronization. This may be useful towards the end of fault simulation when the faults are detected very slowly. One can also use both parameters, MinFaultLimit and MinVectorLimit, simultaneously and communicate faults if either control parameter is exceeded. As long as the granularity of the computation is large enough compared to the communication costs involved, one can expect a good performance with an asynchronous approach. If we assume that communication costs are zero, then one would ideally communicate faults as soon as they are detected to other processors. If the frequency of communication is reduced, then one may have to perform more redundant computation. Another possible strategy for the choice of MinFaultLimit would be to use an monotonically decreasing schedule for the value of this threshold, that would directly correspond to the number of faults being detected. Typically, fault detection follows an exponentially decaying profile where fewer additional faults are detected in each step of fault simulation. Since many faults are detected early, if one keeps the value of MinFaultLimit very low (for example 1), this could result in a high communication overhead between processors. Hence, it would be advisable to keep the value of MinFaultLimit high initially. As fault simulation progresses, fewer faults are detected, and these can be communicated immediately without much communication overhead. At this stage, one can afford to have a lower value of MinFaultLimit. Thus, the ideal strategy to employ would be to start with a large value of MinFaultLimit and to reduce it slowly as more faults get detected. It should be noted that if detected faults are not communicated, there is redundant work being done on processors which may not have detected the fault yet. Hence, one should not choose a very high value of MinFaultLimit to start with, since this would defeat the very purpose of the asynchronous communication strategy being used. In our approach, we start with an initial value of MinFaultLimit = StartLimit. As faults are dropped, the ratio of undetected faults (U) to the total number of faults (F) is

computed. A new value of MinFaultLimit is obtained by multiplying StartLimit with the fraction U F . Hence, the value of MinFaultLimit is computed using

MinFaultLimit = d(StartLimit UF )e

21 where dxe denotes the smallest integer greater than or equal to x. Thus, we have a monotonically decreasing schedule which is expected to be exponentially decaying. An experiment was performed on 16 processors of the IBM SP2 to test the different strategies for the choice of the MinFaultLimit threshold with both random and ATPG test sets. Table 8 shows the results obtained with the SPITFIRE5 algorithm for random test sets. Constant values of 10, 5 and 2 were tried. A monotonically decreasing schedule for the threshold was tried as described above with starting values of 15, 10, 5 and 3. Among the constant values, a value of 5 seemed to work the best. The lowest execution time for each circuit is shown in bold. With a monotonically decreasing schedule, a starting value of 5 seemed to work the best. Between the variations of the constant schedule and the monotonically decreasing schedule that were experimented with, a monotonically decreasing schedule with a value of 5 seemed to outperform all other schedules for random test sets. Table 9 shows the results obtained with the SPITFIRE5 algorithm for ATPG test sets. Constant values of 10, 5 and 2 were tried. A monotonically decreasing schedule for the threshold was tried as described above with starting values of 15, 10, and 5. Among the constant values, a value of 5 seemed to work the best. With a monotonically decreasing schedule, a starting value of 10 seemed to work the best. Between the variations of the constant schedule and the monotonically decreasing schedule that were experimented with, a monotonically decreasing schedule with a value of 10 seemed to outperform all other schedules for ATPG test sets.

Table 8: Execution Times (secs) with SPITFIRE5 on 16 processors of the IBM SP2 for Varying MinFaultLimit and Random Test Sets MinFaultLimit Constant Monotonically Decreasing Constant Value Starting Value Circuit 10 5 2 15 10 5 3 am2910 8.4 8.2 8.3 8.8 8.3 8.1 8.3 div16 6.7 6.5 6.5 6.6 6.2 6.3 6.4 mult16 8.0 7.4 7.5 8.1 7.6 7.4 7.4 pcont2 44.8 43.9 44.2 45.2 44.5 43.6 44.1 piir8o 69.3 69.5 70.1 71.4 69.0 68.3 69.4 s1423 10.1 9.9 10.4 10.7 10.2 10.1 10.5 s3271 6.1 5.6 5.8 6.4 5.6 5.7 5.9 s3330 13.1 13.0 13.4 14.1 13.0 12.7 13.1 s3384 13.1 12.8 13.2 13.8 13.0 12.9 13.4 s35932 654 644 649 661 642 637 648 s382 3.2 3.0 2.9 3.5 3.2 3.0 2.9 s400 3.4 3.3 3.3 3.8 3.3 3.1 3.2 s444 3.7 3.4 3.3 3.7 3.4 3.3 3.3 s4863 7.0 7.2 7.4 7.1 7.1 7.0 7.2 s526 4.4 4.0 3.7 4.3 4.0 3.8 3.7 s5378 32.6 30.9 31.2 32.2 31.8 31.1 31.3 s6669 10.5 10.4 11.1 10.8 10.5 10.3 10.7

It turns out that one requires a higher value of the StartLimit with ATPG test sets than with random test sets. There can be a couple of reasons for this behavior. First, with ATPG test sets, faults tend to be detected faster initially than with random test sets. Hence, it makes sense to communicate less frequently in the beginning. Secondly, towards the end of fault simulation, there are very few faults left with ATPG test sets, but with random test sets, many faults remain undetected. Hence, if a higher

22

Table 9: Execution Times (secs) with SPITFIRE5 on 16 processors of the IBM SP2 for Varying MinFaultLimit and ATPG Test Sets MinFaultLimit Constant Monotonically Decreasing Constant Value Starting Value Circuit 10 5 2 15 10 5 am2910 3.4 3.6 3.4 3.5 3.2 3.4 div16 3.6 3.4 4.2 3.9 3.1 3.5 mult16 2.2 2.4 2.8 2.9 2.3 2.8 pcont2 13.9 13.4 13.2 13.8 12.5 12.9 piir8o 18.7 17.3 17.9 19.5 17.4 17.6 s1423 7.7 7.3 7.6 8.2 7.1 7.4 s35932 205 211 213 214 201 209 s382 0.53 0.52 0.53 0.56 0.49 0.52 s400 0.69 0.68 0.71 0.78 0.65 0.69 s444 0.63 0.62 0.65 0.71 0.62 0.65 s526 0.84 0.81 0.85 0.98 0.83 0.84 s5378 33.1 32.9 32.6 35.3 33.0 34.2

value of StartLimit is used, then a higher value of MinFaultLimit is used in the course of the execution of the algorithm, even towards the end of fault simulation. This results in more redundant work and hence higher execution times. Hence, with random test sets, it would be best to use a reasonably lower value of StartLimit as compared to one that may be used with an ATPG test set. In general, the results show that the use of a monotonically decreasing schedule for the MinFaultLimit threshold results in lower execution times when compared with using a fixed value for the threshold.

4.7 SPITFIRE6: A hybrid asynchronous multistage pipelined algorithm From all the above discussion, it naturally follows that the best overall strategy to perform parallel fault simulation would be to combine a single-stage asynchronous approach with a multistage pipelined approach on any parallel platform. This approach is illustrated in Figure 9. It can be seen from the figure that we require only one stage of good circuit simulation in the first stage since we can take advantage of asynchronous communication to eliminate redundant work. Thus, we have eliminated the need for two stages of good circuit simulation as required in the SPITFIRE1 algorithm. However, since there may exist some pessimism with respect to the fault coverage, we need to have a corrective stage which uses a pipelined approach. The pipelined approach is similar to the one used in the SPITFIRE3 algorithm. Let us assume that we have k

N vectors being simulated between any synchronization points in each stage of fault simulation. This means that we have kP two synchronization points. To eliminate all pessimism, one needs to collect the states of the good circuit and all the faulty

N apart. These checkpoints would have to be taken during circuits corresponding to undetected faults at checkpoints spaced kP the asynchronous phase and during the synchronous pipelined phase. In the asynchronous phase, each processor stores the required states without performing any communication. In the pipelined phase, the checkpoints are identical to the synchronization points. Consider two processors Pi and Pi+1 and two checkpoints m and n, where m occurs before n. The two

checkpoints m and n should be spaced N P test vectors apart. Let Sn;i be the set consisting of states of the good circuit and

the faulty circuits corresponding to the undetected faulty circuits at checkpoint n in processor Pi . Let Sm;i+1 be the set cor-

23

Synchronization Points indicates vector to be simulated after synchronization

Number above

Denotes Busy Processor in Asynchronous Communication Phase Denotes Busy Processor in Synchronous Pipelined Phase Denotes Idle Processor Path with arrow denotes possible execution trace that ensures correct simulation (assuming circuit states match across synchronizing boundaries)

P1 P2 P3 P4

1

N/p + 1

2N/p + 1

3N/p + 1

N/p - q

2N/p + 1

3N/p + 1

4N/p = N

2N/p - q

3N/p + 1

4N/p = N

3N/p - q

4N/p = N m

4N/p = N

n

Program terminated here q = vector overlap for circuits considered N = number of test vectors in test set p = number of processors (p = 4 in the figure) Figure 9: Algorithm SPITFIRE6 on 4 Processors responding to the states of the above circuits at checkpoint m in processor Pi+1 . Let un;i be the last vector that Pi simulated

before checkpoint n. Also, let vm;i+1 be the first vector that Pi+1 simulated after checkpoint m. Now if un;i and vm;i+1 are successive vectors in the test set T and if Sn;i

=

Sm;i+1 8i, then we can safely stop execution at synchronization point n.

Hence, if the states of the good and undetected faulty circuits match, we can conclude that there is no pessimism in the final result. It may be easier to use a heuristic than to use the rigorous approach described above in a practical implementation. One possible heuristic would be to match signatures corresponding to the circuit states across synchronization points. Another heuristic, which was used in the SPITFIRE3 algorithm, would be to ensure that no new faults are detected between two successive synchronization points. For purposes of implementation, we have used the same heuristic as in the SPITFIRE3 algorithm to terminate execution. A monotonically decreasing schedule as suggested in the SPITFIRE5 algorithm was also used in the asynchronous communication phase.

5 Analysis of Algorithms A theoretical analysis of the various algorithms is now presented. We first provide an analysis of serial fault simulation and then extend the analysis for various test set partitioning approaches and for a fault partitioning approach.

5.1 Analysis of sequential fault simulation We first provide an analysis for a uniprocessor and then proceed to an analysis for a multiprocessor implementation.

24 Let us assume that there are N test vectors in the test set f T1 ; T2 ; TN g. Usually in fault simulation, many faults are detected early, and then the remaining faults are detected more slowly. Let us assume that the fraction of faults detected by

vector k in the test set is given by e?(k?1) , i.e., the fraction of faults detected at each step falls exponentially. (Traditionally

one assumes that the fraction of faults left undetected after vector k has been simulated is given by 1 e?k [21] [26]. Hence,

k is given by (1 e?(k?1) ? 1 e?k ) = 1 (1 ? e?1 )e?(k?1) , which is of the form e?(k?1) .) Then the fraction of faults detected at this stage after n ? 1 vectors have been simulated is given by Pkn=1?1(e?(k?1)) = ( 1?e??n? ) = r(1 ? e?(n?1)), where r = 1? . Hence, the number of undetected faults 1?e 1?e remaining, U (n) when vector n has to be simulated is given by U (n) = F (1 ? r(1 ? e?(n?1) )), where F is the total number of faults in the circuit. Let us assume that is the unit of cost for execution in seconds per gate evaluation. Assume that, for each vector, a fraction of the total number of gates G in the circuit is being simulated for each fault and that a fraction of the gates G are simulated during the good circuit logic simulation. (Usually