Finite-State Machine Embeddings for Non-Concurrent ... - CiteSeerX

Finite-State Machine Embeddings for Non-Concurrent Error Detection and Identi£cation Christoforos N. Hadjicostis* Dept. of Electrical and Computer Eng. and Coordinated Science Lab. University of Illinois Urbana, IL 61801-2307, USA [email protected]

Abstract— In digital sequential systems that operate over several time steps, a state-transition fault at any time step during the operation of the system corrupts its state in a way that can render its future functionality useless. In this paper, we develop a methodology for systematically constructing redundant £nitestate machines so that an external checker can capture transient state-transition faults via checks that are performed in a nonconcurrent manner (e.g., periodically). More speci£cally, the proposed approach allows the checker to detect and identify errors due to past state-transition faults based on an analysis of the current, possibly corrupted FSM state. As a result, the checker in such designs can operate at a slower speed than the rest of the system which relaxes the stringent requirements on its reliability.

complexity, higher clock speeds, lower power consumption requirements and smaller transistor sizes. In the schemes proposed in this paper the error-detection/identi£cation mechanism checks for faults periodically instead of every time step, each time detecting and identifying errors due to transient state-transition faults that may have occurred since the last check. As a result, the stringent reliability requirements on the error-detection/identi£cation mechanism are relaxed. The development in this paper is a nonlinear extension of the nonconcurrent error detection and identi£cation schemes that were presented in [5] for DT LTI dynamic systems and in [6] for LFSMs.

I. INTRODUCTION

II. BACKGROUND

In an effort to utilize redundancy in ef£cient ways, researchers have extensively used coding and embedding techniques to achieve fault tolerance. Apart from the numerous examples in communication and data storage applications [1], these approaches have also been successful in protecting computational systems against transient or permanent hardware faults. Examples of such approaches include arithmetic codes [2], algorithm-based fault tolerance schemes [3], and embeddings for protecting discrete-time (DT) dynamic systems, including linear time-invariant (LTI) dynamic systems, linear £nite-state machines (LFSMs), and £nite-state machines (FSMs). These embedding techniques are commonly based on concurrent checks (i.e., checks that are performed at the end of each time step), and detect and identify errors in the system by analyzing violations on enforced encodings. Reference [4] provides an overview of these techniques for both combinational and dynamic systems. In this paper, we extend these embedding techniques to a framework that allows non-concurrent error detection and identi£cation in arbitrary FSMs. Our interest in FSMs stems from our desire to protect complicated computations in modern digital systems whose vulnerability to so-called “soft” (transient) faults has become higher due to increased

In this section we describe the non-concurrent error detection and identi£cation schemes that were developed for LFSMs in [6]. An LFSM S with state evolution

This material is based upon work supported by the National Science Foundation under NSF Career Award 0092696 and NSF ITR Award 0218939. Any opinions, £ndings, and conclusions or recommendations expressed in this publication are those of the author and do not necessarily re¤ect the views of NSF. *Address for correspondence: Coordinated Science Lab. and Dept. of Electrical and Computer Eng., University of Illinois at Urbana-Champaign, 148 CSL, 1308 West Main Street, Urbana, IL 61801-2307, USA.

q[t + 1] = Aq[t] ⊕ Bx[t]

(1)

(where q[t] is the state vector, x[t] is the input vector and operations are performed in GF(q), the £nite (Galois) £eld of order q), is protected in [7], [6] by constructing a larger LFSM H with state evolution

ξ [t + 1] = A ξ [t] ⊕ Bx[t] .

(2)

The initial state ξ [0] and matrices A , B are chosen so that, under fault-free conditions, the state ξ [t] at time step t ≥ 0 provides complete information about q[t], the state of the original LFSM S , and vice-versa. More speci£cally, in [7], [6] these decoding and encoding constraints are restricted to be linear, i.e., it is assumed that there exist matrices L and G [with entries in GF(q)] such that, when ξ [0] = Gq[0] and under fault-free conditions, q[t] = Lξ [t] ,

ξ [t] = Gq[t] for all t ≥ 0. Non-concurrent error detection and identi£cation for LFSMs is analyzed in [6] where it is assumed that the LFSM begins operation at time step 0 and that the £rst nonconcurrent check is performed at the end of time step L − 1 (beginning of time step L). If a total of D transient faults occur during time steps L − k1 − 1, L − k2 − 1, ..., L − kD − 1,

originally corrupting state variables i1 , i2 , ..., iD by initial additive errors v1 , v2 , ..., vD respectively, then the syndrome1 at the end of time step L − 1 can be calculated based on the erroneous state of the redundant LFSM and provides enough information for the detection and identi£cation of all errors. Based on these observations, the work in [6] presented explicit constructions of redundant LFSMs that can be used to provide fault tolerance to the LFSM S in Eq. (1). III. NON-CONCURRENT ERROR DETECTION AND IDENTIFICATION IN FSMS A. Notation and Preliminaries Consider the FSM S with state set Q = {q1 , q2 , ..., qN } and input set X = {x1 , x2 , ..., xU }. Its state q[t] and input x[t] at time step t specify its next state q[t + 1] via the next-state function q[t + 1] = δ (q[t], x[t]) . (3) We assume that function δ is de£ned for all pairs in Q × X and we are interested in detecting, identifying and correcting state-transition faults, i.e., faults that result in the corruption of the state of FSM S at a particular time step. We focus on transient faults, i.e., faults that do not appear on a consistent basis but only manifest themselves with a certain probability (transient faults are usually caused by glitches due to noise, electromagnetic interference, or environmental factors [8]). Due to the memory of the system, a transient fault that corrupts the state of the FSM at a particular time step can in¤uence its future behavior, potentially rendering its functionality useless. Therefore, transient state-transition faults are actually harder to deal with than transient faults that corrupt the output of the FSM. In general, the implementation of the FSM S in Eq. (3) as a sequential system will rely on encoding each of the N states into a binary vector with b bits, where b is an integer satisfying 2b ≥ N (so that there are enough binary strings to represent all N states). In this paper we focus on the case where the state encodings are of minimal length (i.e., 2b−1 ≤ N ≤ 2b or b = ⌈log2 N⌉), but our discussion can be easily generalized to arbitrary b ≥ ⌈log2 N⌉. The states of the FSM S can be captured by a set of N b-dimensional binary vectors Qs = {qs(1) , qs(2) , ..., q(N) s } that correspond to states {q1 , q2 , ..., qN } respectively. These vectors can certainly be viewed as b-dimensional vectors with elements in GF(2) (i.e., with elements “0” or “1”) or as b-dimensional vectors with elements in some £nite £eld GF(q) with q > 2 [where, of course, elements “0” or “1” are mapped to a pair of elements in 1 The syndrome at the end of time step L − 1 is de£ned as s[L] ≡ Hξ [L], f where ξ f [L] is the possibly erroneous state of the redundant LFSM and H is an appropriate parity check matrix that has full row-rank and satis£es HG = 0.

GF(q)]. One can also view the vectors in Qs as lower dimensional vectors with elements in a higher dimensional £eld. For example, the 8-dimensional binary vector q (·) s = £ ¤T 1 0 1 1 0 0 0 1 , can be treated as the 4£ ¤T 10 11 00 01 = in GF(4), dimensional vector q(·) s £ ¤T (·) in or as the 2-dimensional vector qs = 1011 0001 GF(24 ). For the purposes of this paper, we will be content to think of the vectors in the set Qs as d-dimensional vectors with entries in GF(q) where d ≤ ⌈log2 N⌉ and q ≥ 2. The breakdown of the vector dimensions and the size of the £nite £eld is not particularly important for our analysis in this paper but is a factor in the complexity of the redundant implementation and the error-detection/identi£cation mechanism. For notational convenience, we de£ne the following state transition mappings: for each input xu ∈ XU , let δxu : Qs 7→ Qs be the mapping that de£nes the state transition functionality under input xu , i.e., δxu (q(s j) ) = q(l) s iff ql = δ (q j , xu ). With this notation q[t + 1] = δx[t] (q[t]) , (4) where x[t] ∈ XU is the input at time step t, and q[t], q[t + 1] ∈ Qs are the machine states at time steps t and t + 1. Given a sequence of inputs x[τ1 ], x[τ1 + 1], x[τ1 + 2], ..., x[τ2 ], we also de£ne

δ x[τ2 ] (q(s j) ) = δx[τ ] (δx[τ x[τ1 ]

2

2 −1]

...(δx[τ

1 +1]

(δx[τ ] (q(s j) )))...) (5) 1

(if τ2 < τ1 , then δ x[τ2 ] (·) is de£ned to be the identity mapping x[τ1 ] in Qs ). B. Embeddings of FSMs To protect against transient state-transition faults, we will construct a larger FSM H with an η -dimensional state vector ξ [t] with entries in GF(q), where η = d + m for some m > 0. Using the notation introduced in the previous section, the state evolution of H can be described by

ξ [t + 1] = ∆x[t] (ξ [t]) ,

(6)

where x[t] ∈ XU is the input at time step t, and ξ [t], ξ [t + 1] ∈ Qh are the machine states at time steps t and t + 1 [Qh is the η -dimensional vector space with entries in GF(q)]. The initial state ξ [0] and all transition mappings ∆xu (·), xu ∈ X, will be chosen so that under fault-free conditions the state ξ [t] of FSM H for t ≥ 0 provides complete information about q[t], the state of the original FSM S , and vice-versa. More speci£cally, we will restrict ourselves to decoding and encoding mappings that are linear in GF(q), i.e., we will require that there exist a d × η matrix L and an η × d matrix G [with entries in GF(q)] such that, when ξ [0] = Gq[0] and under fault-free conditions, q[t] = Lξ [t] , ξ [t] = Gq[t]

(7) (8)

begins operation with ξ [0] = Gq[0], and that the £rst nonconcurrent parity check is performed at the end of time step L − 1 (i.e., at the beginning of time step L). Assume for now that a single transient state-transition fault takes place in the interval [0, L − 1]. More speci£cally, the fault takes place during the execution of time step t = L − k − 1, 0 ≤ k ≤ L − 1, originally affecting the ith state variable by an initial (additive) error value v so that the erroneous state vector at time step t = L − k is given by

for all t ≥ 0. The second constraint implies that error detection can be performed concurrently by verifying whether the parity check s[t] ≡ Hξ [t] is nonzero (the m × η parity check matrix H has full row-rank and satis£es HG = 0). Concurrent error correction can be achieved by associating each valid state in FSM H (of the form Gq[·]) with a unique subset of invalid states that get corrected to that particular valid state. The FSM H de£ned in Eq. (6) that satis£es the constraints in Eqs. (7) and (8) if properly initialized will be called a redundant implementation for FSM S in Eq. (4). The class of redundant FSMs that will be used throughout the remainder of this paper to construct our non-concurrent error detection and identi£cation schemes is the one described in the theorem below whose proof is omitted due to space limitations. Theorem 3.1: Let the decoding, encoding and parity check matrices of a redundant implementation H for FSM S be given by ¸ · ¤ £ £ ¤ Id , H = −C Im . L = Id 0 , G = C

ξ f [L − k] = ξ [L − k] ⊕ vei , where ξ [L − k] is the state that FSM H would be in under fault-free conditions and ei is an η -dimensional column vector with a unique nonzero entry with value “1” at its ith position. Note that a (concurrent) parity check at the end of time step L − k − 1 would result in the syndrome s[L − k] ≡ Hξ f [L − k] = vH(:, i) , where H(:, i) denotes the ith column of matrix H (i.e., concurrent error-detecting/identifying capabilities depend solely on matrix H). Assuming no other fault occurs in the interval [L − k, L − 1], we can use Eq. (5) to £nd the erroneous state of FSM H at the end of time step L − 1 (beginning of time step L):

FSM H is a redundant implementation for FSM S [i.e., the state of FSM H satis£es the decoding and encoding constraints in Eqs. (7) and (8)] if it is initialized in state ξ [0] = Gq[0] and its next-state transition mapping under each input xu ∈ XU satis£es ∆xu (ξ [t]) =

"

δxu (q[t]) ⊖ A12x Cq[t] ⊕ A12x ξr [t] u u Cδxu (q[t]) ⊖ (CA12x C ⊕ A22x C)q[t] ⊕ (CA12x ⊕ A22x )ξr [t] u u u u (9)

·

¸ q[t] where ξ [t] ≡ = Gq[t] under fault-free conditions ξr [t] [here q[t] is a d-dimensional vector (in Qs ) and ξr [t] is an m-dimensional vector, both with entries in GF(q)]. Note that the coupling matrices A12x , xu ∈ XU , and the u redundant dynamics matrices A22x , xu ∈ XU , are free for us u to choose (and will be discussed later on). Also note that the in¤uence of the coupling and the redundant dynamics on the overall FSM state is restricted to be linear. C. Transient State-Transition Faults and Error Propagation We are interested in designing the redundant FSM H so that we can protect against transient state-transition faults. We assume that a transient fault during the calculation of the next state at time step t causes an erroneous value at exactly one of the state variables in the next state vector ξ [t + 1] (in VLSI implementations of digital sequential systems, this will be true if, for example, FSM H is implemented so that the next value of each state variable is calculated by separate hardware [4]); this assumption is not critical and can easily be relaxed. To analyze non-concurrent error detection and identi£cation, we assume (without loss of generality) that the FSM

#

ξ f [L] = ∆x[L−1] (ξ [L − k]) = ∆x[L−1] (ξ [L − k] ⊕ vei ) , (10) x[L−k] f x[L−k] where ξ [L − k] is the error-free state at time step L − k. For an arbitrary redundant implementation, the above expression cannot be further simpli£ed, making it hard to evaluate the non-concurrent syndrome s[L] ≡ Hξ f [L] at the end of time step L − 1. For the class of FSMs that are given in Theorem 3.1, however, things can be simpli£ed in the way indicated in the following theorem. Theorem 3.2: Let the redundant implementation H for FSM S be constructed as in Theorem 3.1. A single error due to a transient state-transition fault that occurred during the execution of time step L−k −1 and corrupted the ith state variable by an additive value v results in the non-concurrent syndrome Ã ! s[L] ≡ Hξ f [L] = v

L−1

∏

τ =L−k

A22

x[τ ]

Hei .

More generally, an error that occurred during the execution of time step L − k − 1 and corrupted the state of the system by an additive error vector e (so that ξ f [L − k] = ξ [L − k] ⊕ e) results in the non-concurrent syndrome ! Ã s[L] ≡ Hξ f [L] =

L−1

∏

τ =L−k

A22

x[τ ]

He .

Proof: Due to the fault during time step L − k − 1, the erroneous state at time step L − k is given by

ξ f [L − k] = ξ [L − k] ⊕ vei ,

·

¸ q[L − k] is the error-free state at time Cq[L − k] step L − k. For ease of notation, we will also write ξ f [L − k] as · ¸ ξ f s [L − k] , ξ f [L − k] = ξ f r [L − k] where ξ [L − k] =

where in the last step we used the fact that He = Hξ f [L − k + 1] = s[L − k + 1] = vA22 Hei . x[L−k]

Continuing in this way, we can prove by induction that the non-concurrent parity check at the end of time step L − 1 is given by ! Ã

where ξ f s [L − k] represents the top d elements in the erroneous state vector and ξ f r [L − k] denotes the bottom m elements in the state vector. At time step L − k + 1, the error propagates to the next-state ξ f [L − k + 1] according to Eq. (9) as ∆x[L−k] (ξ f [L − k]) = ∆x[L−k] 

µ·

δx[L−k] (ξ f s [L − k]) ⊖ A12

ξ f s [L − k] ξ f r [L − k]

x[L−k]

¸¶

=

Cξ f s [L − k] ⊕ A12

x[L−k]

ξ f r [L − k]

   =  Cδ (ξ [L − k]) ⊖ (CA12 C ⊕ A22 C)ξ f s [L − k]⊕  x[L−k] f s x[L−k] x[L−k] ⊕(CA12 ⊕ A22 )ξ f r [L − k] x[L−k]

x[L−k]

When we evaluate the parity check at the end of time step L − k we get s[L − k + 1] ≡ Hξ f [L − k + 1] £ ¤ −C Im ξ f [L − k + 1] =

ξ f r [L − k] Cξ f s [L − k] ⊕ A22 x[L−k] · ¸ £ ¤ ξ f s [L − k] −C Im = A22 ξ f r [L − k] x[L−k] £ ¤ −C Im (vei ) = A22 = −A22

x[L−k]

x[L−k]

= vA22

x[L−k]

Hei ,

where in the next to last step we used the fact that ξ f [L−k] = ξ [L − k] ⊕ vei and Hξ [·] = 0. Clearly, if the erroneous state at time step L − k − 1 was given by ξ f [L − k] = ξ [L − k] ⊕ e , where ξ [L − k] is the error-free state at time step L − k, the parity check at the end of time step L − k would be given by s[L − k + 1] = A22 He. x[L−k]

The above calculations can be repeated to £nd s[L − k + 2]. We can write

ξ f [L − k + 1] = ξ [L − k + 1] ⊕ e (we can always £nd an appropriate e) and repeat the calculations above to obtain s[L − k + 2] = Hξ f [L − k + 2] = A22 He x[L−k+1]

= vA22

x[L−k+1]

A22

s[L] ≡ Hξ f [L] = v

x[L−k]

Hei ,

      

L−1

∏

τ =L−k

A22

Hei .

x[τ ]

At this point the proof of the theorem is complete. ¤ We now generalize the above results to the case of multiple faults within the interval [0, L − 1]. We assume that a total of D transient state-transition faults occur during the executions of time steps L − k1 − 1, L − k2 − 1, ..., L − kD − 1, originally corrupting state variables i1 , i2 , ..., iD by initial additive errors v1 , v2 , ..., vD , respectively. Without loss of generality, we assume that 0 ≤ kD ≤ kD−1 ≤ ... ≤ k2 ≤ k1 ≤ L − 1. The following theorem and corollary is a generalization of Theorem 3.2. Theorem 3.3: Let the redundant implementation H for FSM S be constructed as in Theorem 3.1. Assume that a total of D transient faults take place in the interval [0, L − 1]. More speci£cally, transient faults occur during the executions of time steps 0 ≤ L − k1 − 1 ≤ L − k2 − 1 ≤ ... ≤ L − kD − 1 ≤ L − 1, originally corrupting state variables i1 , i2 , ..., iD by initial additive errors v1 , v2 , ..., vD . Then, the non-concurrent syndrome s[L] is given by      D  L−1 . s[L] ≡ Hξ f [L] = ∑ v j  ∏ A22  Hei j x[τ ]  j=1 τ =L−k j

Proof: Due to the fault during time step L − k1 − 1, the erroneous state at time step L − k1 is given by

ξ f [L − k1 ] = ξ [L − k1 ] ⊕ v1 ei , 1

·

¸

q[L − k1 ] is the error-free state at Cq[L − k1 ] time step L − k1 . Again, for ease of notation, we will also write ξ f [L − k1 ] as

where ξ [L − k1 ] =

ξ f [L − k1 ] =

·

ξ f s [L − k1 ] ξ f r [L − k1 ]

¸

,

where ξ f s [L − k1 ] represents the top d elements in the erroneous state vector and ξ f r [L − k1 ] denotes the bottom m elements in the state vector. The state at time step L − k2 will be

ξ f [L − k2 ] = ξ [L − k2 ] ⊕ e ⊕ v2 ei , 2

where e is chosen so that ξ [L − k2 ] ⊕ e captures the erroneous state in which the redundant implementation H would be in at time step L − k2 due to the presence of the state-transition fault during the execution of time step L − k1 − 1 (but in

the absence of the fault during the execution of time step L − k2 − 1). From Theorem 3.2, e satis£es ÃL−k −1 ! 2

∏

He = v1

A22

Hei ;

x[τ ]

τ =L−k1

1

therefore, the parity check at time step L − k2 satis£es Hξ f [L − k2 ] = H(ξ [t] ⊕ e ⊕ v2 ei ) 2 ! ÃL−k −1 2

∏

= v1

τ =L−k1

A22

x[τ ]

Hei ⊕ v2 Hei . 1

2

The same approach can be used to evaluate the parity check at the time step immediately following the third statetransition fault. Speci£cally, the state at time step L − k 3 can be written as

ξ f [L − k3 ] = ξ [L − k3 ] ⊕ e′ ⊕ v3 ei , 3

where is chosen so that ξ [L − k3 captures the erroneous state in which the redundant implementation H would be in at time step L−k3 due to the presence of state-transition faults during the executions of time steps L − k1 − 1 and L − k2 − 1 (but in the absence of the state-transition fault during the execution of time step L − k3 − 1). Since e′ is caused by an additive error of e ⊕ v2 ei (at time step L − k2 ), 2 we can invoke Theorem 3.2 to show that e′ satis£es e′

′

He

=

] ⊕ e′

Ã

L−k3 −1

Ã

L−k3 −1

∏

τ =L−k2

=

∏

τ =L−k2

=

v1

Ã

A22 A22

x[τ ]

x[τ ]

!

τ =L−k1

2

!Ã Ã v1

L−k3 −1

∏

H(e ⊕ v2 ei )

A22

x[τ ]

!

L−k2 −1

∏

τ =L−k1

Hei ⊕ v2 1

A22

Ã

x[τ ]

!

Hei ⊕ v2 Hei

L−k3 −1

∏

τ =L−k2

A22

2

1

x[τ ]

!

!

Hei . 2

Thus, the parity check at the end of time step L − k3 − 1 will be s[L − k3 ]

=

Hξ f [L − k3 ]

=

H(ξ [L − k3 ] ⊕ e′ ⊕ v3 ei )

=

He′ ⊕ v3 Hei 3 Ã

3

L−k3 −1

=

v1

∏

τ =L−k1

A22

x[τ ]

!

Hei ⊕ v2 1

Ã

L−k3 −1

∏

τ =L−k2

A22

x[τ ]

!

D. Multiple Error Detection and Identi£cation In this section we focus on redundant FSMs that are constructed so that each FSM input is associated with the same redundant dynamics as given by a matrix A22 (to be appropriately chosen in the sequel). Clearly, if D transient state-transition faults occur during the execution of time steps 0 ≤ L − k1 − 1 ≤ L − k2 − 1 ≤ ... ≤ L − kD − 1 ≤ L − 1, originally corrupting state variables i1 , i2 , ..., iD by initial additive errors v1 , v2 , ..., vD , then the syndrome s[L] ≡ Hξ f [L] at the end of time step L − 1 (beginning of time step L) will be s[L] ≡ Hξ f [L] =

D

∑ v j A22j Hei j , k

j=1

i.e., the syndrome will be a linear combination of D columns of the m × Lη syndrome matrix £ ¤ L−1 H H A22 H A222 H · · · A22 S = . (11)

In order to be able to detect D or less errors in the interval [0, L − 1] based on the syndrome s[L], we need all linear combinations of any subset of D columns of S to be nonzero; to be able to uniquely identify the originally affected variables (i1 , i2 , ..., iD ), the values by which they were initially corrupted (v1 , v2 , ..., vD ) and the time steps during which the errors took place (L − k1 − 1, L − k2 − 1, ..., L − kD − 1), we need all linear combinations of any subset of D columns of S to be different from a linear combination of any other subset of D columns of S. Notice that for the above class of redundant implementations (i.e., the class in which each FSM input is associated with the same redundant dynamics matrix A22 ), the m × Lη syndrome matrix S in Eq. (11) is the same as the one we had in [6] for non-concurrent error detection and identi£cation for LFSMs. Therefore, we can obtain redundant implementations with the desired properties by appropriately adopting the designs there. The following theorem summarizes this observation. The proof is omitted. Theorem 3.4: Let the decoding, encoding and parity check matrices of a redundant implementation H for FSM S be given by ¸ · £ ¤ £ ¤ Id , H = −C Im . L = Id 0 , G = C

Hei ⊕ Assume that the next-state transition mapping under each 2

⊕v3 Hei .

input xu ∈ XU satis£es

3

∆xu (ξ [t]) =

Continuing in this fashion, we can easily prove the theorem by induction. Note that when multiple transient faults occur at the same time step (i.e., when ki = ki ) the corresponding j j+1 additive error at that time step is ei ⊕ ei . After making j j+1 this observation, we can use Theorem 3.2 to calculate the propagation of errors in exactly the same way as above. ¤

"

δxu (q[t]) ⊖ A12x Cq[t] ⊕ A12x ξr [t] u u Cδxu (q[t]) ⊖ (CA12x C ⊕ A22 C)q[t] ⊕ (CA12x ⊕ A22 )ξr [t] u

·

¸

u

#

q[t] [here q[t] is a d-dimensional vector (in ξr [t] Qs ) and ξr [t] is an m-dimensional vector, both with entries in GF(q)]. Any D or less errors due to transient state-transition faults in the interval [0, L − 1] can be detected and identi£ed for all ξ [t] =

by a parity check at the end of time step L − 1 (beginning of time step L) if the following conditions are satis£ed: • The constants x, x1 , x2 , ..., xη in the speci£ed £nite £eld GF(q) are chosen so that 1) xi 6= x j for 1 ≤ i < j ≤ η ; ′ 2) xk xi 6= xk x j for integers k, k′ such that 0 ≤ k < k′ ≤ L − 1. • The number of additional state variables is m = 2D. • The 2D × 2D matrix A22 is of the form −1

A22 = M DM ,

•

where D = diag(1, x, x2 , x3 , ..., x2D−1 ) M = V(xd+1 , xd+2 , ..., xη ). The 2D × d matrix C is chosen so that

and

C = −M−1 V(x1 , x2 , ..., xd ) . As in the construction in [6], the order q of the £nite £eld needs to be chosen to be large enough so that the £rst requirement can be satis£ed. When the order of GF(q) is small, one can still construct redundant FSMs that are amenable to non-concurrent error detection and identi£cation by using a greater number of additional state variables (this is shown in [9] in the context of BCH convolutional codes). If we modify the choice of matrices in Theorem 3.4, we can obtain redundant implementations that allow the use of the Peterson-Gorenstein-Ziegler (PGZ) algorithm to ef£ciently determine the D′ errors based on the syndrome s[L] without resorting to an exhaustive search. IV. CONCLUSIONS The real challenge in extending non-concurrent error detection and identi£cation techniques to FSMs lies in our ability to add redundancy and to effectively model transient state-transition faults in a way that allows us to capture the propagation of errors in the system. The schemes proposed in this paper are successful in systematically tracking errors, including errors in the added (redundant) variables. In particular, the BCH-based encoded FSMs enable the use of well-known error detection and identi£cation procedures. There are several parameters that need to be appropriately chosen before these techniques can be applied towards the construction of reliable sequential systems (i.e., window length L, additional state variables m, £nite £eld size q, number of errors D that can be detected/identi£ed). There are also a number of interesting theoretical questions: (i) How can the ¤exibility in the coupling between the redundant state variables and the original state variables [given by the matrices A12x in Eq. (9)] be exploited to our advantage? u (ii) How can we design systems that are optimal in terms of minimizing the redundant amount of hardware instead of minimizing the number of additional state variables? (iii) What other codes can be adopted to this setting and what is the corresponding choice of redundant dynamics? (iv) If one is only interested in k1 (the earliest time step at which

a fault took place, as would be the case in rollback-based recovery), what type of codes and redundant implementations are appropriate? (v) What other connections can be made between algebraic systems theory and coding theory [10], [11] in terms of designing fault-tolerant FSMs? It would also be interesting to investigate these techniques in conjunction with methodologies for mapping algebraic equations to hardware. V. REFERENCES [1] R. E. Blahut, Algebraic Codes for Data Transmission. Cambridge, UK: Cambridge University Press, 2002. [2] T. R. N. Rao and E. Fujiwara, Error-Control Coding for Computer Systems. Englewood Cliffs, New Jersey: Prentice-Hall, 1989. [3] K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for matrix operations,” IEEE Transactions on Computers, vol. 33, pp. 518–528, June 1984. [4] C. N. Hadjicostis, Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems. Boston, Massachusetts: Kluwer Academic Publishers, 2002. [5] C. N. Hadjicostis, “Non-concurrent error detection and correction in discrete-time LTI dynamic systems,” IEEE Transactions on Circuits and Systems, vol. 50, pp. 45– 55, January 2003. [6] C. N. Hadjicostis, “Non-concurrent error detection and correction in fault-tolerant linear £nite state machines,” IEEE Transactions on Automatic Control, November 2003. To appear. [7] C. N. Hadjicostis and G. C. Verghese, “Encoded dynamics for fault tolerance in linear £nite-state machines,” IEEE Transactions on Automatic Control, vol. 47, pp. 189–192, January 2002. [8] D. Siewiorek and R. Swarz, Reliable Computer Systems: Design and Evaluation. Natick, Massachusetts: A.K. Peters, 1998. [9] J. Rosenthal and F. V. York, “BCH convolutional codes,” IEEE Transactions on Information Theory, vol. 45, pp. 1833–1844, September 1999. [10] B. Marcus and J. Rosenthal, eds., Codes, Systems, and Graphical Models, vol. 123 of IMA Volumes in Mathematics and its Applications. New York: Springer, 2001. [11] E. V. York, J. Rosenthal, and J. M. Schumacher, “On the relationship between algebraic systems theory and coding theory: representations of codes,” in Proceedings of CDC 1995, the 34th Conference on Decision and Control, vol. 3, (New Orleans, Louisiana), pp. 3271– 3276, 1995.