Linear Encoding Scheme for Weighted Finite Automata Mathieu Giraud and Dominique Lavenier IRISA / CNRS / Universit´e de Rennes 1 35042 Rennes Cedex, France {mgiraud, lavenier}@irisa.fr

Abstract. In this paper, we show that the linear encoding scheme eﬃciently implements weighted ﬁnite automata (WFA). WFA with t transitions can be hardwired with O(t) cells. They solve pattern matching problems in a pipelined way, parsing one character every clock cycle. With the massive parallelism of reconﬁgurable processors like FPGAs, a signiﬁcant speed-up is obtained against software solutions.

1

Introduction

Weighted ﬁnite automata (WFA) are ﬁnite-state machines with weights on transitions. They have been widely used in image compression [1] or in speech recognition [2]. In Biology, searching genomic banks for patterns with error counting, or with arbitrary matrices of substitution scores, can be made using WFA. These applications involve sequential scans of large databases (today tens of gigabytes of data) whose size is increasing faster than CPU power. Whereas eﬃcient simulation of a non-deterministic ﬁnite automaton (NFA) can be achieved by ﬁrst determinizing it (although leading to a potential exponential number of states), direct simulation of WFA is needed as they are not all determinizable [3]. Mark G. Eramian proposed in 2002 an algorithm in O(nt) time, where t is the number of transitions and n the length of the parsed sequence [4]. One can use dedicated hardware to accelerate parsing. Reetinder Sidhu and Viktor K. Prasanna proposed in 2001 an FPGA architecture to implement NFA [5]. This paper aims to extend their idea to WFA: we prove that WFA can be hardwired using a linear encoding scheme, providing a signiﬁcant acceleration over software methods. In such a material implementation, space concerns become prominent and we need to ensure the WFA ﬁts into FPGA devices. Thus, an estimation of the surface area will be conducted. The rest of the paper is organized as follows. Section 2 provides background deﬁnitions about pattern matching and WFA. In Section 3, we show how to generalize the one-hot encoding scheme for NFA to the linear encoding scheme for WFA. Section 4 presents some experimental results comparing our method against software techniques. M. Domaratzki et al. (Eds.): CIAA 2004, LNCS 3317, pp. 146–155, 2005. c Springer-Verlag Berlin Heidelberg 2005

Linear Encoding Scheme for Weighted Finite Automata

2 2.1

147

Preliminaries Continuous Pattern Matching

Let Σ be a ﬁnite alphabet. Elements of Σ are called characters. A word is a ﬁnite sequence of characters w = w1 w2 . . . wn ∈ Σ ∗ . A language L is a subset of Σ ∗ . Given a word w and a language L, the problem of continuous pattern matching is to ﬁnd all subwords v of w such that v ∈ L. Because this problem can have O n2 solutions (like a∗ in the word an ), we restrict it to ﬁnd only all positions in the initial word which are terminating matching subwords, that is determining the set Pos(L, w) = { j ∈ [1; n] | ∃ i ∈ [1; j], wi wi+1 . . . wj ∈ L }. When L is a singleton or a ﬁnite dictionary, some indexing techniques can handle the continuous pattern matching. Here those techniques do not apply since L will be deﬁned by a weighted ﬁnite automata. 2.2

Weighted Finite Automaton

Weighted ﬁnite automata (WFA) are ﬁnite-state machines describing languages of higher complexity than NFA [2, 4]. Let (K, ⊕, ⊗) be a semiring, where ¯0 and ¯1 are the identity elements for ⊕ and ⊗. A weighted ﬁnite automaton (WFA) is a 5-uple A = (Q, Σ, δ, I, F ), where Q is a ﬁnite set of states, Σ a ﬁnite alphabet, δ : Q × Σ × Q → K the transition table, I ⊂ Q and F ⊂ Q the initial and ﬁnal states set. The WFA A gives to every word w = w1 w2 . . . wn a weight W (w) deﬁned by q ,...,q

∈Q

W (w) = ⊕q10 ∈I, qn−1 n ∈F

δ(q0 , w1 , q1 ) ⊗ δ(q1 , w2 , q2 ) ⊗ . . . ⊗ δ(qn−1 , wn , qn ).

This weight is the ⊕-sum (i.e. the sum according to ⊕) of all the weights on paths from an initial state to a ﬁnal state labeled by w. Let us now deﬁne a recognizing set J ⊂ K. We say that the word w is recognized by A when W (w) ∈ J. If for every state q1 and for every character α, there exists at most one state 0, the WFA is said to be deterministic. q2 with δ(q1 , α, q2 ) = ¯ The nondeterministic ﬁnite automata (NFA) are only a particular case of WFA over the boolean semiring ({T, F }, ∨, ∧) with the recognizing set J = {T }. In this case, a word w is recognized by A when there exists a path from a initial state to a ﬁnal state labeled by w. Other semirings are used like (R+ , +, ×) (probabilistic), (R∪{−∞}, ⊕log , +) (logarithm) and (R ∪ {−∞}, max, +) (Viterbi’s approximation). For practical use, we consider only a ﬁnite subset of the semiring (Z ∪ {−∞}, max, +). In that case and with the recognizing set J = {x ∈ Z | x ≥ 0}, the WFA A2 represented in Fig. 1 recognizes the subset of L1 containing strictly more occurrences of b than c. This language L2 is not regular. In the following, we want to solve on large databases the continuous pattern matching problem in which the language L is described by a WFA. The set Pos(L, w) will have the form { j ∈ [1; n] | ∃ i ∈ [1; j], W (wi wi+1 . . . wj ) ∈ J }. The next section presents an hardware representation of WFA.

148

M. Giraud and D. Lavenier 3 c

1

a

3 (c

a

2

5

1

(a 0)

;1)

(a 0)

2

5 (b 1)

b

4

c

b c

(b 1) (c

;1)

4

(c

;1)

> 0?

Fig. 1. On the left side, the NFA A1 recognizing the regular language L1 = a (b|c)∗ (ca | bc). On the right side, the WFA A2 over the semiring (Z ∪ {−∞}, max, +). It recognizes the non-regular language L2 = {w ∈ L1 | |w|b > |w|c }

3

Linear Encoding Scheme for WFA

This section gives an overview of encoding schemes for ﬁnite-state machines, then describes the linear encoding scheme for WFA and its properties. 3.1

Encoding Schemes for Finite-State Machines

There are two major schemes to encode a ﬁnite state machine with |Q| states in hardware, according to the representation of its states [6]: – the logarithmic scheme uses a bit vector of size log2 |Q| in binary encoding (natural, Gray, or any encoding tailored to a particular application). For |Q| = 5, one can have the values {000, 001, 011, 101, 111}; – the linear scheme (or one-hot scheme) uses a bit vector of size |Q| where only one bit is set to 1, like in the set {00001, 00010, 00100, 01000, 10000}. Those schemes lead to diﬀerent hardware implementations. The size of the logarithmic scheme merely depends on the logic part and can be reduced with a good numbering scheme. This approach is usual for a conventional serial machine, but is limited to deterministic automata. On the other hand, there can be several states active at the same time in the linear encoding scheme, implementing a NFA in a multi – hot fashion. Sidhu and Prasanna showed that this representation is very eﬀective to scan for a regular expression with an FPGA [5]. As the linear encoding scheme needs as many operators as the number of transitions, one could think that it is limited to implement automata with few transitions. However, it has be shown that, for common automata, the linear scheme is less power-consuming and even smaller than the logarithmic scheme [6]. 3.2

Linear Encoding Scheme for WFA

Sidhu and Prasanna build an NFA from a regular expression describing it [5]. We present here a linear encoding scheme for WFA by giving another point of view: we directly map a given WFA into hardware.

Linear Encoding Scheme for Weighted Finite Automata

149

Let A = (Q, Σ, δ, I, F ) a WFA over a semiring K, as deﬁned in section 2.2. We denote by k and p the number of bits needed to represent respectively the alphabet Σ and the weights in K. Typical values are k = 8 for an ASCII text or k = 5 for amino acid patterns, and bit widths for the weight ranging from p = 1 to p = 16. Principle. The hardware implementation can be viewed as a shift register, in which a weight with p bits is moving. For each state q, there will be a p-bit register. We call eqj its value at the clock cycle j. – Each transition set from a state q to a state q is materialized with an evaluator (left part of Fig. 2). It receives k bits (current character wj ) and generates the weight δ(q , wj , q). In the general case, this evaluator will be a k → p function (k binary inputs, p binary outputs). – This weight is aggregated with the weight at the previous state, giving the value sjq ,q = eqj−1 ⊗ δ(q , wj , q). – Each state is a register driven by the ⊕-sum of all the values at the outputs of its incoming transitions sjq ,q (right part of Fig. 2). At the following clock cycle, this ⊕-sum eqj will be given as input for other transitions. The initialization phase of the automaton, not showed here, consists in setting all states to ¯ 0 except the initial states which are set to ¯1. Those initial states ¯ always receive an additional incoming transition whose k weight is kept to 1. The surface area needed by the WFA is here O 2 pt , where t is the number of pairs (q , q) having a non-void transition δ(q , α, q) for some α. q

0

ej ;1

+

(q wj q ) 0

q

sj

0

q q

0

sj 2

q

0

q1 q

sj

0

q3 q

evaluator

sj wj

0

(if q

q

max

ej

q

ej ;1

2 I)

Fig. 2. Principle of linear encoding scheme for WFA over a ﬁnite subset of (Z ∪ {−∞}, max, +). Here the identity elements are ¯ 0 = −∞ and ¯ 1 = 0. The p bits representing the weight are a compound of p − 1 bits representing a two’s complement integer, and 1 bit representing −∞ (for the initialization, inexistent transitions, and overﬂows).As we consider only a ﬁnite subset of the semiring, one must ensure that the overﬂows are correctly handled. The overﬂow at −∞ can be neglected, as it represents a weight which is very unlikely to participate to a ﬁnal maximum. The overﬂow at +∞ is detected at the output and gives a hit in the recognition. If there are cycles in the automaton, a reset of the whole automaton must follow the overﬂow at +∞

150

M. Giraud and D. Lavenier

+

c 0

+

1

a

wj

a

wj

2

max

+

+

+

3

b

wj

max

5

E

j

>0?

+

4

c

b c

wj

wj

wj

Fig. 3. Linear encoding scheme for the WFA A2

Values of States. The previous descriptions can be summarized to: ⎧ ¯1 if q ∈ I, q ⎪ ⎪ e0 = ¯ ⎪ ⎪ 0 if q ∈ I, ⎪ ⎨ q,q sj =eqj−1 ⊗ δ(q , wj , q), ⎪ ⎪ ¯1 ⊕ (⊕q∈Q sjq ,q ) if q ∈ I, ⎪ q ⎪ ⎪ = e ⎩ j if q ∈ I. ⊕q∈Q sjq ,q With this equation set, the following lemma holds: Lemma. If q is a state and j an integer, one has q

,...,q

j−1 eqj = ⊕ji=0 ⊕qi+1 i ∈I,qj =q

∈Q

⊗j−1 t=i δ(qt , wt+1 , qt+1 ).

Corollary. The ⊕-sum of all the weights at the ﬁnal states is Ej = ⊕q∈F eqj = ⊕ji=0 W (wi wi+1 . . . wj ). The proof of the lemma, which relies on the right-distributivity of ⊗, is given in Appendix A. The corollary says that the ﬁnal value Ej shows the ⊕-sum of all the weights of the words wi . . . wj . Thus, if one could deduce from Ej if there is an i such that W (wi wi+1 . . . wj ) is in J, one would know if a word wi . . . wj has been recognized by checking if Ej is in J. For this, we say now that J is a good recognizing set if it has the two following properties: – ∀a ∈ J, ∀b ∈ K, a ⊕ b ∈ J, – ∀a ∈ K, ∀b ∈ K, a ⊕ b ∈ J =⇒ a ∈ J or b ∈ J. With this deﬁnition, a direct consequence of the above corollary is: Theorem. If J is a good recognizing set, then Ej ∈ J ⇐⇒ j ∈ Pos(L, w). Therefore, if the hypothesis of the theorem holds, the continuous pattern matching problem is resolved by parsing one character on every clock cycle and

Linear Encoding Scheme for Weighted Finite Automata

151

by observing the value at the ﬁnal states. In fact, the clock cycle time is in O(p log dmax ), where dmax is the maximum incoming degree of the states, but this is not a limitation for usual WFA. – In the case of the NFA (boolean semiring ({T, F }, ∨, ∧)), only one bit is needed to represent the weight: we fall back on the one-hot scheme. The subset J = {T } is a good recognizing set. Each evaluator k → 1 is reduced to a comparator (wi ∈ A) for some subset A ⊂ Σ, the ⊗ is an AND gate and the ⊕ an OR gate. – In the semiring (Z, max, +), the only good recognizing sets are those of the form J = {x ∈ Z, x ≥ x0 } for some x0 . Those sets ﬁt perfectly in the applications of WFA where the weight is a score compared to a threshold to know if a sequence was recognized.

4

Performance Evaluation

This section is about the performances of a real implementation of the linear encoding scheme described in section 3.2. Here the WFA are over a ﬁnite subset of (Z, max, +). We begin by describing the context of use. As we use a low-cost FPGA chip and as the main constraint is about size, we need to know precisely the surface area taken by the WFA; this is done in section 4.2. In section 4.3, we compare the speed achieved against software techniques. 4.1

Context of Use

FPGAs. Field Programmable Gate Arrays (FPGAs) are reconﬁgurable chips composed by a matrix of interconnected logic cells [7]. The logic inside each cell as the interconnections can be conﬁgurated in a few milliseconds, allowing to have a custom chip. The cost of such solutions is orders of magnitude below the cost of ASIC (Application Speciﬁc Integrated Circuits) full-custom chips. Prototype Board. Our prototype board, which is part of the R-disk system [8] is devoted to ﬁlter large genomic databases on-the-ﬂy. The board contains an hard disk and a low-cost FPGA which directly ﬁlters data from the disk. The total cost for the components is less than $200. The FPGA is the Spartan-II from Xilinx. It contains 1176 cell logic blocs (CLB), each one having 4 look-up tables (LUT) of 16 bits. The LUTs can realize any 4 → 1 boolean function. Almost two thirds the of FPGA is devoted to the ﬁlter; that is a little more than 3000 LUTs. It operates at a clock frequency of 40 MHz. 4.2

Implementing the Linear Encoding Scheme on FPGAs

FPGA devices are well suited for the linear encoding scheme because of the high number of available registers and the local propagation of data without global control. Furthermore, the computation of transition weights ﬁts perfectly into LUTs with 4 inputs.

152

M. Giraud and D. Lavenier Automaton Transitions Weight Total, by Maximum number type logic operators transition of transitions NFA 5 → 1 AND / OR (1 bit) 2 LUTs ≤ 1 LUT ≤ 3 LUTs ≥ 1000 WFA, Z 5 → p max / + (p bits) 2p LUTs ≤ 3p LUTs ≤ 5p LUTs ≥ 600/p

Fig. 4. Upper bound for the number of LUTs when k = 5. The last column shows the maximum number of transitions for a Spartan-II FPGA with 3000 LUTs

The regularity of the architecture allows a relative ease of programming. Our implementation, written in OCaml, translates WFA abstract descriptions into their representation in the hardware design language VHDL. One of the main issues with WFA is that their topology may change for each query. Design techniques with J-Bits [9] would allow a fast compilation of arbitrary WFA shapes, but they would need a custom place (& route) algorithm. The current slower solution is to perform a full compilation from VHDL for each query, the overhead due to compilation (4-5 minutes) being small compared to the performance gain when scanning large databases. For the scanning of protein databases (alphabet with 5 bits), an automaton with q states and t transitions with a weight of p bits takes a surface area of 3pt + 2p(t − q) LUTs before compiler optimizations. The total area taken is less than 5pt LUTs. Thus WFA with 75 transitions and an 8-bit weight can be encoded. To verify this bound, real FPGA experiments were done using the standard Xilinx framework. We run our method on two bench sets. The ﬁrst one is random WFA, and results show that the real limit is beyond the 75 transitions (left part of Fig. 5). The other bench set is the PROSITE protein pattern bank [10], which contains about 1300 patterns that we translate into WFA to allow substitution errors. More than 98% of the PROSITE bank can be translated in the FPGA. 3500

25

LUTs

2500 2000

Processing Bandwith (MB/s)

3000 LUTs 5pt 3pt 2pt

1500 1000 500 0

PC : agrep (4 errors) PC : WFA simulation One prototype board (R-disk)

20

15

10

5

0 20

40

60 80 Transitions

100

120

0

20

40

60 Transitions

80

100

120

Fig. 5. Experimental results for the linear encoding scheme. The left part shows the LUT count for diﬀerent WFA sizes. The right part compares the bandwidth processed by one prototype board with an FPGA against software solutions on a PC

Linear Encoding Scheme for Weighted Finite Automata

4.3

153

Performance Comparison

Sidhu and Prasanna [5] showed that their FPGA realization is more eﬀective than softwares like agrep if data is large enough. Their conclusions remain for WFA as they do even more operations (additions, maximums). We compared our approach with some software techniques using WFA. The low-cost Spartan-II is compared against a Pentium IV 2 GHz with 728 MB RAM. This comparison is fair since the Spartan II was released in 2000 and the Pentium IV 2 GHz in 2001. Results are shown in the right part of Fig. 5. The comparison with agrep [11] is for reference only, as this software only parse for regular expressions or for weighted expressions with a ﬁxed score (with at most 4 substitution errors). When patterns are small and with no errors, data can be parsed through agrep at the disk rate. But those ﬂows go down with errors and with larger patterns. More interesting is the comparison against a software simulation of WFA, as in the algorithm described by Eramian in [4] that parses data in O(nt) time. Data rates go from 10 MB/s for small WFA down to less than 1 MB/s for WFA with more than 30 states. On the contrary, our WFA implementation on the FPGA parses a constant bandwidth of data (which is now 15 MB/s), as far as the WFA ﬁts into the available surface area of the FPGA. This bandwidth implies parsing less than one amino acid (5 bits) at the 40 MHz clock cycle of the FPGA, allowing to parse a character on every clock cycle. Experiment were done on real data (80transition WFA, 34 GB canine DNA database). It takes more than 20 hours on a 2 GHz Pentium. On a single prototype R-disk board, it takes less than 45 minutes (5 minutes for compiling and 40 minutes for parsing).

5

Conclusion

Weighted ﬁnite automata can be eﬀectively hardwired on FPGAs with the linear encoding scheme. That encoding is perfectly suited for standard FPGA devices and provides a signiﬁcant speed-up over software implementations. To our knowledge, this is the ﬁrst hardware realization of WFA. The main current limitation with the linear encoding scheme is the size requirements of the targeted WFA. Currently, we can implement WFA with an 8-bit weight and more than 75 transitions. This limit is already pushed away by the next generation of FPGAs: in 2004, Xilinx sells the low-cost FPGAs Spartan3 with more than 18,000 CLB, that is 15 times larger than the chip we use in our prototype board. The transition limit raises accordingly. If an higher number of transitions is available, one could distribute them among several automata, especially when one need to parse nucleic banks for protein patterns through six reading frames. More generally, the speed-up obtained by such a spatial implementation [12] against software techniques will continue to increase, as it is easier to exploit more resources in a reconﬁgurable device than in a sequential CPU.

154

M. Giraud and D. Lavenier

References 1. Culik II, K., Kari, J.: Image Compression Using Weighted Finite Automata. In: Mathematical Foundations of Computer Science (MFCS 93). Volume 711 of Lecture Notes in Computer Science. (1993) 392–402 2. Mohri, M., Pereira, F., Riley, M.: Weighted Automata in Text and Speech Processing. In Kornai, A., ed.: Extended Finite State Models of Language (ECAI 96). (1996) 46–50 3. Buchsbaum, A.L., Raﬀaele, G., Westbrook, J.R.: On the Determinization of Weighted Finite Automata. SIAM Journal on Computing 30 (2001) 1502 – 1531 4. Eramian, M.G.: Eﬃcient Simulation of Nondeterministic Weighted Finite Automata. In: Fourth Workshop on Descriptional Complexity of Formal Systems (DCFS 02). (2002) 5. Sidhu, R., Prasanna, V.K.: Fast Regular Expression Matching using FPGAs. In: IEEE Symposium on Field Programmable Custom Computing Machines (FCCM 01). (2001) 6. Dunoyer, J., Ptrot, F., Jacomme, L.: Stratgies de codage des automates pour des applications basse consommation : exprimentation et interprtation. In: Journes d’tude Faible Tension et Faible Consommation (FTFC 97). (1997) 7. Sanchez, E.: Field Programmable Gate Array (FPGA) Circuits. Lecture Notes in Computer Science (1996) 1–18 8. Lavenier, D., Guyetant, S., Derrien, S., Rubini, S.: A reconﬁgurable parallel disk system for ﬁltering genomic banks. In: Proc. Int. Conf. ERSA’03. (2003) 9. Guccione, S., Levi, D., Sundararajan, P.: JBits: A Javabased Interface for Reconﬁgurable Computing. In: 2nd Annual Military and Aerospace Applications of Programmable Devices and Technologies Conference (MAPLD). (1999) 10. Bucher, P., Bairoch, A.: A Generalized Proﬁle Syntax for Biomolecular Sequences Motifs and its Function in Automatic Sequence Interpretation. In: Intelligent Systems for Molecular Biology (ISMB 94). (1994) 53–61 11. Wu, S., Manber, U.: Fast Text Searching Allowing Errors. Communications of the ACM 35 (1992) 83–91 12. DeHon, A.: Very Large Scale Spatial Computing. In: Third International Conference on Unconventional Models of Computation (UMC 02). (2002) 27–37

Appendix A Proof of Lemma. Here we prove by induction on j the following property: q

,...,q

j−1 eqj = ⊕ji=0 ⊕qi+1 i ∈I,qj =q

∈Q

⊗j−1 t=i δ(qt , wt+1 , qt+1 ).

=q ¯ At the cycle j = 0, the property is eq0 = ⊕qq00 ∈I 1, that is eq0 equals ¯1 if q ∈ I and ¯ 0 if q ∈ I: the property is true. Assume that the induction is true until the cycle j − 1, with j ≥ 1. Let q be a non-initial state. We compute the value eqj of the state q at the cycle j.

Linear Encoding Scheme for Weighted Finite Automata

155

eqj = ⊕q∈Q sjq ,q

= ⊕q∈Q eqj−1 ⊗ δ(q , wj , q)

qi+1 ,...,qj−2 ∈Q = ⊕q∈Q ⊕j−1 ⊗j−2 i=0 ⊕qi ∈I,qj−1 =q t=i δ(qt , wt+1 , qt+1 ) ⊗ δ(q , wj , q)

=

⊕q∈Q ⊕j−1 i=0

q ,...,qj−2 ∈Q ⊕qii+1 ∈I,qj−1 =q

=

⊕q∈Q ⊕j−1 i=0

q ,...,qj−1 ∈Q ⊕qii+1 ∈I,qj−1 =q ,qj =q q

,...,q

∈Q

j−1 = ⊕q∈Q ⊕ji=0 ⊕qi+1 i ∈I,qj−1 =q ,qj =q

⊗j−2 t=i

(hypothesis of induction)

δ(qt , wt+1 , qt+1 ) ⊗ δ(q , wj , q)

⊗j−1 t=i

⊗j−1 t=i

(right-distributivity of ⊗)

δ(qt , wt+1 , qt+1 )

δ(qt , wt+1 , qt+1 ) (because q is not initial)

=

⊕ji=0

q ,...,qj−1 ∈Q ⊕qi+1 i ∈I,qj =q

⊗j−1 t=i

δ(qt , wt+1 , qt+1 )

Thus the property is true at the cycle j. If q is initial, the same result is obtained by a similar computation by adding a ¯1 to each term. By induction, the property is true for every cycle j ≥ 0.

Abstract. In this paper, we show that the linear encoding scheme eﬃciently implements weighted ﬁnite automata (WFA). WFA with t transitions can be hardwired with O(t) cells. They solve pattern matching problems in a pipelined way, parsing one character every clock cycle. With the massive parallelism of reconﬁgurable processors like FPGAs, a signiﬁcant speed-up is obtained against software solutions.

1

Introduction

Weighted ﬁnite automata (WFA) are ﬁnite-state machines with weights on transitions. They have been widely used in image compression [1] or in speech recognition [2]. In Biology, searching genomic banks for patterns with error counting, or with arbitrary matrices of substitution scores, can be made using WFA. These applications involve sequential scans of large databases (today tens of gigabytes of data) whose size is increasing faster than CPU power. Whereas eﬃcient simulation of a non-deterministic ﬁnite automaton (NFA) can be achieved by ﬁrst determinizing it (although leading to a potential exponential number of states), direct simulation of WFA is needed as they are not all determinizable [3]. Mark G. Eramian proposed in 2002 an algorithm in O(nt) time, where t is the number of transitions and n the length of the parsed sequence [4]. One can use dedicated hardware to accelerate parsing. Reetinder Sidhu and Viktor K. Prasanna proposed in 2001 an FPGA architecture to implement NFA [5]. This paper aims to extend their idea to WFA: we prove that WFA can be hardwired using a linear encoding scheme, providing a signiﬁcant acceleration over software methods. In such a material implementation, space concerns become prominent and we need to ensure the WFA ﬁts into FPGA devices. Thus, an estimation of the surface area will be conducted. The rest of the paper is organized as follows. Section 2 provides background deﬁnitions about pattern matching and WFA. In Section 3, we show how to generalize the one-hot encoding scheme for NFA to the linear encoding scheme for WFA. Section 4 presents some experimental results comparing our method against software techniques. M. Domaratzki et al. (Eds.): CIAA 2004, LNCS 3317, pp. 146–155, 2005. c Springer-Verlag Berlin Heidelberg 2005

Linear Encoding Scheme for Weighted Finite Automata

2 2.1

147

Preliminaries Continuous Pattern Matching

Let Σ be a ﬁnite alphabet. Elements of Σ are called characters. A word is a ﬁnite sequence of characters w = w1 w2 . . . wn ∈ Σ ∗ . A language L is a subset of Σ ∗ . Given a word w and a language L, the problem of continuous pattern matching is to ﬁnd all subwords v of w such that v ∈ L. Because this problem can have O n2 solutions (like a∗ in the word an ), we restrict it to ﬁnd only all positions in the initial word which are terminating matching subwords, that is determining the set Pos(L, w) = { j ∈ [1; n] | ∃ i ∈ [1; j], wi wi+1 . . . wj ∈ L }. When L is a singleton or a ﬁnite dictionary, some indexing techniques can handle the continuous pattern matching. Here those techniques do not apply since L will be deﬁned by a weighted ﬁnite automata. 2.2

Weighted Finite Automaton

Weighted ﬁnite automata (WFA) are ﬁnite-state machines describing languages of higher complexity than NFA [2, 4]. Let (K, ⊕, ⊗) be a semiring, where ¯0 and ¯1 are the identity elements for ⊕ and ⊗. A weighted ﬁnite automaton (WFA) is a 5-uple A = (Q, Σ, δ, I, F ), where Q is a ﬁnite set of states, Σ a ﬁnite alphabet, δ : Q × Σ × Q → K the transition table, I ⊂ Q and F ⊂ Q the initial and ﬁnal states set. The WFA A gives to every word w = w1 w2 . . . wn a weight W (w) deﬁned by q ,...,q

∈Q

W (w) = ⊕q10 ∈I, qn−1 n ∈F

δ(q0 , w1 , q1 ) ⊗ δ(q1 , w2 , q2 ) ⊗ . . . ⊗ δ(qn−1 , wn , qn ).

This weight is the ⊕-sum (i.e. the sum according to ⊕) of all the weights on paths from an initial state to a ﬁnal state labeled by w. Let us now deﬁne a recognizing set J ⊂ K. We say that the word w is recognized by A when W (w) ∈ J. If for every state q1 and for every character α, there exists at most one state 0, the WFA is said to be deterministic. q2 with δ(q1 , α, q2 ) = ¯ The nondeterministic ﬁnite automata (NFA) are only a particular case of WFA over the boolean semiring ({T, F }, ∨, ∧) with the recognizing set J = {T }. In this case, a word w is recognized by A when there exists a path from a initial state to a ﬁnal state labeled by w. Other semirings are used like (R+ , +, ×) (probabilistic), (R∪{−∞}, ⊕log , +) (logarithm) and (R ∪ {−∞}, max, +) (Viterbi’s approximation). For practical use, we consider only a ﬁnite subset of the semiring (Z ∪ {−∞}, max, +). In that case and with the recognizing set J = {x ∈ Z | x ≥ 0}, the WFA A2 represented in Fig. 1 recognizes the subset of L1 containing strictly more occurrences of b than c. This language L2 is not regular. In the following, we want to solve on large databases the continuous pattern matching problem in which the language L is described by a WFA. The set Pos(L, w) will have the form { j ∈ [1; n] | ∃ i ∈ [1; j], W (wi wi+1 . . . wj ) ∈ J }. The next section presents an hardware representation of WFA.

148

M. Giraud and D. Lavenier 3 c

1

a

3 (c

a

2

5

1

(a 0)

;1)

(a 0)

2

5 (b 1)

b

4

c

b c

(b 1) (c

;1)

4

(c

;1)

> 0?

Fig. 1. On the left side, the NFA A1 recognizing the regular language L1 = a (b|c)∗ (ca | bc). On the right side, the WFA A2 over the semiring (Z ∪ {−∞}, max, +). It recognizes the non-regular language L2 = {w ∈ L1 | |w|b > |w|c }

3

Linear Encoding Scheme for WFA

This section gives an overview of encoding schemes for ﬁnite-state machines, then describes the linear encoding scheme for WFA and its properties. 3.1

Encoding Schemes for Finite-State Machines

There are two major schemes to encode a ﬁnite state machine with |Q| states in hardware, according to the representation of its states [6]: – the logarithmic scheme uses a bit vector of size log2 |Q| in binary encoding (natural, Gray, or any encoding tailored to a particular application). For |Q| = 5, one can have the values {000, 001, 011, 101, 111}; – the linear scheme (or one-hot scheme) uses a bit vector of size |Q| where only one bit is set to 1, like in the set {00001, 00010, 00100, 01000, 10000}. Those schemes lead to diﬀerent hardware implementations. The size of the logarithmic scheme merely depends on the logic part and can be reduced with a good numbering scheme. This approach is usual for a conventional serial machine, but is limited to deterministic automata. On the other hand, there can be several states active at the same time in the linear encoding scheme, implementing a NFA in a multi – hot fashion. Sidhu and Prasanna showed that this representation is very eﬀective to scan for a regular expression with an FPGA [5]. As the linear encoding scheme needs as many operators as the number of transitions, one could think that it is limited to implement automata with few transitions. However, it has be shown that, for common automata, the linear scheme is less power-consuming and even smaller than the logarithmic scheme [6]. 3.2

Linear Encoding Scheme for WFA

Sidhu and Prasanna build an NFA from a regular expression describing it [5]. We present here a linear encoding scheme for WFA by giving another point of view: we directly map a given WFA into hardware.

Linear Encoding Scheme for Weighted Finite Automata

149

Let A = (Q, Σ, δ, I, F ) a WFA over a semiring K, as deﬁned in section 2.2. We denote by k and p the number of bits needed to represent respectively the alphabet Σ and the weights in K. Typical values are k = 8 for an ASCII text or k = 5 for amino acid patterns, and bit widths for the weight ranging from p = 1 to p = 16. Principle. The hardware implementation can be viewed as a shift register, in which a weight with p bits is moving. For each state q, there will be a p-bit register. We call eqj its value at the clock cycle j. – Each transition set from a state q to a state q is materialized with an evaluator (left part of Fig. 2). It receives k bits (current character wj ) and generates the weight δ(q , wj , q). In the general case, this evaluator will be a k → p function (k binary inputs, p binary outputs). – This weight is aggregated with the weight at the previous state, giving the value sjq ,q = eqj−1 ⊗ δ(q , wj , q). – Each state is a register driven by the ⊕-sum of all the values at the outputs of its incoming transitions sjq ,q (right part of Fig. 2). At the following clock cycle, this ⊕-sum eqj will be given as input for other transitions. The initialization phase of the automaton, not showed here, consists in setting all states to ¯ 0 except the initial states which are set to ¯1. Those initial states ¯ always receive an additional incoming transition whose k weight is kept to 1. The surface area needed by the WFA is here O 2 pt , where t is the number of pairs (q , q) having a non-void transition δ(q , α, q) for some α. q

0

ej ;1

+

(q wj q ) 0

q

sj

0

q q

0

sj 2

q

0

q1 q

sj

0

q3 q

evaluator

sj wj

0

(if q

q

max

ej

q

ej ;1

2 I)

Fig. 2. Principle of linear encoding scheme for WFA over a ﬁnite subset of (Z ∪ {−∞}, max, +). Here the identity elements are ¯ 0 = −∞ and ¯ 1 = 0. The p bits representing the weight are a compound of p − 1 bits representing a two’s complement integer, and 1 bit representing −∞ (for the initialization, inexistent transitions, and overﬂows).As we consider only a ﬁnite subset of the semiring, one must ensure that the overﬂows are correctly handled. The overﬂow at −∞ can be neglected, as it represents a weight which is very unlikely to participate to a ﬁnal maximum. The overﬂow at +∞ is detected at the output and gives a hit in the recognition. If there are cycles in the automaton, a reset of the whole automaton must follow the overﬂow at +∞

150

M. Giraud and D. Lavenier

+

c 0

+

1

a

wj

a

wj

2

max

+

+

+

3

b

wj

max

5

E

j

>0?

+

4

c

b c

wj

wj

wj

Fig. 3. Linear encoding scheme for the WFA A2

Values of States. The previous descriptions can be summarized to: ⎧ ¯1 if q ∈ I, q ⎪ ⎪ e0 = ¯ ⎪ ⎪ 0 if q ∈ I, ⎪ ⎨ q,q sj =eqj−1 ⊗ δ(q , wj , q), ⎪ ⎪ ¯1 ⊕ (⊕q∈Q sjq ,q ) if q ∈ I, ⎪ q ⎪ ⎪ = e ⎩ j if q ∈ I. ⊕q∈Q sjq ,q With this equation set, the following lemma holds: Lemma. If q is a state and j an integer, one has q

,...,q

j−1 eqj = ⊕ji=0 ⊕qi+1 i ∈I,qj =q

∈Q

⊗j−1 t=i δ(qt , wt+1 , qt+1 ).

Corollary. The ⊕-sum of all the weights at the ﬁnal states is Ej = ⊕q∈F eqj = ⊕ji=0 W (wi wi+1 . . . wj ). The proof of the lemma, which relies on the right-distributivity of ⊗, is given in Appendix A. The corollary says that the ﬁnal value Ej shows the ⊕-sum of all the weights of the words wi . . . wj . Thus, if one could deduce from Ej if there is an i such that W (wi wi+1 . . . wj ) is in J, one would know if a word wi . . . wj has been recognized by checking if Ej is in J. For this, we say now that J is a good recognizing set if it has the two following properties: – ∀a ∈ J, ∀b ∈ K, a ⊕ b ∈ J, – ∀a ∈ K, ∀b ∈ K, a ⊕ b ∈ J =⇒ a ∈ J or b ∈ J. With this deﬁnition, a direct consequence of the above corollary is: Theorem. If J is a good recognizing set, then Ej ∈ J ⇐⇒ j ∈ Pos(L, w). Therefore, if the hypothesis of the theorem holds, the continuous pattern matching problem is resolved by parsing one character on every clock cycle and

Linear Encoding Scheme for Weighted Finite Automata

151

by observing the value at the ﬁnal states. In fact, the clock cycle time is in O(p log dmax ), where dmax is the maximum incoming degree of the states, but this is not a limitation for usual WFA. – In the case of the NFA (boolean semiring ({T, F }, ∨, ∧)), only one bit is needed to represent the weight: we fall back on the one-hot scheme. The subset J = {T } is a good recognizing set. Each evaluator k → 1 is reduced to a comparator (wi ∈ A) for some subset A ⊂ Σ, the ⊗ is an AND gate and the ⊕ an OR gate. – In the semiring (Z, max, +), the only good recognizing sets are those of the form J = {x ∈ Z, x ≥ x0 } for some x0 . Those sets ﬁt perfectly in the applications of WFA where the weight is a score compared to a threshold to know if a sequence was recognized.

4

Performance Evaluation

This section is about the performances of a real implementation of the linear encoding scheme described in section 3.2. Here the WFA are over a ﬁnite subset of (Z, max, +). We begin by describing the context of use. As we use a low-cost FPGA chip and as the main constraint is about size, we need to know precisely the surface area taken by the WFA; this is done in section 4.2. In section 4.3, we compare the speed achieved against software techniques. 4.1

Context of Use

FPGAs. Field Programmable Gate Arrays (FPGAs) are reconﬁgurable chips composed by a matrix of interconnected logic cells [7]. The logic inside each cell as the interconnections can be conﬁgurated in a few milliseconds, allowing to have a custom chip. The cost of such solutions is orders of magnitude below the cost of ASIC (Application Speciﬁc Integrated Circuits) full-custom chips. Prototype Board. Our prototype board, which is part of the R-disk system [8] is devoted to ﬁlter large genomic databases on-the-ﬂy. The board contains an hard disk and a low-cost FPGA which directly ﬁlters data from the disk. The total cost for the components is less than $200. The FPGA is the Spartan-II from Xilinx. It contains 1176 cell logic blocs (CLB), each one having 4 look-up tables (LUT) of 16 bits. The LUTs can realize any 4 → 1 boolean function. Almost two thirds the of FPGA is devoted to the ﬁlter; that is a little more than 3000 LUTs. It operates at a clock frequency of 40 MHz. 4.2

Implementing the Linear Encoding Scheme on FPGAs

FPGA devices are well suited for the linear encoding scheme because of the high number of available registers and the local propagation of data without global control. Furthermore, the computation of transition weights ﬁts perfectly into LUTs with 4 inputs.

152

M. Giraud and D. Lavenier Automaton Transitions Weight Total, by Maximum number type logic operators transition of transitions NFA 5 → 1 AND / OR (1 bit) 2 LUTs ≤ 1 LUT ≤ 3 LUTs ≥ 1000 WFA, Z 5 → p max / + (p bits) 2p LUTs ≤ 3p LUTs ≤ 5p LUTs ≥ 600/p

Fig. 4. Upper bound for the number of LUTs when k = 5. The last column shows the maximum number of transitions for a Spartan-II FPGA with 3000 LUTs

The regularity of the architecture allows a relative ease of programming. Our implementation, written in OCaml, translates WFA abstract descriptions into their representation in the hardware design language VHDL. One of the main issues with WFA is that their topology may change for each query. Design techniques with J-Bits [9] would allow a fast compilation of arbitrary WFA shapes, but they would need a custom place (& route) algorithm. The current slower solution is to perform a full compilation from VHDL for each query, the overhead due to compilation (4-5 minutes) being small compared to the performance gain when scanning large databases. For the scanning of protein databases (alphabet with 5 bits), an automaton with q states and t transitions with a weight of p bits takes a surface area of 3pt + 2p(t − q) LUTs before compiler optimizations. The total area taken is less than 5pt LUTs. Thus WFA with 75 transitions and an 8-bit weight can be encoded. To verify this bound, real FPGA experiments were done using the standard Xilinx framework. We run our method on two bench sets. The ﬁrst one is random WFA, and results show that the real limit is beyond the 75 transitions (left part of Fig. 5). The other bench set is the PROSITE protein pattern bank [10], which contains about 1300 patterns that we translate into WFA to allow substitution errors. More than 98% of the PROSITE bank can be translated in the FPGA. 3500

25

LUTs

2500 2000

Processing Bandwith (MB/s)

3000 LUTs 5pt 3pt 2pt

1500 1000 500 0

PC : agrep (4 errors) PC : WFA simulation One prototype board (R-disk)

20

15

10

5

0 20

40

60 80 Transitions

100

120

0

20

40

60 Transitions

80

100

120

Fig. 5. Experimental results for the linear encoding scheme. The left part shows the LUT count for diﬀerent WFA sizes. The right part compares the bandwidth processed by one prototype board with an FPGA against software solutions on a PC

Linear Encoding Scheme for Weighted Finite Automata

4.3

153

Performance Comparison

Sidhu and Prasanna [5] showed that their FPGA realization is more eﬀective than softwares like agrep if data is large enough. Their conclusions remain for WFA as they do even more operations (additions, maximums). We compared our approach with some software techniques using WFA. The low-cost Spartan-II is compared against a Pentium IV 2 GHz with 728 MB RAM. This comparison is fair since the Spartan II was released in 2000 and the Pentium IV 2 GHz in 2001. Results are shown in the right part of Fig. 5. The comparison with agrep [11] is for reference only, as this software only parse for regular expressions or for weighted expressions with a ﬁxed score (with at most 4 substitution errors). When patterns are small and with no errors, data can be parsed through agrep at the disk rate. But those ﬂows go down with errors and with larger patterns. More interesting is the comparison against a software simulation of WFA, as in the algorithm described by Eramian in [4] that parses data in O(nt) time. Data rates go from 10 MB/s for small WFA down to less than 1 MB/s for WFA with more than 30 states. On the contrary, our WFA implementation on the FPGA parses a constant bandwidth of data (which is now 15 MB/s), as far as the WFA ﬁts into the available surface area of the FPGA. This bandwidth implies parsing less than one amino acid (5 bits) at the 40 MHz clock cycle of the FPGA, allowing to parse a character on every clock cycle. Experiment were done on real data (80transition WFA, 34 GB canine DNA database). It takes more than 20 hours on a 2 GHz Pentium. On a single prototype R-disk board, it takes less than 45 minutes (5 minutes for compiling and 40 minutes for parsing).

5

Conclusion

Weighted ﬁnite automata can be eﬀectively hardwired on FPGAs with the linear encoding scheme. That encoding is perfectly suited for standard FPGA devices and provides a signiﬁcant speed-up over software implementations. To our knowledge, this is the ﬁrst hardware realization of WFA. The main current limitation with the linear encoding scheme is the size requirements of the targeted WFA. Currently, we can implement WFA with an 8-bit weight and more than 75 transitions. This limit is already pushed away by the next generation of FPGAs: in 2004, Xilinx sells the low-cost FPGAs Spartan3 with more than 18,000 CLB, that is 15 times larger than the chip we use in our prototype board. The transition limit raises accordingly. If an higher number of transitions is available, one could distribute them among several automata, especially when one need to parse nucleic banks for protein patterns through six reading frames. More generally, the speed-up obtained by such a spatial implementation [12] against software techniques will continue to increase, as it is easier to exploit more resources in a reconﬁgurable device than in a sequential CPU.

154

M. Giraud and D. Lavenier

References 1. Culik II, K., Kari, J.: Image Compression Using Weighted Finite Automata. In: Mathematical Foundations of Computer Science (MFCS 93). Volume 711 of Lecture Notes in Computer Science. (1993) 392–402 2. Mohri, M., Pereira, F., Riley, M.: Weighted Automata in Text and Speech Processing. In Kornai, A., ed.: Extended Finite State Models of Language (ECAI 96). (1996) 46–50 3. Buchsbaum, A.L., Raﬀaele, G., Westbrook, J.R.: On the Determinization of Weighted Finite Automata. SIAM Journal on Computing 30 (2001) 1502 – 1531 4. Eramian, M.G.: Eﬃcient Simulation of Nondeterministic Weighted Finite Automata. In: Fourth Workshop on Descriptional Complexity of Formal Systems (DCFS 02). (2002) 5. Sidhu, R., Prasanna, V.K.: Fast Regular Expression Matching using FPGAs. In: IEEE Symposium on Field Programmable Custom Computing Machines (FCCM 01). (2001) 6. Dunoyer, J., Ptrot, F., Jacomme, L.: Stratgies de codage des automates pour des applications basse consommation : exprimentation et interprtation. In: Journes d’tude Faible Tension et Faible Consommation (FTFC 97). (1997) 7. Sanchez, E.: Field Programmable Gate Array (FPGA) Circuits. Lecture Notes in Computer Science (1996) 1–18 8. Lavenier, D., Guyetant, S., Derrien, S., Rubini, S.: A reconﬁgurable parallel disk system for ﬁltering genomic banks. In: Proc. Int. Conf. ERSA’03. (2003) 9. Guccione, S., Levi, D., Sundararajan, P.: JBits: A Javabased Interface for Reconﬁgurable Computing. In: 2nd Annual Military and Aerospace Applications of Programmable Devices and Technologies Conference (MAPLD). (1999) 10. Bucher, P., Bairoch, A.: A Generalized Proﬁle Syntax for Biomolecular Sequences Motifs and its Function in Automatic Sequence Interpretation. In: Intelligent Systems for Molecular Biology (ISMB 94). (1994) 53–61 11. Wu, S., Manber, U.: Fast Text Searching Allowing Errors. Communications of the ACM 35 (1992) 83–91 12. DeHon, A.: Very Large Scale Spatial Computing. In: Third International Conference on Unconventional Models of Computation (UMC 02). (2002) 27–37

Appendix A Proof of Lemma. Here we prove by induction on j the following property: q

,...,q

j−1 eqj = ⊕ji=0 ⊕qi+1 i ∈I,qj =q

∈Q

⊗j−1 t=i δ(qt , wt+1 , qt+1 ).

=q ¯ At the cycle j = 0, the property is eq0 = ⊕qq00 ∈I 1, that is eq0 equals ¯1 if q ∈ I and ¯ 0 if q ∈ I: the property is true. Assume that the induction is true until the cycle j − 1, with j ≥ 1. Let q be a non-initial state. We compute the value eqj of the state q at the cycle j.

Linear Encoding Scheme for Weighted Finite Automata

155

eqj = ⊕q∈Q sjq ,q

= ⊕q∈Q eqj−1 ⊗ δ(q , wj , q)

qi+1 ,...,qj−2 ∈Q = ⊕q∈Q ⊕j−1 ⊗j−2 i=0 ⊕qi ∈I,qj−1 =q t=i δ(qt , wt+1 , qt+1 ) ⊗ δ(q , wj , q)

=

⊕q∈Q ⊕j−1 i=0

q ,...,qj−2 ∈Q ⊕qii+1 ∈I,qj−1 =q

=

⊕q∈Q ⊕j−1 i=0

q ,...,qj−1 ∈Q ⊕qii+1 ∈I,qj−1 =q ,qj =q q

,...,q

∈Q

j−1 = ⊕q∈Q ⊕ji=0 ⊕qi+1 i ∈I,qj−1 =q ,qj =q

⊗j−2 t=i

(hypothesis of induction)

δ(qt , wt+1 , qt+1 ) ⊗ δ(q , wj , q)

⊗j−1 t=i

⊗j−1 t=i

(right-distributivity of ⊗)

δ(qt , wt+1 , qt+1 )

δ(qt , wt+1 , qt+1 ) (because q is not initial)

=

⊕ji=0

q ,...,qj−1 ∈Q ⊕qi+1 i ∈I,qj =q

⊗j−1 t=i

δ(qt , wt+1 , qt+1 )

Thus the property is true at the cycle j. If q is initial, the same result is obtained by a similar computation by adding a ¯1 to each term. By induction, the property is true for every cycle j ≥ 0.