Determinizing Alternating Tree Automata, and Models ? Jean Goubault-Larrecq a,∗ a LSV/CNRS

UMR 8643 & INRIA Futurs projet SECSI & ENS Cachan 61, av. du président-Wilson, 94235 Cachan Cedex, France

Abstract While alternating or non-deterministic tree automata are syntax (e.g., clause sets), complete deterministic automata are best seen as semantics (Σ-algebras). In this sense, the powerset construction and the filtration construction are finite model constructions. We show that they are both instances of a more general construction. This leads to a determinization procedure that determinizes and minimizes partially at the same time. Key words: Tree automata, determinization, powerset construction, structure, model, clause, first-order logic, simulation

1 Introduction It has been observed in the past that complete deterministic tree automata are just Σ-algebras, that is, first-order models [3]. We explain this gently in Section 2. Profiting from model-theoretic intuitions, we define a general determinization scheme based on so-called naming functions and inclusion oracles (Section 3). As an application, we design an efficient determinization procedure for non-deterministic and alternating tree automata, which does some minimizing on the fly (Section 4). ? Partially supported by the RNTL project Prouvé, the ACI SI Rossignol, and the ACI jeunes chercheurs “Sécurité informatique, proto. crypto. et détection d’intrusions”. ∗ Fax: +33-1 47 40 75 21. Email address: [email protected] (Jean Goubault-Larrecq). URL: www.lsv.ens-cachan.fr/˜goubault/ (Jean Goubault-Larrecq).

Preprint submitted to Elsevier Science

13 January 2006

2 Tree Automata We fix a first-order signature Σ. Terms are denoted s, t, u, v, . . . , predicate symbols P , Q, . . . , variables X, Y , Z, . . . Let P be the set of predicate symbols. Horn clauses are of the form H ⇐ B where the head H is either an atom or ⊥, and the body B is a finite set A1 , . . . , An of atoms. For each satisfiable set S of Horn clauses, and each predicate symbol P , let L P (S) be the set of ground terms t such that P (t) is in the least Herbrand model of S. LP (S) is the language recognized at state P . When S consists only of tree automaton clauses P (f (X1 , . . . , Xn )) ⇐ P1 (X1 ), . . . , Pn (Xn ) (X1 , . . . , Xn pairwise distinct), this coincides with the usual definition of the set of terms recognized at P ; such clauses are just tree automaton transitions from P1 , . . . , Pn to P . Accordingly, we shall call below a set of tree automaton clauses a (tree) automaton. This connection between tree automata and Horn clauses was pioneered by Frühwirth et al. [2]; there, LP (S) is called the success set for P . This connection was then used in a number of papers: see the comprehensive textbook [1], in particular Section 7.6 on tree automata as sets of Horn clauses. Tree automata clauses can be generalized right away (see [1, Chapter 7] for alternating tree automata): Definition 1 (Alternating tree automata) Call -block any finite set of atoms of the form P1 (X), . . . , Pm (X) (with the same X, and m ≥ 0); it is non-empty iff m ≥ 1. We abbreviate -blocks B(X) to make the variable X explicit. We shall also write B for the set {P1 , . . . , Pm }. Alternating automaton clauses are of the form P (f (X1 , . . . , Xk )) ⇐ B1 (X1 ), . . . , Bk (Xk )

(1)

where B1 (X1 ), . . . , Bk (Xk ) are possibly empty -blocks, and the variables X1 , . . . , Xk are pairwise distinct. They are non-deterministic if and only if each block Bi (Xi ) contains exactly one predicate. An alternating tree automaton is any set S of alternating automaton clauses. It is non-deterministic if and only if all clauses are non-deterministic. Non-deterministic automata will just be called automata. The usual definitions of automata include a set of final states. It is more natural for now to leave the set of final states (predicates) out of the definition in our setting. Deterministic automata are traditionally defined as restrictions of non-deterministic automata. We shall argue that this is artificial, but let us follow the tradition for now: Definition 2 A non-deterministic automaton S is deterministic (resp. complete) if and only if, for every k-ary function f , for every predicate symbols P 1 , . . . , Pk , there is at most (resp. at least) one predicate symbol P such that S contains 2

P (f (X1 , . . . , Xk )) ⇐ P1 (X1 ), . . . , Pk (Xk ) In other words, a deterministic automaton S defines f as a partial function from P k to P, where P is the set of predicate symbols. A complete deterministic automaton defines f as a total function. It turns out that a tuple I = (|I|, (If )f ∈Σ ) with |I| a non-empty set (here |I| = P) and If : |I|k → |I| for each k-ary f ∈ Σ, is a first-order Σ-algebra (a.k.a., first-order structures, a.k.a., models). In other words, complete deterministic automata are just Σ-algebras. Up to minor details, this is the definition given by Kozen [3, Definition C.5, p.111]; Kozen considers only finite Σ-algebras, and says “deterministic” instead of “complete deterministic”. Given a Σ-algebra I = (|I|, (If )f ∈Σ ), define as usual the semantics I JtK ρ of terms t in an environment ρ mapping variables to values in |I| by: I JXK ρ = ρ(X) and I Jf (t1 , . . . , tn )K ρ = If (I Jt1 K ρ, . . . , I Jtn K ρ). When t is ground, ρ is irrelevant, and we write I JtK for I JtK ρ. Given a complete deterministic automaton S, S JtK is the unique state at which t is recognized in S. That is, S JtK = P if and only if t ∈ LP (S) for every state P . (If this looks strange, recall that S is both a clause set and a Σ-algebra.)

3 Determinization It is always possible to convert any non-deterministic, and even any alternating tree automaton S, with set F of final states, into an equivalent deterministic tree automaton S 0 , F 0 . Here “equivalent” means that the terms recognized at some state in F by S are exactly the terms recognized at some state in F 0 by S 0 . This can be achieved in deterministic exponential time [1, Theorem 54, Section 7.4]. Writing the above slightly more formally, we arrive at the traditional definition of determinization—for alternating tree automata: Given an alternating tree automaton S and a set of final states F , a determinization of S, F is a deterministic automaton S 0 on a new set of states N me and a subset F 0 ⊆ N me such that S S 0 P 0 ∈F 0 LP 0 (S ) = P ∈F LP (S). In terms of Σ-algebras, this translates to: Definition 3 A complete determinization of S, F is a Σ-algebra I and a set F 0 ⊆ |I| of values such that for every ground term t, I JtK ∈ F 0 if and only if t ∈ S P ∈F LP (S). A first complete determinization in the sense above is the Herbrand Σ-algebra IH : |IH | is the set of all ground terms, and If (t1 , . . . , tn ) is the term f (t1 , . . . , tn ). S According to Definition 3, S 0 = IH with F 0 = P ∈F LP (S) is a complete determinization of S, F . (We have not restricted our models to be finite yet!) 3

Not unexpectedly, the next complete determinization, IP , is given by the powerset construction. This one is finite, as will be all subsequent constructions. Its states are the sets of predicates in P, and its transitions are given as one should expect. Formally, the carrier |IP | is P(P), and for each k-ary function f , for every k-tuple of values E1 , . . . , Ek ∈ P(P), IP f (E1 , . . . , Ek ) = {P ∈ P (2) | ∃(P (f (X1 , . . . , Xk )) ⇐ B1 (X1 ), . . . , Bk (Xk )) ∈ S · B1 ⊆ E 1 ∧ . . . ∧ B k ⊆ E k } Recall that both -blocks Bi and values Ei are sets of predicate symbols. An easy induction on t shows that for every ground term t, t ∈ LP (S) if and only if P ∈ IP JtK. So IP , FP0 = {v ∈ P(P) | v ∩ F 6= ∅} is a complete determinization of S, F . A tighter determinization, in the sense that it produces deterministic automata with less states, is the filtration construction Ifil . Let |Ifil | be the set of all languages T LE (S) = P ∈E LP (S), E ∈ P(P), (note there are only finitely many, as most as many as there are subsets of P) and

Ifil f (L1 , . . . , Lk ) = LE (S) where E = {P ∈ P | ∃(P (f (X1 , . . . , Xk )) ⇐ B1 (X1 ), . . . , Bk (Xk )) ∈ S · L1 ⊆ LB1 (S) ∧ . . . ∧ Lk ⊆ LBk (S)} S

Here, t ∈ LP (S) if and only if Ifil JtK ⊆ LP (S). So Ifil , Ffil0 = P ∈F {LE (S)|E ∈ P(P), LE (S) ⊆ LP (S)} is a complete determinization of S, F . Ifil is isomorphic to the quotient of IH by congruence ≡S defined by E ≡S E 0 if and only if LE (S) = LE 0 (S). This is similar to, but not quite the same as the Myhill-Nerode congruence, see Kozen [3, Theorem D.3, p.117]. The above two constructions can be generalized once we realize that they really rely only on two devices: • A naming function N mapping each E ∈ P(P) to some representative N (E) in some set N me of state names. In the power construction case, each E is its own name, and N (E) = E. In the filtration construction, the name N (E) of E is the language LE (S)—or, equivalently, the equivalence class of E modulo ≡S . • An inclusion oracle O : N me×N me → {true, false}. That O(N (E), N (E 0 )) returns true is seen as a sufficient condition for LE (S) to be included in LE 0 (S). In the filtration construction, O(L, L0 ) returns true if and only if L ⊆ L0 , without any further ado. In the powerset construction, O(E, E 0 ) returns true if and only if E ⊇ E 0 —note that this indeed implies LE (S) ⊆ LE 0 (S). 4

The precise conditions that we require on inclusion oracles are given below. Equation (3) is our requirement on language inclusion above. Equation (4) forces the oracle to answer true at least in some (obvious) situations. Definition 4 An inclusion oracle relative to a naming function N : P(P) → N me is any function O : N me × N me → {true, false} such that: if O(N (E), N (E 0 )) = true then LE (S) ⊆ LE 0 (S) if E ⊇ E 0 then O(N (E), N (E 0 )) = true

(3) (4)

Then, the O, N -construction IO,N is given by: |IO,N | = N me, and IO,N f (n1 , . . . , nk ) = N (E), where E = {P ∈ P | ∃(P (f (X1 , . . . , Xk )) ⇐ B1 (X1 ), . . . , Bk (Xk )) ∈ S · O(n1 , N (B1 )) = true ∧ . . . ∧ O(nk , N (Bk )) = true}

(5)

One may also observe that conditions (3) and (4) express that O(N (E), N (E 0 )) = true is a condition that is intermediate in strength between the powerset condition E ⊇ E 0 and the filtration condition LE (S) ⊆ LE 0 (S). We require no other condition on O, in particular we do not require O to be transitive. However, (4) implies that O is reflexive on the range of N , i.e., O(N (E), N (E)) = true. In turn, this and (3) imply that if N (E) = N (E 0 ) then LE (S) = LE 0 (S), so E ≡S E 0 . I.e., the equivalence relation defined by N , whereby E and E 0 are equivalent if and only if N (E) = N (E 0 ), must be finer than ≡S . We now show that the O, N -construction always yields a complete determinization. Proposition 5 Assume that (3) and (4) hold. For every ground term t and every E 0 ∈ P(P), t ∈ LE 0 (S) if and only if O(IO,N JtK , N (E 0 )) = true. S 0 0 Let FO,N is a = P ∈F {n ∈ N me | O(n, N ({P })) = true}. Then IO,N , FO,N complete determinization of S, F . PROOF. Only if: by structural induction on t = f (t1 , . . . , tk ). Let E 0 = {P1 , . . . , Pm }. If t ∈ LE 0 (S), then for every j, 1 ≤ j ≤ m, there is a clause Pj (f (X1 , . . . , Xk )) ⇐ Bj1 (X1 ), . . . , Bjk (Xk ) in S such that ti ∈ LBji (S) for every i, 1 ≤ i ≤ k. By induction hypothesis: (a) O(IO,N Jti K , N (Bji )) = true for every i. By (5) IO,N JtK = IO,N f (IO,N Jt1 K , . . . , IO,N Jtk K) = N (E), where E contains every Pj by (a), 1 ≤ j ≤ m. So E ⊇ E 0 , therefore by (4) O(IO,N JtK , N (E 0 )) = true. If: by structural induction on t = f (t1 , . . . , tk ). If O(IO,N JtK , N (E 0 )) = true, then by (3) LE (S) ⊆ LE 0 (S), where E is the set of all predicates Pj , 1 ≤ j ≤ m, 5

such that there are clauses Pj (f (X1 , . . . , Xk )) ⇐ Bj1 (X1 ), . . . , Bjk (Xk ) in S such that O(IO,N Jti K , N (Bji ) = true for every i, 1 ≤ i ≤ k. By induction hypothesis ti ∈ LBji (S), so by using the clause above t ∈ LPj (S), 1 ≤ j ≤ m. So t ∈ LE (S). Since LE (S) ⊆ LE 0 (S), t ∈ LE 0 (S). Let us show the second part of the Proposition. For every ground term t, I O,N JtK ∈ 0 FO,N if and only if O(IO,N JtK , N ({P })) for some P ∈ F , if and only if t ∈ LP (S) for some P ∈ F by the first part of the Proposition. This completes the proof. 2

4 Application The filtration construction yields a complete deterministic automaton, but its oracle LE (S) ⊆ LE 0 (S) is costly: language inclusion is DEXPTIME-complete, even when the language on the left is the universal language and the language on the right is given by a non-deterministic automaton [1, Theorem 14, Section 1.7], or when the language on the right is the empty language [1, Theorem 55, Section 7.5]. Practically, deciding LE (S) ⊆ LE 0 (S) requires one to determinize S, which is our initial goal. On the other hand, the powerset oracle E ⊇ E 0 is efficiently decided, but may yield large determinizations. Standard practice consists in relying on the powerset construction, then minimizing, either completely, using Hopcroft’s algorithm [1, Algorithm MIN, Section 1.5], or partially, by merging bisimilar states [3]. An application of Proposition 5 is that we can minimize partially while determinizing, at almost no cost. To this end, we use an O, N pair based on efficient sufficient criteria for language non-emptiness and inclusion, which generalize bisimilarity (which is a criterion for language equivalence only). First, let N E(S) be the smallest subset of P such that, for every clause of the form (1) in S, if B1 ⊆ N E(S) and . . . and Bk ⊆ N E(S), then P ∈ N E(S). Clearly, if LP (S) 6= ∅, then P ∈ N E(S). In fact, if S is a non-deterministic automaton, this yields a decision procedure for non-emptiness: if P ∈ N E(S) then LP (S) 6= ∅. (This is not so for alternating automata, for which non-emptiness is DEXPTIMEcomplete [1, Theorem 55, Section 7.5].) N E(S) can be computed in polynomial time by a marking algorithm. We now turn to language inclusion. Given a binary relation R on a set A, its Smyth extension R] is defined by U R] V if and only if for every v ∈ V , there is an u ∈ U such that u R v. (This is the extension to general binary relations of the well-known Smyth quasi-ordering.) Definition 6 Given two alternating automata S and S 0 , a simulation of S by S 0 is a binary relation on P such that, for every P ∈ N E(S), and for any clause P (f (X1 , . . . , Xk )) ⇐ B1 (X1 ), . . . , Bk (Xk ) 6

(6)

then for every P 0 ∈ P such that P P 0 , there is a clause P 0 (f (X1 , . . . , Xk )) ⇐ B10 (X1 ), . . . , Bk0 (Xk )

(7)

in S 0 such that Bi ] Bi0 for every i, 1 ≤ i ≤ k. There is always a largest simulation, which is computable in polynomial time, by a largest fixpoint computation on the set of pairs (P, P 0 ) of predicates. The next two results are probably folklore, at least for non-deterministic automata. Lemma 7 For any two simulations of S by S 0 , and 0 of S 0 by S 00 , (; 0 ), defined by P (; 0 ) P 00 if and only if P P 0 and P 0 0 P 00 for some P 0 ∈ P, is a simulation of S by S 00 .

PROOF. First, we claim that: (∗) if P P 0 , where is a simulation of S by S 0 , and P ∈ N E(S), then P 0 ∈ N E(S 0 ). This is by structural induction on a proof that P ∈ N E(S). Since P ∈ N E(S) there must be a clause (6) with B1 ⊆ N E(S), . . . , Bk ⊆ N E(S). By definition of a simulation, and since P ∈ N E(S), there must be a clause (7) such that Bi ] Bi0 for every i, 1 ≤ i ≤ k. For every Q0 ∈ Bi0 , there is Q ∈ Bi such that Q Q0 . By induction hypothesis, since Q ∈ Bi ⊆ N E(S), Q0 ∈ N E(S 0 ). So Bi0 ⊆ N E(S) for every i, 1 ≤ i ≤ k. Whence P 0 ∈ N E(S 0 ). Let and 0 be as in the first part of the Proposition. Let P (; 0 ) P 00 , say P P 0 0 P 00 . If P 6∈ N E(S), then we are done, so assume P ∈ N E(S). For every clause (6) in S there is a clause (7) in S 0 with Bi ] Bi0 for every i, 1 ≤ i ≤ k. By (∗), P 0 ∈ N E(S), so there is a clause P 00 (f (X1 , . . . , Xk )) ⇐ B100 (X1 ), . . . , Bk00 (Xk ) in S 00 such that Bi0 0 ] Bi00 for every i, 1 ≤ i ≤ k. It follows that Bi (; 0 )] Bi00 for every i, showing that (; 0 ) is a simulation of S by S 00 . 2 Proposition 8 Let be the largest simulation of S by itself. Then is a quasiordering. If E ⊇ E 0 then E ] E 0 . If E ] E 0 then LE (S) ⊆ LE 0 (S).

PROOF. First, is reflexive, because the equality relation is clearly a simulation of S by itself. To show that is transitive, we realize that (; ) is a simulation of S by itself, by Lemma 7, so by maximality (; ) ⊆ : so is transitive. That E ⊇ E 0 implies E ] E 0 is by the definition of ] and the fact that is reflexive. The last claim is shown by proving that whenever is a simulation of S by itself, then for every ground term t ∈ LE (S), whenever E ] E 0 then t ∈ LE 0 (S). This is proved by structural induction on t = f (t1 , . . . , tk ). Let E 0 = {P10 , . . . , Pm0 }. Since E ] E 0 , for every j, 1 ≤ j ≤ m, there is Pj ∈ E such that Pj Pj0 . Since t ∈ LE (S), t ∈ LPj (S) for every j, so there is a clause

7

Pj (f (X1 , . . . , Xk )) ⇐ Bj1 (X1 ), . . . , Bjk (Xk ) in S such that ti ∈ LBji (S) for every i, 1 ≤ i ≤ k. Since t ∈ LPj (S), LPj (S) 6= ∅, so Pj ∈ N E(S), and because Pj Pj0 , by definition there must be a clause 0 0 Pj0 (f (X1 , . . . , Xk )) ⇐ Bj1 (X1 ), . . . , Bjk (Xk ) 0 such that Bji ] Bji for every i, 1 ≤ i ≤ k. By induction hypothesis, since ti ∈ LBji (S), ti ∈ LBji0 (S). So, using the clause above, t ∈ LPj0 (S). As j is arbitrary between 1 and m, t ∈ LE 0 (S). 2

Let ≈ be the equivalence relation ( ∩ ), where is the converse of . (This is generally coarser than, hence an improvement over, bisimilarity.) By Proposition 8, if P ≈ P 0 then LP (S) = LP 0 (S). Pick a unique representative P↓ ∈ [P ] inside the equivalence class [P ] of P : P↓ ≈ P , and if P ≈ P 0 then P↓ = P 0 ↓ . Note also that if P 6∈ N E(S) (in particular LP (S) = ∅), then P P 0 for any P 0 , by definition. It follows that all predicates outside N E(S) are equivalent for ≈. Given E ∈ P(P), let N (E) be {P↓ } if there is a P ∈ E \ N E(S), else let N (E) be the set of all P↓ where P ranges over the -minimal elements of E. (An element P is -minimal in E if and only if any P 0 ∈ E such that P 0 P is such that P 0 ≈ P .) Let N me be the range of N . Let O(E, E 0 ) = true if and only if E ] E 0 . Lemma 9 O is an inclusion oracle relative to the naming function N . PROOF. First we claim that LN (E) (S) = LE (S). If there is a P ∈ E \ N E(S), then LP (S) = ∅, so LE (S) = ∅ since LE (S) ⊆ LP (S) = ∅, and since LP↓ (S) = LP (S), it follows that LE (S) = L{P↓ } (S) = ∅. Otherwise, since E is finite, for any Q ∈ E there is a -minimal P ∈ E such that P Q, in particular such that LP (S) ⊆ LQ (S). So LE (S) is the intersection of all LP (S) where P ranges over the -minimal elements of E. Since LP (S) = LP↓ (S), the claim follows. Now, if O(N (E), N (E 0 )) = true, that is, N (E) ] N (E 0 ), then by Proposition 8, LN (E) (S) ⊆ LN (E 0 ) (S). By the claim above, LE (S) ⊆ LE 0 (S). (3) follows. If E ⊇ E 0 , then for any -minimal element P ∈ E, P ∈ E 0 so there is a -minimal element P 0 ∈ E 0 such that P 0 P . Since P↓ ≈ P and P 0 ↓ ≈ P 0 , P 0 ↓ P↓ . This means that N (E) ] N (E 0 ), i.e., O(N (E), N (E 0 )) = true, whence (4). 2 Using the O, N -construction with the above O and N yields a refinement of the powerset construction, where some equivalent sets E are equated using the naming function N , therefore minimizing the automaton partially on the fly. As it has been presented until now, this produces a complete deterministic automaton, which is almost always huge: letting m be the cardinality of N me, the size of a model I 8

with |I| = N me is the sum over all k-ary functions f of mk times the size of a table entry If (n1 , . . . , nk ) = n. Listing only the entries where n is different from N (∅) (the catch-all state) produces a deterministic automaton S 0 that is in S general much shorter. Moreover, using Proposition 5, t ∈ F ∈F LP (S) if and only if t ∈ Ln (S 0 ) for some n ∈ N me with O(n, N ({P })) = true, i.e., for some n = N (E) with N (E) ] N ({P }), i.e., such that there is a Q ∈ N (E) with S Q P . Since N (E) = ∅ if and only if E = ∅, t ∈ F ∈F LP (S) if and only if t 0 is recognized at any of the final states in FO,N , none of which is the catch-all state 0 0 N (∅). So S , FO,N is a (usually incomplete, hence shorter) determinization of S, F . Finally, we can naturally construct this automaton incrementally, building names N (E) as the need emerges. This produces only reachable state names. All this was implemented in the pldet tool, one of the tools of the h1 tool suite (see http: //www.lsv.ens-cachan.fr/software/). In practice, we have observed that the obtained automata had only a few more states at most than the minimal automaton (the filtration construction without the catch-all state), and was therefore competitive. Running the Hopcroft construction as a second step is usually not even necessary to get automata of reasonable sizes.

5 A Final Note: Universal Determinizations The determinizations I, F 0 of S, F that we have built until now are given by Σalgebras I that satisfy a stronger property: for every set F of final states F we can find a set F 0 of values of |I| such that I, F 0 is a complete determinization of S, F . The point is that I is independent of F . This leads to the following definition. Definition 10 A universal determinization of an alternating automaton S is a Σalgebra I, and a mapping I∗ : P → P(|I|) such that, for any ground term t, I JtK ∈ I∗ (P ) if and only if t ∈ LP (S). S

Indeed, F 0 will now be P ∈F I∗ (P ). Conversely, if as above we may find F 0 given F , whatever F , then I∗ (P ) must be the F 0 corresponding to F = {P }. The Herbrand construction is a universal determinization: I∗ (P ) is just LP (S). The O, N construction is also a universal determinization: by Proposition 5, I ∗ (P ) = {n ∈ N me|O(n, N ({P })) = true} fits the bill. So the powerset construction, with I∗ (P ) = {E ∈ P(P)|P ∈ E}, and the filtration construction, with I∗ (P ) = {LE (S)|E ∈ P(P), LE (S) ⊆ LP (S)}, are universal determinizations, too. Universal determinizations are just first-order interpretations I∗ = (|I|, (If )f ∈Σ , (I∗ (P ))P ∈P ), with the added constraint that I∗ |= P (t) if and only if t ∈ LP (S) for every ground term t. In other words, they are first-order models of the theory of a given set of regular languages LP (S), where P ranges over the set P. 9

References [1] H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi. Tree automata techniques and applications. www.grappa. univ-lille3.fr/tata, 1997. Version of Sep. 6, 2005. [2] T. Frühwirth, E. Shapiro, M. Y. Vardi, and E. Yardeni. Logic programs as types for logic programs. In Proc. 6th Symp. Logic in Computer Science, pages 300–309. IEEE Computer Society Press, 1991. [3] D. C. Kozen. Automata and Computability. Undergraduate Texts in Computer Science. Springer, 1997.

10

UMR 8643 & INRIA Futurs projet SECSI & ENS Cachan 61, av. du président-Wilson, 94235 Cachan Cedex, France

Abstract While alternating or non-deterministic tree automata are syntax (e.g., clause sets), complete deterministic automata are best seen as semantics (Σ-algebras). In this sense, the powerset construction and the filtration construction are finite model constructions. We show that they are both instances of a more general construction. This leads to a determinization procedure that determinizes and minimizes partially at the same time. Key words: Tree automata, determinization, powerset construction, structure, model, clause, first-order logic, simulation

1 Introduction It has been observed in the past that complete deterministic tree automata are just Σ-algebras, that is, first-order models [3]. We explain this gently in Section 2. Profiting from model-theoretic intuitions, we define a general determinization scheme based on so-called naming functions and inclusion oracles (Section 3). As an application, we design an efficient determinization procedure for non-deterministic and alternating tree automata, which does some minimizing on the fly (Section 4). ? Partially supported by the RNTL project Prouvé, the ACI SI Rossignol, and the ACI jeunes chercheurs “Sécurité informatique, proto. crypto. et détection d’intrusions”. ∗ Fax: +33-1 47 40 75 21. Email address: [email protected] (Jean Goubault-Larrecq). URL: www.lsv.ens-cachan.fr/˜goubault/ (Jean Goubault-Larrecq).

Preprint submitted to Elsevier Science

13 January 2006

2 Tree Automata We fix a first-order signature Σ. Terms are denoted s, t, u, v, . . . , predicate symbols P , Q, . . . , variables X, Y , Z, . . . Let P be the set of predicate symbols. Horn clauses are of the form H ⇐ B where the head H is either an atom or ⊥, and the body B is a finite set A1 , . . . , An of atoms. For each satisfiable set S of Horn clauses, and each predicate symbol P , let L P (S) be the set of ground terms t such that P (t) is in the least Herbrand model of S. LP (S) is the language recognized at state P . When S consists only of tree automaton clauses P (f (X1 , . . . , Xn )) ⇐ P1 (X1 ), . . . , Pn (Xn ) (X1 , . . . , Xn pairwise distinct), this coincides with the usual definition of the set of terms recognized at P ; such clauses are just tree automaton transitions from P1 , . . . , Pn to P . Accordingly, we shall call below a set of tree automaton clauses a (tree) automaton. This connection between tree automata and Horn clauses was pioneered by Frühwirth et al. [2]; there, LP (S) is called the success set for P . This connection was then used in a number of papers: see the comprehensive textbook [1], in particular Section 7.6 on tree automata as sets of Horn clauses. Tree automata clauses can be generalized right away (see [1, Chapter 7] for alternating tree automata): Definition 1 (Alternating tree automata) Call -block any finite set of atoms of the form P1 (X), . . . , Pm (X) (with the same X, and m ≥ 0); it is non-empty iff m ≥ 1. We abbreviate -blocks B(X) to make the variable X explicit. We shall also write B for the set {P1 , . . . , Pm }. Alternating automaton clauses are of the form P (f (X1 , . . . , Xk )) ⇐ B1 (X1 ), . . . , Bk (Xk )

(1)

where B1 (X1 ), . . . , Bk (Xk ) are possibly empty -blocks, and the variables X1 , . . . , Xk are pairwise distinct. They are non-deterministic if and only if each block Bi (Xi ) contains exactly one predicate. An alternating tree automaton is any set S of alternating automaton clauses. It is non-deterministic if and only if all clauses are non-deterministic. Non-deterministic automata will just be called automata. The usual definitions of automata include a set of final states. It is more natural for now to leave the set of final states (predicates) out of the definition in our setting. Deterministic automata are traditionally defined as restrictions of non-deterministic automata. We shall argue that this is artificial, but let us follow the tradition for now: Definition 2 A non-deterministic automaton S is deterministic (resp. complete) if and only if, for every k-ary function f , for every predicate symbols P 1 , . . . , Pk , there is at most (resp. at least) one predicate symbol P such that S contains 2

P (f (X1 , . . . , Xk )) ⇐ P1 (X1 ), . . . , Pk (Xk ) In other words, a deterministic automaton S defines f as a partial function from P k to P, where P is the set of predicate symbols. A complete deterministic automaton defines f as a total function. It turns out that a tuple I = (|I|, (If )f ∈Σ ) with |I| a non-empty set (here |I| = P) and If : |I|k → |I| for each k-ary f ∈ Σ, is a first-order Σ-algebra (a.k.a., first-order structures, a.k.a., models). In other words, complete deterministic automata are just Σ-algebras. Up to minor details, this is the definition given by Kozen [3, Definition C.5, p.111]; Kozen considers only finite Σ-algebras, and says “deterministic” instead of “complete deterministic”. Given a Σ-algebra I = (|I|, (If )f ∈Σ ), define as usual the semantics I JtK ρ of terms t in an environment ρ mapping variables to values in |I| by: I JXK ρ = ρ(X) and I Jf (t1 , . . . , tn )K ρ = If (I Jt1 K ρ, . . . , I Jtn K ρ). When t is ground, ρ is irrelevant, and we write I JtK for I JtK ρ. Given a complete deterministic automaton S, S JtK is the unique state at which t is recognized in S. That is, S JtK = P if and only if t ∈ LP (S) for every state P . (If this looks strange, recall that S is both a clause set and a Σ-algebra.)

3 Determinization It is always possible to convert any non-deterministic, and even any alternating tree automaton S, with set F of final states, into an equivalent deterministic tree automaton S 0 , F 0 . Here “equivalent” means that the terms recognized at some state in F by S are exactly the terms recognized at some state in F 0 by S 0 . This can be achieved in deterministic exponential time [1, Theorem 54, Section 7.4]. Writing the above slightly more formally, we arrive at the traditional definition of determinization—for alternating tree automata: Given an alternating tree automaton S and a set of final states F , a determinization of S, F is a deterministic automaton S 0 on a new set of states N me and a subset F 0 ⊆ N me such that S S 0 P 0 ∈F 0 LP 0 (S ) = P ∈F LP (S). In terms of Σ-algebras, this translates to: Definition 3 A complete determinization of S, F is a Σ-algebra I and a set F 0 ⊆ |I| of values such that for every ground term t, I JtK ∈ F 0 if and only if t ∈ S P ∈F LP (S). A first complete determinization in the sense above is the Herbrand Σ-algebra IH : |IH | is the set of all ground terms, and If (t1 , . . . , tn ) is the term f (t1 , . . . , tn ). S According to Definition 3, S 0 = IH with F 0 = P ∈F LP (S) is a complete determinization of S, F . (We have not restricted our models to be finite yet!) 3

Not unexpectedly, the next complete determinization, IP , is given by the powerset construction. This one is finite, as will be all subsequent constructions. Its states are the sets of predicates in P, and its transitions are given as one should expect. Formally, the carrier |IP | is P(P), and for each k-ary function f , for every k-tuple of values E1 , . . . , Ek ∈ P(P), IP f (E1 , . . . , Ek ) = {P ∈ P (2) | ∃(P (f (X1 , . . . , Xk )) ⇐ B1 (X1 ), . . . , Bk (Xk )) ∈ S · B1 ⊆ E 1 ∧ . . . ∧ B k ⊆ E k } Recall that both -blocks Bi and values Ei are sets of predicate symbols. An easy induction on t shows that for every ground term t, t ∈ LP (S) if and only if P ∈ IP JtK. So IP , FP0 = {v ∈ P(P) | v ∩ F 6= ∅} is a complete determinization of S, F . A tighter determinization, in the sense that it produces deterministic automata with less states, is the filtration construction Ifil . Let |Ifil | be the set of all languages T LE (S) = P ∈E LP (S), E ∈ P(P), (note there are only finitely many, as most as many as there are subsets of P) and

Ifil f (L1 , . . . , Lk ) = LE (S) where E = {P ∈ P | ∃(P (f (X1 , . . . , Xk )) ⇐ B1 (X1 ), . . . , Bk (Xk )) ∈ S · L1 ⊆ LB1 (S) ∧ . . . ∧ Lk ⊆ LBk (S)} S

Here, t ∈ LP (S) if and only if Ifil JtK ⊆ LP (S). So Ifil , Ffil0 = P ∈F {LE (S)|E ∈ P(P), LE (S) ⊆ LP (S)} is a complete determinization of S, F . Ifil is isomorphic to the quotient of IH by congruence ≡S defined by E ≡S E 0 if and only if LE (S) = LE 0 (S). This is similar to, but not quite the same as the Myhill-Nerode congruence, see Kozen [3, Theorem D.3, p.117]. The above two constructions can be generalized once we realize that they really rely only on two devices: • A naming function N mapping each E ∈ P(P) to some representative N (E) in some set N me of state names. In the power construction case, each E is its own name, and N (E) = E. In the filtration construction, the name N (E) of E is the language LE (S)—or, equivalently, the equivalence class of E modulo ≡S . • An inclusion oracle O : N me×N me → {true, false}. That O(N (E), N (E 0 )) returns true is seen as a sufficient condition for LE (S) to be included in LE 0 (S). In the filtration construction, O(L, L0 ) returns true if and only if L ⊆ L0 , without any further ado. In the powerset construction, O(E, E 0 ) returns true if and only if E ⊇ E 0 —note that this indeed implies LE (S) ⊆ LE 0 (S). 4

The precise conditions that we require on inclusion oracles are given below. Equation (3) is our requirement on language inclusion above. Equation (4) forces the oracle to answer true at least in some (obvious) situations. Definition 4 An inclusion oracle relative to a naming function N : P(P) → N me is any function O : N me × N me → {true, false} such that: if O(N (E), N (E 0 )) = true then LE (S) ⊆ LE 0 (S) if E ⊇ E 0 then O(N (E), N (E 0 )) = true

(3) (4)

Then, the O, N -construction IO,N is given by: |IO,N | = N me, and IO,N f (n1 , . . . , nk ) = N (E), where E = {P ∈ P | ∃(P (f (X1 , . . . , Xk )) ⇐ B1 (X1 ), . . . , Bk (Xk )) ∈ S · O(n1 , N (B1 )) = true ∧ . . . ∧ O(nk , N (Bk )) = true}

(5)

One may also observe that conditions (3) and (4) express that O(N (E), N (E 0 )) = true is a condition that is intermediate in strength between the powerset condition E ⊇ E 0 and the filtration condition LE (S) ⊆ LE 0 (S). We require no other condition on O, in particular we do not require O to be transitive. However, (4) implies that O is reflexive on the range of N , i.e., O(N (E), N (E)) = true. In turn, this and (3) imply that if N (E) = N (E 0 ) then LE (S) = LE 0 (S), so E ≡S E 0 . I.e., the equivalence relation defined by N , whereby E and E 0 are equivalent if and only if N (E) = N (E 0 ), must be finer than ≡S . We now show that the O, N -construction always yields a complete determinization. Proposition 5 Assume that (3) and (4) hold. For every ground term t and every E 0 ∈ P(P), t ∈ LE 0 (S) if and only if O(IO,N JtK , N (E 0 )) = true. S 0 0 Let FO,N is a = P ∈F {n ∈ N me | O(n, N ({P })) = true}. Then IO,N , FO,N complete determinization of S, F . PROOF. Only if: by structural induction on t = f (t1 , . . . , tk ). Let E 0 = {P1 , . . . , Pm }. If t ∈ LE 0 (S), then for every j, 1 ≤ j ≤ m, there is a clause Pj (f (X1 , . . . , Xk )) ⇐ Bj1 (X1 ), . . . , Bjk (Xk ) in S such that ti ∈ LBji (S) for every i, 1 ≤ i ≤ k. By induction hypothesis: (a) O(IO,N Jti K , N (Bji )) = true for every i. By (5) IO,N JtK = IO,N f (IO,N Jt1 K , . . . , IO,N Jtk K) = N (E), where E contains every Pj by (a), 1 ≤ j ≤ m. So E ⊇ E 0 , therefore by (4) O(IO,N JtK , N (E 0 )) = true. If: by structural induction on t = f (t1 , . . . , tk ). If O(IO,N JtK , N (E 0 )) = true, then by (3) LE (S) ⊆ LE 0 (S), where E is the set of all predicates Pj , 1 ≤ j ≤ m, 5

such that there are clauses Pj (f (X1 , . . . , Xk )) ⇐ Bj1 (X1 ), . . . , Bjk (Xk ) in S such that O(IO,N Jti K , N (Bji ) = true for every i, 1 ≤ i ≤ k. By induction hypothesis ti ∈ LBji (S), so by using the clause above t ∈ LPj (S), 1 ≤ j ≤ m. So t ∈ LE (S). Since LE (S) ⊆ LE 0 (S), t ∈ LE 0 (S). Let us show the second part of the Proposition. For every ground term t, I O,N JtK ∈ 0 FO,N if and only if O(IO,N JtK , N ({P })) for some P ∈ F , if and only if t ∈ LP (S) for some P ∈ F by the first part of the Proposition. This completes the proof. 2

4 Application The filtration construction yields a complete deterministic automaton, but its oracle LE (S) ⊆ LE 0 (S) is costly: language inclusion is DEXPTIME-complete, even when the language on the left is the universal language and the language on the right is given by a non-deterministic automaton [1, Theorem 14, Section 1.7], or when the language on the right is the empty language [1, Theorem 55, Section 7.5]. Practically, deciding LE (S) ⊆ LE 0 (S) requires one to determinize S, which is our initial goal. On the other hand, the powerset oracle E ⊇ E 0 is efficiently decided, but may yield large determinizations. Standard practice consists in relying on the powerset construction, then minimizing, either completely, using Hopcroft’s algorithm [1, Algorithm MIN, Section 1.5], or partially, by merging bisimilar states [3]. An application of Proposition 5 is that we can minimize partially while determinizing, at almost no cost. To this end, we use an O, N pair based on efficient sufficient criteria for language non-emptiness and inclusion, which generalize bisimilarity (which is a criterion for language equivalence only). First, let N E(S) be the smallest subset of P such that, for every clause of the form (1) in S, if B1 ⊆ N E(S) and . . . and Bk ⊆ N E(S), then P ∈ N E(S). Clearly, if LP (S) 6= ∅, then P ∈ N E(S). In fact, if S is a non-deterministic automaton, this yields a decision procedure for non-emptiness: if P ∈ N E(S) then LP (S) 6= ∅. (This is not so for alternating automata, for which non-emptiness is DEXPTIMEcomplete [1, Theorem 55, Section 7.5].) N E(S) can be computed in polynomial time by a marking algorithm. We now turn to language inclusion. Given a binary relation R on a set A, its Smyth extension R] is defined by U R] V if and only if for every v ∈ V , there is an u ∈ U such that u R v. (This is the extension to general binary relations of the well-known Smyth quasi-ordering.) Definition 6 Given two alternating automata S and S 0 , a simulation of S by S 0 is a binary relation on P such that, for every P ∈ N E(S), and for any clause P (f (X1 , . . . , Xk )) ⇐ B1 (X1 ), . . . , Bk (Xk ) 6

(6)

then for every P 0 ∈ P such that P P 0 , there is a clause P 0 (f (X1 , . . . , Xk )) ⇐ B10 (X1 ), . . . , Bk0 (Xk )

(7)

in S 0 such that Bi ] Bi0 for every i, 1 ≤ i ≤ k. There is always a largest simulation, which is computable in polynomial time, by a largest fixpoint computation on the set of pairs (P, P 0 ) of predicates. The next two results are probably folklore, at least for non-deterministic automata. Lemma 7 For any two simulations of S by S 0 , and 0 of S 0 by S 00 , (; 0 ), defined by P (; 0 ) P 00 if and only if P P 0 and P 0 0 P 00 for some P 0 ∈ P, is a simulation of S by S 00 .

PROOF. First, we claim that: (∗) if P P 0 , where is a simulation of S by S 0 , and P ∈ N E(S), then P 0 ∈ N E(S 0 ). This is by structural induction on a proof that P ∈ N E(S). Since P ∈ N E(S) there must be a clause (6) with B1 ⊆ N E(S), . . . , Bk ⊆ N E(S). By definition of a simulation, and since P ∈ N E(S), there must be a clause (7) such that Bi ] Bi0 for every i, 1 ≤ i ≤ k. For every Q0 ∈ Bi0 , there is Q ∈ Bi such that Q Q0 . By induction hypothesis, since Q ∈ Bi ⊆ N E(S), Q0 ∈ N E(S 0 ). So Bi0 ⊆ N E(S) for every i, 1 ≤ i ≤ k. Whence P 0 ∈ N E(S 0 ). Let and 0 be as in the first part of the Proposition. Let P (; 0 ) P 00 , say P P 0 0 P 00 . If P 6∈ N E(S), then we are done, so assume P ∈ N E(S). For every clause (6) in S there is a clause (7) in S 0 with Bi ] Bi0 for every i, 1 ≤ i ≤ k. By (∗), P 0 ∈ N E(S), so there is a clause P 00 (f (X1 , . . . , Xk )) ⇐ B100 (X1 ), . . . , Bk00 (Xk ) in S 00 such that Bi0 0 ] Bi00 for every i, 1 ≤ i ≤ k. It follows that Bi (; 0 )] Bi00 for every i, showing that (; 0 ) is a simulation of S by S 00 . 2 Proposition 8 Let be the largest simulation of S by itself. Then is a quasiordering. If E ⊇ E 0 then E ] E 0 . If E ] E 0 then LE (S) ⊆ LE 0 (S).

PROOF. First, is reflexive, because the equality relation is clearly a simulation of S by itself. To show that is transitive, we realize that (; ) is a simulation of S by itself, by Lemma 7, so by maximality (; ) ⊆ : so is transitive. That E ⊇ E 0 implies E ] E 0 is by the definition of ] and the fact that is reflexive. The last claim is shown by proving that whenever is a simulation of S by itself, then for every ground term t ∈ LE (S), whenever E ] E 0 then t ∈ LE 0 (S). This is proved by structural induction on t = f (t1 , . . . , tk ). Let E 0 = {P10 , . . . , Pm0 }. Since E ] E 0 , for every j, 1 ≤ j ≤ m, there is Pj ∈ E such that Pj Pj0 . Since t ∈ LE (S), t ∈ LPj (S) for every j, so there is a clause

7

Pj (f (X1 , . . . , Xk )) ⇐ Bj1 (X1 ), . . . , Bjk (Xk ) in S such that ti ∈ LBji (S) for every i, 1 ≤ i ≤ k. Since t ∈ LPj (S), LPj (S) 6= ∅, so Pj ∈ N E(S), and because Pj Pj0 , by definition there must be a clause 0 0 Pj0 (f (X1 , . . . , Xk )) ⇐ Bj1 (X1 ), . . . , Bjk (Xk ) 0 such that Bji ] Bji for every i, 1 ≤ i ≤ k. By induction hypothesis, since ti ∈ LBji (S), ti ∈ LBji0 (S). So, using the clause above, t ∈ LPj0 (S). As j is arbitrary between 1 and m, t ∈ LE 0 (S). 2

Let ≈ be the equivalence relation ( ∩ ), where is the converse of . (This is generally coarser than, hence an improvement over, bisimilarity.) By Proposition 8, if P ≈ P 0 then LP (S) = LP 0 (S). Pick a unique representative P↓ ∈ [P ] inside the equivalence class [P ] of P : P↓ ≈ P , and if P ≈ P 0 then P↓ = P 0 ↓ . Note also that if P 6∈ N E(S) (in particular LP (S) = ∅), then P P 0 for any P 0 , by definition. It follows that all predicates outside N E(S) are equivalent for ≈. Given E ∈ P(P), let N (E) be {P↓ } if there is a P ∈ E \ N E(S), else let N (E) be the set of all P↓ where P ranges over the -minimal elements of E. (An element P is -minimal in E if and only if any P 0 ∈ E such that P 0 P is such that P 0 ≈ P .) Let N me be the range of N . Let O(E, E 0 ) = true if and only if E ] E 0 . Lemma 9 O is an inclusion oracle relative to the naming function N . PROOF. First we claim that LN (E) (S) = LE (S). If there is a P ∈ E \ N E(S), then LP (S) = ∅, so LE (S) = ∅ since LE (S) ⊆ LP (S) = ∅, and since LP↓ (S) = LP (S), it follows that LE (S) = L{P↓ } (S) = ∅. Otherwise, since E is finite, for any Q ∈ E there is a -minimal P ∈ E such that P Q, in particular such that LP (S) ⊆ LQ (S). So LE (S) is the intersection of all LP (S) where P ranges over the -minimal elements of E. Since LP (S) = LP↓ (S), the claim follows. Now, if O(N (E), N (E 0 )) = true, that is, N (E) ] N (E 0 ), then by Proposition 8, LN (E) (S) ⊆ LN (E 0 ) (S). By the claim above, LE (S) ⊆ LE 0 (S). (3) follows. If E ⊇ E 0 , then for any -minimal element P ∈ E, P ∈ E 0 so there is a -minimal element P 0 ∈ E 0 such that P 0 P . Since P↓ ≈ P and P 0 ↓ ≈ P 0 , P 0 ↓ P↓ . This means that N (E) ] N (E 0 ), i.e., O(N (E), N (E 0 )) = true, whence (4). 2 Using the O, N -construction with the above O and N yields a refinement of the powerset construction, where some equivalent sets E are equated using the naming function N , therefore minimizing the automaton partially on the fly. As it has been presented until now, this produces a complete deterministic automaton, which is almost always huge: letting m be the cardinality of N me, the size of a model I 8

with |I| = N me is the sum over all k-ary functions f of mk times the size of a table entry If (n1 , . . . , nk ) = n. Listing only the entries where n is different from N (∅) (the catch-all state) produces a deterministic automaton S 0 that is in S general much shorter. Moreover, using Proposition 5, t ∈ F ∈F LP (S) if and only if t ∈ Ln (S 0 ) for some n ∈ N me with O(n, N ({P })) = true, i.e., for some n = N (E) with N (E) ] N ({P }), i.e., such that there is a Q ∈ N (E) with S Q P . Since N (E) = ∅ if and only if E = ∅, t ∈ F ∈F LP (S) if and only if t 0 is recognized at any of the final states in FO,N , none of which is the catch-all state 0 0 N (∅). So S , FO,N is a (usually incomplete, hence shorter) determinization of S, F . Finally, we can naturally construct this automaton incrementally, building names N (E) as the need emerges. This produces only reachable state names. All this was implemented in the pldet tool, one of the tools of the h1 tool suite (see http: //www.lsv.ens-cachan.fr/software/). In practice, we have observed that the obtained automata had only a few more states at most than the minimal automaton (the filtration construction without the catch-all state), and was therefore competitive. Running the Hopcroft construction as a second step is usually not even necessary to get automata of reasonable sizes.

5 A Final Note: Universal Determinizations The determinizations I, F 0 of S, F that we have built until now are given by Σalgebras I that satisfy a stronger property: for every set F of final states F we can find a set F 0 of values of |I| such that I, F 0 is a complete determinization of S, F . The point is that I is independent of F . This leads to the following definition. Definition 10 A universal determinization of an alternating automaton S is a Σalgebra I, and a mapping I∗ : P → P(|I|) such that, for any ground term t, I JtK ∈ I∗ (P ) if and only if t ∈ LP (S). S

Indeed, F 0 will now be P ∈F I∗ (P ). Conversely, if as above we may find F 0 given F , whatever F , then I∗ (P ) must be the F 0 corresponding to F = {P }. The Herbrand construction is a universal determinization: I∗ (P ) is just LP (S). The O, N construction is also a universal determinization: by Proposition 5, I ∗ (P ) = {n ∈ N me|O(n, N ({P })) = true} fits the bill. So the powerset construction, with I∗ (P ) = {E ∈ P(P)|P ∈ E}, and the filtration construction, with I∗ (P ) = {LE (S)|E ∈ P(P), LE (S) ⊆ LP (S)}, are universal determinizations, too. Universal determinizations are just first-order interpretations I∗ = (|I|, (If )f ∈Σ , (I∗ (P ))P ∈P ), with the added constraint that I∗ |= P (t) if and only if t ∈ LP (S) for every ground term t. In other words, they are first-order models of the theory of a given set of regular languages LP (S), where P ranges over the set P. 9

References [1] H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi. Tree automata techniques and applications. www.grappa. univ-lille3.fr/tata, 1997. Version of Sep. 6, 2005. [2] T. Frühwirth, E. Shapiro, M. Y. Vardi, and E. Yardeni. Logic programs as types for logic programs. In Proc. 6th Symp. Logic in Computer Science, pages 300–309. IEEE Computer Society Press, 1991. [3] D. C. Kozen. Automata and Computability. Undergraduate Texts in Computer Science. Springer, 1997.

10