Normalized Expressions and Finite Automata J.-M. Champarnaud, F. Ouardi, D. Ziadi LITIS, University of Rouen, 76000 Saint-Etienne du Rouvray – France {Jean-Marc.Champarnaud,Faissal.Ouardi,Djelloul.Ziadi}@univ-rouen.fr

Abstract There exist two well-known quotients of the position automaton of a regular expression. The first one, called the equation automaton, has first been introduced by Mirkin from the notion of prebase and has been redefined by Antimirov from the notion of partial derivative. The second one, due to Ilie and Yu and called the follow automaton, can be obtained by eliminating ε-transitions in an ε-NFA that is always smaller than the classical ε-NFAs (Thompson, Sippu and Soisalon–Soininen). Ilie and Yu discuss of the difficulty to succeed in a theoretical comparison between the size of the follow automaton and the size of the equation automaton and conclude that it is very likely necessary to realize experimental studies. In this paper we solve the theoretical question, by first defining a set of regular expressions, called normalized expressions and such that every regular expression can be normalized in linear time, and proving then that the equation automaton of a normalized expression is always smaller than its follow automaton.

1

Introduction

For the last fifty years, the conversion of regular expressions into automata has raised the interest of computer scientists. In 1960-61, Glushkov [7] and McNaughton and Yamada [10] have defined the position automaton of a regular expression. In 1964, Mirkin [11] has introduced the notion of prebase, described a construction of the equation automaton and proved that this automaton is smaller in number of states than the position automaton. Since the nineties, researchers have been interested by designing efficient algorithms for the computation of these automata. First of all, there exist three well-known algorithms for computing the position automaton. The first one, due to Br¨ uggemann-Klein [2] makes use of the notion of star normal form of a regular expression. The second one is due to Chang and Paige [6] and it is based on a lazy computation technique. The third one has been designed by Ziadi, Ponty and Champarnaud [14, 12] and it is built on the so-called ZPC-structure. The complexity of these three algorithms is quadratic w.r.t. the size of the regular expression. Concerning the equation automaton, there are two main algorithms. Antimirov’s algorithm [1] is based on the computation of the set of partial derivatives of the

1

expression, whereas the algorithm designed by Champarnaud and Ziadi [3] computes the equation automaton as the quotient of the c-continuation automaton by an equivalence relation that will be denoted by ≡e . Recently, Ilie and Yu [9] have introduced a new automaton called the follow automaton and shown that it is a quotient of the position automaton by an equivalence relation denoted by ≡f . It has been shown in [4] that the follow automaton can be efficiently computed from the ZPC-structure of the expression. Ilie and Yu have produced families of expressions for which the follow automaton has a smaller number of states than the equation automaton and other families for which it is the contrary. They finally conjecture that the only way to decide which automaton is better is probably by testing them in real-life applications. In this paper, we solve this challenging question. Indeed, we show that any regular expression can be reduced in a normalized form in linear time, and that the equation automaton of a normalized expression has a smaller number of states than its follow automaton. The paper is organized as follows. Section 2 presents terminology, recalls the definition of the three above-mentioned automata as well as the computation of the c-continuations of a regular expression, which turns out to be a useful tool for proving our results. In Section 3, we first define the set of normalized expressions and then we show that for a normalized expression the equivalence relation ≡e is coarser than the equivalence relation ≡f .

2

Preliminaries

In this section we recall some basic definitions and properties about regular expressions and finite automata. For more details, we refer to [8, 13]. Regular expressions and automata Let Σ be a non-empty finite set of symbols, called an alphabet. The set of all the words over Σ is denoted by Σ∗ . The empty word is denoted by ε. A language over Σ is a subset of Σ∗ . Regular expressions over an alphabet Σ and regular languages that they denote are inductively defined as follows: (1) 0 is a regular expression denoting the language L(0) = ∅. (2) 1 is a regular expression denoting the language L(1) = {ε}. (3) a, for all a ∈ Σ, is a regular expression denoting the language L(a) = {a}. Let F (resp. G) be a regular expression denoting the language L(F ) (resp. L(G)). Then we have: (4) (F +G) is a regular expression denoting the language L(F +G) = L(F )∪L(G). (5) (F · G) is a regular expression denoting the language L(F · G) = L(F )L(G). (6) (F ∗ ) is a regular expression denoting the language L(F ∗ ) = (L(F ))∗ . We write op(E) =“+” (resp. “·”, “∗”) if E = F + G (resp. E = F G, E = F ∗ ). The following identities are classically used: 0 + E = E = E + 0, 1 · E = E = E · 1, 2

0 · E = 0 = E · 0. We write: F ≡ G if two regular expressions are identical (following Mirkin [11], E and F “graphically coincide”). Let E be a regular expression over an alphabet Σ. We call alphabetic width of E, denoted by ||E||, the number of occurrences of symbols of Σ in E, whereas we call size of E, denoted by |E|, the number of nodes of the syntax tree of E. The alphabetic width of the expression (a + b)∗ · a · b · a + 1 is equal to 5; its size is equal to 12. In order to specify their position in the expression, symbols are subscripted following the order of reading. For example, starting from E = (a + b)∗ aba + 1, one obtains the linearized version E = (a1 + b2 )∗ a3 b4 a5 + 1 of E. The set of positions for an expression E is denoted by PosE . For the previous example, we have PosE = {a1 , b2 , a3 , b4 , a5 }. If F is a subexpression of E, we denote by PosE (F ) the subset of positions of E that are symbols of F . We denote by h the application that maps each position in PosE to the symbol of Σ that appears at this position in E. For E = (a + b)∗ aba + 1, we have h(a1 ) = h(a3 ) = h(a5 ) = a and h(b2 ) = h(b4 ) = b. We use the following definition: λ(E) = if ε ∈ L(E) then 1 else 0. A finite automaton is a 5-uple A = (Q, Σ, δ, q0 , F ), where Q is the set of states, Σ is the alphabet, q0 ∈ Q is the initial state, F ⊆ Q is the set of final states, and δ : Q × Σ → 2Q is the transition mapping. The language recognized by A is denoted L(A). The size of a finite automaton A is the number of its states added with the number of its transitions. Position automaton In order to construct a non-deterministic finite automaton recognizing L(E), Glushkov [7] has introduced the following position sets: First(E) (resp. Last(E)) is the set of initial (resp. final) positions of words of L(E) and, for all x ∈ PosE , Follow(E, x) is the set of positions that immediately follow the position x in some word of L(E). Formally First is defined by induction according to the following rules: First(0) = First(1) = ∅, First(x) = {x}, First(F + G) = First(F ) ∪ First(G), First(F · G) =

if λ(F ) = 0 then First(F ) else First(F ) ∪ First(G),

∗

First(F ) = First(F ). Rules for Last(E) are obtained by substituting First by Last and replacing the last but one rule by Last(F · G) = if λ(G) = 0 then Last(G) else Last(F ) ∪ Last(G).

3

For all x ∈ PosE , the set Follow(E, x) can be inductively computed as follows: Follow(0, x) = Follow(1, x) = ∅, Follow(F, x) if x ∈ PosE (F ), Follow(F + G, x) = Follow(G, x) otherwise, if x ∈ PosE (F ) \ Last(F ), Follow(F, x) Follow(F · G, x) = Follow(F, x) ∪ First(G) if x ∈ Last(F ), Follow(G, x) if x ∈ PosE (G), Follow(F, x) if x ∈ PosE (F ) \ Last(F ), Follow(F ∗ , x) = Follow(F, x) ∪ First(F ) if x ∈ Last(F ). We add a specific position, denoted by x0 (where x0 6∈ PosE ) to the set PosE and we set Pos0 (E) = PosE ∪{x0 }; the set Last0 (E) is equal to Last(E) if λ(E) = 0 and to Last(E) ∪ {x0 } otherwise; the set Follow 0 (E, x) is equal to Follow(E, x) if x ∈ PosE and to First(E) if x = x0 . The position automaton of E [7, 10], denoted by PE , whose states are Pos0 (E), and that recognizes L(E), is derived from the above-mentioned position sets as follows. Definition 1. The position automaton of a regular expression E is the automaton PE = (Pos0 (E), Σ, δ, {x0 }, Last0 (E)) such that: δ(x, a) = {y | y ∈ Follow 0 (E, x) and h(y) = a}, ∀x ∈ Pos0 (E), ∀a ∈ Σ. Example 1. Consider the regular expression E = (a∗ b∗ )∗ + (a + b)∗ . Its linearized version is E = (a∗1 b∗2 )∗ + (a3 + b4 )∗ . The associated position automaton is given in a a Figure 1. a3

a

First(E) = {a1 , b2 , a3 , b4 } Last(E) = {a1 , b2 , a3 , b4 }

x0

a

b

a1

a a

b

Follow(E, a1 ) = {a1 , b2 } Follow(E, b2 ) = {a1 , b2 }

b4

b

b

b2

Follow(E, a3 ) = {a3 , b4 } Follow(E, b4 ) = {a3 , b4 }

b

b

Figure 1: the position automaton associated with E = (a∗ b∗ )∗ + (a + b)∗ .

Follow automaton Recently, Ilie and Yu [9] introduced the follow automaton FE of a regular expression E. This automaton is smaller than the position automaton and, in average, faster to compute. There exist several algorithms to compute the follow automaton FE in quadratic time [9, 4] w.r.t. the size of the expression E. Let us set x ≡f y ⇔ (Follow 0 (E, x) = Follow 0 (E, y) and x ∈ Last0 (E) ⇔ y ∈ Last0 (E)) As shown in [9], we have FE ≃ PE /≡f . Hence the follow automaton of a regular expression is a quotient of its position automaton.

4

Example 2. For the regular expression E = (a∗ b∗ )∗ + (a + b)∗ , the associated follow automaton is shown in Figure 2; it is constructed from the position automaton by a, b computing the equivalance relation ≡f :

classes of ≡f :

a, b

{a1 , b2 }

a, b

{a3 , b4 }

{x0 }

{x0 } {a1 , b2 } {a3 , b4 }

a, b

Figure 2: the follow automaton associated with E = (a∗ b∗ )∗ + (a + b)∗ . C-continuation automaton This automaton has been introduced by Champarnaud and Ziadi [5] in order to efficiently compute the equation automaton that we will present later. In this section, we recall the notion of c-derivative, c-continuation and c-continuation automaton. Definition 2 (c-derivative w.r.t. a symbol). The c-derivative of a regular expression E w.r.t. a symbol a, written da (E), is defined by da (0) = da (1) = 0, da (x) = da (F + G) = da (F · G) =

if a = x then 1 else 0, if da (F ) 6= 0 then da (F ) else da (G), if da (F ) 6= 0 then da (F ) · G else λ(F ) · da (G),

∗

da (F ) = da (F ) · F ∗ . The extension to a word u = u1 . . . un follows the equations: dε (E) = E and du1 ...un (E) = du2 ...un (du1 (E)). Theorem 1. [5] If E is linear, for every symbol a and every word u, the c-derivative dua (E) of E w.r.t. the word ua is either 0 or unique. Theorem 1 allows us to define the c-continuation ca (E) of a in a linear expression E, that is the unique value of the non-null c-derivatives dua (E). Proposition 1. [5] For every symbol a of a linear expression E, the c-continuation ca (E) is such that: ca (a) = 1, ca (F + G) = ca (F · G) =

if ca (F ) 6= 0 then ca (F ) else ca (G), if ca (F ) 6= 0 then ca (F ) · G else λ(F ) · ca (G),

∗

ca (F ) = ca (F ) · F ∗ . Corollary 1. [5] For every symbol a of a linear expression E, the c-continuation ca (E) is either 1 or a subexpression of E or a product of subexpressions. 5

More precisely, for a linear regular expression E, we have: ca (E) = H1 · · · Hn , where Hi is a subexpression of E, for all 1 ≤ i ≤ n. This decomposition is fundamental for stating the main theorem of this paper. We now consider a regular expression E over Σ. Let E be the linearized version of E over PosE and h be the mapping from PosE onto Σ. We assume that x0 is a symbol not in PosE . Let cx0 (E) = dε (E) = E. By convention cx (E) is equal to cx (E). Definition 3 (c-continuation automaton). The c-continuation automaton of E, CE = (Q, Σ, i, T, δ), is defined by: • Q = {(x, cx (E))|x ∈ Pos0 (E)}, • i = (0, cx0 (E)), • T = {(x, cx (E)) | λ(cx (E)) = 1}, • δ((x, cx (E)), a) = {(y, cy (E)) | h(y) = a and dy (cx (E)) ≡ cy (E)}, ∀x ∈ Pos0 (E) and ∀a ∈ Σ. Example 3. Consider the regular expression E = (a + b)∗ + (a + b)∗ . Its linearized version is E = (a1 + b2 )∗ + (a3 + b4 )∗ . We show in Figure 3 the c-continuation a a automaton associated with E. ca3 (E)

a

a

ca1 (E)

cx0 (E) = (a1 + b2 )∗ + (a3 + b4 )∗ a

b

ca1 (E) = (a1 + b2 )∗ cb2 (E) = (a1 + b2 )∗

cb4 (E)

ca3 (E) = (a3 + b4 )∗ cb4 (E) = (a3 + b4 )∗

a

cx0 (E)

b

b

b

b

cb2 (E)

b

Figure 3: the c-continuation automaton associated with E = (a + b)∗ + (a + b)∗ . The position and c-continuation automata are isomorphic. The relation that exists between the First, Last and Follow sets and the c-continuations is enlighted by the following proposition, which will be useful in the sequel. Proposition 2. [5] The following equalities hold: 1. First(E) = {y ∈ PosE |cy (E) 6= 0}; 2. Last(E) = {y ∈ PosE |λ(cy (E)) = 1}; 3. Follow(E, x) = {y ∈ PosE |dy (cx (E)) 6= 0}. Equation automaton The equation automaton EE of a regular expression E has been defined by Mirkin [11] and by Antimirov [1]. Champarnaud and Ziadi have proved that the equation automaton is a quotient of the c-continuation automaton [5]. Let us consider the equivalence relation ≡e defined by (x, cx (E)) ≡e (y, cy (E)) ⇔ h(cx (E)) ≡ h(cy (E)) 6

Theorem 2. [5] The equation automaton is a quotient of the c-continuation automaton. We have: EE ≃ CE /≡e . Example 4. Consider the regular expression E = (a + b)∗ + (a + b)∗ of Example 3. One has h(ca1 (E)) = h(cb2 (E)) = h(ca3 (E)) = h(cb4 (E)) = (a+ b)∗ and the equation automaton associated with E is shown in Figure 4. a, b

classes of ≡e :

E

{(a + b)∗ + (a + b)∗ } {(a + b)∗ }

a, b

(a + b)∗

Figure 4: the equation automaton associated with E = (a + b)∗ + (a + b)∗ . Example 5. Let us recall that the size of an automaton is the sum of the number of states and of the number of transitions. 1. Consider the regular expression E1n = (a∗1 + a∗2 + · · · + a∗n )∗ . One has: |PE1n | = |EE1n | = n2 + n + 1 and |FE1n | = 2n + 2. a1 , · · · , an

x0

a1 , · · · , an

{a1 , · · · , an }

Figure 5: the follow automaton FE1n . an an

a2 a1

a2 a2

a1

E1n

a∗1 E1n

an .........

a∗2 E1n a1

a∗n E1n

an

a2 an

Figure 6: the equation automaton EE1n . 2. Consider the regular expression E2n = a1 (a∗2 + 1)a3 · · · a2n−1 (a∗2n + 1)a2n+1 . One has: |PE2n | = |EE2n | = 6n + 3 and |FE2n | = 4n + 2. a2

x0

a1

{a1 , a2 }

a4 a3

a2n

a2n+1 {a3 , a4 } . . . . . . {a2n−1 , a2n }

Figure 7: the follow automaton FE2n .

7

{a2n+1 }

a2 a1

a2

a2n a3

......

a2n+1

a2n a2n+1

a3

Figure 8: the equation automaton EE2n .

3

Follow automaton versus equation automaton

In this section, we show that the equation automaton of a normalized regular expression has always less states and less transitions than the associated follow automaton.

3.1

Normalized expressions

Definition 4. A regular expression E is said to be normalized if the following conditions hold: 1. The expression E is a reduced one according to: - the identities: 0 + E = E = E + 0, 1 · E = E = E · 1, 0 · E = 0 = E · 0, - the rule: for all subexpressions H of E, H = F + 1 ⇒ λ(F ) = 0. 2. The operation “·” is left associative i.e. H = F · G ⇒ op(G) 6=“·”. 3. The expression E is in star normal form i.e. for every subexpression F ∗ of E, it holds: ∀x ∈ Last(F ), Follow(F, x) ∩ First(F ) = ∅. Example 6. Here are some expressions and their normalized forms: • (a∗ + 1)′ = a∗ . • ((a∗ b∗ )∗ )′ = (a + b)∗ . It is shown in [4] that given a regular expression E, it is possible to construct an equivalent normalized regular expression in O(|E|) time and space. Proposition 3. Let E be a regular expression and E ′ its normalized form. One has FE ≡ FE ′ . Proof. It is easy to show that PosE = PosE ′ . Let E • be the star normal form of E. It is shown in [2] that PE ≡ PE • . Since reducing E • according to Definition 4.1 does not modify the positions and the transitions in the position automaton PE • we have: PosE •′ = PosE • = PosE ′ and PE • = PE ′ . Finally, by applying the equivalence relation ≡f , we get FE ≡ FE ′ .

Proposition 4. Let x be a position of a regular expression E. Then one has x ∈ First(E) ⇒ λ(E) = 1. λ(cx (E)) = 1

8

Proof. Proof is by induction on the alphabetic width of E. Let us first consider the case ||E|| = 1. Since E is normalized, its form is either a + 1 or a∗ and it is easy then to prove that the proposition is true. We now suppose that the proposition is satisfied for normalized expressions F and G and we prove it is satisfied for normalized expressions F + G, F · G and F ∗ . Case E = F + G: without loss of generality we suppose that x ∈ PosE (F ). In this case cx (E) = cx (F ). Then λ(cx (E)) = λ(cx (F )) = 1. Since x ∈ First(E) one has x ∈ First(F ). By the inductive hypothesis λ(F ) = 1. Consequently λ(E) = 1. Case E = F · G: if x ∈ PosE (F ), then x ∈ First(F ) and cx (F · G) = cx (F ) · G. It implies that λ(cx (E)) = λ(cx (F ) · G) = 1. Thus λ(cx (F )) = 1 and λ(G) = 1. Applying the inductive hypothesis on F we get λ(F ) = 1. Thus λ(E) = 1. If x ∈ PosE (G), then x ∈ First(G) and since x ∈ First(E), λ(F ) = 1. Since cx (F · G) = cx (G) and λ(cx (F · G)) = 1, it comes λ(cx (G)) = 1. By induction hypothesis we have λ(G) = 1. Consequently λ(E) = 1. Case E = F ∗ : it is obvious.

Theorem 3. Let x and y be two positions of a normalized regular expression E. Then the following conditions are equivalent: i) cx (E) ≡ cy (E), ii) ∀a ∈ PosE ,

da (cx (E)) ≡ da (cy (E)) and λ(cx (E)) = λ(cy (E)),

iii) Follow(E, x) = Follow(E, y) and [x ∈ Last(E) ⇔ y ∈ Last(E)]. Proof. i) ⇒ ii) is obvious. ii) ⇔ iii) is a straightforward consequence of Proposition 4. ii) ⇒ i) we proceed by absurd. We suppose that ∀a ∈ PosE , da (cx (E)) ≡ da (cy (E)) and λ(cx (E)) = λ(cy (E)) and also that cx (E) 6≡ cy (E). Since x and y are two different positions of E, there exists a subexpression Ex ⋄ Ey of E with ⋄ ∈ {+, ·} such that x ∈ PosE (Ex ) and y ∈ PosE (Ey ). Let us suppose that ⋄ =“+”. Since cx (E) 6≡ cy (E), one has cx (E) = A1 · · · Aα · C1 · · · Cm and cy (E) = B1 · · · Bβ · C1 · · · Cm with α + β ≥ 1, PosE (Ai ) ⊆ PosE (Ex ) and PosE (Bj ) ⊆ PosE (Ey ) for all 1 ≤ i ≤ α and 1 ≤ j ≤ β. In the case where C1 · · · Cm 6= 0, we have three sub-cases: (a) α = 0, β ≥ 1, (b) α ≥ 1, β = 0 and (c) α ≥ 1, β ≥ 1. We limit ourselves to the sub-case (a); proof for (b) and (c) can be done in a similar way. In the sub-case (a) , cx (E) and cy (E) have the form cx (E) = C1 · · · Cm and cy (E) = B1 · · · Bβ · C1 · · · Cm . Let b ∈ First(B1 ). By Proposition 2, we have db (cy (E)) 6= 0. By ii) it implies that db (cx (E)) 6= 0. Consequently we have b ∈ First(C1 · · · Cm ). Since we have b ∈ PosE (Ey ), there exists k, 1 ≤ k ≤ m such that Ck = F ∗ and λ(C1 · · · Ck ) = 1 with b ∈ First(F ). It comes First(B1 ) ∩ First(F ) 6= ∅

(1)

In the other hand, since cy (E) = B1 · · · Bβ · C1 · · · Cm , there exists a subexpression Ay of E such that y ∈ PosE (Ay ) and Ay contains no occurrence of “·” nor of “∗”. 9

So we necessarily have that either Ay · B1 or A∗y = B1 is a subexpression of E. Since x ∈ First(Ex ) and λ(C1 · · · Ck ) = 1 we have x ∈ First(F ), which implies by Proposition 2 that dx (cx (E)) 6= 0. By ii) one has dx (cy (E)) 6= 0. Since x 6∈ PosE (Ey ), we have λ(B1 · · · Bβ C1 · · · Ck ) = 1. Hence y ∈ Last(F ). Thus Last(Ay ) ∩ Last(F ) 6= ∅

(2)

Finally, for the subexpression F ∗ = Ck of E, there exists y ∈ Last(F ) such that b ∈ First(B1 ) ⊆ First(F ) ∩ Follow(F, y). Hence a contradiction with E is in star normal form. In the case where C1 · · · Cm = 0, cx (E) and cy (E) have the form cx (E) = A1 · · · Aα and cy (E) = B1 · · · Bβ with α ≥ 1 and β ≥ 1. Let b ∈ First(B1 ). By Proposition 2 we have db (cy (E)) 6= 0. By ii) one has db (cx (E)) 6= 0. It holds: b 6∈ PosE (Ex ) ⇒ b 6∈ PosE (Ai ), for all 1 ≤ i ≤ α. Hence a contradiction. Let us suppose that ⋄ =“·”. Then op(Ey ) 6=“·” and cx (E) and cy (E) have the form cx (E) = A1 · · · Aα · Ey · C1 · · · Cm B1 · · · Bβ · Ey · C1 · · · Cm cy (E) = or B1 · · · Bβ · C1 · · · Cm

where Ey = F ∗ where Ey = F + G

Let us discuss the case where C1 · · · Cm = 0. We have cx (E) = A1 · · · Aα · Ey B1 · · · Bβ · Ey or cy (E) = B1 · · · Bβ

where Ey = F ∗ where Ey = F + G

If α 6= 0, let a ∈ First(A1 ). By Proposition 2 we have da (cx (E)) 6= 0. By ii) it holds da (cy (E)) 6= 0. One has a 6∈ PosE (Ey ). Hence a contradiction. If α = 0, then cx (E) = Ey and B1 · · · Bβ · Ey where Ey = F ∗ cy (E) = or B1 · · · Bβ where Ey = F + G Case cy (E) = B1 · · · Bβ · Ey . The proof is similar as in the sub-case (a) of the case where ⋄ =“+”. There exists b, b ∈ First(B1 ) ⊆ Follow(F, y)∩First(F ) with F ∗ = Ey and y ∈ Last(F ). Hence a contradiction with E is in star normal form. Case cy (E) = B1 · · · Bβ . We proceed in a similar way and we deduce that there exists k, 1 ≤ k ≤ β such that Bk = F ∗ and b ∈ First(B1 ) ⊆ Follow(F, y) ∩ First(F ) and y ∈ Last(F ). Which is a contradiction with E is in star normal form. Let us discuss the case where C1 · · · Cm 6= 0. There are two cases. The first one is when Ey = H ∗ . Let cx (E) = A1 · · · Aα · Ey · C1 · · · Cm and cy (E) = B1 · · · Bβ · Ey · C1 · · · Cm with α + β ≥ 1 . We consider two sub-cases: (d) α = 0 and (e) α ≥ 1. Proof for (d) can be done in a similar way as in the case where C1 · · · Cm = 0. In the sub-case (e), by a similar reasoning, there exist k, 1 ≤ k ≤ α such that Ck = F ∗ and 10

a ∈ First(A1 ) ⊆ Follow(F, x) ∩ First(F ) and x ∈ Last(F ). Which is a contradiction with E is in star normal form. The second one is when Ey = F + G. Let us suppose that y ∈ PosE (F ). We have two sub-cases: Case G 6= 1: by a similar way, there exist k, 1 ≤ k ≤ α such that Ck = F ∗ , t ∈ First(Ey ) ⊆ Follow(F, z) ∩ First(F ) and z ∈ Last(F ) ∩ Last(Ex ). Hence a contradiction with E is in star normal form. Case G = 1: let a ∈ First(A1 ). By Proposition 2 we have da (cx (E)) 6= 0. By ii) one has da (cy (E)) 6= 0. Thus, there exists k, 1 ≤ k ≤ m such that λ(B1 · · · Bβ C1 · · · Ck ) = 1 and Ck = H ∗ with a ∈ First(H). Hence y ∈ First(H). Since λ(B1 · · · Bβ C1 · · · Ck ) = 1, we have λ(B1 · · · Bβ ) = 1. One has cy (F ) = B1 · · · Bβ and λ(cy (F )) = λ(B1 · · · Bβ ) = 1, which implies that y ∈ First(F ). By Proposition 4 one has λ(F ) = 1. Which is a contradiction with the Definition 4-1 for the subexpression Ey = F + 1 of E. As a corollary of Theorem 2, the equation automaton is a quotient of the position automaton by the equivalence relation x ≡e y ⇔ h(cx (E)) ≡ h(cy (E)) By Theorem 3, for a normalized expression it holds that x ≡f y ⇔ cx (E) ≡ cy (E). Then we conclude that x ≡f y implies that x ≡e y. Hence the theorem: Theorem 4. For a normalized regular expression, the size of the equation automaton is smaller than the size of the follow automaton. Example 7. 1. Consider the regular expression E = (a∗ b∗ )∗ + (a + b)∗ . Its follow automaton is shown in Figure 2 and its normalized form is E ′ = (a + b)∗ + (a + b)∗ . The equation automata respectively associated with E and E ′ are given in the following figure: a, b a, b a, b

(a∗ b∗ )∗ E′

E a, b

a, b

(a + b)∗

(a + b)∗

Figure 10: the equation automaton associated with E ′ = (a + b)∗ + (a + b)∗ .

a, b

Figure 9: the equation automaton associated with E = (a∗ b∗ )∗ + (a + b)∗ . 2. Consider the regular expression E1n = (a∗1 +a∗2 +· · ·+a∗n )∗ . Its normalized form ′ ′ is E1n = (a1 + a2 + · · · + an )∗ and the equation automaton associated with E1n given in the following figure has only one with state. One has |FE n′ | = 2n + 2 1 and |EE n′ | = n + 1. 1

11

a1 , · · · , an

(a1 + a2 + · · · + an )∗

Figure 11: the equation automaton EE n′ . 1

3. Consider the regular expression E2n = a1 (a∗2 + 1)a3 · · · a2n−1 (a∗2n + 1)a2n+1 . Its ′ normalized form is E2n = a1 a∗2 a3 · · · a2n−1 a∗2n a2n+1 and the equation automa′ ton associated with E2n is given in the following figure. One has: |FE n′ | = 2 |EE n′ | = 4n + 2. 2

a2

E2n

′

a1

a∗2 · · · a2n+1

a4 a3

a2n

a∗4 · · · a2n+1 . . . . . . a∗2n a2n+1

a2n+1

ε

Figure 12: the equation automaton EE n′ . 2

4

Conclusion

We have solved the conjecture stated by Ilie and Yu by exhibiting a family of regular expressions, namely the normalized expressions, such that in the one hand any regular expression can be turned into an equivalent normalized one in linear time and space w.r.t. the size of the expression, and in the other hand the size of the equation automaton of a normalized expression is always smaller than the size of its follow automaton. The question that we are considering now is how to construct the equation automaton directly from the position automaton.

References [1] Antimirov V., Partial derivatives of regular expressions and finite automaton constructions, Theoret. Comput. Sci., 155, 2917–319, 1996. [2] Br¨ uggemann-Klein A., Regular Expressions into Finite Automata, Theoret. Comput. Sci., 120, 197–213, 1993. [3] Champarnaud J.-M. and D. Ziadi, From C-Continuations to New Quadratic Algorithms for Automaton Synthesis, Intern. Journ. of Alg. and Comp., 116(2001), 707–735. [4] Champarnaud J.-M., F. Nicart and D. Ziadi, From the ZPC-structure of a regular expression to its follow automaton, Intern. Journ. of Alg. and Comp., 16–1(2006), 17–34. [5] Champarnaud J.-M. and D. Ziadi, Canonical derivatives, partial derivatives and finite automaton constructions, Theoret. Comput. Sci., 289, 137-163, 2002.

12

[6] Chang C.-H. and Paige R., From Regular Expressions to DFA’s Using Compressed NFA’s, Theoret. Comput. Sci., 178, 1–36, 1997. [7] Glushkov V.M., The Abstract Theory of Automata, Russian Math. Surveys, 16, 1–53, 1961. [8] Hopcroft J.E. and Ullman J.D., Introduction to Automata Theory, Languages and Computation, Addison-Wesley, Reading, MA, 1979. [9] Ilie L. and Yu S., Follow automata, Information and computation, 186, 140-162, 2003. [10] McNaughton R., Yamada H., Regular Expressions and State Graphs For Automata, IEEE Trans. on Electronic Computers, 9-1, 39–47, 1960. [11] Mirkin B. G. An algorithm for constructing a base in a language of regular expressions, Engineering Cybernetics, 5:110–116, 1966. [12] Ponty J.-L., Ziadi D. and Champarnaud J.-M., A new Quadratic Algorithm to Convert a Regular Expression into an Automaton, in: D. Raymond, D. Wood, eds., Proc. of WIA’96, Lecture Notes in Computer Science 1260, SpringerVerlag, 109-110, 1997. [13] S. Yu, Regular languages, in: G. Rozenberg, A. Salomaa, Handbook of Formal Languages, Vol. I, Springer-Verlag, Berlin, 41–110, 1997. [14] Ziadi D., Ponty J.-L. and Champarnaud J.-M., Passage d’une expression rationnelle a ` un automate fini non-d´eterministe, Journ´ees Montoises (1995), Bull. Belg. Math. Soc. 4, 177-203, 1997.

13

Abstract There exist two well-known quotients of the position automaton of a regular expression. The first one, called the equation automaton, has first been introduced by Mirkin from the notion of prebase and has been redefined by Antimirov from the notion of partial derivative. The second one, due to Ilie and Yu and called the follow automaton, can be obtained by eliminating ε-transitions in an ε-NFA that is always smaller than the classical ε-NFAs (Thompson, Sippu and Soisalon–Soininen). Ilie and Yu discuss of the difficulty to succeed in a theoretical comparison between the size of the follow automaton and the size of the equation automaton and conclude that it is very likely necessary to realize experimental studies. In this paper we solve the theoretical question, by first defining a set of regular expressions, called normalized expressions and such that every regular expression can be normalized in linear time, and proving then that the equation automaton of a normalized expression is always smaller than its follow automaton.

1

Introduction

For the last fifty years, the conversion of regular expressions into automata has raised the interest of computer scientists. In 1960-61, Glushkov [7] and McNaughton and Yamada [10] have defined the position automaton of a regular expression. In 1964, Mirkin [11] has introduced the notion of prebase, described a construction of the equation automaton and proved that this automaton is smaller in number of states than the position automaton. Since the nineties, researchers have been interested by designing efficient algorithms for the computation of these automata. First of all, there exist three well-known algorithms for computing the position automaton. The first one, due to Br¨ uggemann-Klein [2] makes use of the notion of star normal form of a regular expression. The second one is due to Chang and Paige [6] and it is based on a lazy computation technique. The third one has been designed by Ziadi, Ponty and Champarnaud [14, 12] and it is built on the so-called ZPC-structure. The complexity of these three algorithms is quadratic w.r.t. the size of the regular expression. Concerning the equation automaton, there are two main algorithms. Antimirov’s algorithm [1] is based on the computation of the set of partial derivatives of the

1

expression, whereas the algorithm designed by Champarnaud and Ziadi [3] computes the equation automaton as the quotient of the c-continuation automaton by an equivalence relation that will be denoted by ≡e . Recently, Ilie and Yu [9] have introduced a new automaton called the follow automaton and shown that it is a quotient of the position automaton by an equivalence relation denoted by ≡f . It has been shown in [4] that the follow automaton can be efficiently computed from the ZPC-structure of the expression. Ilie and Yu have produced families of expressions for which the follow automaton has a smaller number of states than the equation automaton and other families for which it is the contrary. They finally conjecture that the only way to decide which automaton is better is probably by testing them in real-life applications. In this paper, we solve this challenging question. Indeed, we show that any regular expression can be reduced in a normalized form in linear time, and that the equation automaton of a normalized expression has a smaller number of states than its follow automaton. The paper is organized as follows. Section 2 presents terminology, recalls the definition of the three above-mentioned automata as well as the computation of the c-continuations of a regular expression, which turns out to be a useful tool for proving our results. In Section 3, we first define the set of normalized expressions and then we show that for a normalized expression the equivalence relation ≡e is coarser than the equivalence relation ≡f .

2

Preliminaries

In this section we recall some basic definitions and properties about regular expressions and finite automata. For more details, we refer to [8, 13]. Regular expressions and automata Let Σ be a non-empty finite set of symbols, called an alphabet. The set of all the words over Σ is denoted by Σ∗ . The empty word is denoted by ε. A language over Σ is a subset of Σ∗ . Regular expressions over an alphabet Σ and regular languages that they denote are inductively defined as follows: (1) 0 is a regular expression denoting the language L(0) = ∅. (2) 1 is a regular expression denoting the language L(1) = {ε}. (3) a, for all a ∈ Σ, is a regular expression denoting the language L(a) = {a}. Let F (resp. G) be a regular expression denoting the language L(F ) (resp. L(G)). Then we have: (4) (F +G) is a regular expression denoting the language L(F +G) = L(F )∪L(G). (5) (F · G) is a regular expression denoting the language L(F · G) = L(F )L(G). (6) (F ∗ ) is a regular expression denoting the language L(F ∗ ) = (L(F ))∗ . We write op(E) =“+” (resp. “·”, “∗”) if E = F + G (resp. E = F G, E = F ∗ ). The following identities are classically used: 0 + E = E = E + 0, 1 · E = E = E · 1, 2

0 · E = 0 = E · 0. We write: F ≡ G if two regular expressions are identical (following Mirkin [11], E and F “graphically coincide”). Let E be a regular expression over an alphabet Σ. We call alphabetic width of E, denoted by ||E||, the number of occurrences of symbols of Σ in E, whereas we call size of E, denoted by |E|, the number of nodes of the syntax tree of E. The alphabetic width of the expression (a + b)∗ · a · b · a + 1 is equal to 5; its size is equal to 12. In order to specify their position in the expression, symbols are subscripted following the order of reading. For example, starting from E = (a + b)∗ aba + 1, one obtains the linearized version E = (a1 + b2 )∗ a3 b4 a5 + 1 of E. The set of positions for an expression E is denoted by PosE . For the previous example, we have PosE = {a1 , b2 , a3 , b4 , a5 }. If F is a subexpression of E, we denote by PosE (F ) the subset of positions of E that are symbols of F . We denote by h the application that maps each position in PosE to the symbol of Σ that appears at this position in E. For E = (a + b)∗ aba + 1, we have h(a1 ) = h(a3 ) = h(a5 ) = a and h(b2 ) = h(b4 ) = b. We use the following definition: λ(E) = if ε ∈ L(E) then 1 else 0. A finite automaton is a 5-uple A = (Q, Σ, δ, q0 , F ), where Q is the set of states, Σ is the alphabet, q0 ∈ Q is the initial state, F ⊆ Q is the set of final states, and δ : Q × Σ → 2Q is the transition mapping. The language recognized by A is denoted L(A). The size of a finite automaton A is the number of its states added with the number of its transitions. Position automaton In order to construct a non-deterministic finite automaton recognizing L(E), Glushkov [7] has introduced the following position sets: First(E) (resp. Last(E)) is the set of initial (resp. final) positions of words of L(E) and, for all x ∈ PosE , Follow(E, x) is the set of positions that immediately follow the position x in some word of L(E). Formally First is defined by induction according to the following rules: First(0) = First(1) = ∅, First(x) = {x}, First(F + G) = First(F ) ∪ First(G), First(F · G) =

if λ(F ) = 0 then First(F ) else First(F ) ∪ First(G),

∗

First(F ) = First(F ). Rules for Last(E) are obtained by substituting First by Last and replacing the last but one rule by Last(F · G) = if λ(G) = 0 then Last(G) else Last(F ) ∪ Last(G).

3

For all x ∈ PosE , the set Follow(E, x) can be inductively computed as follows: Follow(0, x) = Follow(1, x) = ∅, Follow(F, x) if x ∈ PosE (F ), Follow(F + G, x) = Follow(G, x) otherwise, if x ∈ PosE (F ) \ Last(F ), Follow(F, x) Follow(F · G, x) = Follow(F, x) ∪ First(G) if x ∈ Last(F ), Follow(G, x) if x ∈ PosE (G), Follow(F, x) if x ∈ PosE (F ) \ Last(F ), Follow(F ∗ , x) = Follow(F, x) ∪ First(F ) if x ∈ Last(F ). We add a specific position, denoted by x0 (where x0 6∈ PosE ) to the set PosE and we set Pos0 (E) = PosE ∪{x0 }; the set Last0 (E) is equal to Last(E) if λ(E) = 0 and to Last(E) ∪ {x0 } otherwise; the set Follow 0 (E, x) is equal to Follow(E, x) if x ∈ PosE and to First(E) if x = x0 . The position automaton of E [7, 10], denoted by PE , whose states are Pos0 (E), and that recognizes L(E), is derived from the above-mentioned position sets as follows. Definition 1. The position automaton of a regular expression E is the automaton PE = (Pos0 (E), Σ, δ, {x0 }, Last0 (E)) such that: δ(x, a) = {y | y ∈ Follow 0 (E, x) and h(y) = a}, ∀x ∈ Pos0 (E), ∀a ∈ Σ. Example 1. Consider the regular expression E = (a∗ b∗ )∗ + (a + b)∗ . Its linearized version is E = (a∗1 b∗2 )∗ + (a3 + b4 )∗ . The associated position automaton is given in a a Figure 1. a3

a

First(E) = {a1 , b2 , a3 , b4 } Last(E) = {a1 , b2 , a3 , b4 }

x0

a

b

a1

a a

b

Follow(E, a1 ) = {a1 , b2 } Follow(E, b2 ) = {a1 , b2 }

b4

b

b

b2

Follow(E, a3 ) = {a3 , b4 } Follow(E, b4 ) = {a3 , b4 }

b

b

Figure 1: the position automaton associated with E = (a∗ b∗ )∗ + (a + b)∗ .

Follow automaton Recently, Ilie and Yu [9] introduced the follow automaton FE of a regular expression E. This automaton is smaller than the position automaton and, in average, faster to compute. There exist several algorithms to compute the follow automaton FE in quadratic time [9, 4] w.r.t. the size of the expression E. Let us set x ≡f y ⇔ (Follow 0 (E, x) = Follow 0 (E, y) and x ∈ Last0 (E) ⇔ y ∈ Last0 (E)) As shown in [9], we have FE ≃ PE /≡f . Hence the follow automaton of a regular expression is a quotient of its position automaton.

4

Example 2. For the regular expression E = (a∗ b∗ )∗ + (a + b)∗ , the associated follow automaton is shown in Figure 2; it is constructed from the position automaton by a, b computing the equivalance relation ≡f :

classes of ≡f :

a, b

{a1 , b2 }

a, b

{a3 , b4 }

{x0 }

{x0 } {a1 , b2 } {a3 , b4 }

a, b

Figure 2: the follow automaton associated with E = (a∗ b∗ )∗ + (a + b)∗ . C-continuation automaton This automaton has been introduced by Champarnaud and Ziadi [5] in order to efficiently compute the equation automaton that we will present later. In this section, we recall the notion of c-derivative, c-continuation and c-continuation automaton. Definition 2 (c-derivative w.r.t. a symbol). The c-derivative of a regular expression E w.r.t. a symbol a, written da (E), is defined by da (0) = da (1) = 0, da (x) = da (F + G) = da (F · G) =

if a = x then 1 else 0, if da (F ) 6= 0 then da (F ) else da (G), if da (F ) 6= 0 then da (F ) · G else λ(F ) · da (G),

∗

da (F ) = da (F ) · F ∗ . The extension to a word u = u1 . . . un follows the equations: dε (E) = E and du1 ...un (E) = du2 ...un (du1 (E)). Theorem 1. [5] If E is linear, for every symbol a and every word u, the c-derivative dua (E) of E w.r.t. the word ua is either 0 or unique. Theorem 1 allows us to define the c-continuation ca (E) of a in a linear expression E, that is the unique value of the non-null c-derivatives dua (E). Proposition 1. [5] For every symbol a of a linear expression E, the c-continuation ca (E) is such that: ca (a) = 1, ca (F + G) = ca (F · G) =

if ca (F ) 6= 0 then ca (F ) else ca (G), if ca (F ) 6= 0 then ca (F ) · G else λ(F ) · ca (G),

∗

ca (F ) = ca (F ) · F ∗ . Corollary 1. [5] For every symbol a of a linear expression E, the c-continuation ca (E) is either 1 or a subexpression of E or a product of subexpressions. 5

More precisely, for a linear regular expression E, we have: ca (E) = H1 · · · Hn , where Hi is a subexpression of E, for all 1 ≤ i ≤ n. This decomposition is fundamental for stating the main theorem of this paper. We now consider a regular expression E over Σ. Let E be the linearized version of E over PosE and h be the mapping from PosE onto Σ. We assume that x0 is a symbol not in PosE . Let cx0 (E) = dε (E) = E. By convention cx (E) is equal to cx (E). Definition 3 (c-continuation automaton). The c-continuation automaton of E, CE = (Q, Σ, i, T, δ), is defined by: • Q = {(x, cx (E))|x ∈ Pos0 (E)}, • i = (0, cx0 (E)), • T = {(x, cx (E)) | λ(cx (E)) = 1}, • δ((x, cx (E)), a) = {(y, cy (E)) | h(y) = a and dy (cx (E)) ≡ cy (E)}, ∀x ∈ Pos0 (E) and ∀a ∈ Σ. Example 3. Consider the regular expression E = (a + b)∗ + (a + b)∗ . Its linearized version is E = (a1 + b2 )∗ + (a3 + b4 )∗ . We show in Figure 3 the c-continuation a a automaton associated with E. ca3 (E)

a

a

ca1 (E)

cx0 (E) = (a1 + b2 )∗ + (a3 + b4 )∗ a

b

ca1 (E) = (a1 + b2 )∗ cb2 (E) = (a1 + b2 )∗

cb4 (E)

ca3 (E) = (a3 + b4 )∗ cb4 (E) = (a3 + b4 )∗

a

cx0 (E)

b

b

b

b

cb2 (E)

b

Figure 3: the c-continuation automaton associated with E = (a + b)∗ + (a + b)∗ . The position and c-continuation automata are isomorphic. The relation that exists between the First, Last and Follow sets and the c-continuations is enlighted by the following proposition, which will be useful in the sequel. Proposition 2. [5] The following equalities hold: 1. First(E) = {y ∈ PosE |cy (E) 6= 0}; 2. Last(E) = {y ∈ PosE |λ(cy (E)) = 1}; 3. Follow(E, x) = {y ∈ PosE |dy (cx (E)) 6= 0}. Equation automaton The equation automaton EE of a regular expression E has been defined by Mirkin [11] and by Antimirov [1]. Champarnaud and Ziadi have proved that the equation automaton is a quotient of the c-continuation automaton [5]. Let us consider the equivalence relation ≡e defined by (x, cx (E)) ≡e (y, cy (E)) ⇔ h(cx (E)) ≡ h(cy (E)) 6

Theorem 2. [5] The equation automaton is a quotient of the c-continuation automaton. We have: EE ≃ CE /≡e . Example 4. Consider the regular expression E = (a + b)∗ + (a + b)∗ of Example 3. One has h(ca1 (E)) = h(cb2 (E)) = h(ca3 (E)) = h(cb4 (E)) = (a+ b)∗ and the equation automaton associated with E is shown in Figure 4. a, b

classes of ≡e :

E

{(a + b)∗ + (a + b)∗ } {(a + b)∗ }

a, b

(a + b)∗

Figure 4: the equation automaton associated with E = (a + b)∗ + (a + b)∗ . Example 5. Let us recall that the size of an automaton is the sum of the number of states and of the number of transitions. 1. Consider the regular expression E1n = (a∗1 + a∗2 + · · · + a∗n )∗ . One has: |PE1n | = |EE1n | = n2 + n + 1 and |FE1n | = 2n + 2. a1 , · · · , an

x0

a1 , · · · , an

{a1 , · · · , an }

Figure 5: the follow automaton FE1n . an an

a2 a1

a2 a2

a1

E1n

a∗1 E1n

an .........

a∗2 E1n a1

a∗n E1n

an

a2 an

Figure 6: the equation automaton EE1n . 2. Consider the regular expression E2n = a1 (a∗2 + 1)a3 · · · a2n−1 (a∗2n + 1)a2n+1 . One has: |PE2n | = |EE2n | = 6n + 3 and |FE2n | = 4n + 2. a2

x0

a1

{a1 , a2 }

a4 a3

a2n

a2n+1 {a3 , a4 } . . . . . . {a2n−1 , a2n }

Figure 7: the follow automaton FE2n .

7

{a2n+1 }

a2 a1

a2

a2n a3

......

a2n+1

a2n a2n+1

a3

Figure 8: the equation automaton EE2n .

3

Follow automaton versus equation automaton

In this section, we show that the equation automaton of a normalized regular expression has always less states and less transitions than the associated follow automaton.

3.1

Normalized expressions

Definition 4. A regular expression E is said to be normalized if the following conditions hold: 1. The expression E is a reduced one according to: - the identities: 0 + E = E = E + 0, 1 · E = E = E · 1, 0 · E = 0 = E · 0, - the rule: for all subexpressions H of E, H = F + 1 ⇒ λ(F ) = 0. 2. The operation “·” is left associative i.e. H = F · G ⇒ op(G) 6=“·”. 3. The expression E is in star normal form i.e. for every subexpression F ∗ of E, it holds: ∀x ∈ Last(F ), Follow(F, x) ∩ First(F ) = ∅. Example 6. Here are some expressions and their normalized forms: • (a∗ + 1)′ = a∗ . • ((a∗ b∗ )∗ )′ = (a + b)∗ . It is shown in [4] that given a regular expression E, it is possible to construct an equivalent normalized regular expression in O(|E|) time and space. Proposition 3. Let E be a regular expression and E ′ its normalized form. One has FE ≡ FE ′ . Proof. It is easy to show that PosE = PosE ′ . Let E • be the star normal form of E. It is shown in [2] that PE ≡ PE • . Since reducing E • according to Definition 4.1 does not modify the positions and the transitions in the position automaton PE • we have: PosE •′ = PosE • = PosE ′ and PE • = PE ′ . Finally, by applying the equivalence relation ≡f , we get FE ≡ FE ′ .

Proposition 4. Let x be a position of a regular expression E. Then one has x ∈ First(E) ⇒ λ(E) = 1. λ(cx (E)) = 1

8

Proof. Proof is by induction on the alphabetic width of E. Let us first consider the case ||E|| = 1. Since E is normalized, its form is either a + 1 or a∗ and it is easy then to prove that the proposition is true. We now suppose that the proposition is satisfied for normalized expressions F and G and we prove it is satisfied for normalized expressions F + G, F · G and F ∗ . Case E = F + G: without loss of generality we suppose that x ∈ PosE (F ). In this case cx (E) = cx (F ). Then λ(cx (E)) = λ(cx (F )) = 1. Since x ∈ First(E) one has x ∈ First(F ). By the inductive hypothesis λ(F ) = 1. Consequently λ(E) = 1. Case E = F · G: if x ∈ PosE (F ), then x ∈ First(F ) and cx (F · G) = cx (F ) · G. It implies that λ(cx (E)) = λ(cx (F ) · G) = 1. Thus λ(cx (F )) = 1 and λ(G) = 1. Applying the inductive hypothesis on F we get λ(F ) = 1. Thus λ(E) = 1. If x ∈ PosE (G), then x ∈ First(G) and since x ∈ First(E), λ(F ) = 1. Since cx (F · G) = cx (G) and λ(cx (F · G)) = 1, it comes λ(cx (G)) = 1. By induction hypothesis we have λ(G) = 1. Consequently λ(E) = 1. Case E = F ∗ : it is obvious.

Theorem 3. Let x and y be two positions of a normalized regular expression E. Then the following conditions are equivalent: i) cx (E) ≡ cy (E), ii) ∀a ∈ PosE ,

da (cx (E)) ≡ da (cy (E)) and λ(cx (E)) = λ(cy (E)),

iii) Follow(E, x) = Follow(E, y) and [x ∈ Last(E) ⇔ y ∈ Last(E)]. Proof. i) ⇒ ii) is obvious. ii) ⇔ iii) is a straightforward consequence of Proposition 4. ii) ⇒ i) we proceed by absurd. We suppose that ∀a ∈ PosE , da (cx (E)) ≡ da (cy (E)) and λ(cx (E)) = λ(cy (E)) and also that cx (E) 6≡ cy (E). Since x and y are two different positions of E, there exists a subexpression Ex ⋄ Ey of E with ⋄ ∈ {+, ·} such that x ∈ PosE (Ex ) and y ∈ PosE (Ey ). Let us suppose that ⋄ =“+”. Since cx (E) 6≡ cy (E), one has cx (E) = A1 · · · Aα · C1 · · · Cm and cy (E) = B1 · · · Bβ · C1 · · · Cm with α + β ≥ 1, PosE (Ai ) ⊆ PosE (Ex ) and PosE (Bj ) ⊆ PosE (Ey ) for all 1 ≤ i ≤ α and 1 ≤ j ≤ β. In the case where C1 · · · Cm 6= 0, we have three sub-cases: (a) α = 0, β ≥ 1, (b) α ≥ 1, β = 0 and (c) α ≥ 1, β ≥ 1. We limit ourselves to the sub-case (a); proof for (b) and (c) can be done in a similar way. In the sub-case (a) , cx (E) and cy (E) have the form cx (E) = C1 · · · Cm and cy (E) = B1 · · · Bβ · C1 · · · Cm . Let b ∈ First(B1 ). By Proposition 2, we have db (cy (E)) 6= 0. By ii) it implies that db (cx (E)) 6= 0. Consequently we have b ∈ First(C1 · · · Cm ). Since we have b ∈ PosE (Ey ), there exists k, 1 ≤ k ≤ m such that Ck = F ∗ and λ(C1 · · · Ck ) = 1 with b ∈ First(F ). It comes First(B1 ) ∩ First(F ) 6= ∅

(1)

In the other hand, since cy (E) = B1 · · · Bβ · C1 · · · Cm , there exists a subexpression Ay of E such that y ∈ PosE (Ay ) and Ay contains no occurrence of “·” nor of “∗”. 9

So we necessarily have that either Ay · B1 or A∗y = B1 is a subexpression of E. Since x ∈ First(Ex ) and λ(C1 · · · Ck ) = 1 we have x ∈ First(F ), which implies by Proposition 2 that dx (cx (E)) 6= 0. By ii) one has dx (cy (E)) 6= 0. Since x 6∈ PosE (Ey ), we have λ(B1 · · · Bβ C1 · · · Ck ) = 1. Hence y ∈ Last(F ). Thus Last(Ay ) ∩ Last(F ) 6= ∅

(2)

Finally, for the subexpression F ∗ = Ck of E, there exists y ∈ Last(F ) such that b ∈ First(B1 ) ⊆ First(F ) ∩ Follow(F, y). Hence a contradiction with E is in star normal form. In the case where C1 · · · Cm = 0, cx (E) and cy (E) have the form cx (E) = A1 · · · Aα and cy (E) = B1 · · · Bβ with α ≥ 1 and β ≥ 1. Let b ∈ First(B1 ). By Proposition 2 we have db (cy (E)) 6= 0. By ii) one has db (cx (E)) 6= 0. It holds: b 6∈ PosE (Ex ) ⇒ b 6∈ PosE (Ai ), for all 1 ≤ i ≤ α. Hence a contradiction. Let us suppose that ⋄ =“·”. Then op(Ey ) 6=“·” and cx (E) and cy (E) have the form cx (E) = A1 · · · Aα · Ey · C1 · · · Cm B1 · · · Bβ · Ey · C1 · · · Cm cy (E) = or B1 · · · Bβ · C1 · · · Cm

where Ey = F ∗ where Ey = F + G

Let us discuss the case where C1 · · · Cm = 0. We have cx (E) = A1 · · · Aα · Ey B1 · · · Bβ · Ey or cy (E) = B1 · · · Bβ

where Ey = F ∗ where Ey = F + G

If α 6= 0, let a ∈ First(A1 ). By Proposition 2 we have da (cx (E)) 6= 0. By ii) it holds da (cy (E)) 6= 0. One has a 6∈ PosE (Ey ). Hence a contradiction. If α = 0, then cx (E) = Ey and B1 · · · Bβ · Ey where Ey = F ∗ cy (E) = or B1 · · · Bβ where Ey = F + G Case cy (E) = B1 · · · Bβ · Ey . The proof is similar as in the sub-case (a) of the case where ⋄ =“+”. There exists b, b ∈ First(B1 ) ⊆ Follow(F, y)∩First(F ) with F ∗ = Ey and y ∈ Last(F ). Hence a contradiction with E is in star normal form. Case cy (E) = B1 · · · Bβ . We proceed in a similar way and we deduce that there exists k, 1 ≤ k ≤ β such that Bk = F ∗ and b ∈ First(B1 ) ⊆ Follow(F, y) ∩ First(F ) and y ∈ Last(F ). Which is a contradiction with E is in star normal form. Let us discuss the case where C1 · · · Cm 6= 0. There are two cases. The first one is when Ey = H ∗ . Let cx (E) = A1 · · · Aα · Ey · C1 · · · Cm and cy (E) = B1 · · · Bβ · Ey · C1 · · · Cm with α + β ≥ 1 . We consider two sub-cases: (d) α = 0 and (e) α ≥ 1. Proof for (d) can be done in a similar way as in the case where C1 · · · Cm = 0. In the sub-case (e), by a similar reasoning, there exist k, 1 ≤ k ≤ α such that Ck = F ∗ and 10

a ∈ First(A1 ) ⊆ Follow(F, x) ∩ First(F ) and x ∈ Last(F ). Which is a contradiction with E is in star normal form. The second one is when Ey = F + G. Let us suppose that y ∈ PosE (F ). We have two sub-cases: Case G 6= 1: by a similar way, there exist k, 1 ≤ k ≤ α such that Ck = F ∗ , t ∈ First(Ey ) ⊆ Follow(F, z) ∩ First(F ) and z ∈ Last(F ) ∩ Last(Ex ). Hence a contradiction with E is in star normal form. Case G = 1: let a ∈ First(A1 ). By Proposition 2 we have da (cx (E)) 6= 0. By ii) one has da (cy (E)) 6= 0. Thus, there exists k, 1 ≤ k ≤ m such that λ(B1 · · · Bβ C1 · · · Ck ) = 1 and Ck = H ∗ with a ∈ First(H). Hence y ∈ First(H). Since λ(B1 · · · Bβ C1 · · · Ck ) = 1, we have λ(B1 · · · Bβ ) = 1. One has cy (F ) = B1 · · · Bβ and λ(cy (F )) = λ(B1 · · · Bβ ) = 1, which implies that y ∈ First(F ). By Proposition 4 one has λ(F ) = 1. Which is a contradiction with the Definition 4-1 for the subexpression Ey = F + 1 of E. As a corollary of Theorem 2, the equation automaton is a quotient of the position automaton by the equivalence relation x ≡e y ⇔ h(cx (E)) ≡ h(cy (E)) By Theorem 3, for a normalized expression it holds that x ≡f y ⇔ cx (E) ≡ cy (E). Then we conclude that x ≡f y implies that x ≡e y. Hence the theorem: Theorem 4. For a normalized regular expression, the size of the equation automaton is smaller than the size of the follow automaton. Example 7. 1. Consider the regular expression E = (a∗ b∗ )∗ + (a + b)∗ . Its follow automaton is shown in Figure 2 and its normalized form is E ′ = (a + b)∗ + (a + b)∗ . The equation automata respectively associated with E and E ′ are given in the following figure: a, b a, b a, b

(a∗ b∗ )∗ E′

E a, b

a, b

(a + b)∗

(a + b)∗

Figure 10: the equation automaton associated with E ′ = (a + b)∗ + (a + b)∗ .

a, b

Figure 9: the equation automaton associated with E = (a∗ b∗ )∗ + (a + b)∗ . 2. Consider the regular expression E1n = (a∗1 +a∗2 +· · ·+a∗n )∗ . Its normalized form ′ ′ is E1n = (a1 + a2 + · · · + an )∗ and the equation automaton associated with E1n given in the following figure has only one with state. One has |FE n′ | = 2n + 2 1 and |EE n′ | = n + 1. 1

11

a1 , · · · , an

(a1 + a2 + · · · + an )∗

Figure 11: the equation automaton EE n′ . 1

3. Consider the regular expression E2n = a1 (a∗2 + 1)a3 · · · a2n−1 (a∗2n + 1)a2n+1 . Its ′ normalized form is E2n = a1 a∗2 a3 · · · a2n−1 a∗2n a2n+1 and the equation automa′ ton associated with E2n is given in the following figure. One has: |FE n′ | = 2 |EE n′ | = 4n + 2. 2

a2

E2n

′

a1

a∗2 · · · a2n+1

a4 a3

a2n

a∗4 · · · a2n+1 . . . . . . a∗2n a2n+1

a2n+1

ε

Figure 12: the equation automaton EE n′ . 2

4

Conclusion

We have solved the conjecture stated by Ilie and Yu by exhibiting a family of regular expressions, namely the normalized expressions, such that in the one hand any regular expression can be turned into an equivalent normalized one in linear time and space w.r.t. the size of the expression, and in the other hand the size of the equation automaton of a normalized expression is always smaller than the size of its follow automaton. The question that we are considering now is how to construct the equation automaton directly from the position automaton.

References [1] Antimirov V., Partial derivatives of regular expressions and finite automaton constructions, Theoret. Comput. Sci., 155, 2917–319, 1996. [2] Br¨ uggemann-Klein A., Regular Expressions into Finite Automata, Theoret. Comput. Sci., 120, 197–213, 1993. [3] Champarnaud J.-M. and D. Ziadi, From C-Continuations to New Quadratic Algorithms for Automaton Synthesis, Intern. Journ. of Alg. and Comp., 116(2001), 707–735. [4] Champarnaud J.-M., F. Nicart and D. Ziadi, From the ZPC-structure of a regular expression to its follow automaton, Intern. Journ. of Alg. and Comp., 16–1(2006), 17–34. [5] Champarnaud J.-M. and D. Ziadi, Canonical derivatives, partial derivatives and finite automaton constructions, Theoret. Comput. Sci., 289, 137-163, 2002.

12

[6] Chang C.-H. and Paige R., From Regular Expressions to DFA’s Using Compressed NFA’s, Theoret. Comput. Sci., 178, 1–36, 1997. [7] Glushkov V.M., The Abstract Theory of Automata, Russian Math. Surveys, 16, 1–53, 1961. [8] Hopcroft J.E. and Ullman J.D., Introduction to Automata Theory, Languages and Computation, Addison-Wesley, Reading, MA, 1979. [9] Ilie L. and Yu S., Follow automata, Information and computation, 186, 140-162, 2003. [10] McNaughton R., Yamada H., Regular Expressions and State Graphs For Automata, IEEE Trans. on Electronic Computers, 9-1, 39–47, 1960. [11] Mirkin B. G. An algorithm for constructing a base in a language of regular expressions, Engineering Cybernetics, 5:110–116, 1966. [12] Ponty J.-L., Ziadi D. and Champarnaud J.-M., A new Quadratic Algorithm to Convert a Regular Expression into an Automaton, in: D. Raymond, D. Wood, eds., Proc. of WIA’96, Lecture Notes in Computer Science 1260, SpringerVerlag, 109-110, 1997. [13] S. Yu, Regular languages, in: G. Rozenberg, A. Salomaa, Handbook of Formal Languages, Vol. I, Springer-Verlag, Berlin, 41–110, 1997. [14] Ziadi D., Ponty J.-L. and Champarnaud J.-M., Passage d’une expression rationnelle a ` un automate fini non-d´eterministe, Journ´ees Montoises (1995), Bull. Belg. Math. Soc. 4, 177-203, 1997.

13