Extracting Grammar from Programs: Brute Force ... - Semantic Scholar

3 downloads 5656 Views 3MB Size Report
Faculty of Electrical Engineering and Computer Science, ... a pragmatic level, if you just have a set of programs in some programming language, say 'PLQ', that.
Extracting Grammar from Programs: Brute Force Approach ˇ ˇ Matej Crepinˇ sek, Marjan Mernik, and Viljem Zumer University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia {matej.crepinsek, marjan.mernik, zumer}@uni-mb.si

Abstract. Extracting grammar from programs attracts researchers from several fields such as software engineering, pattern recognition, computational linguistic and natural language acquisition. So far, only regular grammar induction has been successful, while purposeful context-free grammar induction is still elusive. We discuss the search space of context-free grammar induction and propose the Brute Force approach to the problem which, along with its various enhancements, can be further used in collaboration with the Evolutionary approach, as described in the accompanying paper. Keywords. Grammar induction, Grammar inference, Learning from positive and negative examples, Exhaustive search

1

Introduction

A few years ago, some interesting questions were posted on the Usenet group comp.compilers: – (1.1) “I am looking for an algorithm that will generate context-free grammar from given set of strings. For example, given a set L = {aaabbbbb, aab} one of the grammar is G → AB, A → aA|a, B → b|bB” [1] – (1.2) “I’m working on a project for which I need information about some reverse engineering method that would help me extract the grammar from a set of programs (written in any language). A sufficient grammar will be the one which is able to parse all the programs ...” [2] Those questions triggered some interesting responses: – (2.1) “Unfortunately, there are infinitely many context-free grammars for any given set of strings (Consider for example adding A → C, C → D, ..., Y → Z, Z → A to the above grammar. You can obviously add as many pointless rules as you want this way, and the string set doesn’t change). What’s worse, there are several reasonable questions that can be asked about context-free grammars that can be shown to be undecidable. You can’t write down an algorithm which will always tell you whether two CFGs are equivalent, and under one idealization of the grammar learning process [8] you can even show that a learning algorithm is impossible.” – (2.2) “There are formal theories that address this. However, their results are far from encouraging. The essential problem is that given a finite set of programs, there is a trivial regular expression which recognizes exactly those set of programs and no others. Namely, regexp : program 1 |...|programn . At a pragmatic level, if you just have a set of programs in some programming language, say ‘PLQ’, that you don’t know the grammar for. It is probably possible to construct a compiler for it by assuming that it is syntactically similar to all other programming languages with only a few wrinkles and thus taking some other language translator and tweaking it until the set of programs in question goes through.” – (2.3) “Within machine learning there is a subfield called Grammatical Inference. They have demonstrated a few practical successes mostly at the level of recognizing regular languages or subsets thereof.” – (2.4) “The hard part isn’t the approach. It’s the Machine Bottlenecks: its Lack of Parallelism and its Lack of Memory, which humans have in abundance. The approach for inducting the rules A → BC is going to take up a huge amount of memory because of all the table-keeping over the combinations.

ACM SIGPLAN Notices

29

Vol. 40(4), Apr 2005

– (2.5) “There is a way to deal with this issue. Let us assume for the moment that the program is compiled by a compiler. Then the grammar knowledge that you need resides in that compiler. What you do is write a parser that parses the part of the compiler containing the grammar knowledge. If you are lucky this is easy and you recover the BNF in a snippet. If you are dealing with a language like IBM COBOL or PL/I for which it is not possible to obtain the source code of the grammar there is another option. You can extract the grammar from the manual.” The responses point out several research good directions. The aim of this paper is to investigate these research directions in greater depth and propose a new approach to solving this problem, hopefully stimulating other researchers in the programming language community to investigate and contribute to this area. The structure of this paper is as follows. The regular grammar and context-free grammar (CFG) inference problem is briefly described in section 2. The search space in context-free grammar induction is discussed in section 3 and the grammar inference tool GIE-BF is briefly presented in section 4 followed by the conclusion in section 5.

2

Grammatical inference

We first have to consider the practicality of the problem. Does it exist in everyday projects, or is it an artificially engineered problem? Indeed, the grammar induction problem is viable since it is being investigated, and finds application in, several research domains such as software engineering, pattern recognition, computational biology, computational linguistic, speech recognition and natural language acquisition. For example, software engineers usually want to recover grammar from legacy systems in order to automatically generate various software analysis and modification tools. Usually in this case the grammar can be semi-automatically recovered from compilers and language manuals [12] (see also item 2.5 above). In application areas outside software engineering, grammars are mainly used as an efficient representation of artifacts that are inherently structural and/or recursive (e.g. neural networks, structured data and patterns). Here compilers and manuals do not exist and semi-automatic grammar recovery as suggested in [12] is not possible. The grammar needs to be extracted solely from artifacts represented as sentences/programs written in some unknown language. The grammar induction problem also provides interesting theoretical undertones. Grammatical inference (grammar induction or language learning), a subfield of machine learning, is the process of learning of grammar from training data. One of the most important theoretical results in this field is Gold’s theorem [8] which states that it is impossible to identify any of the four classes of languages in the Chomsky hierarchy in the limit using only positive samples. Using both negative and positive samples, the Chomsky hierarchy languages can be identified in the limit. Intuitively, Gold’s theorem can be explained by recognizing the fact that the final generalization of positive samples would be an automation that accept all strings. Singular use of positive samples results in an uncertainty as to when the generalization steps should be stopped. This implies the need for some restrictions or background knowledge on the generalization process. Despite the dispiriting results, research has continued in this area. Learning algorithms have been developed that exploit knowledge of negative samples, structural information, or restrict grammars to some subclasses (e.g. even linear grammars, k-bounded grammars, structurally reversible languages, terminal distinguishable context-free languages, etc) [15] where identification in the limit is possible only from positive samples. 2.1

Regular grammar inference

So far, grammar inference has been mainly successful in inferring regular languages. Researchers have developed various algorithms which can learn regular languages from positive and negative samples. Let S + and S − denote the set of positive and negative samples of finite state automaton A = (Q, δ, Σ, q 0 , F ) [10]. Automaton A is consistent with a sample S = S + ∪ S − if it accepts all positive samples and rejects all negative samples. A set S + is structurally complete with respect to automaton A if S + covers each

ACM SIGPLAN Notices

30

Vol. 40(4), Apr 2005

transition of A and uses all final states of A as an accepting state [6]. Hence, positive samples need to be structurally complete in order for the inference process to be successful; it is not possible to infer a transition in an automaton if the positive samples do not evidence it. Clearly, the search space of all finite state automata is infinite. We can narrow this search space by using positive samples S + as explained below. A number of algorithms (e.g. RPNI [18]) first construct the maximal canonical automaton (M CA(S + )) or prefix tree acceptor (P T A(S + )) [10] from positive samples, and generalize the automaton by using a state merging process. By merging states, an automaton is obtained that accepts a bigger set of strings and generalizes according to the increasing number of positive samples presented. In figure 1 the M CA(S + ) and P T A(S + ) for the set S + = {aaabbbbb, aab} is shown. q0 a a

q1 a

q2 a

q3 b

q9 a

q10 b

q11

q0 a

q1 a

q2 a q3 b b q4

q4 b

q5 b

q6 b

q7 b

q8

MCA(S+) q5 b

q6 b

q7 b

q8 b

q9

PTA(S+)

Fig. 1. M CA(S + ) and P T A(S +) for the set S + = {aaabbbbb, aab}

By merging block of states B(q, π) we are looking for some partition π of states Q. The quotient automaton Aπ = (Qπ , δπ , Σ, B(q0 , π), Fπ ) is then defined as follows: Qπ = {B(q, π)|q ∈ Q} is the set of states, Fπ = {B(q, π)|q ∈ F } is the set of accepting states, and δπ : Qπ × Σ → 2Qπ is the transition function such that ∀B(qi , π), B(qj , π) ∈ Qπ , ∀a ∈ Σ, B(qj , π) = δπ (B(qi , π), a) iff qi , qj ∈ Q and qj = δ(qi , a). The set of all quotient automata Aπ represents a lattice Lat(M CA(S + )) that is ordered by the grammar cover relation . Aπj  Aπi iff L(Aπj ) ⊆ L(Aπi ). It is shown in [6] that the lattice Lat(M CA(S + )) is guaranteed to contain the consistent automaton A. The lattice Lat(M CA({aab})) is shown in figure 2. Some of the languages generated by the quotient automata are: L(Aπ0 ) = {aab}, L(Aπ1 ) = {ab, aab, aaab, ...}, L(Aπ11 ) = {ab, abb, aab, aabb, aaab, ...}, and L(Aπ14 ) = {ε, ab, ba, bb, aab, abab, ...}. Many automata in the lattice can be consistent with the positive and negative samples. For example, consider M CA(S + ), which though consistent with the positive and negative samples provided, does not generalize the positive samples. Our aim is to find an automaton A composed of a minimal number of states, and consistent with the positive and negative samples provided. Gold [7] proved this problem to be NP-hard. The search space of regular grammar inference is the lattice of the automata built from M CA(S + ) and also depends on the total number of states (m) in M CA(S + ) [19]. The following equation enumerates the search space (see also Table 1): Em =

m−1 X

(m−1 )Ej j

where E0 = 1

j=0

Table 1 shows that explicitly building the lattice Lat(M CA(S + )) is not practical even when the number of states involved is small. To overcome this problem, heuristic (e.g. as described in [22]) or incremental methods (e.g. version space algorithm [19]) can be used. 2.2

Context-free grammar inference

Learning context-free grammars G = (V, T, P, S) [10] is more difficult than learning regular grammars (compare Table 1 and Table 2). Using representative positive samples (that is, positive samples which

ACM SIGPLAN Notices

31

Vol. 40(4), Apr 2005

14

b

a

q0q1q2q3

4

q0q3 a

q1 a

q2

q1 a

q1 a

q2q3

5

a q1q3 a

q0

q2 b

q0 a

q3

3

b

q0 a

6

b

q0 a q1q2 b

13

q0q3 a q1q2 b

q2q3

2

q3

11

a

q0q1 a

a

a q1 a b

b

a

a

1

q3

9

q0q2q3 a a q1

q3

q0q2

q2 b

q0 a q1q2q3

q2

b

a

7

q0q1q2 b

q0q1 a

b

12

b q0q2 a q1q3 a

b

a q0q1q3 a

10

a

8

q3

b

q2

0

Fig. 2. Lattice Lat(M CA({aab}))

Table 1. As the number of states increases, the search space grows exponentially m 1 2 3 4 5 6 7

Em 1 2 5 15 52 203 877

m Em 8 4140 9 21147 10 115975 11 678570 12 4213597 13 27644437 14 190899322

exercise every production rule in the grammar) along with negative samples did not result in the same level of success as with regular grammar inference. Hence, some researchers reported to using additional knowledge to assist in the induction process. Sakakibara [20] used a set of skeleton derivation trees (unlabelled derivation trees), where the input to the learning process are complete structured sentences, that is sentences with parentheses inserted to indicate the shape of the derivation tree of the grammar. It was shown in [20] that in this case, learning CFG’s was possible from positive samples only. Since the grammatical structure (topology) was known, the problem of learning CFG’s could be reduced to the problem of labeling non-terminals (similar to the partitioning problem of nonterminals). Recently an enhancement to this algorithm was proposed [21], where learning CFG’s was possible from partially structured sentences (some pairs of left and right parentheses missing). For example, a completely structured sample for the language L = {am bm cn |m, n ≥ 1} is ((a(ab))b)(c(cc)),

ACM SIGPLAN Notices

32

Vol. 40(4), Apr 2005

Table 2. Search space with labeling of nonterminals n 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of full binary Labeling of nonterminals trees (Catalan numbers) nn 1 1 2 4 5 27 14 256 42 3125 132 46656 429 823543 1430 1.6777216 E7 4862 3.87420489 E8 16796 1.0 E10 58786 2.85311670611 E11 208012 8.916100448256 E12 742900 3.02875106592253 E14 2674440 1.1112006825558016 E16

Search space 1 8 135 3584 131250 6158592 3.53299947 E8 2.399141888 E10 1.883638417518 E12 1.6796 E14 1.6772331868538246 E16 1.85465588644262707 E18 2.2500591668738474 E20 2.9718395534545382 E22

while some partially structure samples are (a(ab)b)(c(cc)), (a(ab)b)(ccc), and (aabb)(ccc). However, in the area of programming languages it is impractical to assume that completely or partially structured samples exist. Despite the fact that many researchers have looked into the problem of CFG induction [13] [16] [14] [17] there has been no one convincing solution to the problem as of now. In all the work cited so far, experiments were performed on theoretical sample languages such as L = {ww|w ∈ {a, b} +}, L = {w = wR |w ∈ {a, b}{a, b}+}, L = {#a (w) = #b (w)|w ∈ {a, b}+ ∧ #x (w) is a number of x0 s in w}, and L = {abi cbi a|i ≥ 1}, instead of on real or even toy programming languages. Therefore, learning CFG’s is still a real challenge in grammatical inference [9].

3

Brute Force Approach

In the area of programming languages, context-free grammars are of special importance since almost all programming languages employ CFG’s in their design. Recent approaches to CFG induction are not able to infer context-free grammars for general-purpose programming languages. As mentioned earlier (response 2.1) there are infinitely many CFG’s that recognize a given set of programs, since we can always add useless productions to the grammar. On the other hand, there are infinitely many grammars that do not recognize a given set of programs. Therefore, the grammar search space is infinite. In practice, we can impose limitations on the number of productions, the number of non-terminals, and on the size of the right hand side (RHS) of the productions thereby reducing the infinite search space to a finite one. The problem can be restated as follows: how many derivation trees can be constructed from a given program? Since we are not interested in grammars with useless productions, we can make use of the theory of n-ary trees. A lot is known about binary trees (a binary tree is either empty or has a root node with a left binary tree and a right binary tree), so it can be advantageous if we can limit ourselves to making use of these trees exclusively. The Chomsky Normal Form (CNF) for representing grammars enables us to do exactly that, since all grammars in CNF are of the form A → BC

or

A→a

In CNF, derivation trees are represented as full binary trees. A full binary tree is a binary tree with the additional property that every node must have exactly two “children” if it is an internal node (A → BC), or zero children if it is a leaf node (A → a). An extended binary tree [11] is a transformation of any binary tree into a full binary tree. This transformation consists of replacing every null subtree of

ACM SIGPLAN Notices

33

Vol. 40(4), Apr 2005

the original tree with “special nodes”. The nodes from the original tree become internal nodes, while the “special nodes” become the external nodes (or leaves). The following equation holds true for any full binary tree: l =n+1 The number of external nodes (l) in a non-empty, full binary tree is one more than the number of internal nodes (n). In our case, internal nodes represent non-terminals and external nodes represent terminals in the derivation tree. More precise external nodes in CNF represents productions of type A → a. Given a program the number of external nodes (l) is therefore known and consequently also the number of internal nodes (n). The number of all possible full binary trees with n internal nodes is given by the n-th Catalan number (Cn )[4]. (2n)! Cn = (n + 1)!n! For example, there are 14 different full binary trees when l = 5 (n = 4) (Figure 3, Table 2).

Fig. 3. All full binary trees for l = 5 (n = 4)

Even for small n, the number of all possible full binary trees increases exponentially (Table 2). For full binary trees in CNF to be valid derivation trees, the interior nodes need to be labeled with non-terminals. If n is the number of interior nodes, then there are nn different ways to label the interior nodes. For example, for the input “a = 2” the number of external nodes is 3 and the number of internal nodes is 2. Using extended binary trees, external nodes represent the following simple productions: N T 1 → id, N T 2 →=, and N T 3 → num. The total number of different labeling’s of nonterminals in a full binary tree with 2 internal nodes is 4. With 2 different full binary trees, we can construct a total of 8 derivation trees (Figure 4), where every tree represents one grammar. For example tree (a) represents the grammar N T 4 → N T 4 N T 3, N T 4 → N T 1 N T 2. From figure 4 we can easily see that trees a and d represent the same grammar; the only difference is in the naming of nonterminals. A closer look at Figure 4 identifies only 2 different labeling’s in one full binary tree, resulting in a total of 4 different grammars : (1)=(a, d), (2)=(b, c), (3)=(e, h), and (4)=(f, g) (Figure 5). When n = 3 instead of 27(33 ) different labelling of nonterminals only 5 of them are distinct, while others can be obtained by consistent renaming of nonterminals. Actually, we are interested only on distinct labelling of nonterminals. Its enumeration is defined by Bell numbers [3]. The search space for the brute force approach is given in table 3.

4

GIE-BF: Grammar Inference Engine

Based on observations in the previous section, we implemented a grammar inference tool called GIEBF (Figure 6). The GIE-BF employs the brute force approach for CFG learning. From a given set

ACM SIGPLAN Notices

34

Vol. 40(4), Apr 2005

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 4. All possible labeling’s of nonterminals for l = 3(n = 2)

(1)

(2)

(3)

(4)

Fig. 5. All distinct labeling’s of nonterminals when l = 3(n = 2)

Table 3. Search space with distinct labeling of nonterminals n 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Number of full binary Distinct labelling of Search space trees (Catalan numbers) nonterminals (Bell numbers) 1 1 1 2 2 4 5 5 25 14 15 210 42 52 2184 132 203 26796 429 877 376233 1430 4140 5920200 4862 21147 1.02816714 E8 16796 115975 1.9479161 E9 58786 678570 3.989041602 E10 208012 4213597 8.76478739164 E11 742900 27644437 2.05370522473 E13 2674440 190899322 5.1054878272968 E14

ACM SIGPLAN Notices

35

Search space before 1 8 135 3584 131250 6158592 3.53299947 E8 2.399141888 E10 1.883638417518 E12 1.6796 E14 1.6772331868538246 E16 1.85465588644262707 E18 2.2500591668738474 E20 2.9718395534545382 E22

Vol. 40(4), Apr 2005

of positive and negative samples, a valid derivation tree is systematically searched for. The search is stopped when either the derivation tree found produces a context-free grammar in CNF that accepts all positive samples and reject all negative samples, or when the search space is exhausted. The induced CFG is then used to generate samples (programs), the analysis of which can indicate if more negative samples are needed in the starter set; in the event that more negative samples are needed, we deduce that the inferred CFG is over-generalized.

Fig. 6. GIE-BF tool

We are aware of the fact that the search space (Table 3) is still prohibitive for practical examples where programs consist of thousands of lines of codes, where each line consists of several terminal symbols. However, we can treat the currently implemented brute-force algorithm as a good base to further restrict the search space by the use of heuristics. Another option is to adapt the grammar induced by the brute force approach (using small samples) in such a way so as to also recognize larger samples. For example, for the input “a = 2” the following grammar was obtained by the brute force approach: N T 4 → N T 5 N T 3, N T 5 → N T 1 N T 2, N T 1 → id, N T 2 →=, and N T 3 → num. Let us assume a more complicated positive sample like “a = 12 + 4”. The previous grammar can then be transformed to a more complex grammar so that both the positive samples are recognized. The main idea is to append some additional productions to the already existing productions (Figure 7). An example of such a transformation might produce the following grammar: N T 4 → N T 5 N T 3, N T 5 → N T 1 N T 2, N T 1 → id, N T 2 →=, N T 3 → num, N T 4 → N T 4 N T 6, N T 6 → N T 0 N T 3 and N T 0 → +. We also intend to use some well known grammar patterns from existing programming languages (see response 2.2 from section 1). A repository of the most often used language constructs can be constructed, from which supplementary knowledge can be used to guide the searching process. Additionally, the existing brute-force algorithm can be applied on sub-languages, where the sub-programs are small enough so that the brute force approach is feasible. A terminal CFG grammar can then be obtained by merging the induced sub-grammars. We are also concurrently working on the CFG grammar inference project using an evolutionary approach [5], where the just described application of the brute force approach is exploited on frequent sequences. Frequent sequences are sequences of terminal symbols

ACM SIGPLAN Notices

36

Vol. 40(4), Apr 2005

NT4 NT4

NT4

NT5

NT1

NT5

NT2

NT3

NT1

NT6

NT2

NT3

NT0

NT3

Fig. 7. Derivation tree adaptation

that most frequently appear in the positive samples. Frequent sequences are a good approximation for sub-languages which are not known in advance. Using GIE-BF we were able to find CFG’s for small sub-languages (see Table 4). This approach is more successful when it is combined with other approaches (e.g. the evolutionary approach) where we were able to induce more practical grammars [5]. Language

Inferred grammar

An example of positive sample begin down down right up end

Robot

NT4 → NT5 NT3 NT5 → NT2 NT1 NT5 → NT5 NT1 NT5 → NT5 NT6 NT6 → NT1 NT1 NT1 → #command NT2 → #begin NT3 → #end Simple NT5 → NT6 NT1 a := 22 assignment NT6 → NT5 NT6 i := 10 + 2 + 3 NT6 → NT5 NT3 NT6 → NT4 NT2 NT1 → #int NT2 → #= NT3 → #+ NT4 → #id Nesting NT3 → NT1 NT2 begin begin end NT3 → NT1 NT4 begin blocks NT4 → NT3 NT4 end NT4 → NT5 NT2 begin NT5 → NT1 NT2 begin NT5 → NT6 NT2 end NT6 → NT1 NT5 end NT1 → #begin end NT2 → #end Table 4. Inferred grammars for some small languages.

5

Conclusions

We discuss the problem of extracting CFG’s from programs with the aim of stimulating other researchers from the programming language community to contribute to this area, and to make significant advances in this interesting and challenging area of research. First, the search space of CFG induction is investigated, followed by a description of the brute-force approach to CFG induction. We observed that the search space of CFG’s was too large to work with even for toy programming languages, concluding that

ACM SIGPLAN Notices

37

Vol. 40(4), Apr 2005

it would be infeasible to solely use the brute force approach for CFG induction. We have also used the brute-force approach in combination with the evolutionary approach where it was found to be useful on frequent sequences. Several other directions of possible future work are also briefly outlined.

6

Acknowledgements

We would like to thank Barrett Bryant and Faizan Javed for useful comments and fruitful discussions.

References 1. Usenet news group comp.compilers, reference 96-02-306. http://compilers.iecc.com/comparch/article/9902-306. 2. Usenet news group comp.compilers, reference 99-02-025. http://compilers.iecc.com/comparch/article/9902-025. 3. E. Bell. Exponential numbers. Amer. Math, 41:4111–419, 1934. 4. J. M. Borwein and D. H. Bailey. Mathematics by Experiment: Plausible Reasoning in the 21st Century. A. K. Peters, Ltd., Wellesley, MA, USA, 2004. ˇ 5. M. Crepinˇ sek, M. Mernik, F. Javed, B. Bryant, and A. Sprague. Extracting grammar from programs: Evolutionary approach. Submitted to ACM Sigplan Notices, 2004. 6. P. Dupont, L. Miclet, and E. Vidal. What is the search space of the regular inference? In Grammatical Inference and Applications, Second International Colloquium, ICGI-94 Alicante, Spain, September 21 - 23, 1994; Proceedings, volume 862 of Lecture Notes in Computer Science, pages 25–37, 1994. 7. E. A. Gold. Complexity of automaton identification from given data. Information and Control, 37:302–320, 1978. 8. E. M. Gold. Language identification in the limit. Information and Control, 10:447–474, 1967. 9. C. de la. Higuera. Current trends in grammatical inference. In F. J. Ferri et al., editors, Advances in Pattern Recognition, Joint IAPR International Workshops SSPR+SPR’2000, volume 1876 of LNCS, pages 28–31. Springer, 2000. 10. J. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Language, and Computation. Addison– Wesley, Reading, MA, 1979. 11. D. E. Knuth. The Art of Computer Programming, Vol.3 — Sorting and Searching. Addison-Wesley (Reading MA), 1973. 12. R. L¨ ammel and C. Verhoef. Semi-automatic grammar recovery. Software—Practice & Experience, 31(15):1395–1438, Dec. 2001. 13. P. Langley and S. Stromsten. Learning context-free grammars with a simplicity bias. In Machine Learning: ECML 2000, 11th European Conference on Machine Learning, Barcelona, Catalonia, Spain, May 31 - June 2, 2000, Proceedings, volume 1810 of Lecture Notes in Artificial Intelligence, pages 220–228. Springer, 2000. 14. J. A. Laxminarayana and G. Nagaraja. Inference of a subclass of context free grammars using positive samples. In ECML/PKDD 2003 Workshop on Learning Context-Free Grammars, 2003. 15. L. Lee. Learning of context-free languages: a survey of the literature. Technical Report TR-12-96, Center for Research in Computing Technology, Harvard University, Cambridge, MA, 1996. 16. K. Nakamura and T. Ishiwata. Synthesizing context free grammars from sample strings based on inductive CYK algorithm. In Grammatical Inference: Algorithms and Applications, 5th International Colloquium, ICGI 2000, Lisbon, Portugal, September 11 - 13, 2000; Proceedings, volume 1891 of Lecture Notes in Artificial Intelligence, pages 186–195. Springer, 2000. 17. T. Oates, T. Armstrong, J. Harris, and M. Nejman. Leveraging lexical semantics to infer context-free grammars. In ECML/PKDD 2003 Workshop on Learning Context-Free Grammars, 2003. 18. J. Oncina and P. Garcia. Inferring regular languages in polynomial update time. In N. P. de la Blanca, A. Sanfeliu, and E. Vidal, editors, Pattern Recognition and Image Analysis, volume 1 of Series in Machine Perception and Artificial Intelligence, pages 49–61. World Scientific, Singapore, 1992. 19. B. Parekh and V. Honavar. An incremental interactive algorithm for grammar inference. In L. Miclet and C. de la Hieguera, editors, Proceedings of the Third International Colloquium on Grammatical Inference (ICGI-96): Learning Syntax from Sentences, volume 1147 of LNAI, pages 238–249. Springer, Sept. 25–27 1996. 20. Y. Sakakibara. Efficient learning of context-free grammars from positive structural examples. Information and Computation, 97(1):23–60, Mar. 1992. 21. Y. Sakakibara and H. Muramatsu. Learning context-free grammars from partially structured examples. In Grammatical Inference: Algorithms and Applications, 5th International Colloquium, ICGI 2000, Lisbon, Portugal, September 11 - 13, 2000; Proceedings, volume 1891 of Lecture Notes in Artificial Intelligence, pages 229–240. Springer, 2000. 22. S. Savchenko. Regular expression mining. Dr. Dobb’s Journal, 29:46–48, 2004.

ACM SIGPLAN Notices

38

Vol. 40(4), Apr 2005