An Incremental Context-Free Grammar Learning Algorithm for Domain ...

78 downloads 16377 Views 109KB Size Report
Algorithm for Domain-Specific Language Development. Faizan Javed1, Marjan .... grammars result in a more expansive combinatorial search space which the ...
GenInc: An Incremental Context-Free Grammar Learning Algorithm for Domain-Specific Language Development Faizan Javed1, Marjan Mernik2, Barrett R. Bryant1 and Alan Sprague1 1

Department of Computer & Information Sciences University of Alabama at Birmingham 1300 University Boulevard, Birmingham, AL 35294-1170 USA {javedf, bryant, sprague}@cis.uab.edu

Abstract - While grammar inference (or grammar induction) has found extensive application in the areas of robotics, computational biology, speech and pattern recognition, its application to problems in programming language and software engineering domains has been limited. We have found a new application area for grammar inference which intends to make domainspecific language development easier for domain experts not well versed in programming language design, and finds a second application in construction of renovation tools for legacy software systems. As a continuation of our previous efforts to infer context-free grammars (CFGs) for domain-specific languages which previously involved a genetic-programming based CFG inference system, we discuss improvements made to an incremental learning algorithm, called GenInc, for inferring context-free grammars with a core focus on facilitating domain-specific language development. We elaborate on the enhancements made to GenInc in the form of new operators, and conclude by discussing the results of applying GenInc to domain-specific languages. Keywords: Grammar Inference, Languages, Incremental Learning

Domain-Specific

1 Introduction In the realm of software engineering, context-free grammars are widely used for defining the syntactic component of programming languages. Domain-Specific Languages (DSLs) are languages specifically designed for use in a particular domain. The following apt definition for DSLs is given in [1]: “A DSL is a programming language or executable specification language that offers, through appropriate notations and abstractions, expressive power focused on, and usually restricted to, a particular problem domain.” DSLs are usually declarative and smaller in size than general-purpose programming languages (GPLs). An open problem in Domain-Specific Language (DSL) development is a need for reducing the time needed to learn language development tools by incorporating support for the description-by-example paradigm of language specifications like syntax [2]. Using grammar induction, language specifications can be generated for DSLs, facilitating programming language development for domain experts not well versed in programming language

2

Faculty of Electrical Engineering and Computer Science University of Maribor, Smetanova 17, 2000 Maribor, Slovenia [email protected] design. Grammatical inference (GI) is the process of learning a grammar from positive and/or negative sentence examples. Gold’s theorem [3] states that it is impossible to identify any of the four classes of languages in the Chomsky hierarchy in the limit using only positive samples, unless a statistical distribution over all the positive samples is used, or some extra information is provided to the learner in the form of an ordered presentation of samples. GI would enable a domain expert to create a DSL by supplying sentences from the desired DSL to the grammar induction system, which would then create a parser for the DSL represented by those samples, thus expediting programming language development. Such a technique would find application in cases where legacy DSLs have been running for many years and their specifications no longer exist to assist with evolution of their implementations (e.g., as was needed to solve the Y2K problem). A survey of programming language usage in commercial and research environments has shown that more than 500 general-purpose and proprietary programming languages are in use today [4]. The Y2K-like problems not withstanding, many commercial installations use in-house DSLs and a variety of situations can arise (e.g., software company went bankrupt) where source implementations either need to be recovered or translated to a different language dialect. It is entirely possible that the application domain may require a language far more complex than GPLs. Our research focus is on recovering specifications of imperative, explicitly Turing-complete (i.e., languages which contain features which can be used as control structures, and provide for conditionals, loops and recursion) and context-free DSLs. Examples of such DSLs include PostScript, JavaScript, SQL and HTML (amongst others). To this end we discuss GenInc, an incremental learning algorithm which infers DSL grammars from DSL samples. GenInc is based on the identification in the limit learning model introduced in [3], instead of the active learning model [5], which assumes the existence of an oracle to which the learner can ask membership and grammar equivalence queries. A case for “programming by examples” is made in [6], where the aim is to program computers the same way children learn – from examples. This is accomplished by making the computer compute in

the limit, a concept similar to identification in the limit. The idea involves providing the computer with examples (or samples) from which a program which correctly classifies the input examples can be output by the computer without the programmer providing the computer a detailed algorithm on how to do so. We first discuss related work in the areas of GI and incremental learning. Then we provide background information on GenInc, discuss the specific enhancements made to the algorithm and show the application of the algorithm on some DSLs. This is followed by conclusions and future work.

Our previous work in this area focused on using a Genetic-Programming (GP) based system called GenParse to infer CFGs for DSLs. The system was augmented with basic data-mining techniques such as frequent sequences [15] to increase the inference capability of GenParse. More details and discussion on GenParse are available in [16] and [17]. In the area of Domain-Specific Modeling, we have developed MARS [18], a semi-automatic inference technique which, using GI algorithms, addresses the problem of metamodel drift which occurs when instance models in a repository are orphaned from their defining metamodel. MARS is able to recover metamodels which correctly define the mined instance models.

2 Related Work Grammar inference research has primarily focused on learning regular and context-free grammars. A context-free grammar (CFG) is a 4-tuple G=(V,T,P,S) where V is a finite set of non-terminals (NTs), T is a finite set of terminals, P is a finite set of products (or rules) and S is the starting NT. Rules are of the form NT  (V U T)* . While successful algorithms like L* [7], Evidence Driven State Merging (EDSM) [8] and Regular Positive Negative Inference (RPNI) [9] exist for inferring regular grammars, inferring context-free grammars (CFGs) has proved to be more challenging. Given that many CFG decision problems are undecidable (e.g., given two CFGs L1 and L2, is L1 ∩ L2 = Ø?, is L1 = L2?) means that heuristically inferring CFGs is the preferred method of choice of learning. Genetic algorithms have been used as a heuristic in [10]. SubDueGL [11] uses the Minimum Description Length principle to infer context-free graph grammars. In GenInc, textual CFGs are the target inference entity. An incremental inductive CYK algorithm which learns simple CFGs from positive and negative examples is discussed in [12]. GenInc by comparison uses positive examples only. In [13], an incremental learning algorithm which infers minimum-state finite state languages from lexicographically ordered examples is introduced. The algorithm is different from GenInc in that it learns regular languages and uses both positive and negative examples. The authors of [14] propose an algorithm for avoiding order effects in incremental learning using an inductive logic programming (ILP) incremental learning system enhanced with a backtracking strategy. Like GenInc (discussed in the next section), they split their incremental learning concept into three increasingly complex cases. Unlike GenInc, they use the ILP framework, and make use of positive and negative examples. The work in [4] focuses on using either compiler sources or language reference manuals to obtain grammar, from which renovation tools are constructed. In the absence of reference manuals or compiler sources (a more solemn scenario where GI techniques like GenInc are applicable) this method isn’t applicable.

3 GenInc GenInc was developed to address the limitations of GenParse. In GenParse (which makes use of both positive and negative examples), when a positive sample isn’t accepted (or a negative sample is accepted) by the current candidate solution CFG, the induction algorithm takes into account the violating sample and iterates through the induction process to generate a new CFG which successfully accepts (rejects) the positive sample (negative sample). An improvement to this inference process would be to use only the current CFG as the knowledge structure and the current training sample (instead of reprocessing all the previous samples) to add only the minimum number of new production rules to the violating CFG (which would enable the positive/negative sample to be accepted/rejected) allowing the learning process to be carried out in an incremental and more efficient way. Another limitation was that although GenParse was able to infer small DSLs, it did not succeed when the original grammar was of a larger size (i.e., more than 20 productions) or exhibited highly recursive structures which were several levels deep. The reason for this is that such grammars result in a more expansive combinatorial search space which the current implementation is not capable of tackling efficiently and conclusively. Under these conditions of more complex search spaces, GP’s have a tendency to converge towards local optima rather than global optimum of the solution. It has been shown that Genetic-Programming systems enhanced with local search outperform (in terms of result efficacy and efficiency) pure genetic programs in many problem domains [19]. However, they haven’t been applied to the problem of CFG inference, which provides another motivation for our work. GenInc follows the incremental learning model as defined in [20]; it analyzes one training sample at a time, doesn’t reprocess any previous samples when inferring a new hypothesis (CFG), and maintains only one candidate CFG in memory. GenInc was introduced in [17], and makes use of positive samples only because often in

practice negative samples are not available. Instead, supplemental knowledge is provided to GenInc by using ordered samples. Smaller samples are presented before more complex ones; this allows an incremental learning algorithm to encode the sources of simple variances in the samples, before attempting to capture the long-range dependencies [21]. Before discussing the algorithm, we give a brief overview of its theoretical foundations. GenInc takes a set of positive samples (or programs) Pos = {S1, S2, ,…, SN } as input, and uses the following auxiliary functions: o Position (Tokeni, Si): returns the position of Tokeni in sample Si. o Prefix (Si, k): returns the first k tokens in sample Si. o Suffix (Si, k): returns the last k tokens in sample Si. o Length (Si ): returns the length of sample Si o Diff (Si, j, k): returns k tokens in program Si starting from position j. An ordering on the samples is indicated by the relation “simpler than” ( 0.

In case 3, the most general case, it is not required that all tokens in Si also are in Si+1, and the increments can be in the middle or at the end. For this paper we focus on case 2 of incremental learning and describe the modifications made to GenInc to handle this case. Specifically, we discuss the improvements and new operators that were created to handle this more complex case of learning. Due to space constraints, we refer the reader to [17] for examples of the other cases. We conjecture that as GenInc progressively becomes capable of handling more complex learning cases, the advancements made will contribute toward it successfully tackling the most general learning case. The GenInc algorithm is detailed in Table 1. The algorithms for the RHS_Subset and IncGeneralizer operators are shown in Table 2. GenInc is fully incremental, that is, learning can start from the first sample itself using an empty hypothesis (grammar). There is no need for an initial grammar to be available beforehand. The algorithm is iterative, and tests the current grammar on each new sample. If it fails to be parsed, a new grammar construction process takes place. It is important to note that only the current grammar and the new sample are used to create the new grammar. If the increment exists as a right-hand side (RHS) of a grammar rule, then that rule is modified to handle recursion. If not, the RHS_Subset operator is applied to the grammar. The operator analyzes all the rules in the grammar to check whether the current increment is a subset of any RHS. If so, it modifies the grammar accordingly. If no such RHS exists, the increment is added to the grammar as a new rule. Table 3 illustrates an example using simple arithmetic expressions where increment 2 was captured as a subset by GenInc as NT3, and the resulting grammar G2 was successful in parsing all the samples (the terminals are in lower case). The new grammar is then tested on all the previous samples up to and including the new sample. If parsing fails on any previous sample, we apply the IncGeneralizer operator on the grammar. It is possible that the new grammar might be ambiguous or exhibit a shift/reduce conflict in the case of LR(1) [23] parsers. LR(1) parsers which are widely used for creating parsers for most programming languages, aren’t able to parse ambiguous grammars. The example in Table 4 illustrates the IncGeneralizer operator using the DESK DSL [24].

Table 1. The GenInc Algorithm 1. Create initial Grammar G1 from the first sample S1. 2. While more samples exist, do : a.

Generate an LR(1) parser for the current grammar (Gi) and attempt to parse the next sample. If the incumbent grammar successfully parses the sample, skip steps b-e.

b.

If Gi fails to parse a sample Sj, reconstruct Gi by making the use of the increment in tokens from Sj-1 to Sj and the following sub-steps (c, d and e).

c.

If the increment already exists as the right hand side (RHS) of a rule in the grammar with non-terminal Nx as left hand side (LHS), then append Nx to the rule and add the rule Nx  epsilon to the grammar. In some cases, this will introduce recursion in the grammar.

d.

Else, if the increment does not exist as the RHS of a rule, perform the RHS_Subset operator on Gi. If the procedure succeeds, try appending the new non-terminal to existing rules as described in step e.

e.

Else, add the rule Nx  increment to the grammar. Append a new non-terminal Nx to the first rule at the appropriate place (the succeeding new NTs are added in series). Create an LR(1) parser from the grammar and then try to parse all the samples up to and including the violating sample. If the parsing fails, perform the IncGeneralizer operator on Gi to check for ambiguity.

3. If all samples are parsed successfully, output the grammar. Else, indicate failure and output Gi and the sample which can’t be parsed. Table 2. The RHS_Subset and IncGeneralizer Operators RHS_Subset: (takes as arguments the current grammar G, and the increment I) 1. For all rules R in G do a. If (R.RHS is != epsilon) and (R.size > I.size) For j =1 to R.size do i) If I.size < (R.size – j), extract an increment E from R of size I.size starting from j. ii) If E is similar to I, replace the segment corresponding to E in the current rule by a new NT, NTx end For; 2. If a match for I was found in G, add a new non-terminal NTy at the appropriate place in rule 1. Also add the rules NTx  I, NTy  NTx, NTy epsilon to G. IncGeneralizer: (takes as argument the current grammar G) 1. For all rules R in G do a. For j=1 to R.size do i) if (R.RHS != epsilon) search through the entire RHS for two consecutive non-terminals, NTa and NTb. If two such NTs are found, and they both reduce to epsilon, break; End For; b. If consecutive NTs exist, break; End For; 2. If NTa and NTb exist, iterate over all the RHSs of all the rules in G and replace NTa and NTb with a new non-terminal NTx. Also remove the epsilon rules of NTa and NTb. 3. Add the rules NTx  epsilon, NTx  NTa NTb , NTx  NTb NTa , NTx  NTa to the grammar.

In case 2 of incremental learning, since the increments are in the middle of the samples, new NTs are inserted consecutively in the middle of the first rule. Because such increments are added as new rules which also have an epsilon (ε) rule, this can result in two consecutive NTs reducing to epsilon and causing a parser conflict. In table 4, using G2 and S3, GenInc initially inferred the following grammar (step (e) in Table 1): NT1  print id NT2 NT3 where id = int NT2  + int | ε NT3  + id | ε Here NT2 and NT3 both reduce to epsilon, causing a parser conflict in LR(1) parsers. The IncGeneralizer operator modifies the inferred grammar to handle this.

i

Table 3. The RHS_Subset Operator Si Gi

1

2*4+5

NT1  int * int + int

2

2+6*4+5

NT1  int NT2 * int NT3 NT2  NT3 | ε NT3  + int

Table 4. IncGeneralizer Operator on the DESK Dsl i Si Gi 1

print b where b = 10

NT1  print id where id = int

2

print b + 1 where b = 10

NT1  print id NT2 where id = int NT2  + int | ε

3

print b + 1 + b where b = 10

NT1  print id NT2NT3 where id = int NT2  + int NT3  + id NT2NT3  NT2 | NT2 NT3 | NT3 NT2 | ε

4 Experimental Results We ran GenInc on DSLs which were designed to test the properties exhibited by samples created for case 2 of incremental learning. The DSLs used were Robot (a simple language for moving robots) [25], MRL (Model

Representation Language) [18], a DSL which is used to represent the graphical software metamodels in a textual format, and a variant of the DESK calculator DSL [24]. Our goal is to infer a grammar which accepts the same language as the original grammar. Table 5 shows the inferred and original grammars for the various DSLs. The Robot DSL case was the simplest of the three cases while the experimental run on the DESK DSL was more involved and highlighted the need for the IncGeneralizer operator. Although GenInc was able to infer suitable grammars for all the DSLs, it is also desirable to measure the complexity of the inferred grammars. Significantly complex grammars can result in difficulties with software comprehension and maintenance. We measure the complexity of the grammars using the SYNQ tool for grammar metrics [26]. In particular, we use the following grammar size metrics: TERM & VAR: These are simple size metrics which count the number of terminals and NTs in a grammar, respectively. McCabe Cyclomatic Complexity (MCC): MCC measures the number of alternatives for a grammar rule. A high MCC value can foreshadow parser conflicts. Average RHS Size (AVS): AVS measures the size of a rules’ RHS. A high AVS value may impede parser performance. Halstead Effort (HAL): HAL estimates the effort required to understand a grammar. Grammar metrics for the DSLs are detailed in table 6. While the metric values for the ROBOT and MRL (inferred and original) grammars are similar, there is a marked increase in the AVS and HAL values of the corresponding DESK grammars. Because of the way GenInc learns grammars, the RHSs of the inferred grammars are usually larger, resulting in a higher AVS value. It can also be seen that the inferred DESK grammar has more NTs and a larger number of distinct alternatives for some of the rules (e.g., non-terminal NT2NT3). While we have achieved some success with GenInc, a limitation of the current effort is that it cannot infer grammars when samples exhibit the properties of case 3, that is, not all the tokens of sample S1 are contained in S2. This reflects the situation when alternative rules of a grammar are used in generating a sample. We are currently working on enabling GenInc to learn samples representative of case 3.

DSL ROBOT

MRL

DESK

DSL TERM VAR MCC AVS HAL

Table 5. Original and Inferred Grammars Inferred Original NT1  begin NT2 end ROBOT  begin MOVES end NT2  command NT2 | ε MOVES  MOVE MOVES MOVE  command | ε NT1  beginmodel NT2 NT3 NT4 endmodel MODEL  beginmodel MBODY endmodel NT2  model NT2 | ε MBODY  MODELS ATOMS CONNECTIONS NT3  atom NT3 | ε MODELS  model MODELS | ε NT4  conname : atom → atom NT4 | ε ATOMS  atom ATOMS | ε CONNECTIONS  conname : atom → atom CONNECTIONS | ε NT1  print id NT2NT3 where id = int NT2NT3  NT2 | NT2 NT3 | NT3 NT2 | ε NT2  + int NT3  + id

ROBOT Original 3 3 1 2 225.853

ROBOT  print EXPR CONST EXPR  EXPR + INTID | INTID INTID  int | id CONST  where id = int

Table 6. Grammar Metrics Comparison of the DSLs ROBOT MRL MRL Inferred Original Inferred 3 8 8 2 5 4 1 3 3 2.5 3 3.75 174.355 705.81 713.057

5 Conclusions and Future Work Learning CFGs finds application in many diverse domains, but their application to programming language development and software renovation tool construction domains has been limited. DSL development for domain experts not well versed in the nuances of programming language development is a well documented open problem. In this paper, we described improvements made to GenInc, an incremental learning algorithm for DSL development. We focused on case 2 of our learning problem, and discussed the results of experimental runs on various DSLs. We also used grammar metrics to compare the original and inferred grammars. Our future work involves using the insights gained by working on cases 1 and 2 to develop an all inclusive learning algorithm which can infer grammars for DSLs from more challenging cases of learning (case 3). One

DESK Original 6 4 2 3.25 530.4

DESK Inferred 6 4 3 4 816

future direction involves integrating GenInc as a local search component of GenParse. To the best of our knowledge, GP-systems enhanced with domain-specific local search have not been applied to the problem of CFG inference. Continuing work on GenInc will provide a suitable local search operator for our GenParse system tailored to the DSL inference domain. We also plan to apply the algorithm to more practical DSLs, and study the scalability of the approach. Because of the utility of CFGs in domains outside programming languages, the results of this work may find potential application in many diverse areas such as pattern recognition and bioinformatics.

6 Acknowledgments We would like to thank James Power and Brian Malloy for providing us their SYNQ tool for grammar metrics calculations.

AI*IA 2005 Advances in Artificial Intelligence, 21-23. Milan, Italy, 2005.

7 References [1] A. van Deursen, P. Klint, and J. Visser. “DomainSpecific Languages: An Annotated Bibliography”. ACM SIGPLAN Notices 35(6):26-36, 2000. [2] M. Mernik, J. Heering, and T. Sloane. “When and How to Develop Domain-specific Languages”. ACM Computing Surveys 37(4): 316-344, 2005. [3] E. M. Gold. “Language Identification in the Limit”. Information and Control 10(5): 447-474, 1967. [4] R. Lämmel and C. Verhoef. “Cracking the 500 Language Problem”. IEEE Software 18(6):78-88, 2001. [5] D. Angluin. “A Note on the Number of Queries Needed to Identify Regular Languages”. Information and Control 51(1): 76-87, 1981. [6] P. Kugel. “It's Computational Box”. 48(11):32-37, 2005.

Time to Think Outside the Communications of the ACM

[7] D. Angluin. “Learning Regular Sets from Queries and Counterexamples”. Information and Control 39(2):337350, 1987. [8] K. J. Lang, B.A. Pearlmutter, and R.A. Price. “Results of the Abbadingo One DFA Learning Competition and a new Evidence-Driven State Merging Algorithm”. In Proceedings of the Fourth International Colloquium on Grammatical Inference. Ames, IA: Springer-Verlag LNCS 1433, 1998. [9] J. Oncina, and P. Garcia. “Inferring regular languages in polynomial update time”. In Pattern Recognition and Image Analysis, volume 1 of Series in Machine Perception and Artificial Intelligence, 49-61. World Scientific, Singapore, 1992. [10] Y. Sakakibara, and H. Muramatsu. “Learning Context-Free Grammars from Partially Structured Examples”. In Proceedings of the Fifth International Colloquium on Grammatical Inference and Applications, 229-240, Lisbon, Portugal, 2000. [11] I. Jonyer, L. B. Holder, and D. J. Cook. “MDL-Based Context-Free Graph Grammar Induction and Applications”. International Journal of Artificial Intelligence Tools 13(1): 65-79, 2003. [12] K. Nakamura and M. Matsumoto. “Incremental Learning of Context-Free Grammars Based on Bottom-up Parsing and Search”. Pattern Recognition 38(9): 13841392, 2005. [13] S. Porat, and A. J. Feldman. “Learning Automata from Ordered Examples”. Machine Learning 7:109-138, 1991. [14] N. Di Mauro, F. Esposito, S. Ferilli, and T. M. A. Basile. “Avoiding Order Effects in Incremental Learning”.

[15] J. Han, and M. Kamber. “Data Mining: Concepts and Techniques”. Morgan-Kaufmann Publishers, 2001. [16] M. Črepinšek, M. Mernik, B. R. Bryant, F. Javed, and A. Sprague. “Inferring Context-Free Grammars for Domain-Specific Languages”. In Proceedings of the Fifth Workshop on Language Description, Tools and Applications (LDTA 2005), 64 – 81, 2005. [17] M. Črepinšek, M. Mernik, B. R. Bryant, F. Javed, and A. Sprague. “Context-Free Grammar Inference for Domain-Specific Languages”, Technical Report UABCISTR-2006-0301-1, Dept. of Computer & Info. Sciences, UAB, http://www.cis.uab.edu/softcom/GenParse/SCP05.ps, 2006 [18] F. Javed, M. Mernik, J. Gray, and B. R. Bryant. “MARS: A MetaModel Recovery System Using Grammar Inference”, Technical Report UABCIS-TR-2006-0724-1, Dept. of Computer & Info. Sciences, UAB. http://www.cis.uab.edu/softcom/GenParse/mars.htm, 2006. [19] P. Merz. “Memetic Algorithms for Combinatorial Optimization Problems: Fitness Landscapes and Effective Search Strategies”. Ph.D. dissertation, Department of Electrical Engineering and Computer Science, University of Siegen, 2000. [20] P. Langley. “Order Effects in Incremental Learning”. In Learning in Humans and Machines: Towards an Interdisciplinary Learning Science. Elsevier, 1995. [21] J. L. Elman. “Learning and Development in Neural Networks: The Importance of Starting Small”. Cognition, 48, 71-99, 1993. [22] F. Javed, M. Mernik, A. Sprague, and B. R. Bryant. “Incrementally Inferring Context-Free Grammars for Domain-Specific Languages”, In Proceedings of the Eighteenth International Conference on Software Engineering and Knowledge Engineering (SEKE'06), San Francisco, CA, 2006. [23] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. “Compiler: Principles, Techniques, and Tools”. Addison Wesley, 2007. [24] J. Paakki. “Attribute Grammar Paradigms - A HighLevel Methodology in Language Implementation”. ACM Computing Surveys 27(2):196-255, 1995. [25] M. Mernik, and V. Žumer. “Incremental Programming Language Development”. Computer Languages, Systems and Structures 31:1-16, 2005. [26] J. F. Power and B. A. Malloy. “A Metrics Suite for Grammar-Based Software”. Journal of Software Maintenance and Evolution: Research and Practice 16(6):405-426, 2005.