Tree Adjoining Grammars, Language Bias, and Genetic ... - CiteSeerX

2 downloads 0 Views 73KB Size Report
In this paper, we introduce a new grammar guided genetic programming system .... TAGs are tree-rewriting systems, defined in [9] as a 5-tuple (T, V, I, A, S), where T is a finite set of ..... New South Wales, Australia, (1996). 23. Wolpert D. and ...
Tree Adjoining Grammars, Language Bias, and Genetic Programming Nguyen Xuan Hoai1, R.I. McKay2, and H.A. Abbass2 School of Computer Science, Australian Defence Force Academy, ACT 2600, Australia 1

[email protected],

2

rim, [email protected]

Abstract. In this paper, we introduce a new grammar guided genetic programming system called tree-adjoining grammar guided genetic programming (TAG3P+), where tree-adjoining grammars (TAGs) are used as means to set language bias for genetic programming. We show that the capability of TAGs in handling context-sensitive information and categories can be useful to set a language bias that cannot be specified in grammar guided genetic programming. Moreover, we bias the genetic operators to preserve the language bias during the evolutionary process. The results pace the way towards a better understanding of the importance of bias in genetic programming.

1

Introduction

The use of bias has been a key subject in inductive learning for many years [13]. Theoretical arguments for its necessity were presented in [20, 23]. Bias is the set of factors that influence the selection of a particular hypothesis; therefore, some hypotheses are preferred over others [18]. Three main forms of bias can be distinguished [21]: selection, language and search bias. In selection bias, the criteria for selecting a hypothesis create a preference ordering over the set of hypothesis in the hypothesis space. A language bias is a set of language-based restrictions to represent the hypothesis space. A search bias is the control mechanism for reaching one hypothesis from another in the hypothesis space. A bias can be either exclusive or preferential. An exclusive bias eliminates certain hypotheses during the learning process, whereas a preferential bias weights each hypothesis according to some criteria. An inductive bias is said to be correct when it allows the learning system to elect the correct target concept(s) whereas an incorrect bias prevents the learning system from doing so. An inductive bias is said to be strong when it focuses the search on a relatively small portion of the space; and weak when it focuses the search on a large portion of the hypothesis space. A declarative bias in an inductive learning system is one specified explicitly in a language designed for the purpose; if the bias is simply encoded implicitly in the search mechanism, it is said to be procedural. An inductive bias is static if it does not change during the learning process; otherwise it is dynamic. A genetic programming (GP) system [1, 11] can be seen as an inductive learning system. In a GP system, fitness-based selection, the bias towards programs that

perform well on the problem, is a selection bias. The language bias is implemented through the selection of the function and terminal sets, and the search bias is implemented with genetic operators (mainly crossover and mutation). The language bias of a traditional GP system [11] is fixed, while GP with automatic-defined functions (ADF) [12] has a dynamic language bias. Whigham [21, 22] introduced grammar guided genetic programming (GGGP), where context-free grammars (CFGs) are used to declaratively set the language bias. He also proposed genetic operators to implement search bias and overcome the closure requirement, and showed that GGGP generalizes GP [22, page 129]. However, GGGP cannot handle context-sensitive language biases, and preserving a preferential language bias during the evolutionary process is difficult. Geyer-Schulz [5] independently proposed another GGGP system. His use of a complicated counting procedure on the derivation tree set of the CFG resulted in a better initialization scheme. With no restriction on the chromosome’s (derivation tree) size or depth, the similar work of Gruau [4] quickly resulted in code bloat during the evolutionary process. Wong and Leung [24] used logic grammars that allow the handling of context-sensitive information and encoding problem-dependent knowledge. Ryan and O’Neil [16, 17] proposed grammatical evolution (GE), where a genotype-to phenotype mapping is employed. Hoai et al. [6, 7] introduced tree adjunct grammar guided GP (TAG3P), but the approach had some limitations. It used a restricted form of derivation, which has not been shown to subsume context-free languages (though [6] gives a range of contextsensitive problems covered by this form). Importantly, the method often generates large numbers of elementary trees, especially when the number of terminals is large. In computational linguistics, this issue led to a shift from tree adjunct grammars (adjunction only) to tree adjoining grammars (adjunction and substitution) [9]. Furthermore, their work did not discuss bias. The objective of this paper is to propose tree adjoining grammars (TAGs) for evolving computer programs and study the effect of bias. In Section 2, the concepts of tree adjoining grammars and its advantages over context-free grammars are presented. All the components of TAG3P+, and an example using bias on the 6-multiplexer problem, are discussed in Sections 3 and 4. Section 5 draws conclusions.

2

Tree Adjoining Grammars

Joshi and his colleagues in [10] proposed tree-adjunct grammars, the original form of tree adjoining grammars (TAG). Adjunction was the only tree-rewriting operation. Later, the substitution operation was added and the new formalism became known as TAG. Although the addition of substitution did not change the strong and weak generative power of tree adjunct grammars (their tree and string sets), it compacted the formalism with fewer elementary trees [9]. TAGs have gradually replaced tree adjunct grammars in the field of computational linguistics. TAGs are tree-rewriting systems, defined in [9] as a 5-tuple (T, V, I, A, S), where T is a finite set of terminal symbols; V is a finite set of non-terminal symbols (T ∩ V = ∅); S ∈ V is a distinguished symbol called the start symbol; and E = I ∪ A is a set of

elementary trees (initial and auxiliary respectively). In an elementary tree, all interior nodes are labeled by non-terminal symbols, while the nodes on the frontier are labeled either by terminal or non-terminal symbols. The frontier of an auxiliary tree must contain a distinguished node, the foot node, labeled by the same non-terminal as the root. The convention in [9] of marking the foot node with an asterisk (*) is followed here. With the exception of the foot node, all non-terminal symbols on the frontier of an elementary tree are marked as ↓ (i.e. substitution). Initial and auxiliary trees are denoted α and β respectively. A tree whose root is labeled by X is called an X-type tree. A derivation tree in TAG [17, 19] is a treestructure which encodes the history of derivation (substitutions and adjunctions) to produce the derived tree. Each node is labelled by an elementary tree, the root with an α tree, and other nodes with either an α or β tree. Links from a node to its offspring are marked by addresses for adjunction or substitution. Adjunction builds a new (derived) tree γ from an A-type auxiliary tree β, and a tree α (elementary or derived) with an interior node labeled A. Sub-tree α1 rooted at A is disconnected from α, and β is attached to α to replace it. Finally, α1 is attached back to the foot node of β (which by definition, also has label A), to produce γ. In substitution, we replace a non-terminal node on the frontier of an elementary tree by an initial tree whose root has the same label. The set of languages generated by TAGs is a superset of the context-free language, and is properly included in indexed languages [9]. Lexicalized TAGs (LTAGs) [16], require each elementary tree to have at least one terminal node. [16] presents an algorithm which, for any context-free grammar G, produces an LTAG Glex which generates the same language and tree set as G (Glex is said to strongly lexicalize G). The derivation trees in G are the derived trees of Glex. TAGs have a number of advantages for GP. The separation between derivation and derived trees provides a natural genotype-to-phenotype map. Derivation trees in TAGs are more fine-grained structures than those of CFG. They are compact (since each node is an elementary tree, equivalent to a number of nodes in CFG derivation trees) and they are closer to a semantic representation [2]. Finally, in growing a derivation tree from the root, one can stop at anytime and still have a valid derivation tree and a valid derived tree. We call this property “feasibility”. Feasibility helps TAG3P+ (described in Section 3) to control the exact size of its chromosomes, and also to implement a wide range of genetic operators, a number of them bio-inspired, which cannot be implemented either in GP or in GGGP.

3.

Tree Adjoining Grammar Guided Genetic Programming

In this section, we propose a new GGGP system, tree adjoining grammar guided GP (TAG3P+ to distinguish it from TAG3P [6, 7]). To relate this work to previous systems in the context-free domain, we frame the discussion in terms of a context-free grammar G and corresponding LTAG, Glex. However G is strictly unnecessary, and of course, for context-sensitive problems, would not exist. TAG3P+ evolves the derivation trees in Glex (genotype) instead of the derived trees (the derivation trees in

G – phenotype). It thus creates a genotype-phenotype map, potentially many-one. As in canonical GP [11], TAG3P+ has the following five main components: Program representation: the derivation trees in LTAG Glex. As usual, programs are restricted to a bounded size. Each node also contains a list of lexemes for substitution within the elementary tree at that node (Figure 1). The main operation is adjunction.

Fig. 1. Example of an individual (derivation tree) in TAG3P+.

Parameters: minimum size of genomes (MIN_SIZE), maximum size of genomes (MAX_SIZE), size of population (POP_SIZE), maximum number of generations (MAX_GEN) and probabilities for genetic operators. Initialization procedure: Each individual is generated by randomly growing a derivation tree in Glex to a size randomly chosen between MIN_ and MAX_SIZE (unlike most GP systems, which use depth bounds). Thanks to the TAG feasibility property, the procedure always generates valid individuals of the exact size. Another initialization scheme (corresponding to ramped half-and-half initialization) generates a portion of the derivation trees – usually 50% – randomly but in a full shape. Fitness Evaluation: an individual (a derivation tree in Glex) is first mapped to a derived tree (a derivation tree in G) through the sequence of adjunctions and substitutions encoded in its genotype. The expression defined by the derived tree is then semantically evaluated as in GGGP. Genetic operators: sub-tree crossover and sub-tree mutation. In sub-tree crossover, two individuals are selected based on their fitness. A randomly chosen point in each of the two derivation trees is chosen, subject to the constraint that each sub-tree can be adjoined to the other parent tree. If this point can be found, the exchange of the two sub-trees is then undertaken; otherwise the two individuals are discarded. This process is repeated until a valid crossover point is found or a bound is exceeded. In sub-tree mutation, a point in the derivation tree is chosen at random, then the sub-tree rooted at that point is replaced by a newly generated sub-derivation tree. Two further constraints may optionally be imposed on the operators. Adjunction context preservation requires the adjunction addresses of the replacing sub-trees to be the same as the adjunction addresses of the replaced sub-trees. Due to the feasibility property, it is simple to implement fair-size operators: the randomly generated subtree in mutation, or the sub-tree from the other parent in crossover, must have size within a pre-specified tolerance of the size of the original sub-tree.

4. An Example of TAG3P+ with Language Bias In this section, we show how TAG3P+ can be used to preserve context-sensitive language bias in the evolutionary process. Because of limited space, we restrict our discussion to preferential language bias, but the argument extends to exclusive bias. Unless we state otherwise, the word “bias” will refer to “preferential bias”. We use the 6-multiplexer problem, a standard Boolean function problem in GP [11]. A 6-multiplexer uses two address lines to output one of four data lines. The task is to learn this function from its 64 possible fitness cases. Following [11], the non-terminals and terminals are {IF, AND, OR, NOT} and {a0,a1,d0,d1,d2,d3}, respectively. The corresponding CFG [22, page 51], is: G={N={B}, T={a0, a1, d0, d1, d2, d3, d4, and, or, not, if}, P, {B}}, where the rule set P is defined as follows: (lexical rules) B →a0| a1| d0| d1| d2| d3. (structure rules) B →B and B| B or B | not B | if B B B Applying the algorithm in [16], we obtain the LTAG Glex that strongly lexicalizes G as follows. Glex={N={B, TL}, T={a0, a1, d0, d1, d2, d3, and, or, not, if}, I, A}, where I ∪ A is depicted in Figure 2. TL is a lexicon (or category) that can be substituted by one lexeme in {a0, a1, d0, d1, d2, d3}.

Fig. 2. Elementary trees for Glex

In [22], the bias was implemented by attaching a selection merit to each rule in the rule set of G, which acts as a probability for choosing which rule to re-write. In effect, the grammar G becomes a stochastic CFG. In TAG3P+, the bias is implemented by using adjunction and substitution probabilities. The former – structure bias – is the probability of choosing a β tree to adjoin into a given tree. The latter – lexicon bias – is the likelihood of choosing a lexeme to substitute into a lexicon (category) positioned in an elementary tree. We firstly examine the performance of TAG3P+ using the pairs of grammars G and Glex, without any initial bias. Consequently, the probability distributions for adjunctions and substitutions are uniform. The parameter setting is as follows: Terminal operands are a0, a1, d0, d1, d2, d3; Terminal operators are and, or, if, not; 64 fitness cases of the 6-multipler; the raw fitness is the total number of hits; the standardized fitness is number of misses (= 64 – number of hits); tournament selection

size is 3; sub-tree crossover (unfair and adjunction-context-preserving) and sub-tree mutation (fair with a tolerance of 5 and adjunction-context-preserving); ramped Initialization; MIN_SIZE is 2; MAX_SIZE is 40; POP_SIZE is 500; MAX_GEN is 50; crossover probability is 0.9; mutation probability is 0.1. 50 independent runs (UB) were conducted, 23 of them (46%) succeeded in finding perfect solutions. The proportion of success in GP [11, page 195] and GGGP [22, page 54] with similar setting were 28% and 34%, respectively. In the next three experiments, we show how to improve the performance of TAG3P+ with lexical bias. Each elementary tree in Figure 2 has only one lexicon (category) for substitution, so there is no obvious advantage for TAG over CFG in implementing lexical bias. Once we allow different lexicons the lexicons used for substitutions might be contextsensitive. For the 6-multiplexer we know that there are two main categories of terminals: data and addresses. Can this information provide a useful bias to the learner? An LTAG Glex2 equivalent to Glex1 is defined; Glex2 is the same as Glex1 but has three additional lexicons T1, T2 and T3. Each can be substituted by any lexeme in {a0, a1, d0, d1, d2, d3}. The elementary trees for Glex2 are given in Figure 3.

Fig. 3. Elementary tree for Glex2.

A lexicon bias is implemented by biasing (increasing the substitution probability) T1 and T2 towards the address (a0-a1) and data (d0-d3) lines respectively. Note that GGGP cannot represent this bias because CFGs cannot handle contextsensitive information (the location–dependent probabilities). TAGs can represent these probabilities and handle the context-sensitive information and categories. For example, consider the elementary tree β 1 in Figure 3 (also Figure 4-left). To construct a similar structure with β 1 in CFG G, it requires three separate rewriting steps. In Figure 4, we use B---B to stress that there are two different and independent Bs in re-writing steps in the CFG G. Since these two Bs use the same re-writing probability distribution for symbol B (as in stochastic context-free grammars), it is impossible for the two Bs to have different and location-dependent probabilities. Moreover, not all context-free grammars are lexicalized, and it is not always possible to strongly lexicalize a CFG by a CFG [16], so that lexical bias cannot be imposed in GGGP using CFGs. In addition, since CFGs are not lexicalized, biasing the selection towards particular re-writing rules can result in unexpected side effects. For instance, in [22, page 55] Whigham set a bias towards the if structure, by setting the probability of the rule B→ if B B B four times higher than other rules. While this means the IF

structure will appear more frequently, the non-lexicalization of the rule creates a side effect of generating bushier derivation trees (not the aim of the bias).

Fig. 4. An elementary tree in Glex2 (left) and re-writing steps in the corresponding CFG G.

The lexical bias may be useful in applications where we know that solutions must have particular types of lexicons. The question then becomes, how to bias these lexicons towards the appropriate lexemes. For the 6-multiplexer problem, we experimented with TAG3P+ using Glex2 and three different strengths of lexical bias. In separate experiments, we biased T1 to be substituted by address lines a0, a1, and T2 to be substituted by data lines d0-d3 to 2, 4, and 8 times higher than the other lexemes. T3 was uniformly likely to be substituted by either address or data lines (Table 1-3). Table 1. Substitution likelihood for lexical bias with strength 2 (B2).

Lexicon/lexemes T1 T2 T3

A0 4 1 2

a1 4 1 2

d0 1 1 1

d1 1 1 1

d2 1 1 1

d3 1 1 1

Table 2. Substitution likelihood for lexical bias with strength 4 (B4).

Lexicon/lexemes T1 T2 T3

a0 8 1 2

a1 8 1 2

d0 1 2 1

d1 1 2 1

d2 1 2 1

d3 1 2 1

Table 3. Substitution likelihood for lexical bias with strength 8 (B8).

Lexicon/lexemes T1 T2 T3

a0 16 1 2

a1 16 1 2

d0 1 4 1

d1 1 4 1

d2 1 4 1

d3 1 4 1

Each set of three experiments was run 50 times using TAG3P+. The proportion of success was 62%, 72%, and 74% for B2, B4, and B8 respectively. All results were statistically significantly different from unbiased TAG3P (α=0.05). Further increases in the bias did not improve the performance. Although the lexical bias can alter the likelihood of particular lexemes being substituted into particular lexicons within an elementary tree, it does not guarantee to

maintain this after adjunctions of elementary trees, because substitution happens only within the elementary tree level. To prevent the bias from being destroyed by adjunction, structure bias (adjunction bias) is needed. It operates on the interaction between elementary trees by biasing the likelihood of one tree to adjoin to another. In the 6-multiplexer problem above, if β3 in Figure 3 is adjoined into address B (which is connected to T2), the biased lexeme for T2 (presumably a data line) may be placed in a wrong location in β 3 (presumably an address line). To prevent this happening, the likelihood of selecting such adjunctions to grow the derivation trees in Glex2 is set at 10% of the likelihood of selecting another adjunction. To fully implement this initial structure bias, two copies of each of the three sets β 4 to β8 were created and T3 was renamed to T1 in the first copy and T2 in the second. Another 50 independent runs were conducted using the modified version Glex2 with a lexical bias strength of 8 plus the initial structure bias as above (SLB). The success rate was 78%. Compared to B8, the improvement was not statistically significant. In trying to understand why the impact of a lexical and structure bias was less than expected, we found that crossover occasionally destroyed this bias. In the final experiment (FB), we used a biased crossover (search bias) to preserve the trend of the lexical and structure biases. Here, whenever two points are chosen for crossing-over, we calculate the joint probability for adjunctions at the two corresponding addresses. If this probability decreases after crossing over (i.e. we are moving from more likely adjunctions to less likely adjunctions), we use a low crossover probability of 10%. We conducted 50 more independent runs with TAG3P+, using lexical bias strength 8 plus initial structure bias plus bias-preserving crossover. The proportion of success was 88%. This improvement over B8 is statistically significant (α=0.05). The cumulative frequency of success for all 6 experiments is given in Figure 5.

Fig. 5. The cumulative frequencies of success in 6 experiments.

As with lexical bias, GGGP using CFGs cannot implement structure bias because a structure bias needs to consider the inter-relation between different elementary trees in LTAG: it involves several rewriting steps in the corresponding CFG. It is

impossible for CFGs to represent this context-sensitive information. The lexical and structure biases used here are declarative biases; the crossover bias is procedural.

5. Conclusions and Future Work We have proposed a new GGGP system, TAG3P+, using tree-adjoining grammars. By experiments on a standard GP problem, we showed that TAG permits contextsensitive language bias on the lexicon and structure levels that cannot be specified in CFG-based GGGP. Although we considered preferential bias in this paper, the same arguments hold for exclusive biases, viewed as a limiting case in which some probabilities are set to zero. In this paper, we assumed the correctness of the biases provided by the user. Currently, we are studying mechanisms to help TAG3P+ to automatically shift biases. We are also investigating a more general bias-preserving crossover using probabilistic models for adjunctions. Lastly, we are also investigating bio-inspired operators within TAG3P+ such as transposition, translocation and replication, which can be implemented thanks to the feasibility feature of TAGs.

References 1. 2.

3.

4. 5. 6.

7.

8.

9. 10. 11. 12. 13.

Banzhaf W., Nordin P., Keller R.E., and Francone F.D.: Genetic Programming: An Introduction. Morgan Kaufmann Pub (1998). Candito M. H. and Kahane S.: Can the TAG Derivation Tree Represent a Semantic Graph? An Answer in the Light of Meaning-Text Theory. In: Proceedings of TAG+4, Philadelphia, (1999) 25-28. Cohen, W. W.: Grammatically Biased Learning: Learning Logic Programs Using an Explicit Antecedent Description Language. Technical Report, AT and Bell Laboratories, Murray Hill, NJ, (1993). Gruau F.: On Using Syntactic Constraints with Genetic Programming. In: Advances in Genetic Programming II, The MIT Press, (1996) 377-394. Geyer-Schulz A.: Fuzzy Rule-Based Expert Systems and Genetic Machine Learning. Physica-Verlag, Germany, (1995). Hoai N. X.: Solving the Symbolic Regression Problem with Tree Adjunct Grammar Guided Genetic Programming: The Preliminary Result. In: the Proceedings of 5th Australasia-Japan Workshop in Evolutionary and Intelligent Systems, (2001) 52-61. Hoai N. X., Mac Kay R. I., and Essam D.: Solving the Symbolic Regression Problem with Tree Adjunct Grammar Guided Genetic Programming. Australian Journal of Intelligent Information Processing Systems, 7(3), (2002) 114-121. Hoai N.X., Y. Shan, and R. I. MacKay: Is Ambiguity is Useful or Problematic for Genetic Programming? A Case Study. To appear in: The Proceedings of 4th Asia-Pacific Conference on Evolutionary Computation and Simulated Learning (SEAL’02), (2002). Joshi, A. K. and Schabes, Y.: Tree Adjoining Grammars. In: Handbook of Formal Languages, Rozenberg G. and Saloma A. (eds) Springer-Verlag, (1997) 69-123. Joshi, A. K.. Levy, L. S., and Takahashi, M.: Tree Adjunct Grammars. Journal of Computer and System Sciences, 10 (1), (1975) 136-163. Koza, J. : Genetic Programming, The MIT Press (1992). Koza, J. : Genetic Programming II, The MIT Press (1994). Mitchell T. M.: Machine Learning. McGraw-Hill, (1997).

14. Micthell T. M., Utgoff P., and BanerJi R.: Learning by Experimentation: Acquiring and Refining Problem-Solving Heuristics. In: Machine Learning: An Artificial Intelligence Approach. Springer-Verlag, (1984) 163-190. 15. O’Neil M. and Ryan C.: Grammatical Evolution. IEEE Trans on Evolutionary Computation, 4 (4), (2000) 349-357. 16. Schabes Y.: Mathemantical and Computational Aspects of Lexicalized Grammars, Ph.D. Thesis, University of Pennsylvania, USA, (1990). 17. Shanker V.: A Study of Tree Adjoining Grammars. PhD. Thesis, University of Pennsylvania, USA, 1987. 18. Utgoff P.: Machine Learning of Inductive Bias. Kluwer Academic Publisher, (1986). 19. Weir D. J.: Characterizing Mildly Context-Sensitive Grammar Formalisms. PhD. Thesis, University of Pennsylvania, USA, 1988. 20. Valiant L.: A Theory of the Learnable. ACM, 27(11), (1984) 1134-1142. 21. Whigham P. A.: Search Bias, Language Bias and Genetic Programming. In: Genetic Programming 1996, The MIT Press, USA, (1996) 230-237. 22. Whigham P. A.: Grammatical Bias for Evolutionary Learning. Ph.D Thesis, University of New South Wales, Australia, (1996). 23. Wolpert D. and Macready W.: No Free Lunch Theorems for Search. Technical Report SFI-TR-95-02-010, Santa Fem, NM, 87501. 24. Wong M. L. and Leung K. S.: Evolutionary Program Induction Directed by Logic Grammars. Evolutionary Computation, 5 (1997) 143-180.