Learning Finite State Transducers: Evolution Versus ... - CiteSeerX

Learning Finite State Transducers: Evolution Versus Heuristic State Merging Simon M. Lucas and T. Jeff Reynolds Department of Computer Science University of Essex, Colchester CO4 3SQ, UK [email protected]

Abstract Finite State Transducers (FSTs) are Finite State Machines (FSMs) that map strings in a source domain into strings in a target domain. While there are many reports in the literature of evolving FSMs, there has been much less work on evolving FSTs. In particular, the fitness functions required for evolving FSTs are generally different from those used for FSMs. In this paper three string distance based fitness functions are evaluated, in order of increasing computational complexity: string equality, Hamming distance and edit distance. The fitnessdistance correlation (FDC) and evolutionary performance of each fitness function is analysed when used within a random mutation hill-climber (RMHC). Edit distance has the strongest FDC and also provides the best evolutionary performance, in that it is more likely to find the target FST within a given number of fitness function evaluations. Edit distance is also the most expensive to compute, but in most cases this extra computation is more than justified by its performance. The RMHC was compared with the best known heuristic method for learning FSTs, the Onward Sub-sequential Transducer Inference Algorithm (OSTIA). On noise-free data the RMHC performs best on problems with sparse training sets and small target machines. The RMHC and OSTIA offer similar performance for large target machines and denser data sets. When noise-corrupted data is used for training, the RMHC still performs well, while OSTIA performs

1

poorly given even small amounts of noise. The RMHC is also shown to outperform a genetic algorithm. Hence, for certain classes of FST induction problem, the RMHC presented in this paper offers the best performance of any known algorithm.

Key Words: Finite state transducer, random mutation hill climber, string translation, string distance, state merging.

1

Introduction

The evolution of Finite State Machines (FSMs) in general has been much studied. Finite State Transducers (FSTs), by contrast, have received very little attention from the evolutionary computing community. The distinction between an FSM and an FST lies in how the behaviour of the machine is interpreted, rather than the inherent nature of the machine. In general the behaviour of an FSM is judged by the net effect of a sequence of actions (outputs) on its containing environment, where the ordering of the actions may or may not be important. Furthermore the FSMs being evolved are typically situated in an environment such that their actions modify future inputs from the environment. An FST, however, translates strings from an input language to strings from an output language, with no feedback from the output. The exact order in which a string of output symbols is generated is of fundamental importance. Learning FSTs is of interest because string transduction problems arise naturally in many application areas. Projecting from the raw input domain to another more convenient domain is required to render tractable many computer-based manipulations of data. Often the required mappings are complex and non-linear. Human intuition can be used to invent them but there is also scope for machine learning methods to discover new and better projections. Current application areas are mostly in human language manipulation, for example: restricted-domain machine translation, (Oncina et al [1], Alshawi et al [2]), text-to-speech pronunciation modelling (Gildea and Jurafsky [3]) and the parsing of web pages (Hsu and Dung [4]). Our interest in FSTs originally arose from work on pattern recognition. Here we can see the potential for FSTs to provide novel transformations from raw sensor data to features useful in classification tasks. In particular we have investigated chain code transducers which can be used to boost performance in hand-written character recognition as shown 2

by Lucas [5]. The main focus of this paper however, is the challenge of learning FSTs from samples of training string pairs, which is an extremely difficult machine learning problem. Work on evolving FSMs dates back to the 1960’s when Fogel et al [6] evolved FSMs for predicting symbol sequences. Follow-up work included applications to game playing [7] and to systems identification [8]. More recent work includes that of Chellapilla and Czarnecki [9] who evolved modular FSMs for the “food trail” problem of Jefferson et al [10], and Sanchez et al [11] who evolved FSMs to solve exploration and maze problems. Spears and Gordon [12] evolved FSMs for a resource protection game and their work is further discussed in Section 7. The theme of exploring modular methods for FSM construction has also been investigated by Inagaki [13], who evolved trees of deterministic finite automata for predicting symbol sequences. Hybrid systems are also possible, for example Benson [14] evolved an augmented type of FSM where each node had an associated GP-evolved program. Deterministic Finite Automata (DFA) are a class of FSM that produce no output symbols, but instead have accepting and rejecting states which are used to decide whether a string belongs to a particular regular language. Learning DFA from samples of labelled data is a problem that has also been extensively studied. It has been shown to be a hard task by a number of criteria (see Pitt and Warmuth [15] and Kearns and Valiant [16]), and is a good benchmark for evaluating machine learning algorithms. More discussion can be found in Dupont et al [17] and Oliveira and Silva [18]. The Evidence Driven State Merging (EDSM) algorithm due to Price was shown to be a very successful algorithm for learning DFA in the Abbadingo One competition [19], and has since been refined by Cicchello and Kremer [20], and also by Lang [21]. There have been many attempts to learn DFA, or their associated regular languages. Evolutionary approaches have been investigated by Dupont [22], Luke et al [23], and the authors [24]. Recurrent neural networks have been applied using appropriate variations of the back-propagation training algorithm by Giles et al [25] and Watrous and Kuhn [26]. Recurrent neural networks have also been evolved, see Angeline et al [27]. There has also been interest in evolving context-free grammars (Wyard, [28], Lucas [29]), or their equivalent pushdown automata (Lankhorst [30]). Though there is not as much literature on learning FSTs, there is a close connection between

3

methods for learning DFA and FSTs. Efficient algorithms have been developed for inferring a certain class of FST known as sub-sequential transducers. One in particular is the Onward Sub-sequential Transducer Inference Algorithm (OSTIA) [1] of Oncina et al. OSTIA is a state-merging algorithm which is an extension of earlier work on state merging algorithms for inferring DFA, such as the early work of Trakhtenbrot and Barzdin [31]. The data driven version of OSTIA [32] is related to the EDSM algorithm in that it uses a similar heuristic to decide the best state merge at any point in time. The first author recently showed that a random mutation hill-climber (RMHC) was able to learn small FSTs from a training sample of string-pairs [33], and showed how fitness-distance correlation measures could be used to predict the performance of string-distance fitness functions. The original work studied string equality and Hamming distance. The current paper significantly extends that work with the development of a superior fitness function based on string-edit distance, and many new experimental results, which are summarised in the following paragraphs. The first results section (Section 5) evaluates our evolutionary approach on a task taken from the domain of character recognition. Here the performance of the RMHC with the three distance measures is analysed in detail, and compared with OSTIA. The comparison studies the effects of varying both training set size and training string length. The second results section (Section 6) describes a full-scale comparison of the RMHC with OSTIA. This comparison was made on as equal a basis as possible, determining not only how well both methods find good approximations to target FSTs but also how fast they find them. In order to do this large numbers of random target FSTs were generated in order to study the effects of varying the size of the target FST, and the size of the trainings set. This section also includes results on varying the maximum number of states allowed in the evolved FST. Section 7 compares the performance of the RMHC with a genetic algorithm, and finds the RMHC to be superior for this problem. Section 8 compares the performance of evolutionary methods with OSTIA when presented with noisy training data, and shows evolution to be far more robust, a similar result to that recently observed by the authors for DFA induction [34]. Section 9 discusses possible future work and section 10 concludes.

4

2

Finite State Transducers

An FST transforms strings from an input language LI to an output language LO where I is the alphabet of input symbols and O is the alphabet of output symbols. An FST is usually defined as a five-tuple T = (Q, Σ, q0 , F, δ) (e.g. Jurafsky and Martin [35]). In this paper a slightly modified definition is used that omits F , the set of final (or halting) states. This omission is possible because there is no need to stop transforming the input string before the end. This is not a restriction since any halting state can be emulated by one where all transitions from that state loop back to itself while producing no output. Therefore, in this paper we denote an FST as a four-tuple T = (Q, Σ, q0 , δ) defined as follows: • Q is the set of all states, labelled 0 to NQ − 1. • Σ is a finite set of symbol pairs i : o where i ∈ I and o ∈ (O ∪ {²}), where ² is the null symbol. • q0 is the start state. • δ(q, i : o) → q 0 is the state transition function defining, for each state of the FST, the next state for a given input/output pair Note that such an FST is deterministic on the set of inputs since the transition function defines one unique next state. Note also that each input symbol gives rise to either one output symbol or a null. This implies that the output strings are less than or equal to the input strings in length. This is not a problematic restriction for the test problems used in this paper. However, it is worth noting that OSTIA does not have this restriction and allows more than one symbol to be output in any transition. The fact that OSTIA is searching a space of more general machines does theoretically put it at a disadvantage on the restricted problems we study here. To make a truly fair comparison in this sense, it would be necessary either to implement a restricted version of OSTIA that always has a single output symbol on each transition, or to implement an extended version of our EA where the representation allows for strings of output symbols on each transition. These would be interesting experiments to conduct. The modifications needed to each algorithm could entail a significant amount of work however, and this has not yet been investigated.

5

3

Evolutionary Algorithm

The main evolutionary algorithm (EA) used is a random mutation hill-climber (RMHC). Random mutation hill-climbers, also known as (1 + 1) Evolutionary Strategies when in real-valued searchspaces [36], are the simplest form of evolutionary algorithm, but often perform competitively with more complex EAs [37]. Choosing to use a random hill climber greatly simplifies the design of the EA, as it obviates the need to experiment with population size and selection methods. It also avoids the need to define a meaningful crossover operator. The genotype encodes an FST, which is a special type of graph. Standard crossover operators defined for string or vector based genotypes suffer the problem of competing conventions when used on strings that encode graphs [38], and tend to be destructive. While there are certainly cases where crossover significantly outperforms mutation, a good example being the hurdle fitness function [39], there is no strong evidence yet that this is the case for FSMs. This subject is revisited later in section 7 when the RMHC is compared with a genetic algorithm. In related work on DFA, the authors began with a simple random hill-climber [24], but later found even better performance with a multi-start random hill-climber [34]. This paper uses the simple random hill-climber, but it should be noted that performance could probably be improved by adopting a multi-start policy. The RMHC starts with a random individual representing an FST and then hill-climbs (or wanders neutral plateaus) until it finds an individual that is perfectly fit on the training set or a maximum number of evaluations is exceeded. At each generation a randomly mutated copy of the current individual is created, and this replaces that individual if its fitness is greater than or equal to the current individual’s fitness. Note the or equal clause in the replacement condition, which is important to allow the algorithm to freely wander large neutral plateaus in the fitness landscape. Such plateaus are common in this domain.

3.1

The FST Genotype

An individual FST with NQ states and NI input symbols can be represented by two tables of size NQ × NI with integer entries. These are a transition table, to indicate the next state and an output

6

table to indicate the current output symbol. Both tables are indexed on the current state and the current input symbol. The nominal number of states is held fixed for each run so that the two tables representing the current FST are of fixed size. Each entry in the initial transition table is a random integer in the range 0 . . . NQ −1. Each entry in the initial output table is a random integer in the range 0 . . . NO −1. The number of possible genotypes is given by:

(NQ ×NI )

S = NQ

(NQ ×NI )

NO

(1)

This quickly grows very large, but note that there are a factorial number of isomorphically equivalent FSTs for the FST encoded by a particular genotype. This is because the number of ways of labelling a distinct FST grows at least factorially with the number of states. Here it is useful to think of a phenotype space of canonical FSTs, where each point in the space corresponds to a distinct string transduction function. Although the RMHC searches the genotype space in a uniform way, the search in the phenotype (FST) space is biased in the sense that the underlying FSTs are sampled in a non-uniform way. For example, an FST in our fixed size representation where all n states are reachable will be one of at least n! identically behaving isomorphic FSTs, but an FST with less than n reachable states will have many more genotypes with identical phenotypes, since the transitions and outputs from unreachable states can take any values without affecting the behaviour of the FST. Consider a 10 state FST with binary inputs and outputs. If all states are reachable, then it will have at least 10! ≈ 3.6×106 representations in the genotype. An FST in the same genotype space with only 5 reachable states will, however, have at least

10! (10−5)!

× 10(5×2) × 2(5×2) ≈ 3.1 × 1017 equivalent

machines (by substituting the appropriate numbers into Equation 1). Further investigation and exploitation of this phenomenon would be very interesting, as done by Igel and Stagge [40] for neural networks.

7

3.2

Mutation Operator

The mutation operator works as follows: a decision is made with equal probability to either mutate the transition table or the output table. A random location is selected in the chosen table, and the entry there is modified. This ensures that mutation causes at least one change. An iteration is then performed over all the table entries apart from the entry just modified, changing each entry with a probability of 2/(NQ × NI ). This probability was chosen based on some empirical investigation. When an entry is modified, a symbol is chosen from a uniform distribution of all possible symbols not including the current value. In this way, a single call to the mutation operator is most likely to produce one or two changes to the FST tables, but can also produce more (following a geometric distribution).

3.3

Fitness Functions

The choice of fitness function plays a critical part in the success of an evolutionary algorithm. Fitness functions based on three different string distance measures were studied. Each distance measure is applied to the target output string t, and the evolved transducer output string e. The distance measures are based on strict equality (dstrict ), Hamming distance (dHam ), and edit distance (dedit ). Each distance function is normalised to have a value in the range from 1.0 for two strings with nothing in common to 0.0 for two identical strings. Each distance function is mapped to a fitness function by subtracting it from 1.0. The overall fitness on a dataset is then the average over all the fitness scores for the individual (t, e) string pairs. Strict distance is defined by:

dstrict (t, e) = Normalised Hamming distance is defined by:

dHam (t, e) =

   0 :

  1 :

t=e

(2)

t 6= e

Max(|t|,|e|) Σi=1 ∆(ti , ei ) Max(|t|, |e|)

(3)

where ∆ is zero if the symbols are equal and one otherwise. If the strings are of unequal length,

8

then the shorter string is effectively padded with non-matching characters. In other words, the symbol-by-symbol matching is performed with the strings left-justified. The maximum length of the two strings is denoted by Max(|t|, |e|), where |s| is the length of string s. In the implementation we stop the loop at the end of the shorter string and add on the difference in length of the two strings. Normalised edit distance is defined by:

dedit (t, e) =

Edits(t, e) Max(|t|, |e|)

(4)

i.e. the ratio of the number of edits between the two strings over the maximum number of edits possible given the string lengths. The edit distance (also known as Levenshtein Distance [41]) uses dynamic programming to compute the minimum number of insertions, deletions and substitutions needed to edit one string into the other. The maximum number of edits is equal to the length of the longer string. There is a sense in which the normalised edit distance is the most natural choice of distance measure for this problem. In general, an FST operates by transforming strings in the source domain into strings in the target domain, but in doing so may read or produce null symbols, thereby creating insertion or deletion errors, and make incorrect outputs, producing substitution errors. Table 1 illustrates the differences between these distance measures. The table shows the string distance between the string 0123456789 and the strings shown in each column head. Note that each string differs by a single deletion from the longer string, and since the length of the longer string is 10, this gives rise to a normalised edit distance of 0.1 in each case. The strict distance sees each string as being completely different, while the Hamming distance significantly punishes a deletion towards the start (column 1), but is more forgiving of a deletion towards the end (column 2). Therefore, the Hamming distance has a positional bias, whereas the normalised edit distance does not.

3.4

Computational Complexity

Given two strings of length m and n respectively, computing the strict distance has a worst case time complexity of O(n) when n = m, otherwise it is constant time1 . Computing the Hamming 1 Assuming

that string length can be computed in constant time, which depends on the implementation of the

string data structure

9

f

s = 023456789

s = 012345679

strict

1.0

1.0

Hamming

0.9

0.2

edit

0.1

0.1

Table 1: String distance between 0123456789 (i.e. f (s, 0123456789)) and the strings in each column head, calculated with the chosen distance measures. distance has a complexity of O(Min(m, n)), while the edit distance is O(mn). So, while we might expect the edit distance to provide a smoother (and perhaps easier to search) fitness landscape, it reduces the number of fitness evaluations that can be made within a given amount of CPU time.

4

Onward Sub-sequential Transducer Inference Algorithm (OSTIA)

As indicated in the introduction the problem of finding an FST based on training data is an extension of the problem of inferring a DFA. A DFA can be inferred by constructing a prefix tree equivalent to the training data and progressively merging states [31]. In general many valid merges will be possible at any step. A valid merge preserves the consistency of the DFA with the training data and normally produces a new DFA which is a generalisation of the previous one. The new DFA is said to be “correct” if it is consistent with the target automaton. However, without exhaustive training data, merges amount to guesses about the behaviour of the target automaton on unseen data. These guesses can turn out to be wrong. Successful algorithms such as EDSM [19] focus on minimising the risk of making wrong guesses by employing heuristic evidence to choose the merges which are most likely to lead to the target automaton. The heuristic used by EDSM is to count the training data in support of each merge, and choose the merge with the greatest support. The OSTIA algorithm is related to this work on DFA, except that it infers FSTs. OSTIA is packaged with a pre-processor that allows the transduction of sequences of input words2 into 2A

word here is defined as a sequence of non-whitespace symbols delimited by whitespace.

10

sequences of output words, rather than strings of input symbols into strings of output symbols. This is achieved by building a look-up table for input and output words that maps each unique word into a unique symbol. To preserve consistency with our previous description of FSTs we shall describe OSTIA as we have used it, i.e. in inferring FSTs where the words are single symbols and sequences can be referred to unambiguously as strings. Unlike the FSTs we evolve, OSTIA allows more than one output symbol to occur on each transition arc. This is so it can easily cope with transductions where the output may be longer than the input, a situation which occurs frequently in machine translation. OSTIA starts by building a prefix tree from the training data input strings and places the complete output strings at the corresponding leaves of this tree. It then moves as large substrings of the output strings forward in the tree (i.e. towards the root) as it can without producing any contradiction. This initial FST then undergoes progressive state merging until no more merges are possible and a final, most general, FST is produced. As with EDSM, the merges are also chosen on the basis of heuristic evidence that they are correct. There is the additional complication that parts of output strings may have to be pushed back towards their initial positions to allow merges. The important point is that OSTIA progresses towards its best guess of the target FST by a series of heuristic decisions to which it commits at each step. Its underlying prior assumption is that the smallest FST it can find is the FST which is most likely to be correct. The RMHC approach is obviously very different. The algorithm randomly hill-climb toward perfect performance on the training set but there is no guarantee that it will achieve this. Another difference is that OSTIA assumes that there is no noise in the data, whilst the RMHC does not have this restriction. In Section 8 it is shown that the RMHC is indeed more noise-resilient than OSTIA, a factor that could prove crucial in real-world applications. An apparent weakness of the RMHC is that it is necessary to choose a maximum size for the target FST, even though this does not imply that a smaller FST cannot be evolved. However, it could be argued that deciding a maximum size for the FST is actually a good thing. It is a regularisation of the learning process that inhibits the evolved FST from over-fitting the training data. Where prior knowledge exists of the likely number of states needed to solve a problem, this information can be used directly. However, as the results

11

show in Section 6.1, the RMHC is not over-sensitive to the choice of the maximum number of states. The RMHC is implemented in Java and run a standard desktop PC. For convenience, the C++ implementation of OSTIA kindly provided by Jose Oncina was translated into Java to perform our comparison experiments. We used the data-driven variant of the OSTIA algorithm detailed in [32].

5

Results: Learning a 4-8 Chain Code FST

The original motivation for this work came from chain code based recognition of binary images. Freeman chain codes [42] represent a 2-d image as a sequence of symbols where each symbol represents a move around the contour of an image. Some example 4-direction chain codes are shown in Figure 1. Each chain code begins with the (x, y) location of the start of the chain (with the origin in the top left of the grid), which is followed by the sequence of movement vectors encoded as symbols {0, 1, 2, 3} (see Figure 1 (c)). These binary images were converted to chain codes by tracing around the connected edges of each group of pixels. In this way, a closed loop leads to two chain codes, one for the inside and one for the outside of the loop. Small differences in the image can lead to significantly different chain codes, as shown here, though they share long common subsequences. For a detailed description of the procedure see [5]. Any string classification method can be used to build a chain coded image recognizer, see for example Bunke and Buhler [43]. An efficient method introduced by Lucas and Amiri [5] is the scanning n-tuple classifier. This operates by building statistical language models for each class of pattern, similar to conventional n-gram models, except that an ensemble of models is used with each model using different displacements between its sample points. The classification accuracy with this scheme depends on the choice of input coding. For example, 8-direction codes can significantly outperform 4-direction codes. Furthermore, in other applications of chain coded image recognition, rotation invariance can be important as shown by Mollineda et al [44]. This can be coarsely achieved by difference coding the chain code. This then makes the code invariant to rotations of 90 degrees in the case of the 4-direction code, for example. A raw 4-direction code can be transformed into an 8-direction code, or difference coded using an FST. Of course, FSTs are capable of doing much more than this, but become complex to design

12

Figure 1:

Mapping an image to a chain code.

Image (a) produces two strings:

(3, 1,

11111222322233330303001010), (4, 1, 233221211010100333); image (b) maps to the single chain code (3, 1, 01123233221211010100301122232223333030300101). The movement vectors are shown in (c), and labelled on (a) for the first seven symbols. by hand. Furthermore, there is no clear specification of what the output strings should be, other than the requirement that the transformed strings should lead to a better classification accuracy. Therefore, evolving FSTs for this task would appear to be well worth attempting. However, the evaluation of the fitness function is time-consuming when applied to large image recognition datasets. Experiments were conducted to evolve a previously invented FST in order to gauge the difficulty of the problem. The target FST maps 4-direction codes to 8-direction chain codes. This works by looking at the overall direction moved by each pair of 4-codes, representing it as one of 8 compass points. This is illustrated in Figure 2. The black-on-white text labels the 4-direction codes. Following any path of length 2 from the centre of the diagram will arrive at one of the 8-direction codes, written in reverse print (white on grey). For example, either the sequence 23 or the sequence 32 maps to the 8-direction code of 5. A hand-coded five-state FST that performs this task is shown in Figure 3. Each node (circle) in the diagram is labelled with its state number. State zero, the start state is indicated with two concentric circles. Each arc is labelled with symbols i : o, where i is the input symbol to be read when moving along that arc from the source state to the destination state, and o 13

Figure 2: An illustration of how 4-direction vector codes are mapped into 8-direction vector codes.

e

e

1:

0: 0 0: :1 1 e 2: :7 3

0 1 :e 2: :3 3: 4 5

0 :e 2:

0: 1: 7 2 e 3: :5 6

3

e

4

2

0 1: :1 2 2: 3: 3 e

1

3

Figure 3: A hand-coded FST for transforming 4-direction codes to 8-direction codes. is the output symbol produced in doing so. The FST can also be represented in tabular form, and this is given in Table 2. Recall that the mutation operator works by modifying entries in tables such as these. A number of experiments were performed to see if the target FST could be evolved from samples of input/output string pairs. For each experiment a training set was created with 50 random input strings of length 10, and a test set with 50 random input strings of length 12. These were then passed through the target FST to produce an output string for each input string. This data is denoted as the hard dataset. Table 3 shows a sample of five such pairs. We expect it to be relatively difficult to learn an FST from input/output pairs where the strings are long. To test this, easy datasets were also used. The easy sets were created in the same way as the hard sets, except that a complete set

14

i

q

i

0

1

2

3

0

1

2

3

0

1

2

3

4

0

²

²

²

²

1

0

0

0

0

1

0

1

²

7

2

0

0

0

0

2

1

2

3

²

3

0

0

0

0

3

²

3

4

5

4

0

0

0

0

4

7

²

5

6

q

next state

output

Table 2: Tabular representation of a 4-to-8 chain code FST. The random mutation hill-climber works directly in this representation space.

Input

Output

2331313123

55

1133203031

267

0020313210

051

1212233300

33560

0102112130

1237

Table 3: A sample of five input/output string pairs for the 4-8 chain code FST. of input/output pairs was appended to each of the easy training sets, for input strings of length 1 and 2. Since the input alphabet had 4 symbols, this added a total of twenty string pairs to each training set. The maximum size of the FST was set to 10 states. Each experiment measured the accuracy with which the best evolved FST could reproduce the training set outputs, and also recorded its performance on the test set. At no stage were the test sets used during evolution. In these experiments, a perfect score on the training set led to a perfect score on the test set, while an imperfect score on the training set led to a much worse score on the test set. This demonstrates that given

15

the training data and the specified maximum number of states, there is very little chance of finding two machines that fit the training data yet behave differently on other data. Note that with sparse training sets, it may be possible to find different behaving machines that are still consistent with the training data, and OSTIA finds two such machines, as explained in Section 6.2. In practice, this was not found to be a problem with the RMHC. Experiments were repeated 500 times in order to get statistically meaningful results. Table 4 shows the results of the various fitness functions on the hard and easy datasets. In this table, and all similar tables in the paper, each table entry shows the mean, the standard error (i.e. the standard deviation of the mean) and the number of times that a perfect test set score was achieved. The statistical significance of the difference between the mean performance of two methods on a particular problem can be judged by the number of standard errors between the means. Using a two-sided unpaired t-test, a difference of three standard errors indicates that the means are different at a significance level of more than 96%. On the easy datasets the strict function significantly outperformed the Hamming function, but this is reversed on the hard datasets. The edit function performed best on both the easy and hard datasets. Note, however, that the superior performance of the edit function comes at a significant cost. On the hard datasets we found an FST with perfect test set score no times with the strict function, 4 times with the Hamming function and 64 times with the edit function. Table 5 indicates the average number of fitness evaluations made per second by the RMHC. These timings are based on a Java implementation running on a Pentium 4 clocked at 2.4GHz. In a given amount of CPU time, it is possible to make around three times as many evaluations of the strict or Hamming functions than the edit function. Note that these measurements include the entire cost of running the RMHC, which involves running the FST on all the input strings to produce the set of output strings. Although we expect the strict function to be faster than the Hamming function, the difference between the two is not very significant when measured in the context of the RMHC. Whether the extra cost of the edit function is worthwhile depends on how likely each method is to find an acceptable solution within a given number of evaluations. Note that here the strings are of length 10 or less - the difference would be greater for longer strings. This suggests the possibility of

16

f

easy

hard

strict

0.65 (0.01) ; 148

0.11 (0.003) ; 0

Hamming

0.48 (0.01) ; 70

0.24 (0.005) ; 4

edit

0.73 (0.01) ; 184

0.45 (0.013) ; 64

Table 4: Mean (standard error) over 500 runs of the best individual evolved in 10,000 fitness evaluations using a random hill-climber. Also shown is the number of times that a perfect test set score was achieved. f

evals / s

strict

42,000

Hamming

36,000

edit

13,000

Table 5: Average evaluations per second for each fitness function when run within the RMHC. using a more efficient version of string edit. One straightforward approach to this is to only compute the edit distance for close-matching strings. This can be done by evaluating the string-match matrix only within a fixed distance of the leading diagonal, and also by terminating the computation when a threshold distance is exceeded. Figures 4 through to 9 plot 50 runs of the RMHC on both the hard and the easy datasets for the strict, Hamming and edit fitness functions. We plot the training set fitness as measured by the Hamming function for each of the plots, rather than the fitness function used for evolution. This avoids biasing the plots in favour of the more forgiving fitness functions since, for example, an FST evaluated by the edit measure would always appear fitter than the same FST evaluated by the strict measure. A consequence of this is that the plotted fitness of the current solution sometimes gets worse when the measure used for evolution is different to the measure used for the plot. These figures provide interesting insight into the evolutionary behaviour of the random hill climber when using each of the fitness functions. Note that on the hard datasets, the edit function is on many runs no better than the Hamming function, but other times it appears to break through some barrier of mediocrity and then rapidly achieve perfect fitness.

17

1.00 BestYet 0.80

0.60 Ham. fitness 0.40

0.20

0.00 0

2000

4000

6000

8000

10000

evaluations

Figure 4: 50 Runs of the random hill climber using the strict fitness function on the easy 4-8 datasets.

Table 6 shows an FST evolved using the Hamming function on the hard dataset. This achieved a perfect score on both the training and the test sets. On inspection it can be seen that although this FST has 10 states, 5 of them are unreachable from the start state, and when the unreachable states are pruned, we get an FST that is isomorphic to the target FST.

5.1

Fitness Distance Correlation Analysis

For each of the fitness functions the Fitness Distance Correlations (FDCs) was estimated by taking 100 random walks from the target 5-state automaton, represented using the same matrix form, and using a single-change mutation operator (each time a table entry is randomly chosen then randomly modified). Each random walk consisted of 60 random mutations. After each mutation the Hamming distance between the current and the target FST (by comparing the output and transition matrices of each FST) was measured, together with the fitness of the current FST. The fitness was measured on a hard dataset with 50 input/output string pairs where the input length was 10. FDC analysis was studied by Jones [45], and offers a useful but by no means foolproof means of predicting problem difficulty for an evolutionary algorithm. A comprehensive review of problem difficulty measures is given by Naudts and Kallel [46]. The closer the FDC is to -1, the easier the problem should be for an EA. The FDC figures for strict, Hamming and edit were -0.57, -0.78 and -0.80 respectively. Hence, in this case the ordering of the FDC measures correspond to performance 18

1.00 BestYet 0.80


0.20

0.00 0

2000

4000

6000

8000

10000

evaluations

Figure 5: 50 Runs of the random hill climber using the Hamming fitness function on the easy 4-8 datasets.

1.00 BestYet 0.80


0.20

0.00 0

2000

4000

6000

8000

10000

evaluations

Figure 6: 50 Runs of the random hill climber using the edit fitness function on the easy 4-8 datasets.

19

1.00 BestYet 0.80


0.20

0.00 0

2000

4000

6000

8000

10000

evaluations

Figure 7: 50 Runs of the random hill climber using the strict fitness function on the hard 4-8 datasets.

1.00 BestYet 0.80


0.20

0.00 0

2000

4000

6000

8000

10000

evaluations

Figure 8: 50 Runs of the random hill climber using the Hamming fitness function on the hard 4-8 datasets.

20

1.00 BestYet 0.80


0.20

0.00 0

2000

4000

8000

6000

10000

evaluations

Figure 9: 50 Runs of the random hill climber using the edit fitness function on the hard 4-8 datasets.

i

q

i

0

1

2

3

0

1

2

3

0

2

3

6

4

0

²

²

²

²

1

0

0

0

0

1

2

6

2

1

2

0

0

0

0

2

0

1

²

7

3

0

0

0

0

3

1

2

3

²

4

0

0

0

0

4

7

²

5

6

5

0

0

0

0

5

2

3

7

²

6

0

0

0

0

6

²

3

4

5

7

0

0

0

4

7

²

²

²

6

8

0

8

0

0

8

1

5

²

²

9

0

0

0

0

9

²

6

1

1

q

next state

output

Table 6: An evolved (nominal) 10-state FST that attained a perfect score on the 4-8 chain code problem (on the training and the test data).

21

1.0

0.8

0.6 fitness 0.4

0.2

0. 0

10

20

30

40

distance

Figure 10: Fitness distance scatter plot for strict distance. FDC=-0.57. on the hard data sets. In Figure 11 and Figure 12 it appears that there is a significant difference between the two measures when close to the optimum, which is masked by the similar behaviour for distances of 20 or more. To check this, we computed the FDC figures for the scatter plots of these figures when limited to a maximum distance of 10 from the optimum. When limited in this way we get FDC figures of -0.84, -0.86 and -0.91 for strict, Hamming and edit fitness functions respectively. Since much of the time taken in convergence to the optimum is often spent getting the last few details correct, this may give a significant advantage to the edit distance. However, as mentioned previously, this should be balanced against its greater computational cost.

5.2

RMHC versus OSTIA on 4-8 Chain Code Problem

The performance of the RMHC was compared with that of OSTIA on the 4-8 chain code target FST, varying both the training set size and the length of the input strings. For each pair of values the experiments were run 50 times using a different random seed each time. Each experiment generated the appropriate number of training examples with the appropriate input length together with 100 test examples. In each case the target FST was used to produce the output sequence for both training and test sets. 22

1.0

0.8

0.6 fitness 0.4

0.2

0. 0

10

20

30

40

distance

Figure 11: Fitness distance scatter plot for Hamming distance. FDC=-0.78

1.00

0.80

0.60 fitness 0.40

0.20

0.00 0

10

20

30

40

distance

Figure 12: Fitness distance scatter plot for string edit distance. FDC=-0.80.

23

nTrain

|s|

RMHC(strict)

RMHC(Ham)

RMHC(edit)

OSTIA

50

5

0.35 ( 0.017 ) ; 0

0.35 ( 0.019 ) ; 0

0.52 ( 0.022 ) ; 0

0.29 ( 0.008 ) ; 0

100

5

0.47 ( 0.023 ) ; 0

0.41 ( 0.014 ) ; 0

0.72 ( 0.032 ) ; 4

0.33 ( 0.007 ) ; 0

150

5

0.53 ( 0.029 ) ; 0

0.48 ( 0.024 ) ; 1

0.85 ( 0.025 ) ; 9

0.33 ( 0.005 ) ; 0

200

5

0.55 ( 0.029 ) ; 0

0.54 ( 0.029 ) ; 3

0.86 ( 0.027 ) ; 20

0.34 ( 0.005 ) ; 0

250

5

0.60 ( 0.033 ) ; 1

0.60 ( 0.027 ) ; 1

0.94 ( 0.017 ) ; 26

0.36 ( 0.005 ) ; 0

50

15

0.20 ( 0.002 ) ; 0

0.30 ( 0.005 ) ; 0

0.56 ( 0.030 ) ; 7

0.21 ( 0.002 ) ; 0

100

15

0.19 ( 0.002 ) ; 0

0.37 ( 0.013 ) ; 0

0.75 ( 0.030 ) ; 15

0.21 ( 0.001 ) ; 0

150

15

0.19 ( 0.002 ) ; 0

0.46 ( 0.023 ) ; 2

0.86 ( 0.026 ) ; 27

0.21 ( 0.002 ) ; 0

200

15

0.20 ( 0.002 ) ; 0

0.46 ( 0.018 ) ; 0

0.84 ( 0.024 ) ; 24

0.21 ( 0.001 ) ; 0

250

15

0.19 ( 0.002 ) ; 0

0.50 ( 0.022 ) ; 2

0.89 ( 0.023 ) ; 32

0.21 ( 0.001 ) ; 0

50

25

0.22 ( 0.002 ) ; 0

0.30 ( 0.005 ) ; 0

0.58 ( 0.027 ) ; 6

0.23 ( 0.001 ) ; 0

100

25

0.22 ( 0.002 ) ; 0

0.33 ( 0.007 ) ; 0

0.74 ( 0.027 ) ; 16

0.23 ( 0.001 ) ; 0

150

25

0.21 ( 0.001 ) ; 0

0.37 ( 0.008 ) ; 0

0.83 ( 0.026 ) ; 26

0.23 ( 0.001 ) ; 0

200

25

0.21 ( 0.001 ) ; 0

0.41 ( 0.016 ) ; 0

0.79 ( 0.028 ) ; 22

0.23 ( 0.001 ) ; 0

250

25

0.21 ( 0.002 ) ; 0

0.45 ( 0.016 ) ; 1

0.78 ( 0.027 ) ; 19

0.23 ( 0.001 ) ; 0

Table 7: Learning the 4-8 target FST, given various training set sizes and input string lengths (boldface indicates best result for each experiment). OSTIA was run until completion, while each RMHC variant ran for 10,000 fitness evaluations. As before, the RMHC uses its performance on the training set as its fitness function. The Hamming fitness function was used to evaluate the test set output of the final evolved FSTs produced by all methods. Table 7 shows the results for all methods obtained from the 50 experimental runs for each row. The nominal size for the RMHC is set at 10 states. To help visualise these runs we plotted the test set performance (with error bars at three standard errors from the mean) versus the training set size for input strings of length 5 (Figure 13), and of length 25 (Figure 14). The overall conclusion is that RMHC(edit) is the most effective method for learning this particular FST. It is better on average and often delivers a perfect FST on its best runs. Unsurprisingly,

24

1.00

0.80 EA

0.60

OSTIA

Fitness 0.40

0.20

0.00 50

100

200

150

250

Number of training string-pairs

Figure 13: Mean fitness for OSTIA versus RMHC(edit) for input strings of length 5 on 4-8 target FST.

1.00

0.80 EA

0.60

OSTIA

Fitness 0.40

0.20

0.00 50

100

150

200

250

Number of training string-pairs

Figure 14: Mean fitness for OSTIA versus RMHC(edit) for input strings of length 25 on 4-8 target FST.

25

Method

CPU Time (s) : 50, 5

CPU Time (s) : 250, 25

RMHC(strict)

0.62

3

RMHC(Ham)

0.66

3

RMHC(edit)

0.84

24

OSTIA

0.78

310

Table 8: Average CPU time per experimental run for OSTIA and the RMHC variants for small (50,5) and large (250, 25) training sets. from this data it can be seen that all RMHC variants and OSTIA perform better when more training data is available. We also learn that all methods are better able to infer FSTs from short training sequences. This might be considered counter-intuitive because a longer sequence should provide more information about the target machine. A plausible explanation of why this occurs is simplest for OSTIA: a training set of long sequences gives rise to a larger initial FST and there are more possibilities for error in reducing this initial FST to the small target. For the RMHC the most likely explanation of the problem is related to credit assignment: with short input sequences it is simpler to associate particular inputs with particular outputs. Longer sequences lead to more ambiguity, and hence to more local maxima in the search space. It could be argued that OSTIA can be run many times, with many different heuristics, and this would indubitably give it a better chance of finding the target transducer. We could also argue that our RMHC could potentially find a better solution using multiple restarts and/or more total fitness evaluations. Comparing the two algorithms fairly then depends on allocating each a similar amount of run-time. To judge this we measured average run-times at the extreme ranges of these experiments, shown in Table 8. On the small datasets OSTIA is slightly faster than RMHC(edit), but slower than the other RMHC variants. On the large datasets OSTIA is much slower than any of the RMHC variants. OSTIA suffers more on the larger datasets because it constructs large initial FSTs with many possible merge choices at each step. The time taken by the RMHC for a fixed number of generations is predictable, but the time taken for an OSTIA run can grow dramatically with an increase in the

26

300 depth nStates

240

180 h(x) 120

60

0 0

10

20

30

40

50

x

Figure 15: Distribution of depth and number of reachable states in our randomly generated FSTs of nominal size 50. length of input string or the number of training strings.

6

Results: Learning Random FSTs

The experiments with the 4-8 chain code FST suggest that the RMHC is an effective tool for learning that example FST. However to build confidence in the generality of the method it is necessary to extend the comparison over many example FST targets. To do this the performance of the RMHC and OSTIA was compared on randomly generated target FSTs. The comparison was parameterised by the number of states, nStates, in the target FST and the size of the training set used, nTrain. We set the maximum size of the FST evolved by the RMHC to nStates. The number of input and output symbols was fixed at 2, i.e. binary strings were used as both input and output. Target FSTs were generated by randomly filling both transition and output tables with values chosen with equal probability from all possible values. This is similar to the random DFA generation method used by Lang et al [19]. To give an impression of the nature of the FSTs generated by this process, Figure 15 shows the distribution of depth and number of reachable states given the random construction of 1,000 FSTs with a nominal 50 states. Depth is defined as number of transitions required to reach the furthest state from the start state via the shortest path. For each randomly generated target FST a random training set was created of the specified size

27

with input strings of length 10 and a random test set of size 100, using input strings of length 12 to prevent any overlap with the training set. For each target FST we scored the FST produced by the RMHC and the FST produced by OSTIA using the fitness measure based on Hamming distance (described in Section 3.3). Table 9 summarises the results we obtained for the number of training examples ranging from 50 to 250 and the target nominal number of states varying from 5 to 95. Each row is a summary of 50 repeated experiments for both the RMHC (with the three fitness functions) and OSTIA. New target FSTs and training/tests sets were generated for each experiment. The table shows that the mean performance of RMHC(edit) is better than that of OSTIA given most combinations of the number of states and the number of training strings. OSTIA gives comparable results when the data is not too sparse, which in this case is when we have 250 training strings. Practical applications often suffer from data sparsity so this is a favourable result for the RMHC. In addition to observing the mean fitness for each method, it is also interesting to note the number of times a perfect test set score is achieved. For example, given 150 training strings and 5-state targets, the means are the same for RMHC(edit) and OSTIA, yet RMHC(edit) gave a perfect test set score on 32 occasions versus only 16 for OSTIA. Regarding the performance of the various RMHC fitness functions, the results show that normalised edit performs best, as it did for the 4-8 chain-code problem. The following two graphs help to visialise the results. Figure 16 holds the number of training examples constant at 50 and plots the relative mean performance of OSTIA and RMHC(edit) against the number of states in the target FST. Error bars are shown at plus and minus 3 times the standard error calculated for each point. It can be seen that the RMHC maintains a significant margin over OSTIA, though both methods perform worse for larger targets. Figure 17 repeats this plot for 250 training strings. Here the difference between the two algorithms is much smaller, and on occasions OSTIA does better than the RMHC. This is partially explained by the fact that OSTIA would guarantee perfect performance given an exhaustive training set. However, it should also be noted that an exhaustive training set would include all short strings up to a given length, which is not the case here.

28

nTrain

nStates

RMHC(strict)

RMHC(Ham)

RMHC(edit)

OSTIA

50

5

0.92 ( 0.018 ) ; 30

0.92 ( 0.015 ) ; 27

0.97 ( 0.008 ) ; 38

0.86 ( 0.018 ) ; 13

50

15

0.71 ( 0.016 ) ; 1

0.75 ( 0.014 ) ; 2

0.76 ( 0.014 ) ; 1

0.70 ( 0.017 ) ; 0

50

25

0.67 ( 0.010 ) ; 0

0.71 ( 0.009 ) ; 0

0.72 ( 0.010 ) ; 0

0.64 ( 0.009 ) ; 0

50

35

0.64 ( 0.008 ) ; 0

0.68 ( 0.009 ) ; 0

0.68 ( 0.009 ) ; 0

0.63 ( 0.009 ) ; 0

50

45

0.65 ( 0.006 ) ; 0

0.69 ( 0.005 ) ; 0

0.69 ( 0.005 ) ; 0

0.62 ( 0.007 ) ; 0

50

55

0.64 ( 0.005 ) ; 0

0.68 ( 0.005 ) ; 0

0.69 ( 0.004 ) ; 0

0.61 ( 0.008 ) ; 0

50

65

0.65 ( 0.006 ) ; 0

0.68 ( 0.005 ) ; 0

0.69 ( 0.003 ) ; 0

0.61 ( 0.006 ) ; 0

50

75

0.64 ( 0.006 ) ; 0

0.68 ( 0.006 ) ; 0

0.69 ( 0.005 ) ; 0

0.60 ( 0.006 ) ; 0

50

85

0.63 ( 0.006 ) ; 0

0.68 ( 0.006 ) ; 0

0.69 ( 0.005 ) ; 0

0.60 ( 0.007 ) ; 0

50

95

0.65 ( 0.005 ) ; 0

0.68 ( 0.005 ) ; 0

0.69 ( 0.004 ) ; 0

0.61 ( 0.004 ) ; 0

150

5

0.92 ( 0.016 ) ; 31

0.93 ( 0.012 ) ; 26

0.94 ( 0.012 ) ; 32

0.94 ( 0.012 ) ; 16

150

15

0.75 ( 0.017 ) ; 0

0.82 ( 0.014 ) ; 1

0.87 ( 0.014 ) ; 1

0.80 ( 0.019 ) ; 1

150

25

0.68 ( 0.011 ) ; 0

0.76 ( 0.010 ) ; 0

0.77 ( 0.009 ) ; 0

0.73 ( 0.015 ) ; 0

150

35

0.68 ( 0.008 ) ; 0

0.74 ( 0.009 ) ; 0

0.76 ( 0.008 ) ; 0

0.71 ( 0.012 ) ; 0

150

45

0.67 ( 0.011 ) ; 1

0.74 ( 0.008 ) ; 1

0.76 ( 0.008 ) ; 1

0.71 ( 0.011 ) ; 0

150

55

0.66 ( 0.007 ) ; 0

0.73 ( 0.006 ) ; 0

0.74 ( 0.006 ) ; 0

0.69 ( 0.010 ) ; 0

150

65

0.66 ( 0.006 ) ; 0

0.72 ( 0.005 ) ; 0

0.73 ( 0.006 ) ; 0

0.69 ( 0.011 ) ; 0

150

75

0.66 ( 0.008 ) ; 0

0.72 ( 0.006 ) ; 0

0.73 ( 0.005 ) ; 0

0.69 ( 0.009 ) ; 0

150

85

0.66 ( 0.006 ) ; 0

0.72 ( 0.005 ) ; 0

0.73 ( 0.005 ) ; 0

0.67 ( 0.007 ) ; 0

150

95

0.64 ( 0.006 ) ; 0

0.71 ( 0.005 ) ; 0

0.72 ( 0.004 ) ; 0

0.66 ( 0.008 ) ; 0

250

5

0.92 ( 0.016 ) ; 27

0.94 ( 0.012 ) ; 31

0.97 ( 0.008 ) ; 35

0.95 ( 0.010 ) ; 22

250

15

0.80 ( 0.019 ) ; 3

0.86 ( 0.011 ) ; 3

0.89 ( 0.009 ) ; 4

0.90 ( 0.014 ) ; 3

250

25

0.72 ( 0.011 ) ; 0

0.78 ( 0.010 ) ; 0

0.81 ( 0.010 ) ; 0

0.76 ( 0.014 ) ; 1

250

35

0.67 ( 0.009 ) ; 0

0.74 ( 0.008 ) ; 0

0.76 ( 0.007 ) ; 0

0.76 ( 0.014 ) ; 0

250

45

0.66 ( 0.008 ) ; 0

0.74 ( 0.008 ) ; 0

0.76 ( 0.007 ) ; 0

0.73 ( 0.011 ) ; 0

250

55

0.67 ( 0.006 ) ; 0

0.74 ( 0.007 ) ; 0

0.75 ( 0.006 ) ; 0

0.72 ( 0.008 ) ; 0

250

65

0.66 ( 0.007 ) ; 0

0.74 ( 0.006 ) ; 0

0.75 ( 0.005 ) ; 0

0.73 ( 0.008 ) ; 0

250

75

0.67 ( 0.007 ) ; 0

0.75 ( 0.005 ) ; 0

0.72 ( 0.008 ) ; 0

250

85

0.67 ( 0.005 ) ; 0

0.73 ( 0.007 ) ; 0 29 0.72 ( 0.006 ) ; 0

0.74 ( 0.005 ) ; 0

0.70 ( 0.007 ) ; 0

250

95

0.67 ( 0.005 ) ; 0

0.74 ( 0.006 ) ; 0

0.75 ( 0.005 ) ; 0

0.71 ( 0.007 ) ; 0

Table 9: Finding Random FSTs (boldface shows best result for each experiment).

1.00 EA Ostia

0.90

0.80 Fitness 0.70

0.60

0.50 0

10

20

30

40

50

60

70

80

90

100

Target States

Figure 16: Mean fitness for OSTIA versus RMHC(edit) with 50 training examples from a random target.

1.00 EA Ostia

0.90

0.80 Fitness 0.70

0.60

0.50 0

10

20

30

40

50

60

70

80

90

100

Target States

Figure 17: Mean fitness for OSTIA versus RMHC(edit) with 250 training examples from a random target.

30

6.1

Effects of FST Maximum Size

The results so far show that for small FSTs with limited training data, the RMHC clearly outperforms OSTIA. However, the RMHC is also given more information than OSTIA i.e. an estimate of the number of states in the target machine. To understand how precise this estimate has to be, the following experiment was performed. Randomly created machines with exactly 5 states were selected as targets, by creating random machines with 6 nominal states, and repeating the process each time until a machine with 5 reachable states was generated. The nominal (maximum) number of states in the evolved machines was varied in steps of 5, from 5 states up to 50 states. The RMHC(edit) algorithm was used to attempt to learn each tagret. Each point on the graph shows the mean and standard error from 100 runs of the RMHC. The results are plotted in Figure 18. Error bars are shown at one standard error from the mean. The plot shows that best average performance is obtained in the case when the maximum number of states is 10, i.e. twice the number of states in the target machine (with a perfect test set score being achieved 43 times out of 100). Performance then tails off gradually. Even when the maximum number of states is 50, the RMHC still found a perfect test set score 8 times out of 100. The upper bound on the number of states does, as expected, influence the performance of the RMHC. However, the algorithm is not over-sensitive to this, and performance falls off slowly as the number of states is increased. Also plotted on the same graph is the performance of OSTIA on this task. OSTIA does not have a maximum states parameter, so it was run on 100 randomly generated targets as a one-off experiment. Those results were then plotted across the graph to allow easy visual comparison with the RMHC. The nature of the evolved machines was also investigated. As before, over-fitting the training data was not a problem: perfect training set score always led to perfect test set score. It was observed however, that when the maximum number of states was much larger than the number of target states, the evolved solution could be behaviourally identical to target while having many more states. For example, evolution would sometimes find machines with 10 or more states that were behaved identically to the 5 state target. We also found occasions where given a maximum of 20 states, for example, evolution would produce a perfect scoring machine with only 5 reachable

31

1.00 EA OSTIA

0.80

0.60 Accuracy 0.40

0.20

0.00 0

10

20

30

40

50

Max States

Figure 18: Average test-set accuracy of evolved machine plotted against maximum machine size. OSTIA performance also plotted for the purpose of comparison. states. This meant that evolution was able to prune an initial randomly generated machine down from around 16 states to 5 states.

6.2

Analysing Failure Modes for the RMHC and for OSTIA

The RMHC does not always find the target FST, and there are are two possible failure modes. Firstly the RMHC may not succeed in obtaining perfect performance on the training set within the maximum number of allowed fitness evaluations. Secondly the RMHC may obtain perfect performance on the training set but then perform less than perfectly on the test set. This latter failure mode is potentially possible, but was not observed to be a problem. In practice with the given experimental setup, failure was always due to an inability to learn the training data. In contrast, OSTIA is always correct on the training data but, as described previously, may commit to a path which cannot lead to the target FST. To see how this arises in practice we created some very simple target FSTs and looked for a failure of OSTIA to find the target. The setup was 50 training strings, each of length 10 symbols over an alphabet of 3 symbols. The FSTs were tested using 1,000 test strings to make the test for success or failure very clear. FSTs with 1 and 2 states were always successfully inferred by OSTIA, but a 3-state FST was found which illustrated OSTIA’s failure mode. This target FST is shown in figure 19. Many runs of OSTIA actually find the target,

32

but several stop with many more states than 3. A rare occurrence of OSTIA stopping with just 4 states was chosen to analyse as simply as possible the failure mode. This 4-state stopping point is shown in figure 20. OSTIA has stopped because it cannot merge state 3 into state 2 to get to the target. The merge cannot occur because the states have contradictory outputs for inputs of both 0 and 2. This implies that this 4-state FST is consistent with the training set, but contradicts the target. For example the target would output 1200 for the input string 2201, whilst the 4-state machine outputs 1201. It is possible for OSTIA to produce this 4-state machine since the string 2201 does not occur anywhere in the training set, either alone or as a substring. To better understand the behaviour of the RMHC, a plot was constructed showing the variation in the fitness vector with respect to improvements in overall fitness. Fitness can be seen as a vector with an element for each string in the training set. We were interested to observe whether the RMHC improved by incrementally transducing more strings correctly, or did some improvements in fitness occur by correctly transducing several new strings at the expense of incorrectly transducing a smaller number of strings which were previously handled correctly. Figure 21 shows that the latter is often true. The evolution of the fitness vector is shown proceeding down the page. The performance on each training string is shown in each column (the lighter the rectangle, the better the transduction). Each row shows a snapshot at a particular generation. The left-hand column of figures shows each generation that an improvement in fitness occurred, together with the fitness. A randomly constructed FST at the top proceeds via a sequence of mutations to eventually transduce all training strings correctly.

7

RMHC versus GA

In this section we explore the use of a population-based EA for learning random FSTs. There are two main potential advantages that can be gained from using a population-based algorithm to tackle FST induction. Firstly, the use of a population may provide a better selection criterion than can be achieved with an RMHC, since an individual now competes with a population of solutions rather than just a single other solution. Secondly, the idea that evolution of FSTs could proceed by combining good building blocks from

33

#/ 0/1 1/0

2

2/2

#/ 0/0 1

2/1

1/0

#/ 2/1 0

0/0 1/1

Figure 19: Target 3-state FST

34

1/0 #/ 2/1 0/0

#/ 0/1 2

2/2

1/0

#/ 1/0 1

2/1 0/0

#/ 1/1 0

2/1 0/0

Figure 20: Incorrect 4-state FST

35

3

Figure 21: Example evolutionary trace of the fitness vector. Each row corresponds to a snapshot of fitness at a particular generation, and each column shows how well a particular string is classified.

36

multiple parents is an attractive one. However, due to the problem of competing conventions, it seems that this is not very likely under a standard uniform crossover operator i.e. one that simply copies each element in the child from the corresponding element in a randomly chosen parent (out of the two parents selected for breeding). Such an operator has no respect for the fact that different parents may represent the same or similar machines in different but isomorphic ways. Nonetheless, standard uniform crossover was used in our experiments. Many experiments were conducted to compare the performance of the RMHC and the GA. In all cases, the total number of fitness evaluations remained constant at 10, 000. The genetic algorithm used rank-based selection, where the first p parents were used to breed a population of size n, which was run for (10, 000/n) generations. The algorithm performed much better with elitism switched on, which meant that the single best individual was always kept. Each new individual was derived by using crossover on two parents, or mutation on a single parent. The decision to use crossover or mutation was controlled by the crossover probability pc . In the case of a mutation, the same mutation operator was used as before (see section 3.2). In the course of these experiments, with various combinations of n, p, and pc the GA never significantly outperformed the RMHC, and the RMHC usually significantly outperformed the GA. We included configurations where pc was set to zero, thereby using a population based mutation only algorithm. This was also unable to outperform the RMHC. Some experiments were also made using fitness proportional selection, but this always performed worse than rank-based selection. Rather than report results for all the different experimental setups shown in Table 10, results are shown on a single problem instance. These are similar in nature to the results given other experimental settings. The results in Table 10 were based on the GA set with pc = 0.8, p = 10, n = 100. From this it can be concluded that the RMHC is a good choice of evolutionary algorithm for learning FSTs. This finding is further supported by recent work by the authors on using a multi-start RMHC to learn DFA [34], which showed that for certain classes of problem, the RMHC outperformed all other algorithms under test on some public test problems. These included results on the GECCO 2004 noisy DFA learning problem3 , which included a population-based EA, and problem instances 3 http://cswww.essex.ac.uk/staff/sml/gecco/NoisyDFA.html

37

Algorithm

Mean

s.e.

nSuccess

RMHC

0.93

0.007

30

GA

0.88

0.008

8

Table 10: Finding Random FSTs: GA versus RMHC. from the Gowachin DFA server4 , which included comparisons against high performance heuristic state merging algorithms. The only example in the literature of a finite state machine learning problem for which mutation alone was insufficient is the protection game of Spears and Gordon [12]. Investigating this phenomenon would be interesting future work. A possible explanation is that the fitness function in the protection game is very noisy, and that the use of a population can help to alleviate the effects of noise. Note that all the fitness functions used in this paper are noise free in the sense that an evaluation of a candidate solution always produces the same fitness value, even when the training data is noise-corrupted.

8

Results on Noisy Training Data

In this section the effects of noise in the training data are investigated. We corrupted p% of the training string pairs by flipping a single symbol in either the input string or the output string (input or output chosen randomly with equal probability). Note that this can lead to contradictory training data, whereby the same input sequence can be listed as mapping to different output sequences. We varied p from 0% to 100%. A problem setup was chosen where the noise free case was reasonably easy to learn for all algorithms under test. The number of input and output symbols were set to 2 (i.e. binary strings), the maximum training string input length was set to 10, and the nominal number of states in the target FST was set to 5. The results of this experiment are shown in Table 11. The main result here is that both evolutionary algorithms (RMHC and GA) are remarkably resilient to even high levels of noise. The RMHC still outperforms the GA in a similar manner irrespective of the noise level. OSTIA, on the other hand, suffers badly even when only 10% of the 4 http://www.irisa.fr/Gowachin/

38

p

RMHC

GA

OSTIA

0

0.82 (0.03) ; 71

0.60 (0.04) ; 42

0.63 (0.04) ; 33

10

0.82 (0.03) ; 66

0.58 (0.04) ; 32

0.27 (0.03) ; 1

20

0.82 (0.03) ; 64

0.60 (0.04) ; 38

0.16 (0.02) ; 2

50

0.84 (0.03) ; 73

0.62 (0.04) ; 43

0.06 (0.01) ; 2

100

0.77 (0.03) ; 57

0.58 (0.04) ; 37

0.06 (0.01) ; 2

Table 11: The effects of noise on learning Random FSTs. string pairs are corrupted, and continues to deteriorate given higher noise levels. Note that with this experimental setup, and given our procedure for randomly generating FSTs, there is a small but significant chance of producing a trivial target that maps any input string to the empty string ². In all the noise corrupted cases where OSTIA learned the target, it was this trivial null producing target. The training data for such a target cannot be affected by our substitution error corruption process. From this it can be concluded that evolutionary methods are far superior to OSTIA when presented with noise-corrupted training data. This result is similar to the findings of our comparison of evolutionary methods versus heuristic state merging algorithms for DFA learning [34], but the differences are more extreme in the FST case.

9

Future Work

There are some promising avenues for future work, both in improving the evolutionary approach, and for improving OSTIA. Both of these are inspired by analogous methods for learning DFA. It should be possible to improve the RMHC by using a variation of the smart state labelling scheme for evolving DFA developed by the authors in [24] [34]. This approach evolves only the transition table for a DFA, and then uses a simple deterministic procedure to optimally assign state labels given a candidate DFA and the training set. This showed a dramatic improvement compared with evolving the transition table and the state labels together. The analogy of this for learning FSTs would be to evolve only the transition table, and then optimally assign the output table for

39

each FST, also using the training set. However this is significantly more complex for an FST than for a DFA, because the relationship between symbols in the input string and symbols in the output string is not direct, a fact caused by the presence of null symbols in the output. However, for a given allocation of null symbols, it is possible to do an optimal assignment of output table symbols. Initial experiments along these lines have shown improved results over those reported here, and will be the subject of a future paper. The OSTIA algorithm could be improved by adding a search over possible state merges. Lang [21] found that EDSM could be greatly improved with this type of search so it should be possible to improve OSTIA in this way also. This would inevitably reduce the speed of the algorithm, however.

10

Conclusions

This paper presented a comparison of a simple evolutionary algorithm (a random mutation hill climber) with the best known heuristic method, data-driven OSTIA, for learning FSTs from samples of input/output string pairs. The major conclusion is that the RMHC is better than OSTIA at finding the kind of FSTs studied in this paper. This has been shown on an example FST from an OCR application domain, and on large numbers of randomly generated FSTs. The RMHC finds better transducers more often and is a faster algorithm when more training data is used. In particular, the RMHC works better than OSTIA when the training sets are sparse. This could be very important as training data sparseness is a problem for a lot of real applications. It should be pointed out that OSTIA has been developed for a more general task and is capable of finding FSTs of a larger class which can output more than one symbol for one input symbol. It is also likely that OSTIA would fair better in a comparison where the number of input and output symbols is much larger than it is for the tasks we have selected. OSTIA can find an FST from data without any guide given as to the desired size of the FST, which as discussed above, can be seen as a strength or a weakness. However, while the RMHC requires some guidance regarding the size of the target FST, it was shown to be fairly insensitive to this, and works well even when the size estimate is too large by as much as a factor of five for example. The most dramatic difference in performance between OSTIA and the RMHC is observed when

40

noise is present in the training data. Even small amounts of noise cause OSTIA to perform very poorly, while the RMHC continues to work well even when every string pair in the training set has been corrupted with a symbol substitution error. This shows an impressive degree of robustness, and has significant implications for real-world applications where noise-free training data can rarely be guaranteed. In the course of our investigations we have also discovered a few other facts. We find that the learning task is made much easier when the training data includes short strings. Perhaps this is not a great surprise. We compared three string distance functions for use with the RMHC. An unexpected result was that the strict distance fitness function outperformed the Hamming distance fitness function on the easy chain code datasets. A possible reason for this could be that the Hamming function can mislead the hill-climber to false optima, whereas the strict function creates plateaus that take longer to escape from, but when escape does come, it is less likely to be to a false optimum. In the case of the hard chain code datasets, the plateaus dominate, and the strict version never properly escapes them. A similar phenomenon can be observed on the random target FST datasets. Overall, the edit fitness function has proven to be far superior to the strict and Hamming functions on the datasets we studied, even when we take into account its greater computational cost. Experiments were also performed comparing a GA with the RMHC. Despite a good deal of effort spent tuning the GA, it was unable to match the performance of the RMHC. For this class of problem, the RMHC gives the best performance of any known algorithm, and the fact that such a simple evolutionary algorithm can perform so well on such a challenging machine learning problem is a significant result.

Acknowledgements We are very grateful to Jose Oncina of the University of Alicante for supplying us with the OSTIA source code. We also thank the anonymous reviewers, and members of the Natural and Evolutionary Computation group at the University of Essex, UK, for their comments and suggestions.

41

References [1] J. Oncina, P. Garcia, and E. Vidal, “Learning subsequential transducers for pattern recognition interpretation tasks”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, pp. 448–458, 1993. [2] H. Alshawi, S. Bangalore, and S. Douglas, “Learning dependency translation models as collections of finite state head transducers”, Computational Linguistics, vol. 26, no. 1, pp. 45–60, 2000. [3] D. Gildea and D. Jurafsky, “Automatic induction of finite state transducers for simple phonological rules”, in Proceedings of the 33rd conference on Association for Computational Linguistics, 1995, pp. 9–15. [4] C.N. Hsu and M.T. Dung, “Generating finite-state transducers for semi-structured data extraction from the web”, Information Systems, vol. 23, no. 8, pp. 521–538, 1998. [5] S.M. Lucas and A. Amiri, “Statistical syntactic methods for high performance OCR”, IEE Proceedings on Vision, Image and Signal Processing, vol. 143, pp. 23 – 30, (1996). [6] L.J. Fogel, A.J. Owens, and M.J. Walsh, “Artificial intelligence through a simulation of evolution”, in Biophysics and Cybernetic Systems: Proceedings of the 2nd Cybernetic Sciences Symposium, M. Maxfield, A. Callahan, and L.J. Fogel, Eds., pp. 131 – 155. Spartan Books, Washington DC, (1965). [7] G.H. Burgin, “On playing two-person zero-sum games against non-minimax players”, IEEE Transactions on Systems Science and Cybernetics, vol. 5, pp. 369–370, 1969. [8] G.H. Burgin, “Systems identification by quasi-linearization and by evolutionary programming.”, Journal of Cybernetics, vol. 3, pp. 56–75, 1973. [9] K. Chellapilla and D. Czarnecki, “A preliminary investigation into evolving modular finite state machines”, in Proceedings of Congress on Evolutionary Computation, pp. 1349 – 1356. 1999.

42

[10] D. Jefferson, R. Collins, C. Cooper, M. Dyer, M. Flowers, R. Korf, C. Taylor, and A. Wang, “Evolution as a theme in artificial life: The genesys/tracker system”, Proceedings of Artificial Life II, (1991). [11] E. Sanchez, A. P´erez-Uribe, and B. Mesot, “Solving partially observable problems by evolution and learning of finite state machines”, in Evolvable Systems: From Biology to Hardware, Proceedings of the Fourth International Conference (ICES-2001), Y. Liu, K. Tanaka, M. Iwata, T. Higuchi, and M. Yasunaga, Eds., Tokyo, Japan, 2001, vol. 2210 of LNCS, pp. 267–278, Springer Verlag. [12] W.M. Spears and D.F. Gordon-Spears, “Evolution of strategies for resource protection problems”, in Advances in evolutionary computing: theory and applications, pp. 367–392. SpringerVerlag New York, Inc., New York, NY, USA, 2003. [13] Y. Inagaki, “On synchronized evolution of the network of automata”, IEEE Transactions on Evolutionary Computation, vol. 6, pp. 147–158, 2002. [14] K. Benson, “Evolving automatic target detection algorithms that logically combine decision spaces”, Proceedings of the British Machine Vision Conference, pp. 685 – 694, (2000). [15] L. Pitt and M. Warmuth, “The minimum consistent DFA problem cannot be approximated within any polynomial”, Journal of the ACM, vol. 40, no. 1, pp. 95–142, 1993. [16] M. Kearns and L. G. Valiant, “Cryptographic limitations on learning Boolean formulae and finite automata”, in Proceedings of ACM Symposium on Theory of Computation (STOC-89), 1989, pp. 433–444. [17] P. Dupont, L. Miclet, and E. Vidal, “What is the search space of the regular inference ?”, in Grammatical Inference and Applications (ICGI-94), R. C. Carrasco and J. Oncina, Eds., pp. 25–37. Springer, Berlin, Heidelberg, 1994. [18] A.L. Oliveira and J. P. Marques Silva, “Efficient search techniques for the inference of minimum size finite automata”, in String Processing and Information Retrieval, 1998, pp. 81–89.

43

[19] K.J. Lang, B.A. Pearlmutter, and R.A. Price, “Results of the Abbadingo One DFA learning competition and a new evidence-driven state merging algorithm”, Proceedings of the International Colloquium on Grammatical Inference (Lecture Notes in Computer Science), vol. 1433, pp. 1–12, 1998. [20] O. Cicchello and S.C. Kremer, “Beyond EDSM”, Proceedings of the International Colloquium on Grammatical Inference (Lecture Notes in Computer Science), vol. 2484, pp. 37–48, 2002. [21] K. Lang, “Evidence-driven state merging with search”, NECI Technical Report TR98-139, 1998, citeseer.nj.nec.com/lang98evidence.html. [22] P. Dupont, “Regular grammatical inference from positive and negative samples by genetic search: The GIG method”, in Grammatical Inference and Applications (ICGI-94), R. C. Carrasco and J. Oncina, Eds., pp. 236–245. Springer, Berlin, Heidelberg, 1994. [23] S. Luke, S. Hamahashi, and H. Kitano, ““Genetic” Programming”, in GECCO-99: Proceedings of the Genetic and Evolutionary Computation Conference, W. Banzhaf et al , Ed. 1999, pp. 1098–1105, Morgan Kaufmann. [24] S. M. Lucas and T.f Reynolds, “Learning DFA: Evolution versus Evidence Driven State Merging”, in Proceedings of Congress on Evolutionary Computation, pp. 351 – 358. 2003. [25] C.L. Giles, G.Z. Sun, H.H. Chen, Y.C. Lee, and D. Chen, “Higher order recurrent neural networks and grammatical inference”, in Advances in Neural Information Processing Systems 2, D.S. Touretzky, Ed., pp. 380–387. Morgan Kaufman, San Mateo, CA, 1990. [26] R.L. Watrous and G.M. Kuhn, “Induction of finite-state automata using second-order recurrent networks”, in Advances in Neural Information Processing Systems 4, J.E. Moody, S.J. Hanson, and R.P. Lippmann, Eds., pp. 309 – 316. Morgan Kaufman, San Mateo, CA, 1992. [27] P.J. Angeline, G.M. Saunders, and J.P. Pollack, “An evolutionary algorithm that constructs recurrent neural networks”, IEEE Transactions on Neural Networks, vol. 5, no. 1, pp. 54–65, January 1994.

44

[28] P. Wyard, “Context-free grammar induction using genetic algorithms”, in Proceedings of the fourth international conference on Genetic Algorithms, R.K. Belew and L.B. Booker, Eds., pp. 514 – 518. Morgan Kaufman, San Mateo, CA, 1991. [29] S.M. Lucas, “Structuring chromosomes for context-free grammar evolution”, in Proceedings of IEEE International Conference on Evolutionary Computation, pp. 130 – 135. IEEE, Orlando, (1994). [30] M. Lankhorst, automata”,

“A genetic algorithm for induction of non-deterministic pushdown University of Groningen, Computer Science Report CS-R 9502, 1995,

http://www.ub.rug.nl/eldoc/dis/ science/m.m.lankhorst/c4.pdf. [31] B. A. Trakhtenbrot and Y. M. Barzdin, Finite Automata, North-Holland, Amsterdam, 1973. [32] J. Oncina, “The data driven approach applied to the OSTIA algorithm”, Proceedings of the Fourth International Colloquium on Grammatical Inference; Lecture Notes in Computer Science, vol. 1433, pp. 50–56, 1998. [33] S.M. Lucas, “Evolving finite state transducers: Some initial explorations”, in Proceedings of 6th European Conference on Genetic Programming. 2003, pp. 130 – 141, Springer Verlag. [34] S.M. Lucas and T.J. Reynolds, “Learning deterministic finite automata with a smart state labelling evolutionary algorithm”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, pp. 1063 – 1074, 2005. [35] D. Jurafsky and J.H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, Prentice Hall, 2000. [36] H.G. Beyer, “Toward a theory of evolution strategies: The (mu, lambda)-theory”, Evolutionary Computation, vol. 2, no. 4, pp. 381–407, 1994. [37] M. Mitchell, J. Holland, and S. Forrest, “When will a genetic algorithm outperform hill climbing?”, in Advances in Neural Information Processing Systems 6, J. Cowan, G. Tesauro, and J. Alspector, Eds., pp. 51 – 58. Morgan Kaufman, San Mateo, CA, 1994.

45

[38] J. David Schaffer, Darrell Whitley, and Larry J. Eshelman, “Combinations of genetic algorithms and neural networks: a survey of the state of the art”, in COGANN-92, Int. Workshop on Combinations of Genetic Algorithms and Neural Networks, Darrel Whitley and J. David Schaffer, Eds. 1992, pp. 1–37, IEEE Computer Society. [39] Adam Prugel-Bennett, “Symmetry breaking in population-based optimization”, IEEE Transactions on Evolutionary Computation, vol. 8, pp. 63 – 79, 2004. [40] C. Igel and P. Stagge, “Effects of phenotypic redundancy in structure optimization”, IEEE Transactions on Evolutionary Computation, vol. 6, pp. 74–85, Feb. 2002. [41] V.I. Levenshtein, “Binary codes capable of correcting insertions, deletions and reversals”, Cybernetics and Control Theory, vol. 10, pp. 707 – 710, 1966. [42] H. Freeman, “Computer processing of line-drawing images”, Computing Surveys, vol. 6, pp. 57–97, 1974. [43] H. Bunke and U. Buhler, “Applications of approximate string matching to 2d shape recognition”, Pattern Recognition, vol. 26, pp. 1797 – 1812, (1993). [44] R.A. Mollineda, E. Vidal, and F. Casacuberta, “A windowed weighted approach for approximate cyclic string matching”, International Conference on Pattern Recognition, pp. 188 –191, (2002). [45] T. Jones, Evolutionary Algorithms, Fitness Landscapes and Search, PhD thesis, PhD Dissertaion, The University of New Mexico, 1995. [46] B. Naudts and L. Kallel, “A comparison of predictive measures of problem difficulty in evolutionary algorithms”, IEEE Transactions on Evolutionary Computation, vol. 4, pp. 1 – 15, 2000.

46