An Incremental-evolutionary Approach for Learning Deterministic ...

2006 IEEE Congress on Evolutionary Computation Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006

An Incremental-Evolutionary Approach for Learning Deterministic Finite Automata Jonatan Gómez

Abstract— This work proposes an approach for learning Deterministic Finite Automata (DFA) that combines Incremental Learning and Evolutionary Algorithms. First, the training set is sorted according to the sequence length (from the shortest sequence to the longest one). Then, the training set is divided into a suitable number of groups (M ). Next, a DFA population is evolved by using a block of the training set (initially the first group). This process is repeated M times by taken the previously evolved DFA population as initial population and by adding the next sequences group to the previously used block. Finally, an evolutionary algorithm tunes the previously evolved DFA population by using the full training set and the remaining running time. Experiments show that our approach performs well regardless the level of noise present in the training set.

I. I NTRODUCTION The problem of grammatical inference has been extensively studied by the machine learning community [1], [2], [3]. Solving this problem implies to build a model that is able to discriminate (classify) when a string (sequence) belongs to an underlying language based on a set of sample strings previously classified [3]. Some grammatical inference methods build a deterministic finite automaton as model [4]. Deterministic finite automata (DFA) define a class of finite state machines that have accepting and rejecting states, where the states are used to decide if a string belongs or does not belong to a language. The grammatical inference problem is usually called the learning DFA problem when the target models are DFA [4]. Since DFA can only recognize regular languages, using DFA as models are only appropriated when the underlying languages are regular ones. There are many approaches to the learning DFA problem: using neural networks [5], constructing a prefix tree equivalent to the training data and progressively merging tree nodes (states) [6], and/or evolving them [7], [8], [4]. Cichello and Kremer in [1] present a complete survey of the DFA learning problem. In particular, Lucas and Reynolds [8], [4] used a multi start, random hill climber technique for evolving only the transition matrix while the state labels are assigned by following a voting mechanism. The work presented in this paper tunes up the learning Deterministic Finite Automata approach developed by Gomez for the GECCO 2004 Learning DFA competition [9]. The proposed approach uses elements of incremental learning within an evolutionary algorithm. First, the training set is sorted according to the sequence length (from the shortest sequence to the longest one). Then, the training set is divided Jonatan Gómez is with the Department of Computer and Systems Engineering, Universidad Nacional de Colombia. e-mail: [email protected] 0-7803-9487-9/06/$20.00/©2006 IEEE

into a suitable number of groups (M ). Next, a DFA population is evolved by using a block of the training set (initially the first group). The Hybrid Adaptive Evolutionary Algorithm (H A E A) proposed by Gomez [10] is used to evolve a DFA population. This process is repeated M times by taken the previously evolved DFA population as initial population and by adding the next sequences group to the previously used block. Finally, H A E A tunes the previously evolved DFA population by using the full training set during the remaining running time. Experiments show that our approach performs well on noisy training sets. This document is divided in 6 sections. Section 2 introduces the concept of DFA, section 3 outlines the Hybrid Adaptive Evolutionary Algorithm (H A E A) proposed by Gomez [10], section 4 describes our approach, section 5 analyzes some of the experimental results, and section 6 draws some conclusions and future work. II. D ETERMINISTIC F INITE AUTOMATA A Deterministic Finite Automaton is defined as a 5-tuple Q, Σ, δ, q0 , F , where Q is a finite set of states, Σ is a set of input symbols, δ : Q × Σ → Q is a function (the state transition function), q0 ∈ Q is the start state, and F ⊆ Q is the set of accepting states (F \ Q is the set of rejecting states). A deterministic finite automaton is called complete if for every input and every state there is a transition to some state, i.e., if δ : Q × Σ → Q is total. III. H YBRID A DAPTIVE E VOLUTIONARY A LGORITHM Algorithm 1 presents the Hybrid Adaptive Evolutionary Algorithm (H A E A) proposed by Gomez [10]. This algorithm is a mixture of ideas borrowed from Evolutionary Strategies (ES), and both Parameter Adaptation (PA) techniques1: decentralized control adaptation, and central control adaptation [11], [12], [13], [14], [15]. In central control adaptation techniques, genetic operator rates (such as mutation rate, crossover rate, etc) are adapted according to a global learning rule that takes into account the operator productivities through generations (iterations) [16]. Generally, only one operator is applied per generation, and it is selected based on its productivity. The productivity of an operator is measured in terms of good individuals produced by the operator. A good individual is one that improves the fitness measure of the current population. If an operator generates a higher number of good individuals than other operators then its probability is rewarded. Two well known centralized learning 1 Parameter Adaptation techniques try to eliminate the parameter setting process by adapting parameters through the algorithm’s execution.

1066

Algorithm 1 Hybrid Adaptive Evolutionary Algorithm (H A E A) H A E A( P0 , terminationCondition ) 1. P0 : initial population 2. t0 = 0 3. while( terminationCondition( t, Pt ) is false ) do 4. Pt+1 ={} 5. for each ind ∈ Pt do 6. rates = extract_rates( ind ) 7. δ = random(0,1) : learning rate 8. oper = O P _S ELECT( operators, rates ) 9. parents = PARENT S ELECTION(Pt , ind ) 10. offspring = apply( oper, parents ) 11. child = B EST( offspring, ind ) 12. if( fitness( child ) > fitness( ind ) ) then 13. rates[oper] = (1.0 + δ)*rates[oper] : reward 14. else 15. rates[oper] = (1.0 - δ)*rates[oper] : punishment 16. normalize_rates( rates ) 17. set_rates( child, rates ) 18. Pt+1 = Pt+1 ∪ {child} 19. t = t + 1

rule mechanism are the adaptive mechanism of Davis [11] and Julstrom [12]. In decentralized control strategies, genetic operator rates are encoded in the individual and are subject to the evolutionary process [16], [17]. Accordingly, genetic operator rates can be encoded as an array of real values in the semi-open interval [0.0,1.0), with the constraint that the sum of these values must be equal to one [13]. Since the operator rates are encoded as real numbers, special genetic operators, meta-operators, are applied to adapt or evolve them. A. Selection Mechanism In H A E A, each individual is ”independently” evolved from the other individuals of the population, as in evolutionary strategies [18]. In each generation, every individual selects only one operator from the set of possible operators (line 8). Such operator is selected according to the operator rates encoded into the individual. When a non-unary operator is applied, additional parents (the individual being evolved is considered a parent) are chosen according to some selection strategy, see line 9. As it can be noticed, H A E A does not generate a parent population from which the next generation is completely produced. Among the offspring produced by the genetic operator, only one individual is chosen as child (line 11), and will take the place of its parent in the next population (line 17). In order to be able to preserve good individuals through evolution, H A E A compares the parent individual against the offspring generated by the operator. The B EST function will select one individual (parent or offspring) according to some criterion (line 11). In this paper, we select the individual with the highest fitness. Therefore, an individual is preserved if it is better than all the possible individuals generated from it by applying the set of genetic operators.

B. Encoding of Genetic Operator Rates The genetic operator rates are encoded into the individual in the same way as decentralized control adaptation techniques, see figure 1. These probabilities are initialized (into the initPopulation method) with values following a uniform distribution U [0, 1]. A roulette selection scheme is used to select the operator to be applied (line 8).

Fig. 1.

SOLUTION

OPER1

...

OPERn

100101011..01

0.3

...

0.1

Encoding of the operator probabilities in the chromosome

C. Adapting the Probabilities The performance of the produced child is compared against its parent performance in order to determine the productivity of the operator (lines 12-15). The operator is rewarded if the child is better than the parent and punished if it is worst. The magnitude of the reward/punishment is defined by a learning rate that is randomly generated (line 7). Finally, operator rates are recomputed, normalized, and assigned to the individual that will be copied to the next population (lines 16-17). The learning rate is generated in a random fashion instead of setting it to a specific value for two main reasons. First, there is not a clear indication of the appropriated value that should be given to the learning rate; it can depend on the problem being solved. Second, experiments encoding the learning rate into the chromosome [19] show that the behavior of the learning rate can be simulated using a random variable with uniform distribution. D. Properties Contrary to other adaptation techniques, H A E A does not try to determine nor to maintain an optimal rate for each genetic operator. Instead, H A E A tries to determine the appropriate operator rate at each moment according to the conditions of each individual. If the optimal solution is reached by an individual in some generation, then the rates of the individual will converge to the same value in subsequent generations. This happens because no genetic operator is able to improve the optimal solution, therefore the operator will be punished when applied and the others will be rewarded. H A E A uses the same amount of extra information as a decentralized adaptive control; H A E A requires a matrix of n∗M doubles, where n is the number of different genetic operators and M is the population size. Thus, the space complexity of H A E A is linear with respect to the number of operators (the population size is considered a constant). Also, the time expended in calculating and normalizing the operator rates is linear with respect to the number of operators n ∗ M (lines 8 and 12-16). H A E A does not require special operators nor additional parameter settings. Well known genetic operators can be used without any modification. Different encoding schemes can be used: binary, real, trees, programs, etc. Finally, the average of the population fitness grows monotonically. One individual is always replaced by an individual with equal or higher fitness.

1067

Algorithm 2 Incremental DFA Learning algorithm DFA_I NC _L EARNING( x, M, λ, E, R, IR) x : training set M : number of groups λ : population size E : level of noise in the data set R : running time IR : percentage of running time for incremental learning n : number of samples in training set 1. S ORT(x) : sorting the data set according to string length R IR 2. time = M ∗ 100 3 P = I NIT P OPULATION( λ) 4. maxF = 1.0 − E : maximum fitness 5. currentF = maxF 6. for i=1 to M do : incremental learning 7. k = i∗M n 8. yi = {x1 , x2 , .., xk } 9. evalPopulation( P , yi ) −currentF 10. minF = currentF + maxF M−i+1 11. P = H A E A( P , I NC _E ND _C ONDITIONminF, time ) 12. currentF = best_fitness( P ) 13. endfor 14. P = H A E A( P , I NC _E ND _C ONDITIONmaxF, R−M∗time ) 15. return best_individual( P ) I NC _E ND _C ONDITIONmin, time ( t, P ) min : minimum fitness to be reached by H A E A time : maximum time H A E A can spend 1. best_f = best_fitness( P ) 2. return usesTime(time) ∨best_f ≥min

as initial population and by adding the next strings group to the previously used block (lines 7 and 8). Finally, H A E A tunes the previously evolved DFA population by using the full training set and the remaining running time (line 14). The best evolved DFA is then returned (line 15). A. Encoding Although the proposed approach can be used with different DFA encoding mechanisms, we just present two encodings: a simple plain encoding and the smart labeling encoding proposed by Lucas and Reynolds [8]. A deterministic finite automaton of n states can be encoded using two arrays: the first array of 2 ∗ n integers, each of them in the interval [0, n − 1], representing the topology of the automaton and the second array of n bits representing the state label, see Figure 2. Here, a triple < x, s, y >represents the edge starting at state x, ending at state y and using transition symbol s. Since the starting state and the transition symbol can be inferred from the triple position, the only element of the triple that should be stored is the ending state. Lucas and Reynolds reduce the search space dimension in a 2n factor in their smart labeling encoding [8]. In this approach, the label of each state is determined by a majority voting of the strings falling into such state. In this way, the label array is not required. B. Genetic Operators

IV. P ROPOSED A PPROACH

An integer mutation operator was used for evolving the DFA: It takes one integer value of the transition matrix and replaces it with other integer value randomly selected from the interval [0, n − 1]. The probability of modifying one of the integer values of the transition matrix is set by the user. In order to test the H A E A mechanism, three copies of this mutation operator are used, each of them with different 1 ). For the transition matrix value modification (0.1, 0.3, and 2n plain encoding, the integer mutation operator was extended in such a way that it can flip each one of the bits in the label array with the same probability it can change an integer in the edge array.

We use some ideas of incremental learning into an evolutionary algorithm in order to allow the evolutionary process to find a DFA population which is consistent with the short strings first. The proposed approach is shown in Algorithm 2. First, the training set is sorted according to the strings length from the shortest string to the longest one, see line 1 of DFA_I NC _L EARNING algorithm (Algorithm 2). If n is the size of the training set, then length(x1 ) ≤ length(x2 ) ≤ .. ≤ length(xn ) after applying the sorting process. Then, the training set is divided into a given number of groups (M ). Next, a DFA population is evolved by using a block of the training set (initially the first group). The Hybrid Adaptive Evolutionary Algorithm (H A E A) proposed by Gomez [10] is used to evolve a DFA population. H A E A evolves the given DFA population until one of two conditions is satisfied: The maximum amount of time defined for such block of the training set has been used (line 2)2 or the expected fitness for such block has been reached by the best individual in the population (line 10)3 . This process is repeated M times (lines 6-13) by taken the previously evolved DFA population

A. Experimental Setup

1 amount of time given to each block is a m th of the incremental running time. 3 It is expected that the fitness of the best individual increases in a linear manner, group by group, until it reachs the maximum allowed fitness.

We used the training sets of the GECCO 2004 noisy DFA contest in order to compare the performance of our approach against some approaches previously reported in the literature. For these data sets the process of generating the DFA and

2 The

C. Fitness Function The fitness value of an individual is just the number of strings correctly classified divided by the total number of strings in the string block, see equation 1. Clearly, our fitness function is a non-stationary fitness function and H A E A is able to deal with it. f (A, yi ) =

CorrectlyClassif ied i∗M

(1)

n

V. E XPERIMENTAL A NALYSIS

1068

Edge array: Label array: Fig. 2.

State 0 < 0, 0, x0,0 >

State 1

< 0, 1, x0,1 >

< 1, 0, x1,0 >

State 0

State 1

...

State n − 1

label0

label1

...

labeln−1

State n − 1

...

< 1, 1, x1,1 >

...

< n − 1, 0, xn−1,0 >

< n − 1, 1, xn−1,1 >

Encoding of an automaton of n states. TABLE II

TABLE III

C LASSIFICATION PERFORMANCE REACHED BY THE PROPOSED APPROACH USING S MART L ABELING AND P LAIN E NCODING

C LASSIFICATION PERFORMANCE REACHED BY THE PROPOSED APPROACH WITH AND WITHOUT THE PROCESS OF STRINGS SORTING .

encoding plain smart

min 67.7 (2.11) 75.8 (3.68)

max 85.7 (5.06) 98.5 (0.63)

median 73.6 (4.03) 84.6 (4.28)

mean 75.1 85.3

sorting no sorting

min 69.0 (3.06) 67.7 (2.11)

max 98.5 (0.63) 94.7 (3.49)

median 82.6 (3.74) 77.4 (3.56)

mean 82.4 78.7

TABLE IV

training/testing sets is similar to the one applied for Abbadingo and Gowachin contests [4] but flipping some of the string labels with a probability of 0.1, i.e., with a percentage of noise around 10%. There are three different DFA sizes (10, 20 and 30) and ten different training/testing files for each of them. Several different experiments were performed in order to determine the sensitivity of the proposed approach to its parameters. In particular a 2 × 2 × 4 × 6 factorial experimental design varying the encoding (smart labeling or plain), the sorting of strings (with and without sorting), the population size (1, 5, 10, 20), and the number of groups (6, 30, 300, 3000). Each of these configurations was run during 1 minute in a Pentium IV 2.2 GHz computer for the 30 state DFA problem. Finally the best configuration is used for solving all three problems (10, 20, and 30 DFA states). B. Results Summary Table I presents the set of results obtained by the 2 × 2 × 4 × 6 factorial experimental design. A value of avg (err) indicates an experiment with average classification rate of avg and standard error of err. For each experiment (different configuration), these values are obtained using the ten (10) different results on the ten (10) training/test pairs of data sets. C. Results using Smart Labeling and Plain Encoding Table II summarizes the classification performance reached by the proposed approach using different encoding. As expected, smart labeling is better than plain encoding since smart encoding is reducing the search space drastically (in a 2n factor) respect to the plain encoding. As it can be noticed, smart labeling increases the classification performance in at least 10% respect to the plain encoding. Figure 3 compares the fitness evolution of the proposed approach using plain and smart state labeling encodings. D. Results using Sorting of Strings and without Sorting Table III summarizes the classification performance reached by the proposed approach with and without the process of strings sorting. Clearly, the sorting process allows the evolutionary algorithm to perform better since it is evolving

C LASSIFICATION PERFORMANCE REACHED BY THE PROPOSED APPROACH USING DIFFERENT NUMBER OF GROUPS . groups 3000 100 20 10 5 1

min 71.7 (2.83) 69.9 (2.65) 69.8 (2.51) 68.2 (2.84) 67.7 (2.11) 69.5 (2.64)

max 98.5 (0.63) 94.7 (3.49) 93.4 (4.20) 88.1 (4.53) 87.3 (4.35) 83.8 (4.82)

median 83.8 (3.14) 82.2 (3.35) 78.9 (4.19) 77.0 (3.81) 79.5 (4.07) 75.8 (3.68)

mean 83.4 82.2 80.3 78.3 78.5 76.5

an automaton in an incremental manner: first, an automaton consistent with short strings first, then consistent with larger strings and finally consistent with all strings. Notice that, the performance is increased in at least 3.5% in average (the median value is consistent with this analysis too). Figure 4 compares the fitness evolution of the proposed approach with and without sorting of strings process.

E. Results using Different Number of Groups Table IV summarizes the classification performance reached by the proposed approach when the training strings set was divided into 1, 5, 10, 20, 100 and 3000 groups (3000, 600, 300, 150, 30, and 1 string per group respectively). Clearly, when the training set is divided in more groups, the evolutionary process can find good solutions in an incremental way, i.e., it is able to evolve automata that are able to generalize from small number of samples. The performance reached by the proposed approach when using 3000 groups (using one string per group) is higher (at least 6.9%) respect of the same approach when using all the data set as a single group (the median value is consistent with this analysis too). Moreover, the performance improvement is in general gradual: from using 3000 groups to 100 groups it increases in at least 1.2%, from 100 to 20 groups it is 1.9%, from 20 to 10 groups 2.0%, from 5 groups to 1 group it is 2%. The only one that does not follow this tendency is from 10 groups to 5. Figure 5 compares the fitness evolution of the proposed approach using different number of strings per group.

1069

TABLE I

C LASSIFICATION PERFORMANCE REACHED BY THE PROPOSED APPROACH FOR EACH ONE OF THE EXPERIMENTS ( CONFIGURATION TESTED ) USING A FACTORIAL EXPERIMENT DESIGN .

M

20

M

1

20

81.3 87.3 86.7 91.4 93.9 98.5

(5.97) (4.35) (5.71) (4.94) (3.60) (0.63)

83.8 83.1 88.1 87.9 92.4 93.6

(4.82) (5.18) (4.53) (3.85) (3.78) (3.17)

79.4 87.1 82.6 86.9 87.6 93.6

(4.18) (4.79) (3.98) (4.08) (3.94) (2.52)

75.8 79.5 78.0 78.9 82.6 88.4

(3.68) (4.07) (3.76) (4.19) (3.74) (3.90)

1 5 10 20 100 3000

81.3 82.9 83.1 92.4 94.7 85.5

(5.97) (6.20) (6.07) (4.20) (3.49) (4.54)

83.8 86.9 85.7 85.2 85.5 83.0

(4.82) (4.69) (5.86) (5.05) (4.70) (5.02)

79.4 81.8 83.6 84.1 84.6 87.2

(4.18) (4.42) (4.37) (3.99) (4.28) (4.29)

75.8 80.4 76.2 77.4 82.2 83.8

(3.68) (3.99) (4.11) (3.56) (3.35) (3.14)

1 5 10 20 100 3000

75.6 81.5 82.7 85.7 82.3 84.8

(5.50) (4.46) (5.16) (5.06) (4.93) (5.38)

76.2 73.5 77.0 80.0 81.8 85.2

(3.39) (3.99) (3.81) (4.45) (4.06) (3.52)

69.5 69.8 72.3 73.2 77.6 82.4

(2.64) (3.45) (3.89) (3.43) (3.71) (3.12)

70.7 69.0 69.0 69.8 72.2 77.5

(2.92) (2.53) (3.06) (2.51) (2.54) (3.31)

1 5 10 20 100 3000

75.6 78.8 75.1 76.6 79.2 74.3

(5.50) (5.47) (5.07) (5.31) (4.82) (4.43)

76.2 74.8 74.3 73.6 77.0 73.0

(3.39) (3.81) (4.43) (4.03) (3.97) (3.81)

69.5 72.4 70.8 71.2 72.4 72.1

(2.64) (3.46) (3.75) (3.48) (2.85) (2.89)

70.7 67.7 68.2 70.3 69.9 71.7

(2.92) (2.11) (2.84) (2.80) (2.64) (2.83)

0.9

0.9

0.9

0.85

0.85

0.85

0.8

0.8

0.8

0.75

0.75

0.75

0.7 0.65 0.6

fitness

fitness

plain

1

no sorted population size 5 10

1 5 10 20 100 3000

fitness

smart

sorted population size 5 10

0.7 0.65 0.6

smart plain

0.55 0.5

0.6

max median min

0.55

0.5 10 20 30 40 50 60 70 80 90 100

Used Time (%)

10 20 30 40 50 60 70 80 90 100

Used Time (%)

(a)

max median min

0.55

0.5 10 20 30 40 50 60 70 80 90 100

0.7 0.65

Used Time (%)

(b)

(c)

0.9

0.9

0.9

0.85

0.85

0.85

0.8

0.8

0.75

0.75

0.7 0.65 0.6

fitness

0.8 0.75

fitness

fitness

Fig. 3. Fitness evolution using plain and smart state labeling. (a) Maximum fitness evolution of both plain and smart, (b) Fitness evolution using plain encoding, (c) Fitness evolution for smart state labeling encoding.

0.7 0.65 0.6

with without

0.55 0.5

max median min

0.55 0.5

10 20 30 40 50 60 70 80 90 100 Used Time (%)

(a)

0.7 0.65 0.6

max median min

0.55 0.5

10 20 30 40 50 60 70 80 90 100 Used Time (%)

(b)

10 20 30 40 50 60 70 80 90 100 Used Time (%)

(c)

Fig. 4. Fitness evolution with sorting and without sorting of training strings. (a) Maximum fitness evolution of both with sorting and without sorting, (b) Fitness evolution using sorting, (c) Fitness evolution without using sorting.

1070

0.9

0.9

0.85

0.85

0.8

0.8

0.75

0.75

0.7 0.65

3000 100 10 5 1

0.6 0.55

fitness

0.8 0.75

fitness

fitness

0.9 0.85

0.7 0.65 0.6

0.5

0.5 10 20 30 40 50 60 70 80 90 100 Used Time (%)

(a)

10 20 30 40 50 60 70 80 90 100 Used Time (%)

(b)

(c)

0.9

0.9

0.9

0.85

0.85

0.85 0.8 0.75

0.7 0.65 0.6

fitness

0.8 0.75

fitness

0.8 0.75

0.7 0.65 0.6

max median min

0.55

max median min

0.55

0.5 10 20 30 40 50 60 70 80 90 100 Used Time (%)

fitness

0.6

max median min

0.55

0.7 0.65

0.5

0.6

max median min

0.55

0.5 10 20 30 40 50 60 70 80 90 100

Used Time (%)

max median min

0.55

0.5 10 20 30 40 50 60 70 80 90 100

0.7 0.65

10 20 30 40 50 60 70 80 90 100

Used Time (%)

(d)

Used Time (%)

(e)

(f)

Fig. 5. Fitness evolution using different group sizes. (a) Maximum fitness evolution using different group size, (b) Fitness evolution using 3000 groups - 1 sample per group, (c) Fitness evolution using 100 groups, (d) Fitness evolution using 10 groups, (e) Fitness evolution using 5 groups, (f) Fitness evolution using the full data set - 1 group.

TABLE V

TABLE VI

C LASSIFICATION PERFORMANCE REACHED BY THE PROPOSED APPROACH

O PTIMAL CONFIGURATION FOR THE PROPOSED APPROACH .

USING DIFFERENT POPULATION SIZES .

size 1 5 10 20

min 74.3 (4.43) 73.0 (3.81) 69.5 (2.64) 67.7 (2.11)

max 98.5 (0.63) 93.6 (3.17) 93.6 (2.52) 83.8 (3.14)

median 82.9 (6.20) 83.0 (5.02) 79.4 (4.18) 72.2 (2.54)

OPTION Sorting Group size Population size Encoding

mean 84.3 81.9 79.2 75.4

F. Results using Different Population Size Table V summarizes the classification performance reached by the proposed approach with population sizes of 1, 5, 10, and 20 individuals. It is clear that smaller populations perform better in average than larger populations. However, the evolutionary process can get trapped in a local minimum when the population is small: the worst performance is achieved when using just one individual. Also, if the population is large then the evolutionary process cannot exploit the good candidate solutions it is evolving: The second worst performance is achieved when using a 20 individuals population. Figure 6 compares the fitness evolution of the proposed approach using different population size.

VALUE true 1 1 smart

Using this configuration the proposed approach (called IHDFA), reaches the performance shown in Table VII. The values for the other approaches were taken from [4]. Clearly, IH-DFA improves dramatically the performance of our preliminary version (Gómez) in both running time and accuracy. It is now generating a better automaton in almost every run. Figure 7 compares the fitness evolution of the proposed approach on the different DFA sizes.

G. Results using Optimal Setting According to the results obtained in previous sections, the best configuration for the proposed approach is presented in Table VI.

1071

VI. C ONCLUSIONS AND F UTURE W ORK Several conclusions can be drawn from the results obtained: • The smart state labeling improves the performance of the evolutionary algorithm. It can be due to the fact that the search space is smaller using smart labeling (n2n ) than using just plain encoding (n2n 2n ). Additionally, by using the voting mechanism of smart state labeling the lowest fitness of any DFA is guaranteed to be 0.5, something that is not always true when using plain encoding. • The sorting process of training strings helps the evolutionary process to evolve good automaton faster than without

0.9

0.9

0.85

0.85

0.8

0.8

0.75

0.75

0.7 0.65 1 5 10 20

0.6 0.55

fitness

0.8 0.75

fitness

fitness

0.9 0.85

0.7 0.65 0.6

0.5

0.6

max median min

0.55

max median min

0.55

0.5 10 20 30 40 50 60 70 80 90 100

0.7 0.65

0.5 10 20 30 40 50 60 70 80 90 100

Used Time (%)

10 20 30 40 50 60 70 80 90 100

Used Time (%)

(b)

(c)

0.9

0.9

0.85

0.85

0.8

0.8

0.75

0.75

fitness

fitness

(a)

Used Time (%)

0.7 0.65 0.6

0.6

max median min

0.55

0.7 0.65 max median min

0.55

0.5

0.5 10 20 30 40 50 60 70 80 90 100 Used Time (%)

10 20 30 40 50 60 70 80 90 100 Used Time (%)

(d)

(e)

Fig. 6. Fitness evolution using different population sizes. (a) Maximum fitness evolution using different population size, (b) Fitness evolution using 1 individual, (c) Fitness evolution using 5 individuals, (d) Fitness evolution using 10 individuals, (e) Fitness evolution using 20 individuals. TABLE VII C OMPARATIVE PERFORMANCE OF THE PROPOSED APPROACH (IH-DFA) ON THE 10, 20, AND 30 GECCO 2004 NOISY DFA CONTEST

states 10

20

30

method Gómez Blue∗ Smart IH-DFA Gómez Blue∗ Smart IH-DFA Gómez Blue∗ Smart IH-DFA

min 97.0 74.0 96.0 97.2 83.0 82.0 98.0 95.1 60.0 70.0 99.0 93.5

s. e. 0.2 3.0 0.004 0.28 1.6 1.5 0.002 0.49 4.6 2.3 0.001 0.63

success 6 1 9 8 5 0 9 8 6 0 10 6

running time 10 minutes 10 minutes 6 minutes 1 minute 10 minutes 10 minutes 6 minutes 1 minute 10 minutes 10 minutes 6 minutes 1 minute

0.95

0.9

0.9

0.9

0.85

0.85

0.85

0.95

0.75 0.7 0.65

0.8 fitness

0.8 fitness

fitness

mean 99.0 89.0 99.4 99.5 98.0 90.0 99.7 99.4 90.0 82 99.6 98.5

0.95

0.8

0.75 0.7 0.65

0.6

max mean min

0.55 0.5

Used Time (%)

(a)

0.75 0.7 0.65

0.6

max mean min

0.55 0.5

10 20 30 40 50 60 70 80 90 100

Fig. 7.

max 100 100 100 100 99.8 98.0 100 100 99.7 96.0 100 99.9

0.6

max mean min

0.55 0.5

10 20 30 40 50 60 70 80 90 100 Used Time (%)

(b)

10 20 30 40 50 60 70 80 90 100 Used Time (%)

(c)

Fitness evolution for the GECCO 2004 noisy DFA contest. (a) 10 states problem, (b) 20 states problem, (c) 30 states problem.

1072

using it. It can be due to the fact that the evolutionary process tries to find an automaton that is consistent with short strings first. Then, the automaton is evolved in order to preserve the previous automaton structure while it is able to discriminate the additional strings (larger than the first ones). • The incremental process has a great impact in the performance of the evolutionary process. When the incremental process was employed, the evolved automaton was significantly better than the evolved automaton without this process. Moreover, the performance was higher when the group size was smaller. It means, that it is good idea to train the automaton with few short strings first and gradually to show few longer strings while keeping the short ones. In particular it is recommendable to use each single string. • The performance of the proposed approach is strongly affected by the size of the population being evolved. When it is large, the evolutionary process spends a lot of time exploring without exploiting good DFA. If the population is small the evolutionary process can get trapped into local minima. However, H A E A is able to deal with this situations if good genetic operators (for example the integer mutation using different gene mutation rate) are applied. • The proposed approach improves drastically the running time of our previous work, at the same time it improves the efficiency of the evolved automaton. Our future research work will be concentrated on defining a DFA distance notion that allow us to use crossover operators with deterministic crowding. This combination will allow us to maintain diversity and hopefully to reduce the running time.

[11] L. Davis, “Adapting operator probabilities in genetic algorithms,” in Third International Conference on Genetic Algorithms and their Applications, pp. 61–69, 1989. [12] B. Julstrom, “What have you done for me lately? adapting operator probabilities in a steady-state genetic algorithm,” in Sixth International Conference on Genetic Algorithms, pp. 81–87, 1995. [13] A. Tuson and P. Ross, “Adapting operator settings in genetic algorithms,” Evolutionary Computation, 1998. [14] F. Lobo, The parameter-less genetic algorithm: rational and automated parameter selection for simplified genetic algorithm operation. PhD thesis, Nova University of Lisboa, 2000. [15] W. Spears, “Adapting crossover in evolutionary algorithms,” in Evolutionary Programming Conference, 1995. [16] A. E. Eiben, R. Hinterding, and Z. Michalewicz, “Parameter control in evolutionary algorithms,” IEEE Transactions in Evolutionary Computation, vol. 3(2), pp. 124–141, 1999. [17] M. Srinivas and L. M. Patnaik, “Adaptive probabilities of crossover and mutation in genetic algorithms,” Transactions on Systems, Man and Cybernetics, vol. 24(4), pp. 656–667, 1994. [18] T. Back, Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms. Oxford University Press, 1996. [19] J. Gomez and D. Dasgupta, “Using competitive operators and a local selection scheme in genetic search,” in Late-breaking papers GECCO 2002, 2002.

R EFERENCES [1] O. Cichello and S. C. Kremer, “Inducing grammars from sparse data sets: A survey of algorithms and results,” Journal of Machine Learning Research, vol. 4, pp. 603–632, 2003. [2] P. Dupont, “Incremental regular inference,” in Proceedings of the Third ICGI-96, pp. 222–237, 1996. [3] J. Bongard and H. Lipson, “Active coevolutionary learning of deterministic finite automata,” Journal of Machine Learning Research, vol. 6, pp. 1651–1678, 2005. [4] S. M. Lucas and T. J. Reynolds, “Learning deterministic finite automata with a smart state labelling evolutionary algorithm,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, pp. 1063–1074. [5] C. Giles, G. Sun, H. Chen, Y. Lee, and D. Chen, “High order recurrent neural networks and grammatical inference,” in Advances in Neural Information Processing