Towards Unbiased Benchmarking of Evolutionary and ... - CiteSeerX

3 downloads 16481 Views 1MB Size Report
Feb 7, 2007 - Keywords: real-valued optimisation; evolutionary algorithms; hybrid algorithms; ... to identify the best approaches, but also to better understand what ... For the second, we describe a server that uses web services to allow ...
February 7, 2007

16:29

Connection Science

extended07

Connection Science Vol. 00, No. 00, Month-Month 200x, 1–17

Towards Unbiased Benchmarking of Evolutionary and Hybrid Algorithms for Real-valued Optimisation CARA MACNISH∗ The University of Western Australia, Perth WA 6009, Australia (Received 00 Month 200x; In final form 00 Month 200x) The success of evolutionary algorithms and their hybrids on many difficult real-valued optimisation problems has led to an explosion in the number of algorithms and variants proposed. In order for the field to advance it is necessary to carry out effective camparative evaluations of these algorithms, and thereby better identify and understand those properties that optimise performance. This paper discusses the difficulties of providing benchmarking of evolutionary and allied algorithms that is both meaningful and logistically viable. To be meaningful the benchmarking test must give a fair comparison that is free, as far as possible, from biases that favour one style of algorithm over another. To be logistically viable it must overcome the need for pairwise comparison between all the proposed algorithms. To address the first problem, the paper describes a suite of test problems, generated as self-similar or fractal landscapes, designed to overcome the biases inherent in many of the traditional benchmarking functions. For the second, we describe a server that uses web services to allow reseachers to “plug in” their algorithms, running on their local machines, to a central benchmarking repository. Keywords: real-valued optimisation; evolutionary algorithms; hybrid algorithms; benchmarking; fractal landscapes; web services

1

Introduction

The success of randomized, population-based optimisation algorithms, such as genetic algorithms (GAs), evolutionary algorithms (EAs) and swarm-based algorithms (eg. PSOs), along with the steady increase in accessible computing power, has allowed previously impractical search and optimization problems to be tackled across many application domains. These algorithms have proven particularly effective in difficult real-valued optimisation problems in which the solution space is “poorly behaved” in the sense of traditional mathematical optimisation routines. Examples might include hypersurfaces that are non-differentiable, discontinuous, where gradient information does not exist, or where what gradient information is accessible is of little help due to the high modality of the problem. In many cases more traditional search techniques, including local searches such as gradient descents or hill climbing algorithms, can be effectively utilised as part of the search for an global optimum. This recognition has led to the proposal of many hybrid or “memetic” algorithms that seek to gain the best of both worlds. Finally, researchers in newer areas of evolutionary and allied algorithms should not be too quick to dismiss (stochastic and non-stochastic versions of) the traditional algorithms themselves without taking into account the relative computational costs of the population-based approaches. The explosion in the variety and sheer volume of optimisation algorithms proposed, while making for an exciting field, also makes it difficult to effectively comparatively evaluate the algorithms and thereby advance the field. Fair and practical means of comparing the many algorithms proposed are necessary both to identify the best approaches, but also to better understand what properties of the algorithms are most successful so that improved algorithms can be devised. In this paper we discuss the difficulties of providing benchmarking of evolutionary and allied algorithms that is both meaningful and logistically viable. To be meaningful the benchmarking test must give a fair comparison that is free, as far as possible, from biases that favour one style of algorithm over another. To be logistically viable it must overcome the need for pairwise comparison between all proposed solutions. To address the first problem, we describe a suite of test problems, generated as self-similar or fractal landscapes, designed to minimise the biases inherent in many of the traditional benchmarking functions. ∗ Email:

[email protected]

February 7, 2007 2

16:29

Connection Science

extended07 Unbiased Benchmarking of EAs and Hybrids

Figure 1. Publications in Differential Evolution cited by ISI Web of Knowledge showing approximate doubling every two years, taken from Suganthan (2006).

For the second, we describe a server that uses web services to allow reseachers to “plug in” their algorithms, running on their local machines, to a central benchmarking repository. In Section 2 we discuss some of the difficulties that have inhibited the broad adoption of benchmarking techniques for evolutionary and related algorithms, and begin to motivate our solution. Section 3 describes in more detail the problem of biases in benchmarking problems, and describes why fractal landscapes are chosen to minimise these biases. Section 4 discusses our algorithm for generating the benchmarking problem suite and the main design decisions that enabled its implementation. Section 5 describes the server architecture that is used to overcome the practical issues of widely available benchmarking, and Section 6 concludes the paper. This paper extends ideas presented in (MacNish, 2006). The benchmarking system is available on-line at http://ai.csse.uwa.edu.au/cara/huygens/ (MacNish, 2007). 2

Difficulties in Universal Benchmarking

There are a number of issues that inhibit the development and adoption of benchmarking facilities, both in terms of practical logistics and the problem domains chosen. In this section we give an overview of these issues, and in the following section we focus on problem domains in greater detail. 2.1

Growth in the Volume of Proposed Algorithms

The intense interest in evolutionary and hybrid methods for real-valued optimisation (and their commercial potential) has seen a rapid increase in the number of algorithms and variants proposed. Figure 1, for example, shows the rate of increase in publications in just one subfield, differential evolution, taken from one citation index by Suganthan (2006), showing an approximate doubling every two years. There are now over 1000 papers submitted annually to the two largest evolutionary computation conferences alone. This makes it increasingly difficult for researchers to keep abreast of how the various algorithms and variants compare, and for practitioners to select the best approaches. Even if authors were able to readily obtain or implement other authors’ algorithms, the number of pairwise comparisons needed increases combinatorially with the number of algorithms proposed. For this reason it is necessary to have centralised benchmarking facilities. Our approach to this problem is to develop a benchmarking server through which authors can compare their algorithms. The server maintains statistics on algorithms’ performance in “league tables”. This allows an author to compare the performance of their algorithm against a large number of their peers at once. While this paper focusses primarily on problem domains, the server is briefly described in Section 5.

February 7, 2007

16:29

Connection Science

extended07 Cara MacNish

3

b)

a)

Figure 2. Schaffer’s F6 function shown with a) 2 independent variables and b) 1 independent variable (cross section).

It should be noted that other benchmarking problems, problem generators and repositories have been made available through the web. Examples include Spears’ Repository of Test Functions (Spears, 2007), Spears and Potter’s Repository of Test Problem Generators (Spears & Potter, 2007) and the Evolutionary Computation Benchmark Repository (EvoCoBR, 2007). Each of the available sites have different facilities, and one of the challenges is to combine the best aspects into a more mature and widely used repository of both benchmarking functions and performance results. 2.2

Programming Languages and Architectures

Researchers develop implementations for testing their algorithms in programming languages and development environments they are familiar with. Problem sets to test the algorithms tend to be developed in the same language, and often as part of the same code. Where other researchers have used different languages or coding structures, it often requires many hours of programming and testing, and in many cases rewriting in a new language, in order to compare algorithms on the same problems. This is made more difficult by the fact that researchers often do not have the resources to prepare and document their code for easy public consumption. Similar problems exist for different hardware platforms and associated operating systems. This is particularly true when executables are written for proprietary systems. Our approach to this problem uses the Simple Object Access Protocol (SOAP) as an interlingua between the user’s code and the benchmarking suite. This allows the user to “plug in” solutions developed in the language and architecture of their choice. This approach is outlined in more detail in Section 5. 2.3

Choice of Problem Domains

Optimisation algorithms are developed and themselves optimised for a huge range of problem domains. Many of these are not well known or understood, and in some cases they are proprietary and cannot be released. In order for benchmarking problems to be widely accepted they must be readily accessible and easily familiarised. Some defacto standards have emerged through common use. Well-known examples are those by De Jong (1975) and Schaffer et al. (1989). These are typically elementary functions that are designed to be in some way challenging to the solver. A typical example is Schaffer’s F6 function, shown in Figure 2. Some other commonly used functions are listed in Table 1 and shown in Figure 3. While a focus on these problems has raised many important issues, they are arguably poorly representative of naturally occurring optimization problems. F6, for example, is difficult because the closer a candidate solution gets to the global minimum, the bigger the hill that must be climbed or “jumped” to move from one local minimum to the next. However we are not aware of any naturally occurring phenomena

February 7, 2007

16:29

Connection Science

4

extended07 Unbiased Benchmarking of EAs and Hybrids

Table 1.

Common benchmarking functions. For more details see for example Digalakis & Margaritis (2002); Salomon (1996).

Common name Sphere Rosenbrock’s function (2D) Rastrigin’s function Schwefel’s function Griewangk’s function Ackley’s function Schaffer’s F6

Function ! f (x) = ni=1 x2i 2 f (x) = 100(x x2 )2 + (1 − x1 )2 !n 1 − 2 f (x) = i=1 (xi − 10 " cos(2πxi ) + 10) ! f (x) = − ni=1 xi sin( |xi |) ! # x2i xi f (x) = ni=1 4000 − ni=1 cos( √ )+1 i $ ! ! f (x) = −20 exp(−0.2 n1 ni=1 x2i ) − exp( n1 ni=1 cos(2πxi )) + 20 + e f (x) = 0.5 −

√ (sin x2 +y 2 )2 −0.5 (1+0.001(x2 +y 2 ))2

that clearly have this property. In this paper we discuss the use of randomized self-similar recursive functions, or “fractal landscapes” as our problem domain. This domain is appealingly familiar and intuitive while raising many difficulties for solvers that generalise to other domains, and is discussed further in Sections 3 and 4. 2.4

A Word on the No Free Lunch Theorem

The No Free Lunch (NFL) Theorem (Igel & Toussaint, 2004; Wolpert & Macready, 1997) states, roughly speaking, that no algorithm performs better than any other when averaged over all fitness functions. More specifically, for any algorithms a and b, as well as a performs on function fa , b will perform equally well on some function fb . It has been shown further that the NFL theorem applies to any subset of functions if and only if that subset is closed under permutation. The NFL theorem therefore implies that algorithms cannot be benchmarked to find a best general algorithm averaged over all fitness functions. If an algorithm performs well on a set of benchmarking problems, it cannot be claimed that it is better than any other algorithm other than on that set. (Of course this empirical result does not even prove it is better on other unseen problems in that class.) The NFL theorem, however, takes no account of structure in naturally occurring classes of problems. The theorem assumes a uniform probability distribution over fitness functions, and closure under permutation. It appears that naturally occurring classes of problems are unlikely to be closed under permutation. Igel & Toussaint (2004) show that the fraction of non-empty subsets that are closed under permutation rapidly approaches zero as the cardinality of the search space increases, and that constraints on steepness and number of local minima lead to subsets that are not closed under permutation. (Similar results are provided for non-uniform probability distributions.) Thus for naturally occurring classes of structured problems, one might expect (as is intuitive) that some algorithms generally perform better than others. In the following section we discuss some of the biases inherent in many traditional benchmarking functions and how they may give an overly optimistic view of performance for some algorithms. Our benchmarking series is designed to avoid these biases as far as possible. Nevertheless, one should bear in mind that performance on any given class of benchmarking problems gives no guarantee of performance on different classes of problems. 3

Biases in Benchmark Functions and the Case for Fractal Landscapes

Many research papers report comparative results of new algorithms or variants on mathematical functions without giving consideration to whether there may be aspects of those functions, or the way the populations are initialised with respect to them, that affect performance. We refer to these aspects as biases. The landscapes proposed in this paper are designed in an attempt to minimise these biases as far as possible. Before discussing how such landscapes can be efficiently generated or evaluated, it is important to discuss the motivation for choosing the landscapes as benchmarking functions. We begin with an intuitive discussion of the landscapes, and two of the key trade-offs that occur, in one guise or another, in the design and parameterisation of all population-based and hybrid algorithms respec-

February 7, 2007

16:29

Connection Science

extended07 Cara MacNish

a)

b)

c)

d)

e)

f)

g)

h)

5

Figure 3. Commonly used benchmark functions: a) sphere, b) Rosenbrock’s function, c) Rastrigin’s function, d) Schwefel’s function, e) and f) Griewangk’s function, large scale and detail, g) and h) Ackley’s function, large scale and detail. Images from the Evolutionary Computation Benchmark Repository EvoCoBR (2007).

February 7, 2007 6

16:29

Connection Science

extended07 Unbiased Benchmarking of EAs and Hybrids

a)

b)

c)

d)

e)

f) Figure 4. A sequence of landscapes generated by our algorithm. These landscapes are 20 101 to 20 106 as described in Section 4.

tively: exploration versus exploitation, and population-based search versus local search. The relationship between these choices and the problem domain is critical to algorithm performance and is the motivation for the present research. We then attempt to enumerate more specifically some of the biases in traditional functions that we wish to avoid. To illustrate this we will make use of the range of commonly used benchmarking functions from Table 1 and Figures 2 and 3. For contrast, we have shown a short sequence of the proposed landscape functions in Figure 4. In the following we will limit our terminology to minimisation problems, but of course the same applies to maximisation.

February 7, 2007

16:29

Connection Science

extended07 Cara MacNish

7

b)

a)

Figure 5. a) Idealised representation of a population of candidate solutions solving a bowl function. b) The same bowl function as part of a larger function.

3.1

Why Landscapes?

The problem domain that we have chosen is (minimisation in) automatically generated pseudo-random self-similar surfaces, which resemble natural landscapes. As well as being familiar and intuitively appealing, suitably detailed landscapes exhibit many of the properties that raise difficulties in search and optimization algorithms. For example, landscapes have a very large (theoretically infinite) number of local minima (sometimes referred to as highly multimodal). However these are achieved naturally and at random, unlike elementary functions such as F6. Secondly, landscapes are more complex than functions such as F6 in that detail does not diminish with scale. Natural landscapes are considered to exhibit (or at least approximate) the fractal property of statistical self-similarity. As one zooms in on, say, a mountain range, rather than reaching a scale at which the landscape is smooth, successively more detail reveals itself. This is particularly important in regard to the following two issues that arise in population-based and hybrid algorithms. 3.1.1 Exploration versus Exploitation. One of the most difficult issues in designing optimisation algorithms is managing the balance between exploration, for example “jumping” or “spreading” out of local valleys to search for better regions of the solution space, and exploitation, or convergence of the population towards the local minimum. In many artificial problem domains knowledge of scale of the fitness function allows one to inadvertently “cheat oneself” in tuning the parameters of the algorithm. As a simple illustration consider a “bowl” function shown in Figure 5(a). (Such a function can be found as F1 of “sphere” in De Jong’s test suite (De Jong, 1975) and is shown in Figure 3(a)). Any reasonable algorithm can solve this problem quickly and easily, using either local search (eg. hill-climbing) or population convergence. This is because prior information is (implicitly) provided that the minimum lies within the given bounds, and therefore parameters can be chosen with the sole implication of determining rate of convergence. If we assume, however, that the bowl could be a small perturbation in a much larger picture, as illustrated in Figure 5(b), the implications of the parameter choice change entirely. Sufficient random or exploratory action is required to jump out of any given valley. This in turn has an impact on convergence. If the available computing resource is, for example, number of evaluations, we are faced with a very difficult problem of how best to use them. Examples where the scale or bounds of solutions are not known occur in other practical domains. Lets say for example we are using an evolutionary algorithm to find the optimal set of weights for a neural network. Wherever we seed our population, we cannot be certain that the best solution lies within, say, any given hyper-sphere containing the initial population. 3.1.2

Population-based Search versus Local Search. One approach to dealing with the explo-

ration/exploitation problem has been to propose hybrid (or memetic) algorithms combining population-

February 7, 2007 8

16:29

Connection Science

extended07 Unbiased Benchmarking of EAs and Hybrids

Figure 6. Representation of a fitness function with minima at one scale.

based methods with local search. Again, when the scale at which the detail occurs is known or bounded, these algorithms can perform above their general potential. To continue the bowl example, consider say an egg box (or muffin tray), as illustrated in Figure 6. It would be relatively easy to parameterize a hybrid algorithm so that the population operators search “across” the egg cups (primarily exploration), while local search is used to rapidly find minima within the cups (primarily exploitation). This may report excellent results for this problem, but perform extremely poorly when the number of minima (“cups”) within each cup (and in turn within those cups) is increased — suddenly the scales at which the population-based and local search are operating are inappropriate. While this is a simplified example, note that most of the functions in Figures 2 and 3 have this general form. 3.2

Biases in Benchmark Functions

We now attempt to enumerate the biases referred to above in more detail. 3.2.1 Initialisation Bias (Central Bias). In population based optimisation algorithms, the initial population is typically located using a random distribution. Initialisation bias occurs when the distribution of the initial population has a predictably beneficial (or detrimental) effect on performance. Expanding on the example discussed above, if a population is initialised on the sphere function in Figure 3(a) according to a uniform probability distribution over the bounds shown, it is likely there will be some individuals roughly either side of the minimum. Any algorithm that converges through some “averaging” mechanism (such as crossover in a suitable EA encoding) or “attraction” mechanism (such as acceleration towards PB in a PSO) may achieve the minimum much faster than if it were initialised, say, in one quadrant. Note that the same large-scale structure is found in all functions in Figure 3 with the exception of Rosenbrock’s and Schwefel’s functions. A series of fractal landscapes for benchmarking can be used to help mitigate iniatialisation bias in three ways. First, the location of the global minimum is, at least theoretically, unknown (in fact theoretically there is no minimum to the fractal, although in the finite implementation there is a minimum determined by the resolution of the computer — see Section 4). Secondly, using a series of landscapes with differently located minima in each helps to mitigate the effect. For example, if all algorithms were forced to initialise in a predetermined quadrant, it woud contain the minimum in some cases and not others. Thirdly, taking this idea a step further, algorithms can be forced to initialise at a scale unknown to the algorithm. Because of the self-similarity property, the algorithm has no way of knowing whether the minimum is contained within the initialisation bounds or not.

3.2.2

Axial and Directional Bias. Many mathematical functions used for benchmarking exhibit some

alignment in the structure, and in particular valleys containing local minima. For example all functions in Figure 3 other than the first two show a bias in the axial directions. This will have an impact on the efficacy of any algorithm which has a similar bias in its operators. For example, it has been shown by Czarn et al. (2007), Salomon (1996), and others that traditional bitstring encoded genetic algorithms exhibit improved performance on functions with axial biases.

February 7, 2007

16:29

Connection Science

extended07 Cara MacNish

9

The random fractal landscapes exhibit no systematic directional bias.

3.2.3

Decomposability. Closely related to the issue of axial bias is the property of decomposability, also

referred to as linear-separability. Decomposable functions are those that can be expressed as a sum of functions on the individual parameters. That is: f (x) =

n %

fi (xi ).

(1)

i=1

It has been argued (Czarn et al., 2007; Salomon, 1996), for example, that these functions are considerably easier for GA’s to solve than non-decomposable (not-linear-separable) functions, such as the same functions rotated. Salomon (1996) claims a complexity of O(n) for decomposable functions, and generalises this result to functions that satisfy the condition: δf (x) = g(xi )h(x) δxi

(2)

for some functions g and h. The fractal landscapes are not decomposable in this sense.

Rotational Invariance. Some mathematical functions, such as Schaffer’s F6 function shown in Figure 2, exhibit rotational symmetry. Again one can imagine, for example, that it is easy for a population to spread in a circular valley with a horizonal floor around the global minimum, and then make a jump to inner rings using attractive or averaging mechanisms as discussed under central biases above. The random fractal landscapes do not exhibit rotational symmetry.

3.2.4

Regularity. As can be seen in Figure 3 many elementary benchmark functions have local minima spread in regular patterns. Again this can impact performance. It has been shown in Czarn et al. (2007), for example, that the performance of a bitstring GA can depend on the relationship between the binary encoding and the distance between the local minima in a problem instance. Roughly speaking, some algorithms can use regularly spaced minima as “staging posts” on the way to a global minimum. The random fractal landscapes avoid this regularity. 3.2.5

Scale Bias. In Subsection 3.1 we outlined the lack of challenge for hybrid algorithms that can come from all local minima occuring at roughly the same scale. This is a significant issue for the comparative evaluation of evolutionary and allied algorithms, traditional algorithms, and hybrids of the two, and is the primary motivation for this work. Indeed we see multi-scale optimisation as the next big challenge in single-objective optimisation after multimodal optimisation. Note that all of the functions illustrated in Figures 2 and 3 exhibit the characteristic of reduced detail at smaller scales. This is illustrated in Figure 7(a) for Schaffer’s F6. By contrast the landscape in Figure 7(b) shows no dilution of detail at smaller scale. Since we seek to find algorithms that do well in general rather than on more specific problems, we wish to use problems in which the level of detail is maintained as scale decreases (and indeed while scale increases, if the algorithm is initialised at an arbitrary scale). Algorithms that do well on these problems should generalize easily to problems with more limited scale. However the reverse does not hold. 3.2.6

February 7, 2007 10

16:29

Connection Science

extended07 Unbiased Benchmarking of EAs and Hybrids

a)

b) Figure 7. a) Schaffer’s F6 at one tenth and one hundredth of the scale in Figure 2, showing diminishing detail. b) A cross section of a landscape showing blown up detail at one tenth scale. (Further magnifications are shown in Figure 9).

4

Generating the Fractal Landscapes

Having motivated the use of the randomised self-similar landscapes, we now turn to the issues of defining and computing (evaluating points on) the landscapes. 4.1

Midpoint Displacement Algorithms

The most widely used algorithms for random fractal landscape generation are midpoint displacement algorithms. These are iterative algorithms that successively randomly perturb a shape at finer and finer scales. In the 2-dimensional case the midpoint of each line segment is found and perturbed up or down by a random amount, creating twice as many line segments for the next iteration. Similarly, in the 3-dimensional case the midpoint of a square is perturbed, creating four new “squares”. This approach has some major drawbacks from our point of view. From a conceptual point of view it bears little relationship to any natural geological process. This becomes more significant when we consider the “ageing” of the landscape. In our algorithm the landscape naturally ages as the algorithm is run. The algorithm can be terminated at any desired age to give surfaces with different characteristics, each at the full resolution available on the computer. With the midpoint displacement algorithm, however, the only full resolution surface available is the one where the algorithm is run to its practical limit. Stopping the algorithm earlier means reduced resolution, or calculating midpoints by linear interpolation. A more important pragmatic problem, however, is that of storage. The number of midpoints that must be stored increases exponentially with the depth of iteration. In the 2 dimensional case the number of midpoints increases with 2n , while in the 3 dimensional case it increases with 22n . This may not be a significant problem if landscapes are being generated for human viewing, such as in a film, since the

February 7, 2007

16:29

Connection Science

extended07 Cara MacNish

11

Figure 8. Surfaces 5 2, 10 2 and 20 2 showing successive maturation of the surface by meteor impact. Each landscape is twice as old as its predecessor.

resolution required for the human eye is quite low. However if we wish to generate fractal landscapes with detail to the limit of the resolution that can be achieved on a standard computer (which we take to be 64-bit IEEE floating point) the files describing the landscapes would be far too large to store. Therefore this approach cannot be used.

4.2

Our Approach: The Crater Algorithm

Our approach to generating landscapes is based on a (very) simplified model of the natural process of meteors impacting a planetary surface. We assume each surface is initially smooth, and each meteor that impacts it superimposes a crater the size of the bottom half of the meteor. (In the current work we assume that the meteors are also round, but there are plans to extend the algorithm to allow a choice of profiles.) The meteor impacts must be randomly distributed, subject to the constraint that the probability of an impact is related to the meteor size in such as way as to generate self-similarity. (See below.) The surface “ages” or matures as it receives more and more impacts, as illustrated in Figure 8. The first surface in Figure 8 is relatively young, and the larger of the individual craters can still be clearly seen. The next surface is twice the age of the first, and the compound effect of the meteors can be seen. The final surface shown is twice the age of the second, and shows even to the naked eye quite a rough surface. In the following subsections we discuss the main considerations needed to design and implement this algorithm.

February 7, 2007

16:29

Connection Science

12

extended07 Unbiased Benchmarking of EAs and Hybrids

Table 2. Increasing crater numbers with reducing scale, where the average number of craters per order of magnitude square, n, determines the age of the landscape.

Scale (or resolution) 1 0.1 0.01 10−3 ···

Crater size range 0.1 – 1 0.001 – 0.1 0.0001 – 0.001 10−4 – 10−3 ···

Size of square with mean n craters 1×1 0.1 × 0.1 0.01 × 0.01 10−3 × 10−3 ···

Mean number of craters per square n n n n ···

Mean total number of craters at this scale n 100n 10000n 106 n ···

A Suite of Landscapes. Using a single landscape for benchmarking would reward optimisers that are overly specific (in choice of parameters) and may generalise poorly to other landscapes (and other domains). In other words, it would encourage users to overtrain their algorithms. Our approach is to generate a large suite of landscapes. A sequence of training landscapes is provided on which researchers can hone their algorithms, but the benchmarking itself is carried out across a range of unseen landscapes. As we will see, we never actually store the points in a landscape (and in fact it would be impossible to do so) but rather devise a repeatable way of evaluating any requested landscape at any given co-ordinate point at the time it is requested. In order to achieve this, each point in a landscape must be uniquely determined from a unique identifier for that landscape. We use as the identifier a sequence number, which in turn is used to generate a unique random seed. As discussed below this approach gives us access to over 2 billion different landscapes (for any given age).

4.2.1

Randomness and Seeds. To obtain the diversity required, each landscape must have a random component, but to ensure repeatability, every point must be deterministically tied to the descriptor for that landscape. In practical terms that means each point must be tied to a unique pseudorandom sequence. To ensure platform and language independence, rather than relying on a particular language’s pseudorandom generator, we use a portable multiplicative congruential pseudo-random generator (Press et al., 1992). The generator has a modulus of 231 − 1, providing over 2 billion independent landscapes. We wish to tie the landscape to a sequential integer index or “seed” for the landscape. Unfortunately sequential seeds produce related random deviates in multiplicative congruential algorithms. We therefore begin by hashing the index by converting to a hex string and reversing the string so that the least significant digits have greatest effect, then converting back to an integer to form the true seed for the pseudorandom sequence. This results in very different pseudorandom sequences, and hence very different landscapes, from small changes in index value as illustrated in Figure 4. 4.2.2

Crater Density and Scale. While the x, y-positions of craters can be chosen, using the random deviates, from a uniform probability distribution, the number of meteors of each size must be chosen so as to preserve statistical self-similarity. One way to put this is that for any order of magnitude square on the surface, we would expect, on average, the same number of craters of that order of magnitude in the surface. This average number of craters per order of magnitude is dependent on the age of the landscape. This is illustrated in Table 2. Each landscape is uniquely deterimined by its age (mean number of craters per order of magnitude square) and index (seed). We will therefore use the notation n i to specify the landscape with age n and index i. Thus for example landscape 20 103 in Figure 4 is generated from index 103 with 20 craters per order of magnitude square. 4.2.3

4.3

Implementation

While there would appear to be a number of ways to implement the crater algorithm, these are severely constrained by the resolution we wish to achieve. We describe below the issues that arise in a number of these, motivating the approach taken.

February 7, 2007

16:29

Connection Science

extended07 Cara MacNish

13

Storing the Landscape. Perhaps the most obvious approach is to generate landscapes by storing the depth of the surface (initially zero) and simulate the impact of the meteors by deducting the meteor dimension from the depth. This would allow (some subset of) the landscapes to be generated and stored in advance, and therefore “probes” or evaluations of the surface to be returned very quickly. The obvious problem with this approach, as with the midpoint displacement algorithms, is that the storage space increases with resolution. Given that we wish to represent detail to the limit of the 64-bit IEEE floating point standard, the arrays of points would be many orders of magnitude too large to store.

4.3.1

4.3.2

Storing the Meteors (or Craters). An alternative approach is to store the meteors, that is

their centres and radii, and regenerate the landscape as needed. The benchmarking process only requires evaluation at specific points (x, y co-ordinates), so it is not necessary to generate an entire landscape (except for viewing purposes). Each time we wish to evaluate a co-ordinate point, we would cycle through all the meteors and determine the effect each one has on that point. This approach has problems with both storage and running (evaluation) time. It can be seen from Table 2 that the total number of meteors increases with the inverse square of the scale. Thus for a resolution of 1012 we would need to store more than 1024 meteors, and cycle through these each time a point is evaluated. Clearly this is impractical in both time or space. Regenerating the Meteors. As mentioned earlier we would like each landscape to be generated from a single random seed (to a given age). This requires that we are able to regenerate the meteors from the seed in a deterministic sequence. It is therefore possible to overcome the storage problems referred to above — we simply record the seed and regenerate the meteors on each evaluation. The seed and generation algorithm can be regarded as an implicit encoding of the landscape. While this overcomes the storage problem, it clearly exacerbates the execution time problem, and again is entirely impractical. 4.3.3

A Recursive Approach. While the number of meteors increases rapidly with decreasing scale, only a small proportion can impact a particular location being evaluated. As mentioned earlier, the average number of meteors at a given scale centred in a square of that scale is constant. A given location can only be impacted by the meteors centred in the square in which it lies and its eight neighbours. This number of meteors increases only linearly with the exponent of the resolution. If we can find a way of generating only these craters then we can evaluate any point in a practical amount of time. This is the approach taken in the benchmarking software. The trick is to associate a recursively (and deterministically) generated new seed with each square at each scale. That is, the seed associated with a square at scale 10−(m+1) is determined by its parent seed at scale 10−m , along with its relative position within that parent square. Once the recursive algorithm has generated the seed for a square at a given scale, the meteors at that scale whose centres fall in that square can be generated from the pseudorandom sequence. This is done using an exponential probability distribution, so that within each square small meteors are appropriately more abundant than larger ones. It then determines which square at the next scale down the point lies in, and calls the recursive algorithm on this square and its eight neighbours. In practice this can be continued until at least a scale of 10−12 , at which point the resolution of 64-bit floating point numbers becomes insufficient. Note that at the boundary of the unit square landscape we consider the neighbouring squares to be those at the opposite edge, thus causing the landscape to wrap in both x and y directions. 4.3.4

4.4

Results

The landscapes generated by the crater algorithm can all be evaluated to a scale of 10−12 . No storage is required as evaluation is done recursively from a single seed. Evaluation time is linear in (the absolute value of) the scale exponent, so the full resolution of the computer is available while execution time remains fast

February 7, 2007

16:29

Connection Science

14

extended07 Unbiased Benchmarking of EAs and Hybrids

Figure 9. A one dimensional slice of landscape 20 101 shown at magnifications of 100 to 105 . The detail at the minimum of each plot is shown in the subsequent plot.

(approximately 2.5ms on a 2.5GHz Powermac G5). Thus the technical goals have been met. We believe the landscapes to be challenging, intuitive, and visually convincing, tho we leave the latter judgement to the reader, who we invite to view a larger range of sample landscapes at the website (MacNish, 2007). Finally, to illustrate that the fractal self-similarity property is satisfied we have included in Figure 9 six cross sections of landscape 20 101 from Figure 4 at successively smaller scales. Both the x and z (depth) scales are reduced by a factor of 10 in each successive picture. Notice that there is no systematic change in the amount of detail (number of local minima, height of peaks, etc) as one progresses through the sequence. Indeed the images could be shuffled in any order without looking out of place. 4.5

Whats in a Name?

The benchmarking suite in this paper has been named the Huygens Suite after the Huygens Probe, which recently made a successful landing on Saturn’s moon Titan. (The probe is in turn named after Christiaan Huygens, the astrologer who discovered Titan.) The analogy with the crater-based landscapes developed in this paper is obvious. However the choice of Huygens (rather than say Titan) is more specifically motivated. Firstly, when developing a benchmarking methodology, the computing resource to be optimised must be chosen. In evolutionary algorithms a wide variety have been used. Examples include number of epochs or iterations to reach the global minimum (which in our case is unknown) to within a given accuracy, best value achieved in a given number of evaluations, average population fitness, through to environment dependent measures such as CPU time. In order to allow comparison of a very broad range of algorithms, including hybrid and traditional search algorithms, we have steered clear of paradigm-specific concepts such as epochs. Similarly, to support multiple computing environments we have steered clear of environment-dependent measures. We have taken the view that in most practical applications (as with the Huygens Probe) the expensive operation is evaluating the fitness at a given location. We therefore allow each algorithm a fixed number of evaluations, or probes, for each landscape to produce its best solution. (Further, in the non-training case, the algorithm does not know which landscape it is solving at any given time, so probe information cannot be accrued

February 7, 2007

16:29

Connection Science

extended07 Cara MacNish

15

Figure 10. Architecture of the Huygens Benchmark Server

across multiple attempts.) The second analogy is that the probes are sent to the surface from a remote location and the data sent back, as described in the following section.

5

The Huygens Server

Now that we have addressed the content of the benchmarking suite, we briefly address the question of making it readily accessible to all users. The benchmark server is designed to satisfy the following principles: (i) Access to the benchmarking software and results should be freely available through the web. (ii) The user should be able to develop their algorithm in the programming language of their choice, and run it on their platform of choice. (iii) The user should be free of any need to understand (or even see) the benchmarking code. (iv) The user should be able to initiate the benchmarking process. (For example, they should not have to submit an algorithm and rely on another human at the “other end” to run it.) The primary design decisions that allow these principles to be satisfied are the separation of the user and benchmarking (server) environments, and communication via the language independent SOAP protocol (W3C XML Protocol Working Group, 2006) for method invocation. The user’s algorithm runs in his or her own environment, and “plugs into” the SOAP client. This structure is illustrated in Figure 10.

5.1

Comparing Algorithms

For the purposes of training or fine-tuning algorithms, the server allows unlimited access to a series of training surfaces. For benchmarking, however, some measure of computing resource consumed is required for comparison. As discussed in the previous section, to level the playing field between population-based, traditional and hybrid algorithms, we use a fixed number of evaluations which the algorithm may use in any way it sees fit. We also need to make a decision on how to score algorithms. Because of the fractal nature of the landscapes, the differences in raw minimums achieved will decrease exponentially as improvements are made at smaller scales. Therefore raw minimums (with more and more decimal places) do not make a particularly good indicator for human consumption. Instead we wish to “stretch” those differences as the scale decreases. We achieve this by mapping them onto an exponential function.

February 7, 2007

16:29

Connection Science

extended07

16

6

REFERENCES

Conclusion and Future Directions

While arguments such as the No Free Lunch Theorem mean that it will never be possible to identify the “universal best” general purpose optimization algorithm, it is nevertheless vital to develop some effective and widely accessible means of comparing algorithms. This paper presents one attempt at such a system. We have argued that the problem domain chosen is intuitively simple, naturally appealing, and challengingly complex, while avoiding many of the biases inherent in traditionally used functions. We have also shown how a range of technical issues can be overcome to provide a series of efficient, high resolution fitness functions. Finally, we have presented an architecture for accessing the system that overcomes problems of language and environment incompatibilities. The system allows the user to “plug in” their algorithm and initiate automated benchmarking and subsequent scoring. In the future we plan to generalise the algorithm in a number of ways. The server was used for a static optimisation competition at the 2006 Congress on Evolutionary Computation. For the 2007 Congress we plan to hold a dynamic optimisation competition in which algorithms seek to track the minimum as the landscape ages. We also plan to generalise the algorithm to allow choice of impact shape, and to develop higher-dimensional versions. These parameters will allow users to choose the characteristics and level of difficulty they are interested in. Finally we plan to open source the algorithm for the test series in a number of languages to make it easier for users to test their algorithms. We hope that this system will provide a valuable resource to the evolutionary algorithms research community. References

A. Czarn, et al., “Statistical Exploratory Analysis of Genetic Algorithms: The Detrimentality of Crossover”, 2007, Under review. K. A. De Jong, An analysis of the behavior of a class of genetic adaptive systems, Ph.D. thesis, University of Michigan, Ann Arbor, 1975. J. Digalakis & K. Margaritis, “An Experimental study of Benchmarking Functions for Genetic Algorithms”, International Journal of Computer Mathemathics, 79, pp. 403–416, 2002. EvoCoBR, “Evolutionary Computation Benchmark Repository”, http://www.cs.bham.ac.uk/ research/projects/ecb/, 2007. C. Igel & M. Toussaint, “A no-free-lunch theorem for non-uniform distributions of target functions”, J. Mathematical Modelling and Algorithms, 3, pp. 313–322, 2004. C. MacNish, “Benchmarking Evolutionary and Hybrid Algorithms Using Randomized Self-similar Landscapes”, In T.-D. Wang, X. Li, S.-H. Chen, X. Wang, H. A. Abbass, H. Iba, G. Chen, & X. Yao (eds.), Simulated Evolution and Learning, 6th International Conference, SEAL 2006, Hefei, China, October 15-18, 2006, Proceedings, vol. 4247 of Lecture Notes in Computer Science, pp. 361–368. Springer, 2006. C. MacNish, “Huygens Benchmarking Suite”, http://ai.csse.uwa.edu.au/cara/huygens, 2007. W. H. Press, et al., Numerical Recipes in C: The Art of Scientific Computing, Cambridge University Press, 2nd Ed., 1992. R. Salomon, “Re-evaluating genetic algorithm performance under coordinate rotation of benchmark functions”, BioSystems, 39, pp. 263–278, 1996. J. D. Schaffer, et al., “A Study of Control Parameters Affecting Online Performance of Genetic Algorithms for Function Optimization”, In J. D. Schaffer (ed.), Proceedings of the 3rd International Conference on Genetic Algorithms, pp. 51–60, George Mason University. Morgan Kaufmann, 1989. W. M. Spears, “Genetic Algorithms (Evolutionary Algorithms): Repository of test functions”, http: //www.cs.uwyo.edu/~wspears/functs.html, 2007. W. M. Spears & M. A. Potter, “Genetic Algorithsm (Evolutionary Algorithms): Repository of test problem generators”, http://www.cs.uwyo.edu/~wspears/generators.html, 2007. P. N. Suganthan, “Recent Advances in Real Parameter Optimization”, 2006, Tutorial presented at the 6th Int. Conf. on Simulated Evolution and Learning (SEAL’2006), Hefei, China.

February 7, 2007

16:29

Connection Science

extended07 REFERENCES

17

W3C XML Protocol Working Group, “SOAP Version 1.2”, http://www.w3.org/TR/soap12-part0/, 2006. D. H. Wolpert & W. G. Macready, “No Free Lunch Theorems for Optimization”, IEEE Transactions on Evolutionary Computation, 1, pp. 67–82, 1997.