an evolutionary computation approach for developing

0 downloads 0 Views 815KB Size Report
Figure 22, “The distribution of Experiment 1, Run 1 results with window size ..... “hammers. ... We compare 10 communication strategies with different communication ...... 94] was collected from 1897 to 1983 at the Detroit Great Lakes Weather ...... 212. Koza, John R. Genetic Programming II, Cambridge: MIT Press, 1994.

AN EVOLUTIONARY COMPUTATION APPROACH FOR DEVELOPING AND OPTIMIZING DISCRETE-TIME FORECASTING AND CLASSIFICATION MODELS

by Gregory A. Dorais

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in the University of Michigan 1997

Doctoral Committee: Professor Keki B. Irani, Chairman Associate Professor Edmund H. Durfee Professor John H. Holland Associate Professor John E. Laird Associate Professor Robert K. Lindsay

ABSTRACT AN EVOLUTIONARY COMPUTATION APPROACH FOR DEVELOPING AND OPTIMIZING DISCRETE-TIME FORECASTING AND CLASSIFICATION MODELS by Gregory A. Dorais

Chair: Keki B. Irani

CAFECS (Classification and Forecasting Evolutionary Computation System), a hybrid of genetic algorithms and Evolution Strategies, is introduced. Its purpose is to develop and optimize discrete-time forecasting and classification models, particularly in highly nonlinear domains. Models are defined as sets of functions and are evaluated by a user-specified model criterion function that defines the extent to which one model is “better” than another. Model development is accomplished by searching a well-defined model domain by “evolving” a set of model domain subsets called colonies.

Principles derived from evolutionary theory and the system’s distributed architecture are used to increase the system’s performance. The role of natural selection in evolution is examined with an emphasis on the importance of cooperation. CAFECS permits a limited level of communication between colonies including the crossbreeding of models. The effects on system performance of different inter-colony communication strategies are examined. The gains obtained may be explained in part by Sewall Wright’s shifting balance theory and the principle of sociogenesis.

Unlike other machine learning approaches, CAFECS is a data mining approach designed to work in conjunction with the user and apply user knowledge instead of exclusively deriving the models from relationships induced from data. However, this user knowledge is not required. The user can restrict the model domain or focus the search in areas which appear more promising and may improve system performance. Also, at any point in the development process, the user can enter new models that the user desires to test or has reason to believe may be useful. These models are then used by CAFECS in an attempt to create better models.

Among the tests performed, it was shown how CAFECS can be used to create hybrid models and how system performance was improved by evolving separate model colonies and permitting a limited exchange of models between the colonies. Also, it was shown that CAFECS can be used to discover attributes that enabled a decision tree induction algorithm to create a decision tree for classifying the terrain shown in a satellite image that is more accurate and smaller than normally obtainable.

© Gregory A. Dorais All Rights Reserved

1997

To my mother and to the memory of my father

ii

ACKNOWLEDGEMENTS

If I have seen further it is by standing on the shoulders of Giants. ––Isaac Newton

Looking back and considering all the people that touched my life in a way that helped bring this work to fruition, I am overwhelmed and recognize that I cannot possibly express my gratitude adequately. Nevertheless, I will try.

I begin by acknowledging the efforts of the members of my thesis committee. Their expertise and support were invaluable. It was an honor and a privilege to work with them over the years. When I consider each of their contributions, I am reminded of the quote by Newton cited above. The time and energy required to guide a doctoral student through a successful defense are considerable. For these sacrifices, I am truly grateful.

First and foremost, my thanks goes to my thesis advisor and the committee chair, Keki Irani. His guidance and support were steadfast throughout my entire time at the University of Michigan. From the very start, he saw to it that I was intellectually challenged by various research projects, yet he always allowed me the freedom to explore my own areas of interest. Just as a coach transforms an undisciplined raw recruit into a competitive athlete, Keki’s continual insistence on scientific rigor honed a mental discipline in me that I had not even realized I was sorely lacking. I remain in awe of his insight and willingness to meticulously

iii

and tirelessly read more drafts of various portions of this and other of my works than I care to remember. It was by his efforts and recommendation I received the funding that made this research possible.

Another person whose efforts were absolutely essential in making this work possible is John Holland. I had the good fortune of meeting John when I took his class, “Complex Adaptive Systems,” as a first-year graduate student. At the beginning of the class, I was unprepared for the profound effect it would have on my life. As I studied the concepts he presented, my understanding of how the world works underwent a paradigm shift. Just as understanding the motion of the planets becomes much simpler when the Earth is viewed as circling the sun instead of vice versa, so too my understanding of complex adaptive systems, ubiquitous in nature, became much clearer with the concepts he presented. Oliver Wendell Holmes’ quote, “Man’s mind stretched to a new idea never goes back to its original dimension,” certainly seems appropriate in this case. Many of John’s concepts laid the foundation upon which this work was based and guided its development.

I did not have the pleasure of taking a course from any of the other members of the committee, but I expect I would have enjoyed them immensely. The influence of John Laird’s work on cognitive systems and Edmund Durfee’s work on multiple agents on this work may not be readily apparent, but it is riddled throughout. It was in considering how to make an cognitive system, like SOAR, interact with complex real world that I began to formulate ideas that led to a system that evolves models. I also enjoyed my discussions with Robert Lindsay whose interest in biological complex adaptive systems, among others, I share. I hope our fortuitous meetings at a variety of conferences and seminars continues.

iv

Other faculty at Michigan that I especially wish to thank for courses that were fascinating and provocative are K. P. Unnikrishnan for his Neural Computation course, Daniel Koditschek for his Robotics course, and Steve Kaplan for his Neural Models course.

I also thank Erann Gat and Rajiv Desai for a challenging internship at the NASA Jet Propulsion Laboratory as well as Robert Stephens for an interesting project at General Motors Research. At both of these institutions, the work was thought-provoking and it was very rewarding to apply AI technologies to solve real-world problems. In addition, I gratefully acknowledge my discussions with Richard Lacoss of the MIT Lincoln Laboratory concerning applying the approach described in this dissertation to Landsat images and for pointing me toward the Landsat image data used in experiment 8. I also thank Marijke Augusteijn of the University of Colorado for making the data available.

My fellow students and the U of M Advanced Technology Lab staff were immensely helpful in my research. In particular, I thank David Kortenkamp, Clint Bidlack, and Arun Hampapur for introducing me to the wonderful world of autonomous robots. Also, the friendship and intellectual stimulation provided by my office mates Matt Hankinson and Neal Rothleder were greatly appreciated. I will miss them.

For the financial support that helped make my education at U of M possible, I thank IPoCSE (Industrial Partners of Computer Science and Engineering), and in particular, John Joyce of General Motors Research. John is the type of manager it is a pleasure to go that “extra mile” for. I also am grateful for the DeVlieg Fellowship in Manufacturing. The support of higher education from industry is invaluable and is greatly appreciated.

v

Finally, I thank my family who have given so much for so long. My mother and father instilled in me the value of education and a perpetual desire to learn. Even though resources were scarce, e.g., the infamous “powdered milk,” sacrificing education was never an option to them. Even while my father worked to support our family, he continued his Ph.D. at Michigan until his failing health would no longer permit it. His dedication inspired me; he is sorely missed. The continued companionship of my sister Mary, and brothers Michael and fellow U of M alum Christopher is also appreciated. I also thank my children Timothy, Julie, Stephanie, and Jonathan. The joy they bring to me is immeasurable. Most of all, I thank my wife Kyunghee whose devotion and support are unfailing and the Creator whose universe we are just beginning to understand.

vi

TABLE OF CONTENTS

DEDICATION ................................................................................................................... ii ACKNOWLEDGEMENTS .............................................................................................. iii LIST OF FIGURES ........................................................................................................... x LIST OF TABLES ........................................................................................................... xii CHAPTER 1 Introduction.................................................................................................................... 1 1.1 System Modeling ............................................................................................ 5 1.2 Statement of Research Goals .......................................................................... 8 1.3 Formal Definition of the Problem ................................................................. 11 1.3.1 Observation Data Set ..................................................................... 13 1.3.2 Node............................................................................................... 17 1.3.3 Model ............................................................................................. 27 1.3.4 Model Criterion Function .............................................................. 29 1.3.5 Formal Problem ............................................................................. 33 2 Background Information and Related Research .......................................................... 35 2.1 System Modeling Related Research ............................................................. 35 2.2 Evolutionary Theory ..................................................................................... 37 2.3 Evolutionary Computation............................................................................ 53 2.3.1 Representation................................................................................ 56 2.3.2 Fitness ............................................................................................ 60 2.3.3 Reproduction.................................................................................. 64 2.3.4 Removal ......................................................................................... 66 2.3.5 Summary of Evolutionary Computation issues ............................. 67 2.4 Evolutionary Computation Algorithms......................................................... 68

vii

2.4.1 Genetic Algorithms........................................................................ 68 2.4.2 Evolution Strategies ....................................................................... 77 2.4.3 Evolutionary Programming............................................................ 84 2.4.4 Hybrid EC Approaches for Forecasting......................................... 85 2.4.5 Summary of Evolutionary Computation Algorithms..................... 88 3 Description of the CAFECS model selection algorithm.............................................. 90 3.1 Parametric Regression................................................................................... 93 3.2 Symbolic Regression .................................................................................... 93 3.3 Inter-colony search coordination .................................................................. 97 3.4 CAFECS discrete ES algorithm.................................................................. 104 3.5 CAFECS Algorithm.................................................................................... 108 3.6 Description of CAFECS Algorithm Parameters ......................................... 113 4 Experimental Results and Discussion........................................................................ 114 4.1 Moving-average forecasting models ........................................................... 115 4.1.1 Experiment 1: Developing a simple deterministic moving-average model............................................................................................ 116 4.1.2 Discussion of Experiment 1 ......................................................... 123 4.1.3 Experiment 2: Developing a moving-average model using S&P 500 data ............................................................................................... 124 4.1.4 Discussion of Experiment 2 ......................................................... 129 4.2 A Comparison of Communication Strategies ............................................. 131 4.2.1 Experiment 3: 2-dimension sine-wave models ............................ 132 4.2.2 Experiment 4: 4-dimension sine-wave models ............................ 137 4.2.3 Results of Experiment 3 and 4 ..................................................... 139 4.2.4 Discussion of Experiments 3 and 4.............................................. 141 4.3 Model Combination and Optimization ....................................................... 142 4.3.1 Experiment 5: Temperature forecasting with models using day number.......................................................................................... 145 4.3.2 Experiment 6: Temperature Forecasting with moving-average models 148 4.3.3 Experiment 7: Temperature Forecasting with combined models. 149 4.3.4 Results of Experiments 5, 6, and 7 .............................................. 152 4.3.5 Discussion of Experiments 5, 6, and 7......................................... 153

viii

4.4 Constructing Attribute Sets for Decision Trees .......................................... 155 4.4.1 Experiment 8: Elk habitat classification ...................................... 157 4.4.2 Experiment 8 Results ................................................................... 163 4.4.3 Discussion of Experiment 8 ......................................................... 166 4.5 Multiple Non-equivalent Global Optima .................................................... 167 4.5.1 Experiment 9: Matrix Pattern Recognition .................................. 172 4.5.2 Experiment 9 Results ................................................................... 174 4.5.3 Discussion of Experiment 9 ......................................................... 175 5 Limitations of and Guidelines for the CAFECS approach ........................................ 178 5.1 Limitations of the CAFECS approach ........................................................ 178 5.2 Informal Guidelines for when not to use the CAFECS approach............... 192 6 Summary and Future Work ........................................................................................ 197 REFERENCES .............................................................................................................. 206

ix

LIST OF FIGURES

Figure 1, “Types of models by reliability”......................................................................... 6 Figure 2, “A generic classification or forecasting model” ................................................. 7 Figure 3, “An Overview of the CAFECS Model Evaluation System”............................. 12 Figure 4, “An example of an observation data set window”............................................ 14 Figure 5, “A Model Node”............................................................................................... 17 Figure 6, “An example of a 24 node model”.................................................................... 28 Figure 7, “An Example of a Model Criterion Function” ................................................. 31 Figure 8, “An example of a genotype”............................................................................. 39 Figure 9, “A simplified model of evolution”.................................................................... 40 Figure 10, “An example of a fitness landscape” .............................................................. 42 Figure 11, “Differences of ES Crossover Operator Types” ............................................. 80 Figure 12, “The Gaussian curve used to determine the ES mutation amount”................ 81 Figure 13, “Overview of three levels of model evolution” .............................................. 92 Figure 14, “An example of a 14 node model”.................................................................. 95 Figure 15, “An example of model crossover” .................................................................. 96 Figure 16, “16 colony one-way binary tree topology”................................................... 102 Figure 17, “A 16 colony toroid topology” ..................................................................... 103 Figure 18, “CAFECS discrete ES algorithm”................................................................ 106 Figure 19, “Definition of Gauss indexed step for calculating parameter step sizes”..... 107 Figure 20, “Model criterion value normalization algorithm” ........................................ 111

x

Figure 21, “The input and desired forecast time-series for test 1” ................................ 117 Figure 22, “The distribution of Experiment 1, Run 1 results with window size = 15” . 121 Figure 23, “The distribution of Experiment 1, Run 2 results with window size = 25” . 122 Figure 24, “Test 2 input time-series: the S&P 500 weekly ave. from 1/80 - 7/94” ....... 126 Figure 25, “Experiment 2 Results: the node 1 input link vector of the best model selected” 128 Figure 26, “A graphical depiction of function g defined in section 4.2.1” .................... 133 Figure 27, “A 10X magnification of portion of figure 26 that contains the global maximum point” ......................................................................................................... 133 Figure 28, “Experiment 3: A comparison of communications strategies on the 2-dimension function g” ................................................................................................. 140 Figure 29, “Experiment 4: A comparison of communications strategies on the 4-dimension function g'” ................................................................................................ 140 Figure 30, “30,000 days of daily high temperatures at a Detroit weather station”........ 144 Figure 31, “The first 5000 days of daily high temperatures from figure 30”................. 145 Figure 32, “A simplified version of the combined temperature forecasting model” ..... 150 Figure 33, “An example of a decision tree induced by C4.5 from data in table 13” ..... 156 Figure 34, “Gray-scale groundcover classification map of the Landsat image”............ 160 Figure 35, “Rules and Test Error of Models by Generation for trees pruned 0%, 3%, and 5%” ............................................................................................................ 164 Figure 36, “Elk model result summary: Error vs. Rule count”...................................... 165 Figure 37, “A 4x4 matrix used to find a model that can predict additional elements” .. 168 Figure 38, “8x8 matrix created by models 1 & 2”......................................................... 170 Figure 39, “8x8 matrix created by model 3”.................................................................. 171 Figure 40, “8x8 matrix created by model 4”.................................................................. 171 Figure 41, “Classification errors using checkerboard data observation set”.................. 175 Figure 42, “A graph of two model criterion functions, each with one global and two local optima” ...................................................................................................... 194

xi

LIST OF TABLES

Table 1, “A sample list of model operator types” ............................................................ 19 Table 2, “Summary of evolutionary computation issues”................................................ 67 Table 3, “CAFECS algorithm parameter descriptions” ................................................. 113 Table 4, “Experiment 1 specifications”.......................................................................... 119 Table 5, “Experiment 2 specifications”.......................................................................... 127 Table 6, “Experiment 3 specifications”.......................................................................... 134 Table 7, “Communication Strategy Detail” ................................................................... 136 Table 8, “Experiment 4 specifications”.......................................................................... 138 Table 9, “Experiment 5 specifications”.......................................................................... 147 Table 10, “Experiment 6 specifications”........................................................................ 148 Table 11, “Experiment 7 specifications”........................................................................ 151 Table 12, “Summary of results of four temperature forecasting models” ..................... 152 Table 13, “Sample states of a system with control actions” .......................................... 156 Table 14, “Elk Image Frequency Bands”....................................................................... 158 Table 15, “Elk Image Pixel Classifications of Groundcover”........................................ 159 Table 16, “Experiment 8 specifications”........................................................................ 162 Table 17, “Experiment 9 specifications”........................................................................ 173

xii

Chapter 1 Introduction How many times it thundered before Franklin took the hint! How many apples fell on Newton’s head before he took the hint! Nature is always hinting at us. It hints over and over again. And suddenly we take the hint. ––Robert Frost

This dissertation introduces a hybrid of genetic algorithms [Holland 75, 92] and evolution strategies [Bäck & Schwefel 93] for developing and optimizing discrete-time forecasting and classification models, particularly in highly non-linear domains. We discuss genetic algorithms and evolution strategies in section 2.4.1 and section 2.4.2, respectively. We present a system, called CAFECS (Classification and Forecasting Evolutionary Computation System), that incorporates this approach. The CAFECS approach can be used to continually search a well-defined model domain for a “better” model, while making the best model found at any point in the search process available for use at that time.

We define a model as a set of functions. A user-specified model criterion function defines the extent to which one model is “better” than another. CAFECS accomplishes model development by “evolving” a set of model domain subsets called colonies. CAFECS evolves colonies by repeatedly creating new models from the set of existing colonies by copying and combining the better performing models with modifications and then replacing the poorer performing models in the colonies with these newly created models. This is 1

similar to the process used to explain how nature evolves organisms which is discussed in section 2.2. Either CAFECS creates the initial set of models randomly or the user specifies it.

There are three significant difficulties in selecting an optimal model. The first difficulty is the cardinality of the model set can become enormous even for simple models, making an exhaustive search unfeasible. The second difficulty is the model criterion function is often highly non-linear in the model attribute space, making linear and hill-climbing algorithms poor at selecting an optimal model. The third difficulty is that an accurate model criterion function that can be evaluated in an acceptable period of time for the user is often not available. CAFECS was developed to address the first two difficulties. The limitations presented by the third difficulty are discussed in Chapter 5.

We incorporate principles derived from the concept of natural selection with cooperation into CAFECS to help address the first two difficulties. In order to explain this, we examine the role of natural selection in evolution with an emphasis on the importance of cooperation. We use multiple communicating colonies in CAFECS in order to facilitate a more efficient search of the solution space and result in a reduction in the total CPU time required [Koza & Andre 95]. The CAFECS approach supports a variety of communication strategies between the colonies including the crossbreeding of models. We examine the gains obtained using inter-colony communication strategies with different communication topologies in section 4.2. The gains may be explained in part by Sewall Wright’s shifting balance theory [Wright 32] and the principle of sociogenesis [Novak 82], which are discussed in section 2.2, and also by results obtained in distributed problem solving, e.g., [Erman & Lesser 75][Durfee et al. 87][MacIntosh et al. 91]. 2

The system’s performance is increased further by its distributed architecture that supports multiple communicating processes, each process evolving one or more colonies. These processes can run simultaneously and be distributed over a network of processors to decrease the clock time required to select a satisfactory model.

Unlike other machine learning approaches, CAFECS is a “data mining” approach designed to work in conjunction with the user and apply user knowledge instead of exclusively deriving the models from relationships induced from data. However, this user knowledge is not required. The user can restrict the model domain or focus the search in areas which appear more promising and may improve system performance. Also, at any point in the development process, the user can enter new models that the user desires to test or has reason to believe may be useful. These models are then used by CAFECS in an attempt to create better models.

In order to explore a wide variety of high-dimensional, non-linear model solution spaces, we combine in CAFECS techniques and algorithms derived from distributed genetic algorithms, genetic programming, and evolution strategies. CAFECS uses genetic algorithms to select the optimal model inputs and structure. Also, CAFECS uses genetic algorithms to optimize model parameters that have relatively few possible values or whose possible values have no meaningful sequence apropos of the model. CAFECS optimizes the remaining model parameters with an evolution strategies approach. The models that CAFECS creates form directed acyclic graphs consisting of a variety of functions. In section 1.3, we present a formal mathematical model of the system that circumscribes the types of models the system can optimize. With respect to the directed acyclic graphs of functions, the system is similar to genetic programming [Koza 92] which creates parse trees of programs. 3

However, the method in which these trees are created by genetic programming is somewhat different from the CAFECS approach. How CAFECS forms models is explained in section 3.5.

This dissertation is segmented into six chapters: Introduction, Background Information and Related Research, Description of the CAFECS model selection algorithm, Experimental Results and Discussion, Limitations of and Guidelines for the CAFECS approach, and Summary and Future Work. In the Introduction, we discuss system modeling, define the problem, and outline the CAFECS system used to address this problem. In the Background Information and Related Research chapter, we focus on the key aspects of forecasting models, evolutionary theory, and evolutionary computation that are useful in understanding the system this dissertation presents. In addition, we summarize research accomplished in these areas that relates to this work. In the Description of the CAFECS model selection algorithm chapter, we provide detailed description of the system.

The following Experimental Results and Discussion chapter describes tests performed with the system. In this chapter, we present and analyze the corresponding results. We include among the test results two experiments in which CAFECS was able to optimize one of two highly non-linear function in each of the 100 algorithm runs performed. In addition, we show how the system performance was improved by evolving separate colonies and permitting a limited exchange of models between the colonies. In another set of experiments, we show that CAFECS can create a set of attributes that enables a decision tree induction algorithm to create a decision tree for classifying the terrain shown in the individual pixels of a Landsat satellite image that is more accurate as well as smaller than normally obtainable. 4

In the Limitations of and Guidelines for the CAFECS approach chapter, we discuss problems that CAFECS cannot solve or for which it is not appropriate. This dissertation concludes with the Summary and Future Work chapter.

1.1 System Modeling System modeling is the process of creating an abstract representation, normally simplified, of a physical or mathematical system. This representation is called a model. In the next section, we will define model apropos of this dissertation more specifically. Until then, we will use this general definition of model. In this dissertation, we are concerned specifically with discrete-time forecasting models and classification models.

Typically, models are created by induction from observations of the system being modeled. The goal is to construct a model, within a set of constraints, that enables users to make accurate and reliable predictions regarding the system being modeled and how the system would behave under different conditions. In addition, good models also allow us to gain insight and understanding of the system being modeled. In many cases, we can examine the effects of changing the system being modeled by simply changing the model. Common model constraints include accuracy, reliability, simplicity, as well as computational time and space. By accuracy, we refer to the magnitude of the difference between the forecast or classification determined by a model and the desired forecast or classification. By reliability, we refer to the probability of a model calculating a forecast or classification within a given accuracy.

5

Models lie along a spectrum of reliability for a given accuracy ranging from the deterministic to the random (i.e., entirely nondeterministic) with the bulk of the models lying somewhere in between as illustrated in figure 1.

Deterministic High R E L I A B I L I T Y Low

Laws of Physics

Marketing Models

Lottery Prediction Models

Random Figure 1. Types of models by reliability

For our purposes, randomness is an absence of any detectable pattern. It is more a measure of ignorance rather than any provable concept. Richard Hamming defines randomness as “a negative property; it is the absence of any pattern” adding “randomness can never be proved, only the lack of it be shown.” [Hamming 91]. System modeling can be viewed as the search for patterns in order to minimize randomness and, hence, maximize reliability.

A typical classification or forecasting model consists of a set of inputs, a set of constant control parameters, and a structure, normally provided by a set of equations, which transform the inputs into an output called the classification or forecast. This concept is illustrated in Figure 2.

6

Constant Parameters

Input Variables

•••

•••

CLASSIFICATION or FORECASTING MODEL

Classification or Forecast

Figure 2. A generic classification or forecasting model

As the input variables change, either to characterize a different case or time, so does the classification or forecast.

A well-tested strategy for model development is described by the scientific method summarized in the following four steps:

1. Collect observations from the physical system. 2. Generate a set of hypothetical models that are consistent with the observations. 3. Test the hypothetical models. 4. Select the most fit model based on its accuracy, reliability, and simplicity. Model construction can be viewed as a search process. This search consists of finding the pertinent inputs and their relationships to each other in order to determine the desired output. Unfortunately, in many cases the search space is either enormous or too small to contain a useful model with a high probability. Consider constructing a model which simply requires selecting the best subset of 50 possible inputs. Unfortunately, there are 250 such subsets. An exhaustive test of each input subset at a rate of one million tests per second would take nearly 36 years. Raising the number of possible inputs to 75 would increase

7

the time to more than one billion years. In addition to selecting inputs, parameters and functions must be selected as well. Furthermore, some systems change with time requiring models of these systems to change as well in order to maximize their accuracy. Because of the number of possible models is usually enormous, exhaustive searches are frequently not possible. Instead, we rely on mathematical techniques, heuristic algorithms, and our own intelligence to reduce the size of the search space. The machine learning approach being presented in this dissertation uses all three of these strategies in an attempt to arrive at the best solution possible.

1.2 Statement of Research Goals ‘If a cat can kill a rat in a minute, how long would it be killing 60,000 rats?’ Ah, how long, indeed! My private opinion is that the rats would kill the cat. ––Lewis Carroll

In the above quote, Lewis Carroll offers relevant counsel to those of us who would attempt to induce models or use models so created. Humanity has a good grasp on how to construct linear models. However, it is often the case that the physical systems we attempt to model turn out to be highly non-linear in regions of interest. If we attempt to model such a system with a linear model, we can arrive at erroneous conclusions as Lewis Carroll so colorfully points out. Unfortunately, Maslow’s maxim, “if the only tool you have is a hammer, then you start treating everything as if it were a nail,” often holds true.

The goal of this research is to develop a heuristic algorithm based on a hybrid of evolutionary computation algorithms, i.e., an alternative to our assortment of model-building “hammers.” This algorithm is used for selecting a mathematical model, i.e., a set of func-

8

tions, from a user-defined finite set of classification or discrete-time forecasting models, such that the model meets user-specified criteria. These concepts are formally defined in section 1.3. By selecting a model from the model set, we are in effect selecting the model inputs, building the model structure from a set of atomic functions specified by the user, and modifying the model parameters to optimize the model.

The primary weakness of the CAFECS algorithm is that it is a stochastic heuristic approach to developing models. Because of this, we cannot claim that the heuristic we present will find an optimal model in a model set regardless of the time permitted. Moreover, in many non-trivial cases the algorithm can run for a considerable length of time before model improvements become infrequent. For this reason, CAFECS is not appropriate for problems in which the desired solution can be found deterministically in a reasonable amount of time. This and other limitations are discussed further in Chapter 5. However, for many problems, the model space is too large and non-linear to lend itself to such deterministic approaches. However, in practice we achieved better results using the CAFECS heuristic than would be expected with a generate-and-test algorithm. These results are discussed in Chapter 4.

The contributions of this research are as follows:

1. We develop a novel formal structure for defining a wide variety of sets of mathematical models, which can be used for classification and discrete-time forecasting, so that each model can be specified as a vector set. An analogy to this is that DNA can be used to specify a wide variety of biological organisms. With this formal structure, the user can constrain the model domain, even though it contains a wide variety of models, so that the size

9

of the domain is smaller than domains defined using conventional evolutionary computation approaches. This can reduce the search time. Moreover, by making the formal structure hierarchical, the CAFECS algorithm can focus on evolving certain sections of a model defined with this structure decreasing the search time further. We present the formal structure in section 1.3.

2. We develop a heuristic algorithm, which we incorporate in a system called CAFECS, that is a hybrid of genetic algorithms and evolution strategies, each of which has its own biases, for selecting an optimal model from a user-specified model set. For some problems, one obtains better results if a genetic algorithm evolves some model parameters and evolution strategies evolve the remaining parameters. However, for non-linear problems, it is not effective to evolve the model parameters separately. CAFECS subsumes the capabilities of both algorithms by simultaneously applying each algorithm at a different level so that the benefits of both algorithms can be gained. One can view the CAFECS algorithm as evolving models at three levels and we present it in Chapter 3. We discuss genetic algorithms, evolution strategies, and evolutionary theory in general as it relates to these algorithms in Chapter 2. We present and discuss test results using CAFECS in Chapter 4.

3. We compare 10 communication strategies with different communication topologies and demonstrate that by allowing a limited amount of one-way communication between separate populations of models being simultaneously evolved, the performance of the algorithm can be improved by nearly an order of magnitude in some cases. Although using multiple communicating colonies or agents for search is not new, the comparison of the different communication topologies apropos of evolutionary computation is a minor contribution. This is explained in Chapter 3 and an example is presented in section 4.3. 10

4. By using the algorithm to construct attributes for a decision tree, we demonstrate that the algorithm can be used to decrease the size of a decision tree while increasing its accuracy better than can be obtained by optimally pruning the tree. We present these test results in section 4.4.

1.3 Formal Definition of the Problem Keep it simple: as simple as possible, but no simpler. ––Albert Einstein

In loose terms, the problem we address is how to infer a model for forecasting or classification from a data set of observations such that the model criterion value of the selected model, which is determined by a user-specified model criterion function, meets the userspecified solution criterion. The problem domain is defined to be the set of possible observation data sets. An observation data set contains model input values along with the corresponding desired model output values and is discussed further in section 1.3.1. Throughout this dissertation, any reference to a function is implicitly understood to refer a function that can be mapped one-to-one to a total computable n-ary function f : N n → N , where N is the set of natural numbers, unless explicitly stated otherwise.

We roughly define a model as a set of functions, called nodes, that maps an input vector of attribute-values from the observation data set to a model output vector. This output vector contains one or more components that represent, or can be used to calculate, a classification or forecast for an observation. The user specifies a set of models, restricted as described later in this section, called the model domain. The cardinality of this set can be

11

extremely large for non-trivial problems making an exhaustive search impractical. A model and the model domain are formally defined in section 1.3.3.

A model evaluation system, called CAFECS, is used to select a model in the model domain that meets the user-specified search criteria and is illustrated in Figure 3.

Model Parameter Domain, DM

Observation Data Set, Z

Model M(θ) Model Selection Algorithm

Model Criterion

Model Criterion Function, V

Figure 3. An Overview of the CAFECS Model Evaluation System

The four components of the system are the observation data set Z, the models to be evaluated as defined by the model parameter domain DM, the model selection algorithm, and the model criterion function V. From the observation data set Z, we can form an indexed set of model input vectors, each denoted Z(k) where k is the index, and an indexed set of output attribute values, each denoted y(k), where y(k) is the desired forecast or classification given Z(k). The observation data set and model input vector are described in detail in section 1.3.1. A model is a set of functions, parameterized by the model parameter vector θ, that takes Z(k) as inputs and outputs a vector yˆ ( k ) that can be used to perform a forecast or classification as specified by the user. The model criterion function, V, is user-specified and assigns a model criterion value to a model M(θ) selected by the model selection 12

algorithm. The model criterion value of the model is then used by the model selection algorithm to direct its search procedure. The model selection algorithm iteratively selects models in an attempt to minimize the model criterion function V. Typically, the model criterion value represents the error of the model. However, it may also represent, in part, the size of the model or a structure constructed using the model such as a decision tree, the score of a deterministic game, or any other function of a model and the observation data set that the user desires to minimize. The model criterion function is discussed further in section 1.3.4.

We will now provide a precise definition of the CAFECS model evaluation system, excluding the model selection algorithm which is defined in Chapter 3. We begin with the observation data set followed by a description of a model node and then proceed to the model itself. We then present the model criterion function, and conclude with the formal description of the problem.

1.3.1 Observation Data Set We define an observation data set to be a set of indexed sets of attribute values where an attribute generally corresponds to some characteristic in the environment such as voltage or color. The set of possible values of an attribute is restricted to a finite set. An observation is the set of attribute values for a given index. Typically, the index refers to time or space when forecasting and an instance when classifying. One attribute in the data set is designated as the output attribute. For each observation, the output attribute value represents the forecast or class of that observation. When classifying, the order of the observation in the observation data set is usually not relevant. In such cases, a model will only use

13

attribute values from a single observation to calculate an output value. However, when the order of the observations is pertinent, as is frequently the case when performing forecasting, it is often useful for a model to use attribute values from several observations to calculate an output value. We call the ordered set of attribute values from the observation data set that a model may use as inputs to calculate an output the observation data set window. The observation data set window width is the number of observations with elements in the observation data set window. Thus, observation data set window for a classification model typically has a window width of 1, whereas an observation data set window for a forecasting model may have a window width of much more. Figure 4 illustrates this concept.

Observation Data Set Window Z(k) window width = 8



Attribute Number 1 2 3 CA Desired output attribute value y(k)

index k = 0

k = K-1

Figure 4. An example of an observation data set window

In this figure, each long row of white boxes represents an indexed set of attribute values for a specific attribute. The observation data set window consists of all the small white boxes contained in the large gray box. The maximum number of attribute values from a single attribute in figure 4 is 8, which is the window width. The number of attribute values in the observation data set need not be the same for each attribute. Note that the small

14

black box, called the desired output attribute value y(k), is not part of the observation data set window. The attribute values represented by the white boxes in the observation data set window are the elements of a model input vector, Z(k). The model input vector is used by a model to calculate as closely as possible the desired output attribute value represented by the small black box. By sliding the large gray box and the small black box across the indexed sets of attribute values, as indicated by the arrow above the large gray box, we can generate a series of model input vectors with their corresponding desired output attribute values.

We formally define the observation data set and its various elements as follows:

ui =

{ui(0), ui(1), …, ui(K - 1)}, an indexed set of input attribute values representing an attribute i, where K = |ui| and is constant for any i. We will refer to the index of this set as index k, 0 ≤ k < K, for convenience although a temporal interpretation is not required.

n i

u = {ui(0), ui(1), …, ui(n - 1)} ⊆ ui, where 0 ≤ n < Κ. This set will be used later to segment the data into training and test sets.

CA = the total number of attributes, including the output attribute, represented in the observation data set.

y=

{y(0), y(1), …, y(K - 1)}, the indexed set of output attribute values.

yn =

{y(0), y(1), …, y(n - 1)} ⊆ y, where 0 ≤ n < Κ.

15

wi =

the user-specified “window-width” of attribute i, 1 ≤ i ≤ CA.

wy =

the user-specified “window-width” of the output attribute y.

wmx = MAX({wy} ∪ {wi : 1 ≤ i ≤ CA}), the maximum “window-width” of all attributes. Z(k) = (u1(k), u1(k-1), …, u1(k-w1+1), u2(k), u2(k-1), …, u2(k-w2+1),

…,

up(k),

up(k-1), …, up(k-wp+1), y(k-1), y(k-2), …, y(k-wy+1)), where p = CA - 1, and is called the model input vector at index k. Its components are the attribute values of the observation data set that can be input into a model at any index k. The model input vector is also called the data observation set window when used in context with the data observation set. Note that y(k) is not an element of Z(k); that is because Z(k) is used by the model to predict y(k).

Z=

{Z(0), Z(1), …, Z(K-wmx+1)} and represents the entire usable observation data set provided by the user.

ZT = {Z(0), Z(1), …, Z(T-1)}, 0 ≤ T ≤ |Z|, and represents the portion of the observation data set to be used for training, i.e., selecting a model. The remainder of the data set is used for testing, i.e., calculating the model criterion value of a model using a model criterion function that can be different from the model criterion function used for training. The test data has no effect on model selection.

16

1.3.2 Node As previously stated, a model is a set of functions called nodes. We designate the model order to be CZ, the cardinality of Z(k). We designate the cardinality of a model, i.e., the number of nodes it contains, to be CN. A node N(θN) is a function, parameterized by the set θN, called the node definition set, that takes inputs from the data set of observations along with the outputs of other nodes and outputs a natural number, i.e., N(θN) : N CX → N where CX = CZ + CN. We will start with an informal explanation of a node and conclude with a more formal definition. Figure 5 illustrates a model node and how its components relate.

Input Link Vector θL

Input Sequence Vector θS

Operator Type O

Node Output Vector yˆ ( k )





Input Link Selector

Node Output i yˆ ( k )









… …







Model Input Vector Ζ(κ)

Node number i



Node Parameter Vector θP

Input Sequencer

Operator Function

Default Input Argument Vector θD

Figure 5. A Model Node

Each node inputs components of the model input vector Z(k) as well as components of the model output vector yˆ ( k ) and outputs a natural number such that the output of node i is

17

i

the i’th component of yˆ ( k ) denoted yˆ ( k ) . Later in this section, it is shown that each node is defined such that its output does not form a cycle with any other node by ignoring certain components of yˆ ( k ) .

The three sections of a node are the Input Link Selector, the Input Sequencer, and the Operator Function. In addition, a node is parameterized by its Default Input Argument Vector θD, Input Link Vector θL, Input Sequence Vector θS, Node Parameter Vector θP, Operator Type O, as well as its node number i. The model node, shown in figure 5 will be explained from right to left.

The Operator Function is the key section of a node. It takes CX input arguments and outputs the node output value. It is parameterized by the Operator Type O and the node parameter vector θP. The Operator Type specifies the algorithm used by the Operator Function to map the input vector to the output value, i.e., how the output is calculated for a given set of inputs. For example, a commonly used Operator Type is called SUM. When the Operator Type is SUM, the Operator Function sums its CX input arguments along with the first component of θP and outputs the total. Thus, if all its input arguments are 0 and θP = , the node output is 3. Another Operator Type may be a Neural Network where components of θP are used as weights. Table 1 contains a selection of node operators typically used. Node operators are classified as either simple or complex. Simple operators have few or no parameters whereas complex operators have many parameters. It is common for a complex operator to associate at least one parameter with each operator input, e.g., use parameters as link weights.

18

Node Operator Types (multiple input, one output) Simple Operators

Complex Operators

NAND Mean Median Mode Max Min Sum Inverted Sum Multiply Inverted Product Divide Remainder Absolute Value Square Square Root

Neural Network Sigmoid Weighted Sum Weighted Mean Finite State Machine Hyper-cube Hyper-sphere OR/NOR Selector Lookup Table Decision Tree

Table 1. A sample list of model operator types

The inputs to the Operator Function are the outputs of the Input Sequencer. The purpose of the Input Sequencer is to change the sequence of the input arguments for the Operator Function. Consider the function f(a,b,c) = y and an input vector . The Input Sequencer, parameterized by θS, specifies which component of the input vector is mapped to which argument of the function. Continuing the example, if θS = , then the first component of the input vector, 3, will be mapped to the third function argument c, the second component of the input vector, 5, will be mapped to the second argument b, and the third component of the input vector, 7, will be mapped to the first argument a, resulting in f(7,5,3)=y. However if θS=, then the result would be f(5,3,7). For some functions, the argument sequence is irrelevant, e.g., a function that takes the product of it arguments

19

in which case f(c,b,a) = f(b,a,c). However for many functions, the argument sequence is important, e.g., f(a,b,c) = a/b + c.

The inputs to the Input Sequencer are the outputs of the Input Link Selector, which is parameterized by the Input Link Vector, θL, and the node number i. The analogy that the Input Link Selector is a set of demultiplexers may be helpful. The purpose of the Input Link Selector is to select CX components from the Model Input Vector Z(k), the Model Output Vector yˆ ( k ) , and the Default Input Argument Vector θD to be output to the Input Sequencer. As mentioned previously, the Z(k) contains CZ components and the yˆ ( k ) has CN components where CX = CZ + CN. Both the Default Input Argument Vector θD and the Input Link Vector θL each contain CX components as well. We use the notation that vj denotes the j’th component of vector v. We designate the Input Link Selector output vector as L. Algorithmically, the Input Link Selector operates as follows:

for j = 1 to CZ do: if θLj = 1 then Lj = Zj(k) else Lj = θDj end do for j = (CZ +1) to CX do: if θLj = 1 and (j - CZ) < i then Lj = yˆ

( j – CZ )

( k ) else Lj = θDj

end do where i is the node number.

This algorithm can be roughly interpreted as: select the first CZ components from either the Model Input Vector or the Default Input Argument Vector depending on the Input Link Vector. Then, select the remaining components from either the Model Output Vector or the 20

Default Input Argument Vector again depending on the Default Input Argument Vector but also on the node number i. The Default Input Argument Vector θD is determined by the Operator Type. For example, for the Operator Type SUM, each component of the Default Input Argument Vector is equal to 0. Hence, if the Input Link Selector selects all CX components from the Default Input Argument Vector, all the inputs to the Operator Function will be zeroes resulting in a node output of 0.

In order to demonstrate how the Input Link selector operates, consider the following example of a model node. Let Z(k) = , yˆ ( k ) = , θD = , θL = , and i = 2, where i indicates the node number 2 and the dimension of the vector yˆ ( k ) indicates the model contains 3 nodes. We will now calculate the output vector L of the Input Link Selector. Since θL1 = 1, then L1 = Z1(k) = 3. However, θL2 = 0, therefore L2 = θD2 = 0. Since CZ = 2, the next component of L is selected either from the model output vector or the Default Input Argument Vector. Because θL3 = 1, then L3 = yˆ

(3 – 2)

( k ) = 7. The remaining components of L are selected from θD because input num-

ber j - CZ < i is not true when for j = 4 or j = 5 given that i = 2 and CZ = 2. Hence L4 = θD4 = 0 and L5 = θD5 = 0. Thus, the Input Link Selector output vector L = in this case.

j

The requirement that yˆ ( k ) not be selected as an input for node i unless j < i is important because it orders the nodes. The result is that the lower numbered nodes tend to be used to

21

pre-process the input data and the higher numbered nodes tend to be to post-process the data. This permits a more directed model search. However, if this result is not desired, it would be simple to extend the model definition so that an operator that reorders the sequence of the nodes could be used by the search algorithm. One such operator simply swaps the node numbers of two nodes, e.g., the node number of node 3 is changed to 7 and the node number of node 7 is changed to 3.

The second reason why ordering the nodes is useful is because it prevents the formation of cycles where the output of a node has an effect on one or more of its own inputs. Not j

selecting the output of node j, yˆ ( k ) , as an input of node i when j ≥ i is a sufficient condition to prevent the input of a node from being dependent on its output, i.e., the formation of cycles. Consider a directed graph of n vertices, numbered 1 to n, where each vertex in the graph represents a model node and each edge from an initial vertex to a terminal vertex represents that the output of the node represented by the initial vertex was selected as an input of the node represented by the terminal vertex. Clearly, the output of a model node does not affect its input if the vertex that represents it in the directed graph is not part of a cycle. If the directed graph is acyclic, then the model nodes represented by the graph are free from cycles as well. The following proposition and proof show that if it is true that for any edge in a directed graph, j < i, then the graph is acyclic. Theorem: A directed graph G of vertices V(G) = {1, 2, …, n} and edges E(G) = {: i, j ∈ V(G)} is acyclic if for all ∈ E(G), j < i.

22

Let P(n) be the proposition that a directed graph G of vertices V(G) = {1, 2, …, n} and edges E(G) = {: i, j ∈ V(G)} is acyclic if for all ∈ E(G), j < i, is true. We will prove P(n) is true for all integers n > 0.

Proof. (by induction on n):

Basis Step P(1): A directed graph G of vertices V(G) = {1} and edges E(G) = {: i < j and i, j ∈ V(G)} is acyclic.

P(1) is clearly true since there cannot be an edge because 1 < 1 is not true.

Inductive Step Inductive Hypothesis: P(n): A directed graph G of vertices V(G) = {1, 2, …, n} and edges E(G) = {: i < j and i, j ∈ V(G)} is acyclic. Assume that the P(n) is true and show that P(n+1) is true. If we have an acyclic graph of vertices V(G) = {1, 2, …, n} and add the vertex (n + 1) to V(G), then the only way that a cycle could have been created is if an edge was added from vertex (n + 1) to another vertex in the graph, i.e., ; but no such edges can exist from vertex (n + 1) because there is no i ∈ V(G) such that (n + 1) < i. Hence, no cycles could have been created. Since P(n) is acyclic and no cycles could be created by adding vertex (n +1), P(n+1) is acyclic. From the basis and inductive steps, we conclude P(n) is true for all n > 0. ■

The remainder of this section is devoted to a more rigorous definition of a node.

23

Formally, the node Ni(θN), 1 ≤ i ≤ CN, is parameterized by its node definition set, θN, where: θN = {θO, θL, θS} ∈ DNi DNi = the node definition set domain for node number i, 1 ≤ i ≤ CN. DNi is finite and is defined by the user-specified domains of DOi, DLi, DSi, and DPi. Thus, the node definition set domain for each node in a model may be different. θO = {O, θP, θD}, and is called the node operator set. O=

the operator type of a node, O ∈ DOi.

DOi = the finite non-empty domain of operator types for node number i, 1 ≤ i ≤ CN, and is the set of general function classes specified by the user such as sum, product, neural network, etc.... Descriptions of sample elements of DOi are listed in table 1. θP =

the node parameter vector, θP ∈ DPi. The dimension of this vector is a function of DOi. The elements of this vector determine the specific node function within the general function class specified by O.

DPi,j = the finite user-specified domain of node parameter vectors for node number i, 1 ≤ i ≤ CN and operator Oj. DPi,j ⊆ N n where n is a function of O and DPi is a finite set that can be represented on a computer. θD = the node default argument vector. It provides an alternative to using either an element of the model input vector or a node output value as a node input argument as

24

shown in the definition of x(k). θD ∈ N CX and is a function of O. θD is restricted to a finite set that can be represented on a computer.

θL =

the node input link vector, θL ∈ DLi. Each component in it, θLj, is equal to either 1 or 0, and is associated with components Zj(k), yˆ j – CZ ( k ) , and θDj as described below in the definition of x(k), the node input vector at index k. The purpose of θLj is to determine when the node default component, θDj, is to be used as the j’th node input, i.e., when θDj = 0, and when Zj(k) or yˆ j – CZ ( k ) is to be used instead, i.e., when θDj = 1. The variable Zj(k) represents the j’th component of the model input vector to be described later. The variable yˆ j – CZ ( k ) represents the output of node (j - CZ), where j > CZ, at index k.

DLi = the finite user-specified node input link vector domain, DLi ⊆ { 0, 1 }

CX

, for node

number i, 1 ≤ i ≤ CN. θS =

the node input sequence vector, θS ∈ DSi. It specifies an ordering of the node input vector x(k) as shown in the definition of x(k).

DSi = the finite user-specified node input sequence vector domain, DSi ⊆ 〈 1, 2, 3, …, C X〉

CX

where if a and b are components of an element of DSi then

a ≠ b, for node number i, 1 ≤ i ≤ CN. That is, each element of DSi is a vector with a dimension of CX and is a permutation of the set of integers from 1 to CX.

25

* N = the finite non-empty node domain of node i, {N(θN): θN ∈ DNi}, for node number i

i, 1 ≤ i ≤ CN. It is the set of candidate nodes for a specific node location in a model. The output of node i at index k is defined as yˆ i ( k ) = Ni(xi(k); θN), where xi(k) is the node input vector for node i at index k, 1 ≤ i ≤ CN. We use the notation f(α; β) to be equivalent to f(β)(α) where f(β) is a function parameterized by β that takes α as its argument. The dimension of xi(k) is CX = CZ + CN. Each component of xi(k) is defined as:

j

⎧Z (k ); ⎪ j ⎪ θS x i ( k ) = ⎨ yˆ j – CZ ( k ) ; ⎪ ⎪ θD j ; ⎩ i

j i

θL = 1 and

j ≤ CZ

j i

θL = 1 and i > ( j – C Z ) and

j > CZ

otherwise

for j = 1 to CX

j i

where we use the notation v to refer to the j’th component of vector v of node i. Thus, j

x θS ( k ) refers to the θSj’th component of the node input vector x of node i at index k. i This node input vector component is equal to either the j’th model input vector component, Zj(k), which is an observation; an output of node (j - CZ), yˆ j – CZ ( k ) ; or the j’th j i

default node argument vector component of node i, θD ; depending on the j’th node input j i

link vector component of node i, θL , along with the value of i and j. Note that i > (j - CZ) for yˆ j – CZ ( k ) to be a component of the input vector of node i. This restriction prevents the

26

output of a node, and the output any node it is an input for, from being one of its own inputs. The result is that a graph of any set of nodes so defined, where the directed links represent the output of one node going to the input of another node, is a directed acyclic graph which was previously proved. Sets of nodes which remove this constraint have been briefly examined with interesting results, but such work is beyond the scope of this dissertation and is left to future research.

1.3.3 Model As previously stated, a model, M(θ) = {N1(θ1), N2(θ2), …, N CN ( θ CN ) }, is an indexed set of nodes, whose cardinality is CN, i.e., CN is defined as |M(θ)|. A model can be viewed as a directed acyclic graph of nodes, an example of which is shown in Figure 6. θ is a CNtuple, called a model parameter tuple, and is an index covering an index set DM, i.e., θ ∈ DM. Each element of θ is a node definition set, θN, and is defined in section 1.3.2. The domain of model parameter tuples, called the model parameter domain, is the finite set DM = {θ : θi ∈ DNi, for 1 ≤ i ≤ CN}. We let CDNi denote the cardinality of the finite non-empty node domain DNi, i.e., CDNi is defined as |DNi|. Defined as such, CN

|DM| =

∏ CDNi . The model parameter domain, DM, determines the set of candidate

i=1

models, M*, which is called the model domain. We define the model domain as Μ∗ = {M(θ): θ ∈ DM}. M* is the set of candidate models from which we would like to select an optimal model apropos of the observation data set and the model criterion function, which will now be discussed in further detail.

27

Observation Data Set Window Z(k) Index Attribute

k

k-1

k-2

k-3

...

k-w

0 1 2 ... CA-1

Node #1

Node #5

Node #10

Node #2

Node #3

Node #6

Node #11

Node #16

Node #7

Node #12

Node #8

Node #13

Node #17

Node #22

Node #4

Node #18

Node #21

Node #14

Node #19

Node #23

Model Output Vector yˆ ( k ) Figure 6. An example of a 24 node model

28

Node #9

Node #15

Node #20

Node #24

1.3.4 Model Criterion Function ˜ → N of a problem, ˜ : DM We begin by defining the “ideal” model criterion function V such that if the solution set S V˜ defined below is not null, then the problem is said to be ˜ = {θ : θ is a model parameter vector of a computable model}, solvable. We define DM ˜ , and let N represent the set of natural numbers. An ideal model where any DM ⊆ DM ˜ can be viewed as providing a measure of a model’s validity. V ˜ is criterion function V ˜ , if V ˜ (θ ) < V ˜ (θ ) then defined such that for any model parameter vectors θi,θj ∈ DM i j model M(θi) is “better” than model M(θj) as determined by the user. The solution set of the ideal model criterion function is defined by the user to be either S V˜ = ˜ :V ˜ ( θ ) ≤ ε˜ } where ε˜ is a user-defined error tolerance, or the set of global { θ ∈ DM ˜ : θ minimizes V ˜ ( θ ) } . However, because of difficulties to be disoptima S V˜ = { θ ∈ DM ˜ is frequently not used for model selection. Instead, a cussed in Chapter 5, the function V user-specified model criterion function V : DM × N |Z(k)| → N , is used such that V(θ, ZT) is the model criterion value of model M(θ) given the ZT subset of the observation data set. The model criterion function V is used by the model selection algorithm in an attempt to select a model parameter vector θ ∈ S V˜ .

The model criterion function V, which is a function of the training data, has a significant effect on the model search process. Unfortunately, V is often defined with the undesirable characteristic that it is not ideal and can mislead the search process. Because of this, a model may perfectly forecast or classify the training data, but do very poorly on additional 29

data not used during the model selection process, e.g., the test data Z - ZT. Moreover, as a practical matter, a consideration in selecting a model criterion function is how quickly it can be evaluated because the function is evaluated frequently. The user may specify a model criterion function that can be rapidly evaluated, but desire a more comprehensive function to test a selected model that can be evaluated less frequently. Such a function may provide an entirely different measure for the user than the model criterion function does. For example, the model criterion function may measure an object’s size. However, the user may really want to minimize the object’s weight, but such a calculation may be more time consuming and may even involve a physical experiment.

In order to address these problems, CAFECS also uses another user-defined function to evaluate a model M(θ) called the model criterion test function, V : DM × N |Z(k)| → N , such that V (θ, Z - ZT) is the model test criterion value of model M(θ) given the Z - ZT subset of the observation data set used for model testing. It is important to note that the function V is not used by model selection algorithm and has no effect on the model ultimately selected by the CAFECS system. Its purpose is provide feedback to the user in order to better assess the probability P(θ ∈ S V˜ ) for a selected model parameter vector θ. Because V is independent from the model selection algorithm, it can be useful in monitoring the progress of the model selection process.

Frequently, the functions V and V are defined such that V(θ, Z') = V (θ, Z'), where Z' ⊆ Z. However, this is not always the case. For example, V(θ, ZT) can be equal to the size of a decision tree constructed with attributes defined by M(θ) and the training set ZT, while

30

V (θ, Z - ZT) can be equal to the accuracy of the decision tree in classifying the test data. An example of this is discussed in section 4.4.

Allowing the user to define the model criterion function enables the CAFECS system to be applied to a wide variety of domains. The model criterion function can be complex. For example, V can be defined by a deterministic simulation parameterized by θ and ZT. Currently, we do require that such a simulation define a function, but an interesting extension of this research would be to relax this requirement to permit stochastic simulation. The default model criterion function V calculates the mean-squared error of the training data for a model and is illustrated in figure 7 and is defined below.

Model Parameter Vector θ

OBSERVATION DATA SET ZT

Model Output Vector Input Vector Z(k)

yˆ ( k )

MODEL M(θ)

Desired Output y(k)

MODEL ERROR FUNCTION

Model Error Vector E(k)

MODEL MEAN ERROR FUNCTION

MODEL CRITERION V(θ, ZT)

T–1

⎛1 ⎞ min ⎜ --- ∑ E ( k )⎟ T ⎝ k=0 ⎠

ξ

Figure 7. An Example of a Model Criterion Function

For example, we can define a model error function, ξ : N 2 → N , that calculates the error of a model node i at index k as:

31

2 ξ : ξ ( y ( k ), yˆi ( k ;θ i ) ) = ( y ( k ) – yˆ i ( k ;θ i ) ) when forecasting or

⎧0; ξ : ξ ( y ( k ), yˆ i ( k ;θ i ) ) = ⎨ ⎩ 1;

y ( k ) = yˆ i ( k ;θ i )

when classifying.

otherwise

We then define the model error vector of a model M(θ) given Z(k) to be E(k) where:

Ei(k) = ξ ( y ( k ), yˆi ( k ;θ i ) ) , 0 ≤ i < CN.

In order to obtain the mean error of the model nodes for the subset ZT of the observation data set used for model selection, i.e., training, we define the model mean error vector, MET, to be:

MET

1 = --T

T–1

∑ E ( k ) , 0 < T < K.

k=0

Up until this point, we have considered the output of each model node. However, when selecting a model M(θ), we designate one node as the model output node, MN, where the output of the model is the output of the model node MN. We define MN to be the index of the minimum component of the model mean error vector, MET, where 1 ≤ MN ≤ CN. If more than one component of MET is the minimum value, MN is arbitrarily defined to be the minimum index of these equal components. We now can simply define the model criteT

rion function to be V ( θ, Z ) = ME

T . MN

32

1.3.5 Formal Problem The problem we address is how to use the information in the observation data set, Z, and the model criterion value function V to select a model parameter tuple θ ∈ S V˜ ∩ DM . However, it is often the case that there are problems in determining if θ ∈ S V˜ , for a given θ, which are discussed in Chapter 5. As a consequence of these problems, after a solution θˆ ∈ DM has been selected, we then measure the validity of the selected model M( θˆ ), using V ( θˆ , Z - ZT), the model criterion test function and the test observation data set. The reason for this is that we would like to select a model M( θˆ ) that characterizes patterns inherent in the system being modeled rather than patterns that only exist in the training data in order to increase the probability that θˆ ∈ S V˜ ∩ DM is true.

Defined as such, the problem can be viewed as a search of the model domain DM, given the set of node definition sets DN, the model cardinality CN, the model criterion function ˜ V, and the observation data training set ZT such that the ideal model criterion function V is minimized. The three greatest difficulties of this problem are:

1. For most non-trivial problems, we expect the cardinality of the model domain DM to be very large making an exhaustive search for an optimal model parameter vector θˆ that minimizes V unfeasible. Even though the cardinality of DM can be reduced by the user, it is often unclear how to accomplish this so that some or all of the solutions, i.e., elements of S V˜ , are not removed from DM.

33

2. The model criterion function V can be highly non-linear. This either prevents or renders ineffective locating an optimal model using standard mathematical techniques or heuristic algorithms such as hill-climbing.

3. An optimal model parameter vector θˆ that minimizes both V and V may not be an ˜ which is not element of the solution set S V˜ of the ideal model criterion function V given.

These difficulties are discussed in greater detail in Chapter 5.

34

Chapter 2 Background Information and Related Research The ideal condition would be, I admit, that men should be right by instinct; but since we are all likely to go astray, the reasonable thing is to learn from those who can teach. ––Sophocles, Antigone, Ode II, 442 B.C.

The background information and related research is composed of these three areas: System modeling, Evolutionary Theory, and Evolutionary Computation. System modeling is a broad area of which this thesis examines only a small subset. Evolutionary Theory provides the foundation on which Evolutionary Computation is built. The main areas of Evolutionary Computation research are genetic algorithms including genetic programming and classifier systems, evolution strategies, and evolutionary programming.

2.1 System Modeling Related Research A general introduction to system modeling was presented in section 1.1. Throughout history, humans have constructed models to better understand various systems. A general discussion of system modeling can be found in [van Gigch 91]. In particular we are interested in models that are functions. Such functions can be defined using a number of methods, e.g., mathematical equations, computer simulations, and data sets detailing observations of a physical system. Inducing mathematical models from observations of a system is known as system identification. A brief introduction to model building and system identification can be found in [Fasol & Jorgl 80]. A survey on system identification is presented 35

in [Astrom & Eykhoff 71]. Other works on system identification include [Ljung 87] and [Soderstrom & Stoica 89]. With the advent of computers, developing models using simulations has become practical in many cases. A discussion of the use of computer simulations for model building is found in [Kheir 88].

For many systems, the observations used to induce a model of the system are temporally related and form a time-series. In many cases, excellent models of time-series can be constructed by means of regression analysis. In the area of using regression to build linear models of discrete time-series, [Box & Jenkins 76] is essential reading. Readers interested in the Box-Jenkins approach, will find [Jenkins 79] and [Pankratz 83][91] helpful. There are many fine works on the general subject of regression. I direct the interested reader to two on linear regression: [Seber 77] and [Montgomery & Peck 92], and two on non-linear regression: [Seber & Wild 89] and [Tong 90]. More recent techniques include using temporal neural networks [Lapedes & Farber 87][Sastry et al. 94] and Multivariate Adaptive Regression Splines [Friedman 91].

Unfortunately, in spite of these powerful techniques, modeling many complex systems remains illusive. [Weigend & Gershenfeld 93] is one of many sources that show the strengths and weaknesses of some of the more advanced techniques in the above mentioned areas on complex time-series.

The heuristic algorithms being used by the CAFECS system to develop forecasting and classification models fall under the umbrella of evolutionary computation. However, before the concepts of evolutionary computation are presented, a brief overview of evolutionary theory is provided that focuses on the key aspects we are interested in.

36

2.2 Evolutionary Theory And many lines of organisms must have perished then, and been unable to reproduce their kind. For whatever you see feeding on the vital air, either craft, strength, or finally mobility has been protecting and preserving that race from its earliest times. ––Lucretius cited in [Haldane 32, p.112]

The theory of evolution provided the inspiration for the development of evolutionary computation and it is examined in that light rather than entering the debate of its adequacy. The theory itself proves to be sufficiently useful in its own right. By examining how natural selection and cooperation function in nature to evolve organisms, it can be better understood how they may be employed to improve the process of evolving models.

The generally accepted centerpiece of evolutionary theory is Darwin’s landmark work, “The Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle of Life” [Darwin 1859], although he was not the first nor certainly the last to contribute to this theory. Darwin claimed natural selection is the main force driving evolution although he was convinced there were others [Darwin 1859]. One of which he later identified as sexual selection delineated in [Darwin 71]. Briefly, natural selection is the theory that more fit individuals are more likely to survive long enough to reproduce than less fit individuals. Natural selection depends on hereditary theory such that the offspring inherit the characteristics that made their parents more fit. Sexual selection marks the struggle of more fit individuals to reproduce with greater quality and quantity than less fit individuals by competing for mates as opposed to struggling for survival. Thus, a population is continually driven by natural selection and sexual selection, inter alia, to become more fit. It should be noted that natural selection is often used in a more

37

general sense, which subsumes sexual selection, e.g., Fisher’s fundamental theorem of natural selection [Fisher 30], than that used by Darwin.

One of the greatest weaknesses of Darwin’s work was that it failed to incorporate the notion of genetics to explain the hereditarian variation of character traits. The tragedy lies in the fact that Gregor Mendel was a contemporary of Darwin but there is no indication that Darwin was aware of Mendel’s ground-breaking work in genetics done in a small monastery in Czechoslovakia [Stebbins 71, p. 11]. Mendel’s principles of heredity were published in 1866 and later translated into English in 1909 [Mendel 1866] as cited by [Wright 69]. The combination of Mendelian genetics and Darwin’s evolutionary theory is called the Neo-Darwinian synthesis. This theory is first seen in [Fisher 30], [Wright 31], [Ford 31], [Haldane 32]. Since then it has been continually updated by incorporating other concepts such a molecular genetics and has become known as the Neo-Darwinian paradigm.

Before proceeding further with this analysis of evolution, it is helpful to define a few genetic terms and concepts that will be used. Note that these definitions and descriptions are often simplified for the sake of brevity. However, they should suffice for this work. Readers desiring more detailed information are directed to the large body of work on genetics such as [Suzuki, et al. 89].

A genotype is the genetic code of an individual. A genotype is composed of one or more chromosomes. The set of genotypes of a population is called the gene pool.

A chromosome is a strand of DNA. A chromosome can be segmented into sections called genes. A gene is meant to identify a section of a chromosome responsible for a character38

istic or trait in the individual created with that chromosome, e.g., eye color. The position of a gene in a chromosome is called its locus.

The genetic code for a specific gene may vary within the gene pool. Each different DNA sequence within a gene is called an allele or allelomorph of that gene. In a species, each gene may have a large number of alleles. Thus, an allele can be viewed as a gene’s value. These concepts are illustrated in figure 8.

Chromosomes

gttacg

Allele:the specific code sequence of a gene (a gene’s value)

Genes

Figure 8. An example of a genotype

The set of characteristics of an individual that are determined by its genotype is called the individual’s phenotype. Hence, the genotype is what is reproduced and the phenotype, inter alia, determines the fitness of the individual.

The process of converting a genotype into a mature, adult individual is called ontogenesis. In some instances ontogenesis, also referred to as ontogenetic development, is used in a broader sense and is defined as the change an individual organism undergoes during its life from its conception until it dies.

39

In some cases, a gene may be responsible for one and only one characteristic. However, for many genes this is not the case. Polygeny defines the case when more than one gene affects a characteristic of an individual. Pleiotropy defines the case when one gene affects more than one characteristic in an individual. The renowned evolutionary biologist Theodosius Dobzhansky [70] notes pleiotropy and polygeny are common in organic life.

Figure 9 illustrates the model of evolution developed thus far.

ENVIRONMENT

Ontogenesis

Individual

Natural Selection

Genotype

Mated Pair Reproduction

Figure 9. A simplified model of evolution

It is clearly evident in this figure that evolution is a cyclical process. Starting from the left: the genotype, contained in a zygote (fertilized egg), acts as the input to the process of ontogenesis which creates the corresponding individual. The environment also has an effect on the ontogenetic process. The individual then undergoes the process of natural selection that stochastically selects the most fit individuals for reproduction (In reality, ontogenesis and natural selection occur simultaneously but are treated as occurring sequentially for clarity). The process of natural selection is determined by the environment. Sexual selection also occurs as part of this natural selection process producing 40

mated pairs. Species that reproduce asexually can be included in the above model by bypassing the need to select a mate. The mated pair reproduce copies of their genotypes that are combined to create a new genotype. This new genotype then repeats the process. This process continues until no more new genotypes are created at which time the species becomes extinct when that generation perishes. For the sake of simplicity, the model does not illustrate many other aspects of evolution such as the fact that an individual may reproduce more than once.

It is important to note that populations evolve, not individuals. Individuals adapt, but do not pass on these adaptations genetically. Evolution can be measured as the change in the gene frequency of each gene in the gene pool over a period of time. The gene frequency of a gene is defined as the set of allele ratios for that gene. The allele ratio of an allele of a gene is the ratio of the number of times that allele of that gene is in the gene pool to the number times that gene is in the gene pool. Evolution may not occur at a slow, constant rate as Eldredge & Gould point out in their theory of punctuated equilibrium [Eldredge & Gould 72][Gould & Eldredge 93]. According to this theory, evolution may occur in spurts with long periods of little change and then, in a short period of time, very significant change can occur.

To better understand how evolution works, the concept of the fitness landscape, introduced in a classic paper by renown evolutionary biologist Sewall Wright [32], is helpful. A fitness landscape is a visualization of the mapping of a population of genotypes to their fitness where each genotype is adjacent to, i.e., neighbor of, another genotype if and only if the two genotypes differ by only one gene. Normally, such a landscape would require a dimension for each gene and one dimension for the genotype fitness. Thus, a 100 gene 41

genotype requires a 100 dimension space to map each member with each axis specifying the possible alleles of one gene. In addition, one more dimension is required to specify the fitness of each member. This space is increased further if the sequences of the genes in the genotypes are considered. In order to make visualization of such a space possible, Wright compressed the dimensions required to define the genotypes to two even though it is clearly inadequate, and then added a third dimension, height, for the genotype fitness. The result is a fitness landscape of a gene pool that can be drawn topographically as shown in figure 10.

Low Fitness

KEY

High Fitness

Figure 10. An example of a fitness landscape

42

Wright [32] postulates that the process of evolution can be viewed as the process of a species moving from the lower peaks to the higher peaks in a fitness landscape. Thus, a species whose members are all mapped around one peak would most likely evolve to the top of that peak given enough time. However, in such a space there would be many peaks interspersed with vast valleys where genotypes do not produce viable individuals. A species could become trapped on a single peak. Imagine the peaks in the fitness landscape as a group of islands separated by water where the sea level represents the minimal fitness level for survival. This process is further complicated by the fact that continual changes in the environment result in a continual shifting of this fitness landscape. Certain regions grow higher while others grow lower. In keeping with the island analogy, this may cause new islands to rise out of the sea and other islands to submerge. Species confined to an island that submerges and cannot migrate off go extinct. Note that migrating long distances on the fitness landscape is difficult because it requires a number of genetic changes in a genotype to occur simultaneously within the same generation. However, the higher the genotype of an individual is on the landscape, the greater the probability that it will survive to reproduce is. Evolution is envisioned as never-ending search for higher ground on this dynamic landscape.

Wright [32] notes that increasing natural selection and decreasing mutation causes the successful genotypes to be concentrated around the higher peaks they already cover. However, this results in a high concentration of individuals with nearly the same genotype. The subsequent close inbreeding of the species reduces its ability to adapt to change and may lead to extinction as the fitness landscape changes. Decreasing selection and increasing mutation has the opposite effect of allowing the successful genotypes to spread to the

43

lower regions surrounding the peaks. This has the adverse effect of decreasing the percentage of highly fit individuals and lowers the fitness of the species. However, it increases the probability of finding another fitness peak. This may be critical to the species’ survival should selection pressures increase, i.e., their islands begin to sink. It is important to note that mutation rate must be considered as a function of the population size since mutation is applied against each individual. A high mutation rate with a low population results in a random walk across the fitness landscape and may quickly result in extinction.

One other possibility is considered by Wright [32], that of a species subdividing into a number of sub-populations with rare crossbreeding between sub-populations. These subpopulations wander separately across the fitness landscape. When a sub-population encounters a much higher fitness region, its numbers expand. This increases the probability that crossbreeding will occur with other sub-populations, moving them toward this higher region as well. Returning to the island analogy, if a species is located on a number of islands and one or more begin to sink, by means of crossbreeding the species can rapidly migrate off the sinking island without relying on random mutation. In effect, the species does not genetically “put all its eggs in one basket.” Wright [32] concludes that the “subdivision of a species into local races provides the most effective mechanism for trial and error in the field of gene combinations.” This concept has become known as the “shifting balance theory.” The earliest published record of this theory is [Wright 29] and is presented in detail in [Wright 31] where he concludes:

“in a large population, divided and subdivided into partially isolated local races of small size, there is a continually shifting differentiation among the latter (intensified by local differences in selection but occurring under uniform and static conditions) which inevitably brings about an indefinitely continuing, irreversible, adaptive, and much more rapid evolution of the species.” [Wright 31] 44

These sub-populations of individuals with nearly homogeneous genotypes have since become known as demes, a term introduced in [Gilmour & Gregor 39] as cited by [Wright 69].

There are two significant deficiencies of the fitness landscape model: it fails to account for both the ontogenetic and sociogenetic development of individuals. Without recognizing these deficiencies, fallacious conclusions regarding evolution can be drawn. These deficiencies become evident by considering a fitness landscape in which the environment is held constant and in which the genotypes only reproduce exact replicas of themselves. The model would indicate that in such a scenario the fitness of the individual genotypes would remain constant, but we will see that this is not always the case.

The first deficiency of the model is that it fails to account for the ontogenetic development of an individual defined by its genotype. The result is species with advanced central nervous systems that enable adaptive behaviors are not adequately incorporated into the fitness landscape. An individual with an adaptive nervous system can discover behaviors, e.g., the use of fire, that increase the fitness of the individual and may be passed to its progeny without necessitating that the genotype change. Failing to take this capability into account may lead one to conclude that adaptive ability of a species is increased by decreasing the time between generations. For many species, this is certainly the case. Insect populations with short generation periods have been shown to rapidly adapt to changes in the environment. However, humans have taken another route to increase our fitness and adaptability. We have sacrificed short generation periods in favor of relatively long periods so that behaviors and technologies developed may be satisfactorily passed to the next generation. 45

The treatment of evolution would not be complete without mentioning its ontogenetic effect on the brain. The brain is a population of neurons, so is it possible that the rules of evolution create complex structures in the brain as well? If so, it strengthens the idea that these rules can be used to develop other complex systems as well. Many have considered the forces of natural selection acting on the population of neurons that comprise the brain. Donald Hebb [49] postulated the formation of groups of neurons called cell-assemblies in response to environmental factors. He speculated that these cell-assemblies acted as the base unit of thought and formed complex hierarchies. In Design for a Brain, Ashby [52] proposed that neurons formed highly dynamic groups in order for the brain to achieve a desirable dynamic state he defined as ultrastability. Ashby [52] states:

The work ([Ashby, 52]) also in a sense develops a theory of the ‘natural selection’ of behavior patterns (in the brain).... Just as the gene-pattern, in its encounters with the environment, tends towards ever better adaptation of the inherited form and function, so does a system of step- and part-functions tend towards ever better adaptation of learned behavior.... Evolution has thus had to cope, phylogenetically, with all the difficulties of integration that beset the individual ontogenetically. The tendency to ‘chaos’ (due to random change) ... thus occurs in the species as well as in the individual. In the species the power of natural selection has shown itself stronger than the tendency to chaos. Natural selection is effective in proportion to the number of times that the selection occurs: in a single generation it is negligible, over the ages irresistible. And if the unrepeated action of ultrastability seems feeble, might it not become equally irresistible if the nervous system was subjected to its action on an equally great number of occasions? How often does it act in the life of, say, the average human being? I suggest that in those reactions where interaction is important and extensive, the total duration of the learning process is often of ‘geological’ duration when compared with the duration of a single incident, in that the total number of incidents contributing to the final co-ordination is very large.

However, arguably the most complete treatment of evolution on the ontogenetic development of the brain is a trilogy by Edelman [87][88][89], the cornerstone of which is Neural 46

Darwinism: The Theory of Neuronal Group Selection [Edelman 87]. Although still highly controversial, Edelman’s Theory of Neuronal Group Selection is a detailed explanation of how evolutionary forces play a key role in the development of the brain. These works share a common thread: they view the brain as an interactive population of computation mechanisms and discard the traditional algorithmic brain paradigm.

The second deficiency of the fitness landscape model is that it does not account for the sociogenetic development that certain genotypes are capable of. Sociogenetic development, also called sociogenesis, is the characteristic of closely homogeneous members of a species to form groups also referred to as colonies or societies. The individuals may have simple nervous systems, as is the case with certain ants, bees, and polyps that make up a Portuguese man-of-war. Individuals may obviously also have more advance nervous systems, such as is the case with humans or wolves. It appears that individuals form groups because it increases the fitness of the individuals. Thus, the genotype and environment of the individuals need not change, but their fitness can increase over time. Failure to take this property into account may lead one to derive the false conclusion that only the “fittest” individuals will survive. It also begs the question how could multi-cellular organisms ever survive let alone evolve? Why wouldn’t the stronger cells simply attack the weaker cells? The answer obviously lies in the granularity of what an “individual” is. A group of individuals can also be considered an individual just as we consider animals and plants as individuals as well as colonies of cells. Thus, individuals can and often do increase their fitness by forming groups. An individual wolf is normally no match for an elk, but it is another matter when an elk faces a pack of wolves.

47

The importance of sociogenesis is critical to the understanding of evolution. The early theories regarding evolution ignored or minimized the role of sociogenesis although Haldane [32] did state “For in so far as it makes for the survival of one’s descendants and near relations, altruistic behavior is a kind of Darwinian fitness, and may be expected to spread as the result of natural selection.” Even today, the theory of evolution is often presented without mention of it. Sociogenesis is the key component in evolutionary theory that provides explanations of how a species can move from competition to cooperation, from independence to interdependence, and ultimately from self-interest to altruism; for these characteristics are found in various species throughout nature. An evolutionary theory that incorporates sociogenesis would indicate that a society of individuals working together to increase their common fitness can be more fit than a genetically identical group of the same number of individuals where each individual only attempts to maximize its own fitness and treats other members within the same group as competitors.

The key to sociogenesis is that it addresses the granularity question of what is an individual. This granularity issue is not new. Leibniz viewed living organisms as a plenum in which smaller organisms lived out their lives, a view certainly consistent with modern cell theory [Wiener 61]. William Wheeler presented the concept of an ant colony as an individual organism in 1911 [Wheeler 11]. According to this view, the individual ants act like cells in a body (although connected much more loosely). Most ants sacrifice the ability to reproduce and in many cases readily sacrifice their very lives for the benefit of the colony. However, it is important to note that this “altruistic” behavior of ants is restricted to its own species. B. Hölldobler and E. O. Wilson [90], renown for their work with ants, point out:

48

… hundreds of cases on interspecific symbioses among ant species that have come to light encompass almost every conceivable mode of commensalism and parasitism. True cooperation, however, is rare or nonexistent. No verified examples are yet known of mutualism, in which two species cooperate to the benefit of both. All of the relationships carefully analyzed to date are unilateral, with one species profiting and the other species either remaining unaffected or, in the great majority of cases, suffer from the attentions of its partner.

Although not conclusive, this supports the idea that benefits of intraspecific cooperation are far different from interspecific cooperation and have genetic roots.

A detailed theory of sociogenesis, as it relates to evolutionary theory, that addresses this individual granularity issue and that takes the notion of intraspecific cooperation further, has been delineated by evolutionary biologist Vladimir Novak [82]. Novak calls this specific theory the “principle of sociogenesis” which he defines as

...the phylogenetic association of individuals of the same species (usually the offspring of the same maternal individual) in colonies and the progressive integration of such colonies to a higher grade individual. [Novak 82]

Novak partitions all organisms into five grades, the lowest being monomolecular organisms, then unicellular, simple multicellular, compound multicellular, to the highest being societies of compound multicellular organisms. In addition, he claims all higher grade species have passed each of the lower grades during their phylogenesis. A species moves up each grade by undergoing the following 5 overlapping phase process (with the exception of monomolecular species which did not require phases 1 and 4):

1. Non-separation: Parent and children remain connected forming a colony.

49

2. Differentiation: Children develop differently; a division of labor is observed.

3. Formation of internal environment: An internal economy develops within the colony to insure survivability of members as opposed to external non-members. This mitigates the force of natural selection acting on individual members.

4. Evolution of correlations: Members of the colony respond to external environment in a coordinated manner by forming mechanisms.

5. Integration to higher grade individual: Members of the colony integrate to the extent that the colony itself is considered an individual (at the next higher grade level).

What appears to be happening is this: in many cases, the survivability of an individual is increased by forming a group of similar individuals. So natural selection encourages sociogenesis. However, the purpose of the group is to mitigate the effects of natural selection on individuals within the group; that is why the individuals joined. As natural selection decreases within the group, variation of the members appears because each member need not compete so hard for survival. As the variation intensifies, members find niches that they specialize in. Once members have differentiated themselves, an internal economy can form because we now have members who specialize in providing a good or service and who are also in need of goods or services due to their specialization. This causes the group to integrate further. Members continue to specialize and rapidly reach the point 50

where most members cannot survive outside the group. This specialization continues until the group becomes so highly integrated that it is considered to be an individual. The process then repeats at a higher level. Note, that for high grade organisms, the only way known to reproduce is to create a separate single cell that is capable of duplicating this phylogenetic evolution of the species, which occurred over millions of years, ontogenetically in a matter of a couple of years at the most.

Novak [82] provides a substantial measure of biological evidence to support this theory, but it certainly has not been proven and may contain substantial flaws. However, it is attractive to explore for three reasons. First, it permits a growth of complexity in an organism that can be orders of magnitude greater than a gradual approach. Secondly, it provides a clue that the cooperation of individuals is not only useful for complex organisms, but it is essential. Third, it addresses the chicken and the egg problem of complex organisms. How does a complex organism reproduce? Must an organism contain a factory to duplicate itself? The principle of sociogenesis requires no such factory. No matter how complex the organism is, it can start as a single cell that duplicates itself. Each of us began our existence as a single parent cell that rapidly reproduced.

Darwin’s struggle for survival theory alone paints a bleak picture of an environment devoid of cooperation at first glance: a dog eat dog world. However, we observe the contrary, that cooperation is ubiquitous. The point of this discussion is the development in biological systems would be severely curtailed if cooperation is not permitted. It stands to reason that computational models developed in an attempt to harness the power of evolution will be curtailed as well if cooperation is not permitted.

51

In summary, the central role natural selection plays in evolving complexity was presented. In addition, a key piece that Darwin’s theory lacked: the role of genetics and how it affects heredity was examined. The fitness landscape model was delineated to aid our understanding of the evolutionary process as an optimization process. It also helped clarify the notion that segmenting a species into sub-populations provides it a strategic advantage by permitting the simultaneous exploration of multiple areas of the landscape. The fitness landscape model also demonstrated the need to address two additional aspects of evolution. The first is the role ontogenetic development plays in evolution. Of particular interest is how an individual with a brain that learns can increase its fitness and that of its progeny and reduce the need to modify their genotypes. The second is the essential role cooperation plays in evolution. Not only can cooperation increase the fitness of individuals without altering their genotypes, but it facilitates the formation of complex colonies that can be considered organisms in their own right.

Pondering the role evolution has played in creating the plethora of complex organisms that inhabit this planet, including the vast computational abilities of the human brain, has left more than a few people nonplussed at times, overwhelmed by awe. Charles Darwin notes this awe of the complexity of organisms’ forms and their interdependence as he concludes his seminal work, The Origin of Species, as follows:

It is interesting to contemplate a tangled bank, clothed with many plants of many kinds, with birds singing on the bushes, with various insects flitting about, and with worms crawling through the damp earth, and to reflect that these elaborately constructed forms, so different from each other, and dependent upon each other in so complex a manner, have all been produced by laws acting around us. These laws, taken in the largest sense, being Growth with Reproduction; Inheritance which is almost implied by reproduction; Variability from the indirect and direct action of the conditions of life and from use and disuse: a Ratio of Increase so high as to lead to a Struggle for Life, and as a consequence to Natural Selection, entailing Divergence of Character and the Extinction of less-improved 52

forms. Thus, from the war of nature, from famine and death, the most exalted object which we are capable of conceiving, namely, the production of the higher animals, directly follows. There is grandeur in this view of life, with its several powers, having been originally breathed by the Creator into a few forms or into one; and that, whilst this planet has gone cycling on according to the fixed law of gravity, from so simple a beginning endless forms most beautiful and most wonderful have been, and are being evolved [Darwin 1859].

A few have also considered how these forces can be harnessed to solve computational problems and better understand complex adaptive systems in general. This led to the formation of a discipline of computer science now known as evolutionary computation.

2.3 Evolutionary Computation To this purpose the philosophers say that Nature does nothing in vain, and more is in vain when less will serve; for Nature is pleased with simplicity, and affects not the pomp of superfluous causes. ––I. Newton, Mathematical Principles of Natural Philosophy

Evolutionary computation (EC) is the area of computer science that encompasses research in stochastic optimization algorithms based on rules derived from evolutionary theory. The above quote by Newton seems to stand in stark contrast to the tremendous complexities found in nature of which the human body is but one example; and yet for many, the statement rings true. Chaos theory has demonstrated that a wide variety of extremely complex yet deterministic systems can be derived from very simple rules. The Neo-Darwinian paradigm describes a simple system yet results in the diversity of life inhabiting this planet and perhaps throughout the universe as well. Evolutionary computation is an attempt to utilize these simple rules of nature for the purpose of solving problems not amenable to traditional approaches as well as to better understand not only nature, but complex adaptive systems in general.

53

Although evolutionary computation has only recently begun to experience rapid growth, its roots go back to the early days of computers. Alan Turing considered the parallels between evolution and machine learning “obvious,” and in 1950 he outlined a machine system where intelligence is evolved on a computer rather than being programmed [Turing 50]. In this system, the “structure of the child machine” was equated with hereditary material that was changed by mutation. Natural selection was to be determined by the user who would use his intelligence to evaluate the current system and attempt to expedite the evolutionary process whenever possible. This system was to evolve a single “mind” from infancy to adulthood. The proposed system relied strictly on directed mutation and was never built.

In its current state, EC is composed of three main lines of research, each developed independently during the 1960’s: genetic algorithms (GAs), evolution strategies (ES), and evolutionary programming (EP). The first two of these, GAs and ES, will be combined into a single system in this research and are described in detail in section 2.4. How EP differs from these other two approaches will also be described briefly.

EC approaches to search and optimization have proven themselves to be robust in a number of applications where the dimensionality of the solution space is large and the solution gradient is non-monotonic and unsuitable to be solved using other techniques such as regression or calculus-based approaches. Since the environment in which organic creatures compete can be viewed as a stochastic process, it seems reasonable to use an evolutionary computation approach to attempt to develop and optimize models of other stochastic processes.

54

For the purposes of EC, evolution is an optimization process of self-reproducing data objects. These data objects are referred to as genotypes, chromosomes, or simply individuals in the literature. Each data object represents a candidate solution or part of a candidate solution to a problem. Simple solutions can be represented by a single data object while complex solutions often require multiple data objects to represent a single solution. In this thesis, a genotype will refer to the set of data objects used to represent a single solution. A chromosome is defined as a single data object and may only represent a portion of a solution. A gene is defined to be a field in a chromosome and an allele is an acceptable value for a gene.

An EC algorithm describes how a population of genotypes continually reproduces and determines which genotypes are removed from the population each generation. The goal is that as the population evolves, genotypes in the population will tend to represent better solutions than their parents. The essence of EC algorithms can be stated concisely in the following five steps:

1. A limited population of candidate solutions is generated. 2. The population is reproduced with deviations. 3. The fitness of each member of the population is determined. 4. The least desirable members are removed from the population. 5. The process returns to step 2 and repeats.

Different EC theories and technologies vary these steps but the basic theme remains the same.

55

The four main issues in EC algorithms are: how are candidate solutions represented as genotypes, how is the fitness of a genotype evaluated, how is reproduction done, and how are genotypes selected for removal from the population.

2.3.1 Representation As discussed previously, a genotype is a representation of a candidate solution. Two important aspects of the representation are how it restricts the solution space and how it facilitates the search of this space. The size of the solution space, i.e., the number of possible genotypes, is calculated according to the following equation:

s =

m

n

i

j

∏ ∏ ai, j

where: s = the size of the solution space. m = the number of chromosomes per genotype. n = the number of genes per chromosome. ai,j = the number of alleles per gene j in chromosome i.

Thus, if the representation of a genotype is a single chromosome with 3 genes, each having 2 alleles, the solution space contains 23=8 possible solutions. If all the desirable solutions lie outside the solution space, then the representation is inadequate. As made apparent by the above equation, the dimensionality of the solution space is determined by the total number of genes in all chromosomes and the size of each dimension by the number of alleles per gene. If too few genes and alleles are used, an acceptable solution may

56

not be in the restricted solution space. Increasing the total number of genes or alleles per gene increases the solution space size which can increase the probability that the solution space contains a desirable solution. However, if too many genes and alleles are used, the solution space may become too large to effectively explore and unnecessarily increases the time necessary to find a solution.

One goal of defining a representation is to restrict the search space as much as possible without eliminating the best solutions or solutions that would lead to discovering such solutions. This generally reduces the time necessary to find a solution. The other goal is to create a representation that can be used efficiently by an EC algorithm. Each type of EC algorithm tends to perform better with certain types of representations than it does with others due to the reproduction operators used.

The four primary factors that affect a representation of a model are the number of chromosomes in the genotype, the number of genes per chromosome, the position of each gene relative to the other genes in a chromosome, and the number, order, and probability distribution of alleles per gene. Each of these factors is briefly discussed in the remainder of this section 2.3.1.

The number of genes in a chromosome is an important representation factor. As previously mentioned, it is desirable to minimize the total number of genes without eliminating the best solutions from the solution space. This can be done by either reducing the number of genes in a chromosome or reducing the number of chromosomes. In many cases, candidate solutions do not need all of the genes in a chromosome. However, many EC algorithms require that all the chromosomes of the same type have the same number of genes.

57

This causes the size of the solution space to be increased even though these genes will not yield useful solutions and may in fact be disruptive. One technique currently being researched is to permit a variable number of genes in the chromosome. It attempts to prevent the solution space from becoming larger than necessary. However, the extra computational cost required due to the added bookkeeping required often eliminates the advantage of this approach.

An alternative to varying the number of genes per chromosome is to vary the number of chromosomes per genotype. This is also an active area of research. The primary disadvantage of this approach is determining the fitness of the individual chromosomes because each only represents a partial solution. One novel approach to solve this problem is called the “bucket brigade algorithm” and is a metaphor for chromosomes passing back fitness to other chromosomes that enabled it to achieve its fitness like buckets of water being passed by a line of people, cf. [Holland 92][Riolo 88]. Another advantage of using multiple chromosomes is that related genes can be placed in the same chromosome and unrelated genes placed in different chromosomes. This has the effect of focusing the search space due to the reproduction operators of the EC algorithms used, which are discussed in section 2.4. A gene can be assigned to its chromosome by the user. Some EC algorithms also permit a gene to migrate between chromosomes as the population evolves.

Like the total number of genes, the number of alleles per gene has an enormous effect on the size of the solution space. Doubling the number of alleles in one gene doubles the size of the entire solution space. Just as is the case with the number of genes, the number of alleles per gene should be reduced when possible as long as such action does not remove the desired solutions from the solution space. The number of permitted alleles per gene 58

can range from two in binary number genes to infinity in real number genes. A gene with only one allele is normally treated as a constant in the model being represented and is not part of the genotype that evolves.

In many cases, there is a trade-off between the number of alleles per gene and the number of genes per chromosome. For example, one could choose to represent a set of 1024 integers as a single gene with 1024 alleles or recognize that 1024=210 and represent it as a binary number with 10 genes, each having two alleles, 0 and 1. Even though the solution space size is the same in both cases, depending on the EC algorithm used, the efficiency can be severely impacted by the choice made. For example, GAs tend to perform better than other approaches when each gene has only two alleles. However, ES and EP algorithms tend to perform better when each gene has a large number of alleles that have a useful order as is the case with real numbers. This is due to the reproduction operators used by each approach and is discussed more specifically when the EC algorithms are presented.

The order of alleles, or lack thereof, for a gene with more than two alleles is also significant. If the alleles can be ordered usefully, then a hill-climbing approach can be used to focus the search of the solution space as is done in ES and EP algorithms. Consider a gene with 10 alleles: the digits 0 to 9. The natural order of these alleles is 1 min_v // normalize model criterion values before calculating weights for each model θ in model colony // V is the model criterion function. // ZT is the observation data training set // SELECT_TEMP is user-specified, typically 1, to curve weights // power(a,b) = ab v[θ] = power(V(θ,ZT), SELECT_TEMP) // calculate modified V // limit V range if(v[θ] < MIN_V) then v[θ] = MIN_V if(v[θ] > MAX_V) then v[θ] = MAX_V if(v[θ] < min_v) // find minimum (best) V min_v = v[θ] if(v[θ] > max_v) // find maximum (worst) V max_v = wgt[θ]; endfor // calculate model selection weights // (model with maximum weight is most likely to be selected) for each model θ in model colony // 1 1 x ( t – 2 ) + x ( t – 5 ) + x ( t – 14 ) otherwise

where: y(t) = the desired forecast at time t x(t) = the input time-series value of x at time t

116

The purpose of the experiment is to discover the model that exactly calculates the value any element of the desired forecast time-series y given elements of the random input timeseries x with indices less than the index of the y element. The times-series x and y are shown in figure 21. Note that the data set includes 100 forecasts, time t = 50 to 149. The data prior to t = 50 was available for input use (in the input sliding time-series window).

10 9

Time-series Value

8 7 6 5 4 3 2 1 0 0

50

100

Time

150 input forecast

Figure 21. The input and desired forecast time-series for test 1

The test was performed using 21 colonies without inter-colony communication. Each colony consisted of 50 model parameter vectors, and each model consisted of two nodes. The node operator type of the first node in each model was fixed to calculate the mean of the node inputs, i.e., a subset of the data observation window Z(t). The second node in each

117

model had two fixed inputs. The first input of each of these nodes was the value x(t-1) from the data observation window. The second input of each node was the output from the first node. The node operator type of the second node was fixed to divide its first input by its second input and output the quotient. Defined as such, in each model the second node was constant and does not evolve. Only the input link vector of the first node evolves. A uniform crossover recombination operator was used. By limiting the nodes as described, we limit each model to a simple moving average forecasting model as follows:

⎧ ⎪ ⎪ f M(θ) ( Z ( t ) ) = ⎨ 1 ⎪ ⎪ ⎩0

i < Z' ( t )



Z' i ( t )

i=0 --------------------------------------≤1 x ( t – 1 ) + Z' ( t ) otherwise

where Z'(t) ⊆ Z(t) as defined by M(θ).

We restrict DM such that there is one-to-one correspondence between DM and the power set of Z(t). Table 4 summarizes the experiment specifications.

118

Specification Type Model Domain

Observation Data Set

Model Criterion Function CAFECS Algorithm Parameters

Description

Setting

No. of nodes per model Node 0: No. of evolvable bits in Input Link Vector: Run 1 No. of evolvable bits in Input Link Vector: Run 2 Number of Operator types Operator type 0 No. of evolvable ES parameters Node 1: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters

2

Number of observations No. of time-series Output attribute is an input attribute Data Observation Window size: Run 1 Data Observation Window size: Run 2

150 1 Yes 15 25

T

V ( θ, Z ) =

15 25 1 Mean 0 0 1 Div 0

t < 150



y(t) – f M(θ)(Z(t))

t = 50

GA rate GA Parent Selection Algorithm GA Crossover type GA Population replacement percent (per generation) GA Mutation rate (per evolvable bit) Unique models mode Initial population of model parameter vectors No. of Model parameter vectors per colony No. of Colonies Communication Strategy Termination criterion: Run 1 Termination criterion: Run 2

1 RWS Uniform 80% 0.001 On Random 50 21 Isolated Gen. ≥ 20 Gen. ≥ 200

Table 4. Experiment 1 specifications

The initial population of model parameter vectors for each colony consisted of 50 randomly selected elements of the model domain, that is, the selected inputs for the first node in each model were randomly selected.

119

The experiment consisted of two separate runs. In the first run, the observation data set window Z(t) is limited to 15 elements of the time-series, i.e., the window width is 15. Thus, all possible models in this test can be represented by a 15 bit string that specifies the input link vector of the first node in a model, i.e., each bit controls whether an element in the time-series window is to be included in the moving-average model. Also, the desired model θˆ ∈ DM , where f M ( θˆ ) ( Z ( t ) ) , 50 ≤ t ≤ 149, defines a time-series equal to the desired forecast time series y and so V( θˆ , ZT) = 0. Note that |DM| = |power set of Z(t)| = 215 = 32,768, so an exhaustive search of the model domain would not be a time-consuming task. The experiment run was terminated after 20 generations.

In the second run, the window was increased to 25 elements so that |DM| = 225= 33,554,432. In this case, the run was terminated after 200 generations.

4.1.1.1 Experiment 1, Run 1 Results The distribution of the results of the 21 colonies of run 1 (window size = 15) is shown in figure 22.

120

Colony Count

6 5 4 3 2 1 0

2

4

6 8 10 12 14 16 18 20 Generations Required

>20

Total Colony Count = 21 Figure 22. The distribution of Experiment 1, Run 1 results with window size = 15

The x-axis is the exact number of generations a colony evolved before it contained the desired model parameter vector θˆ , which defines a model that predicts all 100 forecasts correctly. The y-axis is the number of colonies that selected θˆ for the first time for a given generation count. Each of the 21 colonies is represented by one and only one of the bars in figure 22. Thus, two colonies selected θˆ in 6 generations and 6 of the 21 total colonies did not select θˆ in less than 20 generations.

The average number of generations required by the three first colonies to select θˆ was 6-2/3 generations. We also note that within a few generations all colonies had selected models that correctly predicted over 90% of the forecasts. However, no models were discovered by any colony that correctly predicted more than 96% but less than 100% of the models.

121

4.1.1.2 Experiment 1, Run 2 Results As previously stated, Run 2 was identical to Run 1 except that the window size was increased from 15 to 25, and the maximum number of generations was increased from 20 to 200. The distribution of the results of the 21 processes of Run 2 is shown in figure 23.

Colony Count

>5 5 4 3 2 1 0

20 40 60 80 100 120 140 160 180 200 Generations Required

>200

Total Colony Count = 21 Figure 23. The distribution of Experiment 1, Run 2 results with window size = 25

In this run, only 3 of the 21 colonies selected θˆ in less than 200 generations. The average number of generations required by these three colonies to select θˆ was 129 generations. However, all colonies developed models within a few generations that correctly predicted more than 90% of the desired forecasts. Also, as was the case in the previous run, no models were discovered by any process which correctly predicted more than 96% but less than 100% of the models.

122

4.1.2 Discussion of Experiment 1 In Experiment 1, Run1, 15 of the 21 colonies independently selected the desired forecasting model defined by θˆ within 20 generations. The reason so few generations were required can be explained in part by the relatively small cardinality of the model domain. Since the model domain was constrained to only 215 = 32,768 model parameter vectors and each colony contained 50 model parameter vectors, the probability of a colony containing θˆ using a generate-and-test method is 50/32,768 ≈ 0.0015 per generation. With such a probability, the expected number of generations per colony to select θˆ is slightly over 325. However, this experiment demonstrated fewer generations were required which translates to better performance. It appeared that some hill-climbing was possible in attempting to find θˆ , but this was hampered once a 96% accuracy level was reached. This can be explained in part by a large number of local optima that were selected with a 9596% accuracy rate. At this point, no incremental improvement in accuracy was found until θˆ was selected.

The effect of selecting a large number of local optima was magnified in the second run. Most of the colonies rapidly selected models with a prediction accuracy rate of between 95-96%, but the selection algorithm had difficulty using them to select the optimal model parameter vector θˆ . In this run, the probability of randomly selecting θˆ decreased by over 1000 times, compared to last run, to 50/225 ≈ 0.0000015 per generation. The expected number of generations per colony to select θˆ using a generate-and-test method increased

123

to over 335,000. However, CAFECS was regularly able to select θˆ in a colony in less than 200 generations.

This experiment demonstrated that CAFECS can develop models of moving-average functions when presented with an input time-series and a desired forecast time-series when such a model is defined by a model parameter vector that is an element of its model domain. It is also of note that increasing the input window size from 15 to 25, increased the number of average number generations required to select θˆ in the three colonies with the best performance just under 20 times. Recall that in run 1 the average number of generations required by the three first colonies to select θˆ was 6-2/3 generations, but in run 2 the number of generations required by three first colonies to select θˆ increased to 129 generations.

4.1.3 Experiment 2: Developing a moving-average model using S&P 500 data For this experiment, the input attribute values of the observation data set are from a single time-series x consisting of the set of weekly averages of the daily closing balances of the S&P 500 index from January 1980 to July 8, 1994. This index is more precisely known as the Standard and Poor’s Composite index of 500 stocks. It is a stock value-weighted index, which fluctuates throughout each day as stock prices change, and represents the change in the aggregate market value of 500 stocks relative to a base period from 1941-43.

The desired forecast time-series y was created using the SP500 index time-series x and a simple moving-average function f. More precisely, y = f(x) such that for 120 ≤ t ≤ 757:

124

⎧1 y(t) = ⎨ ⎩0

x(t + 1) > x(t) otherwise

where: y(t) = the desired forecast at time t x(t) = the input time-series value at time t

For each week, the output attribute value, i.e., the desired forecast, is 1 if the stock average will rise in the next week, otherwise it is 0. The purpose of the experiment is to develop a simple moving-average model that will best predict the output given the input. As in experiment 1, the model inputs are a subset of the x time-series with indices less than the index of the element of desired forecast time-series y being predicted. However, unlike the first experiment, a model that defines function f is not helpful because x(t+1) is not available as a model input. Moreover, the optimal model is unknown. The model domain is restricted to simple moving averages models and the only parameters evolved are the node link parameters of the first node. However, the observation data window width, |Z(t)|, has been increased to 120. The effect is that the number of possible input combinations, i.e., |power set of Z(t)|, is 2120. Hence, it is not feasible to exhaustively search the model domain to determine the optimal model.

The model criterion function used is similar to the model criterion function described in experiment 1 except for the change in the observation data window width, so the first forecast of is the index t = 120. The S&P 500 input time-series is graphically depicted in figure 24.

125

500

S&P500 Weekly Average

450 400 350 300 250 200 150 100 50 0

100

200

300

400

500

600

700

800

Week number starting from 1/1/80 Figure 24. Test 2 input time-series: the S&P 500 weekly ave. from 1/80 - 7/94

Unlike the input time-series of experiment 1 that was randomly generated to minimize detectable patterns, this input time-series is may have many detectable patterns that a forecasting model could use to make more accurate forecasts.

The algorithm parameters were identical to those used in Experiment 1, Run 2 with the following exceptions. As previously mentioned, the observation data window width was increased to 120. This was done so that patterns of up to 120 weeks in length could be detected. It also increased the cardinality of the model domain. To help compensate for this increase, the number of models per colony was increased from 50 to 100 to permit a greater variety of models within a colony. Also, the number of colonies was decreased from 21 to 5 because of the increase in colony size and to offset the increase in computation time. The algorithm parameters are summarized in table 5.

126

Specification Type Model Domain

Observation Data Set

Model Criterion Function CAFECS Algorithm Parameters

Description

Setting

No. of nodes per model Node 0: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters Node 1: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters

2

Number of observations No. of time-series Output attribute is an input attribute Data Observation Window size

758 1 Yes 120

T

V ( θ, Z ) =

120 1 Mean 0 0 1 Div 0

t < 758



y(t) – f M(θ)(Z(t))

t = 120

GA rate GA Parent Selection Algorithm GA Crossover type GA Population replacement percent (per generation) GA Mutation rate (per evolvable bit) Unique models mode Initial population of model parameter vectors No. of Model parameter vectors per colony No. of Colonies Communication Strategy Termination criterion

1 RWS Uniform 80% 0.001 On Random 100 5 Isolated Gen. ≥ 200

Table 5. Experiment 2 specifications

4.1.3.1 Experiment 2 Results Unlike the first experiment, after 200 generations no colony contained a model that came close to correctly predicating 100% of the forecasts. The best forecasting model selected correctly predicted 366 out of 639 possible forecasts which is 57.3% correct. It discovered this forecasting model after 130 generations but was unable to improve upon it in less than 200 generations. Since the only difference between any model parameter vector in the 127

model domain is the input link vector of node 1, we can specify the best model selected with by this vector. Recall that the input link vector specifies which values are input from the data observation window and other nodes. In this case, the only inputs are from the data observation window containing x(t-1) … x(t-120). The input link vector of the first node θL1 of this model is shown in figure 25.

Part 1 Input Link Vector Index

111111111122222222223333333333444444444455555555556 123456789012345678901234567890123456789012345678901234567890

Value

111110111111111101010000000001110000011010101000000000000000

Part 2 Input Link Vector Index

111111111111111111111 666666666777777777788888888889999999999000000000011111111112 123456789012345678901234567890123456789012345678901234567890

Value

000000000000000000000000000100000000000011111100101010011111

Figure 25. Experiment 2 Results: the node 1 input link vector of the best model selected

In this figure, the 120 elements of input link vector θL1 are bifurcated into Part 1 and Part 2. Part 1 lists the first 60 elements of θL1 and the remaining 60 elements are listed in Part 2. If θL1(i) = 1, then x(t-i) is a node input value otherwise x(t-i) is ignored. Thus, the leftmost element in Part 1 having index 1 corresponds to the input value one week prior to the value being forecasted and the right-most element in part 2 having index 120 corresponds to the input value 120 weeks prior to the value being forecasted.

Of note is that nearly all of the 16 most recent values in the data observation window are used, i.e., nearly all of θL1(1) … θL1(16) are equal to 1. Moreover, nearly all of θL1(1) … θL1(100) are equal to 0. That is, nearly all the data observation window ele128

ments 16 to 100 weeks prior to the time of the forecast are ignored. However, several of the data observation window elements more than 100 weeks prior to the time of the forecast are selected.

To check this phenomenon, a number of additional tests were run with similar results both in prediction percentage and the pattern of the inputs selected, i.e., the node 1 input link vectors were similar. Additional runs of this experiment with the same model domain regularly found models with prediction percentages greater than 57%. Tests were also run in which the observation data window width was decreased to 75 elements so that no input values more than 75 weeks prior to the forecast time could be selected. However, the prediction percentages of the models decreased relative to the models using 120 element observation data windows. Tests were also done with the input population of models specified by the user instead of randomly generated. This user-specified population included various popular moving-average models such as the 40-week moving-average model. However, these models consistently performed poorly and were quickly replaced by models similar to the one shown in figure 25.

4.1.4 Discussion of Experiment 2 In the second experiment, a much more difficult task is presented. Unlike the first experiment, after 200 generations no colony contained a model that came close to correctly predicating 100% of the forecasts. This can be explained in part by the model domain being limited to simple moving-average models. Other model domains would permit CAFECS to select better models. Moreover, other approaches, such as neural networks, can rapidly achieve much better results. However, the purpose of this test is not to develop the best

129

S&P 500 forecasting model, but rather to examine how the best model is selected from within the designated model domain composed of simple moving-average models. To this end, CAFECS regularly found models with a prediction percentage slightly greater than 57%. However, even when the models were permitted to evolve 10,000 generations, models with prediction percentages greater than 58% were not found. Of course, this certainly does not indicate that none exist. It may be that there does exist a model defined by an element in the model domain that has a prediction percentage significantly higher than 58%. Unfortunately, compared to the last experiment, the cardinality of the model domain MD increased to an astronomical size of 2120 (which is greater than the estimated age of the universe in seconds squared). Because of the cardinality of MD and the non-linearity of the model criterion function V, no method for determining which element of MD minimizes V within a reasonable period of time is known.

One interesting result of this experiment previously pointed out is the tendency of CAFECS to develop forecasting models that use both inputs less than 17 weeks or more than 100 weeks prior to the time of the forecast, but very few inputs in between. Even when the initial population contained forecasting models frequently used by forecasters, such as the 40-week moving-average model, they were quickly discarded in favor of model with inputs similar to the ones selected by the model shown in figure 25. It should not come as a surprise that a cyclical pattern is found in time-series of corporate stocks. However, for a biannual pattern to have greater significance than annual and quarterly patterns is curious, especially for this time-series (shown in figure 24), which shows no visible signs of biannual or any other cyclical pattern.

130

This leads to the final point regarding this experiment. The CAFECS approach to model development permits a user to explore the data and models interactively. Interesting patterns in both the data and models can be discovered as demonstrated by examining the node 1 input vector shown in figure 25.

4.2 A Comparison of Communication Strategies This section is motivated in part by Sewall Wright’s Shifting Balance Theory which was previously discussed in section 2.2. Recall that this theory suggests that evolution is most effective when sub-populations of a species form with only a limited exchange of genetic material. In CAFECS, this is simulated by using multiple colonies with various communication strategies as discussed in section 3.3.

In this section, we examine what effect 10 different communication strategy topologies have on the selection of models. We present the results of two experiments, numbered 3 and 4, that use these different communication strategies. Each experiment consists of 5 runs using each communication strategy for a total of 50 runs. In each experiment, we attempt to minimize a different multi-dimensional sine function. In both experiments, the model domains are composed of model parameter vectors that define constant functions. T

Because of this, the data observation set ZT is empty and f M ( θ ) ( t, Z ) is constant. This section concludes with a discussion of both experiments.

131

4.2.1 Experiment 3: 2-dimension sine-wave models In this experiment, CAFECS attempts to select a model parameter vector that defines a 2tuple (x, y) such that V is minimized, i.e., the value of the function g(x,y) is maximized. This function is defined as:

g (x,y) = sin ( 0.2πx ) ⋅ sin ( 0.1π ( x + 12.8 ) ) + sin ( 0.3π ( x + 2 ) ) ⋅ sin ( 0.8π ( x + 1.75 ) ) π(y + 3) π ( y + 1.7502 ) + sin ( 0.4π ( y + 1.5 ) ) ⋅ sin ⎛ --------------------⎞ + sin ⎛ ----------------------------------⎞ ⋅ sin ( 0.6π ( y + 0.7 ) ) ⎝ 3.5715 ⎠ ⎝ 1.1365 ⎠ 2

2

– 0.0001x – 0.00021y + 7

This function was arbitrarily developed and is of interest because it exhibits a large number of local maxima that differ from the global maximum by a small amount. Note that no attempt is made to maximize the function g symbolically. A portion of the function containing the global maximum is plotted in figure 26. In figure 27, a portion of the function plot containing the global maximum is enlarged 10 times so that the surface can be seen more easily. The function g(x,y) depicted in these figures is maximized when x ≈ -6.3247 and y ≈ 16.9559.

132

Figure 26. A graphical depiction of function g defined in section 4.2.1

global maximum

Figure 27. A 10X magnification of portion of figure 26 that contains the global maximum point

133

Although each model in the model domain consists of 19 nodes, only node 0 and node 1 contain evolvable parameters. We denote the evolvable parameter of node 0 as N0P and the evolvable parameter of node 1 as N1P. N0P and N1P correspond to the x and y arguments of g respectively as will be shown. The remaining nodes are used to define the function g. The result is that any element θ of the model domain defines a model function f M ( θ ) such T

that f M ( θ ) ( t, Z ) = g(N0P, N1P). Additional specifications of the experiment are detailed in table 6.

Specification Type Model Domain

Description

Setting

No. of nodes per model Node 0, 1: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters Node 3 - 9: No. of evolvable bits in Input Link Vector Number of Operator types Operator type No. of evolvable ES parameters Node 10, 11: No. of evolvable bits in Input Link Vector Number of Operator types Operator type No. of evolvable ES parameters Node 12 - 17: No. of evolvable bits in Input Link Vector Number of Operator types Operator type No. of evolvable ES parameters Node 18: No. of evolvable bits in Input Link Vector Number of Operator types Operator type No. of evolvable ES parameters

19

Table 6. Experiment 3 specifications

134

0 1 Sum 1 0 1 Sine 0 0 1 Square 0 0 1 Multiply 0 0 1 Sum 0

Specification Type

Description

Setting

Observation Data Set

Number of observations No. of time-series Output attribute is an input attribute Data Observation Window size

0 0 No 0

Model Criterion Function

V ( θ, Z ) = – f M ( θ ) ( t, Z )

CAFECS Algorithm Parameters

GA rate ES Mutation rate (per evolvable ES child parameter) Unique models mode Initial population of model parameter vectors No. of Model parameter vectors per colony No. of Colonies Communication Strategy: (see table 7) Run 1 - 5 Run 6 - 10 Run 11 - 15 Run 16 - 20 Run 21 - 25 Run 26 - 30 Run 31 - 35 Run 36 - 40 Run 41 - 45 Run 46 - 50 Termination criterion

T

T

0 0.25 Off Random 20 9 #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 Gen.≥ 1000

Table 6. Experiment 3 specifications

Table 7 describes each of the 10 communication strategies used in this experiment. These communication strategies only differ by their network topologies. Colony network topologies were discussed in section 3.3. For each of these communication strategies, the information sent by each colony is the model parameter vector with the lowest model criterion value in the colony. For each model parameter vector received by a colony, the vector replaces the model parameter vector with the highest model criterion value in the colony. Each of these communication strategies is applied once per generation. These characteristics of the communication strategies were arbitrarily selected and kept constant so that the

135

effect of different network topologies could be examined. A more thorough examination of various communication strategies is left to future work.

No.

Name

Descriptions

1

Complete Communication

All members of all colonies are treated as if they belonged to the same colony.

2

Complete Isolation

Each colony is treated essentially as an entirely separate run.

3

Nearest Neighbor

Each colony sends a copy of a member to one of its four neighboring colonies (toroid structure, see figure 17). In this and the following communication strategies, the member copied and sent has the lowest model criterion value for that colony and it will replace a member with the greatest model criterion value in the colony it is sent to.

4

Random

Each colony sends a copy of a member to a randomly selected colony other than itself.

5

One-way Two-level Static Hierarchy

Each colony except colony 1 sends a copy of a member to colony 1.

6

Two-way Two-level Static Hierarchy

Each colony except colony 1 sends a copy of a member to colony 1. Colony 1 sends a copy of a member to every other colony.

7

One-way Three-level Static Hierarchy

Colony 5 and 6 each sends a copy of a member to colony 2. Colony 7 sends a copy of a member to colony 3. Colony 8 and 9 each sends a copy of a member to colony 4. Colony 2, 3 and 4 each sends a copy of a member to colony 1.

8

Two-way Three-level Static Hierarchy

Colony 5 and 6 each sends a copy of a member to colony 2 and receives from colony 2 a copy of its member. Colony 7 sends a copy of a member to colony 3 and receives from colony 3 a copy of its member. Colony 8 and 9 each sends a copy of a member to colony 4 and receives from colony 4 a copy of its member. Colony 2, 3 and 4 each sends a copy of a member to colony 1 and receives from colony 1 a copy of its member.

9

One-way Two-level Dynamic Hierarchy

Each colony except colony n sends a copy of a member to colony n, where colony n is the colony with a member that has the lowest model criterion value of all colonies.

10

Two-way Two-level Dynamic Hierarchy

Each colony except colony n sends a copy of a member to colony n, where colony n is the colony with a member that has the lowest model criterion value of all colonies. Colony n sends a copy of a member to every other colony. Table 7. Communication Strategy Detail

136

4.2.2 Experiment 4: 4-dimension sine-wave models In this example, CAFECS attempts to select a model parameter vector that defines a 4tuple (a, b, c, d) such that V is minimized, i.e., the value of the function g'(a, b, c, d), where g' is defined as:

g' (a,b, c, d) = sin ( 0.2πa ) ⋅ sin ( 0.1π ( a + 12.8 ) ) π(b + 3) + sin ( 0.3π ( a + 2 ) ) ⋅ sin ( 0.8π ( a + 1.75 ) ) + sin ( 0.4π ( b + 1.5 ) ) ⋅ sin ⎛ --------------------⎞ ⎝ 3.5715 ⎠ π ( b + 1.7502 ) πc + sin ⎛ ----------------------------------⎞ ⋅ sin ( 0.6π ( b + 0.7 ) ) + sin ⎛ ----------⎞ ⋅ sin ( 0.1π ( c + 12 ) ) ⎝ ⎝ 2.99⎠ 1.1365 ⎠ π ( d – 3.44 ) πc π ( c – 21.75 ) πd + sin ⎛ ----------------⎞ ⋅ sin ⎛ ------------------------------⎞ + 3.15 sin ⎛ -------------⎞ ⋅ sin ⎛ ----------------------------⎞ ⎝ ⎝ 348.15⎠ ⎝ ⎠ ⎝ ⎠ 1.2365 ⎠ 17.35 18.05 πd π ( d + 5.91 ) + sin ⎛ -------------⎞ ⋅ sin ⎛ ----------------------------⎞ ⎝ 158.6⎠ ⎝ 3.15 ⎠ 2

2

2

2

– 0.0001a – 0.00021b – 0.00032c – 0.00074d + 7

Like function g used in experiment 3, g' has a large number local optima near the value of the global optimum. Since the function has 4 arguments, it is difficult to express it graphically. However, a plot keeping any two of its arguments constant appears roughly similar to figure 26 for most values of the constant arguments.

Similar to experiment 3, each model in the model domain consists of 37 nodes, although only nodes 0, 1, 2, and 3 contain evolvable parameters, denoted N0P, N1P, N2P, and N3P respectively. The remaining nodes are used to represent the function g' in each model. As in experiment 3, the result is that any element θ of the model domain defines a model T

function f M ( θ ) such that f M ( θ ) ( t, Z ) = g'(N0P, N1P, N2P, N3P). Additional specifications of the experiment are detailed in table 8.

137

Specification Type

Description

Setting

No. of nodes per model Node 0 - 3: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters Node 4 - 19: No. of evolvable bits in Input Link Vector Number of Operator types Operator type No. of evolvable ES parameters Node 20 - 23: No. of evolvable bits in Input Link Vector Number of Operator types Operator type No. of evolvable ES parameters Node 24 - 35: No. of evolvable bits in Input Link Vector Number of Operator types Operator type No. of evolvable ES parameters Node 36: No. of evolvable bits in Input Link Vector Number of Operator types Operator type No. of evolvable ES parameters

37

Observation Data Set

Number of observations No. of time-series Output attribute is an input attribute Data Observation Window size

0 0 No 0

Model Criterion Function

V ( θ, Z ) = – f M ( θ ) ( t, Z )

Model Domain

T

T

Table 8. Experiment 4 specifications

138

0 1 Sum 1 0 1 Sine 0 0 1 Square 0 0 1 Multiply 0 0 1 Sum 0

Specification Type CAFECS Algorithm Parameters

Description

Setting

GA rate ES Mutation rate (per evolvable ES child parameter) Unique models mode Initial population of model parameter vectors No. of Model parameter vectors per colony No. of Colonies Communication Strategy: (see table 7) Run 1 - 5 Run 6 - 10 Run 11 - 15 Run 16 - 20 Run 21 - 25 Run 26 - 30 Run 31 - 35 Run 36 - 40 Run 41 - 45 Run 46 - 50 Termination criterion

0 0.25 Off Random 20 9 #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 Gen.≥ 5000

Table 8. Experiment 4 specifications

4.2.3 Results of Experiment 3 and 4 In all 50 runs of both experiments, the model parameter vector that minimized V, i.e., maximized g or g' in experiment 3 or 4 respectively, was selected. The results of these two experiments are summarized in figure 28 and in figure 29.

139

Figure 28. Experiment 3: A comparison of communications strategies on the 2-dimension function g

Figure 29. Experiment 4: A comparison of communications strategies on the 4-dimension function g'

Each vertical line in both plots represents the results of the 5 runs using the communication strategy identified on the x-axis. The top of the line signifies the maximum number of generations required, the bottom of the line signifies the minimum number of generations

140

required, and the horizontal tick marks the average number of generations required to select the desired model parameter vector in the 5 runs.

Although both plots appear similar in some respects, note that the scale of the y-axis differs. In both experiments, the greatest average number of generations and variance in generations occurred when communication strategy 1 was used, i.e., all colonies were treated as if they were a single population.

Note that |MD3|2 = |MD4|, where MD3 is the model domain for experiment 3 and MD4 is the model domain for experiment 4. This can be compared to the average number of generations required for each communication strategy which increased from approximately 3 times in communication strategy 10 to 10 times in communication strategy 2 in experiments 3 and 4 respectively. We also point out that the average number of generations required was reduced by up to 79% in experiment 3 and 85% in experiment 4 by using a communication strategy other than complete communication.

4.2.4 Discussion of Experiments 3 and 4 The results of these two experiments demonstrate that CAFECS is capable of minimizing certain highly non-linear functions with becoming “stuck” at local optima. This suggests that for other model domains and highly non-linear model criterion functions that CAFECS would also be able to locate an optimal or near-optimal solution within a reasonable time. However, these experiments also demonstrate that the number of generations required to find an optimal solution increases with the cardinality of the model domain.

141

One observation we make in comparing these communication strategies is that, in each experiment, each one-way communication strategy 5, 7, and 9 resulted in a lower average number of generations than a than a similar two-way communication strategy 6, 8, and 10 respectively. Note that communication strategy 4 is also essentially one-way. These results suggest that two-way communication strategies may be too disruptive and should be avoided. It also raises the question: is there an evolutionary advantage to only permitting a “one-way” exchange of genetic material between biological colonies?

We note that only 10 communication strategies were examined in these experiments. In experiment 4, the average number of generations was reduced by 85% when communication strategy 9 was used instead of 1. However, as previously discussed, an infinite number of communication strategies are possible. These experiments suggest that other communication strategies exist which would reduce the average generations required further. Moreover, the CAFECS algorithm could be used to develop communication strategies by, for example, evolving the communication matrix which specifies the colony network topology. Additional research in this area appears promising, but is beyond the scope of this work and is left for future research.

4.3 Model Combination and Optimization In this section, three experiments are presented that were conducted to examine the model selection process when CAFECS is used to combine models. For many problems, there exist known models that can be used to perform the forecast or classification desired. These models may differ for a variety of reasons including different assumptions made by the model developer regarding the system being modeled as well as different data avail-

142

able concerning that system. One common difference between such models is that they use different attributes which are measured to characterize the system. This may be due to simplifying assumptions made about the system or the difficulty in obtaining all the data desired. Data differences include different sets of measurements of the system at a given time. For user-defined models that can be represented by model parameter vectors, CAFECS can be used to evaluate and possibly improve these models apropos of the user-specified model criterion function. Moreover, different user-defined models can be combined using CAFECS. The result can be that the model criterion value of the combined model is less than the model criterion value of either model it was constructed from.

For example, consider model A that uses the value of attribute 1 at time t to make a forecast at time t+1. Now consider model B that uses the value of attribute 2 at time t to also make a forecast at time t+1. When the user only knows the value of attribute 1, he uses model A. When the user only knows the value of attribute 2, he uses model B. The question arises about what to do when the user knows both attribute 1 and attribute 2. One solution is to choose either model A or B based on past performance and ignore one of the attributes. However, it some cases it is possible to combine model A and B, perhaps modify their structure and parameters in the process, such that a model C is created that uses both attribute 1 and attribute 2 and generates forecasts that are more accurate than those made by either model A or B.

To examine the process of model optimization and combination using CAFECS, the following three experiments are presented. In experiment 5, a forecasting model is selected using one input attribute set. In experiment 6, another forecasting model is selected that forecasts the same output attribute as the model in experiment 5, but uses a disjoint set of 143

input attributes. In experiment 7, the models selected in experiment 5 and 6 are combined and evolved further.

In the remainder of this section, experiments 5, 6, 7 will each be defined. This is followed by a summary of the results of all three experiment after which the results are discussed.

Daily High Temperature at Detroit Great Lakes Station 120

100

Temperature (F)

80

60

40

20

0

−20 0

0.5

1 1.5 2 Time (in days from 1/1/1897)

2.5

3 4

x 10

Figure 30. 30,000 days of daily high temperatures at a Detroit weather station

Experiments 5, 6, and 7 all use the Detroit daily high temperature data set depicted in figure 30. In order to examine the cyclical nature of the data more closely, the plot in figure 31 depicts the first 5000 data points plotted in figure 30.

This data [NGDC 94] was collected from 1897 to 1983 at the Detroit Great Lakes Weather Station and consists of two time-series: the day number from 1/1/1897 and the daily high 144

temperature in degrees Fahrenheit. The goal is to create a model that can forecast the daily high temperature for day t with minimal error given either the date at time t, the daily high temperatures for days prior to day t, or both. Note that our purpose is to examine the model optimization and combination process rather than to develop a usable weather prediction model.

Daily High Temperature at Detroit Great Lakes Station 100

80

Temperature (F)

60

40

20

0

−20 0

500

1000

1500 2000 2500 3000 3500 Time (in days from 1/1/1897)

4000

4500

5000

Figure 31. The first 5000 days of daily high temperatures from figure 30

4.3.1 Experiment 5: Temperature forecasting with models using day number In this experiment, we use CAFECS to select a forecasting model that only uses the day number to make a forecast, i.e., f M ( θ ) ( Z ( t ) ) = daily high temperature on day t, where t is the number of days from 1/1/1897. By examining the data, we define a model domain that contains nodes with sine operators to aid in modeling cycles in the data. Knowing that the 145

data is earth weather, a reasonable assumption would be that one cyclical pattern in the data would have a period of approximately 365.25 days. However, we will ignore this fact for the purposes of this experiment so to examine if such a cycle is discovered. Instead, we will create two sets of nodes with sine operator types in each model. The sine operator type takes three arguments: amplitude, period, and phase. In the first set of nodes, each node has a period of less than 500 days. In the second set, each node has a period of between 500 and 5000 days. This reserves some nodes to model lower frequency patterns in the data while other nodes can model higher frequency patterns. In addition, since there may be some linear long-term warming or cooling trend in the data, we can reserve nodes for model the linear component as well.

As a result of these considerations, we defined the model domain such that each element defines a function of the following form: f ( d ) = A 1 sin ( A 2 ( d + A 3 ) ) + A 4 sin ( A 5 ( d + A 6 ) ) + A 7 d + A 8

where d is the day number, f(d) is the forecast in degrees, and A1 through A8 are selected using evolution strategies. The model domain is restricted to define functions of the above form by defining the nodes so that none of the node operator types or node links are evolvable. Further details regarding the model domain and other specifications of the experiment are detailed in table 9. Note that only 8 coefficients are required to define function f, but each model defined by the model domain contains 16 nodes with a total of 20 evolvable parameters. This was done because increasing the number of nodes in a model increases the probability that a subset of the nodes of a model defines the function desired.

146

Specification Type Model Domain

Observation Data Set

Model Criterion Function

Description

Setting

No. of nodes per model Node 0: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters No. of evolvable GA parameters Node 1: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters No. of evolvable GA parameters Node 2, 4, 8, 9, 12, 13: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters No. of evolvable GA parameters Node 3, 5, 6, 7, 10, 11, 14, 15: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters No. of evolvable GA parameters

16

Number of training observations Number of test observations No. of time-series Output attribute is an input attribute Data Observation Window size

29585 2556 2 Yes 1

t < 2956 T



(y(k ) – f M(θ)(Z(k )))

0 1 Sum 1 0 0 1 Multiply 1 0 0 1 Sine 3 0 0 1 Sum 0 0

2

t=0 V ( θ, Z ) = ----------------------------------------------------------------------2956

where k = 10t +30 CAFECS Algorithm Parameters

GA rate ES Mutation rate (per evolvable ES child parameter) Unique models mode Initial population of model parameter vectors No. of Model parameter vectors per colony No. of Colonies Communication Strategy Termination criterion Table 9. Experiment 5 specifications 147

0 0.001 Off Random 10 1 Isolated Gen. ≥ 400

4.3.2 Experiment 6: Temperature Forecasting with moving-average models In experiment 5, a forecasting model could use the day number of the day to be forecasted as an input argument but not the temperatures of any prior days. In this experiment, a forecasting model may use a combination of the daily day high temperature of the 15 days prior to the forecasted day as input arguments but not use the day number. Similar to experiments 1 and 2, each element of the model domain defines a model that calculates a moving-average. Each model consists of a single node with a Mean operator type and 15 evolvable input links, each selecting a different day’s temperature as an input argument. Additional specifications of the experiment are detailed in table 10.

Specification Type Model Domain

Observation Data Set

Description

Setting

No. of nodes per model Node 0: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters No. of evolvable GA parameters

1

Number of training observations Number of test observations No. of time-series Output attribute is an input attribute Data Observation Window size

29585 2556 2 Yes 15

Table 10. Experiment 6 specifications

148

15 1 Mean 0 0

Specification Type Model Criterion Function

Description

Setting t < 999

T



(y(k ) – f M(θ)(Z(k )))

2

t=0 V ( θ, Z ) = -------------------------------------------------------------------999

where k = 10t +15 CAFECS Algorithm Parameters

GA rate GA Parent Selection Algorithm GA Crossover type GA Population replacement percent (per generation) GA Mutation rate (per evolvable bit) Unique models mode Initial population of model parameter vectors No. of Model parameter vectors per colony No. of Colonies Communication Strategy Termination criterion

1 RWS Uniform 100% 0.02 Off Random 100 1 Isolated Gen. ≥ 1000

Table 10. Experiment 6 specifications

4.3.3 Experiment 7: Temperature Forecasting with combined models In this experiment, we specify a model domain that consists of elements that define models that are combinations of models similar to those used in experiments 5 and 6. The two model types are connected using a node with a Weighted Mean operator type. A simplified depiction of the models defined by the model domain specified for this experiment is illustrated in figure 32.

149

DATA INPUTS

Weighted Sum

F o r e c a s t s

Sum Dayt

Sine period ≤ 500

Sum Sine period > 500

Tempt-1 Tempt-2 Tempt-3

Weighted Mean

Mean

Tempt-4 Tempt-5

Figure 32. A simplified version of the combined temperature forecasting model

The specifications for this experiment are detailed in table 11. Note that for this experiment, the initial population of parameter vectors is user-specified rather than randomly selected from the model domain as was done in experiments 5 and 6. The initial values of the evolvable node links and node parameters were set to the values of the models in experiments 5 and 6 with the lowest model criterion values. Since models with low model criterion values selected during experiment 6 ignored input arguments of daily high temperature with indexes from t-6 to t-15, models in this experiment were restricted from using these values as input arguments.

150

Specification Type Model Domain

Observation Data Set

Description

Setting

No. of nodes per model Node 0: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters No. of evolvable GA parameters Node 1: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters No. of evolvable GA parameters Node 2, 4, 8, 9, 12, 13: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters No. of evolvable GA parameters Node 3, 5, 6, 7, 10, 11, 14 - 18, 22 - 24: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters No. of evolvable GA parameters Node 19 - 21: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters No. of evolvable GA parameters Node 25 - 27: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters No. of evolvable GA parameters

28

Number of training observations Number of test observations No. of time-series Output attribute is an input attribute Data Observation Window size

29585 2556 2 Yes 6

Table 11. Experiment 7 specifications

151

0 1 Sum 1 0 0 1 Multiply 1 0 0 1 Sine 3 0 0 1 Sum 0 0 5 1 Mean 0 0 0 1 Wgt.Mean 2 0

Specification Type Model Criterion Function

Description

Setting t < 2958

T



(y(k ) – f M(θ)(Z(k )))

2

t=0 V ( θ, Z ) = ----------------------------------------------------------------------2958

where k = 10t +5 CAFECS Algorithm Parameters

GA rate GA Parent Selection Algorithm GA Crossover type GA Population replacement percent (per generation) GA Mutation rate (per evolvable bit) ES Mutation rate (per evolvable ES child parameter) Unique models mode Initial population of model parameter vectors No. of Model parameter vectors per colony No. of Colonies Communication Strategy Termination criterion

0.5 RWS Uniform 50% 0.002 0.001 Off Custom 10 1 Isolated Gen. ≥ 700

Table 11. Experiment 7 specifications

4.3.4 Results of Experiments 5, 6, and 7 The results of experiments 5, 6, and 7 are summarized in table 12.

Mean Absolute Error (F)

Model Description Constant Forecast Model (mean = 57.2)

Mean Squared Error (F2)

18.50

455.1

Experiment 5: Model with only day input

7.33

87.03

Experiment 6: Moving-average Model

5.81

59.76

Experiment 7: Combined Model

5.44

49.98

Table 12. Summary of results of four temperature forecasting models

The first line of table 12 presents the results of a Constant Forecast Model. This model was created manually for comparison purposes. The mean temperature of the cases in training set, from day 0 to day 29585, is 57.2. When no input arguments are known, the 152

Constant Forecast Model minimizes the error of the training set by forecasting a constant value of 57.2. Each of the remaining three lines of the table present results of a model with the lowest model criterion value of the models selected in its respective experiment. Each model was tested on 2526 days of day following the training data. The mean absolute error used in the table is defined to be

t < 29615 + 2526



y(t) – f (t)

t = 29615 ---------------------------------------------------------2526

The mean squared error (MSE) used in the table is defined to be

t < 29615 + 2526



y(t) – f (t)

2

t = 29615 ------------------------------------------------------------2526

4.3.5 Discussion of Experiments 5, 6, and 7 Of note is that 49.98 MSE of the combined model is lower than both the 59.76 MSE of the moving-average model and the 87.03 MSE of the model using only the day number as an input argument. Examining the evolvable node links and parameters between these three models showed that the node parameters of some nodes with the sine operator types differed between the models with the lowest model criterion values in experiments 5 and 7. Recall that in generation 0 of experiment 7, the model nodes that corresponded to the model nodes of the model with the lowest model criterion value in experiment 5 were identical.

153

In an additional test identical to experiment 7 except that the parameters of the nodes with the sine operator types are held constant, little reduction in the MSE of the best model selected was obtained compared to experiment 6. Instead, the node parameters of the node with the weighted mean operator type changed so that the output value of the nodes with the sine operator types had almost no effect on the model forecasts of the best model selected. This made the selected model almost identical to the moving-average model selected in experiment 6.

This result is interesting because when people construct models, as when they engineering machines, sub-components are often designed and tested separately. When the sub-components are assembled together, generally little modification of the sub-components is done. This result suggests that the overall model, or machine, may be able to be improved if more modification of the sub-components is possible. An analogy can be made with machine design. A design goal may be to minimize weight while maintaining structural integrity. The weight of the individual sub-components can be minimized such that structural integrity is lost by any further weight reduction. However, once the sub-components are assembled, it may be possible to reduce the weight of various sub-components further without loss of structural integrity. A similar analysis appears to hold with the temperature forecasting models developed. When the models that only used the day number as an input argument were initially developed in experiment 5, they could not benefit from the past temperature data. Once another part of the model was able to use this data to modify forecasts that were previously predicted poorly, the parameters of the nodes in the model with sine operator types could change resulting in a drop in the MSE of the overall model.

154

4.4 Constructing Attribute Sets for Decision Trees Decision trees are structures widely used for classifying data and modeling. In a simple decision tree, each non-leaf node represents a test of a single attribute where the test outcome specifies which branch to follow. Each leaf node specifies a single classification. A case is classified by beginning at the root node, and performing the test at each non-leaf node then following the branch directed to by the test results until a leaf node is reached. This node specifies the class of the case. We define a rule to be the conjunction of the node tests on the path of non-leaf nodes from the root node to a leaf node that specifies a class. Thus, one rule corresponds to each leaf in a decision tree.

Quinlan [86] published an algorithm that constructed a decision tree by induction on a set of cases. Later he developed the program C4.5 which executes a more sophisticated version of this algorithm [Quinlan 93]. We use a variation of C4.5 to perform all the induction of decision trees performed in this section.

Consider the following example of a control system represented by a decision tree. The voltage and temperature define the state of the system. The classification specifies the control action to be taken. Nine different states with the corresponding control actions are detailed in table 13.

155

State I.D.

Voltage

Temperature

Control Action

1

17

57

Raise voltage

2

12

123

Accept

3

5

265

High temp. warning

4

32

37

5

25

149

Accept

6

43

357

Lower voltage

7

58

24

8

63

167

Accept

9

71

214

Lower voltage

Raise voltage

Low temp. warning

Table 13. Sample states of a system with control actions

The data in this table was used by C4.5 to induce the decision tree shown in figure 33.

Temperature ≤ 214

> 214

Temperature > 57

Voltage

≤ 57

≤ 17

Voltage Accept

≤ 43

Raise Voltage

High Temp. Warning

> 43

> 17

Lower Voltage

Low Temp. Warning

Figure 33. An example of a decision tree induced by C4.5 from data in table 13

An example of a rule extracted from this decision tree is if temperature ≤ 214 and temperature ≤ 57 and voltage > 43 then perform low temp. warning command. 156

The process of removing nodes from a decision tree is called pruning. Generally, the purpose of pruning is to create a simpler, more understandable decision tree even though the tree will no longer correctly classify all the cases used to train the tree. However, if some of training cases are inaccurate, i.e., noisy data were used to create a decision tree, pruning may also result in a more accurate decision tree. We modified the C4.5 algorithm to include the OPT algorithm developed by Bohanec and Bratco [94] so that the pruned trees are always optimal. More specifically, the OPT algorithm always prunes a tree so that the maximum number of leaves are removed for a given number of misclassified training cases. We use the term optimally pruned to indicate that the pruning was done so that this requirement was met.

One shortcoming of C4.5 is that is does not perform tests on combinations of attributes. For many data sets, the speed gained by ignoring attribute combinations is worthwhile. However, in some instances it is advantageous to spend considerable time searching for attributes that enable a decision tree induction algorithm, such as C4.5, to construct decision trees that are smaller or more accurately classify the data. However, the search for such attributes if often non-intuitive and time-consuming. In such instances, CAFECS can be helpful.

In this section, we will use CAFECS to build models that construct attribute sets that allow C4.5 to construct decision trees that are smaller and classify the test data with less error.

4.4.1 Experiment 8: Elk habitat classification For the following experiment, we use a Landsat Thematic Mapper (TM) satellite image of a mountainous region in central Colorado along with a ground truth map [Augusteijn 96]. 157

This database was originally used to study the habitat for elk [Huber & Casler 90] and later used to test a neural network classifier [Augusteijn, et al., 95]. The image consists of a 1784 by 1907 matrix of pixels and is registered in seven frequency bands. The first three bands are in the visible spectrum and the remaining bands are in the near-infrared spectrum. We ignore band number 6 because it has a lower resolution than the other bands. The six bands that are used are summarized in table 14.

Band No.

Band Name

1

Blue

2

Green

3

Red

4

Infrared 1

5

Infrared 2

7

Infrared 3

Table 14. Elk Image Frequency Bands

The band names provide a rough description of the band frequencies. So, for each pixel we have 6 intensity values, each corresponding to a different frequency band. In addition, we have a classification of each pixel provided by the ground truth map. Of the 3,402,088 pixels in the image (1784 x 1907 = 3,402,088), 6 pixels were assigned erroneous classes and not used in the experiment. Each of the remaining pixels was assigned one of the 14 classes itemized in table 15.

158

Class Id.

Pixel Count

Class Description

Pixel % of total image

1

Ponderosa Pine

581,424

17%

2

Douglas Fir

355,145

10%

3

Spruce/Fir

181,036

5%

4

Mixed Conifer

272,282

8%

5

Limber/Bristlecone Pine

144,334

4%

6

Aspen/Conifer Mix

208,152

6%

7

Non-vegetated

170,196

5%

8

Aspen

277,778

8%

9

Water

16,667

0.5%

10

Wet Meadow

97,502

3%

11

Riparian Deciduous Shrub

127,464

4%

12

Mesic Grassland

267,495

8%

13

Dry Meadow

675,048

20%

14

Alpine

27,556

0.8%

Key for figure 34

Table 15. Elk Image Pixel Classifications of Groundcover

The classification of a pixel indicates the primary groundcover of the location that the pixel is an image of, an area approximately 28.5 by 28.5 meters [Huber & Casler 90]. The pixels were manually classified and we expect a small percentage of the pixels have been misclassified although we do know what the actual error rate is.

The area of the image covered by the different classes greatly varies. Although there are small homogeneous regions of the various classes throughout the images, a large portion of the image is heterogeneous, i.e., pixels of different classes are adjacent, as shown in figure 34 of a low-resolution map of the image where each gray-scale intensity represents a different class from black representing class #1 to white representing class #14.

159

Figure 34. Gray-scale groundcover classification map of the Landsat image Image key in table 15

The goal of this experiment is to construct a decision tree using C4.5 that accurately defines a class for any pixel given a set of attributes for that pixel that are functions of the pixel’s 6 intensity values.

A training set of 2949 pixels were randomly selected such that the set contains at least 200 pixels from each classification. This set of pixels was used by CAFECS to construct attributes that are used by C4.5 to construct decision trees. A test set of 28610 pixels was 160

also randomly selected such that it is disjoint from the training set and contains at least 2000 pixels from each classification. The test set is used only to allow the user to evaluate the accuracy of the decision tree constructed. The test set is not used by CAFECS or C4.5 to affect how the decision trees are constructed or which trees are to be used. The large size of the test set was used to increase the confidence that the accuracy of the decision tree on the test set is close to its actually accuracy on the entire portion of the image not used for training.

For this experiment, CAFECS uses a population of three model colonies, each containing 40 models. The communication strategy used was Static 1-way 2-level described as follows. Every generation, colony #2 and colony #3 each sends a copy of a model parameter vector with the lowest model criterion value in the colony to colony #1. Each model consists of 30 nodes and each node generates a different attribute that can be used by C4.5. So, the inputs of a model are the 6 intensity values of a pixel, and the model’s outputs are 30 attribute values. Additional specifications of the experiment are detailed in table 16. Note that each node permitted 14 operator types using a total of 116 evolvable ES parameters in addition to the 35 evolvable bits in the Input Link Vector.

Three pairs of runs were performed. Each pair of runs differed only by the random seed used by CAFECS to generate pseudo-random variables. Each pair of runs differed by the percentage the trees were pruned. In runs 1 and 2, the trees were not pruned so that no training data was misclassified. In runs 3 and 4, the trees were optimally pruned so that the tree size could not be reduced further with increasing the misclassified training data percentage to more than 3%. In runs 4 and 5, the pruning percentage was increased to 5%.

161

Specification Type Model Domain

Observation Data Set

Model Criterion Function

Description

Setting

No. of nodes per model Node 0 - 29: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters Operator type 1 No. of evolvable ES parameters Operator type 2 No. of evolvable ES parameters Operator type 3 No. of evolvable ES parameters Operator type 4 No. of evolvable ES parameters Operator type 5 No. of evolvable ES parameters Operator type 6 No. of evolvable ES parameters Operator type 7 No. of evolvable ES parameters Operator type 8 No. of evolvable ES parameters Operator type 9 No. of evolvable ES parameters Operator type 10 No. of evolvable ES parameters Operator type 11 No. of evolvable ES parameters Operator type 12 No. of evolvable ES parameters Operator type 13 No. of evolvable ES parameters

30

Number of training observations Number of test observations No. of time-series Output attribute is an input attribute Data Observation Window size

2949 28610 7 No 6

T

2

( 100 ⋅ L ( θ, Z ) ) V ( θ, Z ) = -----------------------------------------2949 T

T

where L ( θ, Z ) is the number of non-leaf nodes in the optimally pruned tree constructed with the cases f M ( θ ) ( Z ( 0 ) ) … f M ( θ ) ( Z ( 2948 ) ) Table 16. Experiment 8 specifications 162

35 14 Mean 0 Inverse Sum 0 Sigmoid 36 NAND 0 Multiply 1 Square 0 Step 2 OR 0 Sum 1 Sine 3 Wgt. Sum 36 Mode 0 OrNorSelect 36 Wgt. Mean 1

Specification Type CAFECS Algorithm Parameters

Description

Setting

GA rate GA Parent Selection Algorithm GA Crossover type GA Crossover rate GA Population replacement percent (per generation) GA Mutation rate (per evolvable bit) ES Mutation rate (per ES child parameter) Unique models mode Initial population of model parameter vectors No. of Model parameter vectors per colony No. of Colonies Communication Strategy

1 SUS Uniform 0.9 100% 0.006 0.00003 Off Random 40 3 Static 1-way 2-level 0 1 0 2 3 1 3 2 5 1 5 2 Gen.≥ 100

Prune %: Run1 Random Seed: Run 1 Prune %: Run2 Random Seed: Run 2 Prune %: Run3 Random Seed: Run 3 Prune %: Run4 Random Seed: Run 4 Prune %: Run5 Random Seed: Run 5 Prune %: Run6 Random Seed: Run 6 Termination criterion Table 16. Experiment 8 specifications

4.4.2 Experiment 8 Results When we construct an error-free decision tree by C4.5 using the same training set of 2949 cases used by CAFECS and only permit the 6 pixel intensity values to be used as attributes, the optimally-pruned tree is 361 rules and has an accuracy of 16.4% when tested on the same test set containing 28,610 cases. By using CAFECS, we reduce the number of rules it takes to construct an error-free tree, (note that this is not pruning, simpler trees are created originally), and at the same time increase the accuracy of the deci-

163

sion tree as indicated by the reduction of the error on the test set. The results of the three pairs of that differ by pruning percent are shown in figure 35.

350

17 Cafecs:0% 16

Cafecs:3%

300

Cafecs:5% 15 250

200

Test Error %

No. of Rules

14

150

13

12

11 100 10 50 9

0 0

20

40 60 Generations Evolved

80

8 0

100

20

40 60 Generations Evolved

80

100

Figure 35. Rules and Test Error of Models by Generation for trees pruned 0%, 3%, and 5% the observation data set had 14 classes: 2949 training cases, 28610 test cases c.f. C4.5 using 6 attributes with results of 361 rules with 16.4% errors on test cases

By ignoring the generation of each point used in the two plots of this figure, the same data points can be displayed in a single plot as shown in figure 36. In addition, figure 36 also contains 11 data points, each denoted by *, obtained by creating optimally-pruned trees using the original 6 attributes. Each of these points represents a tree pruned to a different integer percentage between 0% (the right-most point) and 10% (the left-most point).

164

18

17

16

15

Test Error %

14

13

12

11

10 Cafecs:0% Cafecs:3% Cafecs:5% C4.5:0%−10%

9

8 0

50

100

150

200 No. of Rules

250

300

350

400

Figure 36. Elk model result summary: Error vs. Rule count

This figure shows a comparison of the reduction of error on the test set obtained by optimally pruning trees created by C4.5 to the test error reduction obtained by using CAFECS to find “better” attributes. We define better attributes to be attributes that enable tree induction algorithms, like C4.5, to construct trees that perform more accurate classifications or are smaller without decreasing classification accuracy. Note that the C4.5 data points exhibit a slight reduction in error and then the error rapidly increases as the number of rules decrease. This is a typical of pruning a decision tree. In contrast, observe the CAFECS data point where the error is reduced as the number of rules decrease. From this figure, we can observe that using the original six attributes, the lowest test error percentage

165

was approximately 16% which was obtained with 3% pruning creating a tree with 229 rules. The average number of rules per class is about 16.4. Using CAFECS, the lowest test error percentage was approximately 8% which was also obtained with 3% pruning creating a tree with 73 rules. The average number of rules per class is about 5.2. The smallest tree created using CAFECS was pruned 5% and had 42 rules with a test error of approximately 9%. The average number of rules per class is 3.

4.4.3 Discussion of Experiment 8 This experiment demonstrates how CAFECS can select attributes in order to construct models that perform classifications more accurately. In addition, the smaller trees usually are easier to understand. Moreover, the counter-intuitive result that the test error percentage declines as the number of rules declines is of also of interest.

As is the case with pruning, reducing the number of rules results in an increase in the size of the attribute space rules map a class to on average. However, it accomplishes this by altering the dimensions of the attribute space rather than misclassifying training cases in the attribute space for a rule which pruning does. Because we limit the size of the model space, CAFECS does not create attributes that result in significantly smaller trees with greater test error. Such attributes exist but are usually quite complex. because they, in effect, “memorize” the training cases. Ultimately, however, if the model space is large enough, we would find attributes that create a decision tree with as few as 14 rules that perfectly classify the training data, but we would expect such a tree to perform poorly on the test data. Thus, transforming the attribute space can suffer the same consequences as over-pruning.

166

4.5 Multiple Non-equivalent Global Optima In this section, we perform an experiment which allows us to examine a limitation of CAFECS. For some problems, the model domain may define two or more models with model criterion values equal to the global optimum of the model criterion function. CAFECS treats this set of models as equivalent even though tests can demonstrate that they are not, i.e., one model is “better” than the others as determined by the user. The result is CAFECS does not discriminate between such a set of models and may select the model that is less desirable for the user. A theoretical examination of this and other CAFECS limitations is presented in Chapter 5.

Consider the following example where the user-specified model criterion function does not accurately evaluate a model because infinite time would be required to do so. We show that the model criterion function leads to multiple global optima of models that do not perform similarly on the test data.

We are given the elements specified by the first four rows and columns of a matrix, and then asked to create a model that can predict the elements of the matrix specified by a row or column greater than 4. This matrix is illustrated in figure 37.

167

2 O X O X ?

Column 3 4 5 X O ? O X ? X O ? O X ? ? ? ?

… … … … … …

… … … … … …



Row

1 2 3 4 5

1 X O X O ?

Figure 37. A 4x4 matrix used to find a model that can predict additional elements

We are given the model domain in which each model contains a large number of nodes and the domain of node operators is large as well offering a wide selection of possible models. Moreover, the model criterion function V is defined as

4

V(θ) =



r=1

4

⎧1 ∑ ⎨0 c = 1⎩

f M ( θ ) ( r, c ) ≠ C r, c otherwise

where fM(θ): N × N → {O, X} and Cr,c represents the element in the r’th row and c’th column of the matrix. So V(θ) is the number of errors model M(θ) makes when attempting to predict the matrix C column when the row and column are elements of the set {1, 2, 3, 4}. Thus the maximum model criterion value is 16 and the minimum, i.e., optimal value, is 0. The problem is that a large number of models may have a model criterion value of 0, but not be desirable because they predict the elements specified by are row or column greater than 4 very poorly. Consider the following four models. The first three models were created manually by observing patterns in the data. The fourth was created using the decision tree induction algorithm C4.5.

168

Model 1: if the sum of the row and column is even, predict X otherwise predict O.

Model 2: if the row and column are both odd or are both even predict X otherwise predict O.

Model 3: if the row = column or row = column + 2 or (row = column - 2 and row > 2) predict X otherwise predict O. Model 4: if (row ≤ 3) then if (column ≤ 3) then if (row ≤ 2) then if (row ≤ 1) then if (column > 2) then predict O else if (column ≤ 1) then predict O else predict X else if (column > 2) then predict X else if (column ≤ 1) then predict X else predict O else if (column > 2) then predict O else if (column ≤ 1) then predict O else predict X else if (row > 2) then predict X else if (row ≤ 1) then predict X else predict O else if (column > 3) then predict O else if (column > 2) then predict X else if (column ≤ 2) then if (column ≤ 1) then predict X else predict O

169

The model criterion value of each of the above models is a global optimum of 0. Although we only listed four models that predict the 16 elements of the matrix without error, we could have easily specified many more such models. It turns out that the first two models are equivalent in that they both make the same prediction for any given row and column. We could easily give many more examples of equivalent models. However, although models 3 and 4 make the same predictions as models 1 and 2 when the row and column numbers are both elements of {1,2,3,4}, they make different predictions when either the row or column is greater than 4. This is illustrated by figures 38, 39, and 40.

Row

.

1 2 3 4 5 6 7 8

1 X O X O X O X O

2 O X O X O X O X

3 X O X O X O X O

Column 4 5 6 O X O X O X O X O X O X O X O X O X O X O X O X

7 X O X O X O X O

8 O X O X O X O X

Figure 38. 8x8 matrix created by models 1 & 2

170

Row

1 2 3 4 5 6 7 8

1 X O X O O O O O

2 O X O X O O O O

3 X O X O X O O O

Column 4 5 6 O O O X O O O X O X O X O X O X O X O X O O O X

7 O O O O X O X O

8 O O O O O X O X

Row

Figure 39. 8x8 matrix created by model 3

1 2 3 4 5 6 7 8

1 X O X O O O O O

2 O X O X X X X X

3 X O X O O O O O

Column 4 5 6 O O O X X X O O O X X X X X X X X X X X X X X X

7 O X O X X X X X

8 O X O X X X X X

Figure 40. 8x8 matrix created by model 4

Each of these three figures illustrates an 8x8 matrix created by one of the four models just presented. The matrix depicted in figure 38 was created by both model 1 and model 2. The matrix depicted in figure 39 was created by model 3, and the matrix depicted in figure 40 was created by model 4. Note that in all three figures, each portion of the matrix used to for training, i.e., the elements with both a row and column number less than 5, is identical in all three matrices.

171

The problem is that since each of these four models has a model criterion value equal to 0, the model selection algorithm shows no preference of one model over another although the predictions they make can be substantially different. In order to change this, the model criterion function, model domain, or data observation set must be modified. To demonstrate the effect on CAFECS, we present the following experiment.

4.5.1 Experiment 9: Matrix Pattern Recognition We created a data observation set of 64 cases illustrated by figure 38. The 16 training cases used are illustrated by figure 37 and the remaining 48 cases are test cases. The input attributes are the row and column numbers. The prediction, i.e., the desired model output, is either class X or O. Three model colonies, each containing 40 models, were randomly generated. The inter-colony communication strategy used was Static 1-way 2-level and is described as follows. Every generation, a model parameter vector with the lowest model criterion value in colony #2 and colony #3 were sent to colony #1. Each model consisted of 30 nodes and each node generated a value that was considered class X if the node value was > 0, otherwise it was considered class O. The model criterion value of a model is simply the number of misclassified training cases. Additional specifications of the experiment are detailed in table 17.

172

Specification Type Model Domain

Observation Data Set

Model Criterion Function

Description

Setting

No. of nodes per model Node 0 -29: No. of evolvable bits in Input Link Vector Number of Operator types Operator type 0 No. of evolvable ES parameters Operator type 1 No. of evolvable ES parameters Operator type 2 No. of evolvable ES parameters Operator type 3 No. of evolvable ES parameters Operator type 4 No. of evolvable ES parameters Operator type 5 No. of evolvable ES parameters Operator type 6 No. of evolvable ES parameters Operator type 7 No. of evolvable ES parameters Operator type 8 No. of evolvable ES parameters Operator type 9 No. of evolvable ES parameters Operator type 10 No. of evolvable ES parameters Operator type 11 No. of evolvable ES parameters Operator type 12 No. of evolvable ES parameters Operator type 13 No. of evolvable ES parameters

30

Number of training observations Number of test observations No. of time-series Output attribute is an input attribute Data Observation Window size

16 48 3 No 2

4

V(θ) =

4

f M ( θ ) ( r, c ) ≠ C r, c

⎧1

∑ ∑ ⎨⎩ 0

otherwise

r=1 c=1

Table 17. Experiment 9 specifications

173

31 14 Mean 0 Inverse Sum 0 Sigmoid 32 NAND 0 Multiply 1 Square 0 Step 2 OR 0 Sum 1 Sine 3 Wgt. Sum 32 Mode 0 OrNorSelect 32 Wgt. Mean 1

Specification Type CAFECS Algorithm Parameters

Description

Setting

GA rate GA Parent Selection Algorithm GA Crossover type GA Crossover rate GA Population replacement percent (per generation) GA Mutation rate (per evolvable bit) ES Mutation rate (per ES child parameter) Unique models mode Initial population of model parameter vectors No. of Model parameter vectors per colony No. of Colonies Communication Strategy

1 SUS Uniform 0.9 100% 0.006 0.00003 Off Random 40 3 Static 1-way 2-level Gen.≥ 100

Termination criterion Table 17. Experiment 9 specifications

4.5.2 Experiment 9 Results The model criterion test function value was the quantity of misclassified test cases. This function is used to help determine if the model criterion function accurately evaluates the models defined in the model domain. Figure 41 shows the lowest value of each function obtained for 100 generation. Note that the greatest model criterion value is 16 because 16 training cases were used, but the greatest model criterion test function value is 48. For both the model criterion function and the model criterion test function the optimal value was 0.

174

25

Classification errors

20

15

10 Training Testing 5

0 0

10

20

30

40 50 60 Generations Evolved

70

80

90

100

Figure 41. Classification errors using checkerboard data observation set

4.5.3 Discussion of Experiment 9 In this experiment, CAFECS selected a model that correctly classified each case in the training set in the first generation and continued to do so for the remaining generations. However, the test errors for these models varied between 0 and 23 errors. Since the model criterion of each model was 0, there was no way for CAFECS to prefer the models that generated fewer test errors.

If a model criterion function does have more than one global optimum, it is often possible to decrease the quantity of global optima by increasing the number of attributes of each case or the number of training cases in the data observation set. For example, consider the

175

experiment just presented that used the 16 training cases depicted in figure 37 to calculate the model criterion value of each value. However, if the 64 cases depicted in figure 38 were used for training instead and the model criterion function was changed accordingly, then the models that created the matrices shown in figure 39 and figure 40 would no longer have model criterion values that were global optima. The same would be true for all the models representing points in figure 41 which have a classification error of 0 on the training set but greater than 0 error on the test set.

For some problems the user may not be able to provide a model criterion function with fewer global optima. Instead, the user desires a set of models such that the model criterion value of each element is the global minimum. Unfortunately, the CAFECS algorithm may tend to select a particular element of this set but ignore the others. Note that this is not always the case as demonstrated in figure 41.

One method to address the problem of multiple global optima is to separate the model domain into multiple disjoint subsets. The goal is to reduce the number of global optima in each disjoint subset. Then, the model domain subsets can be searched separately in parallel. This is easily accomplished in CAFECS by creating disjoint node definition set domains. For example, a simple way to separate a model domain into two disjoint sets is to select a single parameter or link from one node. If a parameter was selected, then in one set that parameter must be greater or equal to 0 and in the other set less than 0. If a link was selected, then in one set the link is always connected and in the other set the link is never connected. If multiple global optima are located in a single subset, then that subset can be repeatedly reseparated until no more than one global optimum (of the original model domain) is found in the subset. 176

The difficulty with this method is that the cardinality of these model domain subsets generally are enormous and searching them can be time-consuming due to the stochastic nature of the search algorithm. Moreover, if we reconsider the degenerate case where the model criterion value of each model is a global optimum, i.e., the model criterion function does not direct the model search, then the minimum number of subsets required such that each subset contains no more than one global optimum is equal to the cardinality of the model domain.

The problems this experiment raises along with other limitations of CAFECS are discussed in the next chapter.

177

Chapter 5 Limitations of and Guidelines for the CAFECS approach There is no silver bullet. ––Fredrick Brooks

The major limitations of the CAFECS approach are discussed in this chapter. It begins with the limitations that define the conditions of when CAFECS will not be able to select a model that meets the user’s solution criteria. The section concludes with informal guidelines to aid the reader in determining the types of problems that CAFECS would not be particularly useful for solving apropos of other problem-solving techniques.

5.1 Limitations of the CAFECS approach As presented in section 1.3.4, the model criterion function V and the training observation data set ZT are user-defined and are used by CAFECS in an attempt to select a model parameter vector θ ∈ SV, the solution set. Since ZT remains constant once V is defined, for simplicity we denote the model criterion value of model M(θ) as V(θ) rather than V(θ, ZT), where θ ∈ DM. DM is a user-defined set of model parameter vectors which V is defined over. As previously discussed, by allowing the user to define DM, the user can restrict the search space in an attempt to increase the probability that an element of SV is selected. SV is defined by the user, depending on the problem, as either

178

{ θ ∈ DM : V ( θ ) ≤ ε } where ε is user-defined, or the set of global optima { θ ∈ DM : θ minimizes V ( θ ) } . This is done in the hope that if θ ∈ SV, then θ may also be an element of the ideal solution set S V˜ . Like SV, S V˜ is defined by the user as either ˜ :V ˜ ( θ ) ≤ ε˜ } where ε˜ is user-defined, or the set of global optima of the ideal { θ ∈ DM ˜ : θ minimizes V ˜ = {θ : θ is a model ˜ ( θ ) } , where DM model criterion function { θ ∈ DM parameter vector of a computable model}. Recall that an ideal model criterion function ˜ → N is defined to be a function that for any model parameter vectors ˜ : DM V ˜ , if V ˜ (θ ) < V ˜ (θ ), then model M(θ ) is “better” than model M(θ ) as deterθi, θj ∈ DM i j i j mined by the user. In order to achieve the goal of selecting θ ∈ S V˜ , the user attempts to define V and DM such that SV ⊆ S V˜ or at least so that SV and S V˜ are not disjoint. The obvious question arises: why attempt to select a θ ∈ SV when what is wanted is θ ∈ S V˜ ? Unfortunately, for most practical problems the process of selecting a θ ∈ S V˜ is subject to one or more of the following limitations. Note that these limitations apply to CAFECS as well as other evolutionary computation algorithms in general.

1. The solution set S V˜ may be empty. ˜ and error tolerance ε˜ may be For certain problems, the ideal model criterion function V defined such that S V˜ = ∅. That is, given all possible computable models, infinite information and time to evaluate each model, and infinite time to search for elements of S V˜ in ˜ = {all model parameter vectors}, no elements may be found. Usually, the reason why DM

179

˜ is S V˜ = ∅ can be attributed to the value of ε˜ being too low. For example, if ε˜ = 0 and V a strictly positive function, then S V˜ = ∅. However, in certain cases the reason why ˜ as being the limiting factor. S V˜ = ∅ points to DM

Consider the following example of an ideal model criterion function that has a solution set ˜ :V ˜ ( θ ) ≤ ε˜ } = ∅, given an error tolerance ε˜ = 0, because the function to S V˜ = { θ ∈ DM be modeled is not computable. We would like to select a model that takes two natural numbers as inputs and outputs either a 0 or a 1. The model interprets the first input as coding a Turing machine and the second input as the input for the Turing machine. The model outputs 1 if the Turing machine halts for the given input, otherwise it outputs 0. That is, we would like to model the following function:

⎧1 τ ( Z, X ) = ⎨ ⎩0

Turing machine defined by Z halts for input defined by X otherwise

where Z and X are natural numbers.

˜ (θ) = d(f We accomplish this by defining the ideal model criterion function V M(θ),τ) and ˜ , such that V ˜ (θ) = 0, where f selecting a θ ∈ DM M(θ): N × N → {0, 1} and d is some gen˜ (θ) = 0. If eral distance function. However, we know that no such θ exists for which V such a θ did exist, then fM(θ)(Z, X) = τ(Z, X) for any Z and X would be true. If that were true, then the halting problem would be solvable and Turing [36] proved that is not possible.

180

2. The actual model criterion function may not be ideal because unknown information is required. ˜ as a function such that for any model We defined an ideal model criterion function V ˜ and θ ∈ DM ˜ , if V ˜ (θ ) < V ˜ (θ ) then model M(θ ) is “betparameter vectors θ1 ∈ DM 2 1 2 1 ter” than model M(θ2) as determined by the user. However in many cases the user is unable to determine if model M(θ1) is better than model M(θ2) at the time the model is to be selected because pertinent information is unknown.

˜ may remain unknown Information necessary to define an ideal model criterion function V for many reasons. In some cases, it may be possible to obtain the information, but a decision was made not to. For example, one may need to build a machine and physically test it ˜ (θ) of a model M(θ). In in order to accurately determine the ideal model criterion value V this case, the expense or effort required to obtain the information may be excessive. In other cases, the information necessary to define the ideal model criterion function is not possible to obtain.

In the following example, the ideal model criterion function cannot be defined because information that is not possible to obtain is required. We define the function to be modeled as:

⎧1 F(t + 1) = ⎨ ⎩0

it rains tomorrow it does not rain tomorrow

181

˜ , forecasts rain such that if the function f A model M(θ), θ ∈ DM M(θ)(t+1) = 1, then the model forecasts rain for tomorrow otherwise fM(θ)(t+1) = 0. The ideal model criterion ˜ is defined as: function V

1 ˜ (θ) = ⎧⎨ V ⎩0

f M(θ) ( t + 1 ) ≠ F ( t + 1 ) f M(θ) ( t + 1 ) = F ( t + 1 )

˜ and θ ∈ DM ˜ , such that Now consider two models M(θ1) and M(θ2), θ1 ∈ DM 2 f M ( θ1 ) ( t + 1 ) ≠ f M ( θ2 ) ( t + 1 ) . Because F(t + 1) is currently unknown, it is unknown ˜ (θ ) > V ˜ (θ ) or if V ˜ (θ ) < V ˜ (θ ), although one and only one of the two relawhether V 1 2 1 2 tions is true.

In general practice, when all the information required to define an ideal model criterion ˜ is not available, known information that may be helpful is used to define a function V ˜ . Continuing the rain forecasting example, model criterion function V that approximates V consider a model criterion function V defined with known information where: V ( θ ) = d ( 〈 f M ( θ ) ( t – C ), f M ( θ ) ( t – C + 1 ), …, f M ( θ ) ( t )〉 , 〈 F ( t – C ), F ( t – C + 1 ), …, F ( t )〉 )

for some distance function d and user-specified natural number C. However, it remains ˜ . ˜ (θ) = 0 even when V(θ) = 0 for some θ ∈ DM undetermined if V

As a result, the actual model criterion function V generally does not accurately determine if one model is better than another when unknown information is required and hence, is 182

not ideal. That is, the general effect of not using essential information is that the user cre˜ using available information to ates a model criterion function V with domain DM = DM ˜ , but approximate V ˜ ) ∧ ( θ ∈ DM ˜ ) ∧ (V ˜ (θ ) < V ˜ ( θ ) ) ∧ ( V ( θ ) ≥ V ( θ ) ) ) is often true. In ∃θ i, θ j ( ( θ i ∈ DM j i j i j such a case, the model criterion function V cannot determine whether a model parameter ˜ – S : V ( θ ) ≤ ε } is not empty. Another reason why vector θ ∈ S V˜ , if the set { θ ∈ DM ˜ V this set may not be empty is due to the lack of time rather than the lack of information and is discussed in the following limitation.

3. The model criterion function V may not be ideal because infinite time is required ˜ for a model to determine the value of the ideal model criterion function V ˜ . parameter vector θ ∈ DM Even if all relevant information is known, it may be the case that an infinite amount of time is required to determine if one model is better than another. In which case, it would ˜. obviously take an infinite amount of time to compute an ideal model criterion function V

˜ is not Consider the following example in which the ideal model criterion function V usable in some cases because infinite time is required to compute the ideal model criterion ˜ (θ) of a model M(θ). We would like to model a bit string that has an infinite value V length, B = B1B2B3…, where Bi ∈ {0,1}, 1 ≤ i < ∞. The ideal model criterion function is ∞

˜ (θ) = defined as V



˜ . We note f M ( θ ) ( i ) – B i , where fM(θ) : N → {0, 1} and θ ∈ DM

i=1

that in special cases an infinite bit string may have a minimum description length [Ris-

183

sanen 78] that is finite. However, generally the minimum description length of a bit string ˜ , requires an infinite amount of time to compute. ˜ (θ), θ ∈ DM is infinite. In such cases, V ˜ (θ) is not infinite, the time required Moreover, even when the time required to compute V may not be acceptable to the user.

As in limitation 2 where required information was not available, the result is that the model criterion function V is generally not ideal, i.e., ˜ ) ∧ ( θ ∈ DM ˜ ) ∧ (V ˜ (θ ) < V ˜ ( θ ) ) ∧ ( V ( θ ) ≥ V ( θ ) ) ) is true. This can ∃θ i, θ j ( ( θ i ∈ DM j i j i j ˜ – S : V ( θ ) ≤ ε } is lead to the selection of non-optimal models when the set { θ ∈ DM ˜ V ˜ :V ˜ ( θ ) ≤ ε } . Alternately, let S be defined as the set not empty, where S V˜ = { θ ∈ DM ˜ V ˜ : θ minimizes V ˜ ( θ ) } and S be the set of global optima of of global optima { θ ∈ DM V ˜ pre{ θ ∈ DM : θ minimizes V ( θ ) } . Unfortunately, due to the difficulties in defining V viously described, V is often defined with the undesirable result that ˜ (θ ) ≠ V ˜ ( θ ) ) ) is true. That is, V has global optima ∃θ i , θ j ( ( θ i ∈ S V ) ∧ ( θ j ∈ S V ) ∧ ( V i j ˜ . In such cases, V is ineffective at discriminating that are not equivalent apropos of V between model parameter vectors and can lead to the selection of a non-optimal model M(θ), θ ∉ S V˜ , even though θ ∈ SV.

In order to illustrate the problem of multiple non-equivalent global optima, we return to the example of the bit string B that has an infinite length. We showed that the ideal model ∞

˜ (θ) = criterion function V



˜ , was generally not usable f M ( θ ) ( i ) – B i , for any θ ∈ DM

i=1 184

because infinite time was required to compute it for an arbitrary B. Instead, we define the c

actual model criterion function to be V ( θ ) =



˜ , f M ( θ ) ( i ) – B i , for any θ ∈ DM = DM

i=1

where c is some positive integer. Defined as such, V(θ) can be computed in finite time for ˜ , is c and the miniany B. The maximum value of V(θ), for any bit string B and θ ∈ DM mum value is 0.

˜ , is that it does not evaluate how well M(θ) models B ’s The problem with V(θ), θ ∈ DM i located after bit Bc. In an extreme case, we can have a model M(θ) that only determines ⎧B the first c bits of B correctly, i.e., f M ( θ ) ( i ) = ⎨ i ⎩ B i + 21

i≤c for 1 ≤ i < ∞, where i>c

˜ (θ) = ∞. Thus, an algoa + 2b is modulo 2 addition. In this case, V(θ) = 0 even though V rithm that relies on V to select an optimal model M(θ) can fail in an extreme manner.

4. If the model parameter domain DM from which all candidate solutions are ˜ , then it may be the case selected is restricted such that it is a proper subset of DM that the solution set S V˜ and DM are disjoint. In such a case, a model parameter vector that is an element of the solution set cannot be selected. For various reasons, e.g., to direct the model search to certain classes of models, the model parameter domain DM may be restricted to a proper subset of the domain of the ideal ˜ . However, a model parameter vector that is ˜ , i.e., DM ⊂ DM model criterion function V ˜ :V ˜ ( θ ) ≤ ε˜ } , where an element of the ideal model parameter solution set S V˜ = { θ ∈ DM ε˜ is user-defined, may not be an element of the solution set of the restricted model param-

185

eter domain, S' V˜ is defined by S V˜ ∩ DM . In some cases, S' V˜ = ∅ even though S V˜ ≠ ∅ , i.e., the model parameter domain DM has been restricted such that DM contains no elements of the ideal model parameter solution set S V˜ , so none can be selected.

For example, consider the following case in which S' V˜ = ∅ . We are given a finite data observation set Z ⊂ {(x1, x2, F(x1, x2)) : x1 ∈ N ,x2, ∈ N }. Let F : N 2 → {0, 1} where ⎧1 F(x1,x2) = ⎨ ⎩0

2

2

( x1 + a ) + ( x2 + b ) = r

and the constants a, b, r ∈ N . Thus, F acts

otherwise

as an indicator of whether a point lies on a circle with a center (a, b) and radius r or not. However, this definition of F is unknown to the user. Instead, the user is given Z and asked ˜ :V ˜ ( θ ) ≤ ε˜ } , where ε˜ = 0. The ideal model criterion to find an element of S V˜ = { θ ∈ DM ∞

˜ (θ) = function is defined as V



∑ ∑

f M ( θ ) ( x 1, x 2 ) – F ( x 1, x 2 ) , where

x1 = 0 x2 = 0

˜ . How˜ (θ) for any θ ∈ DM fM(θ) : N 2 → N and we assume sufficient time to compute V ever, if we happen to restrict the model parameter domain to ˜ :f 2 DM ⊂ {θ ∈ DM M(θ) : N → N is a linear function}, then S' V˜ = ∅ . In this example, ˜ , ε˜ , and DM and S V˜ are clearly disjoint. However, it is often not clear if S' V˜ ≠ ∅ given V ˜ :V ˜ ( θ ) ≤ ε˜ } is known to be not empty. DM even when S V˜ = { θ ∈ DM

186

5. The probability of selecting an element of S V˜ for a given error tolerance ε˜ , within a user-specified period of time, may be unacceptable to the user. The model search algorithm is stochastic because deterministic algorithms for selecting an element of the solution set S V˜ are usually not available or exceed the user-specified search time. Reasons for this can include the non-linearity of the ideal model criterion function ˜ , the size |DM| of the model parameter domain, and the computation time of the algoV ˜ is usually highly non-linear, rithm to search the model parameter domain DM. Because V it may not be amenable to being minimized by analytical means. The size of |DM| generally is so large that it precludes performing an exhaustive search. Instead, heuristic algorithms, such as CAFECS, are employed in order to minimize such functions or approximations of them. However, with a heuristic algorithm, the probability of selecting an element of DM with a model criterion value V(θ) less than a user-specified value ε within a specified period of time is often unacceptably low to the user. More specifically, for CAFECS we define this probability as: P V˜ ( V, DM, ε˜ , g ) = P [ S V˜ ∩ ( MP 0 ∪ MP 1 ∪ … ∪ MP g ) ≠ ∅ ]

where: MPi = the model parameter vector population at generation i, where MPi ⊆ DM, 0≤i