Incremental Methods for Bayesian Network Structure ... - CiteSeerX

0 downloads 0 Views 1MB Size Report
A simple and obvious way of taking into account new data is to drop the current structure ...... of those Bayesian networks S = (BSi ,BPi ) that induce a set of independence ...... This algorithm works as follows: it begins with a tree with no branches ...... the same, or a very similar, model as the batch approach, while saving ...
Incremental Methods for Bayesian Network Structure Learning

by Josep Roure i Alcob´e

Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy 2004

Advisors: Dr. Ramon Sang¨ uesa and Dr. Ulises Cort´es

Ph.D. Program on Artificial Intelligence Departament de Llenguatges i Sistemes d’Informaci´o

Universitat Polit`ecnica de Catalunya

A la mem`oria de mon pare que m’inculc`a que els llibres s´on una font immensa de coneixement A la meva mare que em revel` a que els llibres s´on una font de plaer infinit

To the memory of my father who inculcated in me that books are an immense source of knowledge To my mother who revealed me that books are a source of infinite pleasure

Agra¨ıments i Acknowledgments The first words will be to my parents, sister and brother... Les primeres paraules d’agraiment vull que siguin per la mare i el pare per tota la dedicaci´o que van posar en la meva educaci´o. S’hi van haver d’escarrassar molt per a que m’interesses la lectura i per motivar-me en els estudis!!! Una forta abra¸cada a la meva germana i al meu germ`a amb qui tants jocs i baralles hem compartit. I would like to thank my advisor Ramon Sang¨ uesa who introduced me in the field of Bayesian networks and pointed me that there was, and still is, a lot of work to do in learning such structures. He always believed I could do a work like this (I was not that sure). I also want to thank Ulises Cort´es who introduced me to the field of Artificial Intelligence and guided my first steps in doing research. He also gave me the chance of staying in Bath and London, which was a great personal experience. I give special thanks to Luis Talavera, a good fried of mine, who encouraged me to continue my work during all these years and always demanded the best of me (and of every body, himself included). I hope, he will be able to find peace and finish his thesis. I am sure he will. Also I thank all my colleages in the Escola Universit` aria Polit`ecnica de Matar´o who have suffered my complaints about the hard live of a poor PhD student. And now to the friends of mine not related with my work... Moltes gr`acies per haver-me fet costat durant tot aquest temps, sobretot aquests darrers dos o tres anys que m’heu continuat trucant (tot i que no tenia m`obil) incansablement sense defallir davant de les meves respostes negatives: “no surto que haig de treballar!!”. En definitiva, gr`acies per haver-me recordat que hi ha vida m´es enll`a de la tesi. Sense vosaltres m’hauria tornat ben boix. Un record molt especial per l’Andreu Pitarch i Ribas. Una persona que va saber xiular davant de les dificultats i que va ser un exemple d’enteresa i coratge per a tots nosaltres. Va ser la primera persona a la que vaig veure gaudir davant d’un paper en blanc que poc a poc anava farcint de s´ımbols matem`atics. Ell em va passar el cuquet per la recerca.

i

ii

Resum L’aprenentatge incremental, en els seus or´ıgens, es va motivar com la capacitat humana d’anar incorporant coneixement extret de les experi`encies i que seria desitjable que un agent artificial tamb´e tingues. En canvi, avui en dia hi ha altres raons de tipus pr`actic que fan que creixi l’inter`es pels algorismes incrementals. En aquests moments, les empreses de molts sectors diferents emmagatzemen muntanyes de dades cada dia. Els algorismes batch no s´on capa¸cos de processar i incorporar a una base de coneixement aquesta gran quantitat de dades di`aries usant un temps computacional i un espai de mem`oria raonables. Pensem que en aquest tipus d’entorns l’aprenentatge incremental esdev´e particularment rellevant ja que aquest tipus d’algorismes poden revisar els models existents de dades sense haver de comen¸car des del principi i sense haver de reprocessar totes les dades cada cop. Presentem dos heur´ıstiques diferents per tal de transformar un algorisme de cerca hillclimbing en un d’incremental. Creiem que l’heur´ıstica anomenada operadors en ordre correcte (TOCO) is l’aportaci´o m´es novedosa i original d’aquest treball. Aquesta heur´ıstica diu que donada una estructura de coneixement i el cam´ı de cerca format pels operadors en ordre creixent segons la qualitat que aporten, l’estructura nom´es es revisar`a si l’ordre dels operadors es modifica al tenir en compte les dades noves. Tamb´e estipula que quan s’hagi de revisar l’estructura, es far`a a partir del primer operador mal ordenat. Per tant la hur´ıstica TOCO permet detectar quan s’ha de revisar l’estructura, i en el cas que s’hagi de fer a partir de quin punt. La segona heur´ıstica, que anomenem espai redu¨ıt de cerca (RSS), usa el coneixement adquirit en els passos anteriors de l’aprenentatge per evitar explorar aquelles parts de l’espai de cerca que duien a models de qualitat molt baixa. Justifiquem la correctesa de les heur´ıstiques introduint el concepte de funcions de qualitat continues. Direm que una funci´o de qualitat es continua en l’espai de dades quan donada una estructura de coneixement i dos conjunts de dades arbitr`ariament similars (amb una dist`ancia de Kullback-Leibler petita), la funci´o retorna valors de qualitat molt semblants quan l’estructura es mesura respecte als dos conjunts. Usant aquesta definici´o, justifiquem l’heur´ıstica TOCO veient que l’ordre dels operadors d’un cam´ı de cerca no canviar` a si el conjunt de noves dades ´es semblant al conjunt de les antigues. De la mateixa manera podem justifica l’heur´ıstica RSS. Les nostres heur´ıstiques necessiten emmagatzemar els estad´ıstics suficients per tal d’evitar fer varies passades pel conjunt de dades. Hem identificat en la literatura els anomenats AD-trees que s´on una estructura que permet emmagatzemar de forma eficient estad´ıstics suficients dispersos. Aquesta estructura ens permet emmagatzemar a mem`oria els estad´ıstics necessaris per realitzar una cerca exhaustiva en l’espai de cerca de les xarxes Bayesianes. Proposem dues noves heur´ıstiques que ens permeten eliminar aquells estad´ıstics que amb molta probabilitat no ens seran u ´tils en cerques posteriors. Anomenem a aquests estad´ıstics ”espera abans de llen¸car” (WBD) i ”llen¸car subconjunts de variables” (DSV). L’heur´ıstica iii

iv elimina de mem`oria aquells estad´ıstics que corresponen a variables poc correlacionades, mentre que l’heur´ıstica WDB atrassa el proc´es d’eliminaci´o fins que hi ha suficients dades per tenir prouta informaci´o sobre les correlacions. Com els estad´ıstics eliminats corresponen a variables poc correlacionades ´es molt dif´ıcil que hi hagi un arc entre aquestes variables i per tant els estad´ıstics no s’usaran. Implementem les nostres heur´ıstiques en diferents algorismes d’aprenentatge de Xarxes Bayesianes que s´on molt coneguts a la literatura (CL, K2, B, i HCMC) transformant-los en algorismes incrementals. De forma emp´ırica veiem que els algorismes incrementals que obtenim aprenen estructures d’una qualitat molt semblant a l’obtinguda pels algorismes batch emprant menys temps, i que a m´es s´on robustos als diferents ordres de les dades. Tamb´e comparem els nostres algorismes incrementals amb d’altres propostes de la literatura. Per u ´ltim apliquem les nostres heur´ıstiques per l’aprenentatge incremental dels classificadors Bayesians na¨ıve i augmentats.

Abstract The incremental learning approach was firstly motivated as the human capability for incorporating knowledge from new experiences worth being programmed into artificial agents. However, nowadays there exist other practical (i.e. industrial) reasons which increase the interest in incremental algorithms. Nowadays, companies from a very wide range of activities store huge amounts of data every day. One-shot algorithms are not easily able to process and incorporate to a knowledge base this great amount of continuously incoming instances in a reasonable amount of time and memory space. We believe that, in this environment, incremental learning becomes particularly relevant since this sort of algorithms are able to revise already existing models of data without beginning from scratch and without re-processing past data. We present two different and general heuristics in order to convert batch hill-climbing searchers into incremental ones. We believe that the heuristic that we call Traversal Operators in Correct Order (TOCO) is the most novel and original contribution. This heuristic states that, given a learned knowledge structure and the learning path used to obtain the structure where the traversal operators are ordered in decreasing contribution of quality, the structure will be revised only when the order of traversal operators is changed in the light of new data and also that the structure will be rebuild from the first unordered operator of the path. So, the benefit of the TOCO heuristic is twofold. First, the model will only be revised when it is invalidated by new data, and second, in the case that it must be revised, the learning algorithm will not begin from scratch. The second heuristic of our work, that we called Reduced Search Space (RSS) heuristic, uses the knowledge gathered from previous learning steps and states that structures that had very low quality in past learning steps will still have low quality with respect to the new dataset in the current learning step. We formally justify the correctness of these two heuristics. In order to do so, first, we introduce the concept of continuous quality functions. Roughly speaking, we say that a quality function is continuous over the space of datasets when given a knowledge structure and two arbitrarily similar datasets (i.e. there is a short Kullback-Leibler distance between them), the function returns very similar quality values for the structure measured with respect to both datasets. From this definition we will justify the TOCO heuristic noting that if the order of the traversal operators of a given learning path changes when new data instances are added to the dataset it means that the new dataset is significantly different from the former one and thus it is worth revising the structure. Similarly, We justify the RSS heuristic noting that if the order of the traversal operators of a given learning path does not change, it means that both new and old datasets are not significantly different and thus structures that used to be of very low quality will still keep low quality values. Our heuristics need to store the sufficient statistics in order to avoid scanning datasets multiple times. We identify from literature AD-trees as an approach to efficiently store in v

vi memory sparse sufficient statistics. This structure will allow us to store in memory the sufficient statistics necessary to potentially perform a search among the entire space of Bayesian networks without going through already seen data. We also propose two additional heuristics that are coupled to the field of Bayesian networks. Both heuristics are concerned in avoiding to store sufficient statistics that are unlikely to be useful for learning future structures. We call these heuristics Wait Before Dropping (WBD) and Dropping Subsets of Variables (DSV). The DSV heuristic drops from memory those sufficient statistics that correspond to weakly correlated variables while the WBD heuristic delays the application of the DSV heuristic until the algorithm has gathered enough information about the correlation among variables (i.e. the number of gathered data instances is large enough). The sufficient statistics dropped from memory are unlikely to be useful since they correspond to subsets of uncorrelated variables among which it is unlikely that an arc exists. We also will propose an algorithm to gather such useless sufficient statistics and another to drop them from the AD-trees. We use our proposed heuristics in order to transform four different and well-known batch algorithms (CL, K2, B and HCMC) into algorithms that incrementally learn Bayesian network structures. We empirically see that the obtained incremental algorithms obtain structures of almost the same quality than their corresponding batch algorithms and that they are robust in front of different data orders. We also compare the results obtained with our heuristics to the results obtained with other incremental approaches found in the literature. Finally, we use our approach to incrementally learn Tree Augmented Naive Bayes classifiers.

Contents 1 Introduction 1.1 Machine Learning . . . . . . . . . . . . . . . . . . 1.1.1 The environment . . . . . . . . . . . . . . 1.1.2 Representation . . . . . . . . . . . . . . . 1.1.3 The learning component . . . . . . . . . . 1.2 Bayesian Networks . . . . . . . . . . . . . . . . . 1.2.1 Definition of Bayesian networks . . . . . . 1.2.2 Independence assumptions: D-separation 1.2.3 Characterization of Independence Models 1.2.4 Graphical Model Inclusion . . . . . . . . . 1.2.5 Neighborhoods and inclusion boundary . 1.3 Learning Bayesian Networks . . . . . . . . . . . . 1.3.1 The search space of Bayesian networks . . 1.3.2 Performance tasks and measures . . . . . 1.3.3 Representing experiences: datasets . . . . 1.3.4 Learning strategies . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

7 7 8 10 11 13 13 14 15 17 18 19 20 21 22 22

. . . . . . . . . . . . . .

25 25 28 29 30 30 31 32 33 34 35 36 37 38 44

3 Incremental Learning 3.1 Incremental algorithms: purpose and definition . . . . . . . . . . . . . . . . . 3.2 Problems with local maxima and the ordering effects . . . . . . . . . . . . . .

47 47 50

2 Bayesian Network Learning. State of the art 2.1 The Minimum Description Length Approach 2.1.1 Lam and Bacchus approach . . . . . . 2.1.2 Friedman and Goldszmidt approach . 2.1.3 Score Equivalence . . . . . . . . . . . 2.2 The Bayesian Inference Approach . . . . . . . 2.2.1 Cooper and Herskovits proposal . . . 2.2.2 Score Equivalence . . . . . . . . . . . 2.3 Sufficient Statistics . . . . . . . . . . . . . . . 2.4 Batch Bayesian Network Structure Learning . 2.4.1 Algorithm CL . . . . . . . . . . . . . . 2.4.2 Algorithm K2 . . . . . . . . . . . . . . 2.4.3 Algorithm B . . . . . . . . . . . . . . 2.4.4 Algorithm HCMC . . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . .

vii

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

viii

CONTENTS 3.3 3.4

Incremental learning of structured models. A comparison. Incremental Bayesian Network Structure Learning . . . . 3.4.1 Buntine’s proposal . . . . . . . . . . . . . . . . . . 3.4.2 Lam and Bacchus’ proposal . . . . . . . . . . . . . 3.4.3 Friedman and Goldszmidt’s proposal . . . . . . . . 3.4.4 Comments to the incremental proposals . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

51 52 53 56 59 63

4 A New Approach to Incremental Learning of BNS 4.1 Heuristics for incremental learning . . . . . . . . . . . . . . . . . 4.1.1 Batch hill-climbing search . . . . . . . . . . . . . . . . . . 4.1.2 HCS’ search path properties . . . . . . . . . . . . . . . . . 4.1.3 Traversal Operators in Correct Order (TOCO) . . . . . . 4.1.4 Reduced Search Space (RSS) . . . . . . . . . . . . . . . . 4.1.5 Incremental hill-climbing search . . . . . . . . . . . . . . . 4.1.6 Continuity of quality measures for Bayesian networks . . 4.2 Incremental Bayesian Network Structure Learning . . . . . . . . 4.2.1 Incremental CL . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Incremental K2 . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Incremental B . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Incremental HCMC . . . . . . . . . . . . . . . . . . . . . 4.3 Reduced and Cached Sufficient Statistics . . . . . . . . . . . . . . 4.3.1 Reduced sufficient statistics . . . . . . . . . . . . . . . . . 4.3.2 Cached sufficient statistics: AD-trees . . . . . . . . . . . . 4.3.3 Reduced and incremental AD-trees . . . . . . . . . . . . . 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Comparing the batch against the incremental approaches 4.4.2 Learning curves . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Behavior of the TOCO heuristic . . . . . . . . . . . . . . 4.4.4 TOCO heuristic against Two-way operator approach . . . 4.4.5 Reduced Sufficient Statistics . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

65 66 66 68 71 74 75 77 79 80 81 83 85 87 87 89 91 94 95 101 102 110 112

5 Incremental Probabilistic Classifiers 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Bayesian Network Classifiers . . . . . . . . . . . . . . . . 5.2.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Tree Augmented Naive Bayes . . . . . . . . . . . . 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Computational time gain . . . . . . . . . . . . . . 5.3.2 Quality of the recovered structures . . . . . . . . . 5.3.3 Accuracy curves . . . . . . . . . . . . . . . . . . . 5.3.4 Comparing the Naive and TAN Bayesian classifiers

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

117 117 118 118 119 120 120 121 123 123

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

6 Conclusions and future research 127 6.1 Summary of original contributions . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

List of Figures 1.1 1.2 1.3 1.4

Langley’s framework for machine learning . . . . The serial, converging and diverging connections A Bayesian network example . . . . . . . . . . . Covered arc . . . . . . . . . . . . . . . . . . . . .

. . . .

9 15 15 17

3.1

Comparison of greedy batch and incremental learning paths . . . . . . . . . .

50

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14

A continuous quality function . . . . . . . . . . TOCO heuristic . . . . . . . . . . . . . . . . . . Incremental CL: TOCO heuristic . . . . . . . . Sample AD-tree . . . . . . . . . . . . . . . . . . Growth of an AD-tree for an increasing number Growth of an AD-tree for an increasing number Learning curves of iCL algorithm . . . . . . . . Learning curves of iK2 algorithm . . . . . . . . Learning curves of iB algorithm . . . . . . . . . Learning curves of iHCMC algorithm . . . . . . Behavior of algorithms iCL and iK2 . . . . . . Behavior of algorithm iB . . . . . . . . . . . . . Behavior of algorithm iHCMC . . . . . . . . . . Behavior of algorithms with different nTOCO .

5.1 5.2 5.3 5.4 5.5

Naive Bayes Classifier . . . . . . . . . . . . . Tree Augmented Naive Bayes Classifier . . . Quality of recovered structures . . . . . . . . Accuracy curves of the incremental and batch Comparing Naive and TAN accuracy curves .

ix

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . of levels and attributes of data records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . approaches to . . . . . . . .

. . . . . . . . . . . . TAN. . . . .

. . . . .

. . . . .

. . . . .

. . . .

. . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

68 73 74 90 92 94 103 104 105 106 107 108 109 111

. . . . .

. . . . .

. . . . .

118 119 122 124 126

x

LIST OF FIGURES

List of Tables 1.1

The number of DAGs for small number of nodes. . . . . . . . . . . . . . . . .

20

2.1

A database of cases over two binary variables . . . . . . . . . . . . . . . . . .

32

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13

A simple dataset. . . . . . . . . . . . . . . . . . . Description of used data sets . . . . . . . . . . . Result with CL. nTOCO=1 . . . . . . . . . . . . Result with K2. nTOCO=0 . . . . . . . . . . . . Result with B. nTOCO=2 . . . . . . . . . . . . . Result with HCMC. nTOCO=5 . . . . . . . . . . K2. Calls to Score. Alarm. Similar. nTOCO=0 . Result with FG. nTOCO=3 . . . . . . . . . . . . Result with HCMC. nTOCO=5 . . . . . . . . . . DSV: Result with K2. nRSS=9 . . . . . . . . . . DSV: Result with B. nTOCO=2 nRSS=11 . . . DSV: Result with HCMC. nTOCO=5 nRSS=11 Pruned ADtrees’ number of nodes . . . . . . . .

. . . . . . . . . . . . .

89 94 97 97 98 98 100 112 114 114 115 115 116

5.1 5.2

CPU clock ticks and operations spent in learning . . . . . . . . . . . . . . . . P Quality, ni=1 I(Xmi ; Xmj(i) |C), of final network structures . . . . . . . . . .

121 122

xi

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

xii

LIST OF TABLES

List of Algorithms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

CL . . . . . . . . . . . . . . . . . K2 . . . . . . . . . . . . . . . . . B . . . . . . . . . . . . . . . . . . RCAR . . . . . . . . . . . . . . . HCMC . . . . . . . . . . . . . . . Obtain Initial Neighborhood . . . Best DAG . . . . . . . . . . . . . Update A for addition . . . . . . Update D for addition . . . . . . Update Ra and Rd for addition . Update A for deletion . . . . . . Update D for deletion . . . . . . Update Ra and Rd for deletion . Buntine’s . . . . . . . . . . . . . MarkChildren . . . . . . . . . . . FG . . . . . . . . . . . . . . . . . Hill-Climbing Search . . . . . . . Incremental Hill-Climbing Search iCL . . . . . . . . . . . . . . . . . iK2 . . . . . . . . . . . . . . . . . iB . . . . . . . . . . . . . . . . . iHCMC . . . . . . . . . . . . . . Join Candidate Lists . . . . . . . Prune (S, ADN) . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

1

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

36 37 39 41 42 42 43 44 44 44 45 45 45 55 55 62 67 76 80 82 84 86 89 93

2

LIST OF ALGORITHMS

Preface Learning Bayesian Network structures received a lot of attention during 1990’s. During the last few years we have witnessed a great progress in the knowledge of the Bayesian Network structure search space, and in the quality functions used to measure the degree of closeness of learned models to data. Most of the work on learning network structures has focused on one shot (batch) algorithms, that is, algorithms that assume they have all the data training instances there will ever be available for learning a network. Hence, batch algorithms learn the Bayesian network structures at once with the training data and do not provide with any mechanism to revise the structures in case that new data instances are available. Companies, nowadays, collect a great amount of data every day. For example, bank consumers perform thousands of operations every day, electronic virtual shops receive thousands of visits from potential costumers and some of them finally buy a product; big stores collect, from fidelity cards, the products each individual costumer buys, phone companies collect data from thousands of calls, etc. Having this in mind, we strongly believe that it is worth having incremental algorithms able to refine network structures in the light of new data instances using a reasonable computing time and memory space. A simple and obvious way of taking into account new data is to drop the current structure and to learn, using old and new training instances, a new network structure from scratch. This method has two main drawbacks. Firstly, it needs to store in memory all the data instances which, in turn, may grow at infinitum, and secondly, it spends a great amount of computational time as it begins from scratch each time it learns a new structure. For these reasons, we need algorithms that are able to cope with new data while not using unreasonable memory space and demanding unreasonable computer time. Hence, the main goal of the present work is to obtain incremental algorithms in order to learn Bayesian network structures while keeping low time and memory demands. We also want incremental algorithms to yield network structures of similar quality to those that would be obtained by the corresponding batch algorithms.

Goals and contributions The main goal of this theses has been the development of incremental algorithms that continuously incorporate training instances as they arrive. As we will show that our incremental algorithms have the following desirable properties: • They require small constant time per data instance. • They use a fixed amount of main memory, irrespective of the total number of records it has seen. 3

4

PREFACE • They build a Bayesian network structure using one single scan of the data. • They produce Bayesian networks that are nearly identical to the ones that would be obtained by the corresponding batch algorithm. • The model they produce is independent of the order in which the data instances are presented.

Note that an incremental algorithm following the above properties will keep low the memory and the time required to process each instance without losing relevant information. Hence, our incremental algorithms will yield networks acceptably close to the ones that would be obtained by repeatedly using batch algorithms. The main contributions of our work can be summarized as follows: • We identify the main difficulties in learning Bayesian network structures incrementally and in general any sort of complex knowledge structures that aim at relating attributes. • We propose one heuristic in order to transform any batch hill-climbing algorithm into an incremental one. Our heuristic is able to detect when the revising process should be triggered and which part of the search path should be revised. • We propose one heuristic in order to reduce the search space explored when revising the current structure based on the previous learning steps. In this way we reduce very much the learning time because it is focused on the relevant area of the search space. • We study the behavior of our heuristics on a wide range of significant and well known batch learning algorithms for Bayesian Networks. • We use a data structure (AD-tree, [92]) to store efficiently the sufficient statistics needed to learn the Bayesian network structures, by means of this several of the drawbacks of previous algorithms are overcome. • We use our incremental version of the algorithm to learn tree shaped Bayesian networks for learning Tree Augmented Naive Bayes classifiers and empirically show that it saves computing time while it still obtains classifiers with similar accuracy to a batch algorithm.

Structure This work is organized in six chapters Chapter 1 Firstly, we present a broad introduction to Machine Learning and state a general framework in order to situate the key ideas of our proposition. Secondly, we revise the field of Bayesian networks and situate the Bayesian network learning algorithms within the framework. Chapter 2 In this chapter we revise the state of the art on learning Bayesian network structures. We revise the quality measures used in learning structures, and the most representative batch algorithms for learning network structures. Namely, Chow & Liu [22], K2 [23], B [8] and HCMC [13].

PREFACE

5

Chapter 3 In this chapter we motivate the need of incremental learning algorithms and give precise definitions for incremental algorithms. We revise the main drawbacks of the incremental algorithms found in the literature and identify the difficulties in learning incrementally Bayesian networks. At the end of the chapter we revise the literature for incremental learning of Bayesian networks. Chapter 4 This is the core of our contribution. We propose two new general heuristics in order to convert a batch hill-climbing searcher into an incremental one. We also propose a third heuristic in order to reduce the sufficient statistics needed to be stored in memory and to reduce the search space. We transform the four batch algorithms revised in Chapter 2 into incremental ones and perform several experiments. The experiments show that our incremental algorithms a save a significant amount of time while yielding structures close to those obtained with the corresponding batch algorithms. In this chapter we also present the AD-trees [92] as a data structure able to store the sufficient statistics needed in order to incrementally learn Bayesian network structures and we introduce some modifications in order to save additional memory space in incremental environments. Chapter 5 Here we use our incremental approach in order to obtain an incremental Tree Augmented Naive Bayes Classifier. We also show that its accuracy is almost the same as the one of a batch classifier. Chapter 6 In this chapter we provide a summary of the work presented in this theses, some conclusions and we suggest lines for further research.

6

PREFACE

Chapter 1

Introduction During the eighties, Bayesian networks [99, 94] were developed as models for coping with uncertainty. Bayesian networks are graphical representations of relations between variables in a domain. Bayesian networks are also called belief networks, Bayesian belief networks or causal probabilistic networks and they use probability theory in order to reason with uncertainty. The main advantages of this sort of graphical representation can be summarized in three points. Firstly, they allow to express directly the fundamental qualitative relationship of direct causation. The arcs between variables signify the existence of direct causal influences and the strengths of these influences are quantified by conditional probabilities. Secondly, there are mathematical methods to estimate the state of certain variables given some evidence, namely, the state of other variables. Thirdly, there are methods in order to explain to the user how the system came to its conclusions, and there are also methods to analyze how sensitive the conclusion to small changes in the evidence is(sensitivity analysis) [14]. During the nineties and until now much of the research in the Bayesian network field was focused on methods for automatically learning these graphical representations. This was mainly motivated by the success of Bayesian networks in industry applications such as medical diagnosis, computer vision, robotics and others. Companies store huge databases from which lots of knowledge about their business processes could be extracted and used. It was clearly infeasible to analyze such huge databases by hand and hence there appeared the necessity of having methods for learning Bayesian network, and other knowledge structures, from databases. Learning is also of interest for several fields of Artificial Intelligence. The organization of the rest of this chapter is as follows. In Section 1.1 we introduce the framework of Machine Learning that we will use in later chapters in order to motivate and situate our own research. In Section 1.2 we introduce Bayesian networks and some basic concepts, and in Section 1.3 we will situate the field of Bayesian network learning within the machine learning framework.

1.1

Machine Learning

The aim of Machine learning is to obtain algorithms that improve their performance through experience. The field of Machine Learning emerged from the larger field of Artificial Intelligence from the very start of this discipline. The first motivation of the field was to imitate and understand the learning processes of humans. Later, with the appearance of knowledgeintensive systems like expert systems, Machine Learning was also studied from the view-point 7

8

CHAPTER 1. INTRODUCTION

of automatic knowledge elicitation [113]. It was well known that knowledge acquisition from human experts was difficult because of different reasons. Sometimes communication between domain experts and knowledge engineers is difficult as they come from different knowledge areas and therefore speak different jargons. Sometimes there are few, if any, experts on a given domain and it is almost impossible to reach them. Also, when domains are complex or ill-structured it is simply impossible for experts to give a precise model [5]. More recently, large databases became increasingly abundant in many areas like science, engineering and business, so there appeared new applications and motivations in the field of Machine Learning. At that moment, Machine Learning seemed to split into different fields like Pattern Recognition and Data Mining. The primary goal of the first is to classify patterns found in datasets, while the primary goal of the second is to find mathematical models of datasets. Some authors argue that Machine Learning copes with heuristic and symbolic methods associated with Artificial Intelligence while the other two fields are concerned with numerical methods associated with Statistical Learning. Other authors argue that the three above mentioned fields are in fact different views of the same science, the one that produces algorithms that improve with experience. We will now define Machine Learning. We begin with two broad and general definitions Definition 1.1 (Machine Learning (i)) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P , if its performance at tasks in T , as measured by P , improves with experience E. This definition, stated by Tom. M. Mitchell [90], includes any computer system that improves its performance at tasks through experience. Another, more general, definition is given by Pat Langley [74] Definition 1.2 (Machine Learning (ii)) Learning is the improvement of performance in some environment through the acquisition of knowledge resulting from experience in that environment. As we can see, both definitions are quite similar as they require that an algorithm improves its performance in some sense. Even though, the second one is more general since learning occurs within an environment which can determine not only the class of tasks T and performance measures P but also the way experience E is presented to the algorithm and the nature of the learning algorithm itself. Furthermore, Langley’s definition gives a framework which states that a learner system always finds itself in some environment, about which it attempts to learn. The system has a knowledge base from which it can extract knowledge and in which it can store newly gathered knowledge. The system’s aim is to improve some performance measure which accounts only for the knowledge base. Langley’s framework is presented in Figure 1.1 taken from [74]. In the following sections we will analyze each of the elements of the framework proposed by Langley.

1.1.1

The environment

As we said above, learning always occurs in an environment about which the system must learn. The environment can set some conditions or properties that may affect the way the learning algorithm must work.

1.1. MACHINE LEARNING

9 Performance

Knowledge

Environment

Learning

Figure 1.1: Langley’s framework for machine learning Performance tasks and measures As stated by both Mitchell’s and Langley’s definitions, the learning system must improve some given tasks. As examples of tasks we enumerate some of the most typical found in the literature: classification, prediction and problem solving. Given one task there may be different measures of performance. For example, in classification usually one tries to improve the accuracy of the system, but one could also be interested in having a simpler classifier sacrificing some accuracy. So the performance measure could be simplicity rather than accuracy. Supervised and unsupervised learning In some environments the system receives feedback, from datasets or human experts, about the appropriateness of its performance, while in others the learning system does not receive any feedback. An example of the former systems are classification systems where the learning process receives, for each training data item, the class it belongs to. This sort of systems builds a description of each given class in order to accurately predict the class label of new items. On the other hand, clustering is a typical example of unsupervised learning systems that do not know the class label of training data and thereof they must guess the class each instance belongs to before building the class descriptions. Online and off-line learning Another feature the environment may set is the way in which training instances are presented to the learning system. Off-line learning happens when all training items are given at the same time. On the contrary, online learning occurs when training data items are presented one at a time. There are intermediate situations where data items are presented in short chunks. The first studies of online systems where justified by the nature of human learning. A system was supposed to imitate the learning process of humans and people experience events in sequence. More recently, online systems are found to be useful in more and more real-world environments since companies gather new data every day and these data should be used in order to improve the knowledge base. We will deeply study the online learning systems in future chapters.

10

CHAPTER 1. INTRODUCTION

Regularity of the environment The environment as an external factor can also affect the difficulty of learning. One can identify four factors: • The complexity of the target knowledge to be learned obviously affects the complexity of the learning process. • The number of irrelevant features may affect also the performance of the learned knowledge structure. It is widely reported in the literature that accuracy of classification systems decreases with the presence of irrelevant features. • The amount of noise present in the databases obviously affects directly the quality of the learned structures. In supervised systems there may exist two sorts of noise. Namely, class noise where the learner receives incorrect feedback, and attribute noise where the learner receives an incorrect description of data instances. • Some environments evolve with time. In such environments concepts must be changed when they are not valid anymore. This phenomenon is known as concept drift in the literature.

1.1.2

Representation

Both the experience (the input to learning) and the knowledge structure (the output of learning) also may affect the learning process. Representing experience In most part of the literature, experiences are described by means of a collection of attributes. Usually all experiences are described with the same attributes. So, each experience is taken as a list of values for the attributes in the collection. In this way, a database is formed by a set of experience descriptions called instances. The most used attributes are of one of the following kinds: • Boolean attributes stand for the presence or absence of a given feature. • Nominal attributes are similar to boolean attributes but allow more than one possible value for each feature. The values of nominal attributes are not ordered. • Numeric attributes can take real, integral or ordinal values. The values of numeric attributes are ordered. Sometimes, before learning, datasets are transformed into other structures which are easier to access by the learning system. For example, some statistical learning processes do not need the whole data set but a summary usually called sufficient statistics. Sometimes sufficient statistics are sparse contingency tables. In the literature some structures are used in order to avoid storing these zero values [92].

1.1. MACHINE LEARNING

11

Knowledge structures A learning system must represent the knowledge it learns. There are many different knowledge structures. Knowledge structures range from the simplest ones that just list the instances seen of a given concept (extensional structures) to more sophisticated ones that extract a description of the concept (intensional structures). The former ones may be useful in a very reduced number of cases as they do not allow to predict whether a new instance belongs to a given concept or not. Extensional structures can be simple classifiers where a flat description for each concept or class is stored. More complex and sophisticated structures exist such as hierarchies of concepts [105, 35] or inference networks [99, 94]. Each of these extensional structures must be accompanied by an interrogation process in order to consult the structure with a new previously unseen and possibly incomplete instance.

1.1.3

The learning component

We focus now on the learning process itself Learning as search One of the first authors who described learning as a search process was Tom M. Mitchell [91]. According to Mitchell, learning can be seen as a search through a knowledge space where each state of the space represents a knowledge structure. In order to perform the search, a set of operators is needed to transform those states or knowledge structures, a measure function in order to measure the quality of the states with respect to the dataset, and finally a procedure to apply the operators and perform the search. The states of a search space can usually be partially ordered according to their generality. A lattice is defined where in one extreme sits the most general knowledge structures while on the opposite end the most specific ones appear. Structures in the middle have an intermediate degree of generality. This organization can be used in order to guide the search, for example by beginning from the most general structures and going to the most specific or vice-versa. Other search procedures begin at both extremes of the lattice giving as a result the structure in which the two searching paths meet. In order to traverse the search space, a set of transformational operators is defined. These operators transform one knowledge structure into another one by allowing the learner to walk through the space. In addition, a learning system requires some strategy in order to perform the search. The strategy must define four basic points: where to start the search from, how to organize the search, how to evaluate alternative states, and when to terminate the search. Usually search spaces are so huge that it is not feasible to perform an exhaustive search. The most used search algorithm in machine learning is the heuristic hill-climbing that it is seen to work well in many domains. This sort of methods are used because they are cheap in terms of memory space and computing time. The most basic hill-climbing strategy keeps one single knowledge structure in memory and uses one single search path that always moves on from the current structure to the best one until it cannot improve the quality of the current structure using the set of transformational operators anymore. Thus, this sort of search strategy cannot perform any kind of backtracking. In spite of that, backtracking can be simulated by introducing transformational operators that allow the searcher to go back to an already visited structure. Hill-climbing searchers have the well-known drawback of halting at local maxima.

12

CHAPTER 1. INTRODUCTION

The role of bias As we have already said, search spaces are usually huge and for this reason hill-climbing searchers are used to explore them. Biases and background knowledge are introduced in order to help the searcher in finding solutions of high quality. This may help in two ways. First, bias may narrow the search space, by discarding solutions which may be bad, and let the search process to rapidly focus on a good description. Second, bias may be used to guide the search towards that part of the knowledge space where the best domain descriptions are and in this way solutions of higher quality can be obtained. We should give a definition for bias. Following Tom M. Mitchell [89] and Gordon and Desjardins [48] we may say: Definition 1.3 (Bias) Bias is any basis for choosing one domain model over another, other than strict consistency with the instances. It is not clear from this definition whether background knowledge should be considered as a kind of bias or, on the contrary, it is a different concept. Recall that background knowledge is that knowledge which experts may have about the domain being studied and that can be used by the learning algorithm in order to find a good knowledge structure. Gordon and Desjardins [48] state that background knowledge has the supportive role of providing information to select a bias. Hence, it cannot be considered to be a bias per se. Other authors, see [74], consider background knowledge as a kind of bias. Incremental and non-incremental learning When we previously characterized the environment, we made a distinction between off-line learning where all training instances were presented together, and online learning where they were given one at a time. A similar distinction can be made for learning algorithms which can either handle all database instances at a time in an non-incremental manner or one instance at a time in an incremental fashion. More precisely, non-incremental algorithms, given the whole training data set, output a domain model after processing data, possibly multiple times. This sort of algorithms stop learning when they have processed the dataset and assume that they have reached a good domain model which will not be revised. On the other hand, incremental algorithms never assume that they reach a final learning stage. They keep improving their domain model by processing new data items as they are available. Incremental algorithms process each single item of data as it is available without reprocessing previously seen ones. In this way, during the whole learning process there is a domain model available, although incomplete, that can be used for whatever task it is intended. Naturally, incremental algorithms are best suited for online environments and non-incremental algorithms suite best off-line ones. In spite of that, one could adapt one nonincremental algorithm to work in an online environment by storing new instances together with the old ones and re-running the learning process on the whole set. One could also use an incremental algorithm in an off-line environment by running the algorithm for each instance of the database. Both sorts of algorithms have their advantages. On the one hand non-incremental learners can collect statistics about all training instances and thus perform a more informed search than an incremental learner could do. On the other, in an off-line environment incremental learners use less memory space and computer time than non-incremental ones while obtaining knowledge structures of similar quality.

1.2. BAYESIAN NETWORKS

13

We will further discuss and study incremental algorithms in Chapter 3 and Chapter 4. In the rest of this work we will call non-incremental algorithms batch algorithms.

1.2

Bayesian Networks

In the following sections we will present the basic concepts about Bayesian networks. They are intensional knowledge structures based on probability theory. Bayesian networks are provided with a large collection of mathematically and probabilistically sound and computationally efficient methods in order to interrogate them. For further reading in general theory on Bayesian networks the reader is referred to Pearl’s book (1988) [99]. This is a classical book which introduces Bayesian networks theory. Another classical and interesting book was written by Neapolitan (1990) [94]. A very nice book which covers probabilistic experts systems was written by Castillo (1997) [14]. A more recent book which covers Bayesian network inference in depth is written by Jensen (1998) [58]. A short and, in our opinion, very nice introduction by Charniak (1991) can be found in [15]. Also, Jordan (1998) [59] edited a collection of introductory surveys and papers discussing recent advances; it also includes the wider field of learning graphical structures. A very recent book on learning Bayesian networks was written by Neapolitan (2003) [95].

1.2.1

Definition of Bayesian networks

The representation of a domain U under probability theory can be defined by a set of random variables U = {X1 , . . . , Xn } each of which has a domain of possible values. The key concept in probability theory is the joint probability distribution, which specifies a probability for each possible combination of values for all the random variables. Given this distribution, one can compute any desired posterior probability given any combination of evidence. Unfortunately, an explicit description of the joint distribution requires a number of parameters that is exponential in the number of variables n. Probabilistic networks derive their power from the ability to represent conditional independence among variables, which allows them to take advantage of the locality of causal influences. A Bayesian network is an annotated directed acyclic graph that encodes a joint probability distribution over a set of random variables U. Formally, a Bayesian network for U is a pair S = (BS , BP ): • The first component, BS , is a directed acyclic graph (DAG) whose vertices correspond to the random variables X1 , . . . , Xn , and whose edges represent directed dependencies between variables. Let us give a more detailed explanation. We say that two sets of variables X and Y are independent given Z if P (X|Y, Z) = P (X|Z) whenever P (Y, Z) > 0. Recall the chain rule of probability, P (X1 , . . . , Xn ) =

n Y

P (Xi |X1 , . . . , Xi−1 )

i=1

If for each variable Xi , the set Pai ⊆ {X1 , . . . , Xi−1 } renders Xi and {X1 , . . . , Xi−1 }\Pai independent, that is, P (Xi |X1 , . . . , Xi−1 ) = P (Xi |Pai ) then one can rewrite the chain

14

CHAPTER 1. INTRODUCTION rule as P (X1 , . . . , Xn ) =

n Y

P (Xi |Pai )

(1.1)

i=1

A Bayesian network structure BS encodes the assertions of conditional independence in Equation 1.1. Namely, BS is a DAG where each variable in U corresponds to a node in BS , and the parents of the node corresponding to Xi are the nodes corresponding to the variables in Pai . • The second component, BP , represents the set of parameters that quantifies the network. It contains a parameter θijk = P (Xi = xki |Pai = paji ) for each possible state xki of Xi and for each configuration paji of Pai

1.2.2

Independence assumptions: D-separation

As we have said above, Bayesian networks represent direct causation between variables. The power of Bayesian networks is that they have built-in independence assumptions which need not to be explicitly specified. The independence assumptions can be read from a network structure by means of the d-separation criterion. In order to understand d-separation we need to keep in mind the three basic connection structures between variables: 1. Serial connection: Consider the situation in Figure 1.2(a). A has an influence on B which in turn has influence on C. Obviously, evidence on A will influence the certainty of B which then influences the certainty of C. Similarly, evidence on C will influence the certainty of A through B. On the other hand, if the state of B is known, then the channel is blocked, and A and C become independent. We say that A and C are d-separated given B. 2. Diverging connection: In the situation in Figure 1.2(b) the influence can pass between all the children of A unless the state of A is known. We say that B, C, . . . , X are dseparated given A. 3. Converging connections: In this situation, Figure 1.2(c), if nothing is known about A except what may be inferred from knowledge of its parents B, C, . . . , X, then the parents are independent, that is, evidence on one of them has no influence on the certainty of the others. Now, if any other kind of evidence influences the certainty of A, then the parents become dependent due to the principle of explaining away. The evidence may be direct evidence on A, or it may be evidence from a child. The three cases above cover all the ways in which evidence can be transmitted through a variable, and following these rules it is possible to decide for any pair of variables in a Bayesian network whether they are dependent given the evidence entered into the network or not. Definition 1.4 (D-separation) Two variables A and B in a Bayesian network are dseparated if for all paths between A and B there is an intermediate variable V such that either the connection is serial or diverging and the state of V is known; or the connection is converging and neither V nor any of V ’s descendants have received evidence.

1.2. BAYESIAN NETWORKS

15

A A

B

B

C

···

X

C B

(a)

C

···

X

A

(b)

(c)

Figure 1.2: The serial, converging and diverging connections Smoker

CoalMiner s c s−c −sc −s−c

E LungCancer

PositiveXRay

Emphysema

.9 .3 .5 .1

−E .1 .7 .5 .9

Dyspnea

(a)

Emphysema (b)

Figure 1.3: A Bayesian network example If A and B are not d-separated, we call them d-connected. An equivalent definition of d-separation was given by Pearl [99]. In Figure 1.3 we have an example of a Bayesian Network. On the left side (a), we have the network structure BS representing the independences among the variables of the domain. On the right side (b), we have the probability table θemph for the variable Emphysema Note that in this network, if we have some evidence on Dyspnea, then Smoker and CoalMiner are d-connected, otherwise they are d-separated. If we know the state of Lungcancer, then Dyspnea and PositiveXRay are d-separated. Also Smoker and Dyspnea are d-separated when the state of Lungcancer and Emphysema are known.

1.2.3

Characterization of Independence Models

As we said in Section 1.2.2, Bayesian networks have built-in independence assumptions which are not explicitly specified and which can be recovered by means of d-separation. In this section we will characterize network structures according to the relationship between the dependency model in the domain and the dependency model induced by the structure. Definition 1.5 (Independence Statement) Let U be a set of variables. An Independence Statement is a statement of the form I(X, Z, Y) where X, Y and Z are disjoint subsets of U. A statement I(X, Z, Y) should be read as X is independent of Y given Z. Definition 1.6 (Independence Model) Let U be a set of variables, a independency model is a set M = {Ai |Ai = I(X, Z, Y) and X, Z, Y ∈ U}

16

CHAPTER 1. INTRODUCTION

Note, that through conditional independence, every probability distribution P (U) induces an independence model M . Definition 1.7 (Independence model induced by a Bayesian Network) The independence model M induced by a Bayesian network structure BS defined on the set of variables U is the set of independence statements I(X, Z, Y) resulting from applying the d-separation criterion on BS . Definition 1.8 Let U be a set of variables. Let M I be an independence model over U and let BS be the structure of a Bayesian network, • BS is a dependency map, or D-map, of M I if I(X, Z, Y) implies that X, Z, Y are d-separated in BS • BS is an independence map, or I-map, of M I if the fact that X, Z, Y are d-separated in BS implies I(X, Z, Y) ∈ M I • BS is a perfect map, or P-map, of M I if BS is both and I-map and a D-map of M I When the dependency model induced by a BS on a set of variables U is a perfect map of M I we say that both models are isomorphic. We will note as M (BS ) the independence model defined by the network structure (i.e. DAG) BS . An important property is that the minimal I-map corresponding to a model M is not unique. As Pearl [99] showed, the same dependency model can be graphically represented by several network structures that are minimal I-maps. Verma and Pearl proved that any Bayesian network is a minimal I-map of the domain [125]. Definition 1.9 (Equivalent Bayesian Networks) Two Bayesian networks S1 = (BS1 , BP1 ) and S2 = (BS2 , BP2 ) built on the same set of variables U are said to be equivalent, noted S1 = (BS1 , BP1 ) ≡ S2 = (BS2 , BP2 ), if M (BS1 ) = M (BS2 ). This definition states that two Bayesian networks are equivalent if their structures induce the same independency model. The equivalence of Bayesian networks can easily be checked using the following theorem due to Verma and Pearl [125], Theorem 1.1 (Equivalent Bayesian Networks) Two Bayesian networks S1 = (BS1 ,BP1 ) and S2 = (BS2 , BP2 ) built on the same set of variables U, are equivalent if and only if for all u, v, w ∈ U: • u → v, v ← w belong to BS1 and u−v does not belong to BS1 if and only if u → v, v ← w belong to BS2 and u − v does not belong to BS2 • u − w belongs to BS1 if and only if u − w belongs to BS2 Chickering [17] derived a characterization of equivalent network structures based on local transformations. This characterization is very useful as it leads to a simple method for proving invariant properties over equivalent structures. In order to present Chickering’s characterization we first define the notion of covered edge, then we show necessary and sufficient conditions so that an edge reversal leads to an equivalent network and finally a theorem which states that there always exists a sequence of edge reversals between any pair of equivalent networks.

1.2. BAYESIAN NETWORKS

17

Definition 1.10 (Covered Edge) An edge Y → X ∈ BS is covered if PaX = PaY ∪ {Y } That is, Y → X ∈ BS is covered if Y and X have identical parents with the exception that Y is not a parent of itself. See Figure 1.4.

X

Y

Figure 1.4: Covered arc Lemma 1.1 Let S1 = (BS1 , BP1 ) be any Bayesian network containing the edge Y → X, and let S2 = (BS2 , BP2 ) be a Bayesian network with the structure BS2 identical to BS1 except that the edge between Y and X in BS2 is oriented as X → Y . Then S1 = (BS1 , BP1 ) and S2 = (BS2 , BP2 ) are equivalent if and only if Y → X is a covered edge in BS1 . Theorem 1.2 Let S1 = (BS1 , BP1 ) and S2 = (BS2 , BP2 ) be any pair of Bayesian networks such that S1 = (BS1 , BP1 ) ≡ S2 = (BS2 , BP2 ). There exists a sequence of distinct edge reversals in BS1 with the following properties: 1. Each edge reversed in BS1 is a covered edge 2. After each reversal, BS1 is a DAG and BS1 ≡ BS2 3. After all reversals BS1 = BS2 Note here, that this relationship of equivalence organizes the space of Bayesian networks into classes of equivalent networks. One may think that there are much less equivalent classes than Bayesian networks. This is not the case as Gillispie and Perlman [47] showed. They empirically found that the average ratio of Bayesian networks per equivalence class seems to converge to an asymptotic value smaller than 3.7.

1.2.4

Graphical Model Inclusion

It is possible to define a relationship of partial order among Bayesian networks by means of the independence model they induce. Definition 1.11 (Graphical Model Inclusion) We say that a Bayesian network S2 = (BS2 , BP2 ) includes S1 = (BS1 , BP1 ), or that S1 = (BS1 , BP1 ) precedes S2 = (BS2 , BP2 ) in the inclusion order if and only if M (BS1 ) ⊆ M (BS2 ). For example, the Bayesian network defined with the complete DAG BSc induces no independence statements at all. On the contrary, the Bayesian network with no arcs BS∅ induces all possible independence statements among variables. And thus, we say that BSc precedes all Bayesian networks and that BS∅ is preceded by all Bayesian networks. With the following theorem Chickering [18] gave a graphical characterization of the inclusion order for two arbitrary Bayesian networks.

18

CHAPTER 1. INTRODUCTION

Theorem 1.3 (Graphical Model Inclusion characterization) Let S = (BS , BP ) and S 0 = (BS 0 , BP 0 ) be two Bayesian networks such that M (BS ) ⊆ M (BS 0 ). Let r be the number of edges in S 0 that have opposite orientation in BS , and let m be the number of edges in BS 0 that do not exist in either orientation in BS . There exists a sequence of at most r + 2m distinct edge reversals and additions in S 0 with the following properties: 1. Each edge reversed is a covered edge 2. After each reversal and addition S 0 is a DAG and M (BS ) ⊆ M (BS 0 ) 3. After all reversals and additions S = S 0

1.2.5

Neighborhoods and inclusion boundary

Learning algorithms use traversal operators in order to perform a search over the space DAGs. The most common traversal operators consist in adding, removing or reversing one single edge of the structure. Using transformations on DAGs, the concept of neighborhood of a given DAG can be defined: • OA (Only Additions): All DAGs with one arc more that does not introduce a directed cycle. • NR (No Reversals): All DAGs with one arc less and one arc more that does not introduce a directed cycle. • AR (All Reversals): The NR neighborhood plus all DAGs with one arc reversed that does not introduce a directed cycle. • CR (Covered Reversals): The NR neighborhood plus all DAGs with one covered arc reversed. • NCR (Non-Covered Reversal): The NR neighborhood plus all DAGs with one noncovered arc reversed that does not introduce a directed cycle. The following neighborhoods are defined using the equivalent class C of a given Bayesian network S = (BS , BP ): • ENR (Equivalent class No Reversals): All DAGs with one arc less or one arc more than any DAG BSi where S = (BSi BPi ) ∈ C • ENCR (Equivalent class Non-Covered Reversals): All DAGs with one arc more or one arc less or one non-covered arc reversed than any DAG BSi where S = (BSi BPi ) ∈ C Note here that these last two neighborhoods may require the modification of more than one adjacency. It is worth noting also that there is no efficient way of enumerating the Bayesian networks of a given equivalence class [47]. These neighborhoods will be noted as NOA (BS ), NN R (BS ), NAR (BS ), NCR (BS ), NN CR (BS ), NEN R (BS ) and NEN CR (BS ) respectively. Some learning algorithms like algorithm K2 proposed by Cooper and Herskovits (1992) [23] and algorithm B proposed by Buntine (1991) [8], use the OA neighborhood in order to perform the search in the space of

1.3. LEARNING BAYESIAN NETWORKS

19

DAGs. Friedman and Goldszmidt (1997) [38] used neighborhood AR in their learning algorithm. These algorithms will be explained with some extent in Section 2.4 and Section 3.4. By means of graphical model inclusion the concept of inclusion boundary can be also defined. Intuitively, the inclusion boundary of a given Bayesian network S = (BS , BP ) consists of those Bayesian networks S = (BSi , BPi ) that induce a set of independence statements which immediately follow or precede the one induced by S = (BS , BP ). Definition 1.12 (Inclusion Boundary) Let SH = (BSH , BPH ), SL = (BSL , BPL ) be two Bayesian networks determined by the graphs BSH and BSL . Let M (BSH ) ≺ M (BSL ) denote M (BSH ) ⊂ M (BSL ) and for no graph BSK M (BSH ) ⊂ M (BSK ) ⊂ M (BSL ). The inclusion boundary of the Bayesian network S = (BS , BP ) noted by IB(BS ), is IB(BS ) = {BSH : M (BSH ) ≺ M (BS )} ∪ {BSL : M (BS ) ≺ M (BSL )} Ko˜cka et al. [64] used the inclusion boundary to establish a necessary condition that a traversal operator must satisfy in order to avoid local maxima. Definition 1.13 (Inclusion Boundary Condition) A traversal operation satisfies the Inclusion Boundary Condition if for every Bayesian network S = (BS , BP ), the traversal operator operator can create a neighborhood N (BS ) such that N (BS ) ⊇ IB(BS ). In a later paper Castelo and Ko˜cka [13] proposed a theorem (see Theorem 2.7) stating that the inclusion boundary condition is sufficient under some assumptions. This theorem is presented in Section 2.4.4. Castelo and Ko˜cka [13] also stated the following relationships among neighborhoods and the inclusion boundary, Theorem 1.4 Let S = (BS , BP ) be a Bayesian network. The following statements hold: 1. For all BS : NN R (BS ) ⊆ NEN R (BS ) ⊆ IB(BS ) 2. For all BS : IB(BS ) = NEN R (BS ) ⊆ NEN CR (BS ) From this theorem several important limitations for well-known Bayesian network learning algorithms can be derived as we will see in Section 2.4.4.

1.3

Learning Bayesian Networks

In this section we give a brief and general introduction to Bayesian network learning in order to set them within the Machine Learning framework. Bayesian network learning algorithms aim at finding the network, or a reduced set of networks, that best encodes the joint probability distribution implicit in data. The learning problem of Bayesian networks can be stated as follows [114]: Definition 1.14 (Bayesian network learning) Given a dataset, infer the topology for the belief network that may have generated the dataset together with the corresponding uncertainty distribution.

20

CHAPTER 1. INTRODUCTION n 0 1 2 3 4 5 6 7 8 9 10

G(n) 1 1 3 25 543 29.281 3.781.503 1.138.779.265 783.702.329.343 1.213.442.454.842.881 4.175.098.976.430.589.143

Table 1.1: The number of DAGs for small number of nodes. This definition is given from the data mining approach where the Bayesian network generated by the learning algorithm is seen as a model of the dataset. In spite of that, we still can identify most of the parts of a learning algorithm. We have a dataset as a collection of experiences, and Bayesian networks as knowledge structures that form a space within which the algorithm will perform search. Still, we need a way to measure how likely is for a given network to have generated the database. In the following sections we will study Bayesian network search space, datasets, quality measures, and search strategies.

1.3.1

The search space of Bayesian networks

In Definition 1.14 two phases are distinguished in learning Bayesian networks. During the first phase, the structure or topology, an acyclic directed graph (DAG), of the network is learned. The second phase consists in learning the parameters of the network structure. In our work, we will concentrate on the phase of learning Bayesian network structures. It is widely accepted in the literature that although having accurate parameters is important, they are completely useless if the structure is of bad quality. Druzdel et al. [31] reported that the graphical structure of a network is its most important part, as it reflects the independence and relevance relationships between the variables concerned. They also reported that uncertainty analysis of large real-life probabilistic network for liver and biliary disease has provided evidence that probabilistic networks can be highly insensitive to inaccuracies in the numbers in their quantitative part. The reader interested in parameter learning can find extensive discussions in [118, 9, 79]. Let us now analyze the space of Bayesian network structures, that is, the space of acyclic directed graphs (DAGs) where we will have to perform search during the learning process. The number of different DAGs over n nodes is given by the Robinson’s formula [108]:    1 Ã ! Pn n G(n) = i+1  2i(n−1) G(n − 1)  i=1 (−1) i

if n = 0 if n > 0

Table 1.1 shows G(n) values for some small values of n. As the number of different DAGs grows more than exponentially in the number of nodes n, it is evident that it is not

1.3. LEARNING BAYESIAN NETWORKS

21

feasible, from a computational viewpoint, to exhaustively explore the entire space of DAGs. Chickering et al. showed that finding Bayesian network structures with highest quality is NP-hard [19]. Let us give an overview of the quality measures used in learning and study how their properties may affect the search strategy of Bayesian network learning algorithms.

1.3.2

Performance tasks and measures

Algorithms for learning Bayesian networks usually do not directly consider the final performance tasks. Bayesian networks are seen as a model of the dataset, more precisely, they are an approximation of the probability distribution that underlies the dataset. For this reason most of the learning algorithms in the literature look for networks that best approximate the distribution of the dataset. However, Bayesian networks are mainly used to predict the state of a variable given the state of others. Their prediction capability is very useful for a wide range of applications that need to work under uncertainty. There are few works [49] that together with the distribution of the data consider the performance with a range of the most typical queries that the networks must answer. In general, one can distinguish two wide groups of learning algorithms differentiated by means of quality measures, which in turn affect the learning component. The first ones are based on the application of conditional tests between variables and the construction of the structure of the network based on the result of such tests. Some examples of this approach are given in Spirtes et al. [119], Geiger et. al. [45] and De Campos and Huete [24]. The second ones are methods based on goodness-of-fit measures between the probability distribution of a tentative structure and the true joint distribution implied by data. Although with both approaches good results have been obtained, they both suffer from well-known drawbacks. The main drawback of methods based on conditional independence information is that they need a reliable source of independence statements. Independence statements can be derived from data using statistical tests. However, when there is a weak dependence between two variables, or when the number of variables involved in the test grows, in order to obtain reliable estimates a large database is required. Together with the restriction that the independence statements represented by the network structure are exactly those in the domain, these methods are in general impractical for small databases with discrete variables [6]. In recent years, literature on methods based on goodness-of-fit measures has been growing fast. The main drawback of this sort of methods is that they have to explore the space of DAGs that, as we have seen, grows exponentially with the number of variables. We will focus our work on this last category of methods and in the following chapter we will revise the most representative ones. There are three different approaches to quality measures or goodness-of-fit measures. Namely, the Information criterion approach [22, 6], the Bayesian approach [23, 51] and the Minimum Description Length (MDL) approach[71, 37]. Despite the fact that these three approaches to quality measures are based on very different philosophies, their behaviors are very similar. In fact, the Information criterion approach and the Minimum Description Length yield the same algebraic expression in a certain case [6]. Furthermore, Bouckaert [6] demonstrated that the asymptotic behavior of these approaches for databases of infinite size is the same and that they will yield approximately the same results for databases where all configurations of parent sets occur at least once. These three approaches have the property of being decomposable, that is, the overall quality of a network can be expressed in terms of the sum of the quality of all given child-parent configurations. This is possible due to the property of

22

CHAPTER 1. INTRODUCTION

factorization over the probability distribution, or sum property, which is inherent to Bayesian networks: X Quality(Network|Dataset) = quality(Xi |Pai , Dataset) (1.2) Xi

where Pai is the set of parents of variable Xi . This property is computationally important in learning. To evaluate the effect in the global network score that the addition (or removal) of an arc in the network structure will have, we only need to recompute the local structure of the network families affected.

1.3.3

Representing experiences: datasets

Experiences are represented by means of raw datasets. That is, sets of instances described by means of attributes. Bayesian networks can cope with both nominal and numeric (real and integer) attributes. As Bayesian networks do not model relations among data instances (they model relations among variables), one does not need to work with (or store) the whole information contained in datasets. It is sufficient to use just the information needed to approximate the underlying probability distribution. This information is usually called sufficient statistics and it mainly consists of contingency tables for nominal attributes, and of mean and standard deviation of Gaussian distributions for numeric attributes. Usually, learning algorithms work directly with datasets, and repeatedly inspect all the items in order to calculate the quality of each family of variables (a child and its parents) considered during the learning process. Most of the implementations of learning algorithms do not use sufficient statistics because, unfortunately, they require a great amount of memory space. We will see that there are data structures, like AD-trees [92], that allow to store sufficient statistics in a compact way, and that there may be environmental and computational reasons that could dissuade from using datasets directly. In real-world domains several difficulties may occur in datasets. For this reason, quality measures and learning algorithms have been extended to cope with incomplete or missing data [20, 37], or to work with datasets that have hidden variables whose values never appear in data [21, 34]. There are also methods that tackle the problem of learning in domains with irrelevant variables [75, 116].

1.3.4

Learning strategies

We have seen that the number of Bayesian Networks grows more than exponentially in the number of variables, and although the sum property of the goodness-of-fit measures provide a quick way to evaluate structures, an exhaustive exploration of the Bayesian Network space is not possible. For this reason most of the learning algorithms in the literature follow greedy search strategies like hill climbing or beam search. Hill climbing is a search method provided with traversal operators used to transform a given network in order to obtain a set of its neighbors. The search begins with an initial structure (for example a network with no arcs), applies all the traversal operator instantiations, compares the resulting network structures using a quality evaluation function, selects the best model, and iterates until no more progress can be made. The beam search strategy can be seen as a search with several searching streams, each one being a hill climbing search by itself. The main advantage of these methods is their low time and memory requirements since there are never more than a few search states (or network structures) in memory and thus searching paths (or streams) to be explored.

1.3. LEARNING BAYESIAN NETWORKS

23

However, greedy strategies also suffer from well-known drawbacks, such as their tendency to halt at local maxima and their dependence on step size, that is, the number of traversal operators that can be applied in one single learning step. In the literature, there are several variations on these basic search strategies. For example, there are works where a user or a human expert gives bias or background knowledge to the learning algorithm. Sometimes, the expert provides the whole structure and hence the algorithm only needs to learn the parameters [118, 9, 79]. Other times, the expert is only able to give a partial structure which is used as the initial stage of the search in the space of all possible structures. Another extended sort of bias is providing an order among the variables stating that a variable can only be the parent of a previous one in the order, and thus, the search space is reduced [23]. There are also a special sort of Bayesian networks which aim to classify data instances, which are called augmented naive Bayesian classifiers [41, 16]. To learn this sort of classifiers, the learning algorithms are fed with training data with a category variable. Augmented naive Bayesian classifier learners are given a fixed part of the structure and they have to learn the rest of it. The fixed part is given as a sort of prior knowledge where the category variable is the parent of all the others. The remaining structure is learned with ordinary search strategies guided by quality measures that take into account the category that instances belong to. We will explain augmented naive Bayesian classifiers in detail in Chapter 5. There exist other algorithms that do not follow greedy search strategies. For example, Bouckaert [6] studied search strategies like Tabu Search, Simulated Annealing and Rejectionfree Annealing. Evolutionary computing is another family of algorithms, that perform a more intensive search, that has been used to learn Bayesian network structures. For example, there are works that use genetic algorithms (GA) [76, 78, 124] and other ones that use estimation of distribution algorithms (EDA) [107]. These sort of intensive search strategies usually produce better results than greedy ones but are suitable only in environments where there are no serious computing time restrictions.

24

CHAPTER 1. INTRODUCTION

Chapter 2

Bayesian Network Learning. State of the art In this chapter we present the state of the art of learning Bayesian network structures based on goodness-of-fit measures and on hill-climbing search strategies. In Section 2.1 we present the approach to quality measures based on the minimum description length, while in Section 2.2 we present the measures based on Bayesian inference. In Section 2.3 we introduce the concept of sufficient statistics and some notation that we will use in the rest of this work. Finally, in Section 2.4, we present four batch learning algorithms that explore the Bayesian Network space in a different manner. We present these four algorithms in an order where the first less exhaustively and most blindly explores the space of Bayesian networks and the last performs a more intensive and informed search. The first algorithm, Chow and Liu’s [22], restricts the search among tree structures. Secondly, we show the K2 algorithm [23] that by means of an order among variables restricts the search space exploring only one graph per equivalent class. The third, the B algorithm [8], is an uninformed greedy search among all directed acyclic graphs. Finally, we present an algorithm proposed by Castelo and Ko˜cka [65, 11, 13] that performs also a greedy search but uses some properties of the Bayesian Networks in order to guide the search.

2.1

The Minimum Description Length Approach

In this section we briefly introduce the Rissanen’s Minimal Description Length (MDL) principle and two specific approaches for Bayesian Network learning due to Lam and Bacchus [71] and Friedman and Goldszmidt [37]. For further readings on the Minimum Description Length principle see [97, 98]. The MDL principle is based on the idea that the best model of a database is the model that minimizes the sum of the length of the encoding of 1. The model 2. The data given the model In order to apply the MDL principle we first need to encode the Bayesian network, as the model, and the raw data given the network. Afterwards, we will measure in bits the length of both encodings. 25

26

CHAPTER 2. BAYESIAN NETWORK LEARNING. STATE OF THE ART

1. Encoding the Network A Bayesian network S = (BS , BP ) is formed by the structure BS (i.e. a DAG) and a list of the conditional probabilities BP associated to each node. So, in order to encode the network we need to encode its structure and the set of conditional probabilities. Suppose there are n nodes in the dataset. For a node Xi with |Pai | different parents, we need |Pai | log2 (n) bits to list its parents. Thus, we need the following number of bits in order to encode the structure of a network n X

|Pai | log2 (n)

i=1

The encoding length of the conditional probabilities for each node Xi is the product of the number of bits required to store the numerical value of each conditional probability and the total number of conditional probabilities that are required. Thus, we need the following number of bits in order to encode the conditional probabilities of a network n X

d (ri − 1) qi

i=1

where ri is the number of states of the node Xi , qi is the number of configurations of its parents and d the number of bits required to store a numerical value. Note, that since the states of the nodes are independent and exclusive, the sum over all values of conditional probabilities equals 1, and therefore we only need (ri − 1) qi numbers (instead of ri qi ) to fully specify the conditional probabilities. Note also, that for a given dataset D, n and d are constant. Finally, the encoding length of a Bayesian network is the sum of the two values previously stated, n X

|Pai | log2 (n) + d (ri − 1) qi

(2.1)

i=1

2. Encoding the data given the model Here, we want to encode the dataset D of m cases, given the model S = (BS , BP ). Since we are interested in comparing the length of encoding the data given a Bayesian network, we actually do not care to use the most efficient code. We will use the character codes that are intuitive and not very time consuming. The character codes assign to each configuration a unique binary string, and the dataset D is encoded by concatenating the m binary strings of the cases. It is well known that we can minimize the length of the final binary string by giving the shortest code to the cases with the highest frequency. We minimize the coding length by applying the Huffman algorithm: If we want to encode an alphabet A, construct the code backward starting from the tail of the codewords, 1. Take the two least probable symbols in the alphabet. These two symbols will be given the longest codewords, which will have equal length, and differ only in the last digit. 2. Combine these two symbols into a new single symbol, calculate the probability of the new symbol, and repeat.

2.1. THE MINIMUM DESCRIPTION LENGTH APPROACH

27

Since each step reduces the size of the alphabet by one, this algorithm will have strings assigned to all the symbols after |A| − 1 steps. Huffman’s algorithm requires as input the frequency of occurrence of each configuration appearing in the database. Suppose that each configuration cu in the database has probability pu , then Huffman’s algorithm assigns to configuration cu a codeword of length approximately − log2 (pu ). If we have m cases, being m large, in the database, then the length of the string encoding the database is approximately −m

X

pu log2 (pu )

(2.2)

u

where we are summing over all possible configurations. Evidently, we do not have these pu probabilities since the Bayesian network is a guess (or model) of such probabilities. The Bayesian network, as a model, assigns a probability qu to each configuration. Of course, in general qu is not equal pu although it is the aim for qu to be close to pu . The closer qu to pu the more accurate the Bayesian network is. We will use the probabilities qu given by the Bayesian network to compute the Huffman code of dataset D. Hence, each configuration cu is assigned a codeword of length approximately − log2 (qu ) and the length of the string encoding the database is approximately −m

X

pu log2 (qu )

(2.3)

u

Now we can use the Gibbs’ theorem in order to compare equations (2.2) and (2.3): Theorem 2.1 Let pu and qu , where u = 1, . . . , t, be non-negative real numbers that sum to 1. Then, −m

t X u

pu log2 (pu ) ≤ −m

t X

pu log2 (qu )

u

with equally holding iff ∀u, pu = qu . In the summation we take 0 log2 (0) to be 0. From this theorem, it comes out that the encoding using the estimated probabilities qu is longer than the encoding using the true probabilities pu . It also says that the true probabilities achieve the minimal encoding length. The MDL principle states that we must choose the network that minimizes the encoding length of the dataset, and we have seen that it depends on the accuracy of the network. We can use Equation 2.3 in order to evaluate the encoding length of the dataset given the network. However, this measure has two problems. First, we do not know the values of pu , and second, Equation 2.3 requires a summation over all the configurations, and the number of configurations is exponential in the number of variables. The first problem can sometimes be overcome by the law of large numbers: the configuration cu with probability pu is expected to appear m · pu times in a database of m cases. Hence, we can use cu ’s frequency into the database as an estimator of pu . The second problem is often more difficult. Equation 2.3 involves a summation over all the atomic events, and the number of atomic events is exponential in the number of variables. This also points out that we will not be able to use the law of large numbers so easily in order to estimate pu . With an exponential number of atomic events u some of the probabilities pu will be so small that our database will not be able to offer reliable estimates. The database

28

CHAPTER 2. BAYESIAN NETWORK LEARNING. STATE OF THE ART

might be large enough to estimate low-order marginals which are the union of many different atomic events. In order to overcome these problems, Lam and Bacchus [71] and Friedman and Goldszmidt [37] used different approaches yielding similar results.

2.1.1

Lam and Bacchus approach

Lam and Bacchus [71] used Gibbs’ theorem to relate the encoding length of the data to the Kullback-Leibler divergence or cross-entropy DKL (P ||Q), Equation 2.4. Then, they showed that the cross-entropy can be substituted by the summation of mutual information of each variable and its parent set, Equation 2.5. In this way they overcome the two problems mentioned above. Definition 2.1 (Kullback-Leibler divergence or cross-entropy) The Kullback-Leibler divergence or Cross Entropy is a measure of closeness between two different distributions P and Q defined over the same event space, DKL (P ||Q) =

rX X i

P (xi ) log

P (xi ) Q(xi )

(2.4)

Cross entropy satisfies DKL (P ||Q) ≥ 0 (Gibbs’ inequality) with equality only if P = Q. Note that in general DKL (P ||Q) 6= DKL (Q||P ), so DKL is not a distance. From Equation 2.3, the Kullback-Leibler divergence DKL (P ||Q) (Equation 2.4) and Gibbs’ theorem we have the following theorem: Theorem 2.2 (Lam and Bacchus [71]) The encoding length of the data is a monotonically increasing function of the Kullback-Leibler divergence between the distribution defined by the model and the true distribution. This theorem shows that instead of using the data encoding length (Equation 2.3) we can use the Kullback-Leibler divergence to evaluate candidate networks. Although the KullbackLeibler divergence also involves a summation over an exponential number of configurations, a computational feasible approach for evaluating this measure can be developed. Theorem 2.2 is in fact an extension of the work of Chow and Liu [22]. They developed a method to obtain tree-shaped Bayesian networks which minimize cross-entropy. We will now evaluate cross-entropy. We will introduce a theorem due to Chow and Liu which relates the cross entropy between the distribution defined by a tree-shaped network and the true distribution, with the sum over all variables of the mutual information between a variable and its parent. Theorem 2.3 (Chow and Liu) If the mutual information between two nodes Xi and Xj is defined as X P (Xi , Xj ) P (Xi , Xj )log I(Xi ; Xj ) = P (Xi )P (Xj ) xi ,xj where we are summing over all possible values of Xi and Xj , then by assigning to every arc between two nodes Xi and Xj a weight equal to I(Xi ; Xj ), cross entropy DKL (P ||Q) over all tree structured distributions Q is minimized when the structure representing Q(X) is a maximum weight spanning tree.

2.1. THE MINIMUM DESCRIPTION LENGTH APPROACH

29

Using this theorem they developed an algorithm for learning the best tree-shaped Bayesian network by constructing the maximum spanning tree. Lam and Bacchus [71] extended Theorem 2.3 to general Bayesian networks by assigning to each node Xi the mutual information between the node and its parent set PaXi , I(Xi ; Pai ) =

X

P (Xi , Pai )log

xi ,pai

P (Xi , Pai ) P (Xi )P (Pai )

(2.5)

The following theorem holds Theorem 2.4 (Lam and Bacchus) The cross entropy DKL (P ||Q) is a monotonically decreasing function of n X

I(Xi , Pai )

i=1,Pai 6=ø

Hence, it will be minimized if and only if the sum is maximized Note, that Chow and Liu’s theorem, Theorem 2.3, is in fact a special case of Lam and Bacchus’ theorem, Theorem 2.4, where the set of parents PaXi of a node Xi consists of exactly one node (except for the root node). Note also, that the problem of exponential number of atomic events is overcome by estimating lower-order marginals, namely, variables Xi and their parent sets Pai .

2.1.2

Friedman and Goldszmidt approach

Friedman and Goldszmidt [37] directly developed Equation 2.3 in order to obtain an expression for the encoding length of the database given a network structure L(D, BS ): L(D, BS ) = −m

X

pu log qu

u

= −m

X

pu log

u

= −

Y

P (xi |pai )

i

X X

N (xi , pai ) log P (xi |pai )

(2.6)

i xi ,pai

where N (xi , pai ) counts the number of cases in the database D take the values xi and pai . Note, that Friedman and Goldszmidt approximated the probability pu by the factorized probability given by the network qu , in order to overcome the problem of not having the values for pu . They, like Lam and Bacchus did before, used the law of large numbers to approximate P (xi |pai ) by using the counts N (x) from data. With these two approximations the encoding length can be rewritten as L(D, BS ) = m

X

H(Xi |Pai )

(2.7)

i

P

where H(X|Y ) = − x,y P (x, y) log P (x|y) is the conditional entropy of X given Y. We want to note that Herskovits and Cooper (1990) used the conditional entropy (Equation 2.7) as a scoring measure in their Kutat´o algorithm [53].

30

2.1.3

CHAPTER 2. BAYESIAN NETWORK LEARNING. STATE OF THE ART

Score Equivalence

We see in this section that the MDL quality measure gives the same value to any pair of equivalent structures BS1 and BS2 . From Lemma 1.1 and Theorem 1.2 it suffices to see that two equivalent networks that only differ in a single arc reversal are given the same score value. Recall from Lemma 1.1 that the reversed arc must be covered. We will first focus on the influence of such arc reversal on the length of encoding the data given the structure, and after, we will focus on its influence on the length of encoding the structure. Lemma 2.1 Let U be a set of variables, and let D be a database over U . Let BS1 and BS2 be two network structures over U . Furthermore, let Xa and Xb be two nodes in BS1 and BS2 , where PaXa = R and PaXb = R ∪ {Xa } in BS1 and PaXa = R ∪ {Xb } and PaXb = R in BS2 , and the parent sets for the rest of variables in U are the same in both structures. Then, DKL (D||BS1 ) = DKL (D||BS2 ) Now observe that the condition in Lemma 2.1 is the same as the condition that BS1 and BS2 are equivalent. So the lemma states that a single arc reversal on a network structure BS1 has no influence on the Kullback-Leibler distance between the distribution modeled by the network structure and the dataset, as long as the structure obtained BS2 is equivalent. We also want to note that this lemma holds not only for the approximations to the distance used by Lam&Bacchus and Friedman&Goldszmidt but is a general result since the terms that do not appear in both approximations do not depend on the network structure. The following lemma gives a similar result for the encoding length of the two network structures BS1 and BS2 . Lemma 2.2 Let U be a set of variables, and let D be a database over U . Let BS1 and BS2 be two network structures over U . Furthermore, let Xa and Xb be two nodes in BS1 and BS2 , where Paa = R and Pab = R ∪ {Xa } in BS1 and Paa = R ∪ {Xb } and Pab = R in BS2 , and the parent sets for the rest of variables in U are the same in both structures. Then, K1 = K2 where K1 and K2 are the encoding length of the structures BS1 and BS2 respectively With Lemma 2.1 and Lemma 2.2 show that two equivalent structures that differ on one single arc are given the same quality by the MDL scoring function. We can extend this result to any equivalent network structures, Theorem 2.5 For all domains U and any pair of equivalent network structures BS and BS0 the MDL metric gives the same value, M DL(BS ) = M DL(BS0 ).

2.2

The Bayesian Inference Approach

In this framework, we are interested in the most probable hypothesis (a Bayesian network in our case) from some space H given the observed data D plus any initial knowledge about the prior probabilities of the various hypotheses in H. Bayes’ theorem provides a direct method

2.2. THE BAYESIAN INFERENCE APPROACH

31

for calculation such probabilities. More precisely, Bayes’ theorem provides a way to calculate the probability of a hypothesis based on its prior probability, the probabilities of observing various data given the hypothesis, and the observed data itself. We shall write P (h) to denote the initial probability that hypothesis h holds, before we may have observed any training data. P (h) is often called the prior probability of h and may reflect any background knowledge we have about the chance that h is a correct hypothesis. Next, we will write P (D|h) to denote the probability of observing data D given some world in which hypothesis h holds. In Machine Learning we are interested in the posterior probability of h, that is, the probability P (h|D) that h holds given the training data D. Bayes’ theorem provides a way to calculate the posterior probability P (h|D), from the prior probability P (h), together with P (D) and P (D|h). P (h|D) =

P (D|h)P (h) P (D)

(2.8)

In Machine Learning, the learner considers some set of candidate hypotheses H and is interested in finding the most probable hypothesis h ∈ H given the observed data D. Usually, it is computationally unfeasible to find the most probable hypothesis, consequently some heuristic search strategies are used. Note that in order to compare the posterior probabilities of two candidate hypotheses h, we actually do not need to calculate the term P (D) of the Bayes theorem (see Equation 2.8). This holds only when the posterior probabilities are calculated with respect to the same data set D. In some cases, we will assume that every hypothesis in H is equally probable a priori, that is, P (hi ) = P (hj ) for all hi and hj in H. In this case we can also drop the term P (h) from the Bayes theorem equation. The reader may find a deep and an understandable introduction to the Bayesian approach to learning from data in the Heckerman’s tutorials [51, 52].

2.2.1

Cooper and Herskovits proposal

Cooper and Herskovits [23] designed a Bayesian method for constructing a probabilistic network from data. The algorithm searches for a probabilistic network structure with high posterior given a database of cases, and outputs the structure and its probability. They developed an expression for the posterior p(BS |D) of the structure of a network S = (BS , BP ), where BS denotes the network structure and BP denotes the conditional probability assignments associated with the structure. Since they wanted to compare networks they computed p(BS1 |D)/p(BS2 |D) p(BS1 |D) = p(BS2 |D)

p(BS1 ,D) p(D) p(BS2 ,D) p(D)

=

p(BS1 , D) p(BS2 , D)

(2.9)

Thus, they used p(BS , D) as a scoring function for measuring quality of network structures. To do so, they introduced the following four assumptions: 1. The database variables take discrete values 2. Cases of the database occur independently given a belief network S.

32

CHAPTER 2. BAYESIAN NETWORK LEARNING. STATE OF THE ART a 0 0 0 0 1 1 1 1

b 0 0 0 1 0 0 1 1

Table 2.1: A database of cases over two binary variables 3. There are no cases that have variables with missing values. 4. Before observing D, we are indifferent regarding which numerical probabilities to assign to the belief network structure BS . Using these four assumptions Cooper and Herskovits [23] obtained the following expression for P (BS , D): qi ri n Y Y Y (ri − 1)! P (BS , D) = P (BS ) Nijk ! (2.10) (Nij + ri − 1)! k=1 i=1 j=1 Note that the left term of Equation 2.10 P (BS ) stands for the prior probability of the network structure and that Nij stands for the number of times the variable Xi takes value xj in the database, and that Nijk stands for the number of times that variable Xi takes value xj and its parents takes their k-th configuration. When no prior information is available the term is chosen to be a uniform distribution and consequently can be drop from the equation. The right terms of Equation 2.10, represent how well the network structure BS fits the database D. This Bayesian scoring metric is usually called Bayesian Dirichlet, BD for short.

2.2.2

Score Equivalence

The Bayesian-based measure derived by Cooper and Herskovits is not score equivalent if a uniform prior distribution over the network structures is assumed. This can be shown following the example taken from Bouckaert [6]. Consider the database in Table 2.1 over two binary variables a and b, and the network structures BS1 being a ← b and BS2 being a → b. Both structures represent the same set of independences, that is they are equivalent. Yet, P (BS1 , D) = P (BS1 )

(2 − 1)! (2 − 1)! 24 1 (2 − 1)! 4!4! 3!1!2!2! = P (BS1 ) (8 + 2 − 1)! (4 + 2 − 1)! (4 + 2 − 1)! 25 9!

and P (BS2 , D) = P (BS2 )

(2 − 1)! (2 − 1)! 1 (2 − 1)! 5!3! 3!1!2!2! = P (BS2 ) (8 + 2 − 1)! (5 + 2 − 1)! (3 + 2 − 1)! 9!

So, if we assume both structures equiprobable we have that P (BS1 , D) =

24 25 P (BS2 , D).

2.3. SUFFICIENT STATISTICS

33

In order to obtain a score equivalent Bayesian metric we present the following, more general, expression of the Bayesian metric which can be derived from the same four assumptions of Cooper and Herskovits, as Heckerman et al. showed [52]: P (BS , D) = P (BS )

qi n Y Y

ri 0 ) Y Γ(Nijk + Nijk Γ(Nij0 ) 0 ) Γ(Nij + Nij0 ) k=1 Γ(Nijk i=1 j=1

(2.11)

P

0 , and Γ(·) is the Gamma i Nijk where Nijk and Nij are defined as before, Nij0 = rk=1 0 function. Note that when Nijk = 1 Equation 2.11 equals Equation 2.10. 0 Heckerman et al. [52] demonstrated that when the Dirichlet exponents Nijk are constrained by the relation 0 Nijk = N 0 · P (xki , Paji |BS ) (2.12)

where N 0 is the user’s equivalent sample size for the domain, then the Bayesian metric is score equivalent. Lemma 2.3 Let U be a set of variables, and let D be a database over U . Let BS1 and BS2 be two network structures over U . Furthermore, let Xa and Xb be two nodes in BS1 and BS2 , where Paa = R and Pab = R ∪ {Xa } in BS1 and Paa = R ∪ {Xb } and Pab = R in BS2 , and the parent sets for the rest of variables in U are the same in both structures. Then, P (BS1 , D) = P (BS2 , D) With Lemma 2.3 it is demonstrated that two equivalent structures that differ on one single arc are given the same quality by the Bayesian Dirichlet scoring function (BDe), when the Dirichlet exponents are constrained as stated in Equation 2.12. We can extend this result to any equivalent network structures in the same way we did with the MDL measure, Theorem 2.6 For all domains U and any pair of equivalent network structures BS and BS0 the BDe metric gives the same value, BDe(BS ) = BDe(BS0 ). Buntine [8], also developed a Bayesian based metric function which is score equivalent and 0 constraint in Equation 2.12. It which is in fact a special case of the Dirichlet exponents Nijk computes P (xki , Paji |BS ) from a uniform joint distribution.

2.3

Sufficient Statistics

In order to calculate quality measures we only need some information extracted from data rather than all the dataset. We introduce here the concept of sufficient statistics and some notation that we will use in the rest of this work. Roughly speaking, a sufficient statistic contains all information needed to make inference on θ. So, if we gather and store sufficient statistics from a data set, we can throw away data without losing any information related to θ. We can define formally the concept of sufficient statistics as follows, Definition 2.2 (Sufficient Statistic) Let X denote a vector of random variables whose distribution depends on a parameter θ. A vector valued function T of X is said to be sufficient if the conditional distribution of X, given T = t, is independent of θ.

34

CHAPTER 2. BAYESIAN NETWORK LEARNING. STATE OF THE ART

Now, we study the sufficient statistics that algorithms need in order to learn Bayesian D (x) be the number of instances network from data D. We introduce some notation. Let NX D D (x) for all values of X (from b in D where X = x. Let NX be the vector of numbers NX D now on, we omit the superscript and the subscript of NX (x) whenever they are clear from bX the sufficient statistics of variables X. So, given the the context). We call the vector N decomposability of the score assigned to a structure, see Equation 1.2, the sufficient statistics bX,Pa for all X ∈ X and all possible parent sets. We note as that algorithms need are N X SUFF (BS ) the sufficient statistics needed in order to learn a network structure BS . The cardinality of the sufficient statistics when the network structure has no undirected cycles, i.e. a tree structure T , is as follows: |SUFF (T )| = (n2 ) × val(X)2 where val(X) stands for the number of values that a variable X can take. For simplicity we assume that all variables take the same number of values. Note that |SUFF (T )| is quadratic in the number of variables and the number of values that variables can take. For a general network, i.e. a DAG, the cardinality of the sufficient statistics is: |SUFF (B)S )| = n ×

p X

(ni ) × val(X)i

i=0

where p is the maximum number of parents a variable can have. Note that when the number of parents is unbounded the cardinality of the sufficient statistics is rather large. For example, when p = n |SUFF (BS )| >> n · 2n · val(X). Fortunately, for many applications the number of parents is small. We also want to note here that if there are sufficient statistics of a data set D stored, when new data D0 are available in order to recover the sufficient statistics, SUFFD∪D0 (T ) , of the whole dataset we do not need to to go again through the former dataset D. It is possible b D0 ∀i, j : 0 ≤ i < j ≤ n and to calculate the sufficient statistics of the new data instances N Xi ,Xj b D∪D0 as N bD ⊕ N b D0 where ⊕ stands for the addition of vector components. to obtain N X X X

2.4

Batch Bayesian Network Structure Learning

In this section we review four algorithms for learning the structure of Bayesian networks. We first review the algorithm proposed by Chow and Liu (1968) [22], (CL from now on). They are considered to have developed the first method for constructing network structures (i.e. trees), in a moment when Bayesian networks were still to be defined. The algorithm yields a tree structure which maximally approximates the database distribution. Secondly, we review the K2 algorithm proposed by Cooper and Herskovits (1992) [23]. This algorithm yields a general network structure (i.e. DAG). The user must provide the algorithm with an order among variables for reducing the search space of DAGs. The K2 algorithm searches through the space of network structures which for each variable the parents are found among the previous ones in the order. In this way the search is restricted to explore one single network of each equivalent class visited. Thirdly, we review the B algorithm proposed by Buntine (1991) [8]. This algorithm also yields a general network structure and its search space is the space of all DAGs. It performs a greedy blind search adding a single arc in each step.

2.4. BATCH BAYESIAN NETWORK STRUCTURE LEARNING

35

From Theorem 1.4 one realizes that neither algorithm K2 nor algorithm B use traversal operators in a way that fulfill the inclusion boundary condition (Definition 1.13). Thus, all these algorithms may get stuck into local maxima. Castelo and Ko˜cka [13] proposed an algorithm that partially follows the inclusion boundary condition by means of an efficient implementation of neighborhood NEN CR . Their algorithm also performs a greedy search but is able to add, remove or reverse more that one arc in each step. In this way the algorithm follows a learning path of Bayesian networks which are in graphical model inclusion trying to avoid local maxima. Although it may still get stuck into local maxima it usually performs better than algorithms K2 and B. In Chapter 4 we will extend these four batch algorithms with some heuristics in order to obtain incremental versions of them.

2.4.1

Algorithm CL

In this section we study the algorithm proposed by Chow and Liu [22]. They designed an algorithm to estimate the underlying n-dimensional discrete probability distribution from a set of samples. The algorithm yields as an estimation the product of n − 1 second order distributions that optimally approximates the probability distribution. This product can also be formulated like a distribution of n − 1 first order dependence relationships among the n variables, forming a tree dependence structure. This means that they restrict the search space to trees and thus the algorithm is able to recover one of the trees (there may be more) that best approximates the probability distribution. The algorithm uses the mutual information as closeness measure between P (X) and Pτ (X), where P (X) is the probability distribution from a set of samples, and Pτ (X) is the tree dependence distribution. It is an optimization algorithm that gives the tree distribution closest to the distribution from the samples. Let us introduce some notation in order to explain the Chow and Liu’s measure and algorithm. Let i be an integer such that 1 ≤ i ≤ n. Let j(i) be a mapping with 0 ≤ j(i) ≤ i, let T = (X, E) be a dependence tree where X(T ) = {Xi |1, 2, · · · , n} is the set of nodes, E(T ) = {(Xi , Xj(i) )|1, 2, · · · , n} is the set of branches, and where X0 is the null node. If we now assign a cost I(Xi ; Xj ) to every dependence tree branch, the maximum-cost dependence P P tree is defined as the tree T such that for all T 0 in Tn , ni=1 I(Xi ; Xj(i) ) ≥ ni=1 I(Xi ; Xj 0 (i) ). Chow and Liu applied some transformations to Kullback-Leibler divergence DKL (P ||Pτ ), as we saw in Theorem 2.3, obtaining that minimizing the closeness measure DKL (P ||Pτ ) is P equivalent to maximizing the term ni=1 I(Xi ; Xj(i) ). This result allowed Chow and Liu to use Kruskal’s algorithm for the construction of trees of maximum total cost where I(Xi ; Xj(i) ) may represent the distance cost from node Xi to node Xj(i) . Recall that Kruskal’s algorithm obtains a maximum spanning tree from a given undirected graph. In our case the given graph is a complete graph, that is a graph where there is a branch between every pair of variables. Thus, if the complete graph has n variables, it has (n2 − n)/2 branches. Kruskal’s algorithm works as follows: a tree is formed by starting with a tree without branches and adding a branch between two nodes with the highest mutual information. Next, a branch is added which has maximal mutual information associated and does not introduce a cycle in the graph. This process is repeated until no more branches can be added. The worst-case running time of this algorithm is in Θ(m log m) where m is the number of branches of the given graph [2]. Therefore, in our case, the worst-case running time is in Θ((n2 − n)/2 · log((n2 − n)/2)).

36

CHAPTER 2. BAYESIAN NETWORK LEARNING. STATE OF THE ART

Another well known algorithm for obtaining a maximum spanning tree from an undirected graph is Prim’s [2]. This algorithm works as follows: it begins with a tree with no branches and marks a variable chosen at random. Then it seeks for a unmarked variable whose mutual information with one of the marked is maximum, then marks the variable and adds the branch to the tree. This process is repeated until all variables are marked. The worst-case running time of this algorithm is in Θ(n2 ) where n is the number of variables of the given graph [2]. From the worst-case running time of both former algorithms it seems best to use Prim’s algorithm rather than Kruskal’s, as we work with a complete undirected graph. See Algorithm 1 for a version of Prim’s algorithm. Algorithm 1 CL Require: a database D on X = {X1 , · · · , Xn } variables Ensure: T be a dependence tree structure Calculate SUFFD (T ) T = (V, E) the empty tree where V(T ) = {∅} and E(T ) = {∅} Calculate costs for every pair I(Xi ; Xj ) Select the maximum cost pair (Xi , Xj ) V(T ) = {Xi , Xj }; E(T ) = {(Xi , Xj )} repeat B(T ) = {(Xi , Xj ) | ((Xi , Xk ) ∈ E(T ) ∨ (Xk , Xi ) ∈ E(T )) ∧ Xj 6∈ V(T )} Select the max cost pair (Xi , Xj ) from B(T ) V(T ) = V(T ) ∪ {Xj } E(T ) = E(T ) ∪ {(Xi , Xj )} until (V = X)

2.4.2

Algorithm K2

In this section we study the algorithm proposed by Cooper and Herskovits [23]. They designed an hill-climbing heuristic search in order to obtain a general Bayesian network structure (i.e. DAG). The greedy search, see Algorithm 2, begins with the arc-less network and then for each node adds incrementally that parent whose addition most increases the probability of the resulting structure. When the addition of no single parent can increase the probability, it stops adding parents to the node. Besides that, the algorithm requires a total order among variables that is used for reducing the search space of DAGs. Namely, the algorithm does not consider those DAGs where a node precedes in the order one of its parents. In this way, the algorithm cannot reach all equivalent classes and reaches at most one single DAG from the equivalent classes explored. More formally, the search space and the search heuristic can be stated as follows. Let X be a set of variables, let BS be the set of all possible network structures over X. Let Pred be an ordering on X. The neighborhood of a network structure BS = (X, E) is the set of all network structures that can be obtained from BS by adding a single arc Xj → Xi to BS such that Xj precedes Xi in Pred, that is, NK2 (BS ) = {(V, E 0 )|E 0 = E ∪ {(Xj , Xi )} ∧ (Xj , Xi ) 6∈ E ∧ Xj precedes Xi in Pred} (2.13) Thus, the algorithm only uses the operator of adding a legal arc in order to move within the

2.4. BATCH BAYESIAN NETWORK STRUCTURE LEARNING

37

search space. The algorithm begins with the arc-less network structure and moves on to the network structure with the highest score in its neighborhood. It continues moving until it reaches a local maxima, that is, until all the structures in the neighborhood score less than the current one. Note that NK2 (BS ) ⊆ NOA (BS ) ⊆ NN R (BS ) ⊆ IB(BS ) thus K2 does not follow the inclusion boundary condition (Definition 1.13). Actually K2 does not use the whole neighborhood N (BS ) as defined in Equation 2.13. Instead, K2 iterates for each variable in X adding in each step all the edges that increase the scoring of the network. The edges are introduced in decreasing score order. Note also that, for a given precedence order (Pred), K2 yields the same final network if it used the NK2 as defined in Equation 2.13 and independently of the order in which it iterates variables. The algorithm uses a scoring function which follows from the Bayesian Dirichlet function, Equation 2.10, qi ri Y Y (ri − 1)! g(i, Pai ) = Nijk ! (2.14) (Nij + ri − 1)! k=1 j=1 where the Nijk and Pai are as before. Algorithm 2 K2 Require: a database D on X = {X1 , · · · , Xn } variables, an order P red among the variables, and a maximum number of parents per node u Ensure: BS be a DAG structure with high a posteriori probability given the database D Calculate SuffD (T ) for i = 1, . . . , n do Pai = ∅ Pold = g(Xi , Pai ) OkToProceed = true while OkToProceed and |Pai | < u do let z ∈ P red(Xi ) − Pai be the node that maximizes g(Xi , Pai ∪ {z}) Pnew = g(Xi , Pai ∪ {z}) if Pnew > Pold then Pold = Pnew Pai = Pai ∪ {z} else OkToProceed = false end if end while end for

2.4.3

Algorithm B

In this section we study the algorithm B which Bouckaert [6] erroneously adscribed to Buntine [8]. Here, we follow Bouckaert’s [6] nice explanation of algorithm B. Algorithm B is an hill-climbing heuristic search that obtains a general Bayesian network structure (i.e. DAG) and does not need any order on variables. Like algorithm K2, algorithm B is a greedy search, see Algorithm 3, that begins with the arc-less network and then for each

38

CHAPTER 2. BAYESIAN NETWORK LEARNING. STATE OF THE ART

node adds incrementally that parent whose addition most increases the score of the resulting structure and does not introduce a cycle into the structure. When no addition of a single parent can increase the score, it stops adding parents to the node. The search space of this algorithm is the whole DAGs space. In spite of this, algorithm B may not be able to reach all DAGs because it may get stuck in a local maxima that prevents it moving on to a network with higher score. In order to escape from local maxima the algorithm would require adding, deleting or reversing more that one edge as we will see in the next section. Formally, the search space and the search heuristic can be stated as follows. Let X be a set of variables, let BS be the set of all possible network structures over X. The neighborhood of a network structure BS = (X, E) is the set of all network structures that can be obtained from BS by adding a single arc Xi → Xj to BS such that the new network structure is a directed acyclic graph, that is, NB (BS ) = {(X, E 0 )|E 0 = E ∪ {(Xi , Xj )} ∧ (Xi , Xj ) 6∈ E ∧ BS = (X, E 0 )is a DAG} Algorithm B only uses the operator of adding a legal arc in order to move within the search space, and since NB (BS ) = NOA (BS ) ⊆ NN R (BS ) ⊆ IB(BS ) it does not follow the inclusion boundary condition (Definition 1.13). The algorithm begins with the arc-less network structure and moves on to the network structure with the highest score in its neighborhood. It continues moving until it reaches a local maxima, that is, until all the structures in the neighborhood score less than the current one. We take a closer look at the generation mechanism. Let BS be the network structure at some moment during the execution of algorithm B. To find a network structure from the neighborhood of BS with highest quality, the generation mechanism needs to determine an arc that upon addition to BS gives the highest increase in quality and does not introduce a cycle. Now, let BSij denote the network structure obtained by adding the arc Xj → Xi to BS that does not introduce a cycle. From the sum property of quality measures, we have that the difference in quality between BS and BSij equals Q(BS , D) − Q(BSij , D) = m(Xi , Pai ) − m(Xi , Pai Xj ) where Pai is the parent set of Xi in BS . These values are stored in the array A; in this array we have A[i, j] = m(Xi , Pai ) − m(Xi , Pai Xj ) if addition of Xj → Xi does not introduce a cycle and A[i, j] = −∞ otherwise. Note that if an arc Xm → Xk is added, only values A[k, m], m = 1, . . . , n need to be recalculated. In the pseudocode, Algorithm 3, Ai denotes the set of indices of the ascendants of Xi and Di denotes the set of indices of descendants of Xi including i. Algorithm B first initializes the array A with the increase of score due to adding an edge. Then it adds the edge that most increases the score of the network, that is, the one with the highest value in array A. Each time an edge is added to the network, all elements in array A that represent an edge that would introduce a cycle are marked. The process is repeated until the score cannot be improved.

2.4.4

Algorithm HCMC

In this section we study the algorithm proposed in Castelo [11] and in Castelo and Co˜cka [13]. We will call this algorithm HCMC. Algorithm HCMC, like algorithm B, performs a hill-climbing heuristic search over the space of DAGs. The main difference is that HCMC accounts for the inclusion order among

2.4. BATCH BAYESIAN NETWORK STRUCTURE LEARNING

Algorithm 3 B Require: a database D on X = {X1 , · · · , Xn } variables Ensure: BS be a DAG structure with high a posteriori probability given the database D Calculate SUFFD for i = 1, . . . , n do Pai = ∅ end for for i = 1, . . . , n, j = 1, . . . , n do if i 6= j then A[i, j] = m(Xi , {Xj }) − m(Xi , ∅) else A[i, j] = −∞ {Obstruct Xi → Xj } end if end for repeat select i, j that maximize A[i, j] if A[i, j] > 0 then Pai = Pai ∪ {Xj } for a ∈ Ai , b ∈ Di do A[a, b] = −∞ {Obstruct introduction of cycles} end for for k = 1, . . . , n do if A[i, k] > −∞ then A[i, k] = m(Xi , Pai ∪ {Xk }) − m(Xi , Pai ) end if end for end if until A[i, j] ≤ 0 or A[i, j] = −∞ for all i, j

39

40

CHAPTER 2. BAYESIAN NETWORK LEARNING. STATE OF THE ART

Bayesian Networks. Castelo and Co˜cka [13] demonstrated that if the traversing operators, e.g. addition, removal or reversal of arcs, fulfill the inclusion boundary condition, Definition 1.13, then the search strategy can avoid local maxima (see Theorem 1.4). They showed that traversal operators that change (add, delete or reverse) a single arc do not follow the inclusion boundary condition (see Theorem 1.4). In recent literature there are other approaches that try to escape from local maxima by changing more than one arc at a time. See for example [128, 25]. Castelo and Co˜cka [13] demonstrated that under some conditions the inclusion boundary condition is sufficient to avoid local maxima as stated in the following theorem, Theorem 2.7 Let D∞ be a dataset of unbounded length sampled from a probability distribution P which is a perfect map of some DAG structure BS ∗ that determines a Bayesian network S ∗ = (BS ∗ , PS ∗ ). Let sc(BS ; D∞ ) be a locally consistent, and score equivalent, scoring metric. Let BS be any given DAG and N (BS ) its neighborhood created by a traversal operator that satisfies the inclusion boundary condition, i.e. N (BS ) ⊇ IB(BS ). There exists at least one DAG BS 0 ∈ N (BS ) such that sc(BS 0 ; D∞ ) > sc(BS ; D∞ ) unless S = S ∗ Chickering [18] introduced the concept of locally consistent scoring metric. Definition 2.3 Locally consistent scoring metric A scoring metric is locally consistent if: 1. increases as the result of adding any edge that eliminates an independence restriction that does not hold in P . 2. decreases as the result of adding any edge that does not eliminate an independence restriction that does not hold in P . Chickering [18] also proves that a Bayesian scoring metric is locally consistent when the dataset D is an independent and identically distributed sample from P , there exists a DAG which is a perfect map of P and the number of records in D is unbounded. Theorem 2.7 states that there is always some learning path that permits traversing the search space towards some Bayesian network that is in inclusion with the true Bayesian network. It also states that the score of the networks found in the learning path is strictly increasing. Remember from Theorem 1.4 that neighborhoods ENR and ENCR fulfill the inclusion boundary condition. Namely, they showed that the ENR neighborhood coincides with the inclusion boundary, and that the ENCR neighborhood contains all DAGs in ENR plus others that are not into the inclusion boundary. In order to calculate ENR and ENCR neighborhoods of a given network BS belonging to a equivalence class C it is necessary to enumerate all members of C. The effort to enumerate the members of an equivalence class is prohibitive since there is no cheap graphical characterization of them. For this reason, they use a simulation of both neighborhoods by randomly reversing some number of covered arcs. They bank on the assumption that the average ratio of DAGs per equivalence class is bounded by 3.7 [47]. They noted also, that not all members of an equivalence class are strictly required to reach the whole inclusion boundary. HCMC algorithm simulates, by means of a random process, the ENR or ENCR neighborhoods. Castelo et al. showed that using a simulation of the ENCR neighborhood may improve the performance of an algorithm that uses a simulation of a ENR neighborhood. In order to simulate the neighborhoods they introduce the repeated covered arc reversal algorithm (RCAR), that allows to reach any member of the equivalence class with certain

2.4. BATCH BAYESIAN NETWORK STRUCTURE LEARNING

41

probability. The RCAR algorithm, see Algorithm 4, takes a positive integer r as parameter and iterates some random number of times between 0 and r. At each iteration, it peaks at random a covered arc and reverses it. The bounded ratio of DAGs per equivalence class suggests that a small number between 4 and 10 should be sufficiently large to reach any member of the equivalence class with some probability. Algorithm 4 RCAR Require: BS a DAG, r an integer number Ensure: BS a DAG with covered arcs reversed rr = rnd(0,r) for i = 0, . . . , rr − 1 do let ce be the set of covered arcs of BS let e be an arc of ce chosen at random reverse e in BS end for Using the RCAR algorithm Castelo and Co˜cka [13] simulate the ENR and the ENCR neighborhoods obtaining two new neighborhoods for Bayesian Networks that they defined as: • RCARNR (RCAR+NR): Perform the RCAR algorithm and then create a NR neighborhood, noted NRCARN R (G) • RCARR (RCAR+NCR): Perform the RCAR algorithm and then create a NCR neighborhood, noted NRCARR (G) Looking at HCMC, Algorithm 5, one can see that the it begins with the network with no arcs. Then, at each iteration it calculates the RCARNR or the RCARR neighborhood of the current network and keeps the network with the highest score. The algorithm stops when non of the networks of the neighborhood improves the score of the current one. In our implementation of the HCMC algorithm we used four arrays in order to keep the score of the networks which are into the neighborhood of the current one. Namely, array A where we keep the score obtained adding a new arc, array D where we keep the score obtained deleting an existing arc, and finally array Ra and array Rd where we keep the score obtained reversing an arc. The score of reversing an arc is calculated adding both Rd and Ra arrays as the first keeps the score obtained deleting an arc and the second keeps the score of adding an arc. Like in algorithm B we use the sum property of scoring functions we keep in the above mentioned arrays the difference between the score of the current network and the network obtained adding, deleting or reversing an arc. Algorithm 6, Obtain Initial Neighborhood, initializes arrays A, D, Ra and Rd. Note that elements of arrays that represent forbidden operations, for example adding an arc that would introduce a cycle, are marked with −∞ value which prevents the algorithm from considering this operation again. Algorithm 7, Best DAG finds the element with the highest value of the arrays, performs the corresponding operation and finally calls the appropriate function in order to update the arrays. The rest of the algorithms, Algorithm 8 (Update A for addition), Algorithm 9 (Update D for addition), Algorithm 10 (Update Ra and Rd for addition), Algorithm 11 (Update A for deletion), Algorithm 12 (Update D for deletion) and Algorithm 13 (Update Ra and Rd for

42

CHAPTER 2. BAYESIAN NETWORK LEARNING. STATE OF THE ART

Algorithm 5 HCMC Require: a database D on X = {X1 , · · · , Xn } variables ncr indicates whether to use NCR or NR neighborhood Ensure: BS be a DAG structure with high a posteriori probability given the database D Calculate SuffD () localMaximum = false let BS be a DAG with no arcs Obtain Initial Neighborhood while not localMaximum do RCAR(BS ,r) let BS0 be the Best DAG of Nncr?N CR:N R (BS ) localMaximum = (score(BS0 ) ≤ score(BS )) if not localMaximum then BS = BS0 trials=0 else if trials < MAXTRIALS then RCAR(BS ,r) localMaximum = false trials = trials + 1 end if end while

Algorithm 6 Obtain Initial Neighborhood Require: a database D on X = {X1 , · · · , Xn } variables Ensure: Matrixes A, D, Ra and Rd represent the score of BS0 ∈ Nncr?N CR:N R (BS ) where BS has no arcs {Note that NN R (BS ) = NN CR (BS )} for i = 1, . . . , n, j = 1, . . . , n do if i 6= j then A[i, j] = m(Xi , {Xj }) − m(Xi , ∅) D[i, j] = −∞ {Obstruct deleting Xi ← Xj } Ra[i, j] = Rd[i, j] = −∞ {Obstruct reversing Xi ← Xj } else A[i, j] = −∞ {Obstruct adding Xi ← Xj } end if end for

2.4. BATCH BAYESIAN NETWORK STRUCTURE LEARNING

Algorithm 7 Best DAG Require: a database D on X = {X1 , · · · , Xn } variables ncr indicates whether to use NCR or NR neighborhood Ensure: BS be the best DAG of Nncr?N CR:N R (BS ) let m = maxi,j (A[i, j], D[i, j], Ra[j, i] + Rd[i, j]) if m > 0 then if (m = A[i, j]) then BS = BS ∪ {Xi ← Xj } {Adds an arc} Update A for addition of Xi ← Xj Update D for addition of Xi ← Xj if (ncr) then Update Ra and Rd for addition of Xi ← Xj end if else if (m = D[i, j]) then BS = BS \{Xi ← Xj } {Deletes an arc} Update A for deletion of Xi ← Xj Update D for deletion of Xi ← Xj if (ncr) then Update Ra and Rd for deletion of Xi ← Xj end if else if (m = Ra[j, i] + Rd[i, j]) then BS = BS \{Xi ← Xj } {Deletes an arc} Update A for deletion of Xi ← Xj Update D for deletion of Xi ← Xj Update Ra and Rd for deletion of Xi ← Xj BS = BS ∪ {Xj ← Xi } {Adds an arc} Update A for addition of Xj ← Xi Update D for addition of Xj ← Xi Update Ra and Rd for addition of Xj ← Xi end if end if

43

44

CHAPTER 2. BAYESIAN NETWORK LEARNING. STATE OF THE ART

deletion) update the arrays after the chosen operation is applied to the current network. In this way, the score of the networks which belong to the neighborhood of the newly obtained one is calculated. Algorithm 8 Update A for addition Require: a database D on X = {X1 , · · · , Xn } variables BS a DAG, and Xi ← Xj the arc added Ensure: Matrix A represents the score of BS0 ∈ Nncr?N CR:N R (BS ) for a ∈ Ai , b ∈ Di do A[a, b] = −∞ {Obstruct introduction of cycles} end for for k = 1, . . . , n do if A[i, k] > −∞ then A[i, k] = m(Xi , Pai ∪ {Xk }) − m(Xi , Pai ) end if end for

Algorithm 9 Update D for addition Require: a database D on X = {X1 , · · · , Xn } variables BS a DAG, and Xi ← Xj the arc added Ensure: Matrix D represents the score of BS0 ∈ Nncr?N CR:N R (BS ) for a ∈ Pai do D[i, a] = m(Xi , Pai ) − m(Xi , Pai \{Xa }) end for

Algorithm 10 Update Ra and Rd for addition Require: a database D on X = {X1 , · · · , Xn } variables BS a DAG, and Xi ← Xj the arc added Ensure: Matrixes Ra and Rd represent the score of BS0 ∈ Nncr?N CR:N R (BS ) for a ∈ Pai do if notCovered(Xi ← Xa ) ∧Xa ← Xi not IntroduceCycle in BS \Xa ← Xi then Rd[i, a] = m(Xi , Pai ) − m(Xi , Pai \{Xa }) if a = j then Ra[j, i] = m(Xj , Paj ∪ {Xi }) − m(Xj , Paj ) end if end if end for

2.5

Summary

In this chapter we have presented in some extent the state of the art of Bayesian network learing

2.5. SUMMARY

Algorithm 11 Update A for deletion Require: a database D on X = {X1 , · · · , Xn } variables BS a DAG, and Xi ← Xj the arc deleted Ensure: Matrix A represents the score of BS0 ∈ Nncr?N CR:N R (BS ) {Consider variable Xi } A[i,j]=0 {To consider again Xj as parent of Xi } for k ∈ Pai do if A[i, k] > −∞ then A[i, k] = m(Xi , Pai ∪ {Xk }) − m(Xi , Pai ) end if end for {Allowing arcs formerly forbidden due to Xi ← Xj } for a ∈ Ai , b ∈ Di do if Da ∩ Ad = ∅ then A[a, d] = m(Xa , Paa ∪ {Xd }) − m(Xa , Paa ) end if end for

Algorithm 12 Update D for deletion Require: a database D on X = {X1 , · · · , Xn } variables BS a DAG, and Xi ← Xj the arc deleted Ensure: Matrix D represents the score of BS0 ∈ Nncr?N CR:N R (BS ) D[i,j]=0 {Xi ← Xj cannot be deleted again} for a ∈ Pai do D[i, a] = m(Xi , Pai ) − m(Xi , Pai \{Xa }) end for

Algorithm 13 Update Ra and Rd for deletion Require: a database D on X = {X1 , · · · , Xn } variables BS a DAG, and Xi ← Xj the arc deleted Ensure: Matrixes Ra and Rd represent the score of BS0 ∈ Nncr?N CR:N R (BS ) for a ∈ Pai do if notCovered(Xi ← Xa ) ∧ Xa ← Xi not IntroduceCycle in BS \Xa ← Xi then Rd[i, a] = m(Xi , Pai ) − m(Xi , Pai \{Xa }) end if end for for s ∈ Soi do if notCovered(Xi ← Xs ) ∧ Xs ← Xi not IntroduceCycle in BS \Xi ← Xs then Ra[i, s] = m(Xi , Pai ∪ {Xs }) − m(Xi , Pai ) end if end for

45

46

CHAPTER 2. BAYESIAN NETWORK LEARNING. STATE OF THE ART

Chapter 3

Incremental Learning The idea of incremental learning arose from the observation that most part of human learning can be viewed as a gradual process of concept formation or as the human ability for incorporating knowledge from new experiences into already learned concept structures [35]. The incremental learning approach was firstly motivated as a human capability worth being incorporated into artificial agents. However, nowadays there exist other practical (i.e. industrial) reasons which increase the interest in incremental algorithms. Everyday, firms and companies store millions of new records. For example, banks store millions of transaction records, internet search engines store millions of searches, ... and so on. Batch algorithms are not easily able to process and incorporate to a knowledge base this great amount of continuously incoming instances in a reasonable amount of time and memory space. Incremental learning algorithms have been thoroughly studied within the machine learning community. During the second half of the eighties several systems where proposed in the field of clustering, Fisher’s COBWEB [35] being one of the most cited ones even nowadays. See the work of Gennari et al. [46] for a survey of the incremental clustering algorithms of those years. A bit later, Anderson et. al [1] proposed a new system in the field of incremental Bayesian clustering. In the field of decision trees Utgoff [122, 123] extended Quinlan’s ID3 [106] to an incremental system. More recently, the field of Data Mining and Knowledge Discovery in Databases is concerned with very large databases available in streams which do not fit in main memory and has spawned new interest in incremental methods. See for example the work of Provost et al. [104], Domingos et al. [28] and Bradley et al. [7] mainly related with clustering. In the Bayesian network community, incremental learning algorithms have received less attention. There exists, as far as we know, the work of Buntine (1991) [8], Lam and Bacchus [72] (1994) and finally Friedman et. al (1997) [38]. These three works will be explained and discussed with some extent in Section 3.4. More recently Hulten et al. (2002) [54] have developed a general method for mining large databases which can also be used as an incremental method for mining data streams.

3.1

Incremental algorithms: purpose and definition

There are real-world environments with very strong constraints and requirements that may affect the learning process. We could summarize such constraints and requirements under three main categories: 47

48

CHAPTER 3. INCREMENTAL LEARNING 1. Resource limitation: in real-world applications there may be computing time and memory space limitations. Additionally, when databases are so huge that they must be stored in secondary memory, multiple inspection of such amount of data is unfeasible. Similarly, it may be unreasonable to keep in memory several alternative knowledge bases. 2. Any-time availability: sometimes, given the nature of the real-world application, an intelligent agent needs to use a domain model in order to carry out its performance task even if the whole dataset is not available. Incremental methods can deal with such situations because they keep a domain model during the whole learning process. 3. Changing worlds: when intelligent agents must survive in a changing world they should be able to make their model of the world evolve. Incremental algorithms are a natural solution to cope with such situations because they are able to incorporate into the model new samples from the changing world. In such environments, learning algorithms should also be provided with some mechanisms in order to forget old experiences (i.e. data instances) that do not represent the current state of the world.

Incremental algorithms are a response to these requirements as we can see in the definitions found in the literature. Maybe the most widely accepted definition of the main properties of an incremental algorithm was stated by Langley [73]. Definition 3.1 (Incremental algorithm) A learner L is incremental if L inputs one training experience at a time, does not reprocess any previous experiences, and retains only one knowledge structure in memory. This definition is rather strong, because it imposes three heavy constraints on an algorithm in order to be incremental. The first two constraints require learning algorithms to be able to use their knowledge at any time during the learning process. The second one rules out those systems that process new data together with old data in order to come up with a new model. The important idea of this constraint is maintaining reasonably low and constant the time required to process each data instance over the whole dataset. The third constraint aims at learning algorithms not making unreasonable memory demands. However, these three strong constraints can be lowered in different degrees. Incremental algorithms can be allowed to process a chunk of k data items at a time, to process at most k previous instances after encountering a new training instance, or to keep k alternative knowledge bases in memory. In this way we can relax each of these three constraints in a degree that can be specified with a parameter. More recently Domingos and Hulten [27, 29] stated desirable properties of incremental algorithms so that they can cope with data streams which may grow ad infinitum. Their proposal, in fact, is an extension of Definition 3.1. Definition 3.2 An incremental algorithm should meet the following constraints: • It must require small constant time per record. • It must be able to build a model using at most one scan of the data. • It must use only a fixed amount of main memory, irrespective of the total number of records it has seen.

3.1. INCREMENTAL ALGORITHMS: PURPOSE AND DEFINITION

49

• It must make a usable model available at any point in time, as opposed to only when it is done with processing the data. • It should produce a model that is equivalent (or nearly identical) to the one that would be obtained by the corresponding batch algorithm. This second definition is also concerned with the time and memory space that incremental algorithms spend when they process new data instances. It also makes explicit that the learned model must be available at any time, which actually is a consequence of keeping low the time required to process an incoming data instance. Definition 3.2 explicitly wants the models learned with incremental algorithms to be of similar quality than the ones that batch algorithms would obtain. As we will see in the next section, incremental algorithms spend less computing time an memory space, but may produce models of lower quality than batch approaches. Definition 3.2 additionally detaches the amount of the memory spent from the length of the data string. This is a new constraint that rules out extensional descriptions of knowledge models, that is, models that are described by means of data instances (for example classifiers described with a list of the instances that belong to each class). The main reason for the importance of incremental learning is that we may gather new data every day and that it would be interesting to revise the current model in the light of this new data without spending an unreasonable amount of time and memory. And this is desirable even it may happen that the data gathered in one single day were enough to obtain a model of very high quality, and hence, there would be no need for incremental algorithms. We believe that this situation is very unlikely to happen because of two reasons. Firstly, we may not be sure that the data already available are really representative of the whole domain. Namely, the data may not be a fair sampling of the underlying probability distribution and thus, it cannot be accurately estimated from the currently available dataset. And secondly, in order to obtain complex models, where lots of variables are related to each other, it is needed large datasets. It is widely reported in the statistical pattern recognition literature (see Jain et al. [57]) that the performance of a classifier depends on the interrelationship between sample sizes, number of features, and classifier complexity. It has been often observed in practice that adding variables to a classifier may actually degrade its performance if the number of data instances that are used to learn the classifier is small relative to the number of variables. This is known as the peaking phenomenon which is a consequence of the curse of dimensionality[57], usually stated as follows: in order to estimate a joint probability, the number of required data instances grows exponentially with the number of variables. This is due to the fact that the required number of parameters in order to estimate a joint probability distribution grows exponentially with the number of variables, i.e the number of counters of a contingency table. This is also illustrated by T. Hastie et al. [50], when they state that the sampling density is proportional to N 1/p , where p is the number of variables and N is the sample size. Thus, if N1 = 100 represents a dense sample for a single input problem, then N10 = 10010 is the sample size required for the same sampling density with 10 inputs. Thus in high dimensions all feasible training samples sparsely populate the input space. We also want to note that some authors label as ”incremental”, algorithms that do not learn samples one by one but they learn variables incrementally as they become available [73]. If we see a database as a matrix where rows are samples and columns are variables describing samples, an algorithm could incrementally learn variables (columns) instead of samples (rows). In this way, an incremental algorithm grows a domain model incorporating variables to it as they are available. We shall see that some of the learning algorithms presented in the Bayesian

50

CHAPTER 3. INCREMENTAL LEARNING B

f

i,b B

f

A

b

i

b

b I

(i)

A

A

i,b I

B

f

I

(ii)

(iii)

Figure 3.1: Comparison of greedy batch and incremental learning paths network literature as batch methods could actually be considered incremental since they learn variables in an incremental way (i.e. Herskovits and Cooper’s K2 algorithm [23]). We will not, in our work, consider as incremental this sort of algorithms as we are concerned with updating our current structure as new data are available.

3.2

Problems with local maxima and the ordering effects

It is well known that greedy search algorithms, like hill climbing, can get stuck at local maxima and that this behavior is accentuated in incremental greedy approaches. In this section we will report two drawbacks of incremental searchers. The first one appears when one compares the results obtained with incremental algorithms against the ones of batch algorithms. The second one appears when comparing the results yielded by an incremental algorithm when it is fed with the same data instances but in different orders. Greedy strategies can be described [74] as a search over an n-dimensional space where a function f is defined. This function determines the shape of an n-dimensional surface, and the search strategy aims at reaching the point with highest f score. In off-line environments, the function f is static and thus the shape of the surface is constant. On the contrary, in online environments each new instance changes the form of f and modifies the contours of the surface. When a batch algorithm works in an online environment and new data are available, it drops the current model and performs a new search from scratch by building a new learning path that follows the new contours of f . This new path is more informed, as more data are available, and thus it may lead to a different local maximum of higher quality. In contrast, incremental algorithms perform the search by beginning from the former model, that is, they continue the former path hoping that with the new information they will be able to reach the same, or a very similar, model as the batch approach, while saving computing time as they do not begin from scratch. Figure 3.1 illustrates this behavior. The graphic on the left, (i), shows a surface f and a learning path followed by both a batch and an incremental algorithm, b and i, that go from an initial point I to a local maximum A. The other two graphics show two different situations that may happen when new data are available and the surface f is consequently modified. The graphic in the middle (ii) shows a situation in which both algorithms get to the same local maximum B. The batch algorithm begins at point I and follows the same learning path as before until it reaches the local maximum A from where it continues climbing up to the higher local maximum B. On the other hand, the incremental algorithm begins at the former local maximum, A, and follows the same path than the batch algorithm up to the local maximum B. The graphic on the right (iii) shows a situation in

3.3. INCREMENTAL LEARNING OF STRUCTURED MODELS. A COMPARISON. 51 which the incremental algorithm is not able to escape from the former local maxima A, but in which the batch algorithm is able to find a new path from the initial point, I, up to the local maxima, B, that has a higher score. Another consequence of changing the contours of the score function f is that incremental algorithms are sensitive to the order in which data instances are presented to the algorithm [36, 46, 73, 81]. Given two sample orders, O1 and O2 , of a database D, an incremental algorithm may output different domain models when fed with order O1 or with order O2 . Ordering effects are due to the nature of the incremental processing of data combined with the tendency of hill climbing methods to get stuck at local maxima. This sort of algorithms may output a very skewed model when the first observed samples give a biased view of the domain even in cases when the later samples give a correct view. Stating this problem in another way, it may happen that first data guide the learning process to a local maxima surrounded by deep valleys. Therefore, when new data are available, it is very difficult for the hill climbing strategy to fly off the deep valleys. When dissimilar instances are consecutively presented, results are much better than when similar instances are presented successively [36]. This may occur because, in the former case, initial observations are sampled from different parts of the description space leading initial structures to approximate the actual probability distribution of the dataset, while in the later, rather skewed structures may be built at the beginning, biasing the rest of the learning process. There are several strategies that aim at mitigating the order effects that vary dependly on the moment of the learning process when they are applied [112, 120]. Strategies may be applied before, during and after the learning process. Strategies used before learning try to select instances that will produce a good initial knowledge model. Seed selection methods are a typical example in the field of incremental clustering [88, 101]. Strategies used during the learning process try to escape from local maxima. The most widely used strategy introduces traversing operators that allow to reverse the search path, obtaining the effect of search backtracking. A classical example in clustering are the merge and split operators of the COBWEB system [35], or the reverse and remove operators in Bayesian network learning [51]. Other strategies applied during the learning process, delay the incorporation of instances that do not seem to fit well into the current knowledge model. Examples of these methods are the Not-Yet strategy [112, 120] and the UNIMEM system [80, 81]. Finally, strategies used after the learning process try to improve the already learned structure. A good example are iterative optimization algorithms like K-Means [56]. To finish our discussion about ordering effects, we would like to remark, that most of the methods proposed in order to overcome the problems of incremental algorithms relax in some way some of the three hard constrains of the Fisher’s and Domingos’ incremental algorithm definitions. That is, they allow algorithms to input more than one instance at a time, allow limited reprocessing of data and also allow keeping in memory few alternative domain models.

3.3

Incremental learning of structured models. A comparison.

In this section we study incremental algorithms for learning structured models like classifiers and clusterers in order to see the similarities and differences with respect to learning Bayesian network structures. We will focus our attention in the difficulties that will arise in learning Bayesian network structures in an incremental fashion. During the second half of the eighties and nineties, the incremental learning of classifiers

52

CHAPTER 3. INCREMENTAL LEARNING

and clusterers, both flat and hierarchical experimented a notable growth. In all these approaches, new incoming data involved only a part of a classifier (i.e a class or a cluster) or of a decision tree (i.e a subtree). For example, a Naive Bayes classifier [32] could be easily adapted by just updating the parameters of the class for the new instance. In flat cluster models, algorithms like K-means [56] and UNIMEM [80, 81] incrementally formed the clusters, and when a new instance arrived and was added to a cluster, the center of the cluster was updated. In hierarchical clustering, algorithms like COBWEB [35] learned trees in an incremental fashion by using two-way traversal operators that allow to reconsider past decisions in the light of new data. When a new instance was incorporated into the tree, it could a branch of the tree to be split or join to another branch. That is, it only affected to the subtree that clustered the new instance. In decision trees, algorithms like ID3 [106] and C4.5 [105] learned trees in a batch fashion. Incremental versions of these algorithms – see ID4 [115], ID5R [122], ITI [123] or SITI [60, 61]– updated that part of the tree that classified the new instance and changed those decision nodes that did not hold anymore the best score in light of new data. On the contrary, in incremental learning of Bayesian network structures, a new incoming data item affects the entire structure. This is due to the different nature of the knowledge represented in clustering/classification and Bayesian networks. The former, are concerned with the relations among data items (in clusters and classes) while the later are concerned with relations among variables. As a result, learning incrementally Bayesian network structures is more difficult, in terms of computing time, because the whole structure may be needed to be updated in the light of a single new data item. Consequently, it is of central importance to find a way to decide whether is it worth triggering the updating process and, when triggered, to focus the process on a part of the structure as much as possible. Another difficulty arises from the constraint that imposes a single scan of datasets. To achieve this goal, we need to store the information, from datasets, necessary in order to evaluate the models of the search space (i.e. the sufficient statistics). For incremental clustering (both flat and hierarchical) and decision trees, data items usually are stored with the structure. In order to avoid scanning the whole dataset, items are stored in that part of the model structure that best fits them. For example, data are stored into the closest clusters (closeness measured by means of a distance) or into leaf nodes after traversing clusters or decision trees. Hence, when new data arrives, in order to revise the affected sub-structure, algorithms have all the necessary information at hand and they need not to scan stored data in other parts of the structure. This solution is not useful in our work for two reasons. Firstly, because we cannot assign a data instance to a sub-network because a single instance is related to the whole network structure, and secondly, because we want to work with potentially infinite data streams and the memory required would grow as new instances were incorporated to the structure. Hence, since we do not keep data in memory, we need to keep enough information (i.e. sufficient statistics) in order to calculate the score functions for the alternative network structures.

3.4

Incremental Bayesian Network Structure Learning

In this section we consider the main contributions about incremental or on-line learning algorithms for Bayesian networks developed so far. The aim of all these algorithms is to modify or evolve an already known structure when new data are available. However, in

3.4. INCREMENTAL BAYESIAN NETWORK STRUCTURE LEARNING

53

the learning Bayesian networks field, there is neither a widely accepted definition of what is considered to be an incremental algorithm, nor of which is the aim of such algorithms. Friedman and Goldszmidt are the only authors, as far as we know, to give a precise definition of on-line or incremental algorithms in the Bayesian network learning field [38]: Definition: A Bayesian network learning procedure is incremental if at each iteration l, it receives a new data instance ul and then produces the next hypothesis Sl+1 . This estimate is then used to perform the required task (i.e. prediction, diagnosis, classification, etc.) on the next instance nl+1 , which in turn is used to update the network and so on. The procedure might generate a new model after some number of k instances are collected. If we compare this definition with the ones of Langley’s and Domingos’ (see Section 3.1), we can see that it is the same from the viewpoint of the process’ behavior. Both definitions require the learning algorithm to process training instances as they are available and to have a domain model ready for performance tasks at each iteration. However, Friedman and Goldszmidt’s definition does not impose any restriction on the way the learning algorithm processes data. Thus, an algorithm that reprocessed the whole data set at each iteration or that maintained lots of alternative Bayesian networks in memory would fit this definition. As far as we know, there are only three proposals of algorithms that revise the network structure of a Bayesian network. Namely, Buntine’s (1991) [8], Lam and Bacchus’ (1994) [72] and Friedman and Goldszmidt’s (1997) [38]. The algorithm proposed by Buntine [8] yielded a set of alternative and reasonable networks given the dataset. The algorithm was able to revise the set of Bayesian networks in the light of new data. Lam and Bacchus [72] proposed an algorithm which was able to revise parts (i.e. a subgraph) of the already learned Bayesian network when new data about a subset of variables was available. Finally, Friedman and Goldszmidt [38] proposed an algorithm that explored a frontier of possible alternative networks. When new data were presented to the algorithm, the frontier was changed and the best network was selected as a result. All these algorithms share the same idea. They obtain a network with the data they have seen so far and when new data were available the learning algorithm is triggered. Then, the algorithm performs a search in the neighborhood of the current network. We present these three algorithms in a chronological order. It is difficult to compare the results obtained by the three proposals for two reasons. Firstly, only Friedman and Goldszmidt present some experimental results in their article, and secondly, we think it cannot be said that the aim of each of them is exactly the same.

3.4.1

Buntine’s proposal

Buntine [8] proposed an incremental algorithm for learning Bayesian Networks. The algorithm, given a dataset and a total ordering of the variables, comes up with different alternative Bayesian networks that are reasonable in terms of the scoring functions. We call reasonable those alternative Bayesian networks whose scorings are within a factor E of the best found so far, where E is a parameter of the algorithm. Note that, in some way, the parameter E also specifies the number of alternative networks the algorithm could yield. The capability of the algorithm to return a list of alternative Bayesian networks contrasts with the algorithms we have seen so far as they return just a single network.

54

CHAPTER 3. INCREMENTAL LEARNING

Buntine proposed first a batch algorithm that uses the Bayesian approach for its scoring functions. Afterwards, he proposed some guidelines for converting it into an incremental or on-line algorithm. The batch algorithm performed a beam search keeping those network structures whose posteriors were within a given factor C of the best found so far. It can be seen like a generalization of K2 since when the factor C is set to 1 the algorithm actually performs the same greedy search proposed by Cooper and Herskovits [23]. Furthermore, the incremental version of the algorithm can also be seen as another generalization. Here, we follow Buntine [8] in stating first the batch version of the algorithm and afterwards the incremental one. The batch version of the algorithm The algorithm needs a total ordering of the variables as prior knowledge from the experts. The variables coming first in the ordering are supposed to influence the others. We need also a compact structure in order to store the alternative Bayesian networks in memory. For each variable Xi we will keep a set of reasonable alternative parent sets Πi according to some criteria of reasonableness. For the variable Xi alternative parent sets Πi = {Pai1 , . . . , Paim } will be a collection of subsets of {Y : Y ≺ Xi }. We also have to store the network parameters θijk for each set of possible parent sets. The space of alternative networks is then given by the Cartesian product across the sets of the parent sets for each variable ⊗ni=1 Πi . Buntine calls this structure, both the set of parent sets and the network parameters, a combined Bayesian network. In order to access all alternative parent sets Pai ∈ Πi efficiently, they are stored in a lattice structure where each subset and superset parent sets are linked together in a web, denoted the parent lattice for Xi . Since the full set of lattices is of potentially exponential size in the number of variables n, only those parent sets with significant posterior probabilities are stored and linked. The parent lattice Πi for the node Xi is defined as follows. The root node is the empty set and the leaves are the sets Pai which have no supersets contained in Πi . For example, the lattice Πi = {{a}, {a, b}, {a, c}, {a, d}} has the root {a} and the leaves {a, b}, {a, c} and {a, d}. The number of leaves can be reduced by adding the parent sets {a, b, c}, {a, c, d} and {a, c, d, e}, resulting in a lattice with the leave {a, c, d, e}. The algorithm has tree parameters 1 > C > D > E which, in some way, specify the sort of search that will be performed. When the parameters are close to 1, the search becomes greedy since the number of alternative networks is reduced and, on the contrary, when the parameters are close to 0, the search is a beam search. The closer to zero the parameters are, the more beams the algorithm considers during the search process. According to the parameters, the algorithm classifies the parent sets as alive, asleep or dead. The parent sets which finally take part on the combined network are those whose posteriors are within a factor C of the best found so far. The Alive parent sets represent the set of reasonable alternatives having posteriors within a factor of D and they are the beams searched by the algorithm. Dead parent sets exist in the lattice as dead-end markers in the search space. They have been explored and forever determined to be unreasonable alternatives and are not to be further explored. Asleep parent sets are similar but are only considered unreasonable for now and may be made alive later on. Furthermore, nodes can be either open or closed, depending on whether they require further expansion during search or not.

3.4. INCREMENTAL BAYESIAN NETWORK STRUCTURE LEARNING

55

Algorithm 14 Buntine’s Require: a database D on X = {X1 , · · · , Xn } variables, an order ≺ among variables, parameters C, D and E Ensure: A combined Bayesian Network Calculate SuffD (T ) for i = 1, . . . , n do Best-posterior = P (Xi |P ai = ∅, D, ≺) Open-list = {∅} // List of parent sets to be further expanded Alive-list = ∅ // List of parent sets that are considered alive repeat Take the top parent set P ai from the Open-list if P (Xi |P ai , D, ≺) < E · Best-posterior then Mark P ai as dead else if P (Xi |P ai , D, ≺) > D · Best-posterior then Generate all children CHP ai and calculate their posterior Call to MarkChildren(CHP ai ) else Ignore P ai // It is an asleep parent set end if end if until Open-list==∅ end for

Algorithm 15 MarkChildren Require: a database D on X = {X1 , · · · , Xn } variables, parameters C, D and E, a set of children CHP ai Ensure: Marked children for each ch ∈ CHP ai do if P (Xi |ch, D, ≺) > Best-posterior then Best-posterior = P (Xi |ch, D, ≺) Modify the Alive-list to reflect the new maximum else if P (Xi |ch, D, ≺) < E · Best-posterior then Mark ch as dead else if P (Xi |ch, D, ≺) > D · Best-posterior then Add ch to Open-list else if P (Xi |ch, D, ≺) > C · Best-posterior then Mark ch as alive and add it to Alive-list else Mark ch as asleep end if end for

56

CHAPTER 3. INCREMENTAL LEARNING

We want to remark here, that when the parameters E, F and G are set to 1, this algorithm reduces to the K2 algorithm. Note that in such situation only one network structure is kept into the open-list and all the other lists are empty. Thus, we can think of this algorithm as a generalization of the K2 algorithm. However, recall that Buntine’s algorithm yields a combined Bayesian network rather than a single network. The incremental version of the algorithm Buntine proposes an incremental algorithm as an extension of the one seen above. He considers the situation where new data instances are available and have to be processed by the algorithm in order to update the combined network. Buntine describes two different situations depending on the time available in order to update the combined network. Namely, one where the algorithm can spend very little time for updating, and another where more time is available. In the first case, a rapid update of the combined network is required and there is not enough time to update the parent structure. Thus, the algorithm only updates the posterior probabilities of the parent lattices. In the second case, given additional time, both structure and posterior updates are performed. In order to update the posteriors of the combined network, we need to store posterior probabilities and the counters Nijk for each alternative set of parent sets in order to be able to update them when new information is available. Buntine’s algorithm stores the counters only for those parent sets marked as alive since it discards exploring parent sets marked as dead. When a batch of new examples and additional time is available, a process reproducing the results of the algorithm can be run incrementally. For each variable Xi of the already discovered combined Bayesian network the following must be done: 1. Update the posterior probabilities of all alive sets of the lattice. 2. Calculate the new Best-Posterior 3. Expand nodes from the Open-list and continue with the search It may happen that some parent sets oscillate on and off the Alive-list and Open-list because the posteriors ordering of the parent set oscillate as the training examples are taken into account. This effect can easily be prevented by making a differential on C and D between placing a node on and taking a node off. In fact, the incremental part of the algorithm explores the neighborhood of the current combined Bayesian network. It tests whether adding a variable to the parent set improves the score of the network.

3.4.2

Lam and Bacchus’ proposal

Lam and Bacchus [72] proposed an extension of their batch algorithm [71] so that it could perform a revision of the Bayesian network structure incrementally as new data were available. In fact, their proposal is a general one as it is not coupled to their batch algorithm. The refinement they propose needs a Bayesian network to be updated and new data instances. The refinement is done with the implicit assumption that the existent network is already a fairly accurate model of the database. Under this assumption, the new refined network

3.4. INCREMENTAL BAYESIAN NETWORK STRUCTURE LEARNING

57

structure should be similar to the existing one. Their approach is based on the MDL measure. They show that if we improve the description length DL of a subgraph by changing its topology, we improve the description length of the complete graph if no cycles are introduced. 0 Theorem 3.1 (Lam and Bacchus [71]) Let BSp = (Np , Ap ) and BSp = (Np , A0p ) be re0 0 spectively two subgraphs of BS = (N, A) and BS = (N, A ), where N is the set of nodes and A is the set of arcs, and where Np ⊆ N , Ap ⊆ A and A0p ⊆ A0 . The following holds 0 DL(BSp ) < DL(BSp ) ⇒ DL(BS0 ) < DL(BS )

Using this theorem they developed an algorithm that improved the Bayesian network by improving parts of it. The algorithm first learned a new partial structure from the new data and the existing network using an extension of the minimum description length (MDL) measure, and then modified locally the global old structure using the newly discovered partial structure. New data are presented as a table of examples where possibly only a subset of the variables present on the Bayesian network are available. Learning the partial structure Recall from Section 2.1 that the MDL principle states that the best model of a database is the model that minimizes the sum of the length of encoding the model and the length of encoding the database given the model. For the refinement problem, the source data consists of two components, the new data and the existent network structure. Thus, we must find a partial network BSp that minimizes the sum of the of the length of the encoding of 1. The partial network BSp 2. The new data given the network BSp 3. The existent network given the network BSp Note that the sum of the last two items corresponds to the description length of the source data given the model. We are assuming that these two items are independent of each other given BSp , and thus they can be evaluated separately. In order to calculate the encoding length of the first two items we will resort to the equations used in the section 2.1.1. To calculate the length of the encoding of the third item we need to compute the description of the complete existent network BS given the partial one BSp . To recover BS having BSp we only need to describe the differences between BS and BSp , • a listing of reversed arcs, that is, those arcs in BSp that are also in BS but with opposite direction • the additional arcs of BS , that is, those arcs in BS that are not in BSp • the missing arcs of BS , that is, those arcs in BSp but are missing in BS A simple way to encode an arc is to describe the source node and the destination node. If we have n nodes we need log n bits to identify one. Therefore we need 2 log n bits in order

58

CHAPTER 3. INCREMENTAL LEARNING

to describe an arc. Let r, a and m be respectively the number of reversed, additional and missing arcs in BS with respect BSp . The description length BS given BSp is then (r + a + m)2 log n

(3.1)

This description length can be done at each node independently of the others. Remark that each arc can be uniquely assigned to its destination node. For a node Xi in BS let ri , ai and mi be the number of reversed, additional and missing arcs of it given BSp . If BS structure is defined over the set X of nodes and the partial structure BSp is defined over Xp ∈ X set of nodes, Equation 3.1 can be easily rewritten as X

(ri + ai + mi )2 log n

(3.2)

Xi ∈X

If Xq = X\Xp then the sum in previous equation can be expressed as X

(ri + ai + mi )2 log n +

Xi ∈Xp

X

(ri + ai + mi )2 log n

(3.3)

Xi ∈Xq

The second sum in the above equation specifies the description lengths of the nodes Xq which are not present in BSp . Thus, the corresponding ri ’s and mi ’s are zero and the ai ’s are not affected by the BSp structure. As we are using the measure in order to compare different partial structures BSp , the second part of the sum of Equation 3.3 is constant over them. Therefore, we only need to compute the first part. Finally, in order to learn the local structure we can use the batch algorithm proposed by Lam and Bacchus [71] or any other algorithm, and as scoring function for each node of the partial structure DLi = |P ai | log n +

X

I(Xi ; Xj ) + (ri + mi + ai )2 log n

(3.4)

Xj ∈P ai

where the first term corresponds to the encoding length of the partial structure, the second corresponds to the encoding length of the new data given the partial structure and the last corresponds to the encoding length of the old structure given the new one. With the third term of the equation Lam and Bacchus avoided using the sufficient statistics of the old data. In some way, they incorporated the information of the old dataset by biasing the search to networks similar to the old one. Modifying the global old structure Once we have a new partial structure we need to modify the existent one. Suppose the existent network structure is BS , and the learned partial structure is BSp . The objective of the refinement process is to obtain a refined structure of lower total description length with the aid of the existent structure BS and the partial structure BSp . Say we have a node Xi , its parent set in BSp is P ai (BSp ), and its description length in BSp is DLi (Equation 3.3). In the existent network BS , however, Xi will in general have a different set of parents P ai (BS ) and a different description length. If P ai (BS ) 6⊂ Xp , then these two descriptions lengths are incomparable. In this case Xi has a parent in BS that does not appear in the new data; hence the new data cannot tell us anything about the effect of

3.4. INCREMENTAL BAYESIAN NETWORK STRUCTURE LEARNING

59

that parent on Xi ’s description length. We identify all of the nodes Xi whose parents in BS are also in BSp and call these the set of marked nodes. Suppose for a certain marked node Xi , we decide to substitute the parents of Xi in BS with the parents of Xi in BSp , a new structure BS1 is obtained. Usually the total description length of BS1 can be calculated simply by adding the total description length of the old structure BS to the difference between the local description lengths of Xi in BS and BSp . The new total description length of BS1 can be evaluated in this way if the substitution of the parents of Xi in BS does not affect the local description lengths of any other node in BS . In fact, the only situation where this condition fails is when the parents of Xi in BS contain a reversed arc (as compared to BS ). Under this circumstance, we need to consider the node Xr associated with this reversed arc. If Xr is also a marked node, we need to re-evaluate its local description length since it will be affected by the substitution of Xi ’s parents. Recursively, we must detect any other marked nodes that are, in turn, affected by the change in Xr ’s description length. It can be easily observed that all these affected nodes must be connected. As a result, we can identify a marked subgraph unit that contains only marked nodes and which can be considered together as a unit when the replacement is performed. So, we need to identify all marked subgraph units in BSp . Thus, the refinement problem now is reduced to choosing appropriate subgraphs for which we should perform parent substitution in order to achieve a refined structure of lowest total description length. Note that a useful property of the subgraph is that the change in description length of each subgraph is independent of all other subgraphs. Although each subgraph substitution yields an independent reduction in description length, these substitutions cannot be performed independently as cycles may arise. Lam and Bacchus, use best-first search to find the set of subgraph units that yields the best reduction in description length without generating any cycles. To assist the search task, we construct a list S = {S1 , S2 , . . . , St } by ranking all subgraphs in ascending order of the benefit gained if parent substitution was performed using that subgraph. They iteratively try substituting the subgraphs for the general structure until no more substitutions can be performed or there are no more computing resources available.

3.4.3

Friedman and Goldszmidt’s proposal

Friedman and Goldszmidt [38] proposed three different approaches in order to sequentially (incrementally) learn Bayesian Networks. They claim that effective sequential update of structure involves a tradeoff between the quality of the learned network and the amount of information that is maintained about past observations. The three approaches they proposed managed differently the tradeoff. Two of these approaches lie on the extreme of the spectrum, while the third allows for a flexible manipulation of the tradeoff. On one extreme we have the naive approach which stores all previously seen data, and repeatedly invokes a batch learning procedure after each new example is recorded. This approach can use all of the information provided so far, and thus is essentially optimal in terms of the quality of the networks it can induce. This approach, however, requires vast amounts of memory to store the entire corpus of data and vast amounts of time to perform the search from scratch every time that new data arrives. On the other extreme we have the second approach, Maximum A Posteriori probability (MAP) which avoids the overhead of storing all of the previously seen data instances by summarizing them using the model we have seen so far. This approach is similar to that of

60

CHAPTER 3. INCREMENTAL LEARNING

Lam and Bacchus (see Section 3.4.2) in the sense that both use one single network structure as a summary of past data and is space efficient. They argue that by using the current model as a summary of past data, we strongly bias the learning procedure towards this model. As a result, after some number of iterations, this approach locks itself into a particular model and stops adapting to new data. The third approach, which they call incremental, provides a middle ground between the extremes defined by the naive and MAP approaches. Moreover, it allows flexible choices in the tradeoff between space and quality of the induced network. The incremental approach interleaves steps in a search process, to find “good” models, with the incorporation of new data. This approach focuses its resources on keeping track of just enough information to make the next decision in the search process. The basic strategy is to maintain a set of network candidates that they called the frontier of the search process. As each new data example arrives, the procedure updates the information stored in memory, and invokes the search process to check whether one of the networks in the frontier is deemed more suitable than the current model or not. We want to stress that this approach has some similarities with that of Buntine (see Section 3.4.1) with the open, alive and asleep lists, in the sense that both maintain a set of candidate networks. Sequential update of Bayesian networks: MAP The naive approach to sequential update consists in storing all the observed data, and then repeatedly invoking a batch learning process. It needs to store either all of the instances that have been observed, or keep a count of number of times each distinct instantiation of all the variables in the dataset has been observed. This representation grows linearly with the number of examples observed, and will become infeasible when the network is expected to perform for long periods of time. The MAP approach is motivated by Bayesian learning methodology. Recall that in Bayesian analysis we start with a prior probability over possible hypotheses (models and their quantifications), and compute the posterior given our observations. In principle, we can then treat this posterior as our prior for the next iteration in the sequential process. Thus, we maintain our belief state about the possible hypotheses after observing Dl−1 . Upon receiving the l-th data example, we compute the posterior as our current belief state. This methodology has the attractive property that under some reasonable assumptions, the belief state at time l is the same as the posterior of seeing Dl from our initial prior belief state. If we attempt to use priors in order to represent (and update) the posterior we need to store a complete network. Unfortunately, this is equivalent to storing the counts for all possible assignments to X. Since we cannot realize the exact Bayesian network, we can resort to the following approximation. At each step, we find (or approximate) the maximum aposteriori probability (MAP) network candidate. That is, the candidate that is considered most probable given the data so far. We then approximate the posterior in the next iteration by using the MAP network as the prior network. In other words, this procedure uses the network Sl as a summary of the first l observations. This procedure is space efficient since we only need to store the new instances that have been observed since we last performed the update of the MAP. Unfortunately, by using the MAP model as the prior for the next iteration of learning, we are losing information, and are strongly biasing the learning process toward the MAP model

3.4. INCREMENTAL BAYESIAN NETWORK STRUCTURE LEARNING

61

itself. This phenomena becomes more pronounced as the equivalence sample size is assigned to the prior grows. In order to overcome the problems found in the former approaches Friedman and Goldszmidt [38] proposed a new algorithm they called incremental. Unlike the naive approach, it does not keep all possible data examples, and unlike the MAP approach, it does not relay on a single network to represent the prior information. The basic component of this algorithm is a module that maintains a set ST of sufficient statistics records. These records allow the update procedure to select amongst a set of possible networks for the update. Before explaining the approach in detail we introduce some necessary notation. Let Suff(S) to denote the set of sufficient statistics for S, that is , Suff(S) = {Nij : 1 ≤ i ≤ n}. Similarly, given a set ST of sufficient statistics records, let N ets(ST ) to be the set of network structures that can be evaluated using the records in ST , that is, N ets(ST ) = {S : Suff(S) ⊆ S}. Suppose that we are deliberating about the choice between two structures S and S 0 . As we know from the previous sections, in order to use the MDL and Bayesian based measures to evaluate S and S 0 , we need to maintain both the set Suff(S) and Suff(S’). Now suppose that S and S 0 differ only in one arc from Xm to Xi . Then there is a large overlap between Suff(S) and Suff(S’). Namely, Suff(S) ∪ Suff(S’) = Suff(S) ∪ {Ni,j }. Thus, we can easily keep track of both these structures by maintaining a slightly larger set of statistics. To see how this generalizes to larger sets that cover a considerable subset of the search space, recall that the greedy hill climbing search procedure works by comparing its current candidate S to all its neighbors. These neighbors are the networks that are one change away (i.e. arc addition, deletion or reversal) from S. Extending the argument above, we see that we can evaluate the set of neighbors of S, by maintaining a bounded set of sufficient statistics. Note that if Suff(S) consists of all the sufficient statistics for S and its neighbors, N ets(S) contains additional networks, including many networks that add several arcs in distinct families in S. Also note that if X ⊂ Y, then NX can be recovered from NY . Thus, N ets(S) also contains many networks that are simpler than S. Generalizing this discussion, this approach applies to any search procedure that can define a search frontier. This frontier consists of all the networks it compares in the next iteration. We use F to denote this set of networks. The choice of F determines which sufficient statistics are maintained in memory. That is, we set ST to contain all the sufficient statistics needed to evaluate the networks in F . After a new instance is received (or, in general, after some number of new instances are received), the procedure uses the sufficient statistics in ST to evaluate and select the best scoring network in the frontier F . Once this choice is made, it invokes the search procedure to determine the next frontier, and updates ST accordingly. This process may start recording new information and may also remove some sufficient statistics from memory. The main loop of the incremental procedure can be now described as showed in Algorithm 16. This procedure focuses its resources on keeping track of just enough information to make the next decision in the search space. Every k steps, the procedure performs this decision. After each such decision is made, the procedure reallocates its resources in preparation for the next iteration. This reallocation may involve removing some sufficient statistics from ST , and adding new ones. When we instantiate this procedure with the greedy hill climbing procedure, the frontier consists of all the neighbors of Sl . A beam search, on the other hand, can maintain j candidates, and set the frontier to be all the neighbors of all j candidates. Other search procedures might explore only some of the neighbors of Sl and thus would have smaller search frontiers.

62

CHAPTER 3. INCREMENTAL LEARNING

Algorithm 16 FG Set S be initial network Let F be initial set frontier for S S Let ST = Suff (S) ∪ S 0 ∈F Suff (S 0 ) Forever Read data ul Update each record in ST using ul if n mod k = 0 then Let S = arg maxS 0 ∈N ets(S 0 ) Quality(S 0 |ST ) Update the frontier F (using a search procedure) S Set S to Suff (S) ∪ S 0 ∈F Suff (S 0 ) endif Compute optimal parameters θ for S from ST Output(S, θ)

Sequential scoring functions Friedman and Goldszmidt’s approach collects different sufficient statistics in different moments of the learning process. Thus, they need to compare Bayesian networks with respect to different datasets. The underlying problem is a general model selection problem, where we have to compare two models M1 and M2 such that model M1 is evaluated with respect to the training set D1 , while model M2 is evaluated with respect to the training set D2 . Of course, for this problem to be meaningful, we assume that D1 and D2 are both sampled from the same underlying distribution. The MDL and the Bayesian scores are inappropriate for this problem in their current form. The MDL score measures the number of bits required to encode the training data if we assume that the underlying distribution has the form specified by the model. However, if D2 is much smaller than D1 , then the description of D2 would usually be shorter than that of D1 regardless of how good the model M2 is. The same problem occurs with the Bayesian score. This score evaluates the probability of the dataset if we assume that the underlying distribution has the form specified by the model. Again, if D2 is much smaller than D1 , then the probability associated with it will usually be larger, since the probability of a dataset is a product of the probability of each instance given the previous ones. Since each such term is usually smaller than 1, the probability decreases for longer sequences. In order to overcome this problem Friedman and Goldszmidt proposed an averaged MDL ∗ measure SM DL (G|D) = SM DL (G|D)/N , where N is the number of instances of the dataset. This score measures the average encoding length per instance. To see that this averaged measure is theoretically sound they propose the following lemma: Lemma 3.1 Let G1 and G2 be two network structures that are evaluated with respect to datasets D1 and D2 of size N1 and N2 respectively, that are sampled identically from an underlying distribution P . If DKL (G1 , P ) < DKL (G1 , P ), then as both N1 and N2 go to ∗ ∗ infinity, SM DL (G1 |D1 ) < SM DL (G2 |D2 ) with probability 1. They also claim that a similar result can be obtained for the BDe measure.

3.4. INCREMENTAL BAYESIAN NETWORK STRUCTURE LEARNING

3.4.4

63

Comments to the incremental proposals

Here we comment on the three incremental learning algorithms so far discussed in the field of Bayesian network learning. We compare the three revised incremental algorithms against the ones of Langley’s and Domingos’ (Definition 3.1 and Definition 3.2). We first want to note that the three incremental algorithms follow the definition of incremental learning in the sense that all of them are able to modify the network structure when new data is available. In spite of this, we think that Buntine’s (iBun from now on) and Friedman and Goldszmidt’s (iFG) proposals are quite similar to each other while Lam and Bacchus (iLB) use another approximation to incremental learning. Both iBun and iFG incremental algorithms use the new data items to update the sufficient statistics and thereof to update the posterior probabilities. With these updated posteriors, the algorithms perform additional search over the space of alternative Bayesian networks. iLB algorithm uses new data in another way. The algorithm first learns a new graph, possibly a subgraph of the old one, considering the new data and the old structure, and secondly it updates the old structure using the new one. They talk, in their article, of refinement rather than of incremental learning, even though they do not give any precise definition of refinement. However, the fact that the algorithm learns a new graph with the new data instances forces the algorithm to wait until a great amount of new data is available. Thus, we would not say that iLB algorithm is incremental in the sense proposed in this work since, unlike the other incremental algorithms, this one cannot decide to carry out refinement with one or few data instances. It also seems to us that the way their algorithm modifies the network structure once it has obtained the partial network is rather arbitrary. It only adds or reverses an arc when it does not introduce any cycle into the whole network structure, even if the new data clearly shows that there is a reversed arc. Since reversing such an arc would mean also to change the part of the network structure we do no have any further data about, they decided not to add or reverse the arc. Both iBun and iFG approaches explore different possible degrees of incrementality with respect to time and memory restrictions introduced by Langley. Friedman and Goldszmidt’s naive approach uses a batch algorithm in the online environment which processes old data instances together with new ones, their MAP approach that keeps one single structure in memory which is used together with the new data instances to learn the new structure, and finally their incremental approach which keeps in memory a reduced set of significant Bayesian networks together with their associated sufficient statistics. The same sort of exploration was done also by Buntine. He proposed a beam search which kept several domain models in memory using the alive and asleep lists. Buntine provided his algorithm (iBun) with three parameters in order to control the amount of memory spent in keeping these domain models and thus the time spent performing the search. He also kept the sufficient statistics associated to the structures in alive and asleep lists. From our viewpoint, both iBun and iFG proposals follow the same underlying idea. They keep in memory a set of promising networks that are evaluated in the light of new data instances. When new data is presented and the incremental revision is triggered, a search among the structures in the set is performed. Both proposals need to maintain the set of promising structures for two reasons. Firstly, they assume that all sufficient statistics from the dataset cannot be stored as they are huge, and secondly they want to reduce the space of structures to be searched. Buntine says that this sort of search works well because the posterior probabilities on alternative structures tend to vary exponentially as structures change, and high posterior structures tend to clump together.

64

CHAPTER 3. INCREMENTAL LEARNING

Both iBun and iFG proposals differ in the definition of the set of promising structures and in the way this set is explored. On the one hand, the iBun algorithm will search among those structures that can be obtained by adding one or more arcs to the ones in the alive list. That is, they perform a sort of backtracking by keeping several promising structures in a list. On the other hand, the iFG algorithm will search among those structures that can be obtained by adding, deleting or reversing an arc from the current one. That is, they perform a sort of backtracking by having two-way operators. Namely, operators to add new arcs, and operators to reverse and remove arcs. Yet another difference is that the iBun algorithm will not consider any longer those structures that where previously considered as non-promising (the dead ones), while the iFG algorithm may consider such structures (the ones that where out of the frontier). This fact introduces the problem of comparing models evaluated with respect to different datasets. As a conclusion we can say that both iBun and iFG proposals follow definitions (Definition 3.1 and Definition 3.2) of incremental algorithms even they relax its constraints. On the contrary, iLB proposal does not follow the spirit of the incremental definition of our work. We also want to comment here the proposal of Hulten et al. [54] in order to learn Bayesian networks from large databases. Their proposal, a system called VFBN, learns structures by using the minimum number of data items that guarantees a certain quality of the structures. They claim that their system can also be used in order to incrementally learn Bayesian network structures. They perform the search of the parents for each variable independently, just being care not to introduce cycles, and this allows them to store a great amount of sufficient statistics in secondary memory (i.e. disk) and in this way they do not need to use any sort of frontier in order to restrict the search.

Chapter 4

A New Approach to Incremental Learning of Bayesian Network Structures In this chapter we present a novel approach in order to learn, in an incremental fashion, the structure of Bayesian networks. We will tackle with two main problems; firstly, the need to detect when it is necessary to update the current structure network and to restrict the search among the most promising ones. Secondly, the need to calculate and store the sufficient statistics required for learning Bayesian networks and, at the same time, avoiding to spend time and memory with those sufficient statistics which will probably be not used during the learning process. The main objective of our incremental proposal is to save computing time while yielding network structures similar to those obtained with batch algorithms. We concentrate all our attention on the incremental learning of Bayesian network structures and we do not study the incremental learning of parameters. Although having accurate parameters is important, they are completely useless if the structure is of bad quality [31]. There are some papers in the Bayesian network literature that deal with incremental learning of parameters. The most important work may be the one performed by Spiegelhalter and Lauritzen [118] who proposed three different models: discretization of parameters, Dirichlet distributions, and Gaussian distributions. Their work was lately used in the aHUGIN system [96]. Another relevant work is the one of D´ıez [26] who assumes that the parameter distribution is given by a product of Gaussian distributions. We organize this chapter as follows. In Section 4.1, we propose two different heuristics in order to extend the original batch hill-climbing algorithms so that they are able to update the current learned Bayesian network structure in an incremental fashion. The first one establishes whether it is worth updating the current structure in the light of new data, and if it is, it guesses which part of the structure is still valid and which part should be revised. The second heuristic excludes from the search space those structures which where found to have a low posterior (score) in the previous search step and, consequently, would probably not be chosen in the current one. Both heuristics need not only the model learned in the previous step but also additional information. The first heuristic needs the former learning (search) path to check whether it would be equivalent to consider both new and old data, while the second heuristic needs the set of the most probable parents for each variable. In Section 4.2, we apply our approach to the algorithms CL, K2, B and HCMC discussed on Chapter 2. 65

66

CHAPTER 4. A NEW APPROACH TO INCREMENTAL LEARNING OF BNS

We believe that these algorithms are among the most representative in the Bayesian network literature and also represent a wide range of hill-climbing search algorithms. CL explores the space of trees, with the Kruskal’s algorithm, which yields the optimal tree. K2 explores the space of DAG’s, restricted by means of an order among variables, while algorithm B does not restrict the search space. Both, K2 and B, use the addition of on arc as search operator in a one-lookahead manner. Algorithm HCMC is the most general one, since it explores the space of DAG’s using the addition, reversal, and deletion of an arc as search operators and also can apply more than one traversal operator at each step, (i.e. multiple-lookahead). In Section 4.3, we use AD-tree structures, introduced by Moore and Lee (1998) [92], in order to calculate and store the sufficient statistics. We will adapt this sort of structures in order to work in an incremental environment and we propose an new heuristic to avoid storing sufficient statistics that account for Bayesian networks of low quality. Finally, in Section 4.4 we will empirically see that our incremental approach reduces considerably the time spent in learning network structures when new data is available while, at the same time, the structures obtained are almost the same that those learned with the batch version of the algorithm. We will also see that the networks obtained with our incremental algorithms do not depend on the order in which the data instances are presented.

4.1

Heuristics for incremental learning

Our heuristics are based on the assumption that all data in the stream are sampled from the same probability distribution, that is, our system does not need to handle concept drift as the underlying probability distribution of our domain does not change over time. This assumption lets us assume that the more data are available the better information we have from the underlying probability and hence the better the learned structure will be since the peaking phenomenon is reduced. We claim that these two heuristics are general and hence can be applied to any batch algorithm that uses a greedy hill-climbing search in order to incrementally learn models from data as far as the domain in which the search is performed fulfills, in some way, the conditions introduced in Section 4.1.2.

4.1.1

Batch hill-climbing search

Before introducing the conditions and the heuristics we will revise the hill-climbing search algorithm, HCS from now on, in order to introduce the notation that we will use in the following sections. The idea of HCS is to generate a model in a step-by-step fashion by making the maximum possible improvement in an objective quality function at each step. In order to use an HCS algorithm for a given problem we need to define the following elements: • An objective function S(M, D) to measure the quality of a given model M with respect to a dataset D. • A set of traversing operators OP = {op1 , . . . , opk } that given an argument, A and a model M , obtain a new model M 0 = opi (M, A). For example, in the field of Bayesian networks an operator op1 could be to add and an argument A could be the arc that is added. • A domain D to define the legal models. For example, we may restrict the search to the space of trees or to the space of DAGs.

4.1. HEURISTICS FOR INCREMENTAL LEARNING

67

Notice that a traversing operator opi will be used only if it produces a model within the domain D. This fact guides us to define the concept of neighborhood by means of a traversing operator set, Definition 4.1 (Neighborhood) The neighborhood N (M ) of a given model M is the set of all the alternative models that belong to the domain D and that can be built by using a single operator, N (M ) = {M 0 |M 0 = opi (M, A) ∧ M 0 ∈ D} In a similar way, we can also define the set of legal traversing operator and argument pairs (opi , Ai ) with which a neighborhood N is obtained, Definition 4.2 (Set of building operators) Let OpAN (M ) be the set of operator and argument pairs that if applied to a given model M obtain a new model M 0 that belongs to the domain D, OpAN (M ) = {(opi , Ai )|opi (M, Ai ) ∈ D} Now, we are able to describe HCS formally, see Algorithm 17. The algorithm begins with an initial model M0 (i.e. the empty model) and iteratively constructs a sequence of models Mi , i = 0, . . . , n, where each model Mi is the one with the highest score of the models in the neighborhood of Mi−1 , Mi = arg max M ∈N (Mi−1 ) S(M, D). HCS stops when no further improvement can be achieved, that is, when the current model has the highest score of the ones in its neighborhood. Note that we defined the neighborhood N of a model M applying a single operator. This sort of HCS is usually called one look-ahead. That is, the algorithm only considers models that are one step away from the current one. This sort of algorithms easily stick to local maxima since they are not able to escape from these maxima with one single step. In order to escape from local maxima, sometimes, neighborhoods are defined using more than one operator. See for example the HCMC algorithm in Section 2.4.4. Algorithm 17 Hill-Climbing Search Require: a domain of models D, an initial model M0 , a set of operators OP = {op1 , . . . , opk }, a database D, a scoring function S(M, D) Ensure: M be a model of high quality within the domain D i=0 repeat oldScore= S(Mi , D) i=i+1 Mi = opki i (Mi−1 , Ai ) where (opki i , Ai ) = arg max (opk ,A)∈OpAN (M ) S(opki i (Mi−1 , Ai ), D) until oldScore ≥ S(Mi , D)

i−1

Usually, one is interested in the final model M yielded by HCS and does not bother about the intermediate ones. However, we are interested in studying the whole search path (i.e. the sequence of intermediate models) because we will use it in our incremental approach. Now, let us define formally the concept of search path, which we will also call learning path, that HCS follows in order to reach the final model M .

68

CHAPTER 4. A NEW APPROACH TO INCREMENTAL LEARNING OF BNS

Definition 4.3 (Search path) Let M0 be an initial model, let M be the final model obtained by a hill-climbing search algorithm as k

n−1 . . . (opk22 (opk11 (M0 , A1 ), A2 ) . . . , An−1 ), An ) M = opknn (opn−1

where each operator and argument pair (opki i , Ai ) yields the model with the highest score of the neighborhood, (opki i , Ai ) = arg max OpAN (M ) S(opki i (Mi−1 , Ai ), D). The learning path is i−1 the sequence of operator and argument pairs Oop = {(opk11 , A1 ), (opk22 , A2 ), . . . , (opknn , An )} used to build M , or equivalently, the sequence MM = {M0 , M1 , . . . , Mn } of intermediate models obtained with the sequence of operators and argument pairs, where Mi = opkni (Mi−1 , Ai ). Notice that the models in the search path are in increasing quality score order, S(M0 , D) < S(M1 , D) < . . . < S(Mn , D) and that the final model Mn is a local maxima of the domain D of legal models.

4.1.2

HCS’ search path properties

In this section we study the search path of a HCS algorithm in order to introduce, in the following sections, the heuristics that will allow us to extend a batch HCS algorithm to an incremental one. We will first introduce the concept of continuous quality functions in order to state how the quality of the models will change when it is measured with respect to a dataset with additional data. This statement will allow us to introduce two properties of the HCS’ search paths, stated as lemmas, that we will use in our heuristics for incremental learning. Definition 4.4 (Continuous quality functions) Let DS be the set of all possible datasets, D, over a set of variables X, let P (D) be the probability distribution of D, let M ∈ D be a model over X, and let S : D, DS → R be a quality function, we say that S is a continuous quality function at D, if for all M ∈ D, and given any ε > 0, there exists δ(ε) > 0 such that if DKL (P (D)||P (D0 )) < δ(ε) and D0 ∈ DS, then |S(M, D) − S(M, D0 )| < ε.

{

S(M,D’) epsilon { S(M,D)

Delta D D’

Figure 4.1: A continuous quality function

4.1. HEURISTICS FOR INCREMENTAL LEARNING

69

This definition is illustrated in Figure 4.1.2. Recall that DKL is the Kullback-Leibler divergence (Equation 2.4) that measures the distance between two probability distributions. Note, that this definition of continuous quality functions is very similar to the definition of continuous functions over real numbers f : R → R, see for example [3]. In Definition 4.4, we replaced the distance, |x − x0 |, between two points in R by the distance DKL (P (D)||P (D0 )) of the probability distributions of two datasets. The definition of continuous functions at a given dataset, Definition 4.4, can be extended to continuous functions in a subset of datasets, Definition 4.5 Let DS be the set of all possible datasets, D, over a set of variables X, and let A ⊆ DS, we say that S : D, DS → R is continuous in A if it is continuous at each dataset, D ∈ A. Notice that continuous quality functions usually are preferred since models are required not to be scored very differently when they are evaluated with respect to similar datasets. For example, we do not want quality functions to be sensitive in the presence of a few noisy instances in a dataset. Furthermore, the following lemma states that the difference of the quality of two models will not change very much when it is measured with respect to similar datasets. Note that this is a direct consequence of measuring by means of continuous quality functions. Lemma 4.1 Let D and D0 be two datasets in DS over a set of variables X, let P (D) and P (D0 ) be the probability distributions of D and D0 respectively, let Mi and Mj ∈ D be two models over X, if S : D, DS → R is a continuous quality function, then for all ∆ > 0 there exists δ(∆) such that if DKL (P (D)||P (D0 )) < δ(∆) and S(Mi , D) − S(Mj , D) = d then S(Mi , D0 ) − S(Mj , D0 ) ≤ d ± ∆ Proof: We will prove this lemma by contradiction. Assume that there exists ∆ > 0 such that for all δ(∆), DKL (P (D)||P (D0 )) < δ(∆), and S(Mi , D)−S(Mj , D) = d, and S(Mi , D0 )− S(Mj , D0 ) > d ± ∆. If this holds then (S(Mi , D0 ) − S(Mi , D)) + (S(Mj , D) − S(Mj , D0 )) > ±∆ ⇒ |S(Mi , D0 ) − S(Mi , D)| + |S(Mj , D) − S(Mj , D0 )| > (S(Mi , D0 ) − S(Mi , D)) + (S(Mj , D) − S(Mj , D0 )) > ±∆ ⇔ |S(Mi , D0 ) − S(Mi , D)| > εi and |S(Mj , D) − S(Mj , D0 )| > εj where εi + εj = ∆ which contradicts the definition of continuous functions. 2 We generalize Lemma 4.1 in the following theorem by stating that continuous quality functions rank models in a similar way when they are measured with respect to slightly different datasets. Note again that this is a desirable property in a domain since we do no want models to be ranked very differently in the presence of few noisy data instances. Theorem 4.1 (Model ranking) Let DS be the set of all possible datasets, D, over a set of variables X, let D and D0 ∈ DS be two datasets, let P (D) and P (D0 ) be the probability distribution of D and D0 respectively, let M ∈ D be a model over X, and let S : D, DS → R be

70

CHAPTER 4. A NEW APPROACH TO INCREMENTAL LEARNING OF BNS

a quality function with which an order among all models is be defined S(M1 , D) ≤ S(M2 , D) ≤ . . . ≤ S(Mm , D) where the quality is measured with respect to the dataset D. If the quality function S is continuous in DS then for all integers k > 0, there exists δ(k) > 0 such that if DKL (P (D)||P (D0 )) < δ(k), then the quality function defines a new order among models using the dataset D0 , S(Mσ(1) , D0 ) ≤ S(Mσ(2) , D0 ) ≤ . . . ≤ S(Mσ(m) , D0 ), where σ(i) is a mapping 1 ≤ σ(i) ≤ m and where if i = σ(j) then |i − j| < k ∀i, 1 ≤ i ≤ m. Proof: ∀i, 1 ≤ i < m, Letdi = S(Mi , D) − S(Mi+1 , D) and S(Mi , D) − S(Mi+k , D) = di + . . . + di+k−1 , using that S : D, DS → R is a continuous function (Definition 4.4) we know that for all ε there exists δ(ε) such that if DKL (P (D)||P (D0 )) < δ(ε) and D0 ∈ DS, then |S(Mi , D) − S(Mi , D0 )| < ε. Thus, it suffices to take ε < min{(d1 + · · · + d1+k−1 ), (d2 + · · · + d2+k−1 ), · · · , (dm−k+1 + · · · + dm )}/2 to ensure that the model Mi is at a distance k of its place in the former order. 2 In this work, we are actually interested in incremental learning, and thus, in comparing the quality of models when new data are available. So, we are concerned in the change of the quality of models when it is measured with respect to the current dataset D∪D0 . Observe that this dataset is built by adding the newly available data D0 to the former ones, D, having both the same underlying probability distribution. Furthermore, note that the greater |D|/|D0 | is, the smaller DKL (P (D)||P (D ∪ D0 )) is. In other words, when the data already seen tend to infinity or the number of new available data is very small (i.e. a single data instance) the quality score of models will be modified very slightly. Now we introduce two properties of HCS’ search paths that directly follow from Theorem 4.1. The first lemma observes that the order of the models in a search path does not change very much in the light of new data coming from the same underlying probability distribution: Lemma 4.2 Let D be a domain of models M over a set of variables X, and let S : D, DS → R be a continuous quality function. Now, let {M0 , M1 , . . . , Mn } be a search path that results form a HCS algorithm such that S(M0 , D) < S(M1 , D) < . . . < S(Mn , D). Then, given a new dataset D0 , a new order among the models of the search path is obtained, S(Mσ(0) , D ∪ D0 ) < S(Mσ(1) , D ∪ D0 ) < . . . < S(Mσ(n) , D ∪ D0 ), where σ(i) is a mapping 1 ≤ σ(i) ≤ n such that if i = σ(j) then |i − j| < k ∀i, 1 ≤ i ≤ n and k tends to zero when DKL (P (D)||P (D ∪ D0 )) decreases. Proof: Lemma 4.2 directly follows from Theorem 4.1 since the search path, {M0 , M1 , . . . , Mn }, is a subset of the models in D. 2 The second lemma is concerned with the operator and parameter pairs used to build the neighborhood of a model at a given step of the search path. We can order the pairs in OpAN (Mi ) by means of the quality score of the structure obtained with them. The lemma observes that this order will not change very much in the light of new data. Lemma 4.3 Let D be a domain of models M over a set of variables X, and let S : D, DS → R be a continuous quality function. Now, let {op1 , . . . , opk } be the set of traversing operators that ki−1 is used by a HCS algorithm, and let Mi = opki i (opi−1 . . . (opk22 (opk11 (M0 , A1 ), A2 ) . . . , An−1 ), An ) be a model built by the algorithm at the i − th step of the search, the operator and parameter

4.1. HEURISTICS FOR INCREMENTAL LEARNING

71

pairs OpAN (Mi ) are ordered by means of the score of the resulting model, S(opk11 (Mi , A1 ), D) < S(opk22 (Mi , A2 ), D) < . . . < S(opknn (Mi , An ), D). Then, given a new dataset D0 , a new order among the models in the neighborhood is kσ(1) kσ(2) kσ(n) obtained S(opσ(1) (Mi , Aσ(1) ), D) < S(opσ(2) (Mi , Aσ(2) ), D) < . . . < S(opσ(n) (Mi , Aσ(n) ), D), where σ(i) is a mapping and 1 ≤ σ(i) ≤ n such that if i = σ(j) then |i − j| < k ∀i, 1 ≤ i ≤ n and k tends to zero when DKL (P (D)||P (D ∪ D0 )) decreases. Proof: Lemma 4.3 directly follows from Theorem 4.1 since the neighborhood of a given model, Mi , build with the set of operators, {op1 , . . . , opk }, is a subset of the models in D. 2 From these last two lemmas we will now introduce two heuristics in order to learn in an incremental fashion. The first heuristic assumes that it is only worth to update an already learned model when the new data alter the order of the models that are in its learning path. The second heuristic applies when the learned model is updated, and it restricts the search among those structures that had high quality scores before the new data instances were available.

4.1.3

Traversal Operators in Correct Order

We call the first heuristic Traversal Operators in Correct Order, TOCO from now on. The TOCO heuristic is able to determine when the current network structure should be updated by analyzing the search path of the HCS algorithm in the previous learning step. In our incremental approach, we keep the order in which operator and argument pairs were applied, that is, the search path of the former learning step. If, when new data instances are available, the order does not hold any longer, we can conclude two facts. First, using Lemma 4.2, we know that DKL (P (D)||P (D ∪ D0 )) is bigger than a certain positive number, ∆, and thus, that the newly available instances provide new information that is not taken into account in the current model. Second, we know for sure that the HCS algorithm, see Algorithm 17, would follow a different search path through the space of models and possibly obtain a different one. On the contrary, when the order still holds we conclude that DKL (P (D)||P (D ∪ D0 )) is lower than a certain positive number, ∆, and we assume that the HCS algorithm would follow again the same path and obtain the same model. Thus, in the former case we trigger the HCS algorithm while in the latter we do not revise the structure. In order to check whether the pairs are in correct order, the TOCO heuristic, rather than using all the operator and argument pairs in OpAN (M ) , uses only those pairs that also belong to the search path, that is, that were used to build the structure. When the order does not hold, we assume that the model built by means of the correctly ordered traversing operators is still correct and thus we use it as the initial model for the HCS algorithm. So, the benefit of the TOCO heuristic is twofold. First, the model will only be revised when it is invalidated by new data, and second, in the case that it must be revised, the learning algorithm will not begin from scratch. We state the TOCO heuristic in a more formal way. Definition 4.6 (TOCO heuristic) Let D be a dataset, M be a model, S(M, D) be a scoring metric, HCS a hill-climbing searcher, {op1 , . . . , opk } be a set of traversing operators which given an argument A and a model M return a new model M 0 . Also, let OpAN (M ) be the set

72

CHAPTER 4. A NEW APPROACH TO INCREMENTAL LEARNING OF BNS

of operators and argument pairs with which neighborhood N of model M is obtained. Let k

n−1 M = opknn (opn−1 . . . (opk22 (opk11 (M0 , A1 ), A2 ) . . . , An−1 ), An )

be a model learned by HCS where M is built with an ordered list of operator and argument pairs, that is, the learning path OOp = {(opk11 , A1 ), . . . , (opknn , An )} such that ∀i ∈ [1, n] : (opki i , Ai ) = arg max (opki ,A)∈OpA i

N (Mi−1 )

S(opki i (Mi−1 , Ai ), D)

Let D0 be a set of new data instances, the TOCO heuristic states that HCS learns a new model M 0 corresponding to {D ∪ D0 } as k0 0

k0 0

k0

k0

−1 M 0 = opnn0 (opnn0 −1 . . . (op202 (op101 (Mini , A10 ), A20 ) . . . , An0 −1 ), An0 )

where traversing operators and arguments are obtained as before k0

∀i ∈ [1, n0 ] : (opi i , Ai ) = arg max

k0 (opi i ,A)∈OpAN (Mi−1 )

k0

S(opi i (Mi−1 , Ai ), {D ∪ D0 })

and where the initial model Mini is k

Mini = opj j (. . . (opk22 (opk11 (M0 , A1 ), A2 ) . . . , Aj ) k

where (opj j , Aj ) is the last operator and argument pair which is in correct order, that is, the last pair that produces the model with the highest score among pairs in the search path OOp k

(opj j , Aj ) = arg max

k0 (opi i ,A)∈OpAN (Mi−1 ) ∩OOp

k0

S(opi i (Mi−1 , Ai ), {D ∪ D0 })

and k

j+1 (opj+1 , Aj+1 ) 6= arg max

k0 (opi i ,A)∈OpAN (Mi−1 ) ∩OOp

k0

S(opi i (Mi−1 , Ai ), {D ∪ D0 })

where OpAN (Mi−1 ) ∩ OOp stands for the set of all operator and argument pairs that belong together to the neighborhood of Mi−1 and to the learning path OOp of the former learning step Note that there are two extreme cases. On the one hand, when j = 0, it means that the first operator and argument pair is not in correct order and thus the new revised structure will be learned from scratch. On the other hand, when j = n, it means that all pairs were in correct order and thus the current structure does not need to be updated. In Figure 4.2 we illustrate the TOCO heuristic. At the top of the figure we can see the search path of the previous learning step where the quality of models is calculated with data D. The HCS algorithm begun from the empty model, M0 , and using the operator and argument pair that yielded the models with highest quality, the search algorithm obtained the local maxima Mn . At the bottom of the figure we can see lists of the alternative operation and argument pairs that the TOCO heuristic considers in order to check whether the revising process should be fired in the light of new data D ∪ D0 . Each list stores the pairs used in order to build the neighborhood of the learned model at a given step of the search path and that

4.1. HEURISTICS FOR INCREMENTAL LEARNING

73

were used in the search path of the former learning step. The pair with the highest quality measured with the dataset D ∪ D0 is at the top of the lists. So, if the pair that was used in the former learning step is still at the top, the TOCO heuristic considers that this step of the search path is still correct. On the contrary, if the pair at the top is a different one, the heuristic fires the updating process and the HCS algorithm continues the search from that point at the search path. In Figure 4.2 the i-th search step is checked as correct while the j-th step is not. Hence, the HCS is fired with Mj−1 as the initial model.

MM = {

M0 k

Oop = {

Mi 0 we can find a δ(ε) > 0 such that if DKL (P (D)||P (D0 )) < δ(ε) then |DKL (BS , D) − DKL (BS , D0 )| < ε.

78

CHAPTER 4. A NEW APPROACH TO INCREMENTAL LEARNING OF BNS

|DKL (D, BS ) − DKL (D0 , BS )| = X P (D) P (D0 ) − P (D0 ) log |= | P (D) log P (BS ) P (BS ) X |

X

P (D) log

X

|

X

P (D0 ) P (D) P (D0 ) 0 + P (D) log − P (D ) log |= P (BS ) P (D0 ) P (BS )

(P (D) − P (D0 )) log

X

P (D0 ) P (D) + P (D) log |