A Parallel Learning Algorithm for Bayesian Inference Networks

1 downloads 0 Views 298KB Size Report
We present a new parallel algorithm for learning Bayesian inference networks ... Bayesian inference networks (also known as Bayesian networks) have becomeĀ ...
A Parallel Learning Algorithm for Bayesian Inference Networks  Wai Lam

Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong Shatin Hong Kong [email protected]

Alberto Maria Segre

Department of Management Sciences The University of Iowa Iowa City, Iowa 52242 U.S.A. [email protected]

Abstract

We present a new parallel algorithm for learning Bayesian inference networks from data. Our learning algorithm exploits both properties of the MDL-based score metric, and a distributed, asynchronous, adaptive search technique called nagging. Nagging is intrinsically fault tolerant, has dynamic load balancing features, and scales well. We demonstrate the viability, e ectiveness, and scalability of our approach empirically with several experiments using on the order of 20 machines. More speci cally, we show that our distributed algorithm can provide optimal solutions for larger problems as well as good solutions for Bayesian networks of up to 150 variables.

Keywords: Machine Learning, Bayesian Networks, Minimum Description Length Principle, Distributed Systems

 Support for this research was provided by the Oce of Naval Research through grant N0014-94-1-1178, and by the Advanced Research Project Agency through Rome Laboratory Contract Number F30602-93-C-0018 via Odyssey Research Associates, Incorporated.

1 Introduction Bayesian inference networks (also known as Bayesian networks) have become a popular knowl-

edge representation scheme for probabilistic knowledge. One of the reasons for their success lies in the fact that they o er both a sound theoretical framework and a conceptually simple interpretation for representing and manipulating knowledge in a probabilistic framework. To wit, there are a number of practical applications which already make use of this powerful and

exible knowledge representation scheme:

 Image Understanding: A system developed at Naval Research Laboratory employs Bayesian networks to perform ship classi cation from raw sensor images [18].

 Forecasting: The ARCO1 system is able to both reason about and forecast the crude oil market [1].

 Information Retrieval: A Bayesian network is used to retrieve documents relevant to a particular information need from a huge collection of information stored in electronic media [9].

 Intelligent Decision Making: The Vista system, developed at NASA Mission Control Center, interprets live telemetry data and assesses the operation of the space shuttle's propulsion systems [13].

 Process monitoring: General Electric's GEMS expert system monitors power generation equipment performance [17].

 Medical Diagnosis: The PATHFINDER system [10] performs diagnosis of lymph node pathology for over 60 diseases: it has been recently transferred to a commercial system called INTELLIPATH, which is in used by several hundred medical and clinical sites.

1

 Other Applications: Other application areas include, for example, software maintenance [5], natural language understanding [6], troubleshooting [11]. Despite these successful system deployments, systems designers who intend to use Bayesian networks { like designers of knowledge-based systems in general { encounter the knowledge engineering bottleneck; that is, constructing a network manually is both time consuming and

prone to error. Clearly, any mechanism that can help automate this task would be bene cial. One technique to cope with this problem is to learn the network model from data pertinent to the domain [12, 8, 4, 19, 24, 28]. Unfortunately, since this problem is believed to be NPcomplete [7], learning larger models requires exponentially increasing computational resources. In this paper, we present a new distributed solution to the Bayesian network learning problem that exploits idle or under-utilized workstations in order to extend the size of the largest problem which can be solved in reasonable time. Our approach is based on a machine learning formalism known as the Minimum Description Length (MDL) principle. We present a simple serial search algorithm based on a MDL score metric, and show how some properties of this score metric can be used to e ectively reduce the search space for our serial algorithm. We then show how the serial solution can be e ectively parallelized using an asynchronous distributed search technique called nagging. Finally we present an empirical study to support our claims of improved learning performance and increased computational eciency, where Bayesian networks of up to 150 nodes are found using up to 20 networked workstations. Moreover, we demonstrate the superiority of our approach over both the simple serial algorithm and a more naive parallel partitioning scheme. Section 2 presents some background on the Bayesian network learning problem. We describe a serial algorithm for this problem in Section 3. Section 4.2 introduces two di erent parallelizations of the serial algorithm of Section 3. Section 5 presents the experimental results obtained and Section 6 contains some concluding remarks. 2

metastatic cancer A

brain tumor

level of

C

B

serum calcium

D

E

headache

coma condition

Figure 1: A Bayesian Network Structure in a Domain of Brain Cancer

2 Background A Bayesian network consists of vertices (or nodes) and directed arcs connected together to form a directed acyclic graph, or DAG. Each vertex represents a variable that ranges over a discrete set of domain-speci c values. Each arc represents a probabilistic dependency between the variables represented by its source and destination nodes. Apart from the network structure itself, each node contains additional information in form of a conditional probability distribution. More formally, consider the n domain variables X = fX1 ; X2 ; : : : ; X g. Let the parents  n

X

i

of a node X be the set of source nodes with arcs having destination X . For each node X i

i

i

there is a conditional probability distribution P (X j  i ); where X has no parents, there is i

X

i

a prior probability distribution P (X ). i

An example should help make this clear. Figure 1 shows a simple ve-variable Bayesian network structure representing the brain cancer domain. Let variable D denote \coma condition" which can take on values \deep", \momentary", and \no" (abbreviated by d, m, and n respectively), variable B denote \brain tumor" which can take on values \yes" and \no" (ab3

P (D j B; C ) : P (D = d j B = y; C = h) = 0:7 P (D = d j B = y; C = l) = 0:75 P (D = d j B = y; C = n) = 0:1 P (D = d j B = n; C = h) = 0:15 P (D = d j B = n; C = l) = 0:7 P (D = d j B = n; C = n) = 0:2

P (D = m j B = y; C = h) = 0:2 P (D = m j B = y; C = l) = 0:05 P (D = m j B = y; C = n) = 0:8 P (D = m j B = n; C = h) = 0:8 P (D = m j B = n; C = l) = 0:1 P (D = m j B = n; C = n) = 0:1

Figure 2: Probability Parameters for Node D (\coma condition") breviated by y and n respectively), and variable C denote \level of serum calcium" which can take on values \high", \low", and \normal" (abbreviated by h, l, and n respectively). Given that B and C are the only parents of D, the conditional probability parameters associated with variable D (\coma condition") are shown in Figure 2. The Bayesian network learning problem is to automatically construct such a network model { that is, determine both the structure and the associated conditional probability parameters { from data. The learning problem can also be viewed as a kind of unsupervised learning where the targets to be learned are Bayesian networks. The input data consists of cases in a format similar to a table of records in the relational database model. Each case is a fully-instantiated set of domain variables corresponding to some observed real-world circumstance in the domain of interest. In addition to this raw data, we are also given a total ordering on the variables. The ordering constitutes a constraint on parent relations in the output network that corresponds to prior knowledge about causality in the domain of application: if X appears before X in the ordering, then X cannot be an j

k

k

ancestor of X in the structure (this kind of prior knowledge is commonly found in applications j

such as diagnosis, planning, and speech recognition). In our brain cancer example, Table 1 is an example of a data set for this domain, and a possible variable ordering might be A, C , B , 4

case level of coma number serum calcium condition

brain tumor metastatic headache cancer

1 2 3 4 5 6

yes yes yes no yes no

high no low low no high

deep conscious deep momentary conscious momentary : : :

yes yes yes no no yes

painful painful slight no painful slight

Table 1: An Example of a Data Set for the Domain of Brain Cancer

D, E (the variable ordering re ects a priori domain-speci c causal knowledge, such as \brain tumors cause headaches" as opposed to \headaches cause brain tumors"). Most solutions to the network learning problem divide the problem into two separate phases: rst, a structure for the output DAG is determined, and, second, the conditional probabilities associated with the arcs of the DAG are computed. Since these probabilities are easily estimated using standard statistical methods once the structure is determined, in this paper we focus on the problem of constructing a network structure based on the input data and ordering constraints.

3 A Simple Serial Solution One family of techniques for learning Bayesian network structures is based on a scoring approach [8, 12, 16, 27]. This approach is characterized by devising a score metric for a candidate network structure, and searching the space of network structures for the best-scoring structure. 5

Since there are exponential number of candidate network structures for a given number of variables [20], nding an optimal solution is infeasible even for problems of moderate size. Thus scoring systems generally resort to greedy or heuristic search methods that nd reasonable { but suboptimal { solutions quickly. As for the score metrics, examples such as the BD and K2 metrics in [12, 8] are the relative posterior probability of a network structure, while the score metrics used in [14, 27] are cost functions representing the description (message) length of a network structure. Alternatively, other approaches for this learning problem are characterized by employing conditional independence relations exempli ed by the SGS and PC algorithms [23, 22] as well as the techniques developed by Pearl et. al. [29, 19]. In this section, we brie y describe a simple, score-based, solution to the Bayesian network structure learning problem that is based on the result of our previous work [16].

3.1 The Minimum Description Length Metric Our approach makes use of the Minimum Description Length (MDL) principle to balancing between simplicity and accuracy. More precisely, we use a score metric for network structures that is a function representing the total description length, Ltotal (B ), of a candidate network structure B . In related work, we have designed a scheme for eciently computing this description length [15] of a candidate network structure B by decomposing it by individual variable X . With i

some overloading of the notation Ltotal , we can say:

Ltotal (B ) =

X

X

i2

Ltotal (X ;  i ) i

X

X

where  i is the parent set of X in the structure B . 1 X

i

Each variable's contribution to the total description length is in fact composed of two We only state the expressions for this metric in this paper. Interested readers should refer to [15] for derivations and more detailed presentation. 1

6

components:

Ltotal (X ;  i ) = Lnetwork (X ;  i ) + Ldata (X ;  i ) i

X

i

X

i

X

where Lnetwork and Ldata are called network description length and data description length, respectively. The network description length is de ned more precisely as:

Y

Lnetwork (X ;  i ) = j  i j log2 (n) + d(s ? 1) i

X

X

i

j

2Xi

s

j

where n is the number of variables, d represents the number of bits required to store a numerical value, and s represents the number of possible instantiations for a variable X with parent i

i

set  i . A fundamental property of network description length is that the higher the topologX

ical complexity of the network, the greater its network description length. Since it is widely recognized that conducting inference on highly-connected networks is likely to be intractable [7], they are not very useful in practice. Thus, by minimizing this length function, we favor more useful networks, that is, those having simpler topology. 2 Apart from the simplicity issue, we also need to consider the accuracy issue, or how well a network structure represents the data. The accuracy issue is captured by the data description length:

Ldata (X ;  i ) = i

X

X

M (X ;  i ) log2 MM(X(; i ) ) i i Xi X

i

X

i

X ;

X

where the summation is taken over all possible instantiations of that variable and its parents, and M (:) is the number of cases that match a particular instantiation in the database (we de ne M ( i ) to be 1 if  i = ; and the log function will be 0 if M (X ) = 0). A fundaX

X

i

mental property of data description length is that the more accurate the network structure, Other researchers have proposed alternative metrics for network description length that di er from the one we have just described [27, 3]. We note that the parallelization scheme we propose in subsequent sections of this paper is also applicable to systems based on many of these alternative metrics. 2

7

the smaller its data description length. Thus, by minimizing this length function, we favor accurate networks. It can be shown that simplicity and accuracy cannot be achieved simultaneously. Given a particular network structure, we can increase its accuracy if we increase its topological complexity appropriately. As a result, we are faced with a tradeo : we wish to learn an accurate network, and at the same time, the structure should be as simple as possible. The score metric

Ltotal , based on the MDL measure, o ers a principled means to perform such tradeo by considering the sum of the network description length Lnetwork and the data description length Ldata (note that both Lnetwork and Ldata are non-negative). Within the framework, a shorter length Ltotal corresponds to a better network. Ideally we would like to nd a network structure which has the lowest Ltotal . We call such a network an optimal network. In situations where an optimal solution cannot be obtained due to limited

computing resources, we wish to nd a network with Ltotal as low as possible.

3.2 Properties of the MDL Metric The MDL metric just described displays some interesting and important properties upon which our serial algorithm is based.

Property 1 Lnetwork (X ;  i ) is a strictly increasing function as the parent set (i.e.,  i ) expands. In other words, Lnetwork (X ; 1 ) < Lnetwork (X ; 2 ) if 1  2 i

X

X

i

i

This property can be easily veri ed from the expression for Lnetwork (X ;  i ). i

X

Property 2 Ldata (X ;  i ) is a monotonically decreasing function as the parent set (i.e.,  i ) expands. In other words, Ldata (X ; 1 )  Ldata (X ; 2 ) if 1  2 i

X

X

i

i

The proof can be found in the Appendix. These two properties imply that if we introduce more variables into a particular parent set, the network description length (i.e., Lnetwork ) will increase, whereas the data description 8

length (i.e., Ldata ) will decrease. Unfortunately, no immediate conclusion can be drawn with respect to Ltotal , the score metric used for comparing the merits of di erent candidate parent sets. If the amount of increase in Lnetwork is less than the amount of decrease in Ldata , then

Ltotal will decrease and the new parent set is more desirable. On the other hand, if the amount of increase in Lnetwork is greater than the amount of decrease in Ldata , then Ltotal will increase and the new parent set is less desirable. Suzuki [28] derived a property that characterizes this tradeo :

Property 3a [Suzuki] If Lnetwork (X ; 1 ) ? Lnetwork (X ; 1 n fX g)  Ldata (X ; 1 n fX g), then Ltotal (X ; 2 ) > Ltotal (X ; 1 ) for all 1  2 i

i

i

k

i

k

i

Note that, given our de nition of Ltotal above, the expression in the premise of Property 3a can be rewritten to read:

Lnetwork (X ; 1 )  Ltotal (X ; 1 n fX g) i

i

k

Suzuki uses this property within a branch-and-bound framework to make (local) decisions that preclude searching alternative parent sets for a given node. We re ne this property and provide a better, global bound as follows:

Property 3b If Lnetwork (X ; 1)  L, then Ltotal (X ; 2 ) > L for all 1  2 where L is i

i

an upper bound for Ltotal . Property 3b follows from Property 1 and the fact that both Lnetwork and Ldata are nonnegative in a straightforward way. Nevertheless, it plays an important role in the development of our search-pruning strategy. It provides a salient criterion for e ectively pruning a portion of search space without sacri cing optimality. Recall that we want to minimize Ltotal . According to this property, if the network description length of the current parent set exceeds upper bound L (the best MDL score found so far), we can conclude that the metric of a strictly larger parent set will be greater than the upper bound. Hence, there is no need to consider the larger parent set since it can never lead 9

to a better solution. Note that, unlike with Suzuki's method, L need not be computed on a subset of parent set 1 , but rather on any parent set. Thus L gives a better, more informative, bound than Suzuki's Ltotal (X ; 1 n fX g) (see Property 3a). i

k

3.3 A Serial Search Algorithm We now describe a systematic serial search algorithm that produces the optimal Bayesian network structure with respect to the MDL score metric. We can guarantee that the algorithm produces the optimal solution because it is systematic; that is, it enumerates and scores all possible solutions { returning the solution with the lowest score { while never examining a solution more than once in the process. Later we show that this algorithm can also be used to obtain good solutions under resource-bounded conditions. Given an input ordering of the domain variables (X1 ; X2 ; : : : ; X ), the structure learning n

problem can be reduced to searching for a good \parent set"  i for each variable X (where X

i

2  i  n) since the parent sets together completely determine the network structure. Thus the task can be broken down into n ? 1 independent subtasks, where each subtask searches for the desired parent set for a given variable from those variables preceding the given variable in the input ordering. In other words, the parent set  i must be a member of the power set of X

fX1 ; : : : ; X ?1 g. i

Each subtask is itself a search problem on some implicit search space. Optimal solutions from each subtask can then be composed to produce an optimal global solution. We impose a design requirement on the structure of this search space. In order to guarantee systematicity and thus an optimal solution, we must ensure that all possible combinations of variables are explored. Consider the search space for variable X . All variables in the set fX1 ; : : : ; X ?1 g are i

i

potential members of the parent set. First these variables are ranked according to Ltotal (X ; X ) i

j

where 1  j  i ? 1 (this heuristic corresponds to the value of using each variable as the lone 10

L: a global variable containing the current minimum Ltotal  : a global variable containing the parent set corresponding to L rst parameter (): current parent set second parameter (V ): list of variables to be considered procedure search(,V ) 1. let Y be the rst variable in V 2. calculate Ltotal (X ;  [ fY g) i

3. if Ltotal (X ;  [ fY g) < L then i

L = Ltotal (X ;  [ fY g)  =  [ fY g 4. call search(,V n fY g) 5. if Lnetwork (X ;  [ fY g) < L then call search( [ fY g,V n Y ) i

i

Figure 3: A Serial Search Algorithm for Variable X

i

member of the parent set). Suppose the result of the ranking is (X 1 ; X 2 ; : : : ; X i?1 ) (to r

r

r

facilitate discussion, we will use (Y1 ; Y2 ; : : : ; Y ?1 ) to denote such a ranking, e.g., X 1  Y1 and i

r

so on). This ranking provides an initial estimate of the merit of each variable. It also permits the algorithm to consider those variables at the top of the ranking rst. Now we carry out a depth- rst search (see Figure 3), with each state in the implicit de ned search space representing a candidate parent set, and the root state representing an empty parent set. Each state is labeled with the current parent set, denoted by , and an ordered list of variables yet to be considered, denoted by V . The rst variable Y in V is extracted and 11

added into  to form a new parent set  [fY g. The MDL score metric of this new parent set is evaluated and the solution is updated if the score is better than the current solution. Finally two children of the current state are generated and explored recursively. Both children have

V n fY g as the set of variables to be considered, but the rst child keeps the original parent set  whereas the second child contains  [ fY g as the parent set. Note that this second child need only be explored if Lnetwork (X ;  [fY g) < L ; else the search of the subtree rooted i

at this child node can be safely pruned without sacri cing optimality (see Property 3b and step 5 in Figure 3). The better the bound L , the more search space can be pruned: thus, if we had instead used the weaker Property 3a, which imposes an additional constraint on the comparability of alternate parent sets, we would always search at least as many nodes and, quite probably, many more. Note that it is always possible to construct a resource-bounded version of the algorithm by imposing a secondary termination criterion at the expense of optimality. Since the current \best" solution,  , is kept during the search, this solution can be returned at anytime should the algorithm be terminated. Thus a resource bound might be imposed as a limit on computing time or the number of MDL score metric computations.

4 Two Parallel Algorithms Clearly, the exhaustive serial search procedure just described explores 2 nodes in the worst n

case (i.e., when absolutely no pruning takes place), requiring therefore exponential time in the number of variables. Thus only small problems can be solved optimally, while resourcebounded suboptimal solutions will have to do for larger problems. From a practical perspective, we would like to push this horizon back, obtaining optimal solutions for ever larger problems. Clearly, increasing the amount of pruning that occurs is one way of extending this horizon; another mechanism is to explore parallel solutions to the problem, and, in particular, parallel solutions which can scale to the very large numbers of processors typically available in a 12

networked computing environment. Making any algorithm operate in a distributed computing environment raises a number of issues. The rst issue is load balancing which is concerned with how to schedule and allocate subtasks among computers of varying intrinsic speed, con guration, and dynamic load. A good distributed solution should be able to dynamically adjust the subtask allocation to re ect the status of the underlying computing facilities. The second issue is that of fault tolerance. In a distributed environment, it is not uncommon to lose communication links or have some system fail outright. A strategy to handle such problems gracefully during the learning process is essential. Finally, we need to coordinate tasks running on separate processors. At the very least, some scheme is needed to evaluate the intermediate solution found in each task. These intermediate results are to be combined to obtain a correct and desirable solution.

4.1 A Simple Parallel Solution In this section, we describe an obvious, albeit somewhat naive, distributed solution we call the simple partition approach. The simple partition approach involves dividing the search space

into partitions of roughly equal expected size and allocating each to a di erent workstation. More speci cally, a breadth- rst search is conducted on a single designated processor, where the search space is identical to the one described for the serial algorithm of Section 3. As the breadth- rst search progresses, it will accumulate unexplored search states. When the number of unexplored search states is equal to the number of processors available, the breadth- rst search phase halts and each unexplored search state is allocated to a di erent processor. Each processor now searches its corresponding portion of the search space using the same serial algorithm as before. When a processor exhausts its search space or the elapsed CPU time exceeds a predetermined resource limit, the processor reports its best solution to the designated processor who compares the solutions contributed by other processors and selects the best one. 13

This simple partition approach su ers from a number of shortcomings. First, it is simply \racing" di erent processors on di erent portions of the search space, and there is little reason to believe that these portions will take equivalent time; with pruning, we can't know a priori exactly how large these search spaces will be, and some processors might be intrinsically faster than others, or their loads may vary signi cantly. This is a load balancing problem, where processor utilization on some processors may turn out to be quite low. Second, the simple partitioning approach is not fault tolerant; if a processor or communication link should fail, a portion of the search space will remain unexplored, unless that space is reallocated to another processor (of course, keeping track of which portions are assigned to which processor and which processors are still active requires some additional computational overhead). Even if the failure probability for a single processor or link is low, scaling to larger numbers of processors and links is problematic. Finally, there is no provision for information sharing; that is, if a processor should nd a better intermediate bound, it cannot share it with other processors even though the new bound might signi cantly reduce their work. Furthermore, adding a mechanism by which the information is broadcast to all processors would entail additional overhead and increased communication costs, which may result in lower overall performance.

4.2 A Second Parallel Solution Based on our previous work on nagging, a parallel asynchronous search pruning technique for rst-order logic theorem proving [25, 21], we now propose a distributed strategy that addresses the shortcomings of the simple partitioning scheme just described. Nagging employs two types of processes; a master process which attempts to solve a problem via a sequential search procedure, and one or more nagging processes which, operating asynchronously, attempt to prune the master's current search branch. An idle nagging process obtains a portion of the master's search space and explores a specially prepared transformed version of this space in parallel, subject to at least a portion of the solution constraints to which 14

Master

Nagger current bound L*

current bound L*

n

ctio

tio

ma

or nsf

un nf

tra nag point

Figure 4: Nagging the master has already committed (see Figure 4). The problem transformation function used to prepare the nagger's search space must meet two desirable criteria. First, it must ensure that if the nagger's space contains no solution better than the current \best solution" known to the master (corresponding to L ), then the corresponding master's space can also contain no better solution. Second, the nagger's space should, with high probability, be smaller than the corresponding master's space. If the nagger exhausts its search space without nding a better solution, then no better solution can exist in that portion of the master's search space: the nagger then interrupts the master and forces the master to backtrack to a point where a better solution becomes feasible once more. Thus a nagger helps the master's search along by determining whether a portion of the master's search space is infeasible. Readers interested in a more complete treatment of nagging and its formal properties are referred to [26]. Nagging has a number of desirable properties. For example, it is intrinsically fault tolerant, since losing a nagger due to communication or hardware problems will not compromise the master's solution. In addition, unlike most partitioning approaches, nagging is characterized by infrequent and brief communication between processors. Thus nagging is particularly appropriate for more loosely-coupled networks of workstations.

15

4.2.1 Problem Transformation Functions Critical to the success of nagging is the availability of suitable problem transformation functions that allow a nagger to extract useful information while exhausting signi cantly smaller search spaces than the master. Without suitable problem transformation functions, nagging turns out to be similar to the simple partitioning approach of Section 4.1 in that it would essentially \race" multiple processors on portions of the search space. Fortunately, good problem transformation functions are not all that dicult to construct. Recall that the goal of a good problem transformation function is two fold: rst, it must maintain the solution character of the original search space, and second, it should produce a transformed space of smaller size than the original space. One transformation we have implemented is to rank elements Y of V according to the metric Ltotal (X ;  [ fY g). The rationale is that variables with a favorable metric (computed i

with respect to the current parent set) should be considered rst. We found that this transformation is especially useful for the computation in a time resource-bounded setting. Under such resource-bounded condition, we wish to nd a good solution as early as possible. This transformation is applied at two places. The rst place is at the generation of a nagger's search space. When a nagger obtains an unexplored search state from the master, it applies this transformation to the search state before the nagger starts the search. The second place is at some selected search states in the master process. Precisely if the size of the set V is larger than a threshold (currently it is set to 5), this transformation is applied to alter the speci cation of this search state. Besides the recording transformation mentioned above, we might introduce an element of nondeterminacy by randomly permuting each nagger's copy of V , so that variables are considered in a di erent order than the master's search. Such permutations are likely to display considerably di erent (and possibly more e ective) pruning behavior, thus letting the nagger search the space on average more quickly. Note that this kind of transformation function 16

meets the second criteria for good transformation functions only in a probabilistic sense: there is no guarantee that the space will be smaller, but there is a chance that the space will be smaller. The cost/bene t tradeo of using such a transformation function naturally depends on the distribution of expected search times for the transformed problems. Other problem transformation functions are also feasible. A discussion of how more sophisticated transformation functions can dramatically improve overall performance can be found in [26].

4.2.2 Recursive Nagging A second feature becomes obvious once we note that each nagger executes precisely the same search algorithm as the master, albeit perhaps with some transformation applied to the space searched. If enough processors are available, a nagger can also be nagged recursively by additional processors. Use of recursive naggers to speed a nagger's search will, of course, ultimately bene t the master process, since it will in turn be nagged more eciently. In addition, this scheme has the added advantage of reducing the (admittedly relatively small) overhead incurred by the master in servicing too many nagger's requests.

4.2.3 Solution Sharing Finally, we can exploit information gained by a nagger when it nds an intermediate solution that scores better than the current bound L . Recall that this bound plays an important role in the eciency of the search, and that, a good (i.e., lower) value allows us to prune a large portion of the search tree so that more states can be searched within a given amount of time. Yet in the current protocol, the value of L is passed from the master to the naggers rather than from a nagger to its master. Of course, for some problem transformation functions, the value of L obtained by the nagging process may not apply directly to the master's original search space. But for other 17

problem transformation functions, allowing a nagger to pass new values for L back to its master will improve overall performance. We communicate new bounds eciently in the following manner. Any process nding a better bound reports the new bound to its own master (problem transformation function permitting) and any subordinate naggers. Upon receipt of a better bound, each process updates its own bound and reports the new bound to its own master and any subordinate naggers, save, of course, for the original sender. Of course, this additional communication entails some additional overhead, whose cost must be weighed against the additional search eciency obtained.

5 An Empirical Evaluation The nagging protocol just described retains many of the desirable properties of nagging. More precisely, the protocol is still fault tolerant and, in addition, it always returns the same solution as the serial algorithm if both are allowed to terminate. Other properties which are dicult to prove formally can be more easily con rmed by experimental means. In this section, we present some experimental results designed to support speci c claims about the performance of our algorithm. 3 In particular, there are four speci c performance and solution quality claims that we would like to support via experimental analysis. The rst claim is that, when both algorithms are allowed to terminate, the distributed algorithm will produce the optimal solution more quickly than does the serial algorithm. Clearly, if exploiting multiple workstations does not make the work go faster, you should stick to the serial algorithm (indeed, you would more pro tably All of the experiments reported here were conducted using a dedicated 16MB 90MHz Pentium CPU running Linux as the master processor. Nagging processes were allocated among a set of similar 90MHz Pentiums, HPPA, SGI, and 66MHz i486 machines. These machines are physically distributed among several campus buildings and are linked via 10baseT ethernet connections. Note that, unlike the master processor, the nagging processors are not dedicated, and were therefore subject to varying loads during the course of the experiments. 3

18

exploit a di erent kind of parallelism by solving additional problems on the extra machines). This rst claim is, of course, somewhat uninteresting in practice, given that real problems are likely to be large enough to preclude nding optimal solutions. Thus the next claims focus on comparisons of solution quality in a resource-bounded environment. In particular, we would like to show that our distributed algorithm produces a solution that is at least as good and very likely better than that provided by the serial algorithm when operating under identical resource constraints. We would also like to show that our distributed algorithm scales well, that is, as additional processors are made available, the quality of solution will also improve. Finally, we would like to show that nagging is a suitable form of parallelism for this problem: to support this last claim, we compare the performance of our algorithm with that of the simple parallel partitioning scheme of Section 4.1. We use several di erent data sets for our experiments. Each data set was generated from a known target Bayesian network structure (of course, the target structures were only used for generating the data set and were not provided to the learning algorithm in any form). Both realworld and randomly-generated target network structures were used to generate experimental data sets. The rst data set is based on the ALARM Bayesian inference network used to model realworld anesthesia problems in an operating room environment [2]. This network consists of 37 variables and 46 arcs. The 10,000 case data set associated with this network is commonly used as a benchmark; the input variable ordering used in our experiments is identical to that used previously by other researchers [8, 28]. The other data sets used were based on randomlygenerated Bayesian network structures with 30, 70, 100, and 150 boolean variables and having vertices of average degree between 2 and 4. Each data set consists of 2,600 cases generated by conducting a Monte Carlo simulation (an unbiased generator) on the corresponding network structure. The input variable orderings used were chosen arbitrarily in conformance with the randomly-generated network structures. 19

When the algorithms being tested were allowed to terminate, performance was measured simply in terms of elapsed CPU time on the master processor. Where resource bounds were imposed, solution quality was measured in terms of two metrics, the description length Ltotal and a structural di erence metric. The rst metric is easily justi ed: since the objective of our MDL-based approach is to minimize description length, we would hope that the distributed solution returns a network structure with Ltotal as low or lower than the serial algorithm operating under the same resource bounds. The structural di erence metric, commonly used in Bayesian network learning problems [12], attempts to quantify the solution's structural di erences from the original target network. This metric is de ned as M + A where M is the number of arcs in the learned structure but not in the target structure and A is the number of arcs in the target structure but not in the learned structure. Of course, this implies that the original target network that was used to generate cases is in some sense the \correct" solution, even though the data set may not contain enough information to recover some of the edges in the target network. Thus these metrics should serve only to make meaningful comparisons between solutions, rather than absolute measures of solution quality. 4

5.1 Experiment 1 The purpose of this experiment is to evaluate the e ectiveness of our distributed solution in a non-resource-bounded environment. We use the rst 500 and 1,000 cases from the ALARM data set to infer a network structure using one processor (i.e., the serial algorithm), three processors (master and two nagging processes), and ve processors (master and four nagging processes). Each trial was allowed to run to completion, and the elapsed CPU time for the master process was recorded. In every case, the (same) optimal network was found. Compared We also considered applying standard machine learning evaluation metrics such as sensitivity and selectivity on the number of arcs in the learned structure. Unfortunately, these metrics are inappropriate for (typically) sparse Bayesian networks, as simply returning a network with no arcs at all will often provide extremely high sensitivity and selectivity values. 4

20

Size of data set Number of processors Master process CPU usage 500

1000

1

56 minutes

3

40 minutes

5

25 minutes

1

183 minutes

3

63 minutes

5

60 minutes

Figure 5: Time Required to Enumerate the Search Space with the original ALARM network, the optimal network has 7 missing and no extra arcs in the 500-case data set, and 5 missing and 1 extra arcs for the 1,000-case data set. 5 The results are shown in Figure 5. While meaningful performance comparisons with other systems are dicult due to di erences in hardware and the like, we note that our solutions are of equal or better quality (in terms of structural di erence to the original ALARM network) than many existing solutions. For instance, the quality of the solution found is better than the solution obtained by K2, a well-known algorithm for learning Bayesian networks [8] (for the same 500 cases, K2 obtains a solution with 7 missing and 2 extra arcs). A previous system that provides a solution similar in quality to our solutions for the 500-case and 1000-case data sets is described in [28]. Note that the notion of an optimal solution is only de ned with respect to a given data set. In fact, the original network may well not be the optimal network. Thus it is not unreasonable that the 500-case solution di ers from the 1000-case solution, or that both of these solutions di er from the original ALARM network which was used to generate all 10,000 cases. It is important to note, however, that each of the three tested versions of the algorithm (serial, three, and ve processors) obtain the identical solutions on identical data. 5

21

Size of data set Number of processors Master process CPU usage 500

1000

3

12.1 seconds

5

10.8 seconds

9

7.2 seconds

3

17.1 seconds

5

15.5 seconds

9

14.0 seconds

Figure 6: Time Required to Obtain Optimal Solution

5.2 Experiment 2 In the second experiment, we repeat the rst experiment using similar processor con gurations (master and two, four, or eight nagging processes) under increasingly stringent resource bounds. The objective is to determine, at least to some rst approximation, how \quickly" each con guration is able to discover the optimal solution with respect to the given data set. The results are shown in Figure 6. As expected, increasing the number of nagging processors reduces the time required to discover the optimal solution. Moreover, the solutions were generally found very quickly.

5.3 Experiment 3 All the above data sets were used in this third, resource-bounded, experiment. Speci cally, for each data set, we apply the nagging algorithm using a variety of processor con gurations including one processor (i.e., the serial algorithm) and multiple processors (master and up to 16 nagging processes). In each trial, the same resource-bound was imposed on master processor computation time, allowing us to make meaningful comparisons of solution quality across processor con gurations. 22

The results are shown in the following gures. Figure 7 illustrates the total description length of the learned structures for trials using randomly-generated structures. Figure 8 illustrates the structural di erence for the learned structures for trials using randomly-generated structures. Figure 9 summarizes trials using 30-variable and 70-variable randomly-generated structures. Figure 10 summarizes trials using 100-variable and 150-variable randomly-generated structures. Figure 11 summarizes trials using the 2000-case and 10000-case ALARM data set. In general, learning performance as measured by both metrics, improves as the number of processors increases. For instance, the total description length metric of a learned structure decreases (i.e., gets better) monotonically as we increase the number of processors. In most cases, the structural di erence decreases as we increase the number of processors, although the improvement is more apparent for networks of high topological complexity. In particular, the results for the network with 150 variables and vertices of degree 4 are very encouraging. Our distributed approach discovers almost 95% of the structure in 80 CPU minutes using roughly a dozen processors. The serial solution, on the other hand, only discovers 39% of the structure. These results clearly support the claim that our distributed approach obtains a higher solution quality than the serial algorithm. Note that sometimes a learned structure with a slightly higher structural di erence value may have a lower (i.e., better) total description length. This implies it is actually a better structure with respect to the data set under the MDL framework. The results also demonstrate another claim that our distributed approach has a better performance than the simple parallel partitioning scheme. For instance, the total description length of a learned structure using our distributed approach is always better than the one learned by the simple parallel partitioning scheme. Similar results can be observed for the structural di erence metric.

23

variable:30; average degree:4

77500

total description length

total description length

variable:30; average degree:2

59700 59600 59500 59400 59300 59200 59100 59000 58900 58800 58700

nagging simple

75500 75000 74500 1 2

variable:70; average degree:3 nagging simple

157000 156000 155000 154000 153000 152000 151000

1 2 3 4 6 8 number of processors

157000 156000 155000 154000 153000 152000 151000 150000 149000 148000

13

13

nagging simple

1 2 3 4 6 8 number of processors

variable:100; average degree:3

13

variable:100; average degree:4

226000

total description length

225000 224000 223000 222000 221000 220000 219000 218000 217000 216000

4 6 8 number of processors

variable:70; average degree:4

total description length

total description length

76000

13

158000

total description length

76500

74000 1 2 3 4 6 8 number of processors

nagging simple

nagging simple

224000 222000 220000 218000 216000 214000 212000

12

4 8 12 number of processors

17

12

variable:150; average degree:3 total description length

nagging simple

346000 344000 342000 340000 338000 336000 334000 12

4 8 12 number of processors

4 8 12 number of processors

17

variable:150; average degree:4

348000

total description length

nagging simple

77000

17

348000 346000 344000 342000 340000 338000 336000 334000 332000 330000 328000

nagging simple

12

4 8 13 number of processors

17

Figure 7: Total Description Length of the Learned Networks for Randomly-Generated Networks

24

variable:30; average degree:2

variable:30; average degree:4

30

nagging simple

8

structural difference

structural difference

10

6 4 2 0 4 6 8 number of processors

10 5 1 2 3 4 6 8 number of processors

60 55 50 45 40 35 30 25 20 15 10

nagging simple

1 2 3 4 6 8 number of processors

4 8 12 number of processors

1 2 3 4 6 8 number of processors

4 8 12 number of processors

17

structural difference

variable:150; average degree:4

structural difference

nagging simple

100 80 60 40 4 8 12 number of processors

13

nagging simple

12

variable:150; average degree:3

12

100 90 80 70 60 50 40 30 20 10

17

140 120

nagging simple

variable:100; average degree:4

nagging simple

12

90 80 70 60 50 40 30 20 10 0

13

variable:100; average degree:3

90 80 70 60 50 40 30 20 10

13

variable:70; average degree:4

structural difference

structural difference

15

13

variable:70; average degree:3

structural difference

20

0 1 2

structural difference

nagging simple

25

17

200 180 160 140 120 100 80 60 40 20

nagging simple

12

4 8 13 number of processors

17

Figure 8: Structural Di erence of the Learned Networks for Randomly-Generated Networks

25

V

D B

30 2 4

70 3

4

10 seconds

P

1 2-13 30 seconds 1 2 3-6 8-13 3.5 minutes 1 2 4,5 6 8-13 4 minutes 1 2 3-5 6 8-13

sim-M sim-A sim-S sim-T

nag-M nag-A nag-S nag-T

8 2 29 16 6 4 46 21 9 9 9 80 46 29 13 11

8 2 29 16 6 2 46 21 8 8 8 80 46 19 6 6

0 0 1 1 1 1 4 4 4 4 7 4 4 5 4 6

8 2 30 17 7 5 50 25 13 13 16 84 50 34 17 17

59644 58776 77323 76040 74717 74687 157609 153377 151532 151532 151523 156147 153058 151288 149926 149497

0 0 1 1 1 1 4 4 4 6 7 4 4 4 4 5

8 2 30 17 7 3 50 25 12 14 15 84 50 23 10 11

59644 58776 77323 76040 74717 74295 157609 153377 151267 151249 151238 156147 153058 150255 148507 148477

Figure 9: Summary of results for randomly-generated structures with 30 and 70 variables under resourcebounded conditions. We let V denote the number of variables, D denote the average degree of vertices, B denote the master processor CPU time bound, P denote the number of processors used, sim-M and nag-M denote the number of arcs in the original structure but not in the learned structure for simple partition and nagging approach respectively; sim-A and nag-A denote the number of arcs in the learned structure but not in the original structure for simple partition and nagging approach respectively, sim-S and nag-S denote the structural di erence for simple partition and nagging approach respectively, and sim-T and nag-T denote the total description length for simple partition and nagging approach respectively.

26

V

D B

100 3

4

150 3

4

9.8 minutes

P

1 2 4-6 8-17 10.5 minutes 1 2 4 6-8 12-17 73 minutes 1 2 4 6-8 10-17 80 minutes 1 2 4 6 8-10 13-17

sim-M sim-A sim-S sim-T

nag-M nag-A nag-S nag-T

74 34 19 17 114 61 33 20 18 118 52 24 19 18 190 106 54 36 33 31

74 34 14 12 114 61 26 10 10 118 52 15 15 13 190 106 44 10 9 9

5 5 5 5 6 6 6 8 8 15 18 18 18 19 7 10 10 10 12 10

79 39 24 22 120 67 39 28 26 133 70 42 37 37 197 116 64 46 45 41

224419 218384 216527 216498 225772 218879 216072 214364 214043 347477 338636 335386 334722 334683 347784 339084 334379 331364 330893 330572

5 5 5 5 6 6 6 9 8 15 18 20 20 19 7 10 10 10 8 8

79 39 19 17 120 67 32 19 18 133 70 35 35 32 197 116 54 20 17 17

224419 218384 216439 216340 225772 218879 215304 213510 213490 347477 338636 334277 334277 334053 347784 339084 332833 328905 328886 328886

Figure 10: Summary of results for randomly-generated structures with 100 and 150 variables under resourcebounded conditions. We let V denote the number of variables, D denote the average degree of vertices, B denote the master processor CPU time bound, P denote the number of processors used, sim-M and nag-M denote the number of arcs in the original structure but not in the learned structure for simple partition and nagging approach respectively; sim-A and nag-A denote the number of arcs in the learned structure but not in the original structure for simple partition and nagging approach respectively, sim-S and nag-S denote the structural di erence for simple partition and nagging approach respectively, and sim-T and nag-T denote the total description length for simple partition and nagging approach respectively.

27

N

B

2000

20 seconds

P

1 2 5 6-13 10000 2.8 minutes 1 2 3 5 7 10-13

sim-M sim-A sim-S sim-T

nag-M nag-A nag-S nag-T

20 6 5 3 22 17 14 8 6 5

20 6 5 2 22 17 5 3 2 1

1 1 1 1 1 1 4 2 1 1

21 7 6 4 23 18 18 10 7 6

39965 34018 33679 33474 198389 198101 176076 164853 161337 161109

1 1 1 1 1 1 1 0 1 1

21 7 6 3 23 18 6 3 3 2

39965 34018 33679 33423 198389 198101 160590 160158 160157 159314

Figure 11: Summary of results for the ALARM data set under resource-bounded conditions. We let N denote the number of cases, B denote the master processor CPU time bound, P denote the number of processors used, sim-M and nag-M denote the number of arcs in the original structure but not in the learned structure for simple partition and nagging approach respectively; sim-A and nag-A denote the number of arcs in the learned structure but not in the original structure for simple partition and nagging approach respectively, sim-S and nag-S denote the structural di erence for simple partition and nagging approach respectively, and sim-T and nag-T denote the total description length for simple partition and nagging approach respectively.

28

6 Conclusion We have presented a new distributed algorithm for the Bayesian inference network learning problem. Our algorithm relies on a scoring scheme based on an information-theoretic notion, the Minimum Description Length principle; the distributed version of our algorithm exploits a search pruning strategy based on some well-de ned formal properties of this score metric. Our algorithm is intrinsically fault tolerant and scales quite well. It is therefore quite appropriate for loosely-coupled networks of workstations. We also demonstrate the superiority of our distributed solution over a simpler parallel partitioning approach. We have also presented empirical results demonstrating the viability and e ectiveness of our approach using a number of di erent data sets. Since some of our experiments employ standard data sets from the literature, we are able to draw some conclusions about our algorithm's performance with respect to previous work. More precisely, we are able to provide optimal solutions for large problems which were beyond the reach of many other systems. We were able to provide good solutions for still larger problems of up to 150 boolean variables. These results provide empirical evidence supporting our claims of good scalability to large numbers of loosely-coupled processors.

29

Appendix Consider

X

X

;

i

i

X ;

i

1

X ;

M (1 ) 2 M (X ; 1 ) i

i

i

i

X

X ;



X M (X ;  ) log

1 i 1 M ( X ;  ) M ( ; ) M (X ; 1 ; ) log2 M ( )M1(X ; 1 ; ) 1 1 i 1  M (X ; 1 )M (1 ; ) ? 1) log (e) M (X ; 1 ; )( M 2 (1 )M (X ; 1 ; ) i 1 

i 1 

X ;

=

M (X ; 1 ; ) log2 MM(X(; 1; ) ; ) ?

;

i

i

i

;

(since log2 (K )  (K ? 1) log2 (e) for K  0, where e is the base of the natural logarithms) =

X

i

X (P i 1 

X ;

=

1 )M (1 ; ) ? M (X ;  ; )) log (e) ( M (X ; M 1 2 ( )

1 

= 0

;

;

1

i

i M (X ; 1 )M (1 ; )

X

i

M (1 )

?

X M (X ;  ; )) log (e) i

X

30

i

1

2

References [1] B. Abramson. ARCO1: An application of belief networks to the oil market. In Proceedings of the Conference on Uncertainty in Arti cial Intelligence, pages 1{8, 1991.

[2] I. A. Beinlich, H. J. Suermondt, R. M. Chavez, and G. F. Cooper. The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. In Proceedings of the 2nd European Conference on Arti cial Intelligence in Medicine, pages

247{256, 1989. [3] R. Bouckaert. Properties of Bayesian belief network learning algorithms. In Proceedings of the Conference on Uncertainty in Arti cial Intelligence, pages 102{109, 1994.

[4] W. Buntine. Theory re nement on Baysian networks. In Proceedings of the Conference on Uncertainty in Arti cial Intelligence, pages 52{60, 1991.

[5] L. Burnell and E. Horvitz. Structure and chance: Melding logic and probability for software debugging. Communications of the ACM, 38(3):31{41, 1995. [6] E. Charniak and R. Goldman. A probabilistic model of plan recognition. In Proceedings of the AAAI National Conference, pages 160{165, 1991.

[7] G. F. Cooper. The computational complexity of probabilistic inference using Bayesian belief networks. Arti cial Intelligence, 42:393{405, 1990. [8] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309{347, 1992. [9] R. Fung and B. Del Favero. Applying Bayesian networks to information retrieval. Communications of the ACM, 38(3):42{48, 1995.

[10] D. Heckerman, J. Breese, and B. Nathwani. Toward normative expert systems i: The PATHFINDER project. Methods of Information in Medicine, 31:90{105, 1992. 31

[11] D. Heckerman, J. Breese, and K. Rommelse. Decision-theoretic troubleshooting. Communications of the ACM, 38(3):49{56, 1995.

[12] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3):197{243, 1995. [13] E. Horvitz and M. Barry. Display of information for time-critical decision making. In Proceedings of the Conference on Uncertainty in Arti cial Intelligence, pages 296{305,

1995. [14] W. Lam. Bayesian network re nement via machine learning approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, To appear.

[15] W. Lam and F. Bacchus. Using causal information and local measure to learn Bayesian networks. In Proceedings of the Conference on Uncertainty in Arti cial Intelligence, pages 243{250, 1993. [16] W. Lam and F. Bacchus. Learning Bayesian belief networks - an approach based on the MDL principle. Computational Intelligence, 10(3):269{293, 1994. [17] M. Morjaia, J. Rink, W. Smith, G. Klempner, C. Burns, and J. Stein. Commercialization of EPRI's generator expert monitoring system (gems). In Expert System Application for the Electric Power Industry, Phoenix, 1993. EPRI. Also: GE techreport GER-3790.

[18] S. A. Musman, L. W. Chang, and L. B. Booker. Application of a real-time control strategy for Bayesian belief networks to ship classi cation problem solving. International Journal of Pattern Recognition and Arti cial Intelligence, 7(3):513{526, 1993.

[19] J. Pearl and T. S. Verma. A theory of inferred causation. In Proceedings of the 2nd International Conference on Principles of Knowledge Representation and Reasoning, pages

441{452, 1991. 32

[20] R. W. Robinson. Counting unlabeled acyclic digraphs. In Proceedings of the 5th Australian Conference on Combinatorial Mathematics, pages 28{43, 1976.

[21] A.M. Segre and D.B. Sturgill. Using hundreds of workstations to solve rst-order logic problems. In Proceedings of the Twelfth National Conference on Arti cial Intelligence, Seattle, WA, pages 187{192, 1994.

[22] P. Spirtes and C. Glymour. An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review, 9(1):62{71, 1991.

[23] P. Spirtes, C. Glymour, and R. Scheines. Causality from probability. In Evolving Knowledge in Natural Science and Arti cial Intelligence, pages 181{199, 1990.

[24] P. Spirtes and C. Meek. Learning Bayesian networks with discrete variables from data. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 294{299, 1995.

[25] D.B. Sturgill and A.M. Segre. A novel asynchronous parallelization scheme for rst-order logic. In Proceedings of the Twelfth Conference on Automated Deduction, Nancy, France, Springer-Verlag Lecture Notes in Computer Science v814, pages 484{498, 1994.

[26] D.B. Sturgill and A.M. Segre. Nagging: A distributed adversarial search-pruning technique applied to rst-order logic. Journal of Automated Reasoning, To appear. [27] J. Suzuki. A construction of Bayesian networks from databases based on an mdl principle. In Proceedings of the Conference on Uncertainty in Arti cial Intelligence, pages 266{273, 1993. [28] J. Suzuki. Learning bayesian belief networks based on the minimum description length principle: An ecient algorithm using the B & B technique. In Proceedings of the Thirteenth International Confernece on Machine Learning, pages 462{470, 1996.

33

[29] T. Verma and J. Pearl. Equivalence and synthesis of causal models. In Proceedings of the Conference on Uncertainty in Arti cial Intelligence, pages 220{227, 1990.

34