Bayesian Classification with Correlation and Inheritance - IJCAI

1 downloads 0 Views 370KB Size Report
J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Free- man. Autoclass: a Bayesian classification system. In. Proceedings of the Fifth International Conference on.
Bayesian Classification w i t h C o r r e l a t i o n a n d I n h e r i t a n c e Robin Hanson

John Stutz

P e t e r Cheeseman

Sterling Software

NASA

RIACS*

A r t i f i c i a l Intelligence Research Branch N A S A Ames Research Center, M a i l Stop 244-17 Moffett Field, CA 94035, USA Internet: @ptolemy.arc.nasa.gov

Abstract The task of inferring a set of classes and class descriptions most likely to explain a given data set can be placed on a firm theoretical foundation using Bayesian statistics. W i t h i n this framework, and using various mathematical and algorithmic approximations, the A u toClass system searches for the most probable classifications, automatically choosing the number of classes and complexity of class descriptions. Simpler versions of AutoClass have been applied to many large real data sets, have discovered new independently-verified phenomena, and have been released as a robust software package. Recent extensions allow attributes to be selectively correlated w i t h i n particular classes, and allow classes to inherit, or share, model parameters though a class hierarchy.

1

Introduction

The task of supervised classification - i.e., learning to predict class memberships of test cases given labeled t r a i n ing cases - is a familiar machine learning problem. A related problem is unsupervised classification, where training cases are also unlabeled. Here one tries to predict all features of new cases; the best classification is the least "surprised" by new cases. This type of classification, related to clustering, is often very useful in exploratory data analysis, where one has few preconceptions about what structures new data may hold. We have previously developed and reported on A u toClass [Cheeseman et al., 1988a; Cheeseman et a/., 1988b], an unsupervised classification system based on Bayesian theory. Rather than j u s t partitioning cases, as most clustering techniques do, the Bayesian approach searches in a model space for the "best" class descriptions. A best classification optimally trades off predictive accuracy against the complexity of the classes, and so does not "overfit" the data. Such classes are also "fuzzy"; instead of each case being assigned to a class, a case has a probability of being a member of each of the different classes. •Research Institute for Advanced Computer Science

692

Learning a n d Knowledge Acquisition

Autocla88 I I I , the most recent released version, combines real and discrete data, allows some data to be missing, and automatically chooses the number of classes f r o m first principles. Extensive testing has indicated that it generally produces significant and useful results, but is p r i m a r i l y limited by the simplicity of the models it uses, rather t h a n , for example, inadequate search heuristics. AutoClass I I I assumes that all attributes are relevant, t h a t they are independent of each other w i t h i n each class, and that classes are m u t u a l l y exclusive. Recent extensions, embodied in Autoclass I V , let us relax two of these assumptions, allowing attributes to be selectively correlated and to have more or less relevance via a class hierarchy. We begin by describing the Bayesian theory of learning, and then apply it to increasingly complex classification problems, f r o m various single class models up to hierarchical class mixtures. Finally, we report empirical results f r o m an implementation of these extensions.

2

Bayesian Learning

Bayesian theory gives a mathematical calculus of degrees of belief, describing what it means for beliefs to be consistent and how they should change w i t h evidence. This section briefly reviews that theory, describes an approach to making it tractable, and comments on the resulting tradeoffs. In general, a Bayesian agent uses a single real number to describe its degree of belief in each proposition of interest. 2.1

Theory

Let E denote some evidence that is known or could potentially be known to an agent; let H denote a hypothesis specifying that the world is in some particular state; and let the sets of possible evidence E and possible states of the world H each be m u t u a l l y exclusive and exhaustive sets. In general, P(ab\cd) denotes a real number describing an agent's degree of belief in the conjunction of propositions a and 6, conditional on the assumption that propositions c and d are true. More specifically, ) is a " p r i o r " describing the agent's belief in H before, or in the absence of, seeing evidence E, is a "posterior" describing the agent's belief after observing some particular evidence E, and L(E\H) is a "likelihood" em-

better fit to the d a t a before the more c o m p l e x m o d e l can be preferred. T h e i o i n t can n o w b e w r i t t e n a s a n d , for a r e a s o n a b l y - c o m p l e x problem, is usually a very rugged distribution in V T , w i t h a n i m m e n s e n u m b e r o f s h a r p peaks d i s t r i b u t e d w i d e l y over a h u g e h i g h - d i m e n s i o n a l space. Because of this we despair of d i r e c t l y n o r m a l i z i n g the j o i n t , as required by Bayes' rule, or of c o m m u n i c a t i n g the detailed shape o f t h e p o s t e r i o r d i s t r i b u t i o n . I n s t e a d w e b r e a k t h e c o n t i n u o u s V space i n t o regions R s u r r o u n d i n g each s h a r p p e a k , a n d search u n t i l w e t i r e for c o m b i n a t i o n s R T f o r w h i c h t h e " m a r g i n a l " j o i n t

I n t h e o r y , a l l a n a g e n t needs t o d o i n a n y g i v e n s i t u a t i o n i s t o choose a set o f s t a t e s H , a n associated l i k e l i h o o d f u n c t i o n describing w h a t evidence is expected to be observed i n those s t a t e s , a set o f p r i o r e x p e c t a t i o n s o n t h e states, a n d t h e n collect some evidence. Bayes' rule then specifies t h e a p p r o p r i a t e p o s t e r i o r beliefs a b o u t t h e s t a t e o f t h e w o r l d , w h i c h c a n b e used t o answer m o s t q u e s t i o n s of interest. 2.2

Practice

In practice this theory can be difficult to apply, as the s u m s a n d i n t e g r a l s i n v o l v e d are o f t e n m a t h e m a t i c a l l y i n tractable. Here is our approach. R a t h e r t h a n c o n s i d e r a l l possible states o f t h e w o r l d , w e f o c u s o n some s m a l l e r space o f models, a n d d o a l l of our analysis c o n d i t i o n a l on an assumption S t h a t the w o r l d r e a l l y i s d e s c r i b e d b y one o f t h e m o d e l s i n o u r space. T h i s a s s u m p t i o n i s a l m o s t c e r t a i n l y false, b u t i t makes the analysis t r a c t a b l e . T h e p a r a m e t e r s w h i c h s p e c i f y a p a r t i c u l a r m o d e l are s p l i t i n t o t w o sets. F i r s t , a set o f d i s c r e t e p a r a m e t e r s T describe t h e g e n e r a l f o r m o f t h e m o d e l , u s u a l l y b y speci f y i n g some f u n c t i o n a l f o r m f o r t h e l i k e l i h o o d f u n c t i o n . F o r e x a m p l e , T m i g h t specify w h e t h e r t w o v a r i a b l e s are c o r r e l a t e d o r n o t , o r h o w m a n y classes are present i n a c l a s s i f i c a t i o n . S e c o n d , free v a r i a b l e s i n t h i s g e n e r a l f o r m , such a s t h e m a g n i t u d e o f t h e c o r r e l a t i o n o r t h e r e l a t i v e sizes o f t h e classes, c o n s t i t u t e t h e r e m a i n i n g c o n t i n u o u s m o d e l p a r a m e t e r s V. We generally prefer a l i k e l i h o o d 1 w h i c h is m a t h e m a t i c a l l y simple a n d yet still embodies the kinds o f c o m p l e x i t y r e l e v a n t i n some c o n t e x t . Similarly, we prefer a simple prior d i s t r i b u t i o n [VT\S) over t h e m o d e l space, a l l o w i n g t h e r e s u l t i n g V i n t e g r a l s , d e s c r i b e d b e l o w , t o b e a t least a p p r o x i m a t e d . W e also u s u a l l y p r e f e r a r e l a t i v e l y b r o a d a n d u n i n f o r m a t i v e p r i o r , a n d one t h a t gives n e a r l y e q u a l w e i g h t t o d i f f e r e n t levels o f m o d e l c o m p l e x i t y , r e s u l t i n g i n a " s i g nificance t e s t " . A d d i n g more parameters to a m o d e l then induces a cost, w h i c h m u s t be p a i d for by a significantly l

A variable like V in a p r o b a b i l i t y expression stands for the proposition t h a t the variable has a particular value.

is as large as possible. T h e best f e w such " m o d e l s " RT f o u n d are t h e n r e p o r t e d , even t h o u g h i t i s u s u a l l y a l m o s t certain that more probable models remain to be found. Each m o d e l RT is reported by describing its m a r g i n a l j o i n t M{ERT\S), i t s d i s c r e t e p a r a m e t e r s T , a n d estim a t e s of t y p i c a l values of V in t h e r e g i o n R, such as t h e m e a n e s t i m a t e of V:

o r t h e V for w h i c h is m a x i m u m in R. W h i l e these e s t i m a t e s are n o t i n v a r i a n t u n d e r r e p a r a m e t e r i z a t i o n s o f t h e V space, a n d hence d e p e n d o n t h e s y n t a x w i t h w h i c h t h e l i k e l i h o o d w a s expressed, t h e p e a k i s u s u a l l y s h a r p e n o u g h t h a t such differences d o n ' t m a t t e r . A w e i g h t e d average of t h e best few m o d e l s f o u n d is used t o m a k e p r e d i c t i o n s . A l m o s t a l l o f t h e w e i g h t i s u s u a l l y i n t h e best f e w , j u s t i f y i n g t h e n e g l e c t o f t h e rest. Even t h o u g h the sums and integrals can be difficult, a n d large spaces m u s t b e s e a r c h e d , B a y e s i a n t h e o r y offers the advantages of being theoretically well-founded and e m p i r i c a l l y w e l l - t e s t e d [ B e r g e r , 1985]; one c a n a l m o s t " t u r n the c r a n k " , m o d u l o doing integrals a n d search2, to deal w i t h a n y new p r o b l e m . Disadvantages include b e i n g f o r c e d t o b e e x p l i c i t a b o u t t h e space o f m o d e l s one i s s e a r c h i n g i n , a n d o c c a s i o n a l a m b i g u i t i e s r e g a r d i n g w h a t a n a p p r o p r i a t e p r i o r is. A l s o , i t i s n o t clear h o w one can t a k e t h e c o m p u t a t i o n a l cost o f d o i n g a B a y e s i a n a n a l y s i s i n t o a c c o u n t w i t h o u t a c r i p p l i n g i n f i n i t e regress. We w i l l n o w illustrate this general approach by applying it to the p r o b l e m of unsupervised classification.

3

Single Class M o d e l s

For all the models to be considered in this paper, the e v i d e n c e E w i l l consist of a set of I cases, an associated set K of a t t r i b u t e s , of size 3 K, a n d case a t t r i b u t e values 4 2 T h e j o i n t p r o b a b i l i t y provides a good local evaluation function for searching t h o u g h . 3 We use script letters like K to denote sets, and matching ordinary letters K to denote their size. 4 N o t h i n g in principle prevents a Bayesian analysis of more complex model spaces for relational d a t a .

Hanson, Stutz, a n d Cheeseman

693

X i k , which can include " u n k n o w n . " 5 For example, medical case number 8, described as (age = 23, blood-type — A, . . . ) , would have In this section and the next we w i l l describe applications of Bayesian learning theory to various kinds of models which could explain this evidence, beginning w i t h simple model spaces and building more complex spaces from them. We begin in this section w i t h a single class. First, a single a t t r i b u t e is considered, then multiple independent attributes, then fully covariant attributes, and finally selective covariance. In the next section we combine these single classes i n t o class mixtures, first flat then tree-based.

Table 1: Model Spaces Described For each space S we w i l l describe the continuous parameters V, any discrete model parameters T, normalized likelihoods L(E\VTS), and priors A8 most spaces have no discrete parameters T, and only one region R, we can usually ignore these parameters. A p proximations to the resulting marginals M(ERT\S) and estimates w i l l be given, but not derived. These w i l l often be given in terms of general functions F which are common to various models. As appropriate, comments w i l l be made about algorithms and computational complexity. A l l of the likelihood functions considered here assume the cases are independent, i.e., L(E\VTS) = II, L(Ei\VTS) so we need only give L{Ei\VTS) for each space, where

4

Single Class M o d e l s

4.1

S i n g l e D i s c r e t e A t t r i b u t e - SD1

A discrete a t t r i b u t e k allows only a finite number of possible values for any X i . A set of independent coin tosses, tor example, might have L = 3 w i t h /i = heads, l 2 = tails, and / 3 = " u n k n o w n * . If we make the assumption SD\ that there is only one discrete a t t r i b u t e , then the only parameters are the continuous V = qi... qL, consisting of the likelihoods for each possible value I. In the coin example, q1 = .7 would say that the coin had a 70% of coming up heads each time. 5

If the fact t h a t a d a t a value is u n k n o w n m i g h t be inform a t i v e , one can model " u n k n o w n " as j u s t another possible (discrete) d a t a value; otherwise the likelihood for an u n k n o w n value is j u s t a sum over the possible k n o w n values.

694

Learning a n d Knowledge Acquisition

4.3

I n d e p e n d e n t A t t r i b u t e s - SI

We now introduce some notation for collecting sets of indexed terms like X i k . A single such t e r m inside a {} will denote the set of all such indexed terms collected denotes 1 when u — v and 0 otherwise. T(y) is the Gamma function [Spiegel, 1968]. 8 Actually, log( weight) is more normally distributed. 7

4.6 B l o c k Co v a r i a n c e - Sv Rather than just having either f u l l independence or full dependence of attributes, we prefer a model space Sv where some combinations of attributes may covary while others remain independent, w i t h full or no dependence as special l i m i t i n g cases. This allows us to avoid paying the cost of specifying covariance parameters when they cannot buy us a significantly better fit to the data. Our approach to partial dependence is simply to part i t i o n the attributes K into D blocks Kb, each of size K b , w i t h full covariance w i t h i n each block and f u l l independence between blocks. 1 3 We also currently prohibit reals and discretes in the same covariant block.

and estimates12 9

L' can be a very large n u m b e r ! F 1 and F 2 are defined in Section 4 . 1 . 11 At present, we lack a satisfactory way to approximate this m a r g i n a l when some values are u n k n o w n . 12 T o o b t a i n these f o r m u l a s , certain free hyperparameters in the W i s h a r t prior have been set using simple statistics f r o m the d a t a . T h i s is more robust and simplifies the m a t h , b u t is " c h e a t i n g " because priors are supposed to be independent of the d a t a . Similar cheating was also done in Section 4.2. 10

13 We choose it because it is easy, fits well w i t h our model of class hierarchy, a n d allows f u l l dependence.

Hanson, Stutz, and Cheeseman

695

5 5.1

Class M i x t u r e s F l a t M i x t u r e s - SM

The above model spaces Sc = Sy or S I c a n be thought of as describing a single class, and so can be extended by considering a space SM of simple mixtures of such classes [ D . M . T i t t e r i n g t o n et al, 1985]. W i t h S c = S T , this is the model space of AutoClass I I I , and Figure 1 shows how it can fit a set of artificial real-valued data in five dimensions. In this model space the likelihood sums over products of "class weights" « c , each giving the probability that any case would belong to class c of the C classes, and class likelihoods describing how members of each class are dist r i b u t e d . In the l i m i t of large C this model space is general enough to be able to fit any distribution arbitrarily closely, and hence is "asymtotically correct". The parameters and combine parameters for each class and parameters describing the m i x t u r e . The prior is similarly broken down

where .The a is treated as if the choice of class were another discrete a t t r i b u t e , except that a C\ is added because classes are not distinguishable a priori. Except in very simple problems, the resulting j o i n t dJ(EVT\S) has many local maxima, and so we must now distinguish regions R of the V space. To find a local maxima we use the " E M " algorithm [Dempster et al.y 1977] which is based on the fact that at a maxima the class parameters V c can be estimated f r o m weighted sufficient statistics. Relative likelihood weights WiC = satisfying give the probability that a particular case i is a member of class c. Using these weights we can break each case i n t o "fractional cases", assign these to their respective classes, and create new "class d a t a " w i t h new weighted class sufficient statistics obtained by using weighted sums instead of sums £ V . For example and Substituting these statistics into any previous class likelihood function L(E\VcTcSc) gives a weighted likelihood and associated new estimates and marginals. At the maxima, the weights wic should be consistent w i t h estimates of from To reach a maxima we start out at a random seed and repeatedly use our current best estimates of V to compute the w ic, and then use the to re-estimate the V, stopping after 10 — 100 iterations when they b o t h predict each other. Integrating the j o i n t in R can't be done directly because the product of a sum in the full likelihood is hard to decompose, but by using fractional cases to approximate the likelihood

696

Learning a n d Knowledge Acquisition

while holding the wu fixed, we get an approximate marginal: j

Our standard search procedure begins each converging t r i a l f r o m classes b u i l t around C r a n d o m case pairs. The number of classes C is chosen randomly f r o m a lognormal distribution fit to the Cs of the 6 — 10 best trials seen so far, after t r y i n g a fixed range of Cs to start. We also have developed alternative search procedures which selectively merge and split classes according to various heuristics. W h i l e these usually do better, they sometimes do much worse. The marginal j o i n t s of the different trials generally follow a log-normal d i s t r i b u t i o n , allowing us to estimate during the search how much longer it w i l l take on average to find a better peak, and how much better it is likely to be. In the simpler model space SMI where Sc = Si the computation is order where averages over the search trials. N is the number of possible peaks, out of the immense number usually present, that a computation actually examines. In the covariant space SMV where Sc = Sv this becomes 5.2

Class H i e r a r c h y a n d I n h e r i t a n c e - SH

When there are many attributes and each class must have its own set of parameters for each of these attributes, multiple classes are strongly penalized. A t tributes which are irrelevant to the whole classification, like a medical patient's favorite color, can be particularly costly. To reduce this cost, one can allow classes to share the specification of parameters associated w i t h some of their independent blocks. Rather than allow arbitrary combinations of classes to share blocks, it is simpler to organize the classes as leaves of a tree. Each block can be placed at some node in this tree, to be shared by all the leaves below (farther f r o m the root than) that node. In this way different attributes can be explained at different levels of an abstraction hierarchy. For medical patients the tree might have viral-infectionsneai the root, predicting fevers, and some more specific viral disease near the leaves, predicting more disease specific symptoms. Irrelevant attributes like favorite-color would go at the root. To make predictions about a member of a leaf class, one first inherits down the a t t r i b u t e descriptions f r o m classes above i t . Therefore the above class mixture model space SM CAN be generalized to a hierarchical space SH by replacing the above set of classes w i t h a tree of classes, and using the tree to inherit specifications of class parameters. From the view of the parameters specified at a class, all of the classes below that class pool their weight into one big class. Figure 3 shows some sample trees, and Figure 2 shows how a class tree, this time w i t h Sc — Sv, can better fit the same data as in Figure 1. A tree of classes has one root class r. Every other class c has one parent class P c , and every class has J c child classes given by Ccj, where the index j ranges over the children of a class. Each child class has a weight

relative to its siblings, w i t h and an absolute weight with Each class has an associated set of attributes which it predicts independently through a likelihood and which no class above or below it predicts. To avoid having redundant trees which describe the same likelihood function, only can be empty, and non-leaves must have We need to ensure t h a t all attributes are predicted somewhere at or above each leaf class. So we call the set of attributes which are predicted at or below each class, start w i t h and then recursively part i t i o n each A c i n t o attributes K c " k e p t " at that class, and hence predicted directly by i t , and the remaining attributes to be predicted at or below each child For leaves Ac = Kc. Expressed in terms of the leaves the likelihood is again a mixture

allowing the same EM procedure as before to find local m a x i m a s The case weights here wci (with w r i 1) sum like in the flat mixture case and define class statistics We also choose a similar prior, though it must now specify the K c as well:

for all subsets K c of Ac of size in the range except that is replaced b y w h e n A c = Kc. Note t h a t this prior is recursive, as tfte prior for each class depends on the what attributes have been chosen for its parent class. This prior says t h a t each possible number of attributes kept is equally likely, and given the number to be kept each particular combination is equally likely. This prior prefers the simpler cases of Kc = Ac and Kc = 1 and so again offers a significance test. In comparing the hypothesis t h a t all attributes are kept at a class w i t h the hypothesis t h a t all but one particular attribute is kept at that class, this prior penalizes the a l l - b u t - o n e hypothesis in proportion to the number of attributes that could have been kept instead. The marginal j o i n t becomes

and estimates are

and

again. In the general case of SHV , where Sc = Sv, computation again takes except that the is now also an average of, for each k the number of classes in the hierarchy which use that k (i.e., have Since this is usually less than the number of leaves, the model S H is typically cheaper to compute than S M for the same number of leaves.

Searching in this most complex space SHV is challenging. There are a great many search dimensions where one can trade off simplicity and fit to the data, and we have only begun to explore possible heuristics. Blocks can be merged or split, classes can be merged or split, blocks can be promoted or demoted in the class tree, EM iterations can be continued farther, and one can t r y a random restart to seek a new peak. B u t even the simplest approaches to searching a more general model space seem to do better than smarter searches of simpler spaces.

6

Results

Figure 1: AutoClass I I I Finds Three Classes We plot attributes 1 vs. 2, and 3 vs. 4 for an artificial data set. One deviation ovals are drawn around the centers of the three classes.

Figure 2: AutoClass IV Finds Class Tree x 10120 Better Lists of attribute numbers denote covariant blocks within each class, and the ovals now indicate the leaf classes. We have built a robust software package, AutoClass I I I , around the flat independent model SMI, w i t h utilities for reading data, controlling search, and viewing results, and have released it for general use. It is w r i t t e n in CommonLisp, and runs on many different machines. We, and others, have applied this system to many large and real databases. W h e n applied to infrared stellar data we found new, independently verified, phenomena [Goebel et a/., 1989]. Others have successfully applied it to protein structure [Hunter and States, 1991].

Hanson, Stutz, and Cheeseman

697

W e axe n o w d e v e l o p i n g A u t o c l a s s I V a r o u n d t h e f u l l h i e r a r c h i c a l b l o c k c o v a r i a n t m o d e l space SHV • A s h o p e d , i t gives a s i g n i f i c a n t i m p r o v e m e n t over t h e p r e v i o u s v e r sion when applied to real d a t a , t h o u g h we have only t r i e d it on s m a l l problems so far. On the s t a n d a r d I R I S flowers d a t a ( 1 5 0 cases, 4 a t t r i b u t e s ) w e f i n d a m o d e l over 1 0 6 8 t i m e s m o r e p r o b a b l e t h a n t h e i n d e p e n d e n t m o d e l , i.e., w i t h a m a r g i n a l j o i n t (absolute value 10~913) t h a t m u c h larger.14 T h i s compares w i t h typical improvements of 10 2 for d o u b l i n g t h e search t i m e o r for u s i n g a s m a r t e r m e r g i n g search.

o n i n s p e c t i o n one f i n d s t h a t t h e f o u r c o v a r i a n t classes are v i r t u a l l y t h e same in t h e 1 — 2 p r o j e c t i o n a n d s h o w l i t t l e correlation w i t h the other a t t r i b u t e s . We therefore split t h e c o v a r i a n c e b l o c k s a n d raise t h e ( 1 2 ) b l o c k t o t h e c o m m o n r o o t f o r a n o t h e r r e l a t i v e increase o f 1 0 3 0 . T h i s process c a n b e r e p e a t e d o n p a i r s o f t h e ( 3 4 5 ) b l o c k s t o get t h e 3 l e v e l t r e e s h o w n i n F i g u r e 2 , g a i n i n g a n o t h e r 109 in relative p r o b a b i l i t y . We have thus gained a tot a l f a c t o r o f over 1 0 1 2 0 i n r e l a t i v e m a r g i n a l p r o b a b i l i t y over t h e b e s t c l a s s i f i c a t i o n f o u n d u s i n g t h e i n d e p e n d e n t m o d e l . O f t h a t t o t a l , a b o u t 1 0 3 9 comes f r o m t h e f a c t t h a t t h e tree n o w r e q u i r e s f e w e r p a r a m e t e r s t o specify a similar likelihood. These p r e l i m i n a r y results s u p p o r t previous indications t h a t t h e a b i l i t y t o represent v a r i o u s k i n d s o f s t r u c t u r e s i n t h e d a t a i s t h e m a j o r l i m i t i n g f a c t o r o f such a s y s t e m .

References [ B e r g e r , 1985] and Dayesian 1985.

J. O. B e r g e r . Statistical Decision Theory Analysis. Springer-Verlag, New York,

[Cheeseman et a/., 1988a] P. Cheeseman, J . K e l l y , M . Self, J . S t u t z , W . T a y l o r , a n d D . Freem a n . Autoclass: a Bayesian classification system. In Proceedings of the Fifth International Conference on Machine Learning, 1988. [ C h e e s e m a n et a/., 1988b] P . C h e e s e m a n , M . Self, J . K e l l y , J . S t u t z , W . T a y l o r , a n d D . F r e e m a n . B a y e s i a n c l a s s i f i c a t i o n . I n Seventh National Conference on Artificial Intelligence, pages 6 0 7 - 6 1 1 , S a i n t P a u l , M i n n e s o t a , 1988. Figure 3: Each New Feature A l l o w s a Better M o d e l T h e m a r g i n a l probabilities improve by large factors as we fit each new class tree. Figures 1, 2, a n d 3 illustrate a similar gain on an a r t i f i c i a l d a t a set w i t h 4 0 0 cases a n d 5 r e a l a t t r i b u t e s . F i g u r e 1 shows h o w t h e d a t a i s d i s t r i b u t e d i n a t t r i b u t e s # 1 a n d # 2 , and i n attributes # 3 and # 4 (attribute # 5 i s not s h o w n ) . S u p e r i m p o s e d u p o n t h i s i s t h e best r e s u l t f o u n d w i t h Autoclass I I I . Since t h e d a t a i s d o m i n a t e d b y a c o v a r i a n c e b e t w e e n a t t r i b u t e s # 1 a n d # 2 , the independent m o d e l tries t o m o d e l t h i s by s t r i n g i n g classes a l o n g t h e 1 — 2 c o v a r i ance a x i s . T h e r e i s t o o l i t t l e d a t a t o j u s t i f y a n y m o r e s t r u c t u r e i n t h i s m o d e l class, s o t h e # 3 , # 4 a n d # 5 axes are b a s i c a l l y i g n o r e d . T h e F i g u r e 3 s h o w s a p r o g r e s s i o n of m o d e l s b e t w e e n t h i s m o d e l a n d the c u r r e n t best answer f r o m Autoclass I V , g i v e n i n F i g u r e 2 . E a c h n e w m o d e l a d d i n g one n e w feature a n d gains a n i m p r o v e d m a r g i n a l j o i n t . M o v i n g f r o m a flat independent m o d e l to a flat f u l l y covariant m o d e l , the best m o d e l f o u n d is over 1 0 8 1 times m o r e probable, w i t h 4 classes each o f w h i c h h a v e s i g n i f i c a n t c o v a r i a n c e s i n 2 p a i r s o f a t t r i b u t e s . T h i s l a r g e g a i n comes f r o m c o m b i n i n g a m u c h b e t t e r f i t t o t h e d a t a , d e s p i t e e x t r a costs p a i d f o r t h e a d d e d class a n d c o v a r i a n c e p a r a m e t e r s . B u t , 14 Cross-validation w o u l d p r o b a b l y be a better test here, since our priors are f a i r l y crude.

698

Learning and Knowledge Acquisition

[ D e m p s t e r e t a/., 1977] A . P . D e m p s t e r , N . M . L a i r d , a n d D.B. Rubin. M a x i m u m likelihood f r o m incomplete d a t a v i a t h e EM a l g o r i t h m . J, Roy. Statist. Soc. B, 3 9 : 1 - 3 8 , 1977. [ D . M . T i t t e r i n g t o n et a/., 1985] D . M . T i t t e r i n g t o n , A . F . M . Smith, and U.E. Makov. Statistical Analysis of Finite Mixture Distributions. J o h n W i l e y & Sons, N e w Y o r k , 1985. [ G o e b e l e t a/., 1989] J . G o e b e l , K . V o l k , H . W a l k e r , F . G e r b a u l t , P . C h e e s e m a n , M . Self, J . S t u t z , a n d W. Taylor. A Bayesian classification of the I R A S L R S a t l a s . Astron. Astrophys., 2 2 2 : L 5 - L 8 , 1989. [ H a n s o n e t a/., 1991] R o b i n H a n s o n , J o h n S t u t z , a n d Peter Cheeseman. Bayesian classification theory. Technical Report FIA-90-12-7-01, N A S A Ames Research C e n t e r , A r t i f i c i a l I n t e l l i g e n c e B r a n c h , 1 9 9 1 . [ H u n t e r a n d S t a t e s , 1991] L a w r e n c e H u n t e r a n d D a v i d States. A p p l y i n g Bayesian classification to protein structure. In IEEE Conference on Applications of AI, 1991. [ M a r d i a e t a/., 1979] K . V . M a r d i a , J . T . K e n t , a n d J . M . Bibby. Multivariate Analysis. A c a d e m i c Press I n c . , N e w Y o r k , 1979. [Spiegel, 1968] M u r r a y S p i e g e l . Mathematical Handbook of Formulas and Tables. M c G r a w - H i l l B o o k C o m p a n y , N e w Y o r k , 1968.