Probabilistic Models for Hierarchical Clustering ... - Semantic Scholar

6 downloads 0 Views 195KB Size Report
Probabilistic Models for Hierarchical Clustering and. Categorisation: Applications in the Information. Society. Eric Gaussier, Cyril Goutte. Abstract|We propose a ...
Probabilistic Models for Hierarchical Clustering and Categorisation: Applications in the Information Society Eric Gaussier, Cyril Goutte

Abstract |We propose a new hierarchical generative model for textual data, where words may be generated by topic speci c distributions at any level in the hierarchy. This model is naturally well-suited to clustering documents in preset or automatically generated hierarchies, as well as categorising new documents in an existing hierarchy. Furthermore, we present a series of applications that can bene t from our model, as well as an experimental evaluation for both clustering and categorisation on frequently used data sets. Keywords | Probabilistic models, hierarchical clustering, hierarchical categorisation, document modelling.

I. Introduction

The attractiveness of data categorisation continues to grow based, mostly in part, on the availability of data through a number of access mediums, such as the Internet. As the popularity of such mediums increases, so has the responsibility of data providers to o er quick and ef cient access to data. On the user side, the continuously incoming ow of information calls for methods to organise this material in order to provide users with e ective ways to handle large amounts of data, usually in textual form. These ways include: ltering non-relevant information, indexing for ecient storing and retrieval, categorising new documents (emails, reports, articles) into pre-de ned sets of categories, clustering documents in order to provide a global view on a collection, summarising the information, and nally displaying it to users. The information extracted in these ways can then be mined and organised into knowledge bases, as proposed in eg [1]. Among the general above applications, we focus in this paper on clustering and categorisation, mainly of documents, since: rst, these two processes are at the core of many information access systems, and secondly, the methods they rely on (similarity measures, discrimination) can be used in many di erent settings. Clustering and categorisation can be viewed as the two sides of the same coin, inasmuch as clustering documents amounts to identifying groups of objects within a collection, whereas categorisation consists in assigning an object to one or several pre-de ned groups. Furthermore, both operations involve hierarchies of groups: clustering may allow data to be hierarchically grouped based on its characteristics, whereas categorisation into subject catalogs o ered by data providers as YahooTM , requires assigning new docu-

ments to one or more nodes in the catalog hierarchy, where general categories are located at top levels, and lower level categories are associated with more speci c topics. Although conventional organisation techniques, such as hierarchical agglomerative clustering [2] and top-down partitioning methods [3], allow common objects to be grouped together, the resulting hierarchy generally includes a hard assignment of objects to clusters, ie an object is usually assigned to only one cluster in the hierarchy. This form of assignment limits the scope of such clustering techniques since a document will be assigned to say only one topic and may not be retrieved at a later stage upon a search on a di erent, yet related, topic. We present in this paper a new generative probabilistic model that can induce cluster hierarchies while assigning documents to clusters in a "soft" manner. We also present a version of this model that can be used for categorisation into pre-de ned categories, whether organised hierarchically or not. Section 2 introduces these models. We then review in section 3 the motivations behind our model and the di erent applications we envisage. Lastly, section 4 provides an experimental evaluation, for both clustering and categorisation. II. Modelling documents

In this contribution we address the problem of modelling documents as collections of co-occurence data [4]. Rather than considering each document as a vector of word frequency (bag-of-word), we model the document-word matrix as the result of co-occurence of words in documents. In that setting, the data can be viewed as a set of triples (i(r); j (r); r), where r = 1 : : :L is an index over the triples, each triple indicating the occurence of word j (r) in document i(r). An alternative representation is (i; j; nij ) indicating that word j occurs nij times in document i (the index is now over i and j , with several instances of nij = 0). In order to structure the data, we assume it was generated from a hierarchical model. Some early attempts at de ning such models are Hofmann's Hierarchical Assymetric Clustering Model (HACM) [4] or Hierarchical Shrinkage [5]. In this paper we propose a new model which has some additional exibility. In our model the data are generated by the following process: 1. Sample a document class from the distribution P ( ), 2. Sample a document i from the class-conditional distriCorresponding address: Xerox Research Center Europe, 6 chemin bution P (ij ), de Maupertuis, F-38240 Meylan, France. E-mail: Eric.Gaussier, [email protected] 3. Sample a word topic  from the class-conditional distri-

  ! aaa  ! !  ! aa !  -P  ; P j    HHHH ?@@   ? ~ ~ ~ ~ ~ 11

12

(

21

1

22

2

23

3

24

j

)

(

j

)

25

4

5 .

-P ( ); P (i ) j

Fig. 1. An example hierachical model with classes at the leaves: document classes 1 : : : 5 are indicated by lled circles, topics 0 : : : 25 are indicated by circles (accordingly, classes are also topics). Sampling of a co-occurence is done by rst sampling a class (ie selecting the fourth leaf here) and a document from the class-conditional P (i ), then sampling a topic from the class conditional P ( ) (second topic at the second level here) and a word from the topic-conditional P (j  ). j

j

j

bution P ( j ), 4. Sample a word j from the topic-conditional distribution P (j j ). Word topics and document classes may or may not be chosen from identical sets. In a restrictive hierarchy, for example, we may assign documents to classes at the leave of the hierarchy, while words are sampled from topics which are any node in the hierarchy ( gure 1). This therefore allows classes to be linguistically related by sharing some common ancestors/topics. In a more general hierarchy, documents may be assigned to classes at any node of the hierarchy. In all cases, topics  will be restricted to the set of ancestors of a class , a fact that we note  " . This can also be achieved by setting P ( j ) = 0 whenever  6" . According to this model, the probability of a given cooccurence (i; j ) is: P (i; j ) =

X



P ( )P (ij )

X



P ( j )P (j j )

where we have used the conditional independence of documents and words given the class . The parameters of the model are the discrete probabilities P ( ), P (ij ), P ( j ) and P (j j ). Note that the conditioning on the parameters in the individual likelihood above is implicit for brievity. Note also that several known models can be written as special cases of ours. For hierarchical models, by decreasing order of complexity: 1. P (ij ) = P (i); 8 (documents independent of class) yields the HACM [4]; 2. Dropping P ( ) and P (ij ) (or imposing uniform probabilities on these quantities) yields a hierarchical version of the Naive Bayes model [5], [6];1 3. Flat (ie non-hierarchical) models are naturally represented using P ( j ) = ; (ie one topic per class), yielding

P

P ( )P (ij )P (j j ), aka Probabilistic Latent Semantic Analysis (PLSA, [7]). 4. A at model with a uniform distribution over classes and documents i reduces to the ( at) Naive Bayes model. Because of its analogy to PLSA, where both documents and words may belong to di erent clusters, we will refer to our hierarchical model as HPLSA (Hierarchical Probabilistic Latent Semantic Analysis) in the case of clustering, and PLC (Probabilistic Latent Categoriser) or HPLC (Hierarchical PLC) in the case of categorisation.

P (i; j ) =

0

A. Hierarchical clustering

As traditional in clustering based on probabilistic models, we rely on a maximum likelihood approach to estimate the values of the di erent probability distributions our model relies on. In the case of clustering, given an i.i.d. data set D = f(i(r); j (r); r)gr=1:::L the likelihood of the parameters can be expressed Q as the product of the individual likelihoods: P (D) = r P (i(r); j (r)). Due to the presence of the multiple sums under the product of examples, the (log-)likelihood must be maximised numerically. This is elegantly done using the EM algorithm [8] by introducing the unobserved indicator variables specifying the class and topic choices for each observed co-occurence. In addition, we used deterministic annealing [9], [10] in conjunction with the EM iterations. Deterministic annealing has two interesting features: rst it has been shown empirically [10], [7] to yield solutions that are more stable with respect to parameter initialisation; secondly, it provides a natural way to grow the hierarchy [11]. This leads us to the following re-estimation formulas (see [12] for more details on the derivation): P

P ( ) P (i(r)j ) P  P ( j ) P (j (r)j ) hC (r)i = P P ( ) P (i(r)j )  P ( j ) P (j (r)j ) P ( ) P (i(r)j ) PP( j ) P (j (r)j ) hT (r)i = P P ( ) P (i(r)j )  P ( j ) P (j (r)j )

also known as the E-step formulas, and: 1 X hC (r)i P (t)( ) =

L r X

P

t (ij )

( )

=

r;i(r)=i X

r



hC (r)i

hC (r)i

X

t

P ( j ) = ( )

P (t)(j j ) =

hT  (r)i r XX hT  (r)i r  X X hT  (r)i r;j (r)=j XX

r



hT  (r)i

1 The di erence between Hierarchical Shrinkage [5] and Hierarchical (t) superMixture Model [6] boils down to di erent estimation procedures for also known as the M-step formulas, where the script denotes the estimated value at the iteration t. P ( ). j

is an additional parameter introduced by the annealing, varying between 0 and 1. For = 0, the annealed distribution is uniform, and for = 1, we retrieve the posterior P (C; T ji(r); j (r)). Starting from a low value of , the E-step and M-step formulas are iterated until convergence (guaranteed), and the value of is progressively increased until the performance on a validation set reaches an optimum. Deterministic annealing EM has another advantage in connection with hierarchical clustering. When starting with = 0, only one class exists. As is carefully increased, phase transitions appear, which cause classes to di erentiate (typically, one class will split in two). This provides a natural way to grow a hierarchy: starting from a single class, we track phasetransitions by duplicating this class, slightly perturbing the duplicates and carefully increasing until the duplicates diverge into distinct classes. The resulting classes are again duplicated and perturbed slightly while keeps growing, and so on. While is increased, the tree keeps growing, until the performance measured by the likelihood on held-out data is optimised. B. Hierarchical categorisation

In a categorisation context, the training data is D = (i(r)) is a class label for document i(r)2 . The purpose of the categorisation task is to assign one or more category to a new document d expressed as a set of co-occurences (d; j (s); s), where s runs accross the number Ld of occurences in document d (or alternatively (d; j; ndj )). In the training phase, the model parameters are obtained again by maximising the likelihood. Note that the class labels actually give additional information, and the only latent variable is the choice of the topic (once again, we refer the interested reader to [12]). Categorisation of a new document is carried out based on the posterior class probability P ( jd) / P (dj )P ( ). The class probability P ( ) is known from the training phase, and P (dj ) is again estimated iteratively by maximising the (log-)likelihood of the document to categorise: f(i(r); j (r); (i(r)); r)gr=1:::L, where

L(d) =

X

s

log

X



P ( )P (dj )

X



!

P ( j ) P (j j )

L(d) can be iteratively maximised wrt P (dj ) using again

deterministic annealing EM and the following steps, for s = 1 : : :Ld : E-step:

?

 P

P ( ) P (t)(dj )  P ( j )P (j (s)j ) hC (s)i = P ?  P (t) P ( ) P (dj )P  P ( j )P (j (s)j ) hC (s)i sP M-step: P (t+1)(dj ) = # + s hC (s)i

where # denotes the number of examples with class in the training set. Notice that this should usually be performed faster than previously as only P (dj ); 8 are calculated and all other probabilities are kept xed to their training values. III. Applications

A. Discussion on hard and soft assignments

Depending on the meaning we may give to i and j , different problems can be addressed with our model. Generally, we can view i and j as representing any pairs of co-occurring objects. In the preceding section, we have focused on modelling documents, with i objects representing documents, and j objects representing words. Within the general framework of document clustering, an important application we envisage consists of topic detection, where the idea is to identify topics covered by a set of documents. In such a case, a cluster can be interpreted as a topic de ned by the word probability distributions, P (j j ). Our soft hierarchical model takes into account several important properties with respect to this task: 1) a document can cover (or be explained by) several topics (soft assignment of i objects provided by P (ij )), 2) a topic is best described by a set of words, which may belong to di erent topics due to polysemy and specialisation (soft assignment of j objects provided by P (j j )), and 3) topics are in many instances hierarchically organised, which corresponds to the hierarchy we induce over clusters. Moreover, our use of a general probabilistic model for hierarchies allows us to deal with document collections in which topics cannot be hierarchically organised. In such a case, the probabilities P ( j ) are concentrated on  = , thus rendering a at set of topics rather than a hierarchy. We obtained such a result on an internal, highly heterogeneous, collection. Another important application we envisage for our model is knowledge structuring (see for example [13]), where it is important rst to recognise the di erent realisations (i.e. terms and their variants) of the main concepts used in a domain, and secondly to organise them into ontologies3 . A point common in all ways of organising terms in taxonomies consists in the central role of the \generalisation/specialisation" relation. Traditional approaches [14], [15] to this organisation induce hierarchies by repeatedly clustering terms into nested classes, each identi ed with a concept. Di erent kinds of contexts can be considered, from local ones, such as direct syntactic relations, or small windows centered on the term under consideration, to global ones, as sentences, or paragraphs. However, problems in such clustering approaches arise from the facts that both terms and their contexts may be polysemous, and thus need to be assigned to di erent clusters. Polysemy of words, whether regarded as terms under focus or contexts of terms, and polytopicality of textual units, at various levels of granularity (from sentences to complete documents), thus call for models able to induce hierarchies

2 We assume that each example is assigned to a single class. Exam3 We call here ontologies the taxonomies into which concepts of a ples assigned to multiple classes are replicated such that each replication has only one class. domain are organised.

of clusters while assigning objects to di erent clusters. The additional exibility provided by our model over previously proposed ones exactly amounts to the ability of soft clustering both objects instead of one. Even though we have focused on clustering in the above discussion, the same facts hold when categorising new documents.

naturally t within the existing categories, and should then propose the addition of a new category, at the appropriate level of the hierarchy. When a category receives many documents, the system should also be able to detect whether the category should be expanded into several ones or not. Even though we have not developed such a system, all its ingredients can be directly derived from computations on B. Di erent objects, di erent combinations, di erent ap- the probability distributions of the clustering and categoriplications sation models we have presented. As mentioned before, by considering di erent realisations IV. Experimental validation of the i and j objects, we can address di erent clustering In order to evaluate our approach, we consider two di erand categorisation problems. Thus, our model can be used ent settings. First we use clustering to identify documents for clustering images based on the text that is associated with similar topics in a collection of incoming news stories. with the images. In this con guration, the i objects would Secondly, we try to categorise newgroup messages in an represent the images and the j objects would represent the existing hierarchy of newsgroups. words contained in each title of each image. Then, the clusters would exhibit images with similar content, this content being characterised by the word distribution as- A. Clustering We used a collection of labelled document from TDT-1, sociated with the image. The soft clustering, through the provided by the Linguistic Data Consortium for the Topic P (ij ) distributions, could then serve as a way to segment Detection and Tracking (TDT) project. It consists of news images into "meaningful" pieces, meaningful in the sense articles from CNN and Reuters covering various topics such that each piece can be described in natural language. as the Oklahoma City bombing, the Kobe earthquake, etc. Also, one can envisage to cluster companies based on Documents were manually labelled as belong to one of 22 their activity domain or customer relationships, and direct topics. We randomly selected 16 topics, amounting to 700 new customers, through categorisation, to the appropriate documents with 12700 di erent words. Labels were rst sector(s) of the company. In this case, i objects would repremoved from the documents, which were then clustered resent business units, and j objects would represent a deusing our Hierarchical Probabilistic Latent Semantic Analscription of a customer. Since the assignment of an i object ysis (HPLSA) model. We compared the results to those to a cluster does not require all j objects within the cluster to co-occur with i, we can expect in this case to discover obtained using ( at) Probabilistic Latent Semantic Analcommon "sub-markets" to di erent business units. Such ysis (PLSA) proposed by [7] and Hierarchical Shrinkage sub-markets would obviously include (some of) the cus- (HS) proposed by [5]. The comparison with PLSA allows tomers common to the units, but would also select, based us to assess the in uence of the hierarchy, while the comon their description, the customers that should be part of parison with HS will illustrate the advantage of using soft class assignment as opposed to a naive Bayes approach. this sub-market. The adequacy between the clusters induced by the model We can also extend the proposed model in order to deal with data made of three rather than two objects. Such a and the manual labels is estimated using the Gini index, situation arises when one wants to add a model of the user, averaged over labels and clusters: or of the document environment, in the document model 1 X X X P ( jl)P ( 0jl) GL = proper. In this case, we can reformulate the likelihood L l 6= function to optimize by taking into account the dependencies the model should rely on. We can also decouple the model into two sub-models, sharing one object. In this lat- where L is the number of labels, measures the impurity of ter case, we can proceed in two steps by rst computing resulting clusters wrt labels l and the di erent probability distributions from one sub-model, 1 X X X P (lj )P (l0j ) G = and then computing the ones related to the second model K l l 6=l only, while holding the others xed. This is very similar to the folding-in approach we used for categorisation. where K is the number of clusters, measures the impurity Lastly, clustering and categorisation can be combined in of the labels l wrt the resulting clusters . Smaller values order to provide users with interactive ways to organise indicate closer correspondance, with an upper bound at 1 their data. Users should be able to validate and correct and a lower bound at 0 obtained when document clusters induced hierarchies, by merging sub-trees, discarding some match the manual of them and expanding others. They should then be able are the following: labels. In our experiment, the results to select a few documents they believe to be relevant for Gl G a given clsuter, and let the system assigns the remaining PLSA 0.34 0.30 documents to one or several clusters. Furthermore, durHS 0.40 0.45 ing the process of categorising new documents, the system HPLSA 0.20 0.16 should be able to detect that a given document does not 0

0

(1) oklahoma united city fbi states (2)

(3)

mcveigh bombing oklahoma fbi city

building people city oklahoma rescue

(4) people oklahoma states bombing fbi

(5) nichols oklahoma mcveigh federal bombing

(6) oklahoma city building people children

(7) building people area rescue city

Fig. 2. The hierarchy resulting from the clustering experiment. Each box represents a node, with the 5 most frequent words indicated.

On this data, HPLSA outperforms both PLSA and HS. We see that although PLSA does not use a hierarchy, it ranks in between the two hierarchical models. This is because HS results in most cases in a hard assignment of documents to clusters, whereas both PLSA and HPLSA result in a soft assignment, which is bene cial when there is some overlap between the classes (cf. gure 2). In addition, let us note that PLSA tends to outperform other at categorisation methods like K-means (see eg [4]). In gure 2 we present the hierarchy we obtained by clustering the 273 documents related to the Oklahoma City bombing (containing 7684 di erent non empty words). The hierarchy is a complete binary tree containing 7 nodes and 4 leaves ( gure 2). For each node we show the ve most common words, ie the words with the 5 highest values of P (j j ). The data is rst separated in two topics corresponding to the investigation and the description of the event itself (nodes (2) and (3) respectively). Node (2) is then split in two parts, corresponding to the investigation itself (node (4)) and the trial (nodes (5)). Node (3) is divided in two parts corresponding to the bombing and casualties (node (6)), and the work of the rescue team (node (7)). Note that some words, like oklahoma appear in several nodes, because they appear frequently in all documents. The data is thus best explained by including these words in several topics/clusters. B. Categorisation

We illustrate the use of our hierarchical categorisation model on the task of categorising messages from 15 different newsgroups organised in a natural hierarchy ([6] and gure 3). The only preprocessing we perform on the data is removing all tokens that do not contain at least one alphabetic character. In particular we do not lter out words or tokens with very low frequency, thus keeping some amount of noise in the data. In order to assess the e ect of sample size on the di erent

methods, we use datasets of increasing sizes, from 10 to 200 messages per newsgroup, with 2/3 for training (7 to 134 messages per newsgroup) and 1/3 for testing. The testing results are averaged over 10 replications of the same size. We include 4 methods in our comparison:  Naive Bayes (NaBa) with smoothed4 P (j j ).  Probabilistic Latent Categorisation (PLC), with 15 classes, no hierarchy;  Hierarchical Mixture (HiMi) [6] with uniform P ( j );  Hierarchical PLC (HPLC) using tempered EM. The choice of uniform class-conditional topic distributions for HiMi is due to a tendency to over t, especially with small samples. Possible remedies include using heldout data or cross-validation to estimate these parameters. As it is dicult to implement these elaborate strategies in a reproducible manner, we have chosen here to use a uniform distribution on topics. Although sub-optimal, this strategy performs much better than the degenerate ML solution, which is equivalent to Naive Bayes. Results are displayed in table I, using micro- and macroaveraged F1 scores. The F1 score is a balance between the precision p (ratio of true positives returned over all returns) and recall r (ratio of true positives returned over all positives): F 1 = 2pr=(p + r). We average of the 15 classes using micro-averaging and macro-averaging [16]. Because of the moderately high noise level, Naive Bayes performs rather badly, with a classi cation rate some 10 to 20 percents lower than reported in [6] (F1 is equal to the classi cation accuracy). As we retain all tokens with at least one alphabetic character which appear at least once, the topic-conditional word distribution P (j j ) will include many words which occur very rarely in the collection. Even with smoothing, this will tend to alter the probability of documents that contain those rare words. The performance of ( at) PLC is comparable to that reported by [6] for their Hierarchical Mixture model. Although it uses a hierarchy, HiMi doesn't manage to outperfom the at PLC model in our experiments. This is due to 2 factors. First, HiMi is a hierarchical version of Naive Bayes, and thus su ers the same predicament regarding the increased noise level mentioned earlier. Second, it is important to note that the estimation of the class-conditional topic distribution P ( j ) is di erent here from that proposed in [6]. The hierarchical version of PLC (HPLC) does not manage to perform signi cantly better than at PLC, but displays similar performance overall. This suggest that the hierarchy is not used to its full extent, and that HPLC also su ers from over tting of the P ( j ), although less severely than HiMi. Additional experiments using a xed uniform P ( j ) distribution resulted in similar performance. We take this as an indication that, although there is clearly room for improvement in the parameter estimation, HPLC does not seem too sensitive to these parameters. 4

Using Lidstone smoothing with  = 0:5.

SPORTS

RELIGION

rec.sport.baseball resc.sport.hockey

alt.atheism soc.religion.christian talk.religion.misc

POLITICS

MOTORS

COMPUTERS

talk.politics.guns talk.politics.mideast talk.politics.misc

rec.autos rec.motorcycles

comp.graphics comp.os.ms−windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x

Fig. 3. The natural newsgroup hierarchy, with 15 newsgroups at the leaves, arganised in 5 topics.

10

NaBa PLC HiMi HPLC

F1 F 1

.380 .393 .540 .557 .458 .450 .542 .578

20

F1 F 1

.456 .483 .630 .645 .402 .411 .633 .647

30

F1 F 1

.451 .470 .665 .675 .501 .501 .666 .676

50

F1 F 1

.502 .516 .734 .743 .526 .524 .734 .742

100

F1 F 1

.573 .582 .763 .769 .682 .671 .766 .775

200

F1 F 1

.610 .631 .805 .812 .736 .734 .806 .812

TABLE I

Categorisation results for Naive Bayes (NaBa), Probabilistic Latent Categorisation (PLC), Hierarchical Mixture (HiMi) and Hierarchical PLC (HPLC). Results are given as micro-averaged and macro-averaged values of F1 (F1 and F 1 respectively).

V. Conclusion

We have presented in this paper a generative probabilistic model for hierarchical clustering and categorisation of co-occurring objects. This model generalises over probabilistic models already proposed for the same tasks. We have shown how to estimate the parameters of this model, and have evaluated its performance, both for clustering and categorisation, on two standard datasets. We have also described a certain number of applications central to the Information Society, that our model should allow us to address. [1] [2] [3] [4] [5] [6] [7] [8] [9]

[10] [11] [12] [13]

References [14] Rayid Ghani, Rosie Jones, Dunja Mladenic, Kamal Nigam, and Sean Slattery, \Data mining on symbolic knowledge extracted [15] from the web," in Proceedings of the Workshop of Text Mining, Knowledge and Discovery in Databases Conference (KDD'00), [16] 2000. P. Willet, \Recent trends in hierarchical document clustering: a critical review," Journal of Information Processing and Mangement, 1988. P. Willet, \Documentclusteringusing an inverted le approach," Journal of Information Science, 1980. Thomas Hofmann and Jan Puzicha, \Statistical models for cooccurence data," A.I. Memo 1625, A.I. Laboratory, February 1998. Andrew McCallum, Ronald Rosenfeld, Tom Mitchell, and Andrew Y. Ng, \Improving text classi cation by shrinkage in a hierarchy of classes," in Proceedings of the Fifteenth International Conference on Machine Learning, 1998, pp. 359{367. Kristina Toutanova, Francine Chen, Kris Popat, and Thomas Hofmann, \Text classi cation in a hierarchical mixture model for small training sets," in Proceedings of the ACM Conference on Information and Knowledge Management, 2001. Thomas Hofmann, \Probabilistic latent semantic analysis," in Proceedings of the Fifteenth Conference on Uncertainty in Arti cial Intelligence. 1999, pp. 289{296, Morgan Kaufmann. A. P. Dempster, N. M. Laird, and D. B. Rubin, \Maximum likelihood from incomplete data via the EM algorithm," Journal of the Royal Statistical Society, Series B, vol. 39, no. 1, pp. 1{38, 1977. K. Rose, E. Gurewitz, and G. Fox, \Statistical mechanics and

phase transitions in clustering," Physical Review Letters, vol. 65, no. 8, pp. 945{848, 1990. Naonori Ueda and Ryohei Nakano, \Deterministic annealing variant of the EM algorithm," in Advances in Neural Information Processing Systems 7, Gerry Tesauro, David Touretzky, and Todd Leen, Eds. 1995, pp. 545{552, MIT Press. Fernando Pereira, Naftali Tishby, and Lillian Lee, \Distributional clustering of english words," in Proceedings of the International Conference of the Association for Computational Linguistics, 1993. Eric Gaussier, Cyril Goutte, Kris Popat, and Francine Chen, \A hierarchical model for clustering and categorising documents," in BCS-IRSG (submitted), 2002. E. Gaussier and N. Cancedda, \Probabilistic models for terminology extraction and knowleddge structuring from documents," in Proceedings of the 2001 IEEE International Conference on Systems, Man & Cybernetics, 2001. G. Grefenstette, Explorations in Automatic Thesaurus Construction, Kluwer Academic Publishers, 1994. G. Salton, Automatic Thesaurus Construction for Information Retrieval, North Holland Publishing, 1972. Yiming Yang and Xin Liu, \A re-examinationof text categorization methods," in Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 42{49.