Textual Similarity based on Proper Names - CiteSeerX

Textual Similarity based on Proper Names N. Friburger, D. Maurel, A. Giacometti LI, Laboratoire d’Informatique de Tours, Tours, France {Friburger, Maurel, Giacometti}@univ-tours.fr

Abstract Proper names represent about 10% of English or French newspaper articles. Their quantity and informational quality is already used in different Information Extraction systems. Proper names have widely been studied in the MUC conferences designed to promote research in Information Extraction. We have created our own named entity extraction tool based on a linguistic description with automata. The extracted names are used in an information retrieval process: we want to cluster journalistic texts with a high precision level and to provide a description of the topic of the clusters. We verify the interest in the use of proper names in a similarity measure to improve clustering. Keywords: Clustering, similarity, proper names, IE, IR

Introduction Proper names have been widely studied in the field of Information Extraction; we think that they can also play a role in systems of Information Retrieval. We suggest using the proper names to automatically cluster newspaper articles. The quantity of proper nouns and their informative quality in this type of texts make them relevant to improve the clustering thanks to a measure of similarity that highlights them with regard to the other words in a text. In this article, we verify the following premise: proper names are more important than other words in a text to characterize its content. The results of an automatic classification should be improved with a measure of similarity between texts, based on proper names. In the first part of this paper, the system used to extract the proper nouns is described. Then we present different schemes of measures of similarity that we created. Finally, results obtained with an algorithm of Hierarchical Agglomerative Clustering using this measure is presented.

1. Proper names extraction The Named Entity Task is classically defined for information extraction use. Before MUC conferences, researches based on proper names extraction existed in English language but were not sufficient. For example, Coates-Stephens (1992) describes his system named Funes along with a lot of syntactic and semantic rules describing and structuring contexts of the proper names in English texts. Lots of experiences have been conducted since this work but it is most complete in an NLP point of view in English language. MUC structures the Named Entity Task and distinguishes three types of named entity: ENAMEX, TIMEX and NUMEX (Chinchor, 1997). TIMEX contains time expressions and NUMEX contains numbers and percentages. We are only interested in the ENAMEX entities composed of proper names. ENAMEX is limited to proper names and acronyms categorized as follows: - Organization: named corporate, governmental or other organizational entity, for example Boston Chicken Corp. or Pentagon - Person: named person or family, for example Bill Clinton - Location: name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains, etc.). Examples are Silicon Valley or Germany

The Named Entity Task has obtained very good results from the first time it was described (this task appeared in MUC-6 in 1995 [Grishman, R., Sundheim, B: (1996)] ). Most of the systems get a 90% score in recall and precision, and the best score is 96% recall with a 97% precision. In fact, results are very close to those of a human expert (97% of recall). The quantity of proper names and informational quality make them relevant to improve unsupervised clustering with a better similarity measure between texts. 1.1 Proper names in terms of quality and quantity Newspapers contain a lot of proper names that provide important information on the contents of texts. Most of the names are unknown (Mani et al, 1996) and cannot be stored in dictionaries because of their quantity and because proper names belong to opened classes of words. (Mikheev, 1999) studied the use and the impact of gazetteers and showed that locations need a dictionary to be extracted but persons and organizations can be found without such sources. (Coates Stephens, 1992) studied proper names in English: he realized a syntactic and semantic study on the appearance of names (apposition, compound proper names, etc.) and he created the Funes system to extract them using the rules he described. Coates-Stephens specifies that proper names represent 10% of journalistic texts. We have studied the French newspaper Le Monde on which we have concluded that proper names are about 10% too. The three main types of proper names are not equal in term of occurrence, appearing context and amount in newspaper articles. We want to automatically organize collections of documents into interesting categories to allow an easiest browsing. Person names represent 39.8% of proper names in our texts, whereas Organization represents 16.3%, which is less than persons and locations. Location names represent 43.9% of proper names. The most common names are locations closely followed by persons; organizations are less numerous.

1.2 Our extraction system There are two main ways to extract proper names: - Description of rules using lists of words, trigger words and linguistic clues

- Learning

We have designed an extraction tool using linguistic clues and a kind of learning. Our tool uses finite state transducers1: they allow an easy description of the grammar of proper names, marking them out in the texts (in a XML-like mark-up language). Transducers are automata that have an input alphabet and an output alphabet. The input alphabet describes patterns that we want to locate in the texts; the output alphabet contains information on the type of the proper name. We use the FST Toolbox of Intex system (Silberztein, 2000). Our tool, described in (Friburger, Maurel, 2001) uses two stages to extract proper names, consisting in Finite State Transducer Cascades (FST cascades are described Abney’s work (Abney, 1996)): - The first stage searches for names thanks to a description of their contexts and to dictionaries of first names, place names and occupation nouns. The transducers locate left and right contexts that indicate the presence of the proper name. The grammar describing proper names contexts are very lexicalised. - The second step uses proper names already found by the first step like a dictionary. It allows finding the remaining proper names with the proper names found by the first stage. The drawback of such a second stage is that the errors made in the first stage remain in the second but our first step is rather precise.

The system extracts about 92% of proper names with 97% precision. This quantity of proper names seems to be sufficient to create a measure of similarity with names. 2 Textual similarities based on proper names To compute a similarity, standard techniques represent texts as term vectors (or bags of words). (Voorhees, 1999) precises that current Information Retrieval systems consist in two main processes: - Indexing: the text is tokenized and stopwords are removed to keep only potentially interesting words. The remaining terms are lemmatized or stemmed. - Matching: the similarity measure between two vectors is computed. In our system, we represent a text as two subvectors of a term vector, such as named by (Salton et al, 1983): - The first subvector is composed of the words of the text that are not proper names (common words). Those words are lemmatized with dictionaries and the Intex Sytem created by (Silberztein , 2000) - The second subvector is composed of the proper names of the texts with their types (organization, person or place names).

1

Finite state automata and transducers are more and more used in natural language processing [16]. The advantages to use transducers are their robustness, precision and speed.

The similarity simwords (respectively simnames) of the two subvectors containing words (respectively containing proper names) is computed. The merged similarity of the two subvectors is given by formula 1 in which α and β are coefficients used to weigh the two similarity measures (α+β ≤ 1).

αsimc(di,d j )+ βsimp(di,d j ) α +β

sim(di,d j )=

Formula 1: merged similarity of the two subvectors of words

In section 3, we present tests in which the coefficients α and β vary to find the best value for them. 2.3 Pragmatic knowledge on proper names Our extraction system allows us to extract person names with the forename (thanks to a forename dictionary) when the text precises it. In a newspaper article, person names and forenames are most of the time given at least one time at the beginning of the text. In the body of the text, the forename is not given but we consider that it is the same person. A journalist doesn't cite two persons with the same name without distinguishing them with their forenames. If there are different persons in the same text with the same name but with different forenames (ex: Bill Clinton and Hilary Clinton), the term frequency is shared. For example, if we find in the same text the phrase president Clinton, it is difficult to know if president Clinton refers to Bill or Hilary in a simple automatic way. In two texts, we compare the names of persons: if they match, we compare the forenames if they are known. If the forename are different, it is not the same person and the weight of this name between the two texts is zero. Our extraction system can recognize organization names and their abbreviated forms if precised in the text: for example, United Nations Organization is equal to UNO. The two forms of organisation names are synonymous and match. When two texts hold synonymous organization names, weights are computed taking this feature into account. 2.2 Similarity measure Given a collection of texts C, the vector of the text Di is defined as: the weights wij of all its terms :

Di =(wDi1,wDi2,...,wDit ) To verify that proper names can improve the quality of similarity measures, we have chosen the very famous TF.IDF measure described in Salton, Buckley (1988). This measure gives a good representation of term weights. TF.IDF is composed of two parts: TFik is the frequency of the term Tk in the text Di. whereas IDFk is the Inverse Term Frequency of the term Tk in the whole collection C described in formula 2.

( )

idf k =log N nk

Formula 2: Inverse Document Frequency of the text k

where N is the number of texts in collection C, and nk is the number of documents in C containing the term tk. The weight wik of a term is normalized as in formula 3.

wik =

tfik ⋅idf k t

∑(tf ) ⋅(idf ) 2

ik

k

2

k =1

Formula 3: weight of of a term I in the text k

This normalization is used to avoid too big deviation between vectors when the two texts they represent are too different in term of length. Finally, the similarity between two texts is given by the classic formula 4. t

sim(di,d j )=∑wik ⋅w jk k =1

Formula 4: similarity between 2 texts di and dj

3 Experimental Results 3.1 Corpus used The problem of evaluation is to have a corpus with a classification already done. We have created our own corpus using the cluster hypothesis. AMARYLLIS is a French Program, very similar to TREC : each evaluation campaign proposes queries on a French corpus; the systems participating in the campaign must find the texts answering the queries in the corpus. In TREC, the best answers are constituted by the concatenation of the answers supplied by every system then selected by human evaluators. AMARYLLIS answers are constructed by human experts. The results proposed by the systems participating in Amaryllis are used to revised the human experts’ answers. The cluster Hypothesis is proposed by (van Rijsbergen, 1979) : “closely associated documents tend to be relevant to the same requests”. Thanks to this hypothesis, we have created classes of texts that are relevant for the same queries in AMARYLLIS and we obtain a classified corpus. We can compare our results to the classified corpus. We have evaluated the results on several collections of texts2 created with this corpus. Moreover, we propose a topic with all the clusters we have obtained, close to (Cooley, Clifton, 1999) TopCat System. The topic is composed of proper names and common nouns. The type of extracted names is used to propose an answer on the questions who? and where? on the topic of the clusters. Here is an example of the topic of cluster found by our system: Who? Paula Jones, Monica Lewinsky, Bill Clinton Where ? White House

2

The texts are provided by ELDA. www.elda.fr

Other Words : complaint, meeting, divorced, survey, love, republic, job, wife, democrate …

hire, talk, administrative,

The proper names common to the clustered texts allow the user to imagine the topic of the cluster in a better way than the other words. Everybody guesses what the subject of this cluster is. In this article, our purpose is to verify the importance of proper names in similarity measure. In the following part, we present how we have verified this fact. 3.2 Evaluation method We choose to cluster the texts with Hierarchical Agglomerative Clustering algorithm (HAC). In the hierarchical approach the clusters are arranged in a tree in which related clusters occur in the same branch of the tree. The clustering with HAC is unsupervised and provides a clustering without a-priori known number of classes. Moreover this clustering provides an output of very high quality (Voorhees, 1986), (Willett, 1988). We use the Complete Link method to merge the clusters: the similarity of the new cluster is the similarity of the two less similar members of the cluster. Given two clusters ci and cj,, and x and y two texts,

sim(ci,c j )= min sim(x, y ) x∈ci , y∈c j

Formula 5: Complete Link Criterium

This method is known to be the most effective (but the most expensive in time, which is not important for our purpose). The Complete Link offers tightly bound, as said by Zamir et al.(199), spherical and more compact clusters. This measure allows to disconnect the different parts of the tree, cutting the tree with a very low threshold. The other two very well known HAC methods (Single Link and Average Link) are known to exhibit a tendency to create clusters without end (it is the scale effect). When the clustering is finished, we compute the quality of the clustering with a cluster analysis. We have verified the quality of the clusters in two ways : The first way is to verify that the set of clusters minimizes the inter-cluster similarity and maximizes the intra-cluster similarity. In other words, the best possible space would consist in many vectors gathered into small clusters and the clusters would be as far apart as possible from each other. The second way is to compute the entropy of the set of clusters. The entropy is a measure of the disorder of a classification. We use the measure proposed by (Boley, 1998). In his article, Boley presents a divisive clustering method and evaluates the results with entropy. Each document is labeled with the number of the class it belongs to the classification of reference we have created. We compare the reference to the results obtained by our own clustering system. The entropy ec of one cluster c is :

  c(i,c)  c(i,c)    ec =−∑ log  ∑c(i,c)   i  ∑c(i, c) i  i   Formula 6: Entropy of the cluster c

c(i,c) is the number of occurrences of the label i in the cluster c. The entropy ec is null when all the texts of c belong to the same class, otherwise ec >0. The entropy ec growns with the disorder of the cluster c. The total entropy eT of the set of clusters is given by formula 7.

eT = 1 ∑ec ⋅.nc m c Formula 7: Entropy of all the clusters

With m the total number of clusters, nc the number of texts in the cluster c. 3.3 Results The results show that there are 4 proper names on average in the 10 words having the biggest TF.IDF value between clusters. This is a first evidence of the importance of proper names. We use the TF.IDF similarity measure on all the words of the texts (without distinguishing proper names). We have computed the inter cluster and intra cluster in the different clusterings. The coefficients α varies decreasingly between 1 to 0 (for words alone) whereas β increases between 0 to 1 (for names). In other words: - If α=1, β=0 then the clustering is made on words alone - If α=0, β=1 then the clustering is made on proper names alone. - the other values of α and β are variations of the proportion of words and proper names. In Figure 1 and Figure 2, the dotted line represents the value of intra-cluster similarity for a classic TF.IDF measure taking all of the words of a text into account without distinguishing common words and proper names. As Figure 1 shows, the intra-cluster similarity increases with coefficient β; moreover, the intra-cluster similarity with proper names only is better than a classical measure. The inter-cluster similarity decreases with the increase of β. Classical TF.IDF and the very high values of β (0.6≤β≤0.8) obtains the best score. We note that proper names alone obtain a very high inter-cluster similarity which is not a good criterion for a good clustering. The words alone obtain bad results for intra-cluster similarity.

Intra-cluster Sim ilarity 0,11 0,1 0,09 0,08 0,07 0,06 0,05 0,04 words

a=0.8,

a=0.6,

a=0.4,

a=0.2,

Proper

alone

b=0.2

b=0.4

b=0.6

b=0.8

names alone

Figure 1: Intra-cluster similarity for our test

Inter-cluster Sim ilarity 0,04 0,03 0,02 0,01 0 words

a=0.8,

a=0.6,

a=0.4,

a=0.2,

Pr oper

alone

b=0.2

b=0.4

b=0.6

b=0.8

names alone

Figure 2: inter-cluster Similarity

We finally present the results obtained with the measure of entropy. They clearly demonstrate the importance of names in newspaper texts. The biggest disorder is obtained with the clusterings using the similarity measure with the words alone (α = 1, β = 0,), the similarity using all the words without distinction and the similarity using α = 0.8 and β = 0.2. When α < 0.6 and β > 0.4, the entropy (disorder) decreases enormously with the increase of the importance of proper names. We can notice that proper names alone give results very close to those with a low level of common words (very low values of α).

6

a=0 b=1

5

a=0,1 b=0,9 a=0,2 b=0,8

4

a=0,3 b=0,7 a=0,4 b=0,6

3

a=0,6 b=0,4 a=0,8 b=0,2

2

a=1 b=0 All the words

1

0 1

Figure 3: Entropy of the sets of clusters

The experimental results show that proper names alone can provide a good clustering, whereas the other words alone cannot clearly cluster the texts. But proper names are not sufficient to provide the better clustering, the other words are needed. This study shows that the value of α and β maximizes intra-cluster similarity and minimizes inter-cluster similarity when:

α = 0.2 (for common words) β = 0.8 (for proper names) We see that results were obtained for clustering with proper names whereas words alone were unable to provide good clusters. Conclusion Names are very special words and represent about 10% of a text in a newspaper article. We tried to see if a best similarity measure can be obtained using proper names: we show that proper names are very important for the similarity measurement. We have designed a measure in which the similarity of the vector of words and the similarity of the vector of proper names are weighed with two coefficients to increase or decrease their importance. The best clustering results are obtained for a weighing scheme promoting proper names. The premise that quantity and informational quality of proper names make them relevant is confirmed. Names really improve unsupervised clustering with a better similarity measure between texts. The same results should be obtained with the English language, thanks to the same amount of proper names in the French or English language. References Abney S. (1996) Partial Parsing via Finite-State Cascades, In Workshop on Robust Parsing, 8th European Summer School in Logic, Language and Information Ed., Prague, Czech Republic, pp.8-15. Boley D.L. (1998) Principal Direction Divisive Partitioning, Data Mining and Knowledge Discovery, 2(4):325-344. Chinchor N. (1997). Muc-7 Named Entity Task Definition,

http://www.itl.nist.gov /iaui/894.02/related_projects/muc/proceedings/muc_7_toc.html#appendices Coates-Stephens S. (1993) The Analysis and Acquisition of Proper Names for the Understanding of Free Text, Kluwer Academic Publishers, Hingham, MA. Cooley R., Clifton C.(1999). Topcat: Data mining for topic identication in a text corpus. In Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases, 1999. Friburger N., Maurel D. (2001). Finite state transducer Cascade to extract Proper Nouns in French text, 2nd Conference on Implementing and Application of Automata : in Lecture Notes in Computer Science, Pretoria (South Africa). Grishman R., Sundheim B (1996). Message Understanding Conference - 6: a brief history, In Proceedings of the Sixth Message Understanding Conference. Morgan Kaufmann. Mani I.; MacMillan R. T. (1996). Identifying Unknown Proper Names in Newswire Text, In Corpus Processing for Lexical Acquisition, MIT Press. Cambridge, MA, pp. 41-59. Mikheev A., Moens M., Grover C.(1999). Named Entity Recognition without Gazetteers, In EACL'99 Ed Salton G., and Buckley C. (1988). "Term-Weighting Approaches in Automatic Text Retrieval" Information Processing &Management, 24(5), pp. 513-523. Salton G., Fox E.A. and H. Wu (1983). Extended boolean information retrieval. Commun. ACM , 26(12), 1022-1036. Silberztein M. (2000). INTEX: an FST toolbox, Theoretical Computer Science, (234)33-46. van Rijsbergen C.J. (1979). Information Retrieval (2nd edition), Butterworths, London. Voorhees, E.M. (1986). Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing and Management, 22, pp. 465-476. Voorhees, E.M. (1999). "Natural Language Processing and Information Retrieval," in Pazienza, MT (ed.), Information Extraction: Towards Scalable, Adaptable Systems, New York: Springer, 1999, pp. 32-48. Willett P. (1988) Recent trends in hierarchic document clustering: a critical review. In Information Processing and Management. 24(5), pp. 577-597. Zamir O., Etzioni O., Madani O. and Karp R.M. (1997) Fast and intuitive clustering of Web documents. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 287-290.