Topic Hierarchy Generation via Linear Discriminant Projection

Topic Hierarchy Generation via Linear Discriminant Projection [Extended Abstract] Tao Li

Shenghuo Zhu

Mitsunori Ogihara

Computer Science Dept. University of Rochester Rochester, NY 14627-0226



[email protected]

[email protected]

[email protected]

Categories and Subject Descriptors

to help achieve better performance. Third, the natural/manuallygenerated hierarchies are not always available. The first step in automatic hierarchy generation is to measure the similarities between categories by linear discriminant projection approach. Then, a hierarchical clustering method can be used to build the hierarchy.

H.3 [Information Storage and Retrieval]: Content Analysis and Indexing,Information Search and Retrieval; I.2 [Artificial Intelligence]: Learning; I.5 [Pattern Recognition]: Clustering

General Terms

2. LINEAR DISCRIMINANT PROJECTION AND HIERARCHY GENERATION

Algorithms, Performance, Experimentation, Verification

Keywords

For efficient classification, we need to find a distance function which can measure the inherent similarity structure of the instances and be discriminative for each class. Generally finding the distance function involves identifying a coordinate transformation, which reflects the inherent similarity structure. We want to look for a transformation that could potentially measure the inherent similarity of the data, which transforms the instance x to a vector in a new Euclidean space, i.e., xT , which is transformation matrix. Fisher’s linear discriminant analysis suggests to discriminate ^ b , and different classes by using between-class covariance matrix, ^ w . The procedure is to find the vecintra-class covariance matrix, ^ b with tors ^’s by conducting the eigenvalue decomposition of ^ w . After finding transformation , we define the simirespect to larity between two classes to be the distance between the their centroids in the transformed spaces. In other words, two categories are similar if they are “close” to each other in the transformed space. The linear discriminant projection finds the transformation that preserve the class structure by minimizing the sum of squared withinclass distance while maximizing the sum of squared between-class distance and hence the distances in the transformed space should be able to reflect the inherent structure of the dataset. After obtaining the similarities/distances between classes, we used the Hierarchical Agglomerative Clustering (HAC) algorithm [2] to generate automatic topic hierarchies from a given set of flat classes. There are many different HAC algorithms with different policies on combining clusters such as single-linkage, completelinkage, ward’s method and Unweighted Pair-Groups Method Average (UPGMA) method. We choose UPGMA, which is known to be simple, efficient and stable [2]. In UPGMA, the average distance between clusters is calculated from the distance between each point in a cluster and all other points in another cluster. The two clusters with the lowest average distance are joined together to form the new cluster. The result of hierarchical clustering is a dendrogram where similar classes are organized into hierarchies. Different HAC algorithms may produce different dendrogram structures. Singlelinkage clustering is known to be confused by nearby overlapping

Text Categorization, Hierarchy, Clustering

1.

INTRODUCTION

Text categorization has been receiving more and more attention with the ever-increasing growth of the on-line information. Automated text categorization is generally a supervised learning problem, defined as the problem of assigning pre-defined category labels to new documents based on the likelihood suggested by labeled documents. Most studies in the area have been focused on flat classification, where the predefined categories are treated individually and separately [5]. As the available information increases, when the number of categories grows significantly large, it will become much more difficult to browse and search categories. The most successful paradigm for organizing this mass of information and making it compressible is by categorizing documents according to their topics where the topics are organized in a hierarchy of increasing specificity [3]. Hierarchical structures identify the relationships of dependence between the categories and provides a valuable information source for many problems. Recently several researchers have investigated the use of hierarchies for text classification and obtained promising results [1, 4]. However, little has been done to explore the approaches to automatically generate topic hierarchies. Most of the reported techniques have been conducted on existential hierarchically structured corpora. The aim of automatic hierarchy generation has several motivations. First, manually building hierarchies is an expensive task since it requires domain experts to evaluate the documents’ relevance to the topics. Second, existing hierarchies are optimized for human use based on “human semantics”, but not necessarily for classifier use. Automatic generated hierarchies can be incorporated into various classification methods Copyright is held by the author/owner. SIGIR’03, July 28–August 1, 2003, Toronto, Canada. ACM 1-58113-646-3/03/0007.

421

groups 1 2 3 4 5 6 7 8

members alt.atheism, talk.region.misc, talk.politics.guns, talk.politics.misc, talk.politics.mideas sci.space, sci.med sci.electronic comp.os.mswindows.misc, comp.sys.ibm.pc.hardware, comp.graphs, comp.sys.mac.hardware, comp.windows.x rec.sport.baseball, rec.sport.hockey, rec.motorcycles misc.forsale soc.religion.christian rec.autos sci.crypt

Datasets 20Newsgroups WebKB Industry Sector Reuters-top 10 Reuters-2 K-dataset

Table 1: Top level groups for 20Newsgroups.

Overall 0.963 0.804 0.727 0.901 0.916 0.921 Figure 2: Accuracy comparison on each class of Reuters-top10.

dendrogram to determine the clusters that provide maximum interclass separation and find the best grouping of classes at the top level. The dendrogram is scanned in a bottom-up fashion to find the distances at which successive clusters get merged. We clip the dendrogram at the point where the cluster merge distances begin increasing sharply. For example, on 20Newsgroups dataset with linear discriminant projection approach, we have 8 top-level groups as shown in Table 1. The experimental results reported here were obtained via two-level classification. We used LIBSVM5 as our classifier. LIBSVM is a library for support vector classification and regression and supports multi-class classification. In addition, we used linear kernel in all our experiments and it gives best results on our experiments. We first built a top-level classifier (L1 classifier) to discriminate among the toplevel clusters of labels. At the second level (L2) we built classifiers within each cluster of classes. Each L2 classifier can concentrate on a smaller set of classes that confuse with each other. In practice, each classifier has to deal with a more easily separable problem, and can use an independently optimized feature set; this should lead to slight improvements in accuracy apart from the gain in training and testing speed. Table 2 gives the performance comparisons of flat classification with hierarchical classification. We observe the improved performance on all datasets except Reuters-2. From Table 2, we observe that Reuters-top10 has the most significant gain in accuracy using hierarchy. Figure 2 presents the accuracy comparison for each class of Reuters-top10, where the accuracy of each class, except trade, is improved by using the generated hierarchy.

EXPERIMENTS

We used a wide range of datasets in our experiments and most of them have been used in information retrieval literature. The number of classes ranged from seven to 105 and the number of documents ranged from 2,340 to 20,000. We anticipated that these data sets would provide us enough insights on automatic hierarchy generation. The datasets are 20Newsgroups dataset1 , WebKB, Industry Sector2 , Reuters3 , and K-dataset4 . To pre-process the datasets, we removed the stop words using a standard stop list and performing the stemming operations with a Porter stemmer. All HTML tags and all header fields except subject and organization were ignored. In all our experiments, we first randomly chose 70% for hierarchy building (and later training in classification), the remaining 30% is then used for testing. The 70% training set is further preprocessed by selecting the top 1000 words by information gain.

4. CONCLUSIONS

(a) D.P. on WebKB (b) D.P. on 20NG Figure 1: Hierarchies of WebKB, and 20Newsgroups using Discrim-

The linear discriminant projection approach measures the similarities between categories by the distances between category centroids in the transformed space. A hierarchal clustering method is then used to generate the dendrogram based on the similarity measures. Our experiments demonstrate that generated hierarchies can improve classification performance on most datasets.

inant Project (D.P.). The block represents the similarity between the corresponding row and column document categories and the darker the more similar.

Figure 1 shows the hierarchies of WebKB and 20Newsgroups. We can observe from the dendrogram the semantic similarity of classes. We also investigate the effects of exploiting the generated hierarchy for classification. In the following experiments, classification accuracy were used as the evaluation measure. An obvious approach to utilization of the hierarchy is a top-down level-based approach that arranges the clusters in a two-level tree hierarchy and trains a classifier at each internal node. We analyze the generated

5. REFERENCES [1] D’Alessio, S., Murray, K., Schiaffino, R., & Kershenbaum, A. (2000). The effect of using hierarchical classifiers in text categorization. RIAO-00 (pp. 302–313). [2] Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall. [3] Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. ICML (pp. 170–178). [4] Sun, A., & Lim, E.-P. (2001). Hierarchical text classification and evaluation. ICDM (pp. 521–528). [5] Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. SIGIR (pp. 42–49).

1 http://www.cs.cmu.edu/afs/cs.cmu.edu/ project/theo-20/www/data/news20.html. 2

Level One 0.985 0.860 0.739 0.963 0.948 0.961

Table 2: Accuracy Table.

clustering and tend to produce long chains which form loose, straggly clusters. Complete-linkage tend to produce very tight clusters. Ward’s method is only compatible with Euclidean metric and not compatible with non-Euclidean metric and semi-metric.

3.

Flat 0.952 0.791 0.691 0.826 0.923 0.915

http://www.cs.cmu.edu/˜TextLearning/

datasets.html. 3 Two subsets is used: the ten most frequent categories, called Reuters-top10; the documents that have unique topics, called Reuters-2. 4 ftp://ftp.cs.umn.edu/dept/users/boley/ PDDPdata/.

5

422

http://www.csie.ntu.edu.tw/~cjlin/libsvm.