an efficient feature-selection algorithm for text categorization - CiteSeerX

1 downloads 0 Views 311KB Size Report
Newsgroups, using two classification algorithms, Naïve Bayes (NB) and Support ... Extensive research work exists on feature selection for text categorization and this ..... results of the multinomial model [Yang & Liu 1999; Zhang & Oles 2001].
Best Terms: An Efficient Feature Selection Algorithm for Text Categorization

Dimitris Fragoudis 1

Dimitris Meretakis 2

Spiros Likothanassis 1, 3

[email protected]

[email protected]

[email protected]

1

Computer Engineering & Informatics Department University of Patras, Rio – Patras GR-26500, Greece 2

Independent Consultant, Zurich, Switzerland 3

Computer Technology Institute Riga Fereou 613, 26221 Patras GR-26221, Greece Daytime phone number: +30 61 997755 Fax: +30 61 997706 Contact email address: [email protected]

Abstract In this paper we propose a new algorithm for Feature Selection, called Best Terms (BT). BT is linear with respect to the number of the training-set documents, while it is independent from both the vocabulary size and the number of training-set categories. We evaluate BT on two benchmark document collections, Reuters-21578 and 20Newsgroups, using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVM). Experimental results have shown that BT leads to a considerable increase in both the efficiency and the effectiveness of NB and SVM, compared to an extensive list of other feature selection methods. In the most cases the training time of SVM has dropped by an order of magnitude, while there have been cases where the simple, but very fast, NB algorithm has proved to be more effective even than SVM.

1. Introduction As the number of electronic documents available on the Internet or corporate intranets daily increases, effective retrieval, routing or filtering of text information have become an important component in many information organization and management tasks. An increasingly useful tool for managing this vast amount of data is text-categorization – the task of assigning one or more predefined categories to natural language text documents, based on their contents.

There are numerous statistical classification methods and machine learning techniques that have been applied to text categorization, in recent years. Some, state of the art text, categorization algorithms include Naïve Bayes (NB) [McCallum & Nigam 1998], Support Vector Machines (SVM) [Dumais et al. 1998; Joachims 1998], k-Nearest Neighbor (kNN) [Yang 1999; Yang & Liu 1999], Ridge Regression (RR) [Zhang & Oles 2001], Linear Least Squares Fit (LLSF) [Yang 1999; Yang & Liu 1999] and Logistic Regression (LR) [Zhang & Oles 2001].

One particularity of the text categorization problem is that the number of features (unique words or phrases) can easily reach orders of tens of thousands [Blum & Langley 1997; Lewis 1992]. This raises big hurdles in applying many sophisticated learning algorithms to the text categorization arena (e.g. LLSF). Feature selection, therefore, has been a major research area within text categorization, since the reduction of the features used for the representation of documents is an absolute requirement for using most of the machine learning algorithms [Blum & Langley 1997].

2 - 31

The aim of feature selection methods is the reduction of the dimensionality of the dataset by removing features that are considered irrelevant for the classification. This transformation procedure has been shown to present a number of advantages including smaller dataset size, less computational requirements for the text categorization algorithms (especially those that do not scale well with the feature set size) and considerable shrinking of the search space. The goal is the reduction of the "curse of dimensionality" to yield improved classification accuracy.

Another benefit of feature selection is its tendency to reduce overfitting, i.e. the phenomenon by which a classifier is tuned also to the contingent characteristics of the training data rather than the constitutive characteristics of the categories, and therefore, to increase generalization [Sebastiani 2002].

In this paper we propose a new, linear time, algorithm for feature selection, called Best Terms (BT). The BT algorithm works in two steps. Given a target class c, it first searches all documents that belong to c and selects a set of features that predict c with high precision. Then it searches all documents that do not belong to c, but contain at least one of the features selected during the first step, for a set of features that predict c with high precision ( c is the compliment of c). The new feature set is consisted of the union of the two sets of features that are selected during the first and the second step of BT, respectively.

Our experimental results with standard benchmark datasets show that when the BT algorithm is used for feature selection, instead of the traditional filter approach [John et

3 - 31

al. 1994], there is a considerable increase to both the efficiency and effectiveness of two well-known classifiers, namely NB and SVM.

The remainder of the paper is organized as follows. Section 2 presents the state of the art for feature selection methods. Section 3 describes the BT algorithm in detail. Section 4 provides a brief overview of the NB and SVM classifiers, while section 5 provides a brief overview of the document collections used in the experimental evaluation. Section 6 presents the experimental evaluation of the BT algorithm and finally section 7 summarizes the conclusions.

4 - 31

2. Related Work Extensive research work exists on feature selection for text categorization and this is credited to the fact that text collections often have feature set sizes (otherwise: vocabularies) that can reach up to tens or even hundreds of thousands depending on the text representation. Such feature set sizes make the application of many classification algorithms infeasible, since most algorithms do not scale up well with the feature set size. Therefore reducing the feature set size enables the application of more sophisticated algorithms, hopefully grasping more complex relationships inherent in the dataset and increasing the classification quality [McCallum & Nigam 1998; Yang & Pedersen 1997; Rogati & Yang 2002].

In [Blum & Langley 1997] the feature selection methods are grouped into three classes: 1. Those that embed the selection within the basic induction algorithm (e.g. ID3 [Quinlan 1983] and C4.5 [Quinlan 1993]) 2. Those that use feature selection to filter features passed to the induction 3. Those that treat feature selection as a wrapper around the induction process.

Although the general agreement for wrapper approaches [John et al. 1994] is that the induction method that will use the selected features should provide a better estimate of accuracy than a separate measure, their computational cost poses certain drawbacks for their usage. The most popular and a computationally fast approach to the feature selection problem is the filter approach [John et al. 1994], i.e. keeping the |T’| ½ · p + ½ · P(c)

(2)

The parameter p is a user defined normalization factor (0 ≤ p ≤ 1) that normalizes too small or too large class probabilities.

Definition 2. A feature w is called negative for a class c if the knowledge of its value can increase the probability of c . In other words, if a feature w is negative for a class c then the following relation holds: P( c | w) > P( c )

(3)

Using the same process, as in relation (2), we can normalize the relation (3) as follows: P( c | w) > ½ · p + ½ · P( c ) ⇔ [1 – P(c | w)] > ½ · p + ½ · [1 – P(c)] ⇔ P(c | w) < ½ · (1 – p) + ½ · P(c)

(4)

3.2 The BT Algorithm In this subsection we are providing a general description of the BT algorithm. Given a target class c and a score function f (any one of the functions listed in Table 1), the BT algorithm operates, in general, in two steps:

Step 1. From each in-class document select the top-scoring positive feature. Step 2. From each out-of-class document that contains at least one of the selected positive features, select the top-scoring negative feature. Figure 1. General description of the BT algorithm.

9 - 31

The first step of the BT algorithm, described on Figure 1, aims for the selection of a set of positive features that are present in the most of the in-class documents. Since the positive features are, by definition, good predictors of the class membership, a side

effect of the first step of the BT algorithm is the creation of a new, lower dimensional document space that contains the majority of the in-class documents and only a small portion of the out-of-class ones.

All documents that do not contain at least one of the selected positive features are considered as not carrying any useful information and thus they are neither used in the second step of BT nor presented to the classifier for training. The intuition behind this action is that such documents are most certainly out-of-class ones and there is no need for a classifier to verify that. The classifier itself is required only on the most difficult

part of the document space, where the decision of the class membership is not straightforward. Thus if a classifier is trained on the most difficult part of the document space, it will provide a better estimation of the class membership.

However it is not possible to effectively separate the in-class documents from the outof-class ones using only the selected positive features, since all these features are good indicators of the class membership. We therefore need a set of negative features, provided by step 2, which will facilitate the induction process by maximizing the distance between the in-class and the out-of-class documents, since the former will mostly contain positive features while the later will mostly contain negative features.

10 - 31

In the following, we provide a more detailed presentation of the BT algorithm:

Algorithm BT Input: A set of documents D. A target class c. A score function f. A user defined threshold: p Output: A set of positive features FP A set of negative features FN A set of documents DF that contain at least one positive feature from FP 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.

FP = ∅, FN = ∅, DF = ∅ let DC = {di ∈ D: P(c | di) = 1} for each document d in DC

(the set of in-class documents)

let Fd = {wi ∈ d: P(c | wi) > ½ · p + ½ · P(c)} if Fd ≠ ∅ then find a feature w ∈ Fd: f(w, c) ≥ f(wi, c) for each wi ∈ Fd if w ∉ FP then FP = FP ∪ {w} end end for each feature w in FP DF = DF ∪ {di ∈ D: P(w | di) = 1} end for each document d in DF – DC let Fd = {wi ∈ d: P(c | wi) < ½ · (1 – p) + ½ · P(c)} if Fd ≠ ∅ then find a feature w ∈ Fd: f(w, c ) ≥ f(wi, c ) for each wi ∈ Fd if w ∉ FN then FN = FN ∪ {w} end end

Figure 2. The BT algorithm.

11 - 31

3.3 Complexity analysis of BT The variables used in the following complexity analysis of the BT algorithm and the filter approach, are:



N – the number of training documents



V – the number of features



M – the number of training-set categories (M < N, in general)



Lu – the average number of unique words in a document



Lc – the average number of labels in a document (Lc ≤ M)

We assume that the training documents are pre-processed asynchronously, after labeling. Thus the feature selection algorithm needs not perform any further preprocessing.

A complexity analysis of the BT algorithm indicates that:



The estimation of DC may be performed during pre-processing. Therefore the execution of line 2 costs O(1).



Finding the top-scoring positive feature of a document, in lines 4-8, costs O(Lu). Therefore the execution of lines 3–9 costs O(|DC| · Lu).



Determining which documents contain at least one feature from FP, in lines 10– 12, costs O(|DF|).



Finding the top-scoring negative feature of a document, in lines 14-18, costs O(Lu). Therefore the execution of lines 13–19 costs O(|DF – DC| · Lu).



Finally projecting the |DF| documents, which will be used for training, onto the new feature space costs O(|DF| · Lu). This action is not a part of the BT algorithm itself, but it is required in order to create the training set. 12 - 31

Combining the above costs, we can easily calculate the cost per category as following: O(1 + |DC| · Lu + |DF| + |DF – DC| · Lu + |DF| · Lu) = O((|DC| + |DF – DC| + |DF|) · Lu + |DF| + 1) = O(2 · |DF| · Lu + |DF| + 1) = O((2 · Lu + 1) · |DF| + 1)

(5)

Dropping all constants from expression (5), we can simplify the cost per category as following: O(|DF| · Lu)

(6)

A theoretical analysis that provides bounds to |DF| in relation with the normalization factor p and the score function f is very difficult. However experimental results obtained from the Reuters and the Newsgroups collections (see the experimental results section) have shown that:



For score functions that favor high-precision features, such as MI, DIA or OR, it holds |DF| ∝ |DC|, if p ≥ 0.25



For the rest of the functions listed on Table 1 it holds |DF| ∝ |DC|, if p ≥ 0.5

Under the above restrictions the cost per category is reduced to: O(|DC| · Lu)

(7)

Summing over all the M categories, we take:

  Ο Lu ⋅ ∑ DC  M  

13 - 31

(8)

However it holds:

∑D

C

= N ⋅ Lc

(9)

M

Substituting into the expression (8) using equation (9), we take the total cost of the BT algorithm: O(N · Lc · Lu)

(10)

Notice that the cost of the BT algorithm is linear with respect to the number of training documents N, while it is independent from both the number of training-set categories M and the number of features V.

On the other hand, a complexity analysis of the filter approach indicates that:



Score calculation costs O(V)



Sorting costs O(V · logV)



Projecting the N training documents onto the new feature space costs O(N · Lu). This action is not a part of the filter approach itself, but it is required in order to create the training set

Combining the above costs, we can easily calculate the cost per category as following: O(V + V · logV + N · Lu) = O(N · Lu + V · logV)

(11)

Summing over all the M categories, we take the total cost of the filter approach: O(M · Lu · N + M · V · logV)

14 - 31

(12)

Recall that it holds Lc ≤ M, hence if we compare the costs of the BT algorithm and the filter approach, as these are expressed in expressions (10) and (12) respectively, we can

easily deduce that: O(Lc · Lu · N) < O(M · Lu · N + M · V · logV) (13)

A direct consequence of the relation (13) is that the BT algorithm is much more efficient than the filter approach, especially when there is a large number of features or/and training-set categories.

15 - 31

4. Classifiers In this section we provide a brief description of the two classifiers, Naïve Bayes and Support Vector Machines, which we have used on the experimental evaluation of BT.

4.1 Naïve Bayes The basic Naïve Bayes model [Duda & Hart 1973] assumes that the probability of each word wj occurring in a document di, is independent from the occurrence of other words, given that the class is known. Although unrealistic, this assumption allows the easy computation of the conditional probability of the class ck given a document di as follows:

P(c k | d i ) =

P(c k ) ⋅ P(d i | c k ) (5) P(d i )

For the sake of simplicity we have used the multi-variate Bernoulli model, where the conditional probability P(di | ck) is estimated using the equations (2) and (3) of [McCallum & Nigam 1998].

4.2 Support Vector Machines The Support Vector Machines algorithm is a relatively new approach to the text classification problem and it stems from the statistical learning theory [Vapnic 1998]. In its simplest form the linear SVM treats examples as points in a high-dimensional space and tries to find a decision surface that “best” separates the positive from negative examples. The task is treated as a quadratic optimization problem and the resulting hyperplane is used to classify new examples. In our experiments we have used the linear model that is offered by the SVMlight package [Joachims 1999], due to its time efficiency and its high classification performance [Yang & Liu 1999]. 16 - 31

5. Datasets and Performance Measures In order to make our evaluation results comparable to the most of the published results in text categorization evaluations, we have chosen as datasets the Reuters-21578 and the 20-Newsgroups collections.

5.1 Reuters Collection The Reuters-21578 collection consists of 21,578 stories from the 1987 Reuters newswire, each one pre-assigned to one or more of a list of 135 topics. We used the ‘Mod Apte’ training-test split that contains 9,603 training and 3,299 test documents. All words were converted to lower case, punctuation marks were removed, numbers were ignored and words from a common list of stop-words were removed. No stemming was used. We have also removed words occurring in at most 4 training documents. [Rogati & Yang 2002] have reported boosted performance with even more aggressive rare word elimination. The resulting dictionary contained 8067 unique words.

5.2 Newsgroups Collection The 20 Newsgroups was collected by Ken Lang [Lang 1995] and has become one of the standard text collections for evaluation of classification algorithms. It contains 19949 non-empty newsgroup postings from 20 different UseNet groups. The documents are evenly divided among the classes i.e. each class contains around 1000 documents. All headers, except from the “Subject” header, were removed, which along with the body was used for training or testing. As in the Reuters collection, all words were converted to lower case, punctuation marks were removed, numbers were ignored and words from a common list of stop-words were removed. Again no stemming was used and words occurring in at most 4 training documents were removed. The resulting vocabulary

17 - 31

contained 26983 words. Furthermore we combined cross-posted documents into multilabeled ones. After pre-processing, the collection contained 19420 unique documents, 527 of which were posted from 2 to 4 different groups. Since there is no training-test split defined for this collection, we decided to sort the documents by post time and keep the last 250 documents of each group as test documents, which resulted in 14482 training and 4838 test documents. This split makes classification more difficult but more realistic, in contrast to e.g. 4-fold cross validation, because in a newsgroup focus shifts, in general, from one subject to another as time passes.

5.3 Performance Measures For evaluating the effectiveness of the text categorization algorithms, we have used the standard recall, precision and F1 measures. Recall is defined to be the ratio of correct assignments by the system divided by the total number of correct assignments. Precision is defined to be the ratio of correct assignments by the system divided by the total number of the system’s assignments. The F1 measure [van Rijsbergen 1979] is the harmonic mean of recall (r) and precision (p): F1(r , p ) =

2⋅r ⋅ p r+ p

(6)

These scores can be computed for the binary decisions on each individual category first and then averaged over categories (macro-averaging), or they can be computed globally over all NT × M binary decisions (micro-averaging) where NT is the total number of test documents.

18 - 31

6. Experimental Results In this section we present the experimental results obtained from the comparative evaluation of the BT algorithm versus the filter approach, on the Reuters-21578 and 20Newsgroups datasets, using the NB and SVM classification algorithms.

6.1 The “filter” Approach Although there are several studies on the popular filter approach [Yang & Pedersen 1997; Rogati & Yang 2002], we are not aware of a comparative evaluation of all the functions listed in Table 1. Hence we have performed first a thorough evaluation of the filter approach, in order to provide an elaborate comparison with the BT algorithm. Using each one of the functions listed in Table 1, we have selected the 50, 100, 200, 500, 1000, 5000 and 10000 top-scoring features and evaluated the efficiency and the effectiveness of the NB and SVM classification algorithms. We have also evaluated the case of no feature selection and the results are presented on Figures 3-4. Figure 3 shows the classification performance of NB and SVM with respect to the vocabulary size, for each one of the functions listed in Table 1, while Figure 4 shows the respective dataset size reduction.

Several findings stem from the above experiments, less or more expected: •

The performance of the DIA and MI functions has been identical, on both datasets, for both NB and SVM. This was actually expected, because the MI formula derives from the DIA formula after dividing with the class prior and logging. Therefore, in Figure 3, we present statistics only for the MI function.

19 - 31

80

70 60

50

20-Newsgroups

90

micro-averaged F1

micro-averaged F1

nb DF nb IG nb MI nb CHI nb NGL nb OR nb GSS nb svm

Reuters-21578

90

40

80

nb DF nb IG nb MI nb CHI nb NGL nb OR nb GSS nb svm

70 60 50 40

50

100

200

500

1000

5000

vocabulary size

50

100

200

500

1000

5000 10000

vocaburaly size

Figure 3. Classification performance of NB and SVM on the Reuters (left) and the Newsgroups (right) datasets, when using the filter approach for feature selection. The “nb” and “svm” labeled lines illustrate the performance of NB and SVM in the case of no feature selection respectively, while the rest of the lines illustrate the performance of NB when using the respective function.



The performance of the DF and RS functions has been almost identical, on both datasets, for both NB and SVM, regardless of the value of the damping factor d in the RS formula (we have tested various values of d, from 10-2 to 1). Therefore, in Figure 3, we present statistics only for the DF function.



The behavior of SVM has been monotonous on both datasets, regardless of the score function used. Consistent with the results reported in [Joachims 1998], the best classification performance of SVM has been obtained when using the maximum number of features i.e. the case of no feature selection. Therefore, in Figure 3, we present the classification performance of SVM for that case only (the straight line labeled as “svm”).



Although the behavior of NB has been monotonous for the most of the score functions on the Reuters dataset, the same does not hold for the Newsgroups dataset. However, consistent with the results reported in [McCallum & Nigum 20 - 31

1998], the best classification performance of the multi-variate Bernoulli model has been obtained, in both datasets, when using a small vocabulary size. In the case of the Reuters dataset, the best classification performance has been obtained when using the χ2 function (0.80 micro-avg F1), outperforming the published results of the multinomial model [Yang & Liu 1999; Zhang & Oles 2001].



The classification performance of SVM has been very poor on the smaller classes of the Reuters dataset (44.4% macro-avg F1), which is inconsistent with the results reported in [Yang & Liu 1999; Zhang & Oles 2001].



The 20-Newsgroups dataset has been a much tougher domain to learn than the Reuters dataset and this is credited to the significant overlapping between related categories in the former collection. As one may observe in Figure 3, the classification performance of both NB and SVM in the Newsgroups dataset has been almost 10 percentage units lower than in the Reuters collection.



SVM needs much more time than NB in order to be trained, especially in the case of the 20-Newsgroups dataset. The training time of SVM has been about 0.8 secs/class on the Reuters dataset, while it has been much larger, about 12.8 secs/class, on the Newsgroups dataset (including I/O). This difference accounts for more than an order of magnitude and it is not consistent with the super-linear complexity of SVM, which is reported in [Yang et al. 2003]. In fact it seems that the convergence of SVM does not depend only on the dataset size, but also on other factors like the ratio of positive/negative instances, the degree of the problem’s linear separability, etc.

21 - 31

Reuters-21578

100%

80%

60% 40%

DF IG MI CHI NGL OR GSS

20%

dataset size

80%

dataset size

20-Newsgroups

100%

60% 40%

DF IG MI CHI NGL OR GSS

20%

0%

0% 50

100

200

500

1000

5000

vocabulary size

50

100

200

500

1000 5000 10000

vocabulary size

Figure 4. Percentage of the original dataset with non-empty projection onto the new, lower-dimensional feature space, on the Reuters (left) and Newsgroups (right) datasets.



An implementation issue that should also be discussed is that the C parameter of SVMlight has been set equal to 1, in order to achieve reasonable training time. Otherwise, when keeping the default parameter settings, the training of SVM could take a few hours (also confirmed by [Bekkerman et al. 2001]).



Although the filter approach is generally considered as a pure feature selection algorithm, it also acts as an Instance Selection (IS) method, especially when selecting only a small number of features. This is credited to the fact that after projecting the original dataset onto the new, lower-dimensional feature space, there are several empty documents, which are not forwarded to the classification algorithm for training. Although this is actually a side effect, score functions such as MI or OR can achieve a significant reduction to the size of the dataset, as one may observe in Figure 4.

22 - 31

6.2 The “BestTerms” Approach One basic difference between the filter approach and the BT approach is that in the later there is not explicit declaration of the number of features to be selected. Instead once the value of the parameter p is defined, BT automatically selects the appropriate number of features. Recall that it holds 0 ≤ p ≤ 1, since p is a probability value.

Therefore in order to perform a thorough evaluation of the BT algorithm, we have run a series of experiments for various values of p. More specifically, we have started with p = 0 to p = 1, with stepsize equal to 0.25. This series of experiments has been executed for each one of the functions listed on Table 1 and the results are presented in Figures 58. Figure 5 shows the average number of features/class that have been selected from each dataset. Figures 6 – 7 show the classification performance of NB and SVM on the Reuters-21578 and the 20-Newsgroups datasets respectively, while Figure 8 shows the respective dataset size reduction. In addition Table 2 provides a comparative presentation of the best results obtained using the BT algorithm and the popular filter approach.

Several findings stem from the above experiments: •

There is a direct relation between the number of in-class documents and the number of features selected, as one may observe in Figure 5, which is consistent with the nature of the BT algorithm. Hence the almost an order of magnitude difference in the average number of features selected on the two datasets is clearly justified, considering that each class of the Newsgroups dataset contains 743 documents on the average, while the respective number on the Reuters collection is just 106.

23 - 31

DF IG MI CHI NGL OR GSS

vocabulary size

80 60

40

20-Newsgroups

1000

DF IG MI CHI NGL OR GSS

800 vocabulary size

Reuters-21578

100

20

600

400 200

0

0 0

0,25

0,5

0,75

1

normalization factor (p)

0

0,25

0,5

0,75

normalization factor (p)

Figure 5. Vocabulary size on the Reuters (left) and the Newsgroups (right) datasets, when using the BT algorithm for feature selection.



The experimental results have shown that there is also a direct relation between the number of features selected and the score function used. Functions that favor high-precision features, such as MI or OR, have collected much more features than the rest, which favor features present in a larger number of documents.



The use of the BT algorithm has resulted in a significant increase to the classification performance of NB, in both datasets, as one may observe in Figures 6 – 7. The increase has been more apparent when using functions that favor high-precision features, such as MI or OR, although the rest of the functions have also provided a considerable improvement over the filter approach. An interesting observation is that the former functions have performed best with a small value of p (p ≤ 0.25), while the later with a large value of p (p ≥ 0.75).

24 - 31

1

NB

SVM

88

86

micro-averaged F1

micro-averaged F1

88

84 DF IG MI CHI NGL OR GSS Filter

82

80 78

86 84 DF IG MI CHI NGL OR GSS Filter

82 80 78

0

0,25

0,5

0,75

1

0

normalization factor (p)

0,25

0,5

0,75

1

normalization factor (p)

Figure 6. Classification performance of NB (left) and SVM (right) on the Reuters dataset, when using the BT algorithm for feature selection. The “Filter” – labeled lines illustrate the best classification performance that has been obtained when using the filter approach. NB

SVM

80

74

micro-averaged F1

micro-averaged F1

80

68 DF IG MI CHI NGL OR GSS Filter

62 56

74 68 DF IG MI CHI NGL OR GSS Filter

62 56 50

50 0

0,25

0,5

0,75

1

normalization factor (p)

0

0,25

0,5

0,75

normalization factor (p)

Figure 7. Classification performance of NB (left) and SVM (right) on the Newsgroups dataset, when using the BT algorithm for feature selection. The “Filter” – labeled lines illustrate the best classification performance that has been obtained when using the filter approach.



The use of the BT algorithm has boosted the classification performance of SVM too, although not in such a degree as in the case of NB. The increase has been considerable only when using functions that favor high-precision features, such as MI or OR, since the rest of the functions have provided just a minor improvement over the case of no feature selection. As in the case of NB, the former functions have performed best with a small value of p (p ≤ 0.25), while the later with a large value of p (p ≥ 0.75). 25 - 31

1

Reuters Filter : SVM : NB BT : SVM : NB

miR 79,9 78,8 85,3 84,4

miP 91,8 80,8 88,9 88,4

Newsgroups

miF1 85,4 79,8 87,1 86,3

maF1 44,4 46,4 55,4 53,2

MiR 67,6 61,0 73,7 75,1

miP 85,0 80,0 84,3 82,7

miF1 75,3 69,2 78,6 78,7

maF1 75,1 68,7 78,3 78,4

(miR = micro-avg Recall, miP = micro-avg Precision, miF1 = micro-avg F1 and maF1 = macro-avg F1) Table 2. Best classification performance of NB and SVM on the Reuters-21578 and 20-Newsgroups datasets, using the BT and the filter algorithms.



The best classification performance of NB still under-performs that of SVM on the Reuters dataset, as one may observe in Table 2, however the difference is not significant. On the other hand the best classification performance of the two classifiers has been almost identical on the Newsgroups dataset. Notice that the best classification performance of NB has been obtained when using the MI function, on both datasets, while the same holds for SVM in the case of the Reuters dataset; on the Newsgroups dataset the SVM – MI combination has provided just sub-optimal classification performance. We can therefore say that the BT algorithm performs best when using the MI function.

20-Newsgroups

100%

DF IG MI CHI NGL OR GSS

60%

DF IG MI CHI NGL OR GSS

32%

dataset size

80%

dataset size

Reuters-21578

40%

40% 20%

24%

16% 8%

0%

0% 0

0,25

0,5

0,75

1

normalization factor (p)

0

0,25

0,5

0,75

normalization factor (p)

Figure 8. The percentage of the original dataset that has been retained (down), after the application of BT on the Newsgroups dataset.

26 - 31

1



As it has been already discussed in section 3, the use of the BT algorithm has resulted in a significant reduction to the size of the training set. Although this is actually a side effect, since the BT algorithm does not intentionally perform any kind of instance selection, the decrease to the dataset size has reached up to 98% in the case of the Reuters collection (Figure 8). The respective decrease has reached up to 90% in the case of the Newsgroups dataset. As a direct consequence, the training time of both NB and SVM has significantly decreased. In the case of SVM, the training time has dropped from 0.8 to 0.07 secs/class on the Reuters dataset and from 12.8 to 0.3 secs/class on the Newsgroups dataset (including I/O).



The basic intuition of BT, “all the documents that do not contain at least one of the selected positive features are most certainly out-of-class ones and we do not need a classifier to verify it”, has been also applied during testing. Hence only the test documents containing at least one of the selected positive features have been forwarded to the classifier, while the rest of them have been immediately considered as out-of-class ones. The classification performance of NB and SVM has been calculated after first combining each classifier’s assignments with the BT algorithm’s assignments (negative by default). This intuitive rule has saved considerable testing time, since the overwhelming majority of the test documents have been not forwarded to the classifier. The testing time of SVM has dropped from 0.63 to 0.02 secs/class on the Reuters dataset and from 9.4 to 0.04 secs/class on the Newsgroups dataset, including I/O.

27 - 31

The hardware configuration of the used PC has been AMD Athlon XP 2.4, with 512 MB RAM. I/O is included in all time statistics.

7. Conclusions The filter approach has long been considered the most popular and a computationally efficient approach to the feature selection problem. Hence the focus has been mostly on devising more efficient score functions to use with instead of developing new feature selection methods. In this paper we have proposed a new algorithm for feature selection for text categorization, named Best Terms (BT). BT is faster than the filter approach, since it is linear to the number of training-set documents, while it is independent from the vocabulary size and the number of training-set categories. In addition experimental results obtained from evaluating BT on benchmark document collections and comparing it to a range of functions used with the filter approach, have shown a considerable improvement to both the effectiveness and the efficiency of two representative classification algorithms, Naïve Bayes and Support Vector Machines. In most cases SVM training becomes more than an order of magnitude faster, while there are cases where the simple, but very fast, NB algorithm proves to be more effective than SVM.

28 - 31

8. References Baker, L. D. and McCallum, A. K. 1998. Distributional clustering of words for text classification. In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval (Melbourne, AU, 1998), pp. 96–103. Bekkerman, R., El-Yaniv, R., Tishby, N. and Winter, Y. 2001. On Feature Distributional Clustering for Text Categorization. In proceedings of 24th ACM International Conference on Research and Development in Information Retrieval, pp. 146-153, ACM Press, New York, US. Blum, A.L. and Langley, P. 1997. Selection of relevant features and examples in machine learning, Artificial Intelligence, 97:245—271 Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R. 1990. Indexing by latent semantic indexing. Journal of the American Society for Information Science 41, 6, 391–407. Duda, R. , Hart, P. Pattern Classification and Scene Analysis, New York: John Wiley & Sons, 1973. Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM-98, 7th ACMInternational Conference on Information and Knowledge Management (Bethesda, US, 1998), pp. 148–155. Fuhr, N., Hartmann, S., Knorz, G., Lustig, G., Schwantner, M., and Tzeras, K. 1991. AIR/X–a rule-based multistage indexing system for large subject fields. In Proceedings of RIAO-91, 3rd International Conference “Recherche d ’Information Assistee par Ordinateu ” (Barcelona, ES, 1991), pp. 606–623. Galavotti, L., Sebastiani, F., and Simi, M. 2000. Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of ECDL-00, 4th European Conference on Research and Advanced Technology for Digital Libraries (Lisbon, PT, 2000), pp. 59–68. Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, DE, 1998), pp. 137–142. 29 - 31

Joachims, T. 1999. Making large-scale svm learning practical. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press. John, G. H., Kohavi, R., and Pfleger, K. 1994. Irrelevant features and the subset selection problem. In Proceedings of ICML-94, 11th International Conference on Machine Learning (New Brunswick, US, 1994), pp. 121–129. Lang, K. 1995. NewsWeeder: Learning to Filter Netnews. In Proceedings of ICML-95, 12th International Conference on Machine Learning, pp. 331-339. Lewis, D.D. 1992, Feature Selection and Feature Extraction for Text Categorization, Speech and Natural Language: Proceedings of a workshop held at Harriman, NY, Morgan Kaufmann, San Mateo, CA, pp. 212-217. McCallum, A. and Nigam, K. 1998. A Comparison of Event Models for Naive Bayes Text Classification, In Proceedings of the AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48, AAAI Press. Ng, H.T., Goh, W.B. and Low, K.L. 1997. Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval, pp.67 –73. Rogati, M. and Yang, Y. 2002. High-Performing Feature Selection for Text Classification. In Proceedings of 11th International Conference on Information and Knowledge Management (CIKM'02) Ruiz, M.E. and Srinivasan, P. 1999. Hierarchical Neural Networks for Text Categorization. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, pp.281 –282. Sebastiani, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002.. Van Rijsbergen C.J. 1979. Information Retrieval. Butterworths, London, 1979 Vapnic V. 1998. Statistical Learning Theory. John Wiley and Sons, New York, 1998.

30 - 31

Wiener, E.D., Pedersen, J.O. and Weigend, A.S. 1995. A Neural Network Approach to Topic Spotting. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval (Las Vegas, US,1995),pp.317 –332. Yang, Y. and Chute, C.G. 1994. An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems (TOIS), 12 (3): 252-277. Yang, Y. and Pedersen, J.O. 1997. A Comparative Study on Feature Selection in Text Categorization. 14th International Conference on Machine Learning, pages 412420. Morgan Kaufmann. Yang, Y. and Liu, X. 1999. A re-examination of text categorization. In Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’99), pages 42-49. Yang Y. 1999. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1 (1/2): 67-88. Yang, Y., Zhang, J. and Kisiel. B., 2003. A scalability analysis of classifiers in text categorization. In Proceedings of SIGIR-03, 26th ACM International Conference on Research and Development in Information Retrieval. Zhang, T. and Oles, F.J. 2001. Text categorization based on regularized linear classification methods. In Information Retrieval, volume 4, pages 5-31.

31 - 31