Using Naive Bayes algorithm To Classify Text

Group member:
Xin Jin xj7@cs.columbia.edu
Lianshu Huang huang@cs.columbia.edu

Introduction:

With the explosive growth of online information, it is not easy for people to figure out which information is in his/her interests. For instance, thousands of articles in newsgroups often let one very hard to pick up something he/she really would like to see. Thus, It will be nice if we have some kind of tools which can group huge amount of documents into different categories or classes. This process is what we often called Text categorization.

The problem of text categorization is to classify documents into a fixed number of classes. How to classify documents, that is, how to assign documents to a class, and how precise it is, still remains as one of the current research topics. There are two kinds of algorithms have been frequently used for this purpose. First one is based on the algorithm about the relevace feedback method which is introduced by Rocchio[Rocchio, 1971], the other one is based on Naive Bayes algorithm.

Naive Bayes algorithm can be used very successfully in text categorization task. It basically uses probabilistic models to estimate the likelihood that a given document is in a class. This probability estimate is actually used for decision making purpose. This algorithm is based on the bag-of-words representation developed in information retrieval. This is an attribute-value representation where those documents are represented by the words occuring in it without paying attention to their ordering.

The naive Bayes classifier, which is using highly practical Bayesian learning method, is said to be quite successful when applied to learning tasks to classify natural language text documants, and in some domains its performance has been shown to be compatible to that of neural network and decision tree learning. In our final project, we did some interesting experiments using alternative mesaures of evaluation, for example, how precise is it for each newsgroup? Which newgroups get highest/lowest precision? etc., just in order to figure out in what level, how precise it is when classifying text documents. In the second part of our project, we provided a Webpage Classifier based on Andrew McCallum's rainbow system. The performance is also measured and analized.

Approach:

We used the program Rainbow, written by Andrew McCallum, to apply a Naive Bayes learner to the problem of classifying text. First, we downloaded on-line dataset and rainbow software for this project. The data set is 20 newsgroups documents.

First, we will index the documents, this is the training phase:

$ ./rainbow --index 20_newsgroups/*

With the documents indexed, we can now do some testing runs. For our run, we use 67% of the indexdocuments as training documents, and 33% of the indexed documents for testing:

$ ./rainbow --test-percentage=33 --test=1 > rainbow_output

The --test-percentage=33 specifies that 33% of the indexed documents should be set aside for testing. These documents are randomly chosen from the poll of indexed documents and information from these documents is not part of the model used to classify documents (on this run). The --test=1 option indicates that you wish to perform one testing run.

Once the above run is complete, a file named rainbow_output contains all the result information. It includes the classification for each individual document and also the confidence rating for each classification. To get summary results for the testing runs, we run this file through the rainbow-stats program:

$ rainbow-stats < rainbow_output >rainbow_stat

This gives us the result which tells us the overall accuracy for each run, along with the details for the testing documents from each of the category from which testing documents were used. Note that the resulting accuracy might be different as the training/test document split is made at random.

Rainbow offers plenty of options which can be used for our experiment. Without specifying any other options, Rainbow will use a stoplist to remove uninformative words (such as 'a', 'the', etc.). To have Rainbow not use this stoplist, we use the -s or --no-stoplist option.

In order to increase the accuracy of the learning algorithm, we use stemming option, a method which represents each word by its root, removing suffixes which might differentiate two words which have the same meaning. Rainbow has code to perform stemming during the lexing and indexing of the documents. This can be turned on by using the -S or --use-stemming option with the execution of Rainbow.

Result of the Experiments:

Here we give the numerical results and some analysis.
Experiment one: Testing on the Holdout Set

a) Hold-out Set
We can tell Rainbow to test the classification performance on a hold-out set. For example, a portion of the data that will not be used for traing the classifier. This should give much more trustworthy results than tesing on the training data itself. Test the performance on a hold-out set consisting of different percentage of the data, and use the remaining part of the data for training, we can get different percentage of Average_percentage_accuracy.
The following experiment are done on mini_newsgroups data set, with one test run.
(./rainbow -method=naivebayes --no-stoplist --skip-header --prune-vocab-by-occur-count=1 )

Training set : 2% Correct: 27 out of 39 (69.23 percent accuracy)
Training set : 6% Correct: 91 out of 119 (76.47 percent accuracy)
Training set : 10% Correct: 139 out of 199 (69.85 percent accuracy)
Training set : 20% Correct: 274 out of 399 (68.67 percent accuracy)
Training set : 70% Correct: 858 out of 1398 (61.37 percent accuracy)

For some reason, we got relatively lower performance. We suppose maybe this is because our training set and test set are relatively small. Our data set for this particular experiment is mini_newsgroups data set which is much smaller than the other one, 20_newgroups, which contains more than twenty thousand news articles.

b) K-fold Cross-Validation:
With k-fold cross-validation technique, we can use a different random partitioning of the data for k different times. The constant k can be specified by --test=k. We use k=10 for each data point, perform 10 test/ train splits of the indexed documents, and output the classifications of all test documents each time.

The following experiment are done on the newsgroups data set, with ten test runs, without skipping header part of each documant.

Trial 0 -- Correct: 5810 out of 6588 (88.19 percent accuracy)
Trial 1 -- Correct: 5770 out of 6588 (87.58 percent accuracy)
Trial 2 -- Correct: 5787 out of 6588 (87.84 percent accuracy)
Trial 3 -- Correct: 5809 out of 6588 (88.18 percent accuracy)
Trial 4 -- Correct: 5806 out of 6588 (88.13 percent accuracy)
Trial 5 -- Correct: 5797 out of 6588 (87.99 percent accuracy)
Trial 6 -- Correct: 5782 out of 6588 (87.77 percent accuracy)
Trial 7 -- Correct: 5815 out of 6588 (88.27 percent accuracy)
Trial 8 -- Correct: 5788 out of 6588 (87.86 percent accuracy)
Trial 9 -- Correct: 5791 out of 6588 (87.90 percent accuracy)
Average_percentage_accuracy_and_stderr 87.97 0.07

Now, skipping header information in documants, for example, skip subject, title, etc. lines in the newsgroup articles. test-percentage=10 -- Correct: 1683 out of 1986 (84.74 percent accuracy)
test-percentage=20 -- Correct: 3375 out of 3980 (84.80 percent accuracy)
test-percentage=33 -- Correct: 5526 out of 6578 (84.01 percent accuracy)
test-percentage=40 -- Correct: 6711 out of 7967 (84.23 percent accuracy)
test-percentage=60 -- Correct: 9685 out of 11951 (81.04 percent accuracy)
test-percentage=80 -- Correct: 2926 out of 4230 (69.17 percent accuracy)

Analysis: As we see can from above graph, when training percentage increases, we get better performance. For example, when training set only takes 20% of the whole data set, the accuracy is 69.17%; but when training set takes 40% of the total data set, then the accuracy we got jumps to 81.04%. This indicates that training over the insufficient data set will get lower performance, especially when it evaluated over much larger unseen new data set.

Evaluating the performance only over the training set may not fairly judge the real performance over the whole domain, the hypothesis obtained over a bunch of training set may overlearned the training examples, and may not perform well when measured over unseen test cases. Thus evaluating over the training set itself is not enough, we need new data which are unseen during training. Cross-Validation does this job.

As professor mentioned, we don't think improving accuracy should necessarily be the goal for this assignment, however, we consider alternative mesaures of evaluation, that is, how precise is it for each newsgroup? Which newsgroups get highest/lowest precision, and why? Though this WHY question is hard to answer, we do did some experiments and tried to figure out the reason. Through some analysis on the classification accuracy of each newsgroup, we found that talk.religion.misc newsgroup has the lowest precision(aroud 57% accuracy), and soc.religion.christian newsgroup got the highest precision among 20 newsgroups's more than 20000 articles, 99% accuracy. We think this is probably because talk.religion.misc newsgroup does not typically focus on some relatively specific topics, they may talk about many more other field of interest. On the other hand, soc.religion.christian newsgroup mainly focus on the topic regarding the religion derived from Jesus Christ, based on the Bible as sacred scripture, something like that.
The accuracy for higher level distinction, for instance, just comp vs rec, is very high.

Webpage Classification:

Goal: train a system based on naive-bayes to classify homepages into four classe s: department, faculty, student, and extracurricular organization.

Data Collection: around 400 webpages collected under www.columbia.edu

Department(dept), 100 pages
Faculty, 99 pages
Student, 105 pages
Extracurricular(extra), 98 pages

Algorithm: Naive-bayes as illustrated by table 6.2(page 183, Machine Le arning)
Software used: rainbow by Andrew McCallum.
Results:

Using base configuration: use all words including html tokens, use stoplist, no-stemming, test-percentage=33, training_percentage=67

    
         Correct: 104 out of 128 (81.25 percent accuracy)
 
 - Confusion details, row is actual, column is predicted
 
   classname   0   1   2   3   :total  :correct percentage
   0 student  22   6   6   1    : 35      62.86%
   1    dept   2  27   1   1    : 31      87.10%
   2   extra   6   .  23   .    : 29      79.31%
 3 faculty   .   .   1  32    : 33      96.97%
 
        Average_percentage_accuracy_and_stderr 81.25 0.00

Using configuration: skip all html tokens, use stoplist, use stemmi ng, assume class prior, test-percentage=33, training_percentage=67

      Correct: 99 out of 128 (77.34 percent accuracy)
 
      - Confusion details, row is actual, column is predicted
   classname   0   1   2   3  :total  :correct percentage
   0 student  17   6   1  10  : 34     50.00%
   1    dept   1  29   1   1  : 32     90.62%
   2 faculty   .   .  32   .  : 32     100.00%
   3   extra   6   3   .  21  : 30     70.00%
 
     Average_percentage_accuracy_and_stderr 77.34 0.00

Performance Analysis:

Why student pages get worst results?

Why faculty pages get best results?

HTML tokens Convey Information?

Top 20 Infogain Words

 0.32514 research
 0.28718 professor
 0.18282 ph
 0.18221 department
 0.14243 publications
 0.13559 engineering
 0.13130 physics
 0.12831 interests
 0.12162 faculty
 0.12149 systems
 0.11985 mudd
 0.11280 courses
 0.11198 fax
 0.10975 selected
 0.10606 programs
 0.10315 phys
 0.10299 feedback
 0.08793 table
 0.08780 cellpadding
 0.08084 telephone

Machine Learning and Guided WWW Tour ---- WebWatcher Project at CMU

The on-line WebWatcher system is located here.

Major functionality of WebWatcher

highlighting hyperlijnks on the current page, which WebWatcher deem s useful according to the user's stated interests
adding new hyperlinks to the current page, based on the user's inte rests
suggesting pages related to the current page
sending email messages to the user whenever specified pages change

We will summarize and discuss the major algorithms illustrated in the two papers listed in the reference.

Learning By Itself--suggesting relavant pages

WWW As Directed Graph

Alternate Ways of Describing Homepages

which other pages contain hyperlinks to it

Mutual Information and Collaborative Filtering

Learning From User's Choice--suggesting choice for user

What should be learned? UserChoice?: Page X Goal X Link --> [0, 1]

Page

Goal

Link

Use rChoice

Link

Page

Goal

Data collection is achieved by logging users' action
Encoding of Variables

Underlined words in the hyperlink: 200 booleans
Words in the sentence containing the hyperlink: 200 booleans
Words in the headings associated with hyperlink: 100 booleans
words used to deine the user goal: 30 booleans

Terminology

stoplist: a list of uninformative words that rainbow may or may not toss d epending on the flags --use-stoplist(toss) and --no-stopllist(not toss)
stemming: a method which represents each word by its root, removing suffix es which might differentiate two words which have the same meaning. rainbow may or may not use stemming depending on the flags --use-stemming and --no-stemming
assume class prior: when setting weights, calculating infogain and scoring, use equal prior probabilities on classes

Conclusion:

Naive-bayes algorithms is very effective in text-classification.
The above experiments show that the naive Bayes classifier is a very useful in many practical applications.
It incorporates the simplifying assumption that attribute values are conditionally independent.
When this assumption is met, the naive Bayes classifier outputs the MAP classification. Even when this assumption is not met, as in the case of learning to classify text, the naive Bayes classifier is also quite effective.

Reference(paper discussed):

WebWatcher: A Learning Apprentice for the World Wide Web < /i> , in the 1995 AAAI Spring Symposium on Information Gathering from Heterogeneous, D istributed Environments, Stanford, March 1995.
WebWatcher: Machine Learning and Hypertext, Fachgrupp entreffen Maschinelles Lernen, Dortmund, Germany, August 1995.