Martin Jansche

Post Doc, Center for Computational Learning Systems; jansche@cs.columbia.edu

Statistical models of word frequency and other count data

Time: Thursday February 12th, 11:30-12:30

Abstract:

Many text processing applications are based to some extent on counting words. This includes text categorization, information retrieval, topic detection and tracking, among others. Often, generative models of word frequency are used, which predict the number of occurrences of a word within a document. It is well known (Church & Gale 1995) that such occurrence counts are modeled poorly by standard probability distributions like the binomial or Poisson. The problem is that observed counts vary more than simple models predict. This has prompted the use of continuous mixtures of standard distributions as robust alternatives (Church & Gale 1995; Lowe 1999). However, I will show that continuous mixtures do not remedy another deficiency of standard models: this is the fact that most words never occur in a given document, resulting in large amounts of zero counts. I have proposed (Jansche 2003) the use of very simple discrete mixtures, so-called zero-inflated models. When evaluated on a Naive Bayes text classification task, simple zero-inflated models can account for most practically relevant variation, outperforming standard models. Zero-inflated models have the added advantage of being easier to work with than continuous mixtures, and they help us resolve an anomaly observed by McCallum & Nigam (1998).

K. W. Church & W. A. Gale. 1995. Poisson mixtures. Natural Language Engineering 1: 163-190.

M. Jansche. 2003. Parametric models of linguistic count data. ACL 41.

S. Lowe. 1999. The beta-binomial mixture model for word-frequencies in documents with applications to information retrieval. Eurospeech 6.

A. McCallum & K. Nigam. 1998. A comparison of event models for Naive Bayes text classification. AAAI Workshop on Learning for Text Categorization.