Peter Davis

Graduate Student, ptd7@cs.columbia.edu

Title: Glossary Finding across Heterogeneous Formats

Time: Thursday November 21, 12.30pm - 1pm

Place: CS Conference Room in MUDD

Abstract:

We present a component which will both identify which html pages in a web site consist of or contain glossaries of specialized terms, and will improve the performance of text classification algorithms in finding glossaries. This component is part of GetGloss, a glossary identification and normalization module. We use a rule-based decision system built on top of a standard crawler. We compare our approach to SVM and NB and demonstrate that GetGloss is superior to both approaches in cases where glossaries are irregular in format, and that the performance of text classification algorithms gets better when GetGloss is used first as a pre-processor. In contrast, SVM outperforms both GetGloss and NB only in cases where these files are consistent in their internal structure. We have conducted a series of experiments to analyze in which contexts GetGloss, SVM, and NB perform best. Our evaluation is performed over 275 Federal Agency domains. Our conclusion is that:

1. When glossaries are formatted in a number of different ways, text classification algorithms perform poorly.
2. Running SVM on glossary candidates identified by GetGloss improves SVM performance.