LEXical INformation extraction from Glossaries


The Columbia LEXING system takes a definition source (web page or document) and creates a Lexical Knowledge Base automatically. An LKB is a structured form of a set of definitions, and can be used for ontology generation and definition analysis.

The system uses part of speech tagging to analyze the definition, and then uses LinkIT, a noun-phrase chunker from Columbia University, to determine 'phrases.' From there, various semantic attributes are tagged comprising two types:

Output from predefined semantic attribute analysis can be automatically fed into an ontology such as the USC/ISI SENSUS system in the DGRC project. Output from probabilistic analysis is meant for user to post-process.

The system also uses the Acrocat acronym cataloguing system to try to determine the meaning of acronyms used in the document. A list of possibilites for each acronym in the current definition are listed along with a confidence marker.

Output is now being provided in easily parsable XML. Details are on the XML page.

Data sets

The system has been run on the following data sets:

The system will soon be run on the following data sets:

The output into the USC/ISI SENSUS project is not unlike the controlled vocabulary system at the Inter-University Consortium for Political and Social Research.


Columbia University Definition Analysis Project
Judith Klavans - klavans@cs.columbia.edu
Walter Bourne - walter@columbia.edu
Brian Whitman - bwhitman@cs.columbia.edu
Deniz Sarioz - deniz@cs.columbia.edu
Samuel Popper - sp2014@cs.columbia.edu