LEXical INformation extraction from Glossaries
The Columbia LEXING system takes a definition source (web page or
document) and creates a Lexical Knowledge Base automatically. An LKB is a
structured form of a set of definitions, and can be used for ontology
generation and definition analysis.
The system uses part of speech
tagging to analyze the definition, and then uses LinkIT, a noun-phrase
chunker from Columbia University, to determine 'phrases.' From there,
various semantic attributes are tagged comprising two types:
- Predefined Semantic Attributes: determined after an analysis of
literature and a definition set. These include such
attributes as "contains," "used for," "excludes," "includes," and so on.
are arranged into three colour-coded categories: properties, excludes/includes, and quantifiers.
- Automatically Determined Potential
Attributes: determined after running a bigram probability model across the
entire document to find other attributes that might be useful in
document. A group of these are identified and, if they occur in the
definition currently being analyzed, are shown under the analysis with the
phrases surrounding it. This way, a user can note which attributes might
be added to the predefined set.
Output from predefined
semantic attribute analysis can be automatically fed into an ontology
such as the USC/ISI SENSUS system in the DGRC project. Output from
probabilistic analysis is meant for user to post-process.
also uses the
acronym cataloguing system to try to determine the meaning of acronyms
used in the document. A list of possibilites for each acronym in the
current definition are listed along with a confidence marker.
Output is now being provided in easily parsable XML. Details are on
the XML page.
The system has been run on the following data sets:
Department of Energy - Energy Information Administration
- Environmental Protection Agency
- U.S. Bureau of the Census
Lawrence Berkeley National Laboratory - Human Genome Sequencing Department - Biology Group
- Biomedical glossary, portions taken out of U.S. Congress Office of Technology Assessment Mapping Our Genes The Genome Projects: How Big, How Fast? OTA-BA-373, Washington, D.C.: U.S. Government Printing Office, 1998. (165 definitions) original document
- Center for Disease Control and Prevention
- U.S. Department of the Interior Bureau of Reclamation
The system will soon be run on the following data sets:
- National Center for Health Statistics
- US Department of Labor - Bureau of Labor Statistics
- U.S. Bureau of the Census
The output into the USC/ISI SENSUS project is not unlike the
controlled vocabulary system at the Inter-University Consortium
for Political and Social Research.
Columbia University Definition Analysis Project
Judith Klavans - email@example.com
Walter Bourne - firstname.lastname@example.org
Brian Whitman - email@example.com
Deniz Sarioz - firstname.lastname@example.org
Samuel Popper - email@example.com