
LEXical INformation extraction from Glossaries
The Columbia LEXING system takes a definition source (web page or
document) and creates a Lexical Knowledge Base automatically. An LKB is a
structured form of a set of definitions, and can be used for ontology
generation and definition analysis.
The system uses part of speech
tagging to analyze the definition, and then uses LinkIT, a noun-phrase
chunker from Columbia University, to determine 'phrases.' From there,
various semantic attributes are tagged comprising two types:
- Predefined Semantic Attributes: determined after an analysis of
definition
literature and a definition set. These include such
attributes as "contains," "used for," "excludes," "includes," and so on.
These
are arranged into three colour-coded categories: properties, excludes/includes, and quantifiers.
- Automatically Determined Potential
Attributes: determined after running a bigram probability model across the
entire document to find other attributes that might be useful in
classifying the
document. A group of these are identified and, if they occur in the
definition currently being analyzed, are shown under the analysis with the
phrases surrounding it. This way, a user can note which attributes might
need to
be added to the predefined set.
Output from predefined
semantic attribute analysis can be automatically fed into an ontology
such as the USC/ISI SENSUS system in the DGRC project. Output from
probabilistic analysis is meant for user to post-process.
The system
also uses the
Acrocat
acronym cataloguing system to try to determine the meaning of acronyms
used in the document. A list of possibilites for each acronym in the
current definition are listed along with a confidence marker.
Output is now being provided in easily parsable XML. Details are on
the XML page.
The system has been run on the following data sets:
-
Department of Energy - Energy Information Administration
- Environmental Protection Agency
- U.S. Bureau of the Census
-
Lawrence Berkeley National Laboratory - Human Genome Sequencing Department - Biology Group
- Biomedical glossary, portions taken out of U.S. Congress Office of Technology Assessment Mapping Our Genes The Genome Projects: How Big, How Fast? OTA-BA-373, Washington, D.C.: U.S. Government Printing Office, 1998. (165 definitions) original document
- Center for Disease Control and Prevention
- U.S. Department of the Interior Bureau of Reclamation
The system will soon be run on the following data sets:
- National Center for Health Statistics
- US Department of Labor - Bureau of Labor Statistics
- U.S. Bureau of the Census
The output into the USC/ISI SENSUS project is not unlike the
controlled vocabulary system at the Inter-University Consortium
for Political and Social Research.
Columbia University Definition Analysis Project
Judith Klavans - klavans@cs.columbia.edu
Walter Bourne - walter@columbia.edu
Brian Whitman - bwhitman@cs.columbia.edu
Deniz Sarioz - deniz@cs.columbia.edu
Samuel Popper - sp2014@cs.columbia.edu