LEXical INformation extraction from Glossaries

Description

The Columbia LEXING system takes a definition source (web page or document) and creates a Lexical Knowledge Base automatically. An LKB is a structured form of a set of definitions, and can be used for ontology generation and definition analysis.

The system uses part of speech tagging to analyze the definition, and then uses LinkIT, a noun-phrase chunker from Columbia University, to determine 'phrases.' From there, various semantic attributes are tagged comprising two types:

Predefined Semantic Attributes: determined after an analysis of definition literature and a definition set. These include such attributes as "contains," "used for," "excludes," "includes," and so on. These are arranged into three colour-coded categories: properties, excludes/includes, and quantifiers.
Automatically Determined Potential Attributes: determined after running a bigram probability model across the entire document to find other attributes that might be useful in classifying the document. A group of these are identified and, if they occur in the definition currently being analyzed, are shown under the analysis with the phrases surrounding it. This way, a user can note which attributes might need to be added to the predefined set.

Output from predefined semantic attribute analysis can be automatically fed into an ontology such as the USC/ISI SENSUS system in the DGRC project. Output from probabilistic analysis is meant for user to post-process.

The system also uses the Acrocat acronym cataloguing system to try to determine the meaning of acronyms used in the document. A list of possibilites for each acronym in the current definition are listed along with a confidence marker.

Output is now being provided in easily parsable XML. Details are on the XML page.

Data sets

The system has been run on the following data sets:

Department of Energy - Energy Information Administration
- An edited small set of energy terms. (19 definitions)
- Glossary of Energy and Energy Related Terms, Second Edition, May 1995.
Environmental Protection Agency
- Glossary of Terms of Environment - a terms, May 1998. (142 definitions) original document
U.S. Bureau of the Census
- World Population Profile: 1996, pp. D-3 to D-4. (53 definitions) original document
Lawrence Berkeley National Laboratory - Human Genome Sequencing Department - Biology Group
- Biomedical glossary, portions taken out of U.S. Congress Office of Technology Assessment Mapping Our Genes The Genome Projects: How Big, How Fast? OTA-BA-373, Washington, D.C.: U.S. Government Printing Office, 1998. (165 definitions) original document
Center for Disease Control and Prevention
- Glossary of Epidemiology Terms. (165 definitions) original document
U.S. Department of the Interior Bureau of Reclamation
- Glossary for commonly used civil engineering terms. (1943 definitions) original document

The system will soon be run on the following data sets:

National Center for Health Statistics
- Data Definitions.
US Department of Labor - Bureau of Labor Statistics
- Customer Expenditure Surveys Glossary.
U.S. Bureau of the Census
- Decennial Management Division Glossary.

The output into the USC/ISI SENSUS project is not unlike the controlled vocabulary system at the Inter-University Consortium for Political and Social Research.

Credits

Columbia University Definition Analysis Project
Judith Klavans - klavans@cs.columbia.edu
Walter Bourne - walter@columbia.edu
Brian Whitman - bwhitman@cs.columbia.edu
Deniz Sarioz - deniz@cs.columbia.edu
Samuel Popper - sp2014@cs.columbia.edu