Automatic Lexical Knowledge Base creation from a known domain


From a small subset of definitions, we determined a general definitional structure by hand, and from that point abstracted a larger specification that fits a bigger set of data more recently received. We are building a system (currently in Perl, but with aspirations of a more abstracted lex/yacc-based parser) that will automatically pull out important information from eac definition and insert it into a specified hierarchy. We have received interest from other non-EIA organizations for such a tool, as it could prove useful with ontology creation and onward. The following documents outlines the specifications of such a hierarchy and the processes the software takes to parse non-machine-readable sets of definitions.



Definition Overview

Our definitions from the EIA all look very similar to this:

Aviation gasoline (Finished): A complex mixture of relatively volatile hydrocarbons with or without small quantities of additives, blended to form a fuel suitable for use in aviation reciprocating engines. Fuel specifications are provided in ASTM Specification D 910 and Military Specification MIL-G-5572. Note: Data on blending components are not counted in data on finished aviation gasoline.

Assuming we know our domain through various probabilistic token-based and manual methods of determining important 'pre-phrases', and we can extract topics of note for an ontology browser/query expansion system.

Specific to the EIA set of definitions, a definition file is defined as a text file of one or more definition paragraphs separated by the lexical constant TWO_EOLS, which is two carriage returns. Each def. paragraph is one line, and from all the definitions given to us by the EIA/DOE, we have three main sections of the definition, defined as so:

Head Term - any text before the first colon in the definition
Definition Body - text from the first colon until either a Note: token or the end of the line.
Note - text from Note: and onward to the EOL. {Optional}

These three separations then expand automatically based on properties and includes/excludes. Head Term, however, uses a limited set of properties and is treated differently. As well, two other subsections appear the end of the hierarchy: acronyms and implied-defs. So our hierarchy starts to take shape:

Definition-List
     Definition-Paragraph
          Head Term
               head-term specific properties...
          Definition Body
               properties...
          Note
               properties...
          Acronyms
               {pointer to acronym-list}
          Implied-Defs
               {pointer to idef-list}

Back to Top.



Acronym List

An acronym list includes each acronym used in the definition and what their expanded value is. This knowledge is stored in an external acronym-list, which knowledge is gained throughout the entire parsing of the definition set. Acronym-knowledge is not a trivial problem, especially when dealing with this domain: 'transient words' such as with, of, about confuse the situation and many chemical acronyms appear which do not map to their capital letters. For example, in the gasoline definitions we are presented with: Note: This category excludes reformulated gasoline blendstock for oxygenate blending (RBOB) as well as other blendstock. Both gasoline and for are part of the entire RBOB definition but do not appear in its informal acronym.

Back to Top.



Implied-Defs

This list, similar in structure to the acronym list, contains clauses and phrases 'implicitly defined' by the use of such symbols as 'i.e.'; and 'e.g.' For example: "test methods for determining the antiknock rating, i.e., octane rating." Since this information is not directly tied to the particular definition but could be useful elsewhere, we maintain an external idef-list as a simple lookup table. However, since the mention of these words within the definition is useful, we also maintain a list of implicitly-defined words within the definition-paragraph node.

Back to Top.



Head-Term Properties

A head term is usually short and contains two important pieces of information at most: the term being defined and a parenthetical modifier. For example Motor Gasoline (Finished.) Accordingly, the head term only has two applicable properties.

Back to Top.



Definition Properties

This leaves us with the property list for each section within the definition. First, we catch all 'see-also's and 'cf.'s with matches to other definitions within the database. We store this information as pointers to the other parts of the hierarchy that are referenced. As well, we catch the head noun phrase for entry into the LKB.

We have manually determined the list of important phrases within the definitions given to us by the EIA using definition analysis theory. The definitions were collated and normalized by a group within the EIA, and as a result achieve a high level of compatibility for our purposes. However, they remain far from 'machine readable.'

To augment our manual analysis, we ran a quick bigram set on the definitions and obtained a set of phrases which can be used to indicate concepts.


6 having an
5 included in
3 use in
3 greater than

This is from a very small (10 definition) subset of the whole. Actually, the most used phrase by far in the system is "use(d|s)* (for|with|during|in)". Concurrently, we manually paged through the definitions pointing out important phrases that a system should "tag." Manually noticed phrases were not found by the bigram list, and there were also some bigram-encountered phrases that were not found through our manual efforts. This research combined with the mentioned statistical methods provided us with the following standard descriptors:

at most
at least
other than
contains / containing
more than
less than
greater than
used in, for, by, during
characterized as
having
intended for
classification of
excludes / excluding
includes / including

Which then can be split up into three property types, which expands our hierarchy even further:

Definition-List
     Definition-Paragraph
          Head Term
               Defined Term
               Parenthetical Modifier
          Definition Body
          Head Noun Phrase
          Properties
               Properties
               used (in,for,by,during)
               characterized as
               having
               intended for
               classification of
               Excludes/Includes
               excludes/excluding
               includes/including
               contains/containing
               other than
               Quantifiers
               greater than / more than
               less than
               at least
               at most
          x-ref
               {pointers to other definitions}
          Note
               Properties (same as above)
          Acronyms
               {pointer to acronym-list}
          Implied-Defs
               {pointer to idef-list}


The above would be an example of a pretty-print of the information. We need feedback on the UI and query systems end on how this lexical knowledge base should be represented at the 'code end.'


Back to Top.


An Example


From the example definition given below, we have extracted this information by hand:
Motor Gasoline (Finished): A complex mixture of relatively volatile hydrocarbons with or without small quantities of additives, blended to form a fuel suitable for use in spark-ignition engines. Motor gasoline, as defined in ASTM Specification D 4814 or Federal Specification VV-G-1690C, is characterized as having a boiling range of 122 to 158 degrees Fahrenheit at the 10 percent recovery point to 365 to 374 degrees Fahrenheit at the 90 percent recovery point. "Motor Gasoline" includes conventional gasoline; all types of oxygenated gasoline, including gasohol; and reformulated gasoline, but excludes aviation gasoline. Note: Volumetric data on blending components, such as oxygenates, are not counted in data on finished motor gasoline until the blending components are blended into the gasoline.

Definition-List
Definition-Paragraph
Head Term
Defined Term: motor gasoline
Parenthetical Modifier: finished
Definition Body
Head Noun Phrase
"A complex mixture of relatively volatile hydrocarbons"
Properties
Properties
used (in,for,by,during):
for use in spark-ignition engines
characterized as:
having a boiling range of 122 to 158 degrees Fahrenheit at the 10 percent recovery point to 365 to 374 degrees Fahrenheit at the 90 percent recovery point.
Excludes/Includes
excludes/excluding:
aviation gasoline
includes/including:
conventional gasoline; all types of oxygenated gasoline, including gasohol, and reformulated gasoline
Quantifiers
x-ref
{pointers to other definitions}
Note
"Volumetric data on blending component, such as oxygenates, are not counted in data on finished motor gasoline until the blending components are blended into the gasoline."
Properties
Acronyms
ASTM
VV
{pointer to acronym-list}
Implied-Defs

Back to Top.


Problems in Parsing


This definition is a good example because of its similarity to the others: it doesn't fill all the nodes and features but rather a good-sized subset of them. It also shows a few of the problems an automatic system encounters when dealing with this kind of information.

  1. Clauses: when to begin and end. Is "includes: conventional gasoline" enough or should we continue to "conventional gasoline; all types of oxygenated..." and so on?
  2. Things such as "not including..." should be placed in the excludes pile
  3. As mentioned above, acronyms aren't always trivial to identify.
  4. Determining head phrases
  5. Where does an i.e. and e.g. refer back to?

Many of these problems may be solved with adequate text tagging tools for parts of speech and clause structure. The definitions are going to be preprocessed with a combination of Alembic tools and LinkIT to help with these situations. Some of the other issues may require "manual intelligence" to be substituted in the system.

Back to Top.


Mixed Sources


Since any ontology might already have an entry for 'gasoline', and also since many of the defintions we have have multiple conflicting definitions, we will be using an 'Agency' tag at the head of each defintion to determine source.


Back to Top.