Automatic Lexical Knowledge Base creation from a known domain
From a small subset of definitions, we determined a general definitional structure by hand, and from that point abstracted a larger specification that fits a bigger set of data more recently received.
We are building a system (currently in Perl, but with aspirations of a more
abstracted lex/yacc-based parser) that will automatically pull out important information
from eac definition and insert it into a specified hierarchy. We have received
interest from other non-EIA organizations for such a tool, as it could prove
useful with ontology creation and onward. The following documents outlines the
specifications of such a hierarchy and the processes the software takes to parse
non-machine-readable sets of definitions.
Definition Overview
Our definitions from the EIA all look very similar to this:
Aviation gasoline (Finished): A complex mixture of
relatively volatile hydrocarbons with or without small quantities of additives,
blended to form a fuel suitable for use in aviation reciprocating engines. Fuel
specifications are provided in ASTM Specification D 910 and Military
Specification MIL-G-5572. Note: Data on blending components are not counted in
data on finished aviation gasoline.
Assuming we know our domain through various probabilistic token-based and manual
methods of determining important 'pre-phrases', and we can extract
topics of note for an ontology browser/query expansion system.
Specific to the EIA set of definitions, a definition file is defined as a text file of one or more
definition paragraphs separated by the lexical constant TWO_EOLS, which is two
carriage returns. Each def. paragraph is one line, and from all the definitions
given to us by the EIA/DOE, we have three main sections of the definition,
defined as so:
Head Term - any text before the first colon in
the definition
Definition Body - text from the first colon until
either a Note: token or the end of the line.
Note - text from Note: and onward to the EOL.
{Optional}
These three separations then expand automatically based on
properties and includes/excludes. Head Term, however, uses a limited set of
properties and is treated differently. As well, two other subsections appear the
end of the hierarchy: acronyms and implied-defs. So our hierarchy
starts to take shape:
Definition-List
Definition-Paragraph
Head Term
head-term specific properties...
Definition Body
properties...
Note
properties...
Acronyms
{pointer to acronym-list}
Implied-Defs
{pointer to idef-list}
Back to Top.
Acronym List
An acronym list includes each acronym used in the definition
and what their expanded value is. This knowledge is stored in an external
acronym-list, which knowledge is gained throughout the entire parsing of the
definition set. Acronym-knowledge is not a trivial problem, especially when
dealing with this domain: 'transient words' such as with, of,
about confuse the situation and many chemical acronyms appear which do not
map to their capital letters. For example, in the gasoline definitions we are presented with: Note: This category excludes reformulated gasoline blendstock for oxygenate blending (RBOB) as well as other blendstock. Both gasoline and for are part of the entire RBOB definition but do not appear in its informal acronym.
Back to Top.
Implied-Defs
This list, similar in structure to the acronym list, contains
clauses and phrases 'implicitly defined' by the use of such symbols
as 'i.e.'; and 'e.g.' For example: "test methods
for determining the antiknock rating, i.e., octane rating." Since this
information is not directly tied to the particular definition but could be
useful elsewhere, we maintain an external idef-list as a simple lookup table.
However, since the mention of these words within the definition is useful, we
also maintain a list of implicitly-defined words within the definition-paragraph
node.
Back to Top.
Head-Term Properties
A head term is usually short and contains two important pieces
of information at most: the term being defined and a parenthetical modifier. For
example Motor Gasoline (Finished.) Accordingly, the head term only has two
applicable properties.
Back to Top.
Definition Properties
This leaves us with the property list for each section within
the definition. First, we catch all 'see-also's and 'cf.'s with matches to
other definitions within the database. We store this information as pointers to
the other parts of the hierarchy that are referenced. As well, we catch the head
noun phrase for entry into the LKB.
We have manually determined the list of important phrases
within the definitions given to us by the EIA using definition analysis theory. The definitions were collated and
normalized by a group within the EIA, and as a result achieve a high level of
compatibility for our purposes. However, they remain far from 'machine
readable.'
To augment our manual analysis, we ran a quick bigram set on the definitions and obtained a set of phrases which can be used to indicate concepts.
6 having an
5 included in
3 use in
3 greater than
This is from a very small (10 definition) subset
of the whole. Actually, the most used phrase by far in the system is "use(d|s)*
(for|with|during|in)". Concurrently, we manually paged through the definitions pointing
out important phrases that a system should "tag." Manually noticed phrases
were not found by the bigram list, and there were also some bigram-encountered
phrases that were not found through our manual efforts. This research combined with the
mentioned statistical methods provided us with the following standard descriptors:
at most
at least
other than
contains / containing
more than
less than
greater than
used in, for, by, during
characterized as
having
intended for
classification of
excludes / excluding
includes / including
Which then can be split up into three property types, which
expands our hierarchy even further:
Definition-List
Definition-Paragraph
Head Term
Defined Term
Parenthetical Modifier
Definition Body
Head Noun Phrase
Properties
Properties
used (in,for,by,during)
characterized as
having
intended for
classification of
Excludes/Includes
excludes/excluding
includes/including
contains/containing
other than
Quantifiers
greater than / more than
less than
at least
at most
x-ref
{pointers to other definitions}
Note
Properties (same as above)
Acronyms
{pointer to acronym-list}
Implied-Defs
{pointer to idef-list}
The above would be an example of a pretty-print of the
information. We need feedback on the UI and query systems end on how this
lexical knowledge base should be represented at the 'code
end.'
Back to Top.
An Example
From the example definition given below, we have extracted this information by hand:
Motor Gasoline (Finished): A complex mixture of relatively
volatile hydrocarbons with or without small quantities of additives, blended to
form a fuel suitable for use in spark-ignition engines. Motor gasoline, as
defined in ASTM Specification D 4814 or Federal Specification VV-G-1690C, is
characterized as having a boiling range of 122 to 158 degrees Fahrenheit at the
10 percent recovery point to 365 to 374 degrees Fahrenheit at the 90 percent
recovery point. "Motor Gasoline" includes conventional gasoline; all types of
oxygenated gasoline, including gasohol; and reformulated gasoline, but excludes
aviation gasoline. Note: Volumetric data on blending components, such as
oxygenates, are not counted in data on finished motor gasoline until the
blending components are blended into the gasoline.
Definition-List
Definition-Paragraph
Head Term
Defined Term: motor gasoline
Parenthetical Modifier: finished
Definition Body
Head Noun Phrase
"A complex mixture of relatively volatile
hydrocarbons"
Properties
Properties
used (in,for,by,during):
for use in spark-ignition engines
characterized as:
having a boiling range of 122 to 158 degrees Fahrenheit at the
10 percent recovery point to 365 to 374 degrees Fahrenheit at the 90 percent
recovery point.
Excludes/Includes
excludes/excluding:
aviation gasoline
includes/including:
conventional gasoline; all types of oxygenated gasoline,
including gasohol, and reformulated gasoline
Quantifiers
x-ref
{pointers to other definitions}
Note
"Volumetric data on blending component, such as
oxygenates, are not counted in data on finished motor gasoline until the
blending components are blended into the gasoline."
Properties
Acronyms
ASTM
VV
{pointer to acronym-list}
Implied-Defs
Back to Top.
Problems in Parsing
This definition is a good example because of its similarity to
the others: it doesn't fill all the nodes and features but rather a
good-sized subset of them. It also shows a few of the problems an automatic
system encounters when dealing with this kind of information.
- Clauses: when to begin and end. Is "includes: conventional
gasoline" enough or should we continue to "conventional gasoline;
all types of oxygenated..." and so on?
- Things such as "not
including..." should be placed in the excludes pile
- As mentioned
above, acronyms aren't always trivial to identify.
- Determining head
phrases
- Where does an i.e. and e.g. refer back
to?
Many of these problems may be solved with adequate text
tagging tools for parts of speech and clause structure. The definitions are
going to be preprocessed with a combination of Alembic tools and LinkIT to help
with these situations. Some of the other issues may require "manual
intelligence" to be substituted in the system.
Back to Top.
Mixed Sources
Since any ontology might already have an entry for 'gasoline', and also since many of the defintions we have have multiple conflicting definitions, we will be using an 'Agency' tag at the head of each defintion to determine source.
Back to Top.