Digital Government Project - Definition Tagging


Definitions, Definition Tagging

From documents sent to us by the EIA and dated 1999, we have 6 printed pages of definitions with spurious formatting. To extract parts-of-speech and semantic meaning, we need to parse the raw text using only line breaks to gauge definition boundaries (all other formatting is inconsistent.)

Example Definition File - Gasolines

(or go to next section)
(or see all definition documents)

8/20/99

MOTOR GASOLINE AND RELATED TERMS

Aviation gasoline (Finished): A complex mixture of relatively volatile hydrocarbons with or without small quantities of additives, blended to form a fuel suitable for use in aviation reciprocating engines. Fuel specifications are provided in ASTM Specification D 910 and Military Specification MIL-G-5572. Note: Data on blending components are not counted in data on finished aviation gasoline.

Conventional Gasoline: Finished motor gasoline not included in the oxygenated or reformulated gasoline categories. Note: This category excludes reformulated gasoline blendstock for oxygenate blending (RBOB) as well as other blendstock.

Gasohol: A blend of finished motor gasoline containing alcohol (generally ethanol but sometimes methanol) at a concentration of 10 percent or less by volume. Data on gasohol that has at least 2.7 percent oxygen, by weight, and is intended for sale inside carbon monoxide nonattainment areas are included in data on oxygenated gasoline. See Oxygenates.

Gasoline: See Motor Gasoline (Finished).




Tagging Examples

Definiton Text:

Aviation gasoline (Finished): A complex mixture of relatively volatile hydrocarbons with or without small quantities of additives, blended to form a fuel suitable for use in aviation reciprocating engines. Fuel specifications are provided in ASTM Specification D 910 and Military Specification MIL-G-5572. Note: Data on blending components are not counted in data on finished aviation gasoline.

Sample tagging hierarchy:

type word:      (letter*)
type phrase:    (word{1,5})
                alternately:
                (subject, verb, object)
type text:      (phrase*) 
text division:
            Note:
            See:
            See Also:
            Vertical Whitespace
type definition:

head term (phrase)
    parenthetical modifier (phrase)
    definition (text)
        head noun phrase (phrase)
        properties (text)
            for use in (phrase)
            used in (phrase)
            used by (phrase)
            used for (phrase)
            characterized as (phrase)
            intended for (phrase)
        includes/excludes (text)
            includes (phrase)
            contains (phrase)
            excludes (phrase)	
            other than (phrase)
note (text)
x-reference (phrase)

Acronyms are expanded in place via a different structure on the first pass:

uppercase_lookup (word)
    expanded meaning(phrase)
    words-involved(word-list)
    
For example, upon encountering MTBE the analyzer consults this dataset under 
uppercase_lookup to find MTBE. Expanded meaning will have Methyl Tertiary Butyl 
Ether on one of two conditions: if it was already found in the document on the 
first pass in the following format:

Methyl Tertiary Butyl Ether (MTBE)

or if the word and meaning were already entered into the databse through an acronym 
glossary meant for this task. In this manner, we can track such terms as CO 
(Carbon Monoxide) that do not directly match up their letters. 

The words-involved field matches words that appear mid-acronym so the lexer can 
throw them away while searching for acronyms. For example, RBOB is referenced as 
reformulated gasoline blendstock for oxygenate blending, and so gasoline and for 
would be placed in the words-involved list.

Tagged:


head term (phrase)                   aviation gasoline
    parenthetical modifier (phrase)     finished
    definition (text)                   A complex mixture... Fuel specifications are provided in
                                        Whatever ASTM Stands For (ASTM) Specification D 910...
                                        (the entire text up to Note:)
        head noun phrase (phrase)           complex mixture of relatively volatile hydrocarbons    
        properties (text)                   
            for use in (phrase)                 aviation reciprocating engines
            used in (phrase)
            used by (phrase)
            used for (phrase)
            characterized as (phrase)
            intended for (phrase)
        includes/excludes (text)
            includes (phrase)
            contains (phrase)
            excludes (phrase)	
            other than (phrase)
note (text)                          Data on blending component are not counted in data on finished
                                     aviation gasoline.             
x-reference (phrase)