next up previous


A Technical Description of the LinkIT System

David K. Evans1

Introduction

LinkIT is an automated system for determining and ranking candidate ideas for the overall ``aboutness'' of a document. When run over plain text as an input, LinkIT determines the simplex noun phrases in the document, and relates those noun phrases to one another. Each topic that is discussed within the document should be reflected by a grouping of these related simplex noun phrases. These simplex noun phrase groups are then ranked by LinkIT, using a variety of information to inform the ranking heuristics.

LinkIT is a work in progress, and is still under development.

Motivation

With the proliferation of information available via the internet, it has become increasingly common for natural language processing techniques to augment statistical based methods for information retrieval. Advanced search engines now use phrases and simple noun phrase identification to help improve the quality of searches. With LinkIT, we aim to produce a simple representation of the ``aboutness'' of the document that goes beyond just looking at the lexical forms of the words in the document. By identifying and linking the simple noun phrases in the document and doing some simple analysis on the verbs in the document, we can determine who the major entities are in the document, and possibly what general actions are being performed.

There are many possible applications of having such a rich representation of the ``aboutness'' of the document. Compared to just looking at the words in the document without regard to their syntactic role, we should be able to more accurately match documents to user queries, since we will not be misled by spurios hits caused by a document briefly mentioning, but not actually discussing, a certain topic. While using LinkIT to index a large collection of documents is probably not feasable, it would be possible to use LinkIT on a selection of documents that has been shown likely to be relevant by some other method, in order to make further more fine distinctions between the documents. LinkIT could also be used to determine what a document is about as input for a summarization system; this information could inform the system on which areas of the document to focus on, and which entities to expect information about. Given a collection of documents, LinkIT could be used as the basis of a topic detection and tracking system. By looking at the LinkIT output for each document, and detecting similarities and differences between the output, one could detect a documents' topic, and track how that topic changes over time.

Simplex Noun Phrases

A simplex NP is a maximal NP with a common or proper noun as its head, where the NP may include premodifiers such as determiners and possessives but not post-nominal constituents such as prepositions or relativizers. Examples are asbestos fiber and 9.8 billion Kent cigarettes.

Simplex NPs can be contrasted with complex NPs such as 9.8 billion Kent cigarettes with the filters where the head of the NP is followed by a preposition, or 9.8 billion Kent cigarettes sold by the company, where the head is followed by a participial verb. Currently simplex NPs also end at a conjunction.[#!Wacholder1998!#]

System Operation

This section describes how the LinkIT system operates; what input it takes, and what output it produces. First there is a description of the input file, and the specific processing that must be performed to ready a raw text file for processing by LinkIT. Then there is a description of the ouput files that LinkIT creates, and how to interpret them. Also, some of the options that configure how LinkIT runs are explained.

   
Input Pre-Processing

The input to LinkIT is a file that has been pre-tagged with the Alembic Utilities from the MITRE group. These utilities perform part-of-speech tagging, and also do named entity identification for people, organizations, and locations. More information on the Alembic Utilities are available from the Alembic website, at http://www.mitre.org/resources/centers/advanced_info/g04h/alembic.html.

There are two ways to tag files for use by LinkIT. The first is to manually tag the file using the Alembic Workbench's GUI. To do this, start up the Alembic Workbench, and from the "Utilities" menu select the "Process Text..." item. Select a source file using the "Select..." button, and set the output file to what you want. Set the rules file to $AWBDIR/awb-2.8/rules/english-rules-all-data1.lisp. The stages that should be selected for processing are: Punct, Sent, P-O-S, BiGrams, POS-Language: English, Phrasing. Then press the "Process Text" button.

The second method is to use the included perl script, awbTag.pl. This script invokes another script, apply-alembic-dave, which tags the text file using the above parameters. The awbTag.pl script can take any number of file names as a command line arguement, and will process all of those files, creating (FILENAME).tagged files for each command line argument.

Output

LinkIT generates three types of output, two of which are of primary interest to the end user. LinkIT creates four files, ending in (File).np, (File).stat, (File).stat2, and (File).out.

.stat Files

LinkIT generates a statistics file that has the number of tokens, sentences, paragraphs, parts of speech of tokens in the document, and so on. This file ends in .stat. A second file ending in .stat2 is identical, except that there is one variable per line, and no explanatory text. The -noStat switch suppresses creation of both of these files.

   
.np File

(File).np

The .np file that is generated is a listing of all the NPs in the document, and then those NPs sorted by Head, and a listing of all the words in the document and the NPs they occur as heads and modifiers in. The output is also echoed to the terminal (but this can be suppressed using the -noPrint switch.) The other form of output is the .out file, which is the same as the input file except it has tags added to identify the noun phrases in the document and various relations between them. This is the input that is used by our visualization tool.

The LinkIT output consists first of a listing of simplex noun phrases from the document in the order in which they occurr in the document. The second list groups those same NPs by head, and the third list is a breakdown of the words and the NPs they appear in either as modifiers or heads.

The first list is preceded by this header:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

I: In-order Simplex NP Listing:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

This list is not printed by default. To enable printing of this list, use the emph-printListI option.

Another useful option is to print the text with the NPs and verb phrases bracketed by []'s and ()'s respectively. To do this, use the -bracketedText option.

The second list is preceded by:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

II: Noun Phrases Ordered by Heads:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

By default, this list is printed. To suppress printing of this list, use the -printListII 0 option.

The third list is preceded by:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

III: Words as heads and mods:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

By default, this list is printed. To suppress printing of this list, use the -printListIII 0 option.

In the .np file NPs are listed with information on the sentence it occurred in, the token span, and possibly information about relationships to other NPs. This includes information on whether the NP is in apposition, is a possible head or possible modifier of another NP, and previous occurrences of words in the NP. The format is as follows:

S9 177-178 (54) Costa (pocc: 18.58) Rica (pocc: 18.59)

If a noun phrase is possibly in apposition with another noun phrase, that will be marked with a (papp: #) tag. Similarly, (phead: #) and (pmod: #) specify that this noun phrase is a possible head or modifier of another noun phrase. In all cases the # refers to the unique number for the NP given in the parenthesis.

For each of the words of the noun phrase, there might be a (pocc: A.B) label. This denotes a previous occurrence of this word, in noun phrase A, token number B.

.out File

(File).out

The .out file mirrors the original marked-up document, but embeds tags to identify the NPs in the document. For each NP that has been identified in the document, a tag is inserted around the entire text of the NP. The tag is of the form

<NP NUM=%d GROUP=%d SENT=%d START=%d FINISH=%d HEAD=%d PAPP=%d>
The NUM, GROUP, SENT, START, FINISH, and HEAD fields appear in every tag, while the PAPP tag appears only if the NP specified by this tag is in apposition with another NP. The fields take on the values of an NP as described in Section [*].

Technical Description

Architecture Overview

The first step in the LinkIT system is to parse the input file. The input file is parsed by sequentially dividing the input text up into units that LinkIT has knowledge about. The main LinkIT Module calls on a lexer module to read in the input file, and return small units of text in a sequential order. The lexer is a large finite state machine built from a set of regular expressions that is able to identify simplex noun phrases, verb phrases, and a few other units based on the part of speech tags in the input file. The lexer returns the text that matched to one of the expressions that defines a simplex noun phrase, or other unit. If the unit is a simplex NP, information about the NP is extracted from the marked-up text, an entry is created for the NP in a list of NPs for the entire document, and the NP is checked for links to previous NPs in the document. If the unit is not an NP, LinkIT does some special processing specific to that type of unit.

Once all of the simple NPs for the document have been extracted, all of the NPs in the document are clustered into groups. Currently the clusters are created based on similarity of the lexical form of the head. Two NPs will be placed in the same cluster if they have the same head except for differences in plurality or case. These NP clusters are then ranked in order of their relative ``significance.'' Please see Section [*] for more information on the ranking metrics. The resulting list can then be output in various ways.

Optionally, for each word that is in the document, if it is part of a NP, LinkIT can output a list of the NPs that the word is in broken down by occurrence of the word as the head of an NP, and as a modifier in an NP.

NP Chunking

\includegraphics{lexer.ps}

To determine NP boundaries, LinkIT uses a finite-state lexer built from a small hand-crafted regular expression grammar. The input to the lexer is the part of speech tagged text, tagged in the manner explained in Section [*]. The lexer contains regular expressions to identify simplex noun phrases, sentence boundaries, paragraph boundaries, dates, and simple verb phrases.

The lexer takes the input text, and matches it to one of the input patterns, returning the text of the largest match found. When matching to the set of regular expressions, preference is given to expressions that minimize the amount of input that must be ``skipped'' before the start of the matched text. For those expressions that skip the same amount of input, longer matches are preferred. The text that matched the final regular expressions, as well as the text that was ``skipped'' is returned to the LinkIT main module. The lexer also sets variables that indicate which regular expression was used, what sentence and paragraph the match was in, and the token span of the match.

   
Normal NP Processing

Depending on the text text that is returned from the lexer, LinkIT takes some action. The main interesting case is that of NPs.

For each NP that is returned by the lexer, LinkIT creates a data structure to store information about the NP. A list of the words is created, and for each word in the NP, LinkIT extracts the part of speech tag, and any other special feature that might be associated with that word. A word can have a POST or a TITLE feature associated with it, and might possibly be the start or the end of a named entity. POST words are words that function to indicate a job position, such as general or secretary. A title is a human title, such as Dr. or Mr. A named entity can be a sequence of words that refer to a location, place, or organization, and they are tagged by the alembic utilities. The list of words, and their associated information are stored in the NP structure.

If the previous unit returned by the lexer was an adjective and coordinating conjunction unit, LinkIT checks to see if there was any intervening text between that unit and the current NP. If there was not, then the adjective and coordinating conjunction are prepended to the current NP, and processing continues as normal. If there was some intervening text, then the adjective and coordinating conjunction variable is just cleared.

If the head of the current NP is a ``strong'' noun, and the only intervening text between the current NP and the previous NP is of, the previous NP is made a possible modifier of the current NP and the current NP is made a posisble head of the previous NP.

Finally, the current NP is related to all the previous NPs. For each modifier in the current NP, we check to see if there are any other words have are the same using a hash table of all the words we have seen. Each word is reduced to it's singular form, irregular words are reduced to their correct form using a dictionary, and we ignore case in the comparison. If there has been a previous occurrence of the word, a link is added from the word to the previous word. For the head of the NP, LinkIT also searches for similar words, but also assigns a group number to the NP based on what is matched. If no previous occurrences of the word exist, then a new group is formed, and the NP is assigned the next sequential number for a group. If a match is found, then the NP is assigned the group number of the group of the matching words's NP if the matched word was the head of the NP, and a previous occurrence relation is made from the head of the NP to the matched head. If the matched word was not the head of its NP, then a new group is created as above.

Special Case Processing

The lexer also returns units of text that match patterns for the following units: Possesive 's, titles, sentence boundary, comma, new paragraph, and a construction of adjective followed by a coordinating conjunction. In each of these cases, LinkIT updates certain state information pertinent to those returned units. The six cases are listed below.

   
NP Ranking

Talk about how we go through the process of ranking the NPs.

Results

What we are currently using LinkIT for, and some future stuff on evaluation.

Technical Description for Use

Command Line Arguments

operation with no command line args: stats to np.stat (what about stat2?) input from stdin, output to stdout.

LinkIT has many command line options to introduce flexibility into the program. The command line arguments can be broken down into the following categories:

A Small Example

A small example of tagging a file, and running LinkIT on that file. Show the output. Or should this be in the appendix as well?

Sample Input Article

Use wsj_0006 - it is very short, and should be sufficient to demonstrate operation.

Pacific First Financial Corp. said shareholders approved its acquisition by Royal Trustco Ltd. of Toronto for $27 a share, or $212 million.

The thrift holding company said it expects to obtain regulatory approval and complete the transaction by year-end.

Sample Tagged Input Article

Sample Output Formats

permutations of the output that we can produce.

No References!

About this document ...

A Technical Description of the LinkIT System

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 LinkITTechDoc.tex.

The translation was initiated by David Evans on 1999-11-01


Footnotes

... Evans1
This research was partly supported by NSF grant IRI-9712069, ``Automatic Identification of Significant Topics in Domain Independent Full Text'', Judith Klavans, PI; Nina Wacholder, co-PI

next up previous
David Evans
1999-11-01