A Technical Description of the LinkIT System

David K. Evans¹

Introduction

LinkIT is an automated system for determining and ranking candidate ideas for the overall ``aboutness'' of a document. When run over plain text as an input, LinkIT determines the simplex noun phrases in the document, and relates those noun phrases to one another. Each topic that is discussed within the document should be reflected by a grouping of these related simplex noun phrases. These simplex noun phrase groups are then ranked by LinkIT, using a variety of information to inform the ranking heuristics.

LinkIT is a work in progress, and is still under development.

Motivation

With the proliferation of information available via the internet, it has become increasingly common for natural language processing techniques to augment statistical based methods for information retrieval. Advanced search engines now use phrases and simple noun phrase identification to help improve the quality of searches. With LinkIT, we aim to produce a simple representation of the ``aboutness'' of the document that goes beyond just looking at the lexical forms of the words in the document. By identifying and linking the simple noun phrases in the document and doing some simple analysis on the verbs in the document, we can determine who the major entities are in the document, and possibly what general actions are being performed.

There are many possible applications of having such a rich representation of the ``aboutness'' of the document. Compared to just looking at the words in the document without regard to their syntactic role, we should be able to more accurately match documents to user queries, since we will not be misled by spurios hits caused by a document briefly mentioning, but not actually discussing, a certain topic. While using LinkIT to index a large collection of documents is probably not feasable, it would be possible to use LinkIT on a selection of documents that has been shown likely to be relevant by some other method, in order to make further more fine distinctions between the documents. LinkIT could also be used to determine what a document is about as input for a summarization system; this information could inform the system on which areas of the document to focus on, and which entities to expect information about. Given a collection of documents, LinkIT could be used as the basis of a topic detection and tracking system. By looking at the LinkIT output for each document, and detecting similarities and differences between the output, one could detect a documents' topic, and track how that topic changes over time.

Simplex Noun Phrases

A simplex NP is a maximal NP with a common or proper noun as its head, where the NP may include premodifiers such as determiners and possessives but not post-nominal constituents such as prepositions or relativizers. Examples are asbestos fiber and 9.8 billion Kent cigarettes.

Simplex NPs can be contrasted with complex NPs such as 9.8 billion Kent cigarettes with the filters where the head of the NP is followed by a preposition, or 9.8 billion Kent cigarettes sold by the company, where the head is followed by a participial verb. Currently simplex NPs also end at a conjunction.[#!Wacholder1998!#]

System Operation

This section describes how the LinkIT system operates; what input it takes, and what output it produces. First there is a description of the input file, and the specific processing that must be performed to ready a raw text file for processing by LinkIT. Then there is a description of the ouput files that LinkIT creates, and how to interpret them. Also, some of the options that configure how LinkIT runs are explained.

Input Pre-Processing

The input to LinkIT is a file that has been pre-tagged with the Alembic Utilities from the MITRE group. These utilities perform part-of-speech tagging, and also do named entity identification for people, organizations, and locations. More information on the Alembic Utilities are available from the Alembic website, at http://www.mitre.org/resources/centers/advanced_info/g04h/alembic.html.

There are two ways to tag files for use by LinkIT. The first is to manually tag the file using the Alembic Workbench's GUI. To do this, start up the Alembic Workbench, and from the "Utilities" menu select the "Process Text..." item. Select a source file using the "Select..." button, and set the output file to what you want. Set the rules file to $AWBDIR/awb-2.8/rules/english-rules-all-data1.lisp. The stages that should be selected for processing are: Punct, Sent, P-O-S, BiGrams, POS-Language: English, Phrasing. Then press the "Process Text" button.

The second method is to use the included perl script, awbTag.pl. This script invokes another script, apply-alembic-dave, which tags the text file using the above parameters. The awbTag.pl script can take any number of file names as a command line arguement, and will process all of those files, creating (FILENAME).tagged files for each command line argument.

Output

LinkIT generates three types of output, two of which are of primary interest to the end user. LinkIT creates four files, ending in (File).np, (File).stat, (File).stat2, and (File).out.

(File).np The standard output file. This file contains listings of the NPs from the document, and also clustered listing of those NPs. The various options that pertain to which output lists are printed effect the format of this file.
(File).stat This file contains statistical information on the frequency of the part of speech tags that occurred in the input, how many NPs were found, how many sentences, paragraphs, tokens there were , and so on.
(File).stat2 This is the same as (File).stat, however, the values are just listed numerically, one value per line with no headings. This file is generated for the convenience of other utilities, so they do not have to contend with a complicated human-readable format.
(File).out This file is the same as the input file, enriched with SGML tags inserted for the NPs that LinkIT found.

.stat Files

LinkIT generates a statistics file that has the number of tokens, sentences, paragraphs, parts of speech of tokens in the document, and so on. This file ends in .stat. A second file ending in .stat2 is identical, except that there is one variable per line, and no explanatory text. The -noStat switch suppresses creation of both of these files.

.np File

(File).np

The .np file that is generated is a listing of all the NPs in the document, and then those NPs sorted by Head, and a listing of all the words in the document and the NPs they occur as heads and modifiers in. The output is also echoed to the terminal (but this can be suppressed using the -noPrint switch.) The other form of output is the .out file, which is the same as the input file except it has tags added to identify the noun phrases in the document and various relations between them. This is the input that is used by our visualization tool.

The LinkIT output consists first of a listing of simplex noun phrases from the document in the order in which they occurr in the document. The second list groups those same NPs by head, and the third list is a breakdown of the words and the NPs they appear in either as modifiers or heads.

The first list is preceded by this header:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

I: In-order Simplex NP Listing:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

This list is not printed by default. To enable printing of this list, use the emph-printListI option.

Another useful option is to print the text with the NPs and verb phrases bracketed by []'s and ()'s respectively. To do this, use the -bracketedText option.

The second list is preceded by:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

II: Noun Phrases Ordered by Heads:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

By default, this list is printed. To suppress printing of this list, use the -printListII 0 option.

The third list is preceded by:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

III: Words as heads and mods:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

By default, this list is printed. To suppress printing of this list, use the -printListIII 0 option.

In the .np file NPs are listed with information on the sentence it occurred in, the token span, and possibly information about relationships to other NPs. This includes information on whether the NP is in apposition, is a possible head or possible modifier of another NP, and previous occurrences of words in the NP. The format is as follows:

S9 177-178 (54) Costa (pocc: 18.58) Rica (pocc: 18.59)

S# is the sentence number the noun phrase occurrs in.
The next two digits refer to the token span of the noun phrase, where [Costa] is the 177th token, and [Rica] is the 178th token.
Finally the number in parentheses is a unique identifier assigned to this simplex NP.

If a noun phrase is possibly in apposition with another noun phrase, that will be marked with a (papp: #) tag. Similarly, (phead: #) and (pmod: #) specify that this noun phrase is a possible head or modifier of another noun phrase. In all cases the # refers to the unique number for the NP given in the parenthesis.

For each of the words of the noun phrase, there might be a (pocc: A.B) label. This denotes a previous occurrence of this word, in noun phrase A, token number B.

.out File

(File).out

The .out file mirrors the original marked-up document, but embeds tags to identify the NPs in the document. For each NP that has been identified in the document, a tag is inserted around the entire text of the NP. The tag is of the form

<NP NUM=%d GROUP=%d SENT=%d START=%d FINISH=%d HEAD=%d PAPP=%d>

The NUM, GROUP, SENT, START, FINISH, and HEAD fields appear in every tag, while the PAPP tag appears only if the NP specified by this tag is in apposition with another NP. The fields take on the values of an NP as described in Section

NUM is a unique integer for this NP. The integers are assigned to the NPs in simple ascending order, as they are encounted in the input file.
GROUP is an integer that reflects which cluster of NPs this NP is a part of. Each cluster is assigned a unique integer, and each NP in the cluster is assigned that same integer.
SENT is the sentence number this NP is in.
START is the token number of the first word in this NP.
FINISH is the token number of the final word in this NP.
HEAD is a relative count from the first token in this NP to the head of the NP. Usually, in English, the head of the NP is the final word, however, in some cases that does not hold. In those cases, we have the flexibility to specify what LinkIT determines to be the actual head of the NP.

Technical Description

Architecture Overview

The first step in the LinkIT system is to parse the input file. The input file is parsed by sequentially dividing the input text up into units that LinkIT has knowledge about. The main LinkIT Module calls on a lexer module to read in the input file, and return small units of text in a sequential order. The lexer is a large finite state machine built from a set of regular expressions that is able to identify simplex noun phrases, verb phrases, and a few other units based on the part of speech tags in the input file. The lexer returns the text that matched to one of the expressions that defines a simplex noun phrase, or other unit. If the unit is a simplex NP, information about the NP is extracted from the marked-up text, an entry is created for the NP in a list of NPs for the entire document, and the NP is checked for links to previous NPs in the document. If the unit is not an NP, LinkIT does some special processing specific to that type of unit.

Once all of the simple NPs for the document have been extracted, all of the NPs in the document are clustered into groups. Currently the clusters are created based on similarity of the lexical form of the head. Two NPs will be placed in the same cluster if they have the same head except for differences in plurality or case. These NP clusters are then ranked in order of their relative ``significance.'' Please see Section for more information on the ranking metrics. The resulting list can then be output in various ways.

Optionally, for each word that is in the document, if it is part of a NP, LinkIT can output a list of the NPs that the word is in broken down by occurrence of the word as the head of an NP, and as a modifier in an NP.

NP Chunking

$\includegraphics{lexer.ps}$

To determine NP boundaries, LinkIT uses a finite-state lexer built from a small hand-crafted regular expression grammar. The input to the lexer is the part of speech tagged text, tagged in the manner explained in Section . The lexer contains regular expressions to identify simplex noun phrases, sentence boundaries, paragraph boundaries, dates, and simple verb phrases.

The lexer takes the input text, and matches it to one of the input patterns, returning the text of the largest match found. When matching to the set of regular expressions, preference is given to expressions that minimize the amount of input that must be ``skipped'' before the start of the matched text. For those expressions that skip the same amount of input, longer matches are preferred. The text that matched the final regular expressions, as well as the text that was ``skipped'' is returned to the LinkIT main module. The lexer also sets variables that indicate which regular expression was used, what sentence and paragraph the match was in, and the token span of the match.

Normal NP Processing

Depending on the text text that is returned from the lexer, LinkIT takes some action. The main interesting case is that of NPs.

For each NP that is returned by the lexer, LinkIT creates a data structure to store information about the NP. A list of the words is created, and for each word in the NP, LinkIT extracts the part of speech tag, and any other special feature that might be associated with that word. A word can have a POST or a TITLE feature associated with it, and might possibly be the start or the end of a named entity. POST words are words that function to indicate a job position, such as general or secretary. A title is a human title, such as Dr. or Mr. A named entity can be a sequence of words that refer to a location, place, or organization, and they are tagged by the alembic utilities. The list of words, and their associated information are stored in the NP structure.

If the previous unit returned by the lexer was an adjective and coordinating conjunction unit, LinkIT checks to see if there was any intervening text between that unit and the current NP. If there was not, then the adjective and coordinating conjunction are prepended to the current NP, and processing continues as normal. If there was some intervening text, then the adjective and coordinating conjunction variable is just cleared.

If the head of the current NP is a ``strong'' noun, and the only intervening text between the current NP and the previous NP is of, the previous NP is made a possible modifier of the current NP and the current NP is made a posisble head of the previous NP.

Finally, the current NP is related to all the previous NPs. For each modifier in the current NP, we check to see if there are any other words have are the same using a hash table of all the words we have seen. Each word is reduced to it's singular form, irregular words are reduced to their correct form using a dictionary, and we ignore case in the comparison. If there has been a previous occurrence of the word, a link is added from the word to the previous word. For the head of the NP, LinkIT also searches for similar words, but also assigns a group number to the NP based on what is matched. If no previous occurrences of the word exist, then a new group is formed, and the NP is assigned the next sequential number for a group. If a match is found, then the NP is assigned the group number of the group of the matching words's NP if the matched word was the head of the NP, and a previous occurrence relation is made from the head of the NP to the matched head. If the matched word was not the head of its NP, then a new group is created as above.

Special Case Processing

The lexer also returns units of text that match patterns for the following units: Possesive 's, titles, sentence boundary, comma, new paragraph, and a construction of adjective followed by a coordinating conjunction. In each of these cases, LinkIT updates certain state information pertinent to those returned units. The six cases are listed below.

Possesive 's. For phrase with a possesive 's, as in
Boston's Dana Farber Cancer Institute
, LinkIT actually sees this as three separate units. The first is Boston, the second is a possesive 's, and the third is Dana Farber Cancer Institute. LinkIT considers this relationship to be similar to
The Dana Farber Cancer Institute of Boston.
When the LinkIT main module receives a possesive 's from the lexer, it sets the first NP as a possible head of the second NP, and the second NP as a possible modifier of the first NP. At the point where a possesive 's is returned from the lexer, LinkIT does not know what the second NP will be, so a variable is set that is checked each time through the main module's loop for this case.
Titles (e.g. Mr., Dr., etc.) The alembic utilities will mark common, and some uncommon, titles in the input as title words. The lexer will return a title as an independent unit. When the LinkIT main module receives a title, it simply requests the next NP from the lexer, prepends the title to that NP, and marks that NP as likely a human entity. It would also have been possible to include the title words in the NP rules, however, by creating rules that allowed for a special title tag to be in the phrase, the size of the resulting finite state machine would be increased.
Sentence boundary. The alembic utilities also detect sentence boundaries using a statistical method. The lexer will return a sentence boundary that has been tagged in the input file, and also for a few cases that the tagged makes consistent errors on. LinkIT updates it's count of the number of sentences it has seen on receipt of a sentence boundary unit. The sentence count is used to determine which sentence an NP is in when it is returned by the lexer.
Comma. When the lexer returns a comma, LinkIT checks to see if the previous two NPs are in apposition with each other. It does this by keeping a stack of the past three units. If the stack has the last three units as an NP, a comma, and an NP, and we currently have a comma, the two previous NPs might be in apposition. If LinkIT receives a comma, and there is text intervening between the comma and the last NP, then the stack is cleared. A comma will be placed on the stack only if there are less than three units on the stack, and there is no intervening text between the previous NP and this comma. If there is intervening text, the entire stack is cleared, since there can be no apposition involving the previous NPs using this comma. An NP is placed on the stack only if there are less than three units on the stack, and there is no intervening text from the last comma to the NP. If a possible apposition is found, a possible apposition relation is made between the two NPs, and the stack is re-set to contain just one NP and one comma, which represent the two previous NPs apposition.
Adjective and coordinating conjunction. Another special case that LinkIT handles are phrases of the type
fast and cheap machines.
If the lexer encounters an adjective followed by a coordinating conjunction, it returns that as an adjective coordinating conjunction unit. A variable is set that retains the information for the returned unit, and if the next unit is an NP with no interceding words, the adjective and coordinating conjunction are prepended to the next NP. Similar to how the possesive 's modification is handled, this is done with a variable that is set, and a check in the main LinkIT module.
New paragraph. When the lexer detects two or more carriage returns in a row, it returns a new paragraph unit. LinkIT simply updates it's count of the number of paragraphs in the document, similar to how it does with a new sentence unit.

NP Ranking

Talk about how we go through the process of ranking the NPs.

Results

What we are currently using LinkIT for, and some future stuff on evaluation.

Technical Description for Use

Command Line Arguments

operation with no command line args: stats to np.stat (what about stat2?) input from stdin, output to stdout.

LinkIT has many command line options to introduce flexibility into the program. The command line arguments can be broken down into the following categories:

File Manipulation
- -dict filename -d filename This option names the dictionary LinkIT should use. The dictionary contains words that have irregular plural forms, and indicates words that have strong connective characteristics. The dictionary is currently very small, and in a highly non-optimized state. By default, LinkIT looks for the dictionary in the working directory named ``NP.dict''
- -noPrint By default, LinkIT echoes all output to stdout. The -noPrint option disables output to stdout. Debugging messages and status messages are still printed on stdout, while error messages are sent to stderr.
- -noStats This option disables generation of the .stat and .stat2 files.
- -output directory -o directory Send the output to the named directory. By default LinkIT sends all output files to the current working directory. If you want the output files to go to a different directory, use this option followed by the path to the directory to send all output files. This includes .stat, .stat2, .np and .out files. -o with no argument defaults to the current working directory. Without using -o, output will default to standard out. Not specifying an output directory and using -noPrint will suppress all output entirely.
Display
- -bracketText The -bracketText option forces printing of list 0. Each sentence of the input is printed one at a time, with the NPs in the input bracketed off using ( ) 's, and verb phrases in the input are bracketed using [ ] 's.
- -compareFormat The -compareFormat option prints the .np file in a format that we used specifically for performing a user evaluation study. It only prints out list II, and does not print any groups with less than two elements.
- -printInitial The -printInitial option forces printing of the list 0. Using this option, the list is the text of the input, with any NPs in the text placed on their own line. The part of speech information for the words in the NP can be printed with either the -printPOS or -printPOS2 options. There is an alternative form for similar information, see the -bracketText option. By default this option is off.
- -printListI [integer] The -printListI option will force printing of the list of sequential NPs. By default, the list is not printed. An integer value can optionally follow the option, with 0 disabling printing, and 1 enabling it.
- -printListII [integer] The -printListII option will force printing of the list of clustered NPs. This list is printed by default. An integer value can optionally follow the option, with 0 disabling printing, and 1 enabling it.
- -printListIII [integer] The -printListIII option will force printing of the list of words as heads and mods. By default, the list is printed. An integer value can optionally follow the option, with 0 disabling printing, and 1 enabling it.
- -printOf The -printOf option will change how NPs that are in apposition are printed. By default NPs that are modified via of are printed, and following the NP either (phead: #) or (pmod: #) follows the phrase, depending on whether the NP is a possible head or possible modifier of the other phrase. With this option, both NPs are printed.
- -printOnlyOfNPs The -printOnlyOfNPs option will only print NPs that are modified using of. We use this option to look at large amounts of data to cull it for ``strong'' words for the dictionary.
- -printPOS The -printPOS switch will print the part of speech for each word in a NP. The POS is printed after the word and a backslash. For example, the is printed as the $\backslash$ DET
- -printPOS2 The -printPOS2 switch also displays the part of speech tag of words in NPs, however, it displays the tags on the line under the word in the NP.
- -properOnly The -properOnly option will only print NPs that have a proper noun as the head. This option is not overly useful to most applications.
- -noFinal The -noFinal switch suppresses printing of any output except for list 0, which is some form of the input with the NPs that LinkIT determines. This switch supercedes any settings made with -printListX and is essentially an alias for -printListI 0 -printListII 0 -printListIII 0.
- -noInitial The -noInitial switch suppresses printing of list 0. The default behavior is to not print list 0.
- -noRel The -noRel switch suppresses printing relation links between words. The PAPP, POCC, PMOD, and PHEAD links are not printed when this switch is in effect.
- -noWordLists The -noWordLists option suppresses printing of list III. It is an alias for -printListIII 0, however, the setting used will be whatever was last set with either command.
Informative and Debugging Messages
- -? -h -help This switch prints a brief summary of the command line options for LinkIT.
- -noTime By default, at the end of each file LinkIT has processed, the amount of time required to process the file is printed at the end of the .np file. This switch disables printing that information.
- -verbosity integer -v integer This flag requires an integer parameter that specifies how much debugging information should be printed. A value of 0 is no information at all, while a value of 10 is the maximum amount of information. This switch is not useful to most users, and is intended for developmental use only.
- -version This flag prints the LinkIT version number.
Operation
- -sort [ freq or occ ] The -sort option determines how to sort the clusters in list II for display. -sort alone is equivilent to -sort freq. freq sorts based on the number of elements in the cluster, and alphabetically based on the head of the NP for clusters with the same number of NPs. occ sorts based on the occurrence of the lowest numbered NP in the group. Groups with NPs that were first to occurr in the text are printed first, groups with NPs that first occurred later are printed later. -sort alone sorts using freq. The default is also to sort using freq.
- -weight The -weight option weights the terms in list III using both heads and modifiers to compute the final score for each NP. This is the default behavior.
- -weghtHeadOnly The -weightHeadOnly weights terms in list III using only the heads for each NP.
- -verbs [integer] This option tells LinkIT to do some simple verb processing. If the integer value is 1, LinkIT prints a list of all the verb phrases it found. 2 will print lists of verbs for each cluster of NPs, the verbs are verbs that are extracted from the immediate left and right of all the NPs in the cluster. 3 is the same as 2, but the list is a bit cleaner; empty verb lists will not have a blank header printed. -verbs 0 does nothing since 0 is the same as off. The default is -verbs 0. -verbs alone is equivilent to -verbs 1.

A Small Example

A small example of tagging a file, and running LinkIT on that file. Show the output. Or should this be in the appendix as well?

Sample Input Article

Use wsj_0006 - it is very short, and should be sufficient to demonstrate operation.

Pacific First Financial Corp. said shareholders approved its acquisition by Royal Trustco Ltd. of Toronto for $27 a share, or $212 million.
The thrift holding company said it expects to obtain regulatory approval and complete the transaction by year-end.

Sample Tagged Input Article

Sample Output Formats

permutations of the output that we can produce.

No References!

About this document ...

A Technical Description of the LinkIT System

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

The command line arguments were:
latex2html -split 0 LinkITTechDoc.tex.

The translation was initiated by David Evans on 1999-11-01

Footnotes

... Evans ¹: This research was partly supported by NSF grant IRI-9712069, ``Automatic Identification of Significant Topics in Domain Independent Full Text'', Judith Klavans, PI; Nina Wacholder, co-PI

David Evans
1999-11-01