GlossIt

GlossIt | Papers | Studies | Demo | Detailed Demo

GlossIt is a system that crawls the web to select glossary files from government websites. We identify these files with the GetGloss module, and then parse each definition with a module called ParseGloss.

 

GetGloss is a software module which takes a URL as input, finds glossaries, and then puts versions of these glossaries into a directory for analysis by ParseGloss.

GetGloss is particularly good at finding glossaries that are embedded within html pages that have a lot of other, non-glossary content. For example, http://www.fhwa.dot.gov/legsregs/directives/fapg/cfr4924a.htm contains a glossary, but this glossary is only one part of the page. See XML version for the result of GetGloss processing.

 
  ParseGloss

Each glossary file is analyzed by ParseGloss. This procedure separates each definition and its major components, such as the primary word or phrase (i.e. the "genus" of the definition) and other meaningful relationships (e.g. used for, includes, etc.)

Users can query the large definition database, which contains analyzed definitions from all websites we have crawled. A sample database is provided in the query pages accessible from this site. Below are just a few of the websites represented in this sample, along with the size of the set of downloaded pages from which the glossaries were found.

  • www.achp.gov -- Advisory Council on Historic Preservation, 492 page crawl, 2 glossaries, 36 terms found
  • www.fhwa.dot.gov -- U.S. Department of Transportation, 3019 page crawl, 8 glossaries, 78 terms found
  • www.cpsc.gov -- U.S. Consumer Product Safety Commission, 890 page crawl, 2 glossaries, 19 terms found
  • www.epa.gov -- Environmental Protection Agency, 3753 page crawl, 13 glossaries, 146 terms found