GlossItGlossIt | Papers | Studies | Demo | Detailed Demo |
||
|
|
||
|
GlossIt is a system that crawls the web to select glossary files from government websites. We identify these files with the GetGloss module, and then parse each definition with a module called ParseGloss. | ||
|
GetGloss is a software module which takes a URL as input, finds glossaries, and then puts versions of these glossaries into a directory for analysis by ParseGloss. GetGloss is particularly good at finding glossaries that are embedded within html pages that have a lot of other, non-glossary content. For example, http://www.fhwa.dot.gov/legsregs/directives/fapg/cfr4924a.htm contains a glossary, but this glossary is only one part of the page. See XML version for the result of GetGloss processing. | |
![]() |
Each glossary file is analyzed by ParseGloss. This procedure separates each definition and its major components, such as the primary word or phrase (i.e. the "genus" of the definition) and other meaningful relationships (e.g. used for, includes, etc.) |
|
|
Users can query the large definition database, which contains
analyzed definitions from all websites we have crawled. A sample
database is provided in the query pages accessible from this site. Below are just a few of the websites represented in this sample, along with the size of the set of downloaded pages from which the glossaries were found.
|
||
