GlossIt

GlossIt | Papers | Studies | Demo | Detailed Demo

GlossIt Studies


GetGloss

We have collected statistics on which methods perform best in finding glossaries. Looking for the word "glossary" alone in the title, header and text gives only 73% recall in a sample of pages that we collected. Many glossaries are missed when "Glossary" was not somewhere in the page. See the table below for more details. This table is limited to the results that GetGloss found to be glossaries. The set to be examined was limited to what GetGloss found to be glossaries because we wanted a way of measuring recall (True Positives / (True Positives + False Negatives)). Measuring recall can be difficult to measure when the total set of pages to be examined is very large, and in this case it is: a crawl can bring in over 100,000 pages to look at, and looking at all of these pages manually is not feasible. The first row lists information about pages that turned out to actually be glossaries, the second row refers to pages that are not glossaries and not ordered lists, and the third row refers to ordered lists that are not glossaries.

CategoryGlossary in Title Glossary in Header Glossary anywhere in text Total Percent of glossaries with the word Glossary
Glossaries42591622220.72972972972973
Not an Ordered List001260.0384615384615385
An Ordered List, but not a Glossary0014760.184210526315789
Unclear000190

We have provided full Results taken from a crawl of various Federal Agency websites. Pages were downloaded, classified by GetGloss, and then manually tagged to see how effective the classification was. These results show more completely where GetGloss does well, and where it is fooled by ordered lists of information that happen not to be glossaries.

ParseGloss

We are in the process of doing a study to find the parts of a glossary definition people consider to be the most important. This will allow us to establish a "gold standard" which can be used to measure the accuracy of ParseGloss. In addition, it will allow us to measure the effect of changes, and thereby improve the accuracy.

The study is accessable here.