|
MetaX - Automatic Extraction of Descriptive Metadata from Text for Images
MetaX -
Automatic Extraction of Descriptive Metadata from Text for Images
The MetaX system is under design and development at Columbia
University and part of the Digital Library research effort.
The goal of the system is to use computational lingustic
techniques to automatically extract descriptive metadata from text
which is written on the topic of images. In this way, text can be
semi-automatically mined for potential metadata which then can either be
reviewed by an expert or otherwise used in automatic metadata
harvesting systems (e.g. SDARTS).
This page shows the second scenario that we tested. Our second test data came from
text about a statue of Bacchus by Michelangelo (in progress), and can be
found here.
Executive Summary of Initial Results for Architectural Images:
In the standard manual markup, there were 52 noun phrases; MetaX
found 51 of them, counting fragments as correct. Of the 42 possible
keywords, MetaX found 38. Place names, MetaX got 2 of 2; for dates, 5
of 5; related targets 1 of 1; non-place names 12 of 15. MetaX also
found additional words, which need to be evaluated so that people
don't get inundated with too much additional information.
The purpose of this file is to illustrate the processes under
development. We have set these steps out sequentially for users to
follow each stage of processing and analysis of results.
Steps in the Process:
To overview the process, we first link to source
text from
Greene and Greene by Edward R. Bosley (Phaidon Press, Inc. 2000), henceforth
known as Greene & Greene (both the title of the book
and the names of the architects themselves). An
extract was scanned and then manually marked up with specific
attention
paid to the Greene & Greene project called the L.A. Robinson house. That same text was
marked up by MetaX.
The following files will lead you through the steps that we have explored to see how
our existing tools compare to manually marked up text. This is still in the experimental stages, and all
criticisms are welcome.
Steps Towards the Automatic Creation of Metadata using MetaX:
Preliminary Results
- Step One: Input data. This link points to an example of text taken
from Bosley, Edward R., Greene & Greene, Phaidon Press, Inc., 2000 --
The text describes the firm and various projects in California that
they architected. The specific excerpt is from Chapter Four "Stones
of the Arroyo".
Click here to see the fuller chapter four..
- Step Two: Images pertaining to the L.A. Robinson House -
Here
- Step Three: We have taken two paragraphs from Chapter Four for
deep analysis and comparison. These paragraphs describe the
L. A. Robinson House.
This is the data that we have
processed manually and automatically for this example case.
Click
here for the original paragraphs.
- Step Four: The extract describing the L. A. Robinson House was
manually marked up the extract. This provides the goal, or gold standard, against which
we compare the automatic MetaX processing. In this file, you
will find keywords, noun phrases, place names, other proper nouns, and
dates and related targets.
Click here for the manually marked up file.
- Step Five: This step shows the results of processing the same two
paragraphs that were manually analyzed for Step Four, but this time
they were automatically analyzed by MetaX. What we first show is the
raw output to give an idea of the kinds of data we are able to
derive. This file is sorted by category, e.g. keywords, noun
phrases, place names, dates, etc. Each category is sorted by
frequency, so the most frequent words and phrases occur at the top of
each category.
Click here for full MetaX output
- Step Six: The next step is to compare the two methods. The first
method is in-depth manual analysis, and the second is automatic
processing. In this file, the MetaX results are on the left, and the
manual results are on the right. Each table shows the different types
of output. For example, the first table shows keywords as found and
compared for both methods. Using this table, it is easy to see what
the manual method found, what MetaX found, and what keywords both
methods found.
Click here for the comparison file.
The following files show some of the supporting information to
understand our output.
- This file shows each noun phrase in the manually created file that
does not appear in MetaX in the exact form that it was manually
identified. The purpose of showing this is to illustrate what MetaX
produces in more detail. Sometimes we produce a fragment, e.g "Ming
dynasty furniture" is manually produced, whereas MetaX produces "early
Ming dynasty furniture". Decisions on the size of noun phrases, and
inclusion or exclusion of modifiers will have to be made.
Click
here to see noun phrase comparison file.
- In order to process text, we first tag the text to label part of
speech. This is done automatically. If you want to see output of the
tagger, Click here
for tagged text.
- We next identify noun phrases, using the tagged text as input. If
you want to see the output of noun phrase identification, Click here to see noun phrase tagged text.
-
As background for the Greene and Greene data, you can see a full List
of Greene & Greene Projects. This file shows the project numbers,
names, and a brief description of images associated with these
projects.
Here
The following link (password required) leads to Stephen Davis' pages for MetaX --
Here
|