<td align=center> MetaX - Automatic Extraction of Descriptive Metadata from Text for Images

MetaX -
Automatic Extraction of Descriptive Metadata from Text for Images

The MetaX system is under design and development at Columbia University and part of the Digital Library research effort. The goal of the system is to use computational lingustic techniques to automatically extract descriptive metadata from text which is written on the topic of images. In this way, text can be semi-automatically mined for potential metadata which then can either be reviewed by an expert or otherwise used in automatic metadata harvesting systems (e.g. SDARTS).

This page shows the second scenario that we tested. Our second test data came from text about a statue of Bacchus by Michelangelo (in progress), and can be found here.

Executive Summary of Initial Results for Architectural Images: In the standard manual markup, there were 52 noun phrases; MetaX found 51 of them, counting fragments as correct. Of the 42 possible keywords, MetaX found 38. Place names, MetaX got 2 of 2; for dates, 5 of 5; related targets 1 of 1; non-place names 12 of 15. MetaX also found additional words, which need to be evaluated so that people don't get inundated with too much additional information.

The purpose of this file is to illustrate the processes under development. We have set these steps out sequentially for users to follow each stage of processing and analysis of results.

Steps in the Process: To overview the process, we first link to source text from Greene and Greene by Edward R. Bosley (Phaidon Press, Inc. 2000), henceforth known as Greene & Greene (both the title of the book and the names of the architects themselves). An extract was scanned and then manually marked up with specific attention paid to the Greene & Greene project called the L.A. Robinson house. That same text was marked up by MetaX.

The following files will lead you through the steps that we have explored to see how our existing tools compare to manually marked up text. This is still in the experimental stages, and all criticisms are welcome.

Steps Towards the Automatic Creation of Metadata using MetaX: Preliminary Results

Step One: Input data. This link points to an example of text taken from Bosley, Edward R., Greene & Greene, Phaidon Press, Inc., 2000 -- The text describes the firm and various projects in California that they architected. The specific excerpt is from Chapter Four "Stones of the Arroyo". Click here to see the fuller chapter four..
Step Two: Images pertaining to the L.A. Robinson House - Here
Step Three: We have taken two paragraphs from Chapter Four for deep analysis and comparison. These paragraphs describe the L. A. Robinson House. This is the data that we have processed manually and automatically for this example case. Click here for the original paragraphs.
Step Four: The extract describing the L. A. Robinson House was manually marked up the extract. This provides the goal, or gold standard, against which we compare the automatic MetaX processing. In this file, you will find keywords, noun phrases, place names, other proper nouns, and dates and related targets. Click here for the manually marked up file.
Step Five: This step shows the results of processing the same two paragraphs that were manually analyzed for Step Four, but this time they were automatically analyzed by MetaX. What we first show is the raw output to give an idea of the kinds of data we are able to derive. This file is sorted by category, e.g. keywords, noun phrases, place names, dates, etc. Each category is sorted by frequency, so the most frequent words and phrases occur at the top of each category. Click here for full MetaX output
Step Six: The next step is to compare the two methods. The first method is in-depth manual analysis, and the second is automatic processing. In this file, the MetaX results are on the left, and the manual results are on the right. Each table shows the different types of output. For example, the first table shows keywords as found and compared for both methods. Using this table, it is easy to see what the manual method found, what MetaX found, and what keywords both methods found. Click here for the comparison file.

The following files show some of the supporting information to understand our output.

This file shows each noun phrase in the manually created file that does not appear in MetaX in the exact form that it was manually identified. The purpose of showing this is to illustrate what MetaX produces in more detail. Sometimes we produce a fragment, e.g "Ming dynasty furniture" is manually produced, whereas MetaX produces "early Ming dynasty furniture". Decisions on the size of noun phrases, and inclusion or exclusion of modifiers will have to be made. Click here to see noun phrase comparison file.
In order to process text, we first tag the text to label part of speech. This is done automatically. If you want to see output of the tagger, Click here for tagged text.
We next identify noun phrases, using the tagged text as input. If you want to see the output of noun phrase identification, Click here to see noun phrase tagged text.
As background for the Greene and Greene data, you can see a full List of Greene & Greene Projects. This file shows the project numbers, names, and a brief description of images associated with these projects. Here

The following link (password required) leads to Stephen Davis' pages for MetaX -- Here