Findings from these activities

Next: Query interface and presentation Up: Multimedia resource summarization Previous: Activities

Findings from these activities

Starting with journal articles, our research on patient specific summarization has involved development of techniques to identify different patient groups studied in the article, development of techniques to identify the exact sentences within the Results Section of the article which correspond to actual results, and development of information extraction techniques to match the characteristics of patient groups in the articles against characteristics of the patient under physician care. Our current prototype produces a summary per article that represents result sentences (see http://www.cs.columbia.edu/diglib/PERSIVAL)

This summary must be filtered to produce a much shorter version that matches the patient characteristics. We have developed some initial information extraction techniques that, based on context, determine the candidate phrases to be matched against attributes extracted from the patient records. Currently, we match only certain types of result sentences and have patterns that match for multi-variate results.

Over the next year, our work on journal articles will focus on learning new patterns for patient matches, encoding patters for different sentence types and on moving from a summary of a single article to a summary across articles.

The results for consumer health have been very positive, on par with average human performance for some of the basic tasks, as reported in our latest paper submission. The computer uses machine learning to identify headers, tables, captions and main text and decides the hierarchical structure of a document, both in cases where layout features are present and also when they are not. Our work on integrating layout and lexical chains for hierarchical segmentation is called CLASP, shown at http://www.cs.columbia.edu/ min/research/clasp/.

The analysis of consumer health, i.e. lay, articles has inluded a study of how sentences within the genre of patient information articles are structured, so that we can a) better understand the information expected in analyzing such an article, and b) decide and order the content of such articles in a text generation framework. This work is expected to continue on into the fall semester as we a) further analyze the texts, b) propose algorithms to predict and to detect the flow of text in these patient health information documents and c) manually simulate the algorithms to assess their performance.

To date, we have implemented a system called DEFINDER (Definition Finder) which recognizess and extracts definitions and the terms they define from a set of on-line consumer-oriented medical articles. We have compared our results with two existing on-line medical dictionaries and glossaries (including UMLS metathesaurus).

The corpus used consists of cardiology lay articles, and was split in 75% for development and 25% for testing. The analysis of the development set revealed that almost 60% of the definitions in this corpus are introduced by a limited set of text markers ('-', '()') , the other 40% being identified by more complex linguistic phenomena (apposition, anaphora, conjoined definitions). Given this, our system consist of two main modules: first, we run a pattern extractor using text markers and then we run a natural language parser, English Slot Grammar, for more complex linguistic structures.

We performed two kinds of evaluation. First, we built a test set of 93 terms and their associated definitions, extracted from our corpus. For term identification, our methods achieved 84% precision and 83% recall. For terms and definitions, we compared our results with two existing on-line medical dictionaries, including the UMLS. The AMIA publication discusses these comparisons, showing that existing on-line dictionaries appear to be incomplete. For example, we get only a 60% exact match between our term set and the UMLS; 24% of the terms are present but undefined; and 16% were absent altogether. However, an in-depth analysis of absent terms point out that the corresponding 'narrower' or 'broader' terms might be present, e.g. valvoplasty was identified by DEFINDER, whereas balloon valvoplasty exists in UMLS. Furthermore, the technical terms in resources such as UMLS tend to be defined using other unexplained technical terms, which can be confusing for the layperson.

As an immediate future step we will test our system on a larger corpus of consumer-oriented medical text. We plan to explore new techniques (including combining robust statistical methods for term identification with rule-based methods) for identification of new relational information , beyond definitions and also methods for folding our results into existing resources. We plan to extend the architecture of our system, by adding a learning component for information extraction and also to used learning techniques for merging multiple extracted defintions of the same medical term. In applications, we will explore using our results in the process of presentation of information to patients in a language they can understand, by providing accurate and readable lay definitions for technical terms, in conjunction with summarization. We will also explore the application of our system as a diagnostic for text categorization of unknown articles into levels such as lay or technical.

As part of the analysis component, we have recrafted many of our analysis and term identification procedures to use XML, e.g. tokenization, POStagging, sentence boundary determination, NP determinination, resolution of within-NP coordination, UMLS lookup. This has been an important step towards integration since different groups in the project need similar input. Our XML-tools make shared processing possible in a uniform, human readable format. Also, the XML pipeline works on different kinds of text (bookchapter, patient record, journal article, consumer health text) without any code change.

Next: Query interface and presentation Up: Multimedia resource summarization Previous: Activities

Noemie Elhadad
2000-08-01