Starting with journal articles, our research on patient specific summarization
has involved development of techniques to identify different patient groups
studied in the article, development of techniques to identify the exact
sentences within the Results Section of the article which correspond to actual
results, and development of information extraction techniques to match the
characteristics of patient groups in the articles against characteristics of
the patient under physician care. Our current prototype produces a summary per
article that represents result sentences (see
http://www.cs.columbia.edu/diglib/PERSIVAL)
This summary must be filtered to produce a much shorter version that matches
the patient characteristics. We have developed some initial information
extraction techniques that, based on context, determine the candidate phrases
to be matched against attributes extracted from the patient records. Currently,
we match only certain types of result sentences and have patterns that match
for multi-variate results.
Over the next year, our work on journal articles will focus on learning new
patterns for patient matches, encoding patters for different sentence types and
on moving from a summary of a single article to a summary across articles.
The results for consumer health have been very positive, on par with average
human performance for some of the basic tasks, as reported in our latest paper
submission. The computer uses machine learning to identify headers, tables,
captions and main text and decides the hierarchical structure of a document,
both in cases where layout features are present and also when they are not.
Our work on integrating layout and lexical chains for hierarchical segmentation
is called CLASP, shown at
http://www.cs.columbia.edu/ min/research/clasp/.
The analysis of consumer health, i.e. lay, articles has inluded a study of how
sentences within the genre of patient information articles are structured, so
that we can a) better understand the information expected in analyzing such an
article, and b) decide and order the content of such articles in a text
generation framework. This work is expected to continue on into the fall
semester as we a) further analyze the texts, b) propose algorithms to predict
and to detect the flow of text in these patient health information documents
and c) manually simulate the algorithms to assess their performance.
To date, we have implemented a system called DEFINDER (Definition Finder) which
recognizess and extracts definitions and the terms they define from a set of
on-line consumer-oriented medical articles. We have compared our results with
two existing on-line medical dictionaries and glossaries (including UMLS
metathesaurus).
The corpus used consists of cardiology lay articles, and was split in 75% for
development and 25% for testing. The analysis of the development set revealed
that almost 60% of the definitions in this corpus are introduced by a limited
set of text markers ('-', '()') , the other 40% being identified by more
complex linguistic phenomena (apposition, anaphora, conjoined definitions).
Given this, our system consist of two main modules: first, we run a pattern
extractor using text markers and then we run a natural language parser, English
Slot Grammar, for more complex linguistic structures.
We performed two kinds of evaluation. First, we built a test set of 93 terms
and their associated definitions, extracted from our corpus. For term
identification, our methods achieved 84% precision and 83% recall. For terms
and definitions, we compared our results with two existing on-line medical
dictionaries, including the UMLS. The AMIA publication discusses these
comparisons, showing that existing on-line dictionaries appear to be
incomplete. For example, we get only a 60% exact match between our term set
and the UMLS; 24% of the terms are present but undefined; and 16% were absent
altogether. However, an in-depth analysis of absent terms point out that the
corresponding 'narrower' or 'broader' terms might be present, e.g. valvoplasty
was identified by DEFINDER, whereas balloon valvoplasty exists in UMLS.
Furthermore, the technical terms in resources such as UMLS tend to be defined
using other unexplained technical terms, which can be confusing for the
layperson.
As an immediate future step we will test our system on a larger corpus of
consumer-oriented medical text. We plan to explore new techniques (including
combining robust statistical methods for term identification with rule-based
methods) for identification of new relational information , beyond definitions
and also methods for folding our results into existing resources. We plan to
extend the architecture of our system, by adding a learning component for
information extraction and also to used learning techniques for merging
multiple extracted defintions of the same medical term. In applications, we
will explore using our results in the process of presentation of information to
patients in a language they can understand, by providing accurate and readable
lay definitions for technical terms, in conjunction with summarization. We
will also explore the application of our system as a diagnostic for text
categorization of unknown articles into levels such as lay or technical.
As part of the analysis component, we have recrafted many of our analysis and
term identification procedures to use XML, e.g. tokenization, POStagging,
sentence boundary determination, NP determinination, resolution of within-NP
coordination, UMLS lookup. This has been an important step towards integration
since different groups in the project need similar input. Our XML-tools make
shared processing possible in a uniform, human readable format. Also, the XML
pipeline works on different kinds of text (bookchapter, patient record, journal
article, consumer health text) without any code change.