Hong Yu

Postdoc, hongyu@cs.columbia.edu

Title: Unsupervised Approaches for Disambiguating Biomedical Abbreviations

Time: Thursday October 24, 12.10pm - 1pm

Place: CS Conference Room in MUDD

Abstract:

Technical terms such as abbreviations and acronyms are widely used in technical domains. Because many abbreviations are ambiguous (for example, CAT denotes chloramphenicol acetyl transferase and computed axial tomography depending on the context), recognizing the full form associated with each abbreviation is in most cases equivalent to identifying the meaning of the abbreviation. This, in turn, allows us to perform more accurate natural language processing and information retrieval.

I will present, as a part of my thesis work, our unsupervised machine-learning approaches to identifying the full forms of ambiguous abbreviations in the context where they appear. We focus on the biomedical domain as it provides significant amounts of text data for experimentation, as well as supplementary knowledge sources (e.g., MeSH terms) that can be used by automated systems.

We first developed a pattern-recognition system that maps abbreviations to full forms when the abbreviation and full form are linked together in the same sentence. We applied the system to eleven million MEDLINE records (1966-2001) and obtained automatically a dictionary of possible abbreviation-full form pairs. Having assigned multiple possible full forms to each abbreviation, we then treated the in-context full-form prediction for each specific abbreviation occurrence as a case of word-sense disambiguation. We applied machine-learning algorithms (nave Bayesian and support vector machines) for disambiguation. The features we used for machine-learning included words in the abstracts and semantic classes (MeSH terms) that were assigned to the abstracts. Because some of the links between abbreviations and their corresponding full forms are explicitly given in the text and can be recovered automatically, we can use those explicit links to provide training data for disambiguating the abbreviations that are not linked at all to a full form within a text. We evaluated our methods over 150 thousands MEDLIEN abstracts and obtained an accuracy of 93% to disambiguate among an average of ten full forms for each abbreviation.

I will also briefly present an application of abbreviation disambiguation: A system we built that marks up gene or protein terms in MEDLINE. I will also briefly present another system we built that applied pattern-recognition for identifying automatically other synonymous gene or protein terms in biomedical text.

NLP Group Meetings