trans Natural Language Processing
Columbia University


• Home

• People

• Labs

• For Students

• Publications

• Software

• Events

• Locations

• Internal














For current and prospective students

Relevant courses

Fall 2013

Spring 2013

Projects » Ads for Student Research Projects in NLP

We are looking for undergraduate and master students interested in the following research projects.

Fall 2010:

Fall 2008:

Summer 2008:


top

Spoken Dialogue and Dialogue System Evaluation (Fall 2010)

Description:

Loqui (http://www1.ccls.columbia.edu/~Loqui/), a collaborative project with the City University of New York, is funded by the National Science Foundation (NSF). We seek to build more flexible dialogue strategies for human-machine interaction. As the performance of automatic speech recognition (ASR) decreases, dialogue system performance often falls off sharply. Our project seeks dialogue strategies that are less dependent on accurate ASR, and that degrade gracefully. We have two project areas for undergraduates. One pertains to a recently developed a manual annotation method to provide better understanding of human-human dialogue. Another pertains to user questionnaires for human dialogue participants. Both serve as vehicles for improved evaluation of human-machine dialogue. A modest stipend is associated with the project.

Requirements:

Keen interest in language and how it is used and/or good programming skills in any of C, Java, Perl. Looking for highly motivated, dependable students.

Useful skills:

Background in linguistics, mathematics, speech.

Suitable for:

undergraduates only

Contact:

Becky Passonneau (becky [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


top

American National Corpus (Fall 2008)

Description:

Help build the American National Corpus (ANC) (http://americannationalcorpus.org), a research oriented collection of written and spoken language from 1990 or later. The ANC includes many genres and sources, including email, blogs, fiction, newswire, travel guides, medical texts and so on. The ANC is used by computational linguists and linguists as a research resource, and is particularly important for training general purpose tools such as parsers or word-sense disambiguators intended to handle many variants of English. The National Science Foundation (NSF) is funding a project called the Manually Annotated Sub Corpus of the ANC (MASC) which adds annotations, or tags, to the documents in the corpus that represent knowledge about language, such as the difference between nouns and verbs. We have received supplemental NSF funding under the Research Experience for Undergraduates (RE) program to mentor two undergraduates for the 2008-2009 academic year. The undergraduates we enlist will work on adding annotations to our corpus that disambiguate word senses. Words have multiple meanings, and linguists have constructed lexical resources that encode these meanings. Accurate results from automated language processing tools of many sorts depends on the ability to disambiguate word senses. Students on the project will have an opportunity to participate in the creation of a new and important layer of annotation in the ANC. They will learn how senses are represented in WordNet, FrameNet and other lexical resources. They will be trained in data collection and verification methods. A modest stipend is associated with the project.

Requirements:

Keen interest in language, meaning and why the same words mean different things in different contexts. Detail oriented. Excellent organization and time management skills.

Useful skills:

Background in linguistics, foreign languages, or related areas.

Suitable for:

Undergraduates only, any year.

Contact:

Becky Passonneau (becky [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


top

The Loqui Project (Fall 2008)

Description:

The Loqui project (http://www1.ccls.columbia.edu/~Loqui/) involves building an automated dialog system to be used over the phone, meaning that human callers will speak with a computer that can handle limited types of human dialog. Loqui, a collaborative project with the City University of New York, is funded by the National Science Foundation (NSF). We have additional funding to mentor two undergraduates as part of the NSF Research Experience for Undergraduates (REU) program. We are looking for two undergraduate computer science majors. They will be introduced to state-of-the-art software and techniques in computational linguistics and computer science, and acquire ethical training in the use of human subjects via Institutional Research Board certification. They will learn how data about human dialog is collected in order to inform the design of a dialog system. They will learn how to enhance the language resources used by dialog systems, how to use human-human corpora and human-system corpora to implement and evaluate dialog systems, and how research demands care and imagination. A modest stipend is associated with the project.

Requirements:

Keen interest in language and how it is used. Good programming skills in any of C, Java, Perl.

Useful skills:

Background in linguistics, foreign languages, or mathematics. Experience with both linux and Windows platforms. Interest in telephony, VOIP.

Suitable for:

Undergraduates only, any year.

Contact:

Becky Passonneau (becky [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


top

Grammar extraction from treebanks (Fall 2008)

Description:

The job is to write code that extracts various types of grammars from treebanks.

Requirements:

Good programming, and at least a minimal knowledge of syntax (or at least interest in syntax).

Suitable for:

Junior, Senior, Graduate.

This could be for an advanced undergrad, or a beginning graduate student, but other types of candidates are also thinkable. The job will not be a GRA-ship (i.e., it will not cover tuition), but will be paid by the hour, or on a part-time basis. Rate commensurate with relevant criteria.

Contact:

Owen Rambow (rambow [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


top

A Machine Learning Approach for Automatic Labeling of ECS Tickets (Fall 2008)

Description:

Our goal is to develop an automatic labeling method for events that we have been labeling using a rule-based procedure. The events correspond to a trouble tickets database of the secondary electrical distribution system of the Consolidated Edison Company of New York. The Emergency Control Systems (ECS) "tickets" database is a rich resource for data mining containing approximately 1 million tickets from all boroughs. Each ECS "trouble ticket" is a report of an event affecting the New York City electrical distribution system as recorded by a Con Edison dispatcher. The "front" of each ticket contains a timestamp, type of event (such as manhole fire or smoking manhole), address and cross street information where the event occurred along with other pertinent information. The "back" of the ticket (called the ECS-Remarks) contains free-text description of Con Edison's response and repairs made.

The larger learning task we face is to predict serious events based on data from several data bases, including the ECS tickets. In prior work done, we extracted features from the ECS Remarks for use in two aspects of learning: labeling the data, and extracting features for the learning model. Remarks features include external features, such as the length of the ticket, the trouble type of the ticket (assigned by Con Edison), and its date, and internal features based on the content of the ticket, such as what structures (manholes, service boxes) are mentioned, and how frequently.

Based on the knowledge from subject matter experts (SMEs) we labeled tickets into two categories - Serious or Non-serious. In this project, we are interested in developing an automatic ticket labeler using machine learning techniques. In particular, we are interested in the following tasks:

  1. Refinement of features extracted from ECS Remarks, and possible addition of new features depending on status of concurrent work on spelling normalization.
  2. Development of classifiers (such as decision trees) for performing the labeling task.
  3. Extraction of classification rules from trees that may give a better insight on the criteria required for labeling a ticket as serious or non-serious.
  4. Rules extracted from step 3 above can be tested by incorporation into models used for ranking structures (manholes and service boxes).

Requirements:

Knowledge of PostgreSQL, and proficiency in Java or Matlab recommended. A strong background in algorithms and machine learning is a plus. Both under-graduate and graduate students with relevant expertise are encouraged to apply.

Contact:

Rebecca Passonneau (becky [at_cs]) and Haimonti Dutta (haimonti [at_ccls])

Replace [at_cs] with "@cs.columbia.edu", and [at_ccls] with "@ccls.columbia.edu".


top

ARA -- Automated Readers Advisor (Summer 2008)

Description:

We are collaborating with the Andrew Heiskell Talking Book and Braille Library (part of the New York Public Library) and City University of New York on a project to design an automated dialogue system for the library's 15,000 patrons to handle simple library transactions. For the same reasons that qualify patrons to become library users, most cannot conveniently travel to the library and visually browse the collection, thus most transactions are currently handled by phone. In the summer of 2006, we collected just under 200 recorded calls made to a subset of human Readers Advisors (librarians and other staff), and have transcribed a large portion of them. We are currently implementing an initial baseline dialog system using the Olympus/Ravenclaw tools from Carnegie Mellon University. One or more student projects are available within this project that would involve analysis of the transcribed human-human dialogs, testing and enhancing our initial dialog system, or a combination of the two.

Requirements:

Some exposure to NLP, reliable, attention to detail, highly motivated, respond to challenges eagerly and creatively.

Useful skills:

Experience with annotating corpora, or using annotated corpora; familiarity with C++, perl, psql/mysql or other databases.

Suitable for:

junior, senior, graduate

Contact:

Becky Passonneau (becky [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


top

CLiMB -- Computational Linguistics for Metadata Building. (Summer 2008)

Description:

Digital image collections are increasing in number and size at an enormous rate, including collections associated with museums, libraries (New York Public Library; Getty Library), or online collections like ARTstor. CLiMB is a collaborative project (with University of Maryland) to develop automatic methods for extracting metadata from scholarly texts, in order to index digital art collections with subject matter descriptions. The Columbia component involves classifying sentences from art history survey texts into semantic categories pertaining to their discourse function. Functional classes include describing the image, providing biographical background about the artist, interpreting the art historical significance of the work, and so on. We are working with an analog to an ARTstor image collection and two art history survey texts. We are investigating automated methods to assign semantic scores to words from extracted sentences based on their closeness to relevant semantic domains, such as color, anatomy, and so on. To compute semantic distance in these domains, we will compare electronically available ontologies and lexicons such as WordNet, and the Getty Art and Architecture Thesaurus. The project tasks will include developing subroutines to query these resources, developing evaluation suites to test the resulting scores, and integrating the scores into feature sets for machine learning.

Requirements:

Desirable experience/skills include familiarity with one or more NLP tools or resources for language analysis (taggers, parsers, WordNet); familiarity with the Weka datamining toolset; familiarity with Python.

Suitable for:

junior, senior, graduate

Contact:

Becky Passonneau (becky [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


top

Processing Trouble Tickets; Con Edison Secondary Events (two projects) (Summer 2008)

Description:

The Secondary Events project at the Center for Computational Learning Systems (CCLS) works with data from Con Edison's secondary distribution network. The two learning tasks we address are to predict problematic events in the network before they happen, and to rank the vulnerability of structures in this network to such events. We have devoted considerable effort to assembling a consolidated database from disparate sources after cleaning, extending and joining data collected over the past ten years. The effort has paid off in initial success in our two learning problems for data from Manhattan, using small models.

We now turn to investigating whether we can derive a larger set of features from a free-text field of trouble ticket data. Two summer positions are available on this phase.

Project 1: Text Engineering Applied to Remarks Fields of Trouble Tickets

Relational databases often have free-text fields, but extracting meaningful semantic content from free-text presents serious challenges. The trouble ticket remarks fields are especially challenging because the text is highly domain specific, with many types of domain specific expressions that are essentially Named Entities (proper nouns). As a result, existing Named Entity (NE) recognizers cannot be applied to this text. We are importing the remarks into GATE (General Architecture for Text Engineering) in order to develop a standoff annotation where we encode the domain specific classes of NEs. GATE stores the annotations in a relational database to facilitate complex queries over the text. The student on this project will assist in porting our existing patterns for Information Extraction of structure types and numbers (ids for manholes, service boxes, vaults) and other domain specific NEs. More importantly, the student will help develop our meta-language for representing the content of remarks tickets.

Project 2: Text Normalization and Spelling Correction

The free-text remarks field presents serious challenges for feature derivation due to the high noise content, which is compounded by the highly domain-specific vocabulary. For example, the size of the unigram vocabulary (individual strings composed of alphabetic, numeric, punctuation or mixed characters) is approximately 75K, of which only 8K (~ 11%) are alphabetic sequences that match American English dictionary entries. The remaining "word" types consist of numeric or mixed-character type strings, domain-specific words or abbreviations, or misspellings. The longer the word, the more misspellings, thus the word "barricade" and its other forms ("barricades", "barricaded") have approximately 50 variant spellings. The student on this project will assist in normalizing the vocabulary. This will involve a range of methods including: pattern matching within the GATE framework (see project #1), using the GATE pattern language; development of special-purpose edit-distance routines; testing and/or adaptation of exsisting alogorithms such as Double Metaphone.

Requirements:

Reliable, attention to detail, highly motivated, respond to challenges eagerly and creatively.

Desirable skills (any mix):

Experience with regular expressions and unix/linux scripting, java, relational db especially postgres, python, some NLP.

Suitable for:

junior, senior, graduate

Contact:

Becky Passonneau (becky [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


top

Grammar extraction from treebanks (Summer 2008)

Description:

The job is to write code that extracts various types of grammars from treebanks.

Requirements:

Good programming, and at least a minimal knowledge of syntax (or at least interest in syntax).

Suitable for:

Junior, Senior, Graduate.

This could be for an advanced undergrad, or a beginning graduate student, but other types of candidates are also thinkable. The job will not be a GRA-ship (i.e., it will not cover tuition), but will be paid by the hour, or on a part-time basis. Rate commensurate with relevant criteria.

Contact:

Owen Rambow (rambow [at_cs])

Replace [at_cs] with "@cs.columbia.edu"


webmaster - wm2174x[at]xcolumbia.edu last updated - 09.10.2013