trans Natural Language Processing
Columbia University

• Home

• People

• Labs

• For Students

• Publications

• Software

• Events

• Locations

• Internal



Tools on this page are available free of charge for educational, research, and in-house uses. For information on commercial use of any of these tools, please contact Columbia Technology Ventures, email:, phone number: (+1) 212-854-8444.

Narrative Summarization Corpus

Developed by Jessica Ouyang, Serina Chang, and Kathleen McKeown
Described in Crowd-Sourced Iterative Annotation for Narrative Summarization Corpora. Personal narratives with aligned extractive and abstractive summaries. Available under MIT License.

Gendered Corpus

Developed by Serina Chang and Kathleen McKeown
Described in Automatically Inferring Gender Associations from Language. Online articles written about celebrities and online reviews written by students about professors. Labeled for gender.

Opinionated Claims Corpus

Developed by Sara Rosenthal and Kathleen McKeown
Described in Detecting Opinionated Claims in Online Discussions. Wikipedia and LiveJournal. Sentence level annotations of opinionated claims and phrase based sentiment.

Wikipedia Talk Pages Agreement Corpus

Developed by Sara Rosenthal, Jacob Andreas, and Kathleen McKeown
Agreement annotations at the post level in Create Debate. Described in I Couldn't Agree More: The Role of Conversational Structure in Agreement and Disagreement Detection in Online Discussions.

Create Debate Agreement Corpus

Developed by Sara Rosenthal and Kathleen McKeown
Agreement annotations on the sentence level in Wikipedia Talk Pages. Described in Annotating Agreement and Disagreement in Threaded Discussion and I Couldn't Agree More: The Role of Conversational Structure in Agreement and Disagreement Detection in Online Discussions.

Sentence Fusion Corpus

Developed by Kathleen McKeown, Sara Rosenthal, Kapil Thadani and Coleman Moore
Described in Time-Efficient Creation of an Accurate Sentence Fusion Corpus

Text-to-text generation

Developed by Kapil Thadani and Kathleen McKeown
Text-to-text generation software for learning models for compression and fusion

Quoted Speech Attribution Corpus

Developed by David K. Elson
This corpus collects over 3,000 instances of quoted speech from 6 works of 19th and 20th century literature, along with annotations for the speaker (if any) of each quote among the character names and nominals present in the text. Related publication: Elson and McKeown, Automatic Attribution of Quoted Speech in Literary Narrative, AAAI 2010. This material is based on research supported in part by the U.S. National Science Foundation (NSF) under IIS-0935360. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.


Developed by Nizar Habash and Owen Rambow
A full morphological tagger for Modern Standard Arabic


Developed by Michel Galley
A domain-independent discourse segmenter based on lexical cohesion.


Developed by Michel Galley
A tool to find semantically related words within unrestricted texts.


A tool for identifying and relating noun phrases within a document.


Developed by Min-Yen Kan
Centrifuser is a domain- and genre-specific multidocument summarization system. It builds both extract based summary as well as indicative document cluster summaries. The extract summary gives a high level overview of the query topic suitable for browsers. The indicative document cluster summaries differentiate the documents from each other as much as possible to route users to particular documents that can meet their underspecified information needs. Centrifuser was developed as part of the NSF's DLI 2 initiative and focuses on patient health care documents.

Annotated Bibliography Corpus

Developed by Min-Yen Kan
We have collected 2000 annotated bibliography entries from the web and put them into a standardized XML format. We have further annotated 100 of these entries with semantic tags that discuss the types of document-derived and metadata features that play a role in these summaries. Annotated bibliography entries are a good source for doing research on corpus-based summarization; as they provide information about what to include and how to write and stylize indicative summaries.


Developed by Michael Elhadad
FUF stands for Functional Unification Formalism.


Developed by Michael Elhadad and Mark Kharitonov
CFUF is A graph-based implementation of the FUF language implemented in C and embedded within a Scheme interpreter.


Developed by Michael Elhadad and Jacques Robin
Surge is a syntactic realization grammar for text generation.


Developed by Duford
CREP is a regular expression finder for linguistic patterns.


Developed by Min-Yen Kan
Segmenter is a Text Segmentation program.


Developed by Min-Yen Kan, Judith Klavans and Kathleen McKeown
Verber is designed to conflate semantically related verbs together.

webmaster - fl2301x[at] last updated - 12.23.2022