New York Times documents with extractive and compressive summaries ------------------------------------------------------------------ This resource contains three lists of document IDs in the New York Times annotated corpus (LDC2008T19). The summaries of each of these documents, i.e., the 'online_lead_paragraph' portion of the xml files, are extractive or near-extractive: - ex_sents.txt: 38,921 fully extractive instances in which each summary sentence is drawn whole from the article - nx_spans.txt: 15,646 near-extractive instances where one or more summary sentences form a contiguous span of tokens within an article sentence, and the remaining fit the definition above - nx_subseqs.txt: 25,381 near-extractive instances where one or more summary sentences form a non-contiguous token subsequence within an article sentence, and the remaining fit either of the definitions above We recommend the following configuration for experiments using these resources: - Training: all articles from 2001-2004 - Development: articles from 2005 - Testing: articles published in 2006-2007 Related resources ----------------- 1) The New York Times Annotated Corpus contains the full text and metadata of NYT articles from 1987 to 2007 and can be accessed at 2) Code to extract these datasets from this corpus is available at 3) The data-cleaning procedure for this data is described in the paper: Junyi Jessy Li, Kapil Thadani and Amanda Stent. The Role of Discourse Units in Near-Extractive Summarization. Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL) 2016. Citation -------- @InProceedings{li-thadani-stent-edusumm16, author = {Li, Junyi Jessy and Thadani, Kapil and Stent, Amanda}, title = {The Role of Discourse Units in Near-Extractive Summarization}, booktitle = {Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)}, year = {2016}, }