New York Times documents with extractive and compressive summaries ------------------------------------------------------------------ This resource contains three lists of document IDs in the New York Times annotated corpus (LDC2008T19). The nytimes.com summaries of each of these documents, i.e., the 'online_lead_paragraph' portion of the xml files, are extractive or near-extractive: - ex_sents.txt: 38,921 fully extractive instances in which each summary sentence is drawn whole from the article - nx_spans.txt: 15,646 near-extractive instances where one or more summary sentences form a contiguous span of tokens within an article sentence, and the remaining fit the definition above - nx_subseqs.txt: 25,381 near-extractive instances where one or more summary sentences form a non-contiguous token subsequence within an article sentence, and the remaining fit either of the definitions above We recommend the following configuration for experiments using these resources: - Training: all articles from 2001-2004 - Development: articles from 2005 - Testing: articles published in 2006-2007 Related resources ----------------- 1) The New York Times Annotated Corpus contains the full text and metadata of NYT articles from 1987 to 2007 and can be accessed at https://catalog.ldc.upenn.edu/LDC2008T19 2) Code to extract these datasets from this corpus is available at https://github.com/grimpil/nyt-summ 3) The data-cleaning procedure for this data is described in the paper: Junyi Jessy Li, Kapil Thadani and Amanda Stent. The Role of Discourse Units in Near-Extractive Summarization. Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL) 2016. Citation -------- @InProceedings{li-thadani-stent-edusumm16, author = {Li, Junyi Jessy and Thadani, Kapil and Stent, Amanda}, title = {The Role of Discourse Units in Near-Extractive Summarization}, booktitle = {Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)}, year = {2016}, }