The file turkfusion.csv contains human-generated fusion annotations over 297 pairs of related sentences. The sentences are derived from newswire clusters. The annotation process is described in the paper:

Kathleen McKeown, Sara Rosenthal, Kapil Thadani and Coleman Moore.
"Time-Efficient Creation of an Accurate Sentence Fusion Corpus".
In proceedings of the 11th Annual North American Meeting of the Association of Computational Linguistics (NAACL-HLT), June 2010, Los Angeles, California.

Please cite this work if you use the corpus in your research.


The CSV file contains the following columns:
1. Case: The unique ID for each sentence pair (1-297)
2. Type: The type of sentence in that row (S: Original sentence, I: intersection, U: union)
3. Id: A sub-index for each case/type
4. Sentence: The text of the sentence


Contact Kathy McKeown (kathy@cs.columbia.edu) or Kapil Thadani (kapil@cs.columbia.edu) for comments, clarifications or requests.