This file describes an enhanced version of the Edinburgh corpus used in the joint phrasal and dependency ILP aligner by Kapil Thadani, Scott Martin and Michael White. Please cite as: Kapil Thadani, Scott Martin and Michael White. A Joint Phrasal and Dependency Model for Paraphrase Alignment. In Proceedings of COLING 24, 2012. The original Edinburgh corpus is described in: Trevor Cohn, Chris Callison-Burch and Mirella Lapata. Constructing corpora for the development and evaluation of paraphrase systems. Computational Linguistics 34(3):597--614, 2008. doi:10.1162/coli.08-003-R1-07-044. This version of the corpus has hand-corrected tokenization, truecasing and quote normalization, along with a train/test split that may prove useful to other researchers. Note, however, that alignment results with this version of the corpus will not be directly comparable to the ones in the above COLING paper, as further hand corrections have been made since publication, and the COLING paper made use of collapsed named entities, whereas in this version of the corpus named entities can span multiple tokens, as in the original corpus. FORMAT ------ The corpus consists of two files, one representing the training set and the other the testing set, with paraphrases selected as described in the paper. Both files are in the JavaScript Object Notation (JSON) format. The parsers used are listed first, and given an identifier and a human-friendly name. Currently the corpus includes output from two parsers: the Stanford dependency parser and the OpenCCG parser. The paraphrase numbers are maintained from the original corpus, as is the subcorpus name (e.g., "mtc") and partition (e.g., "common"). The new field "id" combines all three of these fields into a single, globally unique identifier for each paraphrase instance (for example, "mtc-common:2127"). The "string" field contains the retokenized version of the original corpus string, with truecasing and quote normalization applied. The "train" field indicates which of the two annotators ("A" or "C") was randomly selected to serve as the gold annotator for doubly annotated paraphrase pairs. Then the two paraphrases are annotated as follows, with "S" the first and "T" the second sentence. There is a list of "dependencies" for each parser used, broken up into nodes and edges. The nodes are matched up with a word from the paraphrase sentence by the (0-based) "index" field. Then there is a list of edges, each with a unique identifier like "T3-7", which is interpreted to signify that this is the 7th edge that the 3rd parser found for sentence "T". An edge lists its "source" and "target" node by string index, then gives the dependency "label" the parser found. In cases where the source text is comprised of multiple sentences, the input was first split into its component sentences before parsing. Then the parse included in the corpus is a single, multiply-rooted parse where the node indices line up with the indices in the entire, multi-sentence string. This way, each index used in the parse is unique, and lines up with the original alignments. The alignments for the paraphrase are then listed under "annotations" for either one or both human annotators ("A" and "C"), if available, along with baseline automatic alignments generated by Meteor. The 0-based alignments are partitioned into sure ("S") and possible ("P"), and formatted as e.g. [3, [2, 5]], which indicates that word 3 in sentence "S" is aligned to words 2 and 5 in sentence "T". Finally, named entities from the Stanford NER tool are annotated by index in the "ner" field, so that an entry like { "start": 13, "end": 14, "type": "LOCATION" } is interpreted to mean that a named entity whose type is LOCATION was found for words 13 and 14. SAMPLE PYTHON CODE ------------------ A python script named print_gold.py is included with the corpus, which prints out the gold alignments in a human-readable form, demonstrating how to read the alignments from the JSON files. To run the program, enter either $ ./print_gold.py or $ python print_gold.py at the command line. ACKNOWLEDGMENT -------------- This work was supported in part by the Air Force Research Laboratory under a subcontract to FA8750-09-C-0179. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government.