This file describes an enhanced version of the Edinburgh corpus used
in the joint phrasal and dependency ILP aligner by Kapil Thadani, Scott
Martin and Michael White. Please cite as:

Kapil Thadani, Scott Martin and Michael White. A Joint Phrasal and
Dependency Model for Paraphrase Alignment. In Proceedings of COLING 24, 
2012.

The original Edinburgh corpus is described in:

Trevor Cohn, Chris Callison-Burch and Mirella Lapata. Constructing
corpora for the development and evaluation of paraphrase
systems. Computational Linguistics 34(3):597--614, 2008. 
doi:10.1162/coli.08-003-R1-07-044.

This version of the corpus has hand-corrected tokenization, truecasing
and quote normalization, along with a train/test split that may prove
useful to other researchers.  Note, however, that alignment results
with this version of the corpus will not be directly comparable to the
ones in the above COLING paper, as further hand corrections have been
made since publication, and the COLING paper made use of collapsed
named entities, whereas in this version of the corpus named entities
can span multiple tokens, as in the original corpus.


FORMAT
------

The corpus consists of two files, one representing the training set
and the other the testing set, with paraphrases selected as described
in the paper. Both files are in the JavaScript Object Notation (JSON)
format. The parsers used are listed first, and given an identifier and
a human-friendly name. Currently the corpus includes output from two
parsers: the Stanford dependency parser and the OpenCCG parser.

The paraphrase numbers are maintained from the original corpus, as is
the subcorpus name (e.g., "mtc") and partition (e.g., "common"). The
new field "id" combines all three of these fields into a single,
globally unique identifier for each paraphrase instance (for example,
"mtc-common:2127"). The "string" field contains the retokenized
version of the original corpus string, with truecasing and quote
normalization applied. The "train" field indicates which of the two
annotators ("A" or "C") was randomly selected to serve as the gold
annotator for doubly annotated paraphrase pairs. Then the two
paraphrases are annotated as follows, with "S" the first and "T" the
second sentence. There is a list of "dependencies" for each parser
used, broken up into nodes and edges. The nodes are matched up with a
word from the paraphrase sentence by the (0-based) "index" field. Then
there is a list of edges, each with a unique identifier like "T3-7",
which is interpreted to signify that this is the 7th edge that the 3rd
parser found for sentence "T".  An edge lists its "source" and
"target" node by string index, then gives the dependency "label" the
parser found.

In cases where the source text is comprised of multiple sentences, the
input was first split into its component sentences before
parsing. Then the parse included in the corpus is a single,
multiply-rooted parse where the node indices line up with the indices
in the entire, multi-sentence string. This way, each index used in the
parse is unique, and lines up with the original alignments.

The alignments for the paraphrase are then listed under "annotations"
for either one or both human annotators ("A" and "C"), if available,
along with baseline automatic alignments generated by Meteor. The
0-based alignments are partitioned into sure ("S") and possible ("P"),
and formatted as e.g. [3, [2, 5]], which indicates that word 3 in
sentence "S" is aligned to words 2 and 5 in sentence "T".

Finally, named entities from the Stanford NER tool are annotated by
index in the "ner" field, so that an entry like

{
  "start": 13, 
  "end": 14, 
  "type": "LOCATION"
}

is interpreted to mean that a named entity whose type is LOCATION was
found for words 13 and 14.


SAMPLE PYTHON CODE
------------------

A python script named print_gold.py is included with the corpus, which 
prints out the gold alignments in a human-readable form, demonstrating
how to read the alignments from the JSON files.  To run the program,
enter either 

  $ ./print_gold.py

or 

  $ python print_gold.py

at the command line.


ACKNOWLEDGMENT
--------------

This work was supported in part by the Air Force Research Laboratory
under a subcontract to FA8750-09-C-0179.  The views and conclusions
contained herein are those of the authors and should not be
interpreted as necessarily representing the official policies or
endorsements, either expressed or implied, of the U.S. Government.