PP Attachment Task Dataset
=========================================

The following package contains the dataset used in our paper "Corpus Creation for New Genres: A Crowdsourced Approach to PP Attachment". The dataset contains 

1) Template for the PP attachment task presented to MTurk workers. 
2) Data files (CSV) used during our experiment to populate the template.
3) Answer File : Expert annotated answers.

The data files contain sentences extracted from LiveJournal (http://www.livejournal.com).

NOTE :
Researchers using this dataset for running experiments should include the following citation:

Mukund Jha, Jacob Andreas, Kapil Thadani, Sara Rosenthal, Kathleen McKeown, "Corpus Creation for New Genres: A Crowdsourced Approach to PP Attachment", Proceedings of the Workshop on Creating Speech and Language Data With Amazon's Mechanical Turk at NAACL 2010.


PP Attachment Task:
===================

The task presented workers with a sentence on a colored background with a prepositional phrase marked in red, followed by a list of potential attachment options. Workers were asked to pick the correct attachment among the options provided. For cases where they felt that no correct answer was among the options provided or the prepositional phrase was not correct, two additional check boxes were presented, "Correct answer is not present in the above choices" and "Prepositional phrase is not correct" along with text boxes for indicating the correct answer and prepositional phrase. But in all cases workers were forced to pick the choice closest to the answer.

Template:  (final_template.html)
==========

To make it easier for users to locate the attachment point in the given sentence, we provided them with the feature that hovering over any option would highlight it in the sentence. Also, since the examples provided for the task and the two additional options take up a lot of space, we chose to hide them by default with a link to view the examples and additional options at any point. The template contains three variables which are populated using the data files.

- ${PlainSentence} = Sentence containing the prepositional phrase
- ${PP} = Prepositional phrase
- ${Options} = Set of options provided to the workers


Data Files: (part1.csv,part2.csv,part3.csv,part4.csv) 
===========

These files contain the data used to populate the template. The files are in CSV format. Each row represent a Human Intelligence Task (HIT) and has the following attributes.

- PlainSentence = The sentence containing the prepositional phrase
- PP = The prepositional phrase under consideration
- Examples = Set of examples used (this is redundant now as the examples in final version are hard-coded in the template)
- Options = Set of options in HTML format

Each HIT has a hidden input "ppNum" which corresponds to the prepositional phrase ID. Also, each option for a particular sentence has a value that corresponds to a unique chunk id for the sentence, which is used to identify the attachment in the sentence. The data is divided into four CSV files and has 1018 sentences in total.

Answers:
========

Answer file contains the expert annotated answers for each of 941 questions used in the experiment. The file contains list prepositional phrase IDs along with the correct option separated by comma (-1 indicates no correct attachment was present in the provided list of options).

Contact:
=========
Please contact Mukund Jha (mukundjha@gmail.com) or Jacob Andreas (jda2129@columbia.edu) for further details, comments or clarifications.