=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Influencer Dataset
By: Sara Rosenthal

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

The Influencer Dataset consists of annotations divided into train,dev,test and the corresponding corpus which consists of documents from 5 sources: Wikipedia, LiveJournal, Create Debate, Political Forum, and Twitter. The documents are in a single xml format (from LiveJournal).

--------------

Folders:

train: *_influence_train.csv
dev: wikipedia_influence_dev.csv (*note only wikipedia was used as a dev set.)
test: *_influence_test.csv
corpus: xml files

--------------

Annotation Format:

CSV file: Corpus, Filename, Influencer, Annotator Confidence, Annotator Id 

Corpus: Wikipedia, LiveJournal, Create Debate, Political Forum, Twitter 
Filename: name of document in corpus folder
Influencer: The username of one of the particpants in the file. If there are no infuencers it says "None". If there is more than one influencer in the file it is on multiple lines
Annoator Confidence: High, Medium, Low. This is the confidence of the annotator in their annotation
Annnotator Id: 1-9 to indicate anonymously which annotator performed the annotation

--------------

For citations and more information regarding the dataset see the following publications:

Sara Rosenthal and Kathleen McKeown
"Detecting Influencers in Multiple Online Genres"
In the ACM Transactions of Internet Technology, May 2017. Issue 17:2. 
See free link at: http://www.cs.columbia.edu/~sara/publications.php

Sara Rosenthal, Doctoral Thesis
"Detecting Influencers in Social Media Discussions"
Columbia University, July 2015.
http://www.cs.columbia.edu/~sara/publications/thesis-detecting_influencers.pdf

--------------

Contact:

Sara Rosenthal
srosenthal@us.ibm.com
http://www.cs.columbia.edu/~sara/
