Please send me an email if you have any questions on installing/using the code/data, or you want to know more details

1. Code and Data in Linking-Tweets-to-News paper


2. Weighted Matrix Factorization (WMF) / Orthogonal Matrix Factorization (OrMF)

The C++ & matlab impelmentation of the WMF algorithm described in my acl2012 paper Modeling Sentences in the Latent Space. Recently I added the orthogonal projection idea from my COLING2014 paper Fast Tweet Retrieval with Compact Binary Codes, hence the new model OrMF.
WMF/OrMF is a dimension reduction model to extract nuanced and robust latent vectors for short texts/sentences, such as tweets, SMS data, short forum posts/comments. To overcome the sparsity problem in short texts/sentences (e.g. 10 words on average), we explicitly model the missing words, a feature that LSA/LDA typically overlooks.
The properties of the model are:
1. An unsupervised approach. No labels required.
2. A simple model -- only bag-of-words features for sentences/short texts.
3. No additional data required, and no specific format/genre required. In contrast, other work use metadata such as author/hashtag to help infer the topics of tweets.
Note that the data matrix X (mentioned in the paper) stores the TF-IDF values of words.


1. C++ code
2. Pipeline (recommended!) (latest version 2014/10/17)
The "pipeline" including a matlab version code and training corpora, with a perl pipeline to do text preprocessing and TF-IDF weighting. Using the default corpora, we achieved a pearson's correlation of 0.726 on STS12 dataset and 0.741 on STS13 dataset, which is state-of-the-art performance among unsupervised short text similarity systems.

3. Data for wmfvec (Download)

The sense similarity measure wmfvec in our acl2012 paper Learning the Latent Semantics of a Concept by its Definition.
As presented in the paper, when combined with jcn it produced similar results to state-of-the-art unsupervised system performance on SENSEVAL2 and SENSEVAL3 all-words WSD tasks.