The Columbia University Text Categorization and Image Corpus

In order to have the means of conducting research involving the use of text to categorize images, I have created a corpus consisting of news documents with embedded, captioned images. I have defined multiple sets of categories representing various levels of abstraction, and with the help of members from the Natural Language Processing Group at Columbia University, I have collected manual labels for the documents and images. The corpus is now ready to be made public. Instructions will be posted here as soon as that happens. I hope that the corpus serves as an important resource for researchers.


Send questions or comments to Carl Sable at sable@cs.columbia.edu