-------------------------------------------------------------------------
-------------------------------------------------------------------------
By Stuart Andrews (stu@cs.brown.edu)
Brown University, Providence, RI
Last modified: August 2003
-------------------------------------------------------------------------
-------------------------------------------------------------------------

This tar file contains MIL datasets for text classification, image retrieval
and drug activity prediction tasks.  With each dataset is a summary file with a
.info suffix (i.e. musk1norm.svm.info).


---------------------
The original musk data sets are available from the UC Irvine Machine Learning
Data Repository, thanks to Tom Dietterich of Oregon State University, and David
Chapman and Ajay Jain of the AI Group at Arris Pharmaceutical Corporation.  

Both data sets, MUSK1 and MUSK2, consist of descriptions of molecules using
multiple low-energy conformations. Each conformation is represented by a
166-dimensional feature vector derived from surface properties. MUSK1 contains
on average approximately 6 conformation per molecule, while MUSK2 has on
average more than 60 conformations in each bag.  For these dataset, we
normalized each feature to have zero mean and unit variance.  

More details on the creation of the musk data sets are provided in
files clean1.info and clean2.info.




---------------------
Image Retrieval Data Sets

We have generated new MIL data sets for an image annotation task. The original
data are color images from the Corel data set that have been preprocessed and
segmented with the Blobworld system. In this representation, an image consists
of a set of segments (or blobs), each characterized by color, texture and shape
descriptors.  We have utilized three different categories ("elephant", "fox", 
"tiger") in our experiments. In each case, the data sets have 100 positive and 
100 negative example images. The latter have been randomly drawn from a pool 
of photos of other animals.

For more information on the image annotation MIL data sets, please see the README in the GenerateBlobworldMilData directory.



---------------------
TREC9 MIL data sets

Starting from the publicly available TREC9 data set, also known as OHSUMED, we
split documents into passages using overlapping windows of maximal 50
words each.  The original data set consists of several years of selected
MEDLINE articles. We have worked with the 1987 data set used as training data
in the TREC9 filtering task which consists of approximately 54,000 documents.
MEDLINE documents are annotated with MeSH terms (Medical Subject Headings),
each defining a binary concept. The total number of MeSH terms in TREC9 was
4903.  We have been using the first seven categories of the pre-test portion
with at least 100 positive examples.  

Reference:

Support Vector Machines for Multiple-Instance Learning 
 Stuart Andrews, Ioannis Tsochantaridis & Thomas Hofmann 
  Advances in Neural Information Processing Systems (NIPS*15), 2002


