README.TXT, June 8th, 2001
==========================

The following piece of code is available free of charge for academic
use only.  It implements the Maximum Entropy Discrimination feature
selection technique as described in the paper: "Feature Selection and
Dualities in Maximum Entropy Discrimination" by Tony Jebara and Tommi
Jaakkola in the Uncertainty in Artificial Intelligence 2000
conference.

Given a set of input data and output binary classes as a text file,
the method will recover a large-margin linear decision boundary and
will only use a subset of the input features. The method will output
1) classifications for the data with the model (output.ext)
2) the model as a bias scalar value and a (sparse) linear vector in
   the feature space. (model.ext)
3) the Lagrange multipliers (lambdas.ext)

The model is effectively classifying input data vectors (X) into a
binary class +/- 1 (y) as follows: y = sign( \theta^T X + b)
The model.ext file contains the scalar 'b' as the first line and then
the vector \theta unrolled after it.

It is assumed that the input data file will contain data points with
their binary label (+1 or -1) as follows:
0.8 0.5  1
0.2 0.6  1
0.8 0.7 -1
0.2 0.8 -1

Here, there are 4 data points, each of which is 2 dimensional with the
first column and second column of the text file specifying the (x,y)
coordinates respectively. The 3rd column indicates the class as a +1
or a -1.

Make sure that the input data is normalized to live in the unit
cube. In other words, it is best if each dimension of the input space
is appropriately and consistently scaled to lie between 0 and 1.

The technique will not use implicit Kernel mappings. It is up to you
to explicitly transform all input features into a higher dimensional
representation (i.e. tack on to the feature space a quadratic
outerproduct to simulate a second-order polynomial kernel). This is
necessary since it would otherwise be impossible to switch features on
or off in an implicit kernel mapping.

The executable is medclass.exe. It is a Windows 2000 (or Windows 98)
binary which is run on the command line.


COMMAND LINE ARGUMENTS
======================

To get a synopsis of the command line arguments, type:
>medclass -help

Which will yield:
Usage: medclass -train <fname> -ntrain <int> -ext <fname> -iters <int>
-lambda <fname> -sigma <double> -p <double> -c <double>
Feature Selection for Linear Classification.


-train: follow this argument with the name of the input text file
 containing the training data with the specified binary classes as the
 last column. You may also put testing data at the end of this file
 which can be hidden from the learning algorithm by only permitting it
 to process the first 'ntrain' data points for training. The last ones
 will be used to test and evaluate accuracy.

-ntrain: this specifies how many points in the file to use for
 training. The remaining ones will only be used for testing. If this
 value is not specified, ALL data in the training file will be used as
 training. That is the default option.

-ext: this is the extension to add to the output files the program
 generates. It will by '.ext' by default but you can specify any
 string to generate multiple output files for different runs on
 different data sets and different parameter settings.

-iters: this is the number of iterations you want at most from the
 program. These are axis-parallel iterations where only 1 Lagrange
 multiplier will be changed at a time. The default is 64 million (!)
 but typically the program will quit earlier when it has detected that
 it has converged. Typically, you would ignore this argument.

-lambda: this is to initialize the axis-parallel optimization with
 previous values of the Lagrange multipliers that the program had
 converged to earlier. So, after running and converging to a set of
 solutions like 'lambdas.ext', you can reload and continue the
 optimization. It is important that none of the values of the lambdas
 are greater than the current value of the 'c' parameter. Typically
 this is not a very useful argument.

-sigma: this specifies the prior on the scalar bias parameter
 (b). This parameter should not affect the performance of the
 algorithm too much and is set to a default value of 3.

-p: this is an IMPORTANT parameter which varies the feature selection
 level. The range of values here is typically from (0,1) where 1 means
 'no feature selection' and will basically generate a regular Support
 Vector Machine. As this parameter is decreased (you think of it on a
 logarithmic scale), the linear model that will be generated will be
 sparser and sparser. The default value is p=1.0 (no feature selection).

-c: this is an IMPORTANT regularization parameter which is basically
 identical to the SVM 'c' parameter. It varies the sensitivity to
 outliers and allows us to handle non-separable problems. Typically
 this will be varied from (1,1000) where higher values will attempt to
 achieve a fully separable classification (and possibly overfit) while
 lower values will ignore some outliers and be more robust. The
 default value is c=30.0. If you specify a NEGATIVE value for 'c' then
 what the program will do is sweep many c values from c=1 to
 c=abs(negative values specified) (in steps of c=c*1.5). Each step
 will run until convergence and then save state and then move to the
 next greater c value. Then you can check the 'log' file to see which
 one did best and permits a quick and direty cross validation
 operation to get the best 'c' regularization parameter.


DISPLAY
=======

While running, the program will output the current value of the
objective function (i.e. the negated log-partition function,
J(\lambda) as described in the paper). It should increase
monotonically. The second column is merely the number of axis-parallel
iterations.

HOTKEYS
=======

While running, you can press the 's' key to query the program about
the current accuracy of the classification boundary (on training and
testing data in the input file). This will also save the state by
generating the output files lambdas.ext, model.ext and output.ext.

To quit running, press the 'q' key and the program will exit. It will
also report the accuracy on testing and training as well as save the 3
output files.


OUTPUT
======

The 3 output files are 'lambdas.ext', 'model.ext' and 'outputs.ext'.
The first two are single columns of numbers. 'Lambdas.ext' contains
the values of the Lagrange multipliers, 'outputs.ext' contains the
bias value followed by the linear model's values. The 'outputs.ext'
will contain two columns of numbers, the first is the estimated
classification outputs of the linear model on the while input data
file, the second column is the true classes as specified in the
training data set. Also, the file 'log' will get appended to after
each run with the command line for the run as well as the training and
testing accuracy each time the files are saved.


EXAMPLE
=======

One sample data set is provided to try out the program. It consists of
1535 protein chains (60 base-pairs) from the UCI repository. These are
represented in a binary code: i.e. A=(1000) C=(0100) G=(0010) T=(0001)
with uncertainty being represented as weighted binary code:
i.e. AorC=(0.5 0.5 0 0). The data forms 2 classes of splice site:
intron-exon (ie) donor or exon-intron (ei) acceptor site. This
simplifies things into a simple binary classification task. The text
data has 1535 rows and 241 columns where the last is -1/+1 depending
on the class (ei = -1, ie = +1). The order of the rows is scrambled to
ensure iid training and testing data.

We run as follows and wait for convergence:
medclass -train spliceX.traintest -ntrain 200 -p 1e-3 -c 50
...
3.347107e+001 49500
3.347107e+001 50000
J=3.347107e+001 i=50000 err1=0.000000e+000 p=1.000000e-003 c=5.000000e+001 err2=6.300000e+001


You should obtain the above which indicates that our linear classifier
has obtained a training accuracy of 100% (err1=0.0 training errors)
and a testing accuracy of 95.2% (err2=63 testing errors). This MED run
is with feature selection since p<<1 (p=1e-3) and the sparseness can
be seen by noting the many zeros (i.e. low values) in the model.ext
file.

A bitmap containing multiple runs for various C and p values is
included (splice.bmp). The runs indicate that a level of p=1e-3 helps
generalization. In fact, any level of feature selection does better
than p=1e0 (i.e. no feature selection) which can yield almost twice
the error. This is somewhat like the standard SVM and yields an
accuracy of 92-93%. The best setting in the results bitmap was when
p=1e-2 and c=2.25 yielding an error count of 50 or a testing accuracy
of 96.25%.




__________________________________________________________
Tony Jebara
MIT Media Laboratory, Room E15-390
20 Ames St., Cambridge MA 02139-4307
Tel: 617-253-0326  Fax: 617-253-8874
http://www.media.mit.edu/~jebara