README.TXT, June 8th, 2001 ========================== The following piece of code is available free of charge for academic use only. It implements the Maximum Entropy Discrimination feature selection technique as described in the paper: "Feature Selection and Dualities in Maximum Entropy Discrimination" by Tony Jebara and Tommi Jaakkola in the Uncertainty in Artificial Intelligence 2000 conference. Given a set of input data and output binary classes as a text file, the method will recover a large-margin linear decision boundary and will only use a subset of the input features. The method will output 1) classifications for the data with the model (output.ext) 2) the model as a bias scalar value and a (sparse) linear vector in the feature space. (model.ext) 3) the Lagrange multipliers (lambdas.ext) The model is effectively classifying input data vectors (X) into a binary class +/- 1 (y) as follows: y = sign( \theta^T X + b) The model.ext file contains the scalar 'b' as the first line and then the vector \theta unrolled after it. It is assumed that the input data file will contain data points with their binary label (+1 or -1) as follows: 0.8 0.5 1 0.2 0.6 1 0.8 0.7 -1 0.2 0.8 -1 Here, there are 4 data points, each of which is 2 dimensional with the first column and second column of the text file specifying the (x,y) coordinates respectively. The 3rd column indicates the class as a +1 or a -1. Make sure that the input data is normalized to live in the unit cube. In other words, it is best if each dimension of the input space is appropriately and consistently scaled to lie between 0 and 1. The technique will not use implicit Kernel mappings. It is up to you to explicitly transform all input features into a higher dimensional representation (i.e. tack on to the feature space a quadratic outerproduct to simulate a second-order polynomial kernel). This is necessary since it would otherwise be impossible to switch features on or off in an implicit kernel mapping. The executable is medclass.exe. It is a Windows 2000 (or Windows 98) binary which is run on the command line. COMMAND LINE ARGUMENTS ====================== To get a synopsis of the command line arguments, type: >medclass -help Which will yield: Usage: medclass -train -ntrain -ext -iters -lambda -sigma -p -c Feature Selection for Linear Classification. -train: follow this argument with the name of the input text file containing the training data with the specified binary classes as the last column. You may also put testing data at the end of this file which can be hidden from the learning algorithm by only permitting it to process the first 'ntrain' data points for training. The last ones will be used to test and evaluate accuracy. -ntrain: this specifies how many points in the file to use for training. The remaining ones will only be used for testing. If this value is not specified, ALL data in the training file will be used as training. That is the default option. -ext: this is the extension to add to the output files the program generates. It will by '.ext' by default but you can specify any string to generate multiple output files for different runs on different data sets and different parameter settings. -iters: this is the number of iterations you want at most from the program. These are axis-parallel iterations where only 1 Lagrange multiplier will be changed at a time. The default is 64 million (!) but typically the program will quit earlier when it has detected that it has converged. Typically, you would ignore this argument. -lambda: this is to initialize the axis-parallel optimization with previous values of the Lagrange multipliers that the program had converged to earlier. So, after running and converging to a set of solutions like 'lambdas.ext', you can reload and continue the optimization. It is important that none of the values of the lambdas are greater than the current value of the 'c' parameter. Typically this is not a very useful argument. -sigma: this specifies the prior on the scalar bias parameter (b). This parameter should not affect the performance of the algorithm too much and is set to a default value of 3. -p: this is an IMPORTANT parameter which varies the feature selection level. The range of values here is typically from (0,1) where 1 means 'no feature selection' and will basically generate a regular Support Vector Machine. As this parameter is decreased (you think of it on a logarithmic scale), the linear model that will be generated will be sparser and sparser. The default value is p=1.0 (no feature selection). -c: this is an IMPORTANT regularization parameter which is basically identical to the SVM 'c' parameter. It varies the sensitivity to outliers and allows us to handle non-separable problems. Typically this will be varied from (1,1000) where higher values will attempt to achieve a fully separable classification (and possibly overfit) while lower values will ignore some outliers and be more robust. The default value is c=30.0. If you specify a NEGATIVE value for 'c' then what the program will do is sweep many c values from c=1 to c=abs(negative values specified) (in steps of c=c*1.5). Each step will run until convergence and then save state and then move to the next greater c value. Then you can check the 'log' file to see which one did best and permits a quick and direty cross validation operation to get the best 'c' regularization parameter. DISPLAY ======= While running, the program will output the current value of the objective function (i.e. the negated log-partition function, J(\lambda) as described in the paper). It should increase monotonically. The second column is merely the number of axis-parallel iterations. HOTKEYS ======= While running, you can press the 's' key to query the program about the current accuracy of the classification boundary (on training and testing data in the input file). This will also save the state by generating the output files lambdas.ext, model.ext and output.ext. To quit running, press the 'q' key and the program will exit. It will also report the accuracy on testing and training as well as save the 3 output files. OUTPUT ====== The 3 output files are 'lambdas.ext', 'model.ext' and 'outputs.ext'. The first two are single columns of numbers. 'Lambdas.ext' contains the values of the Lagrange multipliers, 'outputs.ext' contains the bias value followed by the linear model's values. The 'outputs.ext' will contain two columns of numbers, the first is the estimated classification outputs of the linear model on the while input data file, the second column is the true classes as specified in the training data set. Also, the file 'log' will get appended to after each run with the command line for the run as well as the training and testing accuracy each time the files are saved. EXAMPLE ======= One sample data set is provided to try out the program. It consists of 1535 protein chains (60 base-pairs) from the UCI repository. These are represented in a binary code: i.e. A=(1000) C=(0100) G=(0010) T=(0001) with uncertainty being represented as weighted binary code: i.e. AorC=(0.5 0.5 0 0). The data forms 2 classes of splice site: intron-exon (ie) donor or exon-intron (ei) acceptor site. This simplifies things into a simple binary classification task. The text data has 1535 rows and 241 columns where the last is -1/+1 depending on the class (ei = -1, ie = +1). The order of the rows is scrambled to ensure iid training and testing data. We run as follows and wait for convergence: medclass -train spliceX.traintest -ntrain 200 -p 1e-3 -c 50 ... 3.347107e+001 49500 3.347107e+001 50000 J=3.347107e+001 i=50000 err1=0.000000e+000 p=1.000000e-003 c=5.000000e+001 err2=6.300000e+001 You should obtain the above which indicates that our linear classifier has obtained a training accuracy of 100% (err1=0.0 training errors) and a testing accuracy of 95.2% (err2=63 testing errors). This MED run is with feature selection since p<<1 (p=1e-3) and the sparseness can be seen by noting the many zeros (i.e. low values) in the model.ext file. A bitmap containing multiple runs for various C and p values is included (splice.bmp). The runs indicate that a level of p=1e-3 helps generalization. In fact, any level of feature selection does better than p=1e0 (i.e. no feature selection) which can yield almost twice the error. This is somewhat like the standard SVM and yields an accuracy of 92-93%. The best setting in the results bitmap was when p=1e-2 and c=2.25 yielding an error count of 50 or a testing accuracy of 96.25%. __________________________________________________________ Tony Jebara MIT Media Laboratory, Room E15-390 20 Ames St., Cambridge MA 02139-4307 Tel: 617-253-0326 Fax: 617-253-8874 http://www.media.mit.edu/~jebara