ENGI E1006: Introduction to Computing for Engineers and Applied Scientists


Lecture 10: Nearest-Neighbor Classification

Reading: Wikipedia on NN Classifieres

Machine Learning

Definition: (from Tom Mitchel) Suppose we have the following three ingredients:

  • A task (like playing chess)
  • A unit of experience (playing a game against yourself)
  • A measure of performance (like how many games do you win when you play against each person in the class)

We say learning happens when we improve at the task with respect to the performance measure as we gain more units of experience. So if you play 1000 games against yourself, do you win more games when you play against everyone in the class? Substitute you with a computer program and we have a definition of machine learning.

Classification

Consider a set of observations drawn from two different classes. An observation is just a vector. Consider for example the two-dimensional vector consisting of a person's weight and height. Let class 1 be people in New York City under the age of 14 and class 2 be people in the city 14 and older. A classifier is a map that takes a vector like (weight,height) as input and outputs a label, like over/under age 14. We build the map using examples of labeled pairs. So for example suppose we have data on a hundred people in New York. So we have their weights, heights, and a label for each observation 'o' for over 14 and 'u' for under. We call this labeled data training data and using this data to build the map is called training the classifier. Notice that if this works, it's a special case of the definition of machine learning above.

Nearest Neighbor

Nearest neighbor classification is a very intuitive approach to building a classifier first proposed in 1951 by Fix and Hodges. Given a new unlabeled observation, this classifier will locate the nearest observation in the training data and use its label to label the new unlabeled observation. So consider the example above where we have data for 100 people in New York including their class labels, 'u' or 'o'. Let's treat each (weight,height) pair like a point on the plane. Now you give me a new pair without a label. The nearest neighbor classifier will find the point in the training data that is closest to it in the plane and use that nearest neighbor's label for the new observation. The more training data we have, the better we do (up to a point).

Your Project

Your final programming project is to implement a nearest neighbor classifier in Python and test it on two data sets. The first data set is known as the Wisconsin Breast Cancer Data. It is available online at the UCI data repository. Each observations consists of 30 measurements describing certain aspects of cells taken from a tumor. Each tumor has been labeled malignant or benign. There are 569 of these labeled observatios. So what does this mean? How does this work? Just like before with the (weight,height) pairs, given a new observation, this time consisting of 30 measurements describing the geometry of cells from a tumor, we will determine which observation in the training data is nearest and use it's label, malignant or benign, to label it.

How do we calculate distance between observations consisting of 30 measurements? Once again we consider each observation as a vector. This time instead of two dimensional vectors we have 30-dimensional vectors. It's true we can't visualize what that looks like but we can still calculate the Euclidean distance between any two 30-dimensional vectors. We simply generalize the formula that is used in two dimensions. That is:

$$d(x,y) = \sqrt{(x_1 - y_1)^2 + (x_2-y_2)^2 + ... +(x_{30}-y_{30})^2}$$

How do we organize the data? We organize $n$ training observations into into a matrix that is $n$ rows by 31 columns. Each row is a 30-dimensional observation together with it's label. We will call data that we are trying to label, testing data. We can organize a set of $m$ testing observations as an $m$ by 30 matrix. This time each row is an unlabeled 30-dimensional observation.

What does this look like in Python? We define a function in Python that has two parameters, a matrix of training labeled training data ($nx31$) and a matrix of testing data ($m x30$). The function will output an $mx1$-dimensional matix of labels for the testing data.

So we have 569 training examples, where do we get the testing data? To test out our classifier we will partition the training data into five sets of similar size (since 5 doesn't go into 569 evenly they will not all be exactly the same size). We will use four of those sets for training data and the other set for testing. What about the labels on the testing data? We will have to remove (but not forget) those labels when passing the testing data to the classifier. The classifier will then output what it thinks the labels should be. We can compare these with the actual labels to see how well we did. We do this five times, each time using a different set as the testing set and the other four as our training data. We then average the five results to estimate how well our classifier performs on this kind of data. The truth is there's nothing too special about five. We could have divided into ten sets or twenty or whatever. We call this process $n$-fold cross validation where $n$ is the number of sets you partition the data into. In your project, it is the job of the n_validator function to do this and return an estimate for the classifier's accuracy on this kind of data.

Synthetic Data

In addition to the Wisconsin Breast Cancer Data you must also generate and test your classifier on artificial data generated from two multivariate normal distributions. We'll talk more about this during the next lecture but in the meantime here is an example of generating two such classes in Python

Example: normal_demo.py

In [ ]:
# -*- coding: utf-8 -*-
"""
Generating multivariate normal classes

@author: Cannon
"""
import numpy as np
import matplotlib.pyplot as plt


x1=1.0
x2=3.5

y1=3.0
y2=-2.5

mean_1 = [x1,y1]
mean_2 = [x2,y2]

cov_1=[[1,0],[0,25]]
cov_2=[[2,1],[1,15]]

c1= np.random.multivariate_normal(mean_1,cov_1,100)
c2= np.random.multivariate_normal(mean_2,cov_2,200)

plt.plot(c1[:,0],c1[:,1],'ro')
plt.plot(c2[:,0],c2[:,1],'bo')

plt.show()

Using this kind of data to test out our classifier

Let's suppose we have generated 600 samples, 300 from each of the multivaraiates as instructed on the assignment. Now let's go over how this will work with our implementation of nearest neighbor.

First, how do we store the data? We will build a $(600 \times 3)$-dimensional numpy array to store all 600 two-dimensional samples. The third dimension is for the label. You can put the label first or last, it's up to you, just be consistent. So in this big array, each row represens an artifically generated labeled observation.

Now let's consider the NNclassifier function. What does it take as input? It takes a numpy array of training data (labeled data), and a numpy array of test data (unlabeled data). So if we used 480 training examples that would look like a $(480 \times 3)$-dimensional numpy array. That would be the first parameter for our NNclassifier. We could use the other 120 observations we geneerated as the test data but we must ignore the labels. So that would look like a $(120 \times 2)$-dimensional numpy array. You can strip the labels anyway you want but keep track of what they were. The easiest approach is to just slice out the label column when passing the test array to classifier. Then the NNclassifier function should return a $(120 \times 1)$-dimensional array of labels for the array. Those are the labels that the classifier guesses the obsevations should have. We can compare those guesses to the actual labels to see how well our classifer did. To summerize:

  1. Generate a $(600 \times 3)$-dimensional array of data drawn from two multivariate normal distributions.

  2. Use 480 of these labeled observations to construct the $(480 \times 3)$-dimensional training array.

  3. Ignoring the labels of the other 120 construct a $(120 \times 2)$-dimensinoal test array.

  4. The NNclassifier returns a $(120 \times 1)$-dimensional array of labels for the test data.

  5. We compare the labels the classifier returns to the actual labels to see how well we did.

But how does the classifier guess the labels? We talked last time about the basics of nearest neighbor. We saw that for each test observation, which equals each row of the $(120 \times 2)$-dimensional array, the classifier must calculate the distance to every observation in the training data. Whichever training observation is closest is the winner and we use its label as our guess. Is it difficult to implement this? No. The trick is to generate a distance matrix that is populated with the distances between all elements of the testing and training data. What would that look like? Once again, lets focus on dimensionality to help us understand. We have 120 test observations and 480 training observations. So for each test observation we need to calculate its distance from 480 training observations. If we construct a matrix where the element at row $i$ and column $j$ represents the distance between the $i$th test observation and the $j$th training observation we end up with a distance matrix with dimensionality $(120 \times 480)$. The numbers in this matrix are all distances. To find the closest neighbor to test obsevation $i$ we simply locate the minimum number in row $i$ of our distance matrix. This location, call it $loc_min$ is a number between 0 and 479. It represents the index of the nearest neighbor in the training set. Now we just lookup it's label in the training array and we have our guess! To summarize:

  1. Generate a $(120 \times 480)$-dimensional matrix of distances (hint: check out scipy.spatial.distance)
  2. For each row locate the position of the minimum distance (hint: check out numpy.argmin), let's call it loc_min.
  3. Check the row indexed loc_min in the training data array and use its label.
  4. Populate a 120 element array with these guesses and return it.

So then what about n_validator? The purpose of this function is to take that original $(600 \times 3)$-dimendinal data array and use it to test the accuracy of our classifier on that data by breaking the data into $p$ pairs of training-test sets. It should be able to do this for any classifier, not just our nearest neighbor classifier. So it takes as input the complete labeled data arra, an integer p which is the number of cells we will be partitioning the data into, the classifier to test, and any extra parameters the classifier may need. So we are actually going to pass a function as a paramenter here. For nearest neighbor we don't actually have any extra parameters so the variable lenth parameter *args will be empty. To pass the function as a prameter we just provide its name, in this case NNclassifier. So to summarize:

  1. Pass n_validator all of the labeled data, the number of cells to divide into (5 per the instructions on your assignment), and the NNclassifier function.

  2. The function should first randomly shuffle the rows of the data array.

  3. The function then passes 480 labeled rows to the classifier as training data and 120 unlabeled rows as test data.

  4. The function compares the classifier's guesses to the actual labels and stores the result as a fraction between 0 and 1.

  5. The function repeats this five times, each time using a different 120 unlabeled observations as the test data.

  6. The function returns the average of the five fractions stored in Step 4.

And that's all she wrote. If you do this right, part two will be a breeze.