Homework 10

Due: Tuesday, May 6 at 11:59 pm

Instructions

Please read the instructions carefully. An incorrectly formatted submission may not be counted.

There are several questions in this assignment. This assignment starts from skeleton code: cancerpredict.py. Please include comments in your code detailing your thought process where appropriate. Put this file in a folder called uni-hw10 (so for me this would be tkp2108-hw10). Then compress this folder into a .zip file. For most operating systems, you should be able to right click and have a “compress” option in the context menu. This should create a file called tkp2108-hw10.zip (with your uni). Submit this on courseworks.

Machine Learning - Binary Classification Prediction

In this homework, we’ll be building on the analysis we performed in HW9.

We’ll continue to use the Wisconsin Diagnostic Breast Cancer dataset, which you can download from HW9.

In this homework, we will implement two majors components:

first, we will deal with the data
- we will load up the data
- remove the ID column
- shuffle
- split our dataset horizontally into 80/20 testing data and training data
- split our testing and training data vertically into features/labels (30 data columns and the 1 diagnosis column, respectively)
next, will build models:
- using scikit learn K nearest neighbors and SVM, as discussed in class
- train the models on the training features and labels, use the testing features to predict labels
- compare the predicted labels to the actual testing labels to determine a final accuracy for each model

Skeleton code is provided, which you should use.

Skeleton Code

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC


def readCSV(filename):
    """Read the file called "filename", expected to be a csv file"""

def splitDataset(dataset, test_percentage=20):
    """
    Takes the dataset as a dataframe

    Shuffles data, then returns 4 subsets of the dataset as numpy matrices:
        Training data without labels (100-test_percentage percent of the data)
        Training data labels (column vector!)
        Testing data without labels (test_percentage percent of the data)
        Testing data labels (column vector!)
    """

def runKNNClassifier(training, testing, training_labels, testing_labels, k):
    """
    Run KNN Classifier and return accuracy.
    sklearn.neighbors.KNeighborsClassifier
    """

def runSVMClassifier(training, testing, training_labels, testing_labels):
    """
    Run SVM Classifier and return accuracy.
    sklearn.svm. SVC
    """

def main():
    df = readCSV("wdbc.csv")

    # randomly shuffle the data and
    # split dataset into training and testing data,
    # with separate labels
    train, train_labels, test, test_labels = splitDataset(df, test_percentage=20)

    # Print some info about test/train split
    print(f"The dataset has {len(df)} entries")
    print(f"Train dataset has {len(train)} entries")
    print(f"Test dataset has {len(test)} entries")
    print("")

    # run the knn classifier from scikit-learn
    k = int(input("How many sklearn nearest neighbors? "))
    print("Running sklearn-knn classifier...")
    knn_accuracy = runKNNClassifier(train, test, train_labels, test_labels, k)

    # run the svm classifier from scikit-learn
    print("Running sklearn-svm classifier...")
    svm_accuracy = runSVMClassifier(train, test, train_labels, test_labels)

    # print the accuracies
    print(f"\nAccuracies:\n\tKNN:{knn_accuracy:.1%}\n\tSVM:{svm_accuracy:.1%}")


if __name__ == "__main__":
    main()

Example Output

The dataset has 569 entries
Train dataset has 456 entries
Test dataset has 113 entries

How many sklearn nearest neighbors? 5
Running sklearn-knn classifier...
Running sklearn-svm classifier...

Accuracies:
	KNN:92.9%
	SVM:90.3%

Note the accuracy will vary due to the random shuffling.