Due: Tuesday, May 6 at 11:59 pm
Submission: On Courseworks
Please read the instructions carefully. An incorrectly formatted submission may not be counted.
There are several questions in this assignment.
This assignment starts from skeleton code: cancerpredict.py
.
Please include comments in your code detailing your thought process where appropriate.
Put this file in a folder called uni-hw10
(so for me this would be tkp2108-hw10
).
Then compress this folder into a .zip
file.
For most operating systems, you should be able to right click and have a “compress” option in the context menu.
This should create a file called tkp2108-hw10.zip
(with your uni).
Submit this on courseworks.
In this homework, we’ll be building on the analysis we performed in HW9.
We’ll continue to use the Wisconsin Diagnostic Breast Cancer dataset, which you can download from HW9.
In this homework, we will implement two majors components:
Skeleton code is provided, which you should use.
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
def readCSV(filename):
"""Read the file called "filename", expected to be a csv file"""
def splitDataset(dataset, test_percentage=20):
"""
Takes the dataset as a dataframe
Shuffles data, then returns 4 subsets of the dataset as numpy matrices:
Training data without labels (100-test_percentage percent of the data)
Training data labels (column vector!)
Testing data without labels (test_percentage percent of the data)
Testing data labels (column vector!)
"""
def runKNNClassifier(training, testing, training_labels, testing_labels, k):
"""
Run KNN Classifier and return accuracy.
sklearn.neighbors.KNeighborsClassifier
"""
def runSVMClassifier(training, testing, training_labels, testing_labels):
"""
Run SVM Classifier and return accuracy.
sklearn.svm. SVC
"""
def main():
df = readCSV("wdbc.csv")
# randomly shuffle the data and
# split dataset into training and testing data,
# with separate labels
train, train_labels, test, test_labels = splitDataset(df, test_percentage=20)
# Print some info about test/train split
print(f"The dataset has {len(df)} entries")
print(f"Train dataset has {len(train)} entries")
print(f"Test dataset has {len(test)} entries")
print("")
# run the knn classifier from scikit-learn
k = int(input("How many sklearn nearest neighbors? "))
print("Running sklearn-knn classifier...")
knn_accuracy = runKNNClassifier(train, test, train_labels, test_labels, k)
# run the svm classifier from scikit-learn
print("Running sklearn-svm classifier...")
svm_accuracy = runSVMClassifier(train, test, train_labels, test_labels)
# print the accuracies
print(f"\nAccuracies:\n\tKNN:{knn_accuracy:.1%}\n\tSVM:{svm_accuracy:.1%}")
if __name__ == "__main__":
main()
The dataset has 569 entries
Train dataset has 456 entries
Test dataset has 113 entries
How many sklearn nearest neighbors? 5
Running sklearn-knn classifier...
Running sklearn-svm classifier...
Accuracies:
KNN:92.9%
SVM:90.3%
Note the accuracy will vary due to the random shuffling.