Homework 6

Due: Thursday, April 25 at 11:59pm

Submission: On Courseworks

Skeleton Code: hw6.zip

Instructions

Please read the instructions carefully. An incorrectly formatted submission may not be counted.

In this homework, we will start from some skeleton code. You are given the engi1006 and mlmodel modules. We will implement both part 1 and part 2 of this homework on top of the skeleton code, just as in homework 5. To submit, place these modules inside a folder called uni-hw6 (for for me this would be tkp2108-hw6). Then compress this folder into a .zip file. For most operating systems, you should be able to right click and have a “compress” option in the context menu. This should create a file called tkp2108-hw6.zip (with your uni). Submit this on courseworks.

Part 1 - Object Oriented Programming

For this part, we will be modeling some real world objects using Object Oriented Programming. We will model a class, consisting of a teacher, students, and assignments. We will generate and enroll some students. Then we will generate some assignments. For each assignment, we calculate the student’s grade, pay the teacher, and assign the student a the grade. Here is an example test run (from the file test_engi1006.py):

from engi1006 import Assignment, Course, Student, Teacher

t = Teacher('Joe')
c = Course(t)
c.register(Student('Jack', 50))
c.register(Student('Jill', 60))
c.register(Student('Jane', 75))
c.assign(Assignment(50))
c.assign(Assignment(75))
c.assign(Assignment(80))
c.assign(Assignment(90))

When we finish the class, we display the average grade and the teacher’s pay:

c.finish()
Class Average: 75.732901
Teacher Pay: 885.000

Finally, we can plot the data from our course. The three ploits are grades by assignment, grades by student, and assignment difficulty.

c.plot()

Skeleton code will be provided. It is missing a number of methods, which you will need to implement to get this library working. This assignment is mostly an exercise in getting comfortable with an object-oriented codebase, and reading and interpreting error messages to fill in the missing details.

Note that the skeleton code includes a tester. That should produce a plot that looks something like this (accouting for randomness)

Part 2 - Machine Learning Part 1

For the second part of this assignment, we will get started on our machine learning model. Like the nbody problem, this project will span multiple homeworks.

In this first assignment, we will get to know the dataset and prepare it for use with scikit-learn. The dataset in question is the Wisconsin Breast Cancer Diagnostic dataset.The dataset consists of 569 samples of tumors taken from real individuals. There are 32 columns:

The goal of this project is to construct a predictive model on the known diagnoses in order to aide in the predition of an unknown specimen. If we can build a good enough model, then given measurements regarding a tumor with unknown diagnosis (e.g. we don’t know yet whether its benign or malignant), we can predict whether we believe it to be benign or malignant, and use that information to help guide treatment decisions. For example, if we predict the tumor is malignant, this could help the patient receive medical intervention earlier and improve the prognosis. If we predict the tumor is benign, this could help the patient avoid costly and invasive surgery.

Before we can construct our model, however, we need to gather more information about our dataset so we can make the best decision about which model to use. We also need to clean up our dataseta and remove unnecessary information like the patient ID. For this assignment, we want to do a few things:

Concretely, we are given a test script:

from mlmodel import readCSV, datasetInfo, advancedStats, scatterMatrix, correlationHeatmap

def main():
    df = readCSV(input("input a file name to load: "))
    print(datasetInfo(df))

    # collect some stats with pandas
    advancedStats(df)

    # show a scattermatrix of first 5 columns
    scatterMatrix(df, 5)

    # show a correlation heatmap of all columns
    correlationHeatmap(df)

    # split dataset into training and testing data, with separate labels
    # make sure data is randomly shuffled beforehand!!
    train, train_labels, test, test_labels = splitDataset(df, test_percentage=20)

    # Print some info about test/train split
    print("Test dataset has {} entries".format(len(test)))
    print("Train dataset has {} entries".format(len(train)))

This should yield the following output:

input a file name to load: wdbc.csv
{'rows': 569, 'columns': 31, 'benign': 357, 'malignant': 212}

Column 1 statistics:
	Skewness:0.9423795716730992	Kurtosis:0.8455216229065377
Column 2 statistics:
	Skewness:0.6504495420828159	Kurtosis:0.7583189723727752
Column 3 statistics:
	Skewness:0.9906504253930081	Kurtosis:0.9722135477110654
Column 4 statistics:
	Skewness:1.6457321756240424	Kurtosis:3.6523027623507582
Column 5 statistics:
	Skewness:0.45632376481955844	Kurtosis:0.8559749303632245
Column 6 statistics:
	Skewness:1.1901230311980404	Kurtosis:1.650130467219256
Column 7 statistics:
	Skewness:1.4011797389486722	Kurtosis:1.9986375291042124
Column 8 statistics:
	Skewness:1.1711800812336282	Kurtosis:1.066555702965477
Column 9 statistics:
	Skewness:0.7256089733641999	Kurtosis:1.2879329922294565
Column 10 statistics:
	Skewness:1.3044888125755076	Kurtosis:3.0058921201694933
Column 11 statistics:
	Skewness:3.0886121663847574	Kurtosis:17.686725966164644
Column 12 statistics:
	Skewness:1.646443808753053	Kurtosis:5.349168692469973
Column 13 statistics:
	Skewness:3.443615202194899	Kurtosis:21.40190492588045
Column 14 statistics:
	Skewness:5.447186284898394	Kurtosis:49.20907650724119
Column 15 statistics:
	Skewness:2.314450056636759	Kurtosis:10.469839532360393
Column 16 statistics:
	Skewness:1.9022207096378565	Kurtosis:5.10625248342338
Column 17 statistics:
	Skewness:5.110463049043661	Kurtosis:48.8613953017919
Column 18 statistics:
	Skewness:1.4446781446974786	Kurtosis:5.1263019430439565
Column 19 statistics:
	Skewness:2.1951328995478216	Kurtosis:7.896129827528971
Column 20 statistics:
	Skewness:3.923968620227413	Kurtosis:26.280847486373336
Column 21 statistics:
	Skewness:1.1031152059604372	Kurtosis:0.9440895758772196
Column 22 statistics:
	Skewness:0.49832130948716474	Kurtosis:0.22430186846478772
Column 23 statistics:
	Skewness:1.1281638713683722	Kurtosis:1.070149666654432
Column 24 statistics:
	Skewness:1.8593732724433467	Kurtosis:4.396394828992138
Column 25 statistics:
	Skewness:0.4154259962824678	Kurtosis:0.5178251903311124
Column 26 statistics:
	Skewness:1.4735549003297956	Kurtosis:3.0392881719200657
Column 27 statistics:
	Skewness:1.1502368219460262	Kurtosis:1.6152532975830205
Column 28 statistics:
	Skewness:0.49261552688550875	Kurtosis:-0.5355351225188589
Column 29 statistics:
	Skewness:1.433927765189328	Kurtosis:4.444559517846582
Column 30 statistics:
	Skewness:1.6625792663955146	Kurtosis:5.244610555815004

Dataframe statistics:        radius_mean   radius_se  radius_worst  texture_mean  texture_se  texture_worst  ...  symmetry_mean  symmetry_se  symmetry_worst  fractaldimension_mean  fractaldimension_se  fractaldimension_worst
count   569.000000  569.000000    569.000000    569.000000  569.000000     569.000000  ...     569.000000   569.000000      569.000000             569.000000           569.000000              569.000000
mean     14.127292   19.289649     91.969033    654.889104    0.096360       0.104341  ...       0.132369     0.254265        0.272188               0.114606             0.290076                0.083946
std       3.524049    4.301036     24.298981    351.914129    0.014064       0.052813  ...       0.022832     0.157336        0.208624               0.065732             0.061867                0.018061
min       6.981000    9.710000     43.790000    143.500000    0.052630       0.019380  ...       0.071170     0.027290        0.000000               0.000000             0.156500                0.055040
25%      11.700000   16.170000     75.170000    420.300000    0.086370       0.064920  ...       0.116600     0.147200        0.114500               0.064930             0.250400                0.071460
50%      13.370000   18.840000     86.240000    551.100000    0.095870       0.092630  ...       0.131300     0.211900        0.226700               0.099930             0.282200                0.080040
75%      15.780000   21.800000    104.100000    782.700000    0.105300       0.130400  ...       0.146000     0.339100        0.382900               0.161400             0.317900                0.092080
max      28.110000   39.280000    188.500000   2501.000000    0.163400       0.345400  ...       0.222600     1.058000        1.252000               0.291000             0.663800                0.207500

[8 rows x 30 columns]

Train dataset has 456 entries
Test dataset has 113 entries