## ENGI E1006: Introduction to Computing for Engineers and Applied Scientists
---

`scikit-learn` is a powerful library built on `numpy` that implements a large number of **Machine Learning** algorithms and tools.


Part of `scikit-learn`'s power is its simple, polymorphic interface. Most models have the same interface:

```python
# instantiate the model
model = ScikitLearnModel(model_specific_parameters...)

# fit the model to the training data
model.fit(KnownData, KnownDataLabels)

# predict new data
model.predict(UnknownData)
```

This flexibility lets us build a lot of powerful tools, with very little code!

In this notebook, we will explore some of scikit-learn's functionality using the **Iris** dataset. This is a classic dataset from machine learning. More information can be found here: https://en.wikipedia.org/wiki/Iris_flower_data_set


In [None]:
import sklearn # notice that scikit-learn's import name is sklearn!
from sklearn.datasets import load_iris

In [None]:
iris_data = load_iris()

In [None]:
iris_data

In [None]:
iris_data.keys()

In [None]:
iris_data['data']

In [None]:
iris_data['target']

In [None]:
iris_data['target_names']

In [None]:
iris_data['feature_names']

Scikit learn is built on `numpy`, but we can leverage `pandas` and `seaborn` to do some **exploratory data analysis**.

In [None]:
import pandas as pd
import seaborn as sns

df_orig = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
df_orig['target'] = iris_data.target
df_orig

In [None]:
# Let's swap target for target name
df = df_orig.copy()
df['target'] = df['target'].map({i: iris_data.target_names[i] for i in range(len(iris_data.target_names))})
df

In [None]:
# Now lets do some plotting
sns.pairplot(df, hue='target')

Cool!

This might seem like just "for fun", but our data analysis has shown us something super important. Looking at the pairplot, it shows that our dataset appears seperable, meaning it looks like plenty of algorithms will work on it. If you recall the randomly generated dataset from previous lectures, it exhibited almost no seperability, meaning we'd have a very difficult time constructing an algorithm around it.

Let's go ahead and leverage scikit-learn's **K Nearest Neighbors** algorithm. In order to judge its accuracy, we'll separate some of our data for later testing.

In [None]:
import numpy as np

all_data = df_orig.to_numpy()
all_data[:10]

In [None]:
# since the targets are in order, lets shuffle our data
np.random.shuffle(all_data)
all_data[:10]

In [None]:
# Now let's reserve 20% for testing
training_data = all_data[ :int(.8*len(all_data))]
testing_data = all_data[int(.8*len(all_data)): ]

print(len(all_data), len(training_data), len(testing_data))

In [None]:
# and finally lets separate the labels
training_data_labels = training_data[ : , -1: ] # grab just the labels column
training_data_labels = training_data_labels.reshape(len(training_data_labels)) # reshape as vector
training_data = training_data[ : , :-1] # slice out the labels column

print(training_data_labels.shape, training_data.shape)
print(training_data_labels[0], training_data[0])

In [None]:
# and for testing data
testing_data_labels = testing_data[:,-1:] # grab just the labels column
testing_data_labels = testing_data_labels.reshape(len(testing_data_labels)) # reshape as vector
testing_data = testing_data[:,:-1] # slice out the labels column
print(testing_data_labels.shape, testing_data.shape)
print(testing_data_labels[0], testing_data[0])

-----

Now we're ready to construct our KNN model. For this, lets use `K = 5`.

Scikit-learn has extensive documentation on every model. The KNN docs are here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# first, we instantiate our model
model = KNeighborsClassifier(n_neighbors=5)

In [None]:
# Now, we train the model on the training data and training labels
model.fit(training_data, training_data_labels)

In [None]:
# finally, we generate a numpy array of predictions for the testing data
predicted_test_labels = model.predict(testing_data)
predicted_test_labels

In [None]:
testing_data_labels

In [None]:
# At this point, we can do a simple comparison to the known test labels to get an accuracy
sum(predicted_test_labels == testing_data_labels) / len(testing_data_labels) * 100.

Wow! close to 100% accuracy! Though looking at the scatter matrix, this makes sense. The data is very seperated. What if we tried just the 1 nearest neighbor?

In [None]:
model = KNeighborsClassifier(n_neighbors=1)
model.fit(training_data, training_data_labels)
sum(model.predict(testing_data) == testing_data_labels) / len(testing_data_labels) * 100.

And just for fun, what if we picked too many neighbors?

In [None]:
model = KNeighborsClassifier(n_neighbors=100)
model.fit(training_data, training_data_labels)
sum(model.predict(testing_data) == testing_data_labels) / len(testing_data_labels) * 100.

This makes sense, because we don't have enough samples of each kind!


To illustrate the power of scikit-learn, lets pick another model. For this one, lets use a support vector machine. Recall from lecture that support vector machines (or SVM for short) attempt to draw an N-dimensional line between all the different classes (otherwise known as a hyperplane). Because our dataset looks pretty seperable, this should be easy for the model to do.

The documentation for the SVM classifier is here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

In [None]:
from sklearn.svm import SVC

In [None]:
# first, we instantiate our model
model = SVC()

In [None]:
# Now, we train the model on the training data and training labels
model.fit(training_data, training_data_labels)

In [None]:
# finally, we generate a numpy array of predictions for the testing data
predicted_test_labels = model.predict(testing_data)
predicted_test_labels

In [None]:
# At this point, we can do a simple comparison to the known test labels to get an accuracy
sum(predicted_test_labels == testing_data_labels) / len(testing_data_labels) * 100.