Assignment 11 - Makeup
0) You must use version control ("git"), as you develop your code. We suggest you start, from the Linux command line, by creating a new directory, e.g. assignment11, cd into that directory and initialize a git repository ("git init"
) within it, and perform "git add ..., git commit"
repeatedly as you add to your code. You will hand in the output of "git log"
for your assignment repository as part of the assignment. You must have a significant number of commits representing the modifications, alterations and changes to your code. If your log does not show a significant number of commits with meaningful comments you will lose marks.
Create a utilities file, called ClassificationUtilities.py. We will be using Python for this assignment.
1a) Create a function, which takes a filename as an argument. The function should load the file, separate the label column from the rest of the data, and split the data into training and testing data sets. It should then return the training and testing features and labels (targets). You may assume that the name of the column which contains the label data is "label".
The pandas library can be used to read csv files:
>>> import pandas as pd
>>> data = pd.read_csv(filename)
A handy way to extract the "label" column from a pandas data frame is:
>>> label = data.pop("label")
1b) Create a function, which takes 3 mandatory arguments. The first two mandatory arguments are the training data features and labels. The third mandatory argument is a string, which will be used to determine what kind of classification model to create.
The function will examine the third argument to determine what kind of classification model to create. If the third argument is
- "DT" the function will build a Decision Tree model,
- "LR" the function will build a Logistic Regression model.
Notice that "kNN" is not in the above list. The function will build the aforementioned model, train it on the training data, and return the model.
1c) Create a function which takes 2 arguments, the training data features and labels. The function should perform 10-fold cross-validation on a kNN model, using the inputed training data, for k values from 1 - 31 inclusive, to determine the optimal value of k to use in a kNN model for this training data (the numpy function 'argmax' may be useful here). The function should print out a sentence indicating the type of model being built, and the optimal value of k. The function should build a final kNN model using the optimal value of k, train it on the training data, and return it.
1d) Create a function which takes 3 arguments, a pretrained model and the test data features and labels. The function should print out the confusion matrix for the model, and the model accuracy, based on the test data.
1e) Create a function, which takes 3 arguments, a pretrained model and the test data features and labels. The function should print out the ROC area under the curve for the model, based on the test data. However, it should only print out the area under the curve IF there are only 2 categories. If there are more than two categories then the function should print nothing.
You should test the functionality of your functions, and your driver script below, on the following 2 data sets.
The first is the 'seeds' data set, hosted at the UCI Machine Learning Repository:
http://archive.ics.uci.edu/ml/datasets/seeds
This data consists of measurements of 3 types of wheat seed. By default the data does not contain the column names. I've created a new version of the data which does contain the column names. It can be found here.
The second data set is the 'cars' data set, which can be found here.
This data set consists of multiple car models with its specifications.
Note that you should use a single function to handle either data set.
Create a driver script, called ClassificationDriver.py, that takes two mandatory command line arguments. The first argument should be the file name of the data set to be examined. The second should be the type of model to be used, either "DT", "kNN", or "LR". When the script is called, the driver script should download the data in question, create the model in question, train it on the training data, and print out the confusion matrix generated from running the model on the test data.
Unlike when we were using R, for your Python scripts you are expected to use the argparse package to automate the handling of your command line arguments.
If the second argument is "kNN", the script should perform 10-fold cross-validation to determine the optimal value of k to use with this data, and use that value of k when building the final kNN model.
The script should also take an optional third argument, "--AUC". If this argument is supplied the script should print out the ROC area under the curve for the model and the test data, but only if there are only 2 categories! Otherwise the function should print nothing.
Please have your script print out statements describing what is happening: "Using data set ...", "Creating model ...".
$ python ClassificationDriver.py seeds_dataset2.txt LR --AUC
Gathering data from seeds_dataset2.txt
Building Logistic Regression model.
Confusion matrix:
[[14 0 2]
[0 16 0]
[0 0 10]]
Model accuracy: 0.9523809523809523
$
$ python ClassificationDriver.py cars_labeled.csv kNN --AUC
Gathering data from cars_labeled.csv
Building kNN model.
Optimal value of k is 3.
Confusion matrix:
[[13 3]
[ 1 9]]
Model accuracy: 0.8461538461538461
ROC AUC: 0.9624999999999999
$
Submit your utilities file and driver script, and the output of 'git log' for this assignment. As usual, the scripts should be well documented, explaining the logic of each step; we will take points off for poorly-commented code. Be sure to use defensive programming for your command line arguments.
To capture the output of 'git log
' use redirection, git log > git.log
, and hand in the "git.log" file.
Assignments will be graded on a 10 point basis.
Due date is April 11th 2024 (midnight). Late assignments will NOT be accepted.