BCH2202 - Winter 2023: Assignment 4

Opened: Wednesday, 29 March 2023, 9:30 AM

Due: Wednesday, 5 April 2023, 11:59 PM

Create a utilities file, called ClassificationUtilities.R. In this file create the following functionality:

1a) Create a function, which takes a filename, or URL or file name (a string), as an argument. The function should download the file at the URL and split the data into training and testing data sets. It should then return these training and testing data sets. You may assume that the name of the column which contains the label data is "label".

1b) Create two functions, each of which takes a single mandatory argument. The argument is the training data set.

One function will build a Decision Tree model.
One function will build a Logistic Regression model.

The functions will build the aforementioned models, train it on the training data, and return the model. If the a data set is passed to the Logistic Regression function that contains more than 2 categories the function should print an error message (using the 'stop' command, for example) and exit.

1c) Create a function which takes 2 arguments, a pre-trained model and the test data set. The function should print out the confusion matrix for the model, and the model accuracy, based on the test data.

1d) Create a function which takes 2 arguments, a file name and a string argument that indicates whether a Decision Tree or Logistic Regression model should be built. The function will load the data, build and train the appropriate model on the training data set, and print out the confusion matrix and accuracy of the model based on the test data.

You should test the functionality of your functions on the following 2 data sets.

The first is the 'seeds' data set, hosted at the UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/seeds

This data consists of measurements of 3 types of wheat seed. By default the data does not contain the column names. I've created a new version of the data which does contain the column names. It can be found here.

The second data set is the 'cars' data set, which can be found here. This data set consists of multiple car models with their specifications.

Note that you should use a single function to handle either data set. Please have your functions print out statements describing what is happening: "Using data set ...", "Creating model ...".


> source("ClassificationUtilities.R")
> 
> build.model('seeds_dataset2.txt', 'DT')
Gathering data from seeds_dataset2.txt
Building Decision Tree model.
Confusion matrix:
          Reference
Prediction  1  2  3
         1  8  0  6
         2  0 14  0
         3  0  0 14
Accuracy: 0.8571429
>

Submit your ClassificationUtilities.R.

Assignments will be graded on a 10 point basis.

Due date is April 5th, 2023 at midnight, with 0.5 point penalty per day for late submission until the cut-off date of April 12th, 2023 at 9:00am.