MSC1090 - Fall 2024: Assignment 7

Opened: Thursday, 31 October 2024, 11:00 AM

Due: Thursday, 7 November 2024, 11:59 PM

Due date: November 7, 2024 at 11:59 pm.

Suppose you build a categorical machine-learning model. One question you must ask yourself is: is this a good model? Perhaps it's just getting the right answer by accident? How can I tell? The approach we've been taking thus far has been to do a train-test split of the data. We then train on the training data and test on the test data. If the model does well on the out-of-sample (test) data, then we can have some confidence that our model is working correctly.

That's a good approach, and the one that is generally recommended. What should you do, however, if you don't have enough data to do a good train-test split? What are your options in that case?

One option is to perform a hypothesis test on the classification model, against the null hypothesis that your model has no predictive power. This is done by performing a permutation test on the model. In this case we permute the labels of the data set, and see how the model does on the permuted data, relative to the correct data. If the model does poorly on the permuted data, we have some evidence that the model wasn't doing badly in the first place.

We will perform an analysis similar to Ojala and Garriga (2010). In their analysis, they calculated a $p$ value for the null hypothesis that their model was indistinct from a model that randomly guesses, by performing the following steps:

perform a leave-one-out cross-validation for a decision tree on the full data set, returning the average error, where the average error is defined as $$e = \frac{1}{n}\sum_i^n I(f_i(x_i) \ne y_i)$$ where the sum is over every data point, $f_i$ is the model built without the $i$th data point, and $I(x)$ is the indicator function: $$I(x) = \begin{cases} 1 & x = \rm{True}\\ 0 & x = \rm{False} \end{cases}$$ In essence, the error is just the error rate of the model.
Then, $m$ times:
- randomly permute the labels of the data set,
- perform leave-one-out cross-validation for a decision tree on this new permuted data set,
- returns the average error, as defined above.
Once the permutations have been performed, the $p$ value is calculated using the equation $$p = \frac{\left(\sum_j I(e_j \le e_{\rm{full}})\right) + 1}{m+1}$$ where $e_j$ is the average error from the model for the $j$th permutation of the data, $e_{\rm{full}}$ is the average error from the full, unpermuted data set, and $m$ is the number of permutations of the data.

For this assignment we're going to build a classification model for the Pediatric Appendicitis data set, which contains data on patients with suspected appendicitis at Children’s Hospital St. Hedwig in Regensburg, Germany, between 2016 and 2021. The original data has many many columns. I've reduced the the number of columns and rows for the purpose of this assignment; you can find that data here. The target in this case is the 'Diagnosis' column.

0) You must use version control ("git"), as you develop your code. We suggest you start, from the Linux command line, by creating a new directory, e.g. assignment7, cd into that directory and initialize a git repository ("git init") within it, and perform "git add ..., git commit" repeatedly as you add to your code. You will hand in the output of "git log" for your assignment repository as part of the assignment. You must have a significant number of commits representing the modifications, alterations and changes to your code. If your log does not show a significant number of commits with meaningful comments you will lose marks.

1) Create a file named Appendicitis.Utilities.R containing the following functions.

1a) Create a function which loads the file above, and returns the data.

1b) Create a function that, given a data set, will perform leave-one-out cross-validation, using a decision tree, on the data set. It should return the average error, as defined above.

1c) Write a function which takes two arguments, a data set and $m$, the number of times to permute the data set. The function should perform the following steps:

Calculates the average error on the data set.
Repeats $m$ times:
- copies the original data set,
- permutes the labels of the copy
- calculates the average error resulting from leave-one-out cross-validation of a decision tree on this new data set.
Calculates and returns $p$, using the formula given above.

2) Create an R script called Appendicitis.Analysis.R that will perform the following steps:

sources your utilities file Appendicitis.Utilities.R,
takes an argument from the command line, the file to be processed,
reads in the data.
calculates the $p$ value for the null hypothesis that a decision tree model has no predictive power better than a model that generates random results, by permuting the labels 100 times.
Prints out a sentence indicating if the null hypothesis can safely be rejected, assuming a significance of 0.05.
The script should defend itself against incorrect command-line arguments in the usual way.

Submit your Appendicitis.Analysis.R script file and Appendicitis.Utiltites.R file, and the output of git log from your assignment repository.

To capture the output of git log use redirection, git log > git.log, and hand in the git.log file.

Assignments will be graded on a 10 point basis. Due date is November 7th, 2024 at 11:59pm, with 0.5 point penalty per day for late submission until the cut-off date of November 14, 2024 at 10:00am.