DAT112 - Apr 2022: Assignment 1

Opened: Thursday, 12 May 2022, 12:00 PM

Due: Thursday, 19 May 2022, 11:59 PM

Due date: Thursday, May 19th, 2021 at midnight.

Consider the Dogs vs. Cats data set, which consists of a collection of photos of dogs and cats. I've created a modified version of this data set, in which I've scaled all the photos to 50 x 50 pixels, with black on the borders if the image was not a square. This data can be found here (192MB).

The goal of this assignment is to build the best neural network you can, which will categorize a given image into its respective category, dog or cat. Your script should not overfit, as much as possible, while simultaneously getting the highest score it can on the test data.

Create a Python script, called "dogs_cats_nn.py", which performs the following steps:

reads in the dogs vs. cats data set, given in the link above (the numpy function "load" will be helpful here). You may assume that the data file is colocated with the script; the file name may be hard-coded.
splits the input and target data into training and testing data sets, with 20% in the test set. Note that the data will be returned as a dictionary, with the keys 'images' and 'labels'.
builds a neural network, using Keras, to predict the category of the input images,
trains the network on the training data, and prints out the final training accuracy,
evaluates the network on the test data, and prints out the test accuracy.
creates a plot of the model's training loss as a function of epoch.

Your script will be tested from the Linux command line, thus:

$ python dogs_cats_nn.py Reading dogs vs. cats data file. Building network. Training network. The training score is [0.4612, 0.7828] The test score is [0.4664459792613983, 0.7906] $

Be sure, to the best of your ability, to try to address the problem that this data set has: it's too small (even though it's got 25,000 photos). Overfitting is a problem with neural networks applied to this data set. To attempt to address this problem, explore various ways of addressing overfitting:

Experiment with creating the smallest network you reasonably can.
Explore the ability to create new, artificial data, by using the ImageDataGenerator class, which can be found in the tensorflow.keras.preprocessing.image subpackage. You can read about how to use this subpackage here. Use this enlarged data set to train your model.
Experiment with regularization or dropout.

Experiment with your hyperparameters to create the best model you can which minimizes overfitting. You should run the training until the loss stops improving, as demonstrated by your plot. The best model I have found in which the training and testing accuracies are similar returns a training and test accuracy of about 78%. See if you can do better.

Submit your script which generates and trains your best model. The script will be graded on functionality, but also on form. This means your script should use meaningful variable names and be well commented.

Submit your dogs_cats_nn.py, and the final plot of your training loss.

Assignments will be graded on a 10 point basis.
Due date is May 19th 2022 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of May 26th, at 11:00am.