DAT112 - Apr 2023: Assignment 1

Opened: Thursday, 4 May 2023, 11:30 AM

Due: Thursday, 11 May 2023, 11:59 PM

Due date: Thursday, May 11th, 2023 at midnight.

Consider the Riseholme data set, which consists of photos of strawberries in various stages of development. I've created a modified version of this data set, in which I've rescaled all the photos to the same size, adding black bands to increase the size of the photos, so that all photos have the same dimensions. This data can be found here (228MB).

>>> >>> import numpy as np >>> data = np.load('strawberries.npz') >>> >>> x = data['x'] >>> y = data['y'] >>>

The goal of this assignment is to build the best neural network you can, which will categorize a given image into its respective category: Occluded, Ripe, or Unripe. Your script should not overfit, as much as possible, while simultaneously getting the highest score it can on the test data.

Create a Python script, called "strawberries_nn.py", which performs the following steps:

reads in the strawberries data set, given in the link above (the numpy function "load" will be helpful here). You may assume that the data file is colocated with the script; the file name may be hard-coded.
splits the input and target data into training and testing data sets, with 20% in the test set. Note that the data will be returned as a dictionary, with the keys 'x' and 'y'.
builds a neural network, using Keras, to predict the category of the input images,
trains the network on the training data, and prints out the final training accuracy,
evaluates the network on the test data, and prints out the test accuracy.

Your script will be tested from the Linux command line, thus:

$ python strawberry_nn.py Reading strawberry data file. Building network. Training network. The training score is [0.1941, 0.9272] The test score is [0.2535, 0.9139] $

Be sure, to the best of your ability, to try to address the problem that this data set has: it's too small (there are only 3367 photos). Overfitting is a problem with neural networks applied to this data set. To attempt to address this problem, explore various ways of addressing overfitting:

Experiment with creating the smallest network you reasonably can.
Explore the ability to create new, artificial data, by using the ImageDataGenerator class, which can be found in the tensorflow.keras.preprocessing.image subpackage. You can read about how to use this subpackage here. Use this enlarged data set to train your model.
Experiment with dropout.

Experiment with your hyperparameters to create the best model you can which minimizes overfitting. You should run the training until the loss stops improving. The best model I have found in which the training and testing accuracies are similar returns a training and test accuracy of about 78%. See if you can do better.

Submit your script which generates and trains your best model. The script will be graded on functionality, but also on form. This means your script should use meaningful variable names and be well commented.

Submit your strawberry_nn.py file.

Assignments will be graded on a 10 point basis.
Due date is May 11th 2023 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of May 18th, at 11:00am.