DAT112 - Apr 2024: Assignment 1

Opened: Thursday, 2 May 2024, 11:30 AM

Due: Thursday, 9 May 2024, 11:59 PM

Due date: Thursday, May 9th, 2024 at midnight.

Consider the 17 Category Flower Dataset, a modified version of which can be found here, with the associated targets here. This data set consists of colour images of flowers, each of which is categorized into one of 17 categories. In the modified version of the data set, the images have been scaled to be 50 x 50 pixels each, rather than their original dimensions.

This goal of this is to build the best neural network you can, which will categorize a given flower image into its respective category. Your script should not overfit, as much as possible, while simultaneously getting the highest score it can on the test data.

Create a Python script, called flower_nn.py, which performs the following steps:

reads in the flower data set, images and targets, given in the links above (the numpy function load will be helpful here). You may assume that the files are colocated with the script; the file names may be hard-coded.
splits the input and target data into training and testing data sets,
builds a neural network, using Keras, to predict the category of the input images,
trains the network on the training data, and prints out the final training accuracy,
evaluates the network on the test data, and prints out the test accuracy.
optionally creates a plot of the model's training loss as a function of epoch. This information is returned as part of the fit operation which is performed on the model.

Note that the data has a few details in it which can lead to errors in the implementation of your network, including leading to a failure to train. Be sure to import the data at the Python command line and examine the data carefully, by hand.

Your script will be tested from the Linux command line, thus:

$ python flowers_nn.py Reading flowers input file. Reading flowers target file. Building network. Training network. The training score is [0.291714324566543, 0.9329] The test score is [1.6099461106693043, 0.5845588445663452] $

Note that the result above is NOT an example of a GOOD result. Try to address the problem that this data set has: it's too small. Overfitting is a problem with neural networks applied to this data set. To attempt to address this problem, explore various ways of addressing overfitting:

Experiment with creating the smallest network you reasonably can.
Optionally, explore the ability to create new, artificial data, by using the ImageDataGenerator class, which can be found in the tensorflow.keras.preprocessing.image subpackage. You can read about how to use this subpackage here. Use this enlarged data set to train your model.
Experiment with dropout.

Experiment with your hyperparameters to create the best model you can which minimizes overfitting. You should run the training until the loss stops improving, as demonstrated by your plot. The best model I have found in which the training and testing accuracies are similar returns a training accuracy of about 68% and a test accuracy of 58%. See if you can do better.

Submit your script which generates and trains your best model. The script will be graded on functionality, but also on form. This means your script should use meaningful variable names and be well commented.

Submit your flowers_nn.py. Assignments will be graded on a 10 point basis. Due date is May 9th 2024 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of May 16th, at 11:00am.