Assignment 2
Due date: Thursday, May 26th, 2022 at midnight.
Consider the Netflix prize, a contest run by Netflix from 2006 - 2009. The goal of the contest was to create a collaborative filtering model which would predict user ratings for movies. The original data set consists of quadruplets of data: (user, movie, date, rating), for over 100 million ratings, over 480,000 users and over 17,000 movies.
I have modified the data set, reducing it down to 4499 movies, and 10,000 users. I've reorganized the data so that each row of the data set, rather than being a rating, corresponds to a user, and each column is a movie. If the user has not yet rated a given movie the rating for that movie will be 0. The data can be found here. Needless to say, this dataset is quite sparse, but it must be organized in this way because Keras cannot handle sparse data in a sparse format.
As we learned in class, autoencoders can be used as a means of encoding typical behaviours, and thus are often used for filtering applications. The goal of this assignment is to create a standard (non-variational) autoencoder which, given the existing movie ratings for a given user, will predict the movie ratings of the user for all movies.
Create a Python script, called "netflix_ae.py", which performs the following steps:
- reads in the modified Netflix data set, given in the link above. The file is a standard CSV file. Use the pandas package to read the file. You may assume that the file is colocated with the script; the file name of the data set may be hard-coded.
- splits the data into training and testing data sets,
- builds an autoencoding neural network, using Keras, to predict the movie ratings for the users,
- trains the network on the training data, and prints out the final training accuracy,
- evaluates the network on the test data, and prints out the test accuracy.
If you think about the data set, you should notice that there is a notable problem with what has just been described. The problem is that the input data is sparse (most of the entries are zeros), since the user has not rated most of the movies, and it doesn't make sense to use the non-rated movies in the calculation of the cost function. To fix this problem we need to use a custom cost function when training this model. This cost function is known as the MMSE ("Masked Mean Squared Error"). It is the same as the regular mean squared error, but it is only calculated using those ratings which have a non-zero input value.
Your script will be tested from the Linux command line, thus:
$ python netflix_ae.py
Reading netflix input file.
Building network.
Training network.
The training score is [1.2671977962493896, 1.264156311416626]
The test score is [1.2146154642105103, 1.2141642570495605]
$
For an optional second part of the assignment, implement an improved version of the above training. To improve the training we will 'create' artificial data. We can do this by making the observation that, for a standard autoencoder, f(x) == f(f(x)), where f is the autoencoder and x is the input data. As a result, we can alternate between training on the original data, and training on f(x). To do this, modify the training step of the above script to use the "train_on_batch" functionality, as described in the GANs class. For each iteration:
- Select a batch_size of data from the training data,
- Perform a train_on_batch using this data,
- Select a different batch_size of training data,
- Use this data to calculate f(x), using model.predict,
- Perform a train_on_batch using f(x).
- Possibly perform steps 3-5 multiple times per iteration.
The results should be better, though how much better will depend on the details of your model.
$ python netflix_ae.py
Reading netflix input file.
Building network.
Training network.
The training score is [1.2103127241134644, 1.2103127241134644]
The test score is [1.205627202987671, 1.207932472229004]
$
Submit your 'netflix_ae.py'
.
Assignments will be graded on a 10 point basis.
Due date is May 26th 2022 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of June 2nd, at 11:00am.