Assignment 2
Due date: Thursday, May 23rd, 2024 at midnight.
Consider the KDD Cup data set. This data set consists of features describing server log in attempts. Most of the log in attempts are normal, and valid. Some of the log in attempts are malicious. A specially-crafted version of the data set can be found here. This particular version only contains features which are continuous, the categorical features have been removed from the data set. The malicious log in attempts have been converted to a label of "1", the normal log in attempts have a label of "0". The number of malicious log in attempts has been reduced to about 1% of the data.
An interesting application of autoencoders is outlier detection. That is, detecting data which does not correspond to "normal". The purpose of this assignment is to build a neural network autoencoder, and use it to detect the malicious log in attempts in the data set. Note that the autoencoder you will be using should just be a regular one, not a variational autoencoder.
How do we use an autoencoder to do outlier detection? This is accomplished by building an autoencoder, training it on "normal" data, and then looking at the value of the loss function for the "normal" data versus the "outlier" data. If the autoencoder is crafted properly and well-trained, you should be able to come up with a criteria for the value of the loss function which distinguishes between normal and outlier data.
Create a Python script, called kdd_ae.py
, which performs the following steps:
- reads in the KDD data set given in the link above. You may assume that the above CSV file is colocated with the script; the file name may be hard-coded.
- separates the input and output data from the data set (you may hard-code the columns for this assignment),
- scales each feature in the input data to the range 0-1 (the function
sklearn.preprocessing.minmax_scale()
may be useful here), - splits the input and output data into three data sets: training, validation and testing data sets,
- builds a neural network autoencoder, using Keras,
- trains the network on the training data, but only on non-malicious login attempts,
- calculates the mean-squared error for each data point in the validation data, after passing the validation data through the network,
- for the previously calculated mean-squared error of the validation data, calculates the mean and standard deviation of the normal data points, and the malicious data points, and prints them out.
- using the mean and standard deviations calculated above, sets an appropriate threshold to separate the normal log in attempts from the malicious, and prints it out (you don't need to calculate this, you may just look at the numbers and pick a good value),
- calculates the mean-squared error for each data point in the test data, after passing the test data through the network,
- uses the threshold value determined above to predict the normal and malicious data points in the test data set,
- prints out the confusion matrix of the test data predictions versus the actual test data labels (the
sklearn.metrics.confusion_matrix
function may be useful here).
Experiment with your script, varying the parameters in your model (number of hidden layers, number of nodes per layer, size of latent space, activation functions, presence/absence of regularization or dropout or batch normalization, cost function, optimization algorithm) to get the best model you can find. You should run the training until the loss stops improving.
Your script will be tested from the Linux command line, thus:
$ python kdd_ae.py
Reading KDD data file.
Building network.
Training network.
Summary of validation data results:
mean std
binary_labels
0.0 0.003351 0.006613
1.0 0.093298 0.034623
Setting threshold to 0.05.
Test data confusion matrix:
[[19446 21]
[ 2 181]]
$
The script will be graded on functionality, but also on form. This means your script should use meaningful variable names and be well commented.
Submit your kdd_ae.py
. Assignments will be graded on a 10 point basis. Due date is May 23rd 2024 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of May 30th, at 11:00am.