Assignment 3
Due date: Tuesday, June 4th, 2024 at midnight.
An active area of research is the use of neural networks to predict the properties of chemicals. Being able to predict a chemical's properties without actually needing to manufacture the chemical first would allow much-more rapid development of possible drug candidates, or other chemicals, and potentially the ability to discover or create chemicals with custom properties. This assignment will explore a somewhat aesthetic application of such neural networks.
Consider the Leffingwell Odor Dataset, a data set "of more than 4,100 perfumery materials". This data is most easily downloaded using the pyrfume
Python package. However, if this fails you can download the data here.
$
$ pip install pyrfume
$
$ python
>>> import pyrfume
>>> data = pyrfume.load_data('leffingwell/behavior.csv', remote = True)
>>>
>>> data.shape
(3523, 114)
>>>
There are many ways to numerically represent a chemical. One way is to represent the chemical as an ASCII string, in SMILES format. The data aforedownloaded will be a pandas
DataFrame. The first column will be the SMILES representation of the given chemical. The remaining 113 columns will be labels which experts have given these chemicals. These labels include "almond", "tomato", "vanilla", and many others.
The goal of this assignment is to create a Graph Convolutional Network to categorize the various chemicals into their respective labels. To do this, however, we need a way to converting the SMILES representation of the chemical into a graph. I have shamelessly stolen and modified some code from here which converts a SMILES string into an array of node types (atomic types) and an adjacency matrix. You can download the modified code here.
It turns out that the largest atom in this dataset is sulfur, so we can specify the maximum number of atom types to be 16. (This could be optimized, as several of the elements between 1 and 16 (He for example) are obviously not included.)
>>> data.iloc[0][0]
'CC(C)CC(C)(O)C1CCCS1'
>>>
>>> import smiles
>>> nodes, adj = smiles.smiles2graph(data.iloc[0][0], 16)
>>>
>>> nodes.shape
(32, 16)
>>> adj.shape
(32, 32)
>>>
If you look closely at the 113 labels for each chemical, you'll notice that most of the chemicals have 4 or 5 different labels. This is not a regular single-label multi-class problem (the type we have done thus far), but is rather what is known as a multi-label multi-class problem. In this case we use sigmoid as our output activation function (rather than softmax). Similarly, the "binary_accuracy" metric should be used rather than "accuracy".
Furthermore, because there are 113 categories but only 4 or 5 are non-zero, this data suffers from extreme class imbalance. If we just use binary crossentropy as our loss function (rather than categorical crossentropy), which would be the normal loss function to use under these circumstances, we will get a very high accuracy, but a very poorly performing model. Why? Because the model will just output zeros for every category, and get a very high accuracy. To get around this problem, create a custom loss function that will only compare the output the model generates for those categories that are non-zero for the actual data. Similarly, create a new accuracy metric which only gives the accuracy for those categories which are non-zero for the true data.
Create a Python script, called "leffingwell_gcn.py", which performs the following steps:
- reads in the Leffingwell Odor data set.
- splits the data into training and testing data sets,
- builds an graph convolutional neural network, using Keras, to predict the labels of the chemicals,
- trains the network on the training data, using your custom loss function, and prints out the final training accuracy,
- evaluates the network on the test data, and prints out the test accuracy.
Your script will be tested from the Linux command line, thus:
$ python leffington_gcn.py
Reading Leffington data.
Building network.
Training network.
The training score is [0.1103, 0.9671]
The test score is [0.1130, 0.9675]
$
Thus far I've been able to get about 16% accuracy on the test data for those categories that are non-zero.
Submit your various Python files.
Assignments will be graded on a 10 point basis.
Due date is June 4th 2024 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of June 11th, at 11:00am.