Assignment 3
Due date: Tuesday, June 7th, 2022 at midnight.
An active area of research is the use of neural networks to predict the properties of chemicals. Being able to predict a chemical's properties without actually needing to manufacture the chemical first would allow much-more rapid development of possible drug candidates, or other chemicals, and potentially the ability to discover or create chemicals with custom properties. This assignment will explore a somewhat aesthetic application of such neural networks.
Consider the Leffingwell Odor Dataset, a data set "of more than 4,100 perfumery materials". This data is most easily downloaded using the pyrfume
Python package. However, if this fails you can download the data here..
$
$ pip install pyrfume
$
$ python
>>> import pyrfume
>>> data = pyrfume.load_data('leffingwell/behavior.csv', remote = True)
>>>
>>> data.shape
(3523, 114)
>>>
There are many ways to numerically represent a chemical. One way is to represent the chemical as an ASCII string, in SMILES format. The data aforedownloaded will be a pandas
DataFrame. The first column will be the SMILES representation of the given chemical. The remaining 113 columns will be labels which experts have given these chemicals. These labels include "almond", "tomato", "vanilla", and many others.
The goal of this assignment is to create a Graph Convolutional Network to categorize the various chemicals into their respective labels. To do this, however, we need a way to converting the SMILES representation of the chemical into a graph. I have shamelessly stolen and modified some code from here which converts a SMILES string into an array of node types (atomic types) and an adjacency matrix. You can download the modified code here.
It turns out that the largest atom in this dataset is sulfur, so we can specify the maximum number of atom types to be 16.
>>> data.iloc[0][0]
'CC(C)CC(C)(O)C1CCCS1'
>>>
>>> import smiles
>>> nodes, adj = smiles.smiles2graph(data.iloc[0][0], 16)
>>>
>>> nodes.shape
(32, 16)
>>> adj.shape
(32, 32)
>>>
If you look closely at the 113 labels for each chemical, you'll notice that most of the chemicals have 4 or 5 different labels. This is not a regular single-label multi-class problem (the type we have done thus far), but is rather what is known as a multi-label multi-class problem. In this case we use sigmoid as our output activation function (rather than softmax), and binary crossentropy as our loss function (rather than categorical crossentropy). Update: similarly, the "binary_accuracy" metric should be used rather than "accuracy".
Create a Python script, called "leffingwell_gcn.py", which performs the following steps:
- reads in the Leffingwell Odor data set.
- splits the data into training and testing data sets,
- builds an graph convolutional neural network, using Keras, to predict the labels of the chemicals,
- trains the network on the training data, and prints out the final training accuracy,
- evaluates the network on the test data, and prints out the test accuracy.
Your script will be tested from the Linux command line, thus:
$ python leffington_gcn.py
Reading Leffington data.
Building network.
Training network.
The training score is [0.1103, 0.9671]
The test score is [0.1130, 0.9675]
$
Update: Thus far I've been able to get about 97% accuracy. See the note below.
Update: further exploration of this data indicates that this problem suffers from severe class imbalance. This means that certain classes are so underrepresented that the model doesn't even bother to learn them. In this situation, the model just generates zeros for all labels and gets a very high binary_accuracy.
Submit your various Python files.
Assignments will be graded on a 10 point basis.
Due date is June 7th 2022 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of June 14th, at 11:00am.