DAT112 - Apr 2023: Assignment 3

Opened: Thursday, 8 June 2023, 11:30 AM

Due: Thursday, 15 June 2023, 11:59 PM

Due date: Tuesday, June 15th, 2023 at midnight.

An active area of research is the use of neural networks to discover new drugs. In principle we'd like to be able to predict the characteristics of a potential drug without actually measuring the chemical's properties. This would allow much-more rapid development of possible drug candidates, and potentially the ability to discover, or design, chemicals with custom properties.

Many different approaches to using neural networks to determine a chemical's properties are being studied. Many involve analysing chemicals which are represented in "SMILES" format. This format represents chemicals as an ASCII string.


    >>> 

    >>> from rdkit import Chem

    >>> from rdkit.Chem import Draw

    >>> import matplotlib.pyplot as plt

    >>> 

    >>> smiles = "C1CC2=C3C(=CC=C2)C(=CN3C1)[C@H]4[C@@H](C(=O)NC4=O)C5=CNC6=CC=CC=C65"

    >>> chem = Chem.MolFromSmiles(smiles)

    >>> 

    >>> Draw.MolToMPL(chem, size = (200, 200))

    >>> plt.show()

    >>>

Suppose we want to design a neural network which would take SMILES-format strings as inputs. One way to process these inputs, since they are strings, would be to build a recurrent neural network, using LSTMs, which would process each string as a "sentence". Once processed, the data could be then used to predict the properties of the chemical.

Let us consider a collection of chemicals with annotated nanomolar activities (IC/EC/AC50), which were downloaded from ChEMBL, a database of bioactive molecules. The dataset we will consider can be found here. For this assignment, we will be interested in predicting each chemical's value of the "AlogP" column, which is a measure of molecular hydrophobicity (lipophilicity).

Create a Python script, called "chembl_lstm.py". Your script should:

read in the data set and separate out the inputs (the SMILES strings) and targets (the AlogP data),
perform whatever preprocessing is necessary to convert the input data into a form that can be used with an LSTM layer,
split the data into training and testing data sets,
build an LSTM neural network, using Keras, to predict the AlogP value, given the input SMILES string.
train the network on the training data, and print out the final training loss value,
evaluate the network on the test data, and print out the test loss value.

Your script will be tested from the Linux command line, thus:

$ python chembl_lstm.py Using Tensorflow backend. Reading ChEMB data. Building network. Training network. The training loss is 0.0108 The test loss is 0.0195 $

You do not need to use the full data set (it's quite large), but you should use enough of the data such that overfitting is minimized.

The script will be graded on functionality, but also on form. This means your script should use meaningful variable names and be well commented.

Submit your 'chembl_lstm.py' file.

Assignments will be graded on a 10 point basis.
Due date is June 15th 2023 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of June 22nd, at 11:00am.