BCH2203 - Winter 2022: 6. Classify downloaded protein structures

Opened: Sunday, 10 April 2022, 12:00 AM

Due: Sunday, 17 April 2022, 11:59 PM

In this assignment, you'll be trying out the k-means methods on protein structures. The proteins structures to use correspond to the human hemoglobin protein, as found on the PDB at rcsb.org, by searching for hemoglobin, and selecting "Homo sapiens" and the best refinement resolution (<1.5 Angstrom). This results in the following 44 PDBIDs:

6LCX, 6LCW, 6KAO, 6KAP, 6L5V, 6KA9, 6KAI, 6KAH, 6KAE, 3S66, 7JY3,
2D5Z, 2W72, 7DY4, 7DY3, 2DN2, 2DN1, 2DN3, 1J40, 1J41, 1IRD, 5QR5,
5QQR, 5QR1, 5QQY, 4Y08, 6TX8, 4HF3, 5O10, 3ZOO, 5TY3, 3TEM, 6YA6,
3UVC, 4GR8, 7N5O, 6H5W, 6E4F, 4QC4, 6J6M, 7P8X, 7MU3, 3U9W, 4AJX

Your task is to write a script that

Uses Biopython's Bio.PDB module to download the protein structure with these PDBIDs
For each, extracts the 3-dimensional positions of all the atoms
Computes or determines N, the number of atoms stored in the structure, and M, the mean square displacement of atoms from the center, defined as

$$ M\ =\ \frac{1}{N} \sum_{i} \left[ (x_{i} -<x>)^{2} + (y_{i} -<y>)^{2} + (z_{i} -<z>)^{2}\right]$$

where

$$ <x> =\ \frac{1}{N} \sum_{i} x_{i},\ <y> =\ \frac{1}{N} \sum_{i} y_{i} ,\ <z> =\ \frac{1}{N} \sum_{i} z_{i} $$

With these 44 pairs of (N,M), performs a k-means clustering with the number of cluster set to 3, 4, 5, and 6.
And produces plots of the results. The plots should be scatter plots of M and N, with the colour of each point determined by the cluster number found by the k-means method.

Your script may combine the four plots using subplots if you want. Submit your script and the four plots (or combination thereof) by April 17, 2022 at 23:55 PM.