6. Classify downloaded protein structures
In this assignment, you'll be trying out the k-means methods on protein structures. The proteins structures to use correspond to the human hemoglobin protein, as found on the PDB at rcsb.org, by searching for hemoglobin, and selecting "Homo sapiens" and the best refinement resolution (<1.5 Angstrom). This results in the following 44 PDBIDs:
6LCX, 6LCW, 6KAO, 6KAP, 6L5V, 6KA9, 6KAI, 6KAH, 6KAE, 3S66, 7JY3,
2D5Z, 2W72, 7DY4, 7DY3, 2DN2, 2DN1, 2DN3, 1J40, 1J41, 1IRD, 5QR5,
5QQR, 5QR1, 5QQY, 4Y08, 6TX8, 4HF3, 5O10, 3ZOO, 5TY3, 3TEM, 6YA6,
3UVC, 4GR8, 7N5O, 6H5W, 6E4F, 4QC4, 6J6M, 7P8X, 7MU3, 3U9W, 4AJX
Your task is to write a script that
- Uses Biopython's Bio.PDB module to download the protein structure with these PDBIDs
- For each, extracts the 3-dimensional positions of all the atoms
- Computes or determines N, the number of atoms stored in the structure, and M, the mean square displacement of atoms from the center, defined as
where
$$ <x> =\ \frac{1}{N} \sum_{i} x_{i},\ <y> =\ \frac{1}{N} \sum_{i} y_{i} ,\ <z> =\ \frac{1}{N} \sum_{i} z_{i} $$
- With these 44 pairs of (N,M), performs a k-means clustering with the number of cluster set to 3, 4, 5, and 6.
- And produces plots of the results. The plots should be scatter plots of M and N, with the colour of each point determined by the cluster number found by the k-means method.
Your script may combine the four plots using subplots if you want. Submit your script and the four plots (or combination thereof) by April 17, 2022 at 23:55 PM.