BCH2203 - Winter 2024: Assignment 4: Find the rotten eggs

Opened: Friday, 15 March 2024, 12:00 AM

Due: Friday, 22 March 2024, 11:59 PM

In this assignment, you'll be using the subprocess module in combination with the NCBIXML reader from the Bio.Blast module, to run a number of alignments for the following use case:

Given a number of batches of DNA fragments taken from 12 eggs, we want to know which eggs contain salmonella by aligning these fragments against a reference salmonella genome and a reference chicken genome.

For the purpose of this assignment, do not get the reference genomes from NCBI's Genbank, instead, use the following files as reference genomes:

genome.fa for the salmonella genome and
chromosome1.fa for the chicken genome
(for the purpose of this assignment, chromosome1.fa is actually only one tenth of one of the chicken's chromosomes).

The DNA data from the eggs is as follows

We have 12 batches (one per egg) of roughly 40 DNA fragments.
Each fragment is a 150 bases long.

Goal: find out which of the eggs is/are contaminated with salmonella.

All the data can be found in the file a4data.zip below, or in the directory /scinet/course/bch2203/a4data.zip on the teach cluster (where you can extract its content with the 'unzip' command). The DNA fragments are in the 'fragments' directory and are named 'eggXfragmentY', with X and Y replaced by the number of the egg from which it originated, and the fragment number, respectively.

Your task is to write a script that performs the following tasks.

Using subprocess to call makeblastdb to build databases (i.e. indices) for the two reference genomes;
Using subprocess repeatedly to call blastn to align the fragments with the two reference genomes using evalue=0.001, and make sure to have blastn write out the result to in xml format;
Using the Bio.Blast.NCBIXML.read function to read in the resulting xml file;
Counting the number of hsps (High Scoring Pairs) for each egg for each of the two reference genomes (salmonella and egg).
Using the ratio of the number of hsps matches with salmonella for a given egg and the number of matches for chicken as a measure on the "rottenness" of that egg.
Printing out the egg number followed by the rottenness of that eggs, sorted from freshest to most rotten.

Note that Bio.Blast.NCBIXML.read will return a "Bio.Blast.Record.Blast" object containing a property called 'alignments', which is a list of objects of type "Bio.Blast.Record.Alignment". It is a list as it could hold all matches for each of a set fragments, but if you have matched only one fragment at a time, it will be a list of just one such alignment. These "Bio.Blast.Record.Alignment"s contain a property called '.hsps', which is a list of the actually alignment matches. You can determine the number of elements of '.hsps' to count up the number of 'hsps' matches for each egg.

Keep best practices in mind, ie., use functions, document and comment your code, use meaningful names for variables and functions.

Submit your script by March 22, 2024 at 23:55 PM.

a4data.zip
15 March 2024, 10:46 AM