Assignment 5
In this assignment, we will try to identify (potential) messenger RNA encoded by the DNA sequence stored in the FASTA file chromosome1.fa.
As you probably know, sequences in DNA are recipes for proteins, with triplets of C, T, G, and A encoding for a specific amino-acids (which is why these triplets are called codons). The first step in communicating these recipes to the ribosomes where proteins are ultimately synthesized is the production of messenger RNA. or mRNA.
mRNA is produced by 'transcribing' DNA. From a bioinformatics point of view, finding possible mRNA sequences entails:
- Making a choice of reading frame, i.e., where to start triplets and in which direction to read the DNA sequence.
- Transcribing the triplets from DNA to RNA, which is given by the mapping:
T → A, A → U, G → C, C → G. - Finding a starting point of the mRNA in the sequence. This is given by a specific codon. Let's say that the RNA start codon for our sample is AUG.
- Reading and translating the sequence until a stop codon is encountered. There are several possibilities, but for this assignment, let's say the only stop codon is UAA.
Create a utilities file, called mRNA.Utilities.R. In this file create the following functionality:
1a) Create a function which takes a FASTA file name as an argument. The function should load the file, convert the data to the reverse complement (to convert it to the perspective of RNA), convert the data to an RNAStringSet, and return the result.
1b) Create a function which takes an RNAStringSet as an argument. The function should determine the codons of the input. The codons
function in the Biostrings
library is useful here.
1c) Create a function which takes the list of codons, and a specific codon, as arguments. The function will return the indices of all the locations where the specific codon occurs in the list of codons. The which
and as.character
functions may be useful here.
1d) Create a function which, given vectors of start and stop indices as arguments, returns a list of lists. Each element of the list should contain the start and stop indices of each genetic sequence. If you find that this function takes forever to run, you may return the indices of only the first 10000 sequences.
1e) Create a function which, given the list of RNA codons and start and stop indices, prints out the associated sequence.
1f) Create a driver function which, given a FASTA file name:
- Reads FASTA file and stores the associated RNA sequence.
- Converts this sequence into a set of codons.
- Determines the indices of all the start codons ("AUG") in the sequence.
- Determines the indices of all the stop codons ("UAA") in the sequence.
- Determines the start and stop positions of each mRNA sequence in the sample.
- Prints out the mRNA sequence starting at each start codon and ending at a stop codon, for the first 10 such sequences.
Note that we have only examined one of the 3 possible 'frames', meaning that we've started determining our codons at position 1. We could also consider starting at positions 2 and 3 as well, but will not do so here.
> source("mRNA.Utilities.R")
>
> print.sequences('chromosome1.fa')
[1] "AUGCAUUCUCUUCAUAAGCAAGACCCAAAUUUCCACUCUCUUCAUAAGACCCCAAAUUCCCACAUUUUUGUA
CCCUCUCCUCACAAAAAGACCCAACUUCUCUUCAUGAAUUAGACCCAAACUCCCCCAUUUUUUGUCCUCUUCCUAAA
UGAGACCGUCAUCUCUUCCUCAGUUAGCCCCAAAUCCCCCCAUUUAUGUGCCCUCUCUUCUCAAACUUGGCGACCCU
GCACAUAGCAGGGGGUUGGAACUGGAUGAGCACUGUGGUCCUUUGCAACCAGCCGGCCCGAACGCACUUCUAUAUAU
AGGAACCCGGUGUUCCUACAUUCAACUUCUCUUAAUACCCAGAAGUUAGACAAGAAUUCUCAUUUCAGAAUUGCAAU
GGGAAAAAAAAAAAUGACACCUCCGUGAUGGCCUAGGUGGGCUCUGCCAGCGUUCUCUCUCAGAAGCAAGCAGAAGC
AAUCAGCAGAAAGGGCUCAGAGCUCUUCAUCAUCAUCAGGACCGAUGGGCAGAGAGGGCAUGUGGCUAAAUAGCAAA
GGGAAAAGAGAUGCCUGAACCAAGGCUGGGAUGAUCUGAAGGCAGGAGGAGCUGAGAGCGCACAGAGGGUGAGGGAU
GGCUGCGUUUGCCUUUCUCCGGCUUGUGGGAAUGGAGAAGAAAAAUGGCAGAAGAAGGCAAUGAACACCGAGAUGGA
GGGCUGCAUGUGCCAGCCUGUACCGAGGACCCCACGUCCCUGCUCUGCUCACACCCCUCCUUCAAUACCAACAAAAG
CUACGACGGCACAGUGCUGUUUUUCAGCUGCUAAGGGAGGCUUGAAGAGGAUGAGCCUUGCUUUUCAUGCCCUUUGC
UUGUUUUUUUUUUUAUAAAGCAUGAGAAAUCAUGUAGCACUAA"
.
.
.
>
Submit your mRNA.Utilities.R. Assignments will be graded on a 10 point basis.
Due date is April 19th, 2023 at midnight, with 0.5 point penalty per day for late submission until the cut-off date of April 26th, 2023 at 9:00am.