Assignment 1
Due date: Tuesday, January 23rd at 11:59 pm.
Please note that all of the commands and techniques you need to solve this assignment were given in class. No internet searches should be necessary to complete this assignment. If you aren't sure where to start, review the class slides.
The purpose of this assignment is to practise your bash scripting skills on a real data set. Before you begin, be sure to create a new directory to hold your assignment, and move into that directory:
[ejspence.mycomp] pwd
/c/Users/ejspence/EES1137
[ejspence.mycomp]
[ejspence.mycomp] mkdir assignment1
[ejspence.mycomp] cd assignment1
[ejspence.mycomp] pwd
/c/Users/ejspence/EES1137/assignment1
[ejspence.mycomp]
Consider the following data set, which concerns the human genome oligo microarray G4112A: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL6848. To make this dataset available for examination, we now recall two bash commands:
- curl: this command downloads files from a given internet address.
- gunzip: this command uncompresses a gzipped file.
To download and uncompress the data set, use the following commands at the Linux command line:
[ejspence.mycomp] pwd
/c/Users/ejspence/EES1137/assignment1
[ejspence.mycomp]
[ejspence.mycomp] curl -O https://pages.scinet.utoronto.ca/~ejspence/GPL6848-9572.txt.gz
[ejspence.mycomp]
[ejspence.mycomp] ls
GPL6848-9572.txt.gz
[ejspence.mycomp] gunzip GPL6848-9572.txt.gz
[ejspence.mycomp] ls
GPL6848-9572.txt
[ejspence.mycomp]
The data is now ready to be analyzed.
If you look into the data file (try the 'less' command), you'll notice that the file is a TSV (tab-separated values) file, with many columns. Each row of the file references an Agilent feature number, which references a gene. Examine the column headers and familiarize yourself with the types of entries in the columns.
Using this information, write a shell script, called count.drosophila.sh, which
- takes a filename as an input argument,
- prints out the name of the input file,
- prints out the number of entries which reference "Drosophila",
- prints out the number of entries which reference "Drosophila" and are on chromosome 16, and
- prints out the ID of the entry which has the ID with the largest number, and whose ID starts with "A_24".
The script will be sourced from the command line, and should behave as follows:
[ejspence.mycomp] pwd
/c/Users/ejspence/EES1137/assignment1
[ejspence.mycomp] ls
count.drosophila.sh GPL6848-9572.txt
[ejspence.mycomp]
[ejspence.mycomp] source count.drosophila.sh GPL6848-9572.txt
Working with data file GPL6848-9572.txt.
The total number of entries which reference Drosophila is 648.
The number of Drosophila entries on chromosome 16 is 24.
The entry which has the largest ID, and whose ID starts with A_24, is A_24_P945408.
[ejspence.mycomp]
Some points to consider:
- Full points will be awarded for implementations which store the calculated values, such as the number of Drosophila entries, in local variables, before printing the output.
- Similarly, full points will be awarded for solutions which do not use "grep -c". The "grep" command may be used, just without the "-c" flag.
- Do not "hard code" the answers. This means you should not have the numbers 648 or 24, nor the string "GPL6848-9572.txt", anywhere in your script.
- Mac users may find that there is extra white space around the numbers in their output sentences. Do not worry about this white space. Extra spaces within the sentences are not important.
Submit your count.drosophila.sh script.
Assignments will be graded on a 10 points basis.
Due date is January 23, 2024 at 11:59pm, with 0.5 point penalty per day for late submission until the cut-off date of January 30, 2024 at 10:00am.