Assignment 1
Due date: Thursday, September 21st at 11:55 pm.
Please note that all of the commands and techniques you need to solve this assignment were given in class. No internet searches should be necessary to complete this assignment. If you aren't sure where to start, review the class slides.
The purpose of this assignment is to practise your bash scripting skills on a real data set. Before you begin, be sure to create a new directory to hold your assignment, and move into that directory:
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090
[ejspence.mycomp]
[ejspence.mycomp] mkdir assignment1
[ejspence.mycomp] cd assignment1
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment1
[ejspence.mycomp]
Consider the following data set, which concerns the response of bipolar disorder patients to lithium treatments: https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5393. To make this dataset available for examination, we now introduce two new bash commands:
- curl: this command downloads files from a given internet address.
- gunzip: this command uncompresses a gzipped file.
To download and uncompress the data set, use the following commands at the Linux command line:
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment1
[ejspence.mycomp]
[ejspence.mycomp] curl -O https://pages.scinet.utoronto.ca/~ejspence/GDS5393.soft.gz
[ejspence.mycomp]
[ejspence.mycomp] ls
GDS5393.soft.gz
[ejspence.mycomp] gunzip GDS5393.soft.gz
[ejspence.mycomp] ls
GDS5393.soft
[ejspence.mycomp]
The data is now ready to be analyzed. Note that the flag used with the curl command is a capital "oh", not a zero.
If you look into the data file (try the 'less' command, and type 'q' to get out), you'll notice (once you get past the header information) that each subject of the study is identified with a character string ILMN_XXXXXXX, where XXXXXXX is a 7 digit number.
Using this information, write a shell script, called count.patients.sh, which
- takes a filename as an input argument,
- prints out the name of the input file,
- prints out the number of patients listed in the file (assuming the file has the patient-identification format of the aforementioned data file), and
- prints out the number of patients that do not have 'null' as one of the entries in their columns (meaning the patient has complete data).
- prints out the identifier of the patient who has the largest numeric identifier, and whose numeric identifier is greater than 3199999 and less than 3300000.
The script will be sourced from the command line, and should output as follows:
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment1
[ejspence.mycomp] ls
count.patients.sh GDS5393.soft
[ejspence.mycomp]
[ejspence.mycomp] source count.patients.sh GDS5393.soft
Working with data file GDS5393.soft.
The total number of patients is 48107.
The number of patients with complete data is 47323.
The patient who has the largest numeric identifier, and whose numeric identifier is greater than 3199999 and less than 3300000, is patient ILMN_3299955.
[ejspence.mycomp]
Some points to consider:
- Full points will only be awarded for implementations which store the numbers of patients in local variables, before printing the output.
- Similarly, full points will only be awarded for implementations which do not use 'grep -c'. The "grep" command may be used, just without the "-c" flag.
- Do not "hard code" the answers. This means you should not have the numbers 48107 and 47323, nor the string "GDS5393.soft", anywhere in your script.
- Mac users may find that there is extra white space around the numbers in their output sentences. Do not worry about this white space. Extra spaces within the sentences are not important.
Submit your count.patients.sh script.
Assignments will be graded on 10 points basis.
Due date is September 21, 2023 at 11:55pm, with 0.5 point penalty per day for late submission until the cut-off date of September 28, 2023 at 9:00am.