Assignment 1
Due date: Thursday, September 19th at 11:59 pm.
Please note that all of the commands and techniques you need to solve this assignment were given in class. No internet searches should be necessary to complete this assignment. If you aren't sure where to start, review the class slides.
The purpose of this assignment is to practise your bash scripting skills on a real data set. Before you begin, be sure to create a new directory to hold your assignment, and move into that directory:
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090
[ejspence.mycomp]
[ejspence.mycomp] mkdir assignment1
[ejspence.mycomp] cd assignment1
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment1
[ejspence.mycomp]
Let us consider a data set which was collected from the ChEMBL database, a database of bioactive chemicals. The data set is stored as a CSV (Comma Separated Value) file. Each row corresponds to a specific chemical, and each column corresponds to a specific attribute of the chemical. Of particular interest for this assignment is the 'Smiles' column, which gives the representation of the chemical in 'SMILES' format. The SMILES format is used to represent chemicals as an ASCII string (a collection of characters).
I've created a cleaned and simplified version of this data set for this assignment. To make this data set available for examination, we now introduce a new bash command, "curl". The curl command will download a file from the internet.
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment1
[ejspence.mycomp]
[ejspence.mycomp] curl -O https://pages.scinet.utoronto.ca/~ejspence/chembl_CBr.csv
[ejspence.mycomp]
[ejspence.mycomp] ls
chembl_CBr.csv
[ejspence.mycomp]
The data is now ready to be analyzed.
If you look into the data file (try the 'head' command), you'll notice that the file is a CSV file, with many columns. Each row of the file corresponds to a different chemical. Examine the column headers and familiarize yourself with the types of entries in the columns.
Using this information, write a shell script, called bromomethane.sh
, which
- takes a filename as an argument,
- prints out the name of the input file,
- prints out the number of chemicals in the file that contain a bromomethane group (a "(CBr)" string in the "Smiles" column),
- prints out the number of chemicals which contain a bromomethane group which have a BAO Format ID of type "BAO_0000357", and
- prints out ChEMBL ID of the chemical containing a bromomethane group with the highest molecular weight. Note that you will need to specify the "field separator" when you use the 'sort' command, to specify the symbol that separates the columns. Read the man page for 'sort' to determine how to do this. Because all of the ChEMBL IDs have the same number of characters you may hard-code the number of characters when you call 'cut'.
The script will be sourced from the command line, and should output as follows:
[ejspence.mycomp]
[ejspence.mycomp] source bromomethane.sh chembl_CBr.csv
Working with data file chembl_CBr.csv.
The total number of chemicals with a bromomethane group is 118.
The number of chemicals with a bromomethane group which have a BAO Format ID of BAO_0000357 is 16.
The ChEMBL ID of the chemical containing a bromomethane group with the highest molecular weight is "CHEMBL1083547".
[ejspence.mycomp]
Some points to consider:
- Full points will be awarded for implementations which store the calculated values, such as the number of bromomethane entries, in local variables, before printing the output.
- Similarly, full points will be awarded for solutions which do not use "grep -c". The "grep" command may be used, just without the "-c" flag.
- Do not "hard code" the answers. This means you should not have the numbers 118 or 16, nor the string "chembl_CBr.csv", anywhere in your script.
- Mac users may find that there is extra white space around the numbers in their output sentences. Do not worry about this white space. Extra spaces within the sentences are not important.
Submit your bromomethane.sh script.
Assignments will be graded on a 10 points basis.
Due date is September 19, 2024 at 11:59pm, with 0.5 point penalty per day for late submission until the cut-off date of September 26, 2024 at 10:00am.