Assignment 4
Due date: Thursday, October 12th at midnight (Thursday night).
0) Be sure to use version control ("git"), as you develop your code. We suggest you create a new directory to hold this assignment, "assignment4" for example, and initialize a new git repository within it. Do "git add ...., git commit
" repeatedly as you add to your scripts. You will hand in the output of "git log
" for your assignment repository as part of the assignment. You must have a significant number of commits representing the modifications, alterations and changes in your scripts. If your log does not show a significant number of meaningful commits you will lose marks.
For this assignment we will be working with real data gathered from the City of Toronto's DineSafe Restaurant Health Inspections.
1) From your assignment directory, at the bash prompt, download and unpack the following tar file using the 'curl' and 'tar' commands. This is similar to what we saw in Assignment 1:
[ejspence.mycomp]
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment4
[ejspence.mycomp] ls
[ejspence.mycomp]
[ejspence.mycomp] curl -O https://pages.scinet.utoronto.ca/~ejspence/dinesafe.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] ls
dinesafe.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] tar -zxf dinesafe.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] ls
dinesafe.tar.gz dinesafe2017.csv dinesafe2018.csv dinesafe2019.csv dinesafe2021.csv dinesafe2022.csv dinesafe2023.csv
[ejspence.mycomp]
The files contain the data for the Toronto Dinesafe Restaurant Health Inspections for the years 2017, 2018, 2019, 2021, 2022 and a little bit of 2023.
2) Write an R script, called process.dinesafe.R
, which performs the following:
- Receives an argument from the command line, indicating which file will be read, and puts the data from this file into a data frame. Use the
read.csv
function. - Prints the name of the file being processed.
- Calculates and prints the total number of establishments per establishment status. For this you will have to find a way to automatically identify the different types of statuses, and loop over them to compute the total number for each type. A useful function for this is
unique()
. Use help() and example() to learn how to use this function. - Calculates and prints the average fine for establishments north of the latitude 43.75. Your output should not be 'NA'. Using help() with the
mean()
function will be helpful here. - Calculates and prints the total fine given to bakeries in January. For this question, depending on your strategy, the function
substr()
(to determine substrings of strings) might be useful. - Calculates and prints the establishment type with the second-most non-zero fines. For this question, depending on your strategy, functions which might be helpful include
is.na()
(to determine which entries are NA),table()
(to perform a frequency analysis on data),sort()
(to sort things), andnames()
(to get the names from your table).
Your script should output something like this, when run from the shell terminal:
[ejspence.mycomp] Rscript process.dinesafe.R dinesafe2018.csv
Processing data from file: dinesafe2018.csv
Total number of establishments per status type:
Pass -- 37001
Conditional Pass -- 9687
Closed -- 291
Average fine for establishments north of latitude 43.75: 162.75
The total amount of January fines from bakeries is 580
The establishment type with the second-most non-zero fines is: Food Take Out
----------------------------------------------------------------------------
Note that part 2c) is the only part that should have a loop. All other questions should be answered using slicing. Also note that you do not need to write your own functions for this assignment.
Note the following code, which may inspire your answers for some of the above sections:
>
> a <- 1:10
>
> a > 7
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
>
> sum(a > 7)
[1] 3
>
3) Finally, write a shell script named "process.all.years.sh
" that loops over all the CSV files in your directory and calls the previous script so that all the years are processed sequentially.
Consider the following starting point for your shell script,
for i in *csv; do
echo $i;
done
Start with this, add and remove the necessary commands so that this script executes your R script and processes all the DineSafe data files. Recall that for this to work you must have all 6 CSV files and the R script and the shell script in the same directory!
Submit your "process.dinesafe.R
" and "process.all.years.sh
" scripts and the output of "git log
" from your assignment repository.
Both scripts should to be added and committed frequently to the repository. To capture the output of 'git log
' use redirection (git log > git.log
, and hand in the "git.log
" file).
Assignments will be graded on a 10 point basis. Due date is October 12th 2023 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of October 19th 2023, at 9:00am.