Assignment 4
0) Be sure to use version control ("git"), as you develop your code. We suggest you create a new directory to hold this assignment, "assignment4" for example, and initialize a new git repository within it. Do "git add ...., git commit
" repeatedly as you add to your scripts. You will hand in the output of "git log
" for your assignment repository as part of the assignment. You must have a significant number of commits representing the modifications, alterations and changes in your scripts. If your log does not show a significant number of meaningful commits you will lose marks.
For this assignment we will be working with real data gathered from the City of Toronto's DineSafe Restaurant Health Inspections.
1) From your assignment directory, at the bash prompt, download and unpack the following tar file using the 'curl' and 'tar' commands. This is similar to what we saw in Assignment 1:
[ejspence.mycomp]
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment4
[ejspence.mycomp] ls
[ejspence.mycomp]
[ejspence.mycomp] curl -O https://pages.scinet.utoronto.ca/~afedosee/dinesafe.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] ls
dinesafe.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] tar -zxf dinesafe.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] ls
2022.csv 2023.csv 2024.csv dinesafe.tar.gz
[ejspence.mycomp]
The files contain the data for the Toronto Dinesafe Restaurant Health Inspections for the years 2022, 2023, 2024.
2) Write an R script, called process.dinesafe.R
, which performs the following:
- Receives an argument from the command line, indicating which file will be read, and puts the data from this file into a data frame. Use the
read.csv
function. - Prints the name of the file being processed.
- Calculates and prints the total number of establishments per establishment status. For this you will have to find a way to automatically identify the different types of statuses, and loop over them to compute the total number for each type. A useful function for this is
unique()
. Use help() and example() to learn how to use this function. - Calculates and prints the average fine for establishments north of the latitude 43.75. Your output should not be 'NA'. Using help() with the
mean()
function will be helpful here. - Calculates and prints the total fine given to bakeries in January. For this question, depending on your strategy, the function
substr()
(to determine substrings of strings) might be useful. - Calculates and prints the establishment type with the second-most non-zero fines. For this question, depending on your strategy, functions which might be helpful include
is.na()
(to determine which entries are NA),table()
(to perform a frequency analysis on data),sort()
(to sort things), andnames()
(to get the names from your table).
Your script should output something like this, when run from the shell terminal:
[ejspence.mycomp] Rscript process.dinesafe.R 2024.csv
Processing data from file: 2024.csv
Total number of establishments per status type:
Pass -- 45286
Conditional Pass -- 190
Average fine for establishments north of latitude 43.75: 274.1667
The total amount of January fines from bakeries is 5385
The establishment type with the second-most non-zero fines is: Food Take Out
----------------------------------------------------------------------------
Note that part 2c) is the only part that should have a loop. All other questions should be answered using slicing. Also note that you do not need to write your own functions for this assignment.
Note the following code, which may inspire your answers for some of the above sections:
>
> a <- 1:10
>
> a > 7
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
>
> sum(a > 7)
[1] 3
>
3) Finally, write a shell script named "process.all.years.sh
" that loops over all the CSV files in your directory and calls the previous script so that all the years are processed sequentially.
Consider the following starting point for your shell script,
for i in *csv; do
echo $i;
done
Start with this, add and remove the necessary commands so that this script executes your R script and processes all the DineSafe data files. Recall that for this to work you must have all CSV files and the R script and the shell script in the same directory!
Submit your "process.dinesafe.R
" and "process.all.years.sh
" scripts and the output of "git log
" from your assignment repository.
Both scripts should to be added and committed frequently to the repository. To capture the output of 'git log
' use redirection (git log > git.log
, and hand in the "git.log
" file).
Assignments will be graded on a 10 point basis. Due date is February 6th 2025 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of February 13th 2025, at 10:00am.