Skip to main content
SciNet
  • Home
  • All Courses
  • Calendar
  • Certificates
  • SciNet
    Main Site Documentation my.SciNet
  • CCDB
  • More
Close
Toggle search input
English
English Français
You are currently using guest access
Log in
SciNet
Home All Courses Calendar Certificates SciNet Collapse Expand
Main Site Documentation my.SciNet
CCDB
Expand all Collapse all
  1. Dashboard
  2. MSC1090 - Fall 2023
  3. Assignment 4

Assignment 4

Completion requirements
Opened: Thursday, 5 October 2023, 10:00 AM
Due: Thursday, 12 October 2023, 11:59 PM

Due date: Thursday, October 12th at midnight (Thursday night).


0) Be sure to use version control ("git"), as you develop your code. We suggest you create a new directory to hold this assignment, "assignment4" for example, and initialize a new git repository within it. Do "git add ....,  git commit" repeatedly as you add to your scripts.  You will hand in the output of "git log" for your assignment repository as part of the assignment. You must have a significant number of commits representing the modifications, alterations and changes in your scripts. If your log does not show a significant number of meaningful commits you will lose marks.


For this assignment we will be working with real data gathered from the City of Toronto's DineSafe Restaurant Health Inspections.

1) From your assignment directory, at the bash prompt, download and unpack the following tar file using the 'curl' and 'tar' commands.  This is similar to what we saw in Assignment 1:

[ejspence.mycomp]
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment4
[ejspence.mycomp] ls
[ejspence.mycomp]
[ejspence.mycomp] curl -O https://pages.scinet.utoronto.ca/~ejspence/dinesafe.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] ls
dinesafe.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] tar -zxf dinesafe.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] ls
dinesafe.tar.gz dinesafe2017.csv dinesafe2018.csv dinesafe2019.csv dinesafe2021.csv dinesafe2022.csv dinesafe2023.csv
[ejspence.mycomp]

The files contain the data for the Toronto Dinesafe Restaurant Health Inspections for the years 2017, 2018, 2019, 2021, 2022 and a little bit of 2023.


2) Write an R script, called process.dinesafe.R, which performs the following:

  1. Receives an argument from the command line, indicating which file will be read, and puts the data from this file into a data frame. Use the read.csv function.
  2. Prints the name of the file being processed.
  3. Calculates and prints the total number of establishments per establishment status. For this you will have to find a way to automatically identify the different types of statuses, and loop over them to compute the total number for each type. A useful function for this is unique(). Use help() and example() to learn how to use this function.
  4. Calculates and prints the average fine for establishments north of the latitude 43.75. Your output should not be 'NA'. Using help() with the mean() function will be helpful here.
  5. Calculates and prints the total fine given to bakeries in January. For this question, depending on your strategy, the function substr() (to determine substrings of strings) might be useful.
  6. Calculates and prints the establishment type with the second-most non-zero fines. For this question, depending on your strategy, functions which might be helpful include is.na() (to determine which entries are NA), table() (to perform a frequency analysis on data), sort() (to sort things), and names() (to get the names from your table).

Your script should output something like this, when run from the shell terminal:

[ejspence.mycomp] Rscript process.dinesafe.R dinesafe2018.csv

Processing data from file:  dinesafe2018.csv
Total number of establishments per status type:
     Pass --  37001
     Conditional Pass  -- 9687
     Closed  --  291
Average fine for establishments north of latitude 43.75: 162.75
The total amount of January fines from bakeries is 580
The establishment type with the second-most non-zero fines is: Food Take Out
----------------------------------------------------------------------------

Note that part 2c) is the only part that should have a loop. All other questions should be answered using slicing. Also note that you do not need to write your own functions for this assignment.

Note the following code, which may inspire your answers for some of the above sections:

>
> a <- 1:10
>
> a > 7
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
>
> sum(a > 7)
[1] 3
>


3) Finally, write a shell script named "process.all.years.sh" that loops over all the CSV files in your directory and calls the previous script so that all the years are processed sequentially.
Consider the following starting point for your shell script,

for i in *csv; do
     echo $i;
done

Start with this, add and remove the necessary commands so that this script executes your R script and processes all the DineSafe data files. Recall that for this to work you must have all 6 CSV files and the R script and the shell script in the same directory!


Submit your "process.dinesafe.R" and "process.all.years.sh" scripts and the output of "git log" from your assignment repository.

Both scripts should to be added and committed frequently to the repository. To capture the output of 'git log' use redirection (git log > git.log, and hand in the "git.log" file).


Assignments will be graded on a 10 point basis. Due date is October 12th 2023 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of October 19th 2023, at 9:00am.

Contact site support
You are currently using guest access (Log in)
Data retention summary


All content on this website is made available under the Creative Commons Attribution 4.0 International licence, with the exception of all videos which are released under the Creative Commons Attribution-NoDerivatives 4.0 International licence.
Powered by Moodle