MSC1090 - Fall 2021: Assignment 4

Opened: Thursday, 7 October 2021, 12:30 PM

Due: Thursday, 14 October 2021, 11:55 PM

Due date: Thursday, October 14th at 11:55 pm.

You must use version control `git`, as you develop your scripts. Start by creating a new directory and use the following commands to initialize the git repository

$ mkdir assignment4
$ cd assignment4
$ git init

Perform `git add` and `git commit` repeatedly as you add code to your scripts. You will hand in the output of `git log` for your assignment repository as part of the assignment. You must have a significant number of commits representing the modifications, alterations and changes in your scripts. If your log does not show a significant and meaningful number of commits, you will lose marks.

Description

For this assignment we will be working with real 311 Service Request data coming from the City of Toronto. Details about the data can be found on the City of Toronto 311 Service Requests page.

The data is stored in the zip file located here: https://pages.scinet.utoronto.ca/~afedosee/T311_2010-2021.zip. This file contains Comma Separated Values (CSV) files.

To download and uncompress the data set, use the following commands at the Linux command line:


user@scinet assignment4 $ curl -O https://pages.scinet.utoronto.ca/~afedosee/T311_2010-2021.zip
user@scinet assignment4 $ ls
T311_2010-2021.zip
user@scinet assignment4 $ unzip T311_2010-2021.zip
user@scinet assignment4 $ ls
T311_2010-2021.zip  data/
user@scinet assignment4 $ ls data/
SR2010.csv SR2012.csv SR2014.csv SR2016.csv SR2018.csv SR2020.csv
SR2011.csv SR2013.csv SR2015.csv SR2017.csv SR2019.csv SR2021.csv

The files contain the Toronto 311 Services Request Data for the years 2010 to 2021. Each file contains the data corresponding to the year specified in its name, eg. SR2010.csv, ..., SR2021.csv.

Note that it is a good idea to do some initial exploration of the data (read the data in, use `str()` to examine the names of the columns) before you proceed to the next section.

Part 1

Write an R script, called `process311.R`, which performs the following steps.

Receives an argument from the command line indicating which file to read, and using the `read.csv()` command, puts the file's data into a data frame.
Prints which file is being processed.
Calculates and prints the total number of service calls per city division. For this you will need to find a way to automatically identify the different divisions (do not hard-code the divisions!), and loop over them to compute the total number for each division. A useful function for this is `unique()`. Use `help()` and `example()` for getting more information about it.
Calculates and prints the total number of service calls about dead animals on expressways.
Calculates and prints the ward with the most service calls from the "311" division in September. For this question, depending on your strategy, functions which might be helpful include `as.character()` (to convert inputs to strings), `substr()` (to cut substrings out of strings), `table()` (to perform a frequency analysis on data), `sort()` (to sort things), and `names()` (to get the names from your table).

Your script should output the following message, when run from the shell terminal:

user@scinet assignment4 $ Rscript process311.R data/SR2010.csv
Processing data from file: data/SR2010.csv
Total number of service calls per city division:
     Transportation Services  --  31904
     Toronto Water  --  48921
     Solid Waste Management Services  --  136808
     311  --  1050
     Urban Forestry  --  16016
     Municipal Licensing & Standards  --  19507
     City of Toronto  --  12
The number of reports of a dead animal on an expressway is 15
The ward with the most 311 calls in September was Trinity-Spadina (20)

Note that part c) is the only part that should have a loop. All other questions should be answered using slicing.

>
> a <- 1:10
>
> a > 7
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
>
> sum(a > 7)
[1] 3
>

Part 2

Finally, write a shell script named `processALLyears.sh` that loops over all CSV files in your directory and calls the previous R script so that all the years are processed sequentially. The following is the skeleton of a `for` loop in bash. This code should inspire your shell script.

for filename in *csv
do
    echo $filename
done

Start with this, remove and add the necessary commands so that this script executes your R script for all the data/SR20XX.csv files. You should assume that all the CSV files are in the data directory, the R script and the shell script in the directory one level above the data.

Be sure to comment your code, indent your code blocks, and use meaningful variable names.

Submit your `process311.R` and `processALLyears.sh` scripts and the output of `git log` from your assignment repository.

To capture the output of `git log` use redirection: `git log > git.log`, and hand in the `git.log` file.

Assignments will be graded on 10 points basis.
Due date is October 14th 2021 (midnight), with 0.5 point penalty per day for late submission until the cut-off date of October 21st, 2021, at 12:00pm (noon).