Assignment 4
You must use version control (git
), as you develop your scripts. Start by creating a new directory and use the following commands to initialize the git repository
$ mkdir assignment4
$ cd assignment4
$ git init
Perform git add
and git commit
repeatedly as you add code to your scripts. You will hand in the output of git log
for your assignment repository as part of the assignment. You must have a significant number of commits representing the modifications, alterations and changes in your scripts. If your log does not show a significant and meaningful number of commits, with detailed comments describing your changes, you will lose marks.
Description
For this assignment we will be working with real TTC Streetcar delay data coming from the City of Toronto. Details about the data can be found on the TTC Streetcar Delay Data page.
The data is stored in the zip file located here: https://pages.scinet.utoronto.ca/~afedosee/ttc-streetcar-delay-data-2014-2021.zip. This file contains Comma Separated Values (CSV) files.
To download and uncompress the data set, use the following commands at the Linux command line:
user@scinet assignment4 $ curl -O https://pages.scinet.utoronto.ca/~afedosee/ttc-streetcar-delay-data-2014-2021.zip
user@scinet assignment4 $ ls
ttc-streetcar-delay-data-2014-2021.zip
user@scinet assignment4 $ unzip ttc-streetcar-delay-data-2014-2021.zip
user@scinet assignment4 $ ls
data/ ttc-streetcar-delay-data-2014-2021.zip
user@scinet assignment4 $ ls data/
ttc-streetcar-delay-data-2014.csv ttc-streetcar-delay-data-2018.csv
ttc-streetcar-delay-data-2015.csv ttc-streetcar-delay-data-2019.csv
ttc-streetcar-delay-data-2016.csv ttc-streetcar-delay-data-2020.csv
ttc-streetcar-delay-data-2017.csv ttc-streetcar-delay-data-2021.csv
user@scinet assignment4 $
The files contain the TTC Streetcar Delay Data for the years 2014 to 2021. Each file contains the data corresponding to the year specified in its name, eg. ttc-streetcar-delay-data-2014.csv, ..., ttc-streetcar-delay-data-2021.csv. For consistency, please put your scripts in your assignment4 directory, but leave the data in the data directory.
Note that it is a good idea to do some initial exploration of the data (read the data in, use str()
to examine the names of the columns) before you proceed to the next section.
Part 1
Write an R script, called processTTC.R
, which performs the following steps.
- Receives an argument from the command line indicating which file to read, and using the
read.csv()
command, puts the file's data into a data frame. - Prints which file is being processed.
- Calculates and prints the total number of delays per incident type. For this you will need to find a way to automatically identify the different types of reported incidents (do not hard-code the incidents!), and loop over them to compute the total number for each incident. A useful function to assist with this is
unique()
. Usehelp()
andexample()
to learn how to use it. - Calculates and prints the average minimum delay of streetcars due to a mechanical incident, ignoring unreported data.
- Calculates and prints the route with the most delays in February. For this question, depending on your strategy, functions which might be helpful include
as.character()
(to convert variables to strings),substr()
(to cut substrings out of strings),table()
(to perform a frequency analysis on data),sort()
(to sort things), andnames()
(to get the names from your table).
Your script should output the following message, when run from the shell terminal:
user@scinet assignment4 $ Rscript processTTC.R data/ttc-streetcar-delay-data-2014.csv
Processing data from file: data/ttc-streetcar-delay-data-2014.csv
Total number of delays per incident type:
Late Leaving Garage -- 1143
Utilized Off Route -- 516
Held By -- 1493
Investigation -- 1530
Mechanical -- 5107
General Delay -- 829
Emergency Services -- 272
Diversion -- 137
The average minimum delay of the streetcars due to a mechanical incident,
ignoring unreported data, is 7.838621 minutes.
The route with the most delays in February was route 504
----------------------------------------------------------------------------
Note that part c) is the only part that should have a loop. All other questions should be answered using slicing.
Make a note of the following code, which may inspire your answers for some of the above sections:
>
> a <- 1:10
>
> a > 7
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
>
> sum(a > 7)
[1] 3
>
Part 2
Finally, write a shell script named processALLyears.sh
that loops over all CSV files in your directory and calls the previous R script so that all the years are processed sequentially. The following is the skeleton of a for loop in bash. This code
should inspire your shell script.
for filename in data/*csv
do
echo $filename
done
Start with this, remove and add the necessary commands so that this script executes your R script for all the data/ttc-streetcar-delay-data-20XX.csv files. You should assume that all the CSV files are in the data directory, the R script and the shell script in the directory one level above the data.
Be sure to comment your code, indent your code blocks, and use meaningful variable names.
Submit your processTTC.R
and processALLyears.sh
scripts and the output of git log
from your assignment repository.
To capture the output of git log
use "redirection": git log > git.log
, and hand in the git.log
file.
Assignments will be graded on 10 points basis.
Due date is October 13th 2022 (midnight), with 0.5 point penalty per day for late submission until the cut-off date of October 20th, 2022, at 9:00am.