Assignment 2
For this assignment we will be working with real TTC Streetcar delay data coming from the City of Toronto. Details about the data can be found on the TTC Streetcar Delay Data page.
The data is stored in the zip file located here: https://pages.scinet.utoronto.ca/~afedosee/ttc-streetcar-delay-data-2014-2021.zip. This file contains Comma Separated Values (CSV) files.
To download and uncompress the data set, if you are using a Windows machine you can right-click on the link to download the zip file. Double click on the downloaded file to unpack it. On a Mac, you can use the following commands in the Terminal program:
$ curl -O https://pages.scinet.utoronto.ca/~afedosee/ttc-streetcar-delay-data-2014-2021.zip
$ ls
ttc-streetcar-delay-data-2014-2021.zip
$ unzip ttc-streetcar-delay-data-2014-2021.zip
$ ls
data/ ttc-streetcar-delay-data-2014-2021.zip
$ ls data/
ttc-streetcar-delay-data-2014.csv ttc-streetcar-delay-data-2018.csv
ttc-streetcar-delay-data-2015.csv ttc-streetcar-delay-data-2019.csv
ttc-streetcar-delay-data-2016.csv ttc-streetcar-delay-data-2020.csv
ttc-streetcar-delay-data-2017.csv ttc-streetcar-delay-data-2021.csv
$
You can also run such commands on a Windows machine if you have a terminal program installed, such as 'git bash' or MobaXterm'. Note that curl
and ls
are Linux shell (terminal) commands which have not been introduced in class. If you are not familiar with the Linux shell you can use graphical methods (using a mouse) to create your working directory and move the data to that directory. The R commands getwd()
and setwd()
can be helpful in making sure your R prompt is running where the data is.
The files contain the TTC Streetcar Delay Data for the years 2014 to 2021. Each file contains the data corresponding to the year specified in its name, eg. ttc-streetcar-delay-data-2014.csv, ..., ttc-streetcar-delay-data-2021.csv.
Note that it is a good idea to do some initial exploration of the data (read the data in, use str()
to examine the names of the columns) before you proceed to the next section.
Create a file, called TTC.Utilities.R
, which contain the following functions.
1a) create a function which takes a single string as an argument, the name of the file to be read. The function should read in the associated file, print out a nice sentence saying what file is being processed, and return the resulting data frame.
1b) create a function which takes the data as an argument. The function should calculate and print the total number of delays per incident type. For this you will need to find a way to automatically identify the different types of reported incidents (do not hard-code the incidents!), and loop over them to compute the total number for each incident. A useful function to assist with this is unique()
. Use help()
and example()
to learn how to use it.
1c) create a function which takes the data as its argument. The function should calculate and print the average minimum delay of streetcars due to a mechanical incident, ignoring unreported data.
1d) create a function which takes the data as its argument. The function should calculate and print the day of the week with the fewest delays. The functions table()
(to perform a frequency analysis on data), sort()
(to sort things), and names()
(to get the names from your table) may be useful here.
1e) create a function that calculates and prints out the route with the most delays in February. For this question, depending on your strategy, functions which might be helpful include as.character()
(to convert variables to strings), and substr()
(to cut substrings out of strings).
1f) create a function which takes a filename as an argument. The function will run the functions in parts 1a), 1b), 1c), 1d) and 1e) for the file in question.
Your script should output the following message, when run from the R prompt:
> source('TTC.Utilities.R')
>
> analyze.delays('data/ttc-streetcar-delay-data-2015.csv')
-----------------------------------------------------------
Processing data from file: data/ttc-streetcar-delay-data-2015.csv
Total number of delays per incident type:
Mechanical -- 6431
Held By -- 1433
Investigation -- 1612
Emergency Services -- 302
Late Leaving Garage -- 1304
General Delay -- 631
Diversion -- 185
Utilized Off Route -- 323
The average minimum delay of the streetcars due to a mechanical incident,
ignoring unreported data, is 8.569207 minutes.
The day of the week with the fewest delays is Sunday .
The route with the most delays in February was route 501 .
-----------------------------------------------------------
>
Note that loops should only be used for part 1b) in this assignment. All other questions should be answered using slicing.
Make a note of the following code, which you may find useful.
>
> a <- 1:10
>
> a > 7
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
>
> sum(a > 7)
[1] 3
>
Be sure to comment your code, indent your code blocks, and use meaningful variable names.
Submit your TTC.Utilities.R
. Assignments will be graded on 10 points basis.
Due date is March 8th (midnight), with 0.5 point penalty per day for late submission until the cut-off date of March 15th, at 9:00am.