Passer au contenu principal
SciNet
  • Accueil
  • Tous les cours
  • Calendrier
  • Certificats
  • SciNet
    Site principal Documentation my.SciNet
  • CCDB
  • Plus
Fermer
Activer/désactiver la saisie de recherche
Français
English Français
Vous êtes connecté anonymement
Connexion
SciNet
Accueil Tous les cours Calendrier Certificats SciNet Replier Déplier
Site principal Documentation my.SciNet
CCDB
Tout déplier Tout replier
  1. Tableau de bord
  2. MSC1090 - Fall 2023
  3. Assignment 4

Assignment 4

Conditions d’achèvement
Ouvert le : jeudi 5 octobre 2023, 10:00
À rendre : jeudi 12 octobre 2023, 23:59

Due date: Thursday, October 12th at midnight (Thursday night).


0) Be sure to use version control ("git"), as you develop your code. We suggest you create a new directory to hold this assignment, "assignment4" for example, and initialize a new git repository within it. Do "git add ....,  git commit" repeatedly as you add to your scripts.  You will hand in the output of "git log" for your assignment repository as part of the assignment. You must have a significant number of commits representing the modifications, alterations and changes in your scripts. If your log does not show a significant number of meaningful commits you will lose marks.


For this assignment we will be working with real data gathered from the City of Toronto's DineSafe Restaurant Health Inspections.

1) From your assignment directory, at the bash prompt, download and unpack the following tar file using the 'curl' and 'tar' commands.  This is similar to what we saw in Assignment 1:

[ejspence.mycomp]
[ejspence.mycomp] pwd
/c/Users/ejspence/MSC1090/assignment4
[ejspence.mycomp] ls
[ejspence.mycomp]
[ejspence.mycomp] curl -O https://pages.scinet.utoronto.ca/~ejspence/dinesafe.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] ls
dinesafe.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] tar -zxf dinesafe.tar.gz
[ejspence.mycomp]
[ejspence.mycomp] ls
dinesafe.tar.gz dinesafe2017.csv dinesafe2018.csv dinesafe2019.csv dinesafe2021.csv dinesafe2022.csv dinesafe2023.csv
[ejspence.mycomp]

The files contain the data for the Toronto Dinesafe Restaurant Health Inspections for the years 2017, 2018, 2019, 2021, 2022 and a little bit of 2023.


2) Write an R script, called process.dinesafe.R, which performs the following:

  1. Receives an argument from the command line, indicating which file will be read, and puts the data from this file into a data frame. Use the read.csv function.
  2. Prints the name of the file being processed.
  3. Calculates and prints the total number of establishments per establishment status. For this you will have to find a way to automatically identify the different types of statuses, and loop over them to compute the total number for each type. A useful function for this is unique(). Use help() and example() to learn how to use this function.
  4. Calculates and prints the average fine for establishments north of the latitude 43.75. Your output should not be 'NA'. Using help() with the mean() function will be helpful here.
  5. Calculates and prints the total fine given to bakeries in January. For this question, depending on your strategy, the function substr() (to determine substrings of strings) might be useful.
  6. Calculates and prints the establishment type with the second-most non-zero fines. For this question, depending on your strategy, functions which might be helpful include is.na() (to determine which entries are NA), table() (to perform a frequency analysis on data), sort() (to sort things), and names() (to get the names from your table).

Your script should output something like this, when run from the shell terminal:

[ejspence.mycomp] Rscript process.dinesafe.R dinesafe2018.csv

Processing data from file:  dinesafe2018.csv
Total number of establishments per status type:
     Pass --  37001
     Conditional Pass  -- 9687
     Closed  --  291
Average fine for establishments north of latitude 43.75: 162.75
The total amount of January fines from bakeries is 580
The establishment type with the second-most non-zero fines is: Food Take Out
----------------------------------------------------------------------------

Note that part 2c) is the only part that should have a loop. All other questions should be answered using slicing. Also note that you do not need to write your own functions for this assignment.

Note the following code, which may inspire your answers for some of the above sections:

>
> a <- 1:10
>
> a > 7
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
>
> sum(a > 7)
[1] 3
>


3) Finally, write a shell script named "process.all.years.sh" that loops over all the CSV files in your directory and calls the previous script so that all the years are processed sequentially.
Consider the following starting point for your shell script,

for i in *csv; do
     echo $i;
done

Start with this, add and remove the necessary commands so that this script executes your R script and processes all the DineSafe data files. Recall that for this to work you must have all 6 CSV files and the R script and the shell script in the same directory!


Submit your "process.dinesafe.R" and "process.all.years.sh" scripts and the output of "git log" from your assignment repository.

Both scripts should to be added and committed frequently to the repository. To capture the output of 'git log' use redirection (git log > git.log, and hand in the "git.log" file).


Assignments will be graded on a 10 point basis. Due date is October 12th 2023 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of October 19th 2023, at 9:00am.

Contacter l’assistance du site
Vous êtes connecté anonymement (Connexion)
Résumé de conservation de données


All content on this website is made available under the Creative Commons Attribution 4.0 International licence, with the exception of all videos which are released under the Creative Commons Attribution-NoDerivatives 4.0 International licence.
Fourni par Moodle