MSC1090 - Fall 2023: Assignment 7

Opened: Thursday, 26 October 2023, 10:00 AM

Due: Thursday, 2 November 2023, 11:59 PM

Due date: Thursday, November 2nd at midnight.

0) Be sure to use version control ("git"), as you develop your script. Do "git add ...., git commit" repeatedly as you add to your script. You will hand in the output of "git log" for your assignment repository as part of the assignment.

The goal of this assignment is not only to evaluate the knowledge you have acquired during the lectures but also guide you through a typical statistical analysis study.

The following are the steps you will initially follow when analyzing your data, and that you will also implement in this assignment:

Inspect the data graphically, to check for possibles insights underlying their relation.
Quantify this relationship by computing the appropriate statistical estimators (e.g. covariance and correlation between the variables). What can you conclude from these values?

For this assignment we will use data from "Sport and Recreation-related Concussions and Other Traumatic Brain Injuries Among Canada's Children and Youth". The original data and its description is available in the following website: Sport and Recreation-related Concussions and Other Traumatic Brain Injuries Among Canada's Children and Youth.

In particular we will focus in the following datafile concussion.csv. Please download this file and place it in your assignment directory.

You can use the following command from your shell to download the file:

$      
$ curl -L -O https://pages.scinet.utoronto.ca/~ejspence/concussion.csv
$

Before beginning the assignment you will need to load the data in R (use the read.csv function and explore its possible arguments). You should then inspect the data and select two variables, a dependent and an independent variable, that you will use for the rest of the assignment. The two variables should have some sort of relationship (correlation) so that the following parts make sense.

To answer the following questions, create an R script that will receive two arguments from the command line and depending on the values perform one of the actions mentioned in parts 2), 3) or 4).

The first argument should be the name of the file containing the data, in this case "concussion.csv".
The second argument will indicate which analysis to run.

The code should be modular. For instance, at least each part in this assignment should be a function, such as loading the data, computing correlations, executing the fits, etc. Put your functions in an auxilliary file called FittingUtilities.R.

You should implement defensive programming, so that:

if the first argument is not the name of file that can be found the script will throw an error message
if the second argument is not a 1, 2 or 3, the script will print a message to the screen letting the user know that only these options are possible, and then stops.

1) Create a function which loads the observations, prints the name of the file being processed and returns the data.

Your script should perform the following actions:

2) if the second command line argument is a 1.
2.a) Print the correlation estimators for the dataset.
2.b) Implement a linear model to fit the data, and print out the details of the fitted model.
2.c) Generate a graphical representation of the model in the presence of the original data.

3) The following actions should be performed if the second command line argument is a 2:
3.a) Print the correlation estimators for the dataset.
3.b) Implement a quadratic model to fit the data, and print out the details of the model.
3.c) Generate a plot of the quadratic model comparing with the original data.

4) The following actions should be performed if the second command line argument is a 3:
4.a) Print the correlation estimators for the dataset.
4.b) Implement both the linear and quadratic models to fit the data, and print out the details for both models.
4.c) Generate plots of both the quadratic and linear models compared to the original data.

Example: Note that your results will vary depending on which variables you decide to use!

$
$ Rscript generateModels.R
Error: This scripts requires two arguments: a CSV file and an option 1, 2 or 3
$
$ Rscript generateModels.R concu.csv 1
Error: File not found!
$
$ Rscript generateModels.R concussion.csv 1 2
Error: This scripts requires only one second argument: 1, 2 or 3
$
$ Rscript generateModels.R concussion.csv 1
Processing file concussion.csv ...
------------- 
Computing correlation indicators... 
Covariance: 76.29809 
Correlation coefficient: 0.9774704  
Correlation Test: 
        Pearson's product-moment correlation

data: x and y 
t = 24.063, df = 27, p-value < 2.2e-16 
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval: 
0.9520269 0.9894921
sample estimates: 
      cor 
0.9774704

---------------

Fitting a Linear Model 
Call:
  lm(formula = Y ~ X, data=concData)

Residuals:
    Min      1Q Median     3Q    Max 
-8.2362 -0.1430 0.2618 0.9337 1.5036

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.07346    0.84800  -1.266    0.216
x            0.99029    0.04115  24.063   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.911 on 27 degrees of freedom
Multiple R-squared: 0.9554, Adjusted R-squared: 0.9538
F-statistic: 579 on 1 and 27 DF, p-value: < 2.2e-16
---------------

Submit your generateModels.R, FittingUtiltites.R file, and the output of git log from your assignment repository.

To capture the output of 'git log' use redirection, git log > git.log, and hand in the git.log file.

Assignments will be graded on a 10 point basis.
Due date is November 2nd, 2023 (midnight), with 0.5 penalty point per day off for late submission until the cut-off date of November 9th, 2023, at 9:00am.