EES1137 - Winter 2024: Assignment 5

Opened: Tuesday, 27 February 2024, 11:00 AM

Due: Tuesday, 5 March 2024, 11:59 PM

Due date: March 5, 2024 at 11:55 pm.

Be sure to use version control git, as you develop your script. Do git add and git commit repeatedly as you add to your script. You will hand in the output of git log for your assignment repository as part of the assignment.

Introduction

We are often interested in studying the relationship among variables to determine whether there is any underlying association among them. When we think that changes in a variable X explain, or maybe even cause, changes in a second variable Y, we call X an independent (or explanatory) variable and Y a dependent (or response) variable. Moreover, if we plot these variables (X,Y), and the form of the plot resembles a straight line, this may indicate that there may be a linear relationship between the two variables. The relationship is strong if all the data points are close to the line or weak if the points are widely scattered about the line. The covariance and correlation are measures of the strength and direction of a linear relationship between two quantitative variables. A regression line can be defined as a mathematical model describing a relationship between an explanatory variable X, and a response variable Y.

The following are some steps that you should initially follow when analyzing data, and that you should also perform for this assignment:

Inspect the data graphically, to check for possibles insights underlying their relation.
Quantify this relationship by computing the appropriate statistical estimators (e.g. covariance and correlation between the variables). What can you conclude from these values?

Consider the Wine Quality date set, a description of which can be found here. In particular, let us consider the white wine data set, which can be found here. Download this data set and place it in your assignment directory. Assume for the rest of the assignment that we are only interested in the relationship between the density of the wine and its residual sugar content.

Problem

For answering the following questions, create an R script, named generateModels.R, that will receive a two arguments from the Linux command line. The first will be the file name of the data to be processed. The second, depending on its value, will cause the script to perform one of the actions mentioned in parts 1), 2) or 3) below. The script should be modular, as much as you think is necessary. For instance, at least each part in this assignment could be a function, such as loading the data, computing correlations, executing the fits, etc. Put your functions in an auxiliary file called Utilities.R.

We also want you to implement defensive programming, so that if the second argument is not a 1, 2 or 3, the script sends a message to the screen letting the user know that only these options are possible, and then stops. It should also check to make sure that there are two command line arguments given.

In addition to the commands in your script, include additional comments, within your driver script, explaining your observations.

0) Create a function takes a filename as an argument, loads the data, and returns it.

Your script should perform the following actions:

If the command line argument is 1:
1. Print the correlation estimators for the dataset.
2. Implement a linear model to fit the data, and print out the details of the fitted model.
3. Generate a graphical representation of the model in the presence of the original data.
The following actions should be performed if the command line argument is 2:
1. Print the correlation estimators for the dataset.
2. Implement a quadratic model to fit the data, and print out the details of the model.
3. Generate a graphical representation of the model in the presence of the original data.
The following actions should be performed if the command line argument is a 3:
1. Print the correlation estimators for the dataset.
2. Implement a generalized linear model to fit to the data, using a noise model and link function that you think is appropriate for the data, and print out the details of the model.
3. Generate a graphical representation of the model in the presence of the original data.

Some notes to follow when implementing your script:

OBSERVATION #1: Do not use global variables, i.e. pass arguments to the functions you created otherwise you will lose marks!

OBSERVATION #2: You will notice that when running the R script from the command line, the plots will not be shown, but instead saved on a file named Rplots.pdf in the same directory as the script is located. This is the default way in which R deals with plots when running in batch mode, and totally acceptable for this assignment.

Examples:


    $ Rscript generateModels.R 
Error: This scripts requires two arguments, a file name and a 1, 2 or 3 
$ Rscript generateModels.R 0 
Error: This scripts requires two arguments, a file name and a 1, 2 or 3 
$ Rscript generateModels.R winequality-white.csv 0
Error: This scripts requires two arguments, a file name and a 1, 2 or 3 
$ Rscript generateModels.R winequality-white.csv 1 
--------------- 
Computing correlation indicators... 
Covariance: 0.01272717
Correlation coefficient: 0.8389665
--------------- 
Fitting a Linear Model 

Call: 
lm(formula = density ~ residual.sugar, data = my.data) 

Residuals: 
     Min       1Q   Median       3Q      Max
-0.00569 -0.00110  0.00017  0.00115  0.01556     

Coefficients: 
               Estimate Std. Error t value Pr(>|t|) 
(Intercept)    9.91e-01  3.742e-05 26480.7   <2e-16 ***
residual.sugar 4.95e-04  4.586e-06   107.9   <2e-16 ***
--- 
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.001628 on 4896 degrees of freedom
Multiple R-squared: 0.7039, Adjusted R-squared: 0.7038 
F-statistic: 61.164e+04 on 1 and 4896 DF, p-value: < 2.2e-16 
---------------

Submit your generateModels.R script file and Utiltites.R file, and the output of git log from your assignment repository.

To capture the output of git log use redirection, git log > git.log, and hand in the git.log file.

Assignments will be graded on a 10 point basis. Due date is March 5th, 2024 at 11:55pm, with 0.5 point penalty per day for late submission until the cut-off date of March 12, 2024 at 10:00am.