MSC1090 - Fall 2024: Assignment 6

Opened: Thursday, 24 October 2024, 11:00 AM

Due: Thursday, 31 October 2024, 11:59 PM

Due date: October 31, 2024 at 11:59 pm.

Be sure to use version control git, as you develop your script. Do git add and git commit repeatedly as you add to your script. You will hand in the output of git log for your assignment repository as part of the assignment.

Introduction

We are often interested in studying the relationship among variables to determine whether there is any underlying association among them. When we think that changes in a variable X explain, or maybe even cause, changes in a second variable Y, we call X an independent (or explanatory) variable and Y a dependent (or response) variable. Moreover, if we plot these variables (X,Y), and the form of the plot resembles a straight line, this may indicate that there may be a linear relationship between the two variables. The covariance and correlation are measures of the strength and direction of a linear relationship between two quantitative variables. A regression line can be defined as a mathematical model describing a relationship between an explanatory variable X, and a response variable Y.

The following are some steps that you should initially follow when analyzing data, and that you should also perform for this assignment:

Inspect the data graphically, to check for possibles insights underlying their relation.
Quantify this relationship by computing the appropriate statistical estimators (e.g. covariance and correlation between the variables). What can you conclude from these values?

For this assignment we're going to explore Zipf's law, which states that there is an inverse relationship between word frequency and the ranking of the order of the words (in terms of frequency):

$$F_i \propto \frac{1}{i}$$

where $F_i$ is the frequency of the $i$th most frequent word. We will apply this to Jane Austen's Pride and Prejudice. A text file containing an all-upper-case version of the novel, with punctuation removed, can be found here. Download this file and place it in your assignment directory.

Problem

For answering the following questions, create an R script, named generateModels.R, that will receive two arguments from the Linux command line and, depending on the value of the second argument, perform one of the actions mentioned in parts 1), 2) or 3) below. The first argument should be the file being processed. The script should be modular, as much as you think is necessary. For instance, at least each part in this assignment could be a function, such as loading the data, computing correlations, executing the fits, etc. Put your functions in an auxiliary file called Fitting.Utilities.R.

We also want you to implement defensive programming, so that if the second argument is not a 1, 2 or 3, the script sends a message to the screen letting the user know that only these options are possible, and then stops. It should also check to make sure that there are two command-line arguments given, no more and no less, and that, assuming that the first argument is a file name, the file exists. The file.exists() function may be useful here.

0) Create a function which loads the file above, using the scan() function, which will read the file and put it into a vector of strings. The function should do a frequency analysis of the words of the file, sort them in descending order, remove those words that occur less than 5 times, and put the frequencies into a data frame. It should also add to the data frame a column containing one over the rank of the corresponding word. The function should then return the data frame. Do not hard-code the name of the file into the function.

Your script should perform the following actions:

If the second command line argument is 1:
1. Print the correlation estimators for the dataset. If the dataset is bivariate normal it should print out Pearson's correlation coefficient. If not it should print out Spearman's correlation coefficient. It should indicate which type of correlation is being performed.
2. Implement a linear model to fit the data, and provide details of the fitted model (we will ignore the fact that the dependent variable is not continuous).
3. Generate a graphical representation of the model in the presence of the original data.
The following actions should be performed if the second command line argument is 2:
1. Print the correlation estimators for the dataset, as per 1a above.
2. Implement a quadratic model to fit the data, and provide details of the model (we will ignore the fact that the dependent variable is not continuous).
3. Generate a graphical representation of the model in the presence of the original data.
The following actions should be performed if the second command line argument is a 3:
1. Print the correlation estimators for the dataset, as per 1a above.
2. Implement both a linear model and quadratic model to fit to the data, and provide details of both models.
3. Generate a graphical representation of both models in the presence of the original data, on the same graph.

Some notes to follow when implementing your script:

OBSERVATION #1: Do not use global variables, i.e. pass arguments to the functions you created otherwise you will lose marks!

OBSERVATION #2: You will notice that when running the R script from the command line, the plots will not be shown, but instead saved on a file named Rplots.pdf in the same directory as the script is located.
This is the default way in which R deals with plots when running in batch mode, and totally acceptable for this assignment.

Examples:


    $ Rscript generateModels.R 
Error: This scripts requires two arguments, a file name and 1, 2 or 3. 
$ Rscript generateModels.R 0 
Error: This scripts requires two arguments, a file name and 1, 2 or 3. 
$ Rscript generateModels.R 1 4 
Error: the second argument must be 1, 2 or 3. 
$ Rscript generateModels.R pants 2 
Error: the file pants does not exist. 
$ Rscript generateModels.R pride-and-prejudice.txt 1 
--------------- 
loading data from file pride-and-prejudice.txt 
Read 122416 items
------------ 
computing correlation indicators... 
Covariance:  5.606526 
Data is not multivariate normal. Performing Spearman's correlation.
Correlation coefficient:  0.9980414 
------------ 

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-2870.19   -24.87   -22.17   -12.64  1761.82 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   27.173      2.732   9.947   <2e-16 ***
x           7176.019     96.779  74.148   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 122.9 on 2062 degrees of freedom
Multiple R-squared:  0.7272,	Adjusted R-squared:  0.7271 
F-statistic:  5498 on 1 and 2062 DF,  p-value: < 2.2e-16
---------------

Submit your generateModels.R script file and Fitting.Utiltites.R file, and the output of git log from your assignment repository.

To capture the output of git log use redirection, git log > git.log, and hand in the git.log file.

Assignments will be graded on a 10 point basis. Due date is October 31st, 2024 at 11:59pm, with 0.5 point penalty per day for late submission until the cut-off date of November 7, 2024 at 10:00am.