Simple linear regression (1/5)- correlation and covariance

Simple linear regression (1/5)- correlation and covariance


Since today, I’ll explain simple linear regression model. There are lots of information about linear regression on websites, but I believe I’ll tell you about what most people don’t mention. My philosophy on data analysis and statistics is to fully understand the concept, not simply follow what software programs say. Therefore I usually calculate statistical concepts by hand, and only my hand calculation is exactly same as the software programs provide, I say I understand the concept.



In this context, I’ll introduce simple linear regression with five different parts. 1) Correlation, 2) slope and intercept of linear regression model, 3) standard error of slope and intercept, 4) t-value on slope and intercept, and finally 5) R-squared also called Coefficient of determination.

If you follow those five parts step by step, I guarantee you can completely understand simple linear regression model as a whole picture like below.

First of all, I’ll introduce what correlation is.



Correlation is a statistical measure that expresses the extent to which two variables are linearly related. To understand correlation, first we need to understand covariance. Covariance is a statistical measure to show the relationship between two random variables and to what extent they change together. Generally speaking, correlation and covariance is the same concept. Only difference is the ratio of standard deviation of x and y.

I’ll explain why.

This is the equation of covariance. Let’s calculate covariance by hand.

Here is one data. According to amount of nitrogen fertilizer, I investigate how yield is changed.

Then, I’ll calculate covariance according to the equation.



First, I calculated deviation x and y.



Then, let’s multiply each deviation x and y.



Second, let’s add up all values.



Finally, if the sum of all values is divided by n-1 (where n is sample size, here it’s 5), it’s covariance.



What does this 375 indicate? If this value is great, it would be good to explain the relationship between two variables?

Let’s draw each deviation x and y in the quadrant. For example, when deviation x is -20 and y is -34, the box would be like figure I.

If we combine all boxes together, we can obtain the below figure.

All boxes are located at II and IV quadrant. What does it indicate? It indicates that there is a positive relationship between two variables. To determine whether it’s positive or negative relationship, is the box size important? No, it’s not. The important thing is where the boxes are located.

Therefore, covariance value 375.0 itself is not the main issue, but whether the value is positive or negative would be the main issue.



Correlation

This is equation of correlation. When we usually talk about correlation, it’s about Pearson’s correlation coefficient. Pearson’s correlation coefficient basically measures the strength of the linear association between two quantitative variables x and y.

I told you before that correlation and covariance is the same concept. Only difference is the ratio of standard deviation of x and y.

Please look at the equation. Don’t you you familiar with the equation?

Numerator is the equation of covariance. Denominator is multiplication between standard deviation x and y (Sx*Sy). That’s why I told only difference between correlation and covariance is the ratio of standard deviation of x and y.



Therefore, simply we can express correlation (r) is

Just forget below equation. Simply we can say, r= Cov /SxSy

Correlation could be simply calculated as below r = 375.0 / (15.81*24.08) = 0.985



Let’s verify our hand calculation is correct.

x<- c(10,20,30,40,50)
y<- c(100,120,140,150,160)
cor(x, y,  method = "pearson")

It’s the same as what we calculated.


The basic characteristics of correlation are

  • r is always between -1 and +1.
  • r has the same sign as slope β1 in simple linear regression: positive when the line goes up and negative when it goes down.
  • Values close to -1 or 1 → strong (linear) association
  • Values close to 0 → little or no (linear) association.
  • When correlation r =1 or r = -1, there will be constants (integer) a and b in y = a + bx model, and a plot will show a perfectly straight line.

Even though it’s obvious, many people might not know about this. r squared is R squared (r2 = R2).

Therefore, 0.970 (= 0.9852) is R2 (Coefficient of determination) in linear regression between x and y.

This implies that 97% of the variability in y can be explained by x.



Now we can understand what correlation is and how it can be calculated. Our next step is to understand slope in linear regression model.

The slope is related to correlation. Therefore, next question will be how to calculate slope (and intercept) in linear regression. The answer will be in the next post!!


Follow up!!
Simple linear regression (2/5)- slope and intercept of linear regression model


Leave a Reply

If you include a website address in the comment section, I cannot see your comment as it will be automatically deleted and will not be posted. Please refrain from including website addresses.