Understanding Multiple Linear Regression Easily (Part 1: Calculating the Regression Equation Manually)
In my previous posts, I explained the simple linear regression model as five categories. I recommend reading the following posts first.
□ Simple linear regression (1/5)- correlation and covariance
□ Simple linear regression (2/5)- slope and intercept of linear regression model
□ Simple linear regression (3/5)- standard error of slope and intercept
□ Simple linear regression (4/5)- t value on the slope and intercept
□ Simple linear regression (5/5)- R_squared
In this session, I will explain multiple regression analysis. Multiple regression analysis refers to cases where there are two or more independent variables (x) that influence a specific outcome (dependent variable, y). Personally, I do not have much trust in multiple regression analysis, so I do not use this statistical method often. In statistics, independent variables (x) are always assumed to be independent. However, in multiple regression analysis, there can be correlations between independent variables (x). This issue is known as multi-collinearity. As the number of independent variables increases, this problem becomes more significant. Therefore, it is always necessary to interpret the independent variables with caution based solely on the results of multiple regression analysis.
Here is one dataset as follows:
No. | Yield (yi) | Time (xi1) | Moisture (xi2) |
1 | 4.3 | 4 | 0.2 |
2 | 5.5 | 5 | 0.2 |
3 | 6.8 | 6 | 0.2 |
4 | 8.0 | 7 | 0.2 |
5 | 4.0 | 4 | 0.3 |
6 | 5.2 | 5 | 0.3 |
7 | 6.6 | 6 | 0.3 |
8 | 7.5 | 7 | 0.3 |
9 | 2.0 | 4 | 0.4 |
10 | 4.0 | 5 | 0.4 |
11 | 5.7 | 6 | 0.4 |
12 | 6.5 | 7 | 0.4 |
This is a dataset that compiles values for sunlight duration (xi1) and humidity (xi2) along with crop yield (yi). Here, we aim to understand the relationship between sunlight duration (time), humidity (moisture), and crop yield (yield). Therefore, this is a case where we intend to analyze the data using multiple regression analysis.
In this case, the statistical model is as follows:
As I always emphasize, while you can obtain results in just 10 seconds when using statistical software, it’s more important to understand the underlying principles.
Let’s go ahead and check the results using R.
yield= c(4.3,5.5,6.8,8.0,4.0,5.2,6.6,7.5,2.0,4.0,5.7,6.5)
time= rep(c(4,5,6,7), times=3)
moisture= rep(c(0.2,0.3,0.4), each=4)
dataA= data.frame(yield,time,moisture)
yield time moisture
1 4.3 4 0.2
2 5.5 5 0.2
3 6.8 6 0.2
4 8.0 7 0.2
5 4.0 4 0.3
6 5.2 5 0.3
7 6.6 6 0.3
8 7.5 7 0.3
9 2.0 4 0.4
10 4.0 5 0.4
11 5.7 6 0.4
12 6.5 7 0.4
And now, let’s proceed with the multiple regression analysis.
model= lm (yield ~ time+moisture, data=dataA)
summary(model)
In the case mentioned earlier, we defined the multiple regression analysis model as follows: that is, y ≈ 0.67 + 1.32x1 - 8.0x2
.
This means that when Time increases by 1 unit, the yield increases by a ratio of 1.32, while when moisture increases by 1 unit, the yield decreases by a ratio of 8.0. Most people stop here, confirming the values from their statistical software.
However, it’s crucial to understand how these values were calculated. Knowing the principles behind anything is essential.
Now, our focus is on understanding how the intercept 0.67 and the slopes 1.32 for x1 and -8.0 for x2 were computed in the equation y = 0.67 + 1.32x1 - 8.0x2
.
In multiple regression analysis, the formulas for obtaining the intercept (β0) and the slopes for x (βik) are called the normal equations and are as follows:
In our case, we have two independent variables. Therefore, it will be formulated as follows:
Even though it may seem challenging, as long as you understand the principles, it’s nothing more than a middle school math problem.
Let’s rearrange the data according to the formula above.
No. | Yield (yi) | Time (xi1) | Moisture (xi2) | (xi1)2 | (xi2)2 | xi1*yi | xi2*yi | xi1*xi2 |
1 | 4.3 | 4 | 0.2 | 16 | 0.04 | 17.2 | 0.86 | 0.8 |
2 | 5.5 | 5 | 0.2 | 25 | 0.04 | 27.5 | 1.10 | 1.0 |
3 | 6.8 | 6 | 0.2 | 36 | 0.04 | 40.8 | 1.36 | 1.2 |
4 | 8.0 | 7 | 0.2 | 49 | 0.04 | 56.0 | 1.60 | 1.4 |
5 | 4.0 | 4 | 0.3 | 16 | 0.09 | 16.0 | 1.20 | 1.2 |
6 | 5.2 | 5 | 0.3 | 25 | 0.09 | 26.0 | 1.56 | 1.5 |
7 | 6.6 | 6 | 0.3 | 36 | 0.09 | 39.6 | 1.98 | 1.8 |
8 | 7.5 | 7 | 0.3 | 49 | 0.09 | 52.5 | 2.25 | 2.1 |
9 | 2.0 | 4 | 0.4 | 16 | 0.16 | 8.0 | 0.80 | 1.6 |
10 | 4.0 | 5 | 0.4 | 25 | 0.16 | 20.0 | 1.60 | 2.0 |
11 | 5.7 | 6 | 0.4 | 36 | 0.16 | 34.2 | 2.28 | 2.4 |
12 | 6.5 | 7 | 0.4 | 49 | 0.16 | 45.5 | 2.60 | 2.8 |
Sum | Σyi = 66.1 | Σxi1 = 66.0 | Σxi2 = 3.60 | Σxi12= 378.0 | Σxi22= 1.16 | Σxi1yi = 383.3 | Σxi2yi = 19.19 | Σxi1xi2 = 19.8 |
And we’ve also calculated the sums of the values that were computed individually. Now, all we need to do is substitute these values into the equation.
Let’s take a look at the first equation.
Σyi = nb0 + Σxi1b1 + Σxi2b2
-------- (1)
Here, n
represents the sample size (which is 12 in this case), b0 is the intercept, b1 is the slope for x1, and b2 is the slope for x2. So, what b0 represents is:
b0 = (Σyi - Σxi1b1 - Σxi2b2) / n
What is it called when you divide the sum of the data by the sample size? It’s called the mean.
That is, b0 = ȳi - x̄i1b1 - x̄i2b2
If we calculate the mean from the given data, it becomes as follows:
ȳi = 66.1 / 12 = 5.50833
x̄i1 = 66.0 / 12 = 5.5
x̄i2 = 3.6 / 12 = 0.3
So, the intercept would be b0 = 5.50833 - 5.5b1 - 0.3b2
.
Now, let’s examine the second and third equations.
Σxi1yi = Σxi1b0 + Σxi12b1 + Σxi1xi2b2
-------- (2)Σxi2yi = Σxi2b0 + Σxi1
-------- (3)xi2
b1 + Σxi22b2
Let’s go ahead and substitute the calculated values into the equations.
383.3 = 66.0 b0 + 378.0 b1 + 19.8 b2
-------- (2)19.19 = 3.60 b0 + 19.8 b1 + 1.16 b2
-------- (3)
And we already know that b0 = 5.50833 - 5.5b1 - 0.3b2
, so let’s substitute the value of b0 into equations (2) and (3).
383.3 = 66.0*(5.50833 - 5.5 b1 - 0.3 b2) + 378.0 b1 + 19.8 b2
-------- (2)19.19 = 3.60
-------- (3)*(5.50833 - 5.5 b1 - 0.3 b2)
+ 19.8 b1 + 1.16 b2
Now, from here, we need to determine the values of b1 and b2.
You can certainly calculate the values manually, but there are also many websites available that can help you find the values of x and y in equations. I found a website below, and I will try to obtain the values from there.
https://www.symbolab.com/solver/function-intercepts-calculator
When we set b1 as x and b2 as y in the equation, the calculated values for x and y are as follows:
x= 19.75022 / 15 ≈ 1.32
y= -0.639988 / 0.08 ≈ - 8.0
In other words, b1 = 1.32, and b2 = -8.0. Since we had b0 = 5.50833 - 5.5 b1 - 0.3 b2
, when we calculate b0, it comes out to be ≈ 0.66. Therefore, y = 0.66 + 1.32x1 - 8.0x2
.
This equation matches the values provided by R.
Now, the next curiosity is how the values in the red box from the statistics provided were calculated. This topic will continue in the next post.