Simple linear regression (5/5)- Coefficient of determination
Here is data for x and y. I would like to perform regression analysis to understand how y changes with x.
n | x | y |
1 | 10 | 30 |
2 | 20 | 40 |
3 | 30 | 50 |
4 | 40 | 80 |
5 | 50 | 90 |
6 | 60 | 100 |
7 | 70 | 120 |
I have data for x and y as described above, and want to determine the regression model for this data, where the dependent variable y changes according to the independent variable x, in the form of y= β0 + β1x
. Although the values can be obtained directly through a statistical program, I will calculate the equation manually to gain an understanding of the underlying principles. If you want to learn how to calculate a regression model equation, please refer to the link below.
□ Simple linear regression (2/5)- slope and intercept of linear regression model
■ Correlation coefficient
mean of x: 40.0 standard deviation of x: 21.602 mean of y: 72.857 standard deviation of y: 33.523
r= ((10-40.0) /21.602 * (30-72.857) / 33.523 + (20-40.0) / 21.602 * (40-72.857) / 33.523 + (30-40.0) / 21.602 * (50-72.857) / 33.523 + (40-40.0) / 21.602 * (80-72.857) / 33.523 + (50-40.0) / 21.602 * (90-72.857) / 33.523 + (60-40.0) / 21.602 * (100-72.857) / 33.523 + (70-40.0) / 21.602 * (120-72.857) / 33.523 ) / (7-1) = 0.9896
□ Simple linear regression (1/5)- correlation and covariance
■ How to calculate slope (β1)
β1 = r*Sy/Sx
β1 = 0.9896 * (33.523 / 21.602) = 1.5357
■ How to calculate slope (β0)?
In linear regression model equation, y= β0 + β1x
72.857 = β0 + 1.5357 * 40.0
∴ β0 = 11.429
Therefore, y = 11.429 + 1.5357x
■ Verification of the outcome using a statistical program
I use SAS. First, I create a data table.
DATA EXP;
INPUT x y;
CARDS;
10 30
20 40
30 50
40 80
50 90
60 100
70 120
And now, I will try simple linear regression analysis.
proc reg data=WORK.DATAA alpha=0.05;
model y=x;
run;
quit;
The statistical program provided the regression equation: y = 11.429 + 1.5357x
.
The linear regression model equation is y=11.429 + 1.5357x
. It is the same as the result we obtained manually. Moreover, the statistical program has provided the coefficient of determination (R-squared), which is 0.9793.
However, if you stop here and simply accept this result, you may not fully understand the principle behind the coefficient of determination.
■ Coefficient of determination in simple linear regression: Understanding ANOVA
You might wonder why we suddenly bring up ANOVA in the context of regression analysis. Analysis of variance (ANOVA) is an interpretation of variance. Now, we will examine the variance in the data for both x and y.
x | y | ŷ | model equation | |
1 | 10 | 30 | 26.8 | ŷ=11.429 + 1.5357 * 10 |
2 | 20 | 40 | 42.1 | ŷ=11.429 + 1.5357 * 20 |
3 | 30 | 50 | 57.5 | ŷ=11.429 + 1.5357 * 30 |
4 | 40 | 80 | 72.9 | ŷ=11.429 + 1.5357 * 40 |
5 | 50 | 90 | 88.2 | ŷ=11.429 + 1.5357 * 50 |
6 | 60 | 100 | 103.6 | ŷ=11.429 + 1.5357 * 60 |
7 | 70 | 120 | 118.9 | ŷ=11.429 + 1.5357 * 70 |
ȳ = 72.86 |
Earlier, the statistical program provided the regression model y = 11.429 + 1.5357x
for the x and y data. With this regression model, we can now predict ŷ
values for specific x values.
Now, let’s explore the concept of Total = Fit + Error
.
Total | Fit | Error | ||||
x | yi | ŷ | yi – ȳ | ŷ – ȳ | yi – ŷ | |
1 | 10 | 30 | 26.8 | -42.86 | -46.07 | 3.21 |
2 | 20 | 40 | 42.1 | -32.86 | -30.72 | -2.14 |
3 | 30 | 50 | 57.5 | -22.86 | -15.36 | -7.50 |
4 | 40 | 80 | 72.9 | 7.14 | 0.00 | 7.14 |
5 | 50 | 90 | 88.2 | 17.14 | 15.35 | 1.79 |
6 | 60 | 100 | 103.6 | 27.14 | 30.71 | -3.57 |
7 | 70 | 120 | 118.9 | 47.14 | 46.07 | 1.07 |
ȳ = 72.86 |
First, we calculated the mean of the y values, which is 72.86. Then, we subtracted this mean from each individual y value. Representing individual y values as yi
and the overall mean of y values as ȳ
, this calculation can be expressed as yi - ȳ
. We squared each result and summed them up. This process can be represented by the formula Σ (yi - ȳ)²
.
Second, using the model equation, we subtracted the predicted y values from the overall mean of y
values. The calculation formula for this is ŷ - ȳ
. Again, we squared each result and summed them up. This process can be represented by the formula Σ (ŷ - ȳ)²
.
Finally, we subtracted the predicted y values from the individual y
values. The calculation formula for this is yi - ŷ
. We squared each result and summed them up. This process can be represented by the formula Σ (yi - ŷ)²
.
SST | SSR | SSE | ||||
n | x | yi | ŷ | (yi – ȳ)2 | (ŷ – ȳ)2 | (yi – ŷ)2 |
1 | 10 | 30 | 26.8 | 1836.7 | 2122.5 | 10.3 |
2 | 20 | 40 | 42.1 | 1079.6 | 943.3 | 4.6 |
3 | 30 | 50 | 57.5 | 522.4 | 235.8 | 56.3 |
4 | 40 | 80 | 72.9 | 51.0 | 0.0 | 51.0 |
5 | 50 | 90 | 88.2 | 293.9 | 235.8 | 3.2 |
6 | 60 | 100 | 103.6 | 736.7 | 943.4 | 12.8 |
7 | 70 | 120 | 118.9 | 2222.4 | 2122.6 | 1.1 |
ȳ = 72.86 | Σ(yi – ȳ)2 =6742.86 | Σ(ŷ – ȳ)2 = 6603.45 | Σ(yi – ŷ)2 = 139.29 |
[Note] Understanding the variation in data 1) yi - ȳ represents the total variation in the data. Let's consider the data: 5, 7, 8, 5, 5. The mean of this data is 6. Now, subtract the mean from each individual value: 5 - 6, 7 - 6, 8 - 6, 5 - 6, 5 - 6 These values represent the total variation in the data. In simpler terms, if the data were 5, 5, 5, 5, 5, with a mean of 0, the total variation would be 0, indicating no differences between data points. These differences between each individual value and the overall mean are called deviations. The sum of deviations is always 0. To understand the total variation in data, you need to sum these values. However, since the sum of deviations is 0, we cannot grasp the total variation in the data. How can we resolve this issue? We square each deviation value. (5 - 6)², (7 - 6)², (8 - 6)², (5 - 6)², (5 - 6)² Now, the sum of squared deviations is not 0; it is 8. This process can be represented by the formula Σ (yi - ȳ)². 2) ŷ - ȳ represents the variation in predicted values, indicating the fitness of the model equation. Through the regression model equation, we obtained predicted values ŷ and subtracted the overall mean of y values. This represents the fitness of the model equation. For example, if x values are 1, 2, 3, 4, 5, and y values are 10, 20, 30, 40, 50, the regression model equation is ŷ = 10x. The mean of y values is ȳ = 30. According to the model equation, ŷ values are 10, 20, 30, 40, 50. Then, ŷ - ȳ is 10 - 30, 20 - 30, 30 - 30, 40 - 30, 50 - 30. If y values were 10, 20, 30, 40, 200, the model equation would be ŷ = -60 + 4x. Then, ŷ - ȳ is -20 - 60, 20 - 60, 60 - 60, 100 - 60, 140 - 60. The sum of squared deviations is calculated, and this process can be represented by the formula Σ (ŷ - ȳ)². For ŷ = 10x, the value is 1,000, while for ŷ = -60 + 4x, the value is 16,000. Which model is more accurate? Certainly, ŷ = 10x, as the predicted values are identical to the actual measured values. Therefore, if Σ (ŷ - ȳ)² is large, it indicates low fitness of the model equation. 3) yi - ŷ represents the error in the regression model. Subtracting the predicted y values from each individual y value indicates the difference between actual and predicted values, representing the error generated by the regression model. For example, in the regression model with x values 1, 2, 3, 4, 5, and y values 10, 20, 30, 40, 50, ŷ = 10x, the error is 0. On the other hand, for x values 1, 2, 3, 4, 5, and y values 10, 20, 30, 40, 200, with the regression model ŷ = -60 + 4x, the error is calculated by squaring each deviation and summing them up. This process can be represented by the formula Σ (yi - ŷ)². (10 - -20)² + (20 - 20)² + (60 - 30)² + (100 - 40)² + (200 - 140)² = 9,000. If the data's variation increases (i.e., suddenly becomes 200 after 10, 20, 30, 40), the fitness of the regression model decreases, and the error generated by the model increases.
Now, we can verify the following formula:
Σ (yi – ȳ)2 = Σ (ŷ – ȳ)2 + Σ (yi – ŷ)2
We squared the deviations and summed them up, so we can refer to this process as “sum of squares.”
Earlier, we calculated the sum of squares for the total variation in the data, the variation in predicted values due to the regression model, and the sum of squares for the error. From now on, let’s denote these values as Sum of Squares Total (SST), Sum of Squares due to Regression (SSR), and Sum of Squares Error (SSE).
SST = SSR + SSE
SST = Sum of Squares Total
SSR = Sum of Squares due to regression
SSE = Sum of Squared Error
Σ (yi – ȳ)2 = Σ (ŷ – ȳ)2 + Σ (yi – ŷ)2
SST = SSR + SSE
6742.86 = 6603.45 + 139.29
If we have calculated each sum of squares, let’s create an ANOVA table. The structure of an ANOVA table in a simple linear regression model is as follows:
The number of data points is n = 7. Let’s substitute the previously calculated sum of squares values into the ANOVA table.
I will now compare the manually calculated ANOVA table with the values provided by SAS. If they match, it confirms the accuracy of the calculations.
Now we understand the principles behind constructing the ANOVA table in regression analysis.
Sometimes, ANOVA tables in different statistical programs are displayed with different names, causing confusion. The table below is from SPSS, provided here as a simple example taken from past analysis results. While SAS uses “Model,” “Error,” and “Corrected Total,” SPSS uses “Regression,” “Residual,” and “Total.” The results may be presented with slightly different names in various statistical programs, leading to occasional confusion. Some statistics books may explain TSS = SSR + SSE, adding to the potential confusion.
Therefore, if you thoroughly understand the concepts explained above, you’ll be able to comprehend them solidly, even if they are presented with different names in statistical programs or explained with different equations in statistical books.
Σ (yi – ȳ)2 = Σ (ŷ – ȳ)2 + Σ (yi – ŷ)2
SST = SSR + SSE
If you’ve followed along until here, calculating the coefficient of determination should be a piece of cake.
Simply divide SSR by SST. And this formula can also be explained as 1 – SSE/SST.
R2 = 6603.45 / 6742.86 = 0.979
= 1 – 139.29 / 6742.86 = 0.979
The R² we calculated manually is 0.979. This value is identical to the one provided by SAS.
What does the ratio of SSR to SST signify? Let’s examine the figure below. A small SSR indicates that the predicted values from the regression model do not deviate significantly from the mean. In other words, it suggests a high fitness of the regression model. As SSR decreases, R² will increase.
In other words, it means that 97.9% of the variation in y values can be explained by x. The remaining 2.1% is likely attributed to other factors or sources of variation.
Coefficient of determination is the proportion of the variance explained by the regression model out of the total variance of the dependent variable. The higher this proportion, the more suitable our estimated regression model is.