R-Squared Calculation in Linear Regression with Zero Intercept
Previously, I scanned wheat grains to obtain the area of each grain, and then measured the weight of each grain corresponding to its area in order to develop a model equation. The following regression demonstrates the relationship between grain area and weight.
# Data download https://www.kaggle.com/datasets/agronomy4future/wheat-grain-area-vs-weight
#to upload data
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/wheat_grain_area_vs_weight.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))
# to analyze linear regression
model=lm(grain_weight ~ grain_area, data=dataA)
summary(model)
Call:
lm(formula = grain_weight ~ grain_area, data = dataA)
Residuals:
Min 1Q Median 3Q Max
-14.1497 -0.6232 -0.4314 0.8931 10.9166
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -13.7155 0.4866 -28.19 <2e-16 ***
grain_area 3.3333 0.0266 125.32 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.965 on 1330 degrees of freedom
Multiple R-squared: 0.9219, Adjusted R-squared: 0.9219
F-statistic: 1.57e+04 on 1 and 1330 DF, p-value: < 2.2e-16
I obtained the equation y = 3.3333x - 13.7155
, where y is the grain weight (mg) and x is the grain area (mm2), using both Excel and R. However, this model predicts negative values of y for small values of x, which is unrealistic as it implies that the grain weight would become negative when the grain area decreases beyond a certain point. To address this issue, we can simply force the intercept to be zero.
In Excel, we can force the intercept to zero by selecting ‘Set Intercept as 0’ in the ‘Format Trendline’ dialog box.
I obtained a new regression model, y = 2.594x
, with an R-squared value of 0.9938 after forcing the intercept to be zero. However, I have doubts about the accuracy of this regression model. Although the R-squared value is high, the residuals appear to be relatively large, which leads me to question the accuracy of the R-squared value provided by Excel. Therefore, I am uncertain whether this regression model with an R-squared value of 0.99 is truly valid.
Since I had doubts about the accuracy of the R-squared value provided by Excel, I attempted to force the intercept to zero using R using the following code.
#to upload data
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/wheat_grain_area_vs_weight.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))
grain_area grain_weight
1 15.57 43.40
2 17.14 49.70
3 16.24 45.20
4 7.85 11.00
5 14.32 36.40
6 10.93 25.60
7 18.69 55.60
8 10.81 24.50
9 16.90 49.90
10 13.79 42.20
.
.
.
# to force intercept to 0
intercept_zero=lm (grain_weight ~ 0 + grain_area, data=dataA)
summary (intercept_zero)
Call:
lm(formula = grain_weight ~ 0 + grain_area, data = dataA)
Residuals:
Min 1Q Median 3Q Max
-13.1734 -2.5865 -0.3858 1.8437 12.9090
***
Coefficients:
Estimate Std. Error t value Pr(>|t|)
grain_area 2.594003 0.005611 462.3 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.746 on 1331 degrees of freedom
Multiple R-squared: 0.9938, Adjusted R-squared: 0.9938
F-statistic: 2.137e+05 on 1 and 1331 DF, p-value: < 2.2e-16
This value is the same as Excel provided, but I’m pretty sure this R2 is not correct. I’d like to know the correct R2 when forcing intercept to zero to estimate the fitness of the regression model. However, it seems that I can’t obtain the correct R2 using Excel or statistical programs.
So,, Let’s back to the basic. I’ll calculate R2 by hand.
First, I’ll calculate R2 before forcing intercept to 0.
1) Prediction value
First, in Excel, I’ll calculate the predicted values using the equation y = 3.3333x - 13.716
You can easily calculate prediction values using R.
model=lm(grain_weight ~ grain_area, data=dataA)
prediction_calculation=predict(model)
dataA$prediction_value=prediction_calculation
grain_area grain_weight prediction_value
1 15.57 43.40 38.18340
2 17.14 49.70 43.41663
3 16.24 45.20 40.41669
4 7.85 11.00 12.45061
.
.
.
2) Data partitioning
Now, let’s divide data as Data = Fit + Error
This concept is based on the below simple linear regression model.
3) Sum of squares
Next, I’ll square each value and calculate the sum of squares for each case.
Then, we can calculate SST =SSR + SSE
However, there may be some differences in the calculations due to decimal issues because we used the model equation y=3.3333x - 13.7155
with values rounded to the 4th decimal. The actual values may not be exactly 3.3333 or 13.7155. I will perform the same calculations in R to compare the results.
y_mean=mean(dataA$grain_weight)
dataA$Data=dataA$grain_weight-y_mean
dataA$Fit=dataA$prediction_value-y_mean
dataA$Error=dataA$grain_weight-dataA$prediction_value
dataA$SST=(dataA$Data)^2
dataA$SSR=(dataA$Fit)^2
dataA$SSE=(dataA$Error)^2
sum(dataA$SST)
[1] 149735.9
sum(dataA$SSR)
[1] 138044.8
sum(dataA$SSE)
[1] 11691.09
sum(dataA$SST) - (sum(dataA$SSR) - sum(dataA$SSE))
[1] 0.000000
We can verify SST = SSR + SSE
4) R-squared
We can calculate R-squared using this formula.
Therefore, R2 will be ≈ 0.9219 (=138047.88 / 149735.93). This is the same value that Excel or R provides.
Second, I’ll calculate R2 after forcing intercept to 0
Now, I will calculate the R2 value after forcing the intercept to zero.
This calculation is to obtain the same value as the R2 value provided by both Excel and R.
The important thing is, when forcing intercept to 0, SST (Sum of Squares Total) was not calculated by Σ(yi - ȳ)2
, but just Σ(yi)2
. Also, SSR (Sum of Squares due to regression) was not calculated by Σ(ŷi - ȳ)2
, but Σ(ŷi)2
. Only SSE (Sum of Squared Error) was calculated by Σ(yi - ŷi)2
.
Therefore, the value of SST and SSR will be dramatically increased. The mistake would start from here. If we still think R2 = SSR / SST
, and calculate R2 when forcing intercept to zero, it will be ≈ 0.9938 (= 2999131.15 / 3017806.25). This value is the same as Excel provided when forcing intercept to zero.
Sometimes, programs are wrong
When programs provide you with certain values, you usually accept them without any doubts. However, when forcing the intercept to zero, the calculation R2 = SSR / SST
is incorrect. Instead, when forcing the intercept to zero, R2 should be calculated as shown below.
1 – SSE (when intercept is 0) / SST (when intercept exists)
Only the SSE (Sum of Squared Error) is calculated as Σ(yi - ŷi)2
, regardless of the presence of an intercept. However, SST (Sum of Squares Total) and SSR (Sum of Squares due to Regression) are not computed in the same way when the intercept is forced to zero. Therefore, we cannot use SST and SSR values obtained under a different modeling assumption to calculate R2 with a zero intercept.
This equation was cross-checked with several statisticians.
https://stats.stackexchange.com/questions/495217/when-forcing-intercept-to-zero-how-r-squared-is-changed
If we calculate R2 when forcing the intercept to zero using the equation I suggested above,
R2
= 1 - 18675.90 / 149735.93 ≈ 0.875
R2 will be 0.875
“Regression fitting is based on minimizing error. Therefore, if we artificially manipulate it, R2 should be decreased. In other words, if we artificially manipulate the fitting and can obtain a better R2, the original regression model is wrong!”
FYI
I found different language version of Excel provides different R2. English Excel version calculated a wrong R2 (0.9938) while Spanish Excel version calculated a correct R2 (0.875). I don’t know why. I’m wondering how about other language versions of Excel.