R-Squared Calculation in Linear Regression with Zero Intercept

R-Squared Calculation in Linear Regression with Zero Intercept


Previously, I scanned wheat grains to obtain the area of each grain, and then measured the weight of each grain corresponding to its area in order to develop a model equation. The following regression demonstrates the relationship between grain area and weight.


# Data download
https://www.kaggle.com/datasets/agronomy4future/wheat-grain-area-vs-weight

#to upload data
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/wheat_grain_area_vs_weight.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))

# to analyze linear regression
model=lm(grain_weight ~ grain_area, data=dataA)
summary(model)

Call:
lm(formula = grain_weight ~ grain_area, data = dataA)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.1497  -0.6232  -0.4314   0.8931  10.9166 

Coefficients:
             Estimate   Std. Error  t value  Pr(>|t|)    
(Intercept)  -13.7155   0.4866      -28.19   <2e-16 ***
grain_area    3.3333    0.0266      125.32   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.965 on 1330 degrees of freedom
Multiple R-squared:  0.9219,	Adjusted R-squared:  0.9219 
F-statistic: 1.57e+04 on 1 and 1330 DF,  p-value: < 2.2e-16

I obtained the equation y = 3.3333x - 13.7155, where y is the grain weight (mg) and x is the grain area (mm2), using both Excel and R. However, this model predicts negative values of y for small values of x, which is unrealistic as it implies that the grain weight would become negative when the grain area decreases beyond a certain point. To address this issue, we can simply force the intercept to be zero.

In Excel, we can force the intercept to zero by selecting ‘Set Intercept as 0’ in the ‘Format Trendline’ dialog box.


I obtained a new regression model, y = 2.594x, with an R-squared value of 0.9938 after forcing the intercept to be zero. However, I have doubts about the accuracy of this regression model. Although the R-squared value is high, the residuals appear to be relatively large, which leads me to question the accuracy of the R-squared value provided by Excel. Therefore, I am uncertain whether this regression model with an R-squared value of 0.99 is truly valid.



Since I had doubts about the accuracy of the R-squared value provided by Excel, I attempted to force the intercept to zero using R using the following code.

#to upload data
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/wheat_grain_area_vs_weight.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))

    grain_area grain_weight
1        15.57        43.40
2        17.14        49.70
3        16.24        45.20
4         7.85        11.00
5        14.32        36.40
6        10.93        25.60
7        18.69        55.60
8        10.81        24.50
9        16.90        49.90
10       13.79        42.20
.
.
.

# to force intercept to 0
intercept_zero=lm (grain_weight ~ 0 + grain_area, data=dataA)
summary (intercept_zero)

Call:
lm(formula = grain_weight ~ 0 + grain_area, data = dataA)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.1734  -2.5865  -0.3858   1.8437  12.9090 


***
Coefficients:
            Estimate Std.  Error      t value   Pr(>|t|)    
grain_area  2.594003       0.005611   462.3     <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.746 on 1331 degrees of freedom
Multiple R-squared:  0.9938,	Adjusted R-squared:  0.9938 
F-statistic: 2.137e+05 on 1 and 1331 DF,  p-value: < 2.2e-16

This value is the same as Excel provided, but I’m pretty sure this R2 is not correct. I’d like to know the correct R2 when forcing intercept to zero to estimate the fitness of the regression model. However, it seems that I can’t obtain the correct R2 using Excel or statistical programs.

So,, Let’s back to the basic. I’ll calculate R2 by hand.



First, I’ll calculate R2 before forcing intercept to 0.

1) Prediction value

First, in Excel, I’ll calculate the predicted values using the equation y = 3.3333x - 13.716


You can easily calculate prediction values using R.

model=lm(grain_weight ~ grain_area, data=dataA)
prediction_calculation=predict(model)
dataA$prediction_value=prediction_calculation

   grain_area  grain_weight  prediction_value
1  15.57       43.40         38.18340
2  17.14       49.70         43.41663
3  16.24       45.20         40.41669
4  7.85        11.00         12.45061
.
.
.

2) Data partitioning

Now, let’s divide data as Data = Fit + Error


This concept is based on the below simple linear regression model.


3) Sum of squares

Next, I’ll square each value and calculate the sum of squares for each case.


Then, we can calculate SST =SSR + SSE

However, there may be some differences in the calculations due to decimal issues because we used the model equation y=3.3333x - 13.7155 with values rounded to the 4th decimal. The actual values may not be exactly 3.3333 or 13.7155. I will perform the same calculations in R to compare the results.

y_mean=mean(dataA$grain_weight)

dataA$Data=dataA$grain_weight-y_mean
dataA$Fit=dataA$prediction_value-y_mean
dataA$Error=dataA$grain_weight-dataA$prediction_value

dataA$SST=(dataA$Data)^2
dataA$SSR=(dataA$Fit)^2
dataA$SSE=(dataA$Error)^2

sum(dataA$SST)
[1] 149735.9
sum(dataA$SSR)
[1] 138044.8
sum(dataA$SSE)
[1] 11691.09

sum(dataA$SST) - (sum(dataA$SSR) - sum(dataA$SSE))
[1] 0.000000

We can verify SST = SSR + SSE

4) R-squared

We can calculate R-squared using this formula.

Therefore, R2 will be ≈ 0.9219 (=138047.88 / 149735.93). This is the same value that Excel or R provides.



Second, I’ll calculate R2 after forcing intercept to 0

Now, I will calculate the R2 value after forcing the intercept to zero.

This calculation is to obtain the same value as the R2 value provided by both Excel and R.


The important thing is, when forcing intercept to 0, SST (Sum of Squares Total) was not calculated by Σ(yi - ȳ)2 , but just Σ(yi)2. Also, SSR (Sum of Squares due to regression) was not calculated by Σ(ŷi - ȳ)2 , but Σ(ŷi)2. Only SSE (Sum of Squared Error) was calculated by Σ(yi - ŷi)2.

Therefore, the value of SST and SSR will be dramatically increased. The mistake would start from here. If we still think R2 = SSR / SST, and calculate R2 when forcing intercept to zero, it will be ≈ 0.9938 (= 2999131.15 / 3017806.25). This value is the same as Excel provided when forcing intercept to zero.


Sometimes, programs are wrong

When programs provide you with certain values, you usually accept them without any doubts. However, when forcing the intercept to zero, the calculation R2 = SSR / SST is incorrect. Instead, when forcing the intercept to zero, R2 should be calculated as shown below.

1 – SSE (when intercept is 0) / SST (when intercept exists)

Only the SSE (Sum of Squared Error) is calculated as Σ(yi - ŷi)2, regardless of the presence of an intercept. However, SST (Sum of Squares Total) and SSR (Sum of Squares due to Regression) are not computed in the same way when the intercept is forced to zero. Therefore, we cannot use SST and SSR values obtained under a different modeling assumption to calculate R2 with a zero intercept.

This equation was cross-checked with several statisticians.

https://stats.stackexchange.com/questions/495217/when-forcing-intercept-to-zero-how-r-squared-is-changed

If we calculate R2 when forcing the intercept to zero using the equation I suggested above,

R2 = 1 - 18675.90 / 149735.93 ≈ 0.875

R2 will be 0.875

“Regression fitting is based on minimizing error. Therefore, if we artificially manipulate it, R2 should be decreased. In other words, if we artificially manipulate the fitting and can obtain a better R2, the original regression model is wrong!”


FYI

I found different language version of Excel provides different R2. English Excel version calculated a wrong R2 (0.9938) while Spanish Excel version calculated a correct R2 (0.875). I don’t know why. I’m wondering how about other language versions of Excel.




Leave a Reply

If you include a website address in the comment section, I cannot see your comment as it will be automatically deleted and will not be posted. Please refrain from including website addresses.