R-Squared Calculation in Linear Regression with Zero Intercept

May 8, 2023 JK Comments 0 Comment

Previously, I scanned wheat grains to obtain the area of each grain, and then measured the weight of each grain corresponding to its area in order to develop a model equation. The following regression demonstrates the relationship between grain area and weight.

# Data download
https://www.kaggle.com/datasets/agronomy4future/wheat-grain-area-vs-weight

#to upload data
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/wheat_grain_area_vs_weight.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))

# to analyze linear regression
model=lm(grain_weight ~ grain_area, data=dataA)
summary(model)

Call:
lm(formula = grain_weight ~ grain_area, data = dataA)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.1497  -0.6232  -0.4314   0.8931  10.9166 

Coefficients:
             Estimate   Std. Error  t value  Pr(>|t|)    
(Intercept)  -13.7155   0.4866      -28.19   <2e-16 ***
grain_area    3.3333    0.0266      125.32   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.965 on 1330 degrees of freedom
Multiple R-squared:  0.9219,	Adjusted R-squared:  0.9219 
F-statistic: 1.57e+04 on 1 and 1330 DF,  p-value: < 2.2e-16

I obtained the equation y = 3.3333x - 13.7155, where y is the grain weight (mg) and x is the grain area (mm²), using both Excel and R. However, this model predicts negative values of y for small values of x, which is unrealistic as it implies that the grain weight would become negative when the grain area decreases beyond a certain point. To address this issue, we can simply force the intercept to be zero.

In Excel, we can force the intercept to zero by selecting ‘Set Intercept as 0’ in the ‘Format Trendline’ dialog box.

I obtained a new regression model, y = 2.594x, with an R-squared value of 0.9938 after forcing the intercept to be zero. However, I have doubts about the accuracy of this regression model. Although the R-squared value is high, the residuals appear to be relatively large, which leads me to question the accuracy of the R-squared value provided by Excel. Therefore, I am uncertain whether this regression model with an R-squared value of 0.99 is truly valid.

Since I had doubts about the accuracy of the R-squared value provided by Excel, I attempted to force the intercept to zero using R using the following code.

#to upload data
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/wheat_grain_area_vs_weight.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))

    grain_area grain_weight
1        15.57        43.40
2        17.14        49.70
3        16.24        45.20
4         7.85        11.00
5        14.32        36.40
6        10.93        25.60
7        18.69        55.60
8        10.81        24.50
9        16.90        49.90
10       13.79        42.20
.
.
.

# to force intercept to 0
intercept_zero=lm (grain_weight ~ 0 + grain_area, data=dataA)
summary (intercept_zero)

Call:
lm(formula = grain_weight ~ 0 + grain_area, data = dataA)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.1734  -2.5865  -0.3858   1.8437  12.9090 


***
Coefficients:
            Estimate Std.  Error      t value   Pr(>|t|)    
grain_area  2.594003       0.005611   462.3     <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.746 on 1331 degrees of freedom
Multiple R-squared:  0.9938,	Adjusted R-squared:  0.9938 
F-statistic: 2.137e+05 on 1 and 1331 DF,  p-value: < 2.2e-16

This value is the same as Excel provided, but I’m pretty sure this R² is not correct. I’d like to know the correct R² when forcing intercept to zero to estimate the fitness of the regression model. However, it seems that I can’t obtain the correct R² using Excel or statistical programs.

So,, Let’s back to the basic. I’ll calculate R² by hand.

First, I’ll calculate R² before forcing intercept to 0.

1) Prediction value

First, in Excel, I’ll calculate the predicted values using the equation y = 3.3333x - 13.716

You can easily calculate prediction values using R.

model=lm(grain_weight ~ grain_area, data=dataA)
prediction_calculation=predict(model)
dataA$prediction_value=prediction_calculation

   grain_area  grain_weight  prediction_value
1  15.57       43.40         38.18340
2  17.14       49.70         43.41663
3  16.24       45.20         40.41669
4  7.85        11.00         12.45061
.
.
.

2) Data partitioning

Now, let’s divide data as Data = Fit + Error

This concept is based on the below simple linear regression model.

3) Sum of squares

Next, I’ll square each value and calculate the sum of squares for each case.

Then, we can calculate SST =SSR + SSE

However, there may be some differences in the calculations due to decimal issues because we used the model equation y=3.3333x - 13.7155 with values rounded to the 4th decimal. The actual values may not be exactly 3.3333 or 13.7155. I will perform the same calculations in R to compare the results.

y_mean=mean(dataA$grain_weight)

dataA$Data=dataA$grain_weight-y_mean
dataA$Fit=dataA$prediction_value-y_mean
dataA$Error=dataA$grain_weight-dataA$prediction_value

dataA$SST=(dataA$Data)^2
dataA$SSR=(dataA$Fit)^2
dataA$SSE=(dataA$Error)^2

sum(dataA$SST)
[1] 149735.9
sum(dataA$SSR)
[1] 138044.8
sum(dataA$SSE)
[1] 11691.09

sum(dataA$SST) - (sum(dataA$SSR) - sum(dataA$SSE))
[1] 0.000000

We can verify SST = SSR + SSE

4) R-squared

We can calculate R-squared using this formula.

Therefore, R² will be ≈ 0.9219 (=138047.88 / 149735.93). This is the same value that Excel or R provides.

Second, I’ll calculate R² after forcing intercept to 0

Now, I will calculate the R² value after forcing the intercept to zero.

This calculation is to obtain the same value as the R² value provided by both Excel and R.

The important thing is, when forcing intercept to 0, SST (Sum of Squares Total) was not calculated by Σ(yi - ȳ)² , but just Σ(yi)². Also, SSR (Sum of Squares due to regression) was not calculated by Σ(ŷi - ȳ)² , but Σ(ŷi)². Only SSE (Sum of Squared Error) was calculated by Σ(yi - ŷi)².

Therefore, the value of SST and SSR will be dramatically increased. The mistake would start from here. If we still think R² = SSR / SST, and calculate R² when forcing intercept to zero, it will be ≈ 0.9938 (= 2999131.15 / 3017806.25). This value is the same as Excel provided when forcing intercept to zero.

Sometimes, programs are wrong

When programs provide you with certain values, you usually accept them without any doubts. However, when forcing the intercept to zero, the calculation R² = SSR / SST is incorrect. Instead, when forcing the intercept to zero, R² should be calculated as shown below.

1 – SSE (when intercept is 0) / SST (when intercept exists)

Only the SSE (Sum of Squared Error) is calculated as Σ(yi - ŷi)², regardless of the presence of an intercept. However, SST (Sum of Squares Total) and SSR (Sum of Squares due to Regression) are not computed in the same way when the intercept is forced to zero. Therefore, we cannot use SST and SSR values obtained under a different modeling assumption to calculate R² with a zero intercept.

This equation was cross-checked with several statisticians.

https://stats.stackexchange.com/questions/495217/when-forcing-intercept-to-zero-how-r-squared-is-changed

If we calculate R² when forcing the intercept to zero using the equation I suggested above,

R² = 1 - 18675.90 / 149735.93 ≈ 0.875

R² will be 0.875

“Regression fitting is based on minimizing error. Therefore, if we artificially manipulate it, R² should be decreased. In other words, if we artificially manipulate the fitting and can obtain a better R², the original regression model is wrong!”

FYI

I found different language version of Excel provides different R². English Excel version calculated a wrong R² (0.9938) while Spanish Excel version calculated a correct R² (0.875). I don’t know why. I’m wondering how about other language versions of Excel.

When forcing the intercept to zero in a linear regression model, it is important not to just accept the R-squared provided by computer programs. I explains why it is essential to understand the principles behind the analysis (https://t.co/AeGS7qy43q)
— Jin.W.Kim (JK) (@el_trigo_JK) May 8, 2023

Agronomy4future

Stories about cereals and statistics (plus coding). We aim to develop open-source code for agronomy.

R-Squared Calculation in Linear Regression with Zero Intercept

May 8, 2023 JK Comments 0 Comment

First, I’ll calculate R² before forcing intercept to 0.

1) Prediction value

2) Data partitioning

3) Sum of squares

4) R-squared

Second, I’ll calculate R² after forcing intercept to 0

Sometimes, programs are wrong

“Regression fitting is based on minimizing error. Therefore, if we artificially manipulate it, R² should be decreased. In other words, if we artificially manipulate the fitting and can obtain a better R², the original regression model is wrong!”

FYI

Leave a Reply Cancel reply

First, I’ll calculate R2 before forcing intercept to 0.

1) Prediction value

2) Data partitioning

3) Sum of squares

4) R-squared

Second, I’ll calculate R2 after forcing intercept to 0

Sometimes, programs are wrong

“Regression fitting is based on minimizing error. Therefore, if we artificially manipulate it, R2 should be decreased. In other words, if we artificially manipulate the fitting and can obtain a better R2, the original regression model is wrong!”

FYI

Leave a Reply Cancel reply

First, I’ll calculate R² before forcing intercept to 0.

Second, I’ll calculate R² after forcing intercept to 0

“Regression fitting is based on minimizing error. Therefore, if we artificially manipulate it, R² should be decreased. In other words, if we artificially manipulate the fitting and can obtain a better R², the original regression model is wrong!”