[STAT Article] Mastering RMSE Calculation with Excel and R: A Comprehensive Guide

[STAT Article] Mastering RMSE Calculation with Excel and R: A Comprehensive Guide



When running statistical programs, you might come across RMSE (Root Mean Square Error). For instance, the table below displays RMSE values obtained from SAS, which indicate that it is approximately equal to 2.72

I am wondering how RMSE is calculated. The equation for RMSE is shown below.

First, calculate the difference between the estimated and observed values: (ŷi - yi), and then square the difference: (ŷi - yi)². Second, calculate the sum of squares: Σ(ŷi - yi)². Third, divide the sum of squares by the number of samples (n): Σ(ŷi - yi)²/n. Fourth, take the square root: √(Σ(ŷi - yi)²/n).

The calculation of the sum of squares, Σ(ŷi - yi)², is a similar concept to the sum of square error (SSE). When SSE is divided by the degrees of freedom, it becomes the mean squared error (MSE). Therefore, we can deduce that RMSE is the square root of MSE: RMSE = √MSE.

In SAS, MSE is 7.41946. Thus, I calculated the RMSE as √7.41946 ≈ 2.72, which is the same value that SAS provided. Therefore, we can easily calculate the RMSE as √MSE.

However, one question remains: How do we calculate MSE? While statistical programs automatically calculate it, understanding how MSE is calculated is essential to fully comprehend our data.



How to calculate MSE in simple linear regression?

[Step 1] Regression analysis

Here is one data (You can copy and paste the data below into Excel).

xy
19.5020.41
40.8037.43
45.2036.57
55.3050.34
24.9023.07
49.0044.98
50.0044.35
29.7029.27
32.3028.10
42.6040.89
54.2052.16
52.9049.04
64.3061.57
53.6052.30
34.1040.86
18.1023.18
66.7060.13
57.2051.02
37.1035.37
58.1054.46
67.2059.88
53.5045.96
63.0053.90
42.1039.33
34.9034.58
71.3061.35
23.0025.37

By performing a regression analysis on the dataset in Excel, you can obtain an equation that relates the dependent variable (y) to the independent variable (x): ŷ = 5.90 + 80309x. This equation can be used to predict y for any given value of x within the range of the dataset.


In R, the code would be like below.

# to generate data
x=c(19.5,40.8,45.2,55.3,24.9,49,50,29.7,32.3,42.6,54.2,52.9,64.3,53.6,34.1,18.1,66.7,57.2,37.1,58.1,67.2,53.5,63,42.1,34.9,71.3,23)
y=c(20.41,37.43,36.57,50.34,23.07,44.98,44.35,29.27,28.10,40.89,52.16,49.04,61.57,52.3,40.86,23.18,60.13,51.02,35.37,54.46,59.88,45.96,53.9,39.33,34.58,61.35,25.37)
dataA=data.frame (x,y)

#Linear regression
summary (lm (y ~ x, data=dataA))


[Step 2] Calculate estimated ŷ

As we have obtained ŷ = 5.90 + 0.80309x, we can calculate the estimated value (ŷ). For example, if we take x to be 19.50, the estimated value ŷ will be 21.57, which can be calculated as 5.90 + 0.80309 * 19.50.

xyŷ
19.5020.4121.57
40.8037.4338.68
45.2036.5742.21
55.3050.3450.32
24.9023.0725.91
49.0044.9845.26
50.0044.3546.06
29.7029.2729.76
32.3028.1031.85
42.6040.8940.12
54.2052.1649.44
52.9049.0448.39
64.3061.5757.55
53.6052.3048.96
34.1040.8633.30
18.1023.1820.45
66.7060.1359.48
57.2051.0251.85
37.1035.3735.70
58.1054.4652.57
67.2059.8859.88
53.5045.9648.88
63.0053.9056.50
42.1039.3337.72
34.9034.5833.94
71.3061.3563.17
23.0025.3724.38

In R, the code would be like below.

dataA$yi= 5.90973 + 0.80309*dataA$x
dataA


[Step 3] Calculate each sum of squares

First, let’s do data partitioning.

This partitioning is based on the below concept.

For more details, please see the below post.


Simple linear regression (5/5)- R_squared


The sum of squares values calculated by Excel match those provided by SAS


In R, we can also conduct data partitioning.

dataA$SST= round((dataA$y - mean(dataA$y))^2,digit=2)
dataA$SSR= round((dataA$yi - mean(dataA$y))^2,digit=2)
dataA$SSE= round((dataA$y - dataA$yi)^2,digit=2)
dataA
SST=sum(dataA$SST)  ≈ 4094.4
SSR=sum(dataA$SSR)  ≈ 3908.9
SSE=sum(dataA$SSE)  ≈ 185.5


[Step 4] Calculate Mean squared Error (MSE)

The sum of squares total (SST) can be partitioned into the sum of squares due to regression (SSR) and the sum of squares of errors (SSE).

SST (Total)= SSR (Model) + SSE (Error)

That is, 4094.4= 3908.9 + 185.5

The table below shows the sources of variance for regression.

SourceDegrees of FreedomSum of SquaresMean SquareF-ratiop-value
ModelpSSRMSR = SSR/pMSR/MSEp-value
Errorn-p-1SSEMSE = SSE/(n-p-1)
Totaln-1SST
Note: ‘p’ denotes the number of predictor variables in the model.

As we have already calculated SST, SSR, and SSE, we can easily compute their mean squared values by dividing each of them by the number of samples (n), which is 27 in our case. Based on this, we can construct the ANOVA table, which would look something like this.

MSE was calculated as 7.4 (=185.5 / (27 – 2)). It’s the same value as what SAS provided

In R, we can obtain MSE using below code.

anova (lm (y ~ x, data=dataA))

Then, RMSE will be √7.4 ≈ 2.72. It’s the same value as what SAS provide.



Wrap up!!

1) How to calculate sum of squares total (SST)?

(20.41 – 42.81)2 + (37.43 – 42.81)2 + … + (25.37 – 42.81)2 = 4094.4

First, we calculated the difference between each observed value (black dot in the below graph) and the mean of the observed values (blue line in the below graph), and calculated the sum of squares of each difference; Σ(yi - ȳ)2.



2) How to calculate Model (SSR)?

(21.56 – 42.81)2 + (38.67 – 42.81)2 + … + (24.37 – 42.81)2 = 3908.9

First, we calculated the difference between each estimated value (red line in the below graph) and the mean of the observed values (blue line in the below graph), and calculated the sum of squares of each difference; Σ(ŷi - ȳ)2.



3) How to calculate SSE?

(21.56 – 20.41)2 + (38.67 – 37.43)2 + … + (24.37 – 25.37)2 = 185.5

First, we calculated the difference between each estimated value (red line in the below graph) and each observed value (black dot in the below graph), and calculated the sum of squares of each difference; Σ(ŷi - yi)2.

That’s why SST = SSR + SSE

Eventually, MSE (also called Residuals) explains how far data points are dispersed from the regression line, and RMSE (=√MSE) is explained as the standard deviation (we put √ in variance) of the residuals.



How about calculating the differences between two variables without fitting?

Sometime, I see many people obtain RMSE, simply calculating the differences between two variables like below.

Then, they squared each difference(xi - yi)2, and calculated the sum of squares; Σ(xi - yi)2.

Then divided by number of samples (n); Σ(xi - yi)2 / n

i.e. 684.4 / 27 ≈ 25.35.

They say it would be MSE and RMSE will be √25.35 ≈ 5.04


In R, the calculation would be like below.

dataA$Difference= dataA$x- dataA$y
dataA$Difference_square= (dataA$x- DataA$y)^2
dataA
sqrt(sum(dataA$Difference_square)/length(dataA$y))
# 5.042012

Then RMSE will be ≈ 5.04

This value is different from what we calculated, 2.72. This would be an issue about how to interpret RMSE.

This raises an issue about how to interpret RMSE, but in order to understand this, we first need to understand the difference from √Σ(ŷi - yi)2/(n-2) to Σ(xi - yi)2/n.

The reason why I point out this issue is people simply believe what the statistical software provides. For example, R provides a simple code to calculate RMSE.

#install.packages("Metrics")
library(Metrics)
rmse= rmse(dataA$x, dataA$y)  # 5.042012

Now it’s 5.04. This is about Σ(xi - yi)2/n, not √Σ(ŷi - yi)2/(n-2). If you just accept this RMSE value without any doubts, you never explain why this value is different from 2.72 which is calculated as √Σ(ŷi - yi)2/(n-2)


Many people tend to overlook this issue, assuming that there must be an error in the calculation. However, it is important to note that the discrepancy is not due to an error, but rather a difference in the calculation methods. It is important to not simply rely on the output provided by statistical programs, but rather to take the time to understand the underlying principles and calculations that are used. By doing so, we can gain a deeper understanding of our data and feel more confident in your results.



Comments are closed.