An Introduction to Residual Analysis in Simple Linear Regression Models

An Introduction to Residual Analysis in Simple Linear Regression Models


Sample No.xy
11030
22040
33050
44080
55090
660100
770120

Here is a dataset that allows us to analyze the relationship between x and y and obtain the model equation, y= β0 + β1x. Although statistical programs can provide us with results in just 10 seconds, it is more important to understand the principles behind the calculations than to simply know how to run the program. To gain a deeper understanding of simple linear regression principles, the following posts may be helpful:


Simple linear regression (1/5)- correlation and covariance
Simple linear regression (2/5)- slope and intercept of linear regression model
Simple linear regression (3/5)- standard error of slope and intercept
Simple linear regression (4/5)- t value on the slope and intercept


I checked the results using SAS and obtained the following equation: y= 11.43 + 1.54x

data mydata;
input x y;
datalines;
10 30
20 40
30 50
40 80
50 90
60 100
70 120
;
run;
###
proc reg data=DATABASE.DATA1 alpha=0.05;
	model y=x;
	run;
quit;

When analyzing your data using statistical programs, the program will likely provide a residual plot, as shown below.

The graph above shows the pattern in the data between the predicted values and residuals (also known as errors). To gain a complete understanding of what predicted values and residuals are, it may be necessary to calculate them manually.



What is residuals?

Residuals represent the differences between actual and predicted values, and can be calculated using the following formula:

Residual = Observed value - Predicted value

Consequently, by computing residuals, we can perform residual analysis with ease.

We can obtain the same model equation, y = 11.43 + 1.54x, using Excel, just as we did with SAS. This indicates that as x increases, y will also increase by 11.43 + 1.54x. By utilizing this model equation, we can predict y values. Let us compute these predicted values. From this point forward, actual values will be referred to as y, while predicted values will be referred to as ŷ which was computed based on the model equation: y = 11.43 + 1.54x.

As mentioned earlier, residuals are the differences between actual values and predicted values, which means y - ŷ. I have calculated the residuals, which are the distances between the black dots (actual values) and the blue line (predicted values). If y=x, then all black dots will fall on the blue line, and all residuals will be zero.


How to calculate residuals in R?

x=c(10,20,30,40,50,60,70)
y=c(30,40,50,80,90,100,120)
dataA=data.frame(x,y)

First, I generated the same data in R. Now I will calculate residuals by first calculating predicted values.

library(dplyr)
library(parsnip)
linear_model=linear_reg() %>% fit(y~x,data=dataA)

dataA$predict=predict(linear_model,dataA,type="raw")
dataA

This is much easier code.

dataA$predict=predict(lm(y~x), data=dataA)

Next, I will calculate the residuals by subtracting the predicted values from the actual values. As mentioned earlier, residuals are the differences between the actual values and predicted values, so it’s a straightforward calculation.

dataA$residual= dataA$y - dataA$predict
dataA

However, using code, easily we can obtain residuals.

dataA$residual=residuals(lm(y~x), data=dataA)
dataA


Residual plot

Let’s create a residual plot using Excel, with the predicted values on the x-axis and the residuals on the y-axis.

We can then compare this plot with the one generated by SAS. Note that the residual plot generated by SAS may appear inverted up and down. I also checked the residual plot using JMP, and it matched the one generated by Excel.

Let’s draw a residual plot using R.

# to generate data
x=c(10,20,30,40,50,60,70)
y=c(30,40,50,80,90,100,120)
dataA=data.frame(x,y)

# to set up model and to calculate residuals
model=lm(y~x,data=dataA)
residuals=residuals(model)

# to draw graph
par(mar = c(1, 1, 1, 1))
plot(fitted(model), residuals)
abline(0,0)

If you want to draw more elaborate graph, you can use ggplot()

# to generate data
x=c(10,20,30,40,50,60,70)
y=c(30,40,50,80,90,100,120)
dataA=data.frame(x,y)

# to calculate predicted values and residuals
dataA$predict=predict(lm(y~x), data=dataA)
dataA$residual=residuals(lm(y~x), data=dataA)

# to draw graph
library(ggplot2)
ggplot(data=dataA, aes(x=predict, y=residual))+
  geom_point(alpha=0.5, size=4) +
  geom_hline(yintercept=0, linetype="solid", color = "Blue") +
  labs(y="Residuals", x="Predicted value") +
  theme_classic(base_size=18, base_family="serif")+
  theme_grey(base_size=18, base_family="serif")+
  theme(legend.position="bottom",
        legend.title=element_blank(),
        legend.key=element_rect(color=alpha("grey",.05), fill=alpha("grey",.05)),
        legend.background= element_rect(fill=alpha("grey",.05)),
        axis.line=element_line(linewidth=0.5, colour="black"))+
  windows(width=5.5, height=5)

What does this residual plot indicate?

If the dots in the residual plot are randomly dispersed, it indicates a good fit for a linear model. On the other hand, if the dots show some specific patterns, it would be better to consider a non-linear model.

source: https://stattrek.com/regression/residual-analysis.aspx

For example, here is another data.

x=c(10,20,30,40,50,60,70)
y=c(30,40,50,70,60,30,20)
dataA=data.frame(x,y)

The residual plot looks like above. So, in this case, non-linear model would be more suitable. Let’s check it’s true.

# Linear model
summary(lm(y~x, data=dataA))

# Non-linear model (Quadratic model)
summary(lm(y~poly(x,2, raw=TRUE), data=dataA))

It shows a good fit for a non-linear model.

full code: https://github.com/agronomy4future/r_code/blob/main/An_Introduction_to_Residual_Analysis_in_Simple_Linear_Regression_Models.ipynb


Comments are closed.