[STAT Article] Easy Guide to Cook’s Distance Calculation Using Excel and R
I have 1,000 data points of measurements of the length (mm) and weight (mg) of wheat grains. With this data, I want to analyze the relationship between the length and weight of the wheat grain to propose an equation model that can predict grain weight. I will draw a graph to visualize the data.
If you are new to R, you can copy and paste the following code into your R script window to obtain the same graph as shown above.
#to upload data
install.packages("readr")
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/cooks_distance_practical_data.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))
#to draw a graph
install.packages("ggplot2")
library (ggplot2)
ggplot(data=dataA, aes(x=grain_length, y=grain_weight))+
stat_smooth(method='lm', linetype=1, se=FALSE, formula=y~x,
linewidth=0.5, color="dark red") +
geom_point(size=5, shape=21, fill="orange", color="grey15")+
scale_x_continuous(breaks=seq(0,10,2),limits=c(0,10))+
scale_y_continuous(breaks=seq(0,100,20),limits=c(0,100))+
labs(x="Grain length (mm)", y="Grain weight (mg)") +
theme_grey(base_size=18, base_family="serif")+
theme(axis.line=element_line(linewidth=0.5, colour="black"))+
windows(width=5.5, height=5)
Here, you can observe a few outliers that may require removal. Rather than relying on intuition, we will attempt to identify statistically influential outliers using Cook’s Distance.
Cook’s Distance is a method used to identify influential outliers in regression analysis. In other words, it helps identify data points that have a negative impact on the regression model. This measurement is a combination of each observation’s leverage and residual values. The higher the leverage and residual, the higher the Cook’s Distance. Therefore, if we can calculate the leverage and residual, we might be able to also calculate Cook’s Distance.
In this post, I will explain how to manually calculate Cook’s Distance. First, let’s download the data. The data is available on Kaggle, so you can download it from there.
Data download
https://www.kaggle.com/datasets/agronomy4future/wheat-grain-length-vs-grain-weight
To download the data, you need to sign up for Kaggle. However, for those who find the sign-up process cumbersome, an alternative option is available. You can access the data by visiting an address that starts with https://raw.githubusercontent.com/~ in the R code above. Once you’re there, you can copy and paste the entire dataset into your Excel and split the data into columns to obtain the same dataset.
sample_no grain_length grain_weight
1 6.50 44.33
2 6.53 60.45
3 9.00 49.05
4 6.65 40.24
5 6.20 42.49
6 6.55 95.00
7 6.78 51.01
8 5.74 30.84
9 6.63 43.80
10 8.90 35.42
.
.
1000 5.05 43.85
Cook’s Distance formula
First, let’s examine the equation for Cook’s distance.
Most people may give up on understanding the concept here and simply rely on the statistical program. However, if you understand the principle, it is nothing more than a middle school math problem. Now, let’s calculate it step by step according to this equation.
[Step 1] Calculating sample size, mean, and standard deviation
The first task is to calculate the mean and standard error for the x and y variables, and to determine the total number of data points (n). I have calculated them as shown above.
> length(dataA$grain_length)
[1] 1000
> mean(dataA$grain_length)
[1] 6.23063
> mean(dataA$grain_weight)
[1] 41.20036
> sd(dataA$grain_length)
[1] 0.5037456
> sd(dataA$grain_weight)
[1] 9.298137
[Step 2] Calculating residuals
The second task is to calculate the residuals. Please refer to the post below for the concept of residuals.
□ An Introduction to Residual Analysis in Simple Linear Regression Models
I have calculated the residuals as follows.
Residuals are the difference between the actual value and the predicted value. For example, the predicted value of 44.8048 is calculated as follows: 44.8048 = 13.3804 * B5 – 42.1678, and the residual value is simply the difference between the actual value and the predicted value, as follows: -0.475 = 44.33 – 44.8048
Note that 13.3804 and 42.1678 are the slope and intercept values of the regression equation for seed length and weight. That is, we have calculated the predicted value of y that fits the regression equation y = -42.1678 + 13.3804x
summary(lm(grain_weight~grain_length, data=dataA))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -42.1678 2.5159 -16.76 <2e-16 ***
grain_length 13.3804 0.4025 33.24 <2e-16 ***
We can easily calculate the residuals using R.
#to upload data
install.packages("readr")
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/cooks_distance_practical_data.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))
# to identify model
model=lm(grain_weight~grain_length, data=dataA)
# to calculate residuals
residual_values=residuals(model)
dataA$residual=residual_values
sample_no grain_length grain_weight residual
1 6.50 44.33 -0.47462983
2 6.53 60.45 15.24395911
3 9.00 49.05 -29.20555082
4 6.65 40.24 -6.57168509
5 6.20 42.49 1.69948068
.
.
.
[Step 3] Calculating leverage
Now we need to calculate the leverage. Leverage is a measure of how much an observation differs from the mean of the predictor variables. Observations with high leverage values can have a large impact on the regression model and can potentially be outliers. This value is represented as hii
and is located in the Cook’s Distance equation as follows.
Furthermore, the leverage equation is as follows.
We already know the sample size (n=1000) and have already calculated the mean (x̄=6.23) and standard deviation (sx=0.50) of x (grain length). We will calculate the leverage according to this equation.
0.00128623 = 1/($B$1) + 1/($B$1-1)*((B5-$B$2)/$B$3)^2
Since the example of the Excel function used to calculate 0.00128623 is shown above, please try to calculate it yourself.
[Step 4] Calculating Internal studentized residuals
The next step is to calculate Internal Studentized Residuals. It is represented in the Cook’s Distance equation as follows.
The symbols in this equation have the following meanings:
- ei2: the square of individual residual
- εj2 / (n-p): Variance of residuals.
Where n is number of data and p is the number of predictor/independent variables. In our data, there are 1,000 dataset, so n=1000, and there are two independent variables, grain length and grain weight, so p is 2.
Since we have already calculated the residual (ei) and leverage (hii), we only need to find the “variance of residuals”
What is the variance of the residuals (or errors)? We call it MSE (Mean squared error).
We can easily obtain the MSE value from statistical software.
model=lm(grain_weight~grain_length, data=dataA)
anova(model)
Analysis of Variance Table
Response: grain_weight
Df Sum Sq Mean Sq F value Pr(>F)
grain_length 1 45386 45386 1105.2 < 0.000 ***
Residuals 998 40983 41
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
When conducting regression analysis using Excel, MSE can be found in the following location.
Please refer to the post below for the concept of MSE in regression analysis.
□ Mastering RMSE Calculation with Excel and R: A Comprehensive Guide
□ Simple linear regression (5/5)- R_squared
If you find regression analysis tedious, you can calculate MSE directly using the linest()
function in Excel.
=LINEST(range of y, range of x, TRUE, TRUE)
If you use the function as shown above, the data below will be automatically generated, and the values are explained on the right side.
The Root Mean Squared Error (RMSE) value is 6.408
RMSE = √MSE
so if you square this value, it will give you the MSE value: 6.408*6.408 ≈ 41.065
However, there is one problem that arises here. If we calculate the internal studentized residuals as shown in the equation below, negative residuals are squared and calculated as positive values.
Having a positive internally studentized residual when the residual is negative leads to distorting the data. The positive/negative sign of residuals must be preserved to accurately calculate Cook’s Distance. The solution is to take the square root of the equation above.
Then, internal studentized residuals will be calculated as follows. Please refer to the Excel function example provided below.
-0.07414048 = E5 / (($E$2*SQRT(1-F5)))
Since the example of the Excel function used to calculate -0.07414048 is shown above, please try to calculate it yourself.
[Step 5] Calculating Cook’s Distance
We have calculated all the values needed for the Cook’s Distance equation. Now, let’s simply calculate all the values according to the equation. Please refer to the Excel function example below.
0.00000354 = 1/2*G5^2*(F5/(1-F5))
We have calculated Cook’s Distance according to the Cook’s Distance equation. Once you understand the principle, the Cook’s Distance equation above is not a difficult concept. You just need to calculate each value and calculate it according to the formula. We often find it difficult because we do not understand the principle. Now let’s check if the value we calculated is correct using a statistical program.
Cook’s Distance using R
#to upload data
install.packages("readr")
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/cooks_distance_practical_data.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))
Let’s upload the data to your R script window by copying and pasting the code above. Then, let’s create a simple linear regression model and execute the cooks.distance()
command to check if the Cook’s Distance values we calculated are correct.
model=lm(grain_weight~grain_length, data=dataA)
cooksD=cooks.distance(model)
Then, R can calculate the values we manually calculated earlier all at once. Let’s add these values to the dataA.
dataA$Cooks_distance=cooksD
sample_no grain_length grain_weight Cooks_distance
1 6.50 44.33 3.537089e-06
2 6.53 60.45 3.840086e-03
3 9.00 49.05 3.458660e-01
4 6.65 40.24 8.936700e-04
5 6.20 42.49 3.536792e-05
.
.
.
And now we’ll export this data as an Excel file.
install.packages("writexl")
library(writexl)
write_xlsx (dataA,"C:/Users/Desktop/dataA.xlsx")
Check pathway in your PC
If we increase the decimal points, there is a very slight difference between the values calculated by Excel and R. The reason is that when we calculated the residuals, we used the model equation y = 13.3804x - 42.1678
to calculate the predicted values and then calculated the residuals. 13.3804 and -42.1678 are values of 13.3804xxxxx and -42.1678xxxxx, but rounding was done at the fourth decimal point, causing this slight difference.
There is no significant difference in the results. However, if you still want to calculate the same value as Cook’s Distance provided by R, you can use the residual calculated in R to solve this problem. When explaining how to calculate residuals earlier, I introduced a method to calculate residuals using R.
# to calculate residuals
residual_values=residuals(model)
dataA$residual=residual_values
sample_no grain_length grain_weight residual
1 6.50 44.33 -0.47462983
2 6.53 60.45 15.24395911
3 9.00 49.05 -29.20555082
4 6.65 40.24 -6.57168509
5 6.20 42.49 1.69948068
.
.
.
If we calculate Cook’s Distance again manually using this residual value, the value would be as follows.
When we calculated Cook’s Distance using R’s residuals, the values were exactly the same as the ones we calculated manually using Excel. The code cooks.distance()
performs all the steps I explained earlier in one go. However, if you use cooks.distance()
without understanding the underlying statistical concepts, you will not be able to comprehend the meaning behind the code.
How to interpret Cook’s Distance?
Now, let’s interpret this Cook’s Distance value. We will execute the following code in R:
install.packages("olsrr")
library(olsrr)
ols_plot_cooksd_bar(model) +
windows(width=5.5, height=5)
The Cook’s Distance value provides information to interpret outliers. Running the following code in R shows the values of outliers based on Cook’s Distance. The threshold for selecting outliers is 0.004, which is calculated as 4/n, where n is the number of observations. In our case, it is 4/1000 = 0.004.
Filtering the Cook’s Distance values greater than 0.004 in Excel, 25 values are obtained. Statistically, these values are outliers. If these outliers are removed in Excel, the graph will change as follows.
Let’s try removing these outliers in R. Please run the following code.
influential=cooksD[(cooksD>(4/length(dataA$grain_weight)))]
names_of_influential=names(influential)
"3" "6" "10" "12" "18" "24" "48" "287" "288" "291" "325" "334" "381" "576" "616" "618" "634" "697" "717" "726" "766" "768" "888" "999" "1000"
First, I set the threshold to 4/n and assigned it to the variable “influential”. Then, I matched the sample numbers of influential and assigned them to the variable “names_of_influential”. If you compare the selected sample numbers by R and the filtered sample numbers by Excel above, they are the same sample numbers.
Now, I will extract the values corresponding to these outlier values.
outliers=dataA[names_of_influential,]
sample_no grain_length grain_weight residual
3 9.00 49.05 -29.205551
6 6.55 95.00 49.526352
10 8.90 35.42 -41.497514
12 8.00 42.00 -22.875182
18 6.09 88.00 48.681321
24 4.50 40.56 22.516107
48 6.59 63.21 17.201137
287 7.31 63.44 7.797272
288 6.82 62.45 13.363652
291 6.95 61.65 10.824204
325 6.85 63.53 14.042241
334 4.05 20.76 8.737273
381 7.32 63.57 7.793468
576 4.36 22.38 6.209359
616 4.91 38.83 15.300156
618 4.91 38.83 15.300156
634 6.57 61.02 15.278744
697 6.67 32.89 -14.189292
717 6.79 35.42 -13.264937
726 6.68 33.69 -13.523096
766 6.75 35.45 -12.699722
768 6.97 40.67 -10.423403
888 4.26 21.27 6.437395
999 5.05 43.85 18.446904
1000 5.05 43.85 18.446904
Next, we will exclude these values from the original data.
install.packages("dplyr")
library(dplyr)
dataA_without_outliers=dataA %>% anti_join(outliers)
str(dataA_without_outliers)
'data.frame': 975 obs. of 4 variables:
You can confirm that the number of data points is now 975 after removing the 25 outliers. In other words, 25 outliers have been deleted. Now let’s perform the regression analysis again.
model2=lm(grain_weight~grain_length, data=dataA_without_outliers)
summary(model2)
Call:
lm(formula = grain_weight ~ grain_length, data = dataA_without_outliers)
Residuals:
Min 1Q Median 3Q Max
-15.4973 -3.5808 0.4676 3.6968 16.6774
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -53.3694 2.3440 -22.77 <2e-16 ***
grain_length 15.1561 0.3753 40.38 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.421 on 973 degrees of freedom
Multiple R-squared: 0.6263, Adjusted R-squared: 0.6259
F-statistic: 1631 on 1 and 973 DF, p-value: < 2.2e-16
After removing the outliers, the model equation has changed and R2 has slightly increased. You can see the comparison of the graphs in R below.