The Best Linear Unbiased Estimator (BLUE): Step-by-Step Guide using R (with AllInOne Package)
In this session, I will introduce the method of calculating the Best Linear Unbiased Estimator (BLUE). Instead of simply listing formulas as many websites do to explain BLUE, this post aims to help readers understand the process of calculating BLUE with an actual dataset using R.
I have the following data.
location | sulphur (kg/ha) | block | yield |
Cordoba | 0 | 1 | 750 |
Cordoba | 24 | 1 | 1250 |
Cordoba | 36 | 1 | 1550 |
Cordoba | 48 | 1 | 1120 |
Cordoba | 0 | 2 | 780 |
Cordoba | 24 | 2 | 1280 |
Cordoba | 36 | 2 | 1520 |
Cordoba | 48 | 2 | 1130 |
Cordoba | 0 | 3 | 790 |
Cordoba | 24 | 3 | 1300 |
Cordoba | 36 | 3 | 1525 |
Cordoba | 48 | 3 | 1140 |
Cordoba | 0 | 4 | 800 |
Cordoba | 24 | 4 | 1250 |
Cordoba | 36 | 4 | 1555 |
Cordoba | 48 | 4 | 1130 |
Granada | 0 | 1 | 1070 |
Granada | 24 | 1 | 1490 |
Granada | 36 | 1 | 1790 |
Granada | 48 | 1 | 1345 |
Granada | 0 | 2 | 1066 |
Granada | 24 | 2 | 1510 |
Granada | 36 | 2 | 1700 |
Granada | 48 | 2 | 1355 |
Granada | 0 | 3 | 948 |
Granada | 24 | 3 | 1560 |
Granada | 36 | 3 | 1830 |
Granada | 48 | 3 | 1368 |
Granada | 0 | 4 | 960 |
Granada | 24 | 4 | 1550 |
Granada | 36 | 4 | 1866 |
Granada | 48 | 4 | 1356 |
Leon | 0 | 1 | 800 |
Leon | 24 | 1 | 1250 |
Leon | 36 | 1 | 1550 |
Leon | 48 | 1 | 910 |
Leon | 0 | 2 | 780 |
Leon | 24 | 2 | 1180 |
Leon | 36 | 2 | 1540 |
Leon | 48 | 2 | 890 |
Leon | 0 | 3 | 663.6 |
Leon | 24 | 3 | 1260 |
Leon | 36 | 3 | 1480 |
Leon | 48 | 3 | 900 |
Leon | 0 | 4 | 750 |
Leon | 24 | 4 | 1195 |
Leon | 36 | 4 | 1525 |
Leon | 48 | 4 | 900 |
The data mentioned above has been uploaded to my GitHub.
https://github.com/agronomy4future/raw_data_practice/blob/main/BLUE_Practice.csv
Let’s consider a dataset that was collected by investigating crop yields in three regions of Spain based on the treatment of sulfur fertilizer. The experiment was conducted with four repetitions as blocks. In this case, I have set the location as a fixed factor, and sulfur and block as random factors. When both fixed and random factors are present in the same model, the Best Linear Unbiased Estimator (BLUE) is commonly used. If all factors are considered as random factors, then Best Linear Unbiased Prediction (BLUP) would be used.
I will now demonstrate how to calculate the Best Linear Unbiased Estimator (BLUE) using R.
To calculate BLUE using R
If you have understood the above calculations, fron now on, let’s calculate BLUE using R.
First, we will load the data.”
library (readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/BLUE_Practice.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))
Now we will configure the model. Location is set as a fixed factor, while treatment (sulphur) and block are set as a random factor. The model configuration is as follows.
library(lme4)
Blue=lmer(yield ~ 0 + (1|trt) + (1|block) + location, data=dataA)
Now let’s calculate the prediction values that were previously calculated in Excel.
predict(Blue)
Now let’s add these prediction values into the existing data.
dataA$prediction=predict(Blue)
dataA
Now let’s calculate the average of these prediction values.
data.frame(dataA %>%
group_by(location) %>%
summarise(mean=mean(prediction),
sd=sd(prediction),
n=length(prediction),
se=sd/sqrt(n)))
Here, the mean value is the BLUE value. Please compare it with the value calculated using Excel. It’s the same. In summary, the code is as follows.
# to upload data
library (readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/BLUE_Practice.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))
# to identify model
library(lme4)
Blue=lmer(yield ~ 0 + (1|trt) + (1|block) + location, data=dataA)
# to calculate prediction values
predict(Blue)
dataA$prediction=predict(Blue)
# to calculate BLUE
library (dplyr)
data.frame(dataA %>%
group_by(location) %>%
summarise(mean=mean(prediction),
sd=sd(prediction),
n=length(prediction),
se=sd/sqrt(n)))
AllInOne package
I will now introduce a method to calculate BLUE in R without writing code, using the AllInOne package developed by Dr. Mohsen Yoosefzadeh Najafabadi at University of Guelph. This package is an excellent tool for calculating BLUE and BLUP without any coding.
# to install package
remotes::install_github('MohsenYN/AllInOne')
# to run the program
library(AllInOne)
run_app()
When you run the above code, a window like the following will appear.
Go to the Dataset
tab and click on Upload Dataset
to upload the data.
Next, click on Select Variables
to choose the independent variables (Location, sulphur, block) and the dependent variable (yield)
Select Variance Analysis
under Data Pre-Processing
, and then choose Mixed Analysis
.”
Then, select ‘yield’ as the dependent variable, and choose ‘location’ as the fixed factor, and select ‘sulphur’ and ‘block’ as the random factors.”
When you click OK, a new window will appear. You can first check the Summary of Model.”
Summary of Model(yield)
Linear mixed model fit by REML ['lmerMod']
Formula: yield ~ 0 + (1 | sulphur) + (1 | block) + Location
Data: data
REML criterion at convergence: 515.3
Scaled residuals:
Min 1Q Median 3Q Max
-1.94477 -0.65178 0.03198 0.77110 1.58621
Random effects:
Groups Name Variance Std.Dev.
sulphur (Intercept) 1.067e+05 3.266e+02
block (Intercept) 5.737e-13 7.575e-07
Residual 3.061e+03 5.532e+01
Number of obs: 48, groups: sulphur, 4; block, 4
Fixed effects:
Estimate Std. Error t value
LocationCórdoba 1179.4 163.9 7.196
LocationGranada 1422.8 163.9 8.680
LocationLeon 1098.4 163.9 6.701
Correlation of Fixed Effects:
LctnCr LctnGr
LocatinGrnd 0.993
LocationLen 0.993 0.993
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
As you continue to navigate through the pages, you will be able to see the BLUE value.
What does this BLUE value mean?
By calculating the BLUE value using R, we can infer one interesting fact here.
Since the standard deviations of all mean values are the same and the sample sizes are equal at 16 each, the standard errors are also the same. This is because the residuals have been removed.
Let’s calculate the mean for the actual data
library (dplyr)
data.frame(dataA %>%
group_by(location) %>%
summarise(mean=mean(yield),
sd=sd(yield),
n=length(yield),
se=sd/sqrt(n)))
If you look at the table above, you can see that there is no difference in the mean values of the yield. However, in this case, the standard deviation is different for each location, resulting in different standard errors. So, even though the mean values are the same, the concept of BLUE comes from the differences in the data that make up this mean. Let’s compare the actual yield values with the yield values after applying BLUE to the data.
location | sulphur (kg/ha) | block | Actual yield | predicted yield by R |
Cordoba | 0 | 1 | 750 | 793.3 |
Cordoba | 24 | 1 | 1250 | 1285.2 |
Cordoba | 36 | 1 | 1550 | 1564.2 |
Cordoba | 48 | 1 | 1120 | 1074.8 |
Cordoba | 0 | 2 | 780 | 793.3 |
Cordoba | 24 | 2 | 1280 | 1285.2 |
Cordoba | 36 | 2 | 1520 | 1564.2 |
Cordoba | 48 | 2 | 1130 | 1074.8 |
Cordoba | 0 | 3 | 790 | 793.3 |
Cordoba | 24 | 3 | 1300 | 1285.2 |
Cordoba | 36 | 3 | 1525 | 1564.2 |
Cordoba | 48 | 3 | 1140 | 1074.8 |
Cordoba | 0 | 4 | 800 | 793.3 |
Cordoba | 24 | 4 | 1250 | 1285.2 |
Cordoba | 36 | 4 | 1555 | 1564.2 |
Cordoba | 48 | 4 | 1130 | 1074.8 |
Granada | 0 | 1 | 1070 | 1036.6 |
Granada | 24 | 1 | 1490 | 1528.6 |
Granada | 36 | 1 | 1790 | 1807.6 |
Granada | 48 | 1 | 1345 | 1318.2 |
Granada | 0 | 2 | 1066 | 1036.6 |
Granada | 24 | 2 | 1510 | 1528.6 |
Granada | 36 | 2 | 1700 | 1807.6 |
Granada | 48 | 2 | 1355 | 1318.2 |
Granada | 0 | 3 | 948 | 1036.6 |
Granada | 24 | 3 | 1560 | 1528.6 |
Granada | 36 | 3 | 1830 | 1807.6 |
Granada | 48 | 3 | 1368 | 1318.2 |
Granada | 0 | 4 | 960 | 1036.6 |
Granada | 24 | 4 | 1550 | 1528.6 |
Granada | 36 | 4 | 1866 | 1807.6 |
Granada | 48 | 4 | 1356 | 1318.2 |
Leon | 0 | 1 | 800 | 712.2 |
Leon | 24 | 1 | 1250 | 1204.2 |
Leon | 36 | 1 | 1550 | 1483.2 |
Leon | 48 | 1 | 910 | 993.8 |
Leon | 0 | 2 | 780 | 712.2 |
Leon | 24 | 2 | 1180 | 1204.2 |
Leon | 36 | 2 | 1540 | 1483.2 |
Leon | 48 | 2 | 890 | 993.8 |
Leon | 0 | 3 | 663.6 | 712.2 |
Leon | 24 | 3 | 1260 | 1204.2 |
Leon | 36 | 3 | 1480 | 1483.2 |
Leon | 48 | 3 | 900 | 993.8 |
Leon | 2 | 4 | 750 | 712.2 |
Leon | 24 | 4 | 1195 | 1204.2 |
Leon | 36 | 4 | 1525 | 1483.2 |
Leon | 48 | 4 | 900 | 993.8 |
Let’s copy/paste the above data into Excel and calculate the mean and standard deviation.
The mean value is the same for both the actual yield values and the predicted yield by R. However, the individual yield values that make up the mean can differ, resulting in different standard deviations. In other words, when BLUE is applied, it generates new individual data that makes the standard deviation for each region (fixed factor) the same, while there is no change in the mean yield values for each region. This concept is known as the Best Linear Unbiased Estimator (BLUE), which creates a new set of individual data that equalizes the standard deviation for each fixed factor.
Let’s consider another example. Instead of the Leon region, let’s include the yield data from the Barcelona region, assuming that the data from Barcelona is not uniformly distributed due to abnormal weather conditions.
location=rep(c("Cordoba","Granada","Barcelona"), each=16)
trt=rep(c(0,24,36,48), time=12)
block=rep(rep(c(1,2,3,4), each=4),3)
yield= c(750,1250,1550,1120,780,1280,1520,1130,790,1300,
1525,1140,800,1250,1555,1130,1070,1490,1790,1345,
1066,1510,1700,1355,948,1560,1830,1368,960,1550,
1866,1356,800,1000,1400,100,1000,1100,1500,
500,900,1300,2800,1300,200,1200,3000,950)
dataA=data.frame(location,trt,block,yield)
# data summary
library(dplyr)
dataB=data.frame(dataA %>%
group_by(location) %>%
summarise(mean=mean(yield),
sd=sd(yield),
n=length(yield),
se=sd/sqrt(n)))
dataB
In this case, the yield data from Barcelona would exhibit a high level of variability, indicating that there is significant variation in the data points for Barcelona.
library(ggplot2)
ggplot(data=dataB, aes(x=location, y=mean))+
geom_bar(stat="identity", position="dodge", width=0.7, size=1) +
geom_errorbar(aes(ymin= mean-se, ymax= mean+se), position=position_dodge(0.7),
width=0.5, color='Black') +
scale_fill_manual(values = c("grey25","grey55")) +
scale_y_continuous(breaks= seq(0,2000,500), limits = c(0,2000)) +
labs(x="Location", y="Yield") +
theme_grey(base_size=15, base_family="serif")+
theme(legend.position="none",
axis.line= element_line(linewidth=0.5, colour="black"))+
windows(width=5.5, height=5)
If we were to apply the concept of BLUE, even with the high variability in yield data from Barcelona, it would help to mitigate the impact of this variability on the statistical significance of the location factor. The BLUE concept would generate a new set of individual data points for Barcelona that would have the same standard error as the other locations, effectively reducing the impact of the high variability in yield data from Barcelona. This would result in a more robust estimation of the location factor’s effect on yield, accounting for the variability in the data and improving the statistical significance of the results.
# to identify model
library(lme4)
Blue=lmer(yield ~ 0 + (1|trt) + (1|block) + location, data=dataA)
# to calculate prediction values
dataA$prediction=predict(Blue)
dataA
If the BLUE concept is applied in this case, the standard errors would be the same for all locations, including Barcelona. When plotting the graph, it would show similar standard errors for all locations, including Barcelona.
The standard errors would be the same for all locations, indicating that the variability in the data (residuals) has been adjusted. It’s important to note that the mean values remain unchanged, and only the errors have been adjusted through the application of the BLUE concept.