The Best Linear Unbiased Estimator (BLUE): Step-by-Step Guide using R (with AllInOne Package)

April 24, 2023 JK

In this session, I will introduce the method of calculating the Best Linear Unbiased Estimator (BLUE). Instead of simply listing formulas as many websites do to explain BLUE, this post aims to help readers understand the process of calculating BLUE with an actual dataset using R.

I have the following data.

location	sulphur (kg/ha)	block	yield
Cordoba	0	1	750
Cordoba	24	1	1250
Cordoba	36	1	1550
Cordoba	48	1	1120
Cordoba	0	2	780
Cordoba	24	2	1280
Cordoba	36	2	1520
Cordoba	48	2	1130
Cordoba	0	3	790
Cordoba	24	3	1300
Cordoba	36	3	1525
Cordoba	48	3	1140
Cordoba	0	4	800
Cordoba	24	4	1250
Cordoba	36	4	1555
Cordoba	48	4	1130
Granada	0	1	1070
Granada	24	1	1490
Granada	36	1	1790
Granada	48	1	1345
Granada	0	2	1066
Granada	24	2	1510
Granada	36	2	1700
Granada	48	2	1355
Granada	0	3	948
Granada	24	3	1560
Granada	36	3	1830
Granada	48	3	1368
Granada	0	4	960
Granada	24	4	1550
Granada	36	4	1866
Granada	48	4	1356
Leon	0	1	800
Leon	24	1	1250
Leon	36	1	1550
Leon	48	1	910
Leon	0	2	780
Leon	24	2	1180
Leon	36	2	1540
Leon	48	2	890
Leon	0	3	663.6
Leon	24	3	1260
Leon	36	3	1480
Leon	48	3	900
Leon	0	4	750
Leon	24	4	1195
Leon	36	4	1525
Leon	48	4	900

The data mentioned above has been uploaded to my GitHub.
https://github.com/agronomy4future/raw_data_practice/blob/main/BLUE_Practice.csv

Let’s consider a dataset that was collected by investigating crop yields in three regions of Spain based on the treatment of sulfur fertilizer. The experiment was conducted with four repetitions as blocks. In this case, I have set the location as a fixed factor, and sulfur and block as random factors. When both fixed and random factors are present in the same model, the Best Linear Unbiased Estimator (BLUE) is commonly used. If all factors are considered as random factors, then Best Linear Unbiased Prediction (BLUP) would be used.

I will now demonstrate how to calculate the Best Linear Unbiased Estimator (BLUE) using R.

To calculate BLUE using R

If you have understood the above calculations, fron now on, let’s calculate BLUE using R.

First, we will load the data.”

library (readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/BLUE_Practice.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))

Now we will configure the model. Location is set as a fixed factor, while treatment (sulphur) and block are set as a random factor. The model configuration is as follows.

library(lme4)
Blue=lmer(yield ~ 0 + (1|trt) + (1|block) + location, data=dataA)

Now let’s calculate the prediction values that were previously calculated in Excel.

predict(Blue)

Now let’s add these prediction values into the existing data.

dataA$prediction=predict(Blue)
dataA

Now let’s calculate the average of these prediction values.

data.frame(dataA %>%
  group_by(location) %>%
  summarise(mean=mean(prediction), 
  sd=sd(prediction), 
  n=length(prediction), 
  se=sd/sqrt(n)))

Here, the mean value is the BLUE value. Please compare it with the value calculated using Excel. It’s the same. In summary, the code is as follows.

# to upload data
library (readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/BLUE_Practice.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))

# to identify model
library(lme4)
Blue=lmer(yield ~ 0 + (1|trt) + (1|block) + location, data=dataA)

# to calculate prediction values
predict(Blue)
dataA$prediction=predict(Blue)

# to calculate BLUE
library (dplyr)
data.frame(dataA %>%
  group_by(location) %>%
  summarise(mean=mean(prediction), 
            sd=sd(prediction), 
            n=length(prediction), 
            se=sd/sqrt(n)))

AllInOne package

A souvenir from #AllInOne R package. I like this R package, particularly for variance analysis. pic.twitter.com/u9QhieEJRL
— Jin.W.Kim (JK) (@el_trigo_JK) December 7, 2022

I will now introduce a method to calculate BLUE in R without writing code, using the AllInOne package developed by Dr. Mohsen Yoosefzadeh Najafabadi at University of Guelph. This package is an excellent tool for calculating BLUE and BLUP without any coding.

# to install package
remotes::install_github('MohsenYN/AllInOne')

# to run the program
library(AllInOne)
run_app()

When you run the above code, a window like the following will appear.

Go to the Dataset tab and click on Upload Dataset to upload the data.

Next, click on Select Variables to choose the independent variables (Location, sulphur, block) and the dependent variable (yield)

Select Variance Analysis under Data Pre-Processing, and then choose Mixed Analysis.”

Then, select ‘yield’ as the dependent variable, and choose ‘location’ as the fixed factor, and select ‘sulphur’ and ‘block’ as the random factors.”

When you click OK, a new window will appear. You can first check the Summary of Model.”

Summary of Model(yield)

Linear mixed model fit by REML ['lmerMod']
Formula: yield ~ 0 + (1 | sulphur) + (1 | block) + Location
Data: data

REML criterion at convergence: 515.3

Scaled residuals:
Min 1Q Median 3Q Max
-1.94477 -0.65178 0.03198 0.77110 1.58621

Random effects:
Groups Name Variance Std.Dev.
sulphur (Intercept) 1.067e+05 3.266e+02
block (Intercept) 5.737e-13 7.575e-07
Residual 3.061e+03 5.532e+01
Number of obs: 48, groups: sulphur, 4; block, 4

Fixed effects:
Estimate Std. Error t value
LocationCórdoba 1179.4 163.9 7.196
LocationGranada 1422.8 163.9 8.680
LocationLeon 1098.4 163.9 6.701

Correlation of Fixed Effects:
LctnCr LctnGr
LocatinGrnd 0.993
LocationLen 0.993 0.993
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')

As you continue to navigate through the pages, you will be able to see the BLUE value.

What does this BLUE value mean?

By calculating the BLUE value using R, we can infer one interesting fact here.

Since the standard deviations of all mean values are the same and the sample sizes are equal at 16 each, the standard errors are also the same. This is because the residuals have been removed.

Let’s calculate the mean for the actual data

library (dplyr)
data.frame(dataA %>%
  group_by(location) %>%
  summarise(mean=mean(yield), 
  sd=sd(yield), 
  n=length(yield), 
  se=sd/sqrt(n)))

If you look at the table above, you can see that there is no difference in the mean values of the yield. However, in this case, the standard deviation is different for each location, resulting in different standard errors. So, even though the mean values are the same, the concept of BLUE comes from the differences in the data that make up this mean. Let’s compare the actual yield values with the yield values after applying BLUE to the data.

location	sulphur (kg/ha)	block	Actual yield	predicted yield by R
Cordoba	0	1	750	793.3
Cordoba	24	1	1250	1285.2
Cordoba	36	1	1550	1564.2
Cordoba	48	1	1120	1074.8
Cordoba	0	2	780	793.3
Cordoba	24	2	1280	1285.2
Cordoba	36	2	1520	1564.2
Cordoba	48	2	1130	1074.8
Cordoba	0	3	790	793.3
Cordoba	24	3	1300	1285.2
Cordoba	36	3	1525	1564.2
Cordoba	48	3	1140	1074.8
Cordoba	0	4	800	793.3
Cordoba	24	4	1250	1285.2
Cordoba	36	4	1555	1564.2
Cordoba	48	4	1130	1074.8
Granada	0	1	1070	1036.6
Granada	24	1	1490	1528.6
Granada	36	1	1790	1807.6
Granada	48	1	1345	1318.2
Granada	0	2	1066	1036.6
Granada	24	2	1510	1528.6
Granada	36	2	1700	1807.6
Granada	48	2	1355	1318.2
Granada	0	3	948	1036.6
Granada	24	3	1560	1528.6
Granada	36	3	1830	1807.6
Granada	48	3	1368	1318.2
Granada	0	4	960	1036.6
Granada	24	4	1550	1528.6
Granada	36	4	1866	1807.6
Granada	48	4	1356	1318.2
Leon	0	1	800	712.2
Leon	24	1	1250	1204.2
Leon	36	1	1550	1483.2
Leon	48	1	910	993.8
Leon	0	2	780	712.2
Leon	24	2	1180	1204.2
Leon	36	2	1540	1483.2
Leon	48	2	890	993.8
Leon	0	3	663.6	712.2
Leon	24	3	1260	1204.2
Leon	36	3	1480	1483.2
Leon	48	3	900	993.8
Leon	2	4	750	712.2
Leon	24	4	1195	1204.2
Leon	36	4	1525	1483.2
Leon	48	4	900	993.8

Let’s copy/paste the above data into Excel and calculate the mean and standard deviation.

The mean value is the same for both the actual yield values and the predicted yield by R. However, the individual yield values that make up the mean can differ, resulting in different standard deviations. In other words, when BLUE is applied, it generates new individual data that makes the standard deviation for each region (fixed factor) the same, while there is no change in the mean yield values for each region. This concept is known as the Best Linear Unbiased Estimator (BLUE), which creates a new set of individual data that equalizes the standard deviation for each fixed factor.

Let’s consider another example. Instead of the Leon region, let’s include the yield data from the Barcelona region, assuming that the data from Barcelona is not uniformly distributed due to abnormal weather conditions.

location=rep(c("Cordoba","Granada","Barcelona"), each=16)
trt=rep(c(0,24,36,48), time=12)
block=rep(rep(c(1,2,3,4), each=4),3)
yield= c(750,1250,1550,1120,780,1280,1520,1130,790,1300, 
         1525,1140,800,1250,1555,1130,1070,1490,1790,1345, 
         1066,1510,1700,1355,948,1560,1830,1368,960,1550, 
         1866,1356,800,1000,1400,100,1000,1100,1500,  
         500,900,1300,2800,1300,200,1200,3000,950)
dataA=data.frame(location,trt,block,yield)

# data summary
library(dplyr)
dataB=data.frame(dataA %>%
                   group_by(location) %>%
                   summarise(mean=mean(yield), 
                             sd=sd(yield), 
                             n=length(yield), 
                             se=sd/sqrt(n)))
dataB

In this case, the yield data from Barcelona would exhibit a high level of variability, indicating that there is significant variation in the data points for Barcelona.

library(ggplot2)
ggplot(data=dataB, aes(x=location, y=mean))+
  geom_bar(stat="identity", position="dodge", width=0.7, size=1) +
  geom_errorbar(aes(ymin= mean-se, ymax= mean+se), position=position_dodge(0.7), 
                width=0.5, color='Black') +
  scale_fill_manual(values = c("grey25","grey55")) +
  scale_y_continuous(breaks= seq(0,2000,500), limits = c(0,2000)) +
  labs(x="Location", y="Yield") +
  theme_grey(base_size=15, base_family="serif")+
  theme(legend.position="none",
        axis.line= element_line(linewidth=0.5, colour="black"))+
  windows(width=5.5, height=5)

If we were to apply the concept of BLUE, even with the high variability in yield data from Barcelona, it would help to mitigate the impact of this variability on the statistical significance of the location factor. The BLUE concept would generate a new set of individual data points for Barcelona that would have the same standard error as the other locations, effectively reducing the impact of the high variability in yield data from Barcelona. This would result in a more robust estimation of the location factor’s effect on yield, accounting for the variability in the data and improving the statistical significance of the results.

# to identify model
library(lme4)
Blue=lmer(yield ~ 0 + (1|trt) + (1|block) + location, data=dataA)

# to calculate prediction values
dataA$prediction=predict(Blue)
dataA

If the BLUE concept is applied in this case, the standard errors would be the same for all locations, including Barcelona. When plotting the graph, it would show similar standard errors for all locations, including Barcelona.

The standard errors would be the same for all locations, indicating that the variability in the data (residuals) has been adjusted. It’s important to note that the mean values remain unchanged, and only the errors have been adjusted through the application of the BLUE concept.

Agronomy4future

Stories about cereals and statistics (plus coding). We aim to develop open-source code for agronomy.

The Best Linear Unbiased Estimator (BLUE): Step-by-Step Guide using R (with AllInOne Package)

April 24, 2023 JK

To calculate BLUE using R

AllInOne package

What does this BLUE value mean?