The Best Linear Unbiased Estimator (BLUE): Step-by-Step Guide using R (with AllInOne Package)

The Best Linear Unbiased Estimator (BLUE): Step-by-Step Guide using R (with AllInOne Package)


AllInOne R package

In this session, I will introduce the method of calculating the Best Linear Unbiased Estimator (BLUE). Instead of simply listing formulas as many websites do to explain BLUE, this post aims to help readers understand the process of calculating BLUE with an actual dataset using R.

I have the following data.

locationsulphur (kg/ha)blockyield
Cordoba01750
Cordoba2411250
Cordoba3611550
Cordoba4811120
Cordoba02780
Cordoba2421280
Cordoba3621520
Cordoba4821130
Cordoba03790
Cordoba2431300
Cordoba3631525
Cordoba4831140
Cordoba04800
Cordoba2441250
Cordoba3641555
Cordoba4841130
Granada011070
Granada2411490
Granada3611790
Granada4811345
Granada021066
Granada2421510
Granada3621700
Granada4821355
Granada03948
Granada2431560
Granada3631830
Granada4831368
Granada04960
Granada2441550
Granada3641866
Granada4841356
Leon01800
Leon2411250
Leon3611550
Leon481910
Leon02780
Leon2421180
Leon3621540
Leon482890
Leon03663.6
Leon2431260
Leon3631480
Leon483900
Leon04750
Leon2441195
Leon3641525
Leon484900

The data mentioned above has been uploaded to my GitHub.
https://github.com/agronomy4future/raw_data_practice/blob/main/BLUE_Practice.csv



Let’s consider a dataset that was collected by investigating crop yields in three regions of Spain based on the treatment of sulfur fertilizer. The experiment was conducted with four repetitions as blocks. In this case, I have set the location as a fixed factor, and sulfur and block as random factors. When both fixed and random factors are present in the same model, the Best Linear Unbiased Estimator (BLUE) is commonly used. If all factors are considered as random factors, then Best Linear Unbiased Prediction (BLUP) would be used.

I will now demonstrate how to calculate the Best Linear Unbiased Estimator (BLUE) using R.


To calculate BLUE using R

If you have understood the above calculations, fron now on, let’s calculate BLUE using R.

First, we will load the data.”

library (readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/BLUE_Practice.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))

Now we will configure the model. Location is set as a fixed factor, while treatment (sulphur) and block are set as a random factor. The model configuration is as follows.

library(lme4)
Blue=lmer(yield ~ 0 + (1|trt) + (1|block) + location, data=dataA)

Now let’s calculate the prediction values that were previously calculated in Excel.

predict(Blue)

Now let’s add these prediction values into the existing data.

dataA$prediction=predict(Blue)
dataA

Now let’s calculate the average of these prediction values.

data.frame(dataA %>%
  group_by(location) %>%
  summarise(mean=mean(prediction), 
  sd=sd(prediction), 
  n=length(prediction), 
  se=sd/sqrt(n)))

Here, the mean value is the BLUE value. Please compare it with the value calculated using Excel. It’s the same. In summary, the code is as follows.

# to upload data
library (readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/BLUE_Practice.csv"
dataA=data.frame(read_csv(url(github),show_col_types = FALSE))

# to identify model
library(lme4)
Blue=lmer(yield ~ 0 + (1|trt) + (1|block) + location, data=dataA)

# to calculate prediction values
predict(Blue)
dataA$prediction=predict(Blue)

# to calculate BLUE
library (dplyr)
data.frame(dataA %>%
  group_by(location) %>%
  summarise(mean=mean(prediction), 
            sd=sd(prediction), 
            n=length(prediction), 
            se=sd/sqrt(n)))


AllInOne package

I will now introduce a method to calculate BLUE in R without writing code, using the AllInOne package developed by Dr. Mohsen Yoosefzadeh Najafabadi at University of Guelph. This package is an excellent tool for calculating BLUE and BLUP without any coding.

# to install package
remotes::install_github('MohsenYN/AllInOne')

# to run the program
library(AllInOne)
run_app()

When you run the above code, a window like the following will appear.

Go to the Dataset tab and click on Upload Dataset to upload the data.

Next, click on Select Variables to choose the independent variables (Location, sulphur, block) and the dependent variable (yield)

Select Variance Analysis under Data Pre-Processing, and then choose Mixed Analysis.”

Then, select ‘yield’ as the dependent variable, and choose ‘location’ as the fixed factor, and select ‘sulphur’ and ‘block’ as the random factors.”

When you click OK, a new window will appear. You can first check the Summary of Model.”

Summary of Model(yield)

Linear mixed model fit by REML ['lmerMod']
Formula: yield ~ 0 + (1 | sulphur) + (1 | block) + Location
Data: data

REML criterion at convergence: 515.3

Scaled residuals:
Min 1Q Median 3Q Max
-1.94477 -0.65178 0.03198 0.77110 1.58621

Random effects:
Groups Name Variance Std.Dev.
sulphur (Intercept) 1.067e+05 3.266e+02
block (Intercept) 5.737e-13 7.575e-07
Residual 3.061e+03 5.532e+01
Number of obs: 48, groups: sulphur, 4; block, 4

Fixed effects:
Estimate Std. Error t value
LocationCórdoba 1179.4 163.9 7.196
LocationGranada 1422.8 163.9 8.680
LocationLeon 1098.4 163.9 6.701

Correlation of Fixed Effects:
LctnCr LctnGr
LocatinGrnd 0.993
LocationLen 0.993 0.993
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')

As you continue to navigate through the pages, you will be able to see the BLUE value.



What does this BLUE value mean?

By calculating the BLUE value using R, we can infer one interesting fact here.

Since the standard deviations of all mean values are the same and the sample sizes are equal at 16 each, the standard errors are also the same. This is because the residuals have been removed.

Let’s calculate the mean for the actual data

library (dplyr)
data.frame(dataA %>%
  group_by(location) %>%
  summarise(mean=mean(yield), 
  sd=sd(yield), 
  n=length(yield), 
  se=sd/sqrt(n)))

If you look at the table above, you can see that there is no difference in the mean values of the yield. However, in this case, the standard deviation is different for each location, resulting in different standard errors. So, even though the mean values are the same, the concept of BLUE comes from the differences in the data that make up this mean. Let’s compare the actual yield values with the yield values after applying BLUE to the data.

locationsulphur (kg/ha)blockActual yieldpredicted yield by R
Cordoba01750793.3
Cordoba24112501285.2
Cordoba36115501564.2
Cordoba48111201074.8
Cordoba02780793.3
Cordoba24212801285.2
Cordoba36215201564.2
Cordoba48211301074.8
Cordoba03790793.3
Cordoba24313001285.2
Cordoba36315251564.2
Cordoba48311401074.8
Cordoba04800793.3
Cordoba24412501285.2
Cordoba36415551564.2
Cordoba48411301074.8
Granada0110701036.6
Granada24114901528.6
Granada36117901807.6
Granada48113451318.2
Granada0210661036.6
Granada24215101528.6
Granada36217001807.6
Granada48213551318.2
Granada039481036.6
Granada24315601528.6
Granada36318301807.6
Granada48313681318.2
Granada049601036.6
Granada24415501528.6
Granada36418661807.6
Granada48413561318.2
Leon01800712.2
Leon24112501204.2
Leon36115501483.2
Leon481910993.8
Leon02780712.2
Leon24211801204.2
Leon36215401483.2
Leon482890993.8
Leon03663.6712.2
Leon24312601204.2
Leon36314801483.2
Leon483900993.8
Leon24750712.2
Leon24411951204.2
Leon36415251483.2
Leon484900993.8

Let’s copy/paste the above data into Excel and calculate the mean and standard deviation.

The mean value is the same for both the actual yield values and the predicted yield by R. However, the individual yield values that make up the mean can differ, resulting in different standard deviations. In other words, when BLUE is applied, it generates new individual data that makes the standard deviation for each region (fixed factor) the same, while there is no change in the mean yield values for each region. This concept is known as the Best Linear Unbiased Estimator (BLUE), which creates a new set of individual data that equalizes the standard deviation for each fixed factor.



Let’s consider another example. Instead of the Leon region, let’s include the yield data from the Barcelona region, assuming that the data from Barcelona is not uniformly distributed due to abnormal weather conditions.

location=rep(c("Cordoba","Granada","Barcelona"), each=16)
trt=rep(c(0,24,36,48), time=12)
block=rep(rep(c(1,2,3,4), each=4),3)
yield= c(750,1250,1550,1120,780,1280,1520,1130,790,1300, 
         1525,1140,800,1250,1555,1130,1070,1490,1790,1345, 
         1066,1510,1700,1355,948,1560,1830,1368,960,1550, 
         1866,1356,800,1000,1400,100,1000,1100,1500,  
         500,900,1300,2800,1300,200,1200,3000,950)
dataA=data.frame(location,trt,block,yield)

# data summary
library(dplyr)
dataB=data.frame(dataA %>%
                   group_by(location) %>%
                   summarise(mean=mean(yield), 
                             sd=sd(yield), 
                             n=length(yield), 
                             se=sd/sqrt(n)))
dataB

In this case, the yield data from Barcelona would exhibit a high level of variability, indicating that there is significant variation in the data points for Barcelona.

library(ggplot2)
ggplot(data=dataB, aes(x=location, y=mean))+
  geom_bar(stat="identity", position="dodge", width=0.7, size=1) +
  geom_errorbar(aes(ymin= mean-se, ymax= mean+se), position=position_dodge(0.7), 
                width=0.5, color='Black') +
  scale_fill_manual(values = c("grey25","grey55")) +
  scale_y_continuous(breaks= seq(0,2000,500), limits = c(0,2000)) +
  labs(x="Location", y="Yield") +
  theme_grey(base_size=15, base_family="serif")+
  theme(legend.position="none",
        axis.line= element_line(linewidth=0.5, colour="black"))+
  windows(width=5.5, height=5)

If we were to apply the concept of BLUE, even with the high variability in yield data from Barcelona, it would help to mitigate the impact of this variability on the statistical significance of the location factor. The BLUE concept would generate a new set of individual data points for Barcelona that would have the same standard error as the other locations, effectively reducing the impact of the high variability in yield data from Barcelona. This would result in a more robust estimation of the location factor’s effect on yield, accounting for the variability in the data and improving the statistical significance of the results.

# to identify model
library(lme4)
Blue=lmer(yield ~ 0 + (1|trt) + (1|block) + location, data=dataA)

# to calculate prediction values
dataA$prediction=predict(Blue)
dataA

If the BLUE concept is applied in this case, the standard errors would be the same for all locations, including Barcelona. When plotting the graph, it would show similar standard errors for all locations, including Barcelona.

The standard errors would be the same for all locations, indicating that the variability in the data (residuals) has been adjusted. It’s important to note that the mean values remain unchanged, and only the errors have been adjusted through the application of the BLUE concept.




Leave a Reply

If you include a website address in the comment section, I cannot see your comment as it will be automatically deleted and will not be posted. Please refrain from including website addresses.