In agronomy research, how to estimate a missing value in collected field data?
Here is an example of missing data. There are four different cultivars and I would like to determine if there is a difference in yield among them. I have five replicates as blocks, so the experimental design is a Randomized Complete Block Design (RCBD), and we can analyze the data using one-way ANOVA with blocks. You can download the data using R by copying and pasting the code below into your R script. After running the code, an Excel file will be saved to your computer.
Cultivar=rep(c("A","B","C","D"),5)
Block=rep(c("I","II","III","IV","V"),each=4)
Yield=c(32.3,33.3,30.8, NA,34.0,33.0,34.3,26.0,34.3,36.3,35.3,29.8,35.0,36.8,32.3,28.0,36.5,34.5,35.8,28.8)
dataA=data.frame (Cultivar, Block, Yield)
library(writexl)
write_xlsx (dataA,"C:/Users/Usuari/Desktop/dataA.xlsx")
# Please check the pathway in your computer
The data was originally supposed to consist of four different cultivars with five replicates each (block). However, due to flooding in a certain spot, we were unable to grow plants of Cultivar D in the first block, resulting in missing yield data. As a result, the number of treatments and replicates are not consistent in this case. As previously mentioned, we are analyzing the differences among cultivars (treatments) with blocks, which requires a one-way ANOVA with blocks.
The statistical model of one-way ANOVA with block is
yij = μ + τi+ βj + εij where μ is the grain mean, τi is the treatment effect βj is the block effect εij is errors (residuals)
In this model, 4th treatment and 1st block is the missing value. Therefore, y41 will be the missing value.
How can we estimate a missing value using existing data?
I suggest the following equation;
r = number of replicates (Blocks)
t = number of treatment (Cultivar)
B= the sum of the block containing a missing value
T= the sum of the treatment containing a missing value
G= the total sum of data
Let’s calculate each value.
Then, we can calculate the estimated missing value as
= 25.4
How would statistical significance be affected by the presence of missing data, and how would it differ when using estimated values?
Now, let’s explore the impact of missing data versus estimated values. We will compare the results obtained when the missing values are left as-is versus when they are estimated and filled in. This will allow us to see the differences that can arise when working with incomplete datasets.
A) Missing value exists (Ignoring missing value)
Cultivar=rep(c("A","B","C","D"),5)
Block=rep(c("I","II","III","IV","V"),each=4)
Yield= c(32.3,33.3,30.8,NA,34.0,33.0,34.3,26.0,34.3,36.3,35.3,29.8,35.0,36.8,32.3,28.0,36.5,34.5,35.8,28.8)
dataA=data.frame (Cultivar, Block, Yield)
ANOVA1= aov (Yield ~ Cultivar + factor(Block), data=dataA)
summary(ANOVA1)
Df Sum Sq Mean Sq F value Pr(>F)
Cultivar 3 122.46 40.82 25.911 2.75e-05 ***
factor(Block) 4 29.34 7.33 4.655 0.0192 *
Residuals 11 17.33 1.58
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1 observation deleted due to missingness
Now, it shows there is a significant difference among cultivars (p<0.001, ***).
B) Estimated value is applied
Cultivar=rep(c("A","B","C","D"),5)
Block=rep(c("I","II","III","IV","V"),each=4)
Yield= c(32.3,33.3,30.8,25.4,34.0,33.0,34.3,26.0,34.3,36.3,35.3,29.8,35.0,36.8,32.3,28.0,36.5,34.5,35.8,28.8)
dataB=data.frame (Cultivar, Block, Yield)
ANOVA2= aov (Yield ~ Cultivar + factor(Block), data=dataB)
summary(ANOVA2)
Df Sum Sq Mean Sq F value Pr(>F)
Cultivar 3 171.36 57.12 39.550 1.69e-06 ***
factor(Block) 4 35.38 8.85 6.125 0.00635 **
Residuals 12 17.33 1.44
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
It also shows significant difference among cultivars (p<0.001, ***). Regardless of missing value, statistical outcome indicates the treatment is always significant.
However, the degrees of freedom would be different depending on the approach used to handle missing data.
It is important to note that the approach used to handle missing data can have an impact on the magnitude of the effect size and the precision of the estimated treatment effect. Therefore, it is crucial to carefully consider the appropriate method for handling missing data in order to obtain accurate and reliable results. Also, it is important to note that the approach used to handle missing data can affect the degrees of freedom in the analysis. When excluding observations with missing data, the degrees of freedom would be reduced, which can impact the accuracy and reliability of the estimated treatment effects. On the other hand, when using estimated values, the degrees of freedom would be retained, which can lead to more accurate and precise estimates of treatment effects.
df: TSS= rt-1 = 5*4-1= 19 df: SSTr= t-1= 4-1= 3 (t=number of treatment) df: SSBlock= r-1= 5-1= 4 (r=number of block) df: SSE= (r-1)(t-1)= (5-1)*(4-1)= 12
Even though there is one missing value, the number of treatments remains four (t=4). Additionally, a missing value in a block does not change the number of blocks (r=5). Therefore, the degrees of freedom for the treatment (cultivars) and block will remain the same. The degree of freedom for treatment will always be 3, and the degree of freedom for block will always be 4-1=3.
However, the degree of freedom for residuals (error) will be affected by the missing value. The degree of freedom for residuals is the product of the degrees of freedom for treatment and block. Since there were only three treatments in the first block due to the missing value (Cultivar D was missing), the degree of freedom for residuals needs to be adjusted. In the ANOVA table, the degree of freedom for residuals was 11 when the missing value was excluded. However, when the estimated value was used to replace the missing value, the degree of freedom for residuals increased to 12 [= (r-1)(t-1) = (5-1) (4-1) = 12].
The estimated value for the missing value is based on the same cultivar in other blocks (as per the equation used for estimation), and therefore, it is likely that the value would not be too different from the other cultivars.
Do we really need to calculate the estimated value?
If not, would it be okay to ignore the missing value?
Let’s go back to the degrees of freedom. As mentioned earlier, if we include the estimated value, the degrees of freedom in residuals will be increased. It is important to understand what this increased degree of freedom signifies.
When ignoring a missing value Degree of freedom in residuals = 11 Mean square of Error (MSE) = 1.58 F-ratio = 25.911 When accepting a estimated value Degree of freedom in residuals = 12 Mean square of Error (MSE) = 1.44 F-ratio = 39.550
MSE is calculated as MSE = SSE / df of residuals
, and the Sum of Squares of Error (SSE) is the same in both cases, which is 17.33. Therefore, if the degree of freedom in residuals increases, MSE will decrease. That’s why, when accepting an estimated value, the MSE was smaller than when ignoring a missing value.
The F-ratio is the ratio between MST and MSE. Thus, a lower MSE will increase the F-ratio, indicating increased statistical significance (i.e., a lower p-value).
Now, I’ll answer the question, “Do we really need to calculate the estimated value?
If you want to have more statistical significance, it would be better to include the estimated value rather than ignoring the missing value. The true meaning of the estimated value is to increase statistical significance (i.e., increase the F-ratio).
One thought on “In agronomy research, how to estimate a missing value in collected field data?”
One-way ANOVA with block
yij = μ + τi+ βj + εij
where μ is the grain mean,
τi is the treatment effect
βj is the block effect
εij is errors (residuals)