A Practical Approach to Linear Mixed-Effects Modeling in R

December 28, 2023 JK

A Linear Mixed-Effects Model (LMM) is a statistical model that combines both fixed effects and random effects to analyze data with repeated measurements or hierarchical structure.

Let’s break down the key components and concepts of a Linear Mixed-Effects Model:

1) Fixed Effects:

Definition: Fixed effects are parameters that represent population-level effects. These effects are assumed to be constant across all levels or groups in the data.
Role: Fixed effects capture the overall relationship between the predictors and the response variable. In simpler terms, they describe the average effect that a predictor has on the response across the entire population.
Example: The type of fertilizer we use is a fixed effect because it applies to all crops

2) Random Effects:

Definition: Random effects are parameters that represent group-specific or subject-specific effects. These effects are allowed to vary across different levels or groups in the data.
Role: Random effects account for variability at different levels of the data hierarchy. They acknowledge that observations within the same group or subject may be more similar to each other than to observations in other groups.
Example: Each planting row might have its own soil quality, sunlight exposure, or other factors that influence crop yield uniquely.

3) Linear Mixed-Effects Model Equation:

The general equation of a Linear Mixed-Effects Model can be written as: Y= X_β + Z_b + ε

Y: response variable or dependent variable
X: the fixed-effects design matrix
β: fixed effects coefficients
Z: the random-effects design matrix
b: random effects coefficients
ε: residuals

For example, we’re trying to predict grain yield based on fixed effects (like fertilizer type) and random effects (planting row), and the equation will be: y= Fixed Effects + Random Effects + Error

4) Estimation:

Fixed Effects: Typically estimated using methods like Ordinary Least Squares (OLS).
Random Effects: Estimated using methods like Maximum Likelihood Estimation (MLE) or Restricted Maximum Likelihood Estimation (REML).

In summary, Linear Mixed-Effects Models are a powerful statistical tool for analyzing data with hierarchical or clustered structures, accounting for both fixed and random effects to provide more accurate and flexible modeling. They provide a way to handle complex data structures while making reasonable assumptions about the relationships within and between groups.

Let’s practice applying this model and learn how it is used in actual data analysis.

Here is a dataset. Let’s consider this as grain yield data based on different nitrogen levels in various planting rows.

library (lme4)
library (Matrix)

nitrogen=rep(c("cv1", "cv2", "cv3", "cv4", "cv5"),3)
row=rep(c("A", "B", "C"), each=5)
yield=c(20, 27, 35, 52, 29, 17.7, 24, 32.7, 47.7, 22.7, 12, 24.05, 41.05, 52.05, 34.05)
df=data.frame(nitrogen, row, yield)

df
   nitrogen  row  yield
1       cv1   A   20.00
2       cv2   A   27.00
3       cv3   A   35.00
4       cv4   A   52.00
5       cv5   A   29.00
6       cv1   B   17.70
7       cv2   B   24.00
8       cv3   B   32.70
9       cv4   B   47.70
10      cv5   B   22.70
11      cv1   C   12.00
12      cv2   C   24.05
13      cv3   C   41.05
14      cv4   C   52.05
15      cv5   C   34.05

Now, I’ll create a bar graph. First, let’s summarize the data using the dplyr package.

library(dplyr)
summary=data.frame(data.frame(df %>%
                   group_by(nitrogen) %>%
                   dplyr::summarize(across(c(yield), 
                      .fns=list(Mean=~mean(., na.rm=TRUE), 
                       SD=~sd(., na.rm=TRUE), 
                       n=~length(.),
                       se=~sd(.,na.rm=TRUE) / sqrt(length(.)))))))

summary
  nitrogen   yield_Mean  yield_SD   yield_n   yield_se
1       N1   21.56667    11.977618      3     6.9152810
2       N2   25.01667    1.717799       3     0.9917717
3       N3   36.25000    4.313062       3     2.4901473
4       N4   50.58333    2.497165       3     1.4417389
5       N5   32.90000    12.610710      3     7.2807967

Next, I’ll generate a bar graph to visually represent the summarized data.

library (ggplot2)
ggplot(data=summary, aes(x=nitrogen, y=yield_Mean)) +
  geom_bar(stat="identity",position="dodge", width=0.7, size=1, fill="grey50", color="grey15") +
  geom_errorbar(aes(ymin=yield_Mean-yield_se, ymax=yield_Mean+yield_se), 
                position=position_dodge(0.9), width=0.5) +
  scale_y_continuous(breaks=seq(0,70,10), limits=c(0,70)) + 
  labs(x="Nitrogen", y="Yield") +
  theme_classic(base_size=18, base_family="serif") +
  theme(axis.line=element_line(linewidth=0.5, colour="black")) +
windows(width=5.5, height=5)

I’ll conduct statistical analysis to determine if there are any differences in yield among different nitrogen levels.

library(agricolae)
anova= aov(yield~ nitrogen, data=df)
summary(anova)

            Df Sum Sq Mean Sq F value Pr(>F)  
nitrogen     4 1541.6   385.4   5.834 0.0109 *
Residuals   10  660.6    66.1                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

TUKEY= HSD.test(anova, "nitrogen")
TUKEY
$groups
      yield groups
N4 50.58333      a
N3 36.25000     ab
N5 32.90000     ab
N2 25.01667      b
N1 21.56667      b

attr(,"class")
[1] "group"

The analysis suggests that there is no significant difference in grain yield between N2 and N4, as well as between N3 and N4. The lack of significance between these nitrogen levels is likely attributed to the variability present in the data.

Statistical modelling

In this case, using a Linear Mixed-Effects model, we can adjust data variability, not influencing on grain yield data.

Initially, I assume that the planting row serves as a random factor. Each planting row may possess unique soil quality, sunlight exposure, or other factors influencing crop yield. This random factor is expected to introduce greater data variability. Therefore, I’ll reduce the variability caused by planting row through a Linear Mixed-Effects model.

So, I’ll create a model according to my assumption.

library (lme4)
library (Matrix)

model= lmer(yield ~ 0 + nitrogen + (1|row), REML=TRUE, data=df)

I use lme4 package in R, and I set up nitrogen as fixed factor and planting row is a random factor.

In this mode, there is no intercept term (constant), and the relationship between the independent variable (nitrogen) and the dependent variable (yield) is being modeled directly without an intercept, while (1|row); this specifies a random intercept term for the variable “row.” It implies that the intercepts for different levels of “row” are allowed to vary randomly. In summary, the model is expressing that the relationship between nitrogen and yield is to be modeled without an overall intercept, but with random intercepts allowed for each level of the “row” variable. This can be useful when we expect different “rows” to have different baseline yields, and we want to account for this variability in the random intercepts.

Now, let’s extract fixed and random effect coefficients?

fixed_effects=fixef(model)
(Intercept)  nitrogenN2  nitrogenN3  nitrogenN4  nitrogenN5 
   21.56667     3.45000    14.68333    29.01667    11.33333 

random_effects=ranef(model)
$row
    (Intercept)
I     0.1493495
II   -0.2750502
III   0.1257007

Next, I’ll add these estimates next to yield data.

df$fixed_effects= fixed_effects

random_effects_A= random_effects$row["I", "(Intercept)"]
random_effects_B= random_effects$row["II", "(Intercept)"]
random_effects_C= random_effects$row["III", "(Intercept)"]

df$random_effects[df$row=="I"]= random_effects_A
df$random_effects[df$row=="II"]= random_effects_B
df$random_effects[df$row=="III"]= random_effects_C

Let’s check that all the data was added correctly.

> df
   nitrogen row yield fixed_effects random_effects
1        N1   I 35.00      21.56667      0.1493495
2        N2   I 27.00      25.01667      0.1493495
3        N3   I 35.00      36.25000      0.1493495
4        N4   I 52.00      50.58333      0.1493495
5        N5   I 29.00      32.90000      0.1493495
6        N1  II 17.70      21.56667     -0.2750502
7        N2  II 24.00      25.01667     -0.2750502
8        N3  II 32.70      36.25000     -0.2750502
9        N4  II 47.70      50.58333     -0.2750502
10       N5  II 22.70      32.90000     -0.2750502
11       N1 III 12.00      21.56667      0.1257007
12       N2 III 24.05      25.01667      0.1257007
13       N3 III 41.05      36.25000      0.1257007
14       N4 III 52.05      50.58333      0.1257007
15       N5 III 47.00      32.90000      0.1257007

Let’s go back to the model equation of Linear Mixed-Effects model.

Y= X_β + Z_b + ε where y= Fixed Effects + Random Effects + Error

According to this equation, we can easily calculate errors.

df$error= df$yield - df$fixed_effects - df$random_effects

df
   nitrogen row yield fixed_effects random_effects      error
1        N1   I 35.00      21.56667      0.1493495 13.2839838
2        N2   I 27.00      25.01667      0.1493495  1.8339838
3        N3   I 35.00      36.25000      0.1493495 -1.3993495
4        N4   I 52.00      50.58333      0.1493495  1.2673172
5        N5   I 29.00      32.90000      0.1493495 -4.0493495
6        N1  II 17.70      21.56667     -0.2750502 -3.5916165
7        N2  II 24.00      25.01667     -0.2750502 -0.7416165
8        N3  II 32.70      36.25000     -0.2750502 -3.2749498
9        N4  II 47.70      50.58333     -0.2750502 -2.6082831
10       N5  II 22.70      32.90000     -0.2750502 -9.9249498
11       N1 III 12.00      21.56667      0.1257007 -9.6923674
12       N2 III 24.05      25.01667      0.1257007 -1.0923674
13       N3 III 41.05      36.25000      0.1257007  4.6742993
14       N4 III 52.05      50.58333      0.1257007  1.3409660
15       N5 III 47.00      32.90000      0.1257007 13.9742993

Or, we can easily add errors using residuals().

errors=residuals(model)
df$error2=errors

df
   nitrogen row yield fixed_effects random_effects      error     error2
1        N1   I 35.00      21.56667      0.1493495 13.2839838 13.2839838
2        N2   I 27.00      25.01667      0.1493495  1.8339838  1.8339838
3        N3   I 35.00      36.25000      0.1493495 -1.3993495 -1.3993495
4        N4   I 52.00      50.58333      0.1493495  1.2673172  1.2673172
5        N5   I 29.00      32.90000      0.1493495 -4.0493495 -4.0493495
6        N1  II 17.70      21.56667     -0.2750502 -3.5916165 -3.5916165
7        N2  II 24.00      25.01667     -0.2750502 -0.7416165 -0.7416165
8        N3  II 32.70      36.25000     -0.2750502 -3.2749498 -3.2749498
9        N4  II 47.70      50.58333     -0.2750502 -2.6082831 -2.6082831
10       N5  II 22.70      32.90000     -0.2750502 -9.9249498 -9.9249498
11       N1 III 12.00      21.56667      0.1257007 -9.6923674 -9.6923674
12       N2 III 24.05      25.01667      0.1257007 -1.0923674 -1.0923674
13       N3 III 41.05      36.25000      0.1257007  4.6742993  4.6742993
14       N4 III 52.05      50.58333      0.1257007  1.3409660  1.3409660
15       N5 III 47.00      32.90000      0.1257007 13.9742993 13.9742993

Let’s think about the residuals. Even though N2 had greater yield than N1, due to the variability (residuals), the statistical significance was not found. How about adjusting the residuals? Would it become significant? We already done this process using statistical modelling, and we calculated fixed and random effect, and also residuals (errors).

According to the model equation of the Linear Mixed-Effects model: Y=Xβ+Zb+ε, where
Y represents grain yield, the breakdown is as follows: 35.00 (N1, 1st row) consists of 21.56667 (fixed factor), 0.1493495 (random factor), and 13.2839838 (residuals).

Now, I’ll extract these residuals using predict() and add the resulting values to the original data.

prediction=predict(model)
df$prediction=prediction

df
   nitrogen row yield fixed_effects random_effects     errors prediction
1        N1   I 35.00      21.56667      0.1493495 13.2839838   21.71602
2        N2   I 27.00      25.01667      0.1493495  1.8339838   25.16602
3        N3   I 35.00      36.25000      0.1493495 -1.3993495   36.39935
4        N4   I 52.00      50.58333      0.1493495  1.2673172   50.73268
5        N5   I 29.00      32.90000      0.1493495 -4.0493495   33.04935
6        N1  II 17.70      21.56667     -0.2750502 -3.5916165   21.29162
7        N2  II 24.00      25.01667     -0.2750502 -0.7416165   24.74162
8        N3  II 32.70      36.25000     -0.2750502 -3.2749498   35.97495
9        N4  II 47.70      50.58333     -0.2750502 -2.6082831   50.30828
10       N5  II 22.70      32.90000     -0.2750502 -9.9249498   32.62495
11       N1 III 12.00      21.56667      0.1257007 -9.6923674   21.69237
12       N2 III 24.05      25.01667      0.1257007 -1.0923674   25.14237
13       N3 III 41.05      36.25000      0.1257007  4.6742993   36.37570
14       N4 III 52.05      50.58333      0.1257007  1.3409660   50.70903
15       N5 III 47.00      32.90000      0.1257007 13.9742993   33.02570

In the grain yield (35.00), when subtracting 13.2839838, it becomes 21.71602. This result indicates that the prediction is obtained by extracting residuals and represents the sum of fixed and random effects.

Let’s summarize prediction value.

library(dplyr)
summary=data.frame(data.frame(df %>%
                   group_by(nitrogen) %>%
                   dplyr::summarize(across(c(prediction), 
                      .fns=list(Mean=~mean(., na.rm=TRUE), 
                       SD=~sd(., na.rm=TRUE), 
                       n=~length(.),
                       se=~sd(.,na.rm=TRUE) / sqrt(length(.)))))))

summary
  nitrogen prediction_Mean prediction_SD prediction_n prediction_se
1       N1        21.56667     0.2384938            3     0.1376944
2       N2        25.01667     0.2384938            3     0.1376944
3       N3        36.25000     0.2384938            3     0.1376944
4       N4        50.58333     0.2384938            3     0.1376944
5       N5        32.90000     0.2384938            3     0.1376944

With the actual grain yield data, both the standard deviation and standard error are now identical. The summary below presents the statistical analysis using the actual grain yield.

  nitrogen   yield_Mean  yield_SD   yield_n   yield_se
1       N1   21.56667    11.977618      3     6.9152810
2       N2   25.01667    1.717799       3     0.9917717
3       N3   36.25000    4.313062       3     2.4901473
4       N4   50.58333    2.497165       3     1.4417389
5       N5   32.90000    12.610710      3     7.2807967

It’s noteworthy that the mean remained unchanged, while only the residuals were adjusted. This observation suggests that the core tendency of the grain yield data persists, with the adjustments primarily influencing the variability captured by the residuals.

Let’s create a bar graph.

library(agricolae)
anova= aov(prediction~ nitrogen, data=df)
summary(anova)

            Df Sum Sq Mean Sq F value Pr(>F)    
nitrogen     4 1541.6   385.4    6776 <2e-16 ***
Residuals   10    0.6     0.1                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

TUKEY= HSD.test(anova, "nitrogen")
TUKEY
$groups
   prediction groups
N4   50.58333      a
N3   36.25000      b
N5   32.90000      c
N2   25.01667      d
N1   21.56667      e

The current findings indicate that each nitrogen level significantly differs in grain yield. It’s evident that the adjustments in residuals have contributed to the overall significance of the experiment.

Agronomy4future

Stories about cereals and statistics (plus coding). We aim to develop open-source code for agronomy.