A Practical Approach to Linear Mixed-Effects Modeling in R
A Linear Mixed-Effects Model (LMM) is a statistical model that combines both fixed effects and random effects to analyze data with repeated measurements or hierarchical structure.
Let’s break down the key components and concepts of a Linear Mixed-Effects Model:
1) Fixed Effects:
- Definition: Fixed effects are parameters that represent population-level effects. These effects are assumed to be constant across all levels or groups in the data.
- Role: Fixed effects capture the overall relationship between the predictors and the response variable. In simpler terms, they describe the average effect that a predictor has on the response across the entire population.
- Example: The type of fertilizer we use is a fixed effect because it applies to all crops
2) Random Effects:
- Definition: Random effects are parameters that represent group-specific or subject-specific effects. These effects are allowed to vary across different levels or groups in the data.
- Role: Random effects account for variability at different levels of the data hierarchy. They acknowledge that observations within the same group or subject may be more similar to each other than to observations in other groups.
- Example: Each planting row might have its own soil quality, sunlight exposure, or other factors that influence crop yield uniquely.
3) Linear Mixed-Effects Model Equation:
The general equation of a Linear Mixed-Effects Model can be written as: Y= Xβ + Zb + ε
- Y: response variable or dependent variable
- X: the fixed-effects design matrix
- β: fixed effects coefficients
- Z: the random-effects design matrix
- b: random effects coefficients
- ε: residuals
- For example, we’re trying to predict grain yield based on fixed effects (like fertilizer type) and random effects (planting row), and the equation will be:
y= Fixed Effects + Random Effects + Error
4) Estimation:
- Fixed Effects: Typically estimated using methods like Ordinary Least Squares (OLS).
- Random Effects: Estimated using methods like Maximum Likelihood Estimation (MLE) or Restricted Maximum Likelihood Estimation (REML).
In summary, Linear Mixed-Effects Models are a powerful statistical tool for analyzing data with hierarchical or clustered structures, accounting for both fixed and random effects to provide more accurate and flexible modeling. They provide a way to handle complex data structures while making reasonable assumptions about the relationships within and between groups.
Let’s practice applying this model and learn how it is used in actual data analysis.
Here is a dataset. Let’s consider this as grain yield data based on different nitrogen levels in various planting rows.
library (lme4)
library (Matrix)
nitrogen=rep(c("cv1", "cv2", "cv3", "cv4", "cv5"),3)
row=rep(c("A", "B", "C"), each=5)
yield=c(20, 27, 35, 52, 29, 17.7, 24, 32.7, 47.7, 22.7, 12, 24.05, 41.05, 52.05, 34.05)
df=data.frame(nitrogen, row, yield)
df
nitrogen row yield
1 cv1 A 20.00
2 cv2 A 27.00
3 cv3 A 35.00
4 cv4 A 52.00
5 cv5 A 29.00
6 cv1 B 17.70
7 cv2 B 24.00
8 cv3 B 32.70
9 cv4 B 47.70
10 cv5 B 22.70
11 cv1 C 12.00
12 cv2 C 24.05
13 cv3 C 41.05
14 cv4 C 52.05
15 cv5 C 34.05
Now, I’ll create a bar graph. First, let’s summarize the data using the dplyr
package.
library(dplyr)
summary=data.frame(data.frame(df %>%
group_by(nitrogen) %>%
dplyr::summarize(across(c(yield),
.fns=list(Mean=~mean(., na.rm=TRUE),
SD=~sd(., na.rm=TRUE),
n=~length(.),
se=~sd(.,na.rm=TRUE) / sqrt(length(.)))))))
summary
nitrogen yield_Mean yield_SD yield_n yield_se
1 N1 21.56667 11.977618 3 6.9152810
2 N2 25.01667 1.717799 3 0.9917717
3 N3 36.25000 4.313062 3 2.4901473
4 N4 50.58333 2.497165 3 1.4417389
5 N5 32.90000 12.610710 3 7.2807967
Next, I’ll generate a bar graph to visually represent the summarized data.
library (ggplot2)
ggplot(data=summary, aes(x=nitrogen, y=yield_Mean)) +
geom_bar(stat="identity",position="dodge", width=0.7, size=1, fill="grey50", color="grey15") +
geom_errorbar(aes(ymin=yield_Mean-yield_se, ymax=yield_Mean+yield_se),
position=position_dodge(0.9), width=0.5) +
scale_y_continuous(breaks=seq(0,70,10), limits=c(0,70)) +
labs(x="Nitrogen", y="Yield") +
theme_classic(base_size=18, base_family="serif") +
theme(axis.line=element_line(linewidth=0.5, colour="black")) +
windows(width=5.5, height=5)
I’ll conduct statistical analysis to determine if there are any differences in yield among different nitrogen levels.
library(agricolae)
anova= aov(yield~ nitrogen, data=df)
summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
nitrogen 4 1541.6 385.4 5.834 0.0109 *
Residuals 10 660.6 66.1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
TUKEY= HSD.test(anova, "nitrogen")
TUKEY
$groups
yield groups
N4 50.58333 a
N3 36.25000 ab
N5 32.90000 ab
N2 25.01667 b
N1 21.56667 b
attr(,"class")
[1] "group"
The analysis suggests that there is no significant difference in grain yield between N2 and N4, as well as between N3 and N4. The lack of significance between these nitrogen levels is likely attributed to the variability present in the data.
Statistical modelling
In this case, using a Linear Mixed-Effects model, we can adjust data variability, not influencing on grain yield data.
Initially, I assume that the planting row serves as a random factor. Each planting row may possess unique soil quality, sunlight exposure, or other factors influencing crop yield. This random factor is expected to introduce greater data variability. Therefore, I’ll reduce the variability caused by planting row through a Linear Mixed-Effects model.
So, I’ll create a model according to my assumption.
library (lme4)
library (Matrix)
model= lmer(yield ~ 0 + nitrogen + (1|row), REML=TRUE, data=df)
I use lme4
package in R, and I set up nitrogen as fixed factor and planting row is a random factor.
In this mode, there is no intercept term (constant), and the relationship between the independent variable (nitrogen) and the dependent variable (yield) is being modeled directly without an intercept, while (1|row)
; this specifies a random intercept term for the variable “row.” It implies that the intercepts for different levels of “row” are allowed to vary randomly. In summary, the model is expressing that the relationship between nitrogen and yield is to be modeled without an overall intercept, but with random intercepts allowed for each level of the “row” variable. This can be useful when we expect different “rows” to have different baseline yields, and we want to account for this variability in the random intercepts.
Now, let’s extract fixed and random effect coefficients?
fixed_effects=fixef(model)
(Intercept) nitrogenN2 nitrogenN3 nitrogenN4 nitrogenN5
21.56667 3.45000 14.68333 29.01667 11.33333
random_effects=ranef(model)
$row
(Intercept)
I 0.1493495
II -0.2750502
III 0.1257007
Next, I’ll add these estimates next to yield data.
df$fixed_effects= fixed_effects
random_effects_A= random_effects$row["I", "(Intercept)"]
random_effects_B= random_effects$row["II", "(Intercept)"]
random_effects_C= random_effects$row["III", "(Intercept)"]
df$random_effects[df$row=="I"]= random_effects_A
df$random_effects[df$row=="II"]= random_effects_B
df$random_effects[df$row=="III"]= random_effects_C
Let’s check that all the data was added correctly.
> df
nitrogen row yield fixed_effects random_effects
1 N1 I 35.00 21.56667 0.1493495
2 N2 I 27.00 25.01667 0.1493495
3 N3 I 35.00 36.25000 0.1493495
4 N4 I 52.00 50.58333 0.1493495
5 N5 I 29.00 32.90000 0.1493495
6 N1 II 17.70 21.56667 -0.2750502
7 N2 II 24.00 25.01667 -0.2750502
8 N3 II 32.70 36.25000 -0.2750502
9 N4 II 47.70 50.58333 -0.2750502
10 N5 II 22.70 32.90000 -0.2750502
11 N1 III 12.00 21.56667 0.1257007
12 N2 III 24.05 25.01667 0.1257007
13 N3 III 41.05 36.25000 0.1257007
14 N4 III 52.05 50.58333 0.1257007
15 N5 III 47.00 32.90000 0.1257007
Let’s go back to the model equation of Linear Mixed-Effects model.
Y= Xβ + Zb + ε
where y= Fixed Effects + Random Effects + Error
According to this equation, we can easily calculate errors.
df$error= df$yield - df$fixed_effects - df$random_effects
df
nitrogen row yield fixed_effects random_effects error
1 N1 I 35.00 21.56667 0.1493495 13.2839838
2 N2 I 27.00 25.01667 0.1493495 1.8339838
3 N3 I 35.00 36.25000 0.1493495 -1.3993495
4 N4 I 52.00 50.58333 0.1493495 1.2673172
5 N5 I 29.00 32.90000 0.1493495 -4.0493495
6 N1 II 17.70 21.56667 -0.2750502 -3.5916165
7 N2 II 24.00 25.01667 -0.2750502 -0.7416165
8 N3 II 32.70 36.25000 -0.2750502 -3.2749498
9 N4 II 47.70 50.58333 -0.2750502 -2.6082831
10 N5 II 22.70 32.90000 -0.2750502 -9.9249498
11 N1 III 12.00 21.56667 0.1257007 -9.6923674
12 N2 III 24.05 25.01667 0.1257007 -1.0923674
13 N3 III 41.05 36.25000 0.1257007 4.6742993
14 N4 III 52.05 50.58333 0.1257007 1.3409660
15 N5 III 47.00 32.90000 0.1257007 13.9742993
Or, we can easily add errors using residuals()
.
errors=residuals(model)
df$error2=errors
df
nitrogen row yield fixed_effects random_effects error error2
1 N1 I 35.00 21.56667 0.1493495 13.2839838 13.2839838
2 N2 I 27.00 25.01667 0.1493495 1.8339838 1.8339838
3 N3 I 35.00 36.25000 0.1493495 -1.3993495 -1.3993495
4 N4 I 52.00 50.58333 0.1493495 1.2673172 1.2673172
5 N5 I 29.00 32.90000 0.1493495 -4.0493495 -4.0493495
6 N1 II 17.70 21.56667 -0.2750502 -3.5916165 -3.5916165
7 N2 II 24.00 25.01667 -0.2750502 -0.7416165 -0.7416165
8 N3 II 32.70 36.25000 -0.2750502 -3.2749498 -3.2749498
9 N4 II 47.70 50.58333 -0.2750502 -2.6082831 -2.6082831
10 N5 II 22.70 32.90000 -0.2750502 -9.9249498 -9.9249498
11 N1 III 12.00 21.56667 0.1257007 -9.6923674 -9.6923674
12 N2 III 24.05 25.01667 0.1257007 -1.0923674 -1.0923674
13 N3 III 41.05 36.25000 0.1257007 4.6742993 4.6742993
14 N4 III 52.05 50.58333 0.1257007 1.3409660 1.3409660
15 N5 III 47.00 32.90000 0.1257007 13.9742993 13.9742993
Let’s think about the residuals. Even though N2 had greater yield than N1, due to the variability (residuals), the statistical significance was not found. How about adjusting the residuals? Would it become significant? We already done this process using statistical modelling, and we calculated fixed and random effect, and also residuals (errors).
According to the model equation of the Linear Mixed-Effects model: Y=Xβ+Zb+ε
, where
Y represents grain yield, the breakdown is as follows: 35.00 (N1, 1st row) consists of 21.56667 (fixed factor), 0.1493495 (random factor), and 13.2839838 (residuals).
Now, I’ll extract these residuals using predict()
and add the resulting values to the original data.
prediction=predict(model)
df$prediction=prediction
df
nitrogen row yield fixed_effects random_effects errors prediction
1 N1 I 35.00 21.56667 0.1493495 13.2839838 21.71602
2 N2 I 27.00 25.01667 0.1493495 1.8339838 25.16602
3 N3 I 35.00 36.25000 0.1493495 -1.3993495 36.39935
4 N4 I 52.00 50.58333 0.1493495 1.2673172 50.73268
5 N5 I 29.00 32.90000 0.1493495 -4.0493495 33.04935
6 N1 II 17.70 21.56667 -0.2750502 -3.5916165 21.29162
7 N2 II 24.00 25.01667 -0.2750502 -0.7416165 24.74162
8 N3 II 32.70 36.25000 -0.2750502 -3.2749498 35.97495
9 N4 II 47.70 50.58333 -0.2750502 -2.6082831 50.30828
10 N5 II 22.70 32.90000 -0.2750502 -9.9249498 32.62495
11 N1 III 12.00 21.56667 0.1257007 -9.6923674 21.69237
12 N2 III 24.05 25.01667 0.1257007 -1.0923674 25.14237
13 N3 III 41.05 36.25000 0.1257007 4.6742993 36.37570
14 N4 III 52.05 50.58333 0.1257007 1.3409660 50.70903
15 N5 III 47.00 32.90000 0.1257007 13.9742993 33.02570
In the grain yield (35.00), when subtracting 13.2839838, it becomes 21.71602. This result indicates that the prediction is obtained by extracting residuals and represents the sum of fixed and random effects.
Let’s summarize prediction value.
library(dplyr)
summary=data.frame(data.frame(df %>%
group_by(nitrogen) %>%
dplyr::summarize(across(c(prediction),
.fns=list(Mean=~mean(., na.rm=TRUE),
SD=~sd(., na.rm=TRUE),
n=~length(.),
se=~sd(.,na.rm=TRUE) / sqrt(length(.)))))))
summary
nitrogen prediction_Mean prediction_SD prediction_n prediction_se
1 N1 21.56667 0.2384938 3 0.1376944
2 N2 25.01667 0.2384938 3 0.1376944
3 N3 36.25000 0.2384938 3 0.1376944
4 N4 50.58333 0.2384938 3 0.1376944
5 N5 32.90000 0.2384938 3 0.1376944
With the actual grain yield data, both the standard deviation and standard error are now identical. The summary below presents the statistical analysis using the actual grain yield.
nitrogen yield_Mean yield_SD yield_n yield_se
1 N1 21.56667 11.977618 3 6.9152810
2 N2 25.01667 1.717799 3 0.9917717
3 N3 36.25000 4.313062 3 2.4901473
4 N4 50.58333 2.497165 3 1.4417389
5 N5 32.90000 12.610710 3 7.2807967
It’s noteworthy that the mean remained unchanged, while only the residuals were adjusted. This observation suggests that the core tendency of the grain yield data persists, with the adjustments primarily influencing the variability captured by the residuals.
Let’s create a bar graph.
library(agricolae)
anova= aov(prediction~ nitrogen, data=df)
summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
nitrogen 4 1541.6 385.4 6776 <2e-16 ***
Residuals 10 0.6 0.1
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
TUKEY= HSD.test(anova, "nitrogen")
TUKEY
$groups
prediction groups
N4 50.58333 a
N3 36.25000 b
N5 32.90000 c
N2 25.01667 d
N1 21.56667 e
The current findings indicate that each nitrogen level significantly differs in grain yield. It’s evident that the adjustments in residuals have contributed to the overall significance of the experiment.