What is a nested model in statistics?

What is a nested model in statistics?


One tomato farmer is growing tomato seedlings, and all of sudden he wants to investigate the amount of calcium in leaves. So, he selected four tomato seedlings, and he randomly chose three leaves in each seedling and investigated the amount of calcium. He measured twice in each leaf.

This experimental design would be explained by below table.

y111 means the amount of calcium in the 1st seedling – 1st leaf – 1st replicate. Then, y432 will mean the outcome of the 4th seedling – 3rd leaf – 2nd replicate.



Now, I doubt whether this experimental design could be a factorial experiment or not.

The farmer did not have any rules to select leaves in each seedling. He just randomly selected 3 leaves. If he selected leaves with a certain rule (i.e. leaf at the most bottom, 3rd leaf from the bottom, and leaf at the most top), and he applied this rule to select leaves in all seedlings, this experimental design will have two factors (seedlings and leaves) and we can regard this experimental design as a factorial experiment. In this case, however, leaves cannot be a fixed factor.

Instead, we say the leave factor is nested in regard to the seedling factor.

This is the linear model of nested treatment.

yijk = μ + τi + β(i)j + ε(ij)k

.

where yijk represents the calcium content of the rep k, leave j and seedling i. The subscript (i)j indicates that the level j of factor B, in this case leave j, is nested to the level i of factor A, in this case seedling. That is, it makes reference to leave j of seedling i. 

The model could be explained by below equation.

It seems tricky, but if you follow me step by step, it’s just a simple arithmetic in elementary school.

# Download above data using R
Plant<- rep(c("P1","P2","P3","P4"), each=6)
Leaves<- rep(rep(c("L1","L2","L3"), each=2),4)
Rep<-rep(c(1,2),12)
Ca<- c(3.28, 3.09, 3.52, 3.48, 2.88, 2.80, 2.46, 2.44, 1.87, 1.92, 2.19, 2.19, 2.77, 2.66, 3.74, 3.44, 2.55, 2.55, 3.78, 3.87, 4.07, 4.12, 3.31, 3.31)
tomato<- data.frame(Plant,Leaves,Rep,Ca)

library(writexl)
write_xlsx (tomato,"C:/Users/Usuari/Desktop/tomato2.xlsx")
# check the pathway in your computer

Let’s say this is the data the farmer collected. For example, y111 = 3.28 (1st seedling, 1st leaf, 1st rep) and y432 = 3.31 (4th seedling, 3rd leaf, 2nd rep).

Now, I’ll summarize the mean of the data.

  1. ȳ is the grand mean of total data. It’s 3.01.
  2. ȳi.. is the mean of treatment i which is plant (seedling). The mean of each plant will be 3.18, 2.18, 2.95 and 3.74.
  3. ȳ(i)j is the mean of two values in leaves. This is because the leave factor (j) is nested in regard to the the seedling factor (j).

If we finished to calculate the mean of each case, now data partitioning is necessary.

Let’s go back to the below equation.

  1. ȳ is the grain mean which is 3.01
  2.  (ȳi.. – ȳ) is the difference between mean of plant and grand mean, which indicates the effect of plant. It’s written as τi. For example, in ȳ111, the calcium content was 3.28, the effect of plant (τi) is 0.16 (≈ 3.18 – 3.01)
  3. (ȳij. – ȳi..) is the difference between mean of leaves about two replicates (i.e. in ȳ111, 3.28 +3.09) / 2 = 3.19 as the leave factor is nested in regard to the seedling factor) and mean of plants. For example, in ȳ111, it’s 3.19 – 3.18 = 0.01. It’s β(i)j
  4. ȳijkȳij. is error which is the difference between each calcium value and mean of leaves about two replicates (β(i)j). In ȳ111, error is 3.28 – 3.19 = 0.10.

Therefore, ȳ111 is partitioning as

3.28 = 3.01 (μ) + 0.16 (τi) + 0.01 (β(i)j) + 0.10 (ε(ij)k)
yijk = μ + τi + β(i)j + ε(ij)k

Now we need to square each partitioning data to avoid the sum of each partitioning data becomes zero (Remember!! the sum of deviation is always zero). Simply we can use =SUMSQ() in excel to calculate the sum of square in each case.



How to analyze a nested model in R?

Now, I’ll analyze the above data using R.

Plant<- rep(c("P1","P2","P3","P4"), each=6)
Leaves<- rep(rep(c("L1","L2","L3"), each=2),4)
Rep<-rep(c(1,2),12)
Ca<- c(3.28, 3.09, 3.52, 3.48, 2.88, 2.80, 2.46, 2.44, 1.87, 1.92, 2.19, 2.19, 2.77, 2.66, 3.74, 3.44, 2.55, 2.55, 3.78, 3.87, 4.07, 4.12, 3.31, 3.31)
tomato<- data.frame(Plant,Leaves,Rep,Ca)
library(lme4)
result <- lmer(Ca ~ (1|Plant)+ (1|Plant:Leaves), data=tomato)
summary(result)

Let’s check the variance in random effect. The total variance is ≈ 0.532 (=0.161170 + 0.364603 + 0.006654). Then the % of variance in each case is

plant ≈ 0.69
plant:leaves ≈ 0.30
residual ≈ 0.01

It means the most variability of the data is mainly due to plants (seedlings) ≈ 69%, and the environmental error generates variability less than 1%. Therefore, based on this result, we can suggest that if we have to increase replicates in the experiment, it might be better to increase number of plants rather than increasing replicate of trials.

For example, in this experiment, the farmer investigated calcium content twice per leaf in 4 plants (seedlings). However, based on this statistical result, if the farmer should increase replicates, it would be better to increase number of plants (i.e. 4 -> 8 plants) rather than increasing trials (i.e. 2 -> 4 times).



How to analyze a nested model in SAS?

In R, REML does not show statistical outcomes. Instead, it shows variance component. Now, I’ll introduce how to analyze a nested model in SAS. I uploaded Excel file to SAS and set up data name as WORK.TOMATO.

First, I’ll choose estimation method as Restricted maximum likelihood (REML).

proc mixed data=WORK.TOMATO method=reml alpha=0.05;
	class Plant Leaves;
	model Ca= /;
	random Plant Plant(Leaves) /;
run;

It shows Variance Components. It’s the same as in R.

Second, I’ll choose estimation method as Type 1.

proc mixed data=WORK.TOMATO method=type1 alpha=0.05;
	class Plant Leaves;
	model Ca= /;
	random Plant Plant(Leaves) /;
run;

When choosing type 1, we can see the statistical outcomes.

Let’s check the sum of squares.

SS of plant ≈ 7.56
SS of plant (leaves) ≈ 2.63
SS residual ≈ 0.08

This value is the same as what we calculated by Excel. SAS shows both plant and leaves (nested to plants) are significant.

Comments are closed.