[STAT Article] Steps to Calculate Log-Likelihood Prior to AIC and BIC: [Part 2] ANOVA model

November 15, 2024 JK

In my previous post, I explained how to calculate the Log-Likelihood, AIC, and BIC in a regression model. In this post, I will demonstrate the same concepts, but in the context of an ANOVA model.

Here I have one dataset.

if(!require(readr)) install.packages("readr")
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/refs/heads/main/Fertilizer%20(One%20Way%20ANOVA).csv"
df= data.frame(read_csv(url(github), show_col_types=FALSE))
head(df[sample(nrow(df), 10), ])

   Fertilizer Yield
9     Control  12.4
19       Slow  16.0
5     Control  11.8
17       Slow  15.8
7     Control  13.1
24       Fast   8.8
.
.
.

Let’s say this data represents yield in response to different fertilizer types (Control, Slow, and Fast), and I want to determine the effect of fertilizer type on yield. Therefore, I will perform a one-way ANOVA.

model= aov(Yield~ Fertilizer, data=df)
summary(model)

            Df Sum Sq Mean Sq F value Pr(>F)    
Fertilizer   2 207.68  103.84   394.3 <2e-16 ***
Residuals   27   7.11    0.26

Now, I observe that the effect of fertilizer on yield is significant. Next, I will calculate the Log-Likelihood, AIC, and BIC for this one-way ANOVA. The equation for the Log-Likelihood in ANOVA is shown below. It may seem tricky at first, but once you understand the principle, it becomes a piece of cake.

Reading my previous post would be helpful for understanding the process of these calculations.

In this dataset, there is one factor, which is fertilizer, and the replicates were not designated as blocks. Therefore, this is a one-way ANOVA without blocks. In other words, it follows a Completely Randomized Design (CRD).

The model equation is as follows:

y_ij= μ + τ_i + ϵ_ij 

Where:
y_ij = yield for fertilizer treatment 𝑖 and replicate 𝑗
μ = grand mean of yield
τi = Effect of fertilizer treatment 𝑖
εij = Residual error for the observation at treatment 𝑖 and replicate 𝑗

Based on this model equation, I will analyze the variation in the data.

x	y_ij	ȳ..	ȳ_i.	ȳi. – ȳ..	y_ij – ȳ_i.	y_ij – ȳ_..	(ȳ_i. – ȳ_..)²	(y_ij – ȳ_i.)²	(y_ij – ȳ_..)²
Fertilizer	Yield	Grand mean	Fertilizer mean	Fertilizer effect	Residuals	Total	Fertilizer effect^2	Residuals^2	Total^2
Control	12.2	12.46	12.13	-0.33	0.07	-0.26	0.11	0.00	0.07
Control	12.4	12.46	12.13	-0.33	0.27	-0.06	0.11	0.07	0.00
Control	11.9	12.46	12.13	-0.33	-0.23	-0.56	0.11	0.05	0.31
Control	11.3	12.46	12.13	-0.33	-0.83	-1.16	0.11	0.69	1.34
Control	11.8	12.46	12.13	-0.33	-0.33	-0.66	0.11	0.11	0.43
Control	12.1	12.46	12.13	-0.33	-0.03	-0.36	0.11	0.00	0.13
Control	13.1	12.46	12.13	-0.33	0.97	0.64	0.11	0.94	0.41
Control	12.7	12.46	12.13	-0.33	0.57	0.24	0.11	0.32	0.06
Control	12.4	12.46	12.13	-0.33	0.27	-0.06	0.11	0.07	0.00
Control	11.4	12.46	12.13	-0.33	-0.73	-1.06	0.11	0.53	1.12
Slow	16.6	12.46	15.83	3.37	0.77	4.14	11.38	0.59	17.17
Slow	15.8	12.46	15.83	3.37	-0.03	3.34	11.38	0.00	11.18
Slow	16.5	12.46	15.83	3.37	0.67	4.04	11.38	0.45	16.35
Slow	15.0	12.46	15.83	3.37	-0.83	2.54	11.38	0.69	6.47
Slow	15.4	12.46	15.83	3.37	-0.43	2.94	11.38	0.18	8.66
Slow	15.6	12.46	15.83	3.37	-0.23	3.14	11.38	0.05	9.88
Slow	15.8	12.46	15.83	3.37	-0.03	3.34	11.38	0.00	11.18
Slow	15.8	12.46	15.83	3.37	-0.03	3.34	11.38	0.00	11.18
Slow	16.0	12.46	15.83	3.37	0.17	3.54	11.38	0.03	12.56
Slow	15.8	12.46	15.83	3.37	-0.03	3.34	11.38	0.00	11.18
Fast	9.5	12.46	9.4	-3.05	0.09	-2.96	9.28	0.01	8.74
Fast	9.5	12.46	9.4	-3.05	0.09	-2.96	9.28	0.01	8.74
Fast	9.6	12.46	9.4	-3.05	0.19	-2.86	9.28	0.04	8.16
Fast	8.8	12.46	9.4	-3.05	-0.61	-3.66	9.28	0.37	13.37
Fast	9.5	12.46	9.4	-3.05	0.09	-2.96	9.28	0.01	8.74
Fast	9.8	12.46	9.4	-3.05	0.39	-2.66	9.28	0.15	7.06
Fast	9.1	12.46	9.4	-3.05	-0.31	-3.36	9.28	0.10	11.27
Fast	10.3	12.46	9.4	-3.05	0.89	-2.16	9.28	0.79	4.65
Fast	9.5	12.46	9.4	-3.05	0.09	-2.96	9.28	0.01	8.74
Fast	8.5	12.46	9.4	-3.05	-0.91	-3.96	9.28	0.83	15.66
						Σ	207.68	7.11	214.79

Reading my previous post would be helpful for understanding the process of these calculations.

■ Understanding Mean Absolute Error (MAE) in ANOVA: A Step-by-Step Guide to Calculation in Excel

If you can understand the equation of the Log-Likelihood, you can see the equation as follows:

I already calculate Sum of Squared Error (SSE), which is 7.11, and I’ll calculate Mean Squared Error (MSE) by dividing degree of freedom (n – t) where n is total sample number and t is the number of levels in the factor, which n is 30, and t is 27 (= 30 – 3 as there are 3 fertilizer types)

However, when calculating MSE for the Log-Likelihood, it is calculated as SSE/n, not SSE/(n-t)

Therefore, MSE is 7.11 / 30 ≈ 0.237

Now, it’s a piece of the cake to calculate the Log-Likelihood

LL = (-30/2)*ln(2*3.14*0.237) - (1/(2*0.237))*7.11 
   ≈ -20.97

The Log-Likelihood for this ANOVA model is around -20.97

Let’s verify this value is correct using R.

First, you can easily find SSE and MSE in the ANOVA table.

model= aov(Yield~ Fertilizer, data=df)
summary(model)

            Df Sum Sq Mean Sq F value Pr(>F)    
Fertilizer   2 207.68  103.84   394.3 <2e-16 ***
Residuals   27   7.11    0.26

The ANOVA table provide SSE value as 7.11, but MSE is calculated as SSE/(n-t), which is 7.11 / 27 (=30-3) ≈ 0.26, but as mentioned, for the Log-Likelihood, MSE will be calculated as 7.11/30 ≈ 0.237

To calculate the Log-Likelihood, R provides a simple code.

logLik(model)

'log Lik.' -20.97484 (df=4)

You can see the same value as what we calculate by hand. Therefore, we can verify the calculation for Log-Likelihood is correct.

df=4 correspond to one intercept parameter, two regression coefficients for the levels of the Fertilizer variable (with one level used as the reference category), and one variance parameter for the residuals.

Let’s understand how to calculate the Log-Likelihood manually. We’ll calculate it step by step in R, not simply using logLik(model).

residuals= resid(model)
SSE=sum(residuals^2)
MSE= SSE / length(df$Yield)
LL= -length(df$Yield)/2*(log(2*pi*MSE)) - SSE/(2*MSE)
cat("Log-likelihood:", LL, "\n")

Log-likelihood: -20.97484

Then, now we can calculate Akaike information criterion (AIC) and Bayesian Information Criterion (BIC).

The Akaike Information Criterion (AIC) is a statistical measure used to compare and evaluate different models based on their goodness of fit and complexity. It helps to identify the model that best balances fit (how well the model explains the data) and complexity (how many parameters the model uses).

The formula for AIC:

AIC= 2*k - 2*ln(L)

where:
k is the number of parameters in the model
ln(L) is the likelihood of the model

We already calculated Log-Likelihood, and k is 4 in the model. Also, given this, the number of parameters, k used in the AIC formula is 4

Therefore, AIC is calculated as 2*4 - 2*-20.97 = 49.94

let’s verify this value is correct using R.

AIC (model)
49.94968

My calculation is correct!!

Let’s calculate BIC. The Bayesian Information Criterion (BIC) is another model selection criterion, similar to the Akaike Information Criterion (AIC), but it applies a stronger penalty for models with more parameters. The goal of BIC, like AIC, is to balance model fit (how well the model explains the data) with model complexity (how many parameters the model uses).

The formula for BIC:

BIC= k*ln(n) - 2*ln(L)

where:
k is the number of parameters in the model
ln(L) is the likelihood of the model
n is the number of data points.

Again, we already calculated Log-Likelihood, and k is 4 in the model. Also, given this, the number of parameters, k used in the BIC formula is 4.

Therefore, AIC is calculated as 4*ln(30) - (2*-20.97) = 55.55

let’s verify this value is correct using R.

BIC (model)
55.55447

My calculation is correct!!

How about more factors are added?

The One-Way ANOVA model is the simplest model, but we’ll have more factors in the experimental fields. Let’s discuss again with another dataset.

Cultivar= rep(c("CV1","CV2"),each=12)
Nitrogen= rep(rep(c("N0","N1","N2","N3"), each=3),2)
Block= rep(c("I","II","III"),8)
Yield= c (99, 109, 89, 115, 142, 133, 121, 157, 142, 125, 150, 139, 82, 104, 99, 117, 125, 127, 145, 154, 154, 151, 166, 175)
df= data.frame(Cultivar,Nitrogen,Block,Yield)

head(df[sample(nrow(df), 10), ])

   Cultivar Nitrogen Block Yield
13      CV2       N0     I    82
17      CV2       N1    II   125
16      CV2       N1     I   117
18      CV2       N1   III   127
8       CV1       N2    II   157
21      CV2       N2   III   154
.
.
.

In this case, we want to analyze how yield is affected by nitrogen fertilizer across different genotypes. Both genotype and fertilizer would be treated as factors, while replicates are considered blocks, indicating a Randomized Complete Block Design (RCBD).

If you want to calculate the variation in the data for this Two-Way ANOVA with blocks in Excel, reading my previous post could help you understand the process behind these calculations.

■ Easy-to-Understand Guide to Factorial Experiments and Two-Way ANOVA

Now, I’ll analyze two-way ANOVA using R.

model= aov(Yield~ Cultivar+Nitrogen+Cultivar:Nitrogen+ factor(Block), data=df)

summary(model)
                  Df Sum Sq Mean Sq F value   Pr(>F)    
Cultivar           1    253     253    4.99 0.042327 *  
Nitrogen           3  10695    3565   70.17 1.12e-08 ***
factor(Block)      2   1505     752   14.81 0.000351 ***
Cultivar:Nitrogen  3   1040     347    6.82 0.004604 ** 
Residuals         14    711      51

In this ANOVA table, you can find SSE, and calculate MSE. SSE is 711, and MSE will be 711/24 = 29.625

Then, we can calculate the Log-Likelihood

LL = (-24/2)*ln(2*3.14*29.625) - (1/(2*29.625))*711
   ≈ -74.72

The Log-Likelihood for this ANOVA model is around -74.71186. Let’s verify this value is correct using R.

logLik(model)
'log Lik.' -74.72217 (df=11)

df = 11 correspond to the following: one intercept parameter, one regression coefficient for the levels of the genotype variable (Cultivar, with one level used as the reference category), three regression coefficients for the levels of the nitrogen variable (Nitrogen, with one level used as the reference category), two regression coefficients for the levels of the block variable (Block, with one level used as the reference category), three regression coefficients for the interaction between the genotype and nitrogen variables (1 for the genotype × 3 for nitrogen, excluding reference levels), and one parameter for the residual variance.

Total Parameters:
1(Intercept)+1(Genotype)+3(Nitrogen)+2(Block)+3(Interaction)+1(Residual Variance)= 11

Again, let’s calculate manually using R.

residuals= resid(model)
SSE=sum(residuals^2)
MSE= SSE / length(df$Yield)
LL= -length(df$Yield)/2*(log(2*pi*MSE)) - SSE/(2*MSE)
cat("Log-likelihood:", LL, "\n")

Log-likelihood: -74.72217

My calculation is correct. Let’s calculate AIC and BIC.

AIC= 2*k - 2*ln(L) = 2*11 - 2*-74.72217 
   = 171.4443

BIC= k*ln(n) - 2*ln(L) = 11*ln(24) - 2*-74.72217 
   = 184.4029

Using R, I’ll check my calculation is correct.

AIC (model)
171.4443

BIC (model)
184.4029

From, now on you can calculate the Log-Likelihood, AIC and BIC when you see the ANOVA table.

We aim to develop open-source code for agronomy (kimjk@agronomy4future.com)

Agronomy4future

Stories about cereals and statistics (plus coding). We aim to develop open-source code for agronomy.

[STAT Article] Steps to Calculate Log-Likelihood Prior to AIC and BIC: [Part 2] ANOVA model

November 15, 2024 JK

■ Understanding Mean Absolute Error (MAE) in ANOVA: A Step-by-Step Guide to Calculation in Excel

How about more factors are added?

■ Easy-to-Understand Guide to Factorial Experiments and Two-Way ANOVA