What is the F-ratio in statistics?

March 31, 2023 JK Comments 0 Comment

Today, I will explain the meaning of the F-value in testing for significance through statistical processing. Let me give you an example. Suppose we want to determine whether there are differences in the yield according to the varieties (A, B, C). The total experimental unit is 12 (3 varieties x 4 replicates).

What would happen if there is a significant difference in yield among varieties A and C? If there is a large difference in yield between these varieties, the statistical program will calculate that there is a difference in the treatment (which refers to the variety).

This value is called SSTreat, which stands for Sum of Squares Treatment.

Understanding this daunting mathematical formula even a little bit can be very helpful in understanding the concept. Did you find the formula for SSTreat (Sum of Squares Treatment) in the explanation above?

Here it is!!!

What does this mean? It means that first, we squared the difference between the average yield value (ȳ_i_.) for each variety (since there were four replicates, the average is for those four values) and the overall average yield value (ȳ_..) and second, summed them up. In other words, this means that we calculated the sum of squares of the deviations.

Summary of basic concepts

Here are 5 pieces of data:

4, 5, 6, 3, 7

The mean of these numbers is 5. Now let's say we want to know how far each data point is from the mean. We can do this by subtracting the mean from each value.

4-5= -1, 5-5= 0, 6-5= 1, 3-5= -2, 7-5 = 2

Now, let's add up all of these differences from the mean. The sum will be zero. Each difference from the mean is called a "deviation", and the sum of all deviations from the mean is always zero. Therefore, using only deviations, we cannot know the level of distribution of this data. That's why we squared each deviation to prevent the sum from being zero.

(-1)² = 1, (0)² = 0, (1)² = 1, (-2)² = 4, (2)² = 4

The sum of squared deviations, which is 10, divided by the total number of observations (n) which is 5, is called the variance. Since the variance is a squared value, we take the square root. This value is called the standard deviation. Thus, √(10/5) = 1.414.

In the original formula for calculating the sample variance, it is divided by n-1 degrees of freedom, but for ease of explanation, we assumed it was the entire population and divided by n.

Using this basic concept, let’s try to calculate manually.

Let’s examine the handwritten calculation. There are three varieties, A, B, and C, and there are four repetitions. The average yield of A is 44, B is 50, and C is 41, and the overall mean is 45.

How can we calculate SSTreat (Sum of squares Treatment)?

It is:

{(44-45)² + (50-45)² + (41-45)²} x 4

Since there are four repetitions, we multiplied the value by 4. This can be formulated as the following mathematical equation:

That is, SSTreat (Sum of squares Treatment), Σn_i(ȳ_i.- ȳ_..)² = 168

If the yield from A to C are all the same, what will happen? The sum of squares will be 0, which means no difference in yield is recognized. Therefore, the difference between A and C becomes the “treatment effect“.

On the other hand, let’s focus only on variety A. There are four repetitions of variety A (39, 46, 48, and 43), and their mean is 44. However, what if the values between repetitions are 155, 8, 12, and 1, even though the mean is still the same at 44? If the difference in yield is too extreme, even though it is the same variety with just repeated cultivation, it becomes ambiguous whether the difference between variety A and C is due to the treatment effect or environmental errors. The sum of squares of such errors is called SSE (Sum of squares Error).

The formula for SSE (Sum of squares Error) is as follows:

It is the sum of the squared differences between each individual yield of a variety and the average of a variety.

For Variety A (average 44):
(39-44)² + (46-44)² + (48-44)² + (43-44)² = 46

For Variety B (average 50):
(46-50)² + (52-50)v + (49-50)v + (53-50)² = 30

For Variety C (average 41):
(41-41)² +(47-41)² + (39-41)² +(37-41)² = 56

Thus, SSE (Sum of squares Error) is 46 + 30 + 56 = 132.

TSS (Total sum of squares) is sum of SSTreat and SSE.

How is TSS calculated? It is the sum of the squared differences between each yield value and the overall mean. In other words, it involves subtracting the overall mean from each of the 12 individual yield data points, squaring the differences, and then adding them up.

Therefore, the value would be (39-45)² + (46-45)² + (48-45)² + (43-45)² + (46-45)² + …. + (37-45)², which equals 300.

As mentioned earlier, TSS = SSTreat + SSE. In this case, it is observed that 300 = 168 + 132.

Let’s now consider the degrees of freedom in the analysis of variance. How many data points make up TSS? There are a total of 12. So the degrees of freedom is 11 (n-1).

df: TSS = n-1 = 12-1 (where n is the total number of data points)

How many data points make up SSTreat? There are 3. So the degrees of freedom is 2 (n-1).

df: SSTreat = t-1 = 3-1 = 2 (where t is the number of treatments)

How many data points make up SSE? This is a bit confusing, but we can understand it as n-t.

df: SSE = n-t = 12-3 = 9

Now, let’s create an analysis of variance (ANOVA) table.

When we divide each Sum of Squares (SS) by its corresponding degree of freedom (df), we get the Mean Square (MS). Let’s call the MS for SST as MST (Mean Squares Treatment) and the MS for SSE as MSE (Mean Squares Error) to distinguish them. So, what is the value of MST? It is 84 (= 168/2). The value of MSE is 14.67 (= 132/9).

Now, let’s finally calculate the F-value. The F-value is a ratio, that is, the ratio of MST to MSE.

F = MST / MSE  = 5.73

What does it mean when F-value increases? In other words, if MST increases (treatment effect increases) and MSE decreases (errors decrease), the F-ratio will increase, right?

So, how can we interpret it? We can conclude that there is a significant difference between the treatments, and the environmental errors are small, so we can say that “there is a difference between the treatments.” Statisticians have mathematically verified at what level this ratio is significant or not, and we call that value the critical value.

Let’s take a closer look at this critical value.

Go to https://pqrs.software.informer.com/ and download the PQRS program and run it!

Select F in the Distribution and enter the degrees of freedom for treatments and errors, which are 2 and 9, respectively. What is the critical value of F for α=0.05 in this case? PQRS makes it easy to check.

When the degree of freedom for treatment is 2 and the degree of freedom for error is 9, the F-value corresponding to the 0.05 area is 4.26. This means that if our F-value is greater than 4.26, it is less than the 0.05 area, so it can be considered significant! What is our F-value? It is 5.73, which is located to the right of 4.26. Therefore, p <0.05, indicating that there is a difference between varieties. To make it easier to understand, let’s change the PQRS F value of 4.26 to 5.73.

How did the area change? It changed from 0.05 to 0.0248. Therefore, our p-value is 0.0248. In other words, when our F-value is 5.73, the corresponding p-value is 0.0248. Running a statistical software program will give us the same result.

The value can also be calculated using Excel.

You can use the function

= fdist (F-value, degrees of freedom of treatment, degrees of freedom of error)

For example, if you input = fdist (5.73, 2, 9) in excel, the value 0.024832 will be returned.

To find the critical value, you can use the function

= finv(α, degrees of freedom of treatment, degrees of freedom of error)

For example, if you input = finv(0.05, 2, 9), the value 4.2565 will be returned.

To verify manual calculation using a statistical software program

1) R Studio

# to generate data
A=c(39,46,48,43)
B=c(46,52,49,53)
C=c(41,47,39,37)
Reps=c(1,2,3,4)
DataA=data.frame(Reps,A,B,C)
DataB=reshape2::melt(DataA[c("Reps", "A","B","C")],
                            id.vars=c("Reps"))
colnames(DataB)[2]=c("Cultivar")
colnames(DataB)[3]=c("Yield")
DataB

# ANOVA
ANOVA=aov(Yield~Cultivar, data=DataB)
summary(ANOVA)

2) SAS Studio

proc glm data= DATABASE.DATA;
	 class Cultivar Reps;
	 model Yield= Cultivar / ss1;
	 lsmeans Cultivar/ adjust=tukey pdiff=all alpha=0.05 cl;
quit;

3) Python

# to generate data
import pandas
from pandas import DataFrame
source={"Cultivar": ["A"]*4 + ["B"]*4 + ["C"]*4,
'Yield':[39,46,48,43,46,52,49,53,41,47,39,37]}
DataA=DataFrame(source)
DataA

# ANOVA
import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('Yield ~ C(Cultivar)', data=DataA).fit()
sm.stats.anova_lm(model, typ=1)

Agronomy4future

Stories about cereals and statistics (plus coding)