How to calculate the optimum sample size for 2-Sample t test (using R and G*Power program)?
When we set up our experimental design, it is not easy to decide the sample size because we don’t know exactly how many samples are required for our experiments. Of course the more, the better. However, eventually, we need to decide appropriate sample size according to our time and resources.
For example, if we want to know the average height of students in University of Guelph, the best way is to measure all students’ height. According to Wikipedia, total number of students is 29,923. So, if we measure 29,923 times, we’ll get to know the exact mean height of students in University of Guelph.
Make sense? No!! it’s nonsense.
We should ‘estimate’ the population (the actual mean height of students in the University) from sampling. Therefore, the main question would be that how many samples are necessary to estimate the population?
Here is one example. There is a corn variety which is a hybrid cultivar commercially used. An agronomist wants to know there is a difference in ear dry weight per plant between two groups according to different nitrogen conditions at R1 stage (In corn, R1 stage is regarded as flowering time).
Reading Matter
Corn Development Stages- R1 (Silking)
So, he investigated ear dry weight (g) per plant at R1 stage at two different nitrogen conditions at pre-plant (0 lbs-N/ac and 260 lbs-N/ac). Simply he selected 4 plants at each N conditions, and measured ear dry weight per plant.
Group_A [0 lbs-N/ac pre-plant] | Group_B [260 lbs-N/ac pre-plant] |
23.3 g | 24.0 g |
20.2 g | 25.0 g |
23.3 g | 22.5 g |
19.9 g | 25.0 g |
mean: 21.7 / Stdev: 1.88 | mean: 24.1 / Stdev: 1.18 |
He wants to compare the mean of two group in order to know which group has greater ear dry weight per plant. In this case, 2-Sample T test would be used to analyze the data. So, now he wants to know how many samples are necessary to conduct 2-Sample T test.
R Studio
In R, we can use pwr.t.test()
in pwr package to calculate sample size in 2-Sample t test.
pwr.t.test(d=, sig.level= , power=, type="two.sample", alternative="two.sided") • d= effect size • sig.level= significant level • power= power of test • type= type of test • alternative= one sided or two sided Effect size Cohen's D = (Mean 1 - Mean 2) / SDpooled *SDpooled = pooled standard deviation
The effect size can be calculated as dividing the difference of two groups’ mean by pooled standard deviation. This is two different groups, and therefore we need a common standard deviation which is able to be applied in both groups.
This is a formula to calculate pooled variance between two groups.
If the sample number between two groups is the same (n1 = n2), the below formula is possible.
When sample number between two groups is the same (n1 = n2), we can obtain pooled variance by the mean of variance of each group; (S12 + S22) / 2.
Standard deviation is the square root of the variance. So after calculating pooled variance, we can simply obtain pooled standard deviation by square root of the pooled variance.
√pooled variance = pooled standard deviation
First, let’s calculate pooled variance between two groups.
The pooled variance is 2.463. Then pooled standard deviation will be 1.57 (=√2.463).
If you simply copy the below codes and paste in R, you can obtain the pooled variance and pooled standard deviation.
Group_A<- c(23.3,20.2,23.3,19.9)
Group_B<- c(24.0,25.0,22.5,25.0)
Mean1<-round(mean(Group_A),digits=2)
var1<-round(var(Group_A),digits=2)
n1<-length(Group_A)
Mean2<-round(mean(Group_B),digits=2)
var2<-round(var(Group_B),digits=2)
n2<-length(Group_B)
# pooled variance
pooled <- ((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2)
if n1 = n2
pooled <- (var1 + var2) / 2
# pooled standard deviation
pooled <- ((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2)
sqrt(pooled)
if n1 = n2
pooled <- (var1 + var2) / 2
sqrt(pooled)
Now we obtained the pooled standard deviation, 1.57. So, we can calculate the effect size.
Cohen's D = (Mean 1 - Mean 2) / SDpooled
effect size = (24.1 – 21.7) / 1.57 ≈ 1.53.
Let’s calculate the sample size!!
install.packages("pwr")
library(pwr)
pwr.t.test (d=1.53, sig.level= 0.05, power=0.85, type="two.sample", alternative="two.sided")
In a given effect size (d= 1.53), let’s set up significant level as α= 0.05, and power of test (1-β) as 0.85.
If we increase power of test (1-β), the area of α will be increased, indicating type I error (a.k.a α error) is increased. It means we’ll take more risk that even though null hypothesis is correct, we reject it (I’ll explain again in the below).
As a independent researcher, you need to decide the power of test (1-β) based on your data, but normally, I set up 0.85.
This is the comparison between two independent group, and therefore it’s 2-Sample t test. So, for type, write "two.sample"
. Also, this is about ‘the same’ or ‘not the same’. So, for alternative, write alternative="two.sided"
.
R calculated the sample size. In each group, 9 samples are required. So in total, 18 samples are necessary to compare the mean of two group. The agronomist collected 4 samples per group. He would increase sample size more than twice to conduct more precise analysis.
G*Power program
As another method to determine sample size, I’ll introduce G*Power program which is the most popular software to calculate sample size.
https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower
You can download the program in the above link.
First, in Test family, select t tests and choose Mean: Difference between two independent means (two groups) which indicates 2-Sample T test.
Now, we need to input several values, but we already know what it is.
- Tails is two (I told you this is about “the same” or “not the same”)
- Effect size = 1.53
- α= 0.05
- power (1-β) = 0.85
Sometimes, people who can’t understand the effect size simply accept the effect size (0.2, 0.5 or 0.8) which G*Power provides as a guide.
It’s totally wrong!!
You should calculate the effect size based on your data.
It’s simple as Cohen's D = (Mean 1 - Mean 2) / SDpooled
G*Power says we need 9 samples per group and in total 18 samples are required. This is the same as R calculated.
FYI
Actually, when using G*Power, we do not need to calculate the effect size because G*Power automatically calculates it.
In the main window, when you select the button, ‘Determine’, new sub-window opens. When the sample size is the same (n1 = n2), input mean, and standard deviation of each group. Then click the button, “Calculate.’ You can see the effect size as ≈ 1.53 which is the same value we calculated as Cohen's D = (Mean 1 - Mean 2) / SDpooled
You can simply copy this value as clicking the button, ‘Calculate and transfer to main window.’
How sample size increases when increasing the power of test (1-β)?
G*Power provides an excellent visualized graph about the change of sample size according to power of test (1-β).
In the main window, click the button, ‘X-Y plot for a range of values’, and then click the button, ‘Draw plot.’ G*Power provides the change of sample size according to power of test (1-β). When power of test (1-β) is 0.85, in total 18 samples were necessary, but if power of test (1-β) is 0.95, we need 24 samples (12 samples per group).
When power of test (1-β) increases, β will be decreased (as it’s 1-β). When β is decreased, α will be increased (α and β are mutually opposite). Due to increased α, statistical significance will be increased.
Think about that!! Your p-value was 0.75. So, we say it’s not significant (when α = 0.05). However, when you set up α = 0.10, your p-value is statistically significant. When α is increased, statistical significance is also increased, but also type I error (a.k.a α error) is increased.
type I error (a.k.a α error) = null hypothesis is correct, but you reject it.
Under α = 0.05, you rejected alternative hypothesis as p-value is 0.75, but under α = 0.10, you’ll reject null hypothesis as p-value: 0.70 < 0.10. Null hypothesis might be correct, but due to increased α, you’ll take more risk to reject null hypothesis. Increased statistical significance will result in more chance to reject null hypothesis (and accept alternative hypothesis). Therefore, it would require more sample size in order to compensate the increased statistical significance. However, there is no answer about power of test (1-β). As an independent researcher, we should decide the sample size based on our own data.
However, data variability is important to decide sample size.
You can choose power of test (1-β) based on your data, and also you might be able to take more samples if you can manage. When power of test (1-β) is 0.85, you need 9 samples per group, while you need 12 samples per group when power of test (1-β) is 0.95. You can adjust sample size based on your capacity.
However, the most important thing to determine sample size is not power of test (1-β) itself, but the data variability.
Group_A [0 lbs-N/ac pre-plant] | Group_B [260 lbs-N/ac pre-plant] |
23.3 g | 5.3 g |
20.2 g | 25.0 g |
23.3 g | 22.5 g |
3.2 g | 25.0 g |
mean: 17.5 / Stdev: 9.64 | mean: 19.5 / Stdev: 9.51 |
Here is another data. The agronomist found some plants have extremely low ear dry weight, but he accepted this plants too, and calculated sample size.
Then, how many samples are required?
In each group, extreme outlier exists, in total, 826 samples are required. Due to higher data variability, much more sample size will be required.
In this case, deciding power of test (1-β) would be meaningless. Even though we decrease power of test (1-β), we will need lots of samples.
What I want to indicate is that power of test (1-β) is not the main factor to determine sample size. It’s just a statistical guide. However, data variability would be a critical factor to determine sample size.
So, what should we do? Always monitor outliers in fields, and decide whether or not to include or exclude.
Agronomy research is to cope with outliers in fields.