How to calculate Probability Density Function in Excel and R (feat. normal distribution)?

How to calculate Probability Density Function in Excel and R (feat. normal distribution)?



When we analyze data, there are instances where we need to display normal distribution graphs. These graphs are distinct from density graphs. Normal distribution graphs convey various concepts that simple bar graphs cannot depict. While it’s easy to create such graphs in Excel, understanding the underlying concept holds greater significance.

Today, I will explain what a Probability Density Function (PDF) is and demonstrate how to calculate it using Excel and R.



I have collected data on the individual wheat grain weights of 1,000 samples, including both CV1 and CV2 genotypes.

You can download above data in Kaggle
https://www.kaggle.com/datasets/agronomy4future/wheat-grain-weight-at-two-genotypes

We can represent the results as a bar graph by calculating the average grain weight and adding the standard error (standard deviation / √n).

Now, we observe that the average grain weight in CV1 is greater than in CV2, but this alone doesn’t provide insights into why CV1 is superior. However, by calculating the Probability Density Function (PDF), we can visualize the distribution of grain weights, revealing a more detailed story.



What is Probability Density Function (PDF)

First, select all the grain weight data (B column) and choose ‘Histogram.’

Then, you will see the following graph.

If you connect each bar with a line, it would look like this.

Now, I’d like to represent this line as a curve using a mathematical formula. We refer to this curve as a normal distribution or Gaussian distribution, which is a type of continuous probability distribution.

The main characteristics of a normal distribution include:

1. The shape of the bell curve is symmetrical, with the highest probability at the mean.
2. The area under the curve is equal to 1.
3. As the curve moves away from the mean, it gets closer to the x-axis but never intersects it. In other words, the probability never reaches 0.

The formula for the Probability Density Function (PDF) is as follows:

It may seem tricky, but it’s actually quite simple.

x is each observational value
µ is average
σ is standard deviation, and therefore σ2 is variance
π is pie which is 3.14159...
e is Euler's number

Practicing the calculation of PDF by hand will show you that it’s a straightforward process. Let’s calculate it step by step using the formula.

Next, we will calculate the Probability Density Function (PDF) and create a normal distribution graph. Set up the GW column as the x-axis and the PDF column as the y-axis. Then, select a scatter graph.

We’ve created a normal distribution graph for CV1.

Excel provides a function for this process. Let’s calculate PDF using this function.

=NORM.DIST(x, mean, stdev, FALSE)  

You will notice that the Excel function provides the same values as our manual calculation. However, by manually calculating, you gain a deeper understanding of the principles behind PDF. That’s why I explained the manual calculation process.

We can calculate the PDF for CV2 using the same process.

Now, let’s compare the two graphs.

Using the same data, we created two different graphs. While both graphs show the mean grain weight, the distribution graph provides deeper insights. It reveals that grain weights in CV1 are greater than in CV2, with CV1 having both smaller and larger grains in greater abundance compared to CV2.

We can extract values from specific percentiles for analysis and comparison.

Therefore, if data points are more than 30, I recommend to calculate Probability Density Function and see the normal distribution graph. It will tell you more stories!!



R provides simple codes about this

In Excel, sometimes it doesn’t appear to adhere to the main characteristics of a normal distribution.

1. The shape of the bell curve is symmetrical, with the highest probability at the mean.
2. The area under the curve is equal to 1.
3. As the curve moves away from the mean, it gets closer to the x-axis but never intersects it. In other words, the probability never reaches 0.

This happens because Excel’s appearance may vary depending on the sample size. With a larger sample size, like 10,000 instead of 1,000, Excel tends to show a clearer normal distribution. However, in R, we can circumvent this issue using stat_function().

First, I upload the Excel file to R. Before uploading the file, I rearrange the data as shown below.

Then, use the below code.

ggplot (data=dataA, aes(x=GW)) +

  stat_function(data=subset(dataA,Genotype=="CV1"),aes(x=GW),
  color="Blue",size=1,fun=dnorm, args=c(mean=mean(subset(dataA, 
  Genotype=="CV1")$GW),sd=sd(subset(dataA,Genotype=="CV1")$GW))) +  
  
  stat_function(data=subset(dataA,Genotype=="CV2"),aes(x=GW),
  color="Black",size=1,fun=dnorm, args=c(mean=mean(subset(dataA,
  Genotype=="CV2")$GW),sd=sd(subset(dataA,Genotype=="CV2")$GW))) +   
 
  scale_x_continuous(breaks=seq(0,80,10),limits=c(0,80)) + 
  scale_y_continuous(breaks=seq(0,0.05,0.01),limits=c(0,0.05)) +  
  labs(x="Grain weight (mg)", y="Frequency") +
  theme_grey(base_size=15, base_family="serif")+
  theme(axis.line= element_line(size=0.5, colour="black")) +
windows(width=5.5, height=5)

R provides an accurate normal distribution graph, even with a small sample size, unlike Excel. That’s why I prefer using R for data analysis involving PDF.

Here’s one more tip: you can add a histogram to the normal distribution graph.

ggplot (data=dataA, aes(x=GW, fill=Genotype)) +

  geom_histogram (aes(y=0.5*..density..), alpha=0.5, 
  position='identity', binwidth=1.5) +

  stat_function(data=subset(dataA,Genotype=="CV1"),aes(x=GW),
  color="Blue",size=1,fun=dnorm, args=c(mean=mean(subset(dataA, 
  Genotype=="CV1")$GW), sd=sd(subset(dataA,Genotype=="CV1")$GW))) + 
 
  stat_function(data=subset(dataA,Genotype=="CV2"),aes(x=GW),
  color="Black",size=1,fun=dnorm, args=c(mean=mean(subset(dataA, 
  Genotype=="CV2")$GW), sd=sd(subset(dataA,Genotype=="CV2")$GW))) +  
  
  scale_x_continuous(breaks=seq(0,80,10), limits=c(0,80)) +
  scale_y_continuous(breaks=seq(0,0.05,0.01), limits=c(0,0.05)) + 
  labs(x="Grain weight (mg)", y="Frequency") +
  theme(legend.position ="bottom",
        legend.key = element_rect(colour=NA, fill=NA), 
        legend.key.size = unit(0.5,"cm"),
        legend.title = element_text(face="plain", size=16, 
        color="Black"),
        legend.background = element_rect(fill = alpha("grey",.05)),
        legend.text = element_text(face="plain", size=14, 
        color="Black"),
        axis.line= element_line(size=0.5, colour="black")) +
windows(width=5.5, height=5)


Leave a Reply

If you include a website address in the comment section, I cannot see your comment as it will be automatically deleted and will not be posted. Please refrain from including website addresses.