[R package] Probability Distribution and Z-Score Calculation Function (feat. probdistz)

August 31, 2024 JK

■ Introduction

■ What is Probability Density Function (PDF) and Cumulative Distribution Function (CDF): How to calculate using Excel and R?

In my previous post, I explained what the Probability Density Function (PDF) and the Cumulative Distribution Function (CDF) are. I also explained the formula for the PDF and demonstrated how to manually calculate it in Excel.

Additionally, I mentioned the Excel function that performs the same calculation for the PDF, as follows:

 =NORM.DIST(x, mean, stdev, FALSE)

I then introduced how to create a probability distribution graph in R using stat_function(). To use stat_function(), you need to specify the mean and standard deviation of your data in the code. For example, this is an example of the stat_function().

ggplot(data=df, aes(x=yield)) +
   stat_function(data=df ,aes(x=yield),color="blue",size=1,
     fun=dnorm, args=c(mean=mean(df$yield), sd=sd(df$yield))) +

I initially thought I could create a probability distribution graph using a simple code like geom_line() instead of stat_function(), which requires typing in the mean and standard deviation. Therefore, recently, I developed an R package called probdistz() to facilitate this process.

□ Github: https://github.com/agronomy4future/probdistz/blob/main/README.md

■ Package installation

First, let’s install the package.

if(!require(remotes)) install.packages("remotes")
if (!requireNamespace("probdistz", quietly = TRUE)) {
  remotes::install_github("agronomy4future/probdistz")
}
library(remotes)
library(probdistz)

If you type ?probdistz in your R script, you can view detailed information about this package.

■ Code format

The basic format of the code is as follows:

probdistz(data, env_cols, yield_cols, smooth= TRUE)
*smooth= TRUE or FALSE (Default is TRUE)

□ data refers to the dataset you want to analyze.
□ env_cols specifies the grouping for calculating the PDF and CDF. For example, if you set env_cols= c("field", "genotype"), the PDF and CDF will be calculated per genotype within each field. Conversely, if you set env_cols= c("genotype"), the PDF and CDF will be calculated per genotype across all environments.
□ yield_cols specifies the independent variables for which you want to calculate the PDF and CDF.

When selecting smooth= TRUE, the probdistz() package predicts additional data to compensate for a lack of sample size, based on 6σ, and uses this to calculate the PDF, CDF, and Z-score, resulting in a continuous probability curve. On the contrary, when selecting smooth= FALSE, the probdistz() package calculates the PDF, CDF, and Z-score based on the actual dataset. Therefore, if the datapoints are limited, the probability curve may not be connected.

■ Upload the data for practice

Let’s practice using this code with an actual dataset. Please copy and paste the code below into your R script.

if(!require(readr)) install.packages("readr")
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/grains.csv"
df = data.frame(read_csv(url(github), show_col_types=FALSE))

and you can obtain the below dataset. This data is grain number per certain unit and grain weight (mg) data across different environmental conditions.

head(df,5)
     field genotype block        stage   treatment shoot      grain_number grain_weight
1 up_state     cv_1     I pre_anthesis treatment_1 main stem  513          49.26
2 up_state     cv_1     I pre_anthesis treatment_1 tillers    268          44.68
3 up_state     cv_2     I pre_anthesis treatment_1 main stem  616          45.19
4 up_state     cv_2     I pre_anthesis treatment_1 tillers    83           44.34
5 up_state     cv_1     I pre_anthesis     control main stem  516          48.25
.
.
.

First, let’s create a Probability Density Function (PDF) curve using stat_function(). Before doing that, you need to decide within which groups you’ll calculate the PDF.

The dataset is structured as follows:

Independent variable	Number of levels	Remarks
Field	2	Up state, Down state
Genotype	3	CV1, CV2, CV3
Block	3	I, II, III
Stage	2	pre_anthesis ,post_anthesis
Treatment	3	Control, treatment 1, treatment 2
Shoot	2	Main stem, tillers

Dependent variable
Gran number
Grain weight

First, I want to visualize the probability distribution of grain weight per genotype across all environments.

ggplot(data=df, aes(x=grain_weight)) +
  
  stat_function(data=subset(df, genotype=="cv_1"), aes(color= "cv_1"), linewidth=1, 
    fun=dnorm, args=list(mean=mean(subset(df, genotype=="cv_1")$grain_weight, na.rm=TRUE),
    sd=sd(subset(df, genotype=="cv_1")$grain_weight, na.rm=TRUE))) +
  
  stat_function(data=subset(df, genotype=="cv_2"), aes(color= "cv_2"), linewidth=1, 
    fun=dnorm, args=list(mean=mean(subset(df, genotype=="cv_2")$grain_weight, na.rm=TRUE),
    sd=sd(subset(df, genotype=="cv_2")$grain_weight, na.rm=TRUE))) +
  
  stat_function(data=subset(df, genotype=="cv_3"), aes(color= "cv_3"), linewidth=1, 
    fun=dnorm, args=list(mean=mean(subset(df, genotype=="cv_3")$grain_weight, na.rm=TRUE),
    sd=sd(subset(df, genotype=="cv_3")$grain_weight, na.rm=TRUE))) +
  
scale_color_manual(name= "Genotype", values = c("cv_1"="black", "cv_2"="red", "cv_3"="blue")) +
scale_x_continuous(breaks=seq(0,80,20), limits=c(0,80)) +
scale_y_continuous(breaks=seq(0,0.15,0.03), limits=c(0,0.15)) +
labs(x="Wheat grain weight (mg)", y="Frequency") +
theme_classic(base_size=15, base_family="serif") +
theme(legend.position=c(0.85, 0.85),
      legend.title=element_blank(),
      legend.key=element_rect(color="white", fill="white"),
      legend.text=element_text(family="serif", face="plain", size=15, color= "Black"),
      legend.background=element_rect(fill="white"),
      axis.line=element_line(linewidth=0.5, colour="black"))

This is the probability distribution of grain weight across all environments.

Next, I want to create a Probability Density Function (PDF) curve for each genotype, categorized by fields. The df dataset contains data from two fields: upstate and downstate. I want to analyze the distribution for either of these locations.

For example, I’ll slightly modify the previous code to generate a PDF curve for each genotype in the upstate field.

ggplot(data=subset(df, field=="up_state"), aes(x=grain_weight)) +
  
 stat_function(data=subset(subset(df, field=="up_state"), genotype=="cv_1"), 
  aes(color="cv_1"), linewidth=1, fun=dnorm, 
  args=list(mean=mean(subset(subset(df, field=="up_state"), genotype=="cv_1")$grain_weight, na.rm=TRUE),
  sd=sd(subset(subset(df, field=="up_state"), genotype=="cv_1")$grain_weight, na.rm=TRUE))) +
  
 stat_function(data=subset(subset(df, field=="up_state"), genotype=="cv_2"), 
  aes(color="cv_2"), linewidth=1, fun=dnorm, 
  args=list(mean=mean(subset(subset(df, field=="up_state"), genotype=="cv_2")$grain_weight, na.rm=TRUE),
  sd=sd(subset(subset(df, field=="up_state"), genotype=="cv_2")$grain_weight, na.rm=TRUE))) +
  
 stat_function(data=subset(subset(df, field=="up_state"), genotype=="cv_3"), 
  aes(color="cv_3"), linewidth=1, fun=dnorm, 
  args=list(mean=mean(subset(subset(df, field=="up_state"), genotype=="cv_3")$grain_weight, na.rm=TRUE),
  sd=sd(subset(subset(df, field=="up_state"), genotype=="cv_3")$grain_weight, na.rm=TRUE))) +
  
 scale_color_manual(name= "Genotype", values = c("cv_1"="black", "cv_2"="red", "cv_3"="blue")) +
 scale_x_continuous(breaks=seq(0,80,20), limits=c(0,80)) +
 scale_y_continuous(breaks=seq(0,0.15,0.03), limits=c(0,0.15)) +
 labs(x="Wheat grain weight (mg)", y="Frequency") +
 theme_classic(base_size=15, base_family="serif") +
 theme(legend.position=c(0.85, 0.85),
       legend.title=element_blank(),
       legend.key=element_rect(color="white", fill="white"),
       legend.text=element_text(family="serif", face="plain", size=15, color= "Black"),
       legend.background=element_rect(fill="white"),
       axis.line=element_line(linewidth=0.5, colour="black"))

Now, the probability distribution of grain weight per genotype is for upstate field only.

If you change the code to subset(df, field=="down_state") in the code above, you can also generate the graph for the downstate field only.

■ probdistz code for PDF

I developed the probdistz() package. As mentioned above, the basic format of the code is below.

probdistz(data, env_cols, yield_cols, smooth= TRUE)
*smooth= TRUE or FALSE, Default is TRUE

Again, I want to create a Probability Density Function (PDF) curve for each genotype, categorized by field.

head(df,5)
     field genotype block        stage   treatment shoot      grain_number grain_weight
1 up_state     cv_1     I pre_anthesis treatment_1 main stem  513          49.26
2 up_state     cv_1     I pre_anthesis treatment_1 tillers    268          44.68
3 up_state     cv_2     I pre_anthesis treatment_1 main stem  616          45.19
4 up_state     cv_2     I pre_anthesis treatment_1 tillers    83           44.34
5 up_state     cv_1     I pre_anthesis     control main stem  516          48.25
.
.
.

I’ll run code.

# to install package
if(!require(remotes)) install.packages("remotes")
if (!requireNamespace("probdistz", quietly = TRUE)) {
  remotes::install_github("agronomy4future/probdistz")
}
library(remotes)
library(probdistz)

# to run prodistz package
dataA= probdistz(df, env_cols= c("field","genotype"), yield_cols="grain_weight), smooth= TRUE)

The output was generated and named dataA, as shown below.

set.seed(100)
dataA[sample(nrow(dataA),5),]
     grain_weight   smooth_pdf   smooth_cdf    smooth_z      field genotype
3786     53.69313 3.036760e-04 9.996976e-01  3.42942943 down_state     cv_1
503      45.21273 6.573905e-02 5.119784e-01  0.03003003   up_state     cv_1
3430     38.00036 7.595374e-02 1.985402e-01 -0.84684685 down_state     cv_1
3696     49.72585 6.898674e-03 9.905716e-01  2.34834835 down_state     cv_1
4090     21.05160 4.985506e-07 4.091933e-07 -4.93093093 down_state     cv_3
.
.
.

You can see the Probability Density Function (PDF), the Cumulative Distribution Function (CDF) and Z-score were generated.

probdistz() package generates additional data based on the mean and standard deviation of the current dataset. This is necessary because, with limited datapoints, a complete probability density curve is not feasible. You may encounter such a difficulty when attempting to create the probability density curve using Excel like below graph.

probdistz() package predicts additional data to make up for the missing data based on 6σ.

I’ll filter field, only “up_state” and create the PDF curve.

ggplot(data=subset(dataA, field=="up_state"), aes(x=grain_weight, y=smooth_pdf, color=genotype)) +
  geom_line ()+
  scale_color_manual(name= "Genotype", values= c("cv_1"="black", "cv_2"="red", "cv_3"="blue")) +
  scale_x_continuous(breaks=seq(0,80,20), limits=c(0,80)) +
  scale_y_continuous(breaks=seq(0,0.15,0.03), limits=c(0,0.15)) +
  labs(x="Wheat grain weight (mg)", y="Frequency") +
  theme_classic(base_size=15, base_family="serif") +
  theme(legend.position=c(0.85, 0.85),
        legend.title=element_blank(),
        legend.key=element_rect(color="white", fill="white"),
        legend.text=element_text(family="serif", face="plain", size=15, color= "Black"),
        legend.background=element_rect(fill="white"),
        axis.line=element_line(linewidth=0.5, colour="black"))

This graph is identical to the one produced using stat_function().

Let’s compare the two codes: stat_function() and probdistz(). The code using probdistz() is much simpler.

Let’s practice the package code more. In this time, I want to create a Probability Density Function (PDF) curve for each genotype ‘across all environments‘.

dataB= probdistz(df, env_cols= c("genotype"), yield_cols="grain_weight", smooth= TRUE)

and I’ll create PDF curve.

ggplot(data=dataB, aes(x=grain_weight, y=smooth_pdf, color=genotype)) +
  geom_line ()+
  scale_color_manual(name= "Genotype", values= c("cv_1"="black", "cv_2"="red", "cv_3"="blue")) +
  scale_x_continuous(breaks=seq(0,80,20), limits=c(0,80)) +
  scale_y_continuous(breaks=seq(0,0.15,0.03), limits=c(0,0.15)) +
  labs(x="Wheat grain weight (mg)", y="Frequency") +
  theme_classic(base_size=15, base_family="serif") +
  theme(legend.position=c(0.85, 0.85),
        legend.title=element_blank(),
        legend.key=element_rect(color="white", fill="white"),
        legend.text=element_text(family="serif", face="plain", size=15, color= "Black"),
        legend.background=element_rect(fill="white"),
        axis.line=element_line(linewidth=0.5, colour="black"))

This graph is identical to the one produced using stat_function().

■ probdistz code for CDF

I’ll create the Cumulative Distribution Function (CDF) curve for each genotype, categorized by field. First, I’ll use stat_function() to create the CDF curve for the upstate region only.

head(df,5)
     field genotype block        stage   treatment     shoot grain_number grain_weight
1 up_state     cv_1     I pre_anthesis treatment_1 main stem          513        49.26
2 up_state     cv_1     I pre_anthesis treatment_1   tillers          268        44.68
3 up_state     cv_2     I pre_anthesis treatment_1 main stem          616        45.19
4 up_state     cv_2     I pre_anthesis treatment_1   tillers           83        44.34
5 up_state     cv_1     I pre_anthesis     control main stem          516        48.25

###
ggplot(data=df) +

  stat_function(fun= pnorm,
      args= list(mean= mean(subset(subset(df, field=="up_state"), 
      genotype== "cv_1")$grain_weight, na.rm= TRUE),
      sd= sd(subset(subset(df, field=="up_state"), genotype== "cv_1")$grain_weight, 
      na.rm= TRUE)), aes(color= "cv_1"), size = 1) +

  stat_function(fun= pnorm,
      args= list(mean= mean(subset(subset(df, field=="up_state"), 
      genotype== "cv_2")$grain_weight, na.rm= TRUE),
      sd= sd(subset(subset(df, field=="up_state"), genotype== "cv_2")$grain_weight, 
      na.rm= TRUE)), aes(color= "cv_2"), size = 1) +

  stat_function(fun= pnorm,
      args= list(mean= mean(subset(subset(df, field=="up_state"), 
      genotype== "cv_3")$grain_weight, na.rm= TRUE),
      sd= sd(subset(subset(df, field=="up_state"), genotype== "cv_3")$grain_weight, 
      na.rm= TRUE)), aes(color= "cv_3"), size = 1) +

  scale_color_manual(name= "Genotype", values= c("cv_1"= "black", "cv_2"= "red", "cv_3"= "blue")) +
  scale_x_continuous(breaks= seq(0, 80, 20), limits= c(0, 80)) +
  scale_y_continuous(breaks= seq(0, 1, 0.2), limits= c(0, 1)) +
  labs(x= "Wheat grain weight (mg)", y= "Frequency") +
  theme_classic(base_size= 15, base_family= "serif") +
  theme(legend.position= c(0.85, 0.85),
        legend.title= element_blank(),
        legend.key= element_rect(color= "white", fill= "white"),
        legend.text= element_text(family= "serif", face= "plain", size= 15, color= "Black"),
        legend.background= element_rect(fill = "white"),
        axis.line= element_line(linewidth= 0.5, colour= "black"))

Next, I’ll use probdistz().

dataA= probdistz(df, env_cols= c("field","genotype"), yield_cols="grain_weight", smooth= TRUE)

ggplot(data=subset(dataA, field=="up_state"), aes(x=grain_weight, y=smooth_cdf, color=genotype)) +
  geom_line ()+
  scale_color_manual(name= "Genotype", values = c("cv_1"="black", "cv_2"="red", "cv_3"="blue")) +
  scale_x_continuous(breaks=seq(0,80,20), limits=c(0,80)) +
  scale_y_continuous(breaks=seq(0,1,0.2), limits=c(0,1)) +
  labs(x="Wheat grain weight (mg)", y="Frequency") +
  theme_classic(base_size=15, base_family="serif") +
  theme(legend.position=c(0.85, 0.85),
        legend.title=element_blank(),
        legend.key=element_rect(color="white", fill="white"),
        legend.text=element_text(family="serif", face="plain", size=15, color= "Black"),
        legend.background=element_rect(fill="white"),
        axis.line=element_line(linewidth=0.5, colour="black"))

■ probdistz code for Z-distribution

probdistz() can provide z-distribution.

dataA= probdistz(df, env_cols= c("field","genotype"), yield_cols= c("grain_weight"), smooth= TRUE)

ggplot(data=subset(dataA, field=="up_state"), aes(x=smooth_z, y=smooth_pdf, color=genotype)) +
  geom_line ()+
  scale_color_manual(name= "Genotype", values= c("cv_1"="black", "cv_2"="red", "cv_3"="blue")) +
  scale_x_continuous(breaks=seq(-4,4,1), limits=c(-4,4)) +
  scale_y_continuous(breaks=seq(0,0.2,0.05), limits=c(0,0.2)) +
  labs(x="Z-score", y="Frequency") +
  theme_classic(base_size=15, base_family="serif") +
  theme(legend.position=c(0.85, 0.85),
        legend.title=element_blank(),
        legend.key=element_rect(color="white", fill="white"),
        legend.text=element_text(family="serif", face="plain", size=15, color= "Black"),
        legend.background=element_rect(fill="white"),
        axis.line=element_line(linewidth=0.5, colour="black"))

■ probdistz code (smooth= FALSE)

In probdistz() package, when selecting smooth= FALSE, it provides PDF, CDF and Z-score regarding the actual data, not predicting additional data based on 6σ.

dataA= probdistz(df, env_cols= c("field","genotype"), yield_cols="grain_weight", smooth= FALSE)

ggplot(data=subset(dataA, field=="up_state"), aes(x=grain_weight, y=pdf_grain_weight, color=genotype)) +
  geom_line ()+
  scale_color_manual(name= "Genotype", values= c("cv_1"="black", "cv_2"="red", "cv_3"="blue")) +
  scale_x_continuous(breaks=seq(0,80,20), limits=c(0,80)) +
  scale_y_continuous(breaks=seq(0,0.15,0.03), limits=c(0,0.15)) +
  labs(x="Wheat grain weight (mg)", y="Frequency") +
  theme_classic(base_size=15, base_family="serif") +
  theme(legend.position=c(0.85, 0.85),
        legend.title=element_blank(),
        legend.key=element_rect(color="white", fill="white"),
        legend.text=element_text(family="serif", face="plain", size=15, color= "Black"),
        legend.background=element_rect(fill="white"),
        axis.line=element_line(linewidth=0.5, colour="black"))

So, this graph is the same as the one created using Excel. If the data size increases, it will display a probability curve graph. This means that when using stat_function() and probdistz(), the estimation is based on the mean and standard deviation of the entire dataset. For probdistz(), I estimated it up to 6σ.

□ Github: https://github.com/agronomy4future/probdistz/blob/main/README.md
□ Code summary: https://github.com/agronomy4future/r_code/blob/main/Probability_Distribution_and_Z_Score_Calculation_Function_feat_probdistz.ipynb

Agronomy4future

Stories about cereals and statistics (plus coding). We aim to develop open-source code for agronomy.