How to draw a normal distribution graph using Python?
In this session, using Python, I will draw a normal distribution graph using the actual data I collected. I measured individual grain areas for two wheat genotypes. For Cultivar_A, I measured the area for 1,225 grains and for Cultivar_B, I measured the area for 1,841 grains. Therefore, the total number of grain area data is 3,066. This is natural data collected by me in the actual cultivation field, and it would be interesting to see if it also follows a normal distribution.
1) to upload data
First, I will upload the data that I have saved on my Github to Python.
import pandas as pd
url="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/Grain%20area%20for%20two%20genotypes.csv"
df=pd.read_csv(url)
I uploaded 3,066 data points to Python.
2) to filter data
In this data, Cultivar_A and Cultivar_B are in the same row. Since I will draw a normal distribution for each cultivar, I will divide the data by cultivar.
cv1 = df.loc[df['Genotype']== 'Cultivar_A']
cv1.head()
cv2 = df.loc[df['Genotype']== 'Cultivar_B']
cv2.head()
Now that I have divided the data into cv1 and cv2, I will draw a normal distribution graph for each cultivar.
3) to calculate mean and standard deviation
I’ll calculate mean and standard deviation per cultivar.
import numpy as np
cv1_mean = np.mean(cv1["Area"]) # 17.41
cv1_std = np.std(cv1["Area"]) # 2.94
cv2_mean = np.mean(cv2["Area"]) # 16.87
cv2_std = np.std(cv2["Area"]) # 3.28
4) to calculate probability density function
Now I will calculate the probability density functions corresponding to the mean and standard deviation for each cultivar.
import scipy.stats as stats
cv1_pdf = stats.norm.pdf(cv1["Area"].sort_values(), cv1_mean, cv1_std)
cv2_pdf = stats.norm.pdf(cv2["Area"].sort_values(), cv2_mean, cv2_std)
Why are we doing this? Please refer to the post below for more details.
□ What is Probability Density Function (PDF) and Cumulative Distribution Function (CDF): How to calculate using Excel and R ?
5) to draw normal distribution (with histogram) gragh
I will draw a normal distribution graph for each cultivar. In addition, I will import rcParams
from pylab
to adjust the graph size and resolution for the graph image.
import numpy as np
import matplotlib.pyplot as plt
from pylab import rcParams
plt.plot(cv1["Area"].sort_values(), cv1_pdf, color="Black", label="Cultivar_A")
plt.plot(cv2["Area"].sort_values(), cv2_pdf, color="Orange", label="Cultivar_B")
plt.xlim([0,30])
plt.ylim([0,0.15])
plt.legend()
plt.xlabel("Grain area (mm2)", size=15)
plt.ylabel("Frequency", size=15)
plt.grid(True, alpha=0.3, linestyle="--")
plt.rcParams["figure.figsize"] = [7,5] # to adjust 'inch'
plt.rcParams["figure.dpi"] = 500 # to adjust image resolution
plt.show()
5.1) to include a histogram graph
I’ll add a histogram graph to the normal distibution graph.
import matplotlib.pyplot as plt
from pylab import rcParams
import seaborn as sns
plt.plot(cv1["Area"].sort_values(), cv1_pdf, color="Black", label="Cultivar_A")
plt.plot(cv2["Area"].sort_values(), cv2_pdf, color="Orange", label="Cultivar_B")
sns.histplot(data = cv1["Area"], color="Black",stat = "probability",alpha=0.3)
sns.histplot(data = cv2["Area"], color="Orange", stat = "probability",alpha=0.3)
plt.xlim([0,30])
plt.ylim([0,0.15])
plt.legend()
plt.xlabel("Grain area (mm2)", size=15)
plt.ylabel("Frequency", size=15)
plt.grid(True, alpha=0.3, linestyle="--")
plt.rcParams["figure.figsize"] = [7,5]
plt.rcParams["figure.dpi"] = 500
plt.show()
The histogram has been overlaid onto the normal distribution graph as shown. The full code for the above graph is shown below.
# to open library
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from pylab import rcParams
import seaborn as sns
# to upload data
Url="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/Grain%20area%20for%20two%20genotypes.csv"
df=pd.read_csv(Url)
# to filter data
cv1 = df.loc[df['Genotype']== 'Cultivar_A']
cv2 = df.loc[df['Genotype']== 'Cultivar_B']
# to calculate mean and standard deviation
cv1_mean = np.mean(cv1["Area"])
cv1_std = np.std(cv1["Area"])
cv2_mean = np.mean(cv2["Area"])
cv2_std = np.std(cv2["Area"])
# to calculate probability density function
cv1_pdf = stats.norm.pdf(cv1["Area"].sort_values(), cv1_mean, cv1_std)
cv2_pdf = stats.norm.pdf(cv2["Area"].sort_values(), cv2_mean, cv2_std)
# to draw normal distribution (with histogram) gragh
plt.plot(cv1["Area"].sort_values(), cv1_pdf, color="Black", label="Cultivar_A")
plt.plot(cv2["Area"].sort_values(), cv2_pdf, color="Orange", label="Cultivar_B")
sns.histplot(data = cv1["Area"], color="Black",stat = "probability",alpha=0.3)
sns.histplot(data = cv2["Area"], color="Orange", stat = "probability",alpha=0.3)
plt.xlim([0,30])
plt.ylim([0,0.15])
plt.legend()
plt.xlabel("Grain area (mm2)", size=15)
plt.ylabel("Frequency", size=15)
plt.grid(True, alpha=0.3, linestyle="--")
plt.rcParams["figure.figsize"] = [7,5]
plt.rcParams["figure.dpi"] = 500
plt.show()
Extra tip!! How to draw the same normal distribution graph using R?
This is the code to draw the same graph using R.
# to open library
library(readxl)
library(readr)
library(ggplot2)
# to upload data
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/Grain%20area%20for%20two%20genotypes.csv"
df=data.frame(read_csv(url(github),show_col_types=FALSE))
# to filter data
df=read_excel ("Grain area for two genotypes.xlsx")
cv1=subset(df, Genotype=="Cultivar_A")
cv2=subset(df, Genotype=="Cultivar_B")
# to draw a normal distribution graph (with a histogram)
ggplot (df, aes(x=Area, fill= Genotype)) +
geom_histogram (aes(y=0.5*..density..),
alpha=0.5, position='identity', binwidth=0.2) +
scale_fill_manual(values=c("grey25","Orange")) +
stat_function(data=cv1, aes(x=Area), color="Grey", size=1,
fun=dnorm, args=c(mean=mean(cv1$Area), sd=sd(cv1$Area))) +
stat_function(data=cv2, aes(x=Area), color="Orange", size=1,
fun=dnorm, args=c(mean=mean(cv2$Area), sd=sd(cv2$Area))) +
scale_x_continuous(breaks=seq(0,30,10),limits=c(0,30)) +
scale_y_continuous(breaks=seq(0,0.15,0.02), limits=c(0,0.15)) +
labs(x="Grain area (mm2)", y="Frequency") +
theme_bw(base_size=15, base_family="serif")+
theme(axis.line= element_line(size=0.5, colour="black"),
legend.position=c(0.85,0.85)) +
windows(width=6, height=5)
Which graph do you prefer? The normal distribution graph created in R is slightly different from Python.
- In the case of R, the normal distribution graph remains the same irrespective of the sample size. R generates an overall normal distribution graph based on the provided mean and standard deviation. On the other hand, in Python, the shape of the normal distribution curve varies with the sample size. In cases where the sample size is limited, the graph curve is unstable in Python, just like in Excel. For instance, when there are only ten data points (which is not ideal for drawing a normal distribution graph), R displays a normal distribution curve that looks similar to the one shown for an infinitely large number of data points with the same mean and standard deviation, while Python displays an unstable curve.
- Similarly, the histogram’s frequency varies based on the range (bins) of the histogram in Python, whereas R always displays the same frequency regardless of the range. In essence, R always provides the same graph, irrespective of the number of data points and the range of the histogram, whereas Python’s graph always varies.
The total number of data points is 3,066. If the range of the histogram is made very dense, it would be the same as the histogram provided in R. However, if the range of the histogram is made very dense in Python, the frequency of the histogram becomes extremely low, like the Python graph below.
plt.plot(cv1["Area"].sort_values(), cv1_pdf, color="Black", label="Cultivar_A")
plt.plot(cv2["Area"].sort_values(), cv2_pdf, color="Orange", label="Cultivar_B")
sns.histplot(data = cv1["Area"], bins=100, color="Black",stat = "probability",alpha=0.3)
sns.histplot(data = cv2["Area"], bins=100, color="Orange", stat = "probability",alpha=0.3)
plt.xlim([0,30])
plt.ylim([0,0.15])
plt.legend()
plt.xlabel("Grain area (mm2)", size=15)
plt.ylabel("Frequency", size=15)
plt.grid(True, alpha=0.3, linestyle="--")
plt.rcParams["figure.figsize"] = [7,5]
plt.rcParams["figure.dpi"] = 500
plt.show()
If you want to convey that there is an extremely large amount of data, using R to draw a histogram graph may be more suitable. However, mathematically speaking, if the range of the histogram becomes very dense, the corresponding frequency will also decrease. For instance, the frequency of data in the 5.0-6.0 mm2 range will differ from the frequency of data in the 5.01-5.02 mm2 range. Therefore, in principle, the graph provided by Python is more accurate. Nevertheless, if you have a sufficient understanding of the data principles and can defend your approach, using the R graph is also acceptable.