How to summarize data using Python?
In my previous post, I demonstrated how to create a data table using Python. If you’re interested, please refer to the post below.
■ How to create a data table in Python?
import pandas as pd
genotypes = ["Genotype_A", "Genotype_B", "Genotype_C", "Genotype_D"] * 16
blocks = ["I", "II", "III", "IV"] * 16
treatment = ["Control", "Fertilizer1", "Fertilizer2", "Fertilizer3"] * 16
grain_yield = [
42.9, 41.6, 28.9, 30.8, 53.3, 69.6, 45.4, 35.1, 62.3, 58.5, 44.6,
50.3, 75.4, 65.6, 54, 52.7, 53.8, 58.5, 43.9, 46.3, 57.6, 69.6, 42.4,
51.9, 63.4, 50.4, 45, 46.7, 70.3, 67.3, 57.6, 58.5, 49.5, 53.8, 40.7,
39.4, 59.8, 65.8, 41.4, 45.4, 64.5, 46.1, 62.6, 50.3, 68.8, 65.3, 45.6,
51, 44.4, 41.8, 28.3, 34.7, 64.1, 57.4, 44.1, 51.6, 63.6, 56.1, 52.7,
51.8, 71.6, 69.4, 56.6, 47.4
]
df = pd.DataFrame({
"genotype": genotypes,
"block": blocks,
"treatment": variables,
"grain_yield": values
})
df
genotype block treatment grain_yield
0 Genotype_A I Control 42.9
1 Genotype_B II Fertilizer1 41.6
2 Genotype_C III Fertilizer2 28.9
3 Genotype_D IV Fertilizer3 30.8
4 Genotype_A I Control 53.3
... ... ... ... ...
59 Genotype_D IV Fertilizer3 51.8
60 Genotype_A I Control 71.6
61 Genotype_B II Fertilizer1 69.4
62 Genotype_C III Fertilizer2 56.6
63 Genotype_D IV Fertilizer3 47.4
64 rows × 4 columns
I’ll summarize this data by mean and standard error.
import numpy as np
summary_stats = df.groupby(['genotype', 'treatment']).agg(
mean_value=('grain_yield', np.mean),
std_error=('grain_yield', lambda x: np.std(x, ddof=1) / np.sqrt(len(x)))
).reset_index()
summary_stats
genotype treatment mean_value std_error
0 Genotype_A Control 60.33125 2.376547
1 Genotype_B Fertilizer1 58.55000 2.419969
2 Genotype_C Fertilizer2 45.86250 2.336733
3 Genotype_D Fertilizer3 46.49375 1.918147
full code: https://github.com/agronomy4future/python_code/blob/main/How_to_summarize_data_using_Python.ipynb