Efficient Multivariate Summary in R: A Guide to Analyzing Multiple Independent Variables
In my previous post, I introduced how to summarize data, such as mean, standard deviation, and standard error. However, at that moment, I demonstrated how to summarize only one variable.
□ Streamlined Data Summary in R STUDIO: Enhancing Bar Graphs with Error Bars
Now, let’s discuss this further with a dataset.
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/yield%20component_nitrogen.csv"
dataA= data.frame(read_csv(url(github), show_col_types= FALSE))
dataA
Genotype Block Nitrogen GN AGW Yield
1 CV1 I N1 21488 43.9 902.4
2 CV1 II N1 23707 41.7 944.9
3 CV1 III N1 20817 45.6 907.7
4 CV1 I N0 13570 47.7 645.0
5 CV1 II N0 8593 47.6 393.1
6 CV1 III N0 8588 45.5 389.8
7 CV2 I N1 22072 51.6 1119.7
8 CV2 II N1 14675 46.5 675.7
9 CV2 III N1 17180 48.6 800.1
10 CV2 I N0 10969 52.2 572.0
.
.
.
I would like to summarize the Yield data, including the mean, standard deviation, and standard error. I’ll use ddply()
dataB= ddply (dataA, c("Genotype","Nitrogen"), summarise, mean=mean(Yield),
sd=sd(Yield), n=length(Yield), se=sd/sqrt(n))
dataB
Genotype Nitrogen mean sd n se
1 CV1 N0 475.9667 146.39646 3 84.52204
2 CV1 N1 918.3333 23.15952 3 13.37115
3 CV2 N0 603.8000 41.99190 3 24.24404
4 CV2 N1 865.1667 229.03985 3 132.23622
5 CV3 N0 485.1667 105.48423 3 60.90135
6 CV3 N1 960.6333 84.80627 3 48.96292
7 CV4 N0 640.1000 47.77457 3 27.58266
8 CV4 N1 846.5000 80.20916 3 46.30878
9 CV5 N0 701.0000 86.74001 3 50.07937
10 CV5 N1 966.4667 63.15262 3 36.46118
Now, I also want to summarize variables GN and AGW. Do you think it is a good approach to summarize all variables?
dataC= ddply (dataA, c("Genotype","Nitrogen"), summarise, mean=mean(GN),
sd=sd(GN), n=length(GN), se=sd/sqrt(n))
dataD= ddply (dataA, c("Genotype","Nitrogen"), summarise, mean=mean(AGW),
sd=sd(AGW), n=length(AGW), se=sd/sqrt(n))
dataB$GN=dataC$mean
dataB$GN_se=dataC$se
dataB$AGW=dataD$mean
dataB$AGW_se=dataD$se
colnames(dataB)[3]=c("Yield")
colnames(dataB)[6]=c("Yield_se")
dataB
Genotype Nitrogen Yield sd n Yield_se GN GN_se AGW AGW_se
1 CV1 N0 475.9667 146.39646 3 84.52204 10250.33 1659.8340 46.93333 0.71724783
2 CV1 N1 918.3333 23.15952 3 13.37115 22004.00 873.2539 43.73333 1.12891295
3 CV2 N0 603.8000 41.99190 3 24.24404 11618.33 498.8525 52.06667 0.06666667
4 CV2 N1 865.1667 229.03985 3 132.23622 17975.67 2172.0740 48.90000 1.47986486
5 CV3 N0 485.1667 105.48423 3 60.90135 11677.00 1202.7329 43.13333 1.18649812
6 CV3 N1 960.6333 84.80627 3 48.96292 24803.00 922.3407 41.60000 0.55677644
7 CV4 N0 640.1000 47.77457 3 27.58266 13364.00 748.6497 49.13333 1.76477887
8 CV4 N1 846.5000 80.20916 3 46.30878 17466.33 732.2956 50.50000 0.77674535
9 CV5 N0 701.0000 86.74001 3 50.07937 15419.00 730.9747 45.96667 0.72188026
10 CV5 N1 966.4667 63.15262 3 36.46118 26012.33 1237.1395 39.10000 1.20554275
It seems alright, but we can save time using a much simpler code. I’ll introduce how to summarize multiple variables at once. I’ll use the following code, dplyr::summarize()
.
library(dplyr)
dataB= data.frame(dataA %>%
group_by(Genotype, Nitrogen) %>%
dplyr::summarize(across(c(GN, AGW, Yield),
.fns = list(Mean = mean,
SD = sd,
n = length,
se = ~ sd(.)/sqrt(length(.))))))
dataB
Genotype Nitrogen GN_Mean GN_SD GN_n GN_se AGW_Mean AGW_SD AGW_n AGW_se Yield_Mean Yield_SD Yield_n Yield_se
1 CV1 N0 10250.33 2874.9168 3 1659.8340 46.93333 1.2423097 3 0.71724783 475.9667 146.39646 3 84.52204
2 CV1 N1 22004.00 1512.5201 3 873.2539 43.73333 1.9553346 3 1.12891295 918.3333 23.15952 3 13.37115
3 CV2 N0 11618.33 864.0378 3 498.8525 52.06667 0.1154701 3 0.06666667 603.8000 41.99190 3 24.24404
4 CV2 N1 17975.67 3762.1425 3 2172.0740 48.90000 2.5632011 3 1.47986486 865.1667 229.03985 3 132.23622
5 CV3 N0 11677.00 2083.1944 3 1202.7329 43.13333 2.0550750 3 1.18649812 485.1667 105.48423 3 60.90135
6 CV3 N1 24803.00 1597.5409 3 922.3407 41.60000 0.9643651 3 0.55677644 960.6333 84.80627 3 48.96292
7 CV4 N0 13364.00 1296.6993 3 748.6497 49.13333 3.0566867 3 1.76477887 640.1000 47.77457 3 27.58266
8 CV4 N1 17466.33 1268.3731 3 732.2956 50.50000 1.3453624 3 0.77674535 846.5000 80.20916 3 46.30878
9 CV5 N0 15419.00 1266.0853 3 730.9747 45.96667 1.2503333 3 0.72188026 701.0000 86.74001 3 50.07937
10 CV5 N1 26012.33 2142.7884 3 1237.1395 39.10000 2.0880613 3 1.20554275 966.4667 63.15262 3 36.46118
All variables were summarized at once.
How about when missing values exist? Here is another example.
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/fertilizer_fungicide_treatment_with_missing_values.csv"
dataA= data.frame(read_csv(url(github), show_col_types= FALSE))
dataA
Fertilizer Fungicide Yield Height
1 Control Yes 12.2 45
2 Control Yes 12.4 NA
3 Control Yes 11.9 42
4 Control Yes 11.3 35
5 Control Yes 11.8 40
6 Control No 12.1 48
7 Control No 13.1 NA
8 Control No 12.7 61
9 Control No 12.4 50
10 Control No 11.4 33
11 Slow Yes NA 63
12 Slow Yes 15.8 50
13 Slow Yes 16.5 63
14 Slow Yes 15.0 33
15 Slow Yes 15.4 38
16 Slow No 15.6 45
17 Slow No NA 50
18 Slow No 15.8 48
19 Slow No 16.0 50
20 Slow No 15.8 NA
21 Fast Yes 9.5 52
22 Fast Yes 9.5 54
23 Fast Yes 9.6 58
24 Fast Yes NA 45
25 Fast Yes 9.5 57
26 Fast No 9.8 62
27 Fast No 9.1 52
28 Fast No 10.3 67
29 Fast No 9.5 55
30 Fast No NA 40
In this data, there are some missing values. I’ll use the same code, dplyr::summarize()
.
library(dplyr)
dataB= data.frame(dataA %>%
group_by(Fertilizer, Fungicide) %>%
dplyr::summarize(across(c(Yield, Height),
.fns= list(Mean= mean,
SD= sd,
n= length,
se= ~ sd(.)/sqrt(length(.))))))
dataB
Fertilizer Fungicide Yield_Mean Yield_SD Yield_n Yield_se Height_Mean Height_SD Height_n Height_se
1 Control No 12.34 0.6426508 5 0.2874022 NA NA 5 NA
2 Control Yes 11.92 0.4207137 5 0.1881489 NA NA 5 NA
3 Fast No NA NA 5 NA 55.2 10.329569 5 4.619524
4 Fast Yes NA NA 5 NA 53.2 5.167204 5 2.310844
5 Slow No NA NA 5 NA NA NA 5 NA
6 Slow Yes NA NA 5 NA 49.4 13.867228 5 6.201613
When there are missing values, the mean and standard deviation and standard error cannot be calculated. In this case, we can add na.rm= TRUE
to calculate the mean and others even though missing values exist.
library(dplyr)
dataB = data.frame(dataA %>%
group_by(Fertilizer, Fungicide) %>%
dplyr::summarize(across(c(Yield, Height),
.fns= list(Mean=~mean(., na.rm= TRUE),
SD= ~sd(., na.rm= TRUE),
n=~length(.),
se=~sd(.,na.rm= TRUE) / sqrt(length(.))))))
dataB
Fertilizer Fungicide Yield_Mean Yield_SD Yield_n Yield_se Height_Mean Height_SD Height_n Height_se
1 Control No 12.340 0.6426508 5 0.28740216 48.00 11.518102 5 5.151052
2 Control Yes 11.920 0.4207137 5 0.18814888 40.50 4.203173 5 1.879716
3 Fast No 9.675 0.5057997 5 0.22620050 55.20 10.329569 5 4.619524
4 Fast Yes 9.525 0.0500000 5 0.02236068 53.20 5.167204 5 2.310844
5 Slow No 15.800 0.1632993 5 0.07302967 48.25 2.362908 5 1.056724
6 Slow Yes 15.675 0.6396614 5 0.28606526 49.40 13.867228 5 6.201613