Efficient Multivariate Summary in R: A Guide to Analyzing Multiple Independent Variables

Efficient Multivariate Summary in R: A Guide to Analyzing Multiple Independent Variables


In my previous post, I introduced how to summarize data, such as mean, standard deviation, and standard error. However, at that moment, I demonstrated how to summarize only one variable.


Streamlined Data Summary in R STUDIO: Enhancing Bar Graphs with Error Bars


Now, let’s discuss this further with a dataset.

library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/yield%20component_nitrogen.csv"
dataA= data.frame(read_csv(url(github), show_col_types= FALSE))

dataA
  Genotype Block Nitrogen    GN  AGW  Yield
1       CV1     I       N1 21488 43.9  902.4
2       CV1    II       N1 23707 41.7  944.9
3       CV1   III       N1 20817 45.6  907.7
4       CV1     I       N0 13570 47.7  645.0
5       CV1    II       N0  8593 47.6  393.1
6       CV1   III       N0  8588 45.5  389.8
7       CV2     I       N1 22072 51.6 1119.7
8       CV2    II       N1 14675 46.5  675.7
9       CV2   III       N1 17180 48.6  800.1
10      CV2     I       N0 10969 52.2  572.0
.
.
.

I would like to summarize the Yield data, including the mean, standard deviation, and standard error. I’ll use ddply()

dataB= ddply (dataA, c("Genotype","Nitrogen"), summarise, mean=mean(Yield), 
              sd=sd(Yield), n=length(Yield), se=sd/sqrt(n))

dataB
   Genotype Nitrogen     mean        sd n        se
1       CV1       N0 475.9667 146.39646 3  84.52204
2       CV1       N1 918.3333  23.15952 3  13.37115
3       CV2       N0 603.8000  41.99190 3  24.24404
4       CV2       N1 865.1667 229.03985 3 132.23622
5       CV3       N0 485.1667 105.48423 3  60.90135
6       CV3       N1 960.6333  84.80627 3  48.96292
7       CV4       N0 640.1000  47.77457 3  27.58266
8       CV4       N1 846.5000  80.20916 3  46.30878
9       CV5       N0 701.0000  86.74001 3  50.07937
10      CV5       N1 966.4667  63.15262 3  36.46118

Now, I also want to summarize variables GN and AGW. Do you think it is a good approach to summarize all variables?

dataC= ddply (dataA, c("Genotype","Nitrogen"), summarise, mean=mean(GN), 
              sd=sd(GN), n=length(GN), se=sd/sqrt(n))

dataD= ddply (dataA, c("Genotype","Nitrogen"), summarise, mean=mean(AGW), 
              sd=sd(AGW), n=length(AGW), se=sd/sqrt(n))

dataB$GN=dataC$mean 
dataB$GN_se=dataC$se
dataB$AGW=dataD$mean 
dataB$AGW_se=dataD$se
colnames(dataB)[3]=c("Yield")
colnames(dataB)[6]=c("Yield_se")

dataB
   Genotype Nitrogen    Yield        sd n  Yield_se       GN     GN_se      AGW     AGW_se
1       CV1       N0 475.9667 146.39646 3  84.52204 10250.33 1659.8340 46.93333 0.71724783
2       CV1       N1 918.3333  23.15952 3  13.37115 22004.00  873.2539 43.73333 1.12891295
3       CV2       N0 603.8000  41.99190 3  24.24404 11618.33  498.8525 52.06667 0.06666667
4       CV2       N1 865.1667 229.03985 3 132.23622 17975.67 2172.0740 48.90000 1.47986486
5       CV3       N0 485.1667 105.48423 3  60.90135 11677.00 1202.7329 43.13333 1.18649812
6       CV3       N1 960.6333  84.80627 3  48.96292 24803.00  922.3407 41.60000 0.55677644
7       CV4       N0 640.1000  47.77457 3  27.58266 13364.00  748.6497 49.13333 1.76477887
8       CV4       N1 846.5000  80.20916 3  46.30878 17466.33  732.2956 50.50000 0.77674535
9       CV5       N0 701.0000  86.74001 3  50.07937 15419.00  730.9747 45.96667 0.72188026
10      CV5       N1 966.4667  63.15262 3  36.46118 26012.33 1237.1395 39.10000 1.20554275

It seems alright, but we can save time using a much simpler code. I’ll introduce how to summarize multiple variables at once. I’ll use the following code, dplyr::summarize().

library(dplyr)
dataB= data.frame(dataA %>%
  group_by(Genotype, Nitrogen) %>%
  dplyr::summarize(across(c(GN, AGW, Yield), 
                          .fns = list(Mean = mean, 
                                      SD = sd, 
                                      n = length,
                                      se = ~ sd(.)/sqrt(length(.))))))
dataB

   Genotype Nitrogen  GN_Mean  GN_SD GN_n   GN_se AGW_Mean  AGW_SD AGW_n   AGW_se Yield_Mean  Yield_SD Yield_n  Yield_se
1     CV1    N0 10250.33 2874.9168    3 1659.8340 46.93333 1.2423097     3 0.71724783   475.9667 146.39646     3  84.52204
2     CV1    N1 22004.00 1512.5201    3  873.2539 43.73333 1.9553346     3 1.12891295   918.3333  23.15952     3  13.37115
3     CV2    N0 11618.33  864.0378    3  498.8525 52.06667 0.1154701     3 0.06666667   603.8000  41.99190     3  24.24404
4     CV2    N1 17975.67 3762.1425    3 2172.0740 48.90000 2.5632011     3 1.47986486   865.1667 229.03985     3 132.23622
5     CV3    N0 11677.00 2083.1944    3 1202.7329 43.13333 2.0550750     3 1.18649812   485.1667 105.48423     3  60.90135
6     CV3    N1 24803.00 1597.5409    3  922.3407 41.60000 0.9643651     3 0.55677644   960.6333  84.80627     3  48.96292
7     CV4    N0 13364.00 1296.6993    3  748.6497 49.13333 3.0566867     3 1.76477887   640.1000  47.77457     3  27.58266
8     CV4    N1 17466.33 1268.3731    3  732.2956 50.50000 1.3453624     3 0.77674535   846.5000  80.20916     3  46.30878
9     CV5    N0 15419.00 1266.0853    3  730.9747 45.96667 1.2503333     3 0.72188026   701.0000  86.74001     3  50.07937
10    CV5    N1 26012.33 2142.7884    3 1237.1395 39.10000 2.0880613     3 1.20554275   966.4667  63.15262     3  36.46118

All variables were summarized at once.



How about when missing values exist? Here is another example.

library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/fertilizer_fungicide_treatment_with_missing_values.csv"
dataA= data.frame(read_csv(url(github), show_col_types= FALSE))

dataA
   Fertilizer Fungicide Yield Height
1     Control       Yes  12.2     45
2     Control       Yes  12.4     NA
3     Control       Yes  11.9     42
4     Control       Yes  11.3     35
5     Control       Yes  11.8     40
6     Control        No  12.1     48
7     Control        No  13.1     NA
8     Control        No  12.7     61
9     Control        No  12.4     50
10    Control        No  11.4     33
11       Slow       Yes    NA     63
12       Slow       Yes  15.8     50
13       Slow       Yes  16.5     63
14       Slow       Yes  15.0     33
15       Slow       Yes  15.4     38
16       Slow        No  15.6     45
17       Slow        No    NA     50
18       Slow        No  15.8     48
19       Slow        No  16.0     50
20       Slow        No  15.8     NA
21       Fast       Yes   9.5     52
22       Fast       Yes   9.5     54
23       Fast       Yes   9.6     58
24       Fast       Yes    NA     45
25       Fast       Yes   9.5     57
26       Fast        No   9.8     62
27       Fast        No   9.1     52
28       Fast        No  10.3     67
29       Fast        No   9.5     55
30       Fast        No    NA     40

In this data, there are some missing values. I’ll use the same code, dplyr::summarize().

library(dplyr)
dataB= data.frame(dataA %>%
                  group_by(Fertilizer, Fungicide) %>%
                  dplyr::summarize(across(c(Yield, Height), 
                          .fns= list(Mean= mean, 
                            SD= sd, 
                             n= length,
                            se= ~ sd(.)/sqrt(length(.))))))

dataB
  Fertilizer Fungicide Yield_Mean  Yield_SD Yield_n  Yield_se Height_Mean Height_SD Height_n Height_se
1    Control        No      12.34 0.6426508       5 0.2874022          NA        NA        5        NA
2    Control       Yes      11.92 0.4207137       5 0.1881489          NA        NA        5        NA
3       Fast        No         NA        NA       5        NA        55.2 10.329569        5  4.619524
4       Fast       Yes         NA        NA       5        NA        53.2  5.167204        5  2.310844
5       Slow        No         NA        NA       5        NA          NA        NA        5        NA
6       Slow       Yes         NA        NA       5        NA        49.4 13.867228        5  6.201613

When there are missing values, the mean and standard deviation and standard error cannot be calculated. In this case, we can add na.rm= TRUE to calculate the mean and others even though missing values exist.

library(dplyr)
dataB = data.frame(dataA %>%
                   group_by(Fertilizer, Fungicide) %>%
                   dplyr::summarize(across(c(Yield, Height), 
                                     .fns= list(Mean=~mean(., na.rm= TRUE), 
                                      SD= ~sd(., na.rm= TRUE), 
                                      n=~length(.),
                                      se=~sd(.,na.rm= TRUE) / sqrt(length(.))))))

dataB
  Fertilizer Fungicide Yield_Mean  Yield_SD Yield_n   Yield_se Height_Mean Height_SD Height_n Height_se
1    Control        No     12.340 0.6426508       5 0.28740216       48.00 11.518102        5  5.151052
2    Control       Yes     11.920 0.4207137       5 0.18814888       40.50  4.203173        5  1.879716
3       Fast        No      9.675 0.5057997       5 0.22620050       55.20 10.329569        5  4.619524
4       Fast       Yes      9.525 0.0500000       5 0.02236068       53.20  5.167204        5  2.310844
5       Slow        No     15.800 0.1632993       5 0.07302967       48.25  2.362908        5  1.056724
6       Slow       Yes     15.675 0.6396614       5 0.28606526       49.40 13.867228        5  6.201613


Comments are closed.