Streamlined Data Summary in R STUDIO: Enhancing Bar Graphs with Error Bars

November 13, 2020 JK

When working with data in R, there are situations where you might need to examine summarized information, such as means, standard deviations, and more. Today, I will introduce the methods that can be employed for this purpose.

Let’s start by loading a dataset.

install.packages("readr")
library(readr)
github= "https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/fertilizer_treatment.csv"
dataA= data.frame(read_csv(url(github),show_col_types = FALSE))

     Genotype Block    variable value
1  Genotype_A     I     Control  42.9
2  Genotype_A    II     Control  41.6
3  Genotype_A   III     Control  28.9
4  Genotype_A    IV     Control  30.8
5  Genotype_B     I     Control  53.3
6  Genotype_B    II     Control  69.6
7  Genotype_B   III     Control  45.4
8  Genotype_B    IV     Control  35.1
9  Genotype_C     I     Control  62.3
10 Genotype_C    II     Control  58.5
11 Genotype_C   III     Control  44.6
12 Genotype_C    IV     Control  50.3
13 Genotype_D     I     Control  75.4
14 Genotype_D    II     Control  65.6
15 Genotype_D   III     Control  54.0
16 Genotype_D    IV     Control  52.7
17 Genotype_A     I Fertilizer1  53.8
18 Genotype_A    II Fertilizer1  58.5
19 Genotype_A   III Fertilizer1  43.9
20 Genotype_A    IV Fertilizer1  46.3
21 Genotype_B     I Fertilizer1  57.6
22 Genotype_B    II Fertilizer1  69.6
23 Genotype_B   III Fertilizer1  42.4
24 Genotype_B    IV Fertilizer1  51.9
25 Genotype_C     I Fertilizer1  63.4
26 Genotype_C    II Fertilizer1  50.4
27 Genotype_C   III Fertilizer1  45.0
28 Genotype_C    IV Fertilizer1  46.7
29 Genotype_D     I Fertilizer1  70.3
30 Genotype_D    II Fertilizer1  67.3
31 Genotype_D   III Fertilizer1  57.6
32 Genotype_D    IV Fertilizer1  58.5
33 Genotype_A     I Fertilizer2  49.5
34 Genotype_A    II Fertilizer2  53.8
35 Genotype_A   III Fertilizer2  40.7
36 Genotype_A    IV Fertilizer2  39.4
37 Genotype_B     I Fertilizer2  59.8
38 Genotype_B    II Fertilizer2  65.8
39 Genotype_B   III Fertilizer2  41.4
40 Genotype_B    IV Fertilizer2  45.4
41 Genotype_C     I Fertilizer2  64.5
42 Genotype_C    II Fertilizer2  46.1
43 Genotype_C   III Fertilizer2  62.6
44 Genotype_C    IV Fertilizer2  50.3
45 Genotype_D     I Fertilizer2  68.8
46 Genotype_D    II Fertilizer2  65.3
47 Genotype_D   III Fertilizer2  45.6
48 Genotype_D    IV Fertilizer2  51.0
49 Genotype_A     I Fertilizer3  44.4
50 Genotype_A    II Fertilizer3  41.8
51 Genotype_A   III Fertilizer3  28.3
52 Genotype_A    IV Fertilizer3  34.7
53 Genotype_B     I Fertilizer3  64.1
54 Genotype_B    II Fertilizer3  57.4
55 Genotype_B   III Fertilizer3  44.1
56 Genotype_B    IV Fertilizer3  51.6
57 Genotype_C     I Fertilizer3  63.6
58 Genotype_C    II Fertilizer3  56.1
59 Genotype_C   III Fertilizer3  52.7
60 Genotype_C    IV Fertilizer3  51.8
61 Genotype_D     I Fertilizer3  71.6
62 Genotype_D    II Fertilizer3  69.4
63 Genotype_D   III Fertilizer3  56.6
64 Genotype_D    IV Fertilizer3  47.4

As I engage in various tasks involving this data, I aim to summarize it. Therefore, I will introduce methods applicable to such situations.

1) using plyr package

First, install and activate the package.

install.packages ("plyr")
library (plyr)

I want to summarize the average values for the ‘Genotype’ and ‘variable’ in the given dataset “dataA”. The summarized data will be named “dataB”.

dataB= ddply (dataA, c("Genotype","variable"), summarise, mean=mean(value), sd=sd(value), n=length(value), se=sd/sqrt(n))

     Genotype    variable   mean        sd n       se
1  Genotype_A     Control 36.050  7.220572 4 3.610286
2  Genotype_A Fertilizer1 50.625  6.733684 4 3.366842
3  Genotype_A Fertilizer2 45.850  6.943822 4 3.471911
4  Genotype_A Fertilizer3 37.300  7.266820 4 3.633410
5  Genotype_B     Control 50.850 14.552548 4 7.276274
6  Genotype_B Fertilizer1 55.375 11.368487 4 5.684244
7  Genotype_B Fertilizer2 53.100 11.581019 4 5.790509
8  Genotype_B Fertilizer3 54.300  8.504509 4 4.252254
9  Genotype_C     Control 53.925  7.982637 4 3.991319
10 Genotype_C Fertilizer1 51.375  8.327615 4 4.163807
11 Genotype_C Fertilizer2 55.875  9.059939 4 4.529970
12 Genotype_C Fertilizer3 56.050  5.363146 4 2.681573
13 Genotype_D     Control 61.925 10.692482 4 5.346241
14 Genotype_D Fertilizer1 63.425  6.336863 4 3.168432
15 Genotype_D Fertilizer2 57.675 11.139532 4 5.569766
16 Genotype_D Fertilizer3 61.250 11.357670 4 5.678835

The means, standard deviations (sd), and standard errors (se) of the values are compiled for each combination of Genotype and variable.

□ Utilizing R Studio for Data Grouping and Mean/Standard Error Calculation (feat ddply)

2) using dplyr package

This time, I will demonstrate how to create the same data using the dplyr package. First, let’s install the package.

install.packages ("dplyr")
library (dplyr)

I will create a dataset that compiles the means, standard deviations, and standard errors of value based on Genotype and variable in the given dataset “dataA”. The summarized data will be named “dataC”.

The %>% symbol can be generated automatically by pressing Ctrl + Shift + M, which eliminates the requirement for manual typing.

dataC= dataA %>%
       group_by(Genotype, variable) %>%
       summarise(mean=mean(value), sd=sd(value), n=length(value), se=sd/sqrt(n))

# A tibble: 16 × 6
# Groups:   Genotype [4]
   Genotype   variable     mean    sd     n    se
   <chr>      <chr>       <dbl> <dbl> <int> <dbl>
 1 Genotype_A Control      36.0  7.22     4  3.61
 2 Genotype_A Fertilizer1  50.6  6.73     4  3.37
 3 Genotype_A Fertilizer2  45.8  6.94     4  3.47
 4 Genotype_A Fertilizer3  37.3  7.27     4  3.63
 5 Genotype_B Control      50.8 14.6      4  7.28
 6 Genotype_B Fertilizer1  55.4 11.4      4  5.68
 7 Genotype_B Fertilizer2  53.1 11.6      4  5.79
 8 Genotype_B Fertilizer3  54.3  8.50     4  4.25
 9 Genotype_C Control      53.9  7.98     4  3.99
10 Genotype_C Fertilizer1  51.4  8.33     4  4.16
11 Genotype_C Fertilizer2  55.9  9.06     4  4.53
12 Genotype_C Fertilizer3  56.0  5.36     4  2.68
13 Genotype_D Control      61.9 10.7      4  5.35
14 Genotype_D Fertilizer1  63.4  6.34     4  3.17
15 Genotype_D Fertilizer2  57.7 11.1      4  5.57
16 Genotype_D Fertilizer3  61.2 11.4      4  5.68

In the provided data, you might notice that it’s labeled as a tibble. This means that the data is in the tibble format, not a regular data frame. In reality, there isn’t a significant difference in terms of data analysis. I also looked it up since I wasn’t sure, and the differences are well explained on the webpage below:

https://sulgik.github.io/r4ds/tibble.html

If you want to convert the tibble format to a data frame format, you can do so using the following approach

dataC= dataA %>%
       group_by(Genotype, variable) %>%
       summarise(mean=mean(value), sd=sd(value), n=length(value), se=sd/sqrt(n)) %>% 
       as.data.frame()

     Genotype    variable   mean        sd n       se
1  Genotype_A     Control 36.050  7.220572 4 3.610286
2  Genotype_A Fertilizer1 50.625  6.733684 4 3.366842
3  Genotype_A Fertilizer2 45.850  6.943822 4 3.471911
4  Genotype_A Fertilizer3 37.300  7.266820 4 3.633410
5  Genotype_B     Control 50.850 14.552548 4 7.276274
6  Genotype_B Fertilizer1 55.375 11.368487 4 5.684244
7  Genotype_B Fertilizer2 53.100 11.581019 4 5.790509
8  Genotype_B Fertilizer3 54.300  8.504509 4 4.252254
9  Genotype_C     Control 53.925  7.982637 4 3.991319
10 Genotype_C Fertilizer1 51.375  8.327615 4 4.163807
11 Genotype_C Fertilizer2 55.875  9.059939 4 4.529970
12 Genotype_C Fertilizer3 56.050  5.363146 4 2.681573
13 Genotype_D     Control 61.925 10.692482 4 5.346241
14 Genotype_D Fertilizer1 63.425  6.336863 4 3.168432
15 Genotype_D Fertilizer2 57.675 11.139532 4 5.569766
16 Genotype_D Fertilizer3 61.250 11.357670 4 5.678835

No matter which package is used, the data has been summarized by means. Now, let’s proceed to create a bar graph using this summarized mean data.

library(ggplot2)
ggplot(data=dataB, aes(x=Genotype, y=mean, fill=variable))+
  geom_bar(stat="identity",position="dodge", width = 0.7, size=1) +
  geom_errorbar(aes(ymin= mean-se, ymax=mean + se), position=position_dodge(0.7),
                width=0.2, color='Black') +
  scale_fill_manual(values= c ("dark blue", "darkred", "blue", "orange")) +
  scale_y_continuous(breaks = seq(0,100,20), limits = c(0,100)) +
  labs(x="Genotype", y="Yield") +
  theme_classic(base_size=20, base_family="serif")+
  theme(legend.position=c(0.90,0.9),,
        legend.title=element_blank(),
        legend.key.size=unit(0.5,'cm'),
        legend.key=element_rect(color=alpha("white",.05), 
                                fill=alpha("white",.05)),
        legend.text=element_text(size=11),
        legend.background= element_rect(fill=alpha("white",.05)),
        panel.grid.major=element_line(colour="grey90", linewidth=0.5),
        axis.line=element_line(linewidth=0.5, colour="black")) +
  windows(width=8, height=5)

Since you used windows(), the graph will be displayed in a new window.

A bar graph with standard errors included has been successfully plotted.

Tip 1 > If you want to change legend titles:

In the above graph, you used the legend.title = element_blank() code to hide the legend title. Now, I want to display the legend title as “Treatment”.

First, I will change the code from legend.title = element_blank() to legend.title = element_text(face= "plain", family= "serif", size= 12, color= "Black"). Additionally, I will include name= "Treatment" in the scale_fill_manual(values= c("dark blue", "darkred", "blue", "orange")) code. In other words, the modified code will be scale_fill_manual(name= "Treatment", values= c("dark blue", "darkred", "blue", "orange")).

The complete code is as follows:

library(ggplot2)
ggplot(data=dataB, aes(x=Genotype, y=mean, fill=variable))+
  geom_bar(stat="identity",position="dodge", width = 0.7, size=1) +
  geom_errorbar(aes(ymin= mean-se, ymax=mean + se), position=position_dodge(0.7),
                width=0.2, color='Black') +
  scale_fill_manual(name="Treatment", values= c("dark blue", "darkred", "blue", "orange")) +
  scale_y_continuous(breaks = seq(0,100,20), limits = c(0,100)) +
  labs (x="Genotype", y="Yield") +
  theme_classic(base_size=20, base_family="serif")+
  theme(legend.position=c(0.90,0.9),,
        legend.title= element_text(face= "plain", family="serif", size= 12, color= "Black"),
        legend.key.size=unit(0.5,'cm'),
        legend.key=element_rect(color=alpha("white",.05), 
                                fill=alpha("white",.05)),
        legend.text=element_text(size=11),
        legend.background= element_rect(fill=alpha("white",.05)),
        panel.grid.major=element_line(colour="grey90", linewidth=0.5),
        axis.line=element_line(linewidth=0.5, colour="black")) +
  windows(width=8, height=5)

Tip 2 > If you want to change legend labels:

Now, I want to change the legend label ‘Fertilizer 1 – 3’ to ‘Nitrogen,’ ‘Phosphorus,’ and ‘Potassium.’ How can I achieve this? While there are various methods to change variable names, for now, I’ll make the changes directly in the provided codes.

□ How to Rename Variables within Columns in R?

In the scale_fill_manual() function, I will add some code as shown below.

scale_fill_manual(name="Treatment", values= c("dark blue", "darkred", "blue", "orange"),
                    breaks=c("Control","Fertilizer1","Fertilizer2","Fertilizer3"), 
                    labels=c("Control", "Nitrogen","Phosphorus","Potassium")) +

The complete code is provided below.

library(ggplot2)
ggplot(data=dataB, aes(x=Genotype, y=mean, fill=variable))+
  geom_bar(stat="identity",position="dodge", width = 0.7, size=1) +
  geom_errorbar(aes(ymin= mean-se, ymax=mean + se), position=position_dodge(0.7),
                width=0.2, color='Black') +
  scale_fill_manual(name="Treatment", values= c("dark blue", "darkred", "blue", "orange"),
                    breaks=c("Control","Fertilizer1","Fertilizer2","Fertilizer3"), 
                    labels=c("Control", "Nitrogen","Phosphorus","Potassium")) +
  scale_y_continuous(breaks = seq(0,100,20), limits = c(0,100)) +
  labs (x="Genotype", y="Yield") +
  theme_classic(base_size=20, base_family="serif")+
  theme(legend.position=c(0.90,0.9),,
        legend.title= element_text(face= "plain", family="serif", size= 12, color= "Black"),
        legend.key.size=unit(0.5,'cm'),
        legend.key=element_rect(color=alpha("white",.05), 
                                fill=alpha("white",.05)),
        legend.text=element_text(size=11),
        legend.background= element_rect(fill=alpha("white",.05)),
        panel.grid.major=element_line(colour="grey90", linewidth=0.5),
        axis.line=element_line(linewidth=0.5, colour="black")) +
  windows(width=8, height=5)

Agronomy4future

Stories about cereals and statistics (plus coding). We aim to develop open-source code for agronomy.