Data filtering using R Studio
When you conduct statistical analysis, you might want to include/exclude some variables. For example, here is one data.
data:image/s3,"s3://crabby-images/ff330/ff330ff9e5a579935e1dc9d78ac00234a3a7c290" alt=""
This is data about how yield, grain number (GN) and averge grain weight (AGW) are different according to two different fertilizers (N0, N1) in five genotypes (CV1 – CV5). That is, there will be 10 treatments [Genotype (5) x Nitrogen (2) =10]. Replicates are 10 as blocks, and therefore experimental unit will be 30 [10 treatments x 3 blocks = 30].
What if we want to analyze in only N1 condition? or only about CV1? I’ll introdce how to filter data in R studio?
Let’s upload the data above.
# to uplopad data
library (readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/yield%20component_nitrogen.csv"
dataA=data.frame(read_csv(url(github),show_col_types=FALSE))
data:image/s3,"s3://crabby-images/5bd7d/5bd7d3f6d544f9b702aa67cde88a9bab61a145a9" alt=""
1. subset
I’d like to filter one variable. For example, there are two ways to select N1.
N1= subset (dataA, Nitrogen=="N1")
N1= subset (dataA, Nitrogen!="N1")
print(N1)
data:image/s3,"s3://crabby-images/a3a5c/a3a5cb1f7d4c690a15c9f0c29b1b5df7ad27d0d0" alt=""
How about selecting several variables? For example, I want to select CV1 and N1.
CV1_N1= subset (dataA, Genotype=="CV1" & Nitrogen=="N1")
CV1_N1
data:image/s3,"s3://crabby-images/56b16/56b1642eb06149b69d846e4cffbbf645cccfc5b9" alt=""
How about selecting one variable and excluding another variable?
CV1X_N1= subset (dataA, Genotype!="CV1" & Nitrogen=="N1")
CV1X_N1
data:image/s3,"s3://crabby-images/4cf6d/4cf6d91382e8693d0f8da88996e2f73859a69e3b" alt=""
How about selecting two variables within the same factor? For example, I want to select both CV1 and CV3. So I used the below code.
CV1_CV3= subset (dataA, Genotype=="CV1" & Genotype=="CV3")
CV1_CV3
data:image/s3,"s3://crabby-images/213f1/213f107615313954693300b91912bb83cf3045d7" alt=""
But I can’t select any variables. This is because if I select CV1, and now only CV1 exists. In this condition, if I select CV3 which does not exist, no variables are selected.
We can solve this problem using |
CV1_CV3= subset (dataA,Genotype=="CV1" | Genotype=="CV3")
CV1_CV3
data:image/s3,"s3://crabby-images/6aceb/6aceb5c6e4237965450bde2c8b0f59ca950ec8c3" alt=""
Or below code is possible.
CV1_CV3= subset (dataA, Genotype!="CV2" & Genotype!="CV4" & Genotype!="CV5" )
How about selecting two variables within the same factor, and another variable? For example, I want to select both CV1 and CV3, and then select N1.
CV1_CV3_N1= subset (dataA, c(Genotype=="CV1" | Genotype=="CV3") & Nitrogen=="N1")
CV1_CV3_N1
data:image/s3,"s3://crabby-images/5f451/5f451b173cf9d87b736fdd833fd2da77674d2eea" alt=""
There are no specific answers. Simply we can use below code.
N1= subset (dataA, Nitrogen=="N1")
CV1_CV3_N1= subset(N1, Genotype=="CV1" | Genotype=="CV3")
CV1_CV3_N1
First, we can select N1, and in N1, we can selecte both CV1 and CV3. We can shorten the code like below.
CV1_CV3_N1= subset(subset (dataA, Nitrogen=="N1"), Genotype=="CV1" | Genotype=="CV3")
In summary, these three codes are the same code to select CV1, CV3, N1.
#1
CV1_CV3_N1= subset (dataA, c(Genotype=="CV1" | Genotype=="CV3") & Nitrogen=="N1")
#2
N1<- subset (dataA, Nitrogen=="N1")
CV1_CV3_N1= subset(N1, Genotype=="CV1" | Genotype=="CV3")
#3
CV1_CV3_N1= subset(subset(dataA, Nitrogen=="N1"), Genotype=="CV1" | Genotype=="CV3")
2. dplyr
Now, let’s use dplyr package.
install.packages ("dplyr")
library (dplyr)
Now, I’d like to select N1
#1
dataB= dataA %>% filter (Nitrogen=="N1")
#2
dataB= dataA %>% filter (Nitrogen!="N0")
print (dataB)
data:image/s3,"s3://crabby-images/13bba/13bba737c1d432fdbbe6f17a5b8e8bc599a40acf" alt=""
How about selecting CV1 and N1? It’s similar with subset()
#subset()
CV1_N1= subset (dataA, Genotype=="CV1" & Nitrogen=="N1")
#dplyr
CV1_N1= dataA %>% filter (Genotype=="CV1" & Nitrogen!="N0")
How about selecting both CV1 and CV3?
#subset()
CV1_CV3= subset (dataA, Genotype=="CV1" | Genotype=="CV3")
#dplyr
CV1_CV3= dataA %>% filter(Genotype=="CV1"| Genotype=="CV3")
How about selecting both CV1 and CV3, and then N1?
#subset()
CV1_CV3_N1= subset (dataA, c(Genotype=="CV1" | Genotype=="CV3") & Nitrogen=="N1")
#dplyr
CV1_CV3_N1= dataA %>% filter (c(Genotype=="CV1" | Genotype=="CV3") & Nitrogen=="N1")
Now, let’s export this data to Excel.
install.packages("writexl")
library(writexl)
write_xlsx(CV1_CV3_N1,"C://Users/Usuari/Desktop//CV1_3_N1.xlsx")