How to select/delete specific variables using R STUDIO?
□ How to select and delete specific columns using R STUDIO?
In my previous post, I explained how to select or delete specific columns. This time, I’ll elaborate on selecting or deleting specific variables within columns. Once again, I’ll generate a new set of data.
Genotype=rep(c("CV1","CV2","CV3"), times=5)
Yield=c(20,25,28,35,25,26,34,57,36,44,29,36,41,25,29)
dataA=data.frame(Genotype,Yield)
head(dataA, 10)
Genotype Yield
1 CV1 20
2 CV2 25
3 CV3 28
4 CV1 35
5 CV2 25
6 CV3 26
7 CV1 34
8 CV2 57
9 CV3 36
10 CV1 44
.
.
.
If I want to divide the data by genotype, I use the code below.
cv1=subset(dataA, Genotype=="CV1")
cv2=subset(dataA, Genotype=="CV2")
cv3=subset(dataA, Genotype=="CV3")
But what if I simply want to delete all instances of the CV2 genotype? The code is below.
dataB= subset(dataA, Genotype!="CV2")
dataB
Genotype Yield
1 CV1 20
3 CV3 28
4 CV1 35
6 CV3 26
7 CV1 34
9 CV3 36
10 CV1 44
12 CV3 36
13 CV1 41
15 CV3 29
Alternatively, the code below is also a valid option.
dataC=subset(dataA, Genotype=="CV1" | Genotype=="CV3")
How about deleting multiple variables?
Country=c("Spain","Canada","USA","Korea","Netherlands","Denmark","France","UK","Japan","Germany")
Income=c("40k","50k","60k","45k","55k","70k","50k","55k","55k","50k")
dataA=data.frame(Country,Income)
dataA
Country Income
1 Spain 40k
2 Canada 50k
3 USA 60k
4 Korea 45k
5 Netherlands 55k
6 Denmark 70k
7 France 50k
8 UK 55k
9 Japan 55k
10 Germany 50k
Let’s assume that the data above lists the salaries of postdoctoral researchers by country. However, upon examining the data, I found inaccuracies for Spain, France, and Japan, so I would like to delete them. We can use the &
to delete multiple variables.
dataA=subset (dataA, Country!="Spain" & Country!="France" & Country!="Japan")
dataA
Country Income
2 Canada 50k
3 USA 60k
4 Korea 45k
5 Netherlands 55k
6 Denmark 70k
8 UK 55k
10 Germany 50k
Alternatively, the code below is also a valid option.
dataB=subset(dataA,!(Country %in% c("Spain","France","Japan")))
dataB
Country Income
2 Canada 50k
3 USA 60k
4 Korea 45k
5 Netherlands 55k
6 Denmark 70k
8 UK 55k
10 Germany 50k
or, we can also delete using filter()
in dplyr()
package.
library (dplyr)
dataC= dataA %>%
filter(!(Country %in% c("Spain","France","Japan")))
What is different &
and |
?
I have a dataset that looks like the following. Let’s say this is a math and english score for 8 students from different countries.
name=c("Jack","Kate","John","Jane","David","Min","Hyuk","Jisoo")
math=c(90,85,95,75,80,90,90,85)
eng=c(85,90,90,88,95,85,87,88)
name=c("Jack","Kate","John","Jane","David","Min","Hyuk","Jisoo")
math=c(90,85,95,75,80,90,90,85)
eng=c(85,90,90,88,95,85,87,88)
country=c("USA","Spain","France","Germany","Netherlands", rep("Korea",3))
gender=c(rep(c("Male","Female"),times=4))
enroll=c(rep(c("Yes","No"),each=4))
grade=data.frame(name,math,eng,country,gender,enroll)
grade
name math eng country gender enroll
1 Jack 90 85 USA Male Yes
2 Kate 85 90 Spain Female Yes
3 John 95 90 France Male Yes
4 Jane 75 88 Germany Female Yes
5 David 80 95 Netherlands Male No
6 Min 90 85 Korea Female No
7 Hyuk 90 87 Korea Male No
8 Jisoo 85 88 Korea Female No
Now, I would like to exclude David, people from Korea, and all male students. So, I used the code below.
dataA= subset (grade, name!="David" & country!="Korea" & gender!="Male")
dataA
name math eng country gender enroll
2 Kate 85 90 Spain Female Yes
4 Jane 75 88 Germany Female Yes
Now, I would like to include only Jack, David, and Jisoo. The code is below.
dataB= subset (grade, name=="Jack" | name=="David" | name=="Jisoo")
dataB
name math eng country gender enroll
1 Jack 90 85 USA Male Yes
5 David 80 95 Netherlands Male No
8 Jisoo 85 88 Korea Female No
Why did I use |
? not &
?
Think about this!!
When I select Jack first, there will be no David and Jisoo. So, if I use the code like dataB=subset(grade, name=="Jack" & name =="David" & name=="Jisoo")
, it does not work logically. In this case, the |
operator allows us to select multiple variables in the same column.
We can also use
in filter()
dplyr()
package.
library (dplyr)
grade1= grade %>%
filter((name %in% c("Jack","David","Jisoo")))
grade1
name math eng country gender enroll
1 Jack 90 85 USA Male Yes
2 David 80 95 Netherlands Male No
3 Jisoo 85 88 Korea Female No