How to select/delete specific variables using R STUDIO?

How to select/delete specific variables using R STUDIO?


How to select and delete specific columns using R STUDIO?


In my previous post, I explained how to select or delete specific columns. This time, I’ll elaborate on selecting or deleting specific variables within columns. Once again, I’ll generate a new set of data.

Genotype=rep(c("CV1","CV2","CV3"), times=5)
Yield=c(20,25,28,35,25,26,34,57,36,44,29,36,41,25,29)
dataA=data.frame(Genotype,Yield)

head(dataA, 10)
   Genotype Yield
1       CV1    20
2       CV2    25
3       CV3    28
4       CV1    35
5       CV2    25
6       CV3    26
7       CV1    34
8       CV2    57
9       CV3    36
10      CV1    44
.
.
.

If I want to divide the data by genotype, I use the code below.

cv1=subset(dataA, Genotype=="CV1")
cv2=subset(dataA, Genotype=="CV2")
cv3=subset(dataA, Genotype=="CV3")

But what if I simply want to delete all instances of the CV2 genotype? The code is below.

dataB= subset(dataA, Genotype!="CV2")

dataB
   Genotype Yield
1       CV1    20
3       CV3    28
4       CV1    35
6       CV3    26
7       CV1    34
9       CV3    36
10      CV1    44
12      CV3    36
13      CV1    41
15      CV3    29

Alternatively, the code below is also a valid option.

dataC=subset(dataA, Genotype=="CV1" | Genotype=="CV3")


How about deleting multiple variables?

Country=c("Spain","Canada","USA","Korea","Netherlands","Denmark","France","UK","Japan","Germany")
Income=c("40k","50k","60k","45k","55k","70k","50k","55k","55k","50k")
dataA=data.frame(Country,Income)

dataA
       Country Income
1        Spain    40k
2       Canada    50k
3          USA    60k
4        Korea    45k
5  Netherlands    55k
6      Denmark    70k
7       France    50k
8           UK    55k
9        Japan    55k
10     Germany    50k

Let’s assume that the data above lists the salaries of postdoctoral researchers by country. However, upon examining the data, I found inaccuracies for Spain, France, and Japan, so I would like to delete them. We can use the & to delete multiple variables.

dataA=subset (dataA, Country!="Spain" & Country!="France" & Country!="Japan")

dataA
       Country Income
2       Canada    50k
3          USA    60k
4        Korea    45k
5  Netherlands    55k
6      Denmark    70k
8           UK    55k
10     Germany    50k

Alternatively, the code below is also a valid option.

dataB=subset(dataA,!(Country %in% c("Spain","France","Japan")))

dataB
       Country Income
2       Canada    50k
3          USA    60k
4        Korea    45k
5  Netherlands    55k
6      Denmark    70k
8           UK    55k
10     Germany    50k

or, we can also delete using filter() in dplyr() package.

library (dplyr)

dataC= dataA %>%
  filter(!(Country %in% c("Spain","France","Japan")))


What is different & and | ?

I have a dataset that looks like the following. Let’s say this is a math and english score for 8 students from different countries.

name=c("Jack","Kate","John","Jane","David","Min","Hyuk","Jisoo")
math=c(90,85,95,75,80,90,90,85)
eng=c(85,90,90,88,95,85,87,88)
name=c("Jack","Kate","John","Jane","David","Min","Hyuk","Jisoo")
math=c(90,85,95,75,80,90,90,85)
eng=c(85,90,90,88,95,85,87,88)
country=c("USA","Spain","France","Germany","Netherlands", rep("Korea",3))
gender=c(rep(c("Male","Female"),times=4))
enroll=c(rep(c("Yes","No"),each=4))
grade=data.frame(name,math,eng,country,gender,enroll)

grade
   name math eng     country gender enroll
1  Jack   90  85         USA   Male    Yes
2  Kate   85  90       Spain Female    Yes
3  John   95  90      France   Male    Yes
4  Jane   75  88     Germany Female    Yes
5 David   80  95 Netherlands   Male     No
6   Min   90  85       Korea Female     No
7  Hyuk   90  87       Korea   Male     No
8 Jisoo   85  88       Korea Female     No

Now, I would like to exclude David, people from Korea, and all male students. So, I used the code below.

dataA= subset (grade, name!="David" & country!="Korea" & gender!="Male")
dataA
  name math eng country gender enroll
2 Kate   85  90   Spain Female    Yes
4 Jane   75  88 Germany Female    Yes

Now, I would like to include only Jack, David, and Jisoo. The code is below.

dataB= subset (grade, name=="Jack" | name=="David" | name=="Jisoo")
dataB
   name math eng     country gender enroll
1  Jack   90  85         USA   Male    Yes
5 David   80  95 Netherlands   Male     No
8 Jisoo   85  88       Korea Female     No

Why did I use |? not &?

Think about this!!
When I select Jack first, there will be no David and Jisoo. So, if I use the code like dataB=subset(grade, name=="Jack" & name =="David" & name=="Jisoo"), it does not work logically. In this case, the | operator allows us to select multiple variables in the same column.

We can also use filter() in dplyr() package.

library (dplyr)
grade1= grade %>%
  filter((name %in% c("Jack","David","Jisoo")))

grade1
   name math eng     country gender enroll
1  Jack   90  85         USA   Male    Yes
2 David   80  95 Netherlands   Male     No
3 Jisoo   85  88       Korea Female     No


Leave a Reply

If you include a website address in the comment section, I cannot see your comment as it will be automatically deleted and will not be posted. Please refrain from including website addresses.