Matching Datasets in R: An Approach Comparable to Excel’s VLOOKUP Function
I have two datasets.
dataA= data.frame(
season= c(2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022),
treatment= c("Tr1", "Tr1", "Tr1", "Tr2", "Tr2", "Tr2", "Tr3", "Tr3", "Tr3"),
rep= c(1, 2, 3, 1, 2, 3, 1, 2, 3),
grain_yield= c(10, 12, 13, 14, 15, 17, 12, 15, 18)
)
season treatment rep grain_yield
1 2022 Tr1 1 10
2 2022 Tr1 2 12
3 2022 Tr1 3 13
4 2022 Tr2 1 14
5 2022 Tr2 2 15
6 2022 Tr2 3 17
7 2022 Tr3 1 12
8 2022 Tr3 2 15
9 2022 Tr3 3 18
dataB= data.frame(
season= c(2023, 2023, 2023, 2023, 2023, 2023, 2023),
treatment= c("Tr1", "Tr1", "Tr2", "Tr2", "Tr2", "Tr3", "Tr3"),
rep= c(1, 2, 1, 2, 3, 1, 3),
grain_yield= c(12, 12, 14, 15, 17, 12, 18)
)
season treatment rep grain_yield
1 2023 Tr1 1 12
2 2023 Tr1 2 12
3 2023 Tr2 1 14
4 2023 Tr2 2 15
5 2023 Tr2 3 17
6 2023 Tr3 1 12
7 2023 Tr3 3 18
Now, I want to combine these two datasets, but the row numbers differ between the two datasets. In dataB
, the 3rd replicate for Tr1
and the 2nd replicate for Tr3
were deleted due to environmental errors.
In this case, simply combining the two datasets is not feasible. One solution is to merge them row-wise using the rbind()
function. This way, the two datasets will be combined by row.
merged_data= cbind(dataA, dataB)
season treatment rep grain_yield
1 2022 Tr1 1 10
2 2022 Tr1 2 12
3 2022 Tr1 3 13
4 2022 Tr2 1 14
5 2022 Tr2 2 15
6 2022 Tr2 3 17
7 2022 Tr3 1 12
8 2022 Tr3 2 15
9 2022 Tr3 3 18
10 2023 Tr1 1 12
11 2023 Tr1 2 12
12 2023 Tr2 1 14
13 2023 Tr2 2 15
14 2023 Tr2 3 17
15 2023 Tr3 1 12
16 2023 Tr3 3 18
However, my goal is to combine the two datasets column-wise, allowing me to check for missing data between the two datasets. This is akin to the functionality in Excel using VLOOKUP()
.
Using left_join()
I’ll use left_join()
dplyr.
library(dplyr)
merged_data=left_join(dataA, dataB, by=c("treatment","rep"), suffix=c("_2022", "_2023"))
treatment rep season_2022 grain_yield_2022 season_2023 grain_yield_2023
1 Tr1 1 2022 10 2023 12
2 Tr1 2 2022 12 2023 12
3 Tr1 3 2022 13 NA NA
4 Tr2 1 2022 14 2023 14
5 Tr2 2 2022 15 2023 15
6 Tr2 3 2022 17 2023 17
7 Tr3 1 2022 12 2023 12
8 Tr3 2 2022 15 NA NA
9 Tr3 3 2022 18 2023 18
Now two datasets were combined, but the missing values are also included as NA.
If there is a left_join()
function, there also will be likely to be right_join()
. Let’s use right_join()
merged_data=right_join(dataA, dataB, by=c("treatment","rep"), suffix=c("_2022", "_2022"))
season_2022 treatment rep grain_yield_2022 season_2023 grain_yield_2023
1 2022 Tr1 1 10 2023 12
2 2022 Tr1 2 12 2023 12
3 2022 Tr2 1 14 2023 14
4 2022 Tr2 2 15 2023 15
5 2022 Tr2 3 17 2023 17
6 2022 Tr3 1 12 2023 12
7 2022 Tr3 3 18 2023 18
Can you identify the difference? When I use right_join()
, the missing data in dataB
is excluded, and only the existing matching data is displayed.