Matching Datasets in R: An Approach Comparable to Excel’s VLOOKUP Function

Matching Datasets in R: An Approach Comparable to Excel’s VLOOKUP Function


I have two datasets.

dataA= data.frame(
  season= c(2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022),
  treatment= c("Tr1", "Tr1", "Tr1", "Tr2", "Tr2", "Tr2", "Tr3", "Tr3", "Tr3"),
  rep= c(1, 2, 3, 1, 2, 3, 1, 2, 3),
  grain_yield= c(10, 12, 13, 14, 15, 17, 12, 15, 18)
)

  season treatment rep grain_yield
1   2022       Tr1   1          10
2   2022       Tr1   2          12
3   2022       Tr1   3          13
4   2022       Tr2   1          14
5   2022       Tr2   2          15
6   2022       Tr2   3          17
7   2022       Tr3   1          12
8   2022       Tr3   2          15
9   2022       Tr3   3          18

dataB= data.frame(
  season= c(2023, 2023, 2023, 2023, 2023, 2023, 2023),
  treatment= c("Tr1", "Tr1", "Tr2", "Tr2", "Tr2", "Tr3", "Tr3"),
  rep= c(1, 2, 1, 2, 3, 1, 3),
  grain_yield= c(12, 12, 14, 15, 17, 12, 18)
)

  season treatment rep grain_yield
1   2023       Tr1   1          12
2   2023       Tr1   2          12
3   2023       Tr2   1          14
4   2023       Tr2   2          15
5   2023       Tr2   3          17
6   2023       Tr3   1          12
7   2023       Tr3   3          18

Now, I want to combine these two datasets, but the row numbers differ between the two datasets. In dataB, the 3rd replicate for Tr1 and the 2nd replicate for Tr3 were deleted due to environmental errors.

In this case, simply combining the two datasets is not feasible. One solution is to merge them row-wise using the rbind() function. This way, the two datasets will be combined by row.

merged_data= cbind(dataA, dataB)

  season treatment rep grain_yield
1    2022       Tr1   1          10
2    2022       Tr1   2          12
3    2022       Tr1   3          13
4    2022       Tr2   1          14
5    2022       Tr2   2          15
6    2022       Tr2   3          17
7    2022       Tr3   1          12
8    2022       Tr3   2          15
9    2022       Tr3   3          18
10   2023       Tr1   1          12
11   2023       Tr1   2          12
12   2023       Tr2   1          14
13   2023       Tr2   2          15
14   2023       Tr2   3          17
15   2023       Tr3   1          12
16   2023       Tr3   3          18

However, my goal is to combine the two datasets column-wise, allowing me to check for missing data between the two datasets. This is akin to the functionality in Excel using VLOOKUP().



Using left_join()

I’ll use left_join() dplyr.

library(dplyr)

merged_data=left_join(dataA, dataB, by=c("treatment","rep"), suffix=c("_2022", "_2023"))

  treatment rep season_2022 grain_yield_2022 season_2023 grain_yield_2023
1       Tr1   1        2022               10        2023               12
2       Tr1   2        2022               12        2023               12
3       Tr1   3        2022               13          NA               NA
4       Tr2   1        2022               14        2023               14
5       Tr2   2        2022               15        2023               15
6       Tr2   3        2022               17        2023               17
7       Tr3   1        2022               12        2023               12
8       Tr3   2        2022               15          NA               NA
9       Tr3   3        2022               18        2023               18

Now two datasets were combined, but the missing values are also included as NA.

If there is a left_join() function, there also will be likely to be right_join(). Let’s use right_join()

merged_data=right_join(dataA, dataB, by=c("treatment","rep"), suffix=c("_2022", "_2022"))

  season_2022 treatment rep grain_yield_2022 season_2023 grain_yield_2023
1        2022       Tr1   1               10        2023               12
2        2022       Tr1   2               12        2023               12
3        2022       Tr2   1               14        2023               14
4        2022       Tr2   2               15        2023               15
5        2022       Tr2   3               17        2023               17
6        2022       Tr3   1               12        2023               12
7        2022       Tr3   3               18        2023               18

Can you identify the difference? When I use right_join(), the missing data in dataB is excluded, and only the existing matching data is displayed.


Comments are closed.