Combining Factors from Separate Datasets into a Single Column Using R Studio (feat. dplyr package)
When data is divided into two separate datasets, it needs to be combined into a single column. Using R, we can simply combine the two datasets.
I will create a simple dataset.
Defoliation= rbind(data.frame(Genotype=c ("CV1","CV2","CV3","CV4","CV5"),
Block="I", yield=c("50","52","55","56","54")),
data.frame(Genotype=c ("CV1","CV2","CV3","CV4","CV5"),
Block="II", yield=c("52","50","56","58","59")),
data.frame(Genotype=c ("CV1","CV2","CV3","CV4","CV5"),
Block="III", yield=c("52","50","56","58","59")))
Defoliation
Genotype Block yield
1 CV1 I 50
2 CV2 I 52
3 CV3 I 55
4 CV4 I 56
5 CV5 I 54
6 CV1 II 52
7 CV2 II 50
8 CV3 II 56
9 CV4 II 58
10 CV5 II 59
11 CV1 III 52
12 CV2 III 50
13 CV3 III 56
14 CV4 III 58
15 CV5 III 59
De_graining= rbind(data.frame(Genotype= c("CV1","CV2","CV3","CV4","CV5"),
Block="I", yield= c("62","59","62","60","59")),
data.frame(Genotype= c("CV1","CV2","CV3","CV4","CV5"),
Block="II", yield= c("61","60","63","62","60")),
data.frame(Genotype= c("CV1","CV2","CV3","CV4","CV5"),
Block="III", yield= c("64","58","65","61","62")))
De_graining
Genotype Block yield
1 CV1 I 62
2 CV2 I 59
3 CV3 I 62
4 CV4 I 60
5 CV5 I 59
6 CV1 II 61
7 CV2 II 60
8 CV3 II 63
9 CV4 II 62
10 CV5 II 60
11 CV1 III 64
12 CV2 III 58
13 CV3 III 65
14 CV4 III 61
15 CV5 III 62
Now I will combine these two datasets into one.
1) using data.frame()
To explain the below code simply, we are using the function data.frame()
to combine two datasets. Regarding the repetition of the text “Defoliation,” it indicates repeating it by the number of times corresponding to the values in the ‘yield’ column of the Defoliation dataset, which is length(Defoliation$yield)
. Similarly, for the text “De_graining,” it means repeating it by the number of times corresponding to the values in the ‘yield’ column of the De_graining dataset, which is length(De_graining$yield)
. By doing this repetition, we are taking the values of the ‘yield’ column from the Defoliation dataset and the De_graining dataset.
When you execute the above code, the data will be structured as follows. However, this data lacks an ID. In other words, we need to incorporate the Genotype and Block, which constitute the ID in this data.
new_data= data.frame (dataset=c(rep("Defoliation", length(Defoliation$yield)),
rep("De_graining",length(De_graining$yield))),
value=c(Defoliation$yield, De_graining$yield))
new_data
dataset value
1 Defoliation 50
2 Defoliation 52
3 Defoliation 55
4 Defoliation 56
5 Defoliation 54
6 Defoliation 52
7 Defoliation 50
8 Defoliation 56
9 Defoliation 58
10 Defoliation 59
11 Defoliation 52
12 Defoliation 50
13 Defoliation 56
14 Defoliation 58
15 Defoliation 59
16 De_graining 62
17 De_graining 59
18 De_graining 62
19 De_graining 60
20 De_graining 59
21 De_graining 61
22 De_graining 60
23 De_graining 63
24 De_graining 62
25 De_graining 60
26 De_graining 64
27 De_graining 58
28 De_graining 65
29 De_graining 61
30 De_graining 62
So, I will make a slight modification to the code as shown below.
new_data2= data.frame (Defoliation$Genotype, De_graining$Genotype,
Defoliation$Block, De_graining$Block,
dataset=c(rep("Defoliation", length(Defoliation$yield)),
rep("De_graining", length(De_graining$yield))),
value=c(Defoliation$yield,De_graining$yield))
As a result, the data structure will be organized as shown below.
The reason for including Defoliation$Genotype
, De_graining$Genotype
, Defoliation$Block
, and De_graining$Block
in the data.frame()
function is to verify whether the two datasets have been stacked vertically properly. If both datasets share identical conditions for Genotype and Block, you could include only Defoliation$Genotype
and Defoliation$Block
. These two columns serve as variable conditions describing the vertically stacked data.
In other words, if you write the code as shown below:
new_data3= data.frame (Defoliation$Genotype, Defoliation$Block,
dataset=c(rep("Defoliation", length(Defoliation$yield)),
rep("De_graining", length(De_graining$yield))),
value=c(Defoliation$yield, De_graining$yield))
new_data3
Defoliation.Genotype Defoliation.Block dataset value
1 CV1 I Defoliation 50
2 CV2 I Defoliation 52
3 CV3 I Defoliation 55
4 CV4 I Defoliation 56
5 CV5 I Defoliation 54
6 CV1 II Defoliation 52
7 CV2 II Defoliation 50
8 CV3 II Defoliation 56
9 CV4 II Defoliation 58
10 CV5 II Defoliation 59
11 CV1 III Defoliation 52
12 CV2 III Defoliation 50
13 CV3 III Defoliation 56
14 CV4 III Defoliation 58
15 CV5 III Defoliation 59
16 CV1 I De_graining 62
17 CV2 I De_graining 59
18 CV3 I De_graining 62
19 CV4 I De_graining 60
20 CV5 I De_graining 59
21 CV1 II De_graining 61
22 CV2 II De_graining 60
23 CV3 II De_graining 63
24 CV4 II De_graining 62
25 CV5 II De_graining 60
26 CV1 III De_graining 64
27 CV2 III De_graining 58
28 CV3 III De_graining 65
29 CV4 III De_graining 61
30 CV5 III De_graining 62
The data structure will look as shown above. Let’s change the column names to make the data more clear and understandable.
colnames(new_data3)[1]= c("Genotype")
colnames(new_data3)[2]= c("Block")
colnames(new_data3)[3]= c("Treatment")
colnames(new_data3)[4]= c("Grain weight")
new_data3
Genotype Block Treatment Grain weight
1 CV1 I Defoliation 50
2 CV2 I Defoliation 52
3 CV3 I Defoliation 55
4 CV4 I Defoliation 56
5 CV5 I Defoliation 54
6 CV1 II Defoliation 52
7 CV2 II Defoliation 50
8 CV3 II Defoliation 56
9 CV4 II Defoliation 58
10 CV5 II Defoliation 59
11 CV1 III Defoliation 52
12 CV2 III Defoliation 50
13 CV3 III Defoliation 56
14 CV4 III Defoliation 58
15 CV5 III Defoliation 59
16 CV1 I De_graining 62
17 CV2 I De_graining 59
18 CV3 I De_graining 62
19 CV4 I De_graining 60
20 CV5 I De_graining 59
21 CV1 II De_graining 61
22 CV2 II De_graining 60
23 CV3 II De_graining 63
24 CV4 II De_graining 62
25 CV5 II De_graining 60
26 CV1 III De_graining 64
27 CV2 III De_graining 58
28 CV3 III De_graining 65
29 CV4 III De_graining 61
30 CV5 III De_graining 62
The drawback of the data.frame()
function is that if the number of rows is different, the datasets cannot be merged. So, in the case above, since the number of rows was the same, they could be combined. This is where the dplyr package comes in as a solution to easily merge data without such issues.
2. using dplyr package
The method mentioned above is quite primitive in terms of R usage. The dplyr package simplifies tasks like the one described above and can accomplish them in a matter of seconds. After installing the package, you can use the bind_rows()
function.
install.packages("dplyr")
library(dplyr)
new_data4= bind_rows (Defoliation, De_graining)
new_data4
Genotype Block yield
1 CV1 I 50
2 CV2 I 52
3 CV3 I 55
4 CV4 I 56
5 CV5 I 54
6 CV1 II 52
7 CV2 II 50
8 CV3 II 56
9 CV4 II 58
10 CV5 II 59
11 CV1 III 52
12 CV2 III 50
13 CV3 III 56
14 CV4 III 58
15 CV5 III 59
16 CV1 I 62
17 CV2 I 59
18 CV3 I 62
19 CV4 I 60
20 CV5 I 59
21 CV1 II 61
22 CV2 II 60
23 CV3 II 63
24 CV4 II 62
25 CV5 II 60
26 CV1 III 64
27 CV2 III 58
28 CV3 III 65
29 CV4 III 61
30 CV5 III 62
The data has been sorted as columns all at once. However, even this data lacks proper ID. In other words, there is no distinction for the treatments like Defoliation or De_graining. So, I will make a slight modification to the code.
new_data5= bind_rows (list(DF= Defoliation, DG= De_graining),.id="Treatment")
new_data5
Treatment Genotype Block yield
1 DF CV1 I 50
2 DF CV2 I 52
3 DF CV3 I 55
4 DF CV4 I 56
5 DF CV5 I 54
6 DF CV1 II 52
7 DF CV2 II 50
8 DF CV3 II 56
9 DF CV4 II 58
10 DF CV5 II 59
11 DF CV1 III 52
12 DF CV2 III 50
13 DF CV3 III 56
14 DF CV4 III 58
15 DF CV5 III 59
16 DG CV1 I 62
17 DG CV2 I 59
18 DG CV3 I 62
19 DG CV4 I 60
20 DG CV5 I 59
21 DG CV1 II 61
22 DG CV2 II 60
23 DG CV3 II 63
24 DG CV4 II 62
25 DG CV5 II 60
26 DG CV1 III 64
27 DG CV2 III 58
28 DG CV3 III 65
29 DG CV4 III 61
30 DG CV5 III 62
So, we set the Defoliation data as a new variable called “DF” and the De_graining data as “DG,” and we’ll assign the ID (column name) for these as “Treatment.”
Although the rearrangement of the data is similar to using the data.frame()
function, the time taken has been significantly reduced. Therefore, using a suitable package for rearranging data like this is much more efficient.