How to Sample a Portion of Data using R?
I have one big dataset. Let’s upload to R.
if (require("rio") == F) install.packages("rio")
library(rio)
url= "https://github.com/agronomy4future/raw_data_practice/raw/main/wheat_grain_size_big_data.RData"
df= import(url)
head (df, 3)
tail (df, 3)
head (df, 3)
Field Genotype Block fungicide planting_date fertilizer Shoot Length.mm. Width.mm. Area.mm2.
1 South Alkaline II Yes normal N/A Main stem 4.720 2.270 8.269
2 South Alkaline II Yes normal N/A Main stem 5.487 3.163 13.378
3 South Alkaline II Yes normal N/A Main stem 6.004 3.621 16.122
.
.
.
tail (df, 3)
Field Genotype Block fungicide planting_date fertilizer Shoot Length.mm. Width.mm. Area.mm2.
96317 South Peele II No early N/A Tillers 5.674 2.210 9.154
96318 South Peele II No late N/A Tillers 6.041 2.138 18.092
96319 South Peele II No late N/A Tillers 6.041 2.138 18.092
This data has 96,319 data rows. I want to use some part of this data. How can I randomly extract some data from the whole dataset.
First, I’ll add number from 1 to the end of the data row to provide ID of each data row.
df$ID= seq_along(df[, 1])
Caret package
The caret
package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. You can find more details in below link.
https://topepo.github.io/caret/index.html#
if (require("caret")== F) install.packages("caret")
library(caret)
# Set the seed for reproducibility
set.seed(123)
# Create indices for training and testing sets
indices= createDataPartition(df$Area.mm2., p=0.8, list= FALSE)
# Create training and testing sets
train_data= df[indices, ]
test_data= df[-indices, ]
The list
parameter in the createDataPartition()
determines the type of output you want: If list= TRUE
, the function returns a list of indices specifying the partitioning of the data. Each element of the list corresponds to a separate group. If list= FALSE
, the function returns a vector of indices without wrapping them in a list. In other words, list= FALSE
extracts data without considering any grouping. It simply returns a single vector of indices, making it convenient for straightforward operations on the data, while list= TRUE
considers grouping and returns a list of indices, where each element of the list corresponds to a separate group of indices. This is useful when you have grouped data and want to apply operations separately to each group.
When checking the data, data sampling was conducted randomly, but it’s important to understand how the data was randomly selected.
Let’s check if data sampling was done correctly. Now, 80% of the original data has been designated to train_data
(77,056 rows), and 20% of the original data has been designated to test_data
(19,263 rows).
dim(df) # original data
96319 11
dim(train_data) # 80% sampling data from the original data
77056 11
dim(test_data) # 20% sampling data from the original data
19263 11
Here is much simpler way.
library(dplyr)
test_data1= df %>%
sample_frac(0.2)
dim(test_data1)
19264 11