How to Sample a Portion of Data using R?

May 9, 2024 JK

I have one big dataset. Let’s upload to R.

if (require("rio") == F) install.packages("rio")
library(rio)

url= "https://github.com/agronomy4future/raw_data_practice/raw/main/wheat_grain_size_big_data.RData"

df= import(url)

head (df, 3) 
tail (df, 3)

head (df, 3) 
   Field  Genotype  Block  fungicide  planting_date  fertilizer  Shoot       Length.mm.  Width.mm.   Area.mm2.
1  South  Alkaline  II	   Yes	      normal	     N/A	 Main stem   4.720	 2.270	     8.269
2  South  Alkaline  II	   Yes	      normal	     N/A	 Main stem   5.487	 3.163	     13.378
3  South  Alkaline  II	   Yes	      normal	     N/A	 Main stem   6.004	 3.621	     16.122
.
.
.
 
tail (df, 3)
       Field  Genotype  Block  fungicide  planting_date  fertilizer  Shoot    Length.mm.  Width.mm.  Area.mm2.
96317  South  Peele	II     No	  early	         N/A	     Tillers   5.674	  2.210	     9.154
96318  South  Peele	II     No	  late	         N/A	     Tillers   6.041	  2.138	     18.092
96319  South  Peele	II     No	  late	         N/A	     Tillers   6.041	  2.138	     18.092

This data has 96,319 data rows. I want to use some part of this data. How can I randomly extract some data from the whole dataset.

First, I’ll add number from 1 to the end of the data row to provide ID of each data row.

df$ID= seq_along(df[, 1])

Caret package

The caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. You can find more details in below link.

https://topepo.github.io/caret/index.html#

if (require("caret")== F) install.packages("caret")
library(caret)

# Set the seed for reproducibility
set.seed(123)

# Create indices for training and testing sets
indices= createDataPartition(df$Area.mm2., p=0.8, list= FALSE)

# Create training and testing sets
train_data= df[indices, ]
test_data= df[-indices, ]

The list parameter in the createDataPartition() determines the type of output you want: If list= TRUE, the function returns a list of indices specifying the partitioning of the data. Each element of the list corresponds to a separate group. If list= FALSE, the function returns a vector of indices without wrapping them in a list. In other words, list= FALSE extracts data without considering any grouping. It simply returns a single vector of indices, making it convenient for straightforward operations on the data, while list= TRUE considers grouping and returns a list of indices, where each element of the list corresponds to a separate group of indices. This is useful when you have grouped data and want to apply operations separately to each group.

When checking the data, data sampling was conducted randomly, but it’s important to understand how the data was randomly selected.

Let’s check if data sampling was done correctly. Now, 80% of the original data has been designated to train_data (77,056 rows), and 20% of the original data has been designated to test_data (19,263 rows).

dim(df) # original data
96319    11

dim(train_data) # 80% sampling data from the original data
77056    11

dim(test_data) # 20% sampling data from the original data
19263    11

Here is much simpler way.

library(dplyr)
test_data1= df %>% 
            sample_frac(0.2)

dim(test_data1)
19264    11

Agronomy4future

Stories about cereals and statistics (plus coding)

How to Sample a Portion of Data using R?

May 9, 2024 JK

Caret package