In R, how to check the data structure?

In R, how to check the data structure?


When uploading data to R, we first need to check the data structure before analyzing it. Here are some tips for checking the data structure in R. First, I’ll upload a dataset from my GitHub.

library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/biomass_N_P.csv"
df= data.frame(read_csv(url(github), show_col_types=FALSE))

print(df)
     season cultivar treatment rep biomass nitrogen phosphorus
1    2022      cv1        N0   1    9.16     1.23       0.41
2    2022      cv1        N0   2   13.06     1.49       0.45
3    2022      cv1        N0   3    8.40     1.18       0.31
4    2022      cv1        N0   4   11.97     1.42       0.48
5    2022      cv1        N1   1   24.90     1.77       0.49
.
.
.

In this dataset, let’s check the structure of the data.



Code to display the first or last certain rows

When we examine the data, we can simply run the variable df or use print(df) to display it. However, if we want to quickly understand the structure of the data, we can use the head() or tail() functions.

head (df, 3) # code to display the first three rows
tail (df, 3) # code to display the first three rows

    season cultivar treatment rep biomass nitrogen phosphorus
1   2022   cv1	    N0	      1	  9.16	  1.23	   0.41
2   2022   cv1	    N0	      2	  13.06	  1.49	   0.45
3   2022   cv1	    N0	      3	  8.40	  1.18	   0.31

    season cultivar treatment rep biomass nitrogen phosphorus
60  2023   cv3	    N3	      2	  47.69	  2.18	   0.46
61  2023   cv3	    N4	      1	  54.42	  2.48	   0.53
62  2023   cv3	    N4	      2	  59.06	  2.58	   0.48


Cross-tabulation of the variables from the data.

When we want to check the data structure using a cross table, we can use the xtabs() function.

xtabs (~ cultivar + treatment, data= df)

, , season = 2022
        treatment
cultivar N0 N1 N2 N3 N4
     cv1  4  4  4  4  4
     cv2  5  5  4  5  5
     cv3  0  0  0  0  0

, , season = 2023
        treatment
cultivar N0 N1 N2 N3 N4
     cv1  0  0  0  0  0
     cv2  2  2  1  1  2
     cv3  2  2  2  2  2


# Display the unique values for each column

Now, I’d like to see how many different levels there are in each variable. We can use the str() function to view the data structure, but it does not show all levels of each variable.

str (df)
'data.frame':  62 obs. of  7 variables:
 $ season    : num  2022 2022 2022 2022 2022 ...
 $ cultivar  : chr  "cv1" "cv1" "cv1" "cv1" ...
 $ treatment : chr  "N0" "N0" "N0" "N0" ...
 $ rep       : num  1 2 3 4 1 2 3 4 1 2 ...
 $ biomass   : num  9.16 13.06 8.4 11.97 24.9 ...
 $ nitrogen  : num  1.23 1.49 1.18 1.42 1.77 1.74 1.85 1.75 1.81 2.01 ...
 $ phosphorus: num  0.41 0.45 0.31 0.48 0.49 0.46 0.42 0.42 0.46 0.36 ...

So, I’ll use the below code.

sapply(df, function(x) list(Count= length(unique(x)), UniqueValues= unique(x)))

However, we do not need to see the numeric variables, and I’ll exclude them using the code below.

sapply(df, function(x) if (!is.numeric(x)) list(Count= length(unique(x)), UniqueValues= unique(x)))

$season NULL
$cultivar
$Count 3
$UniqueValues 
'cv1''cv2''cv3'
$treatment
$Count 5
$UniqueValues
'N0''N1''N2''N3''N4'
$rep NULL
$biomass NULL
$nitrogen NULL
$phosphorus NULL

A simpler version of the code is provided below.

sapply(df[sapply(df, function(x) is.factor(x) || is.character(x))], unique)

$cultivar
'cv1''cv2''cv3'
$treatment
'N0''N1''N2''N3''N4'

Here is one issue: ‘Season’ and ‘Replicate’ are factors in this experiment, but they are currently treated as numeric because they are represented by numbers. Therefore, when I run the code above, ‘Season’ does not appear.

We can manually exclude the numerical variables that we consider irrelevant.

sapply(df[setdiff(names(df), c("biomass", "nitrogen", "phosphorus"))], unique)

$season 2022 2023
$cultivar 'cv1''cv2''cv3'
$treatment 'N0''N1''N2''N3''N4'
$rep 1 2 3 4 5

Alternatively, we can convert numerical variables to categorical variables.

df$season= as.factor(df$season)
df$rep= as.factor(df$rep)

sapply(df[sapply(df, function(x) is.factor(x) || is.character(x))], unique)

$season 2022 2023
Levels:'2022''2023'
$cultivar 'cv1''cv2''cv3'
$treatment 'N0''N1''N2''N3''N4'
$rep 1 2 3 4 5
Levels:'1''2''3''4''5'


Comments are closed.