In R, how to check the data structure?
When uploading data to R, we first need to check the data structure before analyzing it. Here are some tips for checking the data structure in R. First, I’ll upload a dataset from my GitHub.
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/biomass_N_P.csv"
df= data.frame(read_csv(url(github), show_col_types=FALSE))
print(df)
season cultivar treatment rep biomass nitrogen phosphorus
1 2022 cv1 N0 1 9.16 1.23 0.41
2 2022 cv1 N0 2 13.06 1.49 0.45
3 2022 cv1 N0 3 8.40 1.18 0.31
4 2022 cv1 N0 4 11.97 1.42 0.48
5 2022 cv1 N1 1 24.90 1.77 0.49
.
.
.
In this dataset, let’s check the structure of the data.
■ Code to display the first or last certain rows
When we examine the data, we can simply run the variable df
or use print(df)
to display it. However, if we want to quickly understand the structure of the data, we can use the head()
or tail()
functions.
head (df, 3) # code to display the first three rows
tail (df, 3) # code to display the first three rows
season cultivar treatment rep biomass nitrogen phosphorus
1 2022 cv1 N0 1 9.16 1.23 0.41
2 2022 cv1 N0 2 13.06 1.49 0.45
3 2022 cv1 N0 3 8.40 1.18 0.31
season cultivar treatment rep biomass nitrogen phosphorus
60 2023 cv3 N3 2 47.69 2.18 0.46
61 2023 cv3 N4 1 54.42 2.48 0.53
62 2023 cv3 N4 2 59.06 2.58 0.48
■ Cross-tabulation of the variables from the data.
When we want to check the data structure using a cross table, we can use the xtabs()
function.
xtabs (~ cultivar + treatment, data= df)
, , season = 2022
treatment
cultivar N0 N1 N2 N3 N4
cv1 4 4 4 4 4
cv2 5 5 4 5 5
cv3 0 0 0 0 0
, , season = 2023
treatment
cultivar N0 N1 N2 N3 N4
cv1 0 0 0 0 0
cv2 2 2 1 1 2
cv3 2 2 2 2 2
# Display the unique values for each column
Now, I’d like to see how many different levels there are in each variable. We can use the str()
function to view the data structure, but it does not show all levels of each variable.
str (df)
'data.frame': 62 obs. of 7 variables:
$ season : num 2022 2022 2022 2022 2022 ...
$ cultivar : chr "cv1" "cv1" "cv1" "cv1" ...
$ treatment : chr "N0" "N0" "N0" "N0" ...
$ rep : num 1 2 3 4 1 2 3 4 1 2 ...
$ biomass : num 9.16 13.06 8.4 11.97 24.9 ...
$ nitrogen : num 1.23 1.49 1.18 1.42 1.77 1.74 1.85 1.75 1.81 2.01 ...
$ phosphorus: num 0.41 0.45 0.31 0.48 0.49 0.46 0.42 0.42 0.46 0.36 ...
So, I’ll use the below code.
sapply(df, function(x) list(Count= length(unique(x)), UniqueValues= unique(x)))
However, we do not need to see the numeric variables, and I’ll exclude them using the code below.
sapply(df, function(x) if (!is.numeric(x)) list(Count= length(unique(x)), UniqueValues= unique(x)))
$season NULL
$cultivar
$Count 3
$UniqueValues
'cv1''cv2''cv3'
$treatment
$Count 5
$UniqueValues
'N0''N1''N2''N3''N4'
$rep NULL
$biomass NULL
$nitrogen NULL
$phosphorus NULL
A simpler version of the code is provided below.
sapply(df[sapply(df, function(x) is.factor(x) || is.character(x))], unique)
$cultivar
'cv1''cv2''cv3'
$treatment
'N0''N1''N2''N3''N4'
Here is one issue: ‘Season’ and ‘Replicate’ are factors in this experiment, but they are currently treated as numeric because they are represented by numbers. Therefore, when I run the code above, ‘Season’ does not appear.
We can manually exclude the numerical variables that we consider irrelevant.
sapply(df[setdiff(names(df), c("biomass", "nitrogen", "phosphorus"))], unique)
$season 2022 2023
$cultivar 'cv1''cv2''cv3'
$treatment 'N0''N1''N2''N3''N4'
$rep 1 2 3 4 5
Alternatively, we can convert numerical variables to categorical variables.
df$season= as.factor(df$season)
df$rep= as.factor(df$rep)
sapply(df[sapply(df, function(x) is.factor(x) || is.character(x))], unique)
$season 2022 2023
Levels:'2022''2023'
$cultivar 'cv1''cv2''cv3'
$treatment 'N0''N1''N2''N3''N4'
$rep 1 2 3 4 5
Levels:'1''2''3''4''5'