In R Studio, how to exclude missing value (NA)?
I’ll create one data.
Genotype= c("A", "B", "C", "D", "E")
Yield= c(100, 120, 130, NA, 110)
dataA= data.frame(Genotype,Yield)
dataA
Genotype Yield
1 A 100
2 B 120
3 C 130
4 D NA
5 E 110
In genotype D, yield data was missed, so it was indicated as NA
. Now I’ll calculate the mean of total yield across all genotypes.
mean(dataA$Yield)
[1] NA
As you see above, we can’t calculate the mean dud to NA
. To obtain the mean of total yield, we should exclude NA
. Using subset()
, we can simply exclude Genotype D,
dataB= subset (dataA, Genotype!="D")
mean(dataB$Yield)
[1] 115
But, a much simpler way is to use the code na.rm=TRUE
, which enables you to avoid using subset()
.
mean(na.rm=T, dataA$Yield)
[1] 115
When the data size is small and there is only one variable, we can simply delete or ignore NA
values. However, what should we do when the data size is large, and several variables exist? Let’s upload another dataset.
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/chlorophyll_contents_on_leaves.csv"
df= data.frame(read_csv(url(github), show_col_types=FALSE))
print(df)
Now, there are several independent variables (Location, Genotype, Treatment), as well as dependent variables. First, let’s check which variables exist.
sapply(df, function(x) if (!is.numeric(x)) list(UniqueValues= unique(x)))
$Location
[1] "Northern area" "Southern area"
$Genotype
[1] "CV1" "CV2"
$Treatment
[1] "Control" "Stress_2" "Stress_1"
$Days_after_planting
NULL
$Chlorophyll_contents
NULL
$Chlorophyll_contents_Std_error
NULL
$Loss_of_greenness_on_leaves
NULL
$Loss_of_greenness_on_leaves_Std_error
NULL
There are 2 locations, 2 genotypes, and 3 treatments, indicating a total of 12 treatment combinations. Over the 20 days after planting, chlorophyll content
and loss of greenness on leaves
were measured, and the standard error of each measurement was also included. Let’s check for NA
values.
1) How many data rows contain NA values?
num_rows_with_na= sum(!complete.cases(df))
cat("Number of rows with NA values:", num_rows_with_na)
or
num_rows_with_na= sum(rowSums(is.na(df)) > 0)
cat("Number of rows with NA values:", num_rows_with_na)
Number of rows with NA values: 31
In this data, 31 rows have NA
values. This means that if even one column has NA values, this code counts the entire row as having NA
values, regardless of the other columns.
2) How many variables contain NA values?
Next, I want to see how many variables have NA
values.
num_variables_with_na= sum(colSums(is.na(df)) > 0)
cat("Number of variables with NA values:", num_variables_with_na)
Number of variables with NA values: 2
There are two variables containing NA values. Let’s see what they are.
colSums(is.na(df))
Location Genotype Treatment
0 0 0
Days_after_planting Chlorophyll_contents Chlorophyll_contents_Std_error
0 0 28
Loss_of_greenness_on_leaves Loss_of_greenness_on_leaves_Std_error
0 31
The standard error for chlorophyll content
and loss of greenness on leaves
contain NA
values.
How to discard NA
values?
I’ll delete all NA
values in Loss_of_greenness_on_leaves_Std_error
because it contains the most NA
values.
df_na_trit= df %>%
filter(is.na(Loss_of_greenness_on_leaves_Std_error)== F)
or
df_na_trit= dataA[complete.cases(df$Loss_of_greenness_on_leaves_Std_error),]
Then, let’s check all NA
values were discarded.
colSums(is.na(df_na_trit))
Location Genotype Treatment
0 0 0
Days_after_planting Chlorophyll_contents Chlorophyll_contents_Std_error
0 0 0
Loss_of_greenness_on_leaves Loss_of_greenness_on_leaves_Std_error
0 0
Now, none of the variables contain any NA
values. This is because when deleting NA
values in Loss_of_greenness_on_leaves_Std_error
, all NA
values in Chlorophyll_contents_Std_error
were also deleted. This means that deleting NA
values in one variable affects other variables as well. If we delete all NA
values, 31 data rows will be discarded, and data for chlorophyll contents and loss of greenness on leaves will also be deleted. This might affect the final result.
Therefore, deleting NA
values should be done carefully.
<Exercise>
Genotype= rep(c("A","B","C","D","E"),3)
Block= rep(c("I","II","III"), each=5)
Yield= c(100,150,NA,120,115,130,NA,125,140,NA,135,NA,120,100,NA)
DataA= data.frame(Genotype,Block,Yield)
There are several NA
in the data.
- Calculate the mean of total yield
- Calculate the mean of yield in genotype A
The answer is in the below.