In R Studio, how to exclude missing value (NA)?

In R Studio, how to exclude missing value (NA)?



I’ll create one data.

Genotype= c("A", "B", "C", "D", "E")
Yield= c(100, 120, 130, NA, 110)
dataA= data.frame(Genotype,Yield)

dataA
  Genotype Yield
1        A   100
2        B   120
3        C   130
4        D    NA
5        E   110

In genotype D, yield data was missed, so it was indicated as NA. Now I’ll calculate the mean of total yield across all genotypes.

mean(dataA$Yield)
[1] NA

As you see above, we can’t calculate the mean dud to NA. To obtain the mean of total yield, we should exclude NA. Using subset(), we can simply exclude Genotype D,

dataB= subset (dataA, Genotype!="D")
mean(dataB$Yield)
[1] 115

But, a much simpler way is to use the code na.rm=TRUE, which enables you to avoid using subset().

mean(na.rm=T, dataA$Yield)
[1] 115


When the data size is small and there is only one variable, we can simply delete or ignore NA values. However, what should we do when the data size is large, and several variables exist? Let’s upload another dataset.

library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/chlorophyll_contents_on_leaves.csv"
df= data.frame(read_csv(url(github), show_col_types=FALSE))

print(df)

Now, there are several independent variables (Location, Genotype, Treatment), as well as dependent variables. First, let’s check which variables exist.

sapply(df, function(x) if (!is.numeric(x)) list(UniqueValues= unique(x)))

$Location
[1] "Northern area" "Southern area"

$Genotype
[1] "CV1" "CV2"

$Treatment
[1] "Control" "Stress_2" "Stress_1"

$Days_after_planting
NULL

$Chlorophyll_contents
NULL

$Chlorophyll_contents_Std_error
NULL

$Loss_of_greenness_on_leaves
NULL

$Loss_of_greenness_on_leaves_Std_error
NULL

There are 2 locations, 2 genotypes, and 3 treatments, indicating a total of 12 treatment combinations. Over the 20 days after planting, chlorophyll content and loss of greenness on leaves were measured, and the standard error of each measurement was also included. Let’s check for NA values.

1) How many data rows contain NA values?

num_rows_with_na= sum(!complete.cases(df))
cat("Number of rows with NA values:", num_rows_with_na)

or

num_rows_with_na= sum(rowSums(is.na(df)) > 0)
cat("Number of rows with NA values:", num_rows_with_na)

Number of rows with NA values: 31

In this data, 31 rows have NA values. This means that if even one column has NA values, this code counts the entire row as having NA values, regardless of the other columns.

2) How many variables contain NA values?

Next, I want to see how many variables have NA values.

num_variables_with_na= sum(colSums(is.na(df)) > 0)
cat("Number of variables with NA values:", num_variables_with_na)

Number of variables with NA values: 2

There are two variables containing NA values. Let’s see what they are.

colSums(is.na(df))

Location Genotype Treatment 
0        0        0
Days_after_planting Chlorophyll_contents Chlorophyll_contents_Std_error
0                   0                    28
Loss_of_greenness_on_leaves Loss_of_greenness_on_leaves_Std_error 
0                           31       

The standard error for chlorophyll content and loss of greenness on leaves contain NA values.

How to discard NA values?

I’ll delete all NA values in Loss_of_greenness_on_leaves_Std_error because it contains the most NA values.

df_na_trit= df %>% 
            filter(is.na(Loss_of_greenness_on_leaves_Std_error)== F)

or 

df_na_trit= dataA[complete.cases(df$Loss_of_greenness_on_leaves_Std_error),]

Then, let’s check all NA values were discarded.

colSums(is.na(df_na_trit))

Location Genotype Treatment 
0        0        0
Days_after_planting Chlorophyll_contents Chlorophyll_contents_Std_error
0                   0                    0
Loss_of_greenness_on_leaves Loss_of_greenness_on_leaves_Std_error 
0                           0       

Now, none of the variables contain any NA values. This is because when deleting NA values in Loss_of_greenness_on_leaves_Std_error, all NA values in Chlorophyll_contents_Std_error were also deleted. This means that deleting NA values in one variable affects other variables as well. If we delete all NA values, 31 data rows will be discarded, and data for chlorophyll contents and loss of greenness on leaves will also be deleted. This might affect the final result.

Therefore, deleting NA values should be done carefully.



<Exercise>

Genotype= rep(c("A","B","C","D","E"),3)
Block= rep(c("I","II","III"), each=5)
Yield= c(100,150,NA,120,115,130,NA,125,140,NA,135,NA,120,100,NA)
DataA= data.frame(Genotype,Block,Yield)

There are several NA in the data.

  1. Calculate the mean of total yield
  2. Calculate the mean of yield in genotype A

The answer is in the below.

Comments are closed.