12 Handling missing data
12.1 Questions
What is missing data?
Why is missing data a common challenge in data analysis?
How can we identify and handle missing data effectively in R?
What are the consequences of ignoring or mishandling missing data?
What are the best practices for imputing missing values?
When should I choose deletion, imputation, or alternative approaches?
12.2 Learning Objectives
Define the main types of missing data.
Identify missing data in R and assess its impact on analyses.
Work with datasets that contain missing data.
Understand the importance of addressing missing data in data analysis.
Apply best practices for handling missing data in your R projects.
Apply functions like
is.na()
andcomplete.cases()
to identifyNA
values.
12.3 Lesson Content
12.3.1 Introduction
One of the most common tasks/activities in data analysis is handling missing values. Missing data are found everywhere; therefore, it is important to learn how to identify and summarize them. Missing data can arise from different sources, including nonresponses in surveys or technical difficulties during data collection.
- Unanswered survey questions
- Data entry errors
- Errors in measurements
- Data merge issues
Missingness presents itself in various ways in R. The most common way to represent a missing value in R is with the value NA
. Other reserved values include:
NULL
= neither TRUE or FALSE (is.null()
identifies a null value andas.null()
converts to a null value).NaN
= impossible valuesInf
= infinite value obtained by dividing by zero (is.infinite()
identifies an infinite value andas.infinite()
converts to an infinite value).
Examples of the missing values
NA
= 1/NA
NaN
= 0/0
Inf
= 10/0
NULL
= 5/Inf
Common strategies for handling missing data
When investigating a dataset, there are several functions that can assist with the process. To find missing and non-missing values, the user can work with is.na()
and !is.na()
, respectively. If any rows contain missing values, one can use na.omit()
or drop_na()
. If there are missing values in the dataset, the arguments na.rm = TRUE
, remove_na = TRUE
, or the function complete.cases()
can be used to remove missing values.
Deletion: Remove rows or columns with missing values (simplest method but it can lead to bias).
Imputation: Replace missing values with estimated values based on statistical methods (can introduce artificial data).
R has two types of missing data, NA
and NULL
.
Missing values can represent a fixed value like 0 (zero) or a negative number.
Missing can be NA
or NaN
and represented by is.na()
or is.nan()
12.3.2 NA
R uses NA
to represent missing data. The NA
appears as another element of a vector. To test each element for missingness we use is.na()
. Generally, we can use tools such as mi
, mice
, and Amelia
(which will be discussed later) to deal with missing data. The deletion of this missing data may lead to bias or data loss, so we need to be careful when we choose this option. In subsequent blog posts, we will look at the use of imputation to deal with missing data.
12.3.3 NULL
NULL represents nothingness or the “absence of anything”. 2 It does not mean missing but represents nothing. NULL cannot exist within a vector because it disappears.
12.4 Exercises
Identify missing values in a dataset using the
is.na()
andcomplete.cases()
functions.Practice removing missing data using the
na.omit()
function.How can you handle missing values when calculating summary statistics like mean or median?
If you take the mean of a vector with
NA
values, what will the result be?How are missing values represented in R?
What is
NA
in R? How is it different fromNULL
?What are some common reasons you may get
NA
values in a dataset?How can you check if a value is
NA
in R?
12.5 Summary
The learner should now have a firm grasp of what missing data is and why it is a common challenge in data analysis. Additionally, after completing this chapter, the learner should know how to identify and handle missing data with various R functions. This is the final chapter in the book, and I hope you have enjoyed your learning journey so far. In the next chapter, I provide a conclusion to the book and some next steps you can take to continue on your R learning journey.