12  Handling missing data

12.1 Questions

  • What is missing data?

  • Why is missing data a common challenge in data analysis?

  • How can we identify and handle missing data effectively in R?

  • What are the consequences of ignoring or mishandling missing data?

  • What are the best practices for imputing missing values?

  • When should I choose deletion, imputation, or alternative approaches?

12.2 Learning Objectives

  • Define the main types of missing data.

  • Identify missing data in R and assess its impact on analyses.

  • Work with datasets that contain missing data.

  • Understand the importance of addressing missing data in data analysis.

  • Apply best practices for handling missing data in your R projects.

  • Apply functions like is.na() and complete.cases() to identify NA values.

12.3 Lesson Content

12.3.1 Introduction

One of the most common tasks/activities in data analysis is handling missing values. Missing data are found everywhere; therefore, it is important to learn how to identify and summarize them. Missing data can arise from different sources, including nonresponses in surveys or technical difficulties during data collection.

  • Unanswered survey questions
  • Data entry errors
  • Errors in measurements
  • Data merge issues

Missingness presents itself in various ways in R. The most common way to represent a missing value in R is with the value NA. Other reserved values include:

  • NULL = neither TRUE or FALSE (is.null() identifies a null value and as.null() converts to a null value).

  • NaN = impossible values

  • Inf = infinite value obtained by dividing by zero (is.infinite() identifies an infinite value and as.infinite() converts to an infinite value).

Examples of the missing values

NA = 1/NA

NaN = 0/0

Inf = 10/0

NULL = 5/Inf

Common strategies for handling missing data

When investigating a dataset, there are several functions that can assist with the process. To find missing and non-missing values, the user can work with is.na() and !is.na(), respectively. If any rows contain missing values, one can use na.omit() or drop_na(). If there are missing values in the dataset, the arguments na.rm = TRUE, remove_na = TRUE, or the function complete.cases() can be used to remove missing values.

  • Deletion: Remove rows or columns with missing values (simplest method but it can lead to bias).

  • Imputation: Replace missing values with estimated values based on statistical methods (can introduce artificial data).

R has two types of missing data, NA and NULL.

Missing values can represent a fixed value like 0 (zero) or a negative number.

Missing can be NA or NaN and represented by is.na() or is.nan()

12.3.2 NA

R uses NA to represent missing data. The NA appears as another element of a vector. To test each element for missingness we use is.na(). Generally, we can use tools such as mi, mice, and Amelia (which will be discussed later) to deal with missing data. The deletion of this missing data may lead to bias or data loss, so we need to be careful when we choose this option. In subsequent blog posts, we will look at the use of imputation to deal with missing data.

12.3.3 NULL

NULL represents nothingness or the “absence of anything”. 2 It does not mean missing but represents nothing. NULL cannot exist within a vector because it disappears.

12.4 Exercises

  1. Identify missing values in a dataset using the is.na() and complete.cases() functions.

  2. Practice removing missing data using the na.omit() function.

  3. How can you handle missing values when calculating summary statistics like mean or median?

  4. If you take the mean of a vector with NA values, what will the result be?

  5. How are missing values represented in R?

  6. What is NA in R? How is it different from NULL?

  7. What are some common reasons you may get NA values in a dataset?

  8. How can you check if a value is NA in R?

12.5 Summary

The learner should now have a firm grasp of what missing data is and why it is a common challenge in data analysis. Additionally, after completing this chapter, the learner should know how to identify and handle missing data with various R functions. This is the final chapter in the book, and I hope you have enjoyed your learning journey so far. In the next chapter, I provide a conclusion to the book and some next steps you can take to continue on your R learning journey.