Dealing With Missing Data in R

Introduction to missing data

Using and finding missing values

When working with missing data, there are a couple of commands that you should be familiar with - firstly, you should be able to identify if there are any missing values, and where these are.

Using the any_na() and are_na() tools, identify which values are missing.

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.5     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(naniar)

# Create x, a vector, with values NA, NaN, Inf, ".", and "missing"
x <- c(NA, NaN, Inf, ".", "missing")

# Use any_na() and are_na() on to explore the missings
any_na(x)

## [1] TRUE

are_na(x)

## [1]  TRUE FALSE FALSE FALSE FALSE

Well done! You’ve now learned how to use any_na() and are_na() to identify missing values.

How many missing values are there?

One of the first things that you will want to check with a new dataset is if there are any missing missing values, and how many there are.

You could use are_na() to and count up the missing values, but the most efficient way to count missings is to use the n_miss() function. This will tell you the total number of missing values in the data.

You can then find the percent of missing values in the data with the pct_miss function. This will tell you the percentage of missing values in the data.

You can also find the complement to these - how many complete values there are - using n_complete and pct_complete.

# Use n_miss() to count the total number of missing values in dat_hw
n_miss(dat_hw)

[1] 30

# Use n_miss() on dat_hw$weight to count the total number of missing values
n_miss(dat_hw$weight)

[1] 15

# Use n_complete() on dat_hw to count the total number of complete values
n_complete(dat_hw)

[1] 170

# Use n_complete() on dat_hw$weight to count the total number of complete values
n_complete(dat_hw$weight)

[1] 85

# Use prop_miss() and prop_complete on dat_hw to count the total number of missing values in each of the variables
prop_miss(dat_hw)
prop_complete(dat_hw)

[1] 0.15

Congratulations! You’ve learned how to count the number and proportion of missing values using n_miss(), prop_miss(), n_complete(), and prop_complete()

Why care about missing values?

Summarizing missingness

Now that you understand the behavior of missing values in R, and how to count them, let’s scale up our summaries for cases (rows) and variables, using miss_var_summary() and miss_case_summary(), and also explore how they can be applied for groups in a dataframe, using the group_by function from dplyr.

# Summarize missingness in each variable of the `airquality` dataset
miss_var_summary(airquality)

## # A tibble: 6 x 3
##   variable n_miss pct_miss
##   <chr>     <int>    <dbl>
## 1 Ozone        37    24.2 
## 2 Solar.R       7     4.58
## 3 Wind          0     0   
## 4 Temp          0     0   
## 5 Month         0     0   
## 6 Day           0     0

# Summarize missingness in each case of the `airquality` dataset
miss_case_summary(airquality)

## # A tibble: 153 x 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1     5      2     33.3
##  2    27      2     33.3
##  3     6      1     16.7
##  4    10      1     16.7
##  5    11      1     16.7
##  6    25      1     16.7
##  7    26      1     16.7
##  8    32      1     16.7
##  9    33      1     16.7
## 10    34      1     16.7
## # ... with 143 more rows

# Return the summary of missingness in each variable, 
# grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_var_summary()

## # A tibble: 25 x 4
## # Groups:   Month [5]
##    Month variable n_miss pct_miss
##    <int> <chr>     <int>    <dbl>
##  1     5 Ozone         5     16.1
##  2     5 Solar.R       4     12.9
##  3     5 Wind          0      0  
##  4     5 Temp          0      0  
##  5     5 Day           0      0  
##  6     6 Ozone        21     70  
##  7     6 Solar.R       0      0  
##  8     6 Wind          0      0  
##  9     6 Temp          0      0  
## 10     6 Day           0      0  
## # ... with 15 more rows

# Return the summary of missingness in each case, 
# grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_case_summary()

## # A tibble: 153 x 4
## # Groups:   Month [5]
##    Month  case n_miss pct_miss
##    <int> <int>  <int>    <dbl>
##  1     5     5      2       40
##  2     5    27      2       40
##  3     5     6      1       20
##  4     5    10      1       20
##  5     5    11      1       20
##  6     5    25      1       20
##  7     5    26      1       20
##  8     5     1      0        0
##  9     5     2      0        0
## 10     5     3      0        0
## # ... with 143 more rows

Hooray! You now know how to summarize missingness for variables and cases for each groups!

Tabulating Missingness

The summaries of missingness we just calculated give us the number and percentage of missing observations for the cases and variables.

Another way to summarize missingness is by tabulating the number of times that there are 0, 1, 2, 3, missings in a variable, or in a case.

In this exercise we are going to tabulate the number of missings in each case and variable using miss_var_table() and miss_case_table(), and also combine these summaries with the the group_by operator from dplyr. to explore the summaries over a grouping variable in the dataset.

# Tabulate missingness in each variable and case of the `airquality` dataset
miss_var_table(airquality)

## # A tibble: 3 x 3
##   n_miss_in_var n_vars pct_vars
## *         <int>  <int>    <dbl>
## 1             0      4     66.7
## 2             7      1     16.7
## 3            37      1     16.7

miss_case_table(airquality)

## # A tibble: 3 x 3
##   n_miss_in_case n_cases pct_cases
## *          <int>   <int>     <dbl>
## 1              0     111     72.5 
## 2              1      40     26.1 
## 3              2       2      1.31

# Tabulate the missingness in each variable, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_var_table()

## # A tibble: 12 x 4
## # Groups:   Month [5]
##    Month n_miss_in_var n_vars pct_vars
##    <int>         <int>  <int>    <dbl>
##  1     5             0      3       60
##  2     5             4      1       20
##  3     5             5      1       20
##  4     6             0      4       80
##  5     6            21      1       20
##  6     7             0      4       80
##  7     7             5      1       20
##  8     8             0      3       60
##  9     8             3      1       20
## 10     8             5      1       20
## 11     9             0      4       80
## 12     9             1      1       20

# Tabulate of missingness in each case, grouped by Month, in the `airquality` dataset
airquality %>% group_by(Month) %>% miss_case_table()

## # A tibble: 11 x 4
## # Groups:   Month [5]
##    Month n_miss_in_case n_cases pct_cases
##    <int>          <int>   <int>     <dbl>
##  1     5              0      24     77.4 
##  2     5              1       5     16.1 
##  3     5              2       2      6.45
##  4     6              0       9     30   
##  5     6              1      21     70   
##  6     7              0      26     83.9 
##  7     7              1       5     16.1 
##  8     8              0      23     74.2 
##  9     8              1       8     25.8 
## 10     9              0      29     96.7 
## 11     9              1       1      3.33

Fantastic! You have know learnt how to tabulate and summarize missingness for each variable and cases, and to get summaries for each of the groups.

Other summaries of missingness

Some summaries of missingness are particularly useful for different types of data. For example, miss_var_span() and miss_var_run().

miss_var_span() calculates the number of missing values in a specified variable for a repeating span. This is really useful in time series data, to look for weekly (7 day) patterns of missingness.
miss_var_run() calculates the number of “runs” or “streaks” of missingness. This is useful to find unusual patterns of missingness, for example, you might find a repeating pattern of 5 complete and 5 missings.

Both miss_var_span() and miss_var_run() work with the group_by operator from dplyr.

# Calculate the summaries for each run of missingness for the variable, hourly_counts
miss_var_run(pedestrian, var = hourly_counts)

## # A tibble: 35 x 2
##    run_length is_na   
##         <int> <chr>   
##  1       6628 complete
##  2          1 missing 
##  3       5250 complete
##  4        624 missing 
##  5       3652 complete
##  6          1 missing 
##  7       1290 complete
##  8        744 missing 
##  9       7420 complete
## 10          1 missing 
## # ... with 25 more rows

# Calculate the summaries for each span of missingness, 
# for a span of 4000, for the variable hourly_counts
miss_var_span(pedestrian, var = hourly_counts, span_every = 4000)

## # A tibble: 10 x 5
##    span_counter n_miss n_complete prop_miss prop_complete
##  *        <int>  <int>      <dbl>     <dbl>         <dbl>
##  1            1      0       4000   0               1    
##  2            2      1       3999   0.00025         1.00 
##  3            3    121       3879   0.0302          0.970
##  4            4    503       3497   0.126           0.874
##  5            5    745       3255   0.186           0.814
##  6            6      0       4000   0               1    
##  7            7      1       3999   0.00025         1.00 
##  8            8      0       4000   0               1    
##  9            9    745       3255   0.186           0.814
## 10           10    432       3568   0.108           0.892

# For each `month` variable, calculate the run of missingness for hourly_counts
pedestrian %>% group_by(month) %>% miss_var_run(var = hourly_counts)

## # A tibble: 51 x 3
## # Groups:   month [12]
##    month    run_length is_na   
##    <ord>         <int> <chr>   
##  1 January        2976 complete
##  2 February       2784 complete
##  3 March          2976 complete
##  4 April           888 complete
##  5 April           552 missing 
##  6 April          1440 complete
##  7 May             744 complete
##  8 May              72 missing 
##  9 May            2160 complete
## 10 June           2880 complete
## # ... with 41 more rows

# For each `month` variable, calculate the span of missingness 
# of a span of 2000, for the variable hourly_counts
pedestrian %>% group_by(month) %>% miss_var_span(var = hourly_counts, span_every = 2000)

## # A tibble: 25 x 6
## # Groups:   month [12]
##    month    span_counter n_miss n_complete prop_miss prop_complete
##    <ord>           <int>  <int>      <dbl>     <dbl>         <dbl>
##  1 January             1      0       2000     0             1    
##  2 January             2      0       2000     0             1    
##  3 February            1      0       2000     0             1    
##  4 February            2      0       2000     0             1    
##  5 March               1      0       2000     0             1    
##  6 March               2      0       2000     0             1    
##  7 April               1    552       1448     0.276         0.724
##  8 April               2      0       2000     0             1    
##  9 May                 1     72       1928     0.036         0.964
## 10 May                 2      0       2000     0             1    
## # ... with 15 more rows

Wonderful! You have now learnt how to summarize missingness over repeating spans, and find runs of missingness!

How do we visualize missing values?

does not support faceting

supports faceting

Your first missing data visualizations

It can be difficult to get a handle on where the missing values are in your data, and here is where visualization can really help.

The function vis_miss() creates an overview visualization of the missingness in the data. It also has options to cluster rows based on missingness, using cluster = TRUE; as well as options for sorting the columns, from most missing to least missing (sort_miss = TRUE).

# Visualize all of the missingness in the `riskfactors`  dataset
vis_miss(riskfactors)

# Visualize and cluster all of the missingness in the `riskfactors` dataset
vis_miss(riskfactors, cluster = TRUE)

# Visualize and sort the columns by missingness in the `riskfactors` dataset
vis_miss(riskfactors, sort_miss = TRUE)

Excellent! You can now produce quick visual summaries of all missingness in the data using vis_miss!

Visualizing missing cases and variables

To get a clear picture of the missingness across variables and cases, use gg_miss_var() and gg_miss_case(). These are the visual counterpart to miss_var_summary() and miss_case_summary().

These can be split up into multiple plots with one for each category by choosing a variable to facet by.

# Visualize the number of missings in cases using `gg_miss_case()`
gg_miss_case(riskfactors)

# Explore the number of missings in cases using `gg_miss_case()` 
# and facet by the variable `education`
gg_miss_case(riskfactors, facet = education)

# Visualize the number of missings in variables using `gg_miss_var()`
gg_miss_var(riskfactors)

# Explore the number of missings in variables using `gg_miss_var()` 
# and facet by the variable `education`
gg_miss_var(riskfactors, facet = education)

Wahoo! You can now visualize missingness in a dataset for variables and cases, and explore how other variables interact with missingness.

Visualizing missingness patterns

Let’s practice a few different ways to visualize patterns of missingness using:

gg_miss_upset() to give an overall pattern of missingness.
gg_miss_fct() for a dataset that has a factor of interest: marriage.
and gg_miss_span() to explore the missingness in a time series dataset.

What do you notice with the missingness and the faceting in the data?

# Using the airquality dataset, explore the missingness pattern using gg_miss_upset()
gg_miss_upset(airquality)

# With the riskfactors dataset, explore how the missingness changes across the marital variable using gg_miss_fct()
gg_miss_fct(x = riskfactors, fct = marital)

# Using the pedestrian dataset, explore how the missingness of hourly_counts changes over a span of 3000 
gg_miss_span(pedestrian, var = hourly_counts, span_every = 3000)

# Using the pedestrian dataset, explore the impact of month by faceting by month
# and explore how missingness changes for a span of 1000
gg_miss_span(pedestrian, var = hourly_counts, span_every = 1000, facet = month)

Splendid! You can now visualize different missingness patterns for different kinds of data

Searching for and replacing missing values

Using miss_scan_count

You have a dataset with missing values coded as "N/A", "missing", and "na". But before we go ahead and start replacing these with NA, we should get an idea of how big the problem is.

Use miss_scan_count to count the possible missings in the dataset, pacman, a dataset of pacman scores, containing three columns:

year: the year that person made that score.
initial: the initials of the person.
score: the scores of that person.

# Explore the strange missing values "N/A"
miss_scan_count(data = pacman, search = list("N/A"))

# A tibble: 6 x 2
  Variable     n
  <chr>    <int>
1 year         0
2 month        0
3 day          0
4 initial      0
5 score      100
6 country      0

# Explore the strange missing values "missing"
miss_scan_count(data = pacman, search = list("missing"))

# A tibble: 6 x 2
  Variable     n
  <chr>    <int>
1 year        93
2 month       93
3 day         93
4 initial      0
5 score        0
6 country      0

# Explore the strange missing values "na"
miss_scan_count(data = pacman, search = list("na"))

# A tibble: 6 x 2
  Variable     n
  <chr>    <int>
1 year       100
2 month      100
3 day        100
4 initial      0
5 score        0
6 country      0

# Explore the strange missing values " " (a single space)
miss_scan_count(data = pacman, search = list(" "))

# A tibble: 6 x 2
  Variable     n
  <chr>    <int>
1 year         0
2 month        0
3 day          0
4 initial      0
5 score        0
6 country    100

# Explore all of the strange missing values, "N/A", "missing", "na", " "
miss_scan_count(data = pacman, search = list("N/A", "missing","na", " "))

# A tibble: 6 x 2
  Variable     n
  <chr>    <int>
1 year       193
2 month      193
3 day        193
4 initial      0
5 score      100
6 country    100

Fantastic! You’ve learned how to search for and count strange missing values

Using replace_with_na

Following on from the previous dataset, we now know that we have a few strange missing values.

Now, we are going to do something about it, and replace these values with missings (e.g. NA) using the function replace_with_na().

# Print the top of the pacman data using `head()`
head(pacman)

# A tibble: 6 x 6
  year  month day   initial score   country
  <chr> <chr> <chr> <chr>   <chr>   <chr>  
1 2007  10    27    LEX     2065812 "CA"   
2 1995  8     23    PNY     1163465 "JP"   
3 1980  2     8     MBJ     175380  " "    
4 1982  5     9     QRC     2025632 "ES"   
5 na    na    na    YPZ     925357  "NZ"   
6 2013  11    15    RVJ     319733  "AU"

# Replace the strange missing values "N/A", "na", and 
# "missing" with `NA` for the variables, year, and score
pacman_clean <- replace_with_na(pacman, replace = list(year = c("N/A", "na", "missing"),
                                score = c("N/A", "na", "missing")))
                                        
# Test if `pacman_clean` still has these values in it?
miss_scan_count(pacman_clean, search = list("N/A", "na", "missing"))

# A tibble: 6 x 2
  Variable     n
  <chr>    <int>
1 year         0
2 month      193
3 day        193
4 initial      0
5 score        0
6 country      0

You are doing great! You’ve now learned how to search for, replace, and confirm strange missing values in a dataset!

Using replace_with_na scoped variants

To reduce code repetition when replacing values with NA, use the “scoped variants” of replace_with_na():

replace_with_na_at()
replace_with_na_if()
replace_with_na_all()

The syntax of replacement looks like this:

~.x == "N/A"

This replaces all cases that are equal to “N/A”.

~.x %in% c("N/A", "missing", "na", " ")

Replaces all cases that have "N/A", "missing", "na", or " "

# Use `replace_with_na_at()` to replace with NA
replace_with_na_at(pacman,
                   .vars = c("year", "month", "day"), 
                   ~.x %in% c("N/A", "missing", "na", " "))

# A tibble: 2,000 x 6
   year  month day   initial score   country
   <chr> <chr> <chr> <chr>   <chr>   <chr>  
 1 2007  10    27    LEX     2065812 "CA"   
 2 1995  8     23    PNY     1163465 "JP"   
 3 1980  2     8     MBJ     175380  " "    
 4 1982  5     9     QRC     2025632 "ES"   
 5 <NA>  <NA>  <NA>  YPZ     925357  "NZ"   
 6 2013  11    15    RVJ     319733  "AU"   
 7 2003  12    4     VKD     3322668 "US"   
 8 2016  9     9     ZIS     2137806 "CN"   
 9 2013  3     20    IYD     3059716 "CN"   
10 1993  5     19    CHQ     231892  "AU"   
# ... with 1,990 more rows

# Use `replace_with_na_if()` to replace with NA the character values using `is.character`
replace_with_na_if(pacman,
                   .predicate = is.character, 
                   ~.x %in% c("N/A", "missing", "na", " "))

# A tibble: 2,000 x 6
   year  month day   initial score   country
   <chr> <chr> <chr> <chr>   <chr>   <chr>  
 1 2007  10    27    LEX     2065812 CA     
 2 1995  8     23    PNY     1163465 JP     
 3 1980  2     8     MBJ     175380  <NA>   
 4 1982  5     9     QRC     2025632 ES     
 5 <NA>  <NA>  <NA>  YPZ     925357  NZ     
 6 2013  11    15    RVJ     319733  AU     
 7 2003  12    4     VKD     3322668 US     
 8 2016  9     9     ZIS     2137806 CN     
 9 2013  3     20    IYD     3059716 CN     
10 1993  5     19    CHQ     231892  AU     
# ... with 1,990 more rows

# Use `replace_with_na_all()` to replace with NA
replace_with_na_all(pacman, ~.x %in% c("N/A", "missing", "na", " "))

# A tibble: 2,000 x 6
   year  month day   initial score   country
   <chr> <chr> <chr> <chr>   <chr>   <chr>  
 1 2007  10    27    LEX     2065812 CA     
 2 1995  8     23    PNY     1163465 JP     
 3 1980  2     8     MBJ     175380  <NA>   
 4 1982  5     9     QRC     2025632 ES     
 5 <NA>  <NA>  <NA>  YPZ     925357  NZ     
 6 2013  11    15    RVJ     319733  AU     
 7 2003  12    4     VKD     3322668 US     
 8 2016  9     9     ZIS     2137806 CN     
 9 2013  3     20    IYD     3059716 CN     
10 1993  5     19    CHQ     231892  AU     
# ... with 1,990 more rows

Super! Now you know how to use the powerful scoped variants of replace_with_na() to replace values with NA.

Filling down missing values

last observation carried forward
Warning: this method is only useful in this specific type of case

Fix implicit missings using complete()

We are going to explore a new dataset, frogger.

This dataset contains 4 scores per player recorded at different times: morning, afternoon, evening, and late_night.

Every player should have played 4 games, one at each of these times, but it looks like not every player completed all of these games.

Use the complete() function to make these implicit missing values explicit

# Print the frogger data to have a look at it
frogger

    name       time  value
1  jesse    morning   6678
2  jesse  afternoon 800060
3  jesse    evening 475528
4  jesse late_night 143533
5   andy    morning 425115
6   andy  afternoon 587468
7   andy late_night 111000
8    nic  afternoon 588532
9    nic late_night 915533
10   dan    morning 388148
11   dan    evening 180912
12  alex    morning 552670
13  alex  afternoon  98355
14  alex    evening 266055
15  alex late_night 121056

# Use `complete()` on the `time` and `name` variables to  
# make implicit missing values explicit
frogger_tidy <- frogger %>% complete(time, name)

# A tibble: 20 x 3
   time       name   value
   <fct>      <fct>  <int>
 1 afternoon  alex   98355
 2 afternoon  andy  587468
 3 afternoon  dan       NA
 4 afternoon  jesse 800060
 5 afternoon  nic   588532
 6 evening    alex  266055
 7 evening    andy      NA
 8 evening    dan   180912
 9 evening    jesse 475528
10 evening    nic       NA
11 late_night alex  121056
12 late_night andy  111000
13 late_night dan       NA
14 late_night jesse 143533
15 late_night nic   915533
16 morning    alex  552670
17 morning    andy  425115
18 morning    dan   388148
19 morning    jesse   6678
20 morning    nic       NA

Excellent! The complete() function converts implicitly missing values into explicitly missing values.

Fix explicit missings using fill()

One type of missing value that can be obvious to deal with is where the first entry of a group is given, but subsequent entries are marked NA.

These missing values often result from empty values in spreadsheets to avoid entering multiple names multiple times; as well as for “human readability”.

This type of problem can be solved by using the fill() function from the tidyr package.

# Print the frogger data to have a look at it
frogger

# A tibble: 15 x 4
   name  time        value  year
   <chr> <chr>       <int> <dbl>
 1 jesse morning      6678  1981
 2 <NA>  afternoon  800060    NA
 3 <NA>  evening    475528    NA
 4 <NA>  late_night 143533    NA
 5 andy  morning    425115    NA
 6 <NA>  afternoon  587468    NA
 7 <NA>  late_night 111000    NA
 8 nic   afternoon  588532    NA
 9 <NA>  late_night 915533    NA
10 dan   morning    388148  1988
11 <NA>  evening    180912    NA
12 alex  morning    552670    NA
13 <NA>  afternoon   98355    NA
14 <NA>  evening    266055    NA
15 <NA>  late_night 121056    NA

# Use `fill()` to fill down the name variable in the frogger dataset
frogger %>% fill(name)

# A tibble: 15 x 4
   name  time        value  year
   <chr> <chr>       <int> <dbl>
 1 jesse morning      6678  1981
 2 jesse afternoon  800060    NA
 3 jesse evening    475528    NA
 4 jesse late_night 143533    NA
 5 andy  morning    425115    NA
 6 andy  afternoon  587468    NA
 7 andy  late_night 111000    NA
 8 nic   afternoon  588532    NA
 9 nic   late_night 915533    NA
10 dan   morning    388148  1988
11 dan   evening    180912    NA
12 alex  morning    552670    NA
13 alex  afternoon   98355    NA
14 alex  evening    266055    NA
15 alex  late_night 121056    NA

Excellent! The fill function makes it easy to fill down observations.

Using complete() and fill() together

Now let’s put it together!

Use complete() and fill() together to fix explicit and implicitly missing values in the frogger dataset.

# Print the frogger data to have a look at it
frogger

# A tibble: 15 x 4
   name  time        value  year
   <chr> <chr>       <int> <dbl>
 1 jesse morning      6678  1981
 2 <NA>  afternoon  800060    NA
 3 <NA>  evening    475528    NA
 4 <NA>  late_night 143533    NA
 5 andy  morning    425115    NA
 6 <NA>  afternoon  587468    NA
 7 <NA>  late_night 111000    NA
 8 nic   afternoon  588532    NA
 9 <NA>  late_night 915533    NA
10 dan   morning    388148  1988
11 <NA>  evening    180912    NA
12 alex  morning    552670    NA
13 <NA>  afternoon   98355    NA
14 <NA>  evening    266055    NA
15 <NA>  late_night 121056    NA

# Correctly fill() and complete() missing values so that our dataset becomes sensible
frogger %>%
  fill(name) %>%
  complete(name, time)

# A tibble: 20 x 4
   name  time        value  year
   <chr> <chr>       <int> <dbl>
 1 alex  afternoon   98355    NA
 2 alex  evening    266055    NA
 3 alex  late_night 121056    NA
 4 alex  morning    552670    NA
 5 andy  afternoon  587468    NA
 6 andy  evening        NA    NA
 7 andy  late_night 111000    NA
 8 andy  morning    425115    NA
 9 dan   afternoon      NA    NA
10 dan   evening    180912    NA
11 dan   late_night     NA    NA
12 dan   morning    388148  1988
13 jesse afternoon  800060    NA
14 jesse evening    475528    NA
15 jesse late_night 143533    NA
16 jesse morning      6678  1981
17 nic   afternoon  588532    NA
18 nic   evening        NA    NA
19 nic   late_night 915533    NA
20 nic   morning        NA    NA

Faaaantastic! You correctly used the fill and complete functions in the right order to create a sensible dataset.

Missing Data dependence

Edit: Missingness depends on data observed, but not data unobserved

Exploring missingness dependence

To learn about the structure of the missingness in data, you can explore how sorting changes how missingness is presented.

For the oceanbuoys dataset, explore the missingness with vis_miss(), and then arrange by a few different variables

This is not a definitive process, but it will get you started to ask the right questions of your data. We explore more powerful techniques in the next chapter.

# Arrange by year
oceanbuoys %>% arrange(year) %>% vis_miss()

# Arrange by latitude
oceanbuoys %>% arrange(latitude) %>% vis_miss()

# Arrange by wind_ew (wind east west)
oceanbuoys %>% arrange(wind_ew) %>% vis_miss()

Wahoo! You did it. We have some ideas for how to explore missingness in the data, but we are going to explore how to better explain and explore missingness in the next chapter.

Tools to explore missing data dependence

Creating shadow matrix data

Missing data can be tricky to think about, as they don’t usually proclaim themselves for you, and instead hide amongst the weeds of the data.

One way to help expose missing values is to change the way we think about the data - by thinking about every single data value being missing or not missing.

The as_shadow() function in R transforms a dataframe into a shadow matrix, a special data format where the values are either missing (NA), or Not Missing (!NA).

The column names of a shadow matrix are the same as the data, but have a suffix added _NA.

To keep track of and compare data values to their missingness state, use the bind_shadow() function. Having data in this format, with the shadow matrix column bound to the regular data is called nabular data.

# Create shadow matrix data with `as_shadow()`
as_shadow(oceanbuoys)

## # A tibble: 736 x 8
##    year_NA latitude_NA longitude_NA sea_temp_c_NA air_temp_c_NA humidity_NA
##    <fct>   <fct>       <fct>        <fct>         <fct>         <fct>      
##  1 !NA     !NA         !NA          !NA           !NA           !NA        
##  2 !NA     !NA         !NA          !NA           !NA           !NA        
##  3 !NA     !NA         !NA          !NA           !NA           !NA        
##  4 !NA     !NA         !NA          !NA           !NA           !NA        
##  5 !NA     !NA         !NA          !NA           !NA           !NA        
##  6 !NA     !NA         !NA          !NA           !NA           !NA        
##  7 !NA     !NA         !NA          !NA           !NA           !NA        
##  8 !NA     !NA         !NA          !NA           !NA           !NA        
##  9 !NA     !NA         !NA          !NA           !NA           !NA        
## 10 !NA     !NA         !NA          !NA           !NA           !NA        
## # ... with 726 more rows, and 2 more variables: wind_ew_NA <fct>,
## #   wind_ns_NA <fct>

# Create nabular data by binding the shadow to the data with `bind_shadow()`
bind_shadow(oceanbuoys)

## # A tibble: 736 x 16
##     year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
##    <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
##  1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
##  2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
##  3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
##  4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
##  5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
##  6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
##  7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
##  8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
##  9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
## 10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
## # ... with 726 more rows, and 8 more variables: year_NA <fct>,
## #   latitude_NA <fct>, longitude_NA <fct>, sea_temp_c_NA <fct>,
## #   air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>, wind_ns_NA <fct>

# Bind only the variables with missing values by using bind_shadow(only_miss = TRUE)
bind_shadow(oceanbuoys, only_miss = TRUE)

## # A tibble: 736 x 11
##     year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
##    <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
##  1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
##  2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
##  3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
##  4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
##  5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
##  6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
##  7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
##  8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
##  9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
## 10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
## # ... with 726 more rows, and 3 more variables: sea_temp_c_NA <fct>,
## #   air_temp_c_NA <fct>, humidity_NA <fct>

Nice, you did it! We can now create shadow matrix and nabular data. This is a useful first step in more advanced summaries of missing data.

Performing grouped summaries of missingness

Now that you can create nabular data, let’s use it to explore the data. Let’s calculate summary statistics based on the missingness of another variable.

To do this we are going to use the following steps:

First, bind_shadow() turns the data into nabular data.

Next, perform some summaries on the data using group_by() and summarize() to calculate the mean and standard deviation, using the mean() and sd() functions.

# `bind_shadow()` and `group_by()` humidity missingness (`humidity_NA`)
oceanbuoys %>%
  bind_shadow() %>%
  group_by(humidity_NA) %>%
  summarize(wind_ew_mean = mean(wind_ew), # calculate mean of wind_ew
            wind_ew_sd = sd(wind_ew)) # calculate standard deviation of wind_ew

## # A tibble: 2 x 3
##   humidity_NA wind_ew_mean wind_ew_sd
## * <fct>              <dbl>      <dbl>
## 1 !NA                -3.78       1.90
## 2 NA                 -3.30       2.31

# Repeat this, but calculating summaries for wind north south (`wind_ns`)
oceanbuoys %>%
  bind_shadow() %>%
  group_by(humidity_NA) %>%
  summarize(wind_ns_mean = mean(wind_ns),
            wind_ns_sd = sd(wind_ns))

## # A tibble: 2 x 3
##   humidity_NA wind_ns_mean wind_ns_sd
## * <fct>              <dbl>      <dbl>
## 1 !NA                 2.78       2.06
## 2 NA                  1.66       2.23

Fantastic! Now we can summarize and explore values at each level of missingness.

Further exploring more combinations of missingness

It can be useful to get a bit of extra information about the number of cases in each missing condition.

In this exercise, we are going to add information about the number of observed cases using n() inside the summarize() function.

We will then add an additional level of grouping by looking at the combination of humidity being missing (humidity_NA) and air temperature being missing (air_temp_c_NA).

# Summarize wind_ew by the missingness of `air_temp_c_NA`
oceanbuoys %>% 
  bind_shadow() %>%
  group_by(air_temp_c_NA) %>%
  summarize(wind_ew_mean = mean(wind_ew),
            wind_ew_sd = sd(wind_ew),
            n_obs = n())

## # A tibble: 2 x 4
##   air_temp_c_NA wind_ew_mean wind_ew_sd n_obs
## * <fct>                <dbl>      <dbl> <int>
## 1 !NA                  -3.91       1.85   655
## 2 NA                   -2.17       2.14    81

# Summarize wind_ew by missingness of `air_temp_c_NA` and `humidity_NA`
oceanbuoys %>% 
  bind_shadow() %>%
  group_by(air_temp_c_NA, humidity_NA) %>%
  summarize(wind_ew_mean = mean(wind_ew),
            wind_ew_sd = sd(wind_ew),
            n_obs = n())

## `summarise()` has grouped output by 'air_temp_c_NA'. You can override using the `.groups` argument.

## # A tibble: 4 x 5
## # Groups:   air_temp_c_NA [2]
##   air_temp_c_NA humidity_NA wind_ew_mean wind_ew_sd n_obs
##   <fct>         <fct>              <dbl>      <dbl> <int>
## 1 !NA           !NA                -4.01       1.74   565
## 2 !NA           NA                 -3.24       2.31    90
## 3 NA            !NA                -2.06       2.08    78
## 4 NA            NA                 -4.97       1.74     3

Excellent! By counting the number of observations in each of the groups, we can get some useful context for our summary statistics.

Visualizing missingness across one variable

Nabular data and filling by missingness

Summary statistics are useful to calculate, but as they say, a picture tells you a thousand words.

In this exercise, we are going to explore how you can use nabular data to explore the variation in a variable by the missingness of another.

We are going to use the oceanbuoys dataset from naniar.

# First explore the missingness structure of `oceanbuoys` using `vis_miss()`
vis_miss(oceanbuoys)

# Explore the distribution of `wind_ew` for the missingness  
# of `air_temp_c_NA` using  `geom_density()`
bind_shadow(oceanbuoys) %>%
  ggplot(aes(x = wind_ew, 
             color = air_temp_c_NA)) + 
  geom_density()

# Explore the distribution of sea temperature for the  
# missingness of humidity (humidity_NA) using  `geom_density()`
bind_shadow(oceanbuoys) %>%
  ggplot(aes(x = sea_temp_c,
             color = humidity_NA)) + 
  geom_density()

## Warning: Removed 3 rows containing non-finite values (stat_density).

Excellent! You can now use nabular data to visualize and explore missing data using density plots.

Nabular data and summarising by missingness

In this exercise, we are going to explore how to use nabular data to explore the variation in a variable by the missingness of another.

We are going to use the oceanbuoys dataset from naniar, and then create multiple plots of the data using facets.

This allows you to explore different layers of missingness.

# Explore the distribution of wind east west (wind_ew) for the missingness of air temperature 
# using geom_density() and faceting by the missingness of air temperature (air_temp_c_NA).
oceanbuoys %>%
  bind_shadow() %>%
  ggplot(aes(x = wind_ew)) + 
  geom_density() + 
  facet_wrap(~air_temp_c_NA)

# Build upon this visualization by filling by the missingness of humidity (humidity_NA).
oceanbuoys %>%
  bind_shadow() %>%
  ggplot(aes(x = wind_ew,
             color = humidity_NA)) + 
  geom_density() + 
  facet_wrap(~air_temp_c_NA)

Excellent! You can now use nabular data to visualize and explore missing data.

Explore variation by missingness: box plots

Previous exercises use nabular data along with density plots to explore the variation in a variable by the missingness of another.

We are going to use the oceanbuoys dataset from naniar, using box plots instead of facets or others to explore different layers of missingness.

# Explore the distribution of wind east west (`wind_ew`) for  
# the missingness of air temperature using  `geom_boxplot()`
oceanbuoys %>%
  bind_shadow() %>%
  ggplot(aes(x = air_temp_c_NA,
             y = wind_ew)) + 
  geom_boxplot()

# Build upon this visualization by faceting by the missingness of humidity (`humidity_NA`).
oceanbuoys %>%
  bind_shadow() %>%
  ggplot(aes(x = air_temp_c_NA,
             y = wind_ew)) + 
  geom_boxplot() + 
  facet_wrap(~humidity_NA)

Excellent! You can now use nabular data to visualize and explore missing data with box plots and facet wraps.

Visualizing missingness across two variables

data is imputed 10% below the minimum value

Exploring missing data with scatter plots

Missing values in a scatter plot in ggplot2 are removed by default, with a warning.

We can display missing values in a scatter plot, using geom_miss_point() - a special ggplot2 geom that shifts the missing values into the plot, displaying them 10% below the minimum of the variable.

Let’s practice using this visualization with the oceanbuoys dataset.

# Explore the missingness in wind and air temperature, and  
# display the missingness using `geom_miss_point()`
ggplot(oceanbuoys,
       aes(x = wind_ew,
           y = air_temp_c)) + 
  geom_miss_point()

# Explore the missingness in humidity and air temperature,  
# and display the missingness using `geom_miss_point()`
ggplot(oceanbuoys,
       aes(x = humidity,
           y = air_temp_c)) + 
  geom_miss_point()

Splendid! Now you can use geom_miss_point() to explore missing values in a scatter plot

Using facets to explore missingness

Because geom_miss_point() is a ggplot geom, you can use it with ggplot2 features like faceting.

This means we can rapidly explore the missingness and stay within the familiar bounds of ggplot2.

# Explore the missingness in wind and air temperature, and display the 
# missingness using `geom_miss_point()`. Facet by year to explore this further.
ggplot(oceanbuoys,
       aes(x = wind_ew,
           y = air_temp_c)) + 
  geom_miss_point() + 
  facet_wrap(~year)

# Explore the missingness in humidity and air temperature, and display the 
# missingness using `geom_miss_point()` Facet by year to explore this further.
ggplot(oceanbuoys,
       aes(x = humidity,
           y = air_temp_c)) + 
  geom_miss_point() + 
  facet_wrap(~year)

Splendid! Now you can use geom_miss_point() to explore missing values in a scatter plot, and can use faceting to expand and explore further.

Faceting to explore missingness (multiple plots)

Another useful technique with geom_miss_point() is to explore the missingness by creating multiple plots.

Just as we have done in the previous exercises, we can use the nabular data to help us create additional faceted plots.

We can even create multiple faceted plots according to values in the data, such as year, and features of the data, such as missingness.

# Use geom_miss_point() and facet_wrap to explore how the missingness 
# in wind_ew and air_temp_c is different for missingness of humidity
bind_shadow(oceanbuoys) %>%
  ggplot(aes(x = wind_ew,
           y = air_temp_c)) + 
  geom_miss_point() + 
  facet_wrap(~humidity_NA)

# Use geom_miss_point() and facet_grid to explore how the missingness in wind_ew and air_temp_c 
# is different for missingness of humidity AND by year - by using `facet_grid(humidity_NA ~ year)`
bind_shadow(oceanbuoys) %>%
  ggplot(aes(x = wind_ew,
             y = air_temp_c)) + 
  geom_miss_point() + 
  facet_grid(humidity_NA~year)

Wonderful! Now you can use facet_grid() and facet_wrap() to dive deeper into understanding your missing data!

Filling in the blanks

Impute data below range with nabular data

We want to keep track of values we imputed. If we don’t, it is very difficult to assess how good the imputed values are.

We are going to practice imputing data and recreate visualizations in the previous set of exercises by imputing values below the range of the data.

This is a very useful way to help further explore missingness, and also provides the framework for imputing missing values.

First, we are going to impute the data below the range using impute_below_all(), and then visualize the data. We notice that although we can see where the missing values are in this instance, we need some way to track them. The track missing data programming pattern can help with this.

# Impute the data below the range using `impute_below`.
ocean_imp <- impute_below_all(oceanbuoys)

# Visualize the new missing values
ggplot(ocean_imp,
       aes(x = wind_ew, y = air_temp_c)) + 
  geom_point()

# Impute and track data with `bind_shadow`, `impute_below`, and `add_label_shadow`
ocean_imp_track <- bind_shadow(oceanbuoys) %>% 
  impute_below_all() %>%
  add_label_shadow()

# Look at the imputed values
ocean_imp_track

## # A tibble: 736 x 17
##     year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
##    <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
##  1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
##  2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
##  3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
##  4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
##  5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
##  6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
##  7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
##  8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
##  9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
## 10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
## # ... with 726 more rows, and 9 more variables: year_NA <fct>,
## #   latitude_NA <fct>, longitude_NA <fct>, sea_temp_c_NA <fct>,
## #   air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>, wind_ns_NA <fct>,
## #   any_missing <chr>

Wonderful! You can now impute and track missing values, allowing you to better explore imputations!

Visualize imputed values in a scatter plot

Now, let’s recreate one of the previous plots we saw in chapter three that used geom_miss_point().

To do this, we need to impute the data below the range of the data. This is a special kind of imputation to explore the data. This imputation will illustrate what we need to practice: how to track missing values. To impute the data below the range of the data, we use the function impute_below_all().

# Impute and track the missing values
ocean_imp_track <- bind_shadow(oceanbuoys) %>% 
  impute_below_all() %>%
  add_label_shadow()

# Visualize the missingness in wind and air temperature,  
# coloring missing air temp values with air_temp_c_NA
ggplot(ocean_imp_track,
       aes(x = wind_ew, y = air_temp_c, color = air_temp_c_NA)) + 
  geom_point()

# Visualize humidity and air temp, coloring any missing cases using the variable any_missing
ggplot(ocean_imp_track,
       aes(x = humidity, y = air_temp_c, color = any_missing)) + 
  geom_point()

Excellent! You can now impute, track, and visualize missing values! Now you can perform other kinds of explorations of the data.

Create histogram of imputed data

Now that we can recreate the first visualization of geom_miss_point(), let’s explore how we can apply this to other exploratory tasks.

One useful task is to evaluate the number of missings in a given variable using a histogram. We can do this using the ocean_imp_track dataset we created in the last exercise, which is loaded into this session.

# Explore the values of air_temp_c, visualizing the amount of missings with `air_temp_c_NA`.
p <- ggplot(ocean_imp_track, aes(x = air_temp_c, fill = air_temp_c_NA)) + geom_histogram()

# Expore the missings in humidity using humidity_NA
p2 <- ggplot(ocean_imp_track, aes(x = humidity, fill = humidity_NA)) + geom_histogram()

# Explore the missings in air_temp_c according to year, using `facet_wrap(~year)`.
p + facet_wrap(~year)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Explore the missings in humidity according to year, using `facet_wrap(~year)`.
p2 + facet_wrap(~year)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Well done! Now you can use imputations to explore structure in visualizations for different types of data. You can apply the same principles for imputations to assist with prediction.

What makes a good imputation

Evaluating bad imputations

In order to evaluate imputations, it helps to know what something bad looks like. To explore this, let’s look at a typically bad imputation method: imputing using the mean value.

In this exercise we are going to explore how the mean imputation method works using a box plot, using the oceanbuoys dataset.

# Impute the mean value and track the imputations 
ocean_imp_mean <- bind_shadow(oceanbuoys) %>% 
  impute_mean_all() %>% 
  add_label_shadow()

# Explore the mean values in humidity in the imputed dataset
ggplot(ocean_imp_mean, 
       aes(x = humidity_NA, y = humidity)) + 
  geom_boxplot()

# Explore the values in air temperature in the imputed dataset
ggplot(ocean_imp_mean, 
       aes(x = air_temp_c_NA, y = air_temp_c)) + 
  geom_boxplot()

Great job! You can explore the imputed values and look at how the mean values change across different variables.

Evaluating imputations: The scale

While the mean imputation might not look so bad when we compare it using a box plot, it is important to get a sense of the variation in the data. This is why it is important to explore how the scale and spread of imputed values changes compared to the data.

One way to evaluate the appropriateness of the scale of the imputations is to use a scatter plot to explore whether or not the values are appropriate.

# Explore imputations in air temperature and humidity,  
# coloring by the variable, any_missing
ggplot(ocean_imp_mean, 
       aes(x = air_temp_c, y = humidity, color = any_missing)) + 
  geom_point()

# Explore imputations in air temperature and humidity,  
# coloring by the variable, any_missing, and faceting by year
ggplot(ocean_imp_mean, 
       aes(x = air_temp_c, y = humidity, color = any_missing)) +
  geom_point() + 
  facet_wrap(~year)

Fantastic! You can now explore the variation in imputed values using a scatter plot!

Evaluating imputations: Across many variables

So far, we have covered ways to look at individual variables or pairs of variables and their imputed values. However, sometimes you want to look at imputations for many variables. To do this, you need to perform some data munging and re-arranging. This lesson covers how to perform this data wrangling, which can get a little bit hairy when considering its usage in nabular data. The function, shadow_long() gets the data into the right shape for these kinds of visualizations.

# Gather the imputed data 
ocean_imp_mean_gather <- shadow_long(ocean_imp_mean,
                                     humidity,
                                     air_temp_c)
# Inspect the data
ocean_imp_mean_gather

## # A tibble: 1,472 x 4
##    variable   value       variable_NA   value_NA
##    <chr>      <chr>       <chr>         <chr>   
##  1 air_temp_c 27.14999962 air_temp_c_NA !NA     
##  2 air_temp_c 27.02000046 air_temp_c_NA !NA     
##  3 air_temp_c 27          air_temp_c_NA !NA     
##  4 air_temp_c 26.93000031 air_temp_c_NA !NA     
##  5 air_temp_c 26.84000015 air_temp_c_NA !NA     
##  6 air_temp_c 26.94000053 air_temp_c_NA !NA     
##  7 air_temp_c 27.04000092 air_temp_c_NA !NA     
##  8 air_temp_c 27.11000061 air_temp_c_NA !NA     
##  9 air_temp_c 27.20999908 air_temp_c_NA !NA     
## 10 air_temp_c 27.25       air_temp_c_NA !NA     
## # ... with 1,462 more rows

# Explore the imputations in a histogram 
ggplot(ocean_imp_mean_gather, 
       aes(x = as.numeric(value), fill = value_NA)) + 
  geom_histogram() + 
  facet_wrap(~variable)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Well done! Now you know how to explore variation across many variables and their missingness. This is a useful skill that we will repeat throughout the course.

Performing imputations

library(simputation)

## 
## Attaching package: 'simputation'

## The following object is masked from 'package:naniar':
## 
##     impute_median

Using simputation to impute data

There are many imputation packages in R. We are going to focus on using the simputation package, which provides a simple, powerful interface into performing imputations.

Building a good imputation model is super important, but it is a complex topic - there is as much to building a good imputation model as there is for building a good statistical model. In this course, we are going to focus on how to evaluate imputations.

First, we are going to look at using impute_lm() function, which imputes values according to a specified linear model.

In this exercise, we are going to apply the previous assessment techniques to data with impute_lm(), and then build upon this imputation method in subsequent lessons.

# Impute humidity and air temperature using wind_ew and wind_ns, and track missing values
ocean_imp_lm_wind <- oceanbuoys %>% 
    bind_shadow() %>%
    impute_lm(air_temp_c ~ wind_ew + wind_ns) %>% 
    impute_lm(humidity ~ wind_ew + wind_ns) %>%
    add_label_shadow()
    
# Plot the imputed values for air_temp_c and humidity, colored by missingness
ggplot(ocean_imp_lm_wind, 
       aes(x = air_temp_c, y = humidity, color = any_missing)) +
  geom_point()

Fantastic job! Now you can impute missing values using a more complex model and explore the variation amongst the values using visualization techniques.

Evaluating and comparing imputations

When you build up an imputation model, it’s a good idea to compare it to another method. In this lesson, we are going to compare the previously imputed dataset created using impute_lm() to the mean imputed dataset. Both of these datasets are included in this exercise as ocean_imp_lm_wind and ocean_imp_mean respectively.

# Bind the models together 
bound_models <- bind_rows(mean = ocean_imp_mean,
                          lm_wind = ocean_imp_lm_wind,
                          .id = "imp_model")

# Inspect the values of air_temp and humidity as a scatter plot
ggplot(bound_models,
       aes(x = air_temp_c,
           y = humidity,
           color = any_missing)) +
  geom_point() + 
  facet_wrap(~imp_model)

Wonderful work! Notice how imputing the data with a linear model helps create better imputations. You can now compare missing values coming from different models, building upon the previous missing data visualization techniques you’ve learned.

Evaluating imputations (many models & variables)

When you build up an imputation model, it’s a good idea to compare it to another method.

In this lesson, we are going to get you to add a final imputation model that contains an extra useful piece of information that helps explain some of the variation in the data. You are then going to compare the values, as previously done in the last lesson.

# Build a model adding year to the outcome
ocean_imp_lm_wind_year <- bind_shadow(oceanbuoys) %>%
  impute_lm(air_temp_c ~ wind_ew + wind_ns + year) %>%
  impute_lm(humidity ~ wind_ew + wind_ns + year) %>%
  add_label_shadow()

# Bind the mean, lm_wind, and lm_wind_year models together
bound_models <- bind_rows(mean = ocean_imp_mean,
                          lm_wind = ocean_imp_lm_wind,
                          lm_wind_year = ocean_imp_lm_wind_year,
                          .id = "imp_model")

# Explore air_temp and humidity, coloring by any missings, and faceting by imputation model
ggplot(bound_models, aes(x = air_temp_c, y = humidity, color = any_missing)) +
  geom_point() + facet_wrap(~imp_model)

Excellent! You helped create better imputations by making a better model

Evaluating imputations and models

Combining and comparing many imputation models

To evaluate the different imputation methods, we need to put them into a single dataframe. Next, you will compare three different approaches to handling missing data using the dataset, oceanbuoys.

The first method is using only the completed cases and is loaded as ocean_cc.
The second method is imputing values using a linear model with predictions made using wind and is loaded as ocean_imp_lm_wind.

You will create the third imputed dataset, ocean_imp_lm_all, using a linear model and impute the variables sea_temp_c, air_temp_c, and humidity using the variables wind_ew, wind_ns, year, latitude, longitude.

You will then bind all of the datasets together (ocean_cc, ocean_imp_lm_wind, and ocean_imp_lm_all), calling it bound_models.

# Create a complete case dataset
ocean_cc <- oceanbuoys %>%
  na.omit() %>%
  bind_shadow() %>%
  add_label_shadow()

# Create an imputed dataset using a linear models
ocean_imp_lm_all <- bind_shadow(oceanbuoys) %>%
  add_label_shadow() %>%
  impute_lm(sea_temp_c ~ wind_ew + wind_ns + year + latitude + longitude) %>%
  impute_lm(air_temp_c ~ wind_ew + wind_ns + year + latitude + longitude) %>%
  impute_lm(humidity ~ wind_ew + wind_ns + year + latitude + longitude)

# Bind the datasets
bound_models <- bind_rows(cc = ocean_cc,
                          imp_lm_wind = ocean_imp_lm_wind,
                          imp_lm_all = ocean_imp_lm_all,
                          .id = "imp_model")

# Look at the models
bound_models

## # A tibble: 2,037 x 18
##    imp_model  year latitude longitude sea_temp_c air_temp_c humidity wind_ew
##    <chr>     <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>
##  1 cc         1997        0      -110       27.6       27.1     79.6   -6.40
##  2 cc         1997        0      -110       27.5       27.0     75.8   -5.30
##  3 cc         1997        0      -110       27.6       27       76.5   -5.10
##  4 cc         1997        0      -110       27.6       26.9     76.2   -4.90
##  5 cc         1997        0      -110       27.6       26.8     76.4   -3.5 
##  6 cc         1997        0      -110       27.8       26.9     76.7   -4.40
##  7 cc         1997        0      -110       28.0       27.0     76.5   -2   
##  8 cc         1997        0      -110       28.0       27.1     78.3   -3.70
##  9 cc         1997        0      -110       28.0       27.2     78.6   -4.20
## 10 cc         1997        0      -110       28.0       27.2     76.9   -3.60
## # ... with 2,027 more rows, and 10 more variables: wind_ns <dbl>,
## #   year_NA <fct>, latitude_NA <fct>, longitude_NA <fct>, sea_temp_c_NA <fct>,
## #   air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>, wind_ns_NA <fct>,
## #   any_missing <chr>

Now you can create multiple datasets with different handlings of missing values to set up your comparisons.

Evaluating the different parameters in the model

We are imputing our data for a reason - we want to analyze the data!

In this example, we are interested in predicting sea temperature, so we will build a linear model predicting sea temperature.

We will fit this model to each of the datasets we created and then explore the coefficients in the data.

The objects from the previous lesson (ocean_cc, ocean_imp_lm_wind, ocean_imp_lm_all, and bound_models) are loaded into the workspace.

library(broom)

# Create the model summary for each dataset
model_summary <- bound_models %>% 
  group_by(imp_model) %>%
  nest() %>%
  mutate(mod = map(data, ~lm(sea_temp_c ~ air_temp_c + humidity + year, data = .)),
         res = map(mod, residuals),
         pred = map(mod, predict),
         tidy = map(mod, tidy))

# Explore the coefficients in the model
model_summary %>% 
    select(imp_model,tidy) %>%
    unnest()

## Warning: `cols` is now required when using unnest().
## Please use `cols = c(tidy)`

## # A tibble: 12 x 6
## # Groups:   imp_model [3]
##    imp_model   term          estimate std.error statistic   p.value
##    <chr>       <chr>            <dbl>     <dbl>     <dbl>     <dbl>
##  1 cc          (Intercept)  -735.      45.9        -16.0  8.19e- 48
##  2 cc          air_temp_c      0.864    0.0231      37.4  2.64e-154
##  3 cc          humidity        0.0341   0.00390      8.74 2.69e- 17
##  4 cc          year            0.369    0.0232      15.9  3.46e- 47
##  5 imp_lm_wind (Intercept) -1742.      56.1        -31.0  1.83e-135
##  6 imp_lm_wind air_temp_c      0.365    0.0279      13.1  2.73e- 35
##  7 imp_lm_wind humidity        0.0225   0.00690      3.26 1.17e-  3
##  8 imp_lm_wind year            0.880    0.0283      31.1  6.79e-136
##  9 imp_lm_all  (Intercept)  -697.      51.8        -13.5  5.04e- 37
## 10 imp_lm_all  air_temp_c      0.890    0.0255      35.0  2.90e-158
## 11 imp_lm_all  humidity        0.0127   0.00463      2.75 6.03e-  3
## 12 imp_lm_all  year            0.351    0.0262      13.4  1.12e- 36

best_model <- "imp_lm_all"

Wonderful! You fit many models, and then identified a good dataset for the model! Also, congratulations on making it through the course! Now you know how to identify, visualize, and deal with missing data!